I am using Apache Spark to analyse the data from Cassandra and will insert the data back into Cassandra by designing new tables in Cassandra as per our queries. I want to know that whether it is possible for spark to analyze in real time? If yes then how? I have read so many tutorials regarding this, but found nothing.
I want to perform the analysis and insert into Cassandra whenever a data comes into my table instantaneously.
This is possible with Spark Streaming, you should take a look at the demos and documentation which comes packaged with the Spark Cassandra Connector.
https://github.com/datastax/spark-cassandra-connector
This includes support for streaming, as well as support for creating new tables on the fly.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md
Spark Streaming extends the core API to allow high-throughput,
fault-tolerant stream processing of live data streams. Data can be
ingested from many sources such as Akka, Kafka, Flume, Twitter,
ZeroMQ, TCP sockets, etc. Results can be stored in Cassandra.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#saving-rdds-as-new-tables
Use saveAsCassandraTable method to automatically create a new table
with given name and save the RDD into it. The keyspace you're saving
to must exist. The following code will create a new table words_new in
keyspace test with columns word and count, where word becomes a
primary key:
case class WordCount(word: String, count: Long) val collection =
sc.parallelize(Seq(WordCount("dog", 50), WordCount("cow", 60)))
collection.saveAsCassandraTable("test", "words_new",
SomeColumns("word", "count"))
Related
I am using Apache Beam to read messages from PubSub and write them to BigQuery. What I'm trying to do is write to multiple tables according to the information in the input. To reduce the amount of writes, I am using windowing on the input from PubSub.
A small example:
messages
.apply(new PubsubMessageToTableRow(options))
.get(TRANSFORM_OUT)
.apply(ParDo.of(new CreateKVFromRow())
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(10L))))
// group by key
.apply(GroupByKey.create())
// Are these two rows what I want?
.apply(Values.create())
.apply(Flatten.iterables())
.apply(BigQueryIO.writeTableRows()
.withoutValidation()
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withExtendedErrorInfo()
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.to((SerializableFunction<ValueInSingleWindow<TableRow>, TableDestination>) input -> {
// Simplified for readability
Integer destination = (Integer) input.getValue().get("key");
return new TableDestination(
new TableReference()
.setProjectId(options.getProjectID())
.setDatasetId(options.getDatasetID())
.setTableId(destination + "_Table"),
"Table Destination");
}));
I couldn't find anything in the documentation, but I was wondering how many writes are done to each window? If these are multiple tables, is it one write for each table for all elements in the window? Or is it once for each element, as each table might by different for each element?
Since you're using PubSub as a source your job seems to be a streaming job. Therefore, the default insertion method is STREAMING_INSERTS(see docs). I don't see any benefit or reasons to reduce writes with this method as billig is based on the size of data. Btw, your example is more or less not really effectively reducing writes.
Although it is a streaming job, since a few versions the FILE_LOADS method is also supported. If withMethod is set to FILE_LOADS you can define withTriggeringFrequency on BigQueryIO. This setting defines the frequency in which the load job happens. Here the connector handles all for you and you don't need to group by key or window data. A load job will be started for each table.
Since it seems it is totally fine for you if it takes some time until your data is in BigQuery, I'd suggest to use FILE_LOADS as loading is free opposed to streaming inserts. Just mind the quotas when defining the triggering frequency.
I am using apache beam & google cloud dataflow to insert information into a cloud SQL database. So far this has been working great writing to one table. The information that is being sent is being broadened, including information destined to another table in the database.
I was curious if there was a way to dynamically use an SQL query based on the information I am receiving or am I able to somehow create the pipeline to execute multiple queries? Either would work...
Or, am I stuck with having to create a separate pipeline?
Cheers,
EDIT: Adding my current pipeline config
MainPipeline = Pipeline.create(options);
MainPipeline.apply(PubsubIO.readStrings().fromSubscription(MAIN_SUBSCRIPTION))
.apply(JdbcIO.<String> write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create("com.mysql.cj.jdbc.Driver", JDBC_URL)
.withUsername(JDBC_USER).withPassword(JDBC_PASS))
.withStatement(QUERY_SQL).withPreparedStatementSetter(new NewPreparedStatementSetter() {
}));
I don't think you can have dynamic queries on JdbcIO based on the input elements, it's configured once at construction time as far as I can see.
However, I can think of couple of potential workarounds if they suit your use case.
One is to just to write your own ParDo in which you would call the JDBC driver manually. This will be basically re-implementing some part of the JdbcIO with new features added. Such ParDo can be as flexible as you like.
Another is to split the input PColleciton into multiple outputs. That will work if your use case is limited to some predefined set of queries that you can choose from based on the input. This way you split the input into multiple PCollections and then attach differently configured IOs to each.
How we can read data from relational database using custom data source. I am new to flink streaming. I am facing problem while adding new custom data-source. So please help me to add custom data source and read data continuously from source DB.
As suggested by Chengzhi, relational databases are not designed to be processed in a streaming fashion and it would be better to use Kafka, Kinesis or some other system for that.
However you could write a custom source function that uses a JDBC connection to fetch the data. It would have to continuously query the DB for any new data. The issue here is that you need a way to determine which data you have already read/processed and which you did not. From the top of my head you could use a couple of things, like remembering what was the last processed primary key, and use it in subsequent query like:
SELECT * FROM events WHERE event_id > $last_processed_event_id;
Alternatively you could clear the events table inside some transaction like:
SELECT * FROM unprocessed_events;
DELETE FROM unprocessed_events WHERE event_id IN $PROCESSED_EVENT_IDS;
event_id can be anything that lets you uniquely identify the records, maybe it could be some timestamp or a set of fields.
Another thing to consider is that you would have to manually take care of checkpointing (of the last_processed_even_id offset) if you want to provide any reasonable at-least-once or exactly-once guarantees.
I'm using org.apache.spark.sql.SparkSession to read a Cassandra table to Spark Dataset<Row>. The dataset has the whole table information and if I add a new row into Cassandra it seems to be working asynchronously in the background and updates the dataset with the row, without reading the table again.
Is there any way to limit or is there built in limit for the data read in from the table?
What's the size of a Dataset<Row> that Spark starts to find difficult to process?
What are the requirements for Spark to handle calculations if Cassandra table is half a terabyte?
If Spark wants to write a large new table of information into Cassandra, does it cause more problems for Spark to write it in Cassandra or for Cassandra to read it? I just wonder which product would cause data loss or break down first.
If someone could tell me how SparkSession .read() exactly works in the background or Dataset<Row> and what they require to preform well, would be really useful. Thank you.
SparkSession.read() invokes the underlying datasource's scan method. For Cassandra that is the Spark Cassandra Connector.
The Spark Cassandra Connector breaks up the C* token ring into chunks, each chunk more or less becomes a Spark Partition. Single Spark partitions are then read in each executor core.
A video explaining this at Datastax Academy
The actual size of the Row is pretty unrelated to stability, the data is broken up by token range so you only should end up with difficulties if the underlying Cassandra data has very large hot spots. This would lead to very large Spark Partitions which could lead to memory issues. In general a well distributed C* database should have no problems at any size.
I've been upgrading a JAVA spark project from using txt file input to reading from a MongoDB. My question is can we just query the data needed, for example, I have a millions of records. I want to get only the records from the beginning of this week and start processing on it.
Looking at MongoDB documentation, they all start like this:
// Create a JavaSparkContext using the SparkSession's SparkContext object
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
// Load data and infer schema, disregard toDF() name as it returns Dataset
Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
Basically, the MongoSpark load the whole collection to the context and then transform it into a DF, which means even if I only need 1000 records of this week, the program still has to get the whole 1 million records before doing anything else.
I wonder if there is something else which allow me to pass the query directly to MongoSpark instead of doing this?
Thank you.
A DataFrame or even RDD's represent a lazy collection so doing:
Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
Will not cause any compute to happen inside Spark and no data will be requested from MongoDB.
Only, when you do an action will Spark request data to be processed. At this stage the Mongo Spark Connector will partition the data you have requested and return the partition information to the Spark Driver. The Spark Driver will allocate tasks to the Spark Worker and each worker will ask for the relevant partition from the Mongo Spark Connector.
One of the nice features of DataFrames / Datasets is that when using filters the underlying Mongo Connector code constructs an aggregation pipeline to filter the data in MongoDB before sending it to Spark. This means that not all the data is sent across the wire! Just the data you need.
Things to be aware of make sure you are using the latest Mongo Spark Connector. Also there is a ticket to push the filters down into the partitioning logic as well. Potentially, reducing the number of empty partitions and providing further speed ups.