Java spark and MongoDB: query only needed data - java

I've been upgrading a JAVA spark project from using txt file input to reading from a MongoDB. My question is can we just query the data needed, for example, I have a millions of records. I want to get only the records from the beginning of this week and start processing on it.
Looking at MongoDB documentation, they all start like this:
// Create a JavaSparkContext using the SparkSession's SparkContext object
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
// Load data and infer schema, disregard toDF() name as it returns Dataset
Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
Basically, the MongoSpark load the whole collection to the context and then transform it into a DF, which means even if I only need 1000 records of this week, the program still has to get the whole 1 million records before doing anything else.
I wonder if there is something else which allow me to pass the query directly to MongoSpark instead of doing this?
Thank you.

A DataFrame or even RDD's represent a lazy collection so doing:
Dataset<Row> implicitDS = MongoSpark.load(jsc).toDF();
Will not cause any compute to happen inside Spark and no data will be requested from MongoDB.
Only, when you do an action will Spark request data to be processed. At this stage the Mongo Spark Connector will partition the data you have requested and return the partition information to the Spark Driver. The Spark Driver will allocate tasks to the Spark Worker and each worker will ask for the relevant partition from the Mongo Spark Connector.
One of the nice features of DataFrames / Datasets is that when using filters the underlying Mongo Connector code constructs an aggregation pipeline to filter the data in MongoDB before sending it to Spark. This means that not all the data is sent across the wire! Just the data you need.
Things to be aware of make sure you are using the latest Mongo Spark Connector. Also there is a ticket to push the filters down into the partitioning logic as well. Potentially, reducing the number of empty partitions and providing further speed ups.

Related

Spark read CSV file using Data Frame and query from PostgreSQL DB

I'm new to Spark, I'm loading a huge CSV file using Data Frame code given below
Dataset<Row> df = sqlContext.read().format("com.databricks.spark.csv").schema(customSchema)
.option("delimiter", "|").option("header", true).load(inputDataPath);
Now after loading CSV data in data frame, now I want to iterate through each row and based on some columns want to query from PostgreSQL DB (performing some geometry operation). Later want to merge some fields returned from DB with the data frame records. What's the best way to do it, consider huge amount of records.
Any help appreciated. I'm using Java.
Like #mck also pointed out: the best way is to use join.
with spark you can read external jdbc table using the DataRame Api
for example
val props = Map(....)
spark.read.format("jdbc").options(props).load()
see the DataFrameReader scaladoc for more options and which properties and values you need to set.
then use join to merge fields

Aggregations using Spark Data Frames in Java for Large Data

I have around 6-8TB of data each in a sharded table with 5 partitions. This table is in HBase. I have built a Java based spark job that reads data from this table and performs some aggregations to get aggregates for a set of columns treated as key and then finally writes back the results into another table. Initially, i tried with spark map and foreach api, and performed aggregations in memory using data structures such HashMap. This was finally upserted into table using jdbc connection. However, the performance was really bad and the job never completed. Then, wrote a new job using DataFrames. I am pulling the data using HBaseRDD API and converting it into dataframe, then i perform groupBY and aggregations and finally saving the results using
" finalDF.save("org.apache.phoenix.spark", SaveMode.Overwrite, output_conf);"
This also was taking time,so i divided the task based on a key range and processed a range (say 1 million users) at a time with repartitioning the data by 2001 to ensure high compression.
DataFrame sessionDF = new PhoenixRDD(sqlContext.sparkContext(),inputTable,JavaConverters.asScalaBufferConverter(cols).asScala().toSeq(),Option.apply(filter),Option.apply(source),hconf).toDataFrame(sqlContext).repartition(partitions);
The spark job properties used are as below:
--spark.app.name test
--spark.master yarn
--spark.deploy.mode cluster
--spark.driver.cores 2
--spark.driver.memory 4G
--spark.executor.instances 8
--spark.executor.cores 2
--spark.executor.memory 16G
--spark.executor.heartbeatInterval 6000000
--spark.default.parallelism 2001
--spark.yarn.executor.memoryOverhead 4096
--spark.yarn.scheduler.heartbeat.interval-ms 6000000
--spark.network.timeout 6000000
--spark.serializer org.apache.spark.serializer.KryoSerializer
--spark.shuffle.io.retryWait 60s
--spark.shuffle.io.maxRetries 10
The problem is that this job takes around 8-10hrs to process just one million users which is close to 1TB of data and after that it usually start giving "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1" and takes another 5-6hrs to finish. I tried increasing executors and memory, but still end up with this issue somewhere during the run and its getting difficult to process this whole data.
Can someone please advise how can i improve the processing of this job?
Please let me know if you need any further information.
Here is the cut-down version of aggregation step:
finalDF
.select(col("OID"), col("CID")
, col(“P”))
.groupBy(col("OID")
, col("CID”))
.agg(sum(when(col("P").equalTo(lit("sd")).or(col("P").equalTo(lit("hd"))), lit(1)).otherwise(lit(0))).alias("P"));
There are many more fields and other aggregations as part of this statement.

What is SparkSession.read() Dataset<Row> limit reading in Cassandra table? (Spark performance)

I'm using org.apache.spark.sql.SparkSession to read a Cassandra table to Spark Dataset<Row>. The dataset has the whole table information and if I add a new row into Cassandra it seems to be working asynchronously in the background and updates the dataset with the row, without reading the table again.
Is there any way to limit or is there built in limit for the data read in from the table?
What's the size of a Dataset<Row> that Spark starts to find difficult to process?
What are the requirements for Spark to handle calculations if Cassandra table is half a terabyte?
If Spark wants to write a large new table of information into Cassandra, does it cause more problems for Spark to write it in Cassandra or for Cassandra to read it? I just wonder which product would cause data loss or break down first.
If someone could tell me how SparkSession .read() exactly works in the background or Dataset<Row> and what they require to preform well, would be really useful. Thank you.
SparkSession.read() invokes the underlying datasource's scan method. For Cassandra that is the Spark Cassandra Connector.
The Spark Cassandra Connector breaks up the C* token ring into chunks, each chunk more or less becomes a Spark Partition. Single Spark partitions are then read in each executor core.
A video explaining this at Datastax Academy
The actual size of the Row is pretty unrelated to stability, the data is broken up by token range so you only should end up with difficulties if the underlying Cassandra data has very large hot spots. This would lead to very large Spark Partitions which could lead to memory issues. In general a well distributed C* database should have no problems at any size.

Reading from a custom storage backend with Apache Spark

I'm still fairly new to Spark, and I've a question.
Let's say I need to submit a spark application to a 4 node cluster, and each node has the a standalone storage backend (ex. RocksDB) with exactly the same k,v rows, from where I need to read the data to process. I can create an RDD by getting all the rows I need from the storage and calling parallelize on the dataset:
public JavaRDD<value> parallelize(Map<key, value> data){
return sparkcontext.parallelize(new ArrayList<>(data.values()));
}
However I still need to get every row that I need to process into memory from disk, for every node in the cluster, even though each node is only going to process a part of it, since the data is going to be on the Map structure before creating the RDD.
Is there another way to do this or I'm seeing this wrongly? The database is not supported by hadoop, and I can't use HDFS for this use case. It's not supported by jdbc either.
Thank you in advance.

Real time analytic using Apache Spark

I am using Apache Spark to analyse the data from Cassandra and will insert the data back into Cassandra by designing new tables in Cassandra as per our queries. I want to know that whether it is possible for spark to analyze in real time? If yes then how? I have read so many tutorials regarding this, but found nothing.
I want to perform the analysis and insert into Cassandra whenever a data comes into my table instantaneously.
This is possible with Spark Streaming, you should take a look at the demos and documentation which comes packaged with the Spark Cassandra Connector.
https://github.com/datastax/spark-cassandra-connector
This includes support for streaming, as well as support for creating new tables on the fly.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md
Spark Streaming extends the core API to allow high-throughput,
fault-tolerant stream processing of live data streams. Data can be
ingested from many sources such as Akka, Kafka, Flume, Twitter,
ZeroMQ, TCP sockets, etc. Results can be stored in Cassandra.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#saving-rdds-as-new-tables
Use saveAsCassandraTable method to automatically create a new table
with given name and save the RDD into it. The keyspace you're saving
to must exist. The following code will create a new table words_new in
keyspace test with columns word and count, where word becomes a
primary key:
case class WordCount(word: String, count: Long) val collection =
sc.parallelize(Seq(WordCount("dog", 50), WordCount("cow", 60)))
collection.saveAsCassandraTable("test", "words_new",
SomeColumns("word", "count"))

Categories

Resources