Aggregations using Spark Data Frames in Java for Large Data - java

I have around 6-8TB of data each in a sharded table with 5 partitions. This table is in HBase. I have built a Java based spark job that reads data from this table and performs some aggregations to get aggregates for a set of columns treated as key and then finally writes back the results into another table. Initially, i tried with spark map and foreach api, and performed aggregations in memory using data structures such HashMap. This was finally upserted into table using jdbc connection. However, the performance was really bad and the job never completed. Then, wrote a new job using DataFrames. I am pulling the data using HBaseRDD API and converting it into dataframe, then i perform groupBY and aggregations and finally saving the results using
" finalDF.save("org.apache.phoenix.spark", SaveMode.Overwrite, output_conf);"
This also was taking time,so i divided the task based on a key range and processed a range (say 1 million users) at a time with repartitioning the data by 2001 to ensure high compression.
DataFrame sessionDF = new PhoenixRDD(sqlContext.sparkContext(),inputTable,JavaConverters.asScalaBufferConverter(cols).asScala().toSeq(),Option.apply(filter),Option.apply(source),hconf).toDataFrame(sqlContext).repartition(partitions);
The spark job properties used are as below:
--spark.app.name test
--spark.master yarn
--spark.deploy.mode cluster
--spark.driver.cores 2
--spark.driver.memory 4G
--spark.executor.instances 8
--spark.executor.cores 2
--spark.executor.memory 16G
--spark.executor.heartbeatInterval 6000000
--spark.default.parallelism 2001
--spark.yarn.executor.memoryOverhead 4096
--spark.yarn.scheduler.heartbeat.interval-ms 6000000
--spark.network.timeout 6000000
--spark.serializer org.apache.spark.serializer.KryoSerializer
--spark.shuffle.io.retryWait 60s
--spark.shuffle.io.maxRetries 10
The problem is that this job takes around 8-10hrs to process just one million users which is close to 1TB of data and after that it usually start giving "org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1" and takes another 5-6hrs to finish. I tried increasing executors and memory, but still end up with this issue somewhere during the run and its getting difficult to process this whole data.
Can someone please advise how can i improve the processing of this job?
Please let me know if you need any further information.
Here is the cut-down version of aggregation step:
finalDF
.select(col("OID"), col("CID")
, col(“P”))
.groupBy(col("OID")
, col("CID”))
.agg(sum(when(col("P").equalTo(lit("sd")).or(col("P").equalTo(lit("hd"))), lit(1)).otherwise(lit(0))).alias("P"));
There are many more fields and other aggregations as part of this statement.

Related

How to ensure all data belonging to a user goes to the same file when using spark?

We have a use case to prepare a spark job that'll read data from multiple providers, containing info about users present in some arbitrary order and write them back to files in S3. Now, the condition is, all of a user's data must be present in a single file. There are roughly about 1 million unique users, and each one of them has about 10KB of data, at max. We thought of creating at most 1000 files, and let each file contain about 1000 users' records.
We're using java dataframe apis for creating the job against spark 2.4.0. I can't wrap my head around what would be the most logical way of doing this? Should I do a group by operation on the user-id and then somehow collect the rows unless I reach 1000 users, and then roll over (if that's even possible) or there's some better way. Any help or a hint in the right direction is much appreciated.
Update:
After following the suggestion from the answer I went ahead with the following code snippet, still I saw 200 files being written, instead of 1000.
Properties props = PropLoader.getProps("PrepareData.properties");
SparkSession spark = SparkSession.builder().appName("prepareData").master("local[*]")
.config("fs.s3n.awsAccessKeyId", props.getProperty(Constants.S3_KEY_ID_KEY))
.config("fs.s3n.awsSecretAccessKey", props.getProperty(Constants.S3_SECERET_ACCESS_KEY)).getOrCreate();
Dataset<Row> dataSet = spark.read().option("header", true).csv(pathToRead);
dataSet.repartition(dataSet.col("idvalue")).coalesce(1000).write().parquet(pathToWrite);
spark.close();
But instead of 1000 if I use 100, then I see 100 files. Then I followed the link shared by #Alexandros, and the following code snippet generated more than 20000 files within their individual directories, and also the execution time shot up like crazy.
dataSet.repartition(1000, dataSet.col("idvalue")).write().partitionBy("idvalue").parquet(pathToWrite);
You can use repartition and then coalesce function.
Df.repartion(user_id).coalese(1000)
Df.repartion(user_id,1000)
First one guarantees there will not be any empty partitions while in second solution some partitions might be empty.
Refer : Spark SQL - Difference between df.repartition and DataFrameWriter partitionBy?
https://spark.apache.org/docs/1.6.3/api/java/org/apache/spark/sql/DataFrame.html#coalesce(int)
Update:
To make this work
dataSet.repartition(dataSet.col("idvalue")).coalesce(1000).write().parquet(pathToWrite);
spark.sql.shuffle.partitions (default: 200). Due to that it don't give 1000 files but works for 100 files. To make it work you will have to first repatriation to 1000 partitions which will be same as approach 2.
dataSet.repartition(1000, dataSet.col("idvalue")).write().partitionBy("idvalue").parquet(pathToWrite);
I think above code will create one million files or more instead of 1000.
dataSet.repartition(1000, dataSet.col("idvalue")).write().parquet(pathToWrite);
It will create 1000 files but you will have to create mapping between ids and files by reading each file once you complete writing the files.

spark application from kafka stream takes long time to produce recommendation

I am reading stream of data in my spark application from kafka stream. My requirement is to produce product recommendation for a user when he makes any request (search/browse etc.)
I already have a trained model containing score of users. I am using Java and org.apache.spark.mllib.recommendation.MatrixFactorizationModel model to read the model once at start of my spark application. Whenever there is any browsing event, I call recommendProducts(user_id, num_of_recommended_products) API to produce recommendation for a user from my already existing trained model.
This API is taking ~3-5 seconds for generating result per user which is very slow and hence my stream processing lags behind. Are there any ways in which I can optimise the time of this API? I am considering increasing stream duration from 15 seconds to 1 minute as an optimisation (not sure of its results now)
Calling recommendProducts in real time, doesn't make much sense. Since ALS model can make predictions only for users, which has been seen in the training dataset, it is better to recommendProductsForUser once, store the output in a store which supports first lookups by key and fetch results from there, when needed.
If adding storage layer is not an option, you can also take output of recommendProductsForUser, partition by id, checkpoint and cache predictions, and then join with input stream by id.

HBase: execute small job using cluster

I have a Java function that runs on a single HBase row (a Result), it takes a Result as an input and outputs a byte[]. I would like to run this function on 10K-100K HBase rows and collect the results. I have a List<byte[]> which is the rows I'd like to run this function on, they are distributed evenly across all regions of the table. I would like to do so under these constraints:
Not ship all the rows from the server to the client
No long job init, the entire operations is expected to run in under a second
Utilize processing power of the Hadoop cluster and not the processing power of the client
Obviously, not depend upon the size of the HBase table which can be billions of rows
What's the best way to achieve this? I've thought of these options:
Spark - I'm not sure if this is a good option if my job runs on a tiny % of the number of rows in the table
Coprocessor - is there a way to run coprocessors in bulk on a List<byte[]> of rowkeys and collect the result? Will the work be processed in parallel by the cluster?
Implementing a custom HBase filter and then doing a bulk Get on the List<byte[]> with the custom filter - The Get will be processed by all region servers in parallel and can run custom logic, but this seems like a hack and I'm not sure a custom filter can return data that wasn't present in one of the columns of the row.

Processing millions of records from mysql in java and store the result in another database

I have around 15 million records in MySQL (read only) which will be fetched using joins of 10 tables. Around 50000 new records are inserted daily. Number will keep on increasing in future.
Each record will be processed independently by a java program. Multiple processing will be done on the same record and output will be calculated based on the processing.
Results will be stored in another database.
Processing shall be completed within an hour
My questions are
How to design the processing engine (cluster of java programs) in a distributed manner making the processing as fast as possible? To be more precise, I want to boot many spot instance at that time and finish the processing.
Will mysql be a read bottleneck?
I don't have any experience in big data solutions. Shall I use spark or any other map reduce solution? If yes, then how shall I proceed?
I was in a similar situation where we were collecting about 15 million records per day. What I did was create some collection tables that I rotated and performed initial processing. Once that was done, I moved the data to the next phase where further processing was done before adding it to the large collection of data. Breaking it down will get the best performance and avoid having to run through a large set of data.
I'm not sure what you mean about processing data and why you want to do it in Java, you may have a good reason for that. I would imagine that performance would be much better if you offload that to MySQL and let it do as much of the processing as possible.

Using DynamoDB for Timeseries Data with visualization goal

I've been advised to look into DynamoDB to store timeseries data but I'm not quite sure about it given that my final goal is data visualization.
I have sensors that send data once every 10 minutes and I'd like to visualize the data in some charts with a weekly view by default (1008 data points (datetime/values) per week). Let's suppose that I provision 10,000 Reads/Second (AWS 'default' maximum) and let's assume that 1 record will fit in 1 unit of capacity (1kb).
Besides stuff getting expensive, does this mean that I cannot even support only 10 clients simultaneously? Am I wrong or DynamoDB is just not the right tool for the job?
DynamoDB is very good to store your in coming event data, but it should not be the only tool for you to work with. You can integrate DynamoDB with other tools:
Put a cache (ElasticCache, for example) in front of your DynamoDB to allow serving your repeated queries from it, instead of DynamoDB
Put a buffer queue (SQS, for example) in front of your DynamoDB to allow your sensors to send their reports in various rates, while keeping a lower balanced rate of writes into your DynamoDB.
You can also have multiple formats of your data inside DynamoDB., based on your access pattern. For example, you can have a single record that holds the data points of the whole week per a sensor, and you can update this single record every 10 minutes, instead of only appending new record data every report. This weekly record per sensor, can be daily or monthly as you see fit. Still you will only have to read 1 or 7 or any other small number of records per write and read.
Updated with link to more on DynamoDB table design from DynamoDB documentations: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

Categories

Resources