DataSet javaRDD() performance

DataSet javaRDD() performance - java

I'm getting retrieving data from Cassandrain a SparkApplication using Spark SQL. Data is retrieved as DataSet. However, I need to convert this datasetto JavaRDD using javaRDD() function. It works, however it takes about 2 hours. Is there some parameters to adjust to enhance this time?

Dataset APIs are built on top of the SparkSQL engine, it uses Catalystto generate an optimized logical and physical query plan. Across R, Java, Scala, or Python DataFrame/Dataset APIs, all relation type queries undergo the same code optimizer, providing the space and speed efficiency. Whereas the Dataset[T] typed API is optimized for data engineering tasks, the untyped Dataset[Row](an alias of DataFrame) is even faster and suitable for interactive analysis.
For more details Spark RDD vs Dataset performance

Resilient Distributed Dataset (RDD) is the main abstraction of Spark framework while Spark SQL (a Spark module for structured data processing) provides Spark more information about the structure of both the data and the computation being performed, and therefore uses this extra information to perform extra optimizations.
Up until Spark 1.6, RDDs used to perform better than its Spark SQL counterpart DataFrame (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html), however, Spark 2.1 upgrades have made Spark SQL quite more efficient
https://i.stack.imgur.com/TmhXf.png

Related

Generic approach of mirroring data from Oracle to another database

We have source Oracle database, where we have a lot of tabels (let say 100) which we need to mirror to target database. So we need to copy data increments periodically to another db tables. The target database is currently Oracle, but in the short future it will be probably changed to a different database technology.
So currently we can create a PL/SQL procedure which will dynamically generate DML (insert, update or merge statements) for each table (assuming that the source and target table have exactly the same attributes) from Oracle metadata.
But we would rather create some db technology independent solution so when we change target database to another (e.g. MS SQL or Postgres), then we will no need to change whole logic of data mirroring.
Does anyone have a suggestion how to do it differently (preferably in java)?
Thanks for every advice.

The problem you have is called CDC - continuous data capture. In case of Oracle this is complicated because Oracle is usually asking money for this.
So you can use:
PL/SQL or Java and use SQL to incrementally detect changes in data. IT requires plenty of work and performance is bad.
Use tools based on Oracle triggers, which will dects data changes and pushes them into some queue.
Use tool which can parse content of Oracle Archive logs. These are commercial products: GoldenGate (from Oracle) and Shareplex (Dell/EMC/dunno). GoldenDate also contains Java technology(XStreams) which allows you to inject Java visitor into the data stream. Those technologies also support sending data changes into Kafka stream.
There are plenty of tools like Debezium, Informatica, Tibco which can not parse Archived logs by themself, but rather they use Oracle's internal tool LogMiner. These tools usually do not scale well and can not cope with higher data volumes.
Here is quite article in as a summary. If you have money pick GoldenGate or Shareplex. If you don't pick Debezium or any other Java CDC project based on Logminer.

Update a huge size text file using Apache Spark

I have around 300GB fulldata and daily I’ll get an update on this data around 10GB. Both files are in text format. I would like to update the fulldata based on updates. How do I proceed the situation with Apache spark in a distributed manner.
I have tried to create a JavaRDD with a map function which override a call method and converted that to an Dataset[Row] from the two files. Now I’m planning to do a sparkSQL join queries over the datasets. Is this the right approach, anyone can guide me in this, as this’s my first footstep with apache spark.
How do I achieve the parallel processing here?

Spring JPA and Streaming - Is the data fetched incrementally?

I am looking at streaming query results section of the Spring documentation. Does this functionality fetch all the data at once but provide it as a stream? Or does it fetch data incrementally so that it will be more memory efficient?
If it doesn't fetch data incrementally, is there any other way to achieve this with spring data jpa?

It depends on your platform.
Instead of simply wrapping the query results in a Stream data store specific methods are used to perform the streaming.
With MySQL for example the streaming is performed in a truly streaming fashion, but of course if the underlying datastore (or the driver being used) doesn't have support for such a mechanism (yet) it won't make a difference.
MySQL is IIRC currently the only driver that can provide streaming without additional configuration in this fashion whereas other databases/drivers go with the standard fetch size setting as described by the venerable Vlad Mihalcea here: https://vladmihalcea.com/whats-new-in-jpa-2-2-stream-the-result-of-a-query-execution/, note the trade-off between performance vs. memory use. Other databases are most likely going to need a reactive database client in order to even perform true streaming.
Whatever the underlying streaming method, what affects most is how you process the stream. Using Spring's StreamingResponseBody for example would allow you to stream large amounts of data directly from the database to the client with minimal memory use. Still it's a very specific use case, so don't start streaming everything just yet unless you're sure it's worth it.

Mongodb map reduce vs Apache Spark map reduce

I have use-case in which I have 3M records in my Mongodb.
I want to aggregate data based on some condition.
I found two ways to accomplish it
Using Mongodb map reduce function query
Using Apache Spark map reduce function by connecting Mongodb to to spark.
I successfully executed my use-case using the above methods and found similar performance of both.
My query is ?
Does Mongodb and Apache Spark use the same Map reduce algorithm and which method (M.R using Spark or native Mongodb map reduce) is more efficient ?

Does Mongodb and Apache Spark use the same Map reduce algorithm and which method (M.R using Spark or native Mongodb map reduce) is more efficient ?
In the broad sense of map-reduce algorithm, yes. Although implementation wise they are different (i.e. JavaScript vs Java Jar)
If your question is more about finding out suitability of the two for your use case, you should consider from other aspects. Especially if for your use case, you've found both to be similar in performance. Let's explore below:
Assuming that you have the resources (time, money, servers) and expertise to maintain an Apache Spark cluster along side MongoDB cluster, then having a separate processing framework (Spark) and data storage (MongoDB) is ideal. Maintaining CPU/RAM resources only for database querying in MongoDB servers, and CPU/RAM resources only for intensive ETL in Spark nodes. Afterward write the result of the processing back into MongoDB.
If you are using MongoDB Connector for Apache Spark, you can take advantage of Aggregation Pipeline and (secondary) indexes to do ETL only the range of data Spark needs. As opposed to pulling unnecessary data to Spark nodes, which means more processing overhead, hardware requirements, network-latency.
You may find the following resources useful:
MongoDB Connector for Spark: Getting started - contains example for aggregation.
MongoDB Spark Connector Java API
M233: Getting started with Spark and MongoDB - free online course
If you don't have the resources and expertise to maintain a Spark cluster, then keep it in MongoDB. Worth mentioning that for most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface than MongoDB's map-reduce. If you can convert your map-reduce into an aggregation pipeline, I would recommend you to do so. Also see Aggregation Pipeline Optimisation for extra optimisation tips.
If your use case doesn't require a real-time processing, you can configure delayed or hidden node of MongoDB Replica Set. Which will serve as a dedicated server/instance for your aggregation/map-reduce processing. Separating the processing node(s) and data-storage node(s). See also Replica Set Architectures.

A way to analyze and compute a huge Oracle data on daily basis

I need to calculate the summary data from various transaction tables in the primary Oracle database of our core engine. I have planned write this as a multi-threaded Java program which will be scheduled as a job that runs every midnight; the program will extract the data from various transaction log tables joining other tables with it from the database, calculate and store back the result to a separate table. The log tables usually will contain millions of data with some tables partitioned daily and some on monthly.
The GUI (the dashboard) platform would request these information through a separate webservice that already exists in providing various other details. Almost all the modules in the project uses Spring framework, so I thought to use the Spring-Batch with the scheduling capability. As I started some research before starting the design on this, I found various other techniques used such as the ETL tools, scheduling in the database itself, real-time data analysis and other similar techniques.
Am I over approaching the problem in my hand? Did my earlier approach a right one? Or is there a way, a framework on Java, to do this process in real-time, say while the data is processed (while the core engine is processing the data), such that there is no need to write a separate job to do this calculation?

You can have a look at Spring XD which is a Engine to process high volume data.
Spring XD offers a lot readers (jdbc, file, jms), processors and writers (jdbc, file, jms) out of box and you can write your own readers, writers, processors easily.
Spring XD uses the Unix style source, pipe, sink to connect multiple processors. You can see this post for a small example of the application of Spring XD with High volume twitter data.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.