Processing more than 100k records of data

Processing more than 100k records of data - java

I am developing spring-mvc application.
I have an requirement of processing more than 100k records of data. And I can't make it database dependent so I have to implement all the logic in java.
For now I am creating number of threads and assigning say 1000 records to each thread to process.
I am using org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor.
List item
Question:
Suggested number of threads that I should use.
Should I equally divide number of records among threads or
Should I give predefined number of records to each thread and increase the number of threads?
ThreadPoolTaskExecutor is ok or I should use something else?
Should I maintain the record ids which is assigned to each thread in java or in database? (Note : If using database then I have make extra database call for each record and update it after processing that record)
Can any one please suggest me best practices in this scenario.
Any kind of suggestion will be great.
Note: Execution time is main concern.
Update:
Processing include hug number of database calls.
Means you can consider it as searching done in java. Taking one record, then comparing(in java) that record with other records from db. Then again taking another record and do the same.

In order to process huge amount of data, you can use Spring Batch framework.
Check this Doc.
Wiki page.

ExecutorService should be fine for you, no need to use spring. But the thread number will be a trick. I can only say, it depends, why not try out to figure out the optimized number?

Related

Multi threading in Spring Boot

I have a DynamoDB table where I have 1000+ rows. I need to write a springboot app that reads the table rows one by one and make a rest call to another service that accepts one JSON at a time. Looping through the table 1 by 1 does not seem to be an optimal solution. Can this be achieved by multi-threading, if so how can that be done ? Or do we have any better option for this, can someone help ?

At a time you can read suppose N records at a time from for example 50,you can use more than 1 thread to read records from database as you are only reading and not writing, once you have read the record then you can create N number of Threads to call external service from ExecutorService, and Each Thread of Executor service can have data of 1 record which will call that external service

You could use ‘#async’. Its one of the easiest ways of executing multiple threads in springboot

First of all, why do you have to read a DB table one row by another row? you can have just one sql statement and get all the rows that you need (achieved in your repository layer if it's a spring boot app) and then use multiple threading as described by other people (#Async, ExecutorService, etc...) in your service classes.

Processing millions of records from mysql in java and store the result in another database

I have around 15 million records in MySQL (read only) which will be fetched using joins of 10 tables. Around 50000 new records are inserted daily. Number will keep on increasing in future.
Each record will be processed independently by a java program. Multiple processing will be done on the same record and output will be calculated based on the processing.
Results will be stored in another database.
Processing shall be completed within an hour
My questions are
How to design the processing engine (cluster of java programs) in a distributed manner making the processing as fast as possible? To be more precise, I want to boot many spot instance at that time and finish the processing.
Will mysql be a read bottleneck?
I don't have any experience in big data solutions. Shall I use spark or any other map reduce solution? If yes, then how shall I proceed?

I was in a similar situation where we were collecting about 15 million records per day. What I did was create some collection tables that I rotated and performed initial processing. Once that was done, I moved the data to the next phase where further processing was done before adding it to the large collection of data. Breaking it down will get the best performance and avoid having to run through a large set of data.
I'm not sure what you mean about processing data and why you want to do it in Java, you may have a good reason for that. I would imagine that performance would be much better if you offload that to MySQL and let it do as much of the processing as possible.

Spring Batch Job Design -Multiple Readers

I’m struggling with how to design a Spring Batch job. The overall goal is to retrieve ~20 million records and save them to a sql database.
I’m doing it in two parts. First I retrieve the 20 million ids of the records I want to retrieve and save those to a file (or DB). This is a relatively fast operation. Second, I loop through my file of Ids, taking batches of 2,000, and retrieve their related records from an external service. I then repeat this, 2,000 Ids at a time, until I’ve retrieved all of the records. For each batch of 2,000 records I retrieve, I save them to a database.
Some may be asking why I’m doing this in two steps. I eventual plan to make the second step run in parallel so that I can retrieve batches of 2,000 records in parallel and hopefully greatly speed-up the download. Having the Ids allows me to partition the job into batches. For now, let’s not worry about parallelism and just focus on how to design a simpler sequential job.
Imagine I already have solved the first problem of saving all of the Ids locally. They are in a file, one Id per line. How do I design the steps for the second part?
Here’s what I’m thinking…
Read 2,000 Ids using a flat file reader. I’ll need an aggregator since I only want to do one query to my external service for each batch of 2K Ids. This is where i’m struggling. Do I nest a series of readers? Or can I do ‘reading’ in the processor or writer?
Essentially, my problem is that I want to read lines from a file, aggregate those lines, and then immediately do another ‘read’ to retrieve the respective records. I almost want to chain readers together.
Finally, once I’ve retrieved the records from the external service, I’ll have a List of records. Which means when they arrive at the Writer, I’ll have a list of lists. I want a list of objects so that I can use the JdbcItemWriter out of the box.
Thoughts? Hopefully that makes sense.
Andrew

This is a matter of design and is subjective, but based on the Spring Batch example I found (from SpringSource) and my personal experience, the pattern of doing addtional reading in the processor step is a good solution to this problem. You can also chain together multiple processors/readers in the 'processor' step. So, while the names don't exactly match, i find myself doing more and more 'reading' in my processors.
[http://docs.spring.io/spring-batch/trunk/reference/html/patterns.html#drivingQueryBasedItemReaders][1]

Given that you want to call your external service just once per chunk of 2.000 records, you 'll actually want to do this service call in an ItemWriter. That is the standard recommended way to do chunk-level processing.
You can create a custom ItemWriter<Long> implementation. It will receive the list of 2.000 IDs as input, and call the external service. The result from the external service should allow you to create a List<Item>. Your writer can then simply forward this List<Item> to your JdbcItemWriter<Item> delegate.

Is JDBC multi-threaded insert possible?

I'm currently working on a Java project which i need to prepare a big(to me) mysql database. I have to do web scraping using Jsoup and store the results into my database as well. As i estimated, i will have roughly 1,500,000 to 2,000,000 records to be inserted. In my first trial, i just use a loop to insert these records and it takes me one week to insert about 1/3 of my required records, which is too slow i think. Is it possible to make this process multi-threaded, so that i can have my records split into 3 sets, say 500,000 records per set, and then insert them into one database( one table specifically)?

Multi-threading isn't going to help you here. You'll just move the contention bottleneck from your app server to the database.
Instead, try using batch-inserts instead, they generally make this sort of thing orders of magnitude faster. See "3.4 Making Batch Updates" in the JDBC tutorial.
Edit: As #Jon commented, you need to decouple the fetching of the web pages from their insertion into the database, otherwise the whole process will go at the speed of the slowest operation. You could have multiple threads fetching web pages, which add the data to a queue data structure, and then have a single thread draining the queue into the database using a batch insert.

Just make sure two (or more) threads doesn't use the same connection at the same time, using a connection pool resolves that. c3po and apache dbcp comes to mind ...

You can insert these records in different threads provided they do use different primary key values.
You should also look at Spring Batch which I believe will be useful in your case.

You can chunk your record set into batches and do this, but perhaps you should think about other factors as well.
Are you doing a network round trip for each INSERT? If yes, latency could be the real enemy. Try batching those requests to cut down on network traffic.
Do you have transactions turned on? If yes, the size of the rollback log could be the problem.
I'd recommend profiling the app server and the database server to see where the time is being spent. You can waste a lot of time guessing about the root cause.

I think multi thread approch usefull for your issue but you have to using a connection pool such as C3P0 or Tomca 7 Connetcion pool for more performance.
Another solution is using a batch-operation provider such as Spring-batch, exist anothers utility for batch operation also.
Another solution is using a PL/SQl Procedure with a input structure parameter.

How to retrieve huge (>2000) amount of entities from GAE datastore in under 1 second?

We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB.
On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second).
We had tried several strategies to speed up the entities retrieval:
Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.
Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.
The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case.
We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval?
UPDATE: Supplying additional information about our use case and parallelization result:
We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only.
We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.
UPDATE 2: Giving an example of what we are trying to do:
Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not.
The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value.
The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.

200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory.
Do you need the whole 5kb of each entity for computation?
Do you need all 200k entities when querying before computation? Do queries touch all entities?
Also, check out BigQuery. It might suit your needs.

Use Memcache. I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform.

This is very interesting, but yes, its possible & Iv seen some mind boggling results.
I would have done the same; map-reduce concept
It would be great if you would provide us more metrics on how many parallel instances do you use & what are the results of each instance?
Also, our process includes retrieval alone or retrieval & storing ?
How many elements do you have in your data store? 4000? 10000? Reason is because you could cache it up from the previous request.
regards

In the end, it does not appear that we could retrieve >2000 entities from a single instance in under one second, so we are forced to use in-memory caching placed on our backend instance, as described in the original question. If someone comes up with a better answer, or if we found a better strategy/implementation for this problem, I would change or update the accepted answer.

Our solution involves periodically reading entities in a background task and storing the result in a json blob. That way we can quickly return more than 100k rows. All filtering and sorting is done in javascript using SlickGrid's DataView model.
As someone has already commented, MapReduce is the way to go on GAE. Unfortunately the Java library for MapReduce is broken for me so we're using non optimal task to do all the reading but we're planning to get MapReduce going in the near future (and/or the Pipeline API).
Mind that, last time I checked, the Blobstore wasn't returning gzipped entities > 1MB so at the moment we're loading the content from a compressed entity and expanding it into memory, that way the final payload gets gzipped. I don't like that, it introduces latency, I hope they fix issues with GZIP soon!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.