Predicate splitting and parallel processing

Predicate splitting and parallel processing - java

I'm new to Spring Batch, and I don't know how to come up with the right solution for my problem.
I have a CSV file of a million or two of records. These records are grouped by an id.
id;head-x;head-y;...
1;;;
1;;;
1;;;
...
1;;;
2;;;
2;;;
2;;;
...
2;;;
3;;;
3;;;
...
3;;;
...
...
What I want is to process this records as group. I read all the 1 group records process and convert them to a business model and save it to my database.
I need to do this work in parallel to speed up processing. I want to process 2 and 3 if possible at the same time of 1.
I've started with using StepBuilderFactory#chunk() but this gives me a fixed size of chunks. I can get multiple groups inside a chunk or an uncomplete one.
Have you any idea to do this?

Since your records are already grouped by Id in that order, you can use a SingleItemPeekableItemReader that reads multiple physical records by Id into a single logical item. Once this in place, you can synchronize the reader (to make it thread-safe) and configure a multi-threaded step to process items in parallel.
You can also take a look at the AggregateItemReader (which is part of the samples) to aggregate multiple physical records into a single logical one: multi-line orders sample. Here too a multi-threaded step would improve the performance of your job.

Related

Processing millions of records from mysql in java and store the result in another database

I have around 15 million records in MySQL (read only) which will be fetched using joins of 10 tables. Around 50000 new records are inserted daily. Number will keep on increasing in future.
Each record will be processed independently by a java program. Multiple processing will be done on the same record and output will be calculated based on the processing.
Results will be stored in another database.
Processing shall be completed within an hour
My questions are
How to design the processing engine (cluster of java programs) in a distributed manner making the processing as fast as possible? To be more precise, I want to boot many spot instance at that time and finish the processing.
Will mysql be a read bottleneck?
I don't have any experience in big data solutions. Shall I use spark or any other map reduce solution? If yes, then how shall I proceed?

I was in a similar situation where we were collecting about 15 million records per day. What I did was create some collection tables that I rotated and performed initial processing. Once that was done, I moved the data to the next phase where further processing was done before adding it to the large collection of data. Breaking it down will get the best performance and avoid having to run through a large set of data.
I'm not sure what you mean about processing data and why you want to do it in Java, you may have a good reason for that. I would imagine that performance would be much better if you offload that to MySQL and let it do as much of the processing as possible.

Processing more than 100k records of data

I am developing spring-mvc application.
I have an requirement of processing more than 100k records of data. And I can't make it database dependent so I have to implement all the logic in java.
For now I am creating number of threads and assigning say 1000 records to each thread to process.
I am using org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor.
List item
Question:
Suggested number of threads that I should use.
Should I equally divide number of records among threads or
Should I give predefined number of records to each thread and increase the number of threads?
ThreadPoolTaskExecutor is ok or I should use something else?
Should I maintain the record ids which is assigned to each thread in java or in database? (Note : If using database then I have make extra database call for each record and update it after processing that record)
Can any one please suggest me best practices in this scenario.
Any kind of suggestion will be great.
Note: Execution time is main concern.
Update:
Processing include hug number of database calls.
Means you can consider it as searching done in java. Taking one record, then comparing(in java) that record with other records from db. Then again taking another record and do the same.

In order to process huge amount of data, you can use Spring Batch framework.
Check this Doc.
Wiki page.

ExecutorService should be fine for you, no need to use spring. But the thread number will be a trick. I can only say, it depends, why not try out to figure out the optimized number?

Spring Batch Job Design -Multiple Readers

I’m struggling with how to design a Spring Batch job. The overall goal is to retrieve ~20 million records and save them to a sql database.
I’m doing it in two parts. First I retrieve the 20 million ids of the records I want to retrieve and save those to a file (or DB). This is a relatively fast operation. Second, I loop through my file of Ids, taking batches of 2,000, and retrieve their related records from an external service. I then repeat this, 2,000 Ids at a time, until I’ve retrieved all of the records. For each batch of 2,000 records I retrieve, I save them to a database.
Some may be asking why I’m doing this in two steps. I eventual plan to make the second step run in parallel so that I can retrieve batches of 2,000 records in parallel and hopefully greatly speed-up the download. Having the Ids allows me to partition the job into batches. For now, let’s not worry about parallelism and just focus on how to design a simpler sequential job.
Imagine I already have solved the first problem of saving all of the Ids locally. They are in a file, one Id per line. How do I design the steps for the second part?
Here’s what I’m thinking…
Read 2,000 Ids using a flat file reader. I’ll need an aggregator since I only want to do one query to my external service for each batch of 2K Ids. This is where i’m struggling. Do I nest a series of readers? Or can I do ‘reading’ in the processor or writer?
Essentially, my problem is that I want to read lines from a file, aggregate those lines, and then immediately do another ‘read’ to retrieve the respective records. I almost want to chain readers together.
Finally, once I’ve retrieved the records from the external service, I’ll have a List of records. Which means when they arrive at the Writer, I’ll have a list of lists. I want a list of objects so that I can use the JdbcItemWriter out of the box.
Thoughts? Hopefully that makes sense.
Andrew

This is a matter of design and is subjective, but based on the Spring Batch example I found (from SpringSource) and my personal experience, the pattern of doing addtional reading in the processor step is a good solution to this problem. You can also chain together multiple processors/readers in the 'processor' step. So, while the names don't exactly match, i find myself doing more and more 'reading' in my processors.
[http://docs.spring.io/spring-batch/trunk/reference/html/patterns.html#drivingQueryBasedItemReaders][1]

Given that you want to call your external service just once per chunk of 2.000 records, you 'll actually want to do this service call in an ItemWriter. That is the standard recommended way to do chunk-level processing.
You can create a custom ItemWriter<Long> implementation. It will receive the list of 2.000 IDs as input, and call the external service. The result from the external service should allow you to create a List<Item>. Your writer can then simply forward this List<Item> to your JdbcItemWriter<Item> delegate.

How to implement Spring batch remoting and still maintain order of writing?

I am new to Spring Batch and have just begun to conduct a POC to prove that Spring Batch is capable of processing 1m records in a hour. The architecture however demands that we demonstrate horizontally scalablity as well.
I have read through both the Partitoning and Remote Chunking strategies. Both make sense to me. The essential difference between the two is that Remote Chunking requires a durable message queue as the actual write out to the database or file happens from the master. In partioning a durable message queue is not needed as the write happens from the slave.
Where I am totally lost however is, how to ensure that the results of these 2 variants of parallel processing are written out in the correct sequence? .
Let's take partinioning for example. As far as I understand if a particular step dealing with 1000 records is partioned into 10 parallel step executions each having it's own Reader,Processor and Writer, one of the executions could easily complete before the other. The result is that the ItemWriter of one of the step executions could write the results of processing records 300-400 to a table before results of processing 200-300 are written out to the same table, as that particular step execution could be lagging behind.
What this means is that now I have a output table which does have all results of the processing but they are not in the correct sorted order. A further round of sequential processing may be required just bring them back to the correct sorted order of 1 through to 1000.
I am struggling to understand, how I can ensure correct sorted output and at the same time scale the system horizontally through the remote processing strategies described in Spring Batch.
I have read both these books. http://www.manning.com/templier/ as well as http://www.apress.com/9781430234524 but there is nothing in those books either that answer my question.

I think you can't do that because Table are naturally unsorted. If you need them to be ordered in some way add a order column managed by writer. First partition write 1-100, second partition 101-200 and so on. Next step reader will get items order by [order column]. Holes between order column due to missing write in previous partitioners are not an issue. My 2 cents

Processing large number of data

Question Goes like this.
Form one application I am getting approx 2,00,000 Encrypted values
task
Read all Encrypted values in one Vo /list
Reformat it add header /trailers.
Dump this records to DB in one shot with header and trailer in seperated define coloums
I don't want to use any file in between processes
What would be the best way to store 2,00,000 records list or something
how to dump this record at one shot in db. is better to dived in chunks and use separate thread to work on it.
please suggest some less time consuming solution for this.
I am using spring batch for this and this process will be one job.

Spring batch is made to do this type of operation. You will want a chunk tasklet. This type of tasklet uses a reader, an item processor, and writer. Also, this type of tasklet uses streaming, so you will never have all items in memory at one time.
I'm not sure of the incoming format of your data, but there are existing readers for pretty much any use-case. And if you can't find the type you need, you can create your own. You will then want to implement ItemProcessor to handle any modifications you need to do.
For writing, you can just use JdbcBatchItemWriter.
As for these headers/footers, I would need more details on this. If they are an aggregation of all the records, you will need to process them beforehand. You can put the end results into the ExecutionContext.

There are a couple of generic tricks to make bulk insertion go faster:
Consider using the database's native bulk insert.
Sort the records into ascending order on the primary key before you insert them.
If you are inserting into an empty table, drop the secondary indexes first and then recreate them.
Don't do it all in one database transaction.
I don't know how well these tricks translate to spring-batch ... but if they don't you could consider bypassing spring-batch and going directly to the database.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.