Spring Batch Job Design -Multiple Readers - java

I’m struggling with how to design a Spring Batch job. The overall goal is to retrieve ~20 million records and save them to a sql database.
I’m doing it in two parts. First I retrieve the 20 million ids of the records I want to retrieve and save those to a file (or DB). This is a relatively fast operation. Second, I loop through my file of Ids, taking batches of 2,000, and retrieve their related records from an external service. I then repeat this, 2,000 Ids at a time, until I’ve retrieved all of the records. For each batch of 2,000 records I retrieve, I save them to a database.
Some may be asking why I’m doing this in two steps. I eventual plan to make the second step run in parallel so that I can retrieve batches of 2,000 records in parallel and hopefully greatly speed-up the download. Having the Ids allows me to partition the job into batches. For now, let’s not worry about parallelism and just focus on how to design a simpler sequential job.
Imagine I already have solved the first problem of saving all of the Ids locally. They are in a file, one Id per line. How do I design the steps for the second part?
Here’s what I’m thinking…
Read 2,000 Ids using a flat file reader. I’ll need an aggregator since I only want to do one query to my external service for each batch of 2K Ids. This is where i’m struggling. Do I nest a series of readers? Or can I do ‘reading’ in the processor or writer?
Essentially, my problem is that I want to read lines from a file, aggregate those lines, and then immediately do another ‘read’ to retrieve the respective records. I almost want to chain readers together.
Finally, once I’ve retrieved the records from the external service, I’ll have a List of records. Which means when they arrive at the Writer, I’ll have a list of lists. I want a list of objects so that I can use the JdbcItemWriter out of the box.
Thoughts? Hopefully that makes sense.
Andrew

This is a matter of design and is subjective, but based on the Spring Batch example I found (from SpringSource) and my personal experience, the pattern of doing addtional reading in the processor step is a good solution to this problem. You can also chain together multiple processors/readers in the 'processor' step. So, while the names don't exactly match, i find myself doing more and more 'reading' in my processors.
[http://docs.spring.io/spring-batch/trunk/reference/html/patterns.html#drivingQueryBasedItemReaders][1]

Given that you want to call your external service just once per chunk of 2.000 records, you 'll actually want to do this service call in an ItemWriter. That is the standard recommended way to do chunk-level processing.
You can create a custom ItemWriter<Long> implementation. It will receive the list of 2.000 IDs as input, and call the external service. The result from the external service should allow you to create a List<Item>. Your writer can then simply forward this List<Item> to your JdbcItemWriter<Item> delegate.

Related

Predicate splitting and parallel processing

I'm new to Spring Batch, and I don't know how to come up with the right solution for my problem.
I have a CSV file of a million or two of records. These records are grouped by an id.
id;head-x;head-y;...
1;;;
1;;;
1;;;
...
1;;;
2;;;
2;;;
2;;;
...
2;;;
3;;;
3;;;
...
3;;;
...
...
What I want is to process this records as group. I read all the 1 group records process and convert them to a business model and save it to my database.
I need to do this work in parallel to speed up processing. I want to process 2 and 3 if possible at the same time of 1.
I've started with using StepBuilderFactory#chunk() but this gives me a fixed size of chunks. I can get multiple groups inside a chunk or an uncomplete one.
Have you any idea to do this?
Since your records are already grouped by Id in that order, you can use a SingleItemPeekableItemReader that reads multiple physical records by Id into a single logical item. Once this in place, you can synchronize the reader (to make it thread-safe) and configure a multi-threaded step to process items in parallel.
You can also take a look at the AggregateItemReader (which is part of the samples) to aggregate multiple physical records into a single logical one: multi-line orders sample. Here too a multi-threaded step would improve the performance of your job.

Processing more than 100k records of data

I am developing spring-mvc application.
I have an requirement of processing more than 100k records of data. And I can't make it database dependent so I have to implement all the logic in java.
For now I am creating number of threads and assigning say 1000 records to each thread to process.
I am using org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor.
List item
Question:
Suggested number of threads that I should use.
Should I equally divide number of records among threads or
Should I give predefined number of records to each thread and increase the number of threads?
ThreadPoolTaskExecutor is ok or I should use something else?
Should I maintain the record ids which is assigned to each thread in java or in database? (Note : If using database then I have make extra database call for each record and update it after processing that record)
Can any one please suggest me best practices in this scenario.
Any kind of suggestion will be great.
Note: Execution time is main concern.
Update:
Processing include hug number of database calls.
Means you can consider it as searching done in java. Taking one record, then comparing(in java) that record with other records from db. Then again taking another record and do the same.
In order to process huge amount of data, you can use Spring Batch framework.
Check this Doc.
Wiki page.
ExecutorService should be fine for you, no need to use spring. But the thread number will be a trick. I can only say, it depends, why not try out to figure out the optimized number?

How to implement Spring batch remoting and still maintain order of writing?

I am new to Spring Batch and have just begun to conduct a POC to prove that Spring Batch is capable of processing 1m records in a hour. The architecture however demands that we demonstrate horizontally scalablity as well.
I have read through both the Partitoning and Remote Chunking strategies. Both make sense to me. The essential difference between the two is that Remote Chunking requires a durable message queue as the actual write out to the database or file happens from the master. In partioning a durable message queue is not needed as the write happens from the slave.
Where I am totally lost however is, how to ensure that the results of these 2 variants of parallel processing are written out in the correct sequence? .
Let's take partinioning for example. As far as I understand if a particular step dealing with 1000 records is partioned into 10 parallel step executions each having it's own Reader,Processor and Writer, one of the executions could easily complete before the other. The result is that the ItemWriter of one of the step executions could write the results of processing records 300-400 to a table before results of processing 200-300 are written out to the same table, as that particular step execution could be lagging behind.
What this means is that now I have a output table which does have all results of the processing but they are not in the correct sorted order. A further round of sequential processing may be required just bring them back to the correct sorted order of 1 through to 1000.
I am struggling to understand, how I can ensure correct sorted output and at the same time scale the system horizontally through the remote processing strategies described in Spring Batch.
I have read both these books. http://www.manning.com/templier/ as well as http://www.apress.com/9781430234524 but there is nothing in those books either that answer my question.
I think you can't do that because Table are naturally unsorted. If you need them to be ordered in some way add a order column managed by writer. First partition write 1-100, second partition 101-200 and so on. Next step reader will get items order by [order column]. Holes between order column due to missing write in previous partitioners are not an issue. My 2 cents

Processing large number of data

Question Goes like this.
Form one application I am getting approx 2,00,000 Encrypted values
task
Read all Encrypted values in one Vo /list
Reformat it add header /trailers.
Dump this records to DB in one shot with header and trailer in seperated define coloums
I don't want to use any file in between processes
What would be the best way to store 2,00,000 records list or something
how to dump this record at one shot in db. is better to dived in chunks and use separate thread to work on it.
please suggest some less time consuming solution for this.
I am using spring batch for this and this process will be one job.
Spring batch is made to do this type of operation. You will want a chunk tasklet. This type of tasklet uses a reader, an item processor, and writer. Also, this type of tasklet uses streaming, so you will never have all items in memory at one time.
I'm not sure of the incoming format of your data, but there are existing readers for pretty much any use-case. And if you can't find the type you need, you can create your own. You will then want to implement ItemProcessor to handle any modifications you need to do.
For writing, you can just use JdbcBatchItemWriter.
As for these headers/footers, I would need more details on this. If they are an aggregation of all the records, you will need to process them beforehand. You can put the end results into the ExecutionContext.
There are a couple of generic tricks to make bulk insertion go faster:
Consider using the database's native bulk insert.
Sort the records into ascending order on the primary key before you insert them.
If you are inserting into an empty table, drop the secondary indexes first and then recreate them.
Don't do it all in one database transaction.
I don't know how well these tricks translate to spring-batch ... but if they don't you could consider bypassing spring-batch and going directly to the database.

how to create a copy of a table in HBase on same cluster? or, how to serve requests using original state while operating on a working state

Is there an efficient way to create a copy of table structure+data in HBase, in the same cluster? Obviously the destination table would have a different name. What I've found so far:
The CopyTable job, which has been described as a tool for copying data between different HBase clusters. I think it would support intra-cluster operation, but have no knowledge on whether it has been designed to handle that scenario efficiently.
Use the export+import jobs. Doing that sounds like a hack but since I'm new to HBase maybe that might be a real solution?
Some of you might be asking why I'm trying to do this. My scenario is that I have millions of objects I need access to, in a "snapshot" state if you will. There is a batch process that runs daily which updates many of these objects. If any step in that batch process fails, I need to be able to "roll back" to the original state. Not only that, during the batch process I need to be able to serve requests to the original state.
Therefore the current flow is that I duplicate the original table to a working copy, continue to serve requests using the original table while I update the working copy. If the batch process completes successfully I notify all my services to use the new table, otherwise I just discard the new table.
This has worked fine using BDB but I'm in a whole new world of really large data now so I might be taking the wrong approach. If anyone has any suggestions of patterns I should be using instead, they are more than welcome. :-)
All data in HBase has a certain timestamp. You can do reads (Gets and Scans) with a parameter indicating that you want to the latest version of the data as of a given timestamp. One thing you could do would be to is to do your reads to server your requests using this parameter pointing to a time before the batch process begins. Once the batch completes, bump your read timestamp up to the current state.
A couple things to be careful of, if you take this approach:
HBase tables are configured to store the most recent N versions of a given cell. If you overwrite the data in the cell with N newer values, then you will lose the older value during the next compaction. (You can also configure them to with a TTL to expire cells, but that doesn't quite sound like it matches your case).
Similarly, if you delete the data as part of your process, then you won't be able to read it after the next compaction.
So, if you don't issue deletes as part of your batch process, and you don't write more versions of the same data that already exists in your table than you've configured it to save, you can keep serving old requests out of the same table that you're updating. This effectively gives you a snapshot.

Categories

Resources