Parallel Processing Spring Batch

Parallel Processing Spring Batch - java

We have a requirement to read a CSV file and for each line read, there shall be a REST call. The REST call returns an array of elements, for each element returned there shall be an update in DB.
There is an NFR to achieve parallel processing in the above requirement.,
After reading the CSV, each individual line processing has to be parallel i.e., there shall be a group of workers making concurrent REST calls per each line read.
Subsequently, for each array element found in the response of the REST call, there shall be parallel updates to the DB as well.
Any valuable suggestions / ideas on how to achieve this requirement in Spring Batch?
We have thought of 2 approaches, the first one is to go with Master / Worker design for doing REST calls on each CSV line that is read.
Each worker here will correspond to one line read from the CSV file, they shall perform the REST call and when the response is returned, each of these workers shall become a master themselves and launch another set of workers based on the number of array elements returned from the REST call.
Each worker then launched shall handle one element of the response returned above and will perform DB updates in parallel.
Is this even possible and a good solution?
The second one is to use JobStep based approach i.e., to launch a Job from another. If we have to follow this approach, how do we communicate the data between 2 Jobs? i.e., Suppose our first job (Job1) reads the CSV and makes a REST call, the second Job (Job2) shall be responsible for processing the individual response elements from the REST call. In that case, how do we communicate the response element data from the Job1 to Job2. In this scenario, can we have Job1 launching multiple Job2 instances for parallel DB updates?
The solutions we have thought may not be very clear and confusing, we are not sure if these solutions are even correct and feasible.
Apologies for any confusions caused. But we are clueless on how to come up with the design for this requirement.
In either case, we are not clear on how the failures will be tracked and how the job can be re-run from the failed state.
Any help is much appreciated!!

Related

When Tasklet#execute should return CONTINUABLE?

I read that:
When processing is complete in your Tasklet implementation, you return
an org.springframework.batch.repeat.RepeatStatus object. There are two
options with this: RepeatStatus.CONTINUABLE and RepeatStatus.FINISHED.
These two values can be confusing at first glance. If you return
RepeatStatus.CONTINUABLE, you aren't saying that the job can continue.
You're telling Spring Batch to run the tasklet again. Say, for
example, that you wanted to execute a particular tasklet in a loop
until a given condition was met, yet you still wanted to use Spring
Batch to keep track of how many times the tasklet was executed,
transactions, and so on. Your tasklet could return
RepeatStatus.CONTINUABLE until the condition was met. If you return
RepeatStatus.FINISHED, that means the processing for this tasklet is
complete (regardless of success) and to continue with the next piece
of processing.
But I can't imagine example of using this feature. Could you explain it for me ? When the next time tasklet will be invoked ?

Let's say that you have a large set of items (for example files), and you need to enrich each one of them in some way, which requires consuming an external service. The external service might provide a chunked mode that can process up to 1000 requests at once instead of making a separate remote call for each single file. That might be the only way you can bring down your overall processing time to the required level.
However, this is not possible to implement using Spring Batch's Reader/Processor/Writer API in a nice way, because the Processor is fed item by item and not entire chunks of them. Only the Writer actually sees chunks of items.
You could implement this using a Tasklet that reads the next up to 1000 unprocessed files, sends a chunked request to the service, processes the results, writes output files and deletes or moves the processed files.
Finally it checks if there are more unprocessed files left. Depending on that it returns either FINISHED or CONTINUABLE, in which case the framework would invoke the Tasklet again to process the next up to 1000 files.
This is actually a quite realistic scenario, so I hope that illustrates the purpose of the feature.

This allows you to break up processing of a complex task across multiple iterations.
The functionality is similar to a while(true) loop with continue/break.

Reading huge file in Java

I read a huge File (almost 5 million lines). Each line contains Date and a Request, I must parse Requests between concrete **Date**s. I use BufferedReader for reading File till start Date and than start parse lines. Can I use Threads for parsing lines, because it takes a lot of time?

It isn't entirely clear from your question, but it sounds like you are reparsing your 5 million-line file every time a client requests data. You certainly can solve the problem by throwing more threads and more CPU cores at it, but a better solution would be to improve the efficiency of your application by eliminating duplicate work.
If this is the case, you should redesign your application to avoid reparsing the entire file on every request. Ideally you should store data in a database or in-memory instead of processing a flat text file on every request. Then on a request, look up the information in the database or in-memory data structure.
If you cannot eliminate the 5 million-line file entirely, you can periodically recheck the large file for changes, skip/seek to the end of the last record that was parsed, then parse only new records and update the database or in-memory data structure. This can all optionally be done in a separate thread.

Firstly, 5 million lines of 1000 characters is only 5Gb, which is not necessarily prohibitive for a JVM. If this is actually a critical use case with lots of hits then buying more memory is almost certainly the right thing to do.
Secondly, if that is not possible, most likely the right thing to do is to build an ordered Map based on the date. So every date is a key in the map and points to a list of line numbers which contain the requests. You can then go direct to the relevant line numbers.
Something of the form
HashMap<Date, ArrayList<String>> ()
would do nicely. That should have a memory usage of order 5,000,000*32/8 bytes = 20Mb, which should be fine.
You could also use the FileChannel class to keep the I/O handle open as you go jumping from on line to a different line. This allows Memory Mapping.
See http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html
And http://en.wikipedia.org/wiki/Memory-mapped_file

A good way to parallelize a lot of small tasks is to wrap the processing of each task with a FutureTask and then pass each task to a ThreadPoolExecutor to run them. The executor should be initalized with the number of CPU cores your system has available.
When you call executor.execute(future), the future will be queued for background processing. To avoid creating and destroying too many threads, the ScheduledThreadPoolExecutor will only create as many threads as you specified and execute the futures one after another.
To retrieve the result of a future, call future.get(). When the future hasn't completed yet (or wasn't even started yet), this method will freeze until it is completed. But other futures get executed in background while you wait.
Remember to call executor.shutdown() when you don't need it anymore, to make sure it terminates the background threads it otherwise keeps around until the keepalive time has expired or it is garbage-collected.
tl;dr pseudocode:
create executor
for each line in file
create new FutureTask which parses that line
pass future task to executor
add future task to a list
for each entry in task list
call entry.get() to retrieve result
executor.shutdown()

Spring Batch flow control

I have a job in spring batch with a reader, a processor and a writer.
First, I would like to know in what order to these 3 components run: are they sequential(for commit-interval=1) or is a new item read before the previous one has been written in order to avoid delays?
I am interested in this, because I have the following case:
I want to have an "assembly-line": read->process->write->read again->...
This means that nothing is read before the previous item has been written.
Is this thing already assured out-of-the-box? If not, how can I accomplish such a thing?

The interaction between an ItemReader, ItemProcessor, and ItemWriter is as follows in Spring Batch:
Until the chunk size is reached
ItemReader.read()
While there are items that have not been processed
ItemProcessor.process()
ItemWriter.write() // single call for all items in the chunk.
That being said, with the chunk size set to 1, it is processed read, process, write, repeat.
It's important to note that not only is the above contract guaranteed, but each step is run to completion before the next step executes (splits not withstanding).
You can read more about how the various components interact in the documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/configureStep.html#chunkOrientedProcessing

Time Based Streaming

I am trying to figure out how to get time-based streaming but on an infinite stream. The reason is pretty simple: Web Service call latency results per unit time.
But, that would mean I would have to terminate the stream (as I currently understand it) and that's not what I want.
In words: If 10 WS calls came in during a 1 minute interval, I want a list/stream of their latency results (in order) passed to stream processing. But obviously, I hope to get more WS calls at which time I would want to invoke the processors again.
I could totally be misunderstanding this. I had thought of using Collectors.groupBy(x -> someTimeGrouping) (so all calls are grouped by whatever measurement interval I chose. But then no code will be aware of this until I call a closing function as which point the monitoring process is done.
Just trying to learn java 8 through application to previous code

By definition and construction a stream can only be consumed once, so if you send your results to an inifinite streams, you will not be able to access them more than once. Based on your description, it looks like it would make more sense to store the latency results in a collection, say an ArrayList, and when you need to analyse the data use the stream functionality to group them.

Exchange data in real time over AJAX with multiple threads

I am developing an application in JSF 2.0 and I would like to have a multiline textbox which displays output data which is being read (line by line) from a file in real time.
So the goal is to have a page with a button on it that triggers the backend to start reading from the file and then displaying the results as it's reading in the textbox.
I had thought about doing this in the following way:
Have the local page keep track of what lines it has retrieved/displayed in the textbox so far.
Periodically the local page will poll the backend using AJAX and request any new data that has been read (tell it what lines the page has so far and only retrieve the new lines since then).
This will continue until the entire file has been completely retrieved.
The issue is that the bean method that reads from the file is running a while loop that blocks. So to read from the data structure it is writing to at the same time will require using additional Threads, correct? I hear that spawning new Threads in a web application is a potentially dangerous move and that Thread pools should be used, etc.
Can anyone shed some insight on this?
Update: I tried a couple of different things with no luck. But I did manage to get it working by spawning a separate Thread to run my blocking loop while the main thread could be used to read from it whenever an AJAX request is processed. Is there a good library I could use to do something similar to this that still gives JSF some lifecycle control over this Thread?

Have you considered implementing the Future interface (included in Java5+ Concurrency API)? Basically, as you read in the file, you could split it into sections and simply create a new Future object (for each section). Then you can have the object return once the computation has completed.
This way you prevent having to access the structure while it is still being manipulated by the loop and you also split the operations into smaller computations reducing the amount of time locking occurs (total lock time might be greater but you get faster response to other areas). If you maintain the order in which your Future objects were created then you don't need to track line #'s. Note that calling Future.get() does block until the object is 'ready'.
The rest of you're approach would be similar - make the Ajax call to get content of all 'ready' Future objects from a FIFO queue.
I think I understand what you're trying to accomplish.. maybe a bit more info would help.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.