When Tasklet#execute should return CONTINUABLE? - java

I read that:
When processing is complete in your Tasklet implementation, you return
an org.springframework.batch.repeat.RepeatStatus object. There are two
options with this: RepeatStatus.CONTINUABLE and RepeatStatus.FINISHED.
These two values can be confusing at first glance. If you return
RepeatStatus.CONTINUABLE, you aren't saying that the job can continue.
You're telling Spring Batch to run the tasklet again. Say, for
example, that you wanted to execute a particular tasklet in a loop
until a given condition was met, yet you still wanted to use Spring
Batch to keep track of how many times the tasklet was executed,
transactions, and so on. Your tasklet could return
RepeatStatus.CONTINUABLE until the condition was met. If you return
RepeatStatus.FINISHED, that means the processing for this tasklet is
complete (regardless of success) and to continue with the next piece
of processing.
But I can't imagine example of using this feature. Could you explain it for me ? When the next time tasklet will be invoked ?

Let's say that you have a large set of items (for example files), and you need to enrich each one of them in some way, which requires consuming an external service. The external service might provide a chunked mode that can process up to 1000 requests at once instead of making a separate remote call for each single file. That might be the only way you can bring down your overall processing time to the required level.
However, this is not possible to implement using Spring Batch's Reader/Processor/Writer API in a nice way, because the Processor is fed item by item and not entire chunks of them. Only the Writer actually sees chunks of items.
You could implement this using a Tasklet that reads the next up to 1000 unprocessed files, sends a chunked request to the service, processes the results, writes output files and deletes or moves the processed files.
Finally it checks if there are more unprocessed files left. Depending on that it returns either FINISHED or CONTINUABLE, in which case the framework would invoke the Tasklet again to process the next up to 1000 files.
This is actually a quite realistic scenario, so I hope that illustrates the purpose of the feature.

This allows you to break up processing of a complex task across multiple iterations.
The functionality is similar to a while(true) loop with continue/break.

Related

Parallel Processing Spring Batch

We have a requirement to read a CSV file and for each line read, there shall be a REST call. The REST call returns an array of elements, for each element returned there shall be an update in DB.
There is an NFR to achieve parallel processing in the above requirement.,
After reading the CSV, each individual line processing has to be parallel i.e., there shall be a group of workers making concurrent REST calls per each line read.
Subsequently, for each array element found in the response of the REST call, there shall be parallel updates to the DB as well.
Any valuable suggestions / ideas on how to achieve this requirement in Spring Batch?
We have thought of 2 approaches, the first one is to go with Master / Worker design for doing REST calls on each CSV line that is read.
Each worker here will correspond to one line read from the CSV file, they shall perform the REST call and when the response is returned, each of these workers shall become a master themselves and launch another set of workers based on the number of array elements returned from the REST call.
Each worker then launched shall handle one element of the response returned above and will perform DB updates in parallel.
Is this even possible and a good solution?
The second one is to use JobStep based approach i.e., to launch a Job from another. If we have to follow this approach, how do we communicate the data between 2 Jobs? i.e., Suppose our first job (Job1) reads the CSV and makes a REST call, the second Job (Job2) shall be responsible for processing the individual response elements from the REST call. In that case, how do we communicate the response element data from the Job1 to Job2. In this scenario, can we have Job1 launching multiple Job2 instances for parallel DB updates?
The solutions we have thought may not be very clear and confusing, we are not sure if these solutions are even correct and feasible.
Apologies for any confusions caused. But we are clueless on how to come up with the design for this requirement.
In either case, we are not clear on how the failures will be tracked and how the job can be re-run from the failed state.
Any help is much appreciated!!

Reading huge file in Java

I read a huge File (almost 5 million lines). Each line contains Date and a Request, I must parse Requests between concrete **Date**s. I use BufferedReader for reading File till start Date and than start parse lines. Can I use Threads for parsing lines, because it takes a lot of time?
It isn't entirely clear from your question, but it sounds like you are reparsing your 5 million-line file every time a client requests data. You certainly can solve the problem by throwing more threads and more CPU cores at it, but a better solution would be to improve the efficiency of your application by eliminating duplicate work.
If this is the case, you should redesign your application to avoid reparsing the entire file on every request. Ideally you should store data in a database or in-memory instead of processing a flat text file on every request. Then on a request, look up the information in the database or in-memory data structure.
If you cannot eliminate the 5 million-line file entirely, you can periodically recheck the large file for changes, skip/seek to the end of the last record that was parsed, then parse only new records and update the database or in-memory data structure. This can all optionally be done in a separate thread.
Firstly, 5 million lines of 1000 characters is only 5Gb, which is not necessarily prohibitive for a JVM. If this is actually a critical use case with lots of hits then buying more memory is almost certainly the right thing to do.
Secondly, if that is not possible, most likely the right thing to do is to build an ordered Map based on the date. So every date is a key in the map and points to a list of line numbers which contain the requests. You can then go direct to the relevant line numbers.
Something of the form
HashMap<Date, ArrayList<String>> ()
would do nicely. That should have a memory usage of order 5,000,000*32/8 bytes = 20Mb, which should be fine.
You could also use the FileChannel class to keep the I/O handle open as you go jumping from on line to a different line. This allows Memory Mapping.
See http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html
And http://en.wikipedia.org/wiki/Memory-mapped_file
A good way to parallelize a lot of small tasks is to wrap the processing of each task with a FutureTask and then pass each task to a ThreadPoolExecutor to run them. The executor should be initalized with the number of CPU cores your system has available.
When you call executor.execute(future), the future will be queued for background processing. To avoid creating and destroying too many threads, the ScheduledThreadPoolExecutor will only create as many threads as you specified and execute the futures one after another.
To retrieve the result of a future, call future.get(). When the future hasn't completed yet (or wasn't even started yet), this method will freeze until it is completed. But other futures get executed in background while you wait.
Remember to call executor.shutdown() when you don't need it anymore, to make sure it terminates the background threads it otherwise keeps around until the keepalive time has expired or it is garbage-collected.
tl;dr pseudocode:
create executor
for each line in file
create new FutureTask which parses that line
pass future task to executor
add future task to a list
for each entry in task list
call entry.get() to retrieve result
executor.shutdown()

Spring Batch flow control

I have a job in spring batch with a reader, a processor and a writer.
First, I would like to know in what order to these 3 components run: are they sequential(for commit-interval=1) or is a new item read before the previous one has been written in order to avoid delays?
I am interested in this, because I have the following case:
I want to have an "assembly-line": read->process->write->read again->...
This means that nothing is read before the previous item has been written.
Is this thing already assured out-of-the-box? If not, how can I accomplish such a thing?
The interaction between an ItemReader, ItemProcessor, and ItemWriter is as follows in Spring Batch:
Until the chunk size is reached
ItemReader.read()
While there are items that have not been processed
ItemProcessor.process()
ItemWriter.write() // single call for all items in the chunk.
That being said, with the chunk size set to 1, it is processed read, process, write, repeat.
It's important to note that not only is the above contract guaranteed, but each step is run to completion before the next step executes (splits not withstanding).
You can read more about how the various components interact in the documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/configureStep.html#chunkOrientedProcessing

Time Based Streaming

I am trying to figure out how to get time-based streaming but on an infinite stream. The reason is pretty simple: Web Service call latency results per unit time.
But, that would mean I would have to terminate the stream (as I currently understand it) and that's not what I want.
In words: If 10 WS calls came in during a 1 minute interval, I want a list/stream of their latency results (in order) passed to stream processing. But obviously, I hope to get more WS calls at which time I would want to invoke the processors again.
I could totally be misunderstanding this. I had thought of using Collectors.groupBy(x -> someTimeGrouping) (so all calls are grouped by whatever measurement interval I chose. But then no code will be aware of this until I call a closing function as which point the monitoring process is done.
Just trying to learn java 8 through application to previous code
By definition and construction a stream can only be consumed once, so if you send your results to an inifinite streams, you will not be able to access them more than once. Based on your description, it looks like it would make more sense to store the latency results in a collection, say an ArrayList, and when you need to analyse the data use the stream functionality to group them.

Spring Batch Multithreaded Job with fixed order

I created a spring batch job that reads chunks (commit level = 10) of a flat CSV file and writes the output to another flat file. Plain and simple.
To test local scaling I also configured the tasklet with a TaskExecutor with a pool of 10 threads, thus introducing parallelism by using a multithreaded step pattern.
As expected these threads concurrently read items until their chunk is filled and the chunk is written to the output file.
Also as expected the order of the items has changed because of this concurrent reading.
But is it possible to maintain the fixed order, preferably still leveraging the increased performance gained by using multiple threads?
I can't think of an easy way. A workaround would be to prefix all lines with an ID which is created sequentially while reading. After finishing the job, sort the lines by the id. Sounds hacky, but should work.
I don't think there is any easy solution, but only using one writer thread (that also performs a sort when writing) and multiple reading threads could work but it would not be as scalable..

Categories

Resources