I have a job in spring batch with a reader, a processor and a writer.
First, I would like to know in what order to these 3 components run: are they sequential(for commit-interval=1) or is a new item read before the previous one has been written in order to avoid delays?
I am interested in this, because I have the following case:
I want to have an "assembly-line": read->process->write->read again->...
This means that nothing is read before the previous item has been written.
Is this thing already assured out-of-the-box? If not, how can I accomplish such a thing?
The interaction between an ItemReader, ItemProcessor, and ItemWriter is as follows in Spring Batch:
Until the chunk size is reached
ItemReader.read()
While there are items that have not been processed
ItemProcessor.process()
ItemWriter.write() // single call for all items in the chunk.
That being said, with the chunk size set to 1, it is processed read, process, write, repeat.
It's important to note that not only is the above contract guaranteed, but each step is run to completion before the next step executes (splits not withstanding).
You can read more about how the various components interact in the documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/configureStep.html#chunkOrientedProcessing
Related
I understood that there are some limitations regarding the items processing order when it comes to using multi-threading.
From my understanding, when we configure a Step (read process, write) to use multi-threading (taskExecutor for example), we can't guarantee the items incoming order, since we don't know which thread will process first.
Is there a safe and a simple way to read and process a flat file (FlatFileItemReader) with the initial order of its items using multi-threading ?
Thank you
I read that:
When processing is complete in your Tasklet implementation, you return
an org.springframework.batch.repeat.RepeatStatus object. There are two
options with this: RepeatStatus.CONTINUABLE and RepeatStatus.FINISHED.
These two values can be confusing at first glance. If you return
RepeatStatus.CONTINUABLE, you aren't saying that the job can continue.
You're telling Spring Batch to run the tasklet again. Say, for
example, that you wanted to execute a particular tasklet in a loop
until a given condition was met, yet you still wanted to use Spring
Batch to keep track of how many times the tasklet was executed,
transactions, and so on. Your tasklet could return
RepeatStatus.CONTINUABLE until the condition was met. If you return
RepeatStatus.FINISHED, that means the processing for this tasklet is
complete (regardless of success) and to continue with the next piece
of processing.
But I can't imagine example of using this feature. Could you explain it for me ? When the next time tasklet will be invoked ?
Let's say that you have a large set of items (for example files), and you need to enrich each one of them in some way, which requires consuming an external service. The external service might provide a chunked mode that can process up to 1000 requests at once instead of making a separate remote call for each single file. That might be the only way you can bring down your overall processing time to the required level.
However, this is not possible to implement using Spring Batch's Reader/Processor/Writer API in a nice way, because the Processor is fed item by item and not entire chunks of them. Only the Writer actually sees chunks of items.
You could implement this using a Tasklet that reads the next up to 1000 unprocessed files, sends a chunked request to the service, processes the results, writes output files and deletes or moves the processed files.
Finally it checks if there are more unprocessed files left. Depending on that it returns either FINISHED or CONTINUABLE, in which case the framework would invoke the Tasklet again to process the next up to 1000 files.
This is actually a quite realistic scenario, so I hope that illustrates the purpose of the feature.
This allows you to break up processing of a complex task across multiple iterations.
The functionality is similar to a while(true) loop with continue/break.
We have a requirement to read a CSV file and for each line read, there shall be a REST call. The REST call returns an array of elements, for each element returned there shall be an update in DB.
There is an NFR to achieve parallel processing in the above requirement.,
After reading the CSV, each individual line processing has to be parallel i.e., there shall be a group of workers making concurrent REST calls per each line read.
Subsequently, for each array element found in the response of the REST call, there shall be parallel updates to the DB as well.
Any valuable suggestions / ideas on how to achieve this requirement in Spring Batch?
We have thought of 2 approaches, the first one is to go with Master / Worker design for doing REST calls on each CSV line that is read.
Each worker here will correspond to one line read from the CSV file, they shall perform the REST call and when the response is returned, each of these workers shall become a master themselves and launch another set of workers based on the number of array elements returned from the REST call.
Each worker then launched shall handle one element of the response returned above and will perform DB updates in parallel.
Is this even possible and a good solution?
The second one is to use JobStep based approach i.e., to launch a Job from another. If we have to follow this approach, how do we communicate the data between 2 Jobs? i.e., Suppose our first job (Job1) reads the CSV and makes a REST call, the second Job (Job2) shall be responsible for processing the individual response elements from the REST call. In that case, how do we communicate the response element data from the Job1 to Job2. In this scenario, can we have Job1 launching multiple Job2 instances for parallel DB updates?
The solutions we have thought may not be very clear and confusing, we are not sure if these solutions are even correct and feasible.
Apologies for any confusions caused. But we are clueless on how to come up with the design for this requirement.
In either case, we are not clear on how the failures will be tracked and how the job can be re-run from the failed state.
Any help is much appreciated!!
Suppose i have 10 records and some of them are corrupted records so how spring will handle restart.
Example suppose record no. 3& 7 are corrupt and they go to different reducer then how spring will handle the restart
1.how it will maintain the queue to track where it last failed.
2.what are the different ways we can solve this one
SpringBatch will do exactly what you tell SpringBatch to do.
Restart for SpringBatch means run the same job that failed with the same set of input parameters. However, the new instance (execution) of this job will be created.
The job will run on the same data set that the failed instance of the job ran on.
In general, it is not a good idea to modify the input data set for you job - the input data of MapReduce job must be immutable (I assume, you will not modify the same data set you use as an input).
In your case the job is likely to complete with the BatchStatus.COMPLETED unless you put a very specific logic in the last step of your SpringBatch job.
This last step will validate all records and if any broken records detected artificially will set the status of the job to BatchStatus.FAILED like below:
jobExecution.setStatus(BatchStatus.FAILED)
Now how to restart the job is a good question that I will answer in a few moments.
However, before restrting the question you need to ask is: if the input data set for your MapReduce job and the code of your MapReduce job have not changed, how restrt is going to help you?
I think you need to have some kind of data set where you dump all the bad records that the original MapReduce job failed to process. Than how to process these broken records is for you to decide.
Anyway, restarting SpringBatch job is easy, once you know what is the ID of failed jobExecution. Below is the code:
final Long restartId = jobOperator.restart(failedJobId);
final JobExecution restartExecution = jobExplorer.getJobExecution(restartId);
EDIT
Read about ItemReader, ItemWriter and ItemProcessor interfaces
I think that you can achieve tracking by using CompositeItemProcessor.
In Hadoop every record in a file must have a unique ID. So, I think you can store the list of IDs of the bad record in the Job context. Update JobParameter that you would have created when the job starts for the first time, call it badRecordsList. Now when you restart/resume your job, you will read the value of the badRecordsListand will have a reference.
I created a spring batch job that reads chunks (commit level = 10) of a flat CSV file and writes the output to another flat file. Plain and simple.
To test local scaling I also configured the tasklet with a TaskExecutor with a pool of 10 threads, thus introducing parallelism by using a multithreaded step pattern.
As expected these threads concurrently read items until their chunk is filled and the chunk is written to the output file.
Also as expected the order of the items has changed because of this concurrent reading.
But is it possible to maintain the fixed order, preferably still leveraging the increased performance gained by using multiple threads?
I can't think of an easy way. A workaround would be to prefix all lines with an ID which is created sequentially while reading. After finishing the job, sort the lines by the id. Sounds hacky, but should work.
I don't think there is any easy solution, but only using one writer thread (that also performs a sort when writing) and multiple reading threads could work but it would not be as scalable..