Chaining steps in spring batch - java

I was reading the spring documentation for spring batch project, I want to know if there is an out of the box configuration to chain steps, it means the output of the first step be the input for the second one and so on.
I'm not asking about step flows which one execute after other, is more about use the exit of the item processor of a step to be the input of the next one.
What I have in mind is use a normal step with reader, processor and in the writer create a flat file that could be reader by the second reader in the next step but this seems to be inefficiently as need to write objects that are in the jvm and restore them with the second reader.
If not sure if this is possible with spring normal config, or jsr does not work exactly as I want

Instead of multiple steps use multiple ItemProcessors in a chain. You can chain them using a CompositeItemProcessor.
EDIT :
I was reading about the spring batch strategies and I dont found any out of the box configuration in xml to chain the steps in a kind of pipeline, the best option that fit my needs is use ItemProcessorAdapter to run the different logic that I need in the steps and use CompositeItemProcessor (6.21) to make a chain of them.

Related

Multiple reader/processor/writer in spring batch

I am new to Spring batch and I have a peculiar problem. I want to get results from a 3 different jpa queries with JpaPagingItemReader and process them individually and write them into one consolidated XML file using StaxEventItemWriter.
For eg the resultant XML would look like,
<root>
<query1>
...
</query1>
<query2>
...
</query2>
<query3>
...
</query3>
</root>
Please let me know how to achieve this?
Also, I currently implemented my configurer with one query but the reader/writer is also quite slow. It took around 59 minutes to generate file of 20MB as I am running it in single threaded environment as of now as opposed to multithreaded env. If there are any other suggestions around it, please do let me know. Thanks.
EDIT:
I tried following this approach:
Created 3 different steps and added 1 reader, processor, writer in each of them but the problem I am facing now is writer is not able to write in the same file or append to it.
This is written in StaxEventItemWriter class:
FileUtils.setUpOutputFile(file, restarted, false, overwriteOutput);
Here 3rd argument append is false by default.
The second approach to your question seems the right direction, you could create 3 different readers/processors/writers and create your custom writer which should extend AbstractFileItemWriter in which setAppend is allowed. Also, I have seen that xmlWriter writes faster xmls than StaxEventItemWriter but there is some trade off in writing boiler plate code.
One option off the top of my head is to
create a StaxEventItemWriter
create 3 instances of a step that has a JpaPagingItemReader and writes the corresponding <queryX>...</queryX> section to the shared writer
write the <root> and </root> tags in a JobExecutionListener, so the steps don't care about the envelope
There are other considerations here, like whether it's always 3 files, etc. but the general idea is to separate concerns between processors, step, job, tasks, and listeners to make each perform a clear piece of work.
use JVisualVm to monitor the bottlenecks inside your application.
Since you said it is taking 59 minutes to create file of 20MB, you will get better insights of where you are getting performance hits.
VisualVm tutorial
Open visualvm connect your application => sampler => cpu => CPU Samples.
Take snapshot at various times and analyse where is it taking much time. By checking this only you will get enough data for optimisation.
Note: JvisualVm comes under oracle jdk 8 distribution. you can simply type jvisualvm on command prompt/terminal. if not download from here

Spring Batch: Reading data from one source but writing different data to 2 separate files

I have written a spring batch program to read/process/write data into a single file. I have a new business requirement wherein from the same data what I am reading, I have to build another list with different data and process/format the data and write it onto a separate file.
I have looked into MultiFormatItemWriter in which I can define separate FlatFileItemWriters & CompositeItemWriter as well, but i am unable to understand how to send different lists to these different filewriters.
Please do suggest some options with sample code if possible.
A combination of ClassifierCompositeItemProcessor and ClassifierCompositeItemWriter is what you are looking for. The classifier allows you to route items to the right processor/writer based on their class.
You can find an example here.

How to define multiple threads in MultiItemResource Reader?

I am using MultiResourceItemReader class of Spring Batch. Which uses FlatFileReader bean as delegate.My files contains XML requests, my batch reading requestes from files hit its on to URL and writing response to corresponding output files. I want to define one thread for each file processing to decrease execution time. In my current requirement I have four input files , I want to define four thread to read ,process and write files. I tried with simpleTaskExecuter with
task-executor="simpleTaskExecutor" throttle-limit="20"
But after using this flatfileReader is throwing Exception.
I am beginner, please suggest me how to implement this. Thanks in advance.
There are a couple ways to go here. However, the easiest way would be to partition by file using the MultiResourcePartitioner. That in combination with the TaskExecutorPartitionHandler will give you reliable parallel processing of your input files. You can read more about partitioning in section 7.4 of our documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/scalability.html

Passing outputs between spring batch steps [duplicate]

This question already has answers here:
How can we share data between the different steps of a Job in Spring Batch?
(12 answers)
Closed 3 years ago.
I have two business logic steps:
download xml from external resource parse and transform it into objects
dispatch the output(object list) to external queue
#Bean
public Job job() throws Exception {
return this.jobs.get("job").start(getXmlViaHttpStep()).next(pushMessageToQueue()).build();
}
So my first Step is Tasklet which downloads (via http) the file and converts it into Objects.
My second task is another Tasklet that suppose to dispatch the output from the previous step.
Now how do I pass the output list from step1 into step2 (as its input)?
I could save that on temp file, but isn't there another best practice scenario for this?
I can see at least two options that are both viable.
Option 1: setup the job as one step
You can setup your job to contain one step where the reader simply reads the input from your URL and the writer posts to your queue.
Option 2: setup the job as two steps with intermediate storage
However, you may want to divide the job in two steps to be able to re-run a step if it fails and simplify debugging etc. In that cas, the following approach may work out for you:
Step 1: Create a step with a FlatFileItemReader or similar is used to download the file. The step can then configure a FlatFileItemWriter to move the contents to disk.
Step 2: Open the file produced by the ItemWriter from the previous step. One alternative is to use the org.springframework.batch.item.xml.StaxEventItemReader together with a Jaxb2Marshaller to handle the processing (as described in this blog). Configure the output step to post messages to a queue by using e.g. org.springframework.batch.item.jms.JmsItemWriter. The writer is (as always) chunked so multiple messages can be posted at for each write.
Personally, I would probably setup the whole thing as Option 2. I find simple steps without too much transformations are easier to follow and also easier to test but that is just a matter of taste.

How do I log from a mapper? (hadoop with commoncrawl)

I'm using the commoncrawl example code from their "Mapreduce for the Masses" tutorial. I'm trying to make modifications to the mapper and I'd like to be able to log strings to some output. I'm considering setting up some noSQL db and just pushing my output to it, but it doesn't feel like a good solution. What's the standard way to do this kind of logging from java?
While there is no special solution for the logs aside of usual logger (at least one I am aware about) I can see about some solutions.
a) if logs are of debug purpose - indeed write usual debug logs. In case of the failed tasks you can find them via UI and analyze.
b) if this logs are some kind of output you want to get alongside some other output from you job - assign them some specail key and write to the context. Then in the reducer you will need some special logic to put them to the output.
c) You can create directory on HDFS and make mapper to write to there. It is not classic way for MR because it is side effect - in some cases it can be fine. Especially taking to account that after each mapper will create its own file - you can use command hadoop fs -getmerge ... to get all logs as one file.
c) If you want to be able to monitor the progress of your job, number of error etc - you can use counters.

Categories

Resources