Multiple reader/processor/writer in spring batch - java

I am new to Spring batch and I have a peculiar problem. I want to get results from a 3 different jpa queries with JpaPagingItemReader and process them individually and write them into one consolidated XML file using StaxEventItemWriter.
For eg the resultant XML would look like,
<root>
<query1>
...
</query1>
<query2>
...
</query2>
<query3>
...
</query3>
</root>
Please let me know how to achieve this?
Also, I currently implemented my configurer with one query but the reader/writer is also quite slow. It took around 59 minutes to generate file of 20MB as I am running it in single threaded environment as of now as opposed to multithreaded env. If there are any other suggestions around it, please do let me know. Thanks.
EDIT:
I tried following this approach:
Created 3 different steps and added 1 reader, processor, writer in each of them but the problem I am facing now is writer is not able to write in the same file or append to it.
This is written in StaxEventItemWriter class:
FileUtils.setUpOutputFile(file, restarted, false, overwriteOutput);
Here 3rd argument append is false by default.

The second approach to your question seems the right direction, you could create 3 different readers/processors/writers and create your custom writer which should extend AbstractFileItemWriter in which setAppend is allowed. Also, I have seen that xmlWriter writes faster xmls than StaxEventItemWriter but there is some trade off in writing boiler plate code.

One option off the top of my head is to
create a StaxEventItemWriter
create 3 instances of a step that has a JpaPagingItemReader and writes the corresponding <queryX>...</queryX> section to the shared writer
write the <root> and </root> tags in a JobExecutionListener, so the steps don't care about the envelope
There are other considerations here, like whether it's always 3 files, etc. but the general idea is to separate concerns between processors, step, job, tasks, and listeners to make each perform a clear piece of work.

use JVisualVm to monitor the bottlenecks inside your application.
Since you said it is taking 59 minutes to create file of 20MB, you will get better insights of where you are getting performance hits.
VisualVm tutorial
Open visualvm connect your application => sampler => cpu => CPU Samples.
Take snapshot at various times and analyse where is it taking much time. By checking this only you will get enough data for optimisation.
Note: JvisualVm comes under oracle jdk 8 distribution. you can simply type jvisualvm on command prompt/terminal. if not download from here

Related

Chaining steps in spring batch

I was reading the spring documentation for spring batch project, I want to know if there is an out of the box configuration to chain steps, it means the output of the first step be the input for the second one and so on.
I'm not asking about step flows which one execute after other, is more about use the exit of the item processor of a step to be the input of the next one.
What I have in mind is use a normal step with reader, processor and in the writer create a flat file that could be reader by the second reader in the next step but this seems to be inefficiently as need to write objects that are in the jvm and restore them with the second reader.
If not sure if this is possible with spring normal config, or jsr does not work exactly as I want
Instead of multiple steps use multiple ItemProcessors in a chain. You can chain them using a CompositeItemProcessor.
EDIT :
I was reading about the spring batch strategies and I dont found any out of the box configuration in xml to chain the steps in a kind of pipeline, the best option that fit my needs is use ItemProcessorAdapter to run the different logic that I need in the steps and use CompositeItemProcessor (6.21) to make a chain of them.

How to define multiple threads in MultiItemResource Reader?

I am using MultiResourceItemReader class of Spring Batch. Which uses FlatFileReader bean as delegate.My files contains XML requests, my batch reading requestes from files hit its on to URL and writing response to corresponding output files. I want to define one thread for each file processing to decrease execution time. In my current requirement I have four input files , I want to define four thread to read ,process and write files. I tried with simpleTaskExecuter with
task-executor="simpleTaskExecutor" throttle-limit="20"
But after using this flatfileReader is throwing Exception.
I am beginner, please suggest me how to implement this. Thanks in advance.
There are a couple ways to go here. However, the easiest way would be to partition by file using the MultiResourcePartitioner. That in combination with the TaskExecutorPartitionHandler will give you reliable parallel processing of your input files. You can read more about partitioning in section 7.4 of our documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/scalability.html

restart SAX parser from the middle of the document

I'm working on a project that needs to parse a very big XML file (about 10GB). Because process time is really long (about days), It's possible that my code exit in the middle of the process; so I want to save my code's status once in a while and then be able to restart it from last save point.
Is there a way to start (restart) a SAX parser not from the beginning of a XML file?
P.S: I'm programming using Python, but solutions for Java and C++ are also acceptable.
Not really sure if this answers your question, but I would take a different approach. 10GB is not THAT much data, so you could implement a two-phase parsing.
Phase 1 would be to split the file in smaller chunks based on some tag, so you end up with more smaller files. For example if your first file is A.xml, you split it to A_0.xml, A_1.xml etc.
Phase 2 would do the real heavy lifting on each chuck, so you invoke it on A_0.xml, then after that on A_1.xml etc. You could then restart on a chunk after your code has exitted.

Search optimization when data owner is someone else

In my project, we have 2 REST calls which take too much time, so we are planning to optimize that. Here is how it works currently - we make 1st call to system A and then pass the response to system B for further processing. Once we get the response from system B, we have to manipulate it further before passing it to UI layer and this entire process takes lot of time. We planned on using Solr/Lucene but since we are not the data owners, we can't implement that. Can someone please shed some light on how best this can be handled? We are using Spring MVC and Spring webflow. Thanks in advance!!
[EDIT:] This is not the actual scenario and I am writing this as an example for better understanding. Think of this as making a store locator call for a particular zip to get a list of 100 stores and then sending those 100 stores to another call to get a list of inventory etc. So, this list of stores would change for every zip code and also the inventory there.
If your queries parameters to System A / System B are frequently the same you can add a cache framework to your code. If you use Spring3, you can use the cache easily with an #Cacheable annotation on your code calling SystemA. See :
http://static.springsource.org/spring/docs/3.1.0.M1/spring-framework-reference/html/cache.html
The cache subsystem will cache the result including processing code.

How do I log from a mapper? (hadoop with commoncrawl)

I'm using the commoncrawl example code from their "Mapreduce for the Masses" tutorial. I'm trying to make modifications to the mapper and I'd like to be able to log strings to some output. I'm considering setting up some noSQL db and just pushing my output to it, but it doesn't feel like a good solution. What's the standard way to do this kind of logging from java?
While there is no special solution for the logs aside of usual logger (at least one I am aware about) I can see about some solutions.
a) if logs are of debug purpose - indeed write usual debug logs. In case of the failed tasks you can find them via UI and analyze.
b) if this logs are some kind of output you want to get alongside some other output from you job - assign them some specail key and write to the context. Then in the reducer you will need some special logic to put them to the output.
c) You can create directory on HDFS and make mapper to write to there. It is not classic way for MR because it is side effect - in some cases it can be fine. Especially taking to account that after each mapper will create its own file - you can use command hadoop fs -getmerge ... to get all logs as one file.
c) If you want to be able to monitor the progress of your job, number of error etc - you can use counters.

Categories

Resources