When processing a step level using chunk processing(specifying a commit-interval) in Spring Batch,is there a way to know inside the Writer,when all the records in a file have been read and processed.My idea was to pass the collection of records read from the file to the ExecutionContext once all the records have been read.
Please help.
I don't know if the is one of pre-built CompletionPolicy that do what you want, but if none you can write a custom CompletionPolicy that mark a chunk as completed when writer return null; in this way you hold all items read from file.
After that, are you sure this is exactly what you wnat? because store all item in ExecutionContext is not a good pratice; also you will lose chunk processing, restartability, and all other SB features...
Related
So I have been using Java Batch Processing for some time now. My jobs were either import/export jobs which chunked from a reader to a writer, or I would write Batchlets that would do some more complex processing. Since I am beginning to hit memory limits I need to rethink the architecture.
So I want to want to better leverage the chunked Reader/Processor/Writer pattern. And apparently I feel unsure how to distribute the work over the three items. During processing it becomes clear whether to write zero, one or several other records.
The reader is quite clear: It reads the data to be processed from the DB. But I am unsure how to write the records back to the database. I see these options:
Make the processor now store the variable amounts of data in the DB.
Make the processor send variable amount of data to the writer that would then perform the writing.
Place the entire logic into the writer.
Which way would be the best for this kind of task?
Looking at https://www.ibm.com/support/pages/system/files/inline-files/WP102706_WLB_JSR352.002.pdf, especially the chapters Chunk/The Processor and Chunk/The Writer it becomes obvious that it is up to me.
The processor can return an object, and the writer will have to understand and write this object. So for the above case where the processor has zero, one or many items to write per input record, it should simply return a list. This list can contain zero, one or several elements. The writer has to understand the list and write its elements to the database.
Since the logic is divided this way, the code is still pluggable and can easily be extended or maintained.
Addon: Since both reader and writer this time connect to the same database, I perceived the problem that upon commit for each chunk the connection for the reader was also invalidated. The solution was to use a nonJTA datasource for the reader.
Typically, an item processor processes an input item passed from an item reader, and the processing result can be null or a domain object. So it's not suited for your cases where the processing result may be split into multiple objects. I would assume even in your case, multiple objects from a processing iteration is not common. So I would suggest to use list or any collection type as the element type of the processed object only when necessary. In other more common cases, the item processor will still return null (to skip the current processed item) or a domain object.
When the item writer iterates through accumulated items, it can check if it's a collection and then write out all contained elements. For domain object type, then just write it as usual.
Using non-jta datasource for the reader is fine. I think you would want to keep the reader connection open from the start to end to keep reading from the result set. In an item writer, the connection is typically acquired at the beginning of the write operation and closed at the end of the chunk transaction commit or rollback.
Some resources that may be of help:
Jakarta Batch API,
jberet-support JdbcItemReader,
jberet-support JdbcItemWriter
I am very new to Java and have been tasked to use spring batch to read in some text files. So far Spring batch resources online have helped me to get to a point where I am reading, processing and writing some simple test .csv files into Mongo.
The problem I have now is that the actual file I would like to read from has over 600 columns. Meaning that with the current way I am reading in my file to Java, I would need 600+ fields in my #Document mongo model.
I have been thinking of a couple of ways to get around this,
first I was thinking maybe I could read in each line as a string and then in my processor deal with splitting everything up and formatting the data to then return a list of my MongoTemplate but returning a List is not viable from the overridden process method.
So my question to you guys is,
What is the best way to handle reading in files with hundreds of
columns in spring batch? Or what would be the best resource to start
reading to help point me in the right direction.
Thanks!
I had a same problem I used
http://opencsv.sourceforge.net/apidocs/com/opencsv/CSVReader.html
for reading csvs.
I suggest you use Map instead of 600 java fields.
Besides, 600X600 java strings is not a big deal for java and neither for mongo.
To manipulate with mongo use http://jongo.org/
If you really need batch processing of data.
Your flow for batches should be,
Loop here : divide in batches(say 300 per loop)
Read 300X300 java objects(or in a Map) from file in memory.
Sanitize or Process them if needed.
Store in mongoDB.
return if EOF.
I ended up just reading in each line as a String object. Then in the processor looping over the String object with a delimiter creating my Mongo repository objects and storing them. So I am basically doing all of the writing inside the processor method which I would say is definitely not best practice but gives me the desired end result.
I currently have a Spring Batch job that does the following:
Reads a list of csv files using a MultiResourceItemReader which delegates to a FlatFileItemReader.
Splits each file into chunks and writes each chunk as a JMS message, with each message containing the list of lines in the chunk and the filename of the underlying resource in JSON format.
What I want is for each chunk to only contain lines from a single file resource so that the filename on the JMS message will link up to the corresponding file.
The problem is that when processing of one file resource is complete, the reader will just continue and process the next resource meaning that lines from multiple resource files are being inserted into the same chunk and the filename property will not necessarily match the underlying data in the chunk.
Is there any clean way to prevent the reader from including lines from separate file resources in the same chunk?
EDIT: I believe the solution will require using a custom chunk completion policy to somehow determine if the current item being read is from the same resource as the previous line, not sure how feasible this is though. Any thoughts?.
I changed my implementation to use MultiResourcePartitioner to create a partitioned step per file, everything working now.
I have and XML file with 60k entities. I want to process it in batches of 20k. I am using SAX parser to parse the entities and store it in a list.
I parsed 60k entities and stored it in a file/array/list and then process each separately. I dont think it is the best solution.
Is there any way to read only 20k entities from the XML file, process them and read the XML file again.
I think you can use Multithreading concept. Create 3 thread, each thread has to read 20K data. Then another thread will read another 20k data.
Hi I am doing POC/base for design on reading database and writing into flat files. I am struggling on couple of issues here but first I will tell you the output format of flat file
Please let me know how do design the input writer where I need to read the transactions from different tables, process records , figure out the summary fields and then how should I design the Item Writer which has such a complex design. Please advice. I am successfully able to read from single table and write to file but the above task looks complex.
Extend the FlatFileItemWriter to only open a file once and append to it instead of overwriting it. Then pass that same filewriter into multiple readers in the order you would like them to appear. (Make sure that each object read by the readers are extensible by something that the writer understands! Maybe interface BatchWriteable would be a good name.)
Some back-of-the-envelope pseudocode:
Before everything starts:
Open file.
Write file headers.
Start Batch step
implement as many times as necessary
Read Batch section
Process Batch section
A Write Batch section
when done:
Write file footer
Close file