Spring Batch MultiResourceItemReader Chunk Commit Per Resource - java

I currently have a Spring Batch job that does the following:
Reads a list of csv files using a MultiResourceItemReader which delegates to a FlatFileItemReader.
Splits each file into chunks and writes each chunk as a JMS message, with each message containing the list of lines in the chunk and the filename of the underlying resource in JSON format.
What I want is for each chunk to only contain lines from a single file resource so that the filename on the JMS message will link up to the corresponding file.
The problem is that when processing of one file resource is complete, the reader will just continue and process the next resource meaning that lines from multiple resource files are being inserted into the same chunk and the filename property will not necessarily match the underlying data in the chunk.
Is there any clean way to prevent the reader from including lines from separate file resources in the same chunk?
EDIT: I believe the solution will require using a custom chunk completion policy to somehow determine if the current item being read is from the same resource as the previous line, not sure how feasible this is though. Any thoughts?.

I changed my implementation to use MultiResourcePartitioner to create a partitioned step per file, everything working now.

Related

SpringBatch- How to process files itself as a Item?

I am new to spring batch development. I have the following requirement.
There will be a s3 source with zip files and each of the zipfiles will contain multiple pdf files and xml files.[Eg:100 pdfs and 100 xml files] (xml files will contain data about the pdf)
Batch needs to read the pdf files and its associated xml file and push these to rest service/db.
When I looked at examples, most of it all covered how to read a line from the file and process it. here I have the items itself as file. I want to read one pdf file(as bytes) + xml file(converted into pojo) as set and push this to rest service one by one.
Right now, I am doing all the reading and processing inside a single tasklet. but I am sure there will be better solution to implement it. Please suggest and thank you.
The chunk-oriented processing model requires you to first define what an item is. In your case, one option is to consider an item as the PDF file (data) with its associated XML file (metadata). You can create a class that represents such an item and a custom item reader for it. Once that in place, you can use the reader in a chunk oriented step and a processor or writer that sends data to your REST endpoint.

Spring Batch Chunk processing

When processing a step level using chunk processing(specifying a commit-interval) in Spring Batch,is there a way to know inside the Writer,when all the records in a file have been read and processed.My idea was to pass the collection of records read from the file to the ExecutionContext once all the records have been read.
Please help.
I don't know if the is one of pre-built CompletionPolicy that do what you want, but if none you can write a custom CompletionPolicy that mark a chunk as completed when writer return null; in this way you hold all items read from file.
After that, are you sure this is exactly what you wnat? because store all item in ExecutionContext is not a good pratice; also you will lose chunk processing, restartability, and all other SB features...

How to define multiple threads in MultiItemResource Reader?

I am using MultiResourceItemReader class of Spring Batch. Which uses FlatFileReader bean as delegate.My files contains XML requests, my batch reading requestes from files hit its on to URL and writing response to corresponding output files. I want to define one thread for each file processing to decrease execution time. In my current requirement I have four input files , I want to define four thread to read ,process and write files. I tried with simpleTaskExecuter with
task-executor="simpleTaskExecutor" throttle-limit="20"
But after using this flatfileReader is throwing Exception.
I am beginner, please suggest me how to implement this. Thanks in advance.
There are a couple ways to go here. However, the easiest way would be to partition by file using the MultiResourcePartitioner. That in combination with the TaskExecutorPartitionHandler will give you reliable parallel processing of your input files. You can read more about partitioning in section 7.4 of our documentation here: http://docs.spring.io/spring-batch/trunk/reference/html/scalability.html

Processing data based on the metadata in the file using apache camel

I have to setup camel to process data files where the first line of the file is the metadata and then it follows with millions of lines of actual data. The metadata dictates how the data is to be processed. What I am looking for is something like this:
Read first line (metadata) and populate a bean (with metadata) --> 2. then send data 1000 lines at a time to the data processor which will refer to the bean in step # 1
Is it possible in Apache Camel?
Yes.
An example architecture might look something like this:
You could setup a simple queue that could be populated with file names (or whatever identifier you are using to locate each individual file).
From the queue, you could route through a message translator bean, whose sole is to translate a request for a filename into a POJO that contains the metadata from the first line of the file.
(You have a few options here)
Your approach to processing the 1000 line sets will depend on whether or not the output or resulting data created from the 1000 lines sets needs to be recomposed into a single message and processed again later. If so, you could implement a composed message processor made up of a message producer/consumer, a message aggregator and a router. The message producer/consumer would receive the POJO with the metadata created in step2 and enqueue as many new requests are necessary to process all of the lines in the file. The router would route from this queue through your processing pipeline and into the message aggregator. Once aggregated, a single unified message with all of your important data will be available for you to do what you will.
If instead each 1000 line set can be processed independently and rejoining is not required, than it is not necessary to agggregate the messages. Instead, you can use a router to route from step 2 to a producer/consumer that will, like above, enquene the necessary number of new requests for each file. Finally, the router will route from this final queue to a consumer that will do the processing.
Since you have a large quantity of data to deal with, it will likely be difficult to pass around 1000 line groups of data through messages, especially if they are being placed in a queue (you don't want to run out of memory). I recommend passing around some type of indicator that can be used to identify which line of the file a specific request was for, and then parse the 1000 lines when you need them. You could do this in a number of ways, like by calculating the number of bytes deep into a file a specific line is, and then using a file reader's skip() method to jump to that line when the request hits the bean that will be processing it.
Here are some resources provided on the Apache Camel website that describe the enterprise integration patterns that I mentioned above:
http://camel.apache.org/message-translator.html
http://camel.apache.org/composed-message-processor.html
http://camel.apache.org/pipes-and-filters.html
http://camel.apache.org/eip.html

Java: Reading a file containing both text and binary data

I'm having a problem with a new file format I'm being asked to implement at work.
Basically, the file is a text file which contains a bunch of headers containing information about the data in UTC-8, and then the rest of the file is the numerical data in binary. I can write the data and read it back just fine, and I recently added the code to write the headers.
The problem is that I don't know how to read a file that contains both text and binary data. I want to be able to read in and deal with the header information (which is fairly extensive) and then be able to continue reading the binary data without having to re-iterate through the headers. Is this possible?
I am currently using a FileInputStream to read the binary data, but I don't know how to start it at the beginning of the data, rather than the beginning of the whole file. One of the FileInputStream's constructors takes a FileDescriptor as the argument and I think that's my answer, but I don't know how to get one from another file reading class. Am I approaching this correctly?
You can reposition a FileInputStream to any arbitrary point by getting its channel via getChannel() and calling position() on that channel.
The one caveat is that this position affects all consumers of the stream. It is not suitable if you have different threads (for example) reading from different parts of the same file. In that case, create a separate FileInputStream for each consumer.
Also, this technique only works for file streams, because the underlying file can be randomly accessed. There is no equivalent for sockets, or named pipes, or anything else that is actually a stream.

Categories

Resources