I created a spring batch job that reads chunks (commit level = 10) of a flat CSV file and writes the output to another flat file. Plain and simple.
To test local scaling I also configured the tasklet with a TaskExecutor with a pool of 10 threads, thus introducing parallelism by using a multithreaded step pattern.
As expected these threads concurrently read items until their chunk is filled and the chunk is written to the output file.
Also as expected the order of the items has changed because of this concurrent reading.
But is it possible to maintain the fixed order, preferably still leveraging the increased performance gained by using multiple threads?
I can't think of an easy way. A workaround would be to prefix all lines with an ID which is created sequentially while reading. After finishing the job, sort the lines by the id. Sounds hacky, but should work.
I don't think there is any easy solution, but only using one writer thread (that also performs a sort when writing) and multiple reading threads could work but it would not be as scalable..
Related
I understood that there are some limitations regarding the items processing order when it comes to using multi-threading.
From my understanding, when we configure a Step (read process, write) to use multi-threading (taskExecutor for example), we can't guarantee the items incoming order, since we don't know which thread will process first.
Is there a safe and a simple way to read and process a flat file (FlatFileItemReader) with the initial order of its items using multi-threading ?
Thank you
I read that:
When processing is complete in your Tasklet implementation, you return
an org.springframework.batch.repeat.RepeatStatus object. There are two
options with this: RepeatStatus.CONTINUABLE and RepeatStatus.FINISHED.
These two values can be confusing at first glance. If you return
RepeatStatus.CONTINUABLE, you aren't saying that the job can continue.
You're telling Spring Batch to run the tasklet again. Say, for
example, that you wanted to execute a particular tasklet in a loop
until a given condition was met, yet you still wanted to use Spring
Batch to keep track of how many times the tasklet was executed,
transactions, and so on. Your tasklet could return
RepeatStatus.CONTINUABLE until the condition was met. If you return
RepeatStatus.FINISHED, that means the processing for this tasklet is
complete (regardless of success) and to continue with the next piece
of processing.
But I can't imagine example of using this feature. Could you explain it for me ? When the next time tasklet will be invoked ?
Let's say that you have a large set of items (for example files), and you need to enrich each one of them in some way, which requires consuming an external service. The external service might provide a chunked mode that can process up to 1000 requests at once instead of making a separate remote call for each single file. That might be the only way you can bring down your overall processing time to the required level.
However, this is not possible to implement using Spring Batch's Reader/Processor/Writer API in a nice way, because the Processor is fed item by item and not entire chunks of them. Only the Writer actually sees chunks of items.
You could implement this using a Tasklet that reads the next up to 1000 unprocessed files, sends a chunked request to the service, processes the results, writes output files and deletes or moves the processed files.
Finally it checks if there are more unprocessed files left. Depending on that it returns either FINISHED or CONTINUABLE, in which case the framework would invoke the Tasklet again to process the next up to 1000 files.
This is actually a quite realistic scenario, so I hope that illustrates the purpose of the feature.
This allows you to break up processing of a complex task across multiple iterations.
The functionality is similar to a while(true) loop with continue/break.
I have 3 executors in my spark streaming job which consumes from Kafka. Executor count depends on partition count in topic. When a message consumed from this topic, I am starting query on Hazelcast. Every executor finds results from some filtering operation on hazelcast and returns duplicated results. Because data statuses are not updated when executor returns the data and other executor finds the same data.
My question is, is there a way to combine all results in only one list which are found by executors during streaming?
Spark Executors are distributed across Cluster, so if you are trying to deduplicate data across cluster. So deduplicating is difficult. you have following options
Use accumulators.- problem here is that accumulators are not consistent when job is running and you may end up reading stale data
Other option is Offload this work to external system. - store your output in some external storage which can deduplicate it. (Probably HBase). efficiency of this storage system becomes key here.
I hope this helps
To avoid duplicate data read, you need to maintain the offset somewhere, preferred in HBase and everytime you consume the data from Kafka, you read it from HBase and then check the offset for each topic which is already consumed and then start reading and writing it. After each successful write, you must update the offset count.
Do you think that way it solves the issue?
I am currently writing a spring batch which is supposed to transfer all the files from my application to a shared location. The batch consists of a single step which consists of a reader which reads byte[], a processor that converts it to pdf and a writer that creates new file at the shared location.
1) Since its an IO bound operation should I use ThreadPoolTaskExecutor in my batch? Will using it cause the data to be lost?
2) Also in my ItemWriter I am writing using a FileOutputStream. My server is in paris and the shared location is in New York. SO while writing the file in such scenarios is there any better or effecient way to achieve this with least delay?
Thanks in advance
1) If you can separate the IO-bound operation into its own thread and other parts into their own threads, you could go for a multithreaded approach. Data won't be lost if you write it correctly.
One approach could be as follows:
Have one reader thread read the data into a buffer.
Have a second thread perform the transformation into PDF.
Have a third thread perform the write out (if it's not to the same
disk as the reading).
2) So it's a mapped network share? There's probably not much you can do from Java then. But before you get worried about it, you should make sure it's an actual issue and not premature optimization.
I guess you can achieve the above task by partitioning. Create Master step which returns you the file paths and slave task with multi-thread with two approach.
simple tasklet which would read the file /convert to pdf and write to shared drive.
Rather than using FileOutputStream use FileChannel with BufferedRead/Write it has much better performance
Use Chunk Reader specifically ItemStream and pass it to custom ItemWriter to write to file(Never got chance to write to pdf but I believe at the end of the day it will be a stream with different encoding)
I would recommand the second one, its always better to use lib rather than using custom code
I can understand why network apps would use multiplexing (to not create too many threads), and why programs would use async calls for pipelining (more efficient). But I don't understand the efficiency purpose of AsynchronousFileChannel.
Any ideas?
It's a channel that you can use to read files asynchronously, i.e. the I/O operations are done on a separate thread, so that the thread you're calling it from can do other things while the I/O operations are happening.
For example: The read() methods of the class return a Future object to get the result of reading data from the file. So, what you can do is call read(), which will return immediately with a Future object. In the background, another thread will read the actual data from the file. Your own thread can continue doing things, and when it needs the read data, you call get() on the Future object. That will then return the data (if the background thread hasn't completed reading the data, it will make your thread block until the data is ready). The advantage of this is that your thread doesn't have to wait the whole length of the read operation; it can do some other things until it really needs the data.
See the documentation.
Note that AsynchronousFileChannel will be a new class in Java SE 7, which is not released yet.
I've just come across another, somewhat unexpected reason for using AsynchronousFileChannel. When performing random record-oriented writes across large files (exceeding physical memory so caching isn't helping everything) on NTFS, I find that AsynchronousFileChannel performs over twice as many operations, in single-threaded mode, versus a normal FileChannel.
My best guess is that because the asynchronous io boils down to overlapped IO in Windows 7, the NTFS file system driver is able to update its own internal structures faster when it doesn't have to create a sync point after every call.
I micro-benchmarked against RandomAccessFile to see how it would perform (results are very close to FileChannel, and still half of the performance of AsynchronousFileChannel.
Not sure what happens with multi-threaded writes. This is on Java 7, on an SSD (the SSD is an order of magnitude faster than magnetic, and another order of magnitude faster on smaller files that fit in memory).
Will be interesting to see if the same ratios hold on Linux.
The main reason I can think of to use asynchronous IO is to better utilize the processor. Imagine you have some application which does some sort of processing on a file. And also let's assume you can process the data contained in the file in chunks. If you don't make use of asynchronous IO then your application will probably behave something like this:
Read a block of data. No processor utilization at this point as you're blocked waiting for the data to be read.
process the data you just read. At this point your application will start consuming CPU cycles as it processed the data.
If more data to read, goto #1.
The processor utilization will go up and then to zero and then up and then to zero, ... . Ideally you want to not be idle if you want your application to be efficient and process the data as fast as possible. A better approach would be:
Issue async read
When read completes issue next async read and then process data
The first step is the bootstrapping. You have no data yet so you have to issue a read. From then on, when you get notified a read has completed you issue another async read and then process the data. The benefit here is that by the time you finish processing the chunk of data the next read has probably finished, so you always have data available to process and thus you're more efficiently using the processor. If your processing finishes before the read has finished you might need to issue multiple asynchronous reads so that you have more data to process.
Nick
Here's something no one has mentioned:
A plain FileChannel implements InterruptibleChannel so it, as well as anything that uses it such as the OutputStream returned by Files.newOutputStream(), has the unfortunate[1][2] behaviour that performing any blocking operation on it (e.g. read() and write()) in a thread in interrupted state will cause the Channel itself to close with java.nio.channels.ClosedByInterruptException.
If this is a problem, using AsynchronousFileChannel instead is a possible alternative.
[1] http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6608965
[2] https://bugs.openjdk.java.net/browse/JDK-4469683