using ThreadPoolTaskExecutor while transferring files over the network

using ThreadPoolTaskExecutor while transferring files over the network - java

I am currently writing a spring batch which is supposed to transfer all the files from my application to a shared location. The batch consists of a single step which consists of a reader which reads byte[], a processor that converts it to pdf and a writer that creates new file at the shared location.
1) Since its an IO bound operation should I use ThreadPoolTaskExecutor in my batch? Will using it cause the data to be lost?
2) Also in my ItemWriter I am writing using a FileOutputStream. My server is in paris and the shared location is in New York. SO while writing the file in such scenarios is there any better or effecient way to achieve this with least delay?
Thanks in advance

1) If you can separate the IO-bound operation into its own thread and other parts into their own threads, you could go for a multithreaded approach. Data won't be lost if you write it correctly.
One approach could be as follows:
Have one reader thread read the data into a buffer.
Have a second thread perform the transformation into PDF.
Have a third thread perform the write out (if it's not to the same
disk as the reading).
2) So it's a mapped network share? There's probably not much you can do from Java then. But before you get worried about it, you should make sure it's an actual issue and not premature optimization.

I guess you can achieve the above task by partitioning. Create Master step which returns you the file paths and slave task with multi-thread with two approach.
simple tasklet which would read the file /convert to pdf and write to shared drive.
Rather than using FileOutputStream use FileChannel with BufferedRead/Write it has much better performance
Use Chunk Reader specifically ItemStream and pass it to custom ItemWriter to write to file(Never got chance to write to pdf but I believe at the end of the day it will be a stream with different encoding)
I would recommand the second one, its always better to use lib rather than using custom code

Related

Java concurrent writes from multiple threads to a single text file?

I have a multi-threaded Java 7 program (a jar file) which uses JDBC to perform work (it uses a fixed thread pool).
The program works fine and it logs things as it progresses to the command shell console window (System.out.printf()) from multiple concurrent threads.
In addition to the console output I also need to add the ability for this program to write to a single plain ASCII text log file - from multiple threads.
The volume of output is low, the file will be relatively small as its a log file, not a data file.
Can you please suggest a good and relatively simple design/approach to get this done using Java 7 features (I dont have Java 8 yet)?
Any code samples would also be appreciated.
thank you very much
EDIT:
I forgot to add: in Java 7 using Files.newOutputStream() static factory method is stated to be thread safe - according to official Java documentation. Is this the simplest option to write a single shared text log file from multiple threads?

If you want to log output, why not use a logging library, like e.g. log4j2? This will allow you to tailor your log to your specific needs, and can log without synchronizing your threads on stdout (you know that running System.out.print involves locking on System.out?)
Edit: For the latter, if the things you log are thread-safe, and you are OK with adding LMAX' disruptor.jar to your build, you can configure async loggers (just add "async") that will have a logging thread take care of the whole message formatting and writing (and keeping your log messages in order) while allowing your threads to run on without a hitch.

Given that you've said the volume of output is low, the simplest option would probably be to just write a thread-safe writer which uses synchronization to make sure that only one thread can actually write to the file at a time.
If you don't want threads to block each other, you could have a single thread dedicated to the writing, using a BlockingQueue - threads add write jobs (in whatever form they need to - probably just as strings) to the queue, and the single thread takes the values off the queue and writes them to the file.
Either way, it would be worth abstracting out the details behind a class dedicated for this purpose (ideally implementing an interface for testability and flexibility reasons). That way you can change the actual underlying implementation later on - for example, starting off with the synchronized approach and moving to the producer/consumer queue later if you need to.

Keep a common PrintStream reference where you'll write to (instead of System.out) and set it to System.out or channel it through to a FileOutputStream depending on what you want.
Your code won't change much (barely at all) and PrintStream is already synchronized too.

Should multiple threads read from the same DataInputStream?

I'd like my program to get a file, and then create 4 files based on its byte content.
Working with only the main thread, I just create one DataInputStream and do my thing sequentially.
Now, I'm interested in making my program concurrent. Maybe I can have four threads - one for each file to be created.
I don't want to read the file's bytes into memory all at once, so my threads will need to query the DataInputStream constantly to stream the bytes using read().
What is not clear to me is, should my 4 threads call read() on the same DataInputStream, or should each one have their own separate stream to read from?

I don't think this is a good idea. See http://download.java.net/jdk7/archive/b123/docs/api/java/io/DataInputStream.html
DataInputStream is not necessarily safe for multithreaded access. Thread safety is optional and is the responsibility of users of methods in this class.

Assuming you want all of the data in each of your four new files, each thread should create its own DataInputStream.
If the threads share a single DataInputStream, at best each thread will get some random quarter of the data. At worst, you'll get a crash or data corruption due to multithreaded access to code that is not thread safe.

If you want to read data from 1 file into 4 separate ones you will not share DataInputStream. You can however wrap that stream and add functionality that would make it thread safe.
For example you may want to read in a chunk of data from your DataInputStream and cache that small chunk. When all 4 threads have read the chunk you can dispose of it and continue reading. You would never have to load the complete file into memory. You would only have to load a small amount.

If you look at the doc of DataInputStream. It is a FilterInputStream, which means the read operation is delegated to other inputStream. Suppose you use here is a FileInputStream, In most platform, concurrent read will be supported.
So in your case, you should initialize four different FileInputStream, result in four DataInputStream, used in four thread separately. The read operation will not be interfered.

Short answer is no.
Longer answer: have a single thread read the DataInputStream, and put the data into one of four Queues, one per output file. Decide which Queue based upon the byte content.
Have four threads, each one reading from a Queue, that write to the output files.

How to get the Thread trace during reading a File

I have File and I want to do the following task: (just to get more knowledge about the thread reading and writing file.)
When an application starts andthe file is read I want to have information about all the streams which are open and how many threads are reading from the same stream.
Is there a way I can have all the stream information via reflection . Is there another way ?

I would suggest some sort of StreamFactory class that will maintain that information for you. Threads can then do a InputStream getStream(File) and closeStream(InputStream) or some such and the factory will maintain the list of which thread has what streams open and provide some statistics functions such as:
public Collection<InputStream> getOpenStreams()
and
public int getNumThreadsWithStream(InputStream);

I believe this is something you have to keep track of yourself. If you are sharing a file stream between threads (and I suggest you don't do this, use one thread for reading the stream and pass work to a thread pool if you need too.) you can keep track of all the Thread which have attempted to read the stream.

Spring Batch Multithreaded Job with fixed order

I created a spring batch job that reads chunks (commit level = 10) of a flat CSV file and writes the output to another flat file. Plain and simple.
To test local scaling I also configured the tasklet with a TaskExecutor with a pool of 10 threads, thus introducing parallelism by using a multithreaded step pattern.
As expected these threads concurrently read items until their chunk is filled and the chunk is written to the output file.
Also as expected the order of the items has changed because of this concurrent reading.
But is it possible to maintain the fixed order, preferably still leveraging the increased performance gained by using multiple threads?

I can't think of an easy way. A workaround would be to prefix all lines with an ID which is created sequentially while reading. After finishing the job, sort the lines by the id. Sounds hacky, but should work.

I don't think there is any easy solution, but only using one writer thread (that also performs a sort when writing) and multiple reading threads could work but it would not be as scalable..

Why use Java's AsynchronousFileChannel?

I can understand why network apps would use multiplexing (to not create too many threads), and why programs would use async calls for pipelining (more efficient). But I don't understand the efficiency purpose of AsynchronousFileChannel.
Any ideas?

It's a channel that you can use to read files asynchronously, i.e. the I/O operations are done on a separate thread, so that the thread you're calling it from can do other things while the I/O operations are happening.
For example: The read() methods of the class return a Future object to get the result of reading data from the file. So, what you can do is call read(), which will return immediately with a Future object. In the background, another thread will read the actual data from the file. Your own thread can continue doing things, and when it needs the read data, you call get() on the Future object. That will then return the data (if the background thread hasn't completed reading the data, it will make your thread block until the data is ready). The advantage of this is that your thread doesn't have to wait the whole length of the read operation; it can do some other things until it really needs the data.
See the documentation.
Note that AsynchronousFileChannel will be a new class in Java SE 7, which is not released yet.

I've just come across another, somewhat unexpected reason for using AsynchronousFileChannel. When performing random record-oriented writes across large files (exceeding physical memory so caching isn't helping everything) on NTFS, I find that AsynchronousFileChannel performs over twice as many operations, in single-threaded mode, versus a normal FileChannel.
My best guess is that because the asynchronous io boils down to overlapped IO in Windows 7, the NTFS file system driver is able to update its own internal structures faster when it doesn't have to create a sync point after every call.
I micro-benchmarked against RandomAccessFile to see how it would perform (results are very close to FileChannel, and still half of the performance of AsynchronousFileChannel.
Not sure what happens with multi-threaded writes. This is on Java 7, on an SSD (the SSD is an order of magnitude faster than magnetic, and another order of magnitude faster on smaller files that fit in memory).
Will be interesting to see if the same ratios hold on Linux.

The main reason I can think of to use asynchronous IO is to better utilize the processor. Imagine you have some application which does some sort of processing on a file. And also let's assume you can process the data contained in the file in chunks. If you don't make use of asynchronous IO then your application will probably behave something like this:
Read a block of data. No processor utilization at this point as you're blocked waiting for the data to be read.
process the data you just read. At this point your application will start consuming CPU cycles as it processed the data.
If more data to read, goto #1.
The processor utilization will go up and then to zero and then up and then to zero, ... . Ideally you want to not be idle if you want your application to be efficient and process the data as fast as possible. A better approach would be:
Issue async read
When read completes issue next async read and then process data
The first step is the bootstrapping. You have no data yet so you have to issue a read. From then on, when you get notified a read has completed you issue another async read and then process the data. The benefit here is that by the time you finish processing the chunk of data the next read has probably finished, so you always have data available to process and thus you're more efficiently using the processor. If your processing finishes before the read has finished you might need to issue multiple asynchronous reads so that you have more data to process.
Nick

Here's something no one has mentioned:
A plain FileChannel implements InterruptibleChannel so it, as well as anything that uses it such as the OutputStream returned by Files.newOutputStream(), has the unfortunate[1][2] behaviour that performing any blocking operation on it (e.g. read() and write()) in a thread in interrupted state will cause the Channel itself to close with java.nio.channels.ClosedByInterruptException.
If this is a problem, using AsynchronousFileChannel instead is a possible alternative.
[1] http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6608965
[2] https://bugs.openjdk.java.net/browse/JDK-4469683

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.