Java - Process bytes as they are being read from a file

Java - Process bytes as they are being read from a file - java

Is there a way to have one thread in java make a read call to some FileInputStream or similar and have a second thread processing the bytes being loaded at the same time?
I've tried a number of things - my current attempt has one thread running this:
FileChannel inStream;
try {
inStream = (new FileInputStream(inFile)).getChannel();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
int result;
try {
result = inStream.read(inBuffer);
} ...
And a second thread wanting to access the bytes as they are being loaded. Clearly the read call in the first thread blocks until the buffer is full, but I want to be able to access the bytes loaded into the buffer before that point. Currently, everything I try has the buffer and it's backing array unchanged until the read completes - this not only defeats the point of this threading but also suggests the data is being loaded into some intermediate buffer somewhere and then copied into my buffer later, which seems daft.
One option would be to do a bunch of smaller reads into the array with offsets on subsequent reads, but that adds extra overhead.
Any ideas?

When you read data sequentially, the OS will read ahead the data before you need it. As the system is doing this for you already, you may not get the benefit you might expect.
why can't I just make my Filechannel or FileInputStream "flow" into my ByteBuffer or some byte array?
That is sort of what it does already.
If you want a more seamless loading of the data, you can use a memory mapped files as it "appears" in the memory of the program immediately and is loaded in the background as you use it.

What I usually do with requirements like this is to use multiple buffer class instances, preferably sized to allow efficient loading - a multiple of cluster-size, say. As soon as the first buffer gets loaded up, queue it off, (ie. push its pointer/instance onto a producer-consumer queue), to the thread that will process it and immediately create, (or depool), another buffer instance and start loading that one. To control overall data flow, you can create a suitable number of buffer objects at startup and store them in a 'pool queue', (another producer-consumer queue), and then you can circulate the objects full of data from the pool, to the file-read thread, then to the buffer-processing thread, than back to the pool.
This keeps the file->processing queue 'topped up' with buffer-objects full of data, no bulk copying required, no unavoidable delays, no inefficient inter-thread comms of single bytes, no messy locking of buffer-indexes, no chance that the file-read thread and data-processing thread can ever operate on the same buffer object.
If you want/need to use a threadPool to perform the processing, you can easily do so but you may need a sequence-number in the buffer objects if you need any resulting output from this subsystem to be in the same order as it was read from the file.
The buffer-objects may also contain result data members, exception/errorMessage fields, anything that you might want. The file and/or result data could easily be forwarded on to other thread/s from the data-processing, (eg. a logger or GUI display of progress), before getting repooled. Since it's all just pointer/instance queueing, the huge amount of data wil lflow around your system quickly and efficiently.

I would recommend to use SynchronousQueue. Reader will retrieve data from the queue and writer will "publish" the data from your file.

Use a PipedInput/OutputStream to create a familiar looking pipe with a buffer.?
Also use a FileInputStream to read it byte per byte if necessary. the fis.read() function will not block, it will return -1 if there is no data and you can always check for available();

Related

How can I get two threads to read from one inputStream?

I have on input stream coming in that is periodically receiving data. One of my threads (let's call it threadA) reads every message from the stream and makes sure the data is ok, but will through an error otherwise. My other thread (let's call it threadB) needs to read a few specific messages and then process it. As of now I have threadA just store the important messages in a global variable, and threadB read the messages from the global variable.
Is there any way to allow for two threads to read from the same source to avoid this?
edit: the data coming in are responses to commands threadB issued. My issue is that threadB needs the replies from certain commands, which are issued in no particular pattern, but it does not need all the replies.

You probably could create a threadsafe inputstream or a wrapper and if the stream supports mark/reset you could also have two streams read the data in parallel. However, you'd have to handle situations where one thread reads faster than the other thus making mark/reset unusable or having to skip data - there's so much involved, I doubt you'll want to bother with all this.
I'd suggest you keep your basic setup but try to get rid of global variables, e.g. by using the obverser pattern, passing references to the shared store to the threads etc.

Should multiple threads read from the same DataInputStream?

I'd like my program to get a file, and then create 4 files based on its byte content.
Working with only the main thread, I just create one DataInputStream and do my thing sequentially.
Now, I'm interested in making my program concurrent. Maybe I can have four threads - one for each file to be created.
I don't want to read the file's bytes into memory all at once, so my threads will need to query the DataInputStream constantly to stream the bytes using read().
What is not clear to me is, should my 4 threads call read() on the same DataInputStream, or should each one have their own separate stream to read from?

I don't think this is a good idea. See http://download.java.net/jdk7/archive/b123/docs/api/java/io/DataInputStream.html
DataInputStream is not necessarily safe for multithreaded access. Thread safety is optional and is the responsibility of users of methods in this class.

Assuming you want all of the data in each of your four new files, each thread should create its own DataInputStream.
If the threads share a single DataInputStream, at best each thread will get some random quarter of the data. At worst, you'll get a crash or data corruption due to multithreaded access to code that is not thread safe.

If you want to read data from 1 file into 4 separate ones you will not share DataInputStream. You can however wrap that stream and add functionality that would make it thread safe.
For example you may want to read in a chunk of data from your DataInputStream and cache that small chunk. When all 4 threads have read the chunk you can dispose of it and continue reading. You would never have to load the complete file into memory. You would only have to load a small amount.

If you look at the doc of DataInputStream. It is a FilterInputStream, which means the read operation is delegated to other inputStream. Suppose you use here is a FileInputStream, In most platform, concurrent read will be supported.
So in your case, you should initialize four different FileInputStream, result in four DataInputStream, used in four thread separately. The read operation will not be interfered.

Short answer is no.
Longer answer: have a single thread read the DataInputStream, and put the data into one of four Queues, one per output file. Decide which Queue based upon the byte content.
Have four threads, each one reading from a Queue, that write to the output files.

Java - Reading A Binary File In Parallel

I have a binary file that contains blocks of information (I'll refer to them as packets henceforth). Each packet consists of a fixed-length header and a variable length body. I've to determine the lenth of the body from the packet header itself. My task is to read these packets from the file and perform some operation on them. Currently I'm performing this task as follows:
Opening the file as a random access file and going to a specific start position (a user-specified start position). Reading the 1st packet from this position. Performing the specific operation
Then in a loop
reading the next packet
performing my operation
This goes on till I hit the end of file marker.
As you can guess, when the file size is huge, reading each packet serially and processing it is a time-consuming affair. I want to somehow parallelize this operation i.e. packet generation operation and put it in some blocking queue and then parallely retrieve each packet from the queue and perform my operation.
Can someone suggest how may I generate these packets in parallel?

You should only have one thread read in the file sequentially since I'm assuming the file lies in a single drive. Reading the file is limited by your IO speed so there's no point in parallelizing that in the CPU. In fact, reading non-sequentially will actually significantly decrease your performance since regular hard drives are designed for sequential IO. For each packet it reads in, it should put that object into a thread-safe queue.
Now you can start parallelizing the processing of the packets. Create multiple threads and have them each read in packets from the queue. Each thread should do their processing and put it into some "finished" queue.
Once the IO thread has finished reading in the file, a flag should be set so that the working threads stop once the queue is empty.

If you are using a disk with platters (i.e. not an SSD) then there is no point having more than one thread read the file since all you will do is thrash the disk causing the disk arm to introduce millisecond delays. If you have an SSD its a different story and you could parallelise the reading.
Instead you should have one thread reading the data from the file and creating the packets, then doing the following:
wait on a shared semaphore 'A' (which has been initialised to some number that will be your 'max buffered packets' count)
lock a shared object
append the packet to a LinkedList
signal another shared semaphore 'B' (this one is tracking the count of the packets in the buffer)
Then you can have many other threads doing the following:
wait on the 'B' semaphore (to ensure there is a packet to be processed)
lock the shared object
do getFirst() on the LinkedList and store the packet in a local variable
signal semaphore 'A' to allow another packet into the buffered packet list
This will ensure you are reading packets as fast as possible (from a platter disk) by striping them in one continuous sequence, and it will ensure that you are processing multiple packets at once without any polling.

I guess the known fast method is using java.nio.MappedByteBuffer

PipedOutputStream in HashMap for threads

I have couple of threads that run in background. They do share a common HashMap.
Is it possible to store (safely) PipedOutputStream there?
I have this following scenario:
When first background thread receives a specific event, it should start read text data from a huge file into a buffer.
Second background thread (they are independent) should be notified somehow and then read data from the buffer (pipe) as it arrives.
Because all threads can access the HashMap, is it ok to store there all the streams?

You can use a ConcurrentHashMap. I don't see much point in using a Pipe here as files will be read by the OS in advance of where you are reading anyway.

HashMap is not synchronized, so you have to add your logic or use a synchronized collection. As for the streams, ince you can only write to an OutputStream and only read from an InputStream, you will have no problem writing from one thread and reading from another.

Why use Java's AsynchronousFileChannel?

I can understand why network apps would use multiplexing (to not create too many threads), and why programs would use async calls for pipelining (more efficient). But I don't understand the efficiency purpose of AsynchronousFileChannel.
Any ideas?

It's a channel that you can use to read files asynchronously, i.e. the I/O operations are done on a separate thread, so that the thread you're calling it from can do other things while the I/O operations are happening.
For example: The read() methods of the class return a Future object to get the result of reading data from the file. So, what you can do is call read(), which will return immediately with a Future object. In the background, another thread will read the actual data from the file. Your own thread can continue doing things, and when it needs the read data, you call get() on the Future object. That will then return the data (if the background thread hasn't completed reading the data, it will make your thread block until the data is ready). The advantage of this is that your thread doesn't have to wait the whole length of the read operation; it can do some other things until it really needs the data.
See the documentation.
Note that AsynchronousFileChannel will be a new class in Java SE 7, which is not released yet.

I've just come across another, somewhat unexpected reason for using AsynchronousFileChannel. When performing random record-oriented writes across large files (exceeding physical memory so caching isn't helping everything) on NTFS, I find that AsynchronousFileChannel performs over twice as many operations, in single-threaded mode, versus a normal FileChannel.
My best guess is that because the asynchronous io boils down to overlapped IO in Windows 7, the NTFS file system driver is able to update its own internal structures faster when it doesn't have to create a sync point after every call.
I micro-benchmarked against RandomAccessFile to see how it would perform (results are very close to FileChannel, and still half of the performance of AsynchronousFileChannel.
Not sure what happens with multi-threaded writes. This is on Java 7, on an SSD (the SSD is an order of magnitude faster than magnetic, and another order of magnitude faster on smaller files that fit in memory).
Will be interesting to see if the same ratios hold on Linux.

The main reason I can think of to use asynchronous IO is to better utilize the processor. Imagine you have some application which does some sort of processing on a file. And also let's assume you can process the data contained in the file in chunks. If you don't make use of asynchronous IO then your application will probably behave something like this:
Read a block of data. No processor utilization at this point as you're blocked waiting for the data to be read.
process the data you just read. At this point your application will start consuming CPU cycles as it processed the data.
If more data to read, goto #1.
The processor utilization will go up and then to zero and then up and then to zero, ... . Ideally you want to not be idle if you want your application to be efficient and process the data as fast as possible. A better approach would be:
Issue async read
When read completes issue next async read and then process data
The first step is the bootstrapping. You have no data yet so you have to issue a read. From then on, when you get notified a read has completed you issue another async read and then process the data. The benefit here is that by the time you finish processing the chunk of data the next read has probably finished, so you always have data available to process and thus you're more efficiently using the processor. If your processing finishes before the read has finished you might need to issue multiple asynchronous reads so that you have more data to process.
Nick

Here's something no one has mentioned:
A plain FileChannel implements InterruptibleChannel so it, as well as anything that uses it such as the OutputStream returned by Files.newOutputStream(), has the unfortunate[1][2] behaviour that performing any blocking operation on it (e.g. read() and write()) in a thread in interrupted state will cause the Channel itself to close with java.nio.channels.ClosedByInterruptException.
If this is a problem, using AsynchronousFileChannel instead is a possible alternative.
[1] http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6608965
[2] https://bugs.openjdk.java.net/browse/JDK-4469683

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.