Java - Reading A Binary File In Parallel

Java - Reading A Binary File In Parallel - java

I have a binary file that contains blocks of information (I'll refer to them as packets henceforth). Each packet consists of a fixed-length header and a variable length body. I've to determine the lenth of the body from the packet header itself. My task is to read these packets from the file and perform some operation on them. Currently I'm performing this task as follows:
Opening the file as a random access file and going to a specific start position (a user-specified start position). Reading the 1st packet from this position. Performing the specific operation
Then in a loop
reading the next packet
performing my operation
This goes on till I hit the end of file marker.
As you can guess, when the file size is huge, reading each packet serially and processing it is a time-consuming affair. I want to somehow parallelize this operation i.e. packet generation operation and put it in some blocking queue and then parallely retrieve each packet from the queue and perform my operation.
Can someone suggest how may I generate these packets in parallel?

You should only have one thread read in the file sequentially since I'm assuming the file lies in a single drive. Reading the file is limited by your IO speed so there's no point in parallelizing that in the CPU. In fact, reading non-sequentially will actually significantly decrease your performance since regular hard drives are designed for sequential IO. For each packet it reads in, it should put that object into a thread-safe queue.
Now you can start parallelizing the processing of the packets. Create multiple threads and have them each read in packets from the queue. Each thread should do their processing and put it into some "finished" queue.
Once the IO thread has finished reading in the file, a flag should be set so that the working threads stop once the queue is empty.

If you are using a disk with platters (i.e. not an SSD) then there is no point having more than one thread read the file since all you will do is thrash the disk causing the disk arm to introduce millisecond delays. If you have an SSD its a different story and you could parallelise the reading.
Instead you should have one thread reading the data from the file and creating the packets, then doing the following:
wait on a shared semaphore 'A' (which has been initialised to some number that will be your 'max buffered packets' count)
lock a shared object
append the packet to a LinkedList
signal another shared semaphore 'B' (this one is tracking the count of the packets in the buffer)
Then you can have many other threads doing the following:
wait on the 'B' semaphore (to ensure there is a packet to be processed)
lock the shared object
do getFirst() on the LinkedList and store the packet in a local variable
signal semaphore 'A' to allow another packet into the buffered packet list
This will ensure you are reading packets as fast as possible (from a platter disk) by striping them in one continuous sequence, and it will ensure that you are processing multiple packets at once without any polling.

I guess the known fast method is using java.nio.MappedByteBuffer

Related

Faster DatagramChannel write

I'm building a high performance network application that needs to pump out hundreds of megabits per second. It is UDP based and I'm using a DatagramChannel to send the data.
My application is separated into two parts, with one thread doing each part. One part is reading from something, doing some processing and putting the result in a queue to be sent. The second part is reading from the queue and writing to the DatagramChannel.
The thread doing the writing is falling behind greatly (sometimes causing out of memory errors due to how much data is in the queue) and I'm wondering if there are ways to speed up the DatagramChannel write operation.

Should multiple threads read from the same DataInputStream?

I'd like my program to get a file, and then create 4 files based on its byte content.
Working with only the main thread, I just create one DataInputStream and do my thing sequentially.
Now, I'm interested in making my program concurrent. Maybe I can have four threads - one for each file to be created.
I don't want to read the file's bytes into memory all at once, so my threads will need to query the DataInputStream constantly to stream the bytes using read().
What is not clear to me is, should my 4 threads call read() on the same DataInputStream, or should each one have their own separate stream to read from?

I don't think this is a good idea. See http://download.java.net/jdk7/archive/b123/docs/api/java/io/DataInputStream.html
DataInputStream is not necessarily safe for multithreaded access. Thread safety is optional and is the responsibility of users of methods in this class.

Assuming you want all of the data in each of your four new files, each thread should create its own DataInputStream.
If the threads share a single DataInputStream, at best each thread will get some random quarter of the data. At worst, you'll get a crash or data corruption due to multithreaded access to code that is not thread safe.

If you want to read data from 1 file into 4 separate ones you will not share DataInputStream. You can however wrap that stream and add functionality that would make it thread safe.
For example you may want to read in a chunk of data from your DataInputStream and cache that small chunk. When all 4 threads have read the chunk you can dispose of it and continue reading. You would never have to load the complete file into memory. You would only have to load a small amount.

If you look at the doc of DataInputStream. It is a FilterInputStream, which means the read operation is delegated to other inputStream. Suppose you use here is a FileInputStream, In most platform, concurrent read will be supported.
So in your case, you should initialize four different FileInputStream, result in four DataInputStream, used in four thread separately. The read operation will not be interfered.

Short answer is no.
Longer answer: have a single thread read the DataInputStream, and put the data into one of four Queues, one per output file. Decide which Queue based upon the byte content.
Have four threads, each one reading from a Queue, that write to the output files.

Timing issues with socket Java I/O

Trying to get how Java sockets operate. A question is: what can you do simultaneously if you are using socket Java API, and what happens if we send and read data with some delay?
READ & WRITE at once. If one socket-client connected to one spcket-server, can they BOTH read and write at the same time? As far as I understand, TCP protocol is full-duplex, so theoretically socket should be able to read and write at one, but we have to create two threads for bot client and server. Am I right?
WRITE to N clients at once. If several socket-clients connected to one socket-server, can server read several clients at one moment, can server write to several clients at one moment?
If maximum possible physical speed rate of NetworkCard is 1kbyte/sec and 5 clients are connected, which speed is it possible to write with to one client?
How can I implement sequential sending of data in both directions? I mean I want to send N bytes from server to client, then M bytes from client to server, then N from server to client etc. The problem is if any of the two sides has written something to the channel, the other side will stop reading that data (read() == -1) only if channel is closed, which means that we cannot reuse it and have to open another connection. Or, may be, we should place readers and writers to different threads which do their job with read() and write() until connection is closed?
Imagine we have a delay between calling write(); flush() on one side, and calling read() on the other side. During the delay - where the written data would be stored? Would it be transmitted? What is the max size of that "delayed" data to be stored somewhere "between"?

Correct. If you're using blocking I/O, you'll need a reader thread and a writer thread for each Socket connection.
You could use a single thread to write to N clients at once, but you run the risk of blocking on a write. I won't address the writing speeds here, as it would depend on several things, but obviously the cumulative writing speed to all clients would be under 1kbps.
Yes, you'll need 2 threads, you can't do this with a single thread (or you could, but as you said yourself, you'd need to constantly open and close connections).
It would be stored in a buffer somewhere. Depending on your code it could be in a Buffered stream, or the socket's own buffer. I believe the default buffer size of BufferedOutputStream is 8K, and the socket's own buffer would depend on the environment. It shouldn't really be of importance though, the streaming quality of TCP/IP removes the need to think about buffers unless you really need to do fine-tuning.

Java - Process bytes as they are being read from a file

Is there a way to have one thread in java make a read call to some FileInputStream or similar and have a second thread processing the bytes being loaded at the same time?
I've tried a number of things - my current attempt has one thread running this:
FileChannel inStream;
try {
inStream = (new FileInputStream(inFile)).getChannel();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
int result;
try {
result = inStream.read(inBuffer);
} ...
And a second thread wanting to access the bytes as they are being loaded. Clearly the read call in the first thread blocks until the buffer is full, but I want to be able to access the bytes loaded into the buffer before that point. Currently, everything I try has the buffer and it's backing array unchanged until the read completes - this not only defeats the point of this threading but also suggests the data is being loaded into some intermediate buffer somewhere and then copied into my buffer later, which seems daft.
One option would be to do a bunch of smaller reads into the array with offsets on subsequent reads, but that adds extra overhead.
Any ideas?

When you read data sequentially, the OS will read ahead the data before you need it. As the system is doing this for you already, you may not get the benefit you might expect.
why can't I just make my Filechannel or FileInputStream "flow" into my ByteBuffer or some byte array?
That is sort of what it does already.
If you want a more seamless loading of the data, you can use a memory mapped files as it "appears" in the memory of the program immediately and is loaded in the background as you use it.

What I usually do with requirements like this is to use multiple buffer class instances, preferably sized to allow efficient loading - a multiple of cluster-size, say. As soon as the first buffer gets loaded up, queue it off, (ie. push its pointer/instance onto a producer-consumer queue), to the thread that will process it and immediately create, (or depool), another buffer instance and start loading that one. To control overall data flow, you can create a suitable number of buffer objects at startup and store them in a 'pool queue', (another producer-consumer queue), and then you can circulate the objects full of data from the pool, to the file-read thread, then to the buffer-processing thread, than back to the pool.
This keeps the file->processing queue 'topped up' with buffer-objects full of data, no bulk copying required, no unavoidable delays, no inefficient inter-thread comms of single bytes, no messy locking of buffer-indexes, no chance that the file-read thread and data-processing thread can ever operate on the same buffer object.
If you want/need to use a threadPool to perform the processing, you can easily do so but you may need a sequence-number in the buffer objects if you need any resulting output from this subsystem to be in the same order as it was read from the file.
The buffer-objects may also contain result data members, exception/errorMessage fields, anything that you might want. The file and/or result data could easily be forwarded on to other thread/s from the data-processing, (eg. a logger or GUI display of progress), before getting repooled. Since it's all just pointer/instance queueing, the huge amount of data wil lflow around your system quickly and efficiently.

I would recommend to use SynchronousQueue. Reader will retrieve data from the queue and writer will "publish" the data from your file.

Use a PipedInput/OutputStream to create a familiar looking pipe with a buffer.?
Also use a FileInputStream to read it byte per byte if necessary. the fis.read() function will not block, it will return -1 if there is no data and you can always check for available();

Why use Java's AsynchronousFileChannel?

I can understand why network apps would use multiplexing (to not create too many threads), and why programs would use async calls for pipelining (more efficient). But I don't understand the efficiency purpose of AsynchronousFileChannel.
Any ideas?

It's a channel that you can use to read files asynchronously, i.e. the I/O operations are done on a separate thread, so that the thread you're calling it from can do other things while the I/O operations are happening.
For example: The read() methods of the class return a Future object to get the result of reading data from the file. So, what you can do is call read(), which will return immediately with a Future object. In the background, another thread will read the actual data from the file. Your own thread can continue doing things, and when it needs the read data, you call get() on the Future object. That will then return the data (if the background thread hasn't completed reading the data, it will make your thread block until the data is ready). The advantage of this is that your thread doesn't have to wait the whole length of the read operation; it can do some other things until it really needs the data.
See the documentation.
Note that AsynchronousFileChannel will be a new class in Java SE 7, which is not released yet.

I've just come across another, somewhat unexpected reason for using AsynchronousFileChannel. When performing random record-oriented writes across large files (exceeding physical memory so caching isn't helping everything) on NTFS, I find that AsynchronousFileChannel performs over twice as many operations, in single-threaded mode, versus a normal FileChannel.
My best guess is that because the asynchronous io boils down to overlapped IO in Windows 7, the NTFS file system driver is able to update its own internal structures faster when it doesn't have to create a sync point after every call.
I micro-benchmarked against RandomAccessFile to see how it would perform (results are very close to FileChannel, and still half of the performance of AsynchronousFileChannel.
Not sure what happens with multi-threaded writes. This is on Java 7, on an SSD (the SSD is an order of magnitude faster than magnetic, and another order of magnitude faster on smaller files that fit in memory).
Will be interesting to see if the same ratios hold on Linux.

The main reason I can think of to use asynchronous IO is to better utilize the processor. Imagine you have some application which does some sort of processing on a file. And also let's assume you can process the data contained in the file in chunks. If you don't make use of asynchronous IO then your application will probably behave something like this:
Read a block of data. No processor utilization at this point as you're blocked waiting for the data to be read.
process the data you just read. At this point your application will start consuming CPU cycles as it processed the data.
If more data to read, goto #1.
The processor utilization will go up and then to zero and then up and then to zero, ... . Ideally you want to not be idle if you want your application to be efficient and process the data as fast as possible. A better approach would be:
Issue async read
When read completes issue next async read and then process data
The first step is the bootstrapping. You have no data yet so you have to issue a read. From then on, when you get notified a read has completed you issue another async read and then process the data. The benefit here is that by the time you finish processing the chunk of data the next read has probably finished, so you always have data available to process and thus you're more efficiently using the processor. If your processing finishes before the read has finished you might need to issue multiple asynchronous reads so that you have more data to process.
Nick

Here's something no one has mentioned:
A plain FileChannel implements InterruptibleChannel so it, as well as anything that uses it such as the OutputStream returned by Files.newOutputStream(), has the unfortunate[1][2] behaviour that performing any blocking operation on it (e.g. read() and write()) in a thread in interrupted state will cause the Channel itself to close with java.nio.channels.ClosedByInterruptException.
If this is a problem, using AsynchronousFileChannel instead is a possible alternative.
[1] http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6608965
[2] https://bugs.openjdk.java.net/browse/JDK-4469683

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.