My batch process needs to be reading lines from huge files (1-3G), each of which can be processed independently of another. The files can have 10-50M rows. I was thinking of spawning about a dozen threads each of which would be processing a predetermined range of buffers, e.g. T1 will read range 0-1, T2 1-2, etc. concurrently. That means, of course, that T2 needs to jump instantly into the buffer position 2, without reading 0-2.
Is this type of segmentation of buffered file reading for the purposes of concurrency possible with Java NIO?
There is no point to this. The CPU may allow multiple threads but the disk is still single-threaded. All this will do is cause disk thrashing. Forget it.
Related
I have many gzipped files which contain records that I am trying to sequence into a single consolidated file. CPU power is not a constraint.
I want to spin up threads that read from GZipInputStreams as necessary. The amount that will be read from each file at any given time is variant and unpredictable. The most obvious way to solve this problem is to have a thread pool where a task is submitted to read from a GZipInputStream if a backing buffer falls below a low watermark.
I am concerned that reading from a single GZipInputStream from different threads could manifest a memory barrier issue since it may have been assumed that data would be consumed from only one thread.
To be clear, I am not suggesting that more than one thread will read from the same GZipInputStream concurrently, but rather the lack of synchronization monitors may cause some data to be inconsistent if the stream is read from one thread and then immediately read from another thread.
I want to define thread pool with 10 threads and read the content of the file. But different threads must not read same content.(like divide content into 10 pieces and read each pieces by one thread)
Well what you would do would be roughly this:
get the length of the file,
divide by N.
create N threads
have each one skip to (file_size / N) * thread_no and read (file_size / N) bytes into a buffer
wait for all threads to complete.
stitch the buffers together.
(If you were slightly clever about it, you could avoid the last step ...)
HOWEVER, it is doubtful that you would get much speed-up by doing this. Indeed, I wouldn't be surprised if you got a slow down in many cases. With a typical OS, I would expect that you would get as good, if not better performance by reading the file using one big read(...) call from one thread.
The OS can fetch the data faster from the disc if you read it sequentially. Indeed, a lot of OSes optimize for this use-case, and use read-ahead and in-memory buffering (using OS-level buffers) to give high effective file read rates.
Reading a file with multiple threads means that each thread will typically be reading from a different position in the file. Naively, that would entail the OS to seeking the disk heads backwards and forwards between the different positions ... which will slow down I/O considerably. In practice, the OS will do various things to mitigate that, but even so, simultaneously reading data from different positions on a disk is still bad for I/O throughput.
I am extracting out lines matching a pattern from log files. Hence I allotted each log file to a Runnable object which writes the found pattern lines to a result file. (well synchronised writer methods)
Important snippet under discussion :
ExecutorService executor = Executors.newFixedThreadPool(NUM_THREAD);
for (File eachLogFile : hundredsOfLogFilesArrayObject) {
executor.execute(new RunnableSlavePatternMatcher(eachLogFile));
}
Important Criteria :
The number of log files could be very few like 20 or for some users the number of logs files could cross 1000. I recorded series of tests in an excel sheet and I am really concerned on the RED marked results. 1. I assume that if the number of threads created is equal to the number of files to be processed then the processing time would be less, compared to the case when the number of thread is lesser than the number of files to be processed which didn't happen. (please advice me if my understanding is wrong)
Result :
I would like to identify a value for the NUM_THREAD which is efficient for less number of files as well as 1000's of files
Suggest me answer for Question 1 & 2
Thanks !
Chandru
you just found that your program is not CPU bound but (likely) IO bound
this means that beyond 10 threads the OS can't keep up with the requested reads of all the thread that want their data and more threads are waiting for the next block of data at a time
also because writing the output is synchronized across all threads that may even be the biggest bottle neck in your program, (producer-consumer solution may be the answer here to minimize the time threads are waiting to output)
the optimal number of threads depends on how fast you can read the files (the faster you can read the more threads are useful),
It appears that 2 threads is enough to use all your processing power. Most likely you have two cores and hyper threading.
Mine is a Intel i5 2.4GHz 4CPU 8GB Ram . Is this detail helpful ?
Depending on the model, this has 2 cores and hyper-threading.
I assume that if the number of threads created is equal to the number of files to be processed then the processing time would be less,
This will maximise the overhead, but wont give you more cores than you have already.
When parallelizing, using a lot more threads than you have available cpu cores will usually increase the overall time. You system will spend some overhead time switching from thread to thread on one cpu core instead of having it executing the tasks at once, one after an other.
If you have 8 cpu cores on your computer, you might observe some improvement using 8/9/10 threads instead of using only 1 while using 20+ threads will actually be less efficient.
One problem is that I/O doesn't parallelize well, especially if you have a non-SSD, since sequential reads (what happens when one thread reads a file) are much faster than random reads (when the read head has to jump around between different files read by several threads). I would guess you could speed up the program by reading the files from the thread sending the jobs to the executor:
for (File file : hundredsOfLogFilesArrayObject) {
byte[] fileContents = readContentsOfFile(file);
executor.execute(new RunnableSlavePatternMatcher(fileContents));
}
As for the optimal thread count, that depends.
If your app is I/O bound (which is quite possible if you're not doing extremely heavy processing of the contents), a single worker thread which can process the file contents while the original thread reads the next file will probably suffice.
If you're CPU bound, you probably don't want many more threads than you've got cores:
ExecutorService executor = Executors.newFixedThreadPool(
Runtime.getRuntime().availableProcessors());
Although, if your threads get suspended a lot (waiting for synchronization locks, or something), you may get better result with more threads. Or if you've got other CPU-munching activitities going on, you may want fewer threads.
You can try using cached thread pool.
public static ExecutorService newCachedThreadPool()
Creates a thread pool that creates new threads as needed, but will reuse previously constructed threads when they are available. These pools will typically improve the performance of programs that execute many short-lived asynchronous tasks. Calls to execute will reuse previously constructed threads if available.
You can read more here
I have a requirement where I need to read text file then transform it and write it to some other file. I wish to do this in parallel fashion like one thread for read, one for transform and another for write.
Now to share data between threads I need some channel, I was thinking to use BlockingQueue for this but would like to explore some other (better) alternatives if available.
Guava has a EventBus but not sure whether this is a good fit for the requirement. What other alternatives are available and which one is best from performance point of view.
Unless your transform step is really intensive, this is probably a waste of time.
Think of it this way. What are you asking for?
You're asking for something that
Takes an incoming stream of data
Copies it to another thread
Presents it to that thread as an incoming stream of data
What data structure best represents an incoming stream of data for step 3? (Hint: it's the InputStream you started with!)
What value do the first two steps add? The "transform" thread can read from disk just as fast as it could read from disk through another thread. Adding the thread inbetween does not speed up the disk read.
You would start to consider adding another thread when
Your problem can be usefully divided into independent pieces of work (say, each thread works on a chunk of text
The cost of splitting the problem into those pieces of work is significantly smaller than the overhead of adding an additional thread and coordinating between them (which is small, but not free!)
The problem requires more resources than a single CPU can provide (a thread gives you access to more CPU resources, but doesn't provide much value in terms of I/O throughput)
I have a program in which each thread reads in files many lines at a time from a file, processes the lines, and writes the lines out to a different file. Four threads split the list of files to process among them. I'm having strange performance issues across two cases:
Four files with 50,000 lines each
Throughput starts at 700 lines/sec processed, declines to ~100 lines/sec
30,000 files with 12 lines each
Throughput starts around 800 lines/sec and remains steady
This is internal software I'm working on so unfortunately I can't share any source code, but the main steps of the program are:
Split list of files among four worker threads
Start all threads.
Thread reads up to 100 lines at once and stores in String[] array.
Thread applies transformation to all lines in array.
Thread writes lines to a file (not same as input file).
3-5 repeats for each thread until all files completely processed.
What I don't understand is why 30k files with 12 lines each gives me greater performance than a few files with many lines each. I would have expected that the overhead of opening and closing the files to be greater than that of reading a single file. In addition, the decline in performance of the former case is exponential in nature.
I've set the maximum heap size to 1024 MB and it appears to use 100 MB at most, so an overtaxed GC isn't the problem. Do you have any other ideas?
From your numbers, I guess that GC is probably not the issue. I suspect that this is a normal behavior of a disk, being operated on by many concurrent threads. When the files are big, the disk has to switch context between the threads many times (producing significant disk seek time), and the overhead is apparent. With small files, maybe they are read as a single chunk with no extra seek time, so threads do not interfere with each other too much.
When working with a single, standard disk, serial IO is usually better that parallel IO.
I am assuming that the files are located on the same disk, in which case you are probably thrashing the disk (or invalidating the disk\OS cache) with multiple threads attempting to read concurrently and write concurrently. A better pattern may be to have a dedicated reader\writer thread to handle IO, and then alter your pattern so that the job of transform (which sounds expensive) is handled by multiple threads. Your IO thread can fetch and overlap writing with the transform operations as results become available. This should stop disk thrashing, and balance the IO and CPU side of your pattern.
Have you tried running a Java profiler? That will point out what parts of your code are running the slowest. From this discussion, it seems like Netbeans profiler is a good one to check out.
Likely your thread is holding on to the buffered String[]s for too long. Even though your heap is much larger than you need, the throughput could be suffering due to garbage collection. Look at how long you're holding on to those references.
You might also waiting while the vm allocates more memory- asking for Xmx1024m doesn't allocate that much immediately, it grabs what it needs as more memory is required. You could also try -Xms1024m -Xmx1024m (i.e. allocate all of the memory at start) to test if that's the case.
You might have a stop and lock condition going on with your threads (one thread reads 100 lines into memory and holds onto the lock until its done processing, instead of giving it up when it has finished reading from the file). I'm not expert on Java threading, but it's something to consider.
I would review this process. If you use BufferedReader and BufferedWriter there is no advantage to reading and processing 100 lines at a time. It's just added complication and another source of potential error. Do it one at a time and simplify your life.