I have a program in which each thread reads in files many lines at a time from a file, processes the lines, and writes the lines out to a different file. Four threads split the list of files to process among them. I'm having strange performance issues across two cases:
Four files with 50,000 lines each
Throughput starts at 700 lines/sec processed, declines to ~100 lines/sec
30,000 files with 12 lines each
Throughput starts around 800 lines/sec and remains steady
This is internal software I'm working on so unfortunately I can't share any source code, but the main steps of the program are:
Split list of files among four worker threads
Start all threads.
Thread reads up to 100 lines at once and stores in String[] array.
Thread applies transformation to all lines in array.
Thread writes lines to a file (not same as input file).
3-5 repeats for each thread until all files completely processed.
What I don't understand is why 30k files with 12 lines each gives me greater performance than a few files with many lines each. I would have expected that the overhead of opening and closing the files to be greater than that of reading a single file. In addition, the decline in performance of the former case is exponential in nature.
I've set the maximum heap size to 1024 MB and it appears to use 100 MB at most, so an overtaxed GC isn't the problem. Do you have any other ideas?
From your numbers, I guess that GC is probably not the issue. I suspect that this is a normal behavior of a disk, being operated on by many concurrent threads. When the files are big, the disk has to switch context between the threads many times (producing significant disk seek time), and the overhead is apparent. With small files, maybe they are read as a single chunk with no extra seek time, so threads do not interfere with each other too much.
When working with a single, standard disk, serial IO is usually better that parallel IO.
I am assuming that the files are located on the same disk, in which case you are probably thrashing the disk (or invalidating the disk\OS cache) with multiple threads attempting to read concurrently and write concurrently. A better pattern may be to have a dedicated reader\writer thread to handle IO, and then alter your pattern so that the job of transform (which sounds expensive) is handled by multiple threads. Your IO thread can fetch and overlap writing with the transform operations as results become available. This should stop disk thrashing, and balance the IO and CPU side of your pattern.
Have you tried running a Java profiler? That will point out what parts of your code are running the slowest. From this discussion, it seems like Netbeans profiler is a good one to check out.
Likely your thread is holding on to the buffered String[]s for too long. Even though your heap is much larger than you need, the throughput could be suffering due to garbage collection. Look at how long you're holding on to those references.
You might also waiting while the vm allocates more memory- asking for Xmx1024m doesn't allocate that much immediately, it grabs what it needs as more memory is required. You could also try -Xms1024m -Xmx1024m (i.e. allocate all of the memory at start) to test if that's the case.
You might have a stop and lock condition going on with your threads (one thread reads 100 lines into memory and holds onto the lock until its done processing, instead of giving it up when it has finished reading from the file). I'm not expert on Java threading, but it's something to consider.
I would review this process. If you use BufferedReader and BufferedWriter there is no advantage to reading and processing 100 lines at a time. It's just added complication and another source of potential error. Do it one at a time and simplify your life.
Related
I want to define thread pool with 10 threads and read the content of the file. But different threads must not read same content.(like divide content into 10 pieces and read each pieces by one thread)
Well what you would do would be roughly this:
get the length of the file,
divide by N.
create N threads
have each one skip to (file_size / N) * thread_no and read (file_size / N) bytes into a buffer
wait for all threads to complete.
stitch the buffers together.
(If you were slightly clever about it, you could avoid the last step ...)
HOWEVER, it is doubtful that you would get much speed-up by doing this. Indeed, I wouldn't be surprised if you got a slow down in many cases. With a typical OS, I would expect that you would get as good, if not better performance by reading the file using one big read(...) call from one thread.
The OS can fetch the data faster from the disc if you read it sequentially. Indeed, a lot of OSes optimize for this use-case, and use read-ahead and in-memory buffering (using OS-level buffers) to give high effective file read rates.
Reading a file with multiple threads means that each thread will typically be reading from a different position in the file. Naively, that would entail the OS to seeking the disk heads backwards and forwards between the different positions ... which will slow down I/O considerably. In practice, the OS will do various things to mitigate that, but even so, simultaneously reading data from different positions on a disk is still bad for I/O throughput.
This question relates to the latest version of Java.
30 producer threads push strings to an abstract queue. One writer thread pops from the same queue and writes the string to a file that resides on a 5400 rpm HDD RAID array. The data is pushed at a rate of roughly 111 MBps, and popped/written at a rate of roughly 80MBps. The program lives for 5600 seconds, enough for about 176 GB of data to accumulate in the queue. On the other hand, I'm restricted to a total of 64GB of main memory.
My question is: What type of queue should I use?
Here's what I've tried so far.
1) ArrayBlockingQueue. The problem with this bounded queue is that, regardless of the initial size of the array, I always end up with liveness issues as soon as it fills up. In fact, a few seconds after the program starts, top reports only a single active thread. Profiling reveals that, on average, the producer threads spend most of their time waiting for the queue to free up. This is regardless of whether or not I use the fair-access policy (with the second argument in the constructor set to true).
2) ConcurrentLinkedQueue. As far as liveness goes, this unbounded queue performs better. Until I run out of memory, about seven hundred seconds in, all thirty producer threads are active. After I cross the 64GB limit, however, things become incredibly slow. I conjecture that this is because of paging issues, though I haven't performed any experiments to prove this.
I foresee two ways out of my situation.
1) Buy an SSD. Hopefully the I/O rate increases will help.
2) Compress the output stream before writing to file.
Is there an alternative? Am I missing something in the way either of the above queues are constructed/used? Is there a cleverer way to use them? The Java Concurrency in Practice book proposes a number of saturation policies (Section 8.3.3) in the case that bounded queues fill up faster than they can be exhausted, but unfortunately none of them---abort, caller runs, and the two discard policies---apply in my scenario.
Look for the bottleneck. You produce more then you consume, a bounded queue makes absolutely sense, since you don't want to run out of memory.
Try to make your consumer faster. Profile and look where the most time is spent. Since you write to a disk here some thoughts:
Could you use NIO for your problem? (maybe FileChannel#transferTo())
Flush only when needed.
If you have enough CPU reserves, compress the stream? (as you already mentioned)
optimize your disks for speed (raid cache, etc.)
faster disks
As #Flavio already said, for the producer-consumer pattern, i see no problem there and it should be the way it is now. In the end the slowest party controls the speed.
I can't see the problem here. In a producer-consumer situation, the system will always go with the speed of the slower party. If the producer is faster than the consumer, it will be slowed down to the consumer speed when the queue fills up.
If your constraint is that you can not slow down the producer, you will have to find a way to speed up the consumer. Profile the consumer (don't start too fancy, a few System.nanoTime() calls often give enough information), check where it spends most of its time, and start optimizing from there. If you have a CPU bottleneck you can improve your algorithm, add more threads, etc. If you have a disk bottleneck try writing less (compression is a good idea), get a faster disk, write on two disks instead of one...
According to java "Queue implementation" there are other classes that should be right for you:
LinkedBlockingQueue
PriorityBlockingQueue
DelayQueue
SynchronousQueue
LinkedTransferQueue
TransferQueue
I don't know the performance of these classes or the memory usage but you can try by your self.
I hope that this helps you.
Why do you have 30 producers. Is that number fixed by the problem domain, or is it just a number you picked? If the latter, you should reduce the number of producers until they produce at total rate that is larger than the consumption by only a small amount, and use a blocking queue (as others have suggested). Then you will keep your consumer busy, which is the performance limiting part, while minimizing use of other resources (memory, threads).
you have only 2 ways out: make suppliers slower or consumer faster. Slower producers can be done in many ways, particullary, using bounded queues. To make consumer faster, try https://www.google.ru/search?q=java+memory-mapped+file . Look at https://github.com/peter-lawrey/Java-Chronicle.
Another way is to free writing thread from work of preparing write buffers from strings. Let the producer threads emit ready buffers, not strings. Use limited number of buffers, say, 2*threadnumber=60. Allocate all buffers at the start and then reuse them. Use a queue for empty buffers. Producing thread takes a buffer from that queue, fills it and puts into writing queue. Writing thread takes buffers from writing thread, writes to disk and puts into the empty buffers queue.
Yet another approach is to use asynchronous I/O. Producers initiate writing operation themselves, without special writing thread. Completion handler returns used buffer into tthe empty buffers queue.
I am extracting out lines matching a pattern from log files. Hence I allotted each log file to a Runnable object which writes the found pattern lines to a result file. (well synchronised writer methods)
Important snippet under discussion :
ExecutorService executor = Executors.newFixedThreadPool(NUM_THREAD);
for (File eachLogFile : hundredsOfLogFilesArrayObject) {
executor.execute(new RunnableSlavePatternMatcher(eachLogFile));
}
Important Criteria :
The number of log files could be very few like 20 or for some users the number of logs files could cross 1000. I recorded series of tests in an excel sheet and I am really concerned on the RED marked results. 1. I assume that if the number of threads created is equal to the number of files to be processed then the processing time would be less, compared to the case when the number of thread is lesser than the number of files to be processed which didn't happen. (please advice me if my understanding is wrong)
Result :
I would like to identify a value for the NUM_THREAD which is efficient for less number of files as well as 1000's of files
Suggest me answer for Question 1 & 2
Thanks !
Chandru
you just found that your program is not CPU bound but (likely) IO bound
this means that beyond 10 threads the OS can't keep up with the requested reads of all the thread that want their data and more threads are waiting for the next block of data at a time
also because writing the output is synchronized across all threads that may even be the biggest bottle neck in your program, (producer-consumer solution may be the answer here to minimize the time threads are waiting to output)
the optimal number of threads depends on how fast you can read the files (the faster you can read the more threads are useful),
It appears that 2 threads is enough to use all your processing power. Most likely you have two cores and hyper threading.
Mine is a Intel i5 2.4GHz 4CPU 8GB Ram . Is this detail helpful ?
Depending on the model, this has 2 cores and hyper-threading.
I assume that if the number of threads created is equal to the number of files to be processed then the processing time would be less,
This will maximise the overhead, but wont give you more cores than you have already.
When parallelizing, using a lot more threads than you have available cpu cores will usually increase the overall time. You system will spend some overhead time switching from thread to thread on one cpu core instead of having it executing the tasks at once, one after an other.
If you have 8 cpu cores on your computer, you might observe some improvement using 8/9/10 threads instead of using only 1 while using 20+ threads will actually be less efficient.
One problem is that I/O doesn't parallelize well, especially if you have a non-SSD, since sequential reads (what happens when one thread reads a file) are much faster than random reads (when the read head has to jump around between different files read by several threads). I would guess you could speed up the program by reading the files from the thread sending the jobs to the executor:
for (File file : hundredsOfLogFilesArrayObject) {
byte[] fileContents = readContentsOfFile(file);
executor.execute(new RunnableSlavePatternMatcher(fileContents));
}
As for the optimal thread count, that depends.
If your app is I/O bound (which is quite possible if you're not doing extremely heavy processing of the contents), a single worker thread which can process the file contents while the original thread reads the next file will probably suffice.
If you're CPU bound, you probably don't want many more threads than you've got cores:
ExecutorService executor = Executors.newFixedThreadPool(
Runtime.getRuntime().availableProcessors());
Although, if your threads get suspended a lot (waiting for synchronization locks, or something), you may get better result with more threads. Or if you've got other CPU-munching activitities going on, you may want fewer threads.
You can try using cached thread pool.
public static ExecutorService newCachedThreadPool()
Creates a thread pool that creates new threads as needed, but will reuse previously constructed threads when they are available. These pools will typically improve the performance of programs that execute many short-lived asynchronous tasks. Calls to execute will reuse previously constructed threads if available.
You can read more here
My Java program uses java.util.concurrent.Executor to run multiple threads, each one starts a runnable class, in that class it reads from a comma delimited text file on C: drive and loops through the lines to split and parse text into floats, after that data is stored into :
static Vector
static ConcurrentSkipListMap
My PC is a Win 7 64bit, Intel Core i7, has six * 2 cores and 24GB of RAM, I have noticed the program will run for 2 minutes and finish all 1700 files, but the CPU usage is only around 10% to 15%, no matter how many threads I assign using :
Executor executor=Executors.newFixedThreadPool(50);
Executors.newFixedThreadPool(500) won't have a better CPU usage or shorter time to finish the tasks. There is no network traffic, everything is on local C: drive, There is enough RAM for more threads to use, it will have an "OutOfMemoryError" when I increase the threads to 1000.
How come more threads doesn't translate to more CPU usage and less time of processing, why ?
Edit : My hard drive is a SSD 200 GB.
Edit : Finally found where the problem was, each thread writes it's results to a log file which is shared by all threads, the more times I run the app, the larger the log file, the slower it gets, and since it's shared, this definitely slows down the process, so after I stopped writing to the log file, it finishes all tasks in 10 seconds !
The OutOfMemoryError is probably coming from Java's own limits on its memory usage. Try using some of the arguments here to increase the maximum memory.
For speed, Adam Bliss starts with a good suggestion. If this is the same file over and over, then I imagine having multiple threads try to read it at the same time could result in a lot of contention over locks on the file. More threads would even mean more contention, which could even result in worse overall performance. So avoid that and simply load the file once if it's possible. Even if it's a large file, you have 24 GB of RAM. You can hold quite a large file, but you may need to increase the JVM's allowed memory to allow the whole file to be loaded.
If there are multiple files being used, then consider this fact: your disk can only read one file at a time. So having multiple threads trying to use the disk all at the same time probably won't be too effective if the threads aren't spending much time processing. Since you have so little CPU usage, it could be that the thread loads part of the file, then runs very quickly on the part that got buffered, and then spends a lot of time waiting for the rest of the file to load. If you're loading the file over and over, that could even still apply.
In short: Disk IO probably is your culprit. You need to work to reduce it so that the threads aren't contending for file content so much.
Edit:
After further consideration, it's more likely a synchronization issue. Threads are probably getting held up trying to add to the result list. If access is frequent, this will result in huge amounts of contention for locks on the object. Consider doing something like having each thread save it's results in a local list (like ArrayList, which is not thread safe), and then copying all values into the final, shared list in chunks to try to reduce contention.
You're probably being limited by IO, not cpu.
Can you reduce the number of times you open the file to read it? Maybe open it once, read all the lines, keep them in memory, and then iterate on that.
Otherwise, you'll have to look at getting a faster hard drive. SSDs can be quite speedy.
It is possible that your threads are somehow given low priority on the system? Increasing the number of threads in that case wouldn't correspond to an increase in CPU usage, since the amount of CPU space allotted to your program may be throttled somewhere else.
Are there any configuration files/ initialization steps where something like this could possibly occur?
My batch process needs to be reading lines from huge files (1-3G), each of which can be processed independently of another. The files can have 10-50M rows. I was thinking of spawning about a dozen threads each of which would be processing a predetermined range of buffers, e.g. T1 will read range 0-1, T2 1-2, etc. concurrently. That means, of course, that T2 needs to jump instantly into the buffer position 2, without reading 0-2.
Is this type of segmentation of buffered file reading for the purposes of concurrency possible with Java NIO?
There is no point to this. The CPU may allow multiple threads but the disk is still single-threaded. All this will do is cause disk thrashing. Forget it.