What causes this performance drop?

What causes this performance drop? - java

I'm using the Disruptor framework for performing fast Reed-Solomon error correction on some data. This is my setup:
RS Decoder 1
/ \
Producer- ... - Consumer
\ /
RS Decoder 8
The producer reads blocks of 2064 bytes from disk into a byte buffer.
The 8 RS decoder consumers perform Reed-Solomon error correction in parallel.
The consumer writes files to disk.
In the disruptor DSL terms, the setup looks like this:
RsFrameEventHandler[] rsWorkers = new RsFrameEventHandler[numRsWorkers];
for (int i = 0; i < numRsWorkers; i++) {
rsWorkers[i] = new RsFrameEventHandler(numRsWorkers, i);
}
disruptor.handleEventsWith(rsWorkers)
.then(writerHandler);
When I don't have a disk output consumer (no .then(writerHandler) part), the measured throughput is 80 M/s, as soon as I add a consumer, even if it writes to /dev/null, or doesn't even write, but it is declared as a dependent consumer, performance drops to 50-65 M/s.
I've profiled it with Oracle Mission Control, and this is what the CPU usage graph shows:
Without an additional consumer:
With an additional consumer:
What is this gray part in the graph and where is it coming from? I suppose it has to do with thread synchronisation, but I can't find any other statistic in Mission Control that would indicate any such latency or contention.

Your hypothesis is correct, it is a thread synchronization issue.
From the API Documentation for EventHandlerGroup<T>.then (Emphasis mine)
Set up batch handlers to consume events from the ring buffer. These handlers will only process events after every EventProcessor in this group has processed the event.
This method is generally used as part of a chain. For example if the handler A must process events before handler B:
This should necessarily decrease throughput. Think about it like a funnel:
The consumer has to wait for every EventProcessor to be finished, before it can proceed through the bottleneck.

I can see two possibilities here, based on what you've shown. You might be affected by one or both, I'd recommend testing both.
1) IO processing bottleneck.
2) Contention on multiple threads writing to buffer.
IO processing
From the data shown, you have stated that as soon as you enable the IO component, your throughput decreases and kernel time increases. This could quite easily be the IO wait time while your consumer thread is writing. Context switch to perform a write() call is significantly more expensive than doing nothing. Your Decoders are now capped at the maximum speed of the consumer. To test this hypothesis, you could remove the write() call. In other words, open the output file, prepare the string for output, and just not issue the write call.
Suggestions
Try removing the write() call in the Consumer, see if it reduces kernel time.
Are you writing to a single flat file sequentially - if not, try this
Are you using smart batching (ie: buffering until endOfBatch flag and then writing in a single batch) to ensure that the IO is bundled up as efficiently as possible?
Contention on multiple writers
Based on your description I suspect your Decoders are reading from the disruptor and then writing back to the very same buffer. This is going to cause issues with multiple writers aka contention on the CPUs writing to memory. One thing I would suggest is to have two disruptor rings:
Producer writes to #1
Decoder reads from #1, performs RS decode and writes the result to #2
Consumer reads from #2, and writes to disk
Assuming your RBs are sufficiently large, this should result in good clean walking through memory.
The key here is not having the Decoder threads (which may be running on a different core) write to the same memory that was just owned by the Producer. With only 2 cores doing this, you will probably see improved throughput unless the disk speed is the bottleneck.
I have a blog article here which describes in more detail how to achieve this including sample code. http://fasterjava.blogspot.com.au/2013/04/disruptor-example-udp-echo-service-with.html
Other thoughts
It would also be helpful to know what WaitStrategy you are using, how many physical CPUs are in the machine, etc.
You should be able to significantly reduce CPU utilisation by moving to a different WaitStrategy given that your biggest latency will be IO writes.
Assuming you are using reasonably new hardware, you should be able to saturate the IO devices with only this setup.
You will also need to make sure the files are on different physical devices to achieve reasonable performance.

Related

HBase BufferedMutator vs PutList performance

I recently came across BufferedMutator class of HBase which can be used for batch inserts and deletes.
I was previously using a List to put data as hTable.put(putList) to do the same.
Benchmarking my code didn't seem to show much difference too, where I was instead doing mutator.mutate(putList);.
Is there a significant performance improvement of using BufferedMutator over PutList?

Short Answer
BufferedMutator generally provides better throughput than just using Table#put(List<Put>) but needs proper tuning of hbase.client.write.buffer, hbase.client.max.total.tasks, hbase.client.max.perserver.tasks and hbase.client.max.perregion.tasks for good performance.
Explanation
When you pass a list of puts to the HBase client, it groups the puts by destination regions and batches these groups by destination region server. A single rpc request is sent for each batch. This cuts down the rpc overhead, especially in cases when the Puts are very small thus making rpc overhead per request significant.
The Table client sends all the Puts to the region servers immediately and waits for response. This means that any batching that can happen is limited to the number of Puts in the single API call and the api calls are synchronous from the caller's perspective.
However, the BufferedMutator keeps buffering the Puts in a buffer and decides to flush the buffered puts based on current buffered size in background threads wrapped around by a class called AsyncProcess. From the caller's perspective, each API call is still synchronous, but the whole buffering strategy gives much better batching. The background flush model also allows a continuous flow of requests, which combined with better batching means ability to support more client threads. However, due to this buffering strategy, the larger the buffer, the worse the per operation latency as seen by the caller, but higher throughput can be sustained by having a much larger number of client threads.
Some of the configs that control BufferedMutator throughput are:
hbase.client.write.buffer: Size (bytes) of the buffer (Higher gives better peak throughput, consumes more memory)
hbase.client.max.total.tasks: Number of pending requests across the cluster before AsyncProcess starts blocking requests (Higher is better, but can starve CPU on client, or cause overload on servers)
hbase.client.max.perserver.tasks: Number of pending requests for one region server before AsyncProcess starts blocking requests.
hbase.client.max.perregion.tasks: Number of pending requests per region.
Also, for the sake of completeness, it should go without saying that if the bottleneck is on the server side instead of client side, you won't see much performance gains by using BufferedMutator over Table on the client.

How to write java thread pool programme to read content of file?

I want to define thread pool with 10 threads and read the content of the file. But different threads must not read same content.(like divide content into 10 pieces and read each pieces by one thread)

Well what you would do would be roughly this:
get the length of the file,
divide by N.
create N threads
have each one skip to (file_size / N) * thread_no and read (file_size / N) bytes into a buffer
wait for all threads to complete.
stitch the buffers together.
(If you were slightly clever about it, you could avoid the last step ...)
HOWEVER, it is doubtful that you would get much speed-up by doing this. Indeed, I wouldn't be surprised if you got a slow down in many cases. With a typical OS, I would expect that you would get as good, if not better performance by reading the file using one big read(...) call from one thread.
The OS can fetch the data faster from the disc if you read it sequentially. Indeed, a lot of OSes optimize for this use-case, and use read-ahead and in-memory buffering (using OS-level buffers) to give high effective file read rates.
Reading a file with multiple threads means that each thread will typically be reading from a different position in the file. Naively, that would entail the OS to seeking the disk heads backwards and forwards between the different positions ... which will slow down I/O considerably. In practice, the OS will do various things to mitigate that, but even so, simultaneously reading data from different positions on a disk is still bad for I/O throughput.

How to determine optimal number of threads for high latency network requests?

I am writing a utility that must make thousands of network requests. Each request receives only a single, small packet in response (similar to ping), but may take upwards of several seconds to complete. Processing each response completes in one (simple) line of code.
The net effect of this is that the computer is not IO-bound, file-system-bound, or CPU-bound, it is only bound by the latency of the responses.
This is similar to, but not the same as There is a way to determine the ideal number of threads? and Java best way to determine the optimal number of threads [duplicate]... the primary difference is that I am only bound by latency.
I am using an ExecutorService object to run the threads and a Queue<Future<Integer>> to track threads that need to have results retrieved:
ExecutorService executorService = Executors.newFixedThreadPool(threadPoolSize);
Queue<Future<Integer>> futures = new LinkedList<Future<Integer>>();
for (int quad3 = 0 ; quad3 < 256 ; ++quad3) {
for (int quad4 = 0 ; quad4 < 256 ; ++quad4) {
byte[] quads = { quad1, quad2, (byte)quad3, (byte)quad4 };
futures.add(executorService.submit(new RetrieverCallable(quads)));
}
}
... I then dequeue all the elements in the queue and put the results in the required data structure:
int[] result = int[65536]
while(!futures.isEmpty()) {
try {
results[i] = futures.remove().get();
} catch (Exception e) {
addresses[i] = -1;
}
}
My first question is: Is this a reasonable way to track all the threads? If thread X takes a while to complete, many other threads might finish before X does. Will the thread pool exhaust itself waiting for open slots, or will the ExecutorService object manage the pool in such a way that threads that have completed but not yet been processed be moved out of available slots so that other threads my begin?
My second question is what guidelines can I use for finding the optimal number of threads to make these calls? I don't even know order-of-magnitude guidance here. I know it works pretty well with 256 threads, but seems to take roughly the same overall time with 1024 threads. CPU utilization is hovering around 5%, so that doesn't appear to be an issue. With that large a number of threads, what are all the metrics I should be looking at to compare different numbers? Obviously overall time to process the batch, average time per thread... what else? Is memory an issue here?

It will shock you, but you do not need any threads for I/O (quantitatively, this means 0 threads). It is good that you have studied that multithreading does not multiply your network bandwidth. Now, it is time to know that threads do computation. They are not doing the (high-latency) communication. The communication is performed by a network adapter, which is another process, running really in parallel with with CPU. It is stupid to allocate a thread (see which resources allocated are listed by this gentlemen who claims that you need 1 thread) just to sleep until network adapter finishes its job. You need no threads for I/O = you need 0 threads.
It makes sense to allocate the threads for computation to make in parallel with I/O request(s). The amount of threads will depend on the computation-to-communication ratio and limited by the number of cores in your CPU.
Sorry, I had to say that despite you have certainly implied the commitment to blocking I/O, so many people do not understand this basic thing. Take the advise, use asynchronous I/O and you'll see that the issue does not exist.

As mentioned in one of the linked answers you refer to, Brian Goetz has covered this well in his article.
He seems to imply that in your situation you would be advised to gather metrics before committing to a thread count.
Tuning the pool size
Tuning the size of a thread pool is largely a matter of avoiding two mistakes: having too few threads or too many threads. ...
The optimum size of a thread pool depends on the number of processors available and the nature of the tasks on the work queue. ...
For tasks that may wait for I/O to complete -- for example, a task that reads an HTTP request from a socket -- you will want to increase the pool size beyond the number of available processors, because not all threads will be working at all times. Using profiling, you can estimate the ratio of waiting time (WT) to service time (ST) for a typical request. If we call this ratio WT/ST, for an N-processor system, you'll want to have approximately N*(1+WT/ST) threads to keep the processors fully utilized.
My emphasis.

Have you considered using Actors?
Best practises.
Actors should be like nice co-workers: do their job efficiently
without bothering everyone else needlessly and avoid hogging
resources. Translated to programming this means to process events and
generate responses (or more requests) in an event-driven manner.
Actors should not block (i.e. passively wait while occupying a Thread)
on some external entity—which might be a lock, a network socket,
etc.—unless it is unavoidable; in the latter case see below.
Sorry, I can't elaborate, because haven't much used this.
UPDATE
Answer in Good use case for Akka might be helpful.
Scala: Why are Actors lightweight?

Pretty sure in the described circumstances, the optimal number of threads is 1. In fact, that is surprisingly often the answer to any quesion of the form 'how many threads should I use'?
Each additonal thread adds extra overhead in terms of stack (and associated GC roots), context switching and locking. This may or not be measurable: the effor to meaningfully measure it in all target envoronments is non-trivial. In return, there is little scope to provide any benifit, as processing is neither cpu nor io-bound.
So less is always better, if only for reasons of risk reduction. And you cant have less than 1.

I assume the desired optimization is the time to process all requests. You said the number of requests is "thousands". Evidently, the fastest way is to issue all requests at once, but this may overflow the network layer. You should determine how many simultaneous connections can network layer bear, and make this number a parameter for your program.
Then, spending a thread for each request require a lot of memory. You can avoid this using non-blocking sockets. In Java, there are 2 options: NIO1 with selectors, and NIO2 with asynchronous channels. NIO1 is complex, so better find a ready-made library and reuse it. NIO2 is simple but available only since JDK1.7.
Processing the responses should be done on a thread pool. I don't think the number of threads in the thread pool greatly affects the overall performance in your case. Just make tuning for thread pool size from 1 to the number of available processors.

In our high-performance systems, we use the actor model as described by #Andrey Chaschev.
The no. of optimal threads in your actor model differ with your CPU structure and how many processes (JVMs) do you run per box. Our finding is
If you have 1 process only, use total CPU cores - 2.
If you have multiple process, check your CPU structure. We found its good to have no. of threads = no. of cores in a single CPU - e.g. if you have a 4 CPU server each server having 4 cores, then using 4 threads per JVM gives you best performance. After that, always leave at least 1 core to your OS.

An partial answer, but I hope it helps. Yes, memory can be an issue: Java reserves 1 MB of thread stack by default (at least on Linux amd64). So with a few GB of RAM in your box, that limits your thread count to a few thousand.
You can tune this with a flag like -XX:ThreadStackSize=64. That would give you 64 kB, which is plenty in most situations.
You could also move away from threading entirely and use epoll to respond to incoming responses. This is far more scalable but I have no practical experience with doing this in Java.

Java and queues: saturation issues with multithreaded I/O

This question relates to the latest version of Java.
30 producer threads push strings to an abstract queue. One writer thread pops from the same queue and writes the string to a file that resides on a 5400 rpm HDD RAID array. The data is pushed at a rate of roughly 111 MBps, and popped/written at a rate of roughly 80MBps. The program lives for 5600 seconds, enough for about 176 GB of data to accumulate in the queue. On the other hand, I'm restricted to a total of 64GB of main memory.
My question is: What type of queue should I use?
Here's what I've tried so far.
1) ArrayBlockingQueue. The problem with this bounded queue is that, regardless of the initial size of the array, I always end up with liveness issues as soon as it fills up. In fact, a few seconds after the program starts, top reports only a single active thread. Profiling reveals that, on average, the producer threads spend most of their time waiting for the queue to free up. This is regardless of whether or not I use the fair-access policy (with the second argument in the constructor set to true).
2) ConcurrentLinkedQueue. As far as liveness goes, this unbounded queue performs better. Until I run out of memory, about seven hundred seconds in, all thirty producer threads are active. After I cross the 64GB limit, however, things become incredibly slow. I conjecture that this is because of paging issues, though I haven't performed any experiments to prove this.
I foresee two ways out of my situation.
1) Buy an SSD. Hopefully the I/O rate increases will help.
2) Compress the output stream before writing to file.
Is there an alternative? Am I missing something in the way either of the above queues are constructed/used? Is there a cleverer way to use them? The Java Concurrency in Practice book proposes a number of saturation policies (Section 8.3.3) in the case that bounded queues fill up faster than they can be exhausted, but unfortunately none of them---abort, caller runs, and the two discard policies---apply in my scenario.

Look for the bottleneck. You produce more then you consume, a bounded queue makes absolutely sense, since you don't want to run out of memory.
Try to make your consumer faster. Profile and look where the most time is spent. Since you write to a disk here some thoughts:
Could you use NIO for your problem? (maybe FileChannel#transferTo())
Flush only when needed.
If you have enough CPU reserves, compress the stream? (as you already mentioned)
optimize your disks for speed (raid cache, etc.)
faster disks
As #Flavio already said, for the producer-consumer pattern, i see no problem there and it should be the way it is now. In the end the slowest party controls the speed.

I can't see the problem here. In a producer-consumer situation, the system will always go with the speed of the slower party. If the producer is faster than the consumer, it will be slowed down to the consumer speed when the queue fills up.
If your constraint is that you can not slow down the producer, you will have to find a way to speed up the consumer. Profile the consumer (don't start too fancy, a few System.nanoTime() calls often give enough information), check where it spends most of its time, and start optimizing from there. If you have a CPU bottleneck you can improve your algorithm, add more threads, etc. If you have a disk bottleneck try writing less (compression is a good idea), get a faster disk, write on two disks instead of one...

According to java "Queue implementation" there are other classes that should be right for you:
LinkedBlockingQueue
PriorityBlockingQueue
DelayQueue
SynchronousQueue
LinkedTransferQueue
TransferQueue
I don't know the performance of these classes or the memory usage but you can try by your self.
I hope that this helps you.

Why do you have 30 producers. Is that number fixed by the problem domain, or is it just a number you picked? If the latter, you should reduce the number of producers until they produce at total rate that is larger than the consumption by only a small amount, and use a blocking queue (as others have suggested). Then you will keep your consumer busy, which is the performance limiting part, while minimizing use of other resources (memory, threads).

you have only 2 ways out: make suppliers slower or consumer faster. Slower producers can be done in many ways, particullary, using bounded queues. To make consumer faster, try https://www.google.ru/search?q=java+memory-mapped+file . Look at https://github.com/peter-lawrey/Java-Chronicle.
Another way is to free writing thread from work of preparing write buffers from strings. Let the producer threads emit ready buffers, not strings. Use limited number of buffers, say, 2*threadnumber=60. Allocate all buffers at the start and then reuse them. Use a queue for empty buffers. Producing thread takes a buffer from that queue, fills it and puts into writing queue. Writing thread takes buffers from writing thread, writes to disk and puts into the empty buffers queue.
Yet another approach is to use asynchronous I/O. Producers initiate writing operation themselves, without special writing thread. Completion handler returns used buffer into tthe empty buffers queue.

how to log asynchronously in a heavily multithreaded environment?

I am trying to to log asynchronously in a heavily multi-threaded environment in java on linux platform.what would be a suitable data structure(lock-free) to bring in low thread contention?
I need to log GBs of messages. I need to do it in async/lock-free manner so I don't kill performance on the main logic(the code that invokes the logger apis).

Logback has an AsyncAppender that might meet your needs.

The simplest way to do it is to write into multiple files - one for each thread.
Make sure you put timestamps at the start of each record, so it is easier to merge them into a single log file.
example unix command:
cat *.log | sort | less
But for a better / more useful answer you do need to clarify your question by adding a lot more detail.

I would use Java Chronicle, mostly because I wrote it but I suggest it here because you can write lock free and garbage free logging with a minimum of OS calls. This requires one log per thread, but you will have kept these to a minimum already I assume.
I have used this library to write 1 GB/second from a two threads. You may find having more threads will not help as much as you think if logging is a bottle neck for you.
BTW: You have given me an idea of how the log can be updated from multiple threads/processes, but it will take a while to implement and test.

To reduce contention, you can first put log messages in a buffer, private to each thread. When the buffer is full, put it in a queue handled by a separate log thread, which then merges messages from different threads and writes them to a file. Note, you need that separate thread in any case, in order not to slowdown the working threads when the next buffer is to be written on disk.

It is impossible to avoid queue contention as your logging thread will most likely log faster than your writer (disk i/o) thread can keep up, but with some smart wait strategies and thread pinning you can minimize latency and maximize throughput.
Take a look on CoralLog developed by Coral Blocks (with which I am affiliated) which uses a lock-free queue and can log a 64-byte message in 52 nanoseconds on average. It is capable of writing more than 5 million messages per second.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.