I recently came across BufferedMutator class of HBase which can be used for batch inserts and deletes.
I was previously using a List to put data as hTable.put(putList) to do the same.
Benchmarking my code didn't seem to show much difference too, where I was instead doing mutator.mutate(putList);.
Is there a significant performance improvement of using BufferedMutator over PutList?
Short Answer
BufferedMutator generally provides better throughput than just using Table#put(List<Put>) but needs proper tuning of hbase.client.write.buffer, hbase.client.max.total.tasks, hbase.client.max.perserver.tasks and hbase.client.max.perregion.tasks for good performance.
Explanation
When you pass a list of puts to the HBase client, it groups the puts by destination regions and batches these groups by destination region server. A single rpc request is sent for each batch. This cuts down the rpc overhead, especially in cases when the Puts are very small thus making rpc overhead per request significant.
The Table client sends all the Puts to the region servers immediately and waits for response. This means that any batching that can happen is limited to the number of Puts in the single API call and the api calls are synchronous from the caller's perspective.
However, the BufferedMutator keeps buffering the Puts in a buffer and decides to flush the buffered puts based on current buffered size in background threads wrapped around by a class called AsyncProcess. From the caller's perspective, each API call is still synchronous, but the whole buffering strategy gives much better batching. The background flush model also allows a continuous flow of requests, which combined with better batching means ability to support more client threads. However, due to this buffering strategy, the larger the buffer, the worse the per operation latency as seen by the caller, but higher throughput can be sustained by having a much larger number of client threads.
Some of the configs that control BufferedMutator throughput are:
hbase.client.write.buffer: Size (bytes) of the buffer (Higher gives better peak throughput, consumes more memory)
hbase.client.max.total.tasks: Number of pending requests across the cluster before AsyncProcess starts blocking requests (Higher is better, but can starve CPU on client, or cause overload on servers)
hbase.client.max.perserver.tasks: Number of pending requests for one region server before AsyncProcess starts blocking requests.
hbase.client.max.perregion.tasks: Number of pending requests per region.
Also, for the sake of completeness, it should go without saying that if the bottleneck is on the server side instead of client side, you won't see much performance gains by using BufferedMutator over Table on the client.
Related
I have a third party API, which I call using an HTTP GET request. Each request takes a few seconds to get a response.
Currently I am using a CompletableFuture which I am executing on a FixedThreadPool of size 64. This is causing the threads to be blocked until it recieves a response for the GET request, i.e. the threads sit idle after sending the GET response until they recieve a response. So my maximum number of simultaneous requests I can send out is limited by my thread size i.e. 64 here.
What can I use instead of CompletableFuture so that my threads don't sit idle waiting for the response?
As #user207421 says:
A truly asynchronous (i.e. event driven) HTTP client application is complicated.
A multi-threaded (but fundamentally synchronous) HTTP client application is simpler, and scales to as many threads as you have memory for.
Assuming that you have 64 worker threads processing requests, the actual bottleneck is likely to be EITHER your physical network bandwidth, OR your available client-side CPU. If you have hit those limits, then:
increasing the number of worker threads is not going to help, and
switching to an asynchronous (event driven) model is not going to help.
A third possibility is that the bottleneck is server-side resource limits or rate limiting. In this scenario, increasing the client-side thread count might help, have no effect, or make the problem worse. It will depend on how the server is implemented, the nature of the requests, etc.
If your bottleneck really is the number of threads, then a simple thing to try is reducing the worker thread stack size so that you can run more of them. The default stack size is typically 1MB, and that is likely to be significantly more than it needs to be. (This will also reduce the ... erm ... memory overhead of idle threads, if that is a genuine issue.)
There are a few Java asynchronous HTTP client libraries around. But I have never used one and cannot recommend one. And like #user207421, I am not convinced that the effort of changing will actually pay off.
What can I [do] so that my threads don't sit idle waiting for the response?
Idle threads is actually not the problem. An idle thread is only using memory (and some possible secondary effects which probably don't matter here). Unless you are short of memory, it will make little difference.
Note: if there is something else for your client to do while a thread is waiting for a server response, the OS thread scheduler will switch to a different thread.
So my maximum number of simultaneous requests I can send out is limited by my thread [pool] size i.e. 64 here.
That is true. However, sending more simultaneous requests probably won't help. If the client-side threads are sitting idle, that probably means that the bottleneck is either the network, or something on the server side. If this is the case, adding more threads won't increase throughput. Instead individual requests will take (on average) longer, and throughput will stay the same ... or possibly drop if the server starts dropping requests from its request queue.
Finally, if you are worried of the overheads of a large pool of worker threads sitting idle (waiting for the next task to do), use an execution service or connection pool that can shrink and grow its thread pool to meet changing workloads.
I'm using the Disruptor framework for performing fast Reed-Solomon error correction on some data. This is my setup:
RS Decoder 1
/ \
Producer- ... - Consumer
\ /
RS Decoder 8
The producer reads blocks of 2064 bytes from disk into a byte buffer.
The 8 RS decoder consumers perform Reed-Solomon error correction in parallel.
The consumer writes files to disk.
In the disruptor DSL terms, the setup looks like this:
RsFrameEventHandler[] rsWorkers = new RsFrameEventHandler[numRsWorkers];
for (int i = 0; i < numRsWorkers; i++) {
rsWorkers[i] = new RsFrameEventHandler(numRsWorkers, i);
}
disruptor.handleEventsWith(rsWorkers)
.then(writerHandler);
When I don't have a disk output consumer (no .then(writerHandler) part), the measured throughput is 80 M/s, as soon as I add a consumer, even if it writes to /dev/null, or doesn't even write, but it is declared as a dependent consumer, performance drops to 50-65 M/s.
I've profiled it with Oracle Mission Control, and this is what the CPU usage graph shows:
Without an additional consumer:
With an additional consumer:
What is this gray part in the graph and where is it coming from? I suppose it has to do with thread synchronisation, but I can't find any other statistic in Mission Control that would indicate any such latency or contention.
Your hypothesis is correct, it is a thread synchronization issue.
From the API Documentation for EventHandlerGroup<T>.then (Emphasis mine)
Set up batch handlers to consume events from the ring buffer. These handlers will only process events after every EventProcessor in this group has processed the event.
This method is generally used as part of a chain. For example if the handler A must process events before handler B:
This should necessarily decrease throughput. Think about it like a funnel:
The consumer has to wait for every EventProcessor to be finished, before it can proceed through the bottleneck.
I can see two possibilities here, based on what you've shown. You might be affected by one or both, I'd recommend testing both.
1) IO processing bottleneck.
2) Contention on multiple threads writing to buffer.
IO processing
From the data shown, you have stated that as soon as you enable the IO component, your throughput decreases and kernel time increases. This could quite easily be the IO wait time while your consumer thread is writing. Context switch to perform a write() call is significantly more expensive than doing nothing. Your Decoders are now capped at the maximum speed of the consumer. To test this hypothesis, you could remove the write() call. In other words, open the output file, prepare the string for output, and just not issue the write call.
Suggestions
Try removing the write() call in the Consumer, see if it reduces kernel time.
Are you writing to a single flat file sequentially - if not, try this
Are you using smart batching (ie: buffering until endOfBatch flag and then writing in a single batch) to ensure that the IO is bundled up as efficiently as possible?
Contention on multiple writers
Based on your description I suspect your Decoders are reading from the disruptor and then writing back to the very same buffer. This is going to cause issues with multiple writers aka contention on the CPUs writing to memory. One thing I would suggest is to have two disruptor rings:
Producer writes to #1
Decoder reads from #1, performs RS decode and writes the result to #2
Consumer reads from #2, and writes to disk
Assuming your RBs are sufficiently large, this should result in good clean walking through memory.
The key here is not having the Decoder threads (which may be running on a different core) write to the same memory that was just owned by the Producer. With only 2 cores doing this, you will probably see improved throughput unless the disk speed is the bottleneck.
I have a blog article here which describes in more detail how to achieve this including sample code. http://fasterjava.blogspot.com.au/2013/04/disruptor-example-udp-echo-service-with.html
Other thoughts
It would also be helpful to know what WaitStrategy you are using, how many physical CPUs are in the machine, etc.
You should be able to significantly reduce CPU utilisation by moving to a different WaitStrategy given that your biggest latency will be IO writes.
Assuming you are using reasonably new hardware, you should be able to saturate the IO devices with only this setup.
You will also need to make sure the files are on different physical devices to achieve reasonable performance.
This question relates to the latest version of Java.
30 producer threads push strings to an abstract queue. One writer thread pops from the same queue and writes the string to a file that resides on a 5400 rpm HDD RAID array. The data is pushed at a rate of roughly 111 MBps, and popped/written at a rate of roughly 80MBps. The program lives for 5600 seconds, enough for about 176 GB of data to accumulate in the queue. On the other hand, I'm restricted to a total of 64GB of main memory.
My question is: What type of queue should I use?
Here's what I've tried so far.
1) ArrayBlockingQueue. The problem with this bounded queue is that, regardless of the initial size of the array, I always end up with liveness issues as soon as it fills up. In fact, a few seconds after the program starts, top reports only a single active thread. Profiling reveals that, on average, the producer threads spend most of their time waiting for the queue to free up. This is regardless of whether or not I use the fair-access policy (with the second argument in the constructor set to true).
2) ConcurrentLinkedQueue. As far as liveness goes, this unbounded queue performs better. Until I run out of memory, about seven hundred seconds in, all thirty producer threads are active. After I cross the 64GB limit, however, things become incredibly slow. I conjecture that this is because of paging issues, though I haven't performed any experiments to prove this.
I foresee two ways out of my situation.
1) Buy an SSD. Hopefully the I/O rate increases will help.
2) Compress the output stream before writing to file.
Is there an alternative? Am I missing something in the way either of the above queues are constructed/used? Is there a cleverer way to use them? The Java Concurrency in Practice book proposes a number of saturation policies (Section 8.3.3) in the case that bounded queues fill up faster than they can be exhausted, but unfortunately none of them---abort, caller runs, and the two discard policies---apply in my scenario.
Look for the bottleneck. You produce more then you consume, a bounded queue makes absolutely sense, since you don't want to run out of memory.
Try to make your consumer faster. Profile and look where the most time is spent. Since you write to a disk here some thoughts:
Could you use NIO for your problem? (maybe FileChannel#transferTo())
Flush only when needed.
If you have enough CPU reserves, compress the stream? (as you already mentioned)
optimize your disks for speed (raid cache, etc.)
faster disks
As #Flavio already said, for the producer-consumer pattern, i see no problem there and it should be the way it is now. In the end the slowest party controls the speed.
I can't see the problem here. In a producer-consumer situation, the system will always go with the speed of the slower party. If the producer is faster than the consumer, it will be slowed down to the consumer speed when the queue fills up.
If your constraint is that you can not slow down the producer, you will have to find a way to speed up the consumer. Profile the consumer (don't start too fancy, a few System.nanoTime() calls often give enough information), check where it spends most of its time, and start optimizing from there. If you have a CPU bottleneck you can improve your algorithm, add more threads, etc. If you have a disk bottleneck try writing less (compression is a good idea), get a faster disk, write on two disks instead of one...
According to java "Queue implementation" there are other classes that should be right for you:
LinkedBlockingQueue
PriorityBlockingQueue
DelayQueue
SynchronousQueue
LinkedTransferQueue
TransferQueue
I don't know the performance of these classes or the memory usage but you can try by your self.
I hope that this helps you.
Why do you have 30 producers. Is that number fixed by the problem domain, or is it just a number you picked? If the latter, you should reduce the number of producers until they produce at total rate that is larger than the consumption by only a small amount, and use a blocking queue (as others have suggested). Then you will keep your consumer busy, which is the performance limiting part, while minimizing use of other resources (memory, threads).
you have only 2 ways out: make suppliers slower or consumer faster. Slower producers can be done in many ways, particullary, using bounded queues. To make consumer faster, try https://www.google.ru/search?q=java+memory-mapped+file . Look at https://github.com/peter-lawrey/Java-Chronicle.
Another way is to free writing thread from work of preparing write buffers from strings. Let the producer threads emit ready buffers, not strings. Use limited number of buffers, say, 2*threadnumber=60. Allocate all buffers at the start and then reuse them. Use a queue for empty buffers. Producing thread takes a buffer from that queue, fills it and puts into writing queue. Writing thread takes buffers from writing thread, writes to disk and puts into the empty buffers queue.
Yet another approach is to use asynchronous I/O. Producers initiate writing operation themselves, without special writing thread. Completion handler returns used buffer into tthe empty buffers queue.
Brief
I am running a multithreaded tcp server that uses a fixed thread pool with an unbounded Runnable queue. The clients dispatch the runnables to the pool.
In my stress test scenario, 600 clients attempt to login to the server and immediately broadcast messages to every other client simultaneously and repeatedly to no end and without sleeping (Right now the clients just discard the incoming messages). Using a quad-core with 1GB reserved for heap memory - and a parallel GC for both the young and old generations - the server crashes with a OOM exception after 20 minutes. Monitoring the garbage collector reveals that the tenured generation is slowly increasing, and a full GC only frees up a small fraction of memory. A snapshot of a full heap shows that the old generation is almost completely occupied by Runnables (and their outgoing references).
It seems the worker threads are not able to finish executing the Runnables faster than the clients are able to queue them for execution (For each incoming "event" to the server, the server will create 599 runnables as there are 600 - 1 clients - assuming they are all logged in at the time).
Question
Can someone please help me conceive a strategy on how to handle the overwhelmed thread pool workers?
Also
If I bound the queue, what policy should I implement to handle rejected execution?
If I increase the size of the heap, wouldn't that only prolong the OOM exception?
A calculation can be made to measure the amount of work done in the aggregation of Runnables. Perhaps this measurement be used as a basis for a locking mechanism to coordinate clients' dispatching work?
What reaction should the client experience when the server is overwhelmed with work?
Do not use an unbounded queue. I cannot tell you what the bound should be; your load tests should give you an answer to that question. Anyhow, make the bound configurable: at least dynamycalliy configurable, better yet adaptable to some load measurement.
You did not tell us how the clients submit their requests, but if HTTP is involved, there already is a status code for the overloaded case: 503 Service Unavailable.
I would suggest you limit the capacity of the queue and "push back" on the publisher to stop it publishing or drop the requests gracefully. You can do the former b making the Queue block when its full.
You should be able to calculate your maximum throughput based on you network bandwidth and message size. If you are getting less than this, I would consider changing how your server distributes data.
Another approach is to make your message handling more efficient. You could have each reading thread from each client write directly to the listening clients. This avoids the need for an explicit queue (you might think of the buffers in the Socket as a queue of bytes) and limits the speed to whatever the server can handle. It will also not use more memory under load (than it does when idle)
Using this approach you can achieve as high message rates as your network bandwidth can handle. (Even with a 10 Gig-E network) This moves the bottle neck elsewhere, meaning you still have a problem but your server shouldn't fail.
BTW: If you use direct ByteBuffers you can do this without creating garbage and with a minimum of heap. e.g. ~1 KB of heap per client.
It sounds as if you're doing load testing. I would determine what you consider to be "acceptable heavy load". What is the heaviest amount of traffic you can expect a single client to generate? Then double it. Or triple it. Or scale a manner similar to that. Use this threshold to throttle or deny clients that use this much bandwidth.
This has a number of perks. First, it gives you the kind of analysis you need to determine server load (users per server). Second it gives you a first line of defense against DDOS attacks.
You have to somehow throttle the incoming requests, and the mechanism for doing that should depend on the work you are trying to do. Anything else will simply result in an OOM under enough load, and thus open you up for DOS attacks (even unintentional ones).
Fundamentally, you have 4 choices:
Make clients wait until you are ready to accept their requests
Actively reject client requests until you are ready to accept new requests
Allow clients to timeout while trying to reach your server when it is not ready to receive requests
A blend of 2 or 3 of the above strategies.
The right strategy depends on how your real clients will react under the various circumstances – is it better for them to wait, possibly (effectively) indefinitely, or is it better that they know quickly that their work won't get done unless they try again later?
Whichever way you do it, you need to be able to count the number of tasks currently queued and either add a delay, block completely, or return an error condition based on the number of items in the queue.
A simple blocking strategy can be implemented by using a BlockingQueue implementation. However, this doesn't give particularly fine-grained control.
Or you can use a Semaphore to control permits to add tasks to the queue, which has the advantage of supplying a tryAcquire(long timeout, TimeUnit unit) method if you want to apply a mild throttling.
Whichever way, don't allow the threads that service the clients to grow without bounds, or else you'll simply end up with an OOM for a different reason!
As title, in my module I had a blockingqueue to deliver my data. The data which server can produce is a a large number of logging information. In order to avoid affecting the performance of server , I wrote multi-thread clients to consume these data and persist them in data caches. Because the data can be produced hugely per mins,I became confused that how many sizes should I initialize my queue. And I knew that I can set my queue policy that if more data is produced , I can omit the overflow part. But how many size I created in the queue in order to hold these data as much as I can.
Could you give me some suggestion?As far as I know , it was related with my server JVM stack size & the single logging data in my JVM???
Make it "as large as is reasonable". For example, if you are OK with it consuming up to 1Gb of memory, then allocate its size to be 1Gb divided by the average number of bytes of the objects in the queue.
If I had to pick a "reasonable" number, I would start with 10000. The reason is, if it grows to larger than that, then making it larger isn't a good idea and isn't going to help much, because clearly the logging requirement is outpacing your ability to log, so it's time to back off the clients.
"Tuning" through experimentation is usually the best approach, as it depends on the profile of your application:
If there are highs and lows in your application's activity, then a larger queue will help "smooth out" the load on your server
If your application has a relatively steady load, then a smaller queue is appropriate as a larger queue only delays the inevitable point when clients are blocked - you would be better to make it smaller and dedicate more resources (a couple more logging threads) to consuming the work.
Note also that a very large queue may impact garbage collection responsiveness to freeing up memory, as it has to traverse a much larger heap (all the objects in the queue) each time it runs, increasing the load on both CPU and memory.
You want to make the size as small as you can without impacting throughput and responsiveness too much. To asses this you'll need to set up a test server and hit it with a typical load to see what happens. Note that you'll probably need to hit it from multiple machines to put a realistic load on the server, as hitting it from one machine can limit the load due to the number of CPU cores and other resources on the test client machine.
To be frank, I'd just make the size 10000 and tune the number of worker threads rather than the queue size.
Contiguous writes to disk are reasonably fast (easily 20MB per second). Instead of storing data in RAM, you might be better off writing it to disk without worrying about memory requirements. Your clients then can read data from files instead of RAM.
To know size of java object, you could use any java profiler. YourKit is my favorite.
I think the real problem is not size of queue but what you want to do when things exceed your planned capacity. ArrayBlockingQueue will simply block your threads, which may or may not be the right thing to do. Your options typically are:
1) Block the threads (use ArrayBlockingQueue) based on memory committed for this purpose
2) Return error to the "layer above" and let that layer decide what to do...may be send error to the client
3) Can you throw away some data...say which was en queued long ago.
4) Start writing to disk, once you overflow RAM capacity.