Profiling Netty Performance

Profiling Netty Performance - java

I'm writing a Netty application. The application is running on a 64 bit eight core linux box
The Netty application is a simple router that accepts requests (incoming pipeline) reads some metadata from the request and forwards the data to a remote service (outgoing pipeline).
This remote service will return one or more responses to the outgoing pipeline. The Netty application will route the responses back to the originating client (the incoming pipeline)
There will be thousands of clients. There will be thousands of remote services.
I'm doing some small scale testing (ten clients, ten remotes services) and I don't see the sub 10 millisecond performance I'm expecting at a 99.9 percentile. I'm measuring latency from both client side and server side.
I'm using a fully async protocol that is similar to SPDY. I capture the time (I just use System.nanoTime()) when we process the first byte in the FrameDecoder. I stop the timer just before we call channel.write(). I am measuring sub-millisecond time (99.9 percentile) from the incoming pipeline to the outgoing pipeline and vice versa.
I also measured the time from the first byte in the FrameDecoder to when a ChannelFutureListener callback was invoked on the (above) message.write(). The time was a high tens of milliseconds (99.9 percentile) but I had trouble convincing myself that this was useful data.
My initial thought was that we had some slow clients. I watched channel.isWritable() and logged when this returned false. This method did not return false under normal conditions
Some facts:
We are using the NIO factories. We have not customized the worker size
We have disabled Nagel (tcpNoDelay=true)
We have enabled keep alive (keepAlive=true)
CPU is idle 90+% of the time
Network is idle
The GC (CMS) is being invoked every 100 seconds or so for a very short amount of time
Is there a debugging technique that I could follow to determine why my Netty application is not running as fast as I believe it should?
It feels like channel.write() adds the message to a queue and we (application developers using Netty) don't have transparency into this queue. I don't know if the queue is a Netty queue, an OS queue, a network card queue or what. Anyway I'm reviewing examples of existing applications and I don't see any anti-patterns I'm following
Thanks for any help/insight

Netty creates Runtime.getRuntime().availableProcessors() * 2 workers by default. 16 in your case. That means you can handle up to 16 channels simultaneously, other channels will wait untils you release the ChannelUpstreamHandler.handleUpstream/SimpleChannelHandler.messageReceived handlers, so don't do heavy operations in these (IO) threads, otherwise you can stuck the other channels.

You haven't specified your Netty version, but it sounds like Netty 3.
Netty 4 is now stable, and I would advise that you update to it as soon as possible.
You have specified that you want ultra low latency times, as well as tens of thousands of clients and services. This doesn't really mix well. NIO is inherently reasonably latent as opposed to OIO. However the pitfall here is that OIO probably wont be able to reach the number of clients you are hoping for. None the less I would use an OIO event loop / factory and see how it goes.
I myself have a TCP server, which takes around 30ms on localhost to send and receive and process a few TCP packets (measured from the time client opens a socket until server closes it). If you really do require such low latencies I suggest you switch away from TCP due to the SYN/ACK spam that is required to open a connection, this is going to use a large part of your 10ms.

Measuring time in a multi-threaded environment is very difficult if you are using simple things like System.nanoTime(). Imagine the following on a 1 core system:
Thread A is woken up and begins processing the incoming request.
Thread B is woken up and begins processing the incoming request. But since we are working on a 1 core machine, this ultimately requires that Thread A is put on pause.
Thread B is done and performed perfectly fast.
Thread A resumes and finishes, but took twice as long as Thread B. Because you actually measured the time it took to finish for Thread A + Thread B.
There are two approaches on how to measure correctly in this case:
You can enforce that only one thread is used at all times.
This allows you to measure the exact performance of the operation, if the OS does not interfere. Because in the above example Thread B can be outside of your program as well. A common approach in this case is to median out the interference, which will give you an estimation of the speed of your code.You can however assume, that on an otherwise idle multi-core system, there will be another core to process background tasks, so your measurement will usually not be interrupted. Setting this thread to high priority helps as well.
You use a more sophisticated tool that plugs into the JVM to actually measure the atomic executions and time it took for those, which will effectively remove outside interference almost completely. One tool would be VisualVM, which is already integrated in NetBeans and available as a plugin for Eclipse.
As a general advice: it is not a good idea to use more threads than cores, unless you know that those threads will be blocked by some operation frequently. This is not the case when using non-blocking NIO for IO-operations as there is no blocking.
Therefore, in your special case, you would actually reduce the performance for clients, as explained above, because communication would be put on hold up to 50% of the time under high load. In worst case, that could cause a client to even run into a timeout, as there is no guarantee when a thread is actually resumed (unless you explicitly request fair scheduling).

Related

How is a single threaded server able to cater to multiple clients even thru non-blocking I/O?

When implementing a server, we can delegate one client request to one thread. I read that problem with this approach is that each thread will have its own stack and this would be very "expensive". Alternative approach is that have server be single threaded and implement all client requests on this one server thread with I/O requests as non-blocking request. My doubt is that if one server thread is running multiple client requests simultaneously, won't server code have instruction pointer, set of local variables, function calls stacks for each client request, then won't this again be "expensive" as before. How are we really saving?.

I read that problem with this approach is that each thread will have its own stack and this would be very "expensive".
Depends on how tight your system resources are. The typical JVM stack-space allocated per thread defaults to 1mB on many current architectures although this can be tuned with the -Xss command line argument. How much system memory your JVM has at its disposal and how many threads you need determines if you want to pay the high price of writing the server single threaded.
My doubt is that if one server thread is running multiple client requests simultaneously, won't server code have instruction pointer, set of local variables, function calls stacks for each client request, then won't this again be "expensive" as before
It will certainly will need to store per request context information in the heap but I suspect that it would be a lot less than 1mB worth of information to hold the variables necessary to service the incoming connections.
Like most things, what we are really competing against when we look to optimize a program, whether to reduce memory or other system resource use, is code complexity. It is harder to get right and harder to maintain.
Although threaded programs can be highly complex, isolating a request handler in a single thread can make the code extremely simple unless it needs to coordinate with other requests somehow. Writing a high performance single threaded server would be much more complex than the threaded version in most cases. Of course, there would also be limits on the performance given that you can't make use of multiple processors.

Using non blocking I/O, A single I/O thread can handle many connections. The I/O thread will get notification when:
client wants to connect
the write buffer of the socket of the connection has space when the write buffer of the socket was full the previous round.
the read buffer of the socket of the connection has data available for reading
So the thread makes use of event-multiplexing to serve the connections concurrently using a selector. A thread waits for a set of selection-keys from the selector, and the selection key contains the state of the event you have registered for and you can attach user data like a 'session' to the selection-key.
A very typical design pattern used here is the reactor pattern.
But often you want to prevent blocking the I/O thread with longer running requests. So you offload the work to a different pool of threads. Then the reactor changes to the proactor pattern.
And often you want to scale the number of I/O threads. So you can have a bunch of I/O threads in parallel.
But the total number of threads in your application should remain limited.
It all depends on what you want. Above are techniques I frequently used while working for Hazelcast.
I would not start to write all this logic from scratch. If you want to make use of networking, I would have a look at Netty. It takes care of most of the heavy lifting and has all kinds of optimizations built in.
I'm not 100% sure if the a thread that doesn't write to its stack will actually consume 1MB of physical memory. In Linux the (shared) zero-page is used for a memory allocation, so no actual page frame (physical memory) is allocated unless the stack of the thread is actually written to; this will trigger a copy on write to do the actual allocation of a page-frame. Apart from saving memory, this also prevents wasting memory bandwidth on zeroing out the the stack. Memory consumption of a thread is one thing; but context switching is another problem. If you have many more threads than cores, context switching can become a real performance problem.

RabbitMQ-Is it a good practice to create multiple consumers for a single queue in one application process

I just work with an new project backed by RabbitMQ, and there are multiple consumer instances created listening to the same queue when the application starts. Howerver they shares the same connections with different channels.
The messages from the queue are massive(millions messages for one single producing behavior ) so I guess the very first code author is trying to do something to make consuming faster.
I am trying to find some posts discussing on this but I can't find a very certain answer.
What I get so far is:
Each channel will have a separate dispatch thread
The operation commands on the same channel is serialized even though they are called in multiple thread
So
creating multiple consumers thus multiple channels will have multiple dispatch threads, but I don't think it provided a better performance to message dispatching since the dispatch should far from enough with one single thread.
The operation of ack will can be paralized in different channels, I am not quite sure this will give any better performances.
Since more channels consume more system resources I wonder is this practice good?

There seem to be a few things going on here, so let's try to look at this scenario from a holistic perspective.
For starters, it sounds like the original designer of this code understood some basics about RabbitMQ (or learned a few things by trial and error), but may have had trouble putting all the pieces together- hopefully I can help.
RabbitMQ connections are, in reality, AMQP-over-TCP connections (and thus are somewhere around the session layer of the OSI model). TCP connections are supposed to be opened up and used until some sort of network interruption or application shutdown closes them (and for this reason, AMQP has trouble with firewalls and other smart network devices). Using a single TCP connection for message processing activities for a single logical process is a good idea, as creating and destroying TCP connections is usually an expensive process for the computer, which leads to
RabbitMQ channels are used to multiplex communication streams in the AMQP-Over-TCP connection (and are defined in the AMQP Protocol Spec). All they do is specify an integer value (I can't remember the number of bytes, but it doesn't matter anyway) used to preface the subsequent command or response on a TCP connection. Most AMQP operations are channel-specific. For the purposes of higher-level operations, channels are treated similar to connections, as they are application-level constructs.
Now, where I think the question starts to go off the rails a bit is here:
The messages from the queue are massive(millions messages for one
single producing behavior ) so I guess the very first code author is
trying to do something to make consuming faster.
A fundamental assumption about a system which uses queues is that messages are consumed at approximately the same rate that they are produced. Queues exist to buffer uneven producing activities. The mathematics and statistics of how queues work are quite interesting, and assuming the production of messages is done in response to some real-world stimulus, your system is virtually guaranteed to behave in a predictable manner. Therefore, your design goal is to ensure that there are enough consumers to process the messages that are produced, and to respond to changing conditions as needed. Your goal should not be to "speed up" the consumers (unless they have some specific issue), but rather to have enough consumers to process the total load.
Further, the average number of items in the queue at any time should approach zero. It is usually a good idea to have overcapacity so that you don't wind up with an unstable situation where messages start accumulating in the queue (and the queue ends up looking like the Stack Overflow Close Vote Queue).
And that brings us to an attempt to answer your fundamental question, which seems to deal with threading and possibly detailed implementation of the Java client, which I will readily admit I have not used (I'm a .NET guy).
Here are some design guidelines for your software:
Ensure that a single thread uses no more than one channel.
Use one TCP connection per logical consuming process.
Balance the number of logical processes on a single physical machine such that resource contention is not a problem (you don't want to starve your consumers of computer resources).
Try to use BASIC.GET as opposed to a push-based consumer. Use of consumers is difficult in practice, and there is no performance benefit at the protocol level over a BASIC.GET. Note I do not know if the Java library has implemented these differently such that it does cause a performance difference- stranger things have been known to happen.
If you do use consumers, make sure pre-fetch is set to 0 (disabled) and that AutoAck is set to false if reliable processing is important (most applications require reliable processing). Along with this, make sure you are acknowledging messages upon completion of processing!
Periodically reboot your consuming threads, channels, and processors - or do a BASIC.Recover. There are degrees of randomness that will result in unacknowledged messages accumulating over time, and this will deal with it.
Again, if you prefer to use consumers, generally speaking to share consumers across channels is a bad idea. Each consumer should get its own channel.

Netty - The best way to send message concurrently in java

I have 150 threads.
Each Thread has Netty Client and it is connected to server.
Should I use more 150 threads to send?
Should I use 75 threads to send?
Should I use no thread to send?
My local test is not meaningful. (I can't operate server over 50)
please help me.

There is no golden rule for this. Depending on your application, you can find that;
just one connection with one thread is enough to use all the resources of the machine.
Using around the number of CPUs to 2 * the number of CPUs is enough to use all the CPU of the machine.
If you have synchronous requests (instead of asynchronous ones) and a high network latency you might find that you are spending most of the time waiting for data in which case more connections would help mitigate this latency.
My preference is to allow asynchronous messaging/requests and allow a single connection to use all the CPU/resources on the machine if it makes sense because while you might get better result when you test with 150 busy connections, in the real world you they might not all be active at once or to the same degree.

Does downloading with multiple threads actually speed things up?

So, I was starting up minecraft a few days ago and opened up it's developer console to see what it was doing while it was updating itself. I noticed one of the lines said the following:
Downloading 32 files. (16 threads)
Now, the first thing that came to mind was: the processor can still only do one thing at a time, all threads do is split each of their tasks up and distribute the CPU power between them, so what would the purpose be of downloading multiple files on multiple threads if each thread is still only being run on a single processor?
Then, in the process of deciding whether or not I should ask this question on SO, I remembered that multiple cores can reside on one processor. For example, my processor is quad-core. So, you can actually accomplish 4 downloads truly simultaneously. Now that sounds like it makes sense. Except for the fact that there are 16 threads being use for minecraft's download. So, basically my question is:
Does increasing the number of threads during a download help the speed at all? (Assuming a multi-core processor, and the thread count is less than the core count.)
And
If you increase the number of threads to past the number of cores, does speed still increase? (It sounds to me like the downloads would be max-speed after 4 threads, on a quad-core processor.)

Downloads are network-bound, not CPU-bound. So theoretically, using multiple threads will not make it faster.
On the one hand, if your program downloads using synchronous (blocking) I/O, then multiple threads simply enables less blocking to occur. In general, on the other hand, it is more sensible to just use a single thread with asynchronous I/O.
On the gripping hand, asynchronous I/O is trickier to code correctly than synchronous I/O (which is straightforward). So the developers may have just decided to favour ease of programming over pure performance. (Or they may favour compatibility with older Java platforms: real async I/O is only available with NIO2 (which came with Java 7).)

When one thread downloads one file, it will spend some time waiting. When one thread downloads N files, one after another, it will spend, on average, N times as much total wait time.
When N threads each download one file, each of those threads will spend some time waiting, but some of those waits will be overlapped (e.g., thread A and thread B are both waiting at the same time.) The end result is that it may take less wall-clock time to get all N of the files.
On the other hand, if the threads are waiting for files from the same server, each thread's individual wait time may be longer.
The question of whether or not there is an over-all performance benefit depends on the client, on the server, and on the available network bandwidth. If the network can't carry bytes as fast as the server can pump them out, then multi-threading the client probably won't save any time, if the server is single-threaded, then multi-threading the client definitely won't help, but if the conditions are right (e.g., if you have a fast internet connection and especially if the files are coming from a server farm instead of a single machine), then multi-threading potentially can speed things up.

Normally it will not be faster, but there are always exceptions.
Assuming for each download thread, you are opening a new connection, then if
The network (either your own network, or target system) is limiting the download speed for each connection, or
You are downloading from multiple servers, and etc
Or, if the "download" is not a plain download, but downloading something and do some CPU intensive processing on that.
In such cases you may see download speed become faster when having multiple thread.

Java TCP/IP Socket Performance Problem

Our application is reading data very fast over TCP/IP sockets in Java. We are using the NIO library with a non-blocking Sockets and a Selector to indicate readiness to read. On average, the overall processing times for reading and handling the read data is sub-millisecond.
However we frequently see spikes of 10-20 milliseconds. (running on Linux).
Using tcpdump we can see the time difference between tcpdump's reading of 2 discreet messages, and compare that with our applications time. We see tcpdump seems to have no delay, whereas the application can show 20 milliseconds.
We are pretty sure this is not GC, because the GC log shows virtually no Full GC, and in JDK 6 (from what I understand) the default GC is parallel, so it should not be pausing the application threads (unless doing Full GC).
It looks almost as if there is some delay for Java's Selector.select(0) method to return the readiness to read, because at the TCP layer, the data is already available to be read (and tcpdump is reading it).
Additional Info: at peak load we are processing about 6,000 x 150 bytes avg per message, or about 900 MB per second.

eden collection still incurs a STW pause so 20ms may be perfectly normal depending on allocation behaviour & heap size/size of the live set.

Is your Java code running under RTLinux, or some other distro with hard real-time scheduling capability? If not, 10-20 msec of jitter in the processing times seems completely reasonable, and expected.

I had the same problem in a java service that I work on. When sending the same request
repeatedly from the client the server would block at the same spot in stream for 25-35ms.
Turning off Nagle's algorithm in the socket fixed this for me.
This can be accomplished by calling setTcpNoDelay(true) on the Socket.
This may result in increased network congestion because ACKs will now be sent as separate
packets.
See http://en.wikipedia.org/wiki/Nagle%27s_algorithm for more info on Nagle's algorithm.

From the tcpdump faq:
WHEN IS A PACKET TIME-STAMPED? HOW
ACCURATE ARE THE TIME STAMPS?
In most OSes on which tcpdump and
libpcap run, the packet is time
stamped as part of the process of the
network interface's device driver, or
the networking stack, handling it.
This means that the packet is not time
stamped at the instant that it arrives
at the network interface; after the
packet arrives at the network
interface, there will be a delay until
an interrupt is delivered or the
network interface is polled (i.e., the
network interface might not interrupt
the host immediately - the driver may
be set up to poll the interface if
network traffic is heavy, to reduce
the number of interrupts and process
more packets per interrupt), and there
will be a further delay between the
point at which the interrupt starts
being processed and the time stamp is
generated.
So odds are, the timestamp is made in the privileged kernel layer, and the lost 20ms is to context-switching overhead back to user-space and into Java and the JVMs network selector logic. Without more analysis of the system as a whole I don't think it's possible to make an affirmative selection of cause.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.