Recently I was playing with Java sockets and NIO for writing a server. Although it is still not really clear for me why Java NIO could be superior to standard sockets. When writing a server using either of these technologies, in most cases it comes down to having a dispatcher thread that accepts connections and further passes them to working threads.
I have read that in a threaded-model we need a dedicated thread per connection but still we can create a thread pool of a fixed size and reuse them to handle different connections (so that a cost of creation and tear down of threads is reduced).
But with Java NIO it looks similar. We have one thread that accepts requests and some worker thread(s) processing data when it is received.
An example I found where Java NIO would be better is a server that maintains many non-busy connections, like a chat client or http server. But can't really understand why.
There are several distinct reasons.
Using multiplexed I/O with a Selector can save you a lot of threads, which saves you a lot of thread stacks, which save you a lot of memory. On the other hand it moves scheduling from the operating system into your program, so it can cost you a bit of CPU, and it will also cost you a lot of programming complication. Given that select() was designed when the alternative was more processes, not more threads, it is in fact debatable whether the extra complication is really worth it, as against using threads and spending the programming money saved on more memory.
MappedByteBuffers are a slightly faster way of reading files than either java.io or using java.nio.channels with ByteBuffers.
If you are just copying from one channel to another, using 'direct' buffers saves you from having to copy data from the native JNI space into the JVM space and back again; or using the FileChannel.transferTo() method can save you from copying data from kernel space into user space.
Even though NIO supports the Dispatcher model, NIO Sockets are blocking by default and when you use them as such they can be faster than either plain IO or non-blocking NIO for a small (< 100) connections. I also find blocking NIO simpler to work with than non-blocking NIO.
I use non-blocking NIO when I want to use busy waiting. This allows be to have a thread which never gives up the CPU but this is only useful in rare cases i.e. where latency is ciritical.
From my benchmarks the real strength (besides threading model) is, that it consumes less memory bandwith (Kernel<=>Java). E.g. if you open several UDP NIO Multicast Channels and have high traffic you will notice that at a certain number of processes with each new process throughput of all running UDP receivers gets lower. With the traditional socket API i start 3 receiving processes with full throughput. If i start the 4th I reach a limit and received data/second will lower on all the running processes. With nio i can start about 6 processes until this effect kicks in.
I think this is mostly because NIO kind of directly bridges to native or kernel memory, while the old socket copies buffers to the VM process space.
Important in GRID computing and high load server apps (10GBit network or infiniband).
Related
When implementing a server, we can delegate one client request to one thread. I read that problem with this approach is that each thread will have its own stack and this would be very "expensive". Alternative approach is that have server be single threaded and implement all client requests on this one server thread with I/O requests as non-blocking request. My doubt is that if one server thread is running multiple client requests simultaneously, won't server code have instruction pointer, set of local variables, function calls stacks for each client request, then won't this again be "expensive" as before. How are we really saving?.
I read that problem with this approach is that each thread will have its own stack and this would be very "expensive".
Depends on how tight your system resources are. The typical JVM stack-space allocated per thread defaults to 1mB on many current architectures although this can be tuned with the -Xss command line argument. How much system memory your JVM has at its disposal and how many threads you need determines if you want to pay the high price of writing the server single threaded.
My doubt is that if one server thread is running multiple client requests simultaneously, won't server code have instruction pointer, set of local variables, function calls stacks for each client request, then won't this again be "expensive" as before
It will certainly will need to store per request context information in the heap but I suspect that it would be a lot less than 1mB worth of information to hold the variables necessary to service the incoming connections.
Like most things, what we are really competing against when we look to optimize a program, whether to reduce memory or other system resource use, is code complexity. It is harder to get right and harder to maintain.
Although threaded programs can be highly complex, isolating a request handler in a single thread can make the code extremely simple unless it needs to coordinate with other requests somehow. Writing a high performance single threaded server would be much more complex than the threaded version in most cases. Of course, there would also be limits on the performance given that you can't make use of multiple processors.
Using non blocking I/O, A single I/O thread can handle many connections. The I/O thread will get notification when:
client wants to connect
the write buffer of the socket of the connection has space when the write buffer of the socket was full the previous round.
the read buffer of the socket of the connection has data available for reading
So the thread makes use of event-multiplexing to serve the connections concurrently using a selector. A thread waits for a set of selection-keys from the selector, and the selection key contains the state of the event you have registered for and you can attach user data like a 'session' to the selection-key.
A very typical design pattern used here is the reactor pattern.
But often you want to prevent blocking the I/O thread with longer running requests. So you offload the work to a different pool of threads. Then the reactor changes to the proactor pattern.
And often you want to scale the number of I/O threads. So you can have a bunch of I/O threads in parallel.
But the total number of threads in your application should remain limited.
It all depends on what you want. Above are techniques I frequently used while working for Hazelcast.
I would not start to write all this logic from scratch. If you want to make use of networking, I would have a look at Netty. It takes care of most of the heavy lifting and has all kinds of optimizations built in.
I'm not 100% sure if the a thread that doesn't write to its stack will actually consume 1MB of physical memory. In Linux the (shared) zero-page is used for a memory allocation, so no actual page frame (physical memory) is allocated unless the stack of the thread is actually written to; this will trigger a copy on write to do the actual allocation of a page-frame. Apart from saving memory, this also prevents wasting memory bandwidth on zeroing out the the stack. Memory consumption of a thread is one thing; but context switching is another problem. If you have many more threads than cores, context switching can become a real performance problem.
I run multiple game servers and I want to develop a custom application to manage them. Basically all the game servers will connect to the application to exchange data. I don't want any of this data getting lost so I think it would be best to use TCP. I have looked into networking and understand how it works however I have a question about cpu usage. More servers are being added and in the next few months it could potentially reach around 100 - 200 and will continue to grow as needed. Will new threads for each server use a lot of cpu and is it a good idea to do this? Does anyone have any suggestions on how to go about this? Thanks.
You should have a look at non blocking io. With blocking io, each socket will consume 1 thread and the number of threads in a system is limited. And even if you can create 1000+, it is a questionable approach.
With non blocking io, you can server multiple sockets with a single thread. This is a more scalable approach + you control how many threads at any given moment are running.
More servers are being added and in the next few months it could potentially reach around 100 - 200 and will continue to grow as needed. Will new threads for each server use a lot of cpu and is it a good idea to do this?
It is a standard answer to caution away from 100s of threads and to the NIO solution. However, it is important to note that the NIO approach has a significantly more complex implementation. Isolating the interaction with a server connection to a single thread has its advantages from a code standpoint.
Modern OS' can fork 1000s of threads with little overhead aside from the stack memory. If you are sure of your scaling factors (i.e. you're not going to reach 10k connections or something) and you have the core memory then I would say that a thread per TCP connection could work very well. I've very successfully run applications with 1000s of threads and have not seen fall offs in performance due to context switching which used to be the case with earlier processors/kernels.
I have a Java application that require communication between different process. Process could run in same JVM or different JVM, but runs on the same machine.
My application need to submit "messages" to another process (same or different JVM) and forgot about it. similar to messaging queue like IBM "MQ", but simple, and only use memory, no IO to hard disk for performance gains.
I'm not sure what is the best approach from Performance prescriptive.
I wonder if RMI is efficient in terms of Performance, I think it require some overhead.
What about TCP/IP socket using local host?
any other thought?
I wonder if RMI is efficient in terms of Performance, I think it require some overhead.
RMI is efficient for what it does. It does much more than most people need, but is usually more than faster enough. You should be able to get of the order of 1-3 K messages per second with a latency around 1 milli-second.
What about TCP/IP socket using local host?
That is always an option but with plain Java Serialization this will not be a lot faster than using RMI. How you do the serialization and deserialization is critical for high performance.
An important note is that much of the time is spent serializing and deserilizing the message, something most transports don't help you with, so if you want maximum performance you have to consider an efficient marshalling strategy. Most transport protocols only benchmark raw bytes.
Ironically if you are willing to use disk, it can be faster than TCP or UDP (like ZeroMQ) plus you get persistence for "free".
This library (I am the author) can perform millions of messages per second between processes with latency as low as 100 nano-second (350x lower than ZeroMQ) https://github.com/peter-lawrey/Java-Chronicle Advantages are
ultra fast serialization and deserialization, something most transport benchmarks avoid including this as it often takes much longer than the transport costs.
is that you can monitor what is happening between queues any time after the message was sent.
replay all messages.
the producer can be any amount of data ahead of your consumer to handle micro-burst gracefully up to the size of your disk space. e.g. the consumer can be TBs behind.
supports replication over TCP.
restart of either the consumer or producer is largely transparent.
If you are developing server application try to consider ZeroMQ. It has great performance, allow to build interprocess communication easier, allow to build asynchronous API.
ZeroMQ declare fantastic performance with InterProcess communication. Even better than TCP sounds great. We are consider this solution for our clusterisation schema.
Pieter Hintjens give the great answer for performance comparison between different Message Broker.
I've read several posts about java.net vs java.nio here on StackOverflow and on some blogs. But I still cannot catch an idea of when should one prefer NIO over threaded sockets. Can you please examine my conclusions below and tell me which ones are incorrect and which ones are missed?
Since in threaded model you need to dedicate a thread to each active connection and each thread takes like 250Kilobytes of memory for it's stack, with thread per socket model you will quickly run out of memory on large number of concurrent connections. Unlike NIO.
In modern operating systems and processors a large number of active threads and context switch time can be considered almost insignificant for performance
NIO throughoutput can be lower because select() and poll() used by asynchronous NIO libraries in high-load environments is more expensive than waking up and putting to sleep threads.
NIO has always been slower but it allows you to process more concurrent connections. It's essentially a time/space trade-off: traditional IO is faster but has a heavier memory footprint, NIO is slower but uses less resources.
Java has a hard limit per concurrent threads of 15000 / 30000 depending on JVM and this will limit thread per connection model to this number of concurrent connections maximum, but JVM7 will have no such limit (cannot confirm this data).
So, as a conclusion, you can have this:
If you have tens of thousands concurrent connections - NIO is a better choice unless a request processing speed is a key factor for you
If you have less than that - thread per connection is a better choice (given that you can afford amount of RAM to hold stacks of all concurrent threads up to maximum)
With Java 7 you may want to go over NIO 2.0 in either case.
Am I correct?
That seems right to me, except for the part about Java limiting the number of threads – that is typically limited by the OS it's running on (see How many threads can a Java VM support? and Can't get past 2542 Threads in Java on 4GB iMac OSX 10.6.3 Snow Leopard (32bit)).
To reach that many threads you'll probably need to adjust the stack size of the JVM.
I still think the context switch overhead for the threads in traditional IO is significant. At a high level, you only gain performance using multiple threads if they won't contend for the same resources as much, or they spend time much higher than the context switch overhead on the resources.
The reason for bringing this up, is with new storage technologies like SSD, your threads come back to contend on the CPU much quicker
There is not a single "best" way to build NIO servers, but the preponderance of this particular question on SO suggests that people think there is! Your question summarizes the use cases that are suited to both options well enough to help you make the decision that is right for you.
Also, hybrid solutions are possible too! You could hand the channel off to threads when they are going to do something worthy of their expense, and stick to NIO when it is better.
I would say start with thread-per-connection and adapt from there if you run into problems.
If you really need to handle a million connections you should consider writing (or finding) a simple request broker in C (or whatever) that will use far less memory per connection than any java implementation can. The broker can receive requests asynchronously and queue them to backend workers written in your language of choice.
The backends thus only need a thread per active request, and you can just have a fixed number of them so the memory and database use is predetermined to some degree. When large numbers of requests are running in parallel the requests are made to wait a bit longer.
Thus I think you should never have to resort to NIO select channels or asynchronous I/O (NIO 2) on 64-bit systems. The thread-per-connection model works well enough and you can do your scaling to "tens or hundreds of thousands" of connections using some more appropriate low-level technology.
It is always helpful to avoid premature optimization (i.e. writing NIO code before you really have massive numbers of connections coming in) and don't reinvent the wheel (Jetty, nginx, etc.) if possible.
What most often is overlooked is that NIO allows zero copy handling. E.g. if you listen to the same multicast traffic from within multiple processes using old school sockets on one single server, any multicast packet is copied from the network/kernel buffer to each listening application. So if you build a GRID of e.g. 20 processes, you get memory bandwidth issues. With nio you can examine the incoming buffer without having to copy it to application space. The process then copies only parts of the incoming traffic it is interested in.
another application example:
see http://www.ibm.com/developerworks/java/library/j-zerocopy/ for an example.
I have pretty much already decided not to use asynchronous, non-blocking Java NIO. The complexity versus benefit is very questionable in general, and I think it's not worth it in this project particularly.
But most of what I read about NIO, and comparisons with older java.io.* focuses on non-blocking, asynchronous NIO versus thread-per-connection synchronous I/O using java.io.*. However, NIO can be used in synchronous, blocking, thread-per-connection mode, which is rarely discussed it seems.
Here's the question: Is there any performance advantage of synchronous, blocking NIO versus traditional synchronous, blocking I/O (java.io.*)? Both would be thread-per-connection. How does the complexity compare?
Note that this is a general question, but at the moment I am primarily concerned with TCP socket communication.
An advantage of NIO over "traditional" IO is that NIO can use direct buffers that allow the OS to use DMA for some operations (e.g. reading from a network connection directly into a memory-mapped file) and thereby avoid copying data to intermediate buffers.
If you're moving large amounts of data in a scenario where this technique does avoid copy operations that would otherwise be performed, this can have a big impact on performance.
It basically boils down the number of concurrent connections and how busy those connections are. Blocking (standard thread per connection) is faster, both in latency and throughput (about twice as fast for a simple echo server). So if your system can cope with maintaining a thread for each connection (<1000 connections as a rule of thumb) go for the blocking approach. If you have lots of mostly idle connections (e.g. Comet long poll requests or IMAP idle connections) then switching to a non-blocking architecture could help scale your system.
I can not speak to the technology in particular, but it is not unusual for asynchronous libraries to provide synchronous operations to facilitate in debugging.
For instance if you are having problems you can eliminate the asynchronous portions of the logic without rewriting your entire process. This is especially helpful since synchronous processes are typically much easier to work with.