Threading model for high capacity Java server

Threading model for high capacity Java server - java

I'm creating a Java server for a quite big simulation and I have a couple of high level design questions.
Some background:
The server will run a simulation.
Clients will connect to the server via TCP connections from mobile devices and interact with data structures in the simulation. Initially I will try to use a simple polling scheme in the clients. I find it hard to maintain long-lived TCP connections between mobile devices and the server and I'm not yet sure whether the clients will try to keep an open TCP connection or whether they will set it up and tear it down for each transmission.
When a client is active on a mobile device, I would like to have the client poll the server at least a few times a minute.
The simulation will keep running regardless of whether clients are connected or not.
The total number of existing clients could get very large, many thousands.
Clients mostly poll the server for simulation state, but also sometimes issues control commands to the simulation.
All messages are small in size.
I expect the server to run under Linux on multi-core CPU server hardware.
Currently I have the following idea for threading model in the server:
The simulation logic is executed by a few threads. The simulation logic threads both read and write from/to the simulation data structures.
For each client there is a Java thread performing a blocking read call to the socket for that client. When a poll command is received from a client, the corresponding client thread reads info from the simulation data structures (one client poll would typically be interested in a small subset of the total data structures) and sends a reply to the client on the client's socket. Thus, access to the data structures would need to be synchronized between the client threads and the simulation threads (I would try to have the locks on smaller subsets of the data). If a control command is received from the client, the client thread would write to the data structures.
For small number of clients, I think this would work fine.
Question 1: Would this threading model hold for a large number (thousands) of connected clients? I'm not familiar with what memory/CPU overhead there would be in such a Java implementation.
Question 2: I would like to avoid having the server asynchronously send messages to the clients but in certain scenarios I may need to have the server send "update yourself now" messages asynchronously to some or many clients and I'm not quite sure how to do that. Having the simulation logic thread(s) send those messages doesn't seem right... maybe some "client notification thread pool" concept?

You ask two questions; I'll answer the first.
I've previously written an application that involved thousands of threads in one application. We did once run into a problem with the maximum number of threads on the Linux server; for us, I think the limit was about 1000 threads. This affected our Java application because Java threads use native threads. We set the limit higher, and the application scaled to about 2000 threads, which was what we needed without an issue; I don't know what would have happened had we needed to scale it much higher.
The fact that the default maximum number of threads was 1000 suggests that it might not be wise to run too many thousands of threads on a single Linux server. I believe the primary issue is that sufficient memory for a stack needs to be allocated for each thread.
Our intended long term fix was to change to an architecture where threads from a thread pool each serviced multiple sockets. This really isn't too much of an issue; for each socket, the thread just has to process any pending messages before going on to the next socket. You would have to be careful about synchronizing memory access, but your application already needs to do that since your simulation interacts with multiple threads already so that part would not be a huge change.

Related

How is a single threaded server able to cater to multiple clients even thru non-blocking I/O?

When implementing a server, we can delegate one client request to one thread. I read that problem with this approach is that each thread will have its own stack and this would be very "expensive". Alternative approach is that have server be single threaded and implement all client requests on this one server thread with I/O requests as non-blocking request. My doubt is that if one server thread is running multiple client requests simultaneously, won't server code have instruction pointer, set of local variables, function calls stacks for each client request, then won't this again be "expensive" as before. How are we really saving?.

I read that problem with this approach is that each thread will have its own stack and this would be very "expensive".
Depends on how tight your system resources are. The typical JVM stack-space allocated per thread defaults to 1mB on many current architectures although this can be tuned with the -Xss command line argument. How much system memory your JVM has at its disposal and how many threads you need determines if you want to pay the high price of writing the server single threaded.
My doubt is that if one server thread is running multiple client requests simultaneously, won't server code have instruction pointer, set of local variables, function calls stacks for each client request, then won't this again be "expensive" as before
It will certainly will need to store per request context information in the heap but I suspect that it would be a lot less than 1mB worth of information to hold the variables necessary to service the incoming connections.
Like most things, what we are really competing against when we look to optimize a program, whether to reduce memory or other system resource use, is code complexity. It is harder to get right and harder to maintain.
Although threaded programs can be highly complex, isolating a request handler in a single thread can make the code extremely simple unless it needs to coordinate with other requests somehow. Writing a high performance single threaded server would be much more complex than the threaded version in most cases. Of course, there would also be limits on the performance given that you can't make use of multiple processors.

Using non blocking I/O, A single I/O thread can handle many connections. The I/O thread will get notification when:
client wants to connect
the write buffer of the socket of the connection has space when the write buffer of the socket was full the previous round.
the read buffer of the socket of the connection has data available for reading
So the thread makes use of event-multiplexing to serve the connections concurrently using a selector. A thread waits for a set of selection-keys from the selector, and the selection key contains the state of the event you have registered for and you can attach user data like a 'session' to the selection-key.
A very typical design pattern used here is the reactor pattern.
But often you want to prevent blocking the I/O thread with longer running requests. So you offload the work to a different pool of threads. Then the reactor changes to the proactor pattern.
And often you want to scale the number of I/O threads. So you can have a bunch of I/O threads in parallel.
But the total number of threads in your application should remain limited.
It all depends on what you want. Above are techniques I frequently used while working for Hazelcast.
I would not start to write all this logic from scratch. If you want to make use of networking, I would have a look at Netty. It takes care of most of the heavy lifting and has all kinds of optimizations built in.
I'm not 100% sure if the a thread that doesn't write to its stack will actually consume 1MB of physical memory. In Linux the (shared) zero-page is used for a memory allocation, so no actual page frame (physical memory) is allocated unless the stack of the thread is actually written to; this will trigger a copy on write to do the actual allocation of a page-frame. Apart from saving memory, this also prevents wasting memory bandwidth on zeroing out the the stack. Memory consumption of a thread is one thing; but context switching is another problem. If you have many more threads than cores, context switching can become a real performance problem.

Netty - The best way to send message concurrently in java

I have 150 threads.
Each Thread has Netty Client and it is connected to server.
Should I use more 150 threads to send?
Should I use 75 threads to send?
Should I use no thread to send?
My local test is not meaningful. (I can't operate server over 50)
please help me.

There is no golden rule for this. Depending on your application, you can find that;
just one connection with one thread is enough to use all the resources of the machine.
Using around the number of CPUs to 2 * the number of CPUs is enough to use all the CPU of the machine.
If you have synchronous requests (instead of asynchronous ones) and a high network latency you might find that you are spending most of the time waiting for data in which case more connections would help mitigate this latency.
My preference is to allow asynchronous messaging/requests and allow a single connection to use all the CPU/resources on the machine if it makes sense because while you might get better result when you test with 150 busy connections, in the real world you they might not all be active at once or to the same degree.

Performant Multi Threading with Clients on a socket

At the moment I have a project where we develop a Java Texas Holdem Application. Of course this Application is based on a client server socket system. I am saving all joined clients (I'm getting them with socketServer.accept() method) in an ArrayList. At the moment I make one thread for each joined client, which permanently checks if the client send any data to the server. My classmate told me it would be way better if I create one big Thread, that iterates through the whole Client ArrayList and checks every Client inputstreamreader. Should I trust him?

Creating a thread per Socket isn't a good idea if your application will have a lot of clients.
I'd recommend into looking into external libraries and how they handle their connonections. Example: http://netty.io/, https://mina.apache.org/

Both approaches are not feasible. Having a thread per connection will quickly exhaust resources in any loaded system. Having one thread pinging all connections in a loop will produce a terrible performance.
The proper way is to multiplex on the sockets - have a sane number of threads (16, why not), distribute all sockets between those 16 threads and multiplex on those sockets using select() variant - whatever is available in Java for this.

XDBC app server showing only 2 active threads over MarkLogic XDBC admin console

Our multi-threaded Java application is using the Java XCC library. Over MarkLogic admin console under status tab only 2 threads are shown as active while the application is running, that is the most probable reason of bottleneck in our project. Please advise what is wrong here?

To effectively run xcc requests in parallel you need to make sure you are using separate Sessions for each thread. See:
https://docs.marklogic.com/javadoc/xcc/com/marklogic/xcc/Session.html
Having only 2 active threads running is not necessarily a sign of a problem, its possible that your requests are being processed as fast as you issue them and read the response. If your queries are fast enough there is no need for more threads. Without more information about your queryies, response times and server load its not possible to say if there is a bottleneck or not. How many threads are you running ? Compare the response time as you increase threads. Check that you have sufficient network IO so that your requests are not bottlenecked in the network layer.
I suggest profiling your queries and using the Performance History console to see if the server is running at high utilization. Try increasing the number of client threads, possibly running them from different servers.

Profiling Netty Performance

I'm writing a Netty application. The application is running on a 64 bit eight core linux box
The Netty application is a simple router that accepts requests (incoming pipeline) reads some metadata from the request and forwards the data to a remote service (outgoing pipeline).
This remote service will return one or more responses to the outgoing pipeline. The Netty application will route the responses back to the originating client (the incoming pipeline)
There will be thousands of clients. There will be thousands of remote services.
I'm doing some small scale testing (ten clients, ten remotes services) and I don't see the sub 10 millisecond performance I'm expecting at a 99.9 percentile. I'm measuring latency from both client side and server side.
I'm using a fully async protocol that is similar to SPDY. I capture the time (I just use System.nanoTime()) when we process the first byte in the FrameDecoder. I stop the timer just before we call channel.write(). I am measuring sub-millisecond time (99.9 percentile) from the incoming pipeline to the outgoing pipeline and vice versa.
I also measured the time from the first byte in the FrameDecoder to when a ChannelFutureListener callback was invoked on the (above) message.write(). The time was a high tens of milliseconds (99.9 percentile) but I had trouble convincing myself that this was useful data.
My initial thought was that we had some slow clients. I watched channel.isWritable() and logged when this returned false. This method did not return false under normal conditions
Some facts:
We are using the NIO factories. We have not customized the worker size
We have disabled Nagel (tcpNoDelay=true)
We have enabled keep alive (keepAlive=true)
CPU is idle 90+% of the time
Network is idle
The GC (CMS) is being invoked every 100 seconds or so for a very short amount of time
Is there a debugging technique that I could follow to determine why my Netty application is not running as fast as I believe it should?
It feels like channel.write() adds the message to a queue and we (application developers using Netty) don't have transparency into this queue. I don't know if the queue is a Netty queue, an OS queue, a network card queue or what. Anyway I'm reviewing examples of existing applications and I don't see any anti-patterns I'm following
Thanks for any help/insight

Netty creates Runtime.getRuntime().availableProcessors() * 2 workers by default. 16 in your case. That means you can handle up to 16 channels simultaneously, other channels will wait untils you release the ChannelUpstreamHandler.handleUpstream/SimpleChannelHandler.messageReceived handlers, so don't do heavy operations in these (IO) threads, otherwise you can stuck the other channels.

You haven't specified your Netty version, but it sounds like Netty 3.
Netty 4 is now stable, and I would advise that you update to it as soon as possible.
You have specified that you want ultra low latency times, as well as tens of thousands of clients and services. This doesn't really mix well. NIO is inherently reasonably latent as opposed to OIO. However the pitfall here is that OIO probably wont be able to reach the number of clients you are hoping for. None the less I would use an OIO event loop / factory and see how it goes.
I myself have a TCP server, which takes around 30ms on localhost to send and receive and process a few TCP packets (measured from the time client opens a socket until server closes it). If you really do require such low latencies I suggest you switch away from TCP due to the SYN/ACK spam that is required to open a connection, this is going to use a large part of your 10ms.

Measuring time in a multi-threaded environment is very difficult if you are using simple things like System.nanoTime(). Imagine the following on a 1 core system:
Thread A is woken up and begins processing the incoming request.
Thread B is woken up and begins processing the incoming request. But since we are working on a 1 core machine, this ultimately requires that Thread A is put on pause.
Thread B is done and performed perfectly fast.
Thread A resumes and finishes, but took twice as long as Thread B. Because you actually measured the time it took to finish for Thread A + Thread B.
There are two approaches on how to measure correctly in this case:
You can enforce that only one thread is used at all times.
This allows you to measure the exact performance of the operation, if the OS does not interfere. Because in the above example Thread B can be outside of your program as well. A common approach in this case is to median out the interference, which will give you an estimation of the speed of your code.You can however assume, that on an otherwise idle multi-core system, there will be another core to process background tasks, so your measurement will usually not be interrupted. Setting this thread to high priority helps as well.
You use a more sophisticated tool that plugs into the JVM to actually measure the atomic executions and time it took for those, which will effectively remove outside interference almost completely. One tool would be VisualVM, which is already integrated in NetBeans and available as a plugin for Eclipse.
As a general advice: it is not a good idea to use more threads than cores, unless you know that those threads will be blocked by some operation frequently. This is not the case when using non-blocking NIO for IO-operations as there is no blocking.
Therefore, in your special case, you would actually reduce the performance for clients, as explained above, because communication would be put on hold up to 50% of the time under high load. In worst case, that could cause a client to even run into a timeout, as there is no guarantee when a thread is actually resumed (unless you explicitly request fair scheduling).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.