In a Java servlet environment, what are the factors that are the bottleneck for number of simultaneous users.
Number of HTTP connections the server can allow per port
Number of HTTP connections the server can allow across several ports (I can have multiple WAS profiles on several HTTP ports)
Number of servlets in pool
Number of threads configured for WAS to use to service connections
RAM available to server (is there any any correletation between number of service threads assuming 0-memory leak in application)
Are there any other factors?
Edited:
To leave business logic out of the picture, assume have only one servlet printing one line on Log4j.
Can my Tomcat server handle 6000 simultaneous HTTP connections? Why
not (file handles? CPU time per request?)?
Can I have thread pool size as 5000 (do idle threads cost CPU/RAM)?
Can I have oracle connection pool size as 500 connections (do idle
connections cost CPU/RAM)?
Is the amount of garbage that is generated for each connection have an impact? For example, if for each HTTP connection 20KB of objects are created and left behind by Tomcat.. then by the time 2500 requests are processed 100MB heap would be used and this may trigger a GC pause of 300ms.
Can we say something like this: if Tomcat uses 0.2 sec of CPU time for processing a single HTTP request, then it would be able to handle roughly 500 http connections in a second. So, 6000 connections would need 5 seconds.
Interesting question, If we leave apart all the performance deciding attributes finally it boils down to how much work you are doing in the servlet or how much time it takes if it has highest I/O, CPU and memory. Now lets move down with you list with the above statement in mind;-
Number of HTTP connections the server can allow per port
There are limit for file descriptors but that again gets triggered by how much time a servlet is taking complete a request or how much time it takes from request first byte receive to finish sending the entire response. Because if it take only 1ms and you are using Netty and persistent connection, you can reach a really high >> 6000.
Number of servlets in pool
Theoretically >> 6000. But how many thread are processing your requests? Is there a thread pool that is burning your requests ? So you want to increase threads, but how much lets say 2000 concurrent threads. Is your CPU behaving poor with context switching ? Is it I/O bound? if yes it makes sense to context switch but then you will be hitting those network limits because a lot of thread waiting on network I/O, so ultimately how much time you spent on a piece of work.
DB
If it oracle, bless you with connection management, you definitely need rigorous monitoring here. Now this is just another limiting factor and can be considered as an just another blocking I/O. By definition of I/O, latency/throughput matters and becomes a bottleneck the moment it becomes the bigger than the smallest piece of work.
So, finally, you need to break down following or more attributes for all the servlets
Is it CPU bound? If yes, how much cycles it takes or can it be converted safely to some time unit. e.g. 1ms for just the compute piece of work.
Is it I/O bound, If yes similarly find the unit.
and others
A long list of what you have, e.g. CPU, Memory, GB/s
Now you know how much work needs to be done and all you do is divide by what you have and keep tuning , so that you find out the optimal and also find out what else attribute you have not considered and consider them.
The biggest bottleneck I have experienced is the time it takes to process the request.
The faster you can service a request, the more connections you can handle.
It's a difficult question to answer due to every application being different.
To figure this out for an application I support, I created a unit test that spawns many threads and I watch the memory usage in VisualVM in eclipse.
You can see how your memory consumption changes with the number of threads in use.
And you should be able to get a thread dump and see how much memory the thread is using.
You can extrapolate an average out to understand how much RAM you might need for N number of users.
The bottleneck will be a moving target since you'll optimize one area until you can scale larger, then another area will become your bottleneck.
If the response time of the servlet is a bottleneck, you'll could use some queuing mathematics to determine how many requests can be queued optimally based on the avg response time.
http://www4.ncsu.edu/~hp/SSME_QueueingTheory.pdf
Hope this helps.
Updated to address your additional questions:
Can my Tomcat server handle 6000 simultaneous HTTP connections? Why not (file handles? CPU time per request?)?
It's possible but probably not. Also you should probably add a web layer in front of the application server if you plan on doing high volume.
Suppose you have 6000 users all pounding away on your application. Each request a user sends only exists on the server for a moment [hopefully], and your peak thread count may have never reached over 20.
I'd recommend setting up some monitoring to understand how your application performs under real use cases. Check out http://Hawt.io which uses Jolokia to grab JMX metrics via http.
If your serious about analytics I'd recommend using something like Graphite to aggregate your JMX metrics. https://github.com/graphite-project/graphite-web
I've written a collector for Jolokia to send metrics to Carbon/Graphite, and may be able to open-source it with approval from my management. Let me know if you are interested.
Can I have thread pool size as 5000 (do idle threads cost CPU/RAM)?
Idle threads are not much to worry about, though setting your thread pool too high could allow your application server to receive too many requests. If this happens you may end up flooding your DB with connections it cant handle, or your memory allocation may not be enough to handle so many requests. This could start overall application performance degradation.
Set too low, and your app server could start queuing request again causing performance degradation.
It's normally to have some queuing during spikes or high volume times, but you don't want to overload your application server. Check out queuing theory to understand more about this.
Also, this is where having a web server in front of the app server could help you. If you have Apache serve your static content, only dynamic requests will reach the application servers in most cases.
Tuning is very specific to your individual application. I'd recommend staying with the defaults and just optimize your code until you can gather enough data to know which knob should be turned.
Can I have oracle connection pool size as 500 connections (do idle connections cost CPU/RAM)?
Same situation as the application thread pool size. Though your pool size for DB should be much smaller than the app thread count.
500 would be too high for most web applications unless you have very high volume, in which case you may need a DB cluster environment like Oracle RAC.
If the pool is set too high and you start using a lot of connections, your DB hardware will not be able to keep up and you will end up with performance problem on the database server.
The time it takes for a query to return may increase, in turn causing your application response time to increase. The "log jam" effect.
Use profiling or metrics to determine the avg number of active DB connections under normal use, and use that as a baseline for determining the max allowed.
Is the amount of garbage that is generated for each connection have an impact? For example, if for each HTTP connection 20KB of objects are created and left behind by Tomcat.. then by the time 2500 requests are processed 100MB heap would be used and this may trigger a GC pause of 300ms.
The numbers would be different, but yes. Also remember the Full GC are more concern. The incremental GCs will not pause your application. Check out "concurrent mark and sweep" and "Garbage first".
Can we say something like this: if Tomcat uses 0.2 sec of CPU time for processing a single HTTP request, then it would be able to handle roughly 500 http connections in a second. So, 6000 connections would need 5 seconds.
It's not quite that easy as each request is coming in, there are also some being processed and completed. Check out queuing theory to understand this better.
http://www4.ncsu.edu/~hp/SSME_QueueingTheory.pdf
There is another common bottleneck : the size of the database connection pool. But I have an additional remark : when you exhaust the number of allowed HTTP connections, of the number of threads allowed to serve request, you will only reject some requests. But when you exhaust memory (too much sessions with too much data for example), you can crash the whole application.
The difference is that in the case of heavy load for a short time, when load later falls down :
in first case, the application is up and can serve requests normally
in second case the application is down and must be restarted
EDIT :
I forgot to remember real use cases. The biggest problem I ever found for serving numerous concurrent connections is the quality of the database requests (assuming you use a database). There is not a direct impact since there is no maximum number, but you can easily hog all database server resources. Common examples of poor database requests :
no index on a table with a large number of rows
a request (on a big table) that makes no use of any index
the n+1 syndrome : with a ORM when you map a one to many relation to a collection no eagerly when you always need data from the collection
the load full database syndrome : with a ORM when you map all relations as eager, any single request ends in loading a high quantity of dependent data.
What is worse with those problems, is that they can cause no harm in tests when the database is young because there are not that many rows, but with time and increasing number of rows performances fall giving a unusable application over few users.
Number of HTTP connections the server can allow per port
Unlimited except by kernel resources, e.g. FDs, socket buffer soace, etc.
Number of HTTP connections the server can allow across several ports (I can have multiple WAS profiles on several HTTP ports)
As the number of connections per port is unlimited, this irrelevant.
Number of servlets in pool
Irrelevant except insofar as it increases the rate of incoming requests.
Number of threads configured for WAS to use to service connections
Relevant in an indirect way, see below.
RAM available to server (is there any any correletation between number of service threads assuming 0-memory leak in application)
Relevant if it limits the number of threads below the configured number of threads mentioned above.
The fundamental limitation is request service time. The shorter, the better. The longer it is, the longer the thread is tied up in that request, the longer wait queues get, ... Queuing theory dictates that the 'sweet spot' is no more than 70% server utilization. Beyond that, wait times grow rapidly with increasing utilization.
So anything that contributes to request service time is significant: for example, thread pool size, connection pool size, concurrency bottlenecks, ...
You should also consider that the use case itself is limiting the amount of concurrency. Imagine a collaborative environment where the order of actions matters. This forces you to synchronize actions - even if you would have been able to process all of them at once.
In java land this could be a simple thing as sharing a single resource which is using blocking access. (e.g. shared Random number generators (not per thread), shared Vectors, concurrent structures like ConcurrentHashMap etc.).
The more synchronization the less you will be able to fully utilize your server hardware.
So apart from running out of memory or saturating the CPU or hitting the garbage collection limit this synchronization might be a problem which does not only need to be solved in your code but maybe even requires you to soften some requirements of the high level workflow.
Seeing point 6, you can use these tools to see if your hardware is being the bottleneck: Assuming that you're on linux, you can use VmStat to see some statistics on your RAM usage, top or atop (depending on your distro) to see processes taking a toll in your CPU and RAM, nload and iftop to see what is consuming network bandwith, and iotop to see what is reading and writing to your disk.
Related
When implementing a server, we can delegate one client request to one thread. I read that problem with this approach is that each thread will have its own stack and this would be very "expensive". Alternative approach is that have server be single threaded and implement all client requests on this one server thread with I/O requests as non-blocking request. My doubt is that if one server thread is running multiple client requests simultaneously, won't server code have instruction pointer, set of local variables, function calls stacks for each client request, then won't this again be "expensive" as before. How are we really saving?.
I read that problem with this approach is that each thread will have its own stack and this would be very "expensive".
Depends on how tight your system resources are. The typical JVM stack-space allocated per thread defaults to 1mB on many current architectures although this can be tuned with the -Xss command line argument. How much system memory your JVM has at its disposal and how many threads you need determines if you want to pay the high price of writing the server single threaded.
My doubt is that if one server thread is running multiple client requests simultaneously, won't server code have instruction pointer, set of local variables, function calls stacks for each client request, then won't this again be "expensive" as before
It will certainly will need to store per request context information in the heap but I suspect that it would be a lot less than 1mB worth of information to hold the variables necessary to service the incoming connections.
Like most things, what we are really competing against when we look to optimize a program, whether to reduce memory or other system resource use, is code complexity. It is harder to get right and harder to maintain.
Although threaded programs can be highly complex, isolating a request handler in a single thread can make the code extremely simple unless it needs to coordinate with other requests somehow. Writing a high performance single threaded server would be much more complex than the threaded version in most cases. Of course, there would also be limits on the performance given that you can't make use of multiple processors.
Using non blocking I/O, A single I/O thread can handle many connections. The I/O thread will get notification when:
client wants to connect
the write buffer of the socket of the connection has space when the write buffer of the socket was full the previous round.
the read buffer of the socket of the connection has data available for reading
So the thread makes use of event-multiplexing to serve the connections concurrently using a selector. A thread waits for a set of selection-keys from the selector, and the selection key contains the state of the event you have registered for and you can attach user data like a 'session' to the selection-key.
A very typical design pattern used here is the reactor pattern.
But often you want to prevent blocking the I/O thread with longer running requests. So you offload the work to a different pool of threads. Then the reactor changes to the proactor pattern.
And often you want to scale the number of I/O threads. So you can have a bunch of I/O threads in parallel.
But the total number of threads in your application should remain limited.
It all depends on what you want. Above are techniques I frequently used while working for Hazelcast.
I would not start to write all this logic from scratch. If you want to make use of networking, I would have a look at Netty. It takes care of most of the heavy lifting and has all kinds of optimizations built in.
I'm not 100% sure if the a thread that doesn't write to its stack will actually consume 1MB of physical memory. In Linux the (shared) zero-page is used for a memory allocation, so no actual page frame (physical memory) is allocated unless the stack of the thread is actually written to; this will trigger a copy on write to do the actual allocation of a page-frame. Apart from saving memory, this also prevents wasting memory bandwidth on zeroing out the the stack. Memory consumption of a thread is one thing; but context switching is another problem. If you have many more threads than cores, context switching can become a real performance problem.
I'm just curious how to solve the connection-pooling problem in the scalable java application.
Imagine I have java web application with HikariCP set up (max pool size is 20) and PosgtreSQL with max allowed connections 100.
And now I want to implement scalability approach for my web app (no matter how) end even with autoscaling. So I don't know how many web app replicas will be eventually, it may dynamically change (caused by some reasons e.g. cluster workload).
But there is the problem. When I create more then 5 web app replicas cause my total connection count exceeds max allowed connection.
Are there any best practices to solve this problem (except evident increasing max allowed connections/decreasing pool size)?
Thanks
You need an orchestrator over the web application. It would be responsible for the scaling in-out and it will manage the connections in order not to exceed the limitation of 100. It will open-close the connections according to the traffic.
Nevertheless, my recommendation is to take into consideration the migration into a no-SQL database which is more suitable solution for scalability and performance.
I'll start by saying that whatever you do, as long as you're restricted by 100 connections to your DB - it will not scale!
That said, you can optimize and "squeeze" performance out of it by applying a couple of known tricks. It's important to understand the trade-offs (availability vs. consistency, latency vs. throughput and etc):
Caching: if you can anticipate certain select queries you can calculate them offline (maybe even from a replica?) and cache the results. The tradeoff: the user might get results which are not up-to-date
Buffering/throttling: all updates/inserts go to a queue and there are only a few workers which are allowed to pull from the queue and update the DB. Tradeoff: you get more availability but becomes "eventually consistent" (since updates won't be visible right away).
It might come to that you'll have to run the selects in async manner as well, which means that the user submits a query, and when it's ready it'll be "pushed" back to the client (or the client can keep "polling" every few seconds). It can be implemented with a callback as well.
By separating the updates (writes) from reads you'll be able to get more performance by creating replicas that are "read only" and which can be used by the webservers for read-queries.
We are developing a site that will allow users to send semi-real-time events to other users. The UI will display an icon when there is a new event for a user (pretty standard stuff).
I have read that periodic short polling does not scale as well as websockets because it puts more pressure on the web server. I am not quite sure why this would be the case?
We are using tomcat NIO (which does not have a one-to-one connection per thread ratio). As I understand it, Tomcat NIO is pretty good at handling longer HTTP connection timeouts with a small number of threads.
So, if the periodic polling time is less than the connection timeout, then the polling should not have to create another TCP handshake, as it will just reuse an existing HTTP 1.1 connection.
Thus, the above does not seem like it would create too much pressure on the server. It may not be as real-time as long polling or websockets, but I do not see why it should not scale (assuming that the server can quickly respond with a response indicating a new event or not – we use an in memory concurrent hashmap, so this should be pretty fast with no necessary DB access).
Am I missing anything?
Thanks,
-Adam
Short polling may not be as trendy as long polling and web sockets but it works and works everywhere.
Trello (backed by some of the same people as SO) normally uses web sockets but when they encountered a crippling bug in their web sockets implementation on launch day they were saved by short polling:
We hit a problem right after launch. Our WebSocket server implementation started behaving very strangely under the sudden and heavy real-world usage of launching at TechCrunch disrupt, and we were glad to be able to revert to plain polling and tune server performance by adjusting the active and idle polling intervals. It allowed us to degrade gracefully as we increased from 300 to 50,000 users in under a week. We’re back on WebSockets now, but having a working short-polling system still seems like a very prudent fallback.
The full story is well worth a read.
I'd particularly highlight,
The use of HAProxy to terminate the client connection. Meaning that internal web servers are shielded from slow and misbehaving clients and the overhead of repeatedly creating connections becomes less of an issue due to HAProxy's scalability/efficiency;
Trello's polling frequency was adjustable meaning that under heavy load they could tell all clients to poll less frequently thus exchanging responsiveness for increased capacity.
In Brazil at least there are many retail trading platforms that use short polling, with very short polling intervals for rapid publication of stock prices, and regularly support thousands of concurrent users.
Unlike long polling and web sockets, short polling doesn't require a persistent connection so with something like HAProxy in the middle your maximum number of "connections" could actually be greater than the number of concurrent sockets supported by your hardware (although at that point you'd probably be seeing some degradation in responsiveness).
I'm writing a Netty application. The application is running on a 64 bit eight core linux box
The Netty application is a simple router that accepts requests (incoming pipeline) reads some metadata from the request and forwards the data to a remote service (outgoing pipeline).
This remote service will return one or more responses to the outgoing pipeline. The Netty application will route the responses back to the originating client (the incoming pipeline)
There will be thousands of clients. There will be thousands of remote services.
I'm doing some small scale testing (ten clients, ten remotes services) and I don't see the sub 10 millisecond performance I'm expecting at a 99.9 percentile. I'm measuring latency from both client side and server side.
I'm using a fully async protocol that is similar to SPDY. I capture the time (I just use System.nanoTime()) when we process the first byte in the FrameDecoder. I stop the timer just before we call channel.write(). I am measuring sub-millisecond time (99.9 percentile) from the incoming pipeline to the outgoing pipeline and vice versa.
I also measured the time from the first byte in the FrameDecoder to when a ChannelFutureListener callback was invoked on the (above) message.write(). The time was a high tens of milliseconds (99.9 percentile) but I had trouble convincing myself that this was useful data.
My initial thought was that we had some slow clients. I watched channel.isWritable() and logged when this returned false. This method did not return false under normal conditions
Some facts:
We are using the NIO factories. We have not customized the worker size
We have disabled Nagel (tcpNoDelay=true)
We have enabled keep alive (keepAlive=true)
CPU is idle 90+% of the time
Network is idle
The GC (CMS) is being invoked every 100 seconds or so for a very short amount of time
Is there a debugging technique that I could follow to determine why my Netty application is not running as fast as I believe it should?
It feels like channel.write() adds the message to a queue and we (application developers using Netty) don't have transparency into this queue. I don't know if the queue is a Netty queue, an OS queue, a network card queue or what. Anyway I'm reviewing examples of existing applications and I don't see any anti-patterns I'm following
Thanks for any help/insight
Netty creates Runtime.getRuntime().availableProcessors() * 2 workers by default. 16 in your case. That means you can handle up to 16 channels simultaneously, other channels will wait untils you release the ChannelUpstreamHandler.handleUpstream/SimpleChannelHandler.messageReceived handlers, so don't do heavy operations in these (IO) threads, otherwise you can stuck the other channels.
You haven't specified your Netty version, but it sounds like Netty 3.
Netty 4 is now stable, and I would advise that you update to it as soon as possible.
You have specified that you want ultra low latency times, as well as tens of thousands of clients and services. This doesn't really mix well. NIO is inherently reasonably latent as opposed to OIO. However the pitfall here is that OIO probably wont be able to reach the number of clients you are hoping for. None the less I would use an OIO event loop / factory and see how it goes.
I myself have a TCP server, which takes around 30ms on localhost to send and receive and process a few TCP packets (measured from the time client opens a socket until server closes it). If you really do require such low latencies I suggest you switch away from TCP due to the SYN/ACK spam that is required to open a connection, this is going to use a large part of your 10ms.
Measuring time in a multi-threaded environment is very difficult if you are using simple things like System.nanoTime(). Imagine the following on a 1 core system:
Thread A is woken up and begins processing the incoming request.
Thread B is woken up and begins processing the incoming request. But since we are working on a 1 core machine, this ultimately requires that Thread A is put on pause.
Thread B is done and performed perfectly fast.
Thread A resumes and finishes, but took twice as long as Thread B. Because you actually measured the time it took to finish for Thread A + Thread B.
There are two approaches on how to measure correctly in this case:
You can enforce that only one thread is used at all times.
This allows you to measure the exact performance of the operation, if the OS does not interfere. Because in the above example Thread B can be outside of your program as well. A common approach in this case is to median out the interference, which will give you an estimation of the speed of your code.You can however assume, that on an otherwise idle multi-core system, there will be another core to process background tasks, so your measurement will usually not be interrupted. Setting this thread to high priority helps as well.
You use a more sophisticated tool that plugs into the JVM to actually measure the atomic executions and time it took for those, which will effectively remove outside interference almost completely. One tool would be VisualVM, which is already integrated in NetBeans and available as a plugin for Eclipse.
As a general advice: it is not a good idea to use more threads than cores, unless you know that those threads will be blocked by some operation frequently. This is not the case when using non-blocking NIO for IO-operations as there is no blocking.
Therefore, in your special case, you would actually reduce the performance for clients, as explained above, because communication would be put on hold up to 50% of the time under high load. In worst case, that could cause a client to even run into a timeout, as there is no guarantee when a thread is actually resumed (unless you explicitly request fair scheduling).
I've read several posts about java.net vs java.nio here on StackOverflow and on some blogs. But I still cannot catch an idea of when should one prefer NIO over threaded sockets. Can you please examine my conclusions below and tell me which ones are incorrect and which ones are missed?
Since in threaded model you need to dedicate a thread to each active connection and each thread takes like 250Kilobytes of memory for it's stack, with thread per socket model you will quickly run out of memory on large number of concurrent connections. Unlike NIO.
In modern operating systems and processors a large number of active threads and context switch time can be considered almost insignificant for performance
NIO throughoutput can be lower because select() and poll() used by asynchronous NIO libraries in high-load environments is more expensive than waking up and putting to sleep threads.
NIO has always been slower but it allows you to process more concurrent connections. It's essentially a time/space trade-off: traditional IO is faster but has a heavier memory footprint, NIO is slower but uses less resources.
Java has a hard limit per concurrent threads of 15000 / 30000 depending on JVM and this will limit thread per connection model to this number of concurrent connections maximum, but JVM7 will have no such limit (cannot confirm this data).
So, as a conclusion, you can have this:
If you have tens of thousands concurrent connections - NIO is a better choice unless a request processing speed is a key factor for you
If you have less than that - thread per connection is a better choice (given that you can afford amount of RAM to hold stacks of all concurrent threads up to maximum)
With Java 7 you may want to go over NIO 2.0 in either case.
Am I correct?
That seems right to me, except for the part about Java limiting the number of threads – that is typically limited by the OS it's running on (see How many threads can a Java VM support? and Can't get past 2542 Threads in Java on 4GB iMac OSX 10.6.3 Snow Leopard (32bit)).
To reach that many threads you'll probably need to adjust the stack size of the JVM.
I still think the context switch overhead for the threads in traditional IO is significant. At a high level, you only gain performance using multiple threads if they won't contend for the same resources as much, or they spend time much higher than the context switch overhead on the resources.
The reason for bringing this up, is with new storage technologies like SSD, your threads come back to contend on the CPU much quicker
There is not a single "best" way to build NIO servers, but the preponderance of this particular question on SO suggests that people think there is! Your question summarizes the use cases that are suited to both options well enough to help you make the decision that is right for you.
Also, hybrid solutions are possible too! You could hand the channel off to threads when they are going to do something worthy of their expense, and stick to NIO when it is better.
I would say start with thread-per-connection and adapt from there if you run into problems.
If you really need to handle a million connections you should consider writing (or finding) a simple request broker in C (or whatever) that will use far less memory per connection than any java implementation can. The broker can receive requests asynchronously and queue them to backend workers written in your language of choice.
The backends thus only need a thread per active request, and you can just have a fixed number of them so the memory and database use is predetermined to some degree. When large numbers of requests are running in parallel the requests are made to wait a bit longer.
Thus I think you should never have to resort to NIO select channels or asynchronous I/O (NIO 2) on 64-bit systems. The thread-per-connection model works well enough and you can do your scaling to "tens or hundreds of thousands" of connections using some more appropriate low-level technology.
It is always helpful to avoid premature optimization (i.e. writing NIO code before you really have massive numbers of connections coming in) and don't reinvent the wheel (Jetty, nginx, etc.) if possible.
What most often is overlooked is that NIO allows zero copy handling. E.g. if you listen to the same multicast traffic from within multiple processes using old school sockets on one single server, any multicast packet is copied from the network/kernel buffer to each listening application. So if you build a GRID of e.g. 20 processes, you get memory bandwidth issues. With nio you can examine the incoming buffer without having to copy it to application space. The process then copies only parts of the incoming traffic it is interested in.
another application example:
see http://www.ibm.com/developerworks/java/library/j-zerocopy/ for an example.