I'm using a JProfiler evaluation license to profile a client application that writes data to a socket. I connect successfully to the client after starting it and click on Record Probe Sockets. The Time Line tab shows two vertical red lines for when the application starts and terminates. The Sockets tab shows nothing though.
I know data is being written to a socket because I see the data on the server. The client uses multiple threads to write data to the socket.
Is there something else I need to do to profile socket usage with JProfiler? What I really want to see is how much time my application is using to write to the socket and how much time it is blocked while waiting to write to a socket.
The profiler can make event-based and sampling-based measurements. Event-based measurements are only available for classes that are not part of the java packages, since it modifies classes on the fly while they are loaded. So, your socket operations are measured by sampling.
Sampling works the way that the profiler inspects stack traces every particular period of time and see what code does (in your case, looks if it is socket operation). Default sampling rate is 5 ms. I would offer to decrease the rate and see if it helps, since your socket operations might be 1 ms long and they are never shown: the profiler stops every time when there is no any socket operation.
Edit: Plus, I would check that sockets profiling is enabled in the first place
Related
I'm creating a Java server for a quite big simulation and I have a couple of high level design questions.
Some background:
The server will run a simulation.
Clients will connect to the server via TCP connections from mobile devices and interact with data structures in the simulation. Initially I will try to use a simple polling scheme in the clients. I find it hard to maintain long-lived TCP connections between mobile devices and the server and I'm not yet sure whether the clients will try to keep an open TCP connection or whether they will set it up and tear it down for each transmission.
When a client is active on a mobile device, I would like to have the client poll the server at least a few times a minute.
The simulation will keep running regardless of whether clients are connected or not.
The total number of existing clients could get very large, many thousands.
Clients mostly poll the server for simulation state, but also sometimes issues control commands to the simulation.
All messages are small in size.
I expect the server to run under Linux on multi-core CPU server hardware.
Currently I have the following idea for threading model in the server:
The simulation logic is executed by a few threads. The simulation logic threads both read and write from/to the simulation data structures.
For each client there is a Java thread performing a blocking read call to the socket for that client. When a poll command is received from a client, the corresponding client thread reads info from the simulation data structures (one client poll would typically be interested in a small subset of the total data structures) and sends a reply to the client on the client's socket. Thus, access to the data structures would need to be synchronized between the client threads and the simulation threads (I would try to have the locks on smaller subsets of the data). If a control command is received from the client, the client thread would write to the data structures.
For small number of clients, I think this would work fine.
Question 1: Would this threading model hold for a large number (thousands) of connected clients? I'm not familiar with what memory/CPU overhead there would be in such a Java implementation.
Question 2: I would like to avoid having the server asynchronously send messages to the clients but in certain scenarios I may need to have the server send "update yourself now" messages asynchronously to some or many clients and I'm not quite sure how to do that. Having the simulation logic thread(s) send those messages doesn't seem right... maybe some "client notification thread pool" concept?
You ask two questions; I'll answer the first.
I've previously written an application that involved thousands of threads in one application. We did once run into a problem with the maximum number of threads on the Linux server; for us, I think the limit was about 1000 threads. This affected our Java application because Java threads use native threads. We set the limit higher, and the application scaled to about 2000 threads, which was what we needed without an issue; I don't know what would have happened had we needed to scale it much higher.
The fact that the default maximum number of threads was 1000 suggests that it might not be wise to run too many thousands of threads on a single Linux server. I believe the primary issue is that sufficient memory for a stack needs to be allocated for each thread.
Our intended long term fix was to change to an architecture where threads from a thread pool each serviced multiple sockets. This really isn't too much of an issue; for each socket, the thread just has to process any pending messages before going on to the next socket. You would have to be careful about synchronizing memory access, but your application already needs to do that since your simulation interacts with multiple threads already so that part would not be a huge change.
We have an issue in our server at job and I'm trying to understand what is happening. It's a Java application that runs in a linux server, the application recieve inforamtion form TCP socket and analyse them and after analyse write into the database.
Sometimes the quantity of packets is too many and the Java application need to write many times into the database per second (like 100 to 500 times).
I try to reproduce the issue in my own computer and look how the application works with JProfiler.
The memory look always going up, is it a memory leak (sorry I'm not a Java programmer, i'm C++ programmer)?
After 133 minute
After 158 minute
I have many locked thread, does it means that the application did not programmed correctly?
Is it too many connection to the database (the application use BasicDataSource class to use a connection pool)?
The program don't have FIFO to manage database writing for continual information entering from TCP port. My questions are (remeber that I'm not a Java programmer and I don't know if this is way that a Java application should work or the program can be programmed more efficient)
Do you think that something is wrong with the code that are not correctly managing write, read, updates on the database and cosume too many memory and CPU time, or is it the way that it works in BasicDataSource class?
How do you think I can improve it (if you think it's an issue) this issue, by creating a FIFO and removing the part of code that create too many threads? Or the threads is not the application threads himself and thats the BasicDataSource threads?
There are several areas to dig into, but first I would try and find what is actually blocking the threads in question. I'll assume everything before the app is being looked at as well, so this is from the app down.
I know the graphs show free memory but they are just point in time so I can't see a trend. GC logging is available, I haven't used JProfiler much though so I am not sure how to point you to it in that tool. I know in DynaTrace I can see GC events and their duration as well as any other blocking events and their root cause as well. If this isn't available there are command line switches to log GC activity to see its duration and frequency. That is one area that could block.
I would also look at how many connections you have in your pool. If there are 100-500 requests/second trying to write and they are stacking up because you don't have enough connections to work them then that could be a problem as well. The image shows all transactions but doesn't speak to the pool size. Transactions blocked with nowhere to go could lead to your memory jumps as well.
There is also the flip side that your database can't handle the traffic and is pegged, and that is what is blocking the connections as well so you would want to monitor that end of things and see if that is a possible cause of the blocking.
There is also the chance that the blocking is occurring from the SQL being run as well, waiting for page locks to be released, etc.
Lots of areas to look at, but I would address and verify one layer at a time starting with the app and working down.
I haven't found a clear answer on this one.
I have a client/server application in Java 7. The server and client are on seperate computers. The client has a short (1 line of 10 characters) command to issue to the server and the server responds (120 character string). This will be repeated every X seconds--where X is the rate in the configuration file. This could be as short as 1 second to Integer.MAX_VALUE seconds.
Every time that I've created a client/server application, the philosophy has been create the connection, do the business, close the connection and then do whatever else with the data. This seems to be the way things should be done--especially when using the try with resources programming.
What are the hiccups with leaving a socket connection hanging out there for X seconds? Is it really a best practice to close down and restart or is it a better practice for the socket to remain connected and just send the command every X seconds?
I think the answer depends a bit on the number of clients you expect to have.
If you will never have very many client connections open, then I'd say leave the connection open and call it good, especially if latency is an issue - even on LANs, I've seen connections take several milliseconds to initialize. If you expect hundreds or thousands of clients to connect and do this, however, I would reconnect every time. As others have said, leaving non-blocking sockets open will often mean you have a thread left running, which can take several megabytes of stack space on a per-thread basis. Do this several thousand times and you will have a big problem on most machines.
Another issue is port space. Just because the TCP/IP stack gives us 65535 total ports doesn't mean all are usable - in fact, most local firewalls will prohibit most from being used, so even if you had enough memory to run thousands of simultaneous threads, you could very likely run out of ports if you leave a lot of connections open simultaneously.
IMHO the client should open, do it's thing and then close.
on the server...
In UNIX one usually forks a process to answer the call (each call); however, on Windows one typically creates a new thread for each inbound call.
I'm writing a Netty application. The application is running on a 64 bit eight core linux box
The Netty application is a simple router that accepts requests (incoming pipeline) reads some metadata from the request and forwards the data to a remote service (outgoing pipeline).
This remote service will return one or more responses to the outgoing pipeline. The Netty application will route the responses back to the originating client (the incoming pipeline)
There will be thousands of clients. There will be thousands of remote services.
I'm doing some small scale testing (ten clients, ten remotes services) and I don't see the sub 10 millisecond performance I'm expecting at a 99.9 percentile. I'm measuring latency from both client side and server side.
I'm using a fully async protocol that is similar to SPDY. I capture the time (I just use System.nanoTime()) when we process the first byte in the FrameDecoder. I stop the timer just before we call channel.write(). I am measuring sub-millisecond time (99.9 percentile) from the incoming pipeline to the outgoing pipeline and vice versa.
I also measured the time from the first byte in the FrameDecoder to when a ChannelFutureListener callback was invoked on the (above) message.write(). The time was a high tens of milliseconds (99.9 percentile) but I had trouble convincing myself that this was useful data.
My initial thought was that we had some slow clients. I watched channel.isWritable() and logged when this returned false. This method did not return false under normal conditions
Some facts:
We are using the NIO factories. We have not customized the worker size
We have disabled Nagel (tcpNoDelay=true)
We have enabled keep alive (keepAlive=true)
CPU is idle 90+% of the time
Network is idle
The GC (CMS) is being invoked every 100 seconds or so for a very short amount of time
Is there a debugging technique that I could follow to determine why my Netty application is not running as fast as I believe it should?
It feels like channel.write() adds the message to a queue and we (application developers using Netty) don't have transparency into this queue. I don't know if the queue is a Netty queue, an OS queue, a network card queue or what. Anyway I'm reviewing examples of existing applications and I don't see any anti-patterns I'm following
Thanks for any help/insight
Netty creates Runtime.getRuntime().availableProcessors() * 2 workers by default. 16 in your case. That means you can handle up to 16 channels simultaneously, other channels will wait untils you release the ChannelUpstreamHandler.handleUpstream/SimpleChannelHandler.messageReceived handlers, so don't do heavy operations in these (IO) threads, otherwise you can stuck the other channels.
You haven't specified your Netty version, but it sounds like Netty 3.
Netty 4 is now stable, and I would advise that you update to it as soon as possible.
You have specified that you want ultra low latency times, as well as tens of thousands of clients and services. This doesn't really mix well. NIO is inherently reasonably latent as opposed to OIO. However the pitfall here is that OIO probably wont be able to reach the number of clients you are hoping for. None the less I would use an OIO event loop / factory and see how it goes.
I myself have a TCP server, which takes around 30ms on localhost to send and receive and process a few TCP packets (measured from the time client opens a socket until server closes it). If you really do require such low latencies I suggest you switch away from TCP due to the SYN/ACK spam that is required to open a connection, this is going to use a large part of your 10ms.
Measuring time in a multi-threaded environment is very difficult if you are using simple things like System.nanoTime(). Imagine the following on a 1 core system:
Thread A is woken up and begins processing the incoming request.
Thread B is woken up and begins processing the incoming request. But since we are working on a 1 core machine, this ultimately requires that Thread A is put on pause.
Thread B is done and performed perfectly fast.
Thread A resumes and finishes, but took twice as long as Thread B. Because you actually measured the time it took to finish for Thread A + Thread B.
There are two approaches on how to measure correctly in this case:
You can enforce that only one thread is used at all times.
This allows you to measure the exact performance of the operation, if the OS does not interfere. Because in the above example Thread B can be outside of your program as well. A common approach in this case is to median out the interference, which will give you an estimation of the speed of your code.You can however assume, that on an otherwise idle multi-core system, there will be another core to process background tasks, so your measurement will usually not be interrupted. Setting this thread to high priority helps as well.
You use a more sophisticated tool that plugs into the JVM to actually measure the atomic executions and time it took for those, which will effectively remove outside interference almost completely. One tool would be VisualVM, which is already integrated in NetBeans and available as a plugin for Eclipse.
As a general advice: it is not a good idea to use more threads than cores, unless you know that those threads will be blocked by some operation frequently. This is not the case when using non-blocking NIO for IO-operations as there is no blocking.
Therefore, in your special case, you would actually reduce the performance for clients, as explained above, because communication would be put on hold up to 50% of the time under high load. In worst case, that could cause a client to even run into a timeout, as there is no guarantee when a thread is actually resumed (unless you explicitly request fair scheduling).
Our application is reading data very fast over TCP/IP sockets in Java. We are using the NIO library with a non-blocking Sockets and a Selector to indicate readiness to read. On average, the overall processing times for reading and handling the read data is sub-millisecond.
However we frequently see spikes of 10-20 milliseconds. (running on Linux).
Using tcpdump we can see the time difference between tcpdump's reading of 2 discreet messages, and compare that with our applications time. We see tcpdump seems to have no delay, whereas the application can show 20 milliseconds.
We are pretty sure this is not GC, because the GC log shows virtually no Full GC, and in JDK 6 (from what I understand) the default GC is parallel, so it should not be pausing the application threads (unless doing Full GC).
It looks almost as if there is some delay for Java's Selector.select(0) method to return the readiness to read, because at the TCP layer, the data is already available to be read (and tcpdump is reading it).
Additional Info: at peak load we are processing about 6,000 x 150 bytes avg per message, or about 900 MB per second.
eden collection still incurs a STW pause so 20ms may be perfectly normal depending on allocation behaviour & heap size/size of the live set.
Is your Java code running under RTLinux, or some other distro with hard real-time scheduling capability? If not, 10-20 msec of jitter in the processing times seems completely reasonable, and expected.
I had the same problem in a java service that I work on. When sending the same request
repeatedly from the client the server would block at the same spot in stream for 25-35ms.
Turning off Nagle's algorithm in the socket fixed this for me.
This can be accomplished by calling setTcpNoDelay(true) on the Socket.
This may result in increased network congestion because ACKs will now be sent as separate
packets.
See http://en.wikipedia.org/wiki/Nagle%27s_algorithm for more info on Nagle's algorithm.
From the tcpdump faq:
WHEN IS A PACKET TIME-STAMPED? HOW
ACCURATE ARE THE TIME STAMPS?
In most OSes on which tcpdump and
libpcap run, the packet is time
stamped as part of the process of the
network interface's device driver, or
the networking stack, handling it.
This means that the packet is not time
stamped at the instant that it arrives
at the network interface; after the
packet arrives at the network
interface, there will be a delay until
an interrupt is delivered or the
network interface is polled (i.e., the
network interface might not interrupt
the host immediately - the driver may
be set up to poll the interface if
network traffic is heavy, to reduce
the number of interrupts and process
more packets per interrupt), and there
will be a further delay between the
point at which the interrupt starts
being processed and the time stamp is
generated.
So odds are, the timestamp is made in the privileged kernel layer, and the lost 20ms is to context-switching overhead back to user-space and into Java and the JVMs network selector logic. Without more analysis of the system as a whole I don't think it's possible to make an affirmative selection of cause.