Profiling Java: find out where threads spend time blocked

Profiling Java: find out where threads spend time blocked - java

I have a multi-threaded application which scales well to begin with, but running on a 16-cpu server, once I exceed 5 or 6 hardware threads the performance levels off. I suspect that the bottleneck surrounds one of the synchronized methods. However, I need to be sure it's the guilty method before I start diving into the code and trying to replace the algorithm with a non-blocking one.
Running Java with the -Xprof argument tells me that, as I expected, the threads are spending most of their time blocked. Is there a way that I can break that down into how much time they spend blocked at a particular method?

http://yourkit.com the monitor view will tell you which lock classes are hot, who is holding the contended locks and breakdown by lock instance and caller stack. There is 30 day evaluation period of the tool.

The jvisualvm tool that comes with the JDK can help you a little bit, although its CPU profiling information is rather limited (more of a visualizer for Xprof data). I generally find it more useful for memory profiling.
JProfiler has a pretty nice CPU profiler with some really cool functionality that might help you, but it's commercial.
Or, you could add statistics gathering to your code (e.g., measuring how much time it takes to execute each synchronized method you suspect, breaking it into time waiting for sync / time executing the method), although that's a lot more work.

Could you try this method? If it works across multiple CPUs it should find the problem, but that's a big "if".
Basically, when you see that a thread is blocked, the call stack tells you exactly why. If you're not sure if you're seeing the real problem, do it a few times.

Eclipse TPTP is another very good and free profiler.

Related

Java Mission Control says "few profiling samples", why, and what are my other options?

I'm profiling a Java application using Java Mission Control, and it's saying on the main page of the flight recording that "This recording contains few profiling samples even though CPU load is high.
The profiling data is thus likely not relevant."
It seems to be telling the truth. I asked it to sample every 10 ms for 3 minutes which should be 18000 samples, but I only see 996 samples.
It goes on to explain "The profiling data is thus likely not relevant. This might be because the application is running a lot JNI code or that the JVM is spending a lot of time in GC, class loading, JIT compilation etc."
Hmm, I don't have any native methods, and it shouldn't be loading classes or doing any JIT at the stage I recorded (well into the repetitive number crunching part of the code.) It doesn't look like it's spending an inordinate amount of time garbage collecting either.
We used to use hprof to profile this product, with much success. Hprof helped immensely in figuring out where we were relying on the main thread execution, so we could parallelize the hotspots into multiple threads. But that tool got discontinued in Java 9 so we're moving onward to Java Mission Control. It has a lot going for it, but if it can't identify what line numbers the VM threads are sitting on at random sample times, it's not very useful. Is there some other tool to use? Or, is there a way to debug this further from within Java Mission Control? It also looks like JVisualVM is no longer included in Java 9.

If you have many more running threads than cores, the sampling thread could be starved and not able to wake up at the interval you specified.

The answer is probably as simple as you having more threads than cores, and thus most of them not being scheduled on CPU at the time of sampling. The JFR method sampler will only keep samples of threads actually on CPU. The idea is to provide you of a view of where you are actually spending the time executing your Java code.
Now, we know that there are cases where you want to get random samples of all threads, no matter what they are doing. We are adding new profiling capabilities/events in JDK 10.

Java Optimization Paradigm

I am currently doing some Java optimization. It seems the best approach to assess my progress is to do repeated runs and collect run time statistics using System.nanoTime.
Most of my background has been with embedded DSP applications. In my embedded developments, I would have access to CPU cycle counters (which are a great measure for optimization). The JRE acts like a synthetic CPU, is there a way to get information of the number of instructions or JRE clock equivalents that were executed?
Thanks in advance for any hints. J.R.

The byte code count is almost certainly useless for what you want. They are entirely virtual/notional in terms of performance at run time. The only thing which matters is elapse time for code which has been warmed up.
I suggest you have a look at JMH for micro-benchmarking your code. http://openjdk.java.net/projects/code-tools/jmh/
If you come from the DSP/realtime background I suggest looking at the latency distribution and minimising your allocation rate.
NOTE: The JVM injects "safe points" between instructions to check if the thread needs to stop e.g. to perform garbage collection. A garbage collection can take seconds. However safepoints are often optimised away to avoid slowing down your program.
In short this means the time between instructions might be nothing, it might be something, it might be seconds or even minutes so I wouldn't bother counting them.

Throttling CPU from within Java

I have seen many questions in this (and others) forum with the same title, but none of them seemed to address exactly my problem. This is it:
I have got a JVM that eats all the CPU on the machine that hosts it. I would like to throttle it, however I cannot rely on any throttling tool/technique external to Java as I cannot make assumptions as to where this Vm will be run. Thus, for instance, I cannot use processor affinity because if the VM runs on a Mac the OS won't make process affinity available.
What I would need is an indication as to whether means exist within Java to ensure the thread does not take the full CPU.
I would like to point straightaway that I cannot use techniques based on alternating process executions and pauses, as suggested in some forums, because the thread needs to generate values continuously.
Ideally I'd like some mean for, for instance, setting some VM or thread priority, or cap in some way the percentage of CPU consumed.
Any help would be much appreciated.

What I would need is an indication as to whether means exist within Java to ensure the thread does not take the full CPU.
There is no way that I know of to do this within Java except for tuning your application to use less CPU.
You could put some Thread.sleep(...); calls in your calculation methods. A profiler would help with showing you the hot loops/methods/etc..
Forking fewer threads would also affect the CPU used. Moving to fixed sized thread-pools or lowering the number of threads in your pools.
It may not be CPU that is the problem but other resources. Watch your IO bandwidth for example. Slowing down your network or disk reads/writes might restore your server to proper operation.
From outside of the JVM you could use the ~unix nice command to affect the priority of the running JVM to not dominate the system. This will give it CPU if available but will let other applications get more of the CPU.

I take it you want something more reliable than setting the threads' priorities?
If you want throttled execution of some code that is constantly generating values, you need to look into chunking up the work the thread(s) do, and coding in your own timer. For example, the java.util.Timer allows for scheduling execution at a fixed rate.
Any other technique will still consume as much CPU as is available (1 core per thread, assuming no locks preventing concurrent execution) when the scheduler doesn't have other tasks to prioritize ahead of yours.

The detail is simply that you said "must generate values continuously", and if that, to the extreme, is true, then CPU saturation is actually the goal.
But, if you define "continuously" as X values per second, then there is room to work.
Because then you can run your process at 100% CPU, measure the number of values over time, and if you find that it's generates more values than necessary (more than X/sec), then you can now insert pauses in to the process as appropriate until the value rate reaches your desired goal.
The plan being to continually monitor and adjust the pauses to maintain your value rate over time. Then your process will take as much CPU as necessary to meet your values/sec goal.
Addenda:
If you have a benchmark of values/sec that you are happy with, then interjecting the sleeps will give "all the priority necessary" to the other applications, but still maintain your throughput. If, on the other hand, you don't have any solid requirement, that is the requirement is "run as fast as possible when nothing else is running, with no actual requirement for ANY results if some other process dominates the CPU", then that's truly a kernel issue of the host OS, and not something the JVM has any direct, portable mechanism to address.
On Unix systems, you have the nice(1) command to adjust process (not thread) priority, and Windows has their own mechanism. With these commands, you can knock the priority of your Java process to just above "idle" (the default "process" that always runs when nothing else is running). But it's platform specific, as this is an inherently platform specific problem. This may well be managed through platform specific startup scripts that launch your Java program (or even a Java launcher that detects the platform and "does the right thing" before executing your actual code).
Most systems will allow you to lower your own process priorities, but few will let you raise unless you're an admin/superuser or have whatever the appropriate role is for your host OS.

Check to see if you have any "tight loops" in your code.
while (true) {
if (object.checkSomething()) {
...
}
}
If you do, then you are burning the CPU cycles on millions of checks that are probably not that time critical. The JVM will oblige (because it doesn't know if the check is "important" or not) and you'll get 100% CPU.
If you find such loops, rewrite them like so
while (true) {
if (object.checkSomething()) {
...
}
try {
Thread.sleep(100);
} catch (InterruptedException e) {
// purposefully do nothing
}
}
and the sleeping will voluntarily release the CPU within the loop, preventing it from running too quickly (and checking the condition too many times).

Really interesting thread. I found out Java does not provide means for doing what I want to do, and the only way to do this is from outside the JVM.
I ended up using nice to alter the scheduling priority in my test (Linux) environment, will still need to find something similar for WIn-based OSs.
Everyone's intervention has been much appreciated.

Is this java project idea practical? (Thread scheduler and Particle Swarm Optimization)

On a multicore box, the java thread schedulers decisions are rather arbitrary, it assigns thread priorities based on when the thread was created, from which thread it was created etc.
The idea is to run a tuning process using pso that would randomly set thread priorities and then eventually reach optimal priorities where the fitness function is the total run time of the program?
Of course there would be more parameters, like the priorities would shift during the run to find an optimal priority function.
How practical, interesting does the idea sound? and any suggestions.
Just some background,
ive been programming in java/c/c++ for a few years now with various projects, another alternative would be making a thread scheduler based on this in c, where the default thread scheduler is the OS.

Your approach as described is a static approach, i.e. you need to run the program several times, then come up with a scheduling solution, then ship your scheduling information with the program.
The problem is that for most non-trivial programs, their performance will partly depend on the specific data they're working with. Even if you find an optimal way to schedule threads for one data set, there is absolutely no guarantee that it will improve speed on another one. In most cases, running what will be a long and arduous optimization every time they want to do a new release will not be worth it for devs, unless perhaps for large computation efforts (where the programs are likely to be manually tuned and not written in java anyway).
I'd say a self-learning thread scheduler is a nice idea, but you can't treat it as a classical optimization problem here. You either need to be sure that your scheduling order will remain optimal (unlikely) or find an optimization method that works at runtime. And the issue here might be that it wouldn't take much for the overhead of your scheduler to destroy any performance gain you might get.
I think this is a somewhat subjective question, but overall no, don't think it would work.

Best way to find out -- start an open source project and see people's usage/reaction.
It sounds very interesting to me -- but I personally don't find it very much useful. Perhaps we're just not at the point where concurrent programming is as prevalent and easy as it could be.
With the promotion of functional programming, I guess the world would move towards avoiding thread synchronization as much as possible (thus making thread scheduling less of an impact in overall performance)
From my personal subjective experience, most performance problems in software can be solved by improving one single bottleneck area that accounts for 90% of the slowdown. This optimizer may help find that out. I am not sure how much the scheduling strategy could improve overall performance, though.
Don't get discouraged, though! I'm just talking out of thin air. It sounds fun, so why not just play with it anyway :)

Java performance Inconsistent

I have an interpreter written in Java. I am trying to test the performance results of various optimisations in the interpreter. To do this I parse the code and then repeatedly run the interpreter over the code, this continues until I get 5 runs which differ by a very small margin (0.1s in the times below), the mean is taken and printed. No I/O or randomness happens in the interpreter. If I run the interpreter again I am getting different run times:
91.8s
95.7s
93.8s
97.6s
94.6s
94.6s
107.4s
I have tried to no avail the server and client VM, the serial and parallel gc, large tables and windows and linux. These are on 1.6.0_14 JVM. The computer has no processes running in the background. So I asking what may be causing these large variations or how can I find out what is?
The actualy issue was caused because the program had to iterate to a fixed point solution and the values were stored in a hashset. The hashed values differed between runs, resulting in a different ordering which in turn led to a change in the amount of iterations needed to reach the solution.

"Wall clock time" is rarely a good measurement for benchmarking. A modern OS is extremely unlikely to "[have] no processes running in the background" -- for all you know, it could be writing dirty block buffers to disk, because it's decided that there's no other contention.
Instead, I recommend using ThreadMXBean to track actual CPU consumption.

Your variations don't look that large. It's simply the nature of the beast that there are other things running outside of your direct control, both in the OS and the JVM, and you're not likely to get exact results.
Things that could affect runtime:
if your test runs are creating objects (may be invisible to you, within library calls, etc) then your repeats may trigger a GC
Different GC algorithms, specifications will react differently, different thresholds for incremental gc. You could try to run a System.gc() before every run, although the JVM is not guaranteed to GC when you call that (although it always has when I've played with it).T Depending on the size of your test, and how many iterations you're running, this may be an unpleasantly (and nearly uselessly) slow thing to wait for.
Are you doing any sort of randomization within your tests? e.g. if you're testing integers, values < |128| may be handled slightly differently in memory.
Ultimately I don't think it's possible to get an exact figure, probably the best you can do is an average figure around the cluster of results.

The garbage collection may be responsible. Even though your logic is the same, it may be that the GC logic is being scheduled on external clock/events.
But I don't know that much about JVMs GC implementation.

This seems like a significant variation to me, I would try running with -verbosegc.
You should be able to get the variation to much less than a second if your process has no IO, output or network of any significance.
I suggest profiling your application, there is highly likely to be significant saving if you haven't done this already.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.