I am facing a situation where I do not see some method calls not being recorded by the VisualVM application. Wanted to find out the reason and came across this answer on SO. The third point mentions about a potential issue of the sampling method(which is the only option that I am seeing enabled probably because I am doing remote profiling). It mentions about safe points in code and safe point polling by code itself. What do these terms mean?
The issue of inaccuracy of Java sampling profiler tools and its relation to the safe points is very well discussed in Evaluating the Accuracy of Java Profilers (PLDI'10).
Essentially, Java profilers may produce inaccurate results when sampling due to the fact that the sampling occurs during the safe points. And since occurrence of safe-points can be modified by the compiler, execution of some methods may never by sampled by the profiler. Therefore, the profiler is scheduled to record a sample of the code (the time interval is up) but it must wait for the occurrence of the safe-point. And since the safe-point is e.g. moved around by the compiler, the method that would be ideally sampled is never observed.
As already explained by the previous anwer, a safepoint is an event or a position in the code where compiler interrupts execution in order to execute some internal VM code (for example GC).
The safe-point polling is a method of implementing the safepoint or a safepoint trigger. It means that in the code being executed you regularly check a flag to see if a safe-point execution is required, if yes (due to e.g. GC trigger), the thread is interrupted and the safepoint is executed. See e.g. GC safe-point (or safepoint) and safe-region
This blog post discusses safe points.
Basically they are points in the code where the JITter allows interruptions for GC, stack traces etc.
The post also says the safe points, by delaying stack samples, cannot occur in places where you might like them to, and that's a problem.
In my opinion, that's a small problem.
The whole reason you take a stack sample (as opposed to just a program-counter sample) is to show you all the call-sites leading to the current state, because those are likely to be much more juicy sources of slowness than whatever the program counter is doing.
(If it's doing anything. You might be in the middle of I/O, where the PC is meaningless, but the call-sites are still just as important.)
If the stack sample has to wait a few cycles to get to a safe point, all that means is it happens at the end of a block of instructions, not in the middle.
If you examine the sample you can still get a good idea what's happening.
I'm hoping profiler writers come to realize they don't need to sweat the small stuff.
What's more important is not to miss the big stuff.
Related
I have a Java application and one of the methods is performance-critical.
I created a loop to call this method 10 times and I am checking for performance issues by using the profiler for every iteration. It turned out that the execution time decreases by iterations. Thus, the 10th iteration has a smaller execution time than then 9th iteration.
Any idea why such case is happening?
Could it be due to the loop overheads?
You are warming the CPU caches, and the JVM thus the performance changes.
Profillers put the JVM into an unusual mode, and depending on what profiler approach you are using then it may only be sampling at a regular interval.
I find that profillers are good for giving you relative measurements and to improve your understanding of the code; but always take their reading with a pinch of salt.
Do not trust just a single measurement.
Outside of using profillers, microbenchmarking is a good way to go. Although it is a very tricky subject.
Note that Hotspot tends not to kick in and optimise the byte codes until the target code has been called 10,000 or more times.
http://java.dzone.com/articles/microbenchmarking-java, and How do I write a correct micro-benchmark in Java? may help to get you started. There is also a lot of good advice on the Mechanical Sympathy Forum.
A good microbenchmarking framework is here http://openjdk.java.net/projects/code-tools/jmh/, it helps keep GC, and other JVM stop-the-world events out of the timings. As well as some guidence on how to prevent Hotspot from optimising out the very code that you are trying to measure.
This is something I think I've seen before with other profiling tools in other environments, but it's particularly dramatic in this case.
I'm taking a CPU profile of a task that runs for about 12 minutes, and it's showing almost half the time spent in a method that literally does nothing: it's got an empty body. What can cause this? I don't believe that the method is being called a ridiculous number of times, certainly not to account for half the execution time.
For what it's worth, the method in question is called startContent() and it's used to notify a parsing event. The event is passed down a chain of filters (perhaps a dozen of them), and the startContent() method on each filter does almost nothing except to call startContent() on the next filter in the chain.
This is pure Java code, and I'm running it on a Mac.
Attached is a screen shot of the CPU sampler output:
and here is a sample showing the call stack:
(After a delay due to vacation) Here are a couple of pictures showing the output from the profiler. These figures are much more what I would expect the profile to look like. The profiler output seems entirely meaningful, while the sampler output is spurious.
As some of you will have guessed, the job in question is a run of the Saxon XML schema validator (on a 9Gb input file). The profile shows about half the time being spent validating element content against simple types (which happens during endElement processing) and about half being spent testing key constraints for uniqueness; the two profiler views show highlight the activity involved in these two aspects of the task.
I'm not able to supply the data as it comes from a client.
I have not used VisualVM, but I suspect the problem is likely because of the instrumentation overhead on such an empty method. Here's the relevant passage in JProfiler's documentation (which I have used extensively):
If the method call recording type is set to Dynamic instrumentation, all methods of profiled classes are instrumented. This creates some overhead which is significant for methods that have very short execution times. If such methods are called very frequently, the measured time of those method will be far to high. Also, due to the instrumentation, the hot spot compiler might be prevented from optimizing them. In extreme cases, such methods become the dominant hot spots although this is not true for an uninstrumented run. An example is the method of an XML parser that reads the next character. This method returns very quickly, but may be invoked millions of times in a short time span.
Basically, a profiler adds it's own "time length detection code", essentially, but in an empty method the profiler will spend all it's time doing that rather than actually allowing the method to run.
I recommend, if it's possible, to tell VisualVM to stop instrumenting that thread, if it supports such a filtering.
It is generally assumed that using a profiler is much better (for finding performance problems, as opposed to measuring things) than - anything else, really - certainly than the bone-simple way of random pausing.
This assumption is only common wisdom - it has no basis in theory or practice.
There are numerous scholarly peer-reviewed papers about profiling, but none that I've read even address the point, let alone substantiate it.
It's a blind spot in academia, not a big one, but it's there.
Now to your question -
In the screenshot showing the call stack, that is what's known as the "hot path", accounting for roughly 60% of in-thread CPU time. Assuming the code with "saxon" in the name is what you're interested in, it is this:
net.sf.saxon.event.ReceivingContentHandler.startElement
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.StartTagBuffer.startContent
net.sf.saxon.event.ProxyReceiver.startContent
com.saxonica.ee.validate.ValidationStack.startContent
com.saxonica.ee.validate.AttributeValidator.startContent
net.sf.saxon.event.TeeOutputter.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.Sink.startContent
First, this looks to me like it has to be doing I/O, or at least waiting for some other process to give it content. If so, you should be looking at wall-clock time, not CPU time.
Second, the problem(s) could be at any of those call sites where a function calls the one below. If any such call is not truly necessary and could be skipped or done less often, it will reduce time by a significant fraction.
My suspicion is drawn to StartTagBuffer and to validate, but you know best.
There are other points I could make, but these are the major ones.
ADDED after your edit to the question.
I tend to assume you are looking for ways to optimize the code, not just ways to get numbers for their own sake.
It still looks like only CPU time, not wall-clock time, because there is no I/O in the hot paths. Maybe that's OK in your case, but what it means is, of your 12-minute wall clock time, 11 minutes could be spent in I/O wait, with 1 minute in CPU. If so, you could maybe cut out 30 seconds of fat in the CPU part, and only shorten the time by 30 seconds.
That's why I prefer sampling on wall-clock time, so I have overall perspective.
By looking at hot paths alone, you're not getting a true picture.
For example, if the hot path says function F is on the hot path for, say 40% of the time, that only means F costs no less than 40%. It could be much more, because it could be on other paths that aren't so hot. So you could have a juicy opportunity to speed things up by a lot, but it doesn't get much exposure in the specific path that the profiler chose to show you, so you don't give it much attention.
In fact, a big time-taker might not show up at all because on any specific hot path there's always something else a little bigger, like new, or because it goes by multiple names, such as templated collection class constructors.
It's not showing you any line-resolution information.
If you want to inspect a supposedly high-cost routine for the reason for the cost, you have to look at the lines within it. There's a tendency when looking at a routine to say "It's just doing what it's supposed to do.", but if you are looking at a specific costly line of code, which most often is a method call, you can ask "Is it really necessary to do this call? Maybe I already have the information." It's far more specific in suggesting what you could fix.
Can it actually show you some raw stack samples?
In my experience these are far more informative than any summary, like a hot path, that the profiler can present.
The thing to do is examine the sample and come to a full understanding of what the program was doing, and the reason why, at that point in time.
Then repeat for several more samples.
You will see things that don't need to be done, that you can fix to get substantial speedup.
(Unless the code is already optimal, in which case it will be nice to know.)
The point is, you're looking for problems, not measurements.
Statistically, it's very rough, but good enough, and no problem will escape.
My guess is that the method Sink.startContent actually is called a ridiculous number of times.
Your screenshot shows the Sampling tab, which usually results in realistic timings if user over a long enoung interval. If you use Profiler tab instead, you will also get the invocation count. (You'll also get less realistic timings and your program will get very very slow, but you only need to do this for a few seconds to get a good idea about the invocation counts).
It's hard to predict what optimizations and especially inlining HotSpot performs, and the sampling profiler can only attribute the time of inlined methods to the call sites. I suspect that some of the invocation code in saxon might for some reason be attributed to your empty callback function. In that case, you're just suffering the cost of XML, and switching to a different parser might be the only option.
I've had a lot of useful information and guidance from this thread, for which many thanks. However, I don't think the core question has been answered: why is the CPU sampling in VisualVM giving an absurdly high number of hits in a method that does nothing, and that isn't called any more often than many other methods?
For future investigations I will rely on the profiler rather than the sampler, now I have gained a bit of insight into how they differ.
From the profiler I haven't really gained a lot of new information about this specific task, in so far as it has largely confirmed what I would have guessed; but that itself is useful. It has told me that there's no magic bullet to speeding up this particular process, but has put bounds on what might be achieved by some serious redesign, e.g a possible future enhancement that appears to have some promise is generating a bytecode validator for each user-defined simple type in the schema.
I would like to provide my system with a way of detecting whether out of memory exception has occurred or not. The aim for this exercise is to expose this flag through JMX and act correspondingly (e.g. by configuring a relevant alert on the monitoring system), as otherwise these errors sit unnoticed for days.
Naive approach for this would be to set an uncaught exception handler for every thread and check whether the raised exception is instance of OutOfMemoryError and set a relevant flag. However, this approach isn't realistic for the following reasons:
The exception can occur anywhere, including 3rd party libraries. There is nothing I can do to prevent them catching Throwable and keeping it for themselves.
Libraries can spawn their own threads and I have no way of enforcing uncaught exception handlers for these threads.
One of possible scenarios I see is bytecode manipulation (e.g. attaching some sort of aspect on top of OutOfMemoryError), however I am not sure if that's right approach or whether this is doable in general.
We have -XX:+HeapDumpOnOutOfMemoryError enabled, but I don't see this as a solution for this problem as it was designed for something else - and it provides no Java callback when this happens.
Has anyone done this? How would you solve it or suggest solving it? Any ideas are welcome.
You could use an out of memory warning system; this OutOfMemoryError Warning System can be an inspiration. You could configure a listener which is invoked after a certain memory threshold ( say 80%) is breached - you can use this invocation to start taking corrective measures.
We use something similar, where we suspend the component's service when the memory threshold of the component reaches 80% and start the clean up action; the component comes back only when the used memory comes below a another configurable value threshold.
There is an article based on the post that Scorpion has already given a link to.
The technique is again based on using MemoryPoolMXBean and subscribing to the "memory threshold exceeded" event, but it's slightly different from what was described in original post.
Author states that when you subscribe for the plain "memory threshold exceeded" event, there is a possibility of "false alarm". Imagine a situation when the memory consumption is above the threshold, but there will be a garbage collection performed soon and a lot of the memory is freed after that. In fact that situation is quite common in real world applications.
Fortunately, there is another threshold, "collection usage threshold", and a corresponding event, which is fired based on memory consumption right after garbage collection. When you receive that event, you can be much more confident you're running out of memory.
We have -XX:+HeapDumpOnOutOfMemoryError enabled, but I don't see this
as a solution for this problem as it was designed for something else -
and it provides no Java callback when this happens.
This flag should be all that you need. Set the output directory of the resulting heap dump file in some known location that you check regularly. Having a callback would be of no use to you. If you are out of memory, you can't guarantee that the callback code has enough memory to execute! All you can do is collect the data and use an external program to analyze why you ran out of memory. Any attempt at recovering in process can create bigger problems.
Bytecode instrumentation is possible - but hard. HPjmeter's monitoring tool has the ability to predict future OOM's (with caveats) -- but only on HP-UX/Itanium based systems. You could dedicate a daemon thread to calculating used memory in process and trigger an alert when this is exceeded, but really you're not solving the problem.
You can catch any and all uncaught exceptions with the static Thread.setDefaultUncaughtExceptionHandler. Of course, it doesn't help if someone is catching all Throwables. (I don't think anything will, though with an OOME I'd suspect you'd get a cascading effect until something outside the offending try block blew up.) Hopefully the thread would have released enough memory for the exception handler to work; OOM errors do tend to multiply as you try to deal with them.
Debugging performance problems using a standard debugger is almost hopeless since the level of detail is too high. Other ways are using a profiler, but they seldom give me good information, especially when there is GUI and background threads involved, as I never know whether the user was actually waiting for the computer, or not. A different way is simply using Control + C and see where in the code it stops.
What I really would like is to have Fast Forward, Play, Pause and Rewind functionality combined with some visual repressentation of the code. This means that I could set the code to run on Fast Forward until I navigate the GUI to the critical spot. Then I set the code to be run in slow mode, while I get some visual repressentation of, which lines of are being executed (possibly some kind of zoomed out view of the code). I could for example set the execution speed to something like 0.0001x. I believe that I would get a very good visualization this way of whether the problem is inside a specific module, or maybe in the communication between modules.
Does this exist? My specific need is in Python, but I would be interested in seeing such functionality in any language.
The "Fast Forward to critical spot" function already exists in any debugger, it's called a "breakpoint". There are indeed debuggers that can slow down execution, but that will not help you debug performance problems, because it doesn't slow down the computer. The processor and disk and memory is still exactly as slow as before, all that happens is that the debugger inserts delays between each line of code. That means that every line of code suddenly take more or less the same time, which means that it hides any trace of where the performance problem is.
The only way to find the performance problems is to record every call done in the application and how long it took. This is what a profiler does. Indeed, using a profiler is tricky, but there probably isn't a better option. In theory you could record every call and the timing of every call, and then play that back and forwards with a rewind, but that would use an astonishing amount of memory, and it wouldn't actually tell you anything more than a profiler does (indeed, it would tell you less, as it would miss certain types of performance problems).
You should be able to, with the profiler, figure out what is taking a long time. Note that this can be both by certain function calls taking a long time because they do a lot of processing, or it can be system calls that take a long time becomes something (network/disk) is slow. Or it can be that a very fast call is called loads and loads of times. A profiler will help you figure this out. But it helps if you can turn the profiler on just at the critical section (reduces noise) and if you can run that critical section many times (improves accuracy).
The methods you're describing, and many of the comments, seem to me to be relatively weak probabilistic attempts to understand the performance impact. Profilers do work perfectly well for GUIs and other idle-thread programs, though it takes a little practice to read them. I think your best bet is there, though -- learn to use the profiler better, that's what it's for.
The specific use you describe would simply be to attach the profiler but don't record yet. Navigate the GUI to the point in question. Hit the profiler record button, do the action, and stop the recording. View the results. Fix. Do it again.
I assume there is a phase in the app's execution that takes too long - i.e. it makes you wait.
I assume what you really want is to see what you could change to make it faster.
A technique that works is random-pausing.
You run the app under the debugger, and in the part of its execution that makes you wait, pause it, and examine the call stack. Do this a few times.
Here are some ways your program could be spending more time than necessary.
I/O that you didn't know about and didn't really need.
Allocating and releasing objects very frequently.
Runaway notifications on data structures.
others too numerous to mention...
No matter what it is, when it is happening, an examination of the call stack will show it.
Once you know what it is, you can find a better way to do it, or maybe not do it at all.
If the program is taking 5 seconds when it could take 1 second, then the probability you will see the problem on each pause is 4/5. In fact, any function call you see on more than one stack sample, if you could avoid doing it, will give you a significant speedup.
AND, nearly every possible bottleneck can be found this way.
Don't think about function timings or how many times they are called. Look for lines of code that show up often on the stack, that you don't need.
Example Added: If you take 5 samples of the stack, and there's a line of code appearing on 2 of them, then it is responsible for about 2/5 = 40% of the time, give or take. You don't know the precise percent, and you don't need to know.
(Technically, on average it is (2+1)/(5+2) = 3/7 = 43%. Not bad, and you know exactly where it is.)
I'm profiling a program using sampling profiling in YourKit and JProfiler, and also "manually" (I launch it and press Ctrl-Break several times to get thread dumps).
All three methods give me extremely strange results: some tens of percents of time spent in a 3-line method that does not even do any allocation or synchronization and doesn't have loops etc. Moreover, after I made this method into a NOP and even removed its invocation completely, the observable program performance didn't change at all (although it got a negligible memory leak, since it was a method for freeing a cheap resource).
I'm thinking that this might be because of the constraints that JVM puts on the moments at which a thread's stacktrace may be taken, and it somehow turns out that in my program it is exactly the moments where this method is invoked, although there is absolutely nothing special about it or the context in which it is invoked.
What can be the explanation for this phenomenon?
What are the aforementioned constraints?
What further measurements can I take to clarify the situation?
To my minds, these results only show that this method gets called a huge number of times. Since its code is quite small, and it may be called as an invocation tree leaf, its impact on your profiling results seems neglectable. However, I had many time that kind of weird results.
Some 3rd party libraries cause the heap dumps to go completely haywire due to unexpected usage patterns, for example if cglib is used, it will mask away the actual cause of the issues and instead show a lot of Proxy objects (if I remember correctly) filling up the VM instead.
So in short, code generation and reflection may cause the stats to go wrong.
When you did the Ctrl-Break several times and got thread dumps, I'm curious what you saw. The call stacks are, to my mind, the most useful information. If your 3-liner is on the stack a large % of time, then you can see why by looking at where it is called from, and where that is called from, etc. Those call sites are just as responsible for the time being spent as the 3-liner is.
If the stack traces seem to make no sense, it may be because they are being delayed until after something completes. If that is so, look up the stack to see what has just completed, because the break could have occurred within that.