I'm trying to optimize my code, but it's giving me problems.
I've got this list of objects:
List<DataDescriptor> descriptors;
public class DataDescriptor {
public int id;
public String name;
}
There is 1700 objects with unique id (0-1699) and some name, it's used to decode what type of data I get later on.
The method that I try to optimize works like that:
public void processData(ArrayList<DataDescriptor> descriptors, ArrayList<IncomingData> incomingDataList) {
for (IncomingData data : incomingDataList) {
DataDescriptor desc = descriptors.get(data.getDataDescriptorId());
if (desc.getName().equals("datatype_1")) {
doOperationOne(data);
} else if (desc.getName().equals("datatype_2")) {
doOperationTwo(data);
} else if ....
.
.
} else if (desc.getName().equals("datatype_16")) {
doOperationSixteen(data);
}
}
}
This method is called about milion times when processing data file and every time incomingDataList contains about 60 elements, so this set of if/elses is executed about 60 milion times.
This takes about 15 seconds on my desktop (i7-8700).
Changing code to test integer ids instead of strings obviously shaves off few seconds, which is nice, but I hoped for more :)
I tried profiling using VisualVM, but for this method (with string testing) it says that 66% of time is spent in "Self time" (which I believe would be all this string testing? and why doesnt it says that it is in String.equals method?) and 33% is spent on descriptors.get - which is simple get from ArrayList and I don't think I can optimize it any further, other than trying to change how data is structured in memory (still, this is Java, so I don't know if this would help a lot).
I wrote "simple benchmark" app to isolate this String vs int comparisons. As I expected, comparing integers was about 10x faster than String.equals when I simply run the application, but when I profiled it in VisualVM (I wanted to check if in benchmark ArrayList.get would also be so slow), strangely both methods took exactly the same amount of time. When using VisualVM's Sample, instead of Profile, application finished with expected results (ints being 10x faster), but VisualVM was showing that in his sample both types of comparisons took the same amount of time.
What is the reason for getting such totally different results when profiling and not? I know that there is a lot of factors, there is JIT and profiling maybe interferes with it etc. - but in the end, how do you profile and optimize Java code, when profiling tools change how the code runs? (if it's the case)
Profilers can be divided into two categories: instrumenting and sampling. VisualVM includes both, but both of them have disadvantages.
Instrumenting profilers use bytecode instrumentation to modify classes. They basically insert the special tracing code into every method entry and exit. This allows to record all executed methods and their running time. However, this approach is associated with a big overhead: first, because the tracing code itself can take much time (sometimes even more than the original code); second, because the instrumented code becomes more complicated and prevents from certain JIT optimizations that could be applied to the original code.
Sampling profilers are different. They do not modify your application; instead they periodically take a snapshot of what the application is doing, i.e. the stack traces of currently running threads. The more often some method occurs in these stack traces - the longer (statistically) is the total execution time of this method.
Sampling profilers typically have much smaller overhead; furthermore, this overhead is manageable, since it directly depends on the profiling interval, i.e. how often the profiler takes thread snapshots.
The problem with sampling profilers is that JDK's public API for getting stack traces is flawed. JVM does not get a stack trace at any arbitrary moment of time. It rather stops a thread in one of the predefined places where it knows how to reliably walk the stack. These places are called safepoints. Safepoints are located at method exits (excluding inlined methods), and inside the loops (excluding short counted loops). That's why, if you have a long linear peace of code or a short counted loop, you'll never see it in a sampling profiler that relies on JVM standard getStackTrace API.
This problem is known as Safepoint Bias. It is described well in a great post by Nitsan Wakart. VisualVM is not the only victim. Many other profilers, including commercial tools, also suffer from the same issue, because the original problem is in the JVM rather than in a particular profiling tool.
Java Flight Recorder is much better, as long as it does not rely on safepoints. However, it has its own flaws: for example, it cannot get a stack trace, when a thread is executing certain JVM intrinsic methods like System.arraycopy. This is especially disappointing, since arraycopy is a frequent bottleneck in Java applications.
Try async-profiler. The goal of the project is exactly to solve the above issues. It should provide a fair view of the application performance, while having a very small overhead. async-profiler works on Linux and macOS. If you are on Windows, JFR is still your best bet.
Related
I was trying to get timing data for various Java programs. Then I had to perform some regression analysis based on this timing data. Here are the two methods I used to get the timing data:
System.currentTimeMillis(): I used this initially, but I wanted the timing data to be constant when the same program was run multiple
times. The variation was huge in this case. When two instances of the
same code were executed in parallel, the variation was even more. So
I dropped this and started looking for some profilers.
-XX countBytecodes Flag in Hotspot JVM: Since the variation in timing data was huge, I thought of measuring the number of byte codes executed, when this code was executed. This should have given a more static count, when the same program was executed multiple times. But This also had variations. When the programs were executed sequentially, the variations were small, but during parellel runs of the same code, the variations were huge. I also tried compiling using -Xint, but the results were similar.
So I am looking for some profiler that could give me the count of byte codes executed when a code is executed. The count should remain constant (or correlation close to 1) across runs of the same program. Or if there could be some other metric based on which I could get timing data, which should stay almost constant across multiple runs.
I wanted the timing data to be constant when the same program was run multiple times
That is not possible on a real machine unless it is designed for hard real time system which your machine will almost certainly be not.
I am looking for some profiler that could give me the count of byte codes executed when a code is executed.
Assuming you could do this, it wouldn't prove anything. You wouldn't be able to see for example that ++ is 90x cheaper than % depending on the hardware you run it on. You won't be able to see that a branch miss of an if is up to 100x more expensive than a speculative branch. You wouldn't be able to see that a memory access to an area of memory which triggers a TLB miss can be more expensive than copying 4 KB of data.
if there could be some other metric based on which I could get timing data, which should stay almost constant across multiple runs.
You can run it many times and take the average. This will hide any high results/outliers and give you a favourable idea of throughput. It can be a reproducible number for a given machine, if run long enough.
This is something I think I've seen before with other profiling tools in other environments, but it's particularly dramatic in this case.
I'm taking a CPU profile of a task that runs for about 12 minutes, and it's showing almost half the time spent in a method that literally does nothing: it's got an empty body. What can cause this? I don't believe that the method is being called a ridiculous number of times, certainly not to account for half the execution time.
For what it's worth, the method in question is called startContent() and it's used to notify a parsing event. The event is passed down a chain of filters (perhaps a dozen of them), and the startContent() method on each filter does almost nothing except to call startContent() on the next filter in the chain.
This is pure Java code, and I'm running it on a Mac.
Attached is a screen shot of the CPU sampler output:
and here is a sample showing the call stack:
(After a delay due to vacation) Here are a couple of pictures showing the output from the profiler. These figures are much more what I would expect the profile to look like. The profiler output seems entirely meaningful, while the sampler output is spurious.
As some of you will have guessed, the job in question is a run of the Saxon XML schema validator (on a 9Gb input file). The profile shows about half the time being spent validating element content against simple types (which happens during endElement processing) and about half being spent testing key constraints for uniqueness; the two profiler views show highlight the activity involved in these two aspects of the task.
I'm not able to supply the data as it comes from a client.
I have not used VisualVM, but I suspect the problem is likely because of the instrumentation overhead on such an empty method. Here's the relevant passage in JProfiler's documentation (which I have used extensively):
If the method call recording type is set to Dynamic instrumentation, all methods of profiled classes are instrumented. This creates some overhead which is significant for methods that have very short execution times. If such methods are called very frequently, the measured time of those method will be far to high. Also, due to the instrumentation, the hot spot compiler might be prevented from optimizing them. In extreme cases, such methods become the dominant hot spots although this is not true for an uninstrumented run. An example is the method of an XML parser that reads the next character. This method returns very quickly, but may be invoked millions of times in a short time span.
Basically, a profiler adds it's own "time length detection code", essentially, but in an empty method the profiler will spend all it's time doing that rather than actually allowing the method to run.
I recommend, if it's possible, to tell VisualVM to stop instrumenting that thread, if it supports such a filtering.
It is generally assumed that using a profiler is much better (for finding performance problems, as opposed to measuring things) than - anything else, really - certainly than the bone-simple way of random pausing.
This assumption is only common wisdom - it has no basis in theory or practice.
There are numerous scholarly peer-reviewed papers about profiling, but none that I've read even address the point, let alone substantiate it.
It's a blind spot in academia, not a big one, but it's there.
Now to your question -
In the screenshot showing the call stack, that is what's known as the "hot path", accounting for roughly 60% of in-thread CPU time. Assuming the code with "saxon" in the name is what you're interested in, it is this:
net.sf.saxon.event.ReceivingContentHandler.startElement
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.StartTagBuffer.startContent
net.sf.saxon.event.ProxyReceiver.startContent
com.saxonica.ee.validate.ValidationStack.startContent
com.saxonica.ee.validate.AttributeValidator.startContent
net.sf.saxon.event.TeeOutputter.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.Sink.startContent
First, this looks to me like it has to be doing I/O, or at least waiting for some other process to give it content. If so, you should be looking at wall-clock time, not CPU time.
Second, the problem(s) could be at any of those call sites where a function calls the one below. If any such call is not truly necessary and could be skipped or done less often, it will reduce time by a significant fraction.
My suspicion is drawn to StartTagBuffer and to validate, but you know best.
There are other points I could make, but these are the major ones.
ADDED after your edit to the question.
I tend to assume you are looking for ways to optimize the code, not just ways to get numbers for their own sake.
It still looks like only CPU time, not wall-clock time, because there is no I/O in the hot paths. Maybe that's OK in your case, but what it means is, of your 12-minute wall clock time, 11 minutes could be spent in I/O wait, with 1 minute in CPU. If so, you could maybe cut out 30 seconds of fat in the CPU part, and only shorten the time by 30 seconds.
That's why I prefer sampling on wall-clock time, so I have overall perspective.
By looking at hot paths alone, you're not getting a true picture.
For example, if the hot path says function F is on the hot path for, say 40% of the time, that only means F costs no less than 40%. It could be much more, because it could be on other paths that aren't so hot. So you could have a juicy opportunity to speed things up by a lot, but it doesn't get much exposure in the specific path that the profiler chose to show you, so you don't give it much attention.
In fact, a big time-taker might not show up at all because on any specific hot path there's always something else a little bigger, like new, or because it goes by multiple names, such as templated collection class constructors.
It's not showing you any line-resolution information.
If you want to inspect a supposedly high-cost routine for the reason for the cost, you have to look at the lines within it. There's a tendency when looking at a routine to say "It's just doing what it's supposed to do.", but if you are looking at a specific costly line of code, which most often is a method call, you can ask "Is it really necessary to do this call? Maybe I already have the information." It's far more specific in suggesting what you could fix.
Can it actually show you some raw stack samples?
In my experience these are far more informative than any summary, like a hot path, that the profiler can present.
The thing to do is examine the sample and come to a full understanding of what the program was doing, and the reason why, at that point in time.
Then repeat for several more samples.
You will see things that don't need to be done, that you can fix to get substantial speedup.
(Unless the code is already optimal, in which case it will be nice to know.)
The point is, you're looking for problems, not measurements.
Statistically, it's very rough, but good enough, and no problem will escape.
My guess is that the method Sink.startContent actually is called a ridiculous number of times.
Your screenshot shows the Sampling tab, which usually results in realistic timings if user over a long enoung interval. If you use Profiler tab instead, you will also get the invocation count. (You'll also get less realistic timings and your program will get very very slow, but you only need to do this for a few seconds to get a good idea about the invocation counts).
It's hard to predict what optimizations and especially inlining HotSpot performs, and the sampling profiler can only attribute the time of inlined methods to the call sites. I suspect that some of the invocation code in saxon might for some reason be attributed to your empty callback function. In that case, you're just suffering the cost of XML, and switching to a different parser might be the only option.
I've had a lot of useful information and guidance from this thread, for which many thanks. However, I don't think the core question has been answered: why is the CPU sampling in VisualVM giving an absurdly high number of hits in a method that does nothing, and that isn't called any more often than many other methods?
For future investigations I will rely on the profiler rather than the sampler, now I have gained a bit of insight into how they differ.
From the profiler I haven't really gained a lot of new information about this specific task, in so far as it has largely confirmed what I would have guessed; but that itself is useful. It has told me that there's no magic bullet to speeding up this particular process, but has put bounds on what might be achieved by some serious redesign, e.g a possible future enhancement that appears to have some promise is generating a bytecode validator for each user-defined simple type in the schema.
I have been looking into benchmarking lately, I have always been interested in logging program data etc. I was interested in knowing if we can implement our own memory usage code and implement our own time consumption code efficently inside our program. I know how to check long it takes for a code to run:
public static void main(String[]args){
long start = System.currentTimeMillis();
// code
System.out.println(System.currentTimeMillis() - start);
}
I also looked into Robust Java benchmarking, Part 1: Issues, this tutorial is very comprehensive. Displays the negative effects of System.currentTimeMillis();. The tutorial then suggests that we use System.nanoTime(); (making it more accurate?).
I also looked at Determining Memory Usage in Java for memory usage. The website shows how you can implement it. The code that has been provided looks inefficent because the person is calling
long L = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
After this he calls System.gc(); (4 * 4) = 16 times. Then repeating the process again.
Doesn't this also take up memory?
So in conlusion, is it possible to implement an efficent benchmarking code inside your java program?
Yes it is possible to effectively implement performance benchmarks in java code. The important question is that any kind of performance benchmark is going to add its own overhead and how much of it do you want. System.currentMill..() is good enough benchmark for performance and in most of the cases nanoTime() is an overkill.
For memory System.gc will show you varied results for different runs (as gc run is never guranteed.) I generally use Visual VM for memory profiling (its free) and then use TDA for dumps analyzing.
One way to do it less invasively is using Aspect oriented programing. You can create just one Aspect that runs on a particular Annotation or set of methods and write an #Around advice to collect performance data.
Here is a small snippet:
public class TestAspect {
#LogPerformance
public void thisMethodNeedsToBeMonitored(){
// Do Something
}
public void thisMethodNeedsToBeMonitoredToo(){
// Do Something
}
}
#interface LogPerformance{}
#Aspect
class PerformanceAspect{
#Around("the pointcut expression to pick up all " +
"the #PerfMonitor annotated methods")
public void logPerformance(){
// log performance here
// Log it to a file
}
}
It may be impossible to benchmark without some Heisenberg effect, i.e. your benching code also being measured. However, if you measure at a high enough granularity the effect will be negligible.
Any benchmarking code is going to be less efficient than non-benchmarked code just based on having more stuff to do. That said, Java in particular causes issues as the article states due to garbage collection happening whenever the jre feels like it. Even the documentation for System.gc says it makes a "best effort".
As for your specific questions:
System.gc shouldn't take up more memory, but it will take processor resources.
It is somewhat possible based on what you're trying to benchmark. There will always be some interference. If you are willing to go outside of your code, there are tools like VisualVM to watch memory usage from outside of your application.
Edit: Corrected the wording of what System.gc docs say.
I want to filter what classes are being cpu-profiled in Java VisualVm (Version 1.7.0 b110325). For this, I tried under Profiler -> Settings -> CPU-Settings to set "Profile only classes" to my package under test, which had no effect. Then I tried to get rid of all java.* and sun.* classes by setting them in "Do not profile classes", which had no effect either.
Is this simply a bug? Or am I missing something? Is there a workaround? I mean other than:
paying for a better profiler
doing sampling by hand (see One could use a profiler, but why not just halt the program?)
switch to the Call Tree view, which is no good since only the Profiler view gives me the percentages of consumed CPU per method.
I want to do this mainly to get halfway correct percentages of consumed CPU per method. For this, I need to get rid of the annoying measurements, e.g. for sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run() (around 70%). Many users seem to have this problem, see e.g.
Java VisualVM giving bizarre results for CPU profiling - Has anyone else run into this?
rmi.transport.tcp.tcptransport Connectionhandler consumes much CPU
Can't see my own application methods in Java VisualVM.
The reason you see sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run() in the profile is that you left the option Profile new Runnables selected.
Also, if you took a snapshot of your profiling session you would be able to see the whole callstack for any hotspot method - this way you could navigate from the run() method down to your own application logic methods, filtering out the noise generated by the Profile new Runnables option.
OK, since your goal is to make the code run as fast as possible, let me suggest how to do it.
I'm no expert on VisualVM, but I can tell you what works. (Only a few profilers actually tell you what you need to know, which is - which lines of your code are on the stack a healthy fraction of wall-clock time.)
The only measuring I ever bother with is some stopwatch on the overall time, or alternatively, if the code has something like a framerate, the number of frames per second. I don't need any sort of further precision breakdown, because it's at best a remote clue to what's wasting time (and more often totally irrelevant), when there's a very direct way to locate it.
If you don't want to do random-pausing, that's up to you, but it's proven to work, and here's an example of a 43x speedup.
Basically, the idea is you get a (small, like 10) number of stack samples, taken at random wall-clock times.
Each sample consists (obviously) of a list of call sites, and possibly a non-call site at the end.
(If the sample is during I/O or sleep, it will end in the system call, which is just fine. That's what you want to know.)
If there is a way to speed up your code (and there almost certainly is), you will see it as a line of code that appears on at least one of the stack samples.
The probability it will appear on any one sample is exactly the same as the fraction of time it uses.
So if there's a call site or other line of code using a healthy fraction of time, and you can avoid executing it, the overall time will decrease by that fraction.
I don't know every profiler, but one I know that can tell you that is Zoom.
Others may be able to do it.
They may be more spiffy, but they don't work any quicker or better than the manual method when your purpose is to maximize performance.
I'm profiling a program using sampling profiling in YourKit and JProfiler, and also "manually" (I launch it and press Ctrl-Break several times to get thread dumps).
All three methods give me extremely strange results: some tens of percents of time spent in a 3-line method that does not even do any allocation or synchronization and doesn't have loops etc. Moreover, after I made this method into a NOP and even removed its invocation completely, the observable program performance didn't change at all (although it got a negligible memory leak, since it was a method for freeing a cheap resource).
I'm thinking that this might be because of the constraints that JVM puts on the moments at which a thread's stacktrace may be taken, and it somehow turns out that in my program it is exactly the moments where this method is invoked, although there is absolutely nothing special about it or the context in which it is invoked.
What can be the explanation for this phenomenon?
What are the aforementioned constraints?
What further measurements can I take to clarify the situation?
To my minds, these results only show that this method gets called a huge number of times. Since its code is quite small, and it may be called as an invocation tree leaf, its impact on your profiling results seems neglectable. However, I had many time that kind of weird results.
Some 3rd party libraries cause the heap dumps to go completely haywire due to unexpected usage patterns, for example if cglib is used, it will mask away the actual cause of the issues and instead show a lot of Proxy objects (if I remember correctly) filling up the VM instead.
So in short, code generation and reflection may cause the stats to go wrong.
When you did the Ctrl-Break several times and got thread dumps, I'm curious what you saw. The call stacks are, to my mind, the most useful information. If your 3-liner is on the stack a large % of time, then you can see why by looking at where it is called from, and where that is called from, etc. Those call sites are just as responsible for the time being spent as the 3-liner is.
If the stack traces seem to make no sense, it may be because they are being delayed until after something completes. If that is so, look up the stack to see what has just completed, because the break could have occurred within that.