I have an interpreter written in Java. I am trying to test the performance results of various optimisations in the interpreter. To do this I parse the code and then repeatedly run the interpreter over the code, this continues until I get 5 runs which differ by a very small margin (0.1s in the times below), the mean is taken and printed. No I/O or randomness happens in the interpreter. If I run the interpreter again I am getting different run times:
91.8s
95.7s
93.8s
97.6s
94.6s
94.6s
107.4s
I have tried to no avail the server and client VM, the serial and parallel gc, large tables and windows and linux. These are on 1.6.0_14 JVM. The computer has no processes running in the background. So I asking what may be causing these large variations or how can I find out what is?
The actualy issue was caused because the program had to iterate to a fixed point solution and the values were stored in a hashset. The hashed values differed between runs, resulting in a different ordering which in turn led to a change in the amount of iterations needed to reach the solution.
"Wall clock time" is rarely a good measurement for benchmarking. A modern OS is extremely unlikely to "[have] no processes running in the background" -- for all you know, it could be writing dirty block buffers to disk, because it's decided that there's no other contention.
Instead, I recommend using ThreadMXBean to track actual CPU consumption.
Your variations don't look that large. It's simply the nature of the beast that there are other things running outside of your direct control, both in the OS and the JVM, and you're not likely to get exact results.
Things that could affect runtime:
if your test runs are creating objects (may be invisible to you, within library calls, etc) then your repeats may trigger a GC
Different GC algorithms, specifications will react differently, different thresholds for incremental gc. You could try to run a System.gc() before every run, although the JVM is not guaranteed to GC when you call that (although it always has when I've played with it).T Depending on the size of your test, and how many iterations you're running, this may be an unpleasantly (and nearly uselessly) slow thing to wait for.
Are you doing any sort of randomization within your tests? e.g. if you're testing integers, values < |128| may be handled slightly differently in memory.
Ultimately I don't think it's possible to get an exact figure, probably the best you can do is an average figure around the cluster of results.
The garbage collection may be responsible. Even though your logic is the same, it may be that the GC logic is being scheduled on external clock/events.
But I don't know that much about JVMs GC implementation.
This seems like a significant variation to me, I would try running with -verbosegc.
You should be able to get the variation to much less than a second if your process has no IO, output or network of any significance.
I suggest profiling your application, there is highly likely to be significant saving if you haven't done this already.
Related
Is there any easy, cheap (which don't require to test program on many hardware configuration) and painless method to define hardware requirements (like CPU, RAM memory etc), that are require to run my own program? How it's should be done?
I have quite resource-hungry program written in Java and i don't know how to define hardware specification that will be enough to run this aplication smoothly.
No, I don't think there is any generally applicable way to determine the minimum requirements that does not involve testing on some specified reference hardware.
You may be able to find some of the limitations by using Virtual Machines of some kind - it is easier to modify the parameters of some VM than modifying hardware. But there are artifacts generated by the interaction between host and VM that may influence your results.
It is also difficult to define the criteria for "acceptable performance" in general without knowing a lot about use cases.
Many programs will use more resources if they are available, but can also get along with less.
For example, consider a program using a thread pool with a size a based on the number of CPU cores. When running on a CPU with more cores, more work can be done in parallel, but at the same time overhead due to thread creation, synchronisation and aggregation of results increases. The effects are non-linear in the number of CPUs and depend a lot on the actual program and data. Similarly, the effects of decreasing available memory range from potentially throwing OutOfMemory-Errors for some inputs (but possibly not for others) to just running GC a bit more frequently (and the effects of that depend on the GC strategy, ranging from noticeable freezes to just a bit more CPU load).
All that is without even considering that programs don't usually live in isolation - they run on an operating system in parallel with other tasks that also consume resources.
I am currently doing some Java optimization. It seems the best approach to assess my progress is to do repeated runs and collect run time statistics using System.nanoTime.
Most of my background has been with embedded DSP applications. In my embedded developments, I would have access to CPU cycle counters (which are a great measure for optimization). The JRE acts like a synthetic CPU, is there a way to get information of the number of instructions or JRE clock equivalents that were executed?
Thanks in advance for any hints. J.R.
The byte code count is almost certainly useless for what you want. They are entirely virtual/notional in terms of performance at run time. The only thing which matters is elapse time for code which has been warmed up.
I suggest you have a look at JMH for micro-benchmarking your code. http://openjdk.java.net/projects/code-tools/jmh/
If you come from the DSP/realtime background I suggest looking at the latency distribution and minimising your allocation rate.
NOTE: The JVM injects "safe points" between instructions to check if the thread needs to stop e.g. to perform garbage collection. A garbage collection can take seconds. However safepoints are often optimised away to avoid slowing down your program.
In short this means the time between instructions might be nothing, it might be something, it might be seconds or even minutes so I wouldn't bother counting them.
Edit: Of the several extremely generous and helpful responses this question has already received, it is obvious to me that I didn't make an important part of this question clear when I asked it earlier this morning. The answers I've received so far are more about optimizing applications & removing bottlenecks at the code level. I am aware that this is way more important than trying to get an extra 3- or 5% out of your JVM!
This question assumes we've already done just about everything we could to optimize our application architecture at the code level. Now we want more, and the next place to look is at the JVM level and garbage collection; I've changed the question title accordingly. Thanks again!
We've got a "pipeline" style backend architecture where messages pass from one component to the next, with each component performing different processes at each step of the way.
Components live inside of WAR files deployed on Tomcat servers. Altogether we have about 20 components in the pipeline, living on 5 different Tomcat servers (I didn't choose the architecture or the distribution of WARs for each server). We use Apache Camel to create all the routes between the components, effectively forming the "connective tissue" of the pipeline.
I've been asked to optimize the GC and general performance of each server running a JVM (5 in all). I've spent several days now reading up on GC and performance tuning, and have a pretty good handle on what each of the different JVM options do, how the heap is organized, and how most of the options affect the overall performance of the JVM.
My thinking is that the best way to optimize each JVM is not to optimize it as a standalone. I "feel" (that's about as far as I can justify it!) that trying to optimize each JVM locally without considering how it will interact with the other JVMs on other servers (both upstream and downstream) will not produce a globally-optimized solution.
To me it makes sense to optimize the entire pipeline as a whole. So my first question is: does SO agree, and if not, why?
To do this, I was thinking about creating a LoadTester that would generate input and feed it to the first endpoint in the pipeline. This LoadTester might also have a separate "Monitor Thread" that would check the last endpoint for throughput. I could then do all sorts of processing where we check for average end-to-end travel time for messages, maximum throughput before faulting, etc.
The LoadTester would generate the same pattern of input messages over and over again. The variable in this experiment would be the JVM options passed to each Tomcat server's startup options. I have a list of about 20 different options I'd like to pass the JVMs, and figured I could just keep tweaking their values until I found near-optimal performance.
This may not be the absolute best way to do this, but it's the best way I could design with what time I've been given for this project (about a week).
Second question: what does SO think about this setup? How would SO create an "optimizing solution" any differently?
Last but not least, I'm curious as to what sort of metrics I could use as a basis of measure and comparison. I can really only think of:
Find the JVM option config that produces the fastest average end-to-end travel time for messages
Find the JVM option config that produces the largest volume throughput without crashing any of the servers
Any others? Any reasons why those 2 are bad?
After reviewing the play I could see how this might be construed as a monolithic question, but really what I'm asking is how SO would optimize JVMs running along a pipeline, and to feel free to cut-and-dice my solution however you like it.
Thanks in advance!
Let me go up a level and say I did something similar in a large C app many years ago.
It consisted of a number of processes exchanging messages across interconnected hardware.
I came up with a two-step approach.
Step 1. Within each process, I used this technique to get rid of any wasteful activities.
That took a few days of sampling, revising code, and repeating.
The idea is there is a chain, and the first thing to do is remove inefficiences from the links.
Step 2. This part is laborious but effective: Generate time-stamped logs of message traffic.
Merge them together into a common timeline.
Look carefully at specific message sequences.
What you're looking for is
Was the message necessary, or was it a retransmission resulting from a timeout or other avoidable reason?
When was the message sent, received, and acted upon? If there is a significant delay between being received and acted upon, what is the reason for that delay? Was it just a matter of being "in line" behind another process that was doing I/O, for example? Could it have been fixed with different process priorities?
This activity took me about a day to generate logs, combine them, find a speedup opportunity, and revise code.
At this rate, after about 10 working days, I had found/fixed a number of problems, and improved the speed dramatically.
What is common about these two steps is I'm not measuring or trying to get "statistics".
If something is spending too much time, that very fact exposes it to a dilligent programmer taking a close meticulous look at what is happening.
I would start with finding the optimum recommended jvm values specified for your hardware/software mix OR just start with what is already out there.
Next I would make sure that I have monitoring in place to measure Business throughputs and SLAs
I would not try to tweak just the GC if there is no reason to.
First you will need to find what are the major bottlenecks in your application. If it is I/O bound, SQL bound etc.
Key here is to MEASURE, IDENTIFY TOP bottlenecks, FIX them and conduct another iteration with a repeatable load.
HTH...
The biggest trick I am aware of when running multiple JVMs on the same machine is limiting the number of core the GC will use. Otherwise what can happen when one JVM does a full GC is it will attempt to grab every core, impacting the performance of all the JVMs even though they are not performing a GC. One suggestion is to limit the number of gc threads to 5/8 or less. (I can't remember where it is written)
I think you should test the system as a whole to ensure you have realistic interaction between the services. However, I would assume you may need to tune each service differently.
Changing command line options is useful if you cannot change the code. However if you profile and optimise the code you can make far for difference than tuning the GC parameters (in which cause you need to change them again)
For this reason, I would only change the command line parameters as a last resort, after you there is little improvement which can be made in the code of the application.
I'm working on refactoring some code in Java, so I I'm timing things to make sure the code doesn't get any slower. However, the new refactored code seems to take more time than the original code. Remarkably, when I run the code with a profiler, the new code is significantly faster than the old code. The primary difference is that the old code is recursive, while the new code is iterative. Can a profiler affect the recursive code by a factor of several hundred thousand while only affecting the iterative code by a factor of 1.5?
I'm running on Mac OS X 10.6.6, 3 GB RAM, 2.4 GHz CPU, using the default Netbeans 6.9 profiler with Java 1.6.0__22 64-Bit Server.
(Both methods have self-timing code using System.currentTimeMillis() to allow me to compare the times when not using a profiler, but this shouldn't affect things noticeably.)
Yes. Most profiles do instrumentation at the level of method invocations. In recursive form, the profiler must take a lot more measurements than in iterative form. While profilers do try to extract their overhead from the reported numbers, this is very difficult to do reliable. Different profilers will be better/worse at this.
I'm working on refactoring some code in Java, so I I'm timing things to make sure the code doesn't get any slower. However, the new refactored code seems to take more time than the original code.
Yes. Code typically runs slower under a profiler.
Therefore, you should compare times of old / new version of your application, either both run under the profiler, or both run normally.
Also be aware that a profiler can actually distort performance characteristics. And different profilers may disagree about where the code hotspots are. So it is a good idea to run/compare versions of your application without profiling before you adopt an optimization that you are trialing.
(Both methods have self-timing code using System.currentTimeMillis() to allow me to compare the times when not using a profiler, but this shouldn't affect things noticeably.)
There are traps here too:
The best possible granularity of currentTimeMillis() is 1 millisecond, but on some OSes it might be tens of milliseconds.
If you don't take steps to avoid this, your manual timings can include distorting overheads such as the JIT compilation times and the GC times. The effects could be quite subtle.
I would say if you want to measure speed, just measure speed, don't profile. They're not the same thing. Instrumenting profilers put a lot of overhead into each function call, and if all you want is an overall speed difference, it won't be accurate because you're partly measuring the cost of the instrumentation itself.
If you want to find out what is taking the time, that is different from measuring. A wall-clock-time stack-sampling profiler (not instrumentation) that reports line-level percent is your best bet. It doesn't matter if it slows the program down, because it's purpose is not to measure speed; it's purpose is to find out where the time is going and why, on a percentage basis. It would be OK if it or something else slowed the program down by 10%, or 10 times, if it showed you where the time was being taken, independent of speed.
I only say this because lots of people are confused about this point, and the confusion gets solidified into lots of profilers.
More on that subject.
Is there any Java profiler that allows profiling short-lived applications? The profilers I found so far seem to work with applications that keep running until user termination. However, I want to profile applications that work like command-line utilities, it runs and exits immediately. Tools like visualvm or NetBeans Profiler do not even recognize that the application was ran.
I am looking for something similar to Python's cProfile, in that the profiler result is returned when the application exits.
You can profile your application using the JVM builtin HPROF.
It provides two methods:
sampling the active methods on the stack
timing method execution times using injected bytecode (BCI, byte codee injection)
Sampling
This method reveals how often methods were found on top of the stack.
java -agentlib:hprof=cpu=samples,file=profile.txt ...
Timing
This method counts the actual invocations of a method. The instrumenting code has been injected by the JVM beforehand.
java -agentlib:hprof=cpu=times,file=profile.txt ...
Note: this method will slow down the execution time drastically.
For both methods, the default filename is java.hprof.txt if the file= option is not present.
Full help can be obtained using java -agentlib:hprof=help or can be found on Oracles documentation
Sun Java 6 has the java -Xprof switch that'll give you some profiling data.
-Xprof output cpu profiling data
A program running 30 seconds is not shortlived. What you want is a profiler which can start your program instead of you having to attach to a running system. I believe most profilers can do that, but you would most likely like one integrated in an IDE the best. Have a look at Netbeans.
Profiling a short running Java applications has a couple of technical difficulties:
Profiling tools typically work by sampling the processor's SP or PC register periodically to see where the application is currently executing. If your application is short-lived, insufficient samples may be taken to get an accurate picture.
You can address this by modifying the application to run a number of times in a loop, as suggested by #Mike. You'll have problems if your app calls System.exit(), but the main problem is ...
The performance characteristics of a short-lived Java application are likely to be distorted by JVM warm-up effects. A lot of time will be spent in loading the classes required by your app. Then your code (and library code) will be interpreted for a bit, until the JIT compiler has figured out what needs to be compiled to native code. Finally, the JIT compiler will spend time doing its work.
I don't know if profilers attempt to compensate to for JVM warmup effects. But even if they do, these effects influence your applications real behavior, and there is not a great deal that the application developer can do to mitigate them.
Returning to my previous point ... if you run a short lived app in a loop you are actually doing something that modifies its normal execution pattern and removes the JVM warmup component. So when you optimize the method that takes (say) 50% of the execution time in the modified app, that is really 50% of the time excluding JVM warmup. If JVM warmup is using (say) 80% of the execution time when the app is executed normally, you are actually optimizing 50% of 20% ... and that is not worth the effort.
If it doesn't take long enough, just wrap a loop around it, an infinite loop if you like. That will have no effect on the inclusive time percentages spent either in functions or in lines of code. Then, given that it's taking plenty of time, I just rely on this technique. That tells which lines of code, whether they are function calls or not, are costing the highest percentage of time and would therefore gain the most if they could be avoided.
start your application with profiling turned on, waiting for profiler to attach. Any profiler that conforms to Java profiling architecture should work. i've tried this with NetBeans's profiler.
basically, when your application starts, it waits for a profiler to be attached before execution. So, technically even line of code execution can be profiled.
with this approach, you can profile all kinds of things from threads, memory, cpu, method/class invocation times/duration...
http://profiler.netbeans.org/
The SD Java Profiler can capture statement block execution-count data no matter how short your run is. Relative execution counts will tell you where the time is spent.
You can use a measurement (metering) recording: http://www.jinspired.com/site/case-study-scala-compiler-part-9
You can also inspect the resulting snapshots: http://www.jinspired.com/site/case-study-scala-compiler-part-10
Disclaimer: I am the architect of JXInsight/OpenCore.
I suggest you try yourkit. It can profile from the start and dump the results when the program finishes. You have to pay for it but you can get an eval license or use the EAP version without one. (Time limited)
YourKit can take a snapshot of a profile session, which can be later analyzed in the YourKit GUI. I use this to feature to profile a command-line short-lived application I work on. See my answer to this question for details.