Benchmarking inside Java code

Benchmarking inside Java code - java

I have been looking into benchmarking lately, I have always been interested in logging program data etc. I was interested in knowing if we can implement our own memory usage code and implement our own time consumption code efficently inside our program. I know how to check long it takes for a code to run:
public static void main(String[]args){
long start = System.currentTimeMillis();
// code
System.out.println(System.currentTimeMillis() - start);
}
I also looked into Robust Java benchmarking, Part 1: Issues, this tutorial is very comprehensive. Displays the negative effects of System.currentTimeMillis();. The tutorial then suggests that we use System.nanoTime(); (making it more accurate?).
I also looked at Determining Memory Usage in Java for memory usage. The website shows how you can implement it. The code that has been provided looks inefficent because the person is calling
long L = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
After this he calls System.gc(); (4 * 4) = 16 times. Then repeating the process again.
Doesn't this also take up memory?
So in conlusion, is it possible to implement an efficent benchmarking code inside your java program?

Yes it is possible to effectively implement performance benchmarks in java code. The important question is that any kind of performance benchmark is going to add its own overhead and how much of it do you want. System.currentMill..() is good enough benchmark for performance and in most of the cases nanoTime() is an overkill.
For memory System.gc will show you varied results for different runs (as gc run is never guranteed.) I generally use Visual VM for memory profiling (its free) and then use TDA for dumps analyzing.
One way to do it less invasively is using Aspect oriented programing. You can create just one Aspect that runs on a particular Annotation or set of methods and write an #Around advice to collect performance data.
Here is a small snippet:
public class TestAspect {
#LogPerformance
public void thisMethodNeedsToBeMonitored(){
// Do Something
}
public void thisMethodNeedsToBeMonitoredToo(){
// Do Something
}
}
#interface LogPerformance{}
#Aspect
class PerformanceAspect{
#Around("the pointcut expression to pick up all " +
"the #PerfMonitor annotated methods")
public void logPerformance(){
// log performance here
// Log it to a file
}
}

It may be impossible to benchmark without some Heisenberg effect, i.e. your benching code also being measured. However, if you measure at a high enough granularity the effect will be negligible.

Any benchmarking code is going to be less efficient than non-benchmarked code just based on having more stuff to do. That said, Java in particular causes issues as the article states due to garbage collection happening whenever the jre feels like it. Even the documentation for System.gc says it makes a "best effort".
As for your specific questions:
System.gc shouldn't take up more memory, but it will take processor resources.
It is somewhat possible based on what you're trying to benchmark. There will always be some interference. If you are willing to go outside of your code, there are tools like VisualVM to watch memory usage from outside of your application.
Edit: Corrected the wording of what System.gc docs say.

Related

How to understand and optimize high self time of a factory method

I profiled cpu usage of a use case and looked into the the biggest call time contributors in the call tree. There I stumbled upon a method with a quite high self time. See here:
I checked the method and found the following code
#Override
public IArticleDataProvider getArticleDataProvider() {
return new ArticleDataProvider();
}
The method does nothing else, but instantiation. The call tree shows, that the instantiation itself isn't slow - no black magic -, so how can I make sense out of this 117ms delay? I rerun my usage and the second time it was within micro seconds as expected. Also the clinit wasn't needed anymore, which makes sense.
I'm aware that single samples are not a good benchmark for performance, only a starting point for deeper dives, but is this something to be expected while profiling? At least it yells for a spot in outlier detection.
This sounds like more than just a jvm warmup issue, right?
Are there more factors to be taken into account here?
This is my first real case using java profiling. I've attended a couple of talks about java profiling and read some guides prior. I'm thankful for all feedback

It looks simply like the time needed for class loading. That time is not included in the node, the latter is only the invocation of the static initializer.

Profiling Java code changes execution times

I'm trying to optimize my code, but it's giving me problems.
I've got this list of objects:
List<DataDescriptor> descriptors;
public class DataDescriptor {
public int id;
public String name;
}
There is 1700 objects with unique id (0-1699) and some name, it's used to decode what type of data I get later on.
The method that I try to optimize works like that:
public void processData(ArrayList<DataDescriptor> descriptors, ArrayList<IncomingData> incomingDataList) {
for (IncomingData data : incomingDataList) {
DataDescriptor desc = descriptors.get(data.getDataDescriptorId());
if (desc.getName().equals("datatype_1")) {
doOperationOne(data);
} else if (desc.getName().equals("datatype_2")) {
doOperationTwo(data);
} else if ....
.
.
} else if (desc.getName().equals("datatype_16")) {
doOperationSixteen(data);
}
}
}
This method is called about milion times when processing data file and every time incomingDataList contains about 60 elements, so this set of if/elses is executed about 60 milion times.
This takes about 15 seconds on my desktop (i7-8700).
Changing code to test integer ids instead of strings obviously shaves off few seconds, which is nice, but I hoped for more :)
I tried profiling using VisualVM, but for this method (with string testing) it says that 66% of time is spent in "Self time" (which I believe would be all this string testing? and why doesnt it says that it is in String.equals method?) and 33% is spent on descriptors.get - which is simple get from ArrayList and I don't think I can optimize it any further, other than trying to change how data is structured in memory (still, this is Java, so I don't know if this would help a lot).
I wrote "simple benchmark" app to isolate this String vs int comparisons. As I expected, comparing integers was about 10x faster than String.equals when I simply run the application, but when I profiled it in VisualVM (I wanted to check if in benchmark ArrayList.get would also be so slow), strangely both methods took exactly the same amount of time. When using VisualVM's Sample, instead of Profile, application finished with expected results (ints being 10x faster), but VisualVM was showing that in his sample both types of comparisons took the same amount of time.
What is the reason for getting such totally different results when profiling and not? I know that there is a lot of factors, there is JIT and profiling maybe interferes with it etc. - but in the end, how do you profile and optimize Java code, when profiling tools change how the code runs? (if it's the case)

Profilers can be divided into two categories: instrumenting and sampling. VisualVM includes both, but both of them have disadvantages.
Instrumenting profilers use bytecode instrumentation to modify classes. They basically insert the special tracing code into every method entry and exit. This allows to record all executed methods and their running time. However, this approach is associated with a big overhead: first, because the tracing code itself can take much time (sometimes even more than the original code); second, because the instrumented code becomes more complicated and prevents from certain JIT optimizations that could be applied to the original code.
Sampling profilers are different. They do not modify your application; instead they periodically take a snapshot of what the application is doing, i.e. the stack traces of currently running threads. The more often some method occurs in these stack traces - the longer (statistically) is the total execution time of this method.
Sampling profilers typically have much smaller overhead; furthermore, this overhead is manageable, since it directly depends on the profiling interval, i.e. how often the profiler takes thread snapshots.
The problem with sampling profilers is that JDK's public API for getting stack traces is flawed. JVM does not get a stack trace at any arbitrary moment of time. It rather stops a thread in one of the predefined places where it knows how to reliably walk the stack. These places are called safepoints. Safepoints are located at method exits (excluding inlined methods), and inside the loops (excluding short counted loops). That's why, if you have a long linear peace of code or a short counted loop, you'll never see it in a sampling profiler that relies on JVM standard getStackTrace API.
This problem is known as Safepoint Bias. It is described well in a great post by Nitsan Wakart. VisualVM is not the only victim. Many other profilers, including commercial tools, also suffer from the same issue, because the original problem is in the JVM rather than in a particular profiling tool.
Java Flight Recorder is much better, as long as it does not rely on safepoints. However, it has its own flaws: for example, it cannot get a stack trace, when a thread is executing certain JVM intrinsic methods like System.arraycopy. This is especially disappointing, since arraycopy is a frequent bottleneck in Java applications.
Try async-profiler. The goal of the project is exactly to solve the above issues. It should provide a fair view of the application performance, while having a very small overhead. async-profiler works on Linux and macOS. If you are on Windows, JFR is still your best bet.

Why are floating point operations much faster with a warmup phase?

I initially wanted to test something different with floating-point performance optimisation in Java, namely the performance difference between the division by 5.0f and multiplication with 0.2f (multiplication seems to be slower without warm-up but faster with by a factor of about 1.5 respectively).
After studying the results I noticed that I had forgotten to add a warm-up phase, as suggested so often when doing performance optimisations, so I added it. And, to my utter surprise, it turned out to be about 25 times faster in average over multiple test runs.
I tested it with the following code:
public static void main(String args[])
{
float[] test = new float[10000];
float[] test_copy;
//warmup
for (int i = 0; i < 1000; i++)
{
fillRandom(test);
test_copy = test.clone();
divideByTwo(test);
multiplyWithOneHalf(test_copy);
}
long divisionTime = 0L;
long multiplicationTime = 0L;
for (int i = 0; i < 1000; i++)
{
fillRandom(test);
test_copy = test.clone();
divisionTime += divideByTwo(test);
multiplicationTime += multiplyWithOneHalf(test_copy);
}
System.out.println("Divide by 5.0f: " + divisionTime);
System.out.println("Multiply with 0.2f: " + multiplicationTime);
}
public static long divideByTwo(float[] data)
{
long before = System.nanoTime();
for (float f : data)
{
f /= 5.0f;
}
return System.nanoTime() - before;
}
public static long multiplyWithOneHalf(float[] data)
{
long before = System.nanoTime();
for (float f : data)
{
f *= 0.2f;
}
return System.nanoTime() - before;
}
public static void fillRandom(float[] data)
{
Random random = new Random();
for (float f : data)
{
f = random.nextInt() * random.nextFloat();
}
}
Results without warm-up phase:
Divide by 5.0f: 382224
Multiply with 0.2f: 490765
Results with warm-up phase:
Divide by 5.0f: 22081
Multiply with 0.2f: 10885
Another interesting change that I cannot explain is the turn in what operation is faster (division vs. multiplication). As earlier mentioned, without the warm-up the division seems to be a tad faster, while with the warm-up it seems to be twice as slow.
I tried adding an initialization block setting the values to something random, but it didn't not effect the results and neither did adding multiple warm-up phases. The numbers on which the methods operate are the same, so that cannot be the reason.
What is the reason for this behaviour? What is this warm-up phase and how does it influence the performance, why are the operations so much faster with a warm-up phase and why is there a turn in which operation is faster?

Before the warm up Java will be running the byte codes via an interpreter, think how you would write a program that could execute java byte codes in java. After warm up, hotspot will have generated native assembler for the cpu that you are running on; making use of that cpus feature set. There is a significant performance difference between the two, the interpreter will run many many cpu instructions for a single byte code where as hotspot generates native assembler code just as gcc does when compiling C code. That is the difference between the time to divide and to multiply will ultimately be down to the CPU that one is running on, and it will be just a single cpu instruction.
The second part to the puzzle is hotspot also records statistics that measure the runtime behaviour of your code, when it decides to optimise the code then it will use those statistics to perform optimisations that are not necessarily possible at compilation time. For example it can reduce the cost of null checks, branch mispredictions and polymorphic method invocation.
In short, one must discard the results pre-warmup.
Brian Goetz wrote a very good article here on this subject.
========
APPENDED: overview of what 'JVM Warm-up' means
JVM 'warm up' is a loose phrase, and is no longer strictly speaking a single phase or stage of the JVM. People tend to use it to refer to the idea of where JVM performance stabilizes after compilation of the JVM byte codes to native byte codes. In truth, when one starts to scratch under the surface and delves deeper into the JVM internals it is difficult not to be impressed by how much Hotspot is doing for us. My goal here is just to give you a better feel for what Hotspot can do in the name of performance, for more details I recommend reading articles by Brian Goetz, Doug Lea, John Rose, Cliff Click and Gil Tene (amongst many others).
As already mentioned, the JVM starts by running Java through its interpreter. While strictly speaking not 100% correct, one can think of an interpreter as a large switch statement and a loop that iterates over every JVM byte code (command). Each case within the switch statement is a JVM byte code such as add two values together, invoke a method, invoke a constructor and so forth. The overhead of the iteration, and jumping around the commands is very large. Thus execution of a single command will typically use over 10x more assembly commands, which means > 10x slower as the hardware has to execute so many more commands and caches will get polluted by this interpreter code which ideally we would rather focused on our actual program. Think back to the early days of Java when Java earned its reputation of being very slow; this is because it was originally a fully interpreted language only.
Later on JIT compilers were added to Java, these compilers would compile Java methods to native CPU instructions just before the methods were invoked. This removed all of the overhead of the interpreter and allowed the execution of code to be performed in hardware. While execution within hardware is much faster, this extra compilation created a stall on startup for Java. And this was partly where the terminology of 'warm up phase' took hold.
The introduction of Hotspot to the JVM was a game changer. Now the JVM would start up faster because it would start life running the Java programs with its interpreter and individual Java methods would be compiled in a background thread and swapped out on the fly during execution. The generation of native code could also be done to differing levels of optimisation, sometimes using very aggressive optimisations that are strictly speaking incorrect and then de-optimising and re-optimising on the fly when necessary to ensure correct behaviour. For example, class hierarchies imply a large cost to figuring out which method will be called as Hotspot has to search the hierarchy and locate the target method. Hotspot can become very clever here, and if it notices that only one class has been loaded then it can assume that will always be the case and optimise and inline methods as such. Should another class get loaded that now tells Hotspot that there is actually a decision between two methods to be made, then it will remove its previous assumptions and recompile on the fly. The full list of optimisations that can be made under different circumstances is very impressive, and is constantly changing. Hotspot's ability to record information and statistics about the environment that it is running in, and the work load that it is currently experiencing makes the optimisations that are performed very flexible and dynamic. In fact it is very possible that over the life time of a single Java process, that the code for that program will be regenerated many times over as the nature of its work load changes. Arguably giving Hotspot a large advantage over more traditional static compilation, and is largely why a lot of Java code can be considered to be just as fast as writing C code. It also makes understanding microbenchmarks a lot harder; in fact it makes the JVM code itself much more difficult for the maintainers at Oracle to understand, work with and diagnose problems. Take a minute to raise a pint to those guys, Hotspot and the JVM as a whole is a fantastic engineering triumph that rose to the fore at a time when people were saying that it could not be done. It is worth remembering that, because after a decade or so it is quite a complex beast ;)
So given that context, in summary we refer to warming up a JVM in microbenchmarks as running the target code over 10k times and throwing the results away so as to give the JVM a chance to collect statistics and to optimise the 'hot regions' of the code. 10k is a magic number because the Server Hotspot implementation waits for that many method invocations or loop iterations before it starts to consider optimisations. I would also advice on having method calls between the core test runs, as while hotspot can do 'on stack replacement' (OSR), it is not common in real applications and it does not behave exactly the same as swapping out whole implementations of methods.

You aren't measuring anything useful "without a warmup phase"; you're measuring the speed of interpreted code times how long it takes for the on-stack replacement to be generated. Maybe divisions cause compilation to kick in earlier.
There are sets of guidelines and various packages for building microbenchmarks that don't suffer from these sorts of issues. I would suggest that you read the guidelines and use the ready-made packages if you intend to continue doing this sort of thing.

Does periodic garbage collection help JVM performance?

I just encountered the following code (slightly simplified):
/* periodically requests garbagecollect to improve memory usage and
garbage collect performance under most JVMs */
static class GCThread implements Runnable {
public void run() {
while(true) {
try {
Thread.sleep(300000);
} catch (InterruptedException e) {}
System.gc();
}
}
}
Thread gcThread = new Thread(new GCThread());
gcThread.setDaemon(true);
gcThread.start();
I respect the author of the code, but no longer has easy access to ask him to defend his assertion in the comment on top.
Is this true? It very much goes against my intuition that this little hack should improve anything. I would expect the JVM to be much better equipped to decide when to perform a collection.
The code is running in a web-application running inside a IBM WebSphere on Z/OS.

It depends.
The JVM can completely ignore System.gc() so this code could do absolutely nothing.
Secondly, a GC has a cost impact. If your program wouldn't otherwise have done a GC (say, it doesn't generate much garbage, or it has a huge heap and never needs to GC) then this code will be added overhead.
If the program would normally run with just minor GCs and this code causes a major GC, you will have a negative impact.
All in all, this kind of optimisation makes absolutely no sense whatsoever unless you have concrete evidence that it provides benefit and you would need to re-evaluate that evidence every time the program materially changed.

I also share your assumption. If this is really an optimization, it would have found its way in the JVM. Calling the garbage collector should be avoided - it can even have negative effect (because you are "disturbing" the JVM)
JVMs will probably have a setting for gc interval. See here for Sun's. And Hardcoding the value is rather questionable for anything, especially for garbage collection.

Maybe it could be a good thing if your application could time the calls so that GC would occur at times when the application has nothing useful to do, so the memory is clean whenever the next load peak comes. But this is not what this loop does (it simply runs all 5 minutes), so I would advise against it.
But I invite you to test it - run the application with and without this loop for similar work volumes and measure which takes more time (and maybe total memory, if important). Repeat the test some times. Maybe we get some surprising insights.

Java Performance Testing [duplicate]

This question already has answers here:
Is stopwatch benchmarking acceptable?
(13 answers)
Closed 7 years ago.
I want to do some timing tests on a Java application. This is what I am currently doing:
long startTime = System.currentTimeMillis();
doSomething();
long finishTime = System.currentTimeMillis();
System.out.println("That took: " + (finishTime - startTime) + " ms");
Is there anything "wrong" with performance testing like this? What is a better way?
Duplicate: Is stopwatch benchmarking acceptable?

The one flaw in that approach is that the "real" time doSomething() takes to execute can vary wildly depending on what other programs are running on the system and what its load is. This makes the performance measurement somewhat imprecise.
One more accurate way of tracking the time it takes to execute code, assuming the code is single-threaded, is to look at the CPU time consumed by the thread during the call. You can do this with the JMX classes; in particular, with ThreadMXBean. You can retrieve an instance of ThreadMXBean from java.lang.management.ManagementFactory, and, if your platform supports it (most do), use the getCurrentThreadCpuTime method in place of System.currentTimeMillis to do a similar test. Bear in mind that getCurrentThreadCpuTime reports time in nanoseconds, not milliseconds.
Here's a sample (Scala) method that could be used to perform a measurement:
def measureCpuTime(f: => Unit): java.time.Duration = {
import java.lang.management.ManagementFactory.getThreadMXBean
if (!getThreadMXBean.isThreadCpuTimeSupported)
throw new UnsupportedOperationException(
"JVM does not support measuring thread CPU-time")
var finalCpuTime: Option[Long] = None
val thread = new Thread {
override def run(): Unit = {
f
finalCpuTime = Some(getThreadMXBean.getThreadCpuTime(
Thread.currentThread.getId))
}
}
thread.start()
while (finalCpuTime.isEmpty && thread.isAlive) {
Thread.sleep(100)
}
java.time.Duration.ofNanos(finalCpuTime.getOrElse {
throw new Exception("Operation never returned, and the thread is dead " +
"(perhaps an unhandled exception occurred)")
})
}
(Feel free to translate the above to Java!)
This strategy isn't perfect, but it's less subject to variations in system load.

The code shown in the question is not a good performance measuring code:
The compiler might choose to optimize your code by reordering statements. Yes, it can do that. That means your entire test might fail. It can even choose to inline the method under test and reorder the measuring statements into the now-inlined code.
The hotspot might choose to reorder your statements, inline code, cache results, delay execution...
Even assuming the compiler/hotspot didn't trick you, what you measure is "wall time". What you should be measuring is CPU time (unless you use OS resources and want to include these as well or you measure lock contestation in a multi-threaded environment).
The solution? Use a real profiler. There are plenty around, both free profilers and demos / time-locked trials of commercials strength ones.

Using a Java Profiler is the best option and it will give you all the insight that you need into the code. viz Response Times, Thread CallTraces, Memory Utilisations, etc
I will suggest you JENSOR, an open source Java Profiler, for its ease-of-use and no overheads on CPU. You can download it, instrument the code and will get all the info you need about your code.
You can download it from: http://jensor.sourceforge.net/

Keep in mind that the resolution of System.currentTimeMillis() varies between different operating systems. I believe Windows is around 15 msec. So if your doSomething() runs faster than the time resolution, you'll get a delta of 0. You could run doSomething() in a loop multiple times, but then the JVM may optimize it.

Have you looked at the profiling tools in netbeans and eclipse. These tools give you a better handle on what is REALLY taking up all the time in your code. I have found problems that I did not realize by using these tools.

Well that is just one part of performance testing. Depending on the thing you are testing you may have to look at heap size, thread count, network traffic or a whole host of other things. Otherwise I use that technique for simple things that I just want to see how long they take to run.

That's good when you are comparing one implementation to another or trying to find a slow part in your code (although it can be tedious). It's a really good technique to know and you'll probably use it more than any other, but be familiar with a profiling tool as well.

I'd imagine you'd want to doSomething() before you start timing too, so that the code is JITted and "warmed up".

Japex may be useful to you, either as a way to quickly create benchmarks, or as a way to study benchmarking issues in Java through the source code.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.