Java Performance measurement

Java Performance measurement - java

I am doing some Java performance comparison between my classes, and wondering if there is some sort of Java Performance Framework to make writing performance measurement code easier?
I.e, what I am doing now is trying to measure what effect does it have having a method as "synchronized" as in PseudoRandomUsingSynch.nextInt() compared to using an AtomicInteger as my "synchronizer".
So I am trying to measure how long it takes to generate random integers using 3 threads accessing a synchronized method looping for say 10000 times.
I am sure there is a much better way doing this. Can you please enlighten me? :)
public static void main( String [] args ) throws InterruptedException, ExecutionException {
PseudoRandomUsingSynch rand1 = new PseudoRandomUsingSynch((int)System.currentTimeMillis());
int n = 3;
ExecutorService execService = Executors.newFixedThreadPool(n);
long timeBefore = System.currentTimeMillis();
for(int idx=0; idx<100000; ++idx) {
Future<Integer> future = execService.submit(rand1);
Future<Integer> future1 = execService.submit(rand1);
Future<Integer> future2 = execService.submit(rand1);
int random1 = future.get();
int random2 = future1.get();
int random3 = future2.get();
}
long timeAfter = System.currentTimeMillis();
long elapsed = timeAfter - timeBefore;
out.println("elapsed:" + elapsed);
}
the class
public class PseudoRandomUsingSynch implements Callable<Integer> {
private int seed;
public PseudoRandomUsingSynch(int s) { seed = s; }
public synchronized int nextInt(int n) {
byte [] s = DonsUtil.intToByteArray(seed);
SecureRandom secureRandom = new SecureRandom(s);
return ( secureRandom.nextInt() % n );
}
#Override
public Integer call() throws Exception {
return nextInt((int)System.currentTimeMillis());
}
}
Regards

Ignoring the question of whether a microbenchmark is useful in your case (Stephen C' s points are very valid), I would point out:
Firstly, don't listen to people who say 'it's not that hard'. Yes, microbenchmarking on a virtual machine with JIT compilation is difficult. It's actually really difficult to get meaningful and useful figures out of a microbenchmark, and anyone who claims it's not hard is either a supergenius or doing it wrong. :)
Secondly, yes, there are a few such frameworks around. One worth looking at (thought it's in very early pre-release stage) is Caliper, by Kevin Bourrillion and Jesse Wilson of Google. Looks really impressive from a few early looks at it.

More micro-benchmarking advice - micro benchmarks rarely tell you what you really need to know ... which is how fast a real application is going to run.
In your case, I imagine you are trying to figure out if your application will perform better using an Atomic object than using synchronized ... or vice versa. And the real answer is that it most likely depends on factors that a micro-benchmark cannot measure. Things like the probability of contention, how long locks are held, the number of threads and processors, and the amount of extra algorithmic work needed to make atomic update a viable solution.
EDIT - in response to this question.
so is there a way i can measure all these probability of contention, locks held duration, etc ?
In theory yes. Once you have implemented the entire application, it is possible to instrument it to measure these things. But that doesn't give you your answer either, because there isn't a predictive model you can plug these numbers into to give the answer. And besides, you've already implemented the application by then.
But my point was not that measuring these factors allows you to predict performance. (It doesn't!) Rather, it was that a micro-benchmark does not allow you to predict performance either.
In reality, the best approach is to implement the application according to your intuition, and then use profiling as the basis for figuring out where the real performance problems are.

OpenJDK guys have developed a benchmarking tool called JMH:
http://openjdk.java.net/projects/code-tools/jmh/
This provides quite an easy to setup framework, and there is a couple of samples showing how to use that.
http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/
Nothing can prevent you from writing the benchmark wrong, but they did a great job at eliminating the non-obvious mistakes (such as false sharing between threads, preventing dead code elimination etc).

These guys designed a good JVM measurement methodology so you won't fool yourself with bogus numbers, and then published it as a Python script so you can re-use their smarts -
Statistically Rigorous Java Performance Evaluation (pdf paper)

You probably want to move the loop into the task. As it is you just start all the threads and almost immediately you're back to single threaded.
Usual microbenchmarking advice: Allow for some warm up. As well as average, deviation is interesting. Use System.nanoTime instead of System.currentTimeMillis.
Specific to this problem is how much the threads fight. With a large number of contending threads, cas loops can perform wasted work. Creating a SecureRandom is probably expensive, and so might System.currentTimeMillis to a lesser extent. I believe SecureRandom should already be thread safe, if used correctly.

In short, you are thus searching for an "Java unit performance testing tool"?
Use JUnitPerf.
Update: for the case it's not clear yet: it also supports concurrent (multithreading) testing. Here's an extract of the chapter "LoadTest" of the aforementioned link which includes a code sample:
For example, to create a load test of
10 concurrent users with each user
running the
ExampleTestCase.testOneSecondResponse()
method for 20 iterations, and with a 1
second delay between the addition of
users, use:
int users = 10;
int iterations = 20;
Timer timer = new ConstantTimer(1000);
Test testCase = new ExampleTestCase("testOneSecondResponse");
Test repeatedTest = new RepeatedTest(testCase, iterations);
Test loadTest = new LoadTest(repeatedTest, users, timer);

Related

Sequential stream is faster than parallel stream if number of iterations is increased

I measure performance with the example code at the end.
If I call the checkPerformanceResult method with the parameter numberOfTimes set to 100 the parallel stream outperforms the sequential stream significant(sequential=346, parallel=78).
If I set the parameter to 1000, the sequential stream outperforms the parallel stream significant(sequential=3239, parallel=9337).
I did a lot of runs and the result is the same.
Can someone explain me this behaviour and what is going on under the hood here?
public class ParallelStreamExample {
public static long checkPerformanceResult(Supplier<Integer> s, int numberOfTimes) {
long startTime = System.currentTimeMillis();
for(int i = 0; i < numberOfTimes; i++) {
s.get();
}
long endTime = System.currentTimeMillis();
return endTime - startTime;
}
public static int sumSequentialStreamThread() {
IntStream.rangeClosed(1, 10000000).sum();
return 0;
}
public static int sumParallelStreamThread() {
IntStream.rangeClosed(1, 10000000)
.parallel().sum();
return 0;
}
public static void main(String[] args) {
System.out.println(checkPerformanceResult(ParallelStreamExample::sumSequentialStreamThread, 1000));
System.out.println("break");
System.out.println(checkPerformanceResult(ParallelStreamExample::sumParallelStreamThread, 1000));
}
}

Using threads doesn't always make the code run faster
when working with a few threads there is always an overhead of managing each thread (assigning CPU time by to OS to each thread, managing the next line of code that needs to run in case of a context switch etc...)
In this specific case
each thread created in sumParallelStreamThread does very simple in memory operations (calling a function that returns a number).
so the difference between sumSequentialStreamThread and sumParallelStreamThread is that in sumParallelStreamThread each simple operation has the overhead of creating a thread and running it (assuming that there isn't any thread optimization happening in the background).
and sumSequentialStreamThread does the same thing without the overhead of managing all the threads, that's why it runs faster.
When to use a threads
The most common use case for working with threads is when you need to perform a bunch of I/O tasks.
what is considered an I/O task?
it depends on several factors, you can find a debate on it here.
but i think generally people will agree that making and HTTP request to somewhere or executing a database query can be considered an I/O operation.
why is it more suitable?
because I/O operations usually have some period of time of waiting for a response involved with them.
for example when querying a database the thread performing the query will wait for the database to return the response (even if its less than half a second) while this thread is waiting a different thread can perform other actions and that is where we can gain performance.
I find that usually running tasks that involve only RAM memory and CPU operations in different threads makes the code run slower than with one thread.
Benchmark discussion
regarding the benchmark remarks is see in the comments, i am not sure if they are correct or not, but in these type of situations i would double check my benchmark against any profiling tool (or just use it to begin with) like JProfiler or YoutKit they are usually very accurate.

Clean code vs performance

Some principles of clean code are:
functions should do one thing at one abstraction level
functions should be at most 20 lines long
functions should never have more than 2 input parameters
How many cpu cycles are "lost" by adding an extra function call in Java?
Are there compiler options available that transform many small functions into one big function in order to optimize performance?
E.g.
void foo() {
bar1()
bar2()
}
void bar1() {
a();
b();
}
void bar2() {
c();
d();
}
Would become
void foo() {
a();
b();
c();
d();
}

How many cpu cycles are "lost" by adding an extra function call in Java?
This depends on whether it is inlined or not. If it's inline it will be nothing (or a notional amount)
If it is not compiled at runtime, it hardly matters because the cost of interperting is more important than a micro optimisation, and it is likely to be not called enough to matter (which is why it wasn't optimised)
The only time it really matters is when the code is called often, however for some reason it is prevented from being optimised. I would only assume this is the case because you have a profiler telling you this is a performance issue, and in this case manual inlining might be the answer.
I designed, develop and optimise latency sensitive code in Java and I choose to manually inline methods much less than 1% of time, but only after a profiler e.g. Flight Recorder suggests there is a significant performance problem.
In the rare event it matters, how much difference does it make?
I would estimate between 0.03 and 0.1 micros-seconds in real applications for each extra call, in a micro-benchmark it would be far less.
Are there compiler options available that transform many small functions into one big function in order to optimize performance?
Yes, in fact what could happen is not only are all these method inlined, but the methods which call them are inlined as well and none of them matter at runtime, but only if the code is called enough to be optimised. i.e. not only is a,b, c and d inlined and their code but foo is inlined as well.
By default the Oracle JVM can line to a depth of 9 levels (until the code gets more than 325 bytes of byte code)
Will clean code help performance
The JVM runtime optimiser has common patterns it optimises for. Clean, simple code is generally easier to optimise and when you try something tricky or not obvious, you can end up being much slower. If it harder to understand for a human, there is a good chance it's hard for the optimiser to understand/optimise.

Runtime behavior and cleanliness of code (a compile time or life time property of code) belong to different requirement categories. There might be cases where optimizing for one category is detrimental to the other.
The question is: which category really needs you attention?
In my view cleanliness of code (or malleability of software) suffers from a huge lack of attention. You should focus on that first. And only if other requirements start to fall behind (e.g. performance) you inquire as to whether that's due to how clean the code is. That means you need to really compare, you need to measure the difference it makes. With regard to performance use a profiler of your choice: run the "dirty" code variant and the clean variant and check the difference. Is it markedly? Only if the "dirty" variant is significantly faster should you lower the cleanliness.

Consider the following piece of code, which compares a code that does 3 things in one for loop to another that has 3 different for loops for each task.
#Test
public void singleLoopVsMultiple() {
for (int j = 0; j < 5; j++) {
//single loop
int x = 0, y = 0, z = 0;
long l = System.currentTimeMillis();
for (int i = 0; i < 100000000; i++) {
x++;
y++;
z++;
}
l = System.currentTimeMillis() - l;
//multiple loops doing the same thing
int a = 0, b = 0, c = 0;
long m = System.currentTimeMillis();
for (int i = 0; i < 100000000; i++) {
a++;
}
for (int i = 0; i < 100000000; i++) {
b++;
}
for (int i = 0; i < 100000000; i++) {
c++;
}
m = System.currentTimeMillis() - m;
System.out.println(String.format("%d,%d", l, m));
}
}
When I run it, here is the output I get for time in milliseconds.
6,5
8,0
0,0
0,0
0,0
After a few runs, JVM is able to identify hotspots of intensive code and optimises parts of the code to make them significantly faster. In our previous example, after 2 runs, the JVM had already optimised the code so much that the discussion around for-loops became redundant.
Unless we know what's happening inside, we cannot predict the performance implications of changes like introduction of for-loops. The only way to actually improve the performance of a system is by measuring it and focusing only on fixing the actual bottlenecks.
There is a chance that cleaning your code may make it faster for the JVM. But even if that is not the case, every performance optimisation, comes with added code complexity. Ask yourself whether the added complexity is worth the future maintenance effort. After all, the most expensive resource on any team is the developer, not the servers, and any additional complexity slows the developer, adding to the project cost.
The way to deal it is to figure out your benchmarks, what kind of application you're making, what are the bottlenecks. If you're making a web-app, perhaps the DB is taking most of the time, and reducing the number of functions will not make a difference. On the other hand, if its an app running on a system where performance is everything, every small thing counts.

ExecutorService Future::get very slow

I'm parallelizing a quite complex program to get it faster. For this I use most of the time the ExecutorService. Until now it worked pretty well, but then I noticed that just one line of code makes my program run half as fast as it could. It's the line with exactScore.get().
I don't know why, but it sometimes needs more that 0.1 s just to get the double value of the Future Object.
Why is this? How can I handle it that it runs faster? Is there a way to write directly in the Double[] while multithreading?
Thanks
int processors = Runtime.getRuntime().availableProcessors();
ExecutorService service = Executors.newFixedThreadPool(processors);
// initialize output
Double[] presortedExScores = new Double[sortedHeuScores.length];
for(int i =0; i < sortedHeuScores.length; i++ ){
final int index = i;
final Collection<MolecularFormula> formulas_for_exact_method = multimap.get(sortedHeuScores[i]);
for (final MolecularFormula formula : formulas_for_exact_method){
Future<Double> exactScore = service.submit(new Callable<Double>() {
#Override
public Double call() throws Exception {
return getScore(computeTreeExactly(computeGraph(formula)));
}
});
presortedExScores[index] = exactScore.get();
}
}

That is to be expected. It isn't "slower" then; it is just doing its job.
From the javadoc for get():
Waits if necessary for the computation to complete, and then retrieves its result.
Long story short: it seems that you do not understand the concepts you are using in your code. The idea of a Future is that it does things at some point in the future.
And by calling get() you express: I don't mind waiting now until the results of that computation "behind" that Future become available.
Thus: you have to step back and look into your code again; to understand how your different "threads of activity" really work; and how/when they come back together.
One idea that comes to mind: right now, you you are creating your Future objects in a loop; and directly after you created the Future, you call get() on it. That completely contradicts the idea of creating multiple Futures. In other words: instead of going:
foreach X
create future X.i
wait/get future X.i
you could do something like
foreach X
create future X.i
foreach X
wait/get for future X.i
In other words: allow your futures to really do things in parallel; instead of enforcing sequential processing.
If that doesn't help "enough", then as said: you have to look at your overall design, and determine if there are ways to further "pull apart" things. Right now all activity happens "closely" together; and surprise: when you do a lot of work at the same time, that takes time. But as you might guess: such a re-design could be a lot of work; and is close to impossible without knowing more about your problem/code base.
A more sophisticated approach would be that you write code where some each Future has a way of expressing "I am done" - then you would "only" start all Futures; and wait until the last one comes back. But as said; I can't design a full solution for you here.
The other really important take-away here: don't just blindly use some code that "happens" to work. One essence of programming is to understand each and any concept used in your source code. You should have a pretty good idea what your code is doing before running it and finding "oh, that get() makes things slow".

Multicore Java Program with Native Code

I am using a native C++ library inside a Java program. The Java program is written to make use of many-core systems, but it does not scale: the best speed is with around 6 cores, i.e., adding more cores slows it down. My tests show that the call to the native code itself causes the problem, so I want to make sure that different threads access different instances of the native library, and therefore remove any hidden (memory) dependency between the parallel tasks.
In other words, instead of the static block
static {
System.loadLibrary("theNativeLib");
}
I want multiple instances of the library to be loaded, for each thread dynamically. The main question is if that is possible at all. And then how to do it!
Notes:
- I have implementations in Java 7 fork/join as well as Scala/akka. So any help in each platform is appreciated.
- The parallel tasks are completely independent. In fact, each task may create a couple of new tasks and then terminates; no further dependency!
Here is the test program in fork/join style, in which processNatively is basically a bunch of native calls:
class Repeater extends RecursiveTask<Long> {
final int n;
final processor mol;
public Repeater(final int m, final processor o) {
n=m;
mol = o;
}
#Override
protected Long compute() {
processNatively(mol);
final List<RecursiveTask<Long>> tasks = new ArrayList<>();
for (int i=n; i<9; i++) {
tasks.add(new Repeater(n+1,mol));
}
long count = 1;
for(final RecursiveTask<Long> task : invokeAll(tasks)) {
count += task.join();
}
return count;
}
}
private final static ForkJoinPool forkJoinPool = new ForkJoinPool();
public void repeat(processor mol)
{
final long middle = System.currentTimeMillis();
final long count = forkJoinPool.invoke(new Repeater(0, mol));
System.out.println("Count is "+count);
final long after = System.currentTimeMillis();
System.out.println("Time elapsed: "+(after-middle));
}
Putting it differently:
If I have N threads that use a native library, what happens if each of them calls System.loadLibrary("theNativeLib"); dynamically, instead of calling it once in a static block? Will they share the library anyway? If yes, how can I fool JVM into seeing it as N different libraries loaded independently? (The value of N is not known statically)

The javadoc for System.loadLibrary states that it's the same as calling Runtime.getRuntime().loadLibrary(name). The javadoc for this loadLibrary (http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#loadLibrary(java.lang.String) ) states that "If this method is called more than once with the same library name, the second and subsequent calls are ignored.", so it seems you can't load the same library more than once. In terms of fooling the JVM into thinking there are multiple instances, I can't help you there.

You need to ensure you don't have a bottle neck on any shared resources. e.g. say you have 6 hyper threaded cores, you may find that 12 threads is optimal or you might find that 6 thread is optimal (and each thread has a dedicated core)
If you have a heavy floating point routine, it is likely that hyperthreading will be slower rather than faster.
If you are using all the cache, trying to use more can slow your system down. If you are using the limit of CPU to main memory bandwidth, attempting to use more bandwidth can slow your machine.
But then, how can I refer to the different instances? I mean the loaded classes will have the same names and packages, right? What happens in general if you load two dynamic libraries containing classes with the same names and packages?
There is only one instance, you cannot load a DLL more than once. If you want to construct a different data set for each thread you need to do this externally to the library and pass this to the library so each thread can work on different data.

Java compiler optimization for repeated method calls?

Does the java compiler (the default javac that comes in JDK1.6.0_21) optimize code to prevent the same method from being called with the same arguments over and over? If I wrote this code:
public class FooBar {
public static void main(String[] args) {
foo(bar);
foo(bar);
foo(bar);
}
}
Would the method foo(bar) only run once? If so, is there any way to prevent this optimization? (I'm trying to compare runtime for two algos, one iterative and one comparative, and I want to call them a bunch of times to get a representative sample)
Any insight would be much appreciated; I took this problem to the point of insanity (I though my computer was insanely fast for a little while, so I kept on adding method calls until I got the code too large error at 43671 lines).

The optimization you are observing is probably nothing to do with repeated calls ... because that would be an invalid optimization. More likely, the optimizer has figured out that the method calls have no observable effect on the computation.
The cure is to change the method so that it does affect the result of computation ...

It doesn't; that would cause a big problem if foo is non-pure (changes the global state of the program). For example:
public class FooBar {
private int i = 0;
private static int foo() {
return ++i;
}
public static void main(String[] args) {
foo();
foo();
foo();
System.out.println(i);
}
}

You haven't provided enough information to allow for any definitive answers, but the jvm runtime optimizer is extremely powerful and does all sorts of inlining, runtime dataflow and escape analysis, and all manner of cache tricks.
The end result is to make the sort of micro-benchmarks you are trying to perform all but useless in practice; and extremely difficult to get right even when they are potentially useful.
Definitely read http://www.ibm.com/developerworks/java/library/j-benchmark1.html for a fuller discussion on the problems you face. At the very least you need to ensure:
foo is called in a loop that runs thousands of times
foo() returns a result, and
that result is used
The following is the minimum starting point, assuming foo() is non-trivial and therefore is unlikely to be inlined. Note: You still have to expect loop-unrolling and other cache level optimizations. Also watch out for the hotspot compile breakpoint (I believe this is ~5000 calls on -server IIRC), which can completely stuff up your measurements if you try to re-run the measurements in the same JVM.
public class FooBar {
public static void main(String[] args) {
int sum = 0;
int ITERATIONS = 10000;
for (int i = 0; i < ITERATIONS; i++) {
sum += foo(i);
}
System.out.println("%d iterations returned %d sum", ITERATIONS, sum);
}
}
Seriously, you need to do some reading before you can make any meaningful progress towards writing benchmarks on a modern JVM. The same optimizations that allows modern Java code to match or even sometimes beat C++ make benchmarking really difficult.

The Java compiler is not allowed to perform such optimizations because method calls very likely cause side effets, for example IO actions or changes to all fields it can reach, or calling other methods that do so.
In functional languages where each function call is guaranteed to return the same result if called with the same arguments (changes to state are forbidden), a compiler might indeed optimize away multiple calls by memorizing the result.
If you feel your algorithms are too fast, try to give them some large or complicated problem sets. There are only a few algorithms which are always quite fast.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.