Why this method does not get optimized away? - java

This Java method gets used in benchmarks for simulating slow computation:
static int slowItDown() {
int result = 0;
for (int i = 1; i <= 1000; i++) {
result += i;
}
return result;
}
This is IMHO a very bad idea, as its body can get replaced by return 500500. This seems to never happen1; probably because of such an optimization being irrelevant for real code as Jon Skeet stated.
Interestingly, a slightly simpler method with result += 1; gets fully optimized away (caliper reports 0.460543 ns).
But even when we agree that optimizing away methods returning a constant result is useless for real code, there's still loop unrolling, which could lead to something like
static int slowItDown() {
int result = 0;
for (int i = 1; i <= 1000; i += 2) {
result += 2 * i + 1;
}
return result;
}
So my question remains: Why is no optimization performed here?
1Contrary to what I wrote originally; I must have seen something what wasn't there.

Well, the JVM does optimize away such code. The question is how many times it has to be detected as a real hotspot (benchmarks do some more than this single method, usually) before it will be analyzed this way. In my setup it required 16830 invocations before the execution time went to (almost) zero.
It’s correct that such a code does not appear in real code. However it might remain after several inlining operations of other hotspots dealing with values not being compiling-time constants but runtime constants or de-facto constants (values that could change in theory but don’t practically). When such a piece of code remains it’s a great benefit to optimize it away entirely but that is not expected to happen soon, i.e. when calling right from the main method.
Update: I simplified the code and the optimization came even earlier.
public static void main(String[] args) {
final int inner=10;
final float innerFrac=1f/inner;
int count=0;
for(int j=0; j<Integer.MAX_VALUE; j++) {
long t0=System.nanoTime();
for(int i=0; i<inner; i++) slowItDown();
long t1=System.nanoTime();
count+=inner;
final float dt = (t1-t0)*innerFrac;
System.out.printf("execution time: %.0f ns%n", dt);
if(dt<10) break;
}
System.out.println("after "+count+" invocations");
System.out.println(System.getProperty("java.version"));
System.out.println(System.getProperty("java.vm.version"));
}
static int slowItDown() {
int result = 0;
for (int i = 1; i <= 1000; i++) {
result += i;
}
return result;
}
…
execution time: 0 ns
after 15300 invocations
1.7.0_13
23.7-b01
(64Bit Server VM)

Related

Strange performance drop after innocent changes to a trivial program

Imagine you want to count how many non-ASCII chars a given char[] contains. Imagine, the performance really matters, so we can skip our favorite slogan.
The simplest way is obviously
int simpleCount() {
int result = 0;
for (int i = 0; i < string.length; i++) {
result += string[i] >= 128 ? 1 : 0;
}
return result;
}
Then you think that many inputs are pure ASCII and that it could be a good idea to deal with them separately. For simplicity assume you write just this
private int skip(int i) {
for (; i < string.length; i++) {
if (string[i] >= 128) break;
}
return i;
}
Such a trivial method could be useful for more complicated processing and here it can't do no harm, right? So let's continue with
int smartCount() {
int result = 0;
for (int i = skip(0); i < string.length; i++) {
result += string[i] >= 128 ? 1 : 0;
}
return result;
}
It's the same as simpleCount. I'm calling it "smart" as the actual work to be done is more complicated, so skipping over ASCII quickly makes sense. If there's no or a very short ASCII prefix, it can costs a few cycles more, but that's all, right?
Maybe you want to rewrite it like this, it's the same, just possibly more reusable, right?
int smarterCount() {
return finish(skip(0));
}
int finish(int i) {
int result = 0;
for (; i < string.length; i++) {
result += string[i] >= 128 ? 1 : 0;
}
return result;
}
And then you ran a benchmark on some very long random string and get this
The parameters determine the ASCII to non-ASCII ratio and the average length of a non-ASCII sequence, but as you can see they don't matter. Trying different seeds and whatever doesn't matter. The benchmark uses caliper, so the usual gotchas don't apply. The results are fairly repeatable, the tiny black bars at the end denote the minimum and maximum times.
Does anybody have an idea what's going on here? Can anybody reproduce it?
Got it.
The difference is in the possibility for the optimizer/CPU to predict the number of loops in for. If it is able to predict the number of repeats up front, it can skip the actual check of i < string.length. Therefore the optimizer needs to know up front how often the condition in the for-loop will succeed and therefore it must know the value of string.length and i.
I made a simple test, by replacing string.length with a local variable, that is set once in the setup method. Result: smarterCount has runtime of about simpleCount. Before the change smarterCount took about 50% longer then simpleCount. smartCount did not change.
It looks like the optimizer looses the information of how many loops it will have to do when a call to another method occurs. That's the reason why finish() immediately ran faster with the constant set, but not smartCount(), as smartCount() has no clue about what i will be after the skip() step. So I did a second test, where I copied the loop from skip() into smartCount().
And voilà, all three methods return within the same time (800-900 ms).
My tentative guess would be that this is about branch prediction.
This loop:
for (int i = 0; i < string.length; i++) {
result += string[i] >= 128 ? 1 : 0;
}
Contains exactly one branch, the backward edge of the loop, and it is highly predictable. A modern processor will be able to accurately predict this, and so fill its whole pipeline with instructions. The sequence of loads is also highly predictable, so it will be able to pre-fetch everything the pipelined instructions need. High performance results.
This loop:
for (; i < string.length - 1; i++) {
if (string[i] >= 128) break;
}
Has a dirty great data-dependent conditional branch sitting in the middle of it. That is much harder for the processor to predict accurately.
Now, that doesn't entirely make sense, because (a) the processor will surely quickly learn that the break branch will usually not be taken, (b) the loads are still predictable, and so just as pre-fetchable, and (c) after that loop exits, the code goes into a loop which is identical to the loop which goes fast. So i wouldn't expect this to make all that much difference.

Empirical analysis for binary search not matching Theoretical Analysis

I'm currently doing a test for the Binary searches average case. Simply all I do is I generate a random variable and then search for this random variable in different sized arrays using the binary search. Below is my code used:
public static void main(String[] args)
{
//This array keeps track of the times of the linear search
long[] ArrayTimeTaken = new long[18];
//Values of the array lengths that we test for
int[] ArrayAssignValues = new int[18];
ArrayAssignValues[0] = 1000000;
ArrayAssignValues[1] = 10000000;
ArrayAssignValues[2] = 20000000;
ArrayAssignValues[3] = 30000000;
ArrayAssignValues[4] = 40000000;
ArrayAssignValues[5] = 50000000;
ArrayAssignValues[6] = 60000000;
ArrayAssignValues[7] = 70000000;
ArrayAssignValues[8] = 80000000;
ArrayAssignValues[9] = 90000000;
ArrayAssignValues[10] = 100000000;
ArrayAssignValues[11] = 110000000;
ArrayAssignValues[12] = 120000000;
ArrayAssignValues[13] = 130000000;
ArrayAssignValues[14] = 140000000;
ArrayAssignValues[15] = 150000000;
ArrayAssignValues[16] = 160000000;
ArrayAssignValues[17] = 170000000;
//Code that runs the linear search
for (int i = 0; i < ArrayAssignValues.length; i++)
{
float[] arrayExperimentTest = new float[ ArrayAssignValues[i]];
//We fill the array with ascending numbers
for (int j = 0; j < arrayExperimentTest.length; j++)
{
arrayExperimentTest[j] = j;
}
Random Generator = new Random();
int ValuetoSearchfor = (int) Generator.nextInt(ArrayAssignValues[i]);
System.out.println(ValuetoSearchfor);
ValuetoSearchfor = (int) arrayExperimentTest[ValuetoSearchfor];
//Here we perform a the Linear Search
ArrayTimeTaken[i] = BinarySearch(arrayExperimentTest,ValuetoSearchfor);
}
ChartCreate(ArrayTimeTaken);
System.out.println("Done");
}
Here is my code for the binary search:
static long BinarySearch(float[] ArraySearch,int ValueFind)
{
System.gc();
long startTime = System.nanoTime();
int low = 0;
int high = ArraySearch.length-1;
int mid = Math.round((low+high)/2);
while (ArraySearch[mid] != ValueFind )
{
if (ValueFind <ArraySearch[mid])
{
high = mid-1;
}
else
{
low = mid+1;
}
mid = (low+high)/2;
}
long TimeTaken = System.nanoTime() - startTime;
return TimeTaken;
}
Now the problem is that the results aren't making sense. Below is a graph:
Can some explain how the 1st array takes so much time? I've run the code a good few times and its basically the same graph created every time. Does Java some how cache results? Can anyone explain the result why the 1st binary search takes so long relativve to the others even though the array size is tiny compared to the rest?
It looks like you're doing these searches one after another, starting with the lowest values. If that's true, then the code will be running much slower, because the JIT compiler won't have had a chance to warm up yet. Generally, for benchmarking like this, you want to run through all relevant code to give the JIT compiler time to compile it and optimize before you do the real testing.
For more information on the JIT compiler, read this
You should also see this question to learn more about benchmarking.
Another possible cause of the slowness is that the JVM could still be in the process of starting up, and running it' own background code while you're timing it, causing slowdown.
Benchmarking is not done this way, you should run at least 1000 cycles as a "warmup" and only then start measuring. Benchmarking could be more complicated than it seems, it should be carefully constructed to not be affected by other programs that run in the memory at the same time etc. Here and here you can find some good tips.

Why does Method access seem faster than Field access?

I was doing some tests to find out what the speed differences are between using getters/setters and direct field access. I wrote a simple benchmark application like this:
public class FieldTest {
private int value = 0;
public void setValue(int value) {
this.value = value;
}
public int getValue() {
return this.value;
}
public static void doTest(int num) {
FieldTest f = new FieldTest();
// test direct field access
long start1 = System.nanoTime();
for (int i = 0; i < num; i++) {
f.value = f.value + 1;
}
f.value = 0;
long diff1 = System.nanoTime() - start1;
// test method field access
long start2 = System.nanoTime();
for (int i = 0; i < num; i++) {
f.setValue(f.getValue() + 1);
}
f.setValue(0);
long diff2 = System.nanoTime() - start2;
// print results
System.out.printf("Field Access: %d ns\n", diff1);
System.out.printf("Method Access: %d ns\n", diff2);
System.out.println();
}
public static void main(String[] args) throws InterruptedException {
int num = 2147483647;
// wait for the VM to warm up
Thread.sleep(1000);
for (int i = 0; i < 10; i++) {
doTest(num);
}
}
}
Whenever I run it, I get consistent results such as these: http://pastebin.com/hcAtjVCL
I was wondering if someone could explain to me why field access seems to be slower than getter/setter method access, and also why the last 8 iterations execute incredibly fast.
Edit: Having taken into account assylias and Stephen C comments, I have changed the code to http://pastebin.com/Vzb8hGdc where I got slightly different results: http://pastebin.com/wxiDdRix .
The explanation is that your benchmark is broken.
The first iteration is done using the interpreter.
Field Access: 1528500478 ns
Method Access: 1521365905 ns
The second iteration is done by the interpreter to start with and then we flip to running JIT compiled code.
Field Access: 1550385619 ns
Method Access: 47761359 ns
The remaining iterations are all done using JIT compiled code.
Field Access: 68 ns
Method Access: 33 ns
etcetera
The reason they are unbelievably fast is that the JIT compiler has optimized the loops away. It has detected that they were not contributing anything useful to the computation. (It is not clear why the first number seems consistently faster than the second, but I doubt that the optimized code is measuring field versus method access in any meaningful way.)
Re the UPDATED code / results: it is obvious that the JIT compiler is still optimizing the loops away.

Java iterative vs recursive

Can anyone explain why the following recursive method is faster than the iterative one (Both are doing it string concatenation) ? Isn't the iterative approach suppose to beat up the recursive one ? plus each recursive call adds a new layer on top of the stack which can be very space inefficient.
private static void string_concat(StringBuilder sb, int count){
if(count >= 9999) return;
string_concat(sb.append(count), count+1);
}
public static void main(String [] arg){
long s = System.currentTimeMillis();
StringBuilder sb = new StringBuilder();
for(int i = 0; i < 9999; i++){
sb.append(i);
}
System.out.println(System.currentTimeMillis()-s);
s = System.currentTimeMillis();
string_concat(new StringBuilder(),0);
System.out.println(System.currentTimeMillis()-s);
}
I ran the program multiple time, and the recursive one always ends up 3-4 times faster than the iterative one. What could be the main reason there that is causing the iterative one slower ?
See my comments.
Make sure you learn how to properly microbenchmark. You should be timing many iterations of both and averaging these for your times. Aside from that, you should make sure the VM isn't giving the second an unfair advantage by not compiling the first.
In fact, the default HotSpot compilation threshold (configurable via -XX:CompileThreshold) is 10,000 invokes, which might explain the results you see here. HotSpot doesn't really do any tail optimizations so it's quite strange that the recursive solution is faster. It's quite plausible that StringBuilder.append is compiled to native code primarily for the recursive solution.
I decided to rewrite the benchmark and see the results for myself.
public final class AppendMicrobenchmark {
static void recursive(final StringBuilder builder, final int n) {
if (n > 0) {
recursive(builder.append(n), n - 1);
}
}
static void iterative(final StringBuilder builder) {
for (int i = 10000; i >= 0; --i) {
builder.append(i);
}
}
public static void main(final String[] argv) {
/* warm-up */
for (int i = 200000; i >= 0; --i) {
new StringBuilder().append(i);
}
/* recursive benchmark */
long start = System.nanoTime();
for (int i = 1000; i >= 0; --i) {
recursive(new StringBuilder(), 10000);
}
System.out.printf("recursive: %.2fus\n", (System.nanoTime() - start) / 1000000D);
/* iterative benchmark */
start = System.nanoTime();
for (int i = 1000; i >= 0; --i) {
iterative(new StringBuilder());
}
System.out.printf("iterative: %.2fus\n", (System.nanoTime() - start) / 1000000D);
}
}
Here are my results...
C:\dev\scrap>java AppendMicrobenchmark
recursive: 405.41us
iterative: 313.20us
C:\dev\scrap>java -server AppendMicrobenchmark
recursive: 397.43us
iterative: 312.14us
These are times for each approach averaged over 1000 trials.
Essentially, the problems with your benchmark are that it doesn't average over many trials (law of large numbers), and that it is highly dependent on the ordering of the individual benchmarks. The original result I was given for yours:
C:\dev\scrap>java StringBuilderBenchmark
80
41
This made very little sense to me. Recursion on the HotSpot VM is more than likely not going to be as fast as iteration because as of yet it does not implement any sort of tail optimization that you might find used for functional languages.
Now, the funny thing that happens here is that the default HotSpot JIT compilation threshold is 10,000 invokes. Your iterative benchmark will more than likely be executing for the most part before append is compiled. On the other hand, your recursive approach should be comparatively fast since it will more than likely enjoy append after it is compiled. To eliminate this from influencing the results, I passed -XX:CompileThreshold=0 and found...
C:\dev\scrap>java -XX:CompileThreshold=0 StringBuilderBenchmark
8
8
So, when it comes down to it, they're both roughly equal in speed. Note however that the iterative appears to be a bit faster if you average with higher precision. Order might still make a difference in my benchmark, too, as the latter benchmark will have the advantage of the VM having collected more statistics for its dynamic optimizations.

Jvm native code compilation crazyness - I seem to suffer odd performance penalties for some time even after the code is compiled. Why?

Issue
When benchmarking a simple QuickSort implementation in Java, I faced unexpected humps in the n vs time graphics I was plotting:
I know HotSpot will attempt to compile code to native after it seems certain methods are being heavily used, so I ran the JVM with -XX:+PrintCompilation. After repeated trials, it seems to be compiling the algorithm's methods always in the same way:
# iteration 6 -> sorting.QuickSort::swap (15 bytes)
# iteration 7 -> sorting.QuickSort::partition (66 bytes)
# iteration 7 -> sorting.QuickSort::quickSort (29 bytes)
I am repeating the above graphic with this added info, just to make things a bit clearer:
At this point, we must all be asking ourselves : why are we still getting those ugly humps AFTER the code is compiled? Maybe it has something to do with the algorithm itself? It sure could be, and luckily for us there's a quick way to sort that out, with -XX:CompileThreshold=0:
Bummer! It really must be something the JVM is doing in the background. But what?
I theorized that although code is being compiled, it may take a while until the compiled code actually starts to be used. Maybe adding a couple of Thread.sleep()s here and there could help us a bit sorting this issue out?
Ouch! The green colored function is the QuickSort's code ran with a 1000ms internal between each run (details in the appendix), while the blue colored function is our old one (just for comparison).
At fist, giving time to the HotSpot only seems to make matters worse! Maybe it only seems worse by some other factor, such as caching issues?
Disclaimer : I am running 1000 trials for each point of the shown graphics, and using System.nanoTime() to measure the results.
EDIT
Some of you may at this stage wonder how the use of sleep() might distort the results. I ran the Red Plot (no native compilation) again, now with the sleeps in-between:
Scary!
Appendix
Here I present the QuickSort code I am using, just in case:
public class QuickSort {
public <T extends Comparable<T>> void sort(int[] table) {
quickSort(table, 0, table.length - 1);
}
private static <T extends Comparable<T>> void quickSort(int[] table,
int first, int last) {
if (first < last) { // There is data to be sorted.
// Partition the table.
int pivotIndex = partition(table, first, last);
// Sort the left half.
quickSort(table, first, pivotIndex - 1);
// Sort the right half.
quickSort(table, pivotIndex + 1, last);
}
}
/**
* #author http://en.wikipedia.org/wiki/Quick_Sort
*/
private static <T extends Comparable<T>> int partition(int[] table,
int first, int last) {
int pivotIndex = (first + last) / 2;
int pivotValue = table[pivotIndex];
swap(table, pivotIndex, last);
int storeIndex = first;
for (int i = first; i < last; i++) {
if (table[i]-(pivotValue) <= 0) {
swap(table, i, storeIndex);
storeIndex++;
}
}
swap(table, storeIndex, last);
return storeIndex;
}
private static <T> void swap(int[] a, int i, int j) {
int h = a[i];
a[i] = a[j];
a[j] = h;
}
}
as well the code I am using to run my benchmarks:
public static void main(String[] args) throws InterruptedException, IOException {
QuickSort quickSort = new QuickSort();
int TRIALS = 1000;
File file = new File(Long.toString(System.currentTimeMillis()));
System.out.println("Saving # \"" + file.getAbsolutePath() + "\"");
for (int x = 0; x < 30; ++x) {
// if (x > 4 && x < 17)
// Thread.sleep(1000);
int[] values = new int[x];
long start = System.nanoTime();
for (int i = 0; i < TRIALS; ++i)
quickSort.sort(values);
double duration = (System.nanoTime() - start) / TRIALS;
String line = x + "\t" + duration;
System.out.println(line);
FileUtils.writeStringToFile(file, line + "\r\n", true);
}
}
Well, it seems that I sorted the issue out on my own.
I was right about the idea that compiled code could take a while to kick in. The problem was a flaw in the way I actually implemented my benchmarking code:
if (x > 4 && x < 17)
Thread.sleep(1000);
in here I assumed that as the only "affected" area would be between 4 and 17, I could go on and just do a sleep over those values. This is simply not so. The following plot may be enlightening:
Here I am comparing the original no compilation function (red) to another no compilation function, but separed with sleeps in-between. As you may see, they work in different orders of magnitude, that meaning that mixing results of code with and without sleeps will yield unsound results, as I was guilty of doing.
The original question remains unaswered, yet. What causes the humps to ocurr even after compilation took place? Let's try to find that out, putting a 1s sleep in ALL points taken:
That yields the expected result. The odd humps were happening the native code still didn't kick in.
Comparing a sleep 50ms with a sleep 1000ms function yields yet again, the expected result:
(the gray one seems to still show a bit of delay)

Categories

Resources