Why are floating point operations much faster with a warmup phase?

Why are floating point operations much faster with a warmup phase? - java

I initially wanted to test something different with floating-point performance optimisation in Java, namely the performance difference between the division by 5.0f and multiplication with 0.2f (multiplication seems to be slower without warm-up but faster with by a factor of about 1.5 respectively).
After studying the results I noticed that I had forgotten to add a warm-up phase, as suggested so often when doing performance optimisations, so I added it. And, to my utter surprise, it turned out to be about 25 times faster in average over multiple test runs.
I tested it with the following code:
public static void main(String args[])
{
float[] test = new float[10000];
float[] test_copy;
//warmup
for (int i = 0; i < 1000; i++)
{
fillRandom(test);
test_copy = test.clone();
divideByTwo(test);
multiplyWithOneHalf(test_copy);
}
long divisionTime = 0L;
long multiplicationTime = 0L;
for (int i = 0; i < 1000; i++)
{
fillRandom(test);
test_copy = test.clone();
divisionTime += divideByTwo(test);
multiplicationTime += multiplyWithOneHalf(test_copy);
}
System.out.println("Divide by 5.0f: " + divisionTime);
System.out.println("Multiply with 0.2f: " + multiplicationTime);
}
public static long divideByTwo(float[] data)
{
long before = System.nanoTime();
for (float f : data)
{
f /= 5.0f;
}
return System.nanoTime() - before;
}
public static long multiplyWithOneHalf(float[] data)
{
long before = System.nanoTime();
for (float f : data)
{
f *= 0.2f;
}
return System.nanoTime() - before;
}
public static void fillRandom(float[] data)
{
Random random = new Random();
for (float f : data)
{
f = random.nextInt() * random.nextFloat();
}
}
Results without warm-up phase:
Divide by 5.0f: 382224
Multiply with 0.2f: 490765
Results with warm-up phase:
Divide by 5.0f: 22081
Multiply with 0.2f: 10885
Another interesting change that I cannot explain is the turn in what operation is faster (division vs. multiplication). As earlier mentioned, without the warm-up the division seems to be a tad faster, while with the warm-up it seems to be twice as slow.
I tried adding an initialization block setting the values to something random, but it didn't not effect the results and neither did adding multiple warm-up phases. The numbers on which the methods operate are the same, so that cannot be the reason.
What is the reason for this behaviour? What is this warm-up phase and how does it influence the performance, why are the operations so much faster with a warm-up phase and why is there a turn in which operation is faster?

Before the warm up Java will be running the byte codes via an interpreter, think how you would write a program that could execute java byte codes in java. After warm up, hotspot will have generated native assembler for the cpu that you are running on; making use of that cpus feature set. There is a significant performance difference between the two, the interpreter will run many many cpu instructions for a single byte code where as hotspot generates native assembler code just as gcc does when compiling C code. That is the difference between the time to divide and to multiply will ultimately be down to the CPU that one is running on, and it will be just a single cpu instruction.
The second part to the puzzle is hotspot also records statistics that measure the runtime behaviour of your code, when it decides to optimise the code then it will use those statistics to perform optimisations that are not necessarily possible at compilation time. For example it can reduce the cost of null checks, branch mispredictions and polymorphic method invocation.
In short, one must discard the results pre-warmup.
Brian Goetz wrote a very good article here on this subject.
========
APPENDED: overview of what 'JVM Warm-up' means
JVM 'warm up' is a loose phrase, and is no longer strictly speaking a single phase or stage of the JVM. People tend to use it to refer to the idea of where JVM performance stabilizes after compilation of the JVM byte codes to native byte codes. In truth, when one starts to scratch under the surface and delves deeper into the JVM internals it is difficult not to be impressed by how much Hotspot is doing for us. My goal here is just to give you a better feel for what Hotspot can do in the name of performance, for more details I recommend reading articles by Brian Goetz, Doug Lea, John Rose, Cliff Click and Gil Tene (amongst many others).
As already mentioned, the JVM starts by running Java through its interpreter. While strictly speaking not 100% correct, one can think of an interpreter as a large switch statement and a loop that iterates over every JVM byte code (command). Each case within the switch statement is a JVM byte code such as add two values together, invoke a method, invoke a constructor and so forth. The overhead of the iteration, and jumping around the commands is very large. Thus execution of a single command will typically use over 10x more assembly commands, which means > 10x slower as the hardware has to execute so many more commands and caches will get polluted by this interpreter code which ideally we would rather focused on our actual program. Think back to the early days of Java when Java earned its reputation of being very slow; this is because it was originally a fully interpreted language only.
Later on JIT compilers were added to Java, these compilers would compile Java methods to native CPU instructions just before the methods were invoked. This removed all of the overhead of the interpreter and allowed the execution of code to be performed in hardware. While execution within hardware is much faster, this extra compilation created a stall on startup for Java. And this was partly where the terminology of 'warm up phase' took hold.
The introduction of Hotspot to the JVM was a game changer. Now the JVM would start up faster because it would start life running the Java programs with its interpreter and individual Java methods would be compiled in a background thread and swapped out on the fly during execution. The generation of native code could also be done to differing levels of optimisation, sometimes using very aggressive optimisations that are strictly speaking incorrect and then de-optimising and re-optimising on the fly when necessary to ensure correct behaviour. For example, class hierarchies imply a large cost to figuring out which method will be called as Hotspot has to search the hierarchy and locate the target method. Hotspot can become very clever here, and if it notices that only one class has been loaded then it can assume that will always be the case and optimise and inline methods as such. Should another class get loaded that now tells Hotspot that there is actually a decision between two methods to be made, then it will remove its previous assumptions and recompile on the fly. The full list of optimisations that can be made under different circumstances is very impressive, and is constantly changing. Hotspot's ability to record information and statistics about the environment that it is running in, and the work load that it is currently experiencing makes the optimisations that are performed very flexible and dynamic. In fact it is very possible that over the life time of a single Java process, that the code for that program will be regenerated many times over as the nature of its work load changes. Arguably giving Hotspot a large advantage over more traditional static compilation, and is largely why a lot of Java code can be considered to be just as fast as writing C code. It also makes understanding microbenchmarks a lot harder; in fact it makes the JVM code itself much more difficult for the maintainers at Oracle to understand, work with and diagnose problems. Take a minute to raise a pint to those guys, Hotspot and the JVM as a whole is a fantastic engineering triumph that rose to the fore at a time when people were saying that it could not be done. It is worth remembering that, because after a decade or so it is quite a complex beast ;)
So given that context, in summary we refer to warming up a JVM in microbenchmarks as running the target code over 10k times and throwing the results away so as to give the JVM a chance to collect statistics and to optimise the 'hot regions' of the code. 10k is a magic number because the Server Hotspot implementation waits for that many method invocations or loop iterations before it starts to consider optimisations. I would also advice on having method calls between the core test runs, as while hotspot can do 'on stack replacement' (OSR), it is not common in real applications and it does not behave exactly the same as swapping out whole implementations of methods.

You aren't measuring anything useful "without a warmup phase"; you're measuring the speed of interpreted code times how long it takes for the on-stack replacement to be generated. Maybe divisions cause compilation to kick in earlier.
There are sets of guidelines and various packages for building microbenchmarks that don't suffer from these sorts of issues. I would suggest that you read the guidelines and use the ready-made packages if you intend to continue doing this sort of thing.

Related

Profiling Java code changes execution times

I'm trying to optimize my code, but it's giving me problems.
I've got this list of objects:
List<DataDescriptor> descriptors;
public class DataDescriptor {
public int id;
public String name;
}
There is 1700 objects with unique id (0-1699) and some name, it's used to decode what type of data I get later on.
The method that I try to optimize works like that:
public void processData(ArrayList<DataDescriptor> descriptors, ArrayList<IncomingData> incomingDataList) {
for (IncomingData data : incomingDataList) {
DataDescriptor desc = descriptors.get(data.getDataDescriptorId());
if (desc.getName().equals("datatype_1")) {
doOperationOne(data);
} else if (desc.getName().equals("datatype_2")) {
doOperationTwo(data);
} else if ....
.
.
} else if (desc.getName().equals("datatype_16")) {
doOperationSixteen(data);
}
}
}
This method is called about milion times when processing data file and every time incomingDataList contains about 60 elements, so this set of if/elses is executed about 60 milion times.
This takes about 15 seconds on my desktop (i7-8700).
Changing code to test integer ids instead of strings obviously shaves off few seconds, which is nice, but I hoped for more :)
I tried profiling using VisualVM, but for this method (with string testing) it says that 66% of time is spent in "Self time" (which I believe would be all this string testing? and why doesnt it says that it is in String.equals method?) and 33% is spent on descriptors.get - which is simple get from ArrayList and I don't think I can optimize it any further, other than trying to change how data is structured in memory (still, this is Java, so I don't know if this would help a lot).
I wrote "simple benchmark" app to isolate this String vs int comparisons. As I expected, comparing integers was about 10x faster than String.equals when I simply run the application, but when I profiled it in VisualVM (I wanted to check if in benchmark ArrayList.get would also be so slow), strangely both methods took exactly the same amount of time. When using VisualVM's Sample, instead of Profile, application finished with expected results (ints being 10x faster), but VisualVM was showing that in his sample both types of comparisons took the same amount of time.
What is the reason for getting such totally different results when profiling and not? I know that there is a lot of factors, there is JIT and profiling maybe interferes with it etc. - but in the end, how do you profile and optimize Java code, when profiling tools change how the code runs? (if it's the case)

Profilers can be divided into two categories: instrumenting and sampling. VisualVM includes both, but both of them have disadvantages.
Instrumenting profilers use bytecode instrumentation to modify classes. They basically insert the special tracing code into every method entry and exit. This allows to record all executed methods and their running time. However, this approach is associated with a big overhead: first, because the tracing code itself can take much time (sometimes even more than the original code); second, because the instrumented code becomes more complicated and prevents from certain JIT optimizations that could be applied to the original code.
Sampling profilers are different. They do not modify your application; instead they periodically take a snapshot of what the application is doing, i.e. the stack traces of currently running threads. The more often some method occurs in these stack traces - the longer (statistically) is the total execution time of this method.
Sampling profilers typically have much smaller overhead; furthermore, this overhead is manageable, since it directly depends on the profiling interval, i.e. how often the profiler takes thread snapshots.
The problem with sampling profilers is that JDK's public API for getting stack traces is flawed. JVM does not get a stack trace at any arbitrary moment of time. It rather stops a thread in one of the predefined places where it knows how to reliably walk the stack. These places are called safepoints. Safepoints are located at method exits (excluding inlined methods), and inside the loops (excluding short counted loops). That's why, if you have a long linear peace of code or a short counted loop, you'll never see it in a sampling profiler that relies on JVM standard getStackTrace API.
This problem is known as Safepoint Bias. It is described well in a great post by Nitsan Wakart. VisualVM is not the only victim. Many other profilers, including commercial tools, also suffer from the same issue, because the original problem is in the JVM rather than in a particular profiling tool.
Java Flight Recorder is much better, as long as it does not rely on safepoints. However, it has its own flaws: for example, it cannot get a stack trace, when a thread is executing certain JVM intrinsic methods like System.arraycopy. This is especially disappointing, since arraycopy is a frequent bottleneck in Java applications.
Try async-profiler. The goal of the project is exactly to solve the above issues. It should provide a fair view of the application performance, while having a very small overhead. async-profiler works on Linux and macOS. If you are on Windows, JFR is still your best bet.

Efficiency of sequential algorithm executed on machine with 60Gb RAM

Will sequential algorithm written in Java execute faster (in Eclipse) on a machine with 60Gb RAM and 16 cores, if compared to a dual-core machine with 16Gb RAM? I expected that the algorithm would really run faster, but experiments on a Google Compute Engine and my laptop showed that it's not a truth. I appreciate if someone could explain why this happens.

Java doesn't parallize the code automatically for you, you need to do it yourself.
There are some abstractions that like parallel streams that give you concise parallelism, but still, the performance of your program is governed by Amdahl's law . Having more memory will help in launching more threads and applying parallel algorithms for leveraging more cores.
Example:
Arrays.sort is a sequential Dual-Pivot Quicksort that runs in O(nlgn) time, its overall performance governed by the clock rate.
Arrays.parallelSort is parallel merge-sort, it uses more space ( so here memory is important ), it divided the array into pieces and sort each piece and merge them.
But, someone had to write this parallel sort in order to benefit from multicores machines.
What could be done automatically for you, is a highly concurrent and parallel GC that effects the overall performance of your program.

You are asking for sequential algorithm, which clearly means there are no multiple threads, no parallelism or multi-processing involved in the execution of the code. Lets say, the code is:
a = 5;
b = a + 5;
c = b + 5;
...
and so on...
We cannot execute any of the latter lines because of their dependency on the former values.
A simple loop,
for i from 1 to 100 increment 1
a = a + i
will have to be executed 100 times, in order, as that would create a difference in result, and hence cannot be parallelized.
Also, since you are not using threads in your code, java has no support for parallelism inbuilt, so there go your chances even if the code was a bit parallelizable.

If it's a single threaded piece of code, the system it will run on has some influence on the execution time. This is measured by the IPC
https://en.wikipedia.org/wiki/Instructions_per_cycle
You code will definitely run faster on a newer system than a 10 year old, but maybe the difference between the two machines you mentioned for 1 thread are not significant enough.

Putting a number to the efficiency of an algorithm

I have been developing with Java for some time now, and always strife to do something in the most efficient way. By now i have mostly been trying to condense the number of lines of code I have. But when starting to work with 2d rendering it is more about how long it takes to compute a certain piece of code as it is called many times a second.
My question:
Is there some way to measure how long it takes to compute a certain piece of code in Eclipse, Java, ... ?

First, some nitpicking. You title this question ...
Putting a number to the efficiency of an algorithm
There is no practical quantifiable measure of "efficiency" for an algorithm. Efficiency (as normally conceived) is a measure of "something" relative to an ideal / perfect; e.g. a hypothetical 100% efficient steam engine would convert all of the energy in the coal being burned into useful "work". But for software, there is no ideal to measure against. (Or if there is, we can't be sure that it is the ideal.) Hence "efficiency" is the wrong term.
What you actually mean is a measure of the "performance" of ...
Algorithms are an abstract concept, and their performance cannot be measured.
What you actually want is a measure of the performance of a specific implementation of an algorithm; i.e. some actual code.
So how do you quantify performance?
Well, ultimately there is only one sound way to quantify performance. You measure it, empirically. (How you do that ... and the limitations ... are a matter I will come to.)
But what about theoretical approaches?
A common theoretical approach is to analyse the algorithm to give you a measure of computational complexity. The classic measure is Big-O Complexity. This is a very useful measure, but unfortunately Big-O Complexity does not actually measure performance at all. Rather, it is a way of characterizing the behaviour of an algorithm as the problem size scales up.
To illustrate, consider these algorithms for adding B numbers together:
int sum(int[] input) {
int sum = 0;
for (int i = 0; i < input.size(); i++) {
sum += input[i];
}
return i;
}
int sum(int[] input) {
int tmp = p(1000); // calculates the 1000th prime number
int sum = 0;
for (int i = 0; i < input.size(); i++) {
sum += input[i];
}
return i;
}
We can prove that both versions of sum have a complexity of O(N), according to the accepted mathematical definitions. Yet it obvious that the first one will be faster than the second one ... because the second one does a large (and pointless) calculation as well.
In short: Big-O Complexity is NOT a measure of Performance.
What about theoretical measures of Performance?
Well, as far as I'm aware, there are none that really work. The problem is that real performance (as in time taken to complete) depends on various complicated things in the compilation of code to executables AND the way that real execution platforms (hardware) behaves. It is too complicated to do a theoretical analysis that will reliably predict actual performance.
So how do you measure performance?
The naive answer is to benchmark like this:
Take a clock measurement
Run the code
Take a second clock measurement
Subtract the first measurement from the second ... and that is your answer.
But it doesn't work. Or more precisely, the answer you get may be wildly different from the performance that the code exhibits when you use it in a real world context.
Why?
There may be other things happening on the machine that are happening ... or have happened ... that influence the code's execution time. Another program might be running. You may have files pre-loaded into the file system cache. You may get hit by CPU clock scaling ... or a burst of network traffic.
Compilers and compiler flags can often make a lot of difference to how fast a piece of code runs.
The choice of inputs can often make a big difference.
If the compiler is smart, it might deduce that some or all of your benchmarked code does nothing "useful" (in the context) ... and optimize it away entirely.
And for languages like Java and C#, there are other important issues:
Implementations of these languages typically do a lot of work during startup to load and link the code.
Implementations of these languages are typically JIT compiled. This means that the language runtime system does the final translation of the code (e.g. bytecodes) to native code at runtime. The performance characteristics of your code after JIT compilation change drastically, and the time taken to do the compilation may be significant ... and may distort your time measurements.
Implementations of these languages typically rely on a garbage collected heap for memory management. The performance of a heap is often uneven, especially at startup.
These things (and possibly others) contribute to something that we call (in Java) JVM warmup overheads; particularly JIT compilation. If you don't take account of these overheads in your methodology, then your results are liable to be distorted.
So what is the RIGHT way to measure performance of Java code?
It is complicated, but the general principle is to run the benchmark code lots of times in the same JVM instance, measuring each iteration. The first few measurements (during JVM warmup) should be discarded, and the remaining measurements should be averaged.
These days, the recommended way to do Java benchmarking is to use a reputable benchmarking framework. The two prime candidates are Caliper and Oracle's jmh tool.
And what are the limitations of performance measurements that you mentioned?
Well I have alluded to them above.
Performance measurements can be distorted to various environmental factors on the execution platform.
Performance can be dependent on inputs ... an this may not be revealed by simple measurement.
Performance (e.g. of C / C++ code) can be dependent on the compiler and compiler switches.
Performance can be dependent on hardware; e.g. processors speed, number of cores, memory architecture, and so on.
These factors can make it difficult to make general statements about the performance of a specific piece of code, and to make general comparisons between alternative versions of the same code. As a rule, we can only make limited statements like "on system X, with compiler Y and input set Z the performance measures are P, Q, R".

The amount of lines has very little correlation to the execution speed of a program.
Your program will look completely different after it's processed by the compiler. In general, large compilers perform many optimizations, such as loop unrolling, getting rid of variables that are not used, getting rid of dead code, and hundreds more.
So instead of trying to "squeeze" the last bit of performance/memory out of your program by using short instead of int, char[] instead of String or whichever method you think will "optimize" (premature optimization) your program, just do it using objects, or types such that make sense to you, so it will be easier to maintain. Your compiler, interpreter, VM should take care of the rest. If it doesn't, only then do you start looking for bottlenecks, and start playing with hacks.
So what makes programs fast then? Algorithmic efficiency (at least it tends to make the biggest difference if the algorithm/data structure was not designed right). This is what computer scientists study.
Let's say you're given 2 data structures. An array, and a singly linked list.
An array stores things in a block, one after the other.
+-+-+-+-+-+-+-+
|1|3|2|7|4|6|1|
+-+-+-+-+-+-+-+
To retrieve the element at index 3, you simply just go to the 4th square and retrieve it. You know where it is because you know it's 3 after the first square.
A singly linked list will store things in a node, which may not be stored contiguously in memory, but each node will have a tag (pointer, reference) on it telling you where the next item in the list is.
+-+ +-+ +-+ +-+ +-+ +-+ +-+
|1| -> |3| -> |2| -> |7| -> |4| -> |6| -> |1|
+-+ +-+ +-+ +-+ +-+ +-+ +-+
To retrieve the element at index of 3, you will have to start with the first node, then go to the connected node, which is 1, and then go to 2, and finally after, you arrive at 3. All because you don't know where they are, so you follow a path to them.
Now say you have an Array and an SLL, both containing the same data, with the length n, which one would be faster? Depends on how you use it.
Let's say you do a lot of insertions at the end of the list. The algorithms (pseudocode) would be:
Array:
array[list.length] = element to add
increment length field
SLL:
currentNode = first element of SLL
while currentNode has next node:
currentNode = currentNode's next element
currentNode's next element = new Node(element to add)
increment length field
As you can see, in the array algorithm, it doesn't matter what the size of the array is. It always takes a constant amount of operations. Let's say a[list.length] takes 1 operation. Assigning it is another operation, incrementing the field, and writing it to memory is 2 operations. It would take 4 operations every time. But if you look at the SLL algorithm, it would take at least list.length number of operations just to find the last element in the list. In other words, the time it takes to add an element to the end of an SLL increases linearly as the size of the SLL increases t(n) = n, whereas for the array, it's more like t(n) = 4.
I suggest reading the free book written by my data structures professor. Even has working code in C++ and Java

Generally speaking, the speed vs. lines of code is not the most effective measure of performance since it depends heavily depends on your hardware and your compiler. There is something called Big Oh notation, which gives one a picture of how fast an algorithm will run as the number of inputs increase.
For example, if your algorithm speed is O(n), then the time it will take for code to run scales linear with time. If your algorithm speed is O(1), then the time it will take your code to run will be constant.
I found this particular way of measuring performance useful because you learn that it's not really lines of code that will effect speed it's your codes design that will effect speed. A code with a more efficient way of handling the problem can be faster than code with a less efficient method with 1/10 lines of code.

Java call type performance

I put together a microbenchmark that seemed to show that the following types of calls took roughly the same amount of time across many iterations after warmup.
static.method(arg);
static.finalAnonInnerClassInstance.apply(arg);
static.modifiedNonFinalAnonInnerClassInstance.apply(arg);
Has anyone found evidence that these different types of calls in the aggregate will have different performance characteristics? My findings are they don't, but I found that a little surprising (especially knowing the bytecode is quite different for at least the static call) so I want to find if others have any evidence either way.
If they indeed had the same exact performance, then that would mean there was no penalty to having that level of indirection in the modified non final case.
I know standard optimization advice would be: "write your code and profile" but I'm writing a framework code generation kind of thing so there is no specific code to profile, and the choice between static and non final is fairly important for both flexibility and possibly performance. I am using framework code in the microbenchmark which I why I can't include it here.
My test was run on Windows JDK 1.7.0_06.

If you benchmark it in a tight loop, JVM would cache the instance, so there's no apparent difference.
If the code is executed in a real application,
if it's expected to be executed back-to-back very quickly, for example, String.length() used in for(int i=0; i<str.length(); i++){ short_code; }, JVM will optimize it, no worries.
if it's executed frequently enough, that the instance is mostly likely in CPU's L1 cache, the extra load of the instance is very fast; no worries.
otherwise, there is a non trivial overhead; but it's executed so infrequently, the overhead is almost impossible to detect among the overall cost of the application. no worries.

When and how to properly use loop optimization and transformation techniques

First of all, i would like to know what is the fundamental difference between loop optimization and transformation , also
A simple loop in C follows:
for (i = 0; i < N; i++)
{
a[i] = b[i]*c[i];
}
but we can unroll it to:
for (i = 0; i < N/2; i++)
{
a[i*2] = b[i*2]*c[i*2];
a[i*2 + 1] = b[i*2 + 1]*c[i*2 + 1];
}
but further we can unroll it..but what is the limit till which we can unroll it, and how do we find that.
There are many more techniques like Loop Tilling,Loop Distribution,etc. , how to determine when to use the appropriate one.

I will assume that the OP has already profiled his/her code and has discovered that this piece of code is actually important, and actually answer the question :-) :
The compiler will try to make the loop unrolling decision based on what it knows about your code and the processor architecture.
In terms of making things faster.
As someone pointed out, unrolling does reduce the number of loop termination condition compares and jumps.
Depending on the architecture, the hardware may also support an efficient way to to index near memory locations (E.g., mov eax, [ebx + 4]), without adding additional instructions (this may expand to more micro-ops though - not sure).
Most modern processors use out of order execution, to find instruction level parallelism. This is hard to do, when the next N instructions are after multiple conditional jumps (i.e., the hardware would need to be able to discard variable levels of speculation).
There is more opportunity to reorder memory operations earlier so that the data fetch latency is hidden.
Code vectorization (e.g., converting to SSE/AVX), may also occur which allows parallel execution of the code in some cases. This is also a form of unrolling.
In terms of deciding when to stop unrolling:
Unrolling increases code size. The compiler knows that there are penalties for exceeding instruction code cache size (all modern processors), trace cache(P4), loop buffer cache(Core2/Nehalem/SandyBridge), micro-op cache(SandyBridge), etc. Ideally it uses static cost-benefit heuristics (a function of the specic code and architecture) to determine which level of unrolling will result in the best overall net performance. Depending on the compiler, the heurstics may vary (often I find that it would be nice to tweak this oneself).
Generally, if the loop contains a large amount of code it is less likely to be unrolled because the loop cost is already amortized, there is plenty of ILP available, and the code bloat cost of unrolling is excessive. For smaller pieces of code, the loop is likely to be unrolled, since the cost is likely to be low. The actual number of unrolls will depend on the specifics of the architecture, compiler heuristics and code, and will be what the compiler decides is optimal (it may not be :-) ).
In terms of when YOU should be doing these optimizations:
When you don't think the compiler did the correct thing. The compiler may not be sophisticated (or sufficiently up to date) enough to use the knowledge of the architecture you are working on optimally.
Possibly, the heuristics just failed (they are just heuristics after all). In general, if you know the piece of code is very important, try unroll it, and if it improved performance, keep it, otherwise throw it out. Also, only do this when you have roughly the whole system in place, since what may be beneficial, when your code working set is 20k, may not be beneficial when your code working set is 31k.

This may seem rather off topic to your question but I cannot but stress the importance of this.
The key is to write a correct code and get your code working as per the requirement without being bothered about micro optimization.
If later you find your program to be lacking in performance then you profile!! your application to find the problem areas and then try to optimize them.
Remember as one of the wise guys said It is only 10% of your code which runs 90% of the total run time of your application trick is to identify that code through profiling and then try to optimize it.

Well considering that your first attempt at optimizing is already wrong in 50% of all cases I really wouldn't try anything more complex (try any odd number).
Also instead of multiplying your indices, just add 2 to i and loop up to N again - avoids the unnecessary shifting (minor effect as long as we stay with powers of 2, but still)
To summarize: You created incorrect, slower code than what a compiler could do - well that's the perfect example of why you shouldn't do this stuff I assume.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.