Strange behaviour of the Hotspot loop condition optimizer

Strange behaviour of the Hotspot loop condition optimizer - java

Based on the discussions around an answer to this question, I discovered a really strange behaviour of the Java Hotspot optimizer. The observed behaviour can at least be seen in the Oracle VM 1.7.0_17, but seem to occur in older Java 6 versions as well.
First of all, I was already aware of the optimizer obviously being aware that some methods in the standard API are invariant and have no side effects. When executing a loop like double x=0.5; for(double d = 0; d < Math.sin(x); d += 0.001);, the expression Math.sin(x) is not evaluated for each iteration, but the optimizer is aware that the method Math.sin has no relevant side effects and that the result is invariant, as long as x is not modified in the loop.
Now I noticed, that simply changing x from 0.5 to 1.0 disabled this optimization. Further tests indicate that the optimization is only enabled if abs(x) < asin(1/sqrt(2)). Is there a good reason for that, which I don't see, or is that an unnecessary limitation to the optimizing conditions?
Edit: The optimization seem to be implemented in hotspot/src/share/vm/opto/subnode.cpp

I think your question about specifically Oracle JVM, because implementation of Math is implementation-dependent. Here is good answer about Dalvik implementation for example:
native code for Java Math class
Generally
sin(a) * sin(a) + cos(a) * cos(a) = 1
sin(pi/2 - a) = cos(a)
sin(-a) = -sin(a)
cos(-a) = cos(a)
so we don't need sin/cos functions implementation for x < 0 or x > pi/4.
I suppose this is the answer (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5005861):
We are aware of the almabench results and and osnews article on
trigonometric performance. However, the HotSpot implementation of
sin/cos on x86 for years has used and continues to use fsin/fcos x87
instructions in a range where those instructions meet the quality of
implementation requirements, basically [-pi/4, pi/4]. Outside of that
range, the results of fsin/fcos can be anywhere in the range [-1, 1]
with little relation to the true sine/cosine of the argument. For
example, fsin(Math.PI) only gets about half the digits of the result
correct. The reason for this is that the fsin/fcos instruction
implementations use a less than ideal algorithm for argument
reduction; the argument reduction process is explained in bug
4857011.
Conclusion: you have seen results of argument reduction algorithm in action, not the limitation of optimization.

Related

Java getChars method in Integer class, why is it using bitwise operations instead of arithmetic?

So I was examining the Integer's class source code (JDK 8) to understand how an int get converted to a String. It seems to be using a package private method called getChars (line 433) to convert an int to char array.
While the code is not that difficult to understand, however, there are multiple lines of code where they use bitwise shift operations instead of simple arithmetic multiplication/division, such as the following lines of code:
// really: r = i - (q * 100);
r = i - ((q << 6) + (q << 5) + (q << 2));
and
q = (i * 52429) >>> (16+3);
r = i - ((q << 3) + (q << 1)); // r = i-(q*10) ...
I just do not understand the point of doing that, is this actually an optimization and does it affect the runtime of the algorithm?
Edit:
To put it in a another way, since the compiler does this type of optimization internally, is this manual optimization necessary?

I don't know the reason for this specific change and unless you find the original author, it's unlikely you'll find an authoritative answer to that anyway.
But I'd like to respond to the wider point, which is that a lot of code in the runtime library (java.* and many internal packages) is optimized to a degree that would be very unusual (and I dare say irresponsible) to apply to "normal" application code.
And that has basically two reasons:
It's called a lot and in many different environment. Optimizing a method in your server to take 0.1% less CPU time when it's only executed 50 times per day on 3 servers each won't be worth the effort you put into it. If, however, you can make Integer.toString 0.1% faster for everyone who will ever execute it, then this can turn into a very big change indeed.
If you optimize your application code on a specific VM then updating that VM to a newer version can easily undo your optimization, when the compiler decides to optimize differently. With code in java.* this is far less of an issue, because it is always shipped with the runtime that will run it. So if they introduce a compiler change that makes a given optimization no longer optimal, then they can change the code to match this.
tl;dr java.* code is often optimized to an insane degree because it's worth it and they can know that it will actually work.

There are a couple reasons that this is done. Being a long-time embedded developer, using tiny microcontrollers that sometimes didn't even have a multiplication and division instruction, I can tell you that this is significantly faster. The key here is that the multiplier is a constant. If you were multiplying two variables, you'd either need to use the slower multiply and divide operators or, if they didn't exist, perform multiplication using a loop with the add operator.

Java/C: OpenJDK native tanh() implementation wrong?

I was digging through some of the Java Math functions native C source code. Especially tanh(), as I was curious to see how they implemented that one.
However, what I found surprised me:
double tanh(double x) {
...
if (ix < 0x40360000) { /* |x|<22 */
if (ix<0x3c800000) /* |x|<2**-55 */
return x*(one+x); /* tanh(small) = small */
...
}
As the comment indicates, the taylor series of tanh(x) around 0, starts with:
tanh(x) = x - x^3/3 + ...
Then why does it look like they implemented it as:
tanh(x) = x * (1 + x)
= x + x^2
Which is clearly not the correct expansion, and even a worse approximation than just using tanh(x) = x (which would be faster), as indicated by this plot:
(The bold line is the one indicated on top. The other gray one is the log(abs(x(1+x) - tanh(x))). The sigmoid is of course the tanh(x) itself.)
So, is this a bug in the implementation, or is this a hack to fix some problem (like numerical issues, which I can't really think of)? Note that I expect the outcome of both approaches to be exactly the same as there are not enough mantisse bits to actually perform the addition 1 + x, for x < 2^(-55).
EDIT: I will include a link to the version of the code at the time of writing, for future reference, as this might get fixed.

Under the conditions in which that code is executed, and supposing that IEEE-754 double-precision floating point representations and arithmetic are in use, 1.0 + x will always evaluate to 1.0, so x * (1.0 + x) will always evaluate to x. The only externally (to the function) observable effect of performing the computation as is done instead of just returning x would be to set the IEEE "inexact" status flag.
Although I know no way to query the FP status flags from Java, other native code could conceivably query them. More likely than not, however, the practical reason for the implementation is given by by these remarks in the Javadocs for java.StrictMath:
To help ensure portability of Java programs, the definitions of some of the numeric functions in this package require that they produce the same results as certain published algorithms. These algorithms are available from the well-known network library netlib as the package "Freely Distributable Math Library," fdlibm. These algorithms, which are written in the C programming language, are then to be understood as executed with all floating-point operations following the rules of Java floating-point arithmetic.
The Java math library is defined with respect to fdlibm version 5.3. Where fdlibm provides more than one definition for a function (such as acos), use the "IEEE 754 core function" version (residing in a file whose name begins with the letter e). The methods which require fdlibm semantics are sin, cos, tan, asin, acos, atan, exp, log, log10, cbrt, atan2, pow, sinh, cosh, tanh, hypot, expm1, and log1p.
(Emphasis added.) You will note in the C source code an #include "fdlibm.h" that seems to tie it to the Javadoc comments.

How fast is left shift (<<2) compared to multiply by 2(*2) in java? [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Shifting bits left and right is apparently faster than multiplication and division operations on most, maybe even all, CPUs if you happen to be using a power of 2. However, it can reduce the clarity of code for some readers and some algorithms. Is bit-shifting really necessary for performance, or can I expect the compiler or VM to notice the case and optimize it (in particular, when the power-of-2 is a literal)? I am mainly interested in the Java and .NET behavior but welcome insights into other language implementations as well.

Almost any environment worth its salt will optimize this away for you. And if it doesn't, you've got bigger fish to fry. Seriously, do not waste one more second thinking about this. You will know when you have performance problems. And after you run a profiler, you will know what is causing it, and it should be fairly clear how to fix it.
You will never hear anyone say "my application was too slow, then I started randomly replacing x * 2 with x << 1 and everything was fixed!" Performance problems are generally solved by finding a way to do an order of magnitude less work, not by finding a way to do the same work 1% faster.

Most compilers today will do more than convert multiply or divide by a power-of-two to shift operations. When optimizing, many compilers can optimize a multiply or divide with a compile time constant even if it's not a power of 2. Often a multiply or divide can be decomposed to a series of shifts and adds, and if that series of operations will be faster than the multiply or divide, the compiler will use it.
For division by a constant, the compiler can often convert the operation to a multiply by a 'magic number' followed by a shift. This can be a major clock-cycle saver since multiplication is often much faster than a division operation.
Henry Warren's book, Hacker's Delight, has a wealth of information on this topic, which is also covered quite well on the companion website:
http://www.hackersdelight.org/
See also a discussion (with a link or two ) in:
Reading assembly code
Anyway, all this boils down to allowing the compiler to take care of the tedious details of micro-optimizations. It's been years since doing your own shifts outsmarted the compiler.

Humans are wrong in these cases.
99% when they try to second guess a modern (and all future) compilers.
99.9% when they try to second guess modern (and all future) JITs at the same time.
99.999% when they try to second guess modern (and all future) CPU optimizations.
Program in a way that accurately describes what you want to accomplish, not how to do it. Future versions of the JIT, VM, compiler, and CPU can all be independantly improved and optimized. If you specify something so tiny and specific, you lose the benefit of all future optimizations.

You can almost certainly depend on the literal-power-of-two multiplication optimisation to a shift operation. This is one of the first optimisations that students of compiler construction will learn. :)
However, I don't think there's any guarantee for this. Your source code should reflect your intent, rather than trying to tell the optimiser what to do. If you're making a quantity larger, use multiplication. If you're moving a bit field from one place to another (think RGB colour manipulation), use a shift operation. Either way, your source code will reflect what you are actually doing.

Note that shifting down and division will (in Java, certainly) give different results for negative, odd numbers.
int a = -7;
System.out.println("Shift: "+(a >> 1));
System.out.println("Div: "+(a / 2));
Prints:
Shift: -4
Div: -3
Since Java doesn't have any unsigned numbers it's not really possible for a Java compiler to optimise this.

On computers I tested, integer divisions are 4 to 10 times slower than other operations.
When compilers may replace divisions by multiples of 2 and make you see no difference, divisions by not multiples of 2 are significantly slower.
For example, I have a (graphics) program with many many many divisions by 255.
Actually my computation is :
r = (((top.R - bottom.R) * alpha + (bottom.R * 255)) * 0x8081) >> 23;
I can ensure that it is a lot faster than my previous computation :
r = ((top.R - bottom.R) * alpha + (bottom.R * 255)) / 255;
so no, compilers cannot do all the tricks of optimization.

I would ask "what are you doing that it would matter?". First design your code for readability and maintainability. The likelyhood that doing bit shifting verses standard multiplication will make a performance difference is EXTREMELY small.

It is hardware dependent. If we are talking micro-controller or i386, then shifting might be faster but, as several answers state, your compiler will usually do the optimization for you.
On modern (Pentium Pro and beyond) hardware the pipelining makes this totally irrelevant and straying from the beaten path usually means you loose a lot more optimizations than you can gain.
Micro optimizations are not only a waste of your time, they are also extremely difficult to get right.

If the compiler (compile-time constant) or JIT (runtime constant) knows that the divisor or multiplicand is a power of two and integer arithmetic is being performed, it will convert it to a shift for you.

According to the results of this microbenchmark, shifting is twice as fast as dividing (Oracle Java 1.7.0_72).

Most compilers will turn multiplication and division into bit shifts when appropriate. It is one of the easiest optimizations to do. So, you should do what is more easily readable and appropriate for the given task.

I am stunned as I just wrote this code and realized that shifting by one is actually slower than multiplying by 2!
(EDIT: changed the code to stop overflowing after Michael Myers' suggestion, but the results are the same! What is wrong here?)
import java.util.Date;
public class Test {
public static void main(String[] args) {
Date before = new Date();
for (int j = 1; j < 50000000; j++) {
int a = 1 ;
for (int i = 0; i< 10; i++){
a *=2;
}
}
Date after = new Date();
System.out.println("Multiplying " + (after.getTime()-before.getTime()) + " milliseconds");
before = new Date();
for (int j = 1; j < 50000000; j++) {
int a = 1 ;
for (int i = 0; i< 10; i++){
a = a << 1;
}
}
after = new Date();
System.out.println("Shifting " + (after.getTime()-before.getTime()) + " milliseconds");
}
}
The results are:
Multiplying 639 milliseconds
Shifting 718 milliseconds

Faster implementation of Math.abs() by bit-manipulation

The normal implementation of Math.abs(x) (as implemented by Oracle) is given by
public static double abs(double a) {
return (a <= 0.0D) ? 0.0D - a : a;
}
Isn't it faster to just set the one bit coding for the sign of the number to zero (or one)?
I suppose that there is only one bit coding the sign of the number, and that it is always the same bit, but I may be wrong in this.
Or are our computers generally unfit to do operations on single bits with atomary instructions?
If a faster implementation is possible, can you give it?
edit:
It has been pointed out to me that Java code is platform independent, and as such it cannot depend on what are the atomary instructions of single machines. To optimize code, however, the JVM hotspot optimizer does consider the specifics of the machine, and will maybe apply the very optimization under consideration.
Through a simple test, however, I have found that at least on my machine, the Math.abs function doesn't seem to get optimized to a single atomary instructions. My code was as follows:
long before = System.currentTimeMillis();
int o = 0;
for (double i = 0; i<1000000000; i++)
if ((i-500)*(i-500)>((i-100)*2)*((i-100)*2)) // 4680 ms
o++;
System.out.println(o);
System.out.println("using multiplication: "+(System.currentTimeMillis()-before));
before = System.currentTimeMillis();
o = 0;
for (double i = 0; i<1000000000; i++)
if (Math.abs(i-500)>(Math.abs(i-100)*2)) // 4778 ms
o++;
System.out.println(o);
System.out.println("using Math.abs: "+(System.currentTimeMillis()-before));
Which gives me the following output:
234
using multiplication: 4985
234
using Math.abs: 5587
Supposing that multiplication is performed by an atomary instruction, it seems that at least on my machine the JVM hotspot optimizer doesn't optimize the Math.abs function to a single instruction operation.

My first thought was, it’s because of NaN (Not-a-number) values, i.e. if the input is NaN it should get returned without any change. But this seems to be not a requirement as harold’s test has shown that the JVM’s internal optimization does not preserve the sign of NaNs (unless you use StrictMath).
The documentation of Math.abs says:
In other words, the result is the same as the value of the expression:
Double.longBitsToDouble((Double.doubleToLongBits(a)<<1)>>>1)
So the option of bit manipulations was known to the developers of this class but they decided against it.
Most probably, because optimizing this Java code makes no sense. The hotspot optimizer will replace its invocation with the appropriate FPU instruction once it encountered it in a hotspot, in most environments. This happens with a lot of java.lang.Math methods as well as Integer.rotateLeft and similar methods. They might have a pure Java implementation but if the CPU has an instruction for it, it will be used by the JVM.

I'm not a java expert, but I think the problem is that this definition is expressible in the language. Bit operations on floats are machine format specific, so not portable, and thus not allowed in Java. I'm not sure if any of the jit compilers will do the optimization.

is there a fast implementation of the log1p function?

I want to have a fast log1p function for Java. Java has Math.log1p, but it is apparently too slow for my needs.
I have found this code for log1p here:
http://golang.org/src/pkg/math/log1p.go
for the GO language.
Is it the same like the one in Java, or is it a faster one? (assuming I translate it to java).
Anyone is aware of some other fast implementation of log1p?
Thanks.

In "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) there is a short algorithm to compute log1p within 5 epsilon for 0<=x<3/4 and given certain requirements on the arithmetic
double xp1 = 1+x;
if(xp1==1)
return x;
else
return x * log(xp1) / (xp1-1);
Maybe this performs better on your system than the builtin log1p implementation. However, use it with care (see the paper for things that could go wrong e.g. in extended-base systems) and have some tests ready.

since log1p(x) = Math.log(x+1), finding a natural log fast algorithm is sufficient for what you need.
Fast Natural Logarithm in Java
I have found the following approximation here, and there is not much
information about it except that it is called “Borchardt’s Algorithm”
and it is from the book “Dead Reconing: Calculating without
instruments”. The approximation is not very good (some might say very
bad…), it gets worse the larger the values are. But the approximation
is also a monotonic, slowly increasing function, which is good enough
for my use case.
public static double log(double x) {
return 6 * (x - 1) / (x + 1 + 4 * (Math.sqrt(x))); }
This approximation is 11.7 times faster than Math.log().
See this site. Also, a performance comparison for math libraries in java.
But probably what you need is to link to c++ compiled stuff, detailed here.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Strange behaviour of the Hotspot loop condition optimizer - java

Related

Java getChars method in Integer class, why is it using bitwise operations instead of arithmetic?

Java/C: OpenJDK native tanh() implementation wrong?

How fast is left shift (<<2) compared to multiply by 2(*2) in java? [duplicate]

Faster implementation of Math.abs() by bit-manipulation

is there a fast implementation of the log1p function?

Categories

Resources