Faster implementation of Math.abs() by bit-manipulation

Faster implementation of Math.abs() by bit-manipulation - java

The normal implementation of Math.abs(x) (as implemented by Oracle) is given by
public static double abs(double a) {
return (a <= 0.0D) ? 0.0D - a : a;
}
Isn't it faster to just set the one bit coding for the sign of the number to zero (or one)?
I suppose that there is only one bit coding the sign of the number, and that it is always the same bit, but I may be wrong in this.
Or are our computers generally unfit to do operations on single bits with atomary instructions?
If a faster implementation is possible, can you give it?
edit:
It has been pointed out to me that Java code is platform independent, and as such it cannot depend on what are the atomary instructions of single machines. To optimize code, however, the JVM hotspot optimizer does consider the specifics of the machine, and will maybe apply the very optimization under consideration.
Through a simple test, however, I have found that at least on my machine, the Math.abs function doesn't seem to get optimized to a single atomary instructions. My code was as follows:
long before = System.currentTimeMillis();
int o = 0;
for (double i = 0; i<1000000000; i++)
if ((i-500)*(i-500)>((i-100)*2)*((i-100)*2)) // 4680 ms
o++;
System.out.println(o);
System.out.println("using multiplication: "+(System.currentTimeMillis()-before));
before = System.currentTimeMillis();
o = 0;
for (double i = 0; i<1000000000; i++)
if (Math.abs(i-500)>(Math.abs(i-100)*2)) // 4778 ms
o++;
System.out.println(o);
System.out.println("using Math.abs: "+(System.currentTimeMillis()-before));
Which gives me the following output:
234
using multiplication: 4985
234
using Math.abs: 5587
Supposing that multiplication is performed by an atomary instruction, it seems that at least on my machine the JVM hotspot optimizer doesn't optimize the Math.abs function to a single instruction operation.

My first thought was, it’s because of NaN (Not-a-number) values, i.e. if the input is NaN it should get returned without any change. But this seems to be not a requirement as harold’s test has shown that the JVM’s internal optimization does not preserve the sign of NaNs (unless you use StrictMath).
The documentation of Math.abs says:
In other words, the result is the same as the value of the expression:
Double.longBitsToDouble((Double.doubleToLongBits(a)<<1)>>>1)
So the option of bit manipulations was known to the developers of this class but they decided against it.
Most probably, because optimizing this Java code makes no sense. The hotspot optimizer will replace its invocation with the appropriate FPU instruction once it encountered it in a hotspot, in most environments. This happens with a lot of java.lang.Math methods as well as Integer.rotateLeft and similar methods. They might have a pure Java implementation but if the CPU has an instruction for it, it will be used by the JVM.

I'm not a java expert, but I think the problem is that this definition is expressible in the language. Bit operations on floats are machine format specific, so not portable, and thus not allowed in Java. I'm not sure if any of the jit compilers will do the optimization.

Related

Java getChars method in Integer class, why is it using bitwise operations instead of arithmetic?

So I was examining the Integer's class source code (JDK 8) to understand how an int get converted to a String. It seems to be using a package private method called getChars (line 433) to convert an int to char array.
While the code is not that difficult to understand, however, there are multiple lines of code where they use bitwise shift operations instead of simple arithmetic multiplication/division, such as the following lines of code:
// really: r = i - (q * 100);
r = i - ((q << 6) + (q << 5) + (q << 2));
and
q = (i * 52429) >>> (16+3);
r = i - ((q << 3) + (q << 1)); // r = i-(q*10) ...
I just do not understand the point of doing that, is this actually an optimization and does it affect the runtime of the algorithm?
Edit:
To put it in a another way, since the compiler does this type of optimization internally, is this manual optimization necessary?

I don't know the reason for this specific change and unless you find the original author, it's unlikely you'll find an authoritative answer to that anyway.
But I'd like to respond to the wider point, which is that a lot of code in the runtime library (java.* and many internal packages) is optimized to a degree that would be very unusual (and I dare say irresponsible) to apply to "normal" application code.
And that has basically two reasons:
It's called a lot and in many different environment. Optimizing a method in your server to take 0.1% less CPU time when it's only executed 50 times per day on 3 servers each won't be worth the effort you put into it. If, however, you can make Integer.toString 0.1% faster for everyone who will ever execute it, then this can turn into a very big change indeed.
If you optimize your application code on a specific VM then updating that VM to a newer version can easily undo your optimization, when the compiler decides to optimize differently. With code in java.* this is far less of an issue, because it is always shipped with the runtime that will run it. So if they introduce a compiler change that makes a given optimization no longer optimal, then they can change the code to match this.
tl;dr java.* code is often optimized to an insane degree because it's worth it and they can know that it will actually work.

There are a couple reasons that this is done. Being a long-time embedded developer, using tiny microcontrollers that sometimes didn't even have a multiplication and division instruction, I can tell you that this is significantly faster. The key here is that the multiplier is a constant. If you were multiplying two variables, you'd either need to use the slower multiply and divide operators or, if they didn't exist, perform multiplication using a loop with the add operator.

Java/C: OpenJDK native tanh() implementation wrong?

I was digging through some of the Java Math functions native C source code. Especially tanh(), as I was curious to see how they implemented that one.
However, what I found surprised me:
double tanh(double x) {
...
if (ix < 0x40360000) { /* |x|<22 */
if (ix<0x3c800000) /* |x|<2**-55 */
return x*(one+x); /* tanh(small) = small */
...
}
As the comment indicates, the taylor series of tanh(x) around 0, starts with:
tanh(x) = x - x^3/3 + ...
Then why does it look like they implemented it as:
tanh(x) = x * (1 + x)
= x + x^2
Which is clearly not the correct expansion, and even a worse approximation than just using tanh(x) = x (which would be faster), as indicated by this plot:
(The bold line is the one indicated on top. The other gray one is the log(abs(x(1+x) - tanh(x))). The sigmoid is of course the tanh(x) itself.)
So, is this a bug in the implementation, or is this a hack to fix some problem (like numerical issues, which I can't really think of)? Note that I expect the outcome of both approaches to be exactly the same as there are not enough mantisse bits to actually perform the addition 1 + x, for x < 2^(-55).
EDIT: I will include a link to the version of the code at the time of writing, for future reference, as this might get fixed.

Under the conditions in which that code is executed, and supposing that IEEE-754 double-precision floating point representations and arithmetic are in use, 1.0 + x will always evaluate to 1.0, so x * (1.0 + x) will always evaluate to x. The only externally (to the function) observable effect of performing the computation as is done instead of just returning x would be to set the IEEE "inexact" status flag.
Although I know no way to query the FP status flags from Java, other native code could conceivably query them. More likely than not, however, the practical reason for the implementation is given by by these remarks in the Javadocs for java.StrictMath:
To help ensure portability of Java programs, the definitions of some of the numeric functions in this package require that they produce the same results as certain published algorithms. These algorithms are available from the well-known network library netlib as the package "Freely Distributable Math Library," fdlibm. These algorithms, which are written in the C programming language, are then to be understood as executed with all floating-point operations following the rules of Java floating-point arithmetic.
The Java math library is defined with respect to fdlibm version 5.3. Where fdlibm provides more than one definition for a function (such as acos), use the "IEEE 754 core function" version (residing in a file whose name begins with the letter e). The methods which require fdlibm semantics are sin, cos, tan, asin, acos, atan, exp, log, log10, cbrt, atan2, pow, sinh, cosh, tanh, hypot, expm1, and log1p.
(Emphasis added.) You will note in the C source code an #include "fdlibm.h" that seems to tie it to the Javadoc comments.

Strange behaviour of the Hotspot loop condition optimizer

Based on the discussions around an answer to this question, I discovered a really strange behaviour of the Java Hotspot optimizer. The observed behaviour can at least be seen in the Oracle VM 1.7.0_17, but seem to occur in older Java 6 versions as well.
First of all, I was already aware of the optimizer obviously being aware that some methods in the standard API are invariant and have no side effects. When executing a loop like double x=0.5; for(double d = 0; d < Math.sin(x); d += 0.001);, the expression Math.sin(x) is not evaluated for each iteration, but the optimizer is aware that the method Math.sin has no relevant side effects and that the result is invariant, as long as x is not modified in the loop.
Now I noticed, that simply changing x from 0.5 to 1.0 disabled this optimization. Further tests indicate that the optimization is only enabled if abs(x) < asin(1/sqrt(2)). Is there a good reason for that, which I don't see, or is that an unnecessary limitation to the optimizing conditions?
Edit: The optimization seem to be implemented in hotspot/src/share/vm/opto/subnode.cpp

I think your question about specifically Oracle JVM, because implementation of Math is implementation-dependent. Here is good answer about Dalvik implementation for example:
native code for Java Math class
Generally
sin(a) * sin(a) + cos(a) * cos(a) = 1
sin(pi/2 - a) = cos(a)
sin(-a) = -sin(a)
cos(-a) = cos(a)
so we don't need sin/cos functions implementation for x < 0 or x > pi/4.
I suppose this is the answer (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5005861):
We are aware of the almabench results and and osnews article on
trigonometric performance. However, the HotSpot implementation of
sin/cos on x86 for years has used and continues to use fsin/fcos x87
instructions in a range where those instructions meet the quality of
implementation requirements, basically [-pi/4, pi/4]. Outside of that
range, the results of fsin/fcos can be anywhere in the range [-1, 1]
with little relation to the true sine/cosine of the argument. For
example, fsin(Math.PI) only gets about half the digits of the result
correct. The reason for this is that the fsin/fcos instruction
implementations use a less than ideal algorithm for argument
reduction; the argument reduction process is explained in bug
4857011.
Conclusion: you have seen results of argument reduction algorithm in action, not the limitation of optimization.

Which of these pieces of code is faster in Java?

a) for(int i = 100000; i > 0; i--) {}
b) for(int i = 1; i < 100001; i++) {}
The answer is there on this website (question 3). I just can't figure out why? From website:
3. a

When you get down to the lowest level (machine code but I'll use assembly since it maps one-to-one mostly), the difference between an empty loop decrementing to 0 and one incrementing to 50 (for example) is often along the lines of:
ld a,50 ld a,0
loop: dec a loop: inc a
jnz loop cmp a,50
jnz loop
That's because the zero flag in most sane CPUs is set by the decrement instruction when you reach zero. The same can't usually be said for the increment instruction when it reaches 50 (since there's nothing special about that value, unlike zero). So you need to compare the register with 50 to set the zero flag.
However, asking which of the two loops:
for(int i = 100000; i > 0; i--) {}
for(int i = 1; i < 100001; i++) {}
is faster (in pretty much any environment, Java or otherwise) is useless since neither of them does anything useful. The fastest version of both those loops no loop at all. I challenge anyone to come up with a faster version than that :-)
They'll only become useful when you start doing some useful work inside the braces and, at that point, the work will dictate which order you should use.
For example if you need to count from 1 to 100,000, you should use the second loop. That's because the advantage of counting down (if any) is likely to be swamped by the fact that you have to evaluate 100000-i inside the loop every time you need to use it. In assembly terms, that would be the difference between:
ld b,100000 dsw a
sub b,a
dsw b
(dsw is, of course, the infamous do something with assembler mnemonic).
Since you'll only be taking the hit for an incrementing loop once per iteration, and you'll be taking the hit for the subtraction at least once per iteration (assuming you'll be using i, otherwise there's little need for the loop at all), you should just go with the more natural version.
If you need to count up, count up. If you need to count down, count down.

On many compilers, the machine instructions emitted for a loop going backwards, are more efficient, because testing for zero (and therefore zero'ing a register) is faster than a load immediate of a constant value.
On the other hand, a good optimising compiler should be able to inspect the loop inner and determine that going backwards won't cause any side effects...
BTW, that is a terrible interview question in my opinion. Unless you are talking about a loop which runs 10 millions of times AND you have ascertained that the slight gain is not outweighed by many instances of recreating the forward loop value (n - i), any performance gain will be minimal.
As always, don't micro-optimise without performance benchmarking and at the expense of harder to understand code.

These kinds of questions are largely an irrelevant distraction that some people get obsessed with it. Call it the Cult of Micro-optimization or whatever you like but is it faster to loop up or down? Seriously? You use whichever is appropriate for what you're doing. You don't write your code around saving two clock cycles or whatever it is.
Let the compiler do what it's for and make you intent clear (both to the compiler and the reader). Another common Java pessimization is:
public final static String BLAH = new StringBuilder().append("This is ").append(3).append(' text").toString();
because excessive concatenation does result in memory fragmentation but for a constant the compiler can (and will) optimize this:
public final static String BLAH = "This is a " + 3 + " test";
where it won't optimize the first and the second is easier to read.
And how about (a>b)?a:b vs Math.max(a,b)? I know I'd rather read the second so I don't really care that the first doesn't incur a function call overhead.
There are a couple of useful things in this list like knowing that a finally block isn't called on System.exit() is potentially useful. Knowing that dividing a float by 0.0 doesn't throw an exception is useful.
But don't bother second-guessing the compiler unless it really matters (and I bet you that 99.99% of the time it doesn't).

A better question is;
Which is easier to understand/work with?
This is far more important than a notional difference in performance. Personally, I would point out that performance shouldn't be the criteria for determining the difference here. If they didn't like me challenging their assumption on this, I wouldn't be unhappy about not getting the job. ;)

On a modern Java implementation this is not true.
Summing up the numbers up to one billion as a benchmark:
Java(TM) SE Runtime Environment 1.6.0_05-b13
Java HotSpot(TM) Server VM 10.0-b19
up 1000000000: 1817ms 1.817ns/iteration (sum 499999999500000000)
up 1000000000: 1786ms 1.786ns/iteration (sum 499999999500000000)
up 1000000000: 1778ms 1.778ns/iteration (sum 499999999500000000)
up 1000000000: 1769ms 1.769ns/iteration (sum 499999999500000000)
up 1000000000: 1769ms 1.769ns/iteration (sum 499999999500000000)
up 1000000000: 1766ms 1.766ns/iteration (sum 499999999500000000)
up 1000000000: 1776ms 1.776ns/iteration (sum 499999999500000000)
up 1000000000: 1768ms 1.768ns/iteration (sum 499999999500000000)
up 1000000000: 1771ms 1.771ns/iteration (sum 499999999500000000)
up 1000000000: 1768ms 1.768ns/iteration (sum 499999999500000000)
down 1000000000: 1847ms 1.847ns/iteration (sum 499999999500000000)
down 1000000000: 1842ms 1.842ns/iteration (sum 499999999500000000)
down 1000000000: 1838ms 1.838ns/iteration (sum 499999999500000000)
down 1000000000: 1832ms 1.832ns/iteration (sum 499999999500000000)
down 1000000000: 1842ms 1.842ns/iteration (sum 499999999500000000)
down 1000000000: 1838ms 1.838ns/iteration (sum 499999999500000000)
down 1000000000: 1838ms 1.838ns/iteration (sum 499999999500000000)
down 1000000000: 1847ms 1.847ns/iteration (sum 499999999500000000)
down 1000000000: 1839ms 1.839ns/iteration (sum 499999999500000000)
down 1000000000: 1838ms 1.838ns/iteration (sum 499999999500000000)
Note that the time differences are brittle, small changes somewhere near the loops can turn them around.
Edit:
The benchmark loops are
long sum = 0;
for (int i = 0; i < limit; i++)
{
sum += i;
}
and
long sum = 0;
for (int i = limit - 1; i >= 0; i--)
{
sum += i;
}
Using a sum of type int is about three times faster, but then sum overflows.
With BigInteger it is more than 50 times slower:
BigInteger up 1000000000: 105943ms 105.943ns/iteration (sum 499999999500000000)

Typically real code will run faster counting upwards. There are a few reasons for this:
Processors are optimised for reading memory forwards.
HotSpot (and presumably other bytecode->native compilers) heavily optimise forward loops, but don't bother with backward loops because they happen so infrequently.
Upwards is usually more obvious, and cleaner code is often faster.
So happily doing the right thing will usually be faster. Unnecessary micro-optimisation is evil. I haven't purposefully written backward loops since programming 6502 assembler.

There are really only two ways to answer this question.
To tell you that it really, really doesn't matter, and you're wasting your time even wondering.
To tell you that the only way to know is to run a trustworthy benchmark on your actual production hardware, OS and JRE installation that you care about.
So, I made you a runnable benchmark you could use to try that out here:
http://code.google.com/p/caliper/source/browse/trunk/test/examples/LoopingBackwardsBenchmark.java
This Caliper framework is not really ready for prime time yet, so it may not be totally obvious what to do with this, but if you really care enough you can figure it out. Here are the results it gave on my linux box:
max benchmark ns
2 Forwards 4
2 Backwards 3
20 Forwards 9
20 Backwards 20
2000 Forwards 1007
2000 Backwards 1011
20000000 Forwards 9757363
20000000 Backwards 10303707
Does looping backwards look like a win to anyone?

Are you sure that the interviewer who asks such a question expects a straight answer ("number one is faster" or "number two is faster"), or if this question is asked to provoke a discussion, as is happening in the answers people are giving here?
In general, it's impossible to say which one is faster, because it heavily depends on the Java compiler, JRE, CPU and other factors. Using one or the other in your program just because you think that one of the two is faster without understanding the details to the lowest level is superstitious programming. And even if one version is faster than the other on your particular environment, then the difference is most likely so small that it's irrelevant.
Write clear code instead of trying to be clever.

Such questions have their base on old best-practice recommendations.
It's all about comparison: comparing to 0 is known to be faster. Years ago this might have been seen as quite important. Nowadays, especially with Java, I'd rather let the compiler and the VM do their job and I'd focus on writing code that is easies to maintain and understand.
Unless there are reasons for doing it otherwise. Remember that Java apps don't always run on HotSpot and/or fast hardware.

With regards for testing for zero in the JVM: it can apparently be done with ifeq whereas testing for anything else requires if_icmpeq which also involves putting an extra value on the stack.
Testing for > 0, as in the question, might be done with ifgt, whereas testing for < 100001 would need if_icmplt.

This is about the dumbest question I have ever seen. The loop body is empty. If the compiler is any good it will just emit no code at all. It doesn't do anything, can't throw an exception and doesn't modify anything outside of its scope.
Assuming your compiler isn't that smart, or that you actually didn't have an empty loop body:
The "backwards loop counter" argument makes sense for some assembly languages (it may make sense to the java byte code too, I don't know it specifically). However, the compiler will very often have the ability to transform your loop to use decrementing counters. Unless you have loop body in which the value of i is explicitly used, the compiler can do this transformation. So again you often see no difference.

I decided to bite and necro back the thread.
both of the loops are ignored by the JVM as no-ops. so essentially even one of the loops were till 10 and the other till 10000000, there would have been no difference.
Looping back to zero is another thing (for jne instruction but again, it's not compiled like that), the linked site is plain weird (and wrong).
This type of a question doesn't fit any JVM (nor any other compiler that can optimize).

The loops are identical, except for one critical part:
i > 0;
and
i < 100001;
The greater than zero check is done by checking the NZP (Commonly known as condition code or Negative Zero or Positive bit) bit of the computer.
The NZP bit is set whenever operation such as load, AND, addition ect. are performed.
The greater than check cannot directly utilize this bit (and therefore takes a bit longer...) The general solution is to make one of the values negative (by doing a bitwise NOT and then adding 1) and then adding it to the compared value. If the result is zero, then they're equal. Positive, then the second value (not the neg) is greater. Negative, then the first value (neg) is greater. This check takes a slightly longer than the direct nzp check.
I'm not 100% certain that this is the reason behind it though, but it seems like a possible reason...

The answer is a (as you probably found out on the website)
I think the reason is that the i > 0 condition for terminating the loop is faster to test.

The bottom line is that for any non-performance critical application, the difference is probably irrelevant. As others have pointed out there are times when using ++i instead of i++ could be faster, however, especially in for loops any modern compiler should optimize that distinction away.
That said, the difference probably has to do with the underlying instructions that get generated for the comparison. Testing if a value is equal to 0 is simply a NAND NOR gate. Whereas testing if a value is equal to an arbitrary constant requires loading that constant into a register, and then comparing the two registers. (This probably would require an extra gate delay or two.) That said, with pipelining and modern ALUs I'd be surprised if the distinction was significant to begin with.

I've been making tests for about 15 minutes now, with nothing running other than eclipse just in case, and I saw a real difference, you can try it out.
When I tried timing how long java takes to do "nothing" and it took around 500 nanoseconds just to have an idea.
Then I tested how long it takes to run a for statement where it increases:
for(i=0;i<100;i++){}
Then five minutes later I tried the "backwards" one:
for(i=100;i>0;i--)
And I've got a huge difference (in a tinny tiny level) of 16% between the first and the second for statements, the latter being 16% faster.
Average time for running the "increasing" for statement during 2000 tests: 1838 n/s
Average time for running the "decreasing" for statement during 2000 tests: 1555 n/s
Code used for such tests:
public static void main(String[] args) {
long time = 0;
for(int j=0; j<100; j++){
long startTime = System.nanoTime();
int i;
/*for(i=0;i<100;i++){
}*/
for(i=100;i>0;i--){
}
long endTime = System.nanoTime();
time += ((endTime-startTime));
}
time = time/100;
System.out.print("Time: "+time);
}
Conclusion:
The difference is basically nothing, it already takes a significant amount of "nothing" to do "nothing" in relation to the for statement tests, making the difference between them negligible, just the time taken for importing a library such as java.util.Scanner takes way more to load than running a for statement, it will not improve your application's performance significantly, but it's still really cool to know.

faster Math.exp() via JNI?

I need to calculate Math.exp() from java very frequently, is it possible to get a native version to run faster than java's Math.exp()??
I tried just jni + C, but it's slower than just plain java.

This has already been requested several times (see e.g. here). Here is an approximation to Math.exp(), copied from this blog posting:
public static double exp(double val) {
final long tmp = (long) (1512775 * val + (1072693248 - 60801));
return Double.longBitsToDouble(tmp << 32);
}
It is basically the same as a lookup table with 2048 entries and linear interpolation between the entries, but all this with IEEE floating point tricks. Its 5 times faster than Math.exp() on my machine, but this can vary drastically if you compile with -server.

+1 to writing your own exp() implementation. That is, if this is really a bottle-neck in your application. If you can deal with a little inaccuracy, there are a number of extremely efficient exponent estimation algorithms out there, some of them dating back centuries. As I understand it, Java's exp() implementation is fairly slow, even for algorithms which must return "exact" results.
Oh, and don't be afraid to write that exp() implementation in pure-Java. JNI has a lot of overhead, and the JVM is able to optimize bytecode at runtime sometimes even beyond what C/C++ is able to achieve.

Use Java's.
Also, cache results of the exp and then you can look up the answer faster than calculating them again.

You'd want to wrap whatever loop's calling Math.exp() in C as well. Otherwise, the overhead of marshalling between Java and C will overwhelm any performance advantage.

You might be able to get it to run faster if you do them in batches. Making a JNI call adds overhead, so you don't want to do it for each exp() you need to calculate. I'd try passing an array of 100 values and getting the results to see if it helps performance.

The real question is, has this become a bottle neck for you? Have you profiled your application and found this to be a major cause of slow down? If not, I would recommend using Java's version. Try not to pre-optimize as this will just cause development slow down. You may spend an extended amount of time on a problem that may not be a problem.
That being said, I think your test gave you your answer. If jni + C is slower, use java's version.

Commons Math3 ships with an optimized version: FastMath.exp(double x). It did speed up my code significantly.
Fabien ran some tests and found out that it was almost twice as fast as Math.exp():
0.75s for Math.exp sum=1.7182816693332244E7
0.40s for FastMath.exp sum=1.7182816693332244E7
Here is the javadoc:
Computes exp(x), function result is nearly rounded. It will be correctly rounded to the theoretical value for 99.9% of input values, otherwise it will have a 1 UPL error.
Method:
Lookup intVal = exp(int(x))
Lookup fracVal = exp(int(x-int(x) / 1024.0) * 1024.0 );
Compute z as the exponential of the remaining bits by a polynomial minus one
exp(x) = intVal * fracVal * (1 + z)
Accuracy: Calculation is done with 63 bits of precision, so result should be correctly rounded for 99.9% of input values, with less than 1 ULP error otherwise.

Since the Java code will get compiled to native code with the just-in-time (JIT) compiler, there's really no reason to use JNI to call native code.
Also, you shouldn't cache the results of a method where the input parameters are floating-point real numbers. The gains obtained in time will be very much lost in amount of space used.

The problem with using JNI is the overhead involved in making the call to JNI. The Java virtual machine is pretty optimized these days, and calls to the built-in Math.exp() are automatically optimized to call straight through to the C exp() function, and they might even be optimized into straight x87 floating-point assembly instructions.

There's simply an overhead associated with using the JNI, see also:
http://java.sun.com/docs/books/performance/1st_edition/html/JPNativeCode.fm.html
So as others have suggested try to collate operations that would involve using the JNI.

Write your own, tailored to your needs.
For instance, if all your exponents are of the power of two, you can use bit-shifting. If you work with a limited range or set of values, you can use look-up tables. If you don't need pin-point precision, you use an imprecise, but faster, algorithm.

There is a cost associated with calling across the JNI boundary.
If you could move the loop that calls exp() into the native code as well, so that there is just one native call, then you might get better results, but I doubt it will be significantly faster than the pure Java solution.
I don't know the details of your application, but if you have a fairly limited set of possible arguments for the call, you could use a pre-computed look-up table to make your Java code faster.

There are faster algorithms for exp depending on what your'e trying to accomplish. Is the problem space restricted to a certain range, do you only need a certain resolution, precision, or accuracy, etc.
If you define your problem very well, you may find that you can use a table with interpolation, for instance, which will blow nearly any other algorithm out of the water.
What constraints can you apply to exp to gain that performance trade-off?
-Adam

I run a fitting algorithm and the minimum error of the fitting result is way larger
than the precision of the Math.exp().
Transcendental functions are always much more slower than addition or multiplication and a well-known bottleneck. If you know that your values are in a narrow range, you can simply build a lookup-table (Two sorted array ; one input, one output). Use Arrays.binarySearch to find the correct index and interpolate value with the elements at [index] and [index+1].
Another method is to split the number. Lets take e.g. 3.81 and split that in 3+0.81.
Now you multiply e = 2.718 three times and get 20.08.
Now to 0.81. All values between 0 and 1 converge fast with the well-known exponential series
1+x+x^2/2+x^3/6+x^4/24.... etc.
Take as much terms as you need for precision; unfortunately it's slower if x approaches 1. Lets say you go to x^4, then you get 2.2445 instead of the correct 2.2448
Then multiply the result 2.781^3 = 20.08 with 2.781^0.81 = 2.2445 and you have the result
45.07 with an error of one part of two thousand (correct: 45.15).

It might not be relevant any more, but just so you know, in the newest releases of the OpenJDK (see here), Math.exp should be made an intrinsic (if you don't know what that is, check here).
This will make performance unbeatable on most architectures, because it means the Hotspot VM will replace the call to Math.exp by a processor-specific implementation of exp at runtime. You can never beat these calls, as they are optimized for the architecture...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Faster implementation of Math.abs() by bit-manipulation - java

I'm not a java expert, but I think the problem is that this definition is expressible in the language. Bit operations on floats are machine format specific, so not portable, and thus not allowed in Java. I'm not sure if any of the jit compilers will do the optimization.

Related

Java getChars method in Integer class, why is it using bitwise operations instead of arithmetic?

Java/C: OpenJDK native tanh() implementation wrong?

Strange behaviour of the Hotspot loop condition optimizer

Which of these pieces of code is faster in Java?

faster Math.exp() via JNI?

Categories

Resources