Bit manipulation performance: masking or shifting

Bit manipulation performance: masking or shifting - java

I'm not really familiar with low-level (hardware-close) specifics (forgot much).
My app needs to perform millions (or even more) bit manipulation operations in very short time periods, so performance matters.
I need to check if a certain section (consisting of 4, 5 or 6 bits) of an int value is equal to a specified value.
I can solve this either by using an int as a complete mask; or by using bit shift(s) (to get rid of the disturbing sections) and then doing a direct compare (==). Do these have equal performance? Which is faster?

Generally speaking ((a & b )== c) ought to be very fast, and faster than the same operation
with an extra shift. ((a>>n)&b)==c)
It's likely that other optimization techniques, such as loop unrollong, will be a lot more effective than trying to guess what shift and mask operations are the fastest.
If you really care about performance at that level, the answer is to benchmark all the likely variations in the actual deployment environment.

Related

Java's BigInteger implementation

I'm new here so please excuse my noob mistakes. I'm currently working on a little project of mine that sees me dealing with digits with a length in the forty thousands and beyond.
I'm currently using BigInteger to handle these values, and I need something that performs faster. I've read that BigInteger uses an array of integers in its implementation, and what I need to know is whether BigInteger is using each index in this array to represent each decimal point, as in 1 - 9, or is it using something more efficient.
I ask this because I already have an implementation in mind that uses bit operations, which makes it more efficient, memory and processing wise.
So the final question is - is BigInteger already efficient enough, and should I just rely on that? It would better to know this rather than putting it to the test unnecessarily, which would take a lot of time.
Thank you.

At least with Oracle's Java 8 and OpenJDK 8, it doesn't store one decimal digit per int. It stores full 32-bit portions per 32-bit int in the int[], which can be seen with its source code.
Bit operations are fast for it, since it's a sign-magnitude value and the magnitude is stored packed just as you'd expect, just make sure that you use the relevant BigInteger bitwise methods rather than implementing your own.
If you still need more speed, try something like GMP, though be aware that it uses a LGPL or GPL license. It would also be better to use it outside of Java.

Why must Java booleans be at least 1 byte in size?

It is known that C++ bools must be at least 1 byte in size so that pointers can be created for each [https://stackoverflow.com/a/2064565/7154924]. But there are no pointers to primitive types in Java. Yet, they still take up at least 1 byte [https://stackoverflow.com/a/383597/7154924].
Why is this the case - why can't Java booleans be 1 bit in size? Computation time aside, if one has a large boolean array, surely one could conceive a compiler that does the appropriate shifting to retrieve the individual bit corresponding to a boolean value?

There is no reason why a boolean must be one byte in size. In fact, is likely that booleans already aren't 1 byte in (effective) size in some scenarios: when packed next to other elements larger than 1 byte on the stack or in an object, they are likely to be larger (i.e., adding them to an object may cause the size to grow by more than one byte).
Any JVM is free to implement booleans as 1 bit, but as far as know none choose to do so, probably largely because:
Accessing one bit is often more expensive than accessing a byte, particularly when writing.
To read a bit, a CPU using a "classic RISC" instruction set would often need need an additional and instruction to extract the relevant bit out of a packed byte (or larger word) of boolean bits. Some might even need a additional instruction to load a constant to and. In the case of indexing an array of boolean, where the bit-index isn't fixed at compile-time, you'd need a variable shift. Some CPUs such as x86 have an easier time since they have memory sourcetestinstructions, including specific bit-test instructions taking a variable position such asbt`. Such a CPU probably has similar read performance in both representations.
Writing is worse: rather than a simple byte write to set a boolean value you now need to read the value, modify the appropriate bit and write it back. Some platforms such as x86 have memory source-and-destination RMW instructions such as and and or that will help, but these are still significantly more expensive than plain writes. In the worst case, repeatedly writing the same element will result in a dependency chain through memory that could slow your code down by an order of magnitude (a series of plain stores can't form a dependency chain).
Even worse, the write method above is totally thread-unsafe. Two threads working on "independent" booleans might clobber each other, so the runtime would have to use atomic update operations just to write a bit for any field where the object cannot be proven local to the thread.
The space savings outside of arrays is usually very small, and is often zero: alignment concerns mean that a single bit will often end up taking the same space as a byte on the stack or in the layout for an object. Only if you had many primitive boolean values on the stack or an object would you see a savings (for example, objects are typically aligned to 8-byte boundaries, so if you have an object whose non-boolean fields are int or larger, you'd need at least 4 boolean values to save any space, and often you'd need 8).
This leaves the last remaining "big win" for bit-representation boolean in arrays of boolean, where you could have an asymptotic 8x space savings for large arrays. In fact, this case was motivating enough in the C++ world that vector<bool> there has a "special" implementation where each bool takes one bit - a never ending source of headaches due to all the required special cases and non-intuitive behavior (and often used as an example of a mis-feature that can't be removed now).
If it weren't for the memory model I could imagine a world where Java
implemented arrays of boolean in a bit-wise manner. They don't have
the same issues as vector<bool> (mostly because of the extra layer of abtraction provided by the JIT and also because an array provides a simpler interface than vector) and it could be done efficiently, I think. There is that pesky memory model though. That model allows writes to different array elements to be safe if done by different threads (i.e,. they act as independent variables for the purpose of the memory model). All common CPUs support this directly if you implement boolean as a byte, since they have independent byte accesses. No CPUs offer independent bit-access though: you are stuck using atomic operations (x86 offers the lock bt* operations, but these are slow: other platforms have even worse options). That would destroy the performance of any boolean array implemented as a bit-array.
Finally, as described above, implementing boolean as a bit has significant downsides - but what about the upside?
As it turns out, if the user really wants this bit-packed representation of for boolean they can do so themselves! They can pack 8 boolean values into a byte (or 32 values into an int or whatever) in an object (and this is common for flags, etc) and the generated accessor code should be about efficient as efficient as if the JVM natively supported boolean-as-bit. In fact, when you know you want an array-of-bits representation for a large number of booleans, you can simply use BitSet - this has the representation you want and sidesteps the atomic issues by not offering any thread-safety guarantees. So by implementing boolean as a byte, you sidestep all the problems above, but still let the user "opt-in" to bit-level representation if they want, without much runtime penalty.

How to deal with underflow in scientific computing?

I am working on probabilistic models, and when doing inference on those models, the estimated probabilities can become very small. In order to avoid underflow, I am currently working in the log domain (I store the log of the probabilities). Multiplying probabilities is equivalent to an addition, and summing is done by using the formula:
log(exp(a) + exp(b)) = log(exp(a - m) + exp(b - m)) + m
where m = max(a, b).
I use some very large matrices, and I have to take the element-wise exponential of those matrices to compute matrix-vector multiplications. This step is quite expensive, and I was wondering if there exist other methods to deal with underflow, when working with probabilities.
Edit: for efficiency reasons, I am looking for a solution using primitive types and not objects storing arbitrary-precision representation of real numbers.
Edit 2: I am looking for a faster solution than the log domain trick, not a more accurate solution. I am happy with the accuracy I currently get, but I need a faster method. Particularly, summations happen during matrix-vector multiplications, and I would like to be able to use efficient BLAS methods.
Solution: after a discussion with Jonathan Dursi, I decided to factorize each matrix and vector by its largest element, and to store that factor in the log domain. Multiplications are straightforward. Before additions, I have to factorize one of the added matrices/vectors by the ratio of the two factors. I update the factor every ten operations.

This issue has come up recently on the computational science stack exchange site as well, and although there the immediate worry there was overflow, the issues are more or less the same.
Transforming into log space is certainly one reasonable approach. Whatever space you're in, to do a large number of sums correctly, there's a couple of methods you can use to improve the accuracy of your summations. Compensated summation approaches, most famously Kahan summation, keep both a sum and what's effectively a "remainder"; it gives you some of the advantages of using higher precision arithmeitic without all of the cost (and only using primitive types). The remainder term also gives you some indication of how well you're doing.
In addition to improving the actual mechanics of your addition, changing the order of how you add your terms can make a big difference. Sorting your terms so that you're summing from smallest to largest can help, as then you're no longer adding terms as frequently that are very different (which can cause significant roundoff problems); in some cases, doing log2 N repeated pairwise sums can also be an improvement over just doing the straight linear sum, depending on what your terms look like.
The usefullness of all these approaches depend a lot on the properties of your data. The arbitrary precision math libraries, while enormously expensive in compute time (and possibly memory) to use, have the advantage of being a fairly general solution.

I ran into a similar problem years ago. The solution was to develop an approximation of log(1+exp(-x)). The range of the approximation does not need to be all that large (x from 0 to 40 will more than suffice), and at least in my case the accuracy didn't need to be particularly high, either.
In your case, it looks like you need to compute log(1+exp(-x1)+exp(-x2)+...). Throw out those large negative values. For example, suppose a, b, and c are three log probabilities, with 0>a>b>c. You can ignore c if a-c>38. It's not going to contribute to your joint log probability at all, at least not if you are working with doubles.

Option 1: Commons Math - The Apache Commons Mathematics Library
Commons Math is a library of lightweight, self-contained mathematics and statistics components addressing the most common problems not
available in the Java programming language or Commons Lang.
Note: The API protects the constructors to force a factory pattern while naming the factory DfpField (rather than the somewhat more intuitive DfpFac or DfpFactory). So you have to use
new DfpField(numberOfDigits).newDfp(myNormalNumber)
to instantiate a Dfp, then you can call .multiply or whatever on this. I thought I'd mention this because it's a bit confusing.
Option 2: GNU Scientific Library or Boost C++ Libraries.
In these cases you should use JNI in order to call these native libraries.
Option 3: If you are free to use other programs and/or languages, you could consider using programs/languages for numerical computations such as Octave, Scilab, and similar.
Option 4: BigDecimal of Java.

Rather than storing values in logarithmic form, I think you'd probably be better off using the same concept as doubles, namely, floating-point representation. For example, you might store each value as two longs, one for sign-and-mantissa and one for the exponent. (Real floating-point has a carefully tuned design to support lots of edge cases and avoid wasting a single bit; but you probably don't need to worry so much about any of those, and can focus on designing it in a way that's simple to implement.)

I don't understand why this works, but this formula seems to work and is simpler:
c = a + log(1 + exp(b - a))
Where c = log(exp(a)+exp(b))

Operation on different data types

Considering the basic data types like char, int, float, double etc..in any standard language C/C++, Java etc
Is there anything like.."operating on integers are faster than operating on characters".. by operating I mean assignment, arithmetic op/ comparison etc.
Are data types slower than one another?

For almost anything you're doing this has almost no effect, but purely for informational purposes, it is usually fastest to work with data types whose size is machine word size (i.e. 32 bits on x86 and 64-bits on amd64). Additionally, SSE/MMX instructions give you benefits as well if you can group these and work on them at the same time

Rules for this are a bit like rules for English spelling and/or grammar. The rules are broken at least as often as they're followed.
Just for example, for years "everybody has known" that floating point operations are slower than integers, especially for more complex operations like multiply and divide. In reality, some processors do some integer operations (especially multiplication and division) by converting the operands to floating point, doing the operation in floating point, then converting the result back to an integer. As you'd expect from that, the floating point operation is actually faster (though only a little bit).
Most of the time, however, it doesn't matter much -- in a lot of cases, it's quite reasonable to think of the operations on the processor itself as free, and concern yourself primarily with optimizing your use of bandwidth to memory. Of course, doing that well is often even harder...

yes , some data types are definitely slower than others. For example , floats are more complicated than int's and thus may incur additional penalties when doing divides and multiplies. It all depends on how your hardware is setup and what kind of instructions it supports.
Data types which is longer than the machine word size will also be slower because it takes more cycles to perform operations.

depending on what you do, the difference can be quite large, especially when working with floats versus double versus long double.
In modern processors it comes down to simd instructions, which have certain length, most commonly 128 bit. so four float versus two double numbers.
However some processors only have 32 bit simd instructions(PPC) and GPU hardware has a factor of eight performance difference between float and double.
when you add trigonometric , exponential, and square root functions into the mix, float numbers are going to have better performance overall given number of factors.

Almost all of the answers on this page are mostly right. The answer, however, varies wildly depending upon your hardware, language, compiler, and VM (in managed languages like Java). On most CPUs, your best performance will be to do the operations on a data type that fits the native operand size of your CPU. In some cases, some compilers will optimize this for you, however.
On most modern desktop CPUs the difference between floating point and integer operations has become pretty trivial. However, on older hardware and a lot of embedded systems the difference in all of these factors can still be really, really big.
The important thing is to know the specifics of your target architecture and your tools.

This answer relates to the Java case (only).
The literal answer is that the relative speed of the primitive types and operators depends on your processor hardware and your JVM implementation.
But a better answer is that it usually doesn't make a a lot of difference to performance what representations you use. Indeed, any clever data type optimizations you do to make your code run fast on your current machine / JVM may turn out to be anti-optimizations on a different machine / JVM combination.
In general, it is better to pick a data type that represents your data in a correct and natural way, and leave it to the compiler to sort out the details. However, if you are creating large arrays of a primitive type, it is worth knowing that Java uses compact representations for arrays of boolean, byte and short.

faster Math.exp() via JNI?

I need to calculate Math.exp() from java very frequently, is it possible to get a native version to run faster than java's Math.exp()??
I tried just jni + C, but it's slower than just plain java.

This has already been requested several times (see e.g. here). Here is an approximation to Math.exp(), copied from this blog posting:
public static double exp(double val) {
final long tmp = (long) (1512775 * val + (1072693248 - 60801));
return Double.longBitsToDouble(tmp << 32);
}
It is basically the same as a lookup table with 2048 entries and linear interpolation between the entries, but all this with IEEE floating point tricks. Its 5 times faster than Math.exp() on my machine, but this can vary drastically if you compile with -server.

+1 to writing your own exp() implementation. That is, if this is really a bottle-neck in your application. If you can deal with a little inaccuracy, there are a number of extremely efficient exponent estimation algorithms out there, some of them dating back centuries. As I understand it, Java's exp() implementation is fairly slow, even for algorithms which must return "exact" results.
Oh, and don't be afraid to write that exp() implementation in pure-Java. JNI has a lot of overhead, and the JVM is able to optimize bytecode at runtime sometimes even beyond what C/C++ is able to achieve.

Use Java's.
Also, cache results of the exp and then you can look up the answer faster than calculating them again.

You'd want to wrap whatever loop's calling Math.exp() in C as well. Otherwise, the overhead of marshalling between Java and C will overwhelm any performance advantage.

You might be able to get it to run faster if you do them in batches. Making a JNI call adds overhead, so you don't want to do it for each exp() you need to calculate. I'd try passing an array of 100 values and getting the results to see if it helps performance.

The real question is, has this become a bottle neck for you? Have you profiled your application and found this to be a major cause of slow down? If not, I would recommend using Java's version. Try not to pre-optimize as this will just cause development slow down. You may spend an extended amount of time on a problem that may not be a problem.
That being said, I think your test gave you your answer. If jni + C is slower, use java's version.

Commons Math3 ships with an optimized version: FastMath.exp(double x). It did speed up my code significantly.
Fabien ran some tests and found out that it was almost twice as fast as Math.exp():
0.75s for Math.exp sum=1.7182816693332244E7
0.40s for FastMath.exp sum=1.7182816693332244E7
Here is the javadoc:
Computes exp(x), function result is nearly rounded. It will be correctly rounded to the theoretical value for 99.9% of input values, otherwise it will have a 1 UPL error.
Method:
Lookup intVal = exp(int(x))
Lookup fracVal = exp(int(x-int(x) / 1024.0) * 1024.0 );
Compute z as the exponential of the remaining bits by a polynomial minus one
exp(x) = intVal * fracVal * (1 + z)
Accuracy: Calculation is done with 63 bits of precision, so result should be correctly rounded for 99.9% of input values, with less than 1 ULP error otherwise.

Since the Java code will get compiled to native code with the just-in-time (JIT) compiler, there's really no reason to use JNI to call native code.
Also, you shouldn't cache the results of a method where the input parameters are floating-point real numbers. The gains obtained in time will be very much lost in amount of space used.

The problem with using JNI is the overhead involved in making the call to JNI. The Java virtual machine is pretty optimized these days, and calls to the built-in Math.exp() are automatically optimized to call straight through to the C exp() function, and they might even be optimized into straight x87 floating-point assembly instructions.

There's simply an overhead associated with using the JNI, see also:
http://java.sun.com/docs/books/performance/1st_edition/html/JPNativeCode.fm.html
So as others have suggested try to collate operations that would involve using the JNI.

Write your own, tailored to your needs.
For instance, if all your exponents are of the power of two, you can use bit-shifting. If you work with a limited range or set of values, you can use look-up tables. If you don't need pin-point precision, you use an imprecise, but faster, algorithm.

There is a cost associated with calling across the JNI boundary.
If you could move the loop that calls exp() into the native code as well, so that there is just one native call, then you might get better results, but I doubt it will be significantly faster than the pure Java solution.
I don't know the details of your application, but if you have a fairly limited set of possible arguments for the call, you could use a pre-computed look-up table to make your Java code faster.

There are faster algorithms for exp depending on what your'e trying to accomplish. Is the problem space restricted to a certain range, do you only need a certain resolution, precision, or accuracy, etc.
If you define your problem very well, you may find that you can use a table with interpolation, for instance, which will blow nearly any other algorithm out of the water.
What constraints can you apply to exp to gain that performance trade-off?
-Adam

I run a fitting algorithm and the minimum error of the fitting result is way larger
than the precision of the Math.exp().
Transcendental functions are always much more slower than addition or multiplication and a well-known bottleneck. If you know that your values are in a narrow range, you can simply build a lookup-table (Two sorted array ; one input, one output). Use Arrays.binarySearch to find the correct index and interpolate value with the elements at [index] and [index+1].
Another method is to split the number. Lets take e.g. 3.81 and split that in 3+0.81.
Now you multiply e = 2.718 three times and get 20.08.
Now to 0.81. All values between 0 and 1 converge fast with the well-known exponential series
1+x+x^2/2+x^3/6+x^4/24.... etc.
Take as much terms as you need for precision; unfortunately it's slower if x approaches 1. Lets say you go to x^4, then you get 2.2445 instead of the correct 2.2448
Then multiply the result 2.781^3 = 20.08 with 2.781^0.81 = 2.2445 and you have the result
45.07 with an error of one part of two thousand (correct: 45.15).

It might not be relevant any more, but just so you know, in the newest releases of the OpenJDK (see here), Math.exp should be made an intrinsic (if you don't know what that is, check here).
This will make performance unbeatable on most architectures, because it means the Hotspot VM will replace the call to Math.exp by a processor-specific implementation of exp at runtime. You can never beat these calls, as they are optimized for the architecture...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.