As far as I know, modulo % is a pretty costly operation, backed by division opertation underneath, the slowest operation for CPU.
Whether it worth or not to substitute this operation explicitly by it's bitwise analog number & (divisor - 1) in code, or JIT can do this for us implicitly?
As far as I aware JIT doesn't optimize such expression so:
number%divisor is not faster (slower or same speed) than number & (divisor - 1) in case of divisor is constant (so divisor - 1 can be calculated in compile time).
It is difficult to say how big difference will be because in modern CPU it will depend on code around it, cache state, and many other factors.
PS: Keep in mind that optimization will work only if divisor is power of 2.
Related
I know that we can optimise "find even numbers" code by using bitwise operator &. The following program:
if(i%2==0) sout("even")
else sout("odd")
can be optimised to:
if(i&1==0) sout("even")
else sout("odd")
The above approach works only for 2 as a divisor. What if we have to optimise the code when we have multiple divisors like 4, 9, 20, 56 and so on? Is there a way to further optimise this code?
You obviously didn't even try what you posted because it doesn't compile (even with a reasonable sout added). First expression statements in Java end in semicolon, and second i&1==0 parses as i & (1==0) -> i & true and the & operator doesn't take an int and a boolean.
If i is negative and odd, i%2 is -1 while i&1 is 1 = +1. That's because % is remainder not modulo.
In the limited cases where i%n and (i&(n-1)) are the same -- i nonnegative and n a power of two -- as the commenters said the Java runtime compiler (JIT) will actually produce the same code for both and obfuscating the source will only make your program more likely to be or become wrong without providing any benefit.
Fifty years ago when people were writing in assembler for machines with microsecond clocks (i.e. not even megaHertz much less gigaHertz) this sometimes helped -- only sometimes because usually only a small fraction of the code matters to execution time. In this century it's at best a waste and often harmful.
As part of a Monte Carlo simulation, I have to roll a group of dice until certain values show up a certain amount of times. My code that does this calls upon a dice class which generates a random number between 1 and 6, and returns it. Originally the code looked like
public void roll() {
value = (int)(Math.random()*6) + 1;
}
and it wasn't very fast. By exchanging Math.random() for
ThreadLocalRandom.current().nextInt(1, 7);
It ran a section in roughly 60% of the original time, which called this around about 250 million times.
As part of the full simulation it will call upon this method billions of times at the very least, so is there any faster way to do this?
Pick a random generator that is as fast and as good as you need it to be, and that isn't slowed down to a tiny fraction of its normal speed by thread safety mechanisms. Then pick a method of generating the [1..6] integer distribution that is a fast and as precise as you need it to be.
The fastest simple generator that is of sufficiently high quality to beat standard tests for PRNGs like TestU01 (instead of failing systematically, like the Mersenne Twister) is Sebastiano Vigna's xorshift64*. I'm showing it as C code but Sebastiano has it in Java as well:
uint64_t xorshift64s (int64_t &x)
{
x ^= x >> 12;
x ^= x << 25;
x ^= x >> 27;
return x * 2685821657736338717ull;
}
Sebastiano Vigna's site has lots of useful info, links and benchmark results. Including papers, for the mathematically inclined.
At that high resolution you can simply use 1 + xorshift64s(state) % 6 and the bias will be immeasurably small. If that is not fast enough, implement the modulo division by multiplication with the inverse. If that is not fast enough - if you cannot afford two MULs per variate - then it gets tricky and you need to come back here. xorshift1024* (Java) plus some bit trickery for the variate would be an option.
Batching - generating an array full of numbers and processing that, then refilling the array and so on - can unlock some speed reserves. Needlessly wrapping things in classes achieves the opposite.
P.S.: if ThreadLocalRandom and xorshift* are not fast enough for your purposes even with batching then you might be going about things in the wrong way, or you might be doing it in the wrong language. Or both.
P.P.S.: in languages like Java (or C#, or Delphi), abstraction is not free, it has a cost. In Java you also have to reckon with things like mandatory gratuitous array bounds checking, unless you have a compiler that can eliminate those checks. Teasing high performance out of a Java program can get very involved... In C++ you get abstraction and performance for free.
Darth is correct that Xorshift* is probably the best generator to use. Use it to fill a ring buffer of bytes, then fetch the bytes one at a time to roll your dice, refilling the buffer when you've fetched enough. To get the actual die roll, avoid division and bias by using rejection sampling. The rest of the code then looks something like this (in C):
do {
if (bp >= buffer + sizeof buffer) {
// refill buffer with Xorshifts
}
v = *bp++ & 7;
} while (v > 5);
return v;
This will allow you to get on average 6 die rolls per 64-bit random value.
I want to run a particle simulation with periodic boundary conditions - for simplicity, let's assume a 1D simulation with a region of length 1.0. I could enforce these conditions using the following short snippet:
if (x > 1.0)
x -= 1.0;
if (x < 0.0)
x += 1.0;
but it feels "clumsy"1 - especially when generalizing this to higher dimensions. I tried doing something like
x = x % 1.0;
which takes good care of the case x > 1.0 but doesn't do what I want for x < 0.02. A few examples of the output of the "modulus" version and the "manual" version to show the difference:
Value: 1.896440, mod: 0.896440, manual: 0.896440
Value: -0.449115, mod: -0.449115, manual: 0.550885
Value: 1.355568, mod: 0.355568, manual: 0.355568
Value: -0.421918, mod: -0.421918, manual: 0.578082
1) For my current application, the "clumsy" way is probably good enough. But in my pursuit of becoming a better programmer, I'd still like to learn if there's a "better" (or at least better-looking...) way to do this.
2) Yes, I've read this post, so I know why it doesn't do what I want, and that it's not supposed to. My question is not about this, but rather about what to do instead.
You can use % with this slight modification x = (x + 1.0) % 1.0
The best approach is probably to subtract the floor of the value from the value itself. Computing a floating-point remainder in compliance with IEEE standards is rather complicated, and unless one is on a system which can detect and accelerate "easy cases", especially those where the divisor is a power of two, a pair of if statements is apt to be faster.
It might be interesting, though, to consider why fmod was designed the annoying way it was: if fmod were designed to return a value between 0 and the dividend, then the precision of the result when the dividend is a very small positive number would be much better than the precision when the dividend is a very small negative number (the precision would be limited to that of the divisor). The advantages of having fmod's precision be relatively symmetric about zero probably outweighs the advantages of having the results be non-negative, but that doesn't imply the IEEE is the only good way to design a range-limiting function.
An alternative approach which would combine the advantages of the IEEE's approach and the zero-to-divisor approach would be to specify that a mod function must yield a result whose numerical value was (for d positive) less than the numerical value of d/2, but no less than -d/2. Such a definition would always yield a result that was representable in the source operands' type (if D is a very small value such that D/2 is not precisely representable, the range of the modulus would be symmetrical). Unfortunately, I know of no library mod functions that work this way.
I'm not really familiar with low-level (hardware-close) specifics (forgot much).
My app needs to perform millions (or even more) bit manipulation operations in very short time periods, so performance matters.
I need to check if a certain section (consisting of 4, 5 or 6 bits) of an int value is equal to a specified value.
I can solve this either by using an int as a complete mask; or by using bit shift(s) (to get rid of the disturbing sections) and then doing a direct compare (==). Do these have equal performance? Which is faster?
Generally speaking ((a & b )== c) ought to be very fast, and faster than the same operation
with an extra shift. ((a>>n)&b)==c)
It's likely that other optimization techniques, such as loop unrollong, will be a lot more effective than trying to guess what shift and mask operations are the fastest.
If you really care about performance at that level, the answer is to benchmark all the likely variations in the actual deployment environment.
I have a situation where I might need to apply a multiplier to a value in order to get the correct results. This involves computing the value using floating point division.
I'm thinking it would be a good idea to check the values before I perform floating point logic on them to save processor time, however I'm not sure how efficient it will be at run-time either way.
I'm assuming that the if check is 1 or 2 instructions (been a while since assembly class), and that the floating point operation is going to be many more than that.
//Check
if (a != 10) { //1 or 2 instructions?
b *= (float) a / 10; //Many instructions?
}
Value a is going to be '10' most of the time, however there are a few instances where it wont be. Is the floating point division going to take very many cycles even if a is equal to the divisor?
Will the previous code with the if statement execute more efficiently than simply the next one without?
//Don't check
b *= (float) a / 10; //Many instructions?
Granted there wont be any noticable difference either way, however I'm curious as to the behavior of the floating point multiplication when the divisor is equal to the dividend in case things get processor heavy.
Assuming this is in some incredibly tight loop, executed billions of times, so the difference of 1-2 instructions matters, since otherwise you should probably not bother --
Yes you are right to weigh the cost of the additional check each time, versus the savings when the check is true. But my guess is that it has to be true a lot to overcome not only the extra overhead, but the fact that you're introducing a branch, which will ultimately do more to slow you down via a pipeline stall in the CPU in the JIT-compiled code than you'll gain otherwise.
If a == 10 a whole lot, I'd imagine there's a better and faster way to take advantage of that somehow, earlier in the code.
IIRC, floating-point multiplication is much less expensive than division, so this might be faster than both:
b *= (a * 0.1);
If you do end up needing to optimize this code I would recommend using Caliper to do micro benchmarks of your inner loops. It's very hard to predict accurately what sort of effect these small modifications will have. Especially in Java where how the VM behaves is bit of an unknown since in theory it can optimize the code on the fly. Best to try several strategies and see what works.
http://code.google.com/p/caliper/