While reading through Java 8's Integer class, I come upon the following FIX-ME: (line 379)
// TODO-FIXME: convert (x * 52429) into the equiv shift-add
// sequence.
The entire comment reads:
// I use the "[invariant division by multiplication][2]" trick to
// accelerate Integer.toString. In particular we want to
// avoid division by 10.
//
// The "trick" has roughly the same performance characteristics
// as the "classic" Integer.toString code on a non-JIT VM.
// The trick avoids .rem and .div calls but has a longer code
// path and is thus dominated by dispatch overhead. In the
// JIT case the dispatch overhead doesn't exist and the
// "trick" is considerably faster than the classic code.
//
// TODO-FIXME: convert (x * 52429) into the equiv shift-add
// sequence.
//
// RE: Division by Invariant Integers using Multiplication
// T Gralund, P Montgomery
// ACM PLDI 1994
//
I cannot imagine that I should be worried about this, as this has been present for quite a while.
But, can someone shed light on what this FIX-ME means and if has any side-effects?
Side notes:
I see this has been removed from the JDK 10
The paper referenced in the link does not seem to address to address the issue directly.
52429 is the closest integer to (2 ^ 19) / 10, so division by 10 can be achieved by multiplying by 52429, and then dividing by 2 ^ 19, where the latter is a trivial bit shift operation instead of requiring a full division.
The code author appears to be suggesting that the multiplication could be done more optimally using shift/add operations instead, per this (C language) snippet:
uint32_t div10(uint16_t in)
{
// divides by multiplying by 52429 / (2 ^ 16)
// 52429 = 0xcccd
uint32_t x = in << 2; // multiply by 4 : total = 0x0004
x += (x << 1); // multiply by 3 : total = 0x000c
x += (x << 4); // multiply by 17 : total = 0x00cc
x += (x << 8); // multiply by 257 : total = 0xcccc
x += in; // one more makes : total = 0xcccd
return x >> 19;
}
What I can't answer is why they apparently thought this might be more optimal than a straight multiplication in a Java environment.
At the machine code level it would only be more optimal on a (nowadays rare) CPU without a hardware multiplier where the simplest (albeit perhaps naïve) multiply function would need 16 shift/add operations to multiply two 16-bit numbers.
On the other hand a hand-crafted function like the above can perform the multiplication by a constant in fewer steps by exploiting the numeric properties of that constant, in this case reducing it to four shift/add operations instead of 16.
FWIW (and somewhat impressively) the clang compiler on macOS even with just the -O1 optimisation flag actually converts that code above back into a single multiplication:
_div10: ## #div10
pushq %rbp
movq %rsp, %rbp
imull $52429, %edi, %eax ## imm = 0xCCCD
shrl $19, %eax
popq %rbp
retq
It also turns:
uint32_t div10(uint16_t in) {
return in / 10;
}
into exactly the same assembly code, which just goes to show that modern compilers really do know best.
Related
I'm making a doom style pseudo-3D game.
The world is rendered pixel by pixel into a buffered image, which is later displayed on the JPanel. I want to keep this approach so that lighting individual pixels will be easier.
I want to be able to color the textures in the game to many different colors.
Coloring the whole texture and storing it in a separate buffered image takes too much time and memory for my purpose. So I am tinting each pixel of the texture during the rendering stage.
The problem I am having is that tinting each pixel is quite expensive. When an uncolored wall covers the entire screen, I get around 65 fps. And when a colored wall covers the screen, I get 30 fps.
This is my function for tinting the pixels:
//Change the color of the pixel using its brightness.
public static int tintABGRPixel(int pixelColor, Color tintColor) {
//Calculate the luminance. The decimal values are pre-determined.
double lum = ((pixelColor>>16 & 0xff) * 0.2126 +
(pixelColor>>8 & 0xff) * 0.7152 +
(pixelColor & 0xff) * 0.0722) / 255;
//Calculate the new tinted color of the pixel and return it.
return ((pixelColor>>24 & 0xff) << 24) |
((int)(tintColor.getBlue()*lum) & 0xff) |
(((int)(tintColor.getGreen()*lum) & 0xff) << 8) |
(((int)(tintColor.getRed()*lum) & 0xff) << 16);
}
Sorry for the illegible code. This function calculates the brightness of the original pixel, multiplies the new color by the brightness, and converts it back into an int.
It only contains simple operations, but this function is called up to a million times per frame in the worst case. The bottleneck is the calculation in the return statement.
Is there a more efficient way to calculate the new color?
Would it be best if I changed my approach?
Thanks
Do the work in Parallel
Threads aren't necessarily the only way to parallelise code, CPUs often have instructions sets such as SIMD which allow you to compute the same arithmetic on multiple numbers at once. GPUs take this idea and run with it, allowing you to run the same function on hundreds to thousands of numbers in parallel. I don't know how to do this in Java, but I'm sure with some googling its possible to find an method that works.
Algorithm - Do less work
Is it possible to reduce the amount of time the function needs to be called? Calling any function a million times per frame is going to hurt. Unless the overhead of each function call is managed (inlining it, reusing the stack frame, caching the result if possible), you'll want to do less work.
Possible options could be:
Make the window/resolution of the game smaller.
Work with a different representation. Are you doing a lot of operations that are easier to do when pixels are HSV instead of RGB? Then only convert to RGB when you are about to render the pixel.
Use a limited number of colours for each pixel. That way you can work out the possible tints in advance, so they are only a lookup away, as opposed to a function call.
Tint as little as possible. Maybe there is some UI that is tinted and shouldn't be. Maybe lighting effects only travel so far.
As a last resort, make tinted the default. If tinting pixels is done so much then possibly "untinting" happens far less and you can get better performance by doing that.
Performance - (Micro-)optimising the code
If you can settle for an "approximate tint" this SO answer gives an approximation for the brightness (lum) of a pixel that should be cheaper to compute. (The formula from the link is Y = 0.33 R + 0.5 G + 0.16 B, which can be written Y = (R+R+B+G+G+G)/6.
The next step is to time your code (profile is a good term to know for googling) to see what takes up the most resources. It may well be that it isn't this function here, but another piece of code. Or waiting for textures to load.
From this point on we will assume the function provided in the question takes up the most time. Let's see what it is spending its time on. I don't have the rest of your code, so I can't benchmark all of it, but I can compile it and look at the bytecode that is produced. Using javap on a class containing the function I get the following (bytecode has been cut where there are repeats).
public static int tintABGRPixel(int, Color);
Code:
0: iload_0
1: bipush 16
3: ishr
4: sipush 255
7: iand
8: i2d
9: ldc2_w #2 // double 0.2126d
12: dmul
13: iload_0
...
37: dadd
38: ldc2_w #8 // double 255.0d
41: ddiv
42: dstore_2
43: iload_0
44: bipush 24
46: ishr
47: sipush 255
50: iand
51: bipush 24
53: ishl
54: aload_1
55: pop
56: invokestatic #10 // Method Color.getBlue:()I
59: i2d
60: dload_2
61: dmul
62: d2i
63: sipush 255
66: iand
67: ior
68: aload_1
69: pop
...
102: ireturn
This can look scary at first, but Java bytecode is nice, in that you can match each line (or instruction) to a point in your function. It hasn't done anything crazy like rewrite it or vectorize it or anything that makes it unrecognizable.
The general method to see if a change has made an improvement, is to measure the code before and after. With that knowledge you can decide if a change is worth keeping. Once the performance is good enough, stop.
Our poor man profiling is to look at each instruction, and see (on average, according to online sources) how expensive it is. This is a little naive, as how long each instruction takes to execute can depend on a multitude of things such as the hardware it is running on, the versions of software on the computer, and the instructions around it.
I don't have a comprehensive list of the time cost for each instruction, so I'm going to go with some heuristics.
integer operations are faster than floating operations.
constants are faster than local memory, which is faster than global memory.
powers of two can allow for powerful optimisations.
I stared at the bytecode for a while, and all I noticed was that from lines [8 - 42] there are a lot of floating point operations. This section of code works out lum (the brightness). Other than that, nothing else stands out, so let's rewrite the code with our first heuristic in mind. If you don't care for the explanation, I'll provide the final code at the end.
Let us just consider what the blue colour value (which we will label B) will be by the end of the function. The changes will apply to red and green too, but we will leave them out for brevity.
double lum = ((pixelColor>>16 & 0xff) * 0.2126 +
(pixelColor>>8 & 0xff) * 0.7152 +
(pixelColor & 0xff) * 0.0722) / 255;
...
... | ((int)(tintColor.getBlue()*lum) & 0xff) | ...
This can be rewritten as
int x = (pixelColor>>16 & 0xff), y = (pixelColor>>8 & 0xff), z = (pixelColor & 0xff);
double a = 0.2126, b = 0.7152, c = 0.0722;
double lum = (a*x + b*y + c*z) / 255;
int B = (int)(tintColor.getBlue()*lum) & 0xff;
We don't want to be doing as many floating point operations, so let us do some refactoring. The idea is that the floating point constants can be written as fractions. For example, 0.2126 can be written as 2126/10000.
int x = (pixelColor>>16 & 0xff), y = (pixelColor>>8 & 0xff), z = (pixelColor & 0xff);
int a = 2126, b = 7152, c = 722;
int top = a*x + b*y + c*z;
double temp = (double)(tintColor.getBlue() * top) / 10000 / 255;
int B = (int)temp & 0xff;
So now we do three integer multiplications (imul) instead of three dmuls. The cost is one extra floating division, which alone would probably not be worth it. We can avoid this issue by piggybacking on the other division that we are already doing. Combining the two sequential divisions into one division is as simple as changing / 10000 / 255 to /2550000. We can also setup the code for one more optimization by moving the casting and division to one line.
int x = (pixelColor>>16 & 0xff), y = (pixelColor>>8 & 0xff), z = (pixelColor & 0xff);
int a = 2126, b = 7152, c = 722;
int top = a*x + b*y + c*z);
int temp = (int)((double)(tintColor.getBlue()*top) / 2550000);
int B = temp & 0xff;
This could be a good place to stop. However, if you need to squeeze a tiny bit more performance out of this function, we can optimise dividing by a constant and casting a double to an int (which I believe are two expensive operations) to a multiply (by a long) and a shift.
int x = (pixelColor>>16 & 0xff), y = (pixelColor>>8 & 0xff), z = (pixelColor & 0xff);
int a = 2126, b = 7152, c = 722;
int top = a*x + b*y + c*z;
int Btemp = (int)((tintColor.getBlue() * top * 1766117501L) >> 52);
int B = temp & 0xff;
where the magic numbers are two that were magicked up when I compiled a c++ version of the code with clang. I am not able to explain how to produce this magic, but it works as far as I have tested with a couple of values for x, y, z, and tintColor.getBlue(). When testing I assumed all the values are in the range [0 - 256), and I tried only a couple of examples.
The final code is below. Be warned that this is not well tested and may have edge cases that I've missed, so let me know if there are any bugs. Hopefully it is fast enough.
public static int tintABGRPixel(int pixelColor, Color tintColor) {
// Calculate the luminance. The decimal values are pre-determined.
int x = pixelColor>>16 & 0xff, y = pixelColor>>8 & 0xff, z = pixelColor & 0xff;
int top = 2126*x + 7252*y + 722*z;
int Btemp = (int)((tintColor.getBlue() * top * 1766117501L) >> 52);
int Gtemp = (int)((tintColor.getGreen() * top * 1766117501L) >> 52);
int Rtemp = (int)((tintColor.getRed() * top * 1766117501L) >> 52);
//Calculate the new tinted color of the pixel and return it.
return ((pixelColor>>24 & 0xff) << 24) | Btemp & 0xff | (Gtemp & 0xff) << 8 | (Rtemp & 0xff) << 16;
}
EDIT: Alex found that the magic number should be 1755488566L instead of 1766117501L.
To get better performance you'll have to get rid of objects like Color during image manipulation, also if you know that a method is to be called million times (image.width * image.height times) then it's best to inline this method. In general JVM would probably inline this method itself, but you should not take the risk.
You can use PixelGrabber to get all the pixels into an array. Here's a general usage
final int[] pixels = new int[width * height];
final PixelGrabber pixelgrabber = new PixelGrabber(image, 0, 0, width, height, pixels, 0, 0);
for(int i = 0; i < height; i++) {
for(int j = 0; j < width; j++) {
int p = pixels[i * width + j]; // same as image.getRGB(j, i);
int alpha = ( ( p >> 24) & 0xff );
int red = ( ( p >> 16) & 0xff );
int green = ( ( p >> 8) & 0xff );
int blue = ( p & 0xff );
//do something i.e. apply luminance
}
}
Above is just an example of how to iterate row and column indexes, however in your case nested loop is not needed. This should reasonably improve the performance.
This can probably be parallelized also using Java 8 streams easily, however be careful before using streams while dealing with images, as streams are a lot slower than plain old loops.
You can also try replacing int with byte where applicable (i.e. individual color components don't need to be stored in int). Basically try using primitive datatypes and even in primitive datatypes use smallest that's applicable.
At this point you are really close to the metal on this calculation. I think you'll have to change your approach to really improve things, but a quick idea is to cache the lum calculation. That is a simple function of pixel color and your lum isn't dependent on anything but that. If you cache that it could save you a lot of calcs. While you're caching you could cache this calc too:
((pixelColor>>24 & 0xff) << 24)
I don't know if that'll save you a ton of time, but I think at this point that is just about all you could do from a micro-optimization stand point.
Now you could refactor your pixel loop to use parallelism, and do those pixel calcs in parallel on your CPU this might set you up for the next idea too.
If neither of those above ideas work I think you might need to try and push color calculations off to the GPU card. This is all bare metal math that has to happen millions of times which is what graphics cards do best. Unfortunately this is a deep topic with lots of education that has to happen in order to pick the best option. Here were some interesting things to research:
https://code.google.com/archive/p/java-gpu/
https://github.com/nativelibs4java/JavaCL
http://jogamp.org/jogl/www/
https://www.lwjgl.org/
I know some of those are huge frameworks which isn't what you asked for. But they might contain other relatively unknown libs that you could use to push these math calcs off to the GPU. The #Parrallel annotation looked like it could be the most useful or JavaCL bindings.
I am looking for a bit-wise test equivalent to (num%2) == 0 || (num%3) == 0.
I can replace num%2 with num&1, but I'm still stuck with num%3 and with the logical-or.
This expression is also equivalent to (num%2)*(num%3) == 0, but I'm not sure how that helps.
Yes, though it's not very pretty, you can do something analogous to the old "sum all the decimal digits until you have only one left" trick to test if a number is divisible by 9, except in binary and with divisibility by 3. You can use the same principle for other numbers as well, but many combinations of base/divisor introduce annoying scaling factors so you're not just summing digits anymore.
Anyway, 16n-1 is divisible by 3, so you can use radix 16, that is, sum the nibbles. Then you're left with one nibble (well, 5 bits really), and you can just look that up. So for example in C# (slightly tested) edit: brute-force tested, definitely works
static bool IsMultipleOf3(uint x)
{
const uint lookuptable = 0x49249249;
uint t = (x & 0x0F0F0F0F) + ((x & 0xF0F0F0F0) >> 4);
t = (t & 0x00FF00FF) + ((t & 0xFF00FF00) >> 8);
t = (t & 0x000000FF) + ((t & 0x00FF0000) >> 16);
t = (t & 0xF) + ((t & 0xF0) >> 4);
return ((lookuptable >> (int)t) & 1) != 0;
}
The trick from my comment, x * 0xaaaaaaab <= 0x55555555, works through a modular multiplicative inverse trick. 0xaaaaaaab * 3 = 1 mod 232, which means that 0xaaaaaaab * x = x / 3 if and only if
x % 3 = 0. "if" because 0xaaaaaaab * 3 * y = y (because 1 * y = y), so if x is of the form
3 * y then it will map back to y. "only if" because no two inputs are mapped to the same output, so everything not divisible by 3 will map to something higher than the highest thing you can get by dividing anything by 3 (which is 0xFFFFFFFF / 3 = 0x55555555).
You can read more about this (including the more general form, which includes a rotation) in Division by Invariant Integers using Multiplication (T. Granlund and P. L. Montgomery).
You compiler may not know this trick. For example this:
uint32_t foo(uint32_t x)
{
return x % 3 == 0;
}
Becomes, on Clang 3.4.1 for x64,
movl %edi, %eax
movl $2863311531, %ecx # imm = 0xAAAAAAAB
imulq %rax, %rcx
shrq $33, %rcx
leal (%rcx,%rcx,2), %eax
cmpl %eax, %edi
sete %al
movzbl %al, %eax
ret
G++ 4.8:
mov eax, edi
mov edx, -1431655765
mul edx
shr edx
lea eax, [rdx+rdx*2]
cmp edi, eax
sete al
movzx eax, al
ret
What it should be:
imul eax, edi, 0xaaaaaaab
cmp eax, 0x55555555
setbe al
movzx eax, al
ret
I guess I'm a bit late to this party, but here's a slightly faster (and slightly prettier) solution than the one from harold:
bool is_multiple_of_3(std::uint32_t i)
{
i = (i & 0x0000FFFF) + (i >> 16);
i = (i & 0x00FF) + (i >> 8);
i = (i & 0x0F) + (i >> 4);
i = (i & 0x3) + (i >> 2);
const std::uint32_t lookuptable = 0x49249249;
return ((lookuptable >> i) & 1) != 0;
}
It's C++11, but that doesn't really matter for this piece of code. It's also brute-force tested for 32-bit unsigned ints. It saves you at least one bit-fiddling op for each of the first four steps. It also scales beautifully to 64 bits - only one additional step needed at the beginning.
The last two lines are obviously and shamelessly taken from harold's solution (nice one, I wouldn't have done that so elegantly).
Possible further optimizations:
The & ops in the first two steps will be optimized away by just using the lower-half registers on architectures that have them (x86, for example).
The largest possible output from the third step is 60, and from the fourth step it's 15 (when the function argument is 0xFFFFFFFF). Given that, we can eliminate the fourth step, use a 64-bit lookuptable and shift directly into that following the third step. This turns out to be a bad idea for Visual C++ 2013 in 32-bit mode, as the right shift turns into a non-inline call to code that does a lot of tests and jumps. However, it should be a good idea if 64-bit registers are natively available.
The point above needs to be reevaluated if the function is modified to take a 64-bit argument. The maximum outputs from the last two steps (which will be steps 4 and 5 after adding one step at the beginning) will be 75 and 21 respectively, which means we can no longer eliminate the last step.
The first four steps are based on the fact that a 32-bit number can be written as
(high 16 bits) * 65536 + (low 16 bits) =
(high 16 bits) * 65535 + (high 16 bits) + (low 16 bits) =
(high 16 bits) * 21845 * 3 + ((high 16 bits) + (low 16 bits))
So the whole thing is divisible by 3 if and only if the right parenthesis is divisible by 3. And so on, as this holds for 256 = 85 * 3 + 1, 16 = 5 * 3 + 1, and 4 = 3 + 1. (Of course, this is generally true for even powers of two; odd powers are one less than the nearest multiple of 3.)
The numbers that are input into the following steps will be larger than 16-bit, 8-bit, and 4-bit respectively in some cases, but that's not a problem, as we're not dropping any high-order bits when shifting right.
Well, I can do that through logic, but I bet there is a mathematical operation or expression to do that. Does one exist? If yes, what is it?
Here is the algorithm:
private int calcNumberOfLongs(int size) {
if (size % 64 == 0) {
return size / 64;
} else {
return size / 64 + 1;
}
}
Let me be clear what I want:
For 150 bits I need three 64-bit longs. Two of course only gives me 128 bits. So that's the first computation.
The second computation, this one even more important because it will be executed all the time, is to go from bit position to long. For example:
bit 5 -> first long
bit 64 -> first long
bit 65 -> second long
bit 140 -> third long
What is the mathematical expression and / or bitwise operation to get this information?
Ok, from the answer below it looks like to go from bit position to long, we just use:
long position = bit position / 64
The continuation is here: How to turn a division into a bitwise shift when power of two?
I don't believe there is a built-in function to do it although you could simplify your code to:
return (size + 63) / 64;
I have implemented a Kahan floating point summation algorithm in Java. I want to compare it against the built-in floating point addition in Java and infinite precision addition in Mathematica. However the data set I have is not good for testing, because the numbers are close to each other. (Condition number ~= 1)
Running Kahan on my data set gives all most the same result as the built-in +.
Could anyone suggest how to generate a large amount of data that can potentially cause serious rounding off error?
However the data set I have is not good for testing, because the numbers are close to each other.
It sounds like you already know what the problem is. Get to it =)
There are a few things that you will want:
Numbers of wildly different magnitudes, so that most of the precision of the smaller number is lost with naive summation.
Numbers with different signs and nearly equal (or equal) magnitudes, such that catastrophic cancellation occurs.
Numbers that have some low-order bits set, to increase the effects of rounding.
To get you started, you could try some simple three-term sums, which should show the effect clearly:
1.0 + 1.0e-20 - 1.0
Evaluated with simple summation, this will give 0.0; clearly incorrect. You might also look at sums of the form:
a0 + a1 + a2 + ... + an - b
Where b is the sum a0 + ... + an evaluated naively.
You want a heap of high precision numbers? Try this:
double[] nums = new double[SIZE];
for (int i = 0; i < SIZE; i++)
nums[i] = Math.rand();
Are we talking about number pairs or sequences?
If pairs, start with 1 for both numbers, then in every iteration divide one by 3, multiply the other by 3. It's easy to calculate the theoretical sums of those pairs and you'll get a whole host of rounding errors. (Some from the division and some from the addition. If you don't want division errors, then use 2 instead of 3.)
By experiment, I found following pattern:
public static void main(String[] args) {
System.out.println(1.0 / 3 - 0.01 / 3);
System.out.println(1.0 / 7 - 0.01 / 7);
System.out.println(1.0 / 9 - 0.001 / 9);
}
I've subtracted close negative powers of prime numbers (which should not have exact representation in binary form). However, there are cases then such expression evaluates correctly, for example
System.out.println(1.0 / 9 - 0.01 / 9);
You can automate this approach by iterating power of subtrahend and stopping when multiplication by appropriate value doesn't yield integer number, for example:
System.out.println((1.0 / 9 - 0.001 / 9) * 9000);
if (1000 - (1.0 / 9 - 0.001 / 9) * 9000 > 1.0)
System.out.println("Found it!");
Scalacheck might be something for you. Here is a short sample:
cat DoubleSpecification.scala
import org.scalacheck._
object DoubleSpecification extends Properties ("Doubles") {
/*
(a/1000 + b/1000) = (a+b) / 1000
(a/x + b/x ) = (a+b) / x
*/
property ("distributive") = Prop.forAll { (a: Int, b: Int, c: Int) =>
(c == 0 || a*1.0/c + b*1.0/c == (a+b) * 1.0 / c) }
}
object Runner {
def main (args: Array[String]) {
DoubleSpecification.check
println ("...done")
}
}
To run it, you need scala, and the schalacheck-jar. I used version 2.8 (I don't have to say, that your c-path will vary):
scalac -cp /opt/scala/lib/scalacheck.jar:. DoubleSpecification.scala
scala -cp /opt/scala/lib/scalacheck.jar:. DoubleSpecification
! Doubles.distributive: Falsified after 6 passed tests.
> ARG_0: 28 (orig arg: 1030341)
> ARG_1: 9 (orig arg: 2147483647)
> ARG_2: 5
Scalacheck takes some random values (orig args) and tries to simplify these, if the test fails, in order to find simple examples.
I did some tests on pow(exponent) method. Unfortunately, my math skills are not strong enough to handle the following problem.
I'm using this code:
BigInteger.valueOf(2).pow(var);
Results:
var | time in ms
2000000 | 11450
2500000 | 12471
3000000 | 22379
3500000 | 32147
4000000 | 46270
4500000 | 31459
5000000 | 49922
See? 2,500,000 exponent is calculated almost as fast as 2,000,000. 4,500,000 is calculated much faster then 4,000,000.
Why is that?
To give you some help, here's the original implementation of BigInteger.pow(exponent):
public BigInteger pow(int exponent) {
if (exponent < 0)
throw new ArithmeticException("Negative exponent");
if (signum==0)
return (exponent==0 ? ONE : this);
// Perform exponentiation using repeated squaring trick
int newSign = (signum<0 && (exponent&1)==1 ? -1 : 1);
int[] baseToPow2 = this.mag;
int[] result = {1};
while (exponent != 0) {
if ((exponent & 1)==1) {
result = multiplyToLen(result, result.length,
baseToPow2, baseToPow2.length, null);
result = trustedStripLeadingZeroInts(result);
}
if ((exponent >>>= 1) != 0) {
baseToPow2 = squareToLen(baseToPow2, baseToPow2.length, null);
baseToPow2 = trustedStripLeadingZeroInts(baseToPow2);
}
}
return new BigInteger(result, newSign);
}
The algorithm uses repeated squaring (squareToLen) and multiplication (multiplyToLen). The time for these operations to run depends on the size of the numbers involved. The multiplications of the large numbers near the end of the calculation are much more expensive than those at the start.
The multiplication is only done when this condition is true: ((exponent & 1)==1). The number of square operations depends on the number of bits in the number (excluding leading zeros), but a multiplication is only required for the bits that are set to 1. It is easier to see the operations that are required by looking at the binary representation of the number:
2000000: 0000111101000010010000000
2500000: 0001001100010010110100000
3000000: 0001011011100011011000000
3500000: 0001101010110011111100000
4000000: 0001111010000100100000000
4500000: 0010001001010101000100000
5000000: 0010011000100101101000000
Note that 2.5M and 4.5M are lucky in that they have fewer high bits set than the numbers surrounding them. The next time this happens is at 8.5M:
8000000: 0011110100001001000000000
8500000: 0100000011011001100100000
9000000: 0100010010101010001000000
The sweet spots are exact powers of 2.
1048575: 0001111111111111111111111 // 16408 ms
1048576: 0010000000000000000000000 // 6209 ms
Just a guess:
the exponent is handled bit by bit, and if the least significant bit is 1 additional work is done.
If L is the number of bits in the exponent
and A the number of bits which are 1
and t1 the time to process the common part
and t2 the additional time processing when the LSbit is 1
then the run time would be
Lt1 + At2
or the time is dependent on the number of 1's in the binary representation.
now writing a little program to verify my theory...
I'm not sure how many times you've run your timings. As some of the commenters have pointed out, you need to time operations many, many times to get good results (and they can still be wrong).
Assuming you have timed things well, remember that there are a lot of shortcuts that can be taken in math. You don't have to do the operations 5*5*5*5*5*5 to calculate 5^6.
Here is one way to do it much more quickly. http://en.wikipedia.org/wiki/Exponentiation_by_squaring