My mini benchmark:
import java.math.*;
import java.util.*;
import java.io.*;
public class c
{
static Random rnd = new Random();
public static String addDigits(String a, int n)
{
if(a==null) return null;
if(n<=0) return a;
for(int i=0; i<n; i++)
a+=rnd.nextInt(10);
return a;
}
public static void main(String[] args) throws IOException
{
int n = 10000; \\number of iterations
int k = 10; \\number of digits added at each iteration
BigInteger a;
BigInteger b;
String as = "";
String bs = "";
as += rnd.nextInt(9)+1;
bs += rnd.nextInt(9)+1;
a = new BigInteger(as);
b = new BigInteger(bs);
FileWriter fw = new FileWriter("c.txt");
long t1 = System.nanoTime();
a.multiply(b);
long t2 = System.nanoTime();
//fw.write("1,"+(t2-t1)+"\n");
if(k>0) {
as = addDigits(as, k-1);
bs = addDigits(as, k-1);
}
for(int i=0; i<n; i++)
{
a = new BigInteger(as);
b = new BigInteger(bs);
t1 = System.nanoTime();
a.multiply(b);
t2 = System.nanoTime();
fw.write(((i+1)*k)+","+(t2-t1)+"\n");
if(i < n-1)
{
as = addDigits(as, k);
bs = addDigits(as, k);
}
System.out.println((i+1)*k);
}
fw.close();
}
}
It measures multiplication time of n-digit BigInteger
Result:
You can easily see the trend but why there is so big noise above 50000 digits?
It is because of garbage collector or is there something else that affects my results?
When performing the test, there were no other applications running.
Result from test with only odd digits. The test was shorter (n=1000, k=100)
Odd digits (n=10000, k=10)
As you can see there is a huge noise between 65000 and 70000. I wonder why...
Odd digits (n=10000, k=10), System.gc() every 1000 iterations
Results in noise between 50000-70000
I also suspect this is a JVM warmup effect. Not warmup involving classloading or the JIT compiler, but warmup of the heap.
Put a (java) loop around the whole benchmark, and run it a number of times. (If this gives you the same graphs as before ... you will have evidence that this is not a warmup effect. Currently you don't have any empirical evidence one way or the other.)
Another possibility is that the noise is caused by your benchmark's interactions with the OS and/or other stuff running on the machine.
You are writing your timing data to an unbuffered stream. That means LOTS of syscalls, and (potentially) lots of fine-grained disc writes.
You are making LOTS of calls to nanoTime(), and that might introduce noise.
If something else is running on your machine (e.g. you are web browsing) that will slow down your benchmark for a bit and introduce noise.
There could be competition over physical memory ... if you've got too much running on your machine for the amount of RAM.
Finally, a certain amount of noise is inevitable, because each of those multiply calls generates garbage, and the garbage collector is going to need to work to deal with it.
Finally finally, if you manually run the garbage collector (or increase the heap size) to "smooth out" the data points, what you are actually doing is concealing one of the costs of multiply calls. The resulting graphs looks nice, but it is misleading:
The noisiness reflects what will happen in real life.
The true cost of the multiply actually includes the amortized cost of running the GC to deal with the garbage generated by the call.
To get a measurements that reflect the way that BigInteger behaves in real life, you need to run the test a large number of times, calculate average times and fit a curve to the average data-points.
Remember, the real aim of the game is to get scientifically valid results ... not a smooth curve.
If you do a microbenchmark, you must "warm up" the JVM first to let the JIT optimize the code, and then you can measure the performance. Otherwise you are measuring the work done by the JIT and that can change the result on each run.
The "noise" happens probably because the cache of the CPU is exceeded and the performance starts degrading.
Related
I recently built a Fibonacci generator that uses recursion and hashmaps to reduce complexity. I am using the System.nanoTime() to keep track of the time it takes for my program to print 10000 Fibonacci number. It started out good with less than a second but gradually became slower and now it takes more than 4 seconds. Can someone explain why this might be happening. The code is down here-
import java.util.*;
import java.math.*;
public class FibonacciGeneratorUnlimited {
static int numFibCalls = 0;
static HashMap<Integer, BigInteger> d = new HashMap<Integer, BigInteger>();
static Scanner fibNumber = new Scanner(System.in);
static BigInteger ans = new BigInteger("0");
public static void main(String[] args){
d.put(0 , new BigInteger("0"));
d.put(1 , new BigInteger("1"));
System.out.print("Enter the term:\t");
int n = fibNumber.nextInt();
long startTime = System.nanoTime();
for (int i = 0; i <= n; i++) {
System.out.println(i + " : " + fib_efficient(i, d));
}
System.out.println((double)(System.nanoTime() - startTime) / 1000000000);
}
public static BigInteger fib_efficient(int n, HashMap<Integer, BigInteger> d) {
numFibCalls += 1;
if (d.containsKey(n)) {
return (d.get(n));
} else {
ans = (fib_efficient(n-1, d).add(fib_efficient(n-2, d)));
d.put(n, ans);
return ans;
}
}
}
If you are restarting the program every time you make a new fibonacci sequence, then your program most likely isn't the problem. It might just be the your processor got hot after running the program a few times, or a background process in your computer suddenly started, causing your program to slow down.
More memory java -Xmx=... or less caching
public static BigInteger fib_efficient(int n, HashMap<Integer, BigInteger> d) {
numFibCalls++;
if ((n & 3) <= 1) { // Every second is cached.
BigInteger cached = d.get(n);
if (cached != null) {
return cached;
} else {
BigInteger ans = fib_efficient(n-1, d).add(fib_efficient(n-2, d));
d.put(n, ans);
return ans;
}
} else {
return fib_efficient(n-1, d).add(fib_efficient(n-2, d));
}
}
Two subsequent numbers are cached out of four in order to stop the
recursion on both branches for:
fib(n) = fib(n-1) + fib(n-2)
BigInteger isn't the nicest class where performance and memory is concerned.
It started out good with less than a second but gradually became slower and now it takes more than 4 seconds.
What do you mean by this? Do you mean that you ran this exact same program with the same input and its run-time changed from < 1 second to > 4 seconds?
If you have the same exact code running with the same exact inputs in a deterministic algorithm...
then the differences are probably external to your code - maybe other processes are taking up more CPU on one run.
Do you mean that you increased the inputs from some value X to 10,000 and now it takes > 4 seconds?
Then that's just a matter of the algorithm taking longer with larger inputs, which is perfectly normal.
recursion and hashmaps to reduce complexity
That's not quite how complexity works. You have improved the best-case and the average-case, but you have done nothing to change the worst-case.
Now for some actual performance improvement advice
Stop printing out the results... that's eating up over 99% of your processing time. Seriously, though, switch out "System.out.println(i + " : " + fib_efficient(i, d))" with "fib_efficient(i,d)" and it'll execute over 100x faster.
Concatenating strings and printing to console are very expensive processes.
It happens because the complexity for Fibonacci is Big-O(n^2). This means that, the larger the input the time increases exponentially, as you can see in the graph for Big-O(n^2) in this link. Check this answer to see a complete explanation about it´s complexity.
Now, the complexity of your algorithm increases because you are using a HashMap to search and insert elements each time that function is invoked. Consider remove this HashMap.
The short code below isolates the problem. Basically I'm timing the method addToStorage. I start by executing it one million times and I'm able to get its time down to around 723 nanoseconds. Then I do a short pause (using a busy spinning method not to release the cpu core) and time the method again N times, on a different code location. For my surprise I find that the smaller the N the bigger is the addToStorage latency.
For example:
If N = 1 then I get 3.6 micros
If N = 2 then I get 3.1 and 2.5 micros
if N = 5 then I get 3.7, 1.8, 1.7, 1.5 and 1.5 micros
Does anyone know why this is happening and how to fix it? I would like my method to consistently perform at the fastest time possible, no matter where I call it.
Note: I would not think it is thread related since I'm not using Thread.sleep. I've also tested using taskset to pin my thread to a cpu core with the same results.
import java.util.ArrayList;
import java.util.List;
public class JvmOdd {
private final StringBuilder sBuilder = new StringBuilder(1024);
private final List<String> storage = new ArrayList<String>(1024 * 1024);
public void addToStorage() {
sBuilder.setLength(0);
sBuilder.append("Blah1: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah2: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah3: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah4: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah5: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah6: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah7: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah8: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah9: ").append(System.nanoTime()).append('\n');
sBuilder.append("Blah10: ").append(System.nanoTime()).append('\n');
storage.add(sBuilder.toString());
}
public static long mySleep(long t) {
long x = 0;
for(int i = 0; i < t * 10000; i++) {
x += System.currentTimeMillis() / System.nanoTime();
}
return x;
}
public static void main(String[] args) throws Exception {
int warmup = Integer.parseInt(args[0]);
int mod = Integer.parseInt(args[1]);
int passes = Integer.parseInt(args[2]);
int sleep = Integer.parseInt(args[3]);
JvmOdd jo = new JvmOdd();
// first warm up
for(int i = 0; i < warmup; i++) {
long time = System.nanoTime();
jo.addToStorage();
time = System.nanoTime() - time;
if (i % mod == 0) System.out.println(time);
}
// now see how fast the method is:
while(true) {
System.out.println();
// Thread.sleep(sleep);
mySleep(sleep);
long minTime = Long.MAX_VALUE;
for(int i = 0; i < passes; i++) {
long time = System.nanoTime();
jo.addToStorage();
time = System.nanoTime() - time;
if (i > 0) System.out.print(',');
System.out.print(time);
minTime = Math.min(time, minTime);
}
System.out.println("\nMinTime: " + minTime);
}
}
}
Executing:
$ java -server -cp . JvmOdd 1000000 100000 1 5000
59103
820
727
772
734
767
730
726
840
736
3404
MinTime: 3404
There is so much going on in here that I don't know where to start. But lets start here....
long time = System.nanoTime();
jo.addToStorage();
time = System.nanoTime() - time;
The latency of addToStoarge() cannot be measured using this technique. It simply runs for too quickly meaning you're likely below the resolution of the clock. Without running this, my guess is that your measures are dominated by clock edge counts. You'll need to bulk up the unit of work to get a measure with lower levels of noise in it.
As for what is happening? There are a number of call site optimizations the most important being inlining. Inlining would totally eliminate the call site but it's a path specific optimization. If you call the method from a different place, that would follow the slow path of performing a virtual method lookup followed by a jump to that code. So to see the benefits of inlining from a different path, that path would also have to be "warmed up".
I would strongly recommend that you look at both JMH (delivered with the JDK). There are facilities in there such as blackhole which will help with the effects of CPU clocks winding down. You might also want to evaluate the quality of the bench with the help of tools like JITWatch (Adopt OpenJDK project) which will take logs produced by the JIT and help you interrupt them.
There is so much to this subject, but the bottom line is that you can't write a simplistic benchmark like this and expect it to tell you anything useful. You will need to use JMH.
I suggest watching this: https://www.infoq.com/presentations/jmh about microbenchmarking and JMH
There's also a chapter on microbenchmarking & JMH in my book: http://shop.oreilly.com/product/0636920042983.do
Java internally uses JIT(Just in Compiler). Based on the number of times the same method executes it optimizes the instruction and perform better. For lesser values, the usage of method would be normal which may not fall under optimization that shows the execution time more. When the same method called more time, it uses JIT and executes in lesser time because of the optimized instruction for the same method execution.
The 2 following versions of the same function (which basically tries to recover a password by brute force) do not give same performance:
Version 1:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = length - 1;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
Version 2:
private static final char[] CHARS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".toCharArray();
private static final int N_CHARS = CHARS.length;
private static final int MAX_LENGTH = 8;
private static char[] recoverPassword()
{
char word[];
int refi, i, indexes[];
for (int length = 1; length <= MAX_LENGTH; length++)
{
refi = length - 1;
word = new char[length];
indexes = new int[length];
indexes[length - 1] = 1;
while(true)
{
i = refi;
while ((++indexes[i]) == N_CHARS)
{
word[i] = CHARS[indexes[i] = 0];
if (--i < 0)
break;
}
if (i < 0)
break;
word[i] = CHARS[indexes[i]];
if (isValid(word))
return word;
}
}
return null;
}
I would expect version 2 to be faster, as it does (and that is the only difference):
i = refi;
...as compare to version 1:
i = length -1;
However, it's the opposite: version 1 is faster by over 3%!
Does someone knows why? Is that due to some optimization done by the compiler?
Thank you all for your answers that far.
Just to add that the goal is actually not to optimize this piece of code (which is already quite optimized), but more to understand, from a compiler / CPU / architecture perspective, what could explain such performance difference.
Your answers have been very helpful, thanks again!
Key
It is difficult to check this in a micro-benchmark because you cannot say for sure how the code has been optimised without reading the machine code generated, even then the CPU can do plenty of tricks to optimise it future eg. it turns the x86 code in RISC style instructions to actually execute.
A computation takes as little as one cycle and the CPU can perform up to three of them at once. An access to L1 cache takes 4 cycles and for L2, L3, main memory it takes 11, 40-75, 200 cycles.
Storing values to avoid a simple calculation is actually slower in many cases. BTW using division and modulus is quite expensive and caching this value can be worth it when micro-tuning your code.
The correct answer should be retrievable by a deassembler (i mean .class -> .java converter),
but my guess is that the compiler might have decided to get rid of iref altogether and decided to store length - 1 an auxiliary register.
I'm more of a c++ guy, but I would start by trying:
const int refi = length - 1;
inside the for loop. Also you should probably use
indexes[ refi ] = 1;
Comparing running times of codes does not give exact or quarantine results
First of all, it is not the way comparing performances like this. A running time analysis is needed here. Both 2 codes have same loop structure and their running time are the same. You may have different running times when you run codes. However, they mostly differ with cache hits, I/O times, thread & process schedules. There is no quarantine that code is always completed in a exact time.
However, there is still differences in your code, to understand the difference you should look into your CPU architecture. I can explain it according to x86 architecture basically.
What happens behind the scenes?
i = refi;
CPU takes refi and i to its registers from ram. there is 2 access to ram if the values in not in the cache. and value of i will be written to the ram. However, it always takes different times according to thread & process schedules. Furrhermore, if the values are in virtual memory it wil take longer time.
i = length -1;
CPU also access i and length from ram or cache. there is same number of accesses. In addition, there is a subtraction here which means extra CPU cycles. That is why you think this one take longer time to complete. It is expected, but the issues that i mentioned above explain why this take longer time.
Summation
As i explain this is not the way of comparing performance. I think, there is no real difference between these codes. There are lots of optimizations inside CPU and also in compiler. You can see optimized codes if you decompile .class files.
My advice is it is better to minimize BigO running time analysis. If you find better algorithms it is the best way of optimizing codes. In case you still have bottlenecks in your code, you may try micro-benchmarking.
See also
Analysis of algorithms
Big O notation
Microprocessor
Compiler optimization
CPU Scheduling
To start with, you can't really compare the performance by just running your program - micro benchmarking in Java is complicated.
Also, a subtraction on modern CPUs can take as little as a third of a clock cycle on average. On a 3GHz CPU, that is 0.1 nanoseconds. And nothing tells you that the subtraction actually happens as the compiler might have modified the code.
So:
You should try to check the generated assembly code.
If you really care about the performance, create an appropriate micro-benchmark.
Can anyone explain why the following recursive method is faster than the iterative one (Both are doing it string concatenation) ? Isn't the iterative approach suppose to beat up the recursive one ? plus each recursive call adds a new layer on top of the stack which can be very space inefficient.
private static void string_concat(StringBuilder sb, int count){
if(count >= 9999) return;
string_concat(sb.append(count), count+1);
}
public static void main(String [] arg){
long s = System.currentTimeMillis();
StringBuilder sb = new StringBuilder();
for(int i = 0; i < 9999; i++){
sb.append(i);
}
System.out.println(System.currentTimeMillis()-s);
s = System.currentTimeMillis();
string_concat(new StringBuilder(),0);
System.out.println(System.currentTimeMillis()-s);
}
I ran the program multiple time, and the recursive one always ends up 3-4 times faster than the iterative one. What could be the main reason there that is causing the iterative one slower ?
See my comments.
Make sure you learn how to properly microbenchmark. You should be timing many iterations of both and averaging these for your times. Aside from that, you should make sure the VM isn't giving the second an unfair advantage by not compiling the first.
In fact, the default HotSpot compilation threshold (configurable via -XX:CompileThreshold) is 10,000 invokes, which might explain the results you see here. HotSpot doesn't really do any tail optimizations so it's quite strange that the recursive solution is faster. It's quite plausible that StringBuilder.append is compiled to native code primarily for the recursive solution.
I decided to rewrite the benchmark and see the results for myself.
public final class AppendMicrobenchmark {
static void recursive(final StringBuilder builder, final int n) {
if (n > 0) {
recursive(builder.append(n), n - 1);
}
}
static void iterative(final StringBuilder builder) {
for (int i = 10000; i >= 0; --i) {
builder.append(i);
}
}
public static void main(final String[] argv) {
/* warm-up */
for (int i = 200000; i >= 0; --i) {
new StringBuilder().append(i);
}
/* recursive benchmark */
long start = System.nanoTime();
for (int i = 1000; i >= 0; --i) {
recursive(new StringBuilder(), 10000);
}
System.out.printf("recursive: %.2fus\n", (System.nanoTime() - start) / 1000000D);
/* iterative benchmark */
start = System.nanoTime();
for (int i = 1000; i >= 0; --i) {
iterative(new StringBuilder());
}
System.out.printf("iterative: %.2fus\n", (System.nanoTime() - start) / 1000000D);
}
}
Here are my results...
C:\dev\scrap>java AppendMicrobenchmark
recursive: 405.41us
iterative: 313.20us
C:\dev\scrap>java -server AppendMicrobenchmark
recursive: 397.43us
iterative: 312.14us
These are times for each approach averaged over 1000 trials.
Essentially, the problems with your benchmark are that it doesn't average over many trials (law of large numbers), and that it is highly dependent on the ordering of the individual benchmarks. The original result I was given for yours:
C:\dev\scrap>java StringBuilderBenchmark
80
41
This made very little sense to me. Recursion on the HotSpot VM is more than likely not going to be as fast as iteration because as of yet it does not implement any sort of tail optimization that you might find used for functional languages.
Now, the funny thing that happens here is that the default HotSpot JIT compilation threshold is 10,000 invokes. Your iterative benchmark will more than likely be executing for the most part before append is compiled. On the other hand, your recursive approach should be comparatively fast since it will more than likely enjoy append after it is compiled. To eliminate this from influencing the results, I passed -XX:CompileThreshold=0 and found...
C:\dev\scrap>java -XX:CompileThreshold=0 StringBuilderBenchmark
8
8
So, when it comes down to it, they're both roughly equal in speed. Note however that the iterative appears to be a bit faster if you average with higher precision. Order might still make a difference in my benchmark, too, as the latter benchmark will have the advantage of the VM having collected more statistics for its dynamic optimizations.
Trying to answer to this ticket : What is the difference between instanceof and Class.isAssignableFrom(...)?
I made a performance test :
class A{}
class B extends A{}
A b = new B();
void execute(){
boolean test = A.class.isAssignableFrom(b.getClass());
// boolean test = A.class.isInstance(b);
// boolean test = b instanceof A;
}
#Test
public void testPerf() {
// Warmup the code
for (int i = 0; i < 100; ++i)
execute();
// Time it
int count = 100000;
final long start = System.nanoTime();
for(int i=0; i<count; i++){
execute();
}
final long elapsed = System.nanoTime() - start;
System.out.println(count+" iterations took " + TimeUnit.NANOSECONDS.toMillis(elapsed) + "ms.);
}
Which gave me :
A.class.isAssignableFrom(b.getClass()) : 100000 iterations took 15ms
A.class.isInstance(b) : 100000 iterations took 12ms
b instanceof A : 100000 iterations took 6ms
But playing with the number of iterations, I can see the performance is constant. For Integer.MAX_VALUE :
A.class.isAssignableFrom(b.getClass()) : 2147483647 iterations took 15ms
A.class.isInstance(b) : 2147483647 iterations took 12ms
b instanceof A : 2147483647 iterations took 6ms
Thinking it was a compiler optimization (I ran this test with JUnit), I changed it into this :
#Test
public void testPerf() {
boolean test = false;
// Warmup the code
for (int i = 0; i < 100; ++i)
test |= b instanceof A;
// Time it
int count = Integer.MAX_VALUE;
final long start = System.nanoTime();
for(int i=0; i<count; i++){
test |= b instanceof A;
}
final long elapsed = System.nanoTime() - start;
System.out.println(count+" iterations took " + TimeUnit.NANOSECONDS.toMillis(elapsed) + "ms. AVG= " + TimeUnit.NANOSECONDS.toMillis(elapsed/count));
System.out.println(test);
}
But the performance is still "independent" of the number of iterations.
Could someone explain that behavior ?
A hundred iterations is not nearly enough for warmup. The default compile threshold is 10000 iterations (a hundred times more), so best go at least a bit over that threshold.
Once the compilation has been triggered, the world is not stopped; the compilation takes place in the background. That means that its effect will start being observable only after a slight delay.
There is ample space for optimization of your test in such a way that the entire loop is collapsed into its final result. That would explain the constant numbers.
Anyway, I always do the benchmarks by having an outer method call the inner method something like 10 times. The inner method does a big number of iterations, say 10,000 or more, as needed to make its runtime rise into at least tens of milliseconds. I don't even bother with nanoTime since if microsecond precision is important to you, it is just a sign of measuring too short a time interval.
When you do it like this, you are making it easy for the JIT to execute a compiled version of the inner method after it was substituted for the interpreted version. Another benefit is that you get assurance that the times of the inner method are stabilizing.
If you want to make a real benchmark of a simple function, you should use a micro-benchmarking tool, like Caliper. It will be much simpler that trying to make your own benchmark.
The JIT compiler can eliminate loops which don't anything. This can be triggered after 10,000 iterations.
What I suspect you are timing is how long it takes for the JIT to detect that the loop doesn't do anything and remove it. This will be a little longer than it takes to do 10,000 iterations.