My program iterates over a given data and I have observed a strange behavior. The first few samples that the algorithm processes, show a slow run time performance but then the subsequent samples and iterations run at almost a consistent (and with a relatively low run time than the first few samples/iterations).
Why is this so? I even tried to call the function outside of the iterating loop as a warm up function call hoping if the JVM was optimizing the code it would do that with the warm up function call.
// warm up function call
warpInfo = warp.getDTW(testSet.get(startIndex), trainSet.get(0), distFn, windowSize);
this.startTime = System.currentTimeMillis();
for(int i=startIndex; i<endIndex; i++) {
for(int j=0; j<trainSet.size(); j++) {
train = trainSet.get(j);
instStartTime = System.currentTimeMillis();
warpInfo = warp.getDTW(test, train, distFn, windowSize);
if(warpInfo.getWarpDistance()<bestDist) {
bestDist = warpInfo.getWarpDistance();
classPredicted = train.getTSClass();
}
instEndTime = System.currentTimeMillis();
instProcessingTime = instEndTime - instStartTime;
// record timiing and results here
}
// record other information here
}
After searching, I have come across a few links SO Answer to a similar question and this Java performance comparison which have the option of specifying the -XX:CompileThreshold=1 when running the program to tell the JVM to compile the code after this many function calls. Also see the Oracle Page.
After performing a test run with my program, I can state that it solves the problem.
**EDIT: **Based on Zarev's comments, I'd say I was wrong in posting this as an answer. CompieThreshold flag only allows for a premature compiling of the code so it is not causing the larger run time for the first iteration but in reality it may be causing the program to be compiled without considerations for the details.
Related
I'm running a simulation for several repetitions and I'm trying to see if it would be possible to parallellise these repetitions to improve computation time.
Currently I simply run the simulation several times in a for loop, but ideally these could all be run at the same time, and then the results from each repetition saved to an array for processing.
Currently my code is something like
public static void getAllTheData(){
int nReps = 10;
double[][] allResults = new double[nReps][]
//this for loop is what I want to parallellise
for(int r = 0; r < nReps; r++){
//run the simulation to completion here
simulation.doStuffToCompletion()
allResults[r] = simulation.getValues()
}
//average allResults here and do further analysis
}
My question is, how can I run all these repetitions at the same time in parallel, and save the results from each parallel run in an allResults type array?
Thanks very much.
You can try something like parallel streams to do parallelism like one of the below lines.
Stream.of(allResults).parallel().forEach(i -> processData(i));
Arrays.stream(allResults).parallel().forEach(i -> processData(i));
Java is a synchronous language by default and to do work in parallel you would need to look into threads. I suggest looking into asynchronous function calls and ways to do it in Java.
Here is a previous question asked on this matter: How to asynchronously call a method in Java
With an asynchronous call, you can run your simulation on however many threads you want and collect the data as your simulation runs finish.
In Java I want to measure time for
1000 integer comparisons ("<" operator),
1000 integer additions (a+b
each case for different a and b),
another simple operations.
I know I can do it in the following way:
Random rand = new Random();
long elapsedTime = 0;
for (int i = 0; i < 1000; i++) {
int a = Integer.MIN_VALUE + rand.nextInt(Integer.MAX_VALUE);
int b = Integer.MIN_VALUE + rand.nextInt(Integer.MAX_VALUE);
long start = System.currentTimeMillis();
if (a < b) {}
long stop = System.currentTimeMillis();
elapsedTime += (start - stop);
}
System.out.println(elapsedTime);
I know that this question may seem somehow not clear.
How those values depend on my processor (i.e. relation between time for those operations and my processor) and JVM? Any suggestions?
I'm looking for understandable readings...
How those values depend on my processor (i.e. relation between time for those operations and my processor) and JVM? Any suggestions?
It is not dependant on your processor, at least not directly.
Normally, when you run code enough, it will compile it to native code. When it does this, it removes code which doesn't do anything, so what you will be doing here is measuring the time it takes to perform a System.currentMillis(), which is typically about 0.00003 ms. This means you will get 0 99.997% of the time and see a 1 very rarely.
I say normally, but in this case your code won't be compiled to native code, as the default threshold is 10,000 iterations. I.e. you would be testing how long it takes the interpretor to execute the byte code. This is much slower, but would still be a fraction of a milli-second. i.e. you have higher chance seeing a 1 but still unlikely.
If you want to learn more about low level benchmarking in Java, I suggest you read JMH and the Author's blog http://shipilev.net/
If you want to see what machine code is generated from Java code I suggest you try JITWatch
This is the context of my program.
A function has 50% chance to do nothing, 50% to call itself twice.
What is the probability that the program will finish?
I wrote this piece of code, and it works great apparently. The answer which may not be obvious to everyone is that this program has 100% chance to finish. But there is a StackOverflowError (how convenient ;) ) when I run this program, occuring in Math.Random(). Could someone point to me where does it come from, and tell me if maybe my code is wrong?
static int bestDepth =0;
static int numberOfPrograms =0;
#Test
public void testProba(){
for(int i = 0; i <1000; i++){
long time = System.currentTimeMillis();
bestDepth = 0;
numberOfPrograms = 0;
loop(0);
LOGGER.info("Best depth:"+ bestDepth +" in "+(System.currentTimeMillis()-time)+"ms");
}
}
public boolean loop(int depth){
numberOfPrograms++;
if(depth> bestDepth){
bestDepth = depth;
}
if(proba()){
return true;
}
else{
return loop(depth + 1) && loop(depth + 1);
}
}
public boolean proba(){
return Math.random()>0.5;
}
.
java.lang.StackOverflowError
at java.util.Random.nextDouble(Random.java:394)
at java.lang.Math.random(Math.java:695)
.
I suspect the stack and the amount of function in it is limited, but I don't really see the problem here.
Any advice or clue are obviously welcome.
Fabien
EDIT: Thanks for your answers, I ran it with java -Xss4m and it worked great.
Whenever a function is called or a non-static variable is created, the stack is used to place and reserve space for it.
Now, it seems that you are recursively calling the loop function. This places the arguments in the stack, along with the code segment and the return address. This means that a lot of information is being placed on the stack.
However, the stack is limited. The CPU has built-in mechanics that protect against issues where data is pushed into the stack, and eventually override the code itself (as the stack grows down). This is called a General Protection Fault. When that general protection fault happens, the OS notifies the currently running task. Thus, originating the Stackoverflow.
This seems to be happening in Math.random().
In order to handle your problem, I suggest you to increase the stack size using the -Xss option of Java.
As you said, the loop function recursively calls itself. Now, tail recursive calls can be rewritten to loops by the compiler, and not occupy any stack space (this is called the tail call optimization, TCO). Unfortunately, java compiler does not do that. And also your loop is not tail-recursive. Your options here are:
Increase the stack size, as suggested by the other answers. Note that this will just defer the problem further in time: no matter how large your stack is, its size is still finite. You just need a longer chain of recursive calls to break out of the space limit.
Rewrite the function in terms of loops
Use a language, which has a compiler that performs TCO
You will still need to rewrite the function to be tail-recursive
Or rewrite it with trampolines (only minor changes are needed). A good paper, explaining trampolines and generalizing them further is called "Stackless Scala with Free Monads".
To illustrate the point in 3.2, here's how the rewritten function would look like:
def loop(depth: Int): Trampoline[Boolean] = {
numberOfPrograms = numberOfPrograms + 1
if(depth > bestDepth) {
bestDepth = depth
}
if(proba()) done(true)
else for {
r1 <- loop(depth + 1)
r2 <- loop(depth + 1)
} yield r1 && r2
}
And initial call would be loop(0).run.
Increasing the stack-size is a nice temporary fix. However, as proved by this post, though the loop() function is guaranteed to return eventually, the average stack-depth required by loop() is infinite. Thus, no matter how much you increase the stack by, your program will eventually run out of memory and crash.
There is nothing we can do to prevent this for certain; we always need to encode the stack in memory somehow, and we'll never have infinite memory. However, there is a way to reduce the amount of memory you're using by about 2 orders of magnitude. This should give your program a significantly higher chance of returning, rather than crashing.
We can do this by noticing that, at each layer in the stack, there's really only one piece of information we need to run your program: the piece that tells us if we need to call loop() again or not after returning. Thus, we can emulate the recursion using a stack of bits. Each emulated stack-frame will require only one bit of memory (right now it requires 64-96 times that, depending on whether you're running in 32- or 64-bit).
The code would look something like this (though I don't have a Java compiler right now so I can't test it):
static int bestDepth = 0;
static int numLoopCalls = 0;
public void emulateLoop() {
//Our fake stack. We'll push a 1 when this point on the stack needs a second call to loop() made yet, a 0 if it doesn't
BitSet fakeStack = new BitSet();
long currentDepth = 0;
numLoopCalls = 0;
while(currentDepth >= 0)
{
numLoopCalls++;
if(proba()) {
//"return" from the current function, going up the callstack until we hit a point that we need to "call loop()"" a second time
fakeStack.clear(currentDepth);
while(!fakeStack.get(currentDepth))
{
currentDepth--;
if(currentDepth < 0)
{
return;
}
}
//At this point, we've hit a point where loop() needs to be called a second time.
//Mark it as called, and call it
fakeStack.clear(currentDepth);
currentDepth++;
}
else {
//Need to call loop() twice, so we push a 1 and continue the while-loop
fakeStack.set(currentDepth);
currentDepth++;
if(currentDepth > bestDepth)
{
bestDepth = currentDepth;
}
}
}
}
This will probably be slightly slower, but it will use about 1/100th the memory. Note that the BitSet is stored on the heap, so there is no longer any need to increase the stack-size to run this. If anything, you'll want to increase the heap-size.
The downside of recursion is that it starts filling up your stack which will eventually cause a stack overflow if your recursion is too deep. If you want to ensure that the test ends you can increase your stack size using the answers given in the follow Stackoverflow thread:
How to increase to Java stack size?
I'm trying to time the performance of my program by using System.currentTimeMillis() (or alternatively System.nanoTime()) and I've noticed that every time I run it - it gives a different result for time it took to finish the task.
Even the straightforward test:
long totalTime;
long startTime;
long endTime;
startTime = System.currentTimeMillis();
for (int i = 0; i < 1000000000; i++)
{
for (int j = 0; j < 1000000000; j++)
{
}
}
endTime = System.currentTimeMillis();
totalTime = endTime-startTime;
System.out.println("Time: " + totalTime);
produces all sorts of different outputs, from 0 to 200. Can anyone say what I'm doing wrong or suggest an alternative solution?
The loop doesn't do anything, so you are timing how long it takes to detect the loop is pointless.
Timing the loops more accurately won't help, you need to do something slightly useful to get repeatable results.
I suggest you try -server if you are running on 32-bit windows.
A billion billion clock cycles takes about 10 years so its not really iterating that many times.
This is exactly the expected behavior -- it's supposed to get faster as you rerun the timing. As you rerun a method many times, the JIT devotes more effort to compiling it to native code and optimizing it; I would expect that after running this code for long enough, the JIT would eliminate the loop entirely, since it doesn't actually do anything.
The best and simplest way to get precise benchmarks on Java code is to use a tool like Caliper that "warms up" the JIT to encourage it to optimize your code fully.
I have one problem that I can't explain. Here is the code in main function:
String numberStr = "3151312423412354315";
System.out.println(numberStr + "\n");
System.out.println("Lehman method: ");
long beginTime = System.currentTimeMillis();
System.out.println(Lehman.getFullFactorization(numberStr));
long finishTime = System.currentTimeMillis();
System.out.println((finishTime-beginTime)/1000. + " sec.");
System.out.println();
System.out.println("Lehman method: ");
beginTime = System.currentTimeMillis();
System.out.println(Lehman.getFullFactorization(numberStr));
finishTime = System.currentTimeMillis();
System.out.println((finishTime-beginTime)/1000. + " sec.");
If it is necessary: method Lehman.getFullFactorization(...) returns the ArrayList of prime divisors in String format.
Here is the output:
3151312423412354315
Lehman method:
[5, 67, 24473, 384378815693]
0.149 sec.
Lehman method:
[5, 67, 24473, 384378815693]
0.016 sec.
I was surprised, when I saw it. Why a second execution of the same method much faster than first? Firstly, I thought that at the first running of the method it calculates time with time of running JVM and its resources, but it's impossible, because obviously JVM starts before execution of the "main" method.
In some cases, Java's JIT compiler (see http://java.sun.com/developer/onlineTraining/Programming/JDCBook/perf2.html#jit) kicks in on the first execution of a method and performs optimizations of that methods code. That is supposed to make all subsequent executions faster. I think this might be what happens in your case.
Try doing it more than 10,000 times and it will be much faster. This is because the code first has to be loaded (expensive) then runs in interpreted mode (ok speed) and is finally compiled to native code (much faster)
Can you try this?
int runs = 100*1000;
for(int i = -20000 /* warmup */; i < runs; i++) {
if(i == 0)
beginTime = System.nanoTime();
Lehman.getFullFactorization(numberStr);
}
finishTime = System.nanoTime();
System.out.println("Average time was " + (finishTime-beginTime)/1e9/runs. + " sec.");
I suppose the JVM has cached the results (may be particularly) of the first calculation and you observe the faster second calculation. JIT in action.
There are two things that make the second run faster.
The first time, the class containing the method must be loaded. The second time, it is already in memory.
Most importantly, the JIT optimizes code that is often executed: during the first call, the JVM starts by interpreting the byte code and then compiles it into machine code and continues the execution. The second time, the code is already compiled.
That's why micro-benchmarks in Java are often hard to validate.
My guess is that it's saved in the L1/L2 cache on the CPU for optimization.
Or Java doesn't have to interpret it again and recalls it from the memory as already compiled code.