This now very common algorithm question was asked by a proctor during a whiteboard exam session. My job was to observe, listen to and objectively judge the answers given, but I had neither control over this question asked nor could interact with the person answering.
There was five minutes given to analyze the problem, where the candidate could write bullet notes, pseudo code (this was allowed during actual code-writing as well as long as it was clearly indicated, and people including pseudo-code as comments or TODO tasks before figuring out the algorithm got bonus points).
"A child is climbing up a staircase with n steps, and can hop either 1 step, 2 steps, or 3 steps at a time. Implement a method to count how many possible ways the child can jump up the stairs."
The person who got this question couldn't get started on the recursion algorithm on the spot, so the proctor eventually piece-by-piece led him to HIS solution, which in my opinion was not optimal (well, different from my chosen solution making it difficult to grade someone objectively with respect to code optimization).
Proctor:
public class Staircase {
public static int stairs;
public Staircase() {
int a = counting(stairs);
System.out.println(a);
}
static int counting(int n) {
if (n < 0)
return 0;
else if (n == 0)
return 1;
else
return counting(n - 1) + counting(n - 2) + counting(n - 3);
}
public static void main(String[] args) {
Staircase child;
long t1 = System.nanoTime();
for (int i = 0; i < 30; i++) {
stairs = i;
child = new Staircase();
}
System.out.println("Time:" + ((System.nanoTime() - t1)/1000000));
}
}
//
Mine:
public class Steps {
public static int stairs;
int c2 = 0;
public Steps() {
int a = step2(0);
System.out.println(a);
}
public static void main(String[] args) {
Steps steps;
long t1 = System.nanoTime();
for (int i = 0; i < 30; i++) {
stairs = i;
steps = new Steps();
}
System.out.println("Time:" + ((System.nanoTime() - t1) / 1000000));
}
public int step2(int c) {
if (c + 1 < stairs) {
if (c + 2 <= stairs) {
if (c + 3 <= stairs) {
step2(c + 3);
}
step2(c + 2);
}
step2(c + 1);
} else {
c2++;
}
return c2;
}
}
OUTPUT:
Proctor: Time: 356
Mine: Time: 166
Could someone clarify which algorithm is better/ more optimal? The execution time of my algorithm appears to be less than half as long, (but I am referencing and updating an additional integer which i thought was rather inconsequential) and it allows for setting arbitrary starting and ending step without needing to first now their difference (although for anything higher than n=40 you will need a beast of a CPU).
My question: (feel free to ignore the above example) How do you properly benchmark a similar recursion-based problem (tower of Hanoi etc.). Do you just look at the timing, or take other things into consideration (heap?).
Teaser: You may perform this computation easily in less than one millisecond. Details follow...
Which one is "better"?
The question of which algorithm is "better" may refer to the execution time, but also to other things, like the implementation style.
The Staircase implementation is shorter, more concise and IMHO more readable. And more importantly: It does not involve a state. The c2 variable that you introduced there destroys the advantages (and beauty) of a purely functional recursive implementation. This may easily be fixed, although the implementation then already becomes more similar to the Staircase one.
Measuring performance
Regarding the question about execution time: Properly measuring execution time in Java is tricky.
Related reading:
How do I write a correct micro-benchmark in Java?
Java theory and practice: Anatomy of a flawed microbenchmark
HotSpot Internals
In order to properly and reliably measure execution times, there exist several options. Apart from a profiler, like VisualVM, there are frameworks like JMH or Caliper, but admittedly, using them may be some effort.
For the simplest form of a very basic, manual Java Microbenchmark you have to consider the following:
Run the algorithms multiple times, to give the JIT a chance to kick in
Run the algorithms alternatingly and not only one after the other
Run the algorithms with increasing input size
Somehow save and print the results of the computation, to prevent the computation from being optimized away
Don't print anything to the console during the benchmark
Consider that timings may be distorted by the garbage collector (GC)
Again: These are only rules of thumb, and there may still be unexpected results (refer to the links above for more details). But with this strategy, you usually obtain a good indication about the performance, and at least can see whether it's likely that there really are significant differences between the algorithms.
The differences between the approaches
The Staircase implementation and the Steps implementation are not very different.
The main conceptual difference is that the Staircase implementation is counting down, and the Steps implementation is counting up.
The main difference that actually affects the performance is how the Base Case is handled (see Recursion on Wikipedia). In your implementation, you avoid calling the method recursively when it is not necessary, at the cost of some additional if statements. The Staircase implementation uses a very generic treatment of the base case, by just checking whether n < 0.
One could consider an "intermediate" solution that combines ideas from both approaches:
class Staircase2
{
public static int counting(int n)
{
int result = 0;
if (n >= 1)
{
result += counting(n-1);
if (n >= 2)
{
result += counting(n-2);
if (n >= 3)
{
result += counting(n-3);
}
}
}
else
{
result += 1;
}
return result;
}
}
It's still recursive without a state, and sums up the intermediate results, avoiding many of the "useless" calls by using some if queries. It's already noticably faster than the original Staircase implementation, but still a tad slower than the Steps implementation.
Why both solutions are slow
For both implementations, there's not really anything to be computed. The method consists of few if statements and some additions. The most expensive thing here is actually the recursion itself, with the deeeeply nested call tree.
And that's the key point here: It's a call tree. Imagine what it is computing for a given number of steps, as a "pseudocode call hierarchy":
compute(5)
compute(4)
compute(3)
compute(2)
compute(1)
compute(0)
compute(0)
compute(1)
compute(0)
compute(0)
compute(2)
compute(1)
compute(0)
compute(0)
compute(1)
compute(0)
compute(3)
compute(2)
compute(1)
compute(0)
compute(0)
compute(1)
compute(0)
compute(0)
compute(2)
compute(1)
compute(0)
compute(0)
One can imagine that this grows exponentially when the number becomes larger. And all the results are computed hundreds, thousands or or millions of times. This can be avoided
The fast solution
The key idea to make the computation faster is to use Dynamic Programming. This basically means that intermediate results are stored for later retrieval, so that they don't have to be computed again and again.
It's implemented in this example, which also compares the execution time of all approaches:
import java.util.Arrays;
public class StaircaseSteps
{
public static void main(String[] args)
{
for (int i = 5; i < 33; i++)
{
runStaircase(i);
runSteps(i);
runDynamic(i);
}
}
private static void runStaircase(int max)
{
long before = System.nanoTime();
long sum = 0;
for (int i = 0; i < max; i++)
{
sum += Staircase.counting(i);
}
long after = System.nanoTime();
System.out.println("Staircase up to "+max+" gives "+sum+" time "+(after-before)/1e6);
}
private static void runSteps(int max)
{
long before = System.nanoTime();
long sum = 0;
for (int i = 0; i < max; i++)
{
sum += Steps.step(i);
}
long after = System.nanoTime();
System.out.println("Steps up to "+max+" gives "+sum+" time "+(after-before)/1e6);
}
private static void runDynamic(int max)
{
long before = System.nanoTime();
long sum = 0;
for (int i = 0; i < max; i++)
{
sum += StaircaseDynamicProgramming.counting(i);
}
long after = System.nanoTime();
System.out.println("Dynamic up to "+max+" gives "+sum+" time "+(after-before)/1e6);
}
}
class Staircase
{
public static int counting(int n)
{
if (n < 0)
return 0;
else if (n == 0)
return 1;
else
return counting(n - 1) + counting(n - 2) + counting(n - 3);
}
}
class Steps
{
static int c2 = 0;
static int stairs;
public static int step(int c)
{
c2 = 0;
stairs = c;
return step2(0);
}
private static int step2(int c)
{
if (c + 1 < stairs)
{
if (c + 2 <= stairs)
{
if (c + 3 <= stairs)
{
step2(c + 3);
}
step2(c + 2);
}
step2(c + 1);
}
else
{
c2++;
}
return c2;
}
}
class StaircaseDynamicProgramming
{
public static int counting(int n)
{
int results[] = new int[n+1];
Arrays.fill(results, -1);
return counting(n, results);
}
private static int counting(int n, int results[])
{
int result = results[n];
if (result == -1)
{
result = 0;
if (n >= 1)
{
result += counting(n-1, results);
if (n >= 2)
{
result += counting(n-2, results);
if (n >= 3)
{
result += counting(n-3, results);
}
}
}
else
{
result += 1;
}
}
results[n] = result;
return result;
}
}
The results on my PC are as follows:
...
Staircase up to 29 gives 34850335 time 310.672814
Steps up to 29 gives 34850335 time 112.237711
Dynamic up to 29 gives 34850335 time 0.089785
Staircase up to 30 gives 64099760 time 578.072582
Steps up to 30 gives 64099760 time 204.264142
Dynamic up to 30 gives 64099760 time 0.091524
Staircase up to 31 gives 117897840 time 1050.152703
Steps up to 31 gives 117897840 time 381.293274
Dynamic up to 31 gives 117897840 time 0.084565
Staircase up to 32 gives 216847936 time 1929.43348
Steps up to 32 gives 216847936 time 699.066728
Dynamic up to 32 gives 216847936 time 0.089089
Small changes in the order of statements or so ("micro-optimizations") may have a small impact, or make a noticable difference. But using an entirely different approach can make the real difference.
Related
My code to Finding the numbers in the Fibonacci sequence is working correct but it is really slooooow. (It is almost impossibleto find number over n=55). Where did I make a mistake?
public static void main(String[] args) {
for (; ; ) {
System.out.println("The value of the fibonacci sequence ");
Scanner numer = new Scanner(System.in);
long n = numer.nextLong();
long result = Fib(n);
System.out.println(result);
}
}
private static long Fib(long n) {
if (n <= 0) { System.out.println("Error"); return 0;}
else if (n == 1 | n == 2) return 1;
else return Fib(n-1)+ Fib(n-2) ;
}
Do you understand how many times you compute fib(k) for each k? If you want your program to run fast, don't use recursion so blindly, use loop instead:
private static long Fib(long n) {
long prev = 1;
long current = 1;
if (n <= 0) { System.out.println("Error"); return 0;}
if (n == 1 | n == 2) return 1;
for (int i = 3; i <= n; i++) {
long next = prev + current;
prev = current;
current = next;
}
return current;
}
If you want to solve this problem only by recursion, then you need to store already computed results. It can be done either by memoization or by using lazy dynamic programming, and it's way harder than just using loop.
Even with the proposed improvements by #Jason, this is still an algorithm with exponential time complexity of O(2^N).
To reduce the time complexity to O(N), also called linear time, I propose the following approach (called "memoization", because we save each value we compute and directly return it for O(1) time whenever it is asked, hence the stack frames will be far fewer and will increase at most linearly with the value of "n").
public static void main(String[] args) {
// receive customer input
HashMap<Long, Long> map = new HashMap<>();
long result = Fib(n, map);
}
private long Fib(long n, HashMap<Long, Long> map) {
if (n <= 0) {
return 0;
}
if (n == 1) {
return 1;
}
if (map.containsKey(n)) {
return map.get(n);
} else {
map.put(n, Fib(n - 1, map) + Fib(n - 2, map));
return map.get(n);
}
}
You can compare the two solutions by submitting them in the Fibonacci exercise in Leetcode.
Also, both approaches (yours and a similar memoization approach to mine) are explained under the "Solution" tab in the link above, where you can learn more about the time complexities.
I was running this bit of code to compare performance of 3 equivalent methods of calculating wraparound coordinates:
public class Test {
private static final float MAX = 1000000;
public static void main(String[] args) {
long time = System.currentTimeMillis();
for (float i = -MAX; i <= MAX; i++) {
for (float j = 100; j < 10000; j++) {
method1(i, j);
//method2(i, j);
//method3(i, j);
}
}
System.out.println(System.currentTimeMillis() - time);
}
private static float method1(float value, float max) {
value %= max + 1;
return (value < 0) ? value + max : value;
}
private static float method2(float value, float max) {
value %= max + 1;
if (value < 0)
value += max;
return value;
}
private static float method3(float value, float max) {
return ((value % max) + max) % max;
}
}
I ran this code three times with method1, method2, and method3 (one method per test).
What was peculiar was that method1 spawned several java processes, and managed to utilize nearly 100% of my dual core cpu, while method2 and method3 spawned only 1 process and therefore could only take up 25% of my cpu (since there are 4 virtual cores). Why would this happen?
Here are the details of my machine:
Java 1.6.0_65
Macbook Air 13" late 2013
After reviewing the various comments and failing at consistently replicating the behavior. I have decided that this is indeed a fluke of the JRE. Thanks to the comments, I have learned that benchmarking this way is not very useful and I should use a Micro-benchmarking framework such as Google's caliper to test such minute differences.
Some resources I found:
Google's java micro-benchmark page
IBM page discussing a flawed benchmark and common pitfalls
Oracle's page on micro-benchmarking
Stackoverflow question on micro-benchmarking
I read in couple of blogs that in Java modulo/reminder operator is slower than bitwise-AND. So, I wrote the following program to test.
public class ModuloTest {
public static void main(String[] args) {
final int size = 1024;
int index = 0;
long start = System.nanoTime();
for(int i = 0; i < Integer.MAX_VALUE; i++) {
getNextIndex(size, i);
}
long end = System.nanoTime();
System.out.println("Time taken by Modulo (%) operator --> " + (end - start) + "ns.");
start = System.nanoTime();
final int shiftFactor = size - 1;
for(int i = 0; i < Integer.MAX_VALUE; i++) {
getNextIndexBitwise(shiftFactor, i);
}
end = System.nanoTime();
System.out.println("Time taken by bitwise AND --> " + (end - start) + "ns.");
}
private static int getNextIndex(int size, int nextInt) {
return nextInt % size;
}
private static int getNextIndexBitwise(int size, int nextInt) {
return nextInt & size;
}
}
But in my runtime environment (MacBook Pro 2.9GHz i7, 8GB RAM, JDK 1.7.0_51) I am seeing otherwise. The bitwise-AND is significantly slower, in fact twice as slow than the remainder operator.
I would appreciate it if someone can help me understand if this is intended behavior or I am doing something wrong?
Thanks,
Niranjan
Your code reports bitwise-and being much faster on each Mac I've tried it on, both with Java 6 and Java 7. I suspect the first portion of the test on your machine happened to coincide with other activity on the system. You should try running the test multiple times to verify you aren't seeing distortions based on that. (I would have left this as a 'comment' rather than an 'answer', but apparently you need 50 reputation to do that -- quite silly, if you ask me.)
For starters, logical conjunction trick only works with Nature Number dividends and power of 2 divisors. So, if you need negative dividends, floats, or non-powers of 2, sick with the default % operator.
My tests (with JIT warmup and 1M random iterations), on an i7 with a ton of cores and bus load of ram show about 20% better performance from the bitwise operation. This can very per run, depending how the process scheduler runs the code.
using Scala 2.11.8 on JDK 1.8.91
4Ghz i7-4790K, 8 core AMD, 32GB PC3 19200 ram, SSD
This example in particular will always give you a wrong result. Moreover, I believe that any program which is calculating the modulo by a power of 2 will be faster than bitwise AND.
REASON: When you use N % X where X is kth power of 2, only last k bits are considered for modulo, whereas in case of the bitwise AND operator the runtime actually has to visit each bit of the number under question.
Also, I would like to point out the Hot Spot JVM's optimizes repetitive calculations of similar nature(one of the examples can be branch prediction etc). In your case, the method which is using the modulo just returns the last 10 bits of the number because 1024 is the 10th power of 2.
Try using some prime number value for size and check the same result.
Disclaimer: Micro benchmarking is not considered good.
Is this method missing something?
public static void oddVSmod(){
float tests = 100000000;
oddbit(tests);
modbit(tests);
}
public static void oddbit(float tests){
for(int i=0; i<tests; i++)
if((i&1)==1) {System.out.print(" "+i);}
System.out.println();
}
public static void modbit(float tests){
for(int i=0; i<tests; i++)
if((i%2)==1) {System.out.print(" "+i);}
System.out.println();
}
With that, i used netbeans built-in profiler (advanced-mode) to run this. I set var tests up to 10X10^8, and every time, it showed that modulo is faster than bitwise.
Thank you all for valuable inputs.
#pamphlet: Thank you very much for the concerns, but negative comments are fine with me. I confess that I did not do proper testing as suggested by AndyG. AndyG could have used a softer tone, but its okay, sometimes negatives help seeing the positive. :)
That said, I changed my code (as shown below) in a way that I can run that test multiple times.
public class ModuloTest {
public static final int SIZE = 1024;
public int usingModuloOperator(final int operand1, final int operand2) {
return operand1 % operand2;
}
public int usingBitwiseAnd(final int operand1, final int operand2) {
return operand1 & operand2;
}
public void doCalculationUsingModulo(final int size) {
for(int i = 0; i < Integer.MAX_VALUE; i++) {
usingModuloOperator(1, size);
}
}
public void doCalculationUsingBitwise(final int size) {
for(int i = 0; i < Integer.MAX_VALUE; i++) {
usingBitwiseAnd(i, size);
}
}
public static void main(String[] args) {
final ModuloTest moduloTest = new ModuloTest();
final int invocationCount = 100;
// testModuloOperator(moduloTest, invocationCount);
testBitwiseOperator(moduloTest, invocationCount);
}
private static void testModuloOperator(final ModuloTest moduloTest, final int invocationCount) {
for(int i = 0; i < invocationCount; i++) {
final long startTime = System.nanoTime();
moduloTest.doCalculationUsingModulo(SIZE);
final long timeTaken = System.nanoTime() - startTime;
System.out.println("Using modulo operator // Time taken for invocation counter " + i + " is " + timeTaken + "ns");
}
}
private static void testBitwiseOperator(final ModuloTest moduloTest, final int invocationCount) {
for(int i = 0; i < invocationCount; i++) {
final long startTime = System.nanoTime();
moduloTest.doCalculationUsingBitwise(SIZE);
final long timeTaken = System.nanoTime() - startTime;
System.out.println("Using bitwise operator // Time taken for invocation counter " + i + " is " + timeTaken + "ns");
}
}
}
I called testModuloOperator() and testBitwiseOperator() in mutual exclusive way. The result was consistent with the idea that bitwise is faster than modulo operator. I ran each of the calculation 100 times and recorded the execution times. Then removed first five and last five recordings and used rest to calculate the avg. time. And, below are my test results.
Using modulo operator, the avg. time for 90 runs: 8388.89ns.
Using bitwise-AND operator, the avg. time for 90 runs: 722.22ns.
Please suggest if my approach is correct or not.
Thanks again.
Niranjan
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How slow are Java exceptions?
The following two programs take approximately the same amount of time to run:
public class Break {
public static void main(String[] argv){
long r = 0, s = 0, t = 0;
for(long x = 10000000; x > 0; x--){
long y = x;
while(y != 1){
if(y == 0)
throw new AssertionError();
try2: {
try1: {
for(;;){
r++;
if(y%2 == 0)
break try1;
y = y*3 + 1;
}
}/*catch(Thr _1)*/{
for(;;){
s++;
if(y%2 == 1)
break try2;
y = y/2;
}
}
}/*catch(Thr _2)*/{
t++;
}
}
}
System.out.println(r + ", " + s + ", " + t);
}
}
public class Try {
private static class Thr extends Throwable {}
private static final Thr thrown = new Thr();
public static void main(String[] argv){
long r = 0, s = 0, t = 0;
for(long x = 10000000; x > 0; x--){
long y = x;
while(y != 1){
try{
if(y == 0)
throw new AssertionError();
try{
for(;;){
r++;
if(y%2 == 0)
throw thrown;
y = y*3 + 1;
}
}catch(Thr _1){
for(;;){
s++;
if(y%2 == 1)
throw thrown;
y = y/2;
}
}
}catch(Thr _2){
t++;
}
}
}
System.out.println(r + ", " + s + ", " + t);
}
}
$ for x in Break Try; do echo $x; time java $x; done
Break
1035892632, 1557724831, 520446316
real 0m10.733s
user 0m10.719s
sys 0m0.016s
Try
1035892632, 1557724831, 520446316
real 0m11.218s
user 0m11.204s
sys 0m0.017s
But the time taken by the next two programs is relatively different:
public class Return {
private static int tc = 0;
public static long find(long value, long target, int depth){
if(depth > 100)
return -1;
if(value%100 == target%100){
tc++;
return depth;
}
long r = find(target, value*29 + 4221673238171300827l, depth + 1);
return r != -1? r : find(target, value*27 + 4494772161415826936l, depth + 1);
}
public static void main(String[] argv){
long s = 0;
for(int x = 0; x < 1000000; x++){
long r = find(0, x, 0);
if(r != -1)
s += r;
}
System.out.println(s + ", " + tc);
}
}
public class Throw {
public static class Found extends Throwable {
// note the static!
public static int value = 0;
}
private static final Found found = new Found();
private static int tc = 0;
public static void find(long value, long target, int depth) throws Found {
if(depth > 100)
return;
if(value%100 == target%100){
Found.value = depth;
tc++;
throw found;
}
find(target, value*29 + 4221673238171300827l, depth + 1);
find(target, value*27 + 4494772161415826936l, depth + 1);
}
public static void main(String[] argv){
long s = 0;
for(int x = 0; x < 1000000; x++)
try{
find(0, x, 0);
}catch(Found _){
s += found.value;
}
System.out.println(s + ", " + tc);
}
}
$ for x in Return Throw; do echo $x; time java $x; done
Return
84227391, 1000000
real 0m2.437s
user 0m2.429s
sys 0m0.017s
Throw
84227391, 1000000
real 0m9.251s
user 0m9.215s
sys 0m0.014s
I'd imagine a simple try/throw/catch mechanism would look sort of like an at least-partially-tail-call-optimised return (so it is directly known where control should return to (the closest catch)), but, of course, the JRE implementation does a lot of optimisation.
Why is there a large difference in the latter but not the former? Will it be because control flow analysis determines the former two programs to be pretty much the same and actual try/throw/catches are particularly slow, or because Return's find is unfolded to some level into something that avoids method calls while Throw's one can't be, or ..?
Thanks.
Edit: this question seems different to me to How slow are Java exceptions? because it doesn't ask why there is such a large difference in a case similar to this one. It also disregards the fact that time is spent creating the exception object (which—unless fillInStackTrace is overridden—includes something like traversing the stack and creating an array for it). It does, however, apparently answer part of my question: “Will it be because control flow analysis determines the former two programs to be pretty much the same”—though it seems odd that it was mentioned in the answer that does the stack trace (which would presumably dwarf any actual throw/catch operation—unless it does some complicated analysis to determine that the stack is never seen, which would make #Stephen's answer odd).
Your benchmarks don't take account of JVM warmup effects. Therefore there is considerable doubt that the results you are seeing are truely indicative of how try / break / return would perform in a real program.
(You should declare each timed test in a method, and call the methods a number of times. Then you discard the output from the first few calls ... or until the numbers stabilize ... to eliminate the once-off costs of JIT compilation, class loading and so on from the figures.)
If you really want to find out what is going on, you should get the JIT compiler to dump the native code it generates for each case.
I suspect that you will find that in the first case the JIT compiler is turning the throw / catch within the method into a simple branch instruction. In the second case, it is likely that the JIT compiler is generating more complicated code ... presumably because it doesn't recognize this as equivalent to a branch.
Why the difference? Well, there is a cost / benefit trade-offs for each optimization attempted by the JIT optimizer. Each new optimization supported by the JIT compiler has an implementation and maintenance cost. And at runtime, the compiler needs to examine the code it is compiling to see if the preconditions for the optimization are met. If they are the optimization can be performed. Otherwise, the JIT compiler has wasted time.
Now in these examples, we have an exception that is (in one case) thrown and caught in the same method, and (in the other case) propagated across a method call/return boundary. In the former example, sufficient preconditions for the optimization and the (probable) optimized code sequence are both pretty straightforward. In the latter example, the optimizer has to deal with the possibility that the methods that throw and catch the exception are in different compilation units (and hence one could be reloaded), the possibility of overriding, and so on. In addition, the generate code sequence would be significantly more complicated ... a non-standard return-from-call sequence followed by a branch.
So my theory is that the JIT compiler writers didn't think that the more complicated optimization would pay off. And given that most people don't write Java code like that, they are probably right.
So basically I needed to optimize this piece of code today. It tries to find the longest sequence produced by some function for the first million starting numbers:
public static void main(String[] args) {
int mostLen = 0;
int mostInt = 0;
long currTime = System.currentTimeMillis();
for(int j=2; j<=1000000; j++) {
long i = j;
int len = 0;
while((i=next(i)) != 1) {
len++;
}
if(len > mostLen) {
mostLen = len;
mostInt = j;
}
}
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if(i%2==0) {
return i/2;
} else {
return i*3+1;
}
}
My mistake was to try to introduce multithreading:
void doSearch() throws ExecutionException, InterruptedException {
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
long currTime = System.currentTimeMillis();
List<Future<ValueBean>> list = new ArrayList<Future<ValueBean>>();
for (int j = 2; j <= 1000000; j++) {
MyCallable<ValueBean> worker = new MyCallable<ValueBean>();
worker.setBean(new ValueBean(j, 0));
Future<ValueBean> f = executor.submit(worker);
list.add(f);
}
System.out.println(System.currentTimeMillis() - currTime);
int mostLen = 0;
int mostInt = 0;
for (Future<ValueBean> f : list) {
final int len = f.get().getLen();
if (len > mostLen) {
mostLen = len;
mostInt = f.get().getNum();
}
}
executor.shutdown();
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
public class MyCallable<T> implements Callable<ValueBean> {
public ValueBean bean;
public void setBean(ValueBean bean) {
this.bean = bean;
}
public ValueBean call() throws Exception {
long i = bean.getNum();
int len = 0;
while ((i = next(i)) != 1) {
len++;
}
return new ValueBean(bean.getNum(), len);
}
}
public class ValueBean {
int num;
int len;
public ValueBean(int num, int len) {
this.num = num;
this.len = len;
}
public int getNum() {
return num;
}
public int getLen() {
return len;
}
}
long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
Unfortunately, the multithreaded version worked 5 times slower than the single-threaded on 4 processors (cores).
Then I tried a bit more crude approach:
static int mostLen = 0;
static int mostInt = 0;
synchronized static void updateIfMore(int len, int intgr) {
if (len > mostLen) {
mostLen = len;
mostInt = intgr;
}
}
public static void main(String[] args) throws InterruptedException {
long currTime = System.currentTimeMillis();
final int numProc = Runtime.getRuntime().availableProcessors();
System.out.println("numProc = " + numProc);
ExecutorService executor = Executors.newFixedThreadPool(numProc);
for (int i = 2; i <= 1000000; i++) {
final int j = i;
executor.execute(new Runnable() {
public void run() {
long l = j;
int len = 0;
while ((l = next(l)) != 1) {
len++;
}
updateIfMore(len, j);
}
});
}
executor.shutdown();
executor.awaitTermination(30, TimeUnit.SECONDS);
System.out.println(System.currentTimeMillis() - currTime);
System.out.println("Most len is " + mostLen + " for " + mostInt);
}
static long next(long i) {
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
and it worked much faster, but still it was slower than the single thread approach.
I hope it's not because I screwed up the way I'm doing multithreading, but rather this particular calculation/algorithm is not a good fit for parallel computation. If I change calculation to make it more processor intensive by replacing method next with:
long next(long i) {
Random r = new Random();
for(int j=0; j<10; j++) {
r.nextLong();
}
if (i % 2 == 0) {
return i / 2;
} else {
return i * 3 + 1;
}
}
both multithreaded versions start to execute more than twice as fast than the singlethreaded version on a 4 core machine.
So clearly there must be some threshold that you can use to determine if it is worth to introduce multithreading and my question is:
What is the basic rule that would help decide if a given calculation is intensive enough to be optimized by running it in parallel (without spending effort to actually implement it?)
The key to efficiently implementing multithreading is to make sure the cost is not too high. There are no fixed rules as they heavily depend on your hardware.
Starting and stopping threads has a high cost. Of course you already used the executor service which reduces these costs considerably because it uses a bunch of worker threads to execute your Runnables. However each Runnable still comes with some overhead. Reducing the number of runnables and increasing the amount of work each one has to do will improve performance, but you still want to have enough runnables for the executor service to efficiently distribute them over the worker threads.
You have choosen to create one runnable for each starting value so you end up creating 1000000 runnables. You would probably be getting much better results of you let each Runnable do a batch of say 1000 start values. Which means you only need 1000 runnables greatly reducing the overhead.
I think there is another component to this which you are not considering. Parallelization works best when the units of work have no dependence on each other. Running a calculation in parallel is sub-optimal when later calculation results depend on earlier calculation results. The dependence could be strong in the sense of "I need the first value to compute the second value". In that case, the task is completely serial and later values cannot be computed without waiting for earlier computations. There could also be a weaker dependence in the sense of "If I had the first value I could compute the second value faster". In that case, the cost of parallelization is that some work may be duplicated.
This problem lends itself to being optimized without multithreading because some of the later values can be computed faster if you have the previous results already in hand. Take, for example j == 4. Once through the inner loop produces i == 2, but you just computed the result for j == 2 two iterations ago, if you saved the value of len you can compute it as len(4) = 1 + len(2).
Using an array to store previously computed values of len and a little bit twiddling in the next method, you can complete the task >50x faster.
"Will the performance gain be greater than the cost of context switching and thread creation?"
That is a very OS, language, and hardware, dependent cost; this question has some discussion about the cost in Java, but has some numbers and some pointers to how to calculate the cost.
You also want to have one thread per CPU, or less, for CPU intensive work. Thanks to David Harkness for the pointer to a thread on how to work out that number.
Estimate amount of work which a thread can do without interaction with other threads (directly or via common data). If that piece of work can be completed in 1 microsecond or less, overhead is too much and multithreading is of no use. If it is 1 millisecond or more, multithreading should work well. If it is in between, experimental testing required.