Will sequential algorithm written in Java execute faster (in Eclipse) on a machine with 60Gb RAM and 16 cores, if compared to a dual-core machine with 16Gb RAM? I expected that the algorithm would really run faster, but experiments on a Google Compute Engine and my laptop showed that it's not a truth. I appreciate if someone could explain why this happens.
Java doesn't parallize the code automatically for you, you need to do it yourself.
There are some abstractions that like parallel streams that give you concise parallelism, but still, the performance of your program is governed by Amdahl's law . Having more memory will help in launching more threads and applying parallel algorithms for leveraging more cores.
Example:
Arrays.sort is a sequential Dual-Pivot Quicksort that runs in O(nlgn) time, its overall performance governed by the clock rate.
Arrays.parallelSort is parallel merge-sort, it uses more space ( so here memory is important ), it divided the array into pieces and sort each piece and merge them.
But, someone had to write this parallel sort in order to benefit from multicores machines.
What could be done automatically for you, is a highly concurrent and parallel GC that effects the overall performance of your program.
You are asking for sequential algorithm, which clearly means there are no multiple threads, no parallelism or multi-processing involved in the execution of the code. Lets say, the code is:
a = 5;
b = a + 5;
c = b + 5;
...
and so on...
We cannot execute any of the latter lines because of their dependency on the former values.
A simple loop,
for i from 1 to 100 increment 1
a = a + i
will have to be executed 100 times, in order, as that would create a difference in result, and hence cannot be parallelized.
Also, since you are not using threads in your code, java has no support for parallelism inbuilt, so there go your chances even if the code was a bit parallelizable.
If it's a single threaded piece of code, the system it will run on has some influence on the execution time. This is measured by the IPC
https://en.wikipedia.org/wiki/Instructions_per_cycle
You code will definitely run faster on a newer system than a 10 year old, but maybe the difference between the two machines you mentioned for 1 thread are not significant enough.
Related
We are using java 7 and working on multithreaded data crunching application. Due to certain constraint we are not using spark or any other map-reduce way to solve this problem. The idea of this project is maximize the performance of application using multi-threading.
My understanding is that at any given point, considering the CPU is not running any other thing apart from OS, number of the thread working simultaneously will be equal to number of hyper threading that CPU provides. But there is java GC which will kick-in every now and then. We have to consider that as well.
Also, I am aware that if I create more threads then I will actually degrade the performance because of the time spent in context switching.
The question is what would be the best way to consider all these things and create appropriate number of threads. Any idea or thought process? Is there any other process that I should consider?
The question is what would be the best way to consider all these things and create appropriate number of threads
I would use Java 8 which does this for you. e.g.
Results result = listOfWork.parallelStream()
.map(t -> t.doWork())
.collect(Collectors.reduce(.....));
However if you are stuck on Java 7, you can use an ExecutorService.
int procs = Runtime.getRuntime().availableProcessors();
ExecutorService es = Executors.newFixedThreadPool(procs);
But there is java GC which will kick-in every now and then
Unless you are using CMS, it doesn't kick in at the same time, so it doesn't matter what these threads are doing (in terms of tuning your thread pool)
Is there any other process that I should consider?
If you have other processes on the machines which use the CPU a lot you should consider them.
I actually did research on this last semester. When using threads, a good rule of thumb for increased performance for CPU bound processes is to use an equal number of threads as cores, except in the case of a hyper-threaded system in which case one should use twice as many cores. The other rule of thumb that can be concluded is for I/O bound processes. This rule is to quadruple the number threads per core, except for the case of a hyper-threaded system than one can quadruple the number of threads per core.
I am exploring OpenJDK JMH for benchmarking my code. As per my understanding JMH by default forks multiple JVM in order to defend the test from previously collected “profiles”. Which is explained very well in this sample code.
However my question is that what impact I will have on result if I will execute using following two approaches:
1) with 1 fork , 100 iterations
2) with 10 fork, 10 iterations each
And which approach will give more accurate result?
It depends. Multiple forks are needed to estimate run-to-run variance, see JMHSample_13_RunTo_Run. Therefore, a single fork is definitely worse. Then, if you ask what is better: 10x100 run or 100x10 run, this again depends on what is the worse concern -- run-to-run variance, or in-run variance.
It depends on how much the results vary per fork vs. per iteration, which is workload specific.
If you want a rigorous statistical approach to figuring out this tradeoff, check out "Rigorous Benchmarking in Reasonable Time" (Kalibera, Jones). Equation 3 gives the optimal counts per level (in your case, these would be number of forks to run and number of iterations per fork) by using the observed variances between forks and between iterations.
The title I admit is a bit misleading but I am sort of confused why this happens.
I've written a program in java that takes an argument x that instantiates x number of threads to do the work of the program. The machine i'm running it on has 8 cores / can handle 32 threads in parallel (each core has 4 hyperthreads). When I run the program past 8 threads (i.e. 22), I notice that if I run it with an even amount of threads, the program runs faster as opposed to when I run it with 23 threads (which is actually slower). The performance difference is about 10% between the two. Why would this be? Thread overhead doesn't really take this into account and I feel that as long as im running <32 threads, it should only be faster as I increase the # of threads.
To give you an idea what the program is doing, the program is taking a 1000 * 1000 array and each thread is assigned a portion of that array to update (roundoffs/leftovers in uneven are given to the last thread instantiated).
Is there any good reason for the odd/even thread performance difference?
Two reasons I can imagine:
The need to synchronize the memory access of your cores/threads. This will eventually invalidate CPU core caches and such things, which brings down performance. Try giving them really disjoint tasks, don't let them work on the same array. See: the memory isn't managed in individual bytes.
Hyperthreading CPUs often don't have full performance. They may for example have to share some floating point units. This doesn't mattern when e.g. one thread is integer-math heavy and the other is float-heavy. But having four threads each needing the floating point units means probably waiting, switching contexts, signalling the other thread, switching context back, waiting again...
Just two guesses. For example, you should have given the actual CPU you are using, the partitioning scheme you are, and a more detailed hint about the computational task.
I'm trying to do a typical "A/B testing" like approach on two different implementations of a real-life algorithm, using the same data-set in both cases. The algorithm is deterministic in terms of execution, so I really expect the results to be repeatable.
On the Core 2 Duo, this is also the case. Using just the linux "time" command I'll get variations in execution time around 0.1% (over 10 runs).
On the i7 I will get all sorts of variations, and I can easily have 30% variations up and down from the average. I assume this is due to the various CPU optimizations that the i7 does (dynamic overclocking etc), but it really makes it hard to do this kind of testing. Is there any other way to determine which of 2 algorithms is "best", any other sensible metrics I can use ?
Edit: The algorithm does not sustain for very long and this is actually the real-life scenario I'm trying to benchmark. So running repeatedly is not really an option as such.
See if you can turn off the dynamic over-clocking in your BIOS. Also, ditch all possible other processes running when doing the benchmarking.
Well you could use O-notation principles in determining the performance of algorithms. This will determine the theoretical speed of an algorithm.
http://en.wikipedia.org/wiki/Big_O_notation
If you absolutely must know the real life speed of the alogorithm, then ofc you must benchmark it on a system. But using the O-notation you can see past all that and only focus on the factors/variables that are important.
You didn't indicate how you're benchmarking. You might want to read this if you haven't yet: How do I write a correct micro-benchmark in Java?
If you're running a sustained test I doubt dynamic clocking is causing your variations. It should stay at the maximum turbo speed. If you're running it too long perhaps it's going down one multiplier for heat. Although I doubt that, unless you're over-clocking and are near the thermal envelope.
Hyper-Threading might be playing a role. You could disable that in your BIOS and see if it makes a difference in your numbers.
On linux you can lock the CPU speed to stop clock speed variation. ;)
You need to make the benchmark as realistic as possible. For example, if you run an algorithm flat out and take an average it you might get very different results from performing the same tasks every 10 ms. i.e. I have seen 2x to 10x variation (between flat out and relatively low load), even with a locked clock speed.
I'm working on a system at the moment. It's a complex system but it boils down to a Solver class with a method like this:
public int solve(int problem); // returns the solution, or 0 if no solution found
Now, when the system is up and running, a run time of about 5 seconds for this method is expected and is perfectly fast enough. However, I plan to run some tests that look a bit like this:
List<Integer> problems = getProblems();
List<Integer> solutions = new ArrayList<Integer>(problems.size);
Solver solver = getSolver();
for (int problem: problems) {
solutions.add(solver.solve(problem));
}
// see what percentage of solutions are zero
// get arithmetic mean of non-zero solutions
// etc etc
The problem is I want to run this on a large number of problems, and don't want to wait forever for the results. So say I have a million test problems and I want the tests to complete in the time it takes me to make a cup of tea, I have two questions:
Say I have a million core processor and that instances of Solver are threadsafe but with no locking (they're immutable or something), and that all the computation they do is in memory (i.e. there's no disk or network or other stuff going on). Can I just replace the solutions list with a threadsafe list and kick off threads to solve each problem and expect it to be faster? How much faster? Can it run in 5 seconds?
Is there a decent cloud computing service out there for Java where I can buy 5 million seconds of time and get this code to run in five seconds? What do I need to do to prepare my code for running on such a cloud? How much does 5 million seconds cost anyway?
Thanks.
You have expressed your problem with two major points of serialisation: Problem production and solution consumption (currently expressed as Lists of integers). You want to get the first problems as soon as you can (currently you won't get them until all problems are produced).
I am assuming as well that there is a correlation between the problem list order and the solution list order – that is solutions.get(3) is the solution for problems.get(3) – this would be a huge problem for parallelising it. You'd be better off having a Pair<P, S> of problem/solution so you don't need to maintain the correlation.
Parallelising the solver method will not be difficult, although exactly how you do it will depend a lot on the compute costs of each solve method (generally the more expensive the method the lower the overhead costs of parallelising, so if these are very cheap you need to batch them). If you end up with a distributed solution you'll have much higher costs of course. The Executor framework and the fork/join extensions would be a great starting point.
You're asking extremely big questions. There is overhead for threads, and a key thing to note is that they run in the parent process. If you wanted to run a million of these solvers at the same time, you'd have to fork them into their own processes.
You can use one program per input, and then use a simple batch scheduler like Condor (for Linux) or HPC (for Windows). You can run those on Amazon too, but there's a bit of a learning curve, it's not just "upload Java code & go".
Sure, you could use a standard worker-thread paradigm to run things in parallel. But there will be some synchronization overhead (e.g., updates to the solutions list will cause lock contention when everything tries to finish at the same time), so it won't run in exactly 5 seconds. But it would be faster than 5 million seconds :-)
Amazon EC2 runs between US$0.085 and US$0.68 per hour depending on how much CPU you need (see pricing). So, maybe about $120. Of course, you'll need to set up something separate to distribute your jobs across various CPUs. One option might be just to use Hadoop (see this question about whether Hadoop is right for running simulations.
You could read things like Guy Steele's talk on parallelism for more info on how to think parallel.
Use an appropriate Executor. Have a look at http://download.oracle.com/javase/6/docs/api/java/util/concurrent/Executors.html#newCachedThreadPool()
Check out these article on concurrency:
http://www.vogella.de/articles/JavaConcurrency/article.html
http://www.baptiste-wicht.com/2010/09/java-concurrency-part-7-executors-and-thread-pools/
Basically, Java 7's new Fork/Join model will work really well for this approach. Essentially you can set up your million+ tasks and it will spread them as best it can accross all available processors. You would have to provide your custom "Cloud" task executor, but it can be done.
This assumes, of course, that your "solving" algorithm is rediculously parallel. In short, as long as the Solver is fully self-contained, they should be able to be split among an arbitrary number of processors.