I have a rather big ArrayList.
I have to go through every index, and do a expensive calculation
My first idea to speed it up was by putting it into a thread.
It works, but it is still extremely slow. I tinkered around the calculation, to make it less expensive, but its still to slow. The best solution i came up with is basically this one.
public void calculate(){
calculatePart(0);
calculatePart(1);
}
public void calculatePart(int offset) {
new Thread() {
#Override
public void run() {
int i = offset;
while(arrayList.size() > i) {
//Do the calulation
i +=2;
}
}
}.start();
}
Yet this feels like a lazy, unprofessional solution. That is why I'm asking if there is a cleaner and even faster solution
Assuming that doing task on each element doesn't lead to data races, you could leverage the power of parallelism. To maximize the number of computations occurring at the same time, you would have to give tasks to each of the processors available in your system.
In Java, you can get the number of processors (cores) available using this:
int parallelism = Runtime.getRuntime().availableProcessors();
The idea is to create number of threads equal to the available processors.
So, if you have 4 processors available, you can create 4 threads and ask them to process items at a gap of 4.Suppose you have a list of size 10, which needs to be processed in parallel.
Then,
Thread 1 processes items at index 0,4,8
Thread 2 processes items at index 1,5,9
Thread 3 processes items at index 2,6
Thread 4 processes items at index 3,7
I tried to simulate your scenario with the following code:
import java.util.Arrays;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
public class SpeedUpTest {
public static void main(String[] args) throws InterruptedException, ExecutionException {
long seqTime, twoThreadTime, multiThreadTime;
List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
long time = System.currentTimeMillis();
sequentialProcessing(list);
seqTime = System.currentTimeMillis() - time;
int parallelism = 2;
ExecutorService executorService = Executors.newFixedThreadPool(parallelism);
time = System.currentTimeMillis();
List<Future> tasks = new ArrayList<>();
for (int offset = 0; offset < parallelism; offset++) {
int finalParallelism = parallelism;
int finalOffset = offset;
Future task = executorService.submit(() -> {
int i = finalOffset;
while (list.size() > i) {
try {
processItem(list.get(i));
} catch (InterruptedException e) {
e.printStackTrace();
}
i += finalParallelism;
}
});
tasks.add(task);
}
for (Future task : tasks) {
task.get();
}
twoThreadTime = System.currentTimeMillis() - time;
parallelism = Runtime.getRuntime().availableProcessors();
executorService = Executors.newFixedThreadPool(parallelism);
tasks = new ArrayList<>();
time = System.currentTimeMillis();
for (int offset = 0; offset < parallelism; offset++) {
int finalParallelism = parallelism;
int finalOffset = offset;
Future task = executorService.submit(() -> {
int i = finalOffset;
while (list.size() > i) {
try {
processItem(list.get(i));
} catch (InterruptedException e) {
e.printStackTrace();
}
i += finalParallelism;
}
});
tasks.add(task);
}
for (Future task : tasks) {
task.get();
}
multiThreadTime = System.currentTimeMillis() - time;
log("RESULTS:");
log("Total time for sequential execution : " + seqTime / 1000.0 + " seconds");
log("Total time for execution with 2 threads: " + twoThreadTime / 1000.0 + " seconds");
log("Total time for execution with " + parallelism + " threads: " + multiThreadTime / 1000.0 + " seconds");
}
private static void log(String msg) {
System.out.println(msg);
}
private static void processItem(int index) throws InterruptedException {
Thread.sleep(5000);
}
private static void sequentialProcessing(List<Integer> list) throws InterruptedException {
for (int i = 0; i < list.size(); i++) {
processItem(list.get(i));
}
}
}
OUTPUT:
RESULTS:
Total time for sequential execution : 50.001 seconds
Total time for execution with 2 threads: 25.102 seconds
Total time for execution with 4 threads: 15.002 seconds
High theoretically speaking:
if you have X elements and your calculation must perform N operations on each one then
your computer(processor) must perform X*N operations total, then...
Parallel threads can make it faster only if in the calculation operations there are some of them when thread is waiting (e.g. File or Network operations). That time can be used by other threads. But if all operations are pure CPU (e.g. mathematics) and thread is not waiting - required time to perform X*N operations stays the same.
Also each tread must give other threads ability to take control over CPU at some point. It happens automatically between methods calls or if you have Thread.yield() call in your code.
as example method like:
public void run()
{
long a=0;
for (long i=1; i < Long.MAX_VALUE; i++)
{
a+=i;
}
}
will not give other thread a chance to take control over CPU until it fully completed and exited.
Related
The task I'm trying to implement is finding Collatz sequence for numbers in a set interval using several threads and seeing how much improvement is gained compared to one thread.
However one thread is always faster no matter if it I choose 2 threads(edit. 2 threads are faster, but not by much while 4 threads is slower than 1 thread and I have no idea why.(I could even say that the more threads the slower it gets). I hope someone can explain. Maybe I'm doing something wrong.
Below is my code that I wrote so far. I'm using ThreadPoolExecutor for executing the tasks(one task = one Collatz sequence for one number in the interval).
The Collatz class:
public class ParallelCollatz implements Runnable {
private long result;
private long inputNum;
public long getResult() {
return result;
}
public void setResult(long result) {
this.result = result;
}
public long getInputNum() {
return inputNum;
}
public void setInputNum(long inputNum) {
this.inputNum = inputNum;
}
public void run() {
//System.out.println("number:" + inputNum);
//System.out.println("Thread:" + Thread.currentThread().getId());
//int j=0;
//if(Thread.currentThread().getId()==11) {
// ++j;
// System.out.println(j);
//}
long result = 1;
//main recursive computation
while (inputNum > 1) {
if (inputNum % 2 == 0) {
inputNum = inputNum / 2;
} else {
inputNum = inputNum * 3 + 1;
}
++result;
}
// try {
//Thread.sleep(10);
//} catch (InterruptedException e) {
// TODO Auto-generated catch block
// e.printStackTrace();
//}
this.result=result;
return;
}
}
And the main class where I run the threads(yes for now I create two lists with the same numbers since after running with one thread the initial values are lost):
ThreadPoolExecutor executor = (ThreadPoolExecutor)Executors.newFixedThreadPool(1);
ThreadPoolExecutor executor2 = (ThreadPoolExecutor)Executors.newFixedThreadPool(4);
List<ParallelCollatz> tasks = new ArrayList<ParallelCollatz>();
for(int i=1; i<=1000000; i++) {
ParallelCollatz task = new ParallelCollatz();
task.setInputNum((long)(i+1000000));
tasks.add(task);
}
long startTime = System.nanoTime();
for(int i=0; i<1000000; i++) {
executor.execute(tasks.get(i));
}
executor.shutdown();
boolean tempFirst=false;
try {
tempFirst =executor.awaitTermination(5, TimeUnit.HOURS);
} catch (InterruptedException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
System.out.println("tempFirst " + tempFirst);
long endTime = System.nanoTime();
long durationInNano = endTime - startTime;
long durationInMillis = TimeUnit.NANOSECONDS.toMillis(durationInNano); //Total execution time in nano seconds
System.out.println("laikas " +durationInMillis);
List<ParallelCollatz> tasks2 = new ArrayList<ParallelCollatz>();
for(int i=1; i<=1000000; i++) {
ParallelCollatz task = new ParallelCollatz();
task.setInputNum((long)(i+1000000));
tasks2.add(task);
}
long startTime2 = System.nanoTime();
for(int i=0; i<1000000; i++) {
executor2.execute(tasks2.get(i));
}
executor2.shutdown();
boolean temp =false;
try {
temp=executor2.awaitTermination(5, TimeUnit.HOURS);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("temp "+ temp);
long endTime2 = System.nanoTime();
long durationInNano2 = endTime2 - startTime2;
long durationInMillis2 = TimeUnit.NANOSECONDS.toMillis(durationInNano2); //Total execution time in nano seconds
System.out.println("laikas2 " +durationInMillis2);
For example running with one thread it completes in 3280ms. Running with two threads 3437ms. Should I be considering another concurrent structure for calculating each element?
EDIT
Clarrification. I'm not trying to parallelize individual sequences, but an interval of numbers when each number has it's sequence.(Which is not related to other numbers)
EDIT2
Today I ran the program on a good PC with 6 cores and 12 logical processors and the issue persists. Does anyone have an idea where the problem might be? I also updated my code. 4 threads do worse than 2 threads for some reason.(even worse than 1 thread). I also applied what was given in the answer, but no change.
Another Edit
What I have noticed that if I put a Thread.sleep(1) in my ParallelCollatz method then the performance gradually increases with the thread count. Perhaps this detail tells someone what is wrong? However no matter how many tasks I give if there is no Thread.Sleep(1) 2 threads perform fastest 1 thread is in 2nd place and others hang arround a similiar number of milliseconds but slower both than 1 and 2 threads.
New Edit
I also tried putting more tasks(for cycle for calculating not 1 but 10 or 100 Collatz sequences) in the run() method of the Runnable class so that the thread itself would do more work. Unfortunately, this did not help as well.
Perhaps I'm launching the tasks incorrectly? Anyone any ideas?
EDIT
So it would seem that after adding more tasks to the run method fixes it a bit, but for more threads the issue still remains 8+. I still wonder is the cause of this is that it takes more time to create and run the threads than to execute the task? Or should I create a new post with this question?
You are not waiting for your tasks to complete, only measuring the time it takes to submit them to the executor.
executor.shutdown() does not wait for all tasks get finished.You need to call executor.awaitTermination after that.
executor.shutdown();
executor.awaitTermination(5, TimeUnit.HOURS);
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html#shutdown()
Update
I believe that our testing methodology is flawed. I repeated your test on my machine, (1 processor, 2 cores, 4 logical processors) and the the time measured from run to run differed wildly.
I believe the following are main reasons:
JVM startup & JIT compilation time. At the beginning, the code is running in interpreted mode.
result of calculation is ignored. I have no intuition what is removed by the JIT and what we are actually measuring.
printlines in code
To test this, I converted your test to JMH.
In particular:
I converted the runnable to a callable, and I return the sum of results to prevent inlining (alternativaly, you can use BlackHole from JMH)
My tasks have no state, I moved all moving parts to local variables. No GC is needed to cleanup the tasks.
I still create executors in each round. This is not perfect, but I decided to keep it as is.
The results I received below are consistent with my expectations: one core is waiting in the main thread, the work is performed on a single core, the numbers are rougly the same.
Benchmark Mode Cnt Score Error Units
SpeedTest.multipleThreads avgt 20 559.996 ± 20.181 ms/op
SpeedTest.singleThread avgt 20 562.048 ± 16.418 ms/op
Updated code:
public class ParallelCollatz implements Callable<Long> {
private final long inputNumInit;
public ParallelCollatz(long inputNumInit) {
this.inputNumInit = inputNumInit;
}
#Override
public Long call() {
long result = 1;
long inputNum = inputNumInit;
//main recursive computation
while (inputNum > 1) {
if (inputNum % 2 == 0) {
inputNum = inputNum / 2;
} else {
inputNum = inputNum * 3 + 1;
}
++result;
}
return result;
}
}
and the benchmark itself:
#State(Scope.Benchmark)
public class SpeedTest {
private static final int NUM_TASKS = 1000000;
private static List<ParallelCollatz> tasks = buildTasks();
#Benchmark
#Fork(value = 1, warmups = 1)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MILLISECONDS)
#SuppressWarnings("unused")
public long singleThread() throws Exception {
ThreadPoolExecutor executorOneThread = (ThreadPoolExecutor) Executors.newFixedThreadPool(1);
return measureTasks(executorOneThread, tasks);
}
#Benchmark
#Fork(value = 1, warmups = 1)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MILLISECONDS)
#SuppressWarnings("unused")
public long multipleThreads() throws Exception {
ThreadPoolExecutor executorMultipleThread = (ThreadPoolExecutor) Executors.newFixedThreadPool(4);
return measureTasks(executorMultipleThread, tasks);
}
private static long measureTasks(ThreadPoolExecutor executor, List<ParallelCollatz> tasks) throws InterruptedException, ExecutionException {
long sum = runTasksInExecutor(executor, tasks);
return sum;
}
private static long runTasksInExecutor(ThreadPoolExecutor executor, List<ParallelCollatz> tasks) throws InterruptedException, ExecutionException {
List<Future<Long>> futures = new ArrayList<>(NUM_TASKS);
for (int i = 0; i < NUM_TASKS; i++) {
Future<Long> f = executor.submit(tasks.get(i));
futures.add(f);
}
executor.shutdown();
boolean tempFirst = false;
try {
tempFirst = executor.awaitTermination(5, TimeUnit.HOURS);
} catch (InterruptedException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
long sum = 0l;
for (Future<Long> f : futures) {
sum += f.get();
}
//System.out.println(sum);
return sum;
}
private static List<ParallelCollatz> buildTasks() {
List<ParallelCollatz> tasks = new ArrayList<>();
for (int i = 1; i <= NUM_TASKS; i++) {
ParallelCollatz task = new ParallelCollatz((long) (i + NUM_TASKS));
tasks.add(task);
}
return tasks;
}
}
I'm having big troubles understanding the Java ThreadPoolExecutor. For example, I want to calculate the squares of numbers 1-1000:
public static void main(String[] args) throws InterruptedException, ExecutionException {
Callable<ArrayList<Integer>> c = new squareCalculator(1000);
ExecutorService executor = Executors.newFixedThreadPool(5);
Future<ArrayList<Integer>> result = executor.submit(c);
for(Integer i: result.get()){
System.out.println(i);
}
}
And the
public class squareCalculator implements Callable<ArrayList<Integer>>{
private int i;
private int max;
private int threadID;
private static int id;
private ArrayList<Integer> squares;
public squareCalculator(int max){
this.max = max;
this.i = 1;
this.threadID = id;
id++;
squares = new ArrayList<Integer>();
}
public ArrayList<Integer> call() throws Exception {
while(i <= max){
squares.add(i*i);
System.out.println("Proccessed number " +i + " in thread "+this.threadID);
Thread.sleep(1);
i++;
}
return squares;
}
}
Now my problem is, that I only get one thread doing the calculations. I expected to get 5 threads.
If you want the Callable to run 5 times concurrently, you need to submit it 5 times.
You only submitted it once, and then ask for its result 5 times.
Javadoc of submit():
Submits a value-returning task for execution and returns a
Future representing the pending results of the task. The
Future's get method will return the task's result upon
successful completion.
You see that Javadoc for submit() uses the singular for "task", not "tasks".
The fix is easy: submit it multiple times:
Future<ArrayList<Integer>> result1 = executor.submit(c);
Future<ArrayList<Integer>> result2 = executor.submit(c);
Future<ArrayList<Integer>> result3 = executor.submit(c);
/// etc..
result1.get();
result2.get();
result3.get();
// etc..
The ExecutorService will use one thread to execute each Callable task that you submit. Therefore, if you want to have multiple threads calculating the squares, you have to submit multiple tasks, for example one task for each number. You would then get a Future<Integer> from each task, which you can store in a list and call get() on each one to get the results.
public class SquareCalculator implements Callable<Integer> {
private final int i;
public SquareCalculator(int i) {
this.i = i;
}
#Override
public Integer call() throws Exception {
System.out.println("Processing number " + i + " in thread " + Thread.currentThread().getName());
return i * i;
}
public static void main(String[] args) throws Exception {
ExecutorService executor = Executors.newFixedThreadPool(5);
List<Future<Integer>> futures = new ArrayList<>();
// Create a Callable for each number, submit it to the ExecutorService and store the Future
for (int i = 1; i <= 1000; i++) {
Callable<Integer> c = new SquareCalculator(i);
Future<Integer> future = executor.submit(c);
futures.add(future);
}
// Wait for the result of each Future
for (Future<Integer> future : futures) {
System.out.println(future.get());
}
executor.shutdown();
}
}
The output then looks something like this:
Processing number 2 in thread pool-1-thread-2
Processing number 1 in thread pool-1-thread-1
Processing number 6 in thread pool-1-thread-1
Processing number 7 in thread pool-1-thread-2
Processing number 8 in thread pool-1-thread-2
Processing number 9 in thread pool-1-thread-2
...
1
4
9
...
This is a funny problem to try to do in parallel because creating the result array (or list) runs in O(n) time because it gets initialized with zeros on creation.
public static void main(String[] args) throws InterruptedException {
final int chunks = Runtime.getRuntime().availableProcessors();
final int max = 1001;
ExecutorService executor = Executors.newFixedThreadPool(chunks);
final List<ArrayList<Long>> results = new ArrayList<>(chunks);
for (int i = 0; i < chunks; i++) {
final int start = i * max / chunks;
final int end = (i + 1) * max / chunks;
final ArrayList<Long> localResults = new ArrayList<>(0);
results.add(localResults);
executor.submit(new Runnable() {
#Override
public void run() {
// Reallocate enough space locally so it's done in parallel.
localResults.ensureCapacity(end - start);
for (int j = start; j < end; j++) {
localResults.add((long)j * (long)j);
}
}
});
}
executor.shutdown();
executor.awaitTermination(Long.MAX_VALUE, TimeUnit.MICROSECONDS);
int i = 0;
for (List<Long> list : results) {
for (Long l : list) {
System.out.printf("%d: %d\n", i, l);
++i;
}
}
}
Overhead dealing with the wrapper classes will kill performance, here, so you should use something like Fastutil. Then, you could join them with something like Guava's Iterables.concat, only a List version that's compatible with Fastutil's LongList.
This might also make a good ForkJoinTask, but again, you'll need efficient logical (mapping, not copying; the reverse of List.sublist) List concatenation functions to realize a speedup.
In Java, I have simple multithreaded code:
public class ThreadedAlgo {
public static final int threadsCount = 3;
public static void main(String[] args) {
// start timer prior computation
time = System.currentTimeMillis();
// create threads
Thread[] threads = new Thread[threadsCount];
class ToDo implements Runnable {
public void run() { ... }
}
// create job objects
for (int i = 0; i < threadsCount; i++) {
ToDo job = new ToDo();
threads[i] = new Thread(job);
}
// start threads
for (int i = 0; i < threadsCount; i++) {
threads[i].start();
}
// wait for threads above to finish
for (int i = 0; i < threadsCount; i++) {
try {
threads[i].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
// display time after computation
System.out.println("Execution time: " + (System.currentTimeMillis() - time));
}
}
It works fine, now I want to run it for 2 or 3 threads and compute the time spent for computation of each thread. Then I will compare times: note them by t1 and t2, and if |t1 - t2| < small epsilon, I will say that my algorithm performs with fine granularity under some given conditions, that is the time spent by threads is relatively the same.
How can I measure the time of a thread?
Use System.nanoTime() at the beginning and end of the thread (job) methods to calculate the total time spent in each invocation. In your case, all threads will be executed with the same (default) priority, where time slices should be distributed pretty fair. If your threads are interlocked, use 'fair locks' for the same reason; e.g. new ReentrantLock(true);
Add the timing logic inside your Run methods
I have what probably is a basic question. When I create 100 million Hashtables it takes approximately 6 seconds (runtime = 6 seconds per core) on my machine if I do it on a single core. If I do this multi-threaded on 12 cores (my machine has 6 cores that allow hyperthreading) it takes around 10 seconds (runtime = 112 seconds per core).
This is the code I use:
Main
public class Tests
{
public static void main(String args[])
{
double start = System.currentTimeMillis();
int nThreads = 12;
double[] runTime = new double[nThreads];
TestsThread[] threads = new TestsThread[nThreads];
int totalJob = 100000000;
int jobsize = totalJob/nThreads;
for(int i = 0; i < threads.length; i++)
{
threads[i] = new TestsThread(jobsize,runTime, i);
threads[i].start();
}
waitThreads(threads);
for(int i = 0; i < runTime.length; i++)
{
System.out.println("Runtime thread:" + i + " = " + (runTime[i]/1000000) + "ms");
}
double end = System.currentTimeMillis();
System.out.println("Total runtime = " + (end-start) + " ms");
}
private static void waitThreads(TestsThread[] threads)
{
for(int i = 0; i < threads.length; i++)
{
while(threads[i].finished == false)//keep waiting untill the thread is done
{
//System.out.println("waiting on thread:" + i);
try {
Thread.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
}
Thread
import java.util.HashMap;
import java.util.Map;
public class TestsThread extends Thread
{
int jobSize = 0;
double[] runTime;
boolean finished;
int threadNumber;
TestsThread(int job, double[] runTime, int threadNumber)
{
this.finished = false;
this.jobSize = job;
this.runTime = runTime;
this.threadNumber = threadNumber;
}
public void run()
{
double start = System.nanoTime();
for(int l = 0; l < jobSize ; l++)
{
double[] test = new double[65];
}
double end = System.nanoTime();
double difference = end-start;
runTime[threadNumber] += difference;
this.finished = true;
}
}
I do not understand why creating the object simultaneously in multiple threads takes longer per thread then doing it in serial in only 1 thread. If I remove the line where I create the Hashtable this problem disappears. If anyone could help me with this I would be greatly thankful.
Update: This problem has an associated bug report and has been fixed with Java 1.7u40. And it was never an issue for Java 1.8 as Java 8 has an entirely different hash table algorithm.
Since you are not using the created objects that operation will get optimized away. So you’re only measuring the overhead of creating threads. This is surely the more overhead the more threads you start.
I have to correct my answer regarding a detail, I didn’t know yet: there is something special with the classes Hashtable and HashMap. They both invoke sun.misc.Hashing.randomHashSeed(this) in the constructor. In other words, their instances escape during construction which has an impact on the memory visibility. This implies that their construction, unlike let’s say for an ArrayList, cannot optimized away, and multi-threaded construction slows down due to what happens inside that method (i.e. synchronization).
As said, that’s special to these classes and of course this implementation (my setup:1.7.0_13). For ordinary classes the construction time goes straight to zero for such code.
Here I add a more sophisticated benchmark code. Watch the difference between DO_HASH_MAP = true and DO_HASH_MAP = false (when false it will create an ArrayList instead which has no such special behavior).
import java.util.*;
import java.util.concurrent.*;
public class AllocBench {
static final int NUM_THREADS = 1;
static final int NUM_OBJECTS = 100000000 / NUM_THREADS;
static final boolean DO_HASH_MAP = true;
public static void main(String[] args) throws InterruptedException, ExecutionException {
ExecutorService threadPool = Executors.newFixedThreadPool(NUM_THREADS);
Callable<Long> task=new Callable<Long>() {
public Long call() {
return doAllocation(NUM_OBJECTS);
}
};
long startTime=System.nanoTime(), cpuTime=0;
for(Future<Long> f: threadPool.invokeAll(Collections.nCopies(NUM_THREADS, task))) {
cpuTime+=f.get();
}
long time=System.nanoTime()-startTime;
System.out.println("Number of threads: "+NUM_THREADS);
System.out.printf("entire allocation required %.03f s%n", time*1e-9);
System.out.printf("time x numThreads %.03f s%n", time*1e-9*NUM_THREADS);
System.out.printf("real accumulated cpu time %.03f s%n", cpuTime*1e-9);
threadPool.shutdown();
}
static long doAllocation(int numObjects) {
long t0=System.nanoTime();
for(int i=0; i<numObjects; i++)
if(DO_HASH_MAP) new HashMap<Object, Object>(); else new ArrayList<Object>();
return System.nanoTime()-t0;
}
}
What about if you do it on 6 cores? Hyperthreading isn't the exact same as having double the cores, so you might want to try the amount of real cores too.
Also the OS won't necessarily schedule each of your threads to their own cores.
Since all you are doing is measuring the time and churning memory, your bottleneck is likely to be in your L3 cache or bus to main memory. In this cases, coordinating the work between threads could be producing so much overhead it is worse instead of better.
This is too long for a comment but your inner loop can be just
double start = System.nanoTime();
for(int l = 0; l < jobSize ; l++){
Map<String,Integer> test = new HashMap<String,Integer>();
}
// runtime is an AtomicLong for thread safety
runtime.addAndGet(System.nanoTime() - start); // time in nano-seconds.
Taking the time can be as slow creating a HashMap so you might not be measuring what you think you if you call the timer too often.
BTW Hashtable is synchronized and you might find using HashMap is faster, and possibly more scalable.
I've programmed a (very simple) benchmark in Java. It simply increments a double value up to a specified value and takes the time.
When I use this singlethreaded or with a low amount of threads (up to 100) on my 6-core desktop, the benchmark returns reasonable and repeatable results.
But when I use for example 1200 threads, the average multicore duration is significantly lower than the singlecore duration (about 10 times or more). I've made sure that the total amount of incrementations is the same, no matter how much threads I use.
Why does the performance drop so much with more threads? Is there a trick to solve this problem?
I'm posting my source, but I don't think, that there is a problem.
Benchmark.java:
package sibbo.benchmark;
import java.text.DecimalFormat;
import java.util.LinkedList;
import java.util.List;
public class Benchmark implements TestFinishedListener {
private static final double TARGET = 1e10;
private static final int THREAD_MULTIPLICATOR = 2;
public static void main(String[] args) throws InterruptedException {
Benchmark b = new Benchmark(TARGET);
b.start();
}
private int coreCount;
private List<Worker> workers = new LinkedList<>();
private List<Worker> finishedWorkers = new LinkedList<>();
private double target;
public Benchmark(double target) {
this.target = target;
getSystemInfos();
printInfos();
}
private void getSystemInfos() {
coreCount = Runtime.getRuntime().availableProcessors();
}
private void printInfos() {
System.out.println("Usable cores: " + coreCount);
System.out.println("Multicore threads: " + coreCount * THREAD_MULTIPLICATOR);
System.out.println("Loops per core: " + new DecimalFormat("###,###,###,###,##0").format(TARGET));
System.out.println();
}
public synchronized void start() throws InterruptedException {
Thread.currentThread().setPriority(Thread.MAX_PRIORITY);
System.out.print("Initializing singlecore benchmark... ");
Worker w = new Worker(this, 0);
workers.add(w);
Thread.sleep(1000);
System.out.println("finished");
System.out.print("Running singlecore benchmark... ");
w.runBenchmark(target);
wait();
System.out.println("finished");
printResult();
System.out.println();
// Multicore
System.out.print("Initializing multicore benchmark... ");
finishedWorkers.clear();
for (int i = 0; i < coreCount * THREAD_MULTIPLICATOR; i++) {
workers.add(new Worker(this, i));
}
Thread.sleep(1000);
System.out.println("finished");
System.out.print("Running multicore benchmark... ");
for (Worker worker : workers) {
worker.runBenchmark(target / THREAD_MULTIPLICATOR);
}
wait();
System.out.println("finished");
printResult();
Thread.currentThread().setPriority(Thread.NORM_PRIORITY);
}
private void printResult() {
DecimalFormat df = new DecimalFormat("###,###,###,##0.000");
long min = -1, av = 0, max = -1;
int threadCount = 0;
boolean once = true;
System.out.println("Result:");
for (Worker w : finishedWorkers) {
if (once) {
once = false;
min = w.getTime();
max = w.getTime();
}
if (w.getTime() > max) {
max = w.getTime();
}
if (w.getTime() < min) {
min = w.getTime();
}
threadCount++;
av += w.getTime();
if (finishedWorkers.size() <= 6) {
System.out.println("Worker " + w.getId() + ": " + df.format(w.getTime() / 1e9) + "s");
}
}
System.out.println("Min: " + df.format(min / 1e9) + "s, Max: " + df.format(max / 1e9) + "s, Av per Thread: "
+ df.format((double) av / threadCount / 1e9) + "s");
}
#Override
public synchronized void testFinished(Worker w) {
workers.remove(w);
finishedWorkers.add(w);
if (workers.isEmpty()) {
notify();
}
}
}
Worker.java:
package sibbo.benchmark;
public class Worker implements Runnable {
private double value = 0;
private long time;
private double target;
private TestFinishedListener l;
private final int id;
public Worker(TestFinishedListener l, int id) {
this.l = l;
this.id = id;
new Thread(this).start();
}
public int getId() {
return id;
}
public synchronized void runBenchmark(double target) {
this.target = target;
notify();
}
public long getTime() {
return time;
}
#Override
public void run() {
synWait();
value = 0;
long startTime = System.nanoTime();
while (value < target) {
value++;
}
long endTime = System.nanoTime();
time = endTime - startTime;
l.testFinished(this);
}
private synchronized void synWait() {
try {
wait();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
You need to understand that the OS (or Java thread scheduler, or both) is trying to balance between all of the threads in your application to give them all a chance to perform some work, and there is a non-zero cost to switch between threads. With 1200 threads, you have just reached (and probably far exceeded) the tipping point wherein the processor is spending more time context switching than doing actual work.
Here is a rough analogy:
You have one job to do in room A. You stand in room A for 8 hours a day, and do your job.
Then your boss comes by and tells you that you have to do a job in room B also. Now you need to periodically leave room A, walk down the hall to room B, and then walk back. That walking takes 1 minute per day. Now you spend 3 hours, 59.5 minutes working on each job, and one minute walking between rooms.
Now imagine that you have 1200 rooms to work in. You are going to spend more time walking between rooms than doing actual work. This is the situation that you have put your processor into. It is spending so much time switching between contexts that no real work gets done.
EDIT: Now, as per the comments below, maybe you spend a fixed amount of time in each room before moving on- your work will progress, but the number of context switches between rooms still affects the overall runtime of a single task.
Ok, I think I've found my problem, but until now, no solution.
When measuring the time every thread runs to do his part of the work, there are different possible minimums for different total amounts of threads. The maximum is the same everytime. In case that a thread is started first and then is paused very often and finishes last. For example this maximum value could be 10 seconds. Assuming that the total amount of operations that is done by every thread stays the same, no matter how much threads I use, the amount of operations that is done by a single thread has to be changed when using a different amount of threads. For example, using one thread, it has to do 1000 operations, but using ten threads, everyone of them has to do just 100 operations. Now, using ten threads, the minimum amount of time that one thread can use is much lower than using one thread. So calculating the average amount of time every thread needs to do his work is nonsense. The minimum using ten Threads would be 1 second. This happens if one thread does its work without interruption.
EDIT
The solution would be to simply measure the amount of time between the start of the first thread and the completion of the last.