I work on a high concurrency app. In the app code I try to avoid synchronization where possible. Recently, when comparing test performance of a unsynchronized and synchronized code versions, it turned out synchronized code performed three-four times faster than its unsynchronized counterpart.
After some experiments I came to this test code:
private static final Random RND = new Random();
private static final int NUM_OF_THREADS = 3;
private static final int NUM_OF_ITR = 3;
private static final int MONKEY_WORKLOAD = 50000;
static final AtomicInteger lock = new AtomicInteger();
private static void syncLockTest(boolean sync) {
System.out.println("syncLockTest, sync=" + sync);
final AtomicLong jobsDone = new AtomicLong();
final AtomicBoolean stop = new AtomicBoolean();
for (int i = 0; i < NUM_OF_THREADS; i++) {
Runnable runner;
if (sync) {
runner = new Runnable() {
#Override
public void run() {
while (!stop.get()){
jobsDone.incrementAndGet();
synchronized (lock) {
monkeyJob();
}
Thread.yield();
}
}
};
} else {
runner = new Runnable() {
#Override
public void run() {
while (!stop.get()){
jobsDone.incrementAndGet();
monkeyJob();
Thread.yield();
}
}
};
}
new Thread(runner).start();
}
long printTime = System.currentTimeMillis();
for (int i = 0; i < NUM_OF_ITR;) {
long now = System.currentTimeMillis();
if (now - printTime > 10 * 1000) {
printTime = now;
System.out.println("Jobs done\t" + jobsDone);
jobsDone.set(0);
i++;
}
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
stop.set(true);
}
private static double[] monkeyJob() {
double[] res = new double[MONKEY_WORKLOAD];
for (int i = 0; i < res.length; i++) {
res[i] = RND.nextDouble();
res[i] = 1./(1. + res[i]);
}
return res;
}
I played with the number of threads, workload, test iterations - each time synchronized code perfomed much faster than unsunchronized one.
Here are results for two different values of NUM_OF_THREADS
Number of threads:3 syncLockTest, sync=true Jobs
done 5951 Jobs done 5958 Jobs done 5878 syncLockTest,
sync=false Jobs done 1399 Jobs done 1397 Jobs
done 1391
Number of threads:5 syncLockTest, sync=true Jobs
done 5895 Jobs done 6464 Jobs done 5886 syncLockTest,
sync=false Jobs done 1179 Jobs done 1260 Jobs
done 1226
Test environment
Windows 7 Professional
Java Version 7.0
Here's a simillar case Synchronized code performs faster than unsynchronized one
Any ideas?
Random is a thread-safe class. you are most likely avoiding contention on calls into the Random class by synchronizing around the main job.
This is fascinating. I think #jtahlborn nailed it. If I move the Random and make it local to the thread, the times for the non-sync jump ~10x while the synchronized ones don't change:
Here are my times with a static Random RND:
syncLockTest, sync=true
Jobs done 8800
Jobs done 8839
Jobs done 8896
syncLockTest, sync=false
Jobs done 1401
Jobs done 1381
Jobs done 1423
Here are my times with a Random rnd local variable per thread:
syncLockTest, sync=true
Jobs done 8846
Jobs done 8861
Jobs done 8866
syncLockTest, sync=false
Jobs done 25956 << much faster
Jobs done 26065 << much faster
Jobs done 26021 << much faster
I also wondered if this was GC related but moving the double[] res to being a thread local did not help the speeds at all. Here's the code I used:
...
#Override
public void run() {
// made this be a thread local but it affected the times only slightly
double[] res = new double[MONKEY_WORKLOAD];
// turned rnd into a local variable instead of static
Random rnd = new Random();
while (!stop.get()) {
jobsDone.incrementAndGet();
if (sync) {
synchronized (lock) {
monkeyJob(res, rnd);
}
} else {
monkeyJob(res, rnd);
}
}
}
...
private static double[] monkeyJob(double[] res, Random rnd) {
for (int i = 0; i < res.length; i++) {
res[i] = rnd.nextDouble();
res[i] = 1. / (1. + res[i]);
}
return res;
}
Related
The task I'm trying to implement is finding Collatz sequence for numbers in a set interval using several threads and seeing how much improvement is gained compared to one thread.
However one thread is always faster no matter if it I choose 2 threads(edit. 2 threads are faster, but not by much while 4 threads is slower than 1 thread and I have no idea why.(I could even say that the more threads the slower it gets). I hope someone can explain. Maybe I'm doing something wrong.
Below is my code that I wrote so far. I'm using ThreadPoolExecutor for executing the tasks(one task = one Collatz sequence for one number in the interval).
The Collatz class:
public class ParallelCollatz implements Runnable {
private long result;
private long inputNum;
public long getResult() {
return result;
}
public void setResult(long result) {
this.result = result;
}
public long getInputNum() {
return inputNum;
}
public void setInputNum(long inputNum) {
this.inputNum = inputNum;
}
public void run() {
//System.out.println("number:" + inputNum);
//System.out.println("Thread:" + Thread.currentThread().getId());
//int j=0;
//if(Thread.currentThread().getId()==11) {
// ++j;
// System.out.println(j);
//}
long result = 1;
//main recursive computation
while (inputNum > 1) {
if (inputNum % 2 == 0) {
inputNum = inputNum / 2;
} else {
inputNum = inputNum * 3 + 1;
}
++result;
}
// try {
//Thread.sleep(10);
//} catch (InterruptedException e) {
// TODO Auto-generated catch block
// e.printStackTrace();
//}
this.result=result;
return;
}
}
And the main class where I run the threads(yes for now I create two lists with the same numbers since after running with one thread the initial values are lost):
ThreadPoolExecutor executor = (ThreadPoolExecutor)Executors.newFixedThreadPool(1);
ThreadPoolExecutor executor2 = (ThreadPoolExecutor)Executors.newFixedThreadPool(4);
List<ParallelCollatz> tasks = new ArrayList<ParallelCollatz>();
for(int i=1; i<=1000000; i++) {
ParallelCollatz task = new ParallelCollatz();
task.setInputNum((long)(i+1000000));
tasks.add(task);
}
long startTime = System.nanoTime();
for(int i=0; i<1000000; i++) {
executor.execute(tasks.get(i));
}
executor.shutdown();
boolean tempFirst=false;
try {
tempFirst =executor.awaitTermination(5, TimeUnit.HOURS);
} catch (InterruptedException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
System.out.println("tempFirst " + tempFirst);
long endTime = System.nanoTime();
long durationInNano = endTime - startTime;
long durationInMillis = TimeUnit.NANOSECONDS.toMillis(durationInNano); //Total execution time in nano seconds
System.out.println("laikas " +durationInMillis);
List<ParallelCollatz> tasks2 = new ArrayList<ParallelCollatz>();
for(int i=1; i<=1000000; i++) {
ParallelCollatz task = new ParallelCollatz();
task.setInputNum((long)(i+1000000));
tasks2.add(task);
}
long startTime2 = System.nanoTime();
for(int i=0; i<1000000; i++) {
executor2.execute(tasks2.get(i));
}
executor2.shutdown();
boolean temp =false;
try {
temp=executor2.awaitTermination(5, TimeUnit.HOURS);
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("temp "+ temp);
long endTime2 = System.nanoTime();
long durationInNano2 = endTime2 - startTime2;
long durationInMillis2 = TimeUnit.NANOSECONDS.toMillis(durationInNano2); //Total execution time in nano seconds
System.out.println("laikas2 " +durationInMillis2);
For example running with one thread it completes in 3280ms. Running with two threads 3437ms. Should I be considering another concurrent structure for calculating each element?
EDIT
Clarrification. I'm not trying to parallelize individual sequences, but an interval of numbers when each number has it's sequence.(Which is not related to other numbers)
EDIT2
Today I ran the program on a good PC with 6 cores and 12 logical processors and the issue persists. Does anyone have an idea where the problem might be? I also updated my code. 4 threads do worse than 2 threads for some reason.(even worse than 1 thread). I also applied what was given in the answer, but no change.
Another Edit
What I have noticed that if I put a Thread.sleep(1) in my ParallelCollatz method then the performance gradually increases with the thread count. Perhaps this detail tells someone what is wrong? However no matter how many tasks I give if there is no Thread.Sleep(1) 2 threads perform fastest 1 thread is in 2nd place and others hang arround a similiar number of milliseconds but slower both than 1 and 2 threads.
New Edit
I also tried putting more tasks(for cycle for calculating not 1 but 10 or 100 Collatz sequences) in the run() method of the Runnable class so that the thread itself would do more work. Unfortunately, this did not help as well.
Perhaps I'm launching the tasks incorrectly? Anyone any ideas?
EDIT
So it would seem that after adding more tasks to the run method fixes it a bit, but for more threads the issue still remains 8+. I still wonder is the cause of this is that it takes more time to create and run the threads than to execute the task? Or should I create a new post with this question?
You are not waiting for your tasks to complete, only measuring the time it takes to submit them to the executor.
executor.shutdown() does not wait for all tasks get finished.You need to call executor.awaitTermination after that.
executor.shutdown();
executor.awaitTermination(5, TimeUnit.HOURS);
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html#shutdown()
Update
I believe that our testing methodology is flawed. I repeated your test on my machine, (1 processor, 2 cores, 4 logical processors) and the the time measured from run to run differed wildly.
I believe the following are main reasons:
JVM startup & JIT compilation time. At the beginning, the code is running in interpreted mode.
result of calculation is ignored. I have no intuition what is removed by the JIT and what we are actually measuring.
printlines in code
To test this, I converted your test to JMH.
In particular:
I converted the runnable to a callable, and I return the sum of results to prevent inlining (alternativaly, you can use BlackHole from JMH)
My tasks have no state, I moved all moving parts to local variables. No GC is needed to cleanup the tasks.
I still create executors in each round. This is not perfect, but I decided to keep it as is.
The results I received below are consistent with my expectations: one core is waiting in the main thread, the work is performed on a single core, the numbers are rougly the same.
Benchmark Mode Cnt Score Error Units
SpeedTest.multipleThreads avgt 20 559.996 ± 20.181 ms/op
SpeedTest.singleThread avgt 20 562.048 ± 16.418 ms/op
Updated code:
public class ParallelCollatz implements Callable<Long> {
private final long inputNumInit;
public ParallelCollatz(long inputNumInit) {
this.inputNumInit = inputNumInit;
}
#Override
public Long call() {
long result = 1;
long inputNum = inputNumInit;
//main recursive computation
while (inputNum > 1) {
if (inputNum % 2 == 0) {
inputNum = inputNum / 2;
} else {
inputNum = inputNum * 3 + 1;
}
++result;
}
return result;
}
}
and the benchmark itself:
#State(Scope.Benchmark)
public class SpeedTest {
private static final int NUM_TASKS = 1000000;
private static List<ParallelCollatz> tasks = buildTasks();
#Benchmark
#Fork(value = 1, warmups = 1)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MILLISECONDS)
#SuppressWarnings("unused")
public long singleThread() throws Exception {
ThreadPoolExecutor executorOneThread = (ThreadPoolExecutor) Executors.newFixedThreadPool(1);
return measureTasks(executorOneThread, tasks);
}
#Benchmark
#Fork(value = 1, warmups = 1)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.MILLISECONDS)
#SuppressWarnings("unused")
public long multipleThreads() throws Exception {
ThreadPoolExecutor executorMultipleThread = (ThreadPoolExecutor) Executors.newFixedThreadPool(4);
return measureTasks(executorMultipleThread, tasks);
}
private static long measureTasks(ThreadPoolExecutor executor, List<ParallelCollatz> tasks) throws InterruptedException, ExecutionException {
long sum = runTasksInExecutor(executor, tasks);
return sum;
}
private static long runTasksInExecutor(ThreadPoolExecutor executor, List<ParallelCollatz> tasks) throws InterruptedException, ExecutionException {
List<Future<Long>> futures = new ArrayList<>(NUM_TASKS);
for (int i = 0; i < NUM_TASKS; i++) {
Future<Long> f = executor.submit(tasks.get(i));
futures.add(f);
}
executor.shutdown();
boolean tempFirst = false;
try {
tempFirst = executor.awaitTermination(5, TimeUnit.HOURS);
} catch (InterruptedException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
long sum = 0l;
for (Future<Long> f : futures) {
sum += f.get();
}
//System.out.println(sum);
return sum;
}
private static List<ParallelCollatz> buildTasks() {
List<ParallelCollatz> tasks = new ArrayList<>();
for (int i = 1; i <= NUM_TASKS; i++) {
ParallelCollatz task = new ParallelCollatz((long) (i + NUM_TASKS));
tasks.add(task);
}
return tasks;
}
}
In Java, I have simple multithreaded code:
public class ThreadedAlgo {
public static final int threadsCount = 3;
public static void main(String[] args) {
// start timer prior computation
time = System.currentTimeMillis();
// create threads
Thread[] threads = new Thread[threadsCount];
class ToDo implements Runnable {
public void run() { ... }
}
// create job objects
for (int i = 0; i < threadsCount; i++) {
ToDo job = new ToDo();
threads[i] = new Thread(job);
}
// start threads
for (int i = 0; i < threadsCount; i++) {
threads[i].start();
}
// wait for threads above to finish
for (int i = 0; i < threadsCount; i++) {
try {
threads[i].join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
// display time after computation
System.out.println("Execution time: " + (System.currentTimeMillis() - time));
}
}
It works fine, now I want to run it for 2 or 3 threads and compute the time spent for computation of each thread. Then I will compare times: note them by t1 and t2, and if |t1 - t2| < small epsilon, I will say that my algorithm performs with fine granularity under some given conditions, that is the time spent by threads is relatively the same.
How can I measure the time of a thread?
Use System.nanoTime() at the beginning and end of the thread (job) methods to calculate the total time spent in each invocation. In your case, all threads will be executed with the same (default) priority, where time slices should be distributed pretty fair. If your threads are interlocked, use 'fair locks' for the same reason; e.g. new ReentrantLock(true);
Add the timing logic inside your Run methods
I have what probably is a basic question. When I create 100 million Hashtables it takes approximately 6 seconds (runtime = 6 seconds per core) on my machine if I do it on a single core. If I do this multi-threaded on 12 cores (my machine has 6 cores that allow hyperthreading) it takes around 10 seconds (runtime = 112 seconds per core).
This is the code I use:
Main
public class Tests
{
public static void main(String args[])
{
double start = System.currentTimeMillis();
int nThreads = 12;
double[] runTime = new double[nThreads];
TestsThread[] threads = new TestsThread[nThreads];
int totalJob = 100000000;
int jobsize = totalJob/nThreads;
for(int i = 0; i < threads.length; i++)
{
threads[i] = new TestsThread(jobsize,runTime, i);
threads[i].start();
}
waitThreads(threads);
for(int i = 0; i < runTime.length; i++)
{
System.out.println("Runtime thread:" + i + " = " + (runTime[i]/1000000) + "ms");
}
double end = System.currentTimeMillis();
System.out.println("Total runtime = " + (end-start) + " ms");
}
private static void waitThreads(TestsThread[] threads)
{
for(int i = 0; i < threads.length; i++)
{
while(threads[i].finished == false)//keep waiting untill the thread is done
{
//System.out.println("waiting on thread:" + i);
try {
Thread.sleep(1);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}
}
Thread
import java.util.HashMap;
import java.util.Map;
public class TestsThread extends Thread
{
int jobSize = 0;
double[] runTime;
boolean finished;
int threadNumber;
TestsThread(int job, double[] runTime, int threadNumber)
{
this.finished = false;
this.jobSize = job;
this.runTime = runTime;
this.threadNumber = threadNumber;
}
public void run()
{
double start = System.nanoTime();
for(int l = 0; l < jobSize ; l++)
{
double[] test = new double[65];
}
double end = System.nanoTime();
double difference = end-start;
runTime[threadNumber] += difference;
this.finished = true;
}
}
I do not understand why creating the object simultaneously in multiple threads takes longer per thread then doing it in serial in only 1 thread. If I remove the line where I create the Hashtable this problem disappears. If anyone could help me with this I would be greatly thankful.
Update: This problem has an associated bug report and has been fixed with Java 1.7u40. And it was never an issue for Java 1.8 as Java 8 has an entirely different hash table algorithm.
Since you are not using the created objects that operation will get optimized away. So you’re only measuring the overhead of creating threads. This is surely the more overhead the more threads you start.
I have to correct my answer regarding a detail, I didn’t know yet: there is something special with the classes Hashtable and HashMap. They both invoke sun.misc.Hashing.randomHashSeed(this) in the constructor. In other words, their instances escape during construction which has an impact on the memory visibility. This implies that their construction, unlike let’s say for an ArrayList, cannot optimized away, and multi-threaded construction slows down due to what happens inside that method (i.e. synchronization).
As said, that’s special to these classes and of course this implementation (my setup:1.7.0_13). For ordinary classes the construction time goes straight to zero for such code.
Here I add a more sophisticated benchmark code. Watch the difference between DO_HASH_MAP = true and DO_HASH_MAP = false (when false it will create an ArrayList instead which has no such special behavior).
import java.util.*;
import java.util.concurrent.*;
public class AllocBench {
static final int NUM_THREADS = 1;
static final int NUM_OBJECTS = 100000000 / NUM_THREADS;
static final boolean DO_HASH_MAP = true;
public static void main(String[] args) throws InterruptedException, ExecutionException {
ExecutorService threadPool = Executors.newFixedThreadPool(NUM_THREADS);
Callable<Long> task=new Callable<Long>() {
public Long call() {
return doAllocation(NUM_OBJECTS);
}
};
long startTime=System.nanoTime(), cpuTime=0;
for(Future<Long> f: threadPool.invokeAll(Collections.nCopies(NUM_THREADS, task))) {
cpuTime+=f.get();
}
long time=System.nanoTime()-startTime;
System.out.println("Number of threads: "+NUM_THREADS);
System.out.printf("entire allocation required %.03f s%n", time*1e-9);
System.out.printf("time x numThreads %.03f s%n", time*1e-9*NUM_THREADS);
System.out.printf("real accumulated cpu time %.03f s%n", cpuTime*1e-9);
threadPool.shutdown();
}
static long doAllocation(int numObjects) {
long t0=System.nanoTime();
for(int i=0; i<numObjects; i++)
if(DO_HASH_MAP) new HashMap<Object, Object>(); else new ArrayList<Object>();
return System.nanoTime()-t0;
}
}
What about if you do it on 6 cores? Hyperthreading isn't the exact same as having double the cores, so you might want to try the amount of real cores too.
Also the OS won't necessarily schedule each of your threads to their own cores.
Since all you are doing is measuring the time and churning memory, your bottleneck is likely to be in your L3 cache or bus to main memory. In this cases, coordinating the work between threads could be producing so much overhead it is worse instead of better.
This is too long for a comment but your inner loop can be just
double start = System.nanoTime();
for(int l = 0; l < jobSize ; l++){
Map<String,Integer> test = new HashMap<String,Integer>();
}
// runtime is an AtomicLong for thread safety
runtime.addAndGet(System.nanoTime() - start); // time in nano-seconds.
Taking the time can be as slow creating a HashMap so you might not be measuring what you think you if you call the timer too often.
BTW Hashtable is synchronized and you might find using HashMap is faster, and possibly more scalable.
I have a thread with the following form:
each execution of each thread is supposed to run a function in the class. That function is completely safe to run by itself. The function returns a value, say an int.
After all threads have been executed, the function values need to be accumulated.
So, it goes (in pseudo-code) something like that:
a = 0
for each i between 1 to N
spawn a thread independently and call the command v = f(i)
when thread finishes, do safely: a = a + v
end
I am not sure how to use Java in that case.
The problem is not creating the thread, I know this can be done using
new Thread() {
public void run() {
...
}
}
the problem is accumulating all the answers.
Thanks for any info.
I would probably do something like:
public class Main {
int a = 0;
int[] values;
int[] results;
public Main() {
// Init values array
results = new int[N];
}
public int doStuff() {
LinkedList<Thread> threads = new LinkedList<Thread>();
for (final int i : values) {
Thread t = new Thread() {
public void run() {
accumulate(foo(i));
}
};
threads.add(t);
t.start();
}
for (Thread t : threads) {
try {
t.join();
} catch (InterruptedException e) {
// Act accordingly, maybe ignore?
}
}
return a;
}
synchronized void accumulate(int v) {
// Synchronized because a += v is actually
// tmp = a + v;
// a = tmp;
// which can cause a race condition AFAIK
a += v;
}
}
Use an ExecutorCompletionService, Executor, and Callable.:
Start with a Callable that calls your int function:
public class MyCallable implements Callable<Integer> {
private final int i;
public MyCallable(int i) {
this.i = i;
}
public Integer call() {
return Integer.valueOf(myFunction(i));
}
}
Create an Executor:
private final Executor executor = Executors.newFixedThreadPool(10);
10 is the maximum number of threads to execute at once.
Then wrap it in an ExecutorCompletionService and submit your jobs:
CompletionService<Integer> compService = new ExecutionCompletionService<Integer>(executor);
// Make sure to track the number of jobs you submit
int jobCount;
for (int i = 0; i < n; i++) {
compService.submit(new MyCallable(i));
jobCount++;
}
// Get the results
int a = 0;
for (int i = 0; i < jobCount; i++) {
a += compService.take().get().intValue();
}
ExecutorCompletionService allows you to pull tasks off of a queue as they complete. This is a little different from joining threads. Although the overall outcome is the same, if you want to update a UI as the threads complete, you won't know what order the threads are going to complete using a join. That last for loop could be like this:
for (int i = 0; i < jobCount; i++) {
a += compService.take().get().intValue();
updateUi(a);
}
And this will update the UI as tasks complete. Using a Thread.join won't necessarily do this since you'll be getting the results in the order that you call the joins, not the order that the threads complete.
Through the use of the executor, this will also allow you to limit the number of simultaneous jobs you're running at a given time so you don't accidentally thread-bomb your system.
I am trying to understand threads in Java. As an exercise, I created an Ice Cream class as follows.
public class ThreadIceCream {
private String flavor = "";
private String[] specialFlavors = { "Vanilla", "Chocolate", "Butter Pecan", "Strawberry", "Chocolate Chip", "Cherry", "Coffee" };
// Constructor for ThreadIceCream class
public ThreadIceCream() {
int randInt = (int) (Math.random() * specialFlavors.length);
flavor = specialFlavors[randInt];
System.out.println("Enjoy your " + flavor + " IceCream!");
} }
The ThreadIceCream class is a simple class that creates an IceCream object with a random flavor every time the class is initialized. Here is the TestStub I am using.
public class TestStub {
public static void main(String[] args) {
ThreadIceCream Th1 = new ThreadIceCream();
ThreadIceCream Th2 = new ThreadIceCream();
} }
Now I want to create 10 Icecreams (i.e. Create 10 instances of the ThreadIceCream class simultaneously) and I want to use threads in Java to do this. I tried a few things but they were no were close.
Well it's not really that hard:
Thread[] threads = new Thread[10];
for(int i = 0; i < 10; i++) {
threads[i] = new Thread(new Runnable() {
public void run() {
ThreadIceCream tic = new ThreadIceCream();
}
});
threads[i].start();
}
for(int i = 0; i < 10; i++) {
threads[i].join();
}
Sure, this won't do much because the work performed by each thread is so small that the overhead to start the threads is actually higher, but whatever.
You should also learn to use the ExecutorService for higher efficiency. Pure threads are heavyweight and are rarely a good solution for anything, especially in groups. Here's an ExecutorService version of the above:
ExecutorService exec = Executors.newFixedThreadPool(10);
for(int i = 0; i < 10; i++) {
exec.submit(new Runnable() {
public void run() {
ThreadIceCream tic = new ThreadIceCream();
}
});
}
exec.shutdown();
exec.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
Here we are creating a pool of 10 threads and submitting 10 tasks. The threads are recycled betweeen task executions, so only 10 threads are ever created, no matter how many tasks you submit. Since the tasks are so small several tasks may even be executed on the same thread, but that's actually a good thing.