I learn java.util.concurrency and I have found one article about performance (http://www.javamex.com/tutorials/concurrenthashmap_scalability.shtml). And I decided to repeat one little part of those performance test for studying purpose. I have written the write test for HashMap and ConcurrentHashMap.
I have two questions about it:
Is it true, that for the best performance I should use number threads equal number CPU core?
I understand, that performance vary form platform to platform. But the result has shown that HashMap a little more faster then ConcurrentHashMap. I think it should be the same or vise versa. Maybe I have made a mistake in my code.
Any criticism is welcome.
package Concurrency;
import java.util.concurrent.*;
import java.util.*;
class Writer2 implements Runnable {
private Map<String, Integer> map;
private static int index;
private int nIteration;
private Random random = new Random();
char[] chars = "abcdefghijklmnopqrstuvwxyz".toCharArray();
private final CountDownLatch latch;
public Writer2(Map<String, Integer> map, int nIteration, CountDownLatch latch) {
this.map = map;
this.nIteration = nIteration;
this.latch = latch;
}
private synchronized String getNextString() {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
char c = chars[random.nextInt(chars.length)];
sb.append(c);
}
sb.append(index);
if(map.containsKey(sb.toString()))
System.out.println("dublicate:" + sb.toString());
return sb.toString();
}
private synchronized int getNextInt() { return index++; }
#Override
public void run() {
while(nIteration-- > 0) {
map.put(getNextString(), getNextInt());
}
latch.countDown();
}
}
public class FourtyTwo {
static final int nIteration = 100000;
static final int nThreads = 4;
static Long testMap(Map<String, Integer> map) throws InterruptedException{
String name = map.getClass().getSimpleName();
CountDownLatch latch = new CountDownLatch(nThreads);
long startTime = System.currentTimeMillis();
ExecutorService exec = Executors.newFixedThreadPool(nThreads);
for(int i = 0; i < nThreads; i++)
exec.submit(new Writer2(map, nIteration, latch));
latch.await();
exec.shutdown();
long endTime = System.currentTimeMillis();
System.out.format(name + ": that took %,d milliseconds %n", (endTime - startTime));
return (endTime - startTime);
}
public static void main(String[] args) throws InterruptedException {
ArrayList<Long> result = new ArrayList<Long>() {
#Override
public String toString() {
Long result = 0L;
Long size = new Long(this.size());
for(Long i : this)
result += i;
return String.valueOf(result/size);
}
};
Map<String, Integer> map1 = Collections.synchronizedMap(new HashMap<String, Integer>());
Map<String, Integer> map2 = new ConcurrentHashMap<>();
System.out.println("Rinning test...");
for(int i = 0; i < 5; i++) {
//result.add(testMap(map1));
result.add(testMap(map2));
}
System.out.println("Average time:" + result + " milliseconds");
}
}
/*
OUTPUT:
ConcurrentHashMap: that took 5 727 milliseconds
ConcurrentHashMap: that took 2 349 milliseconds
ConcurrentHashMap: that took 9 530 milliseconds
ConcurrentHashMap: that took 25 931 milliseconds
ConcurrentHashMap: that took 1 056 milliseconds
Average time:8918 milliseconds
SynchronizedMap: that took 6 471 milliseconds
SynchronizedMap: that took 2 444 milliseconds
SynchronizedMap: that took 9 678 milliseconds
SynchronizedMap: that took 10 270 milliseconds
SynchronizedMap: that took 7 206 milliseconds
Average time:7213 milliseconds
*/
One
How many threads varies, not by CPU, but by what you are doing. If, for example, what you are doing with your threads is highly disk intensive, your CPU isn't likely to be maxed out, so doing 8 threads may just cause heavy thrashing. If, however, you have huge amounts of disk activity, followed by heavy computation, followed by more disk activity, you would benefit from staggering the threads, splitting out your activities and grouping them, as well as using more threads. You would, for example, in such a case, likely want to group together file activity that uses a single file, but maybe not activity where you are pulling from a bunch of files (unless they are written contiguously on the disk). Of course, if you overthink disk IO, you could seriously hurt your performance, but I'm making a point of saying that you shouldn't just shirk it, either. In such a program, I would probably have threads dedicated to disk IO, threads dedicated to CPU work. Divide and conquer. You'd have fewer IO threads and more CPU threads.
It is common for a synchronous server to run many more threads than cores/CPUs because most of those threads either do work for only a short time or don't do much CPU intensive work. It's not useful to have 500 threads, though, if you will only ever have 2 clients and the context switching of those excess threads hampers performance. It's a balancing act that often requires a little bit of tuning.
In short
Think about what you are doing
Network activity is light,so more threads are generally good
CPU intensive things don't do much good if you have 2x more of those threads than cores... usually a little more than 1x or a little less than 1x is optimum, but you have to test, test, test
Having 10 disk IO intensive threads may hurt all 10 threads, just like having 30 CPU intensive threads... the thrashing hurts them all
Try to spread out the pain
See if it helps to spread out the CPU, IO, etc, work or if clustering is better... it will depend on what you are doing
Try to group things up
If you can, separate out your disk, IO, and network tasks and give them their own threads that are tuned to those tasks
Two
In general, thread-unsafe methods run faster. Similarly using localized synchronization runs faster than synchronizing the entire method. As such, HashMap is normally significantly faster than ConcurrentHashMap. Another example would be StringBuffer compared to StringBuilder. StringBuffer is synchronized and is not only slower, but the synchronization is heavier (more code, etc); it should rarely be used. StringBuilder, however, is unsafe if you have multiple threads hitting it. With that said, StringBuffer and ConcurrentHashMap can race, too. Being "thread-safe" doesn't mean that you can just use it without thought, particularly the way that these two classes operate. For example, you can still have a race condition if you are reading and writing at the same time (say, using contains(Object) as you are doing a put or remove). If you want to prevent such things, you have to use your own class or synchronize your calls to your ConcurrentHashMap.
I generally use the non-concurrent maps and collections and just use my own locks where I need them. You'll find that it's much faster that way and the control is great. Atomics (e.g. AtomicInteger) are nice sometimes, but really not generally useful for what I do. Play with the classes, play with synchronization, and you'll find that you can master than more efficiently than the shotgun approach of ConcurrentHashMap, StringBuffer, etc. You can have race conditions whether or not you use those classes if you don't do it right... but if you do it yourself, you can also be much more efficient and more careful.
Example
Note that we have a new Object that we are locking on. Use this instead of synchronized on a method.
public final class Fun {
private final Object lock = new Object();
/*
* (non-Javadoc)
*
* #see java.util.Map#clear()
*/
#Override
public void clear() {
// Doing things...
synchronized (this.lock) {
// Where we do sensitive work
}
}
/*
* (non-Javadoc)
*
* #see java.util.Map#put(java.lang.Object, java.lang.Object)
*/
#Override
public V put(final K key, #Nullable final V value) {
// Doing things...
synchronized (this.lock) {
// Where we do sensitive work
}
// Doing things...
}
}
And From Your Code...
I might not put that sb.append(index) in the lock or might have a separate lock for index calls, but...
private final Object lock = new Object();
private String getNextString() {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 5; i++) {
char c = chars[random.nextInt(chars.length)];
sb.append(c);
}
synchronized (lock) {
sb.append(index);
if (map.containsKey(sb.toString()))
System.out.println("dublicate:" + sb.toString());
}
return sb.toString();
}
private int getNextInt() {
synchronized (lock) {
return index++;
}
}
The article you linked doesn't state that ConcurrentHashMap is generally faster than a synchronized HashMap, only that it scales better; i.e. it is faster for a large number of threads. As you can see in their graph, for 4 threads the performance is very similar.
Apart from that, you should test with more items, as my results show, they can vary quite a bit:
Rinning test...
SynchronizedMap: that took 13,690 milliseconds
SynchronizedMap: that took 8,210 milliseconds
SynchronizedMap: that took 11,598 milliseconds
SynchronizedMap: that took 9,509 milliseconds
SynchronizedMap: that took 6,992 milliseconds
Average time:9999 milliseconds
Rinning test...
ConcurrentHashMap: that took 10,728 milliseconds
ConcurrentHashMap: that took 7,227 milliseconds
ConcurrentHashMap: that took 6,668 milliseconds
ConcurrentHashMap: that took 7,071 milliseconds
ConcurrentHashMap: that took 7,320 milliseconds
Average time:7802 milliseconds
Note that your code isn't clearing the Map between loops, adding nIteration more items every time... is that what you wanted?
synchronized isn't necessary on your getNextInt/getNextString, as they aren't being called from multiple threads.
Related
Guava's Enums.ifPresent(Class, String) calls Enums.getEnumConstants under the hood:
#GwtIncompatible // java.lang.ref.WeakReference
static <T extends Enum<T>> Map<String, WeakReference<? extends Enum<?>>> getEnumConstants(
Class<T> enumClass) {
synchronized (enumConstantCache) {
Map<String, WeakReference<? extends Enum<?>>> constants = enumConstantCache.get(enumClass);
if (constants == null) {
constants = populateCache(enumClass);
}
return constants;
}
}
Why does it need a synchronized block? Wouldn't that incur a heavy performance penalty? Java's Enum.valueOf(Class, String) does not appear to need one. Further on if synchronization is indeed necessary, why do it so inefficiently? One would hope if enum is present in cache, it can be retrieved without locking. Only lock if cache needs to be populated.
For Reference: Maven Dependency
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>23.2-jre</version>
</dependency>
Edit: By locking I'm referring to a double checking lock.
I've accepted #maaartinus answer, but wanted to write a separate "answer" about the circumstances behind the question and the interesting rabbit hole it lead me to.
tl;dr - Use Java's Enum.valueOf which is thread safe and does not sync unlike Guava's Enums.ifPresent. Also in majority of cases it probably doesn't matter.
Long story:
I'm working on a codebase that utilizes light weight java threads Quasar Fibers. In order to harness the power of Fibers, the code they run should be primarily async and non-blocking because Fibers are multiplexed to Java/OS Threads. It becomes very important that individual Fibers do not "block" the underlying thread. If underlying thread is blocked, it will block all Fibers running on it and performance degrades considerably. Guava's Enums.ifPresent is one of those blockers and I'm certain it can be avoided.
Initially, I started using Guava's Enums.ifPresent because it returns null on invalid enum values. Unlike Java's Enum.valueOf which throws IllegalArgumentException (which to my taste is less preferrable than a null value).
Here is a crude benchmark comparing various methods of converting to enums:
Java's Enum.valueOf with catching IllegalArgumentException to return null
Guava's Enums.ifPresent
Apache Commons Lang EnumUtils.getEnum
Apache Commons Lang 3 EnumUtils.getEnum
My Own Custom Immutable Map Lookup
Notes:
Apache Common Lang 3 uses Java's Enum.valueOf under the hood and are hence identical
Earlier version of Apache Common Lang uses a very similar WeakHashMap solution to Guava but does not use synchronization. They favor cheap reads and more expensive writes (my knee jerk reaction says that's how Guava should have done it)
Java's decision to throw IllegalArgumentException is likely to have a small cost associated with it when dealing with invalid enum values. Throwing/catching exceptions isn't free.
Guava is the only method here that uses synchronization
Benchmark Setup:
uses an ExecutorService with a fixed thread pool of 10 threads
submits 100K Runnable tasks to convert enums
each Runnable task converts 100 enums
each method of converting enums will convert 10 million strings (100K x 100)
Benchmark Results from a run:
Convert valid enum string value:
JAVA -> 222 ms
GUAVA -> 964 ms
APACHE_COMMONS_LANG -> 138 ms
APACHE_COMMONS_LANG3 -> 149 ms
MY_OWN_CUSTOM_LOOKUP -> 160 ms
Try to convert INVALID enum string value:
JAVA -> 6009 ms
GUAVA -> 734 ms
APACHE_COMMONS_LANG -> 65 ms
APACHE_COMMONS_LANG3 -> 5558 ms
MY_OWN_CUSTOM_LOOKUP -> 92 ms
These numbers should be taken with a heavy grain of salt and will change depending on other factors. But they were good enough for me to conclude to go with Java's solution for the codebase using Fibers.
Benchmark Code:
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import com.google.common.base.Enums;
import com.google.common.collect.ImmutableMap;
import com.google.common.collect.ImmutableMap.Builder;
public class BenchmarkEnumValueOf {
enum Strategy {
JAVA,
GUAVA,
APACHE_COMMONS_LANG,
APACHE_COMMONS_LANG3,
MY_OWN_CUSTOM_LOOKUP;
private final static ImmutableMap<String, Strategy> lookup;
static {
Builder<String, Strategy> immutableMapBuilder = ImmutableMap.builder();
for (Strategy strategy : Strategy.values()) {
immutableMapBuilder.put(strategy.name(), strategy);
}
lookup = immutableMapBuilder.build();
}
static Strategy toEnum(String name) {
return name != null ? lookup.get(name) : null;
}
}
public static void main(String[] args) {
final int BENCHMARKS_TO_RUN = 1;
System.out.println("Convert valid enum string value:");
for (int i = 0; i < BENCHMARKS_TO_RUN; i++) {
for (Strategy strategy : Strategy.values()) {
runBenchmark(strategy, "JAVA", 100_000);
}
}
System.out.println("\nTry to convert INVALID enum string value:");
for (int i = 0; i < BENCHMARKS_TO_RUN; i++) {
for (Strategy strategy : Strategy.values()) {
runBenchmark(strategy, "INVALID_ENUM", 100_000);
}
}
}
static void runBenchmark(Strategy strategy, String enumStringValue, int iterations) {
ExecutorService executorService = Executors.newFixedThreadPool(10);
long timeStart = System.currentTimeMillis();
for (int i = 0; i < iterations; i++) {
executorService.submit(new EnumValueOfRunnable(strategy, enumStringValue));
}
executorService.shutdown();
try {
executorService.awaitTermination(1000, TimeUnit.SECONDS);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
long timeDuration = System.currentTimeMillis() - timeStart;
System.out.println("\t" + strategy.name() + " -> " + timeDuration + " ms");
}
static class EnumValueOfRunnable implements Runnable {
Strategy strategy;
String enumStringValue;
EnumValueOfRunnable(Strategy strategy, String enumStringValue) {
this.strategy = strategy;
this.enumStringValue = enumStringValue;
}
#Override
public void run() {
for (int i = 0; i < 100; i++) {
switch (strategy) {
case JAVA:
try {
Enum.valueOf(Strategy.class, enumStringValue);
} catch (IllegalArgumentException e) {}
break;
case GUAVA:
Enums.getIfPresent(Strategy.class, enumStringValue);
break;
case APACHE_COMMONS_LANG:
org.apache.commons.lang.enums.EnumUtils.getEnum(Strategy.class, enumStringValue);
break;
case APACHE_COMMONS_LANG3:
org.apache.commons.lang3.EnumUtils.getEnum(Strategy.class, enumStringValue);
break;
case MY_OWN_CUSTOM_LOOKUP:
Strategy.toEnum(enumStringValue);
break;
}
}
}
}
}
I guess, the reason is simply that enumConstantCache is a WeakHashMap, which is not thread-safe.
Two threads writing to the cache at the same time could end up with an endless loop or alike (at least such thing happened with HashMap as I tried it years ago).
I guess, you could use DCL, but it mayn't be worth it (as stated in a comment).
Further on if synchronization is indeed necessary, why do it so inefficiently? One would hope if enum is present in cache, it can be retrieved without locking. Only lock if cache needs to be populated.
This may get too tricky. For visibility using volatile, you need a volatile read paired with a volatile write. You could get the volatile read easily by declaring enumConstantCache to be volatile instead of final. The volatile write is trickier. Something like
enumConstantCache = enumConstantCache;
may work, but I'm not sure about that.
10 threads, each one having to convert String values to Enums and then perform some task
The Map access is usually way faster than anything you do with the obtained value, so I guess, you'd need much more threads to get a problem.
Unlike HashMap, the WeakHashMap needs to perform some cleanup (called expungeStaleEntries). This cleanup gets performed even in get (via getTable). So get is a modifying operation and you really don't want to execute it concurrently.
Note that reading WeakHashMap without synchronization means performing mutation without locking and it's plain wrong and that's not only theory.
You'd need an own version of WeakHashMap performing no mutations in get (which is simple) and guaranteeing some sane behavior when written during read by a different thread (which may or may not be possible).
I guess, something like SoftReference<ImmutableMap<String, Enum<?>> with some re-loading logic could work well.
I'm implementing an online store where customers drop in an order and my program generates a summary of who the client is and what items they bought.
public class Summarizer{
private TreeSet<Order> allOrders = new TreeSet<Order>(new OrderComparator());
public void oneThreadProcessing(){
Thread t1;
for(Order order: allOrders){
t1 = new Thread(order);
t1.start();
try {
t1.join();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
public void multipleThreadProcessing(){
Thread[] threads = new Thread[allOrders.size()];
int i = 0;
for(Order order: allOrders){
threads[i] = new Thread(order);
threads[i].start();
i++;
}
for(Thread t: threads){
try {
t.join();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public static void main (String[] args) {
Summarizer s = new Summarizer();
long startTime = System.currentTimeMillis();
s.oneThreadProcessing();
long endTime = System.currentTimeMillis();
System.out.println("Processing time (msec): " + (endTime - startTime));
System.out.println("-------------Multiple Thread-------------------------------");
long startTime1 = System.currentTimeMillis();
s.multipleThreadProcessing();
long endTime1 = System.currentTimeMillis();
System.out.println("Processing time (msec): " + (endTime1 - startTime1));
This is my Order class:
public class Order implements Runnable, Comparable<Order> {
private int clientId;
#Override
public void run() {
/*
Print out summary in the form:
Client id: 1001
Item: Shoes, Quantity: 2, Cost per item: $30.00, Total Cost: $60.00
Item: Bag, Quantity: 1, Cost per item: $15.00, Total Cost: $15.00
Total: $75.00
/*
}
}
Assuming I fill in this TreeSet with all the orders in a particular day sorted by the client ID numbers, I wanna generate the summaries of all these orders using only one thread and then multiple threads and compare the performance. The requirement is to have the multiple threaded performance be better and I would assume it would be but every time I run my main method. The multiple threading is actually almost always slower. Am I doing something wrong? I know one can never be certain that the multiple threading program will be faster but since I'm not dealing with any locking as of yet, shouldn't my program be faster in the multipleThreadedProcessing()? Is there any way to ensure it is faster then?
Multi threading is not a magic powder one sprinkes onto code to make it run faster. It requires careful thought about what you're doing.
Let's analyze the multi threaded code you've written.
You have many Orders (How many btw? dozens? hundreds? thousands? millions?) and when executed, each order prints itself to the screen. That means that you're spawning a possibly enormous number of threads, just so each of them would spend most of its time waiting for System.out to become available (since it would be occupied by other threads). Each thread requires OS and JVM involvement, plus it requires CPU time, and memory resources; so why should such a multi threaded approach run faster than a single thread? A single thread requires no additional memory, and no additional context switching, and doesn't have to wait for System.out. So naturally it would be faster.
To get the multi threaded version to work faster, you need to think of a way to distribute the work between threads without creating unnecessary contention over resources. For one, you probably don't need more than one thread performing the I/O task of writing to the screen. Multiple threads writing to System.out would just block each other. If you have several IO devices you need to write to, then create one or more threads for each (depending on the nature of the device). Computational tasks can sometimes be executed quicker in parallel, but:
A) You don't need more threads doing computation work than the number of cores your CPU has. If you have too many CPU-bound threads, they'd just waste time context switching.
B) You need to reason through your parallel processing, and plan it thoroughly. Give each thread a chunk of the computational tasks that need to be done, then combine the results of each of them.
For massive parallel computing I tend to use executors and callables. When I have thousand of objects to be computed I feel not so good to instantiate thousand of Runnables for each object.
So I have two approaches to solve this:
I. Split the workload into a small amount of x-workers giving y-objects each. (splitting the object list into x-partitions with y/x-size each)
public static <V> List<List<V>> partitions(List<V> list, int chunks) {
final ArrayList<List<V>> lists = new ArrayList<List<V>>();
final int size = Math.max(1, list.size() / chunks + 1);
final int listSize = list.size();
for (int i = 0; i <= chunks; i++) {
final List<V> vs = list.subList(Math.min(listSize, i * size), Math.min(listSize, i * size + size));
if(vs.size() == 0) break;
lists.add(vs);
}
return lists;
}
II. Creating x-workers which fetch objects from a queue.
Questions:
Is creating thousand of Runnables really expensive and to be avoided?
Is there a generic pattern/recommendation how to do it by solution II?
Are you aware of a different approach?
Creating thousands of Runnable (objects implementing Runnable) is not more expensive than creating a normal object.
Creating and running thousands of Threads can be very heavy, but you can use Executors with a pool of threads to solve this problem.
As for the different approach, you might be interested in java 8's parallel streams.
Combining various answers here :
Is creating thousand of Runnables really expensive and to be avoided?
No, it's not in and of itself. It's how you will make them execute that may prove costly (spawning a few thousand threads certainly has its cost).
So you would not want to do this :
List<Computation> computations = ...
List<Thread> threads = new ArrayList<>();
for (Computation computation : computations) {
Thread thread = new Thread(new Computation(computation));
threads.add(thread);
thread.start();
}
// If you need to wait for completion:
for (Thread t : threads) {
t.join();
}
Because it would 1) be unnecessarily costly in terms of OS ressource (native threads, each having a stack on the heap), 2) spam the OS scheduler with a vastly concurrent workload, most certainly leading to plenty of context switchs and associated cache invalidations at the CPU level 3) be a nightmare to catch and deal with exceptions (your threads should probably define an Uncaught exception handler, and you'd have to deal with it manually).
You'd probably prefer an approach where a finite Thread pool (of a few threads, "a few" being closely related to your number of CPU cores) handles many many Callables.
List<Computation> computations = ...
ExecutorService pool = Executors.newFixedSizeThreadPool(someNumber)
List<Future<Result>> results = new ArrayList<>();
for (Computation computation : computations) {
results.add(pool.submit(new ComputationCallable(computation));
}
for (Future<Result> result : results {
doSomething(result.get);
}
The fact that you reuse a limited number threads should yield a really nice improvement.
Is there a generic pattern/recommendation how to do it by solution II?
There are. First, your partition code (getting from a List to a List<List>) can be found inside collection tools such as Guava, with more generic and fail-proofed implementations.
But more than this, two patterns come to mind for what you are achieving :
Use the Fork/Join Pool with Fork/Join tasks (that is, spawn a task with your whole list of items, and each task will fork sub tasks with half of that list, up to the point where each task manages a small enough list of items). It's divide and conquer. See: http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ForkJoinTask.html
If your computation were to be "add integers from a list", it could look like (there might be a boundary bug in there, I did not really check) :
public static class Adder extends RecursiveTask<Integer> {
protected List<Integer> globalList;
protected int start;
protected int stop;
public Adder(List<Integer> globalList, int start, int stop) {
super();
this.globalList = globalList;
this.start = start;
this.stop = stop;
System.out.println("Creating for " + start + " => " + stop);
}
#Override
protected Integer compute() {
if (stop - start > 1000) {
// Too many arguments, we split the list
Adder subTask1 = new Adder(globalList, start, start + (stop-start)/2);
Adder subTask2 = new Adder(globalList, start + (stop-start)/2, stop);
subTask2.fork();
return subTask1.compute() + subTask2.join();
} else {
// Manageable size of arguments, we deal in place
int result = 0;
for(int i = start; i < stop; i++) {
result +=i;
}
return result;
}
}
}
public void doWork() throws Exception {
List<Integer> computation = new ArrayList<>();
for(int i = 0; i < 10000; i++) {
computation.add(i);
}
ForkJoinPool pool = new ForkJoinPool();
RecursiveTask<Integer> masterTask = new Adder(computation, 0, computation.size());
Future<Integer> future = pool.submit(masterTask);
System.out.println(future.get());
}
Use Java 8 parallel streams in order to launch multiple parallel computations easily (under the hood, Java parallel streams can fall back to the Fork/Join pool actually).
Others have shown how this might look like.
Are you aware of a different approach?
For a different take at concurrent programming (without explicit task / thread handling), have a look at the actor pattern. https://en.wikipedia.org/wiki/Actor_model
Akka comes to mind as a popular implementation of this pattern...
#Aaron is right, you should take a look into Java 8's parallel streams:
void processInParallel(List<V> list) {
list.parallelStream().forEach(item -> {
// do something
});
}
If you need to specify chunks, you could use a ForkJoinPool as described here:
void processInParallel(List<V> list, int chunks) {
ForkJoinPool forkJoinPool = new ForkJoinPool(chunks);
forkJoinPool.submit(() -> {
list.parallelStream().forEach(item -> {
// do something with each item
});
});
}
You could also have a functional interface as an argument:
void processInParallel(List<V> list, int chunks, Consumer<V> processor) {
ForkJoinPool forkJoinPool = new ForkJoinPool(chunks);
forkJoinPool.submit(() -> {
list.parallelStream().forEach(item -> processor.accept(item));
});
}
Or in shorthand notation:
void processInParallel(List<V> list, int chunks, Consumer<V> processor) {
new ForkJoinPool(chunks).submit(() -> list.parallelStream().forEach(processor::accept));
}
And then you would use it like:
processInParallel(myList, 2, item -> {
// do something with each item
});
Depending on your needs, the ForkJoinPool#submit() returns an instance of ForkJoinTask, which is a Future and you may use it to check for the status or wait for the end of your task.
You'd most probably want the ForkJoinPool instantiated only once (not instantiate it on every method call) and then reuse it to prevent CPU choking if the method is called multiple times.
Is creating thousand of Runnables really expensive and to be avoided?
Not at all, the runnable/callable interfaces have only one method to implement each, and the amount of "extra" code in each task depends on the code you are running. But certainly no fault of the Runnable/Callable interfaces.
Is there a generic pattern/recommendation how to do it by solution II?
Pattern 2 is more favorable than pattern 1. This is because pattern 1 assumes that each worker will finish at the exact same time. If some workers finish before other workers, they could just be sitting idle since they only are able to work on the y/x-size queues you assigned to each of them. In pattern 2 however, you will never have idle worker threads (unless the end of the work queue is reached and numWorkItems < numWorkers).
An easy way to use the preferred pattern, pattern 2, is to use the ExecutorService invokeAll(Collection<? extends Callable<T>> list) method.
Here is an example usage:
List<Callable<?>> workList = // a single list of all of your work
ExecutorService es = Executors.newCachedThreadPool();
es.invokeAll(workList);
Fairly readable and straightforward usage, and the ExecutorService implementation will automatically use solution 2 for you, so you know that each worker thread has their use time maximized.
Are you aware of a different approach?
Solution 1 and 2 are two common approaches for generic work. Now, there are many different implementation available for you choose from (such as java.util.Concurrent, Java 8 parallel streams, or Fork/Join pools), but the concept of each implementation is generally the same. The only exception is if you have specific tasks in mind with non-standard running behavior.
I am trying to execute a simple calculation (it calls Math.random() 10000000 times). Surprisingly running it in simple method performs much faster than using ExecutorService.
I have read another thread at ExecutorService's surprising performance break-even point --- rules of thumb? and tried to follow the answer by executing the Callable using batches, but the performance is still bad
How do I improve the performance based on my current code?
import java.util.*;
import java.util.concurrent.*;
public class MainTest {
public static void main(String[]args) throws Exception {
new MainTest().start();;
}
final List<Worker> workermulti = new ArrayList<Worker>();
final List<Worker> workersingle = new ArrayList<Worker>();
final int count=10000000;
public void start() throws Exception {
int n=2;
workersingle.add(new Worker(1));
for (int i=0;i<n;i++) {
// worker will only do count/n job
workermulti.add(new Worker(n));
}
ExecutorService serviceSingle = Executors.newSingleThreadExecutor();
ExecutorService serviceMulti = Executors.newFixedThreadPool(n);
long s,e;
int tests=10;
List<Long> simple = new ArrayList<Long>();
List<Long> single = new ArrayList<Long>();
List<Long> multi = new ArrayList<Long>();
for (int i=0;i<tests;i++) {
// simple
s = System.currentTimeMillis();
simple();
e = System.currentTimeMillis();
simple.add(e-s);
// single thread
s = System.currentTimeMillis();
serviceSingle.invokeAll(workersingle); // single thread
e = System.currentTimeMillis();
single.add(e-s);
// multi thread
s = System.currentTimeMillis();
serviceMulti.invokeAll(workermulti);
e = System.currentTimeMillis();
multi.add(e-s);
}
long avgSimple=sum(simple)/tests;
long avgSingle=sum(single)/tests;
long avgMulti=sum(multi)/tests;
System.out.println("Average simple: "+avgSimple+" ms");
System.out.println("Average single thread: "+avgSingle+" ms");
System.out.println("Average multi thread: "+avgMulti+" ms");
serviceSingle.shutdown();
serviceMulti.shutdown();
}
long sum(List<Long> list) {
long sum=0;
for (long l : list) {
sum+=l;
}
return sum;
}
private void simple() {
for (int i=0;i<count;i++){
Math.random();
}
}
class Worker implements Callable<Void> {
int n;
public Worker(int n) {
this.n=n;
}
#Override
public Void call() throws Exception {
// divide count with n to perform batch execution
for (int i=0;i<(count/n);i++) {
Math.random();
}
return null;
}
}
}
The output for this code
Average simple: 920 ms
Average single thread: 1034 ms
Average multi thread: 1393 ms
EDIT: performance suffer due to Math.random() being a synchronised method.. after changing Math.random() with new Random object for each thread, the performance improved
The output for the new code (after replacing Math.random() with Random for each thread)
Average simple: 928 ms
Average single thread: 1046 ms
Average multi thread: 642 ms
Math.random() is synchronized. Kind of the whole point of synchronized is to slow things down so they don't collide. Use something that isn't synchronized and/or give each thread its own object to work with, like a new Random.
You'd do well to read the contents of the other thread. There's plenty of good tips in there.
Perhaps the most significant issue with your benchmark is that according to the Math.random() contract, "This method is properly synchronized to allow correct use by more than one thread. However, if many threads need to generate pseudorandom numbers at a great rate, it may reduce contention for each thread to have its own pseudorandom-number generator"
Read this as: the method is synchronized, so only one thread is likely to be able to usefully use it at the same time. So you do a bunch of overhead to distribute the tasks, only to force them again to run serially.
When you use multiple threads, you need to be aware of the overhead of using additional threads. You also need to determine if your algorithm has work which can be preformed in parallel or not. So you need to have work which can be run concurrently which is large enough that it will exceed the overhead of using multiple threads.
In this case, the simplest workaround is to use a separate Random in each thread. The problem you have is that as a micro-benchmark, your loop doesn't actually do anything and the JIT is very good at discarding code which doesn't do anything. A workaround for this is to sum the random results and return it from the call() as this is usually enough to prevent the JIT from discarding the code.
Lastly if you want to sum lots of numbers, you don't need to save them and sum them later. You can sum them as you go.
First and once more, thanks to all that already answered my question. I am not a very experienced programmer and it is my first experience with multithreading.
I got an example that is working quite like my problem. I hope it could ease our case here.
public class ThreadMeasuring {
private static final int TASK_TIME = 1; //microseconds
private static class Batch implements Runnable {
CountDownLatch countDown;
public Batch(CountDownLatch countDown) {
this.countDown = countDown;
}
#Override
public void run() {
long t0 =System.nanoTime();
long t = 0;
while(t<TASK_TIME*1e6){ t = System.nanoTime() - t0; }
if(countDown!=null) countDown.countDown();
}
}
public static void main(String[] args) {
ThreadFactory threadFactory = new ThreadFactory() {
int counter = 1;
#Override
public Thread newThread(Runnable r) {
Thread t = new Thread(r, "Executor thread " + (counter++));
return t;
}
};
// the total duty to be divided in tasks is fixed (problem dependent).
// Increase ntasks will mean decrease the task time proportionally.
// 4 Is an arbitrary example.
// This tasks will be executed thousands of times, inside a loop alternating
// with serial processing that needs their result and prepare the next ones.
int ntasks = 4;
int nthreads = 2;
int ncores = Runtime.getRuntime().availableProcessors();
if (nthreads<ncores) ncores = nthreads;
Batch serial = new Batch(null);
long serialTime = System.nanoTime();
serial.run();
serialTime = System.nanoTime() - serialTime;
ExecutorService executor = Executors.newFixedThreadPool( nthreads, threadFactory );
CountDownLatch countDown = new CountDownLatch(ntasks);
ArrayList<Batch> batches = new ArrayList<Batch>();
for (int i = 0; i < ntasks; i++) {
batches.add(new Batch(countDown));
}
long start = System.nanoTime();
for (Batch r : batches){
executor.execute(r);
}
// wait for all threads to finish their task
try {
countDown.await();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
long tmeasured = (System.nanoTime() - start);
System.out.println("Task time= " + TASK_TIME + " ms");
System.out.println("Number of tasks= " + ntasks);
System.out.println("Number of threads= " + nthreads);
System.out.println("Number of cores= " + ncores);
System.out.println("Measured time= " + tmeasured);
System.out.println("Theoretical serial time= " + TASK_TIME*1000000*ntasks);
System.out.println("Theoretical parallel time= " + (TASK_TIME*1000000*ntasks)/ncores);
System.out.println("Speedup= " + (serialTime*ntasks)/(double)tmeasured);
executor.shutdown();
}
}
Instead of doing the calculations, each batch just waits for some given time. The program calculates the speedup, that would allways be 2 in theory but can get less than 1 (actually a speed down) if the 'TASK_TIME' is small.
My calculations take at the top 1 ms and are commonly faster. For 1 ms I find a little speedup of around 30%, but in practice, with my program, I notice a speed down.
The structure of this code is very similar to my program, so if you could help me to optimise the thread handling I would be very grateful.
Kind regards.
Below, the original question:
Hi.
I would like to use multithreading on my program, since it could increase its efficiency considerably, I believe. Most of its running time is due to independent calculations.
My program has thousands of independent calculations (several linear systems to solve), but they just happen at the same time by minor groups of dozens or so. Each of this groups would take some miliseconds to run. After one of these groups of calculations, the program has to run sequentially for a little while and then I have to solve the linear systems again.
Actually, it can be seen as these independent linear systems to solve are inside a loop that iterates thousands of times, alternating with sequential calculations that depends on the previous results. My idea to speed up the program is to compute these independent calculations in parallel threads, by dividing each group into (the number of processors I have available) batches of independent calculation. So, in principle, there isn't queuing at all.
I tried using the FixedThreadPool and CachedThreadPool and it got even slower than serial processing. It seems to takes too much time creating new Treads each time I need to solve the batches.
Is there a better way to handle this problem? These pools I've used seem to be proper for cases when each thread takes more time instead of thousands of smaller threads...
Thanks!
Best Regards!
Thread pools don't create new threads over and over. That's why they're pools.
How many threads were you using and how many CPUs/cores do you have? What is the system load like (normally, when you execute them serially, and when you execute with the pool)? Is synchronization or any kind of locking involved?
Is the algorithm for parallel execution exactly the same as the serial one (your description seems to suggest that serial was reusing some results from previous iteration).
From what i've read: "thousands of independent calculations... happen at the same time... would take some miliseconds to run" it seems to me that your problem is perfect for GPU programming.
And i think it answers you question. GPU programming is becoming more and more popular. There are Java bindings for CUDA & OpenCL. If it is possible for you to use it, i say go for it.
I'm not sure how you perform the calculations, but if you're breaking them up into small groups, then your application might be ripe for the Producer/Consumer pattern.
Additionally, you might be interested in using a BlockingQueue. The calculation consumers will block until there is something in the queue and the block occurs on the take() call.
private static class Batch implements Runnable {
CountDownLatch countDown;
public Batch(CountDownLatch countDown) {
this.countDown = countDown;
}
CountDownLatch getLatch(){
return countDown;
}
#Override
public void run() {
long t0 =System.nanoTime();
long t = 0;
while(t<TASK_TIME*1e6){ t = System.nanoTime() - t0; }
if(countDown!=null) countDown.countDown();
}
}
class CalcProducer implements Runnable {
private final BlockingQueue queue;
CalcProducer(BlockingQueue q) { queue = q; }
public void run() {
try {
while(true) {
CountDownLatch latch = new CountDownLatch(ntasks);
for(int i = 0; i < ntasks; i++) {
queue.put(produce(latch));
}
// don't need to wait for the latch, only consumers wait
}
} catch (InterruptedException ex) { ... handle ...}
}
CalcGroup produce(CountDownLatch latch) {
return new Batch(latch);
}
}
class CalcConsumer implements Runnable {
private final BlockingQueue queue;
CalcConsumer(BlockingQueue q) { queue = q; }
public void run() {
try {
while(true) { consume(queue.take()); }
} catch (InterruptedException ex) { ... handle ...}
}
void consume(Batch batch) {
batch.Run();
batch.getLatch().await();
}
}
class Setup {
void main() {
BlockingQueue<Batch> q = new LinkedBlockingQueue<Batch>();
int numConsumers = 4;
CalcProducer p = new CalcProducer(q);
Thread producerThread = new Thread(p);
producerThread.start();
Thread[] consumerThreads = new Thread[numConsumers];
for(int i = 0; i < numConsumers; i++)
{
consumerThreads[i] = new Thread(new CalcConsumer(q));
consumerThreads[i].start();
}
}
}
Sorry if there are any syntax errors, I've been chomping away at C# code and sometimes I forget the proper java syntax, but the general idea is there.
If you have a problem which does not scale to multiple cores, you need to change your program or you have a problem which is not as parallel as you think. I suspect you have some other type of bug, but cannot say based on the information given.
This test code might help.
Time per million tasks 765 ms
code
ExecutorService es = Executors.newFixedThreadPool(4);
Runnable task = new Runnable() {
#Override
public void run() {
// do nothing.
}
};
long start = System.nanoTime();
for(int i=0;i<1000*1000;i++) {
es.submit(task);
}
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
long time = System.nanoTime() - start;
System.out.println("Time per million tasks "+time/1000/1000+" ms");
EDIT: Say you have a loop which serially does this.
for(int i=0;i<1000*1000;i++)
doWork(i);
You might assume that changing to loop like this would be faster, but the problem is that the overhead could be greater than the gain.
for(int i=0;i<1000*1000;i++) {
final int i2 = i;
ex.execute(new Runnable() {
public void run() {
doWork(i2);
}
}
}
So you need to create batches of work (at least one per thread) so there are enough tasks to keep all the threads busy, but not so many tasks that your threads are spending time in overhead.
final int batchSize = 10*1000;
for(int i=0;i<1000*1000;i+=batchSize) {
final int i2 = i;
ex.execute(new Runnable() {
public void run() {
for(int i3=i2;i3<i2+batchSize;i3++)
doWork(i3);
}
}
}
EDIT2: RUnning atest which copied data between threads.
for (int i = 0; i < 20; i++) {
ExecutorService es = Executors.newFixedThreadPool(1);
final double[] d = new double[4 * 1024];
Arrays.fill(d, 1);
final double[] d2 = new double[4 * 1024];
es.submit(new Runnable() {
#Override
public void run() {
// nothing.
}
}).get();
long start = System.nanoTime();
es.submit(new Runnable() {
#Override
public void run() {
synchronized (d) {
System.arraycopy(d, 0, d2, 0, d.length);
}
}
});
es.shutdown();
es.awaitTermination(10, TimeUnit.SECONDS);
// get a the values in d2.
for (double x : d2) ;
long time = System.nanoTime() - start;
System.out.printf("Time to pass %,d doubles to another thread and back was %,d ns.%n", d.length, time);
}
starts badly but warms up to ~50 us.
Time to pass 4,096 doubles to another thread and back was 1,098,045 ns.
Time to pass 4,096 doubles to another thread and back was 171,949 ns.
... deleted ...
Time to pass 4,096 doubles to another thread and back was 50,566 ns.
Time to pass 4,096 doubles to another thread and back was 49,937 ns.
Hmm, CachedThreadPool seems to be created just for your case. It does not recreate threads if you reuse them soon enough, and if you spend a whole minute before you use new thread, the overhead of thread creation is comparatively negligible.
But you can't expect parallel execution to speed up your calculations unless you can also access data in parallel. If you employ extensive locking, many synchronized methods, etc you'll spend more on overhead than gain on parallel processing. Check that your data can be efficiently processed in parallel and that you don't have non-obvious synchronizations lurkinb in the code.
Also, CPUs process data efficiently if data fully fit into cache. If data sets of each thread is bigger than half the cache, two threads will compete for cache and issue many RAM reads, while one thread, if only employing one core, may perform better because it avoids RAM reads in the tight loop it executes. Check this, too.
Here's a psuedo outline of what I'm thinking
class WorkerThread extends Thread {
Queue<Calculation> calcs;
MainCalculator mainCalc;
public void run() {
while(true) {
while(calcs.isEmpty()) sleep(500); // busy waiting? Context switching probably won't be so bad.
Calculation calc = calcs.pop(); // is it pop to get and remove? you'll have to look
CalculationResult result = calc.calc();
mainCalc.returnResultFor(calc,result);
}
}
}
Another option, if you're calling external programs. Don't put them in a loop that does them one at a time or they won't run in parallel. You can put them in a loop that PROCESSES them one at a time, but not that execs them one at a time.
Process calc1 = Runtime.getRuntime.exec("myCalc paramA1 paramA2 paramA3");
Process calc2 = Runtime.getRuntime.exec("myCalc paramB1 paramB2 paramB3");
Process calc3 = Runtime.getRuntime.exec("myCalc paramC1 paramC2 paramC3");
Process calc4 = Runtime.getRuntime.exec("myCalc paramD1 paramD2 paramD3");
calc1.waitFor();
calc2.waitFor();
calc3.waitFor();
calc4.waitFor();
InputStream is1 = calc1.getInputStream();
InputStreamReader isr1 = new InputStreamReader(is1);
BufferedReader br1 = new BufferedReader(isr1);
String resultStr1 = br1.nextLine();
InputStream is2 = calc2.getInputStream();
InputStreamReader isr2 = new InputStreamReader(is2);
BufferedReader br2 = new BufferedReader(isr2);
String resultStr2 = br2.nextLine();
InputStream is3 = calc3.getInputStream();
InputStreamReader isr3 = new InputStreamReader(is3);
BufferedReader br3 = new BufferedReader(isr3);
String resultStr3 = br3.nextLine();
InputStream is4 = calc4.getInputStream();
InputStreamReader isr4 = new InputStreamReader(is4);
BufferedReader br4 = new BufferedReader(isr4);
String resultStr4 = br4.nextLine();