For massive parallel computing I tend to use executors and callables. When I have thousand of objects to be computed I feel not so good to instantiate thousand of Runnables for each object.
So I have two approaches to solve this:
I. Split the workload into a small amount of x-workers giving y-objects each. (splitting the object list into x-partitions with y/x-size each)
public static <V> List<List<V>> partitions(List<V> list, int chunks) {
final ArrayList<List<V>> lists = new ArrayList<List<V>>();
final int size = Math.max(1, list.size() / chunks + 1);
final int listSize = list.size();
for (int i = 0; i <= chunks; i++) {
final List<V> vs = list.subList(Math.min(listSize, i * size), Math.min(listSize, i * size + size));
if(vs.size() == 0) break;
lists.add(vs);
}
return lists;
}
II. Creating x-workers which fetch objects from a queue.
Questions:
Is creating thousand of Runnables really expensive and to be avoided?
Is there a generic pattern/recommendation how to do it by solution II?
Are you aware of a different approach?
Creating thousands of Runnable (objects implementing Runnable) is not more expensive than creating a normal object.
Creating and running thousands of Threads can be very heavy, but you can use Executors with a pool of threads to solve this problem.
As for the different approach, you might be interested in java 8's parallel streams.
Combining various answers here :
Is creating thousand of Runnables really expensive and to be avoided?
No, it's not in and of itself. It's how you will make them execute that may prove costly (spawning a few thousand threads certainly has its cost).
So you would not want to do this :
List<Computation> computations = ...
List<Thread> threads = new ArrayList<>();
for (Computation computation : computations) {
Thread thread = new Thread(new Computation(computation));
threads.add(thread);
thread.start();
}
// If you need to wait for completion:
for (Thread t : threads) {
t.join();
}
Because it would 1) be unnecessarily costly in terms of OS ressource (native threads, each having a stack on the heap), 2) spam the OS scheduler with a vastly concurrent workload, most certainly leading to plenty of context switchs and associated cache invalidations at the CPU level 3) be a nightmare to catch and deal with exceptions (your threads should probably define an Uncaught exception handler, and you'd have to deal with it manually).
You'd probably prefer an approach where a finite Thread pool (of a few threads, "a few" being closely related to your number of CPU cores) handles many many Callables.
List<Computation> computations = ...
ExecutorService pool = Executors.newFixedSizeThreadPool(someNumber)
List<Future<Result>> results = new ArrayList<>();
for (Computation computation : computations) {
results.add(pool.submit(new ComputationCallable(computation));
}
for (Future<Result> result : results {
doSomething(result.get);
}
The fact that you reuse a limited number threads should yield a really nice improvement.
Is there a generic pattern/recommendation how to do it by solution II?
There are. First, your partition code (getting from a List to a List<List>) can be found inside collection tools such as Guava, with more generic and fail-proofed implementations.
But more than this, two patterns come to mind for what you are achieving :
Use the Fork/Join Pool with Fork/Join tasks (that is, spawn a task with your whole list of items, and each task will fork sub tasks with half of that list, up to the point where each task manages a small enough list of items). It's divide and conquer. See: http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ForkJoinTask.html
If your computation were to be "add integers from a list", it could look like (there might be a boundary bug in there, I did not really check) :
public static class Adder extends RecursiveTask<Integer> {
protected List<Integer> globalList;
protected int start;
protected int stop;
public Adder(List<Integer> globalList, int start, int stop) {
super();
this.globalList = globalList;
this.start = start;
this.stop = stop;
System.out.println("Creating for " + start + " => " + stop);
}
#Override
protected Integer compute() {
if (stop - start > 1000) {
// Too many arguments, we split the list
Adder subTask1 = new Adder(globalList, start, start + (stop-start)/2);
Adder subTask2 = new Adder(globalList, start + (stop-start)/2, stop);
subTask2.fork();
return subTask1.compute() + subTask2.join();
} else {
// Manageable size of arguments, we deal in place
int result = 0;
for(int i = start; i < stop; i++) {
result +=i;
}
return result;
}
}
}
public void doWork() throws Exception {
List<Integer> computation = new ArrayList<>();
for(int i = 0; i < 10000; i++) {
computation.add(i);
}
ForkJoinPool pool = new ForkJoinPool();
RecursiveTask<Integer> masterTask = new Adder(computation, 0, computation.size());
Future<Integer> future = pool.submit(masterTask);
System.out.println(future.get());
}
Use Java 8 parallel streams in order to launch multiple parallel computations easily (under the hood, Java parallel streams can fall back to the Fork/Join pool actually).
Others have shown how this might look like.
Are you aware of a different approach?
For a different take at concurrent programming (without explicit task / thread handling), have a look at the actor pattern. https://en.wikipedia.org/wiki/Actor_model
Akka comes to mind as a popular implementation of this pattern...
#Aaron is right, you should take a look into Java 8's parallel streams:
void processInParallel(List<V> list) {
list.parallelStream().forEach(item -> {
// do something
});
}
If you need to specify chunks, you could use a ForkJoinPool as described here:
void processInParallel(List<V> list, int chunks) {
ForkJoinPool forkJoinPool = new ForkJoinPool(chunks);
forkJoinPool.submit(() -> {
list.parallelStream().forEach(item -> {
// do something with each item
});
});
}
You could also have a functional interface as an argument:
void processInParallel(List<V> list, int chunks, Consumer<V> processor) {
ForkJoinPool forkJoinPool = new ForkJoinPool(chunks);
forkJoinPool.submit(() -> {
list.parallelStream().forEach(item -> processor.accept(item));
});
}
Or in shorthand notation:
void processInParallel(List<V> list, int chunks, Consumer<V> processor) {
new ForkJoinPool(chunks).submit(() -> list.parallelStream().forEach(processor::accept));
}
And then you would use it like:
processInParallel(myList, 2, item -> {
// do something with each item
});
Depending on your needs, the ForkJoinPool#submit() returns an instance of ForkJoinTask, which is a Future and you may use it to check for the status or wait for the end of your task.
You'd most probably want the ForkJoinPool instantiated only once (not instantiate it on every method call) and then reuse it to prevent CPU choking if the method is called multiple times.
Is creating thousand of Runnables really expensive and to be avoided?
Not at all, the runnable/callable interfaces have only one method to implement each, and the amount of "extra" code in each task depends on the code you are running. But certainly no fault of the Runnable/Callable interfaces.
Is there a generic pattern/recommendation how to do it by solution II?
Pattern 2 is more favorable than pattern 1. This is because pattern 1 assumes that each worker will finish at the exact same time. If some workers finish before other workers, they could just be sitting idle since they only are able to work on the y/x-size queues you assigned to each of them. In pattern 2 however, you will never have idle worker threads (unless the end of the work queue is reached and numWorkItems < numWorkers).
An easy way to use the preferred pattern, pattern 2, is to use the ExecutorService invokeAll(Collection<? extends Callable<T>> list) method.
Here is an example usage:
List<Callable<?>> workList = // a single list of all of your work
ExecutorService es = Executors.newCachedThreadPool();
es.invokeAll(workList);
Fairly readable and straightforward usage, and the ExecutorService implementation will automatically use solution 2 for you, so you know that each worker thread has their use time maximized.
Are you aware of a different approach?
Solution 1 and 2 are two common approaches for generic work. Now, there are many different implementation available for you choose from (such as java.util.Concurrent, Java 8 parallel streams, or Fork/Join pools), but the concept of each implementation is generally the same. The only exception is if you have specific tasks in mind with non-standard running behavior.
Related
I have requirement where need to process and map the DTOs with the values in for loop as below. Each of the mapping method here consumes nearly 10 minutes to complete its business logic and hence creating performance delay. I am working to refine the algorithms of business logic. However, please let me know if each of these mapping methods can be parallel processed to increase performance.
Since application is compatible only with Java 7 I cannot use streams of java 8.
for(Portfolio pf : portfolio) {
mapAddress(pf);
mapBusinessUnit(pf);
mapRelationShipDetails(pf)
--
--
--
}
You could split portfolios to different threads using either Runnable or Callable.
For example:
public class PortfolioService implements Callable<List<Portfolio>>
{
List<Portfolio> portfolios;
public PortfolioService(List<Portfolio> portfolios)
{
this.portfolios = portfolios;
}
public List<Portfolio> call()
{
for(Portfolio pf : portfolios) {
mapAddress(pf);
mapBusinessUnit(pf);
...
}
return portfolios;
}
}
However, this needs some modifications in your main class. I am using Callable here, since I don't know if you want to do something with all of these mapped Portfolios afterwards. However, if you want to let the threads do all of the work and don't need any return, use Runnable and modify the code.
1) You have to get your amount of cores:
int threads = Runtime.getRuntime().availableProcessors();
2) Now you split the workload per thread
// determine the average workload per thread
int blocksize = portfolios.size()/threads;
// doesn't always get all entries
int overlap = portfolios.size()%threads;
3) Start an ExecutorService, make a list of Future Elements, make reminder variable for old index of array slice
ExecutorService exs = Executors.newFixedThreadPool(threads);
List<Future<List<Portfoilio>>> futures = new ArrayList();
int oldIndex = 0;
4) Start threads
for(int i = 0; i<threads; i++)
{
int actualBlocksize = blocksize;
if(overlap != 0){
actualBlocksize++;
overlap--;
}
futures.add(exs.submit(new PortfolioService(portfolios.subList(oldIndex,actualBlocksize));
oldIndex = actualBlocksize;
}
5) Shutdown the ExecutorService and await it's termination
exs.shutdown();
try {exs.awaitTermination(6, TimeUnit.HOURS);}
catch (InterruptedException e) { }
6) do something with the future, if you want / have to.
I've implemented a java code to execute incoming tasks (as Runnable) with n Threads based on their hashCode module nThreads. The work should spread, ideally - uniformly, among those threads.
Specifically, we have a dispatchId as a string for each Task.
Here is this java code snippet:
int nThreads = Runtime.getRuntime().availableProcessors(); // Number of threads
Worker[] workers = new Worker[nThreads]; // Those threads, Worker is just a thread class that can run incoming tasks
...
Worker getWorker(String dispatchId) { // Get a thread for this Task
return workers[(dispatchId.hashCode() & Integer.MAX_VALUE) % nThreads];
}
Important: In most cases a dispatchId is:
String dispatchId = 'SomePrefix' + counter.next()
But, I have a concern that modulo division by nThreads is not a good choice, because nThreads should be a prime number for a more uniform distribution of dispatId keys.
Are there any other options on how to spread the work better?
Update 1:
Each Worker has a queue:
Queue<RunnableWrapper> tasks = new ConcurrentLinkedQueue();
The worker gets tasks from it and executes them. Tasks can be added to this queue from other threads.
Update 2:
Tasks with the same dispatchId can come in multiple times, therefore we need to find their thread by dispatchId.
Most importantly, each Worker thread must process its incoming tasks sequentially. Hence, there is data structure Queue in the update 1 above.
Update 3:
Also, some threads can be busy, while others are free. Thus, we need to somehow decouple Queues from Threads, but maintain the FIFO order for the same dispatchId for tasks execution.
Solution:
I've implemented Ben Manes' idea (his answer below), the code can be found here.
It sounds like you need FIFO ordering per dispatch id, so the ideal would be to have dispatch queues as the abstraction. That would explain your concern about hashing as not providing uniform distribution, as some dispatch queues may be more active than others and unfairly balanced among workers. By separating the queue from the worker, you retain FIFO semantics and evenly spread out the work.
An inactive library that provides this abstraction is HawtDispatch. It is Java 6 compatible.
A very simple Java 8 approach is to use CompletableFuture as a queuing mechanism, ConcurrentHashMap for registration, and an Executor (e.g. ForkJoinPool) for computing. See EventDispatcher for an implementation of this idea, where registration is explicit. If your dispatchers are more dynamic then you may need to periodically prune the map. The basic idea is as follows.
ConcurrentMap<String, CompletableFuture<Void>> dispatchQueues = ...
public CompletableFuture<Void> dispatch(String queueName, Runnable task) {
return dispatchQueues.compute(queueName, (k, queue) -> {
return (queue == null)
? CompletableFuture.runAsync(task)
: queue.thenRunAsync(task);
});
}
Update (JDK7)
A backport of the above idea would be translated with Guava into something like,
ListeningExecutorService executor = ...
Striped<Lock> locks = Striped.lock(256);
ConcurrentMap<String, ListenableFuture<?>> dispatchQueues = ...
public ListenableFuture<?> dispatch(String queueName, final Runnable task) {
Lock lock = locks.get(queueName);
lock.lock();
try {
ListenableFuture<?> future = dispatchQueues.get(queueName);
if (future == null) {
future = executor.submit(task);
} else {
final SettableFuture<Void> next = SettableFuture.create();
future.addListener(new Runnable() {
try {
task.run();
} finally {
next.set(null);
}
}, executor);
future = next;
}
dispatchQueues.put(queueName, future);
} finally {
lock.unlock();
}
}
I am trying to execute a simple calculation (it calls Math.random() 10000000 times). Surprisingly running it in simple method performs much faster than using ExecutorService.
I have read another thread at ExecutorService's surprising performance break-even point --- rules of thumb? and tried to follow the answer by executing the Callable using batches, but the performance is still bad
How do I improve the performance based on my current code?
import java.util.*;
import java.util.concurrent.*;
public class MainTest {
public static void main(String[]args) throws Exception {
new MainTest().start();;
}
final List<Worker> workermulti = new ArrayList<Worker>();
final List<Worker> workersingle = new ArrayList<Worker>();
final int count=10000000;
public void start() throws Exception {
int n=2;
workersingle.add(new Worker(1));
for (int i=0;i<n;i++) {
// worker will only do count/n job
workermulti.add(new Worker(n));
}
ExecutorService serviceSingle = Executors.newSingleThreadExecutor();
ExecutorService serviceMulti = Executors.newFixedThreadPool(n);
long s,e;
int tests=10;
List<Long> simple = new ArrayList<Long>();
List<Long> single = new ArrayList<Long>();
List<Long> multi = new ArrayList<Long>();
for (int i=0;i<tests;i++) {
// simple
s = System.currentTimeMillis();
simple();
e = System.currentTimeMillis();
simple.add(e-s);
// single thread
s = System.currentTimeMillis();
serviceSingle.invokeAll(workersingle); // single thread
e = System.currentTimeMillis();
single.add(e-s);
// multi thread
s = System.currentTimeMillis();
serviceMulti.invokeAll(workermulti);
e = System.currentTimeMillis();
multi.add(e-s);
}
long avgSimple=sum(simple)/tests;
long avgSingle=sum(single)/tests;
long avgMulti=sum(multi)/tests;
System.out.println("Average simple: "+avgSimple+" ms");
System.out.println("Average single thread: "+avgSingle+" ms");
System.out.println("Average multi thread: "+avgMulti+" ms");
serviceSingle.shutdown();
serviceMulti.shutdown();
}
long sum(List<Long> list) {
long sum=0;
for (long l : list) {
sum+=l;
}
return sum;
}
private void simple() {
for (int i=0;i<count;i++){
Math.random();
}
}
class Worker implements Callable<Void> {
int n;
public Worker(int n) {
this.n=n;
}
#Override
public Void call() throws Exception {
// divide count with n to perform batch execution
for (int i=0;i<(count/n);i++) {
Math.random();
}
return null;
}
}
}
The output for this code
Average simple: 920 ms
Average single thread: 1034 ms
Average multi thread: 1393 ms
EDIT: performance suffer due to Math.random() being a synchronised method.. after changing Math.random() with new Random object for each thread, the performance improved
The output for the new code (after replacing Math.random() with Random for each thread)
Average simple: 928 ms
Average single thread: 1046 ms
Average multi thread: 642 ms
Math.random() is synchronized. Kind of the whole point of synchronized is to slow things down so they don't collide. Use something that isn't synchronized and/or give each thread its own object to work with, like a new Random.
You'd do well to read the contents of the other thread. There's plenty of good tips in there.
Perhaps the most significant issue with your benchmark is that according to the Math.random() contract, "This method is properly synchronized to allow correct use by more than one thread. However, if many threads need to generate pseudorandom numbers at a great rate, it may reduce contention for each thread to have its own pseudorandom-number generator"
Read this as: the method is synchronized, so only one thread is likely to be able to usefully use it at the same time. So you do a bunch of overhead to distribute the tasks, only to force them again to run serially.
When you use multiple threads, you need to be aware of the overhead of using additional threads. You also need to determine if your algorithm has work which can be preformed in parallel or not. So you need to have work which can be run concurrently which is large enough that it will exceed the overhead of using multiple threads.
In this case, the simplest workaround is to use a separate Random in each thread. The problem you have is that as a micro-benchmark, your loop doesn't actually do anything and the JIT is very good at discarding code which doesn't do anything. A workaround for this is to sum the random results and return it from the call() as this is usually enough to prevent the JIT from discarding the code.
Lastly if you want to sum lots of numbers, you don't need to save them and sum them later. You can sum them as you go.
There are a huge amount of tasks.
Each task is belong to a single group. The requirement is each group of tasks should executed serially just like executed in a single thread and the throughput should be maximized in a multi-core (or multi-cpu) environment. Note: there are also a huge amount of groups that is proportional to the number of tasks.
The naive solution is using ThreadPoolExecutor and synchronize (or lock). However, threads would block each other and the throughput is not maximized.
Any better idea? Or is there exist a third party library satisfy the requirement?
A simple approach would be to "concatenate" all group tasks into one super task, thus making the sub-tasks run serially. But this will probably cause delay in other groups that will not start unless some other group completely finishes and makes some space in the thread pool.
As an alternative, consider chaining a group's tasks. The following code illustrates it:
public class MultiSerialExecutor {
private final ExecutorService executor;
public MultiSerialExecutor(int maxNumThreads) {
executor = Executors.newFixedThreadPool(maxNumThreads);
}
public void addTaskSequence(List<Runnable> tasks) {
executor.execute(new TaskChain(tasks));
}
private void shutdown() {
executor.shutdown();
}
private class TaskChain implements Runnable {
private List<Runnable> seq;
private int ind;
public TaskChain(List<Runnable> seq) {
this.seq = seq;
}
#Override
public void run() {
seq.get(ind++).run(); //NOTE: No special error handling
if (ind < seq.size())
executor.execute(this);
}
}
The advantage is that no extra resource (thread/queue) is being used, and that the granularity of tasks is better than the one in the naive approach. The disadvantage is that all group's tasks should be known in advance.
--edit--
To make this solution generic and complete, you may want to decide on error handling (i.e whether a chain continues even if an error occures), and also it would be a good idea to implement ExecutorService, and delegate all calls to the underlying executor.
I would suggest to use task queues:
For every group of tasks You have create a queue and insert all tasks from that group into it.
Now all Your queues can be executed in parallel while the tasks inside one queue are executed serially.
A quick google search suggests that the java api has no task / thread queues by itself. However there are many tutorials available on coding one. Everyone feel free to list good tutorials / implementations if You know some:
I mostly agree on Dave's answer, but if you need to slice CPU time across all "groups", i.e. all task groups should progress in parallel, you might find this kind of construct useful (using removal as "lock". This worked fine in my case although I imagine it tends to use more memory):
class TaskAllocator {
private final ConcurrentLinkedQueue<Queue<Runnable>> entireWork
= childQueuePerTaskGroup();
public Queue<Runnable> lockTaskGroup(){
return entireWork.poll();
}
public void release(Queue<Runnable> taskGroup){
entireWork.offer(taskGroup);
}
}
and
class DoWork implmements Runnable {
private final TaskAllocator allocator;
public DoWork(TaskAllocator allocator){
this.allocator = allocator;
}
pubic void run(){
for(;;){
Queue<Runnable> taskGroup = allocator.lockTaskGroup();
if(task==null){
//No more work
return;
}
Runnable work = taskGroup.poll();
if(work == null){
//This group is done
continue;
}
//Do work, but never forget to release the group to
// the allocator.
try {
work.run();
} finally {
allocator.release(taskGroup);
}
}//for
}
}
You can then use optimum number of threads to run the DoWork task. It's kind of a round robin load balance..
You can even do something more sophisticated, by using this instead of a simple queue in TaskAllocator (task groups with more task remaining tend to get executed)
ConcurrentSkipListSet<MyQueue<Runnable>> sophisticatedQueue =
new ConcurrentSkipListSet(new SophisticatedComparator());
where SophisticatedComparator is
class SophisticatedComparator implements Comparator<MyQueue<Runnable>> {
public int compare(MyQueue<Runnable> o1, MyQueue<Runnable> o2){
int diff = o2.size() - o1.size();
if(diff==0){
//This is crucial. You must assign unique ids to your
//Subqueue and break the equality if they happen to have same size.
//Otherwise your queues will disappear...
return o1.id - o2.id;
}
return diff;
}
}
Actor is also another solution for this specified type of issues.
Scala has actors and also Java, which provided by AKKA.
I had a problem similar to your, and I used an ExecutorCompletionService that works with an Executor to complete collections of tasks.
Here is an extract from java.util.concurrent API, since Java7:
Suppose you have a set of solvers for a certain problem, each returning a value of some type Result, and would like to run them concurrently, processing the results of each of them that return a non-null value, in some method use(Result r). You could write this as:
void solve(Executor e, Collection<Callable<Result>> solvers)
throws InterruptedException, ExecutionException {
CompletionService<Result> ecs = new ExecutorCompletionService<Result>(e);
for (Callable<Result> s : solvers)
ecs.submit(s);
int n = solvers.size();
for (int i = 0; i < n; ++i) {
Result r = ecs.take().get();
if (r != null)
use(r);
}
}
So, in your scenario, every task will be a single Callable<Result>, and tasks will be grouped in a Collection<Callable<Result>>.
Reference:
http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
I submitted 5 jobs to an ExecutorCompletionService, but it seems like the jobs are executed in sequence. The ExecutorService that is passed to the constructor of ExecutorCompletionService is created using newCacheThreadPool form. Am I doing anything wrong ?
UPDATE Each job is basically doing a database query & some calculation. The code for the ExecutorCompletionService is lifted as-is off the javadoc. I just replaced the Callables with my own custom Callable implementations.
The ExecutorCompletionService has nothing to do with how jobs are executed, it's simply a convenient way of retrieving the results.
Executors.newCachedThreadPool by default executes tasks in separate threads, which can be parallel, given that:
tasks are independent, and don't e.g. synchronize on the same object inside;
you have multiple hardware CPU threads.
The last point deserves an explanation. Although there are no guarantees, in practice the Sun JVM favours the currently executing thread so it's never swapped out in favour of another one. That means that your 5 tasks might end up being executed serially due to the JVM implementation and not having e.g. a multi-core machine.
I assume you meant Executors.newCachedThreadPool(). If so, execution should be parallelized as you expect.
Each job is basically doing a database query & some calculation. The code for the ExecutorCompletionService is lifted as-is off the javadoc. I just replaced the Callables with my own custom Callable implementations.
In that case, are you sure you're not mistaken in thinking they're executed sequentially because you're retrieving the results sequentially?
Throw in some debug logging lines in your callables to rule this out, and/or have a look at this limited usage scenario:
public static void main(String... args) throws InterruptedException, ExecutionException {
List<Callable<String>> list = new ArrayList<Callable<String>>();
list.add(new PowersOfX(2));
list.add(new PowersOfX(3));
list.add(new PowersOfX(5));
solve(Executors.newCachedThreadPool(), list);
}
static void solve(Executor e, Collection<Callable<String>> solvers) throws InterruptedException, ExecutionException {
CompletionService<String> ecs = new ExecutorCompletionService<String>(e);
for (Callable<String> s : solvers)
ecs.submit(s);
int n = solvers.size();
for (int i = 0; i < n; ++i) {
String r = ecs.take().get();
if (r != null)
System.out.println("Retrieved: " + r);
}
}
static class PowersOfX implements Callable<String> {
int x;
public PowersOfX(int x) {this.x = x;}
#Override
public String call() throws Exception {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 10; i++) {
sb.append(Math.pow(2, i)).append('\t');
System.out.println(Math.pow(x, i));
Thread.sleep(2000);
}
return sb.toString();
}
}
Executing this you'll see the numbers are generated intermixed (and thus executed concurrently), but retrieving the results alone wont show you this level detail..
The execution will depend on a number of things. For example:
the length of time it takes to complete a job
the number of threads in the thread pool (a cached thread pool will only create threads if it thinks they are needed)
Executing in sequence is not necessarily wrong.