In Loom, can I use virtual threads for Recursive[Action/Task]? - java

Is it possible to use RecursiveAction, for example, in conjunction with -- instead of the fork/join pool -- a pool of virtual threads (before I attempt a poorly-designed, custom effort)?

RecursiveAction is a subclass of ForkJoinTask which is, as the name suggests and the documentation even says literally, an
Abstract base class for tasks that run within a ForkJoinPool.
While the ForkJoinPool can be customized with a thread factory, it’s not the standard thread factory, but a special factory for producing ForkJoinWorkerThread instances. Since these threads are subclasses of Thread, they can’t be created with the virtual thread factory.
So, you can’t use RecursiveAction with virtual threads. The same applies to RecursiveTask. But it’s worth rethinking what using these classes with virtual threads would gain you.
The main challenge, to implement decomposition of your task into sub-task is on you, anyway. What these classes provide you, are features specifically for dealing with the Fork/Join pool and balancing the workload with the available platform threads. When you want to perform each sub-task on its own virtual thread, you don’t need this. So you can easily implement a recursive task with virtual threads without the built-in classes, e.g.
record PseudoTask(int from, int to) {
public static CompletableFuture<Void> run(int from, int to) {
return CompletableFuture.runAsync(
new PseudoTask(from, to)::compute, Thread::startVirtualThread);
protected void compute() {
int mid = (from + to) >>> 1;
if(mid == from) {
// simulate actual processing with potentially blocking operations
else {
CompletableFuture<Void> sub1 = run(from, mid), sub2 = run(mid, to);
This example just doesn’t care about limiting the subdivision nor avoiding blocking join() calls and it still performs well when running, e.g., 1_000).join(); You might notice that with larger ranges, the techniques known from the other recursive task implementations can be useful here too, where the sub-task is rather cheap.
E.g., you may only submit one half of the range to another thread and process the other half locally, like
record PseudoTask(int from, int to) {
public static CompletableFuture<Void> run(int from, int to) {
return CompletableFuture.runAsync(
new PseudoTask(from, to)::compute, Thread::startVirtualThread);
protected void compute() {
CompletableFuture<Void> f = null;
for(int from = this.from, mid; ; from = mid) {
mid = (from + to) >>> 1;
if (mid == from) {
// simulate actual processing with potentially blocking operations
} else {
CompletableFuture<Void> sub1 = run(from, mid);
if(f == null) f = sub1; else f = CompletableFuture.allOf(f, sub1);
if(f != null) f.join();
which makes a notable difference when running, e.g., 1_000_000).join(); which will use only 1 million threads in the second example rather than 2 millions. But, of course, that’s a discussion on a different level than with platform threads where neither approach would work reasonably.
Another upcoming option is the StructuredTaskScope which allows to spawn sub-tasks and wait for their completion
record PseudoTask(int from, int to) {
public static void run(int from, int to) {
try (var scope = new StructuredTaskScope.ShutdownOnFailure()) {
new PseudoTask(from, to).compute(scope);
} catch (InterruptedException e) {
throw new IllegalStateException(e);
protected Void compute(StructuredTaskScope<Object> scope) {
for(int from = this.from, mid; ; from = mid) {
mid = (from + to) >>> 1;
if (mid == from) {
// simulate actual processing with potentially blocking operations
} else {
var sub = new PseudoTask(from, mid);
scope.fork(() -> sub.compute(scope));
return null;
Here, the tasks do not wait for the completion of their sub-task but only the root task waits for the completion of all tasks. But this feature is in incubator state, hence, may take even longer than the virtual threads feature, to become production-ready.


Terribly slow synchronization

I'm trying to write game of life on many threads, 1 cell = 1 thread, it requires synchronization between threads, so no thread will start calculating it new state before other thread does not finish reading previous state. here is my code
public class Cell extends Processor{
private static int count = 0;
private static Semaphore waitForAll = new Semaphore(0);
private static Semaphore waiter = new Semaphore(0);
private IntField isDead;
public Cell(int n)
count ++;
public void initialize()
this.algorithmName = Cell.class.getSimpleName();
isDead = new IntField(0);
this.addField(isDead, "state");
public synchronized void step()
int size = neighbours.size();
IntField[] states = new IntField[size];
int readElementValue = 0;
IntField readElement;
sendAll(new IntField(isDead.getDist()));
//here wait untill all other threads finish reading
while (Cell.waitForAll.availablePermits() != Cell.count) {
//here release semaphore neader lower
for (int i = 0; i < neighbours.size(); i++) {
readElement = (IntField) reciveMessage(neighbours.get(i));
states[i] = (IntField) reciveMessage(neighbours.get(i));
int alive = 0;
int dead = 0;
for(IntField ii: states)
if(ii.getDist() == 1)
if(isDead.getDist() == 0)
if(alive == 3)
if(alive == 3 || alive == 2)
try {
while(Cell.waiter.availablePermits() != Cell.count)
//if every thread finished reading we can acquire this semaphore
while(Cell.waitForAll.availablePermits() != 0)
//here we make sure every thread ends step in same moment
} catch (InterruptedException e) {
class extends thread and in run method if i turn switch on it calls step() method. well it works nice for small amount of cells but when i run abou 36 cells it start to be very slow, how can repair my synchronization so it woudl be faster?
Using large numbers of threads tends not to be very efficient, but 36 is not so many that I would expect that in itself to produce a difference that you would characterize as "very slow". I think more likely the problem is inherent in your strategy. In particular, I suspect this busy-wait is problematic:
//here wait untill all other threads finish reading
while (Cell.waitForAll.availablePermits() != Cell.count) {
Busy-waiting is always a performance problem because you are tying up the CPU with testing the condition over and over again. This busy-wait is worse than most, because it involves testing the state of a synchronization object, and this not only has extra overhead, but also introduces extra interference among threads.
Instead of busy-waiting, you want to use one of the various methods for making threads suspend execution until a condition is satisfied. It looks like what you've actually done is created a poor-man's version of a CyclicBarrier, so you might consider instead using CyclicBarrier itself. Alternatively, since this is a learning exercise you might benefit from learning how to use Object.wait(), Object.notify(), and Object.notifyAll() -- Java's built-in condition variable implementation.
If you insist on using semaphores, then I think you could do it without the busy-wait. The key to using semaphores is that it is being able to acquire the semaphore (at all) that indicates that the thread can proceed, not the number of available permits. If you maintain a separate variable with which to track how many threads are waiting on a given semaphore at a given point, then each thread reaching that point can determine whether to release all the other threads (and proceed itself) or whether to block by attempting to acquire the semaphore.

Massive tasks alternative pattern for Runnable or Callable

For massive parallel computing I tend to use executors and callables. When I have thousand of objects to be computed I feel not so good to instantiate thousand of Runnables for each object.
So I have two approaches to solve this:
I. Split the workload into a small amount of x-workers giving y-objects each. (splitting the object list into x-partitions with y/x-size each)
public static <V> List<List<V>> partitions(List<V> list, int chunks) {
final ArrayList<List<V>> lists = new ArrayList<List<V>>();
final int size = Math.max(1, list.size() / chunks + 1);
final int listSize = list.size();
for (int i = 0; i <= chunks; i++) {
final List<V> vs = list.subList(Math.min(listSize, i * size), Math.min(listSize, i * size + size));
if(vs.size() == 0) break;
return lists;
II. Creating x-workers which fetch objects from a queue.
Is creating thousand of Runnables really expensive and to be avoided?
Is there a generic pattern/recommendation how to do it by solution II?
Are you aware of a different approach?
Creating thousands of Runnable (objects implementing Runnable) is not more expensive than creating a normal object.
Creating and running thousands of Threads can be very heavy, but you can use Executors with a pool of threads to solve this problem.
As for the different approach, you might be interested in java 8's parallel streams.
Combining various answers here :
Is creating thousand of Runnables really expensive and to be avoided?
No, it's not in and of itself. It's how you will make them execute that may prove costly (spawning a few thousand threads certainly has its cost).
So you would not want to do this :
List<Computation> computations = ...
List<Thread> threads = new ArrayList<>();
for (Computation computation : computations) {
Thread thread = new Thread(new Computation(computation));
// If you need to wait for completion:
for (Thread t : threads) {
Because it would 1) be unnecessarily costly in terms of OS ressource (native threads, each having a stack on the heap), 2) spam the OS scheduler with a vastly concurrent workload, most certainly leading to plenty of context switchs and associated cache invalidations at the CPU level 3) be a nightmare to catch and deal with exceptions (your threads should probably define an Uncaught exception handler, and you'd have to deal with it manually).
You'd probably prefer an approach where a finite Thread pool (of a few threads, "a few" being closely related to your number of CPU cores) handles many many Callables.
List<Computation> computations = ...
ExecutorService pool = Executors.newFixedSizeThreadPool(someNumber)
List<Future<Result>> results = new ArrayList<>();
for (Computation computation : computations) {
results.add(pool.submit(new ComputationCallable(computation));
for (Future<Result> result : results {
The fact that you reuse a limited number threads should yield a really nice improvement.
Is there a generic pattern/recommendation how to do it by solution II?
There are. First, your partition code (getting from a List to a List<List>) can be found inside collection tools such as Guava, with more generic and fail-proofed implementations.
But more than this, two patterns come to mind for what you are achieving :
Use the Fork/Join Pool with Fork/Join tasks (that is, spawn a task with your whole list of items, and each task will fork sub tasks with half of that list, up to the point where each task manages a small enough list of items). It's divide and conquer. See:
If your computation were to be "add integers from a list", it could look like (there might be a boundary bug in there, I did not really check) :
public static class Adder extends RecursiveTask<Integer> {
protected List<Integer> globalList;
protected int start;
protected int stop;
public Adder(List<Integer> globalList, int start, int stop) {
this.globalList = globalList;
this.start = start;
this.stop = stop;
System.out.println("Creating for " + start + " => " + stop);
protected Integer compute() {
if (stop - start > 1000) {
// Too many arguments, we split the list
Adder subTask1 = new Adder(globalList, start, start + (stop-start)/2);
Adder subTask2 = new Adder(globalList, start + (stop-start)/2, stop);
return subTask1.compute() + subTask2.join();
} else {
// Manageable size of arguments, we deal in place
int result = 0;
for(int i = start; i < stop; i++) {
result +=i;
return result;
public void doWork() throws Exception {
List<Integer> computation = new ArrayList<>();
for(int i = 0; i < 10000; i++) {
ForkJoinPool pool = new ForkJoinPool();
RecursiveTask<Integer> masterTask = new Adder(computation, 0, computation.size());
Future<Integer> future = pool.submit(masterTask);
Use Java 8 parallel streams in order to launch multiple parallel computations easily (under the hood, Java parallel streams can fall back to the Fork/Join pool actually).
Others have shown how this might look like.
Are you aware of a different approach?
For a different take at concurrent programming (without explicit task / thread handling), have a look at the actor pattern.
Akka comes to mind as a popular implementation of this pattern...
#Aaron is right, you should take a look into Java 8's parallel streams:
void processInParallel(List<V> list) {
list.parallelStream().forEach(item -> {
// do something
If you need to specify chunks, you could use a ForkJoinPool as described here:
void processInParallel(List<V> list, int chunks) {
ForkJoinPool forkJoinPool = new ForkJoinPool(chunks);
forkJoinPool.submit(() -> {
list.parallelStream().forEach(item -> {
// do something with each item
You could also have a functional interface as an argument:
void processInParallel(List<V> list, int chunks, Consumer<V> processor) {
ForkJoinPool forkJoinPool = new ForkJoinPool(chunks);
forkJoinPool.submit(() -> {
list.parallelStream().forEach(item -> processor.accept(item));
Or in shorthand notation:
void processInParallel(List<V> list, int chunks, Consumer<V> processor) {
new ForkJoinPool(chunks).submit(() -> list.parallelStream().forEach(processor::accept));
And then you would use it like:
processInParallel(myList, 2, item -> {
// do something with each item
Depending on your needs, the ForkJoinPool#submit() returns an instance of ForkJoinTask, which is a Future and you may use it to check for the status or wait for the end of your task.
You'd most probably want the ForkJoinPool instantiated only once (not instantiate it on every method call) and then reuse it to prevent CPU choking if the method is called multiple times.
Is creating thousand of Runnables really expensive and to be avoided?
Not at all, the runnable/callable interfaces have only one method to implement each, and the amount of "extra" code in each task depends on the code you are running. But certainly no fault of the Runnable/Callable interfaces.
Is there a generic pattern/recommendation how to do it by solution II?
Pattern 2 is more favorable than pattern 1. This is because pattern 1 assumes that each worker will finish at the exact same time. If some workers finish before other workers, they could just be sitting idle since they only are able to work on the y/x-size queues you assigned to each of them. In pattern 2 however, you will never have idle worker threads (unless the end of the work queue is reached and numWorkItems < numWorkers).
An easy way to use the preferred pattern, pattern 2, is to use the ExecutorService invokeAll(Collection<? extends Callable<T>> list) method.
Here is an example usage:
List<Callable<?>> workList = // a single list of all of your work
ExecutorService es = Executors.newCachedThreadPool();
Fairly readable and straightforward usage, and the ExecutorService implementation will automatically use solution 2 for you, so you know that each worker thread has their use time maximized.
Are you aware of a different approach?
Solution 1 and 2 are two common approaches for generic work. Now, there are many different implementation available for you choose from (such as java.util.Concurrent, Java 8 parallel streams, or Fork/Join pools), but the concept of each implementation is generally the same. The only exception is if you have specific tasks in mind with non-standard running behavior.

Producer-consumer problem with a twist

The producer is finite, as should be the consumer.
The problem is when to stop, not how to run.
Communication can happen over any type of BlockingQueue.
Can't rely on poisoning the queue(PriorityBlockingQueue)
Can't rely on locking the queue(SynchronousQueue)
Can't rely on offer/poll exclusively(SynchronousQueue)
Probably even more exotic queues in existence.
Creates a queued seq on another (presumably lazy) seq s. The queued
seq will produce a concrete seq in the background, and can get up to
n items ahead of the consumer. n-or-q can be an integer n buffer
size, or an instance of java.util.concurrent BlockingQueue. Note
that reading from a seque can block if the reader gets ahead of the
My attempts so far + some tests:
Solutions in Java or Clojure appreciated.
class Reader {
private final ExecutorService ex = Executors.newSingleThreadExecutor();
private final List<Object> completed = new ArrayList<Object>();
private final BlockingQueue<Object> doneQueue = new LinkedBlockingQueue<Object>();
private int pending = 0;
public synchronized Object take() {
Object rVal;
if(completed.isEmpty()) {
try {
rVal = doneQueue.take();
} catch (InterruptedException e) {
throw new RuntimeException(e);
} else {
rVal = completed.remove(0);
return rVal;
private void removeDone() {
Object current = doneQueue.poll();
while(current != null) {
current = doneQueue.poll();
private void queue() {
while(pending < 10) {
ex.submit(new Runnable() {
public void run() {
private Object compute() {
//do actual computation here
return new Object();
Not exactly an answer I'm afraid, but a few remarks and more questions. My first answer would be: use clojure.core/seque. The producer needs to communicate end-of-seq somehow for the consumer to know when to stop, and I assume the number of produced elements is not known in advance. Why can't you use an EOS marker (if that's what you mean by queue poisoning)?
If I understand your alternative seque implementation correctly, it will break when elements are taken off the queue outside your function, since channel and q will be out of step in that case: channel will hold more #(.take q) elements than there are elements in q, causing it to block. There might be ways to ensure channel and q are always in step, but that would probably require implementing your own Queue class, and it adds so much complexity that I doubt it's worth it.
Also, your implementation doesn't distinguish between normal EOS and abnormal queue termination due to thread interruption - depending on what you're using it for you might want to know which is which. Personally I don't like using exceptions in this way — use exceptions for exceptional situations, not for normal flow control.

Multithreading and recursion together

I have recursive code that processes a tree structure in a depth first manner. The code basically looks like this:
function(TreeNode curr)
if (curr.children != null && !curr.children.isEmpty())
for (TreeNode n : curr.children)
//do some stuff
//do some other processing
I want to use threads to make this complete faster. Most of the time is spent traversing so I don't want to just create a thread to handle "the other processing" because it doesn't take that long. I think I want to fork threads at "do some stuff" but how would that work?
It's a good case for Fork/Join framework which is to be included into Java 7. As a standalone library for use with Java 6 it can be downloaded here.
Something like this:
public class TreeTask extends RecursiveAction {
private final TreeNode node;
private final int level;
public TreeTask(TreeNode node, int level) {
this.node = node;
this.level = leve;
public void compute() {
// It makes sense to switch to single-threaded execution after some threshold
if (level > THRESHOLD) function(node);
if (node.children != null && !node.children.isEmpty()) {
List<TreeTask> subtasks = new ArrayList<TreeTask>(node.children.size());
for (TreeNode n : node.children) {
// do some stuff
subtasks.add(new TreeTask(n, level + 1));
invokeAll(subtasks); // Invoke and wait for completion
} else {
//do some other processing
ForkJoinPool p = new ForkJoinPool(N_THREADS);
p.invoke(root, 0);
The key point of fork/join framework is work stealing - while waiting for completion of subtasks thread executes other tasks. It allows you to write algorithm in straightforward way, while avoiding problems with thread exhausting as a naive apporach with ExecutorService would have.
In the // do some stuff code block where you work on the individual Node, what you could do instead is submit the Node to some sort of ExecutorService (in the form of a Runnable which will work on the Node).
You can configure the ExecutorService that you use to be backed by a pool of a certain number of threads, allowing you to decouple the "handling" logic (along with logic around creating threads, how many to create, etc) from your tree-parsing logic.
This solution assumes that the processing only happens at the leaf nodes and that the actual recursion of the tree doesn't take a long time.
I would have the caller thread do the recursion and then a BlockingQueue of workers that process the leafs via a thread-pool. I'm not handling the InterruptedException in a couple of places here.
public void processTree(TreeNode top) {
final LinkedBlockingQueue<Runnable> queue =
new LinkedBlockingQueue<Runnable>(MAX_NUM_QUEUED);
// create a pool that starts at 1 threads and grows to MAX_NUM_THREADS
ExecutorService pool =
new ThreadPoolExecutor(1, MAX_NUM_THREADS, 0L, TimeUnit.MILLISECONDS, queue,
new RejectedExecutionHandler() {
public void rejectedExecution(Runnable r, ThreadPoolExecutor e) {
queue.put(r); // block if we run out of space in the pool
walkTree(top, pool);
// i think this will join with all of the threads
private void walkTree(final TreeNode curr, ExecutorService pool) {
if (curr.children == null || curr.children.isEmpty()) {
pool.submit(new Runnable() {
public void run() {
for (TreeNode child : curr.children) {
walkTree(child, pool);
private void processLeaf(TreeNode leaf) {
// ...

Which ThreadPool in Java should I use?

There are a huge amount of tasks.
Each task is belong to a single group. The requirement is each group of tasks should executed serially just like executed in a single thread and the throughput should be maximized in a multi-core (or multi-cpu) environment. Note: there are also a huge amount of groups that is proportional to the number of tasks.
The naive solution is using ThreadPoolExecutor and synchronize (or lock). However, threads would block each other and the throughput is not maximized.
Any better idea? Or is there exist a third party library satisfy the requirement?
A simple approach would be to "concatenate" all group tasks into one super task, thus making the sub-tasks run serially. But this will probably cause delay in other groups that will not start unless some other group completely finishes and makes some space in the thread pool.
As an alternative, consider chaining a group's tasks. The following code illustrates it:
public class MultiSerialExecutor {
private final ExecutorService executor;
public MultiSerialExecutor(int maxNumThreads) {
executor = Executors.newFixedThreadPool(maxNumThreads);
public void addTaskSequence(List<Runnable> tasks) {
executor.execute(new TaskChain(tasks));
private void shutdown() {
private class TaskChain implements Runnable {
private List<Runnable> seq;
private int ind;
public TaskChain(List<Runnable> seq) {
this.seq = seq;
public void run() {
seq.get(ind++).run(); //NOTE: No special error handling
if (ind < seq.size())
The advantage is that no extra resource (thread/queue) is being used, and that the granularity of tasks is better than the one in the naive approach. The disadvantage is that all group's tasks should be known in advance.
To make this solution generic and complete, you may want to decide on error handling (i.e whether a chain continues even if an error occures), and also it would be a good idea to implement ExecutorService, and delegate all calls to the underlying executor.
I would suggest to use task queues:
For every group of tasks You have create a queue and insert all tasks from that group into it.
Now all Your queues can be executed in parallel while the tasks inside one queue are executed serially.
A quick google search suggests that the java api has no task / thread queues by itself. However there are many tutorials available on coding one. Everyone feel free to list good tutorials / implementations if You know some:
I mostly agree on Dave's answer, but if you need to slice CPU time across all "groups", i.e. all task groups should progress in parallel, you might find this kind of construct useful (using removal as "lock". This worked fine in my case although I imagine it tends to use more memory):
class TaskAllocator {
private final ConcurrentLinkedQueue<Queue<Runnable>> entireWork
= childQueuePerTaskGroup();
public Queue<Runnable> lockTaskGroup(){
return entireWork.poll();
public void release(Queue<Runnable> taskGroup){
class DoWork implmements Runnable {
private final TaskAllocator allocator;
public DoWork(TaskAllocator allocator){
this.allocator = allocator;
pubic void run(){
Queue<Runnable> taskGroup = allocator.lockTaskGroup();
//No more work
Runnable work = taskGroup.poll();
if(work == null){
//This group is done
//Do work, but never forget to release the group to
// the allocator.
try {;
} finally {
You can then use optimum number of threads to run the DoWork task. It's kind of a round robin load balance..
You can even do something more sophisticated, by using this instead of a simple queue in TaskAllocator (task groups with more task remaining tend to get executed)
ConcurrentSkipListSet<MyQueue<Runnable>> sophisticatedQueue =
new ConcurrentSkipListSet(new SophisticatedComparator());
where SophisticatedComparator is
class SophisticatedComparator implements Comparator<MyQueue<Runnable>> {
public int compare(MyQueue<Runnable> o1, MyQueue<Runnable> o2){
int diff = o2.size() - o1.size();
//This is crucial. You must assign unique ids to your
//Subqueue and break the equality if they happen to have same size.
//Otherwise your queues will disappear...
return -;
return diff;
Actor is also another solution for this specified type of issues.
Scala has actors and also Java, which provided by AKKA.
I had a problem similar to your, and I used an ExecutorCompletionService that works with an Executor to complete collections of tasks.
Here is an extract from java.util.concurrent API, since Java7:
Suppose you have a set of solvers for a certain problem, each returning a value of some type Result, and would like to run them concurrently, processing the results of each of them that return a non-null value, in some method use(Result r). You could write this as:
void solve(Executor e, Collection<Callable<Result>> solvers)
throws InterruptedException, ExecutionException {
CompletionService<Result> ecs = new ExecutorCompletionService<Result>(e);
for (Callable<Result> s : solvers)
int n = solvers.size();
for (int i = 0; i < n; ++i) {
Result r = ecs.take().get();
if (r != null)
So, in your scenario, every task will be a single Callable<Result>, and tasks will be grouped in a Collection<Callable<Result>>.

