If I run the following code:
System.out.println("Common Pool Parallelism: "
+ForkJoinPool.getCommonPoolParallelism());
long start = System.currentTimeMillis();
IntStream.range(0, ForkJoinPool.getCommonPoolParallelism() * 2)
.parallel()
.forEach(i -> {
System.out.println(i +" " +(System.currentTimeMillis() - start) +"ms");
LockSupport.parkNanos(TimeUnit.SECONDS.toNanos(1));
});
I get (for example) the following output:
Common Pool Parallelism: 3
5 82ms
3 82ms
1 82ms
0 82ms
4 1087ms
2 1087ms
It appears to me as though the common ForkJoinPool is using 4 threads, then they're blocked for a second, then the last two jobs are run.
But as I understand it, the common ForkJoinPool defaults to using Runtime.getRuntime().availableProcessors() - 1 threads, which in my case is 3 so I'd expect to get three jobs printing at ~82ms not 4.
Is it my code or or understanding that's amiss?
ETA: Printing out Thread.currentThread().getId() within the forEach also shows 4 different thread IDs.
Related
I was investigating Mockito's behavior in one of our tests and I didn't understand its behavior. The following code snippets shows the same behavior :
#Test
public void test(){
var mymock = Mockito.mock(Object.class);
var l = List.of(1,2,3);
l.parallelStream().forEach(val -> {
int finalI = val;
doAnswer(invocationOnMock -> {
log.info("{}",finalI);
return null;
}).when(mymock).toString();
mymock.toString();
});
}
The output I was expecting is prints of 1,2,3 in some order (not nessaseraly sorted) :
The output I received :
2
2
2
Why I got this output ?
This is a classic race condition: your parallel stream executes the forEach lambda three times in parallel. In this case, all three threads managed to execute the doAnswer() call, before any of them got to the mymock.toString() line. It also just so happens that the execution where val=2 happened to run last. Happily, Mockito is reasonably thread-safe, so you don't just get exceptions thrown.
One way your scenario might occur is if the threads happened to run like this:
Main thread: calls l.parallelStream().forEach(), which spawns 3 threads with val equal to 1, 2 and 3
val=1 thread: executes finalI=val, giving it a value of 1 in this thread
val=2 thread: executes finalI=val, giving it a value of 2 in this thread
val=3 thread: executes finalI=val, giving it a value of 3 in this thread
val=3 thread: calls doAnswer(...) which makes future calls to mymock.toString() print out 3
val=1 thread: calls doAnswer(...) which makes future calls to mymock.toString() print out 1
val=2 thread: calls doAnswer(...) which makes future calls to mymock.toString() print out 2
val=1 thread: calls mymock.toString(), which prints out 2
val=2 thread: calls mymock.toString(), which prints out 2
val=3 thread: calls mymock.toString(), which prints out 2
Note that this isn't necessarily happening in order, or even in this order:
some parts of this may literally be happening at the same time, in different cores of your processor.
while any steps within a single thread would always run in order, they might run before, after, or interleaved with any steps from any other thread.
Because the threads are running in parallel, and you have not put anything in place to manage this, there is no guarantee whatsoever about the order in which these calls occur: your output could just as easily have been 1 1 1, 1 2 3, 3 2 1 or 3 2 2 or 1 1 3 etc.
I want to understand the ordering constraints between nested streams in Java.
Example 1:
public static void main(String[] args) {
IntStream.range(0, 10).forEach(i -> {
System.out.println(i);
IntStream.range(0, 10).forEach(j -> {
System.out.println(" " + i + " " + j);
});
});
}
This code executes deterministically, so the inner loop runs forEach on each j before the outer loop runs its own forEach on the next i:
0
0 0
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
1
1 0
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
2
2 0
2 1
2 2
2 3
...
Example 2:
public static void main(String[] args) {
IntStream.range(0, 10).parallel().forEach(i -> {
System.out.println(i);
IntStream.range(0, 10).parallel().forEach(j -> {
System.out.println(" " + i + " " + j);
});
});
}
If the streams are made parallel() as in this second example, I could imagine the inner workers blocking as they wait for threads to become available in the outer work queue, since the outer work queue threads have to block on the completion of the inner stream, and the default thread pool only has a limited number of threads. However, deadlock does not appear to occur:
6
5
8
8 6
0
1
6 2
7
1 6
8 5
7 6
8 8
2
0 6
0 2
0 8
5 2
5 4
5 6
0 5
2 6
7 2
7 5
7 8
6 4
8 9
1 5
...
Both streams share the same default thread pool, yet they generate different work units. Each outer work unit can only complete after all inner units for that outer work unit have completed, since there is a completion barrier at the end of each parallel stream.
How is the coordination between these inner and outer streams managed across the shared pool of worker threads, without any sort of deadlock?
The thread pool behind parallel streams is the common pool, which you can get with ForkJoinPool.commonPool(). It usually uses NumberOfProcessors - 1 workers. To resolve dependencies like you've described, it's able to dynamically create additional workers if (some) current workers are blocked and a deadlock becomes possible.
However, this is not the answer for your case.
Tasks in a ForkJoinPool have two important functionalities:
They can create subtasks and split the current task into smaller pieces (fork).
They can wait for the subtasks (join).
When a thread executes such a task A and joins a subtask B, it doesn't just wait blocking for the subtask to finish its execution but executes another task C in the meantime. When C is finished, the thread comes back to A and checks if B is finished. Note that B and C can (and most likely are) the same task. If B is finished, then A has successfully waited for/joined it (non-blocking!). Check out this guide if the previous explanation is not clear.
Now when you use a parallel stream, the range of the stream is split into tasks recursively until the tasks become so small that they can be executed sequentially more efficiently. Those tasks are put into a work queue (there is one for each worker) in the common pool. So, what IntStream.range(0, 100).parallel().forEach does is splitting up the range recursively until it's not worth it anymore. Each final task, or rather bunch of iterations, can be executed sequentially with the provided code in forEach. At this point the workers in the common pool can just execute those tasks until all are done and the stream can return. Note that the calling thread helps out with the execution by joining subtasks!
Now each of those tasks uses a parallel stream itself in your case. The procedure is the same; split it up into smaller tasks and put those tasks into a work queue in the common pool. From the ForkJoinPool's perspective those are just additional tasks on top of the already present ones. The workers just keep executing/joining tasks until all are done and the outer stream can return.
This is what you see in the output: There is no deterministic behaviour, no fixed order. Also there cannot occur a deadlock because in the given use case there won't be blocking threads.
You can check the explanation with the following code:
public static void main(String[] args) {
IntStream.range(0, 10).parallel().forEach(i -> {
IntStream.range(0, 10).parallel().forEach(j -> {
for (int x = 0; x < 1e6; x++) { Math.sqrt(Math.log(x)); }
System.out.printf("%d %d %s\n", i, j, Thread.currentThread().getName());
for (int x = 0; x < 1e6; x++) { Math.sqrt(Math.log(x)); }
});
});
}
You should notice that the main thread is involved in the execution of the inner iterations, so it is not (!) blocked. The common pool workers just pick tasks one after another until all are finished.
An optimum of threads in a pool is something that is case specific, though there is a rule of thumb which says #threads = #CPU +1.
However, how does this work with threads spanning other threads and waiting (i.e. blocked until thread.join() is successful) for these 'subthreads'?
Assume that I have code that requires the execution of list of tasks (2), which has subtasks(2), which has subsubtasks(3) and so on. The total number of tasks is 2*2*3 = 12, though 18 threads will be created (because a threads will 'spawn' more subtasks (threads), where the thread spawning more threads will be blocked untill all is over. See below for pseudo code.
I am assuming that for a CPU with N cores there is a rule of thumb that everything can be parallelized if the highest number of active threads (12) is #CPU + 1. Is this correct?
PseudoCode
outputOfTask = []
for subtask in SubTaskList
outputOfTask --> append(subtask.doCompute())
// wait untill all output is finished.
in subtask.java:
Each subtask, for example, implements the same interface, but can be different.
outputOfSubtask = []
for task in subsubTaskList
// do some magic depending on the type of subtask
outputOfSubtask -> append( task.doCompute())
return outputOfSubtask
in subsubtask.java:
outputOfSubsubtask = []
for task in subsubsubtask
// do some magic depending on the type of subsubtask
outputOfSubsubtask -> append( task.doCompute())
return outputOfSubsubtask
EDIT:
Dummy code Java code. I used this in my original question to check how many threads were active, but I assume that the pseudocode is more clear. Please note: I used the Eclipse Collection, this introduces the asParallel function which allows for a shorter notation of the code.
#Test
public void testasParallelthreads() {
// // ExecutorService executor = Executors.newWorkStealingPool();
ExecutorService executor = Executors.newCachedThreadPool();
MutableList<Double> myMainTask = Lists.mutable.with(1.0, 2.0);
MutableList<Double> mySubTask = Lists.mutable.with(1.0, 2.0);
MutableList<Double> mySubSubTask = Lists.mutable.with(1.0, 2.0);
MutableList<Double> mySubSubSubTask = Lists.mutable.with(1.0, 2.0, 2.0);
MutableList<Double> a = myMainTask.asParallel(executor, 1)
.flatCollect(task -> mySubTask.asParallel(executor,1)
.flatCollect(subTask -> mySubSubTask.asParallel(executor, 1)
.flatCollect(subsubTask -> mySubSubSubTask.asParallel(executor, 1)
.flatCollect(subsubTask -> dummyFunction(task, subTask, subsubTask, subsubTask,executor))
.toList()).toList()).toList()).toList();
System.out.println("pool size: " + ((ThreadPoolExecutor) executor).getPoolSize());
executor.shutdownNow();
}
private MutableList<Double> dummyFunction(double a, double b, double c, double d, ExecutorService ex) {
System.out.println("ThreadId: " + Thread.currentThread().getId());
System.out.println("Active threads size: " + ((ThreadPoolExecutor) ex).getActiveCount());
return Lists.mutable.with(a,b,c,d);
}
I am assuming that for a CPU with N cores there is a rule of thumb that everything can be parallelized if the highest number of active threads (12) is #CPU + 1. Is this correct?
This topic is extremely hard to generalize about. Even with the actual code, the performance of your application is going to be very difficult to determine. Even if you could come up an estimation, the actual performance may vary wildly between runs – especially considering that the threads are interacting with each other. The only time we can take the #CPU + 1 number is if the jobs that are submitted into the thread-pool are independent and completely CPU bound.
I'd recommend trying a number of different thread-pool size values under simulated load to find the optimal values for your application. Examining the overall throughput numbers or system load stats should give you the feedback you need.
However, how does this work with threads spanning other threads and waiting (i.e. blocked until thread.join() is successful) for these 'subthreads'?
Threads will block, and it is up to the os/jvm to schedule another one if possible. If you have a single thread pool executor and call join from one of your tasks, the other task won't even get started. With executors that use more threads, then the blocking task will block a single thread and the os/jvm is free to scheduled other threads.
These blocked threads should not consume CPU time, because they are blocked. So I am assuming that for a CPU with N cores there is a rule of thumb that everything can be parallelized if the highest number of active threads (24) is #CPU + 1. Is this correct?
Active threads can be blocking. I think you're mixing terms here, #CPU, the number of cores, and the number of virtual cores. If you have N physical cores, then you can run N cpu bound tasks in parallel. When you have other types of blocking or very short lived tasks, then you can have more parallel tasks.
I would like to clarify a behavior of Java parallel streams. If I were to use a parallel stream as shown below, can I take it as a guarantee that the line "Should happen at the end" shown in the code below will only print after all parallel tasks have executed?
public static void main(String[] args) {
Stream.of(1, 2, 3, 5, 7, 8, 9, 10, 11, 23, 399, 939).parallel().forEach(
integer -> System.out
.println(Thread.currentThread().getName() + ", for number " + integer));
System.out.println("Should happen at the end");
}
The repeated trials always print "Should happen at the end" predictably at the end, as shown below. This maybe happening because main thread is also used to process a few requests. Is it possible that in some scenario that main thread is not used and in that case "Should happen at the end" will print before all the ForkJoinPool.commonPool-worker have finished executing their tasks?
main, for number 7
main, for number 6
ForkJoinPool.commonPool-worker-9, for number 3
ForkJoinPool.commonPool-worker-9, for number 4
ForkJoinPool.commonPool-worker-4, for number 1
ForkJoinPool.commonPool-worker-4, for number 10
ForkJoinPool.commonPool-worker-2, for number 2
ForkJoinPool.commonPool-worker-11, for number 5
ForkJoinPool.commonPool-worker-9, for number 8
main, for number 9
Should happen at the end
Stream terminal operations are not asynchronous. It means that Java will give back control to the calling thread only after forEach is terminated. Therefore, what you print after forEach is necessarily printed after in console.
Note, however, that if you were to use java.util.Logger API instead of System.out, output ordering would not be as predictable, as logging API itself cannot (for performance reason mainly) guarantee ordering of outputs.
Further reading on stream operations: Oracle official documentation.
I have defined the following instance variable:
private final AtomicInteger tradeCounter = new AtomicInteger(0);
I have a method called onTrade defined as below being called by 6 threads:
public void onTrade(Trade trade) {
System.out.println(tradeCounter.incrementAndGet());
}
Why is the output:
2
5
4
3
1
6
Instead of
1
2
3
4
5
6
?
I want to avoid using synchronization.
You can think of
tradeCounter.incrementAndGet();
And
System.out.println();
as two separate statements.
So here
System.out.println(tradeCounter.incrementAndGet());
there are basically two statements, and those statements together are not atomical.
Imagine such example with 2 threads :
Thread 1 invokes tradeCounter.incrementAndGet()
Thread 2 invokes tradeCounter.incrementAndGet()
Thread 2 prints value 2
Thread 1 prints value 1
It all depends in what order threads will invoke inctructions in your method.
I have a method called onTrade defined as below being called by 6
threads:
public void onTrade(Trade trade) {
System.out.println(tradeCounter.incrementAndGet());
}
Why is the output:
2 5 4 3 1 6
Instead of 1 2 3 4 5 6 ?
Why shouldn't that be the output? Or why not 3 1 4 6 5 2? Or any of the other permutations of 1 2 3 4 5 6?
Using an AtomicInteger and its incrementAndGet() method ensures that each each thread gets a different value, and that the six values obtained are sequential, without synchronization. But that has nothing to do with the order in which the resulting values are printed afterward.
If you want the results to be printed in the same order that they are obtained, then synchronization is the easiest way to go. In that case, using an AtomicInteger does not gain you anything over using a plain int (for this particular purpose):
int tradeCounter = 0;
synchronized public void onTrade(Trade trade) {
System.out.println(++tradeCounter);
}
Alternatively, don't worry about the order in which the output is printed. Consider: is the output order actually meaningful for any reason?
incrementAndGet() increments in the expected order : 1, 2, 3 etc...
But system.out doesn't invoke the println() in an atomic way along incrementAndGet(). So the random ordering is expected.
I want to avoid using synchronization.
You could not in this case.
System.out.println(...) and tradeCounter.incrementAndGet() are two separate operations and most likely when thread-i gets new value, some other threads can get value and print it before thread-i prints it. There is no way to avoid synchronization (direct or indirect) here.