CompletableFuture takes more time - Java 8

CompletableFuture takes more time - Java 8 - java

I have two snippets of code which are technically same, but the second one takes 1 sec extra then the first one. The first one executes in 6 sec and the second in 7.
Double yearlyEarnings = employmentService.getYearlyEarningForUserWithEmployer(userId, emp.getId());
CompletableFuture<Double> earlyEarningsInHomeCountryCF = currencyConvCF.thenApplyAsync(currencyConv -> {
return currencyConv * yearlyEarnings;
});
The above one takes 6s and the next takes 7s
Here is the link to code
CompletableFuture<Double> earlyEarningsInHomeCountryCF = currencyConvCF.thenApplyAsync(currencyConv -> {
Double yearlyEarnings = employmentService.getYearlyEarningForUserWithEmployer(userId, emp.getId());
return currencyConv * yearlyEarnings;
});
Please explain why the second code consistently takes 1s more (extra time) as compared to the first one
Below is the signature of the method getYearlyEarningForUserWithEmployer. Just sharing, but it should not have any affect
Double getYearlyEarningForUserWithEmployer(long userId, long employerId);
Here is the link to code

Your question is horribly incomplete, but from what we can guess, it’s entirely plausible that the second variant takes longer, if we assume that currencyConvCF represents an asynchronous operation which might be running concurrently while your code fragments are executed and you’re talking about the overall time it takes to complete all operations, including the one represented by the CompletableFuture returned by thenApplyAsync (earlyEarningsInHomeCountryCF).
In the first variant you are invoking getYearlyEarningForUserWithEmployer while the operation represented by currencyConvCF might be still running concurrently. The multiplication will happen when both operations completed.
In the second variant, the getYearlyEarningForUserWithEmployer invocation is part of the operation passed to currencyConvCF.thenApplyAsync, thus it will not start before the operation represented by currencyConvCF has been completed, so no operation will run concurrently. If we assume that getYearlyEarningForUserWithEmployer takes a significant time, say one second, and has no internal dependencies to the other operation, it’s not surprising when the overall operation takes longer in that variant.
It seems, what you actually want to do is something like:
CompletableFuture<Double> earlyEarningsInHomeCountryCF = currencyConvCF.thenCombineAsync(
CompletableFuture.supplyAsync(
() -> employmentService.getYearlyEarningForUserWithEmployer(userId, emp.getId())),
(currencyConv, yearlyEarnings) -> currencyConv * yearlyEarnings);
so getYearlyEarningForUserWithEmployer is not executed sequentially in the initiating thread but both source operations can run asynchronously before the final multiplication applies.
However, when you are invoking get right afterwards in the initiating thread, like in your linked code on github, that asynchronous processing of the second operation has no benefit. Instead of waiting for the completion, your initiating thread can just perform the independent operation as the second code variant of your question already does and you will likely be even faster when not spawning an asynchronous operation for something as simple as a single multiplication, i.e. use instead:
CompletableFuture<Double> currencyConvCF = /* a true asynchronous operation */
return employmentService.getYearlyEarningForUserWithEmployer(userId, emp.getId())
* employerCurrencyCF.join();

What ever Holger said does make sense, but not in the problem I posted. I do agree that the question is not written in the best way.
The problem was that the order in which the futures were written were causing a consistent increase in time.
Ideally the order of the future should not matter as long as the code is written in a correct reactive fashion
The reason of the problem was the default ForkJoinPool of Java and Java uses this pool by default to run all the CompletableFutures. If I run all the CompletableFutues with a custom pool, I get almost the same time, irrespective of the order in which the future statements were written.
I still need to find what are the limitation of ForkJoinPool and find why my custom pool of 20 threads performs better.
I ll update my answer when I find the right reason.

Related

Race condition with Java executor service

Issue: Iterating through files in a directory and scanning them for findings. ExecutorService used to create a thread-pool with fixed # of threads and invoking submit method like this:
final List<Future<List<ObjectWithResults>>> futures = Files.walk(baseDirObj) .map(baseDirObj::relativize)
.filter(pathMatcher::matches)
.filter(filePath -> isScannableFile(baseDirObj, filePathObj))
.map(filePathObj -> executorService.submit(() -> scanFileMethod(baseDirObj, filePathObj, resultMetricsObj, countDownLatchObj)))
.collect(ImmutableList.toImmutableList())
the scanFile method calls 3 concurrent scans that return a list of results. These results are added using:
resultsListObj.addAll(scanMethod1)
resultsListObj.addAll(scanMethod2)
resultsListObj.addAll(scanMethod3)
followed by:
countDownLatch.countDown()
In the method that calls executorService.submit() when iteratively walking through files, I call:
boolean completed = countDownLatch.awaitTermination(200, TimeUnit.MILLISECONDS);
if(isDone)
executorService.shutdown();
Made static members used in unsynchronized context 'volatile' so they will be read from JVM and not from cache.Initially there were 5 to 10% failures (like 22 out of 473), I brought it down to less than 1%. There were static variables declared, I made them volatile that helped bring down the failures
Changed to thread-safe data-structures, like ConcurrentHashMaps, CopyOnWriteArrayLists, etc.
The elements added to these thread-safe lists, maps, etc. are bound to variables declared final which means they should be thread-safe ideally.
I introduced a count down latch mechanism to decrement the thread-count, wait for a bit before calling executor service's shutdown method.
I also added a if (! future.isDone()) check which returns true meaning some future tasks are taking longer, in these cases I used the overloaded flavor of future.get with timeout to wait longer, still I get 2-5 failures in 1000 iterations.
I want to know if declaring variables holding elements added to thread-safe data-structures as final or volatile is better. I read a lot about them, but still unclear.
Result:
For test iterations greater than 500, I always see 04 to 0.7 % failures.
Note: If I synchronize the main scanFile() method, it works without a single failure, but negates the multi-threaded asynchronous performance benefit and takes 3 times longer.
What I tried?
Added countdown latch mechanism.
Declared variables holding values added to thread-safe lists, maps volatile or final
Expected 0 failures after using thread-safe data-structure objects like ConcurrentHashMaps, CopyOnWriteArrayList, but still get 1-3 failures every 1000 runs.

Do "chained" CompletableFuture instances stay in memory?

Let's say that in Java I have a method doSomethingAsync(input) that schedules some work via an executor service and returns a CompletableFuture<FooBar>. Let's say that I have a billion (or whatever huge number of distinct inputs). And let's say I chain the CompleteableFuture<FooBar> instances together using thenCombine(), but I don't keep a reference to the previous CompleteableFuture<FooBar> instance. Something like this:
CompletableFuture<FooBar> future = doSomethingAsync(0);
for(int input = 1; i < 1_000_000_000; i++) {
future = future.thenCombine(doSomethingAsync(i), (foo, bar) -> bar);
}
future.join();
The interesting thing is that I can then do future.join() to wait until they all finish. And I can set a bound (e.g. 100) on the queue on the executor service inside doSomethingAsync() so that it makes the submission block when there are two many unfinished tasks in play. That would provides some back-pressure so that I don't run out of memory in the executor service with all 1,000,000,000 tasks being submitted to the executor service at the same time.
At the end of the process, the logic only has a reference to a single CompletableFuture<>—the final one representing the outcome of the final submission to doSomethingAsync(). But there were one billion of them chained together. Here's the big question: will all those 1,000,000,000 CompletableFuture<> instances stick around in memory until the last one is finished because they were chained using thenComposeAsync(), or will the initial CompletableFuture<> instances be garbage collected after they are completed and the executor service submits the "next" task via thenCombine()?

There's nothing in the public documentation about whether a CompletableFuture can be garbage collected after completion, assuming no other "external" strong references to it exist. However, if you look at the source code then you'll find this comment1:
/*
[...]
* Without precautions, CompletableFutures would be prone to
* garbage accumulation as chains of Completions build up, each
* pointing back to its sources. So we null out fields as soon as
* possible. The screening checks needed anyway harmlessly ignore
* null arguments that may have been obtained during races with
* threads nulling out fields. We also try to unlink non-isLive
* (fired or cancelled) Completions from stacks that might
* otherwise never be popped: Method cleanStack always unlinks non
* isLive completions from the head of stack; others may
* occasionally remain if racing with other cancellations or
* removals.
[...]
*/
And if you look through the implementation, you'll see lines like this one:
src = null; dep = null; fn = null;
So, at least as currently implemented, it looks like a CompletableFuture becomes eligible for garbage collection after it completes, assuming you don't maintain a separate strong reference to it yourself, regardless of the subsequent chain.
1. That link is to the source code tagged jdk-20+8. But it looks like that comment (and associated improvements) was added as part of this commit from September 2014. Perhaps as part of JDK-8056249, which was "fixed" for version 9, but looks to have been backported to Java 8

Does the ordering of calls to sequential() and parallel() matter when processing a Java 8 stream pipeline?

Does the placement of calls to sequential() and parallel() change how a Java 8 stream's pipeline is executed?
For example, suppose I have this code:
new ArrayList().stream().parallel().filter(...).count();
In this example, it's pretty clear that filter() will run in parallel. However, what if I have this code:
new ArrayList().stream().filter(...).parallel().count();
Does filter() still run in parallel or does it run sequentially? The reason it's not clear is because intermediate operations like filter() are lazy, i.e., they won't run until a terminal operation is invoked like count(). As such, by the time count() is invoked, we have a parallel stream pipeline but is filter() performed sequentially because it came before the call to parallel()?

Note the end of the Stream’s class documentation:
Stream pipelines may execute either sequentially or in parallel. This execution mode is a property of the stream. Streams are created with an initial choice of sequential or parallel execution. (For example, Collection.stream() creates a sequential stream, and Collection.parallelStream() creates a parallel one.) This choice of execution mode may be modified by the BaseStream.sequential() or BaseStream.parallel() methods, and may be queried with the BaseStream.isParallel() method.
In other words, calling sequential() or parallel() only changes a property of the stream and its state at the point when the terminal operation is commenced determines the execution mode of the entire pipeline.
This might not be documented that clearly at all places, because, it wasn’t always so. In the early development there were prototypes having different mode for the stages. This mail from March 2013 explains the change.

It appears that at least in the standard Oracle Java 8 implementation, although the parallel() method is defined as an "intermediate operation", it is not exactly lazy. That is, it has an immediate effect, regardless of whether you have a terminal operation or not. Consider the following example:
public class SimpleTest {
public static void main(String[] args) {
Stream<Integer> s = Stream.of(1,2,3,4,5,6,7,8,9,10);
System.out.println(s.isParallel());
Stream<Integer> s1 = s.parallel();
System.out.println(s.isParallel());
System.out.println(s == s1);
}
}
The output on my machine is:
false
true
true
Which tells us that parallel() immediately changes the state of the underlying stream (and returns that stream).
However, the Javadoc is written in such a way that it allows this, but does not require this. Which means that other stream implementations are free to execute the operations before the parallel() operations in a different execution mode than those after it.
In short, it's not a behavior you can rely on, either way.

How to prevent heap space error when using large parallel Java 8 stream

How do I effectively parallel my computation of pi (just as an example)?
This works (and takes about 15secs on my machine):
Stream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).limit(999999999L).mapToDouble(d->4.0d/d).sum()
But all of the following parallel variants run into an OutOfMemoryError
DoubleStream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).parallel().limit(999999999L).map(d->4.0d/d).sum();
DoubleStream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).limit(999999999L).parallel().map(d->4.0d/d).sum();
DoubleStream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).limit(999999999L).map(d->4.0d/d).parallel().sum();
So, what do I need to do to get parallel processing of this (large) stream?
I already checked if autoboxing is causing the memory consumption, but it is not. This works also:
DoubleStream.iterate(1, d->-(d+Math.abs(2*d)/d)).boxed().limit(999999999L).mapToDouble(d->4/d).sum()

The problem is that you are using constructs which are hard to parallelize.
First, Stream.iterate(…) creates a sequence of numbers where each calculation depends on the previous value, hence, it offers no room for parallel computation. Even worse, it creates an infinite stream which will be handled by the implementation like a stream with unknown size. For splitting the stream, the values have to be collected into arrays before they can be handed over to other computation threads.
Second, providing a limit(…) doesn’t improve the situation, it makes the situation even worse. Applying a limit removes the size information which the implementation just had gathered for the array fragments. The reason is that the stream is ordered, thus a thread processing an array fragment doesn’t know whether it can process all elements as that depends on the information how many previous elements other threads are processing. This is documented:
“… it can be quite expensive on ordered parallel pipelines, especially for large values of maxSize, since limit(n) is constrained to return not just any n elements, but the first n elements in the encounter order.”
That’s a pity as we perfectly know that the combination of an infinite sequence returned by iterate with a limit(…) actually has an exactly known size. But the implementation doesn’t know. And the API doesn’t provide a way to create an efficient combination of the two. But we can do it ourselves:
static DoubleStream iterate(double seed, DoubleUnaryOperator f, long limit) {
return StreamSupport.doubleStream(new Spliterators.AbstractDoubleSpliterator(limit,
Spliterator.ORDERED|Spliterator.SIZED|Spliterator.IMMUTABLE|Spliterator.NONNULL) {
long remaining=limit;
double value=seed;
public boolean tryAdvance(DoubleConsumer action) {
if(remaining==0) return false;
double d=value;
if(--remaining>0) value=f.applyAsDouble(d);
action.accept(d);
return true;
}
}, false);
}
Once we have such an iterate-with-limit method we can use it like
iterate(1d, d -> -(d+2*(Math.abs(d)/d)), 999999999L).parallel().map(d->4.0d/d).sum()
this still doesn’t benefit much from parallel execution due to the sequential nature of the source, but it works. On my four core machine it managed to get roughly 20% gain.

This is because the default ForkJoinPool implementation used by the parallel() method does not limit the number of threads that get created. The solution is to provide a custom implementation of a ForkJoinPool that is limited to the number of threads that it executes in parallel. This can be achieved as mentioned below:
ForkJoinPool forkJoinPool = new ForkJoinPool(Runtime.getRuntime().availableProcessors());
forkJoinPool.submit(() -> DoubleStream.iterate(1d, d->-(d+2*(Math.abs(d)/d))).parallel().limit(999999999L).map(d->4.0d/d).sum());

Interpretation of "program order rule" in Java concurrency

Program order rule states "Each action in a thread happens-before every action in that thread that comes later in the program order"
1.I read in another thread that an action is
reads and writes to variables
locks and unlocks of monitors
starting and joining with threads
Does this mean that reads and writes can be changed in order, but reads and writes cannot change order with actions specified in 2nd or 3rd lines?
2.What does "program order" mean?
Explanation with an examples would be really helpful.
Additional related question
Suppose I have the following code:
long tick = System.nanoTime(); //Line1: Note the time
//Block1: some code whose time I wish to measure goes here
long tock = System.nanoTime(); //Line2: Note the time
Firstly, it's a single threaded application to keep things simple. Compiler notices that it needs to check the time twice and also notices a block of code that has no dependency with surrounding time-noting lines, so it sees a potential to reorganize the code, which could result in Block1 not being surrounded by the timing calls during actual execution (for instance, consider this order Line1->Line2->Block1). But, I as a programmer can see the dependency between Line1,2 and Block1. Line1 should immediately precede Block1, Block1 takes a finite amount of time to complete, and immediately succeeded by Line2.
So my question is: Am I measuring the block correctly?
If yes, what is preventing the compiler from rearranging the order.
If no, (which is think is correct after going through Enno's answer) what can I do to prevent it.
P.S.: I stole this code from another question I asked in SO recently.

It probably helps to explain why such rule exist in the first place.
Java is a procedural language. I.e. you tell Java how to do something for you. If Java executes your instructions not in the order you wrote, it would obviously not work. E.g. in the below example, if Java would do 2 -> 1 -> 3 then the stew would be ruined.
1. Take lid off
2. Pour salt in
3. Cook for 3 hours
So, why does the rule not simply say "Java executes what you wrote in the order you wrote"? In a nutshell, because Java is clever. Take the following example:
1. Take eggs out of the freezer
2. Take lid off
3. Take milk out of the freezer
4. Pour egg and milk in
5. Cook for 3 hours
If Java was like me, it'll just execute it in order. However Java is clever enough to understand that it's more efficient AND that the end result would be the same should it do 1 -> 3 -> 2 -> 4 -> 5 (you don't have to walk to the freezer again, and that doesn't change the recipe).
So what the rule "Each action in a thread happens-before every action in that thread that comes later in the program order" is trying to say is, "In a single thread, your program will run as if it was executed in the exact order you wrote it. We might change the ordering behind the scene but we make sure that none of that would change the output.
So far so good. Why does it not do the same across multiple threads? In multi-thread programming, Java isn't clever enough to do it automatically. It will for some operations (e.g. joining threads, starting threads, when a lock (monitor) is used etc.) but for other stuff you need to explicitly tell it to not do reordering that would change the program output (e.g. volatile marker on fields, use of locks etc.).
Note:
Quick addendum about "happens-before relationship". This is a fancy way of saying no matter what reordering Java might do, stuff A will happen before stuff B. In our weird later stew example, "Step 1 & 3 happens-before step 4 "Pour egg and milk in" ". Also for example, "Step 1 & 3 do not need a happens-before relationship because they don't depend on each other in any way"
On the additional question & response to the comment
First, let us establish what "time" means in the programming world. In programming, we have the notion of "absolute time" (what's the time in the world now?) and the notion of "relative time" (how much time has passed since x?). In an ideal world, time is time but unless we have an atomic clock built in, the absolute time would have to be corrected time to time. On the other hand, for relative time we don't want corrections as we are only interested in the differences between events.
In Java, System.currentTime() deals with absolute time and System.nanoTime() deals with relative time. This is why the Javadoc of nanoTime states, "This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time".
In practice, both currentTimeMillis and nanoTime are native calls and thus the compiler can't practically prove if a reordering won't affect the correctness, which means it will not reorder the execution.
But let us imagine we want to write a compiler implementation that actually looks into native code and reorders everything as long as it's legal. When we look at the JLS, all that it tells us is that "You can reorder anything as long as it cannot be detected". Now as the compiler writer, we have to decide if the reordering would violate the semantics. For relative time (nanoTime), it would clearly be useless (i.e. violates the semantics) if we'd reorder the execution. Now, would it violate the semantics if we'd reorder for absolute time (currentTimeMillis)? As long as we can limit the difference from the source of the world's time (let's say the system clock) to whatever we decide (like "50ms")*, I say no. For the below example:
long tick = System.currentTimeMillis();
result = compute();
long tock = System.currentTimeMillis();
print(result + ":" + tick - tock);
If the compiler can prove that compute() takes less than whatever maximum divergence from the system clock we can permit, then it would be legal to reorder this as follows:
long tick = System.currentTimeMillis();
long tock = System.currentTimeMillis();
result = compute();
print(result + ":" + tick - tock);
Since doing that won't violate the spec we defined, and thus won't violate the semantics.
You also asked why this is not included in the JLS. I think the answer would be "to keep the JLS short". But I don't know much about this realm so you might want to ask a separate question for that.
*: In actual implementations, this difference is platform dependent.

The program order rule guarantees that, within individual threads, reordering optimizations introduced by the compiler cannot produce different results from what would have happened if the program had been executed in serial fashion. It makes no guarantees about what order the thread's actions may appear to occur in to any other threads if its state is observed by those threads without synchronization.
Note that this rule speaks only to the ultimate results of the program, and not to the order of individual executions within that program. For instance, if we have a method which makes the following changes to some local variables:
x = 1;
z = z + 1;
y = 1;
The compiler remains free to reorder these operations however it sees best fit to improve performance. One way to think of this is: if you could reorder these ops in your source code and still obtain the same results, the compiler is free to do the same. (And in fact, it can go even further and completely discard operations which are shown to have no results, such as invocations of empty methods.)
With your second bullet point the monitor lock rule comes into play: "An unlock on a monitor happens-before every subsequent lock on that main monitor lock." (Java Concurrency in Practice p. 341) This means that a thread acquiring a given lock will have a consistent view of the actions which occurred in other threads before releasing that lock. However, note that this guarantee only applies when two different threads release or acquire the same lock. If Thread A does a bunch of stuff before releasing Lock X, and then Thread B acquires Lock Y, Thread B is not assured to have a consistent view of A's pre-X actions.
It is possible for reads and writes to variables to be reordered with start and join if a.) doing so doesn't break within-thread program order, and b.) the variables have not had other "happens-before" thread synchronization semantics applied to them, say by storing them in volatile fields.
A simple example:
class ThreadStarter {
Object a = null;
Object b = null;
Thread thread;
ThreadStarter(Thread threadToStart) {
this.thread = threadToStart;
}
public void aMethod() {
a = new BeforeStartObject();
b = new BeforeStartObject();
thread.start();
a = new AfterStartObject();
b = new AfterStartObject();
a.doSomeStuff();
b.doSomeStuff();
}
}
Since the fields a and b and the method aMethod() are not synchronized in any way, and the action of starting thread does not change the results of the writes to the fields (or the doing of stuff with those fields), the compiler is free to reorder thread.start() to anywhere in the method. The only thing it could not do with the order of aMethod() would be to move the order of writing one of the BeforeStartObjects to a field after writing an AfterStartObject to that field, or to move one of the doSomeStuff() invocations on a field before the AfterStartObject is written to it. (That is, assuming that such reordering would change the results of the doSomeStuff() invocation in some way.)
The critical thing to bear in mind here is that, in the absence of synchronization, the thread started in aMethod() could theoretically observe either or both of the fields a and b in any of the states which they take on during the execution of aMethod() (including null).
Additional question answer
The assignments to tick and tock cannot be reordered with respect to the code in Block1 if they are to be actually used in any measurements, for example by calculating the difference between them and printing the result as output. Such reordering would clearly break Java's within-thread as-if-serial semantics. It changes the results from what would have been obtained by executing instructions in the specified program order. If the assignments aren't used for any measurements and have no side-effects of any kind on the program result, they'll likely be optimized away as no-ops by the compiler rather than being reordered.

Before I answer the question,
reads and writes to variables
Should be
volatile reads and volatile writes (of the same field)
Program order doesn't guarantee this happens before relationship, rather the happens-before relationship guarantees program order
To your questions:
Does this mean that reads and writes can be changed in order, but reads and writes cannot change order with actions specified in 2nd or 3rd lines?
The answer actually depends on what action happens first and what action happens second. Take a look at the JSR 133 Cookbook for Compiler Writers. There is a Can Reorder grid that lists the allowed compiler reordering that can occur.
For instance a Volatile Store can be re-ordered above or below a Normal Store but a Volatile Store cannot be be reordered above or below a Volatile Load. This is all assuming intrathread semantics still hold.
What does "program order" mean?
This is from the JLS
Among all the inter-thread actions performed by each thread t, the
program order of t is a total order that reflects the order in which
these actions would be performed according to the intra-thread
semantics of t.
In other words, if you can change the writes and loads of a variable in such a way that it will preform exactly the same way as you wrote it then it maintains program order.
For instance
public static Object getInstance(){
if(instance == null){
instance = new Object();
}
return instance;
}
Can be reordered to
public static Object getInstance(){
Object temp = instance;
if(instance == null){
temp = instance = new Object();
}
return temp;
}

it simply mean though the thread may be multiplxed, but the internal order of the thread's action/operation/instruction would remain constant (relatively)
thread1: T1op1, T1op2, T1op3...
thread2: T2op1, T2op2, T2op3...
though the order of operation (Tn'op'M) among thread may vary, but operations T1op1, T1op2, T1op3 within a thread will always be in this order, and so as the T2op1, T2op2, T2op3
for ex:
T2op1, T1op1, T1op2, T2op2, T2op3, T1op3

Java tutorial http://docs.oracle.com/javase/tutorial/essential/concurrency/memconsist.html says that happens-before relationship is simply a guarantee that memory writes by one specific statement are visible to another specific statement. Here is an illustration
int x;
synchronized void x() {
x += 1;
}
synchronized void y() {
System.out.println(x);
}
synchronized creates a happens-before relationship, if we remove it there will be no guarantee that after thread A increments x thread B will print 1, it may print 0

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.