Task Executor vs Java 8 parallel streaming - java

I can't find a specific answer to the line of investigation that we've been requested to take on
I see that parallel streams may not be so performant when using small amount of threads, and that apparently it doesn't behave so well when the DB blocks the next request while processing the current one
However, I find that the overhead of implementing Task Executor vs Parallel Streams is huge, we've implemented a POC that takes care of our concurrency needs with just this one line of code:
List<Map<String, String>> listWithAllMaps = mappedValues.entrySet().parallelStream().map(e -> callPlugins(e))
.collect(Collectors.toList());
Whereas in Task Executor, we'd need to override the Runnable interface and write some cumbersome code just to get the runnables not to be void and return the values we're reading from the DB, leading us into several hours, if not days of coding, and producing a less maintainable, more bug prone code
However, our CTO is still reluctant to using parallel streams due to unforeseen issues that could come up down the road
So the question is, in an environment where I need to make several concurrent read-only queries to a database, using different java-components/REST calls for each query: Is it preferrable in any way to use Task Executor instead of parallel streaming, if so, why?

Use the TaskExecutor as an Executor for a CompletableFuture.
List<CompletableFuture> futures = mappedValues.entrySet().stream().map(e - > CompletableFuture.supplyAsync(() -> callPlugins(e), taskExecutor)).collect(Collectors.toList());
List<Map<String, String>> listWithAllMaps = futures.stream().map(CompletableFuture::join).collect(Collectors.toList());
Not sure how this is cumbersome. Yes it is a bit more code, but with the advantage that you can easily configure the TaskExecutor and increase the number of threads, queueu-size etc. etc.
DISCLAIMER: Typed it from the top of my head, so some minor things might be of with the code snippet.

Related

executorService.invokeAll(...) takes longer time to finish even when timeout is set

I've encountered a problem I'm not sure how to solve.
I'm trying to parallelize a part of the code that we've done sequentially up until now. To do so I've divided the task into several smaller orthogonal tasks.
I've created an executorService and I'm running:
executorService.invokeAll(callableList, timeBudget, TimeUnit.NANOSECONDS);
Each callable is has several IO tasks within it (Like going to a database and external services) the overall time-budget is 200ms+-. The reason to use invokeAll is since I have an overall timeBudget for all of the request. Thus, I need a way to limit all the futures with a single budget.
In order to test myself I've added different metrics that report back to some logging visualisation tool that we have. I've noticed that:
The median (and 75th percentile) latency of that part of the code has faster.
The 95th+ percentiles has actually gotten worse.
After thorough investigation (Where I've benchmarked different parts of the code) I've noticed that invokeAll 99th percentile running time was actually 500ms and even more sometimes. This thing really screws up the optimization. Any ideas on what may cause this? Any other suggestions? Are there alternatives to invokeAll?
While I don't have answer to why invokeAll with timeouts sometimes takes much more time than the given budget. I have an answer to the question: How to run a list of futures simultaneously with a given budget?
ListeningExecutorService executor = <init>;
List<ListenableFuture<G>> futures = new ArrayList<>();
for (T chunk : chunks) {
futures.add(executorService.submit(() -> function(chunk, param1, param2, ...)));
}
Futures.allAsList(futures).get(budgetInNanos, TimeUnit.NANOSECONDS);
The code above has used Guava library.
The problem with this approach is that I'm not getting the status of each future, because I'm getting a timeout exception if the time is up - but at least in terms of budgeting the behaviour is as expected.

Parallel streams Vs Completeable Future in java 8

What is the difference between:
List<String> parra = list.parallelStream()
.map(heavyPrcessingFunction)
.collect(Collectors.toList());
and this (apart from the second being a bit complex):
List<CompletableFuture<Void>> com = list.stream()
.map(x-> CompletableFuture.runAsync(() -> heavyPrcessingFunction.apply(x)))
.collect(Collectors.toList());
CompletableFuture.allOf(com.toArray(new CompletableFuture[0])).join();
// get all of strings from com now
Semantically they are quite similar, it mostly is a matter of overhead.
For the 2nd approach you have to create a CF for each entry in the list and submit them individually to the common FJP.
Parallel streams on the other hand can be implemented by chunking the input list into a few large slices, submitting only those slices as a tasks to the common pool and then having a thread essentially loop over the slice instead of having to pick up and unwrap future by future from its work queue.
Additionally the stream implementation means that not just the map operation but also the collect step is aware of parallel execution and can thus optimize it.
Fewer allocations, fewer expensive operations on concurrent data structures, simpler code.
The way that it was implemented above, there's no difference. The advantage of using CompletableFuture API is that you can pass a custom Executor if you want to have more control over the threads and/or implement some async semantic.

Java parallelization using lambda functions

I have an array of some objects with the method process() that I want to run parallelized. And I wanted to try lambdas to achieve the parallelization. So I tried this:
Arrays.asList(myArrayOfItems).forEach(item->{
System.out.println("processing " + item.getId());
item.process();
});
Each process() call takes about 2 seconds. And I have noticed that there is still no speedup with the "parallelization" approach. It seems that everything is still running serialized. The ids are printed in series (ordered) and between every print there is a pause of 2 seconds.
Probably I have misunderstood something. What is needed to execute this in parallel using lambdas (hopefully in a very condensed way)?
Lambdas itself aren't executing anything in parallel. Streams are capable of doing this though.
Take a look at the method Collection#parallelStream (documentation):
Arrays.asList(myArrayOfItems).parallelStream().forEach(...);
However, note that there is no guarantee or control when it will actually go parallel. From its documentation:
Returns a possibly parallel Stream with this collection as its source. It is allowable for this method to return a sequential stream.
The reason is simple. You really need a lot of elements in your collection (like millions) for parallelization to actually pay off (or doing other heavy things). The overhead introduced with parallelization is huge. Because of that, the method might choose to use sequential stream instead, if it thinks that it will be faster.
Before you think about using parallelism, you should actually setup some benchmarks to test if it improves anything. There are many examples where people did just blindly use it without noticing that they actually decreased the perfomance. Also see Should I always use a parallel stream when possible?.
You can check if a Stream is parallel by using Stream#isParallel (documentation).
If you use Stream#parallel (documentation) directly on a stream, you get a parallel version.
Method Collection.forEach() is just iteration through all the elements. It is called internal iteration as it leaves up to the collection how it will iterate, but it is still an iteration on all the elements.
If you want parallel processing, you have to:
Get a parallel stream from the collection.
Specify the operation(s) which will be done on the stream.
Do something with the result if you need to.
You may read first part of my explanation here: https://stackoverflow.com/a/22942829/2886891
To create a parallel stream, invoke the operation .parallelStream on a Collection
See https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html
Arrays.asList(myArrayOfItems).parallelStream().forEach(item->{
System.out.println("processing " + item.getId());
item.process();
});

Difference between Futures(Guava)/CompletableFuture and Observable(RxJava) [duplicate]

I would like to know the difference between
CompletableFuture,Future and Observable RxJava.
What I know is all are asynchronous but
Future.get() blocks the thread
CompletableFuture gives the callback methods
RxJava Observable --- similar to CompletableFuture with other benefits(not sure)
For example: if client needs to make multiple service calls and when we use Futures (Java) Future.get() will be executed sequentially...would like to know how its better in RxJava..
And the documentation http://reactivex.io/intro.html says
It is difficult to use Futures to optimally compose conditional asynchronous execution flows (or impossible, since latencies of each request vary at runtime). This can be done, of course, but it quickly becomes complicated (and thus error-prone) or it prematurely blocks on Future.get(), which eliminates the benefit of asynchronous execution.
Really interested to know how RxJava solves this problem. I found it difficult to understand from the documentation.
Futures
Futures were introduced in Java 5 (2004). They're basically placeholders for a result of an operation that hasn't finished yet. Once the operation finishes, the Future will contain that result. For example, an operation can be a Runnable or Callable instance that is submitted to an ExecutorService. The submitter of the operation can use the Future object to check whether the operation isDone(), or wait for it to finish using the blocking get() method.
Example:
/**
* A task that sleeps for a second, then returns 1
**/
public static class MyCallable implements Callable<Integer> {
#Override
public Integer call() throws Exception {
Thread.sleep(1000);
return 1;
}
}
public static void main(String[] args) throws Exception{
ExecutorService exec = Executors.newSingleThreadExecutor();
Future<Integer> f = exec.submit(new MyCallable());
System.out.println(f.isDone()); //False
System.out.println(f.get()); //Waits until the task is done, then prints 1
}
CompletableFutures
CompletableFutures were introduced in Java 8 (2014). They are in fact an evolution of regular Futures, inspired by Google's Listenable Futures, part of the Guava library. They are Futures that also allow you to string tasks together in a chain. You can use them to tell some worker thread to "go do some task X, and when you're done, go do this other thing using the result of X". Using CompletableFutures, you can do something with the result of the operation without actually blocking a thread to wait for the result. Here's a simple example:
/**
* A supplier that sleeps for a second, and then returns one
**/
public static class MySupplier implements Supplier<Integer> {
#Override
public Integer get() {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
//Do nothing
}
return 1;
}
}
/**
* A (pure) function that adds one to a given Integer
**/
public static class PlusOne implements Function<Integer, Integer> {
#Override
public Integer apply(Integer x) {
return x + 1;
}
}
public static void main(String[] args) throws Exception {
ExecutorService exec = Executors.newSingleThreadExecutor();
CompletableFuture<Integer> f = CompletableFuture.supplyAsync(new MySupplier(), exec);
System.out.println(f.isDone()); // False
CompletableFuture<Integer> f2 = f.thenApply(new PlusOne());
System.out.println(f2.get()); // Waits until the "calculation" is done, then prints 2
}
RxJava
RxJava is whole library for reactive programming created at Netflix. At a glance, it will appear to be similar to Java 8's streams. It is, except it's much more powerful.
Similarly to Futures, RxJava can be used to string together a bunch of synchronous or asynchronous actions to create a processing pipeline. Unlike Futures, which are single-use, RxJava works on streams of zero or more items. Including never-ending streams with an infinite number of items. It's also much more flexible and powerful thanks to an unbelievably rich set of operators.
Unlike Java 8's streams, RxJava also has a backpressure mechanism, which allows it to handle cases in which different parts of your processing pipeline operate in different threads, at different rates.
The downside of RxJava is that despite the solid documentation, it is a challenging library to learn due to the paradigm shift involved. Rx code can also be a nightmare to debug, especially if multiple threads are involved, and even worse - if backpressure is needed.
If you want to get into it, there's a whole page of various tutorials on the official website, plus the official documentation and Javadoc. You can also take a look at some of the videos such as this one which gives a brief intro into Rx and also talks about the differences between Rx and Futures.
Bonus: Java 9 Reactive Streams
Java 9's Reactive Streams aka Flow API are a set of Interfaces implemented by various reactive streams libraries such as RxJava 2, Akka Streams, and Vertx. They allow these reactive libraries to interconnect, while preserving the all important back-pressure.
I have been working with Rx Java since 0.9, now at 1.3.2 and soon migrating to 2.x I use this in a private project where I already work on for 8 years.
I wouldn't program without this library at all anymore. In the beginning I was skeptic but it is a complete other state of mind you need to create. Quiete difficult in the beginning. I sometimes was looking at the marbles for hours.. lol
It is just a matter of practice and really getting to know the flow (aka contract of observables and observer), once you get there, you'll hate to do it otherwise.
For me there is not really a downside on that library.
Use case:
I have a monitor view that contains 9 gauges (cpu, mem, network, etc...). When starting up the view, the view subscribes itselfs to a system monitor class that returns an observable (interval) that contains all the data for the 9 meters.
It will push each second a new result to the view (so not polling !!!).
That observable uses a flatmap to simultaneously (async!) fetch data from 9 different sources and zips the result into a new model your view will get on the onNext().
How the hell you gonna do that with futures, completables etc ... Good luck ! :)
Rx Java solves many issues in programming for me and makes in a way a lot easier...
Advantages:
Statelss !!! (important thing to mention, most important maybe)
Thread management out of the box
Build sequences that have their own lifecycle
Everything are observables so chaining is easy
Less code to write
Single jar on classpath (very lightweight)
Highly concurrent
No callback hell anymore
Subscriber based (tight contract between consumer and producer)
Backpressure strategies (circuit breaker a like)
Splendid error handling and recovering
Very nice documentation (marbles <3)
Complete control
Many more ...
Disadvantages:
- Hard to test
Java's Future is a placeholder to hold something that will be completed in the future with a blocking API. You'll have to use its' isDone() method to poll it periodically to check if that task is finished. Certainly you can implement your own asynchronous code to manage the polling logic. However, it incurs more boilerplate code and debug overhead.
Java's CompletableFuture is innovated by Scala's Future. It carries an internal callback method. Once it is finished, the callback method will be triggered and tell the thread that the downstream operation should be executed. That's why it has thenApply method to do further operation on the object wrapped in the CompletableFuture.
RxJava's Observable is an enhanced version of CompletableFuture. It allows you to handle the backpressure. In the thenApply method (and even with its brothers thenApplyAsync) we mentioned above, this situation might happen: the downstream method wants to call an external service that might become unavailable sometimes. In this case, the CompleteableFuture will fail completely and you will have to handle the error by yourself. However, Observable allows you to handle the backpressure and continue the execution once the external service to become available.
In addition, there is a similar interface of Observable: Flowable. They are designed for different purposes. Usually Flowable is dedicated to handle the cold and non-timed operations, while Observable is dedicated to handle the executions requiring instant responses. See the official documents here: https://github.com/ReactiveX/RxJava#backpressure
All three interfaces serve to transfer values from producer to consumer. Consumers can be of 2 kinds:
synchronous: consumer makes blocking call which returns when the value is ready
asynchronous: when the value is ready, a callback method of the consumer is called
Also, communication interfaces differ in other ways:
able to transfer single value of multiple values
if multiple values, backpressure can be supported or not
As a result:
Future transferes single value using synchronous interface
CompletableFuture transferes single value using both synchronous and asynchronous interfaces
Rx transferes multiple values using asynchronous interface with backpressure
Also, all these communication facilities support transferring exceptions. This is not always the case. For example, BlockingQueue does not.
The main advantage of CompletableFuture over normal Future is that CompletableFuture takes advantage of the extremely powerful stream API and gives you callback handlers to chain your tasks, which is absolutely absent if you use normal Future. That along with providing asynchronous architecture, CompletableFuture is the way to go for handling computation heavy map-reduce tasks, without worrying much about application performance.

Difference between parallel stream and CompletableFuture

In the book "Java 8 in action" (by Urma, Fusco and Mycroft) they highlight that parallel streams internally use the common fork join pool and that whilst this can be configured globally, e.g. using System.setProperty(...), that it is not possibly to specify a value for a single parallel stream.
I have since seen the workaround that involves running the parallel stream inside a custom made ForkJoinPool.
Later on in the book, they have an entire chapter dedicated to CompletableFuture, during which they have a case study where they compare the respective performance of using a parallelStream VS a CompletableFuture. It turns out their performance is very similar - they highlight the reason for this as being that they are both as default using the same common pool (and therefore the same amount of threads).
They go on to show a solution and argue that the CompletableFuture is better in this circumstance as it can be congifured to use a custom Executor, with a thread pool size of the user's choice. When they update the solution to utilise this, the performance is significantly improved.
This made me think - if one were to do the same for the parallel stream version using the workaround highlighted above, would the performance benefits be similar, and would the two approaches therefore become similar again in terms of performance? In this case, why would one choose the CompletableFuture over the parallel stream when it clearly takes more work on the developer's part.
In this case, why would one choose the CompletableFuture over the parallel stream when it clearly takes more work on the developer's part.
IMHO This depends on the interface you are looking to support. If you are looking to support an asynchronous API e.g.
CompletableFuture<String> downloadHttp(URL url);
In this case, only a completable future makes sense because you may want to do something else unrelated while you wait for the data to come down.
On the other hand parallelStream() is best for CPU bound tasks where you want every tasks to perform a portion of some work. i.e. every thread is doing the same thing with different data. As you meantion it is also easier to use.

Categories

Resources