What is the difference between:
List<String> parra = list.parallelStream()
.map(heavyPrcessingFunction)
.collect(Collectors.toList());
and this (apart from the second being a bit complex):
List<CompletableFuture<Void>> com = list.stream()
.map(x-> CompletableFuture.runAsync(() -> heavyPrcessingFunction.apply(x)))
.collect(Collectors.toList());
CompletableFuture.allOf(com.toArray(new CompletableFuture[0])).join();
// get all of strings from com now
Semantically they are quite similar, it mostly is a matter of overhead.
For the 2nd approach you have to create a CF for each entry in the list and submit them individually to the common FJP.
Parallel streams on the other hand can be implemented by chunking the input list into a few large slices, submitting only those slices as a tasks to the common pool and then having a thread essentially loop over the slice instead of having to pick up and unwrap future by future from its work queue.
Additionally the stream implementation means that not just the map operation but also the collect step is aware of parallel execution and can thus optimize it.
Fewer allocations, fewer expensive operations on concurrent data structures, simpler code.
The way that it was implemented above, there's no difference. The advantage of using CompletableFuture API is that you can pass a custom Executor if you want to have more control over the threads and/or implement some async semantic.
Related
I can't find a specific answer to the line of investigation that we've been requested to take on
I see that parallel streams may not be so performant when using small amount of threads, and that apparently it doesn't behave so well when the DB blocks the next request while processing the current one
However, I find that the overhead of implementing Task Executor vs Parallel Streams is huge, we've implemented a POC that takes care of our concurrency needs with just this one line of code:
List<Map<String, String>> listWithAllMaps = mappedValues.entrySet().parallelStream().map(e -> callPlugins(e))
.collect(Collectors.toList());
Whereas in Task Executor, we'd need to override the Runnable interface and write some cumbersome code just to get the runnables not to be void and return the values we're reading from the DB, leading us into several hours, if not days of coding, and producing a less maintainable, more bug prone code
However, our CTO is still reluctant to using parallel streams due to unforeseen issues that could come up down the road
So the question is, in an environment where I need to make several concurrent read-only queries to a database, using different java-components/REST calls for each query: Is it preferrable in any way to use Task Executor instead of parallel streaming, if so, why?
Use the TaskExecutor as an Executor for a CompletableFuture.
List<CompletableFuture> futures = mappedValues.entrySet().stream().map(e - > CompletableFuture.supplyAsync(() -> callPlugins(e), taskExecutor)).collect(Collectors.toList());
List<Map<String, String>> listWithAllMaps = futures.stream().map(CompletableFuture::join).collect(Collectors.toList());
Not sure how this is cumbersome. Yes it is a bit more code, but with the advantage that you can easily configure the TaskExecutor and increase the number of threads, queueu-size etc. etc.
DISCLAIMER: Typed it from the top of my head, so some minor things might be of with the code snippet.
If I am using a map operation in a stream pipeline with forEach() terminal operation(which does not honors encounter order irrespective of whether its sequential or parallel stream) on the list (as source), will map respect the encounter order of the list in case of sequential or parallel stream ?
List<Integer> = Arrays.asList(1,2,3,4,5)
someList.stream().map(i -> i*2).forEach(System.out::println) // this is sequential stream
someList.parallelStream().map(i -> i*2).forEach(System.out::println) // this is parallel stream
If yes, then in this post https://stackoverflow.com/a/47337690/5527839, it is mentioned map operation will be performed in parallel. If order is maintained, how it will make the performance better when using parallel stream. What a point of using parallel stream?
If order is maintained, how it will make the performance better when using parallel stream. What a point of using parallel stream? (yes still you will gain the performance but not expected level)
Even if you use forEachOrdered() while parallelStream the intermediate operation map will be executed by concurrent threads, but in the terminal operation orEachOrdered makes them to process in order. Try below code you will see the parallelism in map operation
List<Integer> someList = Arrays.asList(1,2,3,4,5);
someList.stream().map(i -> {
System.out.println(Thread.currentThread().getName()+" Normal Stream : "+i);
return i*2;
}).forEach(System.out::println); // this is sequential stream
System.out.println("this is parallel stream");
someList.parallelStream().map(i -> {
System.out.println(Thread.currentThread().getName()+" Parallel Stream : "+i);
return i*2;
}).forEachOrdered(System.out::println); // this is parallel stream
will map honor encounter order ? Is ordering any way related to intermediate operations ?
If it is parallelstream map will not encounter any order, if it is normal stream then map will encounter in order, it completely depends on stream not on intermediate operation
While many intermediate operations do preserve ordering with out having to explicitly specify that desire, I always prefer to assume data flowing through the java stream api isnt guaranteed to end up ordered in every scenario even given the same code.
When the order of the elements must be preserved, it is enough to specify the terminal operation as an ordered operation and the data will be in order when it comes out. In your case I believe youd be looking for
.forEachOrdered()
If order is maintained, how it will make the performance better when
using parallel stream. What a point of using parallel stream?
I've heard many opinions on this. I believe you should only use parallel streams if you are doing a non trivial amount of processing inside the pipeline, otherwise the overhead of managing the parallel stream will in most cases degrade performance when compared to serial streams. If you are doing some intensive processing, parallel will still definitely work faster than serial even when instructed to preserve the order because, after all, the data being processed is stored in a heap either way and the pointers to that data are what gets ordered and passed out of the end of the pipe. All a stream needs to do for ordering is hand the pointers out in the same order they were encountered, but it can work on the data in parallel and just wait if the data at the front of the queue isnt yet finished.
I'm sure this is a tad bit of an oversimplification, as there are cases where an ordered stream will require data to be shared from one element to the next (collectors for instance) But the basic concept is valid since even in this case a parallel stream is able to process at least two pieces of data at a time.
I am writing a program to download historical quotes from a source. The source provides files over http for each day which need to be parsed and processed. The program downloads multiple files in parallel using a CompletableFuture using different stages. The first stage is to make a Http call using HttpClient and get the response.
The getHttpResponse() method returns a CloseableHttpResponse Object. I also want to return a url for which this http request was made. Simplest way is to have a wrapper object having these 2 fields, but i feel it is too much to have a class just to contain these 2 fields. Is there a way with CompletableFuture or Streams that I can achieve this?
filesToDownload.stream()
.map(url -> CompletableFuture.supplyAsync(() -> this.getHttpResponse(url), this.executor) )
.map(httpResponseFuture -> httpResponseFuture.thenAccept(t -> processHttpResponse(t)))
.count();
It’s not clear why you want to bring in the Stream API at all costs. Splitting the CompletableFuture use into two map operations causes the problem which wouldn’t exist otherwise. Besides that, using map for side effects is an abuse of the Stream API. This may break completely in Java 9, if filesToDownload is a Stream source with a known size (like almost every Collection). Then, count() will simply return that known size, without processing the functions of the map operations…
If you want to pass the URL and the CloseableHttpResponse to processHttpResponse, you can do it as easy as:
filesToDownload.forEach(url ->
CompletableFuture.supplyAsync(() -> this.getHttpResponse(url), this.executor)
.thenAccept( t -> processHttpResponse(t, url))
);
Even, if you use the Stream API to collect results, there is no reason to split the CompletableFuture into multiple map operations:
List<…> result = filesToDownload.stream()
.map(url -> CompletableFuture.supplyAsync(() -> this.getHttpResponse(url), this.executor)
.thenApply( t -> processHttpResponse(t, url)) )
.collect(Collectors.toList())
.stream()
.map(CompletableFuture::join)
.collect(Collectors.toList());
Note that this will collect the CompletableFutures into a List before waiting for any result in a second Stream operation. This is preferable to using a parallel Stream operation as it ensures that all asynchronous operations have been submitted, before starting to wait.
Using a single Stream pipeline would imply waiting for the completion of the first job before even submitting the second and using a parallel Stream would only reduce that problem instead of solving it. It would depend on the execution strategy of the Stream implementation (the default Fork/Join pool), which interferes with actual policy of your specified executor. E.g., if the specified executor is supposed to use more threads than CPU cores, the Stream would still submit only as much jobs at a time as there are cores — or even less if there are other jobs on the default Fork/Join pool.
In contrast, the behavior of the solution above will be entirely controlled by the execution strategy of the specified executor.
What is the best approach to parallel process a Collection of Java Objects? I would like to have a threadpool of a 100 threads each work on a separate Collection object and perform some action on it. Any ideas? Java 8 is the targeted version.
Use a parallelStream.
yourCollection
.parallelStream()
.forEach(e -> doStuff(e));
You may also want to collect() the results afterwards.
In the book "Java 8 in action" (by Urma, Fusco and Mycroft) they highlight that parallel streams internally use the common fork join pool and that whilst this can be configured globally, e.g. using System.setProperty(...), that it is not possibly to specify a value for a single parallel stream.
I have since seen the workaround that involves running the parallel stream inside a custom made ForkJoinPool.
Later on in the book, they have an entire chapter dedicated to CompletableFuture, during which they have a case study where they compare the respective performance of using a parallelStream VS a CompletableFuture. It turns out their performance is very similar - they highlight the reason for this as being that they are both as default using the same common pool (and therefore the same amount of threads).
They go on to show a solution and argue that the CompletableFuture is better in this circumstance as it can be congifured to use a custom Executor, with a thread pool size of the user's choice. When they update the solution to utilise this, the performance is significantly improved.
This made me think - if one were to do the same for the parallel stream version using the workaround highlighted above, would the performance benefits be similar, and would the two approaches therefore become similar again in terms of performance? In this case, why would one choose the CompletableFuture over the parallel stream when it clearly takes more work on the developer's part.
In this case, why would one choose the CompletableFuture over the parallel stream when it clearly takes more work on the developer's part.
IMHO This depends on the interface you are looking to support. If you are looking to support an asynchronous API e.g.
CompletableFuture<String> downloadHttp(URL url);
In this case, only a completable future makes sense because you may want to do something else unrelated while you wait for the data to come down.
On the other hand parallelStream() is best for CPU bound tasks where you want every tasks to perform a portion of some work. i.e. every thread is doing the same thing with different data. As you meantion it is also easier to use.