I am writing a program to download historical quotes from a source. The source provides files over http for each day which need to be parsed and processed. The program downloads multiple files in parallel using a CompletableFuture using different stages. The first stage is to make a Http call using HttpClient and get the response.
The getHttpResponse() method returns a CloseableHttpResponse Object. I also want to return a url for which this http request was made. Simplest way is to have a wrapper object having these 2 fields, but i feel it is too much to have a class just to contain these 2 fields. Is there a way with CompletableFuture or Streams that I can achieve this?
filesToDownload.stream()
.map(url -> CompletableFuture.supplyAsync(() -> this.getHttpResponse(url), this.executor) )
.map(httpResponseFuture -> httpResponseFuture.thenAccept(t -> processHttpResponse(t)))
.count();
It’s not clear why you want to bring in the Stream API at all costs. Splitting the CompletableFuture use into two map operations causes the problem which wouldn’t exist otherwise. Besides that, using map for side effects is an abuse of the Stream API. This may break completely in Java 9, if filesToDownload is a Stream source with a known size (like almost every Collection). Then, count() will simply return that known size, without processing the functions of the map operations…
If you want to pass the URL and the CloseableHttpResponse to processHttpResponse, you can do it as easy as:
filesToDownload.forEach(url ->
CompletableFuture.supplyAsync(() -> this.getHttpResponse(url), this.executor)
.thenAccept( t -> processHttpResponse(t, url))
);
Even, if you use the Stream API to collect results, there is no reason to split the CompletableFuture into multiple map operations:
List<…> result = filesToDownload.stream()
.map(url -> CompletableFuture.supplyAsync(() -> this.getHttpResponse(url), this.executor)
.thenApply( t -> processHttpResponse(t, url)) )
.collect(Collectors.toList())
.stream()
.map(CompletableFuture::join)
.collect(Collectors.toList());
Note that this will collect the CompletableFutures into a List before waiting for any result in a second Stream operation. This is preferable to using a parallel Stream operation as it ensures that all asynchronous operations have been submitted, before starting to wait.
Using a single Stream pipeline would imply waiting for the completion of the first job before even submitting the second and using a parallel Stream would only reduce that problem instead of solving it. It would depend on the execution strategy of the Stream implementation (the default Fork/Join pool), which interferes with actual policy of your specified executor. E.g., if the specified executor is supposed to use more threads than CPU cores, the Stream would still submit only as much jobs at a time as there are cores — or even less if there are other jobs on the default Fork/Join pool.
In contrast, the behavior of the solution above will be entirely controlled by the execution strategy of the specified executor.
Related
i did research but didn't find a adequate answer for this question.
Why we need more stages than on stage.
One Thread -> One Big Task(A,B,C,D)
VS
CompletableFuture with the stages A, B, C, D
So my answer would be the following:
If I have more stages, i can split the task over different methods and classes
If I have more stages, it's more fair executing the whole task related to other whole tasks. What I mean with that? Let's say we have in our system only one Thread. If I execute it that way -> One Big Task(A,B,C,D), then my next big Task (W,X,Y,Z) get the chance to be executed, after the first big task is ready. With CompletionStages, there it is more fair: because A,W,B,C,X,Y,Z,D could be the execution order
Are there for my last point any metrics/rules, how small I should split the big task into sub-tasks?
Is my last point a point for the stages in CompletableFutures?
Is my first point a
point for the stages in CompletableFutures?
Are there other points for using the stages of CompletableFutures?
When you have the choice, like with
CompletableFuture.supplyAsync(() -> method1())
.thenApply(o1 -> method2(o1))
.thenApply(o2 -> method3(o2))
.thenAccept(o3 -> method4(o3));
and
CompletableFuture.runAsync(() -> {
var o1 = method1();
var o2 = method2(o1);
var o3 = method3(o2);
method4(o3);
});
or
CompletableFuture.runAsync(() -> method4(method3(method2(method1()))));
there is no advantage in using multiple stages. In fact, the first variant is much harder to debug than the alternatives.
Things are different when the chaining does not happen at the same place. Think of a library having a future returning method, encapsulating something like supplyAsync(() -> method1()), another library calling that method, chaining another operation and returning the composition to the application which will chain yet another application.
Expressing the same in a single stage would only be possible when the methods invoked in the functions are still provided by each library’s API and have a sequential nature, i.e. we’re not talking about thenCompose(…) kind of stages.
But such chains are still hard to debug and project Loom is trying to solve this. Then, you’d express the operation as a call sequence, exactly like in the second or third variant even when the methods are potentially blocking, but run it in a virtual thread which will release the underlying native thread each time it would block.
Then, we have even less use for a linear chain of stages.
A remaining use case for creating a linear chain of dependent stages is to have different executors. For example
CompletableFuture.supplyAsync(() -> fetchFromDb(), MY_BACKGROUND_EXECUTOR)
.thenAcceptAsync(data -> updateSwingModel(data), EventQueue::invokeLater)
.whenCompleteAsync((x, thrown) ->
updateStatusBar(jobID, thrown), EventQueue::invokeLater);
here, writing the operation as a single block is not an option…
What is the difference between:
List<String> parra = list.parallelStream()
.map(heavyPrcessingFunction)
.collect(Collectors.toList());
and this (apart from the second being a bit complex):
List<CompletableFuture<Void>> com = list.stream()
.map(x-> CompletableFuture.runAsync(() -> heavyPrcessingFunction.apply(x)))
.collect(Collectors.toList());
CompletableFuture.allOf(com.toArray(new CompletableFuture[0])).join();
// get all of strings from com now
Semantically they are quite similar, it mostly is a matter of overhead.
For the 2nd approach you have to create a CF for each entry in the list and submit them individually to the common FJP.
Parallel streams on the other hand can be implemented by chunking the input list into a few large slices, submitting only those slices as a tasks to the common pool and then having a thread essentially loop over the slice instead of having to pick up and unwrap future by future from its work queue.
Additionally the stream implementation means that not just the map operation but also the collect step is aware of parallel execution and can thus optimize it.
Fewer allocations, fewer expensive operations on concurrent data structures, simpler code.
The way that it was implemented above, there's no difference. The advantage of using CompletableFuture API is that you can pass a custom Executor if you want to have more control over the threads and/or implement some async semantic.
If I am using a map operation in a stream pipeline with forEach() terminal operation(which does not honors encounter order irrespective of whether its sequential or parallel stream) on the list (as source), will map respect the encounter order of the list in case of sequential or parallel stream ?
List<Integer> = Arrays.asList(1,2,3,4,5)
someList.stream().map(i -> i*2).forEach(System.out::println) // this is sequential stream
someList.parallelStream().map(i -> i*2).forEach(System.out::println) // this is parallel stream
If yes, then in this post https://stackoverflow.com/a/47337690/5527839, it is mentioned map operation will be performed in parallel. If order is maintained, how it will make the performance better when using parallel stream. What a point of using parallel stream?
If order is maintained, how it will make the performance better when using parallel stream. What a point of using parallel stream? (yes still you will gain the performance but not expected level)
Even if you use forEachOrdered() while parallelStream the intermediate operation map will be executed by concurrent threads, but in the terminal operation orEachOrdered makes them to process in order. Try below code you will see the parallelism in map operation
List<Integer> someList = Arrays.asList(1,2,3,4,5);
someList.stream().map(i -> {
System.out.println(Thread.currentThread().getName()+" Normal Stream : "+i);
return i*2;
}).forEach(System.out::println); // this is sequential stream
System.out.println("this is parallel stream");
someList.parallelStream().map(i -> {
System.out.println(Thread.currentThread().getName()+" Parallel Stream : "+i);
return i*2;
}).forEachOrdered(System.out::println); // this is parallel stream
will map honor encounter order ? Is ordering any way related to intermediate operations ?
If it is parallelstream map will not encounter any order, if it is normal stream then map will encounter in order, it completely depends on stream not on intermediate operation
While many intermediate operations do preserve ordering with out having to explicitly specify that desire, I always prefer to assume data flowing through the java stream api isnt guaranteed to end up ordered in every scenario even given the same code.
When the order of the elements must be preserved, it is enough to specify the terminal operation as an ordered operation and the data will be in order when it comes out. In your case I believe youd be looking for
.forEachOrdered()
If order is maintained, how it will make the performance better when
using parallel stream. What a point of using parallel stream?
I've heard many opinions on this. I believe you should only use parallel streams if you are doing a non trivial amount of processing inside the pipeline, otherwise the overhead of managing the parallel stream will in most cases degrade performance when compared to serial streams. If you are doing some intensive processing, parallel will still definitely work faster than serial even when instructed to preserve the order because, after all, the data being processed is stored in a heap either way and the pointers to that data are what gets ordered and passed out of the end of the pipe. All a stream needs to do for ordering is hand the pointers out in the same order they were encountered, but it can work on the data in parallel and just wait if the data at the front of the queue isnt yet finished.
I'm sure this is a tad bit of an oversimplification, as there are cases where an ordered stream will require data to be shared from one element to the next (collectors for instance) But the basic concept is valid since even in this case a parallel stream is able to process at least two pieces of data at a time.
I have never used a ForkJoinPool and I came accross this code snippet.
I have a Set<Document> docs. Document has a write method. If I do the following, do I need to have a get or join to ensure that all the docs in the set have correctly finished their write method?
ForkJoinPool pool = new ForkJoinPool(concurrencyLevel);
pool.submit(() -> docs.parallelStream().forEach(
doc -> {
doc.write();
})
);
What happens if one of the docs is unable to complete it's write? Say it throws an exception. Does the code given wait for all the docs to complete their write operation?
ForkJoinPool.submit(Runnable) returns a ForkJoinTask representing the pending completion of the task. If you want to wait for all documents to be processed, you need some form of synchronization with that task, like calling its get() method (from the Future interface).
Concerning the exception handling, as usual any exception during the stream processing will stop it. However you have to refer to the documentation of Stream.forEach(Consumer):
The behavior of this operation is explicitly nondeterministic. For parallel stream pipelines, this operation does not guarantee to respect the encounter order of the stream, as doing so would sacrifice the benefit of parallelism. For any given element, the action may be performed at whatever time and in whatever thread the library chooses. […]
This means that you have no guarantee of which document will be written if an exception occurs. The processing will stop but you cannot control which document will still be processed.
If you want to make sure that the remaining documents are processed, I would suggest 2 solutions:
surround the document.write() with a try/catch to make sure no exception propagates, but this makes it difficult to check which document succeeded or if there was any failure at all; or
use another solution to manage your parallel processing, like the CompletableFuture API. As noted in the comments, your current solution is a hack that works thanks to implementation details, so it would be preferable to do something cleaner.
Using CompletableFuture, you could do it as follows:
List<CompletableFuture<Void>> futures = docs.stream()
.map(doc -> CompletableFuture.runAsync(doc::write, pool))
.collect(Collectors.toList());
This will make sure that all documents are processed, and inspect each future in the returned list for success or failure.
What is the best approach to parallel process a Collection of Java Objects? I would like to have a threadpool of a 100 threads each work on a separate Collection object and perform some action on it. Any ideas? Java 8 is the targeted version.
Use a parallelStream.
yourCollection
.parallelStream()
.forEach(e -> doStuff(e));
You may also want to collect() the results afterwards.