Parallel Processing Java Collection - java

What is the best approach to parallel process a Collection of Java Objects? I would like to have a threadpool of a 100 threads each work on a separate Collection object and perform some action on it. Any ideas? Java 8 is the targeted version.

Use a parallelStream.
yourCollection
.parallelStream()
.forEach(e -> doStuff(e));
You may also want to collect() the results afterwards.

Related

Parallel streams Vs Completeable Future in java 8

What is the difference between:
List<String> parra = list.parallelStream()
.map(heavyPrcessingFunction)
.collect(Collectors.toList());
and this (apart from the second being a bit complex):
List<CompletableFuture<Void>> com = list.stream()
.map(x-> CompletableFuture.runAsync(() -> heavyPrcessingFunction.apply(x)))
.collect(Collectors.toList());
CompletableFuture.allOf(com.toArray(new CompletableFuture[0])).join();
// get all of strings from com now
Semantically they are quite similar, it mostly is a matter of overhead.
For the 2nd approach you have to create a CF for each entry in the list and submit them individually to the common FJP.
Parallel streams on the other hand can be implemented by chunking the input list into a few large slices, submitting only those slices as a tasks to the common pool and then having a thread essentially loop over the slice instead of having to pick up and unwrap future by future from its work queue.
Additionally the stream implementation means that not just the map operation but also the collect step is aware of parallel execution and can thus optimize it.
Fewer allocations, fewer expensive operations on concurrent data structures, simpler code.
The way that it was implemented above, there's no difference. The advantage of using CompletableFuture API is that you can pass a custom Executor if you want to have more control over the threads and/or implement some async semantic.

Task Executor vs Java 8 parallel streaming

I can't find a specific answer to the line of investigation that we've been requested to take on
I see that parallel streams may not be so performant when using small amount of threads, and that apparently it doesn't behave so well when the DB blocks the next request while processing the current one
However, I find that the overhead of implementing Task Executor vs Parallel Streams is huge, we've implemented a POC that takes care of our concurrency needs with just this one line of code:
List<Map<String, String>> listWithAllMaps = mappedValues.entrySet().parallelStream().map(e -> callPlugins(e))
.collect(Collectors.toList());
Whereas in Task Executor, we'd need to override the Runnable interface and write some cumbersome code just to get the runnables not to be void and return the values we're reading from the DB, leading us into several hours, if not days of coding, and producing a less maintainable, more bug prone code
However, our CTO is still reluctant to using parallel streams due to unforeseen issues that could come up down the road
So the question is, in an environment where I need to make several concurrent read-only queries to a database, using different java-components/REST calls for each query: Is it preferrable in any way to use Task Executor instead of parallel streaming, if so, why?
Use the TaskExecutor as an Executor for a CompletableFuture.
List<CompletableFuture> futures = mappedValues.entrySet().stream().map(e - > CompletableFuture.supplyAsync(() -> callPlugins(e), taskExecutor)).collect(Collectors.toList());
List<Map<String, String>> listWithAllMaps = futures.stream().map(CompletableFuture::join).collect(Collectors.toList());
Not sure how this is cumbersome. Yes it is a bit more code, but with the advantage that you can easily configure the TaskExecutor and increase the number of threads, queueu-size etc. etc.
DISCLAIMER: Typed it from the top of my head, so some minor things might be of with the code snippet.

Does intermediate operations honors encounter order when terminal operation used in the same stream pipeline does not honors encounter order?

If I am using a map operation in a stream pipeline with forEach() terminal operation(which does not honors encounter order irrespective of whether its sequential or parallel stream) on the list (as source), will map respect the encounter order of the list in case of sequential or parallel stream ?
List<Integer> = Arrays.asList(1,2,3,4,5)
someList.stream().map(i -> i*2).forEach(System.out::println) // this is sequential stream
someList.parallelStream().map(i -> i*2).forEach(System.out::println) // this is parallel stream
If yes, then in this post https://stackoverflow.com/a/47337690/5527839, it is mentioned map operation will be performed in parallel. If order is maintained, how it will make the performance better when using parallel stream. What a point of using parallel stream?
If order is maintained, how it will make the performance better when using parallel stream. What a point of using parallel stream? (yes still you will gain the performance but not expected level)
Even if you use forEachOrdered() while parallelStream the intermediate operation map will be executed by concurrent threads, but in the terminal operation orEachOrdered makes them to process in order. Try below code you will see the parallelism in map operation
List<Integer> someList = Arrays.asList(1,2,3,4,5);
someList.stream().map(i -> {
System.out.println(Thread.currentThread().getName()+" Normal Stream : "+i);
return i*2;
}).forEach(System.out::println); // this is sequential stream
System.out.println("this is parallel stream");
someList.parallelStream().map(i -> {
System.out.println(Thread.currentThread().getName()+" Parallel Stream : "+i);
return i*2;
}).forEachOrdered(System.out::println); // this is parallel stream
will map honor encounter order ? Is ordering any way related to intermediate operations ?
If it is parallelstream map will not encounter any order, if it is normal stream then map will encounter in order, it completely depends on stream not on intermediate operation
While many intermediate operations do preserve ordering with out having to explicitly specify that desire, I always prefer to assume data flowing through the java stream api isnt guaranteed to end up ordered in every scenario even given the same code.
When the order of the elements must be preserved, it is enough to specify the terminal operation as an ordered operation and the data will be in order when it comes out. In your case I believe youd be looking for
.forEachOrdered()
If order is maintained, how it will make the performance better when
using parallel stream. What a point of using parallel stream?
I've heard many opinions on this. I believe you should only use parallel streams if you are doing a non trivial amount of processing inside the pipeline, otherwise the overhead of managing the parallel stream will in most cases degrade performance when compared to serial streams. If you are doing some intensive processing, parallel will still definitely work faster than serial even when instructed to preserve the order because, after all, the data being processed is stored in a heap either way and the pointers to that data are what gets ordered and passed out of the end of the pipe. All a stream needs to do for ordering is hand the pointers out in the same order they were encountered, but it can work on the data in parallel and just wait if the data at the front of the queue isnt yet finished.
I'm sure this is a tad bit of an oversimplification, as there are cases where an ordered stream will require data to be shared from one element to the next (collectors for instance) But the basic concept is valid since even in this case a parallel stream is able to process at least two pieces of data at a time.

Java parallelization using lambda functions

I have an array of some objects with the method process() that I want to run parallelized. And I wanted to try lambdas to achieve the parallelization. So I tried this:
Arrays.asList(myArrayOfItems).forEach(item->{
System.out.println("processing " + item.getId());
item.process();
});
Each process() call takes about 2 seconds. And I have noticed that there is still no speedup with the "parallelization" approach. It seems that everything is still running serialized. The ids are printed in series (ordered) and between every print there is a pause of 2 seconds.
Probably I have misunderstood something. What is needed to execute this in parallel using lambdas (hopefully in a very condensed way)?
Lambdas itself aren't executing anything in parallel. Streams are capable of doing this though.
Take a look at the method Collection#parallelStream (documentation):
Arrays.asList(myArrayOfItems).parallelStream().forEach(...);
However, note that there is no guarantee or control when it will actually go parallel. From its documentation:
Returns a possibly parallel Stream with this collection as its source. It is allowable for this method to return a sequential stream.
The reason is simple. You really need a lot of elements in your collection (like millions) for parallelization to actually pay off (or doing other heavy things). The overhead introduced with parallelization is huge. Because of that, the method might choose to use sequential stream instead, if it thinks that it will be faster.
Before you think about using parallelism, you should actually setup some benchmarks to test if it improves anything. There are many examples where people did just blindly use it without noticing that they actually decreased the perfomance. Also see Should I always use a parallel stream when possible?.
You can check if a Stream is parallel by using Stream#isParallel (documentation).
If you use Stream#parallel (documentation) directly on a stream, you get a parallel version.
Method Collection.forEach() is just iteration through all the elements. It is called internal iteration as it leaves up to the collection how it will iterate, but it is still an iteration on all the elements.
If you want parallel processing, you have to:
Get a parallel stream from the collection.
Specify the operation(s) which will be done on the stream.
Do something with the result if you need to.
You may read first part of my explanation here: https://stackoverflow.com/a/22942829/2886891
To create a parallel stream, invoke the operation .parallelStream on a Collection
See https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html
Arrays.asList(myArrayOfItems).parallelStream().forEach(item->{
System.out.println("processing " + item.getId());
item.process();
});

Returning multiple values with CompletableFuture.supplyAsync

I am writing a program to download historical quotes from a source. The source provides files over http for each day which need to be parsed and processed. The program downloads multiple files in parallel using a CompletableFuture using different stages. The first stage is to make a Http call using HttpClient and get the response.
The getHttpResponse() method returns a CloseableHttpResponse Object. I also want to return a url for which this http request was made. Simplest way is to have a wrapper object having these 2 fields, but i feel it is too much to have a class just to contain these 2 fields. Is there a way with CompletableFuture or Streams that I can achieve this?
filesToDownload.stream()
.map(url -> CompletableFuture.supplyAsync(() -> this.getHttpResponse(url), this.executor) )
.map(httpResponseFuture -> httpResponseFuture.thenAccept(t -> processHttpResponse(t)))
.count();
It’s not clear why you want to bring in the Stream API at all costs. Splitting the CompletableFuture use into two map operations causes the problem which wouldn’t exist otherwise. Besides that, using map for side effects is an abuse of the Stream API. This may break completely in Java 9, if filesToDownload is a Stream source with a known size (like almost every Collection). Then, count() will simply return that known size, without processing the functions of the map operations…
If you want to pass the URL and the CloseableHttpResponse to processHttpResponse, you can do it as easy as:
filesToDownload.forEach(url ->
CompletableFuture.supplyAsync(() -> this.getHttpResponse(url), this.executor)
.thenAccept( t -> processHttpResponse(t, url))
);
Even, if you use the Stream API to collect results, there is no reason to split the CompletableFuture into multiple map operations:
List<…> result = filesToDownload.stream()
.map(url -> CompletableFuture.supplyAsync(() -> this.getHttpResponse(url), this.executor)
.thenApply( t -> processHttpResponse(t, url)) )
.collect(Collectors.toList())
.stream()
.map(CompletableFuture::join)
.collect(Collectors.toList());
Note that this will collect the CompletableFutures into a List before waiting for any result in a second Stream operation. This is preferable to using a parallel Stream operation as it ensures that all asynchronous operations have been submitted, before starting to wait.
Using a single Stream pipeline would imply waiting for the completion of the first job before even submitting the second and using a parallel Stream would only reduce that problem instead of solving it. It would depend on the execution strategy of the Stream implementation (the default Fork/Join pool), which interferes with actual policy of your specified executor. E.g., if the specified executor is supposed to use more threads than CPU cores, the Stream would still submit only as much jobs at a time as there are cores — or even less if there are other jobs on the default Fork/Join pool.
In contrast, the behavior of the solution above will be entirely controlled by the execution strategy of the specified executor.

Categories

Resources