How to fully parallelize a Java8 stream through buffering? - java

I have some code like this:
Stream<Item> stream = listPaths.parallelStream().flatMap(path -> { ... })
I have also added this:
System.setProperty(
"java.util.concurrent.ForkJoinPool.common.parallelism",
String.valueOf(Runtime.getRuntime().availableProcessors() * 4));
Later I call stream.forEach(...)
However, I have found that on a machine with 32 cores, only 5 to 8 cores are utilized.
I believe what is happening is that the code inside flatMap() and the code inside the forEach() suffer of I/O Latency issues for different external resources, and returns data in "fits and starts" -- a bad combination with the "pull" nature of streams.
Is there a simple (idiomatic, not "go write your own 200 lines of code") to wrap the stream into some sort of "stream buffer" that would keep the source stream fully utilized (pulling at max threads) while feeding the forEach()?

The best way in my opinion is to use reactive streams. There are many ways to go about this:
Using <Flowable>
Using RxJava
Or thirdly, using Spring's Reactor framework
They all have scheduling mechanism. Personally, I can use one of these algorithms and still be happy with them unless I want to write 200 lines of code.

Related

Task Executor vs Java 8 parallel streaming

I can't find a specific answer to the line of investigation that we've been requested to take on
I see that parallel streams may not be so performant when using small amount of threads, and that apparently it doesn't behave so well when the DB blocks the next request while processing the current one
However, I find that the overhead of implementing Task Executor vs Parallel Streams is huge, we've implemented a POC that takes care of our concurrency needs with just this one line of code:
List<Map<String, String>> listWithAllMaps = mappedValues.entrySet().parallelStream().map(e -> callPlugins(e))
.collect(Collectors.toList());
Whereas in Task Executor, we'd need to override the Runnable interface and write some cumbersome code just to get the runnables not to be void and return the values we're reading from the DB, leading us into several hours, if not days of coding, and producing a less maintainable, more bug prone code
However, our CTO is still reluctant to using parallel streams due to unforeseen issues that could come up down the road
So the question is, in an environment where I need to make several concurrent read-only queries to a database, using different java-components/REST calls for each query: Is it preferrable in any way to use Task Executor instead of parallel streaming, if so, why?
Use the TaskExecutor as an Executor for a CompletableFuture.
List<CompletableFuture> futures = mappedValues.entrySet().stream().map(e - > CompletableFuture.supplyAsync(() -> callPlugins(e), taskExecutor)).collect(Collectors.toList());
List<Map<String, String>> listWithAllMaps = futures.stream().map(CompletableFuture::join).collect(Collectors.toList());
Not sure how this is cumbersome. Yes it is a bit more code, but with the advantage that you can easily configure the TaskExecutor and increase the number of threads, queueu-size etc. etc.
DISCLAIMER: Typed it from the top of my head, so some minor things might be of with the code snippet.

Java parallelization using lambda functions

I have an array of some objects with the method process() that I want to run parallelized. And I wanted to try lambdas to achieve the parallelization. So I tried this:
Arrays.asList(myArrayOfItems).forEach(item->{
System.out.println("processing " + item.getId());
item.process();
});
Each process() call takes about 2 seconds. And I have noticed that there is still no speedup with the "parallelization" approach. It seems that everything is still running serialized. The ids are printed in series (ordered) and between every print there is a pause of 2 seconds.
Probably I have misunderstood something. What is needed to execute this in parallel using lambdas (hopefully in a very condensed way)?
Lambdas itself aren't executing anything in parallel. Streams are capable of doing this though.
Take a look at the method Collection#parallelStream (documentation):
Arrays.asList(myArrayOfItems).parallelStream().forEach(...);
However, note that there is no guarantee or control when it will actually go parallel. From its documentation:
Returns a possibly parallel Stream with this collection as its source. It is allowable for this method to return a sequential stream.
The reason is simple. You really need a lot of elements in your collection (like millions) for parallelization to actually pay off (or doing other heavy things). The overhead introduced with parallelization is huge. Because of that, the method might choose to use sequential stream instead, if it thinks that it will be faster.
Before you think about using parallelism, you should actually setup some benchmarks to test if it improves anything. There are many examples where people did just blindly use it without noticing that they actually decreased the perfomance. Also see Should I always use a parallel stream when possible?.
You can check if a Stream is parallel by using Stream#isParallel (documentation).
If you use Stream#parallel (documentation) directly on a stream, you get a parallel version.
Method Collection.forEach() is just iteration through all the elements. It is called internal iteration as it leaves up to the collection how it will iterate, but it is still an iteration on all the elements.
If you want parallel processing, you have to:
Get a parallel stream from the collection.
Specify the operation(s) which will be done on the stream.
Do something with the result if you need to.
You may read first part of my explanation here: https://stackoverflow.com/a/22942829/2886891
To create a parallel stream, invoke the operation .parallelStream on a Collection
See https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html
Arrays.asList(myArrayOfItems).parallelStream().forEach(item->{
System.out.println("processing " + item.getId());
item.process();
});

Difference between parallel stream and CompletableFuture

In the book "Java 8 in action" (by Urma, Fusco and Mycroft) they highlight that parallel streams internally use the common fork join pool and that whilst this can be configured globally, e.g. using System.setProperty(...), that it is not possibly to specify a value for a single parallel stream.
I have since seen the workaround that involves running the parallel stream inside a custom made ForkJoinPool.
Later on in the book, they have an entire chapter dedicated to CompletableFuture, during which they have a case study where they compare the respective performance of using a parallelStream VS a CompletableFuture. It turns out their performance is very similar - they highlight the reason for this as being that they are both as default using the same common pool (and therefore the same amount of threads).
They go on to show a solution and argue that the CompletableFuture is better in this circumstance as it can be congifured to use a custom Executor, with a thread pool size of the user's choice. When they update the solution to utilise this, the performance is significantly improved.
This made me think - if one were to do the same for the parallel stream version using the workaround highlighted above, would the performance benefits be similar, and would the two approaches therefore become similar again in terms of performance? In this case, why would one choose the CompletableFuture over the parallel stream when it clearly takes more work on the developer's part.
In this case, why would one choose the CompletableFuture over the parallel stream when it clearly takes more work on the developer's part.
IMHO This depends on the interface you are looking to support. If you are looking to support an asynchronous API e.g.
CompletableFuture<String> downloadHttp(URL url);
In this case, only a completable future makes sense because you may want to do something else unrelated while you wait for the data to come down.
On the other hand parallelStream() is best for CPU bound tasks where you want every tasks to perform a portion of some work. i.e. every thread is doing the same thing with different data. As you meantion it is also easier to use.

How to convert an AbstractOnSubscribe to an Operator with backpressure support in RxJava?

I extended AbstractOnSubscribe to create my own OnSubscribe to be used with Observable.create(OnSubscribe<T>) that i named OnSubscribeInputStreamToLines that basically reads an InputStream and calls onNext for each line.
The thing is, I also need to do that with the InputStream being part of an other Observable.
The easy solution would be to do the following:
Observable<InputStream> isObservable = ...;
isObservable
.flatMap(is -> Observable.create(new OnSubscribeInputStreamToLines(is)));
The thing is that would not be really efficient as it would create an Observable for each inputStream. I was thinking I may be able to do this using Observable.lift.
Is there a way so I can easily convert my OnSubscribeInputStreamToLines to an Operator ?
I'm mostly worried about backpressure issues as i would call onNext for each line of an InputStream and although AbstractOnSubscribe supports backpressure, I couldn't find an AbstractOperator that does the same.
Thanks
The distinction here is that your OnSubscribeInputStreamToLines is an entry point into the Observable world whereas lift is an in-between operator for an existing sequence. Besides, the whole throughput might be dominated by the IO operation behind InputStream or the string processing in the operation so I wouldn't worry about that thin wrapper.
AbstractOnSubscribe captures the generator-aspect of operators which helps you build backpressure-aware value emitters (cold sources generally) where you can draft out how, when and what values are emitted.
AbstractOperator, on the other hand, can't be generalized this way because Operators have more freedom for interacting with upstream values and downstream requests. They are highly customized to a specific task and there is little-to-none common points to them. They can be built from a set of primitives but that's it (I've written hundreds of them).
So don't be afraid of flatMapping over things.
Don't be bothered about creating another Observable for each InputStream. The overhead is probably not as large as you might think especially compared to overhead associated with lift.
I don't know the nature of the InputStreams you are consuming but you should probably consider Observable.using() to close those resources safely (on termination or unsubscription).
You are absolutely right to have hesitations about writing a backpressure supporting Operator. It is very tricky ground to be stepping on unless you are composing existing Operators.

Use cases of PipedInputStream and PipedOutputStream

What are use cases of Piped streams? Why just not read data into buffer and then write them out?
BlockingQueue or similiar collections may serve you better, which is thread safe, robust, and scales better.
Pipes in Java IO provides the ability for two threads running in the same JVM to communicate. As such pipes are a common source or destination of data.
This useful if you have two long running Threads and one is setup to produce data and the other consume it.
As the other answers have said, they are designed for use between threads. In practice they are best avoided. I've used them once in 13 years and I wish I hadn't.
They are usually used for simultaneously reading and writing, usually by two different threads.
(They design is quite bad. You can't switch threads at one end and then have that thread exit without disrupting the pipe.)
One advantage of using Piped streams is that they provide stream functionality in our code without compelling us to build new specialized streams.
For e.g. we can use pipes to create simple logging facility for our application.We can send messages to logging facility through ordinaty Printwritter and then it can do whatever processing or buffering is required before sending message off to final destination.
more details refer : http://docstore.mik.ua/orelly/java/exp/ch08_01.htm

Categories

Resources