Using ForkJoinPool on a set of documents

Using ForkJoinPool on a set of documents - java

I have never used a ForkJoinPool and I came accross this code snippet.
I have a Set<Document> docs. Document has a write method. If I do the following, do I need to have a get or join to ensure that all the docs in the set have correctly finished their write method?
ForkJoinPool pool = new ForkJoinPool(concurrencyLevel);
pool.submit(() -> docs.parallelStream().forEach(
doc -> {
doc.write();
})
);
What happens if one of the docs is unable to complete it's write? Say it throws an exception. Does the code given wait for all the docs to complete their write operation?

ForkJoinPool.submit(Runnable) returns a ForkJoinTask representing the pending completion of the task. If you want to wait for all documents to be processed, you need some form of synchronization with that task, like calling its get() method (from the Future interface).
Concerning the exception handling, as usual any exception during the stream processing will stop it. However you have to refer to the documentation of Stream.forEach(Consumer):
The behavior of this operation is explicitly nondeterministic. For parallel stream pipelines, this operation does not guarantee to respect the encounter order of the stream, as doing so would sacrifice the benefit of parallelism. For any given element, the action may be performed at whatever time and in whatever thread the library chooses. […]
This means that you have no guarantee of which document will be written if an exception occurs. The processing will stop but you cannot control which document will still be processed.
If you want to make sure that the remaining documents are processed, I would suggest 2 solutions:
surround the document.write() with a try/catch to make sure no exception propagates, but this makes it difficult to check which document succeeded or if there was any failure at all; or
use another solution to manage your parallel processing, like the CompletableFuture API. As noted in the comments, your current solution is a hack that works thanks to implementation details, so it would be preferable to do something cleaner.
Using CompletableFuture, you could do it as follows:
List<CompletableFuture<Void>> futures = docs.stream()
.map(doc -> CompletableFuture.runAsync(doc::write, pool))
.collect(Collectors.toList());
This will make sure that all documents are processed, and inspect each future in the returned list for success or failure.

Related

How is Apache NIO HttpAsyncClient performing non-blocking HTTP Client

How is Apache NIO HttpAsyncClient able to wait for a remote response without blocking any thread? Does it have a way to setup a callback with the OS (I doubt so?). Otherwise does it perform some sort of polling?

EDIT - THIS ANSWER IS WRONG. PLEASE IGNORE AS IT IS INCORRECT.
You did not specify a version, so I can not point you to source code. But to answer your question, the way that Apache does it is by returning a Future<T>.
Take a look at this link -- https://hc.apache.org/httpcomponents-asyncclient-4.1.x/current/httpasyncclient/apidocs/org/apache/http/nio/client/HttpAsyncClient.html
Notice how the link says nio in the package. That stands for "non-blocking IO". And 9 times out of 10, that is done by doing some work with a new thread.
This operates almost exactly like a CompletableFuture<T> from your first question. Long story short, the library kicks off the process in a new thread (just like CompletableFuture<T>), stores that thread into the Future<T>, then allows you to use that Future<T> to manage that newly created thread containing your non-blocking task. By doing this, you get to decide exactly when and where the code blocks, potentially giving you the chance to make some significant performance optimizations.
To be more explicit, let's give a pseudocode example. Let's say I have a method attached to an endpoint. Whenever the endpoint is hit, the method is executed. The method takes in a single parameter --- userID. I then use that userID to perform 2 operations --- fetch the user's personal info, and fetch the user's suggested content. I need both pieces, and neither request needs to wait for the other to finish before starting. So, what I do is something like the following.
public StoreFrontPage visitStorePage(int userID)
{
final Future<UserInfo> userInfoFuture = this.fetchUserInfo(userID);
final Future<PageSuggestion> recommendedContentFuture = this.fetchRecommendedContent(userId);
final UserInfo userInfo = userInfoFuture.get();
final PageSuggestion recommendedContent = recommendedContentFuture.get();
return new StoreFrontPage(userInfo, recommendedContent);
}
When I call this.fetchUserInfo(userID), my code creates a new thread, starts fetching user info on that new thread, but let's my main thread continue and kick off this.fetchRecommendedContent(userID) in the meantime. The 2 fetches are occurring in parallel.
However, I need both results in order to create my StoreFrontPage. So, when I decided that I cannot continue any further until I have the results from both fetches, I call Future::get on each of my fetches. What this method does is merge the new thread back into my original one. In short, it says "wait for that one thread you created to finish doing what it was doing, then output the result as a return value".
And to more explicitly answer your question, no, this tool does not require you to do anything involving callbacks or polling. All it does is give you a Future<T> and lets you decide when you need to block the thread to wait on that Future<T> to finish.
EDIT - THIS ANSWER IS WRONG. PLEASE IGNORE AS IT IS INCORRECT.

Difference between Futures(Guava)/CompletableFuture and Observable(RxJava) [duplicate]

I would like to know the difference between
CompletableFuture,Future and Observable RxJava.
What I know is all are asynchronous but
Future.get() blocks the thread
CompletableFuture gives the callback methods
RxJava Observable --- similar to CompletableFuture with other benefits(not sure)
For example: if client needs to make multiple service calls and when we use Futures (Java) Future.get() will be executed sequentially...would like to know how its better in RxJava..
And the documentation http://reactivex.io/intro.html says
It is difficult to use Futures to optimally compose conditional asynchronous execution flows (or impossible, since latencies of each request vary at runtime). This can be done, of course, but it quickly becomes complicated (and thus error-prone) or it prematurely blocks on Future.get(), which eliminates the benefit of asynchronous execution.
Really interested to know how RxJava solves this problem. I found it difficult to understand from the documentation.

Futures
Futures were introduced in Java 5 (2004). They're basically placeholders for a result of an operation that hasn't finished yet. Once the operation finishes, the Future will contain that result. For example, an operation can be a Runnable or Callable instance that is submitted to an ExecutorService. The submitter of the operation can use the Future object to check whether the operation isDone(), or wait for it to finish using the blocking get() method.
Example:
/**
* A task that sleeps for a second, then returns 1
**/
public static class MyCallable implements Callable<Integer> {
#Override
public Integer call() throws Exception {
Thread.sleep(1000);
return 1;
}
}
public static void main(String[] args) throws Exception{
ExecutorService exec = Executors.newSingleThreadExecutor();
Future<Integer> f = exec.submit(new MyCallable());
System.out.println(f.isDone()); //False
System.out.println(f.get()); //Waits until the task is done, then prints 1
}
CompletableFutures
CompletableFutures were introduced in Java 8 (2014). They are in fact an evolution of regular Futures, inspired by Google's Listenable Futures, part of the Guava library. They are Futures that also allow you to string tasks together in a chain. You can use them to tell some worker thread to "go do some task X, and when you're done, go do this other thing using the result of X". Using CompletableFutures, you can do something with the result of the operation without actually blocking a thread to wait for the result. Here's a simple example:
/**
* A supplier that sleeps for a second, and then returns one
**/
public static class MySupplier implements Supplier<Integer> {
#Override
public Integer get() {
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
//Do nothing
}
return 1;
}
}
/**
* A (pure) function that adds one to a given Integer
**/
public static class PlusOne implements Function<Integer, Integer> {
#Override
public Integer apply(Integer x) {
return x + 1;
}
}
public static void main(String[] args) throws Exception {
ExecutorService exec = Executors.newSingleThreadExecutor();
CompletableFuture<Integer> f = CompletableFuture.supplyAsync(new MySupplier(), exec);
System.out.println(f.isDone()); // False
CompletableFuture<Integer> f2 = f.thenApply(new PlusOne());
System.out.println(f2.get()); // Waits until the "calculation" is done, then prints 2
}
RxJava
RxJava is whole library for reactive programming created at Netflix. At a glance, it will appear to be similar to Java 8's streams. It is, except it's much more powerful.
Similarly to Futures, RxJava can be used to string together a bunch of synchronous or asynchronous actions to create a processing pipeline. Unlike Futures, which are single-use, RxJava works on streams of zero or more items. Including never-ending streams with an infinite number of items. It's also much more flexible and powerful thanks to an unbelievably rich set of operators.
Unlike Java 8's streams, RxJava also has a backpressure mechanism, which allows it to handle cases in which different parts of your processing pipeline operate in different threads, at different rates.
The downside of RxJava is that despite the solid documentation, it is a challenging library to learn due to the paradigm shift involved. Rx code can also be a nightmare to debug, especially if multiple threads are involved, and even worse - if backpressure is needed.
If you want to get into it, there's a whole page of various tutorials on the official website, plus the official documentation and Javadoc. You can also take a look at some of the videos such as this one which gives a brief intro into Rx and also talks about the differences between Rx and Futures.
Bonus: Java 9 Reactive Streams
Java 9's Reactive Streams aka Flow API are a set of Interfaces implemented by various reactive streams libraries such as RxJava 2, Akka Streams, and Vertx. They allow these reactive libraries to interconnect, while preserving the all important back-pressure.

I have been working with Rx Java since 0.9, now at 1.3.2 and soon migrating to 2.x I use this in a private project where I already work on for 8 years.
I wouldn't program without this library at all anymore. In the beginning I was skeptic but it is a complete other state of mind you need to create. Quiete difficult in the beginning. I sometimes was looking at the marbles for hours.. lol
It is just a matter of practice and really getting to know the flow (aka contract of observables and observer), once you get there, you'll hate to do it otherwise.
For me there is not really a downside on that library.
Use case:
I have a monitor view that contains 9 gauges (cpu, mem, network, etc...). When starting up the view, the view subscribes itselfs to a system monitor class that returns an observable (interval) that contains all the data for the 9 meters.
It will push each second a new result to the view (so not polling !!!).
That observable uses a flatmap to simultaneously (async!) fetch data from 9 different sources and zips the result into a new model your view will get on the onNext().
How the hell you gonna do that with futures, completables etc ... Good luck ! :)
Rx Java solves many issues in programming for me and makes in a way a lot easier...
Advantages:
Statelss !!! (important thing to mention, most important maybe)
Thread management out of the box
Build sequences that have their own lifecycle
Everything are observables so chaining is easy
Less code to write
Single jar on classpath (very lightweight)
Highly concurrent
No callback hell anymore
Subscriber based (tight contract between consumer and producer)
Backpressure strategies (circuit breaker a like)
Splendid error handling and recovering
Very nice documentation (marbles <3)
Complete control
Many more ...
Disadvantages:
- Hard to test

Java's Future is a placeholder to hold something that will be completed in the future with a blocking API. You'll have to use its' isDone() method to poll it periodically to check if that task is finished. Certainly you can implement your own asynchronous code to manage the polling logic. However, it incurs more boilerplate code and debug overhead.
Java's CompletableFuture is innovated by Scala's Future. It carries an internal callback method. Once it is finished, the callback method will be triggered and tell the thread that the downstream operation should be executed. That's why it has thenApply method to do further operation on the object wrapped in the CompletableFuture.
RxJava's Observable is an enhanced version of CompletableFuture. It allows you to handle the backpressure. In the thenApply method (and even with its brothers thenApplyAsync) we mentioned above, this situation might happen: the downstream method wants to call an external service that might become unavailable sometimes. In this case, the CompleteableFuture will fail completely and you will have to handle the error by yourself. However, Observable allows you to handle the backpressure and continue the execution once the external service to become available.
In addition, there is a similar interface of Observable: Flowable. They are designed for different purposes. Usually Flowable is dedicated to handle the cold and non-timed operations, while Observable is dedicated to handle the executions requiring instant responses. See the official documents here: https://github.com/ReactiveX/RxJava#backpressure

All three interfaces serve to transfer values from producer to consumer. Consumers can be of 2 kinds:
synchronous: consumer makes blocking call which returns when the value is ready
asynchronous: when the value is ready, a callback method of the consumer is called
Also, communication interfaces differ in other ways:
able to transfer single value of multiple values
if multiple values, backpressure can be supported or not
As a result:
Future transferes single value using synchronous interface
CompletableFuture transferes single value using both synchronous and asynchronous interfaces
Rx transferes multiple values using asynchronous interface with backpressure
Also, all these communication facilities support transferring exceptions. This is not always the case. For example, BlockingQueue does not.

The main advantage of CompletableFuture over normal Future is that CompletableFuture takes advantage of the extremely powerful stream API and gives you callback handlers to chain your tasks, which is absolutely absent if you use normal Future. That along with providing asynchronous architecture, CompletableFuture is the way to go for handling computation heavy map-reduce tasks, without worrying much about application performance.

How to manage LocalConverter and when invoke ShutDown() method?

I wrote some code using documents4j library to convert some documents from .docx to .pdf.
I followed the examples in the documentation and the convertion works perfectly using MS-Word, but I notice that after all conversions complete and methods return, the java application result still running and it seems not to exit.
If I explicitly close the converter using execute() and shutDown() methods instead of schedule(), the application exit, but I need this application run in concurrent mode, so I can't explicitly invoke shutDown() otherwise I cause MS-Word exits and breaks some still opened documents.
What is the best way to use the converter to achieve these objectives? Has LocalConverter got a method to check if there is a queue of documents to be converted? With this information I could invoke shutDown() only with an empty queue and instantiate a new LocalConverter on the next converting request.
Thanks in advance for your replies!
Dan

I am the maintainer of documents4j.
You are right, the LocalConverter does not currently await termination of the running conversions when it is shut down. I added a grace period that corresponds to the timeout for running conversions to finish which will be included in the next version of documents4j. I will release a new version once I have looked into a pending issue with escaping path in folders containing spaces.
In the mean time, I recommend you to implement something similar yourself. Every conversion emitts a Future. Simply collect all the futures in a Set and then call get on each future in a thread. If all futures have returned (i.e. all conversions are complete), it is safe to shut down the local converter:
IConverter converter = ...;
Set<Future<?>> futures = new HashSet<>();
for ( ... ) {
futures.add(converter.from(...).to(...).schedule());
}
for (Future<?> future : futures) {
future.get();
}
converter.shutDown();
The above is safe because all conversions are done concurrently but the main thread blocks until all futures have completed. Future::get blocks until its conversion has completed but returns immediately if a conversion already is complete. This way you make sure you do not reach shutDown before all conversions are complete.

What is the purpose of java.util.concurrent.CompletableFuture#allOf?

If I have Collection<CompletableFuture<MyResult>>, I expect to convert this into CompletableFuture<Collection<MyResult>>. So after conversion I have only one future and can easyly write bussines logic on MyResult collection using methods from CompletableFuture like thenApply, thenAccept etc. But CompletableFuture#allOf have result type Void so after invoking it I get "no results". E.g. I can not retrieve (as I understand) any results from returned future that correspods to Collection<CompletableFuture<MyResult>>.
I have a doubt that CompletableFuture#allOf just return the Future wich is completed after all in collection. So I can invoke CompletableFuture#allOf(...).isDone and then manually (!) in cycle tranform Collection<CompletableFuture> to CompletableFuture<Collection>, Is my assumption right?

Yes, the allOf method does not supply data, but does signal that all futures have been completed. This eliminates the need of using the more cumbersome countdown latch approach. The expectation is that you would then convert the completed futures back into a usable Collection to apply your business logic. See this question for implementation details. A great discussion of this topic is available at this blog post.

if you need CompletableFuture<Collection<MyResult>> as result you can get it by using allAsList method in https://github.com/spotify/completable-futures (spotify-completlablefutures library). CompletableFutures.allAsList(List<CompletableFuture<MyResult>>) will give you CompletableFuture<List<MyResult>>.

Java 8 Streams, perform part of the work as parallel stream and the other as sequential stream

I have a stream performing a series of operations in parallel, then I have to write the results into a file, so I need the writing operation to be sequential, but it needs to be performed as a stream, I have lots of data to write, I cannot use an intermediate collection.
Is there a way to do that?
I thought to a solution that doesn't seem very clean, that is to make the writing method synchronized. Is this approach the only possible? is there some other way?
Thank you.

There is no need to turn a Stream to sequential in order to perform a sequential terminal operation. See, for example, the documentation of Stream.forEachOrdered:
This operation processes the elements one at a time, in encounter order if one exists. Performing the action for one element happens-before performing the action for subsequent elements, but for any given element, the action may be performed in whatever thread the library chooses.
In other words, the action may be called by different threads as a result of the parallel processing of the previous steps, but it is guaranteed to be thread safe, does not call the action concurrently and even maintains the order if the stream was ordered.
So there is no need to do any additional synchronization if you use forEachOrdered to specify the write to a file as terminal operations.
Similar guarantees apply to other terminal operations regarding certain functions. It’s critical to study the documentation of the specific operation and to understand which guarantees are made and what is left unspecified.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.