I have 1000 big files to be processed in order as mentioned below:
First those files needs to be copied to a different directory in parallel, I am planning to use ExecutorService with 10 threads to achieve it.
As soon as any file is copied to another location(#1), I will submit that file for further processing to ExecutorService with 10 threads.
And finally, another action needs to be performed on these files in parallel, like #2 gets input from #1, #3 gets input from #2.
Now, I can use CompletionService here, so I can process the thread results from #1 to #2 and #2 to #3 in the order they are getting completed. CompletableFuture says we can chain asynchronous tasks together which sounds like something I can use in this case.
I am not sure if I should implement my solution with CompletableFuture (since it is relatively new and ought to be better) or if CompletionService is sufficient? And why should I chose one over another in this case?
It would probably be best if you tried both approaches and then choose the one you are more comfortable with. Though it sounds like CompletableFutures are better suited for this task because they make chaining processing steps / stages really easy. For example in your case the code could look like this:
ExecutorService copyingExecutor = ...
// Not clear from the requirements, but let's assume you have
// a separate executor for this
ExecutorService processingExecutor = ...
public CompletableFuture<MyResult> process(Path file) {
return CompletableFuture
.supplyAsync(
() -> {
// Retrieve destination path where file should be copied to
Path destination = ...
try {
Files.copy(file, destination);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
return destination;
},
copyingExecutor
)
.thenApplyAsync(
copiedFile -> {
// Process the copied file
...
},
processingExecutor
)
// This separate stage does not make much sense, so unless you have
// yet another executor for this or this stage is applied at a different
// location in your code, it should probably be merged with the
// previous stage
.thenApply(
previousResult -> {
// Process the previous result
...
}
);
}
Related
I'm looking for some help since I don't know how to optimize a process.
I have to invoke a service that returns a list with more than 500K elements (I don't know why, these services belongs to the client), per each element of the list, I have to invoke 2 more services and then save some attributes in our database, this last step is not the problem, but the entire process took between 1 and 2 seconds per element, so with this time is going to take like more of 100 hours to complete the process.
My approach is the following, I have my main method, inside this method I get the large list, then I use a parallelStream to iterate in the elements of the list and then I use a CompletableFuture to call the method that invokes the 2 services mentioned above. I've tried changing the parallelStream to stream and for-each , tried to split the main list into smaller lists and many other things but I don't see a better performance, I think the problem is the invocation of those 2 services but I want to try luck asking here.
I'm using java 11, spring, and for the invocation of the services I'm using RestTemplate, and this is my code:
public void updateDiscount() {
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1));
});
}
//Second class
#Async("nameOfThePool")
public void asyncDiscountSave(ElementOfList element) {
//Logic to create request
//.........
var responseClients = anotherClass.getClients(element.getGroup1()) //get the first response with restTemplate
var responseProducts = anotherClass.getProducts(element.getGroup2())//get the second response with restTemplate
for (var client : responseClients) {
for (var product : responseProducts) {
//Here we just save some attributes of these objects on our DB
}
}
}
Thanks for the help.
UPDATE:
For this particular case, the only improvement that I can do is to pass a thread pool to the completable future, the problem is the response time of the services that I need to invoke.
I decided to follow a second approach and it took like 5 hours to complete, compared with the first approach this is acceptable.
As you haven't defined an executor you are using the default pool. Adding an executor allow you to create many threads as you needed and the server resources can manage
public void updateDiscount() {
Executor executor = Executors.newFixedThreadPool( 100 );//Define the number according to server resources performance
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1), executor);
});
}
SomeLibrary lib = new SomeLibrary();
lib.doSomethingAsync(); // some function from a library I got and what it does is print 1-5 asynchronously
System.out.println("Done");
// output
// Done
// 1
// 2
// 3
// 4
// 5
I want to be clear that I didn't make the doSomethingAsync() function and it's out of my ability to change it. I want to find a way to block this async function and print Done after the numbers 1 to 5 because as you see Done is being instantly printed. Is there a way to do this in Java?
You can use CountDownLatch as follow:
final CountDownLatch wait = new CountDownLatch(1);
SomeLibrary lib = new SomeLibrary(wait);
lib.doSomethingAsync(); // some function from a library I got and what it does is print 1-5 asynchronously
//NOTE in the doSomethingAsync, you must call wait.countDown() before return
wait.await(); //-> it wait in here until wait.countDown() is called.
System.out.println("Done");
In Constructor SomeLibrary :
private CountDownLatch wait;
public ScannerTest(CountDownLatch _wait) {
this.wait = _wait;
}
In method doSomethingAsync():
public void doSomethingAsync(){
//TODO something
...
this.wait.countDown();
return;
}
This is achieved in a couple of ways in standard libraries :-
Completion Callback
Clients can often provider function to be invoked after the async task is complete. This function usually receives some information regarding the work done as it's input.
Future.get()
Async functions return Future for client synchronization. You can read more about them here.
Do check if any of these options are available (perhaps, an overloaded version ?_ in the method you wish to invoke. It is not too uncommon for libraries to include both sync and async version of some business logic so you could search for that too.
Let's say I have this synchronous method:
public FruitBowl getFruitBowl() {
Apple apple = getApple(); // IO intensive
Banana banana = getBanana(); // CPU intensive
return new FruitBowl(apple, banana);
}
I can use the Java concurrency API to turn it into an async method, which would turn out somewhat like this:
public Future<FruitBowl> getFruitBowl() {
Future<Apple> appleFuture = getAppleAsync(); // IO intensive
Future<Banana> bananaFuture = getBananaAsync(); // CPU intensive
return createFruitBowlAsync(appleFuture, bananaFuture); // Awaits appleFuture and bananaFuture and then returns a new FruitBowl
}
What is the idiomatic Rx way of doing this while taking advantage of it's schedulers (io and computation) and return a Single?
You can use the zip operator. And for each of the async operation define a different thread. If you don't do so, the methods will be executed one after the other, on the same thread.
I would create an Observable version of both methods, in order to return respectively Observable<Apple> and Observable<Banana> and use them in this way:
Observalbe.zip(getAppleObservable().subscribeOn(Schedulers.newThread()),
getBananaObservable().subscribeOn(Schedulers.newThread()),
(apple, banana) -> new FruitBowl(apple, banana)))
.subscribe(/* do your work here with FruitBowl object */);
Here more details about how to parallelize operations with zip operator
I am writing a parser for a website , it has many pages (I call them IndexPages) . Each page has a lot of links (about 300 to 400 links in an IndexPage). I use Java's ExecutorService to invoke 12 Callables concurrently in one IndexPage. Each Callable just fire a http request to one link and do some parsing and db storing actions. When first IndexPage finished , program progresses to second IndexPage , until no next IndexPage found.
When running , it seems OK , I can observe the threads working/scheduling well. Each link's parsing/storing just takes about 1 to 2 seconds.
But as time goes by , I observed each Callable(parsing/storing) takes longer and longer. Take this picture for example , sometimes it takes 10 or more seconds to finish a Callable (The green bar is RUNNING , the purple bar is WAITING). And my PC is bogging down , everything becomes sluggish.
This is my main algorithm :
ExecutorService executorService = Executors.newFixedThreadPool(12);
String indexUrl = // Set initial (1st page) IndexPage
while(true)
{
String nextPage = // parse next page in the indexUrl
Set<Callable<Void>> callables = new HashSet<>();
for(String url : getUrls(indexUrl))
{
Callable callable = new ParserCallable(url , … and some DAOs);
callables.add(callable);
}
try {
executorService.invokeAll(callables);
} catch (InterruptedException e) {
e.printStackTrace();
}
if (nextPage == null)
break;
indexUrl = nextPage;
} // true
executorService.shutdown();
The algorithm is simple and self-explanatory. I wonder what may cause such situation ? Anyway to prevent such performance degradation ?
The CPU/Memory/Heap shows reasonable usage.
Environments , FYI.
==================== updated ====================
I've change my implementations from ExecutorService to ForkJoinPool :
ForkJoinPool pool=new ForkJoinPool(12);
String indexUrl = // Set initial (1st page) IndexPage
while(true)
{
Set<Callable<Void>> callables = new HashSet<>();
for(String url : for(String url : getUrls(indexUrl)))
{
Callable callable = new ParserCallable(url , DAOs...);
callables.add(callable);
}
pool.invokeAll(callables);
String nextPage = // parse next page in this indexUrl
if (nextPage == null)
break;
indexUrl = nextPage;
} // true
It takes longer than ExecutorService's solution. ExecutorService takes about 2 hours to finish all pages , while ForkJoinPool takes 3 hours , and each Callable still takes longer and longer time to complete (from 1 sec to 5,6 or even 10 seconds). I don't mind it takes longer , I just hope it takes constant time (not longer and longer) to finish a job .
I am wondering if I create a lot of (non-thread-safe) GregorianCalendar , Date and SimpleDateFormat objects in the parser and cause some thread issue. But I didn't reuse these objects or pass them among threads. So I still cannot find the reason.
Based on the heap you have a memory issue. ExecutorService.invokeAll collects all of the results of the Callable instances into a List and returns that List when they all complete. You may want to consider simply calling ExecutorService.submit since you don't seem to care about the results of each Callable.
I can't see why there is need of Callable to parse your index pages since your 'Caller' method does not expect any result from ParserCallable. I could see you would need to bit Exception handling,but still it can be managed with Runnable.
When you use Callable.call() it would return FutureTask back ,which is never used.
You should be able to improve implementation by using Runnable which could avoid this additional operation
ExecutorService executor = Executors.newFixedThreadPool(12);
for(String url : getUrls(indexUrl)) {
Runnable worker = new ParserRunnable(url , … and some DAOs);
executor.execute(worker);
}
class ParserRunnable implements Runnable{
}
As I understand it, if you have 40 pages, each with ~300 URLs, you will create ~12,000 Callables? While that it probably not too many Callables, it is a lot of HTTPConnections and Database Connections.
I think you should try using one Callable per page. You'll still gain a ton by running them in parallel. I don't know what you are using for the HTTP request, but you might be able to reuse system resources there instead of opening and closing 12,000 of them.
And especially for the DB. You'll have just 40 connections. You might even be able to be super efficient by collecting the ~300 records locally, then using a batch update.
The only model that I can come up with for running multiple similar processes (SIMD) using
Java Futures (java.util.concurrent.Future<T>) is as follows:
class Job extends Callable<T> {
public T call() {
// ...
}
}
List<Job> jobs = // ...
List<Future<T>> futures = ExecutorService.invokeAll(jobs);
for (Future<T> future : futures) {
T t = future.get();
// Do something with t ...
}
The problem with this model is that if job 0 takes a long time to complete, but jobs 1, 2, and 3 have already completed, the for loop will wait to get the return value from job 0.
Is there any model that allows me to get each Future result as it becomes available without just calling Future.isDone() and busy waiting (or calling Thread.sleep()) if none are ready yet?
You can try out the ExecutorCompletionService:
http://download.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/ExecutorCompletionService.html
You would simply submit your tasks and call take until you've received all Futures.
Consider using ListenableFuture from Guava. They let you basically add a continuation to execute when the future has completed.
Why don't you add what you want done to the job?
class Job extends Runnable {
public void run() {
// ...
T result = ....
// do something with the result.
}
}
That way it will process the result as soon as it is available, concurrently. ;)
A CompletionService can be polled for available results.
If all you want is the results as they become available however, we wrote an AsyncCompleter that abstracts away the detail of completion service usage. It lets you submit an Iterable<Callable<T>> of jobs and returns an Iterable<T> of results that blocks on next() and returns the results in completion order.