Optimizing method with list of 500k+ elements

Optimizing method with list of 500k+ elements - java

I'm looking for some help since I don't know how to optimize a process.
I have to invoke a service that returns a list with more than 500K elements (I don't know why, these services belongs to the client), per each element of the list, I have to invoke 2 more services and then save some attributes in our database, this last step is not the problem, but the entire process took between 1 and 2 seconds per element, so with this time is going to take like more of 100 hours to complete the process.
My approach is the following, I have my main method, inside this method I get the large list, then I use a parallelStream to iterate in the elements of the list and then I use a CompletableFuture to call the method that invokes the 2 services mentioned above. I've tried changing the parallelStream to stream and for-each , tried to split the main list into smaller lists and many other things but I don't see a better performance, I think the problem is the invocation of those 2 services but I want to try luck asking here.
I'm using java 11, spring, and for the invocation of the services I'm using RestTemplate, and this is my code:
public void updateDiscount() {
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1));
});
}
//Second class
#Async("nameOfThePool")
public void asyncDiscountSave(ElementOfList element) {
//Logic to create request
//.........
var responseClients = anotherClass.getClients(element.getGroup1()) //get the first response with restTemplate
var responseProducts = anotherClass.getProducts(element.getGroup2())//get the second response with restTemplate
for (var client : responseClients) {
for (var product : responseProducts) {
//Here we just save some attributes of these objects on our DB
}
}
}
Thanks for the help.
UPDATE:
For this particular case, the only improvement that I can do is to pass a thread pool to the completable future, the problem is the response time of the services that I need to invoke.
I decided to follow a second approach and it took like 5 hours to complete, compared with the first approach this is acceptable.

As you haven't defined an executor you are using the default pool. Adding an executor allow you to create many threads as you needed and the server resources can manage
public void updateDiscount() {
Executor executor = Executors.newFixedThreadPool( 100 );//Define the number according to server resources performance
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1), executor);
});
}

Related

how to convert Flux<pojo> to ArrayList<String>

In my spring-boot springboot service class, I have created the following code which is not working as desired:
Service class:
Flux<Workspace> mWorkspace = webClient.get().uri(WORKSPACEID)
.retrieve().bodyToFlux(Workspace.class);
ArrayList<String> newmWorkspace = new ArrayList();
newmWorkspace = mWorkspace.blockLast();
return newmWorkspace;
Please someone help me on converting the list of json values to put it into arrayList
Json
[
{
"id:"123abc"
},
{
"id:"123abc"
}
]

Why is the code not working as desired
mWorkspace is a publisher of one or many items of type Workspace.
Calling newmWorkspace.blockLast() will get a Workspace from that Publisher:
which is an object of type: Workspace and not of type ArrayList<String>.
That's why : Type mismatch: cannot convert from Workspace to ArrayList<String>
Converting from Flux to an ArrayList
First of all, in reactive programming, a Flux is not meant to be blocked, the blockxxx methods are made for testing purposes. If you find yourself using them, then you may not need reactive logic.
In your service, you shall try this :
//initialize the list
ArrayList<String> newmWorkspace = new ArrayList<>();
Flux<Workspace> mWorkspace = webClient.get().uri(WORKSPACEID)
.retrieve().bodyToFlux(Workspace.class)
.map(workspace -> {
//feed the list
newmWorkspace.add(workspace.getId());
return workspace;
});
//this line will trigger the publication of items, hence feeding the list
mWorkspace.subscribe();
Just in case you want to convert a JSON String to a POJO:
String responseAsjsonString = "[{\"id\": \"123abc\"},{\"id\": \"123cba\"}] ";
Workspace[] workspaces = new ObjectMapper().readValue(responseAsjsonString, Workspace[].class);

You would usually want to avoid blocking in a non-blocking application. However, if you are just integrating from blocking to non-blocking and doing so step-by-step (unless you are not mixing blocking and non-blocking in your production code), or using a servlet stack app but want to only use the WebFlux client, it should be fine.
With that being said, a Flux is a Publisher that represents an asynchronous sequence of 1..n emitted items. When you do a blockLast you wait until the last signal completes, which resolves to a Workspace object.
You want to collect each resolved item to a list and return that. For this purpose, there is a useful method called collectList, which does this job without blocking the stream. You can then block the Mono<List<Workspace>> returned by this method to retrieve the list.
So this should give you the result you want:
List<Workspace> workspaceList = workspaceFlux.collectList().block();
If you must use a blocking call in the reactive stack, to avoid blocking the event loop, you should subscribe to it on a different scheduler. For the I/O purposes, you should use the boundedElastic Scheduler. You almost never want to call block on a reactive stack, instead subscribe to it. Or better let WebFlux to handle the subscription by returning the publisher from your controller (or Handler).

Is it possible to block/wait an already existing asynchronous function?

SomeLibrary lib = new SomeLibrary();
lib.doSomethingAsync(); // some function from a library I got and what it does is print 1-5 asynchronously
System.out.println("Done");
// output
// Done
// 1
// 2
// 3
// 4
// 5
I want to be clear that I didn't make the doSomethingAsync() function and it's out of my ability to change it. I want to find a way to block this async function and print Done after the numbers 1 to 5 because as you see Done is being instantly printed. Is there a way to do this in Java?

You can use CountDownLatch as follow:
final CountDownLatch wait = new CountDownLatch(1);
SomeLibrary lib = new SomeLibrary(wait);
lib.doSomethingAsync(); // some function from a library I got and what it does is print 1-5 asynchronously
//NOTE in the doSomethingAsync, you must call wait.countDown() before return
wait.await(); //-> it wait in here until wait.countDown() is called.
System.out.println("Done");
In Constructor SomeLibrary :
private CountDownLatch wait;
public ScannerTest(CountDownLatch _wait) {
this.wait = _wait;
}
In method doSomethingAsync():
public void doSomethingAsync(){
//TODO something
...
this.wait.countDown();
return;
}

This is achieved in a couple of ways in standard libraries :-
Completion Callback
Clients can often provider function to be invoked after the async task is complete. This function usually receives some information regarding the work done as it's input.
Future.get()
Async functions return Future for client synchronization. You can read more about them here.
Do check if any of these options are available (perhaps, an overloaded version ?_ in the method you wish to invoke. It is not too uncommon for libraries to include both sync and async version of some business logic so you could search for that too.

how to run multiple synchronous functions asynchronously?

I am writing in Java on the Vertx framework, and I have an architecture question regarding blocking code.
I have a JsonObject which consists of 10 objects, like so:
{
"system":"CD0",
"system":"CD1",
"system":"CD2",
"system":"CD3",
"system":"CD4",
"system":"CD5",
"system":"CD6",
"system":"CD7",
"system":"CD8",
"system":"CD9"
}
I also have a synchronous function which gets an object from the JsonObject, and consumes a SOAP web service, while sending the object to it.
the SOAP Web service gets the content (e.g. CD0), and after a few seconds returns an Enum.
I then want to take that enum value returned, and save it in some sort of data variable(like hash table).
What I ultimately want is a function that will iterate over all the JsonObject's objects, and for each one, run the blocking code, in parallel.
I want it to run in parallel so even if one of the calls to the function needs to wait 20 seconds, it won't stuck the other calls.
how can I do such a thing in vertx?
p.s: I will appreciate if you will correct mistakes I wrote.

Why not to use rxJava and "zip" separate calls? Vertx has great support for rxJava too. Assuming that you are calling 10 times same method with different String argument and returning another String you could do something like this:
private Single<String> callWs(String arg) {
return Single.fromCallable(() -> {
//DO CALL WS
return "yourResult";
});
}
and then just use it with some array of arguments:
String[] array = new String[10]; //get your arguments
List<Single<String>> wsCalls = new ArrayList<>();
for (String s : array) {
wsCalls.add(callWs(s));
}
Single.zip(wsCalls, r -> r).subscribe(allYourResults -> {
// do whatever you like with resutls
});
More about zip function and reactive programming in general: reactivex.io

Ideas on concurrent datastructure

I am not sure if i can put my question in the clearest fashion but i will try my best.
Lets say i am retrieving some information from a third party api. The retrieved information will be huge in size. To have a performance gain, instead of retrieving all the info in one go, i will be retrieving the info in a paged fashion (the api gives me that facility, basically an iterator). The return type is basically a list of objects.
My aim here is to process the information i have in hand(that includes comparing and storing in db and many other operations) while i get paged response on the request.
My question here to the expert community is , what data structure do you prefer in such case. Also does a framework like spring batch help you in getting performance gains in such cases.
I know the question is a bit vague, but i am looking for general ideas,tips and pointers.

In these cases, the data structure for me is java.util.concurrent.CompletionService.
For purposes of example, I'm going to assume a couple of additional constraints:
You want only one outstanding request to the remote server at a time
You want to process the results in order.
Here goes:
// a class that knows how to update the DB given a page of results
class DatabaseUpdater implements Callable { ... }
// a background thread to do the work
final CompletionService<Object> exec = new ExecutorCompletionService(
Executors.newSingleThreadExecutor());
// first call
List<Object> results = ThirdPartyAPI.getPage( ... );
// Start loading those results to DB on background thread
exec.submit(new DatabaseUpdater(results));
while( you need to ) {
// Another call to remote service
List<Object> results = ThirdPartyAPI.getPage( ... );
// wait for existing work to complete
exec.take();
// send more work to background thread
exec.submit(new DatabaseUpdater(results));
}
// wait for the last task to complete
exec.take();
This just a simple two-thread design. The first thread is responsible for getting data from the remote service and the second is responsible for writing to the database.
Any exceptions thrown by DatabaseUpdater will be propagated to the main thread when the result is taken (via exec.take()).
Good luck.

In terms of doing the actual parallelism, one very useful construct in Java is the ThreadPoolExecutor. A rough sketch of what that might look like is this:
public class YourApp {
class Processor implements Runnable {
Widget toProcess;
public Processor(Widget toProcess) {
this.toProcess = toProcess;
}
public void run() {
// commit the Widget to the DB, etc
}
}
public static void main(String[] args) {
ThreadPoolExecutor executor =
new ThreadPoolExecutor(1, 10, 30,
TimeUnit.SECONDS,
new LinkedBlockingDeque());
while(thereAreStillWidgets()) {
ArrayList<Widget> widgets = doExpensiveDatabaseCall();
for(Widget widget : widgets) {
Processor procesor = new Processor(widget);
executor.execute(processor);
}
}
}
}
But as I said in a comment: calls to an external API are expensive. It's very likely that the best strategy is to pull all the Widget objects down from the API in one call, and then process them in parallel once you've got them. Doing more API calls gives you the overhead of sending the data all the way from the server to you, every time -- it's probably best to pay that cost the fewest number of times that you can.
Also, keep in mind that if you're doing DB operations, it's possible that your DB doesn't allow for parallel writes, so you might get a slowdown there.

ExecutorService slows down , bogs down my pc

I am writing a parser for a website , it has many pages (I call them IndexPages) . Each page has a lot of links (about 300 to 400 links in an IndexPage). I use Java's ExecutorService to invoke 12 Callables concurrently in one IndexPage. Each Callable just fire a http request to one link and do some parsing and db storing actions. When first IndexPage finished , program progresses to second IndexPage , until no next IndexPage found.
When running , it seems OK , I can observe the threads working/scheduling well. Each link's parsing/storing just takes about 1 to 2 seconds.
But as time goes by , I observed each Callable(parsing/storing) takes longer and longer. Take this picture for example , sometimes it takes 10 or more seconds to finish a Callable (The green bar is RUNNING , the purple bar is WAITING). And my PC is bogging down , everything becomes sluggish.
This is my main algorithm :
ExecutorService executorService = Executors.newFixedThreadPool(12);
String indexUrl = // Set initial (1st page) IndexPage
while(true)
{
String nextPage = // parse next page in the indexUrl
Set<Callable<Void>> callables = new HashSet<>();
for(String url : getUrls(indexUrl))
{
Callable callable = new ParserCallable(url , … and some DAOs);
callables.add(callable);
}
try {
executorService.invokeAll(callables);
} catch (InterruptedException e) {
e.printStackTrace();
}
if (nextPage == null)
break;
indexUrl = nextPage;
} // true
executorService.shutdown();
The algorithm is simple and self-explanatory. I wonder what may cause such situation ? Anyway to prevent such performance degradation ?
The CPU/Memory/Heap shows reasonable usage.
Environments , FYI.
==================== updated ====================
I've change my implementations from ExecutorService to ForkJoinPool :
ForkJoinPool pool=new ForkJoinPool(12);
String indexUrl = // Set initial (1st page) IndexPage
while(true)
{
Set<Callable<Void>> callables = new HashSet<>();
for(String url : for(String url : getUrls(indexUrl)))
{
Callable callable = new ParserCallable(url , DAOs...);
callables.add(callable);
}
pool.invokeAll(callables);
String nextPage = // parse next page in this indexUrl
if (nextPage == null)
break;
indexUrl = nextPage;
} // true
It takes longer than ExecutorService's solution. ExecutorService takes about 2 hours to finish all pages , while ForkJoinPool takes 3 hours , and each Callable still takes longer and longer time to complete (from 1 sec to 5,6 or even 10 seconds). I don't mind it takes longer , I just hope it takes constant time (not longer and longer) to finish a job .
I am wondering if I create a lot of (non-thread-safe) GregorianCalendar , Date and SimpleDateFormat objects in the parser and cause some thread issue. But I didn't reuse these objects or pass them among threads. So I still cannot find the reason.

Based on the heap you have a memory issue. ExecutorService.invokeAll collects all of the results of the Callable instances into a List and returns that List when they all complete. You may want to consider simply calling ExecutorService.submit since you don't seem to care about the results of each Callable.

I can't see why there is need of Callable to parse your index pages since your 'Caller' method does not expect any result from ParserCallable. I could see you would need to bit Exception handling,but still it can be managed with Runnable.
When you use Callable.call() it would return FutureTask back ,which is never used.
You should be able to improve implementation by using Runnable which could avoid this additional operation
ExecutorService executor = Executors.newFixedThreadPool(12);
for(String url : getUrls(indexUrl)) {
Runnable worker = new ParserRunnable(url , … and some DAOs);
executor.execute(worker);
}
class ParserRunnable implements Runnable{
}

As I understand it, if you have 40 pages, each with ~300 URLs, you will create ~12,000 Callables? While that it probably not too many Callables, it is a lot of HTTPConnections and Database Connections.
I think you should try using one Callable per page. You'll still gain a ton by running them in parallel. I don't know what you are using for the HTTP request, but you might be able to reuse system resources there instead of opening and closing 12,000 of them.
And especially for the DB. You'll have just 40 connections. You might even be able to be super efficient by collecting the ~300 records locally, then using a batch update.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Optimizing method with list of 500k+ elements - java

Related

how to convert Flux<pojo> to ArrayList<String>

Is it possible to block/wait an already existing asynchronous function?

how to run multiple synchronous functions asynchronously?

Ideas on concurrent datastructure

ExecutorService slows down , bogs down my pc

Categories

Resources