I am writing a parser for a website , it has many pages (I call them IndexPages) . Each page has a lot of links (about 300 to 400 links in an IndexPage). I use Java's ExecutorService to invoke 12 Callables concurrently in one IndexPage. Each Callable just fire a http request to one link and do some parsing and db storing actions. When first IndexPage finished , program progresses to second IndexPage , until no next IndexPage found.
When running , it seems OK , I can observe the threads working/scheduling well. Each link's parsing/storing just takes about 1 to 2 seconds.
But as time goes by , I observed each Callable(parsing/storing) takes longer and longer. Take this picture for example , sometimes it takes 10 or more seconds to finish a Callable (The green bar is RUNNING , the purple bar is WAITING). And my PC is bogging down , everything becomes sluggish.
This is my main algorithm :
ExecutorService executorService = Executors.newFixedThreadPool(12);
String indexUrl = // Set initial (1st page) IndexPage
while(true)
{
String nextPage = // parse next page in the indexUrl
Set<Callable<Void>> callables = new HashSet<>();
for(String url : getUrls(indexUrl))
{
Callable callable = new ParserCallable(url , … and some DAOs);
callables.add(callable);
}
try {
executorService.invokeAll(callables);
} catch (InterruptedException e) {
e.printStackTrace();
}
if (nextPage == null)
break;
indexUrl = nextPage;
} // true
executorService.shutdown();
The algorithm is simple and self-explanatory. I wonder what may cause such situation ? Anyway to prevent such performance degradation ?
The CPU/Memory/Heap shows reasonable usage.
Environments , FYI.
==================== updated ====================
I've change my implementations from ExecutorService to ForkJoinPool :
ForkJoinPool pool=new ForkJoinPool(12);
String indexUrl = // Set initial (1st page) IndexPage
while(true)
{
Set<Callable<Void>> callables = new HashSet<>();
for(String url : for(String url : getUrls(indexUrl)))
{
Callable callable = new ParserCallable(url , DAOs...);
callables.add(callable);
}
pool.invokeAll(callables);
String nextPage = // parse next page in this indexUrl
if (nextPage == null)
break;
indexUrl = nextPage;
} // true
It takes longer than ExecutorService's solution. ExecutorService takes about 2 hours to finish all pages , while ForkJoinPool takes 3 hours , and each Callable still takes longer and longer time to complete (from 1 sec to 5,6 or even 10 seconds). I don't mind it takes longer , I just hope it takes constant time (not longer and longer) to finish a job .
I am wondering if I create a lot of (non-thread-safe) GregorianCalendar , Date and SimpleDateFormat objects in the parser and cause some thread issue. But I didn't reuse these objects or pass them among threads. So I still cannot find the reason.
Based on the heap you have a memory issue. ExecutorService.invokeAll collects all of the results of the Callable instances into a List and returns that List when they all complete. You may want to consider simply calling ExecutorService.submit since you don't seem to care about the results of each Callable.
I can't see why there is need of Callable to parse your index pages since your 'Caller' method does not expect any result from ParserCallable. I could see you would need to bit Exception handling,but still it can be managed with Runnable.
When you use Callable.call() it would return FutureTask back ,which is never used.
You should be able to improve implementation by using Runnable which could avoid this additional operation
ExecutorService executor = Executors.newFixedThreadPool(12);
for(String url : getUrls(indexUrl)) {
Runnable worker = new ParserRunnable(url , … and some DAOs);
executor.execute(worker);
}
class ParserRunnable implements Runnable{
}
As I understand it, if you have 40 pages, each with ~300 URLs, you will create ~12,000 Callables? While that it probably not too many Callables, it is a lot of HTTPConnections and Database Connections.
I think you should try using one Callable per page. You'll still gain a ton by running them in parallel. I don't know what you are using for the HTTP request, but you might be able to reuse system resources there instead of opening and closing 12,000 of them.
And especially for the DB. You'll have just 40 connections. You might even be able to be super efficient by collecting the ~300 records locally, then using a batch update.
Related
I'm looking for some help since I don't know how to optimize a process.
I have to invoke a service that returns a list with more than 500K elements (I don't know why, these services belongs to the client), per each element of the list, I have to invoke 2 more services and then save some attributes in our database, this last step is not the problem, but the entire process took between 1 and 2 seconds per element, so with this time is going to take like more of 100 hours to complete the process.
My approach is the following, I have my main method, inside this method I get the large list, then I use a parallelStream to iterate in the elements of the list and then I use a CompletableFuture to call the method that invokes the 2 services mentioned above. I've tried changing the parallelStream to stream and for-each , tried to split the main list into smaller lists and many other things but I don't see a better performance, I think the problem is the invocation of those 2 services but I want to try luck asking here.
I'm using java 11, spring, and for the invocation of the services I'm using RestTemplate, and this is my code:
public void updateDiscount() {
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1));
});
}
//Second class
#Async("nameOfThePool")
public void asyncDiscountSave(ElementOfList element) {
//Logic to create request
//.........
var responseClients = anotherClass.getClients(element.getGroup1()) //get the first response with restTemplate
var responseProducts = anotherClass.getProducts(element.getGroup2())//get the second response with restTemplate
for (var client : responseClients) {
for (var product : responseProducts) {
//Here we just save some attributes of these objects on our DB
}
}
}
Thanks for the help.
UPDATE:
For this particular case, the only improvement that I can do is to pass a thread pool to the completable future, the problem is the response time of the services that I need to invoke.
I decided to follow a second approach and it took like 5 hours to complete, compared with the first approach this is acceptable.
As you haven't defined an executor you are using the default pool. Adding an executor allow you to create many threads as you needed and the server resources can manage
public void updateDiscount() {
Executor executor = Executors.newFixedThreadPool( 100 );//Define the number according to server resources performance
//List with 500k elements
var relationshipList = relationshipService.getLargeList();
//CompletableFuture to make the async calls to the method above
relationshipList.parallelStream().forEach(level1 -> {
CompletableFuture.runAsync(() -> relationshipService.asyncDiscountSave(level1), executor);
});
}
I have used CopyOnWriteArrayList collection object which holds 1000 URLs. each URL indicates a file.
I want to use Multithread pooling mechanism to download those URL files parallel.
Tried using below code :
CopyOnWriteArrayList<String> fileList = DataExtractor.getRefLinks();
ExecutorService threadPool = Executors.newFixedThreadPool(4);
CompletionService<String> pool = new ExecutorCompletionService<String>(
threadPool);
for (int i = 0; i < fileList.size() ; i++){
pool.submit(new StringTask(fileList));
}
This is hitting the same URL 4 times. Might have done something wrong. Could you please suggest where it went wrong ?
My requirement is to pick 4 URLs (threads) at a time and start downloading them parallel till all the URLs in the List finish downloading.
Thanks.
I don't know what StringTask is, but you seem to be passing the full list of URLs to it. Make the appropriate changes to only submit a single URL from the list
pool.submit(new StringTask(fileList.get(i)));
(Or use an iterator over the fileList, whichever is more appropriate for a CopyOnWriteArrayList.)
for (String url : fileList){
pool.submit(new StringTask(url));
}
I am not sure if i can put my question in the clearest fashion but i will try my best.
Lets say i am retrieving some information from a third party api. The retrieved information will be huge in size. To have a performance gain, instead of retrieving all the info in one go, i will be retrieving the info in a paged fashion (the api gives me that facility, basically an iterator). The return type is basically a list of objects.
My aim here is to process the information i have in hand(that includes comparing and storing in db and many other operations) while i get paged response on the request.
My question here to the expert community is , what data structure do you prefer in such case. Also does a framework like spring batch help you in getting performance gains in such cases.
I know the question is a bit vague, but i am looking for general ideas,tips and pointers.
In these cases, the data structure for me is java.util.concurrent.CompletionService.
For purposes of example, I'm going to assume a couple of additional constraints:
You want only one outstanding request to the remote server at a time
You want to process the results in order.
Here goes:
// a class that knows how to update the DB given a page of results
class DatabaseUpdater implements Callable { ... }
// a background thread to do the work
final CompletionService<Object> exec = new ExecutorCompletionService(
Executors.newSingleThreadExecutor());
// first call
List<Object> results = ThirdPartyAPI.getPage( ... );
// Start loading those results to DB on background thread
exec.submit(new DatabaseUpdater(results));
while( you need to ) {
// Another call to remote service
List<Object> results = ThirdPartyAPI.getPage( ... );
// wait for existing work to complete
exec.take();
// send more work to background thread
exec.submit(new DatabaseUpdater(results));
}
// wait for the last task to complete
exec.take();
This just a simple two-thread design. The first thread is responsible for getting data from the remote service and the second is responsible for writing to the database.
Any exceptions thrown by DatabaseUpdater will be propagated to the main thread when the result is taken (via exec.take()).
Good luck.
In terms of doing the actual parallelism, one very useful construct in Java is the ThreadPoolExecutor. A rough sketch of what that might look like is this:
public class YourApp {
class Processor implements Runnable {
Widget toProcess;
public Processor(Widget toProcess) {
this.toProcess = toProcess;
}
public void run() {
// commit the Widget to the DB, etc
}
}
public static void main(String[] args) {
ThreadPoolExecutor executor =
new ThreadPoolExecutor(1, 10, 30,
TimeUnit.SECONDS,
new LinkedBlockingDeque());
while(thereAreStillWidgets()) {
ArrayList<Widget> widgets = doExpensiveDatabaseCall();
for(Widget widget : widgets) {
Processor procesor = new Processor(widget);
executor.execute(processor);
}
}
}
}
But as I said in a comment: calls to an external API are expensive. It's very likely that the best strategy is to pull all the Widget objects down from the API in one call, and then process them in parallel once you've got them. Doing more API calls gives you the overhead of sending the data all the way from the server to you, every time -- it's probably best to pay that cost the fewest number of times that you can.
Also, keep in mind that if you're doing DB operations, it's possible that your DB doesn't allow for parallel writes, so you might get a slowdown there.
I have an application on App Engine which is consuming some data. After parsing that data, it will know that it needs to execute something in a period of time - possibly not for a number of hours or weeks.
What is the best way to execute a piece of code after some arbitrary amount of time on App Engine?
I figured using Countdown Millis or EtaMillis from a TaskQueue would work, but haven't seen any evidence of anyone doing the same thing, especially for such long time frames.
Is that the best approach, or is there a better way?
If you are able to persist an object in the datastore with all of the relevant information for future processing (including when the processing for the object's data should begin), you could have a cron job periodically query the datastore with a date/time range filter and trigger processing any of the above objects at the appropriate time.
We successfully use TaskQueue's countdown parameter for sending emails to customers 7 days after they registered and for number of other needs.
Task queues is core/basic API/service and are pretty reliable - my opinion it's a best way to go with task queues ETA/countdown unless you:
need ability programmatically see what is in the queue
need ability programmatically delete certain task from the queue
I'm using the task queue as a scheduler. There is a 30 day max eta declared in QueueConstants and applied in QueueImpl.
//Returns the maximum time into the future that a task may be scheduled.
private static final long MAX_ETA_DELTA_MILLIS = 2592000000L;
1000ms * 60s * 60m * 24hr * 30days = 2592000000ms
private long determineEta(TaskOptions taskOptions) {
Long etaMillis = taskOptions.getEtaMillis();
Long countdownMillis = taskOptions.getCountdownMillis();
if (etaMillis == null) {
if (countdownMillis == null) {
return currentTimeMillis();
} else {
if (countdownMillis > QueueConstants.getMaxEtaDeltaMillis()) {
throw new IllegalArgumentException("ETA too far into the future");
}
if (countdownMillis < 0) {
throw new IllegalArgumentException("Negative countdown is not allowed");
}
return currentTimeMillis() + countdownMillis;
}
} else {
if (countdownMillis == null) {
if (etaMillis - currentTimeMillis() > QueueConstants.getMaxEtaDeltaMillis()) {
throw new IllegalArgumentException("ETA too far into the future");
}
if (etaMillis < 0) {
throw new IllegalArgumentException("Negative ETA is invalid");
}
return etaMillis;
} else {
throw new IllegalArgumentException(
"Only one or neither of EtaMillis and CountdownMillis may be specified");
}
}
}
I do the following:
Enqueue a task with a delay configured as you mention. Have the task processing change datastore entries in a known way (for example: set a flag).
Have a stragglers low frequency cron job, to perform any processing that has somehow been missed by an enqueued task (for example: an uncaught exception happened in the task).
For this to work, ensure that the processing called by the tasks and cron job are idempotent.
Enjoy?
I think taskQueue is a good strategy but has one big problem "If a push task is created successfully, it will eventually be deleted (at most seven days after the task successfully executes)." Source
I would instead use the datastore. here is one strategy you can take:
Insert a record into datastore once you completed "parsing that data".
Check the current date against the create/insert date to see how much time has passed by since your job was completed/started etc
(clearly, you don't want to do every minute etc maybe do it once a
day or every hour)
Execute the next task that you need to do as soon as condition in step 2 become passed your "arbitrary amount of time" you want.
Here is how you can add a record to data store...to get you started ..
Entity parsDataHolder = new Entity("parsing_data_done", guestbookKey);
parsDataHolder.setProperty("date", date);
DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
datastore.put(parsDataHolder)
I'm writing a server end program using Twitter Finagle. I do not use the full Twitter server stack, just the part that enables asynchronous processing (so Future, Function, etc). I want the Future objects to have timeouts, so I wrote this:
Future<String> future = Future.value(some_input).flatMap(time_consuming_function1);
future.get(Duration.apply(5, TimeUnit.SECONDS));
time_consuming_function1 runs for longer than 5 seconds. But future doesn't time out after 5 seconds and it waits till time_consuming_function1 has finished.
I think this is because future.get(timeout) only cares about how long the future took to create, not the whole operation chain. Is there a way to timeout the whole operation chain?
Basically if you call map/flatMap on a satisfied Future, the code is executed immediately.
In your example, you're satisfying your future immediately when you call Future.value(some_input), so flatMap executes the code immediately and the call to get doesn't need to wait for anything. Also, everything is happening in one thread. A more appropriate use would be like this:
import scala.concurrent.ops._
import com.twitter.conversions.time._
import com.twitter.util.{Future,Promise}
val p = new Promise[String]
val longOp = (s: String) => {
val p = new Promise[String]
spawn { Thread.sleep(5000); p.setValue("Received: " + s) }
p
}
val both = p flatMap longOp
both.get(1 second) // p is not complete, so longOp hasn't been called yet, so this will fail
p.setValue("test") // we set p, but we have to wait for longOp to complete
both.get(1 second) // this fails because longOp isn't done
both.get(5 seconds) // this will succeed