Task timed out with ComputeTaskTimeoutException

Task timed out with ComputeTaskTimeoutException - java

Since I upgraded to Ignite 2.8.1 I'm frequently seeing timeouts on Ignite compute for Callables. This hasn't been the case with v2.7.6. I even never used the timeout option before because I never saw any delay in compute.
Since it's now often stuck in the compute().call() method I added withTimeout(60000) but still had to add a loop to try this several times before the function is actually called on any of the client nodes. This doesn't happen all times, but sometimes.
When it happens I see in the Ignite log messages like this:
[01:14:27,237][WARNING][grid-timeout-worker-#119][GridTaskWorker] Task has timed out: GridTaskSessionImpl [taskName=de.my.comp.net.ignite.CreateNewMedia, dep=LocalDeployment [super=GridDeployment [ts=1609753866994, depMode=SHARED, clsLdr=jdk.internal.loader.ClassLoaders$AppClassLoader#5c29bfd, clsLdrId=06faeccc671-65368413-cf29-4a85-9f51-9b124c6a3351, userVer=0, loc=true, sampleClsName=java.lang.String, pendingUndeploy=false, undeployed=false, usage=0]], taskClsName=de.my.comp.net.ignite.CreateNewMedia, sesId=7bccb2ec671-65368413-cf29-4a85-9f51-9b124c6a3351, startTime=1610586807229, endTime=1610586867229, taskNodeId=65368413-cf29-4a85-9f51-9b124c6a3351, clsLdr=jdk.internal.loader.ClassLoaders$AppClassLoader#5c29bfd, closed=false, cpSpi=null, failSpi=null, loadSpi=null, usage=1, fullSup=false, internal=false, topPred=null, subjId=65368413-cf29-4a85-9f51-9b124c6a3351, mapFut=IgniteFuture [orig=GridFutureAdapter [ignoreInterrupts=false, state=DONE, res=null, hash=154761191]], execName=null]
This is now my code (shortened to be more readable):
while(something)
{
...
...
final long TASK_TIMEOUT_CREATENEWMEDIA = 60 * 1000;
ClusterGroup servers = IgniteHelper.getComputeNodes(ignite);
UUID thisMasterID = IgniteHelper.getLocalNodeID(ignite);
IgniteSemaphore semaphore = ignite.semaphore("masterFindNodeSempahore", 1, true, true);
try
{
semaphore.acquire();
CreateNewMedia cnm = new CreateNewMedia(currentConfig, thisMasterID, currentJobID, msResubmit);
CreateNewMediaResult result = ignite.compute(servers).withTimeout(TASK_TIMEOUT_CREATENEWMEDIA).call(cnm);
...
...
}
catch (IgniteException e)
{
if (e instanceof ComputeTaskTimeoutException)
{
// log task timeout, try again if not max tries reached
...
...
}
else
{
// if some other issue, we better leave
throw new Exception(e);
}
}
finally
{
semaphore.release();
}
}
The goal of this code is to find a node that is free to perform a specific action. The function returns if the node where it was executed is available. If it returns negative result, I remove this node ID from the list and try the remaining servers (this logic is not in the above code for simplicity).
It seems that setting publicThreadPoolSize to 512 reduced chances for the timeout to happen, but it still sometimes occur. If it occurs, in most cases it eventually calls the function after a few minutes, i.e. several tries to compute().call() in loop.
Is this known in Ignite 2.8.1? This is running on a server farm with 25 Ignite compute clients.
Workload on the client nodes is high but definitely not overloaded. Actually the nodes are mostly running an external cmdline executable while the Java portion is only waiting for its console output.
Any help appreciated.
Additional note: I've seen the timeout with v2.8.1 on other Callable's as well, so I don't think it's particularly related to the implementation of CreateNewMedia or of how it's been called in my sample code above.

Related

Async API giving worse performance

Interesting, I would think have 255 concurrent users, an async API would have better performance. Here are 2 of my endpoints in my Spring server:
#RequestMapping("/async")
public CompletableFuture<String> g(){
CompletableFuture<String> f = new CompletableFuture<>();
f.runAsync(() -> {
try {
Thread.sleep(500);
f.complete("Finished");
} catch (InterruptedException e) {
e.printStackTrace();
}
});
return f;
}
#RequestMapping("/sync")
public String h() throws InterruptedException {
Thread.sleep(500);
return "Finished";
}
In the /async it runs it on a different thread. I am using Siege for load testing as follows:
siege http://localhost:8080/sync --concurrent=255 --time=10S > /dev/null
For the async endpoint, I got a transaction number of 27 hits
For the sync endpoint, I got a transaction number of 1531 hits
So why is this? Why isnt the async endpoint able to handle more transactions?

Because the async endpoint is using a shared (the small ForkJoinPool.commonPool()) threadpool to execute the sleeps, whereas the sync endpoint uses the larger threadpool of the application server. Since the common pool is so small, you're running maybe 4-8 operations (well, if you call sleeping an operation) at a time, while others are waiting for their turn to even get in the pool. You can use a bigger pool with CompletableFuture.runAsync(Runnable, Executor) (you're also calling the method wrong, it's a static method that returns a CompletableFuture).
Async isn't a magical "make things faster" technique. Your example is flawed as all the requests take 500ms and you're only adding overhead in the async one.

Camel: File consumer component "bites off more than it can chew", pipeline dies from out-of-memory error

I have a route defined in Camel that goes something like this: GET request comes in, a file gets created in the file system. File consumer picks it up, fetches data from external web services, and sends the resulting message by POST to other web services.
Simplified code below:
// Update request goes on queue:
from("restlet:http://localhost:9191/update?restletMethod=post")
.routeId("Update via POST")
[...some magic that defines a directory and file name based on request headers...]
.to("file://cameldest/queue?allowNullBody=true&fileExist=Ignore")
// Update gets processed
from("file://cameldest/queue?delay=500&recursive=true&maxDepth=2&sortBy=file:parent;file:modified&preMove=inprogress&delete=true")
.routeId("Update main route")
.streamCaching() //otherwise stuff can't be sent to multiple endpoints
[...enrich message from some web service using http4 component...]
.multicast()
.stopOnException()
.to("direct:sendUpdate", "direct:dependencyCheck", "direct:saveXML")
.end();
The three endpoints in the multicast are simply POSTing the resulting message to other web services.
This all works rather well when the queue (i.e. the file directory cameldest) is fairly empty. Files are being created in cameldest/<subdir>, picked up by the file consumer and moved into cameldest/<subdir>/inprogress, and stuff is being sent to the three outgoing POSTs no problem.
However, once the incoming requests pile up to about 300,000 files progress slows down and eventually the pipeline fails due to out-of-memory errors (GC overhead limit exceeded).
By increasing logging I can see that the file consumer polling basically never runs, because it appears to take responsibility for all files it sees at each time, waits for them to be done processing, and only then starts another poll round. Besides (I assume) causing the resources bottleneck, this also interferes with my sorting requirements: Once the queue is jammed with thousands of messages waiting to be processed, new messages that would naively be sorted higher up are -if they even still get picked up- still waiting behind those that are already "started".
Now, I've tried the maxMessagesPerPoll and eagerMaxMessagesPerPoll options. They seem to alleviate the problem at first, but after a number of poll rounds I still end up with thousands of files in "started" limbo.
The only thing that sort of worked was making the bottle neck of delay and maxMessages... so narrow that the processing on average would finish faster than the file polling cycle.
Clearly, that is not what I want. I would like my pipeline to process files as fast as possible, but not faster. I was expecting the file consumer to wait when the route is busy.
Am I making an obvious mistake?
(I'm running a somewhat older Camel 2.14.0 on a Redhat 7 machine with XFS, if that is part of the problem.)

Try set maxMessagesPerPoll to a low value on the from file endpoint to only pickup at most X files per poll which also limits the total number of inflight messages you will have in your Camel application.
You can find more information about that option in the Camel documentation for the file component

The short answer is that there is no answer: The sortBy option of Camel's file component is simply too memory-inefficient to accomodate my use-case:
Uniqueness: I don't want to put a file on queue if it's already there.
Priority: Files flagged as high priority should be processed first.
Performance: Having a few hundred thousands of files, or maybe even a few million, should be no problem.
FIFO: (Bonus) Oldest files (by priority) should be picked up first.
The problem appears to be, if I read the source code and the documentation correctly, that all file details are in memory to perform the sorting, no matter whether the built-in language or a custom pluggable sorter is used. The file component always creates a list of objects containing all details, and that apparently causes an insane amount of garbage collection overhead when polling many files often.
I got my use case to work, mostly, without having to resort to using a database or writing a custom component, using the following steps:
Move from one file consumer on the parent directory cameldest/queue that sorts recursively the files in the subdirectories (cameldest/queue/high/ before cameldest/queue/low/) to two consumers, one for each directory, with no sorting at all.
Set up only the consumer from /cameldest/queue/high/ to process files through my actual business logic.
Set up the consumer from /cameldest/queue/low to simply promote files from "low" to "high" (copying them over, i.e. .to("file://cameldest/queue/high");)
Crucially, in order to only promote from "low" to "high" when high is not busy, attach a route policy to "high" that throttles the other route, i.e. "low" if there are any messages in-flight in "high"
Additionally, I added a ThrottlingInflightRoutePolicy to "high" to prevent it from inflighting too many exchanges at once.
Imagine this like at check-in at the airport, where tourist travellers are invited over into the business class lane if that is empty.
This worked like a charm under load, and even while hundreds of thousands of files were on queue in "low", new messages (files) dropped directly into "high" got processed within seconds.
The only requirement that this solution doesn't cover, is the orderedness: There is no guarantee that older files are picked up first, rather they are picked up randomly. One could imagine a situation where a steady stream of incoming files could result in one particular file X just always being unlucky and never being picked up. The chance of that happening, though, is very low.
Possible improvement: Currently the threshold for allowing / suspending the promotion of files from "low" to "high" is set to 0 messages inflight in "high". On the one hand, this guarantees that files dropped into "high" will be processed before another promotion from "low" is performed, on the other hand it leads to a bit of a stop-start-pattern, especially in a multi-threaded scenario. Not a real problem though, the performance as-is was impressive.
Source:
My route definitions:
ThrottlingInflightRoutePolicy trp = new ThrottlingInflightRoutePolicy();
trp.setMaxInflightExchanges(50);
SuspendOtherRoutePolicy sorp = new SuspendOtherRoutePolicy("lowPriority");
from("file://cameldest/queue/low?delay=500&maxMessagesPerPoll=25&preMove=inprogress&delete=true")
.routeId("lowPriority")
.log("Copying over to high priority: ${in.headers."+Exchange.FILE_PATH+"}")
.to("file://cameldest/queue/high");
from("file://cameldest/queue/high?delay=500&maxMessagesPerPoll=25&preMove=inprogress&delete=true")
.routeId("highPriority")
.routePolicy(trp)
.routePolicy(sorp)
.threads(20)
.log("Before: ${in.headers."+Exchange.FILE_PATH+"}")
.delay(2000) // This is where business logic would happen
.log("After: ${in.headers."+Exchange.FILE_PATH+"}")
.stop();
My SuspendOtherRoutePolicy, loosely built like ThrottlingInflightRoutePolicy
public class SuspendOtherRoutePolicy extends RoutePolicySupport implements CamelContextAware {
private CamelContext camelContext;
private final Lock lock = new ReentrantLock();
private String otherRouteId;
public SuspendOtherRoutePolicy(String otherRouteId) {
super();
this.otherRouteId = otherRouteId;
}
#Override
public CamelContext getCamelContext() {
return camelContext;
}
#Override
public void onStart(Route route) {
super.onStart(route);
if (camelContext.getRoute(otherRouteId) == null) {
throw new IllegalArgumentException("There is no route with the id '" + otherRouteId + "'");
}
}
#Override
public void setCamelContext(CamelContext context) {
camelContext = context;
}
#Override
public void onExchangeDone(Route route, Exchange exchange) {
//log.info("Exchange done on route " + route);
Route otherRoute = camelContext.getRoute(otherRouteId);
//log.info("Other route: " + otherRoute);
throttle(route, otherRoute, exchange);
}
protected void throttle(Route route, Route otherRoute, Exchange exchange) {
// this works the best when this logic is executed when the exchange is done
Consumer consumer = otherRoute.getConsumer();
int size = getSize(route, exchange);
boolean stop = size > 0;
if (stop) {
try {
lock.lock();
stopConsumer(size, consumer);
} catch (Exception e) {
handleException(e);
} finally {
lock.unlock();
}
}
// reload size in case a race condition with too many at once being invoked
// so we need to ensure that we read the most current size and start the consumer if we are already to low
size = getSize(route, exchange);
boolean start = size == 0;
if (start) {
try {
lock.lock();
startConsumer(size, consumer);
} catch (Exception e) {
handleException(e);
} finally {
lock.unlock();
}
}
}
private int getSize(Route route, Exchange exchange) {
return exchange.getContext().getInflightRepository().size(route.getId());
}
private void startConsumer(int size, Consumer consumer) throws Exception {
boolean started = super.startConsumer(consumer);
if (started) {
log.info("Resuming the other consumer " + consumer);
}
}
private void stopConsumer(int size, Consumer consumer) throws Exception {
boolean stopped = super.stopConsumer(consumer);
if (stopped) {
log.info("Suspending the other consumer " + consumer);
}
}
}

I would propose an alternative solution unless you really need to save the data as files.
From your restlet consumer, send each request to a message queuing app such as activemq or rabbitmq or something similar. You will quickly end up with lots of messages on that queue but that is ok.
Then replace your file consumer with a queue consumer. It will take some time but the each message should be processed separately and sent to wherever you want. I have tested rabbitmq with about 500 000 messages and that has worked fine. This should reduce the load on the consumer as well.

How to improve the performance while using ExecutorService with thread timeout capabilities?

I am not a Multithreading Expert but I am seeing some performance issues with my current code which is using ExecutorService.
I am working on a project in which I need to make a HTTP URL call to my server and if it is taking too long time to respond then timeout the call. Currently it is returning simple JSON String back..
Current requirement I have is for 10 ms. Within 10 ms it should be able to get the data back from the server. I guess its possible since it is just an HTTP call to server within the same datacenter.
My client program and actual servers are within same datacenter and ping time latency is 0.5 ms between them so it should be doable for sure..
I am using RestTemplate for this to make the URL call.
Below is my code which I have wrote for me which uses ExecutorService and Callables -
public class URLTest {
private ExecutorService executor = Executors.newFixedThreadPool(10);
public String getData() {
Future<String> future = executor.submit(new Task());
String response = null;
try {
System.out.println("Started..");
response = future.get(100, TimeUnit.MILLISECONDS);
System.out.println("Finished!");
} catch (TimeoutException e) {
System.out.println("Terminated!");
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ExecutionException e) {
e.printStackTrace();
}
return response;
}
}
Below is my Task class which implements Callable interface -
class Task implements Callable<String> {
private RestTemplate restTemplate = new RestTemplate();
public String call() throws Exception {
// TimerTest timer = TimerTest.getInstance(); // line 3
String response = restTemplate.getForObject(url, String.class);
// timer.getDuration(); // line 4
return response;
}
}
And below is my code in another class DemoTest which calls the getData method in URLTest class 500 times and measure the 95th percentile of it end to end -
public class DemoTest {
public static void main(String[] args) {
URLTest bc = new URLTest();
// little bit warmup
for (int i = 0; i <= 500; i++) {
bc.getData();
}
for (int i = 0; i <= 500; i++) {
TimerTest timer = TimerTest.getInstance(); // line 1
bc.getData();
timer.getDuration(); // line 2
}
// this method prints out the 95th percentile
logPercentileInfo();
}
}
With the above code as it is, I am always seeing 95th percentile as 14-15 ms (which is bad for my use case as it is end to end flow and that's what I need to measure).
I am surprised why? Is ExectuorFramework adding all the latency here?. May be Each task is submitted, and the submitting thread is waiting (via future.get) until the task is finished..
My main goal is to reduce the latency here as much as possible.. My use case is simple, Make a URL call to one of my server with a TIMEOUT feature enabled, meaning if the server is taking lot of time to response, then Timeout the whole call. Customer will call our code from there application which can be multithreaded as well..
Is there anything I am missing or some other flavors of ExecutorService I need to use? How can I improve my performance here? Any suggestions will be of great help..
Any example will be greatly appreciated.. I was reading about ExecutorCompletionService not sure whether I should use this or something else..

As for your observation that you are measuring 15 ms on the outside, but only 3 ms on the inside, my bet is that the construction of the RestTemplate takes the difference. This could be fixed by refactoring.
Note that RestTemplate is a heavyweight, thread-safe object, and is designed to be deployed as an application-wide singleton. Your current code is in critical violation of this intent.
If you need asynchronous HTTP requests, you should really use an asynchronous HTTP library such an AsyncHttpClient, based on Netty underneath, which is again based on Java NIO. That means that you don't need to occupy a thread per an outstanding HTTP request. AsyncHttpClient also works with Futures so you'll have an API you are used to. It can also work with callbacks, which is preferred for the asynchronous approach.
However, even if you keep your current synchronous library, you should at the very least configure a timeout on the REST client instead of letting it run its course.

then run the program again it will start giving me 95th percentile as 3 ms. So not sure why end to end flow gives me 95th percentile as 14-15 ms
You are generating the tasks faster than you can process them. This means the longer you run the tests, the further behind it gets as it is queuing them up. I would expect if you made this 2000 requests you would see latencies up to 4x what you do now. The bottlneck could be on the client side (in which case more threads would help) but quite likely the bottleneck is on the server side in which case more threads could make it worse.
The default behaviour for HTTP is to establish a new TCP connection for each request. The connection time for a new TCP connection can be up to 20 ms easily even if you have two machines side by side. I suggest looking at using HTTp/1.1 and maintain a persistent connection.
BTW You can ping from one side of London to the other in 0.5 ms. However, getting below 1 ms with HTTP reliably is tricky as the protocol is not designed for low latency. It is designed for use on high latency networks.
Note: you cannot see latencies below 25 ms and 100 ms is plenty fast enough for most web requests. It is with these sort of assumption that HTTP was designed.

java : execute a method over a maximum period of time

I am using the JavaMail API , and there is a method in the Folder class called "search" that sometimes take too long to execute. What i want is to execute this method over a maximum period of time( say for example 15 seconds in maximum) , that way i am sure that this method will not run up more than 15 seconds.
Pseudo Code
messages = maximumMethod(Folder.search(),15);
Do I have to create a thread just to execute this method and in the main thread use the wait method ?

The best way to do this is create a single threaded executor which you can submit callables with. The return value is a Future<?> which you can get the results from. You can also say wait this long to get the results. Here is sample code:
ExecutorService service = Executors.newSingleThreadExecutor();
Future<Message[]> future = service.submit(new Callable<Message[]>() {
#Override
public Message[] call() throws Exception {
return Folder.search(/*...*/);
}
});
try {
Message[] messages = future.get(15, TimeUnit.SECONDS);
}
catch(TimeoutException e) {
// timeout
}

You could
mark current time
launch a thread that will search in the folder
while you get the result (still in thread) don't do anything if current time exceeds time obtained in 1 plus 15 seconds. You won't be able to stop the connection if it is pending but you could just disgard a late result.
Also, if you have access to the socket used to search the folder, you could set its timeout but I fear it's gonna be fully encapsulated by javamail.
Regards,
Stéphane

This SO question shows how to send a timeout exception to the client code: How do I call some blocking method with a timeout in Java?
You might be able to interrupt the actual search using Thread.interrupt(), but that depends on the method's implementation. You may end up completing the action only to discard the results.

How to terminate CXF webservice call within Callable upon Future cancellation

Edit
This question has gone through a few iterations by now, so feel free to look through the revisions to see some background information on the history and things tried.
I'm using a CompletionService together with an ExecutorService and a Callable, to concurrently call the a number of functions on a few different webservices through CXF generated code.. These services all contribute different information towards a single set of information I'm using for my project. The services however can fail to respond for a prolonged period of time without throwing an exception, prolonging the wait for the combined set of information.
To counter this I'm running all the service calls concurrently, and after a few minutes would like to terminate any of the calls that have not yet finished, and preferably log which ones weren't done yet either from within the callable or by throwing an detailed Exception.
Here's some highly simplified code to illustrate what I'm doing already:
private Callable<List<Feature>> getXXXFeatures(final WiwsPortType port,
final String accessionCode) {
return new Callable<List<Feature>>() {
#Override
public List<Feature> call() throws Exception {
List<Feature> features = new ArrayList<Feature>();
//getXXXFeatures are methods of the WS Proxy
//that can take anywhere from second to never to return
for (RawFeature raw : port.getXXXFeatures(accessionCode)) {
Feature ft = convertFeature(raw);
features.add(ft);
}
if (Thread.currentThread().isInterrupted())
log.error("XXX was interrupted");
return features;
}
};
}
And the code that concurrently starts the WS calls:
WiwsPortType port = new Wiws().getWiws();
List<Future<List<Feature>>> ftList = new ArrayList<Future<List<Feature>>>();
//Counting wrapper around CompletionService,
//so I could implement ccs.hasRemaining()
CountingCompletionService<List<Feature>> ccs =
new CountingCompletionService<List<Feature>>(threadpool);
ftList.add(ccs.submit(getXXXFeatures(port, accessionCode)));
ftList.add(ccs.submit(getYYYFeatures(port accessionCode)));
ftList.add(ccs.submit(getZZZFeatures(port, accessionCode)));
List<Feature> allFeatures = new ArrayList<Feature>();
while (ccs.hasRemaining()) {
//Low for testing, eventually a little more lenient
Future<List<Feature>> polled = ccs.poll(5, TimeUnit.SECONDS);
if (polled != null)
allFeatures.addAll(polled.get());
else {
//Still jobs remaining, but unresponsive: Cancel them all
int jobsCanceled = 0;
for (Future<List<Feature>> job : ftList)
if (job.cancel(true))
jobsCanceled++;
log.error("Canceled {} feature jobs because they took too long",
jobsCanceled);
break;
}
}
The problem I'm having with this code is that the Callables aren't actually canceled when waiting for port.getXXXFeatures(...) to return, but somehow keep running. As you can see from the if (Thread.currentThread().isInterrupted()) log.error("XXX was interrupted"); statements the interrupted flag is set after port.getFeatures returns, this is only available after the Webservice call completes normally, instead of it having been interrupted when I called Cancel.
Can anyone tell me what I am doing wrong and how I can stop the running CXF Webservice call after a given time period, and register this information in my application?
Best regards, Tim

Edit 3 New answer.
I see these options:
Post your problem on the Apache CXF as feature request
Fix ACXF yourself and expose some features.
Look for options for asynchronous WS call support within the Apache CXF
Consider switching to a different WS provider (JAX-WS?)
Do your WS call yourself using RESTful API if the service supports it (e.g. plain HTTP request with parameters)
For über experts only: use true threads/thread group and kill the threads with unorthodox methods.

The CXF docs have some instructions for setting the read timeout on the HTTPURLConnection:
http://cwiki.apache.org/CXF20DOC/client-http-transport-including-ssl-support.html
That would probably meet your needs. If the server doesn't respond in time, an exception is raised and the callable would get the exception. (except there is a bug where is MAY hang instead. I cannot remember if that was fixed for 2.2.2 or if it's just in the SNAPSHOTS right now.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Task timed out with ComputeTaskTimeoutException - java

Related

Async API giving worse performance

Camel: File consumer component "bites off more than it can chew", pipeline dies from out-of-memory error

How to improve the performance while using ExecutorService with thread timeout capabilities?

java : execute a method over a maximum period of time

How to terminate CXF webservice call within Callable upon Future cancellation

Categories

Resources