best practice for handling connection errors

best practice for handling connection errors - java

What is the best workflow taken when connection error occurs.
Let say we have a client that connects to the middle tier.
class MyClient{
...
void callServer(){
try{
middletier.saveRecord(123);
}catch(...){
// log the error
// what to do next (notify the user, take the same step again)?
// reinitialize connection ?
}
}
}
What to do if the connection error occured (time out, broken connection ...)
Should I notify the user that the connection has a problem and to try again?
Can something be done automatically, and transparent for the user ?
All I like is, not to bother the user with errors and then to tell the user what to do next.
So what is the best workflow for handling such errors ?

I can highly recommend Michael Nygards "Release It!" which spends quite a bit of time explaining how to make your software more robust.
http://www.pragprog.com/titles/mnee/release-it

It depends... If the action is caused by user interaction inform the user. The user can decide how often he wants to retry. The code may retry by itself but if it is a timeout than the user may wait for a several minutes (or abort the action not getting any feedback).
If it is a background task, try again after some delay (but not infinitely - eventually abort the action). you may reinitialize the connection to be sure, that depends on the used technology and if you use connection pooling.
Of course if you want to invest more time you can handle different errors differently. First of all, differentiate permanent errors (a retry in a few minutes wouldn't help) from intermittent errors (could be OK the next time). For instance, with a broken connection you could retry with a new one (maybe the firewall dropped an open connection because of inactivity). But you probably can't do anything about a time out (maybe a network configuration problem) or "HTTP 404 Not found" (assuming you can't change the URL you use for a HTTP call).
You could gather all this knowledge in "diagnosis&repair" component.
I also recommend reading "Release it!".

This is absolutely dependent on application requirements. Sometimes it is better to inform user immediately and sometimes it is better to retry the request several times before informing the user. You have to consult this with your customer / analyst.

From perspective of caller MyClient : Generally speaking, the failed method invocation should leave the MyClient in the state that it was in prior to the invocation. That means you should take care how to recover state of pre-middletier.saveRecord(123);

Related

What is the maximum connection timeout to any server?

I have this simple Spring boot based web app that downloads data from several APIs. Some of them don't respond in time, since my connectionTimeout is set to somewhat 4 seconds.
As soon as I get rid of connectionTimeout setting, I'm getting an exceptions after 20 or so seconds.
So, my question is, for how long am I able to try to connect to an API and what does it depend on? Where do those 20 seconds come from? What if an API responds after 40 minutes of time and I won't be able to catch that specific moment and just gonna lose data. I don't want that to happen. What are my options?
Here's the code to set the connection. Nothing special.
HttpComponentsClientHttpRequestFactory clientHttpRequestFactory = new HttpComponentsClientHttpRequestFactory(HttpClientBuilder.create().build());
clientHttpRequestFactory.setConnectTimeout(4000);
RestTemplate restTemplate = new RestTemplate(clientHttpRequestFactory);
Then I retrieve the values via:
myObject.setJsonString(restTemplate.getForObject(url, String.class));

Try increasing your timeout. 4 seconds is too little.
It will need to connect, formulate data and return. So 4 seconds is just for connecting, by the time it attempts to return anything, your application has already disconnected.
Set it to 20 seconds to test it. You can set it to much longer to give the API enough time to complete. This does not mean you app will use up all of the connection timeout time. It will finish as soon as a result is returned. Also API's are not designed to take long. They will perform the task and return the result as fast as possible

Connection timeout means that your program couldn't connect to the server at all within the time specified.
The timeout can be configured, as, like you say, some systems may take a longer time to connect to, and if this is known in advance, it can be allowed for. Otherwise the timeout serves as a guard to prevent the application from waiting forever, which in most cases doesn't really give a good user experience.
A separate timeout can normally be configured for reading data (socket timeout). They are not inclusive of each other.
To solve your problem:
Check that the server is running and accepting incoming connections.
You might want to use curl or depending on what it is simply your browser to try and connect.
If one tool can connect, but the other can't, check your firewall settings and ensure that outgoing connections from your Java program are permitted. The easiest way to test whether this is a problem is to disable anti virus and firewall tools temporarily. If this allows the connection, you'll either need to leave the FW off, or better add a corresponding exception.
Leave the timeout on a higher setting (or try setting it to 0, which is interpreted as infinite) while testing. Once you have it working, you can consider tweaking it to reflect your server spec and usability requirements.
Edit:
I realised that this doesn't necessarily help, as you did ultimately connect. I'll leave the above standing as general info.
for how long am I able to try to connect to an API and what does it depend on?
Most likely the server that the API is hosted on. If it is overloaded, response time may lengthen.
Where do those 20 seconds come from?
Again this depends on the API server. It might be random, or it may be processing each request for a fixed period of time before finding itself in an error state. In this case that might take 20 seconds each time.
What if an API responds after 40 minutes of time and I won't be able to catch that specific moment and just gonna lose data. I don't want that to happen. What are my options?
Use a more reliable API, possibly paying for a service guarantee.
Tweak your connection and socket timeouts to allow for the capabilities of the server side, if known in advance.
If the response is really 40 minutes, it is a really poor service, but moving on with that assumption - if the dataset is that large, explore whether the API offers a streaming callback, whereby you pass in an OutputStream into the API's library methods, to which it will (asynchronously) write the response when it is ready.
Keep in mind that connection and socket timeout are separate things. Once you have connected, the connection timeout becomes irrelevant (socket is established). As long as you begin to receive and continue to receive data (packet to packet) within the socket timeout, the socket timeout won't be triggered either.
Use infinite timeouts (set to 0), but this could lead to poor usability within your applications, as well as resource leaks if a server is in fact offline and will never respond. In that case you will be left with dangling connections.

The default and maximum has nothing to do with the the server. It depends on the client platform, but it is around a minute. You can decrease it, but not increase it. Four seconds is far too short. It should be measured in tens of seconds in most circumstances.
And absent or longer connection timeouts do not cause server errors of any kind. You are barking up the wrong tree here.

Should I ever throw a retriable exception from my service

My service is DService and I am fourth link in the chain of services i.e. the call flow is
Online User -> AService -> BService -> CService -> DService -> EService.
When I invoke EService from DService, it can throw retriable exception like HttpTimeoutException. I typically retry 2-3 three times and throw back an exception if it fails even after 2-3 retries.
My question is, the exception which I am throwing to CService, should that be retriable or non-retriable? Please find below my evaluation of Pros & Cons of both options
Cons of Throwing Retriable exception from DService
- If DService throws a retriable exception, following the same convention CService also might retry DService 2-3 times and in each call of C-D, D will again try 2-3 times onto E service call. Likewise the calls ultimately to EService will increase exponentially as we go up the call chain. So If EService network was indeed down for long time, we are talking about long number of un-necessary calls. This can be mitigated by having timeouts for each call in the chain, still not sure if that's a enough mitigation against un-necessary number of calls.
Pros of Throwing Retriable exception from DService
- CService will retry after sometime as in the subsequent retries we might get correct value (within the time limits)
- Especially if the clients are some backend jobs, then they can exponentially retry for long time before giving up. Throwing Un-Retriable exception would weed out this option
Please provide your views and suggestions on this
Thanks,
Harish

Without knowing what the services do, whether or not DService should retry or CService should, I cannot say for sure. However my philosophy is that the service being called should not be the one to retry, ever. In this case, EService would throw an exception stupidly and without any handling whatsoever. The reason behind this is because the end of the chain should be stateless and should not make decisions on behalf of the caller.
The caller can dictate to a certain extent within the confines of what is acceptable and what isn't on whether the error should be reattempted or not. In other words, if EService attempts to perform a connection to a database and DService is performing a lookup service, then it may be in the scope of DService to say, if a certain information isn't found in a certain table, to check in another table instead. However, failure to connect to the database by EService flies over the head of DService, whose scope is simply to return information requested by CService.
CService, having made the call to retrieve certain information, depending on what it does, may then receive the database connection and attempt to retry a number of times after a delay because it is performing a batch work on that data and will continue to retry until the database is back online. Or, if it is retrieving information to show to the user on a webpage, must fail fast and deal with the database connection error by presenting a nice error message to the user instead.
It entirely depends on what your services do and where their responsibilities lie. Likewise, whether an exception is retriable or not should again depend on the caller's necessity, not the service itself. It is perfectly viable to present a retriable exception to the caller that is only attempted once.
Hope that helps!

I think throwing retriable exceptions is a viable approach if you define exponentially increasing retry-periods up on the chain.

I'd say you shouldn't retry in DService in the first place, because, as you say, if each service did that you could be facing trouble. Hence, let the exception bubble up the call stack and let it be handled at the outer most service possible; could even be the user.
Rationale: Why would it be on DService to decide if CService, BService or AService would want to retry or not?
However, I think it also depends on the frequency of the exception and the success rate of retries. If the exception occurs frequently but the call usually succeeds upon the first or second retry it's another thing than an exception which happens once a day and/or retrying is futile most of the time.

What you throw at your invokers, and whether what you throw at them will also carry a suggestion "but you could retry this", should be determined by the intended semantics of your service exclusively.
(Besides, I have never heard of java Exception objects formally carrying any such property, but that might be because I'm lagging behind a bit.)
EDIT.
Whether you "retry" an operation that failed, is for you (and you alone) to decide. However, if you do decide to retry, it is also your responsibility to decide after how many failures you are going to stop retrying and call it a day, and at that point it is most certainly unwise to throw an exception to your caller that suggests he can "retry" as well.

Working with google API on an unstable network

I have written a managment application which has a function to put a bunch of events in multiple Google calendars.
On my computer everything works fine. But the main user of this application has a verry bad network connection. More percicely the ping to different server varies between 23ms and like 2000 ms and packets get lost.
My approach was, besides increasing the timout, to use an own thread for each API call and recall in case of an connection error.
And at this point I got stuck. Now every event is created. Unfortunately not once but at least once. So some events were uploaded mutiple times.
I have already tried to group them as batch requests, but google doesn't want events on multiple calendars in a single batch request.
I hope my situtaion is clear and someone has a solution for me.

I would first try to persuade the "main user" to get a better network connection.
If that is impossible, I would change the code to have the following logic:
// Current version
createEvent(parameters)
// New version
while (queryEvent(parameters) -> no event) {
createEvent(parameters)
}
with appropriate timeouts and retry counters. The idea is to implement some extra logic to make the creation of an event in the calendar idempotent. (This may entail generating a unique identifier on the client side for each event so that you can query the events reliably.)

Using GAE Task Queues for handling persistence operations

Looking to get opinions on whether or not it would be a good idea to hand persistence operations off to a task queue. For example, a user submits a new 'order', I use bean validation to verify that everything is ok, and then hand over the processing/persisting of the order to a task queue, and respond back to the user faster.
My hesitance is that the persistence 'could' fail, but once I've validated the bean, the chances are low. Are task queues usually used to handle tasks that are relatively trivial? My main concern is what happens if a task that the task queue has fails, since it's done asynchronously, how can I notify the user?

Tasks will retry automatically. If the failure is caused by the infrastructure, the task will be completed on a subsequent try. So you need to worry only about cases where a failure was caused by your code (code bug) or data (validation bug). If you iron out the bugs, you can use tasks with no hesitations and don't worry about the notifications.
In either case, if processing an order takes a couple of seconds, I probably wouldn't bother with task queues. From a user experience perspective, users want to feel that the app did some work with their order, so a 1-2 seconds delay with response is acceptable and even expected.

We have implemented a huge app of logistic flows and some of our processes take 2-3 minutes to read lot of data from BigQuery, do the work and send an e-mail with attachments.
To notify the user you can use the Channel API and/or send an e-mail.
You'll have to provide the user id, mail address or something like that into the task parameters because it is run by the system.
You can't ask to App Engine the current logged user, it will null everytime.
Like said Andrei :
you need to worry only about cases where a failure was caused by your
code (code bug) or data (validation bug).
Don't let an exception go out of the task otherwise the entire task will be run again.

HttpURLConnections ignore timeouts and never return

We are getting some unexpected results randomly from some servers when trying to open an InputStream from an HttpURLConnection. It seems like those servers would accept the connection and reply with a "stay-alive" header which will keep the Socket open but doesn't allow data to be sent back to the stream.
That scenario makes an attempt for a multi-threaded crawler a little "complicated", because if some connection gets stuck, the thread running it would never return... denying the completion of it's pool which derives in the controller thinking that some threads are still working.
Is there some way to read the connection response header to identify that "stay-alive" answer and avoid trying to open the stream??

I'm not sure what I'm missing here but it seems to me you simply need getHeaderField()?

Did you try setting "read time out", in addition to "connect time out"?
See http://java.sun.com/j2se/1.5.0/docs/api/java/net/URLConnection.html#setReadTimeout%28int%29

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.