Cassandra Retry Policy for NoHostAvailableException [duplicate]

Cassandra Retry Policy for NoHostAvailableException [duplicate] - java

I am using the Datastax Cassandra driver and have a RetryPolicy setup to retry when a host is unavailable. However, I have noticed that it retries as fast as it can. I would like to change it to have an increasing delay between retries rather than hammer the cluster if it is struggling. This is particularly important for OVERLOADED request errors since I do want to retry in these scenarios, but with a substantial delay.
Where is the right place to put a delay and what is the right mechanism? Should I just throw a Thread.sleep(...) in my RetryPolicy?
I don't mind taking up a request on-the-wire slot (towards the maximum number of in-flight requests) but I am not okay with completely blocking other writes if we are not yet at the in-flight request limit.

You can implement your own retry policy by adding a delay. The simplest way is to pick the source code of the default retry and modify it yourself to implement an exponential delay for retry or something similar.
For exponential delay, just look at the source code of http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/ExponentialReconnectionPolicy.html to see how it works

Related

Performance of WebFlux with retry exponential backoff

I'm building a webhook service that sends events to client's URLs.
In case of failure or timeout, I need to retry the sending with with exponential backoff. I've two ways to implement the logic:
Using WebClient's internal retry feature:
WebClient.create()
.post()
.uri(URL)
.exchange()
...
.retryWhen(Retry.backoff(4, Duration.ofSeconds(3)).jitter(0.7));
Another way is using rabbitMQ deadLetterExchange to re-queue the message with exponential backoof as mentioned here:
https://www.baeldung.com/spring-amqp-exponential-backoff
Applying WebFlux internal retry feature is easy for development but I've some concerns over using it because messages are stored in application memory and it may affect the performance when the number of messages are high.
I want to know other developers thoughts on these options.
Also, is there any better options?

In my opinion these two approaches can be used together.
The retry strategy is the simplest way to manage very transient error. BUT we cannot retry indefinitely.
An infinite retry may cause memory issue. But not only! When you shutdown a server, you probably wish to stop incoming traffic first and expect there is no computation/transaction during the shutting down ...
So we need another strategy for messages that failed after some retries. And a Dead Letter Queue is a valid option.
In the real life we may have several DQL, then the default one is used for the unexpected cases. Obviously we cannot hope to have a good algorithm in that cases.

Handle high number of traditional synchronous/blocking HTTP client requests with few threads Java?

I am working with Java. Another software developer has provided me his code performing synchronous HTTP calls and is responsible of maintaining it - he is using com.google.api.client.http. Updating his code to use an asynchronous HTTP client with a callback is not an available option, and I can't contact the developer to make changes to it. But I still want the efficient asynchronous behaviour of attaching a callback to an HTTP request.
(I am working in Spring Boot and my system is built using RabbitMQ AMQP if it has any effect.)
The simple HTTP GET (it is actually an API call) is performed as follows:
HttpResponse<String> response = httpClient.send(request, BodyHandlers.ofString());
This server I'm communicating with via HTTP takes some time to reply back... say 3-4 seconds. So my thread of execution is blocked for this duration, waiting for a reply. This scales very poorly, my single thread isn't doing is just waiting back for a reply to arrive - this is very heavy.
Sure, I can add the number of threads performing this call if I want to send more HTTP requests concurrently, i.e. I can scale in that way, but this doesn't sound efficient or correct. If possible, I would really like to get a better ratio than 1 thread waiting for 1 HTTP request in this situation.
In other words, I want to send thousands of HTTP requests with 2-3 available threads and handle the response once it arrives; I don't want to incur any significant delay between the execution of each request.
I was wondering: how can I achieve a more scalable solution? How can I handle thousands of this HTTP call per thread? What should I be looking at or do I just have no options and I am asking for the impossible?
EDIT: I guess this is another way to phrase my problem. Assume I have 1000 requests to be sent right now, each will last 3-4 seconds, but only 4-5 available threads of execution on which to send them. I would like to send them all at the same time, but that's not possible; if I manage to send them ALL within the span of 0.5s or less and handle their requests via some callback or something like that, I would consider that a great solution. But I can't switch to an asynchronous HTTP client library.

Using an asynchronous HTTP client is not an available option - I can't change my HTTP client library.
In that case, I think you are stuck with non-scalable synchronous behavior on the client side.
The only work-around I can think of is to run your requests as tasks in an ExecutorService with a bounded thread pool. That will limit the number of threads that are used ... but will also limit the number of simultaneous HTTP requests in play. This is replacing one scaling problem with another one: you are effectively rate-limiting your HTTP requests.
But the flip-side is that launching too many simultaneous HTTP requests is liable to overwhelm the target service(s) and / or the client or server-side network links. From that perspective, client-side rate limiting could be a good thing.
Assume I have 1000 requests to be sent right now, each will last 3-4 seconds, but only 4-5 available threads of execution on which to send them. I would like to send them all at the same time, but that's not possible; if I manage to send them ALL within the span of 0.5s or less and handle their requests via some callback or something like that, I would consider that a great solution. But I can't switch to an asynchronous HTTP client.
The only way you are going to be able to run > N requests at the same time with N threads is to use an asynchronous client. Period.
And "... callback or something like that ...". That's a feature you will only get with an asynchronous client. (Or more precisely, you can only get real asynchronous behavior via callbacks if there is a real asynchronous client library under the hood.)
So the solution is akin to sending the HTTP requests in a staggering manner i.e. some delay between one request and another, where each delay is limited by the number of available threads? If the delay between each request is not significant, I can find that acceptable, but I am assuming it would be a rather large delay between the time each thread is executed as each thread has to wait for each other to finish (3-4s)? In that case, it's not what I want.
With my proposed work-around, the delay between any two requests is difficult to quantify. However, if you are trying to submit a large number of requests at the same time and wait for all of the responses, then the delay between individual requests is not relevant. For that scenario, the relevant measure is the time taken to complete all of the requests. Assuming that nothing else is submitting to the executor, the time taken to complete the requests will be approximately:
nos_requests * average_request_time / nos_worker_threads
The other thing to note is that if you did manage to submit a huge number of requests simultaneously, the server delay of 3-4s per request is liable to increase. The server will only have the capacity to process a certain number of requests per second. If that capacity is exceeded, requests will either be delayed or dropped.
But if there are no other options.
I suppose, you could consider changing your server API so that you can submit multiple "requests" in a single HTTP request.
I think that the real problem here is there is a mismatch between what the server API was designed to support, and what you are trying to do with it.
And there is definitely a problem with this:
Another software developer has provided me his code performing synchronous HTTP calls and is responsible of maintaining it - he is using com.google.api.client.http. Updating his code to use an asynchronous HTTP client with a callback is not an available option, and I can't contact the developer to make changes to it.
Perhaps you need to "bite the bullet" and stop using his code. Work out what it is doing and replace it with your own implementation.
There is no magic pixie dust that will give scalable performance from a synchronous HTTP client. Period.

What is the maximum connection timeout to any server?

I have this simple Spring boot based web app that downloads data from several APIs. Some of them don't respond in time, since my connectionTimeout is set to somewhat 4 seconds.
As soon as I get rid of connectionTimeout setting, I'm getting an exceptions after 20 or so seconds.
So, my question is, for how long am I able to try to connect to an API and what does it depend on? Where do those 20 seconds come from? What if an API responds after 40 minutes of time and I won't be able to catch that specific moment and just gonna lose data. I don't want that to happen. What are my options?
Here's the code to set the connection. Nothing special.
HttpComponentsClientHttpRequestFactory clientHttpRequestFactory = new HttpComponentsClientHttpRequestFactory(HttpClientBuilder.create().build());
clientHttpRequestFactory.setConnectTimeout(4000);
RestTemplate restTemplate = new RestTemplate(clientHttpRequestFactory);
Then I retrieve the values via:
myObject.setJsonString(restTemplate.getForObject(url, String.class));

Try increasing your timeout. 4 seconds is too little.
It will need to connect, formulate data and return. So 4 seconds is just for connecting, by the time it attempts to return anything, your application has already disconnected.
Set it to 20 seconds to test it. You can set it to much longer to give the API enough time to complete. This does not mean you app will use up all of the connection timeout time. It will finish as soon as a result is returned. Also API's are not designed to take long. They will perform the task and return the result as fast as possible

Connection timeout means that your program couldn't connect to the server at all within the time specified.
The timeout can be configured, as, like you say, some systems may take a longer time to connect to, and if this is known in advance, it can be allowed for. Otherwise the timeout serves as a guard to prevent the application from waiting forever, which in most cases doesn't really give a good user experience.
A separate timeout can normally be configured for reading data (socket timeout). They are not inclusive of each other.
To solve your problem:
Check that the server is running and accepting incoming connections.
You might want to use curl or depending on what it is simply your browser to try and connect.
If one tool can connect, but the other can't, check your firewall settings and ensure that outgoing connections from your Java program are permitted. The easiest way to test whether this is a problem is to disable anti virus and firewall tools temporarily. If this allows the connection, you'll either need to leave the FW off, or better add a corresponding exception.
Leave the timeout on a higher setting (or try setting it to 0, which is interpreted as infinite) while testing. Once you have it working, you can consider tweaking it to reflect your server spec and usability requirements.
Edit:
I realised that this doesn't necessarily help, as you did ultimately connect. I'll leave the above standing as general info.
for how long am I able to try to connect to an API and what does it depend on?
Most likely the server that the API is hosted on. If it is overloaded, response time may lengthen.
Where do those 20 seconds come from?
Again this depends on the API server. It might be random, or it may be processing each request for a fixed period of time before finding itself in an error state. In this case that might take 20 seconds each time.
What if an API responds after 40 minutes of time and I won't be able to catch that specific moment and just gonna lose data. I don't want that to happen. What are my options?
Use a more reliable API, possibly paying for a service guarantee.
Tweak your connection and socket timeouts to allow for the capabilities of the server side, if known in advance.
If the response is really 40 minutes, it is a really poor service, but moving on with that assumption - if the dataset is that large, explore whether the API offers a streaming callback, whereby you pass in an OutputStream into the API's library methods, to which it will (asynchronously) write the response when it is ready.
Keep in mind that connection and socket timeout are separate things. Once you have connected, the connection timeout becomes irrelevant (socket is established). As long as you begin to receive and continue to receive data (packet to packet) within the socket timeout, the socket timeout won't be triggered either.
Use infinite timeouts (set to 0), but this could lead to poor usability within your applications, as well as resource leaks if a server is in fact offline and will never respond. In that case you will be left with dangling connections.

The default and maximum has nothing to do with the the server. It depends on the client platform, but it is around a minute. You can decrease it, but not increase it. Four seconds is far too short. It should be measured in tens of seconds in most circumstances.
And absent or longer connection timeouts do not cause server errors of any kind. You are barking up the wrong tree here.

How to reconsume a rejected message later, RabbitMQ

Sometimes due to some external problems, I need to requeue a message by basic.reject with requeue = true.
But I don't need to consume it immediately because it will possibly fail again in a short time. If I continuously requeue it, this may result in infinite loop and requeue.
So I need to consume it later, say one minute later,
And I need to know how many times the messages has been requeue so that I can stop requeue it but only reject it to declare it fails to consume.
PS: I am using Java client.

There are multiple solutions to point 1.
First one is the one chosen by Celery (a Python producer/consumer library that can use RabbitMQ as broker). Inside your message, add a timestamp at which the task should be executed. When your consumer gets the message, do not ack it and check its timestamp. As soon as the timestamp is reached, the worker can execute the task. (Note that the worker can continue working on other tasks instead of waiting)
This technique has some drawbacks. You have to increase the QoS per channel to an arbitrary value. And if your worker is already working on a long running task, the delayed task wont be executed until the first task has finished.
A second technique is RabbitMQ-only and is much more elegant. It takes advantage of dead-letter exchanges and Messages TTL. You create a new queue which isn't consumed by anybody. This queue has a dead-letter exchange that will forward the messages to the consumer queue. When you want to defer a message, ack it (or reject it without requeue) from the consumer queue and copy the message into the dead-lettered queue with a TTL equal to the delay you want (say one minute later). At (roughly) the end of TTL, the defered message will magically land in the consumer queue again, ready to be consumed. RabbitMQ team has also made the Delayed Message Plugin (this plugin is marked as experimental yet fairly stable and potential suitable for production use as long as the user is aware of its limitations and has serious limitations in term of scalability and reliability in case of failover, so you might decide whether you really want to use it in production, or if you prefer to stick to the manual way, limited to one TTL per queue).
Point 2. just requires putting a counter in your message and handling this inside your app. You can choose to put this counter in a header or directly in the body.

Hystrix: Custom circuit breaker and recovery logic

I just read the Hystrix guide and am trying to wrap my head around how the default circuit breaker and recovery period operate, and then how to customize their behavior.
Obviously, if the circuit is tripped, Hystrix will automatically call the command's getFallBack() method; this much I understand. But what criteria go into making the circuit tripped in the first place? Ideally, I'd like to try hitting a backing service several times (say, a max of 3 attempts) before we consider the service to be offline/unhealthy and trip the circuit breaker. How could I implement this, and where?
But I imagine that if I override the default circuit breaker, I must also override whatever mechanism handles the default recovery period. If a backing service goes down, it could be for any one of several reasons:
There is a network outage between the client and server
The service was deployed with a bug that makes it incapable of returning valid responses to the client
The client was deployed with a bug that makes it incapable of sending valid requests to the server
Some weird, momentary service hiccup (perhaps the service is doing a major garbage collection, etc.)
etc.
In most of these cases, it is not sufficient to have a recovery period that merely waits N seconds and then tries again. If the service has a bug in it, or if someone pulled some network cables in the data center, we will always get failures from this service. Only in a small number of cases will the client-service automagically heal itself without any human interaction.
So I guess my next question is partially "How do I customize the default recovery period strategy?", but I guess it is mainly: "How do I use Hystrix to notify devops when a service is down and requires manual intervention?"

there are basically four reasons for Hystrix to call the fallback method: an exception, a timeout, too many parallel requests, or too many exceptions in the previous calls.
You might want to do a retry in your run() method if the return code or the exception you receive from your service indicate that a retry makes sense.
In your fallback method of the command you might retry when there was a timeout - when there where too many parallel requests or too many exceptions it usually makes no sense to call the same service again.
As also asked how to notify devops: You should connect a monitoring system to Hystrix that polls the status of the circuit breaker and the ratio of successful and unsuccessful calls. You can use the metrics publishers provided, JMX, or write your own adapter using Hystrix' API. I've written two adapters for Riemann and Zabbix in a tutorial I prepared; you'll new very few lines of code for that.
The tutorial also has a example application and a load driver to try out some scenarios.
Br,
Alexander.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.