Hystrix: Custom circuit breaker and recovery logic - java

I just read the Hystrix guide and am trying to wrap my head around how the default circuit breaker and recovery period operate, and then how to customize their behavior.
Obviously, if the circuit is tripped, Hystrix will automatically call the command's getFallBack() method; this much I understand. But what criteria go into making the circuit tripped in the first place? Ideally, I'd like to try hitting a backing service several times (say, a max of 3 attempts) before we consider the service to be offline/unhealthy and trip the circuit breaker. How could I implement this, and where?
But I imagine that if I override the default circuit breaker, I must also override whatever mechanism handles the default recovery period. If a backing service goes down, it could be for any one of several reasons:
There is a network outage between the client and server
The service was deployed with a bug that makes it incapable of returning valid responses to the client
The client was deployed with a bug that makes it incapable of sending valid requests to the server
Some weird, momentary service hiccup (perhaps the service is doing a major garbage collection, etc.)
etc.
In most of these cases, it is not sufficient to have a recovery period that merely waits N seconds and then tries again. If the service has a bug in it, or if someone pulled some network cables in the data center, we will always get failures from this service. Only in a small number of cases will the client-service automagically heal itself without any human interaction.
So I guess my next question is partially "How do I customize the default recovery period strategy?", but I guess it is mainly: "How do I use Hystrix to notify devops when a service is down and requires manual intervention?"

there are basically four reasons for Hystrix to call the fallback method: an exception, a timeout, too many parallel requests, or too many exceptions in the previous calls.
You might want to do a retry in your run() method if the return code or the exception you receive from your service indicate that a retry makes sense.
In your fallback method of the command you might retry when there was a timeout - when there where too many parallel requests or too many exceptions it usually makes no sense to call the same service again.
As also asked how to notify devops: You should connect a monitoring system to Hystrix that polls the status of the circuit breaker and the ratio of successful and unsuccessful calls. You can use the metrics publishers provided, JMX, or write your own adapter using Hystrix' API. I've written two adapters for Riemann and Zabbix in a tutorial I prepared; you'll new very few lines of code for that.
The tutorial also has a example application and a load driver to try out some scenarios.
Br,
Alexander.

Related

How to shutdown actors in specific order with ability to send message before death?

I have an actor hierarchy, with a WebsocketActor in the top of it.
It handles the connection to third-party service and has n SessionActor actors that handle complex logic.
Now I want, when the system shuts down, that every SessionActor sends it's last message to WebsocketActor actor (like "ok, terminate my session").
But since the inbox is stopped for the WebsocketActor, it just didn't work.
I tried shutdownGracefully() method, and aware of Reaper pattern, but not sure if it will fit me.
Stack I use is: Akka+Netty andjavax.websocket.*; to handle third-party websocket service
There is no standard mechanism in akka to allow exactly what you require.
You have to implement such "clean up" mechanism on your own.
This mechanism has to be started and quite importantly has to finish before initiation of actor system termination.
Though, implementing such mechanism is possible, you might want to reconsider using akka in general.
If you require guaranteed deliveries of messages in your system, akka is not suited for you. It comes with "At most once delivery" policy, which means that there is not built-in logic for message redelivery and acknowledgements. Check out message-delivery-reliability.

What is advantage of circuit breaker design pattern in API architecture?

So sorry if this question is not fit for SO.
But I tried looking lot for answer.
I was studying Circuit Breaker design pattern, As I understand its used for making you API fault tolerance. Now what I am confuse is,
Let say I have API which calls payment api, and lets say I configured my circuit to open if 5 calls fail continuously.
Now as per circuit breaker design, I will route subsequent calls after opening circuit to fall back method. lets say next 5 calls, and on 6th call I will make a call to payment API if api is online I will close circuit.
But I dont find any advantage of this pattern, Like what is difference between catch block and circuit breaker.
And what can we do in fall back method? How can this help?
Our colleagues have already shown the advantages of a circuit breaker, so I'll concentrate on practical examples.
So, looking at your example, you have a flow that has to call a payment API> Let's assume its an "external" component. Without this call, the whole flow probably can't be considered as "successfully completed" (I understand you have some online process that has a payment as one of its essential steps).
In this case Circuit Breaker indeed probably won't be that useful unless as a fallback you store the payment command in some kind of intermediate storage and then "re-apply" the payment logic.
The whole point of a circuit breaker is to provide a reasonable fallback so that the flow won't be considered as failed if the fallback logic gets applied.
Here is another example where Circuit breaker has much more sense.
If you build a "netflix-like" portal and in UI there is a widget that shows "recommended" movies. The recommendation engine takes into consideration the movies you've seen / liked before. Technically this is an "external system"/microservice.
Now, what if the flow that populates the data for the UI, is not able to get the recommendations (for example, if the recommendation service is down), will you "fail" the whole flow?
Probably not, maybe its better to show some "generic list" of recommended movies and let the user proceed to work with the application.
In this case, the Circuit breaker can be a good choice for implementing the call to external recommendation service.
Now, what's the difference between this flow and exception handling?
If the reason for the failure of that recommendation system is some temp network outage / Database slowness, probably its the best to give this service some time and not to try to call it over and over again, we should give it a chance to 'recuperate'.
When we apply a circuit breaker, during the "open-circuit" period our code won't even try to call the server, directly routing to the fallback method.
A regular try/catch block, on the other hand, will always call the recommendation service.
So to wrap up, a circuit breaker is a pattern for fault tolerance as was stated in the question; its a tool which can be applicable in some situations, and irrelevant in other cases.
It is true that a great use of Circuit breaker is used to make an API fault tolerant.
Like what is difference between catch block and circuit breaker.
The major difference between a catch block and circuit breaker is the handling of the failure case.
Suppose we are calling an api of an external system.
Lets say, the api call failed.
If we use catch block, we will catch the Exception and call a fallback method.
Next time we will call the same api and the external system is still down. So again this same cycle of events will occur.
This will unnecessary bombard the suffering external system, consume system resources and also increase your api response time.
If we use circuit breaker pattern, then our first call fails and then we open the circuit. Next time we won't even call the external system, and directly follow the fallback pattern. Voila everything is handled!
And what can we do in fall back method? How can this help?
One good example for a fallback method is as follows.
We have a payments system which routes payments to different payment gateways.
Lets say we get an error from a particular payment gateway, then in the fallback method we can route it to different payment gateway.
You can also go through this article to understand more about this topic : https://martinfowler.com/bliki/CircuitBreaker.html
The goal of this design pattern is to encapsulate the logic for handling unexpected, repeated errors.
The wikipedia page for this pattern has helpful sections explaining the types of situations where you would want to implement this logic to avoid making request that you expect to fail.
What is the advantage of this pattern
This pattern is advantageous when you run into a situation where you know a service will be unavailable, and you want custom behavior until the service comes back online. For example, during a database migration, it would make sense to circuit break requests into a queue until the migration has finished.
What is difference between catch block and circuit breaker
I think this difference is the difference between concept and implementation. Detecting the presence of a situation where you want to "open the circuit" might mean detecting errors in a catch block and counting them, as in your example. The handling is not limited to just errors, however.
In my example, the backend might inform the frontend that a migration is about to occur, causing an "open circuit" on the frontend until it receives a migration finished message.

Akka system from a QA perspective

I had been testing an Akka based application for more than a month now. But, if I reflect upon it, I have following conclusions:
Akka actors alone can achieve lot of concurrency. I have reached more than 100,000 messages/sec. This is fine and it is just message passing.
Now, if there is netty layer for connections at one end or you end up with akka actors eventually doing DB calls, REST calls, writing to files, the whole system doesn't make sense anymore. The actors' mailbox gets full and their throughput(here, ability to receive msgs/sec) goes slow.
From a QA perspective, this is like having a huge pipe in which you can forcefully pump lot of water and it can handle. But, if the input hose is bad, or the endpoints cannot handle the pressure, this huge pipe is of no use.
I need answers for the following so that I can suggest or verify in the system:
Should the blocking calls like DB calls, REST calls be handled by actors? Or they good only for message passing?
Can it be like, lets say you have the need of connecting persistently millions of android/ios devices to your akka system. Instead of sockets(so unreliable) etc., can remote actor be implemented as a persistent connection?
Is it ok to do any sort of computation in actor's handleMessage()? Like DB calls etc.
I would request this post to get through by the editors. I cannot ask all of these separately.
1) Yes, they can. But this operation should be done in separate (worker) actor, that uses fork-join-pool in combination with scala.concurrent.blocking around the blocking code, it needs it to prevent thread starvation. If target system (DB, REST and so on) supports several concurrent connections, you may use akka's routers for that (creating one actor per connection in pool). Also you can produce several actors for several different tables (resources, queues etc.), depending on your transaction isolation and storage's consistency requirements.
Another way to handle this is using asynchronous requests with acknowledges instead of blocking. You may also put the blocking operation inside some separate future (thread, worker), which will send acknowledge message at the operation's end.
2) Yes, actor may be implemented as a persistence connection. It will be just an actor, which holds connection's state (as actors are stateful). It may be even more reliable using Akka Persistence, which can save connection to some storage.
3) You can do any non-blocking computations inside the actor's receive (there is no handleMessage method in akka). The failures (like no connection to DB) will be managing automatically by Akka Supervising. For the blocking code, see 1.
P.S. about "huge pipe". The backend-application itself is a pipe (which is becoming huge with akka), so nothing can help you to improve performance if environement can't handle it - there is no pumps in this world. But akka is also a "water tank", which means that outer pressure may be stronger than inner. Btw, it means that developer should be careful with mailboxes - as "too much water" may cause OutOfMemory, the way to prevent that is to organize back pressure. It can be done by not acknowledging incoming message (or simply blocking an endpoint's handler) til it proceeded by akka.
I'm not sure I can understand all of your question, but in general actors are good also for slow work:
1) Yes, they are perfectly fine. Just create/assign 1 actor per every request (maybe behind an akka router for load balancing), and once it's done it can either mark itself as "free for new work" or self-terminate. Remember to execute the slow code in a future. Personally, I like avoiding the ask/pipe pattern due to the implicit timeouts and exception swallowing, just use tells with request id's, but if your latencies and error rates are low, go for ask/pipe.
2) You could, but in that case I'd suggest having a pool of connections rather than spawning them per-request, as that takes longer. If you can provide more details, I can maybe improve this answer.
3) Yes, but think about this: actors are cheap. Create millions of them, every time there is a blocking part, it should be a different, specialized actors. Bring single-responsibility to the extreme. If you have few, blocking actors, you lose all the benefits.

Synchronous, Asynchronous and Command Client Requests with GWT and GAE

In designing my GWT/GAE app, it has become evident to me that my client-side (GWT) will be generating three types of requests:
Synchronous - "answer me right now! I'm important and require a real-time response!!!"
Asynchronous - "answer me when you can; I need to know the answer at some point but it's really not all that ugent."
Command - "I don't need an answer. This isn't really a request, it's just a command to do something or process something on the server-side."
My game plan is to implement my GWT code so that I can specify, for each specific server-side request (note: I've decided to go with RequestFactory over traditional GWT-RPC for reasons outside the scope of this question), which type of request it is:
SynchronousRequest - Synchronous (from above); sends a command and eagerly awaits a response that it then uses to update the client's state somehow
AsynchronousRequest - Asynchronous (from above); makes an initial request and somehow - either through polling or the GAE Channel API, is notified when the response is finally received
CommandRequest - Command (from above); makes a server-side request and does not wait for a response (even if the server fails to, or refuses to, oblige the command)
I guess my intention with SynchronousRequest is not to produce a totally blocking request, however it may block the user's ability to interact with a specific Widget or portion of the screen.
The added kicker here is this: GAE strongly enforces a timeout on all of its frontend instances (60 seconds). Backend instances have much more relaxed constraints for timeouts, threading, etc. So it is obvious to me that AsynchronousRequests and CommandRequests should be routed to backend instances so that GAE timeouts do not become an issue with them.
However, if GAE is behaving badly, or if we're hitting peak traffic, or if my code just plain sucks, I have to account for the scenario where a SynchronousRequest is made (which would have to go through a timeout-regulated frontend instance) and will timeout unless my GAE server code does something fancy. I know there is a method in the GAE API that I can call to see how many milliseconds a request has before its about to timeout; but although the name of it escapes me right now, it's what this "fancy" code would be based off of. Let's call it public static long GAE.timeLeftOnRequestInMillis() for the sake of this question.
In this scenario, I'd like to detect that a SynchronousRequest is about to timeout, and somehow dynamically convert it into an AsynchronousRequest so that it doesn't time out. Perhaps this means sending an AboutToTimeoutResponse back to the client, and force the client to decide about whether to resend as an AsynchronousRequest or just fail. Or perhaps we can just transform the SynchronousRequest into an AsynchronousRequest and push it to a queue where a backend instance will consume it, process it and return a response. I don't have any preferences when it comes to implementation, so long as the request doesn't fail or timeout because the server couldn't handle it fast enough (because of GAE-imposed regulations).
So then, here is what I'm actually asking here:
How can I wrap a RequestFactory call inside SynchronousRequest, AsynchronousRequest and CommandRequest in such a way that the RequestFactory call behaves the way each of them is intended? In other words, so that the call either partially-blocks (synchronous), can be notified/updated at some point down the road (asynchronous), or can just fire-and-forget (command)?
How can I implement my requirement to let a SynchronousRequest bypass GAE's 60-second timeout and still get processed without failing?
Please note: timeout issues are easily circumvented by re-routing things to backend instances, but backends don't/can't scale. I need scalability here as well (that's primarily why I'm on GAE in the first place!) - so I need a solution that deals with scalable frontend instances and their timeouts. Thanks in advance!
If the computation that you want GAE to do is going to take longer than 60 seconds, then don't wait for the results to be computed before sending a response. According to your problem definition, there is no way to get around this. Instead, clients should submit work orders, and wait for a notification from the server when the results are ready. Requests would consist of work orders, which might look something like this:
class ComputeDigitsOfPiWorkOrder {
// parameters for the computation
int numberOfDigitsToCompute;
// Used by the GAE app to contact the requester when results are ready.
ClientId clientId;
}
This way, your GAE app can respond as soon as the work order is saved (e.g. in Task Queue), and doesn't have to wait until it actually finishes calculating a billion digits of pi before responding. Your GWT client then waits for the result using the Channel API.
In order to give some work orders higher priority, you can use multiple task queues. If you want Task Queue work to scale automatically, you'll want to use push queues. Implementing priority using push queues is a little tricky, but you can configure high priority queues to have faster feed rate.
You could replace Channel API with some other notification solution, but that would probably be the most straightforward.

Is MQ publish/subscribe domain-specific interface generally faster than point-to-point?

I'm working on the existing application that uses transport layer with point-to-point MQ communication.
For each of the given list of accounts we need to retrieve some information.
Currently we have something like this to communicate with MQ:
responseObject getInfo(requestObject){
code to send message to MQ
code to retrieve message from MQ
}
As you can see we wait until it finishes completely before proceeding to the next account.
Due to performance issues we need to rework it.
There are 2 possible scenarios that I can think off at the moment.
1) Within an application to create a bunch of threads that would execute transport adapter for each account. Then get data from each task. I prefer this method, but some of the team members argue that transport layer is a better place for such change and we should place extra load on MQ instead of our application.
2) Rework transport layer to use publish/subscribe model.
Ideally I want something like this:
void send (requestObject){
code to send message to MQ
}
responseObject receive()
{
code to retrieve message from MQ
}
Then I will just send requests in the loop, and later retrieve data in the loop. The idea is that while first request is being processed by the back end system we don't have to wait for the response, but instead send next request.
My question, is it going to be a lot faster than current sequential retrieval?
The question title frames this as a choice between P2P and pub/sub but the question body frames it as a choice between threaded and pipelined processing. These are two completely different things.
Either code snippet provided could just as easily use P2P or pub/sub to put and get messages. The decision should not be based on speed but rather whether the interface in question requires a single message to be delivered to multiple receivers. If the answer is no then you probably want to stick with point-to-point, regardless of your application's threading model.
And, incidentally, the answer to the question posed in the title is "no." When you use the point-to-point model your messages resolve immediately to a destination or transmit queue and WebSphere MQ routes them from there. With pub/sub your message is handed off to an internal broker process that resolves zero to many possible destinations. Only after this step does the published message get put on a queue where, for the remainder of it's journey, it then is handled like any other point-to-point message. Although pub/sub is not normally noticeably slower than point-to-point the code path is longer and therefore, all other things being equal, it will add a bit more latency.
The other part of the question is about parallelism. You proposed either spinning up many threads or breaking the app up so that requests and replies are handled separately. A third option is to have multiple application instances running. You can combine any or all of these in your design. For example, you can spin up multiple request threads and multiple reply threads and then have application instances processing against multiple queue managers.
The key to this question is whether the messages have affinity to each other, to order dependencies or to the application instance or thread which created them. For example, if I am responding to an HTTP request with a request/reply then the thread attached to the HTTP session probably needs to be the one to receive the reply. But if the reply is truly asynchronous and all I need to do is update a database with the response data then having separate request and reply threads is helpful.
In either case, the ability to dynamically spin up or down the number of instances is helpful in managing peak workloads. If this is accomplished with threading alone then your performance scalability is bound to the upper limit of a single server. If this is accomplished by spinning up new application instances on the same or different server/QMgr then you get both scalability and workload balancing.
Please see the following article for more thoughts on these subjects: Mission:Messaging: Migration, failover, and scaling in a WebSphere MQ cluster
Also, go to the WebSphere MQ SupportPacs page and look for the Performance SupportPac for your platform and WMQ version. These are the ones with names beginning with MP**. These will show you the performance characteristics as the number of connected application instances varies.
It doesn't sound like you're thinking about this the right way. Regardless of the model you use (point-to-point or publish/subscribe), if your performance is bounded by a slow back-end system, neither will help speed up the process. If, however, you could theoretically issue more than one request at a time against the back-end system and expect to see a speed up, then you still don't really care if you do point-to-point or publish/subscribe. What you really care about is synchronous vs. asynchronous.
Your current approach for retrieving the data is clearly synchronous: you send the request message, and wait for the corresponding response message. You could do your communication asynchronously if you simply sent all the request messages in a row (perhaps in a loop) in one method, and then had a separate method (preferably on a different thread) monitoring the incoming topic for responses. This would ensure that your code would no longer block on individual requests. (This roughly corresponds to option 2, though without pub/sub.)
I think option 1 could get pretty unwieldly, depending on how many requests you actually have to make, though it, too, could be implemented without switching to a pub/sub channel.
The reworked approach will use fewer threads. Whether that makes the application faster depends on whether the overhead of managing a lot of threads is currently slowing you down. If you have fewer than 1000 threads (this is a very, very rough order-of-magnitude estimate!), i would guess it probably isn't. If you have more than that, it might well be.

Categories

Resources