My team has a situation where an SNMP SET will fail once every two weeks or so. Since this set happens automatically, we don't necessarily notice it immediately when it fails, and this can result in an inconsistent configuration and associated wailing and gnashing of teeth. The plan is to fix this by having our software automatically retry the SET when it fails.
The problem is, we aren't sure why the failure is happening. My (extremely limited) knowledge of SNMP isn't particularly helpful in diagnosing this problem, so I thought I'd ask StackOverflow for some advice. We think that every so often a spike in network traffic will cause the SET to fail. Since SNMP uses UDP for communication, I would think it would be relatively easy for a command to be drowned out if traffic was high for a short period of time. However, I have no idea how common this is. We have a small network with a single cisco router and there are less than a dozen SNMP controlled devices on that network. In addition to the SNMP traffic, there are some status web pages being loaded from the various devices. In case it makes a difference, I believe we are using the AdventNet SNMP API version 4.0.4 for Java.
Does it sound reasonable that there will be some SET commands dropped occasionally, or should we be looking for other causes?
SNMP was designed to be unreliable. It uses UDP as its transport protocol. Routers will drop SNMP packets when they've got high priority work to do. So yes, it sounds very reasonable that SET commands are dropped occasionally :)
First upgrade to the newest version of the SNMP library if there is one.
Then you can set up a retry mechanism: verify each SET with a GET. If this fails, queue the SET for a later attempt. This requires an elaborate queuing mechanism: a later SET for the same setting should be queued after, or over, an existing queued SET.
Another option is to synchronize the entire state every hour; use GET for a setting, if it has changed, SET it. Changes that do not make it through for over 3 hours can be reported using an alerting system.
There are many more options, but if you have just 1 failure per week average, I'd go with the simplest one: Verify a SET with a GET, retry for 5 times, if it still fails, email.
Related
I have this assignment in college where they ask us to run a Java app as a socket server with multiple clients. Client sends a string, server returns the string in upper case with a request counter. Quite simple.
Each request made by any given client is counted on the server side and stored in a static variable for each client connection thread. So that each client request increments the counter globally on the server. That's working well.
Now, they ask us to run "backup" instances of that server on different machines on the network so that if the primary stops responding, the client connects to one of the backups. That, I got working. But the counter is obviously reset since it's a different server.
The challenge is that the request counter be the same on the primary and the secondaries so that if the primary responds to 10 requests, goes down, client switch to a backup and makes a request, the backup server responds 11.
Here is what I considered:
if on the same PC, I'd use threads but we're over the network so I
believe this will not work.
server sends that counter to the client
with the response, which in turn returns it to the server at the
next request and so forth. Not very "clean" imo but could work.
Each server talks to each other to sync this counter. However, sockets
don't seem to be very efficient to do this, if even possible. RMI
seems to be an alternative here but I'd like confirmation before I
start learning it.
Any leads or suggestions here? I'm not posting code because I don't need a solution here but if necessary, I can invite to the gihub repo.
EDIT: There is no latency, reliability or similar constraints for this project. There is X number of clients and Y number of servers (single master, multiple failovers). Additional third party infrastructure like a DB isn't an option really but third party Java librairies are welcome. Basically I just run in Eclipse on multiple PCs. This is an introduction assignment to distributed systems, expected done in 2 weeks so I believe "keep it simple" is the key here!
EDIT 2: The number and addresses of backup servers will be passed as arguments to the application so broadcast/discovery isn't necessary. We'll likely cover all those points in a later lab assignment in the semester :)
EDIT 3: From all your great suggestions, I'll try an implementation of some variation of #3 and let you know how it works. I think the issue I have here is to make sure all servers are aware of the others. But like I mentioned, they don't need to discover each other so I'll hard code it for now and revisit in the next assignment! Probably opt for some elected master... :)
If option #2 is allowed, then it is the easiest, however I am not sure how it could work in the face of multiple clients (so it depends on the requirements here).
Is it possible to back the servers by a shared db running on another computer? Ideally perhaps one clustered across multiple machines? Or can you use an event bus or 3rd party libraries / code (shared cache, JMS, or even EJBs)?
If not, then having the servers talk to each other is your best bet. Sockets can work, as could UDP multicast (careful there though, no way to know if a message was missed which is why TCP / sockets are safer). If the nodes are going to talk to each other there are generally a few accepted ways to handle the setup:
Master / slaves: Current node is the master and all writes are to it. Slaves connect to the master and receive updates. When the master goes down a new master needs to be elected (see leader election). MongoDB works like this.
Everyone to everyone: Every node connects to every other known node. Can get complicated and might not scale well to lots of nodes.
Daisy chain: one node connects to the next node, which connects to the next, and so on. I don't believe this is widely used.
Ring network: Each node connects to two others in order to form a ring. This is generally superior to daisy chain, but a little bit more complicated to implement.
See here for more examples: https://en.wikipedia.org/wiki/Network_topology
If this was in the real world (i.e. not school), you would use either a shared cache (e.g. ehcache), local caches backed by an event bus (JMS of some sort), or a shared clustered db.
EDIT:
After re-reading your question, it seems you only have a single backup server to worry about, and my guess of the course requirements is that they simply want your backup server to connect to your primary server and also receive the variable count updates. This is completely fine to implement with sockets (it isn't inefficient for a single backup server), and is perhaps the solution they are expecting you to use.
E.g. Backup server connects to primary server and either polls for updates across the held connection or simply listens for updates issued from the primary server.
Key notes:
- You might need keep alives to ensure the connection does not get killed.
- Make sure to implement re-connection logic if the connection from backup to primary dies.
If this is for a networking course they may be expecting UDP multicast, but that may depend a little bit on the server / network environment.
This is a classic distributed systems problem. The right solution is some variation of your option #3, where the different servers communicate with each other.
Where it gets complicated is when you start to introduce latency, downtime, and/or network partitioning between the various servers. Eventually you'll need to arrive at some kind of consensus algorithm. Paxos is a well-known approach to this problem, but there are others; Raft is popular these days as well.
In my opinion best solution is to have vector of the counters. One counter per one server. Each server increments its own counter and broadcast vector value to all other servers. This data structure is conflict-free replicated data type.
Number of requests is calculated as sum of all elements of the vector.
About consistency. If you need strictly growing number on all servers you need to synchronously replicate you new value before answer to client.
The penalty here is performance and availability.
About broadcasting. You can choose any broadcasting algorithm you want. If number of servers are not too large you can use full mesh topology. If number of server become large you can use ring or star topology to replicate data.
The most real life would be option 3. It happens all the time. Nodes talk to one another on another port. So they self discover by broadcast (UDP). So each server broad casts its max on a UDP port. Other nodes listen and up their value that + 1 if their current value is less than that value, else ignore it and instead broadcast their bigger value.
This will work best when there is a 2-300 gap between client calls. This also assumes that any server could be primary (as decided by a load balancer).
UDP is stable within a LAN. Used widely.
Solutions to this problem trade off speed against consistency.
If you value consistency over speed you could try a synchronous approach (assuming servers A, B and C):
A receives initial request
A opens connection to B and C to request current counts from each
A calculates max count (based on its own value and the values from B and C), adds one and sends new count to B and C
A closes connections to B and C
A replies to original request, including new max count
At this point, all servers are in sync with the new max count, ready for a new request to any server.
Edit: Of course, if you are able to use a shared database, the problem becomes much simpler.
I can use the Java MQ Api to put and get messages.
I can also disable gets and put on a queue.
During a migration project, we'll have an App running in parallell. Old and New. Old and New will have their own separate queues. I regulary have messages from a client going to Old. Occasionally want the msgs to flow to New instead.
wondering if MQ supports a gate/switch concept. where via API I can point a queue to go only to New, or only to Old, for a short time.
Trying to avoid going to message based routing via WMB since I dont have to do that today. THe parallel mode is only for a few months.
You do not mention the version of MQ or whether there are message affinities or dependence on preserving the MQMD.MsgID. These are critical in devising a solution to this problem. I'll try to describe enough options so that at least one will be viable whatever version you are at.
Pub/Sub
The easiest thing to do is to have the messages arrive on an alias over a topic. Any message that arrives is published immediately on that topic. Then it is a simple matter to generate administrative subscriptions to direct messages to the queues on which the apps needing the messages are listening. This is entirely a configuration change and requires no external components, processes or code. It is available from v7.1 of MQ and higher, which is to say any of the currently supported versions of MQ.
The down side is that IBM MQ will change the MQMD.MsgID from the time the message is received on the topic to the time it is published on the application's input queue. This breaks the app's ability to use the MQMD.MsgID of the incoming message as a correlation ID when replying. If the requesting app pre-loads the correlation ID or doesn't rely on a correlation ID, this is not an issue.
Aliasing
But for apps where this is an issue, it gets a bit harder. You can alias over a queue and have inbound messages land on the alias. When you need to switch from one queue to another, you change the alias. There are a couple issues with this. The first is that it is never possible to deliver the message stream to more than one of the applications. In a parallel processing test it is often desirable to do exactly that and then compare summary or detail reports.
The second problem is more operational in nature. It isn't possible to change the alias while it is open. If the messages arrive over a RCVR, RQSTR or `CLUSRCVR channel, no problem. Stop the channe, switch the alias and restart the channel. In a series of MQSC script commands this can be done faster than it can be typed. However, if the applications putting the messages are connected in bindings mode or via client directly to the alias, they must all be stopped in order to change the alias.
That said, aliasing works on all versions of MQ out of the box.
Physical copy
One solution that's been around for quite some time is to use the Q program (SupportPac MA01) to direct the messages. In this scenario, the queue on which messages land is a local queue. The Q program is either triggered or set to constantly listen on the queue. When a message arrives, Q then copies it to one or both of the destination queues.
Switching the behavior if Q is triggered involves pre-defining 2 or 3 processes where each defines a different behavior - move new messages to QUEUEA, to QUEUEB or to both. Changing the queue's PROCESS attribute to point to a different process results in an instantaneous change of the behavior.
Alternatively, if Q is configured to listen on the queue forever then changing the behavior involves use of three different scripts to execute it where one causes messages to be copied to QUEUEA, another to QUEUEB and another to both queues. Changing the behavior involves killing the script and starting a different one.
The Q program works with all versions of MQ, regardless of whether it is triggered or scripted.
Downsides to this approach include the obvious - more moving parts. You have to trigger the queue or else make a transactional program act like a daemon. Not hard but if you are betting the business on it then perhaps some monitoring is in order to make sure the input queue doesn't start building.
Recommendation
Of all these methods, I really like the Pub/Sub version. It is extremely reliable, has the least moving parts, and if anything breaks it's under IBM support. When you need to change something, you can do that with minimal impact to the running applications. If at all possible, use that.
My Java web application pulls some data from external systems (JSON over HTTP) both live whenever the users of my application request it and batch (nightly updates for cases where no user has requested it). The data changes so caching options are likely exhausted.
The external systems have some throttling in place, the exact parameters of which I don't know, and which likely change depending on system load (e.g., peak times 10 requests per second from one IP address, off-peak times 100 requests per second from open IP address). If the requests are too frequent, they time out or return HTTP 503.
Right now I am attempting the request 5 times with 2000ms delay between each, giving up if an error is received each time. This is not optimal as sometimes at peak-times nearly all requests fail; I could avoid making these requests and perhaps get at least some to succeed instead.
My goals are to have a somewhat simple, reliable design, and enough flexibility so that I could both pull some metrics from the throttler to understand how well the external systems are responding (and thus adjust how often they are invoked), and to auto-adjust the interval with which I call them (individually per system) so that it is optimal both on off-peak and peak hours.
My infrastructure is Java with RabbitMQ over MongoDB over Linux.
I'm thinking of three main options:
Since I already have RabbitMQ used for batch processing, I could just introduce a queue to which the web processes would send the requests they have for external systems, then worker processes would read from that queue, throttle themselves as needed, and return the results. This would allow running multiple parallel worker processes on more servers if needed. My main concern is that it isn't a very simple solution, and how to manage peak-hour throughput being low and thus the web processes waiting for a long while. Also this converts my RabbitMQ into a critical single failure point; if it dies the whole system stops (as opposed to the nightly batch processes just not running any more, which is less critical). I suppose rpc is the correct pattern of RabbitMQ usage, but not sure. Edit - I've posted a related question How to properly implement RabbitMQ RPC from Java servlet web container? on how to implement this.
Introduce nginx (e.g. ngx_http_limit_req_module), HAProxy (link) or other proxy software to the mix (as reverse proxies?), have them take care of the throttling through some configuration magic. The pro is that I don't have to make code changes. The con is that it is more technology used, and one I've not used before, so chances of misconfiguring something are quite high. It would also likely not be easy to do dynamic throttling depending on external server load, or prioritizing live requests over batch requests, or get statistics of how the throttling is doing. Also, most documentation and examples will likely be on throttling incoming requests, not outgoing.
Do a pure-Java solution (e.g., leaky bucket implementation). Would be simple in the sense that it is "just code", but the devil is in the details; debugging all the deadlocks, starvations and race conditions isn't always fun.
What am I missing here?
Which is the best solution in this case?
P.S. Somewhat related question - what's the proper approach to log all the external system invocations, so that statistics are collected as to how often I invoke them, and what the success rate is?
E.g., after every invocation I'd invoke something like .logExternalSystemInvocation(externalSystemName, wasSuccessful, elapsedTimeMills), and then get some aggregate data out of it whenever needed.
Is there a standard library/tool to use, or do I have to roll my own?
If I use option 1. with RabbitMQ, is there a way to organize the flow so that I get this out of the box from the RabbitMQ console? I wouldn't want to send all failed messages to poison queue, it would fill up too quickly though and in most cases there is no need to re-process these failed requests as the user has already sadly moved on.
Perhaps this open source system can help you a little: http://code.google.com/p/valogato/
I am reading here, that
On connect, the JVM (Java Virtual Machine) tries to resolve the
hostname to IP/port. Windows tries a netbios ns query on UDP (User
Datagram Protocol) port 137 with a timeout of 1.5 seconds, ignores any
ICMP (Internet Control Message Protocol) port unreachable packets and
repeats this two more times, adding up to a value of 4.5 seconds. I
suggest putting critical hostnames in your HOSTS file to make sure
they are resolved quickly. Another possibility is turning off NETBIOS
altogether and running pure TCP/IP on your LAN (Local Area Network).
is this currently an issue still? Because I am working on a heartbeat-sensor and I was curious.
Your citation is not a normative reference, just another hobby site, and in this case it is dead wrong. None of this has anything to do with setSoTimeout(). He is totally confused between name resolution time, connect time, and read time. setSoTimeout() sets a read timeout, and is unaffected by the shenanigans he describes, whether accurately or otherwise, which wouldn't even happen at connect time as he states: they would happen at name-resolution time.
It's far from the only confusion to be found on that site, or even on that page, let me assure you. I told him about several errors on this page ten years ago, and about quite a lot of others, all of which remain uncorrected to this day, which gives you an idea of the site's accuracy, up-to-date-ness, and content review mechanisms. His only response was to add a rude remark about me. Unconvincing as a peer review mechanism.
Stick to authoritative sources.
I have a Java servlet that's getting overloaded by client requests during peak hours. Some clients span concurrent requests. Sometimes the number of requests per second is just too great.
Should I implement application logic to restrict the number of request client can send per second? Does this need to be done on the application level?
The two most common ways of handling this are to turn away requests when the server is too busy, or handle each request slower.
Turning away requests is easy; just run a fixed number of instances. The OS may or may not queue up a few connection requests, but in general the users will simply fail to connect. A more graceful way of doing it is to have the service return an error code indicating the client should try again later.
Handling requests slower is a bit more work, because it requires separating the servlet handling the requests from the class doing the work in a different thread. You can have a larger number of servlets than worker bees. When a request comes in it accepts it, waits for a worker bee, grabs it and uses it, frees it, then returns the results.
The two can communicate through one of the classes in java.util.concurrent, like LinkedBlockingQueue or ThreadPoolExecutor. If you want to get really fancy, you can use something like a PriorityBlockingQueue to serve some customers before others.
Me, I would throw more hardware at it like Anon said ;)
Some solid answers here. I think more hardware is the way to go. Having too many clients or traffic is usually a good problem to have.
However, if you absolutely must throttle clients, there are some options.
The most scalable solutions that I've seen revolve around a distributed caching system, like Memcached, and using integers to keep counts.
Figure out a rate at which your system can handle traffic. Either overall, or per client. Then put a count into memcached that represents that rate. Each time you get a request, decrement the value. Periodically increment the counter to allow more traffic through.
For example, if you can handle 10 requests/second, put a count of 50 in every 5 seconds, up to a maximum of 50. That way you aren't refilling it all the time, but you can also handle a bit of bursting limited to a window. You will need to experiment to find a good refresh rate. The key for this counter can either be a global key, or based on user id if you need to restrict that way.
The nice thing about this system is that it works across an entire cluster AND the mechanism that refills the counters need not be in one of your current servers. You can dedicate a separate process for it. The loaded servers only need to check it and decrement it.
All that being said, I'd investigate other options first. Throttling your customers is usually a good way to annoy them. Most probably NOT the best idea. :)
I'm assuming you're not in a position to increase capacity (either via hardware or software), and you really just need to limit the externally-imposed load on your server.
Dealing with this from within your application should be avoided unless you have very special needs that are not met by the existing solutions out there, which operate at HTTP server level. A lot of thought has gone into this problem, so it's worth looking at existing solutions rather than implementing one yourself.
If you're using Tomcat, you can configure the maximum number of simultaneous requests allowed via the maxThreads and acceptCount settings. Read the introduction at http://tomcat.apache.org/tomcat-6.0-doc/config/http.html for more info on these.
For more advanced controls (like per-user restrictions), if you're proxying through Apache, you can use a variety of modules to help deal with the situation. A few modules to google for are limitipconn, mod_bw, and mod_cband. These are quite a bit harder to set up and understand than the basic controls that are probably offered by your appserver, so you may just want to stick with those.