I have a server application which, somewhat simplified, periodically takes measurements via a rest-api from a not beefy-enough-server. The values should be cached locally (and are timestamped/immutable), maybe stored as a FloatBuffer where every position corresponds to a measurement sample. There's a webbrowser application which periodically makes ajax requests to update some neat statistics on the webpage, like this picture:
Assuming that the server is up and running, there are still many places where errors could occur
The REST measurement server could be unreachable (where the server just keeps storing measurements locally)
The network connection to the measurement server could be down
The storage could be full or somehow corrupt
The Browser could lose contact with the server and try to take it up again
My strategy for coping with errors in general should be the following:
If there are problems getting values from the measurement service via REST, there should be retries every minute. If the error persists for more than 30 minutes consequtively the administrator should be notified. In case of disk problems the administratior should be notified at once, or preferably even before the disk goes full.
The end user experience should be as transparent to the errors as possible, but the application should still function as sanely as possible, by notifying the user an error have occured but also show the latest data availiable.
How do I find which errors to cope with regarding network problems (using clj-http via an agent triggered by a ScheduledThreadPoolExecutor job to make REST request) and regarding problems with disk when trying to flush the FloatBuffer?
What is a sane way to implement the quite stateful yet algorithmic strategy mentioned above? Should I simply handle the error when the agent reports it and switch to some kind of a recovery-mode job?
In an interaction like this involving several components over different systems, the end user should be avoided to do many synchronous operations. Its only the sync operations that are time bound and require error reporting immediately.
Once the interaction of the end user will the system is async, you have a lot of choice on the error handling mechanism too... At the point where the end user interacts with the system you can have a error mapper that translates all the errors that come from the various components to the user understandable messages.
The user should be given an API to query the status of the request he submitted. That should be able to tell if the request is complete or if there is an error. If the network connections are going to take more time the status message can inform the user about that.
Every component will report error at some point in any distributed system. There are error listener interfaces provided by some APIs for this. This will asynchronously report errors to the user. Have a look at APIs like JMS (http://docs.oracle.com/javaee/5/tutorial/doc/bnceh.html). They are proven to be used in complex systems and have good error handling mechanisms.
Related
I'm having a hard time debugging a problem with long running response times. My web application sometimes takes a long time to respond and I'm having a hard time nailing down the specific requests that are the culprits. The trouble I'm having is that tomcat only logs to the access log once it responds to the request. I'm wondering if there is a way to either log all incoming requests the moment they come in (and not wait for a response) or tell tomcat to spew out all the requests that it is currently handling (something analogous to jstack, but instead of showing me threads show me the current/active request urls)?
Once served, the slow requests can be identified by activating the Extended Log Valve with the time-taken token. It's not your real question, as you want to see in real time, but it's not clear if you already identified long time to respond requests.
VisualVM will show you the running threads but it's not enough to understand what's going on. Probably you again need to activate the Extended Log Valve, this time with the x-threadname token to compare with what you have seen in VisualVM. But for the debugging itself, the solution above (time-taken token) is usually enough.
But it not might be enough to identify the bottleneck particularly if the application does in fact have no problem at all (and it seems to be a random problem). Does the application use a database ? If so, activate the slow query log for example with MySQL. Does it use any other tier (authentication, NFS/CIFS,..) ? If so, you need to monitor their availability, in case they are blocking anything when not available.
I'm working on a relatively simple workflow using Amazon's Flow framework for Java. I think I have a decent grasp of everything that's going on right now, but I have one area I'm still uncertain about: how should I go about handling timeouts?
The main timeout with my workflow is the executionStartToCloseTimeoutSeconds on the workflow itself, but I'd imagine the process is the same regardless of which timeout fires. It seems that most of the time, when the task times out, it just kind of disappears. I'd like to be able to know when this happens and do something (e.g. send an e-mail or log it somehow). I searched around and couldn't find any example of anything being notified that a timeout happened.
Activity timeout is delivered to the workflow code in the form of an Exception and can be easily handled.
IMHO workflow execution timeout is similar to kill -9 in Unix. It kills workflow without giving it chance to perform cleanup. So the main use for it is to ensure that broken workflow instances do not stay open forever.
For all business level timeouts do not rely on workflow timeouts, use timers instead. When timer fires your workflow code can execute notification activity and terminate the workflow with appropriate failure status.
http://docs.aws.amazon.com/amazonswf/latest/developerguide/swf-timeout-types.html
For activity related timeouts, the short answer is that your decider (i.e. workflow) logic should handle it. You should not have to worry about things timing out once you validate the logic and have retries in place.
For workflow timeouts you will need to inspect the workflow history / state to figure out that it timed out. You can definitely list workflow executions but you probably have to go through the SWF API directly (i.e. not through Flow). You want to do this anyway to catch failed workflows.
A pattern I've used and seen being used with SWF is to have an external way of keeping track of the work you've dispatched through SWF (think a DB) and use that to check in on work that was started and never completed. The workflow itself updates this when it completes (or as it's completing major pieces of work) so it's trivial to figure out which workflows are problematic.
I have C applications that will run on multiple machines at different sites.
Now I want to control and monitor these C applications. For that I am thinking about Java Web Application using Servlet/JSP.
I am thinking that C applications will connect to Java Web application over TCP. In my web application, I am thinking to implement manager which communicates with C applications over TCP. I will start manager when web application starts as separate thread. And manager will communicate to servlet requests via Context and Session. So whenever user do something on browser, I want to use functionalities of my manager at server, with ServetContext an Session as interface.
So this is what I am thinking. So, I want to know if there is better approach, or I am doing anything wrong? Can anyone please suggest me better solution?
EDIT
Current workflow: whenever I need to start / stop C application, I have to SSH remote machine puTTY terminal, type long commands, and start / stop it. Whenever there is some issue, I have to scroll long long log files. There couple of other things like live status of what application is doing/processing all things at every second, that I can't log always in log file.
So I find these workflow difficult. And things like live status I can't monitor.
Now I want to have web application interface to it. I can modify my C application and implement web application from scratch.
New Workflow to implement: I want to start / stop C application from web page. I want to view logs and live status reports / live graphs on web page (monitoring what C application is doing). I want to monitor machine status also on web page.
The web interface I thinking to design in Java using JSP/servlets.
So, I will modify my C application so it can communicate with with web application.
Question:
Just need guidelines / best practices for making new workflow.
EDIT 2
Sorry for confusion between controller or manager. Both are same thing.
My thoughts:
System will consist of C applications running at different sites, Java controller and Java web app running parallely in Tomcat server, and DB.
1) C applications will connect to controller over TCP. So, controller here becomes server and C applications client.
2) C applications will be multithreaded, will receive tasks from controller and spawns new thread to perform that task. When controller tells to stop task, C application will stop thread of that task. Additionally, C applications will send work progress (logs) every second to controller.
3) Controller receives task commands from web application (as both running parallelly in Tomcat server, both in same instance on JVM), and web application will receive commands from user over HTTP.
4) The work progress (logs) received every second from C applications to controller, controller will then insert logs in DB for later analysis (need to consider if it is good insert logs in MySQL RDBMS, may be needed to do lot of inserts, may be 100 or 1000 every second, forever). Web application may also request recent 5 minute logs from controller and send to user over HTTP. If user is monitoring logs, then web application will have to retrieve logs every second from controller and send to user over HTTP.
5) User monitoring C application tasks, will see progress in graph, updated every second. Additionally text lines of logs of info/error events that may happen occasionally in C applications.
6) C applications will be per machine, which will execute any task user sends from web browser. C applications will be running as service in machine, which will start on machine startup, will connect to server, and will stay connected to server forever. Can be running idle if no tasks to perform.
It is a valid approach, I believe sockets is how most distributed systems communicate, and more often than not even different services on the same box communicate that way. Also I believe what you are suggesting for the java web service is very typical and will work well (It will probably grow in complexity beyond what you are currently thinking, but the archetecture you describe is a good start).
If your C services are made to also run independantly of the management system then you might want to reverse it and have the management system connect to the services (Unless your firewall prevents it).
You will certainly want a small, well-defined protocol. If you are sending lots of fields you could even make everything you send JSON or xml since they will already have parsers to validate the format.
Be careful about security! On the C side ensure that you won't get any buffer overflows and if you parse the information yourself, be strict about throwing away (and logging!) data that doesn't look right. On Java the buffer overruns aren't as much of a problem but be sure that you log packets that don't fit your protocol exactly to detect both bugs and intrusions.
Another solution that you might consider--Your systems all share a database already you could send commands and responses through the DB (Assuming the command/responses are not happening too often). We don't do this exactly, but we share a variable table in which we place name/value pairs indicating different aspects of our systems performance and configuration (it's 2-way), this is probably not optimal but has been amazingly flexible since it allows us to reconfigure our system at runtime (the values are cached locally in each service and re-read/updated every 30 seconds).
I might be able to give you more info if I knew more specifics about what you expected to do--for instance, how often will your browser update it's fields, what kind of command signals or data requests will be sent and what kind of data do you expect back? Although you certainly don't have to post that stuff here, you must consider it--I suggest mocking up your browser page to start.
edits based on comments:
Sounds good, just a couple comments:
2) Any good database should be able to handle that volume of data for logging but you may want to use a good cache on top of your DB.
5) You will probably want a web framework to render the graph and manage updates. There are a lot and most can do what you are saying pretty easily, but trying to do it all yourself without a framework of some sort might be tough. I only say this because you didn't mention it.
6) Be sure you can handle dropped connections and reconnecting. When you are testing, pull the plug on your server (at least the network cable) and leave it out for 10 minutes, then make sure when you plug it back in you get the results you expect (Should the client automatically reconnect? Should it hold onto the logs or throw them away? How long will it hold onto logs?)
You may want to build in a way to "Reboot" your C services. Since they were started as a service, simply sending a command that tells them to terminate/exit will generally work since the system will restart them. You may also want a little monitoring loop that restarts them under certain criteria (like they haven't gotten a command from the server for n minutes). This can come in handy when you're in california at 10am trying to work with a C service in Austraillia at 2am.
Also, consider that an attacker can insert himself between your client and server. If you are using an SSL socket you should be okay, but if it's a raw socket you must be VERY careful.
Correction:
You may have problems putting that many records into a MySQL database. If it is not indexed and you minimize queries against it you may be okay. You can achieve this by keeping the last 5 minutes of all your logs in memory so you don't have to index your database and by grouping inserts or having a very well tuned cache.
A better approach might be to forgo the database and just use flat log files pre-filtered to what a single user might want to see, so if the user asks for the last 5 minutes "WARN" and "DEBUG" messages from a machine you could just read the logfile from that machine into memory, skipping all but warn/debug messages, and display those. This has it's own problems but should be more scalable than an indexed database. This would also allow you to zip up older data (that a user won't want to query against any more) for a 70-90% savings in disk space.
Here are my recommendations on your current design and since you haven't defined a specific scope for this project:
Define a protocol to communicate between your C apps and your monitor app. Probably you don't need the same info from all the C apps in the same format or there are more important metrics for some C apps than others. I would recommend using plain JSON for this and to define a minimum schema to fulfill in order for both C to produce the data and Java for consume and validate it.
Use a database to store the results of monitoring your C apps. The generic option would be using a RDBMS, probably open source like MySQL or PostgreSQL, or if you (or your company) can get the licenses go for SQL Server or Oracle or another one. This in case you need to maintain a history of the results, and you can clear the data periodically.
Probably you want/need to have the latest results from monitoring available in a sort of cache (because in this time performance is critical), so you may use an in-memory database like Hazelcast or Redis, or just a simple cache like EhCache or Infinispan. Storing the data in an external element is better than storing it in plain ServletContext because these technologies are aware of multi threading and support ACID, which is not the primary use case for ServletContext but seems necessary for the monitor.
Separate the monitor that will receive the data from the C apps from the web app. In case the monitor fails or it takes too much time to perform some operations, the Web application will still be available to work without having the overhead to receive and manage the data from the C apps. In the other hand, if the web app starts to be slower (due to problems in the implementation of the app or something that should be discovered using a profiler) then you may restart it, and by doing this your monitor should continue gathering the data from the C apps and store them in your data source.
For the threads in the monitor app, since it seems it will be based on Java, use ExecutorService rather than creating and managing the threads manually.
For this part:
User monitoring C application tasks, will see progress in graph, updated every second. Additionally text lines of logs of info/error events that may happen occasionally in C applications
You may use Rx Java to not update your view (JSP, Facelet, plain HTML or whatever you will use) or another reactive programming model like Play Framework to read the data continuously from database (and cache if you use it) and update the view in a direct way for the users of the web app. If you don't want to use this programming model, then at least use push technology like comet or WebSockets. If this part is not that important, then use a simple refresh timer as explained here: How to reload page every 5 second?
For this part:
C applications will be per machine, which will execute any task user sends from web browser
You could reuse the protocol to communicate the C apps using JSON to the monitor and another thread in each C app to translate the action and execute it.
My Java web application pulls some data from external systems (JSON over HTTP) both live whenever the users of my application request it and batch (nightly updates for cases where no user has requested it). The data changes so caching options are likely exhausted.
The external systems have some throttling in place, the exact parameters of which I don't know, and which likely change depending on system load (e.g., peak times 10 requests per second from one IP address, off-peak times 100 requests per second from open IP address). If the requests are too frequent, they time out or return HTTP 503.
Right now I am attempting the request 5 times with 2000ms delay between each, giving up if an error is received each time. This is not optimal as sometimes at peak-times nearly all requests fail; I could avoid making these requests and perhaps get at least some to succeed instead.
My goals are to have a somewhat simple, reliable design, and enough flexibility so that I could both pull some metrics from the throttler to understand how well the external systems are responding (and thus adjust how often they are invoked), and to auto-adjust the interval with which I call them (individually per system) so that it is optimal both on off-peak and peak hours.
My infrastructure is Java with RabbitMQ over MongoDB over Linux.
I'm thinking of three main options:
Since I already have RabbitMQ used for batch processing, I could just introduce a queue to which the web processes would send the requests they have for external systems, then worker processes would read from that queue, throttle themselves as needed, and return the results. This would allow running multiple parallel worker processes on more servers if needed. My main concern is that it isn't a very simple solution, and how to manage peak-hour throughput being low and thus the web processes waiting for a long while. Also this converts my RabbitMQ into a critical single failure point; if it dies the whole system stops (as opposed to the nightly batch processes just not running any more, which is less critical). I suppose rpc is the correct pattern of RabbitMQ usage, but not sure. Edit - I've posted a related question How to properly implement RabbitMQ RPC from Java servlet web container? on how to implement this.
Introduce nginx (e.g. ngx_http_limit_req_module), HAProxy (link) or other proxy software to the mix (as reverse proxies?), have them take care of the throttling through some configuration magic. The pro is that I don't have to make code changes. The con is that it is more technology used, and one I've not used before, so chances of misconfiguring something are quite high. It would also likely not be easy to do dynamic throttling depending on external server load, or prioritizing live requests over batch requests, or get statistics of how the throttling is doing. Also, most documentation and examples will likely be on throttling incoming requests, not outgoing.
Do a pure-Java solution (e.g., leaky bucket implementation). Would be simple in the sense that it is "just code", but the devil is in the details; debugging all the deadlocks, starvations and race conditions isn't always fun.
What am I missing here?
Which is the best solution in this case?
P.S. Somewhat related question - what's the proper approach to log all the external system invocations, so that statistics are collected as to how often I invoke them, and what the success rate is?
E.g., after every invocation I'd invoke something like .logExternalSystemInvocation(externalSystemName, wasSuccessful, elapsedTimeMills), and then get some aggregate data out of it whenever needed.
Is there a standard library/tool to use, or do I have to roll my own?
If I use option 1. with RabbitMQ, is there a way to organize the flow so that I get this out of the box from the RabbitMQ console? I wouldn't want to send all failed messages to poison queue, it would fill up too quickly though and in most cases there is no need to re-process these failed requests as the user has already sadly moved on.
Perhaps this open source system can help you a little: http://code.google.com/p/valogato/
I've been asked to design and implement a system for receiving a high volume of automated sensor data from a large number of devices. This data will be produced at regular intervals and sent to the server as xml in an http post. The devices will keep resending the same data if they don't receive a specific acknowledgment from the server. Some potentially heavy duty processing of this data will need to occur before it's inserted to a number of tables in the main database via a transaction, and additionally some data points will need to be enqueued to be re-directed to other external urls.
I'm planning on using a Java application server (leaning towards GlassFish) with a servlet to receive the incoming data. I'd like to implement some kind of queuing mechanism to store the data temporarily so that the response back to the sensor isn't dependent on all the intermediate processing. Separate independent queues are also a requirement for the data re-direction piece. After doing some research the two main options seem to be:
1) Install a database on the app server and use tables for the various queues. The queues would be processed by a Java application, either running in the app server or standalone as it's own service.
2) Use a database backed JMS solution to implement the queuing.
I'm not that familiar with JMS but from what I've read it seems to be the better solution in this case. The primary requirement is that no sensor data ever be lost or dropped from the queue before being processed and that it be processed more or less sequentially. We'd also like to make it easy to halt the processing of some of the queues at certain times but still have them accumulate data and for these messages to never automatically expire.
With strategy 1 it's obvious to me how to meet these requirements but it may be less robust and scalable, and more complex to develop than strategy 2, since I'll need to write my own multi-threaded code to handle the various independent queues. I'm wondering what the potential pitfalls could be in using JMS queues for this purpose since I've never worked with them before.
Data integrity is a big issue so I need to make sure JMS can guarantee no data loss in the event of a server reboot, power outage, or if the queue gets very large for some reason. For instance could a problem completing transactions to the main database for a period of time potentially cause the JVM to run out of memory, crash, and lose all accumulated data? (This would be the nightmare scenario).
Also, I was wondering if there would be any way to pause the JMS queue processing via an app server admin tool or to easily see what's in the queue (I would be enqueuing an object which would be the message xml plus some other data, including timestamp received, etc.) I've read a few posts on here that deal with related issues but wanted to get some direct feedback. Basically I'd like to know of instances (if any) where JMS is not an appropriate queuing solution and if this is one of those cases. Any advice is greatly appreciated.
Kaleb's answer talks about the benefits of JMS quite eloquently, but since you're asking about pitfalls, here's what I can think of.
Not all JMS implementations are equal. In theory you can use whatever implementation suits your needs, but unless you're prepared to do some serious load testing and failure condition testing, you can't know that a particular implementation isn't going to fail under your particular use case.
Most JMS use a transactional datastore like a relational database as their back end. That means that rather than writing directly to whatever datastore you're familiar with, you have to rely on the JMS implementation's extra layer between you and that stored messages.
While swapping JMS implementations to find the one that perfectly fits your needs may seem like a simple endeavor because of the homogeneous JMS API, the critical features for failure handling, JMS server monitoring, and all the other cool stuff that exists above and beyond messaging is going to be a hassle to deal with if you do change your implementation.
That said, I think you'd be crazy to write to the DB yourself instead of going with JMS. On the first point, ActiveMQ is a venerable JMS server used in many enterprise environments. On the second point, the fact is you'd just end up writing that extra layer yourself in order to implement messaging, and your code won't have the benefit of thousands of eyes (or a set of paid developers who's sole job it is to respond to customers and make sure the JMS implementation is solid). On the third point, well the same ends up being true of your backend datastore. Use JMS, you'll save yourself trouble in the long run.
If you want to go the JMS route, a standalone JMS-compatible message broker (separate from your app server) would be a good choice. Message brokers range from free open-source (like ActiveMQ at http://activemq.apache.org/ or OpenMQ at https://mq.dev.java.net/), to large-scale commercial solutions (IBM's WebSphere MQ at http://www-01.ibm.com/software/integration/wmq/ is one of the largest).
Message brokers offer guaranteed delivery (provided the server's up and listening), and you can do quite a bit to ensure that the system is fail-safe including integrated backup broker servers and instant power backup. Broker queues can eventually run out of room if your app server isn't picking up the messages, but you can assign huge queue depth (100's of GB) and have the server send alerts if the messages aren't getting processed and the queue reaches a certain percentage.
Your Java app would then run on a different server entirely, and would connect to the broker and pull messages off of the queue as fast as possible. If the app server crashes or stops picking up messages for any other reason, the broker would just keep all messages in that queue until the app server begins picking them up again.
You will be wanting to implement a poison message queue in your implementation - this is the place that messages unable to be processed after some number of retries will arrive.
You will probably need to write some code that can examine the messages in that queue and re-send them to the appropriate destination after fixing whatever is causing them to fail.
If sequence of message processing is important, a message ending up in the poison queue could mean all processing is halted until that message is corrected.
As far as fault tolerance goes, you can have multiple instances of the consuming services subscribe to the same queue or topic, providing an ability to continue processing even if one or more instances goes down.
Finally, have a watchdog process that pings the various consumers on your message queue, and if one doesn't respond, have it send a message that results in a new instance being started. In this way, your message processing environment can be somewhat self regulating.