We've been running into a bug that causes our executors to continuously fail. On the surface, the application still appears healthy, and it has taken upwards of 3 days for the application to crash after such incidents. We want to add monitoring that can be used to alert us in this situation.
I've decided to monitor batch completed events: if the executors are continuously failing, no batches should complete. This had the added benefit of tracking when batches are backed up. I achieved this by registering a SparkStreamingListener that marks a meter every time onBatchCompleted is called. While testing it's efficacy, I killed the application, while keeping the driver running. Much to my surprise, batches are still reported as completing.
Is this expected behaviour, or a bug? If this is expected behaviour, does anyone have any alternative methods for tracking when executors are continuously failing / the overall health of the spark application?
Related
This is regarding a flink job which has a simple source to fetch data from a url and then filter the data then collects data in a process function for some time (keyBy) and at last process on collected data in a map. Due to some reasons the job stops functioning after some days even though the flink UI is showing it as running. Is there any way to know why there is such a behavious, also is there any way i can know if a job has actually stopped even though UI showing it as running.
P.S. How do i know the job has stopped?? Ans : It doesn't perform the task it was performing.
I checked the logs but it didn't help me much understanding the issue.
It sounds like the job manager and task manager(s) are still running, since at least the heartbeat messages are being delivered.
There are a number of metrics that might shed some light on what is happening:
If the job is using event time, perhaps an idle source is causing the watermarks to no longer advance. You should be able to see this in the metrics by looking at the numRecordsOutPerSecond from the source instances, and by looking at the current watermark.
If you are reading from Kafka (or Kinesis), look at records-lag-max (or millisBehindLatest).
If you have checkpoints enabled, look to see if they are still succeeding.
I did an application to do some testing on network nodes like ping test, retrieve disk space ans so on.
I use a scheduled batchlet to run the actions but I wonder if it is the rigth use of batchlet?
Does an EJB timer should be more relevant? Also, when I run a batchlet, my glassfish server keeps a log of the batch job and I don't necessary need it (especially with the amount of batch jobs genereted during a day).
If I need to run some job in the same schedule time, I think batchled can do it but EJB timer too?
Could you give me your input on the rigth way to achieve this?
Thanks,
Ersch
This isn't a question with a clear answer, but there is a bit of a cost in factoring your application as a batch job, and I would look at what I'm getting to see if it's worth doing so.
So you're thinking about a job consisting of a single Batchlet step. Well, there'd be nothing gained from "restart" functions, neither at the failing step within a job nor leveraging checkpoints within a chunk step. The batchlet programming model is quite simple... even if you really like #BatchProperty you'd have to deal with an XML now to do so.
This only starts to get more interesting if you want to start, view, and manage these executions along with the rest of your batch jobs. This might be because you're working with an implementation that offers some kind of implementation-specific add-on function. An example of this could be an integration with external scheduler software, allowing jobs to be scheduled by it. At the other extreme, if you found value in having a persisted record of all your batch job executions in one place (the job repository, usually a persistent DB), then that could also make this worthwhile for you.
But if you don't care for any of that, then an EJB timer could be the way to go instead.
Using an EJB timer is appropriate when your task executes in an eye blink (or thereabouts).
Otherwise use the batching mechanism.
Long running tasks executed from EJB timers can be problematical because they execute in transactions which normally time out after a short period of time. Increasing this transaction time out also increases the chances of database and perhaps other resource locks which can impact normal operation of your application.
We have a cron job that runs every hour on a backend module and creates tasks. The cron job runs queries on the Cloud SQL database, and the tasks make HTTP calls to other servers and also update the database. Normally they run great, even when thousands of tasks as created, but sometimes it gets "stuck" and there is nothing in the logs that can shed some light on the situation.
For example, yesterday we monitored the cron job while it created a few tens of tasks and then it stopped, along with 8 of the tasks that also got stuck in the queue. When it was obvious that nothing was happening we ran the process a few more times and each time completed successfully.
After a day the original task was killed with a DeadlineExceededException and then the 8 other tasks, that were apparently running in the same instance, were killed with the following message:
A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may be throwing exceptions during the initialization of your application. (Error code 104)
Until the processes were killed we saw absolutely no record of them in the logs, and now that we see them there are no log records before the time of the DeadlineExceededException, so we have no idea at what point they got stuck.
We suspected that there is some lock in the database, but we see in the following link that there is a 10 minute limit for queries, so that would cause the process to fail much sooner than one day: https://cloud.google.com/appengine/docs/java/cloud-sql/#Java_Size_and_access_limits
Our module's class and scaling configuration is:
<instance-class>B4</instance-class>
<basic-scaling>
<max-instances>11</max-instances>
<idle-timeout>10m</idle-timeout>
</basic-scaling>
The configuration of the queue is:
<rate>5/s</rate>
<max-concurrent-requests>100</max-concurrent-requests>
<mode>push</mode>
<retry-parameters>
<task-retry-limit>5</task-retry-limit>
<min-backoff-seconds>10</min-backoff-seconds>
<max-backoff-seconds>200</max-backoff-seconds>
</retry-parameters>
I uploaded some images of the trace data for the cron job:
http://imgur.com/a/H5wGG.
This includes the trace summary, and the beginning/ending of the timeline.
There is no trace data for the 8 terminated tasks.
What could be the cause of this and how can we investigate it further?
We eventually managed to solve the problem with the following steps:
We split the module into two - one module to run the cron job and
one module to handle the generated tasks. This allowed us to see
that the problem was with handling the tasks as that was the only
module that kept getting stuck.
We limited the amount of concurrent tasks to 2, which seems to be the maximum amount of tasks that can be processed simultaneously without the system getting stuck.
Our app extensively relies on backend instances. There is some logic that has to run every few seconds. The execution of this code cannot only be driven by requests arriving on the frontend because it needs to run regardless.
We only considered using task queues to solve this. But as far as we know, task queues only guarantee that tasks will be executed within 24 hours. I have not found a reference to back this up though.
Our app uses a fixed number of resident B1 backend instances. We assume that each instance stays alive 24/7 after the backend version is deployed and started.
Is this a valid assumption? If not, can our application be notified every time a backend instance will be shutdown?
What is the SLA on the availability of a backend instance?
Are backend instances restarted automatically after they are terminated? E.g. is an instance automatically restarted after it runs out of memory?
How quickly will instances be brought up again if they every are terminated?
We create a fixed size thread pool on each backend instance. Is there a maximum size for thread pools that we can have on a backend instance?
Are there any other conditions under which a backend instance might die?
Thanks!
UPDATES
Turns out a couple questions can be answered by reading the docs.
App Engine attempts to keep backends running indefinitely. However, at this time there is no guaranteed uptime for backends.
So what is the SLA for uptime? I am looking for a statement like: "The guaranteed uptime for backends is 99.99%"
The App Engine team will provide more guidance on expected backend uptime as statistics become available.
When will this statistics be available?
It's also important to recognize that the shutdown hook is not always able to run before a backend terminates. In rare cases, an outage can occur that prevents App Engine from providing 30 seconds of shutdown time.
When App Engine needs to turn down a backend instance, existing requests are given 30 seconds to complete, and new requests immediately return 404.
The following code sample demonstrates a basic shutdown hook:
LifecycleManager.getInstance().setShutdownHook(new ShutdownHook() {
public void shutdown() {
LifecycleManager.getInstance().interruptAllRequests();
}
});
I am running only one instance of a resident (non dynamic) Backend and my experience is that it is restarted at least once a day.
You application must be able to store its state and resume after restart.
I have a number of backend processes (java applications) which run 24/7. To monitor these backends (i.e. to check if a process is not responding and notify via SMS/EMAIL) I have written another application.
The old backends now log heartbeat at regular time interval and this new applications checks if they are doing it regularly and notifies if necessary.
Now, We have two options
either run it as a scheduled task, which will run after every (let say) 15 min and stop after doing its job or
Run it as another backend process with 15 min sleep time.
The issue we can foresee right now is that what if this monitor application goes into non-responding state? So, my question is Is there any difference between both the cases or both are same? What option would suit my case more?
Please note this is a specific case and is not same as this or this
Environment: Java, hosted on LINUX server
By scheduled task, do you mean triggered by the system scheduler, or as a scheduled thread in the existing backend processes?
To capture unexpected termination or unresponsive states you would be best running a separate process rather than a thread. However, a scheduled thread would give you closer interaction with the owning process with less IPC overhead.
I would implement both. Maintain a record of the local state in each backend process, with a scheduled task in each process triggering a thread to update the current state of that node. This update could be fairly frequent, since it will be less expensive than communicating with a separate process.
Use your separate "monitoring app" process to routinely gather the information about all the backend processes. This should occur less frequently - whether the process is running all the time, or scheduled by a cron job is immaterial since the state is held in each backend process. If one of the backends become unresponsive, this monitoring app will be able to determine the lack of response and perform some meaningful probes to determine what the problem is. It will be this component that will then notify your SMS/Email utility to send a report.
I would go for a backend process as it can maintain state
have a look at the quartz scheduler from terracotta
http://terracotta.org/products/quartz-scheduler
It will be resilient to transient conditions and you only need provide a simple wrap so the monitor app should be robust providing you get the threading stuff right in the quartz.properties file.
You can use nagios core as core and Naptor to monitoring your application. Its easy to setup and embed with your application development.
You can check at this link:
https://github.com/agunghakase/Naptor/tree/ver1.0.0