We have a cron job that runs every hour on a backend module and creates tasks. The cron job runs queries on the Cloud SQL database, and the tasks make HTTP calls to other servers and also update the database. Normally they run great, even when thousands of tasks as created, but sometimes it gets "stuck" and there is nothing in the logs that can shed some light on the situation.
For example, yesterday we monitored the cron job while it created a few tens of tasks and then it stopped, along with 8 of the tasks that also got stuck in the queue. When it was obvious that nothing was happening we ran the process a few more times and each time completed successfully.
After a day the original task was killed with a DeadlineExceededException and then the 8 other tasks, that were apparently running in the same instance, were killed with the following message:
A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may be throwing exceptions during the initialization of your application. (Error code 104)
Until the processes were killed we saw absolutely no record of them in the logs, and now that we see them there are no log records before the time of the DeadlineExceededException, so we have no idea at what point they got stuck.
We suspected that there is some lock in the database, but we see in the following link that there is a 10 minute limit for queries, so that would cause the process to fail much sooner than one day: https://cloud.google.com/appengine/docs/java/cloud-sql/#Java_Size_and_access_limits
Our module's class and scaling configuration is:
<instance-class>B4</instance-class>
<basic-scaling>
<max-instances>11</max-instances>
<idle-timeout>10m</idle-timeout>
</basic-scaling>
The configuration of the queue is:
<rate>5/s</rate>
<max-concurrent-requests>100</max-concurrent-requests>
<mode>push</mode>
<retry-parameters>
<task-retry-limit>5</task-retry-limit>
<min-backoff-seconds>10</min-backoff-seconds>
<max-backoff-seconds>200</max-backoff-seconds>
</retry-parameters>
I uploaded some images of the trace data for the cron job:
http://imgur.com/a/H5wGG.
This includes the trace summary, and the beginning/ending of the timeline.
There is no trace data for the 8 terminated tasks.
What could be the cause of this and how can we investigate it further?
We eventually managed to solve the problem with the following steps:
We split the module into two - one module to run the cron job and
one module to handle the generated tasks. This allowed us to see
that the problem was with handling the tasks as that was the only
module that kept getting stuck.
We limited the amount of concurrent tasks to 2, which seems to be the maximum amount of tasks that can be processed simultaneously without the system getting stuck.
Related
This is regarding a flink job which has a simple source to fetch data from a url and then filter the data then collects data in a process function for some time (keyBy) and at last process on collected data in a map. Due to some reasons the job stops functioning after some days even though the flink UI is showing it as running. Is there any way to know why there is such a behavious, also is there any way i can know if a job has actually stopped even though UI showing it as running.
P.S. How do i know the job has stopped?? Ans : It doesn't perform the task it was performing.
I checked the logs but it didn't help me much understanding the issue.
It sounds like the job manager and task manager(s) are still running, since at least the heartbeat messages are being delivered.
There are a number of metrics that might shed some light on what is happening:
If the job is using event time, perhaps an idle source is causing the watermarks to no longer advance. You should be able to see this in the metrics by looking at the numRecordsOutPerSecond from the source instances, and by looking at the current watermark.
If you are reading from Kafka (or Kinesis), look at records-lag-max (or millisBehindLatest).
If you have checkpoints enabled, look to see if they are still succeeding.
Since the past 1 month we have started facing this problem where the AppEngine backend randomly stops processing any job at all.
The closest other Qs I have seen is this one and this, but nothing of use.
My push queue configuration is
<queue>
<name>MyFetcherQueue</name>
<target>mybackend</target>
<rate>30/m</rate>
<max-concurrent-requests>1</max-concurrent-requests>
<bucket-size>30</bucket-size>
<retry-parameters>
<task-retry-limit>10</task-retry-limit>
<min-backoff-seconds>10</min-backoff-seconds>
<max-backoff-seconds>200</max-backoff-seconds>
<max-doublings>2</max-doublings>
</retry-parameters>
</queue>
Standard Java App Engine with multiple modules, out of which one is a backend with Basic scaling, B4.
<runtime>java8</runtime>
<threadsafe>true</threadsafe>
<instance-class>B4</instance-class>
<basic-scaling>
<max-instances>2</max-instances>
</basic-scaling>
Note, just to fix this problem, I have tried the following but to no avail:
Updated java from 7 to 8 - didnt work
Changed from earlier B1 to B4 - (thinking it could be a memory issue, but nothing as such in the logs)- didnt work
Changed queue processing rate to as low as 15/m with bucket size 15.
Changed max-instances from earlier 1 to 2 hoping that if atleast one instance hangs, the other should be able to process, but to no avail. I noticed (today) that when this issue occurred, the 2nd instance had only 2 requests (the other had 1400+), yet the new instance did not pick up the remaining tasks from the queue. The task ETA just kept increasing.
Updated Java Appengine sdk from earlier 1.9.57 to latest 1.9.63
Behavior: After a while (almost once per day now), the Backend would just stop responding and the tasks in this queue would be left as is. Only way to continue is to kill the backend instance from the console, after which the new instance then starts processing the tasks. These tasks are simple http calls to fetch data. The Queue usually has anywhere from 1 task to upto 15-20 at any time. Noticed that even with as low as 3-4 it sometimes stalls.
The logs don't show anything at all, nothing scrolls. Only when the backend instance is deleted, the log would show that the /task was terminated via console. No out of memory, no crashes, no 404.
Earlier logs and screenshots which I had captured:
I am unable to add more images, but here are the ones for Utilization, Memory Usage & most recent memory usage after which I had deleted the instance and the tasks resumed.
This and recent other experience has shaken my faith in Google App Engine. What am I doing wrong? (This exact B1 backend setup with queue config has worked previously for many years!)
We've been running into a bug that causes our executors to continuously fail. On the surface, the application still appears healthy, and it has taken upwards of 3 days for the application to crash after such incidents. We want to add monitoring that can be used to alert us in this situation.
I've decided to monitor batch completed events: if the executors are continuously failing, no batches should complete. This had the added benefit of tracking when batches are backed up. I achieved this by registering a SparkStreamingListener that marks a meter every time onBatchCompleted is called. While testing it's efficacy, I killed the application, while keeping the driver running. Much to my surprise, batches are still reported as completing.
Is this expected behaviour, or a bug? If this is expected behaviour, does anyone have any alternative methods for tracking when executors are continuously failing / the overall health of the spark application?
We just did a rolling restart of our server, but now every few hours our cluster stops responding to API calls. Instead, when we make a call, I get a response like this:
curl -XGET 'http://localhost:9200/_cluster/health?pretty'
{
"error" : "OutOfMemoryError[unable to create new native thread]",
"status" : 500
}
I noticed that we can still index data fine, it seems, but cannot search or call any API functions. This seems to happen every few hours, and the most recent time it happened, there was no logs in any of the node's log files.
Our cluster is 8 nodes over 5 servers (3 servers run 2 ElasticSearch processes, 2 run 1), running RHEL6u5. We are running ElasticSearch1.3.4.
This can be due to the OS not allowing more threads to be created. Increasing the number of threads per process can solve the problem. Set the ulimit -u value higher.
Although the above is good, an even better solution is to configure ElasticSearch to use a thread pool. This is a better solution since thread creation and destruction are expensive. In fact, some (or all?) JVMs can't cleanup terminated threads except during a full GC.
I have a number of backend processes (java applications) which run 24/7. To monitor these backends (i.e. to check if a process is not responding and notify via SMS/EMAIL) I have written another application.
The old backends now log heartbeat at regular time interval and this new applications checks if they are doing it regularly and notifies if necessary.
Now, We have two options
either run it as a scheduled task, which will run after every (let say) 15 min and stop after doing its job or
Run it as another backend process with 15 min sleep time.
The issue we can foresee right now is that what if this monitor application goes into non-responding state? So, my question is Is there any difference between both the cases or both are same? What option would suit my case more?
Please note this is a specific case and is not same as this or this
Environment: Java, hosted on LINUX server
By scheduled task, do you mean triggered by the system scheduler, or as a scheduled thread in the existing backend processes?
To capture unexpected termination or unresponsive states you would be best running a separate process rather than a thread. However, a scheduled thread would give you closer interaction with the owning process with less IPC overhead.
I would implement both. Maintain a record of the local state in each backend process, with a scheduled task in each process triggering a thread to update the current state of that node. This update could be fairly frequent, since it will be less expensive than communicating with a separate process.
Use your separate "monitoring app" process to routinely gather the information about all the backend processes. This should occur less frequently - whether the process is running all the time, or scheduled by a cron job is immaterial since the state is held in each backend process. If one of the backends become unresponsive, this monitoring app will be able to determine the lack of response and perform some meaningful probes to determine what the problem is. It will be this component that will then notify your SMS/Email utility to send a report.
I would go for a backend process as it can maintain state
have a look at the quartz scheduler from terracotta
http://terracotta.org/products/quartz-scheduler
It will be resilient to transient conditions and you only need provide a simple wrap so the monitor app should be robust providing you get the threading stuff right in the quartz.properties file.
You can use nagios core as core and Naptor to monitoring your application. Its easy to setup and embed with your application development.
You can check at this link:
https://github.com/agunghakase/Naptor/tree/ver1.0.0