Since the past 1 month we have started facing this problem where the AppEngine backend randomly stops processing any job at all.
The closest other Qs I have seen is this one and this, but nothing of use.
My push queue configuration is
<queue>
<name>MyFetcherQueue</name>
<target>mybackend</target>
<rate>30/m</rate>
<max-concurrent-requests>1</max-concurrent-requests>
<bucket-size>30</bucket-size>
<retry-parameters>
<task-retry-limit>10</task-retry-limit>
<min-backoff-seconds>10</min-backoff-seconds>
<max-backoff-seconds>200</max-backoff-seconds>
<max-doublings>2</max-doublings>
</retry-parameters>
</queue>
Standard Java App Engine with multiple modules, out of which one is a backend with Basic scaling, B4.
<runtime>java8</runtime>
<threadsafe>true</threadsafe>
<instance-class>B4</instance-class>
<basic-scaling>
<max-instances>2</max-instances>
</basic-scaling>
Note, just to fix this problem, I have tried the following but to no avail:
Updated java from 7 to 8 - didnt work
Changed from earlier B1 to B4 - (thinking it could be a memory issue, but nothing as such in the logs)- didnt work
Changed queue processing rate to as low as 15/m with bucket size 15.
Changed max-instances from earlier 1 to 2 hoping that if atleast one instance hangs, the other should be able to process, but to no avail. I noticed (today) that when this issue occurred, the 2nd instance had only 2 requests (the other had 1400+), yet the new instance did not pick up the remaining tasks from the queue. The task ETA just kept increasing.
Updated Java Appengine sdk from earlier 1.9.57 to latest 1.9.63
Behavior: After a while (almost once per day now), the Backend would just stop responding and the tasks in this queue would be left as is. Only way to continue is to kill the backend instance from the console, after which the new instance then starts processing the tasks. These tasks are simple http calls to fetch data. The Queue usually has anywhere from 1 task to upto 15-20 at any time. Noticed that even with as low as 3-4 it sometimes stalls.
The logs don't show anything at all, nothing scrolls. Only when the backend instance is deleted, the log would show that the /task was terminated via console. No out of memory, no crashes, no 404.
Earlier logs and screenshots which I had captured:
I am unable to add more images, but here are the ones for Utilization, Memory Usage & most recent memory usage after which I had deleted the instance and the tasks resumed.
This and recent other experience has shaken my faith in Google App Engine. What am I doing wrong? (This exact B1 backend setup with queue config has worked previously for many years!)
Related
I have a simple JMeter experiment with a single Thread Group with 16 threads, running for 500s, hitting the same URL every 2 seconds on each thread, generating 8 requests/second. I'm running in non-GUI (Command Line) mode. Here is the .jmx file:
https://www.dropbox.com/s/l66ksukyabovghk/TestPlan_025.jmx?dl=0
Here is a plot of the result, running on an AWS m5ad.2xlarge / 8 cores / 32GB RAM (I get the same behavior on VirtualBox Debian on my PC, very large Hetzner server, Neocortix Cloud Services instances):
https://www.dropbox.com/s/gtp6oqy0xtuybty/aws.png?dl=0
At the beginning of the Thread Group, all 16 threads report a long response time (0.33s), then settle in to a normal short response time (<0.1s). I call this the "Start of Run" problem.
Then about 220s later, there is another burst of 16 long response times, and yet another burst at about 440s. I call those the "Start of Run Echo" problem, because they look like echoes of the "Start of Run" problem. The same problem occurs if I introduce another Thread Group with a delay, say 60s. That Thread Group gets its own "Start of Run" problem at t=60s, and then its own echos at 280s and 500s.
These two previous posts seem related, but no conclusive cause was given for the "Start of Run" problem, and the "Start of Run Echo" problem was not mentioned.
Jmeter - The time taken by first iteration of http sampler is large
First HTTP Request taking a long time in JMeter
I can work around the "Start of Run" problem by hitting a non-existent page with the first HTTP request in each thread, getting a 404 Error, and filtering out the 404's. But that is a hack, and it doesn't solve the "Start of Run Echo" problem, which is not guaranteed to hit the non-existent pages. And it introduces "holes" in the delivered load to the real target pages.
Update: After suggestion from Dmitri T, I have installed JMeter 5.3. It has default value httpclient4.time_to_live=60000 (60s), and its output matches that:
https://www.dropbox.com/s/gfcqhlfq2h5asnz/hetzner_60.png?dl=0
But if I increase the value of httpclient4.time_to_live=600000 (600s), it does not push all the "echoes" out past the end of the run. It still shows echoes at about 220s and 440s, i.e. the same original behavior that I am trying to eliminate.
https://www.dropbox.com/s/if3q652iyiyu69b/hetzner_600.png?dl=0
I am wondering if httpclient4.time_to_live has an effective maximum value of 220000 (220s) or so.
Thank you,
Lloyd
The first request will be slow due to initial connection establishment and SSL handshake
Going forward JMeter will act according to its network properties in particular:
httpclient4.time_to_live - TTL (in milliseconds) represents an absolute value. No matter what, the connection will not be re-used beyond its TTL.
httpclient.reset_state_on_thread_group_iteration - Reset HTTP State when starting a new Thread Group iteration which means closing opened connection and resetting SSL State
also it seems that you're using kind of outdated JMeter version which is 5 years old, according to JMeter Best Practices you should always be using the latest version of JMeter so consider upgrading to JMeter 5.3 (or whatever is the latest stable version available from JMeter Downloads page) as you might be suffering from a JMeter bug which has been resolved already.
It might also be the case you need to perform OS and JMeter tuning, see Concurrent, High Throughput Performance Testing with JMeter for example problems and solutions
– update — we found that below situation occurs when we encounter
"com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard"
https://github.com/awslabs/amazon-kinesis-client/issues/108
we use s3 directory (and dynamodb) to store checkpoints, but if such occurs blocks should not get stuck but continue to be evicted gracefully from memory, obviously kinesis library race condition is a problem onto itself...
Ran to a problem (also submitted spark jira task on the subject below) where "block streams" (some not all) are persisting leading at to the OOM
App is a standard kinesis/spark streaming app written in java (spark version is 2.0.2)
run starts initially well and automated SparkCleaner does its job nicely recycling streaming jobs (verified by looking at the Storage tab in admin)
then after some time some blocks get stuck in memory such as this block on one of the executor nodes
input-0-1485362233945 1 ip-<>:34245 Memory Serialized 1442.5 KB
after more time more blocks are getting stuck and never freed up
It is my understanding that the SparkContext cleaner will trigger removal or older blocks as well as trigger System.gc at the given interval which is 30 minutes by default
thanks on any feedback on this as this issue here prevents 100% uptime of the application
If this could be of value we use StorageLevel.MEMORY_AND_DISK_SER()
Spark Jira
You can try to perform the eviction manually:
get KinesisInputDStream.generatedRDDs (using reflection)
for each generatedRDDs (it should contains BlockRDD) perform something like DStream.clearMetadata.
I already used a similar hack for mapWithState and for other Spark things that use memory for nothing.
I have been fighting with this issue for ages and I cannot for the life of me figure out what the problem is. Let me set the stage for the stack we are using:
Web-based Java 8 application
GWT
Hibernate 4.3.11
MySQL
MongoDB
Spring
Tomcat 8 (incl Tomcat connection pooling instead of C3PO, for example)
Hibernate Search / Lucene
Terracotta and EhCache
The problem is that every couple of days (sometimes every second day, sometimes once every 10 days, it varies) in the early hours of the morning, our application "locks up". To clarify, it does not crash, you just cannot log in or do anything for that matter. All background tasks - everything - just halt. If we attempt to login when it is in this state, we can see in our log file that it is authenticating us as a valid user, but no response is ever sent so the application just "spins".
The only pattern we have found to date related to when these "lock ups" occur is that it happens when our morning scheduled tasks or SAP imports are running. It is not always the same process that is running though, sometimes the lock up happens during one of our SAP imports and sometimes during internal scheduled task execution. All that these things have in common are that they run outside of business hours (between 1am and 6am) and that they are quite intensive processes.
We are using JavaMelody for monitoring and what we see every time is that starting at different times in this 1 - 6am window, the number of used jdbc connections just start to spike (as per the attached image). Once that starts, it is just a matter of time before the lock up occurs and the only way to solve it is to bounce Tomcat thereby restarting the application.
As for as I can tell, memory, CPU, etc, are all fine when the lock up occurs the only thing that looks like it has an issue is the constantly increasing number of used jdbc connections.
I have checked the code for our transaction management so many times to ensure that transactions are being closed off correctly (the transaction management code is quite old fashioned: explicit begin and commit in try block, rollback in catch blocks and entity manager close in a finally block). It all seems correct to me so I am really, really stumped. In addition to this, I have also recently explicitly configured the Hibernate connection release mode properly to after_transaction, but the issue still occurs.
The other weird thing is that we run several instances of the same application for different clients and this issue only happens regularly for one client. They are our client with by far the most data to be processed though and although all clients run these scheduled tasks, this big client is the only one with SAP imports. That is why I originally thought that the SAP imports were the issue, but it locked up just after 1am this morning and that was a couple hours before the imports even start running. In this case it locked up during an internal scheduled task executing.
Does anyone have any idea what could be causing this strange behavior? I have looked into everything I can think of but to no avail.
After some time and a lot of trial and error, my team and I managed to sort out this issue. Turns out that the spike in JDBC connections was not the cause of the lock-ups but was instead a consequence of the lock-ups. Apache Terracotta was the culprit. It was just becoming unresponsive it seems. It might have been a resource allocation issue, but I don't think so since this was happening on servers that were low usage as well and they had more than enough resources available.
Fortunately we actually no longer needed Terracotta so I removed it. As I said in the question, we were getting these lock-ups every couples of days - at least once per week, every week. Since removing it we have had no such lock-ups for 4 months and counting. So if anyone else experiences the same issue and you are using Terracotta, try dropping it and things might come right, as they did in my case.
As said by coladict, you need to look at "Opened jdbc connections" page in the javamelody monitoring report and before the server "locks up".
Sorry if you need to do that at 2h or 3h in the morning, but perhaps you can run a wget command automatically in the night.
We have a cron job that runs every hour on a backend module and creates tasks. The cron job runs queries on the Cloud SQL database, and the tasks make HTTP calls to other servers and also update the database. Normally they run great, even when thousands of tasks as created, but sometimes it gets "stuck" and there is nothing in the logs that can shed some light on the situation.
For example, yesterday we monitored the cron job while it created a few tens of tasks and then it stopped, along with 8 of the tasks that also got stuck in the queue. When it was obvious that nothing was happening we ran the process a few more times and each time completed successfully.
After a day the original task was killed with a DeadlineExceededException and then the 8 other tasks, that were apparently running in the same instance, were killed with the following message:
A problem was encountered with the process that handled this request, causing it to exit. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may be throwing exceptions during the initialization of your application. (Error code 104)
Until the processes were killed we saw absolutely no record of them in the logs, and now that we see them there are no log records before the time of the DeadlineExceededException, so we have no idea at what point they got stuck.
We suspected that there is some lock in the database, but we see in the following link that there is a 10 minute limit for queries, so that would cause the process to fail much sooner than one day: https://cloud.google.com/appengine/docs/java/cloud-sql/#Java_Size_and_access_limits
Our module's class and scaling configuration is:
<instance-class>B4</instance-class>
<basic-scaling>
<max-instances>11</max-instances>
<idle-timeout>10m</idle-timeout>
</basic-scaling>
The configuration of the queue is:
<rate>5/s</rate>
<max-concurrent-requests>100</max-concurrent-requests>
<mode>push</mode>
<retry-parameters>
<task-retry-limit>5</task-retry-limit>
<min-backoff-seconds>10</min-backoff-seconds>
<max-backoff-seconds>200</max-backoff-seconds>
</retry-parameters>
I uploaded some images of the trace data for the cron job:
http://imgur.com/a/H5wGG.
This includes the trace summary, and the beginning/ending of the timeline.
There is no trace data for the 8 terminated tasks.
What could be the cause of this and how can we investigate it further?
We eventually managed to solve the problem with the following steps:
We split the module into two - one module to run the cron job and
one module to handle the generated tasks. This allowed us to see
that the problem was with handling the tasks as that was the only
module that kept getting stuck.
We limited the amount of concurrent tasks to 2, which seems to be the maximum amount of tasks that can be processed simultaneously without the system getting stuck.
Our app extensively relies on backend instances. There is some logic that has to run every few seconds. The execution of this code cannot only be driven by requests arriving on the frontend because it needs to run regardless.
We only considered using task queues to solve this. But as far as we know, task queues only guarantee that tasks will be executed within 24 hours. I have not found a reference to back this up though.
Our app uses a fixed number of resident B1 backend instances. We assume that each instance stays alive 24/7 after the backend version is deployed and started.
Is this a valid assumption? If not, can our application be notified every time a backend instance will be shutdown?
What is the SLA on the availability of a backend instance?
Are backend instances restarted automatically after they are terminated? E.g. is an instance automatically restarted after it runs out of memory?
How quickly will instances be brought up again if they every are terminated?
We create a fixed size thread pool on each backend instance. Is there a maximum size for thread pools that we can have on a backend instance?
Are there any other conditions under which a backend instance might die?
Thanks!
UPDATES
Turns out a couple questions can be answered by reading the docs.
App Engine attempts to keep backends running indefinitely. However, at this time there is no guaranteed uptime for backends.
So what is the SLA for uptime? I am looking for a statement like: "The guaranteed uptime for backends is 99.99%"
The App Engine team will provide more guidance on expected backend uptime as statistics become available.
When will this statistics be available?
It's also important to recognize that the shutdown hook is not always able to run before a backend terminates. In rare cases, an outage can occur that prevents App Engine from providing 30 seconds of shutdown time.
When App Engine needs to turn down a backend instance, existing requests are given 30 seconds to complete, and new requests immediately return 404.
The following code sample demonstrates a basic shutdown hook:
LifecycleManager.getInstance().setShutdownHook(new ShutdownHook() {
public void shutdown() {
LifecycleManager.getInstance().interruptAllRequests();
}
});
I am running only one instance of a resident (non dynamic) Backend and my experience is that it is restarted at least once a day.
You application must be able to store its state and resume after restart.