One of our application just suffered from some nasty deadlocks. I had quite a hard time recreating the problem because the deadlock (or stacktrace) did not show up immediately in my java application logs.
To my surprise the marklogic java api retries failing requests (e.g because of a deadlock). This might make sense, if your request is not a multi statement request, but otherwise i'm not sure if it does.
So lets stick with this deadlock problem. I created a simple code snippet in which i create a deadlock on purpose. The snippet creates a document test.xml and then tries to read and write from two different transactions, each on a new thread.
public static void main(String[] args) throws Exception {
final Logger root = (Logger) LoggerFactory.getLogger(Logger.ROOT_LOGGER_NAME);
final Logger ok = (Logger) LoggerFactory.getLogger(OkHttpServices.class);
root.setLevel(Level.ALL);
ok.setLevel(Level.ALL);
final DatabaseClient client = DatabaseClientFactory.newClient("localhost", 8000, new DatabaseClientFactory.DigestAuthContext("username", "password"));
final StringHandle handle = new StringHandle("<doc><name>Test</name></doc>")
.withFormat(Format.XML);
client.newTextDocumentManager().write("test.xml", handle);
root.info("t1: opening");
final Transaction t1 = client.openTransaction();
root.info("t1: reading");
client.newXMLDocumentManager()
.read("test.xml", new StringHandle(), t1);
root.info("t2: opening");
final Transaction t2 = client.openTransaction();
root.info("t2: reading");
client.newXMLDocumentManager()
.read("test.xml", new StringHandle(), t2);
new Thread(() -> {
root.info("t1: writing");
client.newXMLDocumentManager().write("test.xml", new StringHandle("<doc><t>t1</t></doc>").withFormat(Format.XML), t1);
t1.commit();
}).start();
new Thread(() -> {
root.info("t2: writing");
client.newXMLDocumentManager().write("test.xml", new StringHandle("<doc><t>t2</t></doc>").withFormat(Format.XML), t2);
t2.commit();
}).start();
TimeUnit.MINUTES.sleep(5);
client.release();
}
This code will produce the following log:
14:12:27.437 [main] DEBUG c.m.client.impl.OkHttpServices - Connecting to localhost at 8000 as admin
14:12:27.570 [main] DEBUG c.m.client.impl.OkHttpServices - Sending test.xml document in transaction null
14:12:27.608 [main] INFO ROOT - t1: opening
14:12:27.609 [main] DEBUG c.m.client.impl.OkHttpServices - Opening transaction
14:12:27.962 [main] INFO ROOT - t1: reading
14:12:27.963 [main] DEBUG c.m.client.impl.OkHttpServices - Getting test.xml in transaction 5298588351036278526
14:12:28.283 [main] INFO ROOT - t2: opening
14:12:28.283 [main] DEBUG c.m.client.impl.OkHttpServices - Opening transaction
14:12:28.286 [main] INFO ROOT - t2: reading
14:12:28.286 [main] DEBUG c.m.client.impl.OkHttpServices - Getting test.xml in transaction 8819382734425123844
14:12:28.289 [Thread-1] INFO ROOT - t1: writing
14:12:28.289 [Thread-1] DEBUG c.m.client.impl.OkHttpServices - Sending test.xml document in transaction 5298588351036278526
14:12:28.289 [Thread-2] INFO ROOT - t2: writing
14:12:28.290 [Thread-2] DEBUG c.m.client.impl.OkHttpServices - Sending test.xml document in transaction 8819382734425123844
Neither t1 or t2 will get commited. MarkLogic logs confirm that there actually is a deadlock:
==> /var/opt/MarkLogic/Logs/8000_AccessLog.txt <==
127.0.0.1 - admin [24/Nov/2018:14:12:30 +0000] "PUT /v1/documents?txid=5298588351036278526&category=content&uri=test.xml HTTP/1.1" 503 1034 - "okhttp/3.9.0"
==> /var/opt/MarkLogic/Logs/ErrorLog.txt <==
2018-11-24 14:12:30.719 Info: Deadlock detected locking Documents test.xml
This would not be a problem, if one of the requests would fail and throw an exception, but this is not the case. MarkLogic Java Api retries every request up to 120 seconds and one of the updates timeouts after like 120 seconds or so:
Exception in thread "Thread-1" com.marklogic.client.FailedRequestException: Service unavailable and maximum retry period elapsed: 121 seconds after 65 retries
at com.marklogic.client.impl.OkHttpServices.putPostDocumentImpl(OkHttpServices.java:1422)
at com.marklogic.client.impl.OkHttpServices.putDocument(OkHttpServices.java:1256)
at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:920)
at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:758)
at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:717)
at Scratch.lambda$main$0(scratch.java:40)
at java.lang.Thread.run(Thread.java:748)
What are possible ways to overcome this problem? One way might be to set a maximum time to live for a transaction (like 5 seconds), but this feels hacky and unreliable. Any other ideas? Are there any other settings i should check out?
I'm on MarkLogic 9.0-7.2 and using marklogic-client-api:4.0.3.
Edit: One way to solve the deadlock would be by syncronizing the calling function, this is actually the way i solved it in my case (see comments). But i think the underlying problem still exists. Having a deadlock in a multi statement transaction should not be hidden away in a 120 second timeout. I rather have a immediately failing request than a 120 second lock on one of my documents + 64 failing retries per thread.
Deadlocks are usually resolvable by retrying. Internally, the server does a inner-retry loop because usually deadlocks are transient and incidental, lasting a very short time. In your case you have constructed a case that will never succeed with any timeout that's equal for both threads.
Deadlocks can be avoided at the application layer by avoiding multi-statement transactions when using the REST API. (which is what the Java api uses).
Multi statement transactions over REST cannot be implemented 100% safely due to the client's responsibility to manage the transaction ID and the server's inability to detect client-side errors or client-side identity. Very subtle problems can and do occur unless you are aggressively proactive wrt handling errors and multithreading. If you 'push' the logic to the server (xquery or javascript) the server is able to manage things much better.
As for if its 'good' or not for the Java API to implement retries for this case, that's debatable either way. (The compromise for an seemingly easy-to-use interface is that many things that would otherwise be options are decided for you as a convention. There's generally no one-size-fits-all answer. In this case I am presuming the thought was that a deadlock is more likely caused by independant code/logic by 'accident' as opposed to identical code running in tangent -- a retry in that case would be a good choice. In your example its not, but then an earlier error would still fail predictably until you change your code to 'not do that' ).
If it doesn't already exist, a feature request for a configurable timeout and retry behaviour does seem a reasonable request. I would recommend, however, to attempt to avoid any REST calls that result in an open transaction -- inherently that is problematic, particularly if you don't notice the problem upfront (then its more likely to bite you in production). Unlike JDBC, which keeps a connection open so that the server can detect client disconnects, HTTP and the ML Rest API do not -- which leads to a different programming model then traditional database coding in java.
Related
I'm using Alluxio 2.0 to accelerate compute layer's performance.
When no query is performing, I found there are about verbose netty output appendding to $Alluxio_home/logs/master.log.
2019-11-25 10:26:32,141 DEBUG NettyServerHandler - {} {} HEADERS: streamId={} headers={} streamDependency={} weight={} exclusive={} padding={} endStream={}
2019-11-25 10:26:32,141 DEBUG NettyServerHandler - {} {} DATA: streamId={} padding={} endStream={} length={} bytes={}
Dozens of above message are appeneded to master.log each second.
Is it a normal behavior? If sure, what's it used for? For heartbeat amoung components?
I found the root cause, leave this thread here for anyone who might encounter the same problem.
The Alluxio is using gRPC as its RPC framework and the latter is based on netty, the verbose output is actually from netty, check this thread for details.
To disable the verbose output from Alluxio side, add below statement to $Alluxio_home/conf/alluxio-site.properties:
log4j.logger.io.grpc.netty.NettyServerHandler=OFF
Note that modify log4j.rootLogger of alluxio-propeties can't disable this verbose output.
Below is the description of problem we faced in production. Please note that I could not reproduce the issue in test or local environment and therfore can not provide you with test code.
We have a hazelcast cluster with two members M1, M2 and three clients C1,C2,C3. Hazelcast version is 3.9.
Clients use IMap.tryLock() method with timeout of 10 seconds. After getting the lock, critical and long running operations are performed and finally the lock is released using IMap.unlock() method.
The problem occured in production is as follows:
At some time instant t, we first saw heartbeat failure to M2 at client C2. Afterwards there are errors in fetching partition table casued by com.hazelcast.spi.exception.TargetDisconnectedException:
[hz.client_0.internal-2 ] WARN [] HeartbeatManager - hz.client_0 [mygroup] [3.9] HeartbeatManager failed to connection: .....
[hz.client_0.internal-3 ] WARN [] ClientPartitionService - hz.client_0 [mygroup] [3.9] Error while fetching cluster partition table!
java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=1, ......
Around 250 ms after initial heartbeat failure, client gets disconnected and then reconnects in 20 ms.
[hz.client_0.cluster- ] INFO [] LifecycleService - hz.client_0 [mygroup] [3.9] HazelcastClient 3.9 (20171023 - b29f549) is CLIENT_DISCONNETED
[hz.client_0.cluster- ] INFO [] LifecycleService - hz.client_0 [mygroup] [3.9] HazelcastClient 3.9 (20171023 - b29f549) is CLIENT_CONNECTED
The problem we are having is, for some keys that are previously acquired by C2, C1 and C3 can not acquire the lock even if it seems to be released by C2. C2 can get the lock, but this puts unacceptable delays
to the application and is not acceptable.. All clients should get since lock is released...
We were notified of the problem after receiving complaints, and then restarted the client application C2.
As documented in http://docs.hazelcast.org/docs/latest-development/manual/html/Distributed_Data_Structures/Lock.html, locks acquired by restarted member (C2 in my case) seemed to be removed after restart operation.
Currently the issue seems to go away, but we are not sure if it will recur.
Do you have any suggestions about the probable cause and more importantly do you have any recommendations?
Would enabling redo-operation in client help for this problem case?
As I tried to explain client seems to recover the problem, but keys remain locked in cluster and this is fatal to my application.
Thanks
It looks like the client had lost the ownership of the lock because of its disconnection from the cluster. You can use IMap#forceUnlock API in cases such as you faced. It releases the lock regardless of the lock owner and it always successfully unlocks, never blocks, and returns immediately.
I'm using elastic cloud (former found) with shield and the transport java client. The app communicating with ES runs on heroku. I'm running a stress test on a staging environment with one node
{
"cluster_name": ...,
"status": "yellow",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 19,
"active_shards": 19,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 7,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0
}
A the beginning everything works perfectly. But after some time (3-4 minutes) I begin to get some errors. I've set the log level to trace and these are the errors I've been getting (I've replaced with ... everything that is irrelevant.
org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes were available: [[...][...][...][inet[...]]{logical_availability_zone=..., availability_zone=..., max_local_storage_nodes=1, region=..., master=true}]
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:242)
at org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:78)
at org.elasticsearch.transport.TransportService$3.run(TransportService.java:290)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.SendRequestTransportException: [...][inet[...]][indices:data/read/search]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)
at org.elasticsearch.shield.transport.ShieldClientTransportService.sendRequest(ShieldClientTransportService.java:41)
at org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:57)
at org.elasticsearch.client.transport.support.InternalTransportClient$1.doWithNode(InternalTransportClient.java:109)
at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:205)
at org.elasticsearch.client.transport.support.InternalTransportClient.execute(InternalTransportClient.java:106)
at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:334)
at org.elasticsearch.client.transport.TransportClient.search(TransportClient.java:416)
at org.elasticsearch.action.search.SearchRequestBuilder.doExecute(SearchRequestBuilder.java:1122)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:91)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:65)
...
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [...][inet[...]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:936)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:629)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276)
...
These are my properties
settings = ImmutableSettings.settingsBuilder()
.put("client.transport.nodes_sampler_interval", "5s") //Tried it with 30s, same outcome
.put("client.transport.ping_timeout", "30s")
.put("cluster.name", clusterName)
.put("action.bulk.compress", false)
.put("shield.transport.ssl", true)
.put("request.headers.X-Found-Cluster", clusterName)
.put("shield.user", user + ":" + password)
.put("transport.ping_schedule", "1s") //Tried with 5s, same outcome
.build();
I've also set for every query I make:
max_query_response_size=100000
timeout_seconds=30
I'm using ElasticSearch 1.7.2 and Shield 1.3.2 with corresponding (same version) clients, Java 1.8.0_65 on my machine - Java 1.8.0_40 on the node.
I was getting the same errors without a stress test, but the errors happened very randomly so I wanted to reproduce. That's why I'm running this in a single node.
I spotted another error in my logs
2016-03-07 23:35:52,177 DEBUG [elasticsearch[Vermin][transport_client_worker][T#7]{New I/O worker #16}] ssl.SslHandler (NettyInternalESLogger.java:debug(63)) - Swallowing an exception raised while writing non-app data
java.nio.channels.ClosedChannelException
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Hot threads
0.0% (111.6micros out of 500ms) cpu usage by thread 'elasticsearch[...][transport_client_timer][T#1]{Hashed wheel timer #1}'
10/10 snapshots sharing following 5 elements
java.lang.Thread.sleep(Native Method)
org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
java.lang.Thread.run(Thread.java:745)
After reading this http://blog.trifork.com/2015/04/08/dealing-with-nodenotavailableexceptions-in-elasticsearch/ I came to understand a little better how the whole communication works. I haven't tested this yet, but I believe that the problem lies there. The thing is though, even if I confirm that the problem is closed query connections, how do I handle this? Keep the config as is and just reconnect? Do I disable keepAlive? If yes, should I be worrying over something else?
Citing this link:
https://discuss.elastic.co/t/nonodeavailableexception-with-java-transport-client/37702 by Konrad Beiske
your application could be resolving the ip address at boot time. The
ELB can change ip's at any point in time. For the best reliability
your application should add all ip's of the ELB to the client and
periodically check the DNS service for changes.
The connection timeout of our ELB's are 5 minutes.
Following should help you fix it:
Creating a new TransportClient for every request is not ideal as it
will imply a new connection handshake for every request and this will
hurt your response time. You could have a pool of TransportClients if
you prefer, but it will most likely be an unnecessary overhead as the
client is thread safe.
My suggestion is that you create a small singleton service that
periodically checks for changes to the DNS service and adds any new
ip's to your existing transport client. In theory it could be as naive
as just adding all ip's discovered every time it checks as the
transport client will discard duplicate addresses and also purges old
addresses no longer reachable.
I found out, that when I connect by debugger to the application, and starting to debug,
the connection to terracotta server is lost (?) and in the terracotta server logs next messages are appeared:
2012-03-30 13:45:06,758 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:27,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:31,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 2
...
2012-03-30 13:46:37,768 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] ERROR com.tc.net.protocol.transport.ConnectionHealthChecke rImpl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 10. But its too long. No more retries 2012-03-30 13:46:38,768
[HealthChecker] INFO
com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server
- 127.0.0.1:55112 is DEAD 2012-03-30 13:46:38,768 [HealthChecker] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO
Server - Declared connection dead
ConnectionID(1.0b1994ac80f14b7191080bdc3f38582a) idle time 45317ms
2012-03-30 13:46:38,768 [L2_L1:TCWorkerComm # 0_R] WARN
com.tc.net.protocol.transport.ServerMessageTransport -
ConnectionID(1.0b1994ac80f14b71 91080bdc3f38582a): CLOSE EVENT :
com.tc.net.core.TCConnectionJDK14#5158277: connected: false, closed:
true local=127.0.0.1:9510 remote=127.0.0 .1:55112 connect=[Fri Mar 30
13:34:22 BST 2012] idle=2001ms [207584 read, 229735 write]. STATUS :
DISCONNECTED
...
2012-03-30 13:46:38,799 [L2_L1:TCWorkerComm # 0_R] INFO
com.tc.objectserver.persistence.sleepycat.SleepycatPersistor - Deleted
client state fo r ChannelID=[1] 2012-03-30 13:46:38,801
[WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.handler.ChannelLifeCycleHandler - : Received tran
sport disconnect. Shutting down client ClientID[1] 2012-03-30
13:46:38,801 [WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.persistence.impl.TransactionStoreImpl - shutdownC
lient() : Removing txns from DB : 0
After this is happened, any operation with cache, like getWithLoader just doesn't answer, until terracotta server won't be restarted again.
Question: how can it be fixed/reconfigured? I assume, it can happen in production also (and actually sometimes happens) if for some (any) reason application will hang/staled/etc.
This is just to get you started.
TC connections betwee server and client are considered dead when the applicable HealthCheck fails. The default values for the HealthCheck assume a very stable and performant network. I recommend you familiarize yourself with the details and the calculations on
http://www.terracotta.org/documentation/3.5.2/terracotta-server-array/high-availability#85916
So typically you begin with
a) making sure your network doesn't hiccup occasionally
b) setting the TC HealthCheck values a bit higher
If the problem persists I'd recommend posting directly on the TC forums (they'll help you even if you only use the open-source edition, may take a few days to reply though.
This (see chart below) is happening since 3/7.
Sure this can be because of instances loading and unloading.
But do you know other reasons for GAE behaving like this?
It's not a high replication instance. And during testing we had up to 5 instances F2 running with only our test client calls. There basically are DB calls, image processing and memcache usage.
There are already 2 issues which look equal:
http://code.google.com/p/googleappengine/issues/detail?id=4180&sort=priority&colspec=ID%20Type%20Component%20Status%20Stars%20Summary%20Language%20Priority%20Owner%20Log
http://code.google.com/p/googleappengine/issues/detail?id=6309&sort=priority&colspec=ID%20Type%20Component%20Status%20Stars%20Summary%20Language%20Priority%20Owner%20Log
And there's an entry in the forum:
https://groups.google.com/forum/#!topic/google-appengine/js5CeRWLQZ0/discussion
Logging (Shay requested) shows that Persistence Manager seems to take 6 seconds to initialize:
2012-03-11 15:32:47.543 /api/yyy 200 16811ms 0kb xxx/1.1 CFNetwork/548.1.4 Darwin/11.0.0
78.53.230.114 - - [11/Mar/2012:07:32:47 -0700] "POST /api/yyy HTTP/1.1" 200 94 - "zzz/1.1 CFNetwork/548.1.4 Darwin/11.0.0" "zzz.appspot.com" ms=16812 cpu_ms=6040 api_cpu_ms=82 cpm_usd=0.167820 pending_ms=5765 instance=00c71b117ca3858c47bdc41d5b30a732dd76eaaf
I 2012-03-11 15:32:37.196
www.server.xxxServlet getvvv: 1
I 2012-03-11 15:32:37.202
www.server.xxxServlet getvvv: hash
I 2012-03-11 15:32:37.207
www.server.xxxServlet getvvv: get PM (PersistenceManager pm = PMF.get().getPersistenceManager();)
I 2012-03-11 15:32:43.606
www.server.xxxServlet getvvv: get data
I 2012-03-11 15:32:47.355
www.server.xxxServlet getvvv: got data
I 2012-03-11 15:32:47.388
www.server.xxxServlet getvvv: done
and PMF is implemented as:
public final class PMF {
private static final PersistenceManagerFactory pmfInstance =
JDOHelper.getPersistenceManagerFactory("transactions-optional");
private PMF() {}
public static PersistenceManagerFactory get() {
return pmfInstance;
}
}
The area where you don't see any stats is usually your code running, the stats starts when the request is entered I don't think this got any thing to do with instance loading.
I suggest adding logs to see the flow of your handler code.
Ikai Lan replied in the forum google-appengine.googlegroups.com:
"Given that the SLA applies to HRD and not master/slave applications, you are definitely going to get a better quality of service migrating to HRD. In fact, I strongly advise that you do so."
"With master/slave applications, we do what we can to address the short term symptoms as well as the underlying system issues without impacting serving,..."
"We may be announcing a maintenance in the very near future that will impact the serving of master/slave applications."
Link:
https://groups.google.com/d/msg/google-appengine/js5CeRWLQZ0/4mFqPWJQjSoJ
For me this means there currently are issues with GAE master/slave and a maintenance announcement was announced.