I found out, that when I connect by debugger to the application, and starting to debug,
the connection to terracotta server is lost (?) and in the terracotta server logs next messages are appeared:
2012-03-30 13:45:06,758 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:27,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:31,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 2
...
2012-03-30 13:46:37,768 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] ERROR com.tc.net.protocol.transport.ConnectionHealthChecke rImpl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 10. But its too long. No more retries 2012-03-30 13:46:38,768
[HealthChecker] INFO
com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server
- 127.0.0.1:55112 is DEAD 2012-03-30 13:46:38,768 [HealthChecker] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO
Server - Declared connection dead
ConnectionID(1.0b1994ac80f14b7191080bdc3f38582a) idle time 45317ms
2012-03-30 13:46:38,768 [L2_L1:TCWorkerComm # 0_R] WARN
com.tc.net.protocol.transport.ServerMessageTransport -
ConnectionID(1.0b1994ac80f14b71 91080bdc3f38582a): CLOSE EVENT :
com.tc.net.core.TCConnectionJDK14#5158277: connected: false, closed:
true local=127.0.0.1:9510 remote=127.0.0 .1:55112 connect=[Fri Mar 30
13:34:22 BST 2012] idle=2001ms [207584 read, 229735 write]. STATUS :
DISCONNECTED
...
2012-03-30 13:46:38,799 [L2_L1:TCWorkerComm # 0_R] INFO
com.tc.objectserver.persistence.sleepycat.SleepycatPersistor - Deleted
client state fo r ChannelID=[1] 2012-03-30 13:46:38,801
[WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.handler.ChannelLifeCycleHandler - : Received tran
sport disconnect. Shutting down client ClientID[1] 2012-03-30
13:46:38,801 [WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.persistence.impl.TransactionStoreImpl - shutdownC
lient() : Removing txns from DB : 0
After this is happened, any operation with cache, like getWithLoader just doesn't answer, until terracotta server won't be restarted again.
Question: how can it be fixed/reconfigured? I assume, it can happen in production also (and actually sometimes happens) if for some (any) reason application will hang/staled/etc.
This is just to get you started.
TC connections betwee server and client are considered dead when the applicable HealthCheck fails. The default values for the HealthCheck assume a very stable and performant network. I recommend you familiarize yourself with the details and the calculations on
http://www.terracotta.org/documentation/3.5.2/terracotta-server-array/high-availability#85916
So typically you begin with
a) making sure your network doesn't hiccup occasionally
b) setting the TC HealthCheck values a bit higher
If the problem persists I'd recommend posting directly on the TC forums (they'll help you even if you only use the open-source edition, may take a few days to reply though.
Related
1、environment
OS version and kernel: CentOS Linux release 7.9.2009 3.10.0-1160.el7.x86_64
nginx version: nginx-1.14.2 (Community nginx)
upstream server(tomcat) version: apache tomcat 8.0.53
JDK version: Oracle jdk1.8.0_144
nginx enable keepalive (with client and upstream server);
HTTP Procotol: HTTP/1.1
2、nginx access log format and tomcat access log patten:
(1) nginx access log format
'$remote_addr - $remote_user [$time_iso8601] '
' "$request" $status $body_bytes_sent $bytes_sent '
' $request_trace_id '
' ["$upstream_addr" "$upstream_status" "$upstream_bytes_received" "$upstream_response_length" "$upstream_cache_status" '
' "$upstream_response_time" "$upstream_connect_time" "$upstream_header_time"] '
' "$request_time" "$http_referer" "$http_user_agent" "$http_x_forwarded_for" "$ssl_protocol"';
self-defined variable: $request_trace_id:
#trace.setting
set $request_trace_id $http_x_request_id;
if ( $request_trace_id = '' ) {
set $request_trace_id $pid-$connection-$bytes_sent-$msec;
}
(2) tomcat access log pattern:
"[%{Y-M-d H:m:s.S+z}t] real_ip:%{X-Real-IP}i remote:%h requestid:%{X-Request-ID}i first_line:"%r" status:%s bytes:%b cost:%Dms commit_time:%Fms Agent:"%{User-Agent}i" %{Connection}i %{Connection}o %{Keep-Alive}i %{Keep-Alive}o"
3、Problematic logging contents
(1) nginx logs
192.168.26.73 - cgpadmin [2021-09-09T09:58:23+08:00] "POST /cgp2-oauth/oauth/check_token HTTP/1.1" 200 12983 13364 6462-1025729-0-1631152697.976 ["127.0.0.1:8801" "200" "13353" "12991" "-" "5.026" "0.000" "5.026"] "5.026" "-" "Java/1.8.0_144" "-" "-"
1631152697.976 timestamp:
2021-09-09 09:58:17.976
(2) tomcat logs
[2021-9-9 9:58:17.993+CST] real_ip:192.168.26.73 remote:127.0.0.1 requestid:6462-1025729-0-1631152697.976 first_line:"POST /cgp2-oauth/oauth/check_token HTTP/1.1" status:200 bytes:12991 cost:17ms commit_time:16ms Agent:"Java/1.8.0_144" - - - -
4、My judgment and analysis
Several times of nginx:
- variables
$request_time 5.026 seconds
$upstream_response_time 5.026 seconds
$upstream_header_time 0.000 seconds
$upstream_connect_time 5.026 seconds
- logs timestamp
nginx deal with proxy_pass : 2021-09-09 09:58:17.976
The moment when nginx processing is completed: 2021-09-09T09:58:23
Several times of tomcat:
- attributes
%D 17 millisecond
%F 16 millisecond
- logs timestamp
The moment when tomcat processing is completed: 2021-9-9 9:58:17.993
Analyze the problem stage:
The total processing time of nginx ($request_time) is because the processing time of upstream ($upstream_response_time);
In upstream processing, nginx first prepares the contents of the request message,
and then establishes a connection with Tomcat.
The connection time is very short (the log I excerpted here is 0. It may be because the connection is maintained.
I have seen other connections, which are also very short (there are cases where it is not 0),
so $upstream_header_Time can be basically confirmed
After nginx connects with tomcat, "start sending request message to Tomcat", "Tomcat receiving" (I understand that there may be a queue,
Tomcat receiving), "Tomcat processing" and "Tomcat returning" "Response message", "nginx receives the first byte of the response message header",
which takes a long time. Start to analyze the problem of which section
The recording time of nginx completion time is 9:58:23 (I don't know milliseconds here, because I use the default variable),
while the total time consumption of upstream is 5.026 seconds.
The time when tomcat returns the message is 9:58:17.993,Subtracting 5.026 seconds from 9:58:23 is about 9:58:18, which is similar to the time
when Tomcat returns the message, indicating that there is a difference between nginx and the client
The time-consuming is not long. We can basically confirm that the time when nginx establishes a connection to Tomcat is about 9:58:18.
We happen to have a $request_trace_ID, This is when nginx starts processing the reverse proxy, it is 2021-09-09 09:58:17.976,
which can basically confirm our conjecture.
The time of Tomcat reverse message is 9:58:17.993,It subtracts the internal processing time of Tomcat (% d 17 ms),
and obtains that the time when Tomcat starts processing requests is 9:58:17.976, which is the same as our $request_trace_id 09:58:17.976
It is basically the same, indicating that nginx starts sending request messages to tomcat, and the processing time of sorting threads
from the pool is basically No.
Therefore, the processing time 5 of upstream is after Tomcat prepares the response message,
Send the response message to nginx.
The time when nginx receives the first byte of the response message header (127.0.0.1) basically has no network problems,
and the monitoring does not see it.
Is there a queue in nginx? Isn't upstream an event callback based on epoll
The system here sometimes has pre problems and sometimes post problems (in this case, post problems).
Please give me some ideas to promote this problem. Thank you!
5、reference
https://www.nginx.com/blog/using-nginx-logging-for-application-performance-monitoring/
Tomcat log: what's the difference between %D and %F
http://nginx.org/en/docs/http/ngx_http_upstream_module.html#variables
https://tomcat.apache.org/tomcat-8.0-doc/config/valve.html
https://juejin.cn/post/6844903887757901832
https://cloud.tencent.com/developer/article/1778734
I have deployed Spring Boot application that has a Database based queue with jobs on App Service.
Yesterday I performed a few Scale out and Scale in operations while the application was working to see how it will behave.
At some point (not necessary related to scaling operations) application started to throw Hikari errors.
com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#1ae66f34 (This connection has been closed.). Possibly consider using a shorter maxLifetime value.
com.zaxxer.hikari.pool.ProxyConnection : HikariPool-1 - Connection org.postgresql.jdbc.PgConnection#1ef85079 marked as broken because of SQLSTATE(08006), ErrorCode(0)
The following are stack traces from my scheduled job in spring and other information:
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
Caused by: javax.net.ssl.SSLException: Connection reset by peer (Write failed)
Suppressed: java.net.SocketException: Broken pipe (Write failed)
Caused by: java.net.SocketException: Connection reset by peer (Write failed)
Next the following stack of errors:
WARN 1 --- [ scheduling-1] com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#48d0d6da (This connection has been closed.).
Possibly consider using a shorter maxLifetime value.
org.springframework.jdbc.support.MetaDataAccessException: Error while extracting DatabaseMetaData; nested exception is java.sql.SQLException: Connection is closed
Caused by: java.sql.SQLException: Connection is closed
The code which is invoked periodically - every 500 milliseconds is here:
#Scheduled(fixedDelayString = "${worker.delay}")
#Transactional
public void execute() {
jobManager.next(jobClass).ifPresent(this::handleJob);
}
Update.
The above code is almost all the time doing nothing, since there was no traffic on the website.
Update2. I've checked Postgres logs and found this:
2020-07-11 22:48:09 UTC-5f0866f0.f0-LOG: checkpoint starting: immediate force wait
2020-07-11 22:48:10 UTC-5f0866f0.f0-LOG: checkpoint complete (240): wrote 30 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.046 s, sync=0.046 s, total=0.437 s; sync files=13, longest=0.009 s, average=0.003 s; distance=163 kB, estimate=13180 kB
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: received immediate shutdown request
2020-07-11 22:48:10 UTC-5f0a3f41.8914-WARNING: terminating connection because of crash of another server process
2020-07-11 22:48:10 UTC-5f0a3f41.8914-DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
// Same text about 10 times
2020-07-11 22:48:10 UTC-5f0866f2.7c-HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: src/port/kill.c(84): Process (272) exited OOB of pgkill.
2020-07-11 22:48:10 UTC-5f0866f1.fc-WARNING: terminating connection because of crash of another server process
2020-07-11 22:48:10 UTC-5f0866f1.fc-DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-07-11 22:48:10 UTC-5f0866f1.fc-HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: archiver process (PID 256) exited with exit code 1
2020-07-11 22:48:11 UTC-5f0866ee.68-LOG: database system is shut down
It looks like it is a problem with Azure PostgresSQL server and it closed itself. Am I reading this right?
Like mentioned in your logs, have you tried setting maxLifetime property for the Hikari CP ? I think after setting that property this issue should be resolved.
Based on Hikari doc (https://github.com/brettwooldridge/HikariCP) --
maxLifetime
This property controls the maximum lifetime of a connection in the pool. An in-use connection will never be retired, only when it is closed will it then be removed. On a connection-by-connection basis, minor negative attenuation is applied to avoid mass-extinction in the pool. We strongly recommend setting this value, and it should be several seconds shorter than any database or infrastructure imposed connection time limit. A value of 0 indicates no maximum lifetime (infinite lifetime), subject of course to the idleTimeout setting. The minimum allowed value is 30000ms (30 seconds). Default: 1800000 (30 minutes)
One of our application just suffered from some nasty deadlocks. I had quite a hard time recreating the problem because the deadlock (or stacktrace) did not show up immediately in my java application logs.
To my surprise the marklogic java api retries failing requests (e.g because of a deadlock). This might make sense, if your request is not a multi statement request, but otherwise i'm not sure if it does.
So lets stick with this deadlock problem. I created a simple code snippet in which i create a deadlock on purpose. The snippet creates a document test.xml and then tries to read and write from two different transactions, each on a new thread.
public static void main(String[] args) throws Exception {
final Logger root = (Logger) LoggerFactory.getLogger(Logger.ROOT_LOGGER_NAME);
final Logger ok = (Logger) LoggerFactory.getLogger(OkHttpServices.class);
root.setLevel(Level.ALL);
ok.setLevel(Level.ALL);
final DatabaseClient client = DatabaseClientFactory.newClient("localhost", 8000, new DatabaseClientFactory.DigestAuthContext("username", "password"));
final StringHandle handle = new StringHandle("<doc><name>Test</name></doc>")
.withFormat(Format.XML);
client.newTextDocumentManager().write("test.xml", handle);
root.info("t1: opening");
final Transaction t1 = client.openTransaction();
root.info("t1: reading");
client.newXMLDocumentManager()
.read("test.xml", new StringHandle(), t1);
root.info("t2: opening");
final Transaction t2 = client.openTransaction();
root.info("t2: reading");
client.newXMLDocumentManager()
.read("test.xml", new StringHandle(), t2);
new Thread(() -> {
root.info("t1: writing");
client.newXMLDocumentManager().write("test.xml", new StringHandle("<doc><t>t1</t></doc>").withFormat(Format.XML), t1);
t1.commit();
}).start();
new Thread(() -> {
root.info("t2: writing");
client.newXMLDocumentManager().write("test.xml", new StringHandle("<doc><t>t2</t></doc>").withFormat(Format.XML), t2);
t2.commit();
}).start();
TimeUnit.MINUTES.sleep(5);
client.release();
}
This code will produce the following log:
14:12:27.437 [main] DEBUG c.m.client.impl.OkHttpServices - Connecting to localhost at 8000 as admin
14:12:27.570 [main] DEBUG c.m.client.impl.OkHttpServices - Sending test.xml document in transaction null
14:12:27.608 [main] INFO ROOT - t1: opening
14:12:27.609 [main] DEBUG c.m.client.impl.OkHttpServices - Opening transaction
14:12:27.962 [main] INFO ROOT - t1: reading
14:12:27.963 [main] DEBUG c.m.client.impl.OkHttpServices - Getting test.xml in transaction 5298588351036278526
14:12:28.283 [main] INFO ROOT - t2: opening
14:12:28.283 [main] DEBUG c.m.client.impl.OkHttpServices - Opening transaction
14:12:28.286 [main] INFO ROOT - t2: reading
14:12:28.286 [main] DEBUG c.m.client.impl.OkHttpServices - Getting test.xml in transaction 8819382734425123844
14:12:28.289 [Thread-1] INFO ROOT - t1: writing
14:12:28.289 [Thread-1] DEBUG c.m.client.impl.OkHttpServices - Sending test.xml document in transaction 5298588351036278526
14:12:28.289 [Thread-2] INFO ROOT - t2: writing
14:12:28.290 [Thread-2] DEBUG c.m.client.impl.OkHttpServices - Sending test.xml document in transaction 8819382734425123844
Neither t1 or t2 will get commited. MarkLogic logs confirm that there actually is a deadlock:
==> /var/opt/MarkLogic/Logs/8000_AccessLog.txt <==
127.0.0.1 - admin [24/Nov/2018:14:12:30 +0000] "PUT /v1/documents?txid=5298588351036278526&category=content&uri=test.xml HTTP/1.1" 503 1034 - "okhttp/3.9.0"
==> /var/opt/MarkLogic/Logs/ErrorLog.txt <==
2018-11-24 14:12:30.719 Info: Deadlock detected locking Documents test.xml
This would not be a problem, if one of the requests would fail and throw an exception, but this is not the case. MarkLogic Java Api retries every request up to 120 seconds and one of the updates timeouts after like 120 seconds or so:
Exception in thread "Thread-1" com.marklogic.client.FailedRequestException: Service unavailable and maximum retry period elapsed: 121 seconds after 65 retries
at com.marklogic.client.impl.OkHttpServices.putPostDocumentImpl(OkHttpServices.java:1422)
at com.marklogic.client.impl.OkHttpServices.putDocument(OkHttpServices.java:1256)
at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:920)
at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:758)
at com.marklogic.client.impl.DocumentManagerImpl.write(DocumentManagerImpl.java:717)
at Scratch.lambda$main$0(scratch.java:40)
at java.lang.Thread.run(Thread.java:748)
What are possible ways to overcome this problem? One way might be to set a maximum time to live for a transaction (like 5 seconds), but this feels hacky and unreliable. Any other ideas? Are there any other settings i should check out?
I'm on MarkLogic 9.0-7.2 and using marklogic-client-api:4.0.3.
Edit: One way to solve the deadlock would be by syncronizing the calling function, this is actually the way i solved it in my case (see comments). But i think the underlying problem still exists. Having a deadlock in a multi statement transaction should not be hidden away in a 120 second timeout. I rather have a immediately failing request than a 120 second lock on one of my documents + 64 failing retries per thread.
Deadlocks are usually resolvable by retrying. Internally, the server does a inner-retry loop because usually deadlocks are transient and incidental, lasting a very short time. In your case you have constructed a case that will never succeed with any timeout that's equal for both threads.
Deadlocks can be avoided at the application layer by avoiding multi-statement transactions when using the REST API. (which is what the Java api uses).
Multi statement transactions over REST cannot be implemented 100% safely due to the client's responsibility to manage the transaction ID and the server's inability to detect client-side errors or client-side identity. Very subtle problems can and do occur unless you are aggressively proactive wrt handling errors and multithreading. If you 'push' the logic to the server (xquery or javascript) the server is able to manage things much better.
As for if its 'good' or not for the Java API to implement retries for this case, that's debatable either way. (The compromise for an seemingly easy-to-use interface is that many things that would otherwise be options are decided for you as a convention. There's generally no one-size-fits-all answer. In this case I am presuming the thought was that a deadlock is more likely caused by independant code/logic by 'accident' as opposed to identical code running in tangent -- a retry in that case would be a good choice. In your example its not, but then an earlier error would still fail predictably until you change your code to 'not do that' ).
If it doesn't already exist, a feature request for a configurable timeout and retry behaviour does seem a reasonable request. I would recommend, however, to attempt to avoid any REST calls that result in an open transaction -- inherently that is problematic, particularly if you don't notice the problem upfront (then its more likely to bite you in production). Unlike JDBC, which keeps a connection open so that the server can detect client disconnects, HTTP and the ML Rest API do not -- which leads to a different programming model then traditional database coding in java.
Below is the description of problem we faced in production. Please note that I could not reproduce the issue in test or local environment and therfore can not provide you with test code.
We have a hazelcast cluster with two members M1, M2 and three clients C1,C2,C3. Hazelcast version is 3.9.
Clients use IMap.tryLock() method with timeout of 10 seconds. After getting the lock, critical and long running operations are performed and finally the lock is released using IMap.unlock() method.
The problem occured in production is as follows:
At some time instant t, we first saw heartbeat failure to M2 at client C2. Afterwards there are errors in fetching partition table casued by com.hazelcast.spi.exception.TargetDisconnectedException:
[hz.client_0.internal-2 ] WARN [] HeartbeatManager - hz.client_0 [mygroup] [3.9] HeartbeatManager failed to connection: .....
[hz.client_0.internal-3 ] WARN [] ClientPartitionService - hz.client_0 [mygroup] [3.9] Error while fetching cluster partition table!
java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=1, ......
Around 250 ms after initial heartbeat failure, client gets disconnected and then reconnects in 20 ms.
[hz.client_0.cluster- ] INFO [] LifecycleService - hz.client_0 [mygroup] [3.9] HazelcastClient 3.9 (20171023 - b29f549) is CLIENT_DISCONNETED
[hz.client_0.cluster- ] INFO [] LifecycleService - hz.client_0 [mygroup] [3.9] HazelcastClient 3.9 (20171023 - b29f549) is CLIENT_CONNECTED
The problem we are having is, for some keys that are previously acquired by C2, C1 and C3 can not acquire the lock even if it seems to be released by C2. C2 can get the lock, but this puts unacceptable delays
to the application and is not acceptable.. All clients should get since lock is released...
We were notified of the problem after receiving complaints, and then restarted the client application C2.
As documented in http://docs.hazelcast.org/docs/latest-development/manual/html/Distributed_Data_Structures/Lock.html, locks acquired by restarted member (C2 in my case) seemed to be removed after restart operation.
Currently the issue seems to go away, but we are not sure if it will recur.
Do you have any suggestions about the probable cause and more importantly do you have any recommendations?
Would enabling redo-operation in client help for this problem case?
As I tried to explain client seems to recover the problem, but keys remain locked in cluster and this is fatal to my application.
Thanks
It looks like the client had lost the ownership of the lock because of its disconnection from the cluster. You can use IMap#forceUnlock API in cases such as you faced. It releases the lock regardless of the lock owner and it always successfully unlocks, never blocks, and returns immediately.
I have set up a replica set using three machines (192.168.122.21, 192.168.122.147 and 192.168.122.148) and I am interacting with the MongoDB Cluster using the Java SDK:
ArrayList<ServerAddress> addrs = new ArrayList<ServerAddress>();
addrs.add(new ServerAddress("192.168.122.21", 27017));
addrs.add(new ServerAddress("192.168.122.147", 27017));
addrs.add(new ServerAddress("192.168.122.148", 27017));
this.mongoClient = new MongoClient(addrs);
this.db = this.mongoClient.getDB(this.db_name);
this.collection = this.db.getCollection(this.collection_name);
After the connection is established I do multiple inserts of a simple test document:
for (int i = 0; i < this.inserts; i++) {
try {
this.collection.insert(new BasicDBObject(String.valueOf(i), "test"));
} catch (Exception e) {
System.out.println("Error on inserting element: " + i);
e.printStackTrace();
}
}
When simulating a node crash of the master server (power-off), the MongoDB cluster does a successful failover:
19:08:03.907+0100 [rsHealthPoll] replSet info 192.168.122.21:27017 is down (or slow to respond):
19:08:03.907+0100 [rsHealthPoll] replSet member 192.168.122.21:27017 is now in state DOWN
19:08:04.153+0100 [rsMgr] replSet info electSelf 1
19:08:04.154+0100 [rsMgr] replSet couldn't elect self, only received -9999 votes
19:08:05.648+0100 [conn15] replSet info voting yea for 192.168.122.148:27017 (2)
19:08:10.681+0100 [rsMgr] replSet not trying to elect self as responded yea to someone else recently
19:08:10.910+0100 [rsHealthPoll] replset info 192.168.122.21:27017 heartbeat failed, retrying
19:08:16.394+0100 [rsMgr] replSet not trying to elect self as responded yea to someone else recently
19:08:22.876+.
19:08:22.912+0100 [rsHealthPoll] replset info 192.168.122.21:27017 heartbeat failed, retrying
19:08:23.623+0100 [SyncSourceFeedbackThread] replset setting syncSourceFeedback to 192.168.122.148:27017
19:08:23.917+0100 [rsHealthPoll] replSet member 192.168.122.148:27017 is now in state PRIMARY
This is also recognized by the MongoDB Driver on the Client Side:
Dec 01, 2014 7:08:16 PM com.mongodb.ConnectionStatus$UpdatableNode update
WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: Read timed out
WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: couldn't connect to [/192.168.122.21:27017] bc:java.net.SocketTimeoutException: connect timed out
Dec 01, 2014 7:08:36 PM com.mongodb.DBTCPConnector setMasterAddress
WARNING: Primary switching from /192.168.122.21:27017 to /192.168.122.148:27017
But it still keeps trying to connect to the old node (forever):
Dec 01, 2014 7:08:50 PM com.mongodb.ConnectionStatus$UpdatableNode update
WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException - message: couldn't connect to [/192.168.122.21:27017] bc:java.net.NoRouteToHostException: No route to host
.....
Dec 01, 2014 7:10:43 PM com.mongodb.ConnectionStatus$UpdatableNode update
WARNING: Server seen down: /192.168.122.21:27017 - java.io.IOException -message: couldn't connect to [/192.168.122.21:27017] bc:java.net.NoRouteToHostException: No route to host
The Document count on the Database stays the same from the moment the primary fails and a secondary becomes primary. Here is the Output from the same node during the process:
"rs0":SECONDARY> db.test_collection.find().count() 12260161
"rs0":PRIMARY> db.test_collection.find().count() 12260161
Update:
Using WriteConcern Unacknowledged it works as designed. Insert Operations are also performed on the new master and all operations during the election process get lost.
With WriteConcern Acknowleged it seems that the Operation is waiting infinitely for an ACK from the crashed master. This could explain why the program continuous after the crashed server boots up again and joins the cluster as a secondary. But in my case I don't want the driver to wait forever, it should raise an error after a certain time.
Update:
WriteConcern Acknowledged is also working as expected when killing the mongod process on the primary. In this case the failover only takes ~3 Seconds. During this time no inserts are done, and after the new primary is elected the insert operations continue.
So I only get the problem when simulating a node failure (power off/network down). In this case the operation hangs until the failed node starts up again.
Does your app still work? Since that server is still in your seed list, the driver will try to connect to it as far as I know. Your app should still work so long as any of the other servers in your seed list can gain primary status.
Explicit specifying a Connection Timeout Value solved the error. See also: http://api.mongodb.org/java/2.7.0/com/mongodb/MongoOptions.html