I'm using elastic cloud (former found) with shield and the transport java client. The app communicating with ES runs on heroku. I'm running a stress test on a staging environment with one node
{
"cluster_name": ...,
"status": "yellow",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 19,
"active_shards": 19,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 7,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0
}
A the beginning everything works perfectly. But after some time (3-4 minutes) I begin to get some errors. I've set the log level to trace and these are the errors I've been getting (I've replaced with ... everything that is irrelevant.
org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes were available: [[...][...][...][inet[...]]{logical_availability_zone=..., availability_zone=..., max_local_storage_nodes=1, region=..., master=true}]
at org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:242)
at org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:78)
at org.elasticsearch.transport.TransportService$3.run(TransportService.java:290)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.SendRequestTransportException: [...][inet[...]][indices:data/read/search]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:286)
at org.elasticsearch.shield.transport.ShieldClientTransportService.sendRequest(ShieldClientTransportService.java:41)
at org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:57)
at org.elasticsearch.client.transport.support.InternalTransportClient$1.doWithNode(InternalTransportClient.java:109)
at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:205)
at org.elasticsearch.client.transport.support.InternalTransportClient.execute(InternalTransportClient.java:106)
at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:334)
at org.elasticsearch.client.transport.TransportClient.search(TransportClient.java:416)
at org.elasticsearch.action.search.SearchRequestBuilder.doExecute(SearchRequestBuilder.java:1122)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:91)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:65)
...
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [...][inet[...]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:936)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:629)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:276)
...
These are my properties
settings = ImmutableSettings.settingsBuilder()
.put("client.transport.nodes_sampler_interval", "5s") //Tried it with 30s, same outcome
.put("client.transport.ping_timeout", "30s")
.put("cluster.name", clusterName)
.put("action.bulk.compress", false)
.put("shield.transport.ssl", true)
.put("request.headers.X-Found-Cluster", clusterName)
.put("shield.user", user + ":" + password)
.put("transport.ping_schedule", "1s") //Tried with 5s, same outcome
.build();
I've also set for every query I make:
max_query_response_size=100000
timeout_seconds=30
I'm using ElasticSearch 1.7.2 and Shield 1.3.2 with corresponding (same version) clients, Java 1.8.0_65 on my machine - Java 1.8.0_40 on the node.
I was getting the same errors without a stress test, but the errors happened very randomly so I wanted to reproduce. That's why I'm running this in a single node.
I spotted another error in my logs
2016-03-07 23:35:52,177 DEBUG [elasticsearch[Vermin][transport_client_worker][T#7]{New I/O worker #16}] ssl.SslHandler (NettyInternalESLogger.java:debug(63)) - Swallowing an exception raised while writing non-app data
java.nio.channels.ClosedChannelException
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Hot threads
0.0% (111.6micros out of 500ms) cpu usage by thread 'elasticsearch[...][transport_client_timer][T#1]{Hashed wheel timer #1}'
10/10 snapshots sharing following 5 elements
java.lang.Thread.sleep(Native Method)
org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
java.lang.Thread.run(Thread.java:745)
After reading this http://blog.trifork.com/2015/04/08/dealing-with-nodenotavailableexceptions-in-elasticsearch/ I came to understand a little better how the whole communication works. I haven't tested this yet, but I believe that the problem lies there. The thing is though, even if I confirm that the problem is closed query connections, how do I handle this? Keep the config as is and just reconnect? Do I disable keepAlive? If yes, should I be worrying over something else?
Citing this link:
https://discuss.elastic.co/t/nonodeavailableexception-with-java-transport-client/37702 by Konrad Beiske
your application could be resolving the ip address at boot time. The
ELB can change ip's at any point in time. For the best reliability
your application should add all ip's of the ELB to the client and
periodically check the DNS service for changes.
The connection timeout of our ELB's are 5 minutes.
Following should help you fix it:
Creating a new TransportClient for every request is not ideal as it
will imply a new connection handshake for every request and this will
hurt your response time. You could have a pool of TransportClients if
you prefer, but it will most likely be an unnecessary overhead as the
client is thread safe.
My suggestion is that you create a small singleton service that
periodically checks for changes to the DNS service and adds any new
ip's to your existing transport client. In theory it could be as naive
as just adding all ip's discovered every time it checks as the
transport client will discard duplicate addresses and also purges old
addresses no longer reachable.
Related
I recently tried to switch my subscriptions in GCP Pub/Sub to the "exactly-once" delivery strategy. However, I started seeing the following warnings ~10 times every 30 minutes in my application logs:
com.google.api.gax.rpc.InvalidArgumentException: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Some acknowledgement ids in the request were invalid. This could be because the acknowledgement ids have expired or the acknowledgement ids were malformed.
at com.google.api.gax.rpc.ApiExceptionFactory.createException(ApiExceptionFactory.java:92)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:98)
at com.google.api.gax.grpc.GrpcApiExceptionFactory.create(GrpcApiExceptionFactory.java:66)
at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:67)
at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1041)
at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1215)
at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:983)
at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:771)
at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:574)
at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:544)
at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:535)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563)
at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:744)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:723)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Some acknowledgement ids in the request were invalid. This could be because the acknowledgement ids have expired or the acknowledgement ids were malformed.
at io.grpc.Status.asRuntimeException(Status.java:535)
... 17 more
They're immediately followed by the following INFO log messages in the same thread:
Permanent error invalid ack id message, will not resend
I didn't see any problems caused by these warnings, but it's a bit hard to tell because my application is processing a decent number of messages (~1000/hour). I initially thought that these warnings were just short-term "aftershocks" from switching to the "exactly-once" strategy. However, I waited for about 2 hours and they kept occurring with the same frequency, showing no sign of disappearing. I then disabled the "exactly-once" strategy and they went away immediately after. Can anyone tell me whether these warnings are dangerous, what side effects I can expect, and most importantly how I can get rid of them?
I'm using version 3.4.0 of spring-cloud-gcp-dependencies and spring-cloud-gcp-starter-pubsub. I'm also using Spring Cloud Stream to process the incoming messages and I rely on it to automatically acknowledge the messages.
I have the following configuration set in my application.yaml file:
spring:
cloud:
gcp:
pubsub:
subscriber:
executor-threads: 15
max-ack-extension-period: 23400 # 6 hours and 30 minutes
acknowledgement-deadline: 600 # Maximum value
For context: The messages in my application represent jobs for execution, and they could take quite a while to finish - hence the 6h30m maximum acknowledgement extension period.
I also saw the following StackOverflow question:
How to handle errors during message acknowledgement using google pubsub java library?
From what I understand, the consequence of these warnings is that the messages will be redelivered to my application, but this is exactly what I want to avoid.
Thanks for the question, Alexander.
The errors you are seeing happen when the modifyAckDeadline or Acknowledgement request to the service fail because the acknowledgement id is already expired. In such cases, the service considers the expired acknowledgement id as invalid, since a newer delivery might already be in-flight. This is as-per the guarantees for exactly once delivery. There could be a few reasons for it:
The request was delayed due to network delays and by the time it arrived at the server, the acknowledgement id lease is already expired.
The tasks issuing the modifyAckDeadline or Acknowledgement request is overwhelmed (high CPU/memory/network usage), leading to delay in issuing those requests.
I suggest setting min-duration-per-ack-extension to a high number to reduce issues mentioned above. This will help mitigate the cases where the acknowledgement id lease expired. The highest value you can set for this field is 600 seconds.
Additionally, as mentioned in the other stack overflow question, you should consider checking the response of your acknowledgement operations. This can be used to guide your application if it can expect a redelivery. Sample.
Below is the description of problem we faced in production. Please note that I could not reproduce the issue in test or local environment and therfore can not provide you with test code.
We have a hazelcast cluster with two members M1, M2 and three clients C1,C2,C3. Hazelcast version is 3.9.
Clients use IMap.tryLock() method with timeout of 10 seconds. After getting the lock, critical and long running operations are performed and finally the lock is released using IMap.unlock() method.
The problem occured in production is as follows:
At some time instant t, we first saw heartbeat failure to M2 at client C2. Afterwards there are errors in fetching partition table casued by com.hazelcast.spi.exception.TargetDisconnectedException:
[hz.client_0.internal-2 ] WARN [] HeartbeatManager - hz.client_0 [mygroup] [3.9] HeartbeatManager failed to connection: .....
[hz.client_0.internal-3 ] WARN [] ClientPartitionService - hz.client_0 [mygroup] [3.9] Error while fetching cluster partition table!
java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=1, ......
Around 250 ms after initial heartbeat failure, client gets disconnected and then reconnects in 20 ms.
[hz.client_0.cluster- ] INFO [] LifecycleService - hz.client_0 [mygroup] [3.9] HazelcastClient 3.9 (20171023 - b29f549) is CLIENT_DISCONNETED
[hz.client_0.cluster- ] INFO [] LifecycleService - hz.client_0 [mygroup] [3.9] HazelcastClient 3.9 (20171023 - b29f549) is CLIENT_CONNECTED
The problem we are having is, for some keys that are previously acquired by C2, C1 and C3 can not acquire the lock even if it seems to be released by C2. C2 can get the lock, but this puts unacceptable delays
to the application and is not acceptable.. All clients should get since lock is released...
We were notified of the problem after receiving complaints, and then restarted the client application C2.
As documented in http://docs.hazelcast.org/docs/latest-development/manual/html/Distributed_Data_Structures/Lock.html, locks acquired by restarted member (C2 in my case) seemed to be removed after restart operation.
Currently the issue seems to go away, but we are not sure if it will recur.
Do you have any suggestions about the probable cause and more importantly do you have any recommendations?
Would enabling redo-operation in client help for this problem case?
As I tried to explain client seems to recover the problem, but keys remain locked in cluster and this is fatal to my application.
Thanks
It looks like the client had lost the ownership of the lock because of its disconnection from the cluster. You can use IMap#forceUnlock API in cases such as you faced. It releases the lock regardless of the lock owner and it always successfully unlocks, never blocks, and returns immediately.
Problem Description: MongoDB version is 3.4
In fact, did not do anything on the normal query, write,
because it is in the testing phase, QPS is small.
Question:
1: How is this anomaly produced.
2: what configuration or adjustment needs to be done? help me
02-01 15:11:47 WARN - Got socket exception on connection [connectionId{localValue:43}] to 172.16.199.96:22001. All connections to 172.16.199.96:22001 will be closed.
02-01 15:11:47 INFO - Closed connection [connectionId{localValue:43}] to 172.16.199.96:22001 because there was a socket exception raised by this connection.
org.springframework.data.mongodb.UncategorizedMongoDbException: Exception receiving message; nested exception is com.mongodb.MongoSocketReadException: Exception receiving message
at org.springframework.data.mongodb.core.MongoExceptionTranslator.translateExceptionIfPossible(MongoExceptionTranslator.java:107)
at org.springframework.data.mongodb.core.MongoTemplate.potentiallyConvertRuntimeException(MongoTemplate.java:2135)
at org.springframework.data.mongodb.core.MongoTemplate.executeFindMultiInternal(MongoTemplate.java:1978)
at org.springframework.data.mongodb.core.MongoTemplate.doFind(MongoTemplate.java:1784)
at org.springframework.data.mongodb.core.MongoTemplate.doFind(MongoTemplate.java:1767)
at org.springframework.data.mongodb.core.MongoTemplate.find(MongoTemplate.java:641)
at org.springframework.data.mongodb.core.MongoTemplate.findOne(MongoTemplate.java:606)
at org.springframework.data.mongodb.core.MongoTemplate.findOne(MongoTemplate.java:598)
at com.xxx.xxx.xxx.xxx(xxxService.java:46)
at com.xxx.xxx.xxx.xxx(xxxService.java:157)
at com.xxx.xxx.xxx.xxx(xxxService.java:142)
at com.xxx.xxx.xxx.xxx(xxxService.java:87)
at com.alibaba.dubbo.common.bytecode.Wrapper2.invokeMethod(Wrapper2.java)
at com.alibaba.dubbo.rpc.proxy.javassist.JavassistProxyFactory$1.doInvoke(JavassistProxyFactory.java:46)
at com.alibaba.dubbo.rpc.proxy.AbstractProxyInvoker.invoke(AbstractProxyInvoker.java:72)
at com.alibaba.dubbo.rpc.protocol.InvokerWrapper.invoke(InvokerWrapper.java:53)
at com.alibaba.dubbo.rpc.filter.ExceptionFilter.invoke(ExceptionFilter.java:64)
at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:69)
at com.alibaba.dubbo.monitor.support.MonitorFilter.invoke(MonitorFilter.java:75)
at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:69)
at com.alibaba.dubbo.rpc.filter.TimeoutFilter.invoke(TimeoutFilter.java:42)
at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:69)
at com.alibaba.dubbo.rpc.protocol.dubbo.filter.TraceFilter.invoke(TraceFilter.java:78)
at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:69)
at com.alibaba.dubbo.rpc.filter.ContextFilter.invoke(ContextFilter.java:61)
at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:69)
at com.alibaba.dubbo.rpc.filter.GenericFilter.invoke(GenericFilter.java:132)
at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:69)
at com.alibaba.dubbo.rpc.filter.ClassLoaderFilter.invoke(ClassLoaderFilter.java:38)
at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:69)
at com.alibaba.dubbo.rpc.filter.EchoFilter.invoke(EchoFilter.java:38)
at com.alibaba.dubbo.rpc.protocol.ProtocolFilterWrapper$1.invoke(ProtocolFilterWrapper.java:69)
at com.alibaba.dubbo.rpc.protocol.dubbo.DubboProtocol$1.reply(DubboProtocol.java:100)
at com.alibaba.dubbo.remoting.exchange.support.header.HeaderExchangeHandler.handleRequest(HeaderExchangeHandler.java:98)
at com.alibaba.dubbo.remoting.exchange.support.header.HeaderExchangeHandler.received(HeaderExchangeHandler.java:170)
at com.alibaba.dubbo.remoting.transport.DecodeHandler.received(DecodeHandler.java:52)
at com.alibaba.dubbo.remoting.transport.dispatcher.ChannelEventRunnable.run(ChannelEventRunnable.java:81)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.mongodb.MongoSocketReadException: Exception receiving message
at com.mongodb.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:483)
at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:228)
at com.mongodb.connection.UsageTrackingInternalConnection.receiveMessage(UsageTrackingInternalConnection.java:96)
at com.mongodb.connection.DefaultConnectionPool$PooledConnection.receiveMessage(DefaultConnectionPool.java:440)
at com.mongodb.connection.CommandProtocol.execute(CommandProtocol.java:112)
at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:168)
at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:289)
at com.mongodb.connection.DefaultServerConnection.command(DefaultServerConnection.java:176)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:216)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:207)
at com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:113)
at com.mongodb.operation.FindOperation$1.call(FindOperation.java:516)
at com.mongodb.operation.FindOperation$1.call(FindOperation.java:510)
at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:431)
at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:404)
at com.mongodb.operation.FindOperation.execute(FindOperation.java:510)
at com.mongodb.operation.FindOperation.execute(FindOperation.java:81)
at com.mongodb.Mongo.execute(Mongo.java:836)
at com.mongodb.Mongo$2.execute(Mongo.java:823)
at com.mongodb.DBCursor.initializeCursor(DBCursor.java:870)
at com.mongodb.DBCursor.hasNext(DBCursor.java:142)
at org.springframework.data.mongodb.core.MongoTemplate.executeFindMultiInternal(MongoTemplate.java:1964)
... 37 common frames omitted
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at com.mongodb.connection.SocketStream.read(SocketStream.java:85)
at com.mongodb.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:494)
at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:224)
... 57 common frames omitted
java version 1.8.
spring boot version 1.5.3.
deployed with docker.
mongo.hosts=ip:port,ip:port,ip:port
mongo.database.name=dbname
mongo.username=username
mongo.password=pwd
mongo.connections.per.host=32
mongo.max.wait.time=2000
mongo.connect.timeout=2000
You can try,
autoConnectRetry simply means the driver will automatically attempt to reconnect to the server(s) after unexpected disconnects. In production environments you usually want this set to true.
This is from another post, How to configure MongoDB Java driver MongoOptions for production use?
for everybody who is experiencing the same random MongoSocketReadException, you may need the socketTimeoutMS or maxIdleTimeMS parameters instead. The parameter autoConnectRetry is not exposed any more in the mongodb connection string.
Our situation: we switched to mongodb atlas serverless solution for our development and testing environments, ever since then we got this MongoSocketReadException like every 15 min. or randomly. We are also behind a enterprise firewall.
According to https://www.mongodb.com/docs/v6.0/tutorial/connection-pool-performance-tuning/:
a misconfigured firewall closes a socket connection incorrectly and the driver cannot detect that the connection closed improperly.
you need => Use socketTimeoutMS to ensure that sockets are always closed. Set socketTimeoutMS to two or three times the length of the slowest operation that the driver runs.
because the socketTimeoutMS is by default 0, which will never timeout.
And another parameter maxIdleTimeMS may also affect the connection because if the socket is closed and on the client side it's not detected, the connection will be still waiting in idle time and not cloesd. And by default it's 0 meaning it waits forever with no upper boundaries.
So configure this to a small amount may help the driver to close the the problematic connection with its closed socket, before it tries to connect to the db using the same connection and presumes the connection is still there.
So our solution:
...mongodbUri...?socketTimeoutMS=150000&maxIdleTimeMS=150000
we changed the socketTimeoutMS from 0 to 15s and same for the maxIdleTimeMS.
We have 10 Cassandra nodes in production running Cassandra-2.1.8. We recently upgraded to 2.1.8 version. Previously we were using only 3 nodes running Cassandra-2.1.2. First we upgraded the initial 3 nodes from 2.1.2 to 2.1.8 (following the procedure as described in Upgrading Cassandra). Then we added 7 more nodes running Cassandra-2.1.8 in cluster. Then we started our client programs. For first few hours everything worked fine, but after few hours, we saw some errors in client program logs like
Thread-0 [29/07/15 17:41:23.356] ERROR com.cleartrail.entityprofiling.engine.InterpretationWriter - Error:com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: [/172.50.33.161:9041, /172.50.33.162:9041, /172.50.33.95:9041, /172.50.33.96:9041, /172.50.33.165:9041, /172.50.33.166:9041, /172.50.33.163:9041, /172.50.33.164:9041, /172.50.33.42:9041, /172.50.33.167:9041] - use getErrors() for details)
at com.datastax.driver.core.exceptions.NoHostAvailableException.copy(NoHostAvailableException.java:65)
at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:259)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:175)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)
at com.cleartrail.entityprofiling.engine.InterpretationWriter.WriteInterpretation(InterpretationWriter.java:430)
at com.cleartrail.entityprofiling.engine.Profiler.buildProfile(Profiler.java:1042)
at com.cleartrail.messageconsumer.consumer.KafkaConsumer.run(KafkaConsumer.java:336)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: [/172.50.33.161:9041, /172.50.33.162:9041, /172.50.33.95:9041, /172.50.33.96:9041, /172.50.33.165:9041, /172.50.33.166:9041, /172.50.33.163:9041, /172.50.33.164:9041, /172.50.33.42:9041, /172.50.33.167:9041] - use getErrors() for details)
at com.datastax.driver.core.RequestHandler.sendRequest(RequestHandler.java:102)
at com.datastax.driver.core.RequestHandler$1.run(RequestHandler.java:176)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Now, I double checked the Firewall (as suggested in few posts), ports, timeouts in client as well as nodes and they all are correct.
I am also not closing the connection anywhere in between. I am using batch queries with batch size of 1000 and the queries are update queries updating counters in my table with three columns
entity , twfwv , cvalue
where entity and twfwv columns are text and primary key and cvalue is counter column.
I even restarted all my nodes (because this trick helped me in my dev environment when I faced the same exception) but its not helping. Please suggest what can be the probable problem here.
My issue was resolved by checking the errors collection of NoHostAvailableException as advised by Olivier Michallat in the comments. For me it was the protocol version on the cluster configuration. Mine was null, setting it to 3 fixed the problem.
My issue was resolved by removing/using a property to set or unset the custom load balancing TokenAwarePolicy my connection was using, and relying on the default.
Specifically, I was trying to get a local spring boot app talking to a single dockerized Cassandra instance.
Cluster.Builder builder = Cluster.builder()
.addContactPoints(cassandraProperties.getHosts())
.withPort(cassandraProperties.getPort())
.withProtocolVersion(ProtocolVersion.V4)
.withRetryPolicy(new LoggingRetryPolicy(DefaultRetryPolicy.INSTANCE))
.withCredentials(cassandraProperties.getUsername(), cassandraProperties.getPassword())
.withCodecRegistry(codecRegistry);
if (loadBalanced) {
builder.withLoadBalancingPolicy(
new TokenAwarePolicy(DCAwareRoundRobinPolicy.builder().withLocalDc(localDc).build()));
}
I found out, that when I connect by debugger to the application, and starting to debug,
the connection to terracotta server is lost (?) and in the terracotta server logs next messages are appeared:
2012-03-30 13:45:06,758 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:27,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:31,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 2
...
2012-03-30 13:46:37,768 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] ERROR com.tc.net.protocol.transport.ConnectionHealthChecke rImpl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 10. But its too long. No more retries 2012-03-30 13:46:38,768
[HealthChecker] INFO
com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server
- 127.0.0.1:55112 is DEAD 2012-03-30 13:46:38,768 [HealthChecker] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO
Server - Declared connection dead
ConnectionID(1.0b1994ac80f14b7191080bdc3f38582a) idle time 45317ms
2012-03-30 13:46:38,768 [L2_L1:TCWorkerComm # 0_R] WARN
com.tc.net.protocol.transport.ServerMessageTransport -
ConnectionID(1.0b1994ac80f14b71 91080bdc3f38582a): CLOSE EVENT :
com.tc.net.core.TCConnectionJDK14#5158277: connected: false, closed:
true local=127.0.0.1:9510 remote=127.0.0 .1:55112 connect=[Fri Mar 30
13:34:22 BST 2012] idle=2001ms [207584 read, 229735 write]. STATUS :
DISCONNECTED
...
2012-03-30 13:46:38,799 [L2_L1:TCWorkerComm # 0_R] INFO
com.tc.objectserver.persistence.sleepycat.SleepycatPersistor - Deleted
client state fo r ChannelID=[1] 2012-03-30 13:46:38,801
[WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.handler.ChannelLifeCycleHandler - : Received tran
sport disconnect. Shutting down client ClientID[1] 2012-03-30
13:46:38,801 [WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.persistence.impl.TransactionStoreImpl - shutdownC
lient() : Removing txns from DB : 0
After this is happened, any operation with cache, like getWithLoader just doesn't answer, until terracotta server won't be restarted again.
Question: how can it be fixed/reconfigured? I assume, it can happen in production also (and actually sometimes happens) if for some (any) reason application will hang/staled/etc.
This is just to get you started.
TC connections betwee server and client are considered dead when the applicable HealthCheck fails. The default values for the HealthCheck assume a very stable and performant network. I recommend you familiarize yourself with the details and the calculations on
http://www.terracotta.org/documentation/3.5.2/terracotta-server-array/high-availability#85916
So typically you begin with
a) making sure your network doesn't hiccup occasionally
b) setting the TC HealthCheck values a bit higher
If the problem persists I'd recommend posting directly on the TC forums (they'll help you even if you only use the open-source edition, may take a few days to reply though.