Hazelcast Operation Heartbeat Timeouts appearing sporadically - java

We have a Hazelcast client (3.7.4):
//Initializes Hazelcast client config
ClientConfig aHazelcastClientConfig = new ClientConfig();
String aHazelcastUrl = this.getHost()+":"+this.getPort().toString();
ClientNetworkConfig aHazelcastNetworkConfig=
aHazelcastClientConfig.getNetworkConfig();
aHazelcastNetworkConfig.addAddress(aHazelcastUrl);
GroupConfig group = new GroupConfig (getGroupName(),getGroupPassword());
aHazelcastClientConfig.setGroupConfig(group);
HazelcastInstance aHazelcastClient=
HazelcastClient.newHazelcastClient(aHazelcastClientConfig);
...
IMap aMonitoredMap = aHazelcastClient.getMap(getMonitoredMap());
that periodically checks one HZ Server (3.7.4), and we have observed sometimes next exceptions are appearing in the client side:
InitializeDistributedObjectOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2017-02-07 18:07:30.329. Total elapsed time: 120189 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2017-02-07 18:05:37.489. Invocation{op=com.hazelcast.spi.impl.proxyservice.impl.operations.InitializeDistributedObjectOperation{serviceName='hz:impl:mapService', identityHash=9759664, partitionId=-1, replicaIndex=0, callId=0, invocationTime=1486487130140 (2017-02-07 18:05:30.140), waitTimeout=-1, callTimeout=60000}, tryCount=1, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1486487130140, firstInvocationTime='2017-02-07 18:05:30.140', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 01:00:00.000', target=[10.118.152.82]:5720, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=7, /172.22.191.200:5720->/10.118.152.82:42563, endpoint=[10.118.152.82]:5720, alive=true, type=MEMBER]}
It seems the maximum call waiting timeout (by default 60000 msecs) is being reached. In the above example, the total elapsed time is more than 2 minutes (120189 ms)
This problem is appearing sporadically, without any regular appearance pattern.
It seems the network is working correctly when it has appeared, so we can discard some network connectivity issue.
Any hint or recommendation about which reasons could provoke it?
Thanks a lot.
Best Regards,
Jorge

Related

Redisson client ; RedisTimeoutException issue

I am using Google cloud managed redis cluster(v5) via redisson(3.12.5)
Following are my SingleServer configurations in yaml file
singleServerConfig:
idleConnectionTimeout: 10000
connectTimeout: 10000
timeout: 3000
retryAttempts: 3
retryInterval: 1500
password: null
subscriptionsPerConnection: 5
clientName: null
address: "redis://127.0.0.1:6379"
subscriptionConnectionMinimumIdleSize: 1
subscriptionConnectionPoolSize: 50
connectionMinimumIdleSize: 40
connectionPoolSize: 250
database: 0
dnsMonitoringInterval: 5000
threads: 0
nettyThreads: 0
codec: !<org.redisson.codec.JsonJacksonCodec> {}
I am getting following exceptions when I increase the load on my application
org.redisson.client.RedisTimeoutException: Unable to acquire connection! Increase connection pool size and/or retryInterval settings Node source: NodeSource
org.redisson.client.RedisTimeoutException: Command still hasn't been written into connection! Increase nettyThreads and/or retryInterval settings. Payload size in bytes: 34. Node source: NodeSource
It seems there is no issue on redis cluster and i think i need to make tweaking in my client side redis connection pooling confs(mentioned above) to make it work.
Please suggest me the changes i need to make in my confs
I am also curious if I should close the Redis connection after making get/set calls. I have tried finding this but found nothing conclusive on how to close Redis connections
One last thing that I want to ask is that is there any mechanism to get Redis connection pool stats(active connection, idle connection etc ) in Redisson
Edit1:
I have tried by changing values following values in 3 different iterations
Iteration 1:
idleConnectionTimeout: 30000
connectTimeout: 30000
timeout: 30000
Iteration 2:
nettyThreads: 0
Iteration 3:
connectionMinimumIdleSize: 100
connectionPoolSize: 750
I have tried these things but nothing has worked for me
Any help is appreciated.
Thanks in advance
Assuming you are getting low memory alerts on your cache JVM.
You may have to analyze the traffic and determine 2 things
Too many parallel cache persists.
Huge chunk of data being persisted.
Both can be determined by the traffic on your server.
For option 1 configuring pool-size would solve you issue, but for option 2 you may have to refactor your code to persist data in smaller chunks.
Try to set nettyThreads = 64 settings

Hazelcast - client mode topology / distributed map lock issue

Below is the description of problem we faced in production. Please note that I could not reproduce the issue in test or local environment and therfore can not provide you with test code.
We have a hazelcast cluster with two members M1, M2 and three clients C1,C2,C3. Hazelcast version is 3.9.
Clients use IMap.tryLock() method with timeout of 10 seconds. After getting the lock, critical and long running operations are performed and finally the lock is released using IMap.unlock() method.
The problem occured in production is as follows:
At some time instant t, we first saw heartbeat failure to M2 at client C2. Afterwards there are errors in fetching partition table casued by com.hazelcast.spi.exception.TargetDisconnectedException:
[hz.client_0.internal-2 ] WARN [] HeartbeatManager - hz.client_0 [mygroup] [3.9] HeartbeatManager failed to connection: .....
[hz.client_0.internal-3 ] WARN [] ClientPartitionService - hz.client_0 [mygroup] [3.9] Error while fetching cluster partition table!
java.util.concurrent.ExecutionException: com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=1, ......
Around 250 ms after initial heartbeat failure, client gets disconnected and then reconnects in 20 ms.
[hz.client_0.cluster- ] INFO [] LifecycleService - hz.client_0 [mygroup] [3.9] HazelcastClient 3.9 (20171023 - b29f549) is CLIENT_DISCONNETED
[hz.client_0.cluster- ] INFO [] LifecycleService - hz.client_0 [mygroup] [3.9] HazelcastClient 3.9 (20171023 - b29f549) is CLIENT_CONNECTED
The problem we are having is, for some keys that are previously acquired by C2, C1 and C3 can not acquire the lock even if it seems to be released by C2. C2 can get the lock, but this puts unacceptable delays
to the application and is not acceptable.. All clients should get since lock is released...
We were notified of the problem after receiving complaints, and then restarted the client application C2.
As documented in http://docs.hazelcast.org/docs/latest-development/manual/html/Distributed_Data_Structures/Lock.html, locks acquired by restarted member (C2 in my case) seemed to be removed after restart operation.
Currently the issue seems to go away, but we are not sure if it will recur.
Do you have any suggestions about the probable cause and more importantly do you have any recommendations?
Would enabling redo-operation in client help for this problem case?
As I tried to explain client seems to recover the problem, but keys remain locked in cluster and this is fatal to my application.
Thanks
It looks like the client had lost the ownership of the lock because of its disconnection from the cluster. You can use IMap#forceUnlock API in cases such as you faced. It releases the lock regardless of the lock owner and it always successfully unlocks, never blocks, and returns immediately.

Jetty Websocket IdleTimeout

I've been working on annotated websockets lately, with the Jetty API (9.4.5 release) , and made a chat with it.
However i got an issue, after 5 minutes (which i believe is the default timer), the session is closed (it is not due to an error).
The only solution I've found yet, is to notify my socket On closing event and reopen the connection in a new socket.
However i've read on stackOverflow, that by setting IdleTimeOut in the WebsocketPolicy, i could avoid the issue:
I've tried setting to 3600000 for instance, but the behavior does not change at all
I also tried to set it to -1 but i get the following error: IdleTimeout [-1] must be a greater than or equal to 0
private ServletContextHandler setupWebsocketContext() {
ServletContextHandler websocketContext = new AmosContextHandler(ServletContextHandler.SESSIONS | ServletContextHandler.SECURITY);
WebSocketHandler socketCreator = new WebSocketHandler(){
#Override
public void configure(WebSocketServletFactory factory){
factory.getPolicy().setIdleTimeout(-1);
factory.getPolicy().setMaxTextMessageBufferSize(MAX_MESSAGE_SIZE);
factory.getPolicy().setMaxBinaryMessageBufferSize(MAX_MESSAGE_SIZE);
factory.getPolicy().setMaxTextMessageSize(MAX_MESSAGE_SIZE);
factory.getPolicy().setMaxBinaryMessageSize(MAX_MESSAGE_SIZE);
factory.setCreator(new UpgradedSocketCreator());
}
};
ServletHolder sh = new ServletHolder(new WebsocketChatServlet());
websocketContext.addServlet(sh, "/*");
websocketContext.setContextPath("/Chat");
websocketContext.setHandler(socketCreator);
websocketContext.getSessionHandler().setMaxInactiveInterval(0);
return websocketContext;
}
I've also tried to change the policy directly in the OnConnect event, by using the call session.getpolicy.setIdleTimeOut(), but I haven't noticed any results.
Is this an expected behavior or am I missing something? Thanks for your help.
EDIT:
Log on the closure:
Client Side:
2017-07-03T12:48:00.552 DEBUG HttpClient#179313750-scheduler Ignored idle endpoint SocketChannelEndPoint#2fb4b627{localhost/127.0.0.1:5080<->/127.0.0.1:53835,OPEN,fill=-,flush=-,to=1/300000}{io=0/0,kio=0,kro=1}->WebSocketClientConnection#e0198ece[ios=IOState#3ac0ec79[CLOSING,in,!out,close=CloseInfo[code=1000,reason=null],clean=false,closeSource=LOCAL],f=Flusher[queueSize=0,aggregateSize=0,failure=null],g=Generator[CLIENT,validating],p=Parser#65c4d838[ExtensionStack,s=START,c=0,len=187,f=null]]
Server side:
2017-07-03T12:48:00.595 DEBUG Idle pool thread onClose WebSocketServerConnection#e0033d54[ios=IOState#10d40dca[CLOSED,!in,!out,finalClose=CloseInfo[code=1000,reason=null],clean=true,closeSource=REMOTE],f=Flusher[queueSize=0,aggregateSize=0,failure=null],g=Generator[SERVER,validating],p=Parser#317213f3[ExtensionStack,s=START,c=0,len=2,f=CLOSE[len=2,fin=true,rsv=...,masked=true]]]<-SocketChannelEndPoint#690dfbfb'{'/127.0.0.1:53835<->/127.0.0.1:5080,CLOSED,fill=-,flush=-,to=1/360000000}'{'io=0/0,kio=-1,kro=-1}->WebSocketServerConnection#e0033d54[ios=IOState#10d40dca[CLOSED,!in,!out,finalClose=CloseInfo[code=1000,reason=null],clean=true,closeSource=REMOTE],f=Flusher[queueSize=0,aggregateSize=0,failure=null],g=Generator[SERVER,validating],p=Parser#317213f3[ExtensionStack,s=START,c=0,len=2,f=CLOSE[len=2,fin=true,rsv=...,masked=true]]]
2017-07-03T12:48:00.595 DEBUG Idle pool thread org.eclipse.jetty.util.thread.Invocable$InvocableExecutor#4f13dee2 invoked org.eclipse.jetty.io.ManagedSelector$$Lambda$193/682154970#551e133a
2017-07-03T12:48:00.595 DEBUG Idle pool thread EatWhatYouKill#6ba355e4/org.eclipse.jetty.io.ManagedSelector$SelectorProducer#7b1559f1/PRODUCING/0/1 produce exit
2017-07-03T12:48:00.595 DEBUG Idle pool thread ran EatWhatYouKill#6ba355e4/org.eclipse.jetty.io.ManagedSelector$SelectorProducer#7b1559f1/PRODUCING/0/1
2017-07-03T12:48:00.595 DEBUG Idle pool thread run EatWhatYouKill#6ba355e4/org.eclipse.jetty.io.ManagedSelector$SelectorProducer#7b1559f1/PRODUCING/0/1
2017-07-03T12:48:00.595 DEBUG Idle pool thread EatWhatYouKill#6ba355e4/org.eclipse.jetty.io.ManagedSelector$SelectorProducer#7b1559f1/PRODUCING/0/1 run
2017-07-03T12:48:00.597 DEBUG Idle pool thread 127.0.0.1 has disconnected !
2017-07-03T12:48:00.597 DEBUG Idle pool thread Disconnected: 127.0.0.1 (127.0.0.1) (statusCode= 1,000 , reason=null)
Annotated WebSockets have their own timeout settings in the annotation.
#WebSocket(maxIdleTime=30000)
The annotation #WebSocket has option:
int maxIdleTime() default -2;
In fact it's not clear what does it mean.
If you check implementation, you can find:
if (anno.maxIdleTime() > 0)
{
this.policy.setIdleTimeout(anno.maxIdleTime());
}
method implementation:
/**
* The time in ms (milliseconds) that a websocket may be idle before closing.
*
* #param ms
* the timeout in milliseconds
*/
public void setIdleTimeout(long ms)
{
assertGreaterThan("IdleTimeout",ms,0);
this.idleTimeout = ms;
}
and finally:
/**
* The time in ms (milliseconds) that a websocket may be idle before closing.
* <p>
* Default: 300000 (ms)
*/
private long idleTimeout = 300000;
Conclusion: negative value apply default behavior (300000 ms). You need to configure 'idleTimeout' according your business value.
PS: solved my case with:
#WebSocket(maxIdleTime = Integer.MAX_VALUE)

Qpid Benchmark: Have questions related to quid benchmark

I am trying to benchmark Qpid with the following use case:
Default Qpid configs are used(ex: 2GB is the max memory set), broker and
client are on the same machine
I have 1 connection and 256 sessions per connection, each sessions has a
producer and consumer. So, there are 256 producers and 256 consumers
All the producers/consumers are created before they start
producing/consuming messages. Each producer/consumer is a thread and they
run parallely
Consumers start consuming(they wait with .receive()). All consumer are durableSubscribers
producers start producing messages, each producer produces only 1
message, so there are 256 messages produced in total
A fanout exchange is used(topic.fanout=fanout://amq.fanout//fanOutTopic),
and there are 256 consumers, each consumer receives 256 messages and so
there are 256*256 messages received in total
Following are the response times(RT's) for the messages:
Response time is defined as the difference in the time when the
message is sent to the broker and the time at which the message is received
at the consumer
min: 144.0 ms
max: 350454.0 ms
average: 151933.02 ms
stddev: 113347.89 ms
95th percentile: 330559.0 ms
Is there any thing that I am doing wrong fundamentally?. I am worried about
the avg response times of "152 secs". Is this expected from qpid?. I see
a pattern here, as the test is running the RT's are increasing linearly
over time.
Thank you,
Siva.

Terracotta Ehcache: server disconnects during debug

I found out, that when I connect by debugger to the application, and starting to debug,
the connection to terracotta server is lost (?) and in the terracotta server logs next messages are appeared:
2012-03-30 13:45:06,758 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:27,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:31,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 2
...
2012-03-30 13:46:37,768 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] ERROR com.tc.net.protocol.transport.ConnectionHealthChecke rImpl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 10. But its too long. No more retries 2012-03-30 13:46:38,768
[HealthChecker] INFO
com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server
- 127.0.0.1:55112 is DEAD 2012-03-30 13:46:38,768 [HealthChecker] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO
Server - Declared connection dead
ConnectionID(1.0b1994ac80f14b7191080bdc3f38582a) idle time 45317ms
2012-03-30 13:46:38,768 [L2_L1:TCWorkerComm # 0_R] WARN
com.tc.net.protocol.transport.ServerMessageTransport -
ConnectionID(1.0b1994ac80f14b71 91080bdc3f38582a): CLOSE EVENT :
com.tc.net.core.TCConnectionJDK14#5158277: connected: false, closed:
true local=127.0.0.1:9510 remote=127.0.0 .1:55112 connect=[Fri Mar 30
13:34:22 BST 2012] idle=2001ms [207584 read, 229735 write]. STATUS :
DISCONNECTED
...
2012-03-30 13:46:38,799 [L2_L1:TCWorkerComm # 0_R] INFO
com.tc.objectserver.persistence.sleepycat.SleepycatPersistor - Deleted
client state fo r ChannelID=[1] 2012-03-30 13:46:38,801
[WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.handler.ChannelLifeCycleHandler - : Received tran
sport disconnect. Shutting down client ClientID[1] 2012-03-30
13:46:38,801 [WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.persistence.impl.TransactionStoreImpl - shutdownC
lient() : Removing txns from DB : 0
After this is happened, any operation with cache, like getWithLoader just doesn't answer, until terracotta server won't be restarted again.
Question: how can it be fixed/reconfigured? I assume, it can happen in production also (and actually sometimes happens) if for some (any) reason application will hang/staled/etc.
This is just to get you started.
TC connections betwee server and client are considered dead when the applicable HealthCheck fails. The default values for the HealthCheck assume a very stable and performant network. I recommend you familiarize yourself with the details and the calculations on
http://www.terracotta.org/documentation/3.5.2/terracotta-server-array/high-availability#85916
So typically you begin with
a) making sure your network doesn't hiccup occasionally
b) setting the TC HealthCheck values a bit higher
If the problem persists I'd recommend posting directly on the TC forums (they'll help you even if you only use the open-source edition, may take a few days to reply though.

Categories

Resources