I have a Java application which uses RMI for client/server communication.
To secure this communication the traffic is tunneled through an ssh connection.
Everything works well, except that the connection keeps getting closed automatically after a few seconds.
I have set the keep alive property true of:
SSHD connection
SSH client connection
ServerSocket server side
ClientSocket client side
A common connection routine of connecting to the register (port 4000) and invoking a method on an object (port 4005) outputs the following log:
INFO org.apache.sshd.server.session.ServerSession - Authentication succeeded
INFO org.apache.sshd.server.session.ServerSession - Received SSH_MSG_CHANNEL_OPEN direct-tcpip
INFO org.apache.sshd.server.channel.ChannelDirectTcpip - Receiving request for direct tcpip: hostToConnect=ThinkPad, portToConnect=4000, originatorIpAddress=127.0.0.1, originatorPort=64539
INFO org.apache.sshd.server.session.ServerSession - Received SSH_MSG_CHANNEL_OPEN direct-tcpip
INFO org.apache.sshd.server.channel.ChannelDirectTcpip - Receiving request for direct tcpip: hostToConnect=ThinkPad, portToConnect=4005, originatorIpAddress=127.0.0.1, originatorPort=64540
INFO org.apache.sshd.server.channel.ChannelDirectTcpip - Received SSH_MSG_CHANNEL_EOF on channel 1
INFO org.apache.sshd.server.channel.ChannelDirectTcpip - Send SSH_MSG_CHANNEL_CLOSE on channel 1
INFO org.apache.sshd.server.channel.ChannelDirectTcpip - Received SSH_MSG_CHANNEL_CLOSE on channel 1
INFO org.apache.sshd.server.channel.ChannelDirectTcpip - Closing channel 1 immediately
INFO org.apache.sshd.server.channel.ChannelDirectTcpip - Closing channel 1 immediately
INFO org.apache.sshd.server.channel.ChannelDirectTcpip - Send SSH_MSG_CHANNEL_EOF on channel 1
INFO org.apache.sshd.server.session.ServerSession - Closing session
INFO org.apache.sshd.server.channel.ChannelDirectTcpip - Closing channel 0 immediately
INFO org.apache.sshd.server.channel.ChannelDirectTcpip - Closing channel 0 immediately
INFO org.apache.sshd.server.channel.ChannelDirectTcpip - Send SSH_MSG_CHANNEL_EOF on channel 0
The line ** Received SSH_MSG_CHANNEL_EOF on channel 1 ** suggests that the method invoked on the object has generated an EOF message. This then causes the session to close...
Possible solutions I can think of:
Intercept or prevent the EOF message (but where and how?)
Try to configure the server side sessionfactory to ignore the EOF messages (feels wrong...)
RMI connections are pooled at the client and closed if not reused within 15 seconds. You can adjust this behaviour via system properties: see the Sun system properties page linked from the RMI home page.
Related
"thread_name":"org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1","message":"[Consumer clientId=consumer-groupId-1-1, groupId=groupId-1] Initiating connection to node node.kafka.us-west-2.amazonaws.com:9098 (id: -2 rack: null) using address node.kafka.us-west-2.amazonaws.com/10.x.x.x"}
"thread_name":"org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1","message":"[Consumer clientId=consumer-groupId-1-1, groupId=groupId-1] Set SASL client state to SEND_APIVERSIONS_REQUEST"}
"thread_name":"org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1","message":"[Consumer clientId=consumer-groupId-1-1, groupId=groupId-1] Creating SaslClient: client=null;service=kafka;serviceHostname=node.kafka.us-west-2.amazonaws.com;mechs=[AWS_MSK_IAM]"}
"thread_name":"org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1","message":"Setting SASL/AWS_MSK_IAM client state to SEND_CLIENT_FIRST_MESSAGE"}
"thread_name":"org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1","message":"[Consumer clientId=consumer-groupId-1-1, groupId=groupId-1] Created socket with SO_RCVBUF = 65562, SO_SNDBUF = 131124, SO_TIMEOUT = 0 to node -2"}
"thread_name":"org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1","message":"[Consumer clientId=consumer-groupId-1-1, groupId=groupId-1] Completed connection to node -2. Fetching API versions."}
"thread_name":"org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1","message":"[Consumer clientId=consumer-groupId-1-1, groupId=groupId-1] Connection with node.kafka.us-west-2.amazonaws.com/10.x.x.x disconnected","stack_trace":"java.io.IOException: Connection reset by peer\n\tat sun.nio.ch.FileDispatcherImpl.read0(FileDispatcherImpl.java)\n\tat sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)\n\tat sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276)\n\tat sun.nio.ch.IOUtil.read(IOUtil.java:245)\n\tat sun.nio.ch.IOUtil.read(IOUtil.java:223)\n\tat sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:356)\n\tat org.apache.kafka.common.network.SslTransportLayer.readFromSocketChannel(SslTransportLayer.java:228)\n\tat org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:291)\n\tat org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:178)\n\tat org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:543)\n\tat org.apache.kafka.common.network.Selector.poll(Selector.java:481)\n\tat org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:560)\n\tat org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:265)\n\tat org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)\n\tat org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)\n\tat org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:246)\n\tat org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.coordinatorUnknownAndUnready(ConsumerCoordinator.java:459)\n\tat org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:487)\n\tat org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1262)\n\tat org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231)\n\tat org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)\n\tat org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollConsumer(KafkaMessageListenerContainer.java:1584)\n\tat org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.doPoll(KafkaMessageListenerContainer.java:1559)\n\tat org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollAndInvoke(KafkaMessageListenerContainer.java:1360)\n\tat org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:1274)\n\t... 4 frames truncated\n"}
"level":"DEBUG","logger_name":"o.a.k.common.network.SslTransportLayer","thread_name":"org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1","message":"[SslTransportLayer channelId=-2 key=channel=java.nio.channels.SocketChannel[connection-pending remote=node.kafka.us-west-2.amazonaws.com/10.x.x.x:9098], selector=sun.nio.ch.KQueueSelectorImpl#690f136f, interestOps=8, readyOps=0] Failed to send SSL Close message","stack_trace":"java.io.IOException: Unexpected status returned by SSLEngine.wrap, expected CLOSED, received OK. Will not send close message to peer.\n\tat org.apache.kafka.common.network.SslTransportLayer.close(SslTransportLayer.java:194)\n\tat org.apache.kafka.common.utils.Utils.closeAll(Utils.java:974)\n\tat org.apache.kafka.common.network.KafkaChannel.close(KafkaChannel.java:155)\n\tat org.apache.kafka.common.network.Selector.doClose(Selector.java:955)\n\tat org.apache.kafka.common.network.Selector.close(Selector.java:939)\n\tat org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:625)\n\tat org.apache.kafka.common.network.Selector.poll(Selector.java:481)\n\tat org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:560)\n\tat org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:265)\n\tat org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)\n\tat org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)\n\tat org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:246)\n\tat org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.coordinatorUnknownAndUnready(ConsumerCoordinator.java:459)\n\tat org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:487)\n\tat org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1262)\n\tat org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1231)\n\tat org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1211)\n\tat org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollConsumer(KafkaMessageListenerContainer.java:1584)\n\tat org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.doPoll(KafkaMessageListenerContainer.java:1559)\n\tat org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.pollAndInvoke(KafkaMessageListenerContainer.java:1360)\n\tat org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer.run(KafkaMessageListenerContainer.java:1274)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:264)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java)\n\tat java.lang.Thread.run(Thread.java:829)\n"}
{"#timestamp":"2022-09-29 08:05:12.245-0700","level":"INFO","logger_name":"org.apache.kafka.clients.NetworkClient","thread_name":"org.springframework.kafka.KafkaListenerEndpointContainer#0-0-C-1","message":"[Consumer clientId=consumer-groupId-1-1, groupId=groupId-1] Node -2 disconnected."}
I have attempted to follow all the directions from here and here.
Ive tested my iam policy using aws kafka <various> .
Ive checked that port 9098 is open.
My policy has everything permitted. If I can get this working I'll deal with limiting permissions later.
Just trying to get a consumer to start at this point.
Suggestions on where to look for the issue?
Edit:
Added some ssl debug to the call. Am seeing CLIENTHELLO being sent but nothing back from the server. Just the auth failure.
Using openssl s_client -connect <host> -tls1_2 I was able to get a "Verify return code: 0 (ok)" from the server
More:
I think something is blocking the SSL request:
CONNECTED(00000005)
write:errno=54
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 0 bytes and written 0 bytes
I can view the open port via NMAP so Im not sure what's happening there either.
Just to close this off: I think MSK doesn't allow iam auth from outside a VPC. I was able to get my test working from EKS but not otherwise.
I am starting to study how can I implement an application supporting Failover/FaultTolerance on top of JMS, more precisely EMS
I configured two EMS servers working both with FaultTolerance enabled:
For EMS running on server on server1 I have
in tibemsd.conf
ft_active = tcp://server2:7232
in factories.conf
[GenericConnectionFactory]
type = generic
url = tcp://server1:7232
[FTTopicConnectionFactory]
type = topic
url = tcp://server1:7232,tcp://server2:7232
[FTQueueConnectionFactory]
type = queue
url = tcp://server1:7232,tcp://server2:7232
And for EMS running on server on server2 I have
in tibemsd.conf
ft_active = tcp://server1:7232
in factories.conf
[GenericConnectionFactory]
type = generic
url = tcp://server2:7232
[FTTopicConnectionFactory]
type = topic
url = tcp://server2:7232,tcp://server1:7232
[FTQueueConnectionFactory]
type = queue
url = tcp://server2:7232,tcp://server1:7232
I am not a TIBCO EMS expert but my config seems to be good: When I start EMS on server1 I get:
$ tibemsd -config tibemsd.conf
...
2022-07-20 23:04:58.566 Server is active.
2022-07-20 23:05:18.563 Standby server 'SERVERNAME#server1' has connected.
then if I start EMS on server2, I get
$ tibemsd -config tibemsd.conf
...
2022-07-20 23:05:18.564 Accepting connections on tcp://server2:7232.
2022-07-20 23:05:18.564 Server is in standby state for 'tcp://server1:7232'
Moreover, if I kill active EMS on server1, I immediately get the following message on server2:
2022-07-20 23:21:52.891 Connection to active server 'tcp://server1:7232' has been lost.
2022-07-20 23:21:52.891 Server activating on failure of 'tcp://server1:7232'.
...
2022-07-20 23:21:52.924 Server is now active.
Until here, everything looks OK, active/standby EMS servers seems to be correctly configured
Things get more complicated when I write a piece of code how is supposed to connect to these EMS servers and to periodically publish messages. Let's try with the following code sample:
#Test
public void testEmsFailover() throws JMSException, InterruptedException {
int NB = 1000;
TibjmsConnectionFactory factory = new TibjmsConnectionFactory();
factory.setServerUrl("tcp://server1:7232,tcp://server2:7232");
Connection connection = factory.createConnection();
Session session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
connection.start();
for (int i = 0; i < NB; i++) {
LOG.info("sending message");
Queue queue = session.createQueue(QUEUE__CLIENT_TO_FRONTDOOR__CONNECTION_REQUEST);
MessageProducer producer = session.createProducer(queue);
MapMessage mapMessage = session.createMapMessage();
mapMessage.setStringProperty(PROPERTY__CLIENT_KIND, USER.toString());
mapMessage.setStringProperty(PROPERTY__CLIENT_NAME, "name");
producer.send(mapMessage);
LOG.info("done!");
Thread.sleep(1000);
}
}
If I run this code while both active and standby servers are up, everything looks good
23:26:32.431 [main] INFO JmsEndpointTest - sending message
23:26:32.458 [main] INFO JmsEndpointTest - done!
23:26:33.458 [main] INFO JmsEndpointTest - sending message
23:26:33.482 [main] INFO JmsEndpointTest - done!
Now If I kill the active EMS server, I would expect that
the standby server would instantaneously become the active one
my code would continue to publish such as if nothing had happened
However, in my code I get the following error:
javax.jms.JMSException: Connection is closed
at com.tibco.tibjms.TibjmsxLink.sendRequest(TibjmsxLink.java:307)
at com.tibco.tibjms.TibjmsxLink.sendRequestMsg(TibjmsxLink.java:261)
at com.tibco.tibjms.TibjmsxSessionImp._createProducer(TibjmsxSessionImp.java:1004)
at com.tibco.tibjms.TibjmsxSessionImp.createProducer(TibjmsxSessionImp.java:4854)
at JmsEndpointTest.testEmsFailover(JmsEndpointTest.java:103)
...
and in the logs of the server (the previous standby server supposed to be now the active one) I get
2022-07-20 23:32:44.447 [anonymous#cersei]: connect failed: server not in active state
2022-07-20 23:33:02.969 Connection to active server 'tcp://server2:7232' has been lost.
2022-07-20 23:33:02.969 Server activating on failure of 'tcp://server2:7232'.
2022-07-20 23:33:02.969 Server rereading configuration.
2022-07-20 23:33:02.971 Recovering state, please wait.
2022-07-20 23:33:02.980 Recovered 46 messages.
2022-07-20 23:33:02.980 Server is now active.
2022-07-20 23:33:03.545 [anonymous#cersei]: reconnect failed: connection unknown for id=8
2022-07-20 23:33:04.187 [anonymous#cersei]: reconnect failed: connection unknown for id=8
2022-07-20 23:33:04.855 [anonymous#cersei]: reconnect failed: connection unknown for id=8
2022-07-20 23:33:05.531 [anonymous#cersei]: reconnect failed: connection unknown for id=8
I would appreciate any help to enhance my code
Thank you
I think I found the origin of my problem:
according to the page Tibco-Ems Failover Issue, the error message
reconnect failed: connection unknown for id=8
means: "the store (ems db) was'nt share between the active and the standby node, so when the active ems failed, the new active ems was'nt able to recover connections and messages."
I realized that it is painful to configure a shared store. To avoid it, I configured two tibems on the same host, by following the page Step By Step How to Setup TIBCO EMS In Fault Tolerant Mode:
two tibemsd.conf configuration files
configure a different listen port in each file
configure ft_active with url of other server
configure factories.conf
By doing so, I can replay my test and it works as expected
I have deployed Spring Boot application that has a Database based queue with jobs on App Service.
Yesterday I performed a few Scale out and Scale in operations while the application was working to see how it will behave.
At some point (not necessary related to scaling operations) application started to throw Hikari errors.
com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#1ae66f34 (This connection has been closed.). Possibly consider using a shorter maxLifetime value.
com.zaxxer.hikari.pool.ProxyConnection : HikariPool-1 - Connection org.postgresql.jdbc.PgConnection#1ef85079 marked as broken because of SQLSTATE(08006), ErrorCode(0)
The following are stack traces from my scheduled job in spring and other information:
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
Caused by: javax.net.ssl.SSLException: Connection reset by peer (Write failed)
Suppressed: java.net.SocketException: Broken pipe (Write failed)
Caused by: java.net.SocketException: Connection reset by peer (Write failed)
Next the following stack of errors:
WARN 1 --- [ scheduling-1] com.zaxxer.hikari.pool.PoolBase : HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#48d0d6da (This connection has been closed.).
Possibly consider using a shorter maxLifetime value.
org.springframework.jdbc.support.MetaDataAccessException: Error while extracting DatabaseMetaData; nested exception is java.sql.SQLException: Connection is closed
Caused by: java.sql.SQLException: Connection is closed
The code which is invoked periodically - every 500 milliseconds is here:
#Scheduled(fixedDelayString = "${worker.delay}")
#Transactional
public void execute() {
jobManager.next(jobClass).ifPresent(this::handleJob);
}
Update.
The above code is almost all the time doing nothing, since there was no traffic on the website.
Update2. I've checked Postgres logs and found this:
2020-07-11 22:48:09 UTC-5f0866f0.f0-LOG: checkpoint starting: immediate force wait
2020-07-11 22:48:10 UTC-5f0866f0.f0-LOG: checkpoint complete (240): wrote 30 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.046 s, sync=0.046 s, total=0.437 s; sync files=13, longest=0.009 s, average=0.003 s; distance=163 kB, estimate=13180 kB
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: received immediate shutdown request
2020-07-11 22:48:10 UTC-5f0a3f41.8914-WARNING: terminating connection because of crash of another server process
2020-07-11 22:48:10 UTC-5f0a3f41.8914-DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
// Same text about 10 times
2020-07-11 22:48:10 UTC-5f0866f2.7c-HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: src/port/kill.c(84): Process (272) exited OOB of pgkill.
2020-07-11 22:48:10 UTC-5f0866f1.fc-WARNING: terminating connection because of crash of another server process
2020-07-11 22:48:10 UTC-5f0866f1.fc-DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-07-11 22:48:10 UTC-5f0866f1.fc-HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-07-11 22:48:10 UTC-5f0866ee.68-LOG: archiver process (PID 256) exited with exit code 1
2020-07-11 22:48:11 UTC-5f0866ee.68-LOG: database system is shut down
It looks like it is a problem with Azure PostgresSQL server and it closed itself. Am I reading this right?
Like mentioned in your logs, have you tried setting maxLifetime property for the Hikari CP ? I think after setting that property this issue should be resolved.
Based on Hikari doc (https://github.com/brettwooldridge/HikariCP) --
maxLifetime
This property controls the maximum lifetime of a connection in the pool. An in-use connection will never be retired, only when it is closed will it then be removed. On a connection-by-connection basis, minor negative attenuation is applied to avoid mass-extinction in the pool. We strongly recommend setting this value, and it should be several seconds shorter than any database or infrastructure imposed connection time limit. A value of 0 indicates no maximum lifetime (infinite lifetime), subject of course to the idleTimeout setting. The minimum allowed value is 30000ms (30 seconds). Default: 1800000 (30 minutes)
I have the following code to start a netty server:
Application application = ApplicationTypeFactory.getApplication(type);
resteasyDeployment = new ResteasyDeployment();
// Explicitly setting the Application should prevent scanning
resteasyDeployment.setApplication(application);
// Need to set the provider factory to the default, otherwise the
// providers we need won't be registered, such as JSON mapping
resteasyDeployment.setProviderFactory(ResteasyProviderFactory.getInstance());
netty = new NettyJaxrsServer();
netty.setHostname(HOST);
netty.setPort(port);
netty.setDeployment(resteasyDeployment);
// Some optional extra configuration
netty.setKeepAlive(true);
netty.setRootResourcePath("/");
netty.setSecurityDomain(null);
netty.setIoWorkerCount(16);
netty.setExecutorThreadCount(16);
LOGGER.info("Starting REST server on: " + System.getenv("HOSTNAME") + ", port:" + port);
// Start the server
//("Starting REST server on " + System.getenv("HOSTNAME"));
netty.start();
LOGGER.info("Started!");
When I do:
netty.stop()
It doesn't appear to close the connection:
[john#dub-001948-VM01:~/workspace/utf/atf/src/test/resources ]$ netstat -anp | grep 8888
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 ::ffff:10.0.21.88:8888 ::ffff:10.0.21.88:60654 TIME_WAIT -
tcp 0 0 ::ffff:10.0.21.88:8888 ::ffff:10.0.21.88:60630 TIME_WAIT -
tcp 0 0 ::ffff:10.0.21.88:8888 ::ffff:10.0.21.88:60629 TIME_WAIT -
tcp 0 0 ::ffff:10.0.21.88:8888 ::ffff:10.0.21.88:60637 TIME_WAIT -
tcp 0 0 ::ffff:10.0.21.88:8888 ::ffff:10.0.21.88:60640 TIME_WAIT -
even after the program exits. In other posts I have read that netty does not close client connections on a stop. How do I shut it down cleanly?
Time_Wait is one of socket state. application can not do anything about it. more information here: http://www.isi.edu/touch/pubs/infocomm99/infocomm99-web/. For a busy server, you could tune some tcp/ip parameter to alleviate its impact.
I found out, that when I connect by debugger to the application, and starting to debug,
the connection to terracotta server is lost (?) and in the terracotta server logs next messages are appeared:
2012-03-30 13:45:06,758 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] WARN com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:27,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 1 2012-03-30 13:45:31,761 [L2_L1:TCComm Main Selector Thread_R
(listen 0.0.0.0:9510)] WARN
com.tc.net.protocol.transport.ConnectionHealthChecker Impl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 2
...
2012-03-30 13:46:37,768 [L2_L1:TCComm Main Selector Thread_R (listen
0.0.0.0:9510)] ERROR com.tc.net.protocol.transport.ConnectionHealthChecke rImpl. DSO Server
- 127.0.0.1:55112 might be in Long GC. GC count since last ping reply : 10. But its too long. No more retries 2012-03-30 13:46:38,768
[HealthChecker] INFO
com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Server
- 127.0.0.1:55112 is DEAD 2012-03-30 13:46:38,768 [HealthChecker] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO
Server - Declared connection dead
ConnectionID(1.0b1994ac80f14b7191080bdc3f38582a) idle time 45317ms
2012-03-30 13:46:38,768 [L2_L1:TCWorkerComm # 0_R] WARN
com.tc.net.protocol.transport.ServerMessageTransport -
ConnectionID(1.0b1994ac80f14b71 91080bdc3f38582a): CLOSE EVENT :
com.tc.net.core.TCConnectionJDK14#5158277: connected: false, closed:
true local=127.0.0.1:9510 remote=127.0.0 .1:55112 connect=[Fri Mar 30
13:34:22 BST 2012] idle=2001ms [207584 read, 229735 write]. STATUS :
DISCONNECTED
...
2012-03-30 13:46:38,799 [L2_L1:TCWorkerComm # 0_R] INFO
com.tc.objectserver.persistence.sleepycat.SleepycatPersistor - Deleted
client state fo r ChannelID=[1] 2012-03-30 13:46:38,801
[WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.handler.ChannelLifeCycleHandler - : Received tran
sport disconnect. Shutting down client ClientID[1] 2012-03-30
13:46:38,801 [WorkerThread(channel_life_cycle_stage, 0)] INFO
com.tc.objectserver.persistence.impl.TransactionStoreImpl - shutdownC
lient() : Removing txns from DB : 0
After this is happened, any operation with cache, like getWithLoader just doesn't answer, until terracotta server won't be restarted again.
Question: how can it be fixed/reconfigured? I assume, it can happen in production also (and actually sometimes happens) if for some (any) reason application will hang/staled/etc.
This is just to get you started.
TC connections betwee server and client are considered dead when the applicable HealthCheck fails. The default values for the HealthCheck assume a very stable and performant network. I recommend you familiarize yourself with the details and the calculations on
http://www.terracotta.org/documentation/3.5.2/terracotta-server-array/high-availability#85916
So typically you begin with
a) making sure your network doesn't hiccup occasionally
b) setting the TC HealthCheck values a bit higher
If the problem persists I'd recommend posting directly on the TC forums (they'll help you even if you only use the open-source edition, may take a few days to reply though.