netty client takes very long before broken network is detected

netty client takes very long before broken network is detected - java

I am using netty.io (4.0.4) in a java application to implement a TCP client to communicate with an external hardware driver. One of the requirements of this hardware is, the client send a KEEP_ALIVE (heart-beat) message every 30 seconds, the hardware however does not respond to this heat-beat.
My problem is, when the connection is abruptly broken (eg: network cable unplugged) the client is completely unaware of this, and keeps sending the KEEP_ALIVE message for much longer (around 5-10 minutes) before it gets an operation timeout exception.
In other words, from the client side, there is no way to tell if its still connected.
Below is a snippet of my bootstrap setup if it helps
// bootstrap setup
bootstrap = new Bootstrap().group(group)
.channel(NioSocketChannel.class)
.option(ChannelOption.SO_KEEPALIVE, true)
.option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 3000)
.remoteAddress(ip, port)
.handler(tcpChannelInitializer);
// part of the pipeline responsible for keep alive messages
pipeline.addLast("idleStateHandler", new IdleStateHandler(0, 0, 30, TimeUnit.SECONDS));
pipeline.addLast("keepAliveHandler", keepAliveMessageHandler);
I would expect since the client is sending keep alive messages, and those messages are not received at the other end, a missing acknowledgement should indicate a problem in the connection much earlier?
EDIT
Code from the KeepAliveMessageHandler
public class KeepAliveMessageHandler extends ChannelDuplexHandler
{
private static final Logger LOGGER = getLogger(KeepAliveMessageHandler.class);
private static final String KEEP_ALIVE_MESSAGE = "";
#Override
public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws Exception
{
if (!(evt instanceof IdleStateEvent)) {
return;
}
IdleStateEvent e = (IdleStateEvent) evt;
Channel channel = ctx.channel();
if (e.state() == IdleState.ALL_IDLE) {
LOGGER.info("Sending KEEP_ALIVE_MESSAGE");
channel.writeAndFlush(KEEP_ALIVE_MESSAGE);
}
}
}
EDIT 2
I tired to explicitly ensure the keep alive message delivered using the code below
#Override
public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws Exception
{
if (!(evt instanceof IdleStateEvent)) {
return;
}
IdleStateEvent e = (IdleStateEvent) evt;
Channel channel = ctx.channel();
if (e.state() == IdleState.ALL_IDLE) {
LOGGER.info("Sending KEEP_ALIVE_MESSAGE");
channel.writeAndFlush(KEEP_ALIVE_MESSAGE).addListener(future -> {
if (!future.isSuccess()) {
LOGGER.error("KEEP_ALIVE message write error");
channel.close();
}
});
}
}
This also does not work. :( according to this answer this behavior makes sense, but I am still hoping there is some way to figure-out if the write was a "real" success. (Having the hardware ack the hear-beat is not possible)

You have enabled the TCP Keepalive
.option(ChannelOption.SO_KEEPALIVE, true)
But in your code I can't see any piece that ensures keepalive to be sent at 30 seconds rate.
If a connection has been terminated due to a TCP Keepalive time-out and the other host eventually sends a packet for the old connection, the host that terminated the connection will send a packet with the RST flag set to signal the other host that the old connection is no longer active. This will force the other host to terminate its end of the connection so a new connection can be established.
Typically TCP Keepalives are sent every 45 or 60 seconds on an idle TCP connection, and the connection is dropped after 3 sequental ACKs are missed. This varies by host, e.g. by default Windows PCs send the first TCP Keepalive packet after 7200000ms (2 hour)s, then sends 5 Keepalives at 1000ms intervals, dropping the connection if there is no response to any of the Keepalive packets.
(taken form http://ltxfaq.custhelp.com/app/answers/detail/a_id/1512/~/tcp-keepalives-explained_
I do understand now that
pipeline.addLast("idleStateHandler", new IdleStateHandler(0, 0, 30, TimeUnit.SECONDS));
pipeline.addLast("keepAliveHandler", keepAliveMessageHandler);
Will trigger an idle event every 30 seconds on mutual inactivity and keepAliveMessageHandler will sent a packet to remove side in this case.
Unfortunately
ChannelFuture future = channel.writeAndFlush(KEEP_ALIVE_MESSAGE);
is considered success when it is written to OS buffers.
It seems that under your conditions you have only 2 optios:
Sending a command that will have some response from external
device (something that will not cause distruption)
But I would assume that this is impossible in your case.
Modyfying underlying TCP driver settings
The default OS settings for TCP keepalive are more about conserving system resources to support large amount of applications and connections. Provided that you have a dedicated system you may set more aggressive TCP checks configuration.
Here is the link on how to make adjustments to linux kernel: http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
The solution should work as on plain installations as well in VMs and Docker containers.
General information on the topic: https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html

Related

Is there a way to simulate Socket and Connection timeout?

I have a certain piece of code that integrates with a third party using HTTP connection, which handles socket timeout and connection timeout differently.
I have been trying to simulate and test all the scenarios which could arise from the third party. was able to test connection timeout by connecting to a port which is blocked by the servers firewall e.g. port 81.
However I'm unable to simulate a socket timeout. If my understanding is not wrong socket timeout is associated with continuous packet flow, and the connection dropping in between. Is there a way I can simulate this?

So we are talking about to kinds of timeouts here, one is to connect to the server (connect timeout), the other timeout will happen when no data is send or received via the socket for a while (idle timeout).
Node sockets have a socket timeout, that can be used to synthesize both the connect and the idle timeout. This can be done by setting the socket timeout to the connect timeout and then when connected, setting it to the idle timeout.
example:
const request = http.request(url, {
timeout: connectTimeout,
});
request.setTimeout(idleTimeout);
This works because the timeout in the options is set immediately when creating the socket, the setTimeout function is run on the socket when connected!
Anyway, the question was about how to test the connect timeout. Ok so let's first park the idle timeout. We can simply test that by not sending any data for some time, that would cause the timeout. Check!
The connect timeout is a bit harder to test, the first thing that comes to mind is that we need a place to connect to that will not error, but also not connect. This would cause a timeout. But how the hell do we simulate that, in node?
If we think a little bit outside the box then we might figure out that this timeout is about the time it takes to connect. It does not matter why the connection takes as long as it does. We simply need to delay the time it takes to connect. This is not necessarily a server thing, we could also do it on the client. After all this is the part connecting, if we can delay it there, we can test the timeout.
So how could we delay the connection on the client side? Well, we can use the DNS lookup for that. Before the connection is made, a DNS lookup is done. If we simply delay that by 5 seconds or so we can test for the connect timeout very easily.
This is what the code could look like:
import * as dns from "dns";
import * as http from "http";
const url = new URL("http://localhost:8080");
const request = http.request(url, {
timeout: 3 * 1000, // connect timeout
lookup(hostname, options, callback) {
setTimeout(
() => dns.lookup(hostname, options, callback),
5 * 1000,
);
},
});
request.setTimeout(10 * 1000); // idle timeout
request.addListener("timeout", () => {
const message = !request.socket || request.socket.connecting ?
`connect timeout while connecting to ${url.href}` :
`idle timeout while connected to ${url.href}`;
request.destroy(new Error(message));
});
In my projects I usually use an agent that I inject. The agent then has the delayed lookup. Like this:
import * as dns from "dns";
import * as http from "http";
const url = new URL("http://localhost:8080");
const agent = new http.Agent({
lookup(hostname, options, callback) {
setTimeout(
() => dns.lookup(hostname, options, callback),
5 * 1000,
);
},
});
const request = http.request(url, {
timeout: 3 * 1000, // connect timeout
agent,
});
request.setTimeout(10 * 1000); // idle timeout
request.addListener("timeout", () => {
const message = !request.socket || request.socket.connecting ?
`connect timeout while connecting to ${url.href}` :
`idle timeout while connected to ${url.href}`;
request.destroy(new Error(message));
});
Happy coding!

"Connection timeout" determines how long it may take for a TCP connection to be established and this all happens before any HTTP related data is send over the line. By connecting to a blocked port, you have only partially tested the connection timeout since no connection was being made. Typically, a TCP connection on your local network is created (established) very fast. But when connecting to a server on the other side of the world, establishing a TCP connection can take seconds.
"Socket timeout" is a somewhat misleading name - it just determines how long you (the client) will wait for an answer (data) from the server. In other words, how long the Socket.read() function will block while waiting for data.
Properly testing these functions involves creating a server socket or a (HTTP) web-server that you can modify to be very slow. Describing how to create and use a server socket for connection timeout testing (if that is possible) is too much to answer here, but socket timeout testing is a common question - see for example here (I just googled "mock web server for testing timeouts") which leads to a tool like MockWebServer. "MockWebServer" might have an option for testing connection timeouts as well (I have not used "MockWebServer"), but if not, another tool might have.
On a final note: it is good you are testing your usage of the third-party HTTP library with respect to timeout settings, even if this takes some effort. The worst that can happen is that a socket timeout setting in your code is somehow not used by the library and the default socket timeout of "wait forever" is used. That can result in your application doing absolutely nothing ("hanging") for no apparent reason.

Play 2.5 WebSocket Connection Build

I have an AWS server (medium) running in EU West, and there are roughly 250 devices connected but are also always reconnecting due to internet connectivity issues, but for some reason, the amount of TCP connections to the server grows until it reaches around 4300. Then no new connections are allowed to the server. I have confirmed that it is isolated to WebSocket requests and not regular HTTP requests.
WebSocket connections are kept per device in a Map with device UUID as key; it sometimes happens that a device will send a request for a new WS connection even though the server has a connection to the device. In this case, the current connection is closed, and an error is returned so that the device can retry the connection request.
Below is the code snippet from the Controller handling the connections using LegacyWebSocket. Connections are closed using out.close() as per https://www.playframework.com/documentation/2.5.x/JavaWebSockets#handling-websockets-using-callbacks
public LegacyWebSocket<String> create(String uuid) {
logger.debug("NEW WebSocket request from {}, creating new socket...", uuid);
if(webSocketMap.containsKey(uuid)){
logger.debug("WebSocket already exists for {}, closing existing connection", uuid);
webSocketMap.get(uuid).close();
logger.debug("Responding forbidden to force WS restart from device {}", uuid);
return WebSocket.reject(forbidden());
}
LegacyWebSocket<String> ws = WebSocket.whenReady((in, out) -> {
logger.debug("Adding downstream connection to webSocketMap-> {} webSocketMap.size() = {}",uuid, webSocketMap.size());
webSocketMap.put(uuid,out);
// For each event received on the socket,
in.onMessage(message->{
if(message.equals("ping")){
logger.debug("PING received from {} {}",uuid, message);
out.write("pong");
}
});
// When the socket is closed.
in.onClose(() -> {
logger.debug("onClose, removing for {}",uuid);
webSocketMap.remove(uuid);
});
});
return ws;
}
How can I ensure that Play Framework closes the TCP connection for closed WS connections?
The call that I use to check the amount of TCP connections is netstat -n -t | wc -l

Looks like a TCP keep-alive issue - i.e. that the TCP connections become stale because of connectivity issues on the client side and the server does not handle or clean up the stale connections in time before the limit is reached.
This link will help you configure the TCP keep-alive on your server to ensure that the stale connections are cleaned up in time.

Preventing RabbitMQ from blocking upstream services

I have a Spring application that consumes messages on a specific port (say 9001), restructures them and then forwards to a Rabbit MQ server. The code segment is:
private void send(String routingKey, String message) throws Exception {
String exchange = applicationConfiguration.getAMQPExchange();
String exchangeType = applicationConfiguration.getAMQPExchangeType();
Connection connection = myConnection.getConnection();
Channel channel = connection.createChannel();
channel.exchangeDeclare(exchange, exchangeType);
channel.basicPublish(exchange, routingKey, null, message.getBytes());
log.debug(" [CORE: AMQP] Sent message with key {} : {}",routingKey, message);
}
If the Rabbit MQ server fails (crashes, runs out of RAM, turned off etc) the code above blocks, preventing the upstream service from receiving messages (a bad thing). I am looking for a way of preventing this behaviour whilst not losing mesages so that at some time in the future they can be resent.
I am not sure how best to address this. One option may be to queue the messages to a disk file and then use a separate thread to read and forward to the Rabbit MQ server?

If I understand correctly, the issue you are describing is a known JDK socket behaviour when the connection is lost mid-write. See this mailing list thread: http://markmail.org/thread/3vw6qshxsmu7fv6n.
Note that if RabbitMQ is shut down, the TCP connection should be closed in a way that's quickly observable by the client. However, it is true that stale TCP connections can take
a while to be detected, that's why RabbitMQ's core protocol has heartbeats. Set heartbeat
interval to a low value (say, 6-8) and the client itself will notice unresponsive peer
in that amount of time.
You need to use Publisher confirms [1] but also account for the fact that the app itself
can go down right before sending a message. As you rightly point out, having a disk-based
WAL (write-ahead log) is a common solution for this problem. Note that it is both quite
tricky to get right and still leaves some time window where your app process shutting down can result in an unpublished and unlogged message.
No promises on the time frame but the idea of adding WAL to the Java client has been discussed.
http://www.rabbitmq.com/confirms.html

Netty: how do I reduce delay between consecutive messages from the server?

I'm on the dev team for a socket server which uses Netty. When a client sends a request, and the server sends a single response, the round trip time is quite fast. (GOOD) We recently noticed that if the request from the client triggers two messages from the server, even though the server writes both messages to the client at about the same time, there is a delay of more than 200ms between the first and second message arriving on the remote client. When using a local client the two messages arrive at the same time. If the remote client sends another request before the second message from the server arrives, that second message is sent immediately, but then the two messages from the new request are both sent with the delay of over 200ms.
Since it was noticed while using Netty 3.3.1, I tried upgrading to Netty 3.6.5 but I still see the same behavior. We are using NIO, not OIO, because we need to be able to support large numbers of concurrent clients.
Is there a setting that we need to configure that will reduce that 200+ ms delay?
editing to add a code snippet. I hope these are the most relevant parts.
#Override
public boolean openListener(final Protocol protocol,
InetSocketAddress inetSocketAddress) throws Exception {
ChannelFactory factory = new NioServerSocketChannelFactory(
Executors.newCachedThreadPool(),
Executors.newCachedThreadPool(),
threadingConfiguration.getProcessorThreadCount());
ServerBootstrap bootstrap = new ServerBootstrap(factory);
final ChannelGroup channelGroup = new DefaultChannelGroup();
bootstrap.setPipelineFactory(new ChannelPipelineFactory() {
.... lots of pipeline setup snipped ......
});
Channel channel = bootstrap.bind(inetSocketAddress);
channelGroup.add(channel);
channelGroups.add(channelGroup);
bootstraps.add(bootstrap);
return true;
}
The writer factory uses ChannelBuffers.dynamicBuffer(defaultMessageSize) for the buffer, and when we write a message it's Channels.write(channel, msg).
What else would be useful? The developer who migrated the code to Netty is not currently available, and I'm trying to fill in.

200ms strikes me as the magic number of the Nagle's algorithm. Try setting the TcpNoDelay to true on both sides.
This is how you set the option for the server side.
serverBootstrap.setOption("child.tcpNoDelay", true);
This is for the client side.
clientBootStrap.setOption("tcpNoDelay", true);
Further reading: http://www.stuartcheshire.org/papers/NagleDelayedAck/

ObjectInputStream.readObject() hangs forever during the process of socket communication

I have encountered a problem of socket communication on linux system, the communication process is like below: client send a message to ask the server to do a compute task, and wait for the result message from server after the task completes.
But the client would hangs up to wait for the result message if the task costs a long time such as about 40 minutes even though from the server side, the result message has been written to the socket to respond to the client, but it could normally receive the result message if the task costs little time, such as one minute. Additionally, this problem only happens on customer environment, the communication process behaves normally in our testing environment.
I have suspected the cause to this problem is the default timeout value of socket is different between customer environment and testing environment, but the follow values are identical on these two environment, and both Client and server.
getSoTimeout:0
getReceiveBufferSize:43690
getSendBufferSize:8192
getSoLinger:-1
getTrafficClass:0
getKeepAlive:false
getTcpNoDelay:false
the codes on CLient are like:
Message msg = null;
ObjectInputStream in = client.getClient().getInputStream();
//if no message readObject() will hang here
while ( true ) {
try {
Object recObject = in.readObject();
System.out.println("Client received msg.");
msg = (Message)recObject;
return msg;
}catch (Exception e) {
e.printStackTrace();
return null;
}
}
the codes on server are like,
ObjectOutputStream socketOutStream = getSocketOutputStream();
try {
MessageJobComplete msgJobComplete = new MessageJobComplete(reportFile, outputFile );
socketOutStream.writeObject(msgJobComplete);
}catch(Exception e) {
e.printStackTrace();
}
in order to solve this problem, i have added the flush and reset method, but the problem still exists:
ObjectOutputStream socketOutStream = getSocketOutputStream();
try {
MessageJobComplete msgJobComplete = new MessageJobComplete(reportFile, outputFile );
socketOutStream.flush();
logger.debug("AbstractJob#reply to the socket");
socketOutStream.writeObject(msgJobComplete);
socketOutStream.reset();
socketOutStream.flush();
logger.debug("AbstractJob#after Flush Reply");
}catch(Exception e) {
e.printStackTrace();
logger.error("Exception when sending MessageJobComplete."+e.getMessage());
}
so do anyone knows what the next steps i should do to solve this problem.
I guess the cause is the environment setting, but I do not know what the environment factors would affect the socket communication?
And the socket using the Tcp/Ip protocal to communicate, the problem is related with the long time task, so what values about tcp would affect the timeout of socket communication?
After my analysis about the logs, i found after the message are written to the socket, there were no exceptions are thrown/caught. But always after 15 minutes, there are exceptions in the objectInputStream.readObject() codes snippet of Server Side which is used to accept the request from client. However, socket.getSoTimeout value is 0, so it is very strange that the a Timed out Exception was thrown.
{2012-01-09 17:44:13,908} ERROR java.net.SocketException: Connection timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:146)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:312)
at sun.security.ssl.InputRecord.read(InputRecord.java:350)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:809)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:766)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:94)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:69)
at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2265)
at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2558)
at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2568)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1314)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368)
so why the Connection Timed out exceptions are thrown?

This problem is solved. using the tcpdump to capture the messages flows. I have found that while in the application level, ObjectOutputStream.writeObject() method was invoked, in the tcp level, many times [TCP ReTransmission] were found.
So, I concluded that the connection is possibly be dead, although using the netstat -an command the tcp connection state still was ESTABLISHED.
So I wrote a testing application to periodically sent Testing messages as the heart-beating messages from the Server. Then this problem disappeared.

The read() methods of java.io.InputStream are blocking calls., which means they wait "forever" if they are called when there is no data in the stream to read.
This is completely expected behaviour and as per the published contract in javadoc if the server does not respond.
If you want a non-blocking read, use the java.nio.* classes.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.