Faster detection of a broken socket in Java/Android

Faster detection of a broken socket in Java/Android - java

Background
My application gathers data from the phone and sends the to a remote server.
The data is first stored in memory (or on file when it's big enough) and every X seconds or so the application flushes that data and sends it to the server.
It's mission critical that every single piece of data is sent successfully, I'd rather send the data twice than not at all.
Problem
As a test I set up the app to send data with a timestamp every 5 seconds, this means that every 5 seconds a new line appear on the server.
If I kill the server I expect the lines to stop, they should now be written to memory instead.
When I enable the server again I should be able to confirm that no events are missing.
The problem however is that when I kill the server it takes about 20 seconds for IO operations to start failing meaning that during those 20 seconds the app happily sends the events and removes them from memory but they never reach the server and are lost forever.
I need a way to make certain that the data actually reaches the server.
This is possibly one of the more basic TCP questions but non the less, I haven't found any solution to it.
Stuff I've tried
Setting Socket.setTcpNoDelay(true)
Removing all buffered writers and just using OutputStream directly
Flushing the stream after every send
Additional info
I cannot change how the server responds meaning I can't tell the server to acknowledge the data (more than mechanics of TCP that is), the server will just silently accept the data without sending anything back.
Snippet of code
Initialization of the class:
socket = new Socket(host, port);
socket.setTcpNoDelay(true);
Where data is sent:
while(!dataList.isEmpty()) {
String data = dataList.removeFirst();
inMemoryCount -= data.length();
try {
OutputStream os = socket.getOutputStream();
os.write(data.getBytes());
os.flush();
}
catch(IOException e) {
inMemoryCount += data.length();
dataList.addFirst(data);
socket = null;
return false;
}
}
return true;
Update 1
I'll say this again, I cannot change the way the server behaves.
It receive data over TCP and UPD and does not send any data back to confirm the receive. This is a fact and sure in a perfect world the server would acknowledge the data but that will simply not happen.
Update 2
The solution posted by Fraggle works perfect (closing the socket and waiting for the input stream to be closed).
This however comes with a new set of problems.
Since I'm on a phone I have to assume that the user cannot send an infinite amount of bytes and I would like to keep all data traffic to a minimum if possible.
I'm not worried by the overhead of opening a new socket, those few bytes will not make a difference. What I am worried about however is that every time I connect to the server I have to send a short string identifying who I am.
The string itself is not that long (around 30 characters) but that adds up if I close and open the socket too often.
One solution is only to "flush" the data every X bytes, the problem is I have to choose X wisely; if too big there will be too much duplicate data sent if the socket goes down and if it's too small the overhead is too big.
Final update
My final solution is to "flush" the socket by closing it every X bytes and if all didn't got well those X bytes will be sent again.
This will possibly create some duplicate events on the server but that can be filtered there.

Dan's solution is the one I'd suggest right after reading your question, he's got my up-vote.
Now can I suggest working around the problem? I don't know if this is possible with your setup, but one way of dealing with badly designed software (this is your server, sorry) is to wrap it, or in fancy-design-pattern-talk provide a facade, or in plain-talk put a proxy in front of your pain-in-the-behind server. Design meaningful ack-based protocol, have the proxy keep enough data samples in memory to be able to detect and tolerate broken connections, etc. etc. In short, have the phone app connect to a proxy residing somewhere on a "server-grade" machine using "good" protocol, then have the proxy connect to the server process using the "bad" protocol. The client is responsible for generating data. The proxy is responsible for dealing with the server.
Just another idea.
Edit 0:
You might find this one entertaining: The ultimate SO_LINGER page, or: why is my tcp not reliable.

The bad news: You can't detect a failed connection except by trying to send or receive data on that connection.
The good news: As you say, it's OK if you send duplicate data. So your solution is not to worry about detecting failure in less than the 20 seconds it now takes. Instead, simply keep a circular buffer containing the last 30 or 60 seconds' worth of data. Each time you detect a failure and then reconnect, you can start the session by resending that saved data.
(This could get to be problematic if the server repeatedly cycles up and down in less than a minute; but if it's doing that, you have other problems to deal with.)

See the accepted answer here: Java Sockets and Dropped Connections
socket.shutdownOutput();
wait for inputStream.read() to return -1, indicating the peer has also shutdown its socket

Won't work: server cannot be modified
Can't your server acknowledge every message it receives with another packet? The client won't remove the messages that the server did not acknowledge yet.
This will have performance implications. To avoid slowing down you can keep on sending messages before an acknowledgement is received, and acknowledge several messages in one return message.
If you send a message every 5 seconds, and disconnection is not detected by the network stack for 30 seconds, you'll have to store just 6 messages. If 6 sent messages are not acknowledged, you can consider the connection to be down. (I suppose that logic of reconnection and backlog sending is already implemented in your app.)

What about sending UDP datagrams on a separate UDP socket while making the remote host respond to each, and then when the remote host doesn't respond, you kill the TCP connection? It detects a link breakage quickly enough :)

Use http POST instead of socket connection, then you can send a response to each post. On the client side you only remove the data from memory if the response indicates success.
Sure its more overhead, but gives you what you want 100% of the time.

Related

Java: Managing more connections than there are threads, using a queue

For an exercise, we are to implement a server that has a thread that listens for connections, accepts them and throws the socket into a BlockingQueue. A set of worker threads in a pool then goes through the queue and processes the requests coming in through the sockets.
Each client connects to the server, sends a large number of requests (waiting for the response before sending the next request) and eventually disconnects when done.
My current approach is to have each worker thread waiting on the queue, getting a socket, then processing one request, and finally putting the (still open) socket back into the queue before handling another request, potentially from a different client. There are many more clients than there are worker threads, so many connections queue up.
The problem with this approach: A thread will be blocked by a client even if the client doesn't send anything. Possible pseudo-solutions, all not satisfactory:
Call available() on the inputStream and put the connection back into the queue if it returns 0. The problem: It's impossible to detect if the client is still connected.
As above but use socket.isClosed() or socket.isConnected() to figure out if the client is still connected. The problem: Both methods don't detect a client hangup, as described nicely by EJP in Java socket API: How to tell if a connection has been closed?
Probe if the client is still there by reading from or writing to it. The problem: Reading blocks (i.e. back to the original situation where an inactive client blocks the queue) and writing actually sends something to the client, making the tests fail.
Is there a way to solve this problem? I.e. is it possible to distinguish a disconnected client from a passive client without blocking or sending something?

Short answer: no. For a longer answer, refer to the one by EJP.
Which is why you probably shouldn't put the socket back on the queue at all, but rather handle all the requests from the socket, then close it. Passing the connection to different worker threads to handle requests separately won't give you any advantage.
If you have badly behaving clients you can use a read timeout on the socket, so reading will block only until the timeout occurs. Then you can close that socket, because your server doesn't have time to cater to clients that don't behave nicely.

Is there a way to solve this problem? I.e. is it possible to distinguish a disconnected client from a passive client without blocking or sending something?
Not really when using blocking IO.
You could look into the non-blocking (NIO) package, which deals with things a little differently.
In essence you have a socket which can be registered with a "selector". If you register sockets for "is data ready to be read" you can then determine which sockets to read from without having to poll individually.
Same sort of thing for writing.
Here is a tutorial on writing NIO servers

Turns out the problem is solvable with a few tricks. After long discussions with several people, I combined their ideas to get the job done in reasonnable time:
After creating the socket, configure it such that a blocking read will only block for a certain time, say 100ms: socket.setSoTimeout(100);
Additionally, record the timestamp of the last successful read of each connection, e.g. with System.currentTimeMillis()
In principle (see below for exception to this principle), run available() on the connection before reading. If this returns 0, put the connection back into the queue since there is nothing to read.
Exception to the above principle in which case available() is not used: If the timestamp is too old (say, more than 1 second), use read() to actually block on the connection. This will not take longer than the SoTimeout that you set above for the socket. If you get a TimeoutException, put the connection back into the queue. If you read -1, throw the connection away since it was closed by the remote end.
With this strategy, most read attempts terminate immediately, either returning some data or nothing beause they were skipped since there was nothing available(). If the other end closed its connection, we will detect this within one second since the timestamp of the last successful read is too old. In this case, we perform an actual read that will return -1 and the socket's isClosed() is updated accordingly. And in the case where the socket is still open but the queue is so long that we have more than a second of delay, it takes us aditionally 100ms to find out that the connection is still there but not ready.
EDIT: An enhancement of this is to change "last succesful read" to "last blocking read" and also update the timestamp when getting a TimeoutException.

No, the only way to discern an inactive client from a client that didn't shut down their socket properly is to send a ping or something to check if they're still there.
Possible solutions I can see is
Kick clients that haven't sent anything for a while. You would have to keep track of how long they've been quiet for, and once they reach a limit you assume they've disconnected .
Ping the client to see if they're still there. I know you asked for a way to do this without sending anything, but if this is really a problem, i.e you can't use the above solution, this is probably the best way to do it depending on the specifics(since it's an exercise you might have to imagine the specifics).
A mix of both, actually this is probably better. Keep track of how long they've been quiet for, after a bit send them a ping to see if they still live.

Handling input-stream overflow (zero-window) in java

We have a system where there are 2 applications. One of these is a legacy application, for which we can't do any code changes. This application is sending messages to second application which is written in java. In our java code, we have set input stream buffer size equal to 1 MB as follows:
Socket eventSocket = new Socket();
eventSocket.setSendBufferSize(1024 * 1024);
Now the legacy application is sending messages of variable size. Most of the messages are smaller than 1 MB. But sometimes it is sending messages as large as 8 MB. Many times these messages are read successfully by the java application. But for some cases, following read operation is returning -1 value:
read = stream.read(b, off, len - off); ( here stream is an InputStream object)
As per Java API definition, InputStream read method returns -1 if there is no more data because the end of the stream has been reached.
But this is an erroneous behavior. We have done snoop test using
wireshark to verify the exact messages that are exchanged between these two applications and found that java application has sent zero
window message few seconds before the time when input stream read
method has returned -1 value. At the time when this java api method
has returned -1, java application was sending ZeroWindowProbeAck
message to the legacy application.
How should we handle this issue?
As per https://wiki.wireshark.org/TCP%20ZeroWindow, zero window has following definition:
What does TCP Zero Window mean?
Zero Window is something to investigate.
TCP Zero Window is when the Window size in a machine remains at zero for a specified amount of time.
This means that a client is not able to receive further information at the moment, and the TCP transmission is halted until it can process the information in its receive buffer.
TCP Window size is the amount of information that a machine can receive during a TCP session and still be able to process the data. Think of it like a TCP receive buffer. When a machine initiates a TCP connection to a server, it will let the server know how much data it can receive by the Window Size.
In many Windows machines, this value is around 64512 bytes. As the TCP session is initiated and the server begins sending data, the client will decrement it's Window Size as this buffer fills. At the same time, the client is processing the data in the buffer, and is emptying it, making room for more data. Through TCP ACK frames, the client informs the server of how much room is in this buffer. If the TCP Window Size goes down to 0, the client will not be able to receive any more data until it processes and opens the buffer up again. In this case, Protocol Expert will alert a "Zero Window" in Expert View.
Troubleshooting a Zero Window
For one reason or another, the machine alerting the Zero Window will not receive any more data from the host. It could be that the machine is running too many processes at that moment, and its processor is maxed. Or it could be that there is an error in the TCP receiver, like a Windows registry misconfiguration. Try to determine what the client was doing when the TCP Zero Window happened.
Source: flukenetworks.com

Handling input-stream overflow (zero window) in Java
There is no such thing as 'input-stream overflow' in Java, and you can't handle zero window in Java either, except by reading from the network more quickly. Your title already doesn't make sense.
We have done snoop test using wireshark to verify the exact messages that are exchanged between these two applications and found that java application has sent zero window message few seconds before the time when input stream read method has returned -1 value.
Neither Java nor the application send those messages. The operating system does.
The input stream of a socket returns -1 if and only if a FIN has been received from the peer, and that may in turn occur if and and only if the peer has closed the connection or exited (Unix). It doesn't have anything to do wth TCP windowing.
At the time when this java api method has returned -1, java application was sending ZeroWindowProbeAck message to the legacy application.
No it wasn't. The operating system was, and it wasn't 'at the time', it was 'a few seconds before', accordingly to your own words. At the time when this Java method returned -1, it had just received a FIN from the peer. Have a look at your sniff log. There is no problem here to explain.
As per [whatever], zero window has the following definition
Wireshark does not get to define TCP. TCP is defined in IETF RFCs. You can't cite non-normative sources as definitions.
TCP Zero Window is when the Window size in a machine remains at zero for a specified amount of time.
For any amount of time.
This means that a client is not able to receive further information at the moment, and the TCP transmission is halted until it can process the information in its receive buffer.
It means that the peer is not able to receive. It has nothing to do with the client or the server specifically.
TCP Window size is the amount of information that a machine can receive during a TCP session
No it isn't. It is the amount of data the receiver is currently able to receive. It is therefore also the amount of data the sender is present allowed to send. It has nothing to do with the session whatsoever.
and still be able to process the data.
Irrelevant.
Think of it like a TCP receive buffer.
It is a TCP receive buffer.
When a machine initiates a TCP connection to a server, it will let the server know how much data it can receive by the Window Size.
Correct. And vice versa. Continuously, not just at the start of the session.
In many Windows machines, this value is around 64512 bytes. As the TCP session is initiated and the server begins sending data, the client will decrement it's Window Size as this buffer fills.
It has nothing to do with clients and servers. It operates in both directions.
At the same time, the client is processing the data in the buffer, and is emptying it, making room for more data. Through TCP ACK frames,
Segments
the client informs the server of how much room is in this buffer.
The receiver informs the sender.
If the TCP Window Size goes down to 0, the client
The peer
will not be able to receive any more data until it processes and opens the buffer up again. In this case, Protocol Expert will alert a "Zero Window" in Expert View.
For one reason or another, the machine alerting the Zero Window will not receive any more data from the host.
For one reason only. Its socket receive buffer is full. Period.
It could be that the machine is running too many processes at that moment
Rubbish.
Or it could be that there is an error in the TCP receiver, like a Windows registry misconfiguration.
Rubbish. The receiver is reading more slowly than the sender is sending. Period. It is a normal condition that arises frequently during any TCP session.
Try to determine what the client was doing when the TCP Zero Window happened.
That's easy. Not reading from the network.
Your source is drivel, and your problem is imaginary.

We have created a solution, where we are waiting for input stream to get cleared by waiting for some time after this overflow problem occurs. We have done code changes as follows:
int execRetries = 0;
while (true)
{
read = stream.read(b, off, len - off);
if (read == -1)
{
if(execRetries++ < MAX_EXEC_RETRIES_AFTER_IS_OVERFLOW){
try {
Log.error("Inputstream buffer overflow occured. Retry no: " + execRetries);
Thread.sleep(WAIT_TIME_AFTER_IS_OVERFLOW);
} catch (InterruptedException e) {
Log.error(e.getMessage(), e);
}
}
else{
throw new Exception("End of file on input stream");
}
}
else if(execRetries!=0){
Log.info("Inputstream buffer overflow problem resolved after retry no: " + execRetries);
execRetries = 0;
}
.....
}
Solution is sent to test server. We are waiting to verify the end result whether this solution is working.

Netty failing to read bytes from server when write fails due to socket close by server

Netty Version: 4.0.10.Final
I've written a client and server using Netty. Here is what client and server do.
Server:
Wait for connection from client
Receive messages from client
If a message is bad, write error message (6 bytes), flush it,
close the socket and do not read any unread messages in the socket.
Otherwise continue reading messages. Do nothing with good messages.
Client:
Connect to server.
After writing N good messages, write one bad message and continue
writing M good messages. This process happens in a separate thread.
This thread is started after the channel is active.
If there is any response from server, log it and close the
socket. (Note that server responds only when there is an error)
I've straced both client and server. I've found that server is closing connection after writing the error message. Client began seeing broken pipe errors when writing good messages after the bad message. This is because server detected bad message and responded with error message and closed socket. connection is closed only after the write operation is complete using a listener. Client is not reading error message from server always. Earlier step (2) in client is performed in I/O thread. This caused the % of error messages received over K number of experiments to be really low (<10%). After moving step (2) to separate thread, % went to (70%). In any case it is not accurate. Does netty trigger channel read if the write fails due to broken pipe?
Update 1:
I'm clarifying and answering any questions asked here, so everybody can find the asked questions/clarifications at one place.
"You're writing a bad message that will cause a reset, followed by good messages that you already know won't get through, and trying to read a response that may have been thrown away. It doesn't make any sense to me whatsoever" - from EJP
-- In real world the server could treat something as bad for whatever reason client can't know in advance. For simplification, I said client intentionally sends a bad message that causes reset from server. I would like to send all good messages even if there are bad messages in the total messages.
What I'm doing is similar to the protocol implemented by Apple Push Notification Service.

If a message is bad, write error message (6 bytes), flush it, close the socket and do not read any unread messages in the socket. Otherwise continue reading messages.
That will cause a connection reset, which will be seen by the client as a broken pipe in Unix, Linux etc.
After writing N good messages, write one bad message and continue writing M good messages.
That will encounter the broken pipe error just mentioned.
This process happens in a separate thread.
Why? The whole point of NIO and therefore Netty is that you don't need extra threads.
I've found that server is closing connection after writing the error message.
Well that's what you said it does, so it does it.
Client began seeing broken pipe errors when writing good messages after the bad message.
As I said.
This is because server detected bad message and responded with error message and closed socket.
Correct.
Client is not reading error message from server always.
Due to the connection reset. The delivery of pending data ceases after a reset.
Does netty trigger channel read if the write fails due to broken pipe?
No, it triggers read when data or EOS arrives
However your bizarre system design/protocol is making that unpredictable if not impossible. You're writing a bad message that will cause a reset, followed by good messages that you already know won't get through, and trying to read a response that may have been thrown away. It doesn't make any sense to me whatsoever. What are you trying to prove here?
Try a request-response protocol like everybody else.

The APN protocol appears to be quite awkward because it does not acknowledge successful receipt of a notification. Instead it just tells you which notifications it has successfully received when it encounters an error. The protocol is working on the assumption that you will generally send well formed notifications.
I would suggest that you need some sort of expiring cache (a LinkedHashMap might work here) and you need to use the opaque identifier field in the notification as a globally unique, ordered value. A sequence number will work (but you'll need to persist if your client can be restarted).
Every time you generate an APN
set its identifier to the next sequence number
send it
place it in the LinkedHashMap with a string key of sequence number concatenated with the current time (eg String key = sequenceNumber + "-" + System.currentTimeMillis() )
If you receive an error you need to reopen the connection and resend all the APNs in the map with a sequence number higher than the identifier reported in the error. This is relatively easy. Just iterate through the map removing any APN with a sequence number lower than that reported. Then resend the remain APNs in order, replacing them in the map with the current time (ie you remove an APN when you resend it, then re-insert into the map with the new current time).
You'll need to periodically purge the map of old entries. You need to determine what is a reasonable length of time based on how long it takes the APN service to return an error if you send a malformed APN. I suspect it'll be a matter of seconds (if not much quicker). If, for example, you're sending 10 APNs / second, and you know that the APN server will definitely respond within 30 seconds, a 30 second expiry time, purging every second, might be appropriate. Just iterate along the map removing any elements which has a time section of it's key that is less than System.currentTimeMillis() - 30000 (for 30 second expiry time). You'll need to synchronize threads appropriately.
I would catch any IOExceptions caused by writing and place the APN you were attempting to write in the map and resend.
What you cannot cope with is a genuine network error whereby you do not know if the APN service received the notification (or a bunch of notifications). You'll have to make a decision based on what your service is as to whether you resend the affected APNs immediately, or after some time period, or not at all. If you send after a time period you'll want to give them new sequence numbers at the point you send them. This will allow you to send new APNs in the meantime.

UDP packets waiting and then arriving together

I have a simple Java program which acts as a server, listening for UDP packets. I then have a client which sends UDP packets over 3g.
Something I've noticed is occasionally the following appears to occur: I send one packet and seconds later it is still not received. I then send another packet and suddenly they both arrive.
I was wondering if it was possible that some sort of system is in place to wait for a certain amount of data instead of sending an undersized packet. In my application, I only send around 2-3 bytes of data per packet - although the UDP header and what not will bulk the message up a bit.
The aim of my application is to get these few bytes of data from A to B as fast as possible. Huge emphasis on speed. Is it all just coincidence? I suppose I could increase the packet size, but it just seems like the transfer time will increase, and 3g isn't exactly perfect.

Since the comments are getting rather lengthy, it might be better to turn them into an answer altogether.
If your app is not receiving data until a certain quantity is retrieved, then chances are, there is some sort of buffering going on behind the scenes. A good example (not saying this applies to you directly) is that if you or the underlying libraries are using InputStream.readLine() or InputStream.read(bytes), then it will block until it receives a newline or bytes number of bytes before returning. Judging by the fact that your program seems to retrieve all of the data when a certain threshold is reached, it sounds like this is the case.
A good way to debug this is, use Wireshark. Wireshark doesn't care about your program--its analyzing the raw packets that are sent to and from your computer, and can tell you whether or not the issue is on the sender or the receiver.
If you use Wireshark and see that the data from the first send is arriving on your physical machine well before the second, then the issue lies with your receiving end. If you're seeing that the first packet arrives at the same time as the second packet, then the issue lies with the sender. Without seeing the code, its hard to say what you're doing and what, specifically, is causing the data to only show up after receiving more than 2-3 bytes--but until then, this behavior describes exactly what you're seeing.

There are several probable causes of this:
Cellular data networks are not "always-on". Depending on the underlying technology, there can be a substantial delay between when a first packet is sent and when IP connectivity is actually established. This will be most noticeable after IP networking has been idle for some time.
Your receiver may not be correctly checking the socket for readability. Regardless of what high-level APIs you may be using, underneath there needs to be a call to select() to check whether the socket is readable. When a datagram arrives, select() should unblock and signal that the socket descriptor is readable. Alternatively, but less efficiently, you could set the socket to non-blocking and poll it with a read. Polling wastes CPU time when there is no data and delays detection of arrival for up to the polling interval, but can be useful if for some reason you can't spare a thread to wait on select().
I said above that select() should signal readability on a watched socket when data arrives, but this behavior can be modified by the socket's "Receive low-water mark". The default value is usually 1, meaning any data will signal readability. But if SO_RCVLOWAT is set higher (via setsockopt() or a higher-level equivalent), then readability will be not be signaled until more than the specified amount of data has arrived. You can check the value with getsockopt() or whatever API is equivalent in your environment.
Item 1 would cause the first datagram to actually be delayed, but only when the IP network has been idle for a while and not once it comes up active. Items 2 and 3 would only make it appear to your program that the first datagram was delayed: a packet sniffer at the receiver would show the first datagram arriving on time.

Socket output stream with DataOuputStream - what are the guarantees?

Looking at the code:
private static void send(final Socket socket, final String data) throws IOException {
final OutputStream os = socket.getOutputStream();
final DataOutputStream dos = new DataOutputStream(os);
dos.writeUTF(data);
dos.flush();
}
can I be sure that calling this method either throws IOException (and that means that I'd better close the socket), or, if no exceptions are thrown, the data I send is guaranteed to be fully send? Are there any cases when I read the data on the other endpoint, the string I get is incomplete and there are no exception?

There is a big difference between sent and received. You can send data from the application successfully, however it then passes to
the OS on your machine
the network adapter
the switch(s) on the network
the network adapter on the remote machine
the OS on the remote machine
the application buffer on the remote machine
whatever the application does with it.
Any of these stages can fail and your sender will be none the wiser.
If you want to know the application has received and processed the data successfully, it must send you back a message saying this has happened. When you receive this, then you know it was received.

Yes, several things may happen. First of all, keep in mind write returns really quickly, so don't think much error checking (has all my data been ACKed ?) is performed.
Door number 1
You write and flush your data. TCP tries as hard as it can to deliver it. Which means it might perform retransmits and such. Of course, your send doesn't get stuck for such a long period (in some cases TCP tries for 5-10 minutes before it nukes the connections). Thus, you will never know if the other side actually got your message. You will get an error message on the next operation on the socket.
Door number 2
You write and flush your data. Because of MTU nastiness and because the string is long, it is sent in multiple packets. So your peer reads some of it and presents it to the user before getting it all.
So imagine you send: "Hello darkness my old friend, I've come to talk with you again". The other side might get "Hello darkness m". However, if it performs subsequent reads, it will get the whole data. So the far side TCP has actually received everything, it has ACKed everything but the user application has failed to read the data in order to take it out of TCPs hands.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.