High lag with zeromq

High lag with zeromq - java

I am facing a strange issue with ZMQ, which I'm just not able to debug. These are the components:
Java ZMQ Server - Almost an exact copy of this example. There are a hundred worker threads.
PHP Client - Simple request reply with a REQ socket. This is the request flow:
$zcontext = new ZMQContext();
$socket = new ZMQSocket($zcontext, ZMQ::SOCKET_REQ);
$socket->connect(<address>);
$startTime = microtime(true);
$socket->send(<request>);
$result = $socket->recv();
$totalTime = microtime(true) - $startTime;
The ZMQ sockets use TCP and both the server and client are on the same machine.
The PHP script is served by apache and I am load testing using apache benchmark. I make 5000 requests with a concurrency of 200. On the PHP client I log the time it takes for the request reply ($totalTime). In most of the cases, this time is very low (sub 500ms), but occasionally it takes a really long time - sometimes even 60 secs (for send + receive).
I added some extra logging to find out where the issue is happening, and it turns out that whenever it takes really long, almost all the time is between PHP's send and Java's receive - so packets are taking really long to reach the server.
I'm not setting any special ZMQ settings, or otherwise doing anything unusual so I don't know what is causing the issue. It should also be noted that the issue persists even at lower concurrencies (I tested at 100 and 150 too), but the max request times are lower.
Sorry if the question seems vague - I'll provide any other details that may be needed.

Related

What's the proper heartbeat/keep-alive technology/layer for Java REST? Http? Tcp? Encoding: chunked?

The setup:
We have an https://Main.externaldomain/xmlservlet site, which is authenticating/validating/geo-locating and proxy-ing (slightly modified) requests to http://London04.internaldomain/xmlservlet for example.
There's no direct access to internaldomain exposed to end-users at all. The communication between the sites gets occasionally interrupted and sometimes the internaldomain nodes become unavailable/dead.
The Main site is using org.apache.http.impl.client.DefaultHttpClient (I know it's deprecated, we're gradually upgrading this legacy code) with readTimeout set to 10.000 milli-seconds.
The request and response have xml payload/body of variable length and the Transfer-Encoding: chunked is used, also the Keep-Alive: timeout=15 is used.
The problem:
Sometimes London04 actually needs more than 10 seconds (let's say 2 minutes) to execute. Sometimes it non-gracefully crashes. Sometimes other (networking) issues happen.
Sometimes during those 2 minutes - the portions of response-xml-data are being so gradually filled that there're no 10-second gaps between the portions and therefore the readTimeout is never exceeded,
sometimes there's a 10+ seconds gap and HttpClient times out...
We could try to increase the timeout on Main side, but that would easily bloat/overload the listener pool (just by regular traffic, not even being DDOSed yet).
We need a way to distinguish between internal-site-still-working-on-generating-the-response and the cases where it really crashed/network_lost/etc.
And a best thing feels to be some kind of heart-beat (every 5 seconds) during the communication.
We thought the Keep-Alive would save us, but it seems to only secure the gaps between the requests (not during the requests) and it seems to not do any heartbeating during the gap (just having/waiting_for the timeout).
We thought chunked-encoding may save us by sending some heartbeat (0-bytes-sized-chunks) to let other side aware, but there seems to be no such/default implementation of supporting any heartbeat this way and moreso it seems that 0-bytes-sized chunk is an EOD indicator itself...
Question(s):
If we're correct in assumptions that KeepAlive/ChunkedEncoding won't help us with achieving the keptAlive/hearbeat/fastDetectionOfDeadBackend then:
1) which layer such a heart-beat should be rather implemented at? Http? tcp?
2) any standard framework/library/setting/etc implementing it already? (if possible: Java, REST)
UPDATE
I've also looked into heartbeat-implementers for WADL/WSDL, though found none for REST, checked out the WebSockets...
Also looked into TCP-keepalives which seem to be the right feauture for the task:
https://en.wikipedia.org/wiki/Keepalive
http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
Socket heartbeat vs keepalive
WebSockets ping/pong, why not TCP keepalive?
BUT according to those I'd have to set up something like:
tcp_keepalive_time=5
tcp_keepalive_intvl=1
tcp_keepalive_probes=3
which seems to be a counter-recommendation (2h is the recommended, 10min already presented as an odd value, is going to 5s sane/safe?? if it is - might be my solution upfront...)
also where should I configure this? on London04 alone or on Main too? (if I set it up on Main - won't it flood client-->Main frontend communication? or might the NATs/etc between sites ruin the keepalive intent/support easily?)
P.S. any link to an RTFM is welcome - I might just be missing something obvious :)

My advice would be don't use a heartbeat. Have your external-facing API return a 303 See Other with headers that indicates when and where the desired response might be available.
So you might call:
POST https://public.api/my/call
and get back
303 See Other
Location: "https://public.api/my/call/results"
Retry-After: 10
To the extent your server can guess how long a response will take to build, it should factor that into the Retry-After value. If a later GET call is made to the new location and the results are not yet done being built, return a response with an updated Retry-After value. So maybe you try 10, and if that doesn't work, you tell the client to wait another 110, which would be two minutes in total.
Alternately, use a protocol that's designed to stay open for long periods of time, such as WebSockets.

Take a look SSE
example code:
https://github.com/rsvoboda/resteasy-sse
or vertx event-bus:
https://vertx.io/docs/apidocs/io/vertx/core/eventbus/EventBus.html

What is the maximum connection timeout to any server?

I have this simple Spring boot based web app that downloads data from several APIs. Some of them don't respond in time, since my connectionTimeout is set to somewhat 4 seconds.
As soon as I get rid of connectionTimeout setting, I'm getting an exceptions after 20 or so seconds.
So, my question is, for how long am I able to try to connect to an API and what does it depend on? Where do those 20 seconds come from? What if an API responds after 40 minutes of time and I won't be able to catch that specific moment and just gonna lose data. I don't want that to happen. What are my options?
Here's the code to set the connection. Nothing special.
HttpComponentsClientHttpRequestFactory clientHttpRequestFactory = new HttpComponentsClientHttpRequestFactory(HttpClientBuilder.create().build());
clientHttpRequestFactory.setConnectTimeout(4000);
RestTemplate restTemplate = new RestTemplate(clientHttpRequestFactory);
Then I retrieve the values via:
myObject.setJsonString(restTemplate.getForObject(url, String.class));

Try increasing your timeout. 4 seconds is too little.
It will need to connect, formulate data and return. So 4 seconds is just for connecting, by the time it attempts to return anything, your application has already disconnected.
Set it to 20 seconds to test it. You can set it to much longer to give the API enough time to complete. This does not mean you app will use up all of the connection timeout time. It will finish as soon as a result is returned. Also API's are not designed to take long. They will perform the task and return the result as fast as possible

Connection timeout means that your program couldn't connect to the server at all within the time specified.
The timeout can be configured, as, like you say, some systems may take a longer time to connect to, and if this is known in advance, it can be allowed for. Otherwise the timeout serves as a guard to prevent the application from waiting forever, which in most cases doesn't really give a good user experience.
A separate timeout can normally be configured for reading data (socket timeout). They are not inclusive of each other.
To solve your problem:
Check that the server is running and accepting incoming connections.
You might want to use curl or depending on what it is simply your browser to try and connect.
If one tool can connect, but the other can't, check your firewall settings and ensure that outgoing connections from your Java program are permitted. The easiest way to test whether this is a problem is to disable anti virus and firewall tools temporarily. If this allows the connection, you'll either need to leave the FW off, or better add a corresponding exception.
Leave the timeout on a higher setting (or try setting it to 0, which is interpreted as infinite) while testing. Once you have it working, you can consider tweaking it to reflect your server spec and usability requirements.
Edit:
I realised that this doesn't necessarily help, as you did ultimately connect. I'll leave the above standing as general info.
for how long am I able to try to connect to an API and what does it depend on?
Most likely the server that the API is hosted on. If it is overloaded, response time may lengthen.
Where do those 20 seconds come from?
Again this depends on the API server. It might be random, or it may be processing each request for a fixed period of time before finding itself in an error state. In this case that might take 20 seconds each time.
What if an API responds after 40 minutes of time and I won't be able to catch that specific moment and just gonna lose data. I don't want that to happen. What are my options?
Use a more reliable API, possibly paying for a service guarantee.
Tweak your connection and socket timeouts to allow for the capabilities of the server side, if known in advance.
If the response is really 40 minutes, it is a really poor service, but moving on with that assumption - if the dataset is that large, explore whether the API offers a streaming callback, whereby you pass in an OutputStream into the API's library methods, to which it will (asynchronously) write the response when it is ready.
Keep in mind that connection and socket timeout are separate things. Once you have connected, the connection timeout becomes irrelevant (socket is established). As long as you begin to receive and continue to receive data (packet to packet) within the socket timeout, the socket timeout won't be triggered either.
Use infinite timeouts (set to 0), but this could lead to poor usability within your applications, as well as resource leaks if a server is in fact offline and will never respond. In that case you will be left with dangling connections.

The default and maximum has nothing to do with the the server. It depends on the client platform, but it is around a minute. You can decrease it, but not increase it. Four seconds is far too short. It should be measured in tens of seconds in most circumstances.
And absent or longer connection timeouts do not cause server errors of any kind. You are barking up the wrong tree here.

Multiple Server Discovery - Java/Android

I have an application where multiple servers could exist. There are heaps of examples of how to use UDP to discover servers but it seems this only works with a single server.
What happens if multiple responses exist? Are they queued, corrupted (with UDP pitfalls) or something else ?
I would like to find out how to receive multiple responses from a UDP broadcast sent from an Android Device. If this isn't viable, is there any other recommended approach for multiple server discovery for Android clients..
Thank you

I would first send a packet to all servers you want to ask if they are there, then let all servers respond. Since you want to find out how to receive the packages, here is how i would do that:
long responseTimeout = 4000;
long start = System.currentTimeMillis();
while(true){
long now = System.currentTimeMillis();
if(now - start < responseTimeout){
datagramSocket.setSoTimeout((int) (responseTimeout - (now - start));
}else{
break;
}
try{
datagramSocket.receive(packet);
addOnlineServer(packet.getAddress());
}catch(SocketTimeoutException e){
break;
}
}
For a certain amount of time your android client should wait for responses, and add each ip of the received package to a list of online servers.
Sure some of the packages could get lost as you are using UDP, but that's what you get. If you want to be sure that no packages get lost, use TCP instead.

If you broadcast the message and the servers all return you should see all the responses as they come back.
However be aware that UDP is a potentially lossy protocol and makes no guarantees at all. Over a non-wireless LAN with decent switches it is pretty safe but as soon as it goes further than that (wireless, over multiple networks, etc) you can expect to lose at least some packets and any packet loss is a message loss on UDP.
The usual solution to this is to send each message a few times. So for example when you first start up you might broadcast at 1 second, 10 second, 30 seconds, and then every 10 minutes thereafter. This will find servers immediately, then sweep up any it missed fairly fast, and then finally detect any new ones that appear on the network.
I've not worked with this sort of system for quite a few years, but last time we did there was a single server acted as the center point. Everything when it started up broadcasted out to find the central server (retrying at increasing intervals until it found it) and when the central server started up it broadcasted out to find everything - retrying 3 times.
All communication after that was done by registering with that central server and getting the list of apps etc from there. The server essentially acted as a network directory so anything could get a list of anything else on the network by querying it.

You should be doing the following to receive and probably also send broadcast packets (which is what you are asking for):
Make sure that network mask is correct
When you bind the socket, be sure to bind it to INADDR_ANY
Set the socket's option to BROADCAST via setsockopt
Call the function sendto(..) with sendaddr.sin_addr.s_addr = inet_addr("your_interface_broadcast_address"), or call sento(..) several times for each interface with its broadcast IP address.
Call the function recvfrom(..), inside a while(..) loop, until you are certain "enough time has passed", usually 1 second should suffice on a LAN network

Faster detection of a broken socket in Java/Android

Background
My application gathers data from the phone and sends the to a remote server.
The data is first stored in memory (or on file when it's big enough) and every X seconds or so the application flushes that data and sends it to the server.
It's mission critical that every single piece of data is sent successfully, I'd rather send the data twice than not at all.
Problem
As a test I set up the app to send data with a timestamp every 5 seconds, this means that every 5 seconds a new line appear on the server.
If I kill the server I expect the lines to stop, they should now be written to memory instead.
When I enable the server again I should be able to confirm that no events are missing.
The problem however is that when I kill the server it takes about 20 seconds for IO operations to start failing meaning that during those 20 seconds the app happily sends the events and removes them from memory but they never reach the server and are lost forever.
I need a way to make certain that the data actually reaches the server.
This is possibly one of the more basic TCP questions but non the less, I haven't found any solution to it.
Stuff I've tried
Setting Socket.setTcpNoDelay(true)
Removing all buffered writers and just using OutputStream directly
Flushing the stream after every send
Additional info
I cannot change how the server responds meaning I can't tell the server to acknowledge the data (more than mechanics of TCP that is), the server will just silently accept the data without sending anything back.
Snippet of code
Initialization of the class:
socket = new Socket(host, port);
socket.setTcpNoDelay(true);
Where data is sent:
while(!dataList.isEmpty()) {
String data = dataList.removeFirst();
inMemoryCount -= data.length();
try {
OutputStream os = socket.getOutputStream();
os.write(data.getBytes());
os.flush();
}
catch(IOException e) {
inMemoryCount += data.length();
dataList.addFirst(data);
socket = null;
return false;
}
}
return true;
Update 1
I'll say this again, I cannot change the way the server behaves.
It receive data over TCP and UPD and does not send any data back to confirm the receive. This is a fact and sure in a perfect world the server would acknowledge the data but that will simply not happen.
Update 2
The solution posted by Fraggle works perfect (closing the socket and waiting for the input stream to be closed).
This however comes with a new set of problems.
Since I'm on a phone I have to assume that the user cannot send an infinite amount of bytes and I would like to keep all data traffic to a minimum if possible.
I'm not worried by the overhead of opening a new socket, those few bytes will not make a difference. What I am worried about however is that every time I connect to the server I have to send a short string identifying who I am.
The string itself is not that long (around 30 characters) but that adds up if I close and open the socket too often.
One solution is only to "flush" the data every X bytes, the problem is I have to choose X wisely; if too big there will be too much duplicate data sent if the socket goes down and if it's too small the overhead is too big.
Final update
My final solution is to "flush" the socket by closing it every X bytes and if all didn't got well those X bytes will be sent again.
This will possibly create some duplicate events on the server but that can be filtered there.

Dan's solution is the one I'd suggest right after reading your question, he's got my up-vote.
Now can I suggest working around the problem? I don't know if this is possible with your setup, but one way of dealing with badly designed software (this is your server, sorry) is to wrap it, or in fancy-design-pattern-talk provide a facade, or in plain-talk put a proxy in front of your pain-in-the-behind server. Design meaningful ack-based protocol, have the proxy keep enough data samples in memory to be able to detect and tolerate broken connections, etc. etc. In short, have the phone app connect to a proxy residing somewhere on a "server-grade" machine using "good" protocol, then have the proxy connect to the server process using the "bad" protocol. The client is responsible for generating data. The proxy is responsible for dealing with the server.
Just another idea.
Edit 0:
You might find this one entertaining: The ultimate SO_LINGER page, or: why is my tcp not reliable.

The bad news: You can't detect a failed connection except by trying to send or receive data on that connection.
The good news: As you say, it's OK if you send duplicate data. So your solution is not to worry about detecting failure in less than the 20 seconds it now takes. Instead, simply keep a circular buffer containing the last 30 or 60 seconds' worth of data. Each time you detect a failure and then reconnect, you can start the session by resending that saved data.
(This could get to be problematic if the server repeatedly cycles up and down in less than a minute; but if it's doing that, you have other problems to deal with.)

See the accepted answer here: Java Sockets and Dropped Connections
socket.shutdownOutput();
wait for inputStream.read() to return -1, indicating the peer has also shutdown its socket

Won't work: server cannot be modified
Can't your server acknowledge every message it receives with another packet? The client won't remove the messages that the server did not acknowledge yet.
This will have performance implications. To avoid slowing down you can keep on sending messages before an acknowledgement is received, and acknowledge several messages in one return message.
If you send a message every 5 seconds, and disconnection is not detected by the network stack for 30 seconds, you'll have to store just 6 messages. If 6 sent messages are not acknowledged, you can consider the connection to be down. (I suppose that logic of reconnection and backlog sending is already implemented in your app.)

What about sending UDP datagrams on a separate UDP socket while making the remote host respond to each, and then when the remote host doesn't respond, you kill the TCP connection? It detects a link breakage quickly enough :)

Use http POST instead of socket connection, then you can send a response to each post. On the client side you only remove the data from memory if the response indicates success.
Sure its more overhead, but gives you what you want 100% of the time.

Apache HttpClient random delays under high requests/second

I'm using Apache HttpClient to query HTTP/1.0 (without keep alive) server on localhost with around 20 POST requests/second. I have a TCP_NODELAY enabled like this:
val httpParams = new BasicHttpParams()
HttpConnectionParams.setTcpNoDelay(httpParams, true)
val client = new DefaultHttpClient(connectionManager, httpParams)
Despite that, several times per minute I see a random delay of 10-200 milliseconds for sending such request (measuring on the server shows, that delay is in sending). I checked, that it's not a garbage collector pause. What can be the problem?
I tried to query the server with C++ client with the same rate, and it doesn't have such random delays, so i think it's HttpClient problem.
Update:
I checked jetty HttpClient implementation, and it has the same problem. Can this be some problem of JVM on FreeBSD? I should test this on linux, but I don't have a linux server at hand.

I have the same problem here using windows.
In my case, HttpClient was introducing around 1 second of delay but just at the first attempt to execute a post (I do several in sequence). I try a workaround that was to create a "fake" post to the local host and execute it (and except an IOException). By doing that, my delay when calling real services was reduce from around 1 second to around 100ms.
I could not improve more than that yet.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.