Messaging latency in java (with zeromq)

Messaging latency in java (with zeromq) - java

I just ran the zeroMQ hello world example and timed the request-response latency. It averaged about 0.1ms running using the IPC protocol. This sounds quite slow to me....Does this sound about right?
long start=System.nanoTime();
socket.send(request, 0);
// Get the reply.
byte[] reply = socket.recv(0);
System.out.println((System.nanoTime()-start)/1000000.0);

I assume your average had a sample of more than one? I would run the test for at least 2-10 seconds before taking an average. The average latency in the same process/thread may be misleading.
I would create a second process which echo everything it gets if you are not doing this already. (And divide the latency in two unless you want the RTT latency)
Plain Sockets can get a RTT latency of 20 micro-seconds on a typical multi-core box and I would expect IPC to be faster. On a fast PC you can get a typical RTT latency of 9 micro-second using sockets.
If you want latency much lower than this, I would consider doing everything in one process or one thread if you can, in which case the cost of a method call is around 10 ns (if its not inlined ;)

Related

How to measure one-way latency?

I wanna measure the time the troughput takes from my client to my server. Currently i can only measure the full trip (from client to server and back to client) i can do this by measuring the time before we send a packet and measuring it after we receive it back from the server. Technically speaking, if i were to devide the full trip time i would get an avarage of each one-way throughput.
But what if some throughput actually took longer to arrive like this:
In the image i created the throughput from client to server is 30 ms and from server to client 90 ms. If the data would have such arrival rates then measuring the full round trip and dividing it by 2 would not give an accurate one-way arrival time. How can i accurately measure one-way arrival times?

How can i accurately measure one-way arrival times
TL;DR - YOU CAN'T
This is actually a very deep philosophical question that is unanswered at the very core of physics (the rock bottom "metal" of the universe). Physics does not unequivocally know the one-way speed of light, only the two-way speed. No experiment we have so far devised can answer that question. See The One-Way Speed of Light.
Although the average speed over a two-way path can be measured, the one-way speed in one direction or the other is undefined (and not simply unknown), unless one can define what is "the same time" in two different locations. To measure the time that the light has taken to travel from one place to another it is necessary to know the start and finish times as measured on the same time scale. This requires either two synchronized clocks, one at the start and one at the finish, or some means of sending a signal instantaneously from the start to the finish. No instantaneous means of transmitting information is known. Thus the measured value of the average one-way speed is dependent on the method used to synchronize the start and finish clocks. This is a matter of convention.
You can get arbitrarily close for non-relativistic situations by synchronizing clocks, but how do you know the clocks stay synchronized? For your case you'd have to agree to synchronize on the same time signal, but propagation delays can introduce tens to hundreds of milliseconds of delay and jitter.
So if you want to pin down one-way times to an accuracy less than clock jitter you're out of luck. Here's the output from ntpq peer on one of my Linux systems.
remote refid st t when poll reach delay offset jitter
==============================================================================
*unifi.versadns. 71.66.197.233 2 u 289 1024 377 2.298 -0.897 0.747
+eterna.binary.n 68.97.68.79 2 u 615 1024 377 42.258 -3.640 0.430
+homemail.org 139.78.97.128 2 u 160 1024 377 45.257 -0.209 0.391
-time.skylineser 130.207.244.240 2 u 418 1024 103 24.829 2.066 1.376
You might be able to pin down one-way time to within 5ms if both systems use the same master clock for synchronization and have had enough time for delay and jitter to stabilize.

OWAMP - owping
you need to have client and server running,
use NTP or chrony to Synchronize time.
There are expensive hardware to sync time between machines as well.

Scalable and High performance message channel

I am developing agents to collect data from different sources, the data should be posted to a channel at high frequency (say every 15 seconds). REST is definitely not a solution. The requirement is clearly fire and forget as status reply is not concerned.
Throughput is more important, message drops are acceptable upto 5%.
Possible solutions I come across are
Message Bus
Multicast
UDP
Any alternatives, please suggest.

IMHO High frequency is too fast too see and 15 seconds you can see. It takes about 0.5 seconds to send a message round the world and back again. You can just about see 15 milli-seconds. And if you are talking about 15 micro-seconds, that is definitely high frequency. I have a persisted messaging solution with a latency of around 0.1 micro-seconds which is 0.0000001 seconds, but I don't suggest you need that.
If all you need is a message every 15 seconds I would use the simplest solution which comes to mind. I would try ActiveMQ which I found to be one of the simplest to get working. You should be able to achieve message rates of up to 20,000 per seconds and decent latencies of about 0.01 seconds and you shouldn't lose any messages.

TCP/IP application Keepalive size, and bandwidth overhead

I'm writing a Java server (java.net.Socket, java.net.ServerSocket, java.io.ObjectOutputStream, java.io.ObjectInputStream) and I know I'm going to have limited bandwidth allocated for it.
I've written a decorator object for my output and input streams so I can count how many bytes go through it for profiling purposes. But this won't give me any indication of the amount of overhead I'm using for the connection.
I don't anticipate it will be much, but I'd like to prepare for it. I'm not going try to optimize it, I just want to know how much it will be for logistical reasons (how much bandwidth must I request, etc.)
I can't be the first person to try to get this information, but I can't seem to find good resources on the overhead of Java Sockets and TCP/IP in general. (Perhaps that's because there's nothing noteworthy to find... If we're on the order of kb per minute, it's really not much of a concern, but I'd still like to know!)
Thanks!

This question is challenging to answer with the information we have right now... for instance, what are you calling 'overhead'? Is it only TCP ACK packets, or all packet overhead (for instance ethernet, IP and tcp headers) for anything other than your data payload?
How many connections per minute? What is the average data transfer, per connection? If there are many very short-lived connections, your overhead requirements go up (due to 3-way handshake, and connection close requirements)... you could also have high overhead if the clients don't read much data, but many clients keep the connections open for days at a time.
Honestly, you're 50x better off modeling this in a lab and making some assumptions about hit rate per minute and concurrent clients... that will give you some ballpark numbers. Play around with limiting the bandwidth afforded to the application to the maximum your budget would allow... then start backing off... you can throttle bandwidth by using wanem on a dual-port linux machine.
Getting lab results like this is far better than theoretical calculations.
HTH,
\mike (who spends all day testing network gear)

TCP overhead varies based on a number of factors, but is typically around 5% at full capacity.
Basically each "packet" has 20 bytes of IP header (and 20 more if IPv6) plus 20-32 bytes of TCP header. Packet sizes vary based on the network devices and conditions, but are often in the neighborhood of 1500 bytes.
This page has some detail: http://sd.wareonearth.com/~phil/net/overhead/
In my opinion you can completely ignore keep-alives, as they are only used when the connection is idle anyway.

Benchmarking Apache Mina Total Bandwidth

I am developing a relatively fast paced game (Flash/Apache Mina Server back end) and I am having some difficulty getting an accurate benchmark of the type of bandwidth my current setup would use.
My question is: How do I get an accurate benchmark of the bandwidth required for my tests? What I am doing now wouldn't take into account any overhead?
On the message sent/received methods I am doing
[out/in]Bandwidth+= message.toString().getBytes().length;
I then print out the current values every 250 milliseconds (since that is how frequently "world" updates are done currently) .
With 10 "monsters" all randomly moving around and 1 player randomly moving around I am getting this output.. (1 second window here)
In bandwidth: 1647, Outgoing: 35378
In bandwidth: 1658, Outgoing: 35585
In bandwidth: 1669, Outgoing: 35792
In bandwidth: 1680, Outgoing: 35999
So acting strictly on the size of the messages (outgoing) being passed that works out to about 621 bytes/second or (621/10) 62.1 bytes per second per constantly moving item on screen per person. This seems a little low, a good high speed connection could handle 1000+ object updates per second at this "rate" no problem.

Something definitely smells fishy here. According to the performance testing provided by them: here mina is capable of 20K+ 405 byte requests per second on ~10 connections - way more than what you're seeing.
My guess is that there is some kind of theading\timing issue going on here that is causing the delay. I would enlist the help of a packet tracing application such as wireshark and see if your observations in code mesh with the raw network data. I would also try "flooding" the server side with more data if possible - this might provide some insight to where the issue lies.
I hope this helps, good luck.

fastest (low latency) method for Inter Process Communication between Java and C/C++

I have a Java app, connecting through TCP socket to a "server" developed in C/C++.
both app & server are running on the same machine, a Solaris box (but we're considering migrating to Linux eventually).
type of data exchanged is simple messages (login, login ACK, then client asks for something, server replies). each message is around 300 bytes long.
Currently we're using Sockets, and all is OK, however I'm looking for a faster way to exchange data (lower latency), using IPC methods.
I've been researching the net and came up with references to the following technologies:
shared memory
pipes
queues
as well as what's referred as DMA (Direct Memory Access)
but I couldn't find proper analysis of their respective performances, neither how to implement them in both JAVA and C/C++ (so that they can talk to each other), except maybe pipes that I could imagine how to do.
can anyone comment about performances & feasibility of each method in this context ?
any pointer / link to useful implementation information ?
EDIT / UPDATE
following the comment & answers I got here, I found info about Unix Domain Sockets, which seem to be built just over pipes, and would save me the whole TCP stack.
it's platform specific, so I plan on testing it with JNI or either juds or junixsocket.
next possible steps would be direct implementation of pipes, then shared memory, although I've been warned of the extra level of complexity...
thanks for your help

Just tested latency from Java on my Corei5 2.8GHz, only single byte send/received,
2 Java processes just spawned, without assigning specific CPU cores with taskset:
TCP - 25 microseconds
Named pipes - 15 microseconds
Now explicitly specifying core masks, like taskset 1 java Srv or taskset 2 java Cli:
TCP, same cores: 30 microseconds
TCP, explicit different cores: 22 microseconds
Named pipes, same core: 4-5 microseconds !!!!
Named pipes, taskset different cores: 7-8 microseconds !!!!
so
TCP overhead is visible
scheduling overhead (or core caches?) is also the culprit
At the same time Thread.sleep(0) (which as strace shows causes a single sched_yield() Linux kernel call to be executed) takes 0.3 microsecond - so named pipes scheduled to single core still have much overhead
Some shared memory measurement:
September 14, 2009 – Solace Systems announced today that its Unified Messaging Platform API can achieve an average latency of less than 700 nanoseconds using a shared memory transport.
http://solacesystems.com/news/fastest-ipc-messaging/
P.S. - tried shared memory next day in the form of memory mapped files,
if busy waiting is acceptable, we can reduce latency to 0.3 microsecond
for passing a single byte with code like this:
MappedByteBuffer mem =
new RandomAccessFile("/tmp/mapped.txt", "rw").getChannel()
.map(FileChannel.MapMode.READ_WRITE, 0, 1);
while(true){
while(mem.get(0)!=5) Thread.sleep(0); // waiting for client request
mem.put(0, (byte)10); // sending the reply
}
Notes: Thread.sleep(0) is needed so 2 processes can see each other's changes
(I don't know of another way yet). If 2 processes forced to same core with taskset,
the latency becomes 1.5 microseconds - that's a context switch delay
P.P.S - and 0.3 microsecond is a good number! The following code takes exactly 0.1 microsecond, while doing a primitive string concatenation only:
int j=123456789;
String ret = "my-record-key-" + j + "-in-db";
P.P.P.S - hope this is not too much off-topic, but finally I tried replacing Thread.sleep(0) with incrementing a static volatile int variable (JVM happens to flush CPU caches when doing so) and obtained - record! - 72 nanoseconds latency java-to-java process communication!
When forced to same CPU Core, however, volatile-incrementing JVMs never yield control to each other, thus producing exactly 10 millisecond latency - Linux time quantum seems to be 5ms... So this should be used only if there is a spare core - otherwise sleep(0) is safer.

DMA is a method by which hardware devices can access physical RAM without interrupting the CPU. E.g. a common example is a harddisk controller which can copy bytes straight from disk to RAM. As such it's not applicable to IPC.
Shared memory and pipes are both supported directly by modern OSes. As such, they're quite fast. Queues are typically abstractions, e.g. implemented on top of sockets, pipes and/or shared memory. This may look like a slower mechanism, but the alternative is that you create such an abstraction.

The question was asked some time ago, but you might be interested in https://github.com/peter-lawrey/Java-Chronicle which supports typical latencies of 200 ns and throughputs of 20 M messages/second. It uses memory mapped files shared between processes (it also persists the data which makes it fastest way to persist data)

Here's a project containing performance tests for various IPC transports:
http://github.com/rigtorp/ipc-bench

A late arrival, but wanted to point out an open source project dedicated to measuring ping latency using Java NIO.
Further explored/explained in this blog post. The results are(RTT in nanos):
Implementation, Min, 50%, 90%, 99%, 99.9%, 99.99%,Max
IPC busy-spin, 89, 127, 168, 3326, 6501, 11555, 25131
UDP busy-spin, 4597, 5224, 5391, 5958, 8466, 10918, 18396
TCP busy-spin, 6244, 6784, 7475, 8697, 11070, 16791, 27265
TCP select-now, 8858, 9617, 9845, 12173, 13845, 19417, 26171
TCP block, 10696, 13103, 13299, 14428, 15629, 20373, 32149
TCP select, 13425, 15426, 15743, 18035, 20719, 24793, 37877
This is along the lines of the accepted answer. System.nanotime() error (estimated by measuring nothing) is measured at around 40 nanos so for the IPC the actual result might be lower. Enjoy.

If you ever consider using native access (since both your application and the "server" are on the same machine), consider JNA, it has less boilerplate code for you to deal with.

I don't know much about native inter-process communication, but I would guess that you need to communicate using native code, which you can access using JNI mechanisms. So, from Java you would call a native function that talks to the other process.

In my former company we used to work with this project, http://remotetea.sourceforge.net/, very easy to understand and integrate.

Have you considered keeping the sockets open, so the connections can be reused?

Oracle bug report on JNI performance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4096069
JNI is a slow interface and so Java TCP sockets are the fastest method for notification between applications, however that doesn't mean you have to send the payload over a socket. Use LDMA to transfer the payload, but as previous questions have pointed out, Java support for memory mapping is not ideal and you so will want to implement a JNI library to run mmap.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Messaging latency in java (with zeromq) - java

Related

How to measure one-way latency?

Scalable and High performance message channel

TCP/IP application Keepalive size, and bandwidth overhead

Benchmarking Apache Mina Total Bandwidth

fastest (low latency) method for Inter Process Communication between Java and C/C++

Categories

Resources