Java TCP/IP Socket write performance optimization

Java TCP/IP Socket write performance optimization - java

Server Environment
Linux/RedHat
6 cores
Java 7/8
About application :
We are working on developing a low latency (7-8 ms) high speed trading platform using Java. Multi-leg orders are sent after algo conditions are met
Problem
The orders to the exchange using TCP/IP java.net.Socket APIs (using java.io.OutputStream.write(bytes[] arg0) ). Profiler measurement is records as 5-7 microsec which is very high as per our low latency reqs. We are not made use of setPerformancePreferences() api as suggested in one of the questions posted in stacktrace.
Question
Any alternatives to java.net.Socket to reduce the socket
transmission time?
Any optimization techniques to improve performance
Is setPerformancePreferences() is of any use?

We are not made use of setPerformancePreferences() api
It doesn't do anything and never has. I wouldn't worry about it.
Any alternatives to java.net.Socket to reduce the socket transmission time?
The problem is most certainly not a software one. You can get <8 micro-seconds from Java to Java on different machines, but you need low latency network cards like Solarflare or Mellanox.
If you want fast processing you should consider either a high GHz haswell processor, possibly over clocked to 4.2 or 4.5 GHz or a dual socket Haswell Xeon. The cost of these compared to the cost of trading is not high.
Any optimization techniques to improve performance
Using non-blocking NIO i.e. ByteBuffers and busy waiting on the socket connections. (I wouldn't use Selectors as they add quite a bit of overhead) I would turn off nagle.
For some micro-tuning, use an affinity bound thread on an isolated cpu.
Is setPerformancePreferences() is of any use?
Looking at the source .. I will let you be the judge.
public void setPerformancePreferences(int connectionTime,
int latency,
int bandwidth)
{
/* Not implemented yet */
}
Java 7/8
In term of which version to use, I would start with Java 8 as it has much improved escape analysis which can reduce garbage of short lived objects and thus help reduce latency between GCs and jitter from GCs.

A couple of things come to mind:
JNI: JNI lets you write C code that is ran from your Java code. Critical parts of your Java code that are running to slow can be migrated to C/C++ for improved performance. Work would be needed to first identify what those critical points are and if its worth the effort to move it to C/C++.
Java Unsafe: Wanna get dangerous? Use Java Unsafe to bypass that pesky GC. Here is more info on it. On Github you may find some cool wrapper code to more-safely use Java Unsafe. Here is one. More info.
LMAX Disruptor: Read more about it here. This company is also building a fast trading system in Java. Disruptor allows for faster inter-thread communication.
Bytecode scrutinization: Review your code by looking at the byte code. I have done this for a video game I made and was able to streamline the code. You'll need a good tool for turning your class files into readable bytecode. THis might be the tool i used.
Improved garbage collection: Have you tried using the G1 garbage collector? Or messing around with the older GC's?
Highscalability: This site is full of good info on making code fast. Here is an example that might help.
New API I dont know exactly how to use New API, but it has come up in articles I have read. Here is another article on it. You might need to use it via JNI.

Related

Kernel bypass with Java 7?

I read in a couple of comments on SO Java 7 supports kernel bypass. However, when googling the topic I did not see any immediate examples of this.
Does anyone have an example of Java 7 performing kernel bypass? I'd be interested to see it

The Answers to this related Question mention that SolarFlare has Java bindings: Networking with Kernel Bypass in Java.
As far as Java 7 is concerned, there is no support for this kind of thing in the core libraries. Kernel bypass is too system / vendor specific for inclusion in the standard APIs.
You can do other things to improve network throughput in Java that don't involve kernel bypass. For instance using the NIO Buffer and Channel APIs ... However, your typical Java "framework" tends to get in the way of this ... by only exposing Stream / Reader and other high level I/O abstractions to "application" code.
(I would also opine that if you have an application where network latency and throughput are critical enough for kernel bypass to be worthwhile, you should use a programming language that is "closer to the metal". Java is better for applications where the biggest problem is application complexity ... NOT moving lots of bits through the network fast.)

Take a look at the Onload Extensions API JNI Wrapper on github. The author seems to specialize in kernel bypass.

Kernel bypassing is a method of avoiding the kernel when reading/writing to external data sources, e.g. files or networking.
Instead, you directly access the data storage without letting all the bytes running through the OS kernel. This is usually faster but also less secure, since the entire process is not supervised by the operating system anymore.
Assumption:
In regard to Java, the kernel (could) represent(s) the JVM.
I have found a very good article on this.

Why is Erlang slower than Java on all these small math benchmarks?

While considering alternatives for Java for a distributed/concurrent/failover/scalable backend environment I discovered Erlang. I've spent some time on books and articles where nearly all of them (even Java addicted guys) says that Erlang is a better choice in such environments, as many useful things are out of the box in a less error prone way.
I was sure that Erlang is faster in most cases mainly because of a different garbage collection strategy (per process), absence of shared state (b/w threads and processes) and more compact data types. But I was very surprised when I found comparisons of Erlang vs Java math samples where Erlang is slower by several orders, e.g. from x10 to x100.
Even on concurrent tasks, both on several cores and a single one.
What's the reasons for that? These answers came to mind:
Usage of Java primitives (=> no heap/gc) on most of the tasks
Same number of threads in Java code and Erlang processes so the actor model has no advantage here
Or just that Java is statically typed, while Erlang is not
Something else?
If that's because these are very specific math algorithms, can anybody show more real/practice performance tests?
UPDATE: I've got the answers so far summarizing that Erlang is not the right tool for such specific "fast Java case", but the thing that is unclear to me - what's the main reason for such Erlang inefficiency here: dynamic typing, GC or poor native compiling?

Erlang was not built for math. It was built with communication, parallel processing and scalability in mind, so testing it for math tasks is a bit like testing if your jackhammer gives you refreshing massage experience.
That said, let's offtop a little:
If you want Erlang-style programming in JVM, take a look at Scala Actors or Akka framework or Vert.x.

Benchmarks are never good for saying anything else than what they are really testing. If you feel that a benchmark is only testing primitives and a classic threading model, that is what you get knowledge about. You can now with some confidence say that Java is faster than Erlang on mathematics on primitives as well as the classic threading model for those types of problems. You don't know anything about the performance with large number of threads or for more involved problems because the benchmark didn't test that.
If you are doing the types of math that the benchmark tested, go with Java because it is obviously the right tool for that job. If you want to do something heavily scalable with little to no shared state, find a benchmark for that or at least re-evaluate Erlang.
If you really need to do heavy math in Erlang, consider using HiPE (consider it anyway for that matter).

I took interest to this as some of the benchmarks are a perfect fit for erlang, such as gene sequencing. So on http://benchmarksgame.alioth.debian.org/ the first thing I did was look at reverse-complement implementations, for both C and Erlang, as well as the testing details. I found that the test is biased because it does not discount the time it takes erlang to start the VM /w the schedulers, natively compiled C is started much faster. The way those benchmarks measure is basically:
time erl -noshell -s revcomp5 main < revcomp-input.txt
Now the benchmark says Java took 1.4 seconds and erlang /w HiPE took 11. Running the (Single threaded) Erlang code took me 0.15 seconds, and if you discount the time it took to start the vm, the actual workload took only 3000 microseconds (0.003 seconds).
So I have no idea how that is benchmarked, if its done 100 times then it makes no sense as the cost of starting the erlang VM will be x100. If the input is a lot longer than given, it would make sense, but I see no details on the webpage of that. To make the benchmarks more fair for Managed languages, have the code (Erlang/Java) send a Unix signal to the python (that is doing the benchmarking) that it hit the startup function.
Now benchmark aside, the erlang VM essentially just executes machine code at the end, as well as the Java VM. So there is no way a math operation would take longer in Erlang than in Java.
What Erlang is bad at is data that needs to mutate often. Such as a chained block cypher. Say you have the chars "0123456789", now your encryption xors the first 2 chars by 7, then xors the next two chars by the result of the first two added, then xors the previous 2 chars by the results of the current 2 subtracted, then xors the next 4 chars.. etc
Because objects in Erlang are immutable this means that the entire char array needs to be copied each time you mutate it. That is why erlang has support for things called NIFS which is C code you can call into to solve this exact problem. In fact all the encryption (ssl,aes,blowfish..) and compression (zlib,..) that ship with Erlang are implemented in C, also there is near 0 cost associated with calling C from Erlang.
So using Erlang you get the best of both worlds, you get the speed of C with the parallelism of Erlang.
If I were to implement the reverse-complement in the FASTEST way possible, I would write the mutating code using C but the parallel code using Erlang. Assuming infinite input, I would have Erlang split on the >
<<Line/binary, ">", Rest/binary>> = read_stream
Dispatch the block to the first available scheduler via round robin, consisting of infinite EC2 private networked hidden nodes, being added in real time to the cluster every millisecond.
Those nodes then call out to C via NIFS for processing (C was the fastest implementation for reverse-compliment on alioth website), then send the output back to the node master to send out to the inputer.
To implement all this in Erlang I would have to write code as if I was writing a single threaded program, it would take me under a day to create this code.
To implement this in Java, I would have to write the single threaded code, I would have to take the performance hit of calling from Managed to Unmanaged (as we will be using the C implementation for the grunt work obviously), then rewrite to support 64 cores. Then rewrite it to support multiple CPUS. Then rewrite it again to support clustering. Then rewrite it again to fix memory issues.
And that is Erlang in a nutshell.

As pointed in other answers - Erlang is designed to solve effectively real life problems, which are bit opposite to benchmark problems.
But I'd like to enlighten one more aspect - pithiness of erlang code (in some cases means rapidness of development), which could be easily concluded, after comparing benchmarks implementations.
For example, k-nucleotide benchmark:
Erlang version: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=hipe&id=3
Java version: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=java&id=3
If you want more real-life benchmarks, I'd suggest you Comparing C++ And Erlang For Motorola Telecoms Software

The Erlang solution uses ETS, Erlang Term Storage, which is like an in-memory database running in a separate process. Consequent to it being in a separate process, all messages to and from that process must be serialized/deserialized. This would account for a lot of the slowness, I should think. For example, if you look at the "regex-dna" benchmark, Erlang is only slightly slower than Java there, and it doesn't use ETS.

The fact that erlang has to allocate memory for every value whereas in java you will typically reuse variables if you want it to be fast, means it will always be faster for 'tight loop' bench marks.
It would be interesting to benchmark a java version using the -client flag and boxed primitives and compare that to erlang.
I believe using hipe is unfair since it is not an active project. I would be interested to know if any mission critical software is running on this.

I don't know anything about Erlang, but this seems to be a compare apples to oranges approach anyways. You must be aware that considerable effort was spent over more than a decade to improve java preformance to the point where it is today.
Its not surprising (to me) that a language implementation done by volunteers or a small company can not outmatch that effort.

Is it possible to develop a real time data acquisition system in Windows XP with microsecond time precision?

For one of my projects I need to develop a software which needs to acquire 2000 data in 100 millisecond from a parallel port after receiving a trigger from the application. It means the parallel port needs to be read in 50 micro second interval. Data frequency is set to 10 KHz. So, this acquisition process should be in real time with microsecond time precision.
I am trying to program in Java. So far I have been able to acquire data from the parallel port but struggling hard to maintain the time interval.
My question is: Is it really possible to do it under windows xp environment with such time (in microsecond) precision? If yes, can you please point me to some guidelines/resources?
Any help would be greatly appreciated.

It depends on if your software has to work reliably or just in most cases.
With a normal Java VM, you cannot predict the garbage collector behaviour, so you basically have no means to prevent the VM from interrupting the execution of your software for any arbitrary period of time.
There are possibilities to implement real time software in Java using a VM with extensions for the "Real-Time Specification for Java" (JSR-1), but AFAIK there are no implementations for Windows, since Windows itself has no real time capabilities. The former reference implementation from Sun (now maintained by Oracle) runs on Solaris and RT enabled Linux versions and there are other implementations for embedded systems.

System.nanoTime() even returns the time in nanosecond time precision, you can use it for microseconds as well.

If you are not committed to doing this exclusively with Windows, it might be effective to add some additional hardware. Although the hardware could be as elaborate as an FPGA, you might be able to do it with something as simple as 8-bit microcontroller, such as the Atmega328 used in the Arduino boards and IDE. With that, you could sample the input, buffer the data, and drive the parallel port to the PC, with the microcontroller acting as a FIFO between the real-time data source and the close-to-real-time PC data consumption.
The Atmega328 has only 2K of RAM, so you would need to determine if that is enough to cover the PC's dead times. If not, there are similar microcontrollers with larger RAM.
The Arduino UNO is available for
In case you are unfamiliar with microcontrollers, there is no OS under the skin that introduces uncertainly. There are not system issues being maintained "invisibly". Writing code is more like writing a kernel driver (in that you need be aware of real-time considerations, interrupt processing, and maintaining the integrity of the CPU state), but the surrounding system is as simple as possible.
-- Carl

There are third party realtime extensions for Windows. Here's one: http://www.directinsight.co.uk/products/venturcom/rtx.html
They claim they deliver hard real time. Now, whether Java is appropriate for realtime programming to begin with is a big hard question. There's garbage collection, JIT compilation might get in the way... I'd say - stick to native code for the time critical parts (signal collection), build an interface to Java for the pretty GUI.
C++ is quite different from Java conceptually, but at least the syntax is close.
The RTX guys explicitly position themselves as an alternative to custom hardware, claim Intel CPUs on commodity motherboards can deliver realtime processing. Their stuff is not free though. You want free, go with an embedded flavor of Linux. Or with QNX.

Resource usage of google Go vs Python and Java on Appengine

Will google Go use less resources than Python and Java on Appengine? Are the instance startup times for go faster than Java's and Python's startup times?
Is the go program uploaded as binaries or source code and if it is uploaded as source code is it then compiled once or at each instance startup?
In other words: Will I benefit from using Go in app engine from a cost perspective? (only taking to account the cost of the appengine resources not development time)

Will google Go use less resources than Python and Java on Appengine?
Are the instance startup times for go faster than Java's and Python's
startup times?
Yes, Go instances have a lower memory than Python and Java (< 10 MB).
Yes, Go instances start faster than Java and Python equivalent because the runtime only needs to read a single executable file for starting an application.
Also even if being atm single threaded, Go instances handle incoming request concurrently using goroutines, meaning that if 1 goroutine is waiting for I/O another one can process an incoming request.
Is the go program uploaded as binaries or source code and if it is
uploaded as source code is it then compiled once or at each instance
startup?
Go program is uploaded as source code and compiled (once) to a binary when deploying a new version of your application using the SDK.
In other words: Will I benefit from using Go in app engine from a cost
perspective?
The Go runtime has definitely an edge when it comes to performance / price ratio, however it doesn't affect the pricing of other API quotas as described by Peter answer.

The cost of instances is only part of the cost of your app. I only use the Java runtime right now, so I don't know how much more or less efficient things would be with Python or Go, but I don't imagine it will be orders of magnitude different. I do know that instances are not the only cost you need to consider. Depending on what your app does, you may find API or storage costs are more significant than any minor differences between runtimes. All of the API costs will be the same with whatever runtime you use.
Language "might" affect these costs:
On-demand Frontend Instances
Reserved Frontend Instances
Backed Instances
Language Independent Costs:
High Replication Datastore (per gig stored)
Outgoing Bandwidth (per gig)
Datastore API (per ops)
Blobstore API storge (per gig)
Email API (per email)
XMPP API (per stanza)
Channel API (per channel)

The question is mostly irrelevant.
The minimum memory footprint for a Go app is less than a Python app which is less than a Java app. They all cost the same per-instance, so unless your application performs better with extra heap space, this issue is irrelevant.
Go startup time is less than Python startup time which is less than Java startup time. Unless your application has a particular reason to churn through lots of instance startup/shutdown cycles, this is irrelevant from a cost perspective. On the other hand, if you have an app that is exceptionally bursty in very short time periods, the startup time may be an advantage.
As mentioned by other answers, many costs are identical among all platforms - in particular, datastore operations. To the extent that Go vs Python vs Java will have an effect on the instance-hours bill, it is related to:
Does your app generate a lot of garbage? For many applications, the biggest computational cost is the garbage collector. Java has by far the most mature GC and basic operations like serialization are dramatically faster than with Python. Go's garbage collector seems to be an ongoing subject of development, but from cursory web searches, doesn't seem to be a matter of pride (yet).
Is your app computationally intensive? Java (JIT-compiled) and Go are probably better than Python for mathematical operations.
All three languages have their virtues and curses. For the most part, you're better off letting other issues dominate - which language do you enjoy working with most?

It's probably more about how you allocate the resources than your language choice. I read that GAE was built the be language-agnostic so there is probably no builtin advantage for any language, but you can get an advantage from choosing the language you are comfortable and motivated with. I use python and what made my deployment much more cost-effective was the upgrade to python 2.7 and you can only make that upgrade if you use the correct subset of 2.6, which is good. So if you choose a language you're comfortable with, it's likely that you will gain an advantage from your ability using the language rather than the combo language + environment itself.
In short, I'd recommend python but that's the only app engine language I tried and that's my choice even though I know Java rather well the code for a project will be much more compact using my favorite language python.
My apps are small to medium sized and they cost like nothing:

I haven't used Go, but I would strongly suspect it would load and execute instances much faster, and use less memory purely because it is compiled. Anecdotally from the group, I believe that Python is more responsive than Java, at least in instance startup time.
Instance load/startup times are important because when your instance is hit by more requests than it can handle, it spins up another instance. This makes that request take much longer, possibly giving the impression that the site is generally slow. Both Java and Python have to startup their virtual machine/interpreter, so I would expect Go to be an order of magnitude faster here.
There is one other issue - now Python2.7 is available, Go is the only option that is single-threaded (ironically, given that Go is designed as a modern multi-process language). So although Go requests should be handled faster, an instance can only handle requests serially. I'd be very surprised if this limitation last long, though.

I/O functional programming, and java programming

Hi: We are using Java for a multi thread application. We found bottleneck at Java I/O. Has functional programming, scala for example, had better I/O throughput? We will have many cores cpu, in that sense, business logic could be handled very fast, but I/O would be a bottleneck. Are there any good solution?

Since Scala runs on the Java Virtual Machine, and (under the hood) uses the Java API for I/O, switching to scala is unlikely to offer better performance than well written Java code.
As for solutions, your description of the problem is far too sketchy to recommend particular solutions.

Are you using or tried Java nio ( non blocking) ? Developers report upto 300% performance increase.
Java NIO FileChannel versus FileOutputstream performance / usefulness ( Please refer this as well)

Usually when people complain that Java IO is slow, it is what they are doing with the IO which is slow, not the IO itself. E.g. BufferedReader reading lines of text (which is relatively slow) can read 90 MB/s with a decent CPU/HDD. You can make it much faster with memory mapped files but unless your disk drive can handle it, it won't make much real difference.
There are things you can do to improve IO performance but you quickly find that the way to get faster IO is to improve the hardware.
If you are using a Hard Drive which can sustain 100 MB/s read speed and 120 IOPS, you are going to limited by these factors and replacing the drive with an SSD which does 500 MB/s and 80,000 IOPS is going to be faster.
Similarly, if you are using a 100 Mb/s network, you might only get 12 MB/s, on a 1 Gb/s network you might get 110 MB/s and on a 10 Gig-E network you might be lucky to get 1 GB/s.

If you are performing many tiny I/O operations, then coalescing them into one large I/O operation could greatly speed up your code. Functional programming techniques tend to make data collection and conversion operations easier to write (e.g. you can store items for pending output in a list, and use map to apply an item-to-text or item-to-binary converter to them). Otherwise, no, functional programming techniques don't overcome inherently slow channels. If raw I/O speed is limiting, in Java and elsewhere, and you have enough hardware threads available, you should have one top priority thread for each independent I/O channel, and have it perform only I/O (no data conversion, nothing). That will maximize your I/O rate, and then you can use the other threads to do conversions and business logic and such.

One question is whether you have unlimited time to develop your application or not. If you have unlimited time, then the Java program and Scala programs will have the same performance since you can write Scala programs that will produce exactly the same bytecode as Java.
But, if you have unlimited time, why not develop in C (or assembler)? You'd get better performance.
Another is how sophisticated your IO code is. If it is something quite trivial, then Scala will probably not provide much benefit, as there is not enough "meat" to utilize its features.
I think if you have limited time and a complex IO codebase, the a Scala based solution may be faster. The reason Scala opens the door to many idioms that in Java are just too laborious to write, so people avoid them and pay the price later.
For example, executing a calculation over a collection of data in parallel is done in Java with ForkJoinPool, which you have to create, then create a class wrapping the calculation, break it for each item and submit to the pool.
In Scala: collection.par.map(calculation). Writing this is much faster than Java, so you just do it and have spare time to tackle other issues.
From personal experience, I have a related story. I read in a blog article that BuildR, a ruby based build tool was two times faster than Maven for a simple build. Considering that Ruby is about 20 times slower than Java, I was surprised. So I profiled Maven. It turned out it did apx 1000 times parsing of the same XML file. Now of course with careful design, they could have reduced that to just one time. But I guess the reason they did not is because the strait-forward approach in Java led to a design to complex to change after. With BuildR, the design was simpler and performance better. In Scala, you get the feeling of programming in a dynamic language while still being on par with Java in terms of performance.
UPDATE: Thinking about it more, there are some areas in Scala which will give greater performance than Java (again, assuming the IO bottleneck is because of the code that wraps the IO operations, not the reading/writing of bytes):
* Lazy arguments and values - can push spending CPU cycles to when they are actually required
* Specialization - allows to tell the compiler to create copies of generic data structures for the native types, thus avoiding boxing, unboxing and casting.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.