I/O functional programming, and java programming

I/O functional programming, and java programming - java

Hi: We are using Java for a multi thread application. We found bottleneck at Java I/O. Has functional programming, scala for example, had better I/O throughput? We will have many cores cpu, in that sense, business logic could be handled very fast, but I/O would be a bottleneck. Are there any good solution?

Since Scala runs on the Java Virtual Machine, and (under the hood) uses the Java API for I/O, switching to scala is unlikely to offer better performance than well written Java code.
As for solutions, your description of the problem is far too sketchy to recommend particular solutions.

Are you using or tried Java nio ( non blocking) ? Developers report upto 300% performance increase.
Java NIO FileChannel versus FileOutputstream performance / usefulness ( Please refer this as well)

Usually when people complain that Java IO is slow, it is what they are doing with the IO which is slow, not the IO itself. E.g. BufferedReader reading lines of text (which is relatively slow) can read 90 MB/s with a decent CPU/HDD. You can make it much faster with memory mapped files but unless your disk drive can handle it, it won't make much real difference.
There are things you can do to improve IO performance but you quickly find that the way to get faster IO is to improve the hardware.
If you are using a Hard Drive which can sustain 100 MB/s read speed and 120 IOPS, you are going to limited by these factors and replacing the drive with an SSD which does 500 MB/s and 80,000 IOPS is going to be faster.
Similarly, if you are using a 100 Mb/s network, you might only get 12 MB/s, on a 1 Gb/s network you might get 110 MB/s and on a 10 Gig-E network you might be lucky to get 1 GB/s.

If you are performing many tiny I/O operations, then coalescing them into one large I/O operation could greatly speed up your code. Functional programming techniques tend to make data collection and conversion operations easier to write (e.g. you can store items for pending output in a list, and use map to apply an item-to-text or item-to-binary converter to them). Otherwise, no, functional programming techniques don't overcome inherently slow channels. If raw I/O speed is limiting, in Java and elsewhere, and you have enough hardware threads available, you should have one top priority thread for each independent I/O channel, and have it perform only I/O (no data conversion, nothing). That will maximize your I/O rate, and then you can use the other threads to do conversions and business logic and such.

One question is whether you have unlimited time to develop your application or not. If you have unlimited time, then the Java program and Scala programs will have the same performance since you can write Scala programs that will produce exactly the same bytecode as Java.
But, if you have unlimited time, why not develop in C (or assembler)? You'd get better performance.
Another is how sophisticated your IO code is. If it is something quite trivial, then Scala will probably not provide much benefit, as there is not enough "meat" to utilize its features.
I think if you have limited time and a complex IO codebase, the a Scala based solution may be faster. The reason Scala opens the door to many idioms that in Java are just too laborious to write, so people avoid them and pay the price later.
For example, executing a calculation over a collection of data in parallel is done in Java with ForkJoinPool, which you have to create, then create a class wrapping the calculation, break it for each item and submit to the pool.
In Scala: collection.par.map(calculation). Writing this is much faster than Java, so you just do it and have spare time to tackle other issues.
From personal experience, I have a related story. I read in a blog article that BuildR, a ruby based build tool was two times faster than Maven for a simple build. Considering that Ruby is about 20 times slower than Java, I was surprised. So I profiled Maven. It turned out it did apx 1000 times parsing of the same XML file. Now of course with careful design, they could have reduced that to just one time. But I guess the reason they did not is because the strait-forward approach in Java led to a design to complex to change after. With BuildR, the design was simpler and performance better. In Scala, you get the feeling of programming in a dynamic language while still being on par with Java in terms of performance.
UPDATE: Thinking about it more, there are some areas in Scala which will give greater performance than Java (again, assuming the IO bottleneck is because of the code that wraps the IO operations, not the reading/writing of bytes):
* Lazy arguments and values - can push spending CPU cycles to when they are actually required
* Specialization - allows to tell the compiler to create copies of generic data structures for the native types, thus avoiding boxing, unboxing and casting.

Related

Java TCP/IP Socket write performance optimization

Server Environment
Linux/RedHat
6 cores
Java 7/8
About application :
We are working on developing a low latency (7-8 ms) high speed trading platform using Java. Multi-leg orders are sent after algo conditions are met
Problem
The orders to the exchange using TCP/IP java.net.Socket APIs (using java.io.OutputStream.write(bytes[] arg0) ). Profiler measurement is records as 5-7 microsec which is very high as per our low latency reqs. We are not made use of setPerformancePreferences() api as suggested in one of the questions posted in stacktrace.
Question
Any alternatives to java.net.Socket to reduce the socket
transmission time?
Any optimization techniques to improve performance
Is setPerformancePreferences() is of any use?

We are not made use of setPerformancePreferences() api
It doesn't do anything and never has. I wouldn't worry about it.
Any alternatives to java.net.Socket to reduce the socket transmission time?
The problem is most certainly not a software one. You can get <8 micro-seconds from Java to Java on different machines, but you need low latency network cards like Solarflare or Mellanox.
If you want fast processing you should consider either a high GHz haswell processor, possibly over clocked to 4.2 or 4.5 GHz or a dual socket Haswell Xeon. The cost of these compared to the cost of trading is not high.
Any optimization techniques to improve performance
Using non-blocking NIO i.e. ByteBuffers and busy waiting on the socket connections. (I wouldn't use Selectors as they add quite a bit of overhead) I would turn off nagle.
For some micro-tuning, use an affinity bound thread on an isolated cpu.
Is setPerformancePreferences() is of any use?
Looking at the source .. I will let you be the judge.
public void setPerformancePreferences(int connectionTime,
int latency,
int bandwidth)
{
/* Not implemented yet */
}
Java 7/8
In term of which version to use, I would start with Java 8 as it has much improved escape analysis which can reduce garbage of short lived objects and thus help reduce latency between GCs and jitter from GCs.

A couple of things come to mind:
JNI: JNI lets you write C code that is ran from your Java code. Critical parts of your Java code that are running to slow can be migrated to C/C++ for improved performance. Work would be needed to first identify what those critical points are and if its worth the effort to move it to C/C++.
Java Unsafe: Wanna get dangerous? Use Java Unsafe to bypass that pesky GC. Here is more info on it. On Github you may find some cool wrapper code to more-safely use Java Unsafe. Here is one. More info.
LMAX Disruptor: Read more about it here. This company is also building a fast trading system in Java. Disruptor allows for faster inter-thread communication.
Bytecode scrutinization: Review your code by looking at the byte code. I have done this for a video game I made and was able to streamline the code. You'll need a good tool for turning your class files into readable bytecode. THis might be the tool i used.
Improved garbage collection: Have you tried using the G1 garbage collector? Or messing around with the older GC's?
Highscalability: This site is full of good info on making code fast. Here is an example that might help.
New API I dont know exactly how to use New API, but it has come up in articles I have read. Here is another article on it. You might need to use it via JNI.

Read/Write efficiency

I have a SCALA(+ JAVA) code which reads and writes at a certain rate. Profiling tells me how much time each method in the code has taken to execute. How do I measure if my program is reaching its maximum efficiency ? To make my code optimized so that it is reading at the maximum speed that is possible with the give configuration.I know this is hard-ware specific and varies from machine to machine. If there is a short-way to measure the process. If my program is reading and writing at the fastest rate possible by the hardware. (I'm using FileWriter along with BufferWriter.)

With a description given your best possible choice might be experimenting. What you could measure:
Changing the buffer size (this didn't help me when I tried this)
Switching to NIO (might help for large files)
Caching the data you read (might help for small files), caching dir content if there are many files in it. Opening a file speed degrades when number of files in a folder grows.
One possible technique to make sure there is no problem with the code with your code profiling is to get the CPU time distribution tree for your method and to expand execution paths taking most of the time. If all these paths head to Java standard libraries, you're probably reaching your best performance.
UPDATE
Some other things & techniques from the hrof you provided.
With a profiler or some other tecnhinique (I prefer stopwatches as they give more stable and realistic results) you need to find what's your bottleneck.
Most of the IO can be optimized to use a single buffer - this is less painful with Guava or Apache Commons IO.
However, there is not much you can do if you're using Jackson in your serialization chain if it's your bottleneck. Changing an algorithm?
There are slow guys (compared to native filesystem IO) - i.e. Formatters, String.format is very slow, Jackson, etc.
There are typical slow operations with IO - i.e. buffer allocations, string concatenation, too many allocated char[] buffers is a smell for IO optimization.

Why is Erlang slower than Java on all these small math benchmarks?

While considering alternatives for Java for a distributed/concurrent/failover/scalable backend environment I discovered Erlang. I've spent some time on books and articles where nearly all of them (even Java addicted guys) says that Erlang is a better choice in such environments, as many useful things are out of the box in a less error prone way.
I was sure that Erlang is faster in most cases mainly because of a different garbage collection strategy (per process), absence of shared state (b/w threads and processes) and more compact data types. But I was very surprised when I found comparisons of Erlang vs Java math samples where Erlang is slower by several orders, e.g. from x10 to x100.
Even on concurrent tasks, both on several cores and a single one.
What's the reasons for that? These answers came to mind:
Usage of Java primitives (=> no heap/gc) on most of the tasks
Same number of threads in Java code and Erlang processes so the actor model has no advantage here
Or just that Java is statically typed, while Erlang is not
Something else?
If that's because these are very specific math algorithms, can anybody show more real/practice performance tests?
UPDATE: I've got the answers so far summarizing that Erlang is not the right tool for such specific "fast Java case", but the thing that is unclear to me - what's the main reason for such Erlang inefficiency here: dynamic typing, GC or poor native compiling?

Erlang was not built for math. It was built with communication, parallel processing and scalability in mind, so testing it for math tasks is a bit like testing if your jackhammer gives you refreshing massage experience.
That said, let's offtop a little:
If you want Erlang-style programming in JVM, take a look at Scala Actors or Akka framework or Vert.x.

Benchmarks are never good for saying anything else than what they are really testing. If you feel that a benchmark is only testing primitives and a classic threading model, that is what you get knowledge about. You can now with some confidence say that Java is faster than Erlang on mathematics on primitives as well as the classic threading model for those types of problems. You don't know anything about the performance with large number of threads or for more involved problems because the benchmark didn't test that.
If you are doing the types of math that the benchmark tested, go with Java because it is obviously the right tool for that job. If you want to do something heavily scalable with little to no shared state, find a benchmark for that or at least re-evaluate Erlang.
If you really need to do heavy math in Erlang, consider using HiPE (consider it anyway for that matter).

I took interest to this as some of the benchmarks are a perfect fit for erlang, such as gene sequencing. So on http://benchmarksgame.alioth.debian.org/ the first thing I did was look at reverse-complement implementations, for both C and Erlang, as well as the testing details. I found that the test is biased because it does not discount the time it takes erlang to start the VM /w the schedulers, natively compiled C is started much faster. The way those benchmarks measure is basically:
time erl -noshell -s revcomp5 main < revcomp-input.txt
Now the benchmark says Java took 1.4 seconds and erlang /w HiPE took 11. Running the (Single threaded) Erlang code took me 0.15 seconds, and if you discount the time it took to start the vm, the actual workload took only 3000 microseconds (0.003 seconds).
So I have no idea how that is benchmarked, if its done 100 times then it makes no sense as the cost of starting the erlang VM will be x100. If the input is a lot longer than given, it would make sense, but I see no details on the webpage of that. To make the benchmarks more fair for Managed languages, have the code (Erlang/Java) send a Unix signal to the python (that is doing the benchmarking) that it hit the startup function.
Now benchmark aside, the erlang VM essentially just executes machine code at the end, as well as the Java VM. So there is no way a math operation would take longer in Erlang than in Java.
What Erlang is bad at is data that needs to mutate often. Such as a chained block cypher. Say you have the chars "0123456789", now your encryption xors the first 2 chars by 7, then xors the next two chars by the result of the first two added, then xors the previous 2 chars by the results of the current 2 subtracted, then xors the next 4 chars.. etc
Because objects in Erlang are immutable this means that the entire char array needs to be copied each time you mutate it. That is why erlang has support for things called NIFS which is C code you can call into to solve this exact problem. In fact all the encryption (ssl,aes,blowfish..) and compression (zlib,..) that ship with Erlang are implemented in C, also there is near 0 cost associated with calling C from Erlang.
So using Erlang you get the best of both worlds, you get the speed of C with the parallelism of Erlang.
If I were to implement the reverse-complement in the FASTEST way possible, I would write the mutating code using C but the parallel code using Erlang. Assuming infinite input, I would have Erlang split on the >
<<Line/binary, ">", Rest/binary>> = read_stream
Dispatch the block to the first available scheduler via round robin, consisting of infinite EC2 private networked hidden nodes, being added in real time to the cluster every millisecond.
Those nodes then call out to C via NIFS for processing (C was the fastest implementation for reverse-compliment on alioth website), then send the output back to the node master to send out to the inputer.
To implement all this in Erlang I would have to write code as if I was writing a single threaded program, it would take me under a day to create this code.
To implement this in Java, I would have to write the single threaded code, I would have to take the performance hit of calling from Managed to Unmanaged (as we will be using the C implementation for the grunt work obviously), then rewrite to support 64 cores. Then rewrite it to support multiple CPUS. Then rewrite it again to support clustering. Then rewrite it again to fix memory issues.
And that is Erlang in a nutshell.

As pointed in other answers - Erlang is designed to solve effectively real life problems, which are bit opposite to benchmark problems.
But I'd like to enlighten one more aspect - pithiness of erlang code (in some cases means rapidness of development), which could be easily concluded, after comparing benchmarks implementations.
For example, k-nucleotide benchmark:
Erlang version: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=hipe&id=3
Java version: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=java&id=3
If you want more real-life benchmarks, I'd suggest you Comparing C++ And Erlang For Motorola Telecoms Software

The Erlang solution uses ETS, Erlang Term Storage, which is like an in-memory database running in a separate process. Consequent to it being in a separate process, all messages to and from that process must be serialized/deserialized. This would account for a lot of the slowness, I should think. For example, if you look at the "regex-dna" benchmark, Erlang is only slightly slower than Java there, and it doesn't use ETS.

The fact that erlang has to allocate memory for every value whereas in java you will typically reuse variables if you want it to be fast, means it will always be faster for 'tight loop' bench marks.
It would be interesting to benchmark a java version using the -client flag and boxed primitives and compare that to erlang.
I believe using hipe is unfair since it is not an active project. I would be interested to know if any mission critical software is running on this.

I don't know anything about Erlang, but this seems to be a compare apples to oranges approach anyways. You must be aware that considerable effort was spent over more than a decade to improve java preformance to the point where it is today.
Its not surprising (to me) that a language implementation done by volunteers or a small company can not outmatch that effort.

Why Python is not better in multiprocessing or multithreading applications than Java?

Since Python has some issues with GIL, Java is better for developing multiprocessing applications. Could you please justify the exact reasoning of java's effective processing than python in your way?

The biggest problem in multithreading in CPython is the Global Interpreter Lock (GIL) (note that other Python implementations don't necessarily share this problem!)
The GIL is an implementation detail that effectively prevents parallel (simultaneous) execution of separate threads in Python. The problem is that whenever Python byte code is to be executed, then the current thread must have acquired the GIL and only a single thread can have the GIL at any given moment.
So if 5 threads are trying to execute some Python byte code, then they will effectively run interleaved, because each one will have to wait for the GIL to become available. This is not usually a problem with single-core computers, as the physical constraints have the same effect: only a single thread can run at a time.
In multi-core/SMP computers, however this becomes a bottleneck. These days almost everthing is running on multiple cores, including effectively all smartphones and even many embedded systems.
Java has no such restrictions, so multiple threads can execute at the exact same time.

I would disagree that Python is not better than Java for Multi-Processing application.
First, I am assuming that the OP is using 'better' to mean 'faster code execution' as far as I can tell.
I suffer from 'speed-freak' syndrome, probably from having come from a C/ASM background, so I have spent considerable time getting to the bottom of the "is Python slow?" issue.
The simple answer to that? "It can be." Here's some important points:
1) With a multi-threadded application, Python is going to have a disadvantage to any language that doesn't have something similar to the GIL. The GIL is an artifact of the Python VM in CPython, not the Python language itself. Some Python VM's like Jython, IronPython, etc do not have a GIL.
2) In a Multi-Process application, the GIL doesn't really apply, and thus you can now start to harness faster execution of your Python code unmolested for the most part by the GIL. I strongly suggest if you want to write large Python code that needs both speed and concurrency, that you learn about Multi-Processing, and possibly ZMQ/0MQ for message passing.
3) Regardless of the GIL, Java displays faster code execution than Python in many areas. This is due to native differences in how Python handles objects in memory:
A number of Python functions create copies of objects in memory rather than modifying them ( see http://www.skymind.com/~ocrow/python_string/ for examples)
Python uses Dict to store attributes for objects, etc. I don't want to distract and delve into these areas, but I can generally say that some of the 'neat' things that Python can do come at a speed cost. It's also important to know that there are ways around the default behaviour if that is causing too high of a speed penalty for you.
4) Some of Java's speed advantage is due to more optimization in the Java VM over Python as far as I can tell. Once you eliminate the differences in how much behind-the-scenes memory/object work is done, Java can often still beat Python. Is it because Java has had more attention than Python? I'm not sure, with enough funding I feel that CPython could be faster.
Check http://c2.com/cgi/wiki?PythonProblems for more discussion on some of these issues.
I will say that I have decided to embrace Python nearly 100% going forward with new code.
Don't fall into the premature optimization trap, and remember you can always call C code in a pinch. Make your code work well, make it maintainable, then start to optimize once the speed of the application isn't fast enough for your needs.
Interesting Benchmarks:
http://benchmarksgame.alioth.debian.org/u64/python.php
Further information about Python speed issues can be found here:
http://www.infoworld.com/d/application-development/van-rossum-python-not-too-slow-188715

Which programming language for compute-intensive trading portfolio simulation?

I am building a trading portfolio management system that is responsible for production, optimization, and simulation of non-high frequency trading portfolios (dealing with 1min or 3min bars of data, not tick data).
I plan on employing Amazon web services to take on the entire load of the application.
I have four choices that I am considering as language.
Java
C++
C#
Python
Here is the scope of the extremes of the project scope. This isn't how it will be, maybe ever, but it's within the scope of the requirements:
Weekly simulation of 10,000,000 trading systems.
(Each trading system is expected to have its own data mining methods, including feature selection algorithms which are extremely computationally-expensive. Imagine 500-5000 features using wrappers. These are not run often by any means, but it's still a consideration)
Real-time production of portfolio w/ 100,000 trading strategies
Taking in 1 min or 3 min data from every stock/futures market around the globe (approx 100,000)
Portfolio optimization of portfolios with up to 100,000 strategies. (rather intensive algorithm)
Speed is a concern, but I believe that Java can handle the load.
I just want to make sure that Java CAN handle the above requirements comfortably. I don't want to do the project in C++, but I will if it's required.
The reason C# is on there is because I thought it was a good alternative to Java, even though I don't like Windows at all and would prefer Java if all things are the same.
Python - I've read somethings on PyPy and pyscho that claim python can be optimized with JIT compiling to run at near C-like speeds... That's pretty much the only reason it is on this list, besides that fact that Python is a great language and would probably be the most enjoyable language to code in, which is not a factor at all for this project, but a perk.
To sum up:
real time production
weekly simulations of a large number of systems
weekly/monthly optimizations of portfolios
large numbers of connections to collect data from
There is no dealing with millisecond or even second based trades. The only consideration is if Java can possibly deal with this kind of load when spread out of a necessary amount of EC2 servers.
Thank you guys so much for your wisdom.

Pick the language you are most familiar with. If you know them all equally and speed is a real concern, pick C.

While I am a huge fan of Python and personaly I'm not a great lover of Java, in this case I have to concede that Java is the right way to go.
For many projects Python's performance just isn't a problem, but in your case even minor performance penalties will add up extremely quickly. I know this isn't a real-time simulation, but even for batch processing it's still a factor to take into consideration. If it turns out the load is too big for one virtual server, an implementation that's twice as fast will halve your virtual server costs.
For many projects I'd also argue that Python will allow you to develop a solution faster, but here I'm not sure that would be the case. Java has world-class development tools and top-drawer enterprise grade frameworks for parallell processing and cross-server deployment and while Python has solutions in this area, Java clearly has the edge. You also have architectural options with Java that Python can't match, such as Javaspaces.
I would argue that C and C++ impose too much of a development overhead for a project like this. They're viable inthat if you are very familiar with those languages I'm sure it would be doable, but other than the potential for higher performance, they have nothing else to bring to the table.
C# is just a rewrite of Java. That's not a bad thing if you're a Windows developer and if you prefer Windows I'd use C# rather than Java, but if you don't care about Windows there's no reason to care about C#.

I would pick Java for this task. In terms of RAM, the difference between Java and C++ is that in Java, each Object has an overhead of 8 Bytes (using the Sun 32-bit JVM or the Sun 64-bit JVM with compressed pointers). So if you have millions of objects flying around, this can make a difference. In terms of speed, Java and C++ are almost equal at that scale.
So the more important thing for me is the development time. If you make a mistake in C++, you get a segmentation fault (and sometimes you don't even get that), while in Java you get a nice Exception with a stack trace. I have always preferred this.
In C++ you can have collections of primitive types, which Java hasn't. You would have to use external libraries to get them.
If you have real-time requirements, the Java garbage collector may be a nuisance, since it takes some minutes to collect a 20 GB heap, even on machines with 24 cores. But if you don't create too many temporary objects during runtime, that should be fine, too. It's just that your program can make that garbage collection pause whenever you don't expect it.

Why only one language for your system? If I were you, I will build the entire system in Python, but C or C++ will be used for performance-critical components. In this way, you will have a very flexible and extendable system with fast-enough performance. You can find even tools to generate wrappers automatically (e.g. SWIG, Cython). Python and C/C++/Java/Fortran are not competing each other; they are complementing.

Write it in your preferred language. To me that sounds like python. When you start running the system you can profile it and see where the bottlenecks are. Once you do some basic optimisations if it's still not acceptable you can rewrite portions in C.
A consideration could be writing this in iron python to take advantage of the clr and dlr in .net. Then you can leverage .net 4 and parallel extensions. If anything will give you performance increases it'll be some flavour of threading which .net does extremely well.
Edit:
Just wanted to make this part clear. From the description, it sounds like parallel processing / multithreading is where the majority of the performance gains are going to come from.

It is useful to look at the inner loop of your numerical code. After all you will spend most of your CPU-time inside this loop.
If the inner loop is a matrix operation, then I suggest python and scipy, but of the inner loop if not a matrix operation, then I would worry about python being slow. (Or maybe I would wrap c++ in python using swig or boost::python)
The benefit of python is that it is easy to debug, and you save a lot of time by not having to compile all the time. This is especially useful for a project where you spend a lot of time programming deep internals.

I would go with pypy. If not, http://lolcode.com/.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.