Groovy: Poor multithreading performance and slow calculations comparing to Java?

Groovy: Poor multithreading performance and slow calculations comparing to Java? - java

According to an article Groovy has
Unfortunately at the same time Groovy is very slow at runtime. As a person, who did a lot to improve performance of Groovy I can probably speak about that very openly. Groovy is very slow. You can easily expect that some Groovy calculation or data transformation rewritten in Java will become 3-5 times faster. Usually this factor is 8-12 and sometimes even higher. Someone can say that Java is always at our service and nobody uses Groovy for calculations or data processing... But, hey, it is exactly my point - why should we limit ourselves for just scripting or handling of simple web pages?
What is even worse is the fact that
Groovy doesn't scale well for
multi-core computers meaning that
several threads executing code
compiled by Groovy really prevent each
other from being fast. It is not a
problem for many applications but for
many others it is simply show-stopper.
Could someone proof or refute that paragraphs?
I am particularly concerned about multi threading performance.

There are ongoing efforts to improve the speed of Groovy, but it should be said that 9 times in 10, the performance is not an issue.
However, where it is an issue you can either write that code in Java (and integrate that Java class easily into your Groovy code), or if you want to remain completely groovy, you can look into using Groovy++ which improves the speed of Groovy by making it more statically typed (with some heavy type inference to save you having to do it all yourself as with Java)
Groovy 1.8b4 (currently in beta), also comes with the GPars framework bundled with it.
The GPars project offers developers
new intuitive and safe ways to handle
Java or Groovy tasks concurrently,
asynchronously, and distributed by
utilizing the power of the Java
platform and the flexibility of the
Groovy language.
{edit July 2012}
Groovy 2.0 has a CompileStatic annotation which you may want to look into (as now Groovy++ has not been developed for quite a few months). This question here has some numbers...

Usually this factor is 8-12 and
sometimes even higher. Someone can say
that Java is always at our service and
nobody uses Groovy for calculations or
data processing... But, hey, it is
exactly my point - why should we limit
ourselves for just scripting or
handling of simple web pages?
We don't need to limit ourselves anymore! Groovy 1.8 is out! ;-)
"Groovy 1.8.x prototype for fib(42) takes about 3.8s
(only 12% slower than Java, over a hundred times
faster than Groovy 1.0) So we may no longer encourage people to write
such 'hot spots' in Java."
Source: http://www.wiki.jvmlangsummit.com/images/0/04/Theodorou-Faster-Groovy-1.8.pdf
Now it's Groovy vs. Scala in performance! Please check it out:
"I'm impressed on how much Groovy's performance has improved for
numerical computing. Groovy 1.8 in my project jlab (http://code.google.com/p/jlabgroovy/)
sometimes outperforms Scala's performance in my other project ScalaLab
(http://code.google.com/p/scalalab) !!"
Source: http://groovy.329449.n5.nabble.com/Great-improvements-in-Groovy-s-performance-for-numerical-computing-td4334768.html
For parallel processing, you can use the already bundled GPars!
Cheers

I can only provide anecdotal evidence of poor Groovy performance in calculations (which is, admittedly, about 2 years old):
I implemented an optimization algorithm in Groovy and found it intolerably slow. Profiling it showed that it spent about 60% of its time in BigDecimal.divide(). The culprit turned out to be one line with a very simple arithmetic calculation on float values that Groovy somehow insisted on transforming into BigDecimal. I tried to avoid having it do that for a bit but failed, so I rewrote that part of the application in Java and saw execution times go down 90%.

"Groovy 1.8.x prototype for fib(42) takes about 3.8s (only 12% slower than Java, over a hundred times faster than Groovy 1.0) So we may no longer encourage people to write such 'hot spots' in Java."
I have just tried it and it is extremely inaccurate. Groovy 1.8 for fib(42) is ~ 25 times slower than java. In my test java fib(42) takes 2 seconds where 1.8 and 1.7.8 take about 52 seconds to complete.
Groovy is still slow ....

Related

Why is Erlang slower than Java on all these small math benchmarks?

While considering alternatives for Java for a distributed/concurrent/failover/scalable backend environment I discovered Erlang. I've spent some time on books and articles where nearly all of them (even Java addicted guys) says that Erlang is a better choice in such environments, as many useful things are out of the box in a less error prone way.
I was sure that Erlang is faster in most cases mainly because of a different garbage collection strategy (per process), absence of shared state (b/w threads and processes) and more compact data types. But I was very surprised when I found comparisons of Erlang vs Java math samples where Erlang is slower by several orders, e.g. from x10 to x100.
Even on concurrent tasks, both on several cores and a single one.
What's the reasons for that? These answers came to mind:
Usage of Java primitives (=> no heap/gc) on most of the tasks
Same number of threads in Java code and Erlang processes so the actor model has no advantage here
Or just that Java is statically typed, while Erlang is not
Something else?
If that's because these are very specific math algorithms, can anybody show more real/practice performance tests?
UPDATE: I've got the answers so far summarizing that Erlang is not the right tool for such specific "fast Java case", but the thing that is unclear to me - what's the main reason for such Erlang inefficiency here: dynamic typing, GC or poor native compiling?

Erlang was not built for math. It was built with communication, parallel processing and scalability in mind, so testing it for math tasks is a bit like testing if your jackhammer gives you refreshing massage experience.
That said, let's offtop a little:
If you want Erlang-style programming in JVM, take a look at Scala Actors or Akka framework or Vert.x.

Benchmarks are never good for saying anything else than what they are really testing. If you feel that a benchmark is only testing primitives and a classic threading model, that is what you get knowledge about. You can now with some confidence say that Java is faster than Erlang on mathematics on primitives as well as the classic threading model for those types of problems. You don't know anything about the performance with large number of threads or for more involved problems because the benchmark didn't test that.
If you are doing the types of math that the benchmark tested, go with Java because it is obviously the right tool for that job. If you want to do something heavily scalable with little to no shared state, find a benchmark for that or at least re-evaluate Erlang.
If you really need to do heavy math in Erlang, consider using HiPE (consider it anyway for that matter).

I took interest to this as some of the benchmarks are a perfect fit for erlang, such as gene sequencing. So on http://benchmarksgame.alioth.debian.org/ the first thing I did was look at reverse-complement implementations, for both C and Erlang, as well as the testing details. I found that the test is biased because it does not discount the time it takes erlang to start the VM /w the schedulers, natively compiled C is started much faster. The way those benchmarks measure is basically:
time erl -noshell -s revcomp5 main < revcomp-input.txt
Now the benchmark says Java took 1.4 seconds and erlang /w HiPE took 11. Running the (Single threaded) Erlang code took me 0.15 seconds, and if you discount the time it took to start the vm, the actual workload took only 3000 microseconds (0.003 seconds).
So I have no idea how that is benchmarked, if its done 100 times then it makes no sense as the cost of starting the erlang VM will be x100. If the input is a lot longer than given, it would make sense, but I see no details on the webpage of that. To make the benchmarks more fair for Managed languages, have the code (Erlang/Java) send a Unix signal to the python (that is doing the benchmarking) that it hit the startup function.
Now benchmark aside, the erlang VM essentially just executes machine code at the end, as well as the Java VM. So there is no way a math operation would take longer in Erlang than in Java.
What Erlang is bad at is data that needs to mutate often. Such as a chained block cypher. Say you have the chars "0123456789", now your encryption xors the first 2 chars by 7, then xors the next two chars by the result of the first two added, then xors the previous 2 chars by the results of the current 2 subtracted, then xors the next 4 chars.. etc
Because objects in Erlang are immutable this means that the entire char array needs to be copied each time you mutate it. That is why erlang has support for things called NIFS which is C code you can call into to solve this exact problem. In fact all the encryption (ssl,aes,blowfish..) and compression (zlib,..) that ship with Erlang are implemented in C, also there is near 0 cost associated with calling C from Erlang.
So using Erlang you get the best of both worlds, you get the speed of C with the parallelism of Erlang.
If I were to implement the reverse-complement in the FASTEST way possible, I would write the mutating code using C but the parallel code using Erlang. Assuming infinite input, I would have Erlang split on the >
<<Line/binary, ">", Rest/binary>> = read_stream
Dispatch the block to the first available scheduler via round robin, consisting of infinite EC2 private networked hidden nodes, being added in real time to the cluster every millisecond.
Those nodes then call out to C via NIFS for processing (C was the fastest implementation for reverse-compliment on alioth website), then send the output back to the node master to send out to the inputer.
To implement all this in Erlang I would have to write code as if I was writing a single threaded program, it would take me under a day to create this code.
To implement this in Java, I would have to write the single threaded code, I would have to take the performance hit of calling from Managed to Unmanaged (as we will be using the C implementation for the grunt work obviously), then rewrite to support 64 cores. Then rewrite it to support multiple CPUS. Then rewrite it again to support clustering. Then rewrite it again to fix memory issues.
And that is Erlang in a nutshell.

As pointed in other answers - Erlang is designed to solve effectively real life problems, which are bit opposite to benchmark problems.
But I'd like to enlighten one more aspect - pithiness of erlang code (in some cases means rapidness of development), which could be easily concluded, after comparing benchmarks implementations.
For example, k-nucleotide benchmark:
Erlang version: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=hipe&id=3
Java version: http://benchmarksgame.alioth.debian.org/u64q/program.php?test=knucleotide&lang=java&id=3
If you want more real-life benchmarks, I'd suggest you Comparing C++ And Erlang For Motorola Telecoms Software

The Erlang solution uses ETS, Erlang Term Storage, which is like an in-memory database running in a separate process. Consequent to it being in a separate process, all messages to and from that process must be serialized/deserialized. This would account for a lot of the slowness, I should think. For example, if you look at the "regex-dna" benchmark, Erlang is only slightly slower than Java there, and it doesn't use ETS.

The fact that erlang has to allocate memory for every value whereas in java you will typically reuse variables if you want it to be fast, means it will always be faster for 'tight loop' bench marks.
It would be interesting to benchmark a java version using the -client flag and boxed primitives and compare that to erlang.
I believe using hipe is unfair since it is not an active project. I would be interested to know if any mission critical software is running on this.

I don't know anything about Erlang, but this seems to be a compare apples to oranges approach anyways. You must be aware that considerable effort was spent over more than a decade to improve java preformance to the point where it is today.
Its not surprising (to me) that a language implementation done by volunteers or a small company can not outmatch that effort.

Why Python is not better in multiprocessing or multithreading applications than Java?

Since Python has some issues with GIL, Java is better for developing multiprocessing applications. Could you please justify the exact reasoning of java's effective processing than python in your way?

The biggest problem in multithreading in CPython is the Global Interpreter Lock (GIL) (note that other Python implementations don't necessarily share this problem!)
The GIL is an implementation detail that effectively prevents parallel (simultaneous) execution of separate threads in Python. The problem is that whenever Python byte code is to be executed, then the current thread must have acquired the GIL and only a single thread can have the GIL at any given moment.
So if 5 threads are trying to execute some Python byte code, then they will effectively run interleaved, because each one will have to wait for the GIL to become available. This is not usually a problem with single-core computers, as the physical constraints have the same effect: only a single thread can run at a time.
In multi-core/SMP computers, however this becomes a bottleneck. These days almost everthing is running on multiple cores, including effectively all smartphones and even many embedded systems.
Java has no such restrictions, so multiple threads can execute at the exact same time.

I would disagree that Python is not better than Java for Multi-Processing application.
First, I am assuming that the OP is using 'better' to mean 'faster code execution' as far as I can tell.
I suffer from 'speed-freak' syndrome, probably from having come from a C/ASM background, so I have spent considerable time getting to the bottom of the "is Python slow?" issue.
The simple answer to that? "It can be." Here's some important points:
1) With a multi-threadded application, Python is going to have a disadvantage to any language that doesn't have something similar to the GIL. The GIL is an artifact of the Python VM in CPython, not the Python language itself. Some Python VM's like Jython, IronPython, etc do not have a GIL.
2) In a Multi-Process application, the GIL doesn't really apply, and thus you can now start to harness faster execution of your Python code unmolested for the most part by the GIL. I strongly suggest if you want to write large Python code that needs both speed and concurrency, that you learn about Multi-Processing, and possibly ZMQ/0MQ for message passing.
3) Regardless of the GIL, Java displays faster code execution than Python in many areas. This is due to native differences in how Python handles objects in memory:
A number of Python functions create copies of objects in memory rather than modifying them ( see http://www.skymind.com/~ocrow/python_string/ for examples)
Python uses Dict to store attributes for objects, etc. I don't want to distract and delve into these areas, but I can generally say that some of the 'neat' things that Python can do come at a speed cost. It's also important to know that there are ways around the default behaviour if that is causing too high of a speed penalty for you.
4) Some of Java's speed advantage is due to more optimization in the Java VM over Python as far as I can tell. Once you eliminate the differences in how much behind-the-scenes memory/object work is done, Java can often still beat Python. Is it because Java has had more attention than Python? I'm not sure, with enough funding I feel that CPython could be faster.
Check http://c2.com/cgi/wiki?PythonProblems for more discussion on some of these issues.
I will say that I have decided to embrace Python nearly 100% going forward with new code.
Don't fall into the premature optimization trap, and remember you can always call C code in a pinch. Make your code work well, make it maintainable, then start to optimize once the speed of the application isn't fast enough for your needs.
Interesting Benchmarks:
http://benchmarksgame.alioth.debian.org/u64/python.php
Further information about Python speed issues can be found here:
http://www.infoworld.com/d/application-development/van-rossum-python-not-too-slow-188715

Which programming language for compute-intensive trading portfolio simulation?

I am building a trading portfolio management system that is responsible for production, optimization, and simulation of non-high frequency trading portfolios (dealing with 1min or 3min bars of data, not tick data).
I plan on employing Amazon web services to take on the entire load of the application.
I have four choices that I am considering as language.
Java
C++
C#
Python
Here is the scope of the extremes of the project scope. This isn't how it will be, maybe ever, but it's within the scope of the requirements:
Weekly simulation of 10,000,000 trading systems.
(Each trading system is expected to have its own data mining methods, including feature selection algorithms which are extremely computationally-expensive. Imagine 500-5000 features using wrappers. These are not run often by any means, but it's still a consideration)
Real-time production of portfolio w/ 100,000 trading strategies
Taking in 1 min or 3 min data from every stock/futures market around the globe (approx 100,000)
Portfolio optimization of portfolios with up to 100,000 strategies. (rather intensive algorithm)
Speed is a concern, but I believe that Java can handle the load.
I just want to make sure that Java CAN handle the above requirements comfortably. I don't want to do the project in C++, but I will if it's required.
The reason C# is on there is because I thought it was a good alternative to Java, even though I don't like Windows at all and would prefer Java if all things are the same.
Python - I've read somethings on PyPy and pyscho that claim python can be optimized with JIT compiling to run at near C-like speeds... That's pretty much the only reason it is on this list, besides that fact that Python is a great language and would probably be the most enjoyable language to code in, which is not a factor at all for this project, but a perk.
To sum up:
real time production
weekly simulations of a large number of systems
weekly/monthly optimizations of portfolios
large numbers of connections to collect data from
There is no dealing with millisecond or even second based trades. The only consideration is if Java can possibly deal with this kind of load when spread out of a necessary amount of EC2 servers.
Thank you guys so much for your wisdom.

Pick the language you are most familiar with. If you know them all equally and speed is a real concern, pick C.

While I am a huge fan of Python and personaly I'm not a great lover of Java, in this case I have to concede that Java is the right way to go.
For many projects Python's performance just isn't a problem, but in your case even minor performance penalties will add up extremely quickly. I know this isn't a real-time simulation, but even for batch processing it's still a factor to take into consideration. If it turns out the load is too big for one virtual server, an implementation that's twice as fast will halve your virtual server costs.
For many projects I'd also argue that Python will allow you to develop a solution faster, but here I'm not sure that would be the case. Java has world-class development tools and top-drawer enterprise grade frameworks for parallell processing and cross-server deployment and while Python has solutions in this area, Java clearly has the edge. You also have architectural options with Java that Python can't match, such as Javaspaces.
I would argue that C and C++ impose too much of a development overhead for a project like this. They're viable inthat if you are very familiar with those languages I'm sure it would be doable, but other than the potential for higher performance, they have nothing else to bring to the table.
C# is just a rewrite of Java. That's not a bad thing if you're a Windows developer and if you prefer Windows I'd use C# rather than Java, but if you don't care about Windows there's no reason to care about C#.

I would pick Java for this task. In terms of RAM, the difference between Java and C++ is that in Java, each Object has an overhead of 8 Bytes (using the Sun 32-bit JVM or the Sun 64-bit JVM with compressed pointers). So if you have millions of objects flying around, this can make a difference. In terms of speed, Java and C++ are almost equal at that scale.
So the more important thing for me is the development time. If you make a mistake in C++, you get a segmentation fault (and sometimes you don't even get that), while in Java you get a nice Exception with a stack trace. I have always preferred this.
In C++ you can have collections of primitive types, which Java hasn't. You would have to use external libraries to get them.
If you have real-time requirements, the Java garbage collector may be a nuisance, since it takes some minutes to collect a 20 GB heap, even on machines with 24 cores. But if you don't create too many temporary objects during runtime, that should be fine, too. It's just that your program can make that garbage collection pause whenever you don't expect it.

Why only one language for your system? If I were you, I will build the entire system in Python, but C or C++ will be used for performance-critical components. In this way, you will have a very flexible and extendable system with fast-enough performance. You can find even tools to generate wrappers automatically (e.g. SWIG, Cython). Python and C/C++/Java/Fortran are not competing each other; they are complementing.

Write it in your preferred language. To me that sounds like python. When you start running the system you can profile it and see where the bottlenecks are. Once you do some basic optimisations if it's still not acceptable you can rewrite portions in C.
A consideration could be writing this in iron python to take advantage of the clr and dlr in .net. Then you can leverage .net 4 and parallel extensions. If anything will give you performance increases it'll be some flavour of threading which .net does extremely well.
Edit:
Just wanted to make this part clear. From the description, it sounds like parallel processing / multithreading is where the majority of the performance gains are going to come from.

It is useful to look at the inner loop of your numerical code. After all you will spend most of your CPU-time inside this loop.
If the inner loop is a matrix operation, then I suggest python and scipy, but of the inner loop if not a matrix operation, then I would worry about python being slow. (Or maybe I would wrap c++ in python using swig or boost::python)
The benefit of python is that it is easy to debug, and you save a lot of time by not having to compile all the time. This is especially useful for a project where you spend a lot of time programming deep internals.

I would go with pypy. If not, http://lolcode.com/.

Java performance in numerical algorithms

I am curious about performance of Java numerical algorithms, say for example matrix matrix double precision multiplication, using the latest JIT machines as compared for example to hand tuned SSE C++/assembler or Fortran counterparts.
I have looked on the web but most of the results come from almost 10 years ago and I understand Java progressed quite a lot since then.
If you have experience using Java for numerically intensive applications can you share your experience. Also how well does Java perform in kernels where the loops are relatively short and the memory access is not very uniform but still within the limits of L1 cache? If such kernel is executed multiple times in succession, can JVM optimize it during runtime?
Thanks

I have written some reasonably large and performance sensitive numerical code in Java (crunching large arrays of doubles usually).
I've found Java to be "good enough" for fast numerical calculations. Especially when you consider that you are usually not CPU-bound anyway - memory latency and cache awareness will probably be your biggest problem for large datasets.
However, you can still beat Java with hand-optimized C/C++ code that takes advantage of specific vectorised instructions etc. or highly customised memory layouts. So for the very fastest code, you could consider writing the core algorithm in C/C++ and calling it from Java using JNI.
Personally, I find that creating a native code dependency is usually more trouble than it is worth so I tend to stick to the pure Java approach.

This is coming from a .NET side of things, but I'm 90% sure that it's the case for Java too. While the JIT will make some use of SSE instructions where it can, it currently does not auto-vectorize your code when dealing with, for example, matrix multiplications. Hand vectorized C++ using compiler intrinsics/inline assembly will definitely be faster here.

One of the weakest points in java is (native) matrix operations. This is due to the nature of Java matrices:
You can not declare a matrix to be rectangular, ie. each row can have a different number of columns.
A matrix is technically not a "matrix of doubles (or ints, ...)", but an array of arrays of ... . The big difference is that since arrays are Java objects you can assign the same array object to more than 1 row.
These two properties make a lot of standard matrix optimizations impossible for the compiler.
You might get better performance by using a Java library which emulates matrices on a single long array. However you have the overhead of method calls for all access.

C++ will definitely be faster. You can even have some hand-optimized libraries for your purposes that contain assembly codes for each of the major CPUs out there. You can't get better than that.
Afterwards, you can use JNI to call to it from Java, if needed.
Java is not meant for high performance arithmetic calculations like this. If you are depending on these, I'd recommend picking a proper, low-level language to implement that. Or, alternatively, you can write the performance-specific part in a low level language, and then connect it to a Java front-end using JNI or some other IPC method.

This is a link to the programming language shootout page for java vs c++, which will give you a comparison of java's speed on several compute intensive algorithms. It will also show you what highest performance java code looks like. For the most part, for these few specific benchmarks, java took more time (but not more than 2 or 3 times) to run.

Seconding that your best bet is to test it for yourself, as performance will vary somewhat depending on what you're doing exactly. I find it difficult to believe Shane C. Mason's answer that Java performance will be the same as C++ or Fortran performance, as even C++ and Fortran are not really comparable for some scientific computing algorithms.
I have a computational fluid dynamics code that I wrote using C++ and the same code essentially translated into Fortran. I'm not really sure why yet, but the Fortran version is about twice as fast as the C++ version. I would guess that with features like bounds-checking and garbage collection, Java would be slower than both, but I would not know until I tested.

This can be so dependent on what you are doing in the C++ code.
For example, are you using the GPU? Edit I forgot about jogl, so Java can compete here.
Are you parallelized using STM or shared-memory, then Java can't compete.
For a link on analysis of parallel matrix multiplication: http://www.cs.utexas.edu/users/plapack/papers/ipps98/ipps98.html
Do you have enough memory to do the calculations in memory, so the garbage collector won't be needed, and have you fine-tuned the garbage collector for optimal performance? Then, Java can be competitive, perhaps.
Are you using multicores, and is the C++ optimized to take advantage of this architecture? Then Java won't be able to compete.
Are you using several computers tied together, then Java won't be able to compete.
Are you using any combination of these, then it will depend on the particular implementation.
Java is not designed to compete with a hand-tuned C++ program, but, the time it takes to do the tuning, are you doing enough calculations where it will matter? Java will be able to give some reasonable speed but with less work than hand-tuning, but not much of an improvement over just doing C++ code.
You may want to see if there is an improvement over Haskell or Erlang, for example, over your C++, as these languages are better designed for this type of work.

Are these kind of computations you're interested in - Fast Fourier Transform, Jacobi Successive Over Relaxation, Monte Carlo Integration, Sparse Matrix Mult, Dense LU Matrix Factorisation?
They make up the SciMark 2.0 composite benchmark which you can launch as an applet on your machine.
There are also ANSI C versions of the programs, and an Intel document (pdf) on optimizing and recompiling SciMark for C++.
Similarly you could use The Java Grande Forum Benchmark Suite and the comparison C programs.

Java uses a Just in Time (JIT) compiler to convert the bytecode to native machine language - so the first time it runs through a code block it will be slower but once the segment is 'warmed up' the performance will be equivalent. In short - the numerical performance is pretty good.

Performance of Java matrix math libraries? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
We are computing something whose runtime is bound by matrix operations. (Some details below if interested.) This experience prompted the following question:
Do folk have experience with the performance of Java libraries for matrix math (e.g., multiply, inverse, etc.)? For example:
JAMA
COLT
Apache commons math
I searched and found nothing.
Details of our speed comparison:
We are using Intel FORTRAN (ifort (IFORT) 10.1 20070913). We have reimplemented it in Java (1.6) using Apache commons math 1.2 matrix ops, and it agrees to all of its digits of accuracy. (We have reasons for wanting it in Java.) (Java doubles, Fortran real*8). Fortran: 6 minutes, Java 33 minutes, same machine. jvisualm profiling shows much time spent in RealMatrixImpl.{getEntry,isValidCoordinate} (which appear to be gone in unreleased Apache commons math 2.0, but 2.0 is no faster). Fortran is using Atlas BLAS routines (dpotrf, etc.).
Obviously this could depend on our code in each language, but we believe most of the time is in equivalent matrix operations.
In several other computations that do not involve libraries, Java has not been much slower, and sometimes much faster.

I'm the author of Java Matrix Benchmark (JMatBench) and I'll give my thoughts on this discussion.
There are significant difference between Java libraries and while there is no clear winner across the whole range of operations, there are a few clear leaders as can be seen in the latest performance results (October 2013).
If you are working with "large" matrices and can use native libraries, then the clear winner (about 3.5x faster) is MTJ with system optimised netlib. If you need a pure Java solution then MTJ, OjAlgo, EJML and Parallel Colt are good choices. For small matrices EJML is the clear winner.
The libraries I did not mention showed significant performance issues or were missing key features.

Just to add my 2 cents. I've compared some of these libraries. I attempted to matrix multiply a 3000 by 3000 matrix of doubles with itself. The results are as follows.
Using multithreaded ATLAS with C/C++, Octave, Python and R, the time taken was around 4 seconds.
Using Jama with Java, the time taken was 50 seconds.
Using Colt and Parallel Colt with Java, the time taken was 150 seconds!
Using JBLAS with Java, the time taken was again around 4 seconds as JBLAS uses multithreaded ATLAS.
So for me it was clear that the Java libraries didn't perform too well. However if someone has to code in Java, then the best option is JBLAS. Jama, Colt and Parallel Colt are not fast.

I'm the main author of jblas and wanted to point out that I've released Version 1.0 in late December 2009. I worked a lot on the packaging, meaning that you can now just download a "fat jar" with ATLAS and JNI libraries for Windows, Linux, Mac OS X, 32 and 64 bit (except for Windows). This way you will get the native performance just by adding the jar file to your classpath. Check it out at http://jblas.org!

I just compared Apache Commons Math with jlapack.
Test: singular value decomposition of a random 1024x1024 matrix.
Machine: Intel(R) Core(TM)2 Duo CPU E6750 # 2.66GHz, linux x64
Octave code: A=rand(1024); tic;[U,S,V]=svd(A);toc
results execution time
---------------------------------------------------------
Octave 36.34 sec
JDK 1.7u2 64bit
jlapack dgesvd 37.78 sec
apache commons math SVD 42.24 sec
JDK 1.6u30 64bit
jlapack dgesvd 48.68 sec
apache commons math SVD 50.59 sec
Native routines
Lapack* invoked from C: 37.64 sec
Intel MKL 6.89 sec(!)
My conclusion is that jlapack called from JDK 1.7 is very close to the native
binary performance of lapack. I used the lapack binary library coming with linux distro and invoked the dgesvd routine to get the U,S and VT matrices as well. All tests were done using double precision on exactly the same matrix each run (except Octave).
Disclaimer - I'm not an expert in linear algebra, not affiliated to any of the libraries above and this is not a rigorous benchmark.
It's a 'home-made' test, as I was interested comparing the performance increase of JDK 1.7 to 1.6 as well as commons math SVD to jlapack.

I can't really comment on specific libraries, but in principle there's little reason for such operations to be slower in Java. Hotspot generally does the kinds of things you'd expect a compiler to do: it compiles basic math operations on Java variables to corresponding machine instructions (it uses SSE instructions, but only one per operation); accesses to elements of an array are compiled to use "raw" MOV instructions as you'd expect; it makes decisions on how to allocate variables to registers when it can; it re-orders instructions to take advantage of processor architecture... A possible exception is that as I mentioned, Hotspot will only perform one operation per SSE instruction; in principle you could have a fantastically optimised matrix library that performed multiple operations per instruction, although I don't know if, say, your particular FORTRAN library does so or if such a library even exists. If it does, there's currently no way for Java (or at least, Hotspot) to compete with that (though you could of course write your own native library with those optimisations to call from Java).
So what does all this mean? Well:
in principle, it is worth hunting around for a better-performing library, though unfortunately I can't recomend one
if performance is really critical to you, I would consider just coding your own matrix operations, because you may then be able perform certain optimisations that a library generally can't, or that a particular library your using doesn't (if you have a multiprocessor machine, find out if the library is actually multithreaded)
A hindrance to matrix operations is often data locality issues that arise when you need to traverse both row by row and column by column, e.g. in matrix multiplication, since you have to store the data in an order that optimises one or the other. But if you hand-write the code, you can sometimes combine operations to optimise data locality (e.g. if you're multiplying a matrix by its transformation, you can turn a column traversal into a row traversal if you write a dedicated function instead of combining two library functions). As usual in life, a library will give you non-optimal performance in exchange for faster development; you need to decide just how important performance is to you.

Jeigen https://github.com/hughperkins/jeigen
wraps Eigen C++ library http://eigen.tuxfamily.org , which is one of the fastest free C++ libraries available
relatively terse syntax, eg 'mmul', 'sub'
handles both dense and sparse matrices
A quick test, by multiplying two dense matrices, ie:
import static jeigen.MatrixUtil.*;
int K = 100;
int N = 100000;
DenseMatrix A = rand(N, K);
DenseMatrix B = rand(K, N);
Timer timer = new Timer();
DenseMatrix C = B.mmul(A);
timer.printTimeCheckMilliseconds();
Results:
Jama: 4090 ms
Jblas: 1594 ms
Ojalgo: 2381 ms (using two threads)
Jeigen: 2514 ms
Compared to jama, everything is faster :-P
Compared to jblas, Jeigen is not quite as fast, but it handles sparse matrices.
Compared to ojalgo, Jeigen takes about the same amount of elapsed time, but only using one core, so Jeigen uses half the total cpu. Jeigen has a terser syntax, ie 'mmul' versus 'multiplyRight'

There's a benchmark of various matrix packages available in java up on
http://code.google.com/p/java-matrix-benchmark/ for a few different hardware configurations. But it's no substitute for doing your own benchmark.
Performance is going to vary with the type of hardware you've got (cpu, cores, memory, L1-3 cache, bus speed), the size of the matrices and the algorithms you intend to use. Different libraries have different takes on concurrency for different algorithms, so there's no single answer. You may also find that the overhead of translating to the form expected by a native library negates the performance advantage for your use case (some of the java libraries have more flexible options regarding matrix storage, which can be used for further performance optimizations).
Generally though, JAMA, Jampack and COLT are getting old, and do not represent the state of the current performance available in Java for linear algebra. More modern libraries make more effective use of multiple cores and cpu caches. JAMA was a reference implementation, and pretty much implements textbook algorithms with little regard to performance. COLT and IBM Ninja were the first java libraries to show that performance was possible in java, even if they lagged 50% behind native libraries.

I'm the author of la4j (Linear Algebra for Java) library and here is my point. I've been working on la4j for 3 years (the latest release is 0.4.0 [01 Jun 2013]) and only now I can start doing performace analysis and optimizations since I've just covered the minimal required functional. So, la4j isn't as fast as I wanted but I'm spending loads of my time to change it.
I'm currently in the middle of porting new version of la4j to JMatBench platform. I hope new version will show better performance then previous one since there are several improvement I made in la4j such as much faster internal matrix format, unsafe accessors and fast blocking algorithm for matrix multiplications.

Have you taken a look at the Intel Math Kernel Library? It claims to outperform even ATLAS. MKL can be used in Java through JNI wrappers.

Linalg code that relies heavily on Pentiums and later processors' vector computing capabilities (starting with the MMX extensions, like LAPACK and now Atlas BLAS) is not "fantastically optimized", but simply industry-standard. To replicate that perfomance in Java you are going to need native libraries. I have had the same performance problem as you describe (mainly, to be able to compute Choleski decompositions) and have found nothing really efficient: Jama is pure Java, since it is supposed to be just a template and reference kit for implementers to follow... which never happened. You know Apache math commons... As for COLT, I have still to test it but it seems to rely heavily on Ninja improvements, most of which were reached by building an ad-hoc Java compiler, so I doubt it's going to help.
At that point, I think we "just" need a collective effort to build a native Jama implementation...

Building on Varkhan's post that Pentium-specific native code would do better:
jBLAS: An alpha-stage project with JNI wrappers for Atlas: http://www.jblas.org.
Author's blog post: http://mikiobraun.blogspot.com/2008/10/matrices-jni-directbuffers-and-number.html.
MTJ: Another such project: http://code.google.com/p/matrix-toolkits-java/

We have used COLT for some pretty large serious financial calculations and have been very happy with it. In our heavily profiled code we have almost never had to replace a COLT implementation with one of our own.
In their own testing (obviously not independent) I think they claim within a factor of 2 of the Intel hand-optimised assembler routines. The trick to using it well is making sure that you understand their design philosophy, and avoid extraneous object allocation.

I have found that if you are creating a lot of high dimensional Matrices, you can make Jama about 20% faster if you change it to use a single dimensional array instead of a two dimensional array. This is because Java doesn't support multi-dimensional arrays as efficiently. ie. it creates an array of arrays.
Colt does this already, but I have found it is more complicated and more powerful than Jama which may explain why simple functions are slower with Colt.
The answer really depends on that you are doing. Jama doesn't support a fraction of the things Colt can do which make make more of a difference.

You may want to check out the jblas project. It's a relatively new Java library that uses BLAS, LAPACK and ATLAS for high-performance matrix operations.
The developer has posted some benchmarks in which jblas comes off favourably against MTJ and Colt.

For 3d graphics applications the lwjgl.util vector implementation out-performed above mentioned jblas by a factor of about 3.
I have done 1 million matrix multiplications of a vec4 with a 4x4 matrix.
lwjgl finished in about 18ms, jblas required about 60ms.
(I assume, that the JNI approach is not very suitable for fast successive application of relatively small multiplications. Since the translation/mapping may take more time than the actual execution of the multiplication.)

There's also UJMP

There are many different freely available java linear algebra libraries. http://www.ujmp.org/java-matrix/benchmark/
Unfortunately that benchmark only gives you info about matrix multiplication (with transposing the test does not allow the different libraries to exploit their respective design features).
What you should look at is how these linear algebra libraries perform when asked to compute various matrix decompositions.
http://ojalgo.org/matrix_compare.html

Matrix Tookits Java (MTJ) was already mentioned before, but perhaps it's worth mentioning again for anyone else stumbling onto this thread. For those interested, it seems like there's also talk about having MTJ replace the linalg library in the apache commons math 2.0, though I'm not sure how that's progressing lately.

You should add Apache Mahout to your shopping list.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.