Where to code this heuristic? - java

I want to ask a complex question.
I have to code a heuristic for my thesis. I need followings:
Evaluate some integral functions
Minimize functions over an interval
Do this over thousand and thousand times.
So I need a faster programming language to do these jobs. Which language do you suggest? First, I started with Java, but taking integrals become a problem. And I'm not sure about speed.
Connecting Java and other softwares like MATLAB may be a good idea. Since I'm not sure, I want to take your opinions.
Thanks!

C,Java, ... are all Turing complete languages. They can calculate the same functions with the same precision.
If you want achieve performance goals use C that is a compiled and high performances language . Can decrease your computation time avoiding method calls and high level features present in an interpreted language like Java.
Anyway remember that your implementation may impact the performances more than which language you choose, because for increasing input dimension is the computational complexity that is relevant ( http://en.wikipedia.org/wiki/Computational_complexity_theory ).

It's not the programming language, it's probably your algorithm. Determine the big0 notation of your algorithm. If you use loops in loops, where you could use a search by a hash in a Map instead, your algorithm can be made n times faster.
Note: Modern JVM's (JDK 1.5 or 1.6) compile Just-In-Time natively (as in not-interpreted) to a specific OS and a specific OS version and a specific hardware architecture. You could try the -server to JIT even more aggressively (at the cost of an even longer initialization time).
Do this over thousand and thousand times.
Are you sure it's not more, something like 10^1000 instead? Try accurately calculating how many times you need to run that loop, it might surprise you. The type of problems on which heuristics are used, tend to have a really big search space.

Before you start switching languages, I'd first try to do the following things:
Find the best available algorithms.
Find available implementations of those algorithms usable from your language.
There are e.g. scientific libraries for Java. Try to use these libraries.
If they are not fast enough investigate whether there is anything to be done about it. Is your problem more specific than what the library assumes. Are you able to improve the algorithm based on that knowledge.
What is it that takes so much/memory? Is this realy related to your language? Try to avoid observing JVM start times instead of the time it performed calculation for you.
Then, I'd consider switching languages. But don't expect it to be easy to beat optimized third party java libraries in c.

Order of the algorithm
Tipically switching between languages only reduce the time required by a constant factor. Let's say you can double the speed using C, but if your algorithm is O(n^2) it will take four times to process if you double the data no matter the language.
And the JVM can optimize a lot of things getting good results.
Some posible optimizations in Java
If you have functions that are called a lot of times make them final. And the same for entire classes. The compiler will know that it can inline the method code, avoiding creating method-call stack frames for that call.

Related

How can I write code to hint to the JVM to use vector operations?

Somewhat related question, and a year old: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?
Preface: I am trying to do this in pure java (no JNI to C++, no GPGPU work, etc...). I have profiled and the bulk of the processing time is coming from the math operations in this method (which is probably 95% floating point math and 5% integer math). I've already reduced all Math.xxx() calls to an approximation that's good enough so most of the math is now floating point multiplies with a few adds.
I have some code that deals with audio processing. I've been making tweaks and have already come across great gains. Now I'm looking into manual loop unrolling to see if there's any benefit (at least with a manual unroll of 2, I am seeing approximately a 25% improvement). While trying my hand at a manual unroll of 4 (which is starting to get very complicated since I am unrolling both loops of a nested loop) I am wondering if there's anything I can do to hint to the jvm that at runtime it can use vector operations (e.g. SSE2, AVX, etc...). Each sample of the audio can be calculated completely independently of other samples, which is why I've been able to see a 25% improvement already (reducing the amount of dependencies on floating point calculations).
For example, I have 4 floats, one for each of the 4 unrolls of the loop to hold a partially computed value. Does how I declare and use these floats matter? If I make it a float[4] does that hint to the jvm that they are unrelated to each other vs having float,float,float,float or even a class of 4 public floats? Is there something I can do without meaning to that will kill my chance at code being vectorized?
I've come across articles online about writing code "normally" because the compiler/jvm knows the common patterns and how to optimize them and deviating from the patterns can mean less optimization. At least in this case however, I wouldn't have expected unrolling the loops by 2 to have improved performance by as much as it did so I'm wondering if there's anything else I can do (or at least not do) to help my chances. I know that the compiler/jvm are only going to get better so I also want to be wary of doing things that will hurt me in the future.
Edit for the curious: unrolling by 4 increased performance by another ~25% over unrolling by 2, so I really think vector operations would help in my case if the jvm supported it (or perhaps already is using them).
Thanks!
How can I..audio processing..pure java (no JNI to C++, no GPGPU work, etc...)..use vector operations (e.g. SSE2, AVX, etc...)
Java is high level language (one instruction in Java generates many hardware instructions) which is by-design (e.g. garbage collector memory management) not suitable for tasks that manipulate high data volumes in real time.
There are usually special pieces of hardware optimized for particular role (e.g. image processing or speech recognition) that many times utilize parallelization through several simplified processing pipelines.
There are also special programming languages for this sort of tasks, mainly hardware description languages and assembly language.
Even C++ (considered the fast language) will not automagically use some super optimized hardware operations for you. It may just inline one of several hand-crafted assembly language methods at certain places.
So my answer is that there is "probably no way" to instruct JVM to use some hardware optimization for your code (e.g. SSE) and even if there was some then the Java language runtime would still have too many other factors that will slow-down your code.
Use a low-level language designed for this task and link it to the Java for high-level logic.
EDIT: adding some more info based on comments
If you are convinced that high-level "write once run anywhere" language runtime definitely should also do lots of low level optimizations for you and turn automagically your high-level code into optimized low-level code then...the way JIT compiler optimizes depends on the implementation of the Java Virtual Machine. There are many of them.
In case of Oracle JVM (HotSpot) you can start looking for your answer by downloading the source code, text SSE2 appears in following files:
openjdk/hotspot/src/cpu/x86/vm/assembler_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/assembler_x86.hpp
openjdk/hotspot/src/cpu/x86/vm/c1_LIRGenerator_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/c1_Runtime1_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/sharedRuntime_x86_32.cpp
openjdk/hotspot/src/cpu/x86/vm/vm_version_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/vm_version_x86.hpp
openjdk/hotspot/src/cpu/x86/vm/x86_32.ad
openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp
openjdk/hotspot/src/share/vm/c1/c1_GraphBuilder.cpp
openjdk/hotspot/src/share/vm/c1/c1_LinearScan.cpp
openjdk/hotspot/src/share/vm/runtime/globals.hpp
They're in C++ and assembly language so you will have to learn some low level languages to read them anyway.
I would not hunt that deep even with +500 bounty. IMHO the question is wrong based on wrong assumptions
SuperWord optimizations on Hotspot are limited and quite fragile. Limited since they are generally behind what a C/C++ compiler offers, and fragile since they depend on particular loop shapes (and are only supported for certain CPUs).
I understand you want to write once run anywhere. It sounds like you already have a pure Java solution. You might want to consider an optional implementation for known popular platforms to supplement that implementation to "fast in some places" which is already true probably.
It's hard to give you more concrete feedback with some code. I suggest you take the loop in question and present it in a JMH benchmark. This makes it easy to analyze and discuss.

How to tell the efficiency of a Java code

I just realized that I have no idea on how to tell whether or not a piece of Java code is efficient from the computational point of view. Reading several source codes sometimes I feel that the code I'm reading is highly inefficient, some other times I feel the opposite.
Could you list the basic one-line rules to respect and why they are so important?
edit - My question is related to the Java implementations of the JVM, so things like Java allocation issues, String management, exception handling, thread synchronization and so on.
Thanks in advance
p.s. don't take the "one-line" literally pls
Basic one-line rule? Okay, here you go:
Avoid unnecessary computations.
How do you do it? Sorry, no one-line answer to that. :(
Well, people spends years in college learning about algorithms and data structures in computer science for a reason... might want to take a course on algorithms/data structures sometime.
I'm not sure what you mean by "from a computation point of view" (it seems to imply algorithm issues), but assuming you mean tricks more similar to things like profiling, try these:
Run the program, then suddenly pause it, and see where it paused. Do this a few times; wherever it stops the most is a bottleneck, and how often it stops indicates how bad of a bottleneck it is.
Avoid boxing/unboxing (converting between int and Integer, etc.); especially avoid Integer[], List<Integer>, and other things that internally store arrays of objects of primitive types
Factor out common code (sometimes a speed issue, sometimes readability)
Avoid looping with String operations; use StringBuilder/StringBuffer instead. (In short, avoid creating and/or copying data when that's not needed.)
I'll add to this if other things come to mind.
Use profiling. Look at JProfile or for any other profilers.
I'll second Mherdad's answer in that there definitely are no "basic one-line rules."
Regarding answers that suggest using profiling tools, profiling isn't really useful until you understand algorithmic time complexity and big-O notation. From wikipedia's article on Big O notation:
In mathematics, computer science, and
related fields, big-O notation
describes the limiting behavior of the
function when the argument tends
towards a particular value or
infinity, usually in terms of simpler
functions. Big O notation
characterizes functions according to
their growth rates: different
functions with the same growth rate
may be represented using the same O
notation.
The idea behind big-O notation is that it gives you a feel for how input size affects execution time for a given algorithm. For instance, consider the following two methods:
void linearFoo(List<String> strings){
for(String s:strings){
doSomethingWithString(s);
}
}
void quadraticFoo(List<String> strings){
for(String s:strings){
for(String s1:strings){
doSomethingWithTwoStrings(s,s1);
}
}
}
linearFoo is said to be O(n), meaning that its time increases linearly with the input size n (ie. strings.size()). quadraticFoo is said to be O(n2), meaning that the time it takes to execute quadraticFoo is a function of strings.size() squared.
Once you have a feel for the algorithmic time complexity of your program profiling tools will start to be useful. For instance, you'll be able to tell that if while profiling you find out that a method typically takes 1ms for a fixed input size, if that method is O(n), doubling the input size will result in an execution time of 2ms (1ms = n, therefore 2n = 2ms). However, if it is O(n2), doubling input size will mean that your method will take around 4ms to execute (1ms = n2 therefore (2n)2 = 4ms).
Take a look at the book Effective Java by Joshua Bloch if you really need a list of rules that you should follow in Java. The book offers guidelines not just for performance but also not he proper way of programming in Java.
You can use jconsole for monitoring your application's deadlocks, memory leaks, threads and heap. In short you can see your applications performance in graphs.

(Dis)Proving that one algorithm works faster than another due to language internals

For a project at university, we had to implement a few different algorithms to calculate the equivalenceclasses when given a set of elements and a collection of relations between said elements.
We were instructed to implement, among others, the Union-Find algorithm and its optimizations (Union by Depth, Size). By accident (doing something I thought was necessary for the correctness of the algorithm) I discovered another way to optimize the algorithm.
It isn't as fast as Union By Depth, but close. I couldn't for the life of me figure out why it was as fast as it was, so I consulted one of the teaching assistants who couldn't figure it out either.
The project was in java and the datastructures I used were based on simple arrays of Integers (the object, not the int)
Later, at the project's evaluation, I was told that it probably had something to do with 'Java caching', but I can't find anything online about how caching would effect this.
What would be the best way, without calculating the complexity of the algorithm, to prove or disprove that my optimization is this fast because of java's way of doing stuff? Implementing it in another, (lower level?) language? But who's to say that language won't do the same thing?
I hope I made myself clear,
thanks
The only way is to prove the worst-case (average case, etc) complexity of the algorithm.
Because if you don't, it might just be a consequence of a combination of
The particular data
The size of the data
Some aspect of the hardware
Some aspect of the language implementation
etc.
It is generally very difficult to perform such task given modern VM's! Like you hint they perform all sorts of stuff behind your back. Method calls gets inlined, objects are reused. Etc. A prime example is seeing how trivial loops gets compiled away if their obviously are not performing anything other than counting. Or how a funtioncall in functional programming are inlined or tail-call optimized.
Further, you have the difficulty of proving your point in general on any data set. An O(n^2) can easily be much faster than a seemingly faster, say O(n), algorithm. Two examples
Bubble sort is faster at sorting a near-sorted data collection than quick sort.
Quick sort in the general case, of course is faster.
Generally the big-O notation purposely ignores constants which in a practical situation can mean life or death to your implementation. and those constants may have been what hit you. So in practice 0.00001 * n
^2 (say the running time of your algorithm) is faster than 1000000 * n log n
So reasoning is hard given the limited information you provide.
It is likely that either the compiler or the JVM found an optimisation for your code. You could try reading the bytecode that is output by the javac compiler, and disabling runtime JIT compilation with the -Djava.compiler=NONE option.
If you have access to the source code -- and the JDK source code is available, I believe -- then you can trawl through it to find the relevant implementation details.

Obsolete Java Optimization Tips

There are number of performance tips made obsolete by Java compiler and especially Profile-guided optimization. For example, these platform-provided optimizations can drastically (according to sources) reduces the cost of virtual function calls. VM is also capable of method inlining, loop unrolling etc.
What are other performance optimization techniques you came around still being applied but are actually made obsolete by optimization mechanisms found in more modern JVMs?
The final modifier on methods and method parameters doesn't help with the performance at all.
Also, the Java HotSpot wiki gives a good overview of the optimizations used by HotSpot and how to efficiently use them in Java code.
People replacing String a = "this" + var1 + " is " + var2; with multiple calls to StringBuilder or StringBuffer. It actually already uses StringBuilder behind the scenes.
It is necessary to define time/memory trade-offs before starting the performance optimization. This is how I do it for my memory/time critical application (repeating some answers above, to be complete):
Rule #1 Never do performance optimization on the earlier stage of development. Never do it if your don't need it really. If decided to do it, then:
use profiler to find bottlenecks, review the source code to find the reasons for bottlenecks;
choose appropriate data structure with the best fit into the defined time/memory trade-offs;
choose appropriate algorithms (e.g. iteration vs recursion, etc);
avoid using synchronized objects from java library, if you don't need it really;
avoid explicitly/implicitly new object creation;
override/re-implement data types/algorithms coming with the java if and only if you are sure they doesn't fit your requirements.
Use small, independent tests to test the performance of chosen algos/data structures.
In 2001 I made apps for a J2ME phone. It was the size of a brick. And very nearly the computational power of a brick.
Making Java apps run acceptably on it required writing them in as procedural fashion as possible. Furthermore, the very large performance improvement was to catch the ArrayIndexOutOfBoundsException to exit for-loops over all items in a vector. Think about that!
Even on Android there are 'fast' loops through all items in an array and 'slow' ways of writing the same thing, as mentioned in the Google IO videos on dalvik VM internals.
However, in answer to your question, I would say that it is most unusual to have to micro-optimise this kind of thing these days, and I'd further expect that on a JIT VM (even the new Android 2.2 VM, which adds JIT) these optimisations are moot.
In 2001 the phone ran KVM interpreter at 33MHz. Now it runs dalvik - a much faster VM than KVM - at 500MHz to 1500MHz, with a much faster ARM architecture (better processor even allowing for clock speed gains) with L1 e.t.c. and JIT arrives.
We are not yet in the realms where I'd be comfortable doing direct pixel manipulation in Java - either on-phone or on the desktop with an i7 - so there are still normal every-day code that Java isn't fast enough for. Here's an interesting blog that claims an expert has said that Java is 80% of C++ speed for some heavy CPU task; I am sceptical, I write image manipulation code and I see an order of magnitude between Java and native for loops over pixels. Maybe I'm missing some trick...? :D
Don't manually call the garbage collector, it hurts performance on modern JVM implementations.
Integer instead of Long will not save much space, but will limit the range of the numbers.
Avoid hand generated Enum classes and use the built in Enum instead. Java 1.5 introduced real Enums, use them.
When using x64 JVM with RAM less than 32GB:
64bit JVM use 30%-50% more memory in comparision to 32bit JVM because of bigger ordinary object pointers. You can heavily reduce this factor by using JDK6+.
From JDK6u6p to JDK6u22 it is optional and can be enabled by adding JVM argument:
-XX:+UseCompressedOops
From JDK6u23 (JDK7 also) it is enabled by default. More info here.
"Premature optimization is the root of all evil"(Donald Knuth)
It is useful to optimize only the bottlenecks.
You should analyze the code in each situation. Maybe you can replace the TreeSet by a fast HashSet because you don't need the sorting feature or maybe you can use float instead of double( look at the Android SDK).
If no technique helps you can try to rewrite a piece of code and call it via JNI, so that native code is working.
I found links above outdated. Here is a new one on Java optimization: http://www.appperfect.com/support/java-coding-rules/optimization.html

C/C++ versus Java/C# in high-performance applications

My Question is regarding the performance of Java versus compiled code, for example, C++/fortran/assembly in high-performance numerical applications.
I know this is a contentious topic, but I am looking for specific answers/examples. Also community wiki. I have asked similar questions before, but I think I put it broadly and did not get answers I was looking for.
double precision matrix-matrix multiplication, commonly known as dgemm in blas library, is able to achieve nearly 100 percent peak CPU performance (in terms of floating operations per second).
There are several factors which allow achieving that performance:
cache blocking, to achieve maximum memory locality
loop unrolling to minimize control overhead
vector instructions, such as SSE
memory prefetching
guarantee no memory aliasing
I have have seen lots of benchmarks using assembly, C++, Fortran, Atlas, vendor BLAS (typical cases are a matrix of dimension 512 and above).
On the other hand, I have heard that the principle byte compiled languages/implementations such as Java can be fast or nearly as fast as machine compiled languages. However, I have not seen definite benchmarks showing that it is so. On the contrary, it seems (from my own research) byte compiled languages are much slower.
Do you have good matrix-matrix multiplication benchmarks for Java/C #?
does just-in-time compiler (actual implementation, not hypothetical) able to produce instructions which satisfy points I have listed?
Thanks
with regards to performance:
every CPU has peak performance, depending on the number of instructions processor can execute per second. For example, modern 2 GHz Intel CPU can achieve 8 billion double precision add/multiply a second, resulting in 8 gflops peak performance. Matrix-matrix multiply is one of the algorithms which is able to achieve nearly full performance with regards number of operations per second, the main reason being a higher ratio of computing over memory operations (N^3/N^2). Numbers I am interested in a something on the order N > 500.
with regards to implementation: higher-level details such as blocking is done at the source code level. Lower-level optimization is handled by the compiler, perhaps with compiler hints with regards to alignment/alias. Byte compiled implementation can be written using block approach as well, so in principle source code details for decent implementation will be very similar.
A comparison of VC++/.NET 3.5/Mono 2.2 in a pure matrix multiplication scenario:
Source
Mono with Mono.Simd goes a long way towards closing the performance gap with the hand-optimized C++ here, but the C++ version is still clearly the fastest. But Mono is at 2.6 now and might be closer and I would expect that if .NET ever gets something like Mono.Simd, it could be very competitive as there's not much difference between .NET and the sequential C++ here.
All factors your specify is probably done by manual memory/code optimization for your specific task. But JIT compiler haven't enough information about your domain to make code optimal as you make it by hand, and can apply only general optimization rules. As a result it will be slower that C/C++ matrix manipulation code (but can utilize 100% of CPU, if you want it :)
Addressing the SSE issue: Java is using SSE instructions since J2SE 1.4.2.
in a pure math scenario (calculating 25 types or algebraic surfaces 3d coords) c++ beats java in a 2.5 ratio
Java cannot compete to C in matrix multiplications, one reason is that it checks on each array access whether the array bounds are exceeded. Further Java's math is slow, it does not use the processor's sin(), cos().

Categories

Resources