There are number of performance tips made obsolete by Java compiler and especially Profile-guided optimization. For example, these platform-provided optimizations can drastically (according to sources) reduces the cost of virtual function calls. VM is also capable of method inlining, loop unrolling etc.
What are other performance optimization techniques you came around still being applied but are actually made obsolete by optimization mechanisms found in more modern JVMs?
The final modifier on methods and method parameters doesn't help with the performance at all.
Also, the Java HotSpot wiki gives a good overview of the optimizations used by HotSpot and how to efficiently use them in Java code.
People replacing String a = "this" + var1 + " is " + var2; with multiple calls to StringBuilder or StringBuffer. It actually already uses StringBuilder behind the scenes.
It is necessary to define time/memory trade-offs before starting the performance optimization. This is how I do it for my memory/time critical application (repeating some answers above, to be complete):
Rule #1 Never do performance optimization on the earlier stage of development. Never do it if your don't need it really. If decided to do it, then:
use profiler to find bottlenecks, review the source code to find the reasons for bottlenecks;
choose appropriate data structure with the best fit into the defined time/memory trade-offs;
choose appropriate algorithms (e.g. iteration vs recursion, etc);
avoid using synchronized objects from java library, if you don't need it really;
avoid explicitly/implicitly new object creation;
override/re-implement data types/algorithms coming with the java if and only if you are sure they doesn't fit your requirements.
Use small, independent tests to test the performance of chosen algos/data structures.
In 2001 I made apps for a J2ME phone. It was the size of a brick. And very nearly the computational power of a brick.
Making Java apps run acceptably on it required writing them in as procedural fashion as possible. Furthermore, the very large performance improvement was to catch the ArrayIndexOutOfBoundsException to exit for-loops over all items in a vector. Think about that!
Even on Android there are 'fast' loops through all items in an array and 'slow' ways of writing the same thing, as mentioned in the Google IO videos on dalvik VM internals.
However, in answer to your question, I would say that it is most unusual to have to micro-optimise this kind of thing these days, and I'd further expect that on a JIT VM (even the new Android 2.2 VM, which adds JIT) these optimisations are moot.
In 2001 the phone ran KVM interpreter at 33MHz. Now it runs dalvik - a much faster VM than KVM - at 500MHz to 1500MHz, with a much faster ARM architecture (better processor even allowing for clock speed gains) with L1 e.t.c. and JIT arrives.
We are not yet in the realms where I'd be comfortable doing direct pixel manipulation in Java - either on-phone or on the desktop with an i7 - so there are still normal every-day code that Java isn't fast enough for. Here's an interesting blog that claims an expert has said that Java is 80% of C++ speed for some heavy CPU task; I am sceptical, I write image manipulation code and I see an order of magnitude between Java and native for loops over pixels. Maybe I'm missing some trick...? :D
Don't manually call the garbage collector, it hurts performance on modern JVM implementations.
Integer instead of Long will not save much space, but will limit the range of the numbers.
Avoid hand generated Enum classes and use the built in Enum instead. Java 1.5 introduced real Enums, use them.
When using x64 JVM with RAM less than 32GB:
64bit JVM use 30%-50% more memory in comparision to 32bit JVM because of bigger ordinary object pointers. You can heavily reduce this factor by using JDK6+.
From JDK6u6p to JDK6u22 it is optional and can be enabled by adding JVM argument:
-XX:+UseCompressedOops
From JDK6u23 (JDK7 also) it is enabled by default. More info here.
"Premature optimization is the root of all evil"(Donald Knuth)
It is useful to optimize only the bottlenecks.
You should analyze the code in each situation. Maybe you can replace the TreeSet by a fast HashSet because you don't need the sorting feature or maybe you can use float instead of double( look at the Android SDK).
If no technique helps you can try to rewrite a piece of code and call it via JNI, so that native code is working.
I found links above outdated. Here is a new one on Java optimization: http://www.appperfect.com/support/java-coding-rules/optimization.html
Related
Somewhat related question, and a year old: Do any JVM's JIT compilers generate code that uses vectorized floating point instructions?
Preface: I am trying to do this in pure java (no JNI to C++, no GPGPU work, etc...). I have profiled and the bulk of the processing time is coming from the math operations in this method (which is probably 95% floating point math and 5% integer math). I've already reduced all Math.xxx() calls to an approximation that's good enough so most of the math is now floating point multiplies with a few adds.
I have some code that deals with audio processing. I've been making tweaks and have already come across great gains. Now I'm looking into manual loop unrolling to see if there's any benefit (at least with a manual unroll of 2, I am seeing approximately a 25% improvement). While trying my hand at a manual unroll of 4 (which is starting to get very complicated since I am unrolling both loops of a nested loop) I am wondering if there's anything I can do to hint to the jvm that at runtime it can use vector operations (e.g. SSE2, AVX, etc...). Each sample of the audio can be calculated completely independently of other samples, which is why I've been able to see a 25% improvement already (reducing the amount of dependencies on floating point calculations).
For example, I have 4 floats, one for each of the 4 unrolls of the loop to hold a partially computed value. Does how I declare and use these floats matter? If I make it a float[4] does that hint to the jvm that they are unrelated to each other vs having float,float,float,float or even a class of 4 public floats? Is there something I can do without meaning to that will kill my chance at code being vectorized?
I've come across articles online about writing code "normally" because the compiler/jvm knows the common patterns and how to optimize them and deviating from the patterns can mean less optimization. At least in this case however, I wouldn't have expected unrolling the loops by 2 to have improved performance by as much as it did so I'm wondering if there's anything else I can do (or at least not do) to help my chances. I know that the compiler/jvm are only going to get better so I also want to be wary of doing things that will hurt me in the future.
Edit for the curious: unrolling by 4 increased performance by another ~25% over unrolling by 2, so I really think vector operations would help in my case if the jvm supported it (or perhaps already is using them).
Thanks!
How can I..audio processing..pure java (no JNI to C++, no GPGPU work, etc...)..use vector operations (e.g. SSE2, AVX, etc...)
Java is high level language (one instruction in Java generates many hardware instructions) which is by-design (e.g. garbage collector memory management) not suitable for tasks that manipulate high data volumes in real time.
There are usually special pieces of hardware optimized for particular role (e.g. image processing or speech recognition) that many times utilize parallelization through several simplified processing pipelines.
There are also special programming languages for this sort of tasks, mainly hardware description languages and assembly language.
Even C++ (considered the fast language) will not automagically use some super optimized hardware operations for you. It may just inline one of several hand-crafted assembly language methods at certain places.
So my answer is that there is "probably no way" to instruct JVM to use some hardware optimization for your code (e.g. SSE) and even if there was some then the Java language runtime would still have too many other factors that will slow-down your code.
Use a low-level language designed for this task and link it to the Java for high-level logic.
EDIT: adding some more info based on comments
If you are convinced that high-level "write once run anywhere" language runtime definitely should also do lots of low level optimizations for you and turn automagically your high-level code into optimized low-level code then...the way JIT compiler optimizes depends on the implementation of the Java Virtual Machine. There are many of them.
In case of Oracle JVM (HotSpot) you can start looking for your answer by downloading the source code, text SSE2 appears in following files:
openjdk/hotspot/src/cpu/x86/vm/assembler_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/assembler_x86.hpp
openjdk/hotspot/src/cpu/x86/vm/c1_LIRGenerator_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/c1_Runtime1_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/sharedRuntime_x86_32.cpp
openjdk/hotspot/src/cpu/x86/vm/vm_version_x86.cpp
openjdk/hotspot/src/cpu/x86/vm/vm_version_x86.hpp
openjdk/hotspot/src/cpu/x86/vm/x86_32.ad
openjdk/hotspot/src/os_cpu/linux_x86/vm/os_linux_x86.cpp
openjdk/hotspot/src/share/vm/c1/c1_GraphBuilder.cpp
openjdk/hotspot/src/share/vm/c1/c1_LinearScan.cpp
openjdk/hotspot/src/share/vm/runtime/globals.hpp
They're in C++ and assembly language so you will have to learn some low level languages to read them anyway.
I would not hunt that deep even with +500 bounty. IMHO the question is wrong based on wrong assumptions
SuperWord optimizations on Hotspot are limited and quite fragile. Limited since they are generally behind what a C/C++ compiler offers, and fragile since they depend on particular loop shapes (and are only supported for certain CPUs).
I understand you want to write once run anywhere. It sounds like you already have a pure Java solution. You might want to consider an optional implementation for known popular platforms to supplement that implementation to "fast in some places" which is already true probably.
It's hard to give you more concrete feedback with some code. I suggest you take the loop in question and present it in a JMH benchmark. This makes it easy to analyze and discuss.
Before I begin I apologize for my lack of comments in my code. I am currently making a OBJ file loader (in java.) Although my code works as expected for small files, when files become large (for example I am currently attempting to load a obj file which has 25,958 lines) my entire system crashes. I recently migrated my entire project over from C++ which could load this model quickly. I utilized a profiler alongside a debugger to determine where the entire process crashes my system. I noticed a few things; first, it was hanging at the initiation process; second, my heap was nearly completely used up (I used up about 90% of the heap.)
My code can be found here:
http://pastebin.com/VjN0pzyi
I was curious about methods I could employ to optimize this code.
When you're really low on memory, everything slows down a lot. I guess you should improve you coding skills, things like
startChar = line[i].toCharArray()[k];
probably don't get optimized to
startChar = line[i].charAt(k);
automagically. Maybe interning your strings could save a lot of memory, try String.intern or Guava Interner.
The Hotspot loves short methods, so refactor. The code as it is hard to read and I guess that given its size no optimizations get done at all!
I know this is an old question, but I wanted to throw in my two cents on your performance issues. You're saying that your code not only runs slow, but it takes up 90% of the heap. I think saying 90% is an egregious exaggeration, but this still allows me to point out the biggest flaw with Java game development. Java does not support value types, such as structs. That means that in order to gain speed you're required to avoid OOP, because every time you instance a class for your loader it is allocated onto the heap. You must then invariably wait for GC to kick in to get rid of the clutter and left over instances that your loader created. Now take a language like C# as an example of how to create a real language. C# fully supports structs. You could replace every class of your loader with them. Faces, groups, Vertex, Normal, classes are then treated as value types; they are deleted when the stack unwinds. No garbage is generated, or at least very little if you're required to use a class or two.
In my opinion, don't use Java for game development. I used it for years before discovering C#. Strictly my opinion, here, but Java is a horrible language; I will never use it again.
I want to ask a complex question.
I have to code a heuristic for my thesis. I need followings:
Evaluate some integral functions
Minimize functions over an interval
Do this over thousand and thousand times.
So I need a faster programming language to do these jobs. Which language do you suggest? First, I started with Java, but taking integrals become a problem. And I'm not sure about speed.
Connecting Java and other softwares like MATLAB may be a good idea. Since I'm not sure, I want to take your opinions.
Thanks!
C,Java, ... are all Turing complete languages. They can calculate the same functions with the same precision.
If you want achieve performance goals use C that is a compiled and high performances language . Can decrease your computation time avoiding method calls and high level features present in an interpreted language like Java.
Anyway remember that your implementation may impact the performances more than which language you choose, because for increasing input dimension is the computational complexity that is relevant ( http://en.wikipedia.org/wiki/Computational_complexity_theory ).
It's not the programming language, it's probably your algorithm. Determine the big0 notation of your algorithm. If you use loops in loops, where you could use a search by a hash in a Map instead, your algorithm can be made n times faster.
Note: Modern JVM's (JDK 1.5 or 1.6) compile Just-In-Time natively (as in not-interpreted) to a specific OS and a specific OS version and a specific hardware architecture. You could try the -server to JIT even more aggressively (at the cost of an even longer initialization time).
Do this over thousand and thousand times.
Are you sure it's not more, something like 10^1000 instead? Try accurately calculating how many times you need to run that loop, it might surprise you. The type of problems on which heuristics are used, tend to have a really big search space.
Before you start switching languages, I'd first try to do the following things:
Find the best available algorithms.
Find available implementations of those algorithms usable from your language.
There are e.g. scientific libraries for Java. Try to use these libraries.
If they are not fast enough investigate whether there is anything to be done about it. Is your problem more specific than what the library assumes. Are you able to improve the algorithm based on that knowledge.
What is it that takes so much/memory? Is this realy related to your language? Try to avoid observing JVM start times instead of the time it performed calculation for you.
Then, I'd consider switching languages. But don't expect it to be easy to beat optimized third party java libraries in c.
Order of the algorithm
Tipically switching between languages only reduce the time required by a constant factor. Let's say you can double the speed using C, but if your algorithm is O(n^2) it will take four times to process if you double the data no matter the language.
And the JVM can optimize a lot of things getting good results.
Some posible optimizations in Java
If you have functions that are called a lot of times make them final. And the same for entire classes. The compiler will know that it can inline the method code, avoiding creating method-call stack frames for that call.
My Question is regarding the performance of Java versus compiled code, for example, C++/fortran/assembly in high-performance numerical applications.
I know this is a contentious topic, but I am looking for specific answers/examples. Also community wiki. I have asked similar questions before, but I think I put it broadly and did not get answers I was looking for.
double precision matrix-matrix multiplication, commonly known as dgemm in blas library, is able to achieve nearly 100 percent peak CPU performance (in terms of floating operations per second).
There are several factors which allow achieving that performance:
cache blocking, to achieve maximum memory locality
loop unrolling to minimize control overhead
vector instructions, such as SSE
memory prefetching
guarantee no memory aliasing
I have have seen lots of benchmarks using assembly, C++, Fortran, Atlas, vendor BLAS (typical cases are a matrix of dimension 512 and above).
On the other hand, I have heard that the principle byte compiled languages/implementations such as Java can be fast or nearly as fast as machine compiled languages. However, I have not seen definite benchmarks showing that it is so. On the contrary, it seems (from my own research) byte compiled languages are much slower.
Do you have good matrix-matrix multiplication benchmarks for Java/C #?
does just-in-time compiler (actual implementation, not hypothetical) able to produce instructions which satisfy points I have listed?
Thanks
with regards to performance:
every CPU has peak performance, depending on the number of instructions processor can execute per second. For example, modern 2 GHz Intel CPU can achieve 8 billion double precision add/multiply a second, resulting in 8 gflops peak performance. Matrix-matrix multiply is one of the algorithms which is able to achieve nearly full performance with regards number of operations per second, the main reason being a higher ratio of computing over memory operations (N^3/N^2). Numbers I am interested in a something on the order N > 500.
with regards to implementation: higher-level details such as blocking is done at the source code level. Lower-level optimization is handled by the compiler, perhaps with compiler hints with regards to alignment/alias. Byte compiled implementation can be written using block approach as well, so in principle source code details for decent implementation will be very similar.
A comparison of VC++/.NET 3.5/Mono 2.2 in a pure matrix multiplication scenario:
Source
Mono with Mono.Simd goes a long way towards closing the performance gap with the hand-optimized C++ here, but the C++ version is still clearly the fastest. But Mono is at 2.6 now and might be closer and I would expect that if .NET ever gets something like Mono.Simd, it could be very competitive as there's not much difference between .NET and the sequential C++ here.
All factors your specify is probably done by manual memory/code optimization for your specific task. But JIT compiler haven't enough information about your domain to make code optimal as you make it by hand, and can apply only general optimization rules. As a result it will be slower that C/C++ matrix manipulation code (but can utilize 100% of CPU, if you want it :)
Addressing the SSE issue: Java is using SSE instructions since J2SE 1.4.2.
in a pure math scenario (calculating 25 types or algebraic surfaces 3d coords) c++ beats java in a 2.5 ratio
Java cannot compete to C in matrix multiplications, one reason is that it checks on each array access whether the array bounds are exceeded. Further Java's math is slow, it does not use the processor's sin(), cos().
Hmmm. Is there a primer anywhere on memory usage in Java? I would have thought Sun or IBM would have had a good article on the subject but I can't find anything that looks really solid. I'm interested in knowing two things:
at runtime, figuring out how much memory the classes in my package are using at a given time
at design time, estimating general memory overhead requirements for various things like:
how much memory overhead is required for an empty object (in addition to the space required by its fields)
how much memory overhead is required when creating closures
how much memory overhead is required for collections like ArrayList
I may have hundreds of thousands of objects created and I want to be a "good neighbor" to not be overly wasteful of RAM. I mean I don't really care whether I'm using 10% more memory than the "optimal case" (whatever that is), but if I'm implementing something that uses 5x as much memory as I could if I made a simple change, I'd want to use less memory (or be able to create more objects for a fixed amount of memory available).
I found a few articles (Java Specialists' Newsletter and something from Javaworld) and one of the builtin classes java.lang.instrument.getObjectSize() which claims to measure an "approximation" (??) of memory use, but these all seem kind of vague...
(and yes I realize that a JVM running on two different OS's may be likely to use different amounts of memory for different objects)
I used JProfiler a number of years ago and it did a good job, and you could break down memory usage to a fairly granular level.
As of Java 5, on Hotspot and other VMs that support it, you can use the Instrumentation interface to ask the VM the memory usage of a given object. It's fiddly but you can do it.
In case you want to try this method, I've added a page to my web site on querying the memory size of a Java object using the Instrumentation framework.
As a rough guide in Hotspot on 32 bit machines:
objects use 8 bytes for
"housekeeping"
fields use what you'd expect them to
use given their bit length (though booleans tend to be allocated an entire byte)
object references use 4 bytes
overall obejct size has a
granularity of 8 bytes (i.e. if you
have an object with 1 boolean field
it will use 16 bytes; if you have an
object with 8 booleans it will also
use 16 bytes)
There's nothing special about collections in terms of how the VM treats them. Their memory usage is the total of their internal fields plus -- if you're counting this -- the usage of each object they contain. You need to factor in things like the default array size of an ArrayList, and the fact that that size increases by 1.5 whenever the list gets full. But either asking the VM or using the above metrics, looking at the source code to the collections and "working it through" will essentially get you to the answer.
If by "closure" you mean something like a Runnable or Callable, well again it's just a boring old object like any other. (N.B. They aren't really closures!!)
You can use JMP, but it's only caught up to Java 1.5.
I've used the profiler that comes with newer versions of Netbeans a couple of times and it works very well, supplying you with a ton of information about memory usage and runtime of your programs. Definitely a good place to start.
If you are using a pre 1.5 VM - You can get the approx size of objects by using serialization. Be warned though.. this can require double the amount of memory for that object.
See if PerfAnal will give you what you are looking for.
This might be not the exact answer you are looking for, but the bosts of the following link will give you very good pointers. Other Question about Memory
I believe the profiler included in Netbeans can moniter memory usage also, you can try that