Hotspot JIT optimizations

Hotspot JIT optimizations - java

In a lecture about JIT in Hotspot I want to give as many examples as possible of the specific optimizations that JIT performs.
I know just about "method inlining", but there should be much more. Give a vote for every example.

Well, you should scan Brian Goetz's articles for examples.
In brief, HotSpot can and will:
Inline methods
Join adjacent synchronized blocks on the same object
Eliminate locks if monitor is not reachable from other threads
Eliminate dead code (hence most of micro-benchmarks are senseless)
Drop memory write for non-volatile variables
Replace interface calls with direct method calls for methods only implemented once
et cetera

There is a great presentation on the optimizations used by modern JVMs on the Jikes RVM site:
ACACES’06 - Dynamic Compilation and Adaptive Optimization in Virtual Machines
It discusses architecture, tradeoffs, measurements and techniques. And names at least 20 things JVMs do to optimize the machine code.

I think the interesting stuff are those things that a conventional compiler can't do contrary to the JIT. Inlining methods, eliminating dead code, CSE, live analysis, etc. are all done by your average c++ compiler as well, nothing "special" here
But optimizing something based on optimistic assumptions and then deoptimizing later if they turn out to be wrong? (assuming a specific type, removing branches that will fail later anyhow if not done,..) Removing virtual calls if we can guarantee that there exists only one class at the moment (again something that only reliably works with deoptimization)? Adaptive optimization is I think the one thing that really distinguishes the JIT from your run of the mill c++ compiler.
Maybe also mention the runtime profiling done by the JIT to analyse which optimizations it should apply (not that unique anymore with all the profile-guided optimizations though).

There's an old but likely still valid overview in this article.
The highlights seem to be performing classical optimizations based on available runtime profiling information:
JITting "hot spots" into native code
Adaptive inlining – inlining the most commonly called implementations for a given method dispatch to avoid a huge code size
And some minor ones like generational GC which makes allocating short lived objects cheaper, and various other smaller optimizations, plus whatever else was added since that article was published.
There's also a more detailed official whitepaper, and a fairly nitty-gritty HotSpot Internals wiki page that lists how to write fast Java code that should let you extrapolate what use cases were optimized.

Jumps to equivalent native machine code instead of JVM interpretation of the op-codes. The lack of a need to simulate a machine (the JVM) in machine code for a heavily used part of a Java application (which is the equivalent of an extension of the JVM) provides a good speed increase.
Of course, that's most of what HotSpot is.

Related

Why is it important that Java (and other JVM languages) is highly portable? [duplicate]

I've been thinking about it lately, and it seems to me that most advantages given to JIT compilation should more or less be attributed to the intermediate format instead, and that jitting in itself is not much of a good way to generate code.
So these are the main pro-JIT compilation arguments I usually hear:
Just-in-time compilation allows for greater portability. Isn't that attributable to the intermediate format? I mean, nothing keeps you from compiling your virtual bytecode into native bytecode once you've got it on your machine. Portability is an issue in the 'distribution' phase, not during the 'running' phase.
Okay, then what about generating code at runtime? Well, the same applies. Nothing keeps you from integrating a just-in-time compiler for a real just-in-time need into your native program.
But the runtime compiles it to native code just once anyways, and stores the resulting executable in some sort of cache somewhere on your hard drive. Yeah, sure. But it's optimized your program under time constraints, and it's not making it better from there on. See the next paragraph.
It's not like ahead-of-time compilation had no advantages either. Just-in-time compilation has time constraints: you can't keep the end user waiting forever while your program launches, so it has a tradeoff to do somewhere. Most of the time they just optimize less. A friend of mine had profiling evidence that inlining functions and unrolling loops "manually" (obfuscating source code in the process) had a positive impact on performance on his C# number-crunching program; doing the same on my side, with my C program filling the same task, yielded no positive results, and I believe this is due to the extensive transformations my compiler was allowed to make.
And yet we're surrounded by jitted programs. C# and Java are everywhere, Python scripts can compile to some sort of bytecode, and I'm sure a whole bunch of other programming languages do the same. There must be a good reason that I'm missing. So what makes just-in-time compilation so superior to ahead-of-time compilation?
EDIT To clear some confusion, maybe it would be important to state that I'm all for an intermediate representation of executables. This has a lot of advantages (and really, most arguments for just-in-time compilation are actually arguments for an intermediate representation). My question is about how they should be compiled to native code.
Most runtimes (or compilers for that matter) will prefer to either compile them just-in-time or ahead-of-time. As ahead-of-time compilation looks like a better alternative to me because the compiler has more time to perform optimizations, I'm wondering why Microsoft, Sun and all the others are going the other way around. I'm kind of dubious about profiling-related optimizations, as my experience with just-in-time compiled programs displayed poor basic optimizations.
I used an example with C code only because I needed an example of ahead-of-time compilation versus just-in-time compilation. The fact that C code wasn't emitted to an intermediate representation is irrelevant to the situation, as I just needed to show that ahead-of-time compilation can yield better immediate results.

Greater portability: The
deliverable (byte-code) stays
portable
At the same time, more platform-specific: Because the
JIT-compilation takes place on the
same system that the code runs, it
can be very, very fine-tuned for
that particular system. If you do
ahead-of-time compilation (and still
want to ship the same package to
everyone), you have to compromise.
Improvements in compiler technology can have an impact on
existing programs. A better C
compiler does not help you at all
with programs already deployed. A
better JIT-compiler will improve the
performance of existing programs.
The Java code you wrote ten years ago will run faster today.
Adapting to run-time metrics. A JIT-compiler can not only look at
the code and the target system, but
also at how the code is used. It can
instrument the running code, and
make decisions about how to optimize
according to, for example, what
values the method parameters usually
happen to have.
You are right that JIT adds to start-up cost, and so there is a time-constraint for it,
whereas ahead-of-time compilation can take all the time that it wants. This makes it
more appropriate for server-type applications, where start-up time is not so important
and a "warm-up phase" before the code gets really fast is acceptable.
I suppose it would be possible to store the result of a JIT compilation somewhere, so that it could be re-used the next time. That would give you "ahead-of-time" compilation for the second program run. Maybe the clever folks at Sun and Microsoft are of the opinion that a fresh JIT is already good enough and the extra complexity is not worth the trouble.

The ngen tool page spilled the beans (or at least provided a good comparison of native images versus JIT-compiled images). Executables that are compiled ahead-of-time typically have the following benefits:
Native images load faster because they don't have much startup activities, and require a static amount of fewer memory (the memory required by the JIT compiler);
Native images can share library code, while JIT-compiled images cannot.
Just-in-time compiled executables typically have the upper hand in these cases:
Native images are larger than their bytecode counterpart;
Native images must be regenerated whenever the original assembly or one of its dependencies is modified.
The need to regenerate an image that is ahead-of-time compiled every time one of its components is a huge disadvantage for native images. On the other hand, the fact that JIT-compiled images can't share library code can cause a serious memory hit. The operating system can load any native library at one physical location and share the immutable parts of it with every process that wants to use it, leading to significant memory savings, especially with system frameworks that virtually every program uses. (I imagine that this is somewhat offset by the fact that JIT-compiled programs only compile what they actually use.)
The general consideration of Microsoft on the matter is that large applications typically benefit from being compiled ahead-of-time, while small ones generally don't.

Simple logic tell us that compiling huge MS Office size program even from byte-codes will simply take too much time. You'll end up with huge starting time and that will scare anyone off your product. Sure, you can precompile during installation but this also has consequences.
Another reason is that not all parts of application will be used. JIT will compile only those parts that user care about, leaving potentially 80% of code untouched, saving time and memory.
And finally, JIT compilation can apply optimizations that normal compilators can't. Like inlining virtual methods or parts of the methods with trace trees. Which, in theory, can make them faster.

Better reflection support. This could be done in principle in an ahead-of-time compiled program, but it almost never seems to happen in practice.
Optimizations that can often only be figured out by observing the program dynamically. For example, inlining virtual functions, escape analysis to turn stack allocations into heap allocations, and lock coarsening.

Maybe it has to do with the modern approach to programming. You know, many years ago you would write your program on a sheet of paper, some other people would transform it into a stack of punched cards and feed into THE computer, and tomorrow morning you would get a crash dump on a roll of paper weighing half a pound. All that forced you to think a lot before writing the first line of code.
Those days are long gone. When using a scripting language such as PHP or JavaScript, you can test any change immediately. That's not the case with Java, though appservers give you hot deployment. So it is just very handy that Java programs can be compiled fast, as bytecode compilers are pretty straightforward.
But, there is no such thing as JIT-only languages. Ahead-of-time compilers have been available for Java for quite some time, and more recently Mono introduced it to CLR. In fact, MonoTouch is possible at all because of AOT compilation, as non-native apps are prohibited in Apple's app store.

I have been trying to understand this as well because I saw that Google is moving towards replacing their Dalvik Virtual Machine (essentially another Java Virtual Machine like HotSpot) with Android Run Time (ART), which is a AOT compiler, but Java usually uses HotSpot, which is a JIT compiler. Apparently, ARM is ~ 2x faster than Dalvik... so I thought to myself "why doesn't Java use AOT as well?".
Anyways, from what I can gather, the main difference is that JIT uses adaptive optimization during run time, which (for example) allows ONLY those parts of the bytecode that are being executed frequently to be compiled into native code; whereas AOT compiles the entire source code into native code, and code of a lesser amount runs faster than code of a greater amount.
I have to imagine that most Android apps are composed of a small amount of code, so on average it makes more sense to compile the entire source code to native code AOT and avoid the overhead associated from interpretation / optimization.

It seems that this idea has been implemented in Dart language:
https://hackernoon.com/why-flutter-uses-dart-dd635a054ebf
JIT compilation is used during development, using a compiler that is especially fast. Then, when an app is ready for release, it is compiled AOT. Consequently, with the help of advanced tooling and compilers, Dart can deliver the best of both worlds: extremely fast development cycles, and fast execution and startup times.

One advantage of JIT which I don't see listed here is the ability to inline/optimize across separate assemblies/dlls/jars (for simplicity I'm just going to use "assemblies" from here on out).
If your application references assemblies which might change after install (e. g. pre-installed libraries, framework libraries, plugins), then a "compile-on-install" model must refrain from inlining methods across assembly boundaries. Otherwise, when the referenced assembly is updated we would have to find all such inlined bits of code in referencing assemblies on the system and replace them with the updated code.
In a JIT model, we can freely inline across assemblies because we only care about generating valid machine code for a single run during which the underlying code isn't changing.

The difference between platform-browser-dynamic and platform-browser is the way your angular app will be compiled.
Using the dynamic platform makes angular sending the Just-in-Time compiler to the front-end as well as your application. Which means your application is being compiled on client-side.
On the other hand, using platform-browser leads to an Ahead-of-Time pre-compiled version of your application being sent to the browser. Which usually means a significantly smaller package being sent to the browser.
The angular2-documentation for bootstrapping at https://angular.io/docs/ts/latest/guide/ngmodule.html#!#bootstrap explains it in more detail.

A concrete Example of the effect of the JIT in java

So I am aware the java has just in time compilation (the JIT), which gives it an advantage over statically compiled languages like C++. Are there any examples illustrating the java JIT? Possible examples could be outperforming C or C++ code for a given algorithm? Or showing an algorithm's iterations getting faster with time (I am unsure if that would be an instance of the JIT). Or just any example which can show some sort of measurement of the existence of the JIT doing this? I ask this question because I have only ever read about the JIT and wish to prove it's existence as opposed to just believing in it like some sort of religious God.
Remark - If this question is too opinionated please comment and let me know why. I am just curious about the JIT and after using java for a few years still to this day am unaware of how I benefit from it, and if it lives up to the hype of outperforming its statically compiled counterparts.
Additional Information - I have read about when it does it, and am not looking for more information I will just need to believe is true, I want to see something which shows me doing what it is suppose to do.
EDIT - Good that I have allot of responses, what has been said is that comparing speed alone JIT optimised vs. C++ is not a good approach, and that a pure java comparison would be the least horrible. What about an example showing this with java:
So a JIT and Non-JIT optimised program doing the same are executed. At the start the JIT has not kicked in, and the program begins getting quicker whilst the static always has the same performance. Then the conditions change at 5.5 seconds or so and the application is being used slightly differently. The JIT has the ability to adapt to these changes again, firstly the time spikes and then it begins optimising again and can even reach a better optima becaue the application is being used slightly different.
Would this be an acceptable example to show a JIT? (I will endevour to achieve this and review everyones links and videos).

I do not think you can convincingly prove that java using JIT is faster than C/C++ statically compiled code.
You could find some code in java that beat its c/c++ implementation. For that you need to search for keywords like (benchmark,Java,JIT,C,C++ )
I have purposely not mentioned any code or links for the above because of my point below.
Most of the times people show java code beating statically compiled c/c++ in following ways
Find part where java is fast compared to c/c++(memory allocation)and write only code to highlight it.
Find weak points of c/C++ code and try to write java code that beat the c/c++ code in achieving the result.
Run code in environment where you have advantage like having fast hardware and good amount of memory .
My point being you are trying to find exception where java is faster that C/C++ and then generalizing it to the whole language. You could easily find more more examples of c/c++ beating java code just by using pointer in many algorithm.
Such code benchmark testing is of no value in real life application development.
Summarizing ( in real life application development )
Java was slow compared to c/c++ when it first came out. But in the past decade improvements made in JVM coupled with JIT,Hotspot etc have made java as good as C/C++. Java is not slow nowadays. But I would not call it fast over c/c++. Any difference in real life application development in negligible because of language improvement as well as better hardware.
You cannot generalize that java is faster than c/c++ by beating it one time in a particular environment with a particular algorithm or code.
You might find some interesting info in the following links
https://softwareengineering.stackexchange.com/questions/110634/why-would-it-ever-be-possible-for-java-to-be-faster-than-c
Is Java really slow?
Since question has been edited to now try and find the performance improvement of using JIT , I am editing my answer to add a few more points.
My understanding of JIT is that it improves the code that is most executed , to a version that can be run really fast by compiler. Most of the examples of JIT optimisation techniques I have come across shows actions which could also be done by the programmer but then would affect the readability of the program or may not confirm to the framework or coding styles the programmer is/has to use.
So what I am trying to say here is if you write a program that can be improved by JIT it will do so and you will see an increase in performace. But if you are someone who understand JVM and write java code that is already optimized then JIT may not give you much benefit.
So in effect if you see a performace improvement when running a program using JIT that improvement is not guaranteed for all java programs. It depends on the program.
These links below show some JIT improvements using code examples.
http://www.infoq.com/articles/Java-Application-Hostile-to-JIT-Compilation
https://plumbr.eu/blog/java/do-you-get-just-in-time-compilation
Anyway if we need to to differentiate the performance while using JIT, we would run a java program with JIT enabled and run the same program again with JIT disabled.
This link http://www.javacodegeeks.com/2013/07/java-just-in-time-compilation-more-than-just-a-buzzword.html has a case study on this topic and recommends the following
Assessing the JIT benefits for your application
In order to understand the impact of not using JIT for your Java application, I recommend that you preform the following experiment:
Generate load to your application with JIT enabled and capture some baseline data such as CPU %, response time, # requests etc
Disable JIT
Redo the same testing and compare the results.
This link http://people.cse.iitd.ac.in/~sbansal/csl862-virt/readings/CompileJava97.pdf does benchmark JIT and shows speed improvements over basic JVM interpretations.
To understand what JIT does to your code , you could use the tool JITwatch.
https://github.com/AdoptOpenJDK/jitwatch
The links below explain its utility.
http://www.oracle.com/technetwork/articles/java/architect-evans-pt1-2266278.html
http://zeroturnaround.com/rebellabs/why-it-rocks-to-finally-understand-java-jit-with-jitwatch/

First, you want to watch this video. It gives you tools to see the JIT in action.
Where I believe your questions is misled is that you are asking for an example of tailored code where you could potentially measure faster performance in some JVM-based language X vs some non JVM-based language Y (where, for instance, X is Java and Y is C).
This is not the way to think about the JIT. Unless you actually write a compiler for the JVM language by yourself, or have to debug some serious performance issue, and only after you have considered refactoring your code and seen it fail then you can delve that deep into details.
But otherwise, the principle is simple: the JIT is your friend and it does things right; all you have to do is write code which just works; if there are ways that the JIT can make it faster at runtime, it will most certainly do so.

There are countless examples on Stack Overflow of questions like "why is my code running faster all of a sudden?" - usually when people try to benchmark their code. The answer is, invariably, because the JIT was able to make optimizations mid-benchmark.
See: How do I write a correct micro-benchmark in Java?, What is going on in this java benchmark?, and Java benchmarking - why is the second loop faster? for some examples.
I have only ever read about the JIT and wish to prove it's existence as opposed to just believing in it like some sort of religious God.
This is an unnecessary line of thinking; there's a lot going on between your keyboard and your monitor that you've never noticed or don't understand. The JIT is documented behavior of the JVM, that's all you need to know. It's fine if you don't understand it and want to learn more, but it's not some mythical, ethereal construct.

JIT Just in Time compilation is a sort of pre compilation that is done prior of execution of byte code. From ORacle site:
"In theory, the JIT comes into use whenever a Java method is called,
and it compiles the bytecode of that method into native machine code,
thereby compiling it “just in time” to execute"
The most reliable effect of JIT is visible comparing java itself with and without jit.
JIT (Just In Time compilation) was introduced in java 1.2 so the best is to execute the same code with java 1.1 and java 1.2 and check the performances.
Prior to java 1.2 java was considered a very slow language and only after the introduction of JIT it has been extensively used in any field.
Instead is difficult to compare C++ or C and java. Potentially C++ is faster then java, because also with JIT java is an interpreted language. JIT compilation helps because the code that is executed more often is interpred only one time instead of each time it is executed.
Differences between java and C++ can involve how libraries are designed, presence or absence of certain primitive types, how code is compiled, level of optimizations, in case of java how gc is configured and so on.
Note that there can be differences also between java and java also with same jdk and same jvm depending on compilation parameters and execution parameters.
It is not possible to say that Java is faster than C or viceversa, too many parameters are involved in this kind of comparison. Sometime C++ is faster, sometime java is the best.
Here is a reference from Oracle on JIT compilation: http://docs.oracle.com/cd/E13150_01/jrockit_jvm/jrockit/geninfo/diagnos/underst_jit.html

What JVM synchronization practices can I ignore assuming I know I will run on x64 cpus?

I know that the JVM memory model is made for lowest common denominator of CPUs, so it has to assume the weakest possible model of a cpu on which the JVM can run (eg ARM).
Now, considering that x64 has a fairly strong memory model, what synchronization practices can I ignore assuming I know my program will only run on 64bit x86 CPUs? Also does this apply when my program is being run through virtualization?
Example:
It is known that JVM's memory model requires synchronizing read/writes access to longs and doubles but one can assume that read/writes of other 32 bit primitives like int, float etc are atomic.
However, if i know that I am running on a 64 bit x86 machine, can i ignore using locks on longs/doubles knowing that the cpu will atomically read/write 64 bit values and just keep them volatile (like i would with ints/floats)?

I know that the JVM memory model is made for lowest common denominator of CPUs, so it has to assume the weakest possible model of a cpu on which the JVM can run (eg ARM).
That's not correct. The JMM resulted from a compromise among a variety of competing forces: the desire for a weaker memory model so that programs can go faster on hardware that have weak memory models; the desire of compiler writers who want certain optimizations to be allowed; and the desire for the result of parallel Java programs to be correct and predictable, and if possible(!) understandable to Java programmers. See Sarita Adve's CACM article for a general overview of memory model issues.
Considering that x64 has a fairly strong memory model, what synchronization practices can I ignore assuming I know my program will only run on [x64] CPUs?
None. The issue is that the memory model applies not only to the underlying hardware, but it also applies to the JVM that's executing your program, and mostly in practice, the JVM's JIT compiler. The compiler might decide to apply certain optimizations that are allowed within the memory model, but if your program is making unwarranted assumptions about the memory behavior based on the underlying hardware, your program will break.
You asked about x64 and atomic 64-bit writes. It may be that no word tearing will ever occur on an x64 machine. I doubt that any JIT compiler would tear a 64-bit value into 32-bit writes as an optimization, but you never know. However, it seems unlikely that you could use this feature to avoid synchronization or volatile fields in your program. Without these, writes to these variables might never become visible to other threads, or they could arbitrarily be re-ordered with respect to other writes, possibly leading to bugs in your program.
My advice is first to apply synchronization properly to get your program correct. You might be pleasantly surprised. The synchronization operations have been heavily optimized and can be very fast in the common case. If you find there are bottlenecks, consider using optimizations like lock splitting, the use of volatiles, or converting to non-blocking algorithms.
UPDATE
The OP has updated the question to be a bit more specific about using volatile instead of locks and synchronization.
It turns out that volatile not only has memory visibility semantics. It also makes long and double access atomic, which is not the case for non-volatile variables of those types. See the JLS section 17.7. You should be able to rely on volatile to provide atomicity on any hardware, not just x64.
While I'm at it, for additional information about the Java Memory Model, see Aleksey Shipilev's JMM Pragmatics talk transcript. (Aleksey is also the JMH guy.) There's lots of detail in this talk, and some interesting exercises to test one's understanding. One overall takeaway of the talk is that it's often a mistake to rely on one's intuition about how the memory model works, e.g. in terms of cache lines or write buffers. The JMM is a formalism about memory operations and various contraints (synchronizes-with, happens-before, etc.) that determine ordering of those operations. This can have quite counterintuitive results. It's unwise to try to outsmart the JMM by thinking about specific hardware properties. It'll come back to bite you.

you would still need to handle thread-safety, so volatility semantics and memory fences will still matter
What I mean here is, eg in Oracle Java, most low-level sync operations end up in Unsafe (docjar.com/docs/api/sun/misc/Unsafe.html#getUnsafe), which in turn has a long list of native methods. So in the end, those synchronization practices and lots of other low-level operations are encapsuled by the JVM where they belong. x64 has not the same jvm as x86.
after reading your edited question again: the atomicity of your load/store operations was a topic here. So no, you don't have to worry about atomic 64bit load/stores on x64. But since this is not the end of all sync issues, see the other answers.

Always include the memory barriers where the JVM memory model states that they are needed and then let the JVM optimize them when it can for different platforms.
Knowing that you run only on x86 CPUs does not mean that you can drop using memory barriers. Unless perhaps you know that you will only run on single core x86 cpus ;) Which, in todays multi core world no body really knows.
Why? Because the java memory model has two main concerns.
visibility of data between cores and
happens before guarantees, aka re-ordering.
Without a memory barrier in play, the order of operations that become visible to other cores could become very confusing; and that is even with the stronger guarantees offered by x86. x86 only ensures consistency once the data makes it to the cpu caches, and while its ordering guarantees are very strong they only kick in once Hotspot has told the CPU to write out to the cache.
Without the volatile/synchronized then it will be up to the compilers (javac and hotspot) as to when they will do those writes and in what order. It is perfectly valid for them to decide to keep data for extended periods within the registers. Once a volatile or synchronized memory barrier is crossed, then the JVM knows to tell the CPU to send the data out to the cache.
As Doug Lea documents in the JSR-133 Cookbook most of the x86 barriers are reduced to no-op instructions that guarantee the ordering. Thus the JVM will make the instructions as efficient as possible for us. Code to the Java Memory Model, and let Hotspot work its magic. If Hotspot can prove that synchronised is not required, it can drop it entirely.
Lastly, the double checked locking pattern was proven to be broken on multi core x86 too; despite its stronger memory guarantees. Some nice detail of this was writen by Bartos Milewski on his C++ blog and again this time specific to Java here

Compiler Writes Have taken care of What you wanted to do.
Many of the volatile read/write barriers will eventually be no-op on x64.
Also do think reordering may also be induced because of compiler optimization
and may not depend on Hardware. For exmple benign data races - for example String hashCode.
See : http://jeremymanson.blogspot.com/2008/12/benign-data-races-in-java.html
Also Please refer the page for what instructions may be no-op on x64.
See : http://gee.cs.oswego.edu/dl/jmm/cookbook.html see Multiprocessors Section.
I will advise not to do any optimizations specific for hardware. You may end up writing
Unmaintainable Code. Compiler Writers Have already put up sufficient HardWork.

It not only depends on CPU, but also on the JVM, operating system etc.
One thing you might be sure: don't assume anything if it comes to thread synchronization.

Performance in Java through code? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
First of all I should mention that I'm aware of the fact that performance optimizations can be very project specific. I'm mostly not facing these special issues right now. I'm facing a bunch of performance issues with the JVM itself.
I wonder now:
which code-optimization make sense
from a compiler perspective: for
example to support the garbage
collector I declared variables as
final - very much following PMD's
suggestions here from Eclipse.
what best practices there are for: vmargs,
heap and other stuff passed to the
JVM for initialization. How do I get
the right values here? Is there any
formula or is it try and error?
Java automates a lot, does many optimization on byte-code level and stuff. However I think most of that must be planed by a developer in order to work.
So how do you speed up your programs in Java? :)

Which code-optimization make sense from a compiler perspective: for example to support the garbage collector I declared variables as final - very much following PMD's suggestions here from Eclipse.
Assuming you are talking about potential micro-optimizations you can make to your code, the answer is pretty much none. The best way to increase your application performance is to run a profiler to figure out where the performance bottlenecks are, then figure out if there is anything you can do to speed them up.
All of the classic tricks like declaring classes, variables and methods final, reorganizing loops, changing primitive types are pretty much a waste of effort in most cases. The JIT compiler can typically do a much better job than you can. For example, recent JIT compilers will analyse all loaded classes to figure out which method calls are not subject to overloading, without you declaring the classes or methods as final. It will then use a quicker call sequence, or even inline the method body.
Indeed, the Sun experts say that some programmer attempts at optimization fail because they actually make it harder for JIT compiler to apply the optimizations it knows about.
On the other hand, higher level algorithmic optimizations are definitely worthwhile ... provided that your profiler tells you that your application is spending a significant amount of time in that area of the code.
Using arrays instead of collections can be a worthwhile optimization in unusual cases, and in rare cases using object pools might be too. But these optimizations 1) will make your code more complicated and bug prone and 2) can slow your application down if used inappropriately. These kinds of optimizations should only be tried as a last resort. For example, if your profiling says that such and such a HashMap<Integer,Integer> is a CPU bottleneck or a memory hog, then it is a better idea to look for an existing specialized Map or Map-like library class than to try and implement the map yourself using arrays. In other words, optimize at the high level.
If you spend long enough or your application is small enough, careful micro-optimization will probably give you a faster application (on a given JVM version / hardware platform) than just relying on the JIT compiler. If you are implementing a smallish application to do large-scale number crunching in Java, the pay-off of micro-optimization may well be considerable. But this is clearly not a typical case! For typical Java applications, the effort is large enough and the performance difference is small enough that micro-optimization is not worthwhile.
(Incidentally, I don't see how declaring a variable can make any possible difference to GC performance. The GC has to trace a variable every time it is encountered whether or not it is final. Besides, it is an open secret that final variables can actually change under certain circumstances, so it would be unsafe for the GC to assume that they don't. Unsafe as in "creates a dangling pointer resulting in a JVM crash".)

I see this a lot. The sequence generally goes:
Thinking performance is about compiler optimizations, big-O, and so on.
Designing software using the recommended ideas, lots of classes, two-way linked lists, trees with pointers up, down, left, and right, hash sets, dictionaries, properties that invoke other properties, event handlers that invoke other event handlers, XML writing, parsing, zipping and unzipping, etc. etc.
Since all those data structures were like O(1) and the compiler's optimizing its guts out, the app should be "efficient", right? Well, then, what's that little voice telling one that the startup is slow, the shutdown is slow, the loading and unloading could be faster, and why is the UI so sluggish?
Hand it off to the "performance expert". With luck, that person finds out, all this stuff is done in the recommended way, but that's why it's cranking its heart out. It's doing all that stuff because it's the recommended way to do things, not because it's needed.
With luck, one has the chance to re-engineer some of that stuff, to make it simple, and gradually remove the "bottlenecks". I say, "with luck" because often it's just not possible, so development relies on the next generation of faster processors to take away the pain.
This happens in every language, but moreso in Java, C#, C++, where abstraction has been carried to extremes. So by all means, be aware of best practices, but also understand what simple software is. Typically it consists of saving those best practices for the circumstances that really need them.

which code-optimization make sense
from a compiler perspective?
All the ones that a compiler can't reason about, because a compiler is very dumb and Java doesn't have "design by contract" (which, hence, cannot help the dumb compiler reason about your code).
For example if you're crunching data and using use int[] or long[] arrays, you may know something about your data that is IMPOSSIBLE for the compiler to figure out and you may use low-level bit-packing/compacting to improve the locality of reference in that part of your code.
Been there, done that, saw gigantic speedup. So much for the "super smart compiler".
This is just one example. There are a huge number of cases like this.
Remember that a compiler is really stupid: it cannot know that if ( Math.abs(42) > 0 ) will always return true.
This should give some food for thoughts to people that think that those compilers are "smart" (things would be different here if Java had DbC, but it doesn't).
what best practices there are for:
vmargs, heap and other stuff passed to
the JVM for initialization. How do I
get the right values here? Is there
any formula or is it try and error?
The real answer is: there shouldn't be. Sadly the situation is so pathetic that such low-level hackery is needed, due to serious failure on Java's part. Oh, one more "tiny" detail: playing with VM fine-tuning only works for server-side app. It doesn't work for desktop apps.
Anyone who has worked on Java desktop applications installed on hundreds or thousands of machines, on various OSes knows all too well what the issue is: full GC pauses making your app look like it's broken. The Apple VM on OS X 10.4 comes to mind for it's particularly afwul, but ALL the JVMs are subject to that issue.
What is worse: it is impossible to "fine tune" the GC's parameters across different OSes / VMs / memory configuration when your application is going to be run on hundreds/thousands of different configuration.
Anyone disputing that: please tell me how you "fine tune" your app knowing that it is going to be run both on octo-cores Mac loaded with 20 GB of ram (I've got users with such setups) and old OS X 10.4 PowerBook that have 768 MB of ram. Please?
But it is not bad: you should not, in the first place, have to be concerned with super-low-level detail like GC "fine tuning". The very fact that this is hinted to is a testimony to one area where Java has a major issue.
Java fans will keep on saying "the GC is super fast, object creation is cheap" while this is blatantly wrong. There's a reason with Trove' TIntIntHashMap runs around circles an HashMap<Integer,Integer>.
There's also a reason why at every new JVM release you'll get countless release notes explaining why -XXGCHyperSteroidMultiTopNotch offers better performance than the last "big JVM param" that every cool Java programmer had to know: maybe the JVM wasn't that great at GC'ing after all.
So to answer your question: how do you speed up Java programs? Easy, do like what the Trove guys did: stop needlessly creating gigantic amount of objects and stop needlessly auto(un)boxing primitives because they will kill your app's perfs.
A TIntIntHashMap OWNS the default HashMap<Integer,Integer> for a reason: for the same reason my apps are now much faster than before.
I stopped believing in crap like "object creation costs nothing" and "the GC is super-optimized, don't worry about it".
I'm using Java to crunch data (I know, I'm a bit crazy) and the one thing that made my app faster was to stop believing all the propaganda surrounding the "cheap object creation" and "amazingly fast GC".
The truth is: INSTEAD OF TRYING TO FINE-TUNE YOUR GC SETTINGS, STOP CREATING THAT MUCH GARBAGE IN THE FIRST PLACE. This can be stated this way: if changing the GC settings radically changes the way your app run, it may be time to wonder if all the needless junk objects your creating are really needed.
Oh, you know what, I'm betting we'll see more and more release notes explaining why Java version x.y.z's GC is faster than version x.y.z-1's GC ;)

Generally there are two kinds of performance optimizations you need to do with Java:
Algorithmic optimization. Choose an algorithm which behaves like you need to. For instance, a simple algorithm may perform best for small datasets, but the overhead of preparing a smarter algorithm may first pay off for much larger datasets.
Bottleneck identification. Here you need to be familiar with a profiler that can tell you what the problem is (humans always guess wrong) - memory leak?, slow method? etc... A good one to start with is VisualVM which can attach to a running program, and is available in the latest Sun JDK. When you know the problem, you can fix it.

Todays JVM's are surprisingly robust when it comes to performance. Any microoptimizations you can apply will, in practically all cases, have only very minor impact on performance. This is easy to understand if you take a look on how typical language constructs (e.g. FOR vs WHILE) translate to bytecode - they are almost indistinguishable.
Making methods/variables final has absolutely no impact on performance on a decent JIT'd JVM. The JIT will keep track of which methods are really polymorphic and optimize away the dynamic dispatch where possible. Static methods can still be faster, since they don't have a this-reference = one less local variable (which at the same time, limits their application). Most efficient micro optimizations are not so much Java specific, for example code with lots of conditional statements can become very slow due to branch mispredictions by the processor. Sometimes conditionals can be replaced by other, sequential code flow constructs (often at the cost of readability), reducing the number of mispredicted branches (and this applies to all languages that somehow compile to native code).
Note that profilers tend to inflate the time spent in short, frequently called methods. This is due to the fact that profilers need to instrument the code to keep track of invocations - this can interfere with the JIT's ability to inline those methods (and the instrumentation overhead becomes significantly larger than the time spent actually executing the methods body). Manual inlining, while apparently very performance boosting under a profiler has in most cases no effect under "real world" conditions. Don't rely purely on the profilers results, verify that optimizations you make have real impact under real runtime conditions, too.
Notable performance boosts can only be expected from changes that reduce the amount of work done, more cache friendly data layout or superior algorytms. Java partially limits your possibilities for cache friendly data layouts, since you have no control where the parts (arrays/objects) that form your data structure will be located in memory in relation to each other. Still, there are plenty of opportunities where choosing the right data structure for the job can make a huge difference (e.g. ArrayList vs LinkedList).
There is little you can do to aid the garbage collector. However, a point worth noting is, while object allocation in Java is very very fast, there is still the cost of object initialization (which is mostly under your control). Poor performance of applications that creating lots of (short lived) objects is more likely to be attributed to poor cache utilization than to the garbage collectors work.
Different applications types require different optimization strategies - so before asking about specific optimizations, find out where your application really spends its time.

If you are experiencing performance issues with your application, you should seriously consider trying some profiling (eg: hprof) to see whether the problem is algorithmic in nature, and also checking the GC performance logging (eg: -verbose:gc) to see if you could benefit from tuning your JVM GC options.

It is worth noting that the compiler does next to no optimisations, and the JVM doesn't optimise at the byte code level either. Most of the optimisations are performed by the JIT in the JVM and it optmises how the code is converted to native machine code.
The best way to optimise your code is to use a profiler which measures how much time and resources your application is using when you give it a realistic data set. Without this information you are just guessing and you can change alot of code where it really, really doesn't matter and find you have added bugs in the process.
Many come to the conclusion that its never worth optmising you code, even counter productive as it can waste time and introduce bugs and I would say that is true for 95+% of your code. However, with aprofiler you can measure the critical pieces of code and optmise the <5% worth optimising and done carefully, you won't get too many issues from trying to optimise your code.

It's hard to answer this too thoroughly because you haven't even mentioned what sort of project you're talking about. Is it a desktop application? A server-side application?
Desktop applications favor application startup time, so the HotSpot client VM is a good start. Client applications don't necessarily need all of their heap space all the time, so a good balance between starting heap and max heap is useful. (Like, maybe -Xms128m -Xmx512m)
Server applications favor overall throughput, which is something the HotSpot server VM is tuned for. You should always allocate the min and max heap sizes the same on a server application. There is an added cost at the system level to it having to malloc() and free() during garbage collection. Use something like -Xms1024m -Xmx1024m.
There are several different garbage collectors also, which are tuned to different application types.
Take a read through the Java SE 6 Performance White Paper if you want more info on the garbage collector and other performance related items from Java 6.

C++/Java Performance for Neural Networks?

I was discussing neural networks (NN) with a friend over lunch the other day and he claimed the the performance of a NN written in Java would be similar to one written in C++. I know that with 'just in time' compiler techniques Java can do very well, but somehow I just don't buy it. Does anyone have any experience that would shed light on this issue? This page is the extent of my reading on the subject.

The Hotspot JIT can now produce code faster than C++. The reason is run-time empirical optimization.
For example, it can see that a certain loop takes the "false" branch 99% of the time and reorder the machine code instructions accordingly.
There's lots of articles about this. If you want all the details, read Sun's excellent whitepaper. For more informal info, try this one.

I'd be interested in a comparison between Hotspot JIT and profile-guided optimization optimized C++.
The problem I see with the Hotspot JIT (and any runtime-profile-optimized JIT compiler) is that statistics must be kept and code modified. While there are isolated cases this will result in faster-running code, I doubt that profile-optimized JIT compilers will run faster than well optimized C or C++ code in most circumstances. (Of course I could be wrong.)
Anyway, usually you're going to be at the mercy of the larger project, using the same language it is written in. Or you'll be at the mercy of the knowledge base of your co-workers. Or you'll be at the mercy of the platform you are targetting (is a JVM available on the architecture you're targetting?). In the rare case you have complete freedom and you're familiar with both languages, do some comparisons with the tools you have at your disposal. That is really the only way to determine what's best.

The only possible answer is: make a prototype and measure for yourself. If my experience is of any interest, Java and C# were always much slower than C++ for the kind of work I was doing - I believe mostly because of the high memory consumption. Of course, you can come to a completely different conclusion.

This is not strictly about C++ vs Java performance but nonetheless interesting in that regard: A paper about the performance of programs running in a garbage collected environment.

If excessive garbage collection is a concern, you can always reuse unused high-churn objects.
Create a factory that keeps a queue of SoftReferences to recycled objects, using those before creating new objects. Then in code that uses these objects, explicitly return these objects to the factory for recycling.

Probably C++, although I believe you'll hardly notice the difference besides a slow startup time. Java however makes development faster and maintenance easier.

In the grand scheme of things, you're debating maybe a 5% performance difference where you'd get several orders of magnitude increase by moving to CUDA or dedicated hardware.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.