Overhead of a Java JNI call [duplicate] - java

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What makes JNI calls slow?
First let me say that this questions is born more out of curiosity than real necessity.
I'm curious to know what is the overhead in doing a JNI call from Java, with say, System.arraycopy versus allocating the array and copying the elements over with a for loop.
If the overhead is substantial, then there is probably a rough "magic number" of elements up to which it compensates to simply use a for loop, instead of using the System call. And also, what exactly is involved in the System call that causes this overhead?
I'm guessing the stack must be pushed to the context of the call, and that might take a while, but I can't find a good explanation for the whole process.
Let me clarify my question:
I know that using arraycopy is the fastest way to copy an array in Java.
That being said, let's say I'm using it to copy an array of only one element. Since I'm calling the underlying OS to do so, there has to be an overhead in this call. I'm interested in knowing what this overhead is and what happens in the process of the call.
I'm sorry if using arraycopy misled you from the purpose of my question. I'm interested to know the overhead of a JNI call, and what's involved in the actual call.

Since I'm calling the underlying OS to do so...
You are right that system calls are fairly expensive. However, the System in System.arraycopy() is a bit of a misnomer. There are no system calls involved.
...there has to be an overhead in this call. I'm interested in knowing what this overhead is and what happens in the process of the call.
When you look at the definition of System.arraycopy(), it's declared as native. This means that the method is implemented in C++. If you were so inclined, you could look at the JDK source code, and find the C++ function. In OpenJDK 7, it's called JVM_ArrayCopy() and lives in hotspot/src/share/vm/prims/jvm.cpp. The implementation is surprisingly complicated, but deep down it's essentially a memcpy().
If arraycopy() is being used as a normal native function, there's overhead to calling it. There's further overhead caused by argument checking etc.
However, it's very likely that the JIT compiler knows about System.arraycopy(). This means that, instead of calling the C++ function, the compiler knows how to generate specially-crafted machine code to carry out the array copy. I don't know about other JVMs, but HotSpot does have such "intrinsic" support for System.arraycopy().
Let's say I'm using it to copy an array of only one element
If your array is tiny, you may be able to beat System.arraycopy() with a hand-crafted loop. You can probably do even better if the size is known at compile time, since then you can also unroll the loop. However, all of that is not really relevant except in the narrowest of circumstances.

Take a look at java.util.Arrays.copyOf implementations, eg
public static byte[] copyOf(byte[] original, int newLength) {
byte[] copy = new byte[newLength];
System.arraycopy(original, 0, copy, 0,
Math.min(original.length, newLength));
return copy;
}
they use System.arraycopy because this is the fastest way.
If you mean whether calling native methods in Java is expensive then take a look at http://www.javamex.com/tutorials/jni/overhead.shtml
UPDATE Question is really interesting, so I've done some testing
long t0 = System.currentTimeMillis();
byte[] a = new byte[100];
byte[] b = new byte[100];
for(int i = 0; i < 10000000; i++) {
// for(int j = 0; j < a.length; j++) {
// a[j] = b[j];
// }
System.arraycopy(b, 0, a, 0, a.length);
}
System.out.println(System.currentTimeMillis() - t0);
It shows that on very short arrays (< 10) System.arraycopy may be even slower, most probably because it's native, but on bigger arrays it does not matter anymore, System.arraycopy is much faster.

I'm interested to know the overhead of a JNI call, and what's involved in the actual call.
The System.arraycopy() method is rather complicated* and it is unlikely that the JIT compiler inlines it (as one of the other answers suggests).
On the other hand, it is likely the JIT compiler uses an optimized calling sequence, since this is an intrinsic native method. In other words, this is most likely not a normal JNI call.
* - System.arraycopy is not a simple memory copy. It has to test its arguments to avoid reading or writing beyond the array bounds, and so on. And in the case when you are copying from one object array to another it may need to check that actual type of each objects copied. All of this adds up to far more code than it is sensible to inline.

You have it the wrong way around. System.arraycopy() is a super fast native implementation provided by the JVM
There is no "overhead" - there is only "advantage"

Related

Performance loss of continued call to array.length or list.size()

I have seen people say to cache the values of size for a list or length for an array when iterating, to save the time of checking the length/size over and over again.
So
for (int i = 0; i < someArr.length; i++) // do stuff
for (int i = 0; i < someList.size(); i++) // do stuff
Would be turned into
for (int i = 0, length = someArr.length; i < length; i++) // do stuff
for (int i = 0, size = someList.size(); i < size; i++) // do stuff
But since Array#length isn't a method, just a field, shouldn't it not have any difference? And if using an ArrayList, size() is just a getter so shouldn't that also be the same either way?
It is possible the JIT compiler will do some of those optimizations for itself. Hence, doing the optimizations by hand may be a complete waste of time.
It is also possible (indeed likely) that the performance benefit you are going to get from hand optimizing those loops is too small to be worth the effort. Think of it this way:
Most of the statements in a typical program are only executed rarely
Most loops will execute in a few microseconds or less.
Hand optimizing a program takes in the order of minutes or hours of developer time.
If you spend minutes to get a execution speedup that is measured in microseconds, you are probably wasting your time. Even thinking about it too long is wasting time.
The corollary is that:
You should benchmark your code to decide whether you need to optimize it.
You should profile your code to figure out which parts of your code is worth spending optimization effort on.
You should set (realistic) performance goals, and stop optimization when you reach those goals.
Having said all of that:
theArr.length is very fast, probably just a couple of machine instructions
theList.size() will probably also be very fast, though it depends on what List class you are using.
For an ArrayList the size() call is probably a method call + a field fetch versus a field fetch for length.
For an ArrayList the size() call is likely to be inlined by the JIT compiler ... assuming that the JIT compiler can figure that out.
The JIT compiler should be able to hoist the length fetch out of the loop. It can probably deduce that it doesn't change in the loop.
The JIT compiler might be able to hoist the size() call, but it will be harder for it to deduce that the size doesn't change.
What this means is that if you do hand optimize those two examples, you will most likely get negligible performance benefit.
In general the loss is negligible. Even a LinkedList.size() will use a stored count, and not iterate over all nodes.
For large sizes you may assume the conversion to machine code may catch up, and optimize it oneself.
If inside the loop the size is changed (delete/insert) the size variable must be changed too, which gives us even less solid code.
The best would be to use a for-each
for (Bar bar: bars) { ... }
You might also use the somewhat more costing Stream:
barList.forEach(bar -> ...);
Stream.of(barArray).forEach(bar -> ...);
Streams can be executed in parallel.
barList.parallelStream().forEach(bar -> ...);
And last but not least you may use standard java code for simple loops:
Arrays.setAll(barArray, i -> ...);
We are talking here about micro-optimisations. I would go for elegance.
Most often the problem is the used algorithm & datastructurs. List is notorious, as everything can be a List. However Set or Map often provide much higher power/expressiveness.
If a complex piece of software is slow, profile the application. Check the break lines: java collections versus database queries, file parsing.

Will the Java compiler optimize out String.length() in a for-loop's condition?

Consider the following Java code fragment:
String buffer = "...";
for (int i = 0; i < buffer.length(); i++)
{
System.out.println(buffer.charAt(i));
}
Since String is immutable and buffer is not reassigned within the loop, will the Java compiler be smart enough to optimize away the buffer.length() call in the for loop's condition? For example, would it emit byte code equivalent to the following, where buffer.length() is assigned to a variable, and that variable is used in the loop condition? I have read that some languages like C# do this type of optimization.
String buffer = "...";
int length = buffer.length();
for (int i = 0; i < length; i++)
{
System.out.println(buffer.charAt(i));
}
In Java (and in .Net), strings are length counted (number of UTF-16 code points), so finding the length is a simple operation.
The compiler (javac) may or may not perform hoisting, but the JVM JIT Compiler will almost certainly inline the call to .length(), making buffer.length() nothing more than a memory access.
The Java compiler (javac) performs no such optimization. The JIT compiler will likely inline the length() method, which at the very least would avoid the overhead of a method call.
Depending on which JDK you're running, the length() method itself likely returns a final length field, which is a cheap memory access, or the length of the string's internal char[] array. In the latter case, the array's length is constant, and the array reference is presumably final, so the JIT may be sophisticated enough to record the length once in a temporary as you suggest. However, that sort of thing is an implementation detail. Unless you control every machine that your code will run on, you shouldn't make too many any assumptions about which JVM it will run on, or which optimizations it will perform.
As to how you should write your code, calling length() directly in the loop condition is a common code pattern, and benefits from readability. I'd keep things simple and let the JIT optimizer do its job, unless you're in a critical code path that has demonstrated performance issues, and you have likewise demonstrated that such a micro-optimization is worthwhile.
You can do several things to examine the two variations of your implementation.
(difficulty: easy) Make a test and measure the speed under similar conditions for each version of the code. Make sure you loop is significant enough to notice a difference, it is possible that there is none.
(difficulty: medium) Examine the bytecode with javap and see how the compiler has interpreted both versions (this might differ depending on javac implementation) or it might not (when the behavior was specified in the spec and left no room for interpretation by the implementor).
(difficulty: hard) Examine the JIT output of both versions with JITWatch, you will need to have a very good understanding of bytecode and assembler.

Is it faster to create a new object reference if it will only be used twice?

I have a question about instruction optimization. If an object is to be used in two statements, is it faster to create a new object reference or should I instead call the object directly in both statements?
For the purposes of my question, the object is part of a Vector of objects (this example is from a streamlined version of Java without ArrayLists). Here is an example:
AutoEvent ptr = ((AutoEvent)e_autoSequence.elementAt(currentEventIndex));
if(ptr.exitConditionMet()) {currentEventIndex++; return;}
ptr.registerSingleEvent();
AutoEvent is the class in question, and e_autoSequence is the Vector of AutoEvent objects. The AutoEvent contains two methods in question: exitConditionMet() and registerSingleEvent().
This code could, therefore, alternately be written as:
if(((AutoEvent)e_autoSequence.elementAt(currentEventIndex)).exitConditionMet())
{currentEventIndex++; return;}
((AutoEvent)e_autoSequence.elementAt(currentEventIndex)).registerSingleEvent();
Is this faster than the above?
I understand the casting process is slow, so this question is actually twofold: additionally, in the event that I am not casting the object, which would be more highly optimized?
Bear in mind this is solely for two uses of the object in question.
The first solution is better all round:
Only one call to the vector elementAt method. This is actually the most expensive operation here, so only doing it once is a decent performance win. Also doing it twice potentially opens you up to some race conditions.
Only one cast operation. Casts are very cheap on moderns JVMs, but still have a slight cost.
It's more readable IMHO. You are getting an object then doing two things with it. If you get it twice, then the reader has to mentally figure out that you are getting the same object. Better to get it once, and assign it to a variable with a good name.
A single assignment of a local variable (like ptr in the first solution) is extremely cheap and often free - the Java JIT compiler is smart enough to produce highly optimised code here.
P.S. Vector is pretty outdated. Consider converting to an ArrayList<AutoEvent>. By using the generic ArrayList you won't need to explicitly cast, and it is much faster than a Vector (because it isn't synchronised and therefore has less locking overhead)
First solution will be faster.
The reason is that assignments work faster than method invocations.
In the second case you will have method elementAt() invoked twice, which will make it slower and JVM will probably not be able to optimize this code because it doesn't know what exactly is happening in the elementAt().
Also remember that Vector's methods are synchronized, which makes every method invocation even slower due to lock acquisition.
I don't know what do you mean by "create a new object reference" here. The following code ((AutoEvent)e_autoSequence.elementAt(currentEventIndex)) probably will be translated into bytecode that obtains sequence element, casts it to AutoEven and store the resulting reference on stack. Local variable ptr as other local variables is stored on stack too, so assigning reference to is is just copying 4 bytes from one stack slot to another, nearby stack slot. This is very-very fast operation. Modern JVMs do not do reference counting, so assigning references is probably as cheap as assigning int values.
Lets get some terminology straight first. Your code does not "create a new object reference". It is fetching an existing object reference (either once or twice) from a Vector.
To answer your question, it is (probably) a little bit faster to fetch once and put the reference into a temporary variable. But the difference is small, and unlikely to be significant unless you do it lots of times in a loop.
(The elementAt method on a Vector or ArrayList is O(1) and cheap. If the list was a linked list, which has an O(N) implementation for elementAt, then that call could be expensive, and the difference between making 1 or 2 calls could be significant ...)
Generally speaking, you should think about the complexity of your algorithms, but beyond that you shouldn't spend time optimizing ... until you have solid profiling evidence to tell you where to optimize.
I can't say whether ArrayList would be more appropriate. This could be a case where you need the thread-safety offered by Vector.

For arrays of up to 10 elements: for loop copy or System.arrayCopy?

Assuming a 10 element or less Object[] in Java, what would be the fastest way of copying the array?
for(int i = 0;i < a.length;i++)
for(int i = 0,l = a.length;i < l;i++) // i.e. is caching array len in local var faster?
System.arrayCopy(a, 0, a2, 0, a.length);
The chances are that the difference between the three alternatives is relatively small.
The chances are that this is irrelevant to your application's overall performance.
The relative difference is likely to depend on the hardware platform and the implementation of Java that you use.
The relative difference will also vary depending on the declared and actual types of the arrays.
You are best off forgetting about this and just coding the way that seems most natural to you. If you find that your completed application is running too slowly, profile it and tune based on the profiling results. At that point it might be worthwhile to try out the three alternatives to see which is faster for your application's specific use-case. (Another approach might be to see if it is sensible to avoid the array copy in the first place.)
Caching the length isn't useful. You're accessing a field directly. And even is it was a method, the JIT would inline and optimize it.
If something had to be optimized, System.arraycopy would contain the optimization.
But the real answer is that it doesn't matter at all. You're not going to obtain a significant gain in performance by choosing the most appropriate way of copying an array of 10 elements or less. If you have a performance problem, then search where it comes from by measuring, and then optimize what must be optimized. For the rest, use what is the most readable and maintainable. What you're doing is premature optimization. And it's the root of all evil (says D. Knuth).
System.arraycopy() is the fastest way to copy array -- as it designed and optimized exactly for this job. There was rumors that for small arrays it hadcoded loop may be faster -- but it is not true for now. System.arraycopy is a JIT intrinsics, and JIT choose best implementation for each case.
Do get yourself a book on JVM internals (for example, "Oracle JRockit, The Definitive Guide") and realize that what the JVM executes, after warming up, loop unrolling, method inlining, register re-allocation, loop invariant extraction and so on will not even closely resemble what you write in Java source code.
Sorry :-) Otherwise, you will enjoy reading http://www.javaspecialists.eu.

Java - calling static methods vs manual inlining - performance overhead

I am interested whether should I manually inline small methods which are called 100k - 1 million times in some performance-sensitive algorithm.
First, I thought that, by not inlining, I am incurring some overhead since JVM will have to find determine whether or not to inline this method (or even fail to do so).
However, the other day, I replaced this manually inlined code with invocation of static methods and seen a performance boost. How is that possible? Does this suggest that there is actually no overhead and that by letting JVM inline at "its will" actually boosts performance? Or this hugely depends on the platform/architecture?
(The example in which a performance boost occurred was replacing array swapping (int t = a[i]; a[i] = a[j]; a[j] = t;) with a static method call swap(int[] a, int i, int j). Another example in which there was no performance difference was when I inlined a 10-liner method which was called 1000000 times.)
I have seen something similar. "Manual inlining" isn't necessarily faster, the result program can be too complex for optimizer to analyze.
In your example let's make some wild guesses. When you use the swap() method, JVM may be able to analyze the method body, and conclude that since i and j don't change, although there are 4 array accesses, only 2 range checks are needed instead of 4. Also the local variable t isn't necessary, JVM can use 2 registers to do the job, without involving r/w of t on stack.
Later, the body of swap() is inlined into the caller method. That is after the previous optimization, so the saves are still in place. It's even possible that caller method body has proved that i and j are always within range, so the 2 remaining range checks are also dropped.
Now in the manually inlined version, the optimizer has to analyze the whole program at once, there are too many variables and too many actions, it may fail to prove that it's safe to save range checks, or eliminate the local variable t. In the worst case this version may cost 6 more memory accesses to do the swap, which is a huge overhead. Even if there is only 1 extra memory read, it is still very noticeable.
Of course, we have no basis to believe that it's always better to do manual "outlining", i.e. extract small methods, wishfully thinking that it will help the optimizer.
--
What I've learned is that, forget manual micro optimizations. It's not that I don't care about micro performance improvements, it's not that I always trust JVM's optimization. It is that I have absolutely no idea what to do that does more good than bad. So I gave up.
The JVM can inline small methods very efficiently. The only benifit inlining yourself is if you can remove code i.e. simplify what it does by inlining it.
The JVM looks for certain structures and has some "hand coded" optimisations when it recognises those structures. By using a swap method, the JVM may recognise the structure and optimise it differently with a specific optimisation.
You might be interested to try the OpenJDK 7 debug version which has an option to print out the native code it generates.
Sorry for my late reply, but I just found this topic and it got my attention.
When developing in Java, try to write "simple and stupid" code. Reasons:
the optimization is made at runtime (since the compilation itself is made at runtime). The compiler will figure out anyway what optimization to make, since it compiles not the source code you write, but the internal representation it uses (several AST -> VM code -> VM code ... -> native binary code transformations are made at runtime by the JVM compiler and the JVM interpreter)
When optimizing the compiler uses some common programming patterns in deciding what to optimize; so help him help you! write a private static (maybe also final) method and it will figure out immediately that it can:
inline the method
compile it to native code
If the method is manually inlined, it's just part of another method which the compiler first tries to understand and see whether it's time to transform it into binary code or if it must wait a bit too understand the program flow. Also, depending on what the method does, several re-JIT'ings are possible during runtime => JVM produces optimum binary code only after a "warm up"... and maybe your program ended before the JVM warms itself up (because I expect that in the end the performance should be fairly similar).
Conclusion: it makes sense to optimize code in C/C++ (since the translation into binary is made statically), but the same optimizations usually don't make a difference in Java, where the compiler JITs byte code, not your source code. And btw, from what I've seen javac doesn't even bother to make optimizations :)
However, the other day, I replaced this manually inlined code with invocation of static methods and seen a performance boost. How is that possible?
Probably the JVM profiler sees the bottleneck more easily if it is in one place (a static method) than if it is implemented several times separately.
The Hotspot JIT compiler is capable of inlining a lot of things, especially in -server mode, although I don't know how you got an actual performance boost. (My guess would be that inlining is done by method invocation count and the method swapping the two values isn't called too often.)
By the way, if its performance really matters, you could try this for swapping two int values. (I'm not saying it will be faster, but it may be worth a punt.)
a[i] = a[i] ^ a[j];
a[j] = a[i] ^ a[j];
a[i] = a[i] ^ a[j];

Categories

Resources