Java Profiling: Private Property Getter has Large Base Time

Java Profiling: Private Property Getter has Large Base Time - java

I'm using TPTP to profile some slow running Java code an I came across something interesting. One of my private property getters has a large Base Time value in the Execution Time Analysis results. To be fair, this property is called many many times, but I never would have guessed a property like this would take very long:
public class MyClass{
private int m_myValue;
public int GetMyValue(){
return m_myValue;
}
}
Ok so there's obviously more stuff in the class, but as you can see there is nothing else happening when the getter is called (just return an int). Some numbers for you:
About 30% of the Calls of the run are
on the getter (I'm working to reduce
this)
About 25% of the base time of
the run is spent in this getter
Average base time is 0.000175s
For comparison, I have another method in a different class that uses this getter:
private boolean FasterMethod(MyClass instance, int value){
return instance.GetMyValue() > m_localInt - value;
}
Which has a much lower average base time of 0.000018s (one order of magnitude lower).
What's the deal here? I assume there is something that I don't understand or something I'm missing:
Does returning a local primitive really take longer than returning a calculated value?
Should I look at metric other than Base Time?
Are these results misleading and I need to consider some other profiling tool?
Edit 1: Based on some suggestions below, I marked the method as final and re-ran the test, but I got the same results.
Edit 2: I installed a demo version of YourKit to re-run my performance tests, and the YourKit results look much closer to what I was expecting. I will continue to test YourKit and report back what I find.
Edit 3: Changing to YourKit seems to have resolved my issue. I was able to use YourKit to determine the actual slow points in my code. There are some excellent comments and posts below (upvoted appropriately), but I'm accepting the first person to suggest YourKit as "correct." (I am not affiliated with YourKit in any way / YMMV)

If possible try using another profiler (the Netbeans one works well). This may be hard to do depending on how your code is setup.
It is possible that, just like many other tools, a different profiler will result in different information.
Does returning a local primitive really take longer than returning a
calculated value?
Returning an instance variable takes longer than returning an local variable (VM dependent). The getter that you have is simple so it should be inlined, so it becomes as fast as accessing a public instance variable (which, again, is slower than accessing a local variable).
But you don't have a local value (local meaning in the method as opposed to in the class).
What exactly do you mean by "local"?
Should I look at metric other than Base Time?
I have not used the Eclipse tools, so I am not sure how it works... it might make a difference if it is a tracing or a sampling profiler (the two can give different results for things like this).
Are these results misleading and I need to consider some other
profiling tool?
I would consider another tool, just to see if the result is the same.
Edit based on comments:
If it is a sampling profiler what happens, essentially, that every "n-time units" the program is sampled to see where the program is. If you are calling the one method way more than the other it will show up as being in the method that is called more (it is simply more likely that that method is being run).
A tracing profiler adds code to your program (a process known as instrumentation) to essentially log what is going on.
Tracing profilers are slower but more accurate, they also require that the program be changed (the instrumentation process) which could potentially introduce bugs (not that I have heard of it happening... but I am sure it does at least while they are developing the profiler).
Sampling profilers are faster but less accurate (they just guess at how often a line of code is executed).
So, if Eclipse uses a sampling profiler you could see what you consider to be strange behaviour. Changing to a tracing profiler would show more accurate results.
If Eclipse uses a tracing profiler then chaning profilers should show the same result (however they new profiler may make it more obvious to you as to what is going on).

It does sound slightly misleading - perhaps the profiler is removing some optimizations?
Just for kicks, try making the method final, which will make it easier to inline. That may well be the difference between the property and FasterMethod. In real use, HotSpot will inline even virtual methods until the first time they're overridden (IIRC).
EDIT: Responding to Brian's comment: Yes, it's usually the case that making something final won't help performance (although it may be a good thing in terms of design :) because Hotspot will normally work out whether it can inline or not based on whether it's overridden or not. I was suggesting this profiler may have messed with that.
EDIT: I've now managed to reproduce the way that HotSpot "undoes" optimisation of classes which haven't been extended yet (or methods which haven't been overridden). This was harder to do for the server VM than the client, but I've done it :)
public class Test
{
public static void main(String[] args)
throws Exception
{
final long iterations = 1000000000L;
Base b = new Base();
// Warm up Hotspot
time(b, 1000);
// Before we load Derived
time(b, iterations);
// Load Derived and use it quickly
// (Just loading is enough to make the client VM
// undo its optimizations; the server VM needs more effort)
Base d = (Base) Class.forName("Derived").newInstance();
time(d, 1);
// Time it again with Base
time(b, iterations);
}
private static void time(Base b, long iterations)
{
long total = 0;
long start = System.currentTimeMillis();
for (long i = 0; i < iterations; i++)
{
total += b.getValue();
}
long end = System.currentTimeMillis();
System.out.println("Time: " + (end-start));
System.out.println("Total: " + total);
}
}
class Base
{
public int getValue() { return 1; }
}
class Derived extends Base
{
#Override
public int getValue() { return 2; }
}

That sounds very peculiar. You're not calling an overriding method by mistake, are you ?
I would be tempted to download a demo version of YourKit. It's trivial to set up, and it should give an indication as to what's really occurring. If both TPTP and YourKit agree, then something peculiar is happening (and I know that's not a lot of help!)

Something that used to make a lot of difference to performance of these sort of methods (although this may be to some extent historical) is that the size of the calling method can be an issue. HotSpot (and serious rivals) will happily inline small methods (some may choke on synchronized/try-finally). However, if the calling method is large, then it may not. This was particularly a problem with old versions of the HotSpot C1/client which had a really bad register allocation algorithm (it now has an algorithm that is both quite good and fast).

Related

Memory/Performance differences of declaring variable for return result of method call versus inline method call

Are there any performance or memory differences between the two snippets below? I tried to profile them using visualvm (is that even the right tool for the job?) but didn't notice a difference, probably due to the code not really doing anything.
Does the compiler optimize both snippets down to the same bytecode? Is one preferable over the other for style reasons?
boolean valid = loadConfig();
if (valid) {
// OK
} else {
// Problem
}
versus
if (loadConfig()) {
// OK
} else {
// Problem
}

The real answer here: it doesn't even matter so much what javap will tell you how the corresponding bytecode looks like!
If that piece of code is executed like "once"; then the difference between those two options would be in the range of nanoseconds (if at all).
If that piece of code is executed like "zillions of times" (often enough to "matter"); then the JIT will kick in. And the JIT will optimize that bytecode into machine code; very much dependent on a lot of information gathered by the JIT at runtime.
Long story short: you are spending time on a detail so subtle that it doesn't matter in practical reality.
What matters in practical reality: the quality of your source code. In that sense: pick that option that "reads" the best; given your context.
Given the comment: I think in the end, this is (almost) a pure style question. Using the first way it might be easier to trace information (assuming the variable isn't boolean, but more complex). In that sense: there is no "inherently" better version. Of course: option 2 comes with one line less; uses one variable less; and typically: when one option is as readable as another; and one of the two is shorter ... then I would prefer the shorter version.

If you are going to use the variable only once then the compiler/optimizer will resolve the explicit declaration.
Another thing is the code quality. There is a very similar rule in sonarqube that describes this case too:
Local Variables should not be declared and then immediately returned or thrown
Declaring a variable only to immediately return or throw it is a bad practice.
Some developers argue that the practice improves code readability, because it enables them to explicitly name what is being returned. However, this variable is an internal implementation detail that is not exposed to the callers of the method. The method name should be sufficient for callers to know exactly what will be returned.
https://jira.sonarsource.com/browse/RSPEC-1488

Where to patch back the information gathered during program analysis

I'm new to compiler design and have few years with java.
Using this and the paper
It's look like after Class hierarchy analysis and rapid type analysis will get information to do de-virtualisation. But where to patch back the information on source code or on Byte-code. And how to check the results?
Trying to understand how things really happens but stuck here.
For example : We have an example program taken from paper specified above.
public class MyProgram {
public static void main(String[] args) {
EUCitizen citizen = getCitizen();
citizen.hasRightToVote(); // Call site 1
Estonian estonian = getEstonian();
estonian.hasRightToVote(); // Call site 2
}
private static EUCitizen getCitizen() {
return new Estonian();
}
private static Estonian getEstonian() {
return new Estonian();
}
}
Using Class hieracrchy method we can conclude as none of the subclasses override hasRightToVote() , the dynamic method invocation can be replaced with a static procedure call to Estonian#hasRightToVote() . But where to replace this information and How? How to tell JVM (feed JVM) that information that we have gathered during analysis.
You can't change source code and put this there ? Could anyone provide me an example so i can start trying new ways to do analysis and still be able to patch that information.
Thanks.

Class Hierarchy Analysis is an optimization done by the virtual machine itself at runtime, you do not have to tell the VM anything. It simply does the analysis by itself based on the information available in the class files.

What generally happens is that analysis results are typically stored as some kind of association with a program representation, or are used immediately to effect the optimization so "nothing" needs to be stored.
You are right: there is generally no "good" way to annotate the source code with an analysis result (you can use Java annotations as a way). But the compiler has already read the source code and isn't going read it again.
In general, the program is parsed and variety of compiler-like structures are built (ASTs, symbol tables, control flow graphs, data flow arcs, ...) by the compiler pretty much before any serious analysis/optimization begins. A low level model of the program (data flow over the operators) is normally what gets analyzed, and the optimization analyzer will either decorate this structure with its opinions, or often just directly modify this structure to achieve the effect of the optimization.
With Java, there are two opportunities to do this: in JavaC, and in the JITter. My understanding (probably wrong, probably varies across JavaC implementations) is that not much optimization occurs in JavaC at all; it just generates naive JVM bytecode, and that all the real work is done in the JITter. The JITter doesn't have source code, but it can do all the same kinds of analysis (control flow, dataflow, ...) on the byte code that one can do on classic compiler structures, and thus achieve the same effect.

I had some doubts with the same and Rohan Padhey Cleared the ones.
In Java, I don't think there is a way to specify monomophrism of virtual method calls in byte-code. The de-virtualization analysis usually happens in the JIT compiler which compiles bytecode to native code and it does so using dynamic analysis.
Why Patching is a Problem :
In Java bytecode, the only method call instructions are: invokestatic, invokedynamic, invokevirtual, invokeinterface and invokespecial (the last is used for constructors, etc). The only type of call that does not refer to virtual method table lookups is the invokestatic call, since static methods cannot be overridden and used polymorphically on objects.
Hence, while there is no way to do a compile-time specification of the target method, you can replace virtual calls with static calls. How? consider an object "x" with a method "foo", and a call-site:
x.foo(arg1, arg2, ...)
If you know for sure that "x" is of the class "A", then you can transform this to:
A.static_foo(x, arg1, arg2, ...)
where "static_foo" is a newly created static method in class A whose body contains exactly everything that the body of "foo()" in "A" would have done, except that references to "this" inside the body should now be replaced by the first parameter, whatever you may call it.
That is exactly what the Whole-Jimple-Optimization-Pack (WJOP) in Soot does.
As regards static analysis using Soot, there is an optimization pack that does devirtualization using a work-around: https://github.com/Sable/soot/wiki/Whole-program-Devirtualization-Optimizations
But That's just a hack.
Why JIT Times Its Better :
JIT doing this better is due to the fact that static analysis has to be sound because you need to be sure when doing this transformation that 100% of the time the target of the virtual call will be one class. With JIT compilation, you can find more opportunities for optimization because even if the target is a single class 90% of the time, but not 10%, you can just-in-time compile the code to use the most-frequently taken route, and fall-back to using bytecode in the 10% of the cases where this prediction was wrong, because you can check this mistake dynamically. While the fall-back is expensive, the common-case of correct predictions 90% of the time leads to overall benefit. With static transformation, you have to make a decision of whether or not to optimize and it better be sound.

In Java, do using "shortcut" variables impact performance?

I have the following piece of code:
Player player = (Player)Main.getInstance().getPlayer();
player.setSpeedModifier(keyMap[GLFW_KEY_LEFT_SHIFT] ? 1.8f : 1);
if (keyMap[GLFW_KEY_W]) {
player.moveForward();
}
if (keyMap[GLFW_KEY_S]) {
player.moveBackward();
}
player.rotateTowards(getMousePositionInWorld());
I was wondering if the usage of a local variable (For the player) to make the code more readable has any impact on performance or whether it would be optimised during compilation to replace the uses of the variable seeing as it is just a straight copy of another variable. Whilst it is possible to keep the long version in place, I prefer the readability of having the shorter version. I understand that the performance impact if there was any would be miniscule, but was just interested if there would be any whatsoever.
Thanks, -Slendy.

For any modern compiler, this will most likely be optimized away and it will not have any performance implications. The few additional bytes used for storage are well worth the added readability.

consider these 2 pieces of code:
final Player player = (Player)Main.getInstance().getPlayer();
player.callmethod1();
player.callmethod2();
and:
((Player)Main.getInstance().getPlayer()).callmethod1();
((Player)Main.getInstance().getPlayer()).callmethod2();
There are reasons, why first variant is preferable:
First one is more readable, at least because of line length
Java compiler cannot assume that the same object will be returned by Main.getInstance().getPlayer() this is why second variant will actually call getPlayer twice, which could be performance penalty

Apart from the probably unneeded (Player) cast, I even find your version to be superior to having long worms of calls.
IMHO if you need one special object more than once or twice, it is worth to be saved in a local variable.
The local variable will need some bytes on the stack, but on the other hand, several calls are omitted, so your version clearly wins.

Your biggest performance hit will likely be the function lookup of the objects:
(Player)Main.getInstance().getPlayer();
Otherwise, you want to minimize these function calls if possible. In this case, a local var could save CPU, though if you have a global var, it might be a hair faster to use it.
It really depends on how many times this is done in a loop though. Quite likely you will see no difference either way in normal usage. :)

Java: Getter and setter faster than direct access?

I tested the performance of a Java ray tracer I'm writing on with VisualVM 1.3.7 on my Linux Netbook. I measured with the profiler.
For fun I tested if there's a difference between using getters and setters and accessing the fields directly. The getters and setters are standard code with no addition.
I didn't expected any differences. But the directly accessing code was slower.
Here's the sample I tested in Vector3D:
public float dot(Vector3D other) {
return x * other.x + y * other.y + z * other.z;
}
Time: 1542 ms / 1,000,000 invocations
public float dot(Vector3D other) {
return getX() * other.getX() + getY() * other.getY() + getZ() * other.getZ();
}
Time: 1453 ms / 1,000,000 invocations
I didn't test it in a micro-benchmark, but in the ray tracer. The way I tested the code:
I started the program with the first code and set it up. The ray tracer isn't running yet.
I started the profiler and waited a while after initialization was done.
I started a ray tracer.
When VisualVM showed enough invocations, I stopped the profiler and waited a bit.
I closed the ray tracer program.
I replaced the first code with the second and repeated the steps above after compiling.
I did at least run 20,000,000 invocations for both codes. I closed any program I didn't need.
I set my CPU on performance, so my CPU clock was on max. all the time.
How is it possible that the second code is 6% faster?

I did done some micro benchmarking with lots of JVM warm up and found the two approaches take the exact same amount of execution time.
This happens because the JIT compiler is in-lining the getter method with a direct access to the field thus making them identical bytecode.

Thank you all for helping me answering this question. In the end, I found the answer.
First, Bohemian is right: With PrintAssembly I checked the assumption that the generated assembly codes are identical. And yes, although the bytecodes are different, the generated codes are identical.
So masterxilo is right: The profiler have to be the culprit. But masterxilo's guess about timing fences and more instrumentation code can't be true; both codes are identical in the end.
So there's still the question: How is it possible that the second code seems to be 6% faster in the profiler?
The answer lies in the way how VisualVM measures: Before you start profiling, you need calibration data. This is used for removing the overhead time caused by the profiler.
Although the calibration data is correct, the final calculation of the measurement is not. VisualVM sees the method invocations in the bytecode. But it doesn't see that the JIT compiler removes these invocations while optimizing.
So it removes non-existing overhead time. And that's how the difference appears.

In case you have not taken a course in Statistics, there is always variance in program performance no matter how well that it is written. The reason why these two methods seem to run at approximately the same rate is because the accessor fields only do one thing: They return a particular field. Because nothing else happens in the accessor method, both tactics pretty much do the same thing; however, in case you know not about encapsulation, which is how well that a programmer hides the data (fields or attributes) from the user, a major rule of encapsulation is not to reveal internal data to the user. Modifying a field as public means that any other class can access those fields, and that can be very dangerous to the user. That is why I always recommend Java programmers to use accessor and mutator methods so that the fields will not get into the wrong hands.
In case you were curious about how to access a private field, you can use reflection, which actually accesses the data of a particular class so that you can mutate it if you really must do so. As a frivolous example, suppose that you knew that the java.lang.String class contains a private field of type char[] (that is, a char array). It is hidden from the user, so you cannot access the field directly. (By the way, the method java.lang.String.toCharArray() accesses the field for you.) If you wanted to access each character consecutively and store each character into a collection (for the sake of simplicity, why not a java.util.List?), then here is how to use reflection in this case:
/**
This method iterates through each character in a <code>String</code> and places each of them into a <code>java.util.List</code> of type <code>Character</code>.
#param str The <code>String</code> to extract from.
#param list The list to store each character into. (This is necessary because the compiler knows not which <code>List</code> to use, so it will automatically clear the list anyway.)
*/
public static void extractStringData(String str, List<Character> list) throws IllegalAccessException, NoSuchFieldException
{
java.lang.reflect.Field value = String.class.getDeclaredField("value");
value.setAccessible(true);
char[] data = (char[]) value.get(str);
for(char ch : data) list.add(ch);
}
As a sidenote, note that reflection takes a lot of performance out of your program. If there is a field, method, or inner or nested class that you must access for whatever reason (which is highly unlikely anyway), then you should use reflection. The main reason why reflection takes away precious performance is because of the relatively innumerable exceptions that it throws. I am glad to have helped!

Java method takes seemingly lot of time that I cannot account for

Using JProfiler, I've identified a hot spot in my Java code that I cannot make sense of. JProfiler explains that this method takes 150μs (674μs without warmup) on average, not including the time it takes to call descendant methods. 150μs may not seem much, but in this application it adds up (and is experienced by my users) and also it seems a lot, compared to other methods that seem more complex to me than this one. Hence it matters to me.
private boolean assertReadAuthorizationForFields(Object entity, Object[] state,
String[] propertyNames) {
boolean changed = false;
final List<Field> fields = FieldUtil.getAppropriatePropertyFields(entity, propertyNames);
// average of 14 fields to iterate over
for (final Field field : fields) {
// manager.getAuthorization returns an enum type
// manager is a field referencing another component
if (manager.getAuthorization(READ, field).isDenied()) {
FieldUtil.resetField(field.getName(), state, propertyNames);
changed = true;
}
}
return changed;
}
I have for myself minimized this method in different directions, but it never teaches me much useful. I cannot stress enough that the JProfiler-reported duration (150μs) is merely about the code in this method and does not include the time it takes to execute getAuthorization, isDenied, resetField and such. That is also why I start of by just posting this snippet, without much context, since the issue seems to be with this code and not its subsequent descendant method calls.
Maybe you can argue why – if you feel I'm seeing ghosts :) Anyhow, thanks for your time!

Candidate behaviour that could slow you down:
Major effect: Obviously iteration. If you have lots of fields... You say 14 on average, which is fairly significant
Major effect: hotspot inlining would mean called methods are included in your times - and this could be noticeable because your method call(s) use reflection. getAppropriatePropertyFields introspects on class field definition metadata; resetField dynamically invokes setter methods (possibly using Method.invoke()??). If you are desperate for performance, you could use a cache via a HashSet (mapping ElementClass->FieldMetadataAndMethodHandle) This could contain field metadata and MethodHandles of setter methods (instead of using method.invoke, which is slow). Then you would only reflect during application startup and would use the JVM's fast dynamicInvoke support.
Minor effect - but multiplied by number of iterations: if you have very large arrays for state and property names, and they use primitive fields, then they would involve some degree of copying during method invocations (method parameters pass-by-'value' actually means pass-by-reference/pass-by-copy-of-primitives)

I suggest you time the method yourself as the profiler doesn't always give accurate timing.
Create a micro-benchmark with just this code and time it for at least 2 second. To work out how much difference method calls make, comment them out and hard code the values they return.

I think the issue is that FieldUtil is using Reflection and doesn't cache the fields it's using.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.