Java: Getter and setter faster than direct access? - java

I tested the performance of a Java ray tracer I'm writing on with VisualVM 1.3.7 on my Linux Netbook. I measured with the profiler.
For fun I tested if there's a difference between using getters and setters and accessing the fields directly. The getters and setters are standard code with no addition.
I didn't expected any differences. But the directly accessing code was slower.
Here's the sample I tested in Vector3D:
public float dot(Vector3D other) {
return x * other.x + y * other.y + z * other.z;
}
Time: 1542 ms / 1,000,000 invocations
public float dot(Vector3D other) {
return getX() * other.getX() + getY() * other.getY() + getZ() * other.getZ();
}
Time: 1453 ms / 1,000,000 invocations
I didn't test it in a micro-benchmark, but in the ray tracer. The way I tested the code:
I started the program with the first code and set it up. The ray tracer isn't running yet.
I started the profiler and waited a while after initialization was done.
I started a ray tracer.
When VisualVM showed enough invocations, I stopped the profiler and waited a bit.
I closed the ray tracer program.
I replaced the first code with the second and repeated the steps above after compiling.
I did at least run 20,000,000 invocations for both codes. I closed any program I didn't need.
I set my CPU on performance, so my CPU clock was on max. all the time.
How is it possible that the second code is 6% faster?

I did done some micro benchmarking with lots of JVM warm up and found the two approaches take the exact same amount of execution time.
This happens because the JIT compiler is in-lining the getter method with a direct access to the field thus making them identical bytecode.

Thank you all for helping me answering this question. In the end, I found the answer.
First, Bohemian is right: With PrintAssembly I checked the assumption that the generated assembly codes are identical. And yes, although the bytecodes are different, the generated codes are identical.
So masterxilo is right: The profiler have to be the culprit. But masterxilo's guess about timing fences and more instrumentation code can't be true; both codes are identical in the end.
So there's still the question: How is it possible that the second code seems to be 6% faster in the profiler?
The answer lies in the way how VisualVM measures: Before you start profiling, you need calibration data. This is used for removing the overhead time caused by the profiler.
Although the calibration data is correct, the final calculation of the measurement is not. VisualVM sees the method invocations in the bytecode. But it doesn't see that the JIT compiler removes these invocations while optimizing.
So it removes non-existing overhead time. And that's how the difference appears.

In case you have not taken a course in Statistics, there is always variance in program performance no matter how well that it is written. The reason why these two methods seem to run at approximately the same rate is because the accessor fields only do one thing: They return a particular field. Because nothing else happens in the accessor method, both tactics pretty much do the same thing; however, in case you know not about encapsulation, which is how well that a programmer hides the data (fields or attributes) from the user, a major rule of encapsulation is not to reveal internal data to the user. Modifying a field as public means that any other class can access those fields, and that can be very dangerous to the user. That is why I always recommend Java programmers to use accessor and mutator methods so that the fields will not get into the wrong hands.
In case you were curious about how to access a private field, you can use reflection, which actually accesses the data of a particular class so that you can mutate it if you really must do so. As a frivolous example, suppose that you knew that the java.lang.String class contains a private field of type char[] (that is, a char array). It is hidden from the user, so you cannot access the field directly. (By the way, the method java.lang.String.toCharArray() accesses the field for you.) If you wanted to access each character consecutively and store each character into a collection (for the sake of simplicity, why not a java.util.List?), then here is how to use reflection in this case:
/**
This method iterates through each character in a <code>String</code> and places each of them into a <code>java.util.List</code> of type <code>Character</code>.
#param str The <code>String</code> to extract from.
#param list The list to store each character into. (This is necessary because the compiler knows not which <code>List</code> to use, so it will automatically clear the list anyway.)
*/
public static void extractStringData(String str, List<Character> list) throws IllegalAccessException, NoSuchFieldException
{
java.lang.reflect.Field value = String.class.getDeclaredField("value");
value.setAccessible(true);
char[] data = (char[]) value.get(str);
for(char ch : data) list.add(ch);
}
As a sidenote, note that reflection takes a lot of performance out of your program. If there is a field, method, or inner or nested class that you must access for whatever reason (which is highly unlikely anyway), then you should use reflection. The main reason why reflection takes away precious performance is because of the relatively innumerable exceptions that it throws. I am glad to have helped!

Related

When does the JVM consider code for bytecode optimization?

I am trying to understand the JVM and HotSpot optimizers internals.
I tackle the problem of initializing object tree structures with an awful lot of nodes as fast as possible.
Right now, for every tree structure given, we generate Java source code to initialize the tree as following. In the end, we have thousands of these classes.
public class TypeATreeNodeInitializer {
public TypeATreeNode initialize(){
return getTypeATree();
}
private TypeATreeNode getTypeATree() {
TypeATreeNode node = StaticTypeAFactory.create();
TypeBTreeNode child1 = getTypeBTreeNode1();
node.getChildren().add(child1);
TypeBTreeNode child2 = getTypeBTreeNode2();
node.getChildren().add(child2);
//... may be many more children
return node;
}
private TypeBTreeNode getTypeBTreeNode1() {
TypeBTreeNode node = StaticTypeBFactory.create();
TypeBTreeNode child1 = getTypeCTreeNode1();
node.getChildren().add(child1);
//store of value in variable first
String value1 = "Some value";
// assign value to node
node.setSomeValue(value1);
boolean value2 = false;
node.setSomeBooleanValue(value2);
return node;
}
private TypeBTreeNode getTypeCTreeNode1() {
// ...
return null;
}
private TypeBTreeNode getTypeBTreeNode2() {
// ...
return null;
}
//... many more child node getter / initializer
}
As you can see, the values to be assigned to the tree nodes are stored inside local variables first. Looking at the generated byte code, this results in:
A load of the variable from the constant pool to the stack // e.g. String “Some Value”
A store of the variable inside the local variables
A load from the method target onto the stack // e.g. TypeBTreeNode
A load of the variable from the local variables // “Some Value”
The invocation of the setter
Yet this could be written shorter by not storing into a local variable and directly passing the parameters. So, it becomes just:
pushing the method target onto the stack // e.g TypeBTreeNode
then loading the constant onto the stack // “Some Value”
then invoking the setter
I know that in other languages (e.g. C++) compiles are capable of such optimizations.
In Java, the HotSpot optimizer is responsible for such magic during runtime.
However, as far as I understand the docs, HotSpot only kicks in after the 500ths method call (client VM).
Questions:
Do I understand correctly: if I initialize every tree only once, but do that for a large number (let’s say 10.000) of generated TreeInitializers the first byte code sequence is executed for every TreeInitializer, as they are different classes with different methods and every method is called just once?
I suspect a significant speed up rewriting the genreator using no locals, as I am saving about a third of byte code instructions and possibly expensive loads of the variables. I know that this is hard to tell without measuring, but altering the generators code is non-trivial, so would you think it is worth a try?
Removing temporary/stack variables like this is almost always premature optimization. Your processor can handle hundreds of millions of these instructions per second; meanwhile, if you're initializing tens of thousands of anything, your program is probably going to be blocking at some point waiting on memory allocation.
My advise is always going to be to hold off on optimizations until you've profiled your code. In the meantime, write code to be as easy-to-read as possible, so that when you do need to come back and modify something, it's easy to find the places that need to be updated.
Before optimizing, the JVM runs your code byte-by-byte and profiles its behavior. Based on this observation, it will compile your code to machine code. For this reason, it is difficult to give general advice for this. You should however only treat your byte code as a general abstraction, not as a performance fundamental.
A few rules of thumb:
Avoid large methods as those methods are often not inlined into other methods, even if the JVM considers this to be a good idea. This is to avoid memory overhead as inlining large methods would create a lot of duplicate code.
Avoid polymorphism and unstable branches if you can. If your VM finds out that a method call only ever hits a specific class, this is good news. The JVM will most likely remove the virtual properties of this call. Similarly, stable branches can help you with branch prediction.
Avoid object allocations for long-lived objects. If you create a lot of objects, rather let them die young then keeping them around for long.
The first rule of Optimize Club is "don't optimize." That said...
There is already no point in assigning a value to a local (stack) variable only to reference it once. If I was reviewing this code, I would have the author remove the assignment and just pass results of get...() to add().
This is not a "premature optimization" but a code simplification (code quality) issue. The fact that it eliminates some byte codes is usually not a consideration either, as the JIT compiler will optimize the code at run time. In this case, because these initializers sound like they will only be run once, the threshold for this optimization will likely never be met, so there will be value in eliminating the unnecessary stack assign and load.

Will C#-JIT implement "inline virtual method" optimizations with future versions in inspiration of Java?

Or should I consider refactoring my virtual indexing method(and its class) into a code-duplicated but faster one?
The issue I'm stuck at, I had some duplicated code, then refactored them and unified into a single class with just single virtual method in child classes only to minimize future code duplications. Now its %50 slower than before to accomplish this:
arr[i]=3.14f; // arr is derived from a base class with `[]` override
(so the derived class implementation is used).
but it became %500 easier to add new types now.
How many if-else checks in a non-virtual method makes it equally fast as a virtual one without if-else checks inside?(for todays 20-30 length pipelined cpus) float+char+double+some other structs = there would be more than 15 different types in my library so 15x code duplication would make the code %1500 harder to implement/refactor without virtual methods.
Example of my issue:
// implemented IList because C# arrays instead of this,
// can be used in same wrapper property too!
// Reduced even more code duplication.
public class Foo<T>:IList<T>
{
public virtual T this [int i]
{ ... }
}
public unsafe class Bar:Foo<byte>
{
public override byte this[int i]
{
get
{
return *(pByte + i);
}
set
{
*(pByte + i) = value;
}
}
}
}
Bar b=new Bar(); // Can't use Foo<byte>
// because I denied that with making its constructor `internal`
// because its mis-use would generate undefined behaviour(more than an out-of-bounds access) in a random time in a random place.
b[400]=50;
The reason I have to duplicate code without virtual is, no pointer is allowed for T generic types.
The reason I have to use pointers is, I have non-managed fast gpgpu C++ arrays to be worked likes just as same as pure C# when looked from outside.
The reason I had to use unmanaged arrays for gpgpu is, they work at top speed when they are aligned to unobtainable values like 4096 and needed to be pinned and also to reduce C# - C++ transition overheads.
Note: maybe it is not only virtual, but also the IList<T> interface contributing to slowness. Many answers say it comes with a cost but if Java can work around it, why can't C#?
Here is the environment:
.Net 3.5
MSVS 2015 community ed. all optimizations enabled.
windows 10 64 bit
project 64 bit release
c3060 cpu with a single channel ddr3 ram
for benchmarking, heating phase is added, timings are taken after many iterations and used in real data visualizations.

java garbage collection and temporary objects

I'm a c++ developer by trade, but I've been doing a bit Java lately. This project I'm working in was done by a developer long since gone and I keep finding things where he is working around the Garbage collection by doing weird things.
Case and point he implemented his own string class to avoid slow down by GC
This section of the app takes a large binary file format and exports it to csv. This means building up a string for each line in the file (millions). In order to avoid those temporary string objects he made a string class that just has a large array of bytes he reuses.
/**
HACK
A Quick and Dirty string builder implementation optimized for GC.
Using String.format causes the application grind to a halt when
more than a couple of string operations are performed due to the number of
temporary objects allocated while formatting strings for drawing or logging.
*/
Does this actually help? is this really needed? Is this better than just declaring a String object outside the loop and setting it inside the loop?
The app also has a hash map containing doubles for the values. The keys in the map are fairly static but the values change often. Afraid of GC on doubles he made a myDouble class to use as the value for the hashmap
/**
* This is a Mutable Double Wrapper class created to avoid GC issues
*
*/
public class MyDouble implements Serializable {
/**
*
*/
private static final long serialVersionUID = C.SERIAL_VERSION_UID;
public double d;
public MyDouble(double d) {
this.d = d;
}
}
This is crazy and completely unnecessary... right?
It's true that string concatenation can be a bottleneck in Java because Strings are immutable. This means each concatenation creates a new String, unless a matching String was previously created and is therefore in the string-pool (see string interning). Either way, it can certainly lead to problems.
However your predecessor is not the first person to have encountered this and the standard way to deal with the need to concatenate many Strings in Java is to use a StringBuilder.
When a double (or any primative for that matter) is used as a local variable, it's kept on the stack and the memory it occupies released along with the stack frame (non sure if they're subject to GC or taken care of by the JVM as it runs). If however the double is the field on an object, it's stored on the heap and will be collected when the object containing it is collected.
Without seeing how the double values are being used, it hard to say for sure, but it's more than likely the use of the Map has increased the GC load.
In summary, yes, imho this is certainly, as you say 'crazy and completely unnecessary'. These sorts of premature optimizations only serve to complicate the code base making it more prone to bugs and making future maintenance more difficult. The golden rule should practically always be, build the simplest thing that works, profile it and then optimize.

Java method takes seemingly lot of time that I cannot account for

Using JProfiler, I've identified a hot spot in my Java code that I cannot make sense of. JProfiler explains that this method takes 150μs (674μs without warmup) on average, not including the time it takes to call descendant methods. 150μs may not seem much, but in this application it adds up (and is experienced by my users) and also it seems a lot, compared to other methods that seem more complex to me than this one. Hence it matters to me.
private boolean assertReadAuthorizationForFields(Object entity, Object[] state,
String[] propertyNames) {
boolean changed = false;
final List<Field> fields = FieldUtil.getAppropriatePropertyFields(entity, propertyNames);
// average of 14 fields to iterate over
for (final Field field : fields) {
// manager.getAuthorization returns an enum type
// manager is a field referencing another component
if (manager.getAuthorization(READ, field).isDenied()) {
FieldUtil.resetField(field.getName(), state, propertyNames);
changed = true;
}
}
return changed;
}
I have for myself minimized this method in different directions, but it never teaches me much useful. I cannot stress enough that the JProfiler-reported duration (150μs) is merely about the code in this method and does not include the time it takes to execute getAuthorization, isDenied, resetField and such. That is also why I start of by just posting this snippet, without much context, since the issue seems to be with this code and not its subsequent descendant method calls.
Maybe you can argue why – if you feel I'm seeing ghosts :) Anyhow, thanks for your time!
Candidate behaviour that could slow you down:
Major effect: Obviously iteration. If you have lots of fields... You say 14 on average, which is fairly significant
Major effect: hotspot inlining would mean called methods are included in your times - and this could be noticeable because your method call(s) use reflection. getAppropriatePropertyFields introspects on class field definition metadata; resetField dynamically invokes setter methods (possibly using Method.invoke()??). If you are desperate for performance, you could use a cache via a HashSet (mapping ElementClass->FieldMetadataAndMethodHandle) This could contain field metadata and MethodHandles of setter methods (instead of using method.invoke, which is slow). Then you would only reflect during application startup and would use the JVM's fast dynamicInvoke support.
Minor effect - but multiplied by number of iterations: if you have very large arrays for state and property names, and they use primitive fields, then they would involve some degree of copying during method invocations (method parameters pass-by-'value' actually means pass-by-reference/pass-by-copy-of-primitives)
I suggest you time the method yourself as the profiler doesn't always give accurate timing.
Create a micro-benchmark with just this code and time it for at least 2 second. To work out how much difference method calls make, comment them out and hard code the values they return.
I think the issue is that FieldUtil is using Reflection and doesn't cache the fields it's using.

Java Profiling: Private Property Getter has Large Base Time

I'm using TPTP to profile some slow running Java code an I came across something interesting. One of my private property getters has a large Base Time value in the Execution Time Analysis results. To be fair, this property is called many many times, but I never would have guessed a property like this would take very long:
public class MyClass{
private int m_myValue;
public int GetMyValue(){
return m_myValue;
}
}
Ok so there's obviously more stuff in the class, but as you can see there is nothing else happening when the getter is called (just return an int). Some numbers for you:
About 30% of the Calls of the run are
on the getter (I'm working to reduce
this)
About 25% of the base time of
the run is spent in this getter
Average base time is 0.000175s
For comparison, I have another method in a different class that uses this getter:
private boolean FasterMethod(MyClass instance, int value){
return instance.GetMyValue() > m_localInt - value;
}
Which has a much lower average base time of 0.000018s (one order of magnitude lower).
What's the deal here? I assume there is something that I don't understand or something I'm missing:
Does returning a local primitive really take longer than returning a calculated value?
Should I look at metric other than Base Time?
Are these results misleading and I need to consider some other profiling tool?
Edit 1: Based on some suggestions below, I marked the method as final and re-ran the test, but I got the same results.
Edit 2: I installed a demo version of YourKit to re-run my performance tests, and the YourKit results look much closer to what I was expecting. I will continue to test YourKit and report back what I find.
Edit 3: Changing to YourKit seems to have resolved my issue. I was able to use YourKit to determine the actual slow points in my code. There are some excellent comments and posts below (upvoted appropriately), but I'm accepting the first person to suggest YourKit as "correct." (I am not affiliated with YourKit in any way / YMMV)
If possible try using another profiler (the Netbeans one works well). This may be hard to do depending on how your code is setup.
It is possible that, just like many other tools, a different profiler will result in different information.
Does returning a local primitive really take longer than returning a
calculated value?
Returning an instance variable takes longer than returning an local variable (VM dependent). The getter that you have is simple so it should be inlined, so it becomes as fast as accessing a public instance variable (which, again, is slower than accessing a local variable).
But you don't have a local value (local meaning in the method as opposed to in the class).
What exactly do you mean by "local"?
Should I look at metric other than Base Time?
I have not used the Eclipse tools, so I am not sure how it works... it might make a difference if it is a tracing or a sampling profiler (the two can give different results for things like this).
Are these results misleading and I need to consider some other
profiling tool?
I would consider another tool, just to see if the result is the same.
Edit based on comments:
If it is a sampling profiler what happens, essentially, that every "n-time units" the program is sampled to see where the program is. If you are calling the one method way more than the other it will show up as being in the method that is called more (it is simply more likely that that method is being run).
A tracing profiler adds code to your program (a process known as instrumentation) to essentially log what is going on.
Tracing profilers are slower but more accurate, they also require that the program be changed (the instrumentation process) which could potentially introduce bugs (not that I have heard of it happening... but I am sure it does at least while they are developing the profiler).
Sampling profilers are faster but less accurate (they just guess at how often a line of code is executed).
So, if Eclipse uses a sampling profiler you could see what you consider to be strange behaviour. Changing to a tracing profiler would show more accurate results.
If Eclipse uses a tracing profiler then chaning profilers should show the same result (however they new profiler may make it more obvious to you as to what is going on).
It does sound slightly misleading - perhaps the profiler is removing some optimizations?
Just for kicks, try making the method final, which will make it easier to inline. That may well be the difference between the property and FasterMethod. In real use, HotSpot will inline even virtual methods until the first time they're overridden (IIRC).
EDIT: Responding to Brian's comment: Yes, it's usually the case that making something final won't help performance (although it may be a good thing in terms of design :) because Hotspot will normally work out whether it can inline or not based on whether it's overridden or not. I was suggesting this profiler may have messed with that.
EDIT: I've now managed to reproduce the way that HotSpot "undoes" optimisation of classes which haven't been extended yet (or methods which haven't been overridden). This was harder to do for the server VM than the client, but I've done it :)
public class Test
{
public static void main(String[] args)
throws Exception
{
final long iterations = 1000000000L;
Base b = new Base();
// Warm up Hotspot
time(b, 1000);
// Before we load Derived
time(b, iterations);
// Load Derived and use it quickly
// (Just loading is enough to make the client VM
// undo its optimizations; the server VM needs more effort)
Base d = (Base) Class.forName("Derived").newInstance();
time(d, 1);
// Time it again with Base
time(b, iterations);
}
private static void time(Base b, long iterations)
{
long total = 0;
long start = System.currentTimeMillis();
for (long i = 0; i < iterations; i++)
{
total += b.getValue();
}
long end = System.currentTimeMillis();
System.out.println("Time: " + (end-start));
System.out.println("Total: " + total);
}
}
class Base
{
public int getValue() { return 1; }
}
class Derived extends Base
{
#Override
public int getValue() { return 2; }
}
That sounds very peculiar. You're not calling an overriding method by mistake, are you ?
I would be tempted to download a demo version of YourKit. It's trivial to set up, and it should give an indication as to what's really occurring. If both TPTP and YourKit agree, then something peculiar is happening (and I know that's not a lot of help!)
Something that used to make a lot of difference to performance of these sort of methods (although this may be to some extent historical) is that the size of the calling method can be an issue. HotSpot (and serious rivals) will happily inline small methods (some may choke on synchronized/try-finally). However, if the calling method is large, then it may not. This was particularly a problem with old versions of the HotSpot C1/client which had a really bad register allocation algorithm (it now has an algorithm that is both quite good and fast).

Categories

Resources