Which one is faster in java ?
a) for(int i = 100000; i > 0; i--) {}
b) for(int i = 1; i < 100001; i++) {}
I have been looking for explanation to the answer which is option a, anyone? any help is appreciated
There are situations when a reverse loop might be slightly faster in Java. Here's a benchmark showing an example. Typically, the difference is explained by implementation details in the increment/decrement instructions or in the loop-termination comparison instructions, both in the context of the underlying processor architecture. In more complex examples, reversing the loop can help eliminate dependencies and thus enable other optimizations, or can improve memory locality and caching, even garbage collection behavior.
One can not assume that either kind of loop will always be faster, for all cases - a benchmark would be needed to determine which one performs better on a given platform, for a concrete case. And I'm not even considering what the JIT compiler has to do with this.
Anyway, this is the sort of micro-optimizations that can make code more difficult to read, without providing a noticeable performance boost. And so it's better to avoid them unless strictly necessary, remember - "premature optimization is the root of all evil".
Just talking out of my hat, but I know assembly languages have specific comparisons to zero that take fewer cycles than comparisons between registered values.
Generally, Oracle HotSpot has an emphasis on optimisation in real code, which means that forward loop optimisations are more likely to be implemented than backward loops. From a machine code point of view, the decrementing loop may save an instruction, but it is unlikely to have a significant impact on performance, particularly when there is much memory access going on. I understand modern CPUs are more or less as happy going backwards as forwards (historically there was a time when they were better optimised for forward access). They'll even optimise certain stride access patterns.
(Also HotSpot (at least the Server/C2 flavour) is capable of removing empty loops.)
You told that the answer is a so, I guess an answer: java virtual machine will "translate" the comparison with zero in a faster way.
Related
I take the following example to illustrate example but note that it could be any other task.
for (int i =0; i< 1000; i++){
String a= "world";
Log.d("hello",a);
}
Versus
String a="";
a="world";
Log.d("hello",a);
String a="";
a="world";
Log.d("hello",a);
String a="";
a="world";
Log.d("hello",a);
String a="";
a="world";
Log.d("hello",a);
String a="";
a="world";
Log.d("hello",a);
...
//1000 times
Let's ignore readability and code quality, just performance of the compiled program.
So which one is better?
The one and only answer: this doesn't matter at all.
What I mean is: aspects such as readability and a good design are much more worth spending your time on.
When, and only when you managed to design a superb application; and you are down to the point that you have a performance problem and you would have to look into this specific question; well, then there would be merit in looking into it. But I am 99.9% sure: you are not at this point. And in that sense: you are wasting your time with such thoughts!
In other words: especially within the "JVM stack" there are tons and tons of things that can have subtle or not-so-subtle effects on performance. Like: there are many many options that influence the inner works of garbage collection and what the Just-in-time compiler is doing. That will affect the "performance" of your running application in much more significant ways than manually un-rolling loop code.
And for the record: many benchmarking experiments showed that the JIT does do its best job if you give "normal" code to it. Meaning: the Oracle JIT is designed to give you "best results" on "normal" input. But as soon as you start to fiddle around with your java code, in order to somehow create "optimized" code ... chances are that your changes make it harder for the JIT to do a good job.
Thus: please forget about such micro-optimizations. If you spend the same amount of time learning about "good OO design" and maybe "generic java performance topics" ... you will gain much more from that!
So which one is better?
The first one, since it actually compiles (there's a typo in the second). Non-compiling code is absolutely the least performant code.
But also, the first one, because, despite you saying "ignore readability and code quality", readability is almost always more important than performance. Readability is certainly what you should write code to be first, before making it performant.
And the first one, because the JIT will unroll the loop for you, it determines that has better performance on the specific JVM on which it is running, for the specific data that the code is run on (I'm assuming you don't simply like filling your logs with the same message over and over).
Don't try to second guess the optimizations the JIT will apply. Applying the optimizations yourself might make performance worse, because it makes it harder for the JIT to optimize the code. Write clear, maintainable code, and let the JVM work its magic.
Depending on the programming language. The compiler is responsible for compressing your code in such a way that it can avoid repeat calls at the low level. Object Orientated Programming with languages like java-script sit on a relatively high level, and therefore the compilation may take some time to run.
So to conclude briefly based widely on assumption from your question, the only impacts are the overall filesize and time taken to compile.
What you are doing is called "Loop Unrolling".
Check: https://en.wikipedia.org/wiki/Loop_unrolling
This is one of the optimization which compilers will do if compiled with appropriate optimization level.
In my view your second option should run faster than first one. Because in first option code has to increment loop counter, check it against loop condition and take appropriate branch decision. Branching itself can be expensive if there is a branch mis-prediction. This is because on branch mis-prediction CPUs has to clear all the pipline and restart the instruction. Check: https://en.wikipedia.org/wiki/Branch_misprediction
When I was learning C, I was taught to do stuff, say, if I wanted to loop through a something strlen(string) times, I should save that value in an 'auxiliary' variable, say count and put that in the for condition clause instead of the strlen, as I'd save the need of processing that many times.
Now, when I started learning Java, I noticed this is quite not the norm. I've seen lots of code and programmers doing just what I was told not to do in C.
What's the reason for this? Are they trading efficiency for 'readibility'? Or does the compiler manage to fix that?
E: This is NOT a duplicate to the linked question. I'm not simply asking about string length, mine is a more general question.
In the old times, every function call was expensive, compilers were dumb, usable profilers yet to come, and computers slow. This way C macros and other terrible things were born. Java is not that old.
Efficiency is important, but the impact of most program parts on efficiency is very small. But reading code still needs programmers time and this is much more costly than CPU. So we'd better optimize for readability most of the time and care about speed just in the most important places.
A local variable can make the code simpler, when it avoids repetitions of complicated expressions - this happens sometimes. It can make it faster, when it avoids expensive computation which the compiler can't do - this happens rather rarely. When neither condition is met, it's just a wasted line, so why bother?
Which is faster:
if(this.foo != 1234)
this.foo = 1234;
or
this.foo = 1234;
is the penalty of write high enough that one should check value before writing or is it faster to just write?
wouldn't having a branch cause possible mispredictions and screwup cpu pipleine? but what is the field is volatile, with writes having higher cost than reads?
yes, it is easy to say that in isolation these operations themselves are 'free' or benchmark it but that is not an answer.
There is a nice example illustrating this dilemma in the very recent talk by Sergey Kuksenko about hardware counters (slides 45-49), where the right answer for the same code depends on the data size! The idea is that "compare and set" approach cause more branch misses and loads, but less stores and L1 store misses. The difference is subtle, and I can't even rationalize why one factors overweight different on small data sizes, but become less signigicant on large data sizes.
So, measure, don't guess.
Both those operations are free: they really take almost no time!
Now if this code is in a loop, you should definitely favor the second option as it will minimize branch mispredictions.
Otherwise, what matters here is what makes the code the more readable. And again in my opinion, the second option is clearer.
Also as mentionned in the comments, assigning is an atomic operation which makes it thread safe. An other advantage for the second option.
They are not free. They cost time and space. And branching in a tight loop can actually be very costly because of branch prediction (kid's these days and their modern CPUs) . See Mysticial's answer.
But mostly it's pointless. Either just set to what it should be or throw when it's not what you expect.
Any code you make me read had better have a good reason to exist.
What I think you are trying to do is express what you expect it's value to be and assert that it should be that. Without context I can't tell you if you should throw when your expectations are violated or simply assign to assert what it should be. But making your expectations clear and enforcing them one way or another is certainly worth a few CPU cycles. I'd rather you were a little slower than quickly giving me garbage in and garbage out.
I believe this is actually a general question rather than java-related because of low level of this operations (CPU, not JVM level).
First of all, let's see what the choice is. On one hand we have reading from memory + comparison + (optionally) writing to memory, on other hand - writing to memory.
Memory access is much more expensive than registry operations (operations on data, already loaded to CPU). Therefore, choise is read + (sometimes) write vs write.
What is more expensive, read or write? Short answer - write. Long answer - write, but difference is probably small and depends on system caching strategy. It is not easy to explain in a few words, you can learn more about caching in the beautiful book "Operating Systems" by William Stallings.
I think in practice you can ignore distinction between read and write operations and just write without a test. That is because (returning back to Java) your object with all it's fields will be in cache for this moment.
Another thing to consider is branch prediction - others already mentioned that this is the reason to just write value without test too.
It depends on what you're really interested in.
If this is a plain old vanilla program, not only does the fetch/compare/branch of the first scheme take extra time, but it's extra code and complexity, and even if the first scheme did actually save a miniscule amount of time (instead of costing time) it wouldn't be worth doing it.
However, there are scenarios where it would be different. In an intensely multi-threaded environment with multiple processors modifying shared storage can be expensive, since changes to that storage need to be propagated to other processors. In such an environment it could be well worth it to spend a few extra instructions to avoid "dirtying" cache.
First of all, i would like to know what is the fundamental difference between loop optimization and transformation , also
A simple loop in C follows:
for (i = 0; i < N; i++)
{
a[i] = b[i]*c[i];
}
but we can unroll it to:
for (i = 0; i < N/2; i++)
{
a[i*2] = b[i*2]*c[i*2];
a[i*2 + 1] = b[i*2 + 1]*c[i*2 + 1];
}
but further we can unroll it..but what is the limit till which we can unroll it, and how do we find that.
There are many more techniques like Loop Tilling,Loop Distribution,etc. , how to determine when to use the appropriate one.
I will assume that the OP has already profiled his/her code and has discovered that this piece of code is actually important, and actually answer the question :-) :
The compiler will try to make the loop unrolling decision based on what it knows about your code and the processor architecture.
In terms of making things faster.
As someone pointed out, unrolling does reduce the number of loop termination condition compares and jumps.
Depending on the architecture, the hardware may also support an efficient way to to index near memory locations (E.g., mov eax, [ebx + 4]), without adding additional instructions (this may expand to more micro-ops though - not sure).
Most modern processors use out of order execution, to find instruction level parallelism. This is hard to do, when the next N instructions are after multiple conditional jumps (i.e., the hardware would need to be able to discard variable levels of speculation).
There is more opportunity to reorder memory operations earlier so that the data fetch latency is hidden.
Code vectorization (e.g., converting to SSE/AVX), may also occur which allows parallel execution of the code in some cases. This is also a form of unrolling.
In terms of deciding when to stop unrolling:
Unrolling increases code size. The compiler knows that there are penalties for exceeding instruction code cache size (all modern processors), trace cache(P4), loop buffer cache(Core2/Nehalem/SandyBridge), micro-op cache(SandyBridge), etc. Ideally it uses static cost-benefit heuristics (a function of the specic code and architecture) to determine which level of unrolling will result in the best overall net performance. Depending on the compiler, the heurstics may vary (often I find that it would be nice to tweak this oneself).
Generally, if the loop contains a large amount of code it is less likely to be unrolled because the loop cost is already amortized, there is plenty of ILP available, and the code bloat cost of unrolling is excessive. For smaller pieces of code, the loop is likely to be unrolled, since the cost is likely to be low. The actual number of unrolls will depend on the specifics of the architecture, compiler heuristics and code, and will be what the compiler decides is optimal (it may not be :-) ).
In terms of when YOU should be doing these optimizations:
When you don't think the compiler did the correct thing. The compiler may not be sophisticated (or sufficiently up to date) enough to use the knowledge of the architecture you are working on optimally.
Possibly, the heuristics just failed (they are just heuristics after all). In general, if you know the piece of code is very important, try unroll it, and if it improved performance, keep it, otherwise throw it out. Also, only do this when you have roughly the whole system in place, since what may be beneficial, when your code working set is 20k, may not be beneficial when your code working set is 31k.
This may seem rather off topic to your question but I cannot but stress the importance of this.
The key is to write a correct code and get your code working as per the requirement without being bothered about micro optimization.
If later you find your program to be lacking in performance then you profile!! your application to find the problem areas and then try to optimize them.
Remember as one of the wise guys said It is only 10% of your code which runs 90% of the total run time of your application trick is to identify that code through profiling and then try to optimize it.
Well considering that your first attempt at optimizing is already wrong in 50% of all cases I really wouldn't try anything more complex (try any odd number).
Also instead of multiplying your indices, just add 2 to i and loop up to N again - avoids the unnecessary shifting (minor effect as long as we stay with powers of 2, but still)
To summarize: You created incorrect, slower code than what a compiler could do - well that's the perfect example of why you shouldn't do this stuff I assume.
Is there a why to tell, how expensive an operation for the processor in millisecons or flops is?
I would be intrested in "instanceof", casts (I heard they are very "expensive").
Are there some studies about that?
It will depend on which JVM you're using, and the cost of many operations can vary even within the same JVM, depending on the exact situation and how much optimization the JIT has performed.
For example, a virtual method call can still be inlined by the Hotspot JIT - so long as it hasn't been overridden by anything else. In some cases with the server JIT it can still be inlined with a quick type test, for up to a couple of types.
Basically, JITs are complex enough that there's unlikely to be a meaningful general purpose answer to the question. You should benchmark your own specific situation in as real-world a way as possible. You should usually write code with primary goals of simplicit and readability - but measure the performance regularly.
The time where counting instructions or cycles could give you a good idea about the performance of some code are long gone, thanks to many, many optimizations happening on all levels of software execution.
This is especially true for VM-based languages, where the JVM can simply skip some steps because it knows that it's not necessary.
For example, I've read some time ago in an article (I'll try to find and link it eventually) that these two methods are pretty much equivalent in cost (on the HotSpot JVM, that is):
public void frobnicate1(Object o) {
if (!(foo instanceof SomeClass)) {
throw new IllegalArgumentException("Oh Noes!");
}
frobnicateSomeClass((SomeClass) o);
}
public void frobnicate2(Object o) {
frobnicateSomeClass((SomeClass) o);
}
Obviously the first method does more work, but the JVM knows that the type of o has already been checked in the if and can actually skip the type-check on the cast later on and make it a no-op.
This and many other optimizations make counting "flops" or cycles pretty much useless.
Generally speaking an instanceof check is relatively cheap. On the HotSpot JVM it boils down to a numeric check of the type id in the object header.
This classic article describes why you should "Write Dumb Code".
There's also an article from 2002 that describes how instanceof is optimized in the HotSpot JVM.
Once the JVM has warmed up most operations can be counted in nano-seconds (millionths of a milli-second) When talking about something being expensive, you usually have to say its expensive relative to an alternative. Its next to impossible to describe something as expensive in all cases.
Usually, the most important expense is your time (and other developers in your team) Using instanceof can be expensive in development and code support time because it often indicates a poor design. Using proper OOP techniques is usually a better idea. The 10 nano-second an instanceof might take, is usually relatively trivial.
The cost of specific operations performed inside the CPU is almost never relavant for performance. If performance is bad, it's almost always because of IO (network, disk) or inefficient code. Writing efficient code is much more about finding a way to reduce the overall amount of operations rather than avoiding "costly" operations (except those that are orders of magnitude more costly, like IO).