When I was learning C, I was taught to do stuff, say, if I wanted to loop through a something strlen(string) times, I should save that value in an 'auxiliary' variable, say count and put that in the for condition clause instead of the strlen, as I'd save the need of processing that many times.
Now, when I started learning Java, I noticed this is quite not the norm. I've seen lots of code and programmers doing just what I was told not to do in C.
What's the reason for this? Are they trading efficiency for 'readibility'? Or does the compiler manage to fix that?
E: This is NOT a duplicate to the linked question. I'm not simply asking about string length, mine is a more general question.
In the old times, every function call was expensive, compilers were dumb, usable profilers yet to come, and computers slow. This way C macros and other terrible things were born. Java is not that old.
Efficiency is important, but the impact of most program parts on efficiency is very small. But reading code still needs programmers time and this is much more costly than CPU. So we'd better optimize for readability most of the time and care about speed just in the most important places.
A local variable can make the code simpler, when it avoids repetitions of complicated expressions - this happens sometimes. It can make it faster, when it avoids expensive computation which the compiler can't do - this happens rather rarely. When neither condition is met, it's just a wasted line, so why bother?
Related
After having read online, I have found that java does not optimize tail-recursion.
So, is there any point in using it, if head and tail recursion would yield the same result.
Moreover, are loops always better in performance than recursion ( tail and head); as it is sometimes easier to use recursion without thinking about the iterations.
Please do tell if I should use loops.
Please correct me if I am wrong as I have just started with recursion.
Yes, the performance of any recursive algorithm in java is almost always significantly worse than rewriting the same thing using a loop. Using loops is generally always just as easy:
Make a stack or deque object.
Make a class that represents all relevant state.
Write a loop that grabs stuff off the stack or deque and operates on it.
As part of 'operating on it', you are free to pile on new jobs - analogous to calling yourself.
That 'formula' should work for any recursive algorithm.
However, the vast majority of code you write just doesn't matter, performance-wise. Quite literally, if your app is having any measurable effect on the CPU at all, it's almost always that 99% of the CPU resources your app is using up are being used up by 0.1% of your entire codebase.
The job then is obviously to [A] find that 0.1% and [B] make it more efficient.
The remaining 99.9% just do not matter. This is not a 'death by a thousand cuts' situation - it really doesn't matter. You can write code to be 10 to 100x less efficient than it could be, and even if you make such a mistake tons of times, as long as it isn't in that 0.1% crucial path, you'd never notice, and nor will your users.
So, in that sense, if you think it's easier to write your code and easier to read it if you use recursion, knock yourself out. Just know that if your profiler is telling you that this recursive algorithm is in that 0.1% (the crucial path), yes, step 1: Rewrite away from recursion.
Sidenote: As long as you don't recurse too far, JVMs can optimize quite a bit. Some VMs, like azul, go so far as to eliminate a bunch of your stack trace if it is repetitive (recursive algorithms have repetitive stack traces). Thus, even a recursive algorithm in java can be fine, performance wise. It's just a lot harder to reliably get this result, as you're now relying on the optimizations made in custom VM implementations.
I take the following example to illustrate example but note that it could be any other task.
for (int i =0; i< 1000; i++){
String a= "world";
Log.d("hello",a);
}
Versus
String a="";
a="world";
Log.d("hello",a);
String a="";
a="world";
Log.d("hello",a);
String a="";
a="world";
Log.d("hello",a);
String a="";
a="world";
Log.d("hello",a);
String a="";
a="world";
Log.d("hello",a);
...
//1000 times
Let's ignore readability and code quality, just performance of the compiled program.
So which one is better?
The one and only answer: this doesn't matter at all.
What I mean is: aspects such as readability and a good design are much more worth spending your time on.
When, and only when you managed to design a superb application; and you are down to the point that you have a performance problem and you would have to look into this specific question; well, then there would be merit in looking into it. But I am 99.9% sure: you are not at this point. And in that sense: you are wasting your time with such thoughts!
In other words: especially within the "JVM stack" there are tons and tons of things that can have subtle or not-so-subtle effects on performance. Like: there are many many options that influence the inner works of garbage collection and what the Just-in-time compiler is doing. That will affect the "performance" of your running application in much more significant ways than manually un-rolling loop code.
And for the record: many benchmarking experiments showed that the JIT does do its best job if you give "normal" code to it. Meaning: the Oracle JIT is designed to give you "best results" on "normal" input. But as soon as you start to fiddle around with your java code, in order to somehow create "optimized" code ... chances are that your changes make it harder for the JIT to do a good job.
Thus: please forget about such micro-optimizations. If you spend the same amount of time learning about "good OO design" and maybe "generic java performance topics" ... you will gain much more from that!
So which one is better?
The first one, since it actually compiles (there's a typo in the second). Non-compiling code is absolutely the least performant code.
But also, the first one, because, despite you saying "ignore readability and code quality", readability is almost always more important than performance. Readability is certainly what you should write code to be first, before making it performant.
And the first one, because the JIT will unroll the loop for you, it determines that has better performance on the specific JVM on which it is running, for the specific data that the code is run on (I'm assuming you don't simply like filling your logs with the same message over and over).
Don't try to second guess the optimizations the JIT will apply. Applying the optimizations yourself might make performance worse, because it makes it harder for the JIT to optimize the code. Write clear, maintainable code, and let the JVM work its magic.
Depending on the programming language. The compiler is responsible for compressing your code in such a way that it can avoid repeat calls at the low level. Object Orientated Programming with languages like java-script sit on a relatively high level, and therefore the compilation may take some time to run.
So to conclude briefly based widely on assumption from your question, the only impacts are the overall filesize and time taken to compile.
What you are doing is called "Loop Unrolling".
Check: https://en.wikipedia.org/wiki/Loop_unrolling
This is one of the optimization which compilers will do if compiled with appropriate optimization level.
In my view your second option should run faster than first one. Because in first option code has to increment loop counter, check it against loop condition and take appropriate branch decision. Branching itself can be expensive if there is a branch mis-prediction. This is because on branch mis-prediction CPUs has to clear all the pipline and restart the instruction. Check: https://en.wikipedia.org/wiki/Branch_misprediction
Which is faster:
if(this.foo != 1234)
this.foo = 1234;
or
this.foo = 1234;
is the penalty of write high enough that one should check value before writing or is it faster to just write?
wouldn't having a branch cause possible mispredictions and screwup cpu pipleine? but what is the field is volatile, with writes having higher cost than reads?
yes, it is easy to say that in isolation these operations themselves are 'free' or benchmark it but that is not an answer.
There is a nice example illustrating this dilemma in the very recent talk by Sergey Kuksenko about hardware counters (slides 45-49), where the right answer for the same code depends on the data size! The idea is that "compare and set" approach cause more branch misses and loads, but less stores and L1 store misses. The difference is subtle, and I can't even rationalize why one factors overweight different on small data sizes, but become less signigicant on large data sizes.
So, measure, don't guess.
Both those operations are free: they really take almost no time!
Now if this code is in a loop, you should definitely favor the second option as it will minimize branch mispredictions.
Otherwise, what matters here is what makes the code the more readable. And again in my opinion, the second option is clearer.
Also as mentionned in the comments, assigning is an atomic operation which makes it thread safe. An other advantage for the second option.
They are not free. They cost time and space. And branching in a tight loop can actually be very costly because of branch prediction (kid's these days and their modern CPUs) . See Mysticial's answer.
But mostly it's pointless. Either just set to what it should be or throw when it's not what you expect.
Any code you make me read had better have a good reason to exist.
What I think you are trying to do is express what you expect it's value to be and assert that it should be that. Without context I can't tell you if you should throw when your expectations are violated or simply assign to assert what it should be. But making your expectations clear and enforcing them one way or another is certainly worth a few CPU cycles. I'd rather you were a little slower than quickly giving me garbage in and garbage out.
I believe this is actually a general question rather than java-related because of low level of this operations (CPU, not JVM level).
First of all, let's see what the choice is. On one hand we have reading from memory + comparison + (optionally) writing to memory, on other hand - writing to memory.
Memory access is much more expensive than registry operations (operations on data, already loaded to CPU). Therefore, choise is read + (sometimes) write vs write.
What is more expensive, read or write? Short answer - write. Long answer - write, but difference is probably small and depends on system caching strategy. It is not easy to explain in a few words, you can learn more about caching in the beautiful book "Operating Systems" by William Stallings.
I think in practice you can ignore distinction between read and write operations and just write without a test. That is because (returning back to Java) your object with all it's fields will be in cache for this moment.
Another thing to consider is branch prediction - others already mentioned that this is the reason to just write value without test too.
It depends on what you're really interested in.
If this is a plain old vanilla program, not only does the fetch/compare/branch of the first scheme take extra time, but it's extra code and complexity, and even if the first scheme did actually save a miniscule amount of time (instead of costing time) it wouldn't be worth doing it.
However, there are scenarios where it would be different. In an intensely multi-threaded environment with multiple processors modifying shared storage can be expensive, since changes to that storage need to be propagated to other processors. In such an environment it could be well worth it to spend a few extra instructions to avoid "dirtying" cache.
First of all, i would like to know what is the fundamental difference between loop optimization and transformation , also
A simple loop in C follows:
for (i = 0; i < N; i++)
{
a[i] = b[i]*c[i];
}
but we can unroll it to:
for (i = 0; i < N/2; i++)
{
a[i*2] = b[i*2]*c[i*2];
a[i*2 + 1] = b[i*2 + 1]*c[i*2 + 1];
}
but further we can unroll it..but what is the limit till which we can unroll it, and how do we find that.
There are many more techniques like Loop Tilling,Loop Distribution,etc. , how to determine when to use the appropriate one.
I will assume that the OP has already profiled his/her code and has discovered that this piece of code is actually important, and actually answer the question :-) :
The compiler will try to make the loop unrolling decision based on what it knows about your code and the processor architecture.
In terms of making things faster.
As someone pointed out, unrolling does reduce the number of loop termination condition compares and jumps.
Depending on the architecture, the hardware may also support an efficient way to to index near memory locations (E.g., mov eax, [ebx + 4]), without adding additional instructions (this may expand to more micro-ops though - not sure).
Most modern processors use out of order execution, to find instruction level parallelism. This is hard to do, when the next N instructions are after multiple conditional jumps (i.e., the hardware would need to be able to discard variable levels of speculation).
There is more opportunity to reorder memory operations earlier so that the data fetch latency is hidden.
Code vectorization (e.g., converting to SSE/AVX), may also occur which allows parallel execution of the code in some cases. This is also a form of unrolling.
In terms of deciding when to stop unrolling:
Unrolling increases code size. The compiler knows that there are penalties for exceeding instruction code cache size (all modern processors), trace cache(P4), loop buffer cache(Core2/Nehalem/SandyBridge), micro-op cache(SandyBridge), etc. Ideally it uses static cost-benefit heuristics (a function of the specic code and architecture) to determine which level of unrolling will result in the best overall net performance. Depending on the compiler, the heurstics may vary (often I find that it would be nice to tweak this oneself).
Generally, if the loop contains a large amount of code it is less likely to be unrolled because the loop cost is already amortized, there is plenty of ILP available, and the code bloat cost of unrolling is excessive. For smaller pieces of code, the loop is likely to be unrolled, since the cost is likely to be low. The actual number of unrolls will depend on the specifics of the architecture, compiler heuristics and code, and will be what the compiler decides is optimal (it may not be :-) ).
In terms of when YOU should be doing these optimizations:
When you don't think the compiler did the correct thing. The compiler may not be sophisticated (or sufficiently up to date) enough to use the knowledge of the architecture you are working on optimally.
Possibly, the heuristics just failed (they are just heuristics after all). In general, if you know the piece of code is very important, try unroll it, and if it improved performance, keep it, otherwise throw it out. Also, only do this when you have roughly the whole system in place, since what may be beneficial, when your code working set is 20k, may not be beneficial when your code working set is 31k.
This may seem rather off topic to your question but I cannot but stress the importance of this.
The key is to write a correct code and get your code working as per the requirement without being bothered about micro optimization.
If later you find your program to be lacking in performance then you profile!! your application to find the problem areas and then try to optimize them.
Remember as one of the wise guys said It is only 10% of your code which runs 90% of the total run time of your application trick is to identify that code through profiling and then try to optimize it.
Well considering that your first attempt at optimizing is already wrong in 50% of all cases I really wouldn't try anything more complex (try any odd number).
Also instead of multiplying your indices, just add 2 to i and loop up to N again - avoids the unnecessary shifting (minor effect as long as we stay with powers of 2, but still)
To summarize: You created incorrect, slower code than what a compiler could do - well that's the perfect example of why you shouldn't do this stuff I assume.
I have endeavored to concurrently implement Dixon's algorithm, with poor results. For small numbers <~40 bits, it operates in about twice the time as other implementations in my class, and after about 40 bits, takes far longer.
I've done everything I can, but I fear it has some fatal issue that I can't find.
My code (fairly lengthy) is located here. Ideally the algorithm would work faster than non-concurrent implementations.
Why would you think it would be faster? Spinning up a thread and adding synchronized calls are HUGE time syncs. If you can't avoid the synchronized keyword, I highly recommend a single-threaded solution.
You may be able to avoid them in various ways--for instance by ensuring that a given variable is only written by one thread even if read by others or by acting like a functional language and making all your variables final using Recursion for variable storage (Iffy, hard to imagine this would speed anything).
If you really need to be fast, however, I did find some very counter-intuitive things out recently from my own attempt at finding a speedy solution...
Static methods didn't help over actual class instances.
Breaking the code down into smaller classes and methods actually INCREASED speed.
Final methods helped more than I would have thought they would
Once I noticed that adding a method call helped speed things along
Don't stress over one-time class allocations or data allocations but avoid allocating objects in loops (This one is obvious but I think it's the most critical)
What I've been able to intuit is that the compiler is extremely smart at optimizing and is tuned to optimize "Ideal" java code. Static methods are no where near ideal--they are kind of a counter-pattern.. one of the most.
I suggest you write the clearest, best OO code you can that actually runs correctly as a reference--then time it and start attempting tweaks to speed it up.