Optimizing the creation of objects inside loops

Optimizing the creation of objects inside loops - java

Which of the following would be more optimal on a Java 6 HotSpot VM?
final Map<Foo,Bar> map = new HashMap<Foo,Bar>(someNotSoLargeNumber);
for (int i = 0; i < someLargeNumber; i++)
{
doSomethingWithMap(map);
map.clear();
}
or
final int someNotSoLargeNumber = ...;
for (int i = 0; i < someLargeNumber; i++)
{
final Map<Foo,Bar> map = new HashMap<Foo,Bar>(someNotSoLargeNumber);
doSomethingWithMap(map);
}
I think they're both as clear to the intent, so I don't think style/added complexity is an issue here.
Intuitively it looks like the first one would be better as there's only one 'new'. However, given that no reference to the map is held onto, would HotSpot be able to determine that a map of the same size (Entry[someNotSoLargeNumber] internally) is being created for each loop and then use the same block of memory (i.e. not do a lot of memory allocation, just zeroing that might be quicker than calling clear() for each loop)?
An acceptable answer would be a link to a document describing the different types of optimisations the HotSpot VM can actually do, and how to write code to assist HotSpot (rather than naive attmepts at optimising the code by hand).

Don't spend your time on such micro optimizations unless your profiler says you should do it. In particular, Sun claims that modern garbage collectors do very well with short-living objects and new() becomes cheaper and cheaper
Garbage collection and performance on DeveloperWorks

That's a pretty tight loop over a "fairly large number", so generally I would say move the instantiation outside of the loop. But, overall, my guess is you aren't going to notice much of a difference as I am willing to bet that your doSomethingWithMap will take up the majority of time to allow the GC to catch up.

Related

Are object initializations in Java "Foo f = new Foo() " essentially the same as using malloc for a pointer in C?

I am trying to understand the actual process behind object creations in Java - and I suppose other programming languages.
Would it be wrong to assume that object initialization in Java is the same as when you use malloc for a structure in C?
Example:
Foo f = new Foo(10);
typedef struct foo Foo;
Foo *f = malloc(sizeof(Foo));
Is this why objects are said to be on the heap rather than the stack? Because they are essentially just pointers to data?

In C, malloc() allocates a region of memory in the heap and returns a pointer to it. That's all you get. Memory is uninitialized and you have no guarantee that it's all zeros or anything else.
In Java, calling new does a heap based allocation just like malloc(), but you also get a ton of additional convenience (or overhead, if you prefer). For example, you don't have to explicitly specify the number of bytes to be allocated. The compiler figures it out for you based on the type of object you're trying to allocate. Additionally, object constructors are called (which you can pass arguments to if you'd like to control how initialization occurs). When new returns, you're guaranteed to have an object that's initialized.
But yes, at the end of the call both the result of malloc() and new are simply pointers to some chunk of heap-based data.
The second part of your question asks about the differences between a stack and a heap. Far more comprehensive answers can be found by taking a course on (or reading a book about) compiler design. A course on operating systems would also be helpful. There are also numerous questions and answers on SO about the stacks and heaps.
Having said that, I'll give a general overview I hope isn't too verbose and aims to explain the differences at a fairly high level.
Fundamentally, the main reason to have two memory management systems, i.e. a heap and a stack, is for efficiency. A secondary reason is that each is better at certain types of problems than the other.
Stacks are somewhat easier for me to understand as a concept, so I start with stacks. Let's consider this function in C...
int add(int lhs, int rhs) {
int result = lhs + rhs;
return result;
}
The above seems fairly straightforward. We define a function named add() and pass in the left and right addends. The function adds them and returns a result. Please ignore all edge-case stuff such as overflows that might occur, at this point it isn't germane to the discussion.
The add() function's purpose seems pretty straightforward, but what can we tell about its lifecycle? Especially its memory utilization needs?
Most importantly, the compiler knows a priori (i.e. at compile time) how large the data types are and how many will be used. The lhs and rhs arguments are sizeof(int), 4 bytes each. The variable result is also sizeof(int). The compiler can tell that the add() function uses 4 bytes * 3 ints or a total of 12 bytes of memory.
When the add() function is called, a hardware register called the stack pointer will have an address in it that points to the top of the stack. In order to allocate the memory the add() function needs to run, all the function-entry code needs to do is issue one single assembly language instruction to decrement the stack pointer register value by 12. In doing so, it creates storage on the stack for three ints, one each for lhs, rhs, and result. Getting the memory space you need by executing a single instruction is a massive win in terms of speed because single instructions tend to execute in one clock tick (1 billionth of a second a 1 GHz CPU).
Also, from the compiler's view, it can create a map to the variables that looks an awful lot like indexing an array:
lhs: ((int *)stack_pointer_register)[0]
rhs: ((int *)stack_pointer_register)[1]
result: ((int *)stack_pointer_register)[2]
Again, all of this is very fast.
When the add() function exits it has to clean up. It does this by subtracting 12 bytes from the stack pointer register. It's similar to a call to free() but it only uses one CPU instruction and it only takes one tick. It's very, very fast.
Now consider a heap-based allocation. This comes into play when we don't know a priori how much memory we're going to need (i.e. we'll only learn about it at runtime).
Consider this function:
int addRandom(int count) {
int numberOfBytesToAllocate = sizeof(int) * count;
int *array = malloc(numberOfBytesToAllocate);
int result = 0;
if array != NULL {
for (i = 0; i < count; ++i) {
array[i] = (int) random();
result += array[i];
}
free(array);
}
return result;
}
Notice that the addRandom() function doesn't know at compile time what the value of the count argument will be. Because of this, it doesn't make sense to try to define array like we would if we were putting it on the stack, like this:
int array[count];
If count is huge it could cause our stack to grow too large and overwrite other program segments. When this stack overflow happens your program crashes (or worse).
So, in cases where we don't know how much memory we'll need until runtime, we use malloc(). Then we can just ask for the number of bytes we need when we need it, and malloc() will go check if it can vend that many bytes. If it can, great, we get it back, if not, we get a NULL pointer which tells us the call to malloc() failed. Notably though, the program doesn't crash! Of course you as the programmer can decide that your program isn't allowed to run if resource allocation fails, but programmer-initiated termination is different than a spurious crash.
So now we have to come back to look at efficiency. The stack allocator is super fast - one instruction to allocate, one instruction to deallocate, and it's done by the compiler, but remember the stack is meant for things like local variables of a known size so it tends to be fairly small.
The heap allocator on the other hand is several orders of magnitude slower. It has to do a lookup in tables to see if it has enough free memory to be able to vend the amount of memory the user wants. It has to update those tables after it vends the memory to make sure no one else can use that block (this bookkeeping may require the allocator to reserve memory for itself in addition to what it plans to vend). The allocator has to employ locking strategies to make sure it vends memory in a thread-safe way. And when memory is finally free()d, which happens at different times and in no predictable order typically, the allocator has to find contiguous blocks and stitch them back together to repair heap fragmentation. If that sounds like it's going to take more than a single CPU instruction to accomplish all of that, you're right! It's very complicated and it takes a while.
But heaps are big. Much larger than stacks. We can get lots of memory from them and they're great when we don't know at compile time how much memory we'll need. So we trade off speed for a managed memory system that declines us politely instead of crashing when we try to allocate something too large.
I hope that helps answer some of your questions. Please let me know if you'd like clarification on any of the above.

Clean code vs performance

Some principles of clean code are:
functions should do one thing at one abstraction level
functions should be at most 20 lines long
functions should never have more than 2 input parameters
How many cpu cycles are "lost" by adding an extra function call in Java?
Are there compiler options available that transform many small functions into one big function in order to optimize performance?
E.g.
void foo() {
bar1()
bar2()
}
void bar1() {
a();
b();
}
void bar2() {
c();
d();
}
Would become
void foo() {
a();
b();
c();
d();
}

How many cpu cycles are "lost" by adding an extra function call in Java?
This depends on whether it is inlined or not. If it's inline it will be nothing (or a notional amount)
If it is not compiled at runtime, it hardly matters because the cost of interperting is more important than a micro optimisation, and it is likely to be not called enough to matter (which is why it wasn't optimised)
The only time it really matters is when the code is called often, however for some reason it is prevented from being optimised. I would only assume this is the case because you have a profiler telling you this is a performance issue, and in this case manual inlining might be the answer.
I designed, develop and optimise latency sensitive code in Java and I choose to manually inline methods much less than 1% of time, but only after a profiler e.g. Flight Recorder suggests there is a significant performance problem.
In the rare event it matters, how much difference does it make?
I would estimate between 0.03 and 0.1 micros-seconds in real applications for each extra call, in a micro-benchmark it would be far less.
Are there compiler options available that transform many small functions into one big function in order to optimize performance?
Yes, in fact what could happen is not only are all these method inlined, but the methods which call them are inlined as well and none of them matter at runtime, but only if the code is called enough to be optimised. i.e. not only is a,b, c and d inlined and their code but foo is inlined as well.
By default the Oracle JVM can line to a depth of 9 levels (until the code gets more than 325 bytes of byte code)
Will clean code help performance
The JVM runtime optimiser has common patterns it optimises for. Clean, simple code is generally easier to optimise and when you try something tricky or not obvious, you can end up being much slower. If it harder to understand for a human, there is a good chance it's hard for the optimiser to understand/optimise.

Runtime behavior and cleanliness of code (a compile time or life time property of code) belong to different requirement categories. There might be cases where optimizing for one category is detrimental to the other.
The question is: which category really needs you attention?
In my view cleanliness of code (or malleability of software) suffers from a huge lack of attention. You should focus on that first. And only if other requirements start to fall behind (e.g. performance) you inquire as to whether that's due to how clean the code is. That means you need to really compare, you need to measure the difference it makes. With regard to performance use a profiler of your choice: run the "dirty" code variant and the clean variant and check the difference. Is it markedly? Only if the "dirty" variant is significantly faster should you lower the cleanliness.

Consider the following piece of code, which compares a code that does 3 things in one for loop to another that has 3 different for loops for each task.
#Test
public void singleLoopVsMultiple() {
for (int j = 0; j < 5; j++) {
//single loop
int x = 0, y = 0, z = 0;
long l = System.currentTimeMillis();
for (int i = 0; i < 100000000; i++) {
x++;
y++;
z++;
}
l = System.currentTimeMillis() - l;
//multiple loops doing the same thing
int a = 0, b = 0, c = 0;
long m = System.currentTimeMillis();
for (int i = 0; i < 100000000; i++) {
a++;
}
for (int i = 0; i < 100000000; i++) {
b++;
}
for (int i = 0; i < 100000000; i++) {
c++;
}
m = System.currentTimeMillis() - m;
System.out.println(String.format("%d,%d", l, m));
}
}
When I run it, here is the output I get for time in milliseconds.
6,5
8,0
0,0
0,0
0,0
After a few runs, JVM is able to identify hotspots of intensive code and optimises parts of the code to make them significantly faster. In our previous example, after 2 runs, the JVM had already optimised the code so much that the discussion around for-loops became redundant.
Unless we know what's happening inside, we cannot predict the performance implications of changes like introduction of for-loops. The only way to actually improve the performance of a system is by measuring it and focusing only on fixing the actual bottlenecks.
There is a chance that cleaning your code may make it faster for the JVM. But even if that is not the case, every performance optimisation, comes with added code complexity. Ask yourself whether the added complexity is worth the future maintenance effort. After all, the most expensive resource on any team is the developer, not the servers, and any additional complexity slows the developer, adding to the project cost.
The way to deal it is to figure out your benchmarks, what kind of application you're making, what are the bottlenecks. If you're making a web-app, perhaps the DB is taking most of the time, and reducing the number of functions will not make a difference. On the other hand, if its an app running on a system where performance is everything, every small thing counts.

When can Hotspot allocate objects on the stack? [duplicate]

This question already has answers here:
Eligibility for escape analysis / stack allocation with Java 7
(3 answers)
Closed 5 years ago.
Since somewhere around Java 6, the Hotspot JVM can do escape analysis and allocate non-escaping objects on the stack instead of on the garbage collected heap. This results in a speedup of the generated code and reduces pressure on the garbage collector.
What are the rules for when Hotspot is able to stack allocate objects? In other words when can I rely on it to do stack allocation?
edit: This question is a duplicate, but (IMO) the answer below is a better answer than what is available at the original question.

I have done some experimentation in order to see when Hotspot is able to stack allocate. It turns out that its stack allocation is quite a bit more limited than what you might expect based on the available documentation. The referenced paper by Choi "Escape Analysis for Java" suggests that an object that is only ever assigned to local variables can always be stack allocated. But that is not true.
All of this are implementation details of the current Hotspot implementation, so they could change in future versions. This refers to my OpenJDK install which is version 1.8.0_121 for X86-64.
The short summary, based on quite a bit of experimentation, seems to be:
Hotspot can stack-allocate an object instance if
all its uses are inlined
it is never assigned to any static or object fields, only to local variables
at each point in the program, which local variables contain references to the object must be JIT-time determinable, and not depend on any unpredictable conditional control flow.
If the object is an array, its size must be known at JIT time and indexing into it must use JIT-time constants.
To know when these conditions hold you need to know quite a bit about how Hotspot works. Relying on Hotspot to definately do stack allocation in a certain situation can be risky, as a lot of non-local factors are involved. Especially knowing if everything is inlined can be difficult to predict.
Practically speaking, simple iterators will usually be stack allocatable if you just use them to iterate. For composite objects only the outer object can ever be stack allocated, so lists and other collections always cause heap allocation.
If you have a HashMap<Integer,Something> and you use it in myHashMap.get(42), the 42 may stack allocate in a test program, but it will not in a full application because you can be sure that there will be more than two types of key objects in HashMaps in the entire program, and therefore the hashCode and equals methods on the key won't inline.
Beyond that I don't see any generally applicable rules, and it will depend on the specifics of the code.
Hotspot internals
The first important thing to know is that escape analysis is performed after inlining. This means that Hotspot's escape analysis is in this respect more powerful than the description in the Choi paper, since an object returned from a method but local to the caller method can still be stack allocated. Because of this iterators can nearly always be stack allocated if you do e.g. for(Foo item : myList) {...} (and the implementation of myList.iterator() is simple enough, which they usually are.)
Hotspot only compiles optimized versions of methods once it determines the method is 'hot', so code that is not run a lot of times does not get optimized at all, in which case there is no stack allocation or inlining whatsoever. But for those methods you usually don't care.
Inlining
Inlining decisions are based on profiling data that Hotspot collects first. The declared types do not matter so much, even if a method is virtual Hotspot can inline it based on the types of the objects it sees during profiling. Something similar holds for branches (i.e. if-statements and other control flow constructs): If during profiling Hotspot never sees a certain branch being taken, it will compile and optimize the code based on the assumption that the branch is never taken. In both cases, if Hotspot cannot prove that its assumptions will always be true, it will insert checks in the compiled code known as 'uncommon traps', and if such a trap is hit Hotspot will de-optimize and possibly re-optimize taking the new information into account.
Hotspot will profile which object types occur as receivers at which call sites. If Hotspot only sees a single type or only two distinct types occuring at a call site, it is able to inline the called method. If there are only one or two very common types and other types occur much less often Hotspot should also still be able to inline the methods of the common types, including a check for which code it needs to take. (I'm not entirely sure about this last case with one or two common types and more uncommon types though). If there are more than two common types, Hotspot will not inline the call at all but instead generate machine code for an indirect call.
'Type' here refers to the exact type of an object. Implemented interfaces or shared superclasses are not taken into account. Even if different receiver types occur at a call site but they all inherit the same implementation of a method (e.g. multiple classes that all inherit hashCode from Object), Hotspot will still generate an indirect call and not inline. (So i.m.o. hotspot is quite stupid in such cases. I hope future versions improve this.)
Hotspot will also only inline methods that are not too big. 'Not too big' is determined by the -XX:MaxInlineSize=n and -XX:FreqInlineSize=n options. Inlinable methods with a JVM bytecode size below MaxInlineSize are always inlined, methods with a JVM bytecode size below FreqInlineSize are inlined if the call is 'hot'. Larger methods are never inlined. By default MaxInlineSize is 35 and FreqInlineSize is platform dependent but for me it is 325. So make sure your methods are not too big if you want them inlined. It can sometimes help to split out the common path from a large method, so that it can be inlined into its callers.
Profiling
One important thing to know about profiling is that profiling sites are based on the JVM bytecode, which itself is not inlined in any way. So if you have e.g. a static method
static <T,U> List<U> map(List<T> list, Function<T,U> func) {
List<U> result = new ArrayList();
for(T item : list) { result.add(func.call(item)); }
return result;
}
that maps a SAM Function callable over a list and returns the transformed list, Hotspot will treat the call to func.call as a single program-wide call site. You might call this map function at several spots in your program, passing a different func in at each call site (but the same one for one call site). In that case you might expect that Hotspot is able to inline map, and then also the call to func.call since at every use of map there is only a single func type. If this were so, Hotspot would be able to optimize the loop down very tightly. Unfortunately Hotspot is not smart enough for that. It only keeps a single profile for the func.call call site, lumping all the func types that you pass to map together. You will probably use more than two different implementations of func, so Hotspot will not be able to inline the call to func.call. Link for more details, and archived link as the original appears to be gone.
(As an aside, in Kotlin the equivalent loop can be fully inlined as the Kotlin compiler can do inlining of calls at the bytecode level. So for some uses it could be significantly faster than Java.)
Scalar Replacement
Another important thing to know is that Hotspot does not actually implement stack allocation of objects. Instead it implements scalar replacement, which means that an object is deconstructed into its constituent fields and those fields are stack allocated like normal local variables. This means that there is no object left at all. Scalar replacement only works if there is never a need to create a pointer to the stack-allocated object. Some forms of stack allocation in e.g. C++ or Go would be able to allocate full objects on the stack and then pass references or pointers to them to called functions, but in Hotspot this does not work. Therefore if there is ever a need to pass an object reference to a non-inlined method, even if the reference would not escape the called method, Hotspot will always heap-allocate such an object.
In principle Hotspot could be smarter about this, but right now it is not.
Test program
I used the following program and variations to see when Hotspot will do scalar replacement.
// Minimal example for which the JVM does not scalarize the allocation. If field is final, or the second allocation is unconditional, it will.
class Scalarization {
int field = 0xbd;
long foo(long i) { return i * field; }
public static void main(String[] args) {
long result = 0;
for(long i=0; i<100; i++) {
result += test();
}
System.out.println("Result: "+result);
}
static long test() {
long ctr = 0x5;
for(long i=0; i<0x10000; i++) {
Scalarization s = new Scalarization();
ctr = s.foo(ctr);
if(i == 0) s = new Scalarization();
ctr = s.foo(ctr);
}
return ctr;
}
}
If you compile and run this program with javac Scalarization.java; java -verbose:gc Scalarization you can see if scalar replacement worked by the number of garbage collections. If scalar replacement works, no garbage collection happened on my system, if scalar replacement did not work I see a few garbage collections.
Variants that Hotspot is able to scalarize run significantly faster than versions where it does not. I verified the generated machine code (instructions) to make sure Hotspot was not doing any unexpected optimizations. If hotspot is able to scalar replace the allocations, it can then also do some additional optimizations on the loop, unrolling it a few iterations and then combining those iterations together. So in the scalarized versions the effective loop count is lower with each iteraton doing the work of multiple source code level iterations. So the speed difference is not only due to allocation and garbage collection overhead.
Observations
I tried a number of variations on the above program. One condition for scalar replacement is that the object must never be assigned to an object (or static) field, and presumably also not into an array. So in code like
Foo f = new Foo();
bar.field = f;
the Foo object cannot be scalar replaced. This holds even if bar itself is scalar replaced, and also if you never again use bar.field. So an object can only ever be assigned to local variables.
That alone is not enough, Hotspot must also be able to determine statically at JIT-time which object instance will be the target of a call. For example, using the following implementations of foo and test and removing field causes heap allocation:
long foo(long i) { return i * 0xbb; }
static long test() {
long ctr = 0x5;
for(long i=0; i<0x10000; i++) {
Scalarization s = new Scalarization();
ctr = s.foo(ctr);
if(i == 50) s = new Scalarization();
ctr = s.foo(ctr);
}
return ctr;
}
While if you then remove the conditional for the second assignment no more heap allocation occurs:
static long test() {
long ctr = 0x5;
for(long i=0; i<0x10000; i++) {
Scalarization s = new Scalarization();
ctr = s.foo(ctr);
s = new Scalarization();
ctr = s.foo(ctr);
}
return ctr;
}
In this case Hotspot can determine statically which instance is the target for each call to s.foo.
On the other hand, even if the second assignment to s is a subclass of Scalarization with a completely different implementation, as long as the assignment is unconditional Hotspot will still scalarize the allocations.
Hotspot does not appear to be able to move an object to the heap that was previously scalar replaced (at least not without deoptimizing). Scalar replacement is an all-or-nothing affair. So in the original test method both allocations of Scalarization always happen on the heap.
Conditionals
One important detail is that Hotspot will predict conditionals based on its profiling data. If a conditional assignment is never executed, Hotspot will compile code under that assumption, and then might be able to do scalar replacement. If at a later point in time the condtion does get taken, Hotspot will need to recompile the code with this new assumption. The new code will not do scalar replacement since Hotspot can no longer determine the receiver instance of following calls statically.
For instance in this variant of test:
static long limit = 0;
static long test() {
long ctr = 0x5;
long i = limit;
limit += 0x10000;
for(; i<limit; i++) { // In this form if scalarization happens is nondeterministic: if the condition is hit before profiling starts scalarization happens, else not.
Scalarization s = new Scalarization();
ctr = s.foo(ctr);
if(i == 0xf9a0) s = new Scalarization();
ctr = s.foo(ctr);
}
return ctr;
}
the conditional assignemnt is only executed once during the lifetime of the program. If this assignment occurs early enough, before Hotspot starts full profiling of the test method, Hotspot never notices the conditional being taken and compiles code that does scalar replacement. If profiling has already started when the conditional is taken, Hotspot will not do scalar replacement. With the test value of 0xf9a0, whether scalar replacement happens is nondeterministic on my computer, since exactly when profiling starts can vary (e.g. because profiling and optimized code is compiled on background threads). So if I run the above variant it sometimes does a few garbage collections, and sometimes does not.
Hotspot's static code analysis is much more limited than what C/C++ and other static compilers can do, so Hotspot is not as smart in following the control flow in a method through several conditionals and other control structures to determine the instance that a variable refers to, even if it would be statically determinable for the programmer or a smarter compiler. In many cases the profiling information will make up for that, but it is something to be aware of.
Arrays
Arrays can be stack allocated if their size is known at JIT time. However indexing into an array is not supported unless Hotspot can also statically determine the index value at JIT-time. So stack allocated arrays are pretty useless. Since most programs don't use arrays directly but use the standard collections this is not very relevant, as embedded objects such as the array containing the data within an ArrayList already need to be heap-allocated due to their embedded-ness. I suppose the reasoning for this restriction is that there exists no indexing operation on local variables so this would require additional code generation functionality for a pretty rare use case.

Overhead of short-lived references to existing objects in Java / Android

Recently I have come across an article about memory optimization in Android, but I think my question is more of a general Java type. I couldn't find any information on this, so I will be grateful if you could point me to a good resource to read.
The article I'm talking about can be found here.
My question relates to the following two snippets:
Non-optimal version:
List<Chunk> mTempChunks = new ArrayList<Chunk>();
for (int i = 0; i<10000; i++){
mTempChunks.add(new Chunk(i));
}
for (int i = 0; i<mTempChunks.size(); i++){
Chunk c = mTempChunks.get(i);
Log.d(TAG,"Chunk data: " + c.getValue());
}
Optimized version:
Chunk c;
int length = mTempChunks.size();
for (int i = 0; i<length; i++){
c = mTempChunks.get(i);
Log.d(TAG,"Chunk data: " + c.getValue());
}
The article also contains the following lines (related to the first snippet):
In the second loop of the code snippet above, we are creating a new chunk object for each iteration of the loop. So it will essentially create 10,000 objects of type ‘Chunk’ and occupy a lot of memory.
What I'm striving to understand is why a new object creation is mentioned, since I can only see creation of a reference to an already existing object on the heap. I know that a reference itself costs 4-8 bytes depending on a system, but they go out of scope very quickly in this case, and apart from this I don't see any additional overhead.
Maybe it's the creation of a reference to an existing object that is considered expensive when numerous?
Please tell me what I'm missing out here, and what is the real difference between the two snippets in terms of memory consumption.
Thank you.

There are two differences:
Non-optimal:
i < mTempChunks.size()
Chunk c = mTempChunks.get(i);
Optimal:
i < length
c = mTempChunks.get(i);
In the non-optimal code, the size() method is called for each iteration of the loop, and a new reference to a Chunk object is created. In the optimal code, the overhead of repeatedly calling size() is avoided, and the same reference is recycled.
However, the author of that article seems to be wrong in suggesting that 10000 temporary objects are created in the second non-optimal loop. Certainly, 10000 temp objects are created, but in the first, not the second loop, and there's no way to avoid that. In the second non-optimal loop, 10000 references are created. So in a way it is less than optimal, although the author mistakes the trees for the forest.
Further References:
1. Avoid Creating Unnecessary Objects.
2. Use Enhanced For Loop Syntax.
EDIT:
I have been accused of being a charlatan. For those who say that calling size() has no overhead, I can do no better than quoting the official docs:
3. Avoid Internal Getters/Setters.
EDIT 2:
In my answer, I initially made the mistake of saying that memory for references is allocated at compile-time on the stack. I realize now that that statement is wrong; that's actually the way things work in C++, not Java. The world of Java is C++ upside down: while memory for references is indeed allocated on the stack, in Java even that happens at runtime. Mind blown!
References:
1. Runtime vs compile time memory allocation in java.
2. Where is allocated variable reference, in stack or in the heap?.
3. The Structure of the Java Virtual Machine - Frames.

Which code is more CPU/memory efficient when used with a Garbage Collected language?

I have these two dummy pieces of code (let's consider they are written in either Java or C#, all variables are local):
Code 1:
int a;
int b = 0;
for (int i = 1; i < 10 ; i++)
{
a = 10;
b += i;
// a lot of more code that doesn't involve assigning new values to "a"
}
Code 2:
int b = 0;
for (int i = 1; i < 10 ; i++)
{
int a = 10;
b += i;
// a lot of more code that doesn't involve assigning new values to "a"
}
At first glance I would say both codes consume the same amount of memory, but Code 1 is more CPU efficient because it creates and allocates variable a just once.
Then I read that Garbage Collectors are extremely efficient to the point that Code 2 would be the more Memory (and CPU?) efficient: keeping variable a inside the loop makes it belongs to Gen0, so it would be garbage collected before variable b.
So, when used with a Garbage Collected language, Code 2 is the more efficient. Am I right?

A few points:
ints (and other primitives) are never allocated on heap. They live directly on the thread stack, "allocation" and "deallocation" are simple moves of a pointer, and happen once (when the function is entered, and immediately after return), regardless of scope.
primitives, that are accessed often, are usually stored in a register for speed, again, regardless of scope.
in your case a (and possibly, b as well, together with the whole loop) will be "optimized away", the optimizer is smart enough to detect a situation when a variable value changes, but is never read, and skip redundant operations. Or, if there is code that actually looks at a, but does not modify it, it will likely be replaced by the optimizer by a constant value of "10", that'll just appear inline everywhere where a is referenced.
New objects (if you did something like String a = new String("foo") for example instead of int) are always allocated in young generation, and only get transferred into old gen after they survive a few minor collections. This means that, for most of the cases, when an object is allocated inside a function, and never referenced from outside, it will never make it to the old gen regardless of its exact scope, unless your heap structure desperately needs tuning.
As pointed out in the comment, sometimes the VM might decide to allocate a large object directly in the old gen (this is true for java too, not just .net), so the point above only apply in most cases, but not always. However, in relation to this question, this does not make any difference, because if the decision is made to allocate an object in old gen, it is made without regard of the scope of its initial reference anyway.
From performance and memory standpoint your two snippets are identical. From the readability perspective though, it is always a good idea to declare all variables in the narrowest possible scope.

Before the code in snippet 2 is actually executed it's going to end up being transformed to look like the code in snippet 1 behind the scenes (whether it be a compiler or runtime). As a result, the performance of the two snippets is going to be identical, as they'll compile into functionally the same code at some point.
Note that for very short lived variables it's actually quite possible for them to not have memory allocated for them at all. They may well be stored entirely in a register, involving 0 memory allocation.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.