Why does creating a big Java array consume so much memory?

Why does creating a big Java array consume so much memory? - java

Why does the following line
Object[] objects = new Object[10000000];
result in a lot of memory (~40M) being used by the JVM? Is there any way to know the internal workings of the VM when allocating arrays?

Well, that allocates enough space for 10000000 references, as well as a small amount of overhead for the array object itself.
The actual size will depend on the VM - but it's surely not surprising that it's taking up a fair amount of memory... I'd expect at least 40MB, and probably 80MB on a 64-bit VM, unless it's using compressed oops for arrays.
Of course, if you populate the array with that many distinct objects, that will take much, much more memory... but the array itself still needs space just for the references.

What do you mean by "a lot of memory"? You allocating 10000000 pointers, each taking 4 bytes(on 32 bit machine) - this is about 40mb of memory.

You are creating ten million references to an object. A reference is at least 4 bytes; IIRC in Java it might be 8, but I'm unsure of that.
So with that one line you're creating 40 or 80 megabytes of data.

You are reserving space for ten million references. That is quite a bit.

It results in a lot of memory being used because it needs to allocate heap space for 10 million objects and their associated overhead.
To look into the internal workings of the JVM, you can check out its source code, as it is open source.

Your array has to hold 10 million object references, which on modern platforms are 64 bit (8 byte) pointers. Since it is allocated as a contiguous chunk of storage, it should take 80 million bytes. That's big in one sense, small compared to the likely amount of memory you have. Why does it bother you?

It creates an array with 10.000.000 reference pointers, all initialized with null.
What did you expect, saying this is "a lot"?
Further reading
Size of object references in Java

One of the principal reasons arrays are used so widely is that their elements can be accessed in constant time. This means that the time taken to access a[i] is the same for each index i. This is because the address of a[i] can be determined arithmetically by adding a suitable offset to the address of the head of the array. The reason is that space for the contents of an array is allocated as a contiguous block of memory.

According to this site, the memory usage for arrays is a 12 bytes header + 4 bytes per element. If you declare an empty array of Object holding 10M elements, then you have just about 40MB of memory used from the start. If you start filling that array with actually 10M object, then the size increases quite rapidly.
From this site, and I just tested it on my 64-bit machine, the size of a plain Object is about 31 bytes, so an array of 10M of Object is just about 12 bytes + (4 + 31 bytes) * 10M = 350 000 012 bytes (or 345.78 MB)
If your array is holding other type of objects, then the size will be even larger.
I would suggest you use some kind of random access file(s) to hold you data if you have to keep so much data inside your program. Or even use a database such as Apache Derby, which will also enable you to sort and filter your data, etc.

I may be behind the times but I understood from the book Practical Java that Vectors are more efficient and faster than Arrays. Is it possible to use a Vector instead of an array?

Related

Memory efficiency: HashMap versus Array

I was thinking about the following situation: I want to count the occurrence of characters in a string (for example for a permutation check).
One way to do it would be to allocate an array with 256 integers (I assume that the characters are UTF-8), to fill it with zeros and then to go through the string and increment the integers on the array positions corresponding to the int value of the chars.
However, for this approach, you would have to allocate a 256 array each time, even when the analyzed string is very short (and consequently uses only a small part of the array).
An other approach would be to use a Character to Integer HashTable and to store a number for each encountered char. This way, you only would have keys for chars that actually are in the string.
As my understanding of the HashTable is rather theoretic and I do not really know how it is implemented in Java my question is: Which of the two approaches would be more memory efficient?
Edit:
During the discussion of this question (thank you for your answers everyone) I did realize that I had a very fuzzy understanding of the nature of UTF-8. After some searching, I have found this great video that I want to share, in case someone has the same problem.

Ich wonder why you choose 256 as the length of your array when you assume that your String is UTF-8. In UTF-8 a character can be composed of up to 4 bytes which means quite a number of more characters than just 256.
Anyway: Using a HashTable/HashMap needs a huge memory overhead. First all your characters and integer need to be wrapped in an object (Integer/Character). And Integer consumes about 3x as much memory as an int. For arrays the difference can be even larger due to the optimizations java performs on arrays (e.g. the java stack works only in multiples of 4 byte, while in an array java allows smaller types such as a char to consume only 2 bytes).
Then the HashTable itself creates a memory overhead because it needs to maintain an array (which is usually not fully used) and linked lists to maintain all objects which generate the same hash.
Additionally access times will be dramatically faster for arrays. You save multiple method invocations (add, hashCode, iterator,...) and there exist a number of opcode in java byte code to make working with arrays more efficient.
Anyway. You question was:
Which of the two approaches would be more memory efficient?
And it is safe to say that arrays will be more memory efficient.
However you should make absolutely sure what your requirements are. Do you need more memory efficiency? (Could be true if you process large amounts of data or you are on a slow device (mobile devices?)) How important is readability of code? How about size of code? Reuseability?
And ist 256 really the correct size?

Without looking in the code I know that a HashMap requires, at minimum, a base object, a hashtable array, and individual objects for each hash entry. Generally an int value would have to be stored as an Integer object so that's more objects. Let's assume you have 30 unique characters:
32 bytes for the base object
256 bytes for a minimum-size hashtable array
32 bytes for each of the 30 table entries
16 bytes (if highly optimized) for each of 30 Integers
32 + 256 + 960 + 480 = 1728 bytes. That's for a minimal, non-fancy implementation.
The array of 256 ints would be about 1056 bytes.

I would use the array. From a performance aspect, you have guaranteed constant access. Better than the what a hash table can get you.
As it also only uses an constant amount of memory, I see no downside. The HashMap will most likely need more memory, even if you only store a few elements.
By the way, the memory footprint should not be a concern, as you will only need the data structure as long as you need it for counting. Then it will be garbage collected, anyway.

Well here are the facts.
HashMap uses an array for its table behind the scenes.
So if you were actually limited by finding a contiguous space in memory, HashMap's benefit is only that the array may be smaller.
HashMap is generic and therefore uses objects.
Objects take up extra space. As I remember, it's typically 8 or 16 bytes minimum depending on whether it's a 32- or 64-bit system. This means the HashMap may very well not be smaller, even if the number of characters in the String is small. HashMap will require 3 extra objects for each entry: an Entry, a Character and an Integer. HashMap also needs to store the int for the index locally whereas the array does not.
That's beyond that there will be some extra computation using the HashMap.
I would also say space optimization is not something you should worry about here. Either way, the memory footprint is actually very small.

Initialize an array of integers that represent the int value of a char, for example the int value of f is 102 which is its ascii value
http://www.asciitable.com/
char c = 'f';
int x = (int)c;
If you know the range of char's youre dealing with then it is easier.
For each occurance of char increment the index of that char in the array by one. This approach would be slow if you have to iterate and complicated if you are to sort but wont be memory intensive.
Just be aware when you sort you lose the indexes

Maximum size of an array - Type mismatch: cannot convert from long to int

I see that the maximum size of an array can be only maximum size of an Int. Why does Java not allow an array of size long-Max ?
long no = 10000000000L;
int [] nums = new int[no];//error here

You'll have to address the "why" question to the Java designers. Anyone else can only speculate. My speculation is that they felt that a two-billion-element array ought to be enough for anybody (which, in fairness, it probably is).

An int-sized length allows arrays of 231-1 ("~2 billion") elements. In the gigantically overwhelming majority of arrays' uses, that's plenty.
An array of that many elements will take between 2 gigabytes and 16 gigabytes of memory, depending on the element type. When Java appeared in 1995, new PCs had only around 8 megabytes of RAM. And those 32-bit operating systems, even if they used virtual memory on disk, had a practical limit on the size of a contiguous chunk of memory they could allocate which was quite a bit less than 2 gigabytes, because other allocated things are scattered around in a process's address space. Thus the limits of an int-sized array length were untestable, unforeseeable, and just very far away.
On 32-bit CPUs, arithmetic with ints is much faster than with longs.
Arrays are a basic internal type and they are used numerously. A long-sized length would take an extra 4 bytes per array to store, which in turn could affect packing together of arrays in memory, potentially wasting more bytes between them. (Even though the longer length would almost never be useful.)
If you ever do need in-RAM storage for more than ~2 billion items, you can use an array of arrays.

Unfortunately Java does not support arrays with more than 2^31 elements.
i.e. 16 GiB of space for a long[] array.
try creating this...
Object[] array = new Object[Integer.MAX_VALUE - 4];
you should get OUTOFMEMMORY error...SO the maximum size will be Integer.MAX_VALUE - 5

How to store 100 million integers in a Java array?

I am working on a small task where I am required to store around 1 billion integers in an Array. However, I am running into a heap space problem. Could you please help me with this?
Machine Details : Core 2 Duo Processor with 4 GB RAM. I have even tried -Xmx 3072m . Is there any work around for this?
The same thing works in C++ , so there should definitely be a way to store this many numbers in memory.
Below is the code and the exception I am getting :
public class test {
private static int C[] = new int[10000*10000];
public static void main(String[] args) {
System.out.println(java.lang.Runtime.getRuntime().maxMemory());
}
}
Exception :
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at test.(test.java:3)

Use an associative array. The key is an integer, and the value is the count (the number of times the integer has been added to the list).
This should get you some decent space savings if the distribution is relatively random, much more so if it's not.

If you need to store 1 billion completely random integers then I am afraid that you really do need to corresponding space, i.e. about 4GB of memory for 32-bit int numbers. You can try increasing the JVM heap space but you need to have a 64-bit OS and at least as much physical memory - and there is only so far that you can go.
On the other hand, you might be able to store those number more efficiently if you can make use of specific constraints within your application.
E.g. if you only need to know if a specific int is contained in a set, you could get away with a bit set - i.e. a single bit for each value in the int range. That is about 4 billion bits, i.e. 512 MB - a far more reasonable space requirement. For example, a handful of BitSet objects could cover the whole 32-bit integer range without you having to write any bit-handling code...

May be using memory mapped files will help? They are not allocated from the heap.
Here is an article how to create a matrix. An array should be easier.
Using a memory mapped file for a huge matrix - Peter Lawrey

You can increase to 4GB on a 32 bit system.
If you're on a 64 bit system you can go higher.
Type in cmd this
java -Xmx4g programname

As array so big may not fit into your RAM, you need to configure the sufficient HDD swap space. 4 - 16 Gb on swap do not look like something unrealistic at these times.
Java only allows to use int as an index of array, not long. Hence the largest possible array could have 2147483648 values, enough.
Use -Xmx to raise the memory ceiling that by default will probably be insufficient. 3072m is not enough as one billion ints requires about 4 Gb. As space is needed also for the operating system and the like, that machine with 4 Gb RAM cannot hold all 4 Gb data structure in memory.
JRE or OS may also refuse to grant an piece of memory so big in one go, requiring to allocate in some smaller chunks (maybe array or arrays).

Allocated Memory for Hashtable.put()

So I was reading Peter Norvig's IAQ (infrequently asked questions - link) and stumbled upon this:
You might be surprised to find that an
Object takes 16 bytes, or 4 words, in
the Sun JDK VM. This breaks down as
follows: There is a two-word header,
where one word is a pointer to the
object's class, and the other points
to the instance variables. Even though
Object has no instance variables, Java
still allocates one word for the
variables. Finally, there is a
"handle", which is another pointer to
the two-word header. Sun says that
this extra level of indirection makes
garbage collection simpler. (There
have been high performance Lisp and
Smalltalk garbage collectors that do
not use the extra level for at least
15 years. I have heard but have not
confirmed that the Microsoft JVM does
not have the extra level of
indirection.)
An empty new String() takes 40 bytes,
or 10 words: 3 words of pointer
overhead, 3 words for the instance
variables (the start index, end index,
and character array), and 4 words for
the empty char array. Creating a
substring of an existing string takes
"only" 6 words, because the char array
is shared. Putting an Integer key and
Integer value into a Hashtable takes
64 bytes (in addition to the four
bytes that were pre-allocated in the
Hashtable array): I'll let you work
out why.
So well I obviously tried, but I can't figure it out. In the following I only count words:
A Hashtable put creates one Hashtable$Entry: 3 (overhead) + 4 variables (3 references which I assume are 1 word + 1 int). I further assume that he means that the Integers are newly allocated (so not cached by the Integer class or already exist) which comes to 2* (3 [overhead] + 1 [1 int value]).
So in the end we end up with.. 15 words or 60bytes. So what I first thought was that the Entry as a inner class needs a reference to its outer object, but alas it's static so that doesn't make much sense (sure we have to store a pointer to the parent class, but I'd think that information is stored in the class header by the VM).
Just idle curiosity and I'm well aware that all this depends to a good bit on the actual JVM implementation (and on a 64bit version the results would be different), but still I don't like questions I can't answer :)
Edit: Just to make this a bit clearer: While I'm well aware that more compact structures can get us some performance benefits, I agree that in general worrying about a few bytes here or there is a waste of time. I surely wouldn't stop using a Hashtable just because of a few bytes overhead here or there just like I wouldn't use plain char arrays instead of Strings (or start using C). This is purely of academic interest to learn a bit more about the insides of Java/the JVM :)

The author appears to assume there is 3 Objects with 16 bytes overhead each and 2 32-bit references in the Map.Entry and 2 x 1 32-bit int values. This would total 64-bytes
This is flawed in that Sun/Oracle's JVM only allocates on 8-byte boundaries so that while technically an Integer occupies 20 bytes of memory, 24 bytes is used (the next multiple of 8)
Additionally many JVMs now use 64-bit references so the Map.Entry would use another 16 bytes.
This is all very inefficient, which is why you might use a class like TIntIntHashMap instead which use primitives.
However, usually it doesn't matter as memory is surprising cheap when you compare it to the cost of your time. If you work on server applications and you cost your company about $40/hour, you need to be saving about 10 MB every minute to save as much memory as you are costing. (Ideally you need to be saving much more than this) Saving 10 MB each and every minute is hard.
Memory is reusable, but your time isn't.

In Java, empty HashMap space allocation

How can i tell how much space a pre-sized HashMap takes up before any elements are added? For example how do i determine how much memory the following takes up?
HashMap<String, Object> map = new HashMap<String, Object>(1000000);

In principle, you can:
calculate it by theory:
look at the implementation of HashMap to figure out what this method does.
look at the implementation of the VM to know how much space the individual created objects take.
measure it somehow.
Most of the other answers are about the second way, so I'll look at the first one (in OpenJDK source, 1.6.0_20).
The constructor uses a capacity that is the next power of two >= your initialCapacity parameter, thus 1048576 = 2^20 in our case.
It then creates an new Entry[capacity] and assigns it to the table variable. (Additionally it assigns some primitive variables).
So, we now have one quite small HashMap object (it contains only 3 ints, one float and one reference variable), and one quite big Entry[] object. This array needs space for their array elements (which are normal reference variables) and some metadata (size, class).
So, it comes down to how big a reference variable is. This depends on VM implementation - usually in 32-bit VMs it is 32 bit (= 4 bytes), in 64-bit VMs 64 bit (= 8 bytes).
So, basically on 32-bit VMs your array takes 4 MB, on 64-bit VMs it takes 8 MB, plus some tiny administration data.
If you then fill your HashTable with mappings, each mapping corresponds to a Entry object. This entry object consists of one int and three references, taking about 24 bytes on 32-bit VMs, maybe the double on 64-bit VMs. Thus your 1000000-mappings HashMap (assuming an load factor > 1) would take ~28 MB on 32-bit-VMs and ~56 MB on 64-bit VMs.
Additionally to the key and value objects themselves, of course.

You could check memory usage before and after creation of the variable. For example:
long preMemUsage = Runtime.getRuntime().totalMemory() -
Runtime.getRuntime().freeMemory();
HashMap<String> map = new HashMap<String>(1000000);
long postMemUsage = Runtime.getRuntime().totalMemory() -
Runtime.getRuntime().freeMemory();

The exact answer will depend on the version of Java you are using, the JVM vendor and the target platform, and is best determined by direct measurement, as described in other answers.
But as a simple estimate, the size is likely to be either ~4 * 2^20 or ~8 * 2^20 bytes, for a 32 bit or 64 bit jvm respectively.
Reasoning:
The Sun Java 1.6 implementation of HashMap has a fixed side top-level object and a table field that points to the array of references to hash chains.
In a newly created (empty) HashMap the references are all null and the array size is the next power of two larger that the supplied initialCapacity. (Yes ... I checked the source code.)
A reference occupies 4 bytes on a typical 32bit JVM and 8 bytes on a typical 64 bit JVM. Some 64 bit JVMs support compact references ("compressed oops"), but you need to set JVM options to enable this.
The top object has 5 fields including the table array reference, but this is a relatively small constant overhead.
The top object and the array have object header overheads, but these are constant and relatively small.
Thus the size of the table array dominates, and it is 2^20 (the next power of 2 greater than 1,000,000) multiplied by the size of a reference.
So, this tells you that setting a large initial capacity really does use a lot of memory. On the other hand, if the initial capacity is a good estimate of the map's capacity when fully populated, you will save significant amounts of time by setting it. (This avoids a number of cycles of reallocating the array and rebuilding of the hash chains.)

You could probably use a profiler like VisualVM and track memory use.
Have a look at this too: http://www.velocityreviews.com/forums/t148009-java-hashmap-size.html

I'd have a look at this article: http://www.javaworld.com/javaworld/javatips/jw-javatip130.html
In short, java does not have a C-style sizeof operator. You could use profiling tools, but IMO the above link gives the simplest solution.
Another piece of info that may be helpful: an empty java String consumes 40 bytes. One million of them would probably be at least 40MB...

I agree that a profiler is really the only way to tell. The other bit of relevant information is whether you're using a 32-bit or 64-bit JVM. The amount of overhead due to memory references (pointers) varies depending on that and whether you have compressed oops turned on. I've found that for smaller data sets the overhead of objects and pointers is significant.

In the latest version of Java 1.7 (I'm looking at 1.7.0_55) HashMap actually lazily instantiates its internal table. It's only instantiated when put() is called - see the private method "inflateTable()". So your HashMap, before you add anything to it at least, will occupy only the handful of bytes of object overhead and instance fields.

You should be able to use VisualVM (comes with JDK 6 or can be downloaded) to create a memory snapshot and inspect the allocated objects for their size.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.