Memory efficiency: HashMap versus Array

Memory efficiency: HashMap versus Array - java

I was thinking about the following situation: I want to count the occurrence of characters in a string (for example for a permutation check).
One way to do it would be to allocate an array with 256 integers (I assume that the characters are UTF-8), to fill it with zeros and then to go through the string and increment the integers on the array positions corresponding to the int value of the chars.
However, for this approach, you would have to allocate a 256 array each time, even when the analyzed string is very short (and consequently uses only a small part of the array).
An other approach would be to use a Character to Integer HashTable and to store a number for each encountered char. This way, you only would have keys for chars that actually are in the string.
As my understanding of the HashTable is rather theoretic and I do not really know how it is implemented in Java my question is: Which of the two approaches would be more memory efficient?
Edit:
During the discussion of this question (thank you for your answers everyone) I did realize that I had a very fuzzy understanding of the nature of UTF-8. After some searching, I have found this great video that I want to share, in case someone has the same problem.

Ich wonder why you choose 256 as the length of your array when you assume that your String is UTF-8. In UTF-8 a character can be composed of up to 4 bytes which means quite a number of more characters than just 256.
Anyway: Using a HashTable/HashMap needs a huge memory overhead. First all your characters and integer need to be wrapped in an object (Integer/Character). And Integer consumes about 3x as much memory as an int. For arrays the difference can be even larger due to the optimizations java performs on arrays (e.g. the java stack works only in multiples of 4 byte, while in an array java allows smaller types such as a char to consume only 2 bytes).
Then the HashTable itself creates a memory overhead because it needs to maintain an array (which is usually not fully used) and linked lists to maintain all objects which generate the same hash.
Additionally access times will be dramatically faster for arrays. You save multiple method invocations (add, hashCode, iterator,...) and there exist a number of opcode in java byte code to make working with arrays more efficient.
Anyway. You question was:
Which of the two approaches would be more memory efficient?
And it is safe to say that arrays will be more memory efficient.
However you should make absolutely sure what your requirements are. Do you need more memory efficiency? (Could be true if you process large amounts of data or you are on a slow device (mobile devices?)) How important is readability of code? How about size of code? Reuseability?
And ist 256 really the correct size?

Without looking in the code I know that a HashMap requires, at minimum, a base object, a hashtable array, and individual objects for each hash entry. Generally an int value would have to be stored as an Integer object so that's more objects. Let's assume you have 30 unique characters:
32 bytes for the base object
256 bytes for a minimum-size hashtable array
32 bytes for each of the 30 table entries
16 bytes (if highly optimized) for each of 30 Integers
32 + 256 + 960 + 480 = 1728 bytes. That's for a minimal, non-fancy implementation.
The array of 256 ints would be about 1056 bytes.

I would use the array. From a performance aspect, you have guaranteed constant access. Better than the what a hash table can get you.
As it also only uses an constant amount of memory, I see no downside. The HashMap will most likely need more memory, even if you only store a few elements.
By the way, the memory footprint should not be a concern, as you will only need the data structure as long as you need it for counting. Then it will be garbage collected, anyway.

Well here are the facts.
HashMap uses an array for its table behind the scenes.
So if you were actually limited by finding a contiguous space in memory, HashMap's benefit is only that the array may be smaller.
HashMap is generic and therefore uses objects.
Objects take up extra space. As I remember, it's typically 8 or 16 bytes minimum depending on whether it's a 32- or 64-bit system. This means the HashMap may very well not be smaller, even if the number of characters in the String is small. HashMap will require 3 extra objects for each entry: an Entry, a Character and an Integer. HashMap also needs to store the int for the index locally whereas the array does not.
That's beyond that there will be some extra computation using the HashMap.
I would also say space optimization is not something you should worry about here. Either way, the memory footprint is actually very small.

Initialize an array of integers that represent the int value of a char, for example the int value of f is 102 which is its ascii value
http://www.asciitable.com/
char c = 'f';
int x = (int)c;
If you know the range of char's youre dealing with then it is easier.
For each occurance of char increment the index of that char in the array by one. This approach would be slow if you have to iterate and complicated if you are to sort but wont be memory intensive.
Just be aware when you sort you lose the indexes

Related

Using int field to improve read performance of byte array access in Java?

This is a question to better understand the Java JVM.
Let's say we have class with an array of 16 bytes. It could more or less than 16. Just 16 as an example.
Each byte is used as a collection of 8 flags. Lots of reads. Few writes.
What is the performance impact of creating 4 int fields by fusing 4 bytes into ints fields. This removes the need for array bound checks when reading those flags. However you need to shift input flags to address the correct bits in the int.
Flag writes are done both on the byte array and the ints cache.
Reason might be the byte array is used elsewhere. We don't know. We just know we have those bytes and we have to access flags on it many many times.
Is it worth it to introduce the int cache in the class implementation?
EDIT: I realized writes by others will not be pushed to the int cache. So the cache will give false flags. But let's imagine, there is a system wide event for cache refresh.

Why would you use a BitSet in java as opposed to an array of booleans (in Java)?

Other than the difference in methods available, why would someone use a BitSet as opposed to an array of booleans? Is performance better for some operations?

You would do it to save space: a boolean occupies a whole byte, so an array of N booleans would occupy eight times the space of a BitSet with the equivalent number of entries.
Execution speed is another closely related concern: you can produce a union or an intersection of several BitSet objects faster, because these operations can be performed by CPU as bitwise ANDs and ORs on 32 bits at a time.

In addition to the space savings noted by #dasblinkenlight, a BitSet has the advantage that it will grow as needed. If you do not know beforehand how many bits will be needed, or the high numbered bits are sparse and rarely used, (e.g. you are detecting which Unicode characters are present in a document and you want to allow for the unusual "foreign" ones > 128 but you know that they will be rare) a BitSet will save even more memory.

Datastructure to use for storing a huge list of long values

I am caching list of Long indexes in my Java program and it is causing the memory to overflow.
So, decided to cache only the start and end indexes of all continuous indexes and rewrite the ArrayList's required APIs. Now, what data structure will be best here to implement the start-end index cache? Is it better to go for TreeMap and keep start index as key and end index as value?

If I were you, I would use some variation of bit string storage.
In Java bit strings are implemented by BitSet.
For example, to represent arbitrary list of unique 32-bit integers, you could store it as a single bit string 4 billion bits long, so this will take 4 bln / 8 bits = 512MB of memory. This is a lot, but it is worst possible case.
But, you can be a lot smarter than that. For example, you could store it as list or binary tree of some smaller fixed (or dynamic) sized bit strings, say 65536 bits or less (or 8KB or less). In other words, each leaf object in this tree will have small header representing start offset and length (probably power of 2 for simplicity, but it does not have to be), and bit string storing actual array members. For efficiency, you could optionally compress this bit string using gzip or similar algorithm - it will make access slower, but could improve memory efficiency by factor of 10 or more.
If your 20 million index elements are almost consecutive (not very sparse), it should take only around 20mln/8bits ~= 2 million bits = 2 MB to represent it in memory. If you gzip it, it will be probably under 1MB overall.

The most compact representation will depend greatly on the distribution of indices in your specific application.
If your indices are densely clustered, the range-based representation suggested by mvp will probably work well (you might look at implementations of run-length encoding for raster graphics, since they're similar problems).
If your indices aren't clustered in dense runs, that encoding will actually increase memory consumption. For sparsely-populated lists, you might look into primitive data structures such as LongArrayList or LongOpenHashSet in FastUtil, or similar structures in Gnu Trove or Colt. In most VMs, each Long object in your ArrayList consumes 20+ bytes, whereas a primitive long consumes only 8. So you can often realize a significant memory savings with type-specific primitive collections instead of the standard Collections framework.
I've been very pleased with FastUtil, but you might find another solution suits you better. A little simulation and memory profiling should help you determine the most effective representation for your own data.

Most BitSet (compressed or uncompressed) implementations are for integers. Here's one for longs: http://www.censhare.com/en/aktuelles/censhare-labs/yet-another-compressed-bitset which works like an ordered primitive long hash set or long to long hash map.

How much space does an array occupy?

If I create 10 integers and an integer array of 10, will there be any difference in total space occupied?
I have to create a boolean array of millions of records, so I want to understand how much space will be taken by array itself.

An array of integers is represented as block of memory to hold the integers, and an object header. The object header typically takes 3 32bit words for a 32 bit JVM, but this is platform dependent. (The header contains some flag bits, a reference to a class descriptor, space for primitive lock information, and the length of the actual array. Plus padding.)
So an array of 10 ints probably takes in the region of 13 * 4 bytes.
In the case on an Integer[], each Integer object has a 2 word header and a 1 word field containing the actual value. And you also need to add in padding, and 1 word (or 1 to 2 words on a 64-bit JVM) for the reference. That is typically 5 words or 20 bytes per element of the array ... unless some Integer objects appear in multiple places in the array.
Notes:
The number of words actually used for a reference on a 64 bit JVM depends on whether "compressed oops" are used.
On some JVMs, heap nodes are allocated in multiples of 16 bytes ... which inflates space usage (e.g. the padding mentioned above).
If you take the identity hashcode of an object and it survives the next garbage collection, its size gets inflated by at least 4 bytes to cache the hashcode value.
These numbers are all version and vendor specific, in addition to the sources of variability enumerated above.

Some rough lower bounds calculations:
Each int takes up four bytes. = 40 bytes for ten
An int array takes up four bytes for each component plus four bytes to store the length plus another four bytes to store the reference to it. = 48 bytes (+ maybe some padding to align all objects at 8 byte boundaries)
An Integer takes up at least 8 bytes, plus the another four bytes to store the reference to it. = at least 120 for ten
An Integer array takes up at least the 120 bytes for the ten Integers plus four bytes for the length, and then maybe some padding for alignment. Plus four bytes to store the reference to it. (#Marko reports that he even measured about 28 bytes per slot, so that would be 280 bytes for an array of ten).

In java you have both Integer and int. Supposing you are referring to int , an array of ints is considered an object and objects have metadata so an array of 10 ints will occupy more than 10 int variables

What you can do is measure:
public static void main(String[] args) {
final long startMem = measure();
final boolean[] bs = new boolean[1000000];
System.out.println(measure() - startMem);
bs.hashCode();
}
private static long measure() {
final Runtime rt = Runtime.getRuntime();
rt.gc();
try { Thread.sleep(20); } catch (InterruptedException e) {}
rt.gc();
return rt.totalMemory() - rt.freeMemory();
}
Of course, this goes with the standard disclaimer: gc() has no particular guarantees, so repeat several times to see if you are getting consistent results. On my machine the answer is one byte per boolean.

In light of your comment it will not make much difference if you used an array. Array will use a negligible amount of memory for its functionality itself. All other memory will be used by the stored objects.
EDIT: What you need to understand is that the difference between Boolean wrapper and boolean primitive type. Wrapper types will usually take up more space than the primitives. So for missions of records try to go with the primitives.
Another thing to keep in mind when dealing of missions of record as you said is Java Autoboxing. The performance hit can be significant if you unintentionally use this in a function that traverses the whole array.

It needn't reflect poorly on the teacher / interviewer.
How much you care about the size and alignment of variables in memory depends on how performant you need your code to be. It matters a lot if your software processes transactions (EFT / stock market) for example.
The size, alignment, and packing of your variables in memory can influence CPU cache hits/misses, which can influence the performance of your code by up to a factor of 100.
It's not a bad thing to know what's happening at a low level, as long as you use performance boosting tricks responsibly.
For example, I came to this thread because I needed to know the answer to exactly this question, so that I can size my arrays of primitives to fill an integer multiple of CPU cache lines because I need the code that is performing calculations over those arrays of primitives to execute quickly because I have a finite window in which I need my calculations to be ready for the consumer of the result.

In terms of RAM space, there is no real difference

If you use an array you have 11 Objects, 10 integers and the array, plus Arrays have other metadata inside. So using an array will take more memory space.
Now for real. This kind of question actually comes up in job interviews and exams, and that shows you what kind of interviewer or teacher you have... with so many layers of abstraction working down there in the VM and in the OS itself, what is the point on thinking on this stuff? Micro-optimizing memory...!

I mean if i create 10 integers and integer array of 10, will there be
any difference in total space occupied.
(integer array of 10) = (10 integers) + 1 integer
The last "+1 integer" is for index of array ( arrays can hold 2,147,483,647 amount of data, which is an integer). That means when you declare an array, say:
int[] nums = new int[10];
you actually reserve 11 int space from memory. 10 for array elements and +1 for array itself.

Allocated Memory for Hashtable.put()

So I was reading Peter Norvig's IAQ (infrequently asked questions - link) and stumbled upon this:
You might be surprised to find that an
Object takes 16 bytes, or 4 words, in
the Sun JDK VM. This breaks down as
follows: There is a two-word header,
where one word is a pointer to the
object's class, and the other points
to the instance variables. Even though
Object has no instance variables, Java
still allocates one word for the
variables. Finally, there is a
"handle", which is another pointer to
the two-word header. Sun says that
this extra level of indirection makes
garbage collection simpler. (There
have been high performance Lisp and
Smalltalk garbage collectors that do
not use the extra level for at least
15 years. I have heard but have not
confirmed that the Microsoft JVM does
not have the extra level of
indirection.)
An empty new String() takes 40 bytes,
or 10 words: 3 words of pointer
overhead, 3 words for the instance
variables (the start index, end index,
and character array), and 4 words for
the empty char array. Creating a
substring of an existing string takes
"only" 6 words, because the char array
is shared. Putting an Integer key and
Integer value into a Hashtable takes
64 bytes (in addition to the four
bytes that were pre-allocated in the
Hashtable array): I'll let you work
out why.
So well I obviously tried, but I can't figure it out. In the following I only count words:
A Hashtable put creates one Hashtable$Entry: 3 (overhead) + 4 variables (3 references which I assume are 1 word + 1 int). I further assume that he means that the Integers are newly allocated (so not cached by the Integer class or already exist) which comes to 2* (3 [overhead] + 1 [1 int value]).
So in the end we end up with.. 15 words or 60bytes. So what I first thought was that the Entry as a inner class needs a reference to its outer object, but alas it's static so that doesn't make much sense (sure we have to store a pointer to the parent class, but I'd think that information is stored in the class header by the VM).
Just idle curiosity and I'm well aware that all this depends to a good bit on the actual JVM implementation (and on a 64bit version the results would be different), but still I don't like questions I can't answer :)
Edit: Just to make this a bit clearer: While I'm well aware that more compact structures can get us some performance benefits, I agree that in general worrying about a few bytes here or there is a waste of time. I surely wouldn't stop using a Hashtable just because of a few bytes overhead here or there just like I wouldn't use plain char arrays instead of Strings (or start using C). This is purely of academic interest to learn a bit more about the insides of Java/the JVM :)

The author appears to assume there is 3 Objects with 16 bytes overhead each and 2 32-bit references in the Map.Entry and 2 x 1 32-bit int values. This would total 64-bytes
This is flawed in that Sun/Oracle's JVM only allocates on 8-byte boundaries so that while technically an Integer occupies 20 bytes of memory, 24 bytes is used (the next multiple of 8)
Additionally many JVMs now use 64-bit references so the Map.Entry would use another 16 bytes.
This is all very inefficient, which is why you might use a class like TIntIntHashMap instead which use primitives.
However, usually it doesn't matter as memory is surprising cheap when you compare it to the cost of your time. If you work on server applications and you cost your company about $40/hour, you need to be saving about 10 MB every minute to save as much memory as you are costing. (Ideally you need to be saving much more than this) Saving 10 MB each and every minute is hard.
Memory is reusable, but your time isn't.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.