Understanding treeifying in Java HashMap [duplicate] - java

As per the following link document: Java HashMap Implementation
I'm confused with the implementation of HashMap (or rather, an enhancement in HashMap). My queries are:
Firstly
static final int TREEIFY_THRESHOLD = 8;
static final int UNTREEIFY_THRESHOLD = 6;
static final int MIN_TREEIFY_CAPACITY = 64;
Why and how are these constants used? I want some clear examples for this.
How they are achieving a performance gain with this?
Secondly
If you see the source code of HashMap in JDK, you will find the following static inner class:
static final class TreeNode<K, V> extends java.util.LinkedHashMap.Entry<K, V> {
HashMap.TreeNode<K, V> parent;
HashMap.TreeNode<K, V> left;
HashMap.TreeNode<K, V> right;
HashMap.TreeNode<K, V> prev;
boolean red;
TreeNode(int arg0, K arg1, V arg2, HashMap.Node<K, V> arg3) {
super(arg0, arg1, arg2, arg3);
}
final HashMap.TreeNode<K, V> root() {
HashMap.TreeNode arg0 = this;
while (true) {
HashMap.TreeNode arg1 = arg0.parent;
if (arg0.parent == null) {
return arg0;
}
arg0 = arg1;
}
}
//...
}
How is it used? I just want an explanation of the algorithm.

HashMap contains a certain number of buckets. It uses hashCode to determine which bucket to put these into. For simplicity's sake imagine it as a modulus.
If our hashcode is 123456 and we have 4 buckets, 123456 % 4 = 0 so the item goes in the first bucket, Bucket 1.
If our hashCode function is good, it should provide an even distribution so that all the buckets will be used somewhat equally. In this case, the bucket uses a linked list to store the values.
But you can't rely on people to implement good hash functions. People will often write poor hash functions which will result in a non-even distribution. It's also possible that we could just get unlucky with our inputs.
The less even this distribution is, the further we're moving from O(1) operations and the closer we're moving towards O(n) operations.
The implementation of HashMap tries to mitigate this by organising some buckets into trees rather than linked lists if the buckets become too large. This is what TREEIFY_THRESHOLD = 8 is for. If a bucket contains more than eight items, it should become a tree.
This tree is a Red-Black tree, presumably chosen because it offers some worst-case guarantees. It is first sorted by hash code. If the hash codes are the same, it uses the compareTo method of Comparable if the objects implement that interface, else the identity hash code.
If entries are removed from the map, the number of entries in the bucket might reduce such that this tree structure is no longer necessary. That's what the UNTREEIFY_THRESHOLD = 6 is for. If the number of elements in a bucket drops below six, we might as well go back to using a linked list.
Finally, there is the MIN_TREEIFY_CAPACITY = 64.
When a hash map grows in size, it automatically resizes itself to have more buckets. If we have a small HashMap, the likelihood of us getting very full buckets is quite high, because we don't have that many different buckets to put stuff into. It's much better to have a bigger HashMap, with more buckets that are less full. This constant basically says not to start making buckets into trees if our HashMap is very small - it should resize to be larger first instead.
To answer your question about the performance gain, these optimisations were added to improve the worst case. You would probably only see a noticeable performance improvement because of these optimisations if your hashCode function was not very good.
It is designed to protect against bad hashCode implementations and also provides basic protection against collision attacks, where a bad actor may attempt to slow down a system by deliberately selecting inputs which occupy the same buckets.

To put it simpler (as much as I could simpler) + some more details.
These properties depend on a lot of internal things that would be very cool to understand - before moving to them directly.
TREEIFY_THRESHOLD -> when a single bucket reaches this (and the total number exceeds MIN_TREEIFY_CAPACITY), it is transformed into a perfectly balanced red/black tree node. Why? Because of search speed. Think about it in a different way:
it would take at most 32 steps to search for an Entry within a bucket/bin with Integer.MAX_VALUE entries.
Some intro for the next topic. Why is the number of bins/buckets always a power of two? At least two reasons: faster than modulo operation and modulo on negative numbers will be negative. And you can't put an Entry into a "negative" bucket:
int arrayIndex = hashCode % buckets; // will be negative
buckets[arrayIndex] = Entry; // obviously will fail
Instead there is a nice trick used instead of modulo:
(n - 1) & hash // n is the number of bins, hash - is the hash function of the key
That is semantically the same as modulo operation. It will keep the lower bits. This has an interesting consequence when you do:
Map<String, String> map = new HashMap<>();
In the case above, the decision of where an entry goes is taken based on the last 4 bits only of you hashcode.
This is where multiplying the buckets comes into play. Under certain conditions (would take a lot of time to explain in exact details), buckets are doubled in size. Why? When buckets are doubled in size, there is one more bit coming into play.
So you have 16 buckets - last 4 bits of the hashcode decide where an entry goes. You double the buckets: 32 buckets - 5 last bits decide where entry will go.
As such this process is called re-hashing. This might get slow. That is (for people who care) as HashMap is "joked" as: fast, fast, fast, slooow. There are other implementations - search pauseless hashmap...
Now UNTREEIFY_THRESHOLD comes into play after re-hashing. At that point, some entries might move from this bins to others (they add one more bit to the (n-1)&hash computation - and as such might move to other buckets) and it might reach this UNTREEIFY_THRESHOLD. At this point it does not pay off to keep the bin as red-black tree node, but as a LinkedList instead, like
entry.next.next....
MIN_TREEIFY_CAPACITY is the minimum number of buckets before a certain bucket is transformed into a Tree.

TreeNode is an alternative way to store the entries that belong to a single bin of the HashMap. In older implementations the entries of a bin were stored in a linked list. In Java 8, if the number of entries in a bin passed a threshold (TREEIFY_THRESHOLD), they are stored in a tree structure instead of the original linked list. This is an optimization.
From the implementation:
/*
* Implementation notes.
*
* This map usually acts as a binned (bucketed) hash table, but
* when bins get too large, they are transformed into bins of
* TreeNodes, each structured similarly to those in
* java.util.TreeMap. Most methods try to use normal bins, but
* relay to TreeNode methods when applicable (simply by checking
* instanceof a node). Bins of TreeNodes may be traversed and
* used like any others, but additionally support faster lookup
* when overpopulated. However, since the vast majority of bins in
* normal use are not overpopulated, checking for existence of
* tree bins may be delayed in the course of table methods.

You would need to visualize it: say there is a Class Key with only hashCode() function overridden to always return same value
public class Key implements Comparable<Key>{
private String name;
public Key (String name){
this.name = name;
}
#Override
public int hashCode(){
return 1;
}
public String keyName(){
return this.name;
}
public int compareTo(Key key){
//returns a +ve or -ve integer
}
}
and then somewhere else, I am inserting 9 entries into a HashMap with all keys being instances of this class. e.g.
Map<Key, String> map = new HashMap<>();
Key key1 = new Key("key1");
map.put(key1, "one");
Key key2 = new Key("key2");
map.put(key2, "two");
Key key3 = new Key("key3");
map.put(key3, "three");
Key key4 = new Key("key4");
map.put(key4, "four");
Key key5 = new Key("key5");
map.put(key5, "five");
Key key6 = new Key("key6");
map.put(key6, "six");
Key key7 = new Key("key7");
map.put(key7, "seven");
Key key8 = new Key("key8");
map.put(key8, "eight");
//Since hascode is same, all entries will land into same bucket, lets call it bucket 1. upto here all entries in bucket 1 will be arranged in LinkedList structure e.g. key1 -> key2-> key3 -> ...so on. but when I insert one more entry
Key key9 = new Key("key9");
map.put(key9, "nine");
threshold value of 8 will be reached and it will rearrange bucket1 entires into Tree (red-black) structure, replacing old linked list. e.g.
key1
/ \
key2 key3
/ \ / \
Tree traversal is faster {O(log n)} than LinkedList {O(n)} and as n grows, the difference becomes more significant.

The change in HashMap implementation was was added with JEP-180. The purpose was to:
Improve the performance of java.util.HashMap under high hash-collision conditions by using balanced trees rather than linked lists to store map entries. Implement the same improvement in the LinkedHashMap class
However pure performance is not the only gain. It will also prevent HashDoS attack, in case a hash map is used to store user input, because the red-black tree that is used to store data in the bucket has worst case insertion complexity in O(log n). The tree is used after a certain criteria is met - see Eugene's answer.

To understand the internal implementation of hashmap, you need to understand the hashing.
Hashing in its simplest form, is a way to assigning a unique code for any variable/object after applying any formula/algorithm on its properties.
A true hash function must follow this rule –
“Hash function should return the same hash code each and every time when the function is applied on same or equal objects. In other words, two equal objects must produce the same hash code consistently.”

Related

What is the Complexity of this method that makes use of a HashMap

I know that the insertion into a HashMap takes O(1) time Complexity, so for inserting n elements the complexity should be O(n). I have a little doubt about the below method.
Code:
private static Map<Character, Character> mapCharacters(String message) {
Map<Character, Character> map = new HashMap<>();
char idx = 'a';
for (char ch : message.toCharArray()) { // a loop - O(n)
if (!map.containsKey(ch)) { // containsKey - take O(n) to check the whole map
map.put(ch, idx);
++idx;
}
}
return map;
}
What I am doing is to map the decrypted message with the sequential "abcd".
So, My confusion is whether the complexity of the above method is O(n^2) or O(n)? Please help me to clear my doubt.
It's amortised O(n). containsKey and put are both amortised O(1); see What is the time complexity of HashMap.containsKey() in java? in particular.
The time complexity of both containsKey() and put() will depend on the implementation of the equals/hashCode contract of the object that is used as a key.
Under the hood, HashMap maintains an array of buckets. And each bucket corresponds to a range of hashes. Each non-empty bucket contains nodes (each node contains information related to a particular map-entry) that form either a linked list or a tree (since Java 8).
The process of determining the right bucket based on the hash of the provided key is almost instant, unless the process of computing the hash is heavy, but anyway it's considered to have a constant time complexity O(1). Accessing the bucket (i.e. accessing array element) is also O(1), but then things can become tricky. Assuming that bucket contains only a few elements, checking them will not significantly increase the total cost (while performing both containsKey() and put() we should check whether provided key exists in the map) and overall time-complexity will be O(1).
But in the case when all the map-entries for some reason ended up in the same bucket then iterating over the linked list will result in O(n), and if nodes are stored as a Red-black tree the worse case cost of both containsKey() and put() will be O(log n).
In case Character as a key we have a proper implementation of the equals/hashCode. Proper hash function is important to ensure that objects would be evenly spread among buckets. But if you were using instead a custom object which hashCode() is broken, for instance always returning the same number, then all entries will end up in the same bucket and performance will be poor.
Collisions, situations when different objects produces hashes in the range that is being mapped to the same bucket, are inevitable, and we need a good hash function to make the number of collisions as fewer as possible, so that bucket would be "evenly" occupied and the map get expanded when the number of non-empty buckets exceeds load factor.
The time complexity of this particular method is linear O(n). But it might not be the case if you would change the type of the key.
Given your original code below:
private static Map<Character,Character> mapCharacters(String message){
Map<Character, Character> map = new HashMap<>();
char idx = 'a';
for (char ch : message.toCharArray()) {
if (!map.containsKey(ch)) {
map.put(ch, idx);
++idx;
}
}
return map;
}
Here are some observations about the time complexity:
Iterating each character in message will be O(n) – pretty clearly if there are 10 characters it takes 10 steps, 10,000 characters will take 10,000 steps. The time complexity is directly proportional to n.
Inside the loop, you're checking if the map contains that character already – using containsKey(), this is a constant-time operation which is defined as O(1) time complexity. The idea is that the actual time to evaluate the result of containsKey() should be the same whether your map has 10 keys or 10,000 keys.
Inside the if, you have map.put() which is another constant-time operation, another O(1) operation.
Combined together: for every nth character, you spend time iterating it in the loop, checking if the map has it already, and adding it; all of this would be 3n (three times n) which is just O(n).
Separately, you could make a small improvement to your code by replacing this block:
if (!map.containsKey(ch)) {
map.put(ch, idx);
++idx;
}
with something like below (which retains the same behavior of running ++idx only if you've put something, and specifically if the previous thing was either "not present" or perhaps had been set with "null" - though the "set as null" doesn't apply from your code sample):
Character previousValue = map.putIfAbsent(ch, idx);
if (previousValue == null) {
++idx;
}
Though the time complexity wouldn't change, using containsKey() makes the code clearer, and also takes advantage of well-tested, performant code that's available for free.
Under the covers, calling Map.putIfAbsent() is implemented like below (as of OpenJDK 17) showing an initial call to get() followed by put() if the key isn't present.
default V putIfAbsent(K key, V value) {
V v = get(key);
if (v == null) {
v = put(key, value);
}
return v;
}

How to iterate from a given key in Java TreeMap

I have a TreeMap, and want to get a set of K smallest keys (entries) larger than a given value.
I know we can use higherKey(givenValue) get the one key, but then how should I iterate from there?
One possible way is to get a tailMap from the smallest key larger than the given value, but that's an overkill if K is small compared with the map size.
Is there a better way with O(logn + K) time complexity?
The error here is thinking that tailMap makes a new map. It doesn't. It gives you a lightweight object that basically just contains a field pointing back to the original map, and the pivot point, that's all it has. Any changes you make to a tailMap map will therefore also affect the underlying map, and any changes made to the underlying map will also affect your tailMap.
for (KeyType key : treeMap.tailMap(pivot).keySet()) {
}
The above IS O(logn) complexity for the tailMap operation, and give that you loop, that adds K to the lot, for a grand total of O(logn+K) time complexity, and O(1) space complexity, where N is the size of the map and K is the number of keys that end up being on the selected side of the pivot point (so, worst case, O(nlogn)).
If you want an actual copied-over map, you'd do something like:
TreeMap<KeyType> tm = new TreeMap<KeyType>(original.tailMap(pivot));
Note that this also copies over the used comparator, in case you specified a custom one. That one really is O(logn + K) in time complexity (and O(K) in space complexity). Of course, if you loop through this code, it's.. still O(logn + K) (because 2K just boils down to K when talking about big-O notation).

Internal implementation of hashMap

I am understanding hashMap internal implementation and i need some help regarding the same. HashMap store data by using linkedlist in java 8 it is using tree data structure.
below is the node Node class constructor could you please help me to understand why node is having hash value of the key?
Node(int hash, K key, V value, Node<K,V> next) {
this.hash = hash;
this.key = key;
this.value = value;
this.next = next;
}
You use the hashes to compute the placement of an element within the HashMap. It would be supremely inefficient if adding a key to your HashMap required re-computation of the hashes every time a collision needed to be resolved. Not only that, but some objects may have expensive hash functions (for example, String hashCode goes through every character in the String). It would always be desirable to compute such functions once and cache them for later (since you should not change hashCode/equals for an object placed within a Map).
Let's consider a HashMap that is half full (n/2), with independent elements being placing into the entry set. There is a 1/2 probability of collision (minimum) for any given element being added. The amount of entries possible to fill the HashMap is n, but the default load factor is 0.75, which means we have 3n/4 - n/2 = n/4 entries left to fill. All of these entries must be without hash collisions, as we save time for each collision (by caching). Assuming the maximum possible probability of 1/2 for not having a collision, we see that we have probability 1/2n/4 of no collisions occurring before HashMap expansion. This means that for any sizeable HashMap (n=29+, out of which only 0.5*29=(about 15) keys have to be filled), there is a greater than 99% chance that you get time savings off of caching hash values.
tl;dr It saves time, especially when collisions become frequent for addition/lookups.

SparseArray vs HashMap

I can think of several reasons why HashMaps with integer keys are much better than SparseArrays:
The Android documentation for a SparseArray says "It is generally slower than a traditional HashMap".
If you write code using HashMaps rather than SparseArrays your code will work with other implementations of Map and you will be able to use all of the Java APIs designed for Maps.
If you write code using HashMaps rather than SparseArrays your code will work in non-android projects.
Map overrides equals() and hashCode() whereas SparseArray doesn't.
Yet whenever I try to use a HashMap with integer keys in an Android project, IntelliJ tells me I should use a SparseArray instead. I find this really difficult to understand. Does anyone know any compelling reasons for using SparseArrays?
SparseArray can be used to replace HashMap when the key is a primitive type.
There are some variants for different key/value types, even though not all of them are publicly available.
Benefits are:
Allocation-free
No boxing
Drawbacks:
Generally slower, not indicated for large collections
They won't work in a non-Android project
HashMap can be replaced by the following:
SparseArray <Integer, Object>
SparseBooleanArray <Integer, Boolean>
SparseIntArray <Integer, Integer>
SparseLongArray <Integer, Long>
LongSparseArray <Long, Object>
LongSparseLongArray <Long, Long> //this is not a public class
//but can be copied from Android source code
In terms of memory, here is an example of SparseIntArray vs HashMap<Integer, Integer> for 1000 elements:
SparseIntArray:
class SparseIntArray {
int[] keys;
int[] values;
int size;
}
Class = 12 + 3 * 4 = 24 bytes
Array = 20 + 1000 * 4 = 4024 bytes
Total = 8,072 bytes
HashMap:
class HashMap<K, V> {
Entry<K, V>[] table;
Entry<K, V> forNull;
int size;
int modCount;
int threshold;
Set<K> keys
Set<Entry<K, V>> entries;
Collection<V> values;
}
Class = 12 + 8 * 4 = 48 bytes
Entry = 32 + 16 + 16 = 64 bytes
Array = 20 + 1000 * 64 = 64024 bytes
Total = 64,136 bytes
Source: Android Memories by Romain Guy from slide 90.
The numbers above are the amount of memory (in bytes) allocated on heap by JVM.
They may vary depending on the specific JVM used.
The java.lang.instrument package contains some helpful methods for advanced operations like checking the size of an object with getObjectSize(Object objectToSize).
Extra info is available from the official Oracle documentation.
Class = 12 bytes + (n instance variables) * 4 bytes
Array = 20 bytes + (n elements) * (element size)
Entry = 32 bytes + (1st element size) + (2nd element size)
I came here just wanting an example of how to use SparseArray. This is a supplemental answer for that.
Create a SparseArray
SparseArray<String> sparseArray = new SparseArray<>();
A SparseArray maps integers to some Object, so you could replace String in the example above with any other Object. If you are mapping integers to integers then use SparseIntArray.
Add or update items
Use put (or append) to add elements to the array.
sparseArray.put(10, "horse");
sparseArray.put(3, "cow");
sparseArray.put(1, "camel");
sparseArray.put(99, "sheep");
sparseArray.put(30, "goat");
sparseArray.put(17, "pig");
Note that the int keys do not need to be in order. This can also be used to change the value at a particular int key.
Remove items
Use remove (or delete) to remove elements from the array.
sparseArray.remove(17); // "pig" removed
The int parameter is the integer key.
Lookup values for an int key
Use get to get the value for some integer key.
String someAnimal = sparseArray.get(99); // "sheep"
String anotherAnimal = sparseArray.get(200); // null
You can use get(int key, E valueIfKeyNotFound) if you want to avoid getting null for missing keys.
Iterate over the items
You can use keyAt and valueAt some index to loop through the collection because the SparseArray maintains a separate index distinct from the int keys.
int size = sparseArray.size();
for (int i = 0; i < size; i++) {
int key = sparseArray.keyAt(i);
String value = sparseArray.valueAt(i);
Log.i("TAG", "key: " + key + " value: " + value);
}
// key: 1 value: camel
// key: 3 value: cow
// key: 10 value: horse
// key: 30 value: goat
// key: 99 value: sheep
Note that the keys are ordered in ascending value, not in the order that they were added.
Yet whenever I try to use a HashMap with integer keys in an android
project, intelliJ tells me I should use a SparseArray instead.
It is only a warning from this documentation of it sparse array:
It is intended to be more memory efficient than using a HashMap to
map Integers to Objects
The SparseArray is made to be memory efficient than using the regular HashMap, that is does not allow multiple gaps within the array not like HashMap. There is nothing to worry about it you can use the traditional HashMap if you desire not worrying about the memory allocation to the device.
After some googling I try to add some information to the already posted anwers:
Isaac Taylor made a performance comparision for SparseArrays and Hashmaps. He states that
the Hashmap and the SparseArray are very similar for data structure
sizes under 1,000
and
when the size has been increased to the 10,000 mark [...] the Hashmap
has greater performance with adding objects, while the SparseArray has
greater performance when retrieving objects. [...] At a size of 100,000 [...] the Hashmap loses performance very quickly
An comparision on Edgblog shows that a SparseArray need much less memory than a HashMap because of the smaller key (int vs Integer) and the fact that
a HashMap.Entry instance must keep track of the references for the
key, the value and the next entry. Plus it also needs to store the
hash of the entry as an int.
As a conclusion I would say that the difference could matter if you are going to store a lot of data in your Map. Otherwise, just ignore the warning.
A sparse array in Java is a data structure which maps keys to values. Same idea as a Map, but different implementation:
A Map is represented internally as an array of lists, where each element in these lists is a key,value pair. Both the key and value are object instances.
A sparse array is simply made of two arrays: an arrays of (primitives) keys and an array of (objects) values. There can be gaps in these arrays indices, hence the term “sparse” array.
The main interest of the SparseArray is that it saves memory by using primitives instead of objects as the key.
The android documentation for a SparseArray says "It is generally
slower than a traditional HashMap".
Yes,it's right. But when you have only 10 or 20 items , the performance difference should be insignificant.
If you write code using HashMaps rather than SparseArrays your code
will work with other implementations of Map and you will be able to
use all of the java APIs designed for Maps
I think most often we only use HashMap to search a value associated with a key while SparseArray is really good at this.
If you write code using HashMaps rather than SparseArrays your code
will work in non-android projects.
The source code of SparseArray is fairly simple and easy to understand so that you only pay little effort moving it to other platforms(through a simple COPY&Paste).
Map overrides equals() and hashCode() whereas SparseArray doesn't
All I can say is, (to most developers)who care?
Another important aspect of SparseArray is that it only uses an array to store all elements while HashMap uses Entry, so SparseArray costs significant less memory than a HashMap, see this
It's unfortunate that the compiler issues a warning. I guess HashMap has been way overused for storing items.
SparseArrays have their place. Given they use a binary search algorithm to find a value in an array you have to consider what you are doing. Binary search is O(log n) while hash lookup is O(1). This doesn't necessarily mean that binary search is slower for a given set of data. However, as the number of entries grows, the power of the hash table takes over. Hence the comments where low number of entries can equal and possibly be better than using a HashMap.
A HashMap is only as good as the hash and also can be impacted by load factor (I think in later versions they ignore the load factor so it can be better optimized). They also added a secondary hash to make sure the hash is good. Also the reason SparseArray works really well for relatively few entries (<100).
I would suggest that if you need a hash table and want better memory usage for primitive integer (no auto boxing), etc., try out trove. (http://trove.starlight-systems.com - LGPL license). (No affiliation with trove, just like their library)
With the simplified multi-dex building we have you don't even need to repackage trove for what you need. (trove has a lot of classes)

Memory overhead of Java HashMap compared to ArrayList

I am wondering what is the memory overhead of java HashMap compared to ArrayList?
Update:
I would like to improve the speed for searching for specific values of a big pack (6 Millions+) of identical objects.
Thus, I am thinking about using one or several HashMap instead of using ArrayList. But I am wondering what is the overhead of HashMap.
As far as i understand, the key is not stored, only the hash of the key, so it should be something like size of the hash of the object + one pointer.
But what hash function is used? Is it the one offered by Object or another one?
If you're comparing HashMap with ArrayList, I presume you're doing some sort of searching/indexing of the ArrayList, such as binary search or custom hash table...? Because a .get(key) thru 6 million entries would be infeasible using a linear search.
Using that assumption, I've done some empirical tests and come up with the conclusion that "You can store 2.5 times as many small objects in the same amount of RAM if you use ArrayList with binary search or custom hash map implementation, versus HashMap". My test was based on small objects containing only 3 fields, of which one is the key, and the key is an integer. I used a 32bit jdk 1.6. See below for caveats on this figure of "2.5".
The key things to note are:
(a) it's not the space required for references or "load factor" that kills you, but rather the overhead required for object creation. If the key is a primitive type, or a combination of 2 or more primitive or reference values, then each key will require its own object, which carries an overhead of 8 bytes.
(b) In my experience you usually need the key as part of the value, (e.g. to store customer records, indexed by customer id, you still want the customer id as part of the Customer object). This means it is IMO somewhat wasteful that a HashMap separately stores references to keys and values.
Caveats:
The most common type used for HashMap keys is String. The object creation overhead doesn't apply here so the difference would be less.
I got a figure of 2.8, being 8880502 entries inserted into the ArrayList compared with 3148004 into the HashMap on -Xmx256M JVM, but my ArrayList load factor was 80% and my objects were quite small - 12 bytes plus 8 byte object overhead.
My figure, and my implementation, requires that the key is contained within the value, otherwise I'd have the same problem with object creation overhead and it would be just another implementation of HashMap.
My code:
public class Payload {
int key,b,c;
Payload(int _key) { key = _key; }
}
import org.junit.Test;
import java.util.HashMap;
import java.util.Map;
public class Overhead {
#Test
public void useHashMap()
{
int i=0;
try {
Map<Integer, Payload> map = new HashMap<Integer, Payload>();
for (i=0; i < 4000000; i++) {
int key = (int)(Math.random() * Integer.MAX_VALUE);
map.put(key, new Payload(key));
}
}
catch (OutOfMemoryError e) {
System.out.println("Got up to: " + i);
}
}
#Test
public void useArrayList()
{
int i=0;
try {
ArrayListMap map = new ArrayListMap();
for (i=0; i < 9000000; i++) {
int key = (int)(Math.random() * Integer.MAX_VALUE);
map.put(key, new Payload(key));
}
}
catch (OutOfMemoryError e) {
System.out.println("Got up to: " + i);
}
}
}
import java.util.ArrayList;
public class ArrayListMap {
private ArrayList<Payload> map = new ArrayList<Payload>();
private int[] primes = new int[128];
static boolean isPrime(int n)
{
for (int i=(int)Math.sqrt(n); i >= 2; i--) {
if (n % i == 0)
return false;
}
return true;
}
ArrayListMap()
{
for (int i=0; i < 11000000; i++) // this is clumsy, I admit
map.add(null);
int n=31;
for (int i=0; i < 128; i++) {
while (! isPrime(n))
n+=2;
primes[i] = n;
n += 2;
}
System.out.println("Capacity = " + map.size());
}
public void put(int key, Payload value)
{
int hash = key % map.size();
int hash2 = primes[key % primes.length];
if (hash < 0)
hash += map.size();
do {
if (map.get(hash) == null) {
map.set(hash, value);
return;
}
hash += hash2;
if (hash >= map.size())
hash -= map.size();
} while (true);
}
public Payload get(int key)
{
int hash = key % map.size();
int hash2 = primes[key % primes.length];
if (hash < 0)
hash += map.size();
do {
Payload payload = map.get(hash);
if (payload == null)
return null;
if (payload.key == key)
return payload;
hash += hash2;
if (hash >= map.size())
hash -= map.size();
} while (true);
}
}
The simplest thing would be to look at the source and work it out that way. However, you're really comparing apples and oranges - lists and maps are conceptually quite distinct. It's rare that you would choose between them on the basis of memory usage.
What's the background behind this question?
All that is stored in either is pointers. Depending on your architecture a pointer should be 32 or 64 bits (or more or less)
An array list of 10 tends to allocate 10 "Pointers" at a minimum (and also some one-time overhead stuff).
A map has to allocate twice that (20 pointers) because it stores two values at a time. Then on top of that, it has to store the "Hash". which should be bigger than the map, at a loading of 75% it SHOULD be around 13 32-bit values (hashes).
so if you want an offhand answer, the ratio should be about 1:3.25 or so, but you are only talking pointer storage--very small unless you are storing a massive number of objects--and if so, the utility of being able to reference instantly (HashMap) vs iterate (array) should be MUCH more significant than the memory size.
Oh, also:
Arrays can be fit to the exact size of your collection. HashMaps can as well if you specify the size, but if it "Grows" beyond that size, it will re-allocate a larger array and not use some of it, so there can be a little waste there as well.
I don't have an answer for you either, but a quick google search turned up a function in Java that might help.
Runtime.getRuntime().freeMemory();
So I propose that you populate a HashMap and an ArrayList with the same data. Record the free memory, delete the first object, record memory, delete the second object, record the memory, compute the differences,..., profit!!!
You should probably do this with magnitudes of data. ie Start with 1000, then 10000, 100000, 1000000.
EDIT: Corrected, thanks to amischiefr.
EDIT:
Sorry for editing your post, but this is pretty important if you are going to use this (and It's a little much for a comment)
.
freeMemory does not work like you think it would. First, it's value is changed by garbage collection. Secondly, it's value is changed when java allocates more memory. Just using the freeMemory call alone doesn't provide useful data.
Try this:
public static void displayMemory() {
Runtime r=Runtime.getRuntime();
r.gc();
r.gc(); // YES, you NEED 2!
System.out.println("Memory Used="+(r.totalMemory()-r.freeMemory()));
}
Or you can return the memory used and store it, then compare it to a later value. Either way, remember the 2 gcs and subtracting from totalMemory().
Again, sorry to edit your post!
Hashmaps try to maintain a load factor (usually 75% full), you can think of a hashmap as a sparsely filled array list. The problem in a straight up comparison in size is this load factor of the map grows to meet the size of the data. ArrayList on the other hand grows to meet it's need by doubling it's internal array size. For relatively small sizes they are comparable, however as you pack more and more data into the map it requires a lot of empty references in order to maintain the hash performance.
In either case I recommend priming the expected size of the data before you start adding. This will give the implementations a better initial setting and will likely consume less over all in both cases.
Update:
based on your updated problem check out Glazed lists. This is a neat little tool written by some of the Google people for doing operations similar to the one you describe. It's also very quick. Allows clustering, filtering, searching, etc.
HashMap hold a reference to the value and a reference to the key.
ArrayList just hold a reference to the value.
So, assuming that the key uses the same memory of the value, HashMap uses 50% more memory ( although strictly speaking , is not the HashMap who uses that memory because it just keep a reference to it )
In the other hand HashMap provides constant-time performance for the basic operations (get and put) So, although it may use more memory, getting an element may be much faster using a HashMap than a ArrayList.
So, the next thing you should do is not to care about who uses more memory but what are they good for.
Using the correct data structure for your program saves more CPU/memory than how the library is implemented underneath.
EDIT
After Grant Welch answer I decided to measure for 2,000,000 integers.
Here's the source code
This is the output
$
$javac MemoryUsage.java
Note: MemoryUsage.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
$java -Xms128m -Xmx128m MemoryUsage
Using ArrayListMemoryUsage#8558d2 size: 0
Total memory: 133.234.688
Initial free: 132.718.608
Final free: 77.965.488
Used: 54.753.120
Memory Used 41.364.824
ArrayListMemoryUsage#8558d2 size: 2000000
$
$java -Xms128m -Xmx128m MemoryUsage H
Using HashMapMemoryUsage#8558d2 size: 0
Total memory: 133.234.688
Initial free: 124.329.984
Final free: 4.109.600
Used: 120.220.384
Memory Used 129.108.608
HashMapMemoryUsage#8558d2 size: 2000000
Basically, you should be using the "right tool for the job". Since there are different instances where you'll need a key/value pair (where you may use a HashMap) and different instances where you'll just need a list of values (where you may use a ArrayList) then the question of "which one uses more memory", in my opinion, is moot, since it is not a consideration of choosing one over the other.
But to answer the question, since HashMap stores key/value pairs while ArrayList stores just values, I would assume that the addition of keys alone to the HashMap would mean that it takes up more memory, assuming, of course, we are comparing them by the same value type (e.g. where the values in both are Strings).
I think the wrong question is being asked here.
If you would like to improve the speed at which you can search for an object in a List containing six million entries, then you should look into how fast these datatype's retrieval operations perform.
As usual, the Javadocs for these classes state pretty plainly what type of performance they offer:
HashMap:
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets.
This means that HashMap.get(key) is O(1).
ArrayList:
The size, isEmpty, get, set, iterator, and listIterator operations run in constant time. The add operation runs in amortized constant time, that is, adding n elements requires O(n) time. All of the other operations run in linear time (roughly speaking).
This means that most of ArrayList's operations are O(1), but likely not the ones that you would be using to find objects that match a certain value.
If you are iterating over every element in the ArrayList and testing for equality, or using contains(), then this means that your operation is running at O(n) time (or worse).
If you are unfamiliar with O(1) or O(n) notation, this is referring to how long an operation will take. In this case, if you can get constant-time performance, you want to take it. If HashMap.get() is O(1) this means that retrieval operations take roughly the same amount of time regardless of how many entries are in the Map.
The fact that something like ArrayList.contains() is O(n) means that the amount of time it takes grows as the size of the list grows; so iterating thru an ArrayList with six million entries will not be very effective at all.
I don't know the exact number, but HashMaps are much heavier. Comparing the two, ArrayList's internal representation is self evident, but HashMaps retain Entry objects (Entry) which can balloon your memory consumption.
It's not that much larger, but it's larger. A great way to visualize this would be with a dynamic profiler such as YourKit which allows you to see all heap allocations. It's pretty nice.
This post is giving a lot of information about objects sizes in Java.
If you're considering two ArrayLists vs one Hashmap, it's indeterminate; both are partially-full data structures. If you were comparing Vector vs Hashtable, Vector is probably more memory efficient, because it only allocates the space it uses, whereas Hashtables allocate more space.
If you need a key-value pair and aren't doing incredibly memory-hungry work, just use the Hashmap.
As Jon Skeet noted, these are completely different structures. A map (such as HashMap) is a mapping from one value to another - i.e. you have a key that maps to a value, in a Key->Value kind of relationship. The key is hashed, and is placed in an array for quick lookup.
A List, on the other hand, is a collection of elements with order - ArrayList happens to use an array as the back end storage mechanism, but that is irrelevant. Each indexed element is a single element in the list.
edit: based on your comment, I have added the following information:
The key is stored in a hashmap. This is because a hash is not guaranteed to be unique for any two different elements. Thus, the key has to be stored in the case of hashing collisions. If you simply want to see if an element exists in a set of elements, use a Set (the standard implementation of this being HashSet). If the order matters, but you need a quick lookup, use a LinkedHashSet, as it keeps the order the elements were inserted. The lookup time is O(1) on both, but the insertion time is slightly longer on a LinkedHashSet. Use a Map only if you are actually mapping from one value to another - if you simply have a set of unique objects, use a Set, if you have ordered objects, use a List.
This site lists the memory consumption for several commonly (and not so commonly) used data structures. From there one can see that the HashMap takes roughly 5 times the space of an ArrayList. The map will also allocate one additional object per entry.
If you need a predictable iteration order and use a LinkedHashMap, the memory consumption will be even higher.
You can do your own memory measurements with Memory Measurer.
There are two important facts to note however:
A lot of data structures (including ArrayList and HashMap) do allocate space more space than they need currently, because otherwise they would have to frequently execute a costly resize operation. Thus the memory consumption per element depends on how many elements are in the collection. For example, an ArrayList with the default settings uses the same memory for 0 to 10 elements.
As others have said, the keys of the map are stored, too. So if they are not in memory anyway, you will have to add this memory cost, too. An additional object will usually take 8 bytes of overhead alone, plus the memory for its fields, and possibly some padding. So this will also be a lot of memory.

Categories

Resources