ArrayList and HashSet insert performance test result confuse me

ArrayList and HashSet insert performance test result confuse me - java

i wirte a class to test the insert performance between arraylist and hashset,as i expect ,the hashset insert performance will be much better than arraylist(maybe the book deceived me),but the test result make me so confused
HashSet<String> hashSet = new HashSet<String>();
long start = System.currentTimeMillis();
for (int i = 0; i < 900000; i++) {
hashSet.add(String.valueOf(i));
}
System.out.println("Insert HashSet Time: " + (System.currentTimeMillis() - start));
ArrayList<String> arrayList = new ArrayList<String>();
start = System.currentTimeMillis();
for (int i = 0; i < 900000; i++) {
arrayList.add(String.valueOf(i));
}
System.out.println("Insert ArrayList Time: " + (System.currentTimeMillis() - start));
result:
Insert HashSet Time: 978
Insert ArrayList Time: 287
i run this main metod many times and the result has no more different between this,the insert arraylist time is much shorter than insert hashset time
can anybody explain this weird result.

The hashset and list are different types of datastructures. So you should think about what you do want to do with them before choosing one.
HashSet
Longer insert time
Fast access time on elements
List
Fast append time
Long access time on elements
The list is faster because it can just add the element at the end of the list, the hashset has to find where to insert and then make the element accessable, this is more work (time) as adding it to the end of a list.

Exact performance characteristics of datastructures and algorithms are highly machine- and implementation-specific. However, it doesn't seem surprising to me that ArrayList inserts would be faster than HashSet inserts by a constant factor. To insert into an ArrayList, you just need to set a value at a particular index in an array. To insert into a hash set, you need to compute a hashcode for the inserted item and map that to an array index, check that index and possibly perform some action based on what you find, and finally insert into the array. Furthermore the HashSet will have worse memory locality so you'll get cache misses more often.
There's also the question of array resizing, which both data structures will need to do, but both data structures will need to resize at about the same rate (and hash table resizing is probably more expensive by a constant factor, too, due to rehashing).
Both algorithms are constant (expected) time, but there's a lot more stuff to do with a hash table than an array list. So it's not surprising that it would be slower by a constant factor. (Again, the exact difference is highly dependent on machine and implementation.)

the hashset insert performance will be much better than arraylist
Where did you get that idea?
HashSet will outperform ArrayList on search i.e: get().
But on insert they have comparable performance. Actually ArrayList is even faster if you are within array limits (no resize needed) and the hash function is not good

Actually, you are getting the right results. Also, as pointed out in the above answer, these are different types of data-structures. Comparing them would be like comparing the speed of a bike with a car. I think the time for inserting in a HashSet must be more than that of insertion in an ArrayList because HashSet doesn't allow duplicate keys. So I assume that before insertion there must be some sort of checking for duplicate keys before insertion and how to handle them which makes them somewhat slower as compared to ArrayList.

HashSet is backed by hashtable. If you know about hashtable, you would know that there is a hash function. also collision handling(if there was collision) when you add new element in it. Well hashSet doesn't handle collision, just overwrite the old value if hash are same. However if capacity reached, it need to resize, and possible re-hash. it would be very slow.
ArrayList just append the object to the end of the list. if size reached, it does resize.

Related

How can i create a find method and a insert method for O(1) time complexity in a custom HashMap implementation?

I am making a custom implementation of a HashMap without using the HashMap data structure as part of an assignment, currently I have the choice of working with two 1D arrays or using a 2D array to store my keys and values. I want to be able to check if a key exists and return the corresponding value in O(1) time complexity (assignment requirement) but i am assuming it is without the use of containsKey().
Also, when inserting key and value pairs to my arrays, i am confused because it should not be O(1) logically, since there would occasionally be cases where there is collision and i have to recalculate the index, so why is the assignment requirement for insertion O(1)?

A lot of questions in there, let me give it a try.
I want to be able to check if a key exists and return the
corresponding value in O(1) time complexity (assignment requirement)
but i am assuming it is without the use of containsKey().
That actually doesn't make a difference. O(1) means the execution time is independent of the input, it does not mean a single operation is used. If your containsKey() and put() implementations are both O(1), then so is your solution that uses both of them exactly once.
Also, when inserting key and value pairs to my arrays, i am confused
because it should not be O(1) logically, since there would
occasionally be cases where there is collision and i have to
recalculate the index, so why is the assignment requirement for
insertion O(1)?
O(1) is the best case, which assumes that there are no hash collisions. The worst case is O(n) if every key generates the same hash code. So when a hash map's lookup or insertion performance is calculated as O(1), that assumes a perfect hashCode implementation.
Finally, when it comes to data structures, the usual approach is to use a single array, where the array items are link list nodes. The array offsets correspond to hashcode() % array size (there are much more advances formulas than this, but this is a good starting point). In case of a hash collision, you will have to navigate the linked list nodes until you find the correct entry.

You're correct in that hash table insert is not guaranteed to be O(1) due to collisions. If you use the open addressing strategy to deal with collisions, the process to insert an item is going to take time proportional to 1/(1-a) where a is the proportion of how much of the table capacity has been used. As the table fills up, a goes to 1, and the time to insert grows without bound.
The secret to keeping time complexity as O(1) is making sure that there's always room in the table. That way a never grows too big. That's why you have to resize the table when it starts to run out capacity.
Problem: resizing a table with N item takes O(N) time.
Solution: increase the capacity exponentially, for example double it every time you need to resize. This way the table has to be resized very rarely. The cost of the occasional resize operations is "amortized" over a large number of insertions, and that's why people say that hash table insertion has "amortized O(1)" time complexity.
TLDR: Make sure you increase the table capacity when it's getting full, maybe 70-80% utilization. When you increase the capacity, make sure that it's by a constant factor, for example doubling it.

How to quickly know the indexes in a massive ArrayList of a very large number of strings from this ArrayList in Java?

Suppose that I have a collection of 50 million different strings in a Java ArrayList. Let foo be a set of 40 million arbitrarily chosen (but fixed) strings from the previous collection. I want to know the index of every string in foo in the ArrayList.
An obvious way to do this would be to iterate through the whole ArrayList until we found a match for the first string in foo, then for the second one and so on. However, this solution would take an extremely long time (considering also that 50 million was an arbitrary large number that I picked for the example, the collection could be in the order of hundreds of millions or even billions but this is given from the beginning and remains constant).
I thought then of using a Hashtable of fixed size 50 million in order to determine the index of a given string in foo using someStringInFoo.hashCode(). However, from my understanding of Java's Hashtable, it seems that this will fail if there are collisions as calling hashCode() will produce the same index for two different strings.
Lastly, I thought about first sorting the ArrayList with the sort(List<T> list) in Java's Collections and then using binarySearch(List<? extends T> list,T key,Comparator<? super T> c) to obtain the index of the term. Is there a more efficient solution than this or is this as good as it gets?

You need additional data structure that is optimized for searching strings. It will map string to it's index. The idea is that you iterate your original list populating your data structure and then iterate your set, performing searches in that data structure.
What structure should you choose?
There are three options worth considering:
Java's HashMap
TRIE
Java's IdentityHashMap
The first option is simple to implement but provides not the best possible performance. But still, it's population time O(N * R) is better than sorting the list, which is O(R * N * log N). Searching time is better then in sorted String list (amortized O(R) compared to O(R log N).
Where R is the average length of your strings.
The second option is always good for maps of strings, providing guaranteed population time for your case of O(R * N) and guaranteed worst-case searching time of O(R). The only disadvantage of it is that there is no out-of-box implementation in Java standard libraries.
The third option is a bit tricky and suitable only for your case. In order to make it work you need to ensure that strings from the first list are literally used in second list (are the same objects). Using IdentityHashMap eliminates String's equals cost (the R above), as IdentityHashMap compares strings by address, taking only O(1). Population cost will be amortized O(N) and search cost amortized O(1). So this solution provides the best performance and out-of-box implementation. However please note that this solution will work only if there are no duplicates in the original list.
If you have any questions please let me know.

You can use a Java Hashtable with no problems. According to the Java Documentation "in the case of a "hash collision", a single bucket stores multiple entries, which must be searched sequentially."
I think you have a misconception on how hash tables work. Hash collisions do NOT ruin the implementation. A hash table is simply an array of linked-lists. Each key goes through a hash function to determine the index in the array which the element will be placed. If a hash collision occurs, the element will be placed at the end of the linked-list at the index in the hash-table array. See link below for diagram.

which data structure in java can add key/value pair at constant time and maintain sorted by value?

basically i'm looking for a best data structure in java which i can store pairs and retrieve top N number of element by the value. i'd like to do this in O(n) time where n is number of entires in the data structure.
example input would be,
<"john", 32>
<"dave", 3>
<"brian", 15>
<"jenna", 23>
<"rachael", 41>
and if N=3, i should be able to return rachael, john, jenna if i wanted descending order.
if i use some kind of hashMap, insertion is fast, but retrieving them by order gets expensive.
if i use some data structure that keeps things ordered, then insertion becomes expensive while retrieving is cheaper. i was not able to find the best data structure that can do both very well and very fast.
any input is appreciated. thanks.
[updated]
let me ask the question in other way if that make it clearer.
i know i can insert at constant time O(1) into hashMap.
now, how can i retrieve elements from sorted order by value in O(n) time where n=number of entires in the data structure? hope it makes sense.

If you want to sort, you have to give up constant O(1) time.
That is because unlike inserting an unsorted key / value pair, sorting will minimally require you to compare the new entry to something, and odds are to a number of somethings. Once you have an algorithm that will require more time with more entries (due to more comparisons) you have overshot "constant" time.
If you can do better, then by all means, do so! There is a Dijkstra prize awaiting for you, if not a Fields Medal to boot.
Don't dispair, you can still do the key part as a HashMap, and the sorting part with a Tree like implementation, that will give you O(log n). TreeMap is probably what you desire.
--- Update to match your update ---
No, you cannot iterate over a hashmap in O(n) time. To do so would assume that you had a list; but, that list would have to already be sorted. With a raw HashMap, you would have to search the entire map for the next "lower" value. Searching part of the map would not do, because the one element you didn't check would possibly be the correct value.
Now, there are some data structures that make a lot of trade offs which might get you closer. If you want to roll your own, perhaps a custom Fibonacci heap can give you an amortized performance close to what you wish, but it cannot guarantee a worst-case performance. In any case, some operations (like extract-min) will still require O(log n) performance.

Time complexity of set in Java

Can someone tell me the time complexity of the below code?
a is an array of int.
Set<Integer> set = new HashSet<Integer>();
for (int i = 0; i < a.length; i++) {
if (set.contains(arr[i])) {
System.out.println("Hello");
}
set.add(arr[i]);
}
I think that it is O(n), but I'm not sure since it is using Set and this contains methods as well. It is also calling the add method of set.
Can anyone confirm and explain what the time complexity of the entire above code is? Also, how much space would it take?

i believe its O(n) because you loop over the array, and contains and add should be constant time because its a hash based set. If it were not hash based and required iteration over the entire set to do lookups, the upper bound would be n^2.
Integers are immutable, so the space complexity would be 2n, which I believe simplifies to just n, since constants don't matter.
If you had objects in the array and set, then you would have 2n references and n objects, so you are at 3n, which is still linear (times a constant) space constraints.
EDIT-- yep "This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets."
see here.

Understanding HashSet is the key of question
According to HashSet in Javadoc,
This class implements the Set interface, backed by a hash table
(actually a HashMap instance)...This class offers constant time
performance for the basic operations (add, remove, contains and size)
A more complete explain about HashSet https://www.geeksforgeeks.org/hashset-contains-method-in-java/?ref=rp
So the HashSet insert and contains are O(1). ( As HashSet is based on HashMap and Its memory complexity is O(n))
The rest is simple, the main array you are looping is order of O(n) , so the total order of function will be O(n).

Memory overhead of Java HashMap compared to ArrayList

I am wondering what is the memory overhead of java HashMap compared to ArrayList?
Update:
I would like to improve the speed for searching for specific values of a big pack (6 Millions+) of identical objects.
Thus, I am thinking about using one or several HashMap instead of using ArrayList. But I am wondering what is the overhead of HashMap.
As far as i understand, the key is not stored, only the hash of the key, so it should be something like size of the hash of the object + one pointer.
But what hash function is used? Is it the one offered by Object or another one?

If you're comparing HashMap with ArrayList, I presume you're doing some sort of searching/indexing of the ArrayList, such as binary search or custom hash table...? Because a .get(key) thru 6 million entries would be infeasible using a linear search.
Using that assumption, I've done some empirical tests and come up with the conclusion that "You can store 2.5 times as many small objects in the same amount of RAM if you use ArrayList with binary search or custom hash map implementation, versus HashMap". My test was based on small objects containing only 3 fields, of which one is the key, and the key is an integer. I used a 32bit jdk 1.6. See below for caveats on this figure of "2.5".
The key things to note are:
(a) it's not the space required for references or "load factor" that kills you, but rather the overhead required for object creation. If the key is a primitive type, or a combination of 2 or more primitive or reference values, then each key will require its own object, which carries an overhead of 8 bytes.
(b) In my experience you usually need the key as part of the value, (e.g. to store customer records, indexed by customer id, you still want the customer id as part of the Customer object). This means it is IMO somewhat wasteful that a HashMap separately stores references to keys and values.
Caveats:
The most common type used for HashMap keys is String. The object creation overhead doesn't apply here so the difference would be less.
I got a figure of 2.8, being 8880502 entries inserted into the ArrayList compared with 3148004 into the HashMap on -Xmx256M JVM, but my ArrayList load factor was 80% and my objects were quite small - 12 bytes plus 8 byte object overhead.
My figure, and my implementation, requires that the key is contained within the value, otherwise I'd have the same problem with object creation overhead and it would be just another implementation of HashMap.
My code:
public class Payload {
int key,b,c;
Payload(int _key) { key = _key; }
}
import org.junit.Test;
import java.util.HashMap;
import java.util.Map;
public class Overhead {
#Test
public void useHashMap()
{
int i=0;
try {
Map<Integer, Payload> map = new HashMap<Integer, Payload>();
for (i=0; i < 4000000; i++) {
int key = (int)(Math.random() * Integer.MAX_VALUE);
map.put(key, new Payload(key));
}
}
catch (OutOfMemoryError e) {
System.out.println("Got up to: " + i);
}
}
#Test
public void useArrayList()
{
int i=0;
try {
ArrayListMap map = new ArrayListMap();
for (i=0; i < 9000000; i++) {
int key = (int)(Math.random() * Integer.MAX_VALUE);
map.put(key, new Payload(key));
}
}
catch (OutOfMemoryError e) {
System.out.println("Got up to: " + i);
}
}
}
import java.util.ArrayList;
public class ArrayListMap {
private ArrayList<Payload> map = new ArrayList<Payload>();
private int[] primes = new int[128];
static boolean isPrime(int n)
{
for (int i=(int)Math.sqrt(n); i >= 2; i--) {
if (n % i == 0)
return false;
}
return true;
}
ArrayListMap()
{
for (int i=0; i < 11000000; i++) // this is clumsy, I admit
map.add(null);
int n=31;
for (int i=0; i < 128; i++) {
while (! isPrime(n))
n+=2;
primes[i] = n;
n += 2;
}
System.out.println("Capacity = " + map.size());
}
public void put(int key, Payload value)
{
int hash = key % map.size();
int hash2 = primes[key % primes.length];
if (hash < 0)
hash += map.size();
do {
if (map.get(hash) == null) {
map.set(hash, value);
return;
}
hash += hash2;
if (hash >= map.size())
hash -= map.size();
} while (true);
}
public Payload get(int key)
{
int hash = key % map.size();
int hash2 = primes[key % primes.length];
if (hash < 0)
hash += map.size();
do {
Payload payload = map.get(hash);
if (payload == null)
return null;
if (payload.key == key)
return payload;
hash += hash2;
if (hash >= map.size())
hash -= map.size();
} while (true);
}
}

The simplest thing would be to look at the source and work it out that way. However, you're really comparing apples and oranges - lists and maps are conceptually quite distinct. It's rare that you would choose between them on the basis of memory usage.
What's the background behind this question?

All that is stored in either is pointers. Depending on your architecture a pointer should be 32 or 64 bits (or more or less)
An array list of 10 tends to allocate 10 "Pointers" at a minimum (and also some one-time overhead stuff).
A map has to allocate twice that (20 pointers) because it stores two values at a time. Then on top of that, it has to store the "Hash". which should be bigger than the map, at a loading of 75% it SHOULD be around 13 32-bit values (hashes).
so if you want an offhand answer, the ratio should be about 1:3.25 or so, but you are only talking pointer storage--very small unless you are storing a massive number of objects--and if so, the utility of being able to reference instantly (HashMap) vs iterate (array) should be MUCH more significant than the memory size.
Oh, also:
Arrays can be fit to the exact size of your collection. HashMaps can as well if you specify the size, but if it "Grows" beyond that size, it will re-allocate a larger array and not use some of it, so there can be a little waste there as well.

I don't have an answer for you either, but a quick google search turned up a function in Java that might help.
Runtime.getRuntime().freeMemory();
So I propose that you populate a HashMap and an ArrayList with the same data. Record the free memory, delete the first object, record memory, delete the second object, record the memory, compute the differences,..., profit!!!
You should probably do this with magnitudes of data. ie Start with 1000, then 10000, 100000, 1000000.
EDIT: Corrected, thanks to amischiefr.
EDIT:
Sorry for editing your post, but this is pretty important if you are going to use this (and It's a little much for a comment)
.
freeMemory does not work like you think it would. First, it's value is changed by garbage collection. Secondly, it's value is changed when java allocates more memory. Just using the freeMemory call alone doesn't provide useful data.
Try this:
public static void displayMemory() {
Runtime r=Runtime.getRuntime();
r.gc();
r.gc(); // YES, you NEED 2!
System.out.println("Memory Used="+(r.totalMemory()-r.freeMemory()));
}
Or you can return the memory used and store it, then compare it to a later value. Either way, remember the 2 gcs and subtracting from totalMemory().
Again, sorry to edit your post!

Hashmaps try to maintain a load factor (usually 75% full), you can think of a hashmap as a sparsely filled array list. The problem in a straight up comparison in size is this load factor of the map grows to meet the size of the data. ArrayList on the other hand grows to meet it's need by doubling it's internal array size. For relatively small sizes they are comparable, however as you pack more and more data into the map it requires a lot of empty references in order to maintain the hash performance.
In either case I recommend priming the expected size of the data before you start adding. This will give the implementations a better initial setting and will likely consume less over all in both cases.
Update:
based on your updated problem check out Glazed lists. This is a neat little tool written by some of the Google people for doing operations similar to the one you describe. It's also very quick. Allows clustering, filtering, searching, etc.

HashMap hold a reference to the value and a reference to the key.
ArrayList just hold a reference to the value.
So, assuming that the key uses the same memory of the value, HashMap uses 50% more memory ( although strictly speaking , is not the HashMap who uses that memory because it just keep a reference to it )
In the other hand HashMap provides constant-time performance for the basic operations (get and put) So, although it may use more memory, getting an element may be much faster using a HashMap than a ArrayList.
So, the next thing you should do is not to care about who uses more memory but what are they good for.
Using the correct data structure for your program saves more CPU/memory than how the library is implemented underneath.
EDIT
After Grant Welch answer I decided to measure for 2,000,000 integers.
Here's the source code
This is the output
$
$javac MemoryUsage.java
Note: MemoryUsage.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
$java -Xms128m -Xmx128m MemoryUsage
Using ArrayListMemoryUsage#8558d2 size: 0
Total memory: 133.234.688
Initial free: 132.718.608
Final free: 77.965.488
Used: 54.753.120
Memory Used 41.364.824
ArrayListMemoryUsage#8558d2 size: 2000000
$
$java -Xms128m -Xmx128m MemoryUsage H
Using HashMapMemoryUsage#8558d2 size: 0
Total memory: 133.234.688
Initial free: 124.329.984
Final free: 4.109.600
Used: 120.220.384
Memory Used 129.108.608
HashMapMemoryUsage#8558d2 size: 2000000

Basically, you should be using the "right tool for the job". Since there are different instances where you'll need a key/value pair (where you may use a HashMap) and different instances where you'll just need a list of values (where you may use a ArrayList) then the question of "which one uses more memory", in my opinion, is moot, since it is not a consideration of choosing one over the other.
But to answer the question, since HashMap stores key/value pairs while ArrayList stores just values, I would assume that the addition of keys alone to the HashMap would mean that it takes up more memory, assuming, of course, we are comparing them by the same value type (e.g. where the values in both are Strings).

I think the wrong question is being asked here.
If you would like to improve the speed at which you can search for an object in a List containing six million entries, then you should look into how fast these datatype's retrieval operations perform.
As usual, the Javadocs for these classes state pretty plainly what type of performance they offer:
HashMap:
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets.
This means that HashMap.get(key) is O(1).
ArrayList:
The size, isEmpty, get, set, iterator, and listIterator operations run in constant time. The add operation runs in amortized constant time, that is, adding n elements requires O(n) time. All of the other operations run in linear time (roughly speaking).
This means that most of ArrayList's operations are O(1), but likely not the ones that you would be using to find objects that match a certain value.
If you are iterating over every element in the ArrayList and testing for equality, or using contains(), then this means that your operation is running at O(n) time (or worse).
If you are unfamiliar with O(1) or O(n) notation, this is referring to how long an operation will take. In this case, if you can get constant-time performance, you want to take it. If HashMap.get() is O(1) this means that retrieval operations take roughly the same amount of time regardless of how many entries are in the Map.
The fact that something like ArrayList.contains() is O(n) means that the amount of time it takes grows as the size of the list grows; so iterating thru an ArrayList with six million entries will not be very effective at all.

I don't know the exact number, but HashMaps are much heavier. Comparing the two, ArrayList's internal representation is self evident, but HashMaps retain Entry objects (Entry) which can balloon your memory consumption.
It's not that much larger, but it's larger. A great way to visualize this would be with a dynamic profiler such as YourKit which allows you to see all heap allocations. It's pretty nice.

This post is giving a lot of information about objects sizes in Java.

If you're considering two ArrayLists vs one Hashmap, it's indeterminate; both are partially-full data structures. If you were comparing Vector vs Hashtable, Vector is probably more memory efficient, because it only allocates the space it uses, whereas Hashtables allocate more space.
If you need a key-value pair and aren't doing incredibly memory-hungry work, just use the Hashmap.

As Jon Skeet noted, these are completely different structures. A map (such as HashMap) is a mapping from one value to another - i.e. you have a key that maps to a value, in a Key->Value kind of relationship. The key is hashed, and is placed in an array for quick lookup.
A List, on the other hand, is a collection of elements with order - ArrayList happens to use an array as the back end storage mechanism, but that is irrelevant. Each indexed element is a single element in the list.
edit: based on your comment, I have added the following information:
The key is stored in a hashmap. This is because a hash is not guaranteed to be unique for any two different elements. Thus, the key has to be stored in the case of hashing collisions. If you simply want to see if an element exists in a set of elements, use a Set (the standard implementation of this being HashSet). If the order matters, but you need a quick lookup, use a LinkedHashSet, as it keeps the order the elements were inserted. The lookup time is O(1) on both, but the insertion time is slightly longer on a LinkedHashSet. Use a Map only if you are actually mapping from one value to another - if you simply have a set of unique objects, use a Set, if you have ordered objects, use a List.

This site lists the memory consumption for several commonly (and not so commonly) used data structures. From there one can see that the HashMap takes roughly 5 times the space of an ArrayList. The map will also allocate one additional object per entry.
If you need a predictable iteration order and use a LinkedHashMap, the memory consumption will be even higher.
You can do your own memory measurements with Memory Measurer.
There are two important facts to note however:
A lot of data structures (including ArrayList and HashMap) do allocate space more space than they need currently, because otherwise they would have to frequently execute a costly resize operation. Thus the memory consumption per element depends on how many elements are in the collection. For example, an ArrayList with the default settings uses the same memory for 0 to 10 elements.
As others have said, the keys of the map are stored, too. So if they are not in memory anyway, you will have to add this memory cost, too. An additional object will usually take 8 bytes of overhead alone, plus the memory for its fields, and possibly some padding. So this will also be a lot of memory.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.