What is the time complexity of java.util.HashMap class' keySet() method? - java

I am trying to implement a plane sweep algorithm and for this I need to know the time complexity of java.util.HashMap class' keySet() method. I suspect that it is O(n log n). Am I correct?
Point of clarification: I am talking about the time complexity of the keySet() method; iterating through the returned Set will take obviously O(n) time.

Getting the keyset is O(1) and cheap. This is because HashMap.keyset() returns the actual KeySet object associated with the HashMap.
The returned Set is not a copy of the keys, but a wrapper for the actual HashMap's state. Indeed, if you update the set you can actually change the HashMap's state; e.g. calling clear() on the set will clear the HashMap!
... iterating through the returned Set will take obviously O(n) time.
Actually that is not always true:
It is true for a HashMap is created using new HashMap<>(). The worst case is to have all N keys land in the same hash chain. However if the map has grown naturally, there will still be N entries and O(N) slots in the hash array. Thus iterating the entry set will involve O(N) operations.
It is false if the HashMap is created with new HashMap<>(capacity) and a singularly bad (too large) capacity estimate. Then it will take O(Cap) + O(N) operations to iterate the entry set. If we treat Cap as a variable, that is O(max(Cap, N)), which could be worse than O(N).
There is an escape clause though. Since capacity is an int in the current HashMap API, the upper bound for Cap is 231. So for really large values of Cap and N, the complexity is O(N).
On the other hand, N is limited by the amount of memory available and in practice you need a heap in the order of 238 bytes (256GBytes) for N to exceed the largest possible Cap value. And for a map that size, you would be better off using a hashtable implementation tuned for huge maps. Or not using an excessively large capacity estimate!

Surely it would be O(1). All that it is doing is returning a wrapper object on the HashMap.
If you are talking about walking over the keyset, then this is O(n), since each next() call is O(1), and this needs to be performed n times.

This should be doable in O(n) time... A hash map is usually implemented as a large bucket array, the bucket's size is (usually) directly proportional to the size of the hash map. In order to retrieve the key set, the bucket must be iterated through, and for each set item, the key must be retrieved (either through an intermediate collection or an iterator with direct access to the bucket)...
**EDIT: As others have pointed out, the actual keyset() method will run in O(1) time, however, iterating over the keyset or transferring it to a dedicated collection will be an O(n) operation. Not quite sure which one you are looking for **

Java collections have a lot of space and thus don't take much time. That method is, I believe, O(1). The collection is just sitting there.

To address the "iterating through the returned Set will take obviously O(n) time" comment, this is not actually correct per the doc comments of HashMap:
Iteration over collection views requires time proportional to the "capacity" of the HashMap instance (the number of buckets) plus its size (the number of key-value mappings). Thus, it's very important not to set the initial capacity too high (or the load factor too low) if iteration performance is important.
So in other words, iterating over the returned Set will take O(n + c) where n is the size of the map and c is its capacity, not O(n). If an inappropriately sized initial capacity or load factor were chosen, the value of c could outweigh the actual size of the map in terms of iteration time.

Related

How can i create a find method and a insert method for O(1) time complexity in a custom HashMap implementation?

I am making a custom implementation of a HashMap without using the HashMap data structure as part of an assignment, currently I have the choice of working with two 1D arrays or using a 2D array to store my keys and values. I want to be able to check if a key exists and return the corresponding value in O(1) time complexity (assignment requirement) but i am assuming it is without the use of containsKey().
Also, when inserting key and value pairs to my arrays, i am confused because it should not be O(1) logically, since there would occasionally be cases where there is collision and i have to recalculate the index, so why is the assignment requirement for insertion O(1)?
A lot of questions in there, let me give it a try.
I want to be able to check if a key exists and return the
corresponding value in O(1) time complexity (assignment requirement)
but i am assuming it is without the use of containsKey().
That actually doesn't make a difference. O(1) means the execution time is independent of the input, it does not mean a single operation is used. If your containsKey() and put() implementations are both O(1), then so is your solution that uses both of them exactly once.
Also, when inserting key and value pairs to my arrays, i am confused
because it should not be O(1) logically, since there would
occasionally be cases where there is collision and i have to
recalculate the index, so why is the assignment requirement for
insertion O(1)?
O(1) is the best case, which assumes that there are no hash collisions. The worst case is O(n) if every key generates the same hash code. So when a hash map's lookup or insertion performance is calculated as O(1), that assumes a perfect hashCode implementation.
Finally, when it comes to data structures, the usual approach is to use a single array, where the array items are link list nodes. The array offsets correspond to hashcode() % array size (there are much more advances formulas than this, but this is a good starting point). In case of a hash collision, you will have to navigate the linked list nodes until you find the correct entry.
You're correct in that hash table insert is not guaranteed to be O(1) due to collisions. If you use the open addressing strategy to deal with collisions, the process to insert an item is going to take time proportional to 1/(1-a) where a is the proportion of how much of the table capacity has been used. As the table fills up, a goes to 1, and the time to insert grows without bound.
The secret to keeping time complexity as O(1) is making sure that there's always room in the table. That way a never grows too big. That's why you have to resize the table when it starts to run out capacity.
Problem: resizing a table with N item takes O(N) time.
Solution: increase the capacity exponentially, for example double it every time you need to resize. This way the table has to be resized very rarely. The cost of the occasional resize operations is "amortized" over a large number of insertions, and that's why people say that hash table insertion has "amortized O(1)" time complexity.
TLDR: Make sure you increase the table capacity when it's getting full, maybe 70-80% utilization. When you increase the capacity, make sure that it's by a constant factor, for example doubling it.

Map/ArrayList: which one is faster to search for an element

I have a gigantic data set which I've to store into a collection and need to find whethere any duplicates in there or not.
The data size could be more than 1 million. I know I can store more element in ArrayList comapre to a Map.
My questions are:
is searching key in a Map faster than searching in sorted ArrayList?
is searching Key in HashMap is faster than TreeMap?
Only in terms of space required to store n elements, which would be more efficient between a TreeMap and a HashMap implementation?
1) Yes. Searching an ArrayList is O(n) on average. The performance of key lookups in a Map depends on the specific implementation. You could write an implementation of Map that is O(n) or worse if you really wanted to, but all the implementations in the standard library are faster than O(n).
2) Yes. HashMap is O(1) on average for simple key lookups. TreeMap is O(log(n)).
Class HashMap<K,V>
This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets.
Class TreeMap<K,V>
This implementation provides guaranteed log(n) time cost for the containsKey, get, put and remove operations. Algorithms are adaptations of those in Cormen, Leiserson, and Rivest's Introduction to Algorithms.
3) The space requirements will be O(n) in both cases. I'd guess the TreeMap requires slightly more space, but only by a constant factor.
It depends on the type of Map you're using.
A HashMap has a constant-time average lookup (O(1)), while a TreeMap's average lookup time is based on the depth of the tree (O(log(n))), so a HashMap is faster.
The difference is probably moot. Both data structures require some amount of constant overhead in space complexity by design (both exhibit O(n) space complexity).
It just did some benchmark testing on lookup performance between hashmap and sorted arraylist. The answer is hashmap is much faster as the size increase. I am talking about 10x, 20x, 30x faster. I did some test with 1 million of entries using sorted array list and hashmap and the array list get and add operation took seconds to complete, where as the hashmap get and put only takes around 50ms.
Here are something I found or observed:
For sorted arraylist, you would have to sort it first to be able to use the search efficiently (binarySearch for example). Practically you don't just have static list (meaning the list will change via add or remove). With that in mind you will need to change the add and the get methods to do "binary" operation to make it efficient (like binarySearch). So even with binary operation the add and get method will be slower and slower as the list grows.
Hashmap on the other hand does not show much of change in term of time in the put and get operation. The problem with Hashmap is memory overhead. If you can live with that then go with hashmap.

Time complexity of set in Java

Can someone tell me the time complexity of the below code?
a is an array of int.
Set<Integer> set = new HashSet<Integer>();
for (int i = 0; i < a.length; i++) {
if (set.contains(arr[i])) {
System.out.println("Hello");
}
set.add(arr[i]);
}
I think that it is O(n), but I'm not sure since it is using Set and this contains methods as well. It is also calling the add method of set.
Can anyone confirm and explain what the time complexity of the entire above code is? Also, how much space would it take?
i believe its O(n) because you loop over the array, and contains and add should be constant time because its a hash based set. If it were not hash based and required iteration over the entire set to do lookups, the upper bound would be n^2.
Integers are immutable, so the space complexity would be 2n, which I believe simplifies to just n, since constants don't matter.
If you had objects in the array and set, then you would have 2n references and n objects, so you are at 3n, which is still linear (times a constant) space constraints.
EDIT-- yep "This class offers constant time performance for the basic operations (add, remove, contains and size), assuming the hash function disperses the elements properly among the buckets."
see here.
Understanding HashSet is the key of question
According to HashSet in Javadoc,
This class implements the Set interface, backed by a hash table
(actually a HashMap instance)...This class offers constant time
performance for the basic operations (add, remove, contains and size)
A more complete explain about HashSet https://www.geeksforgeeks.org/hashset-contains-method-in-java/?ref=rp
So the HashSet insert and contains are O(1). ( As HashSet is based on HashMap and Its memory complexity is O(n))
The rest is simple, the main array you are looping is order of O(n) , so the total order of function will be O(n).

Which is the appropriate data structure?

I need a Java data structure that has:
fast (O(1)) insertion
fast removal
fast (O(1)) max() function
What's the best data structure to use?
HashMap would almost work, but using java.util.Collections.max() is at least O(n) in the size of the map. TreeMap's insertion and removal are too slow.
Any thoughts?
O(1) insertion and O(1) max() are mutually exclusive together with the fast removal point.
A O(1) insertion collection won't have O(1) max as the collection is unsorted. A O(1) max collection has to be sorted, thus the insert is O(n). You'll have to bite the bullet and choose between the two. In both cases however, the removal should be equally fast.
If you can live with slow removal, you could have a variable saving the current highest element, compare on insert with that variable, max and insert should be O(1) then. Removal will be O(n) then though, as you have to find a new highest element in the cases where the removed element was the highest.
If you can have O(log n) insertion and removal, you can have O(1) max value with a TreeSet or a PriorityQueue. O(log n) is pretty good for most applications.
If you accept that O(log n) is still "fast" even though it isn't "fast (O(1))", then some kinds of heap-based priority queue will do it. See the comparison table for different heaps you might use.
Note that Java's library PriorityQueue isn't very exciting, it only guarantees O(n) remove(Object).
For heap-based queues "remove" can be implemented as "decreaseKey" followed by "removeMin", provided that you reserve a "negative infinity" value for the purpose. And since it's the max you want, invert all mentions of "min" to "max" and "decrease" to "increase" when reading the article...
you cannot have O(1) removal+insertion+max
proof:
assume you could, let's call this data base D
given an array A:
1. insert all elements in A to D.
2. create empty linked list L
3. while D is not empty:
3.1. x<-D.max(); D.delete(x); --all is O(1) - assumption
3.2 L.insert_first(x) -- O(1)
4. return L
in here we created a sorting algorithm which is O(n), but it is proven to be impossible! sorting is known as omega(nlog(n)). contradiction! thus, D cannot exist.
I'm very skeptical that TreeMap's log(n) insertion and deletion are too slow--log(n) time is practically constant with respect to most real applications. Even with a 1,000,000,000 elements in your tree, if it's balanced well you will only perform log(2, 1000000000) = ~30 comparisons per insertion or removal, which is comparable to what any other hash function would take.
Such a data structure would be awesome and, as far as I know, doesn't exist. Others pointed this.
But you can go beyond, if you don't care making all of this a bit more complex.
If you can "waste" some memory and some programming efforts, you can use, at the same time, different data structures, combining the pro's of each one.
For example I needed a sorted data structure but wanted to have O(1) lookups ("is the element X in the collection?"), not O(log n). I combined a TreeMap with an HashMap (which is not really O(1) but it is almost when it's not too full and the hashing function is good) and I got really good results.
For your specific case, I would go for a dynamic combination between an HashMap and a custom helper data structure. I have in my mind something very complex (hash map + variable length priority queue), but I'll go for a simple example. Just keep all the stuff in the HashMap, and then use a special field (currentMax) that only contains the max element in the map. When you insert() in your combined data structure, if the element you're going to insert is > than the current max, then you do currentMax <- elementGoingToInsert (and you insert it in the HashMap).
When you remove an element from your combined data structure, you check if it is equal to the currentMax and if it is, you remove it from the map (that's normal) and you have to find the new max (in O(n)). So you do currentMax <- findMaxInCollection().
If the max doesn't change very frequently, that's damn good, believe me.
However, don't take anything for granted. You have to struggle a bit to find the best combination between different data structures. Do your tests, learn how frequently max changes. Data structures aren't easy, and you can make a difference if you really work combining them instead of finding a magic one, that doesn't exist. :)
Cheers
Here's a degenerate answer. I noted that you hadn't specified what you consider "fast" for deletion; if O(n) is fast then the following will work. Make a class that wraps a HashSet; maintain a reference to the maximum element upon insertion. This gives the two constant time operations. For deletion, if the element you deleted is the maximum, you have to iterate through the set to find the maximum of the remaining elements.
This may sound like it's a silly answer, but in some practical situations (a generalization of) this idea could actually be useful. For example, you can still maintain the five highest values in constant time upon insertion, and whenever you delete an element that happens to occur in that set you remove it from your list-of-five, turning it into a list-of-four etcetera; when you add an element that falls in that range, you can extend it back to five. If you typically add elements much more frequently than you delete them, then it may be very rare that you need to provide a maximum when your list-of-maxima is empty, and you can restore the list of five highest elements in linear time in that case.
As already explained: for the general case, no. However, if your range of values are limited, you can use a counting sort-like algorithm to get O(1) insertion, and on top of that a linked list for moving the max pointer, thus achieving O(1) max and removal.

When should I use a TreeMap over a PriorityQueue and vice versa?

Seems they both let you retrieve the minimum, which is what I need for Prim's algorithm, and force me to remove and reinsert a key to update its value. Is there any advantage of using one over the other, not just for this example, but generally speaking?
Generally speaking, it is less work to track only the minimum element, using a heap.
A tree is more organized, and it requires more computation to maintain that organization. But if you need to access any key, and not just the minimum, a heap will not suffice, and the extra overhead of the tree is justified.
There are 2 differences I would like to point out (and this may be more relevant to Difference between PriorityQueue and TreeSet in Java? as that question is deemed a dup of this question).
(1) PriorityQueue can have duplicates where as TreeSet can NOT have dups. So in Treeset, if your comparator deems 2 elements as equal, TreeSet will keep only one of those 2 elements and throw away the other one.
(2) TreeSet iterator traverses the collection in a sorted order, whereas PriorityQueue iterator does NOT traverse in sorted order. For PriorityQueue If you want to get the items in sorted order, you have to destroy the queue by calling remove() repeatedly.
I am assuming that the discussion is limited to Java's implementation of these data structures.
Totally agree with Erickson on that priority queue only gives you the minimum/maximum element.
In addition, because the priority queue is less powerful in maintaining the total order of the data, it has the advantage in some special cases. If you want to track the biggest M elements in an array of N, the time complexity would be O(N*LogM) and the space complexity would be O(M). But if you do it in a map, the time complexity is O(N*logN) and the space is O(N). This is quite fundamental while we must use priority queue in some cases for example M is just a constant like 10.
Rule of thumb about it is:
TreeMap maintains all elements orderly. (So intuitively, it takes time to construct it)
PriorityQueue only guarantees min or max. It's less expensive but less powerful.
It all depends what you want to achieve. Here are the main points to consider before you choose one over other.
PriorityQueue Allows Duplicate(i.e with same priority) while TreeMap doesn't.
Complexity of PriorityQueue is O(n)(when is increases its size), while that of TreeMap is O(logn)(as it is based on Red Black Tree)
PriorityQueue is based on Array while in TreeMap nodes are linked to each other, so contains method of PriorityQueue would take O(n) time while TreeMap would take O(logn) time.
One of the differences is that remove(Object) and contains(Object) are linear O(N) in a normal heap based PriorityQueue (like Oracle's), but O(log(N)) for a TreeSet/Map.
So if you have a large number of elements and do a lot of remove(Object) or contains(Object), then a TreeSet/Map may be faster.
I may be late to this answer but still.
They have their own use-cases, in which either one of them is a clear winner.
For Example:
1: https://leetcode.com/problems/my-calendar-i TreeMap is the one you are looking at
2: https://leetcode.com/problems/top-k-frequent-words you don't need the overhead of keys and values.
So my answer would be, look at the use-case, and see if that could be done without key and value, if yes, go for PQueue else move to TreeMap.
It depends on how you implement you Priority Queue. According to Cormen's book 2nd ed the fastest result is with a Fibonacci Heap.
I find TreeMap to be useful, when there is a need to do something like:
find the minimal/least key, which is greater equal some value, using ceilingKey()
find the maximum/greatest key, which is less equal some value, using floorKey()
If the above is not required, and it's mostly about having a quick option to retrieve the min/max - PriorityQueue might be preferred.
Their difference on time complexity is stated clearly in Erickson's answer.
On space complexity, although a heap and a TreeMap both take O(n) space complexity, building them in actual programs takes up different amount of space and effort.
Say if you have an array of numbers, you can build a heap in place with O(n) time and constant extra space. If you build a TreeMap based on the given array, you need O(nlogn) time and O(n) extra space to accomplish that.
One more thing to take into consideration, PriorityQueue offers an api which return the max/min value without removing it, the time complexity is O(1) while for a TreeMap this will still cost you O(logn)
This could be clear advantage in case of readonly cases where you are only interested in the top end value.

Categories

Resources