why and where do we use hashing? - java

why do we use hashing for search? what are advantages of using hashing over binary search tree?

Hashing is generally a constant time operation whereas a Binary Tree has a logarithmic time complexity.
Because a hash is calculated not based on the number of items in the collection but on the item being searched for, the size of the collection has no bearing on the time it takes to find an item. However most hashing algorithms will have collisions which then increases the time complexity so it's very unlikely to get a perfect constant time lookup.
With a binary tree, you have to do up to log2N comparisons before the item can be found.

Wikipedia explains it well:
http://en.wikipedia.org/wiki/Hash_table#Features
Summary: Inserts are generally slow, reads are faster than trees.
As for Java: Any time you have some key/value pair that you read a lot and write not very often and everything easily fits into RAM, use a HashTable for quick read accesses and incredible easy of code maintenance.

Hashing means using some function or
algorithm to map object data to some
representative integer value. This
so-called hash code (or simply hash)
can then be used as a way to narrow
down our search when looking for the
item in the map.
If need to use an algorithm that is
fast for looking up the information
that you need then the HashTable is
the most suitable algorithm to use, as
it is simply generating a hash of your
key object and using that to access
the target data - it is O(1). The
others are O(N) (Linked Lists of size
N - you have to iterate through the
list one at a time, an average of N/2
times) and O(log N) (Binary Tree - you
halve the search space with each
iteration - only if the tree is
balanced, so this depends on your
implementation, an unbalanced tree can
have significantly worse performance).

Hash Tables are best for searching(=) if you have lower inserts and uniform slot distribution. The time complexity is O(n+k) - linear.
They are not a good idea if you want to do comparison operations (<, >)

Related

Why is time complexity of get() in HashMap O(1) [duplicate]

I've seen some interesting claims on SO re Java hashmaps and their O(1) lookup time. Can someone explain why this is so? Unless these hashmaps are vastly different from any of the hashing algorithms I was bought up on, there must always exist a dataset that contains collisions.
In which case, the lookup would be O(n) rather than O(1).
Can someone explain whether they are O(1) and, if so, how they achieve this?
A particular feature of a HashMap is that unlike, say, balanced trees, its behavior is probabilistic. In these cases its usually most helpful to talk about complexity in terms of the probability of a worst-case event occurring would be. For a hash map, that of course is the case of a collision with respect to how full the map happens to be. A collision is pretty easy to estimate.
pcollision = n / capacity
So a hash map with even a modest number of elements is pretty likely to experience at least one collision. Big O notation allows us to do something more compelling. Observe that for any arbitrary, fixed constant k.
O(n) = O(k * n)
We can use this feature to improve the performance of the hash map. We could instead think about the probability of at most 2 collisions.
pcollision x 2 = (n / capacity)2
This is much lower. Since the cost of handling one extra collision is irrelevant to Big O performance, we've found a way to improve performance without actually changing the algorithm! We can generalzie this to
pcollision x k = (n / capacity)k
And now we can disregard some arbitrary number of collisions and end up with vanishingly tiny likelihood of more collisions than we are accounting for. You could get the probability to an arbitrarily tiny level by choosing the correct k, all without altering the actual implementation of the algorithm.
We talk about this by saying that the hash-map has O(1) access with high probability
You seem to mix up worst-case behaviour with average-case (expected) runtime. The former is indeed O(n) for hash tables in general (i.e. not using a perfect hashing) but this is rarely relevant in practice.
Any dependable hash table implementation, coupled with a half decent hash, has a retrieval performance of O(1) with a very small factor (2, in fact) in the expected case, within a very narrow margin of variance.
In Java, how HashMap works?
Using hashCode to locate the corresponding bucket [inside buckets container model].
Each bucket is a LinkedList (or a Balanced Red-Black Binary Tree under some conditions starting from Java 8) of items residing in that bucket.
The items are scanned one by one, using equals for comparison.
When adding more items, the HashMap is resized (doubling the size) once a certain load percentage is reached.
So, sometimes it will have to compare against a few items, but generally, it's much closer to O(1) than O(n) / O(log n).
For practical purposes, that's all you should need to know.
Remember that o(1) does not mean that each lookup only examines a single item - it means that the average number of items checked remains constant w.r.t. the number of items in the container. So if it takes on average 4 comparisons to find an item in a container with 100 items, it should also take an average of 4 comparisons to find an item in a container with 10000 items, and for any other number of items (there's always a bit of variance, especially around the points at which the hash table rehashes, and when there's a very small number of items).
So collisions don't prevent the container from having o(1) operations, as long as the average number of keys per bucket remains within a fixed bound.
I know this is an old question, but there's actually a new answer to it.
You're right that a hash map isn't really O(1), strictly speaking, because as the number of elements gets arbitrarily large, eventually you will not be able to search in constant time (and O-notation is defined in terms of numbers that can get arbitrarily large).
But it doesn't follow that the real time complexity is O(n)--because there's no rule that says that the buckets have to be implemented as a linear list.
In fact, Java 8 implements the buckets as TreeMaps once they exceed a threshold, which makes the actual time O(log n).
O(1+n/k) where k is the number of buckets.
If implementation sets k = n/alpha then it is O(1+alpha) = O(1) since alpha is a constant.
If the number of buckets (call it b) is held constant (the usual case), then lookup is actually O(n).
As n gets large, the number of elements in each bucket averages n/b. If collision resolution is done in one of the usual ways (linked list for example), then lookup is O(n/b) = O(n).
The O notation is about what happens when n gets larger and larger. It can be misleading when applied to certain algorithms, and hash tables are a case in point. We choose the number of buckets based on how many elements we're expecting to deal with. When n is about the same size as b, then lookup is roughly constant-time, but we can't call it O(1) because O is defined in terms of a limit as n → ∞.
Elements inside the HashMap are stored as an array of linked list (node), each linked list in the array represents a bucket for unique hash value of one or more keys.
While adding an entry in the HashMap, the hashcode of the key is used to determine the location of the bucket in the array, something like:
location = (arraylength - 1) & keyhashcode
Here the & represents bitwise AND operator.
For example: 100 & "ABC".hashCode() = 64 (location of the bucket for the key "ABC")
During the get operation it uses same way to determine the location of bucket for the key. Under the best case each key has unique hashcode and results in a unique bucket for each key, in this case the get method spends time only to determine the bucket location and retrieving the value which is constant O(1).
Under the worst case, all the keys have same hashcode and stored in same bucket, this results in traversing through the entire list which leads to O(n).
In the case of java 8, the Linked List bucket is replaced with a TreeMap if the size grows to more than 8, this reduces the worst case search efficiency to O(log n).
We've established that the standard description of hash table lookups being O(1) refers to the average-case expected time, not the strict worst-case performance. For a hash table resolving collisions with chaining (like Java's hashmap) this is technically O(1+α) with a good hash function, where α is the table's load factor. Still constant as long as the number of objects you're storing is no more than a constant factor larger than the table size.
It's also been explained that strictly speaking it's possible to construct input that requires O(n) lookups for any deterministic hash function. But it's also interesting to consider the worst-case expected time, which is different than average search time. Using chaining this is O(1 + the length of the longest chain), for example Θ(log n / log log n) when α=1.
If you're interested in theoretical ways to achieve constant time expected worst-case lookups, you can read about dynamic perfect hashing which resolves collisions recursively with another hash table!
It is O(1) only if your hashing function is very good. The Java hash table implementation does not protect against bad hash functions.
Whether you need to grow the table when you add items or not is not relevant to the question because it is about lookup time.
This basically goes for most hash table implementations in most programming languages, as the algorithm itself doesn't really change.
If there are no collisions present in the table, you only have to do a single look-up, therefore the running time is O(1). If there are collisions present, you have to do more than one look-up, which drives down the performance towards O(n).
It depends on the algorithm you choose to avoid collisions. If your implementation uses separate chaining then the worst case scenario happens where every data element is hashed to the same value (poor choice of the hash function for example). In that case, data lookup is no different from a linear search on a linked list i.e. O(n). However, the probability of that happening is negligible and lookups best and average cases remain constant i.e. O(1).
Only in theoretical case, when hashcodes are always different and bucket for every hash code is also different, the O(1) will exist. Otherwise, it is of constant order i.e. on increment of hashmap, its order of search remains constant.
Academics aside, from a practical perspective, HashMaps should be accepted as having an inconsequential performance impact (unless your profiler tells you otherwise.)
Of course the performance of the hashmap will depend based on the quality of the hashCode() function for the given object. However, if the function is implemented such that the possibility of collisions is very low, it will have a very good performance (this is not strictly O(1) in every possible case but it is in most cases).
For example the default implementation in the Oracle JRE is to use a random number (which is stored in the object instance so that it doesn't change - but it also disables biased locking, but that's an other discussion) so the chance of collisions is very low.

Is hashmap increasing O(N+1) for every same hashcode object put in it?

Is hashmap increasing O(N+1) for every same hashcode object put in it?
The best case complexity for put is O(1) in time and space.
The average case complexity for put is O(1) time and space when amortized over N put operations.
The amortization averages the cost of growing the hash array and rebuilding the hash buckets when the map is resized.
If you don't amortize, then the worst-case performance of a single put operation (which triggers a resize) will be O(N) in time and space.
There is another worst-case scenario which occurs when a large proportion of the keys has to the same hash code. In that case the worst case time complexity of put will be either O(N) or O(logN).
Let us define M to be the number of entries in the hash bucket with the most entries. Let us assume that we are inserting into that bucket, and that M is O(N).
Prior to Java 8, the hash chains were unordered linked lists and searching an O(N) element chain is O(N). The worst-case put operation would therefore be O(N).
With Java 8, the implementation was changed to use balanced binary trees when 1) the list exceeds a threshold, and 2) the key type K implements Comparable<K>.
For large enough N we can assume that the threshold is exceeded. So the worst-case time complexity of put will be:
O(log N) in the case where the keys can be ordered using Comparable<K>
O(N) in the where the keys cannot be ordered
Note that the javadocs (in Java 11) mention that Comparable may be used:
"To ameliorate impact, when keys are Comparable, this class may use comparison order among keys to help break ties."
but it doesn't explicitly state the complexity. There are more details in the non-javadoc comments in the source code, but these are implementation specific.
The above statements are only valid for extant implementations of HashMap at the time of writing (i.e. up to Java 12). You can always check for yourself by finding and reading the source code.
For Hashmap's get method
Non collision - O(1)
In case of collision - O(log n) [Java8], O(N) [before Java8]
Java8 has started using a balanced tree instead of linked list for storing collided entries. This also means that in the worst case you will get a performance boost from O(n) to O(log n).
Check details here

Is there any reason to have 8 on TREEIFY_THRESHOLD in Java Hashmap?

From Java 8, the hashMap modified slightly to have balanced tree instead of linkedlist if more than 8 (TREEIFY_THRESHOLD=8) items on same bucket. is there any reason choosing 8?
would it impact the performance in case it is 9?
The use of a balanced tree instead of a linked-list is a tradeoff. In the case of a list, a linear scan must be performed to perform a lookup in a bucket, while the tree allows for log-time access. When the list is small, the lookup is fast and using a tree doesn't actually provide a benefit while around 8 or so elements the cost of a lookup in the list becomes significant enough that the tree provides a speed-up.
I suspect that the use of a tree is intended for the exceptional case where the key hash is catastrophically broken (e.g. many keys collide); while a linear lookup will cause performance to degrade severely the use of a tree mitigates this
performance loss somewhat, if the keys are directly comparable.
Therefore, the exact threshold of 8 entries may not be terribly significant: the chance of a tree bin is 0.00000006 assuming good key distribution, so tree bins are obviously used very rarely in such a case. When the hash algorithm is failing catastrophically, then the number of keys in the bucket is far greater than 8 anyway.
This comes at a space penalty since the tree-node must include additional references: four references to tree nodes and a boolean in addition to the fields of a LinkedHashMap.Entry (see its source).
From the comments in the HashMap class source:
Because TreeNodes are about twice the size of regular nodes, we
use them only when bins contain enough nodes to warrant use
(see TREEIFY_THRESHOLD). And when they become too small (due to
removal or resizing) they are converted back to plain bins. In
usages with well-distributed user hashCodes, tree bins are
rarely used. Ideally, under random hashCodes, the frequency of
nodes in bins follows a Poisson distribution
(http://en.wikipedia.org/wiki/Poisson_distribution) with a
parameter of about 0.5 on average for the default resizing
threshold of 0.75, although with a large variance because of
resizing granularity. Ignoring variance, the expected
occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
factorial(k)).

checking the complexity of splay tree operation in java

I have implemented splay tree (insert, search, delete operation) in Java. Now I want to check if the complexity of the algorithm is O(logn) or not. Is there any way to check this by varying the input values (number of nodes) and checking the run time in seconds? Say, by putting input values like 1000, 100000 and checking the run time or is there any other way?
Strictly speaking, you cannot find the time complexity of the algorithm by running it for some values of n. Let's assume that you've run it for values n_1, n_2, ..., n_k. If the algorithm makes n^2 operations for any n <= max(n_1, ..., n_k) and exactly 10^100 operations for any larger value of n, it has a constant time complexity, even though it would look like a quadratic one from the points you have.
However, you can assess the number of operations it takes to complete on an input of a size n (I wouldn't even call it time complexity here, as the latter has a strict formal definition) by running on some values of n and looking at ratios T(n1) / T(n2) and n1 / n2. But even in case of a "real" algorithm (in sense that it is not a pathological case described in the first paragraph), you should be careful with the "structure" of the input (for example, a quick sort algorithm that takes the first element as pivot runs in O(n log n) on average for a random input, so it would look like an O(n log n) if you generate random arrays of different sizes. However, it runs in O(n^2) time for a reversed sorted array).
To sum it up, if you need to figure out if it's fast enough from a practical point of view and you have an idea how a typical input to your algorithm looks like, you can try generating inputs of different sizes and see how the execution time grows.
However, if you need a bound on the runtime in a mathematical sense, you need to prove some properties and bounds of your algorithm mathematically.
In your case, I would say that testing on random inputs can be a reasonable idea (because there is a mathematical proof that the time complexity of one operation is O(log n) for a splay tree), so you just need to check that the tree you have implemented is indeed a correct splay tree. One note: I'd recommend to try different patterns of queries (like inserting elements in sorted/reverse order and so on) as even unbalanced trees can work pretty fast as long as the input is "random".

Clarification of statement of performance of collection's binary search from javadoc

I am confused on the performance analysis of binarySearch from the Collections
It says:
If the specified list does not implement the RandomAccess interface
and is large, this method will do an iterator-based binary search that
performs O(n) link traversals and O(log n) element comparisons.
I am not sure how to interpret this O(n) + O(log n).
I mean isn't it worse than simply traversing the linked-list and compare? We still get only O(n).
So what does this statement mean about performance? As phrased, I can not understand the difference from a plain linear search in the linked list.
What am I missunderstanding here?
First of all you must understand that without RandomAccess interface the binarySearch cannot simply access, well, random element from the list, but instead it has to use an iterator. That introduces O(n) cost. When the collection implements RandomAccess, cost of each element access is O(1) and can be ignored as far as asymptotic complexity is concerned.
Because O(n) is greater than O(log n) it will always take precedence over O(log n) and dominate the complexity. In this case binarySearch has the same complexity as simple linear search. So what is the advantage?
Linear search performs O(n) comparisons, as opposed to O(log n) with binarySearch without random access. This is especially important when the constant before O(logn) is high. In plain English: when single comparison has a very high cost compared to advancing iterator. This might be quite common scenario, so limiting the number of comparisons is beneficial. Profit!
Binary search is not suited for linked lists. The algorithm is supposed to benefit from a sorted collection with random access (like a plain array), where it can quickly jump from one element to another, splitting the remaining search space in two on each iteration (hence the O(log N) time complexity).
For a linked list, there is a modified version which iterates through all elements (and needs to go through 2n elements in the worst case), but instead of comparing every element, it "probes" the list at specified positions only (hence doing a lower number of comparisons compared to a linear search).
Since comparisons are usually slightly more costly compared to plain pointer iterating, total time should be lower. That is why the log N part is emphasized separately.

Categories

Resources