If millions of words are added in a Hash Set, on average, roughly, will there be any performance issues?
According to my opinion, since Hash Set has best case complexity of O(1), so on average it will be lower than O(log N) resulting no performance issues. I would still like answers by others. This is one of the question asked to me in an interview.
Hashset provides good add performance. It should work fine. It should be also important to write hashcode() function correctly.
Related
If i am trying to do a lookup or insert then how do i ensure that it has a O(log n) worst case instead of O(n)?
I have tried to look it up and apparently the solution to this is to sort the values but it seems like you can not sort a hash table without increasing the complexity which is not what i am hoping for.
Language is Java.
Recently below questions were asked in an interview
You are given an array of integer with all elements repeated twice except one element which occurs only once, you need to find the unique element with O(nlogn)time complexity. Suppose array is {2,47,2,36,3,47,36} the output should be 3 here. I told we can perform merge sort(as it takes(nlogn)) after that we can check next element, but he said it will take O(nlogn)+O(n). I also told we can use HashMap to keep count of elements but again he said no as we have to iterate over hashmap again to get results. After some research, I came to know that using xor operation will give output in O(n). Is there any better solution other than sorting which can give the answer in O(nlogn) time?
As we use smartphones we can open many apps at a time. when we look at what all apps are open currently we see a list where the recently opened app is at the front and we can remove or close an app from anywhere on the list. There is some Collection available in java which can perform all these tasks in a very efficient way. I told we can use LinkedList or LinkedHashMap but he was not convinced. What could be the best Collection to use?
Firstly, if the interviewer used Big-O notation and expected a O(n log n) solution, there's nothing wrong with your answer. We know that O(x + y) = O(max(x, y)). Therefore, although your algorithm is O(n log n + n), it's okay if we just call O(n log n). However, it's possible to find the element that appears once in a sorted array can be achieved in O(log n) using binary search. As a hint, exploit odd and even indices while performing search. Also, if the interviewer expected a O(n log n) solution, the objection for traversing is absurd. The hash map solution is already O(n), and if there's a problem with this, it's the requirement of extra space. For this reason, the best one is to use XOR as you mentioned. There're still some more O(n) solutions but they're not better than the XOR solution.
To me, LinkedList is proper to use for this task as well. We want to remove from any location and also want to use some stack operations (push, pop, peek). A custom stack can be built from a LinkedList.
This is regarding identifying time complexity of a java program. If i've iterations like for or while etc, we can identify the complexity. But if i use java API to do some task, if it is internally iterating, i think we should include that as well. If so, how to do that.
Example :
String someString = null;
for(int i=0;i<someLength;i++){
someString.contains("something");// Here i think internal iteration will happen, likewise how to identify time complexity
}
Thanks,
Aditya
Internal operations in the Java APIs have their own time complexity based on their implementation. For example the contains method of the String variable runs with linear complexity, where the dependency is based on the length of your someString variable.
In short - you should check how inner operations work and take them into consideration when calculating complexity.
Particularly for your code the time complexity is something like O(N*K), where N is the number of iterations of your loop (someLength) and K is the length of your someString variable.
You are correct in that the internal iterations will add to your complexity. However, except in a fairly small number of cases, the complexity of API methods is not well documented. Many collection operations come with an upper bound requirement for all implementations, but even in such cases there is no guarantee that the actual code doesn't have lower complexity than required. For cases like String.contains() an educated guess is almost certain to be correct, but again there is no guarantee.
Your best bet for a consistent metric is to look at the source code for the particular API implementation you are using and attempt to figure out the complexity from that. Another good approach would be to run benchmarks on the methods you care about with a wide range of input sizes and types and simply estimate the complexity from the shape of the resulting graph. The latter approach will probably yield better results for cases where the code is too complex to analyze directly.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
In addition to this quite old post, I need something that will use primitives and give a speedup for an application that contains lots of HashSets of Integers:
Set<Integer> set = new HashSet<Integer>();
So people mention libraries like Guava, Javalution, Trove, but there is no perfect comparison of those in terms of benchmarks and performance results, or at least good answer coming from good experience. From what I see many recommend Trove's TIntHashSet, but others say it is not that good; some say Guava is supercool and manageable, but I do not need beauty and maintainability, only time execution, so Python's style Guava goes home :) Javalution? I've visited the website, seems too old for me and thus wacky.
The library should provide the best achievable time, memory does not matter.
Looking at "Thinking in Java", there is an idea of creating custom HashMap with int[] as keys. So I would like to see something similar with a HashSet or simply download and use an amazing library.
EDIT (in response to the comments below)
So in my project I start from about 50 HashSet<Integer> collections, then I call a function about 1000 times that inside creates up to 10 HashSet<Integer> collections. If I change initial parameters, the numbers may grow up exponentially. I only use add(), contains() and clear() methods on those collections, that is why they were chosen.
Now I'm going to find a library that implements HashSet or something similar, but will do that faster due to autoboxing Integer overhead and maybe something else which I do not know. In fact, I'm using ints as my data comes in and store them in those HashSets.
Trove is an excellent choice.
The reason why it is much faster than generic collections is memory use.
A java.util.HashSet<Integer> uses a java.util.HashMap<Integer, Integer> internally. In a HashMap, each object is contained in an Entry<Integer, Integer>. These objects take estimated 24 bytes for the Entry + 16 bytes for the actual integer + 4 bytes in the actual hash table. This yields 44 bytes, as opposed to 4 bytes in Trove, an up to 11x memory overhead (note that unoccupied entires in the main table will yield a smaller difference in practise).
See also these experiments:
http://www.takipiblog.com/2014/01/23/java-scala-guava-and-trove-collections-how-much-can-they-hold/
Take a look at the High Performance Primitive Collections for Java (HPPC). It is an alternative to trove, mature and carefully designed for efficiency. See the JavaDoc for the IntOpenHashSet.
Have you tried working with the initial capacity and load factor parameters while creating your HashSet?
HashSet doc
Initial capacity, as you might think, refers to how big will the empty hashset be when created, and loadfactor is a threshhold that determines when to grow the hash table. Normally you would like to keep the ratio between used buckets and total buckets, below two thirds, which is regarded as the best ratio to achieve good stable performance in a hash table.
Dynamic rezing of a hash table
So basically, try to set an initial capacity that will fit your needs (to avoid re-creating and reassigning the values of a hash table when it grows), as well as fiddling with the load factor until you find a sweet spot.
It might be that for your particular data distribution and setting/getting values, a lower loadfactor could help (hardly a higher one will, but your milage may vary).
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What exactly are hashtables?
I understand the purpose of using hash functions to securely store passwords. I have used arrays and arraylists for class projects for sorting and searching data. What I am having trouble understanding is the practical value of hashtables for something like sorting and searching.
I got a lecture on hashtables but we never had to use them in school, so it hasn't clicked. Can someone give me a practical example of a task a hashtable is useful for that couldn't be done with a numerical array or arraylist? Also, a very simple low level example of a hash function would be helpful.
There are all sorts of collections out there. Collections are used for storing and retrieving things, so one of the most important properties of a collection is how fast these operations are. To estimate "fastness" people in computer science use big-O notation which sort of means how many individual operations you have to accomplish to invoke a certain method (be it get or set for example). So for example to get an element of an ArrayList by an index you need exactly 1 operation, this is O(1), if you have a LinkedList of length n and you need to get something from the middle, you'll have to traverse from the start of the list to the middle, taking n/2 operations, in this case get has complexity of O(n). The same comes to key-value stores as hastable. There are implementations that give you complexity of O(log n) to get a value by its key whereas hastable copes in O(1). Basically it means that getting a value from hashtable by its key is really cheap.
Basically, hashtables have similar performance characteristics (cheap lookup, cheap appending (for arrays - hashtables are unordered, adding to them is cheap partly because of this) as arrays with numerical indices, but are much more flexible in terms of what the key may be. Given a continuous chunck of memory and a fixed size per item, you can get the adress of the nth item very easily and cheaply. That's thanks to the indices being integers - you can't do that with, say, strings. At least not directly. Hashes allows reducing any object (that implements it) to a number and you're back to arrays. You still need to add checks for hash collisions and resolve them (which incurs mostly a memory overhead, since you need to store the original value), but with a halfway decent implementation, this is not much of an issue.
So you can now associate any (hashable) object with any (really any) value. This has countless uses (although I have to admit, I can't think of one that's applyable to sorting or searching). You can build caches with small overhead (because checking if the cache can help in a given case is O(1)), implement a relatively performant object system (several dynamic languages do this), you can go through a list of (id, value) pairs and accumulate the values for identical ids in any way you like, and many other things.
Very simple. Hashtables are often called "associated arrays." Arrays allow access your data by index. Hash tables allow access your data by any other identifier, e.g. name. For example
one is associated with 1
two is associated with 2
So, when you got word "one" you can find its value 1 using hastable where key is one and value is 1. Array allows only opposite mapping.
For n data elements:
Hashtables allows O(k) (usually dependent only on the hashing function) searches. This is better than O(log n) for binary searches (which follow an n log n sorting, if data is not sorted you are worse off)
However, on the flip side, the hashtables tend to take roughly 3n amount of space.