How are Trove collections more efficient than the standard Java collections? - java

In an interview recently, I was asked about how HashMap works in Java and I was able to explain it well and explain that in the worst case the HashMap may degenerate into a list due to chaining. I was asked to figure out a way to improve this performance but I was unable to do that during the interview. The interviewer asked me to look up "Trove".
I believe he was pointing to this page. I have read the description provided on that page but still can't figure out how it overcomes the limitations of the java.util.HashMap.
Even a hint would be appreciated. Thanks!!

The key phrase there is open addressing. Instead of hashing to an array of buckets, all the entries are in one big array. When you add an element, if the space for it is already in use you just move down the array to find a free space.
As long as the array is kept sufficiently bigger than the number of entries and the hash function is well distributed it's possible to keep average lookup times small. And by having one array you can get better performance - it's more cache friendly.
However it still has worst-case linear behaviour if (say) every key hashes to the same value, so it doesn't avoid that issue.

It seems to me from the Trove page that there are two main differences that improve performance.
The first is the use of open addressing (http://en.wikipedia.org/wiki/Hash_table#Open_addressing). This doesn't avoid the collision issue, but it does mean that there's no need to create "Entry" objects for every item that goes in the map.
The second important difference is being able to provide your own hash function, which differs from the one provided by the class of the keys. So you could provide a much faster hash function if it made sense to do so.

One advantage of Trove is that it avoids object creation, especially for primitives.
For big hashtables in an embedded java device this can be advantageous due fewer memory consumption.
The other advantage, I saw is the use of custom hash codes / functions without the need to override hashcode(). For a specific data set, and an expert in writing hash functions this can be an advantage.

Related

Is 'hashing' more efficient than 'linear' search?

I decided to revise Java collection framework, so I started with internal implementation. One question came on my mind, which I can't solve. Hope someone can make a clear explanation on following.
ArrayList uses linear or binary search (both have pros/cons), but we can do anything with them! My question is why do all 'hashing' classes (like HashMap f.e.) use hashing principle? Couldn't they settle with linear or binary search for example? Why just not store Key/Value pair inside array? And the opposite, why isn't (for example ArrayList stored in hashTable)?
The intention of the collections framework is that the programmer will choose the data structure appropriate to the use case. Depending on what you're using it for, different data structures are appropriate.
Hashing classes use the hashing principle, as you put it, because if you choose them, then that's what you want to use. (Hashing is generally the best choice for simple, straightforward lookups.) A screwdriver uses the screwing principle because if you pick up a screwdriver, you want to screw something in; if you had a nail you needed to put in, you would have picked up the hammer instead.
But if you're not going to be performing lookups, or if linear search is good enough for you, then an ArrayList is what you want. It's not worth adding a hash table to a collection that's never going to use it, and it costs CPU and memory to do things you aren't going to need.
I had a large hash of values (about 1,500). The nature of the code was that once the hashmap was loaded it would never be altered. The hashmap was accessed many times per web page, and I had wondered if it could be sped up for faster page loading.
One day I had some time, so I did a series of time tests (using the nano time function). I then reworked the hashmap use over to an array. Not an ArrayList, but an actual array[]. I stored the index with the key class used to get the hash value.
There was a difference, that the array lookup was faster. I calculated that over a days worth of activity I would have saved almost a full second!
So yes, using an array is faster than using a hash, YMMV :-)
And I reverted my code back to using a hashmap, as it was easier to maintain...

Performance tuning for searching

I am fairly new to DS and Algorithms and recently at a job interview I was asked a question on performance tuning along with code. We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure. So which Java feature/library can we use to do the searching in the quickest time possible ?
On the spot I could not think of exact answer so I wrote that:
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
How can I understand the exact answer to this question and what can be the optimal solution(s) ?
After reading the question and getting clarification in the comments, I think what has become apparent to me is that: you needed to ask follow-up questions.
I'll try to break it down and provide comments that I hope will be helpful, because I also know what it's like to be "in the moment" and how nerves can stab you in the back when you least need them to.
We have a Data Structure which contains multi-billion entries and we need to search a particular word in that data structure.
I think a good follow-up question here would've been:
Q: What specific data structure is being used to contain all this data?
I would press until they give me an actual name and explain why it is not possible to name a Java algorithm/library. For all you know, the data structure could've been String[], a Set<String>, or even a fancy name for a file on disk (if they're trying to throw you off). They could've also clarified and said the DS was not relevant and that you could pick whichever DS you thought was best.
The wording also implies that they implemented the structure and that it's already populated in a system with, presumably, enough memory to hold all of it. Asking to confirm that this is really the case could've given you helpful information.
For example: "Based on the wording, it seems this mystery data structure is already implemented and fully populated in memory in a system with enough memory to hold it. Can you confirm my understanding here is correct? If not, could you clarify further?"
Given the suggested wording, and the fact that we don't have additional clarifications to go from, I will assume, for the purposes of this answer, that my suppositions are indeed correct.
Note that if you had been asked to design the data structure to hold all of this info, you would've had to ask very different questions, take memory constraints into account, and perhaps even ask about character sets/encodings (e.g. ASCII vs multi-byte Unicode).
Also, if you had been asked to design the search algorithm, then knowing the DS is a pre-requisite, and not knowing this could've made the task impossible. For example, the binary search algorithm implementation will look very different if you're working on an array vs a binary search tree, even though both would offer O(lg n) time complexity.
So which java feature/library can we use to do the searching in the quickest time possible?
Consistent with the 1st part, this question only asks what pre-existing/built-in Java code you would choose to perform the search for you. The "quickest time possible" here should make you think about solutions that are in O(1), i.e. are constant time. However, the data structure may open/close doors for you.
Some search algorithms in Java work on generics and others work on other types like arrays. Some algorithms work on Maps while others work on Lists, Sets, and so on. The follow-up question from the first part could've helped in answering this question.
That said, even if you knew the DS, but couldn't think of a specific method name or such at the time, I also think it should be considered reasonable to mention the interface or at least a relevant package and say that further details can be checked on the the Java documentation if you're pressed for more specificity, given that's what it's there for in the first place.
We can store the values in a map and search words in the map (but got stuck how to decide key-value pair in the map).
Given the wording, my interpretation of their question was not "which data structure would you use?", but rather, "which pre-existing search algorithm would you choose?". It seems to me like it was them who needed to answer the question regarding DS.
That said, if you had indeed been asked "which data structure would you use?", then a Map would've still worked against you, since you didn't really need to map a key to a value. You only needed to store a value (i.e. the words). Therefore, a Set, specifically a HashSet, would've been a better candidate, since it also avoids duplicates and should consume less memory in the process because it stores singular values, rather than key/value pairs.
Of course, that's still under the assumption(s) I made earlier. If memory constraints are said to be an issue, then scaling horizontally to multiple servers and so on would've likely been necessary.
How can I understand the exact answer to this question and what can be the optimal solution(s)?
It is probably the case that they wanted to see if you would follow up with questions, given the lack of information they gave you.
There are a couple data structures that allow for efficient searching, assuming that memory requirements aren't an issue and the data structure is already populated.
Regarding time complexity, Set#contains and Map#containsKey are both O(1), assuming that the hash function isn't expensive and that there aren't many collisions.
Because the data structure stores words (assuming you're referring to Strings), then it could also be relatively efficient to use a trie (radix tree, prefix tree, etc.), which would allow you to search by character (which I believe would be O(log n)). If the hash function is expensive or there are many collisions, this could be a good alternative!
The answer that you gave to the interviewer should suffice since hashing is an effective searching method, even for billions of entries.
You did not mention whether the entries are words or documents (multiple words). In both cases a search index could be suitable.
Search indexes extract words from the billion document entries and manage a map of these words to the documents they are used in. Frameworks like Lucene (e.g. as part of SOLR or ElasticSearch) manage memory and persistence for you.
If it were only multiple of thousands of entries, a simple HashMap would be sufficient because there is no need for memory management then. If all of the billion entries are single words, a database could be a slightly better choice.
The hashmap solution is reasonable as stated by others but there are doubts with respect to scalability.
Here is a possible solution for the problem as discussed in the below post
Sub-string match If your entry blob is a single sting or word (without any white space) and you need to search arbitrary sub-string within it. In such cases you need to parse every entry to find best possible entries that matches. One uses algorithms like Boyer Moor algorithm. See this and this for details. This is also equivalent to grep - because grep uses similar stuff inside
Indexed search. Here you are assuming that entry contains set of words and search is limited to fixed word lengths. In this case, entries are indexed over all the possible occurrences of words. This is often called "Full Text search". There are number of algorithms to do this and number of open source projects that can be used directly. Many of them, also support wild card search, approximate search etc. as below :
a. Apache Lucene : http://lucene.apache.org/java/docs/index.html
b. OpenFTS : http://openfts.sourceforge.net/
c. Sphinx http://sphinxsearch.com/
Most likely if you need "fixed words" as queries, the approach two will be very fast and effective
Reference - https://softwareengineering.stackexchange.com/questions/118759/how-to-quickly-search-through-a-very-large-list-of-strings-records-on-a-databa
Multi-billion entries lie at the edge of what might conceivably be stored in main memory (for instance, storing 10 billion entries at 100 bytes per entry will take 1000 GB main memory).
While storing the data in main memory offers a very high throughput (thousands to millions of requests per second), you'd likely need special hardware (typical blade servers only offers 16 GB, but there are commodity servers that permit installation of up to 3000 GB of main memory). Also, keeping this much data in the Java Heap will likely cause garbage collector pauses of seconds or minutes unless special care is taken.
Therefore, unless the structure of your data admits a very compact representation in main memory (say, you only need membership checking among ints, which is possible with a 512 MB Bitset), you'll not want to store it in main memory, but on disk.
Therefore, you'll need persistence. Any relational or NoSQL database permits efficient searching by key and can handle such amounts of data with ease. To talk to a relational database, use JPA or JDBC. To talk to a non-relational database, you can use their proprietary Java API or an abstraction layer such as Spring Data.
You could also implement persistence from scratch if you wanted to (i.e. the interviewer asks for that). A data structure optimized for efficient lookup in external memory is the B-Tree, that's what many databases use internally :-)

No definite answer: Which Java Map is the cheapest?

It probably has been asked before but I come across this situation time and time again, that I want store a very small amont of properties that I am absolutely certain will never ever exceed say 20 keys. It seems a complete waste of CPU and memory to use a HashMap with all the overhead to begin with, but also the bad performance calculating an advanced hash value for each key lookup. If there are only <20 keys (probably more like 5 most of the time). I am absolutely certain that calculating a hash value takes hundred times more time than just iterating and comparing ...no?
There is this talk about premature optimization, but I don't totally agree here. I am on Android mostly, and any extra CPU/memory will opt for more juice for other stuff. Not necessarily talking about the consumer market here. The use-case here is very well-defined and doesn't change much, furthermore; it would be trivial to replace a very cheap map with a HashMap in case (something that will never happen) there will be a very large amount of new keys suddenly.
So, my question is; which is the very cheapest, most basic Map I can use in Java?
To all your first paragraph : no ! There won't be a dramatic memory overhead since as far as I know, an HashMap is initialized with 16 buckets and then doubles its size each time it rehashes, so in the worst case you would have 12 exceeding buckets for your map, so this is no big deal.
Concerning the lookup time, it is constant and equivalent to the time of accessing an element of an array, which is always better than looping over O(n) elements (even if n < 20). The only backdrop for HashMap is that it is unsorted, but as far as I am concerned, I consider it the default Map implementation in Java when I have no particular requirement about the order.
To conclude : use HashMap !
If you worry about hashCode() computation time on your keys, consider caching computed values, as, for example, java.lang.String does. See how caching hashcode works in Java as suggested by Joshua Bloch in effective java? question about on that.
Caveat: I suggest you take seriously cautions about premature optimization. For most programmers in most apps, I seriously doubt you need to worry about the performance of your Map. More important is to consider needs of concurrency, iteration-order, and nulls. But since you asked, here is my specific answer.
EnumMap
If your keys are enums, then your very fastest Map implementation will be EnumMap.
Based on a bitmap representing the domain of enum objects, an EnumMap is very fast to execute while using very little memory.
IdentityHashMap
If you are really so concerned about performance, then consider using IdentityHashMap.
This implementation of Map uses reference-equality rather than object-equality. While there is still a hash value involved, it is a hash of the object's address in memory (so to speak, we do not have direct memory access in Java). So the possibly lengthy call to each key object’s own hashCode method is avoided entirely. So performance may be better than a HashMap. You will see constant-time performance for the basic operations (get and put).
Study the documentation carefully to see if you want to take this route. Note the discussion about linear-probe versus chaining for better performance. Be aware that this class partially breaks the Map contract which mandates the use of the equals method when comparing objects. And this map does not provide concurrency.
Here is a table I made to help compare the various Map implementations bundled with Java 11.

Java hashtable or hashmap? [duplicate]

This question already has answers here:
What are the differences between a HashMap and a Hashtable in Java?
(35 answers)
Closed 9 years ago.
I've been researching to find a faster alternative to list. In an algorithm book, hashtable seems to be the fastest using separate chaining. Then I found that java has an implementation of hashtable and from what I read it seems to it uses separate chaining. However, there is the overhead of synchronization so the implementation of hashmap is suggested as a faster alternative to hashtable.
My queations are:
Is java hashmap the fastest data structure implemented in java to
insert/delete/search?
While reading, a few posts had concerns about the memory usage of
hashmap. One post mentioned that an empty hashmap occupy 300
bytes. Is hashtable more memory efficient than hasmap?
Also, is the hash function in each the most efficient for
strings?
There is too much context missing to be able to answer the question which suggests to me that you should use the simplest option and not worry about performance until you have measured that you have a problem.
Is java hashmap the fastest data structure implemented in java to insert/delete/search?
ArrayList is significantly faster than HashMap depending on that you need it for. I have seen people use Maps when they should have used objects. In this case a custom class instance can be 10 faster and smaller.
While reading, a few posts had concerns about the memory usage of hashmap. One post mentioned that an empty hashmap occupy 300 bytes.
Unless you know that 300 bytes (which costs less than what you would be paid on minimum wage to blink) matters, I would assume it doesn't.
Is hashtable more memory efficient than hasmap?
It can be but not enough to matter. Hashtable starts with a smaller size by default. If you make a HashMap with a smaller capacity it will be smaller.
Also, is the hash function in each the most efficient for strings?
In the general case it is efficient enough. In rare cases you may want to change the strategy eg to prevent denial of service attacks. If you really care about memory efficiency and performance perhaps you shouldn't be using String in the first place.
HashMap (or, more likely, HashSet) is probably a good place to start, at your point. It's not perfect, and it does consume more memory than e.g. a list, but it should be your default when you need fast add, remove, and contains operations. The String.hashCode() implementation is not the best hash function, though it is fast, and good enough for most purposes.
The access time of HashMap (& HashTable as well I believe) is O(1) since the internal bucket placement of the given value during put() is determined by computing (Hash of the value's key) % (Total Number of buckets). This O(1) is average access time, if however many keys hash to the same bucket then the access time will tend towards O(n) as all the values are placed into the same bucket grow and they all grow in linked list fashion.
As you said considering the overhead of synchronization inside Hashtable, I would probably opt for Hash map. Besides you can fine tune Hashmap by setting its various params like load factor that offers means of memory optimization. I vote for HashMap...
As you've pointed Hashtable is fully synchronized so it depends on your environment. IF you have many threads then ConcurrentHashMap will be better solution. However you can look at Trove4J - maybe it will better suite your needs. Trove uses chaining hashing similar to hashtables
1.HashMap is only one of the fastest data structures implemented in java to insert/delete/search,HashSet is as fast as HashMap to insert/delete/search,and ArrayList is as fast as HashMap when insert a element to the end.
2.Hashtable is not more memory efficient than HashMap,they are all implemented by separate chaining.
3.Hash function of the two data structures are the same,but you can write a subclass extends them then override the hash function to make it most fit your application.
As others pointed out, a set would be a good replacement for a list but don't forget that lists allow duplicate elements, while sets do not, so while certain operations are faster, e.g., exists, sets and lists represent solutions to different problems.
As a start I recommend HashSet or TreeSet (in case ordering is important). A HashMap maps keys to values which is different. Refer to this discussion to understand the differences between the HashMap and Hashtable. I personally haven't used a Hashtable since 2007.
Finally, if you don't mind using a third party library, I highly recommend to take a look at the Guava immutable collections. Immutability automatically provides thread safety and easier to understand programs.
EDIT: Regarding efficiency concerns, this is a moot point. As a guideline, use the data structure (as in the abstract concept of a data structure) that best fits your problem and choose the vanilla implementation available. If you can prove you have a performance problem in you code, you might start thinking about using something 'more efficient'. That's in quote because it's a very loose definition: are we talking about memory efficiency, computing time efficiency, garbage collection efficiency, etc. Never forget the rules for code optimization.

Is it considered bad form to convert between collection types?

I hope that this question is specific enough to be deemed fit for StackOverflow. I checked the FAQ and I think this qualifies, since it is specific and related to programming.
I'm implementing a complex data mining algorithm (FP-growth) in Java. Some of the initial phases of the algorithm require me to scan a large database and keep a running count of each item type found. This seems perfectly suited to a Hashbag interface. I found one in Apache Commons which seems to work for me.
So now, my HashBag is filled with [itemType, count] entries (pairs). Later on in the algorithm, I'm required to do a lot of list-like operations on these pairs. In some cases, I must sort the collection by itemType. In others, I must sort by count. This seems perfectly suited to a List interface.
I'm left with the conclusion that I must convert my Hasbag to a List. Yet it feels dirty somehow, like a waste of space and time. Is there a smarter way to do this, or is it a common situation to have a programming problem where you must treat your collection differently at different times, and conversions are a necessary evil?
One alternative is to make my own interface which is truly a list, but allows "bag-style" adds. I'd have to keep the list sorted and perform binary searches with a custom comparator every time I wanted to add something. Building that collection would probably take longer than building a Hashbag, but I'd save on the conversion step at the end. Any thoughts as to which is preferable?
Thanks!
If you used Guava's Multiset instead of Apache's Bag -- roughly analogous, but in a different style -- you can do most of this without converting. Multiset.entrySet() returns a Set<Entry<E>>, with Entry<E> effectively representing a pair of an element and a count -- that sounds like it's probably the best way to address your need to operate on the element-count pairs, maybe? You can iterate over that like you'd iterate over a Map.entrySet().
You can use Multisets.copyHighestCountFirst(Multiset) to get a multiset reordered in highest-frequency-first order, and use TreeMultiset to order by the elements directly.
(Disclosure: I contribute to Guava.)
I assume you're using the Apache Commons Collections HashBag class. Have you considered using TreeBag instead? It implements the same Bag interface but efficiently keeps the data sorted according to a comparator you provide.
That said, when you need to change sort order, there isn't usually any better answer than to copy the collection to a new one with a different comparator.
Yet it feels dirty somehow, like a waste of space and time. Is there a smarter way to do this, or is it a common situation to have a programming problem where you must treat your collection differently at different times, and conversions are a necessary evil?
Sometimes it is necessary to convert between collection types. If it is necessary "dirty" or "inelegant" or "dumb" are not really relevant.
It can also be a mistake to over-think these things up front. The actual computational trade-offs are often difficult to grasp. For instance, if you changed the HashBag to a TreeBag, insertion goes from O(1) to O(logN) but you then avoid the overheads of sorting and copying. "Big Oh" analysis / thinking is not going to give you a clear answer. Indeed, the real performance is going to depend on the scaling factors, the values of N, the ratio of hits and misses in the bag and so on.
I would advise to try implementing things the obvious way, and see if it performs well enough ... and if not, profile it to see if the data structures are the main bottleneck. Then based on the profiling, and other measurements of the input datasets, figure out the best way to improve performance from your baseline implementation.
Answering my own question!
I did some experimenting with the different types of Multiset provided by the Guava libary mentioned above by Louis Wasserman. In my particular test case, I'm parsing a 1GB XML file (database of books and authors) and creating a very large Multiset (keeping a count of how many times each author shows up in the DB). Once I reach the end of the parsing, I need to get a new Multiset which only contains the authors who showed up more than x times, where x is some threshold value. I also want my final set to be sorted by author name.
Here are two of the different ways I tried it (among others):
1) collect the original counts in a TreeMultiset and then remove any which don't meet the threshold
2) collect the original counts in a HashMultiset, and then create a new TreeMultiset where I add each item from the hash set with a count the meets the threshold
The second way proved to be significantly faster (roughly 25%), despite the conversion and extra memory usage. Obviously a big part of this is that it is pretty inefficient to delete from binary trees.
So the clear conclusion here is that in this situation, conversion is a good move (unless you have memory constraints that won't allow it).
Thanks again for turning me onto the Guava library, Louis!

Categories

Resources