No definite answer: Which Java Map is the cheapest? - java

It probably has been asked before but I come across this situation time and time again, that I want store a very small amont of properties that I am absolutely certain will never ever exceed say 20 keys. It seems a complete waste of CPU and memory to use a HashMap with all the overhead to begin with, but also the bad performance calculating an advanced hash value for each key lookup. If there are only <20 keys (probably more like 5 most of the time). I am absolutely certain that calculating a hash value takes hundred times more time than just iterating and comparing ...no?
There is this talk about premature optimization, but I don't totally agree here. I am on Android mostly, and any extra CPU/memory will opt for more juice for other stuff. Not necessarily talking about the consumer market here. The use-case here is very well-defined and doesn't change much, furthermore; it would be trivial to replace a very cheap map with a HashMap in case (something that will never happen) there will be a very large amount of new keys suddenly.
So, my question is; which is the very cheapest, most basic Map I can use in Java?

To all your first paragraph : no ! There won't be a dramatic memory overhead since as far as I know, an HashMap is initialized with 16 buckets and then doubles its size each time it rehashes, so in the worst case you would have 12 exceeding buckets for your map, so this is no big deal.
Concerning the lookup time, it is constant and equivalent to the time of accessing an element of an array, which is always better than looping over O(n) elements (even if n < 20). The only backdrop for HashMap is that it is unsorted, but as far as I am concerned, I consider it the default Map implementation in Java when I have no particular requirement about the order.
To conclude : use HashMap !

If you worry about hashCode() computation time on your keys, consider caching computed values, as, for example, java.lang.String does. See how caching hashcode works in Java as suggested by Joshua Bloch in effective java? question about on that.

Caveat: I suggest you take seriously cautions about premature optimization. For most programmers in most apps, I seriously doubt you need to worry about the performance of your Map. More important is to consider needs of concurrency, iteration-order, and nulls. But since you asked, here is my specific answer.
EnumMap
If your keys are enums, then your very fastest Map implementation will be EnumMap.
Based on a bitmap representing the domain of enum objects, an EnumMap is very fast to execute while using very little memory.
IdentityHashMap
If you are really so concerned about performance, then consider using IdentityHashMap.
This implementation of Map uses reference-equality rather than object-equality. While there is still a hash value involved, it is a hash of the object's address in memory (so to speak, we do not have direct memory access in Java). So the possibly lengthy call to each key object’s own hashCode method is avoided entirely. So performance may be better than a HashMap. You will see constant-time performance for the basic operations (get and put).
Study the documentation carefully to see if you want to take this route. Note the discussion about linear-probe versus chaining for better performance. Be aware that this class partially breaks the Map contract which mandates the use of the equals method when comparing objects. And this map does not provide concurrency.
Here is a table I made to help compare the various Map implementations bundled with Java 11.

Related

Is ImmutableMap a sub-optimal choice for large volume of keys/objects/

I was doing some tests with a colleague, and we were pulling data in from a database (about 350,000 records), converting each record into an object and a key object, and then populating them into an ImmutableMap.Builder.
When we called the build() method it took forever, probably due to all the data integrity checks that come with ImmutableMap (dupe keys, nulls, etc). To be fair we tried to use a hashmap as well, and that took awhile but not as long as the ImmutableMap. We finally settled on just using ConcurrentHashMap which we populated with 9 threads as the records were iterated, and wrapped that in an unmodifiable map. The performance was good.
I noticed on the documentation it read ImutableMap is not optimized for "equals()" operations. As a die-hard immutable-ist, I'd like the ImmutableMap to work for large data volumes but I'm getting the sense it is not meant for that. Is that assumption right? Is it optimized only for small/medium-sized data sets? Is there a hidden implementation I need to invoke via "copyOf()" or something?
I guess your key.equals() is a time consuming method.
key.equals() will be called much more times in ImmutableMap.build() than HashMap.put() (in a loop). And key.hashCode() is called same times, both HashMap.put() and ImmutableMap.build(). As result, if key.equals() takes long time, the whole duration can be different much.
key.equals() are called a few times during HashMap.put() (Good hash algorithm leads a few collision).
While in case of ImmutableMap.build(), key.equals() will be called many times when checkNoConflictInBucket(). O(n) for key.equals().
Once the map is built, two types of Map should not be different much when access, as both are hash-based.
Sample:
There are 10000 random String to put as keys. HashMap.put() calls
String.equals() 2 times, while ImmutableMap.build() calls 3000 times.
My experience is that none of Java's built in Collection classes are really optimised for performance at huge volumes. For example HashMap uses simple iteration once hashCode has been used as an index in the array and compares the key via equals to each item with the same hash. If you are going to store several million items in the map then you need a very well designed hash and large capacity. These classes are designed to be as generic and safe as possible.
So performance optimisations to try if you wish to stick with the standard Java HashMap:
Make sure your hash function provides as close as possible to even distribution. Many domains have skewed values and your hash needs to take account of this.
When you have a lot of data HashMap will be expanded many times. Ideally set initial capacity as close as possible to the final value.
Make sure your equals implementation is as efficient as possible.
There are massive performance optimisations you can apply if you know (for example) that your key is an integer, for example using some form of btree after the hash has been applied and using == rather than equals.
So the simple answer is that I believe you will need to write your own collection to get the performance you want or use one of the more optimised implementations available.

Java hashtable or hashmap? [duplicate]

This question already has answers here:
What are the differences between a HashMap and a Hashtable in Java?
(35 answers)
Closed 9 years ago.
I've been researching to find a faster alternative to list. In an algorithm book, hashtable seems to be the fastest using separate chaining. Then I found that java has an implementation of hashtable and from what I read it seems to it uses separate chaining. However, there is the overhead of synchronization so the implementation of hashmap is suggested as a faster alternative to hashtable.
My queations are:
Is java hashmap the fastest data structure implemented in java to
insert/delete/search?
While reading, a few posts had concerns about the memory usage of
hashmap. One post mentioned that an empty hashmap occupy 300
bytes. Is hashtable more memory efficient than hasmap?
Also, is the hash function in each the most efficient for
strings?
There is too much context missing to be able to answer the question which suggests to me that you should use the simplest option and not worry about performance until you have measured that you have a problem.
Is java hashmap the fastest data structure implemented in java to insert/delete/search?
ArrayList is significantly faster than HashMap depending on that you need it for. I have seen people use Maps when they should have used objects. In this case a custom class instance can be 10 faster and smaller.
While reading, a few posts had concerns about the memory usage of hashmap. One post mentioned that an empty hashmap occupy 300 bytes.
Unless you know that 300 bytes (which costs less than what you would be paid on minimum wage to blink) matters, I would assume it doesn't.
Is hashtable more memory efficient than hasmap?
It can be but not enough to matter. Hashtable starts with a smaller size by default. If you make a HashMap with a smaller capacity it will be smaller.
Also, is the hash function in each the most efficient for strings?
In the general case it is efficient enough. In rare cases you may want to change the strategy eg to prevent denial of service attacks. If you really care about memory efficiency and performance perhaps you shouldn't be using String in the first place.
HashMap (or, more likely, HashSet) is probably a good place to start, at your point. It's not perfect, and it does consume more memory than e.g. a list, but it should be your default when you need fast add, remove, and contains operations. The String.hashCode() implementation is not the best hash function, though it is fast, and good enough for most purposes.
The access time of HashMap (& HashTable as well I believe) is O(1) since the internal bucket placement of the given value during put() is determined by computing (Hash of the value's key) % (Total Number of buckets). This O(1) is average access time, if however many keys hash to the same bucket then the access time will tend towards O(n) as all the values are placed into the same bucket grow and they all grow in linked list fashion.
As you said considering the overhead of synchronization inside Hashtable, I would probably opt for Hash map. Besides you can fine tune Hashmap by setting its various params like load factor that offers means of memory optimization. I vote for HashMap...
As you've pointed Hashtable is fully synchronized so it depends on your environment. IF you have many threads then ConcurrentHashMap will be better solution. However you can look at Trove4J - maybe it will better suite your needs. Trove uses chaining hashing similar to hashtables
1.HashMap is only one of the fastest data structures implemented in java to insert/delete/search,HashSet is as fast as HashMap to insert/delete/search,and ArrayList is as fast as HashMap when insert a element to the end.
2.Hashtable is not more memory efficient than HashMap,they are all implemented by separate chaining.
3.Hash function of the two data structures are the same,but you can write a subclass extends them then override the hash function to make it most fit your application.
As others pointed out, a set would be a good replacement for a list but don't forget that lists allow duplicate elements, while sets do not, so while certain operations are faster, e.g., exists, sets and lists represent solutions to different problems.
As a start I recommend HashSet or TreeSet (in case ordering is important). A HashMap maps keys to values which is different. Refer to this discussion to understand the differences between the HashMap and Hashtable. I personally haven't used a Hashtable since 2007.
Finally, if you don't mind using a third party library, I highly recommend to take a look at the Guava immutable collections. Immutability automatically provides thread safety and easier to understand programs.
EDIT: Regarding efficiency concerns, this is a moot point. As a guideline, use the data structure (as in the abstract concept of a data structure) that best fits your problem and choose the vanilla implementation available. If you can prove you have a performance problem in you code, you might start thinking about using something 'more efficient'. That's in quote because it's a very loose definition: are we talking about memory efficiency, computing time efficiency, garbage collection efficiency, etc. Never forget the rules for code optimization.

How are Trove collections more efficient than the standard Java collections?

In an interview recently, I was asked about how HashMap works in Java and I was able to explain it well and explain that in the worst case the HashMap may degenerate into a list due to chaining. I was asked to figure out a way to improve this performance but I was unable to do that during the interview. The interviewer asked me to look up "Trove".
I believe he was pointing to this page. I have read the description provided on that page but still can't figure out how it overcomes the limitations of the java.util.HashMap.
Even a hint would be appreciated. Thanks!!
The key phrase there is open addressing. Instead of hashing to an array of buckets, all the entries are in one big array. When you add an element, if the space for it is already in use you just move down the array to find a free space.
As long as the array is kept sufficiently bigger than the number of entries and the hash function is well distributed it's possible to keep average lookup times small. And by having one array you can get better performance - it's more cache friendly.
However it still has worst-case linear behaviour if (say) every key hashes to the same value, so it doesn't avoid that issue.
It seems to me from the Trove page that there are two main differences that improve performance.
The first is the use of open addressing (http://en.wikipedia.org/wiki/Hash_table#Open_addressing). This doesn't avoid the collision issue, but it does mean that there's no need to create "Entry" objects for every item that goes in the map.
The second important difference is being able to provide your own hash function, which differs from the one provided by the class of the keys. So you could provide a much faster hash function if it made sense to do so.
One advantage of Trove is that it avoids object creation, especially for primitives.
For big hashtables in an embedded java device this can be advantageous due fewer memory consumption.
The other advantage, I saw is the use of custom hash codes / functions without the need to override hashcode(). For a specific data set, and an expert in writing hash functions this can be an advantage.

Is there a parallel processing implementation of HashMap available to Java? Is it even possible?

Searching for the magic ParallelHashMap class
More succinctly, can you use multiple threads to speed up HashMap lookups? Are there any implementations out there that do this already?
In my project, we need to maintain a large map of objects in memory. We never modify the map after it is created, so the map is strictly read-only. However, read and look-up performance on this map is absolutely critical for the success of the application. The systems that the application will be installed on typically have many hardware threads available. Yet, our look-ups only utilize a single thread to retrieve values from the HashMap. Could a divide and conquer approach using multiple threads (probably in a pool) help improve look-up speed?
Most of my google searches have been fruitless - returning lots of results about concurrency problems rather than solutions. Any advice would be appreciated, but if you know of an out of the box solution, you are awesome.
Also of note, all keys and values are immutable. Hash code values are precomputed and stored in the objects themselves on instantiation.
As for the details of the implementation, the Map has about 35,000 items in it. Both keys and values are objects. Keys are a custom look-up key and values are strings. Currently, we can process about 5,000 look-ups per second max (this included a bit of overhead from some other logic, but the main bottleneck is the map implementation itself). However, in order to keep up with our future performance needs, I want to get this number up to around 10,000 look-ups per second. By most normal standards our current implementation is fast - it is just that we need it faster.
In our Map of 35,000 values we have about one hash code collision on average, so I'm guessing that the hash codes are reasonably well-distributed.
So your hash codes are precomputed and the equals function is fast - your hashmap gets should be very fast in this case.
Have you profiled your application to prove that the hashmap gets are indeed the bottleneck?
If you have multiple application threads, they should all be able to perform their own gets from the hashmap at the same time - since you aren't modifying the map, you don't need to externally synchronize the gets. Is the application that uses the hashmap sufficiently threaded to be able to make use of all your hardware threads?
Since the contents of the hash table are immutable, it might be worth looking into perfect hashing - with a perfect hash function, you should never have collisions or need chaining in the hash table, which may improve performance. I don't know of a java implementation off hand, but in know in C/C++, there is gperf
Sounds like you should profile. You could have a high collision rate. You could also try using a lower loadFactor in the HashMap - to reduce collision probability.
Also, if the hashCodes are precomputed, then there is not much work for get() to do except mod and a few equals(). How fast is equals() on your key objects?
To answer your question: yes, absolutely. AS LONG AS YOU AREN'T WRITING TO IT.
You're going to have to make it by hand, and it's going to be a little tricky. Before trying that, have you optimized as much as possible?
In C++, check out Google's dense hash map class in their sparsehash package.
In Java, if you're mapping with a primitive key, use Trove or Colt maps.
That said, here's a start for your parallel hash map: if you choose n hash functions and spawn n threads to search down each path (probing/chaining at each of the n insertion points) you'll get a decent speedup. Be careful because there's a high cost to creating threads, so spawn the threads on construction and then block them until they're needed.
Hopefully the cost of locking won't be higher than the cost of lookup, but that's up to you to experiment with.
From the HashMap documentation (I've changed the emphasis):
Note that this implementation is not
synchronized. If multiple threads
access this map concurrently, and at
least one of the threads modifies the
map structurally, it must be
synchronized externally.
Since your HashMap is never modified you can safely let multiple threads read from it. Implementing locking is not necessary. (The same is true for any case where threads share access to immutable data; in general the simplest way to achieve thread safety is not to share any writable memory)
To ensure that your code doesn't modify the map by accident, I would wrap the map with Collections.unmodifiableMap immediately after its construction. Don't let any references to the original modifiable map linger.
You mentioned this in a comment:
I'm doing equals checks between 5 numbers referenced
From this I infer that your hash computation is also doing some calculations with these 5 numbers. For good HashMap performance, the results of this computation should be randomly dispersed over all possible int values. From the HashMap documentation:
This implementation provides
constant-time performance for the
basic operations (get and put),
assuming the hash function disperses
the elements properly among the
buckets.
In other words, look-up times should remain constant regardless of the element count, if you have a good hash function. Example of a good hashCode() function for a class that stores three numbers (using a prime number to reduce the chance of the XOR yielding zero as suggested by comment):
return this.a.hashCode() ^ (31 * (this.b.hashCode() ^ (31 * this.c.hashCode())));
Example of a bad hashCode function:
return (this.a + this.b + this.c);
HashMaps have constant lookup times. Not sure how you could really speed that up since trying to have multiple threads execute the hashing function will only cause it to go slower.
I think you need evidence that the get() method on the HashMap is where you delay is. I think this is highly unlikely. Put a loop around your get() method to make it repeat 1,000 times and your application probably won't slow down at all. Then you'll know that the delay is elsewhere.

Rule of thumb for choosing an implementation of a Java Collection?

Anyone have a good rule of thumb for choosing between different implementations of Java Collection interfaces like List, Map, or Set?
For example, generally why or in what cases would I prefer to use a Vector or an ArrayList, a Hashtable or a HashMap?
I really like this cheat sheet from Sergiy Kovalchuk's blog entry, but unfortunately it is offline. However, the Wayback Machine has a historical copy:
More detailed was Alexander Zagniotov's flowchart, also offline therefor also a historical copy of the blog:
Excerpt from the blog on concerns raised in comments:
"This cheat sheet doesn't include rarely used classes like WeakHashMap, LinkedList, etc. because they are designed for very specific or exotic tasks and shouldn't be chosen in 99% cases."
I'll assume you know the difference between a List, Set and Map from the above answers. Why you would choose between their implementing classes is another thing. For example:
List:
ArrayList is quick on retrieving, but slow on inserting. It's good for an implementation that reads a lot but doesn't insert/remove a lot. It keeps its data in one continuous block of memory, so every time it needs to expand, it copies the whole array.
LinkedList is slow on retrieving, but quick on inserting. It's good for an implementation that inserts/removes a lot but doesn't read a lot. It doesn't keep the entire array in one continuous block of memory.
Set:
HashSet doesn't guarantee the order of iteration, and therefore is fastest of the sets. It has high overhead and is slower than ArrayList, so you shouldn't use it except for a large amount of data when its hashing speed becomes a factor.
TreeSet keeps the data ordered, therefore is slower than HashSet.
Map: The performance and behavior of HashMap and TreeMap are parallel to the Set implementations.
Vector and Hashtable should not be used. They are synchronized implementations, before the release of the new Collection hierarchy, thus slow. If synchronization is needed, use Collections.synchronizedCollection().
I've always made those decisions on a case by case basis, depending on the use case, such as:
Do I need the ordering to remain?
Will I have null key/values? Dups?
Will it be accessed by multiple threads
Do I need a key/value pair
Will I need random access?
And then I break out my handy 5th edition Java in a Nutshell and compare the ~20 or so options. It has nice little tables in Chapter five to help one figure out what is appropriate.
Ok, maybe if I know off the cuff that a simple ArrayList or HashSet will do the trick I won't look it all up. ;) but if there is anything remotely complex about my indended use, you bet I'm in the book. BTW, I though Vector is supposed to be 'old hat'--I've not used on in years.
Theoretically there are useful Big-Oh tradeoffs, but in practice these almost never matter.
In real-world benchmarks, ArrayList out-performs LinkedList even with big lists and with operations like "lots of insertions near the front." Academics ignore the fact that real algorithms have constant factors that can overwhelm the asymptotic curve. For example, linked-lists require an additional object allocation for every node, meaning slower to create a node and vastly worse memory-access characteristics.
My rule is:
Always start with ArrayList and HashSet and HashMap (i.e. not LinkedList or TreeMap).
Type declarations should always be an interface (i.e. List, Set, Map) so if a profiler or code review proves otherwise you can change the implementation without breaking anything.
About your first question...
List, Map and Set serve different purposes. I suggest reading about the Java Collections Framework at http://java.sun.com/docs/books/tutorial/collections/interfaces/index.html.
To be a bit more concrete:
use List if you need an array-like data structure and you need to iterate over the elements
use Map if you need something like a dictionary
use a Set if you only need to decide if something belongs to the set or not.
About your second question...
The main difference between Vector and ArrayList is that the former is synchronized, the latter is not synchronized. You can read more about synchronization in Java Concurrency in Practice.
The difference between Hashtable (note that the T is not a capital letter) and HashMap is similiar, the former is synchronized, the latter is not synchronized.
I would say that there are no rule of thumb for preferring one implementation or another, it really depends on your needs.
For non-sorted the best choice, more than nine times out of ten, will be: ArrayList, HashMap, HashSet.
Vector and Hashtable are synchronised and therefore might be a bit slower. It's rare that you would want synchronised implementations, and when you do their interfaces are not sufficiently rich for thier synchronisation to be useful. In the case of Map, ConcurrentMap adds extra operations to make the interface useful. ConcurrentHashMap is a good implementation of ConcurrentMap.
LinkedList is almost never a good idea. Even if you are doing a lot of insertions and removal, if you are using an index to indicate position then that requires iterating through the list to find the correct node. ArrayList is almost always faster.
For Map and Set, the hash variants will be faster than tree/sorted. Hash algortihms tend to have O(1) performance, whereas trees will be O(log n).
Lists allow duplicate items, while Sets allow only one instance.
I'll use a Map whenever I'll need to perform a lookup.
For the specific implementations, there are order-preserving variations of Maps and Sets but largely it comes down to speed. I'll tend to use ArrayList for reasonably small Lists and HashSet for reasonably small sets, but there are many implementations (including any that you write yourself). HashMap is pretty common for Maps. Anything more than 'reasonably small' and you have to start worrying about memory so that'll be way more specific algorithmically.
This page has lots of animated images along with sample code testing LinkedList vs. ArrayList if you're interested in hard numbers.
EDIT: I hope the following links demonstrate how these things are really just items in a toolbox, you just have to think about what your needs are: See Commons-Collections versions of Map, List and Set.
Well, it depends on what you need. The general guidelines are:
List is a collection where data is kept in order of insertion and each element got index.
Set is a bag of elements without duplication (if you reinsert the same element, it won't be added). Data doesn't have the notion of order.
Map You access and write your data elements by their key, which could be any possible object.
Attribution: https://stackoverflow.com/a/21974362/2811258
For more information about Java Collections, check out this article.
As suggested in other answers, there are different scenarios to use correct collection depending on use case. I am listing few points,
ArrayList:
Most cases where you just need to store or iterate through a "bunch of things" and later iterate through them. Iterating is faster as its index based.
Whenever you create an ArrayList, a fixed amount of memory is allocated to it and once exceeded, it copies the whole array
LinkedList:
It uses doubly linked list so insertion and deletion operation will be fast as it will only add or remove a node.
Retrieving is slow as it will have to iterate through the nodes.
HashSet:
Making other yes-no decisions about an item, e.g. "is the item a word of English", "is the item in the database?" , "is the item in this category?" etc.
Remembering "which items you've already processed", e.g. when doing a web crawl;
HashMap:
Used in cases where you need to say "for a given X, what is the Y"? It is often useful for implementing in-memory caches or indexes i.e key value pairs For example:
For a given user ID, what is their cached name/User object?.
Always go with HashMap to perform a lookup.
Vector and Hashtable are synchronized and therefore bit slower and If synchronization is needed, use Collections.synchronizedCollection().
Check This for sorted collections.
Hope this hepled.
I found Bruce Eckel's Thinking in Java to be very helpful. He compares the different collections very well. I used to keep a diagram he published showing the inheritance heirachy on my cube wall as a quick reference. One thing I suggest you do is keep in mind thread safety. Performance usually means not thread safe.
Use Map for key-value pairing
For key-value tracking, use Map implementation.
For example, tracking which person is covering which day of the weekend. So we want to map a DayOfWeek object to an Employee object.
Map < DayOfWeek , Employee > weekendWorker =
Map.of(
DayOfWeek.SATURDAY , alice ,
DayOfWeek.SUNDAY , bob
)
;
When choosing one of the Map implementations, there are several aspects to consider. These include: concurrency, tolerance for NULL values in key and/or value, order when iterating keys, tracking by reference versus content, and convenience of literals syntax.
Here is a chart I made showing the various aspects of each of the ten Map implementations bundled with Java 11.

Categories

Resources