This is a very generic question, I am just taking an example of Java HashMap.
I am having a Hashmap.
Map<Integer,Integer> idPriceMap=new HashMap<Integer,Integer>();
idPriceMap.put(10,20);
idPriceMap.put(11,25);
idPriceMap.put(12,0);
idPriceMap.put(13,100);
idPriceMap.put(14,20);
idPriceMap.put(15,40);
idPriceMap.put(16,90);
Requirements might differ, for e.g.:
UseCase1: I want a value for a particular key assuming that I know the key (PS. I know in this scenario HashMap is the best structure )
Usecase2: I want to get all the values.
For now consider the UseCase2 only. The question is: Is that good practice?
In another scenario I am having UseCase1 & UseCase2 both of them at the same time. What would you suggest?
I tried to Google it, all i got is best ways to iterate a HashMap. :(
UseCase1 (value for a specific key): hashap is the best structure for this, yes
UseCase2 (all values): because you want all values, it does not matter if its a hashmap, a list or a tree
so im not sure what your question is. but you could easily get the time complexity for different data-structures if you google for it:
http://en.wikipedia.org/wiki/Hash_table (see the right side big O notation)
If you have a mixed usage scenario, then there is no generic answer to your question. It depends on the frequency and distribution (scenario 1 or 2) of your requests, and the volatility of your map contents.
Recommendation: Just use the standard hash map. Then profile your application. Most of the time the bottleneck is not where you expect it first and you achieve performance gains with cheap changes in other places.
If the hash map is really the bottle neck:
Ask again, with specific frequencies of your usages ;)
If you want to do "premature optimization" (See the good old c2 wiki: http://c2.com/cgi/wiki?PrematureOptimization), then just put your values in a separate array to speed up the request on the complete values a little. But if you have lots of modifications, then your overall performance will degrade. That is the same thing like having a database index or not....
Hope that was useful.
Related
I decided to revise Java collection framework, so I started with internal implementation. One question came on my mind, which I can't solve. Hope someone can make a clear explanation on following.
ArrayList uses linear or binary search (both have pros/cons), but we can do anything with them! My question is why do all 'hashing' classes (like HashMap f.e.) use hashing principle? Couldn't they settle with linear or binary search for example? Why just not store Key/Value pair inside array? And the opposite, why isn't (for example ArrayList stored in hashTable)?
The intention of the collections framework is that the programmer will choose the data structure appropriate to the use case. Depending on what you're using it for, different data structures are appropriate.
Hashing classes use the hashing principle, as you put it, because if you choose them, then that's what you want to use. (Hashing is generally the best choice for simple, straightforward lookups.) A screwdriver uses the screwing principle because if you pick up a screwdriver, you want to screw something in; if you had a nail you needed to put in, you would have picked up the hammer instead.
But if you're not going to be performing lookups, or if linear search is good enough for you, then an ArrayList is what you want. It's not worth adding a hash table to a collection that's never going to use it, and it costs CPU and memory to do things you aren't going to need.
I had a large hash of values (about 1,500). The nature of the code was that once the hashmap was loaded it would never be altered. The hashmap was accessed many times per web page, and I had wondered if it could be sped up for faster page loading.
One day I had some time, so I did a series of time tests (using the nano time function). I then reworked the hashmap use over to an array. Not an ArrayList, but an actual array[]. I stored the index with the key class used to get the hash value.
There was a difference, that the array lookup was faster. I calculated that over a days worth of activity I would have saved almost a full second!
So yes, using an array is faster than using a hash, YMMV :-)
And I reverted my code back to using a hashmap, as it was easier to maintain...
I've building a tree pagination in JSF1.2 and Richfaces 3.3.2, because I have a lot of tree nodes (something like 80k), and it's slow..
So, as first attempt, I create a HashMap with the page and the list of nodes of the page.
But, the performance isn't good enough...
So I was wondering if is something faster than a HashMap, maybe a List of Lists or something.
Someone have some experience with this? What can I do?
Thanks in advance.
EDIT.
The big problem is that I have to validate permissions of users in the childnodes of the tree. I knew that this is the big problem: this validation is slow, because I have to go inside the nodes, I don't have a good way to know if the user have permission in a 10th level node without iterate all of them. Plus to this, the same three has used in more places...
The basic reason for why I was doing this pagination, is that the client side will be much slow, because of the structure generated by richfaces, a lot of tr's and td's, the browser just going crazy with this.
So, unfortunatelly, I have to load all the nodes, and paginate just client side, and I need to know what of them is faster to iterate...
Sorry my bad english.
A hash map is the fastest data structure if you want to get all nodes for a page. The list of nodes can be fetched in constant time (O(1)) while with lists the time is O(n) (n=number of pages, faster on sorted lists but never getting near O(1))
What operations on your datastructure are too slow. That's what you have to analyse before you start optimization.
It's probably more due to the fact that JSF is a performance pig than a data structure choice. The one attempt I've seen to create a JSF app could be timed with a sundial.
You're making a mistake by guessing about solutions without more knowledge about the root cause. I'd recommend that you profile your app to see where the time is being spent.
The data structure to use always depends on how you need to store the data and how you need to access it. HashMap<K, V> is supposed to have constant time complexity in accessing the value, provided the key. When you call get(key), the hashCode() for key is computed and it's used to retrieve the related value. Unless you've got different keys that have the same hashcode (in which case you may have been doing something wrong, as while is not mandatory different objects should have different hash codes, at least in the majority of cases), this is usually fast.
Searching an element in a plain list requires scanning of the list, which will (almost) always be slower than computing an hashcode.
If you need to associate values with keys, a Map is the way. And HashMap should be fast enough.
I don't know too much about JSF, but I think - if the data structure and access pattern is the one that a Map is designed for - the problem is not the HashMap itself.
I would solve this with a javascript/ajax calls method that fetches childnodes.
In an interview recently, I was asked about how HashMap works in Java and I was able to explain it well and explain that in the worst case the HashMap may degenerate into a list due to chaining. I was asked to figure out a way to improve this performance but I was unable to do that during the interview. The interviewer asked me to look up "Trove".
I believe he was pointing to this page. I have read the description provided on that page but still can't figure out how it overcomes the limitations of the java.util.HashMap.
Even a hint would be appreciated. Thanks!!
The key phrase there is open addressing. Instead of hashing to an array of buckets, all the entries are in one big array. When you add an element, if the space for it is already in use you just move down the array to find a free space.
As long as the array is kept sufficiently bigger than the number of entries and the hash function is well distributed it's possible to keep average lookup times small. And by having one array you can get better performance - it's more cache friendly.
However it still has worst-case linear behaviour if (say) every key hashes to the same value, so it doesn't avoid that issue.
It seems to me from the Trove page that there are two main differences that improve performance.
The first is the use of open addressing (http://en.wikipedia.org/wiki/Hash_table#Open_addressing). This doesn't avoid the collision issue, but it does mean that there's no need to create "Entry" objects for every item that goes in the map.
The second important difference is being able to provide your own hash function, which differs from the one provided by the class of the keys. So you could provide a much faster hash function if it made sense to do so.
One advantage of Trove is that it avoids object creation, especially for primitives.
For big hashtables in an embedded java device this can be advantageous due fewer memory consumption.
The other advantage, I saw is the use of custom hash codes / functions without the need to override hashcode(). For a specific data set, and an expert in writing hash functions this can be an advantage.
I hope that this question is specific enough to be deemed fit for StackOverflow. I checked the FAQ and I think this qualifies, since it is specific and related to programming.
I'm implementing a complex data mining algorithm (FP-growth) in Java. Some of the initial phases of the algorithm require me to scan a large database and keep a running count of each item type found. This seems perfectly suited to a Hashbag interface. I found one in Apache Commons which seems to work for me.
So now, my HashBag is filled with [itemType, count] entries (pairs). Later on in the algorithm, I'm required to do a lot of list-like operations on these pairs. In some cases, I must sort the collection by itemType. In others, I must sort by count. This seems perfectly suited to a List interface.
I'm left with the conclusion that I must convert my Hasbag to a List. Yet it feels dirty somehow, like a waste of space and time. Is there a smarter way to do this, or is it a common situation to have a programming problem where you must treat your collection differently at different times, and conversions are a necessary evil?
One alternative is to make my own interface which is truly a list, but allows "bag-style" adds. I'd have to keep the list sorted and perform binary searches with a custom comparator every time I wanted to add something. Building that collection would probably take longer than building a Hashbag, but I'd save on the conversion step at the end. Any thoughts as to which is preferable?
Thanks!
If you used Guava's Multiset instead of Apache's Bag -- roughly analogous, but in a different style -- you can do most of this without converting. Multiset.entrySet() returns a Set<Entry<E>>, with Entry<E> effectively representing a pair of an element and a count -- that sounds like it's probably the best way to address your need to operate on the element-count pairs, maybe? You can iterate over that like you'd iterate over a Map.entrySet().
You can use Multisets.copyHighestCountFirst(Multiset) to get a multiset reordered in highest-frequency-first order, and use TreeMultiset to order by the elements directly.
(Disclosure: I contribute to Guava.)
I assume you're using the Apache Commons Collections HashBag class. Have you considered using TreeBag instead? It implements the same Bag interface but efficiently keeps the data sorted according to a comparator you provide.
That said, when you need to change sort order, there isn't usually any better answer than to copy the collection to a new one with a different comparator.
Yet it feels dirty somehow, like a waste of space and time. Is there a smarter way to do this, or is it a common situation to have a programming problem where you must treat your collection differently at different times, and conversions are a necessary evil?
Sometimes it is necessary to convert between collection types. If it is necessary "dirty" or "inelegant" or "dumb" are not really relevant.
It can also be a mistake to over-think these things up front. The actual computational trade-offs are often difficult to grasp. For instance, if you changed the HashBag to a TreeBag, insertion goes from O(1) to O(logN) but you then avoid the overheads of sorting and copying. "Big Oh" analysis / thinking is not going to give you a clear answer. Indeed, the real performance is going to depend on the scaling factors, the values of N, the ratio of hits and misses in the bag and so on.
I would advise to try implementing things the obvious way, and see if it performs well enough ... and if not, profile it to see if the data structures are the main bottleneck. Then based on the profiling, and other measurements of the input datasets, figure out the best way to improve performance from your baseline implementation.
Answering my own question!
I did some experimenting with the different types of Multiset provided by the Guava libary mentioned above by Louis Wasserman. In my particular test case, I'm parsing a 1GB XML file (database of books and authors) and creating a very large Multiset (keeping a count of how many times each author shows up in the DB). Once I reach the end of the parsing, I need to get a new Multiset which only contains the authors who showed up more than x times, where x is some threshold value. I also want my final set to be sorted by author name.
Here are two of the different ways I tried it (among others):
1) collect the original counts in a TreeMultiset and then remove any which don't meet the threshold
2) collect the original counts in a HashMultiset, and then create a new TreeMultiset where I add each item from the hash set with a count the meets the threshold
The second way proved to be significantly faster (roughly 25%), despite the conversion and extra memory usage. Obviously a big part of this is that it is pretty inefficient to delete from binary trees.
So the clear conclusion here is that in this situation, conversion is a good move (unless you have memory constraints that won't allow it).
Thanks again for turning me onto the Guava library, Louis!
Anyone have a good rule of thumb for choosing between different implementations of Java Collection interfaces like List, Map, or Set?
For example, generally why or in what cases would I prefer to use a Vector or an ArrayList, a Hashtable or a HashMap?
I really like this cheat sheet from Sergiy Kovalchuk's blog entry, but unfortunately it is offline. However, the Wayback Machine has a historical copy:
More detailed was Alexander Zagniotov's flowchart, also offline therefor also a historical copy of the blog:
Excerpt from the blog on concerns raised in comments:
"This cheat sheet doesn't include rarely used classes like WeakHashMap, LinkedList, etc. because they are designed for very specific or exotic tasks and shouldn't be chosen in 99% cases."
I'll assume you know the difference between a List, Set and Map from the above answers. Why you would choose between their implementing classes is another thing. For example:
List:
ArrayList is quick on retrieving, but slow on inserting. It's good for an implementation that reads a lot but doesn't insert/remove a lot. It keeps its data in one continuous block of memory, so every time it needs to expand, it copies the whole array.
LinkedList is slow on retrieving, but quick on inserting. It's good for an implementation that inserts/removes a lot but doesn't read a lot. It doesn't keep the entire array in one continuous block of memory.
Set:
HashSet doesn't guarantee the order of iteration, and therefore is fastest of the sets. It has high overhead and is slower than ArrayList, so you shouldn't use it except for a large amount of data when its hashing speed becomes a factor.
TreeSet keeps the data ordered, therefore is slower than HashSet.
Map: The performance and behavior of HashMap and TreeMap are parallel to the Set implementations.
Vector and Hashtable should not be used. They are synchronized implementations, before the release of the new Collection hierarchy, thus slow. If synchronization is needed, use Collections.synchronizedCollection().
I've always made those decisions on a case by case basis, depending on the use case, such as:
Do I need the ordering to remain?
Will I have null key/values? Dups?
Will it be accessed by multiple threads
Do I need a key/value pair
Will I need random access?
And then I break out my handy 5th edition Java in a Nutshell and compare the ~20 or so options. It has nice little tables in Chapter five to help one figure out what is appropriate.
Ok, maybe if I know off the cuff that a simple ArrayList or HashSet will do the trick I won't look it all up. ;) but if there is anything remotely complex about my indended use, you bet I'm in the book. BTW, I though Vector is supposed to be 'old hat'--I've not used on in years.
Theoretically there are useful Big-Oh tradeoffs, but in practice these almost never matter.
In real-world benchmarks, ArrayList out-performs LinkedList even with big lists and with operations like "lots of insertions near the front." Academics ignore the fact that real algorithms have constant factors that can overwhelm the asymptotic curve. For example, linked-lists require an additional object allocation for every node, meaning slower to create a node and vastly worse memory-access characteristics.
My rule is:
Always start with ArrayList and HashSet and HashMap (i.e. not LinkedList or TreeMap).
Type declarations should always be an interface (i.e. List, Set, Map) so if a profiler or code review proves otherwise you can change the implementation without breaking anything.
About your first question...
List, Map and Set serve different purposes. I suggest reading about the Java Collections Framework at http://java.sun.com/docs/books/tutorial/collections/interfaces/index.html.
To be a bit more concrete:
use List if you need an array-like data structure and you need to iterate over the elements
use Map if you need something like a dictionary
use a Set if you only need to decide if something belongs to the set or not.
About your second question...
The main difference between Vector and ArrayList is that the former is synchronized, the latter is not synchronized. You can read more about synchronization in Java Concurrency in Practice.
The difference between Hashtable (note that the T is not a capital letter) and HashMap is similiar, the former is synchronized, the latter is not synchronized.
I would say that there are no rule of thumb for preferring one implementation or another, it really depends on your needs.
For non-sorted the best choice, more than nine times out of ten, will be: ArrayList, HashMap, HashSet.
Vector and Hashtable are synchronised and therefore might be a bit slower. It's rare that you would want synchronised implementations, and when you do their interfaces are not sufficiently rich for thier synchronisation to be useful. In the case of Map, ConcurrentMap adds extra operations to make the interface useful. ConcurrentHashMap is a good implementation of ConcurrentMap.
LinkedList is almost never a good idea. Even if you are doing a lot of insertions and removal, if you are using an index to indicate position then that requires iterating through the list to find the correct node. ArrayList is almost always faster.
For Map and Set, the hash variants will be faster than tree/sorted. Hash algortihms tend to have O(1) performance, whereas trees will be O(log n).
Lists allow duplicate items, while Sets allow only one instance.
I'll use a Map whenever I'll need to perform a lookup.
For the specific implementations, there are order-preserving variations of Maps and Sets but largely it comes down to speed. I'll tend to use ArrayList for reasonably small Lists and HashSet for reasonably small sets, but there are many implementations (including any that you write yourself). HashMap is pretty common for Maps. Anything more than 'reasonably small' and you have to start worrying about memory so that'll be way more specific algorithmically.
This page has lots of animated images along with sample code testing LinkedList vs. ArrayList if you're interested in hard numbers.
EDIT: I hope the following links demonstrate how these things are really just items in a toolbox, you just have to think about what your needs are: See Commons-Collections versions of Map, List and Set.
Well, it depends on what you need. The general guidelines are:
List is a collection where data is kept in order of insertion and each element got index.
Set is a bag of elements without duplication (if you reinsert the same element, it won't be added). Data doesn't have the notion of order.
Map You access and write your data elements by their key, which could be any possible object.
Attribution: https://stackoverflow.com/a/21974362/2811258
For more information about Java Collections, check out this article.
As suggested in other answers, there are different scenarios to use correct collection depending on use case. I am listing few points,
ArrayList:
Most cases where you just need to store or iterate through a "bunch of things" and later iterate through them. Iterating is faster as its index based.
Whenever you create an ArrayList, a fixed amount of memory is allocated to it and once exceeded, it copies the whole array
LinkedList:
It uses doubly linked list so insertion and deletion operation will be fast as it will only add or remove a node.
Retrieving is slow as it will have to iterate through the nodes.
HashSet:
Making other yes-no decisions about an item, e.g. "is the item a word of English", "is the item in the database?" , "is the item in this category?" etc.
Remembering "which items you've already processed", e.g. when doing a web crawl;
HashMap:
Used in cases where you need to say "for a given X, what is the Y"? It is often useful for implementing in-memory caches or indexes i.e key value pairs For example:
For a given user ID, what is their cached name/User object?.
Always go with HashMap to perform a lookup.
Vector and Hashtable are synchronized and therefore bit slower and If synchronization is needed, use Collections.synchronizedCollection().
Check This for sorted collections.
Hope this hepled.
I found Bruce Eckel's Thinking in Java to be very helpful. He compares the different collections very well. I used to keep a diagram he published showing the inheritance heirachy on my cube wall as a quick reference. One thing I suggest you do is keep in mind thread safety. Performance usually means not thread safe.
Use Map for key-value pairing
For key-value tracking, use Map implementation.
For example, tracking which person is covering which day of the weekend. So we want to map a DayOfWeek object to an Employee object.
Map < DayOfWeek , Employee > weekendWorker =
Map.of(
DayOfWeek.SATURDAY , alice ,
DayOfWeek.SUNDAY , bob
)
;
When choosing one of the Map implementations, there are several aspects to consider. These include: concurrency, tolerance for NULL values in key and/or value, order when iterating keys, tracking by reference versus content, and convenience of literals syntax.
Here is a chart I made showing the various aspects of each of the ten Map implementations bundled with Java 11.