There are several open-source cache implementations available in Java, like Guava Cache, Caffeine, etc. On the other hand, we can also use a Java hashmap to perform simple caching.
I found an article comparing the benchmarks of different caches and concurrent Hash Maps. Hashmap seems a clear winner in the benchmarks.
My use case is to store around 10-20 thousand String key-value pairs. The cache will be updated (Only updates and addition are allowed, no eviction) at regular intervals say 5 minutes.
Does it make any sense to use open-source cache implementations which provide more features rather than sticking with the Hash Map?
Edit 1: Answering #JayC667 questions
Will there be any evictions/removals from the HashMap?
There will be no evictions at all. However, some values can be updated
What is the concurrency requirement?
A single thread is allowed to perform the write operation and multiple threads are allowed to read from caches and stale read is something I can live with.
Related
I'm writing a small system in Java in which i extract n-gram feature from text files and later need to perform Feature Selection process in order to select the most discriminators features.
The Feature Extraction process for a single file return a Map which contains for each unique feature, its occurrences in the file. I merge all the file's Maps (Map) into one Map that contain the Document Frequency (DF) of all unique features extracted from all the files. The unified Map can contain above 10,000,000 entries.
Currently the Feature Extraction process is working great and i want to perform Feature Selection in which i need to implement Information Gain or Gain Ratio. I will have to sort the Map first, perform computations and save the results in order to finally get a list of (for each feature, its Feature Selection score)
My question is:
What is the best practice and the best data structure to hold this large amount of data (~10M) and perform computations?
This is a very broad question, so the answer is going to broad too. The solution depends on (at least) these three things:
The size of your entries
Storing 10,000,000 integers will require about 40MiB of memory, while storing 10,000,000 x 1KiB records will require more than 9GiB. These are two different problems. Ten million integers are trivial to store in memory in any stock Java collection, while keeping 9GiB in memory will force you to tweak and tune the Java Heap and garbage collector. If the entries are even larger, say 1MiB, then you can forget about in-memory storage entirely. Instead, you'll need to focus on finding a good disk backed data structure, maybe a database.
The hardware you're using
Storing ten million 1KiB records on a machine with 8 GiB of ram is not the same as storing them on a server with 128GiB. Things that are pretty much impossible with the former machine are trivial with the latter.
The type of computation(s) you want to do
You've mentioned sorting, so things like TreeMap or maybe PriorityQueue come to mind. But is that the most intensive computation? And what is the key you're using to sort them? Do you plan on locating (getting) entities based on other properties that aren't the key? If so, that requires separate planning. Otherwise you'd need to iterate over all ten million entries.
Do your computations run in a single thread or multiple threads? If you might have concurrent modifications of your data, that requires a separate solution. Data structures such as TreeMap and PriorityQueue would have to be either locked or replaced with concurrent structures such as ConcurrentLinkedHashMap or ConcurrentSkipListMap.
You can use a caching system, check MapDB it's very efficient and has a tree map implementation (so you can have your data ordered without any effort). Also, it provides data stores to save your data to disk when it cannot be held on memory.
// here a sample that uses the off-heap memory to back the map
Map<String, String> map = DBMaker.newMemoryDirectDB().make().getTreeMap("words");
//put some stuff into map
map.put("aa", "bb");
map.put("cc", "dd");
My intuition is that you could take inspiration from the initial MapReduce paradigm and partition your problem into several smaller but similar ones and then aggregate these partial results in order to reach the complete solution.
If you solve one smaller problem instance at a time (i.e. file chunk) this will guarantee you a space consumption penalty bounded by the space requirements for this single instance.
This approach to process the file lazily will work invariant of the data structure you choose.
I was doing some tests with a colleague, and we were pulling data in from a database (about 350,000 records), converting each record into an object and a key object, and then populating them into an ImmutableMap.Builder.
When we called the build() method it took forever, probably due to all the data integrity checks that come with ImmutableMap (dupe keys, nulls, etc). To be fair we tried to use a hashmap as well, and that took awhile but not as long as the ImmutableMap. We finally settled on just using ConcurrentHashMap which we populated with 9 threads as the records were iterated, and wrapped that in an unmodifiable map. The performance was good.
I noticed on the documentation it read ImutableMap is not optimized for "equals()" operations. As a die-hard immutable-ist, I'd like the ImmutableMap to work for large data volumes but I'm getting the sense it is not meant for that. Is that assumption right? Is it optimized only for small/medium-sized data sets? Is there a hidden implementation I need to invoke via "copyOf()" or something?
I guess your key.equals() is a time consuming method.
key.equals() will be called much more times in ImmutableMap.build() than HashMap.put() (in a loop). And key.hashCode() is called same times, both HashMap.put() and ImmutableMap.build(). As result, if key.equals() takes long time, the whole duration can be different much.
key.equals() are called a few times during HashMap.put() (Good hash algorithm leads a few collision).
While in case of ImmutableMap.build(), key.equals() will be called many times when checkNoConflictInBucket(). O(n) for key.equals().
Once the map is built, two types of Map should not be different much when access, as both are hash-based.
Sample:
There are 10000 random String to put as keys. HashMap.put() calls
String.equals() 2 times, while ImmutableMap.build() calls 3000 times.
My experience is that none of Java's built in Collection classes are really optimised for performance at huge volumes. For example HashMap uses simple iteration once hashCode has been used as an index in the array and compares the key via equals to each item with the same hash. If you are going to store several million items in the map then you need a very well designed hash and large capacity. These classes are designed to be as generic and safe as possible.
So performance optimisations to try if you wish to stick with the standard Java HashMap:
Make sure your hash function provides as close as possible to even distribution. Many domains have skewed values and your hash needs to take account of this.
When you have a lot of data HashMap will be expanded many times. Ideally set initial capacity as close as possible to the final value.
Make sure your equals implementation is as efficient as possible.
There are massive performance optimisations you can apply if you know (for example) that your key is an integer, for example using some form of btree after the hash has been applied and using == rather than equals.
So the simple answer is that I believe you will need to write your own collection to get the performance you want or use one of the more optimised implementations available.
It probably has been asked before but I come across this situation time and time again, that I want store a very small amont of properties that I am absolutely certain will never ever exceed say 20 keys. It seems a complete waste of CPU and memory to use a HashMap with all the overhead to begin with, but also the bad performance calculating an advanced hash value for each key lookup. If there are only <20 keys (probably more like 5 most of the time). I am absolutely certain that calculating a hash value takes hundred times more time than just iterating and comparing ...no?
There is this talk about premature optimization, but I don't totally agree here. I am on Android mostly, and any extra CPU/memory will opt for more juice for other stuff. Not necessarily talking about the consumer market here. The use-case here is very well-defined and doesn't change much, furthermore; it would be trivial to replace a very cheap map with a HashMap in case (something that will never happen) there will be a very large amount of new keys suddenly.
So, my question is; which is the very cheapest, most basic Map I can use in Java?
To all your first paragraph : no ! There won't be a dramatic memory overhead since as far as I know, an HashMap is initialized with 16 buckets and then doubles its size each time it rehashes, so in the worst case you would have 12 exceeding buckets for your map, so this is no big deal.
Concerning the lookup time, it is constant and equivalent to the time of accessing an element of an array, which is always better than looping over O(n) elements (even if n < 20). The only backdrop for HashMap is that it is unsorted, but as far as I am concerned, I consider it the default Map implementation in Java when I have no particular requirement about the order.
To conclude : use HashMap !
If you worry about hashCode() computation time on your keys, consider caching computed values, as, for example, java.lang.String does. See how caching hashcode works in Java as suggested by Joshua Bloch in effective java? question about on that.
Caveat: I suggest you take seriously cautions about premature optimization. For most programmers in most apps, I seriously doubt you need to worry about the performance of your Map. More important is to consider needs of concurrency, iteration-order, and nulls. But since you asked, here is my specific answer.
EnumMap
If your keys are enums, then your very fastest Map implementation will be EnumMap.
Based on a bitmap representing the domain of enum objects, an EnumMap is very fast to execute while using very little memory.
IdentityHashMap
If you are really so concerned about performance, then consider using IdentityHashMap.
This implementation of Map uses reference-equality rather than object-equality. While there is still a hash value involved, it is a hash of the object's address in memory (so to speak, we do not have direct memory access in Java). So the possibly lengthy call to each key object’s own hashCode method is avoided entirely. So performance may be better than a HashMap. You will see constant-time performance for the basic operations (get and put).
Study the documentation carefully to see if you want to take this route. Note the discussion about linear-probe versus chaining for better performance. Be aware that this class partially breaks the Map contract which mandates the use of the equals method when comparing objects. And this map does not provide concurrency.
Here is a table I made to help compare the various Map implementations bundled with Java 11.
In an application we are using an LRU(Least Recently Used) cache(Concurrent HashMap Implementation) with Max-size constrain. I'm wondered whether i could improve the performance of this cache. Following were few alternatives which i found on the net .
Using Google Gauva pool library.(since my implementation uses LRU , I dont see any benefit from gauva library)
If i wrap the objects as soft-references and store as values in LRU map(with out any size constrain) , can i see any benefit ? (This is not an ideal way of caching. After major gc run , all the soft references will be garbage collected).
How about using a hybrid pool which is a combination of LRU map + a soft reference map.(idea is when ever a object is pruned from LRU map , it is stored in a soft reference map.
By this approach we can have more number of objects in cache. But this approach might be a time consuming one.)
Are there any other methods to improve the performance of cache?
First of all, welcome to the club of cache implementers and improvers!
Don't use LRU. There are a lot of algorithms that are better then LRU, that are now
more then 10 years old meanwhile. As a start read these relevant resarch papers:
Wikipedia: LIRS Caching Algorithm
Wikipedia: Adaptive Replacement Cache
Within these papers you find also more basic papers about the idea of adaptive caching.
(e.g. 2Q, LRFU, LRU-k).
Warpping objects: It depends on what you want to achieve. Actually you have at least three additional object for a cache entry: The hashtable entry, the weakreference object, the cache entry object. With this approach you increase the memory footprint and if you
have a low efficiency, e.g. because of short expiry, you have a lot of GC trashing.
Adapt to available memory: If you want to adapt to the available memory it is better to evict entries if memory becomes lower. This way you evict entries that are used very seldom, instead of a random eviction. However, this approach affords more coding. EHCache with Auto Resource Control has implemented something like this.
The reference wrappers are a nice and easy way if you want to use more memory for the cache but avoid low heap conditions, but it is nothing high performance in terms of over all memory efficiency and access times.
Measure it! It depends heavily on the usage scenario whether you get an "performance improvement" or not. Ideally you need to know and consider the access pattern, the cached object sizes, expiry constraints and the expected parallelism. I put together a benchmark suite that you can find on GitHub cache2k benchmarks.
However, for now, these benchmarks just focus on the replacement policy efficiency and access times. Comparison of memory overhead and possible parallelism is missing. This will be added in somehow half a year scope. The benchmark results are available on the cache2k benchmark results page.
If you are generally interested in the topic and do some research in the field consider contributing to cache2k. I am especially happy about more benchmarking code, or descriptions of usage scenarios and traces of access patterns to optimize and improve the replacement algorithm.
This question already has answers here:
What are the differences between a HashMap and a Hashtable in Java?
(35 answers)
Closed 9 years ago.
I've been researching to find a faster alternative to list. In an algorithm book, hashtable seems to be the fastest using separate chaining. Then I found that java has an implementation of hashtable and from what I read it seems to it uses separate chaining. However, there is the overhead of synchronization so the implementation of hashmap is suggested as a faster alternative to hashtable.
My queations are:
Is java hashmap the fastest data structure implemented in java to
insert/delete/search?
While reading, a few posts had concerns about the memory usage of
hashmap. One post mentioned that an empty hashmap occupy 300
bytes. Is hashtable more memory efficient than hasmap?
Also, is the hash function in each the most efficient for
strings?
There is too much context missing to be able to answer the question which suggests to me that you should use the simplest option and not worry about performance until you have measured that you have a problem.
Is java hashmap the fastest data structure implemented in java to insert/delete/search?
ArrayList is significantly faster than HashMap depending on that you need it for. I have seen people use Maps when they should have used objects. In this case a custom class instance can be 10 faster and smaller.
While reading, a few posts had concerns about the memory usage of hashmap. One post mentioned that an empty hashmap occupy 300 bytes.
Unless you know that 300 bytes (which costs less than what you would be paid on minimum wage to blink) matters, I would assume it doesn't.
Is hashtable more memory efficient than hasmap?
It can be but not enough to matter. Hashtable starts with a smaller size by default. If you make a HashMap with a smaller capacity it will be smaller.
Also, is the hash function in each the most efficient for strings?
In the general case it is efficient enough. In rare cases you may want to change the strategy eg to prevent denial of service attacks. If you really care about memory efficiency and performance perhaps you shouldn't be using String in the first place.
HashMap (or, more likely, HashSet) is probably a good place to start, at your point. It's not perfect, and it does consume more memory than e.g. a list, but it should be your default when you need fast add, remove, and contains operations. The String.hashCode() implementation is not the best hash function, though it is fast, and good enough for most purposes.
The access time of HashMap (& HashTable as well I believe) is O(1) since the internal bucket placement of the given value during put() is determined by computing (Hash of the value's key) % (Total Number of buckets). This O(1) is average access time, if however many keys hash to the same bucket then the access time will tend towards O(n) as all the values are placed into the same bucket grow and they all grow in linked list fashion.
As you said considering the overhead of synchronization inside Hashtable, I would probably opt for Hash map. Besides you can fine tune Hashmap by setting its various params like load factor that offers means of memory optimization. I vote for HashMap...
As you've pointed Hashtable is fully synchronized so it depends on your environment. IF you have many threads then ConcurrentHashMap will be better solution. However you can look at Trove4J - maybe it will better suite your needs. Trove uses chaining hashing similar to hashtables
1.HashMap is only one of the fastest data structures implemented in java to insert/delete/search,HashSet is as fast as HashMap to insert/delete/search,and ArrayList is as fast as HashMap when insert a element to the end.
2.Hashtable is not more memory efficient than HashMap,they are all implemented by separate chaining.
3.Hash function of the two data structures are the same,but you can write a subclass extends them then override the hash function to make it most fit your application.
As others pointed out, a set would be a good replacement for a list but don't forget that lists allow duplicate elements, while sets do not, so while certain operations are faster, e.g., exists, sets and lists represent solutions to different problems.
As a start I recommend HashSet or TreeSet (in case ordering is important). A HashMap maps keys to values which is different. Refer to this discussion to understand the differences between the HashMap and Hashtable. I personally haven't used a Hashtable since 2007.
Finally, if you don't mind using a third party library, I highly recommend to take a look at the Guava immutable collections. Immutability automatically provides thread safety and easier to understand programs.
EDIT: Regarding efficiency concerns, this is a moot point. As a guideline, use the data structure (as in the abstract concept of a data structure) that best fits your problem and choose the vanilla implementation available. If you can prove you have a performance problem in you code, you might start thinking about using something 'more efficient'. That's in quote because it's a very loose definition: are we talking about memory efficiency, computing time efficiency, garbage collection efficiency, etc. Never forget the rules for code optimization.