Is there a hashmap implementation that uses a caching scheme?

Is there a hashmap implementation that uses a caching scheme? - java

I swear that in the past I had seen something about a hashmap implementation using some type of caching but when I was reading up today on how hashmaps are implemented in java, it was just simply a table with linked lists. Let me go deeper in what I mean.
From what I read today, hashmap in java is essentially like so
There exists an array simply called "table" where each index of the array is the hashcode. The value of the array is the first element of the linked list for that hashcode.
When you try to retrieve an object from the hashmap using the key, the key is transformed into a hashcode which is applied as the index of the "table" to then go to the linkedlist to iterate and find the correct object corresponding to the key.
But what I had read before was something different than this. What I had read was when you retrieve an object from the hashmap using the key, the corresponding bucket is cached so that when you retrieve another object from the hashmap from the same bucket, you are using the cache. But when you retrieve an object from a different bucket, the other bucket is cached instead.
Did I just completely misunderstand something in the past and invent something in my head, or is there something like this that might have confused me?

Never heard of that.
First of all: the "table" array has only a certain size so the hascode is not used directly. From the HashMap source tab[i = (n - 1) & hash]
Maybee you are mixing it up with the LinkedHashMap which keeps track of the element accesses:
void afterNodeAccess(Node<K,V> e) { // move node to last
LinkedHashMap.Entry<K,V> last;
if (accessOrder && (last = tail) != e) {
LinkedHashMap.Entry<K,V> p =
(LinkedHashMap.Entry<K,V>)e, b = p.before, a = p.after;
p.after = null;
if (b == null)
head = a;
else
b.after = a;
if (a != null)
a.before = b;
else
last = b;
if (last == null)
head = p;
else {
p.before = last;
last.after = p;
}
tail = p;
++modCount;
}
}
This behavior is usefull if you are implementing a LRU (Least recently used) cache. A LRU cache removes the elements that have not been requested for the longest period, once the cache reached its maximum size.

Related

Re-hashing a hash map inside put method

I'm trying to implement a separate-chaining hash map in Java. Inside the put()-method I want to re-hash the map if the load factor( nr-of-elements/size-of-array) gets to large. For this I have written another method rehash() that rehashes the list by doubling the size of the array/capacity and then adding all the entries again (atleast this is what I want it to do). The problem is that when I test it I get an "java.lang.OutOfMemoryError: Java heap space" and I'm guessing this is since I'm calling the put() method inside the rehash() method as well. The problem is that I don't really know how to fix this. I wonder if someone can check my code and give me feedback or give me a hint on how to proceed.
The Entry<K,V> in the code below is a nested private class in the hash map class.
Thanks in advance!
The put()-method:
public V put(K key,V value) {
int idx = key.hashCode()%capacity; //Calculate index based on hash code.
if(idx<0) {
idx+=this.capacity; //if index is less than 0 add the length of the array table
}
if(table[idx]==null) { //If list at idx is empty just add the Entry-node
table[idx] = new Entry<K,V>(key,value);
nr_of_keys +=1;
if(this.load()>=this.load_factor) { //Check if load-factor is greater than maximum load. If this is the case rehash.
rehash();
}
return null;
} else {
Entry<K,V> p = table[idx]; //dummy pointer
while(p.next!=null) { //while next node isn't null move the pointer forward
if(p.getKey().equals(key)) { //if key matches:
if(!p.getValue().equals(value)) { //if value don't match replace the old value.
V oldVal = p.getValue();
p.setValue(value);
return oldVal;
}
} else {
p=p.next;
}
}
if(p.getKey().equals(key)) { //if the key of the last node matches the given key:
if(!p.getValue().equals(value)) {
V oldVal = p.getValue();
p.setValue(value);
return oldVal;
} else {
return null;
}
}
p.next = new Entry<K,V>(key,value); //key doesn't exist so add (key,value) at the end of the list.
nr_of_keys +=1;
if(this.load()>=this.load_factor) { //if load is to large rehash()
rehash();
}
return null;
}
}
Rehash()-method:
public void rehash() {
Entry<K,V>[] tmp = table; //create temporary table
int old_capacity = this.capacity; //store old capacity/length of array.
this.capacity = 2*capacity; //New capacity is twice as large
this.nr_of_keys=0; //reset nr. of keys to zero.
table = (Entry<K, V>[]) new Entry[capacity]; //make this.table twice as large
for(int i=0; i<old_capacity;i++) { //go through the array
Entry<K,V> p = tmp[i]; //points to first element of list at position i.
while(p!=null) {
put(p.getKey(), p.getValue());
p=p.next;
}
}
}
The load()-method:
public double load() {
return((double) this.size())/((double)this.capacity);
}
where size() returns the number of (key,value) pairs in the map and capacity is the size of the array table (where the linked lists are stored).

Once you rehash your map nothing will be the same. The buckets the entry sets, etc.
So.
create your temporary table.
get the values normally using your current get methods.
then create new buckets based on rehashing to the new bucket size, with the new capacity and add to the table. (DO NOT USE PUT).
Then replace the existing table with the just created one. Make certain that all values pertinent to the new table size are also changed such as bucket selection methods based on threhholds, capcity, etc.
Finally use print statements to track the new buckets and the movement of items between buckets.

You have added the rehash(), but there is still the load() implemetation missing (or inside load, the size()).
The pattern looks clear though, and allows a guess, waiting for this additional info.
You tell us that when the load factor reaches a certain point inside a put, you rehash. That rehash doubles the internal array and calls put again. And in the end you have no memory.
Where, my bet would be there is some subtle or not-so-subtle recursion taking place where you put, it rehashes by doubling the memory usage, then re-puts, which somehow creates a rehashing...
A first possiblity would be that there is some internal variables tracking the array's state that are not properly reset (e.g. number of occupied entries, ...). Confusing the "old" array data with that of the new being built would a likely culprit.
Another possiblity is with your put implementation, but it would require a step by step debug - which I'd advise you to perform.

ConcurrentModificationException when updating stored Iterator (for LRU cache implementation)

I am trying to implement my own LRU cache. Yes, I know that Java provides a LinkedHashMap for this purpose, but I am trying to implement it using basic data structures.
From reading about this topic, I understand that I need a HashMap for O(1) lookup of a key and a linked list for management of the "least recently used" eviction policy. I found these references that all use a standard library hashmap but implement their own linked list:
"What data structures are commonly used for LRU caches and quickly
locating objects?" (stackoverflow.com)
"What is the best way to Implement a LRU Cache?" (quora.com)
"Implement a LRU Cache in C++" (uml.edu)
"LRU Cache (Java)" (programcreek.com)
The hash table is supposed to directly store a linked list Node as I show below. My cache should store Integer keys and String values.
However, in Java the LinkedList collection does not expose its internal nodes, so I can't store them inside the HashMap. I could instead have the HashMap store indices into the LinkedList, but then getting to an item would require O(N) time. So I tried to store a ListIterator instead.
import java.util.Map;
import java.util.HashMap;
import java.util.List;
import java.util.LinkedList;
import java.util.ListIterator;
public class LRUCache {
private static final int DEFAULT_MAX_CAPACITY = 10;
protected Map<Integer, ListIterator> _map = new HashMap<Integer, ListIterator>();
protected LinkedList<String> _list = new LinkedList<String>();
protected int _size = 0;
protected int _maxCapacity = 0;
public LRUCache(int maxCapacity) {
_maxCapacity = maxCapacity;
}
// Put the key, value pair into the LRU cache.
// The value is placed at the head of the linked list.
public void put(int key, String value) {
// Check to see if the key is already in the cache.
ListIterator iter = _map.get(key);
if (iter != null) {
// Key already exists, so remove it from the list.
iter.remove(); // Problem 1: ConcurrentModificationException!
}
// Add the new value to the front of the list.
_list.addFirst(value);
_map.put(key, _list.listIterator(0));
_size++;
// Check if we have exceeded the capacity.
if (_size > _maxCapacity) {
// Remove the least recently used item from the tail of the list.
_list.removeLast();
}
}
// Get the value associated with the key.
// Move value to the head of the linked list.
public String get(int key) {
String result = null;
ListIterator iter = _map.get(key);
if (iter != null) {
//result = iter
// Problem 2: HOW DO I GET THE STRING FROM THE ITERATOR?
}
return result;
}
public static void main(String argv[]) throws Exception {
LRUCache lruCache = new LRUCache(10);
lruCache.put(10, "This");
lruCache.put(20, "is");
lruCache.put(30, "a");
lruCache.put(40, "test");
lruCache.put(30, "some"); // Causes ConcurrentModificationException
}
}
So this leads to three problems:
Problem 1: I am getting a ConcurrentModificationException when I update the LinkedList using the iterator that I store in the HashMap.
Exception in thread "main" java.util.ConcurrentModificationException
at java.util.LinkedList$ListItr.checkForComodification(LinkedList.java:953)
at java.util.LinkedList$ListItr.remove(LinkedList.java:919)
at LRUCache.put(LRUCache.java:31)
at LRUCache.main(LRUCache.java:71)
Problem 2. How do I retrieve the value pointed to by the ListIterator? It seems I can only retrieve the next() value.
Problem 3. Is there any way to implement this LRU cache using the Java collections LinkedList, or do I really have to implement my own linked list?

1) This isn't really what Iterators are for.
By contract, if you modify the list without using the iterator -- as you do here
_list.addFirst(value);
then ALL OPEN ITERATORS on that list should throw ConcurrentModificationException. They were open to a version of the list that no longer exists.
2) A LinkedList is not, exactly, a linked list of nodes. It's a java.util.List, whose backing implementation is a doubly linked list of nodes. That List contract is why it does not expose references to the backing implementation -- so operations like "remove this node, as a node, and move it to the head" are no good. This encapsulation is for your own protection (same as the concurrent mod exception) -- it allows your code to rely on the List semantics of a LinkedList (iterability, for example) without worry that some joker two cubes away was hacking away at its innards and broke the contract.
3) What you really need here is NOT a LinkedList. What you need is a Stack that that allows you to move any arbitrary entry to the head and dump the tail. You are implying that you want a fast seek time to an arbitrary entry and also a fast remove and a fast add, AND you want to be able to find the tail at any moment in case you need to remove it.
Fast seek time == HashSomething
Fast add/remove of arbitrary elements == LinkedSomething
Fast addressing of the final element == SomekindaList
4) You're going to need to build your own linking structure...or use a LinkedHashMap.
PS LinkedHashSet is cheating, it's implemented using a LinkedHashMap.

I'll deal with problem 3 first:
As you point out in your question, LinkedList (like all well designed generic collections) hides the details of the implementation such as the nodes containing the links. In your case you need your hash map to reference these links directly as the values of the map. To do otherwise (e.g. having indirection through a third class) would defeat the purpose of an LRU cache to allow very low overhead on value access. But this is not possible with standard Java Collections - they don't (and shouldn't) provide direct access to internal structures.
So the logical conclusion of this is that, yes, you need to implement your own way of storing the order in which items in the cache have been used. That doesn't have to be a double-linked list. Those have traditionally been used for LRU caches because the most common operation is to move a node to the top of the list when it is accessed. That is an incredibly cheap operation in a double-linked list requiring just four nodes to be relinked with no memory allocation or free.
Problem 1 & 2:
Essentially the root cause here is that this you are trying to use iterators as a cursor. They are designed to be created, stepped through to perform some operation and then disposed of. Even if you get over the problems you are having I expect there will be further problems behind them. You're putting a square peg in a round hole.
So my conclusion is that you need to implement your own way to hold values in a class that keeps track of order of access. However it can be incredibly simple: only three operations are required: create, get value and remove from tail. Both create and get value must move the node to the head of the list. No inserting or deleting from the middle of the list. No deleting the head. No searching. Honestly dead simple.
Hopefully this will get you started :-)
public class <K,V> LRU_Map implements Map<K,V> {
private class Node {
private final V value;
private Node previous = null;
private Node next = null;
public Node(V value) {
this.value = value;
touch();
if (tail == null)
tail = this;
}
public V getValue() {
touch();
return value;
}
private void touch() {
if (head != this) {
unlink();
moveToHead();
}
}
private void unlink() {
if (tail == this)
tail = prev;
if (prev != null)
prev.next = next;
if (next != null)
next.prev = prev;
}
private void moveToHead() {
prev = null;
next = head;
head = this;
}
public void remove() {
assert this == tail;
assert this != head;
assert next == null;
if (prev != null)
prev.next = null;
tail = prev;
}
}
private final Map<K,Node> map = new HashMap<>();
private Node head = null;
private Node tail = null;
public void put(K key, V value) {
if (map.size() >= MAX_SIZE) {
assert tail != null;
tail.remove();
}
map.put(key, new Node(value));
}
public V get(K key) {
if (map.containsKey(key))
return map.get(key).getValue();
else
return null;
}
// and so on for other Map methods
}

Another way to skin this cat would be to implement a very simple class that extends the LinkedList, but runs any modifications to the list (e.g. add, remove, etc) inside of a "synchronized" block. You'll need to run your HashMap pseudo-pointer through the get() every time, but it should work just fine. e.g.
...
private Object lock = new Object(); //semaphore
//override LinkedList's implementations...
#Override
public <T> remove(int index) { synchronized(lock) { return super.remove(index); } }
...
If you have Eclipse or IntelliJ IDEA, then you should be able to auto-generate the method stubs you need almost instantly, and you can evaluate which ones need to be locked.

How does the get(key) work even after the hashtable has grown in size!

If a Hashtable is of size 8 originally and we hit the load factor and it grows double the size. How is get still able to retrieve the original values ... so say we have a hash function key(8) transforms into 12345 as the hash value which we mod by 8 and we get the index 7 ... now when the hash table size grows to 16 ...for key(8) we get 12345 .. if we mod it by 16 we will get a different answer! So how do i still retrieve the original key(8)

This isn't Java specific - when a hash table grows (in most implementations I know of), it has to reassess the keys of all hashed objects, and place them into their new, correct bucket based on the number of buckets now available.
This is also why resizing a hashtable is generally considered to be an "expensive" operation (compared to many others) - because it has to visit all of the stored items within it.

The hash value used to look up the value comes from the key object itself, not the container.
That's why objects uses as keys in a Map must be immutable. If the hashCode() changes, you won't be able to find your key or value again.

It is all implementation dependent, but a rehash will occur when it is necessary.

Take a look at the source for the HashMap class, in the transfer() method, which is called by the resize() method.
/**
* Transfers all entries from current table to newTable.
*/
void transfer(Entry[] newTable) {
Entry[] src = table;
int newCapacity = newTable.length;
for (int j = 0; j < src.length; j++) {
Entry<K,V> e = src[j];
if (e != null) {
src[j] = null;
do {
Entry<K,V> next = e.next;
int i = indexFor(e.hash, newCapacity);
e.next = newTable[i];
newTable[i] = e;
e = next;
} while (e != null);
}
}
}
In this HashTable implementation you can follow exactly how each entry is stored in the new (twice as big) storage array. The capacity of the new array is used in determining which slot each item will be stored. The hashcode of the keys does not change (it is in fact not even recomputed, but retrieved from the public field named hash in each Entry object, where it is stored), what changes is the result of the indexFor() call:
/**
* Returns index for hash code h.
*/
static int indexFor(int h, int length) {
return h & (length-1);
}
which takes the hash code and the new storage array's length and returns the index in the new array.
So a client's new call to get() will go through the same indexFor() call, which will also use the new storage array's length, and all will be well.

Is this an efficient way to remove duplicates from a linked list?

I am writing a function that will take in the head of a linked list, remove all duplicates, and return the new head. I've tested it but I want to see if you can catch any bugs or improve on it.
removeDuplicates(Node head)
if(head == null) throw new RuntimeException("Invalid linked list");
Node cur = head.next;
while(cur != null) {
if(head.data == cur.data) {
head = head.next;
} else {
Node runner = head;
while(runner.next != cur) {
if(runner.next.data == cur.data) {
runner.next = runner.next.next;
break;
}
runner = runner.next;
}
cur = cur.next;
}
return head;
}

If you are willing to spend a little more RAM on the process, you can make it go much faster without changing the structure.
For desktop apps, I normally favor using more RAM and winning some speed. So I would do something like this.
removeDuplicates(Node head) {
if (head == null) {
throw new RuntimeException("Invalid List");
}
Node current = head;
Node prev = null;
Set<T> data = new HashSet<T>(); // where T is the type of your data and assuming it implements the necessary methods to be added to a Set properly.
while (current != null) {
if (!data.contains(current.data)) {
data.add(current.data);
prev = current;
current = current.next;
} else {
if (prev != null) {
prev.next = current.next;
current = current.next;
}
}
}
}
This should run in O(n) time.
EDIT
I hope I was correct in assuming that this is some kind of project / homework where you are being forced to use a linked list, otherwise, as noted, you would be better off using a different data structure.

I didn't check your code for bugs, but I do have a suggestion for improving it. Allocate a Hashtable or HashMap that maps Node to Boolean. As you process each element, if it is not a key in the hash, add it (with Boolean.TRUE as the value). If it does exist as a key, then it already appeared in the list and you can simply remove it.
This is faster than your method because hash lookups work in roughly constant time, while you have an inner loop that has to go down the entire remainder of the list for each list element.
Also, you might consider whether using an equals() test instead of == makes better sense for your application.

To efficiently remove duplicates you should stay away from linked list: Use java.util.PriorityQueue instead; it is a sorted collection for which you can define the sorting-criteria. If you always insert into a sorted collection removing duplicates can be either done directly upon insertion or on-demand with a single O(n)-pass.

Aside from using the elements of the list to create a hash map and testing each element by using it as a key, which would only be desirable for a large number of elements, where large depends on the resources required to create the hash map, sequentially scanning the list is a practical option, but there are others which will be faster. See user138170's answer here - an in-place merge sort is an O(n log(n)) operation which does not use an extra space, whereas a solution using separately-allocated array would work in O(n) time. Practically, you may want to profile the code and settle for a reasonable value of n, where n is the number of elements in the list, after which a solution allocating memory will be used instead of one which does not.
Edit: If efficiency is really important, I would suggest not using a linked list (largely to preserve cache-coherency) and perhaps using JNI and implementing the function natively; the data would also have to be supplied in a natively-allocated buffer.

Performance considerations for keySet() and entrySet() of Map

All,
Can anyone please let me know exactly what are the performance issues between the 2? The site : CodeRanch provides a brief overview of the internal calls that would be needed when using keySet() and get(). But it would be great if anyone can provide exact details about the flow when keySet() and get() methods are used. This would help me understand the performance issues better.

The most common case where using entrySet is preferable over keySet is when you are iterating through all of the key/value pairs in a Map.
This is more efficient:
for (Map.Entry entry : map.entrySet()) {
Object key = entry.getKey();
Object value = entry.getValue();
}
than:
for (Object key : map.keySet()) {
Object value = map.get(key);
}
Because in the second case, for every key in the keySet the map.get() method is called, which - in the case of a HashMap - requires that the hashCode() and equals() methods of the key object be evaluated in order to find the associated value*. In the first case that extra work is eliminated.
Edit: This is even worse if you consider a TreeMap, where a call to get is O(log(n)), i.e. the comparator may need to run log2(n) times (n = size of the Map) before finding the associated value.
*Some Map implementations have internal optimisations that check the objects' identity before the hashCode() and equals() are called.

First of all, this depends entirely on which type of Map you're using. But since the JavaRanch thread talks about HashMap, I'll assume that that's the implementation you're referring to. And let's assume also that you're talking about the standard API implementation from Sun/Oracle.
Secondly, if you're concerned about performance when iterating through your hash map, I suggest you have a look at LinkedHashMap. From the docs:
Iteration over the collection-views of a LinkedHashMap requires time proportional to the size of the map, regardless of its capacity. Iteration over a HashMap is likely to be more expensive, requiring time proportional to its capacity.
HashMap.entrySet()
The source-code for this implementation is available. The implementation basically just returns a new HashMap.EntrySet. A class which looks like this:
private final class EntrySet extends AbstractSet<Map.Entry<K,V>> {
public Iterator<Map.Entry<K,V>> iterator() {
return newEntryIterator(); // returns a HashIterator...
}
// ...
}
and a HashIterator looks like
private abstract class HashIterator<E> implements Iterator<E> {
Entry<K,V> next; // next entry to return
int expectedModCount; // For fast-fail
int index; // current slot
Entry<K,V> current; // current entry
HashIterator() {
expectedModCount = modCount;
if (size > 0) { // advance to first entry
Entry[] t = table;
while (index < t.length && (next = t[index++]) == null);
}
}
final Entry<K,V> nextEntry() {
if (modCount != expectedModCount)
throw new ConcurrentModificationException();
Entry<K,V> e = next;
if (e == null)
throw new NoSuchElementException();
if ((next = e.next) == null) {
Entry[] t = table;
while (index < t.length && (next = t[index++]) == null);
}
current = e;
return e;
}
// ...
}
So there you have it... That's the code dictating what will happen when you iterate through an entrySet. It walks through the entire array, which is as long as the map's capacity.
HashMap.keySet() and .get()
Here you first need to get hold of the set of keys. This takes time proportional to the capacity of the map (as opposed to size for the LinkedHashMap). After this is done, you call get() once for each key. Sure, in the average case, with a good hashCode-implementation this takes constant time. However, it will inevitably require lots of hashCode() and equals() calls, which will obviously take more time than just doing a entry.value() call.

Here is the link to an article comparing the performance of entrySet(), keySet() and values(), and advice regarding when to use each approach.
Apparently the use of keySet() is faster (besides being more convenient) than entrySet() as long as you don't need to Map.get() the values.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.