Handling hash collisions when using linear probing

Handling hash collisions when using linear probing - java

I have read about hashtable and open adrdessing. If you want to insert the keys: 18,32,44 in an hashtable with size 13:
18 gets index 5 (18 modulus 13 = 5)
32 gets index 6 (32 modulus 13 = 6)
44 gets index 5 (44 modulus 13 = 5)
You'll get a collision because there are already something on index 5.
If you use linear probing you'll do hashfunction = (key+i) modulus N where i = 0,1,2.. until you find an empty place in the hashtable. Then 44 will get be inserted at index 7.
What if you delete 32, and then you want to delete 44. You start by looking at hashfunction(44)=5 - that was not 44, then hashfunction(44 + 1) = 6 - that is empty. Then you might think that 44 is gone. How do you mark a place in the hashtable, that the place is not really empty, but does not contain a key, and that you should keep looking for 44 at the next index?
If you then need to insert another key at index 6 then the key just overwrites the "mark" in the hashtable.
What could you use to mark an index - saying here has been an key, but has been deleted - so you continue to look at next index? You can't just write null or 0 because then either you think the key has been deleted (null) or that an key with value 0 has overwritten 44.

One way to handle hash tables using open addressing is to use state marks: EMPTY, OCCUPIED and DELETED. Note that there's an important distinction between EMPTY, which means the position has never been used and DELETED, which means it was used but got deleted.
When a value gets removed, the slot is marked as DELETED, not EMPTY. When you try to retrieve a value, you'll probe until you find a slot that's mark EMPTY; eg: you consider DELETED slots to be the same as OCCUPIED. Note that insertion can ignore this distinction - you can insert into a DELETED or EMPTY slot.
The question is tagged Java, which is a bit misleading because Java (or at least Oracle's implementation of it) does not use open addressing. Open addressing gets specially problematic when the load factor gets high, which causes hash collisions to occur much more often:
As you can see, there's a dramatic performance drop near the 0.7 mark. Most hashtables get resized once their load factor gets past a certain constant factor. Java for example doubles the size of its HashMap when the load factor gets past 0.75.

It seems like you are trying to implement your own hash table (in contrast to using the Hashtable or HashMap included in java), so it's more a data structure question than a java question.
That being said, implementing a hash table with open addressing (such as linear probing) is not very efficient when it comes to removing elements. The normal solution is to "pull up" all elements that are in the wrong slot so there won't be any spaces in the probing.
There is some pseudo code describing this quite well at wikipedia:
http://en.wikipedia.org/wiki/Open_addressing

The hash table buckets aren't limited to storing a single value. So if two objects hash to the same location in the table they will both be stored. The collision only means that lookup will be slightly slower because when looking for the value with a key that hashes to a particular location it will need to check each entry to see if it matches
It sounds like you are describing a hash table where you only store a single entry and each index. The only way I can think to do that is to add a field to the structure storing the value that indicates if that position had a collision. Then when doing a lookup you'd check the key, if it was a match you have the value. If not, then you would check to see if there was a collision and then check the next position. On a removal you'd have to leave the collision marker but delete the value and key.

If you use a hash table which uses this approach (which none of the built in hash collections do) you need traverse all the latter keys to see if they need to be moved up (to avoid holes). Some might be for the same hash value and some might be collisions for unrelated hash codes. If you do this you are not left with any holes. For a hash map which is not too full, this shouldn't create much overhead.

Related

Pattern Databases Storing all permutations

I am looking for some advice on storing all possible permutations for the fringe pattern database.
So the fifteen tile problem has 16! possible permutations, however storing the values for fringe so the 0 (blank tile),3,7,11,12,13,14,15 is 16!/(16-8)! = 518,918,400 permutations.
I am looking to store all of these permutations in a datastructure along with the value of the heuristic function (which is just incremented each time a iteration of the breadth first search), so far I am doing so but very slowly and took me 5 minutes to store 60,000 which is time I don't have!
At the moment I have a structure which looks like this.
Value Pos0 Pos3 Pos7 Pos11 Pos12 Pos13 Pos14 Pos15
Where I store the position of the given numbers. I have to use these positions as the ID for when I am calculating the heuristic value I can quickly trawl through to the given composition and retrieve the value.
I am pretty unsure about this. The state of the puzzle is represented by an array example:
int[] goalState = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
My question is what would be the best data structure to store these values? and the best way to retrieve them.
(This question was originally based on storing in a database, but now I want to store them in some form of local data structure - as retrieving from a database slow )

I can't really grasp, what special meaning do 0,3,7,11,12,13,14,15 have in your case. Is their position unchangeable? Is their position enough to identify the whole puzzle state?
Anyway, here is a general approach, you can narrow it down anytime:
As you have 16 possible states at max, I would try to use hexadecimal numbers to represent your permutations. So the state {1,2,3,6,5,4,7,8,9,10,11,12,13,14,15,0} would look like 0x123654789ABCDEF0 = 1312329218393956080. The biggest number possible would be 0xFEDCBA9876543210, which still can be stored in an unsigned long (only since Java 8) or alternatively in BigInteger (there are many examples, I would prefer this). Such number would be unique for each permutation and could be used as primary key and if you have the whole state, retrieving it from the database would be pretty fast.
//saving your permutation
String state = "0xFEDCBA9876543210";
BigInteger permutationForDatabase = new BigInteger(state, 16);
//and then you can insert it into database as a number
//reading your permutation
char searchedCharacter = 'A';//lets say you look for tile 10
BigInteger permutation = ...;//here you read the number from the database
int tilePosition = permutation.toString(16).indexOf(searchedCharacter);
There might be a more elegant/performant solution to get the tile position (maybe some bit operation magic).

Each number 0-15 is a 4-bit number. You must represent 7 such numbers, making a minimum requirement of 28 bits, which is well within the 31 signed bit space of an int. Thus all permutations may be assigned, and derived from, an int.
To calculate this number, given variables a through g:
int key = a | (b << 4) | (c << 8) | (d << 12) | (e << 16) | (f << 20) | (g << 24);
To decode (if you need to):
int a = key & 0xF;
int b = key & 0xF0;
int c = key & 0xF00; // etc
Storing ints in a database is very efficient and will use minimal disk space:
create table heuristics (
key_value int not null,
heuristic varchar(32) not null -- as small as you can, char(n) if all the same length
);
After inserting all the rows, create a covering index for super fast lookup:
create unique index heuristics_covering heuristics(key_value, heuristic);
If you create this index before insertion, insertions will be very, very slow.
Creating the data and inserting it is relatively straightforward coding.

So is my understanding correct that you're calculating a heuristic value for each possible puzzle state, and you want to be able to look it up later based on a given puzzle state? So that you don't have to calculate it on the fly? Presumably because of the time it takes to calculate the heuristic value.
So you're iterating over all the possible puzzle states, calculating the heuristic, and then storing that result. And it's taking a long time to do that. It seems like your assumption is that it's taking a long time to store the value - but what if the time lag you're seeing isn't the time it's taking to store the values in the data store, but rather the time it's taking the generate the heuristic values? That seems far more likely to me.
In that case, if you want to speed up the process of generating and storing the values, I might suggest splitting up the task into sections, and using several threads at once.
The fasted data structure I believe is going to be an in-memory hash table, with the hash key being your puzzle state, and the value being your heuristic value. Others have already suggested reasonable ways of generating puzzle-state hash keys. The same hash table structure could be accessed by each of the threads which are generating and storing heuristic values for sections of the puzzle state domain.
Once you've populated the hash table, you can simply serialize it and store it in a binary file in the filesystem. Then have your heuristic value server load that into memory (and deserialize it into the in-memory hash table) when it starts up.
If my premise is incorrect that it's taking a long time to generate the heuristic values, then it seems like you're doing something grossly sub-optimal when you go to store them. For example reconnecting to a remote database each time you store a value. That could potentially explain the 5 minutes. And if you're reconnecting every time you go to look up a value, that could explain why that is taking too long, too.
Depending on how big your heuristic values are, an in memory hash table might not be practical. A random-access binary file of records (with each record simply containing the heuristic value) could accomplish the same thing, potentially, but you'd need some way of mathematically mapping the hash key domain to the record index domain (which consists of sequential integers). If you're iterating over all the possible puzzle states, it seems like you already have a way of mapping puzzle states to sequential integers; you just have to figure out the math.
Using a local database table with each row simply having a key and a value is not unreasonable. You should definitely be able to insert 518 million rows in the space of a few minutes - you just need to maintain a connection during the data loading process, and build your index after your data load is finished. Once you've built the index on your key, a look up using the (clustered primary key integer) index should be pretty quick as long as you don't have to reconnect for every look up.
Also if you're committing rows into a database, you don't want to commit after each row, you'll want to commit every 1,000 or 10,000 rows. If you're committing after each row is inserted, that will substantially degrade your data loading performance.

Hashing a pair(String,long) to a Long Java

I am building a simple versioned key-value store, where I need the ability
to access records by specifying a (key,version) pair. More recent versions store
a pointer to their previous version (ak, they store the index into the hashmap).
To keep the size of the records small, I hash the key,version pair to a Long,
which I then store, and use as an index.
The current implementation I was using was to append the key and the version (keys
are restricted to alphabetical letters), and using the native hashcode() function.
Before each put, I test for collisions (ak, does an entry already exist, and if yes,
does it have the same key,id pair), and I observe those frequently. Is this realistic?
My space is of the order of a million entries. My initial assumption is that this lead to collisions being very rare. But I was wrong.
Do you have an alternative solution (one which can keep the hash to 64 bit, 2^64 is a lot greater than 1m). I would like as much as possible to avoid the size overhead of SHA/MD5
Keys are randomly generated strings of length 16 chars
Versions are longs and will span the range 0 to 100000

What is the Clear Purpose of SparseBooleanArray?? [ I referred official android site for this ]

I referred the android doc site for "SparseBooleanArray" class but still not getting idea of that class about what is the purpose of that class?? For what purpose we need to use that class??
Here is the Doc Link
http://developer.android.com/reference/android/util/SparseBooleanArray.html

From what I get from the documentation it is for mapping Integer values to booleans.
That is, if you want to map, if for a certain userID a widget should be shown and some userIDs have already been deleted, you would have gaps in your mapping.
Meaning, with a normal array, you would create an array of size=maxID and add a boolean value to element at index=userID. Then when iterating over the array, you would have to iterate over maxID elements in the worst case and have to check for null if there is no boolean for that index (eg. the ID does not exist). That is really inefficient.
When using a hashmap to do that you could map the ID to the boolean, but with the added overhead of generating the hashvalue for the key (that is why it is called *hash*map), which would ultimately hurt performance firstly in CPU cycles as well as RAM usage.
So that SparseBooleanArray seems like a good middleway of dealing with such a situation.
NOTE: Even though my example is really contrieved, I hope it illustrates the situation.

Like the javadoc says, SparseBooleanArrays map integers to booleans which basically means that it's like a map with Integer as a key and a boolean as value (Map).
However it's more efficient to use in this particular case It is intended to be more efficient than using a HashMap to map Integers to Booleans
Hope this clears out any issues you had with the description.

I found a very specific and wonderful use for the sparse boolean array.
You can put a true or false value to be associated with a position in a list.
For example: List item #7 was clicked, so putting 7 as the key and true as the value.

There can be three ways to store resource id's
1 Array
Boolean array containing id's as indexes.If we have used that id set it to true else false
Though all the operations are fast but this implementation will require huge amount of space.So it can't be used
High Space Complexity
2 HashMap
Key-ID
Value-Boolean True/False
Using this we need to process each id using the hashing function which will consume memory.Also there may be some empty locations where no id will be stored and we also need to deal with crashes.So due to usage complexity and medium space complexity, it is not used.
Medium Space Complexity
3 SparseBooleanArray
It is middle way.It uses mapping and Array Implementation
Key - ID
Value - Boolean True/False
It is an ArrayList which stores id's in an increasing order.So minimum space is used as it only contains id's which are being used.For searching an id binary search is used.
Though Binary Search O(logn) is slower than hashing O(1) or Array O(1),i.e. all the operations Insertion, Deletion, Searching will take more time but there is least memory wastage.So to save memory we prefer SparseBoolean Array
Least Space Complexity

storing sets of integers to check if a certain set has already been mentioned

I've come across an interesting problem which I would love to get some input on.
I have a program that generates a set of numbers (based on some predefined conditions). Each set contains up to 6 numbers that do not have to be unique with integers that ranges from 1 to 100).
I would like to somehow store every set that is created so that I can quickly check if a certain set with the exact same numbers (order doesn't matter) has previously been generated.
Speed is a priority in this case as there might be up to 100k sets stored before the program stops (maybe more, but most the time probably less)! Would anyone have any recommendations as to what data structures I should use and how I should approach this problem?
What I have currently is this:
Sort each set before storing it into a HashSet of Strings. The string is simply each number in the sorted set with some separator.
For example, the set {4, 23, 67, 67, 71} would get encoded as the string "4-23-67-67-71" and stored into the HashSet. Then for every new set generated, sort it, encode it and check if it exists in the HashSet.
Thanks!

if you break it into pieces it seems to me that
creating a set (generate 6 numbers, sort, stringify) runs in O(1)
checking if this string exists in the hashset is O(1)
inserting into the hashset is O(1)
you do this n times, which gives you O(n).
this is already optimal as you have to touch every element once anyways :)
you might run into problems depending on the range of your random numbers.
e.g. assume you generate only numbers between one and one, then there's obviously only one possible outcome ("1-1-1-1-1-1") and you'll have only collisions from there on. however, as long as the number of possible sequences is much larger than the number of elements you generate i don't see a problem.
one tip: if you know the number of generated elements beforehand it would be wise to initialize the hashset with the correct number of elements (i.e. new HashSet<String>( 100000 ) );
p.s. now with other answers popping up i'd like to note that while there may be room for improvement on a microscopic level (i.e. using language specific tricks), your overal approach can't be improved.

Create a class SetOfIntegers
Implement a hashCode() method that will generate reasonably unique hash values
Use HashMap to store your elements like put(hashValue,instance)
Use containsKey(hashValue) to check if the same hashValue already present
This way you will avoid sorting and conversion/formatting of your sets.

Just use a java.util.BitSet for each set, adding integers to the set with the set(int bitIndex) method, you don't have to sort anything, and check a HashMap for already existing BitSet before adding a new BitSet to it, it will be really very fast. Don't use sorting of value and toString for that purpose ever if speed is important.

How should I go about optimizing a hash table for a given population?

Say I have a population of key-value pairs which I plan to store in a hash table. The population is fixed and will never change. What optimizations are available to me to make the hash table as fast as possible? Which optimizations should I concentrate on? This is assuming I have a lot of space. There will be a reasonable number of pairs (say no more than 100,000).
EDIT: I want to optimize look up. I don't care how long it takes to build.

I would make sure that your key's hash to unique values. This will ensure that every lookup will be constant time, and thus, as fast as possible.
Since you can never have more than 100,000 keys, it is entirely possible to have 100,000 hash values.
Also, make sure that you use the constructor that takes an int to specify the initial capacity (Set it to 100,000), and a float to set the load factor. (Use 1) Also, doing this requires that you have a perfect hash function for your keys. But, this will result in the fastest possible lookup, in the least amount of memory.

In general, to optimize a hash table, you want to minimize collisions in the determination of your hash, so your buckets won't contain more than one item and the hash-search will return immediately.
Most of the time, that means that you should measure the output of your hash function on the problem space. So i guess i'd recommend looking into that

Ensure there are no collisions. If there are no collisions, you are guaranteed O(1) constant look-up time. The next optimization would then be the look-up.
Use a profiler to optimize piece by piece. It's hard to without that.

If it's possible to make a large hash table such that there are no collisions at all, it will be ideal. Since your insertions and lookups will done in constant time.
But if that is not possible, try to choose a hash function such that your keys get distributed uniformly across the hash table.

Perfect hashing algorithms deal with the problem, but may not scale to 100k objects. I found a Java MPH package, but haven't tried it.

If the population is known at compile time, then the optimal solution is to use a minimal perfect hash function (MPH). The Wikipedia page on this subject links to several Java tools that can generate these.

The optimization must be done int the hashCode method of the key class. The thing to have in mind is to implement this method to avoid collisions.

Getting the perfect hashing algorithm to give totally unique values to 100K objects is likely to be close to impossible. Consider the birthday paradox. The date on which people are born can be considered a perfect hashing algorithm yet if you have more than 23 people you are more than likely to have a collision, and that is in a table of 365 dates.
So how big a table will you need to have no collisions in 100K?
If your keys are strings, your optimal strategy is a tree, not binary but n-branch at each character. If the keys are lower-case only it is easier still as you need just 26 whenever you create a branch.
We start with 26 keys. Follow the first character, say f
f might have a value associated with it. And it may have sub-trees. Look up a subtree of o. This leads to more subtrees then look up the next o. (You knew where that was leading!). If this doesn't have a value associated with it, or we hit a null sub-tree on the way, we know the value is not found.
You can optimise the space on the tree where you hit a point of uniqueness. Say you have a key january and it becomes unique at the 4th character. At this point where you assign the value you also store the actual string associated with it. In our example there may be one value associated with foo but the key it relates to may be food, not foo.
I think google search engines use a technique similar to this.

The key question is what your key is. (No pun intended.) As others have pointed out, the goal is to minimize the number of hash collisions. If you can get the number of hash collisions to zero, i.e. your hash function generates a unique value for every key that is actually passed to it, you will have a perfect hash.
Note that in Java, a hash function really has two steps: First the key is run through the hashCode function for it's class. Then we calculate an index value into the hash table by taking this value modulo the size of the hash table.
I think that people discussing the perfect hash function tend to forget that second step. Even if you wrote a hashCode function that generated a unique value for every key passed to it, you could still get an absolutely terrible hash if this value modulo the hash table size is not unique. For example, say you have 100 keys and your hashCode function returns the values 1, 1001, 2001, 3001, 4001, 5001, ... 99001. If your hash table has 100,000 slots, this would be a perfect hash. Every key gets its own slot. But if it has 1000 slots, they all hash to the same slot. It would be the worst possible hash.
So consider constructing a good hash function. Take the extreme cases. Suppose that your key is a date. You know that the dates will all be in January of the same year. Then using the day of the month as the hash value should be as good as it's going to get: everything will hash to a unique integer in a small range. On the other hand, if your dates were all the first of the month for many years and many months, taking the day of the month would be a terrible hash, as every actual key would map to "1".
My point being that if you really want to optimize your hash, you need to know the nature of your data. What is the actual range of values that you will get?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.