Too many hashing function collisions

Too many hashing function collisions - java

I'm trying to make a hashing function using the polynomial accumulation method (which is supposed to give you 5 collisions per 55k words or something) but when I run it with 1,000 words, I get ~190 collisions. Am I doing something wrong?
public int hashCode(String str) {
double hash_value = 0; // used for float
for (int i = 0; i < str.length(); i++){
hash_value = 33*hash_value + str.charAt(i);
}
return (int) (hash_value % array_size);
}

Generally, prime numbers are favoured for hash code generation. I suggest trying 109 or 251. 33 is a multiple of 3 which means you are more likely to have issues based on your inputs.
Also you should use an int for the calculations and call Math.abs on the result.

Either your data set is extremely "unlucky", or (which is more probable) the array_size is too small (hash function params are usually quoted without consideration of finite bucket array size).

You are generating a large number which is different for different word in the input. But there is still a chance of collisions, as for example
"bA" = 98+(33x65)=2243
"AB" = 65+(33x66)=2243
If you go for a large number greater then 57, there will be less chance of collision. 109 or 251 will be a good choice.

Related

Word frequency hash table

Ok, I have a project that requires me to have a dynamic hash table that counts the frequency of words in a file. I must use java, however, we are not allowed to use any built in data types or built in classes at all except standard arrays. Also, I am not allowed to use any hash functions off the internet that are known to be fast. I have to make my own hash functions. Lastly, my instructor also wants my table to start as size "1" and double in size every time a new key is added.
My first idea was to sum the ASCII values of the letters composing a word and use that to make a hash function, but different words with the same letters will equal the same value.
How can I get started? Is the ASCII idea on the right track?

A hash table isn't expected to have in general a one-to-one mapping between a value and a hash. A hash table is expected to have collisions. That is, the domain of the hash-function is expected to be larger than the range (i.e., the hash value). However, the general idea is that you come up with a hash function where the probability of collision is drastically small. If your hash-function is uniform, i.e., if you have it designed such that each possible hash-value has the same probability of being generated, then you can minimize collisions this way.
Getting a collision isn't the end of the world. That just means that you have to search the list of values for that hash. If your hashing function is good, overall your performance for lookup should still be O(1).
Generating hashing functions is a subject of its own, and there is no one answer. But a good place for you to start could be to work with the bitwise representations of the characters in the string, and perform some sort of convolution operations on them (rotate, shift, XOR) in series. You could perform these in some way based on some initial seed-value, and then use the output of the first step of hashing as a seed for the next step. This way you can end up magnifying the effects of your convolution.
For example, let's say you get the character A, which is 41 in hex, or 0100 0001 in binary. You could designate each bit to mean some operation (maybe bit 0 is a ROR when it is 0, and a ROL when it is 1; bit 1 is an OR when it is 0, and a XOR when it is 1, etc.). You could even decide how much convolution you want to do based on the value itself. For example, you could say that the lower nibble specifies how much right-rotation you will do, and the upper nibble specifies how much left rotation you will do. Then once you have the final value, you will use that as the seed for the next character. These are just some ideas. Use your imagination as see what you get!

It does not matter how good your hash function is, you will always have collisions you need to resolve.
If you want to keep your approach by using the ASCII values of the you shouldn't just add the values this would lead to a lot collisions. You should work with the power of the values, for example for the word "Help" you just go like: 'H' * 256 + 'e' * 256 + 'l' * 256² + 'p' * 256³. Or in pseudocode:
int hash(String word, int hashSize)
int res = 0
int count = 0;
for char c in word
res += 'c' * 256^count
count++
count = count mod 5
return res mod hashSize
Now you just have to write your own Hashtable:
class WordCounterMap
Entry[] entrys = new Entry[1]
void add(String s)
int hash = hash(s, entrys.length)
if(entrys[hash] == null{
Entry[] temp = new Entry[entry.length * 2]
for(Entry e : entrys){
if(e != null)
int hash = hash(e.word, temp.length)
temp[hash] = e;
entrys = temp;
hash = hash(s, entrys.length)
while(true)
if(entrys[hash] != null)
if(entrys[hash].word.equals(s))
entrys[hash].count++
break
else
entrys[hash] = new Entry(s)
hash++
hash = hash mod entrys.length
int getCount(String s)
int hash = hash(s, length)
if(entrys[hash] == null)
return 0
while(true)
if(entrys[hash].word.equals(s))
entrys[hash].count++
break
hash++
hash = hash mod entrys.length
class Entry
int count
String word
Entry(String s)
this.word = s
count = 1

Printing PowerSet with help of bit position

Googling around for a while to find subsets of a String, i read wikipedia and it mentions that
.....For the whole power set of S we get:
{ } = 000 (Binary) = 0 (Decimal)
{x} = 100 = 4
{y} = 010 = 2
{z} = 001 = 1
{x, y} = 110 = 6
{x, z} = 101 = 5
{y, z} = 011 = 3
{x, y, z} = 111 = 7
Is there a possible way to implement this through program and avoid recursive algorithm which uses string length?
What i understood so far is that, for a String of length n, we can run from 0 to 2^n - 1 and print characters for on bits.
What i couldn't get is how to map those on bits with the corresponding characters in the most optimized manner
PS : checked thread but couldnt understood this and c++ : Power set generated by bits

The idea is that a power set of a set of size n has exactly 2^n elements, exactly the same number as there are different binary numbers of length at most n.
Now all you have to do is create a mapping between the two and you don't need a recursive algorithm. Fortunately with binary numbers you have a real intuitive and natural mapping in that you just add a character at position j in the string to a subset if your loop variable has bit j set which you can easily do with getBit() I wrote there (you can inline it but for you I made a separate function for better readability).
P.S. As requested, more detailed explanation on the mapping:
If you have a recursive algorithm, your flow is given by how you traverse your data structure in the recursive calls. It is as such a very intuitive and natural way of solving many problems.
If you want to solve such a problem without recursion for whatever reason, for instance to use less time and memory, you have the difficult task of making this traversal explicit.
As we use a loop with a loop variable which assumes a certain set of values, we need to make sure to map each value of the loop variable, e.g. 42, to one element, in our case a subset of s, in a way that we have a bijective mapping, that is, we map to each subset exactly once. Because we have a set the order does not matter, so we just need whatever mapping that satisfies these requirements.
Now we look at a binary number, e.g. 42 = 32+8+2 and as such in binary with the position above:
543210
101010
We can thus map 42 to a subset as follows using the positions:
order the elements of the set s in any way you like but consistently (always the same in one program execution), we can in our case use the order in the string
add an element e_j if and only if the bit at position j is set (equal to 1).
As each number has at least one digit different from any other, we always get different subsets, and thus our mapping is injective (different input -> different output).
Our mapping is also valid, as the binary numbers we chose have at most the length equal to the size of our set so the bit positions can always be assigned to an element in the set. Combined with the fact that our set of inputs is chosen to have the same size (2^n) as the size of a power set, we can follow that it is in fact bijective.
import java.util.HashSet;
import java.util.Set;
public class PowerSet
{
static boolean getBit(int i, int pos) {return (i&1<<pos)>0;}
static Set<Set<Character>> powerSet(String s)
{
Set<Set<Character>> pow = new HashSet<>();
for(int i=0;i<(2<<s.length());i++)
{
Set<Character> subSet = new HashSet<>();
for(int j=0;j<s.length();j++)
{
if(getBit(i,j)) {subSet.add(s.charAt(j));}
}
pow.add(subSet);
}
return pow;
}
public static void main(String[] args)
{System.out.println(powerSet("xyz"));}
}

Here is easy way to do it (pseudo code) :-
for(int i=0;i<2^n;i++) {
char subset[];
int k = i;
int c = 0;
while(k>0) {
if(k%2==1) {
subset.add(string[c]);
}
k = k/2;
c++;
}
print subset;
}
Explanation :- The code divides number by 2 and calculates remainder which is used to convert number to binary form. Then as you know only selects index in string which has 1 at that bit number.

Stable mapping of an integer to a random number

I need a stable and fast one way mapping function of an integer to a random number.
By "stable" I mean that the same integer should always map to the same random number.
And by "random number" I actually mean "some number which behaves like random".
e.g.
1 -> 329423
2 -> -12398791234
3 -> -984
4 -> 42342435
...
If I had enough memory (and time) I would ideally use:
for( int i=Integer.MIN_VALUE; i<Integer.MAX_VALUE; i++ ){
map[i]=i;
}
shuffle( map );
I could use some secure hash function like MD5 or SHA but these are to slow for my purposes and I don't need any crypto/security properties.
I only need this in one way. So I will never have to translate the random number back to its integer.
Background: (For those who want to know more)
I'm planing to use this to invalidate a complete cache over a given amount of time. The invalidation is done "randomly" on access of the cache member with an increasing chance while time passes. I need this to be stable so that isValid( entry ) does not "flicker" and for consistent testing.
The input to this function will be the java hash of the key of the entry which typically is in the range of "1000"-"15000" (but can contain some other stuff, too) and comes in bulks.
The invalidation is done on the condition of:
elapsedTime / timeout * Integer.MAX_VALUE > abs( random( key.hashCode() ) )
EDIT: (this is to long for a comment so I put it here)
I tried gexicide's answer and it turns out this isn't random enough. Here is what I tried:
for( int i=0; i<12000; i++ ){
int hash = (""+i).hashCode();
Random rng = new Random( hash );
int random = rng.nextInt();
System.out.printf( "%05d, %08x, %08x\n", i, hash, random );
}
The output starts with:
00000, 00000030, bac2c591
00001, 00000031, babce6a4
00002, 00000032, bace836b
00003, 00000033, bac8a47e
00004, 00000034, baab49de
00005, 00000035, baa56af1
00006, 00000036, bab707b7
00007, 00000037, bab128ca
00008, 00000038, ba93ce2a
00009, 00000039, ba8def3d
00010, 0000061f, 98048199
and it goes on in this way.
I could use SecureRandom instead:
for( int i=0; i<12000; i++ ){
SecureRandom rng = new SecureRandom( (""+i).getBytes() );
int random = rng.nextInt();
System.out.printf( "%05d, %08x\n", i, random );
}
which indeed looks pretty random but this is not stable anymore and 10 times slower than the method above.

Although you never specified it as a requirement you'll probably want a full 1:1 mapping. This is because the number of possible input values is small. Any output that can occur for more than one input implies another output which can never happen at all. If you have output values which are impossible then you have a skewed distribution.
Of course, if your input is skewed then your output will be skewed anyway, and there's not much you can do about that.
Anyway; this makes it a unique int to int hash.
Simply apply a couple of trivial, independent 1:1 mapping functions until things are suitably distributed. You've already isolated one transform from the Random class, but I suggest mixing it with some other transforms like shifts and XORs to avoid individual weaknesses of different algorithms.
For example:
public static int mapInteger( int value ){
value *= 1664525;
value += 1013904223;
value ^= value >>> 12;
value ^= value << 25;
value ^= value >>> 27;
value *= 1103515245;
value += 12345;
return value;
}
If that's good enough then you can make it faster by deleting lines at random (I suggest you keep at least one multiply) until it's not good enough anymore, and then add the last deleted line back in.

Use a Random and seed it with your number:
Random generator = new Random(i);
return generator.nextInt();
As your testing exposes, the problem with this method is that such a seed creates a very poor random number in the first iteration. To increase the quality of the result, we need to run the random generator a few times; this will fill up the state of the random generator with pseudo-random values and will increase the quality of the following values.
To make sure that the random generator spreads the values enough, use it a few times before outputting the number. This should make the resulting number more pseudo-random:
Random generator = new Random(i);
for(int i = 0; i < 5; i++) generator.nextInt();
return generator.nextInt();
Try different values, maybe 5 is enough.

The answer of gexicide is the correct (and the most simple) one. Just one note:
Running this 1,000,000 times on my system takes about 70ms. (Which is pretty fast.)
But it involves at least two object creations and feeds the GC. It would be better
if this could be done on the stack and not using object creation at all.
Looking at the sources of Random class it shows that there is some code to make
it callable multiple times and to make it threadsafe which can be removed.
So I ended up with a reimplementation in one method:
public static int mapInteger( int value ){
// initial scramble
long seed = (value ^ multiplier) & mask;
// shuffle three times. This is like calling rng.nextInt() 3 times
seed = (seed * multiplier + addend) & mask;
seed = (seed * multiplier + addend) & mask;
seed = (seed * multiplier + addend) & mask;
// fit size
return (int)(seed >>> 16);
}
(multiplier, addend and mask are some constants used by Random)
Running this 1,000,000 times gives the same result but takes only 5ms and is therefor 10 times faster.
BTW: This happens to be another piece of code from The Old Man - again. See Donald Knuth,
The Art of Computer Programming, Volume 2, Section 3.2.1

Java convert hash to random string

I'm trying to develop a reduction function for use within a rainbow table generator.
The basic principle behind a reduction function is that it takes in a hash, performs some calculations, and returns a string of a certain length.
At the moment I'm using SHA1 hashes, and I need to return a string with a length of three. I need the string to be made up on any three random characters from:
abcdefghijklmnopqrstuvwxyz0123456789
The major problem I'm facing is that any reduction function I write, always returns strings that have already been generated. And a good reduction function will only return duplicate strings rarely.
Could anyone suggest any ideas on a way of accomplishing this? Or any suggestions at all on hash to string manipulation would be great.
Thanks in advance
Josh

So it sounds like you've got 20 digits of base 255 (the length of a SHA1 hash) that you need to map into three digits of base 36. I would simply make a BigInteger from the hash bytes, modulus 36^3, and return the string in base 36.
public static final BigInteger N36POW3 = new BigInteger(""+36*36*36));
public static String threeDigitBase36(byte[] bs) {
return new BigInteger(bs).mod(N36POW3).toString(36);
}
// ...
threeDigitBase36(sha1("foo")); // => "96b"
threeDigitBase36(sha1("bar")); // => "y4t"
threeDigitBase36(sha1("bas")); // => "p55"
threeDigitBase36(sha1("zip")); // => "ej8"
Of course there will be collisions, as when you map any space into a smaller one, but the entropy should be better than something even sillier than the above solution.

Applying the KISS principle:
An SHA is just a String
The JDK hashcode for String is "random enough"
Integer can render in any base
This single line of code does it:
public static String shortHash(String sha) {
return Integer.toString(sha.hashCode() & 0x7FFFFFFF, 36).substring(0, 3);
}
Note: The & 0x7FFFFFFF is to zero the sign bit (hash codes can be negative numbers, which would otherwise render with a leading minus sign).
Edit - Guaranteeing hash length
My original solution was naive - it didn't deal with the case when the int hash is less than 100 (base 36) - meaning it would print less than 3 chars. This code fixes that, while still keeping the value "random". It also avoids the substring() call, so performance should be better.
static int min = Integer.parseInt("100", 36);
static int range = Integer.parseInt("zzz", 36) - min;
public static String shortHash(String sha) {
return Integer.toString(min + (sha.hashCode() & 0x7FFFFFFF) % range, 36);
}
This code guarantees the final hash has 3 characters by forcing it to be between 100 and zzz - the lowest and highest 3-char hash in base 36, while still making it "random".

Checking if value of int[] can be long

I have an array of ints ie. [1,2,3,4,5] . Each row corresponds to decimal value, so 5 is 1's, 4 is 10's, 3 is 100's which gives value of 12345 that I calculate and store as long.
This is the function :
public long valueOf(int[]x) {
int multiplier = 1;
value = 0;
for (int i=x.length-1; i >=0; i--) {
value += x[i]*multiplier;
multiplier *= 10;
}
return value;
}
Now I would like to check if value of other int[] does not exceed long before I will calculate its value with valueOf(). How to check it ?
Should I use table.length or maybe convert it to String and send to
public Long(String s) ?
Or maybe just add exception to throw in the valueOf() function ?

I hope you know that this is a horrible way to store large integers: just use BigInteger.
But if you really want to check for exceeding some value, just make sure the length of the array is less than or equal to 19. Then you could compare each cell individually with the value in Long.MAX_VALUE. Or you could just use BigInteger.

Short answer: All longs fit in 18 digits. So if you know that there are no leading zeros, then just check x.length<=18. If you might have leading zeros, you'll have to loop through the array to count how many and adjust accordingly.
A flaw to this is that some 19-digit numbers are valid longs, namely those less than, I believe it comes to, 9223372036854775807. So if you wanted to be truly precise, you'd have to say length>19 is bad, length<19 is good, length==19 you'd have to check digit-by-digit. Depending on what you're up to, rejecting a subset of numbers that would really work might be acceptable.
As others have implied, the bigger question is: Why are you doing this? If this is some sort of data conversion where you're getting numbers as a string of digits from some external source and need to convert this to a long, cool. If you're trying to create a class to handle numbers bigger than will fit in a long, what you're doing is both inefficient and unnecessary. Inefficient because you could pack much more than one decimal digit into an int, and doing so would give all sorts of storage and performance improvements. Unnecessary because BigInteger already does this. Why not just use BigInteger?
Of course if it's a homework problem, that's a different story.

Are you guaranteed that every value of x will be nonnegative?
If so, you could do this:
public long valueOf(int[]x) {
int multiplier = 1;
long value = 0; // Note that you need the type here, which you did not have
for (int i=x.length-1; i >=0; i--) {
next_val = x[i]*multiplier;
if (Long.MAX_LONG - next_val < value) {
// Error-handling code here, however you
// want to handle this case.
} else {
value += next_val
}
multiplier *= 10;
}
return value;
}
Of course, BigInteger would make this much simpler. But I don't know what your problem specs are.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Too many hashing function collisions - java

Generally, prime numbers are favoured for hash code generation. I suggest trying 109 or 251. 33 is a multiple of 3 which means you are more likely to have issues based on your inputs. Also you should use an int for the calculations and call Math.abs on the result.

Either your data set is extremely "unlucky", or (which is more probable) the array_size is too small (hash function params are usually quoted without consideration of finite bucket array size).

Related

Word frequency hash table

Printing PowerSet with help of bit position

Stable mapping of an integer to a random number

Java convert hash to random string

Checking if value of int[] can be long

Categories

Resources