Seemingly easy FNV1 hashing implementation results in a lot of collisions

Seemingly easy FNV1 hashing implementation results in a lot of collisions - java

I'm playing with hash tables and using a corpus of ~350,000 English words which I'd like to try to evenly distribute. Thus, I try to fit them into an array of length 810,049 (the closest prime larger than two times the input size) and I was baffled to see that a straightforward FNV1 implementation like this:
public int getHash(String s, int mod) {
final BigInteger MOD = new BigInteger(Integer.toString(mod));
final BigInteger FNV_offset_basis = new BigInteger("14695981039346656037");
final BigInteger FNV_prime = new BigInteger("1099511628211");
BigInteger hash = new BigInteger(FNV_offset_basis.toString());
for (int i = 0; i < s.length(); i++) {
int charValue = s.charAt(i);
hash = hash.multiply(FNV_prime).mod(MOD);
hash = hash.xor(BigInteger.valueOf((int) charValue & 0xffff)).mod(MOD);
}
return hash.mod(MOD).intValue();
}
results in 64,000 collisions which is a lot, 20% of the input basically. What's wrong with my implementation? Is the approach somehow flawed?
EDIT: to add to that, I've also tried and implemented other hashing algorithms like sdbm and djb2 and they all perform just the same, equally poorly. All have these ~65k collisions on this corpus. When I changed the corpus to just 350,000 integers represented as strings, a bit of variance starts to occur (like one algorithms has 20,000 collisions and the other has 40,000) but still the number of collision is astoundingly high. Why?
EDIT2: I've just tested it and the Java's built-in .hashCode() results in equally as many collisions and even if you do something ridiculously naive, like a hash being a product of multiplicating charcodes of all the characters modulo 810,049, it performs only half worse than all those notorious algorithms (60k collisions vs. 90k with the naive approach).

Since mod is a parameter to your hash function I presume it is the range into which you want the hash normalized, i.e. for your specific use case you are expecting it to be 810,049. I assume this because:
The algorithm calls for the calculations to be done modulo 2n where n is the number of bits in the desired hash.
Given that the offset basis and FNV Prime are constants within the module, and are equal to the parameters for a 64-bit hash, the value of mod should also be fixed at 264.
Since it is not, I assume it is the desired final output range.
In other words, given a fixed offset basis and FNV Prime, there is no reason to pass in the mod parameter -- it is dictated by the other two FNV parameters.
If all the above is correct then the implementation is wrong. You should be doing the calculations mod 264 and applying a final remainder operation with 810,049.
Also (but this may not be important), the algorithm calls for xoring the lower 8 bits with an ASCII character, whereas you are xoring with 16 bits. I am not sure this will make a difference since for ASCII the high-order byte will be zero anyway and it will behave exactly as if you were xoring only 8 bits.

Related

Is there a way to adjust for integer overflow?

I'm noodling through an anagram hash function, already solved several different ways, but I'm looking for extreme performance as an exercise. I already submitted a solution that passed all the given tests (beating out 100% of all competitors by at least 1ms), but I believe that although it "won", it has a weakness that just wasn't triggered. It is subject to integer overflow in a way that could affect the results.
The gist of the solution was to combine multiple commutative operations, each taking some number of bits, and concatenate them into one long variable. I chose xor, sum, and product. The xor operation cleanly fits within a fixed number of bits. The sum operation might overflow, but because of the way overflow is addressed, it would still arrive at the same result if letters and their corresponding values are rearranged. I wouldn't worry, for example, about whether this function would overflow.
private short sumHash(String s) {
short hash=0;
for (char c:s.toCharArray()) {
hash+=c;
}
return hash;
}
Where I run into trouble is in the inclusion of products. If I make a function that returns the product of a list of values (such as character values in a String), then, at the very least, the result could be rendered inaccurate if the product overflowed to exactly zero.
private short productHash(String s) {
short hash=1;
for (char c:s.toCharArray()) {
hash*=c;
}
return hash;
}
Is there any safe and performant way to avoid this weakness so that the function gains the benefit of the commutative property of multiplication to produce the same value for anagrams, but can't ever encounter a product that overflows to zero?

Sure, if you're willing to go to some lengths to do it. The simplest solution that occurs to me is to write
hash *= primes[c];
where primes is an array that maps each possible character to a distinct odd prime. Overflowing to zero can only happen if the "true" product in infinite-precision arithmetic is a multiple of 2^32, and if you're multiplying by odd primes, that's impossible.
(You do run into the problem that the hash itself will always be odd, but you could shift it right one bit to obtain a more fully mixed hash.)

You will only hit zero if
a * b = 0 mod 2^64
which is equivalent to there being an integer k such that
a * b = k * 2^64
That is, we get in trouble if factors divide 2^64, i.e. if factors are even. Therefore, the easiest solution is ensuring that all factors are odd, for instance like this:
for (char ch : chars) {
hash *= (ch << 1) | 1;
}
This allows you to keep 63 bits of information.
Note however that this technique will only avoid collisions caused by overflow, but not collisions caused by multipliers that share a common factor. If you wish to avoid that, too, you'll need coprime multipliers, which is easiest achieved if they are prime.

The naive way to avoid overflow, is to use a larger type such as int or long. However, for your purposes, modulo arithmetic might make more sense. You can do (a * b) % p for a prime p to maintain commutativity. (There is some deep mathematics here called Group Theory, if you are interested in learning more.) You will need to limit p to be small enough that each a * b does not overflow. The easiest way to do this is to pick a p so that (p - 1)^2 can still be represented in a short or whatever data type you are using.

Secure random number of aproximately given size

I'm doing a Secret Sharing algorithm which encrypts a a message. To do that I need a bigger than message prime and some random numbers of aproximately the same size as the message.
I can do the first with BigInteger.probablePrime(MsgSize+8) but I do not know how to do the later.
I was using Random and later SecureRandom but they don't generate numbers of a given length. My solution was to do randomInt ^ randomInt to BigInteger but is obviously a bad solution.
Some ideas?

Is it Shamir's Secret Sharing that you're implementing? If so, note that you don't actually need a prime bigger than the entire message — it's perfectly fine to break the message into chunks of some manageable size and to share each chunk separately using a fixed prime.
Also, Shamir's Secret Sharing doesn't need a prime-sized field; it's possible to use any finite field GF(pn), including in particular the binary fields GF(2n). Such fields are particularly convenient for computer implementation, since the both the secret and share chunks will then be simply n-bit bitstrings.
The only complications are that, in non-prime fields, you'll have to implement finite field arithmetic (or find an existing implementation) and that you'll need to choose a particular reducing polynomial and agree upon it. However, the former isn't really as complicated as it might seem, and the latter isn't really any harder than choosing and agreeing on a prime. (In particular, a reducing polynomial for GF(2n) can be naturally represented as an n-bit bitstring, dropping the high bit which is always 1.)

Have you tried using the same probablePrime method with a smaller size, then using a large random integer as an offset from that number? That might do the trick, just an idea.

I had the same problem (thats why i found this post).
It is a little late but maybe someone else finds this method usefull:
public static BigDecimal getBigRandom(int d)
{
BigDecimal rnd = new BigDecimal(Math.random());
BigDecimal rndtmp;
for(int i=0;i<=d;i++)
{
rndtmp = new BigDecimal(Math.random());
rndtmp = rndtmp.movePointLeft(rnd.precision());
rnd = rnd.add(rndtmp);
}
return rnd;
}
Usage:
BigDecimal x = getBigRandom(y);
every y will give you approximately 50 digits.
if you need more than (2^31-1)*50 digits simply change int to long ;-)
dont know if it is good, but works for me

How to assign the largest n bit unsigned integer to a BigInteger in Java

I have a scenario where I'm working with large integers (e.g. 160 bit), and am trying to create the biggest possible unsigned integer that can be represented with an n bit number at run time. The exact value of n isn't known until the program has begun executing and read the value from a configuration file. So for example, n might be 160, or 128, or 192, etcetera...
Initially what I was thinking was something like:
BigInteger.valueOf((long)Math.pow(2, n));
but then I realized, the conversion to long that takes place sort of defeats the purpose, given that long is not comprised of enough bits in the first place to store the result. Any suggestions?

On the largest n-bit unsigned number
Let's first take a look at what this number is, mathematically.
In an unsigned binary representation, the largest n-bit number would have all bits set to 1. Let's take a look at some examples:
1(2)= 1 =21 - 1
11(2)= 3 =22 - 1
111(2)= 7 =23 - 1
:
1………1(2)=2n -1
   n
Note that this is analogous in decimal too. The largest 3 digit number is:
103- 1 = 1000 - 1 = 999
Thus, a subproblem of finding the largest n-bit unsigned number is computing 2n.
On computing powers of 2
Modern digital computers can compute powers of two efficiently, due to the following pattern:
20= 1(2)
21= 10(2)
22= 100(2)
23= 1000(2)
:
2n= 10………0(2)
       n
That is, 2n is simply a number having its bit n set to 1, and everything else set to 0 (remember that bits are numbered with zero-based indexing).
Solution
Putting the above together, we get this simple solution using BigInteger for our problem:
final int N = 5;
BigInteger twoToN = BigInteger.ZERO.setBit(N);
BigInteger maxNbits = twoToN.subtract(BigInteger.ONE);
System.out.println(maxNbits); // 31
If we were using long instead, then we can write something like this:
// for 64-bit signed long version, N < 64
System.out.println(
(1L << N) - 1
); // 31
There is no "set bit n" operation defined for long, so traditionally bit shifting is used instead. In fact, a BigInteger analog of this shifting technique is also possible:
System.out.println(
BigInteger.ONE.shiftLeft(N).subtract(BigInteger.ONE)
); // 31
See also
Wikipedia/Binary numeral system
Bit Twiddling Hacks
Additional BigInteger tips
BigInteger does have a pow method to compute non-negative power of any arbitrary number. If you're working in a modular ring, there are also modPow and modInverse.
You can individually setBit, flipBit or just testBit. You can get the overall bitCount, perform bitwise and with another BigInteger, and shiftLeft/shiftRight, etc.
As bonus, you can also compute the gcd or check if the number isProbablePrime.
ALWAYS remember that BigInteger, like String, is immutable. You can't invoke a method on an instance, and expect that instance to be modified. Instead, always assign the result returned by the method to your variables.

Just to clarify you want the largest n bit number (ie, the one will all n-bits set). If so, the following will do that for you:
BigInteger largestNBitInteger = BigInteger.ZERO.setBit(n).subtract(BigInteger.ONE);
Which is mathematically equivalent to 2^n - 1. Your question has how you do 2^n which is actually the smallest n+1 bit number. You can of course do that with:
BigInteger smallestNPlusOneBitInteger = BigInteger.ZERO.setBit(n);

I think there is pow method directly in BigInteger. You can use it for your purpose

The quickest way I can think of doing this is by using the constructor for BigInteger that takes a byte[].
BigInteger(byte[] val) constructs the BigInteger Object from an array of bytes. You are, however, dealing with bits, and so creating a byte[] that might consist of {127, 255, 255, 255, 255} for a 39 bit integer representing 2^40 - 1 might be a little tedious.
You could also use the constructor BigInteger(String val, int radix) - which might be readily more apparently what's going on in your code if you don't mind a performance hit for parsing a String. Then you could generate a string like val = "111111111111111111111111111111111111111" and then call BigInteger myInt = new BigInteger(val, 2); - resulting in the same 39 bit integer.
The first option will require some thinking about how to represent your number. That particular constructor expects a two's-compliment, big-endian representation of the number. The second will likely be marginally slower, but much clearer.
EDIT: Corrected numbers. I thought you meant represent 2^n, and didn't correctly read the largest value n bits could store.

understanding of hash code

hash function is important in implementing hash table. I know that in java
Object has its hash code, which might be generated from weak hash function.
Following is one snippet that is "supplement hash function"
static int hash(Object x) {
int h = x.hashCode();
h += ~(h << 9);
h ^= (h >>> 14);
h += (h << 4);
h ^= (h >>> 10);
return h;
}
Can anybody help to explain what is the fundamental idea of a hash algorithm
? to generate non-duplicate integer? If so, how does these bitwise
operations make it?

A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. (wikipedia)
Using more "human" language object hash is a short and compact value based on object's properties. That is if you have two objects that vary somehow - you can expect their hash values to be different. Good hash algorithm produces different values for different objects.

What you are usually trying to do with a hash algorithm is convert a large search key into a small nonnegative number, so you can look up an associated record in a table somewhere, and do it more quickly than M log2 N (where M is the cost of a "comparison" and N is the number of items in the "table") typical of a binary search (or tree search).
If you are lucky enough to have a perfect hash, you know that any element of your (known!) key set will be hashed to a unique, different value. Perfect hashes are primarily of interest for things like compilers that need to look up language keywords.
In the real world, you have imperfect hashes, where several keys all hash to the same value. That's OK: you now only have to compare the key to a small set of candidate matches (the ones that hash to that value), rather than a large set (the full table). The small sets are traditionally called "buckets". You use the hash algorithm to select a bucket, then you use some other searchable data structure for the buckets themselves. (If the number of elements in a bucket is known, or safely expected, to be really small, linear search is not unreasonable. Binary search trees are also reasonable.)
The bitwise operations in your example look a lot like a signature analysis shift register, that try to compress a long unique pattern of bits into a short, still-unique pattern.

Basically, the thing you're trying to achieve with a hash function is to give all bits in the hash code a roughly 50% chance of being off or on given a particular item to be hashed. That way, it doesn't matter how many "buckets" your hash table has (or put another way, how many of the bottom bits you take in order to determine the bucket number)-- if every bit is as random as possible, then an item will always be assigned to an essentially random bucket.
Now, in real life, many people use hash functions that aren't that good. They have some randomness in some of the bits, but not all of them. For example, imagine if you have a hash function whose bits 6-7 are biased-- let's say in the typical hash code of an object, they have a 75% chance of being set. In this made up example, if our hash table has 256 buckets (i.e. the bucket number comes from bits 0-7 of the hash code), then we're throwing away the randomness that does exist in bits 8-31, and a smaller portion of the buckets will tend to get filled (i.e. those whose numbers have bits 6 and 7 set).
The supplementary hash function basically tries to spread whatever randomness there is in the hash codes over a larger number of bits. So in our hypothetical example, the idea would be that some of the randomness from bits 8-31 will get mixed in with the lower bits, and dilute the bias of bits 6-7. It still won't be perfect, but better than before.

If you're generating a hash table, then the main thing you want to get across when writing your hash function is to ensure uniformity, not necessarily to create completely unique values.
For example, if you have a hash table of size 10, you don't want a hash function that returns a hash of 3 over and over. Otherwise, that specific bucket will force a search time of O(n). You want a hash function such that it will return, for example: 1, 9, 4, 6, 8... and ensure that none of your buckets are much heavier than the others.
For your projects, I'd recommend that you use a well-known hashing algorithm such as MD5 or even better, SHA and use the first k bits that you need and discard the rest. These are time-tested functions and as a programmer, you'd be smart to use them.

That code is attempting to improve the quality of the hash value by mashing the bits around.
The overall effect is that for a given x.hashCode() you hopefully get a better distribution of hash values across the full range of integers. The performance of certain algorithms will improve if you started with a poor hashcode implementation but then improve hash codes in this way.
For example, hashCode() for a humble Integer in Java just returns the integer value. While this is fine for many purposes, in some cases you want a much better hash code, so putting the hashCode through this kind of function would improve it significantly.

It could be anything you want as long as you adhere to the general contract described in the doc, which in my own words are:
If you call 100 ( N ) times hashCode on an object, all the times must return the same value, at least during that program execution( subsequent program execution may return a different one )
If o1.equals(o2) is true, then o1.hashCode() == o2.hashCode() must be true also
If o1.equals(o2) is false, then o1.hashCode() == o2.hashCode() may be true, but it helps it is not.
And that's it.
Depending on the nature of your class, the hashCode() e may be very complex or very simple. For instance the String class which may have millions of instances needs a very goo hashCode implementation, and use prime numbers to reduce the poisibility of collisions.
If for your class it does make sense to have a consecutive number, that's ok too, there is no reason why you should complicate it every time.

What is a good 64bit hash function in Java for textual strings?

I'm looking for a hash function that:
Hashes textual strings well (e.g. few collisions)
Is written in Java, and widely used
Bonus: works on several fields (instead of me concatenating them and applying the hash on the concatenated string)
Bonus: Has a 128-bit variant.
Bonus: Not CPU intensive.

Why don't you use a long variant of the default String.hashCode() (where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?
// adapted from String.hashCode()
public static long hash(String string) {
long h = 1125899906842597L; // prime
int len = string.length();
for (int i = 0; i < len; i++) {
h = 31*h + string.charAt(i);
}
return h;
}
If you're looking for even more bits, you could probably use a BigInteger
Edit:
As I mentioned in a comment to the answer of #brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:
I could imagine a huge hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, #brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion possible keys) is pretty much impossible though.
To combine the hash for several fields, simply do an XOR multiply one with a prime and add them:
long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2);
The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.

An answer for today (2018). SipHash.
It will be much faster than most of the answers here, and significantly higher quality than all of them.
The Guava library has one: https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--

Create an SHA-1 hash and then mask out the lowest 64bits.

long hash = string.hashCode();
Yes, the top 32 bits will be 0, but you will probably run out of hardware resources before you run into problems with hash collisions. The hashCode in String is quite efficient and well tested.
Update
I think the above satisfies the simplest thing which could possibly work, however, I agree with #sfussenegger idea of extending the existing String hashCode.
In addition to having a good hashCode for your String, you may want to consider rehashing the hash code in your implementation. If your storage is used by other developers, or used with other types, this can help distributed your keys. For example, Java's HashMap is based on power-of-two length hash tables, so it adds this function to ensure the lower bits are sufficiently distributed.
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);

Why not use a CRC64 polynomial. These are reasonably efficient and optimized to make sure all bits are counted and spread over the result space.
There are plenty of implementations available on the net if you google "CRC64 Java"

Do something like this:
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.math.BigInteger;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
public class Test {
public static void main(String[] args) throws NoSuchAlgorithmException,
IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
try {
MessageDigest md = MessageDigest.getInstance("MD5");
SomeObject testObject = new SomeObject();
dos.writeInt(testObject.count);
dos.writeLong(testObject.product);
dos.writeDouble(testObject.stdDev);
dos.writeUTF(testObject.name);
dos.writeChar(testObject.delimiter);
dos.flush();
byte[] hashBytes = md.digest(baos.toByteArray());
BigInteger testObjectHash = new BigInteger(hashBytes);
System.out.println("Hash " + testObjectHash);
} finally {
dos.close();
}
}
private static class SomeObject {
private int count = 200;
private long product = 1235134123l;
private double stdDev = 12343521.456d;
private String name = "Test Name";
private char delimiter = '\n';
}
}
DataOutputStream lets you write primitives and Strings and have them output as bytes. Wrapping a ByteArrayOutputStream in it will let you write to a byte array, which integrates nicely with MessageDigest. You can pick from any algorithm listed here.
Finally BigInteger will let you turn the output bytes into an easier-to-use number. The MD5 and SHA1 algorithms both produce 128-bit hashes, so if you need 64 you can just truncate.
SHA1 should hash almost anything well, and with infrequent collisions (it's 128-bit). This works from Java, but I'm not sure how it's implemented. It may actually be fairly fast. It works on several fields in my implementation: just push them all onto the DataOutputStream and you're good to go. You could even do it with reflection and annotations (maybe #HashComponent(order=1) to show which fields go into a hash and in what order). It's got a 128-bit variant and I think you'll find it doesn't use as much CPU as you think it will.
I've used code like this to get hashes for huge data sets (by now probably billions of objects) to be able to shard them across many backend stores. It should work for whatever you need it for. Note that I think you may want to only call MessageDigest.getInstance() once and then clone() from then on: IIRC the cloning is a lot faster.

Reverse the string to get another 32-bit hashcode and then combine the two:
String s = "astring";
long upper = ( (long) s.hashCode() ) << 32;
long lower = ( (long) s.reverse().hashCode() ) - ( (long) Integer.MIN_VALUE );
long hash64 = upper + lower;
This is pseudocode; the String.reverse() method doesn't exist and will need to be implemented some other way.

Do you look at Apache commons lang?
But for 64 bit (and 128) you need some tricks: the rules laid out in the book Effective Java by Joshua Bloch help you create 64 bit hash easy (just use long instead of int). For 128 bit you need additional hacks...

DISCLAIMER: This solution is applicable if you wish to efficiently hash individual natural language words. It is inefficient for hashing longer text, or text containing non-alphabetic characters.
I'm not aware of a function but here's an idea that might help:
Dedicate 52 of the 64 bits to representing which letters are present in the String. For example, if 'a' were present you'd set bit[0], for 'b' set bit 1, for 'A' set bit[26]. That way, only text containing exactly the same set of letters would have the same "signature".
You could then use the remaining 12 bits to encode the string length (or a modulo value of it) to further reduce collisions, or generate a 12 bit hashCode using a traditional hashing function.
Assuming your input is text-only I can imagine this would result in very few collisions and would be inexpensive to compute (O(n)). Unlike other solutions so far this approach takes the problem domain into account to reduce collisions - It is based off the Anagram Detector described in Programming Pearls (see here).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Seemingly easy FNV1 hashing implementation results in a lot of collisions - java

Related

Is there a way to adjust for integer overflow?

Secure random number of aproximately given size

How to assign the largest n bit unsigned integer to a BigInteger in Java

understanding of hash code

What is a good 64bit hash function in Java for textual strings?

Categories

Resources