How do I correct my hash function? - java

This question is a result of the responses submitted to my post at CodeReview.
I have a class called Point, which is basically "intended to encapsulate a point represented in 2D space." I have overrided the hashcode() function which is as follows:
...
#Override
public int hashCode() {
int hash = 0;
hash += (int) (Double.doubleToLongBits(this.getX())
^ (Double.doubleToLongBits(this.getX()) >>> 32));
hash += (int) (Double.doubleToLongBits(this.getY())
^ (Double.doubleToLongBits(this.getY()) >>> 32));
return hash;
}
...
Let me clarify (for those who didn't check the above link) that my Point uses the two doubles: x and y to represent its coordinates.
Problem:
My Problem is evident when I run this method:
public static void main(String[] args) {
Point p1 = Point.getCartesianPoint(12, 0);
Point p2 = Point.getCartesianPoint(0, 12);
System.out.println(p1.hashCode());
System.out.println(p2.hashCode());
}
I get the Output:
1076363264
1076363264
This is clearly a problem. Basically I intend my hashcode() to return equal hashcodes for equal Points. If I reverse the order in one of the parameter declarations (i.e. swap 12 with 1 in one of them to get equal Points), I get the correct (same) result. How can I correct my approach while maintaining the quality or uniqueness of the hash?

You cannot get an integer hash code for two doubles that is unique, without some further information about the numbers being made available about the nature of the numbers in the doubles.
Why?
int is stored as a 32bit representation, double as a 64 bit representation (see the Java tutorial).
So you are trying to store 128 bits of information in a 32 bit space, so it can never give an unique hash.
However
This really isn't the purpose of a hash code, hash codes
just need to have fairly uncommon collisions to be useful.
If you
know something about the double numbers, that reduces their
entropy/information content then you could use this to compress the
number of bits they use. This will be dependant on the application
of this class that you have not discussed yet.
This is why equals
normally does not make use of the hashcode to check for equality,
use getX and getY of each Point to do the comparison instead.

Try this
public int hashCode() {
long bits = Double.doubleToLongBits(x);
int hash = 31 + (int) (bits ^ (bits >>> 32));
bits = Double.doubleToLongBits(y);
hash = 31 * hash + (int) (bits ^ (bits >>> 32));
return hash;
}
this implementation follows Arrays.hashCode(double a[]) pattern.
It produces these hash codes:
-992476223
1076364225
You can find suggestions how to write good hashCode in Effective Java Item. 9

Can't you use the code that is present in Arrays.hashCode already?
Arrays.hashCode(new double[]{x,y});
This is what guava for example is using in Objects.hashCode.
If you have Java 7, simply:
Objects.hash(x,y)

It might be a silly idea, bt you are using + which is a symetric operation and you are getting symetric problems. What if you ue a non-symmetric operation such as division (check for denominator == 0 though) or minus? Or any other that you cna find in literature or invent yourself.

Related

Very fast universal hash function for 128 bit keys

I need a very fast universal hash function for a 128-bit key. The returned value needs to be about 32 bit (well, 16 bit would be sufficient; in most cases I only need 1-4 bits actually).
Universal hash means, there are two parameters: key (128 bit) and index (64 bit). For two keys, the universal hash function needs to return different result eventually, if called with different indexes. So with a different index, the universal hash should behave like a different hash function. For x = universalHash(k, i) and y = universalHash(k, i + 1), it would be best if on average 50% of all bits are different between x and y (randomly). The same for the case if the method is called with different keys. In practise, 5% off is OK for me.
It needs to be very fast (one or two multiplications at most). It is called millions of times. Please don't say: no, you won't need it to be fast. It also needs to return different values eventually.
What I have so far (Java code, but C is (due to the lack of a 128 bit data type, the key is the composite of a and b, which are 64 bit each):
int universalHash(long a, long b, long index) {
long x = a ^ Long.rotateLeft(b, (int) index) ^ index;
int y = (int) ((x >>> 32) ^ x);
y = ((y >>> 16) ^ y) * 0x45d9f3b;
y = ((y >>> 16) ^ y) * 0x45d9f3b;
y = (y >>> 16) ^ y;
return y;
}
int universalHash2(long a, long b, long index) {
long x = Long.rotateLeft(a, (int) index) ^
Long.rotateRight(b, (int) index) ^ index;
x = (x ^ (x >>> 32)) * 0xbf58476d1ce4e5b9L;
return (int) ((x >>> 32) ^ x);
}
(The second method is actually broken for some values.)
I would like to have a hash function that is faster than those above, and is guaranteed to work in all cases (if possible provably correct, even thought that's not a strict requirement; it doesn't need to be cryptographically secure however).
I will call the universalHash method with incrementing index (first index 0, then index 1, and so on) for the same keys. It would be best if the next result could be calculated faster (e.g. without multiplication) from the previous result. But I also need to have a fast "direct access" if the index is some value (as in the example code).
Background
The problem I'm trying to solve is finding a MPHF (minimal perfect hash function) for a relatively small set of keys (up to 16 keys by directly mapping, and up to about 1024 keys by splitting into smaller subsets). For details on the algorithm, see my MinPerf project, specially the RecSplit algorithm. To support set of size 10^12 (like BBHash), I'm trying to internally use 128 bit signatures, which would simplify the algorithm.
You need a hash function that outputs 32 bits for 128 bits of inputs.
A simple way would be to just return "some" 32 bits out of the original 128 bits. There are many ways of choosing 32 bits and every choice will have collisions. But the index can decide which 32 bits to choose.
128/32 = 4, so 4 indices are enough to find at least one different bit.
For key 0 you choose the lower most 32 bits
For key 1 you choose the next 32 bits
and so on ..
The C implementation would be
uint32_t universal_hash(uint64_t key_higher, uint64_t key_lower, int index) {
// For a lack of portable 128 bit datatype we take the key in parts.
return 0xFFFFFFFF & ( index >=2 ? key_higher >> ((index - 2)*32) : key_lower >> (index*32));
}

Generating hashCode() for a custom class

I have a class named Dish and I handle it inside ArrayLists
So I had to override default hashCode() method.
#Override
public int hashCode() {
int hash =7;
hash = 3*hash + this.getId();
hash = 3*hash + this.getQuantity();
return hash;
}
When I get two dishes with id=4,quan=3 and id=5,quan=0, hashCode() for both is same;
((7*3)+4)*3+3 = 78
((7*3)+5)*3+0 = 78
What am I doing wrong? Or the magic numbers 7 and 3 I have chosen is wrong?
How do I properly override hashCode() so that it generates unique hashes?
PS: From what I searched from google and SO, people use different numbers but the same method. If the problem is with the numbers, how do I wisely choose numbers that doesn't actually increase the cost for multiplication and at the same time works well even for more number of attributes.
Say I had 7 int attributes and my second magic no. is 31, the final hash will be first magic no. * 27512614111 even if all my attributes are 0. So how do I do it without having my hashed value in billions so as keep my processor burden-free?
You can use something like this
public int hashCode() {
int result = 17;
result = 31 * result + getId();
result = 31 * result + getQuantity();
return result;
}
One more thing if your id is unique for each object then no need of using quantity while calculating hashcode.
Here is extract from Effective Java by Joshua bloch telling how implement hashcode method
Store some constant nonzero value, say, 17, in an int variable called result.
For each significant field f in your object (each field taken into account by the equals method, that is), do the following:
a. Compute an int hash code c for the field:
i. If the field is a boolean, compute (f ? 1 : 0).
ii. If the field is a byte , char, short, or int, compute (int) f .
iii. If the field is a long , compute (int) (f ^ (f >>> 32)) .
iv. If the field is a float , compute Float.floatToIntBits(f) .
v. If the field is a double, compute Double.doubleToLongBits(f) , and then hash the resulting long as in step 2.a.iii.
vi. If the field is an object reference and this class’s equals method compares the field by recursively invoking equals, recursively invoke hashCode on the field. If a more complex comparison is required, compute a “canonical representation” for this field and invoke hashCode on the canonical representation. If the value of the field is null, return 0 (or some other constant, but 0 is traditional).
vii. If the field is an array, treat it as if each element were a separate field. That is, compute a hash code for each **significant element** by applying these rules recursively, and combine these values per step 2.b. If every element in an array field is significant, you can use one of the Arrays.hashCode methods added in release 1.5.
b. Combine the hash code c computed in step 2.a into result as follows:
result = 31 * result + c;
Return result .
When you are finished writing the hashCode method, ask yourself whether equal instances have equal hash codes. Write unit tests to verify your intuition!If equal instances have unequal hash codes, figure out why and fix the problem.
Source: Effective Java by Joshua Bloch
This is perfectly OK. The hashing function is not supposed to be universally unique - it just gives a quick hint about which elements might be equal and should be checked in more depth by a call to equals().
From the name of class looks like quantity is the number of dish. So, There is chance that many times it will be zero. I would say in case getquantity() is zero use a variable say x in the hash fucntion.like this:
#Override
public int hashCode() {
int hash =7;int x =0;
if(getQuantity==0)
{
x = getQuantity+getId();
}
else
{
x = getquantity;
}
hash = 3*hash + this.getId();
hash = 3*hash + x;
return hash;
}
I believe this should reduce the collision of hash.since the getId() you have is a unique number.it makes the x a unique number too.

HashCode calculation from double

In Effective Java there is an example of Complex class. That class has overridden hashCode which uses hashDouble method I have a question about.
private int hashDouble(double val)
{
long longBits = Double.doubleToLongBits(re);
return (int) (longBits ^ (longBits >>> 32));
}
For what purpose it does (int) (longBits ^ (longBits >>> 32))?
The double value is 64 bits wide but the int returned by hash method has only 32 bit.
In order to achieve a better distribution of hash values (compared to simply striping the upper 32 bits).
The code uses XOR to incorporate the upper 32 bits (containing sign, exponent and some bits of the mantissa) aligned by right-shifting of the IEEE 754 value in the calculation.
Image source Wikipedia

Good hashCode() Implementation

The accepted answer in Best implementation for hashCode method gives a seemingly good method for finding Hash Codes. But I'm new to Hash Codes, so I don't quite know what to do.
For 1), does it matter what nonzero value I choose? Is 1 just as good as other numbers such as the prime 31?
For 2), do I add each value to c? What if I have two fields that are both a long, int, double, etc?
Did I interpret it right in this class:
public MyClass{
long a, b, c; // these are the only fields
//some code and methods
public int hashCode(){
return 37 * (37 * ((int) (a ^ (a >>> 32))) + (int) (b ^ (b >>> 32)))
+ (int) (c ^ (c >>> 32));
}
}
The value is not important, it can be whatever you want. Prime numbers will result in a better distribution of the hashCode values therefore they are preferred.
You do not necessary have to add them, you are free to implement whatever algorithm you want, as long as it fulfills the hashCode contract:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hash tables.
There are some algorithms which can be considered as not good hashCode implementations, simple adding of the attributes values being one of them. The reason for that is, if you have a class which has two fields, Integer a, Integer b and your hashCode() just sums up these values then the distribution of the hashCode values is highly depended on the values your instances store. For example, if most of the values of a are between 0-10 and b are between 0-10 then the hashCode values are be between 0-20. This implies that if you store the instance of this class in e.g. HashMap numerous instances will be stored in the same bucket (because numerous instances with different a and b values but with the same sum will be put inside the same bucket). This will have bad impact on the performance of the operations on the map, because when doing a lookup all the elements from the bucket will be compared using equals().
Regarding the algorithm, it looks fine, it is very similar to the one that Eclipse generates, but it is using a different prime number, 31 not 37:
#Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + (int) (a ^ (a >>> 32));
result = prime * result + (int) (b ^ (b >>> 32));
result = prime * result + (int) (c ^ (c >>> 32));
return result;
}
A well-behaved hashcode method already exists for long values - don't reinvent the wheel:
int hashCode = Long.hashCode((a * 31 + b) * 31 + c); // Java 8+
int hashCode = Long.valueOf((a * 31 + b) * 31 + c).hashCode() // Java <8
Multiplying by a prime number (usually 31 in JDK classes) and cumulating the sum is a common method of creating a "unique" number from several numbers.
The hashCode() method of Long keeps the result properly distributed across the int range, making the hash "well behaved" (basically pseudo random).

Java convert hash to random string

I'm trying to develop a reduction function for use within a rainbow table generator.
The basic principle behind a reduction function is that it takes in a hash, performs some calculations, and returns a string of a certain length.
At the moment I'm using SHA1 hashes, and I need to return a string with a length of three. I need the string to be made up on any three random characters from:
abcdefghijklmnopqrstuvwxyz0123456789
The major problem I'm facing is that any reduction function I write, always returns strings that have already been generated. And a good reduction function will only return duplicate strings rarely.
Could anyone suggest any ideas on a way of accomplishing this? Or any suggestions at all on hash to string manipulation would be great.
Thanks in advance
Josh
So it sounds like you've got 20 digits of base 255 (the length of a SHA1 hash) that you need to map into three digits of base 36. I would simply make a BigInteger from the hash bytes, modulus 36^3, and return the string in base 36.
public static final BigInteger N36POW3 = new BigInteger(""+36*36*36));
public static String threeDigitBase36(byte[] bs) {
return new BigInteger(bs).mod(N36POW3).toString(36);
}
// ...
threeDigitBase36(sha1("foo")); // => "96b"
threeDigitBase36(sha1("bar")); // => "y4t"
threeDigitBase36(sha1("bas")); // => "p55"
threeDigitBase36(sha1("zip")); // => "ej8"
Of course there will be collisions, as when you map any space into a smaller one, but the entropy should be better than something even sillier than the above solution.
Applying the KISS principle:
An SHA is just a String
The JDK hashcode for String is "random enough"
Integer can render in any base
This single line of code does it:
public static String shortHash(String sha) {
return Integer.toString(sha.hashCode() & 0x7FFFFFFF, 36).substring(0, 3);
}
Note: The & 0x7FFFFFFF is to zero the sign bit (hash codes can be negative numbers, which would otherwise render with a leading minus sign).
Edit - Guaranteeing hash length
My original solution was naive - it didn't deal with the case when the int hash is less than 100 (base 36) - meaning it would print less than 3 chars. This code fixes that, while still keeping the value "random". It also avoids the substring() call, so performance should be better.
static int min = Integer.parseInt("100", 36);
static int range = Integer.parseInt("zzz", 36) - min;
public static String shortHash(String sha) {
return Integer.toString(min + (sha.hashCode() & 0x7FFFFFFF) % range, 36);
}
This code guarantees the final hash has 3 characters by forcing it to be between 100 and zzz - the lowest and highest 3-char hash in base 36, while still making it "random".

Categories

Resources