signed to positive near-perfect hash - java

I have an integer type, say long, whose values are between Long.MIN_VALUE = 0x80...0 (-2^63) and Long.MAX_VALUE = 0x7f...f (2^63 - 1). I want to hash it with ~50% collision to a positive integer of the same type (i.e. between 1 and Long.MAX_VALUE) in a clean and efficient manner.
My first attempts were something like:
Math.abs(x) + 1
(x & Long.MAX_VALUE) + 1
but those and similar approaches always have problems with certain values, i.e. when x is 0 / Long.MIN_VALUE / Long.MAX_VALUE. Of course, the naive solution is to use 2 if statements, but I'm looking for something cleaner / shorter / faster. Any ideas?
Note: Assume that I'm working in Java where there is no implicit conversion to boolean and shift semantics is defined.

The simplest approach is to zero the sign bit and then map zero to some other value:
Long y = x & Long.MAX_VALUE;
return (y == 0)? 42: y;
This is simple, uses only one if/ternary operator, and gives ~50% collision rate on average. There is one disadvantage: it maps 4 different values (0, 42, MIN_VALUE, MIN_VALUE+42) to one value (42). So for this value we have 75% collisions, while for other values - exactly 50%.
It may be preferable to distribute collisions more evenly:
return (x == 0)? 42: (x == Long.MIN_VALUE) ? 142: x & Long.MAX_VALUE;
This code gives 67% collisions for 2 values and 50% for other values. You cannot distribute collisions more evenly, but it is possible to choose these 2 most colliding values. Disadvantage is that this code uses two ifs/ternary operators.
It is possible to avoid 75% collisions on single value while using only one if/ternary operator:
Long y = x & Long.MAX_VALUE;
return (y == 0)? 42 - (x >> 7): y;
This code gives 67% collisions for 2 values and 50% collisions for other values. There is less freedom choosing these most colliding values: 0 maps to 42 (and you can choose almost any value instead); MIN_VALUE maps to 42 - (MIN_VALUE >> 7) (and you can shift MIN_VALUE by any value from 1 to 63, only make sure that A - (MIN_VALUE >> B) does not overflow).
It is possible to get the same result (67% collisions for 2 values and 50% collisions for other values) without conditional operators (but with more complicated code):
Long y = x - 1 - ((x >> 63) << 1);
Long z = y + 1 + (y >> 63);
return z & Long.MAX_VALUE;
This gives 67% collisions for values '1' and 'MAX_VALUE'. If it is more convenient to get most collisions for some other values, just apply this algorithm to x + A, where 'A' is any number.
An improved variant of this solution:
Long y = x + 1 + ((x >> 63) << 1);
Long z = y - (y >> 63);
return z & Long.MAX_VALUE;

Assuming you want to collapse all values into the positive space, why not just zero the sign bit?
You can do this with a single bitwise op by taking advantage of the fact that MAX_VALUE is just a zero sign bit followed by ones e.g.
int positive = value & Integer.MAX_VALUE;
Or for longs:
long positive = value & Long.MAX_VALUE;
If you want a "better" hash with pseudo-random qualities, you probably want to pss the value through another hash function first. My favourite fast hashes are the XORshift family by George Marsaglia. These have the nice property that they map the entire int / long number space perfectly onto itself, so you will still get exactly 50% collisions after zeroing the sign bit.
Here's a quick XORshift implementation in Java:
public static final long xorShift64(long a) {
a ^= (a << 21);
a ^= (a >>> 35);
a ^= (a << 4);
return a;
}
public static final int xorShift32(int a) {
a ^= (a << 13);
a ^= (a >>> 17);
a ^= (a << 5);
return a;
}

I would opt for the most simple, yet not totally time wasting version:
public static long postiveHash(final long hash) {
final long result = hash & Long.MAX_VALUE;
return (result != 0) ? result : (hash == 0 ? 1 : 2);
}
This implementation pays one conditional operation for all but two possible inputs: 0 and MIN_VALUE. Those two are assigned different value mappings with the second condition. I doubt you get a better combination of (code) simplicity and (computational) complexity.
Of course if you can live with a worse distribution, it gets a lot simpler. By restricting the space to 1/4 instead of to 1/2 -1 you can get:
public static long badDistribution(final long hash) {
return (hash & -4) + 1;
}

You can do it without any conditionals and in a single expression by using the unsigned shift operator:
public static int makePositive(int x) {
return (x >>> 1) + (~x >>> 31);
}

If the value is positive, it probably can be used directly, else, invert all bits:
x >= 0 ? hash = x : hash = x ^ Long.MIN_VALUE
However, you should scramble this value a bit more if the values of x are correlated (meaning: similar objects produce similar values for x), maybe with
hash = a * (hash + b) % (Long.MAX_VALUE) + 1
for some positive constants a and b, where a should be quite large and b prevents that 0 is always mapped to 1. This also maps the whole thing to [1,Long.MAX_VALUE] instead of [0,Long.MAX_VALUE]. By altering the values for a and b you could also implement more complex hash functionalities like cooko hashing, that needs two different hash functions.
Such a solution should definitely be preferred instead of one that delivers "strange collision distribution" for the same values each time it is used.

From the information theoretic view, you have 2^64 values to map into 2^63-1 values.
As such, mapping is trivial with the modulus operator, since it always has a non-negative result:
y = 1 + x % 0x7fffffffffffffff; // the constant is 2^63-1
This could be pretty expensive, so what else is possible?
The simple math 2^64 = 2 * (2^63 - 1) + 2 says we will have two source values mapping to one target value except in two special cases, where three will go to one. Think of these as two special 64-bit values, call them x1 and x2, that each share a target with two other source values. In the mod expression above, this occurs by "wrapping". The target values y=2^31-2 and y=2^31-3 have three mappings. All others have two. Since we have to use something more complex than mod anyway, let's seek a way to map the special values wherever we like at low cost
For illustration let's work with mapping a 4-bit signed int x in [-8..7] to y in [1..7], rather than the 64-bit space.
An easy course is to have x values in [1..7] map to themselves, then the problem reduces to mapping x in [-8..0] to y in [1..7]. Note there are 9 source values here and only 7 targets as discussed above.
There are obviously many strategies. At this point you can probably see a gazzilion. I'll describe only one that's particularly simple.
Let y = 1 - x for all values except special cases x1 == -8 and x2 == -7. The whole hash function thus becomes
y = x <= -7 ? S(x) : x <= 0 ? 1 - x : x;
Here S(x) is a simple function that says where x1 and x2 are mapped. Choose S based on what you know about the data. For example if you think high target values are unlikely, map them to 6 and 7 with S(x) = -1 - x.
The final mapping is:
-8: 7 -7: 6 -6: 7 -5: 6 -4: 5 -3: 4 -2: 3 -1: 2
0: 1 1: 1 2: 2 3: 3 4: 4 5: 5 6: 6 7: 7
Taking this logic up to the 64-bit space, you'd have
y = (x <= Long.MIN_VALUE + 1) ? -1 - x : x <= 0 ? 1 - x : x;
Many other kinds of tuning are possible within this framework.

Just to make sure, you have a long and want to hash it to an int?
You could do...
(int) x // This results in a meaningless number, but it works
(int) (x & 0xffffffffl) // This will give you just the low order bits
(int) (x >> 32) // This will give you just the high order bits
((Long) x).hashcode() // This is the high and low order bits XORed together
If you want to keep a long you could do...
x & 0x7fffffffffffffffl // This will just ignore the sign, Long.MIN_VALUE -> 0
x & Long.MAX_VALUE // Should be the same I think
If getting a 0 is no good...
x & 0x7ffffffffffffffel + 1 // This has a 75% collision rate.
Just thinking out loud...
((x & Long.MAX_VALUE) << 1) + 1 // I think this is also 75%
I think you're going to need to either be ok with 75% or get a little ugly:
(x > 0) ? x : (x < 0) ? x & Long.MAX_VALUE : 7

This seems the simplest of all:
(x % Long.MAX_VALUE) + 1
I would be interested in speed comparisons of all the methods given.

Just AND your input value with Long.MAX_VALUE and OR it with 1. Nothing else needed.
Ex:
long hash = (input & Long.MAX_VALUE) | 1;

Related

Understanding Java Random next Integer with bound algorithm

I'm looking at the way the Java Random library generates an integer given an upper bound, but I don't quite understand the algorithm. In the docs it says:
The algorithm is slightly tricky. It rejects values that would result
in an uneven distribution (due to the fact that 2^31 is not divisible
by n). The probability of a value being rejected depends on n. The
worst case is n=2^30+1, for which the probability of a reject is 1/2,
and the expected number of iterations before the loop terminates is 2.
But I really don't see how this implementation takes this into account, specifically the while condition in the code. To me it seems that this would (almost) always succeed with 50% success rate. Especially when looking at very low values for bound (which I think is used a lot when imposing a bound). It seems to me like the condition in the while is just checking the sign of bits, so why bother with the line they use?
public int nextInt(int bound) {
if (bound <= 0)
throw new IllegalArgumentException("bound must be positive");
if ((bound & -bound) == bound) // i.e., bound is a power of 2
return (int)((bound * (long)next(31)) >> 31);
int bits, val;
do {
bits = next(31);
val = bits % bound;
} while (bits - val + (bound-1) < 0);
return val;
}
Note that bits - val + (bound-1) < 0 is actually checking whether bits - val + (bound-1) overflows. bits is always equal to or greater than val, and bound is always positive, so there is no way for the LHS to be positive under normal circumstances.
We can think of the < 0 as > Integer.MAX_VALUE.
Let's plot a graph of bits - val + (bound - 1). I have made one on desmos here. Let's say bound is 100 (small bound):
The x axis is bits and y axis is bits - val + (bound-1), and I have added lines on both the x and y axes to indicate Integer.MAX_VALUE. Note that bits is bounded by Integer.MAX_VALUE.
At this scale, you can see that bits - val + (bound-1) seems to never overflow. If you zoom a lot, you'll see:
Note that there is a tiny range of values of bits for which bits < Integer.MAX_VALUE, but bits - val + (bound - 1) > Integer.MAX_VALUE.
For b = (1 << 30) + 1, the graph looks like:
Any b that is greater than 1 << 30 overflows. Hence the 1/2 chance of rejecting the bounds as the documentation said.

Why the calculation of hash in HashMap(JDK1.8) don't need to consider the negative hashCode as ConcurrentHashMap does?

In HashMap: (h = key.hashCode()) ^ (h >>> 16);
In ConcurrentHashMap: (h ^ (h >>> 16)) & HASH_BITS;
where HASH_BITS is 0x7fffffff, by & HASH_BITS it can always be a positive number.
Why the calculation of hash in HashMap(JDK1.8) don't need to consider the negative hashCode as ConcurrentHashMap does?
Ultimately, the case where the hash is negative (after spreading) does need to be considered in the HashMap case as well. It is just that this happens later in the code.
For example, in getNode (Java 8) you find this:
Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
if ((tab = table) != null && (n = tab.length) > 0 &&
(first = tab[(n - 1) & hash]) != null) {
Since tab.length is a power of 2, tab.length - 1 is a suitable bitmask for reducing hash to a subscript for the array.
You can rest assured that in every implementation of HashMap or ConcurrentHashMap there is some code that reduces the hash code to a number that is suitable for use as a subscript. It will be there ... somewhere.
But also ... don't expect the code of these classes to be easy to read. All of the collection classes have been reworked / tuned multiple times get the best possible (average) performance over a wide range of test cases.
Actually it handles negative index calculations. It's not evident at first looking but there are calculations in some places while accessing the elements(key or value).
int index = (n - 1) & hash, in which n is length of the table
It simply handles negative indexing.
AFAIK, HashMap always uses arrays sized to a power of 2 (e.g. 16, 32, 64, etc.).
Let's assume we have capacity of 256(0x100) which is 2^8.
After subtraction 1, we get 256 - 1 = 255 which is 0x100 - 0x1 = 0xFF
The subtraction gives rise to get the proper bucket index between 0 to length-1 with exact bit mask needed to bitwise-and with the hash.
256 - 1 = 255
0x100 - 0x1 = 0xFF
A hash of 260 (0x104) gets bitwise-anded with 0xFF to yield a bucket number of 4.
A hash of 257 (0x101) gets bitwise-anded with 0xFF to yield a bucket number of 1.

i dont understand what is 0x7fffffff mean. is there any other way to code getHashValue method?

public int getHashValue(K key){
return (key.hashCode() & 0x7fffffff) % size;
}
i dont understand what is 0x7fffffff mean. is there any other way to code getHasValue method?
The constant 0x7FFFFFFF is a 32-bit integer in hexadecimal with all but the highest bit set.
Despite the name, this method isn't getting the hashCode, rather looking for which bucket the key should appear in for a hash set or map.
When you use % on negative value, you get a negative value. There are no negative buckets so to avoid this you can remove the sign bit (the highest bit) and one way of doing this is to use a mask e.g. x & 0x7FFFFFFF which keeps all the bits except the top one. Another way to do this is to shift the output x >>> 1 however this is slower.
A slightly better approach is to use "take the modulus and apply Math.abs". This uses all the bits of the hashCode which might be better.
e.g.
public int getBucket(K key) {
return Math.abs(key.hashCode() % size);
}
Even this is not ideal as some hashCode() have a poor distribution resulting in a higher collision rate. You might want to agitate the hashcode before the modulus etc.
public int getBucket(K key) {
return Math.abs(hash(key) % size);
}
HashMap in java 8 uses this
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
The function is simple as it handles collisions efficiently. In Java 7 it used this function.
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
That's the Hexadecimal representation of the Max. Integer
You can check here
0x7fffffff just remove the signal getting the complement of the number.
Using Java REPL you can see the results of the operation
> -7 & 0x7fffffff
java.lang.Integer res1 = 2147483641
> 2147483641 & 0x7fffffff
java.lang.Integer res2 = 2147483641
> 2147483641 + 6
java.lang.Integer res3 = 2147483647
> 7 + 2147483641
java.lang.Integer res4 = -2147483648
In the binary representation, the first bit is the signal. If you set it to zero you will get positive complementary if the number is negative. Or the same number if positive.

Very fast universal hash function for 128 bit keys

I need a very fast universal hash function for a 128-bit key. The returned value needs to be about 32 bit (well, 16 bit would be sufficient; in most cases I only need 1-4 bits actually).
Universal hash means, there are two parameters: key (128 bit) and index (64 bit). For two keys, the universal hash function needs to return different result eventually, if called with different indexes. So with a different index, the universal hash should behave like a different hash function. For x = universalHash(k, i) and y = universalHash(k, i + 1), it would be best if on average 50% of all bits are different between x and y (randomly). The same for the case if the method is called with different keys. In practise, 5% off is OK for me.
It needs to be very fast (one or two multiplications at most). It is called millions of times. Please don't say: no, you won't need it to be fast. It also needs to return different values eventually.
What I have so far (Java code, but C is (due to the lack of a 128 bit data type, the key is the composite of a and b, which are 64 bit each):
int universalHash(long a, long b, long index) {
long x = a ^ Long.rotateLeft(b, (int) index) ^ index;
int y = (int) ((x >>> 32) ^ x);
y = ((y >>> 16) ^ y) * 0x45d9f3b;
y = ((y >>> 16) ^ y) * 0x45d9f3b;
y = (y >>> 16) ^ y;
return y;
}
int universalHash2(long a, long b, long index) {
long x = Long.rotateLeft(a, (int) index) ^
Long.rotateRight(b, (int) index) ^ index;
x = (x ^ (x >>> 32)) * 0xbf58476d1ce4e5b9L;
return (int) ((x >>> 32) ^ x);
}
(The second method is actually broken for some values.)
I would like to have a hash function that is faster than those above, and is guaranteed to work in all cases (if possible provably correct, even thought that's not a strict requirement; it doesn't need to be cryptographically secure however).
I will call the universalHash method with incrementing index (first index 0, then index 1, and so on) for the same keys. It would be best if the next result could be calculated faster (e.g. without multiplication) from the previous result. But I also need to have a fast "direct access" if the index is some value (as in the example code).
Background
The problem I'm trying to solve is finding a MPHF (minimal perfect hash function) for a relatively small set of keys (up to 16 keys by directly mapping, and up to about 1024 keys by splitting into smaller subsets). For details on the algorithm, see my MinPerf project, specially the RecSplit algorithm. To support set of size 10^12 (like BBHash), I'm trying to internally use 128 bit signatures, which would simplify the algorithm.
You need a hash function that outputs 32 bits for 128 bits of inputs.
A simple way would be to just return "some" 32 bits out of the original 128 bits. There are many ways of choosing 32 bits and every choice will have collisions. But the index can decide which 32 bits to choose.
128/32 = 4, so 4 indices are enough to find at least one different bit.
For key 0 you choose the lower most 32 bits
For key 1 you choose the next 32 bits
and so on ..
The C implementation would be
uint32_t universal_hash(uint64_t key_higher, uint64_t key_lower, int index) {
// For a lack of portable 128 bit datatype we take the key in parts.
return 0xFFFFFFFF & ( index >=2 ? key_higher >> ((index - 2)*32) : key_lower >> (index*32));
}

Ignoring the first bit using bitwise compare with permission model

I have the bits 101 and 110. I want to compare using some bitwise operator ignoring the first bit like 01 and 10.
Example:
I have:
101
110
===
01 & 10 <- How I want to consider
x00 <- The result I want
or
10110
11011
=====
0110 & 1011 <- How I want to consider
x0010 <- The result I want
How could I achieve this using bitwise operators in java?
Details:
The first bit will always be 1.
The other bits are variable. Both
sides of the comparison will have the same number of bits.
I want to detect just how to make the comparison considering the other bits and
ignoring the first.
Use case:
I have 2 permission values. The first is 5/101 (The permission required) and the second is 6/110 (The permission the user has).
Excluding the first block, which will always be 1, I want to compare the third block that represents a certain permission rule in the system (using bitwise).
"The permission required" bitmask means:
1 - An always fixed value I use to be able to consider the left padding zeroes (unless there is another way to achieve this);
0 - Another permission rule useless for this comparison (let's call permission 1);
1 - The needed permission for the current permission rule (let's call permission 2).
"The permission the user has" means:
1 - A fixed value to be striped out;
1 - Represents the value of the user for the permission 1;
0 - Represents the value of the user for the permission 2. The permission 2 has the value 1 but the user has 0 then he is NOT allowed to the required action. The opposite would be ALLOWED to execute the action.
Any better solution for this case will be considered a correct answer also.
If you know the number of useful bits (e.g numofbits = 5) then the bitmask for the expression is:
bitmask = (1 << numofbits) - 1
If you don't know the numofbits, just make a loop with num = num >> 1, and count the iteration until you got num == 0.
For the use case:
result = (req_roles & user_roles) & (bitmask >> 1)
This simply ands the role bits, ans cuts the upper bit (which is always 1)
Previous answer for previous question :) :
If you know the bitmask for the highest number (e.g. bitmask = 0x1f (11111 in bits)) then you want the result of the following expression:
result = (a ^ b) ^ (bitmask >> 1)
What does it do?
Compares all bits, the equal bits will be 0
Reverts all lower bits, so equal bits will be 1 (leaves the high bit out, so it will remain 0)
Just 'and' the arguments with a mask that has the first bit off, eg 011 & arg before you compare them.
Edit: after restated question.
The alternative is to use role based permissions, these are far more flexible and easier to understand than Boolean permission strings. They are also self documenting. Bit string based permissions are rarely used except where memory or disk space are at a premium, like when Unix was developed back in the early '80s or in embedded systems.
Try this:
// tester 1
int x, y, z, mask;
x = 0x05; // 101
y = 0x06; // 110
mask = getMask(x, y);
z = (mask & (x & y));
System.out.println(String.format("mask: %x result: %x", mask, z));
// tester 2
int x, y, z, mask;
x = 0x16; // 10110
y = 0x1B; // 11011
mask = getMask(x, y);
z = (mask & (x & y));
System.out.println(String.format("mask: %x result: %x", mask, z));
private int getMask(final int x, final int y) {
int mask = findHighOrderOnBit(x, 0);
mask = findHighOrderOnBit(y, mask) - 1;
return mask;
}
private int findHighOrderOnBit(final int target, final int otherMask) {
int result = 0x8000;
for (int x = 0; x != 16; x++) {
if ((result & target) > 0)
break;
result >>= 1;
}
if (otherMask > result)
result = otherMask;
return result;
}

Categories

Resources