I have been reading about hashcode functions for the past couple of hours and have accumulated a couple of questions regarding use of prime numbers as multipliers in custom hashcode implementations. I would be grateful if I could get some insight regarding following questions:
In a comment to #mattb's answer here, #hstoerr advocates for use of larger primes (such as 524287) instead of the common prime 31. My question is, given the following implementation of a hashcode functions for a pair or elements:
#Override
public int hashCode() {
final int prime = 31;
int hash1 = (pg1 == null) ? 0 : pg1.hashCode();
int hash2 = (pg2 == null) ? 0 : pg2.hashCode();
return prime * (hash1 ^ hash2);
}
doesn't this lead to an overflow on the returned int if prime is a large number?
Assuming that the overflow is not a problem (JVM doing an automatic cast) is it better to do a bitshift instead of a cast?
I imagine the performance of the hashcode function vary significantly based on the complexity of the hashcode. Does the size of the prime multiplier not effect the performance?
Is it better/smarter/faster to use multiple primes in a custom hashcode function instead of a single multiplier? If not, is there some other advantage? See the example below from #jinguy's answer to a relevant question:
public int hashCode() {
return a * 13 + b.hashCode() * 23 + (c? 31: 7);
}
where a is an int, b is a String and c is boolean.
How about something like long lhash = prime * (hash1 ^ hash2); then using (int)((lhash >> 32) ^ lhash)? That's something I saw on another question here SO, but it wasn't really explained why it was a good idea to do it like that.
Apologies in advance for the novel. Feel free to make suggestions or edit directly. --Chet
There is an overflow, but not an exception.
The danger doesn't come from losing accuracy, but losing range. Let's use a ridiculous example, where "prime" is a large power of 2, and 8-bit unsigned numbers for brevity. And assume that (hash1 ^ hash2) is 255:
"prime": 1000 0000
(hash1 ^ hash2): 1111 1111
Showing the truncated digits in brackets, our result is:
product: [0111 1111] 1000 0000
But multiplying by 128 is the same as shifting left by 7 places. So we know that whatever the value of (hash1 ^ hash2), the least-significant places of the product will have seven zeros. So if (hash1 ^ hash2) is odd (least significant bit = 1), then the result of multiplying by 128 will always be 128 (after truncating the higher digits). And if (hash1 ^ hash2) is even (LSB is 0, then the product will always be zero.
This extends to larger bit sizes. The general point is that if the lower bits of "prime" are zeros, you're doing a shift (or multiple shift + sum) operation that will give you zeros in the lower bits. And the range of the product of multiplication will suffer.
But let's try making "prime" odd, so that the least significant bit will always be 1. Think about decomposing this into shift / add operations. The unshifted value of (hash1 ^ hash2) will always be one of the summands. The least significant bits that were shifted into guaranteed uselessness by an even "prime" multiplier will now be set based on, at minimum, the bits from the original (hash1 ^ hash2) value.
Now, let's consider a value of prime which is actually prime. If it's more than 2, then we know it's odd. So the lower bits haven't been shifted into uselessness. And by choosing a sufficiently large prime, you get better distribution across the range of output values than you'd get with a smaller prime.
Try some exercises with 16-bit multiplication using 8443 (0010 0000 1111 1011) and 59 (0000 0000 0011 1011). They're both prime, and the lower bits of 59 match the lower bits of 65531. For example, if hash1 and hash2 are both ASCII character values (0 .. 255), then all of the results of (hash1 ^ hash2) * 59 will be <= 15045. This means that roughly 1/4 of the range of hash values (0..65535) for a 16-bit number go unused.
But (hash1 ^ hash2) * 8443 is all over the map. It overflows if (hash1 ^ hash2) is as low as 8. It uses all 16 bits even for very small input numbers. There's much less clustering of hash values across the overall range, even if the input numbers are in a relatively small range.
Assuming that the overflow is not a problem (JVM doing an automatic cast) is it better to do a bitshift instead of a cast?
Most likely not. The JVM should translate into an efficient implementation on the host processor anyway. Integer multiplication should be implemented in hardware. And if not, the JVM is responsible for translating the operation into something reasonable for the CPU. It's very likely that the case of integer multiplication is highly optimized already. If integer multiplication is done more quickly on a given CPU as shift-and-add, the JVM should implement it that way. But it's less likely that the folks writing the JVM would care to watch for cases where multiple shift-and-add operations could have been combined into a single integer multiply.
I imagine the performance of the hashcode function vary significantly based on the complexity of the hashcode. Does the size
of the prime multiplier not effect the performance?
No. The operations are the same when done in hardware regardless of the size, number of bits set, etc. It's probably a couple of clock cycles. It would vary depending on the specific CPU, but should be a constant-time operation regardless of the input values.
Is it better/smarter/faster to use multiple primes in a custom hashcode function instead of a single multiplier? If not, is there
some other advantage?
Only if it reduces the possibility of collisions, and this depends on the numbers you're using. If your hash code depends on A and B and they're in the same range, you might consider using different primes or shifting one of the input values to reduce overlap between the bits. Since you're depending on their individual hash codes, and not their values directly, it's reasonable to assume that their hash codes provide good distribution, etc.
One factor that comes to mind whether you want the hash code for (x, y) to be different from (y, x). If your hash function treats A and B in the same way, then hash(x, y) = hash(y, x). If that's what you want, then by all means use the same multiplier. It not, using a different multiplier would make sense.
How about something like long lhash = prime * (hash1 ^ hash2); then using (int)((lhash >> 32) ^ lhash)? That's something I saw on another question here SO, but it wasn't really explained why it was a good idea to do it like that.
Interesting question. In Java, longs are 64-bit and ints are 32-bit. So this generates a hash using twice as many bits as desired, and then derives the result from the high and low bits combined.
If multiplying a number n by a prime p, and the lowermost k bits of n are all zeros, then the lowermost k bits of the product n * p will also be all zeros. This is fairly easy to see -- if you're multiplying, say, n = 0011 0000 and p = 0011 1011, then the product can be expressed as the sum of two shift operations. Or,
00110000 * p = 00100000 * p + 00010000 * p
= p << 5 + p << 4
Taking p = 59 and using unsigned 8-bit ints and 16-bit longs, here are some examples.
64: 0011 1011 * 0100 0000 = [ 0000 1110 ] 1100 0000 (192)
128: 0011 1011 * 1000 0000 = [ 0001 1101 ] 1000 0000 (128)
192: 0011 1011 * 1100 0000 = [ 0010 1100 ] 0100 0000 (64)
By just dropping the high bits of the result, the range of the resulting hash value is limited when the low bits of the non-prime multiplicand are all zeros. Whether that's an issue in a specific context is, well, context-specific. But for a general hash function it's a good idea to avoid limiting the range of output values even when there are patterns in the input numbers. And in security applications, it's even more critical to avoid anything that would let someone make inferences about the original value based on patterns in the output. Just taking the low bits reveals the exact values of some of the original bits. If we make the assumption that the operation involved multiplying an input number with a large prime, then we know that the original number had as many zeros at the right as the hash output (because the prime's rightmost bit was 1).
By XORing the high bits with the low bits, there's less consistency in the output. And more importantly, it's much harder to make guesses about the input values based on this information. Based on how XOR works, it could mean the original low bit was 0 and the high bit was 1, or the original low bit was 1 and the high bit was 0.
64: 0011 1011 * 0100 0000 = 0000 1110 1100 0000 => 1100 1110 (206)
128: 0011 1011 * 1000 0000 = 0001 1101 1000 0000 => 1001 1101 (157)
192: 0011 1011 * 1100 0000 = 0010 1100 0100 0000 => 0110 1100 (204)
Overflow is not a problem. Hashes are constrained to a narrow value set anyway.
The first hash function you posted isn't very good. Doing return (prime * hash1) ^ hash2;
` instead would reduce the number of collisions in most cases.
Multiplying by a single word int is generally very fast, and the difference between multiplying by different numbers is negligible. Plus the execution time is dwarfed by everything else in the function anyay
Using different prime multipliers for each part may reduce the risk of collisions.
Related
This question already has answers here:
How are integers cast to bytes in Java?
(8 answers)
Type casting into byte in Java
(6 answers)
Explicit conversion from int to byte in Java
(4 answers)
Closed 4 months ago.
For example of this is my input:
byte x=(byte) 200;
This will be the output:
-56
if this is my input:
short x=(short) 250000;
This will be the output:
-12144
I realize that the output is off because the number does not fit into the datatype, but how can I predict what this output will be in this case? In my computer science exam this my be one of the questions and I do not understand why exactly 200 changes to -56 and so one.
I realize that the output is off because the number does not fit into the datatype, but how can I predict what this output will be in this case? In my computer science exam this my be one of the questions and I do not understand why exactly 200 changes to -56 and so one.
The relevant aspects are what overflow looks like, and how the bits that represent the underlying data are treated.
Computers are all bits, grouped together in groups of 8; a group of 8 bits is called a byte.
byte b = 5; for example, is stored in memory as 0000 0101.
Bits can be 0. Or 1. That's it. That's where it ends. And everything is, in the end, bits. This means: That - is not a thing. Computers do not know what - is and cannot store it. We need to write code and agree on some sort of meaning to represent them.
2's complement
So what's -5 in bits? It's 1111 1011. Which seems bizarre. But it's how it works. If you write: byte b = -5;, then b will contain 1111 1011. It is because javac made that happen. Similarly, if you then call System.out.println(b), then the println method gets the bit sequence 1111 1011. Why does the println method decide to print a - symbol and then a 5 symbol? Because it's programmed that way: We all are in agreement that 1111 1011 is -5. So why is that?
Because of a really cool property - signed/unsigned irrelevancy.
The rule is 2's complement: To switch the sign (i.e. turn 5, which is 0000 0101 into -5 which is 1111 1011), you flip every bit, and then add 1 to the end result. Try it with 0000 0101 - and you'll see it's 1111 1011. This algorithm is reversible - apply the same algorithm (flip every bit, then add 1) and you can turn -5 into 5.
This 2's complement thing has 2 great advantages:
There is only one 0 value. If we just flipped all bits, we'd have both 1111 1111 and 0000 0000 both representing some form of 0. In basic math, there's no such thing as 'negative 0' - it's the same as positive 0. Similarly if we just decided the first bit is the sign and the remaining 7 bits are the number, then we'd have both 1000 0000 and 0000 0000 both being 0, which is annoying and inefficient, why waste 2 different bit sequences on the same number?
plus and minus are sign-mode independent. The computer doesn't have to KNOW whether we are doing the 2's complement thing or not. Take the bit sequence 1111 1011. If we treat that as unsigned bits, then that is 251 (it's 128 + 64 + 32 + 16 + 8 + 2 + 1). If we treat that as a signed number, then the first bit is 1, so the thing is negative: We apply 2's complement and figure out that it is -5. So, is it -5 or 251? It's both, at once! Depends on the human/code that interprets this bit sequence which one it is. So how could the computer possibly do a + b given this? The weird answer is: It doesn't matter - because the math works out the same way. 251 - 10 is 241. -5 - 10 is -15. -15 and 241 are the exact same bit sequence.
Overflow
A byte is 8 bits, and there are 256 different sequences of bits, and then you have listed each and every possible variant. (2^8 = 256. Hence, a 16-bit number can be used to convey 65536 different things, because 2^16 is 65536, and so on). So, given that bytes are 8 bits and we decreed they are signed, and 2's complement signed, that means that the smallest number you can send with it is -128, which in bits is 1000 0000 (use 2's complement to check my work), and +127, which in bits is 0111 1111. So what happens if you add 1 to 127? That'd seemingly be +128 except that's not storable in 8 bits if we decree that we interpret these bits as 2's complement signed (which java does). What happens? The bits 'roll over'. We just add 1 as normal, which turns 0111 1111 into 1000 0000 which is -128:
byte b = 127;
b = (byte)(b + 1);
System.out.println(b); // prints -128
Imagine the number line - stretching out into infinity on both ends, from -infinite to +infinite. That's the usual way math works. Computers (or rather, int, long, etc) do not work like that. Instead of a line, it is a circle. Take your infinite number line and take some scissors, and snip that number line at -128 (because a 2's comp signed byte cannot represent -129 or anything else below -128), and at +127 (because our byte cannot represent 128 or anything above it).
And now tape the 2 cut ends together.
That's the number line. What's 'to the right' of 125? 126 - that's what +1 means: Move one to the right on the number line.
What's 'to the right' of +127? Why, -128. Because we taped it together.
Similarly, -127 - 5 is +123. '-5' is 'move 5 places to the left on the number line (or rather, number circle)'. Going in 1 decrements:
-127 (we start here)
-128 (-127 -1)
+127 (-127 -2)
+126 (-127 -3)
+125 (-127 -4)
+124 (-127 -5)
Hence, 124.
Same math applies to short (-32768 to +32767), char (which is really a 16-bit unsigned number - so 0 to 65535), int (-2147483648 to +2147483647), and even long (-2^63 to +2^63-1 - those get a little large).
short x = 32765;
x += 5;
System.out.println(x); // prints -32766.
I just started using Bluej to learn more about how computers store integers. I have a small program that I put into Bluej that sets the value of an integer called x to MAX_VALUE - 3 then adds 1 to x six times, printing out a new value each time.
One addition is incorrect, although I need help understanding which value I received in incorrect and why the results I got are "strange".
Please keep in mind I am VERY naive to the language for computers and am literally reading from a book about storing integers. The book I have is Computer Science 11th edition by J. Glenn Brookshear.
Here is the program I put into BlueJ:
public class Add
{
public Add()
{
int i, x;
x = java.lang.Integer.MAX_VALUE - 3;
i = 0;
while (i < 6) {
x = x + 1;
i = i + 1;
System.out.print(x + "\n");
}
}
}
The values I receive are:
2147483645
2147483646
2147483647
-2147483648
-2147483647
-2147483646
My teacher says there is a problem with any integer math but I do not know what he means. I would just really like to understand why this happens.
I might also note that these numbers are very much larger than 1 and I do not know why.
Thank you all in advance for any responses!
Integers that you store with the int data type are only allocated a limited amount of space in your computer's memory. It's not possible to store every possible integer in this amount of space. So your computer will deal correctly with integers between -2147483648 and 2147483647, because those are enough for most purposes. If you want to store numbers that are outside this range, you need to use a different data type from int. For example, there's long (which has a much bigger range) and BigInteger (which is really limited only by the amount of space allocated to Java itself).
When you add 1 to the largest possible int, the "correct" answer can't fit in an int variable. This is a bit like having an abacus with only one line of beads (which can represent numbers from 0 to 9), and trying to work out 9 + 1. Your computer will roll the number over to the smallest possible int instead. So when you work with int values, the effect is that 2147483647 + 1 = -2147483648, even though mathematically this makes no sense.
There is a limit value for an integer in Java in this case max_value.... for example when you try to surpass that value it becomes the oposite (-2,147,483,648 min_value). Like completing the circle and go back to the beggining. So there in no higher value than 2,147,483,647 ...so when you add 1 to that value you get the min_value instead...think of it like a snake eating his own tail ;)
If your Windows calculator has a Programmer View, switch to it, click Dword, enter 2147483645, add 1 six times, and watch the bits.
An integer in Java is 32-bits and signed. This means there is one sign bit and 31 bits for the value.
Integer.MAX_VALUE = 2147483647 (base 10)
= 0111 1111 1111 1111 1111 1111 1111 1111 (base 2)
Adding 1 yields
2147483647 + 1 = 2147483648
= 1000 0000 0000 0000 0000 0000 0000 0000
Counting up, this is what you'd expect if you weren't a computer (or your number wasn't bounded by representation space). But with the int data type, we only get 32 bits and the "first" (not technically correct, but will aid in understanding) tells you whether or not the value is negative.
Now when Java translates this value to base 10 and because this is the signed integer data type...
2147483647 + 1 = 2147483648
= 1000 0000 0000 0000 0000 0000 0000 0000
We read the first bit as 1 so this is a negative number and we need to take its
twos-complement to calculate the value.
= 1000 0000 0000 0000 0000 0000 0000 0000
negated = 0111 1111 1111 1111 1111 1111 1111 1111
+ 1 = 1000 0000 0000 0000 0000 0000 0000 0000
= 2147483648 (base 10)
so when we display this value, it's the negative value of the two's complement,
= -2147483648
The problem with "integer math" your teacher mentions is that, when your data type (Java's int, in this case) is bounded by size, your operations must make sense within its range.
This question already has answers here:
What is the difference between the float and integer data type when the size is the same?
(3 answers)
Closed 3 years ago.
Looking at Java (but probably similar or the same in other languages), a long and a double both use 8 bytes to store a value.
A long uses 8 bytes to store long integers from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
A double uses 8 bytes to store double-precision, floating-point numbers from -1.7E308 to 1.7E308 with up to 16 significant digits.
My question is, if both use the same number of bytes (8 bytes or 2^64), how can double store a much longer number? 1.7E308 is a hell of a lot larger than 9,223,372,036,854,775,807.
The absolute quantity of information that you can store in 64 bit is of course the same.
What changes is the meaning you assign to the bits.
In an integer or long variable, the codification used is the same you use for decimal numbers in your normal life, with the exception of the fact that number two complement is used, but this doesn't change that much, since it's only a trick to gain an additional number (while storing just one zero instead that a positive and a negative).
In a float or double variable, bits are split in two kinds: the mantissa and the exponent. This means that every double number is shaped like XXXXYYYYY where it's numerical value is something like XXXX*2^YYYY. Basically you decide to encode them in a different way, what you obtain is that you have the same amount of values but they are distribuited in a different way over the whole set of real numbers.
The fact that the largest/smallest value of a floating number is larger/smaller of the largest/smalles value of a integer number doesn't imply anything on the amount of data effectively stored.
A double can store a larger number by having larger intervals between the numbers it can store, essentially. Not every integer in the range of a double is representable by that double.
More specifically, a double has one bit (S) to store sign, 11 bits to store an exponent E, and 52 bits of precision, in what is called the mantissa (M).
For most numbers (There are some special cases), a double stores the number (-1)^S * (1 + (M * 2^{-52})) * 2^{E - 1023}, and as such, when E is large, changing M by one will make a much larger change in the size of the resulting number than one. These large gaps are what give doubles a larger range than longs.
Long or int are signed entities which can be positive or negative but never have any decimal part.
Float or double types are used in computers to represent numbers having decimal parts.
Both long and double are 64 bits.
Long has 1 bit for signed (to determine positive or negative), and the rest 63 bits make up the number. So there range can be -2^63 to 2^63-1
Doubles are represented as in a different way specified by IEEE Standard for Binary Floating-Point Arithmetic, devised to store very large numbers in computers.
64 bits of double represented as - [1 bit][11 bits exponent][52 bit mantissa]
Lets see this with an example to convert say 100.25 into binary form stored as double
Decimal 100.25 converted into binary is 1100100.01
Binary 1100100.01 is then normalized as 1.10010001 * 2^6
6 is the exponent component. We select base or offset to be 1023 to it so that both negative and positive can be represented properly. So 6+1023=1029 is the offsetted exponent component after adding bias. 100000011 is the binary representaion of exponent.
To calculate Mantissa from 1.10010001, we ignore 1 that is present towards the right of decimal and just use all the number(i.e. 10010001) to the right of the decimal and right padd any remaining 52 bits with zero.
So, now mantissa will be 1001 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
So, finally the 64 bits are represented as
signed bit exponenet mantissa
0 100000011 1001 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
Reading Delimanolis statement above, I tested a loss of precision in long-to-double conversion - an integral value as big as 500 is ignored(see below).
long L;
double D;
L = 922_3372_0368_5477_5807L;
L -= 500;
D = L;
L = (long)D;
System.out.println("D and L: " + D + " " + L);
output
D and L: 9.223372036854776E18 9223372036854775807
Consider this snippet from the Java language specification.
class Test {
public static void main(String[] args) {
int i = 1000000;
System.out.println(i * i);
long l = i;
System.out.println(l * l);
}
}
The output is
-727379968
1000000000000
Why is the result -727379968 for (i*i)? Ideally it should be 1000000000000.
I know the range of Integer is from –2147483648 to 2147483647. so obviously 1000000000000
is not in the given range.
Why does the result become -727379968?
Java (like most computer architectures these days) uses something called two's complement arithmetic, which uses the most significant bit of an integer to signify that a number is negative. If you multiply two big numbers, you end up with a number that's so big it sets that highest bit, and the result ends up negative.
Lets look at the binary:
1000000 is 1111 0100 0010 0100 0000.
1000000000000 is 1110 1000 1101 0100 1010 0101 0001 0000 0000 0000
However, the first two sections of 4 bits won't fit in an int (since int is 32-bits wide in Java,) and so they are dropped, leaving only 1101 0100 1010 0101 0001 0000 0000 0000, which is -727379968.
In other words, the result overflows for int, and you get what's left.
You might want to check Integer overflow as a general concept.
Overflow and underflow are handled differently depending on the language, too. Here is an article on Integer overflow and underflow in Java.
As for the reason why this is so in the Java language, as always, it's a tradeoff between simplicity in the language design and performance. But in Java puzzlers (puzzle 3), the authors criticize the fact that overflows are silent in Java:
The lesson for language designers is that it may be worth reducing the
likelihood of silent overflow. This could be done by providing support
for arithmatic that does not overflow silently. Programs could throw
an exception instead of overflowing, as does Ada, or they could switch
to a larger internal representation automatically as required to avoid
overflow, as does Lisp. Both of these approaches may have performance
penalties associated with them. Another way to reduce the likelyhood
of silent overflow is to support target typing, but this adds
significant complexity to the type system.
Some of the other answers explain correctly why this is happening (ie. signed two's compliment binary logic).
The actual solution to the problem and how to get the correct answer in Java when using really big numbers is to use the BigInteger class, which also works for long values.
package com.craigsdickson.scratchpad;
import java.math.BigInteger;
public class BigIntegerExample {
public static void main(String[] args) {
int bigInt = Integer.MAX_VALUE;
// prints incorrect answer
System.out.println(bigInt * bigInt);
BigInteger bi = BigInteger.valueOf(bigInt);
// prints correct answer
System.out.println(bi.multiply(bi));
long bigLong = Long.MAX_VALUE;
// prints incorrect answer
System.out.println(bigLong * bigLong);
BigInteger bl = BigInteger.valueOf(bigLong);
// prints correct answer
System.out.println(bl.multiply(bl));
}
}
The reasons why integer overflow occurs have already been explained in other answers.
A practical way to ensure long arithmetic in calculations is to use numeric literals with l suffix that declare the literals as long.
Ordinary integer multiplication that overflows:
jshell> 100000 * 100000
$1 ==> -727379968
Multiplication where one of the multiplicands has l suffix that does not overflow:
jshell> 100000 * 100000l
$1 ==> 1000000000000
Note that longs are also prone to overflow, but the range is much greater, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.
I am after code, a library routine, or an algorithm that scores how close two different bit or boolean patterns are. Naturally if they are equal then the score should be 1, while if one is all true and the other all false then the score should 0.
Bit pattern example
The bit patterns that i will be testing are many times never actually equal or the same but sometimes they are very similar.
0001 1111 0000
0000 1111 1100
0000 1110 0000
1110 0000 1111
In the above examples 1 & 2 or 1 & 3 are pretty close if i was to score them perhaps the difference would be something like 96 and 95%. On the other hand 1&4 would definitely be a much lower score maybe 25%.
Note that bit patterns may be of different lengths but scoring should still be possible.
001100
000011110000
The above two patterns would be considered identical.
001100
00110000
The above two patterns would be considered close but not identical, because once "scaled" #2 is different from #1.
If the bit patterns are all the same length, just use the exclusive-or (^) operator and count how many zeroes remain.
(xor produces a zero if the two corresponding bits are the same, and a one otherwise).
If they're of different lengths, treat the bit pattern as if it were a string and use something like the Levenshtein distance algorithm.
I've been playing around with fast ways to count the number of matching bits of the bit-wise XOR comparison. Here's what I think is the fastest way:
int num1, num2; // some bit patterns
int diff = num1 ^ num2;
int score;
for (score = 0; diff > 0; diff >>>= 1)
score += diff & 1;
A score of zero means exact match (assuming results of the same length).
public static int bitwiseEditDistance(int a, int b) {
return Integer.bitCount(a ^ b);
}
Integer.bitCount is an obscure little bit of the core libraries.
Returns the number of one-bits in the two's complement binary representation of the specified int value. This function is sometimes referred to as the population count.