I'm noodling through an anagram hash function, already solved several different ways, but I'm looking for extreme performance as an exercise. I already submitted a solution that passed all the given tests (beating out 100% of all competitors by at least 1ms), but I believe that although it "won", it has a weakness that just wasn't triggered. It is subject to integer overflow in a way that could affect the results.
The gist of the solution was to combine multiple commutative operations, each taking some number of bits, and concatenate them into one long variable. I chose xor, sum, and product. The xor operation cleanly fits within a fixed number of bits. The sum operation might overflow, but because of the way overflow is addressed, it would still arrive at the same result if letters and their corresponding values are rearranged. I wouldn't worry, for example, about whether this function would overflow.
private short sumHash(String s) {
short hash=0;
for (char c:s.toCharArray()) {
hash+=c;
}
return hash;
}
Where I run into trouble is in the inclusion of products. If I make a function that returns the product of a list of values (such as character values in a String), then, at the very least, the result could be rendered inaccurate if the product overflowed to exactly zero.
private short productHash(String s) {
short hash=1;
for (char c:s.toCharArray()) {
hash*=c;
}
return hash;
}
Is there any safe and performant way to avoid this weakness so that the function gains the benefit of the commutative property of multiplication to produce the same value for anagrams, but can't ever encounter a product that overflows to zero?
Sure, if you're willing to go to some lengths to do it. The simplest solution that occurs to me is to write
hash *= primes[c];
where primes is an array that maps each possible character to a distinct odd prime. Overflowing to zero can only happen if the "true" product in infinite-precision arithmetic is a multiple of 2^32, and if you're multiplying by odd primes, that's impossible.
(You do run into the problem that the hash itself will always be odd, but you could shift it right one bit to obtain a more fully mixed hash.)
You will only hit zero if
a * b = 0 mod 2^64
which is equivalent to there being an integer k such that
a * b = k * 2^64
That is, we get in trouble if factors divide 2^64, i.e. if factors are even. Therefore, the easiest solution is ensuring that all factors are odd, for instance like this:
for (char ch : chars) {
hash *= (ch << 1) | 1;
}
This allows you to keep 63 bits of information.
Note however that this technique will only avoid collisions caused by overflow, but not collisions caused by multipliers that share a common factor. If you wish to avoid that, too, you'll need coprime multipliers, which is easiest achieved if they are prime.
The naive way to avoid overflow, is to use a larger type such as int or long. However, for your purposes, modulo arithmetic might make more sense. You can do (a * b) % p for a prime p to maintain commutativity. (There is some deep mathematics here called Group Theory, if you are interested in learning more.) You will need to limit p to be small enough that each a * b does not overflow. The easiest way to do this is to pick a p so that (p - 1)^2 can still be represented in a short or whatever data type you are using.
Related
I have the following piece of code:
long[] blocks = new long[(someClass.getMemberArray().length - 1) / 64 + 1];
Basically the someClass.getMemberArray() can return an array that could be much larger than 64 and the code tries to determine how many blocks of len 64 are needed for subsequent processing.
I am confused about the logic and how does this work. It seems to me that just doing:
long[] blocks = new long[(int) Math.ceil(someClass.getMemberArray().length / 64.0)];
should work too any looks simpler.
Can someone help me understanding the -1 and +1 reasoning in the original snippet, how it works and if the ceil would fail in some cases?
as you correctly commented, the -1/+1 is required to get the correct number of blocks, including only partially filled ones. It effectively rounds up.
(But it has something that could be considered a bug: if the array has length 0, which would required 0 blocks, it returns 1. This is because integer division usually truncates on most systems, i.e. rounds UP for negative numbers, so (0 - 1)/64 yields 0. However, this may be a feature if zero blocks for some reasons are not allowed. It definitively requires a comment though.)
The reasoning for the first, original line is that it only uses integer arithmetics, which should translate on just a few basic and fast machine instructions on mostcomputers.
The second solution involved casting floating-point arithmetic and casting. Traditionally, floating-point arithmetic was MUCH slower on most processors, which probably was the reasoning for the first solution. However, on modern CPUs with integrated floating-point support, the performance depends more on other things like cache lines and pipelining.
Personally, I don't really like both solutions, as it's not really obvious what they do. So I would suggest the following solution:
int arrayLength = someClass.getMemberArray().length;
int blockCount = ceilDiv(arrayLength, 64);
long[] blocks = new long[blockCount];
//...
/**
* Integer division, rounding up.
* #return the quotient a/b, rounded up.
*/
static int ceilDiv(int a, int b) {
assert b >= 0 : b; // Doesn't work for negative divisor.
// Divide.
int quotient = a / b;
// If a is not a multiple of b, round up.
if (a % b != 0) {
quotient++;
}
return quotient;
}
It's wordy, but at least it's clear what should happen, and it provides a general solution which works for all integers (except negative divisors). Unfortunately, most languages don't provide an elegant "round up integer division" solution.
I don't see why the -1 would be necessary, but the +1 is likely there to rectify the case where the result of the division gets rounded down to the nearest non-decimal value (which should be, well, every case except those where the result is without decimals)
I'm playing with hash tables and using a corpus of ~350,000 English words which I'd like to try to evenly distribute. Thus, I try to fit them into an array of length 810,049 (the closest prime larger than two times the input size) and I was baffled to see that a straightforward FNV1 implementation like this:
public int getHash(String s, int mod) {
final BigInteger MOD = new BigInteger(Integer.toString(mod));
final BigInteger FNV_offset_basis = new BigInteger("14695981039346656037");
final BigInteger FNV_prime = new BigInteger("1099511628211");
BigInteger hash = new BigInteger(FNV_offset_basis.toString());
for (int i = 0; i < s.length(); i++) {
int charValue = s.charAt(i);
hash = hash.multiply(FNV_prime).mod(MOD);
hash = hash.xor(BigInteger.valueOf((int) charValue & 0xffff)).mod(MOD);
}
return hash.mod(MOD).intValue();
}
results in 64,000 collisions which is a lot, 20% of the input basically. What's wrong with my implementation? Is the approach somehow flawed?
EDIT: to add to that, I've also tried and implemented other hashing algorithms like sdbm and djb2 and they all perform just the same, equally poorly. All have these ~65k collisions on this corpus. When I changed the corpus to just 350,000 integers represented as strings, a bit of variance starts to occur (like one algorithms has 20,000 collisions and the other has 40,000) but still the number of collision is astoundingly high. Why?
EDIT2: I've just tested it and the Java's built-in .hashCode() results in equally as many collisions and even if you do something ridiculously naive, like a hash being a product of multiplicating charcodes of all the characters modulo 810,049, it performs only half worse than all those notorious algorithms (60k collisions vs. 90k with the naive approach).
Since mod is a parameter to your hash function I presume it is the range into which you want the hash normalized, i.e. for your specific use case you are expecting it to be 810,049. I assume this because:
The algorithm calls for the calculations to be done modulo 2n where n is the number of bits in the desired hash.
Given that the offset basis and FNV Prime are constants within the module, and are equal to the parameters for a 64-bit hash, the value of mod should also be fixed at 264.
Since it is not, I assume it is the desired final output range.
In other words, given a fixed offset basis and FNV Prime, there is no reason to pass in the mod parameter -- it is dictated by the other two FNV parameters.
If all the above is correct then the implementation is wrong. You should be doing the calculations mod 264 and applying a final remainder operation with 810,049.
Also (but this may not be important), the algorithm calls for xoring the lower 8 bits with an ASCII character, whereas you are xoring with 16 bits. I am not sure this will make a difference since for ASCII the high-order byte will be zero anyway and it will behave exactly as if you were xoring only 8 bits.
I know that in Java (and probably other languages), Math.pow is defined on doubles and returns a double. I'm wondering why on earth the folks who wrote Java didn't also write an int-returning pow(int, int) method, which seems to this mathematician-turned-novice-programmer like a forehead-slapping (though obviously easily fixable) omission. I can't help but think that there's some behind-the-scenes reason based on the intricacies of CS that I just don't know, because otherwise... huh?
On a similar topic, ceil and floor by definition return integers, so how come they don't return ints?
Thanks to all for helping me understand this. It's totally minor, but has been bugging me for years.
java.lang.Math is just a port of what the C math library does.
For C, I think it comes down to the fact that CPU have special instructions to do Math.pow for floating point numbers (but not for integers).
Of course, the language could still add an int implementation. BigInteger has one, in fact. It makes sense there, too, because pow tends to result in rather big numbers.
ceil and floor by definition return integers, so how come they don't return ints
Floating point numbers can represent integers outside of the range of int. So if you take a double argument that is too big to fit into an int, there is no good way for floor to deal with it.
From a mathematical perspective, you're going to overflow your integer if it's larger than 231-1, and overflow your long if it's larger than 264-1. It doesn't take much to overflow it, either.
Doubles are nice in that they can represent numbers from ~10-308 to ~10308 with 53 bits of precision. There may be some fringe conversion issues (such as the next full integer in a double may not exactly be representable), but by and large you're going to get a much larger range of numbers than you would if you strictly dealt with integers or longs.
On a similar topic, ceil and floor by definition return integers, so how come they don't return ints?
For the same reason outlined above - overflow. If I have an integral value that's larger than what I can represent in a long, I'd have to use something that could represent it. A similar thing occurs when I have an integral value that's smaller than what I can represent in a long.
Optimal implementation of integer pow() and floating-point pow() are very different. And C's math library was probably developed around the time when floating-point coprocessors were a consideration. Optimal implementation of floating point operation is to shift the numbers closer to 1 (to force quicker conversion of the power series) and then shift the result back. For integer power, a more accurate result can be had in O(log(p)) time by doing something like this:
// p is a positive integer power set somewhere above, n is the number to raise to power p
int result = 1;
while( p != 0){
if (p & 1){
result *= n;
}
n = n*n;
p = p >> 1;
}
Because all ints can be upcast to a double without loss and the pow function on a double is no less efficient that that on an int.
The reason lies behind the implementation of Math.pow() (JNI of default implementation). The CPU has an exponentiation module which works with doubles as input and output. Why should Java convert that for you when you have much better control over this yourself?
For floor and ceil the reasons are the same, but note that:
(int) Math.floor(d) == (int) d; // d > 0
(int) Math.ceil(d) == -(int)(-d); // d < 0
For most cases (no warranty around or beyond Integer.MAX_VALUE or Integer.MIN_VALUE).
Java leaves you with
(int) Math.pow(a,b)
because the result of Math.pow may even be NaN or Infinity depending on input.
Here's the sample code from Item 9:
public final class PhoneNumber {
private final short areaCode;
private final short prefix;
private final short lineNumber;
#Override
public int hashCode() {
int result = 17;
result = 31 * result + areaCode;
result = 31 * result + prefix;
result = 31 * result + lineNumber;
return result;
}
}
Pg 48 states: "the value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as muiltiplication by 2 is equivalent to shifting."
I understand the concept of multiplication by 2 being equivalent to bit shifting. I also know that we'll still get an overflow (hence information loss) when we multiply a large number by a large odd prime number. What I don't get is why information loss arising from multiplication by large odd primes is preferable to information loss arising from multiplication by large even numbers.
With an even multiplier the least significant bit, after multiplication, is always zero. With an odd multiplier the least significant bit is either one or zero depending on what the previous value of result was. Hence the even multiplier is losing uncertainty about the low bit, while the odd multiplier is preserving it.
There is no such thing as a large even prime - the only even prime is 2.
That aside - the general point of using a medium-sized prime # rather than a small one like 3 or 5 is to minimize the chance that two objects will end up with the same hash value, overflow or not.
The risk of overflow is not the issue per se; the real issue is how distributed the hashcode values will be for the set of objects being hashed. Because hashcodes are used in data structures like HashSet, HashMap etc., you want to minimize the # of objects that could potentially share the same hash code to optimize lookup times on those collections.
I have cells for whom the numeric value can be anything between 0 and Integer.MAX_VALUE. I would like to color code these cells correspondingly.
If the value = 0, then r = 0. If the value is Integer.MAX_VALUE, then r = 255. But what about the values in between?
I'm thinking I need a function whose limit as x => Integer.MAX_VALUE is 255. What is this function? Or is there a better way to do this?
I could just do (value / (Integer.MAX_VALUE / 255)) but that will cause many low values to be zero. So perhaps I should do it with a log function.
Most of my values will be in the range [0, 10,000]. So I want to highlight the differences there.
The "fairest" linear scaling is actually done like this:
floor(256 * value / (Integer.MAX_VALUE + 1))
Note that this is just pseudocode and assumes floating-point calculations.
If we assume that Integer.MAX_VALUE + 1 is 2^31, and that / will give us integer division, then it simplifies to
value / 8388608
Why other answers are wrong
Some answers (as well as the question itself) suggsted a variation of (255 * value / Integer.MAX_VALUE). Presumably this has to be converted to an integer, either using round() or floor().
If using floor(), the only value that produces 255 is Integer.MAX_VALUE itself. This distribution is uneven.
If using round(), 0 and 255 will each get hit half as many times as 1-254. Also uneven.
Using the scaling method I mention above, no such problem occurs.
Non-linear methods
If you want to use logs, try this:
255 * log(value + 1) / log(Integer.MAX_VALUE + 1)
You could also just take the square root of the value (this wouldn't go all the way to 255, but you could scale it up if you wanted to).
I figured a log fit would be good for this, but looking at the results, I'm not so sure.
However, Wolfram|Alpha is great for experimenting with this sort of thing:
I started with that, and ended up with:
r(x) = floor(((11.5553 * log(14.4266 * (x + 1.0))) - 30.8419) / 0.9687)
Interestingly, it turns out that this gives nearly identical results to Artelius's answer of:
r(x) = floor(255 * log(x + 1) / log(2^31 + 1)
IMHO, you'd be best served with a split function for 0-10000 and 10000-2^31.
For a linear mapping of the range 0-2^32 to 0-255, just take the high-order byte. Here is how that would look using binary & and bit-shifting:
r = value & 0xff000000 >> 24
Using mod 256 will certainly return a value 0-255, but you wont be able to draw any grouping sense from the results - 1, 257, 513, 1025 will all map to the scaled value 1, even though they are far from each other.
If you want to be more discriminating among low values, and merge many more large values together, then a log expression will work:
r = log(value)/log(pow(2,32))*256
EDIT: Yikes, my high school algebra teacher Mrs. Buckenmeyer would faint! log(pow(2,32)) is the same as 32*log(2), and much cheaper to evaluate. And now we can also factor this better, since 256/32 is a nice even 8:
r = 8 * log(value)/log(2)
log(value)/log(2) is actually log-base-2 of value, which log does for us very neatly:
r = 8 * log(value,2)
There, Mrs. Buckenmeyer - your efforts weren't entirely wasted!
In general (since it's not clear to me if this is a Java or Language-Agnostic question) you would divide the value you have by Integer.MAX_VALUE, multiply by 255 and convert to an integer.
This works! r= value /8421504;
8421504 is actually the 'magic' number, which equals MAX_VALUE/255. Thus, MAX_VALUE/8421504 = 255 (and some change, but small enough integer math will get rid of it.
if you want one that doesn't have magic numbers in it, this should work (and of equal performance, since any good compiler will replace it with the actual value:
r= value/ (Integer.MAX_VALUE/255);
The nice part is, this will not require any floating-point values.
The value you're looking for is: r = 255 * (value / Integer.MAX_VALUE). So you'd have to turn this into a double, then cast back to an int.
Note that if you want brighter and brighter, that luminosity is not linear so a straight mapping from value to color will not give a good result.
The Color class has a method to make a brighter color. Have a look at that.
The linear implementation is discussed in most of these answers, and Artelius' answer seems to be the best. But the best formula would depend on what you are trying to achieve and the distribution of your values. Without knowing that it is difficult to give an ideal answer.
But just to illustrate, any of these might be the best for you:
Linear distribution, each mapping onto a range which is 1/266th of the overall range.
Logarithmic distribution (skewed towards low values) which will highlight the differences in the lower magnitudes and diminish differences in the higher magnitudes
Reverse logarithmic distribution (skewed towards high values) which will highlight differences in the higher magnitudes and diminish differences in the lower magnitudes.
Normal distribution of incidence of colours, where each colour appears the same number of times as every other colour.
Again, you need to determine what you are trying to achieve & what the data will be used for. If you have been tasked to build this then I would strongly recommend you get this clarified to ensure that it is as useful as possible - and to avoid having to redevelop it later on.
Ask yourself the question, "What value should map to 128?"
If the answer is about a billion (I doubt that it is) then use linear.
If the answer is in the range of 10-100 thousand, then consider square root or log.
Another answer suggested this (I can't comment or vote yet). I agree.
r = log(value)/log(pow(2,32))*256
Here are a bunch of algorithms for scaling, normalizing, ranking, etc. numbers by using Extension Methods in C#, although you can adapt them to other languages:
http://www.redowlconsulting.com/Blog/post/2011/07/28/StatisticalTricksForLists.aspx
There are explanations and graphics that explain when you might want to use one method or another.
The best answer really depends on the behavior you want.
If you want each cell just to generally have a color different than the neighbor, go with what akf said in the second paragraph and use a modulo (x % 256).
If you want the color to have some bearing on the actual value (like "blue means smaller values" all the way to "red means huge values"), you would have to post something about your expected distribution of values. Since you worry about many low values being zero I might guess that you have lots of them, but that would only be a guess.
In this second scenario, you really want to distribute your likely responses into 256 "percentiles" and assign a color to each one (where an equal number of likely responses fall into each percentile).
If you are complaining that the low numbers are becoming zero, then you might want to normalize the values to 255 rather than the entire range of the values.
The formula would become:
currentValue / (max value of the set)
I could just do (value / (Integer.MAX_VALUE / 255)) but that will cause many low values to be zero.
One approach you could take is to use the modulo operator (r = value%256;). Although this wouldn't ensure that Integer.MAX_VALUE turns out as 255, it would guarantee a number between 0 and 255. It would also allow for low numbers to be distributed across the 0-255 range.
EDIT:
Funnily, as I test this, Integer.MAX_VALUE % 256 does result in 255 (I had originally mistakenly tested against %255, which yielded the wrong results). This seems like a pretty straight forward solution.