Is String.hashCode() inefficient? [closed]

Is String.hashCode() inefficient? [closed] - java

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
When looking at the source code of java.lang.String of openjdk-1.6, i saw that String.hashCode() uses 31 as prime number and computes
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
Now the reason for me to look at this was the question i had in mind whether comparing hashCodes in String.equals would make String.equals significantly faster. But looking at hashCode now, the following questions come to my mind:
Wouldn't a bigger prime help avoid collisions better, at least for short strings, seeing that for example "BC" has the same hash as "Ab" (since the ascii letters live in the region 65-122, wouldn't a prime higher than that work better)?
Is it a conscious decision to use 31 as prime, or just some random one that is used because it is common?
How likely is a hash collision, given a fixed String length? where this question is heading is the original question how good comparing hashCodes and String length could already discern strings, to avoid comparing the actual contents.
a little off-topic, maybe: Is there a good reason String.equals does not compare hashCodes as additional shortcut?
a little more off-topic: assuming we have two Strings with the same content, but different instances: is there any way to assert equality without actually comparing the contents? I would guess not, since someway into String lengths, the space explodes into sizes where we will inevitably have collisions, but what about some restrictions - only a certain character set, a maximum string length... and how much do we need to restrict the string space to be able to have such a hash function?

Wouldn't a bigger prime help avoid collisions better, at least for short strings, seeing that for example "BC" has the same hash as "Ab" (since the ascii letters live in the region 65-122, wouldn't a prime higher than that work better)?
Each character in a String can take 65536 values (2^16). The set of Strings of 1 or 2 characters is therefore larger than the number of int and any hashcode calculation methodology will produce collisions for strings that are 1 or 2 character long (which qualify as short strings I suppose).
If you restrict your character set, you can find hash function that reduce the number of collisions (see below).
Note that a good hash must also provide a good distribution of output. A comment buried in this code advocates using 33 and gives the following reasons (emphasis mine):
If one compares the chi^2 values [...] of the variants the number 33 not even has the best value. But the number 33 and a few other equally good numbers like 17, 31, 63, 127 and 129 have nevertheless a great advantage to the remaining numbers in the large set of possible multipliers: their multiply operation can be replaced by a faster operation based on just one shift plus either a single addition or subtraction operation. And because a hash function has to both distribute good and has to be very fast to compute, those few numbers should be preferred.
Now these formulae were designed a while ago. Even if it appeared now that they are not ideal, it would be impossible to change the implementation because it is documented in the contract of the String class.
Is it a conscious decision to use 31 as prime, or just some random one that is used because it is common?
Why does Java's hashCode() in String use 31 as a multiplier?
How likely is a hash collision, given a fixed String length?
Assuming each possible int value has the same probability of being the result of the hashcode function, the probability of collision is 1 in 2^32.
Is there a good reason String.equals does not compare hashCodes as additional shortcut?
Why does the equals method in String not use hash?
assuming we have two Strings with the same content, but different instances: is there any way to assert equality without actually comparing the contents?
Without any constraint on the string, there isn't. You could intern the strings then check for reference equality (==), but if many strings are involved, that can be inefficient.
how much do we need to restrict the string space to be able to have such a hash function?
If you only allow small cap letters (26 characters), you could design a hash function that generates unique hashes for any strings of length 0 to 6 characters (inclusive) (sum(i=0..6) (26^i) = 3.10^8).

Related

hashCode() for string returning negative value [duplicate]

This question already has answers here:
HashCode giving negative values
(3 answers)
Closed 8 years ago.
"random".hashCode() returns a value -938285885. Are negative values expected for hashCode()?
According to the following question, there's a way the hashCode() for string is computed, but using that, won't the value keep increasing as the length of string increase and eventually be greater than Integer.MAX_VALUE?

Are negative values expected for hashCode()?
They're entirely valid, yes.
won't the value keep increasing as the length of string increase and eventually be greater than Integer.MAX_VALUE?
What makes you think that hash codes increase as the length of the string increases?
Basically, you should think of hash codes as fingerprints - collections of bits rather than numbers with a meaningful magnitude. Hash code calculations very often overflow or underflow, and that's absolutely fine. "More" or "less" are irrelevant comparisons between hash codes - all that's relevant is "equal" or "not equal", where the rules are that the hash codes for two equal values must be equal, but the hash codes for two non-equal values may still be equal. The numeric values are relevant in terms of bucketing, but that's usually an implementation detail of whatever's using them.
A hash code is just a quick way of finding values which are definitely not equal. So consider a situation where I have a set of strings with hash codes { 1, -15, 20, 5, 100 }. If I'm given a string with a hash code of 14, I know that string definitely isn't in the set. If I'm given a string with a hash code of 20, I need to check it with equals against the string in my set with a hash code of 20, to see whether or not the candidate string is in the set.

Unique strings in java

Is there a function in Java that takes in two strings and generates one 16 character string which is unique to the given combination? I dont expect the string to be 100% unique as long as the probability of having 2 conflicting strings is very small (1 in 100,000 for example). Thanks.

You can concatenate both strings and hash them.

If it needs to be truly unique, then a concatenation of the strings which is 16 characters or less is your answer.
Else you have to rely on hashing. But that comes with no guarantee of a probability of clashing.
Your best bet is to use a GUID.

Java algorithm for evenly distributing ranges of strings into buckets [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Short version - I'm looking for a Java algorithm that given a String and an integer representing a number of buckets returns which bucket to place the String into.
Long version - I need to distribute a large number of objects into bins, evenly (or approximately evenly). The number of bins/buckets will vary, so the algorithm can't assume a particular number of bins. It may be 1, 30, or 200. The key for these objects will be a String.
The String has some predictable qualities that are important. The first 2 characters of the string actually appear to be a hex representation of a byte. i.e. 00-ff , and the strings themselves are quite evenly distributed within that range. There are a couple of outliers that start differently though, so this can't be relied on 100% (though easily 99.999%). This just means that edge cases do need to be handled.
It's critical that once all the strings have been distributed that there is zero overlap in range between what values appear in any 2 bins. So, that if I know what range of values appear in a bin, I don't have to look in any other bins to find the object. So for example, if I had 2 bins, it could be that bin 0 has Strings starting with letters a-m and bin 1 starting with n-z. However, that wouldn't satisfy the need for even distribution given what we know about the Strings.
Lastly, the implementation can have no knowledge of the current state of the bins. The method signature should literally be:
public int determineBucketIndex(String key, int numBuckets);
I believe that the foreknowledge about the distribution of the Strings should be sufficient.
EDIT: Clarifying for some questions
Number of buckets can exceed 256. The strings do contain additional characters after the first 2, so this can be leveraged.
The buckets should hold a range of Strings to enable fast lookup later. In fact, that's why they're being binned to begin with. With only the knowledge of ranges, I should be able to look in exactly 1 bucket to see if the value is there or not. I shouldn't have to look in others.
Hashcodes won't work. I need the buckets to contain only String within a certain range of the String value (not the hash). Hashing would lose that.
EDIT 2: Apparently not communicating well.
After bins have been chosen, these values are written out to files. 1 file per bin. The system that uses these files after binning is NOT Java. It's already implemented, and it needs values in the bins that fit within a range. I repeat, hashcode will not work. I explicitly said the ranges for strings cannot overlap between two bins, using hashcode cannot work.

I have read through your question twice and I still don't understand the constraints. Therefore, I am making a suggestion here and you can give feedback on it. If this won't work, please explain why.
First, do some math on the number of bins, to determine how many bits you need for a unique bin number. Take the logarithm to base 2 of the number of bins, then take the ceiling of number of bits divided by 8. This is the number of bytes of data you need, numBytes.
Take the first two letters and convert them to a byte. Then grab numBytes - 1 characters and convert them to bytes. Take the ordinal value of the character ('A' becomes 65, and so on). If the next characters could be Unicode, pick some rule to convert them to bytes... probably grab the least significant byte (modulus by 256). Get numBytes bytes total, including the byte made from the first two letters, and convert to an integer. Make the byte from the first two letters the least significant 8 bits of the integer, the next byte the next 8 significant bits, and so on. Now simply take the modulus of this value by the number of bins, and you have an integer bin number.
If the string is too short and there are no more characters to turn into byte values, use 0 for each missing character.
If there are any predictable characters (for example, the third character is always a space) then don't use those characters; skip past them.
Now, if this doesn't work for you, please explain why, and then maybe we will understand the question well enough to answer it.

answer edited after 2 updates to original post
It would have been an excellent idea to include all the information in your question from the start - with your new edits, your description already gives you the answer: stick your objects into a Balanced Tree (giving you the homogenous distribution you say you need) based on the hashCode for your string's substring(0,2) or something similarly head-based. Then write each leaf (being a set of strings) in the BTree to file.

I seriously doubt that the problem, as described, can be done perfectly. How about this:
Create 257 bins.
Put all normal Strings into bins 0-255.
Put all the outliers into bin 256.
Other than the "even distribution", doesn't this meet all your requirements?
At this point, if you really want more even distribution, you could reorganize bins 0-255 into a smaller number of more evenly distributed bins. But I think you may just have to lesses the requirements there.

how can i generate a unique int from a unique string?

I have an object with a String that holds a unique id .
(such as "ocx7gf" or "67hfs8")
I need to supply it an implementation of int hascode() which will be unique obviously.
how do i cast a string to a unique int in the easiest/fastest way?
10x.
Edit - OK. I already know that String.hashcode is possible. But it is not recommended in any place. Actually' if any other method is not recommended - Should I use it or not if I have my object in a collection and I need the hashcode. should I concat it to another string to make it more successful?

No, you don't need to have an implementation that returns a unique value, "obviously", as obviously the majority of implementations would be broken.
What you want to do, is to have a good spread across bits, especially for common values (if any values are more common than others). Barring special knowledge of your format, then just using the hashcode of the string itself would be best.
With special knowledge of the limits of your id format, it may be possible to customise and result in better performance, though false assumptions are more likely to make things worse than better.
Edit: On good spread of bits.
As stated here and in other answers, being completely unique is impossible and hash collisions are possible. Hash-using methods know this and can deal with it, but it does impact upon performance, so we want collisions to be rare.
Further, hashes are generally re-hashed so our 32-bit number may end up being reduced to e.g. one in the range 0 to 22, and we want as good a distribution within that as possible to.
We also want to balance this with not taking so long to compute our hash, that it becomes a bottleneck in itself. An imperfect balancing act.
A classic example of a bad hash method is one for a co-ordinate pair of X, Y ints that does:
return X ^ Y;
While this does a perfectly good job of returning 2^32 possible values out of the 4^32 possible inputs, in real world use it's quite common to have sets of coordinates where X and Y are equal ({0, 0}, {1, 1}, {2, 2} and so on) which all hash to zero, or matching pairs ({2,3} and {3, 2}) which will hash to the same number. We are likely better served by:
return ((X << 16) | (x >> 16)) ^ Y;
Now, there are just as many possible values for which this is dreadful than for the former, but it tends to serve better in real-world cases.
Of course, there is a different job if you are writing a general-purpose class (no idea what possible inputs there are) or have a better idea of the purpose at hand. For example, if I was using Date objects but knew that they would all be dates only (time part always midnight) and only within a few years of each other, then I might prefer a custom hash code that used only the day, month and lower-digits of the years, over the standard one. The writer of Date though can't work on such knowledge and has to try to cater for everyone.
Hence, If I for instance knew that a given string is always going to consist of 6 case-insensitive characters in the range [a-z] or [0-9] (which yours seem to, but it isn't clear from your question that it does) then I might use an algorithm that assigned a value from 0 to 35 (the 36 possible values for each character) to each character, and then walk through the string, each time multiplying the current value by 36 and adding the value of the next char.
Assuming a good spread in the ids, this would be the way to go, especially if I made the order such that the lower-significant digits in my hash matched the most frequently changing char in the id (if such a call could be made), hence surviving re-hashing to a smaller range well.
However, lacking such knowledge of the format for sure, I can't make that call with certainty, and I could well be making things worse (slower algorithm for little or even negative gain in hash quality).
One advantage you have is that since it's an ID in itself, then presumably no other non-equal object has the same ID, and hence no other properties need be examined. This doesn't always hold.

You can't get a unique integer from a String of unlimited length. There are 4 billionish (2^32) unique integers, but an almost infinite number of unique strings.
String.hashCode() will not give you unique integers, but it will do its best to give you differing results based on the input string.
EDIT
Your edited question says that String.hashCode() is not recommended. This is not true, it is recommended, unless you have some special reason not to use it. If you do have a special reason, please provide details.

Looks like you've got a base-36 number there (a-z + 0-9). Why not convert it to an int using Integer.parseInt(s, 36)? Obviously, if there are too many unique IDs, it won't fit into an int, but in that case you're out of luck with unique integers and will need to get by using String.hashCode(), which does its best to be close to unique.

Unless your strings are limited in some way or your integers hold more bits than the strings you're trying to convert, you cannot guarantee the uniqueness.
Let's say you have a 32 bit integer and a 64-character character set for your strings. That means six bits per character. That will allow you to store five characters into an integer. More than that and it won't fit.

represent each string character by a five-digit binary digit, eg. a by 00001 b by 00010 etc. thus 32 combinations are possible, for example, cat might be written as 00100 00001 01100, then convert this binary into decimal, eg. this would be 4140, thus cat would be 4140, similarly, you can get cat back from 4140 by converting it to binary first and Map the five digit binary to string

One way to do it is assign each letter a value, and each place of the string it's own multiple ie a = 1, b = 2, and so on, then everything in the first digit (read left to right) would be multiplied by a prime number, the next the next prime number and so on, such that the final digit was multiplied by a prime larger than the number of possible subsets in that digit (26+1 for a space or 52+1 with capitols and so on for other supported characters). If the number is mapped back to the first digits (leftmost character) any number you generate from a unique string mapping back to 1 or 6 whatever the first letter will be, gives a unique value.
Dog might be 30,3(15),101(7) or 782, while God 33,3(15),101(4) or 482. More importantly than unique strings being generated they can be useful in generation if the original digit is kept, like 30(782) would be unique to some 12(782) for the purposes of differentiating like strings if you ever managed to go over the unique possibilities. Dog would always be Dog, but it would never be Cat or Mouse.

Why does the default Object.toString() return a hex representation of the hashCode?

I'm curious why Object.toString() returns this:
return getClass().getName() + "#" + Integer.toHexString(hashCode());
as opposed to this:
return getClass().getName() + "#" + hashCode();
What benefits does displaying the hash code as a hex rather than a decimal buy you?

The Short Answer:
Hash Codes are usually displayed in hexadecimal because this way it is easier for us to retain them in our short-term memory, since hexadecimal numbers are shorter and have a larger character variety than the same numbers expressed in decimal.
Also, (as supercat states in a comment,) hexadecimal representation tends to prevent folks from trying to assign some meaning to the numbers, because they don't have any. (To use supercat's example, Fnord#194 is absolutely not the 194th Fnord; it is just Fnord with some unique number next to it.)
The Long Answer:
Decimal is convenient for two things:
Doing arithmetic
Estimating magnitude
However, these operations are inapplicable to hashcodes. You are certainly not going to be adding hashcodes together in your head, nor would you ever care how big a hashcode is compared to another hashcode.
What you are likely to be doing with hashcodes is the one and only thing that they were intended for: to tell whether two hash codes possibly refer to the same object, or definitely refer to different objects.
In other words, you will be using them as unique identifiers or mnemonics for objects. Thus, the fact that a hashcode is a number is in fact entirely irrelevant; you might as well think of it as a hash string.
Well, it just so happens that our brains find it a lot easier to retain in short-term memory (for the purpose of comparison) short strings consisting of 16 different characters, than longer strings consisting of only 10 different characters.
To further illustrate the analogy by taking it to absurdity, imagine if hash codes were represented in binary, where each number is far longer than in decimal, and has a much smaller character variety. If you saw the hash code 010001011011100010100100101011 now, and again 10 seconds later, would you stand the slightest chance of being able to tell that you are looking at the same hash code? (I can't, even if I am looking at the two numbers simultaneously. I have to compare them digit by digit.)
On the opposite end lies the tetrasexagesimal numbering system, which means base 64. Numbers in this system consist of:
the digits 0-9, plus:
the uppercase letters A-Z, plus:
the lowercase letters a-z, plus:
a couple of symbols like '+' and '/' to reach 64.
Tetrasexagesimal obviously has a much greater character variety than lower-base systems, and it should come as no surprise that numbers expressed in it are admirably terse. (I am not sure why the JVM is not using this system for hashcodes; perhaps some prude feared that chance might lead to certain inconvenient four-letter words being formed?)
So, on a hypothetical JVM with 32-bit object hash codes, the hash code of your "Foo" object could look like any of the following:
Binary: com.acme.Foo#11000001110101010110101100100011
Decimal: com.acme.Foo#3251989283
Hexadecimal: com.acme.Foo#C1D56B23
Tetrasexagesimal: com.acme.Foo#31rMiZ
Which one would you prefer?
I would definitely prefer the tetrasexagesimal, and in lack of that, I would settle for the hexadecimal one. Most people would agree.
One web site where you can play with conversions is here:
https://www.mobilefish.com/services/big_number/big_number.php

Object.hashCode used to be computed based on a memory location where the object is located. Memory locations are almost universally displayed as hexadecimal.
The default return value of toString isn’t so much interested in the hash code but rather in a way to uniquely identify the object for the purpose of debugging, and the hash code serve well for the purpose of identification (in fact, the combination of class name + memory address is truly unique; and while a hash code isn’t guaranteed to be unique, it often comes close).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.