Can hashcodes of short string be same?

Can hashcodes of short string be same? - java

I have short Strings (less than 10 characters). I will convert it into int and use it as primary key. (I can't use String primary key because of small problems.) I know the hashcodes of Strings of infinite length can collide, but can short Strings collide too?

Absolutely yes. For example, Ea and FB are colliding strings, each only two characters in length! Example:
public static final void main(String[] args) {
System.out.println("Ea".hashCode() + " " + "FB".hashCode());
}
Prints 2236 2236.
The Java String#hashCode function isn't really even close to random. It's really easy to generate collisions for short strings, and it doesn't fare much better on long strings.
In general, even if you stuck to just 6 bits per character (ASCII letters and numbers, and a couple of symbols), you'd exceed the possible values of the 32-bit hash code with only a 6-character string -- that is, you would absolutely guarantee collisions among the 2^36 6-character 6-bit strings.

A hash code is 32 bits in size.
A char in Java is 16 bits in size.
So in theory, all 2-character strings could have different hash codes, although some of those hash codes would have to collide with the hash code of the empty string and single-character strings. So even taking "all strings of two characters or shorter" there will be collisions. By the time you've got 10 characters, there are way more possible strings than there are hash codes available.
Collisions are still likely to be rare, but you should always assume that they can happen.

Related

Hash Function MD5 30 length

From a given string I am generating 32 digit unique hash code using MD5
MessageDigest.getInstance("MD5")
.digest("SOME-BIG-STRING").map("%02x".format(_)).mkString
//output: 47a8899bdd7213fb1baab6cd493474b4
Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability?
Security is not important, uniqueness is required.

For generating unique IDs from strings, hash functions are never the correct answer.
What you would need is define a one-to-one mapping of text strings (such as "v1.0.0") onto 30-character-long strings (such as "123123..."). This is also known as a bijection, although in your case a injection (a simple one-to-one mapping from inputs to outputs, not necessarily onto) may be enough. As the other answer at the time of this writing notes, hash functions don't necessarily ensure this mapping, but there are other possibilities, such as full-period linear congruential generators (if they take a seed that you can map one-to-one onto input string values), or other reversible functions.
However, if the set of possible input strings is larger than the set of possible output strings, then you can't map all input strings one-to-one with all output strings (without creating duplicates), due to the pigeonhole principle.
See also this question: How to generate a GUID with a custom alphabet, that behaves similar to an MD5 hash (in JavaScript)?.
Indeed, if you use hash functions, the chance of collision will be close to zero but never exactly zero (meaning that the risk of duplicates will always be there). If we take MD5 as an example (which produces any of 2^128 hash codes), then roughly speaking, the chance of accidental collision becomes non-negligible only after 2^64 IDs are generated, which is well over 1 trillion.
But MD5 and other hash functions are not the right way to do what you want to do. This is discussed next.
If you can't restrict the format of your input strings to 30 digits and can't compress them to 30 digits or less and can't tolerate the risk of duplicates, then the next best thing is to create a database table mapping your input strings to randomly generated IDs.
This database table should have two columns: one column holds your input strings (e.g., "<UUID>-NAME-<UUID>"), and the other column holds randomly generated IDs associated with those strings. Since random numbers don't ensure uniqueness, every time you create a new random ID you will need to check whether the random ID already exists in the database, and if it does exist, try a new random ID (but the chance that a duplicate is found will shrink as the size of the ID grows).

Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Yes.
You increase the probability of a collision by a factor of 28.
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability ?
Probably. Taking the first 30 hex digits of a hash produced by any crypto-strength hash algorithm has roughly equivalent uniqueness properties.
Security is not important, uniqueness is required ?
In that case, the fact that MD5 is no longer considered secure is moot. (Note that the reason that MD5 is no longer considered secure is that it is computationally feasible to engineer a collision; i.e. to find a second input for a given MD5 hash.)
However, uniqueness of hashes cannot be guaranteed. Even with a "perfect" crypto strength hash function that generates N bit hashes, the probability of a collision for any 2 arbitrary (different) inputs is one in 2N. For large enough values of N, the probability is very small. But it is never zero.

Best way to derive / compute a unique hash from a given String in Java

I am looking for ways to compute a unique hash for a given String in Java. Looks like I cannot use MD5 or SHA1 because folks claim that they are broken and do not always guarantee uniqueness.
I should get the same hash (preferably a 32 character string like the MD5 Sum) for two String objects which are equal by the equals() method. And no other String should generate this hash - that's the tricky part.
Is there a way to achieve this in Java?

If guaranteed unique hash code is required then it is not possible (possible theoretically but not practically). Hashes and hash codes are non-unique.
A Java String of length N has 65536 ^ N possible states, and requires
an integer with 16 * N bits to represent all possible values. If you
write a hash function that produces integer with a smaller range (e.g.
less than 16 * N bits), you will eventually find cases where more than
one String hashes to the same integer; i.e. the hash codes cannot be
unique. This is called the Pigeonhole Principle, and there is a
straight forward mathematical proof. (You can't fight math and win!)
But if "probably unique" with a very small chance of non-uniqueness is
acceptable, then crypto hashes are a good answer. The math will tell
you how big (i.e. how many bits) the hash has to be to achieve a given
(low enough) probability of non-uniqueness.
Updated : check this another good answer : What is a good 64bit hash function in Java for textual strings?

Unique strings in java

Is there a function in Java that takes in two strings and generates one 16 character string which is unique to the given combination? I dont expect the string to be 100% unique as long as the probability of having 2 conflicting strings is very small (1 in 100,000 for example). Thanks.

You can concatenate both strings and hash them.

If it needs to be truly unique, then a concatenation of the strings which is 16 characters or less is your answer.
Else you have to rely on hashing. But that comes with no guarantee of a probability of clashing.
Your best bet is to use a GUID.

how can i generate a unique int from a unique string?

I have an object with a String that holds a unique id .
(such as "ocx7gf" or "67hfs8")
I need to supply it an implementation of int hascode() which will be unique obviously.
how do i cast a string to a unique int in the easiest/fastest way?
10x.
Edit - OK. I already know that String.hashcode is possible. But it is not recommended in any place. Actually' if any other method is not recommended - Should I use it or not if I have my object in a collection and I need the hashcode. should I concat it to another string to make it more successful?

No, you don't need to have an implementation that returns a unique value, "obviously", as obviously the majority of implementations would be broken.
What you want to do, is to have a good spread across bits, especially for common values (if any values are more common than others). Barring special knowledge of your format, then just using the hashcode of the string itself would be best.
With special knowledge of the limits of your id format, it may be possible to customise and result in better performance, though false assumptions are more likely to make things worse than better.
Edit: On good spread of bits.
As stated here and in other answers, being completely unique is impossible and hash collisions are possible. Hash-using methods know this and can deal with it, but it does impact upon performance, so we want collisions to be rare.
Further, hashes are generally re-hashed so our 32-bit number may end up being reduced to e.g. one in the range 0 to 22, and we want as good a distribution within that as possible to.
We also want to balance this with not taking so long to compute our hash, that it becomes a bottleneck in itself. An imperfect balancing act.
A classic example of a bad hash method is one for a co-ordinate pair of X, Y ints that does:
return X ^ Y;
While this does a perfectly good job of returning 2^32 possible values out of the 4^32 possible inputs, in real world use it's quite common to have sets of coordinates where X and Y are equal ({0, 0}, {1, 1}, {2, 2} and so on) which all hash to zero, or matching pairs ({2,3} and {3, 2}) which will hash to the same number. We are likely better served by:
return ((X << 16) | (x >> 16)) ^ Y;
Now, there are just as many possible values for which this is dreadful than for the former, but it tends to serve better in real-world cases.
Of course, there is a different job if you are writing a general-purpose class (no idea what possible inputs there are) or have a better idea of the purpose at hand. For example, if I was using Date objects but knew that they would all be dates only (time part always midnight) and only within a few years of each other, then I might prefer a custom hash code that used only the day, month and lower-digits of the years, over the standard one. The writer of Date though can't work on such knowledge and has to try to cater for everyone.
Hence, If I for instance knew that a given string is always going to consist of 6 case-insensitive characters in the range [a-z] or [0-9] (which yours seem to, but it isn't clear from your question that it does) then I might use an algorithm that assigned a value from 0 to 35 (the 36 possible values for each character) to each character, and then walk through the string, each time multiplying the current value by 36 and adding the value of the next char.
Assuming a good spread in the ids, this would be the way to go, especially if I made the order such that the lower-significant digits in my hash matched the most frequently changing char in the id (if such a call could be made), hence surviving re-hashing to a smaller range well.
However, lacking such knowledge of the format for sure, I can't make that call with certainty, and I could well be making things worse (slower algorithm for little or even negative gain in hash quality).
One advantage you have is that since it's an ID in itself, then presumably no other non-equal object has the same ID, and hence no other properties need be examined. This doesn't always hold.

You can't get a unique integer from a String of unlimited length. There are 4 billionish (2^32) unique integers, but an almost infinite number of unique strings.
String.hashCode() will not give you unique integers, but it will do its best to give you differing results based on the input string.
EDIT
Your edited question says that String.hashCode() is not recommended. This is not true, it is recommended, unless you have some special reason not to use it. If you do have a special reason, please provide details.

Looks like you've got a base-36 number there (a-z + 0-9). Why not convert it to an int using Integer.parseInt(s, 36)? Obviously, if there are too many unique IDs, it won't fit into an int, but in that case you're out of luck with unique integers and will need to get by using String.hashCode(), which does its best to be close to unique.

Unless your strings are limited in some way or your integers hold more bits than the strings you're trying to convert, you cannot guarantee the uniqueness.
Let's say you have a 32 bit integer and a 64-character character set for your strings. That means six bits per character. That will allow you to store five characters into an integer. More than that and it won't fit.

represent each string character by a five-digit binary digit, eg. a by 00001 b by 00010 etc. thus 32 combinations are possible, for example, cat might be written as 00100 00001 01100, then convert this binary into decimal, eg. this would be 4140, thus cat would be 4140, similarly, you can get cat back from 4140 by converting it to binary first and Map the five digit binary to string

One way to do it is assign each letter a value, and each place of the string it's own multiple ie a = 1, b = 2, and so on, then everything in the first digit (read left to right) would be multiplied by a prime number, the next the next prime number and so on, such that the final digit was multiplied by a prime larger than the number of possible subsets in that digit (26+1 for a space or 52+1 with capitols and so on for other supported characters). If the number is mapped back to the first digits (leftmost character) any number you generate from a unique string mapping back to 1 or 6 whatever the first letter will be, gives a unique value.
Dog might be 30,3(15),101(7) or 782, while God 33,3(15),101(4) or 482. More importantly than unique strings being generated they can be useful in generation if the original digit is kept, like 30(782) would be unique to some 12(782) for the purposes of differentiating like strings if you ever managed to go over the unique possibilities. Dog would always be Dog, but it would never be Cat or Mouse.

Why does the default Object.toString() return a hex representation of the hashCode?

I'm curious why Object.toString() returns this:
return getClass().getName() + "#" + Integer.toHexString(hashCode());
as opposed to this:
return getClass().getName() + "#" + hashCode();
What benefits does displaying the hash code as a hex rather than a decimal buy you?

The Short Answer:
Hash Codes are usually displayed in hexadecimal because this way it is easier for us to retain them in our short-term memory, since hexadecimal numbers are shorter and have a larger character variety than the same numbers expressed in decimal.
Also, (as supercat states in a comment,) hexadecimal representation tends to prevent folks from trying to assign some meaning to the numbers, because they don't have any. (To use supercat's example, Fnord#194 is absolutely not the 194th Fnord; it is just Fnord with some unique number next to it.)
The Long Answer:
Decimal is convenient for two things:
Doing arithmetic
Estimating magnitude
However, these operations are inapplicable to hashcodes. You are certainly not going to be adding hashcodes together in your head, nor would you ever care how big a hashcode is compared to another hashcode.
What you are likely to be doing with hashcodes is the one and only thing that they were intended for: to tell whether two hash codes possibly refer to the same object, or definitely refer to different objects.
In other words, you will be using them as unique identifiers or mnemonics for objects. Thus, the fact that a hashcode is a number is in fact entirely irrelevant; you might as well think of it as a hash string.
Well, it just so happens that our brains find it a lot easier to retain in short-term memory (for the purpose of comparison) short strings consisting of 16 different characters, than longer strings consisting of only 10 different characters.
To further illustrate the analogy by taking it to absurdity, imagine if hash codes were represented in binary, where each number is far longer than in decimal, and has a much smaller character variety. If you saw the hash code 010001011011100010100100101011 now, and again 10 seconds later, would you stand the slightest chance of being able to tell that you are looking at the same hash code? (I can't, even if I am looking at the two numbers simultaneously. I have to compare them digit by digit.)
On the opposite end lies the tetrasexagesimal numbering system, which means base 64. Numbers in this system consist of:
the digits 0-9, plus:
the uppercase letters A-Z, plus:
the lowercase letters a-z, plus:
a couple of symbols like '+' and '/' to reach 64.
Tetrasexagesimal obviously has a much greater character variety than lower-base systems, and it should come as no surprise that numbers expressed in it are admirably terse. (I am not sure why the JVM is not using this system for hashcodes; perhaps some prude feared that chance might lead to certain inconvenient four-letter words being formed?)
So, on a hypothetical JVM with 32-bit object hash codes, the hash code of your "Foo" object could look like any of the following:
Binary: com.acme.Foo#11000001110101010110101100100011
Decimal: com.acme.Foo#3251989283
Hexadecimal: com.acme.Foo#C1D56B23
Tetrasexagesimal: com.acme.Foo#31rMiZ
Which one would you prefer?
I would definitely prefer the tetrasexagesimal, and in lack of that, I would settle for the hexadecimal one. Most people would agree.
One web site where you can play with conversions is here:
https://www.mobilefish.com/services/big_number/big_number.php

Object.hashCode used to be computed based on a memory location where the object is located. Memory locations are almost universally displayed as hexadecimal.
The default return value of toString isn’t so much interested in the hash code but rather in a way to uniquely identify the object for the purpose of debugging, and the hash code serve well for the purpose of identification (in fact, the combination of class name + memory address is truly unique; and while a hash code isn’t guaranteed to be unique, it often comes close).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.