I need to compare two large strings. Rather than using an equals method like that, is there a way like a hashCode or something which generates a unique id for String?
That is because my String is very large. Also, I need distinct content unique Id. Is that possible to use hashCode in String for my purpose.
The purpose of hashCode is to provide a quick means of identifying most of the circumstances where two objects would compare unequal. A hash function which has a 1% false positive rate would for most purposes be considered superior to one that has a 0% false positive rate, but takes twice as long.
There are some hashing functions which are designed for use as "digests", such that two different strings of arbitrary length would be very unlikely to have the same digest. In order to be very effective, however, digests need to be much larger than a 32-bit hashcode value. A well-designed 64-byte (512 bit) digest would generally be adequate to guard strings of any length well enough that one would be more likely to get struck by lightning twice on the same weekend as one wins five state lotteries than to find two different strings that yield the same digest. The cost of computing a good digest function for a string would be much greater than that of comparing the string to another string, but if each string will be compared against many other strings, computing each digest function once and comparing it to the digests of every other string may offer a major performance win.
Related
From a given string I am generating 32 digit unique hash code using MD5
MessageDigest.getInstance("MD5")
.digest("SOME-BIG-STRING").map("%02x".format(_)).mkString
//output: 47a8899bdd7213fb1baab6cd493474b4
Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability?
Security is not important, uniqueness is required.
For generating unique IDs from strings, hash functions are never the correct answer.
What you would need is define a one-to-one mapping of text strings (such as "v1.0.0") onto 30-character-long strings (such as "123123..."). This is also known as a bijection, although in your case a injection (a simple one-to-one mapping from inputs to outputs, not necessarily onto) may be enough. As the other answer at the time of this writing notes, hash functions don't necessarily ensure this mapping, but there are other possibilities, such as full-period linear congruential generators (if they take a seed that you can map one-to-one onto input string values), or other reversible functions.
However, if the set of possible input strings is larger than the set of possible output strings, then you can't map all input strings one-to-one with all output strings (without creating duplicates), due to the pigeonhole principle.
See also this question: How to generate a GUID with a custom alphabet, that behaves similar to an MD5 hash (in JavaScript)?.
Indeed, if you use hash functions, the chance of collision will be close to zero but never exactly zero (meaning that the risk of duplicates will always be there). If we take MD5 as an example (which produces any of 2^128 hash codes), then roughly speaking, the chance of accidental collision becomes non-negligible only after 2^64 IDs are generated, which is well over 1 trillion.
But MD5 and other hash functions are not the right way to do what you want to do. This is discussed next.
If you can't restrict the format of your input strings to 30 digits and can't compress them to 30 digits or less and can't tolerate the risk of duplicates, then the next best thing is to create a database table mapping your input strings to randomly generated IDs.
This database table should have two columns: one column holds your input strings (e.g., "<UUID>-NAME-<UUID>"), and the other column holds randomly generated IDs associated with those strings. Since random numbers don't ensure uniqueness, every time you create a new random ID you will need to check whether the random ID already exists in the database, and if it does exist, try a new random ID (but the chance that a duplicate is found will shrink as the size of the ID grows).
Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Yes.
You increase the probability of a collision by a factor of 28.
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability ?
Probably. Taking the first 30 hex digits of a hash produced by any crypto-strength hash algorithm has roughly equivalent uniqueness properties.
Security is not important, uniqueness is required ?
In that case, the fact that MD5 is no longer considered secure is moot. (Note that the reason that MD5 is no longer considered secure is that it is computationally feasible to engineer a collision; i.e. to find a second input for a given MD5 hash.)
However, uniqueness of hashes cannot be guaranteed. Even with a "perfect" crypto strength hash function that generates N bit hashes, the probability of a collision for any 2 arbitrary (different) inputs is one in 2N. For large enough values of N, the probability is very small. But it is never zero.
Is there a function in Java that takes in two strings and generates one 16 character string which is unique to the given combination? I dont expect the string to be 100% unique as long as the probability of having 2 conflicting strings is very small (1 in 100,000 for example). Thanks.
You can concatenate both strings and hash them.
If it needs to be truly unique, then a concatenation of the strings which is 16 characters or less is your answer.
Else you have to rely on hashing. But that comes with no guarantee of a probability of clashing.
Your best bet is to use a GUID.
My question is not about double hashing technique http://en.wikipedia.org/wiki/Double_hashing , which is a way to resolve collisions. It is about handling existing collisions in hash table of strings. Say, we have a collision: several strings in the same bucket, so now we must go through the bucket checking the strings. It seems it would make sense to calculate another hash function for fast string comparison (compare hash values for quick rejection). The hash key could be lazily computed and saved with the string. Did you use such technique? Could you provide a reference? If not, do you think it's not worth doing since perfomance gain is questionable? Some notes:
I put tag "Java" since I did measurements in Java: String.hashCode() in most cases outperforms String.equals() (and BTW greatly outperforms manual hash code calculation: hashCode = 31 * hashCode + strInTable.charAt(i));
Of course, the same could be asked about any string comparison, not necessarily strings in a hash table. But I am considering a specific situation with huge amount of strings which are kept in hash table.
This probably makes sense if the strings in the bucket are somewhat similar (like in Rabin-Karp algorithm). Looking for your opinion in general situation.
Many hash-based collections store the hash value of each item in the collection, on the premise that since every item's hash will have been computed when it was added to the collection, and code which is looking for an item in a hashed collection will have to know its hash, comparing hash values will be a quick and easy way of reducing the cost of false hits. For example, if one has a 16-bucket hash-table that contains four strings of 1,000 characters each, and will be searching for a lot of 1,000-character strings which match one of the table entries in all but the last few characters, more than 6% of of searches will hit on a bucket that contains a near-match string, but a much smaller fraction will hit a bucket that contains a string whose 32-bit hashCode matches that of the string being sought. Since comparisons of nearly-identical strings are expensive, comparing full 32-bit hash codes is helpful.
If one has large immutable collections which may need to be stored in hash tables and matched against other such collections, there may be some value in having such collections compute and cache longer hash functions, and having their equals methods compare the results of those longer hash functions before proceeding further. In such cases, computing a longer hash function will often be almost as fast as computing a shorter one. Further, not only will comparisons on the longer hash code greatly reduce the risks that false positives will cause needless "deep" comparisons, but computing longer hash functions and combining them into the reported hashCode() may greatly reduce the dangers of strongly-correlated hash collisions.
Comparing a hash only makes sense if the number of comparisons (lookups) is large compared to the number of entries. You would need a large hash (32 bits are not enough; you'd want at least 128 bits), and that would be expensive to calculate. You would want to amortize the cost of hashing each string over a large number of probes into the buckets.
As to whether it's worth it or not, it's highly context dependent. The only way to find out is to actually do it with your data and compare the performance of both methods.
We have a view into a system that uses a value for the unique id that another company we want to share information with will not accept. I was thinking of using an one way encryption hash similar to what is done with passwords. The concern is can the hashing algorithm created output values be guaranteed unique if the inputs are guaranteed unique and the salt is constant?
Answer is yes. Same id input with same salt will always produce same output.
But, if your question is about guaranteeing that outputs will always be unique, the answer is no. There is a very small statistical probability that the hashing will create the same output twice even if the inputs are different and the salt constant.
In principle, there is no hashing algorithm without collisions if the input size is larger than the output size. (In your case, the relevant input size would be the size of this part which changes from one input to the next.)
Whether there are collisions also for shorter inputs is a property of the hashing algorithm, but the idea is that the probability of these should be quite small (about 1/(2^output size) for each pair of input, for a good algorithm).
Is your question can two different values hash to the same thing or is it are hashes deterministic?
If it's the former then yes, you can have hash collisions. A well designed cryptographically strong hash should make it difficult to find two values hashing to the same value though or to find an input that matches a given hash but they can't guarantee uniqueness.
By the pigeon-hole principal:
if your hash is a constant size, say 64 bits (without loss of generality) you will have at most 2^64 unique output hash values. Since there are more than 2^64 potential inputs if you're using strings, a collision is guaranteed after your hash at most 2^64+1 items.
Yes the same hash will be produced when the input and salt are the same. Note that different inputs may produce the same hash.
In short no. The longer answer is the perfect oracle would be able to solve the question you posed. Since no one has ever proven the existence of a perfect oracle it is currently believed to be impossible. The other side of it isn't that it is impossible just that we as a collective are not intelligent enough to figure this out. Similar to P != NP
I'm curious why Object.toString() returns this:
return getClass().getName() + "#" + Integer.toHexString(hashCode());
as opposed to this:
return getClass().getName() + "#" + hashCode();
What benefits does displaying the hash code as a hex rather than a decimal buy you?
The Short Answer:
Hash Codes are usually displayed in hexadecimal because this way it is easier for us to retain them in our short-term memory, since hexadecimal numbers are shorter and have a larger character variety than the same numbers expressed in decimal.
Also, (as supercat states in a comment,) hexadecimal representation tends to prevent folks from trying to assign some meaning to the numbers, because they don't have any. (To use supercat's example, Fnord#194 is absolutely not the 194th Fnord; it is just Fnord with some unique number next to it.)
The Long Answer:
Decimal is convenient for two things:
Doing arithmetic
Estimating magnitude
However, these operations are inapplicable to hashcodes. You are certainly not going to be adding hashcodes together in your head, nor would you ever care how big a hashcode is compared to another hashcode.
What you are likely to be doing with hashcodes is the one and only thing that they were intended for: to tell whether two hash codes possibly refer to the same object, or definitely refer to different objects.
In other words, you will be using them as unique identifiers or mnemonics for objects. Thus, the fact that a hashcode is a number is in fact entirely irrelevant; you might as well think of it as a hash string.
Well, it just so happens that our brains find it a lot easier to retain in short-term memory (for the purpose of comparison) short strings consisting of 16 different characters, than longer strings consisting of only 10 different characters.
To further illustrate the analogy by taking it to absurdity, imagine if hash codes were represented in binary, where each number is far longer than in decimal, and has a much smaller character variety. If you saw the hash code 010001011011100010100100101011 now, and again 10 seconds later, would you stand the slightest chance of being able to tell that you are looking at the same hash code? (I can't, even if I am looking at the two numbers simultaneously. I have to compare them digit by digit.)
On the opposite end lies the tetrasexagesimal numbering system, which means base 64. Numbers in this system consist of:
the digits 0-9, plus:
the uppercase letters A-Z, plus:
the lowercase letters a-z, plus:
a couple of symbols like '+' and '/' to reach 64.
Tetrasexagesimal obviously has a much greater character variety than lower-base systems, and it should come as no surprise that numbers expressed in it are admirably terse. (I am not sure why the JVM is not using this system for hashcodes; perhaps some prude feared that chance might lead to certain inconvenient four-letter words being formed?)
So, on a hypothetical JVM with 32-bit object hash codes, the hash code of your "Foo" object could look like any of the following:
Binary: com.acme.Foo#11000001110101010110101100100011
Decimal: com.acme.Foo#3251989283
Hexadecimal: com.acme.Foo#C1D56B23
Tetrasexagesimal: com.acme.Foo#31rMiZ
Which one would you prefer?
I would definitely prefer the tetrasexagesimal, and in lack of that, I would settle for the hexadecimal one. Most people would agree.
One web site where you can play with conversions is here:
https://www.mobilefish.com/services/big_number/big_number.php
Object.hashCode used to be computed based on a memory location where the object is located. Memory locations are almost universally displayed as hexadecimal.
The default return value of toString isn’t so much interested in the hash code but rather in a way to uniquely identify the object for the purpose of debugging, and the hash code serve well for the purpose of identification (in fact, the combination of class name + memory address is truly unique; and while a hash code isn’t guaranteed to be unique, it often comes close).