Unique strings in java

Unique strings in java - java

Is there a function in Java that takes in two strings and generates one 16 character string which is unique to the given combination? I dont expect the string to be 100% unique as long as the probability of having 2 conflicting strings is very small (1 in 100,000 for example). Thanks.

You can concatenate both strings and hash them.

If it needs to be truly unique, then a concatenation of the strings which is 16 characters or less is your answer.
Else you have to rely on hashing. But that comes with no guarantee of a probability of clashing.
Your best bet is to use a GUID.

Related

Hash Function MD5 30 length

From a given string I am generating 32 digit unique hash code using MD5
MessageDigest.getInstance("MD5")
.digest("SOME-BIG-STRING").map("%02x".format(_)).mkString
//output: 47a8899bdd7213fb1baab6cd493474b4
Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability?
Security is not important, uniqueness is required.

For generating unique IDs from strings, hash functions are never the correct answer.
What you would need is define a one-to-one mapping of text strings (such as "v1.0.0") onto 30-character-long strings (such as "123123..."). This is also known as a bijection, although in your case a injection (a simple one-to-one mapping from inputs to outputs, not necessarily onto) may be enough. As the other answer at the time of this writing notes, hash functions don't necessarily ensure this mapping, but there are other possibilities, such as full-period linear congruential generators (if they take a seed that you can map one-to-one onto input string values), or other reversible functions.
However, if the set of possible input strings is larger than the set of possible output strings, then you can't map all input strings one-to-one with all output strings (without creating duplicates), due to the pigeonhole principle.
See also this question: How to generate a GUID with a custom alphabet, that behaves similar to an MD5 hash (in JavaScript)?.
Indeed, if you use hash functions, the chance of collision will be close to zero but never exactly zero (meaning that the risk of duplicates will always be there). If we take MD5 as an example (which produces any of 2^128 hash codes), then roughly speaking, the chance of accidental collision becomes non-negligible only after 2^64 IDs are generated, which is well over 1 trillion.
But MD5 and other hash functions are not the right way to do what you want to do. This is discussed next.
If you can't restrict the format of your input strings to 30 digits and can't compress them to 30 digits or less and can't tolerate the risk of duplicates, then the next best thing is to create a database table mapping your input strings to randomly generated IDs.
This database table should have two columns: one column holds your input strings (e.g., "<UUID>-NAME-<UUID>"), and the other column holds randomly generated IDs associated with those strings. Since random numbers don't ensure uniqueness, every time you create a new random ID you will need to check whether the random ID already exists in the database, and if it does exist, try a new random ID (but the chance that a duplicate is found will shrink as the size of the ID grows).

Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Yes.
You increase the probability of a collision by a factor of 28.
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability ?
Probably. Taking the first 30 hex digits of a hash produced by any crypto-strength hash algorithm has roughly equivalent uniqueness properties.
Security is not important, uniqueness is required ?
In that case, the fact that MD5 is no longer considered secure is moot. (Note that the reason that MD5 is no longer considered secure is that it is computationally feasible to engineer a collision; i.e. to find a second input for a given MD5 hash.)
However, uniqueness of hashes cannot be guaranteed. Even with a "perfect" crypto strength hash function that generates N bit hashes, the probability of a collision for any 2 arbitrary (different) inputs is one in 2N. For large enough values of N, the probability is very small. But it is never zero.

Is it possible to use hashCode as a content unique ID?

I need to compare two large strings. Rather than using an equals method like that, is there a way like a hashCode or something which generates a unique id for String?
That is because my String is very large. Also, I need distinct content unique Id. Is that possible to use hashCode in String for my purpose.

The purpose of hashCode is to provide a quick means of identifying most of the circumstances where two objects would compare unequal. A hash function which has a 1% false positive rate would for most purposes be considered superior to one that has a 0% false positive rate, but takes twice as long.
There are some hashing functions which are designed for use as "digests", such that two different strings of arbitrary length would be very unlikely to have the same digest. In order to be very effective, however, digests need to be much larger than a 32-bit hashcode value. A well-designed 64-byte (512 bit) digest would generally be adequate to guard strings of any length well enough that one would be more likely to get struck by lightning twice on the same weekend as one wins five state lotteries than to find two different strings that yield the same digest. The cost of computing a good digest function for a string would be much greater than that of comparing the string to another string, but if each string will be compared against many other strings, computing each digest function once and comparing it to the digests of every other string may offer a major performance win.

Represent a set of Strings in unique numbers which can be sorted based on the number as well.

I am looking for a mechanism in which I can represent a set of strings using unique numbers, so that when I want to sort them I can use the numbers to sort this values.
For eg, this is what I have in mind
I am keeping a fixed length number 20 digits
Each alphabet is represented with its ASCII/some alphabetical order value
cat - (03)(01)(20)(00)(00)(00)(00)(00)(00)(00) - 03012000000000000000
cataract - (03)(01)(20)(01)(18)(01)(03)(20)(00)(00) - 03012001180103200000
capital - (03)(01)(16)(09)(20)(01)(12)(00)(00)(00) - 03011609200112000000
So if I sort it based on the numbers, it will sort and say
capital, cat, cataract
Is this a good way of doing this?
Is there any other way for doing this so that I have more accuracy?
Thanks,
Sen

If your string length is fixed and your character set is fixed to say 100 different characters you could treat each character of your string as a number in a base 100 number to turn the string into a double.
If your set of strings is much smaller than the set of possible strings you could hash them and for collisions define the sort order arbitrarily but consistently.
In a specific case I probably wouldn't recommend either of those, but as a SUPER GENERAL solution to what you stated it works. But if you ask what seems to be a theoretical question, a theoretical answer seems appropriate.

Can hashcodes of short string be same?

I have short Strings (less than 10 characters). I will convert it into int and use it as primary key. (I can't use String primary key because of small problems.) I know the hashcodes of Strings of infinite length can collide, but can short Strings collide too?

Absolutely yes. For example, Ea and FB are colliding strings, each only two characters in length! Example:
public static final void main(String[] args) {
System.out.println("Ea".hashCode() + " " + "FB".hashCode());
}
Prints 2236 2236.
The Java String#hashCode function isn't really even close to random. It's really easy to generate collisions for short strings, and it doesn't fare much better on long strings.
In general, even if you stuck to just 6 bits per character (ASCII letters and numbers, and a couple of symbols), you'd exceed the possible values of the 32-bit hash code with only a 6-character string -- that is, you would absolutely guarantee collisions among the 2^36 6-character 6-bit strings.

A hash code is 32 bits in size.
A char in Java is 16 bits in size.
So in theory, all 2-character strings could have different hash codes, although some of those hash codes would have to collide with the hash code of the empty string and single-character strings. So even taking "all strings of two characters or shorter" there will be collisions. By the time you've got 10 characters, there are way more possible strings than there are hash codes available.
Collisions are still likely to be rare, but you should always assume that they can happen.

Is String.hashCode() inefficient? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
When looking at the source code of java.lang.String of openjdk-1.6, i saw that String.hashCode() uses 31 as prime number and computes
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
Now the reason for me to look at this was the question i had in mind whether comparing hashCodes in String.equals would make String.equals significantly faster. But looking at hashCode now, the following questions come to my mind:
Wouldn't a bigger prime help avoid collisions better, at least for short strings, seeing that for example "BC" has the same hash as "Ab" (since the ascii letters live in the region 65-122, wouldn't a prime higher than that work better)?
Is it a conscious decision to use 31 as prime, or just some random one that is used because it is common?
How likely is a hash collision, given a fixed String length? where this question is heading is the original question how good comparing hashCodes and String length could already discern strings, to avoid comparing the actual contents.
a little off-topic, maybe: Is there a good reason String.equals does not compare hashCodes as additional shortcut?
a little more off-topic: assuming we have two Strings with the same content, but different instances: is there any way to assert equality without actually comparing the contents? I would guess not, since someway into String lengths, the space explodes into sizes where we will inevitably have collisions, but what about some restrictions - only a certain character set, a maximum string length... and how much do we need to restrict the string space to be able to have such a hash function?

Wouldn't a bigger prime help avoid collisions better, at least for short strings, seeing that for example "BC" has the same hash as "Ab" (since the ascii letters live in the region 65-122, wouldn't a prime higher than that work better)?
Each character in a String can take 65536 values (2^16). The set of Strings of 1 or 2 characters is therefore larger than the number of int and any hashcode calculation methodology will produce collisions for strings that are 1 or 2 character long (which qualify as short strings I suppose).
If you restrict your character set, you can find hash function that reduce the number of collisions (see below).
Note that a good hash must also provide a good distribution of output. A comment buried in this code advocates using 33 and gives the following reasons (emphasis mine):
If one compares the chi^2 values [...] of the variants the number 33 not even has the best value. But the number 33 and a few other equally good numbers like 17, 31, 63, 127 and 129 have nevertheless a great advantage to the remaining numbers in the large set of possible multipliers: their multiply operation can be replaced by a faster operation based on just one shift plus either a single addition or subtraction operation. And because a hash function has to both distribute good and has to be very fast to compute, those few numbers should be preferred.
Now these formulae were designed a while ago. Even if it appeared now that they are not ideal, it would be impossible to change the implementation because it is documented in the contract of the String class.
Is it a conscious decision to use 31 as prime, or just some random one that is used because it is common?
Why does Java's hashCode() in String use 31 as a multiplier?
How likely is a hash collision, given a fixed String length?
Assuming each possible int value has the same probability of being the result of the hashcode function, the probability of collision is 1 in 2^32.
Is there a good reason String.equals does not compare hashCodes as additional shortcut?
Why does the equals method in String not use hash?
assuming we have two Strings with the same content, but different instances: is there any way to assert equality without actually comparing the contents?
Without any constraint on the string, there isn't. You could intern the strings then check for reference equality (==), but if many strings are involved, that can be inefficient.
how much do we need to restrict the string space to be able to have such a hash function?
If you only allow small cap letters (26 characters), you could design a hash function that generates unique hashes for any strings of length 0 to 6 characters (inclusive) (sum(i=0..6) (26^i) = 3.10^8).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.