Is it possible to get String value back from its hash code? - java

Java doc for method String#hashCode() says:
Returns a hash code for this string. The hash code for a String object is computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)
Questions:
Is it possible to have same hash code for two string objects having different values? If yes then please share some examples.
Is it possible to get String value back from its hash code?
I am not using it any where in code. I have just asked this question to know more about Java String class.

Is it possible to have same hash code for two string objects having different values? If yes then please share some examples.
Here is a small sample of randomly generated examples of short strings with identical hash codes:
String 1 String 2 Common hash code
-------- -------- ----------------
VTBHKIGV - FLXCLLII -1242944431
FPESRBAH - GNFWMYVA 1778061647
UYDHRTXL - HGCNRCBE 1509241566
VXQMFMDE - YMYXDWKK -1553987354
VGWBSYRX - JZNQSUXK 700334696
Since multiple strings can share the same hash code, restoring the original from the hash is not possible.

Is it possible to have same hash code for two string objects having different values?
yes, how can you map infinite string possibilities to int without it
Is it possible to get String value back from its hash code?
no, read 1

It's absolutely possible to have two different strings (or objects) with the same hash code. That's why we have collision handling. So in general it's not possible to get the string value back from the hash code. This is because the hash code value quickly overflows the 32-bit integer for strings longer than 4 bytes.

assume your string is 2 characters long
c1,c2
your hash is 31*c1 + c2
can you think of different values that will map to the same hash?
it is worse in longer strings

Related

How to get the string from its String.hashCode() value? [duplicate]

This question already has answers here:
how can I get the String from hashCode
(4 answers)
Closed 3 years ago.
I need to somehow get the text from its hash in java.
I have this code:
String myString = new String("creashaks organzine");
int hashCode = myString.hashCode();
System.out.println("Hash:" + hashCode);
The result of this code will be 0.
But the hash of "pollinating sandboxes" string will also be 0.
There might be collisions, for example with "creashaks organzine" and "pollinating sandboxes" and I want to find collisions like in this case.
Since i don't have enough reputation to add comment, i will quote solution from another question
You know that several objects can have same hash(), as it mentioned in java doc for Object.hashCode()
It is not required that if two objects are unequal
* according to the {#link java.lang.Object#equals(java.lang.Object)}
* method, then calling the {#code hashCode} method on each of the
* two objects must produce distinct integer results.
It's obvious you can't restore different objects from same hash code, so it's impossible at all, simple logic.
how can I get the String from hashCode
This is a very interesting thing. Regarding the specification in https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#hashCode() says that the hashCode is calculated from the string content but the example seems to shows that is not true for the first string:
class Main
{
public static void main(String[] args)
{
String myString1 = "creashaks organzine";
String myString2 = "crsomething else";
String myString3 = "crsomething else";
System.out.println("Hash1:" + myString1.hashCode());
System.out.println("Hash2:" + myString2.hashCode());
System.out.println("Hash3:" + myString3.hashCode());
}
}
Outputs:
Hash1:0
Hash2:444616526
Hash3:444616526
But when I modify the string, then I get a different output:
String myString1 = "creashaks organzine...";
System.out.println("Hash1:" + myString1.hashCode());
Outputs:
Hash1:45678
So it seems that somebody tricked us by giving a very rare example string that produced exactly the "0" as output. Here you see that the hashCode is not very unique, so you cannot use is safely to compare strings.
Coming back to your initial question: The hashCode is a number with reduced details, so you cannot calculate it back to the original string. This applies to all hash codes.
Hash codes are so often used in server side databases instead of real password strings. They can be compared but not reconstructed.

java - is there a way to confirm that a string is a sha256 hash?

I'd like to validate that a String is a sha256 representation of another without having to decrypt it. Is this possible?
Yes and no.
You can test that a string is hex very easily. You can then test that it contains a statistically sensible number of digits and letters. That will rule out some common non sha256 strings.
But if someone creates a random string designed to look like a sha256, I don't think it's possible to distinguish it from the real thing by any mathematical test. The algorithm is designed to be robust to that.
A sha-256 value is just a 256 bits (32 bytes) value which you usually represent as a String or as a byte[] in Java.
As a value per se it's pointless, if you want to tell if a specific String is a hash then any 32 bytes number is a hash of an infinite unknown plain texts. But it's like asking "how do I know that a 32 bytes number is a number?", you see that you are going nowhere.
It's useful only when it's paired to a plain text so that you can compare it with the hash computed from the plain text to verify they match.
I think what you could do is to hash the other string and then compare these two strings with each other.
No idea if this would help you but I read that it was commonly used praxis when creating rainbow tables for cracking password attempts.
EDIT: Oh forgot this is also the way to compare passwords in php when you login to a webpage iirc. At least I had to do it like this for university.

What does exact value returned by their hashCode mean?

I'm reading Effective Java 2nd Edition by Joshua Bloch.
In this paragraph, he mentions that:
Many classes in the Java platform libraries, such as String, Integer, and Date, include in their specifications the exact value returned by their hashCode method as a function of the instance value. This is generally not a good idea, as it severely limits your ability to improve the hash function in future releases. If you leave the details of a hash function unspecified and a flaw is found or a better hash function discovered, you can change the hash function in a subsequent release, confident that no clients depend on the exact values returned by the hash function.
Can anyone please share some insights what he means by 'exact' values. I had a look at the String implementation class but still unable to understand what he means...
Thanks in advance!
From String.hashCode():
Returns a hash code for this string. The hash code for a String object
is computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
By giving out the definition in the javadoc, people may write code that depends on exactly that hashing algorithm. Changing the hash algorithm in a future release would then break that code.
It means that the value returned by hashCode() is prescribed in the Javadoc.
e.g.
String.hashCode():
Returns a hash code for this string. The hash code for a String object is computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
Integer.hashCode():
[Returns] a hash code value for this object, equal to the primitive int value represented by this Integer object.
Date.hashCode():
Returns a hash code value for this object. The result is the exclusive OR of the two halves of the primitive long value returned by the getTime() method. That is, the hash code is the value of the expression:
(int)(this.getTime()^(this.getTime() >>> 32))
If you look at the API documentation of java.lang.String.hashCode(), it describes exactly how the method is implemented:
Returns a hash code for this string. The hash code for a String object is computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)
What Bloch says, is that it is a mistake that classes such as String describe the implementation details in the API documentation, because this means that programmers can count on the hashCode method being implemented this way. If, in a future Java release, Oracle wants to implement a different, maybe more efficient algorithm to calculate a hash code for a string, then that would be a backward compatibility problem - the behaviour might change compared to previous Java versions.
By describing the implementation in detail in the API documentation, the way it is implemented has become part of the official specification of the Java API.
In general, API documentation should just describe what the purpose is of the method, and not exactly how it is implemented.

I need to assign a random but unique ID to each row of mysql table.The ID should be same if the row contains same values

I need to assign a random but unique ID to each row in Mysql table.The ID should be same if the row contains same values.
ie., If the 1st row contains [hi,hello,bye] 2nd row contains[gg,hello,bye] and 3rd row contains[hi,hello,bye] then 1st and 3rd row should generate same ID and 2nd row should genetare different ID.
Thanks in advance.
MD5 Hash could work. Below is chopped up and quick/dirty code that would need updated, but proves the concept.
System.out.println("row1=" + test1 + ":" + tst1.getHash(test1));
System.out.println("row2=" + test2 + ":" + tst1.getHash(test2));
System.out.println("row3=" + test3 + ":" + tst1.getHash(test3));
private String getHash(String inputStr){
try{
MessageDigest md = MessageDigest.getInstance("MD5");
md.update(inputStr.getBytes());
byte byteData[] = md.digest();
StringBuffer sb = new StringBuffer();
for (int i = 0; i < byteData.length; i++) {
sb.append(Integer.toString((byteData[i] & 0xff) + 0x100, 16).substring(1));
}
return sb.toString();
}
catch(Exception e)
{
e.printStackTrace();
return null;
}
}
row1=hi,hello,bye:cfe40e96aa052a484208c2aefb6f39bb
row2=gg,hello,bye:f652785f0e214507e6aea44ecd3ffb7a
row3=hi,hello,bye:cfe40e96aa052a484208c2aefb6f39bb
SELECT CRC32(CONCAT(column1, column2, column3)) FROM MyTable.
Technically CRC32 is not random (but what is?) -- and it has a small chance of generating collisions (different values mapping to the same integer). But it's a start.
If you really want proof that you don't get collisions everything boils down to concatanating all fields, with a seperator not contained in the fields. Of course this normally will be really long and cumbersome to work with.
What everybody normally does is: feed that String in a Hash function. While theoretically not unique, given a suitable Hashfunction with a large enough result, in should be able to find one that is unlikely to produce a collision during the livetime of the human race. For example git is using such a hash (sha1) and Linus Torvalds writes about the chance of a accidental collision:
First off, let me remind people that the inadvertent kind
of collision is really really really damn unlikely, so we'll quite
likely never ever see it in the full history of the universe.
A different thing is a not so accidental collision. At the very first you should make sure that the string you start with isn't the same for different columns. This means:
Make sure all columns are contained
Make sure columns a separated by something not contained in the columns itself. Use escaping if necessary. For example if you just concatenate two columns the values 'abc' + 'def' will give you the same result as 'a' + 'bcdef'
If you have to worry about targeted attacks, i.e. someone actually trying to create entries with the same hash, your best bet is to use a cryptographic hash, possibly one used for password hashing which are often designed to be slow, in order to prevent brute force attacks. Of course this might collide with the requirement for most applications to be as fast as possible.
What you need is a hash function of all the values that you care about in a row. It can't be random because, by definition, it must be deterministic -- given the same values, you always get the same ID. If, by "random" you mean "not sequential" most hash functions should satisfy this need.
Theoretically, you cannot guarantee uniqueness as there is always the probability of collisions. That is, different IDs definitely mean that the row values are different but the converse is not always true. Depending on your needs, you might want to implement explicit matching on actual row values whenever matching IDs are encountered. You might also consider using a cryptographic hash function like MD5 or SHA1 and rely on the probabilities being on your side (in fact, any collision you find using a cryptographic hash function would be a breakthrough of some kind in the field).

Adler32 Repeating Very Quickly

I'm using the adler32 checksum algorithm to generate a number from a database id. So, when I insert a row into the database, I take the identity of that row and use it to create the checksum. The problem that I'm running into is that I just generated a repeat checksum after only 207 inserts into the database. This is much much faster than I expected. Here is my code:
String dbIdStr = Long.toString(dbId);
byte[] bytes = dbIdStr.getBytes();
Checksum checksum = new Adler32();
checksum.update(bytes, 0, bytes.length);
result = checksum.getValue();
Is there something wrong with what/how I'm doing? Should I be using a different method to create unique strings? I'm doing this because I don't want to use the db id in a url... a change to the structure of the db will break all the links out there in the world.
Thanks!
You should not be using Adler-32 as a hash code generator. That's not what it's for. You should use an algorithm that has good hash properties, which, among other things minimizes the probability of collisions.
You can simply use Java's hashCode method (on any object). For the String object, the hash code is the sum of the byte values of string times successive powers of 31. There can be collisions with very short strings, but it's not a horrible algorithm. It's definitely a lot better than Adler-32 as a hash algorithm.
The suggestions to use a cryptographically secure hash function (like SHA-256) are certainly overkill for your application, both in terms of execution time and hash code size. You should try Java's hashCode and see how many collisions you get. If it seems much more frequent than you'd expect for a 2-n probability (where n is the number of bits in the hash code), then you can override it with a better one. You can find a link here for decent Java hash functions.
Try and use a secure hash function like SHA-256. If you ever find a collision for any data that is not binary equal, you'll get $1000 on your bank account, with compliments. Offer ends if/when SHA-2 is cracked and you enter a collision deliberately. That said, the output is 32 bytes instead of 32 bits.

Categories

Resources