I am looking for ways to compute a unique hash for a given String in Java. Looks like I cannot use MD5 or SHA1 because folks claim that they are broken and do not always guarantee uniqueness.
I should get the same hash (preferably a 32 character string like the MD5 Sum) for two String objects which are equal by the equals() method. And no other String should generate this hash - that's the tricky part.
Is there a way to achieve this in Java?
If guaranteed unique hash code is required then it is not possible (possible theoretically but not practically). Hashes and hash codes are non-unique.
A Java String of length N has 65536 ^ N possible states, and requires
an integer with 16 * N bits to represent all possible values. If you
write a hash function that produces integer with a smaller range (e.g.
less than 16 * N bits), you will eventually find cases where more than
one String hashes to the same integer; i.e. the hash codes cannot be
unique. This is called the Pigeonhole Principle, and there is a
straight forward mathematical proof. (You can't fight math and win!)
But if "probably unique" with a very small chance of non-uniqueness is
acceptable, then crypto hashes are a good answer. The math will tell
you how big (i.e. how many bits) the hash has to be to achieve a given
(low enough) probability of non-uniqueness.
Updated : check this another good answer : What is a good 64bit hash function in Java for textual strings?
I'm doing a Secret Sharing algorithm which encrypts a a message. To do that I need a bigger than message prime and some random numbers of aproximately the same size as the message.
I can do the first with BigInteger.probablePrime(MsgSize+8) but I do not know how to do the later.
I was using Random and later SecureRandom but they don't generate numbers of a given length. My solution was to do randomInt ^ randomInt to BigInteger but is obviously a bad solution.
Some ideas?
Is it Shamir's Secret Sharing that you're implementing? If so, note that you don't actually need a prime bigger than the entire message — it's perfectly fine to break the message into chunks of some manageable size and to share each chunk separately using a fixed prime.
Also, Shamir's Secret Sharing doesn't need a prime-sized field; it's possible to use any finite field GF(pn), including in particular the binary fields GF(2n). Such fields are particularly convenient for computer implementation, since the both the secret and share chunks will then be simply n-bit bitstrings.
The only complications are that, in non-prime fields, you'll have to implement finite field arithmetic (or find an existing implementation) and that you'll need to choose a particular reducing polynomial and agree upon it. However, the former isn't really as complicated as it might seem, and the latter isn't really any harder than choosing and agreeing on a prime. (In particular, a reducing polynomial for GF(2n) can be naturally represented as an n-bit bitstring, dropping the high bit which is always 1.)
Have you tried using the same probablePrime method with a smaller size, then using a large random integer as an offset from that number? That might do the trick, just an idea.
I had the same problem (thats why i found this post).
It is a little late but maybe someone else finds this method usefull:
public static BigDecimal getBigRandom(int d)
{
BigDecimal rnd = new BigDecimal(Math.random());
BigDecimal rndtmp;
for(int i=0;i<=d;i++)
{
rndtmp = new BigDecimal(Math.random());
rndtmp = rndtmp.movePointLeft(rnd.precision());
rnd = rnd.add(rndtmp);
}
return rnd;
}
Usage:
BigDecimal x = getBigRandom(y);
every y will give you approximately 50 digits.
if you need more than (2^31-1)*50 digits simply change int to long ;-)
dont know if it is good, but works for me
I am storing certain entities in my database with integer Ids of size 32 bits thus using the range of -2.14 billion to +2.14 billion.
I have given tried giving some meaning to my ids due to which my Ids, in the positive range, have finished up a bit quickly. I am looking forward to use the negative integer range of -2.14 billion to 0.
Wanted to know, if you could see any downsides of using negative integers as ids, though personally I don't see any downsides.
There is an old saying in database design that goes like this: "Intelligent keys are not". You should never design for special meaning in an id when a descriptive attribute is more appropriate.
Given than dumb keys are only compared for equality, sign or lack thereof has no impact.
Please give me sample code to generate UUID of long type in java without using timestamp.
Thanks
A real UUID is 128 bits. A long is 64 bits.
This is not just pedantry. UUID stands for Universal Unique IDentifier.
The "universal uniqueness" of the established UUID schemes are based on:
encoding a MAC address and a timestamp,
encoding a hash of a DNS name and a timestamp, or
using a 122 bit random number ... which is large enough that the probability of a collision is very very small.
With 64 bits, there are simply not enough bits for "universal uniqueness". For instance, the birthday paradox means that if we had a number of computers generating random 64 bit numbers, the probability of a potentially detectable collision would be large enough to be of concern.
Now if you just want a UID (not a UUID), then any 64-bit sequence generator will do the job, provided that you take steps to guard against the sequence repeating. (If the sequence repeats, then the IDs are not unique in time; i.e. over time a given ID may denote different entities.)
Have you looked at java.util.UUID?
If you just want a simple unique long you can use AtomicLong.incrementAndGet(). This doesn't use a timestamp but does reset to 0 every time you start it and is not unique across JVMs.
What is the requirement not to use timestamps all about? UUID uses a timestamp. (amoungst other things)
I'm looking for a hash function that:
Hashes textual strings well (e.g. few collisions)
Is written in Java, and widely used
Bonus: works on several fields (instead of me concatenating them and applying the hash on the concatenated string)
Bonus: Has a 128-bit variant.
Bonus: Not CPU intensive.
Why don't you use a long variant of the default String.hashCode() (where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?
// adapted from String.hashCode()
public static long hash(String string) {
long h = 1125899906842597L; // prime
int len = string.length();
for (int i = 0; i < len; i++) {
h = 31*h + string.charAt(i);
}
return h;
}
If you're looking for even more bits, you could probably use a BigInteger
Edit:
As I mentioned in a comment to the answer of #brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:
I could imagine a huge hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, #brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion possible keys) is pretty much impossible though.
To combine the hash for several fields, simply do an XOR multiply one with a prime and add them:
long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2);
The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.
An answer for today (2018). SipHash.
It will be much faster than most of the answers here, and significantly higher quality than all of them.
The Guava library has one: https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--
Create an SHA-1 hash and then mask out the lowest 64bits.
long hash = string.hashCode();
Yes, the top 32 bits will be 0, but you will probably run out of hardware resources before you run into problems with hash collisions. The hashCode in String is quite efficient and well tested.
Update
I think the above satisfies the simplest thing which could possibly work, however, I agree with #sfussenegger idea of extending the existing String hashCode.
In addition to having a good hashCode for your String, you may want to consider rehashing the hash code in your implementation. If your storage is used by other developers, or used with other types, this can help distributed your keys. For example, Java's HashMap is based on power-of-two length hash tables, so it adds this function to ensure the lower bits are sufficiently distributed.
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
Why not use a CRC64 polynomial. These are reasonably efficient and optimized to make sure all bits are counted and spread over the result space.
There are plenty of implementations available on the net if you google "CRC64 Java"
Do something like this:
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.math.BigInteger;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
public class Test {
public static void main(String[] args) throws NoSuchAlgorithmException,
IOException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
try {
MessageDigest md = MessageDigest.getInstance("MD5");
SomeObject testObject = new SomeObject();
dos.writeInt(testObject.count);
dos.writeLong(testObject.product);
dos.writeDouble(testObject.stdDev);
dos.writeUTF(testObject.name);
dos.writeChar(testObject.delimiter);
dos.flush();
byte[] hashBytes = md.digest(baos.toByteArray());
BigInteger testObjectHash = new BigInteger(hashBytes);
System.out.println("Hash " + testObjectHash);
} finally {
dos.close();
}
}
private static class SomeObject {
private int count = 200;
private long product = 1235134123l;
private double stdDev = 12343521.456d;
private String name = "Test Name";
private char delimiter = '\n';
}
}
DataOutputStream lets you write primitives and Strings and have them output as bytes. Wrapping a ByteArrayOutputStream in it will let you write to a byte array, which integrates nicely with MessageDigest. You can pick from any algorithm listed here.
Finally BigInteger will let you turn the output bytes into an easier-to-use number. The MD5 and SHA1 algorithms both produce 128-bit hashes, so if you need 64 you can just truncate.
SHA1 should hash almost anything well, and with infrequent collisions (it's 128-bit). This works from Java, but I'm not sure how it's implemented. It may actually be fairly fast. It works on several fields in my implementation: just push them all onto the DataOutputStream and you're good to go. You could even do it with reflection and annotations (maybe #HashComponent(order=1) to show which fields go into a hash and in what order). It's got a 128-bit variant and I think you'll find it doesn't use as much CPU as you think it will.
I've used code like this to get hashes for huge data sets (by now probably billions of objects) to be able to shard them across many backend stores. It should work for whatever you need it for. Note that I think you may want to only call MessageDigest.getInstance() once and then clone() from then on: IIRC the cloning is a lot faster.
Reverse the string to get another 32-bit hashcode and then combine the two:
String s = "astring";
long upper = ( (long) s.hashCode() ) << 32;
long lower = ( (long) s.reverse().hashCode() ) - ( (long) Integer.MIN_VALUE );
long hash64 = upper + lower;
This is pseudocode; the String.reverse() method doesn't exist and will need to be implemented some other way.
Do you look at Apache commons lang?
But for 64 bit (and 128) you need some tricks: the rules laid out in the book Effective Java by Joshua Bloch help you create 64 bit hash easy (just use long instead of int). For 128 bit you need additional hacks...
DISCLAIMER: This solution is applicable if you wish to efficiently hash individual natural language words. It is inefficient for hashing longer text, or text containing non-alphabetic characters.
I'm not aware of a function but here's an idea that might help:
Dedicate 52 of the 64 bits to representing which letters are present in the String. For example, if 'a' were present you'd set bit[0], for 'b' set bit 1, for 'A' set bit[26]. That way, only text containing exactly the same set of letters would have the same "signature".
You could then use the remaining 12 bits to encode the string length (or a modulo value of it) to further reduce collisions, or generate a 12 bit hashCode using a traditional hashing function.
Assuming your input is text-only I can imagine this would result in very few collisions and would be inexpensive to compute (O(n)). Unlike other solutions so far this approach takes the problem domain into account to reduce collisions - It is based off the Anagram Detector described in Programming Pearls (see here).