I want to represent short texts (ie word, couple of words) as a 64 bit hash (want to store them as longs)
MessageDigest.getInstance("MD5") returns 128bits.
Is there anything else I could use, could i just peel off half of it. I am not worried of someone trying to duplicate a hash, I would like to minimize the number of clashes (two different strings having the same hash)
MD5 (and SHA) hash "smear" the data in a uniform way across the hashed value so any 64 bits ypu choose out of the final value will be as sensitive to a change as any other 64 bits. Your only concern will be the increased probability of collisions.
You can just use any part of the MD5 hash.
We tried to fold 128-bit into 64-bit with various algorithms but the folding action didn't make any noticeable difference in hash distribution.
Why don't you just using hashCode() of String? We hashed 8 million Email addresses into 32-bit integer and there are actually more collisions with MD5 than String hashCode. You can run hashCode twice (forward and backward) and make it a 64-bit long.
You can take a sampling of 64-bits from the 128-bit hash. You cannot guarantee there will be no clashes - only a perfect hash will give you that, and there is no perfect hash for arbitrary length strings) but the chances of a clash will be very small.
As well as a sampling, you could derive the hash using a more complex function, such as XOR consecutive pairs of bits.
As a cryptographic hash (even one nowadays considered broken), MD5 has no significant correlation between input and output bits. That means, simply taking the first or last half will give you a perfectly well-distributed hash function. Anything else would never have been seriously considered as a cryptographic hash.
What about using some block cipher with 64bit block size ?
Related
I am looking for ways to compute a unique hash for a given String in Java. Looks like I cannot use MD5 or SHA1 because folks claim that they are broken and do not always guarantee uniqueness.
I should get the same hash (preferably a 32 character string like the MD5 Sum) for two String objects which are equal by the equals() method. And no other String should generate this hash - that's the tricky part.
Is there a way to achieve this in Java?
If guaranteed unique hash code is required then it is not possible (possible theoretically but not practically). Hashes and hash codes are non-unique.
A Java String of length N has 65536 ^ N possible states, and requires
an integer with 16 * N bits to represent all possible values. If you
write a hash function that produces integer with a smaller range (e.g.
less than 16 * N bits), you will eventually find cases where more than
one String hashes to the same integer; i.e. the hash codes cannot be
unique. This is called the Pigeonhole Principle, and there is a
straight forward mathematical proof. (You can't fight math and win!)
But if "probably unique" with a very small chance of non-uniqueness is
acceptable, then crypto hashes are a good answer. The math will tell
you how big (i.e. how many bits) the hash has to be to achieve a given
(low enough) probability of non-uniqueness.
Updated : check this another good answer : What is a good 64bit hash function in Java for textual strings?
I am trying to generate hash key using SHA-256 which is producing 64 length String but i need key of size<=32, what is the best algorithm recommended maintaining uniqueness? Please advice.
As already indicated you loose collision resistance for each bit you drop. Hashes however are considered to be indistinguishable from random. Because of the avalanche effect, each bit of the input is "represented" by each of the bits in the hash. So you can indeed simply use the first 128 bits / 16 bytes of the output of the hash. That would still leave you with some 64 bit of collision resistance. The more or less standard way to do this is to take the leftmost bytes.
Additional hints:
To have some additional security, use the result of a HMAC with a static, randomly generated 128 bit key instead of the output of a hash;
Of course you could also encode the hash in base 64 and use that, if you can store any string instead of only hexadecimals. In that case you can fit 32 * 6 = 192 bits into the value, which would result in higher security than a SHA-1 hash (which is considered insecure nowadays, keep to SHA-2).
I'm writing a disk cache where filenames are the keys. The keys can be longer than the max filename length, so they need to be hashed. What are some fast hash functions with extremely low probability of collisions (so that I can ignore it)?
Basically, I'm looking for a faster alternative to MD5 with no security requirenments.
(Platform = Android, language = Java.)
if your hash is uniformly distributed then you can calculate the size of the hash (in bits) that you need from the approx number of files you expect to handle before a collision. basically, because of the birthday paradox, it's twice the number of bits.
so, for example, if you are happy with a collision after a million files then you need a has that is about 40 bits log (2 * log2(1e6)).
conversely, if a hash is N bits, then it's good for 2^(N/2) files without collision (more or less).
there are many fast hashes. for example, xxhash is a 64 bit hash, so is good for about 4,000,000,000 files. google's fast-hash is another.
if you want more than 64bits (more than ~4 billion files before a collision) then you can either use a hash with a larger output or join two 64bit hashes together (one hash from the original file and one with it modified in some way (eg prefixed with a space)).
The google guava library has different fast hash implementations:
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/hash/Hashing.html#murmur3_128%28%29
Looking at using a hashing algorithm that accepts a string and returns a 64bit signed integer value.
It doesn't have to be cryptographically sound, just provide a decent collision rate to be used as a key for distributed storage.
I'm looking at murmur hash that seems to fit the bill
Curious how the properties of this compare to taking the first 64 bits of something like an MD5 hash.
Secure hashes - even theoretically 'broken' ones like MD5 - exhibit distribution that's indistinguishable from randomness (or else they wouldn't be secure). Thus, they're as close to perfect as it's possible to be.
Like all general purpose hash functions, murmurhash trades off correctness for speed. While it shows very good distribution characteristics for most inputs, it has its own pathological cases, such as the one documented here, where repeated 4-byte sequences lead to collisions more often than desired.
In short: Using a secure hash function will never be worse, and will sometimes be better than using a general purpose hash. It will also be substantially slower, however.
I have an xml file, where I need to determine if it is a duplicate or not.
I will either hash the entire xml file, or specific xml nodes in the xml file will be used to then generate some kind of hash.
Is md5 suitable for this?
Or something else? Speed in generation of the hash is also fairly important, but the guarantee to produce a unique hash for unique data is of higher important.
MD5 is broken (in the sense that it's possible to intentionally generate a hash collision), you should probably use the SHA family (eg: SHA-256 or SHA-2) if you are concerned about someone maliciously creating a file with the same hash as another file.
Note that hash functions, by their nature, cannot guarantee a unique hash for every possible input. Hash functions have a limited length (eg: MD5 is 128 bits in length, so there are 2128 possible hashes). You can't map a potentially infinite domain to a finite co-domain, this is mathematically impossible.
However, as per birthday paradox, the chances of a collision in a good hash function is 1 in 2n/2, where n is the length in bits. (eg: With 128-bit MD5 that would be 264). This is so statistically insignificant that you don't have to worry about a collision happening by accident.
MD5 is suitable and fast. Note though that a single difference in one character will produce a completely different MD5.
There is a slight chance that MD5 will produce the same hash for different inputs. This will be pretty rare. So, depending on your input (are you expecting many similar XMLs or many different ones?) when MD5 gives you a positive match you can compare the plain String contents.
If someone can alter at least partially the contents of some of the XML files, and that someone has an advantage in making you declare two XML files (or XML excerpts) identical while in fact they are not, then you need a cryptographically secure hash function, namely one which is resistant to collisions. A collision is a pair of distinct messages (sequences of bytes) which yield the same hash output -- exactly what you would like to avoid. Since a hash function accepts inputs longer than its output, collisions necessarily exist; a hash function is deemed cryptographically secure when nobody can actually produce such a collision.
If a hash function outputs n bits, then one can expect to find a collision after hashing about 2n/2 distinct messages. A secure hash function is a hash function such that no method is known to get a collision faster than that.
If there is no security issue (i.e. nobody will actively try to find a collision, you just fear a collision out of bad luck), then cryptographically weak hash functions are an option, provided that they have a large enough output, so that 2n/2 remains way bigger than the expected number of XML files you will compare. For n = 128 (i.e. 2n/2 close to eighteen billions of billions), MD5 is fine, fast and widely supported. You may want to investigate MD4, which is even weaker, but a bit faster too. If you want a larger n, try SHA-1, which offers 160-bit outputs (also, SHA-1 weaknesses are still theoretical at the moment, so SHA-1 is much less "cryptographically broken" than MD5).
If you have, even potentially, security issues, then go for SHA-256. No cryptographic weakness with regards to collisions is currently known for that function. If you run into performance issues (which is rather improbable: on a basic PC, SHA-256 can process more than 100 megabytes of data per second, so chances are that XML parsing will be widely more expensive than hashing), consider SHA-512, which is somewhat faster on platforms which offer 64-bit integer types (but quite slower on platforms which do not).
Note that all these hash functions are about sequences of bytes. A single flipped bit changes the output. In the XML world, a given document can be encoded in various ways which are semantically identical, but distinct as far as bits on the wire are concerned (e.g. é and é both represent the same character é). It is up to you to define which notion of equality you want to use; see canonical XML.