Which hash algorithm can be used for duplicate content verification? - java

I have an xml file, where I need to determine if it is a duplicate or not.
I will either hash the entire xml file, or specific xml nodes in the xml file will be used to then generate some kind of hash.
Is md5 suitable for this?
Or something else? Speed in generation of the hash is also fairly important, but the guarantee to produce a unique hash for unique data is of higher important.

MD5 is broken (in the sense that it's possible to intentionally generate a hash collision), you should probably use the SHA family (eg: SHA-256 or SHA-2) if you are concerned about someone maliciously creating a file with the same hash as another file.
Note that hash functions, by their nature, cannot guarantee a unique hash for every possible input. Hash functions have a limited length (eg: MD5 is 128 bits in length, so there are 2128 possible hashes). You can't map a potentially infinite domain to a finite co-domain, this is mathematically impossible.
However, as per birthday paradox, the chances of a collision in a good hash function is 1 in 2n/2, where n is the length in bits. (eg: With 128-bit MD5 that would be 264). This is so statistically insignificant that you don't have to worry about a collision happening by accident.

MD5 is suitable and fast. Note though that a single difference in one character will produce a completely different MD5.
There is a slight chance that MD5 will produce the same hash for different inputs. This will be pretty rare. So, depending on your input (are you expecting many similar XMLs or many different ones?) when MD5 gives you a positive match you can compare the plain String contents.

If someone can alter at least partially the contents of some of the XML files, and that someone has an advantage in making you declare two XML files (or XML excerpts) identical while in fact they are not, then you need a cryptographically secure hash function, namely one which is resistant to collisions. A collision is a pair of distinct messages (sequences of bytes) which yield the same hash output -- exactly what you would like to avoid. Since a hash function accepts inputs longer than its output, collisions necessarily exist; a hash function is deemed cryptographically secure when nobody can actually produce such a collision.
If a hash function outputs n bits, then one can expect to find a collision after hashing about 2n/2 distinct messages. A secure hash function is a hash function such that no method is known to get a collision faster than that.
If there is no security issue (i.e. nobody will actively try to find a collision, you just fear a collision out of bad luck), then cryptographically weak hash functions are an option, provided that they have a large enough output, so that 2n/2 remains way bigger than the expected number of XML files you will compare. For n = 128 (i.e. 2n/2 close to eighteen billions of billions), MD5 is fine, fast and widely supported. You may want to investigate MD4, which is even weaker, but a bit faster too. If you want a larger n, try SHA-1, which offers 160-bit outputs (also, SHA-1 weaknesses are still theoretical at the moment, so SHA-1 is much less "cryptographically broken" than MD5).
If you have, even potentially, security issues, then go for SHA-256. No cryptographic weakness with regards to collisions is currently known for that function. If you run into performance issues (which is rather improbable: on a basic PC, SHA-256 can process more than 100 megabytes of data per second, so chances are that XML parsing will be widely more expensive than hashing), consider SHA-512, which is somewhat faster on platforms which offer 64-bit integer types (but quite slower on platforms which do not).
Note that all these hash functions are about sequences of bytes. A single flipped bit changes the output. In the XML world, a given document can be encoded in various ways which are semantically identical, but distinct as far as bits on the wire are concerned (e.g. é and &#233 both represent the same character é). It is up to you to define which notion of equality you want to use; see canonical XML.

Related

Hash Function MD5 30 length

From a given string I am generating 32 digit unique hash code using MD5
MessageDigest.getInstance("MD5")
.digest("SOME-BIG-STRING").map("%02x".format(_)).mkString
//output: 47a8899bdd7213fb1baab6cd493474b4
Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability?
Security is not important, uniqueness is required.
For generating unique IDs from strings, hash functions are never the correct answer.
What you would need is define a one-to-one mapping of text strings (such as "v1.0.0") onto 30-character-long strings (such as "123123..."). This is also known as a bijection, although in your case a injection (a simple one-to-one mapping from inputs to outputs, not necessarily onto) may be enough. As the other answer at the time of this writing notes, hash functions don't necessarily ensure this mapping, but there are other possibilities, such as full-period linear congruential generators (if they take a seed that you can map one-to-one onto input string values), or other reversible functions.
However, if the set of possible input strings is larger than the set of possible output strings, then you can't map all input strings one-to-one with all output strings (without creating duplicates), due to the pigeonhole principle.
See also this question: How to generate a GUID with a custom alphabet, that behaves similar to an MD5 hash (in JavaScript)?.
Indeed, if you use hash functions, the chance of collision will be close to zero but never exactly zero (meaning that the risk of duplicates will always be there). If we take MD5 as an example (which produces any of 2^128 hash codes), then roughly speaking, the chance of accidental collision becomes non-negligible only after 2^64 IDs are generated, which is well over 1 trillion.
But MD5 and other hash functions are not the right way to do what you want to do. This is discussed next.
If you can't restrict the format of your input strings to 30 digits and can't compress them to 30 digits or less and can't tolerate the risk of duplicates, then the next best thing is to create a database table mapping your input strings to randomly generated IDs.
This database table should have two columns: one column holds your input strings (e.g., "<UUID>-NAME-<UUID>"), and the other column holds randomly generated IDs associated with those strings. Since random numbers don't ensure uniqueness, every time you create a new random ID you will need to check whether the random ID already exists in the database, and if it does exist, try a new random ID (but the chance that a duplicate is found will shrink as the size of the ID grows).
Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Yes.
You increase the probability of a collision by a factor of 28.
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability ?
Probably. Taking the first 30 hex digits of a hash produced by any crypto-strength hash algorithm has roughly equivalent uniqueness properties.
Security is not important, uniqueness is required ?
In that case, the fact that MD5 is no longer considered secure is moot. (Note that the reason that MD5 is no longer considered secure is that it is computationally feasible to engineer a collision; i.e. to find a second input for a given MD5 hash.)
However, uniqueness of hashes cannot be guaranteed. Even with a "perfect" crypto strength hash function that generates N bit hashes, the probability of a collision for any 2 arbitrary (different) inputs is one in 2N. For large enough values of N, the probability is very small. But it is never zero.

Fast hash function with unique hashes

I'm writing a disk cache where filenames are the keys. The keys can be longer than the max filename length, so they need to be hashed. What are some fast hash functions with extremely low probability of collisions (so that I can ignore it)?
Basically, I'm looking for a faster alternative to MD5 with no security requirenments.
(Platform = Android, language = Java.)
if your hash is uniformly distributed then you can calculate the size of the hash (in bits) that you need from the approx number of files you expect to handle before a collision. basically, because of the birthday paradox, it's twice the number of bits.
so, for example, if you are happy with a collision after a million files then you need a has that is about 40 bits log (2 * log2(1e6)).
conversely, if a hash is N bits, then it's good for 2^(N/2) files without collision (more or less).
there are many fast hashes. for example, xxhash is a 64 bit hash, so is good for about 4,000,000,000 files. google's fast-hash is another.
if you want more than 64bits (more than ~4 billion files before a collision) then you can either use a hash with a larger output or join two 64bit hashes together (one hash from the original file and one with it modified in some way (eg prefixed with a space)).
The google guava library has different fast hash implementations:
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/hash/Hashing.html#murmur3_128%28%29

Would there be less collisions from murmurhash or from taking 64 bits from an MD5 hash if you want a 64 bit int?

Looking at using a hashing algorithm that accepts a string and returns a 64bit signed integer value.
It doesn't have to be cryptographically sound, just provide a decent collision rate to be used as a key for distributed storage.
I'm looking at murmur hash that seems to fit the bill
Curious how the properties of this compare to taking the first 64 bits of something like an MD5 hash.
Secure hashes - even theoretically 'broken' ones like MD5 - exhibit distribution that's indistinguishable from randomness (or else they wouldn't be secure). Thus, they're as close to perfect as it's possible to be.
Like all general purpose hash functions, murmurhash trades off correctness for speed. While it shows very good distribution characteristics for most inputs, it has its own pathological cases, such as the one documented here, where repeated 4-byte sequences lead to collisions more often than desired.
In short: Using a secure hash function will never be worse, and will sometimes be better than using a general purpose hash. It will also be substantially slower, however.

Do hashing algorithms guarantee unique outputs if the same salt is used for unique inputs?

We have a view into a system that uses a value for the unique id that another company we want to share information with will not accept. I was thinking of using an one way encryption hash similar to what is done with passwords. The concern is can the hashing algorithm created output values be guaranteed unique if the inputs are guaranteed unique and the salt is constant?
Answer is yes. Same id input with same salt will always produce same output.
But, if your question is about guaranteeing that outputs will always be unique, the answer is no. There is a very small statistical probability that the hashing will create the same output twice even if the inputs are different and the salt constant.
In principle, there is no hashing algorithm without collisions if the input size is larger than the output size. (In your case, the relevant input size would be the size of this part which changes from one input to the next.)
Whether there are collisions also for shorter inputs is a property of the hashing algorithm, but the idea is that the probability of these should be quite small (about 1/(2^output size) for each pair of input, for a good algorithm).
Is your question can two different values hash to the same thing or is it are hashes deterministic?
If it's the former then yes, you can have hash collisions. A well designed cryptographically strong hash should make it difficult to find two values hashing to the same value though or to find an input that matches a given hash but they can't guarantee uniqueness.
By the pigeon-hole principal:
if your hash is a constant size, say 64 bits (without loss of generality) you will have at most 2^64 unique output hash values. Since there are more than 2^64 potential inputs if you're using strings, a collision is guaranteed after your hash at most 2^64+1 items.
Yes the same hash will be produced when the input and salt are the same. Note that different inputs may produce the same hash.
In short no. The longer answer is the perfect oracle would be able to solve the question you posed. Since no one has ever proven the existence of a perfect oracle it is currently believed to be impossible. The other side of it isn't that it is impossible just that we as a collective are not intelligent enough to figure this out. Similar to P != NP

64bit MessageDigest - store short texts as long

I want to represent short texts (ie word, couple of words) as a 64 bit hash (want to store them as longs)
MessageDigest.getInstance("MD5") returns 128bits.
Is there anything else I could use, could i just peel off half of it. I am not worried of someone trying to duplicate a hash, I would like to minimize the number of clashes (two different strings having the same hash)
MD5 (and SHA) hash "smear" the data in a uniform way across the hashed value so any 64 bits ypu choose out of the final value will be as sensitive to a change as any other 64 bits. Your only concern will be the increased probability of collisions.
You can just use any part of the MD5 hash.
We tried to fold 128-bit into 64-bit with various algorithms but the folding action didn't make any noticeable difference in hash distribution.
Why don't you just using hashCode() of String? We hashed 8 million Email addresses into 32-bit integer and there are actually more collisions with MD5 than String hashCode. You can run hashCode twice (forward and backward) and make it a 64-bit long.
You can take a sampling of 64-bits from the 128-bit hash. You cannot guarantee there will be no clashes - only a perfect hash will give you that, and there is no perfect hash for arbitrary length strings) but the chances of a clash will be very small.
As well as a sampling, you could derive the hash using a more complex function, such as XOR consecutive pairs of bits.
As a cryptographic hash (even one nowadays considered broken), MD5 has no significant correlation between input and output bits. That means, simply taking the first or last half will give you a perfectly well-distributed hash function. Anything else would never have been seriously considered as a cryptographic hash.
What about using some block cipher with 64bit block size ?

Categories

Resources