How I avoid brute force searching in hash search - java

I have a program in java which takes an input string and generate a hash value using MD5 algorithm. The program searches a particular pattern (e.g 118855) in the generated hash string in each iteration by varying the last part of the input string by appending it with an integer which is incremented by one in each pass.
For example, if the input string is xyz then I will first find the hash for xyz0 and then for xyz1 and then for xyz2 and so no using MD5. In each pass it will search for a pattern eg 12345 in each hash value. Until this pattern is found the program will not stop.
Now my question is that how to avoid the brute force approach in searching this pattern in these generated hash strings. In other words, how can I jump the integer by a dynamic value instead of one each time?
Note: All the above hashes are generated using MD5. I am not requesting for replacement for MD5. Also, I not finding a collision in two hash value. My concern is to find a given substring pattern in these generated hash value.

If it were possible to tell in advance what to append to your "xyz" string (instead of brute-force search) so the MD5 hash contains a given pattern, then the algorithm were useless.
Message digest algorithms are meant to make cheating near-impossible, so constructing a manipulated document that still gives the same hash value as the original one, should be computationally very hard.
MD5 isn't the cryptographically strongest available hashing algorithm, but surely you can't just somehow "construct" a plain text to give some specified MD5 hash (or hash pattern). If that were possible, people had thrown away MD5 long ago.
Unless you are a cryptography guru, I'd recommend to stay with the brute-force approach.
[EDIT]
The number of tries to find a N-digit pattern should roughly be 16^N / (33-N) (not corrected for double matches), e.g. 2500 tries for a 4-digit pattern or 40000 tries for a 5-digit pattern. So, depending on the pattern length, that looks doable to me.
[EDIT]
To explain the "calculation":
MD5 is written as 32 hex digits.
So if you want to find a specific 5-digit pattern in front of the hash, there are 16^5 different possibilities, so the probability to get the correct one with a single attempt is 1/16^5, thus needing roughly 16^5 attempts until you succeed.
But we don't care about the position where in the hash we find our pattern, so now there are 28 positions where we have a chance to find our pattern. This roughly multiplies the match probability by 28 (this isn't exact, as this calculation counts a double match twice where the pattern is contained at two different positions). So that factor divides the expected number of attempts.

Related

Hash Function MD5 30 length

From a given string I am generating 32 digit unique hash code using MD5
MessageDigest.getInstance("MD5")
.digest("SOME-BIG-STRING").map("%02x".format(_)).mkString
//output: 47a8899bdd7213fb1baab6cd493474b4
Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability?
Security is not important, uniqueness is required.
For generating unique IDs from strings, hash functions are never the correct answer.
What you would need is define a one-to-one mapping of text strings (such as "v1.0.0") onto 30-character-long strings (such as "123123..."). This is also known as a bijection, although in your case a injection (a simple one-to-one mapping from inputs to outputs, not necessarily onto) may be enough. As the other answer at the time of this writing notes, hash functions don't necessarily ensure this mapping, but there are other possibilities, such as full-period linear congruential generators (if they take a seed that you can map one-to-one onto input string values), or other reversible functions.
However, if the set of possible input strings is larger than the set of possible output strings, then you can't map all input strings one-to-one with all output strings (without creating duplicates), due to the pigeonhole principle.
See also this question: How to generate a GUID with a custom alphabet, that behaves similar to an MD5 hash (in JavaScript)?.
Indeed, if you use hash functions, the chance of collision will be close to zero but never exactly zero (meaning that the risk of duplicates will always be there). If we take MD5 as an example (which produces any of 2^128 hash codes), then roughly speaking, the chance of accidental collision becomes non-negligible only after 2^64 IDs are generated, which is well over 1 trillion.
But MD5 and other hash functions are not the right way to do what you want to do. This is discussed next.
If you can't restrict the format of your input strings to 30 digits and can't compress them to 30 digits or less and can't tolerate the risk of duplicates, then the next best thing is to create a database table mapping your input strings to randomly generated IDs.
This database table should have two columns: one column holds your input strings (e.g., "<UUID>-NAME-<UUID>"), and the other column holds randomly generated IDs associated with those strings. Since random numbers don't ensure uniqueness, every time you create a new random ID you will need to check whether the random ID already exists in the database, and if it does exist, try a new random ID (but the chance that a duplicate is found will shrink as the size of the ID grows).
Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?
Yes.
You increase the probability of a collision by a factor of 28.
Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability ?
Probably. Taking the first 30 hex digits of a hash produced by any crypto-strength hash algorithm has roughly equivalent uniqueness properties.
Security is not important, uniqueness is required ?
In that case, the fact that MD5 is no longer considered secure is moot. (Note that the reason that MD5 is no longer considered secure is that it is computationally feasible to engineer a collision; i.e. to find a second input for a given MD5 hash.)
However, uniqueness of hashes cannot be guaranteed. Even with a "perfect" crypto strength hash function that generates N bit hashes, the probability of a collision for any 2 arbitrary (different) inputs is one in 2N. For large enough values of N, the probability is very small. But it is never zero.

Best way to derive / compute a unique hash from a given String in Java

I am looking for ways to compute a unique hash for a given String in Java. Looks like I cannot use MD5 or SHA1 because folks claim that they are broken and do not always guarantee uniqueness.
I should get the same hash (preferably a 32 character string like the MD5 Sum) for two String objects which are equal by the equals() method. And no other String should generate this hash - that's the tricky part.
Is there a way to achieve this in Java?
If guaranteed unique hash code is required then it is not possible (possible theoretically but not practically). Hashes and hash codes are non-unique.
A Java String of length N has 65536 ^ N possible states, and requires
an integer with 16 * N bits to represent all possible values. If you
write a hash function that produces integer with a smaller range (e.g.
less than 16 * N bits), you will eventually find cases where more than
one String hashes to the same integer; i.e. the hash codes cannot be
unique. This is called the Pigeonhole Principle, and there is a
straight forward mathematical proof. (You can't fight math and win!)
But if "probably unique" with a very small chance of non-uniqueness is
acceptable, then crypto hashes are a good answer. The math will tell
you how big (i.e. how many bits) the hash has to be to achieve a given
(low enough) probability of non-uniqueness.
Updated : check this another good answer : What is a good 64bit hash function in Java for textual strings?

Which hash algorithm can be used for duplicate content verification?

I have an xml file, where I need to determine if it is a duplicate or not.
I will either hash the entire xml file, or specific xml nodes in the xml file will be used to then generate some kind of hash.
Is md5 suitable for this?
Or something else? Speed in generation of the hash is also fairly important, but the guarantee to produce a unique hash for unique data is of higher important.
MD5 is broken (in the sense that it's possible to intentionally generate a hash collision), you should probably use the SHA family (eg: SHA-256 or SHA-2) if you are concerned about someone maliciously creating a file with the same hash as another file.
Note that hash functions, by their nature, cannot guarantee a unique hash for every possible input. Hash functions have a limited length (eg: MD5 is 128 bits in length, so there are 2128 possible hashes). You can't map a potentially infinite domain to a finite co-domain, this is mathematically impossible.
However, as per birthday paradox, the chances of a collision in a good hash function is 1 in 2n/2, where n is the length in bits. (eg: With 128-bit MD5 that would be 264). This is so statistically insignificant that you don't have to worry about a collision happening by accident.
MD5 is suitable and fast. Note though that a single difference in one character will produce a completely different MD5.
There is a slight chance that MD5 will produce the same hash for different inputs. This will be pretty rare. So, depending on your input (are you expecting many similar XMLs or many different ones?) when MD5 gives you a positive match you can compare the plain String contents.
If someone can alter at least partially the contents of some of the XML files, and that someone has an advantage in making you declare two XML files (or XML excerpts) identical while in fact they are not, then you need a cryptographically secure hash function, namely one which is resistant to collisions. A collision is a pair of distinct messages (sequences of bytes) which yield the same hash output -- exactly what you would like to avoid. Since a hash function accepts inputs longer than its output, collisions necessarily exist; a hash function is deemed cryptographically secure when nobody can actually produce such a collision.
If a hash function outputs n bits, then one can expect to find a collision after hashing about 2n/2 distinct messages. A secure hash function is a hash function such that no method is known to get a collision faster than that.
If there is no security issue (i.e. nobody will actively try to find a collision, you just fear a collision out of bad luck), then cryptographically weak hash functions are an option, provided that they have a large enough output, so that 2n/2 remains way bigger than the expected number of XML files you will compare. For n = 128 (i.e. 2n/2 close to eighteen billions of billions), MD5 is fine, fast and widely supported. You may want to investigate MD4, which is even weaker, but a bit faster too. If you want a larger n, try SHA-1, which offers 160-bit outputs (also, SHA-1 weaknesses are still theoretical at the moment, so SHA-1 is much less "cryptographically broken" than MD5).
If you have, even potentially, security issues, then go for SHA-256. No cryptographic weakness with regards to collisions is currently known for that function. If you run into performance issues (which is rather improbable: on a basic PC, SHA-256 can process more than 100 megabytes of data per second, so chances are that XML parsing will be widely more expensive than hashing), consider SHA-512, which is somewhat faster on platforms which offer 64-bit integer types (but quite slower on platforms which do not).
Note that all these hash functions are about sequences of bytes. A single flipped bit changes the output. In the XML world, a given document can be encoded in various ways which are semantically identical, but distinct as far as bits on the wire are concerned (e.g. é and &#233 both represent the same character é). It is up to you to define which notion of equality you want to use; see canonical XML.

Do hashing algorithms guarantee unique outputs if the same salt is used for unique inputs?

We have a view into a system that uses a value for the unique id that another company we want to share information with will not accept. I was thinking of using an one way encryption hash similar to what is done with passwords. The concern is can the hashing algorithm created output values be guaranteed unique if the inputs are guaranteed unique and the salt is constant?
Answer is yes. Same id input with same salt will always produce same output.
But, if your question is about guaranteeing that outputs will always be unique, the answer is no. There is a very small statistical probability that the hashing will create the same output twice even if the inputs are different and the salt constant.
In principle, there is no hashing algorithm without collisions if the input size is larger than the output size. (In your case, the relevant input size would be the size of this part which changes from one input to the next.)
Whether there are collisions also for shorter inputs is a property of the hashing algorithm, but the idea is that the probability of these should be quite small (about 1/(2^output size) for each pair of input, for a good algorithm).
Is your question can two different values hash to the same thing or is it are hashes deterministic?
If it's the former then yes, you can have hash collisions. A well designed cryptographically strong hash should make it difficult to find two values hashing to the same value though or to find an input that matches a given hash but they can't guarantee uniqueness.
By the pigeon-hole principal:
if your hash is a constant size, say 64 bits (without loss of generality) you will have at most 2^64 unique output hash values. Since there are more than 2^64 potential inputs if you're using strings, a collision is guaranteed after your hash at most 2^64+1 items.
Yes the same hash will be produced when the input and salt are the same. Note that different inputs may produce the same hash.
In short no. The longer answer is the perfect oracle would be able to solve the question you posed. Since no one has ever proven the existence of a perfect oracle it is currently believed to be impossible. The other side of it isn't that it is impossible just that we as a collective are not intelligent enough to figure this out. Similar to P != NP

how can i generate a unique int from a unique string?

I have an object with a String that holds a unique id .
(such as "ocx7gf" or "67hfs8")
I need to supply it an implementation of int hascode() which will be unique obviously.
how do i cast a string to a unique int in the easiest/fastest way?
10x.
Edit - OK. I already know that String.hashcode is possible. But it is not recommended in any place. Actually' if any other method is not recommended - Should I use it or not if I have my object in a collection and I need the hashcode. should I concat it to another string to make it more successful?
No, you don't need to have an implementation that returns a unique value, "obviously", as obviously the majority of implementations would be broken.
What you want to do, is to have a good spread across bits, especially for common values (if any values are more common than others). Barring special knowledge of your format, then just using the hashcode of the string itself would be best.
With special knowledge of the limits of your id format, it may be possible to customise and result in better performance, though false assumptions are more likely to make things worse than better.
Edit: On good spread of bits.
As stated here and in other answers, being completely unique is impossible and hash collisions are possible. Hash-using methods know this and can deal with it, but it does impact upon performance, so we want collisions to be rare.
Further, hashes are generally re-hashed so our 32-bit number may end up being reduced to e.g. one in the range 0 to 22, and we want as good a distribution within that as possible to.
We also want to balance this with not taking so long to compute our hash, that it becomes a bottleneck in itself. An imperfect balancing act.
A classic example of a bad hash method is one for a co-ordinate pair of X, Y ints that does:
return X ^ Y;
While this does a perfectly good job of returning 2^32 possible values out of the 4^32 possible inputs, in real world use it's quite common to have sets of coordinates where X and Y are equal ({0, 0}, {1, 1}, {2, 2} and so on) which all hash to zero, or matching pairs ({2,3} and {3, 2}) which will hash to the same number. We are likely better served by:
return ((X << 16) | (x >> 16)) ^ Y;
Now, there are just as many possible values for which this is dreadful than for the former, but it tends to serve better in real-world cases.
Of course, there is a different job if you are writing a general-purpose class (no idea what possible inputs there are) or have a better idea of the purpose at hand. For example, if I was using Date objects but knew that they would all be dates only (time part always midnight) and only within a few years of each other, then I might prefer a custom hash code that used only the day, month and lower-digits of the years, over the standard one. The writer of Date though can't work on such knowledge and has to try to cater for everyone.
Hence, If I for instance knew that a given string is always going to consist of 6 case-insensitive characters in the range [a-z] or [0-9] (which yours seem to, but it isn't clear from your question that it does) then I might use an algorithm that assigned a value from 0 to 35 (the 36 possible values for each character) to each character, and then walk through the string, each time multiplying the current value by 36 and adding the value of the next char.
Assuming a good spread in the ids, this would be the way to go, especially if I made the order such that the lower-significant digits in my hash matched the most frequently changing char in the id (if such a call could be made), hence surviving re-hashing to a smaller range well.
However, lacking such knowledge of the format for sure, I can't make that call with certainty, and I could well be making things worse (slower algorithm for little or even negative gain in hash quality).
One advantage you have is that since it's an ID in itself, then presumably no other non-equal object has the same ID, and hence no other properties need be examined. This doesn't always hold.
You can't get a unique integer from a String of unlimited length. There are 4 billionish (2^32) unique integers, but an almost infinite number of unique strings.
String.hashCode() will not give you unique integers, but it will do its best to give you differing results based on the input string.
EDIT
Your edited question says that String.hashCode() is not recommended. This is not true, it is recommended, unless you have some special reason not to use it. If you do have a special reason, please provide details.
Looks like you've got a base-36 number there (a-z + 0-9). Why not convert it to an int using Integer.parseInt(s, 36)? Obviously, if there are too many unique IDs, it won't fit into an int, but in that case you're out of luck with unique integers and will need to get by using String.hashCode(), which does its best to be close to unique.
Unless your strings are limited in some way or your integers hold more bits than the strings you're trying to convert, you cannot guarantee the uniqueness.
Let's say you have a 32 bit integer and a 64-character character set for your strings. That means six bits per character. That will allow you to store five characters into an integer. More than that and it won't fit.
represent each string character by a five-digit binary digit, eg. a by 00001 b by 00010 etc. thus 32 combinations are possible, for example, cat might be written as 00100 00001 01100, then convert this binary into decimal, eg. this would be 4140, thus cat would be 4140, similarly, you can get cat back from 4140 by converting it to binary first and Map the five digit binary to string
One way to do it is assign each letter a value, and each place of the string it's own multiple ie a = 1, b = 2, and so on, then everything in the first digit (read left to right) would be multiplied by a prime number, the next the next prime number and so on, such that the final digit was multiplied by a prime larger than the number of possible subsets in that digit (26+1 for a space or 52+1 with capitols and so on for other supported characters). If the number is mapped back to the first digits (leftmost character) any number you generate from a unique string mapping back to 1 or 6 whatever the first letter will be, gives a unique value.
Dog might be 30,3(15),101(7) or 782, while God 33,3(15),101(4) or 482. More importantly than unique strings being generated they can be useful in generation if the original digit is kept, like 30(782) would be unique to some 12(782) for the purposes of differentiating like strings if you ever managed to go over the unique possibilities. Dog would always be Dog, but it would never be Cat or Mouse.

Categories

Resources