Hash attack: find strings of length 2^N with same hashCode() - java

Reading algorithms fourth edition by Robert Sedgewick and Kevin Wayne I found the following question:
Hash attack: find 2^N strings, each of length 2^N, that have the same hashCode() value, supposing that the hashCode() implementation for String is the following:
public int hashCode() {
int hash = 0;
for (int i = 0; i < length(); i++)
hash = (hash * 31) + charAt(i);
return hash;
}
Strong hint: Aa and BB have the same value.
What comes on my mind is generating all possible strings of length 2^N and compare their hashCodes. This, however, is very expensive for large N and I doubt it's the correct solution.
Can you give me hints what I miss in the whole picture?

Andreas' and Glains' answers are both correct, but they aren't quite what you need if your goal is to produce 2N distinct strings of length 2N.
Rather, a simpler approach is to build strings consisting solely of concatenated sequences of Aa and BB. For length 2×1 you have { Aa, BB }; for length 2×2 you have { AaAa, AaBB, BBAa, BBBB }, for length 2×3 you have { AaAaAa, AAaaBB, AaBBAa, AaBBBB, BBAaAa, BBAaBB, BBBBAa, BBBBBB }; and so on.
(Note: you've quoted the text as saying the strings should have length 2N. I'm guessing that you misquoted, and it's actually asking for length 2N; but if it is indeed asking for length 2N, then you can simply drop elements as you proceed.)

"Strong hint" explained.
Strong hint: Aa and BB have the same value.
In ASCII / Unicode, B has a value 1 higher than A. Since those are the second last characters, the value is multiplied by 31, so hash code is increased by 31 when you change xxxxAa to xxxxBa.
To offset that, you need last character to offset by -31. Since lowercase letters are 32 higher than uppercase letters, changing a to A is -32 and changing one letter up to B is then -31.
So, it get same hash code, change second-last letter to next letter (e.g. A to B), and change last letter from lowercase to next uppercase (e.g. a to B).
You can now use that hint to generate up to 26 strings with the same hash code.

Lets take a look at the hashCode() implementation and the given hint:
public int hashCode() {
int hash = 0;
for (int i = 0; i < length(); i++)
hash = (hash * 31) + charAt(i);
return hash;
}
We know that Aa and BB produce the same hash and we can easily verify that:
(65 * 31) + 97 = 2112
(66 * 31) + 66 = 2112
From here on, hash is the same for both inputs. That said, we can easily append any amount of characters to both strings and you will always receive the same value.
One example could be:
hashCode("AaTest") = 1953079538
hashCode("BBTest") = 1953079538
So, you can generate enough hash values by just appending the same sequence of characters to both strings, more formally:
hashCode("Aa" + x") = hashCode("BB" + x)
Another note on your idea to generate all possible strings and search for duplicates. Have a look at the bithday paradox and you will note that it will take much less to find duplicate hash values for different inputs.
It will be very difficult to find the original hashed value (indeed, you would have to try out all possible inputs if the hash algorithm is good).
Duplicate hash values are rare (there have to be duplicates since the hash has a fixed length). If a duplicate is found, the duplicate should be meaningless (random characters), so it cannot be abused by an attacker.

Taking a closer look at the hash function, it works like a number system (e.g. Hexadecimal) where the weight of the digits is 31. That is, think of it as converting a number to base 31 and that makes your final hash code something like hashCode = (31^n) * first-char + (31^n-1) * second-char + ..... + (31^0) * last-char
The second observation is that the ASCII distance between the capital and the small letter is 32. Explained in terms of the hash function, it means that when you replace a capital letter by a small one, it means you are adding 1 more to the higher digit and 1 to your current digit. For example:
BB = (31)(B) + (31^0)B which also equals (31)*(B - 1) + (31^0)*(31 + B) notice that I have just taken one unit from the higher digit and added to the lower digit without changing the overall value. The last equation equals to (31)*(A) + (a) == Aa
So, to generate all of the possible String of a given hash code, start with the initial String and shift the character from right to left by replacing a small character by the capital one while decreasing one from the higher location (where applicable). You can run this in O(1)
Hope this helps.

Related

Making ASCII values more usable as part of hash key

I'm to implement a hash function, and here is my hash function (the first draft version that is)
public int hashCode(){
String fixedISBN = getIsbn().toString().replace("-", "");
fixedISBN = fixedISBN.substring(fixedISBN.length()-4, fixedISBN.length());
int ISBN = Integer.parseInt(fixedISBN);
int ASCII = 0;
for (int i = 0; i < getTitle().toString().length(); i++) {
ASCII += getTitle().toString().charAt(i);
}
int hashValue = (ISBN * 37 + ASCII*23);
return hashValue;
}
I am meant to hash books, and to do so I initially thought to use the ISBN value of a book, which serves as a wholly unique identifier for every book. Then I looked at the list of ISBNs and saw that using the entire ISBN since there isn't a lot of variation of the ISBN numbers. As such I use only the last four numbers of the ISBN since those numbers tend to be the ones that vary. I also plan to use the ASCII value of the title's chars for my hashValue, but I believe a problem arises since ASCII values can only amount to 127, which means there would be a problem if the title is short, say only 8 chars or less which would produce a maximum value 1016. If the table size is very large, say 10 007 it wouldn't produce a very even spread. Is there any way I could make ASCII values more suitable to produce a hash value of a large table

Is there an approach to finding the ASCII distance between two strings of 5 characters

I am trying to find a way to calculate and print the Ascii distance between a string from user input
Scanner scan = new Scanner(System.in);
System.out.print("Please enter a string of 5 uppercase characters:");
String userString = scan.nextLine();
and a randomly generated string
int leftLimit = 65; // Upper-case 'A'
int rightLimit = 90; // Upper-case 'Z'
int stringLength = 5;
Random random = new Random();
String randString = random.ints(leftLimit, rightLimit + 1)
.filter(i -> (i <= 57 || i >= 65) && (i <= 90 || i >= 97))
.limit(stringLength)
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
Is there a way to calculate the distance without having to separate each individual character from the two strings, comparing them and adding them back together?
Use Edit distance (Levenshtein distance)
You can
Implement your own edit distance based on the algorithm on wikipedia,
you can use an existing source code, for that look at rosetta code.
use an existing library like apache LevenshteinDistance
you can also check
Levenshtein Distance on stackoverflow
Streams are, well, as the name says, streams. They don't work very well unless you can define an operation strictly on the basis of one input: One element from a stream, without knowing its index or referring to the entire collection.
Here, that is a problem; after all, to operate on, say, the 'H' in your input, you need the matching character from your random code.
I'm not sure why you find 'separate each individual character, compare them, and add them back together' is so distasteful to you. Isn't that a pretty clean mapping from the problem description to instructions for your computer to run?
The alternative is more convoluted: You could attempt to create a mixed object that contains both the letter as well as its index, stream over this, and use the index to look up the character in the second string. Alternatively, you could attempt to create a mix object containing both characters (so, for inputs ABCDE and HELLO, an object containing both A and H), but you'd be writing far more code to get that set up, then the simple, no-streams way.
So, let's start with the simple way:
int difference = 0;
for (int i = 0; i < stringLength; i++) {
char a = inString.charAt(i);
char b = randomString.charAt(i);
difference += difference(a, b);
}
You'd have to write the difference method yourself - but it'd be a very very simple one-liner.
Trying to take two collections of some sort, and from them create a single stream where each element in the stream is matching elements from each collection (so, a stream of ["HA", "EB", "LC", "LD", "OE"]) is generally called 'zipping' (no relation to the popular file compression algorithm and product), and java doesn't really support it (yet?). There are some third party libraries that can do it, but given that the above is so simple I don't think zipping is what you're looking for here.
If you absolutely must, I guess i'd look something like:
// a stream of 0,1,2,3,4
IntStream.range(0, stringLength)
// map 0 to "HA", 1 to "EB", etcetera
.mapToObj(idx -> "" + inString.charAt(idx) + randomString.charAt(idx))
// map "HA" to the difference score
.mapToInt(x -> difference(x))
// and sum it.
.sum();
public int difference(String a) {
// exercise for the reader
}
Create an 2D array fill the array with distances - you can index directly into the 2D array to pull out the distance between the characters.
So one expression that sums up a set of array accesses.
Here is my code for this (ASCII distance) in MATLAB
function z = asciidistance(input0)
if nargin ~= 1
error('please enter a string');
end
size0 = size(input0);
if size0(1) ~= 1
error ('please enter a string');
end
length0 = size0(2);
rng('shuffle');
a = 32;
b = 127;
string0 = (b-a).*rand(length0,1) + a;
x = char(floor(string0));
z = (input0 - x);
ascii0 = sum(abs(z),'all');
ascii1 = abs(sum(z,'all'));
disp(ascii0);
disp(ascii1);
disp(ascii0/ascii1/length0);
end
This script also differentiates between the absolute ASCII distance on a per-character basis vs that on a per-string basis, thus resulting in two integers returned for the ASCII distance.
I have also included the limit of these two values, the value of which approaches the inverse of the length of strings being compared. This actually approximates the entropy, E, of every random string generation event when run.
After standard error checking, the script first finds the length of the input string. The rnd function seeds the random number generator. the a and b variables define the ASCII table minus non-printable characters, which ends at 126, inclusively. 127 is actually used as an upper bound so that the next line of code can generate a random string of variables of input length. The following line of code turns the string into the alphanumeric characters provided by the ASCII table. The following line of code subtracts the two strings element-wise and stores the result. The next two lines of code sum up the ASCII distances in the two ways mentioned in the first paragraph. Finally, the values are printed out, as well as providing the entropy, E, of the random string generation event.

Too many hashing function collisions

I'm trying to make a hashing function using the polynomial accumulation method (which is supposed to give you 5 collisions per 55k words or something) but when I run it with 1,000 words, I get ~190 collisions. Am I doing something wrong?
public int hashCode(String str) {
double hash_value = 0; // used for float
for (int i = 0; i < str.length(); i++){
hash_value = 33*hash_value + str.charAt(i);
}
return (int) (hash_value % array_size);
}
Generally, prime numbers are favoured for hash code generation. I suggest trying 109 or 251. 33 is a multiple of 3 which means you are more likely to have issues based on your inputs.
Also you should use an int for the calculations and call Math.abs on the result.
Either your data set is extremely "unlucky", or (which is more probable) the array_size is too small (hash function params are usually quoted without consideration of finite bucket array size).
You are generating a large number which is different for different word in the input. But there is still a chance of collisions, as for example
"bA" = 98+(33x65)=2243
"AB" = 65+(33x66)=2243
If you go for a large number greater then 57, there will be less chance of collision. 109 or 251 will be a good choice.

Word frequency hash table

Ok, I have a project that requires me to have a dynamic hash table that counts the frequency of words in a file. I must use java, however, we are not allowed to use any built in data types or built in classes at all except standard arrays. Also, I am not allowed to use any hash functions off the internet that are known to be fast. I have to make my own hash functions. Lastly, my instructor also wants my table to start as size "1" and double in size every time a new key is added.
My first idea was to sum the ASCII values of the letters composing a word and use that to make a hash function, but different words with the same letters will equal the same value.
How can I get started? Is the ASCII idea on the right track?
A hash table isn't expected to have in general a one-to-one mapping between a value and a hash. A hash table is expected to have collisions. That is, the domain of the hash-function is expected to be larger than the range (i.e., the hash value). However, the general idea is that you come up with a hash function where the probability of collision is drastically small. If your hash-function is uniform, i.e., if you have it designed such that each possible hash-value has the same probability of being generated, then you can minimize collisions this way.
Getting a collision isn't the end of the world. That just means that you have to search the list of values for that hash. If your hashing function is good, overall your performance for lookup should still be O(1).
Generating hashing functions is a subject of its own, and there is no one answer. But a good place for you to start could be to work with the bitwise representations of the characters in the string, and perform some sort of convolution operations on them (rotate, shift, XOR) in series. You could perform these in some way based on some initial seed-value, and then use the output of the first step of hashing as a seed for the next step. This way you can end up magnifying the effects of your convolution.
For example, let's say you get the character A, which is 41 in hex, or 0100 0001 in binary. You could designate each bit to mean some operation (maybe bit 0 is a ROR when it is 0, and a ROL when it is 1; bit 1 is an OR when it is 0, and a XOR when it is 1, etc.). You could even decide how much convolution you want to do based on the value itself. For example, you could say that the lower nibble specifies how much right-rotation you will do, and the upper nibble specifies how much left rotation you will do. Then once you have the final value, you will use that as the seed for the next character. These are just some ideas. Use your imagination as see what you get!
It does not matter how good your hash function is, you will always have collisions you need to resolve.
If you want to keep your approach by using the ASCII values of the you shouldn't just add the values this would lead to a lot collisions. You should work with the power of the values, for example for the word "Help" you just go like: 'H' * 256 + 'e' * 256 + 'l' * 256² + 'p' * 256³. Or in pseudocode:
int hash(String word, int hashSize)
int res = 0
int count = 0;
for char c in word
res += 'c' * 256^count
count++
count = count mod 5
return res mod hashSize
Now you just have to write your own Hashtable:
class WordCounterMap
Entry[] entrys = new Entry[1]
void add(String s)
int hash = hash(s, entrys.length)
if(entrys[hash] == null{
Entry[] temp = new Entry[entry.length * 2]
for(Entry e : entrys){
if(e != null)
int hash = hash(e.word, temp.length)
temp[hash] = e;
entrys = temp;
hash = hash(s, entrys.length)
while(true)
if(entrys[hash] != null)
if(entrys[hash].word.equals(s))
entrys[hash].count++
break
else
entrys[hash] = new Entry(s)
hash++
hash = hash mod entrys.length
int getCount(String s)
int hash = hash(s, length)
if(entrys[hash] == null)
return 0
while(true)
if(entrys[hash].word.equals(s))
entrys[hash].count++
break
hash++
hash = hash mod entrys.length
class Entry
int count
String word
Entry(String s)
this.word = s
count = 1

Java convert hash to random string

I'm trying to develop a reduction function for use within a rainbow table generator.
The basic principle behind a reduction function is that it takes in a hash, performs some calculations, and returns a string of a certain length.
At the moment I'm using SHA1 hashes, and I need to return a string with a length of three. I need the string to be made up on any three random characters from:
abcdefghijklmnopqrstuvwxyz0123456789
The major problem I'm facing is that any reduction function I write, always returns strings that have already been generated. And a good reduction function will only return duplicate strings rarely.
Could anyone suggest any ideas on a way of accomplishing this? Or any suggestions at all on hash to string manipulation would be great.
Thanks in advance
Josh
So it sounds like you've got 20 digits of base 255 (the length of a SHA1 hash) that you need to map into three digits of base 36. I would simply make a BigInteger from the hash bytes, modulus 36^3, and return the string in base 36.
public static final BigInteger N36POW3 = new BigInteger(""+36*36*36));
public static String threeDigitBase36(byte[] bs) {
return new BigInteger(bs).mod(N36POW3).toString(36);
}
// ...
threeDigitBase36(sha1("foo")); // => "96b"
threeDigitBase36(sha1("bar")); // => "y4t"
threeDigitBase36(sha1("bas")); // => "p55"
threeDigitBase36(sha1("zip")); // => "ej8"
Of course there will be collisions, as when you map any space into a smaller one, but the entropy should be better than something even sillier than the above solution.
Applying the KISS principle:
An SHA is just a String
The JDK hashcode for String is "random enough"
Integer can render in any base
This single line of code does it:
public static String shortHash(String sha) {
return Integer.toString(sha.hashCode() & 0x7FFFFFFF, 36).substring(0, 3);
}
Note: The & 0x7FFFFFFF is to zero the sign bit (hash codes can be negative numbers, which would otherwise render with a leading minus sign).
Edit - Guaranteeing hash length
My original solution was naive - it didn't deal with the case when the int hash is less than 100 (base 36) - meaning it would print less than 3 chars. This code fixes that, while still keeping the value "random". It also avoids the substring() call, so performance should be better.
static int min = Integer.parseInt("100", 36);
static int range = Integer.parseInt("zzz", 36) - min;
public static String shortHash(String sha) {
return Integer.toString(min + (sha.hashCode() & 0x7FFFFFFF) % range, 36);
}
This code guarantees the final hash has 3 characters by forcing it to be between 100 and zzz - the lowest and highest 3-char hash in base 36, while still making it "random".

Categories

Resources