Making ASCII values more usable as part of hash key

Making ASCII values more usable as part of hash key - java

I'm to implement a hash function, and here is my hash function (the first draft version that is)
public int hashCode(){
String fixedISBN = getIsbn().toString().replace("-", "");
fixedISBN = fixedISBN.substring(fixedISBN.length()-4, fixedISBN.length());
int ISBN = Integer.parseInt(fixedISBN);
int ASCII = 0;
for (int i = 0; i < getTitle().toString().length(); i++) {
ASCII += getTitle().toString().charAt(i);
}
int hashValue = (ISBN * 37 + ASCII*23);
return hashValue;
}
I am meant to hash books, and to do so I initially thought to use the ISBN value of a book, which serves as a wholly unique identifier for every book. Then I looked at the list of ISBNs and saw that using the entire ISBN since there isn't a lot of variation of the ISBN numbers. As such I use only the last four numbers of the ISBN since those numbers tend to be the ones that vary. I also plan to use the ASCII value of the title's chars for my hashValue, but I believe a problem arises since ASCII values can only amount to 127, which means there would be a problem if the title is short, say only 8 chars or less which would produce a maximum value 1016. If the table size is very large, say 10 007 it wouldn't produce a very even spread. Is there any way I could make ASCII values more suitable to produce a hash value of a large table

Related

Hash attack: find strings of length 2^N with same hashCode()

Reading algorithms fourth edition by Robert Sedgewick and Kevin Wayne I found the following question:
Hash attack: find 2^N strings, each of length 2^N, that have the same hashCode() value, supposing that the hashCode() implementation for String is the following:
public int hashCode() {
int hash = 0;
for (int i = 0; i < length(); i++)
hash = (hash * 31) + charAt(i);
return hash;
}
Strong hint: Aa and BB have the same value.
What comes on my mind is generating all possible strings of length 2^N and compare their hashCodes. This, however, is very expensive for large N and I doubt it's the correct solution.
Can you give me hints what I miss in the whole picture?

Andreas' and Glains' answers are both correct, but they aren't quite what you need if your goal is to produce 2N distinct strings of length 2N.
Rather, a simpler approach is to build strings consisting solely of concatenated sequences of Aa and BB. For length 2×1 you have { Aa, BB }; for length 2×2 you have { AaAa, AaBB, BBAa, BBBB }, for length 2×3 you have { AaAaAa, AAaaBB, AaBBAa, AaBBBB, BBAaAa, BBAaBB, BBBBAa, BBBBBB }; and so on.
(Note: you've quoted the text as saying the strings should have length 2N. I'm guessing that you misquoted, and it's actually asking for length 2N; but if it is indeed asking for length 2N, then you can simply drop elements as you proceed.)

"Strong hint" explained.
Strong hint: Aa and BB have the same value.
In ASCII / Unicode, B has a value 1 higher than A. Since those are the second last characters, the value is multiplied by 31, so hash code is increased by 31 when you change xxxxAa to xxxxBa.
To offset that, you need last character to offset by -31. Since lowercase letters are 32 higher than uppercase letters, changing a to A is -32 and changing one letter up to B is then -31.
So, it get same hash code, change second-last letter to next letter (e.g. A to B), and change last letter from lowercase to next uppercase (e.g. a to B).
You can now use that hint to generate up to 26 strings with the same hash code.

Lets take a look at the hashCode() implementation and the given hint:
public int hashCode() {
int hash = 0;
for (int i = 0; i < length(); i++)
hash = (hash * 31) + charAt(i);
return hash;
}
We know that Aa and BB produce the same hash and we can easily verify that:
(65 * 31) + 97 = 2112
(66 * 31) + 66 = 2112
From here on, hash is the same for both inputs. That said, we can easily append any amount of characters to both strings and you will always receive the same value.
One example could be:
hashCode("AaTest") = 1953079538
hashCode("BBTest") = 1953079538
So, you can generate enough hash values by just appending the same sequence of characters to both strings, more formally:
hashCode("Aa" + x") = hashCode("BB" + x)
Another note on your idea to generate all possible strings and search for duplicates. Have a look at the bithday paradox and you will note that it will take much less to find duplicate hash values for different inputs.
It will be very difficult to find the original hashed value (indeed, you would have to try out all possible inputs if the hash algorithm is good).
Duplicate hash values are rare (there have to be duplicates since the hash has a fixed length). If a duplicate is found, the duplicate should be meaningless (random characters), so it cannot be abused by an attacker.

Taking a closer look at the hash function, it works like a number system (e.g. Hexadecimal) where the weight of the digits is 31. That is, think of it as converting a number to base 31 and that makes your final hash code something like hashCode = (31^n) * first-char + (31^n-1) * second-char + ..... + (31^0) * last-char
The second observation is that the ASCII distance between the capital and the small letter is 32. Explained in terms of the hash function, it means that when you replace a capital letter by a small one, it means you are adding 1 more to the higher digit and 1 to your current digit. For example:
BB = (31)(B) + (31^0)B which also equals (31)*(B - 1) + (31^0)*(31 + B) notice that I have just taken one unit from the higher digit and added to the lower digit without changing the overall value. The last equation equals to (31)*(A) + (a) == Aa
So, to generate all of the possible String of a given hash code, start with the initial String and shift the character from right to left by replacing a small character by the capital one while decreasing one from the higher location (where applicable). You can run this in O(1)
Hope this helps.

What is the maximum char value in a java program in Netbeans IDE/ what is wrong with my program?

What is the maximum Unicode value of a char in Java (in particular in the Netbeans IDE, if that makes any difference) I've been trying to write a program that, as part of the program, multiplies a char by a random number. According to what I've heard, based on the maximum Unicode value I should be able to multiply the highest value char I'm using (the tilde) by at least 8000 without causing overflow, however overflow does occur in my program. Is there a difference between the maximum Unicode char value and the maximum that is available in Netbeans? In case that isn't the case I have included my code below:
EDIT What I want to do with this portion of the program is "encrypt" the password by multiplying the char with a random number, and then I included a separate section meant to "decrypt" said code, however testing with smaller numbers I found that that part worked.
public static void main(String[] args) {
String pass = "Password";
String pwE = "";
int key [] = new int[pass.length()];
for (int i = 0; i < pass.length(); i++)
{
key[i] = (int)(Math.random()*8000+1); /*EDIT changed the placeholder to the actual function I'm using */
System.out.println(key[i]);
}
for (int i = 0; i < pass.length(); i++)
{
pwE += (char)(pass.charAt(i)*key[i]);
}
System.out.println(pwE);
pass = "";
for (int i = 0; i < pwE.length(); i++)
{
pass += (char)(pwE.charAt(i)/key[i]);
}
System.out.println(pass);
}

"Is there a difference between the maximum Unicode char value and the maximum that is available in Netbeans [sic]?"
No, of course not. NetBeans doesn't have its own private, non-compliant version of Java. The maximum value of a char is always Character.MAX_VALUE, as documented.
http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#MAX_VALUE
Your problem is very likely caused by your use of String to drive "encryption" and "decryption". You don't bother to control the string encoding, and that could conceivably create strangeness with respect to surrogate pairs and the like. You're mixing the numeric nature of char with String's use of the type to represent characters.
Since you didn't bother to share inputs, expected outputs, and actual outputs with us, we can only guess. Perhaps if you were to share sufficient information ...

A char is a 16 bit unsigned type in Java.
Its maximum value is 65535.
Your multiplication of a char by an element of key looks suspect to me. Your casting this result (which will be an int type) back to char causes wraparound modulo 65536.
Your suspecting Netbeans is a red herring.
Very crudely, if your string only uses ASCII characters, then a maximumum multiplication of 512 would work.

Word frequency hash table

Ok, I have a project that requires me to have a dynamic hash table that counts the frequency of words in a file. I must use java, however, we are not allowed to use any built in data types or built in classes at all except standard arrays. Also, I am not allowed to use any hash functions off the internet that are known to be fast. I have to make my own hash functions. Lastly, my instructor also wants my table to start as size "1" and double in size every time a new key is added.
My first idea was to sum the ASCII values of the letters composing a word and use that to make a hash function, but different words with the same letters will equal the same value.
How can I get started? Is the ASCII idea on the right track?

A hash table isn't expected to have in general a one-to-one mapping between a value and a hash. A hash table is expected to have collisions. That is, the domain of the hash-function is expected to be larger than the range (i.e., the hash value). However, the general idea is that you come up with a hash function where the probability of collision is drastically small. If your hash-function is uniform, i.e., if you have it designed such that each possible hash-value has the same probability of being generated, then you can minimize collisions this way.
Getting a collision isn't the end of the world. That just means that you have to search the list of values for that hash. If your hashing function is good, overall your performance for lookup should still be O(1).
Generating hashing functions is a subject of its own, and there is no one answer. But a good place for you to start could be to work with the bitwise representations of the characters in the string, and perform some sort of convolution operations on them (rotate, shift, XOR) in series. You could perform these in some way based on some initial seed-value, and then use the output of the first step of hashing as a seed for the next step. This way you can end up magnifying the effects of your convolution.
For example, let's say you get the character A, which is 41 in hex, or 0100 0001 in binary. You could designate each bit to mean some operation (maybe bit 0 is a ROR when it is 0, and a ROL when it is 1; bit 1 is an OR when it is 0, and a XOR when it is 1, etc.). You could even decide how much convolution you want to do based on the value itself. For example, you could say that the lower nibble specifies how much right-rotation you will do, and the upper nibble specifies how much left rotation you will do. Then once you have the final value, you will use that as the seed for the next character. These are just some ideas. Use your imagination as see what you get!

It does not matter how good your hash function is, you will always have collisions you need to resolve.
If you want to keep your approach by using the ASCII values of the you shouldn't just add the values this would lead to a lot collisions. You should work with the power of the values, for example for the word "Help" you just go like: 'H' * 256 + 'e' * 256 + 'l' * 256² + 'p' * 256³. Or in pseudocode:
int hash(String word, int hashSize)
int res = 0
int count = 0;
for char c in word
res += 'c' * 256^count
count++
count = count mod 5
return res mod hashSize
Now you just have to write your own Hashtable:
class WordCounterMap
Entry[] entrys = new Entry[1]
void add(String s)
int hash = hash(s, entrys.length)
if(entrys[hash] == null{
Entry[] temp = new Entry[entry.length * 2]
for(Entry e : entrys){
if(e != null)
int hash = hash(e.word, temp.length)
temp[hash] = e;
entrys = temp;
hash = hash(s, entrys.length)
while(true)
if(entrys[hash] != null)
if(entrys[hash].word.equals(s))
entrys[hash].count++
break
else
entrys[hash] = new Entry(s)
hash++
hash = hash mod entrys.length
int getCount(String s)
int hash = hash(s, length)
if(entrys[hash] == null)
return 0
while(true)
if(entrys[hash].word.equals(s))
entrys[hash].count++
break
hash++
hash = hash mod entrys.length
class Entry
int count
String word
Entry(String s)
this.word = s
count = 1

Regarding collision in Map

I was going through HashMap and read the following analysis ..
An instance of HashMap has two parameters that affect its performance: initial capacity and load factor.
The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created.
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.
When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
The default initial capacity is 16, the default load factor is 0.75. You can supply other values in the map's constructor.
Now suppose I have a map..
HashMap map=new HashMap();//HashMap key random order.
System.out.println("Amit".hashCode());
map.put("Amit","Java");
map.put("mAit","J2EE");
map.put("Saral","J2rrrEE");
I want collision to occur please advise how the collision would occur..!!

I believe the exact hashmap behavior is implementation dependent. Just look at however your class library is doing the hashing and construct a collision. It's pretty simple.
If you want collisions on arbitrary objects instead of strings, it's a lot easier. Just create a class with a custom hashCode() that always returns 0.

If you want really collision to be occured then it's better to write your own custom hash code. Say for example, if you want collision for Amit and mAit, you can do one thing, just use addition of ascii values of the chars as the hash code. You will get collision for different keys.

Collision will happend when 2 keys has the same hash key .
I didn't calc your keys hash keys , but i don't think they have the same hash key, so collision will not occurred if they don't have the same hash key.
If you will put the Same string as key than you will haves collision

Collision here is definitely possible and not tied to hash table implementation.
HashMap works internally by using Object.hashCode to map objects to buckets, and then uses a collision resolution mechanism (the OpenJDK implementation uses separate-chaining) with Object.equals.
To answer your question, String.hashCode is well-defined for compatibility...
Returns a hash code for this string. The hash code for a String object is computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the i-th character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)
Or, in code (from OpenJDK)
public int hashCode() {
int h = hash;
if (h == 0 && count > 0) {
int off = offset;
char val[] = value;
int len = count;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}
As with any hash function, collisions are possible. According to the Wikipedia article, it states that, for example, "FB" and "Ea" result in the same value.
If you want more, it should be a trivial bruteforce problem to find collisions which have the same hash value here.
As a side note, I'd thought I'd point out how this is very similar to the function as in the second edition of the The C Programming Language:
#define HASHSIZE 100
unsigned hash(char *s)
{
unsigned hashval;
for(hashval = 0; *s != '\0'; s++)
hashval = *s + 31 * hashval;
return hashval % HASHSIZE;
}

Generating a unique integer ID from a String

I need to generate a unique integer id for a string.
Reason:
I have a database application that can run on different databases. This databases contains parameters with parameter types that are generated from external xml data.
the current situation is that i use the ordinal number of the Enum. But when a parameter is inserted or removed, the ordinals get mixed up:
(FOOD = 0 , TOYS = 1) <--> (FOOD = 0, NONFOOD = 1, TOYS = 2)
The ammount of Parameter types is between 200 and 2000, so i am scared a bit using hashCode() for a string.
P.S.: I am using Java.
Thanks a lot

I would use a mapping table in the database to map these Strings to an auto increment value. These mapping should then be cached in the application.

Use a cryptographic hash. MD5 would probably be sufficient and relatively fast. It will be unique enough for your set of input.
How can I generate an MD5 hash?
The only problem is that the hash is 128 bits, so a standard 64-bit integer won't hold it.

If you need to be absolute certain that the id are unique (no collissions) and your strings are up to 32 chars, and your number must be of no more than 10 digits (approx 32 bits), you obviously cannot do it by a one way function id=F(string).
The natural way is to keep some mapping of the string to unique numbers (typically a sequence), either in the DB or in the application.

If you know the type of string values (length, letter patterns), you can count the total number of strings in this set and if it fits within 32 bits, the count function is your integer value.
Otherwise, the string itself is your integer value (integer in math terms, not Java).

By Enum you mean a Java Enum? Then you could give each enum value a unique int by your self instead of using its ordinal number:
public enum MyEnum {
FOOD(0),
TOYS(1),
private final int id;
private MyEnum(int id)
{
this.id = id;
}
}

I came across this post that's sensible: How to convert string to unique identifier in Java
In it the author describes his implementation:
public static long longHash(String string) {
long h = 98764321261L;
int l = string.length();
char[] chars = string.toCharArray();
for (int i = 0; i < l; i++) {
h = 31*h + chars[i];
}
return h;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Making ASCII values more usable as part of hash key - java

Related

Hash attack: find strings of length 2^N with same hashCode()

What is the maximum char value in a java program in Netbeans IDE/ what is wrong with my program?

Word frequency hash table

Regarding collision in Map

Generating a unique integer ID from a String

Categories

Resources