How to implement own hashing function for strings?

How to implement own hashing function for strings? - java

So this is the default algorithm that generates the hashcode for Strings:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
However, I wanna use something different and much more simple like adding the ASCII values of each character and then adding them all up.
How do I make it so that it uses the algorithm I created, instead of using the default one when I use the put() method for hashtables?
As of now I don't know what to do other than implementing a hash table from scratch.

Create a new class, and use String type field in it. For example:
public class MyString {
private final String value;
public MyString(String value) {
this.value = value;
}
public String getValue() {
return value;
}
#Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
MyString myString = (MyString) o;
return Objects.equals(value, myString.value);
}
#Override
public int hashCode() {
// use your own implementation
return value.codePoints().sum();
}
}
Add equals() and hashCode() methods with #Override annotation.
Note: here hashCode() operates only with ASCII values.
After that, you will be able to use new class objects in the desired data structure. Here you can find a detailed explanation of these methods and a contract between equals() and hashCode().

However, I wanna use something different and much more simple like adding the ASCII values of each character and then adding them all up.
This is an extremely bad idea if you care at all about hash table efficiency. What you're thinking of as an overly-complicated hashing function is actually designed to give a uniform distribution of hash values throughout the entire 32-bit (or whatever) range. That gives the best possibility of uniformly distributing the hash keys (after you mod by the hash table size) in your buckets.
Your simple method of adding up the ASCII values of the individual characters has multiple flaws. First, you're limited in the range of values you can reasonably expect to generate. The highest value you can create is 255*n, where n is the length of the key. If your key is 10 characters in length, then you can't possibly generate more than 2,550 unique hash values. But there are 255^10 possible 10-character strings. Your collision rate will be very high.
The second problem is that anagrams generate the same hash value. "stop," "spot," and "tops" all generate the same hash value and will hash to the same bucket. Again, this will greatly affect your collision rate.
It's unclear to me why you want to replace the hashing function. If you're thinking it will result in better performance, you should think again. Sure, it will make generating the hash value faster, but it will result in very skewed key distribution, and correspondingly terrible hash table performance.

Related

Is there any chance for the hash codes of two different objects of being same? [duplicate]

In Java, obj.hashCode() returns some value. What is the use of this hash code in programming?

hashCode() is used for bucketing in Hash implementations like HashMap, HashTable, HashSet, etc.
The value received from hashCode() is used as the bucket number for storing elements of the set/map. This bucket number is the address of the element inside the set/map.
When you do contains() it will take the hash code of the element, then look for the bucket where hash code points to. If more than 1 element is found in the same bucket (multiple objects can have the same hash code), then it uses the equals() method to evaluate if the objects are equal, and then decide if contains() is true or false, or decide if element could be added in the set or not.

From the Javadoc:
Returns a hash code value for the object. This method is supported for the benefit of hashtables such as those provided by java.util.Hashtable.
The general contract of hashCode is:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hashtables.
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the Java programming language.)

hashCode() is a function that takes an object and outputs a numeric value. The hashcode for an object is always the same if the object doesn't change.
Functions like HashMap, HashTable, HashSet, etc. that need to store objects will use a hashCode modulo the size of their internal array to choose in what "memory position" (i.e. array position) to store the object.
There are some cases where collisions may occur (two objects end up with the same hashcode), and that, of course, needs to be solved carefully.

The value returned by hashCode() is the object's hash code, which is the object's memory address in hexadecimal.
By definition, if two objects are equal, their hash code must also be equal. If you override the equals() method, you change the way two objects are equated and Object's implementation of hashCode() is no longer valid. Therefore, if you override the equals() method, you must also override the hashCode() method as well.
This answer is from the java SE 8 official tutorial documentation

A hashcode is a number generated from any object.
This is what allows objects to be stored/retrieved quickly in a Hashtable.
Imagine the following simple example:
On the table in front of you. you have nine boxes, each marked with a number 1 to 9. You also have a pile of wildly different objects to store in these boxes, but once they are in there you need to be able to find them as quickly as possible.
What you need is a way of instantly deciding which box you have put each object in. It works like an index. you decide to find the cabbage so you look up which box the cabbage is in, then go straight to that box to get it.
Now imagine that you don't want to bother with the index, you want to be able to find out immediately from the object which box it lives in.
In the example, let's use a really simple way of doing this - the number of letters in the name of the object. So the cabbage goes in box 7, the pea goes in box 3, the rocket in box 6, the banjo in box 5 and so on.
What about the rhinoceros, though? It has 10 characters, so we'll change our algorithm a little and "wrap around" so that 10-letter objects go in box 1, 11 letters in box 2 and so on. That should cover any object.
Sometimes a box will have more than one object in it, but if you are looking for a rocket, it's still much quicker to compare a peanut and a rocket, than to check a whole pile of cabbages, peas, banjos, and rhinoceroses.
That's a hash code. A way of getting a number from an object so it can be stored in a Hashtable. In Java, a hash code can be any integer, and each object type is responsible for generating its own. Lookup the "hashCode" method of Object.
Source - here

Although hashcode does nothing with your business logic, we have to take care of it in most cases. Because when your object is put into a hash based container(HashSet, HashMap...), the container puts/gets the element's hashcode.

hashCode() is a unique code which is generated by the JVM for every object creation.
We use hashCode() to perform some operation on hashing related algorithm like Hashtable, Hashmap etc..
The advantages of hashCode() make searching operation easy because when we search for an object that has unique code, it helps to find out that object.
But we can't say hashCode() is the address of an object. It is a unique code generated by JVM for every object.
That is why nowadays hashing algorithm is the most popular search algorithm.

One of the uses of hashCode() is building a Catching mechanism.
Look at this example:
class Point
{
public int x, y;
public Point(int x, int y)
{
this.x = x;
this.y = y;
}
#Override
public boolean equals(Object o)
{
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
Point point = (Point) o;
if (x != point.x) return false;
return y == point.y;
}
#Override
public int hashCode()
{
int result = x;
result = 31 * result + y;
return result;
}
class Line
{
public Point start, end;
public Line(Point start, Point end)
{
this.start = start;
this.end = end;
}
#Override
public boolean equals(Object o)
{
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
Line line = (Line) o;
if (!start.equals(line.start)) return false;
return end.equals(line.end);
}
#Override
public int hashCode()
{
int result = start.hashCode();
result = 31 * result + end.hashCode();
return result;
}
}
class LineToPointAdapter implements Iterable<Point>
{
private static int count = 0;
private static Map<Integer, List<Point>> cache = new HashMap<>();
private int hash;
public LineToPointAdapter(Line line)
{
hash = line.hashCode();
if (cache.get(hash) != null) return; // we already have it
System.out.println(
String.format("%d: Generating points for line [%d,%d]-[%d,%d] (no caching)",
++count, line.start.x, line.start.y, line.end.x, line.end.y));
}

Java, Date, Array, hashcode() [duplicate]

In Java, obj.hashCode() returns some value. What is the use of this hash code in programming?

hashCode() is used for bucketing in Hash implementations like HashMap, HashTable, HashSet, etc.
The value received from hashCode() is used as the bucket number for storing elements of the set/map. This bucket number is the address of the element inside the set/map.
When you do contains() it will take the hash code of the element, then look for the bucket where hash code points to. If more than 1 element is found in the same bucket (multiple objects can have the same hash code), then it uses the equals() method to evaluate if the objects are equal, and then decide if contains() is true or false, or decide if element could be added in the set or not.

From the Javadoc:
Returns a hash code value for the object. This method is supported for the benefit of hashtables such as those provided by java.util.Hashtable.
The general contract of hashCode is:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.
It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results. However, the programmer should be aware that producing distinct integer results for unequal objects may improve the performance of hashtables.
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the Java programming language.)

hashCode() is a function that takes an object and outputs a numeric value. The hashcode for an object is always the same if the object doesn't change.
Functions like HashMap, HashTable, HashSet, etc. that need to store objects will use a hashCode modulo the size of their internal array to choose in what "memory position" (i.e. array position) to store the object.
There are some cases where collisions may occur (two objects end up with the same hashcode), and that, of course, needs to be solved carefully.

The value returned by hashCode() is the object's hash code, which is the object's memory address in hexadecimal.
By definition, if two objects are equal, their hash code must also be equal. If you override the equals() method, you change the way two objects are equated and Object's implementation of hashCode() is no longer valid. Therefore, if you override the equals() method, you must also override the hashCode() method as well.
This answer is from the java SE 8 official tutorial documentation

Although hashcode does nothing with your business logic, we have to take care of it in most cases. Because when your object is put into a hash based container(HashSet, HashMap...), the container puts/gets the element's hashcode.

hashCode() is a unique code which is generated by the JVM for every object creation.
We use hashCode() to perform some operation on hashing related algorithm like Hashtable, Hashmap etc..
The advantages of hashCode() make searching operation easy because when we search for an object that has unique code, it helps to find out that object.
But we can't say hashCode() is the address of an object. It is a unique code generated by JVM for every object.
That is why nowadays hashing algorithm is the most popular search algorithm.

One of the uses of hashCode() is building a Catching mechanism.
Look at this example:
class Point
{
public int x, y;
public Point(int x, int y)
{
this.x = x;
this.y = y;
}
#Override
public boolean equals(Object o)
{
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
Point point = (Point) o;
if (x != point.x) return false;
return y == point.y;
}
#Override
public int hashCode()
{
int result = x;
result = 31 * result + y;
return result;
}
class Line
{
public Point start, end;
public Line(Point start, Point end)
{
this.start = start;
this.end = end;
}
#Override
public boolean equals(Object o)
{
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
Line line = (Line) o;
if (!start.equals(line.start)) return false;
return end.equals(line.end);
}
#Override
public int hashCode()
{
int result = start.hashCode();
result = 31 * result + end.hashCode();
return result;
}
}
class LineToPointAdapter implements Iterable<Point>
{
private static int count = 0;
private static Map<Integer, List<Point>> cache = new HashMap<>();
private int hash;
public LineToPointAdapter(Line line)
{
hash = line.hashCode();
if (cache.get(hash) != null) return; // we already have it
System.out.println(
String.format("%d: Generating points for line [%d,%d]-[%d,%d] (no caching)",
++count, line.start.x, line.start.y, line.end.x, line.end.y));
}

data members to consider while overriding hashcode and equals

I know (contract) we need to override hashcode when equals is overridden.
Why should I consider same fields used for equals comparison to compute hashcode?
Is it to improve performance, by avoiding too many objects mapping to same bucket, as in below case?
i.e. all objects created on same "date" would map to same bucket and linear comparison will take time in checking object exists using equals() method?
If my above statement is true, what other potential issues will come with below code other than performance issue. Is that the only reason we should use same fields / members used in equals to compute hashcode? Please share. Thanks.
class MyClass {
int date;
int pay;
int id;
public boolean equals(Object o) {
//null and same class instance check
MyClass obj = (MyClass) o;
return (date == obj.date && pay == obj.pay && id == obj.id);
}
public int hashCode() {
int hash = 7;
return (31 * hash + date);
}
}
//please pardon syntax errors, I typed without using ide.
***my intention is to use all fields in equals, and know why same number of elements should be used in hashcode, and what happens if only few elements are used
Clarification:
With only using "date" to compute hashcode,pointer checks right bucket address (do you agree?) furthermore, I get list of items in that bucket, collection will iterate over to check if particular obj exists using equals. And my definition of equals is "all fields must be same". With this, I believe my code works fine, and I only find performance issue. Please point out where I am wrong. Thank you

For your example, I suggest you use just id for equality and that annotate that they're overrides. Also, I like to override toString()
#Override
public boolean equals(Object o) {
if (o instanceof MyClass) {
return (id == ((MyClass) o).id);
}
return false;
}
#Override
public int hashCode() {
return id;
}
#Override
public String toString() {
return String.format("MyClass (id=%d, date=%d, pay=%d)", id, date, pay);
}
That way you can update the date and/or the pay without having to recreate the hash structure. Also, that's what appears to be unique about instances.

I found the answer in Effective Java, by Joshua Bloch, 2nd edtn, page 49 "Do not be tempted to exclude significant parts of an object from the hash code computation to improve performance" . The poor quality may degrade hash tables' performance.
So my guess was right, multiple hashes will map to same bucket.
Additional information:
http://www.javaranch.com/journal/2002/10/equalhash.html
Since the class members/variables num and data do participate in the
equals method comparison, they should also be involved in the
calculation of the hash code. Though, this is not mandatory. You can
use subset of the variables that participate in the equals method
comparison to improve performance of the hashCode method. Performance
of the hashCode method indeed is very important.

Hash table Java insert

I am new to Java and I am trying to learn about hash tables. I want to insert objects into my hash table and then be able to print all the objects from the hash table at the end. I am not sure I am doing doing this right because I have read that I need to override the get() method or hashCode() method but I am not sure why.
I am passing in String objects of student names. When I run the debugger after my inserts, it shows the key as "null" and the indexes of my inserts are at random places in the hash table. Ex. 1, 6, 10
This is how I have been adding. Can anyone tell me if this is correct and do I actually need to override things?
Thanks in advance!
CODE
Hashtable<String,String> hashTable=new Hashtable<String,String>();
hashTable.put("Donald", "Trump");
hashTable.put("Mike", "Myers");
hashTable.put ("Jimmer", "Markus");

You are doing things correctly. Remember, a Hashtable is not a direct-access structure. You can't "get the third item from a Hashtable", for example. There is no real meaning to the term "index" when you're talking about a Hashtable: numerical indexes of items mean nothing.
A Hashtable guarantees that it will hold key-value pairs for you, in a way that it will be very fast to conclude a value based on a key (for example: given Donald, you will get Trump very quickly). Of course, certain conditions have to be fulfilled for this to work right, but for your simple String-to-String example, that works.
You should read more about hash tables in general, to see how they really work behind the scenes.
EDIT (as per OP's request): you are asking about storing Student instances in your Hashtable. As I mentioned above, certain conditions have to be addressed for a Hashtable to work correctly. Those conditions are concerning the key part, not the value part.
If your Student instance is the value, and a simple String is the key, then there's nothing special for you to do, because the String primitive already answers all of the conditions required for a proper Hashtable key.
If your Student instance is the key, then the following conditions must be met:
Inside Student, you must override the hashCode method in such a way that subsequent invocations of hashCode will return exactly the same value. In other words, the expression x.hashCode() == x.hashCode() must always be true.
Inside Student, you must override the equals method in such a way that it will only return true for two identical instances of Student, and return false otherwise.
These conditions are enough for Student to function as a proper Hashtable key. You can further optimize things by writing a better hashCode implementation (read about it... it's quite long to type in here), but as long as you answer the aforementioned two, you're good to go.
Example:
class Student {
private String name;
private String address;
public int hashCode() {
// Assuming 'name' and 'address' are not null, for simplification here.
return name.hashCode() + address.hashCode();
}
public boolean equals (Object other) {
if (!(other instanceof Student) {
return false;
}
if (other == this) {
return true;
}
Student otherStudent = (Student) other;
return name.equals(otherStudent.name) && address.equals(otherStudent.address);
}
}

Try this code:
Hashtable<String,String> hashTable=new Hashtable<String,String>();
hashTable.put("Donald", "16 years old");
hashTable.put("Mike", "20 years old");
hashTable.put ("Jimmer", "18 years old");
Enumeration studentsNames;
String str;
// Show all students in hash table.
studentsNames = hashTable.keys();
while(studentsNames.hasMoreElements()) {
str = (String) studentsNames.nextElement();
txt.append("\n"+str + ": " + hashTable.get(str));
}

HashSet Collisions in Java

I have a program for my Java class where I want to use hashSets to compare a directory of text documents. Essentially, my plan is to create a hashSet of strings for each paper, and then add two of the papers hashSets together into one hashSet and find the number of same 6-word sequences.
My question is, do I have to manually check for, and handle, collisions, or does Java do that for me?

Java Hash Maps/Sets Automatically handle Hash collisions, this is why it is important to override both the equals and the hashCode methods. As both of them are utilised by Sets to differentiate duplicate or unique entries.
It is also important to note that these hash collisions hava a performance impace since multiple objects are referenced by the same Hash.
public class MyObject {
private String name;
//getter and setters
public int hashCode() {
int hashCode = //Do some object specifc stuff to gen hashCode
return int;
}
public boolean equals(Object obj) {
if(this==obj) return true;
if(obj instanceOf MyObject) {
if(this.name.equals((MyObject)obj.getName())) {
return true;
}
return false;
}
}
}
Note: Standard Java Objects such as String have already implemented hashCode and equals so you only have to do that for your own kind of Data Objects.

I think you did not ask for hash collisions, right? The question is what happens when HashSet a and HashSet b are added into a single set e.g. by a.addAll(b).
The answer is a will contain all elements and no duplicates. In case of Strings this means you can count the number of equal String from the sets with a.size() before add - a.size() after add + b.size().
It does not even matter if some of the Strings have the same hash code but are not equal.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.