How to separate chain related strings in hash table? - java

For example, if I had two strings
s1 = "stack",
s2 = "stacks",
how would I implement the program to allow the strings to be placed under the same bucket (in this case s1 and s2 would be under the same bucket)?
Does this implementation have to do with the hash function?
I am creating a puzzle solving program and the whole idea is to eliminate the need to search for "stacks" if I know that "stack" does not exist in the puzzle.

You have to create a hash function that returns the same value (hashCode) for both "stack" and "stacks". BTW this is not a good idea. hashCode generation should not be based on what you want to be in a chain (i.e, linked list when collision occurs in the hash table). It should be such that it returns a unique value for each object being added and reduces collision.

This would be little tough since String is a final class you wont be able to extend and override it. Probably what you can do is create a wrapper class which will have your string to be searched as member variable. You can store this object in hash table. Now you can override equals and hashcode in this class to match chaining string.
public class MyString
{
public String string;
public boolean equals(Object s)
{
//your logic for comparing string
}
public int hashcode()
{
//generate hashcode that is same for two objects true by equals
}
}
Even with this I am not sure you will be able to achieve. I would suggest to use Trie data structure, it will make your task much easy.

Related

Overriding equals method doesn't work

I've been browsing a lot of similar questions in here and on other sites. Still I can't seem to get my head wrapped around this problem.
I have a class:
public class Event {
public String Item;
public String Title;
public String Desc;
#Override
public boolean equals(Object o) {
return true;
}
}
I'm trying to use this class in an ArrayList<Event> events but I can't find a way to get events.contains("item") to work. I have tried debuging and I've found that it doesn't even enter the overridden method.
What am I doing wrong?
That's because you're breaking symmetry as specified in the contract of equals(): if any event equals to "item" (which is a String), "item" should also be equal to any event.
Actually, what Java does is to call indexOf("item") on your list, and check if it is positive.
Now, indexOf() works like this in an ArrayList for instance (see complete source code here):
for (int i = 0; i < size; i++)
if ("item".equals(elementData[i]))
return i;
So basically it is the String's equals() method which is called here, not your one which is returning false of course.
Solve this issue by simply specifying an Event parameter to the function, like:
events.contains( new Event("item", "title", "desc") )
Note that you'll have to create a proper constructor for your class ot initialize the members.
You should also override public int hashCode(). The two methods are closely related.
Read more about this: http://www.javapractices.com/topic/TopicAction.do?Id=17
When you override equals() method, you also have to override the hashcode() method because they go hand in hand. If two object are equal, then they have to have the same hashcode. Hashmaps use this to evaluate the save locations. If two objects are not equal, they may or may not have the same hashcode.
In this case, you need only override equals method, not the hashCode method.
The hashCode and equals method should both be overrided when you want to use the object of your class as key in HashMap. HashMap uses a structure of array + linkedList. When adding a key-value pair, it first do some calculation based on key's hashCode to get the index in array; then go throught the linkedList at that index position to find if the key-value pair is already there. If yes, it will overwrite the record with new value; otherwise add the key-value pair to the end of that linkedList. When locating a key, the process is smiliar. So if the hashCode method is not overrided, you will fail the first round search in the array. That's the reason why you need override both methods. It's not like somewhere says there's contract between these two methods or they have close relation.

putiing objects to hash based collection

Suppose I have the below class.
class S{
String txt = null;
S(String i){
txt=i;
}
public static void main(String args []){
S s1 = new S("a");
S s2 = new S("b");
S s3 = new S("a");
Map m = new HashMap ();
m.put(s1, "v11");
m.put(s2, "v22");
m.put(s3, "v33");
System.out.println(m.size());
}
//just a plain implementation
public boolean equals(Object o)
{
S cc = (S) o;
if (this.i.equals(cc.i))
{
return true;
}
else
{
return false;
}
}
public int hashCode()
{
return 222;
}
}
This will return size as 2 when running above. Its totally fine. If we comment the hashCode() it return 3 which is also correct. But if we comment the equals and keep the hashCode it should return 2 right? instead it returns 3. When putting objects to hashmap map will check the hash code of an object and if its same it will replace the previous value of the map to the new one right?
Thank You.
But if we comment the equals and keep the hashCode it should return 2
right? instead it returns 3.
3 items is the correct behaviour. 3 objects will be hashed to the same bucket, but because all 3 are different this bucket will contain a chain of values (linked list for HashMap in Java) with the same hash code but not equal to each other.
When putting objects to hashmap map will check the hash code of an object and if its same
it will replace the previous value of the map to the new one right?
If they are hashed to the same bucket it doesn't mean that one value will replace another. Then these values will be compared for equality. If they are equal then old value will be replaced, if they are not - new value will be added to the tail of the linked list (for this bucket).
The hashcode is simply used to determine the bucket in which to place the object. Each bucket can contain more than once object. So hashcode must be implemented to ensure that equal objects go in the same bucket. In other words equal objects must have the same hashcode but objects with the same hashcode aren't necessarily equal.
When you override only hashcode nothing really changes. You are just putting every object in the same bucket with return 222. So the HashMap is more inefficient, but its contract doesn't change.
The hashcode is the first, quick method to find if two objects are equal or not. It is used by hash containers to decide in which "slot" the object may go, and to retrieve it without checking for all of the objects in all of the slots.
If your hashcode is always the same, then all the objects will be directed to the same slot. This is called collision. Insertions will be slower, because after the collision the container will have to check if the objects already in that slot match the new one (equals). Also, retrieval will be slower because it will have to check all of them sequentially until it finds the right one(equals againg). Finally, probably there will be a lot of unused memory wasted in slots that will not be used.
In essence, by no implementing a sensible hashcode you are converting the hashcontainers in lists (and inefficient ones).
If we comment the hashCode() it return 3 which is also correct.
This is not correct! There are only 2 different objects: "a" and "b". The equals method says what is equal and what is not. The expected size is 2. But, because the equals-hashcode contract is broken, the returned size is 3.

Convert string to hash and then reform the string later

I need to hash some strings so I can pass them into some libraries, this is straight forward using the String.hashCode call.
However once everything is processed I'd like to convert the integer generated from the hashCode back into the String value. I could obviously track the string and hashcode values somewhere else and do the conversion there, but I'm wondering is there anything in Java that will do this automatically.
I think you misunderstand the concept of a hash. A hash is a one way function. Worse, two strings might generate the same hash.
So no, it's not possible.
hashCode() is a not generally going to be a bijection, because it's not generally going to be an injective map.
hashCode() has ints as its range. There are only 2^32 distinct int values, so for any object where there there can be more than 2^32 different ones (e.g., think about Long), you are guaranteed (by the pigeonhole principle that at least two distinct objects will have the same hash code.
The only guarantee that hashCode() gives you is that if a.equals(b), then a.hashCode() == b.hashCode(). Every object having the same hash code is consistent with this.
You can use the hashCode() to uniquely identify objects in some very limited circumstances: You must have a particular class in where there are no more than 2^32 possible different instances (i.e., there are at most 2^32 objects of your class which pairwise are such that !a.equals(b)). In that case, so long as you ensure that whenever !a.equals(b) and both a and b are objects of your class, that a.hashCode() != b.hashCode(), you will have a bijection between (equivalence classes of) objects and hash codes. (It could be done like this for the Integer class, for example.)
However, unless you're in this very special case, you should create a unique id some other way.
That is not possible in general. The hashCode is what one would call a one-way-function.
Besides, there are more strings than integers, so there is a one-to-many mapping from integers to strings. The strings "0-42L" and "0-43-" for instance, have the same hash-code. (Demonstration on ideone.com.)
What you could do however, (as an estimate), would be to store the strings you pass into the API and remember their hash-codes like this:
import java.util.*;
public class Main {
public static void main(String[] args) {
// Keep track of the corresponding strings
Map<Integer, String> hashedStrings = new HashMap<Integer, String>();
String str1 = "hello";
String str2 = "world";
// Compute hash-code and remember which string that gave rise to it.
int hc = str1.hashCode();
hashedStrings.put(hc, str1);
apiMethod(hc);
// Get back the string that corresponded to the hc hash code.
String str = hashedStrings.get(hc);
}
}
Not possible to convert the .hashcode() output to the original form. It's a one way process.
You can use a base64 encoder scheme where you will encode the data, use it where ever you want to and then decode it to the original form.

How to ensure hashCode() is consistent with equals()?

When overriding the equals() function of java.lang.Object, the javadocs suggest that,
it is generally necessary to override the hashCode method whenever this method is overridden, so as to maintain the general contract for the hashCode method, which states that equal objects must have equal hash codes.
The hashCode() method must return a unique integer for each object (this is easy to do when comparing objects based on memory location, simply return the unique integer address of the object)
How should a hashCode() method be overriden so that it returns a unique integer for each object based only on that object's properities?
public class People{
public String name;
public int age;
public int hashCode(){
// How to get a unique integer based on name and age?
}
}
/*******************************/
public class App{
public static void main( String args[] ){
People mike = new People();
People melissa = new People();
mike.name = "mike";
mike.age = 23;
melissa.name = "melissa";
melissa.age = 24;
System.out.println( mike.hasCode() ); // output?
System.out.println( melissa.hashCode(); // output?
}
}
It doesn't say the hashcode for an object has to be completely unique, only that the hashcode for two equal objects returns the same hashcode. It's entirely legal to have two non-equal objects return the same hashcode. However, the more unique a hashcode distribution is over a set of objects, the better performance you'll get out of HashMaps and other operations that use the hashCode.
IDEs such as IntelliJ Idea have built-in generators for equals and hashCode that generally do a pretty good job at coming up with "good enough" code for most objects (and probably better than some hand-crafted overly-clever hash functions).
For example, here's a hashCode function that Idea generates for your People class:
public int hashCode() {
int result = name != null ? name.hashCode() : 0;
result = 31 * result + age;
return result;
}
I won't go in to the details of hashCode uniqueness as Marc has already addressed it. For your People class, you first need to decide what equality of a person means. Maybe equality is based solely on their name, maybe it's based on name and age. It will be domain specific. Let's say equality is based on name and age. Your overridden equals would look like
public boolean equals(Object obj) {
if (this==obj) return true;
if (obj==null) return false;
if (!(getClass().equals(obj.getClass())) return false;
Person other = (Person)obj;
return (name==null ? other.name==null : name.equals(other.name)) &&
age==other.age;
}
Any time you override equals you must override hashCode. Furthermore, hashCode can't use any more fields in its computation than equals did. Most of the time you must add or exclusive-or the hash code of the various fields (hashCode should be fast to compute). So a valid hashCode method might look like:
public int hashCode() {
return (name==null ? 17 : name.hashCode()) ^ age;
}
Note that the following is not valid as it uses a field that equals didn't (height). In this case two "equals" objects could have a different hash code.
public int hashCode() {
return (name==null ? 17 : name.hashCode()) ^ age ^ height;
}
Also, it's perfectly valid for two non-equals objects to have the same hash code:
public int hashCode() {
return age;
}
In this case Jane age 30 is not equal to Bob age 30, yet both their hash codes are 30. While valid this is undesirable for performance in hash-based collections.
Another question asks if there are some basic low-level things that all programmers should know, and I think hash lookups are one of those. So here goes.
A hash table (note that I'm not using an actual classname) is basically an array of linked lists. To find something in the table, you first compute the hashcode of that something, then mod it by the size of the table. This is an index into the array, and you get a linked list at that index. You then traverse the list until you find your object.
Since array retrieval is O(1), and linked list traversal is O(n), you want a hash function that creates as random a distribution as possible, so that objects will be hashed to different lists. Every object could return the value 0 as its hashcode, and a hash table would still work, but it would essentially be a long linked-list at element 0 of the array.
You also generally want the array to be large, which increases the chances that the object will be in a list of length 1. The Java HashMap, for example, increases the size of the array when the number of entries in the map is > 75% of the size of the array. There's a tradeoff here: you can have a huge array with very few entries and waste memory, or a smaller array where each element in the array is a list with > 1 entries, and waste time traversing. A perfect hash would assign each object to a unique location in the array, with no wasted space.
The term "perfect hash" is a real term, and in some cases you can create a hash function that provides a unique number for each object. This is only possible when you know the set of all possible values. In the general case, you can't achieve this, and there will be some values that return the same hashcode. This is simple mathematics: if you have a string that's more than 4 bytes long, you can't create a unique 4-byte hashcode.
One interesting tidbit: hash arrays are generally sized based on prime numbers, to give the best chance for random allocation when you mod the results, regardless of how random the hashcodes really are.
Edit based on comments:
1) A linked list is not the only way to represent the objects that have the same hashcode, although that is the method used by the JDK 1.5 HashMap. Although less memory-efficient than a simple array, it does arguably create less churn when rehashing (because the entries can be unlinked from one bucket and relinked to another).
2) As of JDK 1.4, the HashMap class uses an array sized as a power of 2; prior to that it used 2^N+1, which I believe is prime for N <= 32. This does not speed up array indexing per se, but does allow the array index to be computed with a bitwise AND rather than a division, as noted by Neil Coffey. Personally, I'd question this as premature optimization, but given the list of authors on HashMap, I'll assume there is some real benefit.
In general the hash code cannot be unique, as there are more values than possible hash codes (integers).
A good hash code distributes the values well over the integers.
A bad one could always give the same value and still be logically correct, it would just lead to unacceptably inefficient hash tables.
Equal values must have the same hash value for hash tables to work correctly.
Otherwise you could add a key to a hash table, then try to look it up via an equal value with a different hash code and not find it.
Or you could put an equal value with a different hash code and have two equal values at different places in the hash table.
In practice you usually select a subset of the fields to be taken into account in both the hashCode() and the equals() method.
I think you misunderstood it. The hashcode does not have to be unique to each object (after all, it is a hash code) though you obviously don't want it to be identical for all objects. You do, however, need it to be identical to all objects that are equal, otherwise things like the standard collections would not work (e.g., you'd look up something in the hash set but would not find it).
For straightforward attributes, some IDEs have hashcode function builders.
If you don't use IDEs, consider using Apahce Commons and the class HashCodeBuilder
The only contractual obligation for hashCode is for it to be consistent. The fields used in creating the hashCode value must be the same or a subset of the fields used in the equals method. This means returning 0 for all values is valid, although not efficient.
One can check if hashCode is consistent via a unit test. I written an abstract class called EqualityTestCase, which does a handful of hashCode checks. One simply has to extend the test case and implement two or three factory methods. The test does a very crude job of testing if the hashCode is efficient.
This is what documentation tells us as for hash code method
# javadoc
Whenever it is invoked on
the same object more than once during
an execution of a Java application,
the hashCode method must consistently
return the same integer, provided no
information used in equals comparisons
on the object is modified. This
integer need not remain consistent
from one execution of an application
to another execution of the same
application.
There is a notion of business key, which determines uniqueness of separate instances of the same type. Each specific type (class) that models a separate entity from the target domain (e.g. vehicle in a fleet system) should have a business key, which is represented by one or more class fields. Methods equals() and hasCode() should both be implemented using the fields, which make up a business key. This ensures that both methods consistent with each other.

How to compute the hashCode() from the object's address?

In Java, I have a subclass Vertex of the Java3D class Point3f. Now Point3f computes equals() based on the values of its coordinates, but for my Vertex class I want to be stricter: two vertices are only equal if they are the same object. So far, so good:
class Vertex extends Point3f {
// ...
public boolean equals(Object other) {
return this == other;
}
}
I know this violates the contract of equals(), but since I'll only compare vertices to other vertices this is not a problem.
Now, to be able to put vertices into a HashMap, the hashCode() method must return results consistent with equals(). It currently does that, but probably bases its return value on the fields of the Point3f, and therefore will give hash collisions for different Vertex objects with the same coordinates.
Therefore I would like to base the hashCode() on the object's address, instead of computing it from the Vertex's fields. I know that the Object class does this, but I cannot call its hashCode() method because Point3f overrides it.
So, actually my question is twofold:
Should I even want such a shallow equals()?
If yes, then, how do I get the object's address to compute the hash code from?
Edit: I just thought of something... I could generate a random int value on object creation, and use that for the hash code. Is that a good idea? Why (not)?
Either use System.identityHashCode() or use an IdentityHashMap.
System.identityHashCode() returns the same hash code for the given object as would be returned by the default method hashCode(), whether or not the given object's class overrides hashCode().
You use a delegate even though this answer is probably better.
class Vertex extends Point3f{
private final Object equalsDelegate = new Object();
public boolean equals(Object vertex){
if(vertex instanceof Vertex){
return this.equalsDelegate.equals(((Vertex)vertex).equalsDelegate);
}
else{
return super.equals(vertex);
}
}
public int hashCode(){
return this.equalsDelegate.hashCode();
}
}
Just FYI, your equals method does NOT violate the equals contract (for the base Object's contract that is)... that is basically the equals method for the base Object method, so if you want identity equals instead of the Vertex equals, that is fine.
As for the hash code, you really don't need to change it, though the accepted answer is a good option and will be a lot more efficient if your hash table contains a lot of vertex keys that have the same values.
The reason you don't need to change it is because it is completely fine that the hash code will return the same value for objects that equals returns false... it is even a valid hash code to just return 0 all the time for EVERY instance. Whether this is efficient for hash tables is completely different issue... you will get a lot more collisions if a lot of your objects have the same hash code (which may be the case if you left hash code alone and had a lot of vertices with the same values).
Please don't accept this as the answer though of course (what you chose is much more practical), I just wanted to give you a little more background info about hash codes and equals ;-)
Why do you want to override hashCode() in the first place? You'd want to do it if you want to work with some other definition of equality. For example
public class A {
int id;
public boolean equals(A other) { return other.id==id}
public int hashCode() {return id;}
}
where you want to be clear that if the id's are the same then the objects are the same, and you override hashcode so that you can't do this:
HashSet hash= new HashSet();
hash.add(new A(1));
hash.add(new A(1));
and get 2 identical(from the point of view of your definition of equality) A's.
The correct behavior would then be that you'd only have 1 object in the hash, the second write would overwrite.
Since you are not using equals as a logical comparison, but a physical one (i.e. it is the same object), the only way you will guarantee that the hashcode will return a unique value, is to implement a variation of your own suggestion. Instead of generating a random number, use UUID to generate an actual unique value for each object.
The System.identityHashCode() will work, most of the time, but is not guaranteed as the Object.hashCode() method is not guaranteed to return a unique value for every object. I have seen the marginal case happen, and it will probably be dependent on the VM implementation, which is not something you will want your code be dependent on.
Excerpt from the javadocs for Object.hashCode():
As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer, but this implementation technique is not required by the JavaTM programming language.)
The problem this addresses, is the case of having two separate point objects from overwriting each other when inserted into the hashmap because they both have the same hash. Since there is no logical equals, with the accompanying override of hashCode(), the identityHashCode method can actually cause this scenario to occur. Where the logical case would only replace hash entries for the same logical point, using the system based hash can cause it to occur with any two objects, equality (and even class) is no longer a factor.
The function hashCode() is inherited from Object and works exactly as you intend (on object level, not coordinate-level). There should be no need to change it.
As for your equals-method, there is no reason to even use it, since you can just do obj1 == obj2 in your code instead of using equals, since it's meant for sorting and similar, where comparing coordinates makes a lot more sense.

Categories

Resources