I've used an array as hash table for hashing alogrithm with values:
int[] arr={4 , 5 , 64 ,432 };
and keys with consective integers in array as:
int keys[]={ 1 , 2 , 3 ,4};
Could anyone please tell me, what would be the good approach in mapping those integers keys with those arrays location? Is the following a short and better approach with little or no collision (or something larger values)?
keys[i] % arrlength // where i is for different element of an array
Thanks in advance.
I assume you're trying to implement some kind of hash table as an exercise. Otherwise, you should just use a java.util.HashMap or java.util.HashTree or similar.
For a small set of values, as you have given above, your solution is fine. The real question will come when your data grows much bigger.
You have identified that collisions are undesirable - that is true. Sometimes, some knowledge of the likely keys can help you design a good hash function. Sometimes, you can assume that the key class will have a good hash() method. Since hash() is a method defined by Object, every class implements it. It would be neatest for you to be able to utilise the hash() method of your key, rather than have to build a new algorithm specially for your map.
If all integer keys are equally likely, then a mod function will spread them out evenly amongst the different buckets, minimising collisions. However, if you know that the keys are going to be numbered consecutively, it might be better to use a List than a HashMap - this will guarantee no collisions.
Any reason not to use the built-in HashMap ? You will have to use Integer though, not int.
java.util.Map myMap = new java.util.HashMap<Integer, Integer>();
Since you want to implement your own, then first brush-up on hash tables by reading the Wikipedia article. After that, you could study the HashMap source code.
This StackOverflow question contains interesting links for implementing fast hashmaps (for C++ though), as does this one (for Java).
Get yourself an book about algorithms and data structures and read the chapter about hash tables (The Wikipedia article would also be a good entry point). It's a complex topic and far beyond the scope of a Q&A site like this.
For starters, using the array-size modulo is in general a horrible hash function, because it results in massive collisions when the values are multiples of the array size or one of its divisors. How bad that is depends on the array size: the more divisors it has, the more likely are collisions; when it's a prime number, it's not too bad (but not really good either).
Related
I read in CLRS that good hash function is: h(x) = ((a*x + b)mod p) mod m
where a,b random, p prime, m table size
I cannot find this hash function in any libraries in any programming language, especially in Java. Why?
I'm talking about implementations for hashCode.
Ok, both answers missed the mark. jwodder talks about salting hashes with random salts, and Juan talks about something else entirely.
The hash function you talk about does have random a and b, but it only gets the random a,b once. Then uses the same values for those through the lifetime of your data. Otherwise you could end up with a scenario where the same value has different hashes and no salt to distinguish why they're different.
The hash you listed is good for randomly distributing values into buckets, as needed by a hash table. The hash functions you see in java however, would be cryptographic, used for integrity and authentication. Alternatively, if you're actually investigating the code used for hash tables, you may see random values a and b, but they would just be constants at that point, they were randomly generated before being added to the library.
(Assuming you mean "hash" as in "hash table/hash map" and not in the sense of cryptographic hashes like SHA1)
Modern best practice in hash design actually does involve randomness in order to prevent denial-of-service attacks caused by people feeding in input that all hashes to the same value. Python, for example, does this since version 3.3, and I believe recent versions of Perl do as well; see also: http://www.ocert.org/advisories/ocert-2011-003.html
This might be a good hash function in some cases, but it'd be a terrible hashCode implementation. Java hashCode is supposed to return an arbitrary int, so constraining it via mod p and mod m makes no sense.
In a hash table like java.util.HashMap, the table size is always a power of two, which allows using bitwise AND instead of the slow modulus computation. Actually, non power of two tables get hardly ever used because of speed.
Using arbitrary random a and b in your formula is surely wrong. Using a==0 is a full scale disaster and so is a==p or a==m. Using a==0x80000000 is only slightly better and there are many other really bad values of a.
Note that using b` is just wasting time, especially it adds no hash dos protection. While it makes the hash value unpredictable, it preserved all collisions.
A hash doesn't need to be random, its target would be to provide an even distribution amongst a limited range of possible results based on random input.
As part of my programming course I was given an exercise to implement my own String collection. I was planning on using ArrayList collection or similar but one of the constraints is that we are not allowed to use any Java API to implement it, so only arrays are allowed. I could have implemented this using arrays however efficiency is very important as well as the amount of data that this code will be tested with. I was suggested to use hash tables or ordered tress as they are more efficient than arrays. After doing some research I decided to go with hash tables because they seemed easy to understand and implement but once I started writing code I realised it is not as straight forward as I thought.
So here are the problems I have come up with and would like some advice on what is the best approach to solve them again with efficiency in mind:
ACTUAL SIZE: If I understood it correctly hash tables are not ordered (indexed) so that means that there are going to be gaps in between items because hash function gives different indices. So how do I know when array is full and I need to resize it?
RESIZE: One of the difficulties that I need to create a dynamic data structure using arrays. So if I have an array String[100] once it gets full I will need to resize it by some factor I decided to increase it by 100 each time so once I would do that I would need to change positions of all existing values since their hash keys will be different as the key is calculated:
int position = "orange".hashCode() % currentArraySize;
So if I try to find a certain value its hash key will be different from what it was when array was smaller.
HASH FUNCTION: I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.
DEALING WITH MULTIPLE OCCURRENCES: one of the requirements is to be able to add multiple words that are the same, because I need to be able to count how many times the word is stored in my collection. Since they are going to have the same hash code I was planning to add the next occurrence at the next index hoping that there will be a gap. I don't know if it is the best solution but here how I implemented it:
public int count(String word) {
int count = 0;
while (collection[(word.hashCode() % size) + count] != null && collection[(word.hashCode() % size) + count].equals(word))
count++;
return count;
}
Thank you in advance for you advice. Please ask anything needs to be clarified.
P.S. The length of words is not fixed and varies greatly.
UPDATE Thank you for your advice, I know I did do few stupid mistakes there I will try better. So I took all your suggestions and quickly came up with the following structure, it is not elegant but I hope it is what you roughly what you meant. I did have to make few judgements such as bucket size, for now I halve the size of elements, but is there a way to calculate or some general value? Another uncertainty was as to by what factor to increase my array, should I multiply by some n number or adding fixed number is also applicable? Also I was wondering about general efficiency because I am actually creating instances of classes, but String is a class to so I am guessing the difference in performance should not be too big?
ACTUAL SIZE: The built-in Java HashMap just resizes when the total number of elements exceeds the number of buckets multiplied by a number called the load factor, which is by default 0.75. It does not take into account how many buckets are actually full. You don't have to, either.
RESIZE: Yes, you'll have to rehash everything when the table is resized, which does include recomputing its hash.
So if I try to find a certain value it's hash key will be different from what it was when array was smaller.
Yup.
HASH FUNCTION: Yes, you should use the built in hashCode() function. It's good enough for basic purposes.
DEALING WITH MULTIPLE OCCURRENCES: This is complicated. One simple solution would just be to have the hash entry for a given string also keep count of how many occurrences of that string are present. That is, instead of keeping multiple copies of the same string in your hash table, keep an int along with each String counting its occurrences.
So how do I know when array is full and I need to resize it?
You keep track of the size and HashMap does. When the size used > capacity * load factor you grow the underlying array, either as a whole or in part.
int position = "orange".hashCode() % currentArraySize;
Some things to consider.
The % of a negative value is a negative value.
Math.abs can return a negative value.
Using & with a bit mask is faster however you need a size which is a power of 2.
I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.
The built in hashCode is cached, so it is fast. However it is not a great hashCode and has poor randomness for lower bit, and higher bit for short strings. You might want to implement your own hashing strategy, possibly a 64-bit one.
DEALING WITH MULTIPLE OCCURRENCES:
This is usually done with a counter for each key. This way you can have say 32767 duplicates (if you use short) or 2 billion (if you use int) duplicates of the same key/element.
I'm looking for the most effective way of creating hashcodes for a very specific case of strings.
I have strings that can be converted to integer, they vary from 1 to 10,000, and they are very concentrated on the 1-600 range.
My question is what is the most effective way, in terms of performance for retrieving the items from a collection to implement the hashcode for it.
What I'm thinking is:
I can have the strings converted to integer and use a direct acess table (an array of 10.000 rows) - this will be very fast for retrieving but not very smart in terms of memory allocation;
I can use the strings as strings and get a hashcode for it (i wont have to convert it to integer, but i dont know how effective will be the hashcode for the strings in terms of collisions)
Any other ideas are greatly appreciated.
thanks a lot
Thanks everyone for your promptly replies...
There is another information Tha i've forget to add on this. I tink it Will Make this clear if I let you know my final goal with this-I migh not even need a hash table!!!
I just want to validate a stream against a dictiory that is immutable. I want to check if a given tag might or might not be present on my message.
I will receive a string with several pairs tag=value. I want to verify if the tag must or must not be treated by my app.
You might want to consider a trie (http://en.wikipedia.org/wiki/Trie) or radix tree (http://en.wikipedia.org/wiki/Radix_tree). No need to parse the string into an integer, or compute a hash code. You're walking a tree as you walk the string.
Edit:
Both computing a hash code on a string and parsing an integer out of a string involve walking the entire the string, and THEN using that value as a look-up into a specific data structure. Other techniques might involve simultaneously inspecting the string WHILE traversing a data structure. This MIGHT be of value to the poster who asked for "other ideas".
Many collections (e.g. HashMap) already apply a supplemental "rehash" method to help with poor hashcode algorithms. e.g. browse the cource code for HashMap.hash(). And Strings are very common keys, so you can be sure that String.hashCode() is highly optimized. SO, unless you notice a lot of collisions between your hashCodes, I'd go with the standard code.
I tried putting the Strings for 0..600 into a HashSet to see what happened, but it's then pretty tedious to see how many entries had collisions. Look for yourself! If you really really care, copy the source code from HashMap into your own class, edit it so you can get access to the entries (in the Java 6 source code I'm looking at, that would be transient Entry[] table, YMMV), and add methods to count collisions.
If there are only a limited valid range of values, why not represent the collection as a int[10000] as you suggested? The value at array[x] is the number of times that x occurs.
If your strings are represented as decimal integers, then parsing them to strings is a 5-iteration loop (up to 5 digits) and a couple of additions and subtractions. That is, it is incredibly fast. Inserting the elements is effectively O(1), retrieval is O(1). Memory required is around 40kb (4 bytes per int).
One problem is that insertion order is not preserved. Maybe you don't care.
Maybe you could think about caching the hashcode and only updating it if your collection has changed since the last time hashcode() was called. See Caching hashes in Java collections?
«Insert disclaimer about only doing this when it's a hot spot in your application and you can prove it»
Well the integer value itself will be a perfect hash function, you will not get any collisions. However there are two problems with this approach:
HashMap doesn't allow you to specify a custom hash function. So either you'll have to implement you own HashMap or you use a wrapper object.
HashMap uses a bitwise and instead of a modulo operation to find the bucket. This obviously throws bits away since it's just a mask. java.util.HashMap.hash(int) tries to compensate for this but I have seen claims that this is not very successful. Again we're back to implementing your own HashMap.
Now that this point since you're using the integer value as a hash function why not use the integer value as a key in the HashMap instead of the string? If you really want optimize this you can write a hash map that uses int instead of Integer keys or use TIntObjectHashMap from trove.
If you're really interested in finding good hash functions I can recommend Hashing in Smalltalk, just ignore the half dozen pages where the author rants about Java (disclaimer: I know the author).
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What exactly are hashtables?
I understand the purpose of using hash functions to securely store passwords. I have used arrays and arraylists for class projects for sorting and searching data. What I am having trouble understanding is the practical value of hashtables for something like sorting and searching.
I got a lecture on hashtables but we never had to use them in school, so it hasn't clicked. Can someone give me a practical example of a task a hashtable is useful for that couldn't be done with a numerical array or arraylist? Also, a very simple low level example of a hash function would be helpful.
There are all sorts of collections out there. Collections are used for storing and retrieving things, so one of the most important properties of a collection is how fast these operations are. To estimate "fastness" people in computer science use big-O notation which sort of means how many individual operations you have to accomplish to invoke a certain method (be it get or set for example). So for example to get an element of an ArrayList by an index you need exactly 1 operation, this is O(1), if you have a LinkedList of length n and you need to get something from the middle, you'll have to traverse from the start of the list to the middle, taking n/2 operations, in this case get has complexity of O(n). The same comes to key-value stores as hastable. There are implementations that give you complexity of O(log n) to get a value by its key whereas hastable copes in O(1). Basically it means that getting a value from hashtable by its key is really cheap.
Basically, hashtables have similar performance characteristics (cheap lookup, cheap appending (for arrays - hashtables are unordered, adding to them is cheap partly because of this) as arrays with numerical indices, but are much more flexible in terms of what the key may be. Given a continuous chunck of memory and a fixed size per item, you can get the adress of the nth item very easily and cheaply. That's thanks to the indices being integers - you can't do that with, say, strings. At least not directly. Hashes allows reducing any object (that implements it) to a number and you're back to arrays. You still need to add checks for hash collisions and resolve them (which incurs mostly a memory overhead, since you need to store the original value), but with a halfway decent implementation, this is not much of an issue.
So you can now associate any (hashable) object with any (really any) value. This has countless uses (although I have to admit, I can't think of one that's applyable to sorting or searching). You can build caches with small overhead (because checking if the cache can help in a given case is O(1)), implement a relatively performant object system (several dynamic languages do this), you can go through a list of (id, value) pairs and accumulate the values for identical ids in any way you like, and many other things.
Very simple. Hashtables are often called "associated arrays." Arrays allow access your data by index. Hash tables allow access your data by any other identifier, e.g. name. For example
one is associated with 1
two is associated with 2
So, when you got word "one" you can find its value 1 using hastable where key is one and value is 1. Array allows only opposite mapping.
For n data elements:
Hashtables allows O(k) (usually dependent only on the hashing function) searches. This is better than O(log n) for binary searches (which follow an n log n sorting, if data is not sorted you are worse off)
However, on the flip side, the hashtables tend to take roughly 3n amount of space.
What is the easiest way in Java to map strings (Java String) to (positive) integers (Java int), so that
equal strings map to equal integers, and
different strings map to different integers?
So, similar to hashCode() but different strings are required to produce different integers. So, in a sense, it would be a hasCode() without the collision possibility.
An obvious solution would maintain a mapping table from strings to integers,
and a counter to guarantee that new strings are assigned a new integer. I'm just wondering
how is this problem usually solved.
Would also be interesting to extend it to other objects than strings.
Have a look at perfect hashing.
This is impossible to achieve without any restrictions, simply because there are more possible Strings than there are integers, so eventually you will run out of numbers.
A solution is only possible when you limit the number of usable Strings. Then you can use a simple counter. Here is a simple implementation where all (2^32 = 4294967296 different strings) can be used. Never mind that it uses lots of memory.
import java.util.HashMap;
import java.util.Map;
public class StringToInt {
private Map<String, Integer> map;
private int counter = Integer.MIN_VALUE;
public StringToInt() {
map = new HashMap<String, Integer>();
}
public int toInt(String s) {
Integer i = map.get(s);
if (i == null) {
map.put(s, counter);
i = counter;
++counter;
}
return i;
}
}
There's not going to be an easy or complete solution. We use hashes because there are way more possible Strings than there are ints. Collisions are just a limitation of using a finite number of bits to represent integers.
In most hashcode() type implementations, collisions are accepted as inevitable and tested for.
If you absolutely must have no collisions, guaranteed, the solution you outline will work.
Aside from this, there are cryptographic hash functions such as MD5 and SHA, where collisions are extremely unlikely (though with a lot of effort can be forced). The Java Cryptography Architecture has implementations of these. Those methods may perhaps be faster than a good implementation of your solution for very large sets. They will also execute in constant time and give the same code for the same string, no matter which order the strings are added in. Also, it doesn't require storing each string. Crypto hash results could be considered as integers but they won't fit in a java int - you could use a BigInteger to hold them as suggested in another answer.
Incidentally, if you're put off by the idea of a collision being 'extremely unlikely', it's probably similar likelihood that a bit would randomly flip in your computer memory or hard disk and cause any program to behave differently than you expect :-)
Note, there are also some theoretical weaknesses in some hash functions (e.g. MD5) but for your purposes that probably doesn't matter and you could just use the most efficient such function - those weaknesses are only relevant if someone is maliciously trying to come up with strings that have the same code as another string.
edit: I just noticed in the title of your question, it seems you want bidirectional mapping, though you don't actually state this in the question. It is (by design) not possible to go from a Crypto hash to the original string. If you really need that, you'd have to store a map keying hashes back to strings.
I'd try to do by introducing an object holding Map and Map. Adding Strings to that object (or maybe having them created from said object) will assign them an Integer value. Requesting a Integer value for a String already registered will return the same value.
Drawbacks: Different launches will yield different Integers for the same String, depending on order unless you somehow persist the whole thing. Also, it's not very object oriented and requires a special object to create/register a String.
Plus side: It's quite similar to internalizing Strings and easily understandable. (Also, you asked for an easy, not elegant way.)
For the more general case, you might create a high level subclass of Object, introduce a "integerize" method there and extend every single class from that. I think, however, that road leads to tears.
Since Strings in java are unbounded in length, and each character has 16 bits, and ints have 32 bits, you could only produce a unique mapping of Strings to ints if the Strings were up to two characters. But you could use BigInteger to produce a unique mapping, with something like:
String s = "my string";
BigInteger bi = new BigInteger(s.getBytes());
Reverse mapping:
String str = new String(bi.toByteArray());
Can you use a Map to indicate which Strings you already have assigned integers to? That's kind of the "database-y" solution, where you assign each String a "primary key" from a sequence as it comes up. Then you put the String and Integer pair into a Map so you can look it up again. And if you need the String for a given Integer, you can also put the same pair into a Map.
As you outline, a hash table that resolves collisions is a standard solution. You could also use a Bentley/Sedgewick style search trie, which in many applications is faster than hashing.
If you substitute 'unique pointer' for 'unique integer' you can see Dave Hanson's solution to this problem in C. This is quite a nice abstraction because
The pointers can still be used as C strings.
Equal strings hash to equal pointers, so strcmp can be dispensed with in favor of pointer equality, and the pointers can be used as keys in other hash tables.
If Java offers a test for object identity on String objects then you can play the same game there.
If by integer you mean the data type, then as other posters have explained this is quite impossible, due to the fact that the integer data type is of fixed size, and strings are unbound.
However if you simply mean a positive number, then theoretically you should be able to interpret the string as if it were an "integer" simply by regarding it as a byte array (in a consistent encoding). You could also treat it as an array of integers of arbitrary length, but if you can do that why not just use a string? :)
Implementation speaking, this is usually "solved" by using a hash code and simply double-checking any collisions, since there are likely to be none anyway and on the off chance there is a collision, it still works out to be constant time. However if this isn't applicable, I'm not sure what the best solution would be.
Interesting question.
I don't know if this is practical, but if we take only lowercase letter alphabet, than every word can be viewed as a number in 26-base positional system. For example, if a is 0 and z is 25 than boom is 1*26^3 + 14*26^2 + 14*26^1 + 12*26^0 = 27416