I'm looking for the most effective way of creating hashcodes for a very specific case of strings.
I have strings that can be converted to integer, they vary from 1 to 10,000, and they are very concentrated on the 1-600 range.
My question is what is the most effective way, in terms of performance for retrieving the items from a collection to implement the hashcode for it.
What I'm thinking is:
I can have the strings converted to integer and use a direct acess table (an array of 10.000 rows) - this will be very fast for retrieving but not very smart in terms of memory allocation;
I can use the strings as strings and get a hashcode for it (i wont have to convert it to integer, but i dont know how effective will be the hashcode for the strings in terms of collisions)
Any other ideas are greatly appreciated.
thanks a lot
Thanks everyone for your promptly replies...
There is another information Tha i've forget to add on this. I tink it Will Make this clear if I let you know my final goal with this-I migh not even need a hash table!!!
I just want to validate a stream against a dictiory that is immutable. I want to check if a given tag might or might not be present on my message.
I will receive a string with several pairs tag=value. I want to verify if the tag must or must not be treated by my app.
You might want to consider a trie (http://en.wikipedia.org/wiki/Trie) or radix tree (http://en.wikipedia.org/wiki/Radix_tree). No need to parse the string into an integer, or compute a hash code. You're walking a tree as you walk the string.
Edit:
Both computing a hash code on a string and parsing an integer out of a string involve walking the entire the string, and THEN using that value as a look-up into a specific data structure. Other techniques might involve simultaneously inspecting the string WHILE traversing a data structure. This MIGHT be of value to the poster who asked for "other ideas".
Many collections (e.g. HashMap) already apply a supplemental "rehash" method to help with poor hashcode algorithms. e.g. browse the cource code for HashMap.hash(). And Strings are very common keys, so you can be sure that String.hashCode() is highly optimized. SO, unless you notice a lot of collisions between your hashCodes, I'd go with the standard code.
I tried putting the Strings for 0..600 into a HashSet to see what happened, but it's then pretty tedious to see how many entries had collisions. Look for yourself! If you really really care, copy the source code from HashMap into your own class, edit it so you can get access to the entries (in the Java 6 source code I'm looking at, that would be transient Entry[] table, YMMV), and add methods to count collisions.
If there are only a limited valid range of values, why not represent the collection as a int[10000] as you suggested? The value at array[x] is the number of times that x occurs.
If your strings are represented as decimal integers, then parsing them to strings is a 5-iteration loop (up to 5 digits) and a couple of additions and subtractions. That is, it is incredibly fast. Inserting the elements is effectively O(1), retrieval is O(1). Memory required is around 40kb (4 bytes per int).
One problem is that insertion order is not preserved. Maybe you don't care.
Maybe you could think about caching the hashcode and only updating it if your collection has changed since the last time hashcode() was called. See Caching hashes in Java collections?
«Insert disclaimer about only doing this when it's a hot spot in your application and you can prove it»
Well the integer value itself will be a perfect hash function, you will not get any collisions. However there are two problems with this approach:
HashMap doesn't allow you to specify a custom hash function. So either you'll have to implement you own HashMap or you use a wrapper object.
HashMap uses a bitwise and instead of a modulo operation to find the bucket. This obviously throws bits away since it's just a mask. java.util.HashMap.hash(int) tries to compensate for this but I have seen claims that this is not very successful. Again we're back to implementing your own HashMap.
Now that this point since you're using the integer value as a hash function why not use the integer value as a key in the HashMap instead of the string? If you really want optimize this you can write a hash map that uses int instead of Integer keys or use TIntObjectHashMap from trove.
If you're really interested in finding good hash functions I can recommend Hashing in Smalltalk, just ignore the half dozen pages where the author rants about Java (disclaimer: I know the author).
Related
To avoid any confusions I am re framing my question based on my research on hashing algorithms
Problem statement
I have multiple text files containing variable length data records. I need find if there are duplicate records in the input. Each of the text files could have data records in millions.
Since I cannot load all the data in memory at once, I plan to create a hash of the key fields in the record when it is processed. Processing a record would mean validating, filtering and transforming it. After processing all the records in all the text files, they are merged to create one view of the whole input (either a text file or a database table).
Finding duplicates would be much faster if a hash value is generated for all the records. If there are collisions of hash values, only those records could be checked for equality (assuming the hashing algorithm is deterministic)
Question - What hash algorithms should I consider for such input i.e. variable length data?
Short Answer
Don't do it. Use the Java map. You can find details here:
http://docs.oracle.com/javase/6/docs/api/java/util/Map.html
Long Answer
You can create a perfect hashing function by treating your string as a number in base-N where N is all of the possible values any character can take on. The problem here is memory. Hashing functions are meant to be used with arrays, which means you'll need an array large enough to handle the results of your hash, and that is impractical.
For instance, take a modest example of a 10 character key. Let's be even more modest and assume they are guaranteed to consist solely of lower-case letters. That gives you 26 possibilities for each character, and 10 characters. This means the possible combinations are:
26 ^ 10 = 141,167,095,653,376
If you look up hashing algorithms, one of the first things they include is collision detection because they acknowledge that collisions are a fact of life.
Now you say you are not loading keys in memory, yet why are you using a hash then? The point of a hash is to give you a mapping onto an array index. Perhaps you're better off using another mechanism.
Possible Solutions
If you are concerned about memory, get some statistics on the duplicates in your file. If you only store a flag to indicate the occurrence of a particular key in the hash, and you have many duplicates, you may be able to just use Java's map. Java's map handles collisions, so that won't keep you from detecting unique keys. You can rest assured that if A[x] is found, that means x is in A, even if x's hash collided with a previous hash.
Next, you could try some utilities to pull out duplicates. Since they would have been written specifically for the purpose, they should be able to handle a large amount of data.
Finally, you could try putting your entries into a database and using that to handle duplicates. This may seem like overkill, but databases are optimized for dealing with very large numbers of records.
This is an extension to the Map idea. Before resorting to this I would check that it cannot be done by simply building a HashSet representing all the strings at once. Remember you can use a 64-bit JVM and set a large heap size.
Define a class StringLocation that contains the data you would need to do a random access to a string on disk - for example, a reference to a RandomAcessFile and an offset within file. If you cannot keep all the files open at once, open and close as needed, caching the RandomAcessFile for the most used files.
Create a HashMap<Integer,List<StringLocation>>.
Start reading the strings. For each string, convert to lower case and obtain its hashCode(), hash, in Integer form. If there is an entry in the Map with hash as key, compare the new string to each string represented in the existing value, doing random file access to get to the already processed strings. Use the String equalsIgnoreCase. If there is a match, you have a duplicate. If there is no match, append a new StringLocation, representing the current string, to the List.
This requires at most two strings to be in memory at a time, the one you are currently processing and a previously processed string with the same hashCode() result to which you are comparing it.
You can further reduce the number of times you have to re-read a string for an equals check by using MessageDigest to generate, for the lower case string, a wide checksum with low risk of collisions, and saving it in the StringLocation object. During a comparison, return false if the checksums do not match, without re-reading the strings.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What exactly are hashtables?
I understand the purpose of using hash functions to securely store passwords. I have used arrays and arraylists for class projects for sorting and searching data. What I am having trouble understanding is the practical value of hashtables for something like sorting and searching.
I got a lecture on hashtables but we never had to use them in school, so it hasn't clicked. Can someone give me a practical example of a task a hashtable is useful for that couldn't be done with a numerical array or arraylist? Also, a very simple low level example of a hash function would be helpful.
There are all sorts of collections out there. Collections are used for storing and retrieving things, so one of the most important properties of a collection is how fast these operations are. To estimate "fastness" people in computer science use big-O notation which sort of means how many individual operations you have to accomplish to invoke a certain method (be it get or set for example). So for example to get an element of an ArrayList by an index you need exactly 1 operation, this is O(1), if you have a LinkedList of length n and you need to get something from the middle, you'll have to traverse from the start of the list to the middle, taking n/2 operations, in this case get has complexity of O(n). The same comes to key-value stores as hastable. There are implementations that give you complexity of O(log n) to get a value by its key whereas hastable copes in O(1). Basically it means that getting a value from hashtable by its key is really cheap.
Basically, hashtables have similar performance characteristics (cheap lookup, cheap appending (for arrays - hashtables are unordered, adding to them is cheap partly because of this) as arrays with numerical indices, but are much more flexible in terms of what the key may be. Given a continuous chunck of memory and a fixed size per item, you can get the adress of the nth item very easily and cheaply. That's thanks to the indices being integers - you can't do that with, say, strings. At least not directly. Hashes allows reducing any object (that implements it) to a number and you're back to arrays. You still need to add checks for hash collisions and resolve them (which incurs mostly a memory overhead, since you need to store the original value), but with a halfway decent implementation, this is not much of an issue.
So you can now associate any (hashable) object with any (really any) value. This has countless uses (although I have to admit, I can't think of one that's applyable to sorting or searching). You can build caches with small overhead (because checking if the cache can help in a given case is O(1)), implement a relatively performant object system (several dynamic languages do this), you can go through a list of (id, value) pairs and accumulate the values for identical ids in any way you like, and many other things.
Very simple. Hashtables are often called "associated arrays." Arrays allow access your data by index. Hash tables allow access your data by any other identifier, e.g. name. For example
one is associated with 1
two is associated with 2
So, when you got word "one" you can find its value 1 using hastable where key is one and value is 1. Array allows only opposite mapping.
For n data elements:
Hashtables allows O(k) (usually dependent only on the hashing function) searches. This is better than O(log n) for binary searches (which follow an n log n sorting, if data is not sorted you are worse off)
However, on the flip side, the hashtables tend to take roughly 3n amount of space.
Let's say that I need to make a mapping from String to an integer. The integers are unique and form a continuous range starting from 0. That is:
Hello -> 0
World -> 1
Foo -> 2
Bar -> 3
Spam -> 4
Eggs -> 5
etc.
There are at least two straightforward ways to do it. With a hashmap:
HashMap<String, Integer> map = ...
int integer = map.get(string); // Plus maybe null check to avoid NPE in unboxing.
Or with a list:
List<String> list = ...
int integer = list.indexOf(string); // Plus maybe check for -1.
Which approach should I use, and why? Arguably the relative performance depends on the size of the list/map, since List#indexOf() is a linear search using String#equals() -> O(n) efficiency, while HashMap#get() uses hash to narrow down the search -> certainly more efficient when the map is big, but maybe inferior when there are just few elements (there must be some overhead in calculating the hash, right?).
Since benchmarking Java code properly is notoriously hard, I would like to get some educated guesses. Is my reasoning above correct (list is better for small, map is better for large)? What is the threshold size approximately? What difference do various List and HashMap implementations make?
A third option and possibly my favorite would be to use a trie:
I bet it beats the HashMap in performance (no collisions + the fact that computing the hash-code is O(length of string) anyway), and possibly also the List approach in some cases (such as if your strings have long common prefixes, as the indexOf would waste lot of time in the equals methods).
When choosing between List and Map I would go for a Map (such as HashMap). Here is my reasoning:
Readability
The Map interface simply provides a more intuitive interface for this use case.
Optimization in the right place
I'd say if you're using a List you would be optimizing for the small cases anyway. That's probably not where the bottle neck is.
A fourth option would be to use a LinkedHashMap, iterate through it if the size is small, and get the associated number if the size is large.
A fifth option is to encapsulate the decision in a separate class all together. In this case you could even implement it to change strategy in runtime as the list grows.
You're right: a List would be O(n), a HashMap would be O(1), so a HashMap would be faster for n large enough so that the time to calculate the hash didn't swamp the List linear search.
I don't know the threshold size; that's a matter for experimentation or better analytics than I can muster right now.
Your question is totally correct on all points:
HashMaps are better (they use a hash)
Benchmarking Java code is hard
But at the end of the day, you're just going to have to benchmark your particular application. I don't see why HashMaps would be slower for small cases but the benchmarking will give you the answer if it is or not.
One more option, a TreeMap is another map data structure which uses a tree as opposed to a hash to access the entries. If you are doing benchmarking, you might as well benchmark that as well.
Regarding benchmarking, one of the main problems is the garbage collector. However if you do a test which doesn't allocate any objects, that shouldn't be a problem. Fill up your map/list, then just write a loop to get N random elements, and then time it, that should be reasonably reproducable and therefore informative.
Unfortunately, you are going to have to benchmark this yourself, because the relative performance will depend critically on the actual String values, and also on the relative probability that you will test a string that is not in your mapping. And of course, it depends on how String.equals() and String.hashCode() are implemented, as well as the details of the HashMap and List classes used.
In the case of a HashMap, a lookup will typically involve calculating the hash of the key String, and then comparing the key String with one or more entry key Strings. The hashcode calculation looks at all characters of the String, and is therefore dependent on the key String. The equals operations typically will typically examine all of the characters when equals returns true and considerably less when it returns false. The actual number of times that equals is called for a given key string depends on how the hashed key strings are distributed. Normally, you'd expect an average of 1 or 2 calls to equal for a "hit" and maybe up to 3 for a "miss".
In the case of a List, a lookup will call equals for an average of half the entry key Strings in the case of a "hit" and all of them in the case of a "miss". If you know the relative distribution of the keys that you are looking up, you can improve the performance in the "hit" case by ordering the list. But the "miss" case cannot be optimized.
In addition to the trie alternative suggested by #aioobe, you could also implement a specialized String to integer hashmap using a so-called perfect hash function. This maps each of the actual key strings to a unique hash within a small range. The hash can then be used to index an array of key/value pairs. This reduces a lookup to exactly one call to hash function and one call to String.equals. (And if you can assume that supplied key will always be one of the mapped strings, you can dispense with the call to equals.)
The difficulty of the perfect hash approach is in finding a function that works for the set of keys in the mapping and is not too expensive to compute. AFAIK, this has to be done by trial and error.
But the reality is that simply using a HashMap is a safe option, because it gives O(1) performance with a relatively small constant of proportionality (unless the entry keys are pathological).
(FWIW, my guess is that the break-even point where HashMap.get() becomes better than List.contains() is less than 10 entries, assuming that the strings have an average length of 5 to 10.)
From what I can remember, the list method will be O(n),but would be quick to add items, as no computation occurs. You could get this lower O(log n) if you implemented a b-search or other searching algorithms. The hash is O(1), but its slower to insert, since the hash needs to be computed every time you add an element.
I know in .net, theres a special collection called a HybridDictionary, that does exactly this. Uses a list to a point, then a hash. I think the crossover is around 10, so this may be a good line in the sand.
I would say you're correct in your above statement, though I'm not 100% sure if a list would be faster for small sets, and where the crossover point is.
I think a HashMap will always be better. If you have n strings each of length at most l, then String#hashCode and String#equals are both O(l) (in Java's default implementation, anyway).
When you do List#indexOf it iterates through the list (O(n)) and performs a comparison on each element (O(l)), to give O(nl) performance.
Java's HashMap has (let's say) r buckets, and each bucket contains a linked list. Each of these lists is of length O(n/r) (assuming the String's hashCode method distributes the Strings uniformly between the buckets). To look up a String, you need to calculate the hashCode (O(l)), look up the bucket (O(1) - one, not l), and iterate through that bucket's linked list (O(n/r) elements) doing an O(l) comparison on each one. This gives a total lookup time of O(l + (nl)/r).
As the List implementation is O(nl) and the HashMap implementation is O(nl/r) (I'm dropping the first l as it's relatively insignificant), lookup performance should be equivalent when r=1 and the HashMap will be faster for all greater values of r.
Note that you can set r when you construct the HashMap using this constructor (set the initialCapacity to r and the loadFactor argument to n/r for your given n and chosen r).
I've used an array as hash table for hashing alogrithm with values:
int[] arr={4 , 5 , 64 ,432 };
and keys with consective integers in array as:
int keys[]={ 1 , 2 , 3 ,4};
Could anyone please tell me, what would be the good approach in mapping those integers keys with those arrays location? Is the following a short and better approach with little or no collision (or something larger values)?
keys[i] % arrlength // where i is for different element of an array
Thanks in advance.
I assume you're trying to implement some kind of hash table as an exercise. Otherwise, you should just use a java.util.HashMap or java.util.HashTree or similar.
For a small set of values, as you have given above, your solution is fine. The real question will come when your data grows much bigger.
You have identified that collisions are undesirable - that is true. Sometimes, some knowledge of the likely keys can help you design a good hash function. Sometimes, you can assume that the key class will have a good hash() method. Since hash() is a method defined by Object, every class implements it. It would be neatest for you to be able to utilise the hash() method of your key, rather than have to build a new algorithm specially for your map.
If all integer keys are equally likely, then a mod function will spread them out evenly amongst the different buckets, minimising collisions. However, if you know that the keys are going to be numbered consecutively, it might be better to use a List than a HashMap - this will guarantee no collisions.
Any reason not to use the built-in HashMap ? You will have to use Integer though, not int.
java.util.Map myMap = new java.util.HashMap<Integer, Integer>();
Since you want to implement your own, then first brush-up on hash tables by reading the Wikipedia article. After that, you could study the HashMap source code.
This StackOverflow question contains interesting links for implementing fast hashmaps (for C++ though), as does this one (for Java).
Get yourself an book about algorithms and data structures and read the chapter about hash tables (The Wikipedia article would also be a good entry point). It's a complex topic and far beyond the scope of a Q&A site like this.
For starters, using the array-size modulo is in general a horrible hash function, because it results in massive collisions when the values are multiples of the array size or one of its divisors. How bad that is depends on the array size: the more divisors it has, the more likely are collisions; when it's a prime number, it's not too bad (but not really good either).
What is the easiest way in Java to map strings (Java String) to (positive) integers (Java int), so that
equal strings map to equal integers, and
different strings map to different integers?
So, similar to hashCode() but different strings are required to produce different integers. So, in a sense, it would be a hasCode() without the collision possibility.
An obvious solution would maintain a mapping table from strings to integers,
and a counter to guarantee that new strings are assigned a new integer. I'm just wondering
how is this problem usually solved.
Would also be interesting to extend it to other objects than strings.
Have a look at perfect hashing.
This is impossible to achieve without any restrictions, simply because there are more possible Strings than there are integers, so eventually you will run out of numbers.
A solution is only possible when you limit the number of usable Strings. Then you can use a simple counter. Here is a simple implementation where all (2^32 = 4294967296 different strings) can be used. Never mind that it uses lots of memory.
import java.util.HashMap;
import java.util.Map;
public class StringToInt {
private Map<String, Integer> map;
private int counter = Integer.MIN_VALUE;
public StringToInt() {
map = new HashMap<String, Integer>();
}
public int toInt(String s) {
Integer i = map.get(s);
if (i == null) {
map.put(s, counter);
i = counter;
++counter;
}
return i;
}
}
There's not going to be an easy or complete solution. We use hashes because there are way more possible Strings than there are ints. Collisions are just a limitation of using a finite number of bits to represent integers.
In most hashcode() type implementations, collisions are accepted as inevitable and tested for.
If you absolutely must have no collisions, guaranteed, the solution you outline will work.
Aside from this, there are cryptographic hash functions such as MD5 and SHA, where collisions are extremely unlikely (though with a lot of effort can be forced). The Java Cryptography Architecture has implementations of these. Those methods may perhaps be faster than a good implementation of your solution for very large sets. They will also execute in constant time and give the same code for the same string, no matter which order the strings are added in. Also, it doesn't require storing each string. Crypto hash results could be considered as integers but they won't fit in a java int - you could use a BigInteger to hold them as suggested in another answer.
Incidentally, if you're put off by the idea of a collision being 'extremely unlikely', it's probably similar likelihood that a bit would randomly flip in your computer memory or hard disk and cause any program to behave differently than you expect :-)
Note, there are also some theoretical weaknesses in some hash functions (e.g. MD5) but for your purposes that probably doesn't matter and you could just use the most efficient such function - those weaknesses are only relevant if someone is maliciously trying to come up with strings that have the same code as another string.
edit: I just noticed in the title of your question, it seems you want bidirectional mapping, though you don't actually state this in the question. It is (by design) not possible to go from a Crypto hash to the original string. If you really need that, you'd have to store a map keying hashes back to strings.
I'd try to do by introducing an object holding Map and Map. Adding Strings to that object (or maybe having them created from said object) will assign them an Integer value. Requesting a Integer value for a String already registered will return the same value.
Drawbacks: Different launches will yield different Integers for the same String, depending on order unless you somehow persist the whole thing. Also, it's not very object oriented and requires a special object to create/register a String.
Plus side: It's quite similar to internalizing Strings and easily understandable. (Also, you asked for an easy, not elegant way.)
For the more general case, you might create a high level subclass of Object, introduce a "integerize" method there and extend every single class from that. I think, however, that road leads to tears.
Since Strings in java are unbounded in length, and each character has 16 bits, and ints have 32 bits, you could only produce a unique mapping of Strings to ints if the Strings were up to two characters. But you could use BigInteger to produce a unique mapping, with something like:
String s = "my string";
BigInteger bi = new BigInteger(s.getBytes());
Reverse mapping:
String str = new String(bi.toByteArray());
Can you use a Map to indicate which Strings you already have assigned integers to? That's kind of the "database-y" solution, where you assign each String a "primary key" from a sequence as it comes up. Then you put the String and Integer pair into a Map so you can look it up again. And if you need the String for a given Integer, you can also put the same pair into a Map.
As you outline, a hash table that resolves collisions is a standard solution. You could also use a Bentley/Sedgewick style search trie, which in many applications is faster than hashing.
If you substitute 'unique pointer' for 'unique integer' you can see Dave Hanson's solution to this problem in C. This is quite a nice abstraction because
The pointers can still be used as C strings.
Equal strings hash to equal pointers, so strcmp can be dispensed with in favor of pointer equality, and the pointers can be used as keys in other hash tables.
If Java offers a test for object identity on String objects then you can play the same game there.
If by integer you mean the data type, then as other posters have explained this is quite impossible, due to the fact that the integer data type is of fixed size, and strings are unbound.
However if you simply mean a positive number, then theoretically you should be able to interpret the string as if it were an "integer" simply by regarding it as a byte array (in a consistent encoding). You could also treat it as an array of integers of arbitrary length, but if you can do that why not just use a string? :)
Implementation speaking, this is usually "solved" by using a hash code and simply double-checking any collisions, since there are likely to be none anyway and on the off chance there is a collision, it still works out to be constant time. However if this isn't applicable, I'm not sure what the best solution would be.
Interesting question.
I don't know if this is practical, but if we take only lowercase letter alphabet, than every word can be viewed as a number in 26-base positional system. For example, if a is 0 and z is 25 than boom is 1*26^3 + 14*26^2 + 14*26^1 + 12*26^0 = 27416