How do I hash strings to RandomAccessFile - java

I'm having trouble understand how my teacher wants us to do the second part of this project. The first part was easy getting the employees and putting it in a r.a.f, but I dont understand how i can hash ssn which is a string. Can someone please explain it. I put the directions just in case.Thanks
ssn: String(9 characters),
fullname : string of 50 characters,
salary: float, and
age: int
1.The program writes at least 20 Employees and stores them sequentially on the random access, and then sequentially reads and prints all Employees in a readable format.
2.Redo the previous part assuming the employees are stored based on hashing the ssn( non-sequentially)

You can create a hash of pretty much any type. If I understand correctly, we want to use a distinct data value, an SSN, to make a hash that is used for keying into some random-access store.
My first thought is, if you need a strong hash based on more than just the String itself, to wrap the SSN in a Class, and use that to both store and fetch the SSN value, as well as provide a hashCode() implementation based on one or more fields along with a seed. This is described a bit in Block's "Effective Java", where a HashCodeUtil class is introduced.
A working example is discussed here: http://www.javapractices.com/topic/TopicAction.do?Id=28

It looks to me like you are trying to implement your own hash table using RandomAccessFile. If that is true what you need to do is take a hashcode and turn it into a slot number. To do this modulo divide the hashcode() of the SSN string by some prime number close to double the number of entries you expect in the table. In your case maybe 41. Each slot in the RandomAccessFile is the size of an entry. In you case 9+50+4+4 = 67. Multiply the slot number by the entry size to give you the offset of the location in the RandomAccessFile to read and write.
Note that this, as in all hash tables, may lead to collisions. A real world implementation would then handle some kind of entry chaining.

Related

What hashing algorithms to consider for variable length data

To avoid any confusions I am re framing my question based on my research on hashing algorithms
Problem statement
I have multiple text files containing variable length data records. I need find if there are duplicate records in the input. Each of the text files could have data records in millions.
Since I cannot load all the data in memory at once, I plan to create a hash of the key fields in the record when it is processed. Processing a record would mean validating, filtering and transforming it. After processing all the records in all the text files, they are merged to create one view of the whole input (either a text file or a database table).
Finding duplicates would be much faster if a hash value is generated for all the records. If there are collisions of hash values, only those records could be checked for equality (assuming the hashing algorithm is deterministic)
Question - What hash algorithms should I consider for such input i.e. variable length data?
Short Answer
Don't do it. Use the Java map. You can find details here:
http://docs.oracle.com/javase/6/docs/api/java/util/Map.html
Long Answer
You can create a perfect hashing function by treating your string as a number in base-N where N is all of the possible values any character can take on. The problem here is memory. Hashing functions are meant to be used with arrays, which means you'll need an array large enough to handle the results of your hash, and that is impractical.
For instance, take a modest example of a 10 character key. Let's be even more modest and assume they are guaranteed to consist solely of lower-case letters. That gives you 26 possibilities for each character, and 10 characters. This means the possible combinations are:
26 ^ 10 = 141,167,095,653,376
If you look up hashing algorithms, one of the first things they include is collision detection because they acknowledge that collisions are a fact of life.
Now you say you are not loading keys in memory, yet why are you using a hash then? The point of a hash is to give you a mapping onto an array index. Perhaps you're better off using another mechanism.
Possible Solutions
If you are concerned about memory, get some statistics on the duplicates in your file. If you only store a flag to indicate the occurrence of a particular key in the hash, and you have many duplicates, you may be able to just use Java's map. Java's map handles collisions, so that won't keep you from detecting unique keys. You can rest assured that if A[x] is found, that means x is in A, even if x's hash collided with a previous hash.
Next, you could try some utilities to pull out duplicates. Since they would have been written specifically for the purpose, they should be able to handle a large amount of data.
Finally, you could try putting your entries into a database and using that to handle duplicates. This may seem like overkill, but databases are optimized for dealing with very large numbers of records.
This is an extension to the Map idea. Before resorting to this I would check that it cannot be done by simply building a HashSet representing all the strings at once. Remember you can use a 64-bit JVM and set a large heap size.
Define a class StringLocation that contains the data you would need to do a random access to a string on disk - for example, a reference to a RandomAcessFile and an offset within file. If you cannot keep all the files open at once, open and close as needed, caching the RandomAcessFile for the most used files.
Create a HashMap<Integer,List<StringLocation>>.
Start reading the strings. For each string, convert to lower case and obtain its hashCode(), hash, in Integer form. If there is an entry in the Map with hash as key, compare the new string to each string represented in the existing value, doing random file access to get to the already processed strings. Use the String equalsIgnoreCase. If there is a match, you have a duplicate. If there is no match, append a new StringLocation, representing the current string, to the List.
This requires at most two strings to be in memory at a time, the one you are currently processing and a previously processed string with the same hashCode() result to which you are comparing it.
You can further reduce the number of times you have to re-read a string for an equals check by using MessageDigest to generate, for the lower case string, a wide checksum with low risk of collisions, and saving it in the StringLocation object. During a comparison, return false if the checksums do not match, without re-reading the strings.

Generating unique reference number

Tech Stack: Java 1.6, JPA (Hibernate 3), Spring 3, Oracle 11g
I am working on a project where one of the requirement is to give back the customers a ‘ReferenceNumber’.
One option is to return the row ID, but for that to work, it must not be sequential. Otherwise, you can guess the next number etc.
I can generate a number in Java and store it in a separate column, but then I’ll have make sure there are no collisions.
There are ways to generate such number in database, but not sure if it will guarantee uniqueness.
Is there a best practice for such a requirement from the database point of view?
UPDATE 1
Current I am using the following in Java to generrate the number:
private static SecureRandom random = new SecureRandom();
public static BigInteger getNew() {
return new BigInteger(60, random);
}
public static BigInteger getNew(int numBits) {
return new BigInteger(numBits, random);
}
UPDATE 2: Requirement
Allowing sequential number would allow:
customer to guess the next number.
Find out how many numbers (orders) were there between two number. etc
It is preferable for this reference to be a number, but say a three letter prefix follwed by number is also fine.
If your table has a sequence generated primary key (e.g. the customer_id) then you could reverse the digits and then convert that to an octal representation. Thus it still looks like a decimal number, but it is definitely no longer consecutive and hard to guess any ranges.
The process is even reversable if you can find a way how to deal with trailing zeros in the original value (because they'd become a leading zero in the reversed number and thus will be "dropped" during the conversion).
How about prefixing the number with a customer abbreviation or name abbreviation or something (or 3 letters assigned when the customer is created, checked for uniqueness) and then just have a value stored that you increment sequentially for just that customer? That way they can't tell what the order numbers are in the rest of the system, but they can for themselves, which shouldn't really matter as they know how many orders they have placed anyway.
Why not take a SHA1 or MD5 Hash of a couple of pertinent fields (say the user's name and the time of record creation, etc.)? In many respects, if the strategy is well known, your services external to the Database will be able to recreate the reference number without having to query the database.

storing sets of integers to check if a certain set has already been mentioned

I've come across an interesting problem which I would love to get some input on.
I have a program that generates a set of numbers (based on some predefined conditions). Each set contains up to 6 numbers that do not have to be unique with integers that ranges from 1 to 100).
I would like to somehow store every set that is created so that I can quickly check if a certain set with the exact same numbers (order doesn't matter) has previously been generated.
Speed is a priority in this case as there might be up to 100k sets stored before the program stops (maybe more, but most the time probably less)! Would anyone have any recommendations as to what data structures I should use and how I should approach this problem?
What I have currently is this:
Sort each set before storing it into a HashSet of Strings. The string is simply each number in the sorted set with some separator.
For example, the set {4, 23, 67, 67, 71} would get encoded as the string "4-23-67-67-71" and stored into the HashSet. Then for every new set generated, sort it, encode it and check if it exists in the HashSet.
Thanks!
if you break it into pieces it seems to me that
creating a set (generate 6 numbers, sort, stringify) runs in O(1)
checking if this string exists in the hashset is O(1)
inserting into the hashset is O(1)
you do this n times, which gives you O(n).
this is already optimal as you have to touch every element once anyways :)
you might run into problems depending on the range of your random numbers.
e.g. assume you generate only numbers between one and one, then there's obviously only one possible outcome ("1-1-1-1-1-1") and you'll have only collisions from there on. however, as long as the number of possible sequences is much larger than the number of elements you generate i don't see a problem.
one tip: if you know the number of generated elements beforehand it would be wise to initialize the hashset with the correct number of elements (i.e. new HashSet<String>( 100000 ) );
p.s. now with other answers popping up i'd like to note that while there may be room for improvement on a microscopic level (i.e. using language specific tricks), your overal approach can't be improved.
Create a class SetOfIntegers
Implement a hashCode() method that will generate reasonably unique hash values
Use HashMap to store your elements like put(hashValue,instance)
Use containsKey(hashValue) to check if the same hashValue already present
This way you will avoid sorting and conversion/formatting of your sets.
Just use a java.util.BitSet for each set, adding integers to the set with the set(int bitIndex) method, you don't have to sort anything, and check a HashMap for already existing BitSet before adding a new BitSet to it, it will be really very fast. Don't use sorting of value and toString for that purpose ever if speed is important.

Hashcode for strings that can be converted to integer

I'm looking for the most effective way of creating hashcodes for a very specific case of strings.
I have strings that can be converted to integer, they vary from 1 to 10,000, and they are very concentrated on the 1-600 range.
My question is what is the most effective way, in terms of performance for retrieving the items from a collection to implement the hashcode for it.
What I'm thinking is:
I can have the strings converted to integer and use a direct acess table (an array of 10.000 rows) - this will be very fast for retrieving but not very smart in terms of memory allocation;
I can use the strings as strings and get a hashcode for it (i wont have to convert it to integer, but i dont know how effective will be the hashcode for the strings in terms of collisions)
Any other ideas are greatly appreciated.
thanks a lot
Thanks everyone for your promptly replies...
There is another information Tha i've forget to add on this. I tink it Will Make this clear if I let you know my final goal with this-I migh not even need a hash table!!!
I just want to validate a stream against a dictiory that is immutable. I want to check if a given tag might or might not be present on my message.
I will receive a string with several pairs tag=value. I want to verify if the tag must or must not be treated by my app.
You might want to consider a trie (http://en.wikipedia.org/wiki/Trie) or radix tree (http://en.wikipedia.org/wiki/Radix_tree). No need to parse the string into an integer, or compute a hash code. You're walking a tree as you walk the string.
Edit:
Both computing a hash code on a string and parsing an integer out of a string involve walking the entire the string, and THEN using that value as a look-up into a specific data structure. Other techniques might involve simultaneously inspecting the string WHILE traversing a data structure. This MIGHT be of value to the poster who asked for "other ideas".
Many collections (e.g. HashMap) already apply a supplemental "rehash" method to help with poor hashcode algorithms. e.g. browse the cource code for HashMap.hash(). And Strings are very common keys, so you can be sure that String.hashCode() is highly optimized. SO, unless you notice a lot of collisions between your hashCodes, I'd go with the standard code.
I tried putting the Strings for 0..600 into a HashSet to see what happened, but it's then pretty tedious to see how many entries had collisions. Look for yourself! If you really really care, copy the source code from HashMap into your own class, edit it so you can get access to the entries (in the Java 6 source code I'm looking at, that would be transient Entry[] table, YMMV), and add methods to count collisions.
If there are only a limited valid range of values, why not represent the collection as a int[10000] as you suggested? The value at array[x] is the number of times that x occurs.
If your strings are represented as decimal integers, then parsing them to strings is a 5-iteration loop (up to 5 digits) and a couple of additions and subtractions. That is, it is incredibly fast. Inserting the elements is effectively O(1), retrieval is O(1). Memory required is around 40kb (4 bytes per int).
One problem is that insertion order is not preserved. Maybe you don't care.
Maybe you could think about caching the hashcode and only updating it if your collection has changed since the last time hashcode() was called. See Caching hashes in Java collections?
«Insert disclaimer about only doing this when it's a hot spot in your application and you can prove it»
Well the integer value itself will be a perfect hash function, you will not get any collisions. However there are two problems with this approach:
HashMap doesn't allow you to specify a custom hash function. So either you'll have to implement you own HashMap or you use a wrapper object.
HashMap uses a bitwise and instead of a modulo operation to find the bucket. This obviously throws bits away since it's just a mask. java.util.HashMap.hash(int) tries to compensate for this but I have seen claims that this is not very successful. Again we're back to implementing your own HashMap.
Now that this point since you're using the integer value as a hash function why not use the integer value as a key in the HashMap instead of the string? If you really want optimize this you can write a hash map that uses int instead of Integer keys or use TIntObjectHashMap from trove.
If you're really interested in finding good hash functions I can recommend Hashing in Smalltalk, just ignore the half dozen pages where the author rants about Java (disclaimer: I know the author).

How should I go about optimizing a hash table for a given population?

Say I have a population of key-value pairs which I plan to store in a hash table. The population is fixed and will never change. What optimizations are available to me to make the hash table as fast as possible? Which optimizations should I concentrate on? This is assuming I have a lot of space. There will be a reasonable number of pairs (say no more than 100,000).
EDIT: I want to optimize look up. I don't care how long it takes to build.
I would make sure that your key's hash to unique values. This will ensure that every lookup will be constant time, and thus, as fast as possible.
Since you can never have more than 100,000 keys, it is entirely possible to have 100,000 hash values.
Also, make sure that you use the constructor that takes an int to specify the initial capacity (Set it to 100,000), and a float to set the load factor. (Use 1) Also, doing this requires that you have a perfect hash function for your keys. But, this will result in the fastest possible lookup, in the least amount of memory.
In general, to optimize a hash table, you want to minimize collisions in the determination of your hash, so your buckets won't contain more than one item and the hash-search will return immediately.
Most of the time, that means that you should measure the output of your hash function on the problem space. So i guess i'd recommend looking into that
Ensure there are no collisions. If there are no collisions, you are guaranteed O(1) constant look-up time. The next optimization would then be the look-up.
Use a profiler to optimize piece by piece. It's hard to without that.
If it's possible to make a large hash table such that there are no collisions at all, it will be ideal. Since your insertions and lookups will done in constant time.
But if that is not possible, try to choose a hash function such that your keys get distributed uniformly across the hash table.
Perfect hashing algorithms deal with the problem, but may not scale to 100k objects. I found a Java MPH package, but haven't tried it.
If the population is known at compile time, then the optimal solution is to use a minimal perfect hash function (MPH). The Wikipedia page on this subject links to several Java tools that can generate these.
The optimization must be done int the hashCode method of the key class. The thing to have in mind is to implement this method to avoid collisions.
Getting the perfect hashing algorithm to give totally unique values to 100K objects is likely to be close to impossible. Consider the birthday paradox. The date on which people are born can be considered a perfect hashing algorithm yet if you have more than 23 people you are more than likely to have a collision, and that is in a table of 365 dates.
So how big a table will you need to have no collisions in 100K?
If your keys are strings, your optimal strategy is a tree, not binary but n-branch at each character. If the keys are lower-case only it is easier still as you need just 26 whenever you create a branch.
We start with 26 keys. Follow the first character, say f
f might have a value associated with it. And it may have sub-trees. Look up a subtree of o. This leads to more subtrees then look up the next o. (You knew where that was leading!). If this doesn't have a value associated with it, or we hit a null sub-tree on the way, we know the value is not found.
You can optimise the space on the tree where you hit a point of uniqueness. Say you have a key january and it becomes unique at the 4th character. At this point where you assign the value you also store the actual string associated with it. In our example there may be one value associated with foo but the key it relates to may be food, not foo.
I think google search engines use a technique similar to this.
The key question is what your key is. (No pun intended.) As others have pointed out, the goal is to minimize the number of hash collisions. If you can get the number of hash collisions to zero, i.e. your hash function generates a unique value for every key that is actually passed to it, you will have a perfect hash.
Note that in Java, a hash function really has two steps: First the key is run through the hashCode function for it's class. Then we calculate an index value into the hash table by taking this value modulo the size of the hash table.
I think that people discussing the perfect hash function tend to forget that second step. Even if you wrote a hashCode function that generated a unique value for every key passed to it, you could still get an absolutely terrible hash if this value modulo the hash table size is not unique. For example, say you have 100 keys and your hashCode function returns the values 1, 1001, 2001, 3001, 4001, 5001, ... 99001. If your hash table has 100,000 slots, this would be a perfect hash. Every key gets its own slot. But if it has 1000 slots, they all hash to the same slot. It would be the worst possible hash.
So consider constructing a good hash function. Take the extreme cases. Suppose that your key is a date. You know that the dates will all be in January of the same year. Then using the day of the month as the hash value should be as good as it's going to get: everything will hash to a unique integer in a small range. On the other hand, if your dates were all the first of the month for many years and many months, taking the day of the month would be a terrible hash, as every actual key would map to "1".
My point being that if you really want to optimize your hash, you need to know the nature of your data. What is the actual range of values that you will get?

Categories

Resources