I am looking for some advice on storing all possible permutations for the fringe pattern database.
So the fifteen tile problem has 16! possible permutations, however storing the values for fringe so the 0 (blank tile),3,7,11,12,13,14,15 is 16!/(16-8)! = 518,918,400 permutations.
I am looking to store all of these permutations in a datastructure along with the value of the heuristic function (which is just incremented each time a iteration of the breadth first search), so far I am doing so but very slowly and took me 5 minutes to store 60,000 which is time I don't have!
At the moment I have a structure which looks like this.
Value Pos0 Pos3 Pos7 Pos11 Pos12 Pos13 Pos14 Pos15
Where I store the position of the given numbers. I have to use these positions as the ID for when I am calculating the heuristic value I can quickly trawl through to the given composition and retrieve the value.
I am pretty unsure about this. The state of the puzzle is represented by an array example:
int[] goalState = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
My question is what would be the best data structure to store these values? and the best way to retrieve them.
(This question was originally based on storing in a database, but now I want to store them in some form of local data structure - as retrieving from a database slow )
I can't really grasp, what special meaning do 0,3,7,11,12,13,14,15 have in your case. Is their position unchangeable? Is their position enough to identify the whole puzzle state?
Anyway, here is a general approach, you can narrow it down anytime:
As you have 16 possible states at max, I would try to use hexadecimal numbers to represent your permutations. So the state {1,2,3,6,5,4,7,8,9,10,11,12,13,14,15,0} would look like 0x123654789ABCDEF0 = 1312329218393956080. The biggest number possible would be 0xFEDCBA9876543210, which still can be stored in an unsigned long (only since Java 8) or alternatively in BigInteger (there are many examples, I would prefer this). Such number would be unique for each permutation and could be used as primary key and if you have the whole state, retrieving it from the database would be pretty fast.
//saving your permutation
String state = "0xFEDCBA9876543210";
BigInteger permutationForDatabase = new BigInteger(state, 16);
//and then you can insert it into database as a number
//reading your permutation
char searchedCharacter = 'A';//lets say you look for tile 10
BigInteger permutation = ...;//here you read the number from the database
int tilePosition = permutation.toString(16).indexOf(searchedCharacter);
There might be a more elegant/performant solution to get the tile position (maybe some bit operation magic).
Each number 0-15 is a 4-bit number. You must represent 7 such numbers, making a minimum requirement of 28 bits, which is well within the 31 signed bit space of an int. Thus all permutations may be assigned, and derived from, an int.
To calculate this number, given variables a through g:
int key = a | (b << 4) | (c << 8) | (d << 12) | (e << 16) | (f << 20) | (g << 24);
To decode (if you need to):
int a = key & 0xF;
int b = key & 0xF0;
int c = key & 0xF00; // etc
Storing ints in a database is very efficient and will use minimal disk space:
create table heuristics (
key_value int not null,
heuristic varchar(32) not null -- as small as you can, char(n) if all the same length
);
After inserting all the rows, create a covering index for super fast lookup:
create unique index heuristics_covering heuristics(key_value, heuristic);
If you create this index before insertion, insertions will be very, very slow.
Creating the data and inserting it is relatively straightforward coding.
So is my understanding correct that you're calculating a heuristic value for each possible puzzle state, and you want to be able to look it up later based on a given puzzle state? So that you don't have to calculate it on the fly? Presumably because of the time it takes to calculate the heuristic value.
So you're iterating over all the possible puzzle states, calculating the heuristic, and then storing that result. And it's taking a long time to do that. It seems like your assumption is that it's taking a long time to store the value - but what if the time lag you're seeing isn't the time it's taking to store the values in the data store, but rather the time it's taking the generate the heuristic values? That seems far more likely to me.
In that case, if you want to speed up the process of generating and storing the values, I might suggest splitting up the task into sections, and using several threads at once.
The fasted data structure I believe is going to be an in-memory hash table, with the hash key being your puzzle state, and the value being your heuristic value. Others have already suggested reasonable ways of generating puzzle-state hash keys. The same hash table structure could be accessed by each of the threads which are generating and storing heuristic values for sections of the puzzle state domain.
Once you've populated the hash table, you can simply serialize it and store it in a binary file in the filesystem. Then have your heuristic value server load that into memory (and deserialize it into the in-memory hash table) when it starts up.
If my premise is incorrect that it's taking a long time to generate the heuristic values, then it seems like you're doing something grossly sub-optimal when you go to store them. For example reconnecting to a remote database each time you store a value. That could potentially explain the 5 minutes. And if you're reconnecting every time you go to look up a value, that could explain why that is taking too long, too.
Depending on how big your heuristic values are, an in memory hash table might not be practical. A random-access binary file of records (with each record simply containing the heuristic value) could accomplish the same thing, potentially, but you'd need some way of mathematically mapping the hash key domain to the record index domain (which consists of sequential integers). If you're iterating over all the possible puzzle states, it seems like you already have a way of mapping puzzle states to sequential integers; you just have to figure out the math.
Using a local database table with each row simply having a key and a value is not unreasonable. You should definitely be able to insert 518 million rows in the space of a few minutes - you just need to maintain a connection during the data loading process, and build your index after your data load is finished. Once you've built the index on your key, a look up using the (clustered primary key integer) index should be pretty quick as long as you don't have to reconnect for every look up.
Also if you're committing rows into a database, you don't want to commit after each row, you'll want to commit every 1,000 or 10,000 rows. If you're committing after each row is inserted, that will substantially degrade your data loading performance.
Related
As part of my programming course I was given an exercise to implement my own String collection. I was planning on using ArrayList collection or similar but one of the constraints is that we are not allowed to use any Java API to implement it, so only arrays are allowed. I could have implemented this using arrays however efficiency is very important as well as the amount of data that this code will be tested with. I was suggested to use hash tables or ordered tress as they are more efficient than arrays. After doing some research I decided to go with hash tables because they seemed easy to understand and implement but once I started writing code I realised it is not as straight forward as I thought.
So here are the problems I have come up with and would like some advice on what is the best approach to solve them again with efficiency in mind:
ACTUAL SIZE: If I understood it correctly hash tables are not ordered (indexed) so that means that there are going to be gaps in between items because hash function gives different indices. So how do I know when array is full and I need to resize it?
RESIZE: One of the difficulties that I need to create a dynamic data structure using arrays. So if I have an array String[100] once it gets full I will need to resize it by some factor I decided to increase it by 100 each time so once I would do that I would need to change positions of all existing values since their hash keys will be different as the key is calculated:
int position = "orange".hashCode() % currentArraySize;
So if I try to find a certain value its hash key will be different from what it was when array was smaller.
HASH FUNCTION: I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.
DEALING WITH MULTIPLE OCCURRENCES: one of the requirements is to be able to add multiple words that are the same, because I need to be able to count how many times the word is stored in my collection. Since they are going to have the same hash code I was planning to add the next occurrence at the next index hoping that there will be a gap. I don't know if it is the best solution but here how I implemented it:
public int count(String word) {
int count = 0;
while (collection[(word.hashCode() % size) + count] != null && collection[(word.hashCode() % size) + count].equals(word))
count++;
return count;
}
Thank you in advance for you advice. Please ask anything needs to be clarified.
P.S. The length of words is not fixed and varies greatly.
UPDATE Thank you for your advice, I know I did do few stupid mistakes there I will try better. So I took all your suggestions and quickly came up with the following structure, it is not elegant but I hope it is what you roughly what you meant. I did have to make few judgements such as bucket size, for now I halve the size of elements, but is there a way to calculate or some general value? Another uncertainty was as to by what factor to increase my array, should I multiply by some n number or adding fixed number is also applicable? Also I was wondering about general efficiency because I am actually creating instances of classes, but String is a class to so I am guessing the difference in performance should not be too big?
ACTUAL SIZE: The built-in Java HashMap just resizes when the total number of elements exceeds the number of buckets multiplied by a number called the load factor, which is by default 0.75. It does not take into account how many buckets are actually full. You don't have to, either.
RESIZE: Yes, you'll have to rehash everything when the table is resized, which does include recomputing its hash.
So if I try to find a certain value it's hash key will be different from what it was when array was smaller.
Yup.
HASH FUNCTION: Yes, you should use the built in hashCode() function. It's good enough for basic purposes.
DEALING WITH MULTIPLE OCCURRENCES: This is complicated. One simple solution would just be to have the hash entry for a given string also keep count of how many occurrences of that string are present. That is, instead of keeping multiple copies of the same string in your hash table, keep an int along with each String counting its occurrences.
So how do I know when array is full and I need to resize it?
You keep track of the size and HashMap does. When the size used > capacity * load factor you grow the underlying array, either as a whole or in part.
int position = "orange".hashCode() % currentArraySize;
Some things to consider.
The % of a negative value is a negative value.
Math.abs can return a negative value.
Using & with a bit mask is faster however you need a size which is a power of 2.
I was also wondering if built-in hashCode() method in String class is efficient and suitable for what I am trying to implement or is it better to create my own one.
The built in hashCode is cached, so it is fast. However it is not a great hashCode and has poor randomness for lower bit, and higher bit for short strings. You might want to implement your own hashing strategy, possibly a 64-bit one.
DEALING WITH MULTIPLE OCCURRENCES:
This is usually done with a counter for each key. This way you can have say 32767 duplicates (if you use short) or 2 billion (if you use int) duplicates of the same key/element.
Lets say we had a stream of data comming into us (of unknown range and distribution) and we wanted to store the last X number of values in a hash table which provides O(1) access, how would we do this?
For simplicity lets say the data is a stream of numbers of unknown range and distribution.
In order to map these numbers to an element of an array we would need a hash function that takes account of the data range and distribution.
I guess we'd either estimate this range upfront or maintain some statistics on the data coming in and adjust the hash function accordingly.
Also we'd need a way of rejigging the array one the X threshold is met.
Any thoughts or ideas for doing this as fast as possible?
Dynamic Perfect Hashing seems to be pretty close, combine this with an array growing strategy and your good to go: http://en.wikipedia.org/wiki/Dynamic_perfect_hashing
To avoid any confusions I am re framing my question based on my research on hashing algorithms
Problem statement
I have multiple text files containing variable length data records. I need find if there are duplicate records in the input. Each of the text files could have data records in millions.
Since I cannot load all the data in memory at once, I plan to create a hash of the key fields in the record when it is processed. Processing a record would mean validating, filtering and transforming it. After processing all the records in all the text files, they are merged to create one view of the whole input (either a text file or a database table).
Finding duplicates would be much faster if a hash value is generated for all the records. If there are collisions of hash values, only those records could be checked for equality (assuming the hashing algorithm is deterministic)
Question - What hash algorithms should I consider for such input i.e. variable length data?
Short Answer
Don't do it. Use the Java map. You can find details here:
http://docs.oracle.com/javase/6/docs/api/java/util/Map.html
Long Answer
You can create a perfect hashing function by treating your string as a number in base-N where N is all of the possible values any character can take on. The problem here is memory. Hashing functions are meant to be used with arrays, which means you'll need an array large enough to handle the results of your hash, and that is impractical.
For instance, take a modest example of a 10 character key. Let's be even more modest and assume they are guaranteed to consist solely of lower-case letters. That gives you 26 possibilities for each character, and 10 characters. This means the possible combinations are:
26 ^ 10 = 141,167,095,653,376
If you look up hashing algorithms, one of the first things they include is collision detection because they acknowledge that collisions are a fact of life.
Now you say you are not loading keys in memory, yet why are you using a hash then? The point of a hash is to give you a mapping onto an array index. Perhaps you're better off using another mechanism.
Possible Solutions
If you are concerned about memory, get some statistics on the duplicates in your file. If you only store a flag to indicate the occurrence of a particular key in the hash, and you have many duplicates, you may be able to just use Java's map. Java's map handles collisions, so that won't keep you from detecting unique keys. You can rest assured that if A[x] is found, that means x is in A, even if x's hash collided with a previous hash.
Next, you could try some utilities to pull out duplicates. Since they would have been written specifically for the purpose, they should be able to handle a large amount of data.
Finally, you could try putting your entries into a database and using that to handle duplicates. This may seem like overkill, but databases are optimized for dealing with very large numbers of records.
This is an extension to the Map idea. Before resorting to this I would check that it cannot be done by simply building a HashSet representing all the strings at once. Remember you can use a 64-bit JVM and set a large heap size.
Define a class StringLocation that contains the data you would need to do a random access to a string on disk - for example, a reference to a RandomAcessFile and an offset within file. If you cannot keep all the files open at once, open and close as needed, caching the RandomAcessFile for the most used files.
Create a HashMap<Integer,List<StringLocation>>.
Start reading the strings. For each string, convert to lower case and obtain its hashCode(), hash, in Integer form. If there is an entry in the Map with hash as key, compare the new string to each string represented in the existing value, doing random file access to get to the already processed strings. Use the String equalsIgnoreCase. If there is a match, you have a duplicate. If there is no match, append a new StringLocation, representing the current string, to the List.
This requires at most two strings to be in memory at a time, the one you are currently processing and a previously processed string with the same hashCode() result to which you are comparing it.
You can further reduce the number of times you have to re-read a string for an equals check by using MessageDigest to generate, for the lower case string, a wide checksum with low risk of collisions, and saving it in the StringLocation object. During a comparison, return false if the checksums do not match, without re-reading the strings.
In Java is there any way to store time ranges as key in Hashmap? I have one HashMap and I store times time range. For example:
I enter 0-50 range as key and for that key I will store some other objects as value. Now when I say 10 I should be able to get the corresponding value for that key.
Any value between 0-50 should get that object.
Map map = new HashMap();
map.put(0-50,"some object")
map.put(51-100,"some other object")
now when I say map.get(10) it should be able to get the "some object". Please suggest how to do this?
I wouldn't use a map, instead I would try with a R-Tree. A R-tree is a tree structure created for indexing spatial data. It stores rectangles. It is often used to test if a point (coordinate) is lying within an other geometry. Those geometries are approximated by rectangles and those are stored in the tree.
To store a rectangle (or the information about it) you only need to save the lower left and upper right corner coordinates. In your case this would be the lower and upper bound of the time span. You can think of it, as if all y values of the coordinates were 0. Then you can query the tree with your time value.
And of course you would save the value at each leaf (time span/rectangle)
A simple search on google for r-tree java brought up some prominising results. Implementing your own R-tree isn't trivial, but it is not too complicated if you understood the principle of re-arranging the tree upon insertion/deletion. In your one dimensional case it might get even simpler.
Assumptions: Non-overlapping ranges.
You can store the starting point and ending point of the ranges in an TreeSet. The starting point and ending point are objects that store the starting time and ending time respectively, plus (a reference to) the object. You have to define comparison function, so that the objects are ordered by the time.
You can obtain the object by using floor() or ceiling() function of TreeSet.
Note that the ranges should NOT overlap, even at the endpoints (e.g. 3-6 and 6-10)
This will give you log complexity for range insertion and query.
If this is a non overlapping and equi distant range i.e. range split by say 50, you can solve this problem by maintaining hash for the max numbers like
50 - 'some object', 100 - 'some other object', etc..
if the input is 10, derive the immediate multiple of 50 and get the value for that key.
You can arrive at immediate multiple of 50
take mode on input say for the input 90 i.e. 90 % 50 = 40
compute diff of step 1 result with 50. i.e. 50 - 40 = 10
add step 2 result to input i.e. 90 + 10 = 100
You need to map the ranges to a single key, why dont you use something like a rangemanager object which returns for anyvalue between min and max the key 1 for example. alternativly you can put the someobject as a value for all keys between 1 and 50 using a for loop, but this would be a waste in my eyes.
Say I have a population of key-value pairs which I plan to store in a hash table. The population is fixed and will never change. What optimizations are available to me to make the hash table as fast as possible? Which optimizations should I concentrate on? This is assuming I have a lot of space. There will be a reasonable number of pairs (say no more than 100,000).
EDIT: I want to optimize look up. I don't care how long it takes to build.
I would make sure that your key's hash to unique values. This will ensure that every lookup will be constant time, and thus, as fast as possible.
Since you can never have more than 100,000 keys, it is entirely possible to have 100,000 hash values.
Also, make sure that you use the constructor that takes an int to specify the initial capacity (Set it to 100,000), and a float to set the load factor. (Use 1) Also, doing this requires that you have a perfect hash function for your keys. But, this will result in the fastest possible lookup, in the least amount of memory.
In general, to optimize a hash table, you want to minimize collisions in the determination of your hash, so your buckets won't contain more than one item and the hash-search will return immediately.
Most of the time, that means that you should measure the output of your hash function on the problem space. So i guess i'd recommend looking into that
Ensure there are no collisions. If there are no collisions, you are guaranteed O(1) constant look-up time. The next optimization would then be the look-up.
Use a profiler to optimize piece by piece. It's hard to without that.
If it's possible to make a large hash table such that there are no collisions at all, it will be ideal. Since your insertions and lookups will done in constant time.
But if that is not possible, try to choose a hash function such that your keys get distributed uniformly across the hash table.
Perfect hashing algorithms deal with the problem, but may not scale to 100k objects. I found a Java MPH package, but haven't tried it.
If the population is known at compile time, then the optimal solution is to use a minimal perfect hash function (MPH). The Wikipedia page on this subject links to several Java tools that can generate these.
The optimization must be done int the hashCode method of the key class. The thing to have in mind is to implement this method to avoid collisions.
Getting the perfect hashing algorithm to give totally unique values to 100K objects is likely to be close to impossible. Consider the birthday paradox. The date on which people are born can be considered a perfect hashing algorithm yet if you have more than 23 people you are more than likely to have a collision, and that is in a table of 365 dates.
So how big a table will you need to have no collisions in 100K?
If your keys are strings, your optimal strategy is a tree, not binary but n-branch at each character. If the keys are lower-case only it is easier still as you need just 26 whenever you create a branch.
We start with 26 keys. Follow the first character, say f
f might have a value associated with it. And it may have sub-trees. Look up a subtree of o. This leads to more subtrees then look up the next o. (You knew where that was leading!). If this doesn't have a value associated with it, or we hit a null sub-tree on the way, we know the value is not found.
You can optimise the space on the tree where you hit a point of uniqueness. Say you have a key january and it becomes unique at the 4th character. At this point where you assign the value you also store the actual string associated with it. In our example there may be one value associated with foo but the key it relates to may be food, not foo.
I think google search engines use a technique similar to this.
The key question is what your key is. (No pun intended.) As others have pointed out, the goal is to minimize the number of hash collisions. If you can get the number of hash collisions to zero, i.e. your hash function generates a unique value for every key that is actually passed to it, you will have a perfect hash.
Note that in Java, a hash function really has two steps: First the key is run through the hashCode function for it's class. Then we calculate an index value into the hash table by taking this value modulo the size of the hash table.
I think that people discussing the perfect hash function tend to forget that second step. Even if you wrote a hashCode function that generated a unique value for every key passed to it, you could still get an absolutely terrible hash if this value modulo the hash table size is not unique. For example, say you have 100 keys and your hashCode function returns the values 1, 1001, 2001, 3001, 4001, 5001, ... 99001. If your hash table has 100,000 slots, this would be a perfect hash. Every key gets its own slot. But if it has 1000 slots, they all hash to the same slot. It would be the worst possible hash.
So consider constructing a good hash function. Take the extreme cases. Suppose that your key is a date. You know that the dates will all be in January of the same year. Then using the day of the month as the hash value should be as good as it's going to get: everything will hash to a unique integer in a small range. On the other hand, if your dates were all the first of the month for many years and many months, taking the day of the month would be a terrible hash, as every actual key would map to "1".
My point being that if you really want to optimize your hash, you need to know the nature of your data. What is the actual range of values that you will get?