In Java is there any way to store time ranges as key in Hashmap? I have one HashMap and I store times time range. For example:
I enter 0-50 range as key and for that key I will store some other objects as value. Now when I say 10 I should be able to get the corresponding value for that key.
Any value between 0-50 should get that object.
Map map = new HashMap();
map.put(0-50,"some object")
map.put(51-100,"some other object")
now when I say map.get(10) it should be able to get the "some object". Please suggest how to do this?
I wouldn't use a map, instead I would try with a R-Tree. A R-tree is a tree structure created for indexing spatial data. It stores rectangles. It is often used to test if a point (coordinate) is lying within an other geometry. Those geometries are approximated by rectangles and those are stored in the tree.
To store a rectangle (or the information about it) you only need to save the lower left and upper right corner coordinates. In your case this would be the lower and upper bound of the time span. You can think of it, as if all y values of the coordinates were 0. Then you can query the tree with your time value.
And of course you would save the value at each leaf (time span/rectangle)
A simple search on google for r-tree java brought up some prominising results. Implementing your own R-tree isn't trivial, but it is not too complicated if you understood the principle of re-arranging the tree upon insertion/deletion. In your one dimensional case it might get even simpler.
Assumptions: Non-overlapping ranges.
You can store the starting point and ending point of the ranges in an TreeSet. The starting point and ending point are objects that store the starting time and ending time respectively, plus (a reference to) the object. You have to define comparison function, so that the objects are ordered by the time.
You can obtain the object by using floor() or ceiling() function of TreeSet.
Note that the ranges should NOT overlap, even at the endpoints (e.g. 3-6 and 6-10)
This will give you log complexity for range insertion and query.
If this is a non overlapping and equi distant range i.e. range split by say 50, you can solve this problem by maintaining hash for the max numbers like
50 - 'some object', 100 - 'some other object', etc..
if the input is 10, derive the immediate multiple of 50 and get the value for that key.
You can arrive at immediate multiple of 50
take mode on input say for the input 90 i.e. 90 % 50 = 40
compute diff of step 1 result with 50. i.e. 50 - 40 = 10
add step 2 result to input i.e. 90 + 10 = 100
You need to map the ranges to a single key, why dont you use something like a rangemanager object which returns for anyvalue between min and max the key 1 for example. alternativly you can put the someobject as a value for all keys between 1 and 50 using a for loop, but this would be a waste in my eyes.
Related
In my COMP class last night we learned about hashing and how it generally works when trying to find an element x in a hash table.
Our scenario was that we have a dataset of 1000 elements inside our table and we want to know if x is contained within that table.
Our professor drew up a Java array of 100 and said that to store these 1000 elements that each position of the array would contain a pointer to a Linked List where we would keep our elements.
Assuming the hashing function perfectly mapped each of the 1000 elements to a value between 0 and 99 and stored the element at the position in the array, there would be 1000/100 = 10 elements contained within each linked list.
Now to know whether x is in the table, we simply hash x, find it's hash value, lookup into the array at that slot and iterate over our linked list to check whether x is in the table.
My professor concluded by saying that the expected complexity of finding whether x is in the table is O(10) which is really just O(1). I cannot understand how this is the case. In my mind, if the dataset is N and the array size is n then it takes on average N/n steps to find x in the table. Isn't this by definition not constant time, because if we scale up the data set the time will still increase?
I've looked through Stack Overflow and online and everyone says hashing is expected time complexity of O(1) with some caveats. I have read people discuss chaining to reduce these caveats. Maybe I am missing something fundamental about determining time complexity.
TLDR: Why does it take O(1) time to lookup a value in a hash table when it seems to still be determined by how large your dataset is (therefore a function of N, therefore not constant).
In my mind, if the dataset is N and the array size is n then it takes on average N/n steps to find x in the table.
This is a misconception, as hashing simply requires you to calculate the correct bucket (in this case, array index) that the object should be stored in. This calculation will not become any more complex if the size of the data set changes.
These caveats that you speak of are most likely hash collisions: where multiple objects share the same hashCode; these can be prevented with a better hash function.
The complexity of a hashed collection for lookups is O(1) because the size of lists (or in Java's case, red-black trees) for each bucket is not dependent on N. Worst-case performance for HashMap if you have a very bad hash function is O(log N), but as the Javadocs point out, you get O(1) performance "assuming the hash function disperses the elements properly among the buckets". With proper dispersion the size of each bucket's collection is more-or-less fixed, and also small enough that constant factors generally overwhelm the polynomial factors.
There is multiple issues here so I will address them 1 by 1:
Worst case analysis vs amortized analysis:
Worst case analysis refers to the absolute worst case scenario that your algorithm can be given relative to running time. As an example, if I am giving an array of unordered elements, and I am told to find an element in it, my best case scenario is when the element is at index [0] the worst possible thing that I can be given is when the element is at the end of the array, in which case if my data set is n, I run n times before finding the element. On the average case however the element is anywhere in the array so I will run n-k steps (where k is the number of elements after the element I am looking for in the array).
Worst case analysis of Hashtables:
There exists only 1 kind of Hashtable that has guaranteed constant time access O(1) to it's elements, Arrays. (And even then it's not actually true do to paging and the way OS's handle memory). The worst possible case that I could give you for a hash table is a data set where every element hashes to the same index. So for example if every single element hashes to index 1, due to collisions, the worst case running time for accessing a value is O(n). This is unavoidable, hashtables always have this behaviour.
Average and best case scenario of hashtables:
You will rarely be given a set that gives you the worst possible case scenario. In general you can expect objects to be hashed to different indexes in your hashtable. Ideally the hash function hashes things in a very spread out manner so that objects get hashed to different indexes in the hash table.
In the specific example your teacher gave you, if 2 things get hashed to the same index, they get put in a linked list. So this is more or less how the table got constructed:
get element E
use the hashing function hash(E) to find the index i in the hash table
add e to the linjed list in hashTable[i].
repeat for all the elements in the data set
So now, let's say I want to find whether element E is on the table. Then:
do hash(E) to find the index i where E is potentially hashed
go to hashTable[i] and iterate through the linked list (up to 10 iterations)
If E is found, then E is in the Hash table, if E is not found, then E is not in the table
The reason why we can guarantee that E is not in the table if we can't find it, is because if it was, it would have been hashed to hashTable[i] so it HAS to be there, if it's on the table.
I am looking for some advice on storing all possible permutations for the fringe pattern database.
So the fifteen tile problem has 16! possible permutations, however storing the values for fringe so the 0 (blank tile),3,7,11,12,13,14,15 is 16!/(16-8)! = 518,918,400 permutations.
I am looking to store all of these permutations in a datastructure along with the value of the heuristic function (which is just incremented each time a iteration of the breadth first search), so far I am doing so but very slowly and took me 5 minutes to store 60,000 which is time I don't have!
At the moment I have a structure which looks like this.
Value Pos0 Pos3 Pos7 Pos11 Pos12 Pos13 Pos14 Pos15
Where I store the position of the given numbers. I have to use these positions as the ID for when I am calculating the heuristic value I can quickly trawl through to the given composition and retrieve the value.
I am pretty unsure about this. The state of the puzzle is represented by an array example:
int[] goalState = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
My question is what would be the best data structure to store these values? and the best way to retrieve them.
(This question was originally based on storing in a database, but now I want to store them in some form of local data structure - as retrieving from a database slow )
I can't really grasp, what special meaning do 0,3,7,11,12,13,14,15 have in your case. Is their position unchangeable? Is their position enough to identify the whole puzzle state?
Anyway, here is a general approach, you can narrow it down anytime:
As you have 16 possible states at max, I would try to use hexadecimal numbers to represent your permutations. So the state {1,2,3,6,5,4,7,8,9,10,11,12,13,14,15,0} would look like 0x123654789ABCDEF0 = 1312329218393956080. The biggest number possible would be 0xFEDCBA9876543210, which still can be stored in an unsigned long (only since Java 8) or alternatively in BigInteger (there are many examples, I would prefer this). Such number would be unique for each permutation and could be used as primary key and if you have the whole state, retrieving it from the database would be pretty fast.
//saving your permutation
String state = "0xFEDCBA9876543210";
BigInteger permutationForDatabase = new BigInteger(state, 16);
//and then you can insert it into database as a number
//reading your permutation
char searchedCharacter = 'A';//lets say you look for tile 10
BigInteger permutation = ...;//here you read the number from the database
int tilePosition = permutation.toString(16).indexOf(searchedCharacter);
There might be a more elegant/performant solution to get the tile position (maybe some bit operation magic).
Each number 0-15 is a 4-bit number. You must represent 7 such numbers, making a minimum requirement of 28 bits, which is well within the 31 signed bit space of an int. Thus all permutations may be assigned, and derived from, an int.
To calculate this number, given variables a through g:
int key = a | (b << 4) | (c << 8) | (d << 12) | (e << 16) | (f << 20) | (g << 24);
To decode (if you need to):
int a = key & 0xF;
int b = key & 0xF0;
int c = key & 0xF00; // etc
Storing ints in a database is very efficient and will use minimal disk space:
create table heuristics (
key_value int not null,
heuristic varchar(32) not null -- as small as you can, char(n) if all the same length
);
After inserting all the rows, create a covering index for super fast lookup:
create unique index heuristics_covering heuristics(key_value, heuristic);
If you create this index before insertion, insertions will be very, very slow.
Creating the data and inserting it is relatively straightforward coding.
So is my understanding correct that you're calculating a heuristic value for each possible puzzle state, and you want to be able to look it up later based on a given puzzle state? So that you don't have to calculate it on the fly? Presumably because of the time it takes to calculate the heuristic value.
So you're iterating over all the possible puzzle states, calculating the heuristic, and then storing that result. And it's taking a long time to do that. It seems like your assumption is that it's taking a long time to store the value - but what if the time lag you're seeing isn't the time it's taking to store the values in the data store, but rather the time it's taking the generate the heuristic values? That seems far more likely to me.
In that case, if you want to speed up the process of generating and storing the values, I might suggest splitting up the task into sections, and using several threads at once.
The fasted data structure I believe is going to be an in-memory hash table, with the hash key being your puzzle state, and the value being your heuristic value. Others have already suggested reasonable ways of generating puzzle-state hash keys. The same hash table structure could be accessed by each of the threads which are generating and storing heuristic values for sections of the puzzle state domain.
Once you've populated the hash table, you can simply serialize it and store it in a binary file in the filesystem. Then have your heuristic value server load that into memory (and deserialize it into the in-memory hash table) when it starts up.
If my premise is incorrect that it's taking a long time to generate the heuristic values, then it seems like you're doing something grossly sub-optimal when you go to store them. For example reconnecting to a remote database each time you store a value. That could potentially explain the 5 minutes. And if you're reconnecting every time you go to look up a value, that could explain why that is taking too long, too.
Depending on how big your heuristic values are, an in memory hash table might not be practical. A random-access binary file of records (with each record simply containing the heuristic value) could accomplish the same thing, potentially, but you'd need some way of mathematically mapping the hash key domain to the record index domain (which consists of sequential integers). If you're iterating over all the possible puzzle states, it seems like you already have a way of mapping puzzle states to sequential integers; you just have to figure out the math.
Using a local database table with each row simply having a key and a value is not unreasonable. You should definitely be able to insert 518 million rows in the space of a few minutes - you just need to maintain a connection during the data loading process, and build your index after your data load is finished. Once you've built the index on your key, a look up using the (clustered primary key integer) index should be pretty quick as long as you don't have to reconnect for every look up.
Also if you're committing rows into a database, you don't want to commit after each row, you'll want to commit every 1,000 or 10,000 rows. If you're committing after each row is inserted, that will substantially degrade your data loading performance.
I am able to do few preprocessing steps in datamining using Hadoop MapReduce.
One such is normalization.
say
100,1:2:3
101,2:3:4
into
100 1
100 2
100 3
101 2
101 3
101 4
Like wise am i able to do binning for a numerical data say iris.csv.
I worked out the maths behind it
Iris DataSet: http://archive.ics.uci.edu/ml/datasets/Iris
find out the minimum and maximum values of each attribute
in the data set.
Sepal Length |Sepal Width| Petal Length| Petal Width
Min | 4.3| 2.0 | 1.0| 0.1
Max | 7.9 | 4.4 |6.9 | 2.5
Then, we should divide the data values of each attributes into ānā buckets .
Say, n=5.
Bucket Width= (Max - Min) /n
Sepal Length= (7.9-4.3)/5= 0.72
So, the intervals will be as follows :
4.3 - 5.02
5.02 - 5.74
Likewise,
5.74 -6.46
6.46 - 7.18
7.18- 7.9
continue for all attributes
Are we able to do the same in Mapreduce .
Please Suggest.
I am not sure if I understood your question, but what you want to do is to get the maximum and minimum for each of the attributes of that dataset, to then divide them, all in the same job, right? Ok, in order to divide the attributes, you need to feed the reducer with the max and min values instead of relying on the reducer to do the work for you. And I am guessing this is where your trouble starts.
However there is one thing you could do, a MapReduce design pattern called in-mapper combiner. When each mapper has finished its job, it calls a method called cleanup. You can implement the cleanup method so that it gets the max and min values of each of the attributes for each of the map nodes. This way, you give the reducer (only one reducer) only a collection with X values, being X the number of mappers in your cluster.
Then, the reducer gets the max and min values for each of the attributes, since it will be a very short collection so there won't be any problems. Finally, you divide each of the attributes into the 'n' buckets.
There is plenty of information about this pattern on the web, an example could be this tutorial. Hope it helps.
EDIT: you need to create an instance variable in the mapper where you will store each of the values in the map method, so that they will be available in the cleanup method, since it's only called once. A HashMap for example will do. You need to remember that you cannot save the values in the context variable in the map method, you need to do this in the cleanup method, after iterating through the HashMap and finding out the max and min value for each column. Then, as for the key, I don't think it really matters in this case, so yes, you could use the csv header, and as for the value you are correct, you need to store the whole column.
Once the reducer receives the output from the mappers, you can't calculate the buckets just yet. Bear in mind that you will receive one "column" for each mapper, so if you have 20 mappers, you will receive 20 max values and 20 min values for each attribute. Therefore you need to calculate the max and min again, just like you did in the cleanup method of the mappers, and once this is done, then you can finally calculate the buckets.
You may be wondering "if I still need to find the max and min values in the reducer, then I could omit the cleanup method and do everything in the reducer, after all the code would be more or less the same". However, to do what you are asking, you can only work with one reducer, so if you omit the cleanup method and leave all the work to the reducer, the throughput would be the same as if working in one machine without Hadoop.
I've come across an interesting problem which I would love to get some input on.
I have a program that generates a set of numbers (based on some predefined conditions). Each set contains up to 6 numbers that do not have to be unique with integers that ranges from 1 to 100).
I would like to somehow store every set that is created so that I can quickly check if a certain set with the exact same numbers (order doesn't matter) has previously been generated.
Speed is a priority in this case as there might be up to 100k sets stored before the program stops (maybe more, but most the time probably less)! Would anyone have any recommendations as to what data structures I should use and how I should approach this problem?
What I have currently is this:
Sort each set before storing it into a HashSet of Strings. The string is simply each number in the sorted set with some separator.
For example, the set {4, 23, 67, 67, 71} would get encoded as the string "4-23-67-67-71" and stored into the HashSet. Then for every new set generated, sort it, encode it and check if it exists in the HashSet.
Thanks!
if you break it into pieces it seems to me that
creating a set (generate 6 numbers, sort, stringify) runs in O(1)
checking if this string exists in the hashset is O(1)
inserting into the hashset is O(1)
you do this n times, which gives you O(n).
this is already optimal as you have to touch every element once anyways :)
you might run into problems depending on the range of your random numbers.
e.g. assume you generate only numbers between one and one, then there's obviously only one possible outcome ("1-1-1-1-1-1") and you'll have only collisions from there on. however, as long as the number of possible sequences is much larger than the number of elements you generate i don't see a problem.
one tip: if you know the number of generated elements beforehand it would be wise to initialize the hashset with the correct number of elements (i.e. new HashSet<String>( 100000 ) );
p.s. now with other answers popping up i'd like to note that while there may be room for improvement on a microscopic level (i.e. using language specific tricks), your overal approach can't be improved.
Create a class SetOfIntegers
Implement a hashCode() method that will generate reasonably unique hash values
Use HashMap to store your elements like put(hashValue,instance)
Use containsKey(hashValue) to check if the same hashValue already present
This way you will avoid sorting and conversion/formatting of your sets.
Just use a java.util.BitSet for each set, adding integers to the set with the set(int bitIndex) method, you don't have to sort anything, and check a HashMap for already existing BitSet before adding a new BitSet to it, it will be really very fast. Don't use sorting of value and toString for that purpose ever if speed is important.
I have read about hashtable and open adrdessing. If you want to insert the keys: 18,32,44 in an hashtable with size 13:
18 gets index 5 (18 modulus 13 = 5)
32 gets index 6 (32 modulus 13 = 6)
44 gets index 5 (44 modulus 13 = 5)
You'll get a collision because there are already something on index 5.
If you use linear probing you'll do hashfunction = (key+i) modulus N where i = 0,1,2.. until you find an empty place in the hashtable. Then 44 will get be inserted at index 7.
What if you delete 32, and then you want to delete 44. You start by looking at hashfunction(44)=5 - that was not 44, then hashfunction(44 + 1) = 6 - that is empty. Then you might think that 44 is gone. How do you mark a place in the hashtable, that the place is not really empty, but does not contain a key, and that you should keep looking for 44 at the next index?
If you then need to insert another key at index 6 then the key just overwrites the "mark" in the hashtable.
What could you use to mark an index - saying here has been an key, but has been deleted - so you continue to look at next index? You can't just write null or 0 because then either you think the key has been deleted (null) or that an key with value 0 has overwritten 44.
One way to handle hash tables using open addressing is to use state marks: EMPTY, OCCUPIED and DELETED. Note that there's an important distinction between EMPTY, which means the position has never been used and DELETED, which means it was used but got deleted.
When a value gets removed, the slot is marked as DELETED, not EMPTY. When you try to retrieve a value, you'll probe until you find a slot that's mark EMPTY; eg: you consider DELETED slots to be the same as OCCUPIED. Note that insertion can ignore this distinction - you can insert into a DELETED or EMPTY slot.
The question is tagged Java, which is a bit misleading because Java (or at least Oracle's implementation of it) does not use open addressing. Open addressing gets specially problematic when the load factor gets high, which causes hash collisions to occur much more often:
As you can see, there's a dramatic performance drop near the 0.7 mark. Most hashtables get resized once their load factor gets past a certain constant factor. Java for example doubles the size of its HashMap when the load factor gets past 0.75.
It seems like you are trying to implement your own hash table (in contrast to using the Hashtable or HashMap included in java), so it's more a data structure question than a java question.
That being said, implementing a hash table with open addressing (such as linear probing) is not very efficient when it comes to removing elements. The normal solution is to "pull up" all elements that are in the wrong slot so there won't be any spaces in the probing.
There is some pseudo code describing this quite well at wikipedia:
http://en.wikipedia.org/wiki/Open_addressing
The hash table buckets aren't limited to storing a single value. So if two objects hash to the same location in the table they will both be stored. The collision only means that lookup will be slightly slower because when looking for the value with a key that hashes to a particular location it will need to check each entry to see if it matches
It sounds like you are describing a hash table where you only store a single entry and each index. The only way I can think to do that is to add a field to the structure storing the value that indicates if that position had a collision. Then when doing a lookup you'd check the key, if it was a match you have the value. If not, then you would check to see if there was a collision and then check the next position. On a removal you'd have to leave the collision marker but delete the value and key.
If you use a hash table which uses this approach (which none of the built in hash collections do) you need traverse all the latter keys to see if they need to be moved up (to avoid holes). Some might be for the same hash value and some might be collisions for unrelated hash codes. If you do this you are not left with any holes. For a hash map which is not too full, this shouldn't create much overhead.