I have c.1,000,000 objects that need to be stored in some form of data structure. They must be unique by a key (ID) - but sorted according to their date. I'm therefore trying to think of a best way of storing them in some form of data structure.
Performance (in terms of time taken to execute) it the primary goal, and then memory usage. My idea was to put the objects into a Tree, so they may be sorted according to their date as they enter the data structure, and I can then return them in order. However - I think this is going to be horrendously slow to find a single object based on it's ID. One thought that did occur to me was to have a secondary structure which linked ID's to dates so I can reduce the time taken to find the single object, or just store everything by this ID anyway (perhaps in a HashTable) and then just sort through all 1,000,000 objects when I want to return them (although this seems to take a very long time).
Key Points:
Objects may be added afterwards so the c.1,000,000 objects ARE NOT fixed. They WILL NOT be updated or removed.
I MAY NOT use Java's built in Comparator.
I am optimising for efficiency of returning the data - whether this be the complete set in order (by date), or a single object obtained from it's ID.
if performance in your chief concern before memory usage, i'd go with 2 datastructures:
ArrayList<YourClass> instancesByDate;
and
HashMap<SomeId,YourClass> instancesById;
this gives you the fastest traversal by date and O(1) lookup (depending on hashCode() obviously).
How about using a hashtable of ID => yourobject for the ID lookups, and a secondary hashtable of date (at some level of granularity) => Vector<yourobject>? You could choose the 'granularity' of the date to ensure you've a moderate number of objects in each vector - and sort each by date.
Related
I have a String that I need to search for in a collection of Strings. I'll need to do searches for multiple representations of the required String(original representation, trimmed, UTF-8 encoded, non ASCII characters encoded). The collection size will be in the order of thousands.
I'm trying to figure out what's the best representation to use for the collection in order to have the best performance:
ArrayList - iterate over the array and check if any of the elements match any of the Strings representations
HashMap - check if map contains any of my Strings representation
Any other?
Generally speaking, HashMap (or any other hashtable-based data structure) is much more preferred for "lookup" exercise. The reason is simple, those data structures support lookup in constant time (independent of collection size).
But... in your scenario (single query for collection), you probably will not gain any performance improvements from using HashMap instead of ArrayList. Reasons:
Putting data inside HashMap will take some time. Not significant time, but comparable to one full pass of the initial list.
Your collection is pretty small - iterating over 5000 of elements is a matter of couple milliseconds (or faster?). Since you need to "search" only once, you will not save much time on that.
I have a Java class which contains two Strings, for example the name of a person and the name of the group.
I also have a list of groups (about 10) and a list of persons (about 100). The list of my data objects is larger, it can exceed 10.000 items.
Now I would like to search through my data objects such that I find all objects having a person from the person list and a group from the group list.
My question is: what is the best data structure for the person and group list?
I could use an ArrayList and simply iterate until I find a match, but that is obviously inefficient. A HashSet or HashMap would be much better.
Are there even more efficient ways to solve this? Please advise.
Every data structure has pro and cons.
A Map is used to retrieve data in O(1) if you have an access key.
A List is used to mantain an order between elements, but accessing an element using a key is not possible and you need to loop the whole list that happens in O(n).
A good data-structure for storing and lookup strings is a Trie:
It's essentially a tree structure which uses characters or substrings to denote paths to follow.
Advantages over hash-maps (quote from Wikipedia):
Looking up data in a trie is faster in the worst case, O(m) time (where m is the length of a search string), compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.
There are no collisions of different keys in a trie.
Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is associated with more than one value.
There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
A trie can provide an alphabetical ordering of the entries by key.
I agree with #Davide answer..If we want fast lookup as well as to maintain the order too, then we can go for LinkedHashMap implementation of Map.
By using it, we can have both things:
Data retrieval, If we have access key.
We can maintain the insertion order, so while iterating we will get the data in the same order as of during insertion.
Depending on the scenario (If you have the data before receiving lists of groups/people), preprocessing the data would save you time.
Comparing the data to the groups/people lists will require at least 10,000+ lookups. Comparing the groups/people lists to the data will require a max 10*100 = 1,000 lookups,.. less if you compare against each group one at a time (10+100 = 110 lookups).
I referred the android doc site for "SparseBooleanArray" class but still not getting idea of that class about what is the purpose of that class?? For what purpose we need to use that class??
Here is the Doc Link
http://developer.android.com/reference/android/util/SparseBooleanArray.html
From what I get from the documentation it is for mapping Integer values to booleans.
That is, if you want to map, if for a certain userID a widget should be shown and some userIDs have already been deleted, you would have gaps in your mapping.
Meaning, with a normal array, you would create an array of size=maxID and add a boolean value to element at index=userID. Then when iterating over the array, you would have to iterate over maxID elements in the worst case and have to check for null if there is no boolean for that index (eg. the ID does not exist). That is really inefficient.
When using a hashmap to do that you could map the ID to the boolean, but with the added overhead of generating the hashvalue for the key (that is why it is called *hash*map), which would ultimately hurt performance firstly in CPU cycles as well as RAM usage.
So that SparseBooleanArray seems like a good middleway of dealing with such a situation.
NOTE: Even though my example is really contrieved, I hope it illustrates the situation.
Like the javadoc says, SparseBooleanArrays map integers to booleans which basically means that it's like a map with Integer as a key and a boolean as value (Map).
However it's more efficient to use in this particular case It is intended to be more efficient than using a HashMap to map Integers to Booleans
Hope this clears out any issues you had with the description.
I found a very specific and wonderful use for the sparse boolean array.
You can put a true or false value to be associated with a position in a list.
For example: List item #7 was clicked, so putting 7 as the key and true as the value.
There can be three ways to store resource id's
1 Array
Boolean array containing id's as indexes.If we have used that id set it to true else false
Though all the operations are fast but this implementation will require huge amount of space.So it can't be used
High Space Complexity
2 HashMap
Key-ID
Value-Boolean True/False
Using this we need to process each id using the hashing function which will consume memory.Also there may be some empty locations where no id will be stored and we also need to deal with crashes.So due to usage complexity and medium space complexity, it is not used.
Medium Space Complexity
3 SparseBooleanArray
It is middle way.It uses mapping and Array Implementation
Key - ID
Value - Boolean True/False
It is an ArrayList which stores id's in an increasing order.So minimum space is used as it only contains id's which are being used.For searching an id binary search is used.
Though Binary Search O(logn) is slower than hashing O(1) or Array O(1),i.e. all the operations Insertion, Deletion, Searching will take more time but there is least memory wastage.So to save memory we prefer SparseBoolean Array
Least Space Complexity
basically i'm looking for a best data structure in java which i can store pairs and retrieve top N number of element by the value. i'd like to do this in O(n) time where n is number of entires in the data structure.
example input would be,
<"john", 32>
<"dave", 3>
<"brian", 15>
<"jenna", 23>
<"rachael", 41>
and if N=3, i should be able to return rachael, john, jenna if i wanted descending order.
if i use some kind of hashMap, insertion is fast, but retrieving them by order gets expensive.
if i use some data structure that keeps things ordered, then insertion becomes expensive while retrieving is cheaper. i was not able to find the best data structure that can do both very well and very fast.
any input is appreciated. thanks.
[updated]
let me ask the question in other way if that make it clearer.
i know i can insert at constant time O(1) into hashMap.
now, how can i retrieve elements from sorted order by value in O(n) time where n=number of entires in the data structure? hope it makes sense.
If you want to sort, you have to give up constant O(1) time.
That is because unlike inserting an unsorted key / value pair, sorting will minimally require you to compare the new entry to something, and odds are to a number of somethings. Once you have an algorithm that will require more time with more entries (due to more comparisons) you have overshot "constant" time.
If you can do better, then by all means, do so! There is a Dijkstra prize awaiting for you, if not a Fields Medal to boot.
Don't dispair, you can still do the key part as a HashMap, and the sorting part with a Tree like implementation, that will give you O(log n). TreeMap is probably what you desire.
--- Update to match your update ---
No, you cannot iterate over a hashmap in O(n) time. To do so would assume that you had a list; but, that list would have to already be sorted. With a raw HashMap, you would have to search the entire map for the next "lower" value. Searching part of the map would not do, because the one element you didn't check would possibly be the correct value.
Now, there are some data structures that make a lot of trade offs which might get you closer. If you want to roll your own, perhaps a custom Fibonacci heap can give you an amortized performance close to what you wish, but it cannot guarantee a worst-case performance. In any case, some operations (like extract-min) will still require O(log n) performance.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What exactly are hashtables?
I understand the purpose of using hash functions to securely store passwords. I have used arrays and arraylists for class projects for sorting and searching data. What I am having trouble understanding is the practical value of hashtables for something like sorting and searching.
I got a lecture on hashtables but we never had to use them in school, so it hasn't clicked. Can someone give me a practical example of a task a hashtable is useful for that couldn't be done with a numerical array or arraylist? Also, a very simple low level example of a hash function would be helpful.
There are all sorts of collections out there. Collections are used for storing and retrieving things, so one of the most important properties of a collection is how fast these operations are. To estimate "fastness" people in computer science use big-O notation which sort of means how many individual operations you have to accomplish to invoke a certain method (be it get or set for example). So for example to get an element of an ArrayList by an index you need exactly 1 operation, this is O(1), if you have a LinkedList of length n and you need to get something from the middle, you'll have to traverse from the start of the list to the middle, taking n/2 operations, in this case get has complexity of O(n). The same comes to key-value stores as hastable. There are implementations that give you complexity of O(log n) to get a value by its key whereas hastable copes in O(1). Basically it means that getting a value from hashtable by its key is really cheap.
Basically, hashtables have similar performance characteristics (cheap lookup, cheap appending (for arrays - hashtables are unordered, adding to them is cheap partly because of this) as arrays with numerical indices, but are much more flexible in terms of what the key may be. Given a continuous chunck of memory and a fixed size per item, you can get the adress of the nth item very easily and cheaply. That's thanks to the indices being integers - you can't do that with, say, strings. At least not directly. Hashes allows reducing any object (that implements it) to a number and you're back to arrays. You still need to add checks for hash collisions and resolve them (which incurs mostly a memory overhead, since you need to store the original value), but with a halfway decent implementation, this is not much of an issue.
So you can now associate any (hashable) object with any (really any) value. This has countless uses (although I have to admit, I can't think of one that's applyable to sorting or searching). You can build caches with small overhead (because checking if the cache can help in a given case is O(1)), implement a relatively performant object system (several dynamic languages do this), you can go through a list of (id, value) pairs and accumulate the values for identical ids in any way you like, and many other things.
Very simple. Hashtables are often called "associated arrays." Arrays allow access your data by index. Hash tables allow access your data by any other identifier, e.g. name. For example
one is associated with 1
two is associated with 2
So, when you got word "one" you can find its value 1 using hastable where key is one and value is 1. Array allows only opposite mapping.
For n data elements:
Hashtables allows O(k) (usually dependent only on the hashing function) searches. This is better than O(log n) for binary searches (which follow an n log n sorting, if data is not sorted you are worse off)
However, on the flip side, the hashtables tend to take roughly 3n amount of space.