I'm wondering if a max or min heap tree is allowed to have duplicate values? I've been unsuccessful in trying to find information regarding this with online resources alone.
Yes, they can. You can read about this in 'Introduction to Algorithms' (by Charles E. Leiserson, Clifford Stein, Thomas H. Cormen, and Ronald Rivest). According to the definition of binary heaps in Wikipedia:
All nodes are either [greater than or equal to](max heaps) or
[less than or equal to](min heaps) each of its children, according
to a comparison predicate defined for the heap.
Yes they can have duplicates. From wikipedia definition of Heap:
Either the keys of parent nodes are always greater than or equal
to those of the children and the highest key is in the root node (this
kind of heap is called max heap) or the keys of parent nodes are
less than or equal to those of the children and the lowest key is in the root node (min heap)
So if they have children nodes that are equal means that they can have duplicated.
Yes, but I would say no. For efficiency they shouldn't have different nodes with duplicate values or it loses it's purpose a bit (you would have to search child nodes and such). However, you could design each node to contain a variable that declares how many copies of that value you have in your data.
Again this is my opinion. If this is a bad way of doing it I would love if someone could explain why. I might just have to do some efficiency testing. If you are storing simple data types like ints then I would see it being less efficient but for larger object nodes that have ids it's been nice, it seems.
Related
I have a need for a data structure that will be able to give preceding and following neighbors for a given int that is part of the structure.
Some criteria I've set for myself:
write once, read many times
contain 100 to 1000 int
be efficient: order of magnitude O(1)
be memory efficient (size of the ints + some housekeeping bits ideally)
implemented in pure Java (no libraries for this, as I want to learn)
items are unique
no concurrency requirements
ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints - any int may be greater or smaller than the int it preceeds in the order).
This is in Java, and is mostly theoretical, as I've started using the solution described below.
Things I've considered:
LinkedHashSet: very quick to find an item, order of O(1), and very quick to retrieve next neighbor. No apparent way to get previous neighbor without reverse sorting the set. Boxed Integer objects only.
int[]: very easy on memory because no boxing required, very quick to get previous and next neighbor, retrieval of an item is O(n) though because index is not known and array traversal is required, and that is not acceptable.
What I'm using now is a combination of int[] and HashMap:
HashMap for retrieving index of a specific int in the int[]
int[] for retrieving the neighbors of that int
What I like:
neighbor lookup is ideally O(2)
int[] does not do boxing
performance is theoretically very good
What I dislike:
HashMap does boxing twice (key and value)
the ints are stored twice (in both the map and the array)
theoretical memory use could be improved quite a bit
I'd be curious to hear of better solutions.
One solution is to sort the array when you add elements. That way, the previous element is always i-1 and to locate a value, you can use a binary search which is O(log(N)).
The next obvious candidate is a balanced binary tree. For this structure, insert is somewhat expensive but lookup is again O(log(N)).
If the values aren't 32bit, then you can make the lookup faster by having a second array where each value is the index in the first and the index is the value you're looking for.
More options: You could look at bit sets but that again depends on the range which the values can have.
Commons Lang has a hash map which uses primitive int as keys: http://grepcode.com/file/repo1.maven.org/maven2/commons-lang/commons-lang/2.6/org/apache/commons/lang/IntHashMap.java
but the type is internal, so you'd have to copy the code to use it.
That means you don't need to autobox anything (unboxing is cheap).
Related:
http://java-performance.info/implementing-world-fastest-java-int-to-int-hash-map/
HashMap and int as key
ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints).
This says "Tree" to me. Like Aaron said, expensive insert but efficient lookup, which is what you want if you have write once, read many.
EDIT: Thinking about this a bit more, if a value can only ever have one child and one parent, and given all your other requirements, I think ArrayList will work just fine. It's simple and very fast, even though it's O(n). But if the data set grows, you'll probably be better off using a Map-List combo.
Keep in mind when working with these structures that the theoretical performance in terms of O() doesn't always correspond to real-word performance. You need to take into account your dataset size and overall environment. One example: ArrayList and HashMap. In theory, List is O(n) for unsorted lookup, while Map is O(1). However, there's a lot of overhead in creating and managing entries for a map, which actually gives worse performance on smaller sets than a List.
Since you say you don't have to worry about memory, I'd stay away from array. The complexity of managing the size isn't worth it on your specified data set size.
I have two array lists and I would like to link element from the first array to element of the second array list. Elements have a property, say A.
The condition is: an element of the first array with an high value of element.getA() prefers to link with an element of the second array with a low value of A.
I understand that for selecting an element according to a biased probability I can calculate the cumulative probabilities and then do something like this Selecting nodes with probability proportional to trust
Let's see if this is more clear: think about preferential attachment mechanism. In that case, a node links to another node with a probability which increments with the degree of the chosen node. I simply would like to hack the preferential attachment and bias the probability for a node to link another node not only on a property of the second node, but also on a property of the first node. And I want this to be inverse, like small node prefers to link big nodes and big nodes prefers to link small nodes.
Best regards,
Simone
[edited]
for each pair, calculate the difference (or absolute difference, or difference squared). then use that difference as weighting to select one pair.
remove pairs that are no longer valid and repeat.
The more I read about tries the more confused I get for some reason.
What confuses me now is the following:
I have read about 2 types of implementation.
Using arrays to represent the characters (not storing the characters
itself) and in each node also store the index to the actual word (if
we reached a word).
Using a Collection of nodes that store characters and at the end
of each node use a boolean to determine if we reached a word going
down this path
In the first case it is not mentioned but it seems that we must actually keep all the dictionary words (since we indirectly reference them). So we have the array_size*numberOfNodes*lengthOfword + size of dictionary processed
In the latter case we don't need the dictionary since the chars are store directly in the tree. So it seems to me that the second implementation is more space efficient. But I am not sure by how much.
Is my understanding correct on the implementations and is there specific reasons to choose one over the other? Also how could we calculate the space requirements for the second case?
Tries do no store the original words anywhere and instead store them implicitly. The basic structure of a trie is the following: each node in the trie stores
A single bit determining whether or not the path that arrives at the node forms a word, and
A collection of pointers to child nodes labeled by characters.
To determine whether a word is in the trie, you start at the root, then follow the appropritately-labeled pointers one at a time. If you arrive at a node marked as a word, then the word exists in the trie. If you arrive at a node that isn't marked or you fall off the trie, the word is not present.
The difference between the two structures you have listed above is how the child pointers are stored. In the first version, the child pointers are stored as an array of one pointer per symbol in the alphabet, which makes following child pointers extremely fast but can be extremely space-inefficient. In the second version, you explicitly store some type of collection holding just the labeled pointers you need. This is slower, but is more space efficient for sparse tries.
The space usage of a trie depends on the number of nodes (call it n), size of the alphabet (call it k), and the way in which child pointers are represented. If you store a fixed-sized array of pointers, then the space usage is about kn pointers (n nodes with k pointers each), plus n bits for the markers at each node. If you have, say, a dynamic array of pointers stored in sorted order, the overhead will be n total child pointers, plus n bits, plus n times the amount of space necessary to store a single collection.
The advantage of the first approach is speed and simplicity, with very good performance on dense tries. The second is slower but more memory efficient for sparse tries.
These are not the only space optimizations possible. Patricia tries compress nodes with just one child together and are very space-efficient. DAWGs try to merge as many nodes as possible together, but do not support efficient insertions.
Hope this helps!
I'm faced with a problem which requires a Queue data structure supporting fast k-th largest element finding.
The requirements of this data structure are as follows:
The elements in the queue are not necessarily integers, but they must be comparable to each other, i.e we can tell which one is greater when we compare two elements(they can be equal as well).
The data structure must support enqueue(adds the element at the tail) and dequeue(removes the element at the head).
It can quickly find the k-th largest element in the queue, pls note k is not a constant.
You can assume that operations enqueue , dequeue and k-th largest element finding all occur with the same frequency.
My idea is to use a modified balanced binary search tree. The tree is the same as ordinary balanced binary search tree except that every nodei is augmented with another field ni, ni denotes the number of nodes contained in the subtree with root nodei. The aforementioned operations are supported as follows:
For simplicity assume that all elements are distinct.
Enqueue(x): x is first inserted into the tree, suppose the corresponding node is nodet, we append pair(x,pointer to nodet) to the queue.
Dequeue: suppose (e1, node1) is the element at the head, node1 is the pointer into the tree corresponding to e1. We delete node1 from the tree and remove (e1, node1) from the queue.
K-th largest element finding: suppose root node is noderoot, its two children are nodeleft and noderight(suppose they all exist), we compare K with nroot , three cases may happen:
if K< nleft we find the K-th largest element in the left subtree of nroot;
if K>nroot-nright we find the (K-nroot+nright)-th largest element in the right subtree of nroot;
otherwise nroot is the node we want.
The time complexity of all the three operations are O(logN) , where N is the number of elements currently in the queue.
How can I speed up the operations mentioned above? With what data structures and how?
Note - you cannot achieve better then O(logn) for all, at best you need to "chose" which op you care for the most. (Otherwise, you could sort in O(n) by feeding the array to the DS, and querying 1st, 2nd, 3rd, ... nth elements)
Using a skip list instead of a Balanced BST as the sorted structure
can reduce dequeue complexity to O(1) average case. It does
not affect complexity of any other op.
To remove from a skip list - all you need to do is to get to the element using the pointer from the head of the queue, and follow the links up and remove each. The expected number of nodes needed to be deleted is 1 + 1/2 + 1/4 + ... = 2.
find Kth can be achieved in O(logK) by starting from the leftest node (and not the root) and making your way up until you find you have "more sons then needed", and then treat the just found node as the root just like the algorithm in the question. Though it is better in asymptotic complexity - the constant factor is double.
I found an interesting paper:
Sliding-Window Top-k Queries on Uncertain Streams published in VLDB 2008 and cited by 71.
https://www.cse.ust.hk/~yike/wtopk.pdf
VLDB is the best conference in database research area, and the number of citations proves the data structure actually works.
The paper looks pretty difficult, but if you really need improve your data structure, I suggest you to read this paper or papers in the reference page of this paper.
You can also use a finger tree.
For example, a priority queue can be implemented by labeling the internal nodes by the minimum priority of its children in the tree, or an indexed list/array can be implemented with a labeling of nodes by the count of the leaves in their children. Finger trees can provide amortized O(1) cons, reversing, cdr, O(log n) append and split; and can be adapted to be indexed or ordered sequences.
Also note that being a purely functional structure makes this a good choice for concurrent usage.
I am confused as to how the Trie implementation saves space & stores data in most compact form!
If you look at the tree below. When you store a character at any node, you also need to store a reference to that & thus for each character of the string you need to store its reference.
Ok we saved some space when a common character arrived but we lost more space in storing a reference to that character node.
So isn't there a lot of structural overhead to maintain this tree itself ? Instead if a TreeMap was used in place of this, lets say to implement a dictionary, this could have saved a lot more space as string would be kept in one piece hence no space wasted in storing references, isn't it ?
To save space when using a trie, one can use a compressed trie (also known as a patricia trie or radix tree), for which one node can represent multiple characters:
In computer science, a radix tree (also patricia trie or radix trie)
is a space-optimized trie data structure where each node with only one
child is merged with its child. The result is that every internal node
has at least two children. Unlike in regular tries, edges can be
labeled with sequences of characters as well as single characters.
This makes them much more efficient for small sets (especially if the
strings are long) and for sets of strings that share long prefixes.
Example of a radix tree:
Note that a trie is usually used as an efficient data structure for prefix matching on a set of strings. A trie can also be used as an associative array (like a hash table) where the key is a string.
Space is saved when you've lots of words to be represented by the tree. Because many words share the same path in the tree; the more words you've, more space you would save.
But there is a better data structure if you want to save space. Trie doesn't save space as much as directed acyclic word graph (DAWG) does, because it shares common node throughout the structure, whereas trie doesn't share nodes. The wiki entry explains this much detail, so have a look at it.
Here is the difference (graphically) between Trie and DAWG:
The strings "tap", "taps", "top", and "tops" stored in a Trie (left) and a DAWG (right), EOW stands for End-of-word.
The tree on the left side is Trie, and the tree on the right is DAWG. Compare them and see how DAWG saves space effciently. Trie has duplicate nodes that represent same letter/subword, while DAWG has exactly one node for each letter/subword.
It's not about cheap space in memory, it's about precious space in a file or on a communications link. With an algorithm that builds that trie, we can send 'ten' in three bits, left-right-right. Compared to the 24 bits 'ten' would take up uncompressed, that's a huge savings of valuable disk space or transfer bandwidth.
You might deduce that it save space is on a ideal machine where every byte is allocated efficiently. However real machines allocate aligned blocks of memory (8 bytes on Java and 16 bytes on some C++) and so it may not save any space.
Java Strings and collections add relatively high amount of over head so the percentage difference can be very small.
Unless your structure is very large the value of your time out weights the memory cost that using the simplest, most standard and easiest to maintain collection is far more important. e.g. your time can very easily be worth 1000x or more the value of the memory you are try to save.
e.g. say you have 10000 names which you can save 16 bytes each by using a trie. (Assuming this can be proven without taking more time) This equates to 16 KB, which at today's prices is worth 0.1 cents. If your time costs your company $30 per hour, the cost of writing one line of tested code might be $1.
If you have think about it a blink of an eye longer to save 16 KB, its unlikely to be worth it for a PC. (mobile devices are a different story but the same argument applies IMHO)
EDIT: You have inspired me to add an update http://vanillajava.blogspot.com/2011/11/ever-decreasing-cost-of-main-memory.html
Guava may indeed store the key at each level but the point to realize is that the key does not really need to be stored because the path to the node completely defines the key for that node. All that actually needs to be stored at each node is a single boolean indicating whether this is a leaf node or not.
Tries, like any other structure, excel at storing certain types of data. Specifically, tries are best at storing strings that share a common root. Think of storing full-path directory listings for example.