Space differences in implementations of a trie

Space differences in implementations of a trie - java

The more I read about tries the more confused I get for some reason.
What confuses me now is the following:
I have read about 2 types of implementation.
Using arrays to represent the characters (not storing the characters
itself) and in each node also store the index to the actual word (if
we reached a word).
Using a Collection of nodes that store characters and at the end
of each node use a boolean to determine if we reached a word going
down this path
In the first case it is not mentioned but it seems that we must actually keep all the dictionary words (since we indirectly reference them). So we have the array_size*numberOfNodes*lengthOfword + size of dictionary processed
In the latter case we don't need the dictionary since the chars are store directly in the tree. So it seems to me that the second implementation is more space efficient. But I am not sure by how much.
Is my understanding correct on the implementations and is there specific reasons to choose one over the other? Also how could we calculate the space requirements for the second case?

Tries do no store the original words anywhere and instead store them implicitly. The basic structure of a trie is the following: each node in the trie stores
A single bit determining whether or not the path that arrives at the node forms a word, and
A collection of pointers to child nodes labeled by characters.
To determine whether a word is in the trie, you start at the root, then follow the appropritately-labeled pointers one at a time. If you arrive at a node marked as a word, then the word exists in the trie. If you arrive at a node that isn't marked or you fall off the trie, the word is not present.
The difference between the two structures you have listed above is how the child pointers are stored. In the first version, the child pointers are stored as an array of one pointer per symbol in the alphabet, which makes following child pointers extremely fast but can be extremely space-inefficient. In the second version, you explicitly store some type of collection holding just the labeled pointers you need. This is slower, but is more space efficient for sparse tries.
The space usage of a trie depends on the number of nodes (call it n), size of the alphabet (call it k), and the way in which child pointers are represented. If you store a fixed-sized array of pointers, then the space usage is about kn pointers (n nodes with k pointers each), plus n bits for the markers at each node. If you have, say, a dynamic array of pointers stored in sorted order, the overhead will be n total child pointers, plus n bits, plus n times the amount of space necessary to store a single collection.
The advantage of the first approach is speed and simplicity, with very good performance on dense tries. The second is slower but more memory efficient for sparse tries.
These are not the only space optimizations possible. Patricia tries compress nodes with just one child together and are very space-efficient. DAWGs try to merge as many nodes as possible together, but do not support efficient insertions.
Hope this helps!

Related

Lexically sorting a linked-list of Strings

Given a string, seperted by a single space, I need to transfer each word in the String to a Node in a linked list, and keep the list sorted lexically (like in a dictionary).
The first step I did is to move through the String, and put every word in a seperate Node. Now, I'm having a hard time sorting the list - it has to be done in the most efficient way.
Merge-sort is nlogn. Merge-sort would be the best choice here?

Generally if you had a list and wanted to sort it merge sort is a good solution. But in your case you can make it better.
You have a string separated by spaces and you break it and put it in list's nodes. Then you want to sort the list.
You can do better by combining both steps.
1) Have a linked list with head and tail and pointers to previous node.
2) As you extract a word from the sentence store the word in the list in inserted order. I mean you start from the tail or head of the list depending on if it is larger or smaller than these elements and go forward until you reach an element larger/smaller than the current one. Insert it at that location. You just update the pointers.

Just use the built-in Collections.sort, which is a mergesort implementation. More specifically:
This implementation is a stable, adaptive, iterative mergesort that requires far fewer than n lg(n) comparisons when the input array is partially sorted, while offering the performance of a traditional mergesort when the input array is randomly ordered. If the input array is nearly sorted, the implementation requires approximately n comparisons. Temporary storage requirements vary from a small constant for nearly sorted input arrays to n/2 object references for randomly ordered input arrays.

Data structure for random access linked list

I have a need for a data structure that will be able to give preceding and following neighbors for a given int that is part of the structure.
Some criteria I've set for myself:
write once, read many times
contain 100 to 1000 int
be efficient: order of magnitude O(1)
be memory efficient (size of the ints + some housekeeping bits ideally)
implemented in pure Java (no libraries for this, as I want to learn)
items are unique
no concurrency requirements
ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints - any int may be greater or smaller than the int it preceeds in the order).
This is in Java, and is mostly theoretical, as I've started using the solution described below.
Things I've considered:
LinkedHashSet: very quick to find an item, order of O(1), and very quick to retrieve next neighbor. No apparent way to get previous neighbor without reverse sorting the set. Boxed Integer objects only.
int[]: very easy on memory because no boxing required, very quick to get previous and next neighbor, retrieval of an item is O(n) though because index is not known and array traversal is required, and that is not acceptable.
What I'm using now is a combination of int[] and HashMap:
HashMap for retrieving index of a specific int in the int[]
int[] for retrieving the neighbors of that int
What I like:
neighbor lookup is ideally O(2)
int[] does not do boxing
performance is theoretically very good
What I dislike:
HashMap does boxing twice (key and value)
the ints are stored twice (in both the map and the array)
theoretical memory use could be improved quite a bit
I'd be curious to hear of better solutions.

One solution is to sort the array when you add elements. That way, the previous element is always i-1 and to locate a value, you can use a binary search which is O(log(N)).
The next obvious candidate is a balanced binary tree. For this structure, insert is somewhat expensive but lookup is again O(log(N)).
If the values aren't 32bit, then you can make the lookup faster by having a second array where each value is the index in the first and the index is the value you're looking for.
More options: You could look at bit sets but that again depends on the range which the values can have.
Commons Lang has a hash map which uses primitive int as keys: http://grepcode.com/file/repo1.maven.org/maven2/commons-lang/commons-lang/2.6/org/apache/commons/lang/IntHashMap.java
but the type is internal, so you'd have to copy the code to use it.
That means you don't need to autobox anything (unboxing is cheap).
Related:
http://java-performance.info/implementing-world-fastest-java-int-to-int-hash-map/
HashMap and int as key

ints are ordered externally, that order will most likely not be a natural ordering, and that order must be preserved (ie. there is no contract whatsoever regarding the difference in value between two neighboring ints).
This says "Tree" to me. Like Aaron said, expensive insert but efficient lookup, which is what you want if you have write once, read many.
EDIT: Thinking about this a bit more, if a value can only ever have one child and one parent, and given all your other requirements, I think ArrayList will work just fine. It's simple and very fast, even though it's O(n). But if the data set grows, you'll probably be better off using a Map-List combo.
Keep in mind when working with these structures that the theoretical performance in terms of O() doesn't always correspond to real-word performance. You need to take into account your dataset size and overall environment. One example: ArrayList and HashMap. In theory, List is O(n) for unsorted lookup, while Map is O(1). However, there's a lot of overhead in creating and managing entries for a map, which actually gives worse performance on smaller sets than a List.
Since you say you don't have to worry about memory, I'd stay away from array. The complexity of managing the size isn't worth it on your specified data set size.

Can max/min heap trees contain duplicate values?

I'm wondering if a max or min heap tree is allowed to have duplicate values? I've been unsuccessful in trying to find information regarding this with online resources alone.

Yes, they can. You can read about this in 'Introduction to Algorithms' (by Charles E. Leiserson, Clifford Stein, Thomas H. Cormen, and Ronald Rivest). According to the definition of binary heaps in Wikipedia:
All nodes are either [greater than or equal to](max heaps) or
[less than or equal to](min heaps) each of its children, according
to a comparison predicate defined for the heap.

Yes they can have duplicates. From wikipedia definition of Heap:
Either the keys of parent nodes are always greater than or equal
to those of the children and the highest key is in the root node (this
kind of heap is called max heap) or the keys of parent nodes are
less than or equal to those of the children and the lowest key is in the root node (min heap)
So if they have children nodes that are equal means that they can have duplicated.

Yes, but I would say no. For efficiency they shouldn't have different nodes with duplicate values or it loses it's purpose a bit (you would have to search child nodes and such). However, you could design each node to contain a variable that declares how many copies of that value you have in your data.
Again this is my opinion. If this is a bad way of doing it I would love if someone could explain why. I might just have to do some efficiency testing. If you are storing simple data types like ints then I would see it being less efficient but for larger object nodes that have ids it's been nice, it seems.

Trie saves space, but how?

I am confused as to how the Trie implementation saves space & stores data in most compact form!
If you look at the tree below. When you store a character at any node, you also need to store a reference to that & thus for each character of the string you need to store its reference.
Ok we saved some space when a common character arrived but we lost more space in storing a reference to that character node.
So isn't there a lot of structural overhead to maintain this tree itself ? Instead if a TreeMap was used in place of this, lets say to implement a dictionary, this could have saved a lot more space as string would be kept in one piece hence no space wasted in storing references, isn't it ?

To save space when using a trie, one can use a compressed trie (also known as a patricia trie or radix tree), for which one node can represent multiple characters:
In computer science, a radix tree (also patricia trie or radix trie)
is a space-optimized trie data structure where each node with only one
child is merged with its child. The result is that every internal node
has at least two children. Unlike in regular tries, edges can be
labeled with sequences of characters as well as single characters.
This makes them much more efficient for small sets (especially if the
strings are long) and for sets of strings that share long prefixes.
Example of a radix tree:
Note that a trie is usually used as an efficient data structure for prefix matching on a set of strings. A trie can also be used as an associative array (like a hash table) where the key is a string.

Space is saved when you've lots of words to be represented by the tree. Because many words share the same path in the tree; the more words you've, more space you would save.
But there is a better data structure if you want to save space. Trie doesn't save space as much as directed acyclic word graph (DAWG) does, because it shares common node throughout the structure, whereas trie doesn't share nodes. The wiki entry explains this much detail, so have a look at it.
Here is the difference (graphically) between Trie and DAWG:
The strings "tap", "taps", "top", and "tops" stored in a Trie (left) and a DAWG (right), EOW stands for End-of-word.
The tree on the left side is Trie, and the tree on the right is DAWG. Compare them and see how DAWG saves space effciently. Trie has duplicate nodes that represent same letter/subword, while DAWG has exactly one node for each letter/subword.

It's not about cheap space in memory, it's about precious space in a file or on a communications link. With an algorithm that builds that trie, we can send 'ten' in three bits, left-right-right. Compared to the 24 bits 'ten' would take up uncompressed, that's a huge savings of valuable disk space or transfer bandwidth.

You might deduce that it save space is on a ideal machine where every byte is allocated efficiently. However real machines allocate aligned blocks of memory (8 bytes on Java and 16 bytes on some C++) and so it may not save any space.
Java Strings and collections add relatively high amount of over head so the percentage difference can be very small.
Unless your structure is very large the value of your time out weights the memory cost that using the simplest, most standard and easiest to maintain collection is far more important. e.g. your time can very easily be worth 1000x or more the value of the memory you are try to save.
e.g. say you have 10000 names which you can save 16 bytes each by using a trie. (Assuming this can be proven without taking more time) This equates to 16 KB, which at today's prices is worth 0.1 cents. If your time costs your company $30 per hour, the cost of writing one line of tested code might be $1.
If you have think about it a blink of an eye longer to save 16 KB, its unlikely to be worth it for a PC. (mobile devices are a different story but the same argument applies IMHO)
EDIT: You have inspired me to add an update http://vanillajava.blogspot.com/2011/11/ever-decreasing-cost-of-main-memory.html

Guava may indeed store the key at each level but the point to realize is that the key does not really need to be stored because the path to the node completely defines the key for that node. All that actually needs to be stored at each node is a single boolean indicating whether this is a leaf node or not.
Tries, like any other structure, excel at storing certain types of data. Specifically, tries are best at storing strings that share a common root. Think of storing full-path directory listings for example.

What is the general purpose of using hashtables as a collection? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What exactly are hashtables?
I understand the purpose of using hash functions to securely store passwords. I have used arrays and arraylists for class projects for sorting and searching data. What I am having trouble understanding is the practical value of hashtables for something like sorting and searching.
I got a lecture on hashtables but we never had to use them in school, so it hasn't clicked. Can someone give me a practical example of a task a hashtable is useful for that couldn't be done with a numerical array or arraylist? Also, a very simple low level example of a hash function would be helpful.

There are all sorts of collections out there. Collections are used for storing and retrieving things, so one of the most important properties of a collection is how fast these operations are. To estimate "fastness" people in computer science use big-O notation which sort of means how many individual operations you have to accomplish to invoke a certain method (be it get or set for example). So for example to get an element of an ArrayList by an index you need exactly 1 operation, this is O(1), if you have a LinkedList of length n and you need to get something from the middle, you'll have to traverse from the start of the list to the middle, taking n/2 operations, in this case get has complexity of O(n). The same comes to key-value stores as hastable. There are implementations that give you complexity of O(log n) to get a value by its key whereas hastable copes in O(1). Basically it means that getting a value from hashtable by its key is really cheap.

Basically, hashtables have similar performance characteristics (cheap lookup, cheap appending (for arrays - hashtables are unordered, adding to them is cheap partly because of this) as arrays with numerical indices, but are much more flexible in terms of what the key may be. Given a continuous chunck of memory and a fixed size per item, you can get the adress of the nth item very easily and cheaply. That's thanks to the indices being integers - you can't do that with, say, strings. At least not directly. Hashes allows reducing any object (that implements it) to a number and you're back to arrays. You still need to add checks for hash collisions and resolve them (which incurs mostly a memory overhead, since you need to store the original value), but with a halfway decent implementation, this is not much of an issue.
So you can now associate any (hashable) object with any (really any) value. This has countless uses (although I have to admit, I can't think of one that's applyable to sorting or searching). You can build caches with small overhead (because checking if the cache can help in a given case is O(1)), implement a relatively performant object system (several dynamic languages do this), you can go through a list of (id, value) pairs and accumulate the values for identical ids in any way you like, and many other things.

Very simple. Hashtables are often called "associated arrays." Arrays allow access your data by index. Hash tables allow access your data by any other identifier, e.g. name. For example
one is associated with 1
two is associated with 2
So, when you got word "one" you can find its value 1 using hastable where key is one and value is 1. Array allows only opposite mapping.

For n data elements:
Hashtables allows O(k) (usually dependent only on the hashing function) searches. This is better than O(log n) for binary searches (which follow an n log n sorting, if data is not sorted you are worse off)
However, on the flip side, the hashtables tend to take roughly 3n amount of space.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.