How to implement immutable collection with constant append and random access time? - java

I'm looking for a solution like this one proposed by Eric Lippert. It is a great implementation, as it is immutable plus the append time is O(1), but it downside is the O(i) random access time.
On the other side there is a great implementation of collection with O(1) on both append and random access. The only problem is that it strongly relies on mutability.
My questions is how to implement a collection which combines the benefits of both solutions? That is:
immutability
O(1) append time
O(1) random access time
Memory complexity is not that big issue for me.

I do not know of a way to implement a list which has all your requirements -- immutability, persistence, O(1) insertion, O(1) removal, O(1) random access.
My suggestion to you is that (1) if you are interested in this topic, read Chris Okasaki's book. (Or, get a copy of his thesis, which was the basis of the book.) And (2) Chris Okasaki suggests the data structure described here for your purposes:
http://www.codeproject.com/Articles/9680/Persistent-Data-Structures#RandomAccessLists
This list is O(1) insert and O(1) removal to the head and O(lg) for random access.

I'm not sure how you would get O(1) append and O(1) random access unless you include another data structure.
Typically, if you want to be able to append elements, you can either copy the source collection, which keeps O(1) random access but gives you O(n) append; or you can do what Eric did, and retain the old list segment(s), which gives you O(1) append time but O(n) random access. Assuming the constant append time is critical, that leaves you with the option of incorporating a second data structure to provide constant-time random access.
The Scala documentation claims "effectively constant" add and lookup times for its immutable HashMap. If true, I suggest looking at their implementation. You could take a solution like Eric's and add an efficient immutable map of indexes to the elements themselves. This would add a bit of memory overhead, though, and while append operations would be efficient, insertions would not.
I am, however, a bit skeptical of the Scala HashMap performance claims. Other immutable map implementations claim log32(n) complexity, which presumably applies to both add and lookup operations. My gut tells me that you're not going to get better than logarithmic complexity, though log32(n) is pretty reasonable.

var bag = new HashBag<int>
{
1,
2,
3
};
var g = new GuardedCollection<int>(bag);
bag.Add(4);
g.Add(5);
The HashBag remains mutable, but you can still pass it to another consumer as an immutable GuardedCollection.

Related

Map.of vs EnumMap for immutable map with enum keys

An EnumMap uses the restriction that all keys of a map will be from the same enum to gain performance benefits:
Enum maps are represented internally as arrays. This representation is extremely compact and efficient.
In this case, the keys and values are stored in separate arrays and the values are ordinal-ordered. Iteration is done with the internal EnumMapIterator class.
An immutable map created by the various Map.of methods use the restriction that the map will not change structurally to gain performance benefits. If the map is not of size 0 or 1, it uses the MapN internal class that also stores its entries in an array. In this case, the value is stored 1 index after its key. Iteration is done with the internal MapNIterator.
For an immutable map of enum keys of size 2 or more, which answers both of the above's requirements, which map performs better? (Criteria could be space efficiency, time efficiency for containsKey, containsValue, get, and iteration efficiency of entrySet, keySet and values.)
which map gives better space efficiency, and time efficiency for its operations and iteration: containsKey, containsValue, get, entrySet, keySet and values?
You're raising 1 + 6 (or 2 * 6, depending on how it's understood) questions, that's a bit too much. If you want a definite answer, you have to concentrate on a single thing and profile it (nobody's gonna do it for you unless you find a very interesting problem).
The space efficiency for an EnumMap simply must be better. There's no need to store the keys as a shared enum array can be used. There's no need for a holes-containing hash lookup array.
There may be exceptions like a small map based on a huge enum.
The most important operation is get. With an EnumMap, it involves no lookup, just a trivial class comparison and an array access. With Map.of(...), there's a loop, which for enums, usually terminates after the first iteration as Enum.hashCode is IMHO stupid, but usually well distributed.
As containsKey is based on the same lookup, it's clear.
I doubt, I've ever used containsValue, but it doesn't do anything smarter than linear search. I'd expect a tiny win for EnumMap because of the holes (needing trivial null test, but causing branch mispredictions).
The remaining three operations are not worth looking up as they return a collection containing no data and simply pointing to the map, i.e., a constant time operation. For example, map.keySet().contains(x) simply delegates to map.containsKey().
The efficiency of the iteration would be a more interesting question, but you didn't ask it.

Java - Most efficient matching method

Assuming one needs to store a list of items, but it can be stored in any variable type; what would be the most efficient type, if used mostly for matching?
To clarify, a list of items needs to be contained, but the form it's contained in doesn't matter (enum, list, hashmap, Arraylist, etc..)
This list of items would be matched against on a regular basis, but not edited. What would the most efficient storage method be, assuming you only need to write to the list once, but could be matching multiple times per second?
Note: No multi-threading
A HashSet (and HashMap) offers O(1) complexity. Also note that you should create a large enough HashSet with small loadfactor which means that after a hashcode check the elements in the result bucket will also be found very quickly (in a bucket there is a sequential search). Optimally each bucket should contain 1 element at the most.
You can read more about the concept of capacity and load factor in the Javadoc of HashMap.
An even faster solution would be if the number of items is no more than 64 is to create an Enum for them and use EnumSet or EnumMap which stores the elements in a long and uses simple and very fast bit operations to test if an element is in the set or map (a contains operation is just a simple bitmask test).
If you choose to go with the HashSet and not with the Enum approach, know that HashSet uses the hashCode() and equals() methods of the elements. You might consider overriding them to provide a faster implementation knowing the internals of the items you wish to store.
A trivial optimization of overriding the hashCode() can be for example to cache a once computed hash code in the item itself if it doesn't change (and subsequent calls to hashCode() should just return the cached value).
From your description it seems that order doesn't matter. If this is so, use a Set. Java's standard implementation is the HashSet.
Most efficient for repeated lookup would almost certainly be an EnumSet
... Enum sets are represented internally as bit vectors. This representation is extremely compact and efficient. The space and time performance of this class should be good enough to allow its use as a high-quality, typesafe alternative to traditional int-based "bit flags." Even bulk operations (such as containsAll and retainAll) should run very quickly if their argument is also an enum set.
...
Implementation note: All basic operations execute in constant time. They are likely (though not guaranteed) to be much faster than their HashSet counterparts. Even bulk operations execute in constant time if their argument is also an enum set.

Retrieve Least Element, Elements are Dynamically-Ordered

I have collection of elements from which I need to retrieve the least/minimum element.
Normally I would use a PriorityQueue as they are designed specifically for this purpose, and offer O(log(n)) time for dequeing methods.
However, the elements in my array have a dynamic order, ie there natural order changes unpredictably over time. I assume PriorityQueue and other such Sorted collections sort an element when inserted, and then leave it. If this is so PriorityQueue wouldn't work for dynamically-ordered elements. Am I correct in my assumption? Or would PriorityQueue still be appropriate in this situation?
If I can't use PriorityQueue, Collections.min would be my next instinct. However this iterates over the entire collection, which presumably gives O(n) time. Is this the next best solution?
What is the best collection/method to use to retrieve the least element from a collection, given that the natural order of the elements may change unpredictably over time?
Edit:
The order of several elements changes per retrieval operation
Edit 2:
The compare algorithm remains constant, however the values of the fields which it assesses vary unpredictably between retrievals.
I think if the change is truly "unpredictable" you may be stuck with Collections.min(). However, maybe for some other collections like PriorityQueue you could try, before calling for the min.
Add something that you KNOW is the min.
Remove that
Then ask again for the "real" min and hope that your little kludge resorted things...
Alternatively, do you know if the order has changed over time? e.g. some OrderChangedEvent can be fired? If so, recreate the sorted whatever as needed.
A possible way to do this would be to extend PriorityQueue that contains a list as one of the fields. This list will store the java.lang.Object.hashCode() of each object. Whenever an add, peek, poll, offer, etc. is called on the PriorityQueue, the queue will check the hash codes of each element and make see if any element changed. If they have, it will re-order the elements that have changed. Then, it will replace the hashcodes of the changed elements in the list. I don't know how fast this will be, but I suspect it will be faster than O(n).
Without any further assumption on the operations you are going to do, you can't achieve better performance than with a PriorityQueue or another O(log(n))-insert collection (TreeSet , for example, but you lose the O(1)-peek).
As you correctly assumed Collections.min(Collection, Comparator) is a linear operation.
But it depends on how often you need to change the ordering: for example if you only need to change it once in a while and still keep a "standard" ordering, min() is a viable option, but if you need to switch ordering completely then you will probably be better off with reordering the queue/set (that is, traversing and adding all the elements in a new one), tough at a O(nlog(n)) cost. Using Collections.sort(List, Comparator) may be effective if you need a lot of reordering compared to inserts, but requires you to use a List.
Of course if you can make somewhat strong assumptions on the types of sorting you will need (for example, if it can be restricted to a part of the data) you could write your own collection.
Edit:
So you have a (more or less) finite number of orderings (never mind that it's the same type of comparison over different fields, it's different Comparators and that's what matters)? If that's the case, you can probably achieve best performance by using m queues that reference the same objects, each using a different comparator (the simplest method, really). This way you have:
constant time access
O(m*logn(n)) inserts (to insert in every queue)
O(m*n) removals (to remove from every queue)
no ordering costs (as it's handled by the inserts)
slightly larger memory cost (probably negligible)
additional O(n*log(n)) cost the first time a particolar ordering is requested
Supposing a value of m orders of magnitude smaller than n, this is comparable to optimal (single-ordering PriorityQueue) performance. For convenience, you can wrap this into a custom collection that takes a Comparator parameter on retrieval operations, and use it as a key for an HashMap of all the PriorityQueues.
Edit #2:
In that case, there is no better solution than running min() on every retrieval (unless you can make assumptions on the changes of the data); this also means that it's better to just use an ArrayList as the collection, since it has basically the lowest possible cost on every operation and you will not benefit from PriorityQueue's natural ordering anyway. You will end up with linear cost on retrieval (for min) and constant on insertion and deletion: this is optimal as there is no sorting algorithm that has less than Ω(n) and Θ(nlog n) anyway.
As a side note, ordered collections work on the assumption that values will not change after insertion; this is because there is no cost-effective way to monitor the changes nor to reorder them "in place".
Can't you use a java TreeSet which keeps the collection sorted at all times. You need to implement the Comparable interface on your objects to do so. Checkout http://docs.oracle.com/javase/1.4.2/docs/api/java/util/TreeSet.html

default Collection type

Assume you need to store/retrieve items in a Collection, don't care about ordering, and duplicates are allowed, what type of Collection do you use?
By default, I've always used ArrayList, but I remember reading/hearing somewhere that a Queue implementation may be a better choice. A List allows items to be added/retrieved/removed at arbitrary positions, which incurs a performance penalty. As a Queue does not provide this facility it should in theory be faster when this facility is not required.
I realise that all discussions about performance are somewhat meaningless, the only thing that really matters is measurement. Nevertheless, I'm interested to know what others use for a Collection, when they don't care about ordering, and duplicates are allowed, and why?
"It depends". The question you really need to answer first is "What do I want to use the collection for?"
If you often insert / remove items on one of the ends (beginning, end) a Queue will be better than a ArrayList. However in many cases you create a Collection in order to just read from it. In this case a ArrayList is far more efficient: As it is implemented as an array, you can iterate over it quite efficient (same applies for a LinkedList). However a LinkedList uses references to link single items together. So if you do not need random removals of items (in the middle), a ArrayList is better: An ArrayList will use less memory as the items don't need the storage for the reference to the next/prev item.
To sum it up:
ArrayList = good if you insert once and read often (random access or sequential)
LinkedList = good if you insert/remove often at random positions and read only sequential
ArrayDeque (java6 only) = good if you insert/remove at start/end and read random or sequential
As a default, I tend to prefer LinkedList to ArrayList. Obviously, I use them not through the List interface, but rather through the Collection interface.
Over the time, I've indeed found out that when I need a generic collection, it's more or less to put some things in, then iterate over it. If I need more evolved behaviour (say random access, sorting or unicity checks), I will then maybe change the used implementation, but before that I will change the used interface to the most appropriated. This way, I can ensure feature is provided before to concentrate on optimization and implementation.
ArrayList basicly contains an array inside (that's why it is called ArrayList). And operations like addd/remove at arbitrary positions are done in a straightforward way, so if you don't use them - there is no harm to performance.
If ordering and duplicates is not a problem and case is only for storing,
I use ArrayList, As it implements all the list operations. Never felt any performance issues with these operations (Never impacted my projects either). Actually using these operations have simple usage & I don't need to care how its managed internally.
Only if multiple threads will be accessing this list I use Vector because its methods are synchronized.
Also ArrayList and Vector are collections which you learn first :).
It depends on what you know about it.
If I have no clue, I tend to go for a linked list, since the penalty for adding/removing at the end is constant. If I have a rough idea of the maximum size of it, I go for an arraylist with the capacity specified, because it is faster if the estimation is good. If I really know the exact size I tend to go for a normal array; although that isn't really a collection type.
I realise that all discussions about performance are somewhat meaningless, the only thing that really matters is measurement.
That's not necessarily true.
If your knowledge of how the application is going to work tells you that certain collections are going to be very large, then it is a good idea to pick the right collection type. But the right collection type depends crucially on how the collections are going to be used; i.e. on the algorithms.
For example, if your application is likely to be dominated by testing if a collection holds a given object, the fact that Collection.contains(Object) is O(N) for both LinkedList<T> and ArrayList<T> might mean that neither is an appropriate collection type. Instead, maybe you should represent the collection as a HashMap<T, Integer>, where the Integer represents the number of occurrences of a T in the "collection". That will give you O(1) testing and removal, at the cost of more space overheads and slower (though still O(1)) insertion.
But the thing to stress is that if you are likely to be dealing with really large collections, there should be no such thing as a "default" collection type. You need to think about the collection in the context of the algorithms. (And the flip side is that if the collections are always going to be small, it probably makes little difference which collection type you pick.)

Which list<Object> implementation will be the fastest for one pass write, read, then destroy?

What is the fastest list implementation (in java) in a scenario where the list will be created one element at a time then at a later point be read one element at a time? The reads will be done with an iterator and then the list will then be destroyed.
I know that the Big O notation for get is O(1) and add is O(1) for an ArrayList, while LinkedList is O(n) for get and O(1) for add. Does the iterator behave with the same Big O notation?
It depends largely on whether you know the maximum size of each list up front.
If you do, use ArrayList; it will certainly be faster.
Otherwise, you'll probably have to profile. While access to the ArrayList is O(1), creating it is not as simple, because of dynamic resizing.
Another point to consider is that the space-time trade-off is not clear cut. Each Java object has quite a bit of overhead. While an ArrayList may waste some space on surplus slots, each slot is only 4 bytes (or 8 on a 64-bit JVM). Each element of a LinkedList is probably about 50 bytes (perhaps 100 in a 64-bit JVM). So you have to have quite a few wasted slots in an ArrayList before a LinkedList actually wins its presumed space advantage. Locality of reference is also a factor, and ArrayList is preferable there too.
In practice, I almost always use ArrayList.
First Thoughts:
Refactor your code to not need the list.
Simplify the data down to a scalar data type, then use: int[]
Or even just use an array of whatever object you have: Object[] - John Gardner
Initialize the list to the full size: new ArrayList(123);
Of course, as everyone else is mentioning, do performance testing, prove your new solution is an improvement.
Iterating through a linked list is O(1) per element.
The Big O runtime for each option is the same. Probably the ArrayList will be faster because of better memory locality, but you'd have to measure it to know for sure. Pick whatever makes the code clearest.
Note that iterating through an instance of LinkedList can be O(n^2) if done naively. Specifically:
List<Object> list = new LinkedList<Object>();
for (int i = 0; i < list.size(); i++) {
list.get(i);
}
This is absolutely horrible in terms of efficiency due to the fact that the list must be traversed up to i twice for each iteration. If you do use LinkedList, be sure to use either an Iterator or Java 5's enhanced for-loop:
for (Object o : list) {
// ...
}
The above code is O(n), since the list is traversed statefully in-place.
To avoid all of the above hassle, just use ArrayList. It's not always the best choice (particularly for space efficiency), but it's usually a safe bet.
There is a new List implementation called GlueList which is faster than all classic List implementations.
Disclaimer: I am the author of this library
You almost certainly want an ArrayList. Both adding and reading are "amortized constant time" (i.e. O(1)) as specified in the documentation (note that this is true even if the list has to increase it's size - it's designed like that see http://java.sun.com/j2se/1.5.0/docs/api/java/util/ArrayList.html ). If you know roughly the number of objects you will be storing then even the ArrayList size increase is eliminated.
Adding to the end of a linked list is O(1), but the constant multiplier is larger than ArrayList (since you are usually creating a node object every time). Reading is virtually identical to the ArrayList if you are using an iterator.
It's a good rule to always use the simplest structure you can, unless there is a good reason not to. Here there is no such reason.
The exact quote from the documentation for ArrayList is: "The add operation runs in amortized constant time, that is, adding n elements requires O(n) time. All of the other operations run in linear time (roughly speaking). The constant factor is low compared to that for the LinkedList implementation."
I suggest benchmarking it. It's one thing reading the API, but until you try it for yourself, it'd academic.
Should be fair easy to test, just make sure you do meaningful operations, or hotspot will out-smart you and optimise it all to a NO-OP :)
I have actually begun to think that any use of data structures with non-deterministic behavior, such as ArrayList or HashMap, should be avoided, so I would say only use ArrayList if you can bound its size; any unbounded list use LinkedList. That is because I mainly code systems with near real time requirements though.
The main problem is that any memory allocation (which could happen randomly with any add operation) could also cause a garbage collection, and any garbage collection can cause you to miss a target. The larger the allocation, the more likely this is to occur, and this is also compounded if you are using CMS collector. CMS is non-compacting, so finding space for a new linked list node is generally going to be easier than finding space for a new 10,000 element array.
The more rigorous your approach to coding, the closer you can come to real time with a stock JVM. But choosing only data structures with deterministic behavior is one of the first steps you would have to take.

Categories

Resources