Iterator performance contract (and use on non-collections)

Iterator performance contract (and use on non-collections) - java

If all that you're doing is a simple one-pass iteration (i.e. only hasNext() and next(), no remove()), are you guaranteed linear time performance and/or amortized constant cost per operation?
Is this specified in the Iterator contract anywhere?
Are there data structures/Java Collection which cannot be iterated in linear time?
java.util.Scanner implements Iterator<String>. A Scanner is hardly a data structure (e.g. remove() makes absolutely no sense). Is this considered a design blunder?
Is something like PrimeGenerator implements Iterator<Integer> considered bad design, or is this exactly what Iterator is for? (hasNext() always returns true, next() computes the next number on demand, remove() makes no sense).
Similarly, would it have made sense for java.util.Random implements Iterator<Double>?
Should a type really implement Iterator if it's effectively only using one-third of its API? (i.e. no remove(), always hasNext())

There is no such guarantee. As you point out, anyone can model anything as Iterator. Individual producers of iterators would have to specify their individual performance.

Nothing in the Iterator documentaton mentions any kind of performance guarantee, so there is no guarantee.
It also wouldn't make sense to require this constraint on such a universal tool.
A much more useful constraint would be document a iterator() method to specify the time constraints that this Iterator instance fulfills (for example an Iterator over a general-purpose Collection will most likely be able to guarantee linear time operation).
Similarly, nothing in the documentation requires hasNext() to ever return false, so an endless Iterator would be perfectly valid.
However, there is a general assumption that all Iterator instances behave like "normal" Iterator instances as returned by Collection.iterator() in that they return some number of values and end at some point. This is not required by the documentation and, strictly speaking, any code depending on that fact would be subtly broken.

All of your proposals sound reasonable for Iterator. The API docs explicitly say remove need not be supported, and suggests that one not use the older Enumeration that works just like Iterator except without remove.
Also, infinite-length streams are a very useful concept in functional programming, and can be implemented with an Iterator that always hasNext. It's a feature that it can handle either case.

It sounds like you're thinking of iterators in the sense of a list or set traversal. I think a more useful mental model is a discrete object stream, aything that you want to handle one-at-a-time that can be streamed from a source in terms of discrete instances.
In that sense a stream of prime numbers or of list objects both makes sense, and the model doesn't imply anything about the finite-ness of the data source.

I can imagine a use case for this.. And it seems intuitive enough. Personally I think it's fine.
for(long prime : new PrimeGenerator()){
//do stuff
if(condition){
break;
}
}

Related

Is creating a HashMap alongside an ArrayList just for constant-time contains() a valid strategy?

I've got an ArrayList that can be anywhere from 0 to 5000 items long (pretty big objects, too).
At one point I compare it against another ArrayList, to find their intersection. I know this is O(n^2).
Is creating a HashMap alongside this ArrayList, to achieve constant-time lookup, a valid strategy here, in order to reduce the complexity to O(n)? Or is the overhead of another data structure simply not worth it? I believe it would take up no additional space (besides for the references).
(I know, I'm sure 'it depends on what I'm doing', but I'm seriously wondering if there's any drawback that makes it pointless, or if it's actually a common strategy to use. And yes, I'm aware of the quote about prematurely optimizing. I'm just curious from a theoretical standpoint).

First of all, a short side note:
And yes, I'm aware of the quote about prematurely optimizing.
What you are asking about here is not "premature optimization"!
You are not talking about replacing a multiplication with some odd bitwise operations "because they are faster (on a 90's PC, in a C-program)". You are thinking about the right data structure for your application pattern. You are considering the application cases (though you did not tell us many details about them). And you are considering the implications that the choice of a certain data structure will have on the asymptotic running time of your algorithms. This is planning, or maybe engineering, but not "premature optimization".
That being said, and to tell you what you already know: It depends.
To elaborate this a bit: It depends on the actual operations (methods) that you perform on these collections, how frequently you perform then, how time-critical they are, and how memory-sensitive the application is.
(For 5000 elements, the latter should not be a problem, as only references are stored - see the discussion in the comments)
In general, I'd also be hesitant to really store the Set alongside the List, if they are always supposed to contain the same elements. This wording is intentional: You should always be aware of the differences between both collections. Primarily: A Set can contain each element only once, whereas a List may contain the same element multiple times.
For all hints, recommendations and considerations, this should be kept in mind.
But even if it is given for granted that the lists will always contain elements only once in your case, then you still have to make sure that both collections are maintained properly. If you really just stored them, you could easily cause subtle bugs:
private Set<T> set = new HashSet<T>();
private List<T> list = new ArrayList<T>();
// Fine
void add(T element)
{
set.add(element);
list.add(element);
}
// Fine
void remove(T element)
{
set.remove(element);
list.remove(element); // May be expensive, but ... well
}
// Added later, 100 lines below the other methods:
void removeAll(Collection<T> elements)
{
set.removeAll(elements);
// Ooops - something's missing here...
}
To avoid this, one could even consider to create a dedicated collection class - something like a FastContainsList that combines a Set and a List, and forwards the contains call to the Set. But you'll qickly notice that it will be hard (or maybe impossible) to not violate the contracts of the Collection and List interfaces with such a collection, unless the clause that "You may not add elements twice" becomes part of the contract...
So again, all this depends on what you want to do with these methods, and which interface you really need. If you don't need the indexed access of List, then it's easy. Otherwise, referring to your example:
At one point I compare it against another ArrayList, to find their intersection. I know this is O(n^2).
You can avoid this by creating the sets locally:
static <T> List<T> computeIntersection(List<T> list0, List<T> list1)
{
Set<T> set0 = new LinkedHashSet<T>(list0);
Set<T> set1 = new LinkedHashSet<T>(list1);
set0.retainAll(set1);
return new ArrayList<T>(set0);
}
This will have a running time of O(n). Of course, if you do this frequently, but rarely change the contents of the lists, there may be options to avoid the copies, but for the reason mentioned above, maintainng the required data structures may become tricky.

Unsplittable Spliterators

I'm trying to understand how Spliterator works, and how spliterators are designed. I recognize that trySplit() is likely one of the more important methods of Spliterator, but when I see some third-party Spliterator implementations, sometimes I see that their spliterators return null for trySplit() unconditionally.
The questions:
Is there a difference between an ordinary iterator and a Spliterator that returns null unconditionally? It seems like such a spliterator defeats the point of, well, splitting.
Of course, there are legitimate use cases of spliterators that conditionally return null on trySplit(), but is there a legitimate use case of a spliterator that unconditionally returns null?

While the main advantage of Spliterator over Iterator is, as you said, its trySplit() method which allows it to be parallelized, there are other significant advantages:
http://docs.oracle.com/javase/8/docs/api/java/util/Spliterator.html
The Spliterator API was designed to support efficient parallel traversal in addition to sequential traversal, by supporting decomposition as well as single-element iteration. In addition, the protocol for accessing elements via a Spliterator is designed to impose smaller per-element overhead than Iterator, and to avoid the inherent race involved in having separate methods for hasNext() and next().
Furthermore, Spliterators can be directly converted to Streams using StreamSupport.stream to make use of Java8's streams.

One of the purposes of a Spliterator is to be able to split, but that's not the only purpose. The other main purpose is as a support class for creating your own Stream source. One way to create a Stream source is to implement your own Spliterator and pass it to StreamSupport.stream. The simplest thing to do is often to write a Spliterator that can't split. Doing so forces the stream to execute sequentially, but that might be acceptable for whatever you're trying to do.
There are other cases where writing a non-splittable Spliterator makes sense. For example, in OpenJDK, there are implementations such as EmptySpliterator that contain no elements. Of course it can't be split. A similar case is a singleton spliterator that contains exactly one element. It can't be split either. Both implementations return null unconditionally from trySplit.
Another case is where writing a non-splittable Spliterator is easy and effective, and the amount of code necessary to implement a splittable one is prohibitive. (At least, not worth the effort of writing one into a Stack Overflow answer.) For example, see the example Spliterator from this answer. The case here is that the Spliterator implementation wants to wrap another Spliterator and do something special, in this case check to see if it's not empty. Otherwise it just delegates everything to the wrapped Spliterator. Doing this with a non-splittable Spliterator is pretty easy.
Notice that there's discussion in that answer, the comment on that answer, in my answer to the same question, and the comment thread on my answer, about how one would make a splittable (i.e., parallel-ready) Spliterator. But nobody actually wrote out the code to do the splitting. :-) Depending upon how much laziness you want to preserve from the original stream, and how much parallel efficiency you want, writing a splittable Spliterator can get pretty complicated.
In my estimation it's somewhat easier to do this sort of stuff by writing an Iterator instead of a Spliterator (as in my answer noted above). It turns out that Spliterators.spliteratorUnknownSize can provide a limited amount of parallelism, even from an Iterator, which is apparently a purely sequential construct. It does so within IteratorSpliterator, which pulls multiple elements from the Iterator and processes them in batches. Unfortunately the batch size is hardcoded, but at least this gives the opportunity for processing elements pulled from an Iterator in parallel in certain cases.

There are more advantages than just splitting support:
The iteration logic is contained in a single tryAdvance method rather than being spread over two methods like hasNext, next. Splitting the logic over two methods complicates a lot of Iterator implementations, as it often implies that the hasNext method has to perform an actual query attempt that might yield a value which then has to be remembered for the follow-up next call. And the fact that this query has been made must be remembered as well, either explicit or implicitly.
It would be easier if there was a guaranty that hasNext/next are always called in the typical alternating fashion, however, there is no such guaranty.
One example is BufferedReader.readLine() which has a simple tryAdvance logic. A wrapping Iterator has to call that method within the hasNext implementation and remember the line for the next call. (Ironically, the current BufferedReader.stream() implementation does implement such a complicated Iterator that will be wrapped into a Spliterator instead of implementing the much simpler Spliterator directly. It seems that the “I’m not familiar with that” problem should not be underestimated)
estimateSize(); a Spliterator may return an estimate (or even an exact number) of the remaining items that can be used to pre-allocate resources. This can raise efficiency.
characteristics(); Spliterators can provide additional information about their content or behavior. Besides telling whether the estimated size is an exact size, you can learn whether you may see null values, whether there is a defined encounter order or all values are distinct. A particular algorithm may take advantage of this. Clearly, the Stream API is a buildup of such algorithms that may take advantage so when planning to create (or support creation of) streams and have a choice, implementing a Spliterator telling as much meta-information as possible is preferred to implementing an Iterator that will be wrapped later.

Iterable vs Iterator as a return behavior (Best Practice?)

I´d just want to know your opinion regarding to change all the Collections function output to an Iterable type.
This seems to me probably the most common code in Java nowadays, and everybody returns always a List/Set/Map in 99% of times, but shouldn´t be the standard returning something like
public final Iterable<String> myMethod() {
return new Iterable<String>() {
#Override
public Iterator<String> iterator() {return myVar.getColl();}
};
}
Is this bad at all? You know all the DAO classes and this stuff would be like
Iterable<String> getName(){}
Iterable<Integer> getNums(){}
Iterable<String> getStuff(){}
instead of
List<String> getName(){}
List<Integer> getNums(){}
Set<String> getStuff(){}
After all, 99% of times you will use it in a for loop...
What dod you think?

This would be a really bad plan.
I wouldn't say that 90% of the time you just use it in a for loop. Maybe 40-50%. The rest of the time, you need more information: size, contains, or get(int).
Additionally, the return type is a sort of documentation by itself. Returning a Set guarantees that the elements will be unique. Returning a List documents that the elements will be in a consistent order.
I wouldn't recommend returning specific collection implementations like HashSet or ArrayList, but I would usually prefer to return a Set or a List rather than a Collection or an Iterable, if the option is available.

List, Set & Map are interfaces, so they're not tied to a particular implementation. So they are good candidates for returning types.
The difference between List/etc and Iterable/Iterator is the kind of access. One is for random access, you have direct access to all the data, and Iterable avoids the need for having all available. Ideal in cases where you have a lot of data and it's not efficient to have it all inplace. Example: iterating over a large database resultset.
So it depends on what you are accessing. If you data can be huge and must need iterating to avoid performance degradation, then force it using iterators. In other cases List is ok.
Edit: returning an iterator means the only thing you can do is looping through the items without other possibility. If you need this trade-off to ensure performance, ok, but as said, only use when needed.

Well what you coded is partially right:
you need to test on some methods of the items like :
size
contains()
get(index)
exists()
So, you should rethink about your new architecture or override it with this method to take every-time what you need.

default Collection type

Assume you need to store/retrieve items in a Collection, don't care about ordering, and duplicates are allowed, what type of Collection do you use?
By default, I've always used ArrayList, but I remember reading/hearing somewhere that a Queue implementation may be a better choice. A List allows items to be added/retrieved/removed at arbitrary positions, which incurs a performance penalty. As a Queue does not provide this facility it should in theory be faster when this facility is not required.
I realise that all discussions about performance are somewhat meaningless, the only thing that really matters is measurement. Nevertheless, I'm interested to know what others use for a Collection, when they don't care about ordering, and duplicates are allowed, and why?

"It depends". The question you really need to answer first is "What do I want to use the collection for?"
If you often insert / remove items on one of the ends (beginning, end) a Queue will be better than a ArrayList. However in many cases you create a Collection in order to just read from it. In this case a ArrayList is far more efficient: As it is implemented as an array, you can iterate over it quite efficient (same applies for a LinkedList). However a LinkedList uses references to link single items together. So if you do not need random removals of items (in the middle), a ArrayList is better: An ArrayList will use less memory as the items don't need the storage for the reference to the next/prev item.
To sum it up:
ArrayList = good if you insert once and read often (random access or sequential)
LinkedList = good if you insert/remove often at random positions and read only sequential
ArrayDeque (java6 only) = good if you insert/remove at start/end and read random or sequential

As a default, I tend to prefer LinkedList to ArrayList. Obviously, I use them not through the List interface, but rather through the Collection interface.
Over the time, I've indeed found out that when I need a generic collection, it's more or less to put some things in, then iterate over it. If I need more evolved behaviour (say random access, sorting or unicity checks), I will then maybe change the used implementation, but before that I will change the used interface to the most appropriated. This way, I can ensure feature is provided before to concentrate on optimization and implementation.

ArrayList basicly contains an array inside (that's why it is called ArrayList). And operations like addd/remove at arbitrary positions are done in a straightforward way, so if you don't use them - there is no harm to performance.

If ordering and duplicates is not a problem and case is only for storing,
I use ArrayList, As it implements all the list operations. Never felt any performance issues with these operations (Never impacted my projects either). Actually using these operations have simple usage & I don't need to care how its managed internally.
Only if multiple threads will be accessing this list I use Vector because its methods are synchronized.
Also ArrayList and Vector are collections which you learn first :).

It depends on what you know about it.
If I have no clue, I tend to go for a linked list, since the penalty for adding/removing at the end is constant. If I have a rough idea of the maximum size of it, I go for an arraylist with the capacity specified, because it is faster if the estimation is good. If I really know the exact size I tend to go for a normal array; although that isn't really a collection type.

I realise that all discussions about performance are somewhat meaningless, the only thing that really matters is measurement.
That's not necessarily true.
If your knowledge of how the application is going to work tells you that certain collections are going to be very large, then it is a good idea to pick the right collection type. But the right collection type depends crucially on how the collections are going to be used; i.e. on the algorithms.
For example, if your application is likely to be dominated by testing if a collection holds a given object, the fact that Collection.contains(Object) is O(N) for both LinkedList<T> and ArrayList<T> might mean that neither is an appropriate collection type. Instead, maybe you should represent the collection as a HashMap<T, Integer>, where the Integer represents the number of occurrences of a T in the "collection". That will give you O(1) testing and removal, at the cost of more space overheads and slower (though still O(1)) insertion.
But the thing to stress is that if you are likely to be dealing with really large collections, there should be no such thing as a "default" collection type. You need to think about the collection in the context of the algorithms. (And the flip side is that if the collections are always going to be small, it probably makes little difference which collection type you pick.)

How expensive is calling size() on List or Map in Java?

How expensive is calling size() on List or Map in Java? or it is better to save size()'s value in a variable if accessed frequently?

The answer is that it depends on the actual implementation class. For some Map and Collection classes, size() is a cheap constant-time operation. For others, it may entail counting the members.
The Java Collections Cheatsheet (V2) is normally a good source for this kind of information, but the host server is currently a bit sick.
The "coderfriendly.com" domain is no more, but I tracked down a copy of the cheat-sheet on scribd.com.
The cost of size() will also be obvious from looking at the source code. (And this is an "implementation detail" that is pretty much guaranteed to not change ... for the standard collection classes.)
FOLLOWUP
Unfortunately, the cheatsheet only documents the complexity of size for queue implementations. I think that's because it is O(1) for all other collections; see #seanizer's answer.

List and Map are interfaces, so it's impossible to say. For the implementations in the Java Standard API, the size is generally kept in a field and thus not performance-relevant.

For most Collections, calling size() is a constant-time operation. There are however some exceptions. One is ConcurrentLinkedQueue. From the Javadoc of the size() method:
Beware that, unlike in most collections, this method is NOT a constant-time operation. Because of the asynchronous nature of these queues, determining the current number of elements requires an O(n) traversal.
So I'm afraid there's no generic answer, you have to check the documentation of the individual collection you are using.

for ArrayList the implementation is like
public int size() {
return lastIndex - firstIndex;
}
So not over head
You can check the source code for detailed info for your required Impl.
Note: The source given is from openjdk

Implement it, then test it. If it is slow, take a closer look.
"Premature optimisation is the root of all evil." - D. Knuth
Also: You should not require certain implementation features, especially if they are black-boxed. What happens if you replace that list with a concurrent list at a later date? What happens if Oracle decides to rewrite List? Will it still be fast? You just don't know.

You don't have to worry much about that. The list implementations keep track of size. The cost of the call is just O(1). If you are very curious, you can read the source code for the implementations of Collection's concrete classes and see the size() method there.

Implementation gets it from a private pre-computed variable so it's not expensive.

No need to store.Its not at all expensive.Check the source of ArrayList and HashMap.

I think some implementations of LinkedList count the total for each call. The call to a method itself can be a little taxing, but only if we're talking about large iterations or driver coding for hardware would that really be an issue.
In either case, if you save it to a local variable, there won't be any problems.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.