I'm trying to understand how Spliterator works, and how spliterators are designed. I recognize that trySplit() is likely one of the more important methods of Spliterator, but when I see some third-party Spliterator implementations, sometimes I see that their spliterators return null for trySplit() unconditionally.
The questions:
Is there a difference between an ordinary iterator and a Spliterator that returns null unconditionally? It seems like such a spliterator defeats the point of, well, splitting.
Of course, there are legitimate use cases of spliterators that conditionally return null on trySplit(), but is there a legitimate use case of a spliterator that unconditionally returns null?
While the main advantage of Spliterator over Iterator is, as you said, its trySplit() method which allows it to be parallelized, there are other significant advantages:
http://docs.oracle.com/javase/8/docs/api/java/util/Spliterator.html
The Spliterator API was designed to support efficient parallel traversal in addition to sequential traversal, by supporting decomposition as well as single-element iteration. In addition, the protocol for accessing elements via a Spliterator is designed to impose smaller per-element overhead than Iterator, and to avoid the inherent race involved in having separate methods for hasNext() and next().
Furthermore, Spliterators can be directly converted to Streams using StreamSupport.stream to make use of Java8's streams.
One of the purposes of a Spliterator is to be able to split, but that's not the only purpose. The other main purpose is as a support class for creating your own Stream source. One way to create a Stream source is to implement your own Spliterator and pass it to StreamSupport.stream. The simplest thing to do is often to write a Spliterator that can't split. Doing so forces the stream to execute sequentially, but that might be acceptable for whatever you're trying to do.
There are other cases where writing a non-splittable Spliterator makes sense. For example, in OpenJDK, there are implementations such as EmptySpliterator that contain no elements. Of course it can't be split. A similar case is a singleton spliterator that contains exactly one element. It can't be split either. Both implementations return null unconditionally from trySplit.
Another case is where writing a non-splittable Spliterator is easy and effective, and the amount of code necessary to implement a splittable one is prohibitive. (At least, not worth the effort of writing one into a Stack Overflow answer.) For example, see the example Spliterator from this answer. The case here is that the Spliterator implementation wants to wrap another Spliterator and do something special, in this case check to see if it's not empty. Otherwise it just delegates everything to the wrapped Spliterator. Doing this with a non-splittable Spliterator is pretty easy.
Notice that there's discussion in that answer, the comment on that answer, in my answer to the same question, and the comment thread on my answer, about how one would make a splittable (i.e., parallel-ready) Spliterator. But nobody actually wrote out the code to do the splitting. :-) Depending upon how much laziness you want to preserve from the original stream, and how much parallel efficiency you want, writing a splittable Spliterator can get pretty complicated.
In my estimation it's somewhat easier to do this sort of stuff by writing an Iterator instead of a Spliterator (as in my answer noted above). It turns out that Spliterators.spliteratorUnknownSize can provide a limited amount of parallelism, even from an Iterator, which is apparently a purely sequential construct. It does so within IteratorSpliterator, which pulls multiple elements from the Iterator and processes them in batches. Unfortunately the batch size is hardcoded, but at least this gives the opportunity for processing elements pulled from an Iterator in parallel in certain cases.
There are more advantages than just splitting support:
The iteration logic is contained in a single tryAdvance method rather than being spread over two methods like hasNext, next. Splitting the logic over two methods complicates a lot of Iterator implementations, as it often implies that the hasNext method has to perform an actual query attempt that might yield a value which then has to be remembered for the follow-up next call. And the fact that this query has been made must be remembered as well, either explicit or implicitly.
It would be easier if there was a guaranty that hasNext/next are always called in the typical alternating fashion, however, there is no such guaranty.
One example is BufferedReader.readLine() which has a simple tryAdvance logic. A wrapping Iterator has to call that method within the hasNext implementation and remember the line for the next call. (Ironically, the current BufferedReader.stream() implementation does implement such a complicated Iterator that will be wrapped into a Spliterator instead of implementing the much simpler Spliterator directly. It seems that the “I’m not familiar with that” problem should not be underestimated)
estimateSize(); a Spliterator may return an estimate (or even an exact number) of the remaining items that can be used to pre-allocate resources. This can raise efficiency.
characteristics(); Spliterators can provide additional information about their content or behavior. Besides telling whether the estimated size is an exact size, you can learn whether you may see null values, whether there is a defined encounter order or all values are distinct. A particular algorithm may take advantage of this. Clearly, the Stream API is a buildup of such algorithms that may take advantage so when planning to create (or support creation of) streams and have a choice, implementing a Spliterator telling as much meta-information as possible is preferred to implementing an Iterator that will be wrapped later.
Related
I read some questions how to create a finite Stream (
Finite generated Stream in Java - how to create one?, How do streams stop?).
The answers suggested to implement a Spliterator. The Spliterator would implement the logic how to and which element to provide as next (tryAdvance). But there are two other non-default methods trySplit and estimateSize() which I would have to implement.
The JavaDoc of Spliterator says:
An object for traversing and partitioning elements of a source. The source of elements covered by a Spliterator could be, for example, an array, a Collection, an IO channel, or a generator function. ... The Spliterator API was designed to support efficient parallel
traversal in addition to sequential traversal, by supporting
decomposition as well as single-element iteration. ...
On the other hand I could implement the logic how to advance to the next element around a Stream.Builder and bypass a Spliterator. On every advance I would call accept or add and at the end build. So it looks quite simple.
What does the JavaDoc say?
A mutable builder for a Stream. This allows the creation of a Stream
by generating elements individually and adding them to the Builder
(without the copying overhead that comes from using an ArrayList as a
temporary buffer.)
Using StreamSupport.stream I can use a Spliterator to obtain a Stream. And also a Builder will provide a Stream.
When should / could I use a Stream.Builder?
Only if a Spliterator wouldn't be more efficient (for instance because the source cannot be partitioned and its size cannot be estimated)?
Note that you can extend Spliterators.AbstractSpliterator. Then, there is only tryAdvance to implement.
So the complexity of implementing a Spliterator is not higher.
The fundamental difference is that a Spliterator’s tryAdvance method is only invoked when a new element is needed. In contrast, the Stream.Builder has a storage which will be filled with all stream elements, before you can acquire a Stream.
So a Spliterator is the first choice for all kinds of lazy evaluations, as well as when you have an existing storage you want to traverse, to avoid copying the data.
The builder is the first choice when the creation of the elements is non-uniform, so you can’t express the creation of an element on demand. Think of situations where you would otherwise use Stream.of(…), but it turns out to be to inflexible.
E.g. you have Stream.of(a, b, c, d, e), but now it turns out, c and d are optional. So the solution is
Stream.Builder<MyType> builder = Stream.builder();
builder.add(a).add(b);
if(someCondition) builder.add(c).add(d);
builder.add(e).build()
/* stream operations */
Other use cases are this answer, where a Consumer was needed to query an existing spliterator and push the value back to a Stream afterwards, or this answer, where a structure without random access (a class hierarchy) should be streamed in the opposite order.
On the other hand I could implement the logic how to advance to the next element around a Stream.Builder and bypass a Spliterator. On every advance I would call accept or add and at the end build. So it looks quite simple.
Yes and no. It is simple, but I don't think you understand the usage model:
A stream builder has a lifecycle, which starts in a building phase, during which elements can be added, and then transitions to a built phase, after which elements may not be added. The built phase begins when the build() method is called, which creates an ordered Stream whose elements are the elements that were added to the stream builder, in the order they were added.
(Javadocs)
In particular no, you would not invoke a Stream.Builder's accept or add method on any stream advance. You need to provide all the objects for the stream in advance. Then you build() to get a stream that will provide all the objects you previously added. This is analogous to adding all the objects to a List, and then invoking that List's stream() method.
If that serves your purposes and you can in fact do it efficiently then great! But if you need to generate elements on an as-needed basis, whether with or without limit, then Stream.Builder cannot help you. Spliterator can.
The Stream.Builder is a misnomer as streams can't really be built. Things that can be built are value objects - dto, array, collection.
So if Stream.Builder is instead thought of as a buffer, it might help understand it better, eg:
buffer.add(a)
buffer.add(b)
buffer.stream()
This shows how similar it is to an ArrayList:
list.add(a)
list.add(b)
list.stream()
On the other hand, Spliterator is the basis of a stream and allows for efficient navigation over data sets (improved version of the Iterator).
So the answer is they should not be compared. Comparing Stream.Builder to Spliterator is the same as comparing ArrayList to Spliterator.
The javadoc of Spliterator (which is basically what is really behind a Stream if I understand things correctly) defines many characeristics which make sense such as SIZED, CONCURRENT, IMMUTABLE etc.
But it also defines NONNULL; why?
I'd have though that it would be the user's responsibility to ensure that and that if, for instance, a developer tried to .sort() a non SORTED stream where there are null elements he/she would rightfully be greeted with an NPE...
But then this characteristic exists. Why? The javadoc of Spliterator itself doesn't mention any real usage of it, and neither does the package-info.java of the java.util.stream package...
From the documentation of Spliterator:
A Spliterator also reports a set of characteristics() of its structure, source, and elements from among ORDERED, DISTINCT, SORTED, SIZED, NONNULL, IMMUTABLE, CONCURRENT, and SUBSIZED. These may be employed by Spliterator clients to control, specialize or simplify computation.
Note that it does not mention the prevention of NullPointerExceptions. If you sort a Stream which might contain null values it is your responsibility to provide a Comparator which can handle nulls.
The second sentence also makes it clear that using these flags is only an option, not a requirement for “Spliterator clients”, which is not limited to usage by Streams.
So regardless of whether it is used by the current implementation of the Stream API, are there possibilities to gain advantage of the knowledge about a NONULL characteristic?
I think so. An implementation may branch to a specialized code for a non-null Spliterator to utilize null for representing certain state then, e.g. absent values or the initial value before processing the first element, etc. If fact, the actual implementation code for dealing with Streams which may contain null is complicated. But of course, you always have to weigh up whether the simplification of one case justifies the code duplication.
But sometimes the simplification is as simple as knowing that there are no null values implies that you can use one of the Concurrent… Collections, which don’t allow nulls, internally.
I found the following comments in the code for the enum StreamOpFlag.
// The following Spliterator characteristics are not currently used but a
// gap in the bit set is deliberately retained to enable corresponding
// stream flags if//when required without modification to other flag values.
//
// 4, 0x00000100 NONNULL(4, ...
// 5, 0x00000400 IMMUTABLE(5, ...
// 6, 0x00001000 CONCURRENT(6, ...
// 7, 0x00004000 SUBSIZED(7, ...
This question already has an answer here:
Why doesn't java.util.Collection implement the new Stream interface?
(1 answer)
Closed 8 years ago.
This is a question about API desing. When extension methods were added in C#, IEnumerable got all the methods that enabled using lambda expression directly on all Collections.
With the advent of lambdas and default methods in Java, I would expect that Collection would implement Stream and provide default implementations for all its methods. This way, we would not need to call stream() in order to leverage the power it provides.
What is the reason the library architects opted for the less convenient approach?
From Maurice Naftalin's Lambda FAQ:
Why are Stream operations not defined directly on Collection?
Early drafts of the API exposed methods like filter, map, and reduce on Collection or Iterable. However, user experience with this design led to a more formal separation of the “stream” methods into their own abstraction. Reasons included:
Methods on Collection such as removeAll make in-place modifications, in contrast to the new methods which are more functional in nature. Mixing two different kinds of methods on the same abstraction forces the user to keep track of which are which. For example, given the declaration
Collection strings;
the two very similar-looking method calls
strings.removeAll(s -> s.length() == 0);
strings.filter(s -> s.length() == 0); // not supported in the current API
would have surprisingly different results; the first would remove all empty String objects from the collection, whereas the second would return a stream containing all the non-empty Strings, while having no effect on the collection.
Instead, the current design ensures that only an explicitly-obtained stream can be filtered:
strings.stream().filter(s.length() == 0)...;
where the ellipsis represents further stream operations, ending with a terminating operation. This gives the reader a much clearer intuition about the action of filter;
With lazy methods added to Collection, users were confused by a perceived—but erroneous—need to reason about whether the collection was in “lazy mode” or “eager mode”. Rather than burdening Collection with new and different functionality, it is cleaner to provide a Stream view with the new functionality;
The more methods added to Collection, the greater the chance of name collisions with existing third-party implementations. By only adding a few methods (stream, parallel) the chance for conflict is greatly reduced;
A view transformation is still needed to access a parallel view; the asymmetry between the sequential and the parallel stream views was unnatural. Compare, for example
coll.filter(...).map(...).reduce(...);
with
coll.parallel().filter(...).map(...).reduce(...);
This asymmetry would be particularly obvious in the API documentation, where Collection would have many new methods to produce sequential streams, but only one to produce parallel streams, which would then have all the same methods as Collection. Factoring these into a separate interface, StreamOps say, would not help; that would still, counterintuitively, need to be implemented by both Stream and Collection;
A uniform treatment of views also leaves room for other additional views in the future.
A Collection is an object model
A Stream is a subject model
Collection definition in doc :
A collection represents a group of objects, known as its elements.
Stream definition in doc :
A sequence of elements supporting sequential and parallel aggregate operations
Seen this way, a stream is a specific collection. Not the way around. Thus Collection should not Implement Stream, regardless of backward compatibility.
So why doesnt Stream<T> implement Collection<T> ? Because It is another way of looking at a bunch of objects. Not as a group of elements, but by the operations you can perform on it. Thus this is why I say a Collection is an object model while a Stream is a subject model
First, from the documentation of Stream:
Collections and streams, while bearing some superficial similarities, have different goals. Collections are primarily concerned with the efficient management of, and access to, their elements. By contrast, streams do not provide a means to directly access or manipulate their elements, and are instead concerned with declaratively describing their source and the computational operations which will be performed in aggregate on that source.
So you want to keep the concepts of stream and collection appart. If Collection would implement Stream every collection would be a stream, which it is conceptually not. The way it is done now, every collection can give you a stream which works on that collection, which is something different if you think about it.
Another factor that comes to mind is cohesion/coupling as well as encapsulation. If every class that implements Collection had to implement the operations of Stream as well, it would have two (kind of) different purposes and might become too long.
My guess would be that it was made that way to avoid breakage with existing code that implements Collection. It would be hard to provide a default implementation that worked correctly with all existing implementations.
How expensive is calling size() on List or Map in Java? or it is better to save size()'s value in a variable if accessed frequently?
The answer is that it depends on the actual implementation class. For some Map and Collection classes, size() is a cheap constant-time operation. For others, it may entail counting the members.
The Java Collections Cheatsheet (V2) is normally a good source for this kind of information, but the host server is currently a bit sick.
The "coderfriendly.com" domain is no more, but I tracked down a copy of the cheat-sheet on scribd.com.
The cost of size() will also be obvious from looking at the source code. (And this is an "implementation detail" that is pretty much guaranteed to not change ... for the standard collection classes.)
FOLLOWUP
Unfortunately, the cheatsheet only documents the complexity of size for queue implementations. I think that's because it is O(1) for all other collections; see #seanizer's answer.
List and Map are interfaces, so it's impossible to say. For the implementations in the Java Standard API, the size is generally kept in a field and thus not performance-relevant.
For most Collections, calling size() is a constant-time operation. There are however some exceptions. One is ConcurrentLinkedQueue. From the Javadoc of the size() method:
Beware that, unlike in most collections, this method is NOT a constant-time operation. Because of the asynchronous nature of these queues, determining the current number of elements requires an O(n) traversal.
So I'm afraid there's no generic answer, you have to check the documentation of the individual collection you are using.
for ArrayList the implementation is like
public int size() {
return lastIndex - firstIndex;
}
So not over head
You can check the source code for detailed info for your required Impl.
Note: The source given is from openjdk
Implement it, then test it. If it is slow, take a closer look.
"Premature optimisation is the root of all evil." - D. Knuth
Also: You should not require certain implementation features, especially if they are black-boxed. What happens if you replace that list with a concurrent list at a later date? What happens if Oracle decides to rewrite List? Will it still be fast? You just don't know.
You don't have to worry much about that. The list implementations keep track of size. The cost of the call is just O(1). If you are very curious, you can read the source code for the implementations of Collection's concrete classes and see the size() method there.
Implementation gets it from a private pre-computed variable so it's not expensive.
No need to store.Its not at all expensive.Check the source of ArrayList and HashMap.
I think some implementations of LinkedList count the total for each call. The call to a method itself can be a little taxing, but only if we're talking about large iterations or driver coding for hardware would that really be an issue.
In either case, if you save it to a local variable, there won't be any problems.
If all that you're doing is a simple one-pass iteration (i.e. only hasNext() and next(), no remove()), are you guaranteed linear time performance and/or amortized constant cost per operation?
Is this specified in the Iterator contract anywhere?
Are there data structures/Java Collection which cannot be iterated in linear time?
java.util.Scanner implements Iterator<String>. A Scanner is hardly a data structure (e.g. remove() makes absolutely no sense). Is this considered a design blunder?
Is something like PrimeGenerator implements Iterator<Integer> considered bad design, or is this exactly what Iterator is for? (hasNext() always returns true, next() computes the next number on demand, remove() makes no sense).
Similarly, would it have made sense for java.util.Random implements Iterator<Double>?
Should a type really implement Iterator if it's effectively only using one-third of its API? (i.e. no remove(), always hasNext())
There is no such guarantee. As you point out, anyone can model anything as Iterator. Individual producers of iterators would have to specify their individual performance.
Nothing in the Iterator documentaton mentions any kind of performance guarantee, so there is no guarantee.
It also wouldn't make sense to require this constraint on such a universal tool.
A much more useful constraint would be document a iterator() method to specify the time constraints that this Iterator instance fulfills (for example an Iterator over a general-purpose Collection will most likely be able to guarantee linear time operation).
Similarly, nothing in the documentation requires hasNext() to ever return false, so an endless Iterator would be perfectly valid.
However, there is a general assumption that all Iterator instances behave like "normal" Iterator instances as returned by Collection.iterator() in that they return some number of values and end at some point. This is not required by the documentation and, strictly speaking, any code depending on that fact would be subtly broken.
All of your proposals sound reasonable for Iterator. The API docs explicitly say remove need not be supported, and suggests that one not use the older Enumeration that works just like Iterator except without remove.
Also, infinite-length streams are a very useful concept in functional programming, and can be implemented with an Iterator that always hasNext. It's a feature that it can handle either case.
It sounds like you're thinking of iterators in the sense of a list or set traversal. I think a more useful mental model is a discrete object stream, aything that you want to handle one-at-a-time that can be streamed from a source in terms of discrete instances.
In that sense a stream of prime numbers or of list objects both makes sense, and the model doesn't imply anything about the finite-ness of the data source.
I can imagine a use case for this.. And it seems intuitive enough. Personally I think it's fine.
for(long prime : new PrimeGenerator()){
//do stuff
if(condition){
break;
}
}