I read some questions how to create a finite Stream (
Finite generated Stream in Java - how to create one?, How do streams stop?).
The answers suggested to implement a Spliterator. The Spliterator would implement the logic how to and which element to provide as next (tryAdvance). But there are two other non-default methods trySplit and estimateSize() which I would have to implement.
The JavaDoc of Spliterator says:
An object for traversing and partitioning elements of a source. The source of elements covered by a Spliterator could be, for example, an array, a Collection, an IO channel, or a generator function. ... The Spliterator API was designed to support efficient parallel
traversal in addition to sequential traversal, by supporting
decomposition as well as single-element iteration. ...
On the other hand I could implement the logic how to advance to the next element around a Stream.Builder and bypass a Spliterator. On every advance I would call accept or add and at the end build. So it looks quite simple.
What does the JavaDoc say?
A mutable builder for a Stream. This allows the creation of a Stream
by generating elements individually and adding them to the Builder
(without the copying overhead that comes from using an ArrayList as a
temporary buffer.)
Using StreamSupport.stream I can use a Spliterator to obtain a Stream. And also a Builder will provide a Stream.
When should / could I use a Stream.Builder?
Only if a Spliterator wouldn't be more efficient (for instance because the source cannot be partitioned and its size cannot be estimated)?
Note that you can extend Spliterators.AbstractSpliterator. Then, there is only tryAdvance to implement.
So the complexity of implementing a Spliterator is not higher.
The fundamental difference is that a Spliterator’s tryAdvance method is only invoked when a new element is needed. In contrast, the Stream.Builder has a storage which will be filled with all stream elements, before you can acquire a Stream.
So a Spliterator is the first choice for all kinds of lazy evaluations, as well as when you have an existing storage you want to traverse, to avoid copying the data.
The builder is the first choice when the creation of the elements is non-uniform, so you can’t express the creation of an element on demand. Think of situations where you would otherwise use Stream.of(…), but it turns out to be to inflexible.
E.g. you have Stream.of(a, b, c, d, e), but now it turns out, c and d are optional. So the solution is
Stream.Builder<MyType> builder = Stream.builder();
builder.add(a).add(b);
if(someCondition) builder.add(c).add(d);
builder.add(e).build()
/* stream operations */
Other use cases are this answer, where a Consumer was needed to query an existing spliterator and push the value back to a Stream afterwards, or this answer, where a structure without random access (a class hierarchy) should be streamed in the opposite order.
On the other hand I could implement the logic how to advance to the next element around a Stream.Builder and bypass a Spliterator. On every advance I would call accept or add and at the end build. So it looks quite simple.
Yes and no. It is simple, but I don't think you understand the usage model:
A stream builder has a lifecycle, which starts in a building phase, during which elements can be added, and then transitions to a built phase, after which elements may not be added. The built phase begins when the build() method is called, which creates an ordered Stream whose elements are the elements that were added to the stream builder, in the order they were added.
(Javadocs)
In particular no, you would not invoke a Stream.Builder's accept or add method on any stream advance. You need to provide all the objects for the stream in advance. Then you build() to get a stream that will provide all the objects you previously added. This is analogous to adding all the objects to a List, and then invoking that List's stream() method.
If that serves your purposes and you can in fact do it efficiently then great! But if you need to generate elements on an as-needed basis, whether with or without limit, then Stream.Builder cannot help you. Spliterator can.
The Stream.Builder is a misnomer as streams can't really be built. Things that can be built are value objects - dto, array, collection.
So if Stream.Builder is instead thought of as a buffer, it might help understand it better, eg:
buffer.add(a)
buffer.add(b)
buffer.stream()
This shows how similar it is to an ArrayList:
list.add(a)
list.add(b)
list.stream()
On the other hand, Spliterator is the basis of a stream and allows for efficient navigation over data sets (improved version of the Iterator).
So the answer is they should not be compared. Comparing Stream.Builder to Spliterator is the same as comparing ArrayList to Spliterator.
Related
I have this concern when it is said there can be one terminal operation and terminal operations cannot be chained. we can write something like this right?
Stream1.map().collect().forEach()
Isn’t this chaining collect and forEach which are both terminal operations. I don’t get that part
The above works fine
Because
Assuming you meant collect(Collectors.toList()), forEach is a List operation, not a Stream operation. Perhaps the source of your confusion: there is a forEach method on Stream as well, but that's not what you're using.
Even if it weren't, nothing stops you from creating another stream from something that can be streamed, you just can't use the same stream you created in the first place.
Stream has forEach, and List has forEach (by extending Iterable). Different methods, but with the same name and purpose. Only the former is a terminal operation.
One practical difference is that the Stream version can be called on a parallel stream, and in that case, the order is not guaranteed. It might appear "random". The version from Iterable always happens on the same, calling thread. The order is guaranteed to match that of an Iterator.
Your example is terminally collecting the stream's items into a List, then calling forEach on that List.
That example is bad style because the intermediate List is useless. It's creating a List for something you could have done directly on the Stream.
I am unable to unable to iterate (second time) over the stream created with Steam.spliterator . I could not find documentation about the same.
Here is what i am doing:
I got a Iterable as funciton argument and I am iterating this via stream like following code :
StreamSupport.stream(values.spliterator(), false)
and following that i am doing it again but the second one do not iterate at all. I spent lot of time debugging it and finally converted the iterable to a list in the beginning itself.
Do any of you guys know the reason ?
Edit: Sorry if i am not clear ,
I was not using the stream multiple times , I was generating the stream in the above way with the same Iterable.
Iterable is the one coming from reduce in MapReduce job.
Thanks,
Hareendra
A Stream is a one-shot object. You can consume it only once, not multiple times. If you want to use the contents multiple times you have to do like you did, converting the stream to a list or array or anything non-streamy and then create two new streams out of it for the two things you want to do.
Quote from JavaDoc of Stream class:
A stream should be operated on (invoking an intermediate or terminal stream operation) only once. This rules out, for example, "forked" streams, where the same source feeds two or more pipelines, or multiple traversals of the same stream.
Be sure that the Iterable instance you are using to create your Spliterator truly honors the Iterable contract. Some people make the mistake of thinking that anything that implements iterator() will serve as an Iterable, and that is not the case. To comply with the Iterable contract, one must be able to call iterator() multiple times and be able to iterate with it each time.
Because it is very easy to create an Iterable from anything that has an iterator() function, I have seen several cases of a manufactured Iterable exhibiting the behavior you mention. For example, one can do this:
Stream<String> stream = ...
Iterable<String> falseIterable = stream::iterator;
falseIterable does not follow required Iterable semantics because falseIterable.iterator(), being a wrapper around stream.iterator(), will not return a usable Iterator a second time, once it has been iterated over.
Let's say you have a collection with some strings and you want to return the first two characters of each string (or some other manipulation...).
In Java 8 for this case you can use either the map or the forEach methods on the stream() which you get from the collection (maybe something else but that is not important right now).
Personally I would use the map primarily because I associate forEach with mutating the collection and I want to avoid this. I also created a really small test regarding the performance but could not see any improvements when using forEach (I perfectly understand that small tests cannot give reliable results but still).
So what are the use-cases where one should choose forEach?
map is the better choice for this, because you're not trying to do anything with the strings yet, just map them to different strings.
forEach is designed to be the "final operation." As such, it doesn't return anything, and is all about mutating some state -- though not necessarily that of the original collection. For instance, you might use it to write elements to a file, having used other constructs (including map) to get those elements.
forEach terminates the stream and is exectued because of the side effect of the called Cosumer. It does not necessarily mutate the stream members.
map maps each stream element to a different value/object using a provided Function. A Stream <R> is returned on which more steps can act.
The forEach terminal operation might be useful in several cases: when you want to collect into some older class for which you don't have a proper collector or when you don't want to collect at all, but send you data somewhere outside (write into the database, print into OutputStream, etc.). There are many cases when the best way is to use both map (as intermediate operation) and forEach (as terminal operation).
I'm trying to understand how Spliterator works, and how spliterators are designed. I recognize that trySplit() is likely one of the more important methods of Spliterator, but when I see some third-party Spliterator implementations, sometimes I see that their spliterators return null for trySplit() unconditionally.
The questions:
Is there a difference between an ordinary iterator and a Spliterator that returns null unconditionally? It seems like such a spliterator defeats the point of, well, splitting.
Of course, there are legitimate use cases of spliterators that conditionally return null on trySplit(), but is there a legitimate use case of a spliterator that unconditionally returns null?
While the main advantage of Spliterator over Iterator is, as you said, its trySplit() method which allows it to be parallelized, there are other significant advantages:
http://docs.oracle.com/javase/8/docs/api/java/util/Spliterator.html
The Spliterator API was designed to support efficient parallel traversal in addition to sequential traversal, by supporting decomposition as well as single-element iteration. In addition, the protocol for accessing elements via a Spliterator is designed to impose smaller per-element overhead than Iterator, and to avoid the inherent race involved in having separate methods for hasNext() and next().
Furthermore, Spliterators can be directly converted to Streams using StreamSupport.stream to make use of Java8's streams.
One of the purposes of a Spliterator is to be able to split, but that's not the only purpose. The other main purpose is as a support class for creating your own Stream source. One way to create a Stream source is to implement your own Spliterator and pass it to StreamSupport.stream. The simplest thing to do is often to write a Spliterator that can't split. Doing so forces the stream to execute sequentially, but that might be acceptable for whatever you're trying to do.
There are other cases where writing a non-splittable Spliterator makes sense. For example, in OpenJDK, there are implementations such as EmptySpliterator that contain no elements. Of course it can't be split. A similar case is a singleton spliterator that contains exactly one element. It can't be split either. Both implementations return null unconditionally from trySplit.
Another case is where writing a non-splittable Spliterator is easy and effective, and the amount of code necessary to implement a splittable one is prohibitive. (At least, not worth the effort of writing one into a Stack Overflow answer.) For example, see the example Spliterator from this answer. The case here is that the Spliterator implementation wants to wrap another Spliterator and do something special, in this case check to see if it's not empty. Otherwise it just delegates everything to the wrapped Spliterator. Doing this with a non-splittable Spliterator is pretty easy.
Notice that there's discussion in that answer, the comment on that answer, in my answer to the same question, and the comment thread on my answer, about how one would make a splittable (i.e., parallel-ready) Spliterator. But nobody actually wrote out the code to do the splitting. :-) Depending upon how much laziness you want to preserve from the original stream, and how much parallel efficiency you want, writing a splittable Spliterator can get pretty complicated.
In my estimation it's somewhat easier to do this sort of stuff by writing an Iterator instead of a Spliterator (as in my answer noted above). It turns out that Spliterators.spliteratorUnknownSize can provide a limited amount of parallelism, even from an Iterator, which is apparently a purely sequential construct. It does so within IteratorSpliterator, which pulls multiple elements from the Iterator and processes them in batches. Unfortunately the batch size is hardcoded, but at least this gives the opportunity for processing elements pulled from an Iterator in parallel in certain cases.
There are more advantages than just splitting support:
The iteration logic is contained in a single tryAdvance method rather than being spread over two methods like hasNext, next. Splitting the logic over two methods complicates a lot of Iterator implementations, as it often implies that the hasNext method has to perform an actual query attempt that might yield a value which then has to be remembered for the follow-up next call. And the fact that this query has been made must be remembered as well, either explicit or implicitly.
It would be easier if there was a guaranty that hasNext/next are always called in the typical alternating fashion, however, there is no such guaranty.
One example is BufferedReader.readLine() which has a simple tryAdvance logic. A wrapping Iterator has to call that method within the hasNext implementation and remember the line for the next call. (Ironically, the current BufferedReader.stream() implementation does implement such a complicated Iterator that will be wrapped into a Spliterator instead of implementing the much simpler Spliterator directly. It seems that the “I’m not familiar with that” problem should not be underestimated)
estimateSize(); a Spliterator may return an estimate (or even an exact number) of the remaining items that can be used to pre-allocate resources. This can raise efficiency.
characteristics(); Spliterators can provide additional information about their content or behavior. Besides telling whether the estimated size is an exact size, you can learn whether you may see null values, whether there is a defined encounter order or all values are distinct. A particular algorithm may take advantage of this. Clearly, the Stream API is a buildup of such algorithms that may take advantage so when planning to create (or support creation of) streams and have a choice, implementing a Spliterator telling as much meta-information as possible is preferred to implementing an Iterator that will be wrapped later.
This question already has an answer here:
Why doesn't java.util.Collection implement the new Stream interface?
(1 answer)
Closed 8 years ago.
This is a question about API desing. When extension methods were added in C#, IEnumerable got all the methods that enabled using lambda expression directly on all Collections.
With the advent of lambdas and default methods in Java, I would expect that Collection would implement Stream and provide default implementations for all its methods. This way, we would not need to call stream() in order to leverage the power it provides.
What is the reason the library architects opted for the less convenient approach?
From Maurice Naftalin's Lambda FAQ:
Why are Stream operations not defined directly on Collection?
Early drafts of the API exposed methods like filter, map, and reduce on Collection or Iterable. However, user experience with this design led to a more formal separation of the “stream” methods into their own abstraction. Reasons included:
Methods on Collection such as removeAll make in-place modifications, in contrast to the new methods which are more functional in nature. Mixing two different kinds of methods on the same abstraction forces the user to keep track of which are which. For example, given the declaration
Collection strings;
the two very similar-looking method calls
strings.removeAll(s -> s.length() == 0);
strings.filter(s -> s.length() == 0); // not supported in the current API
would have surprisingly different results; the first would remove all empty String objects from the collection, whereas the second would return a stream containing all the non-empty Strings, while having no effect on the collection.
Instead, the current design ensures that only an explicitly-obtained stream can be filtered:
strings.stream().filter(s.length() == 0)...;
where the ellipsis represents further stream operations, ending with a terminating operation. This gives the reader a much clearer intuition about the action of filter;
With lazy methods added to Collection, users were confused by a perceived—but erroneous—need to reason about whether the collection was in “lazy mode” or “eager mode”. Rather than burdening Collection with new and different functionality, it is cleaner to provide a Stream view with the new functionality;
The more methods added to Collection, the greater the chance of name collisions with existing third-party implementations. By only adding a few methods (stream, parallel) the chance for conflict is greatly reduced;
A view transformation is still needed to access a parallel view; the asymmetry between the sequential and the parallel stream views was unnatural. Compare, for example
coll.filter(...).map(...).reduce(...);
with
coll.parallel().filter(...).map(...).reduce(...);
This asymmetry would be particularly obvious in the API documentation, where Collection would have many new methods to produce sequential streams, but only one to produce parallel streams, which would then have all the same methods as Collection. Factoring these into a separate interface, StreamOps say, would not help; that would still, counterintuitively, need to be implemented by both Stream and Collection;
A uniform treatment of views also leaves room for other additional views in the future.
A Collection is an object model
A Stream is a subject model
Collection definition in doc :
A collection represents a group of objects, known as its elements.
Stream definition in doc :
A sequence of elements supporting sequential and parallel aggregate operations
Seen this way, a stream is a specific collection. Not the way around. Thus Collection should not Implement Stream, regardless of backward compatibility.
So why doesnt Stream<T> implement Collection<T> ? Because It is another way of looking at a bunch of objects. Not as a group of elements, but by the operations you can perform on it. Thus this is why I say a Collection is an object model while a Stream is a subject model
First, from the documentation of Stream:
Collections and streams, while bearing some superficial similarities, have different goals. Collections are primarily concerned with the efficient management of, and access to, their elements. By contrast, streams do not provide a means to directly access or manipulate their elements, and are instead concerned with declaratively describing their source and the computational operations which will be performed in aggregate on that source.
So you want to keep the concepts of stream and collection appart. If Collection would implement Stream every collection would be a stream, which it is conceptually not. The way it is done now, every collection can give you a stream which works on that collection, which is something different if you think about it.
Another factor that comes to mind is cohesion/coupling as well as encapsulation. If every class that implements Collection had to implement the operations of Stream as well, it would have two (kind of) different purposes and might become too long.
My guess would be that it was made that way to avoid breakage with existing code that implements Collection. It would be hard to provide a default implementation that worked correctly with all existing implementations.