Why doesn't Collection<T> Implement Stream<T>? [duplicate] - java

This question already has an answer here:
Why doesn't java.util.Collection implement the new Stream interface?
(1 answer)
Closed 8 years ago.
This is a question about API desing. When extension methods were added in C#, IEnumerable got all the methods that enabled using lambda expression directly on all Collections.
With the advent of lambdas and default methods in Java, I would expect that Collection would implement Stream and provide default implementations for all its methods. This way, we would not need to call stream() in order to leverage the power it provides.
What is the reason the library architects opted for the less convenient approach?

From Maurice Naftalin's Lambda FAQ:
Why are Stream operations not defined directly on Collection?
Early drafts of the API exposed methods like filter, map, and reduce on Collection or Iterable. However, user experience with this design led to a more formal separation of the “stream” methods into their own abstraction. Reasons included:
Methods on Collection such as removeAll make in-place modifications, in contrast to the new methods which are more functional in nature. Mixing two different kinds of methods on the same abstraction forces the user to keep track of which are which. For example, given the declaration
Collection strings;
the two very similar-looking method calls
strings.removeAll(s -> s.length() == 0);
strings.filter(s -> s.length() == 0); // not supported in the current API
would have surprisingly different results; the first would remove all empty String objects from the collection, whereas the second would return a stream containing all the non-empty Strings, while having no effect on the collection.
Instead, the current design ensures that only an explicitly-obtained stream can be filtered:
strings.stream().filter(s.length() == 0)...;
where the ellipsis represents further stream operations, ending with a terminating operation. This gives the reader a much clearer intuition about the action of filter;
With lazy methods added to Collection, users were confused by a perceived—but erroneous—need to reason about whether the collection was in “lazy mode” or “eager mode”. Rather than burdening Collection with new and different functionality, it is cleaner to provide a Stream view with the new functionality;
The more methods added to Collection, the greater the chance of name collisions with existing third-party implementations. By only adding a few methods (stream, parallel) the chance for conflict is greatly reduced;
A view transformation is still needed to access a parallel view; the asymmetry between the sequential and the parallel stream views was unnatural. Compare, for example
coll.filter(...).map(...).reduce(...);
with
coll.parallel().filter(...).map(...).reduce(...);
This asymmetry would be particularly obvious in the API documentation, where Collection would have many new methods to produce sequential streams, but only one to produce parallel streams, which would then have all the same methods as Collection. Factoring these into a separate interface, StreamOps say, would not help; that would still, counterintuitively, need to be implemented by both Stream and Collection;
A uniform treatment of views also leaves room for other additional views in the future.

A Collection is an object model
A Stream is a subject model
Collection definition in doc :
A collection represents a group of objects, known as its elements.
Stream definition in doc :
A sequence of elements supporting sequential and parallel aggregate operations
Seen this way, a stream is a specific collection. Not the way around. Thus Collection should not Implement Stream, regardless of backward compatibility.
So why doesnt Stream<T> implement Collection<T> ? Because It is another way of looking at a bunch of objects. Not as a group of elements, but by the operations you can perform on it. Thus this is why I say a Collection is an object model while a Stream is a subject model

First, from the documentation of Stream:
Collections and streams, while bearing some superficial similarities, have different goals. Collections are primarily concerned with the efficient management of, and access to, their elements. By contrast, streams do not provide a means to directly access or manipulate their elements, and are instead concerned with declaratively describing their source and the computational operations which will be performed in aggregate on that source.
So you want to keep the concepts of stream and collection appart. If Collection would implement Stream every collection would be a stream, which it is conceptually not. The way it is done now, every collection can give you a stream which works on that collection, which is something different if you think about it.
Another factor that comes to mind is cohesion/coupling as well as encapsulation. If every class that implements Collection had to implement the operations of Stream as well, it would have two (kind of) different purposes and might become too long.

My guess would be that it was made that way to avoid breakage with existing code that implements Collection. It would be hard to provide a default implementation that worked correctly with all existing implementations.

Related

Spliterator vs Stream.Builder

I read some questions how to create a finite Stream (
Finite generated Stream in Java - how to create one?, How do streams stop?).
The answers suggested to implement a Spliterator. The Spliterator would implement the logic how to and which element to provide as next (tryAdvance). But there are two other non-default methods trySplit and estimateSize() which I would have to implement.
The JavaDoc of Spliterator says:
An object for traversing and partitioning elements of a source. The source of elements covered by a Spliterator could be, for example, an array, a Collection, an IO channel, or a generator function. ... The Spliterator API was designed to support efficient parallel
traversal in addition to sequential traversal, by supporting
decomposition as well as single-element iteration. ...
On the other hand I could implement the logic how to advance to the next element around a Stream.Builder and bypass a Spliterator. On every advance I would call accept or add and at the end build. So it looks quite simple.
What does the JavaDoc say?
A mutable builder for a Stream. This allows the creation of a Stream
by generating elements individually and adding them to the Builder
(without the copying overhead that comes from using an ArrayList as a
temporary buffer.)
Using StreamSupport.stream I can use a Spliterator to obtain a Stream. And also a Builder will provide a Stream.
When should / could I use a Stream.Builder?
Only if a Spliterator wouldn't be more efficient (for instance because the source cannot be partitioned and its size cannot be estimated)?
Note that you can extend Spliterators.AbstractSpliterator. Then, there is only tryAdvance to implement.
So the complexity of implementing a Spliterator is not higher.
The fundamental difference is that a Spliterator’s tryAdvance method is only invoked when a new element is needed. In contrast, the Stream.Builder has a storage which will be filled with all stream elements, before you can acquire a Stream.
So a Spliterator is the first choice for all kinds of lazy evaluations, as well as when you have an existing storage you want to traverse, to avoid copying the data.
The builder is the first choice when the creation of the elements is non-uniform, so you can’t express the creation of an element on demand. Think of situations where you would otherwise use Stream.of(…), but it turns out to be to inflexible.
E.g. you have Stream.of(a, b, c, d, e), but now it turns out, c and d are optional. So the solution is
Stream.Builder<MyType> builder = Stream.builder();
builder.add(a).add(b);
if(someCondition) builder.add(c).add(d);
builder.add(e).build()
/* stream operations */
Other use cases are this answer, where a Consumer was needed to query an existing spliterator and push the value back to a Stream afterwards, or this answer, where a structure without random access (a class hierarchy) should be streamed in the opposite order.
On the other hand I could implement the logic how to advance to the next element around a Stream.Builder and bypass a Spliterator. On every advance I would call accept or add and at the end build. So it looks quite simple.
Yes and no. It is simple, but I don't think you understand the usage model:
A stream builder has a lifecycle, which starts in a building phase, during which elements can be added, and then transitions to a built phase, after which elements may not be added. The built phase begins when the build() method is called, which creates an ordered Stream whose elements are the elements that were added to the stream builder, in the order they were added.
(Javadocs)
In particular no, you would not invoke a Stream.Builder's accept or add method on any stream advance. You need to provide all the objects for the stream in advance. Then you build() to get a stream that will provide all the objects you previously added. This is analogous to adding all the objects to a List, and then invoking that List's stream() method.
If that serves your purposes and you can in fact do it efficiently then great! But if you need to generate elements on an as-needed basis, whether with or without limit, then Stream.Builder cannot help you. Spliterator can.
The Stream.Builder is a misnomer as streams can't really be built. Things that can be built are value objects - dto, array, collection.
So if Stream.Builder is instead thought of as a buffer, it might help understand it better, eg:
buffer.add(a)
buffer.add(b)
buffer.stream()
This shows how similar it is to an ArrayList:
list.add(a)
list.add(b)
list.stream()
On the other hand, Spliterator is the basis of a stream and allows for efficient navigation over data sets (improved version of the Iterator).
So the answer is they should not be compared. Comparing Stream.Builder to Spliterator is the same as comparing ArrayList to Spliterator.

Why Stream operations is duplicated with Collectors?

Please allow me to make some complaints, maybe it is boringly but I want to describe:"Why did this question will be raised?".
I have answered questions is different from others here, here and here last night.
After I get dig into it, I found there are many duplicated logic between Stream and Collector that violates Don't repeat yourself principle, e.g: Stream#map & Collectors#mapping, Stream#filter & Collectors#filtering in jdk-9 and .etc.
But it seems to reasonable since Stream abide by Tell, Don't ask principle/Law of Demeter and Collector abide by Composition over Inheritance principle.
I can only think of a few reasons why Stream operations is duplicated with Collectors as below:
We don't care of how the Stream is created in a big context. in this case Stream operation is more effectively and faster than Collector since it can mapping a Stream to another Stream simply, for example:
consuming(stream.map(...));
consuming(stream.collect(mapping(...,toList())).stream());
void consuming(Stream<?> stream){...}
Collector is more powerful that can composes Collectors together to collecting elements in a stream, However, Stream only providing some useful/highly used operations. for example:
stream.collect(groupingBy(
..., mapping(
..., collectingAndThen(reducing(...), ...)
)
));
Stream operations is more expressiveness than Collector when doing some simpler working, but they are more slower than Collectors since it will creates a new stream for each operation and Stream is more heavier and abstract than Collector. for example:
stream.map(...).collect(collector);
stream.collect(mapping(..., collector));
Collector can't applying short-circuiting terminal operation as Stream. for example:
stream.filter(...).findFirst();
Does anyone can come up with other disadvantage/advantage why Stream operations is duplicated with Collectors here? I'd like to re-understand them. Thanks in advance.
Chaining a dedicated terminal stream operation might be considered more expressive by those being used to chained method calls rather than the “LISP style” of composed collector factory calls. But it also allows optimized execution strategies for the stream implementation, as it knows the actual operation instead of just seeing a Collector abstraction.
On the other hand, as you named it yourself, Collectors can be composed, allowing to perform these operation embedded in another collector, at places where stream operations are not possible anymore. I suppose, this mirroring become apparent only at a late stage of the Java 8 development, which is the reason why some operations lacked their counterpart, like filtering or flatMapping, which will be there only in Java 9. So, having two different APIs doing similar things, was not a design decision made at the start of the development.
The Collectors methods that seem to duplicate Stream methods are offering additional functionality. They make sense when used in combination with other Collectors.
For example, if we consider Collectors.mapping(), the most common use is to pass it to a Collectors.groupingBy Collector.
Consider this example (taken from the Javadoc):
List<Person> people = ...
Map<City, Set<String>> namesByCity
= people.stream().collect(groupingBy(Person::getCity, TreeMap::new,
mapping(Person::getLastName, toSet())));
mapping is used here to transform the element type of the value Collection of each group from Person to String.
Without it (and the toSet() Collector) the output would be Map<City, List<Person>>.
Now, you can certainly map a Stream<Person> to a Stream<String> using people.stream().map(Person::getLastName), but then you will lose the ability to group these last names by some other property of Person (Person::getCity in this example).

Choosing between Stream and Collections API

Consider the following example that prints the maximum element in a List :
List<Integer> list = Arrays.asList(1,4,3,9,7,4,8);
list.stream().max(Comparator.naturalOrder()).ifPresent(System.out::println);
The same objective can also be achieved using the Collections.max method :
System.out.println(Collections.max(list));
The above code is not only shorter but also cleaner to read (in my opinion). There are similar examples that come to mind such as the use of binarySearch vs filter used in conjunction with findAny.
I understand that Stream can be an infinite pipeline as opposed to a Collection that is limited by the memory available to the JVM. This would be my criteria for deciding whether to use a Stream or the Collections API. Are there any other reasons for choosing Stream over the Collections API (such as performance). More generally, is this the only reason to chose Stream over older API that can do the job in a cleaner and shorter way?
Stream API is like a Swiss Army knife: it allows you to do quite complex operations by combining the tools effectively. On the other hand if you just need a screwdriver, probably the standalone screwdriver would be more convenient. Stream API includes many things (like distinct, sorted, primitive operations etc.) which otherwise would require you to write several lines and introduce intermediate variables/data structures and boring loops drawing the programmer attention from the actual algorithm. Sometimes using the Stream API can improve the performance even for sequential code. For example, consider some old API:
class Group {
private Map<String, User> users;
public List<User> getUsers() {
return new ArrayList<>(users.values());
}
}
Here we want to return all the users of the group. The API designer decided to return a List. But it can be used outside in a various ways:
List<User> users = group.getUsers();
Collections.sort(users);
someOtherMethod(users.toArray(new User[users.size]));
Here it's sorted and converted to array to pass to some other method which happened to accept an array. In the other place getUsers() may be used like this:
List<User> users = group.getUsers();
for(User user : users) {
if(user.getAge() < 18) {
throw new IllegalStateException("Underage user in selected group!");
}
}
Here we just want to find the user matched some criteria. In both cases copying to intermediate ArrayList was actually unnecessary. When we move to Java 8, we can replace getUsers() method with users():
public Stream<User> users() {
return users.values().stream();
}
And modify the caller code. The first one:
someOtherMethod(group.users().sorted().toArray(User[]::new));
The second one:
if(group.users().anyMatch(user -> user.getAge() < 18)) {
throw new IllegalStateException("Underage user in selected group!");
}
This way it's not only shorter, but may work faster as well, because we skip the intermediate copying.
The other conceptual point in Stream API is that any stream code written according to the guidelines can be parallelized simply by adding the parallel() step. Of course this will not always boost the performance, but it helps more often than I expected. Usually if the operation executed sequentially for 0.1ms or longer, it can benefit from the parallelization. Anyways we haven't seen such simple way to do the parallel programming in Java before.
Of course, it always depends on the circumstances. Take you initial example:
List<Integer> list = Arrays.asList(1,4,3,9,7,4,8);
list.stream().max(Comparator.naturalOrder()).ifPresent(System.out::println);
If you want to do the same thing efficiently, you would use
IntStream.of(1,4,3,9,7,4,8).max().ifPresent(System.out::println);
which doesn’t involve any auto-boxing. But if your assumption is to have a List<Integer> beforehand, that might not be an option, so if you are just interested in the max value, Collections.max might be the simpler choice.
But this would lead to the question why you have a List<Integer> beforehand. Maybe, it’s the result of old code (or new code written using old thinking), which had no other choice than using boxing and Collections as there was no alternative in the past?
So maybe you should think about the source producing the collection, before bother with how to consume it (or well, think about both at the same time).
If all you have is a Collection and all you need is a single terminal operation for which a simple Collection based implementation exists, you may use it directly without bother with the Stream API. The API designers acknowledged this idea as they added methods like forEach(…) to the Collection API instead of insisting of everyone using stream().forEach(…). And Collection.forEach(…) is not a simple short-hand for Collection.stream().forEach(…), in fact, it’s already defined on the more abstract Iterable interface which even hasn’t a stream() method.
Btw., you should understand the difference between Collections.binarySearch and Stream.filter/findAny. The former requires the collection to be sorted and if that prerequisite is met, might be the better choice. But if the collection isn’t sorted, a simple linear search is more efficient than sorting just for a single use of binary search, not to speak of the fact, that binary search works with Lists only while filter/findAny works with any stream supporting every kind of source collection.

Is a Collection better than a LinkedList?

Collection list = new LinkedList(); // Good?
LinkedList list = new LinkedList(); // Bad?
First variant gives more flexibility, but is that all? Are there any other reasons to prefer it? What about performance?
These are design decisions, and one size usually doesn't fit all. Also the choice of what is used internally for the member variable can (and usually should be) different from what is exposed to the outside world.
At its heart, Java's collections framework does not provide a complete set of interfaces that describe the performance characteristics without exposing the implementation details. The one interface that describes performance, RandomAccess is a marker interface, and doesn't even extend Collection or re-expose the get(index) API. So I don't think there is a good answer.
As a rule of thumb, I keep the type as unspecific as possible until I recognize (and document) some characteristic that is important. For example, as soon as I want methods to know that insertion order is retained, I would change from Collection to List, and document why that restriction is important. Similarly, move from List to LinkedList if say efficient removal from front becomes important.
When it comes to exposing the collection in public APIs, I always try to start exposing just the few APIs that are expected to get used; for example add(...) and iterator().
Collection list = new LinkedList(); //bad
This is bad because, you don't want this reference to refer say an HashSet(as HashSet also implements Collection and so does many other class's in the collection framework).
LinkedList list = new LinkedList(); //bad?
This is bad because, good practice is to always code to the interface.
List list = new LinkedList();//good
This is good because point 2 days so.(Always Program To an Interface)
Use the most specific type information on non-public objects. They are implementation details, and we want our implementation details as specific and precise as possible.
Sure. If for example java will find and implement more efficient implementation for the List collection, but you already have API that accepts only LinkedList, you won't be able to replace the implementation if you already have clients for this API. If you use interface, you can easily replace the implementation without breaking the APIs.
They're absolutely equivalent. The only reason to use one over the other is that if you later want to use a function of list that only exists in the class LinkedList, you need to use the second.
My general rule is to only be as specific as you need to be at the time (or will need to be in the near future, within reason). Granted, this is somewhat subjective.
In your example I would usually declare it as a List just because the methods available on Collection aren't very powerful, and the distinction between a List and another Collection (Map, Set, etc.) is often logically significant.
Also, in Java 1.5+ don't use raw types -- if you don't know the type that your list will contain, at least use List<?>.

How to create a Scala parallel collection from a Java collection

The easiest way to convert a Java Collection to a Scala equivalent is using JavaConversions, since Scala 2.8.. These implicit defs return wrappers for the contained Java Collection.
Scala 2.9 introduced parallel collections, where operations on a collection can be executed in parallel and the result collected later. This is easily implemented, converting an existing collection into a parallel one is as simple as:
myCollection.par
But there's a problem with using 'par' on collections converted from Java collections using JavaConversions. As described in Parallel Collection Conversions, inherently sequential collections are 'forced' into a new parallel collection by evaluating all of the values and adding them to the new parallel collection:
Other collections, such as lists, queues or streams, are inherently
sequential in the sense that the elements must be accessed one after
the other. These collections are converted to their parallel variants
by copying the elements into a similar parallel collection. For
example, a functional list is converted into a standard immutable
parallel sequence, which is a parallel vector.
This causes problems when the original Java collection is intended to be lazily evaluated. For instance, if only a Java Iterable is returned, later converted to a Scala Iterable, there's no guarantee that the contents of the Iterable are intended to be accessed eagerly or not. So how should a parallel collection be created from a Java collection without sustaining the cost of evaluating each element? It is this cost I am trying to avoid by using a parallel collection to execute them in parallel and hopefully 'take' the first n results that are offered.
According to Parallel Collection Conversions there are a series of collection types that cost constant time, but there doesn't appear to be a way of getting a guarantee that these types can be created by JavaConversions (e.g. 'Set' can be created, but is that a 'HashSet'?).
First, every collection obtained via JavaConversions from a Java collection is not a by-default parallelizable Scala collection - this means that it will always be reevaluated into its corresponding parallel collection implementation. The reason for this is that parallel execution relies on the concepts of Splitters at least - it has to be splittable into smaller subsets that different processors can then work on.
I don't know how your Java collection looks in the data-structure sense, but if it's a tree-like thing or an array underneath whose elements are evaluated lazily, chances are that you can easily implement a Splitter.
If you do not want to eagerly force a lazy collection that implements a Java collection API, then your only option is to implement a new type of a parallel collection for that particular lazy Java collection. In this new implementation you have to provide means of splitting the iterator (that is, a Splitter).
Once you implement this new parallel collection which knows how to split your data-structure, you should create a custom Scala wrapper for your specific Java collection (at this point it's just a little bit of extra boilerplate, see how it's done in JavaConversions) and override its par to return your specific parallel collection.
You might even be able to do this generically for indexed sequences. Given that your Java collection is a sequence (in Java, a List) with a particularly efficient get method, you could implement the Splitter as an iterator that calls get within the initial range from 0 to size - 1, and is split by subdividing this range.
If you do, patches to the Standard library are always welcome.
Parallel requires random access and java.lang.Iterable doesn't provide it. This is a fundamental mismatch that no amount of conversions will comfortably get you past.
To use a non-programming analogy, you cannot get a person from Australia to England by sending one person from Singapore to England and another from Australia to Singapore at the same time.
Or in programming if you're processing a live stream of data you cannot parallelise it by processing the data from now at the same time as the data from five minutes ago without adding latency.
You would need something that provides at least some random access, like java.util.List.listIterator(Int) instead of Iterable.

Categories

Resources