Why Stream operations is duplicated with Collectors? - java

Please allow me to make some complaints, maybe it is boringly but I want to describe:"Why did this question will be raised?".
I have answered questions is different from others here, here and here last night.
After I get dig into it, I found there are many duplicated logic between Stream and Collector that violates Don't repeat yourself principle, e.g: Stream#map & Collectors#mapping, Stream#filter & Collectors#filtering in jdk-9 and .etc.
But it seems to reasonable since Stream abide by Tell, Don't ask principle/Law of Demeter and Collector abide by Composition over Inheritance principle.
I can only think of a few reasons why Stream operations is duplicated with Collectors as below:
We don't care of how the Stream is created in a big context. in this case Stream operation is more effectively and faster than Collector since it can mapping a Stream to another Stream simply, for example:
consuming(stream.map(...));
consuming(stream.collect(mapping(...,toList())).stream());
void consuming(Stream<?> stream){...}
Collector is more powerful that can composes Collectors together to collecting elements in a stream, However, Stream only providing some useful/highly used operations. for example:
stream.collect(groupingBy(
..., mapping(
..., collectingAndThen(reducing(...), ...)
)
));
Stream operations is more expressiveness than Collector when doing some simpler working, but they are more slower than Collectors since it will creates a new stream for each operation and Stream is more heavier and abstract than Collector. for example:
stream.map(...).collect(collector);
stream.collect(mapping(..., collector));
Collector can't applying short-circuiting terminal operation as Stream. for example:
stream.filter(...).findFirst();
Does anyone can come up with other disadvantage/advantage why Stream operations is duplicated with Collectors here? I'd like to re-understand them. Thanks in advance.

Chaining a dedicated terminal stream operation might be considered more expressive by those being used to chained method calls rather than the “LISP style” of composed collector factory calls. But it also allows optimized execution strategies for the stream implementation, as it knows the actual operation instead of just seeing a Collector abstraction.
On the other hand, as you named it yourself, Collectors can be composed, allowing to perform these operation embedded in another collector, at places where stream operations are not possible anymore. I suppose, this mirroring become apparent only at a late stage of the Java 8 development, which is the reason why some operations lacked their counterpart, like filtering or flatMapping, which will be there only in Java 9. So, having two different APIs doing similar things, was not a design decision made at the start of the development.

The Collectors methods that seem to duplicate Stream methods are offering additional functionality. They make sense when used in combination with other Collectors.
For example, if we consider Collectors.mapping(), the most common use is to pass it to a Collectors.groupingBy Collector.
Consider this example (taken from the Javadoc):
List<Person> people = ...
Map<City, Set<String>> namesByCity
= people.stream().collect(groupingBy(Person::getCity, TreeMap::new,
mapping(Person::getLastName, toSet())));
mapping is used here to transform the element type of the value Collection of each group from Person to String.
Without it (and the toSet() Collector) the output would be Map<City, List<Person>>.
Now, you can certainly map a Stream<Person> to a Stream<String> using people.stream().map(Person::getLastName), but then you will lose the ability to group these last names by some other property of Person (Person::getCity in this example).

Related

Java Stream API - is it possible to create Collector that is collecting result to a new stream?

So, the question is pretty self-explanatory: is there a way to create a Collector, that is collecting stream that is passed to it into a new stream?
I am aware that it can be done with some tricks like:
Collectors.collectingAndThen(Collectors.toList(), List::stream);
But this code allocates redundant memory.
Explanation on why do I need this in the first place: sometimes I want to pass something to Collectors.groupingBy and then perform a stream operation on a downstream, without collecting it additional time.
Is there a simple way to do it (without writing my own class implementing the collector interface)?
EDIT: This question has been marked as a duplicate of this question, but that is not what I'm looking for. I do not want to duplicate the original stream, I want to close it and produce a new stream consisting of the same elements in the same order, by the means of Collector and without allocating memory in between.
The collect method will accept a collector.
Parts of the collectors are the following
Supplier - For a stream, it could be Stream.empty() for example
Accumulator - For streams, we can use Stream.concat()
Combiner - Since this takes a BinaryOperator and you'll need to work on the first element because it will be the reference. For lists for example, you get l1 and l2 and usually do l1.addAll(l2)
So, further than being very complicated, the overhead in terms of memory allocation will be bigger than just collecting everything in a collection first.
Everything is possible, however, you'll probably need to write your own collector that will suite your needs.
You can even rewrite the same exact collector as groupingBy, but is it really worth it?

Choosing between Stream and Collections API

Consider the following example that prints the maximum element in a List :
List<Integer> list = Arrays.asList(1,4,3,9,7,4,8);
list.stream().max(Comparator.naturalOrder()).ifPresent(System.out::println);
The same objective can also be achieved using the Collections.max method :
System.out.println(Collections.max(list));
The above code is not only shorter but also cleaner to read (in my opinion). There are similar examples that come to mind such as the use of binarySearch vs filter used in conjunction with findAny.
I understand that Stream can be an infinite pipeline as opposed to a Collection that is limited by the memory available to the JVM. This would be my criteria for deciding whether to use a Stream or the Collections API. Are there any other reasons for choosing Stream over the Collections API (such as performance). More generally, is this the only reason to chose Stream over older API that can do the job in a cleaner and shorter way?
Stream API is like a Swiss Army knife: it allows you to do quite complex operations by combining the tools effectively. On the other hand if you just need a screwdriver, probably the standalone screwdriver would be more convenient. Stream API includes many things (like distinct, sorted, primitive operations etc.) which otherwise would require you to write several lines and introduce intermediate variables/data structures and boring loops drawing the programmer attention from the actual algorithm. Sometimes using the Stream API can improve the performance even for sequential code. For example, consider some old API:
class Group {
private Map<String, User> users;
public List<User> getUsers() {
return new ArrayList<>(users.values());
}
}
Here we want to return all the users of the group. The API designer decided to return a List. But it can be used outside in a various ways:
List<User> users = group.getUsers();
Collections.sort(users);
someOtherMethod(users.toArray(new User[users.size]));
Here it's sorted and converted to array to pass to some other method which happened to accept an array. In the other place getUsers() may be used like this:
List<User> users = group.getUsers();
for(User user : users) {
if(user.getAge() < 18) {
throw new IllegalStateException("Underage user in selected group!");
}
}
Here we just want to find the user matched some criteria. In both cases copying to intermediate ArrayList was actually unnecessary. When we move to Java 8, we can replace getUsers() method with users():
public Stream<User> users() {
return users.values().stream();
}
And modify the caller code. The first one:
someOtherMethod(group.users().sorted().toArray(User[]::new));
The second one:
if(group.users().anyMatch(user -> user.getAge() < 18)) {
throw new IllegalStateException("Underage user in selected group!");
}
This way it's not only shorter, but may work faster as well, because we skip the intermediate copying.
The other conceptual point in Stream API is that any stream code written according to the guidelines can be parallelized simply by adding the parallel() step. Of course this will not always boost the performance, but it helps more often than I expected. Usually if the operation executed sequentially for 0.1ms or longer, it can benefit from the parallelization. Anyways we haven't seen such simple way to do the parallel programming in Java before.
Of course, it always depends on the circumstances. Take you initial example:
List<Integer> list = Arrays.asList(1,4,3,9,7,4,8);
list.stream().max(Comparator.naturalOrder()).ifPresent(System.out::println);
If you want to do the same thing efficiently, you would use
IntStream.of(1,4,3,9,7,4,8).max().ifPresent(System.out::println);
which doesn’t involve any auto-boxing. But if your assumption is to have a List<Integer> beforehand, that might not be an option, so if you are just interested in the max value, Collections.max might be the simpler choice.
But this would lead to the question why you have a List<Integer> beforehand. Maybe, it’s the result of old code (or new code written using old thinking), which had no other choice than using boxing and Collections as there was no alternative in the past?
So maybe you should think about the source producing the collection, before bother with how to consume it (or well, think about both at the same time).
If all you have is a Collection and all you need is a single terminal operation for which a simple Collection based implementation exists, you may use it directly without bother with the Stream API. The API designers acknowledged this idea as they added methods like forEach(…) to the Collection API instead of insisting of everyone using stream().forEach(…). And Collection.forEach(…) is not a simple short-hand for Collection.stream().forEach(…), in fact, it’s already defined on the more abstract Iterable interface which even hasn’t a stream() method.
Btw., you should understand the difference between Collections.binarySearch and Stream.filter/findAny. The former requires the collection to be sorted and if that prerequisite is met, might be the better choice. But if the collection isn’t sorted, a simple linear search is more efficient than sorting just for a single use of binary search, not to speak of the fact, that binary search works with Lists only while filter/findAny works with any stream supporting every kind of source collection.

Java 8 forEach use cases

Let's say you have a collection with some strings and you want to return the first two characters of each string (or some other manipulation...).
In Java 8 for this case you can use either the map or the forEach methods on the stream() which you get from the collection (maybe something else but that is not important right now).
Personally I would use the map primarily because I associate forEach with mutating the collection and I want to avoid this. I also created a really small test regarding the performance but could not see any improvements when using forEach (I perfectly understand that small tests cannot give reliable results but still).
So what are the use-cases where one should choose forEach?
map is the better choice for this, because you're not trying to do anything with the strings yet, just map them to different strings.
forEach is designed to be the "final operation." As such, it doesn't return anything, and is all about mutating some state -- though not necessarily that of the original collection. For instance, you might use it to write elements to a file, having used other constructs (including map) to get those elements.
forEach terminates the stream and is exectued because of the side effect of the called Cosumer. It does not necessarily mutate the stream members.
map maps each stream element to a different value/object using a provided Function. A Stream <R> is returned on which more steps can act.
The forEach terminal operation might be useful in several cases: when you want to collect into some older class for which you don't have a proper collector or when you don't want to collect at all, but send you data somewhere outside (write into the database, print into OutputStream, etc.). There are many cases when the best way is to use both map (as intermediate operation) and forEach (as terminal operation).

Why doesn't Collection<T> Implement Stream<T>? [duplicate]

This question already has an answer here:
Why doesn't java.util.Collection implement the new Stream interface?
(1 answer)
Closed 8 years ago.
This is a question about API desing. When extension methods were added in C#, IEnumerable got all the methods that enabled using lambda expression directly on all Collections.
With the advent of lambdas and default methods in Java, I would expect that Collection would implement Stream and provide default implementations for all its methods. This way, we would not need to call stream() in order to leverage the power it provides.
What is the reason the library architects opted for the less convenient approach?
From Maurice Naftalin's Lambda FAQ:
Why are Stream operations not defined directly on Collection?
Early drafts of the API exposed methods like filter, map, and reduce on Collection or Iterable. However, user experience with this design led to a more formal separation of the “stream” methods into their own abstraction. Reasons included:
Methods on Collection such as removeAll make in-place modifications, in contrast to the new methods which are more functional in nature. Mixing two different kinds of methods on the same abstraction forces the user to keep track of which are which. For example, given the declaration
Collection strings;
the two very similar-looking method calls
strings.removeAll(s -> s.length() == 0);
strings.filter(s -> s.length() == 0); // not supported in the current API
would have surprisingly different results; the first would remove all empty String objects from the collection, whereas the second would return a stream containing all the non-empty Strings, while having no effect on the collection.
Instead, the current design ensures that only an explicitly-obtained stream can be filtered:
strings.stream().filter(s.length() == 0)...;
where the ellipsis represents further stream operations, ending with a terminating operation. This gives the reader a much clearer intuition about the action of filter;
With lazy methods added to Collection, users were confused by a perceived—but erroneous—need to reason about whether the collection was in “lazy mode” or “eager mode”. Rather than burdening Collection with new and different functionality, it is cleaner to provide a Stream view with the new functionality;
The more methods added to Collection, the greater the chance of name collisions with existing third-party implementations. By only adding a few methods (stream, parallel) the chance for conflict is greatly reduced;
A view transformation is still needed to access a parallel view; the asymmetry between the sequential and the parallel stream views was unnatural. Compare, for example
coll.filter(...).map(...).reduce(...);
with
coll.parallel().filter(...).map(...).reduce(...);
This asymmetry would be particularly obvious in the API documentation, where Collection would have many new methods to produce sequential streams, but only one to produce parallel streams, which would then have all the same methods as Collection. Factoring these into a separate interface, StreamOps say, would not help; that would still, counterintuitively, need to be implemented by both Stream and Collection;
A uniform treatment of views also leaves room for other additional views in the future.
A Collection is an object model
A Stream is a subject model
Collection definition in doc :
A collection represents a group of objects, known as its elements.
Stream definition in doc :
A sequence of elements supporting sequential and parallel aggregate operations
Seen this way, a stream is a specific collection. Not the way around. Thus Collection should not Implement Stream, regardless of backward compatibility.
So why doesnt Stream<T> implement Collection<T> ? Because It is another way of looking at a bunch of objects. Not as a group of elements, but by the operations you can perform on it. Thus this is why I say a Collection is an object model while a Stream is a subject model
First, from the documentation of Stream:
Collections and streams, while bearing some superficial similarities, have different goals. Collections are primarily concerned with the efficient management of, and access to, their elements. By contrast, streams do not provide a means to directly access or manipulate their elements, and are instead concerned with declaratively describing their source and the computational operations which will be performed in aggregate on that source.
So you want to keep the concepts of stream and collection appart. If Collection would implement Stream every collection would be a stream, which it is conceptually not. The way it is done now, every collection can give you a stream which works on that collection, which is something different if you think about it.
Another factor that comes to mind is cohesion/coupling as well as encapsulation. If every class that implements Collection had to implement the operations of Stream as well, it would have two (kind of) different purposes and might become too long.
My guess would be that it was made that way to avoid breakage with existing code that implements Collection. It would be hard to provide a default implementation that worked correctly with all existing implementations.

Does a sequential stream in Java 8 use the combiner parameter on calling collect?

If I call collect on a sequential stream (eg. from calling Collection.stream()) then will it use the combiner parameter I pass to collect? I presume not but I see nothing in the documentation. If I'm correct, then it seems unfortunate to have to supply something that I know will not be used (if I know it is a sequential stream).
Keep in mind to develop against interface specifications -- not against the implementation. The implementation might change with the next Java version, whereas the specification should remain stable.
The specification does not differentiate between sequential and parallel streams. For that reason, you should assume, that the combiner might be used. Actually, there are good examples showing that combiners for sequential streams can improve the performance. For example, the following reduce operation concatenates a list of strings. Executing the code without combiner has quadratic complexity. A smart execution with combiner can reduce the runtime by magnitudes.
List<String> tokens = ...;
String result = tokens.stream().reduce("", String::concat, String::concat);

Categories

Resources