Consider the following example that prints the maximum element in a List :
List<Integer> list = Arrays.asList(1,4,3,9,7,4,8);
list.stream().max(Comparator.naturalOrder()).ifPresent(System.out::println);
The same objective can also be achieved using the Collections.max method :
System.out.println(Collections.max(list));
The above code is not only shorter but also cleaner to read (in my opinion). There are similar examples that come to mind such as the use of binarySearch vs filter used in conjunction with findAny.
I understand that Stream can be an infinite pipeline as opposed to a Collection that is limited by the memory available to the JVM. This would be my criteria for deciding whether to use a Stream or the Collections API. Are there any other reasons for choosing Stream over the Collections API (such as performance). More generally, is this the only reason to chose Stream over older API that can do the job in a cleaner and shorter way?
Stream API is like a Swiss Army knife: it allows you to do quite complex operations by combining the tools effectively. On the other hand if you just need a screwdriver, probably the standalone screwdriver would be more convenient. Stream API includes many things (like distinct, sorted, primitive operations etc.) which otherwise would require you to write several lines and introduce intermediate variables/data structures and boring loops drawing the programmer attention from the actual algorithm. Sometimes using the Stream API can improve the performance even for sequential code. For example, consider some old API:
class Group {
private Map<String, User> users;
public List<User> getUsers() {
return new ArrayList<>(users.values());
}
}
Here we want to return all the users of the group. The API designer decided to return a List. But it can be used outside in a various ways:
List<User> users = group.getUsers();
Collections.sort(users);
someOtherMethod(users.toArray(new User[users.size]));
Here it's sorted and converted to array to pass to some other method which happened to accept an array. In the other place getUsers() may be used like this:
List<User> users = group.getUsers();
for(User user : users) {
if(user.getAge() < 18) {
throw new IllegalStateException("Underage user in selected group!");
}
}
Here we just want to find the user matched some criteria. In both cases copying to intermediate ArrayList was actually unnecessary. When we move to Java 8, we can replace getUsers() method with users():
public Stream<User> users() {
return users.values().stream();
}
And modify the caller code. The first one:
someOtherMethod(group.users().sorted().toArray(User[]::new));
The second one:
if(group.users().anyMatch(user -> user.getAge() < 18)) {
throw new IllegalStateException("Underage user in selected group!");
}
This way it's not only shorter, but may work faster as well, because we skip the intermediate copying.
The other conceptual point in Stream API is that any stream code written according to the guidelines can be parallelized simply by adding the parallel() step. Of course this will not always boost the performance, but it helps more often than I expected. Usually if the operation executed sequentially for 0.1ms or longer, it can benefit from the parallelization. Anyways we haven't seen such simple way to do the parallel programming in Java before.
Of course, it always depends on the circumstances. Take you initial example:
List<Integer> list = Arrays.asList(1,4,3,9,7,4,8);
list.stream().max(Comparator.naturalOrder()).ifPresent(System.out::println);
If you want to do the same thing efficiently, you would use
IntStream.of(1,4,3,9,7,4,8).max().ifPresent(System.out::println);
which doesn’t involve any auto-boxing. But if your assumption is to have a List<Integer> beforehand, that might not be an option, so if you are just interested in the max value, Collections.max might be the simpler choice.
But this would lead to the question why you have a List<Integer> beforehand. Maybe, it’s the result of old code (or new code written using old thinking), which had no other choice than using boxing and Collections as there was no alternative in the past?
So maybe you should think about the source producing the collection, before bother with how to consume it (or well, think about both at the same time).
If all you have is a Collection and all you need is a single terminal operation for which a simple Collection based implementation exists, you may use it directly without bother with the Stream API. The API designers acknowledged this idea as they added methods like forEach(…) to the Collection API instead of insisting of everyone using stream().forEach(…). And Collection.forEach(…) is not a simple short-hand for Collection.stream().forEach(…), in fact, it’s already defined on the more abstract Iterable interface which even hasn’t a stream() method.
Btw., you should understand the difference between Collections.binarySearch and Stream.filter/findAny. The former requires the collection to be sorted and if that prerequisite is met, might be the better choice. But if the collection isn’t sorted, a simple linear search is more efficient than sorting just for a single use of binary search, not to speak of the fact, that binary search works with Lists only while filter/findAny works with any stream supporting every kind of source collection.
Related
By writing applications in Java there are many use cases for java.util.Collection.
Since java.util.stream.Stream was introduced with Java 8, I came over some use-cases where it is difficult to decide what to use.
For example:
You are going to write some util-methods.
public static List<?> filterHashToList(int hash, Collection<?> toFilter) {
return toFilter.stream()
.filter((Object o) -> hash == o.hashCode())
.collect(Collectors.toCollection(LinkedList::new));
}
What about writing it like this:
public static List<?> filterHashToList(int hash, Collection<?> toFilter) {
List<Object> result = new LinkedList<>();
for(Object o : toFilter) {
if(hash == o.hashCode()) {
result.add(o);
}
}
return result;
}
Both methods would produce the same result. java.util.stream.Stream and java.util.stream.Collector are interfaces, so the implementation can vary as well if I use custom streams and collectors.
I think there are loads of implementations out there using the old-fashoined loop-way.
So, is it possible to answer what to use, stream or loop, by use-case?
And if so, do all implementations have to be updated where appropriate?
Or should I even provide both ways by implementing util-methods?
Or should I also provide a mthed returning the stream after the filtering process so you can work with that one too if required?
In the absence of a posted answer I will quote Brian Goetz who echoes my sentiment and I suspect many others'.
There's nothing magic about either streams or loops. You should write the code that is most readable, clear, and maintainable. Either of these are acceptable.
Note that in your implementation you stick with specific resulting collection type LinkedList which usually has very poor performance compared to ArrayList. What if the user of your method wants to use the resulting list in random-access manner? Probably the user of this method needs an array, because it should be passed to another API method which accepts an array. Sometimes the user just need to know how many objects with given hashCode are present in the input collection, thus there's no need to create a resulting list at all. The Java-8 way is to return streams from the methods, not the collections and let the caller decide how to collect it:
public static <T> Stream<T> filterHashToList(int hash, Collection<T> toFilter) {
return toFilter.stream()
.filter(o -> hash == o.hashCode());
}
And use it:
filterHashToList(42, input).count();
Or
filterHashToList(42, input).collect(toCollection(ArrayList::new));
Or
filterHashToList(42, input).toArray();
This way the method becomes very simple, so you probably don't need it at all, but if you want to do more sofisticated filtering/transformation, it's ok.
So if you don't want to change the API and still return the LinkedList, there's no need to change the implementation. But if you want to take the advantage from using Stream API, it's better to change the return type to Stream.
I think this is largely dependent on the anticipated size of the Collection and practical application of the program. The former is slightly more efficient, but I find it less readable and there are still systems that run on Java 7 (ie Google App Engine). Additionally, the latter would likely be easier to add more complex filters to going forward. However, if efficiency is the primary concern, I would go with the former.
Let's say you have a collection with some strings and you want to return the first two characters of each string (or some other manipulation...).
In Java 8 for this case you can use either the map or the forEach methods on the stream() which you get from the collection (maybe something else but that is not important right now).
Personally I would use the map primarily because I associate forEach with mutating the collection and I want to avoid this. I also created a really small test regarding the performance but could not see any improvements when using forEach (I perfectly understand that small tests cannot give reliable results but still).
So what are the use-cases where one should choose forEach?
map is the better choice for this, because you're not trying to do anything with the strings yet, just map them to different strings.
forEach is designed to be the "final operation." As such, it doesn't return anything, and is all about mutating some state -- though not necessarily that of the original collection. For instance, you might use it to write elements to a file, having used other constructs (including map) to get those elements.
forEach terminates the stream and is exectued because of the side effect of the called Cosumer. It does not necessarily mutate the stream members.
map maps each stream element to a different value/object using a provided Function. A Stream <R> is returned on which more steps can act.
The forEach terminal operation might be useful in several cases: when you want to collect into some older class for which you don't have a proper collector or when you don't want to collect at all, but send you data somewhere outside (write into the database, print into OutputStream, etc.). There are many cases when the best way is to use both map (as intermediate operation) and forEach (as terminal operation).
This question already has an answer here:
Why doesn't java.util.Collection implement the new Stream interface?
(1 answer)
Closed 8 years ago.
This is a question about API desing. When extension methods were added in C#, IEnumerable got all the methods that enabled using lambda expression directly on all Collections.
With the advent of lambdas and default methods in Java, I would expect that Collection would implement Stream and provide default implementations for all its methods. This way, we would not need to call stream() in order to leverage the power it provides.
What is the reason the library architects opted for the less convenient approach?
From Maurice Naftalin's Lambda FAQ:
Why are Stream operations not defined directly on Collection?
Early drafts of the API exposed methods like filter, map, and reduce on Collection or Iterable. However, user experience with this design led to a more formal separation of the “stream” methods into their own abstraction. Reasons included:
Methods on Collection such as removeAll make in-place modifications, in contrast to the new methods which are more functional in nature. Mixing two different kinds of methods on the same abstraction forces the user to keep track of which are which. For example, given the declaration
Collection strings;
the two very similar-looking method calls
strings.removeAll(s -> s.length() == 0);
strings.filter(s -> s.length() == 0); // not supported in the current API
would have surprisingly different results; the first would remove all empty String objects from the collection, whereas the second would return a stream containing all the non-empty Strings, while having no effect on the collection.
Instead, the current design ensures that only an explicitly-obtained stream can be filtered:
strings.stream().filter(s.length() == 0)...;
where the ellipsis represents further stream operations, ending with a terminating operation. This gives the reader a much clearer intuition about the action of filter;
With lazy methods added to Collection, users were confused by a perceived—but erroneous—need to reason about whether the collection was in “lazy mode” or “eager mode”. Rather than burdening Collection with new and different functionality, it is cleaner to provide a Stream view with the new functionality;
The more methods added to Collection, the greater the chance of name collisions with existing third-party implementations. By only adding a few methods (stream, parallel) the chance for conflict is greatly reduced;
A view transformation is still needed to access a parallel view; the asymmetry between the sequential and the parallel stream views was unnatural. Compare, for example
coll.filter(...).map(...).reduce(...);
with
coll.parallel().filter(...).map(...).reduce(...);
This asymmetry would be particularly obvious in the API documentation, where Collection would have many new methods to produce sequential streams, but only one to produce parallel streams, which would then have all the same methods as Collection. Factoring these into a separate interface, StreamOps say, would not help; that would still, counterintuitively, need to be implemented by both Stream and Collection;
A uniform treatment of views also leaves room for other additional views in the future.
A Collection is an object model
A Stream is a subject model
Collection definition in doc :
A collection represents a group of objects, known as its elements.
Stream definition in doc :
A sequence of elements supporting sequential and parallel aggregate operations
Seen this way, a stream is a specific collection. Not the way around. Thus Collection should not Implement Stream, regardless of backward compatibility.
So why doesnt Stream<T> implement Collection<T> ? Because It is another way of looking at a bunch of objects. Not as a group of elements, but by the operations you can perform on it. Thus this is why I say a Collection is an object model while a Stream is a subject model
First, from the documentation of Stream:
Collections and streams, while bearing some superficial similarities, have different goals. Collections are primarily concerned with the efficient management of, and access to, their elements. By contrast, streams do not provide a means to directly access or manipulate their elements, and are instead concerned with declaratively describing their source and the computational operations which will be performed in aggregate on that source.
So you want to keep the concepts of stream and collection appart. If Collection would implement Stream every collection would be a stream, which it is conceptually not. The way it is done now, every collection can give you a stream which works on that collection, which is something different if you think about it.
Another factor that comes to mind is cohesion/coupling as well as encapsulation. If every class that implements Collection had to implement the operations of Stream as well, it would have two (kind of) different purposes and might become too long.
My guess would be that it was made that way to avoid breakage with existing code that implements Collection. It would be hard to provide a default implementation that worked correctly with all existing implementations.
By persistent collections I mean collections like those in clojure.
For example, I have a list with the elements (a,b,c).
With a normal list, if I add d, my original list will have (a,b,c,d) as its elements.
With a persistent list, when I call list.add(d), I get back a new list, holding (a,b,c,d).
However, the implementation attempts to share elements between the list wherever possible, so it's much more memory efficient than simply returning a copy of the original list.
It also has the advantage of being immutable (if I hold a reference to the original list, then it will always return the original 3 elements).
This is all explained much better elsewhere (e.g. http://en.wikipedia.org/wiki/Persistent_data_structure).
Anyway, my question is... what's the best library for providing this functionality for use in java? Can I use the clojure collections somehow (other that by directly using clojure)?
Just use the ones in Clojure directly. While obviously you might not want to use the language it's self, you can still use the persistent collections directly as they are all just Java classes.
import clojure.lang.PersistentHashMap;
import clojure.lang.IPersistentMap;
IPersistentMap map = PersistentHashMap.create("key1", "value1");
assert map.get("key1").equals("value1");
IPersistentMap map2 = map.assoc("key1", "value1");
assert map2 != map;
assert map2.get("key1").equals("value1");
(disclaimer: I haven't actually compiled that code :)
the down side is that the collections aren't typed, i.e. there are no generics with them.
What about pcollections?
You can also check out Clojure's implementation of persistent collections (PersistentHashMap, for instance).
I was looking for a slim, Java "friendly" persistent collection framework and took TotallyLazy and PCollections mentioned in this thread for a testdrive, because they sounded most promising to me.
Both provide reasonable simple interfaces to manipulate persistent lists:
// TotallyLazy
PersistentList<String> original = PersistentList.constructors.empty(String.class);
PersistentList<String> modified = original.append("Mars").append("Raider").delete("Raider");
// PCollections
PVector<String> original = TreePVector.<String>empty();
PVector<String> modified = original.plus("Mars").plus("Raider").minus("Raider");
Both PersistentList and PVector extend java.util.List, so both libraries should integrate well into an existing environment.
It turns out, however, that TotallyLazy runs into performance problems when dealing with larger lists (as already mentioned in a comment above by #levantpied). On my MacBook Pro (Late 2013) inserting 100.000 elements and returning the immutable list took TotallyLazy ~2000ms, whereas PCollections finished within ~120ms.
My (simple) test cases are available on Bitbucket, if someone wants to take a more thorough look.
[UPDATE]: I recently had a look at Cyclops X, which is a high performing and more complete lib targeted for functional programming. Cyclops also contains a module for persistent collections.
https://github.com/andrewoma/dexx is a port of Scala's persistent collections to Java. It includes:
Set, SortedSet, Map, SortedMap and Vector
Adapters to view the persistent collections as java.util equivalents
Helpers for easy construction
Paguro provides type-safe versions of the actual Clojure collections for use in Java 8+. It includes: List (Vector), HashMap, TreeMap, HashSet, and TreeSet. They behave exactly the way you specify in your question and have been painstakingly fit into the existing java.util collections interfaces for maximum type-safe Java compatibility. They are also a little faster than PCollections.
Coding your example in Paguro looks like this:
// List with the elements (a,b,c)
ImList<T> list = vec(a,b,c);
// With a persistent list, when I call list.add(d),
// I get back a new list, holding (a,b,c,d)
ImList<T> newList = list.append(d);
list.size(); // still returns 3
newList.size(); // returns 4
You said,
The implementation attempts to share elements between the list
wherever possible, so it's much more memory efficient and fast than
simply returning a copy of the original list. It also has the
advantage of being immutable (if I hold a reference to the original
list, then it will always return the original 3 elements).
Yes, that's exactly how it behaves. Daniel Spiewak explains the speed and efficiency of these collections much better than I could.
May want to check out clj-ds. I haven't used it, but it seems promising. Based off of the projects readme it extracted the data structures out of Clojure 1.2.0.
Functional Java implements a persistent List, lazy List, Set, Map, and Tree. There may be others, but I'm just going by the information on the front page of the site.
I am also interested to know what the best persistent data structure library for Java is. My attention was directed to Functional Java because it is mentioned in the book, Functional Programming for Java Developers.
There's pcollections (Persistent Collections) library you can use:
http://code.google.com/p/pcollections/
The top voted answer suggest to directly use the clojure collections which I think is a very good idea. Unfortunately the fact that clojure is a dynamically typed language and Java is not makes the clojure libraries very uncomfortable to use in Java.
Because of this and the lack of light-weight, easy-to-use wrappers for the clojure collections types I have written my own library of Java wrappers using generics for the clojure collection types with a focus on ease of use and clarity when it comes to interfaces.
https://github.com/cornim/ClojureCollections
Maybe this will be of use to somebody.
P.S.: At the moment only PersistentVector, PersistentMap and PersistentList have been implemented.
In the same vein as Cornelius Mund, Pure4J ports the Clojure collections into Java and adds Generics support.
However, Pure4J is aimed at introducing pure programming semantics to the JVM through compile time code checking, so it goes further to introduce immutability constraints to your classes, so that the elements of the collection cannot be mutated while the collection exists.
This may or may not be what you want to achieve: if you are just after using the Clojure collections on the JVM I would go with Cornelius' approach, otherwise, if you are interested in pursuing a pure programming approach within Java then you could give Pure4J a try.
Disclosure: I am the developer of this
totallylazy is a very good FP library which has implementations of:
PersistentList<T>: the concrete implementations are LinkedList<T> and TreeList<T> (for random access)
PersistentMap<K, V>: the concrete implementations are HashTreeMap<K, V> and ListMap<K, V>
PersistentSortedMap<K, V>
PersistentSet<T>: the concrete implementation is TreeSet<T>
Example of usage:
import static com.googlecode.totallylazy.collections.PersistentList.constructors.*;
import com.googlecode.totallylazy.collections.PersistentList;
import com.googlecode.totallylazy.numbers.Numbers;
...
PersistentList<Integer> list = list(1, 2, 3);
// Create a new list with 0 prepended
list = list.cons(0);
// Prints 0::1::2::3
System.out.println(list);
// Do some actions on this list (e.g. remove all even numbers)
list = list.filter(Numbers.odd);
// Prints 1::3
System.out.println(list);
totallylazy is constantly being maintained. The main disadvantage is the total absence of Javadoc.
I'm surprised nobody mentioned vavr. I use it for a long time now.
http://www.vavr.io
Description from their site:
Vavr core is a functional library for Java. It helps to reduce the amount of code and to increase the robustness. A first step towards functional programming is to start thinking in immutable values. Vavr provides immutable collections and the necessary functions and control structures to operate on these values. The results are beautiful and just work.
https://github.com/arnohaase/a-foundation is another port of Scala's libraries.
It is also available from Maven Central: com.ajjpj.a-foundation:a-foundation
Summary pretty much says it all. Here's the relevant snippet of code in ImmutableList.createFromIterable():
if (element == null) {
throw new NullPointerException("at index " + index);
}
I've run into this several times and can't see why a general-purpose library function should impose this limitation.
Edit 1: by "general-purpose", I'd be happy with 95% of cases. But I don't think I've written 100 calls to ImmutableList.of() yet, and have been bitten by this more than once. Maybe I'm an outlier, though. :)
Edit 2: I guess my big complaint is that this creates a "hiccup" when interacting with standard java.util collections. As you pointed out in your talk, problems with nulls in collections can show up far away from where those nulls were inserted. But if I have a long chain of code that puts nulls in a standard collection at one end and handles them properly at the other, then I'm unable to substitute a google collections class at any point along the way, because it'll immediately throw a NullPointerException.
I explained this at the 25-minute point of this video:
https://youtu.be/ZeO_J2OcHYM?t=1495
Sorry for the lazy answer, but this is after all only a "why" question (arguably not appropriate to StackOverflow?).
EDIT: Here's another point I'm not sure I made clear in the video: the total (across all of the world's Java code), amount of extra code that has to be written for those null-friendly cases to use the old standbys Collections.unmodifiableList(Arrays.asList(...)) etc. is overwhelmed by the total (across all of the world's Java code) amount of extra checkArgument(!foos.contains(null)) calls everyone would need to add if our collections didn't take care of that for you. Most, by FAR, usages of a collection do not expect any nulls to be present, and really should fail fast if any are.
In general in Google Collections the developers are of the group that does not believe that nulls should be an expected general purpose parameter.
From Guava's Github Page
Careless use of null can cause a staggering variety of bugs. Studying the Google code base, we found that something like 95% of collections weren't supposed to have any null values in them, and having those fail fast rather than silently accept null would have been helpful to developers.
The Guava position is largely, that there are other ways to avoid nulls in collections. For example, fetching a batch of items with a specific key. E.g.
// If a widget for the given id does not exist, return `null` in the list
private List<Widget> getWidgets(List<String> widgetIds);
// Could be restructured to use a Map type like this and avoids nulls completely.
// If a widget for the given id does not exist, no entry in list
private Map<String, Widget> getWidgets(List<String> widgetIds);
One reason is that it allows functions that work on the list not to have to check every element for Null, significantly improving performance.