By writing applications in Java there are many use cases for java.util.Collection.
Since java.util.stream.Stream was introduced with Java 8, I came over some use-cases where it is difficult to decide what to use.
For example:
You are going to write some util-methods.
public static List<?> filterHashToList(int hash, Collection<?> toFilter) {
return toFilter.stream()
.filter((Object o) -> hash == o.hashCode())
.collect(Collectors.toCollection(LinkedList::new));
}
What about writing it like this:
public static List<?> filterHashToList(int hash, Collection<?> toFilter) {
List<Object> result = new LinkedList<>();
for(Object o : toFilter) {
if(hash == o.hashCode()) {
result.add(o);
}
}
return result;
}
Both methods would produce the same result. java.util.stream.Stream and java.util.stream.Collector are interfaces, so the implementation can vary as well if I use custom streams and collectors.
I think there are loads of implementations out there using the old-fashoined loop-way.
So, is it possible to answer what to use, stream or loop, by use-case?
And if so, do all implementations have to be updated where appropriate?
Or should I even provide both ways by implementing util-methods?
Or should I also provide a mthed returning the stream after the filtering process so you can work with that one too if required?
In the absence of a posted answer I will quote Brian Goetz who echoes my sentiment and I suspect many others'.
There's nothing magic about either streams or loops. You should write the code that is most readable, clear, and maintainable. Either of these are acceptable.
Note that in your implementation you stick with specific resulting collection type LinkedList which usually has very poor performance compared to ArrayList. What if the user of your method wants to use the resulting list in random-access manner? Probably the user of this method needs an array, because it should be passed to another API method which accepts an array. Sometimes the user just need to know how many objects with given hashCode are present in the input collection, thus there's no need to create a resulting list at all. The Java-8 way is to return streams from the methods, not the collections and let the caller decide how to collect it:
public static <T> Stream<T> filterHashToList(int hash, Collection<T> toFilter) {
return toFilter.stream()
.filter(o -> hash == o.hashCode());
}
And use it:
filterHashToList(42, input).count();
Or
filterHashToList(42, input).collect(toCollection(ArrayList::new));
Or
filterHashToList(42, input).toArray();
This way the method becomes very simple, so you probably don't need it at all, but if you want to do more sofisticated filtering/transformation, it's ok.
So if you don't want to change the API and still return the LinkedList, there's no need to change the implementation. But if you want to take the advantage from using Stream API, it's better to change the return type to Stream.
I think this is largely dependent on the anticipated size of the Collection and practical application of the program. The former is slightly more efficient, but I find it less readable and there are still systems that run on Java 7 (ie Google App Engine). Additionally, the latter would likely be easier to add more complex filters to going forward. However, if efficiency is the primary concern, I would go with the former.
Related
Consider the following example that prints the maximum element in a List :
List<Integer> list = Arrays.asList(1,4,3,9,7,4,8);
list.stream().max(Comparator.naturalOrder()).ifPresent(System.out::println);
The same objective can also be achieved using the Collections.max method :
System.out.println(Collections.max(list));
The above code is not only shorter but also cleaner to read (in my opinion). There are similar examples that come to mind such as the use of binarySearch vs filter used in conjunction with findAny.
I understand that Stream can be an infinite pipeline as opposed to a Collection that is limited by the memory available to the JVM. This would be my criteria for deciding whether to use a Stream or the Collections API. Are there any other reasons for choosing Stream over the Collections API (such as performance). More generally, is this the only reason to chose Stream over older API that can do the job in a cleaner and shorter way?
Stream API is like a Swiss Army knife: it allows you to do quite complex operations by combining the tools effectively. On the other hand if you just need a screwdriver, probably the standalone screwdriver would be more convenient. Stream API includes many things (like distinct, sorted, primitive operations etc.) which otherwise would require you to write several lines and introduce intermediate variables/data structures and boring loops drawing the programmer attention from the actual algorithm. Sometimes using the Stream API can improve the performance even for sequential code. For example, consider some old API:
class Group {
private Map<String, User> users;
public List<User> getUsers() {
return new ArrayList<>(users.values());
}
}
Here we want to return all the users of the group. The API designer decided to return a List. But it can be used outside in a various ways:
List<User> users = group.getUsers();
Collections.sort(users);
someOtherMethod(users.toArray(new User[users.size]));
Here it's sorted and converted to array to pass to some other method which happened to accept an array. In the other place getUsers() may be used like this:
List<User> users = group.getUsers();
for(User user : users) {
if(user.getAge() < 18) {
throw new IllegalStateException("Underage user in selected group!");
}
}
Here we just want to find the user matched some criteria. In both cases copying to intermediate ArrayList was actually unnecessary. When we move to Java 8, we can replace getUsers() method with users():
public Stream<User> users() {
return users.values().stream();
}
And modify the caller code. The first one:
someOtherMethod(group.users().sorted().toArray(User[]::new));
The second one:
if(group.users().anyMatch(user -> user.getAge() < 18)) {
throw new IllegalStateException("Underage user in selected group!");
}
This way it's not only shorter, but may work faster as well, because we skip the intermediate copying.
The other conceptual point in Stream API is that any stream code written according to the guidelines can be parallelized simply by adding the parallel() step. Of course this will not always boost the performance, but it helps more often than I expected. Usually if the operation executed sequentially for 0.1ms or longer, it can benefit from the parallelization. Anyways we haven't seen such simple way to do the parallel programming in Java before.
Of course, it always depends on the circumstances. Take you initial example:
List<Integer> list = Arrays.asList(1,4,3,9,7,4,8);
list.stream().max(Comparator.naturalOrder()).ifPresent(System.out::println);
If you want to do the same thing efficiently, you would use
IntStream.of(1,4,3,9,7,4,8).max().ifPresent(System.out::println);
which doesn’t involve any auto-boxing. But if your assumption is to have a List<Integer> beforehand, that might not be an option, so if you are just interested in the max value, Collections.max might be the simpler choice.
But this would lead to the question why you have a List<Integer> beforehand. Maybe, it’s the result of old code (or new code written using old thinking), which had no other choice than using boxing and Collections as there was no alternative in the past?
So maybe you should think about the source producing the collection, before bother with how to consume it (or well, think about both at the same time).
If all you have is a Collection and all you need is a single terminal operation for which a simple Collection based implementation exists, you may use it directly without bother with the Stream API. The API designers acknowledged this idea as they added methods like forEach(…) to the Collection API instead of insisting of everyone using stream().forEach(…). And Collection.forEach(…) is not a simple short-hand for Collection.stream().forEach(…), in fact, it’s already defined on the more abstract Iterable interface which even hasn’t a stream() method.
Btw., you should understand the difference between Collections.binarySearch and Stream.filter/findAny. The former requires the collection to be sorted and if that prerequisite is met, might be the better choice. But if the collection isn’t sorted, a simple linear search is more efficient than sorting just for a single use of binary search, not to speak of the fact, that binary search works with Lists only while filter/findAny works with any stream supporting every kind of source collection.
Let's say you have a collection with some strings and you want to return the first two characters of each string (or some other manipulation...).
In Java 8 for this case you can use either the map or the forEach methods on the stream() which you get from the collection (maybe something else but that is not important right now).
Personally I would use the map primarily because I associate forEach with mutating the collection and I want to avoid this. I also created a really small test regarding the performance but could not see any improvements when using forEach (I perfectly understand that small tests cannot give reliable results but still).
So what are the use-cases where one should choose forEach?
map is the better choice for this, because you're not trying to do anything with the strings yet, just map them to different strings.
forEach is designed to be the "final operation." As such, it doesn't return anything, and is all about mutating some state -- though not necessarily that of the original collection. For instance, you might use it to write elements to a file, having used other constructs (including map) to get those elements.
forEach terminates the stream and is exectued because of the side effect of the called Cosumer. It does not necessarily mutate the stream members.
map maps each stream element to a different value/object using a provided Function. A Stream <R> is returned on which more steps can act.
The forEach terminal operation might be useful in several cases: when you want to collect into some older class for which you don't have a proper collector or when you don't want to collect at all, but send you data somewhere outside (write into the database, print into OutputStream, etc.). There are many cases when the best way is to use both map (as intermediate operation) and forEach (as terminal operation).
When using Guava's a ImmutableCollection as a parameter for a function is it better to require an ImmutableCollection as parameter type:
void <T> foo(ImmutableCollection<T> l)
or should the function take a Collection<T> and create an immutable collection itself as in
void <T> foo(Collection<T> l)
{
ImmutableCollection<T> l2 = ImmutableCollection.copyOf(l);
// ...
}
The first version seems preferable because the caller is sure that the map he passes to the function is not modified by it. But the first version requires client code with a collection to call copyOf(), that is:
Collection collection = map.values();
foo(ImmutableCollection.copyOf(collection));
// instead of simply
foo(collection);
PS: This is not completely true, since ImmutableCollection does not have copyOf() but ImmutableList and ImmutableSet do.
I think that it depends on what the foo function is supposed to do with the collection argument.
If foo is going to read the collection elements, then void <T> foo(Collection<T> l) is preferable, because it leaves the decision to the caller.
If foo is going to incorporate the collection into the state of some object, then an immutable collection may be preferable. However, we then need to ask ourselves whether it should be the foo method's responsibility to deal with this, or the caller's responsibility.
There isn't a single right (or "best practice") answer to this. However, using ImmutableCollection as the parameter's formal type could result in complexity and/or unnecessary copying in some cases.
Look at the guava docs: "copyOf is smarter than you think."
So you can use the generic Collection interface with no regrets for performance.
Whether the copy is necessary (rather than a function comment) depends, in my view, on how long you're holding on to the data.
Use the more generic Collection interface; it's much better for you to write the call once than to require all of the clients to do so on every call. If you're really concerned about performance, and profiling shows it to be an issue, you could do a class check on the incoming collection to see whether you can avoid the copy.
It depends on what foo is doing. Most probably it simply reads the collection values in which case it does not need to make a copy of it, especially immutable one.
The one advantage of using ImmutableCollection is that the method guarantees that it will not modify the collection. But then that guarantee is to the user only, the platform does not understand it, so you might as well express it in comments or a custom annotation.
I´d just want to know your opinion regarding to change all the Collections function output to an Iterable type.
This seems to me probably the most common code in Java nowadays, and everybody returns always a List/Set/Map in 99% of times, but shouldn´t be the standard returning something like
public final Iterable<String> myMethod() {
return new Iterable<String>() {
#Override
public Iterator<String> iterator() {return myVar.getColl();}
};
}
Is this bad at all? You know all the DAO classes and this stuff would be like
Iterable<String> getName(){}
Iterable<Integer> getNums(){}
Iterable<String> getStuff(){}
instead of
List<String> getName(){}
List<Integer> getNums(){}
Set<String> getStuff(){}
After all, 99% of times you will use it in a for loop...
What dod you think?
This would be a really bad plan.
I wouldn't say that 90% of the time you just use it in a for loop. Maybe 40-50%. The rest of the time, you need more information: size, contains, or get(int).
Additionally, the return type is a sort of documentation by itself. Returning a Set guarantees that the elements will be unique. Returning a List documents that the elements will be in a consistent order.
I wouldn't recommend returning specific collection implementations like HashSet or ArrayList, but I would usually prefer to return a Set or a List rather than a Collection or an Iterable, if the option is available.
List, Set & Map are interfaces, so they're not tied to a particular implementation. So they are good candidates for returning types.
The difference between List/etc and Iterable/Iterator is the kind of access. One is for random access, you have direct access to all the data, and Iterable avoids the need for having all available. Ideal in cases where you have a lot of data and it's not efficient to have it all inplace. Example: iterating over a large database resultset.
So it depends on what you are accessing. If you data can be huge and must need iterating to avoid performance degradation, then force it using iterators. In other cases List is ok.
Edit: returning an iterator means the only thing you can do is looping through the items without other possibility. If you need this trade-off to ensure performance, ok, but as said, only use when needed.
Well what you coded is partially right:
you need to test on some methods of the items like :
size
contains()
get(index)
exists()
So, you should rethink about your new architecture or override it with this method to take every-time what you need.
I usually always find it sufficient to use the concrete classes for the interfaces listed in the title. Usually when I use other types (such as LinkedList or TreeSet), the reason is for functionality and not performance - for example, a LinkedList for a queue.
I do sometimes construct ArrayList with an initial capcacity more than the default of 10 and a HashMap with more than the default buckets of 16, but I usually (especially for business CRUD) never see myself thinking "hmmm...should I use a LinkedList instead ArrayList if I am just going to insert and iterate through the whole List?"
I am just wondering what everyone else here uses (and why) and what type of applications they develop.
Those are definitely my default, although often a LinkedList would in fact be the better choice for lists, as the vast majority of lists seem to just iterate in order, or get converted to an array via Arrays.asList anyway.
But in terms of keeping consistent maintainable code, it makes sense to standardize on those and use alternatives for a reason, that way when someone reads the code and sees an alternative, they immediately start thinking that the code is doing something special.
I always type the parameters and variables as Collection, Map and List unless I have a special reason to refer to the sub type, that way switching is one line of code when you need it.
I could see explicitly requiring an ArrayList sometimes if you need the random access, but in practice that really doesn't happen.
For some kind of lists (e.g. listeners) it makes sense to use a CopyOnWriteArrayList instead of a normal ArrayList. For almost everything else the basic implementations you mentioned are sufficient.
Yep, I use those as defaults. I generally have a rule that on public class methods, I always return the interface type (ie. Map, Set, List, etc.), since other classes (usually) don't need to know what the specific concrete class is. Inside class methods, I'll use the concrete type only if I need access to any extra methods it may have (or if it makes understanding the code easier), otherwise the interface is used.
It's good to be pretty flexible with any rules you do use, though, as a dependancy on concrete class visibility is something that can change over time (especially as your code gets more complex).
Indeed, always use base interfaces Collection, List, Map instead their implementations. To make thinkgs even more flexible you could hide your implementations behind static factory methods, which allow you to switch to a different implementation in case you find something better(I doubt there will be big changes in this field, but you never know). Another benefit is that the syntax is shorter thanks to generics.
Map<String, LongObjectClasName> map = CollectionUtils.newMap();
instead of
Map<String, LongObjectClasName> map = new HashMap<String, LongObjectClasName>();
public class CollectionUtils {
.....
public <T> List<T> newList() {
return new ArrayList<T>();
}
public <T> List<T> newList(int initialCapacity) {
return new ArrayList<T>(initialCapacity);
}
public <T> List<T> newSynchronizedList() {
return new Vector<T>();
}
public <T> List<T> newConcurrentList() {
return new CopyOnWriteArrayList<T>();
}
public <T> List<T> newSynchronizedList(int initialCapacity) {
return new Vector<T>(initialCapacity);
}
...
}
Having just come out of a class about data structure performance, I'll usually look at the kind of algorithm I'm developing or the purpose of the structure before I choose an implementation.
For example, if I'm building a list that has a lot of random accesses into it, I'll use an ArrayList because its random access performance is good, but if I'm inserting things into the list a lot, I might choose a LinkedList instead. (I know modern implementations remove a lot of performance barriers, but this was the first example that came to mind.)
You might want to look at some of the Wikipedia pages for data structures (especially those dealing with sorting algorithms, where performance is especially important) for more information about performance, and the article about Big O notation for a general discussion of measuring the performance of various functions on data structures.
I don't really have a "default", though I suppose I use the implementations listed in the question more often than not. I think about what would be appropriate for whatever particular problem I'm working on, and use it. I don't just blindly default to using ArrayList, I put in 30 seconds of thought along the lines of "well, I'm going to be doing a lot of iterating and removing elements in the middle of this list so I should use a LinkedList".
And I almost always use the interface type for my reference, rather than the implementation. Remember that List is not the only interface that LinkedList implements. I see this a lot:
LinkedList<Item> queue = new LinkedList<Item>();
when what the programmer meant was:
Queue<Item> queue = new LinkedList<Item>();
I also use the Iterable interface a fair amount.
If you are using LinkedList for a queue, you might consider using the Deque interface and ArrayDeque implementing class (introduced in Java 6) instead. To quote the Javadoc for ArrayDeque:
This class is likely to be faster than
Stack when used as a stack, and faster
than LinkedList when used as a queue.
I tend to use one of *Queue classes for queues. However LinkedList is a good choice if you don't need thread safety.
Using the interface type (List, Map) instead of the implementation type (ArrayList, HashMap) is irrelevant within methods - it's mainly important in public APIs, i.e. method signatures (and "public" doesn't necessarily mean "intended to be published outside your team).
When a method takes an ArrayList as a parameter, and you have something else, you're screwed and have to copy your data pointlessly. If the parameter type is List, callers are much more flexible and can, e.g. use Collections.EMPTY_LIST or Collections.singletonList().
I too typically use ArrayList, but I will use TreeSet or HashSet depending on the circumstances. When writing tests, however, Arrays.asList and Collections.singletonList are also frequently used. I've mostly been writing thread-local code, but I could also see using the various concurrent classes as well.
Also, there were times I used ArrayList when what I really wanted was a LinkedHashSet (before it was available).