Summary pretty much says it all. Here's the relevant snippet of code in ImmutableList.createFromIterable():
if (element == null) {
throw new NullPointerException("at index " + index);
}
I've run into this several times and can't see why a general-purpose library function should impose this limitation.
Edit 1: by "general-purpose", I'd be happy with 95% of cases. But I don't think I've written 100 calls to ImmutableList.of() yet, and have been bitten by this more than once. Maybe I'm an outlier, though. :)
Edit 2: I guess my big complaint is that this creates a "hiccup" when interacting with standard java.util collections. As you pointed out in your talk, problems with nulls in collections can show up far away from where those nulls were inserted. But if I have a long chain of code that puts nulls in a standard collection at one end and handles them properly at the other, then I'm unable to substitute a google collections class at any point along the way, because it'll immediately throw a NullPointerException.
I explained this at the 25-minute point of this video:
https://youtu.be/ZeO_J2OcHYM?t=1495
Sorry for the lazy answer, but this is after all only a "why" question (arguably not appropriate to StackOverflow?).
EDIT: Here's another point I'm not sure I made clear in the video: the total (across all of the world's Java code), amount of extra code that has to be written for those null-friendly cases to use the old standbys Collections.unmodifiableList(Arrays.asList(...)) etc. is overwhelmed by the total (across all of the world's Java code) amount of extra checkArgument(!foos.contains(null)) calls everyone would need to add if our collections didn't take care of that for you. Most, by FAR, usages of a collection do not expect any nulls to be present, and really should fail fast if any are.
In general in Google Collections the developers are of the group that does not believe that nulls should be an expected general purpose parameter.
From Guava's Github Page
Careless use of null can cause a staggering variety of bugs. Studying the Google code base, we found that something like 95% of collections weren't supposed to have any null values in them, and having those fail fast rather than silently accept null would have been helpful to developers.
The Guava position is largely, that there are other ways to avoid nulls in collections. For example, fetching a batch of items with a specific key. E.g.
// If a widget for the given id does not exist, return `null` in the list
private List<Widget> getWidgets(List<String> widgetIds);
// Could be restructured to use a Map type like this and avoids nulls completely.
// If a widget for the given id does not exist, no entry in list
private Map<String, Widget> getWidgets(List<String> widgetIds);
One reason is that it allows functions that work on the list not to have to check every element for Null, significantly improving performance.
Related
I've got an ArrayList that can be anywhere from 0 to 5000 items long (pretty big objects, too).
At one point I compare it against another ArrayList, to find their intersection. I know this is O(n^2).
Is creating a HashMap alongside this ArrayList, to achieve constant-time lookup, a valid strategy here, in order to reduce the complexity to O(n)? Or is the overhead of another data structure simply not worth it? I believe it would take up no additional space (besides for the references).
(I know, I'm sure 'it depends on what I'm doing', but I'm seriously wondering if there's any drawback that makes it pointless, or if it's actually a common strategy to use. And yes, I'm aware of the quote about prematurely optimizing. I'm just curious from a theoretical standpoint).
First of all, a short side note:
And yes, I'm aware of the quote about prematurely optimizing.
What you are asking about here is not "premature optimization"!
You are not talking about replacing a multiplication with some odd bitwise operations "because they are faster (on a 90's PC, in a C-program)". You are thinking about the right data structure for your application pattern. You are considering the application cases (though you did not tell us many details about them). And you are considering the implications that the choice of a certain data structure will have on the asymptotic running time of your algorithms. This is planning, or maybe engineering, but not "premature optimization".
That being said, and to tell you what you already know: It depends.
To elaborate this a bit: It depends on the actual operations (methods) that you perform on these collections, how frequently you perform then, how time-critical they are, and how memory-sensitive the application is.
(For 5000 elements, the latter should not be a problem, as only references are stored - see the discussion in the comments)
In general, I'd also be hesitant to really store the Set alongside the List, if they are always supposed to contain the same elements. This wording is intentional: You should always be aware of the differences between both collections. Primarily: A Set can contain each element only once, whereas a List may contain the same element multiple times.
For all hints, recommendations and considerations, this should be kept in mind.
But even if it is given for granted that the lists will always contain elements only once in your case, then you still have to make sure that both collections are maintained properly. If you really just stored them, you could easily cause subtle bugs:
private Set<T> set = new HashSet<T>();
private List<T> list = new ArrayList<T>();
// Fine
void add(T element)
{
set.add(element);
list.add(element);
}
// Fine
void remove(T element)
{
set.remove(element);
list.remove(element); // May be expensive, but ... well
}
// Added later, 100 lines below the other methods:
void removeAll(Collection<T> elements)
{
set.removeAll(elements);
// Ooops - something's missing here...
}
To avoid this, one could even consider to create a dedicated collection class - something like a FastContainsList that combines a Set and a List, and forwards the contains call to the Set. But you'll qickly notice that it will be hard (or maybe impossible) to not violate the contracts of the Collection and List interfaces with such a collection, unless the clause that "You may not add elements twice" becomes part of the contract...
So again, all this depends on what you want to do with these methods, and which interface you really need. If you don't need the indexed access of List, then it's easy. Otherwise, referring to your example:
At one point I compare it against another ArrayList, to find their intersection. I know this is O(n^2).
You can avoid this by creating the sets locally:
static <T> List<T> computeIntersection(List<T> list0, List<T> list1)
{
Set<T> set0 = new LinkedHashSet<T>(list0);
Set<T> set1 = new LinkedHashSet<T>(list1);
set0.retainAll(set1);
return new ArrayList<T>(set0);
}
This will have a running time of O(n). Of course, if you do this frequently, but rarely change the contents of the lists, there may be options to avoid the copies, but for the reason mentioned above, maintainng the required data structures may become tricky.
Consider the following example that prints the maximum element in a List :
List<Integer> list = Arrays.asList(1,4,3,9,7,4,8);
list.stream().max(Comparator.naturalOrder()).ifPresent(System.out::println);
The same objective can also be achieved using the Collections.max method :
System.out.println(Collections.max(list));
The above code is not only shorter but also cleaner to read (in my opinion). There are similar examples that come to mind such as the use of binarySearch vs filter used in conjunction with findAny.
I understand that Stream can be an infinite pipeline as opposed to a Collection that is limited by the memory available to the JVM. This would be my criteria for deciding whether to use a Stream or the Collections API. Are there any other reasons for choosing Stream over the Collections API (such as performance). More generally, is this the only reason to chose Stream over older API that can do the job in a cleaner and shorter way?
Stream API is like a Swiss Army knife: it allows you to do quite complex operations by combining the tools effectively. On the other hand if you just need a screwdriver, probably the standalone screwdriver would be more convenient. Stream API includes many things (like distinct, sorted, primitive operations etc.) which otherwise would require you to write several lines and introduce intermediate variables/data structures and boring loops drawing the programmer attention from the actual algorithm. Sometimes using the Stream API can improve the performance even for sequential code. For example, consider some old API:
class Group {
private Map<String, User> users;
public List<User> getUsers() {
return new ArrayList<>(users.values());
}
}
Here we want to return all the users of the group. The API designer decided to return a List. But it can be used outside in a various ways:
List<User> users = group.getUsers();
Collections.sort(users);
someOtherMethod(users.toArray(new User[users.size]));
Here it's sorted and converted to array to pass to some other method which happened to accept an array. In the other place getUsers() may be used like this:
List<User> users = group.getUsers();
for(User user : users) {
if(user.getAge() < 18) {
throw new IllegalStateException("Underage user in selected group!");
}
}
Here we just want to find the user matched some criteria. In both cases copying to intermediate ArrayList was actually unnecessary. When we move to Java 8, we can replace getUsers() method with users():
public Stream<User> users() {
return users.values().stream();
}
And modify the caller code. The first one:
someOtherMethod(group.users().sorted().toArray(User[]::new));
The second one:
if(group.users().anyMatch(user -> user.getAge() < 18)) {
throw new IllegalStateException("Underage user in selected group!");
}
This way it's not only shorter, but may work faster as well, because we skip the intermediate copying.
The other conceptual point in Stream API is that any stream code written according to the guidelines can be parallelized simply by adding the parallel() step. Of course this will not always boost the performance, but it helps more often than I expected. Usually if the operation executed sequentially for 0.1ms or longer, it can benefit from the parallelization. Anyways we haven't seen such simple way to do the parallel programming in Java before.
Of course, it always depends on the circumstances. Take you initial example:
List<Integer> list = Arrays.asList(1,4,3,9,7,4,8);
list.stream().max(Comparator.naturalOrder()).ifPresent(System.out::println);
If you want to do the same thing efficiently, you would use
IntStream.of(1,4,3,9,7,4,8).max().ifPresent(System.out::println);
which doesn’t involve any auto-boxing. But if your assumption is to have a List<Integer> beforehand, that might not be an option, so if you are just interested in the max value, Collections.max might be the simpler choice.
But this would lead to the question why you have a List<Integer> beforehand. Maybe, it’s the result of old code (or new code written using old thinking), which had no other choice than using boxing and Collections as there was no alternative in the past?
So maybe you should think about the source producing the collection, before bother with how to consume it (or well, think about both at the same time).
If all you have is a Collection and all you need is a single terminal operation for which a simple Collection based implementation exists, you may use it directly without bother with the Stream API. The API designers acknowledged this idea as they added methods like forEach(…) to the Collection API instead of insisting of everyone using stream().forEach(…). And Collection.forEach(…) is not a simple short-hand for Collection.stream().forEach(…), in fact, it’s already defined on the more abstract Iterable interface which even hasn’t a stream() method.
Btw., you should understand the difference between Collections.binarySearch and Stream.filter/findAny. The former requires the collection to be sorted and if that prerequisite is met, might be the better choice. But if the collection isn’t sorted, a simple linear search is more efficient than sorting just for a single use of binary search, not to speak of the fact, that binary search works with Lists only while filter/findAny works with any stream supporting every kind of source collection.
By persistent collections I mean collections like those in clojure.
For example, I have a list with the elements (a,b,c).
With a normal list, if I add d, my original list will have (a,b,c,d) as its elements.
With a persistent list, when I call list.add(d), I get back a new list, holding (a,b,c,d).
However, the implementation attempts to share elements between the list wherever possible, so it's much more memory efficient than simply returning a copy of the original list.
It also has the advantage of being immutable (if I hold a reference to the original list, then it will always return the original 3 elements).
This is all explained much better elsewhere (e.g. http://en.wikipedia.org/wiki/Persistent_data_structure).
Anyway, my question is... what's the best library for providing this functionality for use in java? Can I use the clojure collections somehow (other that by directly using clojure)?
Just use the ones in Clojure directly. While obviously you might not want to use the language it's self, you can still use the persistent collections directly as they are all just Java classes.
import clojure.lang.PersistentHashMap;
import clojure.lang.IPersistentMap;
IPersistentMap map = PersistentHashMap.create("key1", "value1");
assert map.get("key1").equals("value1");
IPersistentMap map2 = map.assoc("key1", "value1");
assert map2 != map;
assert map2.get("key1").equals("value1");
(disclaimer: I haven't actually compiled that code :)
the down side is that the collections aren't typed, i.e. there are no generics with them.
What about pcollections?
You can also check out Clojure's implementation of persistent collections (PersistentHashMap, for instance).
I was looking for a slim, Java "friendly" persistent collection framework and took TotallyLazy and PCollections mentioned in this thread for a testdrive, because they sounded most promising to me.
Both provide reasonable simple interfaces to manipulate persistent lists:
// TotallyLazy
PersistentList<String> original = PersistentList.constructors.empty(String.class);
PersistentList<String> modified = original.append("Mars").append("Raider").delete("Raider");
// PCollections
PVector<String> original = TreePVector.<String>empty();
PVector<String> modified = original.plus("Mars").plus("Raider").minus("Raider");
Both PersistentList and PVector extend java.util.List, so both libraries should integrate well into an existing environment.
It turns out, however, that TotallyLazy runs into performance problems when dealing with larger lists (as already mentioned in a comment above by #levantpied). On my MacBook Pro (Late 2013) inserting 100.000 elements and returning the immutable list took TotallyLazy ~2000ms, whereas PCollections finished within ~120ms.
My (simple) test cases are available on Bitbucket, if someone wants to take a more thorough look.
[UPDATE]: I recently had a look at Cyclops X, which is a high performing and more complete lib targeted for functional programming. Cyclops also contains a module for persistent collections.
https://github.com/andrewoma/dexx is a port of Scala's persistent collections to Java. It includes:
Set, SortedSet, Map, SortedMap and Vector
Adapters to view the persistent collections as java.util equivalents
Helpers for easy construction
Paguro provides type-safe versions of the actual Clojure collections for use in Java 8+. It includes: List (Vector), HashMap, TreeMap, HashSet, and TreeSet. They behave exactly the way you specify in your question and have been painstakingly fit into the existing java.util collections interfaces for maximum type-safe Java compatibility. They are also a little faster than PCollections.
Coding your example in Paguro looks like this:
// List with the elements (a,b,c)
ImList<T> list = vec(a,b,c);
// With a persistent list, when I call list.add(d),
// I get back a new list, holding (a,b,c,d)
ImList<T> newList = list.append(d);
list.size(); // still returns 3
newList.size(); // returns 4
You said,
The implementation attempts to share elements between the list
wherever possible, so it's much more memory efficient and fast than
simply returning a copy of the original list. It also has the
advantage of being immutable (if I hold a reference to the original
list, then it will always return the original 3 elements).
Yes, that's exactly how it behaves. Daniel Spiewak explains the speed and efficiency of these collections much better than I could.
May want to check out clj-ds. I haven't used it, but it seems promising. Based off of the projects readme it extracted the data structures out of Clojure 1.2.0.
Functional Java implements a persistent List, lazy List, Set, Map, and Tree. There may be others, but I'm just going by the information on the front page of the site.
I am also interested to know what the best persistent data structure library for Java is. My attention was directed to Functional Java because it is mentioned in the book, Functional Programming for Java Developers.
There's pcollections (Persistent Collections) library you can use:
http://code.google.com/p/pcollections/
The top voted answer suggest to directly use the clojure collections which I think is a very good idea. Unfortunately the fact that clojure is a dynamically typed language and Java is not makes the clojure libraries very uncomfortable to use in Java.
Because of this and the lack of light-weight, easy-to-use wrappers for the clojure collections types I have written my own library of Java wrappers using generics for the clojure collection types with a focus on ease of use and clarity when it comes to interfaces.
https://github.com/cornim/ClojureCollections
Maybe this will be of use to somebody.
P.S.: At the moment only PersistentVector, PersistentMap and PersistentList have been implemented.
In the same vein as Cornelius Mund, Pure4J ports the Clojure collections into Java and adds Generics support.
However, Pure4J is aimed at introducing pure programming semantics to the JVM through compile time code checking, so it goes further to introduce immutability constraints to your classes, so that the elements of the collection cannot be mutated while the collection exists.
This may or may not be what you want to achieve: if you are just after using the Clojure collections on the JVM I would go with Cornelius' approach, otherwise, if you are interested in pursuing a pure programming approach within Java then you could give Pure4J a try.
Disclosure: I am the developer of this
totallylazy is a very good FP library which has implementations of:
PersistentList<T>: the concrete implementations are LinkedList<T> and TreeList<T> (for random access)
PersistentMap<K, V>: the concrete implementations are HashTreeMap<K, V> and ListMap<K, V>
PersistentSortedMap<K, V>
PersistentSet<T>: the concrete implementation is TreeSet<T>
Example of usage:
import static com.googlecode.totallylazy.collections.PersistentList.constructors.*;
import com.googlecode.totallylazy.collections.PersistentList;
import com.googlecode.totallylazy.numbers.Numbers;
...
PersistentList<Integer> list = list(1, 2, 3);
// Create a new list with 0 prepended
list = list.cons(0);
// Prints 0::1::2::3
System.out.println(list);
// Do some actions on this list (e.g. remove all even numbers)
list = list.filter(Numbers.odd);
// Prints 1::3
System.out.println(list);
totallylazy is constantly being maintained. The main disadvantage is the total absence of Javadoc.
I'm surprised nobody mentioned vavr. I use it for a long time now.
http://www.vavr.io
Description from their site:
Vavr core is a functional library for Java. It helps to reduce the amount of code and to increase the robustness. A first step towards functional programming is to start thinking in immutable values. Vavr provides immutable collections and the necessary functions and control structures to operate on these values. The results are beautiful and just work.
https://github.com/arnohaase/a-foundation is another port of Scala's libraries.
It is also available from Maven Central: com.ajjpj.a-foundation:a-foundation
I want to have an object that allows other objects of a specific type to register themselves with it. Ideally it would store the references to them in some sort of set collection and have .equals() compare by reference rather than value. It shouldn't have to maintain a sort at all times, but it should be able to be sorted before the collection is iterated over.
Looking through the Java Collection Library, I've seen the various features I'm looking for on different collection types, but I am not sure about how I should go about using them to build the kind of collection I'm looking for.
This is Java in the context of Android if that is significant.
Java's built-in tree-based collections won't work.
To illustrate, consider a tree containing weak references to nodes 'B', 'C', and 'D':
C
B D
Now let the weak reference 'C' get collected, leaving null behind:
-
B D
Now insert an element into the tree. The TreeMap/TreeSet doesn't have sufficient information to select the left or right subtree. If your comparator says null is a small value, then it will be incorrect when inserting 'A'. If it says null is a large value, it will be incorrect when inserting 'E'.
Sort on demand is a good choice.
A more robust solution is to use an ArrayList<WeakReference<T>> and to implement a Comparator<WeakReference<T>> that delegates to a Comparator<T>. Then call Collections.sort() prior to iteration.
Android's Collections.sort uses TimSort behind-the-scenes and so it runs quite efficiently if the input is already partially sorted.
Perhaps the collections classes are a level of abstraction below what you're looking for? It sounds like the end product you want is a cache with the ability to iterate in a user-defined sort order. If so, perhaps the cache interface in the Google Guava library is close enough to what you want:
http://code.google.com/p/guava-libraries/source/browse/trunk/guava/src/com/google/common/cache/Cache.java
At a glance, it looks like CacheBuilder in that package doesn't allow you to build an implementation with user-defined iteration order. However, it does provide a Map view that might be good enough for your needs:
List<Thing> cachedThings = Lists.newArrayList(cache.asMap().values());
Collections.sort(cachedThings, YOUR_THING_COMPARATOR);
for (Thing thing : cachedThings) { ... }
Even if this isn't exactly what you want, the classes in that package might give you some useful insights re: using References with Collections.
DISCLAIMER: This was a comment but it got kinda big, sorry if it doesn't solve your problem:
References in Java
Just to clarify what I mean when I say reference, since it isn't really a term commonly used in Java: Java does not really use references or pointers. It uses a kind of pseudo-reference that can be (and is by default) assigned to the special null instance. That's one way to explain it anyway. In Java, these pseudo-references are the only way that an Object can be handled. When I say reference, I mean these pseudo-references.
Sets
Any Set implementation will not allow two references to the same object to be included in it since it uses identity equality for this check. That violates the mathematical concept of a set. The Java Sets ignore any attempt to add duplicate references.
You mention a Map in your comment though... Could you clarify what kind of collection you are after? And why you need that kind of equality checking within it? Are you thinking in C++ terms? I'll try to edit my answer to be more helpful then :)
EDIT: I thought that might have been your goal ;) So a TreeSet should do the trick then! I would not get concerned about performance until there is a performance issue. Simplicity is fantastic for readability, maintenance and preventing bugs. If performance does become a problem, ideally you should profile your code and only optimize the areas that are proven to be the problem.
Does anyone know of any resources or books I can read to help me understand the various Java collection classes?
For example:When would one use a Collection<T> instead of a List<T>
and when would you use a Map<T, V> instead of aList<V>, where V has a member getId as shown below, allowing you to search the list for the element matching a given key:
class V {
T getId();
}
thanks
You use a Map if you want a map-like behaviour. You use a List if you want a list-like behaviour. You use a Collection if you don't care.
Generics have nothing to do with this.
See the Collections tutorial.
You can take a look at sun tutorial. It explains everything in detail.
http://java.sun.com/docs/books/tutorial/collections/index.html (Implementation section explain the difference between them)
This book is very good and covers both the collection framework and generics.
You can check the documentation of the java collection API.
Anyway a basic rule is : be as generic as possible for the type of your parameters. Be as generic as possible for the return type of your interfaces. Be as specific as possible for the return type of your final class.
A good place to start would be the Java API. Each collection has a good description associated with it. After that, you can search for any variety of articles and books on Java Collections on Google.
The decision depends on your data and your needs to use the data.
You should use a map if you have data where you can identify each element with a specific key and want to access or find it by with this key.
You take a List if you don't have a key but you're interested in the order of the elements. like a bunch of Strings you want to store in the order the user entered it.
You take a Set if you don't want to store the same element twice.
Also interesting for your decision is if you're working in am multithreaded environment. So if many threads are accessing the same list at the same tame you would rather take a Vector instead of an ArrayList.
Btw. for some collections it is usefull if your data class implements an interface like comparable or at least overrides the equals function.
here you will find more information.
Most Java books will have a good expanation of the Collections Framework. I find that Object-Oriented-Software-Development-Using has a good chapter that expains the reasons why one Collection is selected over another.
The Head first Java also has a good intropduction but may not tackle the problem of which to select.
The answer to your question is how are you going to be using the data structure? And to get a better idea of the possibilities, it is good to look at the whole collections interfaces hierarchy. For simplicity sake, I am restricting this discussion only to the classic interfaces, and am ignoring all of the concurrent interfaces.
Collection
+- List
+- Set
+- SortedSet
Map
+- SortedMap
So, we can see from the above, a Map and a Collection are different things.
A Collection is somewhat analogous to a bag, it contains a number of values, but makes no further guarantees about them. A list is simply an ordered set of values, where the order is defined externally, not implicitly from the values themselves. A Set on the other hand is a group of values, no two of which are the same, however they are not ordered, neither explicitly, nor implicitly. A SortedSet is a set of unique values that are implicitly sorted, that is, if you iterate over the values, then they will always be returned in the same order.
A Map is mapping from a Set of keys to values. A SortedMap is a mapping from a SortedSet of keys to values.
As to why you would want to use a Map instead of a List? This depends largely on how you need to lookup your data. If you need to do (effectively) random lookups using a key, then you should be using a set, since the implementations of that give you either O(1) or O(lgn) lookups. Searching through the list is O(n). If however, you are performing some kind of "batch" process, that is you are processing each, and every, item in the list then a list, or Set if you need the uniqueness constraint, is more appropriate.
The other answers already covered an overview of what the collections are, so I'd add one rule of thumb that applies to how you might use collections in your programming:
Be strict in what you send, but generous in what you receive
This is a little controversial (some engineers believe that you should always be as strict as possible) but it's a rule of thumb that, when applied to collections, teaches us to pick the collection that limits your users the least when taking arguments but gives as much information as possible when returning results.
In other words a method signature like:
LinkedList< A > doSomething(Collection< A > col);
Might be preferred to:
Collection< A > doSomething(LinkedList< A > list);
In version 1, your user doesn't have to massage their data to use your method. They can pass you an ArrayList< A >, LinkedHashSet< A > or a Collection< A > and you will deal with. On receiving the data from the method, they have a lot more information in what they can do with it (list specific iterators for example) than they would in option 2.