The javadoc of Spliterator (which is basically what is really behind a Stream if I understand things correctly) defines many characeristics which make sense such as SIZED, CONCURRENT, IMMUTABLE etc.
But it also defines NONNULL; why?
I'd have though that it would be the user's responsibility to ensure that and that if, for instance, a developer tried to .sort() a non SORTED stream where there are null elements he/she would rightfully be greeted with an NPE...
But then this characteristic exists. Why? The javadoc of Spliterator itself doesn't mention any real usage of it, and neither does the package-info.java of the java.util.stream package...
From the documentation of Spliterator:
A Spliterator also reports a set of characteristics() of its structure, source, and elements from among ORDERED, DISTINCT, SORTED, SIZED, NONNULL, IMMUTABLE, CONCURRENT, and SUBSIZED. These may be employed by Spliterator clients to control, specialize or simplify computation.
Note that it does not mention the prevention of NullPointerExceptions. If you sort a Stream which might contain null values it is your responsibility to provide a Comparator which can handle nulls.
The second sentence also makes it clear that using these flags is only an option, not a requirement for “Spliterator clients”, which is not limited to usage by Streams.
So regardless of whether it is used by the current implementation of the Stream API, are there possibilities to gain advantage of the knowledge about a NONULL characteristic?
I think so. An implementation may branch to a specialized code for a non-null Spliterator to utilize null for representing certain state then, e.g. absent values or the initial value before processing the first element, etc. If fact, the actual implementation code for dealing with Streams which may contain null is complicated. But of course, you always have to weigh up whether the simplification of one case justifies the code duplication.
But sometimes the simplification is as simple as knowing that there are no null values implies that you can use one of the Concurrent… Collections, which don’t allow nulls, internally.
I found the following comments in the code for the enum StreamOpFlag.
// The following Spliterator characteristics are not currently used but a
// gap in the bit set is deliberately retained to enable corresponding
// stream flags if//when required without modification to other flag values.
//
// 4, 0x00000100 NONNULL(4, ...
// 5, 0x00000400 IMMUTABLE(5, ...
// 6, 0x00001000 CONCURRENT(6, ...
// 7, 0x00004000 SUBSIZED(7, ...
Related
I'm trying to understand how Spliterator works, and how spliterators are designed. I recognize that trySplit() is likely one of the more important methods of Spliterator, but when I see some third-party Spliterator implementations, sometimes I see that their spliterators return null for trySplit() unconditionally.
The questions:
Is there a difference between an ordinary iterator and a Spliterator that returns null unconditionally? It seems like such a spliterator defeats the point of, well, splitting.
Of course, there are legitimate use cases of spliterators that conditionally return null on trySplit(), but is there a legitimate use case of a spliterator that unconditionally returns null?
While the main advantage of Spliterator over Iterator is, as you said, its trySplit() method which allows it to be parallelized, there are other significant advantages:
http://docs.oracle.com/javase/8/docs/api/java/util/Spliterator.html
The Spliterator API was designed to support efficient parallel traversal in addition to sequential traversal, by supporting decomposition as well as single-element iteration. In addition, the protocol for accessing elements via a Spliterator is designed to impose smaller per-element overhead than Iterator, and to avoid the inherent race involved in having separate methods for hasNext() and next().
Furthermore, Spliterators can be directly converted to Streams using StreamSupport.stream to make use of Java8's streams.
One of the purposes of a Spliterator is to be able to split, but that's not the only purpose. The other main purpose is as a support class for creating your own Stream source. One way to create a Stream source is to implement your own Spliterator and pass it to StreamSupport.stream. The simplest thing to do is often to write a Spliterator that can't split. Doing so forces the stream to execute sequentially, but that might be acceptable for whatever you're trying to do.
There are other cases where writing a non-splittable Spliterator makes sense. For example, in OpenJDK, there are implementations such as EmptySpliterator that contain no elements. Of course it can't be split. A similar case is a singleton spliterator that contains exactly one element. It can't be split either. Both implementations return null unconditionally from trySplit.
Another case is where writing a non-splittable Spliterator is easy and effective, and the amount of code necessary to implement a splittable one is prohibitive. (At least, not worth the effort of writing one into a Stack Overflow answer.) For example, see the example Spliterator from this answer. The case here is that the Spliterator implementation wants to wrap another Spliterator and do something special, in this case check to see if it's not empty. Otherwise it just delegates everything to the wrapped Spliterator. Doing this with a non-splittable Spliterator is pretty easy.
Notice that there's discussion in that answer, the comment on that answer, in my answer to the same question, and the comment thread on my answer, about how one would make a splittable (i.e., parallel-ready) Spliterator. But nobody actually wrote out the code to do the splitting. :-) Depending upon how much laziness you want to preserve from the original stream, and how much parallel efficiency you want, writing a splittable Spliterator can get pretty complicated.
In my estimation it's somewhat easier to do this sort of stuff by writing an Iterator instead of a Spliterator (as in my answer noted above). It turns out that Spliterators.spliteratorUnknownSize can provide a limited amount of parallelism, even from an Iterator, which is apparently a purely sequential construct. It does so within IteratorSpliterator, which pulls multiple elements from the Iterator and processes them in batches. Unfortunately the batch size is hardcoded, but at least this gives the opportunity for processing elements pulled from an Iterator in parallel in certain cases.
There are more advantages than just splitting support:
The iteration logic is contained in a single tryAdvance method rather than being spread over two methods like hasNext, next. Splitting the logic over two methods complicates a lot of Iterator implementations, as it often implies that the hasNext method has to perform an actual query attempt that might yield a value which then has to be remembered for the follow-up next call. And the fact that this query has been made must be remembered as well, either explicit or implicitly.
It would be easier if there was a guaranty that hasNext/next are always called in the typical alternating fashion, however, there is no such guaranty.
One example is BufferedReader.readLine() which has a simple tryAdvance logic. A wrapping Iterator has to call that method within the hasNext implementation and remember the line for the next call. (Ironically, the current BufferedReader.stream() implementation does implement such a complicated Iterator that will be wrapped into a Spliterator instead of implementing the much simpler Spliterator directly. It seems that the “I’m not familiar with that” problem should not be underestimated)
estimateSize(); a Spliterator may return an estimate (or even an exact number) of the remaining items that can be used to pre-allocate resources. This can raise efficiency.
characteristics(); Spliterators can provide additional information about their content or behavior. Besides telling whether the estimated size is an exact size, you can learn whether you may see null values, whether there is a defined encounter order or all values are distinct. A particular algorithm may take advantage of this. Clearly, the Stream API is a buildup of such algorithms that may take advantage so when planning to create (or support creation of) streams and have a choice, implementing a Spliterator telling as much meta-information as possible is preferred to implementing an Iterator that will be wrapped later.
This question already has an answer here:
Why doesn't java.util.Collection implement the new Stream interface?
(1 answer)
Closed 8 years ago.
This is a question about API desing. When extension methods were added in C#, IEnumerable got all the methods that enabled using lambda expression directly on all Collections.
With the advent of lambdas and default methods in Java, I would expect that Collection would implement Stream and provide default implementations for all its methods. This way, we would not need to call stream() in order to leverage the power it provides.
What is the reason the library architects opted for the less convenient approach?
From Maurice Naftalin's Lambda FAQ:
Why are Stream operations not defined directly on Collection?
Early drafts of the API exposed methods like filter, map, and reduce on Collection or Iterable. However, user experience with this design led to a more formal separation of the “stream” methods into their own abstraction. Reasons included:
Methods on Collection such as removeAll make in-place modifications, in contrast to the new methods which are more functional in nature. Mixing two different kinds of methods on the same abstraction forces the user to keep track of which are which. For example, given the declaration
Collection strings;
the two very similar-looking method calls
strings.removeAll(s -> s.length() == 0);
strings.filter(s -> s.length() == 0); // not supported in the current API
would have surprisingly different results; the first would remove all empty String objects from the collection, whereas the second would return a stream containing all the non-empty Strings, while having no effect on the collection.
Instead, the current design ensures that only an explicitly-obtained stream can be filtered:
strings.stream().filter(s.length() == 0)...;
where the ellipsis represents further stream operations, ending with a terminating operation. This gives the reader a much clearer intuition about the action of filter;
With lazy methods added to Collection, users were confused by a perceived—but erroneous—need to reason about whether the collection was in “lazy mode” or “eager mode”. Rather than burdening Collection with new and different functionality, it is cleaner to provide a Stream view with the new functionality;
The more methods added to Collection, the greater the chance of name collisions with existing third-party implementations. By only adding a few methods (stream, parallel) the chance for conflict is greatly reduced;
A view transformation is still needed to access a parallel view; the asymmetry between the sequential and the parallel stream views was unnatural. Compare, for example
coll.filter(...).map(...).reduce(...);
with
coll.parallel().filter(...).map(...).reduce(...);
This asymmetry would be particularly obvious in the API documentation, where Collection would have many new methods to produce sequential streams, but only one to produce parallel streams, which would then have all the same methods as Collection. Factoring these into a separate interface, StreamOps say, would not help; that would still, counterintuitively, need to be implemented by both Stream and Collection;
A uniform treatment of views also leaves room for other additional views in the future.
A Collection is an object model
A Stream is a subject model
Collection definition in doc :
A collection represents a group of objects, known as its elements.
Stream definition in doc :
A sequence of elements supporting sequential and parallel aggregate operations
Seen this way, a stream is a specific collection. Not the way around. Thus Collection should not Implement Stream, regardless of backward compatibility.
So why doesnt Stream<T> implement Collection<T> ? Because It is another way of looking at a bunch of objects. Not as a group of elements, but by the operations you can perform on it. Thus this is why I say a Collection is an object model while a Stream is a subject model
First, from the documentation of Stream:
Collections and streams, while bearing some superficial similarities, have different goals. Collections are primarily concerned with the efficient management of, and access to, their elements. By contrast, streams do not provide a means to directly access or manipulate their elements, and are instead concerned with declaratively describing their source and the computational operations which will be performed in aggregate on that source.
So you want to keep the concepts of stream and collection appart. If Collection would implement Stream every collection would be a stream, which it is conceptually not. The way it is done now, every collection can give you a stream which works on that collection, which is something different if you think about it.
Another factor that comes to mind is cohesion/coupling as well as encapsulation. If every class that implements Collection had to implement the operations of Stream as well, it would have two (kind of) different purposes and might become too long.
My guess would be that it was made that way to avoid breakage with existing code that implements Collection. It would be hard to provide a default implementation that worked correctly with all existing implementations.
So, I know that coding to an interface (using an interface as a variable's declared type instead of its concrete type) is a good practice in OO code, for a bunch of reasons. This is seen a lot, for example, with Java collections. Well, is referring to an interface in your program still a good thing to do when only certain implementations of that interface provide correct behavior?
For example, I have a Java program. In that program, I have multiple sets of objects. I chose to use a Set, because I didn't want duplicate elements. However, I wanted a list's ordering property (i.e. maintain insertion order). Therefore, I am using a LinkedHashSet as the concrete Set type. One thing these sets are used for is computing a dot product involving the primitive fields of the objects contained in the sets, such as in (simplifying a bit):
double dot(LinkedHashSet<E> set, double[] array) {
double sum = 0.0;
int i = 0;
for(E element : set) {
sum += (element.getValue()*array[i]);
}
return sum;
}
This method's result is dependent on the set's iteration order, and so certain Set implementations, mainly HashSet, will give incorrect/unexpected results. Currently, I am using LinkedHashSet throughout my program as the declared type, instead of Set, to ensure correct behavior. However, that feels bad stylistically. What's the right thing to do here? Is it okay to use the concrete type in this case? Or maybe should I use Set as the type, but then state in the documentation which implementations will/won't produce correct behavior? I'm looking more for general input than anything specific to the scenario above. In particular, this should apply to really any scenario where you're using the ordering properties of a LinkedHashSet or TreeSet. How do you prevent unintended implementations from being used? Do you force it in the code (by ditching the interface), or do you specify it in the documentation? Or perhaps some other approach?
It is true that you should code to interfaces, but only if the assurances they make fit your needs. In your case, if you would only use Set then you are saying: I don't want duplicates, but I don't care about the order. You could also use a List and mean: I care about insertion order, but not about duplicates. There even is a SortedSet but it does not have the ordering you want. So in your case you can't replace LinkedHashSet by one of its interfaces without violating the Liskov substitution principle.
So I would argue that in your case you should stick to the implementation until you really need the to switch to another implementation. With modern IDEs refactoring is not that hard anymore so I would refrain from doing any premature optimizations -- YAGNI and KISS.
Very very great question. One solution is: Make another interface! Say one that extends SortedMap but has a getInsertionOrderIterator() method or an interface that extends Map & has getOrderIterator() & getInsertionOrderIterator() methods.
You can write a quick adapter class that contains a LinkedHashMap & TreeMap as the backend data structures.
You can make arguments for either way. As long as you and others maintaining this code know that particular implementations of Set might break the rest of the app or library, then coding to the interface is fine. However, if that is not true, then you should use the specific implementation.
The purpose of coding to an interface is to give you flexibility that will not break your app. Take JDBC for instance. If you use the wrong driver it will break your program similar to how you are describing here. However, if let's say Oracle decided to put behavior in their JDBC driver that subtly broke code written to the JDBC spec instead of the specific Oracle driver code then you'd have to choose.
There is no cut and dry, "this is always right" type of answer.
How come Java provides several different implementations of the Set type, including HashSet and TreeSet and not ArraySet?
A set based solely on an array of elements in no particular order would always have O(n) time for a containment check. It wouldn't be terribly useful, IMO. When would you want to use that instead of HashSet or TreeSet?
The most useful aspect of an array is that you can get to an element with a particular index extremely quickly. That's not terribly relevant when it comes to sets.
There is CopyOnWriteArraySet which is a set backed by an array.
This is not particularly useful as its performance is not great for large collections.
Android has android.util.ArraySet (introduced in API level 23) and android.util.ArrayMap (introduced in API level 19).
Actually the concrete implementation of Set does not make any sense. Any set stores elements and guaranties their uniqueness.
I cannot be sure but it sounds that you want Set implementation that preserves order of elements. If I am right use LinkedHashSet.
Java provides multiple implementations of its Collection Interfaces that allow for best performance. ArrayList performs good on many List operations.
For Set Operations, which allways require uniquness different implementations offer better performance. If implemented using an array, any modification operation would have to run through all the array elements to check if it is allready in the Set. HashSet and TreeSet simplyfy this check greatly.
The Set interface has no get-by-index method, such as List.get(int), so there's no use suggesting Set can have array like properties.
Ultimately, all "grouping" classes use arrays under the hood to store their elements, but that doesn't mean you have to expose methods for accessing the array.
You can always implement it yourself....now granted there probably is only one extremely, extremely limited case where it would be useful(and in that case you could use better data structures anyway) and that is where you have a very large set that almost never changes then an array set would take up SLIGHTLY less memory(no extra pointers) and you would have ever so slightly faster enumeration of the whole set... If you keep the array sorted then you can still get O(lg n) search time.
However those differences are purely academic. In the real world you would never really want such a beast
Consider indexed-tree-map , you will be able to access elements by index and get index of elements while keeping the sort order. Duplicates can be put into arrays as values under the same key.
I am using a ConcurrentSkipListSet, that is obviously accessed through multiple threads. Now, the values that are used by the compareTo-method of the underlying objects change overtime. Because of this, I want to 'update' the ordering of the list (by resorting it, or something similar).
However, java.util.Collections.sort(list) doesn't work, and just rebuilding the list is probably too slow (and would mess up the whole concurrency-proofness). Is there any other solution I should look at?
It does not have to lead to an optimal sort (which is near-impossible with concurrency and changing values anyway). Near optimal would suffice, as long as any remove/add-calls remain thread-proof (this would be a real issue when rebuilding the list when sorting).
Every time you edit an item such that it's sort order may potentially change, you have to remove it from the list then change the key and then re-insert it.
Dr Cliff Click at Azul Systems has a very nice presentation of how they do lock-free hash-tables using tombstones and such. If you go towards writing your own skip-list/tree to make the reordering of an item into a single - and hopefully faster - op, then you might also go this lock-free route too. And be sure to share your results :)
These types of collections in the Java API do not support mutable elements (i.e. elements where the compareTo method changes). As such, the only way to do it is re-assemble a new list in an atomic way, or as Will suggests you can perform a remove, mutate and re-insert of the element.
HashSet has the same problem - the hash bucket is calculated on insertion of an object, then you won't be able to do set.contains( ... ) if you mutate the object's hash code.
To be exact, collections like ConcurrentSkipListSet and HashSet perform their comparisons/hashing on insertion and removal. The only collections that 'support' mutable elements do not perform special insertion logic based on the state of the elements (e.g. an ArrayList).
The documentation for the Set interface states:
Note: Great care must be exercised if mutable objects are used as set elements. The behavior of a set is not specified if the value of an object is changed in a manner that affects equals comparisons while the object is an element in the set. A special case of this prohibition is that it is not permissible for a set to contain itself as an element.
and the documentation for the SortedSet interface states:
Note that the ordering maintained by a sorted set (whether or not an explicit comparator is provided) must be consistent with equals if the sorted set is to correctly implement the Set interface. (See the Comparable interface or Comparator interface for a precise definition of consistent with equals.) This is so because the Set interface is defined in terms of the equals operation, but a sorted set performs all element comparisons using its compareTo (or compare) method, so two elements that are deemed equal by this method are, from the standpoint of the sorted set, equal. The behavior of a sorted set is well-defined even if its ordering is inconsistent with equals; it just fails to obey the general contract of the Set interface.