Does every stateful intermediate Stream API operation guarantee a new source collection?

Does every stateful intermediate Stream API operation guarantee a new source collection? - java

Is the following statement true?
The sorted() operation is a “stateful intermediate operation”, which means that subsequent operations no longer operate on the backing collection, but on an internal state.
(Source and source - they seem to copy from each other or come from the same source.)
Disclaimer: I am aware the following snippets are not legit usages of Java Stream API. Don't use in the production code.
I have tested Stream::sorted as a snippet from sources above:
final List<Integer> list = IntStream.range(0, 10).boxed().collect(Collectors.toList());
list.stream()
.filter(i -> i > 5)
.sorted()
.forEach(list::remove);
System.out.println(list); // Prints [0, 1, 2, 3, 4, 5]
It works. I replaced Stream::sorted with Stream::distinct, Stream::limit and Stream::skip:
final List<Integer> list = IntStream.range(0, 10).boxed().collect(Collectors.toList());
list.stream()
.filter(i -> i > 5)
.distinct()
.forEach(list::remove); // Throws NullPointerException
To my surprise, the NullPointerException is thrown.
All the tested methods follow the stateful intermediate operation characteristics. Yet, this unique behavior of Stream::sorted is not documented nor the Stream operations and pipelines part explains whether the stateful intermediate operations really guarantee a new source collection.
Where my confusion comes from and what is the explanation of the behavior above?

The API documentation makes no such guarantee “that subsequent operations no longer operate on the backing collection”, hence, you should never rely on such a behavior of a particular implementation.
Your example happens to do the desired thing by accident; there’s not even a guarantee that the List created by collect(Collectors.toList()) supports the remove operation.
To show a counter-example
Set<Integer> set = IntStream.range(0, 10).boxed()
.collect(Collectors.toCollection(TreeSet::new));
set.stream()
.filter(i -> i > 5)
.sorted()
.forEach(set::remove);
throws a ConcurrentModificationException. The reason is that the implementation optimizes this scenario, as the source is already sorted. In principle, it could do the same optimization to your original example, as forEach is explicitly performing the action in no specified order, hence, the sorting is unnecessary.
There are other optimizations imaginable, e.g. sorted().findFirst() could get converted to a “find the minimum” operation, without the need to copy the element into a new storage for sorting.
So the bottom line is, when relying on unspecified behavior, what may happen to work today, may break tomorrow, when new optimizations are added.

Well sorted has to be a full copying barrier for the stream pipeline, after all your source could be not sorted; but this is not documented as such, thus do not rely on it.
This is not just about sorted per-se, but what other optimization can be done to the stream pipeline, so that sorted could be entirely skipped. For example:
List<Integer> sortedList = IntStream.range(0, 10)
.boxed()
.collect(Collectors.toList());
StreamSupport.stream(() -> sortedList.spliterator(), Spliterator.SORTED, false)
.sorted()
.forEach(sortedList::remove); // fails with CME, thus no copying occurred
Of course, sorted needs to be a full barrier and stop to do an entire sort, unless, of course, it can be skipped, thus the documentation makes no such promises, so that we don't run in weird surprises.
distinct on the other hand does not have to be a full barrier, all distinct does is check one element at a time, if it is unique; so after a single element is checked (and it is unique) it is passed to the next stage, thus without being a full barrier. Either way, this is not documented also...

You shouldn't have brought up the cases with a terminal operation forEach(list::remove) because list::remove is an interfering function and it violates the "non-interference" principle for terminal actions.
It's vital to follow the rules before wondering why an incorrect code snippet causes unexpected (or undocumented) behaviour.
I believe that list::remove is the root of the problem here. You wouldn't have noticed the difference between the operations for this scenario if you'd written a proper action for forEach.

Related

Collectors.toUnmodifiableList in java-10

How do you create an Unmodifiable List/Set/Map with Collectors.toList/toSet/toMap, since toList (and the like) are document as :
There are no guarantees on the type, mutability, serializability, or thread-safety of the List returned
Before java-10 you have to provide a Function with Collectors.collectingAndThen, for example:
List<Integer> result = Arrays.asList(1, 2, 3, 4)
.stream()
.collect(Collectors.collectingAndThen(
Collectors.toList(),
x -> Collections.unmodifiableList(x)));

With Java 10, this is much easier and a lot more readable:
List<Integer> result = Arrays.asList(1, 2, 3, 4)
.stream()
.collect(Collectors.toUnmodifiableList());
Internally, it's the same thing as Collectors.collectingAndThen, but returns an instance of unmodifiable List that was added in Java 9.

Additionally to clear out a documented difference between the two(collectingAndThen vs toUnmodifiableList) implementations :
The Collectors.toUnmodifiableList would return a Collector that
disallows null values and will throw NullPointerException if it is
presented with a null value.
static void additionsToCollector() {
// this works fine unless you try and operate on the null element
var previous = Stream.of(1, 2, 3, 4, null)
.collect(Collectors.collectingAndThen(Collectors.toList(), Collections::unmodifiableList));
// next up ready to face an NPE
var current = Stream.of(1, 2, 3, 4, null).collect(Collectors.toUnmodifiableList());
}
and furthermore, that's owing to the fact that the former constructs an instance of Collections.UnmodifiableRandomAccessList while the latter constructs an instance of ImmutableCollections.ListN which adds to the list of attributes brought to the table with static factory methods.

Stream#toList
Java 16 adds a method on the Stream interface: toList(). To quote the Javadoc:
The returned List is unmodifiable; calls to any mutator method will always cause UnsupportedOperationException to be thrown.
Not just more convenient than Collectors, this method has some goodies like better performance on parallel streams.
In particular with parallel() -- as it avoids result copying.
Benchmark is a few simple ops on 100K elem stream of Long.
For a further reading please go to: http://marxsoftware.blogspot.com/2020/12/jdk16-stream-to-list.html
In resume the link states something like this.
Gotcha: It may be tempting to go into one's code base and use
stream.toList() as a drop-in replacement for
stream.collect(Collectors.toList()), but there may be differences in
behavior if the code has a direct or indirect dependency on the
implementation of stream.collect(Collectors.toList()) returning an
ArrayList. Some of the key differences between the List returned by
stream.collect(Collectors.toList()) and stream.toList() are spelled
out in the remainder of this post.
The Javadoc-based documentation for Collectors.toList() states
(emphasis added), "Returns a Collector that accumulates the input
elements into a new List. There are no guarantees on the type,
mutability, serializability, or thread-safety of the List returned..."
Although there are no guarantees regarding the "type, mutability,
serializability, or thread-safety" on the List provided by
Collectors.toList(), it is expected that some may have realized it's
currently an ArrayList and have used it in ways that depend on the
characteristics of an ArrayList
What i understand is that the Stream.toList() it will result in a inmutable List.
Stream.toList() provides a List implementation that is immutable (type
ImmutableCollections.ListN that cannot be added to or sorted) similar
to that provided by List.of() and in contrast to the mutable (can be
changed and sorted) ArrayList provided by
Stream.collect(Collectors.toList()). Any existing code depending on
the ability to mutate the ArrayList returned by
Stream.collect(Collectors.toList()) will not work with Stream.toList()
and an UnsupportedOperationException will be thrown.
Although the implementation nature of the Lists returned by
Stream.collect(Collectors.toList()) and Stream.toList() are very
different, they still both implement the List interface and so they
are considered equal when compared using List.equals(Object)
And this method will allow nulls so starting from Java 16 we will have a
mutable/null-friendly----->Collectors.toList()
immutable/null-friendly--->Stream.toList()
immutable/null-hostile---->Collectors.toUnmodifiableList() //Naughty
It's great.

List/Set/Map.copyOf
You asked:
How do you create an Unmodifiable List/Set/Map
As of Java 10, simply pass your existing list/set/map to:
List.copyOf
Set.copyOf
Map.copyOf
These static methods return an unmodifiable List, unmodifiable Set, or unmodifiable Map, respectively. Read the details on those linked Javadoc pages.
No nulls allowed.
If the passed collection is already unmodifiable, that passed collection is simply returned, no further work, no new collection.
Note: If using the convenient Stream#toList method in Java 16+ as described in this other Answer, there is no point to this solution here, no need to call List.copyOf. The result of toList is already unmodifiable.

ArrayList filled by parallel stream contains nulls

I'm creating a list and fill it from an other list using parallel stream, unexpectedly destination list contains nulls. It happens seldom and inconstantly. Does someone have the same issue?
Here is the piece of code:
Collection<DestinationObj> DestinationObjList = Lists.newArrayList();
SourceObjList.parallelStream().forEach(portalRule -> DestinationObjList.add(new DestinationObj(portalRule)));
return DestinationObjList;

You should collect in parallel in a bit different way:
SourceObjList.parallelStream()
.map(DestinationObj::new)
.collect(Collectors.toCollection(ArrayList::new));
The problem you are having is that ArrayList is not thread-safe and as such the result is really un-defined.
Notice that using a parallel stream does not require a thread-safe collection - Lists::newArrayList is not.

Using a collector to synchronize access to the destination list gives you a performance penalty in the synchronization. In fact, you can do the same thing without synchronization, since you know the size of the source list and can therefore create a destination list of the required size from the start.
DestinationObj[] dest = new DestinationObj[sourceObjList.size()];
IntStream.range(0, sourceObjList.size())
.parallel()
.forEach(i -> dest[i] = new DestinationObj(sourceObjList.get(i)));
List<DestinationObj> destinationObjList = Arrays.asList(dest);
EDIT: Just putting Holger's improvement here for clarity:
List<DestinationObj> destinationObjList = Arrays.asList(
sourceObjList
.parallelStream()
.map(DestinationObj::new)
.toArray(DestinationObj[]::new));

There is a concurrent list implementation in java.util.concurrent. CopyOnWriteArrayList in particular.
-- Jarrod Roberson
Look here: Is there a concurrent List in Java's JDK?

What is the order in which stream operations are applied to list elements? [duplicate]

This question already has answers here:
How to ensure order of processing in java8 streams?
(2 answers)
Closed 6 years ago.
Suppose we have a standard method chain of stream operations:
Arrays.asList("a", "bc", "def").stream()
.filter(e -> e.length() != 2)
.map(e -> e.length())
.forEach(e -> System.out.println(e));
Are there any guarantees in the JLS regarding the order in which stream operations are applied to the list elements?
For example, is it guaranteed that:
Applying the filter predicate to "bc" is not going to happen before applying the filter predicate to "a"?
Applying the mapping function to "def" is not going to happen before applying the mapping function to "a"?
1 will be printed before 3?
Note: I am talking here specifically about stream(), not parallelStream() where it is expected that operations like mapping and filtering are done in parallel.

Everything you want to know can be found within the java.util.stream JavaDoc.
Ordering
Streams may or may not have a defined encounter order. Whether or not
a stream has an encounter order depends on the source and the
intermediate operations. Certain stream sources (such as List or
arrays) are intrinsically ordered, whereas others (such as HashSet)
are not. Some intermediate operations, such as sorted(), may impose an
encounter order on an otherwise unordered stream, and others may
render an ordered stream unordered, such as BaseStream.unordered().
Further, some terminal operations may ignore encounter order, such as
forEach().
If a stream is ordered, most operations are constrained to operate on
the elements in their encounter order; if the source of a stream is a
List containing [1, 2, 3], then the result of executing map(x -> x*2)
must be [2, 4, 6]. However, if the source has no defined encounter
order, then any permutation of the values [2, 4, 6] would be a valid
result.
For sequential streams, the presence or absence of an encounter order
does not affect performance, only determinism. If a stream is ordered,
repeated execution of identical stream pipelines on an identical
source will produce an identical result; if it is not ordered,
repeated execution might produce different results.
For parallel streams, relaxing the ordering constraint can sometimes
enable more efficient execution. Certain aggregate operations, such as
filtering duplicates (distinct()) or grouped reductions
(Collectors.groupingBy()) can be implemented more efficiently if
ordering of elements is not relevant. Similarly, operations that are
intrinsically tied to encounter order, such as limit(), may require
buffering to ensure proper ordering, undermining the benefit of
parallelism. In cases where the stream has an encounter order, but the
user does not particularly care about that encounter order, explicitly
de-ordering the stream with unordered() may improve parallel
performance for some stateful or terminal operations. However, most
stream pipelines, such as the "sum of weight of blocks" example above,
still parallelize efficiently even under ordering constraints.

Are there any guarantees in the JLS regarding the order in which stream operations are applied to the list elements?
The Streams library is not covered by the JLS. You would need to read the Javadoc for the library.
Streams also support parallel stream and the order in which things are processed depends on the implementations.
Applying the filter predicate to "bc" is not going to happen before applying the filter predicate to "a"?
It would be reasonable to assume that it would, but you can't guarantee it, nor should you be writing code which requires this guarantee otherwise you wouldn't be able to parallelise it later.
applying the mapping function to "def" is not going to happen before applying the mapping function to "a"?
It is safe assume this does happen, but you shouldn't write code which requires it.

There is no guarantee of the order in which list items are passed to predicate lambdas. Stream documentation makes guarantees regarding the output of streams, including the order of encounter; it does not make guarantees about implementation details, such as the order in which filter predicates are applied.
Therefore, the documentation does not prevent filter from, say, reading several elements, running the predicate on them in reverse order, and then sending the elements passing the predicate to the output of the stream in the order in which they came in. I don't know why filter() would do something like that, but doing so wouldn't break any guarantee made in the documentation.
You can make pretty strong inference from the documentation that filter() would call predicate on the elements in the order in which collection supplies them, because you are passing the result of calling stream() on a list, which calls Collection.stream(), and, according to Java documentation, guarantees that Stream<T> produced in this way is sequential:
Returns a sequential Stream with this collection as its source.
Further, filter() is stateless:
Stateless operations, such as filter and map, retain no state from previously seen element when processing a new element - each element can be processed independently of operations on other elements.
Therefore it is rather likely that filter would call the predicate on elements in the order they are supplied by the collection.
I am talking here specifically about stream(), not parallelStream()
Note that Stream<T> may be unordered without being parallel. For example, calling unordered() on a stream(), the result becomes unordered, but not parallel.

Are there any guarantees in the JLS regarding the order in which
stream operations are applied to the list elements?
Quoting from Ordering section in Stream javadocs
Streams may or may not have a defined encounter order. Whether or not
a stream has an encounter order depends on the source and the
intermediate operations.
Applying the filter predicate to "bc" is not going to happen before
applying the filter predicate to "a"?
As quoted above, streams may or may not have a defined order. But in your example since it is a List, the same Ordering section in Stream javadocs goes on saying that
If a stream is ordered, most operations are constrained to operate on
the elements in their encounter order; if the source of a stream is a
List containing [1, 2, 3], then the result of executing map(x -> x*2)
must be [2, 4, 6].
Applying the above statement to your example - I believe, the filter predicate would receive the elements in order defined in the List.
Or, applying the mapping function to "def" is not going to happen before applying the mapping function to "a"?
For this I would refer to the Stream operations section Stream operations in Streams, that says,
Stateless operations, such as filter and map, retain no state from
previously seen element when processing a new element
Since map() doesn't retain state, I believe, it is safe to assume "def" is not going to be processed before "a" in your example.
1 will be printed before 3?
Although it may be unlikely with sequential streams (like List) but not guaranteed as the Ordering section in Stream javadocs does indicate that
some terminal operations may ignore encounter order, such as
forEach().

If the stream is created from a list, it is guaranteed that the collected result will be ordered the same way the original list was, as the documentation states:
Ordering
If a stream is ordered, most operations are constrained to operate on the elements in their encounter order; if the source of a stream is a List containing [1, 2, 3], then the result of executing map(x -> x*2) must be [2, 4, 6]. However, if the source has no defined encounter order, then any permutation of the values [2, 4, 6] would be a valid result.
To go further, there is no guarantee regarding the order of the execution of the map execution though.
From the same documention page (in the Side Effects paragraph):
Side-effects
If the behavioral parameters do have side-effects, unless explicitly stated, there are no guarantees as to the visibility of those side-effects to other threads, nor are there any guarantees that different operations on the "same" element within the same stream pipeline are executed in the same thread. Further, the ordering of those effects may be surprising. Even when a pipeline is constrained to produce a result that is consistent with the encounter order of the stream source (for example, IntStream.range(0,5).parallel().map(x -> x*2).toArray() must produce [0, 2, 4, 6, 8]), no guarantees are made as to the order in which the mapper function is applied to individual elements, or in what thread any behavioral parameter is executed for a given element.
In practice, for an ordered sequential stream, chances are that the stream operations will be executed in order, but there is no guarantee.

Understanding Spliterator, Collector and Stream in Java 8

I am having trouble understanding the Stream interface in Java 8, especially where it has to do with the Spliterator and Collector interfaces. My problem is that I simply can't understand Spliterator and the Collector interfaces yet, and as a result, the Stream interface is still somewhat obscure to me.
What exactly is a Spliterator and a Collector, and how can I use them? If I am willing to write my own Spliterator or Collector (and probably my own Stream in that process), what should I do and not do?
I read some examples scattered around the web, but since everything here is still new and subject to changes, examples and tutorials are still very sparse.

You should almost certainly never have to deal with Spliterator as a user; it should only be necessary if you're writing Collection types yourself and also intending to optimize parallelized operations on them.
For what it's worth, a Spliterator is a way of operating over the elements of a collection in a way that it's easy to split off part of the collection, e.g. because you're parallelizing and want one thread to work on one part of the collection, one thread to work on another part, etc.
You should essentially never be saving values of type Stream to a variable, either. Stream is sort of like an Iterator, in that it's a one-time-use object that you'll almost always use in a fluent chain, as in the Javadoc example:
int sum = widgets.stream()
.filter(w -> w.getColor() == RED)
.mapToInt(w -> w.getWeight())
.sum();
Collector is the most generalized, abstract possible version of a "reduce" operation a la map/reduce; in particular, it needs to support parallelization and finalization steps. Examples of Collectors include:
summing, e.g. Collectors.reducing(0, (x, y) -> x + y)
StringBuilder appending, e.g. Collector.of(StringBuilder::new, StringBuilder::append, StringBuilder::append, StringBuilder::toString)

Spliterator basically means "splittable Iterator".
Single thread can traverse/process the entire Spliterator itself, but the Spliterator also has a method trySplit() which will "split off" a section for someone else (typically, another thread) to process -- leaving the current spliterator with less work.
Collector combines the specification of a reduce function (of map-reduce fame), with an initial value, and a function to combine two results (thus enabling results from Spliterated streams of work, to be combined.)
For example, the most basic Collector would have an initial vaue of 0, add an integer onto an existing result, and would 'combine' two results by adding them. Thus summing a spliterated stream of integers.
See:
Spliterator.trySplit()
Collector<T,A,R>

The following are examples of using the predefined collectors to perform common mutable reduction tasks:
// Accumulate names into a List
List<String> list = people.stream().map(Person::getName).collect(Collectors.toList());
// Accumulate names into a TreeSet
Set<String> set = people.stream().map(Person::getName).collect(Collectors.toCollection(TreeSet::new));
// Convert elements to strings and concatenate them, separated by commas
String joined = things.stream()
.map(Object::toString)
.collect(Collectors.joining(", "));
// Compute sum of salaries of employee
int total = employees.stream()
.collect(Collectors.summingInt(Employee::getSalary)));
// Group employees by department
Map<Department, List<Employee>> byDept
= employees.stream()
.collect(Collectors.groupingBy(Employee::getDepartment));
// Compute sum of salaries by department
Map<Department, Integer> totalByDept
= employees.stream()
.collect(Collectors.groupingBy(Employee::getDepartment,
Collectors.summingInt(Employee::getSalary)));
// Partition students into passing and failing
Map<Boolean, List<Student>> passingFailing =
students.stream()
.collect(Collectors.partitioningBy(s -> s.getGrade() >= PASS_THRESHOLD));

Interface Spliterator - is a core feature of Streams.
The stream() and parallelStream() default methods are presented in the Collection interface. These methods use the Spliterator through the call to the spliterator():
...
default Stream<E> stream() {
return StreamSupport.stream(spliterator(), false);
}
default Stream<E> parallelStream() {
return StreamSupport.stream(spliterator(), true);
}
...
Spliterator is an internal iterator that breaks the stream into the smaller parts. These smaller parts can be processed in parallel.
Among other methods, there are two most important to understand the Spliterator:
boolean tryAdvance(Consumer<? super T> action)
Unlike the Iterator, it tries to perform the operation with the next element.
If operation executed successfully, the method returns true. Otherwise, returns false - that means that there is absence of element or end of the stream.
Spliterator<T> trySplit()
This method allows to split a set of data into a many smaller sets according to one or another criteria (file size, number of lines, etc).

Add objects from stream to two different lists simultaneously

How can I add objects from one stream to two different lists simultaneously
Currently I am doing
body.getSurroundings().parallelStream()
.filter(o -> o.getClass().equals(ResourcePoint.class))
.map(o -> (ResourcePoint)o)
.filter(o -> !resourceMemory.contains(o))
.forEach(resourceMemory::add);
to add objects from my stream into a linkedlist "resourceMemory", but I also want to add the same objects to another list simultaneously, but I can't find the syntax for it. Is it possible or do I need to have two copies of this code for each list?

There are several fundamental errors you should understand first, before trying to expand your code.
First of all, forEach does not guaranty a particular order of element processing, so it’s likely the wrong tool for adding to a List, even for sequential streams, however, it is completely wrong to use with a parallel stream to add to a collection like LinkedList which is not thread safe, as the action will be performed concurrently.
But even if resourceMemory was a thread safe collection, your code still was broken as there is an interference between your filter condition and the terminal action. .filter(o -> !resourceMemory.contains(o)) queries the same list which you are modifying in the terminal action and it shouldn’t be hard to understand how this can brake even with thread-safe collections:
Two or more threads may process the filter and find that the element is not contained in the list, then all of them will add the element, contradicting your obvious intention of not having duplicates.
You could resort to forEachOrdered which will perform the action in order and non-concurrently:
body.getSurroundings().parallelStream()
.filter(o -> o instanceof ResourcePoint)
.map(o -> (ResourcePoint)o)
.forEachOrdered(o -> {// not recommended, just for explanation
if(!resourceMemory.contains(o))
resourceMemory.add(o);
});
This will work and it’s obvious how you could add to another list within that action, but it’s far away from recommended coding style. Also, the fact that this terminal action synchronizes with all processing threads will destroy any potential benefit of parallel processing, especially as the most expensive operation of this stream pipeline is invoking contains on a LinkedList which will (must) happen single-threaded.
The correct way to collect stream elements into a list is via, as the name suggests, collect:
List<ResourcePoint> resourceMemory
=body.getSurroundings().parallelStream()
.filter(o -> o instanceof ResourcePoint)
.map(o -> (ResourcePoint)o)
.distinct() // no duplicates
.collect(Collectors.toList()); // collect into a list
This doesn’t return a LinkedList, but you should rethink carefully whether you really need a LinkedList. In 99% of all cases, you don’t. If you really need a LinkedList, you can replace Collectors.toList() with Collectors.toCollection(LinkedList::new).
Now if you really must add to an existing list created outside of your control, which might already contain elements, you should consider the fact mentioned above, that you have to ensure single-threaded access to a non-thread-safe list anyway, so there’s no benefit from doing it from within the parallel stream at all. In most cases, it’s more efficient to let the stream work independently from that list and add the result in a single threaded step afterwards:
Set<ResourcePoint> newElements=
body.getSurroundings().parallelStream()
.filter(o -> o instanceof ResourcePoint)
.map(o -> (ResourcePoint)o)
.collect(Collectors.toCollection(LinkedHashSet::new));
newElements.removeAll(resourceMemory);
resourceMemory.addAll(newElements);
Here, we collect into a LinkedHashSet which implies maintenance of the encounter order and sorting out duplicates within the new elements, then use removeAll on the new elements to remove existing elements of the target list (here we benefit from the hash set nature of the temporary collection), finally, the new elements are added to the target list, which, as explained, must happen single-threaded anyway for a target collection which isn’t thread safe.
It’s easy to add the newElements to another target collection with this solution, much easier than writing a custom collector for producing two lists during the stream processing. But note that the stream operations as written above are way too cheep to assume any benefit from parallel processing. You would need a very large number of elements to compensate the initial multi-threading overhead. It’s even possible that there is no number for which it ever pays off.

Instead of
.forEach(resourceMemory::add)
You could invoke
.forEach(o -> {
resourceMemory.add(o);
otherResource.add(o);
})
or put the add operations in a separate method so you could provide a method reference
.forEach(this::add)
void add(ResourcePoint p) {
resourceMemory.add(o);
otherResource.add(o);
}
But bear in mind, that the order of insertion maybe different with each run as you use a parallel stream.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.