ArrayList filled by parallel stream contains nulls - java

I'm creating a list and fill it from an other list using parallel stream, unexpectedly destination list contains nulls. It happens seldom and inconstantly. Does someone have the same issue?
Here is the piece of code:
Collection<DestinationObj> DestinationObjList = Lists.newArrayList();
SourceObjList.parallelStream().forEach(portalRule -> DestinationObjList.add(new DestinationObj(portalRule)));
return DestinationObjList;

You should collect in parallel in a bit different way:
SourceObjList.parallelStream()
.map(DestinationObj::new)
.collect(Collectors.toCollection(ArrayList::new));
The problem you are having is that ArrayList is not thread-safe and as such the result is really un-defined.
Notice that using a parallel stream does not require a thread-safe collection - Lists::newArrayList is not.

Using a collector to synchronize access to the destination list gives you a performance penalty in the synchronization. In fact, you can do the same thing without synchronization, since you know the size of the source list and can therefore create a destination list of the required size from the start.
DestinationObj[] dest = new DestinationObj[sourceObjList.size()];
IntStream.range(0, sourceObjList.size())
.parallel()
.forEach(i -> dest[i] = new DestinationObj(sourceObjList.get(i)));
List<DestinationObj> destinationObjList = Arrays.asList(dest);
EDIT: Just putting Holger's improvement here for clarity:
List<DestinationObj> destinationObjList = Arrays.asList(
sourceObjList
.parallelStream()
.map(DestinationObj::new)
.toArray(DestinationObj[]::new));

There is a concurrent list implementation in java.util.concurrent. CopyOnWriteArrayList in particular.
-- Jarrod Roberson
Look here: Is there a concurrent List in Java's JDK?

Related

Should I use shared mutable variable update in Java 8 Streams

Just iterating below list & adding into another shared mutable list via java 8 streams.
List<String> list1 = Arrays.asList("A1","A2","A3","A4","A5","A6","A7","A8","B1","B2","B3");
List<String> list2 = new ArrayList<>();
Consumer<String> c = t -> list2.add(t.startsWith("A") ? t : "EMPTY");
list1.stream().forEach(c);
list1.parallelStream().forEach(c);
list1.forEach(c);
What is the difference between above three iteration & which one we need to use. Are there any considerations?
Regardless of whether you use parallel or sequential Stream, you shouldn't use forEach when your goal is to generate a List. Use map with collect:
List<String> list2 =
list2.stream()
.map(item -> item.startsWith("A") ? item : "EMPTY")
.collect(Collectors.toList());
Functionally speaking,for the simple cases they are almost the same, but generally speaking, there are some hidden differences:
Lets start by quoting from Javadoc of forEach for iterable use-cases stating that:
performs the given action for each element of the Iterable until all
elements have been processed or the action throws an exception.
and also we can iterate over a collection and perform a given action on each element – by just passing a class that implements the Consumer interface
void forEach(Consumer<? super T> action)
https://docs.oracle.com/javase/8/docs/api/java/lang/Iterable.html#forEach-java.util.function.Consumer-
The order of Stream.forEach is random while Iterable.forEach is always executed in the iteration order of the Iterable.
If Iterable.forEach is iterating over a synchronized collection, Iterable.forEach takes the collection's lock once and holds it across all the calls to the action method. The Stream.forEach call uses the collection's spliterator, which does not lock
The action specified in Stream.forEach is required to be non-interfering while Iterable.forEach is allowed to set values in the underlying ArrayList without problems.
In Java, Iterators returned by Collection classes, e.g. ArrayList, HashSet, Vector, etc., are fail fast. This means that if you try to add() or remove() from the underlying data structure while iterating it, you get a ConcurrentModificationException.
https://docs.oracle.com/javase/8/docs/api/java/util/ArrayList.html#fail-fast
More Info:
What is the difference between .foreach and .stream().foreach?
What is difference between Collection.stream().forEach() and Collection.forEach()?
When working with streams, you should write your code in a way that if you switch to parallel streams, it does not produce the wrong results.
Imagine if in your code you were doing reading and writing on the same shared memory (list2) and you distribute your process into several threads (using parallel streams). Then you are DOOMED. Therefore you have several options.
make your shared memory (list2) thread-safe. for example by using AtomicReferences
List<String> list2 = new ArrayList<>();
AtomicReference<List<String>> listSafe = new AtomicReference<>();
listSafe.getAndUpdate(strings -> {strings.add("newvalue"); return strings;});
or you can go with the purely functional approach (code with no side effects)
like the #Eran solution.

Does every stateful intermediate Stream API operation guarantee a new source collection?

Is the following statement true?
The sorted() operation is a “stateful intermediate operation”, which means that subsequent operations no longer operate on the backing collection, but on an internal state.
(Source and source - they seem to copy from each other or come from the same source.)
Disclaimer: I am aware the following snippets are not legit usages of Java Stream API. Don't use in the production code.
I have tested Stream::sorted as a snippet from sources above:
final List<Integer> list = IntStream.range(0, 10).boxed().collect(Collectors.toList());
list.stream()
.filter(i -> i > 5)
.sorted()
.forEach(list::remove);
System.out.println(list); // Prints [0, 1, 2, 3, 4, 5]
It works. I replaced Stream::sorted with Stream::distinct, Stream::limit and Stream::skip:
final List<Integer> list = IntStream.range(0, 10).boxed().collect(Collectors.toList());
list.stream()
.filter(i -> i > 5)
.distinct()
.forEach(list::remove); // Throws NullPointerException
To my surprise, the NullPointerException is thrown.
All the tested methods follow the stateful intermediate operation characteristics. Yet, this unique behavior of Stream::sorted is not documented nor the Stream operations and pipelines part explains whether the stateful intermediate operations really guarantee a new source collection.
Where my confusion comes from and what is the explanation of the behavior above?
The API documentation makes no such guarantee “that subsequent operations no longer operate on the backing collection”, hence, you should never rely on such a behavior of a particular implementation.
Your example happens to do the desired thing by accident; there’s not even a guarantee that the List created by collect(Collectors.toList()) supports the remove operation.
To show a counter-example
Set<Integer> set = IntStream.range(0, 10).boxed()
.collect(Collectors.toCollection(TreeSet::new));
set.stream()
.filter(i -> i > 5)
.sorted()
.forEach(set::remove);
throws a ConcurrentModificationException. The reason is that the implementation optimizes this scenario, as the source is already sorted. In principle, it could do the same optimization to your original example, as forEach is explicitly performing the action in no specified order, hence, the sorting is unnecessary.
There are other optimizations imaginable, e.g. sorted().findFirst() could get converted to a “find the minimum” operation, without the need to copy the element into a new storage for sorting.
So the bottom line is, when relying on unspecified behavior, what may happen to work today, may break tomorrow, when new optimizations are added.
Well sorted has to be a full copying barrier for the stream pipeline, after all your source could be not sorted; but this is not documented as such, thus do not rely on it.
This is not just about sorted per-se, but what other optimization can be done to the stream pipeline, so that sorted could be entirely skipped. For example:
List<Integer> sortedList = IntStream.range(0, 10)
.boxed()
.collect(Collectors.toList());
StreamSupport.stream(() -> sortedList.spliterator(), Spliterator.SORTED, false)
.sorted()
.forEach(sortedList::remove); // fails with CME, thus no copying occurred
Of course, sorted needs to be a full barrier and stop to do an entire sort, unless, of course, it can be skipped, thus the documentation makes no such promises, so that we don't run in weird surprises.
distinct on the other hand does not have to be a full barrier, all distinct does is check one element at a time, if it is unique; so after a single element is checked (and it is unique) it is passed to the next stage, thus without being a full barrier. Either way, this is not documented also...
You shouldn't have brought up the cases with a terminal operation forEach(list::remove) because list::remove is an interfering function and it violates the "non-interference" principle for terminal actions.
It's vital to follow the rules before wondering why an incorrect code snippet causes unexpected (or undocumented) behaviour.
I believe that list::remove is the root of the problem here. You wouldn't have noticed the difference between the operations for this scenario if you'd written a proper action for forEach.

Java 8 Streams - Map Multiple Object of same type to a list using streams

Is it possible to do the below mentioned steps using streams in a better way ?
Set<Long> memberIds = new HashSet<>();
marksDistribution.parallelStream().forEach(marksDistribution -> {
memberIds.add(marksDistribution.getStudentId());
memberIds.add(marksDistribution.getTeacherId());
});
instanceDistribution.getStudentId() and instanceDistribution.getTeacherId() are both of type Long.
It might be possible that this kind of question is asked but I am not able to understand it. In simple yes or no. If yes/no, then how and bit explanation. And if possible kindly, discuss the efficiency.
Yes, you can use flatMap to map a single element of your Stream into a Stream of multiple elements, and then flatten them into a single Stream :
Set<Long> memberIds =
marksDistribution.stream()
.flatMap (marksDistribution -> Stream.of(marksDistribution.getStudentId(), marksDistribution.getTeacherId()))
.collect(Collectors.toSet());
You can use the 3-args version of collect:
Set<Long> memberIds =
marksDistribution.parallelStream()
.collect(HashSet::new,
(s, m) -> {
s.add(m.getStudentId());
s.add(m.getTeacherId());
}, Set::addAll);
Your current version may produce wrong results, since you are adding elements in parallel in a non-thread safe collection. So it may be possible that you have multiple times the same value in the set.

Collection to stream to a new collection

I'm looking for the most pain free way to filter a collection. I'm thinking something like
Collection<?> foo = existingCollection.stream().filter( ... ). ...
But I'm not sure how is best to go from the filter, to returning or populating another collection. Most examples seem to be like "and here you can print". Possible there's a constructor, or output method that I'm missing.
There’s a reason why most examples avoid storing the result into a Collection. It’s not the recommended way of programming. You already have a Collection, the one providing the source data and collections are of no use on its own. You want to perform certain operations on it so the ideal case is to perform the operation using the stream and skip storing the data in an intermediate Collection. This is what most examples try to suggest.
Of course, there are a lot of existing APIs working with Collections and there always will be. So the Stream API offers different ways to handle the demand for a Collection.
Get an unmodifiable List implementation containing all elements (JDK 16):
List<T> results = l.stream().filter(…).toList();
Get an arbitrary List implementation holding the result:
List<T> results = l.stream().filter(…).collect(Collectors.toList());
Get an unmodifiable List forbidding null like List.of(…) (JDK 10):
List<T> results = l.stream().filter(…).collect(Collectors.toUnmodifiableList());
Get an arbitrary Set implementation holding the result:
Set<T> results = l.stream().filter(…).collect(Collectors.toSet());
Get a specific Collection:
ArrayList<T> results =
l.stream().filter(…).collect(Collectors.toCollection(ArrayList::new));
Add to an existing Collection:
l.stream().filter(…).forEach(existing::add);
Create an array:
String[] array=l.stream().filter(…).toArray(String[]::new);
Use the array to create a list with a specific specific behavior (mutable, fixed size):
List<String> al=Arrays.asList(l.stream().filter(…).toArray(String[]::new));
Allow a parallel capable stream to add to temporary local lists and join them afterward:
List<T> results
= l.stream().filter(…).collect(ArrayList::new, List::add, List::addAll);
(Note: this is closely related to how Collectors.toList() is currently implemented, but that’s an implementation detail, i.e. there is no guarantee that future implementations of the toList() collectors will still return an ArrayList)
An example from java.util.stream's documentation:
List<String>results =
stream.filter(s -> pattern.matcher(s).matches())
.collect(Collectors.toList());
Collectors has a toCollection() method, I'd suggest looking this way.
As an example that is more in line with Java 8 style of functional programming:
Collection<String> a = Collections.emptyList();
List<String> result = a.stream().
filter(s -> s.length() > 0).
collect(Collectors.toList());
You would possibly want to use toList or toSet or toMap methods from Collectors class.
However to get more control the toCollection method can be used. Here is a simple example:
Collection<String> c1 = new ArrayList<>();
c1.add("aa");
c1.add("ab");
c1.add("ca");
Collection<String> c2 = c1.stream().filter(s -> s.startsWith("a")).collect(Collectors.toCollection(ArrayList::new));
Collection<String> c3 = c1.stream().filter(s -> s.startsWith("a")).collect(Collectors.toList());
c2.forEach(System.out::println); // prints-> aa ab
c3.forEach(System.out::println); // prints-> aa ab

Add objects from stream to two different lists simultaneously

How can I add objects from one stream to two different lists simultaneously
Currently I am doing
body.getSurroundings().parallelStream()
.filter(o -> o.getClass().equals(ResourcePoint.class))
.map(o -> (ResourcePoint)o)
.filter(o -> !resourceMemory.contains(o))
.forEach(resourceMemory::add);
to add objects from my stream into a linkedlist "resourceMemory", but I also want to add the same objects to another list simultaneously, but I can't find the syntax for it. Is it possible or do I need to have two copies of this code for each list?
There are several fundamental errors you should understand first, before trying to expand your code.
First of all, forEach does not guaranty a particular order of element processing, so it’s likely the wrong tool for adding to a List, even for sequential streams, however, it is completely wrong to use with a parallel stream to add to a collection like LinkedList which is not thread safe, as the action will be performed concurrently.
But even if resourceMemory was a thread safe collection, your code still was broken as there is an interference between your filter condition and the terminal action. .filter(o -> !resourceMemory.contains(o)) queries the same list which you are modifying in the terminal action and it shouldn’t be hard to understand how this can brake even with thread-safe collections:
Two or more threads may process the filter and find that the element is not contained in the list, then all of them will add the element, contradicting your obvious intention of not having duplicates.
You could resort to forEachOrdered which will perform the action in order and non-concurrently:
body.getSurroundings().parallelStream()
.filter(o -> o instanceof ResourcePoint)
.map(o -> (ResourcePoint)o)
.forEachOrdered(o -> {// not recommended, just for explanation
if(!resourceMemory.contains(o))
resourceMemory.add(o);
});
This will work and it’s obvious how you could add to another list within that action, but it’s far away from recommended coding style. Also, the fact that this terminal action synchronizes with all processing threads will destroy any potential benefit of parallel processing, especially as the most expensive operation of this stream pipeline is invoking contains on a LinkedList which will (must) happen single-threaded.
The correct way to collect stream elements into a list is via, as the name suggests, collect:
List<ResourcePoint> resourceMemory
=body.getSurroundings().parallelStream()
.filter(o -> o instanceof ResourcePoint)
.map(o -> (ResourcePoint)o)
.distinct() // no duplicates
.collect(Collectors.toList()); // collect into a list
This doesn’t return a LinkedList, but you should rethink carefully whether you really need a LinkedList. In 99% of all cases, you don’t. If you really need a LinkedList, you can replace Collectors.toList() with Collectors.toCollection(LinkedList::new).
Now if you really must add to an existing list created outside of your control, which might already contain elements, you should consider the fact mentioned above, that you have to ensure single-threaded access to a non-thread-safe list anyway, so there’s no benefit from doing it from within the parallel stream at all. In most cases, it’s more efficient to let the stream work independently from that list and add the result in a single threaded step afterwards:
Set<ResourcePoint> newElements=
body.getSurroundings().parallelStream()
.filter(o -> o instanceof ResourcePoint)
.map(o -> (ResourcePoint)o)
.collect(Collectors.toCollection(LinkedHashSet::new));
newElements.removeAll(resourceMemory);
resourceMemory.addAll(newElements);
Here, we collect into a LinkedHashSet which implies maintenance of the encounter order and sorting out duplicates within the new elements, then use removeAll on the new elements to remove existing elements of the target list (here we benefit from the hash set nature of the temporary collection), finally, the new elements are added to the target list, which, as explained, must happen single-threaded anyway for a target collection which isn’t thread safe.
It’s easy to add the newElements to another target collection with this solution, much easier than writing a custom collector for producing two lists during the stream processing. But note that the stream operations as written above are way too cheep to assume any benefit from parallel processing. You would need a very large number of elements to compensate the initial multi-threading overhead. It’s even possible that there is no number for which it ever pays off.
Instead of
.forEach(resourceMemory::add)
You could invoke
.forEach(o -> {
resourceMemory.add(o);
otherResource.add(o);
})
or put the add operations in a separate method so you could provide a method reference
.forEach(this::add)
void add(ResourcePoint p) {
resourceMemory.add(o);
otherResource.add(o);
}
But bear in mind, that the order of insertion maybe different with each run as you use a parallel stream.

Categories

Resources