Understanding Spliterator, Collector and Stream in Java 8 - java

I am having trouble understanding the Stream interface in Java 8, especially where it has to do with the Spliterator and Collector interfaces. My problem is that I simply can't understand Spliterator and the Collector interfaces yet, and as a result, the Stream interface is still somewhat obscure to me.
What exactly is a Spliterator and a Collector, and how can I use them? If I am willing to write my own Spliterator or Collector (and probably my own Stream in that process), what should I do and not do?
I read some examples scattered around the web, but since everything here is still new and subject to changes, examples and tutorials are still very sparse.

You should almost certainly never have to deal with Spliterator as a user; it should only be necessary if you're writing Collection types yourself and also intending to optimize parallelized operations on them.
For what it's worth, a Spliterator is a way of operating over the elements of a collection in a way that it's easy to split off part of the collection, e.g. because you're parallelizing and want one thread to work on one part of the collection, one thread to work on another part, etc.
You should essentially never be saving values of type Stream to a variable, either. Stream is sort of like an Iterator, in that it's a one-time-use object that you'll almost always use in a fluent chain, as in the Javadoc example:
int sum = widgets.stream()
.filter(w -> w.getColor() == RED)
.mapToInt(w -> w.getWeight())
.sum();
Collector is the most generalized, abstract possible version of a "reduce" operation a la map/reduce; in particular, it needs to support parallelization and finalization steps. Examples of Collectors include:
summing, e.g. Collectors.reducing(0, (x, y) -> x + y)
StringBuilder appending, e.g. Collector.of(StringBuilder::new, StringBuilder::append, StringBuilder::append, StringBuilder::toString)

Spliterator basically means "splittable Iterator".
Single thread can traverse/process the entire Spliterator itself, but the Spliterator also has a method trySplit() which will "split off" a section for someone else (typically, another thread) to process -- leaving the current spliterator with less work.
Collector combines the specification of a reduce function (of map-reduce fame), with an initial value, and a function to combine two results (thus enabling results from Spliterated streams of work, to be combined.)
For example, the most basic Collector would have an initial vaue of 0, add an integer onto an existing result, and would 'combine' two results by adding them. Thus summing a spliterated stream of integers.
See:
Spliterator.trySplit()
Collector<T,A,R>

The following are examples of using the predefined collectors to perform common mutable reduction tasks:
// Accumulate names into a List
List<String> list = people.stream().map(Person::getName).collect(Collectors.toList());
// Accumulate names into a TreeSet
Set<String> set = people.stream().map(Person::getName).collect(Collectors.toCollection(TreeSet::new));
// Convert elements to strings and concatenate them, separated by commas
String joined = things.stream()
.map(Object::toString)
.collect(Collectors.joining(", "));
// Compute sum of salaries of employee
int total = employees.stream()
.collect(Collectors.summingInt(Employee::getSalary)));
// Group employees by department
Map<Department, List<Employee>> byDept
= employees.stream()
.collect(Collectors.groupingBy(Employee::getDepartment));
// Compute sum of salaries by department
Map<Department, Integer> totalByDept
= employees.stream()
.collect(Collectors.groupingBy(Employee::getDepartment,
Collectors.summingInt(Employee::getSalary)));
// Partition students into passing and failing
Map<Boolean, List<Student>> passingFailing =
students.stream()
.collect(Collectors.partitioningBy(s -> s.getGrade() >= PASS_THRESHOLD));

Interface Spliterator - is a core feature of Streams.
The stream() and parallelStream() default methods are presented in the Collection interface. These methods use the Spliterator through the call to the spliterator():
...
default Stream<E> stream() {
return StreamSupport.stream(spliterator(), false);
}
default Stream<E> parallelStream() {
return StreamSupport.stream(spliterator(), true);
}
...
Spliterator is an internal iterator that breaks the stream into the smaller parts. These smaller parts can be processed in parallel.
Among other methods, there are two most important to understand the Spliterator:
boolean tryAdvance(Consumer<? super T> action)
Unlike the Iterator, it tries to perform the operation with the next element.
If operation executed successfully, the method returns true. Otherwise, returns false - that means that there is absence of element or end of the stream.
Spliterator<T> trySplit()
This method allows to split a set of data into a many smaller sets according to one or another criteria (file size, number of lines, etc).

Related

What's the difference between list interface sort method and stream interface sorted method?

I'm interested in sorting an list of object based on date attribute in that object. I can either use list sort method.
list.sort( (a, b) -> a.getDate().compareTo(b.getDate()) );
Or I can use stream sorted method
List<E> l = list.stream()
.sorted( (a, b) -> a.getDate().compareTo(b.getDate()))
.collect(Collectors.toList());
Out of both above option which should we use and why?
I know the former one will update my original list and later one will not update the original but instead give me a fresh new list object.
So, I don't care my original list is getting updated or not. So which one is good option and why?
If you only need to sort your List, and don't need any other stream operations (such as filtering, mapping, etc...), there's no point in adding the overhead of creating a Stream and then creating a new List. It would be more efficient to just sort the original List.
If you wish to known which is best, your best option is to benchmark it: you may reuse my answer JMH test.
It should be noted that:
List::sort use Arrays::sort. It create an array before sorting. It does not exists for other Collection.
Stream::sorted is done as state full intermediate operation. This means the Stream need to remember its state.
Without benchmarking, I'd say that:
You should use collection.sort(). It is easier to read: collection.stream().sorted().collect(toList()) is way to long to read and unless you format your code well, you might have an headache (I exaggerate) before understanding that this line is simply sorting.
sort() on a Stream should be called:
if you filter many elements making the Stream effectively smaller in size than the collection (sorting N items then filtering N items is not the same than filtering N items then sorting K items with K <= N).
if you have a map transformation after the sort and you loose a way to sort using the original key.
If you use your stream with other intermediate operation, then sort might be required / useful:
collection.stream() // Stream<U> #0
.filter(...) // Stream<U> #1
.sorted() // Stream<U> #2
.map(...) // Stream<V> #3
.collect(toList()) // List<V> sorted by U.
;
In that example, the filter apply before the sort: the stream #1 is smaller than #0, so the cost of sorting with stream might be less than Collections.sort().
If all that you do is simply filtering, you may also use a TreeSet or a collectingAndThen operation:
collection.stream() // Stream<U> #0
.filter(...) // Stream<U> #1
.collect(toCollection(TreeSet::new))
;
Or:
collection.stream() // Stream<U>
.filter(...) // Stream<U>
.collect(collectingAndThen(toList(), list -> {
list.sort();
return list;
})); // List<V>
Streams have some overheads because it creates many new objects like a concrete Stream, a Collector, and a new List. So if you just want to sort a list and doesn't care about whether the original gets changed or not, use List.sort.
There is also Collections.sort, which is an older API. The difference between it and List.sort can be found here.
Stream.sorted is useful when you are doing other stream operations alongside sorting.
Your code can also be rewritten with Comparator:
list.sort(Comparator.comparing(YourClass::getDate)));
First one would be better in term of performance. In the first one, the sort method just compares the elements of the list and orders them. The second one will create a stream from your list, sort it and create a new list from that stream.
In your case, since you can update the first list, the first approach is the better, both in term of performance and memory consumption. The second one is convenient if you need to and with a stream, or if you have a stream and want to end up with a sorted list.
You use the first method
list.sort((a, b) -> a.getDate().compareTo(b.getDate()));
it's much faster than the second one and it didn't create a new intermediate object. You could use the second method when you want to do some additional stream operations (e.g. filtering, map).

Should I use shared mutable variable update in Java 8 Streams

Just iterating below list & adding into another shared mutable list via java 8 streams.
List<String> list1 = Arrays.asList("A1","A2","A3","A4","A5","A6","A7","A8","B1","B2","B3");
List<String> list2 = new ArrayList<>();
Consumer<String> c = t -> list2.add(t.startsWith("A") ? t : "EMPTY");
list1.stream().forEach(c);
list1.parallelStream().forEach(c);
list1.forEach(c);
What is the difference between above three iteration & which one we need to use. Are there any considerations?
Regardless of whether you use parallel or sequential Stream, you shouldn't use forEach when your goal is to generate a List. Use map with collect:
List<String> list2 =
list2.stream()
.map(item -> item.startsWith("A") ? item : "EMPTY")
.collect(Collectors.toList());
Functionally speaking,for the simple cases they are almost the same, but generally speaking, there are some hidden differences:
Lets start by quoting from Javadoc of forEach for iterable use-cases stating that:
performs the given action for each element of the Iterable until all
elements have been processed or the action throws an exception.
and also we can iterate over a collection and perform a given action on each element – by just passing a class that implements the Consumer interface
void forEach(Consumer<? super T> action)
https://docs.oracle.com/javase/8/docs/api/java/lang/Iterable.html#forEach-java.util.function.Consumer-
The order of Stream.forEach is random while Iterable.forEach is always executed in the iteration order of the Iterable.
If Iterable.forEach is iterating over a synchronized collection, Iterable.forEach takes the collection's lock once and holds it across all the calls to the action method. The Stream.forEach call uses the collection's spliterator, which does not lock
The action specified in Stream.forEach is required to be non-interfering while Iterable.forEach is allowed to set values in the underlying ArrayList without problems.
In Java, Iterators returned by Collection classes, e.g. ArrayList, HashSet, Vector, etc., are fail fast. This means that if you try to add() or remove() from the underlying data structure while iterating it, you get a ConcurrentModificationException.
https://docs.oracle.com/javase/8/docs/api/java/util/ArrayList.html#fail-fast
More Info:
What is the difference between .foreach and .stream().foreach?
What is difference between Collection.stream().forEach() and Collection.forEach()?
When working with streams, you should write your code in a way that if you switch to parallel streams, it does not produce the wrong results.
Imagine if in your code you were doing reading and writing on the same shared memory (list2) and you distribute your process into several threads (using parallel streams). Then you are DOOMED. Therefore you have several options.
make your shared memory (list2) thread-safe. for example by using AtomicReferences
List<String> list2 = new ArrayList<>();
AtomicReference<List<String>> listSafe = new AtomicReference<>();
listSafe.getAndUpdate(strings -> {strings.add("newvalue"); return strings;});
or you can go with the purely functional approach (code with no side effects)
like the #Eran solution.

Java Parallel Stream Produce HashMap

I have the following test, test the integers ranging from 0 to max and if it is validated, construct the pair (vals[i], i). Finally, I want to produce a HashMap which uses vals[i] as key and the value is the list of integers. The code looks like,
IntStream.range(0, max)
.parallel()
.filter(i-> sometest(i))
.mapToObj(i -> new Pair<>(vals[i],i))
.collect(groupingBy(Pair::getFirst, mapping(Pair::getSecond, toList())));
My question is, is it possible to use parallel streams to speed up the construction of that map?
Thanks.
If you just wanted to know how to better take advantage of parallelism, you could do something like:
ConcurrentMap<Integer, List<Integer>> map = IntStream.range(0, Integer.MAX_VALUE)
.parallel()
.filter(i -> i % 2 == 0)
.boxed()
.collect(Collectors.groupingByConcurrent(
i -> i / 3,
Collectors.mapping(i -> i, Collectors.toList())));
The intermediate creation of Pairs is unnecessary, and groupingByConcurrent accumulates to the new ConcurrentMap in parallel.
Keep in mind that with parallel stream you are stuck with the common ForkJoinPool. For parallelization it's preferable to use something more flexible like an ExecutorService instead of Java Streams.
These are the conditions you must satisfy so that you can perform a Concurrent Reduction, as stated in the Java Documentation about Parallelism:
The Java runtime performs a concurrent reduction if all of the the
following are true for a particular pipeline that contains the collect
operation:
The stream is parallel.
The parameter of the collect operation, the collector, has the characteristic Collector.Characteristics.CONCURRENT. To determine the
characteristics of a collector, invoke the Collector.characteristics
method.
Either the stream is unordered, or the collector has the characteristic Collector.Characteristics.UNORDERED. To ensure that the
stream is unordered, invoke the BaseStream.unordered operation.
But whether it will speed up your the construction of your Map will depend on others aspects, as mentioned by #Jigar Joshi, including (but not only) :
How many elements you have to process
How many threads are already being used by your application
Sometimes the overhead of using parallelism (creating and stopping threads, making them communicate and synchronize, ...) is bigger than the gains.

How to do a parallel unique word count with Java 8 streams and lambdas?

What's the best way to do a parallel unique word count with Java 8 streams and lambdas?
I've come up with a couple, but I'm not convinced they are optimal.
I know the map reduce solution on Hadoop, and wonder if these give the same kind of parallelism.
// Map Reduce Word Count
Map<String, Integer> wordCount = Stream.of("dog","cat","dog","dog","cow","house","house").parallel().collect( Collectors.groupingBy(e->e,Collectors.summingInt(e -> 1)));
System.out.println("number of dogs = " + wordCount.get("dog"));
Map<Object, Object> wordCount2 = Stream.of("dog","cat","dog","dog","cow","house","house").parallel().collect(Collectors.toConcurrentMap(keyWord->keyWord, keyWord->1, (oldVal,newVal)->(int)oldVal+(int)newVal));
System.out.println("number of dogs = " + wordCount2.get("dog"));
Assume the real list would be much longer, possibly coming from a file or generated stream, and that I want to know the counts for all words, not just dog.
Have a look at the javadocs of Collectors.groupingBy
#implNoteThe returned Collector is not concurrent. For parallel stream
pipelines, the combiner function operates by merging the keys from one
map into another, which can be an expensive operation. If preservation
of the order in which elements are presented to the downstream
collector is not required, using groupingByConcurrent(Function,
Supplier, Collector) may offer better parallel performance.
Now, looking at Collectors.groupingByConcurrent you'll see that that is more or less equivalent to your second approach
Returns a concurrent Collector implementing a cascaded "group by"
operation on input elements of type T, grouping elements according to
a classification function, and then performing a reduction operation
on the values associated with a given key using the specified
downstream Collector. The ConcurrentMap produced by the Collector is
created with the supplied factory function.
The groupingBy and toMap might work slower on big datasets compared to groupingByConcurrent and toConcurrentMap. The best way to check whether groupingByConcurrent or toConcurrentMap is faster is to benchmark them by yourself on your own data sets. I think the results would be pretty much the same.
Note however that if you use the file as the source, you may probably have less speedup from parallelism as in Java 8 the Files.lines() and BufferedReader.lines() are reading files sequentially and parallelism is achieved by prebuffering blocks of lines into arrays and spawning new tasks. This not always works efficiently, so probably the bottleneck would be in this procedure. In JDK 9 Files.lines() is optimized (for regular files less than 2Gb long), thus you may get much better performance there.
As for generated sources it depends on how you generate them. It would be better if you supply the good splitting strategy for your source. If you use Stream.iterate or Spliterators.spliterator(iterator, ...) or extend AbstractSpliterator class, the default splitting strategy would be the same: prebuffer some elements into array to spawn a subtask.
explaining Lee's code:
public static Map<String, Integer> wordCount(Stream<String> stream) {
return stream
.flatMap(s -> Stream.of(s.split("\\s+")))
.collect(Collectors.toMap(s -> s, s -> 1, Integer::sum));
}
s -> s: key mapper
s -> 1: value mapper
Integer::sum: merger function
public static Map<String, Integer> wordCount(Stream<String> stream) {
return stream
.flatMap(s -> Stream.of(s.split("\\s+")))
.collect(Collectors.toMap(s -> s, s -> 1, Integer::sum));
}

Add objects from stream to two different lists simultaneously

How can I add objects from one stream to two different lists simultaneously
Currently I am doing
body.getSurroundings().parallelStream()
.filter(o -> o.getClass().equals(ResourcePoint.class))
.map(o -> (ResourcePoint)o)
.filter(o -> !resourceMemory.contains(o))
.forEach(resourceMemory::add);
to add objects from my stream into a linkedlist "resourceMemory", but I also want to add the same objects to another list simultaneously, but I can't find the syntax for it. Is it possible or do I need to have two copies of this code for each list?
There are several fundamental errors you should understand first, before trying to expand your code.
First of all, forEach does not guaranty a particular order of element processing, so it’s likely the wrong tool for adding to a List, even for sequential streams, however, it is completely wrong to use with a parallel stream to add to a collection like LinkedList which is not thread safe, as the action will be performed concurrently.
But even if resourceMemory was a thread safe collection, your code still was broken as there is an interference between your filter condition and the terminal action. .filter(o -> !resourceMemory.contains(o)) queries the same list which you are modifying in the terminal action and it shouldn’t be hard to understand how this can brake even with thread-safe collections:
Two or more threads may process the filter and find that the element is not contained in the list, then all of them will add the element, contradicting your obvious intention of not having duplicates.
You could resort to forEachOrdered which will perform the action in order and non-concurrently:
body.getSurroundings().parallelStream()
.filter(o -> o instanceof ResourcePoint)
.map(o -> (ResourcePoint)o)
.forEachOrdered(o -> {// not recommended, just for explanation
if(!resourceMemory.contains(o))
resourceMemory.add(o);
});
This will work and it’s obvious how you could add to another list within that action, but it’s far away from recommended coding style. Also, the fact that this terminal action synchronizes with all processing threads will destroy any potential benefit of parallel processing, especially as the most expensive operation of this stream pipeline is invoking contains on a LinkedList which will (must) happen single-threaded.
The correct way to collect stream elements into a list is via, as the name suggests, collect:
List<ResourcePoint> resourceMemory
=body.getSurroundings().parallelStream()
.filter(o -> o instanceof ResourcePoint)
.map(o -> (ResourcePoint)o)
.distinct() // no duplicates
.collect(Collectors.toList()); // collect into a list
This doesn’t return a LinkedList, but you should rethink carefully whether you really need a LinkedList. In 99% of all cases, you don’t. If you really need a LinkedList, you can replace Collectors.toList() with Collectors.toCollection(LinkedList::new).
Now if you really must add to an existing list created outside of your control, which might already contain elements, you should consider the fact mentioned above, that you have to ensure single-threaded access to a non-thread-safe list anyway, so there’s no benefit from doing it from within the parallel stream at all. In most cases, it’s more efficient to let the stream work independently from that list and add the result in a single threaded step afterwards:
Set<ResourcePoint> newElements=
body.getSurroundings().parallelStream()
.filter(o -> o instanceof ResourcePoint)
.map(o -> (ResourcePoint)o)
.collect(Collectors.toCollection(LinkedHashSet::new));
newElements.removeAll(resourceMemory);
resourceMemory.addAll(newElements);
Here, we collect into a LinkedHashSet which implies maintenance of the encounter order and sorting out duplicates within the new elements, then use removeAll on the new elements to remove existing elements of the target list (here we benefit from the hash set nature of the temporary collection), finally, the new elements are added to the target list, which, as explained, must happen single-threaded anyway for a target collection which isn’t thread safe.
It’s easy to add the newElements to another target collection with this solution, much easier than writing a custom collector for producing two lists during the stream processing. But note that the stream operations as written above are way too cheep to assume any benefit from parallel processing. You would need a very large number of elements to compensate the initial multi-threading overhead. It’s even possible that there is no number for which it ever pays off.
Instead of
.forEach(resourceMemory::add)
You could invoke
.forEach(o -> {
resourceMemory.add(o);
otherResource.add(o);
})
or put the add operations in a separate method so you could provide a method reference
.forEach(this::add)
void add(ResourcePoint p) {
resourceMemory.add(o);
otherResource.add(o);
}
But bear in mind, that the order of insertion maybe different with each run as you use a parallel stream.

Categories

Resources