I have the following test, test the integers ranging from 0 to max and if it is validated, construct the pair (vals[i], i). Finally, I want to produce a HashMap which uses vals[i] as key and the value is the list of integers. The code looks like,
IntStream.range(0, max)
.parallel()
.filter(i-> sometest(i))
.mapToObj(i -> new Pair<>(vals[i],i))
.collect(groupingBy(Pair::getFirst, mapping(Pair::getSecond, toList())));
My question is, is it possible to use parallel streams to speed up the construction of that map?
Thanks.
If you just wanted to know how to better take advantage of parallelism, you could do something like:
ConcurrentMap<Integer, List<Integer>> map = IntStream.range(0, Integer.MAX_VALUE)
.parallel()
.filter(i -> i % 2 == 0)
.boxed()
.collect(Collectors.groupingByConcurrent(
i -> i / 3,
Collectors.mapping(i -> i, Collectors.toList())));
The intermediate creation of Pairs is unnecessary, and groupingByConcurrent accumulates to the new ConcurrentMap in parallel.
Keep in mind that with parallel stream you are stuck with the common ForkJoinPool. For parallelization it's preferable to use something more flexible like an ExecutorService instead of Java Streams.
These are the conditions you must satisfy so that you can perform a Concurrent Reduction, as stated in the Java Documentation about Parallelism:
The Java runtime performs a concurrent reduction if all of the the
following are true for a particular pipeline that contains the collect
operation:
The stream is parallel.
The parameter of the collect operation, the collector, has the characteristic Collector.Characteristics.CONCURRENT. To determine the
characteristics of a collector, invoke the Collector.characteristics
method.
Either the stream is unordered, or the collector has the characteristic Collector.Characteristics.UNORDERED. To ensure that the
stream is unordered, invoke the BaseStream.unordered operation.
But whether it will speed up your the construction of your Map will depend on others aspects, as mentioned by #Jigar Joshi, including (but not only) :
How many elements you have to process
How many threads are already being used by your application
Sometimes the overhead of using parallelism (creating and stopping threads, making them communicate and synchronize, ...) is bigger than the gains.
Related
Is the following statement true?
The sorted() operation is a “stateful intermediate operation”, which means that subsequent operations no longer operate on the backing collection, but on an internal state.
(Source and source - they seem to copy from each other or come from the same source.)
Disclaimer: I am aware the following snippets are not legit usages of Java Stream API. Don't use in the production code.
I have tested Stream::sorted as a snippet from sources above:
final List<Integer> list = IntStream.range(0, 10).boxed().collect(Collectors.toList());
list.stream()
.filter(i -> i > 5)
.sorted()
.forEach(list::remove);
System.out.println(list); // Prints [0, 1, 2, 3, 4, 5]
It works. I replaced Stream::sorted with Stream::distinct, Stream::limit and Stream::skip:
final List<Integer> list = IntStream.range(0, 10).boxed().collect(Collectors.toList());
list.stream()
.filter(i -> i > 5)
.distinct()
.forEach(list::remove); // Throws NullPointerException
To my surprise, the NullPointerException is thrown.
All the tested methods follow the stateful intermediate operation characteristics. Yet, this unique behavior of Stream::sorted is not documented nor the Stream operations and pipelines part explains whether the stateful intermediate operations really guarantee a new source collection.
Where my confusion comes from and what is the explanation of the behavior above?
The API documentation makes no such guarantee “that subsequent operations no longer operate on the backing collection”, hence, you should never rely on such a behavior of a particular implementation.
Your example happens to do the desired thing by accident; there’s not even a guarantee that the List created by collect(Collectors.toList()) supports the remove operation.
To show a counter-example
Set<Integer> set = IntStream.range(0, 10).boxed()
.collect(Collectors.toCollection(TreeSet::new));
set.stream()
.filter(i -> i > 5)
.sorted()
.forEach(set::remove);
throws a ConcurrentModificationException. The reason is that the implementation optimizes this scenario, as the source is already sorted. In principle, it could do the same optimization to your original example, as forEach is explicitly performing the action in no specified order, hence, the sorting is unnecessary.
There are other optimizations imaginable, e.g. sorted().findFirst() could get converted to a “find the minimum” operation, without the need to copy the element into a new storage for sorting.
So the bottom line is, when relying on unspecified behavior, what may happen to work today, may break tomorrow, when new optimizations are added.
Well sorted has to be a full copying barrier for the stream pipeline, after all your source could be not sorted; but this is not documented as such, thus do not rely on it.
This is not just about sorted per-se, but what other optimization can be done to the stream pipeline, so that sorted could be entirely skipped. For example:
List<Integer> sortedList = IntStream.range(0, 10)
.boxed()
.collect(Collectors.toList());
StreamSupport.stream(() -> sortedList.spliterator(), Spliterator.SORTED, false)
.sorted()
.forEach(sortedList::remove); // fails with CME, thus no copying occurred
Of course, sorted needs to be a full barrier and stop to do an entire sort, unless, of course, it can be skipped, thus the documentation makes no such promises, so that we don't run in weird surprises.
distinct on the other hand does not have to be a full barrier, all distinct does is check one element at a time, if it is unique; so after a single element is checked (and it is unique) it is passed to the next stage, thus without being a full barrier. Either way, this is not documented also...
You shouldn't have brought up the cases with a terminal operation forEach(list::remove) because list::remove is an interfering function and it violates the "non-interference" principle for terminal actions.
It's vital to follow the rules before wondering why an incorrect code snippet causes unexpected (or undocumented) behaviour.
I believe that list::remove is the root of the problem here. You wouldn't have noticed the difference between the operations for this scenario if you'd written a proper action for forEach.
I have two lists streams one of strings (counties) and one of objects(txcArray). I need to iterate through both lists and compare an instance of counties with an instance of txcArray and they match increment a counter and if they don't I would move on. I need to do this using java 8 lambda expressions and this is what I have so far.
counties.stream().forEach(a-> {
txcArray.stream()
.filter(b->b.getCounty().equals(a))
.map(Map<String,Integer>)
});
Your mistake is using forEach.
List<Long> counts = counties.stream()
.map(a -> txcArray.stream().filter(b -> b.getCounty().equals(a)).count())
.collect(Collectors.toList());
However, this is not very efficient, performing counties.size() × txcArray.size() operations. This can get out of hands easily, when the lists are larger.
It’s better to use
Map<String, Long> map = txcArray.stream()
.collect(Collectors.groupingBy(b -> b.getCounty(), Collectors.counting()));
List<Long> counts = counties.stream()
.map(a -> map.getOrDefault(a, 0L))
.collect(Collectors.toList());
This will perform counties.size() + txcArray.size() operations, which will be more efficient for larger lists, therefore, preferable, even if it’s not a single stream operation but using an intermediate storage.
I am trying to understand java-8 streams in detail.
From oracle documentation page on streams:
Streams differ from collections in several ways:
No storage. A stream is not a data structure that stores elements; instead, it conveys elements from a source such as a data structure, an array, a generator function, or an I/O channel, through a pipeline of computational operations.
Stream operations and pipelines
Stream operations are divided into intermediate and terminal operations, and are combined to form stream pipelines.
A stream pipeline consists of a source (such as a Collection, an array, a generator function, or an I/O channel); followed by zero or more intermediate operations such as Stream.filter or Stream.map; and a terminal operation such as Stream.forEach or Stream.reduce.
Intermediate operations return a new stream
Apart from documentation, I have gone through related SE question:
How does streams in Java affect memory consumption?
Everywhere it was quoted that additional memory was not consumed due to pipe lining of stream operations. original stream will be passed through a pipeline.
One working example from Benjamin blog:
List<String> myList =
Arrays.asList("a1", "a2", "b1", "c2", "c1");
myList
.stream()
.filter(s -> s.startsWith("c"))
.map(String::toUpperCase)
.sorted()
.forEach(System.out::println);
But when intermediate operations like filter, map and sorted returns new stream, how come it does not increase memory consumption? Am I missing something here?
I think you interpreted "no storage" section of the documentation too literally, as "no memory increase." This interpretation is incorrect: "no storage" means "no storage for stream elements". Stream object itself represents a fixed overhead, in the same way as an empty collection has some overhead, so the size of the stream itself does not count.
But when intermediate operations like filter, map and sorted returns new stream, how come it does not increase memory consumption?
It does. However, the increase in size is fixed, i.e. an O(1) increase. This is in contrast with collections, where the increase for making a copy of a collection of n elements is O(n).
Try to read here http://www.oracle.com/technetwork/articles/java/ma14-java-se-8-streams-2177646.html and/or here http://winterbe.com/posts/2014/07/31/java8-stream-tutorial-examples/, the concepts you're trying to figure out is explained fairly well I think.
Basically what you can see is that for most intermediate operations, they do not happen all at once for each operation. 1 element at a time they are processed through all of the intermediate operations and either discarded or put into a collection/added to a sum/printed etc. depending on the terminal operation. If it is a collect type terminal operation there will of course be some memory overhead when making this new collection, but individually in the stream nothing is saved. This is also why you cannot iterate over a stream twice (partly).
There are however some operations, such as stream.sorted(func) that may need some state during processing.
What's the best way to do a parallel unique word count with Java 8 streams and lambdas?
I've come up with a couple, but I'm not convinced they are optimal.
I know the map reduce solution on Hadoop, and wonder if these give the same kind of parallelism.
// Map Reduce Word Count
Map<String, Integer> wordCount = Stream.of("dog","cat","dog","dog","cow","house","house").parallel().collect( Collectors.groupingBy(e->e,Collectors.summingInt(e -> 1)));
System.out.println("number of dogs = " + wordCount.get("dog"));
Map<Object, Object> wordCount2 = Stream.of("dog","cat","dog","dog","cow","house","house").parallel().collect(Collectors.toConcurrentMap(keyWord->keyWord, keyWord->1, (oldVal,newVal)->(int)oldVal+(int)newVal));
System.out.println("number of dogs = " + wordCount2.get("dog"));
Assume the real list would be much longer, possibly coming from a file or generated stream, and that I want to know the counts for all words, not just dog.
Have a look at the javadocs of Collectors.groupingBy
#implNoteThe returned Collector is not concurrent. For parallel stream
pipelines, the combiner function operates by merging the keys from one
map into another, which can be an expensive operation. If preservation
of the order in which elements are presented to the downstream
collector is not required, using groupingByConcurrent(Function,
Supplier, Collector) may offer better parallel performance.
Now, looking at Collectors.groupingByConcurrent you'll see that that is more or less equivalent to your second approach
Returns a concurrent Collector implementing a cascaded "group by"
operation on input elements of type T, grouping elements according to
a classification function, and then performing a reduction operation
on the values associated with a given key using the specified
downstream Collector. The ConcurrentMap produced by the Collector is
created with the supplied factory function.
The groupingBy and toMap might work slower on big datasets compared to groupingByConcurrent and toConcurrentMap. The best way to check whether groupingByConcurrent or toConcurrentMap is faster is to benchmark them by yourself on your own data sets. I think the results would be pretty much the same.
Note however that if you use the file as the source, you may probably have less speedup from parallelism as in Java 8 the Files.lines() and BufferedReader.lines() are reading files sequentially and parallelism is achieved by prebuffering blocks of lines into arrays and spawning new tasks. This not always works efficiently, so probably the bottleneck would be in this procedure. In JDK 9 Files.lines() is optimized (for regular files less than 2Gb long), thus you may get much better performance there.
As for generated sources it depends on how you generate them. It would be better if you supply the good splitting strategy for your source. If you use Stream.iterate or Spliterators.spliterator(iterator, ...) or extend AbstractSpliterator class, the default splitting strategy would be the same: prebuffer some elements into array to spawn a subtask.
explaining Lee's code:
public static Map<String, Integer> wordCount(Stream<String> stream) {
return stream
.flatMap(s -> Stream.of(s.split("\\s+")))
.collect(Collectors.toMap(s -> s, s -> 1, Integer::sum));
}
s -> s: key mapper
s -> 1: value mapper
Integer::sum: merger function
public static Map<String, Integer> wordCount(Stream<String> stream) {
return stream
.flatMap(s -> Stream.of(s.split("\\s+")))
.collect(Collectors.toMap(s -> s, s -> 1, Integer::sum));
}
I am having trouble understanding the Stream interface in Java 8, especially where it has to do with the Spliterator and Collector interfaces. My problem is that I simply can't understand Spliterator and the Collector interfaces yet, and as a result, the Stream interface is still somewhat obscure to me.
What exactly is a Spliterator and a Collector, and how can I use them? If I am willing to write my own Spliterator or Collector (and probably my own Stream in that process), what should I do and not do?
I read some examples scattered around the web, but since everything here is still new and subject to changes, examples and tutorials are still very sparse.
You should almost certainly never have to deal with Spliterator as a user; it should only be necessary if you're writing Collection types yourself and also intending to optimize parallelized operations on them.
For what it's worth, a Spliterator is a way of operating over the elements of a collection in a way that it's easy to split off part of the collection, e.g. because you're parallelizing and want one thread to work on one part of the collection, one thread to work on another part, etc.
You should essentially never be saving values of type Stream to a variable, either. Stream is sort of like an Iterator, in that it's a one-time-use object that you'll almost always use in a fluent chain, as in the Javadoc example:
int sum = widgets.stream()
.filter(w -> w.getColor() == RED)
.mapToInt(w -> w.getWeight())
.sum();
Collector is the most generalized, abstract possible version of a "reduce" operation a la map/reduce; in particular, it needs to support parallelization and finalization steps. Examples of Collectors include:
summing, e.g. Collectors.reducing(0, (x, y) -> x + y)
StringBuilder appending, e.g. Collector.of(StringBuilder::new, StringBuilder::append, StringBuilder::append, StringBuilder::toString)
Spliterator basically means "splittable Iterator".
Single thread can traverse/process the entire Spliterator itself, but the Spliterator also has a method trySplit() which will "split off" a section for someone else (typically, another thread) to process -- leaving the current spliterator with less work.
Collector combines the specification of a reduce function (of map-reduce fame), with an initial value, and a function to combine two results (thus enabling results from Spliterated streams of work, to be combined.)
For example, the most basic Collector would have an initial vaue of 0, add an integer onto an existing result, and would 'combine' two results by adding them. Thus summing a spliterated stream of integers.
See:
Spliterator.trySplit()
Collector<T,A,R>
The following are examples of using the predefined collectors to perform common mutable reduction tasks:
// Accumulate names into a List
List<String> list = people.stream().map(Person::getName).collect(Collectors.toList());
// Accumulate names into a TreeSet
Set<String> set = people.stream().map(Person::getName).collect(Collectors.toCollection(TreeSet::new));
// Convert elements to strings and concatenate them, separated by commas
String joined = things.stream()
.map(Object::toString)
.collect(Collectors.joining(", "));
// Compute sum of salaries of employee
int total = employees.stream()
.collect(Collectors.summingInt(Employee::getSalary)));
// Group employees by department
Map<Department, List<Employee>> byDept
= employees.stream()
.collect(Collectors.groupingBy(Employee::getDepartment));
// Compute sum of salaries by department
Map<Department, Integer> totalByDept
= employees.stream()
.collect(Collectors.groupingBy(Employee::getDepartment,
Collectors.summingInt(Employee::getSalary)));
// Partition students into passing and failing
Map<Boolean, List<Student>> passingFailing =
students.stream()
.collect(Collectors.partitioningBy(s -> s.getGrade() >= PASS_THRESHOLD));
Interface Spliterator - is a core feature of Streams.
The stream() and parallelStream() default methods are presented in the Collection interface. These methods use the Spliterator through the call to the spliterator():
...
default Stream<E> stream() {
return StreamSupport.stream(spliterator(), false);
}
default Stream<E> parallelStream() {
return StreamSupport.stream(spliterator(), true);
}
...
Spliterator is an internal iterator that breaks the stream into the smaller parts. These smaller parts can be processed in parallel.
Among other methods, there are two most important to understand the Spliterator:
boolean tryAdvance(Consumer<? super T> action)
Unlike the Iterator, it tries to perform the operation with the next element.
If operation executed successfully, the method returns true. Otherwise, returns false - that means that there is absence of element or end of the stream.
Spliterator<T> trySplit()
This method allows to split a set of data into a many smaller sets according to one or another criteria (file size, number of lines, etc).