I am a beginner in Java 8.
Non-interference is important to have consistent Java stream behaviour.
Imagine we are process a large stream of data and during the process
the source is changed. The result will be unpredictable. This is
irrespective of the processing mode of the stream parallel or
sequential.
The source can be modified till the statement terminal operation is
invoked. Beyond that the source should not be modified till the stream
execution completes. So handling the concurrent modification in stream
source is critical to have a consistent stream performance.
The above quotations are taken from here.
Can someone do some simple example that would shed lights on why mutating the stream source would give such big problems?
Well the oracle example is self-explanatory here. First one is this:
List<String> l = new ArrayList<>(Arrays.asList("one", "two"));
Stream<String> sl = l.stream();
l.add("three");
String s = l.collect(Collectors.joining(" "));
If you change l by adding one more elements to it before you call the terminal operation (Collectors.joining) you are fine; but notice that the Stream consists of three elements, not two; at the time you created the Stream via l.stream().
On the other hand doing this:
List<String> list = new ArrayList<>();
list.add("test");
list.forEach(x -> list.add(x));
will throw a ConcurrentModificationException since you can't change the source.
And now suppose you have an underlying source that can handle concurrent adds:
ConcurrentHashMap<String, Integer> cMap = new ConcurrentHashMap<>();
cMap.put("one", 1);
cMap.forEach((key, value) -> cMap.put(key + key, value + value));
System.out.println(cMap);
What should the output be here? When I run this it is:
{oneoneoneoneoneoneoneone=8, one=1, oneone=2, oneoneoneone=4}
Changing the key to zx (cMap.put("zx", 1)), the result is now:
{zxzx=2, zx=1}
The result is not consistent.
Related
List<String> lines = FileUtils.readLines(dataFile,"UTF-8");
lines.parallelStream().forEachOrdered(line->processRecord(line, dataFileName));
here process record will return string value how to store to any collection / file
could help me for this. thanks for advance.
Instead of using forEachOrdered you could use map to execute and store the result for each of the inputs. You can then collect these into a new list to store the results.
Example:
List<String> result = lines.parallelStream()
.map(line->line.toUpperCase())
.collect(Collectors.toCollection(ArrayList::new));
[Answer is based on comments]
Assuming, you want to process a list of String and also want to store the response of each processing with input and output, then you can follow the below approach.
List<String> lines = FileUtils.readLines(dataFile,"UTF-8");
Map<String, String> result = lines.parallelStream()
.collect(Collectors.toMap(line -> line, line -> processRecord(line, dataFileName)));
Here, the result (a simple Map<Key, Value>) will contain both.. the key as each single line, and value as each single response of THAT line returned by processRecord method.
how many threads will execute parallelly this method (from comments)
The method parallelStream() uses threads form DefaultForkJoinPool. And this pool has default size equal to the number of cores on your system.
I am a beginner in Java 8.
Non-interference is important to have consistent Java stream behaviour.
Imagine we are process a large stream of data and during the process
the source is changed. The result will be unpredictable. This is
irrespective of the processing mode of the stream parallel or
sequential.
The source can be modified till the statement terminal operation is
invoked. Beyond that the source should not be modified till the stream
execution completes. So handling the concurrent modification in stream
source is critical to have a consistent stream performance.
The above quotations are taken from here.
Can someone do some simple example that would shed lights on why mutating the stream source would give such big problems?
Well the oracle example is self-explanatory here. First one is this:
List<String> l = new ArrayList<>(Arrays.asList("one", "two"));
Stream<String> sl = l.stream();
l.add("three");
String s = l.collect(Collectors.joining(" "));
If you change l by adding one more elements to it before you call the terminal operation (Collectors.joining) you are fine; but notice that the Stream consists of three elements, not two; at the time you created the Stream via l.stream().
On the other hand doing this:
List<String> list = new ArrayList<>();
list.add("test");
list.forEach(x -> list.add(x));
will throw a ConcurrentModificationException since you can't change the source.
And now suppose you have an underlying source that can handle concurrent adds:
ConcurrentHashMap<String, Integer> cMap = new ConcurrentHashMap<>();
cMap.put("one", 1);
cMap.forEach((key, value) -> cMap.put(key + key, value + value));
System.out.println(cMap);
What should the output be here? When I run this it is:
{oneoneoneoneoneoneoneone=8, one=1, oneone=2, oneoneoneone=4}
Changing the key to zx (cMap.put("zx", 1)), the result is now:
{zxzx=2, zx=1}
The result is not consistent.
I'm trying to use Java 8 Lambda expressions and streams to parse some logs. I have one giant log file that has run after run. I want to split it into separate collections, one for each run. I do not know how many runs the log has in advanced. And to exercise my very weak lambda expressions muscles I'd like to do it in one pass through the list.
Here is my current implementation:
List<String> lines = readLines(fileDirectory);
Pattern runStartPattern = Pattern.compile("INFO: \\d\\d:\\d\\d:\\d\\d: Starting");
LinkedList<List<String>> testRuns = new LinkedList<>();
List<String> currentTestRun = new LinkedList<>(); // In case log starts in middle of run
testRuns.add(currentTestRun);
for(String line:lines){
if(runStartPattern.matcher(line).find()){
currentTestRun = new ArrayList<>();
testRuns.add(currentTestRun);
}
currentTestRun.add(line);
}
if(testRuns.getFirst().size()==0){ // In case log starts at a run
testRuns.removeFirst();
}
Basically something like TomekRekawek's solution here but with an unknown partition size to begin with.
There's no standard way to easily achieve this in Stream API, but my StreamEx library has a groupRuns method which can solve this pretty easily:
List<List<String>> testLines = StreamEx.of(lines)
.groupRuns((a, b) -> !runStartPattern.matcher(b).find())
.toList();
It groups the input elements based on some predicate which is applied to the pair of adjacent elements. Here we don't want to group the lines if the second line matches the runStartPattern. This works correctly regardless of whether the log starts in the middle of run or not. Also this feature works nice with parallel streams as well.
Context
I've stumble upon a rather annoying problem : I've a program with a lot of data source that are able to stream the same type of elements and I want to "map" each availiable element in the program (element order doesn't matter).
Therefore I've tried to reduce my Stream<Stream<T>> streamOfStreamOfT; into a simple Stream<T> streamOfT; using streamOfT = streamOfStreamOfT.reduce(Stream.empty(), Stream::concat);
Since element order is not important for me, I've tried to parallelize the reduce operation with a .parallel() : streamOfT = streamOfStreamOfT.parallel().reduce(Stream.empty(), Stream::concat); But this triggers an java.lang.IllegalStateException: stream has already been operated upon or closed
Example
To experience it yourself just play with the following main (java 1.8u20) by commenting / uncommenting the .parallel()
public static void main(String[] args) {
// GIVEN
List<Stream<Integer>> listOfStreamOfInts = new ArrayList<>();
for (int j = 0; j < 10; j++) {
IntStream intStreamOf10Ints = IntStream.iterate(0, i -> i + 1)
.limit(10);
Stream<Integer> genericStreamOf10Ints = StreamSupport.stream(
intStreamOf10Ints.spliterator(), true);
listOfStreamOfInts.add(genericStreamOf10Ints);
}
Stream<Stream<Integer>> streamOfStreamOfInts = listOfStreamOfInts
.stream();
// WHEN
Stream<Integer> streamOfInts = streamOfStreamOfInts
// ////////////////
// PROBLEM
// |
// V
.parallel()
.reduce(Stream.empty(), Stream::concat);
// THEN
System.out.println(streamOfInts.map(String::valueOf).collect(
joining(", ")));
}
Question
Can someone explain this limitation ? / find a better way of handling parallel reduction of stream of streams
Edit 1
Following #Smutje and #LouisWasserman comments it seems that .flatMap(Function.identity()) is a better option that tolerates .parallel() streams
The form of reduce you are using takes an identity value and an associative combining function. But Stream.empty() is not a value; it has state. Streams are not data structures like arrays or collections; they are carriers for pushing data through possibly-parallel aggregate operations, and they have some state (like whether the stream has been consumed or not.) Think about how this works; you're going to build a tree where the same "empty" stream appears in more than one leaf. When you try to use this stateful not-an-identity twice (which won't happen sequentially, but will happen in parallel), the second time you try and traverse through that empty stream, it will quite correctly be seen to be already used.
So the problem is, you're simply using this reduce method incorrectly. The problem is not with the parallelism; it is simply that the parallelism exposed the underlying problem.
Secondly, even if this "worked" the way you think it should, you would only get parallelism building the tree that represents the flattened stream-of-streams; when you go to do the joining, that's a sequential stream pipeline there. Ooops.
Thirdly, even if this "worked" the way you think it should, you're going to add a lot of element-access overhead by building up concatenated streams, and you're not going to get the benefit of parallelism that you are seeking.
The simple answer is to flatten the streams:
String joined = streamOfStreams.parallel()
.flatMap(s -> s)
.collect(joining(", "));
I am currently reading the O'Reilly Java 8 Lambdas, it is a really good book. I came across with a example like this.
I have a
private final BiFunction<StringBuilder,String,StringBuilder>accumulator=
(builder,name)->{if(builder.length()>0)builder.append(",");builder.append("Mister:").append(name);return builder;};
final Stream<String>stringStream = Stream.of("John Lennon","Paul Mccartney"
,"George Harrison","Ringo Starr");
final StringBuilder reduce = stringStream
.filter(a->a!=null)
.reduce(new StringBuilder(),accumulator,(left,right)->left.append(right));
System.out.println(reduce);
System.out.println(reduce.length());
this produce the right output.
Mister:John Lennon,Mister:Paul Mccartney,Mister:George Harrison,Mister:Ringo Starr
My question is regarding the reduce method the last parameter which is a BinaryOperator.
Which this parameter is used for? If I change by
.reduce(new StringBuilder(),accumulator,(left,right)->new StringBuilder());
the output is the same; if I pass NULL then N.P.E is returned.
What is this parameter used for?
Update
Why if I run it on parallelStream I am receiving different results?
First run:
returned StringBuilder length = 420
Second run:
returned StringBuilder length = 546
Third run:
returned StringBuilder length = 348
and so on. Why is this - should it not return all the values at each iteration?
The method reduce in the interface Stream is overloaded. The parameters for the method with three arguments are:
identity
accumulator
combiner
The combiner supports parallel execution. Apparently, it is not used for sequential streams. However, there is no such guarantee. If you change your streams into parallel stream, I guess you will see a difference:
Stream<String>stringStream = Stream.of(
"John Lennon", "Paul Mccartney", "George Harrison", "Ringo Starr")
.parallel();
Here is an example of how the combiner can be used to transform a sequential reduction into a reduction, that supports parallel execution. There is a stream with four Strings and acc is used as an abbreviation for accumulator.apply. Then the result of the reduction can be computed as follows:
acc(acc(acc(acc(identity, "one"), "two"), "three"), "four");
With a compatible combiner, the above expression can be transformed into the following expression. Now it is possible to execute the two sub-expressions in different threads.
combiner.apply(
acc(acc(identity, "one"), "two"),
acc(acc(identity, "three"), "four"));
Regarding your second question, I use a simplified accumulator to explain the problem:
BiFunction<StringBuilder,String,StringBuilder> accumulator =
(builder,name) -> builder.append(name);
According to the Javadoc for Stream::reduce, the accumulator has to be associative. In this case, that would imply, that the following two expressions return the same result:
acc(acc(acc(identity, "one"), "two"), "three")
acc(acc(identity, "one"), acc(acc(identity, "two"), "three"))
That's not true for the above accumulator. The problem is, that you are mutating the object referenced by identity. That's a bad idea for the reduce operation. Here are two alternative implementations which should work:
// identity = ""
BiFunction<String,String,String> accumulator = String::concat;
// identity = null
BiFunction<StringBuilder,String,StringBuilder> accumulator =
(builder,name) -> builder == null
? new StringBulder(name) : builder.append(name);
nosid's answer got it mostly right (+1) but I wanted to amplify a particular point.
The identity parameter to reduce must be an identity value. It's ok if it's an object, but if it is, it should immutable. If the "identity" object is mutated, it's no longer an identity! For more discussion of this point, see my answer to a related question.
It looks like this example originated from Example 5-19 of Richard Warburton, Java 8 Lambdas, O'Reilly 2014. If so, I shall have to have a word about this with the good Dr. Warburton.