List<String> lines = FileUtils.readLines(dataFile,"UTF-8");
lines.parallelStream().forEachOrdered(line->processRecord(line, dataFileName));
here process record will return string value how to store to any collection / file
could help me for this. thanks for advance.
Instead of using forEachOrdered you could use map to execute and store the result for each of the inputs. You can then collect these into a new list to store the results.
Example:
List<String> result = lines.parallelStream()
.map(line->line.toUpperCase())
.collect(Collectors.toCollection(ArrayList::new));
[Answer is based on comments]
Assuming, you want to process a list of String and also want to store the response of each processing with input and output, then you can follow the below approach.
List<String> lines = FileUtils.readLines(dataFile,"UTF-8");
Map<String, String> result = lines.parallelStream()
.collect(Collectors.toMap(line -> line, line -> processRecord(line, dataFileName)));
Here, the result (a simple Map<Key, Value>) will contain both.. the key as each single line, and value as each single response of THAT line returned by processRecord method.
how many threads will execute parallelly this method (from comments)
The method parallelStream() uses threads form DefaultForkJoinPool. And this pool has default size equal to the number of cores on your system.
Related
The question is straight forward: Why can't we use StringBuilder(...) as identity function in the reduce(...) operations in the java8 streams, but string1.concat(string2) can be used as the identity function?
string1.concat(string2) can be seen as similar to builder.append(string) (though it is understood that there are few differences in these opeations), but I am not able to understand the difference in the reduce operation. Consider the following example:
List<String> list = Arrays.asList("1", "2", "3");
// Example using the string concatenation operation
System.out.println(list.stream().parallel()
.reduce("", (s1, s2) -> s1 + s2, (s1, s2)->s1 + s2));
// The same example, using the StringBuilder
System.out.println(list.stream() .parallel()
.reduce(new StringBuilder(""), (builder, s) -> builder
.append(s),(builder1, builder2) -> builder1
.append(builder2)));
// using the actual concat(...) method
System.out.println(list.stream().parallel()
.reduce("", (s1, s2) -> s1.concat(s2), (s1, s2)->s1.concat(s2)));
Here is the output after executing above lines:
123
321321321321 // output when StringBuilder() is used as Identity
123
builder.append(string) is an associative operation as str1.concat(str2) is. Then why does concat work and append doesn't?
Yes, append is indeed associative, but that is not the only requirement for the function passed as the accumulator and combiner. According to the docs, they have to be:
Associative
Non-interfering
Stateless
append is not stateless. It is stateful. When you do sb.append("Hello"), not only does it return a StringBuilder with Hello appended to the end, it also changes the contents (i.e. the state) of sb.
Also from the docs:
Stream pipeline results may be nondeterministic or incorrect if the behavioral parameters to the stream operations are stateful. A stateful lambda (or other object implementing the appropriate functional interface) is one whose result depends on any state which might change during the execution of the stream pipeline.
Also because of this, new StringBuilder() is not a valid identity, once the accumulator or the combiner has been applied. Something would have been added to the empty string builder, and the following equation, which all identities must satisfy, is no longer satisfied:
combiner.apply(u, accumulator.apply(identity, t)) == accumulator.apply(u, t)
It is possible that the parallel stream makes use of the old string builders after calling the accumulators and/or combiners, and expects their contents to not be changed. However, the accumulators and combiners mutate the string builders, causing the stream to produce incorrect results.
On the other hand, concat satisfies all three of the above. It is stateless because it does not change the string on which it is called on. It just retunes a new, concatenated string. (String is immutable anyway and can't be changed :D)
Anyway, this is a use case of mutable reduction with collect:
System.out.println((StringBuilder)list.stream().parallel()
.collect(
StringBuilder::new,
StringBuilder::append,
StringBuilder::append
)
);
After read the doc and do many tests, I think reduce is something like following steps:
there will be multi threads to do the reduce, every thread do a
partial reduce;
for identity, there will be only one instance. Every accumulator will use this identity instance;
first do accumulate with identity instance and a string element to get a
StringBuilder;
combine all these StringBuilders;
so the problem is every accumulate with identity instance and a string element will cause identity changed. the identity in the accumulates after first time is not identity anymore.
for example, we consider an list with 2 element {"1","2"}.
there will be 2 threads and every thread do 1 accumulate and one of them do last combine.
thread A do accumulate identity with element "1", then result is a StringBuilder which content is "1"(still be the identity, becuase return object of StringBuilder.append is itself), but identity also changed to content "1". then thread B do accumulate identity with element "2", then result is "12", not "2" any more.
then do combine is the result of these two accumulate result, they are all the identity instance itself, so the result will be "1212".
It like following code snippet:
StringBuilder identity = new StringBuilder();
StringBuilder accumulate1 = identity.append("1");
StringBuilder accumulate2 = identity.append("2");
StringBuilder combine = accumulate1.append(accumulate2);
// combine and accumulate1 and accumulate2 are all identity instance and result is "1212"
return combine;
for more elements, because of threads running randomly, the result will different every time.
after we know the reason, if we fix the accumulator as following
new StringBuilder(builder).append(s)
and full line code will like:
System.out.println(list.stream().parallel().reduce(new StringBuilder(), (builder, s) -> new StringBuilder(builder).append(s),
(builder1, builder2) -> new StringBuilder(builder1).append(builder2)));
then there will be no issue any more because accumulator will not change identity instance and return new StringBuilder every time. But it is not worth to do this as no benefit comparing with String concat method.
Edit: Thanks #Holger's example, seems if there is filter function, then some accumulators may be skipped. so the combiner function also need be changed to
new StringBuilder(builder1).append(builder2)
Don't use the .reduce() when there is already an implemantion (or own .collect() like Sweeper's answer).
List<String> list = Arrays.asList("1", "2", "3");
// Example using the string concatenation operation
System.out.println(list.stream()
.parallel()
.collect(Collectors.joining())
);
// prints "123"
Edit (this will not work for parallel streams)
Depends on of the implementation of .joining():
final List<String> list = Arrays.asList("1", "2", "3");
System.out.println(list.stream().reduce(new StringBuilder(),
StringBuilder::append,
StringBuilder::append)
.toString()
);
// prints "123"
I came across follow code snippet of Apache Spark:
JavaRDD<String> lines = new JavaSparkContext(sparkSession.sparkContext()).textFile("src\\main\\resources\\data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
System.out.println(pairs.collect());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
System.out.println("Reduced data: " + counts.collect());
My data.txt is as follows:
Mahesh
Mahesh
Ganesh
Ashok
Abnave
Ganesh
Mahesh
The output is:
[(Mahesh,1), (Mahesh,1), (Ganesh,1), (Ashok,1), (Abnave,1), (Ganesh,1), (Mahesh,1)]
Reduced data: [(Ganesh,2), (Abnave,1), (Mahesh,3), (Ashok,1)]
While I understand how first line of output is obtained, I dont understand how second line is obtained, that is how JavaPairRDD<String, Integer> counts is formed by reduceByKey.
I found that the signature of reduceByKey() is as follows:
public JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func)
The [signature](http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/function/Function2.html#call(T1, T2)) of Function2.call() is as follows:
R call(T1 v1, T2 v2) throws Exception
The explanation of reduceByKey() reads as follows:
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
Now this explanation sounds somewhat confusing to me. May be there is something more to the functionality of reduceByKey(). By looking at input and output to reduceByKey() and Function2.call(), I feel somehow reducebyKey() sends values of same keys to call() in pairs. But that simply does not sound clear. Can anyone explain what precisely how reduceByKey() and Function2.call() works together?
As its name implies, reduceByKey() reduces data based on the lambda function you pass to it.
In your example, this function is a simple adder: for a and b, return a + b.
The best way to understand how the result is formed is to imagine what happens internally. The ByKey() part groups your records based on their key values. In your example, you'll have 4 different sets of pairs:
Set 1: ((Mahesh, 1), (Mahesh, 1), (Mahesh, 1))
Set 2: ((Ganesh, 1), (Ganesh, 1))
Set 3: ((Ashok, 1))
Set 4: ((Abnave, 1))
Now, the reduce part will try to reduce the previous 4 sets using the lambda function (the adder):
For Set 1: (Mahesh, 1 + 1 + 1) -> (Mahesh, 3)
For Set 2: (Ganesh, 1 + 1) -> (Ganesh, 2)
For Set 3: (Ashok , 1) -> (Ashok, 1) (nothing to add)
For Set 4: (Abnave, 1) -> (Abnave, 1) (nothing to add)
Functions signatures can be sometimes confusing as they tend to be more generic.
I'm thinking that you probably understand groupByKey? groupByKey groups all values for a certain key into a list (or iterable) so that you can do something with that - like, say, sum (or count) the values. Basically, what sum does is to reduce a list of many values into a single value. It does so by iteratively adding two values to yield one value and that is what Function2 needs to do when you write your own. It needs to take in two values and return one value.
ReduceByKey does the same as a groupByKey, BUT it does what is called a "map-side reduce" before shuffling data around. Because Spark distributes data across many different machines to allow for parallel processing, there is no guarantee that data with the same key is placed on the same machine. Spark thus has to shuffle data around, and the more data that needs to be shuffled the longer our computations will take, so it's a good idea to shuffle as little data as needed.
In a map-side reduce, Spark will first sum all the values for a given key locally on the executors before it sends (shuffles) the result around for the final sum to be computed. This means that much less data - a single value instead of a list of values - needs to be send between the different machines in the cluster and for this reason, reduceByKey is most often preferable to a groupByKey.
For a more detailed description, I can recommend this article :)
I am a beginner in Java 8.
Non-interference is important to have consistent Java stream behaviour.
Imagine we are process a large stream of data and during the process
the source is changed. The result will be unpredictable. This is
irrespective of the processing mode of the stream parallel or
sequential.
The source can be modified till the statement terminal operation is
invoked. Beyond that the source should not be modified till the stream
execution completes. So handling the concurrent modification in stream
source is critical to have a consistent stream performance.
The above quotations are taken from here.
Can someone do some simple example that would shed lights on why mutating the stream source would give such big problems?
Well the oracle example is self-explanatory here. First one is this:
List<String> l = new ArrayList<>(Arrays.asList("one", "two"));
Stream<String> sl = l.stream();
l.add("three");
String s = l.collect(Collectors.joining(" "));
If you change l by adding one more elements to it before you call the terminal operation (Collectors.joining) you are fine; but notice that the Stream consists of three elements, not two; at the time you created the Stream via l.stream().
On the other hand doing this:
List<String> list = new ArrayList<>();
list.add("test");
list.forEach(x -> list.add(x));
will throw a ConcurrentModificationException since you can't change the source.
And now suppose you have an underlying source that can handle concurrent adds:
ConcurrentHashMap<String, Integer> cMap = new ConcurrentHashMap<>();
cMap.put("one", 1);
cMap.forEach((key, value) -> cMap.put(key + key, value + value));
System.out.println(cMap);
What should the output be here? When I run this it is:
{oneoneoneoneoneoneoneone=8, one=1, oneone=2, oneoneoneone=4}
Changing the key to zx (cMap.put("zx", 1)), the result is now:
{zxzx=2, zx=1}
The result is not consistent.
I am a beginner in Java 8.
Non-interference is important to have consistent Java stream behaviour.
Imagine we are process a large stream of data and during the process
the source is changed. The result will be unpredictable. This is
irrespective of the processing mode of the stream parallel or
sequential.
The source can be modified till the statement terminal operation is
invoked. Beyond that the source should not be modified till the stream
execution completes. So handling the concurrent modification in stream
source is critical to have a consistent stream performance.
The above quotations are taken from here.
Can someone do some simple example that would shed lights on why mutating the stream source would give such big problems?
Well the oracle example is self-explanatory here. First one is this:
List<String> l = new ArrayList<>(Arrays.asList("one", "two"));
Stream<String> sl = l.stream();
l.add("three");
String s = l.collect(Collectors.joining(" "));
If you change l by adding one more elements to it before you call the terminal operation (Collectors.joining) you are fine; but notice that the Stream consists of three elements, not two; at the time you created the Stream via l.stream().
On the other hand doing this:
List<String> list = new ArrayList<>();
list.add("test");
list.forEach(x -> list.add(x));
will throw a ConcurrentModificationException since you can't change the source.
And now suppose you have an underlying source that can handle concurrent adds:
ConcurrentHashMap<String, Integer> cMap = new ConcurrentHashMap<>();
cMap.put("one", 1);
cMap.forEach((key, value) -> cMap.put(key + key, value + value));
System.out.println(cMap);
What should the output be here? When I run this it is:
{oneoneoneoneoneoneoneone=8, one=1, oneone=2, oneoneoneone=4}
Changing the key to zx (cMap.put("zx", 1)), the result is now:
{zxzx=2, zx=1}
The result is not consistent.
I'm trying to use Java 8 Lambda expressions and streams to parse some logs. I have one giant log file that has run after run. I want to split it into separate collections, one for each run. I do not know how many runs the log has in advanced. And to exercise my very weak lambda expressions muscles I'd like to do it in one pass through the list.
Here is my current implementation:
List<String> lines = readLines(fileDirectory);
Pattern runStartPattern = Pattern.compile("INFO: \\d\\d:\\d\\d:\\d\\d: Starting");
LinkedList<List<String>> testRuns = new LinkedList<>();
List<String> currentTestRun = new LinkedList<>(); // In case log starts in middle of run
testRuns.add(currentTestRun);
for(String line:lines){
if(runStartPattern.matcher(line).find()){
currentTestRun = new ArrayList<>();
testRuns.add(currentTestRun);
}
currentTestRun.add(line);
}
if(testRuns.getFirst().size()==0){ // In case log starts at a run
testRuns.removeFirst();
}
Basically something like TomekRekawek's solution here but with an unknown partition size to begin with.
There's no standard way to easily achieve this in Stream API, but my StreamEx library has a groupRuns method which can solve this pretty easily:
List<List<String>> testLines = StreamEx.of(lines)
.groupRuns((a, b) -> !runStartPattern.matcher(b).find())
.toList();
It groups the input elements based on some predicate which is applied to the pair of adjacent elements. Here we don't want to group the lines if the second line matches the runStartPattern. This works correctly regardless of whether the log starts in the middle of run or not. Also this feature works nice with parallel streams as well.