Bug in parallelStream in java - java

Can someone tell me why this is happening and if it's expected behaviour or a bug
List<Integer> a = Arrays.asList(1,1,3,3);
a.parallelStream().filter(Objects::nonNull)
.filter(value -> value > 2)
.reduce(1,Integer::sum)
Answer: 10
But if we use stream instead of parallelStream I'm getting the right & expected answer 7

The first argument to reduce is called "identity" and not "initialValue".
1 is no identity according to addition. 1 is identity for multiplication.
Though you need to provide 0 if you want to sum the elements.
Java uses "identity" instead of "initialValue" because this little trick allows to parallelize reduce easily.
In parallel execution, each thread will run the reduce on a part of the stream, and when the threads are done, they will be combined using the very same reduce function.
Though it will look something like this:
mainThread:
start thread1;
start thread2;
wait till both are finished;
thread1:
return sum(1, 3); // your reduce function applied to a part of the stream
thread2:
return sum(1, 3);
// when thread1 and thread2 are finished:
mainThread:
return sum(sum(1, resultOfThread1), sum(1, resultOfThread2));
= sum(sum(1, 4), sum(1, 4))
= sum(5, 5)
= 10
I hope you can see, what happens and why the result is not what you expected.

Related

How to handle concurrency in API?

Context:
I have a storage/DB where I store a <String, List<Integer>> -> <uniqueName, numbersPresentForTheName> mapping each time:
I am writing an API that does the following:
How can we make sure this does not happen?
The usual answer is that you use a compare-and-update operation, rather than just an update operation.
t = 1 -> call readStorage
t = 10 -> get response from readStorage <ABCD, [1,2,3]>
t = 11 -> comapreAndUpdateStorage ABCD, [1,2,3] -> [1,2,3,4]
t = 5 -> Request 1 to the API: <ABCD, [5]>
t = 6 -> call readStorage
t = 11 -> get response from readStorage <ABCD, [1,2,3]> (note that we didn't get 4 as first call didnt update 4 yet)
t = 13 -> compareAnUpdateStorage ABCD, [1,2,3] -> [1,2,3,5] (This call will fail, because the current value is [1,2,3,4])
In other words, we're trying to address the lost edit problem by ensuring that the first edit always wins.
That's the first piece; the rest of the work is choosing an appropriate retry strategy.
If your two calls are on separate threads, and the steps of the two calls are interleaved over time, then you are seeing correct behavior.
If both threads retrieved a set of three values, and both threads replaced the existing set in storage with a new set of four values, then whichever thread saves their set last wins.
The key idea here to understand is atomic actions. The retrieval of the data and the update of the data in your system are separate actions. To achieve your aim, you must combine those actions two actions into one. This combining is the purpose of transactions and locks in databases.
See also the correct Answer by VoiceOfUnreason along the same line.

Java Reactor StepVerifier.withVirtualTime loop: repeatedly check with "expectNoEvent()", "expectNext()" and "thenAwait()"

I am learning Java 11 reactor. I have seen this example:
StepVerifier.withVirtualTime(() -> Flux.interval(Duration.ofSeconds(1)).take(3600))
.expectSubscription()
.expectNextCount(3600);
This example just checks that with a Flux<Long> which increments 1 after every second till one hour, the final count is 3600.
But, is there any way to check the counter repeatedly after every second?
I know this:
.expectNoEvent(Duration.ofSeconds(1))
.expectNext(0L)
.thenAwait(Duration.ofSeconds(1))
But I have seen no way to repeatedly check this after every second, like:
.expectNoEvent(Duration.ofSeconds(1))
.expectNext(i)
.thenAwait(Duration.ofSeconds(1))
when i increments till 3600. Is there?
PS:
I tried to add verifyComplete() at last in a long-running tests and it will never end. Do I have to? Or just ignore it?
You can achieve what you wanted by using expectNextSequence. You have to pass an Iterable and include every element you expect to arrive. See my example below:
var longRange = LongStream.range(0, 3600)
.boxed()
.collect(Collectors.toList());
StepVerifier
.withVirtualTime(() -> Flux.interval(Duration.ofSeconds(1)).take(3600))
.expectSubscription()
.thenAwait(Duration.ofHours(1))
.expectNextSequence(longRange)
.expectComplete().verify();
If you don't add verifyComplete() or expectComplete().verify() then the JUnit test won't wait until elements arrive from the flux and just terminate.
For further reference see the JavaDoc of verify():
this method will block until the stream has been terminated

Understanding JavaPairRDD.reduceByKey function

I came across follow code snippet of Apache Spark:
JavaRDD<String> lines = new JavaSparkContext(sparkSession.sparkContext()).textFile("src\\main\\resources\\data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
System.out.println(pairs.collect());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
System.out.println("Reduced data: " + counts.collect());
My data.txt is as follows:
Mahesh
Mahesh
Ganesh
Ashok
Abnave
Ganesh
Mahesh
The output is:
[(Mahesh,1), (Mahesh,1), (Ganesh,1), (Ashok,1), (Abnave,1), (Ganesh,1), (Mahesh,1)]
Reduced data: [(Ganesh,2), (Abnave,1), (Mahesh,3), (Ashok,1)]
While I understand how first line of output is obtained, I dont understand how second line is obtained, that is how JavaPairRDD<String, Integer> counts is formed by reduceByKey.
I found that the signature of reduceByKey() is as follows:
public JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func)
The [signature](http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/function/Function2.html#call(T1, T2)) of Function2.call() is as follows:
R call(T1 v1, T2 v2) throws Exception
The explanation of reduceByKey() reads as follows:
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
Now this explanation sounds somewhat confusing to me. May be there is something more to the functionality of reduceByKey(). By looking at input and output to reduceByKey() and Function2.call(), I feel somehow reducebyKey() sends values of same keys to call() in pairs. But that simply does not sound clear. Can anyone explain what precisely how reduceByKey() and Function2.call() works together?
As its name implies, reduceByKey() reduces data based on the lambda function you pass to it.
In your example, this function is a simple adder: for a and b, return a + b.
The best way to understand how the result is formed is to imagine what happens internally. The ByKey() part groups your records based on their key values. In your example, you'll have 4 different sets of pairs:
Set 1: ((Mahesh, 1), (Mahesh, 1), (Mahesh, 1))
Set 2: ((Ganesh, 1), (Ganesh, 1))
Set 3: ((Ashok, 1))
Set 4: ((Abnave, 1))
Now, the reduce part will try to reduce the previous 4 sets using the lambda function (the adder):
For Set 1: (Mahesh, 1 + 1 + 1) -> (Mahesh, 3)
For Set 2: (Ganesh, 1 + 1) -> (Ganesh, 2)
For Set 3: (Ashok , 1) -> (Ashok, 1) (nothing to add)
For Set 4: (Abnave, 1) -> (Abnave, 1) (nothing to add)
Functions signatures can be sometimes confusing as they tend to be more generic.
I'm thinking that you probably understand groupByKey? groupByKey groups all values for a certain key into a list (or iterable) so that you can do something with that - like, say, sum (or count) the values. Basically, what sum does is to reduce a list of many values into a single value. It does so by iteratively adding two values to yield one value and that is what Function2 needs to do when you write your own. It needs to take in two values and return one value.
ReduceByKey does the same as a groupByKey, BUT it does what is called a "map-side reduce" before shuffling data around. Because Spark distributes data across many different machines to allow for parallel processing, there is no guarantee that data with the same key is placed on the same machine. Spark thus has to shuffle data around, and the more data that needs to be shuffled the longer our computations will take, so it's a good idea to shuffle as little data as needed.
In a map-side reduce, Spark will first sum all the values for a given key locally on the executors before it sends (shuffles) the result around for the final sum to be computed. This means that much less data - a single value instead of a list of values - needs to be send between the different machines in the cluster and for this reason, reduceByKey is most often preferable to a groupByKey.
For a more detailed description, I can recommend this article :)

Difference between traditional imperative style of programming and functional style of programming

I have a problem statement here
what I need to do it iterate over a list find the first integer which is greater than 3 and is even then just double it and return it.
These are some methods to check how many operations are getting performed
public static boolean isGreaterThan3(int number){
System.out.println("WhyFunctional.isGreaterThan3 " + number);
return number > 3;
}
public static boolean isEven(int number){
System.out.println("WhyFunctional.isEven " + number);
return number % 2 == 0;
}
public static int doubleIt(int number){
System.out.println("WhyFunctional.doubleIt " + number);
return number << 1;
}
with java 8 streams I could do it like
List<Integer> integerList = Arrays.asList(1, 2, 3, 5, 4, 6, 7, 8, 9, 10);
integerList.stream()
.filter(WhyFunctional::isGreaterThan3)
.filter(WhyFunctional::isEven)
.map(WhyFunctional::doubleIt)
.findFirst();
and the output is
WhyFunctional.isGreaterThan3 1
WhyFunctional.isGreaterThan3 2
WhyFunctional.isGreaterThan3 3
WhyFunctional.isGreaterThan3 5
WhyFunctional.isEven 5
WhyFunctional.isGreaterThan3 4
WhyFunctional.isEven 4
WhyFunctional.doubleIt 4
Optional[8]
so total 8 operations.
And with imperative style or before java8 I could code it like
for (Integer integer : integerList) {
if(isGreaterThan3(integer)){
if(isEven(integer)){
System.out.println(doubleIt(integer));
break;
}
}
}
and the output is
WhyFunctional.isGreaterThan3 1
WhyFunctional.isGreaterThan3 2
WhyFunctional.isGreaterThan3 3
WhyFunctional.isGreaterThan3 5
WhyFunctional.isEven 5
WhyFunctional.isGreaterThan3 4
WhyFunctional.isEven 4
WhyFunctional.doubleIt 4
8
and operations are same. So my question is what difference does it make if I am using streams rather traditional for loop.
Stream API introduces the new idea of streams which allows you to decouple the task in a new way. For example, based on your task it's possible that you want to do different things with the doubled even numbers greater than three. In some place you want to find the first one, in other place you need 10 such numbers, in third place you want to apply more filtering. You can encapsulate the algorithm of finding such numbers like this:
static IntStream numbers() {
return IntStream.range(1, Integer.MAX_VALUE)
.filter(WhyFunctional::isGreaterThan3)
.filter(WhyFunctional::isEven)
.map(WhyFunctional::doubleIt);
}
Here it is. You've just created an algorithm to generate such numbers (without generating them) and you don't care how they will be used. One user might call:
int num = numbers().findFirst().get();
Other user might need to get 10 such numbers:
int[] tenNumbers = numbers().limit(10).toArray();
Third user might want to find the first matching number which is also divisible by 7:
int result = numbers().filter(n -> n % 7 == 0).findFirst().get();
It would be more difficult to encapsulate the algorithm in traditional imperative style.
In general the Stream API is not about the performance (though parallel streams may work faster than traditional solution). It's about the expressive power of your code.
The imperative style complects the computational logic with the mechanism used to achieve it (iteration). The functional style, on the other hand, decomplects the two. You code against an API to which you supply your logic and the API has the freedom to choose how and when to apply it.
In particular, the Streams API has two ways how to apply the logic: either sequentially or in parallel. The latter is actually the driving force behind the introduction of both lambdas and the Streams API itself into Java.
The freedom to choose when to perform computation gives rise to laziness: whereas in the imperative style you have a concrete collection of data, in the functional style you can have a collection paired with logic to transform it. The logic can be applied "just in time", when you actually consume the data. This further allows you to spread the building up of computation: each method can receive a stream and apply a further step of computation on it, or it can consume it in different ways (by collecting into a list, by finding just the first item and never applying computation to the rest, but calculating an aggregate value, etc.).
As a particular example of the new opportunities offered by laziness, I was able to write a Spring MVC controller which returned a Stream whose data source was a database—and at the time I return the stream, the data is still in the database. Only the View layer will pull the data, implicitly applying the transformation logic it has no knowledge of, never having to retain more than a single stream element in memory. This converted a solution which classically had O(n) space complexity into O(1), thus becoming insensitive to the size of the result set.
Using the Stream API you are describing an operation instead of implementing it. One commonly known advantage of letting the Stream API implement the operation is the option of using different execution strategies like parallel execution (as already said by others).
Another feature which seems to be a bit underestimated is the possibility to alter the operation itself in a way that is impossible to do in an imperative programming style as that would imply modifying the code:
IntStream is=IntStream.rangeClosed(1, 10).filter(i -> i > 4);
if(evenOnly) is=is.filter(i -> (i&1)==0);
if(doubleIt) is=is.map(i -> i<<1);
is.findFirst().ifPresent(System.out::println);
Here, the decision whether to filter out odd numbers or double the result is made before the terminal operation is commenced. In an imperative programming you either have to recheck the flags within the loop or code multiple alternative loops. It should be mentioned that checking such conditions within a loop isn’t that bad on today’s JVM as the optimizer is capable of moving them out of the loop at runtime, so coding multiple loops is usually unnecessary.
But consider the following example:
Stream<String> s = Stream.of("java8 streams", "are cool");
if(singleWords) s=s.flatMap(Pattern.compile("\\s")::splitAsStream);
s.collect(Collectors.groupingBy(str->str.charAt(0)))
.forEach((k,v)->System.out.println(k+" => "+v));
Since flatMap is the equivalent of a nested loop, coding the same in an imperative style isn’t that simple any more as we have either a simple loop or a nested loop based on a runtime value. Usually, you have to resort to splitting the code into multiple methods if you want to share it between both kind of loops.
I already encountered a real-life example where the composition of a complex operation had multiple conditional flatMap steps. The equivalent imperative code is insane…
1) Functional approach allows more declarative way of programming: you just provide a list of functions to apply and don't need to write iterations manually, so your code is more consine sometimes.
2) If you switch to parallel stream (https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html) it will be possible to automatically convert your program to parallel and execute it faster. It is possbile because you don't explicitly code iteration, just list what functions to apply, so compiler/runtime may parallel it.
In this simple example, there is little difference, and the JVM will try to do the same amount of work in each case.
Where you start to see a difference is in more complicated examples like
integerList.parallelStream()
making the code concurrent for a loop is much harder. Note: you wouldn't actually do this as the overhead would to high and you only want the first element.
BTW The first example returns the result and the second prints.

RxJava groupBy and subsequent blocking operations (onComplete missing?)

I have boiled my problem down into the following snippet:
Observable<Integer> numbers = Observable.just(1, 2, 3);
Observable<GroupedObservable<Integer,Integer>> outer = numbers.groupBy(i->i%3);
System.out.println(outer.count().toBlocking().single());
which blocks interminably. I've been reading several posts and believe I understand the problem: GroupedObservables will not call onComplete until their inner Observables have also been completed. Unfortunately though I still can't get the above snippet to print!
For example, the following:
Observable<Integer> just = Observable.just(1, 2, 3);
Observable<GroupedObservable<Integer,Integer>> groupBy = just.groupBy(i->i%3);
groupBy.subscribe(inner -> inner.ignoreElements());
System.out.println(groupBy.count().toBlocking().single());
still does nothing. Have I misunderstood the problem? Is there another problem? In short, how can I get the above snippets to work?
Many thanks in advance,
Dan.
Yes, you have to consume the groups in some fashion. Your second example doesn't work because you have two independent subscription to the grouping operation.
Usually, the solution is flatMap, but not not with ignoreElements because that will just complete and count won't get any elements. Instead, you can use takeLast(1):
Observable.just(1, 2, 3)
.groupBy(k -> k % 3)
.flatMap(g -> g.takeLast(1))
.count()
.toBlocking()
.forEach(System.out::println);

Categories

Resources