I have this function in Haskell, and I am wondering how it can be converted to Java, especially using streams:
build = [(w,m,n,g) | w <- [240..1280], m <- [2,4..20], n <- [2..100], g <- [240..1280], ((w - 2*m - n*g) `mod` (n+1) == 0), n*g+2*m <= w, n*g <= w]
(I'm not a Haskell expert, but I know enough to be dangerous.)
The example code given has several Haskell constructs that map reasonably
well into Java constructs:
A Haskell list is lazy, so it corresponds to a Java Stream.
The ranges used are of integers, so they correspond to IntStream.
For example, [240..1280] corresponds to IntStream.rangeClosed(240, 1280).
A range with a step has no direct correspondence in Java, but it can easily
be computed; you just have to do a bit of arithmetic and then map the values
from the sequential range to the one with steps. For example, [2, 4..20]
can be written as
IntStream.rangeClosed(1, 10).map(i -> 2 * i)
A condition on a list comprehension corresponds to filtering a stream through
a predicate.
A comprehension with multiple generators corresponds to flatmapping
of nested streams.
There is no general way to represent tuples in Java. Various third party
libraries provide tuple implementations with varying tradeoffs regarding
generics and boxing. Or, you can just write your own class with the fields
you want. (This can be quite tedious if you use lots of different kinds of
tuples, though.) In this case, the tuple is simply four ints, so it's easily
represented using an int array with four elements.
Putting it all together, we get the following.
static Stream<int[]> build() {
return IntStream.rangeClosed(240, 1280).boxed()
.flatMap(w -> IntStream.rangeClosed(1, 10).map(m -> 2 * m).boxed()
.flatMap(m -> IntStream.rangeClosed(2, 100).boxed()
.flatMap(n -> IntStream.rangeClosed(240, 1280)
.filter(g -> ((w - 2*m - n*g) % (n+1) == 0))
.filter(g -> n*g+2*m <= w)
.filter(g -> n*g <= w)
.mapToObj(g -> new int[] { w, m, n, g }))));
}
This is clearly quite verbose compared to the original Haskell, but you can easily see where the Haskell constructs have ended up in the Java code. I believe this is correct, as it seems to generate the same output as the Haskell code.
Note that we are generating values using IntStream but we want the flatmap to give a stream of arrays (which are objects), whereas IntStream.flatMap returns an IntStream. Perhaps ideally there would be a flatMapToObj operation, but there isn't, so we must box the int value into an Integer object and then call Stream.flatMap it.
One could assign the stream pipeline to a variable of type Stream, but this wouldn't be very convenient, as Java streams can be used at most once. Since constructing such a stream is cheap (compared to evaluating it) it's reasonable to write a function build() that returns a freshly created stream, ready to be evaluated by the caller.
When the following Java code is run,
System.out.println(build().count());
System.out.println(build().findFirst().map(Arrays::toString).orElse("not found"));
System.out.println(build().reduce((a, b) -> b).map(Arrays::toString).orElse("not found"));
the result is:
654559
[484, 2, 2, 240]
[1280, 20, 5, 248]
Running the following Haskell code (the definition of build is copied from the question)
build = [(w,m,n,g) | w <- [240..1280], m <- [2,4..20], n <- [2..100], g <- [240..1280],
((w - 2*m - n*g) `mod` (n+1) == 0), n*g+2*m <= w, n*g <= w]
main = do
print (length build)
print (head build)
print (last build)
gives the following output:
654559
(484,2,2,240)
(1280,20,5,248)
So the transliteration appears correct to my eye.
Times for the head (in Java, findFirst) and last (in Java, reduce((a, b) -> b)) operations were as follows: (updated using GHC 7.6.3 -O2)
head last
GHC 8s 36s
JDK 3s 9s
This at least shows that both systems provide laziness, as the computation is short-circuited after the first element is found, whereas finding the last element requires all to be computed.
Interestingly, in Haskell, calling all three of length, head, and last doesn't take any more time than just calling last (around 36s) presumably because of memoization. There's no memoization in Java, but of course you could explicitly store the results in an array or List and process that multiple times.
Overall, though, I was startled at how much faster the Java implementation is. I don't really understand Haskell performance, so I'll leave it to Haskell experts to comment on that. It's quite possible I've done something wrong, though mostly I just copied the function from the question into a file and compiled it using GHC.
My environment:
JDK 9, GHC 7.6.3 -O2, MacBook Pro mid 2014 2-core 3GHz Intel Core i7
Related
I am running Spark-1.4.0 pre-built for Hadoop-2.4 (in local mode) to calculate the sum of squares of a DoubleRDD. My Scala code looks like
sc.parallelize(Array(2., 3.)).fold(0.0)((p, v) => p+v*v)
And it gave a surprising result 97.0.
This is quite counter-intuitive compared to the Scala version of fold
Array(2., 3.).fold(0.0)((p, v) => p+v*v)
which gives the expected answer 13.0.
It seems quite likely that I have made some tricky mistakes in the code due to a lack of understanding. I have read about how the function used in RDD.fold() should be communicative otherwise the result may depend on partitions and etc. So example, if I change the number of partitions to 1,
sc.parallelize(Array(2., 3.), 1).fold(0.0)((p, v) => p+v*v)
the code will give me 169.0 on my machine!
Can someone explain what exactly is happening here?
Well, it is actually pretty well explained by the official documentation:
Aggregate the elements of each partition, and then the results for all the partitions, using a given associative and commutative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.
This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection.
To illustrate what is going on lets try to simulate what is going on step by step:
val rdd = sc.parallelize(Array(2., 3.))
val byPartition = rdd.mapPartitions(
iter => Array(iter.fold(0.0)((p, v) => (p + v * v))).toIterator).collect()
It gives us something similar to this Array[Double] = Array(0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 9.0) and
byPartition.reduce((p, v) => (p + v * v))
returns 97
Important thing to note is that results can differ from run to run depending on an order in which partitions are combined.
I came across follow code snippet of Apache Spark:
JavaRDD<String> lines = new JavaSparkContext(sparkSession.sparkContext()).textFile("src\\main\\resources\\data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
System.out.println(pairs.collect());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
System.out.println("Reduced data: " + counts.collect());
My data.txt is as follows:
Mahesh
Mahesh
Ganesh
Ashok
Abnave
Ganesh
Mahesh
The output is:
[(Mahesh,1), (Mahesh,1), (Ganesh,1), (Ashok,1), (Abnave,1), (Ganesh,1), (Mahesh,1)]
Reduced data: [(Ganesh,2), (Abnave,1), (Mahesh,3), (Ashok,1)]
While I understand how first line of output is obtained, I dont understand how second line is obtained, that is how JavaPairRDD<String, Integer> counts is formed by reduceByKey.
I found that the signature of reduceByKey() is as follows:
public JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func)
The [signature](http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/function/Function2.html#call(T1, T2)) of Function2.call() is as follows:
R call(T1 v1, T2 v2) throws Exception
The explanation of reduceByKey() reads as follows:
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
Now this explanation sounds somewhat confusing to me. May be there is something more to the functionality of reduceByKey(). By looking at input and output to reduceByKey() and Function2.call(), I feel somehow reducebyKey() sends values of same keys to call() in pairs. But that simply does not sound clear. Can anyone explain what precisely how reduceByKey() and Function2.call() works together?
As its name implies, reduceByKey() reduces data based on the lambda function you pass to it.
In your example, this function is a simple adder: for a and b, return a + b.
The best way to understand how the result is formed is to imagine what happens internally. The ByKey() part groups your records based on their key values. In your example, you'll have 4 different sets of pairs:
Set 1: ((Mahesh, 1), (Mahesh, 1), (Mahesh, 1))
Set 2: ((Ganesh, 1), (Ganesh, 1))
Set 3: ((Ashok, 1))
Set 4: ((Abnave, 1))
Now, the reduce part will try to reduce the previous 4 sets using the lambda function (the adder):
For Set 1: (Mahesh, 1 + 1 + 1) -> (Mahesh, 3)
For Set 2: (Ganesh, 1 + 1) -> (Ganesh, 2)
For Set 3: (Ashok , 1) -> (Ashok, 1) (nothing to add)
For Set 4: (Abnave, 1) -> (Abnave, 1) (nothing to add)
Functions signatures can be sometimes confusing as they tend to be more generic.
I'm thinking that you probably understand groupByKey? groupByKey groups all values for a certain key into a list (or iterable) so that you can do something with that - like, say, sum (or count) the values. Basically, what sum does is to reduce a list of many values into a single value. It does so by iteratively adding two values to yield one value and that is what Function2 needs to do when you write your own. It needs to take in two values and return one value.
ReduceByKey does the same as a groupByKey, BUT it does what is called a "map-side reduce" before shuffling data around. Because Spark distributes data across many different machines to allow for parallel processing, there is no guarantee that data with the same key is placed on the same machine. Spark thus has to shuffle data around, and the more data that needs to be shuffled the longer our computations will take, so it's a good idea to shuffle as little data as needed.
In a map-side reduce, Spark will first sum all the values for a given key locally on the executors before it sends (shuffles) the result around for the final sum to be computed. This means that much less data - a single value instead of a list of values - needs to be send between the different machines in the cluster and for this reason, reduceByKey is most often preferable to a groupByKey.
For a more detailed description, I can recommend this article :)
This one line of Functional Programming code does:
2*3 + 4*3 + 6*3 + 8*3 + 10*3 operation.
int sum = IntStream.rangeClosed(1,10) /* closed range */
.filter(x -> x%2 == 0) /* filter to even numbers in range */
.map(x -> x*3) /* map */
.sum(); /* actual sum operation happens */
System.out.println(sum); /* prints 90 */
I understand what it is doing. I would like to know what is happening under the hood in terms of memory allocation? We can have the similar old alternatives of above operation as below. This is very easy to understand, but above Lambda based code is more expressive.
int sum=0;
for(int i=1; i<=10;i++) {
if(i%2 == 0) {
sum=sum+i*3;
}
}
System.out.println(sum); /* prints 90 */
First the lambda expressions will be de-sugared to static methods inside your class file (use javap to see that).
For the Predicate there will a .class generated (that you can see via -Djdk.internal.lambda.dumpProxyClasses=/Your/Path parameter set when you invoke your class.
The same thing goes for the Function for the map operation.
Since your lambdas are stateless there will be a single instance of the Predicate and the Function created and re-used for each operation. If it would have been a stateful lambda - a new instance would be generated for each element that is processed.
And from your question title map and reduce do not increase performance (unless there are tons of elements and you can parallelize the process with a benefit). Your simple loop will be faster - but not that much faster than streams. You have also chosen a pretty simple example - suppose you choose an example that does some heavy grouping and then a custom collection, etc - the verbosity of the simple approach via stream would be significant.
I have a problem statement here
what I need to do it iterate over a list find the first integer which is greater than 3 and is even then just double it and return it.
These are some methods to check how many operations are getting performed
public static boolean isGreaterThan3(int number){
System.out.println("WhyFunctional.isGreaterThan3 " + number);
return number > 3;
}
public static boolean isEven(int number){
System.out.println("WhyFunctional.isEven " + number);
return number % 2 == 0;
}
public static int doubleIt(int number){
System.out.println("WhyFunctional.doubleIt " + number);
return number << 1;
}
with java 8 streams I could do it like
List<Integer> integerList = Arrays.asList(1, 2, 3, 5, 4, 6, 7, 8, 9, 10);
integerList.stream()
.filter(WhyFunctional::isGreaterThan3)
.filter(WhyFunctional::isEven)
.map(WhyFunctional::doubleIt)
.findFirst();
and the output is
WhyFunctional.isGreaterThan3 1
WhyFunctional.isGreaterThan3 2
WhyFunctional.isGreaterThan3 3
WhyFunctional.isGreaterThan3 5
WhyFunctional.isEven 5
WhyFunctional.isGreaterThan3 4
WhyFunctional.isEven 4
WhyFunctional.doubleIt 4
Optional[8]
so total 8 operations.
And with imperative style or before java8 I could code it like
for (Integer integer : integerList) {
if(isGreaterThan3(integer)){
if(isEven(integer)){
System.out.println(doubleIt(integer));
break;
}
}
}
and the output is
WhyFunctional.isGreaterThan3 1
WhyFunctional.isGreaterThan3 2
WhyFunctional.isGreaterThan3 3
WhyFunctional.isGreaterThan3 5
WhyFunctional.isEven 5
WhyFunctional.isGreaterThan3 4
WhyFunctional.isEven 4
WhyFunctional.doubleIt 4
8
and operations are same. So my question is what difference does it make if I am using streams rather traditional for loop.
Stream API introduces the new idea of streams which allows you to decouple the task in a new way. For example, based on your task it's possible that you want to do different things with the doubled even numbers greater than three. In some place you want to find the first one, in other place you need 10 such numbers, in third place you want to apply more filtering. You can encapsulate the algorithm of finding such numbers like this:
static IntStream numbers() {
return IntStream.range(1, Integer.MAX_VALUE)
.filter(WhyFunctional::isGreaterThan3)
.filter(WhyFunctional::isEven)
.map(WhyFunctional::doubleIt);
}
Here it is. You've just created an algorithm to generate such numbers (without generating them) and you don't care how they will be used. One user might call:
int num = numbers().findFirst().get();
Other user might need to get 10 such numbers:
int[] tenNumbers = numbers().limit(10).toArray();
Third user might want to find the first matching number which is also divisible by 7:
int result = numbers().filter(n -> n % 7 == 0).findFirst().get();
It would be more difficult to encapsulate the algorithm in traditional imperative style.
In general the Stream API is not about the performance (though parallel streams may work faster than traditional solution). It's about the expressive power of your code.
The imperative style complects the computational logic with the mechanism used to achieve it (iteration). The functional style, on the other hand, decomplects the two. You code against an API to which you supply your logic and the API has the freedom to choose how and when to apply it.
In particular, the Streams API has two ways how to apply the logic: either sequentially or in parallel. The latter is actually the driving force behind the introduction of both lambdas and the Streams API itself into Java.
The freedom to choose when to perform computation gives rise to laziness: whereas in the imperative style you have a concrete collection of data, in the functional style you can have a collection paired with logic to transform it. The logic can be applied "just in time", when you actually consume the data. This further allows you to spread the building up of computation: each method can receive a stream and apply a further step of computation on it, or it can consume it in different ways (by collecting into a list, by finding just the first item and never applying computation to the rest, but calculating an aggregate value, etc.).
As a particular example of the new opportunities offered by laziness, I was able to write a Spring MVC controller which returned a Stream whose data source was a database—and at the time I return the stream, the data is still in the database. Only the View layer will pull the data, implicitly applying the transformation logic it has no knowledge of, never having to retain more than a single stream element in memory. This converted a solution which classically had O(n) space complexity into O(1), thus becoming insensitive to the size of the result set.
Using the Stream API you are describing an operation instead of implementing it. One commonly known advantage of letting the Stream API implement the operation is the option of using different execution strategies like parallel execution (as already said by others).
Another feature which seems to be a bit underestimated is the possibility to alter the operation itself in a way that is impossible to do in an imperative programming style as that would imply modifying the code:
IntStream is=IntStream.rangeClosed(1, 10).filter(i -> i > 4);
if(evenOnly) is=is.filter(i -> (i&1)==0);
if(doubleIt) is=is.map(i -> i<<1);
is.findFirst().ifPresent(System.out::println);
Here, the decision whether to filter out odd numbers or double the result is made before the terminal operation is commenced. In an imperative programming you either have to recheck the flags within the loop or code multiple alternative loops. It should be mentioned that checking such conditions within a loop isn’t that bad on today’s JVM as the optimizer is capable of moving them out of the loop at runtime, so coding multiple loops is usually unnecessary.
But consider the following example:
Stream<String> s = Stream.of("java8 streams", "are cool");
if(singleWords) s=s.flatMap(Pattern.compile("\\s")::splitAsStream);
s.collect(Collectors.groupingBy(str->str.charAt(0)))
.forEach((k,v)->System.out.println(k+" => "+v));
Since flatMap is the equivalent of a nested loop, coding the same in an imperative style isn’t that simple any more as we have either a simple loop or a nested loop based on a runtime value. Usually, you have to resort to splitting the code into multiple methods if you want to share it between both kind of loops.
I already encountered a real-life example where the composition of a complex operation had multiple conditional flatMap steps. The equivalent imperative code is insane…
1) Functional approach allows more declarative way of programming: you just provide a list of functions to apply and don't need to write iterations manually, so your code is more consine sometimes.
2) If you switch to parallel stream (https://docs.oracle.com/javase/tutorial/collections/streams/parallelism.html) it will be possible to automatically convert your program to parallel and execute it faster. It is possbile because you don't explicitly code iteration, just list what functions to apply, so compiler/runtime may parallel it.
In this simple example, there is little difference, and the JVM will try to do the same amount of work in each case.
Where you start to see a difference is in more complicated examples like
integerList.parallelStream()
making the code concurrent for a loop is much harder. Note: you wouldn't actually do this as the overhead would to high and you only want the first element.
BTW The first example returns the result and the second prints.
The order seems odd because in regular Java the return type is always specified first. As in:
public static double sum(Iterable<Number> nums) { ... }
Why then, in the Function and BiFunction classes has the choice been made to specify them the other way around? As in:
interface Function<T,R>
interface BiFunction<T,U,R>
I'm not asking here for opinions as to which is better, but specifically:
a) Is there any technical or other (non-stylistic) benefit in preferring one order over the other? Or is it an arbitrary choice?
b) Is anyone aware of any documented explanation, or any stated reason from an authoritative source, why one was chosen over the other?
Aside: the order seems even more odd if extended to higher arities. For example, a hypothetical QuadFunction:
interface QuadFunction<A,B,C,D,R> { ... }
(At the time of writing the highest arity in the library is 2 - i.e. BiFunction.)
See: http://download.java.net/jdk8/docs/api/java/util/function/package-summary.html
It is to be consistent with prior existing notation.
The mathematical integer division function extended into the rational numbers:
(\): I x I -> Q
Functional programming version of the above (like Haskell, Ocaml)
division :: Integer -> (Integer -> Rational)
or
division :: Integer -> Integer -> Rational
All three say "the division function takes two integers and returns a rational number". It is backwards, in a functional paradigm, to say your returns first. C has taught us to say "we return a rational number in the division function, which takes two integers" (ex float division(int a, int b){}).
In Java, your return type is on the left of methods because Java wants to look like C. The designers of C thought "int main(int argv, char *argv[])" looked better than "main(int argv, char *argv[]) int". When writing code, atleast for me, I more often than not know what a method will return before I know what it will need. (edit 1: and we write lines like String s=removeSpaces(textLine), so the return on the left matches the variable on the left)
In C#, func looks the same way as the Java 8 Function.
My guess is that it's more intuitive for method chaining which might be a typical use case for lambdas, i.e.
IntStream.range(1, 10).map(Ints::random).filter(x -> x % 2 == 0)
So, method sequense here reads left to right and lambdas go left to right. So why not having the type params go left to right?
Escalating this a bit further - the reason might be that the English language reads left to right. :-)
UPDATE
I was very surprised to find out that this is something which takes place for maths modern arabic notation:
Latin complex numbers
Arabic complex numbers
In this example arabic notation in every char mirrors latin. One can track this by the angle sign and i (imagenary unit) char - in both cases it has a dot. In the linked wiki article there is also an example of a reversed lim arrow (compared to Java 8 lamda's arrow direction). This could mean that arabic Java, if it was ever developed, would look a bit differently. :-)
Disclaimer: I have background in maths, but I had no idea of the arabic notation when I was answering this question.
In ordinary procedural and OO programming, functions/methods generally take a list of parameters and return some result:
int max(int num1, int num2)
When rewriting function signatures as callback-based (such as for parallel or asynchronous processing), it has been a longstanding practice to convert the signature by appending the return callback as the last parameter:
void maxAsync(int num1, int num2, Callback<int> callback) // pseudo-Java
A current example of this pattern can be found in GWT RPC processing.
This style originated in the Lisp style of languages with the so-called continuation-passing style, where functions are chained by passing a function to a function as a parameter. Since in Lisp arguments are evaluated left-to-right, the function that's consuming the values needs to be at the end of the list. This arrangement has been adopted by imperative languages for continuity and because it's been traditional to tack on additional optional parameters (boolean flags and the like) at the end of the parameter list.
It is the explicit intent to make it more convenient to program in a functional style in Java. Now, in mathematics, a function is generally written like
f: A -> B
(i.e., a function from As to Bs). This corresponds also to the notation in functional languages, Scala and already existing functional libraries for Java.
In other words: it is just the right thing.
Note that a functional interface is not a method and a method is not a functional interface, hence it is not clear what the syntax of the former has to do with the latter.
Just my opinion: to look the same way as Function in guava does. Having the order the other way around would cause a lot of confusion I guess.
http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/base/Function.html