Infinite loop JavaRDD<String> spark 1.6 - java

i'am trying to iterate a JavaRDD and find element by applying method which use this RDD and then i should delete is
here is my code:
items=input.map(x->{
min=getMin(input);
return min;
})
.filter(x -> ! Domine(x, min));
but there is no result it seem an infinite loop
how can i fix it
thanks

When it comes to implementations like this one (same as Java8 streams, or Kotlin sequences) they are implemented in a lazy way, thus you need to perform a terminal operation, only then the work will be done.
So if you do a filter and end there - nothing will happen since you didn't perform any terminal operation. Do for example first(), take(1), forEach(...) or any other terminal operation, you can find them here.

From the very vague description, i believe what you require would be something like the following, assuming that input is of type JavaRDD<Row>:
final Row min = input.min((row1, row2) -> {
// TODO: replace by some real comparator implementation
Integer row1value = row1.getInt(row1.fieldIndex("fieldName"));
Integer row2value = row2.getInt(row2.fieldIndex("fieldName"));
return row1value.compareTo(row2value);
});
items = input.filter(row -> !Domine(row, min));
Sine Apache SPARK Transformations like filter are inherently lazy, to actually retrieve the value you would have to then write List collectedValues = items.collect(); I would, however, strongly recommend that .collect() never actually goes into production since it can be very dangerous indeed.

Related

Move an element in the list to the first position using Stream

I have written this piece of code and I would like to convert it to stream and while keeping the same behavior.
List<String> str1 = new ArrayList<String>();
str1.add("A");
str1.add("B");
str1.add("C");
str1.add("D");
int index = str1.indexOf("C");
if ((str1.contains("C")) && (index != 0)) {
str1.remove("C");
str1.add(0, "C");
}
You could simplify:
int index = str1.indexOf("C");
if ((str1.contains("C")) && (index != 0)) {
str1.remove("C");
str1.add(0, "C");
}
as
if (str1.remove("C")) {
str1.add(0, "C");
}
(Check the javadoc for remove(Object). It removes the first instance of the object that it finds, and returns true if an object was removed. There is no need to find and test the object's index.)
At this point, you should probably stop.
A small gotcha is that if "C" is the first element, you are removing it and adding it again ... unnecessarily. You could deal with this as follows:
int index = str1.index("C");
if (index > 1) {
str1.add(0, str1.remove(index));
}
You could optimize further to avoid the double copying of a remove and an add. For example (thanks to #Holger):
Collections.rotate(str1.subList(0, str1.indexOf("C") + 1), 1);
(But if you say that is simple, a regular Java programmer will probably throw something at you. It requires careful reading of the javadocs to understand that this is actually efficient.)
You could rewrite this as 2 stream operations but there is no point doing it. Both operations (remove first and insert into stream) are inefficient and tricky to implement. Tricky to implement means hard to read.
Note: Using streams to solve every problem is not a sensible goal. The main point of streams is to express complex transformations in a way that is easier to read and maintain than conventional loops. If the only stream-based solutions to a problem are actually harder to read and maintain than the non-stream solutions, that is clear evidence that you shouldn't be trying to use streams to solve that problem!
Note that the above solutions are for the problem as stated. However if object identity for the elements actually matters, the only one of the solutions above that actually moves an object to the start of the list is the solution that uses rotate.

Understanding JavaPairRDD.reduceByKey function

I came across follow code snippet of Apache Spark:
JavaRDD<String> lines = new JavaSparkContext(sparkSession.sparkContext()).textFile("src\\main\\resources\\data.txt");
JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1));
System.out.println(pairs.collect());
JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b);
System.out.println("Reduced data: " + counts.collect());
My data.txt is as follows:
Mahesh
Mahesh
Ganesh
Ashok
Abnave
Ganesh
Mahesh
The output is:
[(Mahesh,1), (Mahesh,1), (Ganesh,1), (Ashok,1), (Abnave,1), (Ganesh,1), (Mahesh,1)]
Reduced data: [(Ganesh,2), (Abnave,1), (Mahesh,3), (Ashok,1)]
While I understand how first line of output is obtained, I dont understand how second line is obtained, that is how JavaPairRDD<String, Integer> counts is formed by reduceByKey.
I found that the signature of reduceByKey() is as follows:
public JavaPairRDD<K,V> reduceByKey(Function2<V,V,V> func)
The [signature](http://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/api/java/function/Function2.html#call(T1, T2)) of Function2.call() is as follows:
R call(T1 v1, T2 v2) throws Exception
The explanation of reduceByKey() reads as follows:
Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
Now this explanation sounds somewhat confusing to me. May be there is something more to the functionality of reduceByKey(). By looking at input and output to reduceByKey() and Function2.call(), I feel somehow reducebyKey() sends values of same keys to call() in pairs. But that simply does not sound clear. Can anyone explain what precisely how reduceByKey() and Function2.call() works together?
As its name implies, reduceByKey() reduces data based on the lambda function you pass to it.
In your example, this function is a simple adder: for a and b, return a + b.
The best way to understand how the result is formed is to imagine what happens internally. The ByKey() part groups your records based on their key values. In your example, you'll have 4 different sets of pairs:
Set 1: ((Mahesh, 1), (Mahesh, 1), (Mahesh, 1))
Set 2: ((Ganesh, 1), (Ganesh, 1))
Set 3: ((Ashok, 1))
Set 4: ((Abnave, 1))
Now, the reduce part will try to reduce the previous 4 sets using the lambda function (the adder):
For Set 1: (Mahesh, 1 + 1 + 1) -> (Mahesh, 3)
For Set 2: (Ganesh, 1 + 1) -> (Ganesh, 2)
For Set 3: (Ashok , 1) -> (Ashok, 1) (nothing to add)
For Set 4: (Abnave, 1) -> (Abnave, 1) (nothing to add)
Functions signatures can be sometimes confusing as they tend to be more generic.
I'm thinking that you probably understand groupByKey? groupByKey groups all values for a certain key into a list (or iterable) so that you can do something with that - like, say, sum (or count) the values. Basically, what sum does is to reduce a list of many values into a single value. It does so by iteratively adding two values to yield one value and that is what Function2 needs to do when you write your own. It needs to take in two values and return one value.
ReduceByKey does the same as a groupByKey, BUT it does what is called a "map-side reduce" before shuffling data around. Because Spark distributes data across many different machines to allow for parallel processing, there is no guarantee that data with the same key is placed on the same machine. Spark thus has to shuffle data around, and the more data that needs to be shuffled the longer our computations will take, so it's a good idea to shuffle as little data as needed.
In a map-side reduce, Spark will first sum all the values for a given key locally on the executors before it sends (shuffles) the result around for the final sum to be computed. This means that much less data - a single value instead of a list of values - needs to be send between the different machines in the cluster and for this reason, reduceByKey is most often preferable to a groupByKey.
For a more detailed description, I can recommend this article :)

Is it possible to replace all loop constructs in Java with stream-based constructs?

I am exploring the possibilities of the Java Stream API in order to find out if one could possibly replace all loop-based constructs with stream-based constructs.
As an example that would probably hint to the assumption that this is actually possible, consider this:
Is it possible to use the Stream API to split a string containing words delimited by spaces into an array of strings like the following call of String.split(String) would do ?
String[] tokens = "Peter Paul Mary".split(" ");
I am aware of the fact that a stream-based solution could make itself use of the String.split(String) method like so:
Stream.of("Peter Paul Mary".split(" "))
.collect(Collectors.toList());
or make use of Pattern.splitAsStream(CharSequence) (the implementation of which certainly uses a loop-based approach) but I am looking for a Stream-only solution, meaning something along the lines of this Haskell snippet:
words :: String -> [String]
words s = case dropWhile Char.isSpace s of
"" -> []
s' -> w : words s''
where (w, s'') = break Char.isSpace s'
I am asking this because I am still wondering if the introduction of the Stream API will lead to a profound change in the way we handle object collections in Java or just add another option to it, thus making it more challenging to maintain a larger codebase rather than to simplify this task in the long run.
EDIT: Although there is an accepted answer (see below) it only shows that its possible in this special case. I am still interested in any hints to the general case as required in the question.
A distinct non answer here: you are asking the wrong question!
It doesn't matter if all "loop related" lines of Java code can be converted into something streamish.
Because: good programming is the ability to write code that humans can easily read and understand.
So when somebody puts up a rule that says "we only use streams from hereon for everything we do" then that rule significantly reduces your options when writing code. Instead of being able to carefully decide "should I use streams" versus "should I go with a simple old-school loop construct" you are down to "how do I get this working with streams"!
From that point of view, you should focus on coming up with "rules" that work for all people in your development team. That could mean to empasize the use of stream constructs. But you definitely want to avoid the absolutism, and leave it to each developer to write up that code that implements a given requirement in the "most readable" way. If that is possible with streams, fine. If not, don't force people to do it.
And beyond that: depending on what exactly you are doing, using streams comes also with performance cost. So even when you can find a stream solution for a problem - you have to understand its runtime cost. You surely want to avoid using streams in places where they cost too much (especially when that code is already on your "critical path" regarding performance, like: called zillions of times per second).
Finally: to a certain degree, this is a question of skills. Meaning: when you are trained in using streams, it is much easier for you to A) read "streamish" code that others wrote and B) coming up with "streamish" solutions that are in fact easy to read. In other words: this again depends on the context you are working. The other week I was educating another team on "clean code", and my last foil was "Clean Code, Java8 streams/lambdas". One guy asked me "what are streams?" No point in forcing such a community to do anything with streams tomorrow.
Just for fun (this is one horrible way to do it), neither do I know if this fits your needs:
List<String> result = ",,,abc,def".codePoints()
.boxed()
// .parallel()
.collect(Collector.of(
() -> {
List<StringBuilder> inner = new ArrayList<>();
inner.add(new StringBuilder());
return inner;
},
(List<StringBuilder> list, Integer character) -> {
StringBuilder last = list.get(list.size() - 1);
if (character == ',') {
list.add(new StringBuilder());
} else {
last.appendCodePoint(character);
}
},
(left, right) -> {
left.get(left.size() - 1).append(right.remove(0));
left.addAll(right);
return left;
},
list -> list.stream()
.map(StringBuilder::toString)
.filter(x -> !x.equals(""))
.collect(Collectors.toList())
));

How would I use slick 3.0 to return one row at a time?

How I would build a scala query to return one row of my table at a time?
My tables are in the following location if they help in answering this question:
Slick 3.0 (scala) queries don't return data till they are run multiple times (I think)
val q5 = for {
c <- dat.patientsss
} yield (c.PID, c.Gender, c.Age, c.Ethnicity)
Await.result((db.stream(q5.result).foreach(println)),Duration.Inf)
but instead of printing, I need return each.
Answer
Use a materialized result instead:
val result = Await.result((db.run(q5.result)), Duration.Inf)
result is a Seq that contains all your patient data. Use foreach to iterate over the result set:
result.foreach(r => yourFancyAlgorithm(r)) // r is a single patients data row
Sidenote
Await blocks the current thread making one of slick's best features obsolete. Blocking threads is something you should not do. I highly recommend to read about Future and Promise in scala.
The above example can be simply written as:
val result = db.run(q5.result))
result in this case will be of type Future[Seq[(yourPatientsData)]]. To access the data, use map on the result:
result.map(d => whatever) // d is of type Seq[(yourPatientsData)]
While waiting for the result, the rest of your application will continue to do its calculations and stuff. Finally when the result is ready, the callback (d => whatever) will run.

Why can't stream of streams be reduced un parallel ? / stream has already been operated upon or closed

Context
I've stumble upon a rather annoying problem : I've a program with a lot of data source that are able to stream the same type of elements and I want to "map" each availiable element in the program (element order doesn't matter).
Therefore I've tried to reduce my Stream<Stream<T>> streamOfStreamOfT; into a simple Stream<T> streamOfT; using streamOfT = streamOfStreamOfT.reduce(Stream.empty(), Stream::concat);
Since element order is not important for me, I've tried to parallelize the reduce operation with a .parallel() : streamOfT = streamOfStreamOfT.parallel().reduce(Stream.empty(), Stream::concat); But this triggers an java.lang.IllegalStateException: stream has already been operated upon or closed
Example
To experience it yourself just play with the following main (java 1.8u20) by commenting / uncommenting the .parallel()
public static void main(String[] args) {
// GIVEN
List<Stream<Integer>> listOfStreamOfInts = new ArrayList<>();
for (int j = 0; j < 10; j++) {
IntStream intStreamOf10Ints = IntStream.iterate(0, i -> i + 1)
.limit(10);
Stream<Integer> genericStreamOf10Ints = StreamSupport.stream(
intStreamOf10Ints.spliterator(), true);
listOfStreamOfInts.add(genericStreamOf10Ints);
}
Stream<Stream<Integer>> streamOfStreamOfInts = listOfStreamOfInts
.stream();
// WHEN
Stream<Integer> streamOfInts = streamOfStreamOfInts
// ////////////////
// PROBLEM
// |
// V
.parallel()
.reduce(Stream.empty(), Stream::concat);
// THEN
System.out.println(streamOfInts.map(String::valueOf).collect(
joining(", ")));
}
Question
Can someone explain this limitation ? / find a better way of handling parallel reduction of stream of streams
Edit 1
Following #Smutje and #LouisWasserman comments it seems that .flatMap(Function.identity()) is a better option that tolerates .parallel() streams
The form of reduce you are using takes an identity value and an associative combining function. But Stream.empty() is not a value; it has state. Streams are not data structures like arrays or collections; they are carriers for pushing data through possibly-parallel aggregate operations, and they have some state (like whether the stream has been consumed or not.) Think about how this works; you're going to build a tree where the same "empty" stream appears in more than one leaf. When you try to use this stateful not-an-identity twice (which won't happen sequentially, but will happen in parallel), the second time you try and traverse through that empty stream, it will quite correctly be seen to be already used.
So the problem is, you're simply using this reduce method incorrectly. The problem is not with the parallelism; it is simply that the parallelism exposed the underlying problem.
Secondly, even if this "worked" the way you think it should, you would only get parallelism building the tree that represents the flattened stream-of-streams; when you go to do the joining, that's a sequential stream pipeline there. Ooops.
Thirdly, even if this "worked" the way you think it should, you're going to add a lot of element-access overhead by building up concatenated streams, and you're not going to get the benefit of parallelism that you are seeking.
The simple answer is to flatten the streams:
String joined = streamOfStreams.parallel()
.flatMap(s -> s)
.collect(joining(", "));

Categories

Resources