How do I generate a random RDD in Java Spark

How do I generate a random RDD in Java Spark - java

Basically I want something like this,
int count = 100;
Java<String> myRandomRDD = generate(count, new Function<String, String>() {
#Override
public String call(String arg0) throws Exception {
return RandomStringUtils.randomAlphabetic(42);
}
});
Theoretically I could use Spark RandomRDD, but I can't get it working right. I'm overwhelmed by the choices. Should I use RandomRDDs::randomRDD or RandomRDDs::randomRDDVector? Or should I use RandomVectorRDD?
I have tried the following, but I can't even get the syntax to be correct.
RandomRDDs.randomRDD(jsc, new RandomDataGenerator<String>() {
#Override
public void setSeed(long arg0) {
// TODO Auto-generated method stub
}
#Override
public org.apache.spark.mllib.random.RandomDataGenerator<String> copy() {
// TODO Auto-generated method stub
return null;
}
#Override
public String nextValue() {
RandomStringUtils.randomAlphabetic(42);
}
}, count, ??);
The documentation is sparse, I'm confused, and I would appreciate any help.
Thanks!

The simplest solution I can think of is:
JavaRDD<String> randomStringRDD = RandomRDDs.uniformJavaRDD(jsc, numRows).map((Double d) -> RandomStringUtils.randomAlphabetic(42));
Here is a more complete example to test locally:
SparkConf conf = new SparkConf().setAppName("Test random").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
int numRows= 10;//put here how many rows you want
JavaRDD<String> randomStringRDD = RandomRDDs.uniformJavaRDD(jsc, rows).map((Double d) -> RandomStringUtils.randomAlphabetic(42));
//display (to use only on small dataset)
for(String row:randomStringRDD.collect()){
System.out.println(numRows);
}
There is a small CPU overhead because there is no need to generate the initial set of random numbers, but it takes care of creating the partitions etc.
If avoiding that small overhead is important to you, and you want to generate 1 million rows in 10 partitions, you could try the following:
Create an empty rdd via jsc.emptyRDD()
Set its partitioning via repartition to create 10 partitions
use a mapPartition function to create 1milion/10 partition = 100000 rows per partition. Your RDD is ready.
Side notes:
Having the RandomRDDs.randomRDD() class exposed would make it simpler, but it is unfortunately not.
However, RandomRDDs.randomVectorRDD() is exposed, so you could use that one if you need to generate randomized vectors. (but you asked for Strings here so that does not apply).

The RandomRDD class is private to Spark, but we can access the RandomRDDs class and to create these. There are some examples in JavaRandomRDDsSuite.java (see https://github.com/apache/spark/blob/master/mllib/src/test/java/org/apache/spark/mllib/random/JavaRandomRDDsSuite.java ). It seems that the java examples all make Double's and the like but we can use this and turn it into strings like so:
import static org.apache.spark.mllib.random.RandomRDDs.*;
...
JavaDoubleRDD rdd1 = normalJavaRDD(sc, size, numPartitions);
JavaRDD<String> rdd = rdd1.map(e -> Double.toString(e));
That being siad we could use the randomRDD function, but it uses class tags which are a bit frustrating to use with Java. (I've created a JIRA https://issues.apache.org/jira/browse/SPARK-10626 to make an easy Java API for accessing this).

Related

Datastore queries in Dataflow DoFn slow down pipeline when run in the cloud

I am trying to enhance data in a pipeline by querying Datastore in a DoFn step.
A field from an object from the Class CustomClass is used to do a query against a Datastore table and the returned values are used to enhance the object.
The code looks like this:
public class EnhanceWithDataStore extends DoFn<CustomClass, CustomClass> {
private static Datastore datastore = DatastoreOptions.defaultInstance().service();
private static KeyFactory articleKeyFactory = datastore.newKeyFactory().kind("article");
#Override
public void processElement(ProcessContext c) throws Exception {
CustomClass event = c.element();
Entity article = datastore.get(articleKeyFactory.newKey(event.getArticleId()));
String articleName = "";
try{
articleName = article.getString("articleName");
} catch(Exception e) {}
CustomClass enhanced = new CustomClass(event);
enhanced.setArticleName(articleName);
c.output(enhanced);
}
When it is run locally, this is fast, but when it is run in the cloud, this step slows down the pipeline significantly. What's causing this? Is there any workaround or better way to do this?
A picture of the pipeline can be found here (the last step is the enhancing step):
pipeline architecture

What you are doing here is a join between your input PCollection<CustomClass> and the enhancements in Datastore.
For each partition of your PCollection the calls to Datastore are going to be single-threaded, hence incur a lot of latency. I would expect this to be slow in the DirectPipelineRunner and InProcessPipelineRunner as well. With autoscaling and dynamic work rebalancing, you should see parallelism when running on the Dataflow service unless something about the structure of your causes us to optimize it poorly, so you can try increasing --maxNumWorkers. But you still won't benefit from bulk operations.
It is probably better to express this join within your pipeline, using DatastoreIO.readFrom(...) followed by a CoGroupByKey transform. In this way, Dataflow will do a bulk parallel read of all the enhancements and use the efficient GroupByKey machinery to line them up with the events.
// Here are the two collections you want to join
PCollection<CustomClass> events = ...;
PCollection<Entity> articles = DatastoreIO.readFrom(...);
// Key them both by the common id
PCollection<KV<Long, CustomClass>> keyedEvents =
events.apply(WithKeys.of(event -> event.getArticleId()))
PCollection<KV<Long, Entity>> =
articles.apply(WithKeys.of(article -> article.getKey().getId())
// Set up the join by giving tags to each collection
TupleTag<CustomClass> eventTag = new TupleTag<CustomClass>() {};
TupleTag<Entity> articleTag = new TupleTag<Entity>() {};
KeyedPCollectionTuple<Long> coGbkInput =
KeyedPCollectionTuple
.of(eventTag, keyedEvents)
.and(articleTag, keyedArticles);
PCollection<CustomClass> enhancedEvents = coGbkInput
.apply(CoGroupByKey.create())
.apply(MapElements.via(CoGbkResult joinResult -> {
for (CustomClass event : joinResult.getAll(eventTag)) {
String articleName;
try {
articleName = joinResult.getOnly(articleTag).getString("articleName");
} catch(Exception e) {
articleName = "";
}
CustomClass enhanced = new CustomClass(event);
enhanced.setArticleName(articleName);
return enhanced;
}
});
Another possibility, if there are very few enough articles to store the lookup in memory, is to use DatastoreIO.readFrom(...) and then read them all as a map side input via View.asMap() and look them up in a local table.
// Here are the two collections you want to join
PCollection<CustomClass> events = ...;
PCollection<Entity> articles = DatastoreIO.readFrom(...);
// Key the articles and create a map view
PCollectionView<Map<Long, Entity>> = articleView
.apply(WithKeys.of(article -> article.getKey().getId())
.apply(View.asMap());
// Do a lookup join by side input to a ParDo
PCollection<CustomClass> enhanced = events
.apply(ParDo.withSideInputs(articles).of(new DoFn<CustomClass, CustomClass>() {
#Override
public void processElement(ProcessContext c) {
Map<Long, Entity> articleLookup = c.sideInput(articleView);
String articleName;
try {
articleName =
articleLookup.get(event.getArticleId()).getString("articleName");
} catch(Exception e) {
articleName = "";
}
CustomClass enhanced = new CustomClass(event);
enhanced.setArticleName(articleName);
return enhanced;
}
});
Depending on your data, either of these may be a better choice.

After some checking I managed to pinpoint the problem: the project is located in the EU (and as such, the Datastore is located in the EU-zone; same as the AppEningine zone), while the Dataflow jobs themselves (and thus the workers) are hosted in the US by default (when not overwriting the zone-option).
The difference in performance is 25-30 fold: ~40 elements/s compared to ~1200 elements/s for 15 workers.

Spark java.lang.StackOverflowError

I'm using spark in order to calculate the pagerank of user reviews, but I keep getting Spark java.lang.StackOverflowError when I run my code on a big dataset (40k entries). when running the code on a small number of entries it works fine though.
Entry Example :
product/productId: B00004CK40 review/userId: A39IIHQF18YGZA review/profileName: C. A. M. Salas review/helpfulness: 0/0 review/score: 4.0 review/time: 1175817600 review/summary: Reliable comedy review/text: Nice script, well acted comedy, and a young Nicolette Sheridan. Cusak is in top form.
The Code:
public void calculatePageRank() {
sc.clearCallSite();
sc.clearJobGroup();
JavaRDD < String > rddFileData = sc.textFile(inputFileName).cache();
sc.setCheckpointDir("pagerankCheckpoint/");
JavaRDD < String > rddMovieData = rddFileData.map(new Function < String, String > () {
#Override
public String call(String arg0) throws Exception {
String[] data = arg0.split("\t");
String movieId = data[0].split(":")[1].trim();
String userId = data[1].split(":")[1].trim();
return movieId + "\t" + userId;
}
});
JavaPairRDD<String, Iterable<String>> rddPairReviewData = rddMovieData.mapToPair(new PairFunction < String, String, String > () {
#Override
public Tuple2 < String, String > call(String arg0) throws Exception {
String[] data = arg0.split("\t");
return new Tuple2 < String, String > (data[0], data[1]);
}
}).groupByKey().cache();
JavaRDD<Iterable<String>> cartUsers = rddPairReviewData.map(f -> f._2());
List<Iterable<String>> cartUsersList = cartUsers.collect();
JavaPairRDD<String,String> finalCartesian = null;
int iterCounter = 0;
for(Iterable<String> out : cartUsersList){
JavaRDD<String> currentUsersRDD = sc.parallelize(Lists.newArrayList(out));
if(finalCartesian==null){
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD);
}
else{
finalCartesian = currentUsersRDD.cartesian(currentUsersRDD).union(finalCartesian);
if(iterCounter % 20 == 0) {
finalCartesian.checkpoint();
}
}
}
JavaRDD<Tuple2<String,String>> finalCartesianToTuple = finalCartesian.map(m -> new Tuple2<String,String>(m._1(),m._2()));
finalCartesianToTuple = finalCartesianToTuple.filter(x -> x._1().compareTo(x._2())!=0);
JavaPairRDD<String, String> userIdPairs = finalCartesianToTuple.mapToPair(m -> new Tuple2<String,String>(m._1(),m._2()));
JavaRDD<String> userIdPairsString = userIdPairs.map(new Function < Tuple2<String, String>, String > () {
//Tuple2<Tuple2<MovieId, userId>, Tuple2<movieId, userId>>
#Override
public String call (Tuple2<String, String> t) throws Exception {
return t._1 + " " + t._2;
}
});
try {
//calculate pagerank using this https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaPageRank.java
JavaPageRank.calculatePageRank(userIdPairsString, 100);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
sc.close();
}

I have multiple suggestions which will help you to greatly improve the performance of the code in your question.
Caching: Caching should be used on those data sets which you need to refer to again and again for same/ different operations (iterative algorithms.
An example is RDD.count — to tell you the number of lines in the
file, the file needs to be read. So if you write RDD.count, at
this point the file will be read, the lines will be counted, and the
count will be returned.
What if you call RDD.count again? The same thing: the file will be
read and counted again. So what does RDD.cache do? Now, if you run
RDD.count the first time, the file will be loaded, cached, and
counted. If you call RDD.count a second time, the operation will use
the cache. It will just take the data from the cache and count the
lines, no recomputing.
Read more about caching here.
In your code sample you are not reusing anything that you've cached. So you may remove the .cache from there.
Parallelization: In the code sample, you've parallelized every individual element in your RDD which is already a distributed collection. I suggest you to merge the rddFileData, rddMovieData and rddPairReviewData steps so that it happens in one go.
Get rid of .collect since that brings the results back to the driver and maybe the actual reason for your error.

This problem will occur when your DAG grows big and too many level of transformations happening in your code. The JVM will not be able to hold the operations to perform lazy execution when an action is performed in the end.
Checkpointing is one option. I would suggest to implement spark-sql for this kind of aggregations. If your data is structured, try to load that into dataframes and perform grouping and other mysql functions to achieve this.

When your for loop grows really large, Spark can no longer keep track of the lineage. Enable checkpointing in your for loop to checkpoint your rdd every 10 iterations or so. Checkpointing will fix the problem. Don't forget to clean up the checkpoint directory after.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

Below things fixed stackoverflow error, as others pointed it's because of lineage that spark keeps building, specially when you have loop/iteration in code.
Set checkpoint directory
spark.sparkContext.setCheckpointDir("./checkpoint")
checkpoint dataframe/Rdd you are modifying/operating in iteration
modifyingDf.checkpoint()
Cache Dataframe which are reused in each iteration
reusedDf.cache()

Combine all values per key after accumulating data from all nodes using CombineFn

I want to iterate through values of KV pCollection on perKey basis. I used below code to combine using custom class,
PCollection<KV<String, String>> combinesAttributes =
valExtract.get(extAttUsers).apply(Combine.<String, String>perKey(
new CombineAttributes()));
And below is my custom combine class,
public static class CombineAttributes implements SerializableFunction<Iterable<String>, String> {
#Override
public String apply(Iterable<String> input) {...}..}
This was working fine for small inputs but for large inputs the combine was not as expected. The output had combined only few values for a key, others were missing. I was assuming that the output had only combined data from one node.
The documentation in https://cloud.google.com/dataflow/model/combine mentions to use CombineFn in order to combine full collection-of-values per key in all nodes.
But when I changed the custom combine function as below, I am getting following error,
incompatible types: CombineAttributes cannot be converted to com.google.cloud.dataflow.sdk.transforms.SerializableFunction<java.lang.Iterable<java.lang.String>,java.lang.String>
Combine function
public static class CombineAttributes extends CombineFn<Iterable<String>, CombineAttributes.Accum, String> {
public static class Accum {
List<String> inputList = new ArrayList<String>();
}
public Accum createAccumulator() { return new Accum(); }
public Accum addInput(Accum accum, Iterable<String> input) {
for (String item : input) {
accum.inputList.add(item);
}
return accum;
}
public Accum mergeAccumulators(Iterable<Accum> accums) {
Accum merged = createAccumulator();
for (Accum accum : accums) {
for (String item : accum.inputList) {
merged.inputList.add(item);
}
}
return merged;
}
public String extractOutput(Accum accum) {
return "";
}
}
There was no sample code available for combine perKey extending CombineFn. Please let me know what is wrong with the code above.

If you just want to iterate through all the values you can use GroupByKey to turn a PCollection<KV<K, V>> into PCollection<KV<K, Iterable<V>>. Then you can write a DoFn that processes each element of that, and inside iterate over the Iterable<V>.
Note that you'll only receive all values associated with a key in the same window. If you're using the default global window, that will be all values.
Combine and CombineFn are most useful when you want to combine all the values into a smaller output. For instance, if you want to take the sum or mean of all the values it will be more efficient to do so using Sum.perKey() or Mean.perKey(). The efficiency comes from being able to pass around (and merge) accumulators. In the case of Sum, this corresponds to a partial sum.
As an example, say the pipeline runs on two machines. The first machine processes KV<user1, attr1a>, KV<user1, attr1b>, KV<user2, attr2a> and the second machine processes KV<user1, attr1c>, KV<user2, attr2b>.
The CombineAttributes (either way it was implemented) would first be invoked on each machine. So it could combine [attr1a, attr1b] into a single string or accumulator (say attr1a+attr1b). Then it would run on the other machine to combine [attr1c] to attr1c. Then it would merge all of these partial results to get a final accumulator -- attr1a+attr1b+attr1c. In the case of the original implementation, that would be the final answer. In the latter, extractOutput would be called on this accumulator.

Moving Average in Spark Java

I have real time streaming data coming into spark and I would like to do a moving average forecasting on that time-series data. Is there any way to implement this using spark in Java?
I've already referred to : https://gist.github.com/samklr/27411098f04fc46dcd05/revisions
and
Apache Spark Moving Average
but both these codes are written in Scala. Since I'm not familiar with Scala, I'm not able to judge if I'll find it useful or even convert the code to Java.
Is there any direct implementation of forecasting in Spark Java?

I took the question you were referring and struggled for a couple of hours in order to translate the Scala code into Java:
// Read a file containing the Stock Quotations
// You can also paralelize a collection of objects to create a RDD
JavaRDD<String> linesRDD = sc.textFile("some sample file containing stock prices");
// Convert the lines into our business objects
JavaRDD<StockQuotation> quotationsRDD = linesRDD.flatMap(new ConvertLineToStockQuotation());
// We need these two objects in order to use the MLLib RDDFunctions object
ClassTag<StockQuotation> classTag = scala.reflect.ClassManifestFactory.fromClass(StockQuotation.class);
RDD<StockQuotation> rdd = JavaRDD.toRDD(quotationsRDD);
// Instantiate a RDDFunctions object to work with
RDDFunctions<StockQuotation> rddFs = RDDFunctions.fromRDD(rdd, classTag);
// This applies the sliding function and return the (DATE,SMA) tuple
JavaPairRDD<Date, Double> smaPerDate = rddFs.sliding(slidingWindow).toJavaRDD().mapToPair(new MovingAvgByDateFunction());
List<Tuple2<Date, Double>> smaPerDateList = smaPerDate.collect();
Then you have to use a new Function Class to do the actual calculation of each data window:
public class MovingAvgByDateFunction implements PairFunction<Object,Date,Double> {
/**
*
*/
private static final long serialVersionUID = 9220435667459839141L;
#Override
public Tuple2<Date, Double> call(Object t) throws Exception {
StockQuotation[] stocks = (StockQuotation[]) t;
List<StockQuotation> stockList = Arrays.asList(stocks);
Double result = stockList.stream().collect(Collectors.summingDouble(new ToDoubleFunction<StockQuotation>() {
#Override
public double applyAsDouble(StockQuotation value) {
return value.getValue();
}
}));
result = result / stockList.size();
return new Tuple2<Date, Double>(stockList.get(0).getTimestamp(),result);
}
}
If you want more detail on this, I wrote about Simple Moving Averages here:
https://t.co/gmWltdANd3

Is there anything in Java close to the parallel collections in Scala?

What is the simplest way to implement a parallel computation (e.g. on a multiple core processor) using Java.
I.E. the java equivalent to this Scala code
val list = aLargeList
list.par.map(_*2)
There is this library, but it seems overwhelming.

http://gee.cs.oswego.edu/dl/jsr166/dist/extra166ydocs/
Don't give up so fast, snappy! ))
From the javadocs (with changes to map to your f) the essential matter is really just this:
ParallelLongArray a = ... // you provide
a.replaceWithMapping (new LongOp() { public long op(long a){return a*2L;}};);
is pretty much this, right?
val list = aLargeList
list.par.map(_*2)
& If you are willing to live with a bit less terseness, the above can be a reasonably clean and clear 3 liner (and of course, if you reuse functions, then its the same exact thing as Scala - inline functions.):
ParallelLongArray a = ... // you provide
LongOp f = new LongOp() { public long op(long a){return a*2L;}};
a.replaceWithMapping (f);
[edited above to show concise complete form ala OP's Scala variant]
and here it is in maximal verbose form where we start from scratch for demo:
import java.util.Random;
import jsr166y.ForkJoinPool;
import extra166y.Ops.LongGenerator;
import extra166y.Ops.LongOp;
import extra166y.ParallelLongArray;
public class ListParUnaryFunc {
public static void main(String[] args) {
int n = Integer.parseInt(args[0]);
// create a parallel long array
// with random long values
ParallelLongArray a = ParallelLongArray.create(n-1, new ForkJoinPool());
a.replaceWithGeneratedValue(generator);
// use it: apply unaryLongFuncOp in parallel
// to all values in array
a.replaceWithMapping(unaryLongFuncOp);
// examine it
for(Long v : a.asList()){
System.out.format("%d\n", v);
}
}
static final Random rand = new Random(System.nanoTime());
static LongGenerator generator = new LongGenerator() {
#Override final
public long op() { return rand.nextLong(); }
};
static LongOp unaryLongFuncOp = new LongOp() {
#Override final public long op(long a) { return a * 2L; }
};
}
Final edit and notes:
Also note that a simple class such as the following (which you can reuse across your projects):
/**
* The very basic form w/ TODOs on checks, concurrency issues, init, etc.
*/
final public static class ParArray {
private ParallelLongArray parr;
private final long[] arr;
public ParArray (long[] arr){
this.arr = arr;
}
public final ParArray par() {
if(parr == null)
parr = ParallelLongArray.createFromCopy(arr, new ForkJoinPool()) ;
return this;
}
public final ParallelLongArray map(LongOp op) {
return parr.replaceWithMapping(op);
}
public final long[] values() { return parr.getArray(); }
}
and something like that will allow you to write more fluid Java code (if terseness matters to you):
long[] arr = ... // you provide
LongOp f = ... // you provide
ParArray list = new ParArray(arr);
list.par().map(f);
And the above approach can certainly be pushed to make it even cleaner.

Doing that on one machine is pretty easy, but not as easy as Scala makes it. That library you posted is already apart of Java 5 and beyond. Probably the simplest thing to use is a ExecutorService. That represents a series of threads that can be run on any processor. You send it tasks and those things return results.
http://download.oracle.com/javase/1,5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html
http://www.fromdev.com/2009/06/how-can-i-leverage-javautilconcurrent.html
I'd suggest using ExecutorService.invokeAll() which will return a list of Futures. Then you can check them to see if their done.
If you're using Java7 then you could use the fork/join framework which might save you some work. With all of these you can build something very similar to Scala parallel arrays so using it is fairly concise.

Using threads, Java doesn't have this sort of thing built-in.

There will be an equivalent in Java 8: http://www.infoq.com/articles/java-8-vs-scala

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How do I generate a random RDD in Java Spark - java

Related

Datastore queries in Dataflow DoFn slow down pipeline when run in the cloud

Spark java.lang.StackOverflowError

Combine all values per key after accumulating data from all nodes using CombineFn

Moving Average in Spark Java

Is there anything in Java close to the parallel collections in Scala?

Categories

Resources