SparkContext parallelize invocation example in java

SparkContext parallelize invocation example in java - java

Am getting started with Spark, and ran into issue trying to implement the simple example for map function. The issue is with the definition of 'parallelize' in the new version of Spark. Can someone share example of how to use it, since the following way is giving error for insufficient arguments.
Spark Version : 2.3.2
Java : 1.8
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers").config("spark.master","local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
JavaRDD<Integer> numRDD = context.parallelize(seqNumList, 2);
Compiletime Error Message : The method expects 3 arguments
I do not get what the 3rd argument should be like? As per the documentation, it's supposed to be
scala.reflect.ClassTag<T>
But how to even define or use it?
Please do not suggest using JavaSparkContext, as i wanted to know how to get this approach to work with using generic SparkContext.
Ref : https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/SparkContext.html#parallelize-scala.collection.Seq-int-scala.reflect.ClassTag-

Here is the code which worked for me finally. Not the best way to achieve the result, but was a way to explore the API for me
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers")
.config("spark.master", "local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
RDD<Integer> numRDD = context
.parallelize(JavaConverters.asScalaIteratorConverter(seqNumList.iterator()).asScala()
.toSeq(), 2, scala.reflect.ClassTag$.MODULE$.apply(Integer.class));
numRDD.toJavaRDD().foreach(x -> System.out.println(x));
session.stop();

If you don't want to deal with providing the extra two parameters using sparkConext, you can also use JavaSparkContext.parallelize(), which only needs an input list:
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.rdd.RDD;
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
final RDD<Integer> rdd = jsc.parallelize(seqNumList).map(num -> {
// your implementation
}).rdd();

Related

Partition not working in mongodb spark read in java connector

I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at the doc of mongoDb spark, and it mentions of various partitioner classes. I was trying to use MongoSamplePartitioner class but it only reads on just 1 partition. MongoPaginateByCountPartitioner class as well partitions to a fixed 66 partitions. This is even when I am configuring "samplesPerPartition" and "numberOfPartitions" in both of these cases respectively. I need to use readConfig created via a map. My code:
SparkSession sparkSession = SparkSession.builder().appName("sampleRecords")
.config("spark.driver.host", "2g")
.config("spark.driver.host", "127.0.0.1")
.master("local[4]").getOrCreate();
Map<String, String> readOverrides = new HashMap<>();
readOverrides.put("uri", "mongodb://mongo-root:password#127.0.0.1:27017/importedDb.myNewCollection?authSource=admin");
readOverrides.put("numberOfPartitions", "16");
readOverrides.put("partitioner", "MongoPaginateByCountPartitioner");
ReadConfig readConfig = ReadConfig.create(readOverrides);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Dataset<Row> dataset = MongoSpark.load(jsc, readConfig).toDF();
System.out.println(dataset.count()); //24576
System.out.println(dataset.rdd().getNumPartitions()); //66
Using sample partitioner returns 1 partition everytime
Am I missing something here? Please help.
PS - I am reading 24576 records, mongoD version v4.0.10, mongo spark connector 2.3.1, java 8
Edit:
I got it to work, needed to give properties like so partitionerOptions.samplesPerPartition in the map. But I am still facing issue, partitionerOptions.samplesPerPartition : "1000", in MongoSamplePartitioner only returns 1 partition. Any suggestions?

Number of Partitions can be configured for MongoPaginateByCountPartitioner.
Supposing that we need to configure the target number of partitions to 16...
Please add partitionerOptions.numberOfPartitions -> 16 in the properties rather than only numberOfPartitions -> 16.

Cannot resolve method 'flatMap(<lambdaexpression>)' error

I am new to apache spark and trying to run the wordcount example . But intellij editor gives me the error at line 47 Cannot resolve method 'flatMap()' error.
Edit :
This is the line where I am getting the error
JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());

It looks like you're using an older version of Spark that expects Iterable rather than Iterator from the flatMap() function. Try this:
JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)));
See also Spark 2.0.0 Arrays.asList not working - incompatible types

Stream#flatMap is used for combining multiple streams into one, so the supplier method you provided must return a Stream result.
you can try like this:
lines.stream().flatMap(line -> Stream.of(SPACE.split(line)))
.map(word -> // map to JavaRDD)

flatMap method take a FlatMapFunctionas parameter which is not annotated with #FunctionalInterface. So indeed you can not use it as a lambda.
Just build a real FlatMapFunctionobject as parameter and you will be sure of it.

flatMap() is Java 8 Stream API. I think you should check the IDEA compile java version.
compile java version

Error using Spark's Kryo serializer with java protocol buffers that have arrays of strings

I am hitting a bug when using java protocol buffer classes as the object model for RDDs in Spark jobs,
For my application, my ,proto file has properties that are repeated string. For example
message OntologyHumanName
{
repeated string family = 1;
}
From this, the 2.5.0 protoc compiler generates Java code like
private com.google.protobuf.LazyStringList family_ = com.google.protobuf.LazyStringArrayList.EMPTY;
If I run a Scala Spark job that uses the Kryo serializer I get the following error
Caused by: java.lang.NullPointerException
at com.google.protobuf.UnmodifiableLazyStringList.size(UnmodifiableLazyStringList.java:61)
at java.util.AbstractList.add(AbstractList.java:108)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
... 40 more
The same code works fine with spark.serializer=org.apache.spark.serializer.JavaSerializer.
My environment is CDH QuickStart 5.5 with JDK 1.8.0_60

Try to register the Lazy class with:
Kryo kryo = new Kryo()
kryo.register(com.google.protobuf.LazyStringArrayList.class)
Also for custom Protobuf messages take a look at the solution in this answer for registering custom/nestes classes generated by protoc.

I think your RDD's type contains class OntologyHumanName. like: RDD[(String, OntologyHumanName)], and this type RDD in shuffle stage by coincidence. View this: https://github.com/EsotericSoftware/kryo#kryoserializable kryo can't do serialization on abstract class.
Read the spark doc: http://spark.apache.org/docs/latest/tuning.html#data-serialization
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)
on kryo doc:
public class SomeClass implements KryoSerializable {
// ...
public void write (Kryo kryo, Output output) {
// ...
}
public void read (Kryo kryo, Input input) {
// ...
}
}
but the class: OntologyHumanName is generated by protobuf automatically. So I don't think this's a good way to do.
Try to use case class replace OntologyHumanName to avoid doing serialization on class OntologyHumanName directly. This way I didn't try, it's doesn't work possiblly.
case class OntologyHumanNameScalaCaseClass(val humanNames: OntologyHumanName)
An ugly way. I just converted protobuf class to scala things. This way can't be failed. like：
import scala.collection.JavaConverters._
val humanNameObj: OntologyHumanName = ...
val families: List[String] = humamNameObj.getFamilyList.asScala //use this to replace the humanNameObj.
hope resolve your problem above.

Scala Seq for Spark in Java?

I need to use SparkContext instead of JavaSparkContext for the accumulableCollection (if you don't agree check out the linked question and answer it please!)
Clarified Question: SparkContext is available in Java but wants a Scala sequence. How do I make it happy -- in Java?
I have this code to do a simple jsc.parallelize I was using with JavaSparkContext, but SparkContext wants a Scala collection. I thought here I was building a Scala Range and converting it to a Java list, not sure how to get that core Range to be a Scala Seq, which is what the parallelize from SparkContext is asking for.
// The JavaSparkContext way, was trying to get around MAXINT limit, not the issue here
// setup bogus Lists of size M and N for parallelize
//List<Integer> rangeM = rangeClosed(startM, endM).boxed().collect(Collectors.toList());
//List<Integer> rangeN = rangeClosed(startN, endN).boxed().collect(Collectors.toList());
The money line is next, how can I create a Scala Seq in Java to give to parallelize?
// these lists above need to be scala objects now that we switched to SparkContext
scala.collection.Seq<Integer> rangeMscala = scala.collection.immutable.List(startM to endM);
// setup sparkConf and create SparkContext
... SparkConf setup
SparkContext jsc = new SparkContext(sparkConf);
RDD<Integer> dataSetMscala = jsc.parallelize(rangeMscala);

You should use it this way:
scala.collection.immutable.Range rangeMscala =
scala.collection.immutable.Range$.MODULE$.apply(1, 10);
SparkContext sc = new SparkContext();
RDD dataSetMscala =
sc.parallelize(rangeMscala, 3, scala.reflect.ClassTag$.MODULE$.Object());
Hope it helps! Regards

Reuse results of first computation in second computation

I'm trying to write a computation in Flink which requires two phases.
In the first phase I start from a text file, and perform some parameter estimation, obtaining as a result a Java object representing a statistical model of the data.
In the second phase, I'd like to use this object to generate data for a simulation.
I'm unsure how to do this. I tried with a LocalCollectionOutputFormat, and it works locally, but when I deploy the job on a cluster, I get a NullPointerException - which is not really surprising.
What is the Flink way of doing this?
Here is my code:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
GlobalConfiguration.includeConfiguration(configuration);
// Phase 1: read file and estimate model
DataSource<Tuple4<String, String, String, String>> source = env
.readCsvFile(args[0])
.types(String.class, String.class, String.class, String.class);
List<Tuple4<Bayes, Bayes, Bayes, Bayes>> bayesResult = new ArrayList<>();
// Processing here...
....output(new LocalCollectionOutputFormat<>(bayesResult));
env.execute("Bayes");
DataSet<BTP> btp = env
.createInput(new BayesInputFormat(bayesResult.get(0)))
// Phase 2: BayesInputFormat generates data for further calculations
// ....
This is the exception I get:
Error: The program execution failed: java.lang.NullPointerException
at org.apache.flink.api.java.io.LocalCollectionOutputFormat.close(LocalCollectionOutputFormat.java:86)
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:176)
at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257)
at java.lang.Thread.run(Thread.java:745)
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: java.lang.NullPointerException
at org.apache.flink.api.java.io.LocalCollectionOutputFormat.close(LocalCollectionOutputFormat.java:86)
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:176)
at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257)
at java.lang.Thread.run(Thread.java:745)
at org.apache.flink.client.program.Client.run(Client.java:328)
at org.apache.flink.client.program.Client.run(Client.java:294)
at org.apache.flink.client.program.Client.run(Client.java:288)
at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55)
at it.list.flink.test.Test01.main(Test01.java:62)
...

With the latest release (0.9-milestone-1) a collect() method was added to Flink
public List<T> collect()
which fetches a DataSet<T> as List<T> to the driver program. collect() will also trigger an immediate execution of the program (don't need to call ExecutionEnvironment.execute()). Right now, there is size limitation for data sets of about 10 MB.
If you do not evaluate the models in the driver program, you can also chain both programs together and emit the model to the side by attaching a data sink. This will be more efficient, because the data won't do the round-trip over the client machine.

If you're using Flink prior to 0.9 you may use the following snippet to collect your dataset to a local collection:
val dataJavaList = new ArrayList[K]
val outputFormat = new LocalCollectionOutputFormat[K](dataJavaList)
dataset.output(outputFormat)
env.execute("collect()")
Where K is the type of object you want to collect

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

SparkContext parallelize invocation example in java - java

Related

Partition not working in mongodb spark read in java connector

Cannot resolve method 'flatMap(<lambdaexpression>)' error

Error using Spark's Kryo serializer with java protocol buffers that have arrays of strings

Scala Seq for Spark in Java?

Reuse results of first computation in second computation

Categories

Resources