Scala Seq for Spark in Java?

Scala Seq for Spark in Java? - java

I need to use SparkContext instead of JavaSparkContext for the accumulableCollection (if you don't agree check out the linked question and answer it please!)
Clarified Question: SparkContext is available in Java but wants a Scala sequence. How do I make it happy -- in Java?
I have this code to do a simple jsc.parallelize I was using with JavaSparkContext, but SparkContext wants a Scala collection. I thought here I was building a Scala Range and converting it to a Java list, not sure how to get that core Range to be a Scala Seq, which is what the parallelize from SparkContext is asking for.
// The JavaSparkContext way, was trying to get around MAXINT limit, not the issue here
// setup bogus Lists of size M and N for parallelize
//List<Integer> rangeM = rangeClosed(startM, endM).boxed().collect(Collectors.toList());
//List<Integer> rangeN = rangeClosed(startN, endN).boxed().collect(Collectors.toList());
The money line is next, how can I create a Scala Seq in Java to give to parallelize?
// these lists above need to be scala objects now that we switched to SparkContext
scala.collection.Seq<Integer> rangeMscala = scala.collection.immutable.List(startM to endM);
// setup sparkConf and create SparkContext
... SparkConf setup
SparkContext jsc = new SparkContext(sparkConf);
RDD<Integer> dataSetMscala = jsc.parallelize(rangeMscala);

You should use it this way:
scala.collection.immutable.Range rangeMscala =
scala.collection.immutable.Range$.MODULE$.apply(1, 10);
SparkContext sc = new SparkContext();
RDD dataSetMscala =
sc.parallelize(rangeMscala, 3, scala.reflect.ClassTag$.MODULE$.Object());
Hope it helps! Regards

Related

Partition not working in mongodb spark read in java connector

I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at the doc of mongoDb spark, and it mentions of various partitioner classes. I was trying to use MongoSamplePartitioner class but it only reads on just 1 partition. MongoPaginateByCountPartitioner class as well partitions to a fixed 66 partitions. This is even when I am configuring "samplesPerPartition" and "numberOfPartitions" in both of these cases respectively. I need to use readConfig created via a map. My code:
SparkSession sparkSession = SparkSession.builder().appName("sampleRecords")
.config("spark.driver.host", "2g")
.config("spark.driver.host", "127.0.0.1")
.master("local[4]").getOrCreate();
Map<String, String> readOverrides = new HashMap<>();
readOverrides.put("uri", "mongodb://mongo-root:password#127.0.0.1:27017/importedDb.myNewCollection?authSource=admin");
readOverrides.put("numberOfPartitions", "16");
readOverrides.put("partitioner", "MongoPaginateByCountPartitioner");
ReadConfig readConfig = ReadConfig.create(readOverrides);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Dataset<Row> dataset = MongoSpark.load(jsc, readConfig).toDF();
System.out.println(dataset.count()); //24576
System.out.println(dataset.rdd().getNumPartitions()); //66
Using sample partitioner returns 1 partition everytime
Am I missing something here? Please help.
PS - I am reading 24576 records, mongoD version v4.0.10, mongo spark connector 2.3.1, java 8
Edit:
I got it to work, needed to give properties like so partitionerOptions.samplesPerPartition in the map. But I am still facing issue, partitionerOptions.samplesPerPartition : "1000", in MongoSamplePartitioner only returns 1 partition. Any suggestions?

Number of Partitions can be configured for MongoPaginateByCountPartitioner.
Supposing that we need to configure the target number of partitions to 16...
Please add partitionerOptions.numberOfPartitions -> 16 in the properties rather than only numberOfPartitions -> 16.

How to run a spark program in Java in parallel

So I have a java application that has spark maven dependencies and on running it, it launches spark server on the host where its run. The server instance has 36 cores. I am specifying SparkSession instance where I am mentioning the number of cores and other config properties in parallel but when I see the stats using htop, it doesn't seem to use all the cores but just 1.
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("my-spark")
.config("spark.driver.memory","50g")
.config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.shuffle.partitions", "400")
.config("spark.eventLog.enabled", "true")
.config("spark.eventLog.dir", "/dir1/dir2/logs")
.config("spark.history.fs.logDirectory", "/dir1/dir2/logs")
.config("spark.executor.cores", "36")
I also added in JavaSparkContext as well:
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
sc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY);
sc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_KEY);
sc.hadoopConfiguration().set("spark.driver.memory","50g");
sc.hadoopConfiguration().set("spark.eventLog.enabled", "true");
sc.hadoopConfiguration().set("spark.eventLog.dir", "/dir1/dir2/logs");
sc.hadoopConfiguration().set("spark.executor.cores", "36");
My task is reading data from aws s3 into a df and writing data in another bucket.
Dataset<Row> df = spark.read().format("csv").option("header", "true").load("s3a://bucket/file.csv.gz");
//df = df.repartition(200);
df.withColumn("col_name", df.col("col_name")).sort("col_name", "_id").write().format("iceberg").mode("append").save(location);

.gz files are "unspittable": to decompress them you have to start at byte 0 and read forwards. As a result, spark, hive, MapReduce, etc, give the whole file to a single worker. If you want parallel processing, use a different compression format (e.g. snappy)

You are running Spark in local mode, spark.executor.cores will not take effect, consider change .master("local") to .master("local[*]")
Hope this helps

SparkContext parallelize invocation example in java

Am getting started with Spark, and ran into issue trying to implement the simple example for map function. The issue is with the definition of 'parallelize' in the new version of Spark. Can someone share example of how to use it, since the following way is giving error for insufficient arguments.
Spark Version : 2.3.2
Java : 1.8
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers").config("spark.master","local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
JavaRDD<Integer> numRDD = context.parallelize(seqNumList, 2);
Compiletime Error Message : The method expects 3 arguments
I do not get what the 3rd argument should be like? As per the documentation, it's supposed to be
scala.reflect.ClassTag<T>
But how to even define or use it?
Please do not suggest using JavaSparkContext, as i wanted to know how to get this approach to work with using generic SparkContext.
Ref : https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/SparkContext.html#parallelize-scala.collection.Seq-int-scala.reflect.ClassTag-

Here is the code which worked for me finally. Not the best way to achieve the result, but was a way to explore the API for me
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers")
.config("spark.master", "local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
RDD<Integer> numRDD = context
.parallelize(JavaConverters.asScalaIteratorConverter(seqNumList.iterator()).asScala()
.toSeq(), 2, scala.reflect.ClassTag$.MODULE$.apply(Integer.class));
numRDD.toJavaRDD().foreach(x -> System.out.println(x));
session.stop();

If you don't want to deal with providing the extra two parameters using sparkConext, you can also use JavaSparkContext.parallelize(), which only needs an input list:
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.rdd.RDD;
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
final RDD<Integer> rdd = jsc.parallelize(seqNumList).map(num -> {
// your implementation
}).rdd();

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession?
Is there any method to convert or create a Context using a SparkSession?
Can I completely replace all the Contexts using one single entry SparkSession?
Are all the functions in SQLContext, SparkContext, and JavaSparkContext also in SparkSession?
Some functions like parallelize have different behaviors in SparkContext and JavaSparkContext. How do they behave in SparkSession?
How can I create the following using a SparkSession?
RDD
JavaRDD
JavaPairRDD
Dataset
Is there a method to transform a JavaPairRDD into a Dataset or a Dataset into a JavaPairRDD?

sparkContext is a Scala implementation entry point and JavaSparkContext is a java wrapper of sparkContext.
SQLContext is entry point of SparkSQL which can be received from sparkContext.Prior to 2.x.x, RDD ,DataFrame and Data-set were three different data abstractions.Since Spark 2.x.x, All three data abstractions are unified and SparkSession is the unified entry point of Spark.
An additional note is , RDD meant for unstructured data, strongly typed data and DataFrames are for structured and loosely typed data. You can check
Is there any method to convert or create Context using Sparksession ?
yes. its sparkSession.sparkContext() and for SQL, sparkSession.sqlContext()
Can I completely replace all the Context using one single entry SparkSession ?
yes. you can get respective contexs from sparkSession.
Does all the functions in SQLContext, SparkContext,JavaSparkContext etc are added in SparkSession?
Not directly. you got to get respective context and make use of it.something like backward compatibility
How to use such function in SparkSession?
get respective context and make use of it.
How to create the following using SparkSession?
RDD can be created from sparkSession.sparkContext.parallelize(???)
JavaRDD same applies with this but in java implementation
JavaPairRDD sparkSession.sparkContext.parallelize(???).map(//making your data as key-value pair here is one way)
Dataset what sparkSession returns is Dataset if it is structured data.

Explanation from spark source code under branch-2.1
SparkContext:
Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
Only one SparkContext may be active per JVM. You must stop() the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
JavaSparkContext:
A Java-friendly version of [[org.apache.spark.SparkContext]] that returns
[[org.apache.spark.api.java.JavaRDD]]s and works with Java collections instead of Scala ones.
Only one SparkContext may be active per JVM. You must stop() the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
SQLContext:
The entry point for working with structured data (rows and columns) in Spark 1.x.
As of Spark 2.0, this is replaced by [[SparkSession]]. However, we are keeping the class
here for backward compatibility.
SparkSession:
The entry point to programming Spark with the Dataset and DataFrame API.

I will talk about Spark version 2.x only.
SparkSession: It's a main entry point of your spark Application. To run any code on your spark, this is the first thing you should create.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("Word Count")\
.config("spark.some.config.option", "some-value")\
.getOrCreate()
SparkContext: It's a inner Object (property) of SparkSession. It's used to interact with Low-Level API Through SparkContext you can create RDD, accumlator and Broadcast variables.
for most cases you won't need SparkContext. You can get SparkContext from SparkSession
val sc = spark.sparkContext

Spark Context is Class in Spark API which is the first stage to build the spark application. Functionality of the spark context is to create memory in RAM we call this as driver memory, allocation of number of executers and cores in short its all about the cluster management. Spark Context can be used to create RDD and shared variables. To access this we need to create object of it.
This way we can create Spark Context :: var sc=new SparkContext()
Spark Session this is new Object added since spark 2.x which is replacement of Sql Context and Hive Context.
Earlier we had two options like one is Sql Context which is way to do sql operation on Dataframe and second is Hive Context which manage the Hive connectivity related stuff and fetch/insert the data from/to the hive tables.
Since 2.x came We can create SparkSession for the SQL operation on Dataframe and if you have any Hive related work just call the Method enablehivesupport() then you can use the SparkSession for the both Dataframe and hive related SQL operations.
This way we can create SparkSession for Sql operation on Dataframe
val sparksession=SparkSession.builder().getOrCreate();
Second way is to create SparkSession for Sql operation on Dataframe as well as Hive Operation.
val sparkSession=SparkSession.builder().enableHiveSupport().getOrCreate()

Spark Context:
Since Spark 1.x, Spark SparkContext is an entry point to Spark and defined in org. apache. spark package and used to programmatically create Spark RDD, accumulators, and broadcast variables on the cluster. Its object sc is the default variable available in spark-shell and it can be programmatically created using SparkContext class.
SparkContext is a client of spark's execution environment.
SparkContext is the entry point of the spark execution job.
SparkContext acts as the master of the spark application.
Hope you will find this Apache SparkContext Examples site useful.
SparkSession:
Since Spark 2.0, SparkSession has become an entry point to Spark to work with RDD, DataFrame, and Dataset. Prior to 2.0, SparkContext used to be an entry point. Here, I will mainly focus on explaining what is SparkSession by defining and describing how to create Spark Session and using the default Spark Session ‘spark’ variable from spark-shell.
Apache spark2.0 onwards, spark session is the new entry point for spark applications.
All the functionalities provided by spark context are available in the Spark session.
spark session Provides API(s) to work on Datasets and Dataframes.
Prior to Spark2.0:
Spark Context was the entry point for spark jobs.
RDD was one of the main APIs then, and it was created and manipulated using spark Context.
For every other APIs, different Contexts were required - For SQL, SQL Context was required.
You can find more real-time examples on Apache SparkSession.
SQLContext:
In Spark Version 1.0 SQLContext (org.apache.spark.sql.SQLContext ) is an entry point to SQL in order to work with structured data (rows and columns) however with 2.0 SQLContext has been replaced with SparkSession.
Apache Spark SQLContext is the entry point to SparkSQL which is a Spark module for structured data (rows and columns) in Spark 1.x. processing.
Spark SQLContext is initialized.
apache-spark SQL context is the entry point of Spark SQL which can be received from spark context
JavaSparkContext:
JavaSparkContext For JAVARDD same as above is done but in java implementation.
JavaSparkContext Java-friendly version of [[org.apache.spark.SparkContext]] that returns [[org.apache.spark.api.java.JavaRDD]]s and works with Java collections instead of Scala ones.

Reuse results of first computation in second computation

I'm trying to write a computation in Flink which requires two phases.
In the first phase I start from a text file, and perform some parameter estimation, obtaining as a result a Java object representing a statistical model of the data.
In the second phase, I'd like to use this object to generate data for a simulation.
I'm unsure how to do this. I tried with a LocalCollectionOutputFormat, and it works locally, but when I deploy the job on a cluster, I get a NullPointerException - which is not really surprising.
What is the Flink way of doing this?
Here is my code:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
GlobalConfiguration.includeConfiguration(configuration);
// Phase 1: read file and estimate model
DataSource<Tuple4<String, String, String, String>> source = env
.readCsvFile(args[0])
.types(String.class, String.class, String.class, String.class);
List<Tuple4<Bayes, Bayes, Bayes, Bayes>> bayesResult = new ArrayList<>();
// Processing here...
....output(new LocalCollectionOutputFormat<>(bayesResult));
env.execute("Bayes");
DataSet<BTP> btp = env
.createInput(new BayesInputFormat(bayesResult.get(0)))
// Phase 2: BayesInputFormat generates data for further calculations
// ....
This is the exception I get:
Error: The program execution failed: java.lang.NullPointerException
at org.apache.flink.api.java.io.LocalCollectionOutputFormat.close(LocalCollectionOutputFormat.java:86)
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:176)
at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257)
at java.lang.Thread.run(Thread.java:745)
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: java.lang.NullPointerException
at org.apache.flink.api.java.io.LocalCollectionOutputFormat.close(LocalCollectionOutputFormat.java:86)
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:176)
at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:257)
at java.lang.Thread.run(Thread.java:745)
at org.apache.flink.client.program.Client.run(Client.java:328)
at org.apache.flink.client.program.Client.run(Client.java:294)
at org.apache.flink.client.program.Client.run(Client.java:288)
at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55)
at it.list.flink.test.Test01.main(Test01.java:62)
...

With the latest release (0.9-milestone-1) a collect() method was added to Flink
public List<T> collect()
which fetches a DataSet<T> as List<T> to the driver program. collect() will also trigger an immediate execution of the program (don't need to call ExecutionEnvironment.execute()). Right now, there is size limitation for data sets of about 10 MB.
If you do not evaluate the models in the driver program, you can also chain both programs together and emit the model to the side by attaching a data sink. This will be more efficient, because the data won't do the round-trip over the client machine.

If you're using Flink prior to 0.9 you may use the following snippet to collect your dataset to a local collection:
val dataJavaList = new ArrayList[K]
val outputFormat = new LocalCollectionOutputFormat[K](dataJavaList)
dataset.output(outputFormat)
env.execute("collect()")
Where K is the type of object you want to collect

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.