How to run a spark program in Java in parallel

How to run a spark program in Java in parallel - java

So I have a java application that has spark maven dependencies and on running it, it launches spark server on the host where its run. The server instance has 36 cores. I am specifying SparkSession instance where I am mentioning the number of cores and other config properties in parallel but when I see the stats using htop, it doesn't seem to use all the cores but just 1.
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("my-spark")
.config("spark.driver.memory","50g")
.config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.shuffle.partitions", "400")
.config("spark.eventLog.enabled", "true")
.config("spark.eventLog.dir", "/dir1/dir2/logs")
.config("spark.history.fs.logDirectory", "/dir1/dir2/logs")
.config("spark.executor.cores", "36")
I also added in JavaSparkContext as well:
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
sc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY);
sc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_KEY);
sc.hadoopConfiguration().set("spark.driver.memory","50g");
sc.hadoopConfiguration().set("spark.eventLog.enabled", "true");
sc.hadoopConfiguration().set("spark.eventLog.dir", "/dir1/dir2/logs");
sc.hadoopConfiguration().set("spark.executor.cores", "36");
My task is reading data from aws s3 into a df and writing data in another bucket.
Dataset<Row> df = spark.read().format("csv").option("header", "true").load("s3a://bucket/file.csv.gz");
//df = df.repartition(200);
df.withColumn("col_name", df.col("col_name")).sort("col_name", "_id").write().format("iceberg").mode("append").save(location);

.gz files are "unspittable": to decompress them you have to start at byte 0 and read forwards. As a result, spark, hive, MapReduce, etc, give the whole file to a single worker. If you want parallel processing, use a different compression format (e.g. snappy)

You are running Spark in local mode, spark.executor.cores will not take effect, consider change .master("local") to .master("local[*]")
Hope this helps

Related

Azure Databricks running Autoloader Implementation from java Jar throws org.apache.spark.sql.AnalysisException

currently I am running into an issue but do not understand why this is happning. I have implemented a Java function which uses the Databricks Autoloader to readstream all parquet files from an azure blob storage and "write" it in a dataframe (Dataset because it is in Java written). The code is executed from an Jar which I build in Java and running as a Job on a Shared Cluster.
Code:
Dataset<Row> newdata= spark.readStream().format("cloudFiles")
.option("cloudFiles.subscriptionId", storagesubscriptionid)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", sptenantid)
.option("cloudFiles.clientId", spappid)
.option("cloudFiles.clientSecret", spsecret)
.option("cloudFiles.resourceGroup", storageresourcegroup)
.option("cloudFiles.connectionString", storagesasconnectionstring)
// .option("cloudFiles.useNotifications", "true")
.schema(dfsample.schema()).option("cloudFiles.includeExistingFiles", "true").load(filePath);
newdata.show();
But unfortunatelly I get the following exception:
WARN SQLExecution: Error executing delta metering
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
cloudFiles
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.throwError(UnsupportedOperationChecker.scala:447)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1(UnsupportedOperationChecker.scala:38)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1$adapted(UnsupportedOperationChecker.scala:36)
What makes me wonder is, that the exactly same code is running fine inside a Databricks Notebook written in Scala:
val df1 = spark.readStream.format("cloudFiles").option("cloudFiles.useNotifications", "true").option("cloudFiles.subscriptionId", storagesubscriptionid)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", sptenantid)
.option("cloudFiles.clientId", spappid)
.option("cloudFiles.clientSecret", spsecret)
.option("cloudFiles.resourceGroup", storageresourcegroup)
.option("cloudFiles.connectionString", storagesasconnectionstring)
.option("cloudFiles.useNotifications", "true")
.option("cloudFiles.subscriptionId", storagesubscriptionid).schema(df_schema).option("cloudFiles.includeExistingFiles", "false").load(filePath);
display(df1);
I expect a Dataset object containing all the new data from the blobstorage parquet files in schema: id1:int, id2:int, content:binary

So finally, I have found a way to get Autoloader working inside my Java Jar.
As Vincent already commented you have to combine readstream with a writestream.
So I am simply writing the files which have been detected by the autoloader, to a Azure Data Lake.
spark.readStream().format("cloudFiles")
.option("cloudFiles.subscriptionId", STORAGE_SUBSCRIPTION_ID)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", SP_TENANT_ID)
.option("cloudFiles.clientId", SP_APPLICATION_ID)
.option("cloudFiles.clientSecret", SP_CLIENT_SECRET)
.option("cloudFiles.resourceGroup", STORAGE_RESOURCE_GROUP)
.option("cloudFiles.connectionString", STORAGE_SAS_CONNECTION_STRING)
.option("cloudFiles.includeExistingFiles", "true")
.option("cloudFiles.useNotifications", "true")
.schema(DF_SCHEMA)
.load(BLOB_STORAGE_LANDING_ZONE_PATH)
.writeStream()
.format("delta")
.option("checkpointLocation", DELTA_TABLE_RAW_DATA_CHECKPOINT_PATH)
.option("mergeSchema", "true")
.trigger(Trigger.Once())
.outputMode("append")
.start(DELTA_TABLE_RAW_DATA_PATH).awaitTermination();
This works fine with Java when you need to run a Jar as Databricks Jobs.
But to be honest I am still wondering why, from inside a Notebook, I don't have to use writestream in scala language to receive new files from the autoloader.

Partition not working in mongodb spark read in java connector

I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at the doc of mongoDb spark, and it mentions of various partitioner classes. I was trying to use MongoSamplePartitioner class but it only reads on just 1 partition. MongoPaginateByCountPartitioner class as well partitions to a fixed 66 partitions. This is even when I am configuring "samplesPerPartition" and "numberOfPartitions" in both of these cases respectively. I need to use readConfig created via a map. My code:
SparkSession sparkSession = SparkSession.builder().appName("sampleRecords")
.config("spark.driver.host", "2g")
.config("spark.driver.host", "127.0.0.1")
.master("local[4]").getOrCreate();
Map<String, String> readOverrides = new HashMap<>();
readOverrides.put("uri", "mongodb://mongo-root:password#127.0.0.1:27017/importedDb.myNewCollection?authSource=admin");
readOverrides.put("numberOfPartitions", "16");
readOverrides.put("partitioner", "MongoPaginateByCountPartitioner");
ReadConfig readConfig = ReadConfig.create(readOverrides);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Dataset<Row> dataset = MongoSpark.load(jsc, readConfig).toDF();
System.out.println(dataset.count()); //24576
System.out.println(dataset.rdd().getNumPartitions()); //66
Using sample partitioner returns 1 partition everytime
Am I missing something here? Please help.
PS - I am reading 24576 records, mongoD version v4.0.10, mongo spark connector 2.3.1, java 8
Edit:
I got it to work, needed to give properties like so partitionerOptions.samplesPerPartition in the map. But I am still facing issue, partitionerOptions.samplesPerPartition : "1000", in MongoSamplePartitioner only returns 1 partition. Any suggestions?

Number of Partitions can be configured for MongoPaginateByCountPartitioner.
Supposing that we need to configure the target number of partitions to 16...
Please add partitionerOptions.numberOfPartitions -> 16 in the properties rather than only numberOfPartitions -> 16.

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession?
Is there any method to convert or create a Context using a SparkSession?
Can I completely replace all the Contexts using one single entry SparkSession?
Are all the functions in SQLContext, SparkContext, and JavaSparkContext also in SparkSession?
Some functions like parallelize have different behaviors in SparkContext and JavaSparkContext. How do they behave in SparkSession?
How can I create the following using a SparkSession?
RDD
JavaRDD
JavaPairRDD
Dataset
Is there a method to transform a JavaPairRDD into a Dataset or a Dataset into a JavaPairRDD?

sparkContext is a Scala implementation entry point and JavaSparkContext is a java wrapper of sparkContext.
SQLContext is entry point of SparkSQL which can be received from sparkContext.Prior to 2.x.x, RDD ,DataFrame and Data-set were three different data abstractions.Since Spark 2.x.x, All three data abstractions are unified and SparkSession is the unified entry point of Spark.
An additional note is , RDD meant for unstructured data, strongly typed data and DataFrames are for structured and loosely typed data. You can check
Is there any method to convert or create Context using Sparksession ?
yes. its sparkSession.sparkContext() and for SQL, sparkSession.sqlContext()
Can I completely replace all the Context using one single entry SparkSession ?
yes. you can get respective contexs from sparkSession.
Does all the functions in SQLContext, SparkContext,JavaSparkContext etc are added in SparkSession?
Not directly. you got to get respective context and make use of it.something like backward compatibility
How to use such function in SparkSession?
get respective context and make use of it.
How to create the following using SparkSession?
RDD can be created from sparkSession.sparkContext.parallelize(???)
JavaRDD same applies with this but in java implementation
JavaPairRDD sparkSession.sparkContext.parallelize(???).map(//making your data as key-value pair here is one way)
Dataset what sparkSession returns is Dataset if it is structured data.

Explanation from spark source code under branch-2.1
SparkContext:
Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
Only one SparkContext may be active per JVM. You must stop() the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
JavaSparkContext:
A Java-friendly version of [[org.apache.spark.SparkContext]] that returns
[[org.apache.spark.api.java.JavaRDD]]s and works with Java collections instead of Scala ones.
Only one SparkContext may be active per JVM. You must stop() the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
SQLContext:
The entry point for working with structured data (rows and columns) in Spark 1.x.
As of Spark 2.0, this is replaced by [[SparkSession]]. However, we are keeping the class
here for backward compatibility.
SparkSession:
The entry point to programming Spark with the Dataset and DataFrame API.

I will talk about Spark version 2.x only.
SparkSession: It's a main entry point of your spark Application. To run any code on your spark, this is the first thing you should create.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("Word Count")\
.config("spark.some.config.option", "some-value")\
.getOrCreate()
SparkContext: It's a inner Object (property) of SparkSession. It's used to interact with Low-Level API Through SparkContext you can create RDD, accumlator and Broadcast variables.
for most cases you won't need SparkContext. You can get SparkContext from SparkSession
val sc = spark.sparkContext

Spark Context is Class in Spark API which is the first stage to build the spark application. Functionality of the spark context is to create memory in RAM we call this as driver memory, allocation of number of executers and cores in short its all about the cluster management. Spark Context can be used to create RDD and shared variables. To access this we need to create object of it.
This way we can create Spark Context :: var sc=new SparkContext()
Spark Session this is new Object added since spark 2.x which is replacement of Sql Context and Hive Context.
Earlier we had two options like one is Sql Context which is way to do sql operation on Dataframe and second is Hive Context which manage the Hive connectivity related stuff and fetch/insert the data from/to the hive tables.
Since 2.x came We can create SparkSession for the SQL operation on Dataframe and if you have any Hive related work just call the Method enablehivesupport() then you can use the SparkSession for the both Dataframe and hive related SQL operations.
This way we can create SparkSession for Sql operation on Dataframe
val sparksession=SparkSession.builder().getOrCreate();
Second way is to create SparkSession for Sql operation on Dataframe as well as Hive Operation.
val sparkSession=SparkSession.builder().enableHiveSupport().getOrCreate()

Spark Context:
Since Spark 1.x, Spark SparkContext is an entry point to Spark and defined in org. apache. spark package and used to programmatically create Spark RDD, accumulators, and broadcast variables on the cluster. Its object sc is the default variable available in spark-shell and it can be programmatically created using SparkContext class.
SparkContext is a client of spark's execution environment.
SparkContext is the entry point of the spark execution job.
SparkContext acts as the master of the spark application.
Hope you will find this Apache SparkContext Examples site useful.
SparkSession:
Since Spark 2.0, SparkSession has become an entry point to Spark to work with RDD, DataFrame, and Dataset. Prior to 2.0, SparkContext used to be an entry point. Here, I will mainly focus on explaining what is SparkSession by defining and describing how to create Spark Session and using the default Spark Session ‘spark’ variable from spark-shell.
Apache spark2.0 onwards, spark session is the new entry point for spark applications.
All the functionalities provided by spark context are available in the Spark session.
spark session Provides API(s) to work on Datasets and Dataframes.
Prior to Spark2.0:
Spark Context was the entry point for spark jobs.
RDD was one of the main APIs then, and it was created and manipulated using spark Context.
For every other APIs, different Contexts were required - For SQL, SQL Context was required.
You can find more real-time examples on Apache SparkSession.
SQLContext:
In Spark Version 1.0 SQLContext (org.apache.spark.sql.SQLContext ) is an entry point to SQL in order to work with structured data (rows and columns) however with 2.0 SQLContext has been replaced with SparkSession.
Apache Spark SQLContext is the entry point to SparkSQL which is a Spark module for structured data (rows and columns) in Spark 1.x. processing.
Spark SQLContext is initialized.
apache-spark SQL context is the entry point of Spark SQL which can be received from spark context
JavaSparkContext:
JavaSparkContext For JAVARDD same as above is done but in java implementation.
JavaSparkContext Java-friendly version of [[org.apache.spark.SparkContext]] that returns [[org.apache.spark.api.java.JavaRDD]]s and works with Java collections instead of Scala ones.

Spark fails with java.lang.OutOfMemoryError: GC overhead limit exceeded?

This is my java code in which I am querying data from Hive using Apache spark sql.
JavaSparkContext ctx = new JavaSparkContext(new SparkConf().setAppName("LoadData").setMaster("MasterUrl"));
HiveContext sqlContext = new HiveContext(ctx.sc());
List<Row> result = sqlContext.sql("Select * from Tablename").collectAsList();
when I run this code it throws java.lang.OutOfMemoryError: GC overhead limit exceeded. How to solve this or how to increase the memory in Spark configuration.

If you are using the spark-shell to run it then you can use the driver-memory to bump the memory limit:
spark-shell --driver-memory Xg [other options]
If the executors are having problems then you can adjust their memory limits with --executor-memory XG
You can find more info how to exactly set them in the guides: submission for executor memory, configuration for driver memory.
#Edit: since you are running it from Netbeans you should be able to pass them as JVM arguments -Dspark.driver.memory=XG and -Dspark.executor.memory=XG. I think it was in Project Properties under Run.

have you found any solutions for your issue yet?
please share them if you have :D
and here is my idea: rdd and also javaRDD has a method toLocalIterator(),
spark document said that
The iterator will consume as much memory as the largest partition in
this RDD.
it means iterator will consume less memory than List if the rdd is devided into many partitions, you can try like this:
Iterator<Row> iter = sqlContext.sql("Select * from Tablename").javaRDD().toLocalIterator();
while (iter.hasNext()){
Row row = iter.next();
//your code here
}
ps: it's just an idea and i haven't tested it yet

Java - Apache Spark communication

I'm quite new to Spark and was looking for some guidance :-)
What's the typical way in which a Java MVC application communicates with Spark? To simplify things, let's say I want to count the words in a certain file whose name is provided via GET request to my server.
My initial approach was to open the context and implement the transformations/ computations in a class inside my MVC application. That means that at runtime I would have to come up with an uber jar of spark-core. The problem is that:
The uber jar weights 80mb
I am facing the same problem (akka.version) than in: apache spark: akka version error by build jar with all dependencies
I can have a go with shade to solve it but have the feeling this is not the way to go.
Maybe the "provided" scope in Maven would help me but I'm using ant.
Should my application - as suggested in the page - have already one jar with the implementation (devoid of any spark libraries) and use the spark-submit every time I receive a request. I guess it would leave the results somewhere.
Am I missing any middle-of-the-road approach?

Using spark-submit each time is kind of heavy weight, I'd recommend using a long running Spark Context of some sort. I think the "middle of the road" option that you might be looking for is to have your job use something like the IBM Spark Kernel, Zepplin, or the Spark Job Server from Ooyala.

There is a good practice to use middleware service deployed on a top of Spark which manages it’s contexts, job failures spark vesions and a lot of other things to consider.
I would recommend Mist. It implements Spark as a Service and creates a unified API layer for building enterprise solutions and services on top of a Big Data lake.
Mist supports Scala and Python jobs execution.
The quick start is following:
Add Mist wrapper into your Spark job:
Scala example:
object SimpleContext extends MistJob {
override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = {
val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]]
val rdd = context.parallelize(numbers)
Map("result" -> rdd.map(x => x * 2).collect())
}
}
Python example:
import mist
class MyJob:
def __init__(self, job):
job.sendResult(self.doStuff(job))
def doStuff(self, job):
val = job.parameters.values()
list = val.head()
size = list.size()
pylist = []
count = 0
while count < size:
pylist.append(list.head())
count = count + 1
list = list.tail()
rdd = job.sc.parallelize(pylist)
result = rdd.map(lambda s: 2 * s).collect()
return result
if __name__ == "__main__":
job = MyJob(mist.Job())
Run Mist service:
Build the Mist
git clone https://github.com/hydrospheredata/mist.git
cd mist
./sbt/sbt -DsparkVersion=1.5.2 assembly # change version according to your installed spark
Create configuration file
mist.spark.master = "local[*]"
mist.settings.threadNumber = 16
mist.http.on = true
mist.http.host = "0.0.0.0"
mist.http.port = 2003
mist.mqtt.on = false
mist.recovery.on = false
mist.contextDefaults.timeout = 100 days
mist.contextDefaults.disposable = false
mist.contextDefaults.sparkConf = {
spark.default.parallelism = 128
spark.driver.memory = "10g"
spark.scheduler.mode = "FAIR"
}
Run
spark-submit --class io.hydrosphere.mist.Mist \
--driver-java-options "-Dconfig.file=/path/to/application.conf" \ target/scala-2.10/mist-assembly-0.2.0.jar
Try curl from terminal:
curl --header "Content-Type: application/json" -X POST http://192.168.10.33:2003/jobs --data '{"jarPath":"/vagrant/examples/target/scala-2.10/mist_examples_2.10-0.2.0.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"}'

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.