Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession? - java

What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession?
Is there any method to convert or create a Context using a SparkSession?
Can I completely replace all the Contexts using one single entry SparkSession?
Are all the functions in SQLContext, SparkContext, and JavaSparkContext also in SparkSession?
Some functions like parallelize have different behaviors in SparkContext and JavaSparkContext. How do they behave in SparkSession?
How can I create the following using a SparkSession?
RDD
JavaRDD
JavaPairRDD
Dataset
Is there a method to transform a JavaPairRDD into a Dataset or a Dataset into a JavaPairRDD?

sparkContext is a Scala implementation entry point and JavaSparkContext is a java wrapper of sparkContext.
SQLContext is entry point of SparkSQL which can be received from sparkContext.Prior to 2.x.x, RDD ,DataFrame and Data-set were three different data abstractions.Since Spark 2.x.x, All three data abstractions are unified and SparkSession is the unified entry point of Spark.
An additional note is , RDD meant for unstructured data, strongly typed data and DataFrames are for structured and loosely typed data. You can check
Is there any method to convert or create Context using Sparksession ?
yes. its sparkSession.sparkContext() and for SQL, sparkSession.sqlContext()
Can I completely replace all the Context using one single entry SparkSession ?
yes. you can get respective contexs from sparkSession.
Does all the functions in SQLContext, SparkContext,JavaSparkContext etc are added in SparkSession?
Not directly. you got to get respective context and make use of it.something like backward compatibility
How to use such function in SparkSession?
get respective context and make use of it.
How to create the following using SparkSession?
RDD can be created from sparkSession.sparkContext.parallelize(???)
JavaRDD same applies with this but in java implementation
JavaPairRDD sparkSession.sparkContext.parallelize(???).map(//making your data as key-value pair here is one way)
Dataset what sparkSession returns is Dataset if it is structured data.

Explanation from spark source code under branch-2.1
SparkContext:
Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
Only one SparkContext may be active per JVM. You must stop() the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
JavaSparkContext:
A Java-friendly version of [[org.apache.spark.SparkContext]] that returns
[[org.apache.spark.api.java.JavaRDD]]s and works with Java collections instead of Scala ones.
Only one SparkContext may be active per JVM. You must stop() the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
SQLContext:
The entry point for working with structured data (rows and columns) in Spark 1.x.
As of Spark 2.0, this is replaced by [[SparkSession]]. However, we are keeping the class
here for backward compatibility.
SparkSession:
The entry point to programming Spark with the Dataset and DataFrame API.

I will talk about Spark version 2.x only.
SparkSession: It's a main entry point of your spark Application. To run any code on your spark, this is the first thing you should create.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("Word Count")\
.config("spark.some.config.option", "some-value")\
.getOrCreate()
SparkContext: It's a inner Object (property) of SparkSession. It's used to interact with Low-Level API Through SparkContext you can create RDD, accumlator and Broadcast variables.
for most cases you won't need SparkContext. You can get SparkContext from SparkSession
val sc = spark.sparkContext

Spark Context is Class in Spark API which is the first stage to build the spark application. Functionality of the spark context is to create memory in RAM we call this as driver memory, allocation of number of executers and cores in short its all about the cluster management. Spark Context can be used to create RDD and shared variables. To access this we need to create object of it.
This way we can create Spark Context :: var sc=new SparkContext()
Spark Session this is new Object added since spark 2.x which is replacement of Sql Context and Hive Context.
Earlier we had two options like one is Sql Context which is way to do sql operation on Dataframe and second is Hive Context which manage the Hive connectivity related stuff and fetch/insert the data from/to the hive tables.
Since 2.x came We can create SparkSession for the SQL operation on Dataframe and if you have any Hive related work just call the Method enablehivesupport() then you can use the SparkSession for the both Dataframe and hive related SQL operations.
This way we can create SparkSession for Sql operation on Dataframe
val sparksession=SparkSession.builder().getOrCreate();
Second way is to create SparkSession for Sql operation on Dataframe as well as Hive Operation.
val sparkSession=SparkSession.builder().enableHiveSupport().getOrCreate()

Spark Context:
Since Spark 1.x, Spark SparkContext is an entry point to Spark and defined in org. apache. spark package and used to programmatically create Spark RDD, accumulators, and broadcast variables on the cluster. Its object sc is the default variable available in spark-shell and it can be programmatically created using SparkContext class.
SparkContext is a client of spark's execution environment.
SparkContext is the entry point of the spark execution job.
SparkContext acts as the master of the spark application.
Hope you will find this Apache SparkContext Examples site useful.
SparkSession:
Since Spark 2.0, SparkSession has become an entry point to Spark to work with RDD, DataFrame, and Dataset. Prior to 2.0, SparkContext used to be an entry point. Here, I will mainly focus on explaining what is SparkSession by defining and describing how to create Spark Session and using the default Spark Session ‘spark’ variable from spark-shell.
Apache spark2.0 onwards, spark session is the new entry point for spark applications.
All the functionalities provided by spark context are available in the Spark session.
spark session Provides API(s) to work on Datasets and Dataframes.
Prior to Spark2.0:
Spark Context was the entry point for spark jobs.
RDD was one of the main APIs then, and it was created and manipulated using spark Context.
For every other APIs, different Contexts were required - For SQL, SQL Context was required.
You can find more real-time examples on Apache SparkSession.
SQLContext:
In Spark Version 1.0 SQLContext (org.apache.spark.sql.SQLContext ) is an entry point to SQL in order to work with structured data (rows and columns) however with 2.0 SQLContext has been replaced with SparkSession.
Apache Spark SQLContext is the entry point to SparkSQL which is a Spark module for structured data (rows and columns) in Spark 1.x. processing.
Spark SQLContext is initialized.
apache-spark SQL context is the entry point of Spark SQL which can be received from spark context
JavaSparkContext:
JavaSparkContext For JAVARDD same as above is done but in java implementation.
JavaSparkContext Java-friendly version of [[org.apache.spark.SparkContext]] that returns [[org.apache.spark.api.java.JavaRDD]]s and works with Java collections instead of Scala ones.

Related

Azure Databricks running Autoloader Implementation from java Jar throws org.apache.spark.sql.AnalysisException

currently I am running into an issue but do not understand why this is happning. I have implemented a Java function which uses the Databricks Autoloader to readstream all parquet files from an azure blob storage and "write" it in a dataframe (Dataset because it is in Java written). The code is executed from an Jar which I build in Java and running as a Job on a Shared Cluster.
Code:
Dataset<Row> newdata= spark.readStream().format("cloudFiles")
.option("cloudFiles.subscriptionId", storagesubscriptionid)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", sptenantid)
.option("cloudFiles.clientId", spappid)
.option("cloudFiles.clientSecret", spsecret)
.option("cloudFiles.resourceGroup", storageresourcegroup)
.option("cloudFiles.connectionString", storagesasconnectionstring)
// .option("cloudFiles.useNotifications", "true")
.schema(dfsample.schema()).option("cloudFiles.includeExistingFiles", "true").load(filePath);
newdata.show();
But unfortunatelly I get the following exception:
WARN SQLExecution: Error executing delta metering
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();
cloudFiles
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.throwError(UnsupportedOperationChecker.scala:447)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1(UnsupportedOperationChecker.scala:38)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.$anonfun$checkForBatch$1$adapted(UnsupportedOperationChecker.scala:36)
What makes me wonder is, that the exactly same code is running fine inside a Databricks Notebook written in Scala:
val df1 = spark.readStream.format("cloudFiles").option("cloudFiles.useNotifications", "true").option("cloudFiles.subscriptionId", storagesubscriptionid)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", sptenantid)
.option("cloudFiles.clientId", spappid)
.option("cloudFiles.clientSecret", spsecret)
.option("cloudFiles.resourceGroup", storageresourcegroup)
.option("cloudFiles.connectionString", storagesasconnectionstring)
.option("cloudFiles.useNotifications", "true")
.option("cloudFiles.subscriptionId", storagesubscriptionid).schema(df_schema).option("cloudFiles.includeExistingFiles", "false").load(filePath);
display(df1);
I expect a Dataset object containing all the new data from the blobstorage parquet files in schema: id1:int, id2:int, content:binary
So finally, I have found a way to get Autoloader working inside my Java Jar.
As Vincent already commented you have to combine readstream with a writestream.
So I am simply writing the files which have been detected by the autoloader, to a Azure Data Lake.
spark.readStream().format("cloudFiles")
.option("cloudFiles.subscriptionId", STORAGE_SUBSCRIPTION_ID)
.option("cloudFiles.format", "parquet")
.option("cloudFiles.tenantId", SP_TENANT_ID)
.option("cloudFiles.clientId", SP_APPLICATION_ID)
.option("cloudFiles.clientSecret", SP_CLIENT_SECRET)
.option("cloudFiles.resourceGroup", STORAGE_RESOURCE_GROUP)
.option("cloudFiles.connectionString", STORAGE_SAS_CONNECTION_STRING)
.option("cloudFiles.includeExistingFiles", "true")
.option("cloudFiles.useNotifications", "true")
.schema(DF_SCHEMA)
.load(BLOB_STORAGE_LANDING_ZONE_PATH)
.writeStream()
.format("delta")
.option("checkpointLocation", DELTA_TABLE_RAW_DATA_CHECKPOINT_PATH)
.option("mergeSchema", "true")
.trigger(Trigger.Once())
.outputMode("append")
.start(DELTA_TABLE_RAW_DATA_PATH).awaitTermination();
This works fine with Java when you need to run a Jar as Databricks Jobs.
But to be honest I am still wondering why, from inside a Notebook, I don't have to use writestream in scala language to receive new files from the autoloader.

Partition not working in mongodb spark read in java connector

I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at the doc of mongoDb spark, and it mentions of various partitioner classes. I was trying to use MongoSamplePartitioner class but it only reads on just 1 partition. MongoPaginateByCountPartitioner class as well partitions to a fixed 66 partitions. This is even when I am configuring "samplesPerPartition" and "numberOfPartitions" in both of these cases respectively. I need to use readConfig created via a map. My code:
SparkSession sparkSession = SparkSession.builder().appName("sampleRecords")
.config("spark.driver.host", "2g")
.config("spark.driver.host", "127.0.0.1")
.master("local[4]").getOrCreate();
Map<String, String> readOverrides = new HashMap<>();
readOverrides.put("uri", "mongodb://mongo-root:password#127.0.0.1:27017/importedDb.myNewCollection?authSource=admin");
readOverrides.put("numberOfPartitions", "16");
readOverrides.put("partitioner", "MongoPaginateByCountPartitioner");
ReadConfig readConfig = ReadConfig.create(readOverrides);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Dataset<Row> dataset = MongoSpark.load(jsc, readConfig).toDF();
System.out.println(dataset.count()); //24576
System.out.println(dataset.rdd().getNumPartitions()); //66
Using sample partitioner returns 1 partition everytime
Am I missing something here? Please help.
PS - I am reading 24576 records, mongoD version v4.0.10, mongo spark connector 2.3.1, java 8
Edit:
I got it to work, needed to give properties like so partitionerOptions.samplesPerPartition in the map. But I am still facing issue, partitionerOptions.samplesPerPartition : "1000", in MongoSamplePartitioner only returns 1 partition. Any suggestions?
Number of Partitions can be configured for MongoPaginateByCountPartitioner.
Supposing that we need to configure the target number of partitions to 16...
Please add partitionerOptions.numberOfPartitions -> 16 in the properties rather than only numberOfPartitions -> 16.

How to set AvroCoder with KafkaIO and Apache Beam with Java

I'm trying to create a pipeline that streams data from a Kafka topic to google's Bigquery. Data in the topic is in Avro.
I call the apply function 3 times. Once to read from Kafka, once to extract record and once to write to Bigquery. Here is the main part of the code:
pipeline
.apply("Read from Kafka",
KafkaIO
.<byte[], GenericRecord>read()
.withBootstrapServers(options.getKafkaBrokers().get())
.withTopics(Utils.getListFromString(options.getKafkaTopics()))
.withKeyDeserializer(
ConfluentSchemaRegistryDeserializerProvider.of(
options.getSchemaRegistryUrl().get(),
options.getSubject().get())
)
.withValueDeserializer(
ConfluentSchemaRegistryDeserializerProvider.of(
options.getSchemaRegistryUrl().get(),
options.getSubject().get()))
.withoutMetadata()
)
.apply("Extract GenericRecord",
MapElements.into(TypeDescriptor.of(GenericRecord.class)).via(KV::getValue)
)
.apply(
"Write data to BQ",
BigQueryIO
.<GenericRecord>write()
.optimizedWrites()
.useBeamSchema()
.useAvroLogicalTypes()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withSchemaUpdateOptions(ImmutableSet.of(BigQueryIO.Write.SchemaUpdateOption.ALLOW_FIELD_ADDITION))
//Temporary location to save files in GCS before loading to BQ
.withCustomGcsTempLocation(options.getGcsTempLocation())
.withNumFileShards(options.getNumShards().get())
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.withMethod(FILE_LOADS)
.withTriggeringFrequency(Utils.parseDuration(options.getWindowDuration().get()))
.to(new TableReference()
.setProjectId(options.getGcpProjectId().get())
.setDatasetId(options.getGcpDatasetId().get())
.setTableId(options.getGcpTableId().get()))
);
When running, i get the following error:
Exception in thread "main" java.lang.IllegalStateException: Unable to return a default Coder for Extract GenericRecord/Map/ParMultiDo(Anonymous).output [PCollection]. Correct one of the following root causes: No Coder has been manually specified; you may do so using .setCoder().
Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder for org.apache.avro.generic.GenericRecord.
Building a Coder using a registered CoderProvider failed.
How do I set the coder to properly read Avro?
There are at least three approaches to this:
Set the coder inline:
pipeline.apply("Read from Kafka", ....)
.apply("Dropping key", Values.create())
.setCoder(AvroCoder.of(Schema schemaOfGenericRecord))
.apply("Write data to BQ", ....);
Note that the key is dropped because its unused, with this you wont need MapElements any more.
Register the coder in the pipeline's instance of CoderRegistry:
pipeline.getCoderRegistry().registerCoderForClass(GenericRecord.class, AvroCoder.of(Schema genericSchema));
Get the coder from the schema registry via:
ConfluentSchemaRegistryDeserializerProvider.getCoder(CoderRegistry registry)
https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/kafka/ConfluentSchemaRegistryDeserializerProvider.html#getCoder-org.apache.beam.sdk.coders.CoderRegistry-

Combiner function in Apache Hadoop with Gora

I have a simple Hadoop, Nutch 2.x, Hbase cluster. I have to write a MR job that will find some statistics. It is two step job i.e., I think I need combiner function also. In simple Hadoop jobs, its not a big problem as a lot of guide is given e.g., this one. But I could not found any option to use combiner with Gora. My statistics will be added to pages in Hbase that's why I could not about Gora (I think ). Following is the code snippet where I expect to add com
GoraMapper.initMapperJob(job, query, pageStore, Text.class, WebPage.class,
My_Mapper.class, null, true);
job.setNumReduceTasks(1);
// === Reduce ===
DataStore<String, WebPage> hostStore = StorageUtils.createWebStore(
job.getConfiguration(), String.class, WebPage.class);
GoraReducer.initReducerJob(job, hostStore, My_Reducer.class);
I have never used the combiner with Gora, but does this work (or what error does it show)?:
GoraReducer.initReducerJob(job, hostStore, My_Reducer.class);
job.setCombinerClass(My_Reducer.class);
Edit: Created an issue at Apache's Jira about the Combiner.

How to run a spark program in Java in parallel

So I have a java application that has spark maven dependencies and on running it, it launches spark server on the host where its run. The server instance has 36 cores. I am specifying SparkSession instance where I am mentioning the number of cores and other config properties in parallel but when I see the stats using htop, it doesn't seem to use all the cores but just 1.
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("my-spark")
.config("spark.driver.memory","50g")
.config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.shuffle.partitions", "400")
.config("spark.eventLog.enabled", "true")
.config("spark.eventLog.dir", "/dir1/dir2/logs")
.config("spark.history.fs.logDirectory", "/dir1/dir2/logs")
.config("spark.executor.cores", "36")
I also added in JavaSparkContext as well:
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
sc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY);
sc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_KEY);
sc.hadoopConfiguration().set("spark.driver.memory","50g");
sc.hadoopConfiguration().set("spark.eventLog.enabled", "true");
sc.hadoopConfiguration().set("spark.eventLog.dir", "/dir1/dir2/logs");
sc.hadoopConfiguration().set("spark.executor.cores", "36");
My task is reading data from aws s3 into a df and writing data in another bucket.
Dataset<Row> df = spark.read().format("csv").option("header", "true").load("s3a://bucket/file.csv.gz");
//df = df.repartition(200);
df.withColumn("col_name", df.col("col_name")).sort("col_name", "_id").write().format("iceberg").mode("append").save(location);
.gz files are "unspittable": to decompress them you have to start at byte 0 and read forwards. As a result, spark, hive, MapReduce, etc, give the whole file to a single worker. If you want parallel processing, use a different compression format (e.g. snappy)
You are running Spark in local mode, spark.executor.cores will not take effect, consider change .master("local") to .master("local[*]")
Hope this helps

Categories

Resources