Partition not working in mongodb spark read in java connector

Partition not working in mongodb spark read in java connector - java

I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at the doc of mongoDb spark, and it mentions of various partitioner classes. I was trying to use MongoSamplePartitioner class but it only reads on just 1 partition. MongoPaginateByCountPartitioner class as well partitions to a fixed 66 partitions. This is even when I am configuring "samplesPerPartition" and "numberOfPartitions" in both of these cases respectively. I need to use readConfig created via a map. My code:
SparkSession sparkSession = SparkSession.builder().appName("sampleRecords")
.config("spark.driver.host", "2g")
.config("spark.driver.host", "127.0.0.1")
.master("local[4]").getOrCreate();
Map<String, String> readOverrides = new HashMap<>();
readOverrides.put("uri", "mongodb://mongo-root:password#127.0.0.1:27017/importedDb.myNewCollection?authSource=admin");
readOverrides.put("numberOfPartitions", "16");
readOverrides.put("partitioner", "MongoPaginateByCountPartitioner");
ReadConfig readConfig = ReadConfig.create(readOverrides);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Dataset<Row> dataset = MongoSpark.load(jsc, readConfig).toDF();
System.out.println(dataset.count()); //24576
System.out.println(dataset.rdd().getNumPartitions()); //66
Using sample partitioner returns 1 partition everytime
Am I missing something here? Please help.
PS - I am reading 24576 records, mongoD version v4.0.10, mongo spark connector 2.3.1, java 8
Edit:
I got it to work, needed to give properties like so partitionerOptions.samplesPerPartition in the map. But I am still facing issue, partitionerOptions.samplesPerPartition : "1000", in MongoSamplePartitioner only returns 1 partition. Any suggestions?

Number of Partitions can be configured for MongoPaginateByCountPartitioner.
Supposing that we need to configure the target number of partitions to 16...
Please add partitionerOptions.numberOfPartitions -> 16 in the properties rather than only numberOfPartitions -> 16.

Related

Quarkus - KStream and KTable join does not output messages

I am building a project modeled on this project. The key difference is, I want to output, conditionally, a message using the messages from the joined topics. As opposed to the example project, where an aggregation is performed. I am struggling to use Serde for JSON messages and so, I have simplified the message structure as follows.
t1 (KStream) - a plain text value.
t2 (KTable) - a plain text value separated by a ;.
t3 (KStream) - a CSV string.
I am publishing messages using kafkacat with the -k option to set a key e.g. k1. The problem I am facing is: I don't see any output in t3.
This is my TopologyProducer.java.
#Produces
public Topology buildTopology() {
StreamsBuilder builder = new StreamsBuilder();
ObjectMapperSerde<stream1> stream1 = new ObjectMapperSerde<>(stream1.class);
ObjectMapperSerde<topic1> topic1 = new ObjectMapperSerde<>(topic1.class);
ObjectMapperSerde<output1> output1 = new ObjectMapperSerde<>(output1.class);
GlobalKTable<String, topic1> topic1 = builder.globalTable(
t2,
Consumed.with(Serdes.String(), topic1));
builder.stream(t1,
Consumed.with(Serdes.String(), stream1))
.join(t2,
(paramName, paramValue) -> paramName,
(paramValue, paramLimits) -> {
// Add some logic to return conditionally
return new output1("paramName", 0.0, 0.0, true);
})
.to(t3,
Produced.with(Serdes.String(), output1));
return builder.build();
}
}

The Java version I had in my Dockerfile was wrong.
When I inspected the container logs, I saw an error about the difference in version of Java used to compile (higher) vs running (lower). I chose the simpler of two i.e. used a more recent version of Java to run the application (than, adjusting the Java version for local mvn build). The error can be traced to the following instruction as documented here.
The Dockerfile created by Quarkus by default needs one adjustment for the aggregator application in order to run the Kafka Streams pipeline. To do so, edit the file aggregator/src/main/docker/Dockerfile.jvm and replace the line FROM fabric8/java-alpine-openjdk8-jre with FROM fabric8/java-centos-openjdk8-jdk.
I edited my Dockerfile to use FROM registry.access.redhat.com/ubi8/openjdk-17:1.11 and have the application running.

How to set AvroCoder with KafkaIO and Apache Beam with Java

I'm trying to create a pipeline that streams data from a Kafka topic to google's Bigquery. Data in the topic is in Avro.
I call the apply function 3 times. Once to read from Kafka, once to extract record and once to write to Bigquery. Here is the main part of the code:
pipeline
.apply("Read from Kafka",
KafkaIO
.<byte[], GenericRecord>read()
.withBootstrapServers(options.getKafkaBrokers().get())
.withTopics(Utils.getListFromString(options.getKafkaTopics()))
.withKeyDeserializer(
ConfluentSchemaRegistryDeserializerProvider.of(
options.getSchemaRegistryUrl().get(),
options.getSubject().get())
)
.withValueDeserializer(
ConfluentSchemaRegistryDeserializerProvider.of(
options.getSchemaRegistryUrl().get(),
options.getSubject().get()))
.withoutMetadata()
)
.apply("Extract GenericRecord",
MapElements.into(TypeDescriptor.of(GenericRecord.class)).via(KV::getValue)
)
.apply(
"Write data to BQ",
BigQueryIO
.<GenericRecord>write()
.optimizedWrites()
.useBeamSchema()
.useAvroLogicalTypes()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.withSchemaUpdateOptions(ImmutableSet.of(BigQueryIO.Write.SchemaUpdateOption.ALLOW_FIELD_ADDITION))
//Temporary location to save files in GCS before loading to BQ
.withCustomGcsTempLocation(options.getGcsTempLocation())
.withNumFileShards(options.getNumShards().get())
.withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors())
.withMethod(FILE_LOADS)
.withTriggeringFrequency(Utils.parseDuration(options.getWindowDuration().get()))
.to(new TableReference()
.setProjectId(options.getGcpProjectId().get())
.setDatasetId(options.getGcpDatasetId().get())
.setTableId(options.getGcpTableId().get()))
);
When running, i get the following error:
Exception in thread "main" java.lang.IllegalStateException: Unable to return a default Coder for Extract GenericRecord/Map/ParMultiDo(Anonymous).output [PCollection]. Correct one of the following root causes: No Coder has been manually specified; you may do so using .setCoder().
Inferring a Coder from the CoderRegistry failed: Unable to provide a Coder for org.apache.avro.generic.GenericRecord.
Building a Coder using a registered CoderProvider failed.
How do I set the coder to properly read Avro?

There are at least three approaches to this:
Set the coder inline:
pipeline.apply("Read from Kafka", ....)
.apply("Dropping key", Values.create())
.setCoder(AvroCoder.of(Schema schemaOfGenericRecord))
.apply("Write data to BQ", ....);
Note that the key is dropped because its unused, with this you wont need MapElements any more.
Register the coder in the pipeline's instance of CoderRegistry:
pipeline.getCoderRegistry().registerCoderForClass(GenericRecord.class, AvroCoder.of(Schema genericSchema));
Get the coder from the schema registry via:
ConfluentSchemaRegistryDeserializerProvider.getCoder(CoderRegistry registry)
https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/kafka/ConfluentSchemaRegistryDeserializerProvider.html#getCoder-org.apache.beam.sdk.coders.CoderRegistry-

How to run a spark program in Java in parallel

So I have a java application that has spark maven dependencies and on running it, it launches spark server on the host where its run. The server instance has 36 cores. I am specifying SparkSession instance where I am mentioning the number of cores and other config properties in parallel but when I see the stats using htop, it doesn't seem to use all the cores but just 1.
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("my-spark")
.config("spark.driver.memory","50g")
.config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.shuffle.partitions", "400")
.config("spark.eventLog.enabled", "true")
.config("spark.eventLog.dir", "/dir1/dir2/logs")
.config("spark.history.fs.logDirectory", "/dir1/dir2/logs")
.config("spark.executor.cores", "36")
I also added in JavaSparkContext as well:
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
sc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY);
sc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_KEY);
sc.hadoopConfiguration().set("spark.driver.memory","50g");
sc.hadoopConfiguration().set("spark.eventLog.enabled", "true");
sc.hadoopConfiguration().set("spark.eventLog.dir", "/dir1/dir2/logs");
sc.hadoopConfiguration().set("spark.executor.cores", "36");
My task is reading data from aws s3 into a df and writing data in another bucket.
Dataset<Row> df = spark.read().format("csv").option("header", "true").load("s3a://bucket/file.csv.gz");
//df = df.repartition(200);
df.withColumn("col_name", df.col("col_name")).sort("col_name", "_id").write().format("iceberg").mode("append").save(location);

.gz files are "unspittable": to decompress them you have to start at byte 0 and read forwards. As a result, spark, hive, MapReduce, etc, give the whole file to a single worker. If you want parallel processing, use a different compression format (e.g. snappy)

You are running Spark in local mode, spark.executor.cores will not take effect, consider change .master("local") to .master("local[*]")
Hope this helps

SparkContext parallelize invocation example in java

Am getting started with Spark, and ran into issue trying to implement the simple example for map function. The issue is with the definition of 'parallelize' in the new version of Spark. Can someone share example of how to use it, since the following way is giving error for insufficient arguments.
Spark Version : 2.3.2
Java : 1.8
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers").config("spark.master","local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
JavaRDD<Integer> numRDD = context.parallelize(seqNumList, 2);
Compiletime Error Message : The method expects 3 arguments
I do not get what the 3rd argument should be like? As per the documentation, it's supposed to be
scala.reflect.ClassTag<T>
But how to even define or use it?
Please do not suggest using JavaSparkContext, as i wanted to know how to get this approach to work with using generic SparkContext.
Ref : https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/SparkContext.html#parallelize-scala.collection.Seq-int-scala.reflect.ClassTag-

Here is the code which worked for me finally. Not the best way to achieve the result, but was a way to explore the API for me
SparkSession session = SparkSession.builder().appName("Compute Square of Numbers")
.config("spark.master", "local").getOrCreate();
SparkContext context = session.sparkContext();
List<Integer> seqNumList = IntStream.rangeClosed(10, 20).boxed().collect(Collectors.toList());
RDD<Integer> numRDD = context
.parallelize(JavaConverters.asScalaIteratorConverter(seqNumList.iterator()).asScala()
.toSeq(), 2, scala.reflect.ClassTag$.MODULE$.apply(Integer.class));
numRDD.toJavaRDD().foreach(x -> System.out.println(x));
session.stop();

If you don't want to deal with providing the extra two parameters using sparkConext, you can also use JavaSparkContext.parallelize(), which only needs an input list:
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.rdd.RDD;
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
final RDD<Integer> rdd = jsc.parallelize(seqNumList).map(num -> {
// your implementation
}).rdd();

Difference between SparkContext, JavaSparkContext, SQLContext, and SparkSession?

What is the difference between SparkContext, JavaSparkContext, SQLContext and SparkSession?
Is there any method to convert or create a Context using a SparkSession?
Can I completely replace all the Contexts using one single entry SparkSession?
Are all the functions in SQLContext, SparkContext, and JavaSparkContext also in SparkSession?
Some functions like parallelize have different behaviors in SparkContext and JavaSparkContext. How do they behave in SparkSession?
How can I create the following using a SparkSession?
RDD
JavaRDD
JavaPairRDD
Dataset
Is there a method to transform a JavaPairRDD into a Dataset or a Dataset into a JavaPairRDD?

sparkContext is a Scala implementation entry point and JavaSparkContext is a java wrapper of sparkContext.
SQLContext is entry point of SparkSQL which can be received from sparkContext.Prior to 2.x.x, RDD ,DataFrame and Data-set were three different data abstractions.Since Spark 2.x.x, All three data abstractions are unified and SparkSession is the unified entry point of Spark.
An additional note is , RDD meant for unstructured data, strongly typed data and DataFrames are for structured and loosely typed data. You can check
Is there any method to convert or create Context using Sparksession ?
yes. its sparkSession.sparkContext() and for SQL, sparkSession.sqlContext()
Can I completely replace all the Context using one single entry SparkSession ?
yes. you can get respective contexs from sparkSession.
Does all the functions in SQLContext, SparkContext,JavaSparkContext etc are added in SparkSession?
Not directly. you got to get respective context and make use of it.something like backward compatibility
How to use such function in SparkSession?
get respective context and make use of it.
How to create the following using SparkSession?
RDD can be created from sparkSession.sparkContext.parallelize(???)
JavaRDD same applies with this but in java implementation
JavaPairRDD sparkSession.sparkContext.parallelize(???).map(//making your data as key-value pair here is one way)
Dataset what sparkSession returns is Dataset if it is structured data.

Explanation from spark source code under branch-2.1
SparkContext:
Main entry point for Spark functionality. A SparkContext represents the connection to a Spark
cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
Only one SparkContext may be active per JVM. You must stop() the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
JavaSparkContext:
A Java-friendly version of [[org.apache.spark.SparkContext]] that returns
[[org.apache.spark.api.java.JavaRDD]]s and works with Java collections instead of Scala ones.
Only one SparkContext may be active per JVM. You must stop() the active SparkContext before
creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details.
SQLContext:
The entry point for working with structured data (rows and columns) in Spark 1.x.
As of Spark 2.0, this is replaced by [[SparkSession]]. However, we are keeping the class
here for backward compatibility.
SparkSession:
The entry point to programming Spark with the Dataset and DataFrame API.

I will talk about Spark version 2.x only.
SparkSession: It's a main entry point of your spark Application. To run any code on your spark, this is the first thing you should create.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("Word Count")\
.config("spark.some.config.option", "some-value")\
.getOrCreate()
SparkContext: It's a inner Object (property) of SparkSession. It's used to interact with Low-Level API Through SparkContext you can create RDD, accumlator and Broadcast variables.
for most cases you won't need SparkContext. You can get SparkContext from SparkSession
val sc = spark.sparkContext

Spark Context is Class in Spark API which is the first stage to build the spark application. Functionality of the spark context is to create memory in RAM we call this as driver memory, allocation of number of executers and cores in short its all about the cluster management. Spark Context can be used to create RDD and shared variables. To access this we need to create object of it.
This way we can create Spark Context :: var sc=new SparkContext()
Spark Session this is new Object added since spark 2.x which is replacement of Sql Context and Hive Context.
Earlier we had two options like one is Sql Context which is way to do sql operation on Dataframe and second is Hive Context which manage the Hive connectivity related stuff and fetch/insert the data from/to the hive tables.
Since 2.x came We can create SparkSession for the SQL operation on Dataframe and if you have any Hive related work just call the Method enablehivesupport() then you can use the SparkSession for the both Dataframe and hive related SQL operations.
This way we can create SparkSession for Sql operation on Dataframe
val sparksession=SparkSession.builder().getOrCreate();
Second way is to create SparkSession for Sql operation on Dataframe as well as Hive Operation.
val sparkSession=SparkSession.builder().enableHiveSupport().getOrCreate()

Spark Context:
Since Spark 1.x, Spark SparkContext is an entry point to Spark and defined in org. apache. spark package and used to programmatically create Spark RDD, accumulators, and broadcast variables on the cluster. Its object sc is the default variable available in spark-shell and it can be programmatically created using SparkContext class.
SparkContext is a client of spark's execution environment.
SparkContext is the entry point of the spark execution job.
SparkContext acts as the master of the spark application.
Hope you will find this Apache SparkContext Examples site useful.
SparkSession:
Since Spark 2.0, SparkSession has become an entry point to Spark to work with RDD, DataFrame, and Dataset. Prior to 2.0, SparkContext used to be an entry point. Here, I will mainly focus on explaining what is SparkSession by defining and describing how to create Spark Session and using the default Spark Session ‘spark’ variable from spark-shell.
Apache spark2.0 onwards, spark session is the new entry point for spark applications.
All the functionalities provided by spark context are available in the Spark session.
spark session Provides API(s) to work on Datasets and Dataframes.
Prior to Spark2.0:
Spark Context was the entry point for spark jobs.
RDD was one of the main APIs then, and it was created and manipulated using spark Context.
For every other APIs, different Contexts were required - For SQL, SQL Context was required.
You can find more real-time examples on Apache SparkSession.
SQLContext:
In Spark Version 1.0 SQLContext (org.apache.spark.sql.SQLContext ) is an entry point to SQL in order to work with structured data (rows and columns) however with 2.0 SQLContext has been replaced with SparkSession.
Apache Spark SQLContext is the entry point to SparkSQL which is a Spark module for structured data (rows and columns) in Spark 1.x. processing.
Spark SQLContext is initialized.
apache-spark SQL context is the entry point of Spark SQL which can be received from spark context
JavaSparkContext:
JavaSparkContext For JAVARDD same as above is done but in java implementation.
JavaSparkContext Java-friendly version of [[org.apache.spark.SparkContext]] that returns [[org.apache.spark.api.java.JavaRDD]]s and works with Java collections instead of Scala ones.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.