Fail to read an HBase table with Java Spark 3.2.3 - java

I have a Java Spark application, in which I need to read all the row keys from an HBase table.
Up until now, I worked with Spark 2.4.7 and we migrated to Spark 3.2.3. I used newAPIHadoopRDD but HBase is returning an empty result after the Spark version upgrade, with no errors/warnings.
My function looks as follows:
Configuration config = dataContext.getConfig();
Scan scan = new Scan();
FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL, new FirstKeyOnlyFilter(), new KeyOnlyFilter());
scan.setFilter(filters);
config.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));
StructField field = DataTypes.createStructField("id", DataTypes.StringType, true);
StructType schema = DataTypes.createStructType(Arrays.asList(field));
JavaRDD<Tuple2<ImmutableBytesWritable, Result>> keyToEmptyValues = sparkSession.sparkContext().newAPIHadoopRDD(config,
TableInputFormat.class, ImmutableBytesWritable.class, Result.class).toJavaRDD();
JavaRDD<ImmutableBytesWritable> keys = JavaPairRDD.fromJavaRDD(keyToEmptyValues).keys();
JavaRDD<Row> nonPrefixKeys = removeKeyPrefixFromRDD(keys);
Dataset<Row> keysDataset = sparkSession.createDataFrame(nonPrefixKeys, schema);
I read the following post: Spark 3.2.1 fetch HBase data not working with NewAPIHadoopRDD
I added the flag mentioned: --conf "spark.hadoopRDD.ignoreEmptySplits=false", the results are still empty.
I'm using Spark 3.2.3, and Hbase 2.2.4.
My spark-submit command looks as follows:
spark-submit --name "enrich" --master local[*] --class ingest.main.Main --jars /opt/bitnami/spark/externalJar/enrich-1.0-SNAPSHOT.jar --conf "spark.driver.extraClassPath=/opt/bitnami/spark/externalJar/enrich-1.0-SNAPSHOT.jar" --conf "spark.hadoopRDD.ignoreEmptySplits=false" --conf "spark.executor.extraClassPath=/opt/bitnami/spark/externalJar/enrich-1.0-SNAPSHOT.jar" --driver-java-options "-Dlog4j.configuration=file:/opt/bitnami/spark/log4j.properties" --jars /opt/bitnami/spark/externalJar/enrich-1.0-SNAPSHOT.jar --conf 'spark.driver.extraJavaOptions=-Dexecute=Enrich -Denrich.files.path="/hfiles/10/enrichments" -Dhfiles.path="/hfiles/10/enrichments"' ingest-1.jar
Why do I get nothing from HBase (Hbase is full, I made a scan command in the HBase shell)?
Is there any better way to retrieve the row keys from an HBase table with Spark 3.2.3?

Related

What is --archives for SparkLauncher in Java?

I am going to submit pyspark task, and submit an environment with the task.
I need --archives to submit the zip package contain full environment.
The working spark submit command is this
/my/spark/home/spark-submit
--master yarn
--deploy-mode cluster
--driver-memory 10G
--executor-memory 8G
--executor-cores 4
--queue rnd
--num-executors 8
--archives /data/me/ld_env.zip#prediction_env
--conf spark.pyspark.python=./prediction_env/ld_env/bin/python
--conf spark.pyspark.driver.python=./prediction_env/ld_env/bin/python
--conf spark.executor.memoryOverhead=4096
--py-files dist/mylib-0.1.0-py3-none-any.whl my_task.py
I am trying to start spark app programatically with SparkLauncher
String pyPath = "my_task.py"
String archives = "/data/me/ld_env.zip#prediction_env"
SparkAppHandle handle = new SparkLauncher()
.setSparkHome(sparkHome)
.setAppResource(jarPath)
.setMaster("yarn")
.setDeployMode("cluster")
.setConf(SparkLauncher.EXECUTOR_MEMORY, "8G")
.setConf(SparkLauncher.EXECUTOR_CORES, "2")
.setConf("spark.executor.instances", "8")
.setConf("spark.yarn.queue", "rnd")
.setConf("spark.pyspark.python", "./prediction_env/ld_env/bin/python")
.setConf("spark.pyspark.driver.python", "./prediction_env/ld_env/bin/python")
.setConf("spark.executor.memoryOverhead", "4096")
.addPyFile(pyPath)
// .addPyFile(archives)
// .addFile(archives)
.addAppArgs("--inputPath",
inputPath,
"--outputPath",
outputPath,
"--option",
option)
.startApplication(taskListener);
I need somewhere to put my zip file that will unpack on yarn. But I don't see any archives function.
Use config spark.yarn.dist.archivesas document in Running on yarn and tutorial
String pyPath = "my_task.py"
String archives = "/data/me/ld_env.zip#prediction_env"
SparkAppHandle handle = new SparkLauncher()
.setSparkHome(sparkHome)
.setAppResource(jarPath)
.setMaster("yarn")
.setDeployMode("cluster")
.setConf(SparkLauncher.EXECUTOR_MEMORY, "8G")
.setConf(SparkLauncher.EXECUTOR_CORES, "2")
.setConf("spark.executor.instances", "8")
.setConf("spark.yarn.queue", "rnd")
.setConf("spark.pyspark.python", "./prediction_env/ld_env/bin/python")
.setConf("spark.pyspark.driver.python", "./prediction_env/ld_env/bin/python")
.setConf("spark.executor.memoryOverhead", "4096")
.setConf("spark.yarn.dist.archives", archives)
.addPyFile(pyPath)
.addAppArgs("--inputPath",
inputPath,
"--outputPath",
outputPath,
"--option",
option)
.startApplication(taskListener);
So, adding .setConf("spark.yarn.dist.archives", archives) fix the problem.

Partition not working in mongodb spark read in java connector

I was trying to read data using MongoDb spark connector, and want to partition the dataset on a key, reading from mongoD standalone instance. I was looking at the doc of mongoDb spark, and it mentions of various partitioner classes. I was trying to use MongoSamplePartitioner class but it only reads on just 1 partition. MongoPaginateByCountPartitioner class as well partitions to a fixed 66 partitions. This is even when I am configuring "samplesPerPartition" and "numberOfPartitions" in both of these cases respectively. I need to use readConfig created via a map. My code:
SparkSession sparkSession = SparkSession.builder().appName("sampleRecords")
.config("spark.driver.host", "2g")
.config("spark.driver.host", "127.0.0.1")
.master("local[4]").getOrCreate();
Map<String, String> readOverrides = new HashMap<>();
readOverrides.put("uri", "mongodb://mongo-root:password#127.0.0.1:27017/importedDb.myNewCollection?authSource=admin");
readOverrides.put("numberOfPartitions", "16");
readOverrides.put("partitioner", "MongoPaginateByCountPartitioner");
ReadConfig readConfig = ReadConfig.create(readOverrides);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
Dataset<Row> dataset = MongoSpark.load(jsc, readConfig).toDF();
System.out.println(dataset.count()); //24576
System.out.println(dataset.rdd().getNumPartitions()); //66
Using sample partitioner returns 1 partition everytime
Am I missing something here? Please help.
PS - I am reading 24576 records, mongoD version v4.0.10, mongo spark connector 2.3.1, java 8
Edit:
I got it to work, needed to give properties like so partitionerOptions.samplesPerPartition in the map. But I am still facing issue, partitionerOptions.samplesPerPartition : "1000", in MongoSamplePartitioner only returns 1 partition. Any suggestions?
Number of Partitions can be configured for MongoPaginateByCountPartitioner.
Supposing that we need to configure the target number of partitions to 16...
Please add partitionerOptions.numberOfPartitions -> 16 in the properties rather than only numberOfPartitions -> 16.

How to run a spark program in Java in parallel

So I have a java application that has spark maven dependencies and on running it, it launches spark server on the host where its run. The server instance has 36 cores. I am specifying SparkSession instance where I am mentioning the number of cores and other config properties in parallel but when I see the stats using htop, it doesn't seem to use all the cores but just 1.
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("my-spark")
.config("spark.driver.memory","50g")
.config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.shuffle.partitions", "400")
.config("spark.eventLog.enabled", "true")
.config("spark.eventLog.dir", "/dir1/dir2/logs")
.config("spark.history.fs.logDirectory", "/dir1/dir2/logs")
.config("spark.executor.cores", "36")
I also added in JavaSparkContext as well:
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
sc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY);
sc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_KEY);
sc.hadoopConfiguration().set("spark.driver.memory","50g");
sc.hadoopConfiguration().set("spark.eventLog.enabled", "true");
sc.hadoopConfiguration().set("spark.eventLog.dir", "/dir1/dir2/logs");
sc.hadoopConfiguration().set("spark.executor.cores", "36");
My task is reading data from aws s3 into a df and writing data in another bucket.
Dataset<Row> df = spark.read().format("csv").option("header", "true").load("s3a://bucket/file.csv.gz");
//df = df.repartition(200);
df.withColumn("col_name", df.col("col_name")).sort("col_name", "_id").write().format("iceberg").mode("append").save(location);
.gz files are "unspittable": to decompress them you have to start at byte 0 and read forwards. As a result, spark, hive, MapReduce, etc, give the whole file to a single worker. If you want parallel processing, use a different compression format (e.g. snappy)
You are running Spark in local mode, spark.executor.cores will not take effect, consider change .master("local") to .master("local[*]")
Hope this helps

Can't access sqlite db from Spark

I have the following code:
val conf = new SparkConf().setAppName("Spark Test")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val data = sqlContext.read.format("jdbc").options(
Map(
"url" -> "jdbc:sqlite:/nv/pricing/ix_tri_pi.sqlite3",
"dbtable" -> "SELECT security_id FROM ix_tri_pi")).load()
data.foreach {
row => println(row.getInt(1))
}
And I try to submit it with:
spark-submit \
--class "com.novus.analytics.spark.SparkTest" \
--master "local[4]" \
/Users/smabie/workspace/analytics/analytics-spark/target/scala-2.10/analytics-spark.jar \
--conf spark.executer.extraClassPath=sqlite-jdbc-3.8.7.jar \
--conf spark.driver.extraClassPath=sqlite-jdbc-3.8.7.jar \
--driver-class-path sqlite-jdbc-3.8.7.jar \
--jars sqlite-jdbc-3.8.7.jar
But I get the following exception:
Exception in thread "main" java.sql.SQLException: No suitable driver
I am using Spark version 1.6.1, if that helps.
Thanks!
Try defining your jar as the last parameter of spark-submit.
Did you try to specify driver class in options explicitly?
options(
Map(
"url" -> "jdbc:sqlite:/nv/pricing/ix_tri_pi.sqlite3",
"driver" -> "org.sqlite.JDBC",
"dbtable" -> "SELECT security_id FROM ix_tri_pi"))
I had similar issue trying to load PostgreSQL table.
Also, possible cause may be in classloading:
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.
http://spark.apache.org/docs/latest/sql-programming-guide.html#troubleshooting

Java - Apache Spark communication

I'm quite new to Spark and was looking for some guidance :-)
What's the typical way in which a Java MVC application communicates with Spark? To simplify things, let's say I want to count the words in a certain file whose name is provided via GET request to my server.
My initial approach was to open the context and implement the transformations/ computations in a class inside my MVC application. That means that at runtime I would have to come up with an uber jar of spark-core. The problem is that:
The uber jar weights 80mb
I am facing the same problem (akka.version) than in: apache spark: akka version error by build jar with all dependencies
I can have a go with shade to solve it but have the feeling this is not the way to go.
Maybe the "provided" scope in Maven would help me but I'm using ant.
Should my application - as suggested in the page - have already one jar with the implementation (devoid of any spark libraries) and use the spark-submit every time I receive a request. I guess it would leave the results somewhere.
Am I missing any middle-of-the-road approach?
Using spark-submit each time is kind of heavy weight, I'd recommend using a long running Spark Context of some sort. I think the "middle of the road" option that you might be looking for is to have your job use something like the IBM Spark Kernel, Zepplin, or the Spark Job Server from Ooyala.
There is a good practice to use middleware service deployed on a top of Spark which manages it’s contexts, job failures spark vesions and a lot of other things to consider.
I would recommend Mist. It implements Spark as a Service and creates a unified API layer for building enterprise solutions and services on top of a Big Data lake.
Mist supports Scala and Python jobs execution.
The quick start is following:
Add Mist wrapper into your Spark job:
Scala example:
object SimpleContext extends MistJob {
override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = {
val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]]
val rdd = context.parallelize(numbers)
Map("result" -> rdd.map(x => x * 2).collect())
}
}
Python example:
import mist
class MyJob:
def __init__(self, job):
job.sendResult(self.doStuff(job))
def doStuff(self, job):
val = job.parameters.values()
list = val.head()
size = list.size()
pylist = []
count = 0
while count < size:
pylist.append(list.head())
count = count + 1
list = list.tail()
rdd = job.sc.parallelize(pylist)
result = rdd.map(lambda s: 2 * s).collect()
return result
if __name__ == "__main__":
job = MyJob(mist.Job())
Run Mist service:
Build the Mist
git clone https://github.com/hydrospheredata/mist.git
cd mist
./sbt/sbt -DsparkVersion=1.5.2 assembly # change version according to your installed spark
Create configuration file
mist.spark.master = "local[*]"
mist.settings.threadNumber = 16
mist.http.on = true
mist.http.host = "0.0.0.0"
mist.http.port = 2003
mist.mqtt.on = false
mist.recovery.on = false
mist.contextDefaults.timeout = 100 days
mist.contextDefaults.disposable = false
mist.contextDefaults.sparkConf = {
spark.default.parallelism = 128
spark.driver.memory = "10g"
spark.scheduler.mode = "FAIR"
}
Run
spark-submit --class io.hydrosphere.mist.Mist \
--driver-java-options "-Dconfig.file=/path/to/application.conf" \ target/scala-2.10/mist-assembly-0.2.0.jar
Try curl from terminal:
curl --header "Content-Type: application/json" -X POST http://192.168.10.33:2003/jobs --data '{"jarPath":"/vagrant/examples/target/scala-2.10/mist_examples_2.10-0.2.0.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"}'

Categories

Resources