I am going to submit pyspark task, and submit an environment with the task.
I need --archives to submit the zip package contain full environment.
The working spark submit command is this
/my/spark/home/spark-submit
--master yarn
--deploy-mode cluster
--driver-memory 10G
--executor-memory 8G
--executor-cores 4
--queue rnd
--num-executors 8
--archives /data/me/ld_env.zip#prediction_env
--conf spark.pyspark.python=./prediction_env/ld_env/bin/python
--conf spark.pyspark.driver.python=./prediction_env/ld_env/bin/python
--conf spark.executor.memoryOverhead=4096
--py-files dist/mylib-0.1.0-py3-none-any.whl my_task.py
I am trying to start spark app programatically with SparkLauncher
String pyPath = "my_task.py"
String archives = "/data/me/ld_env.zip#prediction_env"
SparkAppHandle handle = new SparkLauncher()
.setSparkHome(sparkHome)
.setAppResource(jarPath)
.setMaster("yarn")
.setDeployMode("cluster")
.setConf(SparkLauncher.EXECUTOR_MEMORY, "8G")
.setConf(SparkLauncher.EXECUTOR_CORES, "2")
.setConf("spark.executor.instances", "8")
.setConf("spark.yarn.queue", "rnd")
.setConf("spark.pyspark.python", "./prediction_env/ld_env/bin/python")
.setConf("spark.pyspark.driver.python", "./prediction_env/ld_env/bin/python")
.setConf("spark.executor.memoryOverhead", "4096")
.addPyFile(pyPath)
// .addPyFile(archives)
// .addFile(archives)
.addAppArgs("--inputPath",
inputPath,
"--outputPath",
outputPath,
"--option",
option)
.startApplication(taskListener);
I need somewhere to put my zip file that will unpack on yarn. But I don't see any archives function.
Use config spark.yarn.dist.archivesas document in Running on yarn and tutorial
String pyPath = "my_task.py"
String archives = "/data/me/ld_env.zip#prediction_env"
SparkAppHandle handle = new SparkLauncher()
.setSparkHome(sparkHome)
.setAppResource(jarPath)
.setMaster("yarn")
.setDeployMode("cluster")
.setConf(SparkLauncher.EXECUTOR_MEMORY, "8G")
.setConf(SparkLauncher.EXECUTOR_CORES, "2")
.setConf("spark.executor.instances", "8")
.setConf("spark.yarn.queue", "rnd")
.setConf("spark.pyspark.python", "./prediction_env/ld_env/bin/python")
.setConf("spark.pyspark.driver.python", "./prediction_env/ld_env/bin/python")
.setConf("spark.executor.memoryOverhead", "4096")
.setConf("spark.yarn.dist.archives", archives)
.addPyFile(pyPath)
.addAppArgs("--inputPath",
inputPath,
"--outputPath",
outputPath,
"--option",
option)
.startApplication(taskListener);
So, adding .setConf("spark.yarn.dist.archives", archives) fix the problem.
Related
I have a Java Spark application, in which I need to read all the row keys from an HBase table.
Up until now, I worked with Spark 2.4.7 and we migrated to Spark 3.2.3. I used newAPIHadoopRDD but HBase is returning an empty result after the Spark version upgrade, with no errors/warnings.
My function looks as follows:
Configuration config = dataContext.getConfig();
Scan scan = new Scan();
FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL, new FirstKeyOnlyFilter(), new KeyOnlyFilter());
scan.setFilter(filters);
config.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));
StructField field = DataTypes.createStructField("id", DataTypes.StringType, true);
StructType schema = DataTypes.createStructType(Arrays.asList(field));
JavaRDD<Tuple2<ImmutableBytesWritable, Result>> keyToEmptyValues = sparkSession.sparkContext().newAPIHadoopRDD(config,
TableInputFormat.class, ImmutableBytesWritable.class, Result.class).toJavaRDD();
JavaRDD<ImmutableBytesWritable> keys = JavaPairRDD.fromJavaRDD(keyToEmptyValues).keys();
JavaRDD<Row> nonPrefixKeys = removeKeyPrefixFromRDD(keys);
Dataset<Row> keysDataset = sparkSession.createDataFrame(nonPrefixKeys, schema);
I read the following post: Spark 3.2.1 fetch HBase data not working with NewAPIHadoopRDD
I added the flag mentioned: --conf "spark.hadoopRDD.ignoreEmptySplits=false", the results are still empty.
I'm using Spark 3.2.3, and Hbase 2.2.4.
My spark-submit command looks as follows:
spark-submit --name "enrich" --master local[*] --class ingest.main.Main --jars /opt/bitnami/spark/externalJar/enrich-1.0-SNAPSHOT.jar --conf "spark.driver.extraClassPath=/opt/bitnami/spark/externalJar/enrich-1.0-SNAPSHOT.jar" --conf "spark.hadoopRDD.ignoreEmptySplits=false" --conf "spark.executor.extraClassPath=/opt/bitnami/spark/externalJar/enrich-1.0-SNAPSHOT.jar" --driver-java-options "-Dlog4j.configuration=file:/opt/bitnami/spark/log4j.properties" --jars /opt/bitnami/spark/externalJar/enrich-1.0-SNAPSHOT.jar --conf 'spark.driver.extraJavaOptions=-Dexecute=Enrich -Denrich.files.path="/hfiles/10/enrichments" -Dhfiles.path="/hfiles/10/enrichments"' ingest-1.jar
Why do I get nothing from HBase (Hbase is full, I made a scan command in the HBase shell)?
Is there any better way to retrieve the row keys from an HBase table with Spark 3.2.3?
error occurs while executing
airflow#41166b660d82:~$ spark-submit --master yarn --deploy-mode cluster --keytab keytab_name.keytab --principal --jars keytab_name#REALM --jars /path/to/spark-hive_2.11-2.3.0.jar sranje.py
from airflow docker container not in CDH env (not managed by CDH CM). sranje.py is simple select * from hive table.
App is accepted on CDH yarn and executed twice with this error:
...
2020-12-31 10:11:43 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
Traceback (most recent call last):
File "sranje.py", line 21, in <module>
source_df = hiveContext.table(hive_source).na.fill("")
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/context.py", line 366, in table
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/session.py", line 721, in table
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/dfs/dn4/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0150/container_e29_1608187067076_0150_02_000001/pyspark.zip/pyspark/sql/utils.py", line 79, in deco
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':"
2020-12-31 10:11:43 ERROR ApplicationMaster:70 - User application exited with status 1
2020-12-31 10:11:43 INFO ApplicationMaster:54 - Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
...
We assume that "some .jar's and java dependencies" are missing. Any ideas?
Details
there is a valid krb ticket before executing spark cmd
if we ommit --jars /path/to/spark-hive_2.11-2.3.0.jar, pyhton error is different
...
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
...
versions of spark(2.3.0), hadoop(2.6.0) and java are same as CDH
hive-site.xml, yarn-site.xml etc are also provided and valid
this same spark-submit app executes OK from node inside of CDH cluster
we tried adding additional --jars spark-hive_2.11-2.3.0.jar,spark-core_2.11-2.3.0.jar,spark-sql_2.11-2.3.0.jar,hive-hcatalog-core-2.3.0.jar,spark-hive-thriftserver_2.11-2.3.0.jar
developers use this code as an example:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext, HiveContext, functions as F
from pyspark.sql.utils import AnalysisException
from datetime import datetime
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
sqlContext = SQLContext(sc)
hiveContext = HiveContext(sc)
current_date = str(datetime.now().strftime('%Y-%m-%d'))
hive_source = "lnz_ch.lnz_cfg_codebook"
source_df = hiveContext.table(hive_source).na.fill("")
print("Number of records: {}".format(source_df.count()))
print("First 20 rows of the table:")
source_df.show(20)
different script, same error
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("ZekoTest").enableHiveSupport().getOrCreate()
data = spark.sql("SELECT * FROM lnz_ch.lnz_cfg_codebook")
data.show(20)
spark.close()
Thank you.
Hive dependecies are resolved with:
downloading hive.tar.gz with exact version of CDH Hive
created symlinks from hive/ to spark/
ln -s apache-hive-1.1.0-bin/lib/*.jar spark-2.3.0-bin-without-hadoop/jars/
additional jars downloaded from maven repo to spark/jars/
hive-hcatalog-core-2.3.0.jar
slf4j-api-1.7.26.jar
spark-hive_2.11-2.3.0.jar
spark-hive-thriftserver_2.11-2.3.0.jar
refresh env var
HADOOP_CLASSPATH=$(find $HADOOP_HOME -name '*.jar' | xargs echo | tr ' ' ':')
SPARK_DIST_CLASSPATH=$(hadoop classpath)
beeline works, but pyspark throws error
2021-01-07 15:02:20 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
Traceback (most recent call last):
File "sranje.py", line 21, in <module>
source_df = hiveContext.table(hive_source).na.fill("")
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/context.py", line 366, in table
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/session.py", line 721, in table
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/dfs/dn12/yarn/nm/usercache/etladmin/appcache/application_1608187067076_0207/container_e29_1608187067076_0207_01_000001/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o31.table.
: java.lang.NoSuchFieldError: METASTORE_CLIENT_SOCKET_LIFETIME
But, that's another question. Thank you all.
So I have a java application that has spark maven dependencies and on running it, it launches spark server on the host where its run. The server instance has 36 cores. I am specifying SparkSession instance where I am mentioning the number of cores and other config properties in parallel but when I see the stats using htop, it doesn't seem to use all the cores but just 1.
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("my-spark")
.config("spark.driver.memory","50g")
.config("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.shuffle.partitions", "400")
.config("spark.eventLog.enabled", "true")
.config("spark.eventLog.dir", "/dir1/dir2/logs")
.config("spark.history.fs.logDirectory", "/dir1/dir2/logs")
.config("spark.executor.cores", "36")
I also added in JavaSparkContext as well:
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
sc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY);
sc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_KEY);
sc.hadoopConfiguration().set("spark.driver.memory","50g");
sc.hadoopConfiguration().set("spark.eventLog.enabled", "true");
sc.hadoopConfiguration().set("spark.eventLog.dir", "/dir1/dir2/logs");
sc.hadoopConfiguration().set("spark.executor.cores", "36");
My task is reading data from aws s3 into a df and writing data in another bucket.
Dataset<Row> df = spark.read().format("csv").option("header", "true").load("s3a://bucket/file.csv.gz");
//df = df.repartition(200);
df.withColumn("col_name", df.col("col_name")).sort("col_name", "_id").write().format("iceberg").mode("append").save(location);
.gz files are "unspittable": to decompress them you have to start at byte 0 and read forwards. As a result, spark, hive, MapReduce, etc, give the whole file to a single worker. If you want parallel processing, use a different compression format (e.g. snappy)
You are running Spark in local mode, spark.executor.cores will not take effect, consider change .master("local") to .master("local[*]")
Hope this helps
I have the following code:
val conf = new SparkConf().setAppName("Spark Test")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val data = sqlContext.read.format("jdbc").options(
Map(
"url" -> "jdbc:sqlite:/nv/pricing/ix_tri_pi.sqlite3",
"dbtable" -> "SELECT security_id FROM ix_tri_pi")).load()
data.foreach {
row => println(row.getInt(1))
}
And I try to submit it with:
spark-submit \
--class "com.novus.analytics.spark.SparkTest" \
--master "local[4]" \
/Users/smabie/workspace/analytics/analytics-spark/target/scala-2.10/analytics-spark.jar \
--conf spark.executer.extraClassPath=sqlite-jdbc-3.8.7.jar \
--conf spark.driver.extraClassPath=sqlite-jdbc-3.8.7.jar \
--driver-class-path sqlite-jdbc-3.8.7.jar \
--jars sqlite-jdbc-3.8.7.jar
But I get the following exception:
Exception in thread "main" java.sql.SQLException: No suitable driver
I am using Spark version 1.6.1, if that helps.
Thanks!
Try defining your jar as the last parameter of spark-submit.
Did you try to specify driver class in options explicitly?
options(
Map(
"url" -> "jdbc:sqlite:/nv/pricing/ix_tri_pi.sqlite3",
"driver" -> "org.sqlite.JDBC",
"dbtable" -> "SELECT security_id FROM ix_tri_pi"))
I had similar issue trying to load PostgreSQL table.
Also, possible cause may be in classloading:
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.
http://spark.apache.org/docs/latest/sql-programming-guide.html#troubleshooting
I'm quite new to Spark and was looking for some guidance :-)
What's the typical way in which a Java MVC application communicates with Spark? To simplify things, let's say I want to count the words in a certain file whose name is provided via GET request to my server.
My initial approach was to open the context and implement the transformations/ computations in a class inside my MVC application. That means that at runtime I would have to come up with an uber jar of spark-core. The problem is that:
The uber jar weights 80mb
I am facing the same problem (akka.version) than in: apache spark: akka version error by build jar with all dependencies
I can have a go with shade to solve it but have the feeling this is not the way to go.
Maybe the "provided" scope in Maven would help me but I'm using ant.
Should my application - as suggested in the page - have already one jar with the implementation (devoid of any spark libraries) and use the spark-submit every time I receive a request. I guess it would leave the results somewhere.
Am I missing any middle-of-the-road approach?
Using spark-submit each time is kind of heavy weight, I'd recommend using a long running Spark Context of some sort. I think the "middle of the road" option that you might be looking for is to have your job use something like the IBM Spark Kernel, Zepplin, or the Spark Job Server from Ooyala.
There is a good practice to use middleware service deployed on a top of Spark which manages it’s contexts, job failures spark vesions and a lot of other things to consider.
I would recommend Mist. It implements Spark as a Service and creates a unified API layer for building enterprise solutions and services on top of a Big Data lake.
Mist supports Scala and Python jobs execution.
The quick start is following:
Add Mist wrapper into your Spark job:
Scala example:
object SimpleContext extends MistJob {
override def doStuff(context: SparkContext, parameters: Map[String, Any]): Map[String, Any] = {
val numbers: List[BigInt] = parameters("digits").asInstanceOf[List[BigInt]]
val rdd = context.parallelize(numbers)
Map("result" -> rdd.map(x => x * 2).collect())
}
}
Python example:
import mist
class MyJob:
def __init__(self, job):
job.sendResult(self.doStuff(job))
def doStuff(self, job):
val = job.parameters.values()
list = val.head()
size = list.size()
pylist = []
count = 0
while count < size:
pylist.append(list.head())
count = count + 1
list = list.tail()
rdd = job.sc.parallelize(pylist)
result = rdd.map(lambda s: 2 * s).collect()
return result
if __name__ == "__main__":
job = MyJob(mist.Job())
Run Mist service:
Build the Mist
git clone https://github.com/hydrospheredata/mist.git
cd mist
./sbt/sbt -DsparkVersion=1.5.2 assembly # change version according to your installed spark
Create configuration file
mist.spark.master = "local[*]"
mist.settings.threadNumber = 16
mist.http.on = true
mist.http.host = "0.0.0.0"
mist.http.port = 2003
mist.mqtt.on = false
mist.recovery.on = false
mist.contextDefaults.timeout = 100 days
mist.contextDefaults.disposable = false
mist.contextDefaults.sparkConf = {
spark.default.parallelism = 128
spark.driver.memory = "10g"
spark.scheduler.mode = "FAIR"
}
Run
spark-submit --class io.hydrosphere.mist.Mist \
--driver-java-options "-Dconfig.file=/path/to/application.conf" \ target/scala-2.10/mist-assembly-0.2.0.jar
Try curl from terminal:
curl --header "Content-Type: application/json" -X POST http://192.168.10.33:2003/jobs --data '{"jarPath":"/vagrant/examples/target/scala-2.10/mist_examples_2.10-0.2.0.jar", "className":"SimpleContext$","parameters":{"digits":[1,2,3,4,5,6,7,8,9,0]}, "external_id":"12345678","name":"foo"}'