Spark in Kubernetes container does not see local file

Spark in Kubernetes container does not see local file - java

I have a trivially small Spark application written in Java that I am trying to run in a K8s cluster using spark-submit. I built an image with Spark binaries, my uber-JAR file with all necessary dependencies (in /opt/spark/jars/my.jar), and a config file (in /opt/spark/conf/some.json).
In my code, I start with
SparkSession session = SparkSession.builder()
.appName("myapp")
.config("spark.logConf", "true")
.getOrCreate();
Path someFilePath = FileSystems.getDefault().getPath("/opt/spark/conf/some.json");
String someString = new String(Files.readAllBytes(someFilePath));
and get this exception at readAllBytes from the Spark driver:
java.nio.file.NoSuchFileException: /opt/spark/conf/some.json
If I run my Docker image manually I can definitely see the file /opt/spark/conf/some.json as I expect. My Spark job runs as root so file permissions should not be a problem.
I have been assuming that, since the same Docker image, with the file indeed present, will be used to start the driver (and executors, but I don't even get to that point), the file should be available to my application. Is that not so? Why wouldn't it see the file?

You seem to get this exception from one of your worker nodes, not from the container.
Make sure that you've specified all files needed as --files option for spark-submit.
spark-submit --master yarn --deploy-mode cluster --files <local files dependecies> ...
https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management

Related

can't read external hive table on spark

Affects Version/s:2.3.2
Component/s:PySpark, Spark Core, Spark Shell
Labels:JSON external-tables hive spark
Environment:hdp 3.1.4
hive-hcatalog-core-3.1.0.3.1.4.0-315.jar & hive-hcatalog-core-3.1.2 both I've tried
Description
I create a external hive table with hdfs file which is formatted as json string.
I can read the data field of this hive table with the help of org.apache.hive.hcatalog.data.JsonSerDe which is packed in hive-hcatalog-core.jar in hive shell.
But when I try to use the spark (pyspark ,spark-shell or whatever) ,I just can't read it.
It gave me a error Table: Unable to get field from serde: org.apache.hive.hcatalog.data.JsonSerDe
I've copied the jar (hive-hcatalog-core.jar) to $spark_home/jars and yarn libs and rerun ,there is no effect,even use --jars $jar_path/hive-hcatalog-core.jar.But when I browse the webpage of spark task ,I can actually find the jar in the env list.

Spark Job with Kafka on Kubernetes

We have a Spark Java application which reads from database and publishes messages on Kafka. When we execute the job locally on windows command line with following arguments it is working as expected :
bin/spark-submit -class com.data.ingestion.DataIngestion --jars local:///opt/spark/jars/spark-sql-kafka-0-10_2.11-2.3.0.jar local:///opt/spark/jars/data-ingestion-1.0-SNAPSHOT.jar
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 --class com.data.ingestion.DataIngestion data-ingestion-1.0-SNAPSHOT.jar
Similarly, when try to run the command using k8s master
bin/spark-submit --master k8s://https://172.16.3.105:8443 --deploy-mode cluster --conf spark.kubernetes.container.image=localhost:5000/spark-example:0.2 --class com.data.ingestion.DataIngestion --jars local:///opt/spark/jars/spark-sql-kafka-0-10_2.11-2.3.0.jar local:///opt/spark/jars/data-ingestion-1.0-SNAPSHOT.jar
It gives following error:
Exception in thread "main" java.util.ServiceConfigurationError:
org.apache.spark.sql.sources.DataSourceRegister: Provider
org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated

Based on the error, it would indicate at least one node in the cluster does not have /opt/spark/jars/spark-sql-kafka-0-10_2.11-2.3.0.jar
I suggest you create an uber jar that includes this Kafka Structured Streaming package or use --packages rather than local files in addition to setup a solution like Rook or MinIO to have a shared filesystem within k8s/spark

Seems Scala version and Spark Kafka version were not aligned.

java.io.IOException: Could not read footer for file FileStatus when trying to read parquet file from Spark cluster from IBM Cloud Object Storage

I have created a Spark Cluster with 3 workers on Kubernetes and a JupyterHub deployment to attach to it so I can run huge queries.
My parquet files are stored into IBM Cloud Object Storage (COS) and when I run a simple code to read from COS, I'm getting the following error:
Could not read footer: java.io.IOException: Could not read footer for file FileStatus{path=file:/path/myfile.parquet/_common_metadata; isDirectory=false; length=413; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false} at parquet.hadoop.ParquetFileReader.readAllFootersInParallel
I have added all the required libraries to the /jars directory on SPARK_HOME directory in the driver.
This is the code I'm using to connect:
# Initial Setup - Once
import os
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
spark_session = SparkSession(sc)
credentials_staging_parquet = {
'bucket_dm':'mybucket1',
'bucket_eid':'bucket2',
'secret_key':'XXXXXXXX',
'iam_url':'https://iam.ng.bluemix.net/oidc/token',
'api_key':'XXXXXXXX',
'resource_instance_id':'crn:v1:bluemix:public:cloud-object-storage:global:a/XXXXX:XXXXX::',
'access_key':'XXXXX',
'url':'https://s3-api.us-geo.objectstorage.softlayer.net'
}
conf = {
'fs.cos.service.access.key': credentials_staging_parquet.get('access_key'),
'fs.cos.service.endpoint': credentials_staging_parquet.get('url'),
'fs.cos.service.secret.key': credentials_staging_parquet.get('secret_key'),
'fs.cos.service.iam.endpoint': credentials_staging_parquet.get('iam_url'),
'fs.cos.service.iam.service.id': credentials_staging_parquet.get('resource_instance_id'),
'fs.stocator.scheme.list': 'cos',
'fs.cos.impl': 'com.ibm.stocator.fs.ObjectStoreFileSystem',
'fs.stocator.cos.impl': 'com.ibm.stocator.fs.cos.COSAPIClient',
'fs.stocator.cos.scheme': 'cos',
'fs.cos.client.execution.timeout': '18000000',
'fs.stocator.glob.bracket.support': 'true'
}
hadoop_conf = sc._jsc.hadoopConfiguration()
for key in conf:
hadoop_conf.set(key, conf.get(key))
parquet_path = 'store/MY_FILE/*'
cos_url = 'cos://{bucket}.service/{parquet_path}'.format(bucket=credentials_staging_parquet.get('bucket_eid'), parquet_path=parquet_path)
df2 = spark_session.read.parquet(cos_url)

I got this similar error & Googled found this post. Next, I realized that I have a file format issue where the saved file was Avro and the file reader was Orc. So ... check your saved file format and reader formats are aligning.

Found the problem to my issue, the required libraries were not available for all workers in the cluster.
There are 2 ways to fix that:
Make sure you add the dependencies on the spark-submit command so it's distributed to the whole cluster, in this case it should be done in the kernel.json file on Jupyterhub located in /usr/local/share/jupyter/kernels/pyspark/kernel.json (assuming you created that).
OR
Add the dependencies on the /jars directory on your SPARK_HOME for each worker in the cluster and the driver (if you didn't do so).
I used the second approach. During my docker image creation, I added the libs so when I start my cluster, all containers already have the libraries required.

Try restarting your system or server and it will work after it.
I faced the same problem. It generally happens when you upgrade your Java version however spark lib still points to old java version. Rebooting your server/system resolves the problem.

SparkLauncher Run spark-submit with yarn-client with user as hive

Trying to run spark job with masterURL=yarn-client. Using SparkLauncher 2.10. The java code is wrapped in nifi processor. Nifi is currently running as root. When I do yarn application -list, I see the spark job started with USER = root. I want to run it with USER = hive.
Following is my SparkLauncher code.
Process spark = new SparkLauncher()
.setSparkHome(cp.fetchProperty(GlobalConstant.spark_submit_work_dir).toString())
.setAppResource(cp.fetchProperty(GlobalConstant.spark_app_resource))
.setMainClass(cp.fetchProperty(GlobalConstant.spark_main_class))
.addAppArgs(ps.getName())
// .setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS,"-Duser.name=hive")
.setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS, "-Dlog4j.configuration=file:///opt/eim/log4j_submitgnrfromhdfs.properties")
.setVerbose(true)
.launch();
Do I need to pass user as driver extra options? Environment is non-kerberos.
Read somewhere that I need to pass user name as driver extra java option. Cannot find that post now!!

export HADOOP_USER_NAME=hive worked. SparkLauncher has overload to accept Map of environment variables. As for spark.yarn.principle, the environment is non-kerberos. As per my reading yarn.principle works only with kerboros. Did the following
Process spark = new SparkLauncher(getEnvironmentVar(ps.getRunAs()))
.setSparkHome(cp.fetchProperty(GlobalConstant.spark_submit_work_dir).toString())
.setAppResource(cp.fetchProperty(GlobalConstant.spark_app_resource))
.setMainClass(cp.fetchProperty(GlobalConstant.spark_main_class))
.addAppArgs(ps.getName())
// .setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS,"-Duser.name=hive")
.setConf(SparkLauncher.DRIVER_EXTRA_JAVA_OPTIONS, "-Dlog4j.configuration=file:///opt/eim/log4j_submitgnrfromhdfs.properties")
.setVerbose(true)
.launch();
Instead of new SparkLancher() used SparkLauncher(java.util.Map<String,String> env).Added or replacedHADOOP_USER_NAME=hive.
Checked yarn application -listlaunches as intended withUSER=hive.

Apache Twill HelloWorld application fails - jar not found

Has anyone run into problems running the HelloWorld Twill example? My Application gets accepted but then transitions to the "FAILED" state.
Yarn application HelloWorldRunnable application_1406337868863_0013 completed with status FAILED
The YARN Web UI shows this as the error:
Application application_1406337868863_0013 failed 2 times due to AM Container for appattempt_1406337868863_0013_000002 exited with exitCode: -1000 due to: File file:/twill/HelloWorldRunnable/2ba08d9f-ca23-4363-a7be-426b93c88de2/appMaster.775a1137-6134-46e2-b270-fc466ce7fe91.jar does not exist
.Failing this attempt.. Failing the application.
Does YARN expect to find this jar on HDFS at the location above? It seems like the jar gets copied to my local FS at the location specified above but not to HDFS.

Looks like you don't have the hadoop conf directory (e.g. /etc/hadoop/conf) in the classpath so that the local file system (file:/twill/...) is used instead of HDFS.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark in Kubernetes container does not see local file - java

Related

can't read external hive table on spark

Spark Job with Kafka on Kubernetes

java.io.IOException: Could not read footer for file FileStatus when trying to read parquet file from Spark cluster from IBM Cloud Object Storage

SparkLauncher Run spark-submit with yarn-client with user as hive

Apache Twill HelloWorld application fails - jar not found

Categories

Resources