Spark Job with Kafka on Kubernetes - java

We have a Spark Java application which reads from database and publishes messages on Kafka. When we execute the job locally on windows command line with following arguments it is working as expected :
bin/spark-submit -class com.data.ingestion.DataIngestion --jars local:///opt/spark/jars/spark-sql-kafka-0-10_2.11-2.3.0.jar local:///opt/spark/jars/data-ingestion-1.0-SNAPSHOT.jar
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 --class com.data.ingestion.DataIngestion data-ingestion-1.0-SNAPSHOT.jar
Similarly, when try to run the command using k8s master
bin/spark-submit --master k8s://https://172.16.3.105:8443 --deploy-mode cluster --conf spark.kubernetes.container.image=localhost:5000/spark-example:0.2 --class com.data.ingestion.DataIngestion --jars local:///opt/spark/jars/spark-sql-kafka-0-10_2.11-2.3.0.jar local:///opt/spark/jars/data-ingestion-1.0-SNAPSHOT.jar
It gives following error:
Exception in thread "main" java.util.ServiceConfigurationError:
org.apache.spark.sql.sources.DataSourceRegister: Provider
org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated

Based on the error, it would indicate at least one node in the cluster does not have /opt/spark/jars/spark-sql-kafka-0-10_2.11-2.3.0.jar
I suggest you create an uber jar that includes this Kafka Structured Streaming package or use --packages rather than local files in addition to setup a solution like Rook or MinIO to have a shared filesystem within k8s/spark

Seems Scala version and Spark Kafka version were not aligned.

Related

Spark in Kubernetes container does not see local file

I have a trivially small Spark application written in Java that I am trying to run in a K8s cluster using spark-submit. I built an image with Spark binaries, my uber-JAR file with all necessary dependencies (in /opt/spark/jars/my.jar), and a config file (in /opt/spark/conf/some.json).
In my code, I start with
SparkSession session = SparkSession.builder()
.appName("myapp")
.config("spark.logConf", "true")
.getOrCreate();
Path someFilePath = FileSystems.getDefault().getPath("/opt/spark/conf/some.json");
String someString = new String(Files.readAllBytes(someFilePath));
and get this exception at readAllBytes from the Spark driver:
java.nio.file.NoSuchFileException: /opt/spark/conf/some.json
If I run my Docker image manually I can definitely see the file /opt/spark/conf/some.json as I expect. My Spark job runs as root so file permissions should not be a problem.
I have been assuming that, since the same Docker image, with the file indeed present, will be used to start the driver (and executors, but I don't even get to that point), the file should be available to my application. Is that not so? Why wouldn't it see the file?
You seem to get this exception from one of your worker nodes, not from the container.
Make sure that you've specified all files needed as --files option for spark-submit.
spark-submit --master yarn --deploy-mode cluster --files <local files dependecies> ...
https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management

java.lang.IllegalAccessError Error while reading AWS S3 configuration from Java

I am getting below error when trying to access the configuration from Java.
Exception in thread "main" java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:164)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:186)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:113)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:199)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at MyProgram.GetHiveTableData(MyProgram.java:710)
at MyProgram$1.run(MyProgram.java:674)
at MyProgram$1.run(MyProgram.java:670)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at MyProgram.GetHiveTableDetails(MyProgram.java:670)
at MyProgram.main(MyProgram.java:398)
The Code line is
FileSystem hdfs = FileSystem.get(new URI(uriStr), configuration);
uriStr=s3a://sBucketName
Confurations are set as below for S3A
fs.default.name=fs.defaultFS
fs.defaultFS=s3a://bucketName
sPath: XXXXXX
fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3a.access.key=XXXXXX
fs.s3a.secret.key=XXXXXXX
fs.s3a.endpoint=XXXXXXX
hadoop.rpc.protection=privacy
dfs.data.transfer.protection=privacy
hadoop.security.authentication=Kerberos
dfs.namenode.kerberos.principal=hdfs/XXXX#XXXX.XXX.XXXXXX.XXX
yarn.resourcemanager.principal=yarn/XXXX#XXXX.XXX.XXXXXX.XXX
Am I missing anything in configuration setup?
Please advise.
This problem might occurs if the version of aws-sdk and hadoop version is not compatible, you may get more help from Spark job reading from S3 on Spark cluster gives IllegalAccessError: tried to access method MutableCounterLong and java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
When I roll back hadoop-aws version from 2.8.0 to 2.7.3 the problem is solved.
spark-submit --master local \
--packages org.apache.hadoop:hadoop-aws:2.7.3,\
com.amazonaws:aws-java-sdk-pom:1.10.6,\
org.apache.hadoop:hadoop-common:2.7.3 \
test_s3.py
According to discussion here https://stackoverflow.com/a/52828978/8025086, it seems it is proper to use aws-java-sdk1.7.4, I just tested this simple example with pyspark, it also works. I am not a java guy, may be someone could have a better explanation.
# this one also works, notice that the version of aws-java-sdk is different
spark-submit --master local \
--packages org.apache.hadoop:hadoop-aws:2.7.3,\
com.amazonaws:aws-java-sdk:1.7.4,\
org.apache.hadoop:hadoop-common:2.7.3 \
test_s3.py

Apache Beam job hangs up when submitted via spark-submit

I am just trying to execute Apache Beam example code in local spark setup. I generated the source and built the package as mentioned in this page. And submitted the jar using spark-submit as below,
$ ~/spark/bin/spark-submit --class org.apache.beam.examples.WordCount --master local target/word-count-beam-0.1.jar --runner=SparkRunner --inputFile=pom.xml --output=counts
The code gets submitted and starts to execute. But gets stuck at step Evaluating ParMultiDo(ExtractWords). Below is the log after submitting the job.
Am not able to find any error message. Can someone please help in finding whats wrong?
Edit: I also tried using below command,
~/spark/bin/spark-submit --class org.apache.beam.examples.WordCount --master spark://Quartics-MacBook-Pro.local:7077 target/word-count-beam-0.1.jar --runner=SparkRunner --inputFile=pom.xml --output=counts
The job is now stuck at INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.2:59049 with 366.3 MB RAM, BlockManagerId(0, 192.168.0.2, 59049, None). Attached the screenshots of Spark History & Dashboard below.The dashboard shows the job is running, but no progress at all.
This is just a version issue. I was able to run the job in Spark 1.6.3. Thanks to all the people who just down voted this question without explanations.

spark submit java.lang.NullPointerException error

I am trying to submit my spark-mongo code jar through spark on windows.I am using spark in standalone mode. I have configured spark master and two workers on same machine. I want to execute my jar with one master and two workers.I am trying to execute following command: spark-submit --master spark://localhost:7077 --deploy-mode cluster --executor-memory 5G --class spark.mongohadoop.testing3 G:\sparkmon1.jar
I am facing following error:
Running Spark using the REST application submission protocol.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/02/28 17:09:13 INFO RestSubmissionClient: Submitting a request to launch an application in spark://192.168.242.1:7077.
17/02/28 17:09:24 WARN RestSubmissionClient: Unable to connect to server spark://192.168.242.1:7077.
Warning: Master endpoint spark://192.168.242.1:7077 was not a REST server. Falling back to legacy submission gateway instead.
17/02/28 17:09:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/02/28 17:09:32 ERROR ClientEndpoint: Exception from cluster was: java.lang.NullPointerException
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:474)
at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:154)
at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:83
I have already set winutil path in env.
why I am getting this error and what is the solution?
I encountered the same error on Linux but with me it was coming when the driver was getting initiated from a particular machine in my cluster, if request to launch driver was going to any other machine in cluster, ten it was working fine. So, in my cased seemed to be as an environmental issue.
I then checked the code at org.apache.hadoop.util.Shell$ShellCommandExecutor class and got that it is trying to run a command but before that it tries lo run "bash" for that machine. I observed that my bash was responding slow.made some changes in bashrc and restarted my cluster.
Now its working fine.

java.lang.ClassNotFoundException: org.openx.data.jsonserde.JsonSerDe

I am trying to query a table from Hive using iPython. Below is what my code looks like.
sqlc = HiveContext(sc)
sqlc.sql("ADD JAR s3://x/y/z/jsonserde.jar")
I first create a new hive context and second try to add the jar above. Below is the error message I get.
Py4JJavaError: An error occurred while calling o63.sql:
java.lang.ClassNotFoundException: org.openx.data.jsonserde.JsonSerDe
How else do I add this jar to Spark classpath?
You get that error because you haven't add your library in your SparkContext when you started iPython.
To do so you'll need to run your shell doing the following :
PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars [path/to/jar].jar --driver-class-path [path/to/jar].jar
NB: Specifying the --jars won't be enough for now considering the SPARK-5185.

Categories

Resources