java.lang.ClassNotFoundException: org.openx.data.jsonserde.JsonSerDe - java

I am trying to query a table from Hive using iPython. Below is what my code looks like.
sqlc = HiveContext(sc)
sqlc.sql("ADD JAR s3://x/y/z/jsonserde.jar")
I first create a new hive context and second try to add the jar above. Below is the error message I get.
Py4JJavaError: An error occurred while calling o63.sql:
java.lang.ClassNotFoundException: org.openx.data.jsonserde.JsonSerDe
How else do I add this jar to Spark classpath?

You get that error because you haven't add your library in your SparkContext when you started iPython.
To do so you'll need to run your shell doing the following :
PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars [path/to/jar].jar --driver-class-path [path/to/jar].jar
NB: Specifying the --jars won't be enough for now considering the SPARK-5185.

Related

Spark Job with Kafka on Kubernetes

We have a Spark Java application which reads from database and publishes messages on Kafka. When we execute the job locally on windows command line with following arguments it is working as expected :
bin/spark-submit -class com.data.ingestion.DataIngestion --jars local:///opt/spark/jars/spark-sql-kafka-0-10_2.11-2.3.0.jar local:///opt/spark/jars/data-ingestion-1.0-SNAPSHOT.jar
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 --class com.data.ingestion.DataIngestion data-ingestion-1.0-SNAPSHOT.jar
Similarly, when try to run the command using k8s master
bin/spark-submit --master k8s://https://172.16.3.105:8443 --deploy-mode cluster --conf spark.kubernetes.container.image=localhost:5000/spark-example:0.2 --class com.data.ingestion.DataIngestion --jars local:///opt/spark/jars/spark-sql-kafka-0-10_2.11-2.3.0.jar local:///opt/spark/jars/data-ingestion-1.0-SNAPSHOT.jar
It gives following error:
Exception in thread "main" java.util.ServiceConfigurationError:
org.apache.spark.sql.sources.DataSourceRegister: Provider
org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated
Based on the error, it would indicate at least one node in the cluster does not have /opt/spark/jars/spark-sql-kafka-0-10_2.11-2.3.0.jar
I suggest you create an uber jar that includes this Kafka Structured Streaming package or use --packages rather than local files in addition to setup a solution like Rook or MinIO to have a shared filesystem within k8s/spark
Seems Scala version and Spark Kafka version were not aligned.

Java: Class not found for PageRank algorithm in Apache Hadoop

I am trying to run PageRank algorithm in Apache Hadoop (2.6.5) cluster (1 master 2 slaves). I am using the program in this repository - https://github.com/danielepantaleone/hadoop-pagerank.git. I was able to compile all the sources using this command -
sudo javac -classpath ${HADOOP_CLASSPATH} -d ./build src/it/uniroma1/hadoop/pagerank/PageRank.java src/it/uniroma1/hadoop/pagerank/job1/PageRankJob1Mapper.java src/it/uniroma1/hadoop/pagerank/job1/PageRankJob1Reducer.java src/it/uniroma1/hadoop/pagerank/job2/PageRankJob2Mapper.java src/it/uniroma1/hadoop/pagerank/job2/PageRankJob2Reducer.java src/it/uniroma1/hadoop/pagerank/job3/PageRankJob3Mapper.java
I created the jar file using this command sudo jar -cf build/pagerank.jar build/.
I am trying to run the program just like the wordcount example like this -
sudo bin/hadoop jar hadoop-pagerank/build/pagerank.jar PageRank --
input /usr/local/hdfs/web-Google.txt --output /usr/local/hdfs-out-PR
Sometimes I get an error like this -
Exception in thread "main" java.lang.NoClassDefFoundError: PageRank (wrong name: it/uniroma1/hadoop
/pagerank/PageRank)
and sometimes I get an error like this - Exception in thread "main" java.lang.ClassNotFoundException: PageRank for different types of compilation.
I am not sure what am I doing wrong. Can anyone please help me in proper steps to compile and run the program in Hadoop ? I dont have any pom.xml file and I am able to run the provided wordcount example jar.
You have to use package name before the name of the class,
it means you have to use :
it.uniroma1.hadoop.pagerank.PageRank
rather than PageRank
in your command.
like this :
hadoop jar hadoop-pagerank/build/pagerank.jar it.uniroma1.hadoop.pagerank.PageRank --input /usr/local/hdfs/web-Google.txt --output /usr/local/hdfs-out-PR

java.lang.IllegalAccessError Error while reading AWS S3 configuration from Java

I am getting below error when trying to access the configuration from Java.
Exception in thread "main" java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:164)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:186)
at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:113)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:199)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at MyProgram.GetHiveTableData(MyProgram.java:710)
at MyProgram$1.run(MyProgram.java:674)
at MyProgram$1.run(MyProgram.java:670)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at MyProgram.GetHiveTableDetails(MyProgram.java:670)
at MyProgram.main(MyProgram.java:398)
The Code line is
FileSystem hdfs = FileSystem.get(new URI(uriStr), configuration);
uriStr=s3a://sBucketName
Confurations are set as below for S3A
fs.default.name=fs.defaultFS
fs.defaultFS=s3a://bucketName
sPath: XXXXXX
fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3a.access.key=XXXXXX
fs.s3a.secret.key=XXXXXXX
fs.s3a.endpoint=XXXXXXX
hadoop.rpc.protection=privacy
dfs.data.transfer.protection=privacy
hadoop.security.authentication=Kerberos
dfs.namenode.kerberos.principal=hdfs/XXXX#XXXX.XXX.XXXXXX.XXX
yarn.resourcemanager.principal=yarn/XXXX#XXXX.XXX.XXXXXX.XXX
Am I missing anything in configuration setup?
Please advise.
This problem might occurs if the version of aws-sdk and hadoop version is not compatible, you may get more help from Spark job reading from S3 on Spark cluster gives IllegalAccessError: tried to access method MutableCounterLong and java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
When I roll back hadoop-aws version from 2.8.0 to 2.7.3 the problem is solved.
spark-submit --master local \
--packages org.apache.hadoop:hadoop-aws:2.7.3,\
com.amazonaws:aws-java-sdk-pom:1.10.6,\
org.apache.hadoop:hadoop-common:2.7.3 \
test_s3.py
According to discussion here https://stackoverflow.com/a/52828978/8025086, it seems it is proper to use aws-java-sdk1.7.4, I just tested this simple example with pyspark, it also works. I am not a java guy, may be someone could have a better explanation.
# this one also works, notice that the version of aws-java-sdk is different
spark-submit --master local \
--packages org.apache.hadoop:hadoop-aws:2.7.3,\
com.amazonaws:aws-java-sdk:1.7.4,\
org.apache.hadoop:hadoop-common:2.7.3 \
test_s3.py

java.lang.NoClassDefFoundError with HBase Scan

I am trying to run a MapReduce job to scan a HBase table. Currently I am using the version 0.94.6 of HBase that comes with Cloudera 4.4. At some point in my program I use Scan(), and I properly import it with:
import org.apache.hadoop.hbase.client.Scan;
It compiles well and I am able to create a jar file too. I do it by passing the hbase classpath as the value for the -cp option. When running the program, I obtain the following message:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/client/Scan
I run the code using:
hadoop jar my_program.jar MyJobClass -libjars <list_of_jars>
where list_of_jars contains /opt/cloudera/parcels/CDH/lib/hbase/hbase.jar. Just to double-check, I confirmed that hbase.jar contains Scan. I do it with:
jar tf /opt/cloudera/parcels/CDH/lib/hbase/hbase.jar
And I can see the line:
org/apache/hadoop/hbase/client/Scan.class
in the output. All looks ok to me. I don't understand why is saying that Scan is not defined. I pass the correct jar, and it contains the class.
Any help is appreciated.
Setting the HADOOP_CLASSPATH variable fixed the issue:
export HADOOP_CLASSPATH=`/usr/bin/hbase classpath`

Debug information with UDF's in Hive

I'm trying to get GeoIP working with hive. I found this: http://www.jointhegrid.com/hive-udf-geo-ip-jtg/index.jsp, which seems to be exactly what I want.
I built the jars (I have no java experience, so I only hope I did this part right), added them to my query and get this:
hive> ADD jar hive-udf-geo-ip-jtg.jar;
Added hive-udf-geo-ip-jtg.jar to class path
Added resource: hive-udf-geo-ip-jtg.jar
hive> ADD jar geo-ip-java.jar;
Added geo-ip-java.jar to class path
Added resource: geo-ip-java.jar
hive> ADD file GeoIPCity.dat;
Added resource: GeoIPCity.dat
hive> create temporary function geoip as 'com.jointhegrid.hive.udf.GenericUDFGeoIP';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask
Is there a way to find out what exactly is going wrong? return code 1 doesn't tell me much... Is there a log file somewhere?
if you want to see the log of hive, you can use $HIVE_HOME/bin/hive -hiveconf hive.root.logger=INFO,console. You can also change levels (DEBUG, INFO, WARN, ERROR or FATAL) to see if you can get enough information.
Try executing hive UDF with below command
hive --hiveconf hive.root.logger=DRFA --hiveconf hive.log.dir=./logs --hiveconf hive.log.level=DEBUG -e "query"
or
hive --hiveconf hive.root.logger=DRFA --hiveconf hive.log.dir=./logs --hiveconf hive.log.level=DEBUG -f queryscript.hql
The logs would be captured in a file under logs folder (current directory). Please make sure that the logs folder exist.
Try adjusting log level to get right detail.

Categories

Resources