What I'm doing:
Trying to connect Spark and Cassandra to retrieve data stored at cassandra tables from spark.
What steps have I followed:
Download cassandra 2.1.12 and spark 1.4.1.
Built spark with sudo build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean packag and sbt/sbt clean assembly
Stored some data into cassandra.
Downloaded these jars into spark/lib:
cassandra-driver-core2.1.1.jar and spark-cassandra-connector_2.11-1.4.1.jar
Added the jar file paths to conf/spark-defaults.conf like
spark.driver.extraClassPath \
~/path/to/spark-cassandra-connector_2.11-1.4.1.jar:\
~/path/to/cassandra-driver-core-2.1.1.jar
How am I running the shell:
After running ./bin/cassandra, I run spark like-
sudo ./bin/pyspark
and also tried with sudo ./bin/spark-shell
What query am I making
sqlContext.read.format("org.apache.spark.sql.cassandra")\
.options(table="users", keyspace="test")\
.load()\
.show()
The problem:
java.lang.NoSuchMethodError:\
scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
But org.apache.spark.sql.cassandra is present in the spark-cassandra-connecter.jar that I downloaded.
Here is the full Log Trace
What have I tried:
I tried running with the option --packages and --driver-class-path and --jars options by adding the 2 jars.
Tried downgrading scala to 2.1 and tried with the scala shell but still the same error.
Questions I've been thinking about-
Are the versions of cassandra, spark and scala that I'm using compatible with each other?
Am I using the correct version of the jar files?
Did I compile spark in the wrong way?
Am I missing something or doing something wrong?
I'm really new to spark and cassandra so I really need some advice! Been spending hours on this and probably it's something trivial.
A few notes
One you are building spark for 2.10 and using Spark Cassandra Connector libraries for 2.11. To build spark for 2.11 you need to use the -Dscala-2.11 flag. This is most likely the main cause of your errors.
Next to actually include the connector in your project just including the core libs without the dependencies will not be enough. If you got past the first error you would most likely see other class not found errors from the missing deps.
This is why it's recommended to use the Spark Packages website and --packages flag. This will include a "fat-jar" which has all the required dependencies. See
http://spark-packages.org/package/datastax/spark-cassandra-connector
For Spark 1.4.1 and pyspark this would be
//Scala 2.10
$SPARK_HOME/bin/pyspark --packages datastax:spark-cassandra-connector:1.4.1-s_2.10
//Scala 2.11
$SPARK_HOME/bin/pyspark --packages datastax:spark-cassandra-connector:1.4.1-s_2.11
You should never have to manually download jars using the --packages method.
Do not use spark.driver.extraClassPath , it will only add the dependencies to the driver remote code will not be able to use the dependencies.
Related
I am getting the following error while executing a mapreduce job in my hadoop cluster (distributed cluster).
I found the error below in the application logs in Yarn where the mapper fails.
java.lang.NoSuchMethodError: org/apache/hadoop/mapreduce/util/MRJobConfUtil.setTaskLogProgressDeltaThresholds(Lorg/apache/hadoop/conf/Configuration;)V (loaded from file:/data/hadoop/yarn/usercache/hdfs-user/appcache/application_1671477750397_2609/filecache/11/job.jar/job.jar by sun.misc.Launcher$AppClassLoader#8bf41861) called from class org.apache.hadoop.mapred.TaskAttemptListenerImpl
The hadoop version is Hadoop 3.3.0
Okay, that method exists in that version, but not 3.0.0
Therefore, you need to use hadoop-client dependency of that version, not 3.0.0-cdh6.... Also, use compileOnly with it not implementation. This way, it is not confliciting with what YARN already has on its classpath.
Similarly, spark-core would have the same problem, and if you have Spark in your app anyway, then use it, not MapReduce functions.
Run gradle dependencies, then search for hadoop libraries and ensure they are 3.3.0
I am trying to upgrade Java and Scala versions on Bitnami Spark image by updating the Dockerfile:
https://github.com/bitnami/bitnami-docker-spark/blob/master/3.3/debian-11/Dockerfile
Going through the rest of the code it looks like it uses below repository URL to download the installation tar files:
https://downloads.bitnami.com/files/stacksmith
I need to find the correct package name with Java 17 and Spark 3.3.0 with Scala 2.13.
How can I view the available packages? Above URL redirect it back to bitnami.com!
I am trying to set up a Apache Samza and Kafka environment. I am experiencing some problems when trying to run the modules.
I have Kafka working correctly but I can not make Samza work. I have installed two Debian Jeesy AMD64 boxes and followed the instructions of the Samza documentation:
apt-get install openjdk-7-jdk openjdk-7-jre git maven
git clone http://git-wip-us.apache.org/repos/asf/samza.git
cd samza
./gradlew clean build
When I try to launch the script that should start the Yarn AppMaster with the script provided with Samza:
/opt/samza/samza-shell/src/main/bash/run-am.sh
I get this error:
Error: Main class org.apache.samza.job.yarn.SamzaAppMaster has not been found or loaded
If I try to run a test job with the run-job.sh script
./run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
and I get a similar error referencing the org.apache.samza.job.JobRunner class.
I am thinking that I have a java configuration issue, but I am not able to find much help or reference.
Does anyone know what I am doing wrong?
Still not working but I have gone one step ahead. When executing the Samza provided scripts from a path, they expect to be located in a /bin/ folder and they need to have a /lib/ one where all the samza .jar files should be located.
I am still having some dependencies issues, but different ones.
I have a jar that uses the Hadoop API to launch various remote mapreduce jobs (ie, im not using the command-line to initiate the job). The service jar that executes the various jobs is built with maven's "jar-with-dependencies".
My jobs all run fine except one that uses commons-codec 1.7, I get:
FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeAsString([B)Ljava/lang/String;
I think this is because my jar is including commons-codec 1.7 whereas my Hadoop install's lib has commons-codec 1.4 ...
Is their any way to instruct Hadoop to use the distributed commons-codec 1.7 (I assume this is distributed as a job dependency) rather than the commons-codec 1.4 in the hadoop 1.0.3 core lib?
Many thanks!
Note: Removing commons-codec-1.4.jar from my Hadoop library folder does solve the problem, but doesn't seem too sane. Hopefully there is a better alternative.
Two approaches:
You should be able to exclude commons-codec from within the hadoop dependency and add another explicit dependency for commons-codec
Try setting the scope to provided so that none of the hadoop jars get included. This assumes that those jars would be in the runtime class path.
I was installing Apache pig's piggybank from this tutorial.
While i was building source with ant , i observed its installing apache hive and hbase.
Can anyone tell me why its doing so?
Dose pig use hive and hbase?
Pig has HBase and Hive as dependencies because it has a HBase loader and a Hive loader that come with the standard distribution.
I wouldn't worry about them getting installed. They are just building the jars, not deploying anything.