I am getting the following error while executing a mapreduce job in my hadoop cluster (distributed cluster).
I found the error below in the application logs in Yarn where the mapper fails.
java.lang.NoSuchMethodError: org/apache/hadoop/mapreduce/util/MRJobConfUtil.setTaskLogProgressDeltaThresholds(Lorg/apache/hadoop/conf/Configuration;)V (loaded from file:/data/hadoop/yarn/usercache/hdfs-user/appcache/application_1671477750397_2609/filecache/11/job.jar/job.jar by sun.misc.Launcher$AppClassLoader#8bf41861) called from class org.apache.hadoop.mapred.TaskAttemptListenerImpl
The hadoop version is Hadoop 3.3.0
Okay, that method exists in that version, but not 3.0.0
Therefore, you need to use hadoop-client dependency of that version, not 3.0.0-cdh6.... Also, use compileOnly with it not implementation. This way, it is not confliciting with what YARN already has on its classpath.
Similarly, spark-core would have the same problem, and if you have Spark in your app anyway, then use it, not MapReduce functions.
Run gradle dependencies, then search for hadoop libraries and ensure they are 3.3.0
Related
I'm using maven shade plugin, and shaded jar is used to submit my ETL job. I want to use 1.7.7 and 1.9.1 versions of Apache avro as transitive dependency. But getting an error
java.lang.NoClassDefFoundError: org/apache/avro/message/BinaryMessageEncoder. In the logs, we can see lower version is getting used.
It was all fine before I set HADOOP_USER_CLASSPATH_FIRST=true when running the jar.
What are ways I can have both versions on the classpath? Is it possible using shade plugin?
Found that org.apache.avro - 1.8.2 is compatible with both (1.7.7 an 1.9.1) that worked for me.
I have a running storm topology started using a packaged jar. I'm trying to find the version of the jar the topology is running. As far as I can tell, Storm will only show the version of Storm that is running, not the version of the topology running.
Running the "storm version" command only gives the version of storm running and I don't see anything in the topology section of the Storm UI to indicate topology version.
Is there any way to have Storm report this or is my best bet setting a properties file? Ideally, this would be done automatically with either the pom.xml version or a git commit hash. Another solution I'd be happy with would be to have Storm report on the jar file name used to start the topology.
One way could be to list the
$STORM_HOME/storm-local/supervisor/stormdist/name-of-your-running-topology
while the topology is running, and look at the stormjar.jar file. This is the uber jar used by storm when submitting the topology. Comparing the size of this jar with the uber jar generated from your java project build command, they should be identical and give you hint about the jar version used.
There is distribution of Spark that doesn't bundle hadoop libraries inside. It requires setting SPARK_DIST_CLASSPATH variable that points to provided hadoop libraries.
Apart from this, there is also "Building Spark" that states about incompatibility between different versions of hdfs:
Because HDFS is not protocol-compatible across versions, if you want
to read from HDFS, you’ll need to build Spark against the specific
HDFS version in your environment. You can do this through the
hadoop.version property. If unset, Spark will build against Hadoop
2.2.0 by default. Note that certain build profiles are required for particular Hadoop versions
Do I understand it right that this is only referred to Spark distributions bundling specific version of hadoop? And "Hadoop Free" can run on any hadoop version as soon as runtime-available jars have classes and methods that Spark uses in source code? So I can safely compile Spark with hadoop-client 2.6 and run in on Hadoop 2.7+?
What I'm doing:
Trying to connect Spark and Cassandra to retrieve data stored at cassandra tables from spark.
What steps have I followed:
Download cassandra 2.1.12 and spark 1.4.1.
Built spark with sudo build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean packag and sbt/sbt clean assembly
Stored some data into cassandra.
Downloaded these jars into spark/lib:
cassandra-driver-core2.1.1.jar and spark-cassandra-connector_2.11-1.4.1.jar
Added the jar file paths to conf/spark-defaults.conf like
spark.driver.extraClassPath \
~/path/to/spark-cassandra-connector_2.11-1.4.1.jar:\
~/path/to/cassandra-driver-core-2.1.1.jar
How am I running the shell:
After running ./bin/cassandra, I run spark like-
sudo ./bin/pyspark
and also tried with sudo ./bin/spark-shell
What query am I making
sqlContext.read.format("org.apache.spark.sql.cassandra")\
.options(table="users", keyspace="test")\
.load()\
.show()
The problem:
java.lang.NoSuchMethodError:\
scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
But org.apache.spark.sql.cassandra is present in the spark-cassandra-connecter.jar that I downloaded.
Here is the full Log Trace
What have I tried:
I tried running with the option --packages and --driver-class-path and --jars options by adding the 2 jars.
Tried downgrading scala to 2.1 and tried with the scala shell but still the same error.
Questions I've been thinking about-
Are the versions of cassandra, spark and scala that I'm using compatible with each other?
Am I using the correct version of the jar files?
Did I compile spark in the wrong way?
Am I missing something or doing something wrong?
I'm really new to spark and cassandra so I really need some advice! Been spending hours on this and probably it's something trivial.
A few notes
One you are building spark for 2.10 and using Spark Cassandra Connector libraries for 2.11. To build spark for 2.11 you need to use the -Dscala-2.11 flag. This is most likely the main cause of your errors.
Next to actually include the connector in your project just including the core libs without the dependencies will not be enough. If you got past the first error you would most likely see other class not found errors from the missing deps.
This is why it's recommended to use the Spark Packages website and --packages flag. This will include a "fat-jar" which has all the required dependencies. See
http://spark-packages.org/package/datastax/spark-cassandra-connector
For Spark 1.4.1 and pyspark this would be
//Scala 2.10
$SPARK_HOME/bin/pyspark --packages datastax:spark-cassandra-connector:1.4.1-s_2.10
//Scala 2.11
$SPARK_HOME/bin/pyspark --packages datastax:spark-cassandra-connector:1.4.1-s_2.11
You should never have to manually download jars using the --packages method.
Do not use spark.driver.extraClassPath , it will only add the dependencies to the driver remote code will not be able to use the dependencies.
I have a jar that uses the Hadoop API to launch various remote mapreduce jobs (ie, im not using the command-line to initiate the job). The service jar that executes the various jobs is built with maven's "jar-with-dependencies".
My jobs all run fine except one that uses commons-codec 1.7, I get:
FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeAsString([B)Ljava/lang/String;
I think this is because my jar is including commons-codec 1.7 whereas my Hadoop install's lib has commons-codec 1.4 ...
Is their any way to instruct Hadoop to use the distributed commons-codec 1.7 (I assume this is distributed as a job dependency) rather than the commons-codec 1.4 in the hadoop 1.0.3 core lib?
Many thanks!
Note: Removing commons-codec-1.4.jar from my Hadoop library folder does solve the problem, but doesn't seem too sane. Hopefully there is a better alternative.
Two approaches:
You should be able to exclude commons-codec from within the hadoop dependency and add another explicit dependency for commons-codec
Try setting the scope to provided so that none of the hadoop jars get included. This assumes that those jars would be in the runtime class path.