Installing Apache Pig , why i see Hbase and Hive installing?

Installing Apache Pig , why i see Hbase and Hive installing? - java

I was installing Apache pig's piggybank from this tutorial.
While i was building source with ant , i observed its installing apache hive and hbase.
Can anyone tell me why its doing so?
Dose pig use hive and hbase?

Pig has HBase and Hive as dependencies because it has a HBase loader and a Hive loader that come with the standard distribution.
I wouldn't worry about them getting installed. They are just building the jars, not deploying anything.

Related

NoSuchMethodError: org/apache/hadoop/mapreduce/util/MRJobConfUtil.setTaskLogProgressDeltaThresholds

I am getting the following error while executing a mapreduce job in my hadoop cluster (distributed cluster).
I found the error below in the application logs in Yarn where the mapper fails.
java.lang.NoSuchMethodError: org/apache/hadoop/mapreduce/util/MRJobConfUtil.setTaskLogProgressDeltaThresholds(Lorg/apache/hadoop/conf/Configuration;)V (loaded from file:/data/hadoop/yarn/usercache/hdfs-user/appcache/application_1671477750397_2609/filecache/11/job.jar/job.jar by sun.misc.Launcher$AppClassLoader#8bf41861) called from class org.apache.hadoop.mapred.TaskAttemptListenerImpl

The hadoop version is Hadoop 3.3.0
Okay, that method exists in that version, but not 3.0.0
Therefore, you need to use hadoop-client dependency of that version, not 3.0.0-cdh6.... Also, use compileOnly with it not implementation. This way, it is not confliciting with what YARN already has on its classpath.
Similarly, spark-core would have the same problem, and if you have Spark in your app anyway, then use it, not MapReduce functions.
Run gradle dependencies, then search for hadoop libraries and ensure they are 3.3.0

Find topology jar version running in Apache Storm

I have a running storm topology started using a packaged jar. I'm trying to find the version of the jar the topology is running. As far as I can tell, Storm will only show the version of Storm that is running, not the version of the topology running.
Running the "storm version" command only gives the version of storm running and I don't see anything in the topology section of the Storm UI to indicate topology version.
Is there any way to have Storm report this or is my best bet setting a properties file? Ideally, this would be done automatically with either the pom.xml version or a git commit hash. Another solution I'd be happy with would be to have Storm report on the jar file name used to start the topology.

One way could be to list the
$STORM_HOME/storm-local/supervisor/stormdist/name-of-your-running-topology
while the topology is running, and look at the stormjar.jar file. This is the uber jar used by storm when submitting the topology. Comparing the size of this jar with the uber jar generated from your java project build command, they should be identical and give you hint about the jar version used.

Is Spark's "Hadoop Free" distribution completely compatible with all Hadoop versions?

There is distribution of Spark that doesn't bundle hadoop libraries inside. It requires setting SPARK_DIST_CLASSPATH variable that points to provided hadoop libraries.
Apart from this, there is also "Building Spark" that states about incompatibility between different versions of hdfs:
Because HDFS is not protocol-compatible across versions, if you want
to read from HDFS, you’ll need to build Spark against the specific
HDFS version in your environment. You can do this through the
hadoop.version property. If unset, Spark will build against Hadoop
2.2.0 by default. Note that certain build profiles are required for particular Hadoop versions
Do I understand it right that this is only referred to Spark distributions bundling specific version of hadoop? And "Hadoop Free" can run on any hadoop version as soon as runtime-available jars have classes and methods that Spark uses in source code? So I can safely compile Spark with hadoop-client 2.6 and run in on Hadoop 2.7+?

Error while connecting spark and cassandra

What I'm doing:
Trying to connect Spark and Cassandra to retrieve data stored at cassandra tables from spark.
What steps have I followed:
Download cassandra 2.1.12 and spark 1.4.1.
Built spark with sudo build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean packag and sbt/sbt clean assembly
Stored some data into cassandra.
Downloaded these jars into spark/lib:
cassandra-driver-core2.1.1.jar and spark-cassandra-connector_2.11-1.4.1.jar
Added the jar file paths to conf/spark-defaults.conf like
spark.driver.extraClassPath \
~/path/to/spark-cassandra-connector_2.11-1.4.1.jar:\
~/path/to/cassandra-driver-core-2.1.1.jar
How am I running the shell:
After running ./bin/cassandra, I run spark like-
sudo ./bin/pyspark
and also tried with sudo ./bin/spark-shell
What query am I making
sqlContext.read.format("org.apache.spark.sql.cassandra")\
.options(table="users", keyspace="test")\
.load()\
.show()
The problem:
java.lang.NoSuchMethodError:\
scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
But org.apache.spark.sql.cassandra is present in the spark-cassandra-connecter.jar that I downloaded.
Here is the full Log Trace
What have I tried:
I tried running with the option --packages and --driver-class-path and --jars options by adding the 2 jars.
Tried downgrading scala to 2.1 and tried with the scala shell but still the same error.
Questions I've been thinking about-
Are the versions of cassandra, spark and scala that I'm using compatible with each other?
Am I using the correct version of the jar files?
Did I compile spark in the wrong way?
Am I missing something or doing something wrong?
I'm really new to spark and cassandra so I really need some advice! Been spending hours on this and probably it's something trivial.

A few notes
One you are building spark for 2.10 and using Spark Cassandra Connector libraries for 2.11. To build spark for 2.11 you need to use the -Dscala-2.11 flag. This is most likely the main cause of your errors.
Next to actually include the connector in your project just including the core libs without the dependencies will not be enough. If you got past the first error you would most likely see other class not found errors from the missing deps.
This is why it's recommended to use the Spark Packages website and --packages flag. This will include a "fat-jar" which has all the required dependencies. See
http://spark-packages.org/package/datastax/spark-cassandra-connector
For Spark 1.4.1 and pyspark this would be
//Scala 2.10
$SPARK_HOME/bin/pyspark --packages datastax:spark-cassandra-connector:1.4.1-s_2.10
//Scala 2.11
$SPARK_HOME/bin/pyspark --packages datastax:spark-cassandra-connector:1.4.1-s_2.11
You should never have to manually download jars using the --packages method.
Do not use spark.driver.extraClassPath , it will only add the dependencies to the driver remote code will not be able to use the dependencies.

Apache Nutch 1.9 in local Eclipse to run on Amazon EMR remotely

I am on Windows 8 32 bit, running Eclipse Juno.
I have just started working on Amazon EMR. So far, I am being able to connect to EMR remotely from my local using SSH and inside Eclipse. I could run my custom JAR on EMR remotely by creating AWS project in Eclipse and using th Custom JAR execution on EMR commands.
I am now trying to run Apache Nutch 1.9 from inside my Eclipse. I did Ant build to create Nutch Eclipse project and I am being to export inside Eclipse workspace successfully. Now, when I am running the Injector I am getting the following error:
Injector: starting at 2015-04-20 00:56:08
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: java.io.IOException: Failed to set permissions of path: \tmp\hadoop-Kajari_G\mapred\staging\Kajari_G881485826\.staging to 0700
I found out this is sue to permission issues of Hadoop. After lots of search online I realized this is a common issue in Windows. I ran it via Cygwin as Admin and still couldn't fix it.
So, now I want to still run the Injector code, but I want to run it on my remote EMR cluster, instead of in my local.
Can you please guide me how to tell my Apache Nutch Eclipse project to run on Amazon EMR and not locally? I don't want to create a JAR and run it. I want to run it as an usual Run As --> in Eclipse.
Is this possible to do at all? I did search this online, but couldn't find any working solution.
Thanks!

As far as I know you can not run Nutch in distributed mode from Eclipse. In order to run Nutch on hadoop cluster you have to follow these steps:
Apply your required configuration in nutch-site.xml and other config files (according to the selected plugins)
Build Nutch using ant runtime
Follow the runtime/deploy directory to find nutch hadoop job.
Run following command:
hadoop jar nutch-${version}.job ${your_main_class} ${class_parameters}
For example suppose your main crawler class in org.apache.nutch.crawl.crawler in this case the running command would be:
hadoop jar nutch-${version}.job org.apache.nutch.crawl.crawler urls -dir crawl -depth 2 -topN 1000

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Installing Apache Pig , why i see Hbase and Hive installing? - java

I was installing Apache pig's piggybank from this tutorial. While i was building source with ant , i observed its installing apache hive and hbase. Can anyone tell me why its doing so? Dose pig use hive and hbase?

Pig has HBase and Hive as dependencies because it has a HBase loader and a Hive loader that come with the standard distribution. I wouldn't worry about them getting installed. They are just building the jars, not deploying anything.

Related

NoSuchMethodError: org/apache/hadoop/mapreduce/util/MRJobConfUtil.setTaskLogProgressDeltaThresholds

Find topology jar version running in Apache Storm

Is Spark's "Hadoop Free" distribution completely compatible with all Hadoop versions?

Error while connecting spark and cassandra

Apache Nutch 1.9 in local Eclipse to run on Amazon EMR remotely

Categories

Resources