Installing, Configuring, and running Hadoop 2.2.0 on Mac OS X

Installing, Configuring, and running Hadoop 2.2.0 on Mac OS X - java

I've installed hadoop 2.2.0, and set up everything (for a single node) based on this tutorial here: Hadoop YARN Installation. However, I can't get hadoop to run.
I think my problem is that I can't connect to my localhost, but I'm not really sure why. I've spent upwards of about 10 hours installing, googling, and hating open-source software installation guides, so I've now turned to the one place that has never failed me.
Since a picture is worth a thousand words, I give you my set up ... in many many words pictures:
Basic profile/setup
I'm running Mac OS X (Mavericks 10.9.5)
For whatever it's worth, here's my /etc/hosts file:
My bash profile:
Hadoop file configurations
The setup for core-site.xml and hdfs-site.xml:
note: I have created folders in the locations you see above
The setup for my yarn-site.xml:
Setup for my hadoop-env.sh file:
Side Note
Before I show the results of when I run start-dfs.sh, start-yarn.sh, and check to see what's running with jps, keep in mind that I have a hadoop pointing to hadoop-2.2.0.
Starting up Hadoop
Now, here's the results of when I start the deamons up:
For those of you who don't have a microscope (it looks super small on the preview of this post), here's a code chunk of what shows above:
mrp:~ mrp$ start-dfs.sh
2014-11-08 13:06:05.695 java[17730:1003] Unable to load realm info from SCDynamicStore
14/11/08 13:06:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop-2.2.0/logs/hadoop-mrp-namenode-mrp.local.out
localhost: starting datanode, logging to /usr/local/hadoop-2.2.0/logs/hadoop-mrp-datanode-mrp.local.out
localhost: 2014-11-08 13:06:10.954 java[17867:1403] Unable to load realm info from SCDynamicStore
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.2.0/logs/hadoop-mrp-secondarynamenode-mrp.local.out
0.0.0.0: 2014-11-08 13:06:16.065 java[17953:1403] Unable to load realm info from SCDynamicStore
2014-11-08 13:06:20.982 java[17993:1003] Unable to load realm info from SCDynamicStore
14/11/08 13:06:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
mrp:~ mrp$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-mrp-resourcemanager-mrp.local.out
2014-11-08 13:06:43.765 java[18053:20b] Unable to load realm info from SCDynamicStore
localhost: starting nodemanager, logging to /usr/local/hadoop-2.2.0/logs/yarn-mrp-nodemanager-mrp.local.out
Check to see what's running:
Time Out
OK. So far, I think, so good. At least this looks good based on all the other tutorials and posts. I think.
Before I try to do anything fancy, I'll just want to see if it's working properly, and run a simple command like hadoop fs -ls.
Failure
When I run hadoop fs -ls, here's what I get:
Again, in case you can't see that pic, it says:
2014-11-08 13:23:45.772 java[18326:1003] Unable to load realm info from SCDynamicStore
14/11/08 13:23:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ls: Call From mrp.local/127.0.0.1 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
I've tried to run other commands, and I get the same basic error in the beginning of everything:
Call From mrp.local/127.0.0.1 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Now, I've gone to that website mentioned, but honestly, everything in that link means nothing to me. I don't get what I should do.
I would very much appreciate any assistance with this. You'll make me the happiest hadooper, ever.
...this should go without saying, but obviously I'd be happy to edit/update with more info if needed. Thanks!

add these to .bashrc
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

Had a very similar problem and found this question while googling for a solution.
Here is how I could resolve it (on Mac OS 10.10 with Hadoop 2.5.1). Not sure if the question is exactly the same problem: I checked the log files generated by the data-node (/usr/local/hadoop-2.2.0/logs/hadoop-mrp-datanode-mrp.local.out) and found the following entry:
2014-11-09 17:44:35,238 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode:
Exception in namenode join org.apache.hadoop.hdfs.server.common.InconsistentFSStateException:
Directory /private/tmp/hadoop-kthul/dfs/name is in an inconsistent state: storage
directory does not exist or is not accessible.
Based on this, I concluded that something is wrong with the HDFS data on the datanode.
I deleted the directory with the HDFS data and reformatted HDFS:
rm -rf /private/tmp/hadoop-kthul
hdfs namenode -format
Now, I am up and running again. Still wondering if /private/tmp is a good place to keep the HDSF data - looking options to change this.

So I've got Hadoop up and running. I had two problems (I think).
When starting up the NameNode and DataNode, I received the following error: Unable to load realm info from SCDynamicStore.
To fix this, I added the following two lines to my hadoop-env.sh file:
HADOOP_OPTS="${HADOOP_OPTS} -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
HADOOP_OPTS="${HADOOP_OPTS} -Djava.security.krb5.conf=/dev/null"
I found those two lines in the solution to this post, Hadoop on OSX "Unable to load realm info from SCDynamicStore". The Answer was posted by Matthew L Daniel.
I had formatted the NameNode folder more than once, which apparently screws things up?
I can't verify this screws things up, because I don't have any errors in any of my log files, however once I followed Workaround 1 (deleting & recreating NameNode/DataNode folders, then reformatting) on this post, No data nodes are started, I was able to load up the DataNode and get everything working.

Since native library isn't supported on Mac, if you want to suppress this warning:
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Add this to the log4j.properties in ${HADOOP_HOME}/libexec/etc/hadoop:
# Turn of native library warning
log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR

Related

Unable to create directory path [/User/Desktop/db2/logs] for Neo4j store

I am trying to use a tool that, in two steps, analyzes code smells for android.
In the first step, the tool parses an apk and generates within a directory .db files that should then be converted to CSV files in the next step; however, whenever I try to run the second step, the console returns the following error:
java.io.IOException: Unable to create directory path [/User/Desktop/db2/logs] for Neo4j store.
I think it is a Neo4J configuration problem.
I am currently running the tool with the following Java configuration:
echo $JAVA_HOME
/home/User/openlogic-openjdk-11.0.15
update-alternatives --config java
* 0 /usr/lib/jvm/java-11-openjdk-amd64/bin/java 1111 auto mode
To be safe, I also started Neo4J, which returned the following output
sudo systemctl status neo4j.service
neo4j.service - Neo4j Graph Database
Loaded: loaded (/lib/systemd/system/neo4j.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2022-07-06 20:11:04 CEST; 16min ago
Main PID: 1040 (java)
Tasks: 57 (limit: 18901)
Memory: 705.4M
CPU: 16.639s
CGroup: /system.slice/neo4j.service
└─1040 /usr/bin/java -cp "/var/lib/neo4j/plugins:/etc/neo4j:/usr/share/neo4j/lib/*:/var/lib/neo4j/plugins/*" -XX:+UseG1GC -XX:-OmitStackTraceInFastThrow -XX:+AlwaysPreTouch -XX:+UnlockExper>.
How can I solve this?

You posted this error:
java.io.IOException: Unable to create directory path [/User/Desktop/db2/logs] for Neo4j store.
From that error, it looks like:
Neo4j was installed at "/User/Desktop/db2"
The permissions for that directory do not have "write" permission
I tried to reproduce this locally using Neo4j Community 4.4.5, following the steps below.
I do see an IOException related to "logs", but it's slightly different from what you posted. Perhaps we're on different versions of Neo4j.
Open terminal into install directory: cd neo4j
Verify "neo4j" is stopped: ./bin/neo4j stop
Rename existing "logs" directory: mv logs logs.save
Remove write permission for the Neo4j install: chmod u-w .
Start neo4j in console mode: ./bin/neo4j console
Observe errors in console output
2022-07-08 03:28:38.081+0000 INFO Starting...
ERROR StatusLogger Unable to create file [****************************]/neo4j/logs/debug.log
java.io.IOException: Could not create directory [****************************]/neo4j/logs
...
To fix things, try:
Get a terminal into your Neo4j directory:
cd /User/Desktop/db2
Set write permissions for the entire directory tree:
chmod u+w -R .
Start neo4j in console mode:
./bin/neo4j console
If this works and you're able to run neo4j fine, it points to an issue with user permissions when running neo4j as a system service.
The best steps from there depend on the system, your access, how comfortable you are making changes, probably other things. An easy, brute-force hammer would be to manually create each directory you discover (such as "/User/Desktop/db2/logs") and grant premissions to all users (chmod ugo+w .), then try re-running the service, see what errors pop up. Repeat that until you're able to run the service without errors.

What is a simple, effective way to debug custom Kafka connectors?

I'm working a couple of Kafka connectors and I don't see any errors in their creation/deployment in the console output, however I am not getting the result that I'm looking for (no results whatsoever for that matter, desired or otherwise). I made these connectors based on Kafka's example FileStream connectors, so my debug technique was based off the use of the SLF4J Logger that is used in the example. I've searched for the log messages that I thought would be produced in the console output, but to no avail. Am I looking in the wrong place for these messages? Or perhaps is there a better way of going about debugging these connectors?
Example uses of the SLF4J Logger that I referenced for my implementation:
Kafka FileStreamSinkTask
Kafka FileStreamSourceTask

I will try to reply to your question in a broad way. A simple way to do Connector development could be as follows:
Structure and build your connector source code by looking at one of the many Kafka Connectors available publicly (you'll find an extensive list available here: https://www.confluent.io/product/connectors/ )
Download the latest Confluent Open Source edition (>= 3.3.0) from https://www.confluent.io/download/
Make your connector package available to Kafka Connect in one of the following ways:
Store all your connector jar files (connector jar plus dependency jars excluding Connect API jars) to a location in your filesystem and enable plugin isolation by adding this location to the
plugin.path property in the Connect worker properties. For instance, if your connector jars are stored in /opt/connectors/my-first-connector, you will set plugin.path=/opt/connectors in your worker's properties (see below).
Store all your connector jar files in a folder under ${CONFLUENT_HOME}/share/java. For example: ${CONFLUENT_HOME}/share/java/kafka-connect-my-first-connector. (Needs to start with kafka-connect- prefix to be picked up by the startup scripts). $CONFLUENT_HOME is where you've installed Confluent Platform.
Optionally, increase your logging by changing the log level for Connect in ${CONFLUENT_HOME}/etc/kafka/connect-log4j.properties to DEBUG or even TRACE.
Use Confluent CLI to start all the services, including Kafka Connect. Details here: http://docs.confluent.io/current/connect/quickstart.html
Briefly: confluent start
Note: The Connect worker's properties file currently loaded by the CLI is ${CONFLUENT_HOME}/etc/schema-registry/connect-avro-distributed.properties. That's the file you should edit if you choose to enable classloading isolation but also if you need to change your Connect worker's properties.
Once you have Connect worker running, start your connector by running:
confluent load <connector_name> -d <connector_config.properties>
or
confluent load <connector_name> -d <connector_config.json>
The connector configuration can be either in java properties or JSON format.
Run
confluent log connect to open the Connect worker's log file, or navigate directly to where your logs and data are stored by running
cd "$( confluent current )"
Note: change where your logs and data are stored during a session of the Confluent CLI by setting the environment variable CONFLUENT_CURRENT appropriately. E.g. given that /opt/confluent exists and is where you want to store your data, run:
export CONFLUENT_CURRENT=/opt/confluent
confluent current
Finally, to interactively debug your connector a possible way is to apply the following before starting Connect with Confluent CLI :
confluent stop connect
export CONNECT_DEBUG=y; export DEBUG_SUSPEND_FLAG=y;
confluent start connect
and then connect with your debugger (for instance remotely to the Connect worker (default port: 5005). To stop running connect in debug mode, just run: unset CONNECT_DEBUG; unset DEBUG_SUSPEND_FLAG; when you are done.
I hope the above will make your connector development easier and ... more fun!

i love the accepted answer. one thing - the environment variables didn't work for me... i'm using confluent community edition 5.3.1...
here's what i did that worked...
i installed the confluent cli from here:
https://docs.confluent.io/current/cli/installing.html#tarball-installation
i ran confluent using the command confluent local start
i got the connect app details using the command ps -ef | grep connect
i copied the resulting command to an editor and added the arg (right after java):
-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
then i stopped connect using the command confluent local stop connect
then i ran the connect command with the arg
brief intermission ---
vs code development is led by erich gamma - of gang of four fame, who also wrote eclipse. vs code is becoming a first class java ide see https://en.wikipedia.org/wiki/Erich_Gamma
intermission over ---
next i launched vs code and opened the debezium oracle connector folder (cloned from here) https://github.com/debezium/debezium-incubator
then i chose Debug - Open Configurations
and entered the highlighted debugging configuration
and then run the debugger - it will hit your breakpoints !!
the connect command should look something like this:
/Library/Java/JavaVirtualMachines/jdk1.8.0_221.jdk/Contents/Home/bin/java -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005 -Xms256M -Xmx2G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.headless=true -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/folders/yn/4k6t1qzn5kg3zwgbnf9qq_v40000gn/T/confluent.CYZjfRLm/connect/logs -Dlog4j.configuration=file:/Users/myuserid/confluent-5.3.1/bin/../etc/kafka/connect-log4j.properties -cp /Users/myuserid/confluent-5.3.1/share/java/kafka/*:/Users/myuserid/confluent-5.3.1/share/java/confluent-common/*:/Users/myuserid/confluent-5.3.1/share/java/kafka-serde-tools/*:/Users/myuserid/confluent-5.3.1/bin/../share/java/kafka/*:/Users/myuserid/confluent-5.3.1/bin/../support-metrics-client/build/dependant-libs-2.12.8/*:/Users/myuserid/confluent-5.3.1/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/* org.apache.kafka.connect.cli.ConnectDistributed /var/folders/yn/4k6t1qzn5kg3zwgbnf9qq_v40000gn/T/confluent.CYZjfRLm/connect/connect.properties

Connector module is executed by the kafka connector framework. For debugging, we can use the standalone mode. we can configure IDE to use the ConnectStandalone main function as entry point.
create debug configure as the following. Need remember to tick "Include dependencies with "Provided" scope if it is maven project
connector properties file need specify the connector class name "connector.class" for debugging
worker properties file can copied from kafka folder /usr/local/etc/kafka/connect-standalone.properties

setting up Hadoop YARN on ubuntu (single node)

I setup Hadoop YARN (2.5.1) on Ubuntu 13 as a single node cluster. When I run start-dfs.sh, it gives the following output and the process does not start (I confirmed usng jps and ps commands). My bashrc setup is also copied below. Any thoughts on what I need to reconfigure?
bashrc additions:
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_INSTALL=/opt/hadoop/hadoop-2.5.1
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
start-dfs.sh output:
14/09/22 12:24:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /opt/hadoop/hadoop-2.5.1/logs/hadoop-hduser-namenode-zkserver1.fidelus.com.out
localhost: nice: $HADOOP_INSTALL/bin/hdfs: No such file or directory
localhost: starting datanode, logging to /opt/hadoop/hadoop-2.5.1/logs/hadoop-hduser-datanode-zkserver1.fidelus.com.out
localhost: nice: $HADOOP_INSTALL/bin/hdfs: No such file or directory
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is cf:e1:ea:86:a4:0c:cd:ec:9d:b9:bc:90:9d:2b:db:d5.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/hadoop-2.5.1/logs/hadoop-hduser-secondarynamenode-zkserver1.fidelus.com.out
0.0.0.0: nice: $HADOOP_INSTALL/bin/hdfs: No such file or directory
14/09/22 12:24:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
The bin directory has the hdfs file and its owner is hduser (I am running the process as hduser). My $HADOOP_INSTALL setting points to the hadoop directory (/opt/hadoop/hadoop-2.5.1). Should I change anything with the permissions, configuration or simply move the directory out of opt and perhaps onto /usr/local?
Update:
When I run start-yarn.sh, I get the following message:
localhost: Error: Could not find or load main class org.apache.hadoop.yarn.server.nodemanager.NodeManager
Update
I moved the directory to /usr/local but I get the same warning message.
Update
I have ResourceManager running as per jps command. However, when I try to start yarn, it fails with the error given above. I can access the resourcemanager UI on port 8088. Any ideas?

Try running namenode with the following (as opposed to using the start-dfs.sh) and see if that works.
hadoop-daemon.sh start namenode
hadoop-daemon.sh start secondarynamenode
hadoop-daemon.sh start datanode
hadoop-daemon.sh start nodemanager
mr-jobhistory-daemon.sh start historyserver

Errors when configuring Apache Nutch crawler

I am having some trouble running Nutch on a Linux server. I am trying to crawl URLs configured in seed.txt, but I am seeing the following errors. The crawler is triggered as follows
nohup java -classpath "./common-conf/*:*:./plugins/*:" -jar crawler-jar-2.0-SNAPSHOT.jar &
In this configuration, all the configuration properties are present in common-conf directory. We have some custom configuration that we have set up in our Crawler binary. As a result, we have built a custom binary and don't use standard Apache nutch crawler. I see the following issues:
Our custom nutch-default.xml and nutch-site.xml are not picked from the common-conf classpath directory. They are being picked up from nutch jar file. When I print out the URL path for the both the xmls, I see something like this
nutch default =
jar:file:/home/nbsxlwa/crawler/lib/nutch-2.2.1.jar!/nutch-default.xml
nutch site =
jar:file:/home/nbsxlwa/crawler/lib/nutch-2.2.1.jar!/nutch-site.xml
I want the files to be picked up from classpath. I can verify that the files exist.
Our custom gora.properties is not being picked up. I see the following log trace
14/08/22 07:18:24 WARN store.DataStoreFactory: gora.properties not found, properties will be empty.
14/08/22 07:18:24 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.
gora.properties exists in the classpath and I am not sure why it is not being picked up.
/home/nbsxlwa/crawler/ find . -name "gora.properties"
./common-conf/gora.properties
http.agent.name configuration property is not being picked up. I can confirm that the configuration exists in nutch-site.xml
The stack trace is given below.
14/08/22 07:18:36 ERROR fetcher.FetcherJob: Fetcher: No agents listed in 'http.agent.name' property.
14/08/22 07:18:36 WARN crawl.Crawler: Error running crawler job for configuration. Tool run command raises an exception
java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.FetcherJob.checkConfiguration(FetcherJob.java:252)
at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:160)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:78)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:176)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:266)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:356)
regex-normalize.xml and regex-urlfilter.txt are not being picked up from the classpath. I can confirm that the files exist in my classpath. The stack trace is given below
/home/nbsxlwa/crawler : find . -name "regex-normalize.xml"
./common-conf/regex-normalize.xml
/home/nbsxlwa/crawler : find . -name "regex-urlfilter.txt"
./common-conf/regex-urlfilter.txt
14/08/22 07:18:29 INFO conf.Configuration: regex-normalize.xml not found
14/08/22 07:18:29 WARN regex.RegexURLNormalizer: Can't load the default rules!
14/08/22 07:18:29 INFO conf.Configuration: regex-urlfilter.txt not found
14/08/22 07:18:29 INFO conf.Configuration: regex-normalize.xml not found
14/08/22 07:18:29 WARN regex.RegexURLNormalizer: Can't load the default rules!
I have gone through the following links to see where I am going wrong. How do I set up Nutch configuration settings here?
http://mail-archives.apache.org/mod_mbox/nutch-user/201202.mbox/%3CCAGaRif3rtJHokgG5FHSbnJLLUAVGiDnfx7JaW-7kiBjx_ivwSg#mail.gmail.com%3E and
http://osdir.com/ml/user.nutch.apache/2012-02/msg00127.html

I know this is old, but thought it may help someone in the future:
Have you tried to run:
ant runtime
run it from the nutch folder after you changed the config values.

Hadoop with eclipse is not connecting

I am using ubuntu 12.04. I am trying to connect hadoop in eclipse.Successfully installed plugin for 1.04. I am using java 1.7 for this.
My configuration data are
username:hduser,locationname:test,map/reduce host port are localhost:9101 and M/R masterhost localhost:9100.
My temp directory is /app/hduser/temp.
As per this location I set advanced parameters.But I was not able to set fs.s3.buffer.dir as there was no such directory created like /app/hadoop/tmp//s3. unable to set map reduce master directory.I only found local directory. I didnot find mapred.jobtracker.persist.job.dir. And also map red temp dir.
When I ran hadoop in pseudo distributed mode I didnot found any datanode running also with JPS.
I am not sure what is the problem here.In eclipse I got the error while setting the dfs server.I got the message like...
An internal error occurred during: "Connecting to DFS test".
org/apache/commons/configuration/Configuration
Thanks all

I was facing the same issue. Later found this:
Hadoop eclipse mapreduce is not working?
The main blog post is this. HTH someone who is looking for a solution.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.