Errors when configuring Apache Nutch crawler

Errors when configuring Apache Nutch crawler - java

I am having some trouble running Nutch on a Linux server. I am trying to crawl URLs configured in seed.txt, but I am seeing the following errors. The crawler is triggered as follows
nohup java -classpath "./common-conf/*:*:./plugins/*:" -jar crawler-jar-2.0-SNAPSHOT.jar &
In this configuration, all the configuration properties are present in common-conf directory. We have some custom configuration that we have set up in our Crawler binary. As a result, we have built a custom binary and don't use standard Apache nutch crawler. I see the following issues:
Our custom nutch-default.xml and nutch-site.xml are not picked from the common-conf classpath directory. They are being picked up from nutch jar file. When I print out the URL path for the both the xmls, I see something like this
nutch default =
jar:file:/home/nbsxlwa/crawler/lib/nutch-2.2.1.jar!/nutch-default.xml
nutch site =
jar:file:/home/nbsxlwa/crawler/lib/nutch-2.2.1.jar!/nutch-site.xml
I want the files to be picked up from classpath. I can verify that the files exist.
Our custom gora.properties is not being picked up. I see the following log trace
14/08/22 07:18:24 WARN store.DataStoreFactory: gora.properties not found, properties will be empty.
14/08/22 07:18:24 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.
gora.properties exists in the classpath and I am not sure why it is not being picked up.
/home/nbsxlwa/crawler/ find . -name "gora.properties"
./common-conf/gora.properties
http.agent.name configuration property is not being picked up. I can confirm that the configuration exists in nutch-site.xml
The stack trace is given below.
14/08/22 07:18:36 ERROR fetcher.FetcherJob: Fetcher: No agents listed in 'http.agent.name' property.
14/08/22 07:18:36 WARN crawl.Crawler: Error running crawler job for configuration. Tool run command raises an exception
java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.FetcherJob.checkConfiguration(FetcherJob.java:252)
at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:160)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:78)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:176)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:266)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:356)
regex-normalize.xml and regex-urlfilter.txt are not being picked up from the classpath. I can confirm that the files exist in my classpath. The stack trace is given below
/home/nbsxlwa/crawler : find . -name "regex-normalize.xml"
./common-conf/regex-normalize.xml
/home/nbsxlwa/crawler : find . -name "regex-urlfilter.txt"
./common-conf/regex-urlfilter.txt
14/08/22 07:18:29 INFO conf.Configuration: regex-normalize.xml not found
14/08/22 07:18:29 WARN regex.RegexURLNormalizer: Can't load the default rules!
14/08/22 07:18:29 INFO conf.Configuration: regex-urlfilter.txt not found
14/08/22 07:18:29 INFO conf.Configuration: regex-normalize.xml not found
14/08/22 07:18:29 WARN regex.RegexURLNormalizer: Can't load the default rules!
I have gone through the following links to see where I am going wrong. How do I set up Nutch configuration settings here?
http://mail-archives.apache.org/mod_mbox/nutch-user/201202.mbox/%3CCAGaRif3rtJHokgG5FHSbnJLLUAVGiDnfx7JaW-7kiBjx_ivwSg#mail.gmail.com%3E and
http://osdir.com/ml/user.nutch.apache/2012-02/msg00127.html

I know this is old, but thought it may help someone in the future:
Have you tried to run:
ant runtime
run it from the nutch folder after you changed the config values.

Related

dependencies issue in apache nutch

Trying to integrate apache nutch with hadoop. After building apache-nutch-1.15.job file which resides in runtime folder using ant and tried to run the script bin/crawl but got some dependency errors.
Can see all the required dependencies embedded in it when extracted the .job file. Also there are no issues with versions of the dependencies which are embedded.
sh crawl -s <seed_file_directory_on_hdfs> <crawl_directory_on_hdfs> <num_rounds>
19/03/22 01:41:22 INFO mapreduce.Job: Running job:job_1547155431533_115992
19/03/22 01:41:34 INFO mapreduce.Job: Job job_1547155431533_115992 running
in uber mode : false
19/03/22 01:41:34 INFO mapreduce.Job: map 0% reduce 0%
19/03/22 01:41:45 INFO mapreduce.Job: Task Id :
attempt_1547155431533_115992_r_000001_0, Status : FAILED
Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.
at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:146)
at org.apache.nutch.crawl.Generator$SelectorReducer.setup(Generator.java:378)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:168)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Tried giving those extracted jars in classpath path by changing the script but it didn't helped and the issue persists.

The documentation of apache nutch is not updated since 2014.
To crawl a web page using apache nutch build the project using ant and execute the commands mentioned to crawl to local file system (https://wiki.apache.org/nutch/NutchTutorial) by replacing all local paths with hdfs paths (If you want to crawl content and dump onto hdfs)

Stackify Prefix for Java

I am trying out Stackify Prefix v3.0.18 to profile a Spring Boot application in WebLogic 12c. The JVM is started with the stackify-java-apm agent as per the instructions:
-javaagent:"C:\Program Files (x86)\StackifyPrefix\java\lib\stackify-java-apm.jar"
On accessing the Spring Boot Actuator's /health endpoint, I do not get anything reported in the Prefix dashboard at http://localhost:2012. Is anything amiss here?
A couple of observations were made; the Prefix agent was trying:
To load a properties file from a Linux/Unix path and failed to do so
16:16:24.826 [main] WARN com.stackify.apm.config.a - Unable to find properties file /usr/local/stackify/stackify-java-apm/stackify.properties
To write a file into a non-existent directory C:\Program Files (x86)\Stackify\stackify-java-apm\log\
I was unable to find an end-to-end demo or tutorial on setting up and using Prefix to profile a Java application.

I was looking on their support site and it seems that WebLogic 12c is not supported according to this link:
https://support.stackify.com/prefix-enable-java-profiling/
Have you tried submitting a ticket with them?
https://support.stackify.com/submit-a-ticket/

Installing, Configuring, and running Hadoop 2.2.0 on Mac OS X

I've installed hadoop 2.2.0, and set up everything (for a single node) based on this tutorial here: Hadoop YARN Installation. However, I can't get hadoop to run.
I think my problem is that I can't connect to my localhost, but I'm not really sure why. I've spent upwards of about 10 hours installing, googling, and hating open-source software installation guides, so I've now turned to the one place that has never failed me.
Since a picture is worth a thousand words, I give you my set up ... in many many words pictures:
Basic profile/setup
I'm running Mac OS X (Mavericks 10.9.5)
For whatever it's worth, here's my /etc/hosts file:
My bash profile:
Hadoop file configurations
The setup for core-site.xml and hdfs-site.xml:
note: I have created folders in the locations you see above
The setup for my yarn-site.xml:
Setup for my hadoop-env.sh file:
Side Note
Before I show the results of when I run start-dfs.sh, start-yarn.sh, and check to see what's running with jps, keep in mind that I have a hadoop pointing to hadoop-2.2.0.
Starting up Hadoop
Now, here's the results of when I start the deamons up:
For those of you who don't have a microscope (it looks super small on the preview of this post), here's a code chunk of what shows above:
mrp:~ mrp$ start-dfs.sh
2014-11-08 13:06:05.695 java[17730:1003] Unable to load realm info from SCDynamicStore
14/11/08 13:06:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop-2.2.0/logs/hadoop-mrp-namenode-mrp.local.out
localhost: starting datanode, logging to /usr/local/hadoop-2.2.0/logs/hadoop-mrp-datanode-mrp.local.out
localhost: 2014-11-08 13:06:10.954 java[17867:1403] Unable to load realm info from SCDynamicStore
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.2.0/logs/hadoop-mrp-secondarynamenode-mrp.local.out
0.0.0.0: 2014-11-08 13:06:16.065 java[17953:1403] Unable to load realm info from SCDynamicStore
2014-11-08 13:06:20.982 java[17993:1003] Unable to load realm info from SCDynamicStore
14/11/08 13:06:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
mrp:~ mrp$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-mrp-resourcemanager-mrp.local.out
2014-11-08 13:06:43.765 java[18053:20b] Unable to load realm info from SCDynamicStore
localhost: starting nodemanager, logging to /usr/local/hadoop-2.2.0/logs/yarn-mrp-nodemanager-mrp.local.out
Check to see what's running:
Time Out
OK. So far, I think, so good. At least this looks good based on all the other tutorials and posts. I think.
Before I try to do anything fancy, I'll just want to see if it's working properly, and run a simple command like hadoop fs -ls.
Failure
When I run hadoop fs -ls, here's what I get:
Again, in case you can't see that pic, it says:
2014-11-08 13:23:45.772 java[18326:1003] Unable to load realm info from SCDynamicStore
14/11/08 13:23:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ls: Call From mrp.local/127.0.0.1 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
I've tried to run other commands, and I get the same basic error in the beginning of everything:
Call From mrp.local/127.0.0.1 to localhost:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
Now, I've gone to that website mentioned, but honestly, everything in that link means nothing to me. I don't get what I should do.
I would very much appreciate any assistance with this. You'll make me the happiest hadooper, ever.
...this should go without saying, but obviously I'd be happy to edit/update with more info if needed. Thanks!

add these to .bashrc
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

Had a very similar problem and found this question while googling for a solution.
Here is how I could resolve it (on Mac OS 10.10 with Hadoop 2.5.1). Not sure if the question is exactly the same problem: I checked the log files generated by the data-node (/usr/local/hadoop-2.2.0/logs/hadoop-mrp-datanode-mrp.local.out) and found the following entry:
2014-11-09 17:44:35,238 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode:
Exception in namenode join org.apache.hadoop.hdfs.server.common.InconsistentFSStateException:
Directory /private/tmp/hadoop-kthul/dfs/name is in an inconsistent state: storage
directory does not exist or is not accessible.
Based on this, I concluded that something is wrong with the HDFS data on the datanode.
I deleted the directory with the HDFS data and reformatted HDFS:
rm -rf /private/tmp/hadoop-kthul
hdfs namenode -format
Now, I am up and running again. Still wondering if /private/tmp is a good place to keep the HDSF data - looking options to change this.

So I've got Hadoop up and running. I had two problems (I think).
When starting up the NameNode and DataNode, I received the following error: Unable to load realm info from SCDynamicStore.
To fix this, I added the following two lines to my hadoop-env.sh file:
HADOOP_OPTS="${HADOOP_OPTS} -Djava.security.krb5.realm= -Djava.security.krb5.kdc="
HADOOP_OPTS="${HADOOP_OPTS} -Djava.security.krb5.conf=/dev/null"
I found those two lines in the solution to this post, Hadoop on OSX "Unable to load realm info from SCDynamicStore". The Answer was posted by Matthew L Daniel.
I had formatted the NameNode folder more than once, which apparently screws things up?
I can't verify this screws things up, because I don't have any errors in any of my log files, however once I followed Workaround 1 (deleting & recreating NameNode/DataNode folders, then reformatting) on this post, No data nodes are started, I was able to load up the DataNode and get everything working.

Since native library isn't supported on Mac, if you want to suppress this warning:
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Add this to the log4j.properties in ${HADOOP_HOME}/libexec/etc/hadoop:
# Turn of native library warning
log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR

Starting tomcat Error code 4 : Failed

I'm creating multiple instances of tomcat using opscode chef cookbook. I see that tomcat.conf was not written into my instance of tomcat but is only in the base instance. I created a softlink to the base instance tomcat.conf file. When I tried to start the server, I get the following error with no logs. There are no logs in /var/log or tomcat folder. Please provide hints on how to debug.
[root#centosclient2 ~]# service tomcat6-obi_sandbox_tomcat start
Starting tomcat6-obi_sandbox_tomcat: Error code 4 [FAILED]
I saw below in /var/log/tomcat6-obi_sandbox_tomcat-initd.log
-sh: /usr/sbin/tomcat6-obi_sandbox_tomcat: No such file or directory
Apparently there is no such file or directory.

I have run into error code 4 a few times, and the problem was that disk was full.

UnableToCompleteException with no log

Trying to run my GWT app, I get
EntryPoint initialization exception
Exception while loading module ch.swisstph.mortqual.mqui.client.MqInput. See Development Mode for details.
com.google.gwt.core.ext.UnableToCompleteException: (see previous log entries)
at com.google.gwt.dev.shell.ModuleSpace.rebindAndCreate(ModuleSpace.java:513)
at com.google.gwt.dev.shell.ModuleSpace.onLoad(ModuleSpace.java:385)
at com.google.gwt.dev.shell.OophmSessionHandler.loadModule(OophmSessionHandler.java:200)
at com.google.gwt.dev.shell.BrowserChannelServer.processConnection(BrowserChannelServer.java:526)
at com.google.gwt.dev.shell.BrowserChannelServer.run(BrowserChannelServer.java:364)
at java.lang.Thread.run(Thread.java:722)
in the web browser, and just
/usr/lib/jvm/java-6-openjdk/bin/java -Xmx256m -Didea.launcher.port=7537 -Didea.launcher.bin.path=/home/dhardy/code/download/idea-IU-129.239/bin -Dfile.encoding=UTF-8 -classpath /home/install/gwt/gwt-dev.jar:/home/dhardy/p/mortqual/mqui/src:/usr/lib/jvm/java-6-openjdk/jre/lib/compilefontconfig.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/jsse.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/rhino.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/jce.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/management-agent.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/rt.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/javazic.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/resources.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/charsets.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/ext/java-atk-wrapper.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/ext/pulse-java.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/ext/localedata.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/ext/zipfs.jar:/usr/lib/jvm/java-6-openjdk/jre/lib/ext/dnsns.jar:/home/dhardy/p/mortqual/mqui/out/test/mqui:/home/dhardy/p/mortqual/mqui/out/production/mqui:/home/install/appengine-java-sdk/lib/shared/jsp-api.jar:/home/install/appengine-java-sdk/lib/shared/appengine-local-runtime-shared.jar:/home/install/appengine-java-sdk/lib/shared/el-api.jar:/home/install/appengine-java-sdk/lib/shared/servlet-api.jar:/home/install/appengine-java-sdk/lib/user/appengine-api-1.0-sdk-1.7.5.jar:/home/install/gwt/gwt-user.jar:/home/dhardy/p/mortqual/anacod/target/test-classes:/home/dhardy/p/mortqual/anacod/target/classes:/home/dhardy/.m2/repository/net/sourceforge/jexcelapi/jxl/2.6.12/jxl-2.6.12.jar:/home/dhardy/.m2/repository/log4j/log4j/1.2.14/log4j-1.2.14.jar:/home/dhardy/.m2/repository/com/beust/jcommander/1.30/jcommander-1.30.jar:/home/dhardy/.m2/repository/org/apache/poi/poi/3.9-20130311/poi-3.9-20130311.jar:/home/dhardy/.m2/repository/org/apache/poi/poi-ooxml/3.9/poi-ooxml-3.9.jar:/home/dhardy/.m2/repository/org/apache/poi/poi-ooxml-schemas/3.9/poi-ooxml-schemas-3.9.jar:/home/dhardy/.m2/repository/org/apache/xmlbeans/xmlbeans/2.3.0/xmlbeans-2.3.0.jar:/home/dhardy/.m2/repository/stax/stax-api/1.0.1/stax-api-1.0.1.jar:/home/dhardy/.m2/repository/dom4j/dom4j/1.6.1/dom4j-1.6.1.jar:/home/dhardy/.m2/repository/xml-apis/xml-apis/1.0.b2/xml-apis-1.0.b2.jar:/home/install/gwt/validation-api-1.0.0.GA-sources.jar:/home/install/gwt/validation-api-1.0.0.GA.jar:/home/dhardy/code/download/idea-IU-129.239/lib/idea_rt.jar com.intellij.rt.execution.application.AppMain com.google.gwt.dev.DevMode -war /home/dhardy/.IntelliJIdea12/system/gwt/mqui.5589a251/mqui.33ff3210/run/www -remoteUI 7901:IntelliJIdea -startupUrl mqInput.html ch.swisstph.mortqual.mqui.mqui
log4j:WARN No appenders could be found for logger (org.apache.jasper.compiler.JspRuntimeContext).
log4j:WARN Please initialize the log4j system properly.
Dev Mode initialized. Startup URL:
http://127.0.0.1:8888/mqInput.html?gwt.codesvr=127.0.0.1:9997
on the command line. There are no logs I can find (I tried configuring log4j via a .properties file, which removed its warnings but still didn't give me any logs).
So how do I solve this?
The two most likely causes are renaming of my start-up page and pushing some code out to a library.

try putting -logLevel SPAM in your command line arguments, this will print detailed logs, also you can put explicit GWT.log("message") in your entry point code, that will tell your how far it is going (normal logging doesn't work directly with gwt).
This page describes how to debug GWT in general, how ever my suggestion would be run debug this directly in eclipse with gwt plugin, GWT support in eclipse is amazing.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.