dependencies issue in apache nutch

dependencies issue in apache nutch - java

Trying to integrate apache nutch with hadoop. After building apache-nutch-1.15.job file which resides in runtime folder using ant and tried to run the script bin/crawl but got some dependency errors.
Can see all the required dependencies embedded in it when extracted the .job file. Also there are no issues with versions of the dependencies which are embedded.
sh crawl -s <seed_file_directory_on_hdfs> <crawl_directory_on_hdfs> <num_rounds>
19/03/22 01:41:22 INFO mapreduce.Job: Running job:job_1547155431533_115992
19/03/22 01:41:34 INFO mapreduce.Job: Job job_1547155431533_115992 running
in uber mode : false
19/03/22 01:41:34 INFO mapreduce.Job: map 0% reduce 0%
19/03/22 01:41:45 INFO mapreduce.Job: Task Id :
attempt_1547155431533_115992_r_000001_0, Status : FAILED
Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.
at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:146)
at org.apache.nutch.crawl.Generator$SelectorReducer.setup(Generator.java:378)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:168)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Tried giving those extracted jars in classpath path by changing the script but it didn't helped and the issue persists.

The documentation of apache nutch is not updated since 2014.
To crawl a web page using apache nutch build the project using ant and execute the commands mentioned to crawl to local file system (https://wiki.apache.org/nutch/NutchTutorial) by replacing all local paths with hdfs paths (If you want to crawl content and dump onto hdfs)

Related

Apache Livy : Could not find or load main class org.apache.livy.server.LivyServer

I am trying to start Apache Livy 0.8.0 server on my windows 10 machine for spark 3.1.2 and hadoop 3.2.1. I am taking help from here.. I have successfully built apache livy using maven (I have attached a of it) But I am not able to run the livy server. When I run it I get the following error -
> starting C:/AmazonJDK/jdk1.8.0_332/bin/java -cp /d/ApacheLivy/incubator-livy-master/incubator-livy-master/server/target/jars/*:/d/ApacheLivy/incubator-livy-master/incubator-livy-master/conf:D:/Program_files/spark/conf:D:/ApacheHadoop/hadoop-3.2.1/etc/hadoop: org.apache.livy.server.LivyServer, logging to D:/ApacheLivy/incubator-livy-master/incubator-livy-master/logs/livy--server.out
ps: unknown option -- o
Try `ps --help' for more information.
failed to launch C:/AmazonJDK/jdk1.8.0_332/bin/java -cp /d/ApacheLivy/incubator-livy-master/incubator-livy-master/server/target/jars/*:/d/ApacheLivy/incubator-livy-master/incubator-livy-master/conf:D:/Program_files/spark/conf:D:/ApacheHadoop/hadoop-3.2.1/etc/hadoop: org.apache.livy.server.LivyServer:
Error: Could not find or load main class org.apache.livy.server.LivyServer
full log in D:/ApacheLivy/incubator-livy-master/incubator-livy-master/logs/livy--server.out
I am using Git bash. If you need more information I will provide

The error got resolved when I used Windows Subsystem for Linux (WSL).

Indexer: java.io.IOException: Job failed

I am using Solr 5.4.1 and Apache Nutch 1.12. I am able to crawl data but in a final step to index in Solr I got following errors.
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
SolrIndexer: deleting 1/1 documents
SolrIndexer: deleting 1/1 documents
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)

Make sure your index name is all lower case.
Check that your hbase_site.xml that the hbase.rootdir, hbase.data and hbase.zookeeper-data and the log file paths are correct.
Finally, make sure to copy hbase_site.xml into your nutch/conf directory. If you are running nutch 2.x do it before compiling with ant, or re-compile (ant clean, ant runtime) once it is there.

HDFS write resulting in " CreateSymbolicLink error (1314): A required privilege is not held by the client."

Tried to execute sample map reduce program from Apache Hadoop. Got exception below when map reduce job was running. Tried hdfs dfs -chmod 777 / but that didn't fix the issue.
15/03/10 13:13:10 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with
ToolRunner to remedy this.
15/03/10 13:13:10 WARN mapreduce.JobSubmitter: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
15/03/10 13:13:10 INFO input.FileInputFormat: Total input paths to process : 2
15/03/10 13:13:11 INFO mapreduce.JobSubmitter: number of splits:2
15/03/10 13:13:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1425973278169_0001
15/03/10 13:13:12 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
15/03/10 13:13:12 INFO impl.YarnClientImpl: Submitted application application_1425973278169_0001
15/03/10 13:13:12 INFO mapreduce.Job: The url to track the job: http://B2ML10803:8088/proxy/application_1425973278169_0001/
15/03/10 13:13:12 INFO mapreduce.Job: Running job: job_1425973278169_0001
15/03/10 13:13:18 INFO mapreduce.Job: Job job_1425973278169_0001 running in uber mode : false
15/03/10 13:13:18 INFO mapreduce.Job: map 0% reduce 0%
15/03/10 13:13:18 INFO mapreduce.Job: Job job_1425973278169_0001 failed with state FAILED due to: Application application_1425973278169_0001 failed 2 times due
to AM Container for appattempt_1425973278169_0001_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://B2ML10803:8088/proxy/application_1425973278169_0001/Then, click on links to logs of each attemp
t.
Diagnostics: Exception from container-launch.
Container id: container_1425973278169_0001_02_000001
Exit code: 1
Exception message: CreateSymbolicLink error (1314): A required privilege is not held by the client.
Stack trace:
ExitCodeException exitCode=1: CreateSymbolicLink error (1314): A required privilege is not held by the client.
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Shell output:
1 file(s) moved.
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
15/03/10 13:13:18 INFO mapreduce.Job: Counters: 0

Win 8.1 + hadoop 2.7.0 (build from sources)
run Command Prompt in admin mode
execute etc\hadoop\hadoop-env.cmd
run sbin\start-dfs.cmd
run sbin\start-yarn.cmd
now try to run your job

I recently met exactly the same problem. I tried reformatting namenode but it doesn't work and I believe this cannot solve the problem permanently. With the reference from #aoetalks, I solved this problem on Windows Server 2012 R2 by looking into Local Group Policy.
In conclusion, try the following steps:
open Local Group Policy (press Win+R to open "Run..." - type gpedit.msc)
expand "Computer Configuration" - "Windows Settings" - "Security Settings" - "Local Policies" - "User Rights Assignment"
find "Create symbolic links" on the right, and see whether your user is included. If not, add your user into it.
this will come in effect after logging in next time, so log out and log in.
If this still doesn't work, perhaps it's because you are using a Administrator account. In this case you'll have to disable User Account Control: Run all administrators in Admin Approval Mode in the same directory (i.e. User Rights Assignment in Group Policy) Then restart the computer to make it take effect.
Reference: https://superuser.com/questions/104845/permission-to-make-symbolic-links-in-windows-7

I encountered the same problem as you. We solved the problem by checking the java environment.
check java version and javac version.
ensure that every computer in the clusters has the same java environment.

I don't know the cause of error, but reformating NameNode helps me to solve it in Windows 8.
Delete all old logs. Clean folders C:\hadoop\logs and C:\hadoop\logs\userlogs
Clean folders C:\hadoop\data\dfs\datanode and C:\hadoop\data\dfs\namenode. Reformat NameNode with calling command in administrator mode: c:\hadoop\bin>hdfs namenode -format

See this for a solution and this for an explanation. Basically, symbolic links can be a security risk and the design of UAC prevents users (even users who are part of the Administrators group) from creating symlinks unless they are running in elevated mode.
Long story short, try reformatting your name node and starting Hadoop and all Hadoop jobs from an elevated command prompt.

In Windows, change the configuration in hdfs-site.xml as
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///C:/hadoop-2.7.2/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///C:/hadoop-2.7.2/data/datanode</value>
</property>
</configuration>
open cmd in admin mode and run the command:-
stop-all.cmd
hdfs namenode –format
start-all.cmd
and then run the final jar in admin mode
hadoop jar C:\Hadoop_Demo\wordCount\target\wordCount-0.0.1-SNAPSHOT.jar file:///C:/Hadoop/input.txt file:///C:/Hadoop/output

I solved the same problem. Let's "Run as administrator" when you run "Command Prompt".

Errors when configuring Apache Nutch crawler

I am having some trouble running Nutch on a Linux server. I am trying to crawl URLs configured in seed.txt, but I am seeing the following errors. The crawler is triggered as follows
nohup java -classpath "./common-conf/*:*:./plugins/*:" -jar crawler-jar-2.0-SNAPSHOT.jar &
In this configuration, all the configuration properties are present in common-conf directory. We have some custom configuration that we have set up in our Crawler binary. As a result, we have built a custom binary and don't use standard Apache nutch crawler. I see the following issues:
Our custom nutch-default.xml and nutch-site.xml are not picked from the common-conf classpath directory. They are being picked up from nutch jar file. When I print out the URL path for the both the xmls, I see something like this
nutch default =
jar:file:/home/nbsxlwa/crawler/lib/nutch-2.2.1.jar!/nutch-default.xml
nutch site =
jar:file:/home/nbsxlwa/crawler/lib/nutch-2.2.1.jar!/nutch-site.xml
I want the files to be picked up from classpath. I can verify that the files exist.
Our custom gora.properties is not being picked up. I see the following log trace
14/08/22 07:18:24 WARN store.DataStoreFactory: gora.properties not found, properties will be empty.
14/08/22 07:18:24 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.
gora.properties exists in the classpath and I am not sure why it is not being picked up.
/home/nbsxlwa/crawler/ find . -name "gora.properties"
./common-conf/gora.properties
http.agent.name configuration property is not being picked up. I can confirm that the configuration exists in nutch-site.xml
The stack trace is given below.
14/08/22 07:18:36 ERROR fetcher.FetcherJob: Fetcher: No agents listed in 'http.agent.name' property.
14/08/22 07:18:36 WARN crawl.Crawler: Error running crawler job for configuration. Tool run command raises an exception
java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.FetcherJob.checkConfiguration(FetcherJob.java:252)
at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:160)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:78)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:176)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:266)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:356)
regex-normalize.xml and regex-urlfilter.txt are not being picked up from the classpath. I can confirm that the files exist in my classpath. The stack trace is given below
/home/nbsxlwa/crawler : find . -name "regex-normalize.xml"
./common-conf/regex-normalize.xml
/home/nbsxlwa/crawler : find . -name "regex-urlfilter.txt"
./common-conf/regex-urlfilter.txt
14/08/22 07:18:29 INFO conf.Configuration: regex-normalize.xml not found
14/08/22 07:18:29 WARN regex.RegexURLNormalizer: Can't load the default rules!
14/08/22 07:18:29 INFO conf.Configuration: regex-urlfilter.txt not found
14/08/22 07:18:29 INFO conf.Configuration: regex-normalize.xml not found
14/08/22 07:18:29 WARN regex.RegexURLNormalizer: Can't load the default rules!
I have gone through the following links to see where I am going wrong. How do I set up Nutch configuration settings here?
http://mail-archives.apache.org/mod_mbox/nutch-user/201202.mbox/%3CCAGaRif3rtJHokgG5FHSbnJLLUAVGiDnfx7JaW-7kiBjx_ivwSg#mail.gmail.com%3E and
http://osdir.com/ml/user.nutch.apache/2012-02/msg00127.html

I know this is old, but thought it may help someone in the future:
Have you tried to run:
ant runtime
run it from the nutch folder after you changed the config values.

Apache Twill HelloWorld application fails - jar not found

Has anyone run into problems running the HelloWorld Twill example? My Application gets accepted but then transitions to the "FAILED" state.
Yarn application HelloWorldRunnable application_1406337868863_0013 completed with status FAILED
The YARN Web UI shows this as the error:
Application application_1406337868863_0013 failed 2 times due to AM Container for appattempt_1406337868863_0013_000002 exited with exitCode: -1000 due to: File file:/twill/HelloWorldRunnable/2ba08d9f-ca23-4363-a7be-426b93c88de2/appMaster.775a1137-6134-46e2-b270-fc466ce7fe91.jar does not exist
.Failing this attempt.. Failing the application.
Does YARN expect to find this jar on HDFS at the location above? It seems like the jar gets copied to my local FS at the location specified above but not to HDFS.

Looks like you don't have the hadoop conf directory (e.g. /etc/hadoop/conf) in the classpath so that the local file system (file:/twill/...) is used instead of HDFS.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.