I am using Solr 5.4.1 and Apache Nutch 1.12. I am able to crawl data but in a final step to index in Solr I got following errors.
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
SolrIndexer: deleting 1/1 documents
SolrIndexer: deleting 1/1 documents
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
Make sure your index name is all lower case.
Check that your hbase_site.xml that the hbase.rootdir, hbase.data and hbase.zookeeper-data and the log file paths are correct.
Finally, make sure to copy hbase_site.xml into your nutch/conf directory. If you are running nutch 2.x do it before compiling with ant, or re-compile (ant clean, ant runtime) once it is there.
Related
I am trying to execute a Pentaho job and transformation using Springboot and I have been able to execute both of them successfully. But, the problem that arises is when I try to execute a Pentaho job that has transformations linked within it that I have connected using the ${Internal.Job.Filename.Directory} parameter. It works successfully in the Pentaho PDI but when I am trying to execute it using my Springboot code, I am faced with the following error:
2022/10/10 10:51:04 - data-fetch - Starting entry [Check S3 DB Connections]
2022-10-10T10:51:04.632+0530
(org.pentaho.di.job.Job) [http-nio-8085-exec-10] INFO - [src/main/resources/pentaho/data-fetch.kjb] Starting entry [Check S3 DB Connections]
2022/10/10 10:51:14 - data-fetch - Starting entry [S3-Transformation]
2022-10-10T10:51:14.828+0530
(org.pentaho.di.job.Job) [http-nio-8085-exec-10] INFO - [src/main/resources/pentaho/data-fetch.kjb] Starting entry [S3-Transformation]
2022/10/10 10:51:14 - S3-Transformation - ERROR (version 9.0.0.1-497, build 9.0.0.1-497 from 2020-03-19 08.25.00 by buildguy) : Unable to run job data-fetch. The S3-Transformation has an error. The transformation path ${Internal.Job.Filename.Directory}/S3-fetch.ktr is invalid, and will not run successfully.
2022/10/10 10:51:14 - S3-Transformation - ERROR (version 9.0.0.1-497, build 9.0.0.1-497 from 2020-03-19 08.25.00 by buildguy) : org.pentaho.di.core.exception.KettleXMLException:
2022/10/10 10:51:14 - S3-Transformation - The transformation path ${Internal.Job.Filename.Directory}/S3-fetch.ktr is invalid, and will not run successfully.
Is there a different parameter that I should be using?
So Pentaho doesn't provide the ability to automatically enable runtime variables and we need to explicitly provide them during execution. I added the following line of code to add it and it got executed successfully.
job.setVariable("Internal.Job.Filename.Directory", pentahoDir);
Where pentahoDir is the variable the points to the absolute path of the directory and needs to be set up by the user.
Trying to integrate apache nutch with hadoop. After building apache-nutch-1.15.job file which resides in runtime folder using ant and tried to run the script bin/crawl but got some dependency errors.
Can see all the required dependencies embedded in it when extracted the .job file. Also there are no issues with versions of the dependencies which are embedded.
sh crawl -s <seed_file_directory_on_hdfs> <crawl_directory_on_hdfs> <num_rounds>
19/03/22 01:41:22 INFO mapreduce.Job: Running job:job_1547155431533_115992
19/03/22 01:41:34 INFO mapreduce.Job: Job job_1547155431533_115992 running
in uber mode : false
19/03/22 01:41:34 INFO mapreduce.Job: map 0% reduce 0%
19/03/22 01:41:45 INFO mapreduce.Job: Task Id :
attempt_1547155431533_115992_r_000001_0, Status : FAILED
Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.
at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:146)
at org.apache.nutch.crawl.Generator$SelectorReducer.setup(Generator.java:378)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:168)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Tried giving those extracted jars in classpath path by changing the script but it didn't helped and the issue persists.
The documentation of apache nutch is not updated since 2014.
To crawl a web page using apache nutch build the project using ant and execute the commands mentioned to crawl to local file system (https://wiki.apache.org/nutch/NutchTutorial) by replacing all local paths with hdfs paths (If you want to crawl content and dump onto hdfs)
I am trying out Stackify Prefix v3.0.18 to profile a Spring Boot application in WebLogic 12c. The JVM is started with the stackify-java-apm agent as per the instructions:
-javaagent:"C:\Program Files (x86)\StackifyPrefix\java\lib\stackify-java-apm.jar"
On accessing the Spring Boot Actuator's /health endpoint, I do not get anything reported in the Prefix dashboard at http://localhost:2012. Is anything amiss here?
A couple of observations were made; the Prefix agent was trying:
To load a properties file from a Linux/Unix path and failed to do so
16:16:24.826 [main] WARN com.stackify.apm.config.a - Unable to find properties file /usr/local/stackify/stackify-java-apm/stackify.properties
To write a file into a non-existent directory C:\Program Files (x86)\Stackify\stackify-java-apm\log\
I was unable to find an end-to-end demo or tutorial on setting up and using Prefix to profile a Java application.
I was looking on their support site and it seems that WebLogic 12c is not supported according to this link:
https://support.stackify.com/prefix-enable-java-profiling/
Have you tried submitting a ticket with them?
https://support.stackify.com/submit-a-ticket/
I'm using VFS2 to take and import files into the folders by SFTP protocol.
But I'm obtaining an Error. Picture below my code:
Into the log for all files I'm watching this error:
The error sequence is:
1) cannot delete file
2) Could not determine if file
3) Caused by: com.jcraft.jsch.JSchException: Could not get the groups id of the current user (error code: -1)
Properties folder:
Could it dependens on the owner/groups ?
This is a known issue - see https://issues.apache.org/jira/browse/VFS-617 (also applies to isReadable).
The root cause of the issue is one of two things - either the SFTP server doesn't allow commands to be executed ("exec") by the client; or the SFTP server is missing the "id" command (e.g. it's a Windows server).
A git pull request has been raised here: https://github.com/apache/commons-vfs/pull/27 but it requires unit tests before it will be included in VFS.
I am having some trouble running Nutch on a Linux server. I am trying to crawl URLs configured in seed.txt, but I am seeing the following errors. The crawler is triggered as follows
nohup java -classpath "./common-conf/*:*:./plugins/*:" -jar crawler-jar-2.0-SNAPSHOT.jar &
In this configuration, all the configuration properties are present in common-conf directory. We have some custom configuration that we have set up in our Crawler binary. As a result, we have built a custom binary and don't use standard Apache nutch crawler. I see the following issues:
Our custom nutch-default.xml and nutch-site.xml are not picked from the common-conf classpath directory. They are being picked up from nutch jar file. When I print out the URL path for the both the xmls, I see something like this
nutch default =
jar:file:/home/nbsxlwa/crawler/lib/nutch-2.2.1.jar!/nutch-default.xml
nutch site =
jar:file:/home/nbsxlwa/crawler/lib/nutch-2.2.1.jar!/nutch-site.xml
I want the files to be picked up from classpath. I can verify that the files exist.
Our custom gora.properties is not being picked up. I see the following log trace
14/08/22 07:18:24 WARN store.DataStoreFactory: gora.properties not found, properties will be empty.
14/08/22 07:18:24 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.
gora.properties exists in the classpath and I am not sure why it is not being picked up.
/home/nbsxlwa/crawler/ find . -name "gora.properties"
./common-conf/gora.properties
http.agent.name configuration property is not being picked up. I can confirm that the configuration exists in nutch-site.xml
The stack trace is given below.
14/08/22 07:18:36 ERROR fetcher.FetcherJob: Fetcher: No agents listed in 'http.agent.name' property.
14/08/22 07:18:36 WARN crawl.Crawler: Error running crawler job for configuration. Tool run command raises an exception
java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
at org.apache.nutch.fetcher.FetcherJob.checkConfiguration(FetcherJob.java:252)
at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:160)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:78)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:176)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:266)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:356)
regex-normalize.xml and regex-urlfilter.txt are not being picked up from the classpath. I can confirm that the files exist in my classpath. The stack trace is given below
/home/nbsxlwa/crawler : find . -name "regex-normalize.xml"
./common-conf/regex-normalize.xml
/home/nbsxlwa/crawler : find . -name "regex-urlfilter.txt"
./common-conf/regex-urlfilter.txt
14/08/22 07:18:29 INFO conf.Configuration: regex-normalize.xml not found
14/08/22 07:18:29 WARN regex.RegexURLNormalizer: Can't load the default rules!
14/08/22 07:18:29 INFO conf.Configuration: regex-urlfilter.txt not found
14/08/22 07:18:29 INFO conf.Configuration: regex-normalize.xml not found
14/08/22 07:18:29 WARN regex.RegexURLNormalizer: Can't load the default rules!
I have gone through the following links to see where I am going wrong. How do I set up Nutch configuration settings here?
http://mail-archives.apache.org/mod_mbox/nutch-user/201202.mbox/%3CCAGaRif3rtJHokgG5FHSbnJLLUAVGiDnfx7JaW-7kiBjx_ivwSg#mail.gmail.com%3E and
http://osdir.com/ml/user.nutch.apache/2012-02/msg00127.html
I know this is old, but thought it may help someone in the future:
Have you tried to run:
ant runtime
run it from the nutch folder after you changed the config values.