Nutch on EMR problem reading from S3

Nutch on EMR problem reading from S3 - java

Hi I am trying to run Apache Nutch 1.2 on Amazon's EMR.
To do this I specifiy an input directory from S3. I get the following error:
Fetcher: java.lang.IllegalArgumentException:
This file system object (hdfs://ip-11-202-55-144.ec2.internal:9000)
does not support access to the request path
's3n://crawlResults2/segments/20110823155002/crawl_fetch'
You possibly called FileSystem.get(conf) when you should have called
FileSystem.get(uri, conf) to obtain a file system supporting your path.
I understand the difference between FileSystem.get(uri, conf), and FileSystem.get(conf). If I were writing this myself I would FileSystem.get(uri, conf) however I am trying to use existing Nutch code.
I asked this question, and someone told me that I needed to modify hadoop-site.xml to include the following properties: fs.default.name, fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey. I updated these properties in core-site.xml (hadoop-site.xml does not exist), but that didn't make a difference. Does anyone have any other ideas?
Thanks for the help.

try to specify in
hadoop-site.xml
<property>
<name>fs.default.name</name>
<value>org.apache.hadoop.fs.s3.S3FileSystem</value>
</property>
This will mention to Nutch that by default S3 should be used
Properties
fs.s3.awsAccessKeyId
and
fs.s3.awsSecretAccessKey
specification you need only in case when your S3 objects are placed under authentication (In S3 object can be accessed to all users, or only by authentication)

Related

How to provide ignite-cofig xml for Driver Manager?

If you access this URL -> https://apacheignite.readme.io/docs/jdbc-driver#section-streaming-mode
There it is mentioned that we can use streaming mode using cfg connection that has to be provided using ignite-jdbc.xml file.
But what are the contents of that file? How do we configure?

As that same page notes, it's "a complete Spring XML configuration." Have a look in the "examples" directory of the Ignite download for some samples, but the important thing is how to find the rest of the cluster.

Please note that preferred option for streaming with JDBC is using thin client driver (which neither needs an XML nor starts a client node) together with SET STREAMING ON.

How to set s3a configuration correctly in hadoop configuration?

I get strange errors such as - cant't get aws credentials or Unable to load credentials from ...
Is there any way to set explicitly the s3a credentials in hadoop configuration?

As s3a is relatively new implementation (and works correctly from hadoop 2.7), you need to set two sets properties in hadoop configuration -
conf.set("fs.s3a.access.key", access_key);
conf.set("fs.s3a.secret.key", secret_key);
conf.set("fs.s3a.awsAccessKeyId", access_key);
conf.set("fs.s3a.awsSecretAccessKey", secret_key);
(conf is hadoop configuration)
the reason is that the naming convention changed between versions and to be on the safe side - set both

Could not load cache configuration resource file://coherence-cache-config.xml

I am using Weblogic 12.1.2 which contains 1-admin & 3-manage-servers(under 1-cluster) in the same machine.I want to store some data into a cache(distributed) which must be available among all the manager-servers inside a cluster.
So I am using oracle coherence feature for the same.
whenever I started coherence.sh it always gives the error saying that
"Could not load cache configuration resource file://coherence-cache-config.xml".
I have done some analysis and came to know that its always taking configuration from coherance.jar which comes with WebLogic. even after changing the PRE_CLASSPATH to my custom coherance.jar. it's always pointing to the WebLogic jar.Due to this i am not able to override "coherence-cache-config.xml" & "tangosol-coherence-override.xml".
Can you please suggest something. how can I override WebLogic default coherance.jar resources to my custom ones?

According to Coherence documentation, by default Coherence will use first coherence-cache-config.xml file found in classpath. But in your case it tries to load it from file://coherence-cache-config.xml location. It means that location of this file is somewhere overriden (either in tangosol-coherence-override.xml file or through tangosol.coherence.cacheconfig system property).
What more, file://coherence-cache-config.xml seems to be not a valid file uri. When I try to do:
new File(new URI("file://coherence-cache-config.xml"))
it results in exception
java.lang.IllegalArgumentException: URI has an authority component
So, make sure you properly set coherence-cache-config.xml file location in tangosol-coherence-override.xml file or through tangosol.coherence.cacheconfig system property (the documentation explains in details, how to do it).

Where does Jersey store the "application.wadl" file?

I am trying to implement a code in our application which needs to monitor the existing registered resources in Jersey and then make decisions accordingly. I know that Jersey provides an application.wadl file which consists of all its resources at runtime.
My question is there a way to look at this file physically? Does Jersey create this file and store it somewhere in the app directory or is it in memory?
Is it possible to call any Jersey api internally on server to get the same info if this particular file is not available like that?
We are using the application class or runtimeconfig. Jersey auto discovers our resources marked with #Path annotation and we are running on Weblogic and Jersey 2.6.
Any help would be appreciated.
Thanks

No WADL file is created on disk. It is created dynamically upon URL request.
http://[host]:[port]/[context_root]/[resource]/application.wadl
E.g.:
http://localhost:8080/rest-jersey/rest/application.wadl
Also, I've found it very useful to inspect the WADL with the SoapUI tool, which can read your WADL from a URL and parse the resources into a tree format. See pic below.

Custom configuration for JBoss applications?

I've built a simple alert monitor to display the health of various applications. This is configured by XML, as each monitor instance needs to show different metrics. An example configuration may be:
<machine>
<monitors>
<check type="connectivity" name="Production Server">
<property key="host" value="ops01.corp" />
<alarm />
</check>
</monitors>
</machine>
Currently, I'm storing this in the root of the C:\ drive of the server. What would be nice is if I could put it in the deploy directory of the JBoss server, and could somehow get a reference to it. Is this possible? I looked at MBeans but it didn't seem to support complex XML structures.
Robert

Try JOPR - http://www.jboss.org/jopr .
For custom metrics, you can write your own plug-in.

You can get an input stream for any file in the classpath by using the ClassLoader#getResourceAsStream(String name) method. Just pass the location of the file relative to the classpath.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Nutch on EMR problem reading from S3 - java

Related

How to provide ignite-cofig xml for Driver Manager?

How to set s3a configuration correctly in hadoop configuration?

Could not load cache configuration resource file://coherence-cache-config.xml

Where does Jersey store the "application.wadl" file?

Custom configuration for JBoss applications?

Categories

Resources