How to copy files from Hadoop cluster to local file system - java

Setup:
I have a map-reduce job. In the mapper class (which is obviously running on the cluster), I have a code something like this:
try {
.
.
.
} catch (<some exception>) {
// Do some stuff
}
What I want to change:
In the catch{} clause, I want to copy the logs from the cluster to the local file system
Problem:
I can see the log file in the directory on the node if I check from command line. But when I try to copy it use org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean delSrc, Path src, Path dst), it says the file does not exist.
Can anyone tell me what I am doing wrong? I am very new to Hadoop, so may be I am missing something obvious. Please ask ask me any clarifying questions, if needed, as I am not sure if I have given all the necessary inforation.
Thanks
EDIT 1:: Since I am trying to copy files from cluster to local and the java code is also running on cluster, can I even use copyToLocalFile()? Or do I need to do a simple scp?

The MapReduce log files are usually located on the data node's local file system path HADOOP_LOG_DIR/userlogs/mapOrReduceTask where the Map/Reduce program runs. Each MapReduce programs generates syslog/stdout/stderr in the above directory.
It would be easier to use the Task tracker's Web UI to see the local log files or you can ssh to the machine and look over logs in the above mentioned directories.
By default, the Task tracker Web UI URL is http://machineName:50060/

Related

How to set HADOOP_CLASSPATH for using the local filesystem with a local job runner?

How to set the HADOOP_CLASSPATH for using the local filesystem with a local job runner?
How to set the input and output path from local directories?
ClassNotFoundException arises for mapper and reducer classes when I try to run with the following command.
hadoop WordCount input/sample.txt output
current value is:
: hadoop classpath
/usr/local/hadoop/hadoop-3.2.1/etc/hadoop:/usr/local/hadoop/hadoop-3.2.1/share/hadoop/common/lib/*:/usr/local/hadoop/hadoop-3.2.1/share/hadoop/common/*:/usr/local/hadoop/hadoop-3.2.1/share/hadoop/hdfs:/usr/local/hadoop/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/usr/local/hadoop/hadoop-3.2.1/share/hadoop/hdfs/*:/usr/local/hadoop/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/hadoop-3.2.1/share/hadoop/mapreduce/*:/usr/local/hadoop/hadoop-3.2.1/share/hadoop/yarn:/usr/local/hadoop/hadoop-3.2.1/share/hadoop/yarn/lib/*:/usr/local/hadoop/hadoop-3.2.1/share/hadoop/yarn/*:/usr/local/hadoop/hadoop-3.2.1/etc/hadoop/usr/local/hadoop/hadoop-3.2.1/share/hadoop/common/*.jar
versions:
Apache hadoop-3.2.1 ,
openjdk 11.0.5.
please help. It is useful for debugging. Thanks in advance.
I would believe if you had FileNotFound Exception, but your classpath seems fine, so I have a hard time seeing you would get ClassNotFound Exception
Although, this path seems wrong /usr/local/hadoop/hadoop-3.2.1/etc/hadoop/usr/local/hadoop/hadoop-3.2.1/share/hadoop/common/*.jar.
I would suggest moving all files under hadoop-3.2.1 up into /usr/local/hadoop, or at the very least, rename hadoop-3.2.1 directory to just /usr/local/hadoop/3.2.1/
By default, Hadoop jobs use file:// paths as your fs.defaultFS (defined in core-site.xml)
Otherwise, if you have changed that to use hdfs://, then you can still use local files like so
hadoop fs -ls file://
To run jobs, I would suggest using yarn jar, not hadoop <name>. And you need to shade your Java application into an uber-jar, or use the existing hadoop-examples JAR to run WordCount

Create ZNodes without cmd in Zookeeper

I am trying to implement Configuration Management through Zookeeper. I have created few ZNodes from command line as:
create /config ""
create /config/mypocapp ""
create /config/mypocapp/name "John Doe"
Here, name is one of the properties that I want to access in my app called mypocapp.
Since we will be having a lot of properties in our application, we just can't use command line to create each and every property like this.
Is there a way we can create the properties in zookeeper through some UI or directly in a file (and import it to zookeeper).
I am completely new to zookeeper and not getting any help in this direction. Please help.
Exhibitor is one of the options that one can try to insert, modify or delete properties in ZNodes.
One can follow the steps given below:
Download the pom file of Exhibitor UI from the Github.
Built the pom file using maven that will generate a jar file.
Run the jar file as: java -jar <jar-file-name>.jar -c file
Go to your browser and type in: localhost:8080 to access Exhibitor UI.
Here, you can configure your Zookeeper ensemble and can edit the properties.
Please note that each instance of Zookeeper will have corresponding Exhibitor UI.
In order to run exhibitor on a different port, you can run:
java -jar <jar-file-name>.jar -c file --port <port-of-your-choice>
There are now also VS Code extensions that allow viewing and editing the Zookeeper node hierarchy and data, like this one:
https://marketplace.visualstudio.com/items?itemName=gaoliang.visual-zookeeper

copy the contents of a local directory to a directory in hdfs

I have got a requirement that I should be able to copy the contents of a directory from local system to a directory on HDFS.
The condition is only the directory contents should be copied to the location I have specified, not the source directory itself. Using command copyfromlocal I can achieve this. But I need to use Java. There is this method copyFromLocalFile which should be used for making a copy from local file system, the problem is it copies the directory itself. Also tried using FileUtils.copy method, gives the same result as the copyFromLocalFile
As a test I tried to copy directory contents from a test directory to another directory , both on the local file system. I used FileUtils.copyDirectory. This works but I cannot use it for HDFS. I have seen many links to related to this same question but could not find any way.
Guys, could you please let me know if this is possible or not or is it some design flaw? If this is possible how can I proceed ?
Ya it's really hard to get things done with FileSystem api exactly the way you want. Insufficient documentation makes things even harder(Too many methods, little explanation). I faced the same problem an year ago.
The only solution I got is iterative :
void copyFilesToDir(String srcDirPath, FileSystem fs, String destDirPath)
throws IOException {
File dir = new File(srcDirPath);
for (File file : dir.listFiles()) {
fs.copyFromLocalFile(new Path(file.getPath()),
new Path(destDirPath, file.getName()));
}
}
I think the code needs little explanation.

store snapshots on server

WebDriver driver = new FirefoxDriver();
driver.get("http://www.google.com/");
File scrFile = ((TakesScreenshot)driver).getScreenshotAs(OutputType.FILE);
FileUtils.copyFile(scrFile, new File("c:\\tmp\\screenshot.png"));
Using this code to take snapshots using selenium web-driver. This code only stores snaphots locally on my PC. If i want to run it automatically from Jenkins, is there any way to store that snapshots somewhere else so that if anyone runs it either though Jenkins or locally from their PC, they don't have to change the link(("c:\tmp\screenshot.png") every time.
You could make the location of the output file something that it controlled by a setting - either a command-line argument to the tool running this code (if you can modify that), or an environment variable that you can read from your section of code above. You could also have a default location that always exists and should be writable like the user's home directory, rather than an absolute path to c:\tmp.
In jenkins, I would have a step (in an ant script, shell script, or whatnot) create a folder called "screenshots" below $WORKSPACE, and then tell the tool that's going to run your code about that location by one of the methods suggested above. This will also be handy if you want to make the screenshots part of the job's output.
Also, unless you really only ever need the latest file (or have downstream code consuming the screenshot and expecting a specific name), I would introduce a timestamp or some other variable file-naming for the png in the code above, e.g. screenshot-2014-05-16_12-15-37.png, so that if you run the tool twice it doesn't just overwrite the file that was there before.
hth

If I have a constructor that requires a path to a file, how can I "fake" that if it is packaged into a jar?

The context of this question is that I am trying to use the maxmind java api in a pig script that I have written... I do not think that knowing about either is necessary to answer the question, however.
The maxmind API has a constructor which requires a path to a file called GeoIP.dat, which is a comma delimited file which has the information it needs.
I have a jar file which contains the API, as well as a wrapping class which instantiates the class and uses it. My idea is to package the GeoIP.dat file into the jar, and then access it as a resource in the jar file. The issue is that I do not know how to construct a path that the constructor can use.
Looking at the API, this is how they load the file:
public LookupService(String databaseFile) throws IOException {
this(new File(databaseFile));
}
public LookupService(File databaseFile) throws IOException {
this.databaseFile = databaseFile;
this.file = new RandomAccessFile(databaseFile, "r");
init();
}
I only paste that because I am not averse to editing the API itself to make this work, if necessary, but do not know how I could replicate the functionality I as such. Ideally I'd like to get it into the file form, though, or else editing the API will be quite a chore.
Is this possible?
Try:
new File(MyWrappingClass.class.getResource(<resource>).toURI())
dump your data to a temp file, and feed the temp file to it.
File tmpFile = File.createTempFile("XX", "dat");
tmpFile.deleteOnExit();
InputStream is = MyClass.class.getResourceAsStream("/path/in/jar/XX.dat");
OutputStream os = new FileOutputStream(tmpFile)
read from is, write to os, close
One recommended way is to use the Distributed Cache rather than trying to bundle it into a jar.
If you zip GeoIP.dat and copy it on hdfs://host:port/path/GeoIP.dat.zip. Then add these options to the Pig command:
pig ...
-Dmapred.cache.archives=hdfs://host:port/path/GeoIP.dat.zip#GeoIP.dat
-Dmapred.create.symlink=yes
...
And LookupService lookupService = new LookupService("./GeoIP.dat"); should work in your UDF as the file will be present locally to the tasks on each node.
This works for me.
Assuming you have a package org.foo.bar.util that contains GeoLiteCity.dat
URL fileURL = this.getClass().getResource("org/foo/bar/util/GeoLiteCity.dat");
File geoIPData = new File(fileURL.toURI());
LookupService cl = new LookupService(geoIPData, LookupService.GEOIP_MEMORY_CACHE );
Use the classloader.getResource(...) method to do the file lookup in the classpath, which will pull it from the JAR file.
This means you will have to alter the existing code to override the loading. The details on how to do that depend heavily on your existing code and environment. In some cases subclassing and registering the subclass with the framework might work. In other cases, you might have to determine the ordering of class loading along the classpath and place an identically signed class "earlier" in the classpath.
Here's how we use the maxmind geoIP;
We put the GeoIPCity.dat file into the cloud and use the cloud location as an argument when we launch the process.
The code where we get the GeoIPCity.data file and create a new LookupService is:
if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
for (Path localFile : localFiles) {
if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
m_geoipLookupService = new LookupService(new File(localFile.toUri().getPath()));
}
}
}
Here is an abbreviated version of command we use to run our process
$HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat -libjars /usr/lib/COMPANY/analytics/libjars/geoiplookup.jar
The critical components of this for running the MindMax component is the -files and -libjars. These are generic options in the GenericOptionsParser.
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
I'm assuming that Hadoop uses the GenericOptionsParser because I can find no reference to it anywhere in my project. :)
If you put the GeoIPCity.dat on the could and specify its using the -files argument, it will be put into the local cache which the mapper can then get in the setup function. It doesn't have to be in setup but only needs to be done once per mapper so is an excellent place to put it.
Then use the -libjars argument to specify the geoiplookup.jar (or whatever you've called yours) and it will be able to use it. We don't put the geoiplookup.jar on the cloud. I'm rolling with the assumption that hadoop will distribute the jar as it needs to.
I hope that all makes sense. I am getting fairly familiar with hadoop/mapreduce, but I didnt' write the pieces that use the maxmind geoip component in the project, so I've had to do a little digging to understand it well enough to do the explanation I have here.
EDIT: Additional description for the -files and -libjars
-files The files argument is used to distribute files through Hadoop Distributed Cache. In the example above, we are distributing the Max Mind geo-ip data file through the Hadoop Distributed Cache. We need access to the Max Mind geo-ip data file to map the user’s ip address to appropriate country, region, city, timezone. The API requires that data file be present locally which is not feasible in a distributed processing environment (we will not be guaranteed which nodes in the cluster will process the data). To distribute the appropriate data to the processing node, we use the Hadoop Distributed Cache infrastructure. The GenericOptionsParser and the ToolRunner automatically facilitate this using the –file argument. Please note that the file we distribute should be available in the cloud (HDFS).
-libjars The –libjars is used to distribute any additional dependencies required by the map-reduce jobs. Like the data file, we also need to copy the dependent libraries to the nodes in the cluster where the job will be run. The GenericOptionsParser and the ToolRunner automatically facilitate this using the –libjars argument.

Categories

Resources