file not adding to DistributedCache

file not adding to DistributedCache - java

I am running Hadoop on my local system, in eclipse environment.
I tried to put a local file from workspace into the distributed cache in driver function as:
DistributedCache.addCacheFile(new Path(
"/home/hduser/workspace/myDir/myFile").toUri(), conf);
but when I tried to access it from Mapper, it returns null.
Inside mapper, I checked to see whether file cached.
System.out.println("Cache: "+context.getConfiguration().get("mapred.cache.files"));
it prints "null", also
Path[] cacheFilesLocal = DistributedCache.getLocalCacheFiles(context.getConfiguration());
returns null.
What's going wrong?

It's because you can only add files to the Distributed Cache from HDFS not local file system. So the Path doesn't exist. Put the file on HDFS and use the HDFS path to refer to it when adding to the DistributedCache.
See DistributedCache for more information.

Add file:// in the path when you add cache file
DistributedCache.addCacheFile(new Path( "file:///home/hduser/workspace/myDir/myFile"), conf);

Try this
DRIVER Class
Path p = new Path(your/file/path);
FileStatus[] list = fs.globStatus(p);
for (FileStatus status : list) {
/*Storing file to distributed cache*/
DistributedCache.addCacheFile(status.getPath().toUri(), conf);
}
Mapper class
public void setup(Context context) throws IOException{
/*Accessing data in file */
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
/* Accessing 0 th cached file*/
Path getPath = new Path(cacheFiles[0].getPath());
/*Read data*/
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
/*Print file content*/
System.out.println("Setup Line "+setupData);
}
bf.close();
}
public void map(){
}

Related

Java - a file is saved to the wrong directory in S3 when copied from HDFS

I wrote a method for saving files from an HDFS directory to S3. But the files are getting saved to the wrong directory in S3. I've inspected the logs an have confirmed that the value of s3TargetPath is s3://bucketName/data and hdfsSource is also resolved correctly.
However instead of being saved to the s3TargetPath they are saved to s3://bucketName//data/.
And also the s3://bucketName//data/ directory contains a file data with a content type: binary/octet-stream, fs-type: Hadoop block.
What needs to be changed in my code to save files to the right S3 path?
private String hdfsPath = "hdfs://localhost:9010/user/usr1/data";
private String s3Path = "s3://bucketName/data";
copyFromHdfstoS3(hdfsPath, s3Path);
//
void copyFromHdfstoS3(String hdfsDir, String s3sDir) {
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(new URI(hdfsDir), conf);
FileSystem s3Fs = FileSystem.get(new URI(s3sDir), conf);
Path hdfsSource = new Path(hdfsDir);
Path s3TargetPath = new Path(s3sDir);
RemoteIterator<LocatedFileStatus> sourceFiles = hdfs.listFiles(sourcePath, false);
if (!s3Fs.exists(s3TargetPath)) {
s3Fs.mkdirs(s3TargetPath);
}
if (sourceFiles != null) {
while (sourceFiles.hasNext()) {
Path srcFilePath = sourceFiles.next().getPath();
if (FileUtil.copy(hdfs, srcFilePath, s3Fs, s3TargetPath, false, true, new Configuration())) {
LOG.info("Copied Successfully");
} else {
LOG.info("Copy Failed");
}
}
}
}

hadoop copy file to current directory from hdfs

I want to re write the function of copy file from hdfs to local
String src = args[0]; // hdfs
String dst = args[1]; // local path
/*
* Prepare the input and output filesystems
*/
Configuration conf = new Configuration();
FileSystem inFS = FileSystem.get(URI.create(src), conf);
FileSystem outFS = FileSystem.get(URI.create(dst), conf);
/*
* Prepare the input and output streams
*/
FSDataInputStream in = null;
FSDataOutputStream out = null;
// TODO: Your implementation goes here...
in = inFS.open(new Path(src));
out = outFS.create(new Path(dst),
new Progressable() {
/*
* Print a dot whenever 64 KB of data has been written to
* the datanode pipeline.
*/
public void progress() {
System.out.print(".");
}
});
(1) when I set the local path to my current director
--> JAVA I/O error. mkdirs failed to create file
(2) when I set the path to a non exist folder
--> a new folder is created and my copied file is inside.
What should I do?
I beleive I should not use filesystem.create()?, am I?
EDIT
a related filesystem library link:
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path)

HDFS file path issue in Hadoop 2

I am getting fileNotFoundException in the mentioned line below. Earlier in Hadoop 1 this was functional. But now it throws a FileNotFoundException
Path localManifestFolder;
Path localManifestPath = new Path("hdfs:///WordCount/write/manifest");
PrintWriter pw = null;
FileSystem fs = null;
try {
URI localHDFSManifestUri = new URI("
hdfs:///WordCount/write");
fs = FileSystem.get(localHDFSManifestUri, conf);
localManifestFolder = new Path("hdfs:///WordCount/write");
FileStatus[] listOfFiles = fs.listStatus(localManifestFolder); // Getting Error in this line
} catch (FileNotFoundException ex) {
throw ex;
}
Exception :
java.io.FileNotFoundException: File hdfs:/WordCount/write does not exist.
Please tell me why such thing is happening

If you do not have your core-site.xml on the classpath, then you need to specify the HDFS location (otherwise defaults to local filesystem)
For example
hdfs://namenode.fqdn:8020/WordCount

how to run mahout kmeans algorithm in local mode

Is it possible to run a mahout k mean java program on local, so that it will read the data from local and save it back to local file system instead of hdfs.
All examles on internet are working on hdfs.
https://github.com/tdunning/MiA/blob/master/src/main/java/mia/clustering/ch07/SimpleKMeansClustering.java

Yes, it is possible - checkout SequenceFileWriter. See the following code example, which writes clustered data points to a file. Here is a blog post that describes this in great detail:
public static void writePointsToFile(List<Vector> points,
String fileName,
FileSystem fs,
Configuration conf) throws IOException {
Path path = new Path(fileName);
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
path, LongWritable.class, VectorWritable.class);
long recNum = 0;
VectorWritable vec = new VectorWritable();
for (Vector point : points) {
vec.set(point);
writer.append(new LongWritable(recNum++), vec);
}
writer.close();
}

Hadoop Map Task : Read the content of a specified input file

I'm pretty new to Hadoop environment. Recently, I run a basic mapreduce program. It was easy to run.
Now, I've a input file with following contents inside input path directory
fileName1
fileName2
fileName3
...
I need to read the lines of this file one by one and create a new File with those names (i.e fileName1, fileName2, and so on) at specified output directory.
I wrote the below map implementation, but it didn't work out
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String fileName = value.toString();
String path = outputFilePath + File.separator + fileName;
File newFile = new File(path);
newFile.mkdirs();
newFile.createNewFile();
}
Can somebody explain me what I've missed out ?
Thanks

I think you should get started with studying the FileSystem class, I think you can only create files in the distributed filesystem. Here's a code example of where I opened a file for reading, you probably just need a FSDataOutputStream. In your mapper you can get your configuration out of the Context class.
Configuration conf = job.getConfiguration();
Path inFile = new Path(file);
try {
FileSystem fs;
fs = FileSystem.get(conf);
if (!fs.exists(inFile))
System.out.println("Unable to open settings file: "+file);
FSDataInputStream in = fs.open(inFile);
...
}

First of all get the path of the input directory inside your mapper with the help of FileSplit. Then append it to the name of the file which contains all these lines and read the lines of this file using FSDataInputStream. Something like this :
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
FileSplit fileSplit = (FileSplit)context.getInputSplit();
FileSystem fs = FileSystem.get(context.getConfiguration());
FSDataInputStream in = fs.open(new Path(fileSplit.getPath().getParent() + "/file.txt"));
while(in.available() > 0){
FSDataOutputStream out = fs.create(new Path(in.readLine()));
}
//Proceed further....
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

file not adding to DistributedCache - java

It's because you can only add files to the Distributed Cache from HDFS not local file system. So the Path doesn't exist. Put the file on HDFS and use the HDFS path to refer to it when adding to the DistributedCache. See DistributedCache for more information.

Add file:// in the path when you add cache file DistributedCache.addCacheFile(new Path( "file:///home/hduser/workspace/myDir/myFile"), conf);

Related

Java - a file is saved to the wrong directory in S3 when copied from HDFS

hadoop copy file to current directory from hdfs

HDFS file path issue in Hadoop 2

how to run mahout kmeans algorithm in local mode

Hadoop Map Task : Read the content of a specified input file

Categories

Resources