ListFiles from HDFS Cluster - java

I am an amateur with hadoop and stuffs. Now, I am trying to access the hadoop cluster (HDFS) and retrieve the list of files from client eclipse. I can do the following operations after setting up the required configurations on hadoop java client.
I can perform copyFromLocalFile, copyToLocalFile operations accessing HDFS from client.
Here's what I am facing. When i give listFiles() method I am getting
org.apache.hadoop.fs.LocatedFileStatus#d0085360
org.apache.hadoop.fs.LocatedFileStatus#b7aa29bf
MainMethod
Properties props = new Properties();
props.setProperty("fs.defaultFS", "hdfs://<IPOFCLUSTER>:8020");
props.setProperty("mapreduce.jobtracker.address", "<IPOFCLUSTER>:8032");
props.setProperty("yarn.resourcemanager.address", "<IPOFCLUSTER>:8032");
props.setProperty("mapreduce.framework.name", "yarn");
FileSystem fs = FileSystem.get(toConfiguration(props)); // Setting up the required configurations
Path p4 = new Path("/user/myusername/inputjson1/");
RemoteIterator<LocatedFileStatus> ritr = fs.listFiles(p4, true);
while(ritr.hasNext())
{
System.out.println(ritr.next().toString());
}
I have also tried FileContext and ended up only getting the filestatus object string or something. Is there a possibility to take the filenames when i iterate to the remote hdfs directory, there is a method called getPath(), Is that the only way we can retrieve the full path of the filenames using the hadoop API or there are any other method so that i can retrieve only name of the files in a specified directory path, Please help me through this, Thanks.

You can indeed use getPath() this will return you a Path object which let you query the name of the file.
Path p = ritr.next().getPath();
// returns the filename or directory name if directory
String name = p.getName();
The FileStatus object you get can tell you if this is a file or directory.
Here is more API documentation:
http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/fs/Path.html
http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/fs/FileStatus.html

Related

Access file without knowing absolute path, only knowing file name

I'm trying to use a file in my code but I don't want to have specify the absolute file path, only the file name, for example "fileName.txt".
I want to do this so I have the ability to use this code on different laptops where the file may be stored in different folders.
The code below is what I'm using at the moment but I receive a NoSuchFileException when I ran it.
FileSystem fs FileSystems.getDefault();
Path fileIn = Paths.get("fileName.txt");
Any ideas how to overcome this problem so I can find the file without knowing its absolute path?
Ideas on how to find the file without knowing its absolute path:
Instruct the user of the app to place the file in the working directory.
Instruct the user of the app to give the path to the file as a program argument, then change the program to use the argument.
Have the program read a configuration file, found using options 1 or 2, then instruct the user of the app to give the path to the file in the configuration file.
Prompt the user for the file name.
(Not recommended) Scan the entire file system for the file, making sure there is only one file with the given name. Optional: If more than one file is found, prompt the user for which file to use.
if you don't ask the user for the complete path, and you don't have a specific folder that it must be in, then your only choice is to search for it.
Start with a rootmost path. Learn to use the File class. Then search all the children. This implementation only returned the first file found with that name.
public File findFile(File folder, String fileName) {
File fullPath = new File(folder,fileName);
if (fullPath.exists()) {
return fullPath;
}
for (File child : folder.listFiles()) {
if (child.isDirectory()) {
File possible = findFile(child,fileName);
if (possible!=null) {
return possible;
}
}
}
return null;
}
Then start this by calling either the root of the file system, or the configured rootmost path that you want to search
File userFile = findFile( new File("/"), fileName );
the best option, however, is to make the user input the entire path. There are nice file system browsing tools for most environments that will do this for the user.

Overwriting HDFS file/directory through Spark

Problem
I have a file saved in HDFS and all I want to do is to run my spark application, calculate a result javaRDD and use saveAsTextFile() in order to store the new "file" in HDFS.
However Spark's saveAsTextFile() does not work if the file already exists. It does not overwrite it.
What I tried
So I searched for a solution to this and I found that a possible way to make it work could be deleting the file through the HDFS API before trying to save the new one.
I added the Code:
FileSystem hdfs = FileSystem.get(new Configuration());
Path newFolderPath = new Path("hdfs://node1:50050/hdfs/" +filename);
if(hdfs.exists(newFolderPath)){
System.out.println("EXISTS");
hdfs.delete(newFolderPath, true);
}
filerdd.saveAsTextFile("/hdfs/" + filename);
When I tried to run my Spark application, the file was deleted but I get a FileNotFoundException.
Considering the fact, that this exception occurs when someone is trying to read a file from a path and the file does not exist, this makes no sense because after deleting the file, there is no code that tries to read it.
Part of my code
JavaRDD<String> filerdd = sc.textFile("/hdfs/" + filename) // load the file here
...
...
// Transformations here
filerdd = filerdd.map(....);
...
...
// Delete old file here
FileSystem hdfs = FileSystem.get(new Configuration());
Path newFolderPath = new Path("hdfs://node1:50050/hdfs/" +filename);
if(hdfs.exists(newFolderPath)){
System.out.println("EXISTS");
hdfs.delete(newFolderPath, true);
}
// Write new file here
filerdd.saveAsTextFile("/hdfs/" + filename);
I am trying to do the simplest thing here but I have no idea why this does not work. Maybe the filerdd is somehow connected to the path??
The problem is you use the same path for input and output. Spark's RDD will be executed lazily. It runs when you call saveAsTextFile. At this point, you have already deleted the newFolderPath. So filerdd will complain.
Anyway, you should not use the same path for input and output.

Converting a Jar-URI into a nio.Path

I'm having trouble coverting from a URI to a nio.Path in the general case. Given a URI with multiple schemas, I wish to create a single nio.Path instance to reflect this URI.
//setup
String jarEmbeddedFilePathString = "jar:file:/C:/Program%20Files%20(x86)/OurSoftware/OurJar_x86_1.0.68.220.jar!/com/our_company/javaFXViewCode.fxml";
URI uri = URI.create(jarEmbeddedFilePathString);
//act
Path nioPath = Paths.get(uri);
//assert --any of these are acceptable
assertThat(nioPath).isEqualTo("C:/Program Files (x86)/OurSoftware/OurJar_x86_1.0.68.220.jar/com/our_company/javaFXViewCode.fxml");
//--or assertThat(nioPath).isEqualTo("/com/our_company/javaFXViewCode.fxml");
//--or assertThat(nioPath).isEqualTo("OurJar_x86_1.0.68.220.jar!/com/our_company/javaFXViewCode.fxml")
//or pretty well any other interpretation of jar'd-uri-to-path any reasonable person would have.
This code currently throws FileSystemNotFoundException on the Paths.get() call.
The actual reason for this conversion is to ask the resulting path about things regarding its package location and file name --so in other words, as long as the resulting path object preserves the ...com/our_company/javaFXViewCode.fxml portion, then its still very convenient for us to use the NIO Path object.
Most of this information is actually used for debugging, so it would not be impossible for me to retrofit our code to avoid use of Paths in this particular instance and instead use URI's or simply strings, but that would involve a bunch of retooling for methods already conveniently provided by the nio.Path object.
I've started digging into the file system provider API and have been confronted with more complexity than I wish to deal with for such a small thing. Is there a simple way to convert from a class-loader provided URI to a path object corresponding to OS-understandable traversal in the case of the URI pointing to a non-jar file, and not-OS-understandable-but-still-useful traversal in the case where the path would point to a resource inside a jar (or for that matter a zip or tarball)?
Thanks for any help
A Java Path belongs to a FileSystem. A file system is implemented by a FileSystemProvider.
Java comes with two file system providers: One for the operating system (e.g. WindowsFileSystemProvider), and one for zip files (ZipFileSystemProvider). These are internal and should not be accessed directly.
To get a Path to a file inside a Jar file, you need to get (create) a FileSystem for the content of the Jar file. You can then get a Path to a file in that file system.
First, you'll need to parse the Jar URL, which is best done using the JarURLConnection:
URL jarEntryURL = new URL("jar:file:/C:/Program%20Files%20(x86)/OurSoftware/OurJar_x86_1.0.68.220.jar!/com/our_company/javaFXViewCode.fxml");
JarURLConnection jarEntryConn = (JarURLConnection) jarEntryURL.openConnection();
URL jarFileURL = jarEntryConn.getJarFileURL(); // file:/C:/Program%20Files%20(x86)/OurSoftware/OurJar_x86_1.0.68.220.jar
String entryName = jarEntryConn.getEntryName(); // com/our_company/javaFXViewCode.fxml
Once you have those, you can create a FileSystem and get a Path to the jar'd file. Remember that FileSystem is an open resource and needs to be closed when you are done with it:
try (FileSystem jarFileSystem = FileSystems.newFileSystem(jarPath, null)) {
Path entryPath = jarFileSystem.getPath(entryName);
System.out.println("entryPath: " + entryPath); // com/our_company/javaFXViewCode.fxml
System.out.println("parent: " + entryPath.getParent()); // com/our_company
}

creating a virtual file system with JIMFS

I'd like to use Google's JIMFS for creating a virtual file system for testing purposes. I have trouble getting started, though.
I looked at this tutorial: http://www.hascode.com/2015/03/creating-in-memory-file-systems-with-googles-jimfs/
However, when I create the file system, it actually gets created in the existing file system, i. e. I cannot do:
Files.createDirectory("/virtualfolder");
because I am denied access.
Am I missing something?
Currently, my code looks something like this:
Test Class:
FileSystem fs = Jimfs.newFileSystem(Configuration.unix());
Path vTargetFolder = fs.getPath("/Store/homes/linux/abc/virtual");
TestedClass test = new TestedClass(vTargetFolder.toAbsolutePath().toString());
Java class somewhere:
targetPath = Paths.get(targetName);
Files.createDirectory(targetPath);
// etc., creating files and writing them to the target directory
However, I created a separate class just to test JIMFS and here the creation of the directory doesnt fail, but I cannot create a new file like this:
FileSystem fs = Jimfs.newFileSystem(Configuration.unix());
Path data = fs.getPath("/virtual");
Path dir = Files.createDirectory(data);
Path file = Files.createFile(Paths.get(dir + "/abc.txt")); // throws NoSuchFileException
What am I doing wrong?
The problem is a mix of Default FileSystem and new FileSystem.
Problem 1:
Files.createDirectory("/virtualfolder");
This will actually not compile so I suspect you meant:
Files.createDirectory( Paths.get("/virtualfolder"));
This attempts to create a directory in your root directory of the default filesystem. You need privileges to do that and probably should not do it as a test. I suspect you tried to work around this problem by using strings and run into
Problem 2:
Lets look at your code with comments
FileSystem fs = Jimfs.newFileSystem(Configuration.unix());
// now get path in the new FileSystem
Path data = fs.getPath("/virtual");
// create a directory in the new FileSystem
Path dir = Files.createDirectory(data);
// create a file in the default FileSystem
// with a parent that was never created there
Path file = Files.createFile(Paths.get(dir + "/abc.txt")); // throws NoSuchFileException
Lets look at the last line:
dir + "/abc.txt" >> is the string "/virtual/abc.txt"
Paths.get(dir + "/abc.txt") >> is this as path in the default filesystem
Remember the virtual filesystem lives parallel to the default filesystem.
Paths have a filesystem and can not be used in an other filesystem. They are not just names.
Notes:
Working with virtual filesystems avoid the Paths class. This class will always work in the default filesystem. Files is ok because you have create a path in the correct filesystem first.
if your original plan was to work with a virtual filesystem mounted to the default filesystem you need bit more. I have a project where I create a Webdav server based on a virtual filesystem and then use OS build in methods to mount that as a volume.
In your shell try ls /
the output should contain the "/virtual" directory.
If this is not the case which I suspect it is then:
The program is masking a:
java.nio.file.AccessDeniedException: /virtual/abc.txt
In reality the code should be failing at Path dir = Files.createDirectory(data);
But for some reason this exception is silent and the program continues without creating the directory (or thinking it has) and attempts to write to the directory that doesn't exist
Leaving a misleading java.nio.file.NoSuchFileException
I suggest you use memoryfilesystem instead. It has a much more complete implementation than Jimfs; in particular, it supports POSIX attributes when creating a "Linux" filesystem etc.
Using it, your code will actually work:
try (
final FileSystem fs = MemoryFileSystemBuilder.newLinux()
.build("testfs");
) {
// create a directory, a file within this directory etc
}
Seems like instead of
Path file = Files.createFile(Paths.get(dir + "/abc.txt"));
You should be doing
Path file = Files.createFile(dir.resolve("/abc.txt"))
This way, the context of dir (it's filesystem) is not lost.

file path in hdfs

I want to read the file from the Hadoop File System.
In order to achieve the correct path of the file, I need host name and port address of the hdfs.
so finally my path of the file will look something like
Path path = new Path("hdfs://123.23.12.4344:9000/user/filename.txt")
Now I want to know to extract the HostName = "123.23.12.4344" & port: 9000?
Basically, I want to access the FileSystem on Amazon EMR but, when I use FileSystem fs = FileSystem.get(getConf()); I get
You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path
So I decided to use URI. (I have to use URI) but I am not sure how to access the URI.
You can use either of the two ways to solve your error.
1.
String infile = "file.txt";
Path ofile = new Path(infile);
FileSystem fs = ofile.getFileSystem(getConf());
2.
Configuration conf = getConf();
System.out.println("fs.default.name : - " + conf.get("fs.default.name"));
// It prints uri as : hdfs://10.214.15.165:9000 or something
String uri = conf.get("fs.default.name");
FileSystem fs = FileSystem.get(uri,getConf());

Categories

Resources