Load hdfs partitions files list - java

I am writing a small program to load hdfs files using java. When i run the code, i get the list of files from the hdfs. But, i want to get the partition files alone. Eg.part-00000 files.
Below is the sample code:
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://localhost");
FileSystem hdfs = FileSystem.get(new URI(
"hdfs://localhost"), conf);
RemoteIterator<LocatedFileStatus> fsStatus = hdfs.listFiles(
new Path("/hdfs/path"), true);
while (fsStatus.hasNext()) {
String path = fsStatus.next().getPath().toString();
System.out.println(path.matches("part-"));
}

I assume you want to print that path, not the fact that it matches
if (path.startsWith("part-")) {
System.out.println(path);
}

Related

File copying in java (cannot create file) by Hadoop

I currently want to copy a file from hdfs to local computer. I have finished most of the work by fileinputstream and fileoutputstream. But then I encounter the following issue.
JAVA I/O exception. Mkdirs fail to create file
I have do some research and figure out that as I am using
filesystem.create()(hadoop function)
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path,%20org.apache.hadoop.util.Progressable)
the reason is as follows:
if I set my path to a non-existing folder, a folder will be created and the file I download is inside.
if I set my path to existing folder (say current directory), the above I/O exception occur.
Say if I already get the path and fileinputstream right, what should I use (better in filesystem library) in order to go around this problem
my code
//src and dst are the path input and output
Configuration conf = new Configuration();
FileSystem inFS = FileSystem.get(URI.create(src), conf);
FileSystem outFS = FileSystem.get(URI.create(dst), conf);
FSDataInputStream in = null;
FSDataOutputStream out = null;
in = inFS.open(new Path(src));
out = outFS.create(new Path(dst),
new Progressable() {
/*
* Print a dot whenever 64 KB of data has been written to
* the datanode pipeline.
*/
public void progress() {
System.out.print(".");
}
});
In the "File" class there is a method called
createNewFile() that will create a new file only if one doent exist.

Overwriting HDFS file/directory through Spark

Problem
I have a file saved in HDFS and all I want to do is to run my spark application, calculate a result javaRDD and use saveAsTextFile() in order to store the new "file" in HDFS.
However Spark's saveAsTextFile() does not work if the file already exists. It does not overwrite it.
What I tried
So I searched for a solution to this and I found that a possible way to make it work could be deleting the file through the HDFS API before trying to save the new one.
I added the Code:
FileSystem hdfs = FileSystem.get(new Configuration());
Path newFolderPath = new Path("hdfs://node1:50050/hdfs/" +filename);
if(hdfs.exists(newFolderPath)){
System.out.println("EXISTS");
hdfs.delete(newFolderPath, true);
}
filerdd.saveAsTextFile("/hdfs/" + filename);
When I tried to run my Spark application, the file was deleted but I get a FileNotFoundException.
Considering the fact, that this exception occurs when someone is trying to read a file from a path and the file does not exist, this makes no sense because after deleting the file, there is no code that tries to read it.
Part of my code
JavaRDD<String> filerdd = sc.textFile("/hdfs/" + filename) // load the file here
...
...
// Transformations here
filerdd = filerdd.map(....);
...
...
// Delete old file here
FileSystem hdfs = FileSystem.get(new Configuration());
Path newFolderPath = new Path("hdfs://node1:50050/hdfs/" +filename);
if(hdfs.exists(newFolderPath)){
System.out.println("EXISTS");
hdfs.delete(newFolderPath, true);
}
// Write new file here
filerdd.saveAsTextFile("/hdfs/" + filename);
I am trying to do the simplest thing here but I have no idea why this does not work. Maybe the filerdd is somehow connected to the path??
The problem is you use the same path for input and output. Spark's RDD will be executed lazily. It runs when you call saveAsTextFile. At this point, you have already deleted the newFolderPath. So filerdd will complain.
Anyway, you should not use the same path for input and output.

file path in hdfs

I want to read the file from the Hadoop File System.
In order to achieve the correct path of the file, I need host name and port address of the hdfs.
so finally my path of the file will look something like
Path path = new Path("hdfs://123.23.12.4344:9000/user/filename.txt")
Now I want to know to extract the HostName = "123.23.12.4344" & port: 9000?
Basically, I want to access the FileSystem on Amazon EMR but, when I use FileSystem fs = FileSystem.get(getConf()); I get
You possibly called FileSystem.get(conf) when you should have called FileSystem.get(uri, conf) to obtain a file system supporting your path
So I decided to use URI. (I have to use URI) but I am not sure how to access the URI.
You can use either of the two ways to solve your error.
1.
String infile = "file.txt";
Path ofile = new Path(infile);
FileSystem fs = ofile.getFileSystem(getConf());
2.
Configuration conf = getConf();
System.out.println("fs.default.name : - " + conf.get("fs.default.name"));
// It prints uri as : hdfs://10.214.15.165:9000 or something
String uri = conf.get("fs.default.name");
FileSystem fs = FileSystem.get(uri,getConf());

Why Hadoop FileSystem.get method needs to know full URI and not just scheme

Is it possible to use an instance of Hadoop FileSystem created from any valid hdfs url to be used again for reading and writing different hdfs urls.I have tried the following
String url1 = "hdfs://localhost:54310/file1.txt";
String url2 = "hdfs://localhost:54310/file2.txt";
String url3 = "hdfs://localhost:54310/file3.txt";
//Creating filesystem using url1
FileSystem fileSystem = FileSystem.get(URI.create(url1), conf);
//Using same filesystem with url2 and url3
InputStream in = fileSystem.open(new Path(url2));
OutputStream out = fileSystem.create(new Path(url3));
This works.But will this cause any other issues.
You can certainly create a single FileSystem with your scheme and adress and then get it via the FileSystem.
Configuration conf = new Configuration();
conf.set("fs.default.name","hdfs://localhost:54310");
FileSystem fs = FileSystem.get(conf);
InputStream is = fs.open(new Path("/file1.txt"));
For different dfs paths, methods create/open will fail. Look at org.apache.hadoop.fs.FileSystem#checkPath method.

ListFiles from HDFS Cluster

I am an amateur with hadoop and stuffs. Now, I am trying to access the hadoop cluster (HDFS) and retrieve the list of files from client eclipse. I can do the following operations after setting up the required configurations on hadoop java client.
I can perform copyFromLocalFile, copyToLocalFile operations accessing HDFS from client.
Here's what I am facing. When i give listFiles() method I am getting
org.apache.hadoop.fs.LocatedFileStatus#d0085360
org.apache.hadoop.fs.LocatedFileStatus#b7aa29bf
MainMethod
Properties props = new Properties();
props.setProperty("fs.defaultFS", "hdfs://<IPOFCLUSTER>:8020");
props.setProperty("mapreduce.jobtracker.address", "<IPOFCLUSTER>:8032");
props.setProperty("yarn.resourcemanager.address", "<IPOFCLUSTER>:8032");
props.setProperty("mapreduce.framework.name", "yarn");
FileSystem fs = FileSystem.get(toConfiguration(props)); // Setting up the required configurations
Path p4 = new Path("/user/myusername/inputjson1/");
RemoteIterator<LocatedFileStatus> ritr = fs.listFiles(p4, true);
while(ritr.hasNext())
{
System.out.println(ritr.next().toString());
}
I have also tried FileContext and ended up only getting the filestatus object string or something. Is there a possibility to take the filenames when i iterate to the remote hdfs directory, there is a method called getPath(), Is that the only way we can retrieve the full path of the filenames using the hadoop API or there are any other method so that i can retrieve only name of the files in a specified directory path, Please help me through this, Thanks.
You can indeed use getPath() this will return you a Path object which let you query the name of the file.
Path p = ritr.next().getPath();
// returns the filename or directory name if directory
String name = p.getName();
The FileStatus object you get can tell you if this is a file or directory.
Here is more API documentation:
http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/fs/Path.html
http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/fs/FileStatus.html

Categories

Resources