Check if file exists on remote HDFS from local spark-submit

Check if file exists on remote HDFS from local spark-submit - java

I'm working on a Java program dedicated to work with Spark on a HDFS filesystem (located at HDFS_IP).
One of my goals is to check whether a file exists on the HDFS at path hdfs://HDFS_IP:HDFS_PORT/path/to/file.json. While debugging my program in local, I figured out I can't access to this remote file using the following code
private boolean existsOnHDFS(String path) {
Configuration conf = new Configuration();
FileSystem fs;
Boolean fileDoesExist = false ;
try {
fs = FileSystem.get(conf);
fileDoesExist = fs.exists(new Path(path)) ;
} catch (IOException e) {
e.printStackTrace();
}
return fileDoesExist ;
}
Actually, fs.exists tries to look for the file hdfs://HDFS_IP:HDFS_PORT/path/to/file.json in my local FS and not on the HDFS. BTW letting the hdfs://HDFS_IP:HDFS_PORT prefix makes fs.existscrash and suppressing it answers false because /path/to/file.json does not exist locally.
What would be the appropriate configuration of fs to get things work properly in local and when executing the Java program from a Hadoop cluster ?
EDIT: I finally gave up and passed the bugfix to someone else in my team. Thanks to the people who tried to help me though !

The problem is you passing to FileSystem an empty conf file.
You should create your FileSystem like that:
FileSystem.get(spark.sparkContext().hadoopConfiguration());
when spark is the SparkSession object.
As you can see in the code of FileSystem:
/**
* Returns the configured filesystem implementation.
* #param conf the configuration to use
*/
public static FileSystem get(Configuration conf) throws IOException {
return get(getDefaultUri(conf), conf);
}
/** Get the default filesystem URI from a configuration.
* #param conf the configuration to use
* #return the uri of the default filesystem
*/
public static URI getDefaultUri(Configuration conf) {
return URI.create(fixName(conf.get(FS_DEFAULT_NAME_KEY, DEFAULT_FS)));
}
it creates the URI base on the configuration passed as parameter, it looks for the key FS_DEFAULT_NAME_KEY(fs.defaultFS) when the DEFAULT_FS is:
public static final String FS_DEFAULT_NAME_DEFAULT = "file:///";

Related

How to find if the file exists in hdfs using Java?

I am trying to find if the trigger files exist in the hdfs directory.
Code:
private static final int index = 23;
#SuppressWarnings("serial")
private static HashMap<String, Boolean> files = new HashMap<String, Boolean>() {{
put("/user/ct_troy/allfiles/_TRIG1", false);
put("/user/ct_troy/allfiles/_TRIG2", false);
put("/user/ct_troy/allfiles/_TRIG3", false);
put("/user/ct_troy/allfiles/_TRIG4", false);
put("/user/ct_troy/allfiles/_TRIG5", false);
}};
private static boolean availableFiles(String file_name){
Configuration config = new Configuration();
config.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
config.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
try {
FileSystem hdfs = FileSystem.get(config);
// Hadoop DFS Path - Input file
Path path = new Path(file_name); // file_name - complete path and file name.
// Check if input is valid
if (hdfs.exists(path) == false) {
System.out.println(file_name + " not found.");
throw new FileNotFoundException(file_name.substring(index));
}
else{
System.out.println(file_name + " File Present.");
return true;
}
}catch (IOException e) {
}
return false;
}
I am passing keys of HashMap<> files as file_name argument to function availableFiles.
I built a jar and ran it on the node, it gave me the following output:
_TRIG2 not found.
_TRIG3 not found.
_TRIG1 not found.
_TRIG4 not found.
_TRIG5 not found.
Not sure why this is happening, _TRIG1, _TRIG2 and _TRIG3 exist, where as _TRIG4 and _TRIG5 don't. It's giving me the same result for all the trigger files. Help.

From the official documentation, you can check if the files you need exists with a direct internal call : #see https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#test
test
Usage: hadoop fs -test -[defswrz] URI
Options:
-d: f the path is a directory, return 0.
-e: if the path exists, return 0.
-f: if the path is a file, return 0.
-s: if the path is not empty, return 0.
-w: if the path exists and write permission is granted, return 0.
-r: if the path exists and read permission is granted, return 0.
-z: if the file is zero length, return 0.
Example:
hadoop fs -test -e filename
Your java code seems to look good. Maybe your if test is not well wrotten :
if (!hdfs.exists(path)) { // <=================== #see the change and test it.

Can't copy from HDFS to S3A

I have a class to copy directory content from one location to another using Apache FileUtil:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.RemoteIterator;
class Folder {
private final FileSystem fs;
private final Path pth;
// ... constructors and other methods
/**
* Copy contents (files and files in subfolders) to another folder.
* Merges overlapping folders
* Overwrites already existing files
* #param destination Folder where content will be moved to
* #throws IOException If fails
*/
public void copyFilesTo(final Folder destination) throws IOException {
final RemoteIterator<LocatedFileStatus> iter = this.fs.listFiles(
this.pth,
true
);
final URI root = this.pth.toUri();
while (iter.hasNext()) {
final Path source = iter.next().getPath();
FileUtil.copy(
this.fs,
source,
destination.fs,
new Path(
destination.pth,
root.relativize(source.toUri()).toString()
),
false,
true,
this.fs.getConf()
);
}
}
}
This class is working fine with local (file:///) directories in a unit test,
but when I'm trying to use it in Hadoop cluster to copy files from HDFS (hdfs:///tmp/result) to Amazon S3 (s3a://mybucket/out) it doesn't copy anything and doesn't throw error, just silently skip copying.
When I'm using same class (with both HDFS or S3a filesystems) for another purpose it's working fine, so the configuration and fs reference should be OK here.
What I'm doing wrong? How to copy files from HDFS to S3A correctly?
I'm using Hadoop 2.7.3.
UPDATE
I've added more logs to copyFilesTo method to log root, source and target variables (and extracted rebase() method without changing the code):
/**
* Copy contents (files and files in subfolders) to another folder.
* Merges overlapping folders
* Overwrites already existing files
* #param dst Folder where content will be moved to
* #throws IOException If fails
*/
public void copyFilesTo(final Folder dst) throws IOException {
Logger.info(
this, "copyFilesTo(%s): from %s fs=%s",
dst, this, this.hdfs
);
final RemoteIterator<LocatedFileStatus> iter = this.hdfs.listFiles(
this.pth,
true
);
final URI root = this.pth.toUri();
Logger.info(this, "copyFilesTo(%s): root=%s", dst, root);
while (iter.hasNext()) {
final Path source = iter.next().getPath();
final Path target = Folder.rebase(dst.path(), this.path(), source);
Logger.info(
this, "copyFilesTo(%s): src=%s target=%s",
dst, source, target
);
FileUtil.copy(
this.hdfs,
source,
dst.hdfs,
target,
false,
true,
this.hdfs.getConf()
);
}
}
/**
* Change the base of target URI to new base, using root
* as common path.
* #param base New base
* #param root Common root
* #param target Target to rebase
* #return Path with new base
*/
static Path rebase(final Path base, final Path root, final Path target) {
return new Path(
base, root.toUri().relativize(target.toUri()).toString()
);
}
After running in the cluster I've got these logs:
io.Folder: copyFilesTo(hdfs:///tmp/_dst): from hdfs:///tmp/_src fs=DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_182008924_1, ugi=hadoop (auth:SIMPLE)]]
io.Folder: copyFilesTo(hdfs:///tmp/_dst): root=hdfs:///tmp/_src
INFO io.Folder: copyFilesTo(hdfs:///tmp/_dst): src=hdfs://ip-172-31-2-12.us-east-2.compute.internal:8020/tmp/_src/one.file target=hdfs://ip-172-31-2-12.us-east-2.compute.internal:8020/tmp/_src/one.file
I localized the wrong code in rebase() method, it's not working correctly when running in EMR cluster because RemoteIterator is returning URIs in remote format: hdfs://ip-172-31-2-12.us-east-2.compute.internal:8020/tmp/_src/one.file but this method is expecting format hdfs:///tmp/_src/one.file, this is why it's working locally with file:/// FS.

I don't see anything obviously wrong.
Does it do hdfs-hdfs or s3a-s3a?
Upgrade your hadoop version; 2.7.x is woefully out of date, especially with the S3A code. It's unlikely to make whatever this problem go away, but it will avoid other issues. Once you've upgraded, switch to the fast upload and it will do incremental updates of large files; currently your code will be saving each file to /tmp somewhere and then uploading it in the close() call.
turn on the logging for the org.apache.hadoop.fs.s3a module and see what it says

I'm not sure that it's the best and fully correct solution, but it's working for me. The idea is to fix host and port of local paths before rebasing, the working rebase method will be:
/**
* Change the base of target URI to new base, using root
* as common path.
* #param base New base
* #param root Common root
* #param target Target to rebase
* #return Path with new base
* #throws IOException If fails
*/
#SuppressWarnings("PMD.DefaultPackage")
static Path rebase(final Path base, final Path root, final Path target)
throws IOException {
final URI uri = target.toUri();
try {
return new Path(
new Path(
new URIBuilder(base.toUri())
.setHost(uri.getHost())
.setPort(uri.getPort())
.build()
),
new Path(
new URIBuilder(root.toUri())
.setHost(uri.getHost())
.setPort(uri.getPort())
.build()
.relativize(uri)
)
);
} catch (final URISyntaxException err) {
throw new IOException("Failed to rebase", err);
}
}

hadoop copy file to current directory from hdfs

I want to re write the function of copy file from hdfs to local
String src = args[0]; // hdfs
String dst = args[1]; // local path
/*
* Prepare the input and output filesystems
*/
Configuration conf = new Configuration();
FileSystem inFS = FileSystem.get(URI.create(src), conf);
FileSystem outFS = FileSystem.get(URI.create(dst), conf);
/*
* Prepare the input and output streams
*/
FSDataInputStream in = null;
FSDataOutputStream out = null;
// TODO: Your implementation goes here...
in = inFS.open(new Path(src));
out = outFS.create(new Path(dst),
new Progressable() {
/*
* Print a dot whenever 64 KB of data has been written to
* the datanode pipeline.
*/
public void progress() {
System.out.print(".");
}
});
(1) when I set the local path to my current director
--> JAVA I/O error. mkdirs failed to create file
(2) when I set the path to a non exist folder
--> a new folder is created and my copied file is inside.
What should I do?
I beleive I should not use filesystem.create()?, am I?
EDIT
a related filesystem library link:
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path)

file not adding to DistributedCache

I am running Hadoop on my local system, in eclipse environment.
I tried to put a local file from workspace into the distributed cache in driver function as:
DistributedCache.addCacheFile(new Path(
"/home/hduser/workspace/myDir/myFile").toUri(), conf);
but when I tried to access it from Mapper, it returns null.
Inside mapper, I checked to see whether file cached.
System.out.println("Cache: "+context.getConfiguration().get("mapred.cache.files"));
it prints "null", also
Path[] cacheFilesLocal = DistributedCache.getLocalCacheFiles(context.getConfiguration());
returns null.
What's going wrong?

It's because you can only add files to the Distributed Cache from HDFS not local file system. So the Path doesn't exist. Put the file on HDFS and use the HDFS path to refer to it when adding to the DistributedCache.
See DistributedCache for more information.

Add file:// in the path when you add cache file
DistributedCache.addCacheFile(new Path( "file:///home/hduser/workspace/myDir/myFile"), conf);

Try this
DRIVER Class
Path p = new Path(your/file/path);
FileStatus[] list = fs.globStatus(p);
for (FileStatus status : list) {
/*Storing file to distributed cache*/
DistributedCache.addCacheFile(status.getPath().toUri(), conf);
}
Mapper class
public void setup(Context context) throws IOException{
/*Accessing data in file */
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
/* Accessing 0 th cached file*/
Path getPath = new Path(cacheFiles[0].getPath());
/*Read data*/
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
/*Print file content*/
System.out.println("Setup Line "+setupData);
}
bf.close();
}
public void map(){
}

how to run mahout kmeans algorithm in local mode

Is it possible to run a mahout k mean java program on local, so that it will read the data from local and save it back to local file system instead of hdfs.
All examles on internet are working on hdfs.
https://github.com/tdunning/MiA/blob/master/src/main/java/mia/clustering/ch07/SimpleKMeansClustering.java

Yes, it is possible - checkout SequenceFileWriter. See the following code example, which writes clustered data points to a file. Here is a blog post that describes this in great detail:
public static void writePointsToFile(List<Vector> points,
String fileName,
FileSystem fs,
Configuration conf) throws IOException {
Path path = new Path(fileName);
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,
path, LongWritable.class, VectorWritable.class);
long recNum = 0;
VectorWritable vec = new VectorWritable();
for (Vector point : points) {
vec.set(point);
writer.append(new LongWritable(recNum++), vec);
}
writer.close();
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Check if file exists on remote HDFS from local spark-submit - java

Related

How to find if the file exists in hdfs using Java?

Can't copy from HDFS to S3A

hadoop copy file to current directory from hdfs

file not adding to DistributedCache

how to run mahout kmeans algorithm in local mode

Categories

Resources