hadoop copy file to current directory from hdfs - java

I want to re write the function of copy file from hdfs to local
String src = args[0]; // hdfs
String dst = args[1]; // local path
/*
* Prepare the input and output filesystems
*/
Configuration conf = new Configuration();
FileSystem inFS = FileSystem.get(URI.create(src), conf);
FileSystem outFS = FileSystem.get(URI.create(dst), conf);
/*
* Prepare the input and output streams
*/
FSDataInputStream in = null;
FSDataOutputStream out = null;
// TODO: Your implementation goes here...
in = inFS.open(new Path(src));
out = outFS.create(new Path(dst),
new Progressable() {
/*
* Print a dot whenever 64 KB of data has been written to
* the datanode pipeline.
*/
public void progress() {
System.out.print(".");
}
});
(1) when I set the local path to my current director
--> JAVA I/O error. mkdirs failed to create file
(2) when I set the path to a non exist folder
--> a new folder is created and my copied file is inside.
What should I do?
I beleive I should not use filesystem.create()?, am I?
EDIT
a related filesystem library link:
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#create(org.apache.hadoop.fs.Path)

Related

Check if file exists on remote HDFS from local spark-submit

I'm working on a Java program dedicated to work with Spark on a HDFS filesystem (located at HDFS_IP).
One of my goals is to check whether a file exists on the HDFS at path hdfs://HDFS_IP:HDFS_PORT/path/to/file.json. While debugging my program in local, I figured out I can't access to this remote file using the following code
private boolean existsOnHDFS(String path) {
Configuration conf = new Configuration();
FileSystem fs;
Boolean fileDoesExist = false ;
try {
fs = FileSystem.get(conf);
fileDoesExist = fs.exists(new Path(path)) ;
} catch (IOException e) {
e.printStackTrace();
}
return fileDoesExist ;
}
Actually, fs.exists tries to look for the file hdfs://HDFS_IP:HDFS_PORT/path/to/file.json in my local FS and not on the HDFS. BTW letting the hdfs://HDFS_IP:HDFS_PORT prefix makes fs.existscrash and suppressing it answers false because /path/to/file.json does not exist locally.
What would be the appropriate configuration of fs to get things work properly in local and when executing the Java program from a Hadoop cluster ?
EDIT: I finally gave up and passed the bugfix to someone else in my team. Thanks to the people who tried to help me though !
The problem is you passing to FileSystem an empty conf file.
You should create your FileSystem like that:
FileSystem.get(spark.sparkContext().hadoopConfiguration());
when spark is the SparkSession object.
As you can see in the code of FileSystem:
/**
* Returns the configured filesystem implementation.
* #param conf the configuration to use
*/
public static FileSystem get(Configuration conf) throws IOException {
return get(getDefaultUri(conf), conf);
}
/** Get the default filesystem URI from a configuration.
* #param conf the configuration to use
* #return the uri of the default filesystem
*/
public static URI getDefaultUri(Configuration conf) {
return URI.create(fixName(conf.get(FS_DEFAULT_NAME_KEY, DEFAULT_FS)));
}
it creates the URI base on the configuration passed as parameter, it looks for the key FS_DEFAULT_NAME_KEY(fs.defaultFS) when the DEFAULT_FS is:
public static final String FS_DEFAULT_NAME_DEFAULT = "file:///";

Java - a file is saved to the wrong directory in S3 when copied from HDFS

I wrote a method for saving files from an HDFS directory to S3. But the files are getting saved to the wrong directory in S3. I've inspected the logs an have confirmed that the value of s3TargetPath is s3://bucketName/data and hdfsSource is also resolved correctly.
However instead of being saved to the s3TargetPath they are saved to s3://bucketName//data/.
And also the s3://bucketName//data/ directory contains a file data with a content type: binary/octet-stream, fs-type: Hadoop block.
What needs to be changed in my code to save files to the right S3 path?
private String hdfsPath = "hdfs://localhost:9010/user/usr1/data";
private String s3Path = "s3://bucketName/data";
copyFromHdfstoS3(hdfsPath, s3Path);
//
void copyFromHdfstoS3(String hdfsDir, String s3sDir) {
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(new URI(hdfsDir), conf);
FileSystem s3Fs = FileSystem.get(new URI(s3sDir), conf);
Path hdfsSource = new Path(hdfsDir);
Path s3TargetPath = new Path(s3sDir);
RemoteIterator<LocatedFileStatus> sourceFiles = hdfs.listFiles(sourcePath, false);
if (!s3Fs.exists(s3TargetPath)) {
s3Fs.mkdirs(s3TargetPath);
}
if (sourceFiles != null) {
while (sourceFiles.hasNext()) {
Path srcFilePath = sourceFiles.next().getPath();
if (FileUtil.copy(hdfs, srcFilePath, s3Fs, s3TargetPath, false, true, new Configuration())) {
LOG.info("Copied Successfully");
} else {
LOG.info("Copy Failed");
}
}
}
}

Not able to copy one HDFS data to another HDFS location using distcp

I am trying to copy one HDFS data to another HDFS location.
I am able to achieve the same using "distcp" command
hadoop distcp hdfs://mySrcip:8020/copyDev/* hdfs://myDestip:8020/copyTest
But I want to try the same using Java Api.
After a long search found one code and executed . But it didnt copied my src file to destination.
public class TouchFile {
/**
* #param args
* #throws Exception
*/
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
//create configuration object
Configuration config = new Configuration();
config.set("fs.defaultFS", "hdfs://mySrcip:8020/");
config.set("hadoop.job.ugi", "hdfs");
/*
* Distcp
*/
String sourceNameNode = "hdfs://mySrcip:8020/copyDev";
String destNameNode = "hdfs://myDestip:8020/copyTest";
String fileList = "myfile.txt";
distFileCopy(config,sourceNameNode,destNameNode,fileList);
}
/**
* Copies files from one cloud to another using Hadoop's distributed copy features. Uses
* input to build DISTCP configuration settings.
*
* param config Hadoop configuration
* param sourceNameNode full HDFS path to parent source directory
* param destNameNode full HDFS path to parent destination directory
* param fileList Comma separated string of file names in sourceNameNode to be copied to destNameNode
* returns Elapsed time in milliseconds to copy files
*/
public static long distFileCopy( Configuration config, String sourceNameNode, String destNameNode, String fileList ) throws Exception {
System.out.println("In dist copy");
StringTokenizer tokenizer = new StringTokenizer(fileList,",");
ArrayList<String> list = new ArrayList<>();
while ( tokenizer.hasMoreTokens() ){
String file = sourceNameNode + "/" + tokenizer.nextToken();
list.add( file );
}
String[] args = new String[list.size() + 1];
int count = 0;
for ( String filename : list ){
args[count++] = filename;
}
args[count] = destNameNode;
System.out.println("args------>"+Arrays.toString(args));
long st = System.currentTimeMillis();
DistCp distCp=new DistCp(config,null);
distCp.run(args);
return System.currentTimeMillis() - st;
}
}
Am I doing anything wrong.
Please suggest
Yes it is solved.
It was Permission issue.
The destination cluster should give permission for user.

file not adding to DistributedCache

I am running Hadoop on my local system, in eclipse environment.
I tried to put a local file from workspace into the distributed cache in driver function as:
DistributedCache.addCacheFile(new Path(
"/home/hduser/workspace/myDir/myFile").toUri(), conf);
but when I tried to access it from Mapper, it returns null.
Inside mapper, I checked to see whether file cached.
System.out.println("Cache: "+context.getConfiguration().get("mapred.cache.files"));
it prints "null", also
Path[] cacheFilesLocal = DistributedCache.getLocalCacheFiles(context.getConfiguration());
returns null.
What's going wrong?
It's because you can only add files to the Distributed Cache from HDFS not local file system. So the Path doesn't exist. Put the file on HDFS and use the HDFS path to refer to it when adding to the DistributedCache.
See DistributedCache for more information.
Add file:// in the path when you add cache file
DistributedCache.addCacheFile(new Path( "file:///home/hduser/workspace/myDir/myFile"), conf);
Try this
DRIVER Class
Path p = new Path(your/file/path);
FileStatus[] list = fs.globStatus(p);
for (FileStatus status : list) {
/*Storing file to distributed cache*/
DistributedCache.addCacheFile(status.getPath().toUri(), conf);
}
Mapper class
public void setup(Context context) throws IOException{
/*Accessing data in file */
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
/* Accessing 0 th cached file*/
Path getPath = new Path(cacheFiles[0].getPath());
/*Read data*/
BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath)));
String setupData = null;
while ((setupData = bf.readLine()) != null) {
/*Print file content*/
System.out.println("Setup Line "+setupData);
}
bf.close();
}
public void map(){
}

Copying a Directory and All of Its Contents - Java - Netbeans

My goal is to write a Java program in Netbeans to copy a directory and all of its contents, including subdirectories and their contents. To do so, I first ask the user for source directory and destination where it will be copied. From here, my program should make a new directory in the new location with same name as the source directory. Then my program should create an array with File class objects for each item in the contents of the source directory. Next, I tried to iterate the array, and for each item - if it is a file, it should copy to the new directory- if it is a directory, it should recursively call this method to the copy the directory and all of its contents.
An extremely useful program if I could just get it to work correctly. It is difficult for me right now to understand the entire logic needed to make this program run efficiently.
When I run the program, it returns that the file cannot be found but this is just not true. So my code has to be wrong somewhere. Any help would be greatly appreciated guys.
Thank you.
package copydirectories;
/**
* CSCI 112, Ch. 18 Ex. 5
* #author zhughes3
* Last edited Tuesday, March 11th, 2014 # 9pm
*/
import java.io.*;
import java.util.Scanner;
public class CopyDirectories {
/**
* #param args the command line arguments
*/
public static void main(String[] args) throws Exception {
//Create a new instance of scanner to get user input
Scanner scanner = new Scanner (System.in);
//Ask user to input the directory to be copied
System.out.print("Input directory to be copied.");
//Save input as String
String dirName = scanner.nextLine();
//Ask user to input destination where direction will be copied
System.out.print("Input destination directory will be moved to.");
//Save input as String
String destName = scanner.nextLine();
//Run method to determine if it is a directory or file
isDirFile(dirName, destName);
}//end main
public static void isDirFile (String source, String dest) throws Exception{
//Create a File object for new directory in new location with same name
//as source directory
File dirFile = new File (dest + source);
//Make the new directory
dirFile.mkdir();
//Create an array of File class objects for each item in the source
//directory
File[] entries;
//If source directory exists
if (dirFile.exists()){
//If the source directory is a directory
if (dirFile.isDirectory()){
//Get the data and load the array
entries = dirFile.listFiles();
//Iterate the array using alternate for statement
for (File entry : entries){
if (entry.isFile()){
copyFile (entry.getAbsolutePath(), dest);
} //end if
else {
isDirFile (entry.getAbsolutePath(), dest);
} //end else if
}//end for
}//end if
}//end if
else {
System.out.println("File does not exist.");
} //end else
}
public static void copyFile (String source, String dest) throws Exception {
//declare Files
File sourceFile = null;
File destFile = null;
//declare stream variables
FileInputStream sourceStream = null;
FileOutputStream destStream = null;
//declare buffering variables
BufferedInputStream bufferedSource = null;
BufferedOutputStream bufferedDest = null;
try {
//Create File objects for source and destination files
sourceFile = new File (source);
destFile = new File (dest);
//Create file streams for the source and destination
sourceStream = new FileInputStream(sourceFile);
destStream = new FileOutputStream(destFile);
//Buffer the file streams with a buffer size of 8k
bufferedSource = new BufferedInputStream(sourceStream,8182);
bufferedDest = new BufferedOutputStream(destStream,8182);
//Use an integer to transfer data between files
int transfer;
//Alert user as to what is happening
System.out.println("Beginning file copy:");
System.out.println("\tCopying " + source);
System.out.println("\tTo " + dest);
//Read a byte while checking for End of File (EOF)
while ((transfer = bufferedSource.read()) !=-1){
//Write a byte
bufferedDest.write(transfer);
}//end while
}//end try
catch (IOException e){
e.printStackTrace();
System.out.println("An unexpected I/O error occurred.");
}//end catch
finally {
//close file streams
if (bufferedSource !=null)
bufferedSource.close();
if (bufferedDest !=null)
bufferedDest.close();
System.out.println("Your files have been copied correctly and "
+ "closed.");
}//end finally
}//end copyDir
}//end class
If you read through JavaDocs, you will see that mkdir...
Creates the directory named by this abstract pathname.
This is a little obscure, but basically, it will only create the last level of the directory. For example, if you have a path of C:\this\is\a\long\path, mkdir would only attempt to create the path directory at the end of C:\this\is\a\long, but if C:\this\is\a\long doesn't exist it will fail.
Instead, if you use mkdirs
Creates the directory named by this abstract pathname, including any
necessary but nonexistent parent directories. Note that if this
operation fails it may have succeeded in creating some of the
necessary parent directories.
I'd also be check the results of these methods, as they will indicate if the operation was successful or not
I think entries = dirFile.listFiles(); looks wrong, you should be listing the files from the source directory, not the destination...

Categories

Resources