Can't copy from HDFS to S3A

Can't copy from HDFS to S3A - java

I have a class to copy directory content from one location to another using Apache FileUtil:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.LocatedFileStatus;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.RemoteIterator;
class Folder {
private final FileSystem fs;
private final Path pth;
// ... constructors and other methods
/**
* Copy contents (files and files in subfolders) to another folder.
* Merges overlapping folders
* Overwrites already existing files
* #param destination Folder where content will be moved to
* #throws IOException If fails
*/
public void copyFilesTo(final Folder destination) throws IOException {
final RemoteIterator<LocatedFileStatus> iter = this.fs.listFiles(
this.pth,
true
);
final URI root = this.pth.toUri();
while (iter.hasNext()) {
final Path source = iter.next().getPath();
FileUtil.copy(
this.fs,
source,
destination.fs,
new Path(
destination.pth,
root.relativize(source.toUri()).toString()
),
false,
true,
this.fs.getConf()
);
}
}
}
This class is working fine with local (file:///) directories in a unit test,
but when I'm trying to use it in Hadoop cluster to copy files from HDFS (hdfs:///tmp/result) to Amazon S3 (s3a://mybucket/out) it doesn't copy anything and doesn't throw error, just silently skip copying.
When I'm using same class (with both HDFS or S3a filesystems) for another purpose it's working fine, so the configuration and fs reference should be OK here.
What I'm doing wrong? How to copy files from HDFS to S3A correctly?
I'm using Hadoop 2.7.3.
UPDATE
I've added more logs to copyFilesTo method to log root, source and target variables (and extracted rebase() method without changing the code):
/**
* Copy contents (files and files in subfolders) to another folder.
* Merges overlapping folders
* Overwrites already existing files
* #param dst Folder where content will be moved to
* #throws IOException If fails
*/
public void copyFilesTo(final Folder dst) throws IOException {
Logger.info(
this, "copyFilesTo(%s): from %s fs=%s",
dst, this, this.hdfs
);
final RemoteIterator<LocatedFileStatus> iter = this.hdfs.listFiles(
this.pth,
true
);
final URI root = this.pth.toUri();
Logger.info(this, "copyFilesTo(%s): root=%s", dst, root);
while (iter.hasNext()) {
final Path source = iter.next().getPath();
final Path target = Folder.rebase(dst.path(), this.path(), source);
Logger.info(
this, "copyFilesTo(%s): src=%s target=%s",
dst, source, target
);
FileUtil.copy(
this.hdfs,
source,
dst.hdfs,
target,
false,
true,
this.hdfs.getConf()
);
}
}
/**
* Change the base of target URI to new base, using root
* as common path.
* #param base New base
* #param root Common root
* #param target Target to rebase
* #return Path with new base
*/
static Path rebase(final Path base, final Path root, final Path target) {
return new Path(
base, root.toUri().relativize(target.toUri()).toString()
);
}
After running in the cluster I've got these logs:
io.Folder: copyFilesTo(hdfs:///tmp/_dst): from hdfs:///tmp/_src fs=DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_182008924_1, ugi=hadoop (auth:SIMPLE)]]
io.Folder: copyFilesTo(hdfs:///tmp/_dst): root=hdfs:///tmp/_src
INFO io.Folder: copyFilesTo(hdfs:///tmp/_dst): src=hdfs://ip-172-31-2-12.us-east-2.compute.internal:8020/tmp/_src/one.file target=hdfs://ip-172-31-2-12.us-east-2.compute.internal:8020/tmp/_src/one.file
I localized the wrong code in rebase() method, it's not working correctly when running in EMR cluster because RemoteIterator is returning URIs in remote format: hdfs://ip-172-31-2-12.us-east-2.compute.internal:8020/tmp/_src/one.file but this method is expecting format hdfs:///tmp/_src/one.file, this is why it's working locally with file:/// FS.

I don't see anything obviously wrong.
Does it do hdfs-hdfs or s3a-s3a?
Upgrade your hadoop version; 2.7.x is woefully out of date, especially with the S3A code. It's unlikely to make whatever this problem go away, but it will avoid other issues. Once you've upgraded, switch to the fast upload and it will do incremental updates of large files; currently your code will be saving each file to /tmp somewhere and then uploading it in the close() call.
turn on the logging for the org.apache.hadoop.fs.s3a module and see what it says

I'm not sure that it's the best and fully correct solution, but it's working for me. The idea is to fix host and port of local paths before rebasing, the working rebase method will be:
/**
* Change the base of target URI to new base, using root
* as common path.
* #param base New base
* #param root Common root
* #param target Target to rebase
* #return Path with new base
* #throws IOException If fails
*/
#SuppressWarnings("PMD.DefaultPackage")
static Path rebase(final Path base, final Path root, final Path target)
throws IOException {
final URI uri = target.toUri();
try {
return new Path(
new Path(
new URIBuilder(base.toUri())
.setHost(uri.getHost())
.setPort(uri.getPort())
.build()
),
new Path(
new URIBuilder(root.toUri())
.setHost(uri.getHost())
.setPort(uri.getPort())
.build()
.relativize(uri)
)
);
} catch (final URISyntaxException err) {
throw new IOException("Failed to rebase", err);
}
}

Related

Load resources from a jar

How can I get the file object from the resource folder from a built jar? I can easily do it from the IDE by doing the following however this does not work with jar files. The code bellow just creates a copy of a file in my resource folder and saves it on the users machine.
final ClassLoader classloader = Thread.currentThread().getContextClassLoader();
final InputStream configIs = classloader.getResourceAsStream("config.yml");
if(configIs != null) {
final File configFile = new File(chosenPath + "/config.yml");
Files.copy(configIs, configFile.toPath(), StandardCopyOption.REPLACE_EXISTING);
configIs.close();
}
I have been trying to figure out how to do the same from within a jar without much success even after reading many other articles. Based on my research the following was suggested. The code bellow loops through all of the paths that reference the resource path. This produces a 500+ KB text file so I am not exactly sure which one is correct but even if I find the correct one what do I do with it? Since the last if statement checks name starts with resourcePath I assume this is the correct entry
Entry:config.yml path:resources/problems.json
But how do I go from that to an input stream? If there is a better way of doing this let me know but so far I have not found any other resources on this topic.
final String resourcePath = "resources/problems.json";
final File jarFile = new File(getClass().getProtectionDomain().getCodeSource().getLocation().getPath());
if(jarFile.exists() && jarFile.isFile()) { // Run with JAR file
System.out.println("Run as jar");
try {
final JarFile jar = new JarFile(jarFile);
final Enumeration<JarEntry> entries = jar.entries(); //gives ALL entries in jar
while(entries.hasMoreElements()) {
final String name = entries.nextElement().getName();
System.out.println("Entry:" + name + " path:" + resourcePath);
if (name.startsWith(resourcePath)) { //filter according to the path
System.out.println(name);
}
}
jar.close();
} catch (IOException exception) {
exception.printStackTrace();
}
}

Perhaps this will help a bit:
/**
* Text files loaded with this method should work either within the IDE or your
* distributive JAR file.<br><br>
*
* <b>Example Usage:</b><pre>
* {#code
* try {
* List<String> list = loadFileFromResources("/resources/textfiles/data_2.txt");
* for (String str : list) {
* System.out.println(str);
* }
* }
* catch (IOException ex) {
* System.err.println(ex);
* } }</pre><br>
*
* As shown in the example usage above the file path requires the Resource
* folder to be named "resources" which also contains a sub-folder named
* "textfiles". This resource folder is to be located directly within the
* Project "src" folder, for example:<pre>
*
* ProjectDirectoryName
* bin
* build
* lib
* dist
* src
* resources
* images
* myImage_1.png
* myImage_2.jpg
* myImage_3.gif
* textfiles
* data_1.txt
* data_2.txt
* data_3.txt
* test</pre><br>
*
* Upon creating the resources/images and resources/textfiles folders within
* the src folder, your IDE should have created two packages within the Project
* named resources.images and resources.textfiles. Images would of course be
* related to the resources.images package and Text Files would be related to
* the resources.textfiles package.<br>
*
* #param filePath (String) Based on the example folder structure above this
* would be (as an example):<pre>
*
* <b> "/resources/textfiles/data_2.txt" </b></pre>
*
* #return ({#code List<String>}) A List of String containing the file's content.
*/
public List<String> loadFileFromResources(String filePath) throws java.io.IOException {
List<String> lines = new ArrayList<>();
try (java.io.InputStream in = getClass().getResourceAsStream(filePath);
java.io.BufferedReader reader = new java.io.BufferedReader(new java.io.InputStreamReader(in))) {
String line;
while ((line = reader.readLine()) != null) {
lines.add(line);
}
reader.close();
}
return lines;
}

Check if file exists on remote HDFS from local spark-submit

I'm working on a Java program dedicated to work with Spark on a HDFS filesystem (located at HDFS_IP).
One of my goals is to check whether a file exists on the HDFS at path hdfs://HDFS_IP:HDFS_PORT/path/to/file.json. While debugging my program in local, I figured out I can't access to this remote file using the following code
private boolean existsOnHDFS(String path) {
Configuration conf = new Configuration();
FileSystem fs;
Boolean fileDoesExist = false ;
try {
fs = FileSystem.get(conf);
fileDoesExist = fs.exists(new Path(path)) ;
} catch (IOException e) {
e.printStackTrace();
}
return fileDoesExist ;
}
Actually, fs.exists tries to look for the file hdfs://HDFS_IP:HDFS_PORT/path/to/file.json in my local FS and not on the HDFS. BTW letting the hdfs://HDFS_IP:HDFS_PORT prefix makes fs.existscrash and suppressing it answers false because /path/to/file.json does not exist locally.
What would be the appropriate configuration of fs to get things work properly in local and when executing the Java program from a Hadoop cluster ?
EDIT: I finally gave up and passed the bugfix to someone else in my team. Thanks to the people who tried to help me though !

The problem is you passing to FileSystem an empty conf file.
You should create your FileSystem like that:
FileSystem.get(spark.sparkContext().hadoopConfiguration());
when spark is the SparkSession object.
As you can see in the code of FileSystem:
/**
* Returns the configured filesystem implementation.
* #param conf the configuration to use
*/
public static FileSystem get(Configuration conf) throws IOException {
return get(getDefaultUri(conf), conf);
}
/** Get the default filesystem URI from a configuration.
* #param conf the configuration to use
* #return the uri of the default filesystem
*/
public static URI getDefaultUri(Configuration conf) {
return URI.create(fixName(conf.get(FS_DEFAULT_NAME_KEY, DEFAULT_FS)));
}
it creates the URI base on the configuration passed as parameter, it looks for the key FS_DEFAULT_NAME_KEY(fs.defaultFS) when the DEFAULT_FS is:
public static final String FS_DEFAULT_NAME_DEFAULT = "file:///";

Get all folders that contain specific character

I am trying to get all folders and subfolders (dirs not files) that their name contains a specific character. I am using a custom IOFileFilter but seems to be ignored.
Collection<File> myFolders = FileUtils.listFilesAndDirs(new File(myFilesPath), new NotFileFilter(TrueFileFilter.INSTANCE), new CustomDirectoryFilter());
My custom filter is:
public class CustomDirectoryFilter extends AbstractFileFilter {
/**
* Checks to see if the File should be accepted by this filter.
*
* #param file the File to check
* #return true if this file matches the test
*/
#Override
public boolean accept(File file) {
return file.isDirectory() && file.getName().contains("#");
}
}
I get only the root folder.

Try using filewalker API.
Files.walk(Paths.get("/my/path/here")).filter(x->!Files.isRegularFile(x)).filter(x->x.toString().contains("#")).forEach(System.out::println);
Using common-io FileUtils
List<File> ff = (List<File>) FileUtils.listFilesAndDirs(new File("/Users/barath/elasticsearch-6.2.4"), new NotFileFilter(TrueFileFilter.INSTANCE), DirectoryFileFilter.DIRECTORY);
for(File f : ff){
if(f.toString().contains("#"))
System.out.println(f.toString());
}

Scala way to do apache commons lang3 Validate?

What is the Scala way to accomplish the same as apache commons lang3 Validate? i.e. the validation aimed for user input validation as opposed to coding errors via assertions where failing the condition would lead to an IllegalArgumentException e.g.
/**
* Returns the newly created file only if the user entered a valid path.
* #param path input path where to store the new file
* #param fileName name of the file to be created in directory path
* #throws IllegalArgumentException when the input path doesn't exist.
*/
public File createFile(File path, String fileName) {
// make sure that the path exists before creating the file
// TODO: is there a way to do this in Scala without the need for 3rd party libraries
org.apache.commons.lang3.Validate.isTrue(path.exists(), "Illegal input path '" + path.getAbsolutePath() + "', it doesn't exist")
// now it is safe to create the file ...
File result = new File(path, fileName)
// ...
return result;
}

By coincidence I just found out that require would be the method of choice in Scala e.g.
/**
* Returns the newly created file only if the user entered a valid path.
* #param path input path where to store the new file
* #param fileName name of the file to be created in directory path
* #throws IllegalArgumentException when the input path doesn't exist.
*/
def createFile(path: File, fileName: String) : File = {
require(path.exists, s"""Illegal input path "${path.getAbsolutePath()}", it doesn't exist""")
// now it is safe to create the file ...
val result = new File(path, fileName)
// ...
result
}

How to get the complete real file path for "../../dir/file.ext" in java?

I there a way in the java.io package to convert a relative path containing "../" to an absolute path?
My goal is to remove the "../" part of the path because
java.awt.Desktop.getDesktop().open()
Does not seem to support ../ in file path under windows

--- Edited when comment was made that the ../ was still in the path ---
import java.io.File;
public class Test {
public static void main(String[] args) throws Exception {
File file = new File("../../home");
System.out.println(file.getCanonicalPath());
System.out.println(file.getAbsolutePath());
}
}
will run with the output
/home/ebuck/home
/home/ebuck/workspace/State/../../home
based on my current working directory of /home/ebuck/workspace/State
Note that you asked for the complete real file path, which technically an absolute path is a complete, real file path, it's just not the shortest complete real file path. So, if you want to do it fast and dirty, one can just add "../../home" to the current working directory and obtain a full, complete file path (albeit a wordy one that contains unnecessary information).
If you want the shortest full, complete file path, that's what getCanonicalPath() is used for. It throws an exception; because, some joker out there will probably ask for "../../home" when in the root directory.
--- Original post follows, with edits ---
new File("../../dir/file.ext").getCanonoicalPath();
Will do so collapsing (following) the relative path links (. and ..).
new File("../../dir/file.ext").getAbsolutePath();
Will do so without collapsing (following) the relative path links.

String path = new File("../../dir/file.ext").getCanonicalPath();

File f = new File("..");
String path = f.getAbsolutePath();

Sometimes File.getCanonicalPath() may not be desired since it may resolve things like symlinks, so if you want to maintain the "logical" path that File.getAbsolutePath() provides, you cannot use getCanonicalPath(). Also, IIRC, gCP() can thrown an exception while gAP() does not, and gAP() can refer to a path does not exist.
I've encounter the '..' problem a long time ago. Here is a utility method I wrote to remove the '..'s in a path:
/**
* Retrieve "clean" absolute path of a file.
* <p>This method retrieves the absolute pathname of file,
* with relative components (e.g. <tt>..</tt>) removed.
* Java's <tt>File.getAbsolutePath()</tt> does not remove
* relative components. For example, if given the pathname:
* </p>
* <pre>
* dir/subdir/subsubdir/../../file.txt
* </pre>
* <p>{#link File#getAbsolutePath()} will return something like:
* </p>
* <pre>
* /home/whomever/dir/subdir/subsubdir/../../file.txt
* </pre>
* <p>This method will return:
* </p>
* <pre>
* /home/whomever/dir/file.txt
* </pre>
*
* #param f File to get clean absolute path of.
* #return Clean absolute pathname of <i>f</i>.
*/
public static String cleanAbsolutePath(
File f
) {
String abs = f.getAbsolutePath();
if (!relDirPattern.matcher(abs).find()) {
// Nothing to do, so just return what Java provided
return abs;
}
String[] parts = abs.split(fileSepRex);
ArrayList<String> newPath = new ArrayList<String>(parts.length);
int capacity = 0;
for (String p : parts) {
if (p.equals(".")) continue;
if (p.equals("..")) {
if (newPath.size() == 0) continue;
String removed = newPath.remove(newPath.size() -1);
capacity -= removed.length();
continue;
}
newPath.add(p);
capacity += p.length();
}
int size = newPath.size();
if (size == 0) {
return File.separator;
}
StringBuilder result = new StringBuilder(capacity);
int i = 0;
for (String p : newPath) {
++i;
result.append(p);
if (i < size) {
result.append(File.separatorChar);
}
}
return result.toString();
}
/** Regex string representing file name separator. */
private static String fileSepRex = "\\"+File.separator;
/** Pattern for checking if pathname has relative components. */
private static Pattern relDirPattern = Pattern.compile(
"(?:\\A|" + fileSepRex + ")\\.{1,2}(?:" + fileSepRex + "|\\z)");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Can't copy from HDFS to S3A - java

Related

Load resources from a jar

Check if file exists on remote HDFS from local spark-submit

Get all folders that contain specific character

Scala way to do apache commons lang3 Validate?

How to get the complete real file path for "../../dir/file.ext" in java?

Categories

Resources