Given two absolute paths, e.g.
/var/data/stuff/xyz.dat
/var/data
How can one create a relative path that uses the second path as its base? In the example above, the result should be: ./stuff/xyz.dat
It's a little roundabout, but why not use URI? It has a relativize method which does all the necessary checks for you.
String path = "/var/data/stuff/xyz.dat";
String base = "/var/data";
String relative = new File(base).toURI().relativize(new File(path).toURI()).getPath();
// relative == "stuff/xyz.dat"
Please note that for file path there's java.nio.file.Path#relativize since Java 1.7, as pointed out by #Jirka Meluzin in the other answer.
Since Java 7 you can use the relativize method:
import java.nio.file.Path;
import java.nio.file.Paths;
public class Test {
public static void main(String[] args) {
Path pathAbsolute = Paths.get("/var/data/stuff/xyz.dat");
Path pathBase = Paths.get("/var/data");
Path pathRelative = pathBase.relativize(pathAbsolute);
System.out.println(pathRelative);
}
}
Output:
stuff/xyz.dat
At the time of writing (June 2010), this was the only solution that passed my test cases. I can't guarantee that this solution is bug-free, but it does pass the included test cases. The method and tests I've written depend on the FilenameUtils class from Apache commons IO.
The solution was tested with Java 1.4. If you're using Java 1.5 (or higher) you should consider replacing StringBuffer with StringBuilder (if you're still using Java 1.4 you should consider a change of employer instead).
import java.io.File;
import java.util.regex.Pattern;
import org.apache.commons.io.FilenameUtils;
public class ResourceUtils {
/**
* Get the relative path from one file to another, specifying the directory separator.
* If one of the provided resources does not exist, it is assumed to be a file unless it ends with '/' or
* '\'.
*
* #param targetPath targetPath is calculated to this file
* #param basePath basePath is calculated from this file
* #param pathSeparator directory separator. The platform default is not assumed so that we can test Unix behaviour when running on Windows (for example)
* #return
*/
public static String getRelativePath(String targetPath, String basePath, String pathSeparator) {
// Normalize the paths
String normalizedTargetPath = FilenameUtils.normalizeNoEndSeparator(targetPath);
String normalizedBasePath = FilenameUtils.normalizeNoEndSeparator(basePath);
// Undo the changes to the separators made by normalization
if (pathSeparator.equals("/")) {
normalizedTargetPath = FilenameUtils.separatorsToUnix(normalizedTargetPath);
normalizedBasePath = FilenameUtils.separatorsToUnix(normalizedBasePath);
} else if (pathSeparator.equals("\\")) {
normalizedTargetPath = FilenameUtils.separatorsToWindows(normalizedTargetPath);
normalizedBasePath = FilenameUtils.separatorsToWindows(normalizedBasePath);
} else {
throw new IllegalArgumentException("Unrecognised dir separator '" + pathSeparator + "'");
}
String[] base = normalizedBasePath.split(Pattern.quote(pathSeparator));
String[] target = normalizedTargetPath.split(Pattern.quote(pathSeparator));
// First get all the common elements. Store them as a string,
// and also count how many of them there are.
StringBuffer common = new StringBuffer();
int commonIndex = 0;
while (commonIndex < target.length && commonIndex < base.length
&& target[commonIndex].equals(base[commonIndex])) {
common.append(target[commonIndex] + pathSeparator);
commonIndex++;
}
if (commonIndex == 0) {
// No single common path element. This most
// likely indicates differing drive letters, like C: and D:.
// These paths cannot be relativized.
throw new PathResolutionException("No common path element found for '" + normalizedTargetPath + "' and '" + normalizedBasePath
+ "'");
}
// The number of directories we have to backtrack depends on whether the base is a file or a dir
// For example, the relative path from
//
// /foo/bar/baz/gg/ff to /foo/bar/baz
//
// ".." if ff is a file
// "../.." if ff is a directory
//
// The following is a heuristic to figure out if the base refers to a file or dir. It's not perfect, because
// the resource referred to by this path may not actually exist, but it's the best I can do
boolean baseIsFile = true;
File baseResource = new File(normalizedBasePath);
if (baseResource.exists()) {
baseIsFile = baseResource.isFile();
} else if (basePath.endsWith(pathSeparator)) {
baseIsFile = false;
}
StringBuffer relative = new StringBuffer();
if (base.length != commonIndex) {
int numDirsUp = baseIsFile ? base.length - commonIndex - 1 : base.length - commonIndex;
for (int i = 0; i < numDirsUp; i++) {
relative.append(".." + pathSeparator);
}
}
relative.append(normalizedTargetPath.substring(common.length()));
return relative.toString();
}
static class PathResolutionException extends RuntimeException {
PathResolutionException(String msg) {
super(msg);
}
}
}
The test cases that this passes are
public void testGetRelativePathsUnix() {
assertEquals("stuff/xyz.dat", ResourceUtils.getRelativePath("/var/data/stuff/xyz.dat", "/var/data/", "/"));
assertEquals("../../b/c", ResourceUtils.getRelativePath("/a/b/c", "/a/x/y/", "/"));
assertEquals("../../b/c", ResourceUtils.getRelativePath("/m/n/o/a/b/c", "/m/n/o/a/x/y/", "/"));
}
public void testGetRelativePathFileToFile() {
String target = "C:\\Windows\\Boot\\Fonts\\chs_boot.ttf";
String base = "C:\\Windows\\Speech\\Common\\sapisvr.exe";
String relPath = ResourceUtils.getRelativePath(target, base, "\\");
assertEquals("..\\..\\Boot\\Fonts\\chs_boot.ttf", relPath);
}
public void testGetRelativePathDirectoryToFile() {
String target = "C:\\Windows\\Boot\\Fonts\\chs_boot.ttf";
String base = "C:\\Windows\\Speech\\Common\\";
String relPath = ResourceUtils.getRelativePath(target, base, "\\");
assertEquals("..\\..\\Boot\\Fonts\\chs_boot.ttf", relPath);
}
public void testGetRelativePathFileToDirectory() {
String target = "C:\\Windows\\Boot\\Fonts";
String base = "C:\\Windows\\Speech\\Common\\foo.txt";
String relPath = ResourceUtils.getRelativePath(target, base, "\\");
assertEquals("..\\..\\Boot\\Fonts", relPath);
}
public void testGetRelativePathDirectoryToDirectory() {
String target = "C:\\Windows\\Boot\\";
String base = "C:\\Windows\\Speech\\Common\\";
String expected = "..\\..\\Boot";
String relPath = ResourceUtils.getRelativePath(target, base, "\\");
assertEquals(expected, relPath);
}
public void testGetRelativePathDifferentDriveLetters() {
String target = "D:\\sources\\recovery\\RecEnv.exe";
String base = "C:\\Java\\workspace\\AcceptanceTests\\Standard test data\\geo\\";
try {
ResourceUtils.getRelativePath(target, base, "\\");
fail();
} catch (PathResolutionException ex) {
// expected exception
}
}
When using java.net.URI.relativize you should be aware of Java bug:
JDK-6226081 (URI should be able to relativize paths with partial roots)
At the moment, the relativize() method of URI will only relativize URIs when one is a prefix of the other.
Which essentially means java.net.URI.relativize will not create ".."'s for you.
In Java 7 and later you can simply use (and in contrast to URI, it is bug free):
Path#relativize(Path)
The bug referred to in another answer is addressed by URIUtils in Apache HttpComponents
public static URI resolve(URI baseURI,
String reference)
Resolves a URI reference against a
base URI. Work-around for bug in
java.net.URI ()
If you know the second string is part of the first:
String s1 = "/var/data/stuff/xyz.dat";
String s2 = "/var/data";
String s3 = s1.substring(s2.length());
or if you really want the period at the beginning as in your example:
String s3 = ".".concat(s1.substring(s2.length()));
Recursion produces a smaller solution. This throws an exception if the result is impossible (e.g. different Windows disk) or impractical (root is only common directory.)
/**
* Computes the path for a file relative to a given base, or fails if the only shared
* directory is the root and the absolute form is better.
*
* #param base File that is the base for the result
* #param name File to be "relativized"
* #return the relative name
* #throws IOException if files have no common sub-directories, i.e. at best share the
* root prefix "/" or "C:\"
*/
public static String getRelativePath(File base, File name) throws IOException {
File parent = base.getParentFile();
if (parent == null) {
throw new IOException("No common directory");
}
String bpath = base.getCanonicalPath();
String fpath = name.getCanonicalPath();
if (fpath.startsWith(bpath)) {
return fpath.substring(bpath.length() + 1);
} else {
return (".." + File.separator + getRelativePath(parent, name));
}
}
Here is a solution other library free:
Path sourceFile = Paths.get("some/common/path/example/a/b/c/f1.txt");
Path targetFile = Paths.get("some/common/path/example/d/e/f2.txt");
Path relativePath = sourceFile.relativize(targetFile);
System.out.println(relativePath);
Outputs
..\..\..\..\d\e\f2.txt
[EDIT] actually it outputs on more ..\ because of the source is file not a directory. Correct solution for my case is:
Path sourceFile = Paths.get(new File("some/common/path/example/a/b/c/f1.txt").parent());
Path targetFile = Paths.get("some/common/path/example/d/e/f2.txt");
Path relativePath = sourceFile.relativize(targetFile);
System.out.println(relativePath);
My version is loosely based on Matt and Steve's versions:
/**
* Returns the path of one File relative to another.
*
* #param target the target directory
* #param base the base directory
* #return target's path relative to the base directory
* #throws IOException if an error occurs while resolving the files' canonical names
*/
public static File getRelativeFile(File target, File base) throws IOException
{
String[] baseComponents = base.getCanonicalPath().split(Pattern.quote(File.separator));
String[] targetComponents = target.getCanonicalPath().split(Pattern.quote(File.separator));
// skip common components
int index = 0;
for (; index < targetComponents.length && index < baseComponents.length; ++index)
{
if (!targetComponents[index].equals(baseComponents[index]))
break;
}
StringBuilder result = new StringBuilder();
if (index != baseComponents.length)
{
// backtrack to base directory
for (int i = index; i < baseComponents.length; ++i)
result.append(".." + File.separator);
}
for (; index < targetComponents.length; ++index)
result.append(targetComponents[index] + File.separator);
if (!target.getPath().endsWith("/") && !target.getPath().endsWith("\\"))
{
// remove final path separator
result.delete(result.length() - File.separator.length(), result.length());
}
return new File(result.toString());
}
Matt B's solution gets the number of directories to backtrack wrong -- it should be the length of the base path minus the number of common path elements, minus one (for the last path element, which is either a filename or a trailing "" generated by split). It happens to work with /a/b/c/ and /a/x/y/, but replace the arguments with /m/n/o/a/b/c/ and /m/n/o/a/x/y/ and you will see the problem.
Also, it needs an else break inside the first for loop, or it will mishandle paths that happen to have matching directory names, such as /a/b/c/d/ and /x/y/c/z -- the c is in the same slot in both arrays, but is not an actual match.
All these solutions lack the ability to handle paths that cannot be relativized to one another because they have incompatible roots, such as C:\foo\bar and D:\baz\quux. Probably only an issue on Windows, but worth noting.
I spent far longer on this than I intended, but that's okay. I actually needed this for work, so thank you to everyone who has chimed in, and I'm sure there will be corrections to this version too!
public static String getRelativePath(String targetPath, String basePath,
String pathSeparator) {
// We need the -1 argument to split to make sure we get a trailing
// "" token if the base ends in the path separator and is therefore
// a directory. We require directory paths to end in the path
// separator -- otherwise they are indistinguishable from files.
String[] base = basePath.split(Pattern.quote(pathSeparator), -1);
String[] target = targetPath.split(Pattern.quote(pathSeparator), 0);
// First get all the common elements. Store them as a string,
// and also count how many of them there are.
String common = "";
int commonIndex = 0;
for (int i = 0; i < target.length && i < base.length; i++) {
if (target[i].equals(base[i])) {
common += target[i] + pathSeparator;
commonIndex++;
}
else break;
}
if (commonIndex == 0)
{
// Whoops -- not even a single common path element. This most
// likely indicates differing drive letters, like C: and D:.
// These paths cannot be relativized. Return the target path.
return targetPath;
// This should never happen when all absolute paths
// begin with / as in *nix.
}
String relative = "";
if (base.length == commonIndex) {
// Comment this out if you prefer that a relative path not start with ./
//relative = "." + pathSeparator;
}
else {
int numDirsUp = base.length - commonIndex - 1;
// The number of directories we have to backtrack is the length of
// the base path MINUS the number of common path elements, minus
// one because the last element in the path isn't a directory.
for (int i = 1; i <= (numDirsUp); i++) {
relative += ".." + pathSeparator;
}
}
relative += targetPath.substring(common.length());
return relative;
}
And here are tests to cover several cases:
public void testGetRelativePathsUnixy()
{
assertEquals("stuff/xyz.dat", FileUtils.getRelativePath(
"/var/data/stuff/xyz.dat", "/var/data/", "/"));
assertEquals("../../b/c", FileUtils.getRelativePath(
"/a/b/c", "/a/x/y/", "/"));
assertEquals("../../b/c", FileUtils.getRelativePath(
"/m/n/o/a/b/c", "/m/n/o/a/x/y/", "/"));
}
public void testGetRelativePathFileToFile()
{
String target = "C:\\Windows\\Boot\\Fonts\\chs_boot.ttf";
String base = "C:\\Windows\\Speech\\Common\\sapisvr.exe";
String relPath = FileUtils.getRelativePath(target, base, "\\");
assertEquals("..\\..\\..\\Boot\\Fonts\\chs_boot.ttf", relPath);
}
public void testGetRelativePathDirectoryToFile()
{
String target = "C:\\Windows\\Boot\\Fonts\\chs_boot.ttf";
String base = "C:\\Windows\\Speech\\Common";
String relPath = FileUtils.getRelativePath(target, base, "\\");
assertEquals("..\\..\\Boot\\Fonts\\chs_boot.ttf", relPath);
}
public void testGetRelativePathDifferentDriveLetters()
{
String target = "D:\\sources\\recovery\\RecEnv.exe";
String base = "C:\\Java\\workspace\\AcceptanceTests\\Standard test data\\geo\\";
// Should just return the target path because of the incompatible roots.
String relPath = FileUtils.getRelativePath(target, base, "\\");
assertEquals(target, relPath);
}
Actually my other answer didn't work if the target path wasn't a child of the base path.
This should work.
public class RelativePathFinder {
public static String getRelativePath(String targetPath, String basePath,
String pathSeparator) {
// find common path
String[] target = targetPath.split(pathSeparator);
String[] base = basePath.split(pathSeparator);
String common = "";
int commonIndex = 0;
for (int i = 0; i < target.length && i < base.length; i++) {
if (target[i].equals(base[i])) {
common += target[i] + pathSeparator;
commonIndex++;
}
}
String relative = "";
// is the target a child directory of the base directory?
// i.e., target = /a/b/c/d, base = /a/b/
if (commonIndex == base.length) {
relative = "." + pathSeparator + targetPath.substring(common.length());
}
else {
// determine how many directories we have to backtrack
for (int i = 1; i <= commonIndex; i++) {
relative += ".." + pathSeparator;
}
relative += targetPath.substring(common.length());
}
return relative;
}
public static String getRelativePath(String targetPath, String basePath) {
return getRelativePath(targetPath, basePath, File.pathSeparator);
}
}
public class RelativePathFinderTest extends TestCase {
public void testGetRelativePath() {
assertEquals("./stuff/xyz.dat", RelativePathFinder.getRelativePath(
"/var/data/stuff/xyz.dat", "/var/data/", "/"));
assertEquals("../../b/c", RelativePathFinder.getRelativePath("/a/b/c",
"/a/x/y/", "/"));
}
}
Cool!! I need a bit of code like this but for comparing directory paths on Linux machines. I found that this wasn't working in situations where a parent directory was the target.
Here is a directory friendly version of the method:
public static String getRelativePath(String targetPath, String basePath,
String pathSeparator) {
boolean isDir = false;
{
File f = new File(targetPath);
isDir = f.isDirectory();
}
// We need the -1 argument to split to make sure we get a trailing
// "" token if the base ends in the path separator and is therefore
// a directory. We require directory paths to end in the path
// separator -- otherwise they are indistinguishable from files.
String[] base = basePath.split(Pattern.quote(pathSeparator), -1);
String[] target = targetPath.split(Pattern.quote(pathSeparator), 0);
// First get all the common elements. Store them as a string,
// and also count how many of them there are.
String common = "";
int commonIndex = 0;
for (int i = 0; i < target.length && i < base.length; i++) {
if (target[i].equals(base[i])) {
common += target[i] + pathSeparator;
commonIndex++;
}
else break;
}
if (commonIndex == 0)
{
// Whoops -- not even a single common path element. This most
// likely indicates differing drive letters, like C: and D:.
// These paths cannot be relativized. Return the target path.
return targetPath;
// This should never happen when all absolute paths
// begin with / as in *nix.
}
String relative = "";
if (base.length == commonIndex) {
// Comment this out if you prefer that a relative path not start with ./
relative = "." + pathSeparator;
}
else {
int numDirsUp = base.length - commonIndex - (isDir?0:1); /* only subtract 1 if it is a file. */
// The number of directories we have to backtrack is the length of
// the base path MINUS the number of common path elements, minus
// one because the last element in the path isn't a directory.
for (int i = 1; i <= (numDirsUp); i++) {
relative += ".." + pathSeparator;
}
}
//if we are comparing directories then we
if (targetPath.length() > common.length()) {
//it's OK, it isn't a directory
relative += targetPath.substring(common.length());
}
return relative;
}
I'm assuming you have fromPath (an absolute path for a folder), and toPath (an absolute path for a folder/file), and your're looking for a path that with represent the file/folder in toPath as a relative path from fromPath (your current working directory is fromPath) then something like this should work:
public static String getRelativePath(String fromPath, String toPath) {
// This weirdness is because a separator of '/' messes with String.split()
String regexCharacter = File.separator;
if (File.separatorChar == '\\') {
regexCharacter = "\\\\";
}
String[] fromSplit = fromPath.split(regexCharacter);
String[] toSplit = toPath.split(regexCharacter);
// Find the common path
int common = 0;
while (fromSplit[common].equals(toSplit[common])) {
common++;
}
StringBuffer result = new StringBuffer(".");
// Work your way up the FROM path to common ground
for (int i = common; i < fromSplit.length; i++) {
result.append(File.separatorChar).append("..");
}
// Work your way down the TO path
for (int i = common; i < toSplit.length; i++) {
result.append(File.separatorChar).append(toSplit[i]);
}
return result.toString();
}
Lots of answers already here, but I found they didn't handle all cases, such as the base and target being the same. This function takes a base directory and a target path and returns the relative path. If no relative path exists, the target path is returned. File.separator is unnecessary.
public static String getRelativePath (String baseDir, String targetPath) {
String[] base = baseDir.replace('\\', '/').split("\\/");
targetPath = targetPath.replace('\\', '/');
String[] target = targetPath.split("\\/");
// Count common elements and their length.
int commonCount = 0, commonLength = 0, maxCount = Math.min(target.length, base.length);
while (commonCount < maxCount) {
String targetElement = target[commonCount];
if (!targetElement.equals(base[commonCount])) break;
commonCount++;
commonLength += targetElement.length() + 1; // Directory name length plus slash.
}
if (commonCount == 0) return targetPath; // No common path element.
int targetLength = targetPath.length();
int dirsUp = base.length - commonCount;
StringBuffer relative = new StringBuffer(dirsUp * 3 + targetLength - commonLength + 1);
for (int i = 0; i < dirsUp; i++)
relative.append("../");
if (commonLength < targetLength) relative.append(targetPath.substring(commonLength));
return relative.toString();
}
Here a method that resolves a relative path from a base path regardless they are in the same or in a different root:
public static String GetRelativePath(String path, String base){
final String SEP = "/";
// if base is not a directory -> return empty
if (!base.endsWith(SEP)){
return "";
}
// check if path is a file -> remove last "/" at the end of the method
boolean isfile = !path.endsWith(SEP);
// get URIs and split them by using the separator
String a = "";
String b = "";
try {
a = new File(base).getCanonicalFile().toURI().getPath();
b = new File(path).getCanonicalFile().toURI().getPath();
} catch (IOException e) {
e.printStackTrace();
}
String[] basePaths = a.split(SEP);
String[] otherPaths = b.split(SEP);
// check common part
int n = 0;
for(; n < basePaths.length && n < otherPaths.length; n ++)
{
if( basePaths[n].equals(otherPaths[n]) == false )
break;
}
// compose the new path
StringBuffer tmp = new StringBuffer("");
for(int m = n; m < basePaths.length; m ++)
tmp.append(".."+SEP);
for(int m = n; m < otherPaths.length; m ++)
{
tmp.append(otherPaths[m]);
tmp.append(SEP);
}
// get path string
String result = tmp.toString();
// remove last "/" if path is a file
if (isfile && result.endsWith(SEP)){
result = result.substring(0,result.length()-1);
}
return result;
}
Passes Dónal's tests, the only change - if no common root it returns target path (it could be already relative)
import static java.util.Arrays.asList;
import static java.util.Collections.nCopies;
import static org.apache.commons.io.FilenameUtils.normalizeNoEndSeparator;
import static org.apache.commons.io.FilenameUtils.separatorsToUnix;
import static org.apache.commons.lang3.StringUtils.getCommonPrefix;
import static org.apache.commons.lang3.StringUtils.isBlank;
import static org.apache.commons.lang3.StringUtils.isNotEmpty;
import static org.apache.commons.lang3.StringUtils.join;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
public class ResourceUtils {
public static String getRelativePath(String targetPath, String basePath, String pathSeparator) {
File baseFile = new File(basePath);
if (baseFile.isFile() || !baseFile.exists() && !basePath.endsWith("/") && !basePath.endsWith("\\"))
basePath = baseFile.getParent();
String target = separatorsToUnix(normalizeNoEndSeparator(targetPath));
String base = separatorsToUnix(normalizeNoEndSeparator(basePath));
String commonPrefix = getCommonPrefix(target, base);
if (isBlank(commonPrefix))
return targetPath.replaceAll("/", pathSeparator);
target = target.replaceFirst(commonPrefix, "");
base = base.replaceFirst(commonPrefix, "");
List<String> result = new ArrayList<>();
if (isNotEmpty(base))
result.addAll(nCopies(base.split("/").length, ".."));
result.addAll(asList(target.replaceFirst("^/", "").split("/")));
return join(result, pathSeparator);
}
}
If you're writing a Maven plugin, you can use Plexus' PathTool:
import org.codehaus.plexus.util.PathTool;
String relativeFilePath = PathTool.getRelativeFilePath(file1, file2);
If Paths is not available for JRE 1.5 runtime or maven plugin
package org.afc.util;
import java.io.File;
import java.util.LinkedList;
import java.util.List;
public class FileUtil {
public static String getRelativePath(String basePath, String filePath) {
return getRelativePath(new File(basePath), new File(filePath));
}
public static String getRelativePath(File base, File file) {
List<String> bases = new LinkedList<String>();
bases.add(0, base.getName());
for (File parent = base.getParentFile(); parent != null; parent = parent.getParentFile()) {
bases.add(0, parent.getName());
}
List<String> files = new LinkedList<String>();
files.add(0, file.getName());
for (File parent = file.getParentFile(); parent != null; parent = parent.getParentFile()) {
files.add(0, parent.getName());
}
int overlapIndex = 0;
while (overlapIndex < bases.size() && overlapIndex < files.size() && bases.get(overlapIndex).equals(files.get(overlapIndex))) {
overlapIndex++;
}
StringBuilder relativePath = new StringBuilder();
for (int i = overlapIndex; i < bases.size(); i++) {
relativePath.append("..").append(File.separatorChar);
}
for (int i = overlapIndex; i < files.size(); i++) {
relativePath.append(files.get(i)).append(File.separatorChar);
}
relativePath.deleteCharAt(relativePath.length() - 1);
return relativePath.toString();
}
}
I know this is a bit late but, I created a solution that works with any java version.
public static String getRealtivePath(File root, File file)
{
String path = file.getPath();
String rootPath = root.getPath();
boolean plus1 = path.contains(File.separator);
return path.substring(path.indexOf(rootPath) + rootPath.length() + (plus1 ? 1 : 0));
}
org.apache.ant has a FileUtils class with a getRelativePath method. Haven't tried it myself yet, but could be worthwhile to check it out.
http://javadoc.haefelinger.it/org.apache.ant/1.7.1/org/apache/tools/ant/util/FileUtils.html#getRelativePath(java.io.File, java.io.File)
private String relative(String left, String right){
String[] lefts = left.split("/");
String[] rights = right.split("/");
int min = Math.min(lefts.length, rights.length);
int commonIdx = -1;
for(int i = 0; i < min; i++){
if(commonIdx < 0 && !lefts[i].equals(rights[i])){
commonIdx = i - 1;
break;
}
}
if(commonIdx < 0){
return null;
}
StringBuilder sb = new StringBuilder(Math.max(left.length(), right.length()));
sb.append(left).append("/");
for(int i = commonIdx + 1; i < lefts.length;i++){
sb.append("../");
}
for(int i = commonIdx + 1; i < rights.length;i++){
sb.append(rights[i]).append("/");
}
return sb.deleteCharAt(sb.length() -1).toString();
}
Psuedo-code:
Split the strings by the path seperator ("/")
Find the greatest common path by iterating thru the result of the split string (so you'd end up with "/var/data" or "/a" in your two examples)
return "." + whicheverPathIsLonger.substring(commonPath.length);
There is this sample record,
100,1:2:3
Which I want to normalize as,
100,1
100,2
100,3
A colleague of mine wrote a pig script to achieve this and my MapReduce code took more time. I was using the default TextInputformat before. But to improve performance, I decided to write a custom Input format class, with a custom RecordReader. Taking the LineRecordReader class as reference, I tried to write the following code.
import java.io.IOException;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;
import com.normalize.util.Splitter;
public class NormalRecordReader extends RecordReader<Text, Text> {
private long start;
private long pos;
private long end;
private LineReader in;
private int maxLineLength;
private Text key = null;
private Text value = null;
private Text line = null;
public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());
in = new LineReader(fileIn, job);
this.pos = start;
}
public boolean nextKeyValue() throws IOException {
int newSize = 0;
if (line == null) {
line = new Text();
}
while (pos < end) {
newSize = in.readLine(line);
if (newSize == 0) {
break;
}
pos += newSize;
if (newSize < maxLineLength) {
break;
}
// line too long. try again
System.out.println("Skipped line of size " + newSize + " at pos " + (pos - newSize));
}
Splitter splitter = new Splitter(line.toString(), ",");
List<String> split = splitter.split();
if (key == null) {
key = new Text();
}
key.set(split.get(0));
if (value == null) {
value = new Text();
}
value.set(split.get(1));
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
#Override
public Text getCurrentKey() {
return key;
}
#Override
public Text getCurrentValue() {
return value;
}
/**
* Get the progress within the split
*/
public float getProgress() {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (pos - start) / (float)(end - start));
}
}
public synchronized void close() throws IOException {
if (in != null) {
in.close();
}
}
}
Though this works, but I haven't seen any performance improvement. Here I am breaking the record at "," and setting the 100 as key and 1,2,3 as value. I only call the mapper which does the following:
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
try {
Splitter splitter = new Splitter(value.toString(), ":");
List<String> splits = splitter.split();
for (String split : splits) {
context.write(key, new Text(split));
}
} catch (IndexOutOfBoundsException ibe) {
System.err.println(value + " is malformed.");
}
}
The splitter class is used to split the data, as I found String's splitter to be slower. The method is:
public List<String> split() {
List<String> splitData = new ArrayList<String>();
int beginIndex = 0, endIndex = 0;
while(true) {
endIndex = dataToSplit.indexOf(delim, beginIndex);
if(endIndex == -1) {
splitData.add(dataToSplit.substring(beginIndex));
break;
}
splitData.add(dataToSplit.substring(beginIndex, endIndex));
beginIndex = endIndex + delimLength;
}
return splitData;
}
Can the code be improved in any way?
Let me summarize here what I think you can improve instead of in the comments:
As explained, currently you are creating a Text object several times per record (number of times will be equal to your number of tokens). While it may not matter too much for small input, this can be a big deal for decently sized jobs. To fix that, do the following:
private final Text text = new Text();
public void map(Text key, Text value, Context context) {
....
for (String split : splits) {
text.set(split);
context.write(key, text);
}
}
For your splitting, what you're doing right now is for every record allocating a new array, populating this array, and then iterating over this array to write your output. Effectively you don't really need an array in this case since you're not maintaining any state. Using the implementation of the split method you provided, you only need to make one pass on the data:
public void map(Text key, Text value, Context context) {
String dataToSplit = value.toString();
String delim = ":";
int beginIndex = 0;
int endIndex = 0;
while(true) {
endIndex = dataToSplit.indexOf(delim, beginIndex);
if(endIndex == -1) {
text.set(dataToSplit.substring(beginIndex));
context.write(key, text);
break;
}
text.set(dataToSplit.substring(beginIndex, endIndex));
context.write(key, text);
beginIndex = endIndex + delim.length();
}
}
I don't really see why you write your own InputFormat, it seems that KeyValueTextInputFormat is exactly what you need and has probably been already optimized. Here is how you use it:
conf.set("key.value.separator.in.input.line", ",");
job.setInputFormatClass(KeyValueTextInputFormat.class);
Based on your example, the key for each record seems to be an integer. If that's always the case, then using a Text as your mapper input key is not optimal and it should be an IntWritable or maybe even a ByteWritable depending on what's in your data.
Similarly, you want want to use an IntWritable or ByteWritable as your mapper output key and output value.
Also, if you want some meaningful benchmark, you should test on a bigger dataset, like a few Gbs if possible. 1 minute tests are not really meaningful, especially in the context of distributed systems. 1 job may run quicker than another one on a small input, but the trend may be reverted for bigger inputs.
That being said, you should also know that Pig does a lot of optimizations behind the hood when translating to Map/Reduce, so I'm not too surprised that it runs faster than your Java Map/Reduce code and I've seen that in the past. Try the optimizations I suggested, if it's still not fast enough here is a link on profiling your Map/Reduce jobs with a few more useful tricks (especially tip 7 on profiling is something I've found useful).