Java Method to Check if URL Fits Pattern - java

I have the need to do some primitive url matching in java. I need a method that will return true, saying
/users/5/roles
matches
/users/*/roles
Here is what I am looking for and what I tried.
public Boolean fitsTemplate(String path, String template) {
Boolean matches = false;
//My broken code, since it returns false and I need true
matches = path.matches(template);
return matches;
}

One option is to replace the * with some kind of regex equivalent such as [^/]+, but the kind of pattern being used here is actually called a "glob" pattern. Starting in Java 7, you can use FileSystem.getPathMatcher to match file paths against glob patterns. For a complete explanation of the glob syntax, see the documentation for getPathMatcher.
public boolean fitsTemplate(String path, String template) {
return FileSystems.getDefault()
.getPathMatcher("glob:" + template)
.matches(Paths.get(path));
}

Related

java.nio.file.Path.contains(subPath)?

I need to check whether a given path is a subpath anywhere within another path and was wondering whether there exists such a method already before writing my own.
Here is some code that may help to understand the problem:
Path someRealPath = Paths.get("/tmp/some/path/to/somewhere");
Path subpathToCheck = Paths.get("some/path");
// I am expecting something similar as follows:
someRealPath.contains(subPathToCheck) // should return true in this case
someRealPath.contains(Paths.get("some/to")) // should return false
I already saw relativize, but I don't think that's the easiest way to solve the problem. The simplest I came up with was normalize().toString().contains(/* other normalized string path */). But maybe there is an easier way? Lots of methods inside Paths look as if this functionality must already be in there. Maybe I am just not seeing it.
What I came up with is the following:
boolean containsSubPath(Path someRealPath, Path subPathToCheck) {
return someRealPath.normalize()
.toString()
.contains(subPathToCheck.normalize()
.toString());
}
This way I am able to just call:
containsSubPath(Paths.get("/tmp/some/path/to/somewhere"), Paths.get("some/path"));
As Thomas Kläger pointed out, this solution matches also paths, that are only substrings of a real path (which for my use case would also be ok).
Here is another solution that is probably more correct (for complete matching subpaths), but still not as short as I would like it to be (now corrected due to Kabhals input):
static boolean containsSubPath(Path realPath, Path subPath) {
for (Path realPathSegment : realPath.normalize()) {
if (containsSubPath(realPathSegment.iterator(), subPath.normalize().iterator())) {
return true;
}
}
return false;
}
private static boolean containsSubPath(Iterator<Path> realPathIterator, Iterator<Path> subPathIterator) {
var hasEntries = realPathIterator.hasNext() && subPathIterator.hasNext();
while (realPathIterator.hasNext() && subPathIterator.hasNext()) {
Path realPathSegment = realPathIterator.next();
Path subPathSegment = subPathIterator.next();
if (!Objects.equals(realPathSegment, subPathSegment))
return false;
}
return hasEntries;
}
Example calls with expected output:
containsSubPath(Paths.get("/tmp/some/path/to/somewhere"), Paths.get("some/path")) // true
containsSubPath(Paths.get("/tmp/some/path/to/somewhere"), Paths.get("me/pa")) // false
If you need to use it as BinaryOperator just use the method reference instead, i.e. YourClass::containsSubPath.

How to exclude a specific file with Apache FileFilterUtils?

Please consider the following folder structure:
src
|_text1.txt
|_text2.txt
|_content
|_text1.txt
|_text2.txt
How do I have to design an org.apache.commons.io.filefilter.IOFileFilter to exclude the src/text1.txt and src/text2.txt but keeping src/content/text1.txt and src/content/text2.txt ?
Currently my filter looks like this:
IOFileFilter filter = FileFilterUtils.and(
FileFilterUtils.notFileFilter(FileFilterUtils.nameFileFilter("text1.txt", IOCase.SENSITIVE)),
FileFilterUtils.notFileFilter(FileFilterUtils.nameFileFilter("text2.txt", IOCase.SENSITIVE))
);
FileUtils.copyDirectory(new File("src"), new File("dst"), filter);
But the code snippet above obviously doesn't copy the two text files within the src/content/ folder either (which I want to have copied)... Btw. the names of the text files are not changeable.
Any ideas?
AFAIK commons io doesn't provide something like a PathFileFilter thus you'd have to add your own filte here.
NameFileFilter, as the name implies, only checks for the file name, i.e. the path is not relevant.
Providing your own filter should not be that hard. I'd suggest subclassing AbstractFileFilter or NameFileFilter here. Subclassing NameFileFilter might be considered a somewhat dirty approach, since you're not only checking the names, but would just require you to override the accept() methods:
public boolean accept(File file) {
return accept( file.getPath() );
}
public boolean accept(File dir, String name) {
//use normalize to account for possible double separators or windows paths which use \
return accept( FilenameUtils.normalize( dir.getPath() + "/" + name ) );
}
protected boolean accept( String path ) {
for (String nameSuffix: names) {
if (caseSensitivity.checkEndsWith( path, nameSuffix )) {
return true;
}
}
return false;
}
Then you'd use it like FileFilterUtils.notFileFilter(new PathFileFilter("/text1.txt")) etc.
Alternatively you could provide a set of patterns and check those:
private Set<Pattern> pathPatterns = new HashSet<>();
PathFileFilter(String... patterns) {
for( String p : patterns ) {
pathPatterns.add( Pattern.compile(p) );
}
}
protected boolean accept( String path ) {
for (Pattern pattern : pathPatterns) {
//separatorsToUnix is used to convert \ to /
if ( pattern.matches( FilenameUtils.separatorsToUnix( path ) )) {
return true;
}
}
return false;
}
Usage: new PathFileFilter("(?i)(.*/)?test[12]\\.txt"); or new PathFileFilter("(?i)(.*/)?test1\\.txt", "(?i)(.*/)?anothertest2\\.txt");
Short breakdown of the regex:
(?i) makes the expression case-insensitive, leave it out for case-sensitive matches
(.*/)? means that if the filename is preceeded by anything it must end with a slash, i.e. this would match some/path/test1.txt but not someothertest1.txt.
test[12]\\.txt would be the file name, here meaning text followed by 1 or 2 and finally .txt

tar extension match using regex

I have a directory with some image files. I want to move all those files to a different place as long as they are not tar extensions. What is the regex in Java to filter tar files?
This is my code:
String regex = "^[[a-z]\\.[^tar]$]*";
You have several ways.
Use this regex
^.*\.(?!tar).*$
EndWith solution
if(!filename.endsWith(".tar"))
FileFilter - Link
And probably a few more. I think the endsWith is the fastest way, not regex, because that's pretty heavy operation.
Try this:
// implement the FileFilter interface and override the accept method
public class ImageFileFilter implements FileFilter
{
private final String[] filterExtensions =
new String[] {"tar"};
public boolean accept(File file)
{
for (String extension : filterExtensions)
{
// if the file name does not end with the extension, you can accept it
if (!file.getName().toLowerCase().endsWith(extension))
{
return true;
}
}
return false;
}
}
Then you can get a list of files with this filter
File dir = new File("path\to\my\images");
String[] filesWithoutTars = dir.list(new ImageFileFilter());
// do stuff here
EDIT:
Since the OP says he can't modify the java code, the following regex should do what you want: ^.*(?!\.tar)$
It will match anything from the beginning of the string, but asserts that the ".tar" portion at the end of the string will not match.
Use String.matches() method to test a string for a match ignore case.
sample code:
String regex = "(?i).*\\.tar";
String fileName = "xyz.taR";
System.out.println(fileName.matches(regex)); // true

Filtering input files using globStatus in MapReduce

I have a lot of input files and I want to process selected ones based on the date that has been appended in the end. I am now confused on where do I use the globStatus method to filter out the files.
I have a custom RecordReader class and I was trying to use globStatus in its next method but it didn't work out.
public boolean next(Text key, Text value) throws IOException {
Path filePath = fileSplit.getPath();
if (!processed) {
key.set(filePath.getName());
byte[] contents = new byte[(int) fileSplit.getLength()];
value.clear();
FileSystem fs = filePath.getFileSystem(conf);
fs.globStatus(new Path("/*" + date));
FSDataInputStream in = null;
try {
in = fs.open(filePath);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
} finally {
IOUtils.closeStream(in);
}
processed = true;
return true;
}
return false;
}
I know it returns a FileStatus array, but how do I use it to filter the files. Can someone please shed some light?
The globStatus method takes 2 complimentary arguments which allow you to filter your files. The first one is the glob pattern, but sometimes glob patterns are not powerful enough to filter specific files, in which case you can define a PathFilter.
Regarding the glob pattern, the following are supported:
Glob | Matches
-------------------------------------------------------------------------------------------------------------------
* | Matches zero or more characters
? | Matches a single character
[ab] | Matches a single character in the set {a, b}
[^ab] | Matches a single character not in the set {a, b}
[a-b] | Matches a single character in the range [a, b] where a is lexicographically less than or equal to b
[^a-b] | Matches a single character not in the range [a, b] where a is lexicographically less than or equal to b
{a,b} | Matches either expression a or b
\c | Matches character c when it is a metacharacter
PathFilter is simply an interface like this:
public interface PathFilter {
boolean accept(Path path);
}
So you can implement this interface and implement the accept method where you can put your logic to filter files.
An example taken from Tom White's excellent book which allows you to define a PathFilter to filter files that match a certain regular expression:
public class RegexExcludePathFilter implements PathFilter {
private final String regex;
public RegexExcludePathFilter(String regex) {
this.regex = regex;
}
public boolean accept(Path path) {
return !path.toString().matches(regex);
}
}
You can directly filter your input with a PathFilter implementation by calling FileInputFormat.setInputPathFilter(JobConf, RegexExcludePathFilter.class) when initializing your job.
EDIT: Since you have to pass the class in setInputPathFilter, you can't directly pass arguments, but you should be able to do something similar by playing with the Configuration. If you make your RegexExcludePathFilter also extend from Configured, you can get back a Configuration object which you will have initialized before with the desired values, so you can get back these values inside your filter and process them in the accept.
For example if you initialize like this:
conf.set("date", "2013-01-15");
Then you can define your filter like this:
public class RegexIncludePathFilter extends Configured implements PathFilter {
private String date;
private FileSystem fs;
public boolean accept(Path path) {
try {
if (fs.isDirectory(path)) {
return true;
}
} catch (IOException e) {}
return path.toString().endsWith(date);
}
public void setConf(Configuration conf) {
if (null != conf) {
this.date = conf.get("date");
try {
this.fs = FileSystem.get(conf);
} catch (IOException e) {}
}
}
}
EDIT 2: There were a few issues with the original code, please see the updated class. You also need to remove the constructor since it's not used anymore, and check if that's a directory in which case you should return true so the content of the directory can be filtered too.
For anyone reading this, can I say "please don't do anything more complex in the filters than validating the paths". Specifically: don't do checks for the files being a directory, getting their sizes, etc. Wait until the list/glob operation has returned and then do a filtering there, using the information now in the populated FileStatus entries.
Why? All those calls to getFileStatus(), directly or via isDirectory() are doing needless calls to the filesystem, calls which add needless namenode load on an HDFS cluster. More critically, against S3 and other object stores, each operation is potentially making multiple HTTPS requests —and those really do take measurable time. Even better, S3 will throttle you if it thinks you are making too many requests across your entire cluster of machines. You don't want that.
Wit until after the call —the file status entries you get back are those from the object store's list commands, which usually return thousands of file entries per HTTPS request, and so are way more efficient.
For further details, inspect the source of org.apache.hadoop.fs.s3a.S3AFileSystem.

GWT java URL Validator

Does someone knows a function that validate if a url is valid or not purely in GWT java without using any JSNI
I am using this one (making use of regular expressions):
private RegExp urlValidator;
private RegExp urlPlusTldValidator;
public boolean isValidUrl(String url, boolean topLevelDomainRequired) {
if (urlValidator == null || urlPlusTldValidator == null) {
urlValidator = RegExp.compile("^((ftp|http|https)://[\\w#.\\-\\_]+(:\\d{1,5})?(/[\\w#!:.?+=&%#!\\_\\-/]+)*){1}$");
urlPlusTldValidator = RegExp.compile("^((ftp|http|https)://[\\w#.\\-\\_]+\\.[a-zA-Z]{2,}(:\\d{1,5})?(/[\\w#!:.?+=&%#!\\_\\-/]+)*){1}$");
}
return (topLevelDomainRequired ? urlPlusTldValidator : urlValidator).exec(url) != null;
}
org.apache.commons.validator.UrlValidator and static method isValid(String url) might be of help here.
You should use regular expression in GWT. Here is similar topics Regex in GWT to match URLs and Regular Expressions and GWT

Categories

Resources