tar extension match using regex - java

I have a directory with some image files. I want to move all those files to a different place as long as they are not tar extensions. What is the regex in Java to filter tar files?
This is my code:
String regex = "^[[a-z]\\.[^tar]$]*";

You have several ways.
Use this regex
^.*\.(?!tar).*$
EndWith solution
if(!filename.endsWith(".tar"))
FileFilter - Link
And probably a few more. I think the endsWith is the fastest way, not regex, because that's pretty heavy operation.

Try this:
// implement the FileFilter interface and override the accept method
public class ImageFileFilter implements FileFilter
{
private final String[] filterExtensions =
new String[] {"tar"};
public boolean accept(File file)
{
for (String extension : filterExtensions)
{
// if the file name does not end with the extension, you can accept it
if (!file.getName().toLowerCase().endsWith(extension))
{
return true;
}
}
return false;
}
}
Then you can get a list of files with this filter
File dir = new File("path\to\my\images");
String[] filesWithoutTars = dir.list(new ImageFileFilter());
// do stuff here
EDIT:
Since the OP says he can't modify the java code, the following regex should do what you want: ^.*(?!\.tar)$
It will match anything from the beginning of the string, but asserts that the ".tar" portion at the end of the string will not match.

Use String.matches() method to test a string for a match ignore case.
sample code:
String regex = "(?i).*\\.tar";
String fileName = "xyz.taR";
System.out.println(fileName.matches(regex)); // true

Related

Regex filter for file search Java

I'm quite new to using regex so I'm having problems with my current code. I created an Abstract File Search that returns a List of Files. I would like this searcher to be filtered by a regex (have ex. the extension it looks for based on a regex filter).
The code of my Abstract Searcher:
public abstract class AbstractFileDiscoverer implements IDiscoverer {
private final Path rootPath;
AbstractFileDiscoverer(final Path rootPath) {
super();
this.rootPath = rootPath;
}
protected List<File> findFiles() throws IOException {
if (!Files.isDirectory(this.rootPath)) {
throw new IllegalArgumentException("Path must be a directory");
}
List<File> result;
try (Stream<Path> walk = Files.walk(this.rootPath)) {
result = walk.filter(p -> !Files.isDirectory(p)).map(p -> p.toFile())
.filter(f -> f.toString().toLowerCase().endsWith("")).collect(Collectors.toList());
}
return result;
}
#Override
public String getName() {
// TODO Auto-generated method stub
return null;
}
}
I would like the following part to be filtered by the regex, so that only the files that the regex returns as true (for .bat and .sql files) to be collected.
result = walk.filter(p -> !Files.isDirectory(p)).map(p -> p.toFile())
.filter(f -> f.toString().toLowerCase().endsWith("")).collect(Collectors.toList());
Could anyone help me achieving it?
FIRST EDIT:
I'm aware that toString().toLowerCase().endsWith("") always returns true, I actually need the regex there instead of an String with the extension. I forgot to mention that.
Try this website: https://regexr.com/ and paste the regex .+(?:.sql|.bat)$ for an explanation.
In code it'd look like this:
Stream.of("file1.json", "init.bat", "init.sql", "file2.txt")
.filter(filename -> filename.matches(".+(?:.sql|.bat)$"))
.forEach(System.out::println);
There is a famous quote from Jamie Zawinski about using regular expressions when simpler non-regex code will do.
In your case, I would avoid using a regular expression and would just write a private method:
private static boolean hasMatchingExtension(Path path) {
String filename = path.toString().toLowerCase();
return filename.endsWith(".bat") || filename.endsWith(".sql");
}
Then you can use it in your stream:
result = walk.filter(p -> !Files.isDirectory(p)).
.filter(p -> hasMatchingExtension(p))
.map(p -> p.toFile())
.collect(Collectors.toList());
(Consider returning List<Path> instead. The Path class is the modern replacement for the File class, some of whose methods that actually operate on files have design issues.)

Java Method to Check if URL Fits Pattern

I have the need to do some primitive url matching in java. I need a method that will return true, saying
/users/5/roles
matches
/users/*/roles
Here is what I am looking for and what I tried.
public Boolean fitsTemplate(String path, String template) {
Boolean matches = false;
//My broken code, since it returns false and I need true
matches = path.matches(template);
return matches;
}
One option is to replace the * with some kind of regex equivalent such as [^/]+, but the kind of pattern being used here is actually called a "glob" pattern. Starting in Java 7, you can use FileSystem.getPathMatcher to match file paths against glob patterns. For a complete explanation of the glob syntax, see the documentation for getPathMatcher.
public boolean fitsTemplate(String path, String template) {
return FileSystems.getDefault()
.getPathMatcher("glob:" + template)
.matches(Paths.get(path));
}

How to exclude a specific file with Apache FileFilterUtils?

Please consider the following folder structure:
src
|_text1.txt
|_text2.txt
|_content
|_text1.txt
|_text2.txt
How do I have to design an org.apache.commons.io.filefilter.IOFileFilter to exclude the src/text1.txt and src/text2.txt but keeping src/content/text1.txt and src/content/text2.txt ?
Currently my filter looks like this:
IOFileFilter filter = FileFilterUtils.and(
FileFilterUtils.notFileFilter(FileFilterUtils.nameFileFilter("text1.txt", IOCase.SENSITIVE)),
FileFilterUtils.notFileFilter(FileFilterUtils.nameFileFilter("text2.txt", IOCase.SENSITIVE))
);
FileUtils.copyDirectory(new File("src"), new File("dst"), filter);
But the code snippet above obviously doesn't copy the two text files within the src/content/ folder either (which I want to have copied)... Btw. the names of the text files are not changeable.
Any ideas?
AFAIK commons io doesn't provide something like a PathFileFilter thus you'd have to add your own filte here.
NameFileFilter, as the name implies, only checks for the file name, i.e. the path is not relevant.
Providing your own filter should not be that hard. I'd suggest subclassing AbstractFileFilter or NameFileFilter here. Subclassing NameFileFilter might be considered a somewhat dirty approach, since you're not only checking the names, but would just require you to override the accept() methods:
public boolean accept(File file) {
return accept( file.getPath() );
}
public boolean accept(File dir, String name) {
//use normalize to account for possible double separators or windows paths which use \
return accept( FilenameUtils.normalize( dir.getPath() + "/" + name ) );
}
protected boolean accept( String path ) {
for (String nameSuffix: names) {
if (caseSensitivity.checkEndsWith( path, nameSuffix )) {
return true;
}
}
return false;
}
Then you'd use it like FileFilterUtils.notFileFilter(new PathFileFilter("/text1.txt")) etc.
Alternatively you could provide a set of patterns and check those:
private Set<Pattern> pathPatterns = new HashSet<>();
PathFileFilter(String... patterns) {
for( String p : patterns ) {
pathPatterns.add( Pattern.compile(p) );
}
}
protected boolean accept( String path ) {
for (Pattern pattern : pathPatterns) {
//separatorsToUnix is used to convert \ to /
if ( pattern.matches( FilenameUtils.separatorsToUnix( path ) )) {
return true;
}
}
return false;
}
Usage: new PathFileFilter("(?i)(.*/)?test[12]\\.txt"); or new PathFileFilter("(?i)(.*/)?test1\\.txt", "(?i)(.*/)?anothertest2\\.txt");
Short breakdown of the regex:
(?i) makes the expression case-insensitive, leave it out for case-sensitive matches
(.*/)? means that if the filename is preceeded by anything it must end with a slash, i.e. this would match some/path/test1.txt but not someothertest1.txt.
test[12]\\.txt would be the file name, here meaning text followed by 1 or 2 and finally .txt

Sort files in numeric order

I made a program to combine all files in a folder together.
Here's part of my code:
File folder = new File("c:/some directory");
File[] listOfFiles = folder.listFiles();
for (File file : listOfFiles){
if (file.isFile()){
System.out.println(file.getName());
File f = new File("c:/some directory"+file.getName());
However, I hope my files can be in order of like:
job1.script, job2.script, .....
but I get:
job1.script, job10.script, job11.script, that 10,11,12... are in front of 2.
I hope I can get efficient code that can avoid this problem.
Time to get rid of all the clumpsy code, and use Java 8! This answer also features the Path class, which is already part of Java 7, however seems to be heavily improved in Java 8.
The code:
private void init() throws IOException {
Path directory = Paths.get("C:\\Users\\Frank\\Downloads\\testjob");
Files.list(directory)
.filter(path -> Files.isRegularFile(path))
.filter(path -> path.getFileName().toString().startsWith("job"))
.filter(path -> path.getFileName().toString().endsWith(".script"))
.sorted(Comparator.comparingInt(this::pathToInt))
.map(path -> path.getFileName())
.forEach(System.out::println);
}
private int pathToInt(final Path path) {
return Integer.parseInt(path.getFileName()
.toString()
.replace("job", "")
.replace(".script", "")
);
}
The explanation of pathToInt:
From a given Path, obtain the String representation of the file.
Remove "job" and ".script".
Try to parse the String as an Integer.
The explanation of init, the main method:
Obtain a Path to the directory where the files are located.
Obtain a lazily populated list of Paths in the directory, be aware: These Paths are still fully qualified!
Keep files that are regular files.
Keep files of which the last part of the Path, thus the filename (for example job1.script) starts with "job". Be aware that you need to first obtain the String representation of the Path before you can check it, else you will be checking if the whole Path starts with a directory called "job".
Do the same for files ending with ".script".
Now comes the fun point. Here we sort the file list based on a Comparator that compares the integers which we obtain by calling pathToInt on the Path. Here I am using a method reference, the method comparingInt(ToIntFunction<? super T> keyExtractor expects a function that maps a T, in this case a Path, to an int. And this is exactly what pathToInt does, hence it can be used a method reference.
Then I map every Path to the Path only consisting of the filename.
Lastly, for each element of the Stream<Path>, I call System.out.println(Path.toString()).
It may seem like this code could be written easier, however I have purposefully written it more verbose. My design here is to keep the full Path intact at all times, the very last part of the code in the forEach actually violates that principle as shortly before it gets mapped to only the file name, and hence you are not able to process the full Path anymore at a later point.
This code is also designed to be fail-fast, hence it is expecting files to be there in the form job(\D+).script, and will throw a NumberFormatException if that is not the case.
Example output:
job1.script
job2.script
job10.script
job11.script
An arguably better alternative features the power of regular expressions:
private void init() throws IOException {
Path directory = Paths.get("C:\\Users\\Frank\\Downloads\\testjob");
Files.list(directory)
.filter(path -> Files.isRegularFile(path))
.filter(path -> path.getFileName().toString().matches("job\\d+.script"))
.sorted(Comparator.comparingInt(this::pathToInt))
.map(path -> path.getFileName())
.forEach(System.out::println);
}
private int pathToInt(final Path path) {
return Integer.parseInt(path.getFileName()
.toString()
.replaceAll("job(\\d+).script", "$1")
);
}
Here I use the regular expression "job\\d+.script", which matches a string starting with "job", followed by one or more digits, followed by ".script".
I use almost the same expression for the pathToInt method, however there I use a capturing group, the parentheses, and $1 to use that capturing group.
I will also provide a concise way to read the contents of the files in one big file, as you have also asked in your question:
private void init() throws IOException {
Path directory = Paths.get("C:\\Users\\Frank\\Downloads\\testjob");
try (BufferedWriter writer = Files.newBufferedWriter(directory.resolve("masterjob.script"))) {
Files.list(directory)
.filter(path -> Files.isRegularFile(path))
.filter(path -> path.getFileName().toString().matches("job\\d+.script"))
.sorted(Comparator.comparingInt(this::pathToInt))
.flatMap(this::wrappedLines)
.forEach(string -> wrappedWrite(writer, string));
}
}
private int pathToInt(final Path path) {
return Integer.parseInt(path.getFileName()
.toString()
.replaceAll("job(\\d+).script", "$1")
);
}
private Stream<String> wrappedLines(final Path path) {
try {
return Files.lines(path);
} catch (IOException ex) {
//swallow
return null;
}
}
private void wrappedWrite(final BufferedWriter writer, final String string) {
try {
writer.write(string);
writer.newLine();
} catch (IOException ex) {
//swallow
}
}
Please note that lambdas cannot throw/catch checked Exceptions, hence there is a neccessity to write wrappers around the code, that decides what to do with the errors. Swallowing the exceptions is rarely a good idea, I am just using it here for code simplicitely.
The real big change here is that instead of printing out the names, I map every file to its contents and write those to a file.
If your files' name are always like jobNumber.script you could sort the array providing a custom comparator:
Arrays.sort(listOfFiles, new Comparator<File>(){
#Override
public int compare(File f1, File f2) {
String s1 = f1.getName().substring(3, f1.getName().indexOf("."));
String s2 = f2.getName().substring(3, f2.getName().indexOf("."));
return Integer.valueOf(s1).compareTo(Integer.valueOf(s2));
}
});
public static void main(String[] args) throws Exception{
File folder = new File(".");
File[] listOfFiles = folder.listFiles(new FilenameFilter() {
#Override
public boolean accept(File arg0, String arg1) {
return arg1.endsWith(".script");
}
});
System.out.println(Arrays.toString(listOfFiles));
Arrays.sort(listOfFiles, new Comparator<File>(){
#Override
public int compare(File f1, File f2) {
String s1 = f1.getName().substring(3, f1.getName().indexOf("."));
String s2 = f2.getName().substring(3, f2.getName().indexOf("."));
return Integer.valueOf(s1).compareTo(Integer.valueOf(s2));
}
});
System.out.println(Arrays.toString(listOfFiles));
}
Prints:
[.\job1.script, .\job1444.script, .\job4.script, .\job452.script, .\job77.script]
[.\job1.script, .\job4.script, .\job77.script, .\job452.script, .\job1444.script]
The easiest solution is to zero pad all digits lower than 10. Like
job01.script
instead of
job1.script
This assumes no more than 100 files. With more, simply add more zeros.
Otherwise, you'll need analyze and breakdown each file name, and then order it numerically. Currently, it's being ordered by character.
The simplest method to solve this problem is to prefix your names with 0s. This is what I did when I had the same problem. So basically you choose the biggest number you have (for example 433234) and prefix all numbers with biggestLength - currentNumLength zeroes.
An example:
Biggest number is 12345: job12345.script.
This way the first job becomes job00001.script.

Filtering input files using globStatus in MapReduce

I have a lot of input files and I want to process selected ones based on the date that has been appended in the end. I am now confused on where do I use the globStatus method to filter out the files.
I have a custom RecordReader class and I was trying to use globStatus in its next method but it didn't work out.
public boolean next(Text key, Text value) throws IOException {
Path filePath = fileSplit.getPath();
if (!processed) {
key.set(filePath.getName());
byte[] contents = new byte[(int) fileSplit.getLength()];
value.clear();
FileSystem fs = filePath.getFileSystem(conf);
fs.globStatus(new Path("/*" + date));
FSDataInputStream in = null;
try {
in = fs.open(filePath);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
} finally {
IOUtils.closeStream(in);
}
processed = true;
return true;
}
return false;
}
I know it returns a FileStatus array, but how do I use it to filter the files. Can someone please shed some light?
The globStatus method takes 2 complimentary arguments which allow you to filter your files. The first one is the glob pattern, but sometimes glob patterns are not powerful enough to filter specific files, in which case you can define a PathFilter.
Regarding the glob pattern, the following are supported:
Glob | Matches
-------------------------------------------------------------------------------------------------------------------
* | Matches zero or more characters
? | Matches a single character
[ab] | Matches a single character in the set {a, b}
[^ab] | Matches a single character not in the set {a, b}
[a-b] | Matches a single character in the range [a, b] where a is lexicographically less than or equal to b
[^a-b] | Matches a single character not in the range [a, b] where a is lexicographically less than or equal to b
{a,b} | Matches either expression a or b
\c | Matches character c when it is a metacharacter
PathFilter is simply an interface like this:
public interface PathFilter {
boolean accept(Path path);
}
So you can implement this interface and implement the accept method where you can put your logic to filter files.
An example taken from Tom White's excellent book which allows you to define a PathFilter to filter files that match a certain regular expression:
public class RegexExcludePathFilter implements PathFilter {
private final String regex;
public RegexExcludePathFilter(String regex) {
this.regex = regex;
}
public boolean accept(Path path) {
return !path.toString().matches(regex);
}
}
You can directly filter your input with a PathFilter implementation by calling FileInputFormat.setInputPathFilter(JobConf, RegexExcludePathFilter.class) when initializing your job.
EDIT: Since you have to pass the class in setInputPathFilter, you can't directly pass arguments, but you should be able to do something similar by playing with the Configuration. If you make your RegexExcludePathFilter also extend from Configured, you can get back a Configuration object which you will have initialized before with the desired values, so you can get back these values inside your filter and process them in the accept.
For example if you initialize like this:
conf.set("date", "2013-01-15");
Then you can define your filter like this:
public class RegexIncludePathFilter extends Configured implements PathFilter {
private String date;
private FileSystem fs;
public boolean accept(Path path) {
try {
if (fs.isDirectory(path)) {
return true;
}
} catch (IOException e) {}
return path.toString().endsWith(date);
}
public void setConf(Configuration conf) {
if (null != conf) {
this.date = conf.get("date");
try {
this.fs = FileSystem.get(conf);
} catch (IOException e) {}
}
}
}
EDIT 2: There were a few issues with the original code, please see the updated class. You also need to remove the constructor since it's not used anymore, and check if that's a directory in which case you should return true so the content of the directory can be filtered too.
For anyone reading this, can I say "please don't do anything more complex in the filters than validating the paths". Specifically: don't do checks for the files being a directory, getting their sizes, etc. Wait until the list/glob operation has returned and then do a filtering there, using the information now in the populated FileStatus entries.
Why? All those calls to getFileStatus(), directly or via isDirectory() are doing needless calls to the filesystem, calls which add needless namenode load on an HDFS cluster. More critically, against S3 and other object stores, each operation is potentially making multiple HTTPS requests —and those really do take measurable time. Even better, S3 will throttle you if it thinks you are making too many requests across your entire cluster of machines. You don't want that.
Wit until after the call —the file status entries you get back are those from the object store's list commands, which usually return thousands of file entries per HTTPS request, and so are way more efficient.
For further details, inspect the source of org.apache.hadoop.fs.s3a.S3AFileSystem.

Categories

Resources