Regex to match file extension - java

I have file names separated by colon :
This one is working as expected
String fileName = "test.pdf:test1.txt:test2.png:test3.jpg:test4.jpeg:test5.doc";
String ext = "pdf";
System.out.println(fileName.matches(".*\\b\\."+ext+":\\b.*"));
but when a matching file is at the end, above solution does not work
String fileName = "test1.txt:test2.png:test3.jpg:test4.jpeg:test5.doc:test.pdf";
What is the regex to achieve it?

Change the pattern to look for : or the end $:
".*\\." + ext + "(:|$).*"
(Also, I removed the unnecessary \\b.)

You can use pattern and matcher.
Pattern pdfPattern = Pattern.compile("\\.pdf");
if(pdfPattern.matcher(fileName).find()){
System.out.println("Found PDF");
}

Related

Split a string in java based on custom logic

I have a string
"target/abcd12345671.csv"
and I need to extract
"abcd12345671"
from the string using Java. Can anyone suggest me a clean way to extract this.
Core Java
String fileName = Paths.get("target/abcd12345671.csv").getFileName().toString();
fileName = filename.replaceFirst("[.][^.]+$", "")
Using apache commons
import org.apache.commons.io.FilenameUtils;
String fileName = Paths.get("target/abcd12345671.csv").getFileName().toString();
String fileNameWithoutExt = FilenameUtils.getBaseName(fileName);
I like a regex replace approach here:
String filename = "target/abcd12345671.csv";
String output = filename.replaceAll("^.*/|\\..*$", "");
System.out.println(output); // abcd12345671
Here we use a regex alternation to remove all content up, and including, the final forward slash, as well as all content from the dot in the extension to the end of the filename. This leaves behind the content you actually want.
Here is an approach with using regex
String filename = "target/abcd12345671.csv";
var pattern = Pattern.compile("target/(.*).csv");
var matcher = pattern.matcher(filename);
if (matcher.find()) {
// Whole matched expression -> "target/abcd12345671.csv"
System.out.println(matcher.group(0));
// Matched in the first group -> in regex it is the (.*) expression
System.out.println(matcher.group(1));
}

Sanitizing strings with filenames and extension in Java

Having this four type of file names:
Filename with double extension
Filename with no extension
Filename with dot at the end, and no extension
Filename with a proper name.
Like this:
String doubleexsension = "doubleexsension.pdf.pdf";
String noextension = "noextension";
String nameWithDot = "nameWithDot.";
String properName = "properName.pdf";
String extension = "pdf";
My aim is to sanitze all the types and output only the filename.filetype properly. I made a little stupid script in order to make this post:
ArrayList<String> app = new ArrayList<String>();
app.add(doubleexsension);
app.add(properName);
app.add(noextension);
app.add(nameWithDot);
System.out.println("------------");
for(String i : app) {
// Ends with .
if (i.endsWith(".")) {
String m = i + extension;
System.out.println(m);
break;
}
// Double extension
String p = i.replaceAll("(\\.\\w+)\\1+$", "$1");
System.out.println(p);
}
This outputs:
------------
doubleexsension.pdf
properName.pdf
noextension
nameWithDot.pdf
I dont know how can I handle the noextension one. How can I do it? When there's no extension, it should take the extension value and apped it to the string at the end.
My desired output would be:
------------
doubleexsension.pdf
properName.pdf
noextension.pdf
nameWithDot.pdf
Thanks in advance.
You may add alternatives to the regex to match all kinds of scenarios:
(?:(\.\w+)\1*|\.|([^.]))$
And replace with $2.pdf. See the regex demo.
EDIT: In case the extensions that can be duplicated are known, you may use the whitelisting approach via an alternation group:
(?:(\.(?:pdf|gif|jpe?g))\1*|\.|([^.]))$
See another regex demo.
Details:
(?: - start of grouping, the $ end of string anchor is applied to all the alternatives below (they must be at the end of string)
(\.\w+)\1* - duplicated (or not) extensions (. + 1+ word chars repeated zero or more times) (with the whitelisting approach, only the indicated extensions will be taken into account - (?:pdf|gif|jpe?g) will only match pdf, gif, jpeg, jpg, etc. if more alternatives are added)
| - or
\. - a dot
| - or
([^.]) - any char that is not a dot captured into Group 2
) - end of the outer grouping
$ - end of string.
See Java demo:
List<String> strs = Arrays.asList("doubleexsension.pdf.pdf","noextension","nameWithDot.","properName.pdf");
for (String str : strs)
System.out.println(str.replaceAll("(?:(\\.\\w+)\\1*|\\.|([^.]))$", "$2.pdf"));
Easy
if (-1 == i.indexOf('.'))
System.out.println(i + "." + extension);
I would avoid the complexity (and reduced readability) of regular expressions:
String m = i;
if (m.endsWith(".")) {
m = m + extension;
}
if (m.endsWith("." + extension + "." + extension)) {
m = m.substring(0, m.length() - extension.length() - 1);
}
if (!m.endsWith("." + extension)) {
m = m + "." + extension;
}
Why so complex. Just do str.replaceAll("\\..*", "") + "." + extension
Java 7 NIO has a way to do this by using PathMatcher
PathMatcher matcher = FileSystems.getDefault().getPathMatcher("glob:*.pdf");
Path filename = namewithdot.pdf;
if (matcher.matches(filename)) {
System.out.println(filename);
}

Regex expression to get the file name

I want to extract only filename from the complete file name + time stamp . below is the input.
String filePath = "fileName1_20150108.csv";
expected output should be: "fileName1"
String filePath2 = "fileName1_filedesc1_20150108_002_20150109013841.csv"
And expected output should be: "fileName1_filedesc1"
I wrote a below code in java to get the file name but it is working for first part (filePath) but not for filepath2.
Pattern pattern = Pattern.compile(".*.(?=_)");
String filePath = "fileName1_20150108.csv";
String filePath2 = "fileName1_filedesc1_20150108_002_20150109013841.csv";
Matcher matcher = pattern.matcher(filePath);
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(matcher.group());
}
Can somebody please help me to correct the regex so i can parse both filepath using same regex?
Thanks
Anchor the start, and make the .* non-greedy:
^.*?(_\D.*?)?(?=[_.])
Update: change the second group (for fileDesc) to optional, and enforce that it starts with a non-digit character. This will work as long as your fileDesc strings never start with numbers.
You can get the characters before the first underscode, the first underscore, and then the characters until the next underscore:
^[^_]*_[^_]*
This should work: "^(.*?)_([0-9_]*)\\.([^.]*)$"
It will return you 3 groups:
the base name (assuming not a single part will be all numbers)
the timestamp info
the extension.
You can test here: http://fiddle.re/v0hne6 (RegexPlanet)

How to remove an id out of a path using a Java Regex?

I am trying to get rid of an "id" in URI paths and I can only use Java regex transformation.
The paths look like this:
/web/service/1223345/add
/web/service/1223345/delete
/web/service/v2/1223345/add
/web/service/1223345
/web/service/do
The id is always a series of numbers. In the example above it is "1223345".
I have tried a couple of regexes but none of them worked. Here are my tries:
(/\w.*)/?[0-9]*/(.*)
([^0-9]+){0,}
(/.*/)[0-9]*(/.*)
Thanks for your help
String input = "/web/service/1223345/add";
System.out.println(input.replaceAll("/\\d*/","/"));
Output:
/web/service/add
If you are after removing id, you could do the following:
String input = "/web/service/v2/1223345/add";
String removed = input.replaceAll("/\\d*/?", "/");
System.out.println(removed);
Note that arnoud's regex "/\d*/" will not work for e.g. /web/service/1223345.
Question mark at the end of the regex takes care of such cases: "/\d*/?"
If on the other hand you are after extracting id:
Pattern pattern = Pattern.compile(".*?/(\\d*?)(/.*)?$");
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
String id = matcher.group(1);
System.out.println(id);
}

Java regex expression to sanitize an uploaded file name

I'm trying to sanitize a String that contains an uploaded file's name. I'm doing this because the files will be downloaded from the web and, plus, I want to normalize the names. This is what I have so far:
private String pattern = "[^0-9_a-zA-Z\\(\\)\\%\\-\\.]";
//Class methods & stuff
private String sanitizeFileName(String badFileName) {
StringBuffer cleanFileName = new StringBuffer();
Pattern filePattern = Pattern.compile(pattern);
Matcher fileMatcher = filePattern.matcher(badFileName);
boolean match = fileMatcher.find();
while(match) {
fileMatcher.appendReplacement(cleanFileName, "");
match = fileMatcher.find();
}
return cleanFileName.substring(0, cleanFileName.length() > 250 ? 250 : cleanFileName.length());
}
This works ok, but for a strange reason the extension of the file is erased. i.e. "p%Z_-...#!$()=¡¿&+.jpg" ends up being "p%Z_-...()".
Any Idea as to how should I tune up my regex?
You need a Matcher#appendTail at the end of your loop.
One line solution:
return badFileName.replaceAll("[^0-9_a-zA-Z\\(\\)\\%\\-\\.]", "");
If you want to restrict it to just alphanumeric and space:
return badFileName.replaceAll("[^a-zA-Z0-9 ]", "");
Cheers :)

Categories

Resources