Cleaning a file name in Java

Cleaning a file name in Java - java

I want to write a script that will clean my .mp3 files.
I was able to write a few line that change the name but I want to write an automatic script that will erase all the undesired characters $%_!?7 and etc. while changing the name in the next format Artist space dash Song.
File file = new File("C://Users//nikita//Desktop//$%#Artis8t_-_35&Son5g.mp3");
String Original = file.toString();
String New = "Code to change 'Original' to 'Artist - Song'";
File file2 = new File("C://Users//nikita//Desktop//" + New + ".mp3");
file.renameTo(file2);
I feel like I should make a list with all possible characters and then run the String through this list and erase all of the listed characters but I am not sure how to do it.
String test = "$%$#Arti56st_-_54^So65ng.mp3";
Edit 1:
When I try using the method remove, it still doesn't change the name.
String test = "$%$#Arti56st_-_54^So65ng.mp3";
System.out.println("Original: " + test);
test.replace( "[0-9]%#&\\$", "");
System.out.println("New: " + test);
The code above returns the following output
Original: $%$#Arti56st_-_54^So65ng.mp3
New: $%$#Arti56st_-_54^So65ng.mp3

I'd suggest something like this:
public static String santizeFilename(String original){
Pattern p = Pattern.compile("(.*)-(.*)\\.mp3");
Matcher m = p.matcher(original);
if (m.matches()){
String artist = m.group(1).replaceAll("[^a-zA-Z ]", "");
String song = m.group(2).replaceAll("[^a-zA-Z ]", "");
return String.format("%s - %s", artist, song);
}
else {
throw new IllegalArgumentException("Failed to match filename : "+original);
}
}
(Edit - changed whitelist regex to exclude digits and underscores)
Two points in particular - when sanitizing strings, it's a good idea to whitelist permitted characters, rather than blacklisting the ones you want to exclude, so you won't be surprised by edge cases later. (You may want a less restrictive whitelist than I've used here, but it's easy to vary)
It's also a good idea to handle the case that the filename doesn't match the expected pattern. If your code comes across something other than an MP3, how would you like it to respond? Here I've through an exception, so the calling code can catch and handle that appropriately.

String new = original.replace( "[0-9]%#&\\$", "")
this should replace almost all the characters you don't want
or you can come up with your own regex
https://docs.oracle.com/javase/tutorial/essential/regex/

Related

Java won't replace all strings, because there is text next to the tags (post improved)

I'm working on a program, which formats HTML Code, extracted from a PDF file.
I have a String list, which contains paragraphs and is divided by that.
As the PDF has hyperlinks, I decided to replace them with a foot note number "[1]".
This will be used for citation of sources. I will eventually plan, to put it at the end of a paragraph, or sentence, so you can look up the sources, like you would in a book.
My Problem
For some reason not all the hyperlinks are replaced.
The reason is most likely, that there is text directly next to the tag.
Hell<a href="http://www.example.com">o old chap!
Specifically the "o" part and the "hell" part is blocking the java .replaceAll function, from doing it's job.
Expected Result
Hello [1] old chap!
EDIT:
If I would just add space, before and after the URL, it might split some words like "help", into "hel p", which is also not an option.
My code would have to replace the URL tag (without the ) and create no new extra spaces.
This is some of my code, where the problem occures:
for (int i = 0; i < EN.length; i++) {
Pattern pattern_URL = Pattern.compile("<a(.+?)\">", Pattern.DOTALL);
Matcher matcher_URL = pattern_URL.matcher(EN[i]); //Checks in the curren Array part.
if (matcher_URL.find() == true) {
source_number++;
String extractedURL = matcher_URL.group(0);
//System.out.println(extractedURL);
String extractedURL_fully = extractedURL.replaceAll("href=\"", ""); //Anführungszeichen
//System.out.println(extractedURL_fully);
String nobracketURL = extractedURL.replaceAll("\\)", ""); //Remove round brackets from URL
EN[i] = EN[i].replaceAll("\\)\"", "\""); /*Replace round brackets from URL in Array. (For some reasons there have been href URLs, with an bracket at the end. This was already in the PDF. They were causing massive problems, because it didn't comment them out, so the entire replaceAll command didn't function.)*/
EN[i] = EN[i].replaceAll(nobracketURL, "[" + source_number + "]"); //Replace URL tags with number and Edgy brackets
}
else{
//System.out.println("FALSE: " + "[" + i + "]");
}
}
The whole idea of this is, that it loops through the array and replaces all the URLs, including it's starting tag <a until the end of the starting tag "> (which can also be seen in the pattern regex.)

Correct me if I'm wrong, but what you need is to eliminate all the <a> tags from a given string, right? If that's the case all you needed to do was use a code like the following:
final String string = "<a href=\"http://www.example.com\">Sen";
final Pattern pattern = Pattern.compile("<a(.+?)>", Pattern.DOTALL);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll("");
System.out.println(result); // prints "Sen"
Notice I didn't use the replaceAll from the String object, but from the Matcher object. This replaces all matches for the empty string "".

How can I get non-matching groups using a Matcher in Java?

I'm trying to write a java regex to catch some groups of words from a String using a Matcher.
Say i got this string: "Hello, we are #happy# to see you today".
I would like to get 2 group of matches, one having
Hello, we are
to see you today
and the other
happy
So far, I was only able to match the word between the #s using this Pattern:
Pattern p = Pattern.compile("#(.+?)#");
I've read about negative lookahead and lookaround, played a bit with it but without success.
I assume I should do some sort of negation of the regex so far, but I couldn't come up with anything.
Any help would be really appreciated, thank you.

From comment:
I may incur in a string where I got more than one instances of words wrapped by #, such as "#Hello# kind #stranger#"
From comment:
I need to apply some different style format to both the text inside and outside.
Since you need to apply different stylings, the code need to process each block of text separately, and needs to know if the text is inside or outside a #..# section.
Note, in the following code, it will silently skip the last #, if there is an odd number of them.
String input = ...
for (Matcher m = Pattern.compile("([^#]+)|#([^#]+)#").matcher(input); m.find(); ) {
if (m.start(1) != -1) {
String outsideText = m.group(1);
System.out.println("Outside: \"" + outsideText + "\"");
} else {
String insideText = m.group(2);
System.out.println("Inside: \"" + insideText + "\"");
}
}
Output for input = "Hello, we are #happy# to see you today"
Outside: "Hello, we are "
Inside: "happy"
Outside: " to see you today"
Output for input = "#Hello# kind #stranger#"
Inside: "Hello"
Outside: " kind "
Inside: "stranger"
Output for input = "This #text# has unpaired # characters"
Outside: "This "
Inside: "text"
Outside: " has unpaired "
Outside: " characters"

The best I could do is splitting in 3 groups, then merging the group 1 and 4 :
(^.*)(\#(.+?)\#)(.*)
Test it here
EDIT: Taking remarks from the comments :
(^[^\#]*)(?:\#(.+?)\#)([^\#]*)
Thanks to #Lino we don't capture the useless group with # anymore, and we capture anything except #, instead of any non whitespace character in the 1st and 2nd groups.
Test it here

Is this solution fine?
Pattern pattern =
Pattern.compile("([^#]+)|#([^#]*)#");
Matcher matcher =
pattern.matcher("Hello, we are #happy# to see you today");
List<String> notBetween = new ArrayList<>(); // not surrounded by #
List<String> between = new ArrayList<>(); // surrounded by #
while (matcher.find()) {
if (Objects.nonNull(matcher.group(1))) notBetween.add(matcher.group(1));
if (Objects.nonNull(matcher.group(2))) between.add(matcher.group(2));
}
System.out.println("Printing group 1");
for (String string :
notBetween) {
System.out.println(string);
}
System.out.println("Printing group 2");
for (String string :
between) {
System.out.println(string);
}

How to Check filename

I have a method that reads filename for a file from a certain source. My method should handle two types of files. My next method depends on the name of the file.
_InPayed.txt" "_OutPayed.txt
My problem is how to check if the filename is ..._inpayed Or ..._OutPayed
How to check the string after the strick " _"
Code:
public class FilenameDemo {
public static void main(String[] args) {
final String FPATH = "/home/mem/"filename.txt";
System.out.println("Extension = " + myHomePage.extension());
System.out.println("Filename = " + myHomePage.filename());
}
}

No need for subString. A simple call to contains(...) is all you need. More importantly, learn to use the API.
if (myString.toLowerCase().contains("inpayed")) {
// do something
}
String API

If you want to check the certain string in the file name then use contains method.
&
You can get the file name after _ using below expression
System.out.println("Filename = "+FPATH.substring(FPATH.lastIndexOf("_")+1,FPATH.lastIndexOf(".")));

The string must contain one and only one _ for this to work.
String fullFile="whatever_InOutPayed.txt"
String[] split = file.split("_");
// split[0] = "whatever"
// split[1] = "InOutPayed.txt"
String file = split[1];
If it contains more than one then take the last element of the array
String file = split[split.length - 1];
or you can easily use String.contains().

You can use substring() method to extract a desired part of a file name stored as String. Or if the part you are checking will always be at the end of the String a simpler option is to use endsWith(). Check the documentation for details about the method usage.

Search and replace formatted properties inside Java string

I will be given Strings that contain formatted "properties"; that is, Strings encapsulated inside the standard "${" and "}" tokens:
"This is an ${example} of a ${string} that I may be ${given}."
I will also have a HashMap<String,String> containing substitutions for each possible formatted property:
HashMap Keys HashMapValues
===========================================
bacon eggs
ham salads
So that, given the following String:
"I like to eat ${bacon} and ${ham}."
I can send this to a Java method that will transform it into:
"I like to eat eggs and salads."
Here's my best attempt:
System.out.println("Starting...");
String regex = "$\\{*\\}";
Map<String,String> map = new HashMap<String, String>();
map.put("bacon", "eggs");
map.put("ham", "salads");
String sampleString = "I like ${bacon} and ${ham}.";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(sampleString);
while(matcher.find()) {
System.out.println("Found " + matcher.group());
// Strip leading "${" and trailing "}" off.
String property = matcher.group();
if(property.startsWith("${"))
property = property.substring(2);
if(property.endsWith("}"))
property = property.substring(0, property.length() - 1);
System.out.println("Before being replaced, property is: " + property);
if(map.containsKey(property))
property = map.get(property);
// Now, not sure how to locate the original property ("${bacon}", etc.)
// inside the sampleString so that I can replace "${bacon}" with
// "eggs".
}
System.out.println("Ending...");
When I execute this, I get no errors, but just see the "Starting..." and "Ending..." outputs. This tells me that my regex is incorrect, and so the Matcher isn't able to match any properties.
So my first question is: what should this regex be?
Once I'm past that, I'm not sure how to perform the string replace once I've changed "${bacon}" into "eggs", etc. Any ideas? Thanks in advance!

Why don't use a .properties file?, that way you could get all your messages from that file and could be separate from your code, something like (file example.properties):
message1=This is a {0} with format markers on it {1}
And then in your class load your bundle and use it like this:
ResourceBundle bundle = ResourceBundle.getBundle("example.properties", Locale.getDefault());
MessageFormat.format(bundle.getString('message1'), "param0", "param1"); // This gonna be your formatted String "This is a param0 with format markers on it param1"
You could use the MessageFormat (is a java.util library) without the bundle (just use the String directly), but again, having a bundle makes your code clear (and gives easy internationalization)

Use this instead:
String regex = "\\$\\{([^}]*)\\}";
Then you obtain only the content between ${ and } that is inside the capture group 1.
Note that the $ has a special meaning in a pattern: end of the string
Thus it musts be escaped to be seen as literal (as curly brackets).

Better use StrSubstitutor from apache commons lang. It can also substitute System props.
Since commons-lang 3.6 StrSubstitutor has been deprecated in favour of commons-text StringSubstitutor. Example:
import org.apache.commons.text.StringSubstitutor;
Properties props = new Properties();
props.setProperty("bacon", "eggs");
props.setProperty("ham", "salads");
String sampleString = "I like ${bacon} and ${ham}.";
String replaced = StringSubstitutor.replace(sampleString, props);

For completion, here is a working solution:
static final Pattern EXPRESSION_PATTERN = Pattern.compile("\\$\\{([^}]*)\\}");
/**
* Replace ${properties} in an expression
* #param expression expression string
* #param properties property map
* #return resolved expression string
*/
static String resolveExpression(String expression, Map<String, String> properties) {
StringBuilder result = new StringBuilder(expression.length());
int i = 0;
Matcher matcher = EXPRESSION_PATTERN.matcher(expression);
while(matcher.find()) {
// Strip leading "${" and trailing "}" off.
result.append(expression.substring(i, matcher.start()));
String property = matcher.group();
property = property.substring(2, property.length() - 1);
if(properties.containsKey(property)) {
//look up property and replace
property = properties.get(property);
} else {
//property not found, don't replace
property = matcher.group();
}
result.append(property);
i = matcher.end();
}
result.append(expression.substring(i));
return result.toString();
}

Java regex expression to sanitize an uploaded file name

I'm trying to sanitize a String that contains an uploaded file's name. I'm doing this because the files will be downloaded from the web and, plus, I want to normalize the names. This is what I have so far:
private String pattern = "[^0-9_a-zA-Z\\(\\)\\%\\-\\.]";
//Class methods & stuff
private String sanitizeFileName(String badFileName) {
StringBuffer cleanFileName = new StringBuffer();
Pattern filePattern = Pattern.compile(pattern);
Matcher fileMatcher = filePattern.matcher(badFileName);
boolean match = fileMatcher.find();
while(match) {
fileMatcher.appendReplacement(cleanFileName, "");
match = fileMatcher.find();
}
return cleanFileName.substring(0, cleanFileName.length() > 250 ? 250 : cleanFileName.length());
}
This works ok, but for a strange reason the extension of the file is erased. i.e. "p%Z_-...#!$()=¡¿&+.jpg" ends up being "p%Z_-...()".
Any Idea as to how should I tune up my regex?

You need a Matcher#appendTail at the end of your loop.

One line solution:
return badFileName.replaceAll("[^0-9_a-zA-Z\\(\\)\\%\\-\\.]", "");
If you want to restrict it to just alphanumeric and space:
return badFileName.replaceAll("[^a-zA-Z0-9 ]", "");
Cheers :)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Cleaning a file name in Java - java

String new = original.replace( "[0-9]%#&\\$", "") this should replace almost all the characters you don't want or you can come up with your own regex https://docs.oracle.com/javase/tutorial/essential/regex/

Related

Java won't replace all strings, because there is text next to the tags (post improved)

How can I get non-matching groups using a Matcher in Java?

How to Check filename

Search and replace formatted properties inside Java string

Java regex expression to sanitize an uploaded file name

Categories

Resources