How can I change the invalid characters to valid chars in Java? - java

private static void isValidName(String[] filename){
FileSystem fs = FileSystems.getDefault();
System.out.println(fs);
String pattern = ("^[\\w&[^?\\\\/. ]]+?\\.*[\\w&[^?\\\\/. ]]+$");
for (String s: filename) {
//System.out.println(s.matches(pattern));
if (s.matches(pattern)==false){
System.out.println(s.matches(pattern));
}
}
Now I call this function:
String[] name2={"valami.txt."};
isValidName(name2);
How can I replace the invalid characters in if(s.matches(pattern)==false) with valid characters?
Output:
false

You may use this piece of code to remove/replace invalid characters:
String[] bad = {
"foo.tar.gz",
" foo.txt",
"foo?",
"foo/",
"foo\\",
".foo",
"foo."
};
String remove_pattern = "^[ .]+|\\.+$|\\.(?=[^.]*\\.[^.]*$)|[?\\\\/:;]";
for (String s: bad) {
System.out.println(s.replaceAll(remove_pattern, "_"));
}
See IDEONE demo
Output:
foo_tar.gz
_foo.txt
foo_
foo_
foo_
_foo
foo_
REGEX contains several alternatives joined with | alternation operator to match the invalid character(s) only.
^[ .]+ - Matches 1 or more leading spaces or dots
\\.+$ - Matches final ., 1 or more occurrences (change to [. ]+$ if you plan to also replace trailing spaces)
\\.(?=[^.]*\\.[^.]*$) - Matches a . that is followed by an optional number of characters and another dot (thus, leaving the last dot in the string)
[?\\\\/:;] - Matches ?, \, /, : and ; literally.

Related

Replace all characters between two delimiters using regex

I'm trying to replace all characters between two delimiters with another character using regex. The replacement should have the same length as the removed string.
String string1 = "any prefix [tag=foo]bar[/tag] any suffix";
String string2 = "any prefix [tag=foo]longerbar[/tag] any suffix";
String output1 = string1.replaceAll(???, "*");
String output2 = string2.replaceAll(???, "*");
The expected outputs would be:
output1: "any prefix [tag=foo]***[/tag] any suffix"
output2: "any prefix [tag=foo]*********[/tag] any suffix"
I've tried "\\\\\[tag=.\*?](.\*?)\\\\[/tag]" but this replaces the whole sequence with a single "\*".
I think that "(.\*?)" is the problem here because it captures everything at once.
How would I write something that replaces every character separately?
you can use the regex
\w(?=\w*?\[)
which would match all characters before a "[\"
see the regex demo, online compiler demo
You can capture the chars inside, one by one and replace them by * :
public static String replaceByStar(String str) {
String pattern = "(.*\\[tag=.*\\].*)\\w(.*\\[\\/tag\\].*)";
while (str.matches(pattern)) {
str = str.replaceAll(pattern, "$1*$2");
}
return str;
}
Use like this it will print your tx2 expected outputs :
public static void main(String[] args) {
System.out.println(replaceByStar("any prefix [tag=foo]bar[/tag] any suffix"));
System.out.println(replaceByStar("any prefix [tag=foo]loooongerbar[/tag] any suffix"));
}
So the pattern "(.*\\[tag=.*\\].*)\\w(.*\\[\\/tag\\].*)" :
(.*\\[tag=.*\\].*) capture the beginning, with eventually some char in the middle
\\w is for the char you want to replace
(.*\\[\\/tag\\].*) capture the end, with eventually some char in the middle
The substitution $1*$2:
The pattern is (text$1)oneChar(text$2) and it will replace by (text$1)*(text$2)

Make Regex Match Whitespaces in Java

How can I make this regex match white spaces? Currently, it can only match the following:
abcdatcsdotuniversitydotedu
I would like it to mach the following:
abcd at cs dot university dot edu
This is the Regex:
([A-Za-z][A-Za-z0-9.\\-_]*)\\s[ ]?(at)[ ]*([A-Za-z][A-Za-z0-9\\-_(dot)]*[ ]?(dot)[ ]*[A-Za-z]+)
\s matches a white-space character and when this is used in a java string you need to escape the \ so it would be \\s. If you want to match zero-or-more white-space then use \\s*.
This will match a single domain and TLD:
([A-Za-z][A-Za-z0-9.\\-_]*)\\s*(at)\\s*([A-Za-z][A-Za-z0-9\\-_()]*\\s*(dot)\\s*[A-Za-z]+)
However, you are trying to match multiple levels of sub-domains so you need to wrap the domain part of the regular expression ([A-Za-z][A-Za-z0-9\\-_()]*\\s*(dot)\\s* in ()+ to get one-or-more of them:
([A-Za-z][A-Za-z0-9.\\-_]*)\\s*(at)\\s*(([A-Za-z][A-Za-z0-9\\-_()]*\\s*(dot)\\s*)+[A-Za-z]+)
^ ^^
Something like this:
public class RegexpMatch {
static Pattern Regex = Pattern.compile(
"([A-Za-z][A-Za-z0-9.\\-_]*)\\s*(at)\\s*(([A-Za-z][A-Za-z0-9\\-_()]*\\s*(dot)\\s*)+[A-Za-z]+)"
);
public static void main( final String[] args ){
final String[] tests = {
"abcdatcsdotuniversitydotedu",
"abcd at cs dot university dot edu"
};
for ( final String test : tests )
System.out.println( test + " - " + ( Regex.matcher( test ).matches() ? "Match" : "No Match" ) );
}
}
Which outputs:
abcdatcsdotuniversitydotedu - Match
abcd at cs dot university dot edu - Match
public static boolean isAlphaNumericWithWhiteSpace(String text) {
return text != null && text.matches("^[\\p{L}\\p{N}ın\\s]*$");
}
\p{L} matches a single code point in the category "letter".
\p{N} matches any kind of numeric character in any script.
I am using this code.

RegEx: Matching n-char long sequence of repeating character

I want to split of a text string that might look like this:
(((Hello! --> ((( and Hello!
or
########No? --> ######## and No?
At the beginning I have n-times the same special character, but I want to match the longest possible sequence.
What I have at the moment is this regex:
([^a-zA-Z0-9])\\1+([a-zA-Z].*)
This one would return for the first example
( (only 1 time) and Hello!
and for the second
# and No!
How do I tell regEx I want the maximal long repetition of the matching character?
I am using RegEx as part of a Java program in case this matters.
I suggest the following solution with 2 regexps: (?s)(\\W)\\1+\\w.* for checking if the string contains same repeating non-word symbols at the start, and if yes, split with a mere (?<=\\W)(?=\\w) pattern (between non-word and a word character), else, just return a list containing the whole string (as if not split):
String ptrn = "(?<=\\W)(?=\\w)";
List<String> strs = Arrays.asList("(((Hello!", "########No?", "$%^&^Hello!");
for (String str : strs) {
if (str.matches("(?s)(\\W)\\1+\\w.*")) {
System.out.println(Arrays.toString(str.split(ptrn)));
}else { System.out.println(Arrays.asList(str)); }
}
See IDEONE demo
Result:
[(((, Hello!]
[########, No?]
[$%^&^Hello!]
Also, your original regex can be modified to fit the requirement like this:
String ptrn = "(?s)((\\W)\\2+)(\\w.*)";
List<String> strs = Arrays.asList("(((Hello!", "########No?", "$%^&^Hello!");
for (String str : strs) {
Pattern p = Pattern.compile(ptrn);
Matcher m = p.matcher(str);
if (m.matches()) {
System.out.println(Arrays.asList(m.group(1), m.group(3)));
}
else {
System.out.println(Arrays.asList(str));
}
}
See another IDEONE demo
That regex matches:
(?s) - DOTALL inline modifier (if the string has newline characters, .* will also match them).
((\\W)\\2+) - Capture group 1 matching and capturing into Group 2 a non-word character followed by the same character (since a backreference \2 is used) 1 or more times.
(\\w.*) - matches and captures into Group 3 a word character and then one or more characters.

word extraction and splitting using Java regex

I have a string "'GLO', FLO" Now, I want a regex expression that will check each words in the string and if:
-word begins and ends with a single quote, replace single quotes with spaces
-if a comma is encounted between words split both words using space.
so, in the end, I should get GLO FLO.
Any help on how to do this using replaceAll() method on the string?
This regex didn't do it for me : "'([^' ]+)|\\s+'"
public static void displaySplitString(final String str) {
String pattern1 = "^'?(\\w+)'?,\\s+(\\w+)$";
StringTokenizer strTok = new StringTokenizer(str, " , ");
while (strTok.hasMoreTokens()) {
String delim = (strTok.nextToken());
delim.replaceAll(pattern1, "$1$2");
System.out.println(delim);
}
} //in main method displaySplitString("'GLO', FLO");
Here is the snippet that should get you going:
public static void displaySplitString(String str)
{
String pattern1 = "^'?(\\w+)'?(?=\\S)";
str = str.replaceAll(pattern1, " $1 ");
StringTokenizer strTok = new StringTokenizer(str, " , ");
while (strTok.hasMoreTokens())
{
String delim = (strTok.nextToken());
System.out.println(delim);
}
}
Here,
I change str argument declaration as not final (so that we could change the str value inside the method)
I am using the first regex ^'?(\\w+)'?(?=\\S) to remove potential single quotes from around the first word
Since you use a StringTokenizer, just 2 lines inside the while block are enough.
The regex means:
^ - Start looking for the match at the very start of the string
'? - match 0 or 1 single quote
(\\w+) - match and capture 1 or more alphanumeric symbols (we'll refer to them as $1 in the replacement pattern)
'? - match 0 or 1 single quote
(?=\\S) - match only if there is no space after the optional single quote. Perhaps, you can even replace this lookahead with a mere , if you always have it there, after the first word.

regex which should check string contains specified word

I wrote a regex which should check does string contains word 'Page' and after it any number
This is code:
public static void main(String[] args) {
String str1 = "12/15/14 7:01:44 Page 10 ";
String str2 = "12/15/14 7:01:44 Page 9 ";
System.out.println(containsPage(str2));
}
private static boolean containsPage(String str) {
String regExp = "^.*Page[ ]{1,}[0-9].$";
return Pattern.matches(regExp, str);
}
Result: str1: false, str2:true
Can you help me what is wrong?
Change the regex to the following:
String regExp = "^.*Page[ ]{1,}[0-9]+.$";
so that it matches one or more digits (hence the [0-9]+).
You also don't need the boundary matchers (^ and $) since Pattern#matches would match the entire input string; and [ ]{1,} is equivalent to [ ]+:
String regExp = ".*Page +[0-9]+.";
Change it to:
String regExp = "^.*Page[ ]{1,}[0-9]+.$"; //or \\d+
↑
[0-9] matches 9 in the second example, and . matches the space.
In the first example, [0-9] matches 1, . matches 0 and remained space isn't matched. Note that ^ and $ are not really needed here.
Your regex can be simplified to:
String regExp = ".*Page\\s+\\d+.";

Categories

Resources