Java replaceAll regex error

Java replaceAll regex error - java

I want to transforme all "*" into ".*" excepte "\*"
String regex01 = "\\*toto".replaceAll("[^\\\\]\\*", ".*");
assertTrue("*toto".matches(regex01));// True
String regex02 = "toto*".replaceAll("[^\\\\]\\*", ".*");
assertTrue("tototo".matches(regex02));// True
String regex03 = "*toto".replaceAll("[^\\\\]\\*", ".*");
assertTrue("tototo".matches(regex03));// Error
If the "*" is the first character a error occure :
java.util.regex.PatternSyntaxException:
Dangling meta character '*' near index 0
What is the correct regex ?

This is currently the only solution capable of dealing with multiple escaped \ in a row:
String regex = input.replaceAll("\\G((?:[^\\\\*]|\\\\[\\\\*])*)[*]", "$1.*");
How it works
Let's print the string regex to have a look at the actual string being parsed by the regex engine:
\G((?:[^\\*]|\\[\\*])*)[*]
((?:[^\\*]|\\[\\*])*) matches a sequence of characters not \ or *, or escape sequence \\ or \*. We match all the characters that we don't want to touch, and put it in a capturing group so that we can put it back.
The above sequence is followed by an unescaped asterisk, as described by [*].
In order to make sure that we don't "jump" when the regex can't match an unescaped *, \G is used to make sure the next match can only start at the beginning of the string, or from where the last match ends.
Why such a long solution? It is necessary, since the look-behind construct to check whether the number of consecutive \ preceding a * is odd or even is not officially supported by Java regex. Therefore, we need to consume the string from left to right, taking into account escape sequences, until we encounter an unescaped * and replace it with .*.
Test program
String inputs[] = {
"toto*",
"\\*toto",
"\\\\*toto",
"*toto",
"\\\\\\\\*toto",
"\\\\*\\\\\\*\\*\\\\\\\\*"};
for (String input: inputs) {
String regex = input.replaceAll("\\G((?:[^\\\\*]|\\\\[\\\\*])*)[*]", "$1.*");
System.out.println(input);
System.out.println(Pattern.compile(regex));
System.out.println();
}
Sample output
toto*
toto.*
\*toto
\*toto
\\*toto
\\.*toto
*toto
.*toto
\\\\*toto
\\\\.*toto
\\*\\\*\*\\\\*
\\.*\\\*\*\\\\.*

You need to use negative lookbehind here:
String regex01 = input.replaceFirst("(?<!\\\\)\\*", ".*");
(?<!\\\\) is a negative lookbehind that means match * if it is not preceded by a backslash.
Examples:
regex01 = "\\*toto".replaceAll("(?<!\\\\)\\*", ".*");
//=> \*toto
regex01 = "*toto".replaceAll("(?<!\\\\)\\*", ".*");
//=> .*toto

You have to cater for the case of a string starting with * in your regex:
(^|[^\\\\])\\*
The single caret represents the 'beginning of the string' ( 'start anchor' ).
Edit
Apart from the correction above, the replacement string in the replaceAll call must be $1.* instead of .* lest a matched character before an unescaped * be lost.

Related

Regex: string can contain spaces, but not only spaces. It cannot contain `*` nor `:` characters either

I need help finding a regex that will allow most strings, except:
if the string only contains whitespaces
if the string contains : or *
I want to reject the following strings:
"hello:world"
"hello*world"
" " (just a whitespace)
But the following strings will pass:
"hello world"
"hello"
So far, I can accomplish what I want... in two patterns.
[^:*]* rejects the 2 special characters
.*\S.* rejects any string with only whitespaces
I'm not sure how to combine these two patterns into one...
I'll be using the regex pattern along with Java.

An example of how you could combine your two patterns for use with the matches method:
"[^:*]*[^:*\\s][^:*]*"
[^\s] is equivalent to \S.

You could use a negative lookahead:
^(?!\s*$)[^:*]+$
^ - start of string anchor
(?!\s*$) negative lookahead rejecting whitespace-only strings
[^:*]+ - one or more of any character except : and *
$ - end of string anchor
Demo

You can use matches to match the whole string with the doubled backslash:
\\s*[^\\s:*][^:*]*
Explanation
\s* Match optional whitespace chars
[^\s:*] Match a non whitespace char other than : and *
[^:*]* Match optional chars other than : and *
See a regex demo.
As \s can also match a newline, if you don't want to cross matching newlines:
\\h*[^\\s:*][^\\r\\n:*]*
Explanation
\h* Match optional horizontal whitespace chars
[^\s:*] Match a non whitespace char other than : and *
[^\\r\\n:*]* Match optional chars other than : and * or newlines
See another regex demo.

Java/Groovy - string: replace characters on matched regex

I have a problem with creating regex of match that will get from string example: NotificationGroup_n+En where n are numbers from 1-4 and when let's say i match desired number from range i will replace or remove it with that specific number.
String BEFORE process: NotificationGroup_4+E3
String AFTER process: NotificationGroup_E3
I removed n (number from 1-4) and leave _E with number
My question is how to write regex in string.replace function to match number and than the plus sign and leave out only the string with _En
def String string = "Notification_Group_4+E3";
println(removeChar(string));
}
public static def removeChar(String string) {
if ((string.contains("1+"))||(string.contains("2+")||(string.contains("3+"))||(string.contains("4+")))) {
def stringReplaced = string.replace('4+', "");
return stringReplaced;
}
}

in groovy:
def result = "Notification_Group_4+E3".replaceFirst(/_\d\+(.*)/, '_$1')
println result
output:
~>  groovy solution.groovy
Notification_Group_E3
~>
Try it online!
A visualization of the regex look like this:
Regex explanation:
we use groovy slashy strings /.../ to define the regex. This makes escaping simpler
we first match on underscore _
Then we match on a single digit (0-9) using the predefined character class \d as described in the javadoc for the java Pattern class.
We then match for one + character. We have to escape this with a backslash \ since + without escaping in regular expressions means "one or more" (see greedy quantifiers in the javadocs) . We don't want one or more, we want just a single + character.
We then create a regex capturing group as described in the logical operators part of the java Pattern regex using the parens expression (.*). We do this so that we are not locked into the input string ending with E3. This way the input string can end in an arbitrary string and the pattern will still work. This essentially says "capture a group and include any character (that is the . in regex) any number of times (that is the * in regex)" which translates to "just capture the rest of the line, whatever it is".
Finally we replace with _$1, i.e. just underscore followed by whatever the capturing group captured. The $1 is a "back reference" to the "first captured group" as documented in, for example, the java Matcher javadocs.

try this regex (\d.*?\+) here demo
in java :
String string = "Notification_Group_4+E3";
System.out.print(string.replaceAll("\\d.*?\\+", ""));
output :
Notification_Group_E3

The simple one-liner:
String res = 'Notification_Group_4+E3'.replaceAll( /_\d+\+/, '_' )
assert 'Notification_Group_E3' == res

Java String Split using Regex with Escape Character

I have a string which needs to be split based on a delimiter(:). This delimiter can be escaped by a character (say '?'). Basically the delimiter can be preceded by any number of escape character. Consider below example string:
a:b?:c??:d???????:e
Here, after the split, it should give the below list of string:
a
b?:c??
d???????:e
Basically, if the delimiter (:) is preceded by even number of escape characters, it should split. If it is preceded by odd number of escape characters, it should not split. Is there a solution to this with regex?
Any help would be greatly appreciated.
Similar question has been asked earlier here, But the answers are not working for this use case.
Update:
The solution with the regex: (?:\?.|[^:?])* correctly split the string. However, this also gives few empty strings. If + is given instead of *, even the real empty matches also ignored. (Eg:- a::b gives only a,b)

Scenario 1: No empty matches
You may use
(?:\?.|[^:?])+
Or, following the pattern in the linked answer
(?:\?.|[^:?]++)+
See this regex demo
Details
(?: - start of a non-capturing group
\?. - a ? (the delimiter) followed with any char
| - or
[^:?] - any char but the : (your delimiter char) and ? (the escape char)
)+ - 1 or more repetitions.
In Java:
String regex = "(?:\\?.|[^:?]++)+";
In case the input contains line breaks, prepend the pattern with (?s) (like (?s)(?:\\?.|[^:?])+) or compile the pattern with Pattern.DOTALL flag.
Scenario 2: Empty matches included
You may add (?<=:)(?=:) alternative to the above pattern to match empty strings between : chars, see this regex demo:
String s = "::a:b?:c??::d???????:e::";
Pattern pattern = Pattern.compile("(?>\\?.|[^:?])+|(?<=:)(?=:)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("'" + matcher.group() + "'");
}
Output of the Java demo:
''
'a'
'b?:c??'
''
'd???????:e'
''
Note that if you want to also match empty strings at the start/end of the string, use (?<![^:])(?![^:]) rather than (?<=:)(?=:).

Java regex shortest match

I have the following string, (a.1) (b.2) (c.3) (d.4). I want to change it to (1) (2) (3) (4). I use the following method.
str.replaceAll("\(.*[.](.*)\)","($1)"). And I only get (4). What is the correct method?
Thanks

Couple things here. First, your escapes for the parentheses are incorrect. In Java string literals, backslash itself is an escape character, meaning you need to use \\( to represent \( in regex.
I think your question is how to do non-greedy matches in regex. Use ? to specify non-greedy matching; e.g. *? means "zero or more times, but as few times as possible".
This doesn't negate other answers, but they depend on your test input being as simple as it is in your question. This gives me the correct output without changing the spirit of your original regex (that only the parentheses and dot delimiter are known to be present):
String test = "(a.1) (b.2) (c.3) (d.4)";
String replaced = test.replaceAll("\\(.*?[.](.*?)\\)", "($1)");
System.out.println(replaced); // "(1) (2) (3) (4)"

Root cause
You want to match ()-delimited substrings, but are using .* greedy dot pattern that can match any 0 or more chars (other than line break chars). The \(.*[.](.*)\) pattern will match the first ( in (a.1) (b.2) (c.3) (d.4), then .* will grab the whole string, and backtracking will start trying to accommodate text for the subsequent obligatory subpatterns. [.] will find the last . in the string, the one before the last digit, 4. Then, (.*) will again grab all the rest of the string, but since the ) is required right after, due to backtracking the last (.*) will only capture 4.
Why is lazy / reluctant .*? not a solution?
Even if you use \(.*?[.](.*?)\), if there are (xxx) like substrings inside the string, they will get matched together with expected matches, as . matches any char but line break chars.
Solution
.replaceAll("\\([^()]*\\.([^()]*)\\)", "($1)")
See the regex demo. The [^()] will only match any char BUT a ( and ).
Details
\( - a ( char
[^()]* - a negated character class matching 0 or more chars other than ( and )
\. - a dot
([^()]*) - Group 1 (its value is later referred to with $1 from the replacement pattern): any 0+ chars other than ( and )
\) - a ) char.
Java demo:
List<String> strs = Arrays.asList("(a.1) (b.2) (c.3) (d.4)", "(a.1) (xxxx) (b.2) (c.3) (d.4)");
for (String str : strs)
System.out.println("\"" + str.replaceAll("\\([^()]*\\.([^()]*)\\)", "($1)") + "\"");
Output:
"(1) (2) (3) (4)"
"(1) (xxxx) (2) (3) (4)"

try this one, it will match any alphabets, . and " and replace them all with empty ""
str.replaceAll("[a-zA-Z\\.\"]", "")
Edit:
You can use also [^\\d)(\\s] to match all characters that are not number, space and )( and replace them all with empty "" string
String str = "(a.1) (b.2) (c.3) (d.4)";
System.out.println(str.replaceAll("[^\\d)(\\s]",""));

Try this
str.replaceAll("[A-Za-z0-9]+\.","");
[A-Za-z0-9] will match the upper case, lower case and digits. If you want to match anything before the dot(.) you can use .+ or .* in the place of [A-Za-z0-9]+

Java Regexp to match words only (', -, space)

What is the Java Regular expression to match all words containing only :
From a to z and A to Z
The ' - Space Characters but they must not be in the beginning or the
end.
Examples
test'test match
test' doesn't match
'test doesn't match
-test doesn't match
test- doesn't match
test-test match

You can use the following pattern: ^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$
Below are the examples:
String s1 = "abc";
String s2 = " abc";
String s3 = "abc ";
System.out.println(s1.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s2.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s3.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));

When you mean the whitespace char it is: [a-zA-Z ]
So it checks if your string contains a-z(lowercase) and A-Z(uppercase) chars and the whitespace chars. If not, the test will fail

Here's my solution:
/(\w{2,}(-|'|\s)\w{2,})/g
You can take it for a spin on Regexr.
It is first checking for a word with \w, then any of the three qualifiers with "or" logic using |, and then another word. The brackets {} are making sure the words on either end are at least 2 characters long so contractions like don't aren't captured. You could set that to any value to prevent longer words from being captured or omit them entirely.
Caveat: \w also looks for _ underscores. If you don't want that you could replace it with [a-zA-Z] like so:
/([a-zA-Z]{2,}(-|'|\s)[a-zA-Z]{2,})/g

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java replaceAll regex error - java

Related

Regex: string can contain spaces, but not only spaces. It cannot contain `*` nor `:` characters either

Java/Groovy - string: replace characters on matched regex

Java String Split using Regex with Escape Character

Java regex shortest match

Java Regexp to match words only (', -, space)

Categories

Resources