String.replaceAll() with regex gets messed up

String.replaceAll() with regex gets messed up - java

I'm trying to implement the Swedish "Robbers language" in Java. It's basically just replacing each consonant with itself, followed by a "o", followed by itself again. I thought I had it working with this code
str.replaceAll("[bcdfghjklmnpqrstvwxz]+", "$0o$0");
but it fails when there are two or more subsequent consonants, for example
String str = "horse";
It should produce hohororsose, but instead I get hohorsorse. I'm guessing the replacement somehow messes up the matching indexes in the original string. How can I make it work?

str.replaceAll("[bcdfghjklmnpqrstvwxz]", "$0o$0");
Remove the + quantifier as it will group consonants.
// when using a greedy quantifier
horse
h | o | rs | e
hoh | o | rsors | e
A plus sign matches one or more of the preceding character, class, or subpattern. For example a+ matches ab and aaab. But unlike a* and
a?, the pattern a+ does not match at the beginning of strings that
lack an "a" character.
https://autohotkey.com/docs/misc/RegEx-QuickRef.htm

+ means: Between one and unlimited times, as many times as possible, giving back as needed (greedy)
+? means: Between one and unlimited times, as few times as possible, expanding as needed (lazy)
{1} means: Exactly 1 time (meaningless quantifier)
In your case you don't need a quantifier.
You can experiment with regular expressions online at https://regex101.com/

Related

How do I select data to the end of the line using RegEx? [duplicate]

What are these two terms in an understandable way?

Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.

'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.

Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow

Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.

As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me

Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.

From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.

Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples

Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).

Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.

Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.

To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.

try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""

How to match such kind of strings using and in regex?

How to make an or and and together in Regex.
We can do this in regex (Boo)|(l30o) and list all permutations which basically beats the purpose of using regex. Here or is being used.
I want to match B in any form, O in any form twice. Something like, [(B)|(l3)][0 o O]{2}. But, in this form, it matches (0O too.
O twice matching isn't a problem.
B when trying to match with multiple character match is a problem along with single character match.
Should match:
Boo
b0o
l300
I3oO
B00
etc.
All words which look like Boo, i.e., b - {B,b,l3,I3,i3} and o - {O, o, 0};

You could try (?:[bB]|[lIi]3)[0Oo]{2}:
(?:...) is a non-capturing group
[...] is a character class, i.e. any character inside it (except - depending on the position) will be assumed to be meant literally (i.e. [iIl] matches i, L or l, while [(B)|(l3)] wouldn't do what you think it does: it matches any of (, B, ), |, l or 3).
| means "or" and matches entire sequences
{...} is a numeric quantifier (i.e. {2} means exactly twice)
You could also use (?i) at the start of your expression to make it case-insensitive, i.e. the expression would then be (?i)(?:b|[li]3)[0o]{2}.

Can you try the following
(B|b|l3|I3|i3)[0oO]{2}
You can try it online at https://regex101.com/r/gLA6N2/3

(B|b|l3|I3|i3)(O|o|0)+
() is a group
| is an or
+ is a quantifier for {1,} which means 1 or more

Why a*? is not giving me output in Java Regular Expression [duplicate]

What are these two terms in an understandable way?

Greedy will consume as much as possible. From http://www.regular-expressions.info/repeat.html we see the example of trying to match HTML tags with <.+>. Suppose you have the following:
<em>Hello World</em>
You may think that <.+> (. means any non newline character and + means one or more) would only match the <em> and the </em>, when in reality it will be very greedy, and go from the first < to the last >. This means it will match <em>Hello World</em> instead of what you wanted.
Making it lazy (<.+?>) will prevent this. By adding the ? after the +, we tell it to repeat as few times as possible, so the first > it comes across, is where we want to stop the matching.
I'd encourage you to download RegExr, a great tool that will help you explore Regular Expressions - I use it all the time.

'Greedy' means match longest possible string.
'Lazy' means match shortest possible string.
For example, the greedy h.+l matches 'hell' in 'hello' but the lazy h.+?l matches 'hel'.

Greedy quantifier
Lazy quantifier
Description
*
*?
Star Quantifier: 0 or more
+
+?
Plus Quantifier: 1 or more
?
??
Optional Quantifier: 0 or 1
{n}
{n}?
Quantifier: exactly n
{n,}
{n,}?
Quantifier: n or more
{n,m}
{n,m}?
Quantifier: between n and m
Add a ? to a quantifier to make it ungreedy i.e lazy.
Example:
test string : stackoverflow
greedy reg expression : s.*o output: stackoverflow
lazy reg expression : s.*?o output: stackoverflow

Greedy means your expression will match as large a group as possible, lazy means it will match the smallest group possible. For this string:
abcdefghijklmc
and this expression:
a.*c
A greedy match will match the whole string, and a lazy match will match just the first abc.

As far as I know, most regex engine is greedy by default. Add a question mark at the end of quantifier will enable lazy match.
As #Andre S mentioned in comment.
Greedy: Keep searching until condition is not satisfied.
Lazy: Stop searching once condition is satisfied.
Refer to the example below for what is greedy and what is lazy.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String args[]){
String money = "100000000999";
String greedyRegex = "100(0*)";
Pattern pattern = Pattern.compile(greedyRegex);
Matcher matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm greedy and I want " + matcher.group() + " dollars. This is the most I can get.");
}
String lazyRegex = "100(0*?)";
pattern = Pattern.compile(lazyRegex);
matcher = pattern.matcher(money);
while(matcher.find()){
System.out.println("I'm too lazy to get so much money, only " + matcher.group() + " dollars is enough for me");
}
}
}
The result is:
I'm greedy and I want 100000000 dollars. This is the most I can get.
I'm too lazy to get so much money, only 100 dollars is enough for me

Taken From www.regular-expressions.info
Greediness: Greedy quantifiers first tries to repeat the token as many times
as possible, and gradually gives up matches as the engine backtracks to find
an overall match.
Laziness: Lazy quantifier first repeats the token as few times as required, and
gradually expands the match as the engine backtracks through the regex to
find an overall match.

From Regular expression
The standard quantifiers in regular
expressions are greedy, meaning they
match as much as they can, only giving
back as necessary to match the
remainder of the regex.
By using a lazy quantifier, the
expression tries the minimal match
first.

Greedy matching. The default behavior of regular expressions is to be greedy. That means it tries to extract as much as possible until it conforms to a pattern even when a smaller part would have been syntactically sufficient.
Example:
import re
text = "<body>Regex Greedy Matching Example </body>"
re.findall('<.*>', text)
#> ['<body>Regex Greedy Matching Example </body>']
Instead of matching till the first occurrence of ‘>’, it extracted the whole string. This is the default greedy or ‘take it all’ behavior of regex.
Lazy matching, on the other hand, ‘takes as little as possible’. This can be effected by adding a ? at the end of the pattern.
Example:
re.findall('<.*?>', text)
#> ['<body>', '</body>']
If you want only the first match to be retrieved, use the search method instead.
re.search('<.*?>', text).group()
#> '<body>'
Source: Python Regex Examples

Greedy Quantifiers are like the IRS
They’ll take as much as they can. e.g. matches with this regex: .*
$50,000
Bye-bye bank balance.
See here for an example: Greedy-example
Non-greedy quantifiers - they take as little as they can
Ask for a tax refund: the IRS sudden becomes non-greedy - and return as little as possible: i.e. they use this quantifier:
(.{2,5}?)([0-9]*) against this input: $50,000
The first group is non-needy and only matches $5 – so I get a $5 refund against the $50,000 input.
See here: Non-greedy-example.
Why do we need greedy vs non-greedy?
It becomes important if you are trying to match certain parts of an expression. Sometimes you don't want to match everything - as little as possible. Sometimes you want to match as much as possible. Nothing more to it.
You can play around with the examples in the links posted above.
(Analogy used to help you remember).

Greedy means it will consume your pattern until there are none of them left and it can look no further.
Lazy will stop as soon as it will encounter the first pattern you requested.
One common example that I often encounter is \s*-\s*? of a regex ([0-9]{2}\s*-\s*?[0-9]{7})
The first \s* is classified as greedy because of * and will look as many white spaces as possible after the digits are encountered and then look for a dash character "-". Where as the second \s*? is lazy because of the present of *? which means that it will look the first white space character and stop right there.

Best shown by example. String. 192.168.1.1 and a greedy regex \b.+\b
You might think this would give you the 1st octet but is actually matches against the whole string. Why? Because the.+ is greedy and a greedy match matches every character in 192.168.1.1 until it reaches the end of the string. This is the important bit! Now it starts to backtrack one character at a time until it finds a match for the 3rd token (\b).
If the string a 4GB text file and 192.168.1.1 was at the start you could easily see how this backtracking would cause an issue.
To make a regex non greedy (lazy) put a question mark after your greedy search e.g
*?
??
+?
What happens now is token 2 (+?) finds a match, regex moves along a character and then tries the next token (\b) rather than token 2 (+?). So it creeps along gingerly.

To give extra clarification on Laziness, here is one example which is maybe not intuitive on first look but explains idea of "gradually expands the match" from Suganthan Madhavan Pillai answer.
input -> some.email#domain.com#
regex -> ^.*?#$
Regex for this input will have a match. At first glance somebody could say LAZY match(".*?#") will stop at first # after which it will check that input string ends("$"). Following this logic someone would conclude there is no match because input string doesn't end after first #.
But as you can see this is not the case, regex will go forward even though we are using non-greedy(lazy mode) search until it hits second # and have a MINIMAL match.

try to understand the following behavior:
var input = "0014.2";
Regex r1 = new Regex("\\d+.{0,1}\\d+");
Regex r2 = new Regex("\\d*.{0,1}\\d*");
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // "0014.2"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // " 0014"
input = " 0014.2";
Console.WriteLine(r1.Match(input).Value); // "0014.2"
Console.WriteLine(r2.Match(input).Value); // ""

Replace symbols in string [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
How to replace ^ into calling Math.pow() ?
For example:
str = "10 + 5.2^12"; // -> "10 + Math.pow(5.2, 12)"
str = "2^(12) + 6"; // -> "Math.pow(2, 12) + 6"

You can do it like this:
str = str.replaceAll("\\(?(\\d+\\.?\\d*)\\)?\\^\\(?(\\d+\\.?\\d*)\\)?", "Math.pow($1,$2)");
In this case, you are looking for 2 groups of digits (\\d+\\.?\\d*), which could be a float value and can be inside of () \\(? and \\)?. Between this 2 groups, you need to have the ^ sign \\^. If it matches, then replaceAll method replaces all this pattern with Math.pow($1,$2), where $1 and $2 will be replaced with first and second groups of digits.
But one thing, it could lead to wrong results if you have a complicated expression, with 2 multiplications in a row, like a 10.22^31.22^5. In this case, this regular expression should be much more complicated. And may be you should use some other algorithm to parse such expressions.

You can do this with regular expression in java.
Use the below code snippet to devide and replace the string.
private static String replaceMatchPow(String pValue) {
String regEx = "(\\d*\\.*\\d*)\\^\\(*(\\d*\\.*\\d*)\\)*";
Pattern pattern = Pattern.compile(regEx);
Matcher m = pattern.matcher(pValue);
if (m.find()) {
String value = "Math.pow(" + m.group(1) + "," + m.group(2) + ")";
pValue = pValue.replaceAll(regEx, value);
}
return pValue;
}
How the regex will work?
Here is the example for the input "2^(12) + 6"
1st group (\d*\.\d)
\d* -> match a digit [0-9],
* -> zero to unlimited
\^ -> divide the groups by ^ symbol
(* -> left bracket, * means occurrence zero to unlimited
2nd group (\d*\.\d)
)* -> right bracket
Result: 1st group values is 2 and 2nd group values is 12
/(\d*.*\d*)\^(*(\d*.*\d*))*/
1st Capturing group (\d*.*\d*)
\d* match a digit [0-9]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
.* matches the character . literally
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\d* match a digit [0-9]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\^ matches the character ^ literally
(* matches the character ( literally
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
2nd Capturing group (\d*.*\d*)
\d* match a digit [0-9]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
.* matches the character . literally
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
\d* match a digit [0-9]
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
)* matches the character ) literally
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]

For your special problem you should lookup how the caret character ^ is escaped in Java regular expressions, it should be \^.
Note that:
Backslashes within string literals in Java source code are interpreted
as required by The Java™ Language Specification as either Unicode
escapes (section 3.3) or other character escapes (section 3.10.6) It
is therefore necessary to double backslashes in string literals that
represent regular expressions to protect them from interpretation by
the Java bytecode compiler.
So you end up with "\\^" as a Java String.
However in general you will write a parser and interpreter for arithmetical expressions.
This subject is known as compiler construction and relies on knowledge of finite automata and formal languages.
A nice example can be found in chapter 10.2 "A Desk Calculator" of "The C++ Programming Language, Fourth Edition" by Bjarne Stroustrup. (Yes, I know you want to code in Java, please continue to read).
It features this formal grammar:
program:
end //end is end-of-input
expr_list end
expr_list:
expression print //pr intis newline or semicolon
expression print expr_list
expression:
expression + term
expression − term
term
term:
term / primary
term ∗ primary
primary
primary:
number //number is a ﬂoating-point literal
name //name is an identiﬁer
name = expression
−primary
( expression )
Stroustrup shows you how to code this in a similar structure:
double expr(bool get) //add and subtract
{
double left = term(get);
for(;;) { //‘‘forever’’
switch(ts.current().kind) {
case Kind::plus:
left += term(true);
break;
case Kind::minus:
left −= term(true);
break;
default:
return left;
}
}
}
this code embodies this part of the grammar:
expression:
expression + term
expression − term
term
This technique is called recursive descent, because recursion in the grammar will be implemented as a recursive function call in the corresponding code. The featured relevant C++ code should be understandable to a Java developer as well.
Once you got a bit familiar with grammars and how to implement them,
see e.g. Unambiguous grammar for exponentiation operation
how to deal with exponentiation in such a way, that the usual operator precedence rules are fulfilled.

Regex ([mb|kb|gb|b|bytes]) does not match 'b' in 'kb' or 'gb' without a + after the braces

I am writing a regular expression that can capture a value and any of mb, kb, gb, bytes that comes after it
The Regex is:
(?<sizevalue>\p{N}+)(?:\s*)(?<sizetype>[mb|kb|gb|b|bytes])
But when given an input "4096 mb", group sizetype matches only 'm' and not 'b'. adding a '+' quantifier after the braces gives the output of grop sizetype as 'mb'. The pattern was compiled with CASE_INSENSITIVE so that was not the issue.
This works
(?<sizevalue>\p{N}+)(?:\s*)(?<sizetype>[mb|kb|gb|b|bytes]+)
Ideally shouldn't the first regex match 'mb' completely ?

You need to use capturing or non-capturing group instead of a character class.
[mb|kb|gb|b|bytes] matches only a single charcater from the given list, ie, it may match an m or b or | or k or b, etc. It won't consider mb as a single word and | operator inside the character class will looses it's special meaning and matches only a literal | symbol. It won't do an OR operation.
(?<sizevalue>\p{N}+)(?:\s*)(?<sizetype>(?:mb|kb|gb|b|bytes)\b)
DEMO
Pattern p = Pattern.compile("(?<sizevalue>\\p{N}+)(?:\\s*)(?<sizetype>(?:mb|kb|gb|b|bytes)\\b)");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

String.replaceAll() with regex gets messed up - java

Related

How do I select data to the end of the line using RegEx? [duplicate]

How to match such kind of strings using and in regex?

Why a*? is not giving me output in Java Regular Expression [duplicate]

Replace symbols in string [closed]

Regex ([mb|kb|gb|b|bytes]) does not match 'b' in 'kb' or 'gb' without a + after the braces

Categories

Resources