Regex Example Confusion - java

I am preparing for Oracle Certified Java Programmer. I am looking into regular expressions. I was going through javaranch Regular Expression and i am not able to understand the regular expression present in the example. Please help me in understanding it. I am adding source code for reference here. Thanks.
class Test
{
static Map props = new HashMap();
static
{
props.put("key1", "fox");
props.put("key2", "dog");
}
public static void main(String[] args)
{
String input = "The quick brown ${key1} jumps over the lazy ${key2}.";
Pattern p = Pattern.compile("\\$\\{([^}]+)\\}");
Matcher m = p.matcher(input);
StringBuffer sb = new StringBuffer();
while (m.find())
{
m.appendReplacement(sb, "");
sb.append(props.get(m.group(1)));
}
m.appendTail(sb);
System.out.println(sb.toString());
}
}

An illustration of your regex:
\$\{([^}]+)\}
Edit live on Debuggex

Here is a very good tutorial on regular expressions you might want to check out. The article on quantifiers has two sections "Laziness instead of Greediness" and "An Alternative to Laziness", that should explain this particular example really well.
Anyway, here is my explanation. First, you need to realize that there are two compilation steps in Java. One compiles the string literal in your code to an actual string. This step already interprets some of the backslashes, so that the string Java receives looks like
\$\{([^}]+)\}
Now let's pick that apart in free-spacing mode:
\$ # match a literal $
\{ # match a literal {
( # start capturing group 1
[^}] # match any single character except } - note the negation by ^
+ # repeat one or more times
) # end of capturing group 1
\} # match a literal }
So this really matches all occurrences of ${...}, where ... can be anything except closing }. The contents of the braces (i.e. the ...) can later be accessed via m.group(1), as it's the first set of parentheses in the expression.
Here are some more relevant articles of the above tutorial (but you should really read it in its entirety - it's definite worth it):
Character classes (including how to negate them with ^)
Repetition/quantifiers
Grouping and capturing
Java's regex peculiarities

\\$: matches a literal dollar sign. Without the backslashes, it matches the end of a string.
\\{: matches a literal opening curly brace.
(: start of a capturing group
[^}]: matches any character that isn't a closing curly brace.
+: repeats the last character set, which will match one or more characters that aren't curly braces.
): closing capturing group.
\\}: matches a literal closing curly brace.
It matches stuff that looks like ${key1}.

Explanation:
\\$ literal $ (must be escaped since it is a special character that
means "end of the string"
\\{ literal { (i m not sure this must be escaped but it doesn't matter)
( open the capture group 1
[^}]+ character class containing all chars but }
) close the capture group 1
\\} literal }

Related

java regular expression and replace all occurrences

I want to replace one string in a big string, but my regular expression is not proper I guess. So it's not working.
Main string is
Some sql part which is to be replaced
cond = emp.EMAIL_ID = 'xx#xx.com' AND
emp.PERMANENT_ADDR LIKE('%98n%')
AND hemp.EMPLOYEE_NAME = 'xxx' and is_active='Y'
String to find and replace is
Based on some condition sql part to be replaced
hemp.EMPLOYEE_NAME = 'xxx'
I have tried this with
Pattern and Matcher class is used and
Pattern pat1 = Pattern.compile("/^hemp.EMPLOYEE_NAME\\s=\\s\'\\w\'\\s[and|or]*/$", Pattern.CASE_INSENSITIVE);
Matcher mat = pat1.matcher(cond);
while (mat.find()) {
System.out.println("Match: " + mat.group());
cond = mat.replaceFirst("xx "+mat.group()+"x");
mat = pat1.matcher(cond);
}
It's not working, not entering the loop at all. Any help is appreciated.
Obviously not - your regexp pattern doesn't make any sense.
The opening /: In some languages, regexps aren't strings and start with an opening slash. Java is not one of those languages, and it has nothing to do with regexps itself. So, this looks for a literal slash in that SQL, which isn't there, thus, failure.
^ is regexpese for 'start of string'. Your string does not start with hemp.EMPLOYEE_NAME, so that also doesn't work. Get rid of both / and ^ here.
\\s is one whitespace character (there are many whitespace characters - this matches any one of them, exactly one though). Your string doesn't have any spaces. Your intent, surely, was \\s* which matches 0 to many of them, i.e.: \\s* is: "Whitespace is allowed here". \\s is: There must be exactly one whitespace character here. Make all the \\s in your regexp an \\s*.
\\w is exactly one 'word' character (which is more or less a letter or digit), you obviously wanted \\w*.
[and|or] this is regexpese for: "An a, or an n, or a d, or an o, or an r, or a pipe symbol". Clearly you were looking for (and|or) which is regexpese for: Either the sequence "and", or the sequence "or".
* - so you want 0 to many 'and' or 'or', which makes no sense.
closing slash: You don't want this.
closing $: You don't want this - it means 'end of string'. Your string didn't end here.
The code itself:
replaceFirst, itself, also does regexps. You don't want to double apply this stuff. That's not how you replace a found result.
This is what you wanted:
Matcher mat = pat1.matcher(cond);
mat.replaceFirst("replacement goes here");
where replacement can include references to groups in the match if you want to take parts of what you matched (i.e. don't use mat.group(), use those references).
More generally did you read any regexp tutorial, did any testing, or did any reading of the javadoc of Pattern and Matcher?
I've been developing for a few years. It's just personal experience, perhaps, but, reading is pretty fundamental.
Instead of the anchors ^ and $, you can use word boundaries \b to prevent a partial match.
If you want to match spaces on the same line, you can use \h to match horizontal whitespace char, as \s can also match a newline.
You can use replaceFirst on the string using $0 to get the full match, and an inline modifier (?i) for a case insensitive match.
Note that using [and|or] is a character class matching one of the listed chars and escape the dot to match it literally, or else . matches any char except a newline.
(?i)\bhemp\.EMPLOYEE_NAME\h*=\h*'\w+'\h+(?:and|or)\b
See a regex demo or a Java demo
For example
String regex = "\\bhemp\\.EMPLOYEE_NAME\\h*=\\h*'\\w+'\\h+(?:and|or)\\b";
String string = "cond = emp.EMAIL_ID = 'xx#xx.com' AND\n"
+ "emp.PERMANENT_ADDR LIKE('%98n%') \n"
+ "AND hemp.EMPLOYEE_NAME = 'xxx' and is_active='Y'";
System.out.println(string.replaceFirst(regex, "xx$0x"));
Output
cond = emp.EMAIL_ID = 'xx#xx.com' AND
emp.PERMANENT_ADDR LIKE('%98n%')
AND xxhemp.EMPLOYEE_NAME = 'xxx' andx is_active='Y'

String to HTML paragraphs in Java with Regex [duplicate]

I had to match a number followed by itself 14 times. Then I've came to the following regular expression in the regexstor.net/tester:
(\d)\1{14}
Edit
When I paste it in my code, including the backslashes properly:
"(\\d)\\1{14}"
I've replaced the back-reference "\1" by the "$1" which is used to replace matches in Java.
Then I've realized that it doesn't work. When you need to back-reference a match in the REGEX, in Java, you have to use "\N", but when you want to replace it, the operator is "$N".
My question is: why?
$1 is not a back reference in Java's regexes, nor in any other flavor I can think of. You only use $1 when you are replacing something:
String input="A12.3 bla bla my input";
input = StringUtils.replacePattern(
input, "^([A-Z]\\d{2}\\.\\d).*$", "$1");
// ^^^^
There is some misinformation about what a back reference is, including the very place I got that snippet from: simple java regex with backreference does not work.
Java modeled its regex syntax after other existing flavors where the $ was already a meta character. It anchors to the end of the string (or line in multi-line mode).
Similarly, Java uses \1 for back references. Because regexes are strings, it must be escaped: \\1.
From a lexical/syntactic standpoint it is true that $1 could be used unambiguously (as a bonus it would prevent the need for the "evil escaped escape" when using back references).
To match a 1 that comes after the end of a line the regex would need to be $\n1:
this line
1
It just makes more sense to use a familiar syntax instead of changing the rules, most of which came from Perl.
The first version of Perl came out in 1987, which is much earlier than Java, which was released in beta in 1995.
I dug up the man pages for Perl 1, which say:
The bracketing construct (\ ...\ ) may also be used, in which case \<digit> matches the digit'th substring. (Outside of the pattern, always use $ instead of \ in front of the digit. The scope of $<digit> (and $\`, $& and $') extends to the end of the enclosing BLOCK or eval string, or to the next pattern match with subexpressions. The \<digit> notation sometimes works outside the current pattern, but should not be relied upon.) You may have as many parentheses as you wish. If you have more than 9 substrings, the variables $10, $11, ... refer to the corresponding substring. Within the pattern, \10, \11, etc. refer back to substrings if there have been at least that many left parens before the backreference. Otherwise (for backward compatibilty) \10 is the same as \010, a backspace, and \11 the same as \011, a tab. And so on. (\1 through \9 are always backreferences.)
I think the main Problem is not the backreference - which works perfectly fine with \1 in java.
Your Problem is more likely the "overall" escaping of a regex pattern in Java.
If you want to have the pattern
(\d)\1{14}
passed to the regex engine, you first need to escape it cause it's a java-string when you write it:
(\\d)\\1{14}
Voila, works like a charm: goo.gl/BNCx7B (add http://, SO does not allow Url-Shorteners, but tutorialspoint.com has no other option as it seems)
Offline-Example:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HelloWorld{
public static void main(String []args){
String test = "555555555555555"; // 5 followed by 5 for 14 times.
String pattern = "(\\d)\\1{14}";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(test);
if (m.find( )) {
System.out.println("Matched!");
}else{
System.out.println("not matched :-(");
}
}
}

Matching regex groups with Java

I am trying to split a line with regex by using groups, but it's not working as I expected.
I want to match for example this line:
Ex. #1: temp name(this is the data)
and also this:
Ex. #2: temp name()
I used this regex:
[\s]*temp[\s]+[\s]*([A-Za-z]+)[\s]*[(]\s*(.*)+[)]\s*[{]\s*
which means: grab anything that starts with temp then put in group #1 the "name" then grab whatever inside the bracket and put it in group #2.
However, group #2 is always empty.
This is my code to fetch the data:
Pattern PATTERN = Pattern.compile("[\\s]*temp[\\s]+[\\s]*([A-Za-z]+)[\\s]*[(]\\s*(.*)+[)]\\s*");
Matcher m = PATTERN.matcher("temp name(this is the data)");
m.matches();
String name = m.group(1);
String data = m.group(2); // always empty
What am I doing wrong?
Your pattern doesn't match because it requires an open curly brace at the end, but your input doesn't have one.
Ignoring that small problem, the main problem is the little + after your capture group (.*)+. The plus requires one or more matches of .* and the group returned is the last match of the many. The term .* is greedy, so it consumes everything up to the bracket. The only way to match again is to consume nothing. So the last match of group 2 is blank.
To fix it, remove the + after group 2:
Pattern PATTERN = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Note also how I removed other unnecessary characters from your regex, eg the single-character character classes - ie [\\s] is identical to \s. And \\s+\\s* is identical to just \\s+, because + is greedy.
I also removed the trailing curly bracket, which you can restore if your input data actually has it (your question showed input of "temp name(this is the data)", which has no trailing curly bracket).
Your regex should be this:
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
You had (.*)+ which means one or more matches of .*. This results in nothing being captured.
Testing:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Example {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Matcher m = pattern.matcher("temp name(this is the data)");
if(m.matches()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}
}
Output:
name
this is the data
[\s] is equivalent with \s
[\s]+[\s]* is equivalent with \s+
[(] is equivalent with \( (same for [)] and [}])
This would leave your regexp as:
\s*temp\s+([A-Za-z]+)\s*\(\s*(.*)+\)\s*\{\s*
Assuming you actually want to match temp name(...) { (your regexp is looking for a {, while in your question you do not specify that):
(.*)+ is your problem. You're saying: "Match any number (including 0) chatacters and put them in a capture group, repeat that at least once".
Regexp are by default greedy (= they consume as much as possible), so the capture group will first contain everything within the two brackets, then the + will try to match the entire group again, and will match it with "" (the emtpy string) as this fulfils the capture group's pattern. This will elave your capture group emtpy.
What you want instead is \s*temp\s+([A-Za-z]+)\s*\(\s*(.*)\)\s*\{\s*
The reason you are getting empty groups is because you are creating multiple capture groups every time you put something between (), even if it is nested.
To make a group so it doesnt capture you can designate it as a non-capturing group by using ?: for example (?:sometest(this is the value we want)) will return just one group while (sometest(this is the value we want)) will return 2 groups.
For your particular regex, I have refined and simplified it, as you had capture groups you did not need.
Simple solution:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*\{\\s*
given the input:
Ex. #1: temp name(this is the data) {
Ex. #2: temp name() {
$1 = name, $2 = data
Pay attention to the fact that your regex contains a trailing curly brace. You can modify the regex to match without it and it will result in this:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*
https://regex101.com/r/tD0tO0/1

Regular expression in java

I know it's a simple problem but i'm blocked on it : i want to retrieve all strings written in this form :
$F{ETIQX}
Where X is a number. i wrote this regular expression but i'm getting errors :
if (textField.getText().matches("$F{ETIQ\d}")){
System.out.println("matches!!");
}
Any help will be appreciated.
i want to retrieve all strings
Then you shouldn't be using .matches() in the first place. but a Matcher and .find(). .matches() is a misnomer. It will succeed only if the whole input matches the regex (in contradiction with the definiton of regex matching which can occur anywhere in the input).
Also, your regex should be:
"\\$F\\{ETIQ\\d\\}"
(you need to escape backslashes in a Java string)
$, { and } are regex metacharacters; the first is an anchor matching the end of input, the two latter are bounds for a repetition quantifier.
Your code should read:
private static final Pattern PATTERN = Pattern.compile("\\$F\\{ETIQ\\d\\}");
// ...
final Matcher m = PATTERN.matcher(textField.getText());
while (m.find())
// work with m.group()
\$F\{ETIQ\d\}
escape character which have meaning in regex.
$ means end of string
{ means start of a quantifier
} means end of a quantifier
for matching these you must escape them to match them literally.
here is a demo http://regex101.com/r/xT4mR6
In java \ has no meaning and will throuw an error , so we need to escape \ with \.

Pattern.compile("(.*?):")

I'm trying to understand the following code:
Pattern.compile("(.*?):")
I already did some research about what it could mean,
but I don't quite get it:
According to the java docs the * would mean 0 or more times,
while ? means once or not at all.
Also, what does the ':' mean?
Thanks
This is called a reluctant quantifier. An asterisk and a question mark *? together mean "zero or more times, without matching more characters from the input than is needed". This is what prevents the dot . expression from matching the subsequent colon : in the input.
A better expression to match the same sequence is [^:]*:, because it lets you avoid backtracking. Here is a link to an article explaining why.
The ? after greedy operators such as + or * will make the operator non greedy. Without the ?, that regex will keep matching all the characters it finds, including the :.
As it is, the regex will match any string which happens before the semi colon (:). In this case, the semicolon is not a special character. What ever comes before the semicolon, will be thrown into a group, which can be accessed later through a Matcher object.
This code snippet will hopefully make things more clear:
String str = "Hello: This is a Test:";
Pattern p1 = Pattern.compile("(.*?):");
Pattern p2 = Pattern.compile("(.*):");
Matcher m1 = p1.matcher(str);
if (m1.find())
{
System.out.println(m1.group(1));
}
Matcher m2 = p2.matcher(str);
if (m2.find())
{
System.out.println(m2.group(1));
}
Yields:
Hello
Hello: This is a Test
This regular expression means anthing ending with : or it could be understood as anthing till first :.
Here ':' means nothing. but it complies for pattern anystring: will match to this pattern
I think the '?' is redundant and will be applied on '.*'.
':' has no special meaning whatsoever in regexps and will be matched to the characters in the string.
EDIT: dasblinkenlight is be right, if greedy the regexp will try to match as much as they can, and he is right in his suggestion as well.
I found a link which lists greedy vs reluctant: What is the difference between `Greedy` and `Reluctant` regular expression quantifiers?

Categories

Resources