Why does this pattern not match? ([\\\\A\\\\W]its[\\\\W\\\\z]) - java

I'm trying to do a replace with this pattern, so I need to match this:
String pattern = "[\\\\A\\\\W]its[\\\\W\\\\z]";
The way I'm interpreting my pattern is: either a beginning of the string OR a non word character like a space or comma, then an "its", then a non word character OR the end of the string.
Why doesn't it match on this "its" inside this string?
its about time
The idea of what this is supposed to do it's supposed to detect incorrectly written words like "its" and fix them to "it's".
Also why do I need so many escape characters in order for the pattern to be accepted by the vm at all?

\\A and \\z are boundary matches. They cannot go inside character classes. If you use them properly, i.e. with two slashes instead of four, regex pattern compiler would throw an exception, because \A or \z cannot go inside [] blocks.
Use straight | syntax with non-capturing groups instead:
String pattern = "(?:\\A|\\W)its(?:\\W|\\z)";
Demo.

Related

Regex to check if String is one word in Java

I need regex to check if String has only one word (e.g. "This", "Country", "Boston ", " Programming ").
So far I used an alternative way of doing it which is to check if String contains spaces. However, I am sure that this can be done using regex.
One possible way in my opinion is "^\w{2,}\s". Does this work properly? Are there any other possible answers?
The pattern ^\w{2,}\s matches 2 or more word characters from the start of the string, followed by a mandatory whitespace char (that can also match a newline)
As the pattern is also unanchored, it can also match Boston in Boston test
If you want to match a single word with as least 2 characters surrounded by optional horizontal whitespace characters using \h* and add an anchor $ to assert the end of the string.
^\h*\w{2,}\h*$
Regex demo
In Java
String regex = "^\\h*\\w{2,}\\h*$";

java regular expression and replace all occurrences

I want to replace one string in a big string, but my regular expression is not proper I guess. So it's not working.
Main string is
Some sql part which is to be replaced
cond = emp.EMAIL_ID = 'xx#xx.com' AND
emp.PERMANENT_ADDR LIKE('%98n%')
AND hemp.EMPLOYEE_NAME = 'xxx' and is_active='Y'
String to find and replace is
Based on some condition sql part to be replaced
hemp.EMPLOYEE_NAME = 'xxx'
I have tried this with
Pattern and Matcher class is used and
Pattern pat1 = Pattern.compile("/^hemp.EMPLOYEE_NAME\\s=\\s\'\\w\'\\s[and|or]*/$", Pattern.CASE_INSENSITIVE);
Matcher mat = pat1.matcher(cond);
while (mat.find()) {
System.out.println("Match: " + mat.group());
cond = mat.replaceFirst("xx "+mat.group()+"x");
mat = pat1.matcher(cond);
}
It's not working, not entering the loop at all. Any help is appreciated.
Obviously not - your regexp pattern doesn't make any sense.
The opening /: In some languages, regexps aren't strings and start with an opening slash. Java is not one of those languages, and it has nothing to do with regexps itself. So, this looks for a literal slash in that SQL, which isn't there, thus, failure.
^ is regexpese for 'start of string'. Your string does not start with hemp.EMPLOYEE_NAME, so that also doesn't work. Get rid of both / and ^ here.
\\s is one whitespace character (there are many whitespace characters - this matches any one of them, exactly one though). Your string doesn't have any spaces. Your intent, surely, was \\s* which matches 0 to many of them, i.e.: \\s* is: "Whitespace is allowed here". \\s is: There must be exactly one whitespace character here. Make all the \\s in your regexp an \\s*.
\\w is exactly one 'word' character (which is more or less a letter or digit), you obviously wanted \\w*.
[and|or] this is regexpese for: "An a, or an n, or a d, or an o, or an r, or a pipe symbol". Clearly you were looking for (and|or) which is regexpese for: Either the sequence "and", or the sequence "or".
* - so you want 0 to many 'and' or 'or', which makes no sense.
closing slash: You don't want this.
closing $: You don't want this - it means 'end of string'. Your string didn't end here.
The code itself:
replaceFirst, itself, also does regexps. You don't want to double apply this stuff. That's not how you replace a found result.
This is what you wanted:
Matcher mat = pat1.matcher(cond);
mat.replaceFirst("replacement goes here");
where replacement can include references to groups in the match if you want to take parts of what you matched (i.e. don't use mat.group(), use those references).
More generally did you read any regexp tutorial, did any testing, or did any reading of the javadoc of Pattern and Matcher?
I've been developing for a few years. It's just personal experience, perhaps, but, reading is pretty fundamental.
Instead of the anchors ^ and $, you can use word boundaries \b to prevent a partial match.
If you want to match spaces on the same line, you can use \h to match horizontal whitespace char, as \s can also match a newline.
You can use replaceFirst on the string using $0 to get the full match, and an inline modifier (?i) for a case insensitive match.
Note that using [and|or] is a character class matching one of the listed chars and escape the dot to match it literally, or else . matches any char except a newline.
(?i)\bhemp\.EMPLOYEE_NAME\h*=\h*'\w+'\h+(?:and|or)\b
See a regex demo or a Java demo
For example
String regex = "\\bhemp\\.EMPLOYEE_NAME\\h*=\\h*'\\w+'\\h+(?:and|or)\\b";
String string = "cond = emp.EMAIL_ID = 'xx#xx.com' AND\n"
+ "emp.PERMANENT_ADDR LIKE('%98n%') \n"
+ "AND hemp.EMPLOYEE_NAME = 'xxx' and is_active='Y'";
System.out.println(string.replaceFirst(regex, "xx$0x"));
Output
cond = emp.EMAIL_ID = 'xx#xx.com' AND
emp.PERMANENT_ADDR LIKE('%98n%')
AND xxhemp.EMPLOYEE_NAME = 'xxx' andx is_active='Y'

Returning java regex (words, spaces, special characters, double quotes)

I am trying to use java regex to tokenize any language source file. What I want the list to return is:
words ([a-z_A-Z0-9])
spaces
any of [()*.,+-/=&:] as a single character
and quoted items left in quotes.
Here is the code I have so far:
Pattern pattern = Pattern.compile("[\"(\\w)\"]+|[\\s\\(\\)\\*\\+\\.,-/=&:]");
Matcher matcher = pattern.matcher(str);
List<String> matchlist = new ArrayList<String>();
while(matcher.find()) {
matchlist.add(matcher.group(0));
}
For example,
"I" am_the 2nd "best".
returns: list, size 8
("I", ,am_the, ,2nd, ,"best", .)
which is what I want. However, if the whole sentence is quoted, except for the period:
"I am_the 2nd best".
returns: list, size 8
("I, ,am_the, ,2nd, ,best", .)
and I want it to be able to return: list, size 2
("I am_the 2nd best", .)
If that makes sense. I believe it works for everything I want it to except for returning string literals (which I want to keep the quotes). What is it that I am missing from the pattern that will allow me to achieve this?
And by all means, if there is an easier pattern to use that I do not see, please help me out. The pattern shown above was the compilation of many trial/error. Thank you very much in advance for any help.
First, you'll need to separate the word-matching code from the string-literal-matching code. For word matching, use:
\w+
Next there's whitespace.
\s+
To match strings as one token, you need to allow more characters than just \w. That only allows alphanumeric characters and _, which means whitespace and symbols are not. You also need to move the starting and ending quotes outside of the square brackets.
And don't forget backslashes to escape characters. You want to allow \" inside of strings.
"(\\.|[^"])+"
Finally, there are the symbols. You could list all the symbols, or you could just treat any non-word, non-whitespace, non-quote character as a symbol. I recommend the latter so you don't choke on other symbols like # or |. So for symbols:
[^\s\w"]
Putting the pieces together, we get this combined regex:
\w+|\s+|"(\\.|[^"])+"|[^\s\w"]
Or, escaping everything properly so it can be put into source code:
Pattern pattern = Pattern.compile("\\w+|\\s+|\"(\\\\.|[^\"])+\"|[^\\s\\w\"]");
Typically, when parsing text, the process you're describing is called "lexical analysis" and the function used is called a 'lexer' which is used to break up an input stream into identifiable tokens like words, numbers, spaces, periods, etc.
The output of a lexer is consumed by a 'parser' which does "syntactic analysis" by identifying groups of tokens which belong together, like [double-quote] [word] [double-quote].
I would recommend you follow the same two-pass strategy, since it's been proven time and again in many, many parsers.
So, your first step might be to use this regular expression as your lexer:
\W|\w+
which will break your input text into either single non-word characters (like spaces, double and single quotation marks, commas, periods, etc.) or sequences of one or more word characters where \w is really just a shortcut for [a-zA-Z_0-9].
So, using your example above:
String str=/"I" am_the 2nd "best"./
String p="\\W|\\w+"
Pattern pattern = Pattern.compile(p);
Matcher matcher = pattern.matcher(str);
List<String> matchlist = new ArrayList<String>();
while(matcher.find()) {
matchlist.add(matcher.group(0));
}
produces:
['"', 'I', '"', ' ', 'am_the', ' ', '2nd', ' ', '"', 'best', '"', '.']
which you can then decide how to treat in your code.
No, this doesn't give you a single one-size-fits-all regular expression which matches both of the cases you list above, but in my experience, regular expressions aren't really the best tool to do the kind of syntactic analysis you require because they either lack the expressiveness needed to cover all possible cases or, and this is far more likely, they quickly become far too complex for most but the true RegExp maven to fully comprehend.

capture all characters between match character (single or repeated) on string

I'm trying to extract the string preceding a specific character (even when character is repeated, like this (ie: underscore '_'):
this_is_my_example_line_0
this_is_my_example_line_1_
this_is_my_example_line_2___
_this_is_my_ _example_line_3_
__this_is_my___example_line_4__
and after running my regex I should get this (the regex should ignore the any instances of the matching character in the middle of the string):
this_is_my_example_line_0
this_is_my_example_line_1
this_is_my_example_line_2
this_is_my_ _example_line_3
this_is_my___example_line_4
In other words I'm trying to 'trim' the matched character(s) at the beginning and end of string.
I'm trying to use a Regex in Java to accomplish this, my idea is to capture the group of characters between the special character(s) at the end or beginning of the line.
So far I can only do this successfully for example 3 with this regexp:
/[^_]+|_+(.*)[_$]+|_$+/
[^_]+ not 'underscore' once or more
| OR
_+ underscore once or more
(.*) capture all characters
[_$]+ not 'underscore' once or more followed by end of line
|_$+ OR 'underscore' once or more followed by end of line
I just realized that this excludes the first word of the message on example 0,1,2 since the string doesn't start with underscore and it only starts matching after finding a underscore..
Is there an easier way not involving regex?
I don't really care about the first character (although it would be nice) I only need to ignore the repeating character at the end.. it looks that (by this regex tester) just doing this, would work? /()_+$/ the empty parenthesis matches anything before a single or repeting matches at the end of the line.. would that be correct?
Thank you!
There are a couple of options here, you could either replace matches of ^_+|_+$ with an empty string, or extract the contents of the first capture group from the match of ^_*(.*?)_*$. Note that if your strings may be multiple lines and you want to perform the replacement on each line then you will need to use the Pattern.MULTILINE flag for either approach. If your strings may be multiple lines and you only want to replacement to occur at the very beginning and end, don't use Pattern.MULTILINE but use Pattern.DOTALL for the second approach.
For example: http://regexr.com?355ff
How about [^_\n\r](.*[^_\n\r])??
Demo
String data=
"this_is_my_example_line_0\n" +
"this_is_my_example_line_1_\n" +
"this_is_my_example_line_2___\n" +
"_this_is_my_ _example_line_3_\n" +
"__this_is_my___example_line_4__";
Pattern p=Pattern.compile("[^_\n\r](.*[^_\n\r])?");
Matcher m=p.matcher(data);
while(m.find()){
System.out.println(m.group());
}
output:
this_is_my_example_line_0
this_is_my_example_line_1
this_is_my_example_line_2
this_is_my_ _example_line_3
this_is_my___example_line_4

Regular expression Java Merge Pattern

I've these three regular expressions. They work individually but i would like to merge them in a single pattern.
regex1 = [0-9]{16}
regex2 = [0-9]{4}[-][0-9]{4}[-][0-9]{4}[-][0-9]{4}
regex3 = [0-9]{4}[ ][0-9]{4}[ ][0-9]{4}[ ][0-9]{4}
I use this method:
Pattern.compile(regex);
Which is the regex string to merge them?
You can use backreferences:
[0-9]{4}([ -]|)([0-9]{4}\1){2}[0-9]{4}
This will only match if the seperators are either all
spaces
hyphens
blank
\1 means "this matches exactly what the first capturing group – expression in parentheses – matched".
Since ([ -]|) is that group, both other separators need to be the same for the pattern to match.
You can simplify it further to:
\d{4}([ -]|)(\d{4}\1){2}\d{4}
The following should match anything the three patterns match:
regex = [0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}
That is, I'm assuming you are happy with either a hyphen, a space or nothing between the numbers?
Note: this will also match situations where you have any combination of the three, e.g.
0000-0000 00000000
which may not be desired?
Alternatively, if you need to match any of the three individual patterns then just concatenate them with |, as follows:
([0-9]{16})|([0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{4})|([0-9]{4} [0-9]{4} [0-9]{4} [0-9]{4})
(Your original example appears to have unnecessary square brackets around the space and hyphen)

Categories

Resources