(Pattern and Matcher) not discovering all pattern matches - java

I have this string object which consists of tags(bounded by [$ and $]) and rest of the text. Im trying to isolate all of the tags. (Pattern-Matcher) recognize all of the tags properly, but two of them are combined into one. I dont have any idea why this is happening, probably some internal (Matcher-Pattern) bussiness.
String docBody = "This is sample text.\r\n[$ FOR i 1 10 1 $]\r\n This is" +
"[$ i $]-th time this message is generated.\r\n[$END$]\r\n" +
"[$ FOR i 0 10 2 $]\r\n sin([$= i $]^2) = [$= i i * #sin \"0.000\"" +
" #decfmt $]" +
"\r\n[$END$] ";
Pattern p = Pattern.compile("(\\[\\$)(.)+(\\$\\])");
Matcher m = p.matcher(docBody);
while(m.find()){
System.out.println(m.group());
}
output:
[$ FOR i 1 10 1 $]
[$ i $]
[$END$]
[$ FOR i 0 10 2 $]
[$= i $]^2) = [$= i i * #sin "0.000" #decfmt $]
[$END$]`
As you can see, this part [$= i $]^2) = [$= i i * #sin "0.000" #decfmt $] is not split into these two tags [$= i $] and [$= i i * #sin "0.000" #decfmt $]
Any suggestions why this is happening?

You should use reluctant quantifier - ".+?" instead of greedy - ".+" :
"(\\[\\$).+?(\\$\\])" // Note `?` after `.+`
If you use .+, it will match everything except the line-terminator till the last $. Note that a dot (.) matches everything except a newline. With reluctant quantifier, .+? matches only till the first $] it encounters.
In your given string, you got all those matches, because you had \r\n in between, where the .+ stops matching. If you remove all those newlines, then you will just get a single match from 1st [$ to the last $].

A good way is to replace the dot by a negated character class, example:
Pattern p = Pattern.compile("(\\[\\$)([^$]++)(\\$])");
(note that you don't need to escape closing square brackets)
But perhaps are you only interested by the content of the tags:
Pattern p = Pattern.compile("(?<=\\[\\$)[^$]++(?=\\$])");
In this case the content is the whole match

Related

java regex tell which column not match

Good day,
My java code is as follow:
Pattern p = Pattern.compile("^[a-zA-Z0-9$&+,:;=\\[\\]{}?##|\\\\'<>._^*()%!/~\"`  -]*$");
String i = "f698fec0-dd89-11e8-b06b-☺";
Matcher tagmatch = p.matcher(i);
System.out.println("tagmatch is " + tagmatch.find());
As expected, the answer will be false, because there is ☺ character inside. However, I would like to show the column number that not match. For this example, it should show column 25th having the invalid character.
May I know how can I do this?
You should remove anchors from your regex and then use Matcher#end() method to get the position where it stopped the previous match like this:
String i = "f698fec0-dd89-11e8-b06b-☺";
Pattern p = Pattern.compile("[\\w$&+,:;=\\[\\]{}?##|\\\\'<>.^*()%!/~\"` -]+");
Matcher m = p.matcher(i);
if (m.lookingAt() && i.length() > m.end()) {
System.out.println("Match <" + m.group() + "> failed at: " + m.end());
}
Output:
Match <f698fec0-dd89-11e8-b06b-> failed at: 24
PS: I have used lookingAt() to ensure that we match the pattern starting from the beginning of the region. You can use find() as well to get the next match anywhere or else keep the start anchor in pattern as
"^[\\w$&+,:;=\\[\\]{}?##|\\\\'<>.^*()%!/~\"` -]+"
and use find() to effectively make it behave like the above code with lookingAt().
Read difference between lookingAt() and find()
I have refactored your regex to use \w instead of [a-zA-Z0-9_] and used quantifier + (meaning match 1 or more) instead of * (meaning match 0 or more) to avoid returning success for zero-length matches.

Masking using regular expressions for below format

I am trying to write a regular expression to mask the below string. Example below.
Input
A1../D//FASDFAS--DFASD//.F
Output (Skip first five and last two Alphanumeric's)
A1../D//FA***********D//.F
I am trying using below regex
([A-Za-z0-9]{5})(.*)(.{2})
Any help would be highly appreciated.
You solve your issue by using Pattern and Matcher with a regex which match multiple groups :
String str = "A1../D//FASDFAS--DFASD//.F";
Pattern pattern = Pattern.compile("(.*?\\/\\/..)(.*?)(.\\/\\/.*)");
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
str = matcher.group(1)
+ matcher.group(2).replaceAll(".", "*")
+ matcher.group(3);
}
Detail
(.*?\\/\\/..) first group to match every thing until //
(.*?) second group to match every thing between group one and three
(.\\/\\/.*) third group to match every thing after the last character before the // until the end of string
Outputs
A1../D//FA***********D//.F
I think this solution is more readable.
If you want to do that with a single regex you may use
text = text.replaceAll("(\\G(?!^|(?:[0-9A-Za-z][^0-9A-Za-z]*){2}$)|^(?:[^0-9A-Za-z]*[0-9A-Za-z]){5}).", "$1*");
Or, using the POSIX character class Alnum:
text = text.replaceAll("(\\G(?!^|(?:\\p{Alnum}\\P{Alnum}*){2}$)|^(?:\\P{Alnum}*\\p{Alnum}){5}).", "$1*");
See the Java demo and the regex demo. If you plan to replace any code point rather than a single code unit with an asterisk, replace . with \P{M}\p{M}*+ ("\\P{M}\\p{M}*+").
To make . match line break chars, add (?s) at the start of the pattern.
Details
(\G(?!^|(?:[0-9A-Za-z][^0-9A-Za-z]*){2}$)|^(?:[^0-9A-Za-z]*[0-9A-Za-z]){5}) -
\G(?!^|(?:[0-9A-Za-z][^0-9A-Za-z]*){2}$) - a location after the successful match that is not followed with 2 occurrences of an alphanumeric char followed with 0 or more chars other than alphanumeric chars
| - or
^(?:[^0-9A-Za-z]*[0-9A-Za-z]){5} - start of string, followed with five occurrences of 0 or more non-alphanumeric chars followed with an alphanumeric char
. - any code unit other than line break characters (if you use \P{M}\p{M}*+ - any code point).
Usually, masking of characters in the middle of a string can be done using negative lookbehind (?<!) and positive lookahead groups (?=).
But in this case lookbehind group can't be used because it does not have an obvious maximum length due to unpredictable number of non-alphanumeric characters between first five alphanumeric characters (. and / in the A1../D//FA).
A substring method can used as a workaround for inability to use negative lookbehind group:
String str = "A1../D//FASDFAS--DFASD//.F";
int start = str.replaceAll("^((?:\\W{0,}\\w{1}){5}).*", "$1").length();
String maskedStr = str.substring(0, start) +
str.substring(start).replaceAll(".(?=(?:\\W{0,}\\w{1}){2})", "*");
System.out.println(maskedStr);
// A1../D//FA***********D//.F
But the most straightforward way is to use java.util.regex.Pattern and java.util.regex.Matcher:
String str = "A1../D//FASDFAS--DFASD//.F";
Pattern pattern = Pattern.compile("^((?:\\W{0,}\\w{1}){5})(.+)((?:\\W{0,}\\w{1}){2})");
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
String maskedStr = matcher.group(1) +
"*".repeat(matcher.group(2).length()) +
matcher.group(3);
System.out.println(maskedStr);
// A1../D//FA***********D//.F
}
\W{0,} - 0 or more non-alphanumeric characters
\w{1} - exactly 1 alphanumeric character
(\W{0,}\w{1}){5} - 5 alphanumeric characters and any number of alphanumeric characters in between
(?:\W{0,}\w{1}){5} - do not capture as a group
^((?:\\W{0,}\\w{1}){5})(.+)((?:\\W{0,}\\w{1}){2})$ - substring with first five alphanumeric characters (group 1), everything else (group 2), substring with last 2 alphanumeric characters (group 3)

How to build a Regex in java to detect a whitespace or end of a string?

I am trying to build a Regex to find and extract the string containing Post office box.
Here is two examples:
str = "some text p.o. box 12456 Floor 105 streetName Street";
str = "po box 1011";
str = "post office Box 12 Floor 105 Tallapoosa Street";
str = "leclair ryan pc p.o. Box 2499 8th floor 951 east byrd street";
str = "box 1 slot 3 building 2 136 harvey road";
Here is my pattern and code:
Pattern p = Pattern.compile("p.*o.*box \\d+(\\z|\\s)");
Matcher m = p.matcher(str);
int count =0;
while(m.find()) {
count++;
System.out.println("Match number "+count);
System.out.println("start(): "+m.start());
System.out.println("end(): "+m.end());
}
It works with the second example and note for the first one!
If change my pattern to the following:
Pattern p = Pattern.compile("p.*o.*box \d+ ");
It works just for the first example.
The question is how to group the Regex for end of string "\z" and Regex for whitespace "\s" or " "?
New Pattern:
Pattern p = Pattern.compile("(?i)((p.*o.box\s\w\s*\d*(\z|\s*)|(box\s*\w\s*\d*(\z|\s*)) ))");
You can leverage the following code:
String str = "some text p.o. box 12456 Floor 105 streetName Street";
Pattern p = Pattern.compile("(?i)\\bp\\.?\\s*o\\.?\\s*box\\s*(\\d+)(?:\\z|\\s)");
Matcher m = p.matcher(str);
int count =0;
while(m.find()) {
count++;
System.out.println("Match: "+m.group(0));
System.out.println("Digits: "+m.group(1));
System.out.println("Match number "+count);
System.out.println("start(): "+m.start());
System.out.println("end(): "+m.end());
}
To make the pattern case insensitive, just add Pattern.CASE_INSENSITIVE flag to the Pattern.compile declaration or pre-pend the inline (?i) modifier to the pattern.
Also, .* matches any characters other than a newline zero or more times, I guess you wanted to match . optionally. So, you need just ? quantifier and to escape the dot so as to match a literal dot. Note how I used (...) to capture digits into Group 1 (it is called a capturing group). The group where you match the end of the string or space is inside a non-capturing grouo ((?:...)) that is used for grouping only, not for storing its value in the memory buffer. Since you wanted to match a word boundary there, I suggest replacing (?:\\z|\\s) with a mere \\b:
Pattern p = Pattern.compile("(?i)\\bp\\.?\\s*o\\.?\\s*box\\s*(\\d+)\\b");
There are a couple items in your regex that look like they need work. From what I understand you want to extract the P.O. Box number from strings of such format that you've provided. Given that, the following regex will accomplish what you want, with a following explanation. See it in action here: https://regex101.com/r/cQ8lH3/2
Pattern p = Pattern.compile("p\.?o\.? box [^ \r\n\t]+");
Firstly, you need to use only ONE slash, for escape sequences. Also, you must escape the dots. If you do not escape the dots, regex will match . as ANY single character. \. will instead match a dot symbol.
Next, you need to change the * quantifier after the \. to a ?. Why? The * symbol will match zero or more of the preceding symbol while the ? quantifier will match only one or none.
Finally rethink how you're matching the box number. Instead of matching all characters AND THEN white space, just match everything that isn't a whitespace. [^ \r\n\t]+ will match all characters that are NOT a space (), carriage return (\r), newline (\n), or tab (\t). Therefore it will consume the box number and stop as soon as it hits any whitespace or end of file.
Some of these changes may not be necessary to get your code to work for the examples you gave, but they are the proper way to build the regex you want.

Latin Regex with symbols

I need split a text and get only words, numbers and hyphenated composed-words. I need to get latin words also, then I used \p{L}, which gives me é, ú ü ã, and so forth. The example is:
String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! # # $ % ^& * ( ) + - _ #$% " ' : ; > < / \ | , here some is wrong… * + () e -"
Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+");
String words[] = pattern.split( myText );
What is wrong with this regex? Why it matches symbols like "(", "+", "-", "*" and "|"?
Some of results are:
dresse // OK
sud-est // OK
occident) // WRONG
987 // OK
() // WRONG
(a // WRONG
* // WRONG
- // WRONG
+ // WRONG
( // WRONG
| // WRONG
The regex explanation is:
[^\p{L}+(\-\p{L}+)*\d]+
* Word separator will be:
* [^ ... ] No sequence in:
* \p{L}+ Any latin letter
* (\-\p{L}+)* Optionally hyphenated
* \d or numbers
* [ ... ]+ once or more.
If my understanding of your requirement is correct, this regex will match what you want:
"\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"
It will match:
A contiguous sequence of Unicode Latin script characters. I restrict it to Latin script, since \p{L} will match letter in any script. Change \\p{IsLatin} to \\pL if your version of Java doesn't support the syntax.
Or several such sequences, hyphenated
Or a contiguous sequence of decimal digits (0-9)
The regex above is to be used by calling Pattern.compile, and call matcher(String input) to obtain a Matcher object, and use a loop to find matches.
Pattern pattern = Pattern.compile("\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+");
Matcher matcher = pattern.matcher(inputString);
while (matcher.find()) {
System.out.println(matcher.group());
}
If you want to allow words with apostrophe ':
"\\p{IsLatin}+(?:['\\-]\\p{IsLatin}+)*|\\d+"
I also escape - in the character class ['\\-] just in case you want to add more. Actually - doesn't need escaping if it is the first or last in the character class, but I escape it anyway just to be safe.
If the opening bracket of a character class is followed by a ^ then the characters listed inside the class are not allowed. So your regex allows anything except unicode letter,+,(,-,),* and digit occurring one or more times.
Note that characters like +,(,),* etc. don't have any special meaning inside a character class.
What pattern.split does is that it splits the string at patterns matching the regex. Your regex matches whitespace and hence split occurs at each occurrence of one or more whitespace. So result will be this.
For example consider this
Pattern pattern = Pattern.compile("a");
for (String s : pattern.split("sda a f g")) {
System.out.println("==>"+s);
}
Output will be
==>sd
==>
==> f g
A regular expression set description with [] can contain only letters, classes (\p{...}), sequences (e.g. a-z) and the complement symbol (^). You have to place the other magic characters you are using (+*()) outside the [ ] block.

Regex - Match numbers & special cases

I'm trying to make a regex that would produce the following results :
for 7.0 + 5 - :asc + (8.256 - :b)^2 + :d/3 : 7.0, 5, :asc, 8.256, :b, 2, :d, 3
for -+*-/^^ )ç# : nothing
It's should first match numbers which can be float, so in my regex I have : [0-9]+(\\.[0-9])? but it should also mach special cases like :a or :Abc.
To be more precise, it should (if possible) match anything but mathematical operators /*+^- and parentheses.
So here is my final regex : ([0-9]+(\\.[0-9])?)|(:[a-zA-Z]+) but it's not working because matcher.groupCount() returns 3 for both of the examples I gave.
Groups are what you specifically group in the regex. Anything surrounded in parentheses is a group. (Hello) World has 1 group, Hello. What you need to be doing is finding all the matches.
In your code ([0-9]+(\\.[0-9])?)|(:[a-zA-Z]+), 3 sets of parentheses can be seen. This is why you will always be given 3 groups in every match.
Your code works fine as it is, here is an example:
String text = "7.0 + 5 - :asc + (8.256 - :b)^2 + :d/3";
Pattern p = Pattern.compile("([0-9]+(\\.[0-9]+)?)|(:[a-zA-Z]+)");
Matcher m = p.matcher(text);
List<String> matches = new ArrayList<String>();
while (m.find()) matches.add(m.group());
for (String match : matches) System.out.println(match);
The ArrayList matches will contain all of the matches that your regex finds.
The only change I made was add a + after the second [0-9].
Here is the output:
7.0
5
:asc
8.256
:b
2
:d
3
Here is some more information about groups in java.
Does that help?
Your regex is correct, run the following code:
String input = "7.0 + 5 - :asc + (8.256 - :b)^2 + :d/3"; // your input
String regex = "(\\d+(\\.\\d+)?)|(:[a-z-A-Z]+)"; // exactly yours.
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
Your problem is the understanding of the method matcher.groupCount(). JavaDoc clearly says
Returns the number of capturing groups in this matcher's pattern.
([^\()+\-*\s])+ //put any mathematical operator inside square bracket

Categories

Resources