How to capture all instances with regex from a string with - java

I got help with a regex expression in java here: java regex to capture any number of periods within a string
This solved the issue of identifying patterns of the string, but I have been unable to figure out how to catch all instances within a body of text.
If I have a string body like this:
String body = "$tag:parent$ is the child of $tag:grand.parent$ who is the grandparent of $tag:child$"
I use the following and always catch the first $tag:*$ string, no matter which one is first the pattern gets it but within something like
final String REGEX = "\\$tag:(?:[a-z]+?\\.*){1,4}\\$";
Pattern pattern = Pattern.compile(REGEX);
Matcher matcher = pt.matcher(body);
if (matcher.find()) {
// do something with matcher.group() but should the group contain the all instances?
}
I have tried enclosing in () on regex101.com and it matches patterns and lists everything in a group but this doesn't work
tried the following, but this is me just trying random stuff:
"(\\$tag:(?:[a-z]+?\\.*){1,4}\\$)"
"(^\\$tag:(?:[a-z]+?\\.*){1,4}\\$)"
"(?<=(\\$tag:(?:[a-z]+?\\.*){1,4}\\$))"
I just basically want a java regex approach where I catch all the instances in some type of manner. Thanks for the assistance.

Elaborating on my comment:
Use a while loop to iterate over all matched patterns in your input, e.g.:
// note: I have simplified your pattern a bit, you probably don't need all
// those restrictions
Pattern pattern = Pattern.compile("\\$tag:.+?\\$");
Matcher matcher = pattern.matcher(body);
while (matcher.find()) {
// TODO whatever you want with the matched group
System.out.println(matcher.group());
}
Output
$tag:parent$
$tag:grand.parent$
$tag:child$

Related

Pattern and Matcher is not working in java

Basically I have a simple String Where I need to explicitly restrict characters other than a-zA-Z0-9. Before I mention what is wrong here is how I am doing it.
Pattern p = Pattern.compile("[&=]");
Matcher m = p.matcher("Nothing is wrong");
if (m.find()){
out.print("You are not allowed to have &=.");
return;
}
Pattern p1 = Pattern.compile("[a-zA-Z0-9]");
Matcher m1 = p1.matcher("Itissupposetobeworking");
if (m1.find()){
out.print("There is something wrong.");
return;
}
The first one works fine, But on the second matcher m1 always gets to execute if(m1.find()) even though it doesn't contain any character other than specified in the pattern.
I also tried Pattern p1 = Pattern.compile("[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789]") But still have the same trouble.
and if you might wanna tell, which is better between String.matches(["a-zA-Z0-9"]); or the way I am using above?
Thanks in advance.
[a-zA-Z0-9] tries to match alphanumeric characters.
So, you will get "There is something wrong." to be printed, if you have a alphanumeric character in the input character sequence of matcher().
Change it to [^a-zA-Z0-9] and try.
This tries to match non-alphanumeric characters. So, you will get expected result.
You seem to want to find a partial match in a string that contains a character other than an alphanumeric character:
Pattern p1 = Pattern.compile("[^a-zA-Z0-9]");
or
Pattern p1 = Pattern.compile("\\P{Alnum}");
The [^a-zA-Z0-9] pattern is a negated character class that matches any char other than the ones defined in the class. So, if a string contains any chars other than ASCII letters or digits, your if (m1.find()) will get triggered and the message will appear.
Note that the whole negated character class can be replaced with a predefined character class \P{Alnum} that matches any char other than alphanumeric. \p{Alnum} matches any alphanumeric character and \P{Alnum} is the reverse class.
If you use the isAlphanumeric Method of org.apache.commons.lang.StringUtils yourcode become much more readable. So you need to write
if (!StringUtils.isAlphanumeric("Itissupposetobeworking"))
instead of
Pattern p1 = Pattern.compile("[a-zA-Z0-9]");
Matcher m1 = p1.matcher("Itissupposetobeworking");
if (!m1.find()){
When above expression finds a matching it prints "There is something wrong." but if you want to restrict then Use below code.
Pattern p1 = Pattern.compile("a-zA-Z0-9");
String a = "It$issupposetobeworking";
Matcher m1 = p1.matcher(a);
if (m1.find()){
System.out.print("There is something wrong.");
}
else
{
System.out.println("Everything is fine");
}
If you want the same code to be working with same regular expression in that scenario use this code.
Pattern p1 = Pattern.compile("[a-zA-Z0-9]");
Matcher m1 = p1.matcher("Itissupposetobeworking");
if (!(m1.find())){
out.print("There is something wrong.");
return;
}

Is it possible to use a quantifier on a non-capturing group ? - regex

I catch a group by regex and I would like to catch
everything but not the group(s).
So group can have several occurences, on different locations, in the String.
My first thought was, to solve it with negativ lookahead but I failed with it. Therefore I tried it with non capturing group and I stuck here too.
(bar) (baz) foo
I want foo.
This is what I have so far:
String input = "(bar) (baz) foo";
String matchesGroup = "((?=\\().*?\\))"; //matches (...)
// as Casimir et Hippolyte commented, I know use
// ((?:(...))+) for the non capturing group
String matchesFoo = "((?:"+ matchesGroup +")+)\\s(.*)";
Pattern pattern = Pattern.compile(matchesFoo);
Matcher matcher = pattern.matcher(input);
while (matcher.find()){
System.out.println(matcher.group());
}
but there is nothing captured at all
actual :
expected : foo
Where is my fault in the regex ?
since you want to match multiple (...) groups, account for the possible trailing space and move the + to quantify one or more of those (I moved the space into the group, and the + quantifying that whole structure)
String matchesFoo = "(?:(?:(?=\\().*?\\))\\s?)+(.*)";
demo here
Why don't you try this:
String testtring = "matches matches foo";
testString = testString.replaceAll("matches", "");
System.out.println(testString);

Pattern (string) allows characters only one time

I want to check if my string contains only allowed characters. Everything works properly for example 7B, 77B or 7BBBB, but when I input something like this 7B7 or 7BB2 it's not matching.
Everything work fine, but when integer is last character it's not working.
Could You tell me what is wrong with that code?
pattern = Pattern.compile("[0-9]*[a-f]*[A-F]*");
matcher = pattern.matcher(stNumber);
if (matcher.matches()) {...}
If you want to mix numbers and chars in a various order you need sth like:
Pattern pattern = Pattern.compile("[\\da-fA-F]*")
Why not try it this way?
// Compile this pattern.
Pattern pattern = Pattern.compile("[0-9]*[a-f]*[A-F]*[0-9]*");
// See if this String matches.
Matcher m = pattern.matcher("num123");
if (m.matches()) {
System.out.println(true);
}
Source
Are you trying to verify that the string only has digits and letters and nothing else?
If so try using the following:
pattern = Pattern.compile("^[a-z-A-Z\\d]*$");
matcher = pattern.matcher(stNumber);
if (matcher.matches()) {...}

Pattern Matcher Vs String Split, which should I use?

First time posting.
Firstly I know how to use both Pattern Matcher & String Split.
My questions is which is best for me to use in my example and why?
Or suggestions for better alternatives.
Task:
I need to extract an unknown NOUN between two known regexp in an unknown string.
My Solution:
get the Start and End of the noun (from Regexp 1&2) and substring to extract the noun.
String line = "unknownXoooXNOUNXccccccXunknown";
int goal = 12 ;
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
I need to locate the index position AFTER the first regex.
I need to locate the index position BEFORE the second regex.
A) I can use pattern matcher
Pattern p = Pattern.compile(regexp1);
Matcher m = p.matcher(line);
if (m.find()) {
int afterRegex1 = m.end();
} else {
throw new IllegalArgumentException();
//TODO Exception Management;
}
B) I can use String Split
String[] split = line.split(regex1,2);
if (split.length != 2) {
throw new UnsupportedOperationException();
//TODO Exception Management;
}
int afterRegex1 = line.indexOf(split[1]);
Which Approach should I use and why?
I don't know which is more efficient on time and memory.
Both are near enough as readable to myself.
I'd do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regex = "Xo+X(.*?)Xc+X";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(line);
if (m.find()) {
String noun = m.group(1);
}
The (.*?) is used to make the inner match on the NOUN reluctant. This protects us from a case where our ending pattern appears again in the unknown portion of the string.
EDIT
This works because the (.*?) defines a capture group. There's only one such group defined in the pattern, so it gets index 1 (the parameter to m.group(1)). These groups are indexed from left to right starting at 1. If the pattern were defined like this
String regex = "(Xo+X)(.*?)(Xc+X)";
Then there would be three capture groups, such that
m.group(1); // yields "XoooX"
m.group(2); // yields "NOUN"
m.group(3); // yields "XccccccX"
There is a group 0, but that matches the whole pattern, and it's equivalent to this
m.group(); // yields "XoooXNOUNXccccccX"
For more information about what you can do with the Matcher, including ways to get the start and end positions of your pattern within the source string, see the Matcher JavaDocs
You should use String.split() for readability unless you're in a tight loop.
Per split()'s javadoc, split() does the equivalent of Pattern.compile(), which you can optimize away if you're in a tight loop.
It looks like you want to get a unique occurrence. For this do simply
input.replaceAll(".*Xo+X(.*)Xc+X.*", "$1")
For efficiency, use Pattern.matcher(input).replaceAll instead.
In case you input contains line breaks, use Pattern.DOTALL or the s modifier.
In case you want to use split, consider using Guava's Splitter. It behaves more sane and also accepts a Pattern which is good for speed.
If you really need the locations you can do it like this:
String line = "unknownXoooXNOUNXccccccXunknown";
String regexp1 = "Xo+X";
String regexp2 = "Xc+X";
Matcher m=Pattern.compile(regexp1).matcher(line);
if(m.find())
{
int start=m.end();
if(m.usePattern(Pattern.compile(regexp2)).find())
{
final int end = m.start();
System.out.println("from "+start+" to "+end+" is "+line.substring(start, end));
}
}
But if you just need the word in between, I recommend the way Ian McLaird has shown.

Java String matches and replaceAll differ in matching parentheses

I have strings with parentheses and also escaped characters. I need to match against these characters and also delete them. In the following code, I use matches() and replaceAll() with the same regex, but the matches() returns false, while the replaceAll() seems to match just fine, because the replaceAll() executes and removes the characters. Can someone explain?
String input = "(aaaa)\\b";
boolean matchResult = input.matches("\\(|\\)|\\\\[a-z]+");
System.out.printf("matchResult=%s\n", matchResult);
String output = input.replaceAll("\\(|\\)|\\\\[a-z]+", "");
System.out.printf("INPUT: %s --> OUTPUT: %s\n", input, output);
Prints out:
matchResult=false
INPUT: (aaaa) --> OUTPUT: aaaa
matches matches the whole input, not part of it.
The regular expression \(|\)|\\[a-z]+ doesn't describe the whole word, but only parts of it, so in your case it fails.
What matches is doing has already been explained by Binyamin Sharet. I want to extend this a bit.
Java does not have a "findall" or a "g" modifier like other languages have it to get all matches at once.
The Java Matcher class knows only two methods to use a pattern against a string (without replacing it)
matches(): matches the whole string against the pattern
find(): returns the next match
If you want to get all things that fits your pattern, you need to use find() in a loop, something like this:
Pattern p = Pattern
.compile("\\(|\\)|\\\\[a-z]+");
Matcher m = p.matcher(text);
while(m.find()){
System.out.println(m.group(0));
}
or if you are only interested if your pattern exists in the string
if (m.find()) {
System.out.println(m.group());
} else {
System.out.println("not found");
}

Categories

Resources