Avoid overlapping regex matching in Java - java

For some reason this piece of Java code is giving me overlapping matches:
Pattern pat = Pattern.compile("(" + leftContext + ")" + ".*" + "(" + rightContext + ")", Pattern.DOTALL);
any way/option so it avoids detecting overlaps? e.g.
leftContext rightContext rightContext
should be be 1 match instead of 2
Here's the complete code:
public static String replaceWithContext(String input, String leftContext, String rightContext, String newString){
Pattern pat = Pattern.compile("(" + leftContext + ")" + ".*" + "(" + rightContext + ")", Pattern.DOTALL);
Matcher matcher = pat.matcher(input);
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(buffer, "");
buffer.append(matcher.group(1) + newString + matcher.group(2));
}
matcher.appendTail(buffer);
return buffer.toString();
}
So here's the final answer using a negative lookahead, my bad for not realizing * was greedy:
Pattern pat = Pattern.compile("(" +
leftContext + ")" + "(?:(?!" +
rightContext + ").)*" + "(" +
rightContext + ")", Pattern.DOTALL);

Your use of the word "overlapping" is confusing. Apparently, what you meant was that the regex is too greedy, matching everything from the first leftContext to the last rightContext. It seems you figured that out already--and came up with a better approach as well--but there's still at least one potential problem.
You said leftContext and rightContext are "plain Strings", by which I assume you meant they aren't supposed to be interpreted as regexes, but they will be. You need to escape them, or any regex metacharacters they contain will cause incorrect results or run-time exceptions. The same goes for your replacement string, although only $ and the backslash have special meanings there. Here's an example (notice the non-greedy .*?, too):
public static String replaceWithContext(String input, String leftContext, String rightContext, String newString){
String lcRegex = Pattern.quote(leftContext);
String rcRegex = Pattern.quote(rightContext);
String replace = Matcher.quoteReplacment(newString);
Pattern pat = Pattern.compile("(" + lcRegex + ").*?(" + rcRegex + ")", Pattern.DOTALL);
One other thing: if you aren't doing any post-match processing on the matched text, you can use replaceAll instead of rolling your own with appendReplacement and appendTail:
return input.replaceAll("(?s)(" + lcRegex + ")" +
"(?:(?!" + rcRegex + ").)*" +
"(" + rcRegex + ")",
"$1" + replace + "$2");

There are few possibilities, depending on what you really need.
You can append $ at the end of your regex, like this:
"(" + leftContext + ")" + ".*" + "(" + rightContext + ")$"
so if rightContext isn't the last thing, your regex won't match.
Next, you can capture everything after rightContext:
"(" + leftContext + ")" + ".*" + "(" + rightContext + ")(.*)"
and after that discard everything in your third matching group.
But, since we don't know what leftContext and rightContext really are, maybe your problem lies within them.

Related

How to extract certain substrings from a string in Java?

I am trying to extract some information from a parse exception message which looks like the following:
"Encountered " <FUNCNAME> "FF "" at line 1, column 22.
Was expecting:
"DEF" ..."
From this message I would like to get the token encountered, in the case above it would be "FUNCNAME" and I would also like to get the expected token, again, in this case it would be "DEF".
String[] REGEX = { "Encountered \" <(.*)> ", "Encountered (.*)." };
Pattern pattern = Pattern.compile(REGEX[0]);
Matcher matcher = pattern.matcher(message);
System.out.println("Matched: " + matcher.group(1));
I used the pattern above to get the encountered token (which works fine), but I am struggling to get the expected one because of the line breaks.
You need to slightly rework your regex pattern, and also use the dot all (?s) modifier when declaring the regex, so that .* can match across lines.
String message = "\"Encountered \" <FUNCNAME> \"FF \" at line 1, column 22.\nWas expecting:\n\"DEF\" ...";
String regex = "(?s)\"Encountered \" <(.*?)>.*?Was expecting:\\s+\"(.*?)\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(message);
if (m.find()) {
System.out.println("Matched: " + m.group(1) + ", " + m.group(2));
}
This prints:
Matched: FUNCNAME, DEF

Look-behind issue with java regex. Look-behind group does not have an obvious maximum length

I need to match sequence='999' inside a <noteinfo> tag in a xml document using Java RegEx (xml parser is not an option).
Snippet of the xml:
<xmltag sequence='11'>
<noteinfo noteid='1fe' unid='25436AF06906885A8525840B00805DBC' sequence='3'/>
</xmltag>
I am using this: (?<=<noteinfo.*)sequence='[0-9999]'(?=/>)
I am expecting a match on this: sequence='3'
Getting error: java.util.regex.PatternSyntaxException: Look-behind group does not have an obvious maximum length
I understand the issue is with the .* in the look-behind part. Any alternatives to avoid the error?
never use a lookbehind if not absolutely necessary
You can reduce the length of a lookbehind with the curly braces eg. {1,255}.
Your problem is solvable without the use of a lookbehind:
static final Pattern seqpat = Pattern.compile( "<noteinfo[^>]+(?<seq>sequence\\s*=\\s*'[\\d]*')", Pattern.MULTILINE );
read through the file with:
Matcher m = seqpat.matcher( s );
while( m.find() )
System.err.println( m.group( "seq" ) );
Pattern.MULTILINE is necessary in the case a noteinfo-line is wrapped
seqpat finds (not matches!) any line starting with <noteinfo and ending with >
the requested sequence is captured in group( "seq" )
perhaps You have to deal with spaces or newlines between sequence, = and the sequence-id '3' — therefore: \\s*=\\s*
the above Pattern finds each sequence-id (even an empy one)
to find only the '999' sequence-id, take this Pattern:
Pattern.compile( "<noteinfo[^>]+(?<seq>sequence\\s*=\\s*'999')", Pattern.MULTILINE );
My guess is that you might want to design an expression similar to:
(?=<noteinfo).*(sequence='[0-9]'|sequence='[1-9][0-9]{0,3}')
DEMO
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "(?=<noteinfo).*(sequence='[0-9]'|sequence='[1-9][0-9]{0,3}')";
final String string = "<xmltag sequence='11'>\n"
+ " <noteinfo noteid='1fe' unid='25436AF06906885A8525840B00805DBC' sequence='3'/>\n"
+ "</xmltag>\n"
+ "<xmltag sequence='11'>\n"
+ " <noteinfo noteid='1fe' unid='25436AF06906885A8525840B00805DBC' sequence='9999'/>\n"
+ "</xmltag>\n"
+ "<xmltag sequence='11'>\n"
+ " <noteinfo noteid='1fe' unid='25436AF06906885A8525840B00805DBC' sequence='10000'/>\n"
+ "</xmltag>\n"
+ "<xmltag sequence='11'>\n"
+ " <noteinfo noteid='1fe' unid='25436AF06906885A8525840B00805DBC' sequence='-1'/>\n"
+ "</xmltag>";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}

Why does the Java regular expression "|" find a matching substring for any input string?

I am trying to understand why a regular expression ending with "|" (or simply "|" itself) will find a matching substring with start index 0 and end "offset after the last character matched (as per JavaDoc for Matcher)" 0.
The following code demonstrates this:
public static void main(String[] args) {
String regExp = "|";
String toMatch = "A";
Matcher m = Pattern.compile(regExp).matcher(toMatch);
System.out.println("ReqExp: " + regExp +
" found " + toMatch + "(" + m.find() + ") " +
" start: " + m.start() +
" end: " + m.end());
}
Output is:
ReqExp: | found A(true) start: 0 end: 0
I'm confused by the fact that it is even a valid regular expression. And further confused by the fact that start and end are both 0.
Hoping someone can explain this to me.
The pipe in a regular expression means "or." So your regular expression is basically "(empty string) or (empty string)". It successfully finds an empty string at the beginning of the string, and an empty string has a length of 0.

Digits are getting deleted when splitting a string

I have a string from which I need to remove all mentioned punctuations and spaces. My code looks as follows:
String s = "s[film] fever(normal) curse;";
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s+]");
System.out.println("spart[0]: " + spart[0]);
System.out.println("spart[1]: " + spart[1]);
System.out.println("spart[2]: " + spart[2]);
System.out.println("spart[3]: " + spart[3]);
I have a string from which I need to remove all mentioned punctuations and spaces. My code looks as follows:
String s = "s[film] fever(normal) curse;";
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s+]");
System.out.println("spart[0]: " + spart[0]);
System.out.println("spart[1]: " + spart[1]);
System.out.println("spart[2]: " + spart[2]);
System.out.println("spart[3]: " + spart[3]);
But, I am getting some elements which are blank. The output is:
spart[0]: s
spart[1]: film
spart[2]:
spart[3]: normal
- is a special character in PHP character classes. For instance, [a-z] matches all chars from a to z inclusive. Note that you've got )-_ in your regex.
- defines a range in regular expressions as used by String.split argument so that needs to be escaped
String[] part = line.toLowerCase().split("[,/?:;\"{}()\\-_+*=|<>!`~##$%^&]");
String[] spart = s.split("[,/?:;\\[\\]\"{}()\\-_+*=|<>!`~##$%^&\\s]+");

Java regex matches nothing

String string = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<Request>\n" +
" <Item>\n" +
" <Type>C0401</Type>\n" +
" <InvDate>20150301</InvDate>\n" +
" <No>PK1000000</No>\n" +
" </Item>\n" +
" <Item>\n" +
" <Type>C0401</Type>\n" +
" <InvDate>20150301</InvDate>\n" +
" <No>PK1000002</No>\n" +
" </Item>\n" +
"</Request>";
Pattern pattern = Pattern.compile("(<Item>)(.*)(</Item>)");
Matcher matcher = pattern.matcher(string);
List<String> listMatches = new ArrayList<String>();
while(matcher.find())
{
listMatches.add(matcher.group(2));
}
If I replace Item with Type or InvDate or No, I can get the content.
Looking for answer. Thanks
You have to use the option Pattern.DOTALL for multiline maches:
Pattern pattern = Pattern.compile("(<Item>)(.*?)(</Item>)",Pattern.DOTALL);
Quantifier: *? Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
You have to use the option Pattern.DOTALL for multiline maches:
Pattern pattern = Pattern.compile("(<Item>)(.*)(</Item>)",Pattern.DOTALL);
But it is better to use a HTML-parser.
You need to use DOTALL flag to make DOT match any character including newlines also:
Pattern pattern = Pattern.compile("(?s)(<Item>)(.*)(</Item>)");
Or else:
Pattern pattern = Pattern.compile("(<Item>)(.*)(</Item>)", Pattern.DOTALL);

Categories

Resources