Finding and retrieving consecutive matches - java

Say I want to match a string that should solely consist of parts adhering to a specific (regex) pattern and retrieve the elements in a loop. For this it seems that Matcher.find() was invented. However, find will match any string, not just one that is directly after the pattern, so intermediate characters are skipped.
So - for instance - I want to match \\p{Xdigit}{2} (two hexadecimal digits) in such a way that:
aabb matches;
_aabb doesn't match;
aa_bb doesn't match;
aabb_ doesn't match.
by using find (or any other iterated call to the regex) so I can directly process each byte in the array. So I want to process aa and bb separately, after matching.
OK, that's it, the most elegant way of doing this wins the accept.
Notes:
the hexadecimal parsing is just an example of a simple repeating pattern;
preferably I would like to keep the regex to the minimal required to match the element;
yes, I know about using (\\p{XDigit}{2})*, but I don't want to scan string twice (as it should be usable on huge input strings).

It appears you want to get all (multiple) matches that appear at the start of the string or right after a successful match. You may combine \G operator with a lookahead that will assure the string only matches some repeated pattern.
Use
(?:\G(?!^)|^(?=(?:\p{XDigit}{2})*$))\p{XDigit}{2}
See the regex demo
Details
(?: - start of a non-capturing group with 2 alternatives:
\G(?!^) - the end of the previous successful match
| - or
^(?=(?:\p{XDigit}{2})*$) - start of a string (^) that is followed with 0+ occurrences of \p{XDigit}{2} pattern up to the end of the string ($)
) - end of the non-capturing group
\p{XDigit}{2} - 2 hex chars.
Java demo:
String regex = "(?:\\G(?!^)|^(?=(?:[0-9a-fA-F]{2})*$))[0-9a-fA-F]{2}";
String[] strings = {"aabb","_aabb","aa_bb", "aabb_"};
Pattern pattern = Pattern.compile(regex);
for (String s : strings) {
System.out.println("Checking " + s);
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<>();
while (matcher.find()) {
res.add(matcher.group(0));
}
if (res.size() > 0) {
System.out.println(res);
} else {
System.out.println("No match!");
}
}
Output:
Checking aabb
[aa, bb]
Checking _aabb
No match!
Checking aa_bb
No match!
Checking aabb_
No match!

OK, I may finally have had a brainstorm: the idea is to remove the find() method out of the condition of the while loop. Instead I should simply keep a variable holding the location and only stop parsing when the whole string has been processed. The location can also be used to produce a more informative error message.
The location starts at zero and is updated to the end of the match. Each time a new match is found the start of the match is compared with the location, i.e. end of the last match. An error occurs if:
the pattern is not found;
the pattern is found, but not at the end of the last match.
Code:
private static byte[] parseHex(String hex){
byte[] bytes = new byte[hex.length() / 2];
int off = 0;
// the pattern is normally a constant
Pattern hexByte = Pattern.compile("\\p{XDigit}{2}");
Matcher hexByteMatcher = hexByte.matcher(hex);
int loc = 0;
// so here we would normally do the while (hexByteMatcher.find()) ...
while (loc < hex.length()) {
// optimization in case we have a maximum size of the pattern
hexByteMatcher.region(loc, loc + 2);
// instead we try and find the pattern, and produce an error if not found at the right location
if (!hexByteMatcher.find() || hexByteMatcher.start() != loc) {
// only a single throw, message includes location
throw new IllegalArgumentException("Hex string invalid at offset " + loc);
}
// the processing of the pattern, in this case a double hex digit representing a byte value
bytes[off++] = (byte) Integer.parseInt(hexByteMatcher.group(), 16);
// set the next location to the end of the match
loc = hexByteMatcher.end();
}
return bytes;
}
The method can be improved by adding \\G (end of last match) to the regex: \\G\\p{XDigit}{2}: this way the regular expression will fail immediately if the pattern cannot be found starting at the end of the last match or the start of the string).
For regular expressions with an expected maximum size (2 in this case) it is of course also possible to adjust the end of the region that needs to be matched.

Related

Java, getting portion of pattern partially matched by input

As title says, i'd like to get the portion of the pattern that is being matched partially by the input; example:
Pattern: aabb
Input string: "aa"
At this point, i'll use hitEnd() method of Matcher class to find out if the pattern is being matched partially, like shown in this answer, but i'd also like to find out that specifically "aa" of "aabb" is matched.
Is there any way to do this in java?
This may be dirty, but here We go...
Once you know that some string hitEnd, do a second processing:
Remove the last character from the string
Search with the original regex
If It matches, then you are over and you have the part of the string
If not, go to 1 and repeat the whole process until you match
If test strings can be long, performance may be a problem. So instead of positions from last to first, try searching for blocks.
For example, considering a string of 1,000 chars:
Test 1000/2 characters: 1-500. For this example, we consider it matches
Test for first 500 chars + 500/2 (1-750 positions). For this example, We consider It does not match. So we know that the position must be placed from 500 to 750
Now test 1-625 ((750+500)/2)... If it matches, the positions must exist between 625-750. If it does not match, It must be from 500 to 625
...
There is no such function in Matcher class. However you could achieve it for example in this way:
public String getPartialMatching(String pattern, String input) {
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(input);
int end = 0;
while(m.find()){
end = m.end();
}
if (m.hitEnd()) {
return input.substring(end);
} else {
return null;
}
}
First, iterate over all matched parts of string and skip them. For example: input = "aabbaa" m.hitEnd() will return false without skipping aabb.
Second, validate if the left part of the string partially matches.

Java Matcher Pattern issue

I am trying to extract everything that is after this string path /share/attachments/docs/. All my strings are starting with /share/attachments/docs/
For example: /share/attachments/docs/image2.png
Number of characters after ../docs/ is not static!
I tried with
Pattern p = Pattern.compile("^(.*)/share/attachments/docs/(\\d+)$");
Matcher m = p.matcher("/share/attachments/docs/image2.png");
m.find();
String link = m.group(2);
System.out.println("Link #: "+link);
But I am getting Exception that: No match found.
Strange because if I use this:
Pattern p = Pattern.compile("^(.*)ABC Results for draw no (\\d+)$");
Matcher m = p.matcher("ABC Results for draw no 2888");
then it works!!!
Also one thing is that in some very rare cases my string does not start with /share/attachments/docs/ and then I should not parse anything but that is not related directly to the issue, but it will be good to handle.
I am getting Exception that: No match found.
This is because image2.png doesn't match with \d+ use a more appropriate pattern like .+ assuming that you want to extract image2.png.
Your regular expression will then be ^(.*)/share/attachments/docs/(.+)$
In case of ABC Results for draw no 2888, the regexp ^(.*)ABC Results for draw no (\\d+)$ works because you have several successive digits at the end of your String while in the first case you had image2.png that is a mix of letters and digits which is the reason why there were no match found.
Generally speaking to avoid getting an IllegalStateException: No match found, you need first to check the result of find(), if it returns true the input String matches:
if (m.find()) {
// The String matches with the pattern
String link = m.group(2);
System.out.println("Draw #: "+link);
} else {
System.out.println("Input value doesn't match with the pattern");
}
The regular expression \d+ (expressed as \\d+ inside a string literal) matches a run of one or more digits. Your example input does not have a corresponding digit run, so it is not matched. The regex metacharacter . matches any character (+/- newline, depending on regex options); it seems like that may be what you're really after.
Additionally, when you use Matcher.find() it is unnecessary for the pattern to match the whole string, so it is needless to include .* to match leading context. Furthermore, find() returns a value that tells you whether a match to the pattern was found. You generally want to use this return value, and in your particular case you can use it to reject those rare non-matching strings.
Maybe this is more what you want:
Pattern p = Pattern.compile("/share/attachments/docs/(.+)$");
Matcher m = p.matcher("/share/attachments/docs/image2.png");
String link;
if (m.find()) {
link = m.group(1);
System.out.println("Draw #: " + link);
} else {
link = null;
System.out.println("Draw #: (not found)");
}

Replacing Strings with a number in it without a for loop

So I currently have this code;
for (int i = 1; i <= this.max; i++) {
in = in.replace("{place" + i + "}", this.getUser(i)); // Get the place of a user.
}
Which works well, but I would like to just keep it simple (using Pattern matching)
so I used this code to check if it matches;
System.out.println(StringUtil.matches("{place5}", "\\{place\\d\\}"));
StringUtil's matches;
public static boolean matches(String string, String regex) {
if (string == null || regex == null) return false;
Pattern compiledPattern = Pattern.compile(regex);
return compiledPattern.matcher(string).matches();
}
Which returns true, then comes the next part I need help with, replacing the {place5} so I can parse the number. I could replace "{place" and "}", but what if there were multiple of those in a string ("{place5} {username}"), then I can't do that anymore, as far as I'm aware, if you know if there is a simple way to do that then please let me know, if not I can just stick with the for-loop.
then comes the next part I need help with, replacing the {place5} so I can parse the number
In order to obtain the number after {place, you can use
s = s.replaceAll(".*\\{place(\\d+)}.*", "$1");
The regex matches arbitrary number of characters before the string we are searching for, then {place, then we match and capture 1 or more digits with (\d+), and then we match the rest of the string with .*. Note that if the string has newline symbols, you should append (?s) at the beginning of the pattern. $1 in the replacement pattern "restores" the value we need.

Excluding markup on lowercased parentheses letters

A string can contain one to many parentheses in lower case letters like String content = "This is (a) nightmare"; I want to transform the string to "<centamp>This is </centamp>(a) <centamp>nightmare</centamp>"; So basically add centamp markup around this string but if it has a lowercase letter in parentheses that should be excluded from the markup.
This is what I have tried so far, but it doesn't achieve the desired result. There could be none to many parentheses in a string and excluding it from the markup should happen for every parentheses.
Pattern pattern = Pattern.compile("^(.*)?(\\([a-z]*\\))?(.*)?$", Pattern.MULTILINE);
String content = "This is (a) nightmare";
System.out.println(content.matches("^(.*)?(\\([a-z]*\\))?(.*)?$"));
System.out.println(pattern.matcher(content).replaceAll("<centamp>$1$3</centamp>$2"));
This can be done in one replaceAll:
String outputString =
inputString.replaceAll("(?s)\\G((?:\\([a-z]+\\))*+)((?:(?!\\([a-z]+\\)).)+)",
"$1<centamp>$2</centamp>");
It allows a non-empty sequence of lower case English alphabet character inside bracket \\([a-z]+\\).
Features:
Whitespace only sequences are tagged.
There will be no tag surrounding empty string.
Explanation:
\G asserts the match boundary, i.e. the next match can only start from the end of last match. It can also match the beginning of the string (when we have yet to find any match).
Each match of the regex will contain a sequence of: 0 or more consecutive \\([a-z]+\\) (no space between allowed), and followed by at least 1 character that does not form \\([a-z]+\\) sequence.
0 or more consecutive \\([a-z]+\\) to cover the case where the string does not start with \\([a-z]+\\), and the case where the string does not contain \\([a-z]+\\).
In the pattern for this portion (?:\\([a-z]+\\))*+ - note that the + after * makes the quantifier possessive, in other words, it disallows backtracking. Simply put, an optimization.
One character restriction is necessary to prevent adding tag that encloses empty string.
In the pattern for this portion (?:(?!\\([a-z]+\\)).)+ - note that for every character, I check whether it is part of the pattern \\([a-z]+\\) before matching it (?!\\([a-z]+\\))..
(?s) flag will cause . to match any character including new line. This will allow a tag to enclose text that spans multiple lines.
You just replace all of the occurence of "([a-z])" with </centamp>$1<centamp> and then prepend <centamp> and append </centamp>
String content = "Test (a) test (b) (c)";
Pattern pattern = Pattern.compile("(\\([a-z]\\))");
Matcher matcher = pattern.matcher(content);
String result = "<centamp>" + matcher.replaceAll("</centamp>$1<centamp>") + "</centamp>";
note I wrote the above in the browser so there may be syntax errors.
EDIT Here's a full example with the simplest RegEx possible.
import java.util.*;
import java.lang.*;
import java.util.regex.*;
class Main
{
public static void main (String[] args) throws java.lang.Exception
{
String content = "test (a) (b) and (c)";
String result = "<centamp>" +
content.replaceAll("(\\([a-z]\\))", "</centamp>$1<centamp>") +
"</centamp>";
result = result.replaceAll("<centamp></centamp>", "");
System.out.print(result);
}
}
This is another solution which uses cleaner regex. The solution is longer, but it allows more flexibility in adjusting the condition to add tag.
The idea here is to match the parenthesis containing lower case characters (the part we don't want to tag), then use the indices from the matches to identify the portion we want to enclose in tag.
// Regex for the parenthesis containing only lowercase English
// alphabet characters
static Pattern REGEX_IN_PARENTHESIS = Pattern.compile("\\([a-z]+\\)");
private static String addTag(String str) {
Matcher matcher = REGEX_IN_PARENTHESIS.matcher(str);
StringBuilder sb = new StringBuilder();
// Index that we have processed up to last append into StringBuilder
int lastAppend = 0;
while (matcher.find()) {
String bracket = matcher.group();
// The string from lastAppend to start of a match is the part
// we want to tag
// If you want to, you can easily add extra logic to process
// the string
if (lastAppend < matcher.start()) { // will not tag if empty string
sb.append("<centamp>")
.append(str, lastAppend, matcher.start())
.append("</centamp>");
}
// Append the parenthesis with lowercase English alphabet as it is
sb.append(bracket);
lastAppend = matcher.end();
}
// The string from lastAppend to end of string (no more match)
// is the part we want to tag
if (lastAppend < str.length()) {
sb.append("<centamp>")
.append(str, lastAppend, str.length())
.append("</centamp>");
}
return sb.toString();
}

Java regex and pattern matching: finding "blanks" in pattern which do not include them?

So, I need to write a compiler scanner for a homework, and thought it'd be "elegant" to use regex. Fact is, I seldomly used them before, and it was a long time ago. So I forgot most of the stuff about them and needed to have a look around. I used them successfully for the identifiers (or at least I think so, I still need to do some further tests but for now they all look ok), but I have a problem with the numbers-recognition.
The function nextCh() reads the next character on the input (lookahead char). What I'd like to do here is to check if this char matches the regex [0-9]*. I append every matching char in the str field of my current token, then I read the int value of this field. It recognizes a single number input such as "123", but the problem I have is that for the input "123 456", the final str will be "123 456" while I should get 2 separate tokens with fields "123" and "456". Why is the " " being matched?
private void readNumber(Token t) {
t.str = "" + ch; // force conversion char --> String
final Pattern pattern = Pattern.compile("[0-9]*");
nextCh(); // get next char and check if it is a digit
Matcher match = pattern.matcher("" + ch);
while (match.find() && ch != EOF) {
t.str += ch;
nextCh();
match = pattern.matcher("" + ch);
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
Thank you!
PS: I did solve my problem using the code below. Nevertheless, I'd like to understand where the flaw is in my regex expression.
t.str = "" + ch;
nextCh(); // get next char and check if it is a number
while (ch>='0' && ch<='9') {
t.str += ch;
nextCh();
}
t.kind = Kind.number;
try {
int value = Integer.parseInt(t.str);
t.val = value;
} catch(NumberFormatException e) {
error(t, Message.BIG_NUM, t.str);
}
EDIT: turns out my regex also doesn't work for the identifiers recognition (again, includes blanks), so I had to switch to a system similar to my "solution" (while with a lot of conditions). Guess I'll need to study the regex again :O
I'm not 100% sure whether this is relevant in your case, but this:
Pattern.compile("[0-9]*");
matches zero or more numbers anywhere in the string, because of the asterisk. I think the space gets matched because it is a match for 'zero numbers'. If you wanted to make sure the char was a number, you would have to match one or more, using the plus sign:
Pattern.compile("[0-9]+");
or, since you are only comparing a single char at a time, just match one number:
Pattern.compile("^[0-9]$");
You should be using the matches method rather than the find method. From the documentation:
The matches method attempts to match the entire input sequence against the pattern
The find method scans the input sequence looking for the next subsequence that matches the pattern.
So in other words, by using find, if the string contains a digit anywhere at all, you'll get a match, but if you use matches the entire string must match the pattern.
For example, try this:
Pattern p = Pattern.compile("[0-9]*");
Matcher m123abc = p.matcher("123 abc");
System.out.println(m123abc.matches()); // prints false
System.out.println(m123abc.find()); // prints true
Use a simpler regex like
/\d+/
Where
\d means a digit
+ means one or more
In code:
final Pattern pattern = Pattern.compile("\\d+");

Categories

Resources