java regex tell which column not match - java

Good day,
My java code is as follow:
Pattern p = Pattern.compile("^[a-zA-Z0-9$&+,:;=\\[\\]{}?##|\\\\'<>._^*()%!/~\"`  -]*$");
String i = "f698fec0-dd89-11e8-b06b-☺";
Matcher tagmatch = p.matcher(i);
System.out.println("tagmatch is " + tagmatch.find());
As expected, the answer will be false, because there is ☺ character inside. However, I would like to show the column number that not match. For this example, it should show column 25th having the invalid character.
May I know how can I do this?

You should remove anchors from your regex and then use Matcher#end() method to get the position where it stopped the previous match like this:
String i = "f698fec0-dd89-11e8-b06b-☺";
Pattern p = Pattern.compile("[\\w$&+,:;=\\[\\]{}?##|\\\\'<>.^*()%!/~\"` -]+");
Matcher m = p.matcher(i);
if (m.lookingAt() && i.length() > m.end()) {
System.out.println("Match <" + m.group() + "> failed at: " + m.end());
}
Output:
Match <f698fec0-dd89-11e8-b06b-> failed at: 24
PS: I have used lookingAt() to ensure that we match the pattern starting from the beginning of the region. You can use find() as well to get the next match anywhere or else keep the start anchor in pattern as
"^[\\w$&+,:;=\\[\\]{}?##|\\\\'<>.^*()%!/~\"` -]+"
and use find() to effectively make it behave like the above code with lookingAt().
Read difference between lookingAt() and find()
I have refactored your regex to use \w instead of [a-zA-Z0-9_] and used quantifier + (meaning match 1 or more) instead of * (meaning match 0 or more) to avoid returning success for zero-length matches.

Related

Finding and retrieving consecutive matches

Say I want to match a string that should solely consist of parts adhering to a specific (regex) pattern and retrieve the elements in a loop. For this it seems that Matcher.find() was invented. However, find will match any string, not just one that is directly after the pattern, so intermediate characters are skipped.
So - for instance - I want to match \\p{Xdigit}{2} (two hexadecimal digits) in such a way that:
aabb matches;
_aabb doesn't match;
aa_bb doesn't match;
aabb_ doesn't match.
by using find (or any other iterated call to the regex) so I can directly process each byte in the array. So I want to process aa and bb separately, after matching.
OK, that's it, the most elegant way of doing this wins the accept.
Notes:
the hexadecimal parsing is just an example of a simple repeating pattern;
preferably I would like to keep the regex to the minimal required to match the element;
yes, I know about using (\\p{XDigit}{2})*, but I don't want to scan string twice (as it should be usable on huge input strings).
It appears you want to get all (multiple) matches that appear at the start of the string or right after a successful match. You may combine \G operator with a lookahead that will assure the string only matches some repeated pattern.
Use
(?:\G(?!^)|^(?=(?:\p{XDigit}{2})*$))\p{XDigit}{2}
See the regex demo
Details
(?: - start of a non-capturing group with 2 alternatives:
\G(?!^) - the end of the previous successful match
| - or
^(?=(?:\p{XDigit}{2})*$) - start of a string (^) that is followed with 0+ occurrences of \p{XDigit}{2} pattern up to the end of the string ($)
) - end of the non-capturing group
\p{XDigit}{2} - 2 hex chars.
Java demo:
String regex = "(?:\\G(?!^)|^(?=(?:[0-9a-fA-F]{2})*$))[0-9a-fA-F]{2}";
String[] strings = {"aabb","_aabb","aa_bb", "aabb_"};
Pattern pattern = Pattern.compile(regex);
for (String s : strings) {
System.out.println("Checking " + s);
Matcher matcher = pattern.matcher(s);
List<String> res = new ArrayList<>();
while (matcher.find()) {
res.add(matcher.group(0));
}
if (res.size() > 0) {
System.out.println(res);
} else {
System.out.println("No match!");
}
}
Output:
Checking aabb
[aa, bb]
Checking _aabb
No match!
Checking aa_bb
No match!
Checking aabb_
No match!
OK, I may finally have had a brainstorm: the idea is to remove the find() method out of the condition of the while loop. Instead I should simply keep a variable holding the location and only stop parsing when the whole string has been processed. The location can also be used to produce a more informative error message.
The location starts at zero and is updated to the end of the match. Each time a new match is found the start of the match is compared with the location, i.e. end of the last match. An error occurs if:
the pattern is not found;
the pattern is found, but not at the end of the last match.
Code:
private static byte[] parseHex(String hex){
byte[] bytes = new byte[hex.length() / 2];
int off = 0;
// the pattern is normally a constant
Pattern hexByte = Pattern.compile("\\p{XDigit}{2}");
Matcher hexByteMatcher = hexByte.matcher(hex);
int loc = 0;
// so here we would normally do the while (hexByteMatcher.find()) ...
while (loc < hex.length()) {
// optimization in case we have a maximum size of the pattern
hexByteMatcher.region(loc, loc + 2);
// instead we try and find the pattern, and produce an error if not found at the right location
if (!hexByteMatcher.find() || hexByteMatcher.start() != loc) {
// only a single throw, message includes location
throw new IllegalArgumentException("Hex string invalid at offset " + loc);
}
// the processing of the pattern, in this case a double hex digit representing a byte value
bytes[off++] = (byte) Integer.parseInt(hexByteMatcher.group(), 16);
// set the next location to the end of the match
loc = hexByteMatcher.end();
}
return bytes;
}
The method can be improved by adding \\G (end of last match) to the regex: \\G\\p{XDigit}{2}: this way the regular expression will fail immediately if the pattern cannot be found starting at the end of the last match or the start of the string).
For regular expressions with an expected maximum size (2 in this case) it is of course also possible to adjust the end of the region that needs to be matched.

Java Matcher Pattern issue

I am trying to extract everything that is after this string path /share/attachments/docs/. All my strings are starting with /share/attachments/docs/
For example: /share/attachments/docs/image2.png
Number of characters after ../docs/ is not static!
I tried with
Pattern p = Pattern.compile("^(.*)/share/attachments/docs/(\\d+)$");
Matcher m = p.matcher("/share/attachments/docs/image2.png");
m.find();
String link = m.group(2);
System.out.println("Link #: "+link);
But I am getting Exception that: No match found.
Strange because if I use this:
Pattern p = Pattern.compile("^(.*)ABC Results for draw no (\\d+)$");
Matcher m = p.matcher("ABC Results for draw no 2888");
then it works!!!
Also one thing is that in some very rare cases my string does not start with /share/attachments/docs/ and then I should not parse anything but that is not related directly to the issue, but it will be good to handle.
I am getting Exception that: No match found.
This is because image2.png doesn't match with \d+ use a more appropriate pattern like .+ assuming that you want to extract image2.png.
Your regular expression will then be ^(.*)/share/attachments/docs/(.+)$
In case of ABC Results for draw no 2888, the regexp ^(.*)ABC Results for draw no (\\d+)$ works because you have several successive digits at the end of your String while in the first case you had image2.png that is a mix of letters and digits which is the reason why there were no match found.
Generally speaking to avoid getting an IllegalStateException: No match found, you need first to check the result of find(), if it returns true the input String matches:
if (m.find()) {
// The String matches with the pattern
String link = m.group(2);
System.out.println("Draw #: "+link);
} else {
System.out.println("Input value doesn't match with the pattern");
}
The regular expression \d+ (expressed as \\d+ inside a string literal) matches a run of one or more digits. Your example input does not have a corresponding digit run, so it is not matched. The regex metacharacter . matches any character (+/- newline, depending on regex options); it seems like that may be what you're really after.
Additionally, when you use Matcher.find() it is unnecessary for the pattern to match the whole string, so it is needless to include .* to match leading context. Furthermore, find() returns a value that tells you whether a match to the pattern was found. You generally want to use this return value, and in your particular case you can use it to reject those rare non-matching strings.
Maybe this is more what you want:
Pattern p = Pattern.compile("/share/attachments/docs/(.+)$");
Matcher m = p.matcher("/share/attachments/docs/image2.png");
String link;
if (m.find()) {
link = m.group(1);
System.out.println("Draw #: " + link);
} else {
link = null;
System.out.println("Draw #: (not found)");
}

Matching Urls Inside Strings

I am trying to write a regex that will match urls inside strings of text that may be html-encoded. I am having a considerable amount of trouble with lookaround though. I need something that would correctly match both links in the string below:
some text "http://www.notarealwebsite.com/?q=asdf&searchOrder=1" "http://www.notarealwebsite.com" some other text
A verbose description of what I want would be: "http://" followed by any number of characters that are not spaces, quotes, or the string "&quot[semicolon]" (I don't care about accepting other non-url-safe characters as delimiters)
I have tried a few regexes using lookahead to check for &'s followed by q's followed by u's and so on, but as soon as I put one into the [^...] negation it just completely breaks down and evaluates more like: "http:// followed by any number of characters that are not spaces, quotes, ampersands, q's, u's, o's, t's, or semicolons" which is obviously not what I am looking for.
This will correctly match the &'s at the beginning of the &quot[semicolon]:
&(?=q(?=u(?=o(?=t(?=;)))))
But this does not work:
http://[^ "&(?=q(?=u(?=o(?=t(?=;)))))]*
I know just enough about regexes to get into trouble, and that includes not knowing why this won't work the way I want it to. I understand to some extent positive and negative lookaround, but I don't understand why it breaks down inside the [^...]. Is it possible to do this with regexes? Or am I wasting my time trying to make it work?
If your regex implementation supports it, use a positive look ahead and a backreference with a non-greedy expression in the body.
Here is one with your conditions: (["\s]|")(http://.*?)(?=\1)
For example, in Python:
import re
p = re.compile(r'(["\s]|")(https?://.*?)(?=\1)', re.IGNORECASE)
url = "http://test.url/here.php?var1=val&var2=val2"
formatstr = 'text "{0}" more text {0} and more "{0}" test greed"'
data = formatstr.format(url)
for m in p.finditer(data):
print "Found:", m.group(2)
Produces:
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Found: http://test.url/here.php?var1=val&var2=val2
Or in Java:
#Test
public void testRegex() {
Pattern p = Pattern.compile("([\"\\s]|")(https?://.*?)(?=\\1)",
Pattern.CASE_INSENSITIVE);
final String URL = "http://test.url/here.php?var1=val&var2=val2";
final String INPUT = "some text " + URL + " more text + \"" + URL +
"\" more then "" + URL + "" testing greed "";
Matcher m = p.matcher(INPUT);
while( m.find() ) {
System.out.println("Found: " + m.group(2));
}
}
Produces the same output.

(Pattern and Matcher) not discovering all pattern matches

I have this string object which consists of tags(bounded by [$ and $]) and rest of the text. Im trying to isolate all of the tags. (Pattern-Matcher) recognize all of the tags properly, but two of them are combined into one. I dont have any idea why this is happening, probably some internal (Matcher-Pattern) bussiness.
String docBody = "This is sample text.\r\n[$ FOR i 1 10 1 $]\r\n This is" +
"[$ i $]-th time this message is generated.\r\n[$END$]\r\n" +
"[$ FOR i 0 10 2 $]\r\n sin([$= i $]^2) = [$= i i * #sin \"0.000\"" +
" #decfmt $]" +
"\r\n[$END$] ";
Pattern p = Pattern.compile("(\\[\\$)(.)+(\\$\\])");
Matcher m = p.matcher(docBody);
while(m.find()){
System.out.println(m.group());
}
output:
[$ FOR i 1 10 1 $]
[$ i $]
[$END$]
[$ FOR i 0 10 2 $]
[$= i $]^2) = [$= i i * #sin "0.000" #decfmt $]
[$END$]`
As you can see, this part [$= i $]^2) = [$= i i * #sin "0.000" #decfmt $] is not split into these two tags [$= i $] and [$= i i * #sin "0.000" #decfmt $]
Any suggestions why this is happening?
You should use reluctant quantifier - ".+?" instead of greedy - ".+" :
"(\\[\\$).+?(\\$\\])" // Note `?` after `.+`
If you use .+, it will match everything except the line-terminator till the last $. Note that a dot (.) matches everything except a newline. With reluctant quantifier, .+? matches only till the first $] it encounters.
In your given string, you got all those matches, because you had \r\n in between, where the .+ stops matching. If you remove all those newlines, then you will just get a single match from 1st [$ to the last $].
A good way is to replace the dot by a negated character class, example:
Pattern p = Pattern.compile("(\\[\\$)([^$]++)(\\$])");
(note that you don't need to escape closing square brackets)
But perhaps are you only interested by the content of the tags:
Pattern p = Pattern.compile("(?<=\\[\\$)[^$]++(?=\\$])");
In this case the content is the whole match

Match the second substring using regular expression

I need a regular expression that matches the second "abc" in "abcasdabchjkabc".
I attempt to write code like this,
Pattern p = Pattern.compile("(?<=abc(.*?))abc");
but it throws a java.util.regex.PatternSyntaxException:
Look-behind group does not have an obvious maximum length near index 11
(?<=abc(.*?))abc
^
at java.util.regex.Pattern.error(Pattern.java:1713)
at java.util.regex.Pattern.group0(Pattern.java:2488)
at java.util.regex.Pattern.sequence(Pattern.java:1806)
at java.util.regex.Pattern.expr(Pattern.java:1752)
at java.util.regex.Pattern.compile(Pattern.java:1460)
at java.util.regex.Pattern.<init>(Pattern.java:1133)
at java.util.regex.Pattern.compile(Pattern.java:823)
Please show me the right one!
You cannot use * or + in a look-behind assertion.
Why does the look-behind expression in this regex not have an "obvious maximum length"?
Regex look-behind without obvious maximum length in Java
Do you actually want to match everything in between the two abcs?
Pattern.compile("abc(.*?)abc");
Or do you just want to check that there are two abcs?
Pattern.compile("abc.*?abc");
I don't see a need for lookbehind in either case.
I guess you want something like:
java.util.regex.Pattern.compile("(?<=abc.{1,99})abc");
It finds the second abc.
A simple option is to match your pattern twice:
String input = "abcXYabcZRabc";
Pattern p = Pattern.compile("abc");
Matcher m = p.matcher(input);
m.find(); // what to do when there is no match?
m.find(); // what to do when there is only one match?
System.out.println("Second match is between " + m.start() + " and " + m.end());
Working example: http://ideone.com/uVZL3j

Categories

Resources