java regular expression lookahead non-capture but output it

java regular expression lookahead non-capture but output it - java

i am trying to use the pattern \w(?=\w) to find 2 consecutive characters using the following,
although lookahead works, i want to output the actual matched but not consume it
here is the code:
Pattern pattern = Pattern.compile("\\w(?=\\w)");
Matcher matcher = pattern.matcher("abcde");
while (matcher.find())
{
System.out.println(matcher.group(0));
}
i want the matching output: ab bc cd de
but i can only get a b c d e
any idea?

The content of the lookahead has zero width, so it is not part of group zero. To do what you want, you need to explicitly capture the content of the lookahead, and then reconstruct the combined text+lookahead, like this:
Pattern pattern = Pattern.compile("\\w(?=(\\w))");
// ^ ^
// | |
// Add a capturing group
Matcher matcher = pattern.matcher("abcde");
while (matcher.find()) {
// Use the captured content of the lookahead below:
System.out.println(matcher.group(0) + matcher.group(1));
}
Demo on ideone.

Related

First pattern key is always not found

I want to read comments from .sql file and get the values:
<!--
#fake: some
#author: some
#ticket: ti-1232323
#fix: some fix
#release: master
#description: This is test example
-->
Code:
String text = String.join("", Files.readAllLines(file.toPath()));
Pattern pattern = Pattern.compile("^\\s*#(?<key>(fake|author|description|fix|ticket|release)): (?<value>.*?)$", Pattern.MULTILINE);
Matcher matcher = pattern.matcher(text);
while (matcher.find())
{
if (matcher.group("key").equals("author")) {
author = matcher.group("value");
}
if (matcher.group("key").equals("description")) {
description = matcher.group("value");
}
}
The first key in this case fake is always empty. If I put author for the first key it's again empty. Do you know how I can fix the regex pattern?

Use the following regex pattern:
(?<!\S)#(?<key>(?:fake|author|description|fix|ticket|release)): (?<value>.*?(?![^#]))
The negative lookbehind (?<!\S) used above will match either whitespace or the start o the string, covering the initial edge case. The negative lookahead (?![^#]) at the end of the pattern will stop before the next # term begins, or upon hitting the end of the input
String text = String.join("", Files.readAllLines(file.toPath()));
Pattern pattern = Pattern.compile("(?<!\\S)#(?<key>(?:fake|author|description|fix|ticket|release)): (?<value>.*?(?![^#]))", Pattern.DOTALL);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
if ("author".equals(matcher.group("key")) {
author = matcher.group("value");
}
if ("description".equals(matcher.group("key")) {
description = matcher.group("value");
}
}

If the <!-- and --> parts should be there, you could make use of the \G anchor to get consecutive matches and keep the groups.
Note that the alternatives are already in a named capturing group (?<key> so you don't have to wrap them in another group. The part in group value can be non greedy as you are matching to the end of the string.
As #Wiktor Stribiżew mentioned, you are joining the lines back without a newline so the separate parts will not be matched using for example the anchor $ asserting the end of the string.
Pattern
(?:^<!--(?=.*(?:\R(?!-->).*)*\R-->)|\G(?!^))\R#(?<key>fake|author|description|fix|ticket|release): (?<value>.*)$
Explanation
(?: Non capture group
^ Start of line
<!-- Match literally
(?=.*(?:\R(?!-->).*)*\R-->) Assert an ending -->
| Or
\G(?!^) Assert the end of the previous match, not at the start
) Close group
\R# Match a unicode newline sequence and #
(?<key> Named group key, match any of the alternatives
fake|author|description|fix|ticket|release
): Match literally
(?<value>.*)$ Named group value Match any char except a newline until the end of the string
Regex demo | Java demo
Example code
String text = String.join("\n", Files.readAllLines(file.toPath()));
String regex = "(?:^<!--(?=.*(?:\\R(?!-->).*)*\\R-->)|\\G(?!^))\\R#(?<key>fake|author|description|fix|ticket|release): (?<value>.*)$";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
if (matcher.group("key").equals("author")) {
System.out.println(matcher.group("value"));
}
if (matcher.group("key").equals("description")) {
System.out.println(matcher.group("value"));
}
}
Output
some
This is test example

Regular Expression (regex). How to ignore or exclude everything in between?

I have this input text:
142d 000781fe0000326f BPD false 65535 FSK_75 FSK_75 -51.984 -48
I want to use regular expression to extract 000781fe0000326f and -51.984, so the output looks like this
000781fe0000326f-51.984
I can use [0-9]{5,7}(?:[a-z][a-z0-9_]*) and ([-]?\\d*\\.\\d+)(?![-+0-9\\.]) to extract 000781fe0000326f and -51.984, respectively.
Is there a way to ignore or exclude everything between 000781fe0000326f and -51.984? To ignore everythin that will be captured by the non greedy filler (.*?) ?
String ref="[0-9]{5,7}(?:[a-z][a-z0-9_]*)_____([-]?\\d*\\.\\d+)(?![-+0-9\\.])";
Pattern p = Pattern.compile(ref,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(input);
while (m.find())
{
String all = m.group();
//list3.add(all);
}

For you example data you might use an alternation | to match either one of the regexes in you question and then concatenate them.
Note that in your regex you could write (?:[a-z][a-z0-9_]*) as [a-z][a-z0-9_] and you don't have to escape the dot in a character class.
For example:
[0-9]{5,7}[a-z][a-z0-9_]*|-?\d*\.\d+(?![-+0-9.])
Regex demo
String regex = "[0-9]{5,7}[a-z][a-z0-9_]*|-?\\d*\\.\\d+(?![-+0-9.])";
String string = "142d 000781fe0000326f BPD false 65535 FSK_75 FSK_75 -51.984 -48";
Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
Matcher matcher = pattern.matcher(string);
String result = "";
while (matcher.find()) {
result += matcher.group(0);
}
System.out.println(result); // 000781fe0000326f-51.984
Demo Java

There's no way to combine strings together like that in pure regex, but it's easy to create a group for the first match, a group for the second match, and then use m.group(1) + m.group(2) to concatenate the two groups together and create your desired combined string.
Also note that [0-9] simplifies to \d, a character set with only one token in it simplifies to just that token, [a-z0-9_] with the i flag simplifies to \w, and there's no need to escape a . inside a character set:
String input = "142d 000781fe0000326f BPD false 65535 FSK_75 FSK_75 -51.984 -48";
String ref="(\\d{5,7}(?:[a-z]\\w*)).*?((?:-?\\d*\\.\\d+)(?![-+\\d.]))";
Pattern p = Pattern.compile(ref,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(input);
while (m.find())
{
String all = m.group(1) + m.group(2);
System.out.println(all);
}

you cannot really ignore the words in between. You can include them all.
something like this will include all of them.
[0-9]{5,7}(?:[a-z][a-z0-9_])[a-zA-Z0-9_ ]([-]?\d*.\d+)(?![-+0-9.])
But that is not what you want.
I think the best bet is either having 2 regular expressions and then combining the result, or splitting the string on spaces/tab characters and checking the 'n'th elements as required

Java Pattern Matcher is not working for regex as expected

1) Pattern pattern = Pattern.compile("34238");
Matcher matcher = pattern.matcher("6003 Honore Ave Suite 101 Sarasota Florida,
34238");
if (matcher.find()) {
System.out.println("ok");
}
2) Pattern pattern = Pattern.compile("^[0-9]{5}(?:-[0-9]{4})?$");
Matcher matcher = pattern.matcher("34238");
if (matcher.find()) {
System.out.println("ok");
}
Output for the above code is: ok
But the following code is not printing anything:
Pattern pattern = Pattern.compile("^[0-9]{5}(?:-[0-9]{4})?$");
Matcher matcher = pattern.matcher("6003 Honore Ave Suite 101 Sarasota Florida, 34238");
if (matcher.find()) {
System.out.println("ok");
}
What is the reason for this not to print ok? I am using the same pattern here also.

Although the pattern is the same, the input strings are different:
In your second example, you are matching a string consisting entirely of a zip code, so you get a match for ^...$ expression
The second example does not start with the zip code, so the ^ anchor prevents your regex from matching.
^ and $ anchors are used when you want your expression to match the entire input line. When you want to match at the beginning, keep ^ and remove $; when you want to match at the end, remove ^ and keep $; when you want to match anywhere inside the string, remove both anchors.

The code is good and working as expected. In the 2) and 3) block in your question you are using the same regex but different input strings.
However, if you just want to check if a string must contain a US zip code, then the problem is that your regex is using anchors, so you are only matching lines that starts and finish with a zip code.
The strings that matches your regex are like 34238 or 34238-1234 and won't match something 12345 something.
If you remove the anchors, then you will match whatever 12345 whatever:
// Pattern pattern = Pattern.compile("^[0-9]{5}(?:-[0-9]{4})?$");
// ^--------- Here -------^
Pattern pattern = Pattern.compile("[0-9]{5}(?:-[0-9]{4})?");
Matcher matcher = pattern.matcher("6003 Honore Ave Suite 101 Sarasota Florida, 34238");
if (matcher.find()) {
System.out.println("ok");
}
Btw, if you just want to check if a string contains a zip code, then you can use String.matches(..), like this:
String str = "6003 Honore Ave Suite 101 Sarasota Florida, 34238";
if (str.matches(".*[0-9]{5}(?:-[0-9]{4})?.*")) {
System.out.println("ok");
}
IDEOne demo

How to parse a range input in java

I want to parse a range of data (e.g. 100-2000) in Java. Is this code correct:
String patternStr = "^(\\\\d+)-(\\\\d+)$";
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(inputStr);
if(matcher.find()){
// Doing some parser
}

Too many backslashes, and you can use matches() without anchors (^$).
String inputStr = "100-2000";
String patternStr = "(\\d+)-(\\d+)";
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(inputStr);
if (matcher.matches()) {
System.out.println(matcher.group(1) + " - " + matcher.group(2));
}
As for your question "Is this code correct", all you had to do was wrap the code in a class with a main method and run it, and you'd get the answer: No.

No, you're double (well, quadruple)-escaping the digits.
It should be: "^(\\d+)-(\\d+)$".
Meaning:
Start of input: ^
Group 1: 1+ digit(s): (\\d+)
Hyphen literal: -
Group 2: 1+ digit(s): (\\d+)
End of input: $
Notes
The groups are useful for back-references. Here you're using none, so you can ditch the parenthesis around the \\d+ expressions.
You are parsing the representation of a range in this example.
If you want an actual range class, you can use the [min-max] idiom, where "min" and "max" are numbers, for instance [0-9].
As mentioned by Andreas, you can use String.matches without the Pattern-Matcher idiom and the ^ and $, if you want to match the whole input.

getting NULL values from Java regex Matcher with a found pattern

I'm trying to get the following regex to work on my String:
Pattern Regex = Pattern.compile("(?:(\\d+) ?(days?|d) *?)?(?:(\\d+) ?(hours?|h) *?)?(?:(\\d+) ?(minutes?|m) *?)?(?:(\\d+) ?(seconds?|s))?",
Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
Matcher RegexMatcher = Regex.matcher(myString);
while (RegexMatcher.find()) {
...
}
.. it basically splits a string like 1day 3 hours into matched regex groups.
The problem I'm having is that when I get into the while loop, calls to RegexMatcher.group(i) will always return a NULL value, meaning they were not found in the string.
When I try to output RegexMatcher.group(0), it returns an empty string, even though myString definitelly contains like "hello 1d" - which should return at least 1st group as "1" and second as "d".
I've checked and double-checked the regex and it seems to be ok. No Idea what's wrong here.
Thanks for any ideas :-)

For a matcher m, input sequence s, and group index g, the expressions m.group(g) and s.substring(m.start(g), m.end(g)) are equivalent.
Capturing groups are indexed from left to right, starting at one. Group zero denotes the entire pattern, so the expression m.group(0) is equivalent to m.group().
If the match was successful but the group specified failed to match any part of the input sequence, then null is returned. Note that some groups, for example (a*), match the empty string. This method will return the empty string when such a group successfully matches the empty string in the input.
If you want to ergodic all the matches, you can code like :
Pattern Regex = Pattern
.compile(
"(?:(\\d+) ?(days?|d) *?)?(?:(\\d+) ?(hours?|h) *?)?(?:(\\d+) ?(minutes?|m) *?)?(?:(\\d+) ?(seconds?|s))?",
Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE
| Pattern.UNICODE_CASE);
Matcher RegexMatcher = Regex.matcher("1 d 3 hours");
while (RegexMatcher.find()) {
System.out.println(RegexMatcher.group());
}
Note: m.group() is equivalent to m.group(0)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java regular expression lookahead non-capture but output it - java

Related

First pattern key is always not found

Regular Expression (regex). How to ignore or exclude everything in between?

Java Pattern Matcher is not working for regex as expected

How to parse a range input in java

getting NULL values from Java regex Matcher with a found pattern

Categories

Resources