I am working on Pentaho which uses Java regex package : java.util.regex.
I want to extract a lot of information from the lines of a text file from both start and end of the string :
^StartofString Controls\(param1="(D[0-9]{0,})",param2="(G[0-9]{0,})",param3="([^"]{0,})",param4="([^"]{0,})"\):(?:.*)param5="([^"]{0,})",.*
There is a long part of the string I want to ignore and try to do so with (?:.*)
The positive lookahead seems working when I test the Regex on the step but does not work when I execute the transformation.
I test the string on 'Regex Evaluation' step, check with 'Filter rows' the boolean of previous step and extract groups within a Javascript step :
var pattern = Packages.java.util.regex.Pattern.compile(patternStr);
var matcher = pattern.matcher(content.toString());
var matchFound = matcher.find();
with patterStr being the same regex than the one in the 'Regex Evaluation' step but with escaping characters : \
I have read many questions about ignoring parts of strings in regex and still can't find the answer.
Any help is welcome.
I can provide more informations if needed.
A non capturing group doesn't mean that its content won't be captured, it means that it won't be captured in a group (although you're still grouping tokens in your regex, which can be useful to apply a modifier to them at once).
For example, these regex will all match the exact same abc string :
abc
a(?:b)c
a(b)c
However in the third case, you've defined a capturing group which will enable you to access to b independently. The first two cases are equals in all respects.
The non-capturing group becomes useful when you want to apply a modifier to a group of tokens without having an extra group you can reference later. The following regexs will all match the same strings :
(ab)*(c)\2
(?:ab)*(c)\1
We want to apply * to the ab tokens. Either we do it with a capturing group (first example) and a group is created that we can reference, or we use a non-capturing group. The backreference at the end of the regex is supposed to match the c ; in the first example it's the second group since ab is the first one, while in the second c is the first group that can be referenced.
Now that I've explained what non-capturing groups do, let's solve your problem : you want to remove something from the middle of your string, where you know what's at the beginning and what's at the end.
Let's assume the string you want to match is the following :
Aremove-thisB
And that you want the result AB.
There are multiple strategies to do so, the easiest in your case probably is to match both the beginning and end of the string in their own capturing group and create your output from there :
var pattern = Packages.java.util.regex.Pattern.compile("(A).*(B)");
var matcher = pattern.matcher(content.toString());
var matchFound = matcher.find();
if (matchFound) { return matcher.group(1) + matcher.group(2); }
Related
I am using a regex to match few possible values in a string that coming with my objects, there I need to get all possible values that are matching from my string as below,
If my string value is "This is the code ABC : xyz use for something".
Here is my code that I am using to extract matchers,
String my_regex = "(ABC|ABC :).*";
List <String> matchers = Pattern.compile(my_regex, Pattern.CASE_INSENSITIVE)
.matcher(my_string)
.results()
.map(MatchResult::group)
.collect(Collection.toList());
I am expecting the 2 list items as the output > {"ABC", "ABC :"}, But I am only getting one. Help would be highly appreciated.
What you describe just isn't how regex engines work. They do not find all possible variant search results; they simply consume and give you all results, moving forward. In other words, had you written:
String my_regex = "(ABC|ABC :)"; // note, get rid of the .*
String myString = "This is the code ABC : xyz use for something ABC again";
Then you'd get 2 results back - ABC : and ABC.
Yes, the regex could just as easily match just the ABC part instead of the ABC : part and it would still be valid. However, regexp matching is by default 'greedy' - it will match as much as it can. For some operators (specifically, * and +) you can use the non-greedy variants: *? and +? which will match as little as possible.
In other words, given:
String regex = "(a*?)(a+)";
String myString = "aaaaa";
Then group 1 would match 0 a (that's the shortest string that can match (a*?) whilst still being able to match the entire regex to the input), and group 2 would be aaaaa.
If, on the other hand, you wrote (a*)(a+), then group 1 would be aaaa and group 2 would be a. It is not possible to ask the regexp engine to provide for you the combinatory explosion, with every possible length of 'a' - which appears to be what you want. The regexp API that ships with java does not have any option to do this, nor does any other regexp API I know of, so you'd have to write that yourself, perhaps. I admit I haven't scoured the web for every possible alternate regex engine impl for java, there are a bunch of third party libraries, perhaps one of them can do it.
NB: I said at the start: Get rid of the .*. That's because otherwise it's still just the one match: ABC : xyz use for something ABC again is the longest possible match and given that regex engines are greedy, that's what you will get: It is a valid 'interpretation' of your string (1 match), consuming the most - that's how it works.
NB2: Greediness can never change whether a regex even matches or not. It just changes which of the input is assigned to which group, and when find()ing more than once (which .results() does - it find()s until no more matches are found - which matches you get.
I'm wondering about the behavior of using the matcher in java.
I have a pattern which I compiled and when running through the results of the matcher i don't understand why a specific value is missing.
My code:
String str = "star wars";
Pattern p = Pattern.compile("star war|Star War|Starwars|star wars|star wars|pirates of the caribbean|long strage trip|drone|snatched (2017)");
Matcher matcher = p.matcher(str);
while (matcher.find()) {
System.out.println("\nRegex : " matcher.group());
}
I get hit with "star war" which is right as it is in my pattern.
But I don't get "star wars" as a hit and I don't understand why as it is part of my pattern.
The behavior is expected because alternation in NFA regex is "eager", i.e. the first match wins, and the rest of the alternatives are not even tested against. Also, note that once a regex engine finds a match in a consuming pattern (and yours is a consuming pattern, it is not a zero-width assertion like a lookahead/lookbehind/word boundary/anchor) the index is advanced to the end of the match and the next match is searched for from that position.
So, once your first star war alternative branch matches, there is no way to match star wars as the regex index is before the last s.
Just check if the string contains the strings you check against, the simplest approach is with a loop:
String str = "star wars";
String[] arr = {"star war","Star War","Starwars","star wars","pirates of the caribbean","long strage trip","drone","snatched (2017)"};
for(String s: arr){
if(str.contains(s))
System.out.println(s);
}
See the Java demo
By the way, your regex contains snatched (2017), and it does not match ( and ), it only matches snatched 2017. To match literal parentheses, the ( and ) must be escaped. I also removed a dupe entry for star wars.
A better way to build your regex would be like this:
String pattern = "[Ss]tar[\\s]{0,1}[Ww]ar[s]{0,1}";
Breaking down:
[Ss]: it will match either S or s in the first position
\s: representation of space
{0,1}: the previous character (or set) will be matched from 0 to 1 times
An alternative is:
String pattern = "[Ss]tar[\\s]?[Ww]ar[s]?";
?: the previous character (or set) will be matched once or not at all
For more information, see https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Edit 1: fixed typo (\s -> \\s). Thanks, #eugene.
You want to match the whole input sequence, so you should use Matcher.matches() or add ^ and $:
Pattern p = Pattern.compile("^(star war|Star War|Starwars|star wars|"
+ "star wars|pirates of the caribbean)$");
will print
Regex : star wars
But I agree with #NAMS: Don't build your regex like this.
I am trying to split a line with regex by using groups, but it's not working as I expected.
I want to match for example this line:
Ex. #1: temp name(this is the data)
and also this:
Ex. #2: temp name()
I used this regex:
[\s]*temp[\s]+[\s]*([A-Za-z]+)[\s]*[(]\s*(.*)+[)]\s*[{]\s*
which means: grab anything that starts with temp then put in group #1 the "name" then grab whatever inside the bracket and put it in group #2.
However, group #2 is always empty.
This is my code to fetch the data:
Pattern PATTERN = Pattern.compile("[\\s]*temp[\\s]+[\\s]*([A-Za-z]+)[\\s]*[(]\\s*(.*)+[)]\\s*");
Matcher m = PATTERN.matcher("temp name(this is the data)");
m.matches();
String name = m.group(1);
String data = m.group(2); // always empty
What am I doing wrong?
Your pattern doesn't match because it requires an open curly brace at the end, but your input doesn't have one.
Ignoring that small problem, the main problem is the little + after your capture group (.*)+. The plus requires one or more matches of .* and the group returned is the last match of the many. The term .* is greedy, so it consumes everything up to the bracket. The only way to match again is to consume nothing. So the last match of group 2 is blank.
To fix it, remove the + after group 2:
Pattern PATTERN = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Note also how I removed other unnecessary characters from your regex, eg the single-character character classes - ie [\\s] is identical to \s. And \\s+\\s* is identical to just \\s+, because + is greedy.
I also removed the trailing curly bracket, which you can restore if your input data actually has it (your question showed input of "temp name(this is the data)", which has no trailing curly bracket).
Your regex should be this:
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
You had (.*)+ which means one or more matches of .*. This results in nothing being captured.
Testing:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Example {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("\\s*temp\\s+([A-Za-z]+)\\s*[(]\\s*(.*)[)]\\s*");
Matcher m = pattern.matcher("temp name(this is the data)");
if(m.matches()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}
}
}
Output:
name
this is the data
[\s] is equivalent with \s
[\s]+[\s]* is equivalent with \s+
[(] is equivalent with \( (same for [)] and [}])
This would leave your regexp as:
\s*temp\s+([A-Za-z]+)\s*\(\s*(.*)+\)\s*\{\s*
Assuming you actually want to match temp name(...) { (your regexp is looking for a {, while in your question you do not specify that):
(.*)+ is your problem. You're saying: "Match any number (including 0) chatacters and put them in a capture group, repeat that at least once".
Regexp are by default greedy (= they consume as much as possible), so the capture group will first contain everything within the two brackets, then the + will try to match the entire group again, and will match it with "" (the emtpy string) as this fulfils the capture group's pattern. This will elave your capture group emtpy.
What you want instead is \s*temp\s+([A-Za-z]+)\s*\(\s*(.*)\)\s*\{\s*
The reason you are getting empty groups is because you are creating multiple capture groups every time you put something between (), even if it is nested.
To make a group so it doesnt capture you can designate it as a non-capturing group by using ?: for example (?:sometest(this is the value we want)) will return just one group while (sometest(this is the value we want)) will return 2 groups.
For your particular regex, I have refined and simplified it, as you had capture groups you did not need.
Simple solution:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*\{\\s*
given the input:
Ex. #1: temp name(this is the data) {
Ex. #2: temp name() {
$1 = name, $2 = data
Pay attention to the fact that your regex contains a trailing curly brace. You can modify the regex to match without it and it will result in this:
\\s*temp\\s+([A-Za-z]+)\\s*\\(\\s*(.*)\\)\\s*
https://regex101.com/r/tD0tO0/1
I'm trying to compile one Java Regex pattern but have that pattern look for 3 different matches. I've learned that I can do that using the pipe (|) but I'm having trouble with the actual syntax of the regex.
I'm looking through XML data and trying to pull out 3 matches. The XML will look something like this:
<Element createdOn="1405358703367" updatedOn="1405358718804" url="http://www.someurl.com" />
The regex I'm trying looks like this so far:
((?<="url": ").*(?=")) | (createdOn="(\d)") | (updatedOn="(\d)")
In the end I Need to get everything between the quotes in the XML (i.e. 1405358703367, 1405358718804, and http://www.someurl.com.
I had the URL regex working on its own earlier, but there seems to be no matches being made.
Thanks.
Get the matched group from index 2.
(url|createdOn|updatedOn)="([^"]*)"
DEMO
Here is sample code:
String string = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String patternString = "(url|createdOn|updatedOn)=\"([^\"]*)\"";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println(matcher.group(2));
}
output:
1405358703367
1405358718804
http://www.someurl.com
Java doesn't have library method that extracts matches, but you only need one line:
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
This works by stripping off leading and trailing input up to/from the first/last quote, then splits on quote-nonquote-quote input, leaving the target matches as an array.
In action:
String input = "<Element createdOn=\"1405358703367\" updatedOn=\"1405358718804\" url=\"http://www.someurl.com\" />";
String[] matches = input.replaceAll("^[^\"]*\"|\"[^\"]*$", "").split("\"[^\"]*\"");
System.out.println(Arrays.toString(matches));
Output:
[1405358703367, 1405358718804, http://www.someurl.com]
The pipe (|) is used to find a match that could be some-pattern OR some-other-pattern OR yet-another-pattern. It's not good at finding all occurrences of several patterns. To do that, if the patterns you're looking for aren't necessarily in a fixed order, you'll need to use a loop.
Here's a code example that starts with the pattern you tried, fixes some problems, and uses a loop to find the patterns:
Pattern p = Pattern.compile("((?<=url=\").*(?=\"))|(createdOn=\"(\\d+)\")|(updatedOn=\"(\\d+)\")");
Matcher m = p.matcher(source);
while (m.find()) {
System.out.println("Found: "+m.group());
System.out.println("Group 1: "+m.group(1));
System.out.println("Group 3: "+m.group(3));
System.out.println("Group 5: "+m.group(5));
}
(Some problems with your original pattern: You put space characters before and after each |, which are treated literally and mean the pattern has to match spaces that aren't there. I added + after \\d because you want to match more than one digit. There were some mistakes, like putting : after url instead of =.)
Now the code uses a loop to find each successive pattern that matches one of the patterns you're looking for. It matches either url=... or createdOn=... or updatedOn=..., but by using a loop we will find all of them. (Note that it doesn't care if it sees a url or a createdOn attribute twice in the source. You'll have to check that yourself.)
The group() method with no parameters will return whatever was matched by the pattern. group(1), group(3), and group(5) return certain subsections of the pattern; the numbers are determined by counting wherever you use ( in the pattern except for (?. So group 1 matches something using url as a lookbehind; group 2 starts with createdOn; group 3 is the sequence of digits following createdOn; group 4 starts with updatedOn, etc. The way the pattern is set up, not all of these will have values, since only one of the three alternatives separated by | will match. The rest will be null. As a result, the output of the above code will display null for two of the groups, and a useful value for the other. If you do things this way, you'll need to test for null to see which value actually got returned.
This would also be a case where named capturing groups could be useful. See http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html. Matcher has a group(name) function that takes a group name as a parameter.
This is one approach, but there are always multiple approaches to string parsing, and the other answers posted here are valid also. Plus there are already XML parsers to take care of things like this for you.
NOTE: This answer was meant to point out how | works. I don't recommend actually doing things this way, since it's overly complicated. If you're going to look separately for each attribute, it would be simpler just to set up three patterns and look for each one, one at a time. Or use #braj's suggestion in a loop, and perform a later check to make sure the createdOn and updatedOn values are numeric.
Is this regex correct to break a sentence into 3 tokens:
Characters Before lowercased letters inside a parentheses
Lowercase letters inside a parentheses including parentheses
Characters after lowercased parentheses letters
System.out.println("This is (a) test".matches("^(.*)?\\([a-z]*\\)(.*)?$"));
The string may or may not have a parentheses lower cased letter and it may appear anywhere in the sentence. If you see a flaw in a use case I haven't considered, can you provide the correction in regex ?
For the e.g. above.
Group1 captures This is
Group2 captures (a)
Group3 captures test
EDIT:: How do I change the regex to achieve the following ?
If the string has (foo)(bar)(baz) how do I capture group1= empty group2=(foo) and group3=empty. And find the above pattern thrice because there are 3 parentheses.
Separate from examining the regex, whenever I write a regex, I write a series of unit tests to cover each case. I'd suggest you do the same. Create four tests (at least) using the regex and testing against the strings:
(a) This is test
This is (a) test
This is test (a)
This is a test
That should cover each of the cases you've described. That's much easier and faster than trying to hand analyze the regex for each case.
If you want to ensure that are characters inside your lower parentesis, you should use +, which stands by one or more times
[a-z]+
The way it is, This is (a) (b) test will yield
Group1 captures This is
Group2 captures (a)
Group3 captures (b) test
If Group2 is expected to be (b) you should use a greedy regexp in Group1
Suggested test cases:
empty - really empty, can't have a bullet point empty.
foo(bar)baz
(foo)(bar)(baz)
(foo)bar(baz)
foo(bar)(baz)bing
foo(bar)baz(bing)
foo(bar)
(foo)bar
Your regex has a little problem.
You say in your definition that you have 3 groups, when in fact your pattern contains 2.
Using literal parentheses doesn't count as a group, so you'd need to use something like this:
"^(.*)?(\\([a-z]*\\))(.*)?$"
Or if you don't really want the parentheses, just the letters, you can change the order:
"^(.*)?\\(([a-z]*)\\)(.*)?$"
Other than that, it seems to be OK, but have in mind that the lower case letters between parentheses are not mandatory in your pattern.
In python:
r=re.compile(r'([^()]*)(\([a-z)(]*\))([^()]*)')
r.match('abc(xx)dd').groups()
('abc', '(xx)', 'dd')`
r.match('abc(xx)(dd)dd').groups()
('abc', '(xx)(dd)', 'dd')
r.match('(abc)').groups()
('', '(abc)', '')
If you want the first and third group to contain all characters before and after the parantheses, you must make sure they exclude ( and ) (your .* will also match groups that contain parantheses, such as (foo)(bar) in your second example).
So I'd replace .* with this [^\\(\\)]*.
Also, if you want to match strings that contain many substrings of the second group (like in your second example), you should have * after the second group.
My result was this:
^([^\\(\\)]*)?(\\([a-z]*\\))*([^\\(\\)]*)?$
This will work for the first example and the second, but the second group will eventually store only the last one found - (bz).
If you want to be able to capture the second group 3 times like you said for your second example, you could try using while m.find() instead of if m.matches() (m is a Matcher object); and also change your regex a little to this:
([^\\(\\)]*)(\\([a-z]*\\))([^\\(\\)]*)
This will should the second group for every possible match in your string - (foo), (bar), (bz).
Edit:
For some reason that I can't really explain, for me it doesn't find (foo), only the other two. So I wrote a piece of code that tries to apply find() with a parameter, explicitly starting from some position, where the last found group ended:
String regex = "([^\\(\\)]*)(\\([a-z]*\\))([^\\(\\)]*)";
String text = "(foo)(bar)(bz)";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
for (int reg = 0; reg < text.length(); reg+=(m.end()-m.start()))
if (m.find(reg))
for (int group = 1; group <=m.groupCount(); group++)
System.out.println("Group "+group+": "+m.group(group));
This works, and the output is:
Group 1:
Group 2: (foo)
Group 3:
Group 1:
Group 2: (bar)
Group 3:
Group 1:
Group 2: (bz)
Group 3: