Regex, lookbehind/lookahead with ".*" - java

This word has to be taken with the space behind it
word like this has to be taken too
If the word is like \gloss{word}, \(anything here)sezione{word}, \gloss{anything word anything), \(anything here)sezione{anything word anything}, it must not be taken.
If the word inside is like \(anything but gloss or sezione){word} and \{anything but gloss or sezione){strings word strings} it has to be taken.
Obviously aword, worda and aworda has not to be taken.
(the bold word has been taken, word has not)
I have problems in not catching the word that is inside "{.... word .....}"
My guess was (?<!(sezione\{)|(gloss\{))(\b)( ?)word(\b)(?!.*\{}) so far, and I would have added a ".*" on the lookbehind and lookahead ( (?<!(sezione\{)|(gloss\{).*)[...] ) but like this it stops working.
If this matter, I plan to use Java's regex engine
Thanks in advance
edit: the major problem is
\(anything here)sezione{anything word anything}
If I can NOT get this one, this should solve the whole problem

Let's set few hard facts about your use-case:
Java (and most of) regex engines don't support variable length lookbehind
Java regex engine doesn't support \K pattern that allows you to reset the search
In absence of that you will need to use a workaround which works in 3 steps:
Make sure input is matching expected lookbehind pattern
If it does then remove matched String by lookbehind pattern
In the replaced String match and extract your search pattern
Consider following code:
String str = "(anything here)sezione{anything word anything}";
// look behind pattern
String lookbehind = "^.*?(?:sezione|gloss|word)\\{";
// make sure input is matching lookbehind pattern first
if (str.matches(lookbehind + ".*$")) {
// actual search pattern
Pattern p = Pattern.compile("[^}]*?\\b(word)\\b");
// search in replaced String
Matcher m = p.matcher(str.replaceFirst(lookbehind, ""));
if (m.find())
System.out.println(m.group(1));
//> word
}
PS: You may need to improve code by checking for indexes in the input String for the starting point of search pattern.

Related

Java find value in a string using regex

I'm wondering about the behavior of using the matcher in java.
I have a pattern which I compiled and when running through the results of the matcher i don't understand why a specific value is missing.
My code:
String str = "star wars";
Pattern p = Pattern.compile("star war|Star War|Starwars|star wars|star wars|pirates of the caribbean|long strage trip|drone|snatched (2017)");
Matcher matcher = p.matcher(str);
while (matcher.find()) {
System.out.println("\nRegex : " matcher.group());
}
I get hit with "star war" which is right as it is in my pattern.
But I don't get "star wars" as a hit and I don't understand why as it is part of my pattern.
The behavior is expected because alternation in NFA regex is "eager", i.e. the first match wins, and the rest of the alternatives are not even tested against. Also, note that once a regex engine finds a match in a consuming pattern (and yours is a consuming pattern, it is not a zero-width assertion like a lookahead/lookbehind/word boundary/anchor) the index is advanced to the end of the match and the next match is searched for from that position.
So, once your first star war alternative branch matches, there is no way to match star wars as the regex index is before the last s.
Just check if the string contains the strings you check against, the simplest approach is with a loop:
String str = "star wars";
String[] arr = {"star war","Star War","Starwars","star wars","pirates of the caribbean","long strage trip","drone","snatched (2017)"};
for(String s: arr){
if(str.contains(s))
System.out.println(s);
}
See the Java demo
By the way, your regex contains snatched (2017), and it does not match ( and ), it only matches snatched 2017. To match literal parentheses, the ( and ) must be escaped. I also removed a dupe entry for star wars.
A better way to build your regex would be like this:
String pattern = "[Ss]tar[\\s]{0,1}[Ww]ar[s]{0,1}";
Breaking down:
[Ss]: it will match either S or s in the first position
\s: representation of space
{0,1}: the previous character (or set) will be matched from 0 to 1 times
An alternative is:
String pattern = "[Ss]tar[\\s]?[Ww]ar[s]?";
?: the previous character (or set) will be matched once or not at all
For more information, see https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
Edit 1: fixed typo (\s -> \\s). Thanks, #eugene.
You want to match the whole input sequence, so you should use Matcher.matches() or add ^ and $:
Pattern p = Pattern.compile("^(star war|Star War|Starwars|star wars|"
+ "star wars|pirates of the caribbean)$");
will print
Regex : star wars
But I agree with #NAMS: Don't build your regex like this.

Regex: Match group if present otherwise ignore and proceed with other matches

I have been trying to match a regex pattern within the following data:
String:
TestData to 1colon delimiter list has 1 rows.Set...value is: 1 Save Error: 267467374736437-TestInfo send Error
Words to match:
TestData
267467374736437-TestInfo
Regex pattern i m using:
(.+?\s)?.*(\s\d+-.*?\s)?
Scenario here is that 2nd match (267467374736437-TestInfo) can be absent in the string to be matched. So, i want it to be a match if it exists otherwise proceed with other matches. Due to this i added zero or one match quantifier ? to the group pattern above. But then it ignores the 2nd group all together.
If i use the below pattern:
`(.+?\s)?.*(\s\d+-.*?\s)`
It matches just fine but fails if string "267467374736437-TestInfo" from the matching string as it's not having the "?" quantifier.
Please help me understand where is it going wrong.
I would rather not use a complex regex, which will be ugly and a maintenance nightmare. Instead, one simple way would be to just split the string and grab the first term, and then use a smart regex to pinpoint the second term.
String input = "TestData to 1colon delimiter list has 1 rows.Set...value is: 1 Save Error: 267467374736437-TestInfo send Error";
String first = input.split(" ")[0];
String second = input.replaceAll(".*Save Error:\\s(.*)?\\s", "$1");
Explore the regex:
Regex101
The optional pattern at the end will almost never not be matched if a more generic pattern occurs. In your case, the greedy dot .* grabs the whole rest of the line up to the end, and since the last pattern is optional, the regex engine calls it a day and does not try to accommodate any text for it.
If you had a lazy dot .*?, the only position where it would work is right after the preceding subpattern, which is rarely the case.
Thus, you can only rely on a tempered greedy token:
^(\S+)(?:(?!\d+-\S).)*(\d+-\S+)?
See the regex demo.
Or an unrolled version:
^(\S+)\D*(?:\d(?!\d*-\S)\D*)*(\d+-\S+)?

Regex in Java: Capture last {n} words

Hi I am trying to do regex in java, I need to capture the last {n} words. (There may be a variable num of whitespaces between words). Requirement is it has to be done in regex.
So e.g. in
The man is very tall.
For n = 2, I need to capture
very tall.
So I tried
(\S*\s*){2}$
But this does not match in java because the initial words have to be consumed first. So I tried
^(.*)(\S*\s*){2}$
But .* consumes everything, and the last 2 words are ignored.
I have also tried
^\S?\s?(\S*\s*){2}$
Anyone know a way around this please?
You had almost got it in your first attempt.
Just change + to *.
The plus sign means at least one character, because there wasn't any space the match had failed.
On the other hand the asterisk means from zero to more, so it will work.
Look it live here: (?:\S*\s*){2}$
Using replaceAll method, you could try this regex: ((?:\\S*\\s*){2}$)|.
Your regex contains - as you already mention - a greedy subpattern that eats up the whole string and sine (\S*\s*){2} can match an empty string, it matches an empty location at the end of the input string.
Lazy dot matching (changing .* to .*?) won't do the whole job since the capturing group is quantified, and the Matcher.group(1) will be set to the last captured non-whitespaces with optional whitespaces. You need to set the capturing group around the quantified group.
Since you most likely are using Matcher#matches, you can use
String str = "The man is very tall.";
Pattern ptrn = Pattern.compile("(.*?)((?:\\S*\\s*){2})"); // no need for `^`/`$` with matches()
Matcher matcher = ptrn.matcher(str);
if (matcher.matches()) { // Group 2 contains the last 2 "words"
System.out.println(matcher.group(2)); // => very tall.
}
See IDEONE demo

RegEx to find the word between last Upper Case word and another word

My problem is to find a word between two words. Out of these two words one is an all UPPER CASE word which can be anything and the other word is "is". I tried out few regexes but none are helping me. Here is my example:
String :
In THE house BIG BLACK cat is very good.
Expected output :
cat
RegEx used :
(?<=[A-Z]*\s)(.*?)(?=\sis)
The above RegEx gives me BIG BLACK cat as output whereas I just need cat.
One solution is to simplify your regular expression a bit,
[A-Z]+\s(\w+)\sis
and use only the matched group (i.e., \1). See it in action here.
Since you came up with something more complex, I assume you understand all the parts of the above expression but for someone who might come along later, here are more details:
[A-Z]+ will match one or more upper-case characters
\s will match a space
(\w+) will match one or more word characters ([a-zA-Z0-9_]) and store the match in the first match group
\s will match a space
is will match "is"
My example is very specific and may break down for different input. Your question didn't provided many details about what other inputs you expect, so I'm not confident my solution will work in all cases.
Try this one:
String TestInput = "In THE house BIG BLACK cat is very good.";
Pattern p = Pattern
.compile(
"(?<=\\b\\p{Lu}+\\s) # lookbehind assertion to ensure a uppercase word before\n"
+ "\\p{L}+ # matching at least one letter\n"
+ "(?=\\sis) # lookahead assertion to ensure a whitespace is ahead\n"
, Pattern.COMMENTS); Matcher m = p.matcher(TestInput);
if(m.find())
System.out.println(m.group(0));
it matches only "cat".
\p{L} is a Unicode property for a letter in any language.
\p{Lu} is a Unicode property for an uppercase letter in any language.
You want to look for a condition that depends on several parts of infirmation and then only retrieve a specific part of that information. That is not possible in a regex without grouping. In Java you should do it like this:
public class Main {
public static void main(String[] args) {
Pattern pattern = Pattern.compile("[A-Z]+\\s(\\w+)\\sis");
Matcher matcher = pattern.matcher("In THE house BIG BLACK cat is very good.");
if (matcher.find())
System.out.println(matcher.group(1));
}
}
}
The group(1) is the one with brackets around it. In this case w+. And that's your word. The return type of group() is String so you can use it right away
The following part has a extrange behavior
(?<=[A-Z]*\s)(.*?)
For some reason [A-Z]* is matching a empty string. And (.*?) is matching BIG BLACK. With a little tweaks, I think the following will work (but it still matches some false positives):
(?<=[A-Z]+\s)(\w+)(?=\sis)
A slightly better regex would be:
(?<=\b[A-Z]+\s)(\w+)(?=\sis)
Hope it helps
String m = "In THE house BIG BLACK cat is very good.";
Pattern p = Pattern.compile("[A-Z]+\\s\\w+\\sis");
Matcher m1 = p.matcher(m);
if(m1.find()){
String group []= m1.group().split("\\s");// split by space
System.out.println(group[1]);// print the 2 position
}

Regex in Java: Capturing string up to a certain group of words

I'm trying to capture a string up until a certain word that is within some group of words.
I only want to capture the string up until the FIRST instance of one of these words, as they may appear many times in the string.
For example:
Group of words: (was, in, for)
String = "Once upon a time there was a fox in a hole";
would return "Once upon a time there"
Thank you
What you need is called a Lookahead. The exact regex for your situation is:
/^.+?(?=(?:was)|(?:in)|(?:for))/
Anyway, the ^ matches the beginning of the string, .+? is a lazy match(it will match the shortest possible string), (?= ... ) means "followed by" and (?: ... ) is a noncapturing group - which may or may not be necessary for you.
For bonus points, you should probably be using word boundaries to make sure you're matching the whole word, instead of a substring ("The fox wasn't" would return "The fox "), and a leading space in the lookahead to kill the trailing space in the match:
/^.+?(?=\s*\b(?:was)|(?:in)|(?:for)\b)/
Where \s* matches any amount of white space (including none at all) and \b matches the beginning or end of a word. It's a Zero-Width assertion, meaning it doesn't match an actual character.
Or, in Java:
Pattern p = Pattern.compile("^.+?(?=\\s*\\b(?:was)|(?:in)|(?:for)\\b)");
I think that will work. I haven't used it, but according to the documentation, that exact string should work. Just had to escape all the backslashes.
Edit
Here I am, more than a year later, and I just realized the regex above does not do what I thought it did at the time. Alternation has the highest precedence, rather than the lowest, so this pattern is more correctly:
/^.+?(?=\s*\b(?:was|in|for)\b)/
Compare this new regex to my old one. Additionally, future travelers, you may wish to capture the whole string if no such breaker word exists. Try THIS on for size:
/^(?:(?!\s*\b(?:was|in|for)\b).)*/
This one uses a NEGATIVE lookahead (which asserts a match that fails the pattern). It's possibly slower, but it still does the job. See it in action here.
You can use this code to capture the string before a terminating word:
Pattern p = Pattern.compile("^(.*?)((\\b(was|in|for)\\b)|$)");
String s = "Once upon a time there was a fox in a hole";
Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1));
}
This code produces the following output (link):
Once upon a time there
Here is how this expression works: the (\\b(was|in|for)\\b) means "any of the words listed in the inner parentheses, when they appear on word boundaries". The expression just outside allows for $ in order to capture something even if none of the terminating words appear in the source.
A very simple way to handle this is to just split the string with a regex and keep the first thing returned:
String str = "Once upon a time there was a fox in a hole";
String match = str.split("(was|in|for)")[0];
// match = "Once upon a time there "
In this example, match will either contain the first part of the string before the first matched word or, in the case of a string where the word wasn't found it will contain the entire string.
String s = "Once upon a time there was a fox in the hole";
String[] taboo = {"was", "in", "for"} ;
for (int i = 0; i < taboo.length; i++){
if (s.indexOf(taboo[i]) > -1 ){
s=s.substring(0, s.indexOf(taboo[i])) ;
}
}
out.print(s);
works on my computer..

Categories

Resources