How to Match a String Against Multiple Regex Patterns in Java - java

I understand how to match a single String against multiple regex patterns using the pipe symbol as explained in some of the answers to this question: Match a string against multiple regex patterns
My question is that when I have the following String:
this_isAnExample of What nav-input a-autoid-9-announce thisIsAnExampleToo
And I use the following regex to extract text:
[A-Z][a-z]*|(?<=_)[A-Za-z-]*
I am expecting to get the following matches:
is
An
Example
What
Is
An
Example
Too
But I actually get is:
isAnExample
What
Is
An
Example
Too
Basically the engine is automatically linking the word An with Example bec it matches the underscore pattern but I want it to treat them as two words (non greedy?) bec according to the other pattern there is another match.

You probably ment the regex to be
[A-Z][a-z]*|(?<=_)[a-z-]*
The first part being lowercase word starting with uppercase letter, or the second: lowercase word preceded by underscore.
The part of your posted regex (?<=_)[A-Za-z-]* matches lower and upper case letters after underscore, i.e. does not stop matching when uppercase letter met, which should be in fact start of another word.

You can use this alternation regex to capture all the lower case text that is wither preceded by _ OR mixed case text:
((?<=_)[a-z][a-z-]*|[A-Z][a-z]*)
RegEx Demo

Related

Why does Java-regex matches underscore? [duplicate]

This question already has answers here:
Java RegEx meta character (.) and ordinary dot?
(9 answers)
Closed 2 years ago.
I was trying to match the URL pattern string.string. for any number of string. using ^([^\\W_]+.)([^\\W_]+.)$ as a first attempt, and it works for matching two consecutive patterns. But then, when I generalize it to ^([^\\W_]+.)+$ stops working and matches the wrong pattern "string.str_ing.".
Do you know what is incorrect with the second version?
You need to escape your . character, else it will match any character including _.
^([^\\W_]+\.?)+$
this can be your generalised regex
With ^([^\\W_]+.)([^\\W_]+.)$ you match any two words with restricted set of characters. Although, you have not escaped the ., it still works as long as the first word is matched first string, then any literal (that's what unescaped . means) and then string again.
In the latter one the unescaped dot (.) is a part of the capturing group occurring at least once (since you use +), therefore it allows any character as a divisor. In other words string.str_ing. is understood as:
string as the 1st word
str as the 2nd word
ing as the 3rd word
... as long as the unescaped dot (.) allows any divisor (both . literally and _).
Escape the dot to make the Regex work as intented (demo):
^([^\\W_]+\.)+$
[^\W] seems a weird choice - it's matching 'not not-a-word-character'. I haven't thought it through, but that sounds like it's equivalent to \w, i.e., matching a word character.
Either way, with ^\W and \w, you're asking to match underscores - which is why it matches the string with the underscore. "Word characters" are uppercase alphabetics, lowercase alphabetics, digits, and underscore.
You probably want [a-z]+ or maybe [A-Za-z0-9]+

Reg Ex strictly match word start with a pattern

I'm trying to extract a text after a sequence. But I have multiple sequences. the regex should ideally match first occurrence of any of these sequences.
my sequences are
PIN, PIN :, PIN IN, PIN IN:, PIN OUT,PIN OUT :
So I came up with the below regex
(PIN)(\sOUT|\sIN)?\:?\s*
It is doing the job except that the regex is also matching strings like
quote lupin in, pippin etc.
My question is how can I strictly select the string that match the pattern being the whole word
note: I tried ^(PIN)(\sOUT|\sON)?\:?\s* but of no use.
I'm new to java, any help is appreciated
It’s always recommended to have the documentation at hand when using regular expressions.
There, under Boundary matchers we find:
\b          A word boundary
So you may use the pattern \bPIN(\sOUT|\sIN)?:?\s* to enforce that PIN matches at the beginning of a word only, i.e. stands at the beginning of a string/line or is preceded by non-word characters like space or punctuation. A boundary only matches a position, rather than characters, so if a preceding non-word character makes this a word boundary, the character still is not part of the match.
Note that the first (…) grouping was unnecessary for the literal match PIN, further the colon : has no special meaning and doesn’t need to be escaped.

How to match a substring following after a string satisfying the specific pattern

Imagine, that I have the string 12.34some_text.
How can I match the substring following after the second character (4 in my case) after the . character. In that particular case the string I want to match is some_text.
For the string 56.78another_text it will be another_text and so on.
All accepted strings have the pattern \d\d\.\d\d\w*
If you wish to match everything from the second character after a specific one (i.e. the dot) you can use a lookbehind, like this:
(?<=[.]\d{2})(\w*)
demo
(?<=[.]\d{2}) is a positive lookbehind that matches a dot [.] followed by two digits \d{2}.
Since you are using java and the given pattern is \d\d\.\d\d\w* you will get some_text from 12.34some_textby using
String s="12.34some_text";
s.substring(5,s.length());
and you can compare the substring!

match whole sentence with regex

I'm trying to match sentences without capital letters with regex in Java:
"Hi this is a test" -> Shouldn't match
"hi thiS is a test" -> Shouldn't match
"hi this is a test" -> Should match
I've tried the following regex, but it also matches my second example ("hi, thiS is a test").
[a-z]+
It seems like it's only looking at the first word of the sentence.
Any help?
[a-z]+ will match if your string contains any lowercase letter.
If you want to make sure your string doesn't contain uppercase letters, you could use a negative character class: ^[^A-Z]+$
Be aware that this won't handle accentuated characters (like É) though.
To make this work, you can use Unicode properties: ^\P{Lu}+$
\P means is not in Unicode category, and Lu is the uppercase letter that has a lowercase variant category.
^[a-z ]+$
Try this.This will validate the right ones.
It's not matching because you haven't used a space in the match pattern, so your regex is only matching whole words with no spaces.
try something like ^[a-z ]+$ instead (notice the space is the square brackets) you can also use \s which is shorthand for 'whitespace characters' but this can also include things like line feeds and carriage returns so just be aware.
This pattern does the following:
^ matches the start of a string
[a-z ]+ matches any a-z character or a space, where 1 or more exists.
$ matches the end of the string.
I would actually advise against regex in this case, since you don't seem to employ extended characters.
Instead try to test as following:
myString.equals(myString.toLowerCase());

Using regex to match beginning and end of string [Java]

I have a list of files in a folder:
maze1.in.txt
maze2.in.txt
maze3.in.txt
I've used substring to remove the .txt extensions.
How do I use regex to match the front and the back of the file name?
I need it to match "maze" at the front and ".in" at the back, and the middle must be a digit (can be single or double digit).
I've tried the following
if (name.matches("name\\din")) {
//dosomething
}
It doesn't match anything. What is the correct regex expression to use?
I'm a little confused what you are asking for in particular
^(maze[0-9]*\.in)$
This will match maze(any number).in
^(maze[0-9]*\.in)\.txt$
this will match maze(any number).in.txt -- excludes the .txt NO NEED FOR USING SUB STRING!
Edit live on Debuggex
The think i would be wary about as of right now is the capture groups... I'm not particularly sure what you are doing with this regex. However, I believe explaining capture groups could benefit you.
A capture group for instance is denoted by () this is basically store them in the pattern array and is a way to parse stuff.
example maze1.in.txt
So if you want to capture the entire line minus .txt i would use this ^(maze[0-9]*\.in\.txt)$
However, if I wanted to capture things separately I would do this ^(maze)([0-9]*)(\.in)\.txt$ this will exclude .txt but include maze, the number, and .in IN separate indexes of the pattern array.
Your original solution doesn't work because string "name" is not in your text. It is "maze".
You can try this
name.matches("maze\\d{1,2}\\.in")
d{1,2} is used to match a digit(can be single or double digit).
You need regex anchors that tell the regex to
start at the beginning: ^
and signal the end of the string: $
^maze[\d]{0,2}\.in$
or in Java:
name.matches("^maze[\\d]{0,2}\\.in$");
Also, your regex wasn't matching strings with a dot (.) which would not accept your examples given. You need to add \. to the regex to accept dots because . is a special character.
It is always good to think of what you are trying to do in english, before you create regular expressions.
You want to match a word maze followed by a digit, followed by a literal period . followed by another word.
word `\w` matches a word character
digit `\d` matches a single digit
period `\.` matches a literal period
word `\w` matches a word character
putting it all together into a single string you get (keep in mind the double backslash for the Java escape and the pluses to repeat the previous match one or more times):
"\\w+\\d\\.\\w+"
The above is the generic case for any file name in the format xxx1.yyy, if you wanted to match maze and in specifically, you can just add those in as literal strings.
"maze\\d+\\.in"
example: http://ideone.com/rS7tw1
name.matches("^maze[0-9]+\\.in\\.txt$")

Categories

Resources