Regex that matches certain trailings of words with certain length? - java

How can I write a regex that matches the "e" trailings of every word, exept the "e" trailings of words with 2 or 3 letters?
Example:
abcdeeee: Full match for "eeee"
more: Full match for "e"
pie: No match
me: No match

Use a lookaround assertion:
e+(?<=\w{4})\b
This matches e characters if afterwards there have been 4 word characters before the final e. The \b makes sure it ends at a word boundary
Demo

Related

Regex match numbers with spaces but not without spaces

Trying to match a string of numbers with spaces in between while ignoring other strings of numbers without spaces in between them. I'd like to match 16 characters.
eg. Would like to match 12345 67890 1234 but NOT 1234567890123456
I have tried this:
[0-9 ]{16}
But this matches both sets of strings.
I used and corrected #Wiktor Stribiżew regex, because original regex will match a space at the beginning and the end of the number.
Regex: \b(?![0-9]{16})\d[0-9 ]{14}\d\b
Details:
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
(?!) Negative Lookahead
[] Match a single character present in the list 0-9
{n} Matches exactly n times
\d matches a digit (equal to [0-9])
RegEx demo
You can use this regex to enforcees at least one space in between numbers:
\d+(?:\h+\d+)+
RegEx Demo
\d+: Match 1+ digits
(?:\h+\d+)+: Match 1+ group of 1+ whitespace and 1+ digits

Generate new attribute using regex in RapidMiner

I work with Excel-file, which contains several sentences. I would like to generate new attribute (I use "Generate Attribute" operator), which returns (“true or false”) if the sentence contains the some numbers with white spaces between them (e.g. 234 45 56). I have used the function “match nominal regex” (matches(sentences,"\d+\s+\d)) to do this. However, I faced the problem that Rapidminer does not recognize the escape () character. How do I change my Regex to make it work?
Some additional comments/examples:
My input sentences:
word word 123 345 6665 23456 54 word word word
word word word 12.3 34.5 6665 23.456 5.4 word word word
word word word 12,3 34,5 6665 23,456 5.4 word word word
word word word 12,3% 34,5% 6665% 23,456% 5.4% word word word
My output will be new variable with true or false, if the sentence contains such chain of numbers.
I first thought to use following Regex to capture numbers \d+[.,]?\d*\s+\d+[.,]?\d*.
You may express \d as [0-9] and \s as a space. Also, it seems you need to match the full line with matches, thus, add .*
match(sentences,".*[0-9] +[0-9].*")
This matches any 0+ chars other than a newline (as more as possible) followed with a digit, 1+ spaces and a digit, and then again 0+ chars other than a newline.
Also, try doubling the \ to match \d or \s (since the regex is Java flavor):
matches(sentences,".*\\d+\\s+\\d.*")

match whole sentence with regex

I'm trying to match sentences without capital letters with regex in Java:
"Hi this is a test" -> Shouldn't match
"hi thiS is a test" -> Shouldn't match
"hi this is a test" -> Should match
I've tried the following regex, but it also matches my second example ("hi, thiS is a test").
[a-z]+
It seems like it's only looking at the first word of the sentence.
Any help?
[a-z]+ will match if your string contains any lowercase letter.
If you want to make sure your string doesn't contain uppercase letters, you could use a negative character class: ^[^A-Z]+$
Be aware that this won't handle accentuated characters (like É) though.
To make this work, you can use Unicode properties: ^\P{Lu}+$
\P means is not in Unicode category, and Lu is the uppercase letter that has a lowercase variant category.
^[a-z ]+$
Try this.This will validate the right ones.
It's not matching because you haven't used a space in the match pattern, so your regex is only matching whole words with no spaces.
try something like ^[a-z ]+$ instead (notice the space is the square brackets) you can also use \s which is shorthand for 'whitespace characters' but this can also include things like line feeds and carriage returns so just be aware.
This pattern does the following:
^ matches the start of a string
[a-z ]+ matches any a-z character or a space, where 1 or more exists.
$ matches the end of the string.
I would actually advise against regex in this case, since you don't seem to employ extended characters.
Instead try to test as following:
myString.equals(myString.toLowerCase());

Extract string without last char if vowel

With regular expressions, how can I extract the whole word except the last character if it is a vowel?
Inputs:
ansia
bello
ansid
Expected output for each:
ansi
bell
ansid
This is what I tried, but it only works if I have a single vowel at the end:
^(.*[^aeiou])
Similar to what #Sotirios Delimanolis wrote in his comment but using word boundaries so it will work if you have multiple words in a line.
\b(\w+?)[aeiou]?\b
This works in the following way :
1) \b matches the start of a word. This will work for the first word on a line or a word preceded by a non-word character (a word character is any alpha-numeric character).
2) (\w+?) matches and captures the part of the word you care about.
2a) \w matches any word character.
2b) + makes the \w be matched one or more times
2c) ? makes the + match as few characters as possible. This is important because if there is a vowel at the end of the word we do not want to match it in the capturing group but instead let (3) take care of it.
3) [aeiou]? matches but does not capture a vowel character if one is present
3a) [aeiou] matches a vowel
3b) ? makes the [aeiou] be matched zero or one times
4) \b matches the end of the word. This will work for a word at the end of a line or a word followed by a non-word character.
You said that the tool you are using uses the Java regex implementation and ansid isn't working for you with my regex. I have tested it with pure Java and it seems to be working for me:
Pattern pattern = Pattern.compile("\\b(\\w+?)[aeiou]?\\b");
Matcher matcher = pattern.matcher("ansia ansid cake cat dog");
while (matcher.find()) {
System.out.println(matcher.group(1));
}
prints
ansi
ansid
cak
cat
dog
Try the regex (\b[a-zA-Z]+?(?=[aeiou]\b))|(\b[a-zA-Z]+?[^aeiou]\b). This captures either a word ending in a consonant OR a word ending in a vowel, and omits the vowel at the end.
this pattern worked for me
^(.*?)(?=[aeiou]$|$)
Demo
in case input is words that can be in a line as pointed out below
use this pattern
\b([a-z]+?)(?=[aeiou]\b|\b)
Demo

Use of \b Boundary Matcher In Java

I am reading Boundary Matcher from Oracle Documentation. I understand most of the part, but i am not able to grasp the \b Boundary Matcher. Here is the example from the documentation.
To check if a pattern begins and ends on a word boundary (as opposed
to a substring within a longer string), just use \b on either side;
for example, \bdog\b
Enter your regex: \bdog\b Enter input string to search: The dog plays
in the yard. I found the text "dog" starting at index 4 and ending at
index 7.
Enter your regex: \bdog\b Enter input string to search: The doggie
plays in the yard. No match found. To match the expression on a
non-word boundary, use \B instead:
Enter your regex: \bdog\B Enter input string to search: The dog
plays in the yard. No match found.
Enter your regex: \bdog\B Enter input string to search: The doggie
plays in the yard. I found the text "dog" starting at index 4 and
ending at index 7.
In short, i am not able to understand the working of \b. Can someone help me describing its usage and help me understand this example.
Thanks
\b is what you can call an "anchor": it will match a position in the input text.
More specifically, \b will match every position in the input text where:
there is no preceding character and the following character is a word character (any letter or digit, or an underscore);
there is no following character and the preceding character is a word character;
the preceding character is a word character and the following character is not; or
the following character is a word character and the preceding character is not.
For instance, the regex dog\b in the text "my dog eats" will match the position immediately after the g of dog (which is a word character) and before the following space (which is not).
Note that like all anchors, the fact that it matches a position means that it does not consume any input text.
Other anchors are ^, $, lookarounds.
The docs don't seem to explain what exactly a word boundary is. Let me try:
\b matches a position between characters (so it doesn't match any text itself, it just asserts that a certain condition is met at the current position in the string). That condition is defined as:
There either is a character of the character set defined by \w (alphanumerics and underscore) before the current position or after the current position, but not both.
The inverse is true for \B - it matches iff \b doesn't match at the current position.
\b- matches the empty string at the beginning or end of a word.
The metacharacter \b is an anchor like the caret and the dollar sign.
It matches at a position that is called a "word boundary". This match is zero-length.
\B is opposite of \b
\B matches the empty string not at the beginning or end of a word.
For \b, if there is a 'word' char at one side of \b, there must be a not-'word' char at other side.
For \B, if there is a 'word' char at one side, there must be a 'word' char too at other side. If there is a not-'word' char at one side, there must be a not-'word' char too at other side.
The 'word' char are A-Za-z0-9 and _, others are not-word char for C locale.
Simply speaking, \b matches the position between a \w and \W (as in not \w) character,
and thus is the end or start of a Word. The end/start of String counts as \W here.
The most common \W characters you may find are:
Whitespace
Comma
Fullstop
Special Characters (§,$,%, [...])
Not Underscore
Anything not ASCII (Umlauts, Cyrillic, Arabic, [...])
\B is just the inverse match of \b
--> It matches the position, that \b does not match (eg. [\w][\w] OR [\W][\W])
You can experiment with java regular expressions here

Categories

Resources