With regular expressions, how can I extract the whole word except the last character if it is a vowel?
Inputs:
ansia
bello
ansid
Expected output for each:
ansi
bell
ansid
This is what I tried, but it only works if I have a single vowel at the end:
^(.*[^aeiou])
Similar to what #Sotirios Delimanolis wrote in his comment but using word boundaries so it will work if you have multiple words in a line.
\b(\w+?)[aeiou]?\b
This works in the following way :
1) \b matches the start of a word. This will work for the first word on a line or a word preceded by a non-word character (a word character is any alpha-numeric character).
2) (\w+?) matches and captures the part of the word you care about.
2a) \w matches any word character.
2b) + makes the \w be matched one or more times
2c) ? makes the + match as few characters as possible. This is important because if there is a vowel at the end of the word we do not want to match it in the capturing group but instead let (3) take care of it.
3) [aeiou]? matches but does not capture a vowel character if one is present
3a) [aeiou] matches a vowel
3b) ? makes the [aeiou] be matched zero or one times
4) \b matches the end of the word. This will work for a word at the end of a line or a word followed by a non-word character.
You said that the tool you are using uses the Java regex implementation and ansid isn't working for you with my regex. I have tested it with pure Java and it seems to be working for me:
Pattern pattern = Pattern.compile("\\b(\\w+?)[aeiou]?\\b");
Matcher matcher = pattern.matcher("ansia ansid cake cat dog");
while (matcher.find()) {
System.out.println(matcher.group(1));
}
prints
ansi
ansid
cak
cat
dog
Try the regex (\b[a-zA-Z]+?(?=[aeiou]\b))|(\b[a-zA-Z]+?[^aeiou]\b). This captures either a word ending in a consonant OR a word ending in a vowel, and omits the vowel at the end.
this pattern worked for me
^(.*?)(?=[aeiou]$|$)
Demo
in case input is words that can be in a line as pointed out below
use this pattern
\b([a-z]+?)(?=[aeiou]\b|\b)
Demo
Related
I have a String like
String str = "305556710S or 100596269C OR CN111111111";
I just want to match the characters in this string that start with numbers or start with numbers and end with English letters,
Then prefix the matched characters add with two "??" characters.
I write a Patern like
Pattern pattern = Pattern.compile("^[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
int start = matcher.start();
int end = matcher.end();
String matchStr = matcher.group();
System.err.println(matchStr);
}
But it can only match the first character "305556710S".
But If I modify the Pattern
Pattern pattern = Pattern.compile("[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
It will matches "305556710S","100596269C","111111111".But the prefix of "111111111" is English character "CN" which is not my goal.
I only want match the "305556710S" and "100596269C" and add two "??" characters before the matched Characters.Can somebody help me ?
First, you should avoid the ^ in this particular regexp. As you noticed, you can't return more than one result, as "^" is an instruction for "match the beginning of the string"
Using \b can be a solution, but you may get invalid results. For example
305556710S or -100596269C OR CN111111111
The regexp "\\b[0-9]{1,10}[A-Z]{0,}\\b" will match 100596269C (because the hyphen is not word character, so there is a word boundary between - and 1)
The following regexp matches exactly what you want: all numbers, that may be followed by some English chars, either at the beginning of the string or after a space, and either followed by a space or at the end of the string.
(?<=^| )[0-9]{1,10}[A-Z]*(?= |$)
Explanations:
(?<=^| ) is a lookbehind. It makes sure that there is either ^ (string start) or a space behind actual location. Note that lookbehinds don't add matching chars to the result: the space won't be part of the result
[0-9]{1,10}[A-Z]* matches digits (at least one, up to ten), then one or more letters.
(?= |$) is a lookahead. It makes sure that there will be either a space or $ (end of string) after this match. Like lookbehinds, the chars aren't added to the results and position remains the same : the space read here for example can also be read by the lookbehind of the next captured string
Examples : 305556710S or 100596269C OR CN111111111 matches: at index 0 [305556710S], at index 15 [100596269C]; 100596269C123does not match.
I think you need to use word boundaries \b. Try this changed pattern:
"\\b[0-9]{1,10}[A-Z]{0,1}\\b"
This prints out:
305556710S
100596269C
Why it works:
The difference here is that it will check only those character sequences that are within a pair of word boundaries. In the earlier pattern you used, a character sequence even from the middle of a word may be used to match against the pattern due to which even 11111... from CN1111... was matched against the pattern and it passed.
A word boundary also matches the end of the string input. So, even if a candidate word appears at the end of the line, it will get picked up.
If more than one English alphabet can come at the end, then remove the max occurrence indicator, 1 in this case:
"\\b[0-9]{1,10}[A-Z]{0,}\\b"
How can I write a regex that matches the "e" trailings of every word, exept the "e" trailings of words with 2 or 3 letters?
Example:
abcdeeee: Full match for "eeee"
more: Full match for "e"
pie: No match
me: No match
Use a lookaround assertion:
e+(?<=\w{4})\b
This matches e characters if afterwards there have been 4 word characters before the final e. The \b makes sure it ends at a word boundary
Demo
I am reading Boundary Matcher from Oracle Documentation. I understand most of the part, but i am not able to grasp the \b Boundary Matcher. Here is the example from the documentation.
To check if a pattern begins and ends on a word boundary (as opposed
to a substring within a longer string), just use \b on either side;
for example, \bdog\b
Enter your regex: \bdog\b Enter input string to search: The dog plays
in the yard. I found the text "dog" starting at index 4 and ending at
index 7.
Enter your regex: \bdog\b Enter input string to search: The doggie
plays in the yard. No match found. To match the expression on a
non-word boundary, use \B instead:
Enter your regex: \bdog\B Enter input string to search: The dog
plays in the yard. No match found.
Enter your regex: \bdog\B Enter input string to search: The doggie
plays in the yard. I found the text "dog" starting at index 4 and
ending at index 7.
In short, i am not able to understand the working of \b. Can someone help me describing its usage and help me understand this example.
Thanks
\b is what you can call an "anchor": it will match a position in the input text.
More specifically, \b will match every position in the input text where:
there is no preceding character and the following character is a word character (any letter or digit, or an underscore);
there is no following character and the preceding character is a word character;
the preceding character is a word character and the following character is not; or
the following character is a word character and the preceding character is not.
For instance, the regex dog\b in the text "my dog eats" will match the position immediately after the g of dog (which is a word character) and before the following space (which is not).
Note that like all anchors, the fact that it matches a position means that it does not consume any input text.
Other anchors are ^, $, lookarounds.
The docs don't seem to explain what exactly a word boundary is. Let me try:
\b matches a position between characters (so it doesn't match any text itself, it just asserts that a certain condition is met at the current position in the string). That condition is defined as:
There either is a character of the character set defined by \w (alphanumerics and underscore) before the current position or after the current position, but not both.
The inverse is true for \B - it matches iff \b doesn't match at the current position.
\b- matches the empty string at the beginning or end of a word.
The metacharacter \b is an anchor like the caret and the dollar sign.
It matches at a position that is called a "word boundary". This match is zero-length.
\B is opposite of \b
\B matches the empty string not at the beginning or end of a word.
For \b, if there is a 'word' char at one side of \b, there must be a not-'word' char at other side.
For \B, if there is a 'word' char at one side, there must be a 'word' char too at other side. If there is a not-'word' char at one side, there must be a not-'word' char too at other side.
The 'word' char are A-Za-z0-9 and _, others are not-word char for C locale.
Simply speaking, \b matches the position between a \w and \W (as in not \w) character,
and thus is the end or start of a Word. The end/start of String counts as \W here.
The most common \W characters you may find are:
Whitespace
Comma
Fullstop
Special Characters (§,$,%, [...])
Not Underscore
Anything not ASCII (Umlauts, Cyrillic, Arabic, [...])
\B is just the inverse match of \b
--> It matches the position, that \b does not match (eg. [\w][\w] OR [\W][\W])
You can experiment with java regular expressions here
I need to filter the given text to get all words, including apostrophes (can't is considered a single word).
Para = "'hello' world '"
I am splitting the text using
String[] splits = Para.split("[^a-zA-Z']");
Expected output:
hello world
But it is giving:
'hello' world '
I get everything right, except a single apostrophe (') and 'hello' are not getting filtered by the above regex.
How can I filter these two things?
As far as I can tell, you're looking for a ' where either the next or previous character is not a letter.
The regex I came up with to do this, contained in some test code:
String str = "bob can't do 'well'";
String[] splits = str.split("(?:(?<=^|[^a-zA-Z])'|'(?=[^a-zA-Z]|$)|[^a-zA-Z'])+");
System.out.println(Arrays.toString(splits));
Explanation:
(?<=^|[^a-zA-Z])' - matches a ' where the previous character is not a letter, or we're at the start of the string.
'(?=[^a-zA-Z]|$) - matches a ' where the next character is not a letter, or we're at the end of the string.
[^a-zA-Z'] - not a letter or '.
(?:...)+ - one or more of any of the above (the ?: is just to make it a non-capturing group).
See this for more on regex lookaround ((?<=...) and (?=...)).
Simplification:
The regex can be simplified to the below by using negative lookaround:
"(?:(?<![a-zA-Z])'|'(?![a-zA-Z])|[^a-zA-Z'])+"
A Unicode version, without lookarounds:
String TestInput = "This voilà München is the test' 'sentence' that I'm willing to split";
String[] splits = TestInput.split("'?[^\\p{L}']+'?");
for (String t : splits) {
System.out.println(t);
}
\p{L} is matching a character with the Unicode property "Letter"
This splits on a non letter, non ' sequence, including a leading or trailing ' in the split.
Output:
This
voilà
München
is
the
test
sentence
that
I'm
willing
to
split
To handle leading and trailing ', just add them as alternatives
TestInput.split("'?[^\\p{L}']+'?|^'|'$")
If you define a word as a sequence that:
Must start and end with English alphabet a-zA-Z
May contain apostrophe (') within.
Then you can use the following regex in Matcher.find() loop to extract matches:
[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?
Sample code:
Pattern p = Pattern.compile("[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?");
Matcher m = p.matcher(inputString);
while (m.find()) {
System.out.println(m.group());
}
Demo1
1 The demo uses PCRE flavor regex, but the result should not be different from Java for this regex
I have a regular expression,
end\\s+[a-zA-Z]{1}[a-zA-Z_0-9]
which is supposed to match a line with the specifications
end abcdef123
where abcdef123 must start with a letter and subsequent alphanumeric characters.
However currently it is also matching this
foobar barfooend
bar fred bob
It's picking up that end at the end of barfooend and also picking up bar in effect returning end bar as a legitimate result.
I tried
^end\\s+[a-zA-Z]{1}[a-zA-Z_0-9]
but that doesn't seem to work at all. It ends up matching nothing.
It should be fairly simple but I can't seem to nut it out.
\s includes also newline characters. So you either need to specify a character class that has only the wanted whitespace charaters or exclude the not wanted.
Use instead of \\s+ one of those:
[^\\S\r\n] this includes all whitespace but not \r and \n. See end[^\S\r\n]+[a-zA-Z][a-zA-Z_0-9]+ here on Regexr
[ \t] this includes only space and tab. See end[ \t]+[a-zA-Z][a-zA-Z_0-9]+ here on Regexr
You can use \b (word boundary detection) to check a word boundary. In our case we will use it to match the beginning of the word end. It can also be used to match the end of a word.
As #nhahtdh stated in his comment the {1} is redundant as [a-zA-Z] already matches one letter in the given range.
Also your regex does not do what you want because it only matches one alphanumeric character after the first letter. Add a + at the end (for one or more times) or * (for zero or more times).
This should work:
"\\bend\\s+[a-zA-Z]{1}[a-zA-Z_0-9]*"
Edit : I think \b is better than ^ because the latter only matches the beginning of a line.
For example take this input : "end azd123 end bfg456" There will be only one match for ^ when \b will help matching both.
Try the regular expression:
end[ ]+[a-zA-Z]\w+
\w is a word character: [a-zA-Z_0-9]