regex whole word option - java

I have a problem about matching whole words in java, what I want to do is finding the start indices of each word in a given line
Pattern pattern = Pattern.compile("("+str+")\\b");
Matcher matcher = pattern.matcher(line.toLowerCase(Locale.ENGLISH));
if(matcher.find()){
//Doing something
}
I have a problem with this given case
line = "Watson has Watson's items.";
str = "watson";
I want to match with only the first watson here without matching the other one and i dont want my pattern to have some empty space control, what should i do in this case

The word boundary \b matches the location between a non-word and a word character (or the start/end before/after a word character). The ', -, +, etc. are non-word characters, so Watson\b will match in Watson's (partial match).
You might want to only match Watson if it is not enclosed with non-whitespace symbols:
Pattern p = Pattern.compile("(?<!\\S)" + str + "(?!\\S)");
To match Watson at the end of the sentence, you will need to allow matching before ., ? and !, use
Pattern p = Pattern.compile("(?<!\\S)" + str + "(?![^\\s.!?])");
See the regex demo
Just FYI: perhaps, it is a good idea to also use Pattern.quote(str) instead of plain str to avoid issues when your str contains special regex metacharacters.

Use find() method in matcher
Refer java docs

Related

How to match all combinations of numbers in a string that do not start with an English letter in regular matching in Java

I have a String like
String str = "305556710S or 100596269C OR CN111111111";
I just want to match the characters in this string that start with numbers or start with numbers and end with English letters,
Then prefix the matched characters add with two "??" characters.
I write a Patern like
Pattern pattern = Pattern.compile("^[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
int start = matcher.start();
int end = matcher.end();
String matchStr = matcher.group();
System.err.println(matchStr);
}
But it can only match the first character "305556710S".
But If I modify the Pattern
Pattern pattern = Pattern.compile("[0-9]{1,10}[A-Z]{0,1}", Pattern.CASE_INSENSITIVE);
It will matches "305556710S","100596269C","111111111".But the prefix of "111111111" is English character "CN" which is not my goal.
I only want match the "305556710S" and "100596269C" and add two "??" characters before the matched Characters.Can somebody help me ?
First, you should avoid the ^ in this particular regexp. As you noticed, you can't return more than one result, as "^" is an instruction for "match the beginning of the string"
Using \b can be a solution, but you may get invalid results. For example
305556710S or -100596269C OR CN111111111
The regexp "\\b[0-9]{1,10}[A-Z]{0,}\\b" will match 100596269C (because the hyphen is not word character, so there is a word boundary between - and 1)
The following regexp matches exactly what you want: all numbers, that may be followed by some English chars, either at the beginning of the string or after a space, and either followed by a space or at the end of the string.
(?<=^| )[0-9]{1,10}[A-Z]*(?= |$)
Explanations:
(?<=^| ) is a lookbehind. It makes sure that there is either ^ (string start) or a space behind actual location. Note that lookbehinds don't add matching chars to the result: the space won't be part of the result
[0-9]{1,10}[A-Z]* matches digits (at least one, up to ten), then one or more letters.
(?= |$) is a lookahead. It makes sure that there will be either a space or $ (end of string) after this match. Like lookbehinds, the chars aren't added to the results and position remains the same : the space read here for example can also be read by the lookbehind of the next captured string
Examples : 305556710S or 100596269C OR CN111111111 matches: at index 0 [305556710S], at index 15 [100596269C]; 100596269C123does not match.
I think you need to use word boundaries \b. Try this changed pattern:
"\\b[0-9]{1,10}[A-Z]{0,1}\\b"
This prints out:
305556710S
100596269C
Why it works:
The difference here is that it will check only those character sequences that are within a pair of word boundaries. In the earlier pattern you used, a character sequence even from the middle of a word may be used to match against the pattern due to which even 11111... from CN1111... was matched against the pattern and it passed.
A word boundary also matches the end of the string input. So, even if a candidate word appears at the end of the line, it will get picked up.
If more than one English alphabet can come at the end, then remove the max occurrence indicator, 1 in this case:
"\\b[0-9]{1,10}[A-Z]{0,}\\b"

Java String Split using Regex with Escape Character

I have a string which needs to be split based on a delimiter(:). This delimiter can be escaped by a character (say '?'). Basically the delimiter can be preceded by any number of escape character. Consider below example string:
a:b?:c??:d???????:e
Here, after the split, it should give the below list of string:
a
b?:c??
d???????:e
Basically, if the delimiter (:) is preceded by even number of escape characters, it should split. If it is preceded by odd number of escape characters, it should not split. Is there a solution to this with regex?
Any help would be greatly appreciated.
Similar question has been asked earlier here, But the answers are not working for this use case.
Update:
The solution with the regex: (?:\?.|[^:?])* correctly split the string. However, this also gives few empty strings. If + is given instead of *, even the real empty matches also ignored. (Eg:- a::b gives only a,b)
Scenario 1: No empty matches
You may use
(?:\?.|[^:?])+
Or, following the pattern in the linked answer
(?:\?.|[^:?]++)+
See this regex demo
Details
(?: - start of a non-capturing group
\?. - a ? (the delimiter) followed with any char
| - or
[^:?] - any char but the : (your delimiter char) and ? (the escape char)
)+ - 1 or more repetitions.
In Java:
String regex = "(?:\\?.|[^:?]++)+";
In case the input contains line breaks, prepend the pattern with (?s) (like (?s)(?:\\?.|[^:?])+) or compile the pattern with Pattern.DOTALL flag.
Scenario 2: Empty matches included
You may add (?<=:)(?=:) alternative to the above pattern to match empty strings between : chars, see this regex demo:
String s = "::a:b?:c??::d???????:e::";
Pattern pattern = Pattern.compile("(?>\\?.|[^:?])+|(?<=:)(?=:)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("'" + matcher.group() + "'");
}
Output of the Java demo:
''
'a'
'b?:c??'
''
'd???????:e'
''
Note that if you want to also match empty strings at the start/end of the string, use (?<![^:])(?![^:]) rather than (?<=:)(?=:).

Regex to catch all the words and the "i'm you're etc" in Java

I am trying to split lines of a document, by creating a Pattern in Java.
The default Pattern in WordCount example is something like this: "\\s*\\b\\s*".
The problem with this pattern however, is that it splits everything to a single word, while I want to keep things such as (I'm, You're, it's) together. So far, what I've tried is [a-zA-Z]+'{0,1}[a-zA-Z]*,
the problem is that when I have a test string, for example:
Pattern BOUNDARY = "[a-zA-Z]+'{0,1}[a-zA-Z]*"
String test = "Hello i'm #£$#you ##can !!be.
and run
for(String word : BOUNDARY.split(test){
println(word)}
I get no results. Ideally, I want to get
Hello
i'm
you
can
be
Any ideas are welcome. In the regex101.com the regex I've put up works like a charm, so I'm guessing I have misunderstood something in the Java part.
Your initial pattern was splitting at a word boundary enclosed with 0+ whitespaces pattern. The second pattern is matching substrings.
Use it like this:
String BOUNDARY_STR = "[a-zA-Z]+(?:'[a-zA-Z]+)?";
String test = "Hello i'm #£$#you ##can !!be.";
Matcher matcher = Pattern.compile(BOUNDARY_STR).matcher(test);
List<String> results = new ArrayList<>();
while (matcher.find()){
results.add(matcher.group(0));
}
System.out.println(results); // => [Hello, i'm, you, can, be]
See the Java demo
Note I used [a-zA-Z]+(?:'[a-zA-Z]+)? that matches
[a-zA-Z]+ - 1 or more ASCII letters
(?:'[a-zA-Z]+)? - an optional substring of
' - an apostrophe
[a-zA-Z]+ - 1 or more ASCII letters
You may also wrap the pattern with word boundaries to only match words that are enclosed with non-word chars, "\\b[a-zA-Z]+(?:'[a-zA-Z]+)?\\b".
To find all Unicode letters, use "\\p{L}+(?:'\\p{L}+)?".

Splitting words from text using regex

I need to filter the given text to get all words, including apostrophes (can't is considered a single word).
Para = "'hello' world '"
I am splitting the text using
String[] splits = Para.split("[^a-zA-Z']");
Expected output:
hello world
But it is giving:
'hello' world '
I get everything right, except a single apostrophe (') and 'hello' are not getting filtered by the above regex.
How can I filter these two things?
As far as I can tell, you're looking for a ' where either the next or previous character is not a letter.
The regex I came up with to do this, contained in some test code:
String str = "bob can't do 'well'";
String[] splits = str.split("(?:(?<=^|[^a-zA-Z])'|'(?=[^a-zA-Z]|$)|[^a-zA-Z'])+");
System.out.println(Arrays.toString(splits));
Explanation:
(?<=^|[^a-zA-Z])' - matches a ' where the previous character is not a letter, or we're at the start of the string.
'(?=[^a-zA-Z]|$) - matches a ' where the next character is not a letter, or we're at the end of the string.
[^a-zA-Z'] - not a letter or '.
(?:...)+ - one or more of any of the above (the ?: is just to make it a non-capturing group).
See this for more on regex lookaround ((?<=...) and (?=...)).
Simplification:
The regex can be simplified to the below by using negative lookaround:
"(?:(?<![a-zA-Z])'|'(?![a-zA-Z])|[^a-zA-Z'])+"
A Unicode version, without lookarounds:
String TestInput = "This voilà München is the test' 'sentence' that I'm willing to split";
String[] splits = TestInput.split("'?[^\\p{L}']+'?");
for (String t : splits) {
System.out.println(t);
}
\p{L} is matching a character with the Unicode property "Letter"
This splits on a non letter, non ' sequence, including a leading or trailing ' in the split.
Output:
This
voilà
München
is
the
test
sentence
that
I'm
willing
to
split
To handle leading and trailing ', just add them as alternatives
TestInput.split("'?[^\\p{L}']+'?|^'|'$")
If you define a word as a sequence that:
Must start and end with English alphabet a-zA-Z
May contain apostrophe (') within.
Then you can use the following regex in Matcher.find() loop to extract matches:
[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?
Sample code:
Pattern p = Pattern.compile("[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?");
Matcher m = p.matcher(inputString);
while (m.find()) {
System.out.println(m.group());
}
Demo1
1 The demo uses PCRE flavor regex, but the result should not be different from Java for this regex

Constructing regex pattern to match sentence

I'm trying to write a regex pattern that will match any sentence that begins with multiple or one tab and/or whitespace.
For example, I want my regex pattern to be able to match " hello there I like regex!"
but so I'm scratching my head on how to match words after "hello". So far I have this:
String REGEX = "(?s)(\\p{Blank}+)([a-z][ ])*";
Pattern PATTERN = Pattern.compile(REGEX);
Matcher m = PATTERN.matcher(" asdsada adf adfah.");
if (m.matches()) {
System.out.println("hurray!");
}
Any help would be appreciated. Thanks.
String regex = "^\\s+[A-Za-z,;'\"\\s]+[.?!]$"
^ means "begins with"
\\s means white space
+ means 1 or more
[A-Za-z,;'"\\s] means any letter, ,, ;, ', ", or whitespace character
$ means "ends with"
An example regex to match sentences by the definition: "A sentence is a series of characters, starting with at lease one whitespace character, that ends in one of ., ! or ?" is as follows:
\s+[^.!?]*[.!?]
Note that newline characters will also be included in this match.
A sentence starts with a word boundary (hence \b) and ends with one or more terminators. Thus:
\b[^.!?]+[.!?]+
https://regex101.com/r/7DdyM1/1
This gives pretty accurate results. However, it will not handle fractional numbers. E.g. This sentence will be interpreted as two sentences:
The value of PI is 3.141...
If you looking to match all strings starting with a white space you can try using "^\s+*"
regular expression.
This tool could help you to test your regular expression efficiently.
http://www.rubular.com/
Based upon what you desire and asked for, the following will work.
String s = " hello there I like regex!";
Pattern p = Pattern.compile("^\\s+[a-zA-Z\\s]+[.?!]$");
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("hurray!");
}
See working demo
String regex = "(?<=^|(\.|!|\?) |\n|\t|\r|\r\n) *\(?[A-Z][^.!?]*((\.|!|\?)(?! |\n|\r|\r\n)[^.!?]*)*(\.|!|\?)(?= |\n|\r|\r\n)"
This match any sentence following the definition 'a sentence start with a capital letter and end with a dot'.
The below regex pattern matches sentences in a paragraph.
Pattern pattern = Pattern.compile("\\b[\\w\\p{Space}“”’\\p{Punct}&&[^.?!]]+[.?!]");
Reference: https://devsought.com/regex-pattern-to-match-sentence

Categories

Resources