I'm trying to write a regex pattern that will match any sentence that begins with multiple or one tab and/or whitespace.
For example, I want my regex pattern to be able to match " hello there I like regex!"
but so I'm scratching my head on how to match words after "hello". So far I have this:
String REGEX = "(?s)(\\p{Blank}+)([a-z][ ])*";
Pattern PATTERN = Pattern.compile(REGEX);
Matcher m = PATTERN.matcher(" asdsada adf adfah.");
if (m.matches()) {
System.out.println("hurray!");
}
Any help would be appreciated. Thanks.
String regex = "^\\s+[A-Za-z,;'\"\\s]+[.?!]$"
^ means "begins with"
\\s means white space
+ means 1 or more
[A-Za-z,;'"\\s] means any letter, ,, ;, ', ", or whitespace character
$ means "ends with"
An example regex to match sentences by the definition: "A sentence is a series of characters, starting with at lease one whitespace character, that ends in one of ., ! or ?" is as follows:
\s+[^.!?]*[.!?]
Note that newline characters will also be included in this match.
A sentence starts with a word boundary (hence \b) and ends with one or more terminators. Thus:
\b[^.!?]+[.!?]+
https://regex101.com/r/7DdyM1/1
This gives pretty accurate results. However, it will not handle fractional numbers. E.g. This sentence will be interpreted as two sentences:
The value of PI is 3.141...
If you looking to match all strings starting with a white space you can try using "^\s+*"
regular expression.
This tool could help you to test your regular expression efficiently.
http://www.rubular.com/
Based upon what you desire and asked for, the following will work.
String s = " hello there I like regex!";
Pattern p = Pattern.compile("^\\s+[a-zA-Z\\s]+[.?!]$");
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("hurray!");
}
See working demo
String regex = "(?<=^|(\.|!|\?) |\n|\t|\r|\r\n) *\(?[A-Z][^.!?]*((\.|!|\?)(?! |\n|\r|\r\n)[^.!?]*)*(\.|!|\?)(?= |\n|\r|\r\n)"
This match any sentence following the definition 'a sentence start with a capital letter and end with a dot'.
The below regex pattern matches sentences in a paragraph.
Pattern pattern = Pattern.compile("\\b[\\w\\p{Space}“”’\\p{Punct}&&[^.?!]]+[.?!]");
Reference: https://devsought.com/regex-pattern-to-match-sentence
Related
I am trying to split lines of a document, by creating a Pattern in Java.
The default Pattern in WordCount example is something like this: "\\s*\\b\\s*".
The problem with this pattern however, is that it splits everything to a single word, while I want to keep things such as (I'm, You're, it's) together. So far, what I've tried is [a-zA-Z]+'{0,1}[a-zA-Z]*,
the problem is that when I have a test string, for example:
Pattern BOUNDARY = "[a-zA-Z]+'{0,1}[a-zA-Z]*"
String test = "Hello i'm #£$#you ##can !!be.
and run
for(String word : BOUNDARY.split(test){
println(word)}
I get no results. Ideally, I want to get
Hello
i'm
you
can
be
Any ideas are welcome. In the regex101.com the regex I've put up works like a charm, so I'm guessing I have misunderstood something in the Java part.
Your initial pattern was splitting at a word boundary enclosed with 0+ whitespaces pattern. The second pattern is matching substrings.
Use it like this:
String BOUNDARY_STR = "[a-zA-Z]+(?:'[a-zA-Z]+)?";
String test = "Hello i'm #£$#you ##can !!be.";
Matcher matcher = Pattern.compile(BOUNDARY_STR).matcher(test);
List<String> results = new ArrayList<>();
while (matcher.find()){
results.add(matcher.group(0));
}
System.out.println(results); // => [Hello, i'm, you, can, be]
See the Java demo
Note I used [a-zA-Z]+(?:'[a-zA-Z]+)? that matches
[a-zA-Z]+ - 1 or more ASCII letters
(?:'[a-zA-Z]+)? - an optional substring of
' - an apostrophe
[a-zA-Z]+ - 1 or more ASCII letters
You may also wrap the pattern with word boundaries to only match words that are enclosed with non-word chars, "\\b[a-zA-Z]+(?:'[a-zA-Z]+)?\\b".
To find all Unicode letters, use "\\p{L}+(?:'\\p{L}+)?".
I have a problem about matching whole words in java, what I want to do is finding the start indices of each word in a given line
Pattern pattern = Pattern.compile("("+str+")\\b");
Matcher matcher = pattern.matcher(line.toLowerCase(Locale.ENGLISH));
if(matcher.find()){
//Doing something
}
I have a problem with this given case
line = "Watson has Watson's items.";
str = "watson";
I want to match with only the first watson here without matching the other one and i dont want my pattern to have some empty space control, what should i do in this case
The word boundary \b matches the location between a non-word and a word character (or the start/end before/after a word character). The ', -, +, etc. are non-word characters, so Watson\b will match in Watson's (partial match).
You might want to only match Watson if it is not enclosed with non-whitespace symbols:
Pattern p = Pattern.compile("(?<!\\S)" + str + "(?!\\S)");
To match Watson at the end of the sentence, you will need to allow matching before ., ? and !, use
Pattern p = Pattern.compile("(?<!\\S)" + str + "(?![^\\s.!?])");
See the regex demo
Just FYI: perhaps, it is a good idea to also use Pattern.quote(str) instead of plain str to avoid issues when your str contains special regex metacharacters.
Use find() method in matcher
Refer java docs
I need to filter the given text to get all words, including apostrophes (can't is considered a single word).
Para = "'hello' world '"
I am splitting the text using
String[] splits = Para.split("[^a-zA-Z']");
Expected output:
hello world
But it is giving:
'hello' world '
I get everything right, except a single apostrophe (') and 'hello' are not getting filtered by the above regex.
How can I filter these two things?
As far as I can tell, you're looking for a ' where either the next or previous character is not a letter.
The regex I came up with to do this, contained in some test code:
String str = "bob can't do 'well'";
String[] splits = str.split("(?:(?<=^|[^a-zA-Z])'|'(?=[^a-zA-Z]|$)|[^a-zA-Z'])+");
System.out.println(Arrays.toString(splits));
Explanation:
(?<=^|[^a-zA-Z])' - matches a ' where the previous character is not a letter, or we're at the start of the string.
'(?=[^a-zA-Z]|$) - matches a ' where the next character is not a letter, or we're at the end of the string.
[^a-zA-Z'] - not a letter or '.
(?:...)+ - one or more of any of the above (the ?: is just to make it a non-capturing group).
See this for more on regex lookaround ((?<=...) and (?=...)).
Simplification:
The regex can be simplified to the below by using negative lookaround:
"(?:(?<![a-zA-Z])'|'(?![a-zA-Z])|[^a-zA-Z'])+"
A Unicode version, without lookarounds:
String TestInput = "This voilà München is the test' 'sentence' that I'm willing to split";
String[] splits = TestInput.split("'?[^\\p{L}']+'?");
for (String t : splits) {
System.out.println(t);
}
\p{L} is matching a character with the Unicode property "Letter"
This splits on a non letter, non ' sequence, including a leading or trailing ' in the split.
Output:
This
voilà
München
is
the
test
sentence
that
I'm
willing
to
split
To handle leading and trailing ', just add them as alternatives
TestInput.split("'?[^\\p{L}']+'?|^'|'$")
If you define a word as a sequence that:
Must start and end with English alphabet a-zA-Z
May contain apostrophe (') within.
Then you can use the following regex in Matcher.find() loop to extract matches:
[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?
Sample code:
Pattern p = Pattern.compile("[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?");
Matcher m = p.matcher(inputString);
while (m.find()) {
System.out.println(m.group());
}
Demo1
1 The demo uses PCRE flavor regex, but the result should not be different from Java for this regex
I have the following string:
CLASSIC STF
CLASSIC
am using regexp to match the strings.
Pattern p = Pattern.compile("^CLASSIC(\\s*)$", Pattern.CASE_INSENSITIVE);
CLASSIC STF is also being displayed.
am using m.find()
How is it possible that only CLASSIC is displayed not CLASSIC STF
Thanks for helping.
If you use Matcher.find() the expression CLASSIC(\s*) will match CLASSIC STF.
Matcher.matches() will return false, however, since it requires the expression to match the entire input.
To make Matcher.find() do the same, change the expression to ^CLASSIC(\s*)$, as said by reto.
By default ^ and $ match against the beginning and end of the entire input string respectively, ignoring any newlines. I would expect that your expression would not match on the string you mention. Indeed:
String pattern = "^CLASSIC(\\s*)$";
String input = "CLASSIC STF%nCLASSIC";
Pattern p = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(String.format(input));
while (m.find()) {
System.out.println(m.group());
}
prints no results.
If you want ^ and $ to match the beginning and end of all lines in the string you should enable "multiline mode". Do so by replacing line 3 above with Pattern p = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE + Pattern.MULTILINE);. When I do so I get one result, namely: "CLASSIC".
You also asked why "CLASSIC STF" is not matched. Let's break down your pattern to see why. The pattern says: match anything that...
starts at the beginning of a line ~ ^
begins with a C, followed by an L, A, S, S, I and C ~ CLASSIC
after which 0 or more whitespace characters follow ~ (\s*)
after which we see a line ending ~ $
After matching the space in "CLASSIC STF" (step 3) we are looking at a character "S". This doesn't match a line ending (step 4), so we cannot match the regex.
Note that the parentheses in your regex are not necessary. You can leave them out.
The Javadoc of the Pattern class is very elaborate. It could be helpful to read it.
EDIT:
If you want to check if a string/line contains the word "CLASSIC" using a regex, then I'd recommend to use the regex \bCLASSIC\b. If you want to see if a string starts with the word "CLASSIC", then I'd use ^CLASSIC\b.
I wonder if this would help:
practice = c("CLASSIC STF", "CLASSIC")
grep("^CLASSIC[[:space:]STF]?", practice)
I am trying to search this string:
,"tt" : "ABC","r" : "+725.00","a" : "55.30",
For:
"r" : "725.00"
And here is my current code:
Pattern p = Pattern.compile("([r]\".:.\"[+|-][0-9]+.[0-9][0-9]\")");
Matcher m = p.matcher(raw_string);
I've been trying multiple variations of the pattern, and a match is never found. A second set of eyes would be great!
Your regexp actually works, it's almost correct
Pattern p = Pattern.compile("\"[r]\".:.\"[+|-][0-9]+.[0-9][0-9]\"");
Matcher m = p.matcher(raw_string);
if (m.find()){
String res = m.toMatchResult().group(0);
}
The next line should read:
if ( m.find() ) {
Are you doing that?
A few other issues: You're using . to match the spaces surrounding the colon; if that's always supposed to be whitespace, you should use + (one or more spaces) or \s+ (one or more whitespace characters). On the other hand, the dot between the digits is supposed to match a literal ., so you should escape it: \. Of course, since this is a Java String literal, you need to escape the backslashes: \\s+, \\..
You don't need the square brackets around the r, and if you don't want to match a | in front of the number you should change [+|-] to [+-].
While some of these issues I've mentioned could result in false positives, none of them would prevent it from matching valid input. That's why I suspect you aren't actually applying the regex by calling find(). It's a common mistake.
First thing try to escape your dot symbol: ...[0-9]+\.[0-9][0-9]...
because the dot symbol match any character...
Second thing: the [+|-]define a range of characters but it's mandatory...
try [+|-]?
Alban.