Splitting words from text using regex - java

I need to filter the given text to get all words, including apostrophes (can't is considered a single word).
Para = "'hello' world '"
I am splitting the text using
String[] splits = Para.split("[^a-zA-Z']");
Expected output:
hello world
But it is giving:
'hello' world '
I get everything right, except a single apostrophe (') and 'hello' are not getting filtered by the above regex.
How can I filter these two things?

As far as I can tell, you're looking for a ' where either the next or previous character is not a letter.
The regex I came up with to do this, contained in some test code:
String str = "bob can't do 'well'";
String[] splits = str.split("(?:(?<=^|[^a-zA-Z])'|'(?=[^a-zA-Z]|$)|[^a-zA-Z'])+");
System.out.println(Arrays.toString(splits));
Explanation:
(?<=^|[^a-zA-Z])' - matches a ' where the previous character is not a letter, or we're at the start of the string.
'(?=[^a-zA-Z]|$) - matches a ' where the next character is not a letter, or we're at the end of the string.
[^a-zA-Z'] - not a letter or '.
(?:...)+ - one or more of any of the above (the ?: is just to make it a non-capturing group).
See this for more on regex lookaround ((?<=...) and (?=...)).
Simplification:
The regex can be simplified to the below by using negative lookaround:
"(?:(?<![a-zA-Z])'|'(?![a-zA-Z])|[^a-zA-Z'])+"

A Unicode version, without lookarounds:
String TestInput = "This voilà München is the test' 'sentence' that I'm willing to split";
String[] splits = TestInput.split("'?[^\\p{L}']+'?");
for (String t : splits) {
System.out.println(t);
}
\p{L} is matching a character with the Unicode property "Letter"
This splits on a non letter, non ' sequence, including a leading or trailing ' in the split.
Output:
This
voilà
München
is
the
test
sentence
that
I'm
willing
to
split
To handle leading and trailing ', just add them as alternatives
TestInput.split("'?[^\\p{L}']+'?|^'|'$")

If you define a word as a sequence that:
Must start and end with English alphabet a-zA-Z
May contain apostrophe (') within.
Then you can use the following regex in Matcher.find() loop to extract matches:
[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?
Sample code:
Pattern p = Pattern.compile("[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?");
Matcher m = p.matcher(inputString);
while (m.find()) {
System.out.println(m.group());
}
Demo1
1 The demo uses PCRE flavor regex, but the result should not be different from Java for this regex

Related

Java String Split using Regex with Escape Character

I have a string which needs to be split based on a delimiter(:). This delimiter can be escaped by a character (say '?'). Basically the delimiter can be preceded by any number of escape character. Consider below example string:
a:b?:c??:d???????:e
Here, after the split, it should give the below list of string:
a
b?:c??
d???????:e
Basically, if the delimiter (:) is preceded by even number of escape characters, it should split. If it is preceded by odd number of escape characters, it should not split. Is there a solution to this with regex?
Any help would be greatly appreciated.
Similar question has been asked earlier here, But the answers are not working for this use case.
Update:
The solution with the regex: (?:\?.|[^:?])* correctly split the string. However, this also gives few empty strings. If + is given instead of *, even the real empty matches also ignored. (Eg:- a::b gives only a,b)
Scenario 1: No empty matches
You may use
(?:\?.|[^:?])+
Or, following the pattern in the linked answer
(?:\?.|[^:?]++)+
See this regex demo
Details
(?: - start of a non-capturing group
\?. - a ? (the delimiter) followed with any char
| - or
[^:?] - any char but the : (your delimiter char) and ? (the escape char)
)+ - 1 or more repetitions.
In Java:
String regex = "(?:\\?.|[^:?]++)+";
In case the input contains line breaks, prepend the pattern with (?s) (like (?s)(?:\\?.|[^:?])+) or compile the pattern with Pattern.DOTALL flag.
Scenario 2: Empty matches included
You may add (?<=:)(?=:) alternative to the above pattern to match empty strings between : chars, see this regex demo:
String s = "::a:b?:c??::d???????:e::";
Pattern pattern = Pattern.compile("(?>\\?.|[^:?])+|(?<=:)(?=:)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println("'" + matcher.group() + "'");
}
Output of the Java demo:
''
'a'
'b?:c??'
''
'd???????:e'
''
Note that if you want to also match empty strings at the start/end of the string, use (?<![^:])(?![^:]) rather than (?<=:)(?=:).

Regex to catch all the words and the "i'm you're etc" in Java

I am trying to split lines of a document, by creating a Pattern in Java.
The default Pattern in WordCount example is something like this: "\\s*\\b\\s*".
The problem with this pattern however, is that it splits everything to a single word, while I want to keep things such as (I'm, You're, it's) together. So far, what I've tried is [a-zA-Z]+'{0,1}[a-zA-Z]*,
the problem is that when I have a test string, for example:
Pattern BOUNDARY = "[a-zA-Z]+'{0,1}[a-zA-Z]*"
String test = "Hello i'm #£$#you ##can !!be.
and run
for(String word : BOUNDARY.split(test){
println(word)}
I get no results. Ideally, I want to get
Hello
i'm
you
can
be
Any ideas are welcome. In the regex101.com the regex I've put up works like a charm, so I'm guessing I have misunderstood something in the Java part.
Your initial pattern was splitting at a word boundary enclosed with 0+ whitespaces pattern. The second pattern is matching substrings.
Use it like this:
String BOUNDARY_STR = "[a-zA-Z]+(?:'[a-zA-Z]+)?";
String test = "Hello i'm #£$#you ##can !!be.";
Matcher matcher = Pattern.compile(BOUNDARY_STR).matcher(test);
List<String> results = new ArrayList<>();
while (matcher.find()){
results.add(matcher.group(0));
}
System.out.println(results); // => [Hello, i'm, you, can, be]
See the Java demo
Note I used [a-zA-Z]+(?:'[a-zA-Z]+)? that matches
[a-zA-Z]+ - 1 or more ASCII letters
(?:'[a-zA-Z]+)? - an optional substring of
' - an apostrophe
[a-zA-Z]+ - 1 or more ASCII letters
You may also wrap the pattern with word boundaries to only match words that are enclosed with non-word chars, "\\b[a-zA-Z]+(?:'[a-zA-Z]+)?\\b".
To find all Unicode letters, use "\\p{L}+(?:'\\p{L}+)?".

Java Regexp to match words only (', -, space)

What is the Java Regular expression to match all words containing only :
From a to z and A to Z
The ' - Space Characters but they must not be in the beginning or the
end.
Examples
test'test match
test' doesn't match
'test doesn't match
-test doesn't match
test- doesn't match
test-test match
You can use the following pattern: ^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$
Below are the examples:
String s1 = "abc";
String s2 = " abc";
String s3 = "abc ";
System.out.println(s1.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s2.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s3.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
When you mean the whitespace char it is: [a-zA-Z ]
So it checks if your string contains a-z(lowercase) and A-Z(uppercase) chars and the whitespace chars. If not, the test will fail
Here's my solution:
/(\w{2,}(-|'|\s)\w{2,})/g
You can take it for a spin on Regexr.
It is first checking for a word with \w, then any of the three qualifiers with "or" logic using |, and then another word. The brackets {} are making sure the words on either end are at least 2 characters long so contractions like don't aren't captured. You could set that to any value to prevent longer words from being captured or omit them entirely.
Caveat: \w also looks for _ underscores. If you don't want that you could replace it with [a-zA-Z] like so:
/([a-zA-Z]{2,}(-|'|\s)[a-zA-Z]{2,})/g

regex whole word option

I have a problem about matching whole words in java, what I want to do is finding the start indices of each word in a given line
Pattern pattern = Pattern.compile("("+str+")\\b");
Matcher matcher = pattern.matcher(line.toLowerCase(Locale.ENGLISH));
if(matcher.find()){
//Doing something
}
I have a problem with this given case
line = "Watson has Watson's items.";
str = "watson";
I want to match with only the first watson here without matching the other one and i dont want my pattern to have some empty space control, what should i do in this case
The word boundary \b matches the location between a non-word and a word character (or the start/end before/after a word character). The ', -, +, etc. are non-word characters, so Watson\b will match in Watson's (partial match).
You might want to only match Watson if it is not enclosed with non-whitespace symbols:
Pattern p = Pattern.compile("(?<!\\S)" + str + "(?!\\S)");
To match Watson at the end of the sentence, you will need to allow matching before ., ? and !, use
Pattern p = Pattern.compile("(?<!\\S)" + str + "(?![^\\s.!?])");
See the regex demo
Just FYI: perhaps, it is a good idea to also use Pattern.quote(str) instead of plain str to avoid issues when your str contains special regex metacharacters.
Use find() method in matcher
Refer java docs

Regex deleted special character

I'm having the following problem with regex: I've written a program that reads words from some text (txt) files and writes into another file, writing one word per line.
Everything works fine, except if the word read has a special characters ľščťžýáíé in it. The regex deletes the char and splits the word where the special char was.
For Example :
Input:
I am Jožo.
Output:
I
am
Jo
o
Here's a snippet of the code:
while( (line = br.readLine())!= null ){
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(line);
}
Instead of this regex:
Pattern.compile("[\\w']+")
Use Unicode based:
Pattern.compile("[\\p{L}']+")
It is because by default \\w in Java matches only ASCII characters, digits 0-9 and underscore.
Another option is to use the modifier
Pattern.UNICODE_CHARACTER_CLASS
Like this:
Pattern.compile("[\\w']+", Pattern.UNICODE_CHARACTER_CLASS)
\\w matches only a-z, A-Z and 0-9 (English alphabet plus numbers)
if you want to accept any character except whitespaces as part of a word, use \\S

Categories

Resources