I'm having the following problem with regex: I've written a program that reads words from some text (txt) files and writes into another file, writing one word per line.
Everything works fine, except if the word read has a special characters ľščťžýáíé in it. The regex deletes the char and splits the word where the special char was.
For Example :
Input:
I am Jožo.
Output:
I
am
Jo
o
Here's a snippet of the code:
while( (line = br.readLine())!= null ){
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(line);
}
Instead of this regex:
Pattern.compile("[\\w']+")
Use Unicode based:
Pattern.compile("[\\p{L}']+")
It is because by default \\w in Java matches only ASCII characters, digits 0-9 and underscore.
Another option is to use the modifier
Pattern.UNICODE_CHARACTER_CLASS
Like this:
Pattern.compile("[\\w']+", Pattern.UNICODE_CHARACTER_CLASS)
\\w matches only a-z, A-Z and 0-9 (English alphabet plus numbers)
if you want to accept any character except whitespaces as part of a word, use \\S
Related
What regex pattern would need I to pass to java.lang.String.split() to split a String into an Array of substrings using all whitespace characters (' ', '\t', '\n', etc.) as delimiters?
Something in the lines of
myString.split("\\s+");
This groups all white spaces as a delimiter.
So if I have the string:
"Hello[space character][tab character]World"
This should yield the strings "Hello" and "World" and omit the empty space between the [space] and the [tab].
As VonC pointed out, the backslash should be escaped, because Java would first try to escape the string to a special character, and send that to be parsed. What you want, is the literal "\s", which means, you need to pass "\\s". It can get a bit confusing.
The \\s is equivalent to [ \\t\\n\\x0B\\f\\r].
In most regex dialects there are a set of convenient character summaries you can use for this kind of thing - these are good ones to remember:
\w - Matches any word character.
\W - Matches any nonword character.
\s - Matches any white-space character.
\S - Matches anything but white-space characters.
\d - Matches any digit.
\D - Matches anything except digits.
A search for "Regex Cheatsheets" should reward you with a whole lot of useful summaries.
To get this working in Javascript, I had to do the following:
myString.split(/\s+/g)
"\\s+" should do the trick
Also you may have a UniCode non-breaking space xA0...
String[] elements = s.split("[\\s\\xA0]+"); //include uniCode non-breaking
String string = "Ram is going to school";
String[] arrayOfString = string.split("\\s+");
Apache Commons Lang has a method to split a string with whitespace characters as delimiters:
StringUtils.split("abc def")
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#split(java.lang.String)
This might be easier to use than a regex pattern.
All you need is to split using the one of the special character of Java Ragex Engine,
and that is- WhiteSpace Character
\d Represents a digit: [0-9]
\D Represents a non-digit: [^0-9]
\s Represents a whitespace character including [ \t\n\x0B\f\r]
\S Represents a non-whitespace character as [^\s]
\v Represents a vertical whitespace character as [\n\x0B\f\r\x85\u2028\u2029]
\V Represents a non-vertical whitespace character as [^\v]
\w Represents a word character as [a-zA-Z_0-9]
\W Represents a non-word character as [^\w]
Here, the key point to remember is that the small leter character \s represents all types of white spaces including a single space [ ] , tab characters [ ] or anything similar.
So, if you'll try will something like this-
String theString = "Java<a space><a tab>Programming"
String []allParts = theString.split("\\s+");
You will get the desired output.
Some Very Useful Links:
Split() method Best Examples
Regexr
split-Java 11
RegularExpInfo
PatternClass
Hope, this might help you the best!!!
To split a string with any Unicode whitespace, you need to use
s.split("(?U)\\s+")
^^^^
The (?U) inline embedded flag option is the equivalent of Pattern.UNICODE_CHARACTER_CLASS that enables \s shorthand character class to match any characters from the whitespace Unicode category.
If you want to split with whitespace and keep the whitespaces in the resulting array, use
s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")
See the regex demo. See Java demo:
String s = "Hello\t World\u00A0»";
System.out.println(Arrays.toString(s.split("(?U)\\s+"))); // => [Hello, World, »]
System.out.println(Arrays.toString(s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")));
// => [Hello, , World, , »]
Since it is a regular expression, and i'm assuming u would also not want non-alphanumeric chars like commas, dots, etc that could be surrounded by blanks (e.g. "one , two" should give [one][two]), it should be:
myString.split(/[\s\W]+/)
you can split a string by line break by using the following statement :
String textStr[] = yourString.split("\\r?\\n");
you can split a string by Whitespace by using the following statement :
String textStr[] = yourString.split("\\s+");
String str = "Hello World";
String res[] = str.split("\\s+");
Study this code.. good luck
import java.util.*;
class Demo{
public static void main(String args[]){
Scanner input = new Scanner(System.in);
System.out.print("Input String : ");
String s1 = input.nextLine();
String[] tokens = s1.split("[\\s\\xA0]+");
System.out.println(tokens.length);
for(String s : tokens){
System.out.println(s);
}
}
}
So I'm trying to split a string with a regex and the split function in java.
The regex should split the string when there is a capital letter after a noncapital letter like this
hHere // -> should split to ["h", "Here"]
I'm trying to split a string like this
String str = "1. Test split hHere and not .Here and /Here";
String[] splitString = str.split("(?=\\w+)((?=[^\\s])(?=\\p{Upper}))");
/* print splitString */
// -> should split to ["1. Test split h", "Here and not .Here and not /Here"]
for(String s : splitString) {
System.out.println(s);
}
output I get
1.
Test split h
Here and not .
Here and /
Here
output I want
1. Test split h
Here and not .Here and not /Here
Just can't figure out the regex to do this
You may use a easier pattern : (?<=\p{Ll})(?=\p{Lu})
(?<= ) ensures that the given pattern will match, ending at the current position in the expression.
(?= ) asserts that the given subpattern can be matched here, without consuming characters
both does not consume any characters, very important !
str.split("(?<=[a-z])(?=[A-Z])"); old version does not work for other alphabet
As per my original comment.
Code
Option 1
This option works with ASCII characters (it will not work for Unicode characters). Basically, this works with English text.
See regex in use here
(?<=[a-z])(?=[A-Z])
Option 2
This option works with Unicode characters. This works with any language.
See regex in use here
(?<=\p{Ll})(?=\p{Lu})
Explanation
Option 1
(?<=[a-z]) Positive lookbehind ensuring what precedes is a character in the set a-z (lowercase ASCII character)
(?=[A-Z]) Positive lookahead ensuring what follows is a character in the set A-Z (uppercase ASCII character)
Option 2
(?<=\p{Ll}) Positive lookbehind ensuring what precedes is a character in the set \p{Ll} (lowercase letter Unicode property/script category)
(?=\p{Lu}) Positive lookahead ensuring what follows is a character in the set \p{Lu} (uppercase letter Unicode property/script category)
I want to write a regular expression to remove all tokens of a text file that do not have at least one letter. I used OpenNLP tokenizer for extracting tokens of my text file.For instance, tokens 90-87, 65#7, ---, 8/0, ? are removed from given text.
I tried to follow these pages 1 ,2 and 3; but I could not find the expression that I want. For example, the following code remove token anti-age, mid-november.
String[] tokens = t.getTokens(sen);
for (String word : tokens)
if((!isstopWord(word)) && word.matches("[a-zA-Z]+"))
bufferedw.append(word+"\n");
But, I do not know how to prevent removing tokens like anti-age.
where is the problem?
The [a-zA-Z]+ expression matches a string that only consists of one or more ASCII letters. It does not allow hyphens, apostrophes, etc.
To match a string containing no spaces and at least one letter, you can use
word.matches("\\S*\\pL\\S*")
See IDEONE demo
The \S* pattern matches zero or more non-whitespace characters and \pL matches any Unicode letter.
With regular expressions, how can I extract the whole word except the last character if it is a vowel?
Inputs:
ansia
bello
ansid
Expected output for each:
ansi
bell
ansid
This is what I tried, but it only works if I have a single vowel at the end:
^(.*[^aeiou])
Similar to what #Sotirios Delimanolis wrote in his comment but using word boundaries so it will work if you have multiple words in a line.
\b(\w+?)[aeiou]?\b
This works in the following way :
1) \b matches the start of a word. This will work for the first word on a line or a word preceded by a non-word character (a word character is any alpha-numeric character).
2) (\w+?) matches and captures the part of the word you care about.
2a) \w matches any word character.
2b) + makes the \w be matched one or more times
2c) ? makes the + match as few characters as possible. This is important because if there is a vowel at the end of the word we do not want to match it in the capturing group but instead let (3) take care of it.
3) [aeiou]? matches but does not capture a vowel character if one is present
3a) [aeiou] matches a vowel
3b) ? makes the [aeiou] be matched zero or one times
4) \b matches the end of the word. This will work for a word at the end of a line or a word followed by a non-word character.
You said that the tool you are using uses the Java regex implementation and ansid isn't working for you with my regex. I have tested it with pure Java and it seems to be working for me:
Pattern pattern = Pattern.compile("\\b(\\w+?)[aeiou]?\\b");
Matcher matcher = pattern.matcher("ansia ansid cake cat dog");
while (matcher.find()) {
System.out.println(matcher.group(1));
}
prints
ansi
ansid
cak
cat
dog
Try the regex (\b[a-zA-Z]+?(?=[aeiou]\b))|(\b[a-zA-Z]+?[^aeiou]\b). This captures either a word ending in a consonant OR a word ending in a vowel, and omits the vowel at the end.
this pattern worked for me
^(.*?)(?=[aeiou]$|$)
Demo
in case input is words that can be in a line as pointed out below
use this pattern
\b([a-z]+?)(?=[aeiou]\b|\b)
Demo
I need to filter the given text to get all words, including apostrophes (can't is considered a single word).
Para = "'hello' world '"
I am splitting the text using
String[] splits = Para.split("[^a-zA-Z']");
Expected output:
hello world
But it is giving:
'hello' world '
I get everything right, except a single apostrophe (') and 'hello' are not getting filtered by the above regex.
How can I filter these two things?
As far as I can tell, you're looking for a ' where either the next or previous character is not a letter.
The regex I came up with to do this, contained in some test code:
String str = "bob can't do 'well'";
String[] splits = str.split("(?:(?<=^|[^a-zA-Z])'|'(?=[^a-zA-Z]|$)|[^a-zA-Z'])+");
System.out.println(Arrays.toString(splits));
Explanation:
(?<=^|[^a-zA-Z])' - matches a ' where the previous character is not a letter, or we're at the start of the string.
'(?=[^a-zA-Z]|$) - matches a ' where the next character is not a letter, or we're at the end of the string.
[^a-zA-Z'] - not a letter or '.
(?:...)+ - one or more of any of the above (the ?: is just to make it a non-capturing group).
See this for more on regex lookaround ((?<=...) and (?=...)).
Simplification:
The regex can be simplified to the below by using negative lookaround:
"(?:(?<![a-zA-Z])'|'(?![a-zA-Z])|[^a-zA-Z'])+"
A Unicode version, without lookarounds:
String TestInput = "This voilà München is the test' 'sentence' that I'm willing to split";
String[] splits = TestInput.split("'?[^\\p{L}']+'?");
for (String t : splits) {
System.out.println(t);
}
\p{L} is matching a character with the Unicode property "Letter"
This splits on a non letter, non ' sequence, including a leading or trailing ' in the split.
Output:
This
voilà
München
is
the
test
sentence
that
I'm
willing
to
split
To handle leading and trailing ', just add them as alternatives
TestInput.split("'?[^\\p{L}']+'?|^'|'$")
If you define a word as a sequence that:
Must start and end with English alphabet a-zA-Z
May contain apostrophe (') within.
Then you can use the following regex in Matcher.find() loop to extract matches:
[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?
Sample code:
Pattern p = Pattern.compile("[a-zA-Z](?:[a-zA-Z']*[a-zA-Z])?");
Matcher m = p.matcher(inputString);
while (m.find()) {
System.out.println(m.group());
}
Demo1
1 The demo uses PCRE flavor regex, but the result should not be different from Java for this regex