Detect non Latin characters with regex Pattern in Java - java

I THINK Latin characters are what I mean in my question, but I'm not entirely sure what the correct classification is. I'm trying to use a regex Pattern to test if a string contains non Latin characters. I'm expecting the following results
"abcDE 123"; // Yes, this should match
"!##$%^&*"; // Yes, this should match
"aaàààäää"; // Yes, this should match
"ベビードラ"; // No, this shouldn't match
"😀😃😄😆"; // No, this shouldn't match
My understanding is that the built-in {IsLatin} preset simply detects if any of the characters are Latin. I want to detect if any characters are not Latin.
Pattern LatinPattern = Pattern.compile("\\p{IsLatin}");
Matcher matcher = LatinPattern.matcher(str);
if (!matcher.find()) {
System.out.println("is NON latin");
return;
}
System.out.println("is latin");

TL;DR: Use regex ^[\p{Print}\p{IsLatin}]*$
You want a regex that matches if the string consists of:
Spaces
Digits
Punctuation
Latin characters (Unicode script "Latin")
Easiest way is to combine \p{IsLatin} with \p{Print}, where Pattern defines \p{Print} as:
\p{Print} - A printable character: [\p{Graph}\x20]
\p{Graph} - A visible character: [\p{Alnum}\p{Punct}]
\p{Alnum} - An alphanumeric character: [\p{Alpha}\p{Digit}]
\p{Alpha} - An alphabetic character: [\p{Lower}\p{Upper}]
\p{Lower} - A lower-case alphabetic character: [a-z]
\p{Upper} - An upper-case alphabetic character: [A-Z]
\p{Digit} - A decimal digit: [0-9]
\p{Punct} - Punctuation: One of !"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
\x20 - A space:
Which makes \p{Print} the same as [\p{ASCII}&&\P{Cntrl}], i.e. ASCII characters that are not control characters.
The \p{Alpha} part overlaps with \p{IsLatin}, but that's fine, since the character class eliminates duplicates.
So, regex is: ^[\p{Print}\p{IsLatin}]*$
Test
Pattern latinPattern = Pattern.compile("^[\\p{Print}\\p{IsLatin}]*$");
String[] inputs = { "abcDE 123", "!##$%^&*", "aaàààäää", "ベビードラ", "😀😃😄😆" };
for (String input : inputs) {
System.out.print("\"" + input + "\": ");
Matcher matcher = latinPattern.matcher(input);
if (! matcher.find()) {
System.out.println("is NON latin");
} else {
System.out.println("is latin");
}
}
Output
"abcDE 123": is latin
"!##$%^&*": is latin
"aaàààäää": is latin
"ベビードラ": is NON latin
"😀😃😄😆": is NON latin

All Latin Unicode character classes are:
\p{InBasic_Latin}: U+0000–U+007F
\p{InLatin-1_Supplement}: U+0080–U+00FF
\p{InLatin_Extended-A}: U+0100–U+017F
\p{InLatin_Extended-B}: U+0180–U+024F
So, the answer is either
Pattern LatinPattern = Pattern.compile("^[\\p{InBasicLatin}\\p{InLatin-1Supplement}\\p{InLatinExtended-A}\\p{InLatinExtended-B}]+$");
Pattern LatinPattern = Pattern.compile("^[\\x00-\\x{024F}]+$"); //U+0000-U+024F
Note that underscores are removed from the Unicode property class names in Java.
See the Java demo:
List<String> strs = Arrays.asList(
"abcDE 123", // Yes, this should match
"!##$%^&*", // Yes, this should match
"aaàààäää", // Yes, this should match
"ベビードラ", // No, this shouldn't match
"😀😃😄😆"); // No, this shouldn't match
Pattern LatinPattern = Pattern.compile("^[\\p{InBasicLatin}\\p{InLatin-1Supplement}\\p{InLatinExtended-A}\\p{InLatinExtended-B}]+$");
//Pattern LatinPattern = Pattern.compile("^[\\x00-\\x{024F}]+$"); //U+0000-U+024F
for (String str : strs) {
Matcher matcher = LatinPattern.matcher(str);
if (!matcher.find()) {
System.out.println(str + " => is NON Latin");
//return;
} else {
System.out.println(str + " => is Latin");
}
}
Note: if you replace .find() with .matches(), you can throw away ^ and $ in the pattern.
Output:
abcDE 123 => is Latin
!##$%^&* => is Latin
aaàààäää => is Latin
ベビードラ => is NON Latin
😀😃😄😆 => is NON Latin

Related

Checking if there is whitespace between two elements in a String

I am working with Strings where I need to separate two chars/elements if there is a whitespace between them. I have seen a former post on SO about the same however it still has not worked for me as intended yet. As you would assume, I could just check if the String contains(" ") and then substring around the space. However my strings could possibly contains countless whitespaces at the end despite not having whitespace in between characters. Hence my question is "How do I detect a whitespace between two chars (numbers too) " ?
//Example with numbers in a String
String test = "2 2";
final Pattern P = Pattern.compile("^(\\d [\\d\\d] )*\\d$");
final Matcher m = P.matcher(test);
if (m.matches()) {
System.out.println("There is between space!");
}
You would use String.strip() to remove any leading or trailing whitespace, followed by String.split(). If there is a whitespace, the array will be of length 2 or greater. If there is not, it will be of length 1.
Example:
String test = " 2 2 ";
test = test.strip(); // Removes whitespace, test is now "2 2"
String[] testSplit = test.split(" "); // Splits the string, testSplit is ["2", "2"]
if (testSplit.length >= 2) {
System.out.println("There is whitespace!");
} else {
System.out.println("There is no whitespace");
}
If you need an array of a specified length, you can also specify a limit to split. For example:
"a b c".split(" ", 2); // Returns ["a", "b c"]
If you want a solution that only uses regex, the following regex matches any two groups of characters separated by a single space, with any amount of leading or trailing whitespace:
\s*(\S+\s\S+)\s*
Positive lookahead and lookbehind may also work if you use the regex (?<=\\w)\\s(?=\\w)
\w : a word character [a-zA-Z_0-9]
\\s : whitespace
(?<=\\w)\\s : positive lookbehind, matches if a whitespace preceeded by a \w
\\s(?=\\w) : positive lookahead, matches if a whitespace followed by a \w
List<String> testList = Arrays.asList("2 2", " 245 ");
Pattern p = Pattern.compile("(?<=\\w)\\s(?=\\w)");
for (String str : testList) {
Matcher m = p.matcher(str);
if (m.find()) {
System.out.println(str + "\t: There is a space!");
} else {
System.out.println(str + "\t: There is not a space!");
}
}
Output:
2 2 : There is a space!
245 : There is not a space!
The reason you pattern does not work as expected is because ^(\\d [\\d\\d] )*\\d$ which can be simplified to (\\d \\d )*\\d$ starts by repeating 0 or more times what is between the parenthesis.
Then it matches a digit at the end of the string. As the repetition is 0 or more times, it is optional and it would also match just a single digit.
If you want to check if there is a single space between 2 non whitespace chars:
\\S \\S
Regex demo | Java demo
final Pattern P = Pattern.compile("\\S \\S");
final Matcher m = P.matcher(test);
if (m.find()) {
System.out.println("There is between space!");
}
Here is the simplest way you can do it:
String testString = " Find if there is a space. ";
testString.trim(); //This removes all the leading and trailing spaces
testString.contains(" "); //Checks if the string contains a whitespace still
You can also use a shorthand method in one line by chaining the two methods:
String testString = " Find if there is a space. ";
testString.trim().contains(" ");
Use
String text = "2 2";
Matcher m = Pattern.compile("\\S\\s+\\S").matcher(text.trim());
if (m.find()) {
System.out.println("Space detected.");
}
Java code demo.
text.trim() will remove leading and trailing whitespaces, \S\s+\S pattern matches a non-whitespace, then one or more whitespace characters, and then a non-whitespace character again.

Tokenize Words separated by non-word characters exept single quote

I have the following method I'm trying to implement: parses the input into “word tokens”: sequences of word characters separated by non-word characters. However, non-word characters can become part of a token if they are quoted (in single quotes).
I want to use regex but have trouble getting my code just right:
public static List<String> wordTokenize(String input) {
Pattern pattern = Pattern.compile ("\\b(?:(?<=\')[^\']*(?=\')|\\w+)\\b");
Matcher matcher = pattern.matcher (input);
ArrayList ans = new ArrayList();
while (matcher.find ()){
ans.add (matcher.group ());
}
return ans;
}
My regex fails to identify that starting a word mid word without space doesn't mean starting a new word. Examples:
The input: this-string 'has only three tokens' // works
The input:
"this*string'has only two#tokens'"
Expected :[this, stringhas only two#tokens]
Actual :[this, string, has only two#tokens]
The input: "one'two''three' '' four 'twenty-one'"
Expected :[onetwothree, , four, twenty-one]
Actual :[one, two, three, four, twenty-one]
How do I fix the spaces?
You want to match one or more occurrences of a word char or a substring between the closest single straight apostrophes, and remove all those apostrophes from the tokens.
Use the following regex and .replace("'", "") on the matches:
(?:\w|'[^']*')+
See the regex demo. Details:
(?: - start of a non-capturing group
\w - a word char
| - or
' - a straight single quotation mark
[^']* - any 0+ chars other than a straight single quotation mark
' - a straight single quotation mark
)+ - end of the group, 1+ occurrences.
See the Java demo:
// String s = "this*string'has only two#tokens'"; // => [this, stringhas only two#tokens]
String s = "one'two''three' '' four 'twenty-one'"; // => [onetwothree, , four, twenty-one]
Pattern pattern = Pattern.compile("(?:\\w|'[^']*')+", Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher = pattern.matcher(s);
List<String> tokens = new ArrayList<>();
while (matcher.find()){
tokens.add(matcher.group(0).replace("'", ""));
}
Note the Pattern.UNICODE_CHARACTER_CLASS is added for the \w pattern to match all Unicode letters and digits.

Java Pattern for Word without Spaces

I am wondering what the regex for a word would be, I can seem to find it anywhere? The string I\m trying to match "Loop-num + 5" and I want to extract the "Loop-num" part. I am unsure what the regex would be to do so.
Pattern pattern = Pattern.compile("(loop-.*)");
Matcher matcher = pattern.matcher("5 * loop-num + 5");
if(matcher.find()){
String extractedString = matcher.group(1);
System.out.println(extractedString);
}
From this I get: "loop-num + 5"
If you really plan to use the regex to match words (entities comprising just letters, optionally split with hyphen(s)), you need to consider the following regex:
\b\pL+(?:-\pL+)*\b
See regex demo
Explanation:
\b - leading word boundary
\pL+ - 1 or more Unicode letters
(?:-\pL+)* - zero or more sequences of...
- - a literal hyphen
\pL+ - 1 or more Unicode letters
\b - trailing word boundary
In Java:
Pattern pattern = Pattern.compile("\\b\\pL+(?:-\\pL+)*\\b", Pattern.UNICODE_CHARACTER_CLASS);
Matcher matcher = pattern.matcher("5 * loop-num + 5");
if(matcher.find()){
String extractedString = matcher.group(0);
System.out.println(extractedString);
}
Note: in case words may include digits (not at the starting positions), you can use \b\pL\w*(?:-\pL\w*)*\b with Pattern.UNICODE_CHARACTER_CLASS. Here, \w will match letters, digits and an underscore.

Why doesn't a regex pattern ending in "." produce a match when using word boundary?

In the following Java code:
public static void main(String[] args) {
String largeText = "abc myphrase. def";
String phrase = "myphrase.";
Pattern myPattern = Pattern.compile("\\b"+Pattern.quote(phrase)+"\\b");
System.out.println("Pattern: "+myPattern);
Matcher myMatcher = myPattern.matcher( largeText );
boolean found = false;
while(myMatcher.find()) {
System.out.println("Found: "+myMatcher.group());
found = true;
}
if(!found){
System.out.println("Not found!");
}
}
I get this output:
Pattern: \b\Qmyphrase.\E\b
Not found!
Please, can someone explain me why the above pattern does not produce a match? I do have a match if I use "myphrase" instead of "myphrase." in the pattern.
Thank you for your help.
There is no boundary after the . A boundary occurs between a word character and a non-word character. Since both . and " " (space) are non-word characters there is no boundary between them.
If you use "myphase" in your pattern you get a match because there is a boundary between the word character e and the ..
It does't match because dot (.) is not considered a "word" character, so there won't be a word boundary after a literal dot (when the next char is a space).
FYI, "word" characters (which have their own regex \w) is equivalent to the character class [a-zA-Z0-9_]
Perhaps you are trying to use \s and not \b ?

java regex to accept any word other than none

I need a regular expression to match any string other than none.
I tried using
regular exp ="^[^none]$",
But it does not work.
If you are matching a String against a specific word in Java you should use equals(). In this case you want to invert the match so your logic becomes:
if(!theString.equals("none")) {
// do stuff here
}
Much less resource hungry, and much more intuitive.
If you need to match a String which contains the word "none", you are probably looking for something like:
if(theString.matches("\\bnone\\b")) {
/* matches theString if the substring "none" is enclosed between
* “word boundaries”, so it will not match for example: "nonetheless"
*/
}
Or if you can be fairly certain that “word boundaries” mean a specific delimiter you can still evade regular expressions by using the indexOf() method:
int i = theString.indexOf("none");
if(i > -1) {
if(i > 0) {
// check theString.charAt(i - 1) to see if it is a word boundary
// e.g.: whitespace
}
// the 4 is because of the fact that "none" is 4 characters long.
if((theString.length() - i - 4) > 0) {
// check theString.charAt(i + 4) to see if it is a word boundary
// e.g.: whitespace
}
}
else {
// not found.
}
You can use the regular expression (?!^none$).*. See this question for details: Regex inverse matching on specific string?
The reason "^[^none]$" doesn't work is that you are actually matching all strings except the strings "n", "o", or "e".
Of course, it would be easier to just use String.equals like so: !"none".equals(testString).
Actually this is the regex to match all words except "word":
Pattern regex = Pattern.compile("\\b(?!word\\b)\\w+\\b");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
// matched text: regexMatcher.group()
// match start: regexMatcher.start()
// match end: regexMatcher.end()
}
You must use word boundaries so that "word" is not contained in other words.
Explanation:
"
\b # Assert position at a word boundary
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
Lorem # Match the characters “Lorem” literally
\b # Assert position at a word boundary
)
\w # Match a single character that is a “word character” (letters, digits, etc.)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
"
This is the regex you are looking for:
Pattern p = Pattern.compile("^(?!none$).*$");
Matcher m = p.matcher("your string");
System.out.println(s + ": " + (m.matches() ? "Match" : "NO Match"));
Having that said, if you are not forced to use a regex that matches everything but "none", the more simple, fast, clear, and easy to write and understand is this:
Pattern p = Pattern.compile("^none$");
Then, you just exclude the matches.
Matcher m = p.matcher("your string");
System.out.println(s + ": " + (m.matches() ? "NO Match" : "Match"));

Categories

Resources