Latin Regex with symbols - java

I need split a text and get only words, numbers and hyphenated composed-words. I need to get latin words also, then I used \p{L}, which gives me é, ú ü ã, and so forth. The example is:
String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! # # $ % ^& * ( ) + - _ #$% " ' : ; > < / \ | , here some is wrong… * + () e -"
Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+");
String words[] = pattern.split( myText );
What is wrong with this regex? Why it matches symbols like "(", "+", "-", "*" and "|"?
Some of results are:
dresse // OK
sud-est // OK
occident) // WRONG
987 // OK
() // WRONG
(a // WRONG
* // WRONG
- // WRONG
+ // WRONG
( // WRONG
| // WRONG
The regex explanation is:
[^\p{L}+(\-\p{L}+)*\d]+
* Word separator will be:
* [^ ... ] No sequence in:
* \p{L}+ Any latin letter
* (\-\p{L}+)* Optionally hyphenated
* \d or numbers
* [ ... ]+ once or more.

If my understanding of your requirement is correct, this regex will match what you want:
"\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"
It will match:
A contiguous sequence of Unicode Latin script characters. I restrict it to Latin script, since \p{L} will match letter in any script. Change \\p{IsLatin} to \\pL if your version of Java doesn't support the syntax.
Or several such sequences, hyphenated
Or a contiguous sequence of decimal digits (0-9)
The regex above is to be used by calling Pattern.compile, and call matcher(String input) to obtain a Matcher object, and use a loop to find matches.
Pattern pattern = Pattern.compile("\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+");
Matcher matcher = pattern.matcher(inputString);
while (matcher.find()) {
System.out.println(matcher.group());
}
If you want to allow words with apostrophe ':
"\\p{IsLatin}+(?:['\\-]\\p{IsLatin}+)*|\\d+"
I also escape - in the character class ['\\-] just in case you want to add more. Actually - doesn't need escaping if it is the first or last in the character class, but I escape it anyway just to be safe.

If the opening bracket of a character class is followed by a ^ then the characters listed inside the class are not allowed. So your regex allows anything except unicode letter,+,(,-,),* and digit occurring one or more times.
Note that characters like +,(,),* etc. don't have any special meaning inside a character class.
What pattern.split does is that it splits the string at patterns matching the regex. Your regex matches whitespace and hence split occurs at each occurrence of one or more whitespace. So result will be this.
For example consider this
Pattern pattern = Pattern.compile("a");
for (String s : pattern.split("sda a f g")) {
System.out.println("==>"+s);
}
Output will be
==>sd
==>
==> f g

A regular expression set description with [] can contain only letters, classes (\p{...}), sequences (e.g. a-z) and the complement symbol (^). You have to place the other magic characters you are using (+*()) outside the [ ] block.

Related

Detect non Latin characters with regex Pattern in Java

I THINK Latin characters are what I mean in my question, but I'm not entirely sure what the correct classification is. I'm trying to use a regex Pattern to test if a string contains non Latin characters. I'm expecting the following results
"abcDE 123"; // Yes, this should match
"!##$%^&*"; // Yes, this should match
"aaàààäää"; // Yes, this should match
"ベビードラ"; // No, this shouldn't match
"😀😃😄😆"; // No, this shouldn't match
My understanding is that the built-in {IsLatin} preset simply detects if any of the characters are Latin. I want to detect if any characters are not Latin.
Pattern LatinPattern = Pattern.compile("\\p{IsLatin}");
Matcher matcher = LatinPattern.matcher(str);
if (!matcher.find()) {
System.out.println("is NON latin");
return;
}
System.out.println("is latin");
TL;DR: Use regex ^[\p{Print}\p{IsLatin}]*$
You want a regex that matches if the string consists of:
Spaces
Digits
Punctuation
Latin characters (Unicode script "Latin")
Easiest way is to combine \p{IsLatin} with \p{Print}, where Pattern defines \p{Print} as:
\p{Print} - A printable character: [\p{Graph}\x20]
\p{Graph} - A visible character: [\p{Alnum}\p{Punct}]
\p{Alnum} - An alphanumeric character: [\p{Alpha}\p{Digit}]
\p{Alpha} - An alphabetic character: [\p{Lower}\p{Upper}]
\p{Lower} - A lower-case alphabetic character: [a-z]
\p{Upper} - An upper-case alphabetic character: [A-Z]
\p{Digit} - A decimal digit: [0-9]
\p{Punct} - Punctuation: One of !"#$%&'()*+,-./:;<=>?#[\]^_`{|}~
\x20 - A space:
Which makes \p{Print} the same as [\p{ASCII}&&\P{Cntrl}], i.e. ASCII characters that are not control characters.
The \p{Alpha} part overlaps with \p{IsLatin}, but that's fine, since the character class eliminates duplicates.
So, regex is: ^[\p{Print}\p{IsLatin}]*$
Test
Pattern latinPattern = Pattern.compile("^[\\p{Print}\\p{IsLatin}]*$");
String[] inputs = { "abcDE 123", "!##$%^&*", "aaàààäää", "ベビードラ", "😀😃😄😆" };
for (String input : inputs) {
System.out.print("\"" + input + "\": ");
Matcher matcher = latinPattern.matcher(input);
if (! matcher.find()) {
System.out.println("is NON latin");
} else {
System.out.println("is latin");
}
}
Output
"abcDE 123": is latin
"!##$%^&*": is latin
"aaàààäää": is latin
"ベビードラ": is NON latin
"😀😃😄😆": is NON latin
All Latin Unicode character classes are:
\p{InBasic_Latin}: U+0000–U+007F
\p{InLatin-1_Supplement}: U+0080–U+00FF
\p{InLatin_Extended-A}: U+0100–U+017F
\p{InLatin_Extended-B}: U+0180–U+024F
So, the answer is either
Pattern LatinPattern = Pattern.compile("^[\\p{InBasicLatin}\\p{InLatin-1Supplement}\\p{InLatinExtended-A}\\p{InLatinExtended-B}]+$");
Pattern LatinPattern = Pattern.compile("^[\\x00-\\x{024F}]+$"); //U+0000-U+024F
Note that underscores are removed from the Unicode property class names in Java.
See the Java demo:
List<String> strs = Arrays.asList(
"abcDE 123", // Yes, this should match
"!##$%^&*", // Yes, this should match
"aaàààäää", // Yes, this should match
"ベビードラ", // No, this shouldn't match
"😀😃😄😆"); // No, this shouldn't match
Pattern LatinPattern = Pattern.compile("^[\\p{InBasicLatin}\\p{InLatin-1Supplement}\\p{InLatinExtended-A}\\p{InLatinExtended-B}]+$");
//Pattern LatinPattern = Pattern.compile("^[\\x00-\\x{024F}]+$"); //U+0000-U+024F
for (String str : strs) {
Matcher matcher = LatinPattern.matcher(str);
if (!matcher.find()) {
System.out.println(str + " => is NON Latin");
//return;
} else {
System.out.println(str + " => is Latin");
}
}
Note: if you replace .find() with .matches(), you can throw away ^ and $ in the pattern.
Output:
abcDE 123 => is Latin
!##$%^&* => is Latin
aaàààäää => is Latin
ベビードラ => is NON Latin
😀😃😄😆 => is NON Latin

How to surround all Bracket groups with * in a string

I have been trying to get a string replaceAll to work in Java that was originally from a JavaScript code block. I have the following
String regexSearch = "((?!([ \\*]))|^)\\[[A-Za-z0-9\\s]*\\](?!\\*)"; //Java Version must escape special characters again
String regexReplacement = "*$&*";
String inputString = "This is a User, [USER 1], and a second user [USER 2]";
Pattern p = Pattern.compile(regexSearch);
Matcher m = p.matcher(inputString);
System.out.println(m.replaceAll(regexReplacement));
My desired output is
This is a User, *[USER 1]*, and a second user *[USER 2]*
I keep getting illegal group reference errors.
Requirements are as follows. Any text that is surrounded by square brackets "[" and "]" will be surrounded by "*" while still retaining the brackets. However if within the bracketed text there is a "|" character then this will not apply.
Your initial ((?!([ \*]))|^)\[[A-Za-z0-9\s]*\](?!\*) regex attempts (but fails) to match [...] strings when not enclosed with * chars. In Java, you would write it as
(?<!\*)\[[A-Za-z0-9\s]*](?!\*)
String regexSearch = "(?<!\\*)\[[A-Za-z0-9\\s]*](?!\\*)";
However, you may use a more lenient expression like
String regexSearch = "\\[[^\\]\\[|]*]";
Or, if you need to keep the original behavior to fail the matches inside asterisks:
String regexSearch = "(?<!\\*)\\[[^\\]\\[|]*](?!\\*)";
See the regex demo.
It matches:
(?<!\*) - a negative lookbehind that fails the match if there is a * char immediately to the left of the current location
\[ - a [ char
[^\]\[|]* - 0 or more chars other than [, ] and |
] - a ] char
(?!\*) - a negative lookahead that fails the match if there is a * char immediately to the right of the current location.
So, it will match from the [ till the closest ] without matching other [ and | inside, i.e. it will match innermost substrings between square brackets. It will also allow any other special and non-speical chars inside brackets, like hyphens, apostrophes, etc. [A-Za-z0-9\s] only allowed ASCII letters, digits and whitespaces.
Java demo:
String regexSearch = "\\[[^\\]\\[|]*]";
String regexReplacement = "*$0*";
String inputString = "This is a User, [USER 1], and a second user [USER 2] not [USER | 3]";
Pattern p = Pattern.compile(regexSearch);
Matcher m = p.matcher(inputString);
System.out.println(m.replaceAll(regexReplacement));
// => This is a User, *[USER 1]*, and a second user *[USER 2]* not [USER | 3]
You don't need to worry about matching the whole line, the following is sufficient:
\[(.*?)\]
Replacing this with *[$1]*.
Here's a demo on RegExr.
Further explanation: taking each element in the regex in turn:
\[ - we need to escape the opening square bracket because square brackets are a reserved character in regular expressions.
(.*?) - the .*? matches zero or more of any character lazily. This is surrounded in parentheses to indicate it's a capture group.
] - close the square bracket.
We then replace this with an an asterisk followed by an open square bracket *[, the first capture group $1 and then the closing square bracket and another asterisk. ]*.
It can be done as simple as this:
String s = inputString.replaceAll("\\[.*?]", "*$0*")
No capture groups needed.
Result
This is a User, *[USER 1]*, and a second user *[USER 2]*
Explanation
\\[ Match '[', escaped since '[' has special meaning, double-escaped because of Java
.*? Match any text on single line, match as little as possible
] Match ']', no need to escape since it's not in a character class
* Literal '*'
$0 Entire matched text '[XXX]'
* Literal '*'
This should do it.
String.replaceAll -- first argument is a regex.
The second argument is the replacement string. The $1 is capture group.
String regexSearch = "\\[.*?]";
String inputString = "This is a User, [USER 1], and a second user [USER 2]";
inputString = inputString.replaceAll(regexSearch, "*$1*");
System.out.println(inputString);
Prints
This is a User, *[USER 1]*, and a second user *[USER 2]*
Try replace all [ with - *[* and do the same for ] using the string method .replace(oldChar, newChar) in java.

Scanning letters and floats using the java scanner

I have a string which looks like this:
"m 535.71429,742.3622 55.71428,157.14286 c 0,0 165.71429,-117.14286 -55.71428,-157.14286 z"
and i want the java scanner to ouput the following strings: "m", "535.71429", "742.3622", "55.71428", "157.14286", "c", ...
so everything seperated by a comma or a space, but I am having troubles getting it to work.
This is how my code looks like:
Scanner scanner = new Scanner(path_string);
scanner.useDelimiter(",||//s");
String s = scanner.next();
if (s.equals("m")){
s = scanner.next();
point[0] = Float.parseFloat(s);
s = scanner.next();
point[1] = Float.parseFloat(s);
....
but the strings that come out are: "m", " ", "5", "3", ...
I think trouble is with //s. You have to use this pattern:
scanner.useDelimiter("(,|\\s)");
Regex patterns:
abc… Letters
123… Digits
\d Any Digit
\D Any Non-digit character
. Any Character
\. Period
[abc] Only a, b, or c
[^abc] Not a, b, nor c
[a-z] Characters a to z
[0-9] Numbers 0 to 9
\w Any Alphanumeric character
\W Any Non-alphanumeric character
{m} m Repetitions
{m,n} m to n Repetitions
* Zero or more repetitions
+ One or more repetitions
? Optional character
\s Any Whitespace
\S Any Non-whitespace character
^…$ Starts and ends
(…) Capture Group
(a(bc)) Capture Sub-group
(.*) Capture all
(ab|cd) Matches ab or cd
We use dual \ because this is special symbol and | isn't
If you want the output to be strings, the Float.parseFloat(s); is of no use for your problem. Is your array a float-array?
Because if it is, your should not get any output but an NumberFormatException, because the string "m" cannot be parsed into a float.
Furthermore, to solve the problem of the single values, you could use a StringBuilder which constructs your numbers and ignores the letters and commas. A special use of the letters should be implemented.
Finally, if it is not absolutely neccessary, use double instead of float. It's just so much safer and might save your from some more problems within you program!

How to build a Regex in java to detect a whitespace or end of a string?

I am trying to build a Regex to find and extract the string containing Post office box.
Here is two examples:
str = "some text p.o. box 12456 Floor 105 streetName Street";
str = "po box 1011";
str = "post office Box 12 Floor 105 Tallapoosa Street";
str = "leclair ryan pc p.o. Box 2499 8th floor 951 east byrd street";
str = "box 1 slot 3 building 2 136 harvey road";
Here is my pattern and code:
Pattern p = Pattern.compile("p.*o.*box \\d+(\\z|\\s)");
Matcher m = p.matcher(str);
int count =0;
while(m.find()) {
count++;
System.out.println("Match number "+count);
System.out.println("start(): "+m.start());
System.out.println("end(): "+m.end());
}
It works with the second example and note for the first one!
If change my pattern to the following:
Pattern p = Pattern.compile("p.*o.*box \d+ ");
It works just for the first example.
The question is how to group the Regex for end of string "\z" and Regex for whitespace "\s" or " "?
New Pattern:
Pattern p = Pattern.compile("(?i)((p.*o.box\s\w\s*\d*(\z|\s*)|(box\s*\w\s*\d*(\z|\s*)) ))");
You can leverage the following code:
String str = "some text p.o. box 12456 Floor 105 streetName Street";
Pattern p = Pattern.compile("(?i)\\bp\\.?\\s*o\\.?\\s*box\\s*(\\d+)(?:\\z|\\s)");
Matcher m = p.matcher(str);
int count =0;
while(m.find()) {
count++;
System.out.println("Match: "+m.group(0));
System.out.println("Digits: "+m.group(1));
System.out.println("Match number "+count);
System.out.println("start(): "+m.start());
System.out.println("end(): "+m.end());
}
To make the pattern case insensitive, just add Pattern.CASE_INSENSITIVE flag to the Pattern.compile declaration or pre-pend the inline (?i) modifier to the pattern.
Also, .* matches any characters other than a newline zero or more times, I guess you wanted to match . optionally. So, you need just ? quantifier and to escape the dot so as to match a literal dot. Note how I used (...) to capture digits into Group 1 (it is called a capturing group). The group where you match the end of the string or space is inside a non-capturing grouo ((?:...)) that is used for grouping only, not for storing its value in the memory buffer. Since you wanted to match a word boundary there, I suggest replacing (?:\\z|\\s) with a mere \\b:
Pattern p = Pattern.compile("(?i)\\bp\\.?\\s*o\\.?\\s*box\\s*(\\d+)\\b");
There are a couple items in your regex that look like they need work. From what I understand you want to extract the P.O. Box number from strings of such format that you've provided. Given that, the following regex will accomplish what you want, with a following explanation. See it in action here: https://regex101.com/r/cQ8lH3/2
Pattern p = Pattern.compile("p\.?o\.? box [^ \r\n\t]+");
Firstly, you need to use only ONE slash, for escape sequences. Also, you must escape the dots. If you do not escape the dots, regex will match . as ANY single character. \. will instead match a dot symbol.
Next, you need to change the * quantifier after the \. to a ?. Why? The * symbol will match zero or more of the preceding symbol while the ? quantifier will match only one or none.
Finally rethink how you're matching the box number. Instead of matching all characters AND THEN white space, just match everything that isn't a whitespace. [^ \r\n\t]+ will match all characters that are NOT a space (), carriage return (\r), newline (\n), or tab (\t). Therefore it will consume the box number and stop as soon as it hits any whitespace or end of file.
Some of these changes may not be necessary to get your code to work for the examples you gave, but they are the proper way to build the regex you want.

(Pattern and Matcher) not discovering all pattern matches

I have this string object which consists of tags(bounded by [$ and $]) and rest of the text. Im trying to isolate all of the tags. (Pattern-Matcher) recognize all of the tags properly, but two of them are combined into one. I dont have any idea why this is happening, probably some internal (Matcher-Pattern) bussiness.
String docBody = "This is sample text.\r\n[$ FOR i 1 10 1 $]\r\n This is" +
"[$ i $]-th time this message is generated.\r\n[$END$]\r\n" +
"[$ FOR i 0 10 2 $]\r\n sin([$= i $]^2) = [$= i i * #sin \"0.000\"" +
" #decfmt $]" +
"\r\n[$END$] ";
Pattern p = Pattern.compile("(\\[\\$)(.)+(\\$\\])");
Matcher m = p.matcher(docBody);
while(m.find()){
System.out.println(m.group());
}
output:
[$ FOR i 1 10 1 $]
[$ i $]
[$END$]
[$ FOR i 0 10 2 $]
[$= i $]^2) = [$= i i * #sin "0.000" #decfmt $]
[$END$]`
As you can see, this part [$= i $]^2) = [$= i i * #sin "0.000" #decfmt $] is not split into these two tags [$= i $] and [$= i i * #sin "0.000" #decfmt $]
Any suggestions why this is happening?
You should use reluctant quantifier - ".+?" instead of greedy - ".+" :
"(\\[\\$).+?(\\$\\])" // Note `?` after `.+`
If you use .+, it will match everything except the line-terminator till the last $. Note that a dot (.) matches everything except a newline. With reluctant quantifier, .+? matches only till the first $] it encounters.
In your given string, you got all those matches, because you had \r\n in between, where the .+ stops matching. If you remove all those newlines, then you will just get a single match from 1st [$ to the last $].
A good way is to replace the dot by a negated character class, example:
Pattern p = Pattern.compile("(\\[\\$)([^$]++)(\\$])");
(note that you don't need to escape closing square brackets)
But perhaps are you only interested by the content of the tags:
Pattern p = Pattern.compile("(?<=\\[\\$)[^$]++(?=\\$])");
In this case the content is the whole match

Categories

Resources