Extract and Print only accent characters through regular expression in JAVA - java

Been trying to extract only accent characters[a particular word] from a multiple text files in a folder.
Don't want to remove or convert accent characters to normal characters but print only those characters which are accent in multiple text files and mixed files which has both accent[words] and normal characters. in JAVA
**only to extract all accent specific words.
**
after searching and exploring for a while this a link below is a type of one solution, similar regex but doesn't work as required also select null values and normal characters.
Regex accented Characters for special field
another solution found for that is ([a-zA-Z]|[à-ü]|[À-Ü])
it selects each letter separately not feasible as it not word specific and also selects both normal and accent.

If you want to match word that contains the accent letter you need to go with something like:
[a-zA-Zà-üÀ-Ü]*[à-üÀ-Ü][a-zA-Zà-üÀ-Ü]*
explenation:
[a-zA-Zà-üÀ-Ü]* - this will match all the accent and not accent letters (so we can have other accent/non-accent letters in our word) - the star * modifier is here to match zero or more letters
[à-üÀ-Ü] - this will match exactly one accent letter - to force matching only the words with an accent

Related

How to prevent accented characters in an email field in Java using regex?

I have an email field in a form which is currently validated using GenericValidator.isEmail method. But now I need to apply another validation where I need to prevent accented characters being sent to the email address. So I was thinking of using a Regex Pattern Matching approach and I found one in stackoverflow itself
if (Pattern.matches(".*[éèàù].*", input)) {
// your code
}
Problem is I saw only é è à ù characters in the pattern but there are several other accented characters like õ ü ì etc. So is there a way we can match pattern for all types of accented characters?
I needed to match for NL (Dutch), FR(French) and DE(German) language accented characters. I need to check if my email address has any accented character and if it does need to stop execution there and throw an error
It turns out you want to match any letter but an ASCII letter.
I suggest substracting ASCII letters from the \p{L} pattern that matches any Unicode letter:
Pattern.matches("(?s).*[\\p{L}&&[^A-Za-z]].*", input)
Here,
(?s) - Pattern.DOTALL embedded flag option that makes . match across lines
.* - any zero or more chars, as many as possible
[\\p{L}&&[^A-Za-z]] - any Unicode letter except ASCII letters
.* - any zero or more chars, as many as possible.
Note it is better to use find() since it also returns partial matches, and there is no need using (?s).* and .* in the above pattern, making it much more efficient with longer strings:
Pattern.compile("[\\p{L}&&[^A-Za-z]]").matcher(input).find()
See this Java demo.

Convert english characters to accent characters and symbols in java

I have a requirement where in i have a dropdown for some String texts which have accent characters as well as english characters. Now when i search with English characters i should be able to search for its equivalent accent characters as well. Convert english characters to its equivalent accent characters.
Now when i provide input as DAG i should have the below output :
ÐÁĞ
DAG
I had thought of a solution to have HashMap where i can map all the English alphabets with the accent characters. But thought if there could be a library already doing this job. I tried all ways but didnt find any. Please advise.

Java regex: search for a string without accent in a text with accent

In my Java app, I want to use a regex to be able to know if a string exists or not in a text.
The case I want to cover is this one: let's assume that my original text is the following french text (with an accent):
démo test
I want to know if the word demo (without accent) exists in the text, using a regex. The thing is: I can't change the original text (I can't use Normalizer.normalize() for example), since I'm using a library that takes a regex as an argument.
Here is what I tried:
If I use "(?i)démo", there is a match (since démo exists)
If I use "(?i)demo", there is no match, but I also want a match here. I want the regex to be accent insensitive.
So far, I haven't managed to find a regex that can cover that specific case.
Is there any regex that can cover that case?
Thanks for your help.
Assuming you really cannot change the input text, the following works:
If your input text is in decomposed form, meaning that démo consists of the unicode codepoints d e COMBINING ACUTE ACCENT m o, you can optionally match the accent:
de\pM?mo
where \pM describes the unicode property "Mark". This would match all marks. You can also just optionally match \u0301 directly if you only care about that exact accent
If your text is in composed form, meaning démo consists of the unicode codepoints d LATIN SMALL LETTER E WITH ACUTE m o, you'll have to just manually match either in the regex:
d(e|é)mo
One way is to modify the regex literal to search and replace the accented
characters with a class.
Regex string Replace string
---------------------------------------------
Find any one Replace with this lieral:
of these:
[aâàä] -> [aâàä]
[cç] -> [cç]
[eéèêë] -> [eéèêë]
[iîï] -> [iîï]
[oô] -> [oô]
[uùûü] -> [uùûü]
[?œ] -> ????
This requires running 7 separate regexes on the search string.
It would be a global find / replace, seven times.

Java regex matches diacritics for the Latin corresponding characters

I am trying to use a regex such as [ăâîșțĂÂÎȘȚ] to match for Romanian alphabet diacritics(ISO 8859-16/Windows-1250). The problem is that the regex would also match the regex for a,i,s,t,A,I,S,T(the Latin alphabet corresponding characters for the above mentioned diacritics) and I don't want this.
I didn't try to compare strings character by character because of performance time.
Is there anyway I can make the regex match exactly for these characters?
If your regex exists as literal rendered text, it has already been combined
and should exist as a different code point.
000074 t LATIN SMALL LETTER T
+
000326 ̦ COMBINING COMMA BELOW
=
00021B ț LATIN SMALL LETTER T WITH COMMA BELOW
Just incase, you should use a hex codepoint to represent them ie. u\021B
Is it possible the Java engine could be stripping the combining character off of the regex?
Where x21B becomes x74? Might be that.
Meanwhile if you expect the letters in the source are not rendered, you could
use a regex like \p{Script=Latin}\p{Block=Combining_Diacritical_Marks}
to get those.
updated info :
While searching around for a defacto solution, I came across this Java info
from http://www.regular-expressions.info/unicode.html.
In Java, the regex token \uFFFF only matches the specified
code point, even when you turned on canonical equivalence.
However, the same syntax \uFFFF is also used to insert
Unicode characters into literal strings in the Java source
code. Pattern.compile("\u00E0") will match both the
single-code-point and double-code-point encodings of à,
while Pattern.compile("\u00E0") matches only the
single-code-point version. Remember that when writing a
regex as a Java string literal, backslashes must be escaped.
The former Java code compiles the regex à, while the latter
compiles \u00E0. Depending on what you're doing, the
difference may be significant.
So, by entering a duality literal inside a class, it looks like Pattern.compile("[à]")
will actually match
000061 a LATIN SMALL LETTER A
or
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
This smacks of the same problem when putting surrogate pairs inside classes.
There is a solution.
Avoid entering those literals inside of a class.
Instead, put them as a series of alternations
(?:à|_|_|_)
Doing this forces it to match either
000061 a LATIN SMALL LETTER A
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
It won't match a independent of the grave like you see now.
Note - If you just use a "[\\u00E0]" you'd miss the a + grave.
which is valid.
I believe this is happening because those characters are being treated as two Unicode code points. I would recommend trying to specifically match the code points using syntax like \uFFFF where FFFF is the code point. The exact syntax will depend on the regex implementation you are using.
Keep in mind that Unicode characters can be encoded as single code points or as multiple, so you'll want to account for that. Example: à encoded as U+0061 U+0300 and also U+00E0.
I hope this helps!
As already mentioned in Unicode one has two alternatives.
'\u0061' 'a' LATIN SMALL LETTER A
'\u0300' ̀ COMBINING GRAVE ACCENT
or
'\u00E0' 'à' LATIN SMALL LETTER A WITH GRAVE
There is a Normalizer that can "normalize" to either form (and deal with ligatures):
String regex = "(?u)[ăâîșțĂÂÎȘȚ]";
regex = Normalizer.normalize(regex, Form.NFC); // Composed form
Pattern pattern = Pattern.compile(regex);
Using "(?u)" or a flag with Pattern.compile with UNICODE flag might already solve the problem. But using the Unicode variant without separate latin ('a') will certainly do.
The normalizer should especially be applied on the searched-through string.

Regex to find all variants of a certain character inside a text

I am trying to find unicode variants of a user-entered character in a text for highlighting them. E.g. if user enters "Beyonce" i'd like to highlight all text with variants like "Beyoncé" or "Beyônce" or Bèyönce" in the text. Currenty the only idea i have is creating a regex by replacing the input string with a set of character groups like this:
"Beyonce" => "B[eêéè]y[óòôö]c[éèê]"
But this seems to be a very tedious and error prone way of doing it. What I am basically looking for is a regex character group that matches all variants of a given input character, something like \p{M} but with the possibility to specify the base letter. Is there something available like this in java regex? And if not, how could the regex creation process be improved? I don't think that specifying all variants by hand is going to work in the long run.
There are several ways, an accented character can be represented. There's a good example in the javadoc of java.text.Normalizer:
For example, take the character A-acute. In Unicode, this can be encoded
as a single character (the "composed" form):
U+00C1 LATIN CAPITAL LETTER A WITH ACUTE
or as two separate characters (the "decomposed" form):
U+0041 LATIN CAPITAL LETTER A
U+0301 COMBINING ACUTE ACCENT
The second form would make it relatively easy to access the non-accentuated character, and fortunately Normalizer can help you here:
Normalizer.normalize(text, Form.NFD); // NFD = "Canonical decomposition"
You can then use a regex to ignore (or remove) any non-ASCII characters from the string, based on:
[^\p{ASCII}]

Categories

Resources