Convert english characters to accent characters and symbols in java - java

I have a requirement where in i have a dropdown for some String texts which have accent characters as well as english characters. Now when i search with English characters i should be able to search for its equivalent accent characters as well. Convert english characters to its equivalent accent characters.
Now when i provide input as DAG i should have the below output :
ÐÁĞ
DAG
I had thought of a solution to have HashMap where i can map all the English alphabets with the accent characters. But thought if there could be a library already doing this job. I tried all ways but didnt find any. Please advise.

Related

Extract and Print only accent characters through regular expression in JAVA

Been trying to extract only accent characters[a particular word] from a multiple text files in a folder.
Don't want to remove or convert accent characters to normal characters but print only those characters which are accent in multiple text files and mixed files which has both accent[words] and normal characters. in JAVA
**only to extract all accent specific words.
**
after searching and exploring for a while this a link below is a type of one solution, similar regex but doesn't work as required also select null values and normal characters.
Regex accented Characters for special field
another solution found for that is ([a-zA-Z]|[à-ü]|[À-Ü])
it selects each letter separately not feasible as it not word specific and also selects both normal and accent.
If you want to match word that contains the accent letter you need to go with something like:
[a-zA-Zà-üÀ-Ü]*[à-üÀ-Ü][a-zA-Zà-üÀ-Ü]*
explenation:
[a-zA-Zà-üÀ-Ü]* - this will match all the accent and not accent letters (so we can have other accent/non-accent letters in our word) - the star * modifier is here to match zero or more letters
[à-üÀ-Ü] - this will match exactly one accent letter - to force matching only the words with an accent

Java regex: search for a string without accent in a text with accent

In my Java app, I want to use a regex to be able to know if a string exists or not in a text.
The case I want to cover is this one: let's assume that my original text is the following french text (with an accent):
démo test
I want to know if the word demo (without accent) exists in the text, using a regex. The thing is: I can't change the original text (I can't use Normalizer.normalize() for example), since I'm using a library that takes a regex as an argument.
Here is what I tried:
If I use "(?i)démo", there is a match (since démo exists)
If I use "(?i)demo", there is no match, but I also want a match here. I want the regex to be accent insensitive.
So far, I haven't managed to find a regex that can cover that specific case.
Is there any regex that can cover that case?
Thanks for your help.
Assuming you really cannot change the input text, the following works:
If your input text is in decomposed form, meaning that démo consists of the unicode codepoints d e COMBINING ACUTE ACCENT m o, you can optionally match the accent:
de\pM?mo
where \pM describes the unicode property "Mark". This would match all marks. You can also just optionally match \u0301 directly if you only care about that exact accent
If your text is in composed form, meaning démo consists of the unicode codepoints d LATIN SMALL LETTER E WITH ACUTE m o, you'll have to just manually match either in the regex:
d(e|é)mo
One way is to modify the regex literal to search and replace the accented
characters with a class.
Regex string Replace string
---------------------------------------------
Find any one Replace with this lieral:
of these:
[aâàä] -> [aâàä]
[cç] -> [cç]
[eéèêë] -> [eéèêë]
[iîï] -> [iîï]
[oô] -> [oô]
[uùûü] -> [uùûü]
[?œ] -> ????
This requires running 7 separate regexes on the search string.
It would be a global find / replace, seven times.

Java regex matches diacritics for the Latin corresponding characters

I am trying to use a regex such as [ăâîșțĂÂÎȘȚ] to match for Romanian alphabet diacritics(ISO 8859-16/Windows-1250). The problem is that the regex would also match the regex for a,i,s,t,A,I,S,T(the Latin alphabet corresponding characters for the above mentioned diacritics) and I don't want this.
I didn't try to compare strings character by character because of performance time.
Is there anyway I can make the regex match exactly for these characters?
If your regex exists as literal rendered text, it has already been combined
and should exist as a different code point.
000074 t LATIN SMALL LETTER T
+
000326 ̦ COMBINING COMMA BELOW
=
00021B ț LATIN SMALL LETTER T WITH COMMA BELOW
Just incase, you should use a hex codepoint to represent them ie. u\021B
Is it possible the Java engine could be stripping the combining character off of the regex?
Where x21B becomes x74? Might be that.
Meanwhile if you expect the letters in the source are not rendered, you could
use a regex like \p{Script=Latin}\p{Block=Combining_Diacritical_Marks}
to get those.
updated info :
While searching around for a defacto solution, I came across this Java info
from http://www.regular-expressions.info/unicode.html.
In Java, the regex token \uFFFF only matches the specified
code point, even when you turned on canonical equivalence.
However, the same syntax \uFFFF is also used to insert
Unicode characters into literal strings in the Java source
code. Pattern.compile("\u00E0") will match both the
single-code-point and double-code-point encodings of à,
while Pattern.compile("\u00E0") matches only the
single-code-point version. Remember that when writing a
regex as a Java string literal, backslashes must be escaped.
The former Java code compiles the regex à, while the latter
compiles \u00E0. Depending on what you're doing, the
difference may be significant.
So, by entering a duality literal inside a class, it looks like Pattern.compile("[à]")
will actually match
000061 a LATIN SMALL LETTER A
or
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
This smacks of the same problem when putting surrogate pairs inside classes.
There is a solution.
Avoid entering those literals inside of a class.
Instead, put them as a series of alternations
(?:à|_|_|_)
Doing this forces it to match either
000061 a LATIN SMALL LETTER A
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
It won't match a independent of the grave like you see now.
Note - If you just use a "[\\u00E0]" you'd miss the a + grave.
which is valid.
I believe this is happening because those characters are being treated as two Unicode code points. I would recommend trying to specifically match the code points using syntax like \uFFFF where FFFF is the code point. The exact syntax will depend on the regex implementation you are using.
Keep in mind that Unicode characters can be encoded as single code points or as multiple, so you'll want to account for that. Example: à encoded as U+0061 U+0300 and also U+00E0.
I hope this helps!
As already mentioned in Unicode one has two alternatives.
'\u0061' 'a' LATIN SMALL LETTER A
'\u0300' ̀ COMBINING GRAVE ACCENT
or
'\u00E0' 'à' LATIN SMALL LETTER A WITH GRAVE
There is a Normalizer that can "normalize" to either form (and deal with ligatures):
String regex = "(?u)[ăâîșțĂÂÎȘȚ]";
regex = Normalizer.normalize(regex, Form.NFC); // Composed form
Pattern pattern = Pattern.compile(regex);
Using "(?u)" or a flag with Pattern.compile with UNICODE flag might already solve the problem. But using the Unicode variant without separate latin ('a') will certainly do.
The normalizer should especially be applied on the searched-through string.

Translation Special Letters to English

I am wondering if
there is any "standard" translation of special letters like ä, ö, ü, ç, Ñ, Ã, æ, etc. into English. A German would certainly transcribe a as ae, but an American would probably just use a. Is there any standard? and
if yes, is there any library in Java which covers signs contained in UTF code tables "Basic Latin" (u0000-u007F) and "Latin-1 Supplement" (u0080-u00FF)?
Thx
I think a solution for your problem is transliteration.
Check those links below:
ICU Home page
Transliterator class
I had an idea, but it doesn't work. It's just complete rubbish. Don't try this.
I'm not sure if there is a standard as such.
One thing you could do would be to normalise the character into the NFKD form, which breaks all characters down to their most basic elements, such as base letters and combining marks, then filter out just the ASCII characters. This would take æ to ae, ä to a, and all other single characters with diacritics to their base characters.
This won't make Germans happy, though.
With the java Normalizer you can split ä into a + combining diacritic mark. And then you can simply remove all diacritic marks.
String normalizedString = Normalizer.normalize(s, Normalizer.Form.NFKD);
String ascii = normalizedString.replaceAll("\\p{InCombiningDiacriticalMarks}", "");

Regex to find all variants of a certain character inside a text

I am trying to find unicode variants of a user-entered character in a text for highlighting them. E.g. if user enters "Beyonce" i'd like to highlight all text with variants like "Beyoncé" or "Beyônce" or Bèyönce" in the text. Currenty the only idea i have is creating a regex by replacing the input string with a set of character groups like this:
"Beyonce" => "B[eêéè]y[óòôö]c[éèê]"
But this seems to be a very tedious and error prone way of doing it. What I am basically looking for is a regex character group that matches all variants of a given input character, something like \p{M} but with the possibility to specify the base letter. Is there something available like this in java regex? And if not, how could the regex creation process be improved? I don't think that specifying all variants by hand is going to work in the long run.
There are several ways, an accented character can be represented. There's a good example in the javadoc of java.text.Normalizer:
For example, take the character A-acute. In Unicode, this can be encoded
as a single character (the "composed" form):
U+00C1 LATIN CAPITAL LETTER A WITH ACUTE
or as two separate characters (the "decomposed" form):
U+0041 LATIN CAPITAL LETTER A
U+0301 COMBINING ACUTE ACCENT
The second form would make it relatively easy to access the non-accentuated character, and fortunately Normalizer can help you here:
Normalizer.normalize(text, Form.NFD); // NFD = "Canonical decomposition"
You can then use a regex to ignore (or remove) any non-ASCII characters from the string, based on:
[^\p{ASCII}]

Categories

Resources