I am wondering if
there is any "standard" translation of special letters like ä, ö, ü, ç, Ñ, Ã, æ, etc. into English. A German would certainly transcribe a as ae, but an American would probably just use a. Is there any standard? and
if yes, is there any library in Java which covers signs contained in UTF code tables "Basic Latin" (u0000-u007F) and "Latin-1 Supplement" (u0080-u00FF)?
Thx
I think a solution for your problem is transliteration.
Check those links below:
ICU Home page
Transliterator class
I had an idea, but it doesn't work. It's just complete rubbish. Don't try this.
I'm not sure if there is a standard as such.
One thing you could do would be to normalise the character into the NFKD form, which breaks all characters down to their most basic elements, such as base letters and combining marks, then filter out just the ASCII characters. This would take æ to ae, ä to a, and all other single characters with diacritics to their base characters.
This won't make Germans happy, though.
With the java Normalizer you can split ä into a + combining diacritic mark. And then you can simply remove all diacritic marks.
String normalizedString = Normalizer.normalize(s, Normalizer.Form.NFKD);
String ascii = normalizedString.replaceAll("\\p{InCombiningDiacriticalMarks}", "");
Related
I have a requirement where in i have a dropdown for some String texts which have accent characters as well as english characters. Now when i search with English characters i should be able to search for its equivalent accent characters as well. Convert english characters to its equivalent accent characters.
Now when i provide input as DAG i should have the below output :
ÐÁĞ
DAG
I had thought of a solution to have HashMap where i can map all the English alphabets with the accent characters. But thought if there could be a library already doing this job. I tried all ways but didnt find any. Please advise.
I need to match any special character in a string. For example, if the string has & % € (), etc. I could have Unicode alphabets such as ä ö å.
But I also want to match a dot "." For example, if I have a string as "8x8 Inc." . It should return true. Because it has a .
I tried a few expression so far but none of them worked for me. Please let me how it can be done? Thanks in advance!
You can do that one:
[^a-zA-Z\d\s] -> basically anything outside the group of all a-Z characters, digits and spaces. It will capture all other characters including special letters ä, dots, commas, braces etc
A simpler version would be [^\w\s] and it would match any non word/space characters but it will not match ä ö å
Java Regex .* will match all characters.
If you want to match only dot(.) then use escape character like \. It will match only dot(.) in string.
And in Java Program you have to use it like.
String regex="\\.";
Take a look at Unicode character classes. For your example, I think something like "(\\p{IsAlphabetic}|\\d)+" should work
I am trying to use a regex such as [ăâîșțĂÂÎȘȚ] to match for Romanian alphabet diacritics(ISO 8859-16/Windows-1250). The problem is that the regex would also match the regex for a,i,s,t,A,I,S,T(the Latin alphabet corresponding characters for the above mentioned diacritics) and I don't want this.
I didn't try to compare strings character by character because of performance time.
Is there anyway I can make the regex match exactly for these characters?
If your regex exists as literal rendered text, it has already been combined
and should exist as a different code point.
000074 t LATIN SMALL LETTER T
+
000326 ̦ COMBINING COMMA BELOW
=
00021B ț LATIN SMALL LETTER T WITH COMMA BELOW
Just incase, you should use a hex codepoint to represent them ie. u\021B
Is it possible the Java engine could be stripping the combining character off of the regex?
Where x21B becomes x74? Might be that.
Meanwhile if you expect the letters in the source are not rendered, you could
use a regex like \p{Script=Latin}\p{Block=Combining_Diacritical_Marks}
to get those.
updated info :
While searching around for a defacto solution, I came across this Java info
from http://www.regular-expressions.info/unicode.html.
In Java, the regex token \uFFFF only matches the specified
code point, even when you turned on canonical equivalence.
However, the same syntax \uFFFF is also used to insert
Unicode characters into literal strings in the Java source
code. Pattern.compile("\u00E0") will match both the
single-code-point and double-code-point encodings of à,
while Pattern.compile("\u00E0") matches only the
single-code-point version. Remember that when writing a
regex as a Java string literal, backslashes must be escaped.
The former Java code compiles the regex à, while the latter
compiles \u00E0. Depending on what you're doing, the
difference may be significant.
So, by entering a duality literal inside a class, it looks like Pattern.compile("[à]")
will actually match
000061 a LATIN SMALL LETTER A
or
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
This smacks of the same problem when putting surrogate pairs inside classes.
There is a solution.
Avoid entering those literals inside of a class.
Instead, put them as a series of alternations
(?:à|_|_|_)
Doing this forces it to match either
000061 a LATIN SMALL LETTER A
000300 ̀ COMBINING GRAVE ACCENT
or
0000E0 à LATIN SMALL LETTER A WITH GRAVE
It won't match a independent of the grave like you see now.
Note - If you just use a "[\\u00E0]" you'd miss the a + grave.
which is valid.
I believe this is happening because those characters are being treated as two Unicode code points. I would recommend trying to specifically match the code points using syntax like \uFFFF where FFFF is the code point. The exact syntax will depend on the regex implementation you are using.
Keep in mind that Unicode characters can be encoded as single code points or as multiple, so you'll want to account for that. Example: à encoded as U+0061 U+0300 and also U+00E0.
I hope this helps!
As already mentioned in Unicode one has two alternatives.
'\u0061' 'a' LATIN SMALL LETTER A
'\u0300' ̀ COMBINING GRAVE ACCENT
or
'\u00E0' 'à' LATIN SMALL LETTER A WITH GRAVE
There is a Normalizer that can "normalize" to either form (and deal with ligatures):
String regex = "(?u)[ăâîșțĂÂÎȘȚ]";
regex = Normalizer.normalize(regex, Form.NFC); // Composed form
Pattern pattern = Pattern.compile(regex);
Using "(?u)" or a flag with Pattern.compile with UNICODE flag might already solve the problem. But using the Unicode variant without separate latin ('a') will certainly do.
The normalizer should especially be applied on the searched-through string.
I want to validate first-name sent by user with Regex. I found multiple expressions for first-name, but I also want to add german characters like äöüß and french ones à À è È é É ù Ù ì Ì ò Ò ñ Ñ to it. I tried regex evaluator suggested by SO here, but that didn't help. Wheneever I would make an extra square bracket, it would tell me Your regular expression does not match the subject string.. What am I doing wrong?
Current Regex pattern :
^([A-Z][a-z]*((\s)))+[A-Z][a-z][äöüß]*$
Thank you.
It is not a good idea to restrict people's names too much, I suggest a rather generic regex:
s.matches("(?U)[\\p{L}\\p{M}\\s'-]+")
This regex will match a string only consisting of 1 or more Unicode letters, diacritics, whitespaces, apostrophes or hyphens.
If you need more restrictive checks, like a whitespace may only appear inside the string and only if not consecutive, use grouping:
"(?U)[\\p{L}\\p{M}'-]+(?:\\s[\\p{L}\\p{M}'-]+)*"
I agree with previous answer from Wiktor.
But to answer why your regex did not work:
You expected a capital letter followed by 0 or more letters and a space and then another word. And that other word is 1 cap. letter, 1 letter and then 0 or more Umlauts. I do not know any name that would fit. All those special characters should be in the same [] as the letters.
Also hyphenated names are not allowed.
So your regex could look more like this:
^([A-ZÜÄÖ][a-züäöß]*(\s|-))*[A-ZÜÄÖ][a-züäöß]*$
I am trying to find unicode variants of a user-entered character in a text for highlighting them. E.g. if user enters "Beyonce" i'd like to highlight all text with variants like "Beyoncé" or "Beyônce" or Bèyönce" in the text. Currenty the only idea i have is creating a regex by replacing the input string with a set of character groups like this:
"Beyonce" => "B[eêéè]y[óòôö]c[éèê]"
But this seems to be a very tedious and error prone way of doing it. What I am basically looking for is a regex character group that matches all variants of a given input character, something like \p{M} but with the possibility to specify the base letter. Is there something available like this in java regex? And if not, how could the regex creation process be improved? I don't think that specifying all variants by hand is going to work in the long run.
There are several ways, an accented character can be represented. There's a good example in the javadoc of java.text.Normalizer:
For example, take the character A-acute. In Unicode, this can be encoded
as a single character (the "composed" form):
U+00C1 LATIN CAPITAL LETTER A WITH ACUTE
or as two separate characters (the "decomposed" form):
U+0041 LATIN CAPITAL LETTER A
U+0301 COMBINING ACUTE ACCENT
The second form would make it relatively easy to access the non-accentuated character, and fortunately Normalizer can help you here:
Normalizer.normalize(text, Form.NFD); // NFD = "Canonical decomposition"
You can then use a regex to ignore (or remove) any non-ASCII characters from the string, based on:
[^\p{ASCII}]