remove unicode characters in given example and print relevant data [duplicate] - java
This question already has answers here:
Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars
(12 answers)
Closed 3 days ago.
Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll() method and replacing letters one by one?
Example:
Input: orčpžsíáýd
Output: orcpzsiayd
It doesn't need to include all letters with accents like the Russian alphabet or the Chinese one.
Start with java.text.Normalizer.
string = Normalizer.normalize(string, Normalizer.Form.NFD);
// or Normalizer.Form.NFKD for a more "compatible" deconstruction
This will separate all of the accent marks from most characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.
string = string.replaceAll("[^\\p{ASCII}]", "");
If your text is in Unicode, you should use this instead:
string = string.replaceAll("\\p{M}", "");
For Unicode, \\P{M} matches the base glyph and \\p{M} (lowercase) matches each accent.
Thanks to GarretWilson for the pointer and regular-expressions.info for the great Unicode guide.
It is important to note that Normalizer by itself is insufficient to remove diacritics. For example, the following will not replace the accented é with the unaccented e:
import static java.text.Normalizer.normalize;
import static java.text.Normalizer.Form.*;
public class T {
public static void main( final String[] args ) {
final var text = "Brévis";
System.out.println(
normalize( text, NFD ) + " " +
normalize( text, NFC ) + " " +
normalize( text, NFKD ) + " " +
normalize( text, NFKC )
);
}
}
As of 2011 you can use Apache Commons StringUtils.stripAccents(input) (since 3.0):
String input = StringUtils.stripAccents("Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ");
System.out.println(input);
// Prints "This is a funky String"
Note:
The accepted answer (Erick Robertson's) doesn't work for Ø or Ł. Apache Commons 3.5 doesn't work for Ø either, but it does work for Ł. After reading the Wikipedia article for Ø, I'm not sure it should be replaced with "O": it's a separate letter in Norwegian and Danish, alphabetized after "z". It's a good example of the limitations of the "strip accents" approach.
The solution by #virgo47 is very fast, but approximate. The accepted answer uses Normalizer and a regular expression. I wondered what part of the time was taken by Normalizer versus the regular expression, since removing all the non-ASCII characters can be done without a regex:
import java.text.Normalizer;
public class Strip {
public static String flattenToAscii(String string) {
StringBuilder sb = new StringBuilder(string.length());
string = Normalizer.normalize(string, Normalizer.Form.NFD);
for (char c : string.toCharArray()) {
if (c <= '\u007F') sb.append(c);
}
return sb.toString();
}
}
Small additional speed-ups can be obtained by writing into a char[] and not calling toCharArray(), although I'm not sure that the decrease in code clarity merits it:
public static String flattenToAscii(String string) {
char[] out = new char[string.length()];
string = Normalizer.normalize(string, Normalizer.Form.NFD);
int j = 0;
for (int i = 0, n = string.length(); i < n; ++i) {
char c = string.charAt(i);
if (c <= '\u007F') out[j++] = c;
}
return new String(out);
}
This variation has the advantage of the correctness of the one using Normalizer and some of the speed of the one using a table. On my machine, this one is about 4x faster than the accepted answer, and 6.6x to 7x slower that #virgo47's (the accepted answer is about 26x slower than #virgo47's on my machine).
EDIT: If you're not stuck with Java <6 and speed is not critical and/or translation table is too limiting, use answer by David. The point is to use Normalizer (introduced in Java 6) instead of translation table inside the loop.
While this is not "perfect" solution, it works well when you know the range (in our case Latin1,2), worked before Java 6 (not a real issue though) and is much faster than the most suggested version (may or may not be an issue):
/**
* Mirror of the unicode table from 00c0 to 017f without diacritics.
*/
private static final String tab00c0 = "AAAAAAACEEEEIIII" +
"DNOOOOO\u00d7\u00d8UUUUYI\u00df" +
"aaaaaaaceeeeiiii" +
"\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" +
"AaAaAaCcCcCcCcDd" +
"DdEeEeEeEeEeGgGg" +
"GgGgHhHhIiIiIiIi" +
"IiJjJjKkkLlLlLlL" +
"lLlNnNnNnnNnOoOo" +
"OoOoRrRrRrSsSsSs" +
"SsTtTtTtUuUuUuUu" +
"UuUuWwYyYZzZzZzF";
/**
* Returns string without diacritics - 7 bit approximation.
*
* #param source string to convert
* #return corresponding string without diacritics
*/
public static String removeDiacritic(String source) {
char[] vysl = new char[source.length()];
char one;
for (int i = 0; i < source.length(); i++) {
one = source.charAt(i);
if (one >= '\u00c0' && one <= '\u017f') {
one = tab00c0.charAt((int) one - '\u00c0');
}
vysl[i] = one;
}
return new String(vysl);
}
Tests on my HW with 32bit JDK show that this performs conversion from àèéľšťč89FDČ to aeelstc89FDC 1 million times in ~100ms while Normalizer way makes it in 3.7s (37x slower). In case your needs are around performance and you know the input range, this may be for you.
Enjoy :-)
System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));
worked for me. The output of the snippet above gives "aee" which is what I wanted, but
System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", ""));
didn't do any substitution.
Depending on the language, those might not be considered accents (which change the sound of the letter), but diacritical marks
https://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics
"Bosnian and Croatian have the symbols č, ć, đ, š and ž, which are considered separate letters and are listed as such in dictionaries and other contexts in which words are listed according to alphabetical order."
Removing them might be inherently changing the meaning of the word, or changing the letters into completely different ones.
I have faced the same issue related to Strings equality check, One of the comparing string has
ASCII character code 128-255.
i.e., Non-breaking space - [Hex - A0] Space [Hex - 20].
To show Non-breaking space over HTML. I have used the following spacing entities. Their character and its bytes are like &emsp is very wide space[ ]{-30, -128, -125}, &ensp is somewhat wide space[ ]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {}
String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes()));
System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes()));
Output in Bytes:
S1: [77, 121, 32, 83, 97, 109, 112, 108, 101, 32, 83, 112, 97, 99, 101, 32, 68, 97, 116, 97]
S2: [77, 121, -30, -128, -125, 83, 97, 109, 112, 108, 101, -30, -128, -125, 83, 112, 97, 99, 101, -30, -128, -125, 68, 97, 116, 97]
Use below code for Different Spaces and their Byte-Codes: wiki for List_of_Unicode_characters
String spacing_entities = "very wide space,narrow space,regular space,invisible separator";
System.out.println("Space String :"+ spacing_entities);
byte[] byteArray =
// spacing_entities.getBytes( Charset.forName("UTF-8") );
// Charset.forName("UTF-8").encode( s2 ).array();
{-30, -128, -125, 44, -30, -128, -126, 44, 32, 44, -62, -96};
System.out.println("Bytes:"+ Arrays.toString( byteArray ) );
try {
System.out.format("Bytes to String[%S] \n ", new String(byteArray, "UTF-8"));
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
➩ ASCII transliterations of Unicode string for Java. unidecode
String initials = Unidecode.decode( s2 );
➩ using Guava: Google Core Libraries for Java.
String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " );
For URL encode for the space use Guava laibrary.
String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString);
➩ To overcome this problem used String.replaceAll() with some RegularExpression.
// \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
s2 = s2.replaceAll("\\p{Zs}", " ");
s2 = s2.replaceAll("[^\\p{ASCII}]", " ");
s2 = s2.replaceAll(" ", " ");
➩ Using java.text.Normalizer.Form.
This enum provides constants of the four Unicode normalization forms that are described in Unicode Standard Annex #15 — Unicode Normalization Forms and two methods to access them.
s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC);
Testing String and outputs on different approaches like ➩ Unidecode, Normalizer, StringUtils.
String strUni = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß";
// This is a funky String AE,O,D,ss
String initials = Unidecode.decode( strUni );
// Following Produce this o/p: Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß
String temp = Normalizer.normalize(strUni, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
temp = pattern.matcher(temp).replaceAll("");
String input = org.apache.commons.lang3.StringUtils.stripAccents( strUni );
Using Unidecode is the best choice, My final Code shown below.
public static void main(String[] args) {
String s1 = "My Sample Space Data", s2 = "My Sample Space Data";
String initials = Unidecode.decode( s2 );
if( s1.equals(s2)) { //[ , ] %A0 - %2C - %20 « http://www.ascii-code.com/
System.out.println("Equal Unicode Strings");
} else if( s1.equals( initials ) ) {
System.out.println("Equal Non Unicode Strings");
} else {
System.out.println("Not Equal");
}
}
I suggest Junidecode . It will handle not only 'Ł' and 'Ø', but it also works well for transcribing from other alphabets, such as Chinese, into Latin alphabet.
One of the best way using regex and Normalizer if you have no library is :
public String flattenToAscii(String s) {
if(s == null || s.trim().length() == 0)
return "";
return Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[\u0300-\u036F]", "");
}
This is more efficient than replaceAll("[^\p{ASCII}]", "")) and if you don't need diacritics (just like your example).
Otherwise, you have to use the p{ASCII} pattern.
Regards.
Since this solution is already available in StringUtils.stripAccents() at Maven Repository and working for Ł as mentioned by #DavidS.
But I need this to be working for both Ø and Ł So modified as below. May be help full for others too.
Update
This is modified version of StringUtils.stripAccents(String obj), that contains old functionality along with handling both Ø and Ł chars.
public static String stripAccents(final String input) {
if (input == null) {
return null;
}
final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));
for (int i = 0; i < decomposed.length(); i++) {
if (decomposed.charAt(i) == '\u0141') {
decomposed.setCharAt(i, 'L');
} else if (decomposed.charAt(i) == '\u0142') {
decomposed.setCharAt(i, 'l');
}else if (decomposed.charAt(i) == '\u00D8') {
decomposed.setCharAt(i, 'O');
}else if (decomposed.charAt(i) == '\u00F8') {
decomposed.setCharAt(i, 'o');
}
}
// Note that this doesn't correctly remove ligatures...
return Pattern.compile("\\p{InCombiningDiacriticalMarks}+").matcher(decomposed).replaceAll("");
}
Input string Ł Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Ø ø
output string L This is a funky String O o
#David Conrad solution is the fastest I tried using the Normalizer, but it does have a bug. It basically strips characters which are not accents, for example Chinese characters and other letters like æ, are all stripped.
The characters that we want to strip are non spacing marks, characters which don't take up extra width in the final string. These zero width characters basically end up combined in some other character. If you can see them isolated as a character, for example like this `, my guess is that it's combined with the space character.
public static String flattenToAscii(String string) {
char[] out = new char[string.length()];
String norm = Normalizer.normalize(string, Normalizer.Form.NFD);
int j = 0;
for (int i = 0, n = norm.length(); i < n; ++i) {
char c = norm.charAt(i);
int type = Character.getType(c);
//Log.d(TAG,""+c);
//by Ricardo, modified the character check for accents, ref: http://stackoverflow.com/a/5697575/689223
if (type != Character.NON_SPACING_MARK){
out[j] = c;
j++;
}
}
//Log.d(TAG,"normalized string:"+norm+"/"+new String(out));
return new String(out);
}
I think the best solution is converting each char to HEX and replace it with another HEX. It's because there are 2 Unicode typing:
Composite Unicode
Precomposed Unicode
For example "Ồ" written by Composite Unicode is different from "Ồ" written by Precomposed Unicode. You can copy my sample chars and convert them to see the difference.
In Composite Unicode, "Ồ" is combined from 2 char: Ô (U+00d4) and ̀ (U+0300)
In Precomposed Unicode, "Ồ" is single char (U+1ED2)
I have developed this feature for some banks to convert the info before sending it to core-bank (usually don't support Unicode) and faced this issue when the end-users use multiple Unicode typing to input the data. So I think, converting to HEX and replace it is the most reliable way.
A fast and safer way
public static String removeDiacritics(String str) {
if (str == null)
return null;
if (str.isEmpty())
return "";
int len = str.length();
StringBuilder sb
= new StringBuilder(len);
//iterate string codepoints
for (int i = 0; i < len; ) {
int codePoint = str.codePointAt(i);
int charCount
= Character.charCount(codePoint);
if (charCount > 1) {
for (int j = 0; j < charCount; j++)
sb.append(str.charAt(i + j));
i += charCount;
continue;
}
else if (codePoint <= 127) {
sb.append((char)codePoint);
i++;
continue;
}
sb.append(
java.text.Normalizer
.normalize(
Character.toString((char)codePoint),
java.text.Normalizer.Form.NFD)
.charAt(0));
i++;
}
return sb.toString();
}
Faced the same issue, here's solution using Kotlin extension
val String.stripAccents: String
get() = Regex("\\p{InCombiningDiacriticalMarks}+")
.replace(
Normalizer.normalize(this, Normalizer.Form.NFD),
""
)
usage
val textWithoutAccents = "some accented string".stripAccents
In case anyone is strugling to do this in kotlin, this code works like a charm. To avoid inconsistencies I also use .toUpperCase and Trim(). then i cast this function:
fun stripAccents(s: String):String{
if (s == null) {
return "";
}
val chars: CharArray = s.toCharArray()
var sb = StringBuilder(s)
var cont: Int = 0
while (chars.size > cont) {
var c: kotlin.Char
c = chars[cont]
var c2:String = c.toString()
//these are my needs, in case you need to convert other accents just Add new entries aqui
c2 = c2.replace("Ã", "A")
c2 = c2.replace("Õ", "O")
c2 = c2.replace("Ç", "C")
c2 = c2.replace("Á", "A")
c2 = c2.replace("Ó", "O")
c2 = c2.replace("Ê", "E")
c2 = c2.replace("É", "E")
c2 = c2.replace("Ú", "U")
c = c2.single()
sb.setCharAt(cont, c)
cont++
}
return sb.toString()
}
to use these fun cast the code like this:
var str: String
str = editText.text.toString() //get the text from EditText
str = str.toUpperCase().trim()
str = stripAccents(str) //call the function
Related
masking of email address in java
I am trying to mask email address with "*" but I am bad at regex. input : nileshxyzae#gmail.com output : nil********#gmail.com My code is String maskedEmail = email.replaceAll("(?<=.{3}).(?=[^#]*?.#)", "*"); but its giving me output nil*******e#gmail.com I am not getting whats getting wrong here. Why last character is not converted? Also can someone explain meaning all these regex
Your look-ahead (?=[^#]*?.#) requires at least 1 character to be there in front of # (see the dot before #). If you remove it, you will get all the expected symbols replaced: (?<=.{3}).(?=[^#]*?#) Here is the regex demo (replace with *). However, the regex is not a proper regex for the task. You need a regex that will match each character after the first 3 characters up to the first #: (^[^#]{3}|(?!^)\G)[^#] See another regex demo, replace with $1*. Here, [^#] matches any character that is not #, so we do not match addresses like abc#example.com. Only those emails will be masked that have 4+ characters in the username part. See IDEONE demo: String s = "nileshkemse#gmail.com"; System.out.println(s.replaceAll("(^[^#]{3}|(?!^)\\G)[^#]", "$1*"));
If you're bad at regular expressions, don't use them :) I don't know if you've ever heard the quote: Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (source) You might get a working regular expression here, but will you understand it today? tomorrow? in six months' time? And will your colleagues? An easy alternative is using a StringBuilder, and I'd argue that it's a lot more straightforward to understand what is going on here: StringBuilder sb = new StringBuilder(email); for (int i = 3; i < sb.length() && sb.charAt(i) != '#'; ++i) { sb.setCharAt(i, '*'); } email = sb.toString(); "Starting at the third character, replace the characters with a * until you reach the end of the string or #." (You don't even need to use StringBuilder: you could simply manipulate the elements of email.toCharArray(), then construct a new string at the end). Of course, this doesn't work correctly for email addresses where the local part is shorter than 3 characters - it would actually then mask the domain.
Your Look-ahead is kind of complicated. Try this code : public static void main(String... args) throws Exception { String s = "nileshkemse#gmail.com"; s= s.replaceAll("(?<=.{3}).(?=.*#)", "*"); System.out.println(s); } O/P : nil********#gmail.com
I like this one because I just want to hide 4 characters, it also dynamically decrease the hidden chars to 2 if the email address is too short: public static String maskEmailAddress(final String email) { final String mask = "*****"; final int at = email.indexOf("#"); if (at > 2) { final int maskLen = Math.min(Math.max(at / 2, 2), 4); final int start = (at - maskLen) / 2; return email.substring(0, start) + mask.substring(0, maskLen) + email.substring(start + maskLen); } return email; } Sample outputs: my.email#gmail.com > my****il#gmail.com info#mail.com > i**o#mail.com
//In Kotlin val email = "nileshkemse#gmail.com" val maskedEmail = email.replace(Regex("(?<=.{3}).(?=.*#)"), "*")
public static string GetMaskedEmail(string emailAddress) { string _emailToMask = emailAddress; try { if (!string.IsNullOrEmpty(emailAddress)) { var _splitEmail = emailAddress.Split(Char.Parse("#")); var _user = _splitEmail[0]; var _domain = _splitEmail[1]; if (_user.Length > 3) { var _maskedUser = _user.Substring(0, 3) + new String(Char.Parse("*"), _user.Length - 3); _emailToMask = _maskedUser + "#" + _domain; } else { _emailToMask = new String(Char.Parse("*"), _user.Length) + "#" + _domain; } } } catch (Exception) { } return _emailToMask; }
Replace all extended ASCII characters by their "original" character [duplicate]
This question already has answers here: Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars (12 answers) Closed 5 days ago. Is there a better way for getting rid of accents and making those letters regular apart from using String.replaceAll() method and replacing letters one by one? Example: Input: orčpžsíáýd Output: orcpzsiayd It doesn't need to include all letters with accents like the Russian alphabet or the Chinese one.
Start with java.text.Normalizer. string = Normalizer.normalize(string, Normalizer.Form.NFD); // or Normalizer.Form.NFKD for a more "compatible" deconstruction This will separate all of the accent marks from most characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't. string = string.replaceAll("[^\\p{ASCII}]", ""); If your text is in Unicode, you should use this instead: string = string.replaceAll("\\p{M}", ""); For Unicode, \\P{M} matches the base glyph and \\p{M} (lowercase) matches each accent. Thanks to GarretWilson for the pointer and regular-expressions.info for the great Unicode guide. It is important to note that Normalizer by itself is insufficient to remove diacritics. For example, the following will not replace the accented é with the unaccented e: import static java.text.Normalizer.normalize; import static java.text.Normalizer.Form.*; public class T { public static void main( final String[] args ) { final var text = "Brévis"; System.out.println( normalize( text, NFD ) + " " + normalize( text, NFC ) + " " + normalize( text, NFKD ) + " " + normalize( text, NFKC ) ); } }
As of 2011 you can use Apache Commons StringUtils.stripAccents(input) (since 3.0): String input = StringUtils.stripAccents("Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ"); System.out.println(input); // Prints "This is a funky String" Note: The accepted answer (Erick Robertson's) doesn't work for Ø or Ł. Apache Commons 3.5 doesn't work for Ø either, but it does work for Ł. After reading the Wikipedia article for Ø, I'm not sure it should be replaced with "O": it's a separate letter in Norwegian and Danish, alphabetized after "z". It's a good example of the limitations of the "strip accents" approach.
The solution by #virgo47 is very fast, but approximate. The accepted answer uses Normalizer and a regular expression. I wondered what part of the time was taken by Normalizer versus the regular expression, since removing all the non-ASCII characters can be done without a regex: import java.text.Normalizer; public class Strip { public static String flattenToAscii(String string) { StringBuilder sb = new StringBuilder(string.length()); string = Normalizer.normalize(string, Normalizer.Form.NFD); for (char c : string.toCharArray()) { if (c <= '\u007F') sb.append(c); } return sb.toString(); } } Small additional speed-ups can be obtained by writing into a char[] and not calling toCharArray(), although I'm not sure that the decrease in code clarity merits it: public static String flattenToAscii(String string) { char[] out = new char[string.length()]; string = Normalizer.normalize(string, Normalizer.Form.NFD); int j = 0; for (int i = 0, n = string.length(); i < n; ++i) { char c = string.charAt(i); if (c <= '\u007F') out[j++] = c; } return new String(out); } This variation has the advantage of the correctness of the one using Normalizer and some of the speed of the one using a table. On my machine, this one is about 4x faster than the accepted answer, and 6.6x to 7x slower that #virgo47's (the accepted answer is about 26x slower than #virgo47's on my machine).
EDIT: If you're not stuck with Java <6 and speed is not critical and/or translation table is too limiting, use answer by David. The point is to use Normalizer (introduced in Java 6) instead of translation table inside the loop. While this is not "perfect" solution, it works well when you know the range (in our case Latin1,2), worked before Java 6 (not a real issue though) and is much faster than the most suggested version (may or may not be an issue): /** * Mirror of the unicode table from 00c0 to 017f without diacritics. */ private static final String tab00c0 = "AAAAAAACEEEEIIII" + "DNOOOOO\u00d7\u00d8UUUUYI\u00df" + "aaaaaaaceeeeiiii" + "\u00f0nooooo\u00f7\u00f8uuuuy\u00fey" + "AaAaAaCcCcCcCcDd" + "DdEeEeEeEeEeGgGg" + "GgGgHhHhIiIiIiIi" + "IiJjJjKkkLlLlLlL" + "lLlNnNnNnnNnOoOo" + "OoOoRrRrRrSsSsSs" + "SsTtTtTtUuUuUuUu" + "UuUuWwYyYZzZzZzF"; /** * Returns string without diacritics - 7 bit approximation. * * #param source string to convert * #return corresponding string without diacritics */ public static String removeDiacritic(String source) { char[] vysl = new char[source.length()]; char one; for (int i = 0; i < source.length(); i++) { one = source.charAt(i); if (one >= '\u00c0' && one <= '\u017f') { one = tab00c0.charAt((int) one - '\u00c0'); } vysl[i] = one; } return new String(vysl); } Tests on my HW with 32bit JDK show that this performs conversion from àèéľšťč89FDČ to aeelstc89FDC 1 million times in ~100ms while Normalizer way makes it in 3.7s (37x slower). In case your needs are around performance and you know the input range, this may be for you. Enjoy :-)
System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "")); worked for me. The output of the snippet above gives "aee" which is what I wanted, but System.out.println(Normalizer.normalize("àèé", Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "")); didn't do any substitution.
Depending on the language, those might not be considered accents (which change the sound of the letter), but diacritical marks https://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics "Bosnian and Croatian have the symbols č, ć, đ, š and ž, which are considered separate letters and are listed as such in dictionaries and other contexts in which words are listed according to alphabetical order." Removing them might be inherently changing the meaning of the word, or changing the letters into completely different ones.
I have faced the same issue related to Strings equality check, One of the comparing string has ASCII character code 128-255. i.e., Non-breaking space - [Hex - A0] Space [Hex - 20]. To show Non-breaking space over HTML. I have used the following spacing entities. Their character and its bytes are like &emsp is very wide space[ ]{-30, -128, -125}, &ensp is somewhat wide space[ ]{-30, -128, -126}, &thinsp is narrow space[ ]{32} , Non HTML Space {} String s1 = "My Sample Space Data", s2 = "My Sample Space Data"; System.out.format("S1: %s\n", java.util.Arrays.toString(s1.getBytes())); System.out.format("S2: %s\n", java.util.Arrays.toString(s2.getBytes())); Output in Bytes: S1: [77, 121, 32, 83, 97, 109, 112, 108, 101, 32, 83, 112, 97, 99, 101, 32, 68, 97, 116, 97] S2: [77, 121, -30, -128, -125, 83, 97, 109, 112, 108, 101, -30, -128, -125, 83, 112, 97, 99, 101, -30, -128, -125, 68, 97, 116, 97] Use below code for Different Spaces and their Byte-Codes: wiki for List_of_Unicode_characters String spacing_entities = "very wide space,narrow space,regular space,invisible separator"; System.out.println("Space String :"+ spacing_entities); byte[] byteArray = // spacing_entities.getBytes( Charset.forName("UTF-8") ); // Charset.forName("UTF-8").encode( s2 ).array(); {-30, -128, -125, 44, -30, -128, -126, 44, 32, 44, -62, -96}; System.out.println("Bytes:"+ Arrays.toString( byteArray ) ); try { System.out.format("Bytes to String[%S] \n ", new String(byteArray, "UTF-8")); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } ➩ ASCII transliterations of Unicode string for Java. unidecode String initials = Unidecode.decode( s2 ); ➩ using Guava: Google Core Libraries for Java. String replaceFrom = CharMatcher.WHITESPACE.replaceFrom( s2, " " ); For URL encode for the space use Guava laibrary. String encodedString = UrlEscapers.urlFragmentEscaper().escape(inputString); ➩ To overcome this problem used String.replaceAll() with some RegularExpression. // \p{Z} or \p{Separator}: any kind of whitespace or invisible separator. s2 = s2.replaceAll("\\p{Zs}", " "); s2 = s2.replaceAll("[^\\p{ASCII}]", " "); s2 = s2.replaceAll(" ", " "); ➩ Using java.text.Normalizer.Form. This enum provides constants of the four Unicode normalization forms that are described in Unicode Standard Annex #15 — Unicode Normalization Forms and two methods to access them. s2 = Normalizer.normalize(s2, Normalizer.Form.NFKC); Testing String and outputs on different approaches like ➩ Unidecode, Normalizer, StringUtils. String strUni = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß"; // This is a funky String AE,O,D,ss String initials = Unidecode.decode( strUni ); // Following Produce this o/p: Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Æ,Ø,Ð,ß String temp = Normalizer.normalize(strUni, Normalizer.Form.NFD); Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+"); temp = pattern.matcher(temp).replaceAll(""); String input = org.apache.commons.lang3.StringUtils.stripAccents( strUni ); Using Unidecode is the best choice, My final Code shown below. public static void main(String[] args) { String s1 = "My Sample Space Data", s2 = "My Sample Space Data"; String initials = Unidecode.decode( s2 ); if( s1.equals(s2)) { //[ , ] %A0 - %2C - %20 « http://www.ascii-code.com/ System.out.println("Equal Unicode Strings"); } else if( s1.equals( initials ) ) { System.out.println("Equal Non Unicode Strings"); } else { System.out.println("Not Equal"); } }
I suggest Junidecode . It will handle not only 'Ł' and 'Ø', but it also works well for transcribing from other alphabets, such as Chinese, into Latin alphabet.
One of the best way using regex and Normalizer if you have no library is : public String flattenToAscii(String s) { if(s == null || s.trim().length() == 0) return ""; return Normalizer.normalize(s, Normalizer.Form.NFD).replaceAll("[\u0300-\u036F]", ""); } This is more efficient than replaceAll("[^\p{ASCII}]", "")) and if you don't need diacritics (just like your example). Otherwise, you have to use the p{ASCII} pattern. Regards.
Since this solution is already available in StringUtils.stripAccents() at Maven Repository and working for Ł as mentioned by #DavidS. But I need this to be working for both Ø and Ł So modified as below. May be help full for others too. Update This is modified version of StringUtils.stripAccents(String obj), that contains old functionality along with handling both Ø and Ł chars. public static String stripAccents(final String input) { if (input == null) { return null; } final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD)); for (int i = 0; i < decomposed.length(); i++) { if (decomposed.charAt(i) == '\u0141') { decomposed.setCharAt(i, 'L'); } else if (decomposed.charAt(i) == '\u0142') { decomposed.setCharAt(i, 'l'); }else if (decomposed.charAt(i) == '\u00D8') { decomposed.setCharAt(i, 'O'); }else if (decomposed.charAt(i) == '\u00F8') { decomposed.setCharAt(i, 'o'); } } // Note that this doesn't correctly remove ligatures... return Pattern.compile("\\p{InCombiningDiacriticalMarks}+").matcher(decomposed).replaceAll(""); } Input string Ł Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ Ø ø output string L This is a funky String O o
#David Conrad solution is the fastest I tried using the Normalizer, but it does have a bug. It basically strips characters which are not accents, for example Chinese characters and other letters like æ, are all stripped. The characters that we want to strip are non spacing marks, characters which don't take up extra width in the final string. These zero width characters basically end up combined in some other character. If you can see them isolated as a character, for example like this `, my guess is that it's combined with the space character. public static String flattenToAscii(String string) { char[] out = new char[string.length()]; String norm = Normalizer.normalize(string, Normalizer.Form.NFD); int j = 0; for (int i = 0, n = norm.length(); i < n; ++i) { char c = norm.charAt(i); int type = Character.getType(c); //Log.d(TAG,""+c); //by Ricardo, modified the character check for accents, ref: http://stackoverflow.com/a/5697575/689223 if (type != Character.NON_SPACING_MARK){ out[j] = c; j++; } } //Log.d(TAG,"normalized string:"+norm+"/"+new String(out)); return new String(out); }
I think the best solution is converting each char to HEX and replace it with another HEX. It's because there are 2 Unicode typing: Composite Unicode Precomposed Unicode For example "Ồ" written by Composite Unicode is different from "Ồ" written by Precomposed Unicode. You can copy my sample chars and convert them to see the difference. In Composite Unicode, "Ồ" is combined from 2 char: Ô (U+00d4) and ̀ (U+0300) In Precomposed Unicode, "Ồ" is single char (U+1ED2) I have developed this feature for some banks to convert the info before sending it to core-bank (usually don't support Unicode) and faced this issue when the end-users use multiple Unicode typing to input the data. So I think, converting to HEX and replace it is the most reliable way.
A fast and safer way public static String removeDiacritics(String str) { if (str == null) return null; if (str.isEmpty()) return ""; int len = str.length(); StringBuilder sb = new StringBuilder(len); //iterate string codepoints for (int i = 0; i < len; ) { int codePoint = str.codePointAt(i); int charCount = Character.charCount(codePoint); if (charCount > 1) { for (int j = 0; j < charCount; j++) sb.append(str.charAt(i + j)); i += charCount; continue; } else if (codePoint <= 127) { sb.append((char)codePoint); i++; continue; } sb.append( java.text.Normalizer .normalize( Character.toString((char)codePoint), java.text.Normalizer.Form.NFD) .charAt(0)); i++; } return sb.toString(); }
Faced the same issue, here's solution using Kotlin extension val String.stripAccents: String get() = Regex("\\p{InCombiningDiacriticalMarks}+") .replace( Normalizer.normalize(this, Normalizer.Form.NFD), "" ) usage val textWithoutAccents = "some accented string".stripAccents
In case anyone is strugling to do this in kotlin, this code works like a charm. To avoid inconsistencies I also use .toUpperCase and Trim(). then i cast this function: fun stripAccents(s: String):String{ if (s == null) { return ""; } val chars: CharArray = s.toCharArray() var sb = StringBuilder(s) var cont: Int = 0 while (chars.size > cont) { var c: kotlin.Char c = chars[cont] var c2:String = c.toString() //these are my needs, in case you need to convert other accents just Add new entries aqui c2 = c2.replace("Ã", "A") c2 = c2.replace("Õ", "O") c2 = c2.replace("Ç", "C") c2 = c2.replace("Á", "A") c2 = c2.replace("Ó", "O") c2 = c2.replace("Ê", "E") c2 = c2.replace("É", "E") c2 = c2.replace("Ú", "U") c = c2.single() sb.setCharAt(cont, c) cont++ } return sb.toString() } to use these fun cast the code like this: var str: String str = editText.text.toString() //get the text from EditText str = str.toUpperCase().trim() str = stripAccents(str) //call the function
java how to escape accented character in string
For example {"orderNumber":"S301020000","customerFirstName":"ke ČECHA ","customerLastName":"张科","orderStatus":"PENDING_FULFILLMENT_REQUEST","orderSubmittedDate":"May 13, 2015 1:41:28 PM"} how to get the accented character like "Č" in above json string and escape it in java Just give some context of this question, please check this question from me Ajax unescape response text from java servlet not working properly Sorry for my English :)
You should escape all characters that are greater than 0x7F. You can loop through the String's characters using the .charAt(index) method. For each character ch that needs escaping, replace it with: String hexDigits = Integer.toHexString(ch).toUpperCase(); String escapedCh = "\\u" + "0000".substring(hexDigits.length) + hexDigits; I don't think you will need to unescape them in JavaScript because JavaScript supports escaped characters in string literals, so you should be able to work with the string the way it is returned by the server. I'm guessing you will be using JSON.parse() to convert the returned JSON string into a JavaScript object, like this. Here's a complete function: public static escapeJavaScript(String source) { StringBuilder result = new StringBuilder(); for (int i = 0; i < source.length(); i++) { char ch = source.charAt(i); if (ch > 0x7F) { String hexDigits = Integer.toHexString(ch).toUpperCase(); String escapedCh = "\\u" + "0000".substring(hexDigits.length) + hexDigits; result.append(escapedCh); } else { result.append(ch); } } return result.toString(); }
hex-Encoding in Java goes wrong
me and several experienced Java developers worked on this for like 1 hour now and we cannot get it to work. Someone has any tips for me? Problem: We got a text in an Excel file which seems to be encoded completely inconsistent and stupid. Sometimes there are special chars, sometimes not, sometimes they are shown and interpreted differently. What i wanted to do now is to write a little Java-Script, that checks the given Text in the Excel File and converts all the different Char-sequences into what we want it to be. My Code: while (iterator.hasNext()) { Entity entity = (Entity) iterator.next(); Dataset dataset = produkt_store.getDataset(entity); FormData formdata = dataset.getFormData(); DomElement dom = (DomElement) formdata.get(lang, "cs_description_short").get(); String beschreibung = dom.toText(true); System.out.println("Before: " + beschreibung); String hexBeschreibung = StringToHex(beschreibung); String newHexBeschreibung = hexBeschreibung.replaceAll("75 3F", "FC"); newHexBeschreibung = newHexBeschreibung.replaceAll("75 A8", "FC"); //beschreibung2 = beschreibung2.replaceAll("75A8", "FC"); System.out.println("After: " + HexToString(newHexBeschreibung)); System.out.println(hexBeschreibung.equals(newHexBeschreibung) + "\n"); // dom.set(beschreibung); } Also i got those functions to encode / decode to hex: private static String StringToHex(String s) { if (s.length() == 0) return ""; char c; StringBuffer buff = new StringBuffer(); for (int i = 0; i < s.length(); i++) { c = s.charAt(i); buff.append(Integer.toHexString(c) + " "); } return buff.toString().trim(); } private static String HexToString(String s) { if (s.length() == 0) return ""; String[] arr = s.split(" "); StringBuffer buff = new StringBuffer(); int i; for (String str : arr) { i = Integer.valueOf(str, 16).intValue(); String hs = new Character((char) i).toString(); buff.append(hs); } return buff.toString(); } Example: Sometimes where there should be an "ü" it is shown as "u?" which we obviously want to avoid. When looking into it in an hex-Editor we see those things represented sometimes as 753F or 75A8. Same goes for "ä" or "ö" or "ß". So even for "u?" it varies from 753F to sometimes being 75A8. We tried to replace that with "ü". Doesn't work. Someone got any tips? We tried to use String.replaceAll() before that and used something like String.replaceAll("u\?","ü"); But that didn't work either as of nothing was changed at all. Thanks for any tips on that encoding stuff! :) EDIT: This is the solution which works perfectly fine: beschreibung = beschreibung.replace("U\u0308", "\u00DC"); // "Ü" beschreibung = beschreibung.replace("u\u0308", "\u00FC"); // "ü" beschreibung = beschreibung.replace("A\u0308", "\u00C4"); // "Ä" beschreibung = beschreibung.replace("a\u0308", "\u00E4"); // "ä" beschreibung = beschreibung.replace("O\u0308", "\u00D6"); // "Ö" beschreibung = beschreibung.replace("o\u0308", "\u00F6"); // "ö" beschreibung = beschreibung.replace("s\u0308", "\u00DF"); // "ß"
Somewhere there was ü represented not as one char U-UMLAUT but as SMALL-LETTER-U followed by COMBING-DIACRITICAL-MARK-UMLAUT. This is valid. Then there was some conversion back, to maybe ISO-8859-1 (or even US-ASCII?), and the Umlaut got separately converted. There was no such character in ISO-8859-1 and you got a question mark instead. A repair afterwards would be: String s = ... s = s.replace("U?", "\u00DC")); // "Ü" s = s.replace("u?", "\u00FC"); // "ü" ... (I have escaped the chars to prevent problems with possibly different encoding of java compiler and editor. (Would be an error.) That can also be done a bit more sophisticated: s = s.replaceAll("([aouAOU])\\?", "$1\u0308"); // Again ASCII + Umlaut separately s = TextNormalizer.normalize(s, TextNormalizer.Form.NFC); // Now single non-ASCII letters. The TextNormalizer might be a help here. Caveat: The '?' can also be shown in a console (i.e. from the IDE), as there a conversion takes place too. Somewhere a conversion was done. This can happen implicitly, where the encoding is optional and such. You might try with setting the system property file.encoding to UTF-8 or Cp1252 (Windows Latin-1).
First thing to check: are upper/lowercase important? e.g. if your toHex produces "75 3f" you won't replace it with your given command. hexBeschreibung = hexBeschreibung.toLowercase() would solve this issue. Second: (more of a hint) "u?" doesn't mean 'u' + '?', but 'u' + <not unicode character and definitly not '?'>. I hope my first suggestion will help :) -- Sorry I can't comment, so I have to edit: Hex editors may show hex values upper or lower case, because it doesn't matter. You have to check your used String by yourself, because Java may represent hex in Strings with lowercase letters.
Using Java Normalizer to convert accent ascii to non-accent but to exclude some symboles
I have a set of data that have accented ascii in them. I want to convert the accent to plain English alphabets. I achieve that with the following code : import java.text.Normalizer; import java.util.regex.Pattern; public String deAccent(String str) { String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+"); return pattern.matcher(nfdNormalizedString).replaceAll(""); } But what this code is missing is the exclude characters, I don't know how I can exclude certain characters from the conversion, for example I want to exclude the letter "ü" from the word Düsseldorf so when I convert, it doesn't turn into Dusseldorf word. Is there a way to pass an exclude list to the method or the matcher and don't convert certain accented characters ?
Do not use normalization to remove accents! For example, the following letters are not asciified using your method: ł đ ħ You may also want to split ligatures like œ into separate letters (i.e. oe). Try this: private static final String TAB_00C0 = "" + "AAAAAAACEEEEIIII" + "DNOOOOO×OUUUÜYTs" + // <-- note an accented letter you wanted // and preserved multiplication sign "aaaaaaaceeeeiiii" + "dnooooo÷ouuuüyty" + // <-- note an accented letter and preserved division sign "AaAaAaCcCcCcCcDd" + "DdEeEeEeEeEeGgGg" + "GgGgHhHhIiIiIiIi" + "IiJjJjKkkLlLlLlL" + "lLlNnNnNnnNnOoOo" + "OoOoRrRrRrSsSsSs" + "SsTtTtTtUuUuUuUu" + "UuUuWwYyYZzZzZzs"; public static String toPlain(String source) { StringBuilder sb = new StringBuilder(source.length()); for (int i = 0; i < source.length(); i++) { char c = source.charAt(i); switch (c) { case 'ß': sb.append("ss"); break; case 'Œ': sb.append("OE"); break; case 'œ': sb.append("oe"); break; // insert more ligatures you want to support // or other letters you want to convert in a non-standard way here // I recommend to take a look at: æ þ ð fl fi default: if (c >= 0xc0 && c <= 0x17f) { c = TAB_00C0.charAt(c - 0xc0); } sb.append(c); } } return sb.toString(); }