I need to convert unicode string to string which have non-ascii characters encoded in unicode. For example, string "漢字 Max" should be presented as "\u6F22\u5B57 Max".
What I have tried:
Differenct combinations of
new String(sourceString.getBytes(encoding1), encoding2)
Apache StringEscapeUtils which escapes also ascii chars like double-quote
StringEscapeUtils.escapeJava(source)
Is there an easy way to encode such string? Ideally only Java 6 SE or Apache Commons should be used to achieve desired result.
This is the kind of simple code Jon Skeet had in mind in his comment:
final String in = "šđčćasdf";
final StringBuilder out = new StringBuilder();
for (int i = 0; i < in.length(); i++) {
final char ch = in.charAt(i);
if (ch <= 127) out.append(ch);
else out.append("\\u").append(String.format("%04x", (int)ch));
}
System.out.println(out.toString());
As Jon said, surrogate pairs will be represented as a pair of \u escapes.
Guava Escaper Based Solution:
This escapes any non-ASCII characters into Unicode escape sequences.
import static java.lang.String.format;
import com.google.common.escape.CharEscaper;
public class NonAsciiUnicodeEscaper extends CharEscaper
{
#Override
protected char[] escape(final char c)
{
if (c >= 32 && c <= 127) { return new char[]{c}; }
else { return format("\\u%04x", (int) c).toCharArray(); }
}
}
Related
I want to check if String contains only Latin letters but also can contains numbers and other symbols like: _/+), etc.
String utm_source=google should pass, utm_source=google&2019_and_2020! should pass too. But utm_ресурс=google should not pass (coz cyrillic letters). I know code with regex, but how can i do it without using regex and classic for loop, maybe with Streams and Character class?
Use this code
public static boolean isValidUsAscii (String s) {
return Charset.forName("US-ASCII").newEncoder().canEncode(s);
}
For restricted "latin" (no é etcetera), it must be either US-ASCII (7 bits), or ISO-8859-1 but without accented letters.
boolean isBasicLatin(String s) {
return s.codePoints().allMatch(cp -> cp < 128 || (cp < 256 && !isLetter(cp)));
}
Less of a neat single line approach but really all you need to do is check whether the numeric value of the character is within certain limits like so:
public boolean isQwerty(String text) {
int length = text.length();
for(int i = 0; i < length; i++) {
char character = text.charAt(i);
int ascii = character;
if(ascii<32||ascii>126) {
return false;
}
}
return true;
}
Test Run
ä returns false
abc returns true
I found a website which can convert any text to different obscure unicode font styles, e.g. Small Caps pseudoalphabet.
I'm interested in doing the same thing in Java code. The following HxD screenshot shows the bytes of both text versions:
Is there any way to do the conversion in Java with built-in methods or a library? Preferrably the result will be another String object.
Quoting the website you linked:
What makes an alphabet "psuedo"?
One or more of the letters transliterated has a different meaning or source than intended. In the non-bold version of Fraktur, for
example, several letters are "black letter" but most are "mathematical
fraktur". In the Faux Cyrillic and Faux Ethiopic, letters are selected
merely based on superficial similarities, rather than phonetic or
semantic similarities.
So there is no well-defined smallcaps transformation; rather, the author of the converter hand-picked codepoint mappings to give the desired effect.
In the case of small caps, this is probably because there is no small-caps version of x in unicode.
In order to recreate the same effect, you'll have to implement a codepoint conversion lookup table (which you could generate by, e.g., passing the whole alphabet to the transformer)
The Unicode specification has an official, stable name for each and every codepoint. You can take advantage of this by looking up “LATIN LETTER SMALL CAPITAL c” using the method Character.codePointOf(String).
public static String translate(String s) {
int len = s.length();
Formatter smallCaps = new Formatter(new StringBuilder(len));
for (int i = 0; i < len; i++) {
char c = s.charAt(i);
if (c >= 'A' && c <= 'Z' && c != 'X') {
smallCaps.format("%c",
Character.codePointOf("LATIN LETTER SMALL CAPITAL " + c));
} else {
smallCaps.format("%c", c);
}
}
return smallCaps.toString();
}
I put && c != 'X' in the test because there currently is no LATIN LETTER SMALL CAPITAL X character, though it has been proposed.
Note that some small capital codepoints may not be in Java’s internal copy of the Unicode character data table. I found that I needed to use Java 12 or later to recognize them all.
I just found a simple solution by translating the plain text alphabet to the Unicode "small caps" alphabet as follows:
private static final String[] ALPHABET = "abcdefghijklmnopqrstuvwxyz".split("");
private static final String[] SMALL_CAPS_ALPHABET = "ᴀʙᴄᴅᴇꜰɢʜɪᴊᴋʟᴍɴᴏᴩqʀꜱᴛᴜᴠᴡxyᴢ".split("");
private static String toSmallCaps(String text)
{
text = text.toLowerCase();
StringBuilder convertedBuilder = new StringBuilder();
for (char textCharacter : text.toCharArray())
{
int index = 0;
boolean successfullyTranslated = false;
for (String alphabetLetter : ALPHABET)
{
if ((textCharacter + "").equals(alphabetLetter))
{
convertedBuilder.append(SMALL_CAPS_ALPHABET[index]);
successfullyTranslated = true;
break;
}
index++;
}
if (!successfullyTranslated)
{
convertedBuilder.append(textCharacter);
}
}
return convertedBuilder.toString();
}
Usage:
String smallCaps = toSmallCaps("Hello StackOverflow!");
System.out.println(smallCaps);
Output:
ʜᴇʟʟᴏ ꜱᴛᴀᴄᴋᴏᴠᴇʀꜰʟᴏᴡ!
It's not the most elegant or extendable solution but maybe someone can suggest improvements.
The answer posted by #BullyWiiPlaza is a good one, but the code is pretty inefficient.
Here is an alternative implementation which will be much faster and uses less memory:
private static final char[] SMALL_CAPS_ALPHABET = "ᴀʙᴄᴅᴇꜰɢʜɪᴊᴋʟᴍɴᴏᴩqʀꜱᴛᴜᴠᴡxyᴢ".toCharArray();
private static String toSmallCaps(String text)
{
if(null == text) {
return null;
}
int length = text.length();
StringBuilder smallCaps = new StringBuilder(length);
for(int i=0; i<length; ++i) {
char c = text.charAt(i);
if(c >= 'a' && c <= 'z') {
smallCaps.append(SMALL_CAPS_ALPHABET[c - 'a']);
} else {
smallCaps.append(c);
}
}
return smallCaps.toString();
}
i want to find the character ASCII code for programming android to support the Arabic locale. Android programming has many characters are different English. The ASCII code in many letters joint or some of letters are split.
how can i find the special code for each letter?
Unicode is a numbering of all characters. The numbering would need three bytes integers. A Unicode character is represented in science as U+XXXX where XXXX stands for the number in hexadecimal (base 16) notation. A Unicode character is called code point, in Java with type int.
Java char is 2 bytes (UTF-16), so cannot represent the higher order Unicode; there a pair of two chars is used.
The java class Character deals with conversion.
char lowUnicode = '\u0627'; // Alef, fitting in a char
int cp = (int) lowUnicode;
One can iterate through code points of a String as follows:
String s = "...";
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
i += Character.charCount(codePoint);
}
String s = "...";
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
...
i += Character.charCount(codePoint);
}
Or in java 8:
s.codePoints().forEach(
(codePoint) -> System.out.println(codePoint));
Dumping Arabic between U+600 and U+8FF:
The code below dumps Unicode in the main Arabic range.
for (int codePoint = 0x600; codePoint < 0x900; ++codePoint) {
if (Character.isAlphabetic(codePoint)
&& UnicodeScript.of(codePoint) == UnicodeScript.ARABIC) {
System.out.printf("\u200E\\%04X \u200F%s\u200E %s%n",
codePoint,
new String(Character.toChars(codePoint)),
Character.getName(codePoint));
}
}
Under Windows/Linux/... there exist char map tools to display Unicode.
Above U+200E is the Left-To-Right, and U+200F is the Right-To-Left mark.
If you want to get Unicode characters code below will do that:
char character = 'ع';
int code = (int) character;
The following will replace ASCII control characters (shorthand for [\x00-\x1F\x7F]):
my_string.replaceAll("\\p{Cntrl}", "?");
The following will replace all ASCII non-printable characters (shorthand for [\p{Graph}\x20]), including accented characters:
my_string.replaceAll("[^\\p{Print}]", "?");
However, neither works for Unicode strings. Does anyone has a good way to remove non-printable characters from a unicode string?
my_string.replaceAll("\\p{C}", "?");
See more about Unicode regex. java.util.regexPattern/String.replaceAll supports them.
Op De Cirkel is mostly right. His suggestion will work in most cases:
myString.replaceAll("\\p{C}", "?");
But if myString might contain non-BMP codepoints then it's more complicated. \p{C} contains the surrogate codepoints of \p{Cs}. The replacement method above will corrupt non-BMP codepoints by sometimes replacing only half of the surrogate pair. It's possible this is a Java bug rather than intended behavior.
Using the other constituent categories is an option:
myString.replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "?");
However, solitary surrogate characters not part of a pair (each surrogate character has an assigned codepoint) will not be removed. A non-regex approach is the only way I know to properly handle \p{C}:
StringBuilder newString = new StringBuilder(myString.length());
for (int offset = 0; offset < myString.length();)
{
int codePoint = myString.codePointAt(offset);
offset += Character.charCount(codePoint);
// Replace invisible control characters and unused code points
switch (Character.getType(codePoint))
{
case Character.CONTROL: // \p{Cc}
case Character.FORMAT: // \p{Cf}
case Character.PRIVATE_USE: // \p{Co}
case Character.SURROGATE: // \p{Cs}
case Character.UNASSIGNED: // \p{Cn}
newString.append('?');
break;
default:
newString.append(Character.toChars(codePoint));
break;
}
}
methods below for your goal
public static String removeNonAscii(String str)
{
return str.replaceAll("[^\\x00-\\x7F]", "");
}
public static String removeNonPrintable(String str) // All Control Char
{
return str.replaceAll("[\\p{C}]", "");
}
public static String removeSomeControlChar(String str) // Some Control Char
{
return str.replaceAll("[\\p{Cntrl}\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}]", "");
}
public static String removeFullControlChar(String str)
{
return removeNonPrintable(str).replaceAll("[\\r\\n\\t]", "");
}
You may be interested in the Unicode categories "Other, Control" and possibly "Other, Format" (unfortunately the latter seems to contain both unprintable and printable characters).
In Java regular expressions you can check for them using \p{Cc} and \p{Cf} respectively.
I have used this simple function for this:
private static Pattern pattern = Pattern.compile("[^ -~]");
private static String cleanTheText(String text) {
Matcher matcher = pattern.matcher(text);
if ( matcher.find() ) {
text = text.replace(matcher.group(0), "");
}
return text;
}
Hope this is useful.
Based on the answers by Op De Cirkel and noackjr, the following is what I do for general string cleaning: 1. trimming leading or trailing whitespaces, 2. dos2unix, 3. mac2unix, 4. removing all "invisible Unicode characters" except whitespaces:
myString.trim.replaceAll("\r\n", "\n").replaceAll("\r", "\n").replaceAll("[\\p{Cc}\\p{Cf}\\p{Co}\\p{Cn}&&[^\\s]]", "")
Tested with Scala REPL.
I propose it remove the non printable characters like below instead of replacing it
private String removeNonBMPCharacters(final String input) {
StringBuilder strBuilder = new StringBuilder();
input.codePoints().forEach((i) -> {
if (Character.isSupplementaryCodePoint(i)) {
strBuilder.append("?");
} else {
strBuilder.append(Character.toChars(i));
}
});
return strBuilder.toString();
}
Supported multilanguage
public static String cleanUnprintableChars(String text, boolean multilanguage)
{
String regex = multilanguage ? "[^\\x00-\\xFF]" : "[^\\x00-\\x7F]";
// strips off all non-ASCII characters
text = text.replaceAll(regex, "");
// erases all the ASCII control characters
text = text.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
// removes non-printable characters from Unicode
text = text.replaceAll("\\p{C}", "");
return text.trim();
}
I have redesigned the code for phone numbers +9 (987) 124124
Extract digits from a string in Java
public static String stripNonDigitsV2( CharSequence input ) {
if (input == null)
return null;
if ( input.length() == 0 )
return "";
char[] result = new char[input.length()];
int cursor = 0;
CharBuffer buffer = CharBuffer.wrap( input );
int i=0;
while ( i< buffer.length() ) { //buffer.hasRemaining()
char chr = buffer.get(i);
if (chr=='u'){
i=i+5;
chr=buffer.get(i);
}
if ( chr > 39 && chr < 58 )
result[cursor++] = chr;
i=i+1;
}
return new String( result, 0, cursor );
}
Having ignored it all this time, I am currently forcing myself to learn more about unicode in Java. There is an exercise I need to do about converting a UTF-16 string to 8-bit ASCII. Can someone please enlighten me how to do this in Java? I understand that you can't represent all possible unicode values in ASCII, so in this case I want a code which exceeds 0xFF to be merely added anyway (bad data should also just be added silently).
Thanks!
You can use java.nio for an easy solution:
// first encode the utf-16 string as a ByteBuffer
ByteBuffer bb = Charset.forName("utf-16").encode(CharBuffer.wrap(utf16str));
// then decode those bytes as US-ASCII
CharBuffer ascii = Charset.forName("US-ASCII").decode(bb);
How about this:
String input = ... // my UTF-16 string
StringBuilder sb = new StringBuilder(input.length());
for (int i = 0; i < input.length(); i++) {
char ch = input.charAt(i);
if (ch <= 0xFF) {
sb.append(ch);
}
}
byte[] ascii = sb.toString().getBytes("ISO-8859-1"); // aka LATIN-1
This is probably not the most efficient way to do this conversion for large strings since we copy the characters twice. However, it has the advantage of being straightforward.
BTW, strictly speaking there is no such character set as 8-bit ASCII. ASCII is a 7-bit character set. LATIN-1 is the nearest thing there is to an "8-bit ASCII" character set (and block 0 of Unicode is equivalent to LATIN-1) so I'll assume that's what you mean.
EDIT: in the light of the update to the question, the solution is even simpler:
String input = ... // my UTF-16 string
byte[] ascii = new byte[input.length()];
for (int i = 0; i < input.length(); i++) {
ascii[i] = (byte) input.charAt(i);
}
This solution is more efficient. Since we now know how many bytes to expect, we can preallocate the byte array and in copy the (truncated) characters without using a StringBuilder as intermediate buffer.
However, I'm not convinced that dealing with bad data in this way is sensible.
EDIT 2: there is one more obscure "gotcha" with this. Unicode actually defines code points (characters) to be "roughly 21 bit" values ... 0x000000 to 0x10FFFF ... and uses surrogates to represent codes > 0x00FFFF. In other words, a Unicode codepoint > 0x00FFFF is actually represented in UTF-16 as two "characters". Neither my answer or any of the others take account of this (admittedly esoteric) point. In fact, dealing with codepoints > 0x00FFFF in Java is rather tricky in general. This stems from the fact that 'char' is a 16 bit type and String is defined in terms of 'char'.
EDIT 3: maybe a more sensible solution for dealing with unexpected characters that don't convert to ASCII is to replace them with the standard replacement character:
String input = ... // my UTF-16 string
byte[] ascii = new byte[input.length()];
for (int i = 0; i < input.length(); i++) {
char ch = input.charAt(i);
ascii[i] = (ch <= 0xFF) ? (byte) ch : (byte) '?';
}
Java internally represents strings in UTF-16. If a String object is what you are starting with, you can encode using String.getBytes(Charset c), where you might specify US-ASCII (which can map code points 0x00-0x7f) or ISO-8859-1 (which can map code points 0x00-0xff, and may be what you mean by "8-bit ASCII").
As for adding "bad data"... ASCII or ISO-8859-1 strings simply can't represent values outside of a certain range. I believe getBytes will simply drop characters it's not able to represent in the destination character set.
Since this is an exercise, it sounds like you need to implement this manually. You can think of an encoding (e.g. UTF-16 or ASCII) as a lookup table that matches a sequence of bytes to a logical character (a codepoint).
Java uses UTF-16 strings, which means that any given codepoint can be represented in one or two char variables. Whether you want to handle the two-char surrogate pairs depends on how likely you think your application is to encounter them (see the Character class for detecting them). ASCII only uses the first 7 bits of an octet (byte), so the valid range of values is 0 to 127. UTF-16 uses identical values for this range (they're just wider). This can be confirmed with this code:
Charset ascii = Charset.forName("US-ASCII");
byte[] buffer = new byte[1];
char[] cbuf = new char[1];
for (int i = 0; i <= 127; i++) {
buffer[0] = (byte) i;
cbuf[0] = (char) i;
String decoded = new String(buffer, ascii);
String utf16String = new String(cbuf);
if (!utf16String.equals(decoded)) {
throw new IllegalStateException();
}
System.out.print(utf16String);
}
System.out.println("\nOK");
Therefore, you can convert UTF-16 to ASCII by casting a char to a byte.
You can read more about Java character encoding here.
Just to optimize on the accepted answer and not pay any penalty if the string is already all ascii characters, here is the optimized version. Thanks #stephen-c
public static String toAscii(String input) {
final int length = input.length();
int ignoredChars = 0;
byte[] ascii = null;
for (int i = 0; i < length; i++) {
char ch = input.charAt(i);
if (ch > 0xFF) {
//-- ignore this non-ascii character
ignoredChars++;
if (ascii == null) {
//-- first non-ascii character. Create a new ascii array with all ascii characters
ascii = new byte[input.length() - 1]; //-- we know, the length will be at less by at least 1
for (int j = 0; j < i-1; j++) {
ascii[j] = (byte) input.charAt(j);
}
}
} else if (ascii != null) {
ascii[i - ignoredChars] = (byte) ch;
}
}
//-- (ignoredChars == 0) is the same as (ascii == null) i.e. no non-ascii characters found
return ignoredChars == 0 ? input : new String(Arrays.copyOf(ascii, length - ignoredChars));
}