Arabic unicode or ASCII code in java - java

i want to find the character ASCII code for programming android to support the Arabic locale. Android programming has many characters are different English. The ASCII code in many letters joint or some of letters are split.
how can i find the special code for each letter?

Unicode is a numbering of all characters. The numbering would need three bytes integers. A Unicode character is represented in science as U+XXXX where XXXX stands for the number in hexadecimal (base 16) notation. A Unicode character is called code point, in Java with type int.
Java char is 2 bytes (UTF-16), so cannot represent the higher order Unicode; there a pair of two chars is used.
The java class Character deals with conversion.
char lowUnicode = '\u0627'; // Alef, fitting in a char
int cp = (int) lowUnicode;
One can iterate through code points of a String as follows:
String s = "...";
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
i += Character.charCount(codePoint);
}
String s = "...";
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
...
i += Character.charCount(codePoint);
}
Or in java 8:
s.codePoints().forEach(
(codePoint) -> System.out.println(codePoint));
Dumping Arabic between U+600 and U+8FF:
The code below dumps Unicode in the main Arabic range.
for (int codePoint = 0x600; codePoint < 0x900; ++codePoint) {
if (Character.isAlphabetic(codePoint)
&& UnicodeScript.of(codePoint) == UnicodeScript.ARABIC) {
System.out.printf("\u200E\\%04X \u200F%s\u200E %s%n",
codePoint,
new String(Character.toChars(codePoint)),
Character.getName(codePoint));
}
}
Under Windows/Linux/... there exist char map tools to display Unicode.
Above U+200E is the Left-To-Right, and U+200F is the Right-To-Left mark.

If you want to get Unicode characters code below will do that:
char character = 'ع';
int code = (int) character;

Related

replacing all cases of ISO Control characters in a string with "CTRL"

static String clean(String identifier) {
String firstString = "";
for (int i = 0; i < identifier.length(); i++)
if (Character.isISOControl(identifier.charAt(i))){
firstString = identifier.replaceAll(identifier.charAt(i),
"CTRL");
}
return firstString;
}
The logic behind the code above is to replace all instances of ISO Control characters in the string 'identifier' with "CTRL". I'm however faced with this error: "char cannot be converted to java.lang.String"
Can someone help me to solve and improve my code to produce the right output?
String#replaceAll expects a String as parameter, but it has to be a regular expression. Use String#replace instead.
EDIT: I haven't seen that you want to replace a character by some string. In that case, you can use this version of String#replace but you need to convert the character to a String, e. g. by using Character.toString.
Update
Example:
String text = "AB\003DE";
text = text.replace(Character.toString('\003'), "CTRL");
System.out.println(text);
// gives: ABCTRLDE
Code points, and Control Picture characters
I can add two points:
The char type is essentially broken since Java 2, and legacy since Java 5. Best to use code point integers when working with individual characters.
Unicode defines characters for display as placeholders for control characters. See Control Pictures section of one Wikipedia page, and see another page, Control Pictures.
For example, the NULL character at code point 0 decimal has a matching SYMBOL FOR NULL character at 9,216 decimal: ␀. To see all the Control Picture characters, use this PDF section of the Unicode standard specification.
Get an array of the code point integers representing each of the characters in your string.
int[] codePoints = myString.codePoints().toArray() ;
Loop those code points. Replace those of interest.
Here is some untested code.
int[] replacedCodePoints = new int[ codePoints.length ] ;
int index = 0 ;
for ( int codePoint : codePoints )
{
if( codePoint >= 0 && codePoint <= 32 ) // 32 is SPACE, so you may want to use 31 depending on your context.
{
replacedCodePoints[ index ] = codePoint + 9_216 ; // 9,216 is the offset to the beginning of the Control Picture character range defined in Unicode.
} else if ( codePoint == 127 ) // DEL character.
{
replacedCodePoints[ index ] = 9_249 ;
} else // Any other character, we keep as-is, no replacement.
{
replacedCodePoints[ index ] = codePoint ;
}
i ++ ; // Set up the next loop.
}
Convert code points back into text. Use StringBuilder#appendCodePoint to build up the characters of text. You can use the following stream-based code as boilerplate. For explanation, see this Question.
String result =
Arrays
.stream( replacedCodePoints )
.collect( StringBuilder::new , StringBuilder::appendCodePoint , StringBuilder::append )
.toString();

Convert UTF-8 Unicode string to ASCII Unicode escaped String

I need to convert unicode string to string which have non-ascii characters encoded in unicode. For example, string "漢字 Max" should be presented as "\u6F22\u5B57 Max".
What I have tried:
Differenct combinations of
new String(sourceString.getBytes(encoding1), encoding2)
Apache StringEscapeUtils which escapes also ascii chars like double-quote
StringEscapeUtils.escapeJava(source)
Is there an easy way to encode such string? Ideally only Java 6 SE or Apache Commons should be used to achieve desired result.
This is the kind of simple code Jon Skeet had in mind in his comment:
final String in = "šđčćasdf";
final StringBuilder out = new StringBuilder();
for (int i = 0; i < in.length(); i++) {
final char ch = in.charAt(i);
if (ch <= 127) out.append(ch);
else out.append("\\u").append(String.format("%04x", (int)ch));
}
System.out.println(out.toString());
As Jon said, surrogate pairs will be represented as a pair of \u escapes.
Guava Escaper Based Solution:
This escapes any non-ASCII characters into Unicode escape sequences.
import static java.lang.String.format;
import com.google.common.escape.CharEscaper;
public class NonAsciiUnicodeEscaper extends CharEscaper
{
#Override
protected char[] escape(final char c)
{
if (c >= 32 && c <= 127) { return new char[]{c}; }
else { return format("\\u%04x", (int) c).toCharArray(); }
}
}

Android ndk log unicode (wide characters)

I am trying to write to logcat unicode characters (UTF16; wide characters like Japanese, Korean or Chinese). The only success is a work around which is to send it to Java as a string (for example the code below):
unsigned short* text = new unsigned short[100]; // or jchar*
.... // all some unicode to the array
jstring jtext = jniEnv->NewString(text, (jsize)length);
jniEnv->CallVoidMethod( JavaClass, JavaPrintUnicode, jtext );
... // Clean up
// Then java prints it out
However if the bytes are in a char array, __android_log_print will print garbled text. Is there a method to print out the text within the NDK (C++) side?
Android NDK, as typical for Linux, uses 32 bits to represent wide chars, so if conversion to UTF-32 is trivial, you can try this:
unsigned short text[100];
// fill text with UTF-16; make sure that all characters are 2 bytes long
long utf32[100];
for (int i=0; i<100; i++)
utf32[i] = text[i];
__android_log_print(ANDROID_LOG_INFO, "UTF-32", "%ls", utf32);
You can print hex values for each character instead.
void print_string_as_hex(WCHAR* ws){
int l = wcslen(ws);
for(int i = 0 ; i < l ; i++)
__android_log_print(ANDROID_LOG_INFO, "MyTag", "%x", ws[i]);
}
To get the encoded character from hex values, you can use Microsoft Word - 4513[Alt-x] will encode 0x4513for you.

Creating Unicode character from its number

I want to display a Unicode character in Java. If I do this, it works just fine:
String symbol = "\u2202";
symbol is equal to "∂". That's what I want.
The problem is that I know the Unicode number and need to create the Unicode symbol from that. I tried (to me) the obvious thing:
int c = 2202;
String symbol = "\\u" + c;
However, in this case, symbol is equal to "\u2202". That's not what I want.
How can I construct the symbol if I know its Unicode number (but only at run-time---I can't hard-code it in like the first example)?
If you want to get a UTF-16 encoded code unit as a char, you can parse the integer and cast to it as others have suggested.
If you want to support all code points, use Character.toChars(int). This will handle cases where code points cannot fit in a single char value.
Doc says:
Converts the specified character (Unicode code point) to its UTF-16 representation stored in a char array. If the specified code point is a BMP (Basic Multilingual Plane or Plane 0) value, the resulting char array has the same value as codePoint. If the specified code point is a supplementary code point, the resulting char array has the corresponding surrogate pair.
Just cast your int to a char. You can convert that to a String using Character.toString():
String s = Character.toString((char)c);
EDIT:
Just remember that the escape sequences in Java source code (the \u bits) are in HEX, so if you're trying to reproduce an escape sequence, you'll need something like int c = 0x2202.
The other answers here either only support unicode up to U+FFFF (the answers dealing with just one instance of char) or don't tell how to get to the actual symbol (the answers stopping at Character.toChars() or using incorrect method after that), so adding my answer here, too.
To support supplementary code points also, this is what needs to be done:
// this character:
// http://www.isthisthingon.org/unicode/index.php?page=1F&subpage=4&glyph=1F495
// using code points here, not U+n notation
// for equivalence with U+n, below would be 0xnnnn
int codePoint = 128149;
// converting to char[] pair
char[] charPair = Character.toChars(codePoint);
// and to String, containing the character we want
String symbol = new String(charPair);
// we now have str with the desired character as the first item
// confirm that we indeed have character with code point 128149
System.out.println("First code point: " + symbol.codePointAt(0));
I also did a quick test as to which conversion methods work and which don't
int codePoint = 128149;
char[] charPair = Character.toChars(codePoint);
System.out.println(new String(charPair, 0, 2).codePointAt(0)); // 128149, worked
System.out.println(charPair.toString().codePointAt(0)); // 91, didn't work
System.out.println(new String(charPair).codePointAt(0)); // 128149, worked
System.out.println(String.valueOf(codePoint).codePointAt(0)); // 49, didn't work
System.out.println(new String(new int[] {codePoint}, 0, 1).codePointAt(0));
// 128149, worked
--
Note: as #Axel mentioned in the comments, with java 11 there is Character.toString(int codePoint) which would arguably be best suited for the job.
This one worked fine for me.
String cc2 = "2202";
String text2 = String.valueOf(Character.toChars(Integer.parseInt(cc2, 16)));
Now text2 will have ∂.
Remember that char is an integral type, and thus can be given an integer value, as well as a char constant.
char c = 0x2202;//aka 8706 in decimal. \u codepoints are in hex.
String s = String.valueOf(c);
String st="2202";
int cp=Integer.parseInt(st,16);// it convert st into hex number.
char c[]=Character.toChars(cp);
System.out.println(c);// its display the character corresponding to '\u2202'.
Although this is an old question, there is a very easy way to do this in Java 11 which was released today: you can use a new overload of Character.toString():
public static String toString​(int codePoint)
Returns a String object representing the specified character (Unicode code point). The result is a string of length 1 or 2, consisting solely of the specified codePoint.
Parameters:
codePoint - the codePoint to be converted
Returns:
the string representation of the specified codePoint
Throws:
IllegalArgumentException - if the specified codePoint is not a valid Unicode code point.
Since:
11
Since this method supports any Unicode code point, the length of the returned String is not necessarily 1.
The code needed for the example given in the question is simply:
int codePoint = '\u2202';
String s = Character.toString(codePoint); // <<< Requires JDK 11 !!!
System.out.println(s); // Prints ∂
This approach offers several advantages:
It works for any Unicode code point rather than just those that can be handled using a char.
It's concise, and it's easy to understand what the code is doing.
It returns the value as a string rather than a char[], which is often what you want. The answer posted by McDowell is appropriate if you want the code point returned as char[].
This is how you do it:
int cc = 0x2202;
char ccc = (char) Integer.parseInt(String.valueOf(cc), 16);
final String text = String.valueOf(ccc);
This solution is by Arne Vajhøj.
The code below will write the 4 unicode chars (represented by decimals) for the word "be" in Japanese. Yes, the verb "be" in Japanese has 4 chars!
The value of characters is in decimal and it has been read into an array of String[] -- using split for instance. If you have Octal or Hex, parseInt take a radix as well.
// pseudo code
// 1. init the String[] containing the 4 unicodes in decima :: intsInStrs
// 2. allocate the proper number of character pairs :: c2s
// 3. Using Integer.parseInt (... with radix or not) get the right int value
// 4. place it in the correct location of in the array of character pairs
// 5. convert c2s[] to String
// 6. print
String[] intsInStrs = {"12354", "12426", "12414", "12377"}; // 1.
char [] c2s = new char [intsInStrs.length * 2]; // 2. two chars per unicode
int ii = 0;
for (String intString : intsInStrs) {
// 3. NB ii*2 because the 16 bit value of Unicode is written in 2 chars
Character.toChars(Integer.parseInt(intsInStrs[ii]), c2s, ii * 2 ); // 3 + 4
++ii; // advance to the next char
}
String symbols = new String(c2s); // 5.
System.out.println("\nLooooonger code point: " + symbols); // 6.
// I tested it in Eclipse and Java 7 and it works. Enjoy
Here is a block to print out unicode chars between \u00c0 to \u00ff:
char[] ca = {'\u00c0'};
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 16; j++) {
String sc = new String(ca);
System.out.print(sc + " ");
ca[0]++;
}
System.out.println();
}
Unfortunatelly, to remove one backlash as mentioned in first comment (newbiedoodle) don't lead to good result. Most (if not all) IDE issues syntax error. The reason is in this, that Java Escaped Unicode format expects syntax "\uXXXX", where XXXX are 4 hexadecimal digits, which are mandatory. Attempts to fold this string from pieces fails. Of course, "\u" is not the same as "\\u". The first syntax means escaped 'u', second means escaped backlash (which is backlash) followed by 'u'. It is strange, that on the Apache pages is presented utility, which doing exactly this behavior. But in reality, it is Escape mimic utility. Apache has some its own utilities (i didn't testet them), which do this work for you. May be, it is still not that, what you want to have. Apache Escape Unicode utilities But this utility 1 have good approach to the solution. With combination described above (MeraNaamJoker). My solution is create this Escaped mimic string and then convert it back to unicode (to avoid real Escaped Unicode restriction). I used it for copying text, so it is possible, that in uencode method will be better to use '\\u' except '\\\\u'. Try it.
/**
* Converts character to the mimic unicode format i.e. '\\u0020'.
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param ch the character to convert
* #return is in the mimic of escaped unicode string,
*/
public static String unicodeEscaped(char ch) {
String returnStr;
//String uniTemplate = "\u0000";
final static String charEsc = "\\u";
if (ch < 0x10) {
returnStr = "000" + Integer.toHexString(ch);
}
else if (ch < 0x100) {
returnStr = "00" + Integer.toHexString(ch);
}
else if (ch < 0x1000) {
returnStr = "0" + Integer.toHexString(ch);
}
else
returnStr = "" + Integer.toHexString(ch);
return charEsc + returnStr;
}
/**
* Converts the string from UTF8 to mimic unicode format i.e. '\\u0020'.
* notice: i cannot use real unicode format, because this is immediately translated
* to the character in time of compiling and editor (i.e. netbeans) checking it
* instead reaal unicode format i.e. '\u0020' i using mimic unicode format '\\u0020'
* as a string, but it doesn't gives the same results, of course
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param String - nationalString in the UTF8 string to convert
* #return is the string in JAVA unicode mimic escaped
*/
public String encodeStr(String nationalString) throws UnsupportedEncodingException {
String convertedString = "";
for (int i = 0; i < nationalString.length(); i++) {
Character chs = nationalString.charAt(i);
convertedString += unicodeEscaped(chs);
}
return convertedString;
}
/**
* Converts the string from mimic unicode format i.e. '\\u0020' back to UTF8.
*
* This format is the Java source code format.
*
* CharUtils.unicodeEscaped(' ') = "\\u0020"
* CharUtils.unicodeEscaped('A') = "\\u0041"
*
* #param String - nationalString in the JAVA unicode mimic escaped
* #return is the string in UTF8 string
*/
public String uencodeStr(String escapedString) throws UnsupportedEncodingException {
String convertedString = "";
String[] arrStr = escapedString.split("\\\\u");
String str, istr;
for (int i = 1; i < arrStr.length; i++) {
str = arrStr[i];
if (!str.isEmpty()) {
Integer iI = Integer.parseInt(str, 16);
char[] chaCha = Character.toChars(iI);
convertedString += String.valueOf(chaCha);
}
}
return convertedString;
}
char c=(char)0x2202;
String s=""+c;
(ANSWER IS IN DOT NET 4.5 and in java, there must be a similar approach exist)
I am from West Bengal in INDIA.
As I understand your problem is ...
You want to produce similar to ' অ ' (It is a letter in Bengali language)
which has Unicode HEX : 0X0985.
Now if you know this value in respect of your language then how will you produce that language specific Unicode symbol right ?
In Dot Net it is as simple as this :
int c = 0X0985;
string x = Char.ConvertFromUtf32(c);
Now x is your answer.
But this is HEX by HEX convert and sentence to sentence conversion is a work for researchers :P

UTF-16 to ASCII conversion in Java

Having ignored it all this time, I am currently forcing myself to learn more about unicode in Java. There is an exercise I need to do about converting a UTF-16 string to 8-bit ASCII. Can someone please enlighten me how to do this in Java? I understand that you can't represent all possible unicode values in ASCII, so in this case I want a code which exceeds 0xFF to be merely added anyway (bad data should also just be added silently).
Thanks!
You can use java.nio for an easy solution:
// first encode the utf-16 string as a ByteBuffer
ByteBuffer bb = Charset.forName("utf-16").encode(CharBuffer.wrap(utf16str));
// then decode those bytes as US-ASCII
CharBuffer ascii = Charset.forName("US-ASCII").decode(bb);
How about this:
String input = ... // my UTF-16 string
StringBuilder sb = new StringBuilder(input.length());
for (int i = 0; i < input.length(); i++) {
char ch = input.charAt(i);
if (ch <= 0xFF) {
sb.append(ch);
}
}
byte[] ascii = sb.toString().getBytes("ISO-8859-1"); // aka LATIN-1
This is probably not the most efficient way to do this conversion for large strings since we copy the characters twice. However, it has the advantage of being straightforward.
BTW, strictly speaking there is no such character set as 8-bit ASCII. ASCII is a 7-bit character set. LATIN-1 is the nearest thing there is to an "8-bit ASCII" character set (and block 0 of Unicode is equivalent to LATIN-1) so I'll assume that's what you mean.
EDIT: in the light of the update to the question, the solution is even simpler:
String input = ... // my UTF-16 string
byte[] ascii = new byte[input.length()];
for (int i = 0; i < input.length(); i++) {
ascii[i] = (byte) input.charAt(i);
}
This solution is more efficient. Since we now know how many bytes to expect, we can preallocate the byte array and in copy the (truncated) characters without using a StringBuilder as intermediate buffer.
However, I'm not convinced that dealing with bad data in this way is sensible.
EDIT 2: there is one more obscure "gotcha" with this. Unicode actually defines code points (characters) to be "roughly 21 bit" values ... 0x000000 to 0x10FFFF ... and uses surrogates to represent codes > 0x00FFFF. In other words, a Unicode codepoint > 0x00FFFF is actually represented in UTF-16 as two "characters". Neither my answer or any of the others take account of this (admittedly esoteric) point. In fact, dealing with codepoints > 0x00FFFF in Java is rather tricky in general. This stems from the fact that 'char' is a 16 bit type and String is defined in terms of 'char'.
EDIT 3: maybe a more sensible solution for dealing with unexpected characters that don't convert to ASCII is to replace them with the standard replacement character:
String input = ... // my UTF-16 string
byte[] ascii = new byte[input.length()];
for (int i = 0; i < input.length(); i++) {
char ch = input.charAt(i);
ascii[i] = (ch <= 0xFF) ? (byte) ch : (byte) '?';
}
Java internally represents strings in UTF-16. If a String object is what you are starting with, you can encode using String.getBytes(Charset c), where you might specify US-ASCII (which can map code points 0x00-0x7f) or ISO-8859-1 (which can map code points 0x00-0xff, and may be what you mean by "8-bit ASCII").
As for adding "bad data"... ASCII or ISO-8859-1 strings simply can't represent values outside of a certain range. I believe getBytes will simply drop characters it's not able to represent in the destination character set.
Since this is an exercise, it sounds like you need to implement this manually. You can think of an encoding (e.g. UTF-16 or ASCII) as a lookup table that matches a sequence of bytes to a logical character (a codepoint).
Java uses UTF-16 strings, which means that any given codepoint can be represented in one or two char variables. Whether you want to handle the two-char surrogate pairs depends on how likely you think your application is to encounter them (see the Character class for detecting them). ASCII only uses the first 7 bits of an octet (byte), so the valid range of values is 0 to 127. UTF-16 uses identical values for this range (they're just wider). This can be confirmed with this code:
Charset ascii = Charset.forName("US-ASCII");
byte[] buffer = new byte[1];
char[] cbuf = new char[1];
for (int i = 0; i <= 127; i++) {
buffer[0] = (byte) i;
cbuf[0] = (char) i;
String decoded = new String(buffer, ascii);
String utf16String = new String(cbuf);
if (!utf16String.equals(decoded)) {
throw new IllegalStateException();
}
System.out.print(utf16String);
}
System.out.println("\nOK");
Therefore, you can convert UTF-16 to ASCII by casting a char to a byte.
You can read more about Java character encoding here.
Just to optimize on the accepted answer and not pay any penalty if the string is already all ascii characters, here is the optimized version. Thanks #stephen-c
public static String toAscii(String input) {
final int length = input.length();
int ignoredChars = 0;
byte[] ascii = null;
for (int i = 0; i < length; i++) {
char ch = input.charAt(i);
if (ch > 0xFF) {
//-- ignore this non-ascii character
ignoredChars++;
if (ascii == null) {
//-- first non-ascii character. Create a new ascii array with all ascii characters
ascii = new byte[input.length() - 1]; //-- we know, the length will be at less by at least 1
for (int j = 0; j < i-1; j++) {
ascii[j] = (byte) input.charAt(j);
}
}
} else if (ascii != null) {
ascii[i - ignoredChars] = (byte) ch;
}
}
//-- (ignoredChars == 0) is the same as (ascii == null) i.e. no non-ascii characters found
return ignoredChars == 0 ? input : new String(Arrays.copyOf(ascii, length - ignoredChars));
}

Categories

Resources