UTF-16 to ASCII conversion in Java - java

Having ignored it all this time, I am currently forcing myself to learn more about unicode in Java. There is an exercise I need to do about converting a UTF-16 string to 8-bit ASCII. Can someone please enlighten me how to do this in Java? I understand that you can't represent all possible unicode values in ASCII, so in this case I want a code which exceeds 0xFF to be merely added anyway (bad data should also just be added silently).
Thanks!

You can use java.nio for an easy solution:
// first encode the utf-16 string as a ByteBuffer
ByteBuffer bb = Charset.forName("utf-16").encode(CharBuffer.wrap(utf16str));
// then decode those bytes as US-ASCII
CharBuffer ascii = Charset.forName("US-ASCII").decode(bb);

How about this:
String input = ... // my UTF-16 string
StringBuilder sb = new StringBuilder(input.length());
for (int i = 0; i < input.length(); i++) {
char ch = input.charAt(i);
if (ch <= 0xFF) {
sb.append(ch);
}
}
byte[] ascii = sb.toString().getBytes("ISO-8859-1"); // aka LATIN-1
This is probably not the most efficient way to do this conversion for large strings since we copy the characters twice. However, it has the advantage of being straightforward.
BTW, strictly speaking there is no such character set as 8-bit ASCII. ASCII is a 7-bit character set. LATIN-1 is the nearest thing there is to an "8-bit ASCII" character set (and block 0 of Unicode is equivalent to LATIN-1) so I'll assume that's what you mean.
EDIT: in the light of the update to the question, the solution is even simpler:
String input = ... // my UTF-16 string
byte[] ascii = new byte[input.length()];
for (int i = 0; i < input.length(); i++) {
ascii[i] = (byte) input.charAt(i);
}
This solution is more efficient. Since we now know how many bytes to expect, we can preallocate the byte array and in copy the (truncated) characters without using a StringBuilder as intermediate buffer.
However, I'm not convinced that dealing with bad data in this way is sensible.
EDIT 2: there is one more obscure "gotcha" with this. Unicode actually defines code points (characters) to be "roughly 21 bit" values ... 0x000000 to 0x10FFFF ... and uses surrogates to represent codes > 0x00FFFF. In other words, a Unicode codepoint > 0x00FFFF is actually represented in UTF-16 as two "characters". Neither my answer or any of the others take account of this (admittedly esoteric) point. In fact, dealing with codepoints > 0x00FFFF in Java is rather tricky in general. This stems from the fact that 'char' is a 16 bit type and String is defined in terms of 'char'.
EDIT 3: maybe a more sensible solution for dealing with unexpected characters that don't convert to ASCII is to replace them with the standard replacement character:
String input = ... // my UTF-16 string
byte[] ascii = new byte[input.length()];
for (int i = 0; i < input.length(); i++) {
char ch = input.charAt(i);
ascii[i] = (ch <= 0xFF) ? (byte) ch : (byte) '?';
}

Java internally represents strings in UTF-16. If a String object is what you are starting with, you can encode using String.getBytes(Charset c), where you might specify US-ASCII (which can map code points 0x00-0x7f) or ISO-8859-1 (which can map code points 0x00-0xff, and may be what you mean by "8-bit ASCII").
As for adding "bad data"... ASCII or ISO-8859-1 strings simply can't represent values outside of a certain range. I believe getBytes will simply drop characters it's not able to represent in the destination character set.

Since this is an exercise, it sounds like you need to implement this manually. You can think of an encoding (e.g. UTF-16 or ASCII) as a lookup table that matches a sequence of bytes to a logical character (a codepoint).
Java uses UTF-16 strings, which means that any given codepoint can be represented in one or two char variables. Whether you want to handle the two-char surrogate pairs depends on how likely you think your application is to encounter them (see the Character class for detecting them). ASCII only uses the first 7 bits of an octet (byte), so the valid range of values is 0 to 127. UTF-16 uses identical values for this range (they're just wider). This can be confirmed with this code:
Charset ascii = Charset.forName("US-ASCII");
byte[] buffer = new byte[1];
char[] cbuf = new char[1];
for (int i = 0; i <= 127; i++) {
buffer[0] = (byte) i;
cbuf[0] = (char) i;
String decoded = new String(buffer, ascii);
String utf16String = new String(cbuf);
if (!utf16String.equals(decoded)) {
throw new IllegalStateException();
}
System.out.print(utf16String);
}
System.out.println("\nOK");
Therefore, you can convert UTF-16 to ASCII by casting a char to a byte.
You can read more about Java character encoding here.

Just to optimize on the accepted answer and not pay any penalty if the string is already all ascii characters, here is the optimized version. Thanks #stephen-c
public static String toAscii(String input) {
final int length = input.length();
int ignoredChars = 0;
byte[] ascii = null;
for (int i = 0; i < length; i++) {
char ch = input.charAt(i);
if (ch > 0xFF) {
//-- ignore this non-ascii character
ignoredChars++;
if (ascii == null) {
//-- first non-ascii character. Create a new ascii array with all ascii characters
ascii = new byte[input.length() - 1]; //-- we know, the length will be at less by at least 1
for (int j = 0; j < i-1; j++) {
ascii[j] = (byte) input.charAt(j);
}
}
} else if (ascii != null) {
ascii[i - ignoredChars] = (byte) ch;
}
}
//-- (ignoredChars == 0) is the same as (ascii == null) i.e. no non-ascii characters found
return ignoredChars == 0 ? input : new String(Arrays.copyOf(ascii, length - ignoredChars));
}

Related

How to convert EBCDIC hex to binary

//PROBLEM SOLVED
I wrote a program to convert EBCDIC string to hex.
I have a problem with some signs.
So, I read string to bytes, then change them to hex(every two signs)
Problem is that, JAVA converts Decimal ASCII 136 sign according to https://shop.alterlinks.com/ascii-table/ascii-ebcdic-us.php to Decimal ASCII 63.
Which is problematic and wrong, becouse it's the only wrong converted character.
EBCDIC 88 in UltraEdit
//edit -added code
int[] bytes = toByte(0, 8);
String bitmap = hexToBin(toHex(bytes));
//ebytes[] - ebcdic string
public int[] toByte(int from, int to){
int[] newBytes = new int[to - from + 1];
int k = 0;
for(int i = from; i <= to; i++){
newBytes[k] = ebytes[i] & 0xff;
k++;
}
return newBytes;
}
public String toHex(int[] hex){
StringBuilder sb = new StringBuilder();
for (int b : hex) {
if(Integer.toHexString(b).length() == 1){
sb.append("0" + Integer.toHexString(b));
}else{
sb.append(Integer.toHexString(b));
}
}
return sb.toString();
}
public String hexToBin(String hex){
String toReturn = new BigInteger(hex, 16).toString(2);
return String.format("%" + (hex.length() * 4) + "s", toReturn).replace(' ', '0');
}
//edit2
Changing encoding in Eclipse to ISO-8859-1 helped, but I lose some signs while reading text from a file.
//edit3
Problem solved by changing the way of reading file.
Now, I read it byte by byte and parse it to char.
Before, it was line by line.
There is no ASCII-value of 136 since ASII is only 7 bit - everything beyond 127 is some custom extended codepage (the linked table seems to use some sort of windows-codepage, e.g. Cp1252). Since it is printing a ? it seems you are using a codepage that doesn't have a character assigned to the value 136 - e.g. some flavour of ISO-8859.
Solution.
Change encoding in Eclipse to more proper (in my example ISO-8859-1).
Change the way of reading file.
Now, I read it byte by byte and parse it to char.
Before, it was line by line and this is how I lost some chars.

Arabic unicode or ASCII code in java

i want to find the character ASCII code for programming android to support the Arabic locale. Android programming has many characters are different English. The ASCII code in many letters joint or some of letters are split.
how can i find the special code for each letter?
Unicode is a numbering of all characters. The numbering would need three bytes integers. A Unicode character is represented in science as U+XXXX where XXXX stands for the number in hexadecimal (base 16) notation. A Unicode character is called code point, in Java with type int.
Java char is 2 bytes (UTF-16), so cannot represent the higher order Unicode; there a pair of two chars is used.
The java class Character deals with conversion.
char lowUnicode = '\u0627'; // Alef, fitting in a char
int cp = (int) lowUnicode;
One can iterate through code points of a String as follows:
String s = "...";
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
i += Character.charCount(codePoint);
}
String s = "...";
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
...
i += Character.charCount(codePoint);
}
Or in java 8:
s.codePoints().forEach(
(codePoint) -> System.out.println(codePoint));
Dumping Arabic between U+600 and U+8FF:
The code below dumps Unicode in the main Arabic range.
for (int codePoint = 0x600; codePoint < 0x900; ++codePoint) {
if (Character.isAlphabetic(codePoint)
&& UnicodeScript.of(codePoint) == UnicodeScript.ARABIC) {
System.out.printf("\u200E\\%04X \u200F%s\u200E %s%n",
codePoint,
new String(Character.toChars(codePoint)),
Character.getName(codePoint));
}
}
Under Windows/Linux/... there exist char map tools to display Unicode.
Above U+200E is the Left-To-Right, and U+200F is the Right-To-Left mark.
If you want to get Unicode characters code below will do that:
char character = 'ع';
int code = (int) character;

encode the given byte array such that the output has only small alphabets

I am new to encryption and encoding/decoding. I want to encrypt and encode the given string so that the result string is only of small alphabets (not capital letters, numbers or any special characters?). Base64 is used for encoding. is that possible to acheive encoding using Base64 and get result strings only of small characters. If not which encryption method could give such results? thanks in advance
public byte [] encode (byte [] data)
{
ByteArrayOutputStream output = new ByteArrayOutputStream ();
for (byte b: data)
{
output.write ((b & 0x0F) + 'a');
output.write ((((b >>> 4) & 0x0F) + 'a');
}
return output.toByteArray ();
}
public byte [] decode (byte [] encodedData)
{
int l = encodedData.length / 2;
byte [] result = new byte [l];
for (int i = 0; i < l; i++)
result [i] =
(byte)(
encodedData [l * 2] - 'a' +
((encodedData [l * 2 + 1] - 'a') << 4)
);
return result;
}
If you use only lower case letters, it wouldn't be base64 anymore.
If your alphabet consists of only 16 lower case chars you could call it base16. And map each nibble (4 bit) to one of the chars.
I would recommend you to use org.jasypt.encryption.StringEncryptor#encrypt().
The resulted encrypted text is encoded BASE64 and thus safely stored as US-ASCII chars.
Base32 uses single-case characters, therefore it should be suitable for your purpose. There is a Base32 implementation from Apache Commons that you could use, or you could just write your own.
But why do you want to restrict your output to lower case only?

How do I recognize a character such as "ç" as a letter?

I have an array of bytes that contains a sentence. I need to convert the lowercase letters on this sentence into uppercase letters. Here is the function that I did:
public void CharUpperBuffAJava(byte[] word) {
for (int i = 0; i < word.length; i++) {
if (!Character.isUpperCase(word[i]) && Character.isLetter(word[i])) {
word[i] -= 32;
}
}
return cchLength;
}
It will work fine with sentences like: "a glass of water". The problem is it must work with all ANSI characters, which includes "ç,á,é,í,ó,ú" and so on. The method Character.isLetter doesn't work with these letters and, therefore, they are not converted into uppercase.
Do you know how can I identify these ANSI characters as a letter in Java?
EDIT
If someone wants to know, I did method again after the answers and now it looks like this:
public static int CharUpperBuffAJava(byte[] lpsz, int cchLength) {
String value;
try {
value = new String(lpsz, 0, cchLength, "Windows-1252");
String upperCase = value.toUpperCase();
byte[] bytes = upperCase.getBytes();
for (int i = 0; i < cchLength; i++) {
lpsz[i] = bytes[i];
}
return cchLength;
} catch (UnsupportedEncodingException e) {
return 0;
}
}
You need to "decode" the byte[] into a character string. There are several APIs to do this, but you must specify the character encoding that is use for the bytes. The overloaded versions that don't use an encoding will give different results on different machines, because they use the platform default.
For example, if you determine that the bytes were encoded with Windows-1252 (sometimes referred to as ANSI).
String s = new String(bytes, "Windows-1252");
String upper = s.toUpperCase();
Convert the byte array into a string, supporting the encoding. Then call toUpperCase(). Then, you can call getBytes() on the string if you need it as a byte array after capitalizing.
Can't you simply use:
String s = new String(bytes, "cp1252");
String upper = s.toUpperCase(someLocale);
Wouldn't changing the character set do the trick before conversion? The internal conversion logic of Java might work fine. Something like http://www.exampledepot.com/egs/java.nio.charset/ConvertChar.html, but use ASCII as the target character set.
I am looking at this table:
http://slayeroffice.com/tools/ascii/
But anything > 227 appears to be a letter, but to make it upper case you would subtract 27 from the ASCII value.

How to replace characters using Regex

I received string from IBM Mainframe like below (2bytes graphic fonts)
" ;A;B;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z;a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;0;1;2;3;4;5;6;7;8;9;`;-;=;₩;~;!;@;#;$;%;^;&;*;(;);_;+;|;[;];{;};:;";';,;.;/;<;>;?;";
and, I wanna change these characters to 1 byte ascii codes
How can I replace these using java.util.regex.Matcher, String.replaceAll() in Java
target characters :
;A;B;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z;a;b;c;d;e;f;g;h;i;j;k;l;m;n;o;p;q;r;s;t;u;v;w;x;y;z;0;1;2;3;4;5;6;7;8;9;`;-;=;\;~;!;#;#;$;%;^;&;*;(;);_;+;|;[;];{;};:;";';,;.;/;<;>;?;";
This is not (as other responders are saying) a character-encoding issue, but regexes are still the wrong tool. If Java had an equivalent of Perl's tr/// operator, that would be the right tool, but you can hand-code it easily enough:
public static String convert(String oldString)
{
String oldChars = " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789`-=₩~!@#$%^&*()_+|[]{}:"',./<>?";
String newChars = " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789`-=\\~!##$%^&*()_+|[]{}:\"',./<>?";
StringBuilder sb = new StringBuilder();
int len = oldString.length();
for (int i = 0; i < len; i++)
{
char ch = oldString.charAt(i);
int pos = oldChars.indexOf(ch);
sb.append(pos < 0 ? ch : newChars.charAt(pos));
}
return sb.toString();
}
I'm assuming each character in the first string corresponds to the character at the same position in the second string, and that the first character (U+3000, 'IDEOGRAPHIC SPACE') should be converted to an ASCII space (U+0020).
Be sure to save the source file as UTF-8, and include the -encoding UTF-8 option when you compile it (or tell your IDE to do so).
Don't think this one's about regex, it's about encoding. Should be possible to read into a String with 2-byte and then write it with any other encoding.
Look here for supported encodings.

Categories

Resources