Java- Converting from unicode to ANSI

Java- Converting from unicode to ANSI - java

I have a string \u0986\u09AE\u09BF \u0995\u09BF\u0982\u09AC\u09A6\u09A8\u09CD\u09A4\u09BF\u09B0 \u0995\u09A5\u09BE \u09AC\u09B2\u099B\u09BF.
I need to convert it in Avwg wKsewš—i K_v ejwQ` which is in ANSI format. How can I convert this Unicode to ANSI characters in java.
Edit:
resultView.setTypeface(typeFace);
String str=new String("\u0986\u09AE\u09BF \u0995\u09BF\u0982\u09AC\u09A6\u09A8\u09CD\u09A4\u09BF\u09B0 \u0995\u09A5\u09BE \u09AC\u09B2\u099B\u09BF");
resultView.setText(str);

I need to convert it in AvwgwKsewš—i K_v ejwQ which is in ANSI format.
That's not ANSI format. The (misleadingly-named) "ANSI" code pages in Windows are all based around ASCII, with different characters added in the high bytes. Byte 0x41 (A) as a leading letter in an ANSI code page always means Latin A and not Bengali আ.
What I think you have is a custom symbol font, that maps arbitrary symbols to completely unrelated codepoints. Every such font has its own visual encoding; to convert between Unicode and the custom visual encoding you'd have to build up your own translation table by looking at the glyphs for each character and matching them to the Unicode character that represents the same letter.
I would strongly advise getting a proper Unicode-aware font that supports Bengali instead. Content stuck in an arbitrary font-specific encoding is difficult to deal with (because semantically you really are dealing with a string that means "AvwgwKsewš—i K_v ejwQ", with all the editing and case-changing gotchas that implies.
Visual-encoded fonts are an unhappy relic of the time before Windows had good Unicode (or even ISCII) support. They should not be used for anything today.

I'm not sure exactly what you're asking, but I'll assume you're asking how to convert some characters from Unicode into an 8-bit character set. (e.g. ISO-8859-1 is the characterset for 'Western European' languages, like English).
I don't know of any way to automatically detect the relevant 8-bit charset, so I looked up one of your characters (on here http://unicode.org/charts/ ), and I can see that these characters are Bengali.
I think the equivalent 8-bit character set for Bengali is known as x-iscii-be.
I don't have this installed on my system, so I couldn't do the conversion successfully.
EDIT: Java does not support the charset x-iscii-be, but I'll leave the remainder of this answer for illustration purposes. See http://download.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html for a list of supported Charsets.
EDIT2: Android certainly doesn't guarantee support for this charset (the only 8-bit characterset it guarantees is ISO-8859-1). See: http://developer.android.com/reference/java/nio/charset/Charset.html .
*So, I think you should run some Charset-detecting code on a Bengali Android device - perhaps it supports this charset. Everything you need is in my code sample. *
In order for Java to convert your data in a different charset, all you need to do in Java is to check that the desired Charset is installed, and then specify the desired Charset when you convert the String into bytes.
The conversion itself would be extremely simple:
str.getBytes("x-iscii-be");
So, you see, the String itself is stored in a kind of 'normalised' form (i.e. the defaultCharset), and you can treat the getBytes(charsetName) as kind of 'alternative output format' for the String. Sorry - poor explanation!
In your situation, perhaps you just need to assign a Charset to the resultView, and the framework will work its magic for you ...
Here's some test code I put together to illustrate the point, and to check whether a given charset is supported on a system.
I have got this code to output the byte-arrays as 'hex' strings, so that you can see that the data is different after conversion.
import java.io.UnsupportedEncodingException;
import java.math.BigInteger;
import java.nio.charset.Charset;
import java.util.Map.Entry;
import java.util.SortedMap;
public class UnicodeTest {
public static void main(String[] args) throws UnsupportedEncodingException {
testWestern();
testBengali();
}
public static void testWestern() throws UnsupportedEncodingException {
String unicodeStr= "\u00c2"; //This is a capital A with an accent.;
String charsetName= "ISO-8859-1";
System.out.println("Input (outputted as default charset - normally unicode): "+unicodeStr);
attempt8bitCharsetConversion(unicodeStr, charsetName);
}
public static void testBengali() throws UnsupportedEncodingException {
String unicodeStr = "\u0986\u09AE\u09BF \u0995\u09BF\u0982\u09AC\u09A6\u09A8\u09CD\u09A4\u09BF\u09B0 \u0995\u09A5\u09BE \u09AC\u09B2\u099B\u09BF";
String charsetName= "x-iscii-be";
System.out.println(unicodeStr);
attempt8bitCharsetConversion(unicodeStr, charsetName);
}
public static void attempt8bitCharsetConversion(String input, String charsetName) throws UnsupportedEncodingException {
SortedMap<String, Charset> availableCharsets = Charset
.availableCharsets();
for (Entry<String, Charset> entry : availableCharsets.entrySet()) {
if (charsetName.equalsIgnoreCase(entry.getKey())) {
System.out.println("HEXED input : "+ toHex(input.getBytes(Charset.defaultCharset().name())));
System.out.println("HEXED output: "+ toHex(input.getBytes(entry.getKey())));
}
}
throw new UnsupportedEncodingException(charsetName+ " is not supported on this system");
}
public static String toHex(byte[] input) throws UnsupportedEncodingException {
return String.format("%x", new BigInteger(input));
}
}
See also here for more information on charset conversion: http://download.oracle.com/javase/tutorial/i18n/text/string.html
Charactersets are a tricky business, so please forgive my convoluted answer.
HTH

I've written a class which can solve the problem of 09CB ো, 09CC ৌ, 09C7 ে, 09C8 ৈ,09BF ি ্য,্র,ৃ in UTF-8, I reshape it by editing font glyph, you don't need to change it to extended ASCII, :( but still i couldn't solve your bengali conjugates. For proper render it require android 3.5 or higher, it'll work smooth on android 4.0 (Ice Cream Sandwich).

Related

Converting string to byte[] returns wrong value (encoding?)

I read a byte[] from a file and convert it to a String:
byte[] bytesFromFile = Files.readAllBytes(...);
String stringFromFile = new String(bytesFromFile, "UTF-8");
I want to compare this to another byte[] I get from a web service:
String stringFromWebService = webService.getMyByteString();
byte[] bytesFromWebService = stringFromWebService.getBytes("UTF-8");
So I read a byte[] from a file and convert it to a String and I get a String from my web service and convert it to a byte[]. Then I do the following tests:
// works!
org.junit.Assert.assertEquals(stringFromFile, stringFromWebService);
// fails!
org.junit.Assert.assertArrayEquals(bytesFromFile, bytesFromWebService);
Why does the second assertion fail?

Other answers have covered the likely fact that the file is not UTF-8 encoded giving rise to the symptoms described.
However, I think the most interesting aspect of this is not that the byte[] assert fails, but that the assert that the string values are the same passes. I'm not 100% sure why this is, but I think the following trawl through the source code might give us the answer:
Looking at how new String(bytesFromFile, "UTF-8"); works - we see that the constructor calls through to StringCoding.decode()
This in turn, if supplied with tht UTF-8 character set, calls through to StringDecoder.decode()
This calls through to CharsetDecoder.decode() which decides what to do if the character is unmappable (which I guess will be the case if a non-UTF-8 character is presented)
In this case it uses an action defined by
private CodingErrorAction unmappableCharacterAction
= CodingErrorAction.REPORT;
Which means that it still reports the character it has decoded, even though it's technically unmappable.
I think this means that even when the code gets an umappable character, it substitutes its best guess - so I'm guessing that its best guess is correct and hence the String representations are the same under comparison, but the byte[] are no longer the same.
This hypothesis is kind of supported by the fact that the catch block for CharacterCodingException in StringCoding.decode() says:
} catch (CharacterCodingException x) {
// Substitution is always enabled,
// so this shouldn't happen

I don't understand it fully, but here's what I get so fare:
The problem is that the data contains some bytes which are not valid UTF-8 bytes as I know by the following check:
// returns false for my data!
public static boolean isValidUTF8(byte[] input) {
CharsetDecoder cs = Charset.forName("UTF-8").newDecoder();
try {
cs.decode(ByteBuffer.wrap(input));
return true;
}
catch(CharacterCodingException e){
return false;
}
}
When I change the encoding to ISO-8859-1 everything works fine. The strange thing (which a don't understand yet) is why my conversion (new String(bytesFromFile, "UTF-8");) doesn't throw any exception (like my isValidUTF8 method), although the data is not valid UTF-8.
However, I think I will go another and encode my byte[] in a Base64 string as I don't want more trouble with encoding.

The real problem in your code is that you don't know what the real file encoding.
When you read the string from the web service you get a sequence of chars; when you convert the string from chars to bytes the conversion is made right because you specify how to transform char in bytes with a specific encoding ("UFT-8"). when you read a text file you face a different problem. You have a sequence of bytes that needs to be converted to chars. In order to do it properly you must know how the chars where converted to bytes i.e. what is the file encoding. For files (unless specified) it's a platform constants; on windows the file are encoded in win1252 (which is very close to ISO-8859-1); on linux/unix it depends, I think UTF8 is the default.
By the way the web service call did a decond operation under the hood; the http call use an header taht defins how chars are encoded, i.e. how to read the bytes form the socket and transform then to chars. So calling a SOAP web service gives you back an xml (which can be marshalled into a Java object) with all the encoding operations done properly.
So if you must read chars from a File you must face the encoding issue; you can use BASE64 as you stated but you lose one of the main benefits of text files: the are human readable, easing debugging and developing.

Java program doesn't print cyrillic characters, but only question marks

my code(Qwe.java)
public class Qwe {
public static void main(String[] args) {
System.out.println("тест привет");
}
}
where
тест привет
is russian words
Qwe.java in UTF-8
on my machine(ubuntu 14.04) result is
тест привет
on server(ubuntu 12.04) I have:
???? ??????
$java Qwe > test.txt
in test.txt is see
???? ??????

I fix it just use export JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8

The java source text must use the same encoding as the javac compiler. That seems to have been the case, and UTF-8 is of course ideal.
The the file Qwe.class is fine, with internally using Unicode for String. The output to the console uses the server platform encoding. That is, java converts the text in Unicode to bytes using probably the default (platform) encoding, and that cannot handle Cyrillic.
So you have to write to a file, never using FileWriter (a utility class for local files only), but using:
... new OutputStreamWriter(new FileOutputStream(file), "UTF-8")
You can also set user locales on the server but that is not my beer.
In general I would switch to a file logger.

I am not sure but it might only accept ASCII characters from the english language unless you have some extension or something. But like I said my best guess is it is not finding the characters and just outputting garbage in their place.
"Java, any unknown character which is passed through the write() methods of an OutputStream get printed as a plain question mark “?”"
as taken from here

ASCII to UCS2 encoding java code

I have a requirement for converting a String(which is usually be in ASCII char set) to UCS2 character set and then needs to be converted to Base 64.
I could find the code for Base 64 conversion, but facing issue with encoding to UCS2.
It would be great help if any help provided for converting a string to UCS2 character set in java.
Thank you,

When you read your data into a String variable the internal representation will already be unicode, but when you do mystring.getBytes() the returned bytes will be the String encoded by the default-encoding of the current platform.
If you want to get UTF-16 (which is basically the same as UCS-2 (a.k.a ISO 10646), see here) use
mystring.getBytes("UTF-16").

I initially started with getBytes("UTF-16") as mentionned by #piet.t but there are a few caveats to consider when dealing with UCS2: it encodes each character as exactly two bytes (see complete code chart) and doesn't use any BOM. getBytes("UTF-16") adds a 2-bytes BOM 0xfeff that should be removed when encoding, and added back when decoding.
I also noted the last byte should be discarded during decoding (but I'm encoding mostly ASCII, it might be wrong to do that with other character codes)
EDIT: After #jtahlborn hint about using UTF-16BE, I ended up using UTF-16LE (which does not produce any BOM, even not the extra 0 that UTF-16BE was giving) with the following two encode/decode methods, which work well in my use-cases (adding XP TIFF tags):
public static byte[] encodeUCS2(String s) {
try {
return s.getBytes("UTF-16LE");
} catch (UnsupportedEncodingException e) {
return new byte[]{};
}
}
public static String decodeUCS2(byte[] e) {
try {
return new String(e, "UTF-16LE");
} catch (UnsupportedEncodingException e1) {
return null;
}
}
Note that there is not much need of specific encode/decode methods in that case, as they're bare no-exception-thrown replacements for getBytes()/new String().

Writing strings with chars like "ñ" to a txt file

Im having a strange issue trying to write in text files with strings which contain characters like "ñ", "á".. and so on. Let me first show you my little piece of code:
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
String content = "whatever";
int c;
c = System.in.read();
content = content + (char)c;
FileWriter fw = new FileWriter("filename.txt");
BufferedWriter bw = new BufferedWriter(fw);
bw.write(content);
bw.close();
}
}
In this example, im just reading a char from the keyboard input and appending it to a given string; then writting the final string into a txt. The problem is that if I type an "ñ" for example (i have a Spanish layout keyboard), when i check the txt, it shows a strange char "¤" where there should be a "ñ", that is, the content of the file is "whatever¤". The same happens with "ç", "ú"..etc. However it writes it fine ("whateverñ") if i just forget about the keyboard input and i write:
...
String content = "whateverñ";
...
or
...
content = content + "ñ";
...
It makes me think that there might be something wrong with the read() method? Or maybe im using it wrongly? or should i use a different method to get the keyboard input? or..? Im a bit lost here.
(Im using the jdk 7u45 # Windows 7 Pro x64)

So ...
It works (i.e. you can read the accented characters on the output file) if you write them as literal strings.
It doesn't work when you read them from System.in and then write them.
This suggests that the problem is on the input side. Specifically, I think your console / keyboard must be using a character encoding for the input stream that does not match the encoding that Java thinks should be used.
You should be able to confirm this tentative diagnosis by outputting the characters you are reading in hexadecimal, and then checking the codes against the unicode tables (which you can find at unicode.org for example).
It strikes me as "odd" that the "platform default encoding" appears to be working on the output side, but not the input side. Maybe someone else can explain ... and offer a concrete suggestion for fixing it. My gut feeling is that the problem is in the way your keyboard is configured, not in Java or your application.

files do not remember their encoding format, when you look at a .txt, the text editor makes a "best guess" to the encoding used.
if you try to read the file into your program again, the text should be back to normal.
also, try printing the "strange" character directly.

Greek String doesn't match regex when read from keyboard

public static void main(String[] args) throws IOException {
String str1 = "ΔΞ123456";
System.out.println(str1+"-"+str1.matches("^\\p{InGreek}{2}\\d{6}")); //ΔΞ123456-true
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String str2 = br.readLine(); //ΔΞ123456 same as str1.
System.out.println(str2+"-"+str2.matches("^\\p{InGreek}{2}\\d{6}")); //Ξ”Ξ�123456-false
System.out.println(str1.equals(str2)); //false
}
The same String doesn't match regex when read from keyboard.
What causes this problem, and how can we solve this?
Thanks in advance.
EDIT: I used System.console() for input and output.
public static void main(String[] args) throws IOException {
PrintWriter pr = System.console().writer();
String str1 = "ΔΞ123456";
pr.println(str1+"-"+str1.matches("^\\p{InGreek}{2}\\d{6}")+"-"+str1.length());
String str2 = System.console().readLine();
pr.println(str2+"-"+str2.matches("^\\p{InGreek}{2}\\d{6}")+"-"+str2.length());
pr.println("str1.equals(str2)="+str1.equals(str2));
}
Output:
ΔΞ123456-true-8
ΔΞ123456
ΔΞ123456-true-8
str1.equals(str2)=true

There are multiple places where transcoding errors can take place here.
Ensure that your class is being compiled correctly (unlikely to be an issue in an IDE):
Ensure that the compiler is using the same encoding as your editor (i.e. if you save as UTF-8, set your compiler to use that encoding)
Or switch to escaping to the ASCII subset that most encodings are a superset of (i.e. change the string literal to "\u0394\u039e123456")
Ensure you are reading input using the correct encoding:
Use the Console to read input - this class will detect the console encoding
Or configure your Reader to use the correct encoding (probably windows-1253) or set the console to Java's default encoding
Note that System.console() returns null in an IDE, but there are things you can do about that.

If you use Windows, it may be caused by the fact that console character encoding ("OEM code page") is not the same as a system encoding ("ANSI code page").
InputStreamReader without explicit encoding parameter assumes input data to be in the system default encoding, therefore characters read from the console are decoded incorrectly.
In order to correctly read non-us-ascii characters in Windows console you need to specify console encoding explicitly when constructing InputStreamReader (required codepage number can be found by executing mode con cp in the command line):
BufferedReader br = new BufferedReader(
new InputStreamReader(System.in, "CP737"));
The same problem applies to the output, you need to construct PrintWriter with proper encoding:
PrintWriter out = new PrintWrtier(new OutputStreamWriter(System.out, "CP737"));
Note that since Java 1.6 you can avoid these workarounds by using Console object obtained from System.console(). It provides Reader and Writer with correctly configured encoding as well as some utility methods.
However, System.console() returns null when streams are redirected (for example, when running from IDE). A workaround for this problem can be found in McDowell's answer.
See also:
Code page

I get true in both cases with nothing changed on your code. (I tested with greek layout keyboard - I'm from Greece :])
Probably your keyboard is sending ascii in 8859-7 ISO and not UTF-8. Mine sends UTF-8.
EDIT: I still get true with the addition of the equals command..
System.out.println(str1.equals(str2));
Check if you can get it working by changing everything to greek in the regional options (if you are using windows).
Rundll32 Shell32.dll,Control_RunDLL Intl.cpl,,0
If this is the case then you can act accordingly.. as 'axtavt' said

The keyboard is likely not sending the characters as UTF-8, but as the operating system's default character encoding.
See also
Java : How to determine the correct charset encoding of a stream
Java App : Unable to read iso-8859-1 encoded file correctly

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java- Converting from unicode to ANSI - java

Related

Converting string to byte[] returns wrong value (encoding?)

Java program doesn't print cyrillic characters, but only question marks

ASCII to UCS2 encoding java code

Writing strings with chars like "ñ" to a txt file

Greek String doesn't match regex when read from keyboard

Categories

Resources