Java convert String UTF-8 to UTF-16 - java

I try to convert String a = "try" to String UTF-16
I did this :
try {
String ulany = new String("357810087745445");
System.out.println(ulany.getBytes().length);
String string = new String(ulany.getBytes(), "UTF-16");
System.out.println(string.getBytes().length);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
And ulany.getBytes().length = 15
and System.out.println(string.getBytes().length) = 24 but I think that it should be 30 what I did wrong ?

String (and char) hold Unicode. So nothing is needed.
However if you want bytes, binary data, that are in some encoding, like UTF-16, you need a conversion:
ulany.getBytes("UTF-16") // Those bytes are in UTF-16 big endian
ulany.getBytes("UTF-16LE")
However System.out uses the operating systems encoding, so one cannot just pick some different encoding.
In fact char is UTF-16 encoded.
What happens
//String ulany = new String("357810087745445");
String ulany = "357810087745445";
The String copy constructor stems from the C++ beginning, and is senseless.
System.out.println(ulany.getBytes().length);
This will run on different platforms differently, as getBytes() uses
the default Charset. Better
System.out.println(ulany.getBytes("UTF-8").length);
String string = new String(ulany.getBytes(), "UTF-16");
This interpretes those bytes pairwise; having 15 bytes is already wrong.
Evidently one gets 7 (8?) special characters, as the high byte is not zero.
System.out.println(string.getBytes().length);
Now getting 24 means an average 3 bytes per char. Hence the default platform encoding is probably UTF-8 creating multibyte sequences.
The string will contain something like:
String string = "\u3533\u3837\u3031\u3830\u3737\u3534\u3434?";

You can also include a text encoding in getBytes(). For example:
String string = new String(ulany.getBytes("UTF-8"), "UTF-16");

Related

String format when reading from file

I have this example. It reads a line "hello" from a file saved as utf-8. Here is my question:
Strings are stored in java in UTF-16 format. So when it reads the line hello it converts it to a utf-16 format. So string s is in a utf-16 with a utf-16 BOM... Am i right?
filereader = new FileReader(file);
read= new BufferedReader(filereader);
String s= null;
while ((s= read.readLine()) != null)
{
System.out.println(s);
}
So when i do this:
s= s.replace("\uFEFF","A");
nothing happens. Should the above find and replace the UTF-16 BOM? Or is it eventually a utf-8 format? Am a little bit confused about this.
Thank you
Try to use the Apache Commons library and the class org.apache.commons.io.input.BOMInputStream to get rid of this kind of problems.
Example:
String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(file);
try
{
BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
ByteOrderMark bom = bOMInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
// your code...
}
finally
{
inputStream.close();
}
For what concerns the BOM itself, as #seand said, it's just meta data being used for reading/writing/storing strings in memory. It's present in the strings themselves, but you cannot replace or modify it unless working at binary level or re-encoding the strings.
Let's make a few examples:
String str = "Hadoop";
byte bt1[] = str.getBytes();
System.out.println(bt1.length); // 6
byte bt2a[] = str.getBytes("UTF-16");
System.out.println(bt2a.length); // 14
byte bt2b[] = str.getBytes("UTF-16BE");
System.out.println(bt2b.length); // 14
byte bt3[] = str.getBytes("UTF-16LE");
System.out.println(bt3.length); // 12
In the UTF-16 (which defaults to Big Endian) and UTF-16BE versions, you get 14 bytes because of the BOM being inserted to distinguish between BE and LE. If you specify UTF-16LE you get 12 bytes because of no BOM is being added.
You cannot strip the BOM from a string with a simple replace, as you tried. Because the BOM, if present, is only part of the underlying byte stream that, memory side, is being handled as a string by the java framework. And you can't manipulate it like you manipulate characters that are part of the string itself.

How to convert "Æàìáûë" to readable cyrillic in Java?

I tryed to get byte and then convert with Utf-8.
byte ptext[] = first_name.getBytes();
Log.i("", new String(ptext,"UTF-8"));
But it's not working .Sorry for my dumbness. I'm very confused.
try {
String s = new String("Æàìáûë".getBytes(StandardCharsets.ISO_8859_1), "Windows-1251");
Files.write(Paths.get("C:/cyrillic.txt"),
("\uFEFF" + s).getBytes(StandardCharsets.UTF_8));
} catch (IOException e) {
e.printStackTrace();
}
Assuming that the editor and compiler are set to UTF-8 to have a correct erroneous string literal.
This treats the characters as single bytes, abusing ISO-8859-1. Then tries the Windows-1251 encoding for Cyrillic (there are others).
This way we have a java String (always in Unicode).
This we'll write to a text file in UTF-8, with a BOM, so Windows Notepad will identify the file as UTF-8.
Writing to any Cyrillic encoding will be no problem.
Жамбыл
Your byte array must have some encoding. The encoding cannot be ASCII if you've got negative values. Once you figure that out, you can convert a set of bytes to a String using:
byte[] bytes = {...}
String str = new String(bytes, "UTF-8"); // for UTF-8 encoding
Log.i("value", str);
There are a bunch of encodings you can use, look at the Charset class in the Sun javadocs..
Seems your original encoding is Cp1251:
byte ptext[] = first_name.getBytes();
Log.i("", new String(ptext, "Cp1251")); // <- put it here
Resulting word is Жамбыл.

How to convert UTF8 string to UTF16

I'm getting a UTF8 string by processing a request sent by a client application. But the string is really UTF16. What can I do to get it into my local string is a letter followed by \0 character? I need to convert that String into UTF16.
Sample received string: S\0a\0m\0p\0l\0e (UTF8).
What I want is : Sample (UTF16)
FileItem item = (FileItem) iter.next();
String field = "";
String value = "";
if (item.isFormField()) {
try{
value=item.getString();
System.out.println("====" + value);
}
The bytes from the server are not UTF-8 if they look like S\0a\0m\0p\0l\0e. They are UTF-16. You can convert UTF16 bytes to a Java String with:
byte[] bytes = ...
String string = new String(bytes, "UTF-16");
Or you can use UTF-16LE or UTF-16BE as the character set name if you know the endian-ness of the byte stream coming from the server.
If you've already (mistakenly) constructed a String from the bytes as if it were UTF-8, you can convert to UTF-16 with:
string = new String(string.getBytes("UTF-8"), "UTF-16");
However, as JB Nizet points out, this round trip (bytes -> UTF-8 string -> bytes) is potentially lossy if the bytes weren't valid UTF-8 to start with.
I propose the following solution:
NSString *line_utf16[ENOUGH_MEMORY_SIZE];
line_utf16= [NSString stringWithFormat: #"%s", line_utf8];
ENOUGH_MEMORY_SIZE is at least twice exceeds memory used for line_utf8
I suppose memory for
line_utf16
has to be dynamically or statically allocated at least twice of the size of
line_utf8.
If you run into similar problem please add a couple of sentences!

Get Multilingual Data from ByteBuffer

I am receiving ByteBuffers in an UDP Java application.
Now the data in this ByteBuffer can be any string in any language or any special chars separated by zero.
I use following code to get Strings from it.
public String getString() {
byte[] remainingBytes = new byte[this.byteBuffer.remaining()];
this.byteBuffer.slice().get(remainingBytes);
String dataString = new String(remainingBytes);
int stringEnd = dataString.indexOf(0);
if(stringEnd == -1) {
return null;
} else {
dataString = dataString.substring(0, stringEnd);
this.byteBuffer.position(this.byteBuffer.position() + dataString.getBytes().length + 1);
return dataString;
}
}
These strings are stored in MySQL DB with everything set as UTF8.
IF i run application in Windows then special chars like ® are displayed but chinese are not.
On adding VM argument -Dfile.encoding=UTF8 chinese are displayed but chars like ® are shown as ?? etc.
Please Help.
Edit:
Input Strings in UDP packet are variable-length byte field, encoded in UTF-8, terminated by 0x00
For JDBC also i use useUnicode=true&characterEncoding=UTF-8
String dataString = new String(remainingBytes); is wrong. You should almost never do that. You should find out what encoding was used to put the bytes into the UDP packet, and use the same encoding on that line:
String dataString = new String(remainingBytes, encoding); // e.g. "UTF-8"
Edit: based on your updated question, encoding should be "UTF-8"
Not sure, but dataString contains only data till this zero, because stringEnd shows on first zero postion but not behind.
dataString = dataString.substring(0, stringEnd+1);
or
char specChar = dataString.substring(stringEnd, stringEnd+1); and it should return only special character, but as I said in the biggining, not sure...

How best to convert a byte[] array to a string buffer

I have a number of byte[] array variables I need to convert to string buffers.
is there a method for this type of conversion ?
Thanks
Thank you all for your responses..However I didn't make myself clear....
I'm using some byte[] arrays pre-defined as public static "under" the class declaration
for my java program. these "fields" are reused during the "life" of the process.
As the program issues status messages, (written to a file) I've defined a string buffer
(mesg_data) that used to format a status message.
So as the program executes
I tried msg2 = String(byte_array2)
I get a compiler error:
cannot find symbol
symbol : method String(byte[])
location: class APPC_LU62.java.LU62XnsCvr
convrsID = String(conversation_ID) ;
example:
public class LU62XnsCvr extends Object
.
.
static String convrsID ;
static byte[] conversation_ID = new byte[8] ;
So I can't use a "dynamic" define of a string variable because the same variable is used
in multiple occurances.
I hope I made myself clear
Thanks ever so much
Guy
String s = new String(myByteArray, "UTF-8");
StringBuilder sb = new StringBuilder(s);
There is a constructor that a byte array and encoding:
byte[] bytes = new byte[200];
//...
String s = new String(bytes, "UTF-8");
In order to translate bytes to characters you need to specify encoding: the scheme by which sequences (typically of length 1,2 or 3) of 0-255 values (that is: sequence of bytes) are mapped to characters. UTF-8 is probably the best bet as a default.
You can turn it to a String directly
byte[] bytearray
....
String mystring = new String(bytearray)
and then to convert to a StringBuffer
StringBuffer buffer = new StringBuffer(mystring)
You may use
str = new String(bytes)
By thewhat the code above does is to create a java String (i.e. UTF-16) with the default platform character encoding.
If the byte array was created from a string encoded in the platform default character encoding this will work well.
If not you need to specify the correct character encoding (Charset) as
String str = new String (byte [] bytes, Charset charset)
It depends entirely on the character encoding, but you want:
String value = new String(bytes, "US-ASCII");
This would work for US-ASCII values.
See Charset for other valid character encodings (e.g., UTF-8)

Categories

Resources