How to convert UTF8 string to UTF16 - java

I'm getting a UTF8 string by processing a request sent by a client application. But the string is really UTF16. What can I do to get it into my local string is a letter followed by \0 character? I need to convert that String into UTF16.
Sample received string: S\0a\0m\0p\0l\0e (UTF8).
What I want is : Sample (UTF16)
FileItem item = (FileItem) iter.next();
String field = "";
String value = "";
if (item.isFormField()) {
try{
value=item.getString();
System.out.println("====" + value);
}

The bytes from the server are not UTF-8 if they look like S\0a\0m\0p\0l\0e. They are UTF-16. You can convert UTF16 bytes to a Java String with:
byte[] bytes = ...
String string = new String(bytes, "UTF-16");
Or you can use UTF-16LE or UTF-16BE as the character set name if you know the endian-ness of the byte stream coming from the server.
If you've already (mistakenly) constructed a String from the bytes as if it were UTF-8, you can convert to UTF-16 with:
string = new String(string.getBytes("UTF-8"), "UTF-16");
However, as JB Nizet points out, this round trip (bytes -> UTF-8 string -> bytes) is potentially lossy if the bytes weren't valid UTF-8 to start with.

I propose the following solution:
NSString *line_utf16[ENOUGH_MEMORY_SIZE];
line_utf16= [NSString stringWithFormat: #"%s", line_utf8];
ENOUGH_MEMORY_SIZE is at least twice exceeds memory used for line_utf8
I suppose memory for
line_utf16
has to be dynamically or statically allocated at least twice of the size of
line_utf8.
If you run into similar problem please add a couple of sentences!

Related

Convert byte[] to String and back

I'm trying to save content of a pdf file in a json and thought of saving the pdf as String value converted from byte[].
byte[] byteArray = feature.convertPdfToByteArray(Paths.get("path.pdf"));
String byteString = new String(byteArray, StandardCharsets.UTF_8);
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
System.out.println(secondString.equals(byteString));
System.out.println(Arrays.equals(byteArray, newByteArray));
System.out.println(byteArray.length + " vs " + newByteArray.length);
The result of the above code is as follows:
true
false
421371 vs 760998
The two String's are equal while the two byte[]s are not. Why is that and how to correctly convert/save a pdf inside a json?
You are probably using the wrong charset when reading from the PDF file.
For example, the character é (e with acute) does not exists in ISO-8859-1 :
byte[] byteArray = "é".getBytes(StandardCharsets.ISO_8859_1);
String byteString = new String(byteArray, StandardCharsets.UTF_8);
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
System.out.println(secondString.equals(byteString));
System.out.println(Arrays.equals(byteArray, newByteArray));
System.out.println(byteArray.length + " vs " + newByteArray.length);
Output :
true
false
1 vs 3
Why is that
If the byteArray indeed contains a PDF, it most likely is not valid UTF-8. Thus, wherever
String byteString = new String(byteArray, StandardCharsets.UTF_8);
stumbles over a byte sequence which is not valid UTF-8, it will replace that by a Unicode replacement character. I.e. this line damages your data, most likely beyond repair. So the following
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
does not result in the original byte array but instead a damaged version of it.
The newByteArray, on the other hand, is the result of UTF-8 encoding a given string, byteString. Thus, newByteArray is valid UTF-8 and
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
does not need to replace anything outside the UTF-8 mappings, in particular byteString and secondString are equal.
how to correctly convert/save a pdf inside a json?
As #mammago explained in his comment,
JSON is not the appropriate format for binary content (like files). You should propably use something like base64 to create a string out of your PDF and store that in your JSON object.

Java convert String UTF-8 to UTF-16

I try to convert String a = "try" to String UTF-16
I did this :
try {
String ulany = new String("357810087745445");
System.out.println(ulany.getBytes().length);
String string = new String(ulany.getBytes(), "UTF-16");
System.out.println(string.getBytes().length);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
And ulany.getBytes().length = 15
and System.out.println(string.getBytes().length) = 24 but I think that it should be 30 what I did wrong ?
String (and char) hold Unicode. So nothing is needed.
However if you want bytes, binary data, that are in some encoding, like UTF-16, you need a conversion:
ulany.getBytes("UTF-16") // Those bytes are in UTF-16 big endian
ulany.getBytes("UTF-16LE")
However System.out uses the operating systems encoding, so one cannot just pick some different encoding.
In fact char is UTF-16 encoded.
What happens
//String ulany = new String("357810087745445");
String ulany = "357810087745445";
The String copy constructor stems from the C++ beginning, and is senseless.
System.out.println(ulany.getBytes().length);
This will run on different platforms differently, as getBytes() uses
the default Charset. Better
System.out.println(ulany.getBytes("UTF-8").length);
String string = new String(ulany.getBytes(), "UTF-16");
This interpretes those bytes pairwise; having 15 bytes is already wrong.
Evidently one gets 7 (8?) special characters, as the high byte is not zero.
System.out.println(string.getBytes().length);
Now getting 24 means an average 3 bytes per char. Hence the default platform encoding is probably UTF-8 creating multibyte sequences.
The string will contain something like:
String string = "\u3533\u3837\u3031\u3830\u3737\u3534\u3434?";
You can also include a text encoding in getBytes(). For example:
String string = new String(ulany.getBytes("UTF-8"), "UTF-16");

In Java Is it possible to convert character set 1047 into another character set, say 500?

I have a program which reads a message from MQ. the character set is 1047. Since my java version is very old it doesn't support thus character set.
Is it possible to change this string into char set 500 in the program after receiving but before reading.
For eg:
public void fun (String str){ //str in char set 1047. **1047 is not supported in my system**
/* can I convert str into char set 500 here. Convert it into byte stream and then back to string. Something like this */
byte [] b=str.getBytes();
ByteArrayOutputStream baos = null;
try{
baos = new ByteArrayOutputStream();
baos.write(b);
String str = baos.toString("IBM-500");
System.out.println(str);
}
byte [] b=str.getBytes(); //will convert string(encoding could only be Unicode in jvm) to bytes using file.encoding. You should check whether the str contains correct information, if so, you need not care the 1047 encoding, just run str.getBytes("IBM-500"), you will get the 500 encoded bytes. Again, String object only use Unicode, if you convert string to bytes, the encoding matters the result bytes array.

Get Multilingual Data from ByteBuffer

I am receiving ByteBuffers in an UDP Java application.
Now the data in this ByteBuffer can be any string in any language or any special chars separated by zero.
I use following code to get Strings from it.
public String getString() {
byte[] remainingBytes = new byte[this.byteBuffer.remaining()];
this.byteBuffer.slice().get(remainingBytes);
String dataString = new String(remainingBytes);
int stringEnd = dataString.indexOf(0);
if(stringEnd == -1) {
return null;
} else {
dataString = dataString.substring(0, stringEnd);
this.byteBuffer.position(this.byteBuffer.position() + dataString.getBytes().length + 1);
return dataString;
}
}
These strings are stored in MySQL DB with everything set as UTF8.
IF i run application in Windows then special chars like ® are displayed but chinese are not.
On adding VM argument -Dfile.encoding=UTF8 chinese are displayed but chars like ® are shown as ?? etc.
Please Help.
Edit:
Input Strings in UDP packet are variable-length byte field, encoded in UTF-8, terminated by 0x00
For JDBC also i use useUnicode=true&characterEncoding=UTF-8
String dataString = new String(remainingBytes); is wrong. You should almost never do that. You should find out what encoding was used to put the bytes into the UDP packet, and use the same encoding on that line:
String dataString = new String(remainingBytes, encoding); // e.g. "UTF-8"
Edit: based on your updated question, encoding should be "UTF-8"
Not sure, but dataString contains only data till this zero, because stringEnd shows on first zero postion but not behind.
dataString = dataString.substring(0, stringEnd+1);
or
char specChar = dataString.substring(stringEnd, stringEnd+1); and it should return only special character, but as I said in the biggining, not sure...

How best to convert a byte[] array to a string buffer

I have a number of byte[] array variables I need to convert to string buffers.
is there a method for this type of conversion ?
Thanks
Thank you all for your responses..However I didn't make myself clear....
I'm using some byte[] arrays pre-defined as public static "under" the class declaration
for my java program. these "fields" are reused during the "life" of the process.
As the program issues status messages, (written to a file) I've defined a string buffer
(mesg_data) that used to format a status message.
So as the program executes
I tried msg2 = String(byte_array2)
I get a compiler error:
cannot find symbol
symbol : method String(byte[])
location: class APPC_LU62.java.LU62XnsCvr
convrsID = String(conversation_ID) ;
example:
public class LU62XnsCvr extends Object
.
.
static String convrsID ;
static byte[] conversation_ID = new byte[8] ;
So I can't use a "dynamic" define of a string variable because the same variable is used
in multiple occurances.
I hope I made myself clear
Thanks ever so much
Guy
String s = new String(myByteArray, "UTF-8");
StringBuilder sb = new StringBuilder(s);
There is a constructor that a byte array and encoding:
byte[] bytes = new byte[200];
//...
String s = new String(bytes, "UTF-8");
In order to translate bytes to characters you need to specify encoding: the scheme by which sequences (typically of length 1,2 or 3) of 0-255 values (that is: sequence of bytes) are mapped to characters. UTF-8 is probably the best bet as a default.
You can turn it to a String directly
byte[] bytearray
....
String mystring = new String(bytearray)
and then to convert to a StringBuffer
StringBuffer buffer = new StringBuffer(mystring)
You may use
str = new String(bytes)
By thewhat the code above does is to create a java String (i.e. UTF-16) with the default platform character encoding.
If the byte array was created from a string encoded in the platform default character encoding this will work well.
If not you need to specify the correct character encoding (Charset) as
String str = new String (byte [] bytes, Charset charset)
It depends entirely on the character encoding, but you want:
String value = new String(bytes, "US-ASCII");
This would work for US-ASCII values.
See Charset for other valid character encodings (e.g., UTF-8)

Categories

Resources