Get Multilingual Data from ByteBuffer - java

I am receiving ByteBuffers in an UDP Java application.
Now the data in this ByteBuffer can be any string in any language or any special chars separated by zero.
I use following code to get Strings from it.
public String getString() {
byte[] remainingBytes = new byte[this.byteBuffer.remaining()];
this.byteBuffer.slice().get(remainingBytes);
String dataString = new String(remainingBytes);
int stringEnd = dataString.indexOf(0);
if(stringEnd == -1) {
return null;
} else {
dataString = dataString.substring(0, stringEnd);
this.byteBuffer.position(this.byteBuffer.position() + dataString.getBytes().length + 1);
return dataString;
}
}
These strings are stored in MySQL DB with everything set as UTF8.
IF i run application in Windows then special chars like ® are displayed but chinese are not.
On adding VM argument -Dfile.encoding=UTF8 chinese are displayed but chars like ® are shown as ?? etc.
Please Help.
Edit:
Input Strings in UDP packet are variable-length byte field, encoded in UTF-8, terminated by 0x00
For JDBC also i use useUnicode=true&characterEncoding=UTF-8

String dataString = new String(remainingBytes); is wrong. You should almost never do that. You should find out what encoding was used to put the bytes into the UDP packet, and use the same encoding on that line:
String dataString = new String(remainingBytes, encoding); // e.g. "UTF-8"
Edit: based on your updated question, encoding should be "UTF-8"

Not sure, but dataString contains only data till this zero, because stringEnd shows on first zero postion but not behind.
dataString = dataString.substring(0, stringEnd+1);
or
char specChar = dataString.substring(stringEnd, stringEnd+1); and it should return only special character, but as I said in the biggining, not sure...

Related

Java convert String UTF-8 to UTF-16

I try to convert String a = "try" to String UTF-16
I did this :
try {
String ulany = new String("357810087745445");
System.out.println(ulany.getBytes().length);
String string = new String(ulany.getBytes(), "UTF-16");
System.out.println(string.getBytes().length);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
And ulany.getBytes().length = 15
and System.out.println(string.getBytes().length) = 24 but I think that it should be 30 what I did wrong ?
String (and char) hold Unicode. So nothing is needed.
However if you want bytes, binary data, that are in some encoding, like UTF-16, you need a conversion:
ulany.getBytes("UTF-16") // Those bytes are in UTF-16 big endian
ulany.getBytes("UTF-16LE")
However System.out uses the operating systems encoding, so one cannot just pick some different encoding.
In fact char is UTF-16 encoded.
What happens
//String ulany = new String("357810087745445");
String ulany = "357810087745445";
The String copy constructor stems from the C++ beginning, and is senseless.
System.out.println(ulany.getBytes().length);
This will run on different platforms differently, as getBytes() uses
the default Charset. Better
System.out.println(ulany.getBytes("UTF-8").length);
String string = new String(ulany.getBytes(), "UTF-16");
This interpretes those bytes pairwise; having 15 bytes is already wrong.
Evidently one gets 7 (8?) special characters, as the high byte is not zero.
System.out.println(string.getBytes().length);
Now getting 24 means an average 3 bytes per char. Hence the default platform encoding is probably UTF-8 creating multibyte sequences.
The string will contain something like:
String string = "\u3533\u3837\u3031\u3830\u3737\u3534\u3434?";
You can also include a text encoding in getBytes(). For example:
String string = new String(ulany.getBytes("UTF-8"), "UTF-16");

JAVA: failing to get encrypted data in string using xor

I was trying to print encrypted text using string perhaps i was wrong somewhere. I am doing simple xor on a plain text. Coming encrypted text/string i am putting in a C program and doing same xor again to get plain text again.
But in between, I am not able to get proper string of encrypted text to pass in C
String xorencrypt(byte[] passwd,int pass_len){
char[] st = new char[pass_len];
byte[] crypted = new byte[pass_len];
for(int i = 0; i<pass_len;i++){
crypted[i] = (byte) (passwd[i]^(i+1));
st[i] = (char)crypted[i];
System.out.println((char)passwd[i]+" "+passwd[i] +"= " + (char)crypted[i]+" "+crypted[i]);/* characters are printed fine but problem is when i am convering it in to string */
}
return st.toString();
}
I don't know if any kind of encoding also needed because if i did so how I will decode and decrypt from C program.
example if suppose passwd = bond007
then java program should return akkb78>
further C program will decrypt akkb78> to bond007 again.
Use
return new String(crypted);
in that case you don't need st[] array at all.
By the way, the encoded value for bond007 is cmm`560 and not what you posted.
EDIT
While solution above would most likely work in most java environments, to be safe about encoding,
as suggested by Alex, provide encoding parameter to String constructor.
For example if you want your string to carry 8-bit bytes :
return new String(crypted, "ISO-8859-1");
You would need the same parameter when getting bytes from your string :
byte[] bytes = myString.getBytes("ISO-8859-1")
Alternatively, use solution provided by Alex :
return new String(st);
But, convert bytes to chars properly :
st[i] = (char) (crypted[i] & 0xff);
Otherwise, all negative bytes, crypted[i] < 0 will not be converted to char properly and you get surprising results.
Change this line:
return st.toString();
with this
return new String(st);

How to convert UTF8 string to UTF16

I'm getting a UTF8 string by processing a request sent by a client application. But the string is really UTF16. What can I do to get it into my local string is a letter followed by \0 character? I need to convert that String into UTF16.
Sample received string: S\0a\0m\0p\0l\0e (UTF8).
What I want is : Sample (UTF16)
FileItem item = (FileItem) iter.next();
String field = "";
String value = "";
if (item.isFormField()) {
try{
value=item.getString();
System.out.println("====" + value);
}
The bytes from the server are not UTF-8 if they look like S\0a\0m\0p\0l\0e. They are UTF-16. You can convert UTF16 bytes to a Java String with:
byte[] bytes = ...
String string = new String(bytes, "UTF-16");
Or you can use UTF-16LE or UTF-16BE as the character set name if you know the endian-ness of the byte stream coming from the server.
If you've already (mistakenly) constructed a String from the bytes as if it were UTF-8, you can convert to UTF-16 with:
string = new String(string.getBytes("UTF-8"), "UTF-16");
However, as JB Nizet points out, this round trip (bytes -> UTF-8 string -> bytes) is potentially lossy if the bytes weren't valid UTF-8 to start with.
I propose the following solution:
NSString *line_utf16[ENOUGH_MEMORY_SIZE];
line_utf16= [NSString stringWithFormat: #"%s", line_utf8];
ENOUGH_MEMORY_SIZE is at least twice exceeds memory used for line_utf8
I suppose memory for
line_utf16
has to be dynamically or statically allocated at least twice of the size of
line_utf8.
If you run into similar problem please add a couple of sentences!

Check if a String contains encoded characters

Hello I am looking for a way to detect if a string has being encoded
For example
String name = "Hellä world";
String encoded = new String(name.getBytes("utf-8"), "iso8859-1");
The output of this encoded variable is:
Hellä world
As you can see there is an A with grave and another symbol. Is there a way to check if the output contains encoded characters?
Sounds like you want to check if a string that was decoded from bytes in latin1 could have been decoded in UTF-8, too. That's easy because illegal byte sequences are replaced by the character \ufffd:
String recoded = new String(encoded.getBytes("iso-8859-1"), "UTF-8");
return recoded.indexOf('\uFFFD') == -1; // No replacement character found
Your question doesn't make sense. A java String is a list of characters. They don't have an encoding until you convert them into bytes, at which point you need to specify one (although you will see a lot of code that uses the platform default, which is what e.g. String.getBytes() with no argument does).
I suggest you read this http://kunststube.net/encoding/.
String name = "Hellä world";
String encoded = new String(name.getBytes("utf-8"), "iso8859-1");
This code is just a character corruption bug. You take a UTF-16 string, transcode it to UTF-8, pretend it is ISO-8859-1 and transcode it back to UTF-16, resulting in incorrectly encoded characters.
If I correctly understood your question, this code may help you. The function isEncoded check if its parameter could be encoded as ascii or if it contains non ascii-chars.
public boolean isEncoded(String text){
Charset charset = Charset.forName("US-ASCII");
String checked=new String(text.getBytes(charset),charset);
return !checked.equals(text);
}
#Test
public void testAscii() throws Exception{
Assert.assertFalse(isEncoded("Hello world"));
}
#Test
public void testNonAscii() throws Exception{
Assert.assertTrue(isEncoded("Hellä world"));
}
You can also check for other charset changing charset var or moving it to a parameter.
I'm not really sure what are you trying to do or what is your problem.
This line doesn't make any sense:
String encoded = new String(name.getBytes("utf-8"), "iso8859-1");
You are encoding your name into "UTF-8" and then trying to decode as "iso8859-1".
If you what to encode your name as "iso8859-1" just do name.getBytes("iso8859-1").
Please tell us what is the problem you encountered so that we can help more.
You can check that your string is encoded or not by this code
public boolean isEncoded(String input) {
char[] charArray = input.toCharArray();
for (int i = 0, charArrayLength = charArray.length; i < charArrayLength; i++) {
Character c = charArray[i];
if (Character.getType(c) == Character.OTHER_LETTER)){
return true;
}
}
return false;
}

Java InputStream encoding/charset

Running the following (example) code
import java.io.*;
public class test {
public static void main(String[] args) throws Exception {
byte[] buf = {-27};
InputStream is = new ByteArrayInputStream(buf);
BufferedReader r = new BufferedReader(
new InputStreamReader(is, "ISO-8859-1"));
String s = r.readLine();
System.out.println("test.java:9 [byte] (char)" + (char)s.getBytes()[0] +
" (int)" + (int)s.getBytes()[0]);
System.out.println("test.java:10 [char] (char)" + (char)s.charAt(0) +
" (int)" + (int)s.charAt(0));
System.out.println("test.java:11 string below");
System.out.println(s);
System.out.println("test.java:13 string above");
}
}
gives me this output
test.java:9 [byte] (char)? (int)63
test.java:10 [char] (char)? (int)229
test.java:11 string below
?
test.java:13 string above
How do I retain the correct byte value (-27) in the line-9 printout? And consequently receive the expected output of the System.out.println(s) command (å).
If you want to retain byte values, don't use a Reader at all, ideally. To represent arbitrary binary data in text and convert it back to binary data later, you should use base16 or base64 encoding.
However, to explain what's going on, when you call s.getBytes() that's using the default character encoding, which apparently doesn't include Unicode character U+00E5.
If you call s.getBytes("ISO-8859-1") everywhere instead of s.getBytes() I suspect you'll get back the right byte value... but relying on ISO-8859-1 for this is kinda dirty IMO.
As noted, getBytes() (no-arguments) uses the Java platform default encoding, which may not be ISO-8859-1. Simply printing it should work, provided your terminal and the default encoding match and support the character. For instance, on my system, the terminal and default Java encoding are both UTF-8. The fact that you're seeing a '?' indicates that yours don't match or å is not supported.
If you want to manually encode to UTF-8 on your system, do:
String s = r.readLine();
byte[] utf8Bytes = s.getBytes("UTF-8");
It should give a byte array with {-61, -91}.

Categories

Resources