How can I read NUL-terminated UTF-8 string from Java ByteBuffer starting at ByteBuffer#position()?
ByteBuffer b = /* 61 62 63 64 00 31 32 34 00 (hex) */;
String s0 = /* read first string */;
String s1 = /* read second string */;
// `s0` will now contain “ABCD” and `s1` will contain “124”.
I have already tried using Charsets.UTF_8.decode(b) but it seems this function is ignoring current ByteBuffer postision and reads until the end of the buffer.
Is there more idiomatic way to read such string from byte buffer than seeking for byte containing 0 and the limiting the buffer to it (or copying the part with string into separate buffer)?
Idiomatic meaning "one liner" not that I know of (unsurprising since NUL-terminated strings are not part of the Java spec).
The first thing I came up with is using b.slice().limit(x) to create a lightweight view onto the desired bytes only (better than copying them anywhere as you might be able to work directly with the buffer)
ByteBuffer b = ByteBuffer.wrap(new byte[] {0x61, 0x62, 0x63, 0x64, 0x00, 0x31, 0x32, 0x34, 0x00 });
int i;
while (b.hasRemaining()) {
ByteBuffer nextString = b.slice(); // View on b with same start position
for (i = 0; b.hasRemaining() && b.get() != 0x00; i++) {
// Count to next NUL
}
nextString.limit(i); // view now stops before NUL
CharBuffer s = StandardCharsets.UTF_8.decode(nextString);
System.out.println(s);
}
In java the char \u0000, the UTF-8 byte 0, the Unicode code point U+0 is a normal char. So read all (maybe into an overlarge byte array), and do
String s = new String(bytes, StandardCharsets.UTF_8);
String[] s0s1 = s.split("\u0000");
String s0 = s0s1[0];
String s1 = s0s1[1];
If you do not have fixed positions and must sequentially read every byte the code is ugly. One of the C founders indeed called the nul terminated string a historic mistake.
The reverse, to not produce a UTF-8 byte 0 for a java String, normally for further processing as C/C++ nul terminated strings, there exists writing a modified UTF-8, also encoding the 0 byte.
You can do it by replace and split functions. Convert your hex bytes to String and find 0 by a custom character. Then split your string with that custom character.
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;
/**
* Created by Administrator on 8/25/2020.
*/
public class Jtest {
public static void main(String[] args) {
//ByteBuffer b = /* 61 62 63 64 00 31 32 34 00 (hex) */;
ByteBuffer b = ByteBuffer.allocate(10);
b.put((byte)0x61);
b.put((byte)0x62);
b.put((byte)0x63);
b.put((byte)0x64);
b.put((byte)0x00);
b.put((byte)0x31);
b.put((byte)0x32);
b.put((byte)0x34);
b.put((byte)0x00);
b.rewind();
String s0;
String s1;
// print the ByteBuffer
System.out.println("Original ByteBuffer: "
+ Arrays.toString(b.array()));
// `s0` will now contain “ABCD” and `s1` will contain “124”.
String s = StandardCharsets.UTF_8.decode(b).toString();
String ss = s.replace((char)0,';');
String[] words = ss.split(";");
for(int i=0; i < words.length; i++) {
System.out.println(" Word " + i + " = " +words[i]);
}
}
}
I believe you can do it more efficiently with removing replace.
Related
I'm working on a project that involves receiving a byte array over wireless, the Android app reads this as a String over a TCP connection:
input = new BufferedReader(new InputStreamReader(this.clientSocket.getInputStream()));
...
...
//Loop
String read = input.readLine();
//Do something meaningful with String read...
The String will always be of a fixed format i.e. the first 3 characters will be an ID and the next following 20 characters will be the message data. The amount of characters will not change (3+20 characters = 23, with a starting and ending character '[' and ']' so that's 25 characters in total.
An example of a String received by the application would be [01A01020304050A0B0C0D]
ID - 0x01A
Byte0 0x01
Byte1 0x02
Byte2 0x03
Byte3 0x04
Byte4 0x05
Byte5 0x0A
Byte6 0x0B
Byte7 0x0C
Byte 8 0x0D
I would guess that I would have to use the substring operation, but I'm having trouble converting the substring to a byte value (note: the app is expecting byte[] and not Byte[]) and I feel I'm not doing it efficiently. I came across this piece of code that I've been using:
public static byte[] hexStringToByteArray(String s) {
int len = s.length();
byte[] data = new byte[len / 2];
for (int i = 0; i < len; i += 2) {
data[i / 2] = (byte) ((Character.digit(s.charAt(i), 16) << 4)
+ Character.digit(s.charAt(i+1), 16));
}
return data;
This is returning a byte array of size 1 and will have to be run 9 times (9 bytes) per message. I'm a bit concerned that this may be a bit too strenuous on processing, especially when the application is receiving messages very frequently (roughly about 10-15 messages per second)
I appreciate any thoughts and many thanks in advance!
just use this :
byte[] decodedString = Base64.decode(your_string, Base64.DEFAULT);
byte[] b = string.getBytes();
byte[] b = string.getBytes(Charset.forName("UTF-8"));
byte[] b = string.getBytes("UTF-8");
There is no way to be more effecient than using this methods.
Best and simple way:
String myString = "This is my string";
byte[] myByteArray = myString.getBytes("UTF-8");
Now, you able to access id, message whatever; easily from myByteArray.
Just write your data like
byte[] data = yourData.getBytes();
os.write(data, 0, data.length) // data is of 23 bytes
os.flush();
what about reading through InputStream, as you mentioned in your question that String is of 23 characters just do like
public byte[] readData(InputStream is) {
byte[] data = new byte[23];
int read = is.read(data);
System.out.println("Read: " + read);
return data;
}
When you have data then you can split data like this
byte[] tempId = new byte[3];
System.arrayCopy(data, 0, id, 0, id.length);
byte[] tempMessage = new byte[20];
System.arrayCopy(data, 3, message, 0, message.length);
String id = new String(tempId);
String message = new String(tempMessage);
Now you id and message separated and converted into String.
byte[] array = String.getBytes("UTF-8");
I have a hex string (sA) convert from UTF8 string.
When I convert hex string sA to UTF8 string, I can't show it in form UI with build mode (run file .jar) but when I run with run mode or debug mode UTF8 string can show in form UI.
I use netbeans IDE 7.3.1.
My code below:
public String hexToString(String txtInHex) {
byte[] txtInByte = new byte[txtInHex.length() / 2];
int j = 0;
for (int i = 0; i < txtInHex.length(); i += 2) {
txtInByte[j++] = Byte.parseByte(txtInHex.substring(i, i + 2), 16);
}
return new String(txtInByte);
}
private String asHex(byte[] buf) {
char[] chars = new char[2 * buf.length];
for (int i = 0; i < buf.length; ++i) {
chars[2 * i] = HEX_CHARS[(buf[i] & 0xF0) >>> 4];
chars[2 * i + 1] = HEX_CHARS[buf[i] & 0x0F];
}
return new String(chars);
}
There are multiple problems with this code.
The valid range for byte values is -128 to 127, or -80 to 7F in hex, and Byte.parseByte enforces this. If your asHex method has to process a character whose second byte is greater than 127 it will produce a string that can't be decoded by toHexString.
The asHex method processes only the second byte of the input characters, so it will work correctly only for the first 256 Unicode characters and produce bogus output for the rest of them.
The toHexString method decodes a string from a byte array assuming some platform-specific default encoding, which will give incorrect results if the data was supposedly encoded in UTF-8 and the default encoding is something else.
Why are you trying to create your own methods for encoding and decoding hex strings instead of using a well known and tested library?
new String(txtInByte, "UTF-8");
Without the encoding the platform encoding is taken, for instance Windows-1252. The same holds for its inverse: String.getBytes-
String s = "....";
byte[] b = s.getBytes("UTF-8");
I'm trying to recognize a BOM for UTF-8 when reading a file. Of course, Java files like to deal with 16 bit chars, and the BOM characters are eight bit bytes.
My test code looks like:
public void testByteOrderMarks() {
System.out.println("test byte order marks");
byte[] bytes = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF, (byte) 'a', (byte) 'b',(byte) 'c'};
String test = new String(bytes, Charset.availableCharsets().get("UTF-8"));
System.out.printf("test len: %s value %s\n", test.length(), test);
String three = test.substring(0,3);
System.out.printf("len %d >%s<\n", three.length(), three);
for (int i = 0; i < test.length();i++) {
byte b = bytes[i];
char c = test.charAt(i);
System.out.printf("b: %s %x c: %s %x\n", (char) b, b, c, (int) c);
}
}
and the result is:
test byte order marks
test len: 4 value ?abc
len 3 >?ab<
b: ? ef> c: ? feff
b: ? bb c: a 61
b: ? bf c: b 62
b: a 61 c: c 63
I can't figure out why the length of "test" is 4 and not 6.
I can't figure out why I don't pick up each 8 bit byte to do the comparison.
Thanks
Don't use characters when trying to figure out the BOM header. The BOM header is two or three bytes, so you should open an (File)InputStream, read two bytes and process them.
Incidentally, the XML header (<?xml version=... encoding=...>) is pure ASCII so it's safe to load that as a byte stream, too (well, unless there is a BOM to indicate that the file is saved with 16bit characters and not as UTF-8).
My solution (see DecentXML's XMLInputStreamReader) is to load the first few bytes of the file and analyze them. That gives me enough information to create a properly decoding Reader out of an InputStream.
A character is a character. The Byte Order Mark is the Unicode character U+FEFF. In Java it is the character '\uFEFF'. There is no need to delve into bytes. Just read the first character of the file, and if it matches '\uFEFF' it is the BOM. If it doesn't match then the file was written without a BOM.
private final static char BOM = '\uFEFF'; // Unicode Byte Order Mark
String firstLine = readFirstLineOfFile("filename.txt");
if (firstLine.charAt(0) == BOM) {
// We have a BOM
} else {
// No BOM present.
}
If you want to recognize a BOM file a better solution (and works for me) will be use the encoding detector library of Mozilla: http://code.google.com/p/juniversalchardet/
In that link is described easily how to use it:
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector {
public static void main(String[] args) throws java.io.IOException {
byte[] buf = new byte[4096];
String fileName = "testFile.";
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
If you are using maven the dependency is:
<dependency>
<groupId>com.googlecode.juniversalchardet</groupId>
<artifactId>juniversalchardet</artifactId>
<version>1.0.3</version>
</dependency>
I'm looking to convert a Java char array to a byte array without creating an intermediate String, as the char array contains a password. I've looked up a couple of methods, but they all seem to fail:
char[] password = "password".toCharArray();
byte[] passwordBytes1 = new byte[password.length*2];
ByteBuffer.wrap(passwordBytes1).asCharBuffer().put(password);
byte[] passwordBytes2 = new byte[password.length*2];
for(int i=0; i<password.length; i++) {
passwordBytes2[2*i] = (byte) ((password[i]&0xFF00)>>8);
passwordBytes2[2*i+1] = (byte) (password[i]&0x00FF);
}
String passwordAsString = new String(password);
String passwordBytes1AsString = new String(passwordBytes1);
String passwordBytes2AsString = new String(passwordBytes2);
System.out.println(passwordAsString);
System.out.println(passwordBytes1AsString);
System.out.println(passwordBytes2AsString);
assertTrue(passwordAsString.equals(passwordBytes1) || passwordAsString.equals(passwordBytes2));
The assertion always fails (and, critically, when the code is used in production, the password is rejected), yet the print statements print out password three times. Why are passwordBytes1AsString and passwordBytes2AsString different from passwordAsString, yet appear identical? Am I missing out a null terminator or something? What can I do to make the conversion and unconversion work?
Conversion between char and byte is character set encoding and decoding.I prefer to make it as clear as possible in code. It doesn't really mean extra code volume:
Charset latin1Charset = Charset.forName("ISO-8859-1");
charBuffer = latin1Charset.decode(ByteBuffer.wrap(byteArray)); // also decode to String
byteBuffer = latin1Charset.encode(charBuffer); // also decode from String
Aside:
java.nio classes and java.io Reader/Writer classes use ByteBuffer & CharBuffer (which use byte[] and char[] as backing arrays). So often preferable if you use these classes directly. However, you can always do:
byteArray = ByteBuffer.array(); byteBuffer = ByteBuffer.wrap(byteArray);
byteBuffer.get(byteArray); charBuffer.put(charArray);
charArray = CharBuffer.array(); charBuffer = ByteBuffer.wrap(charArray);
charBuffer.get(charArray); charBuffer.put(charArray);
Original Answer
public byte[] charsToBytes(char[] chars){
Charset charset = Charset.forName("UTF-8");
ByteBuffer byteBuffer = charset.encode(CharBuffer.wrap(chars));
return Arrays.copyOf(byteBuffer.array(), byteBuffer.limit());
}
public char[] bytesToChars(byte[] bytes){
Charset charset = Charset.forName("UTF-8");
CharBuffer charBuffer = charset.decode(ByteBuffer.wrap(bytes));
return Arrays.copyOf(charBuffer.array(), charBuffer.limit());
}
Edited to use StandardCharsets
public byte[] charsToBytes(char[] chars)
{
final ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(CharBuffer.wrap(chars));
return Arrays.copyOf(byteBuffer.array(), byteBuffer.limit());
}
public char[] bytesToChars(byte[] bytes)
{
final CharBuffer charBuffer = StandardCharsets.UTF_8.decode(ByteBuffer.wrap(bytes));
return Arrays.copyOf(charBuffer.array(), charBuffer.limit());
}
Here is a JavaDoc page for StandardCharsets.
Note this on the JavaDoc page:
These charsets are guaranteed to be available on every implementation of the Java platform.
The problem is your use of the String(byte[]) constructor, which uses the platform default encoding. That's almost never what you should be doing - if you pass in "UTF-16" as the character encoding to work, your tests will probably pass. Currently I suspect that passwordBytes1AsString and passwordBytes2AsString are each 16 characters long, with every other character being U+0000.
I would do is use a loop to convert to bytes and another to conver back to char.
char[] chars = "password".toCharArray();
byte[] bytes = new byte[chars.length*2];
for(int i=0;i<chars.length;i++) {
bytes[i*2] = (byte) (chars[i] >> 8);
bytes[i*2+1] = (byte) chars[i];
}
char[] chars2 = new char[bytes.length/2];
for(int i=0;i<chars2.length;i++)
chars2[i] = (char) ((bytes[i*2] << 8) + (bytes[i*2+1] & 0xFF));
String password = new String(chars2);
If you want to use a ByteBuffer and CharBuffer, don't do the simple .asCharBuffer(), which simply does an UTF-16 (LE or BE, depending on your system - you can set the byte-order with the order method) conversion (since the Java Strings and thus your char[] internally uses this encoding).
Use Charset.forName(charsetName), and then its encode or decode method, or the newEncoder /newDecoder.
When converting your byte[] to String, you also should indicate the encoding (and it should be the same one).
This is an extension to Peter Lawrey's answer. In order to backward (bytes-to-chars) conversion work correctly for the whole range of chars, the code should be as follows:
char[] chars = new char[bytes.length/2];
for (int i = 0; i < chars.length; i++) {
chars[i] = (char) (((bytes[i*2] & 0xff) << 8) + (bytes[i*2+1] & 0xff));
}
We need to "unsign" bytes before using (& 0xff). Otherwise half of the all possible char values will not get back correctly. For instance, chars within [0x80..0xff] range will be affected.
You should make use of getBytes() instead of toCharArray()
Replace the line
char[] password = "password".toCharArray();
with
byte[] password = "password".getBytes();
When you use GetBytes From a String in Java, The return result will depend on the default encode of your computer setting.(eg: StandardCharsetsUTF-8 or StandardCharsets.ISO_8859_1etc...).
So, whenever you want to getBytes from a String Object. Make sure to give a encode . like :
String sample = "abc";
Byte[] a_byte = sample .getBytes(StandardCharsets.UTF_8);
Let check what has happened with the code.
In java, the String named sample , is stored by Unicode. every char in String stored by 2 byte.
sample : value: "abc" in Memory(Hex): 00 61 00 62 00 63
a -> 00 61
b -> 00 62
c -> 00 63
But, When we getBytes From a String, we have
Byte[] a_byte = sample .getBytes(StandardCharsets.UTF_8)
//result is : 61 62 63
//length: 3 bytes
Byte[] a_byte = sample .getBytes(StandardCharsets.UTF_16BE)
//result is : 00 61 00 62 00 63
//length: 6 bytes
In order to get the oringle byte of the String. We can just read the Memory of the string and get Each byte of the String.Below is the sample Code:
public static byte[] charArray2ByteArray(char[] chars){
int length = chars.length;
byte[] result = new byte[length*2+2];
int i = 0;
for(int j = 0 ;j<chars.length;j++){
result[i++] = (byte)( (chars[j] & 0xFF00) >> 8 );
result[i++] = (byte)((chars[j] & 0x00FF)) ;
}
return result;
}
Usages:
String sample = "abc";
//First get the chars of the String,each char has two bytes(Java).
Char[] sample_chars = sample.toCharArray();
//Get the bytes
byte[] result = charArray2ByteArray(sample_chars).
//Back to String.
//Make sure we use UTF_16BE. Because we read the memory of Unicode of
//the String from Left to right. That's the same reading
//sequece of UTF-16BE.
String sample_back= new String(result , StandardCharsets.UTF_16BE);
I have a program that handles byte arrays in Java, and now I would like to write this into a XML file. However, I am unsure as to how I can convert the following byte array into a sensible String to write to a file. Assuming that it was Unicode characters I attempted the following code:
String temp = new String(encodedBytes, "UTF-8");
Only to have the debugger show that the encodedBytes contain "\ufffd\ufffd ^\ufffd\ufffd-m\ufffd\ufffd\/ufffd \ufffd\ufffdIA\ufffd\ufffd". The String should contain a hash in alphanumerical format.
How would I turn the above String into a sensible String for output?
The byte array doesn't look like UTF-8. Note that \ufffd (named REPLACEMENT CHARACTER) is "used to replace an incoming character whose value is unknown or unrepresentable in Unicode."
Addendum: Here's a simple example of how this can happen. When cast to a byte, the code point for ñ is neither UTF-8 nor US-ASCII; but it is valid ISO-8859-1. In effect, you have to know what the bytes represent before you can encode them into a String.
public class Hello {
public static void main(String[] args)
throws java.io.UnsupportedEncodingException {
String s = "Hola, señor!";
System.out.println(s);
byte[] b = new byte[s.length()];
for (int i = 0; i < b.length; i++) {
int cp = s.codePointAt(i);
b[i] = (byte) cp;
System.out.print((byte) cp + " ");
}
System.out.println();
System.out.println(new String(b, "UTF-8"));
System.out.println(new String(b, "US-ASCII"));
System.out.println(new String(b, "ISO-8859-1"));
}
}
Output:
Hola, señor!
72 111 108 97 44 32 115 101 -15 111 114 33
Hola, se�or!
Hola, se�or!
Hola, señor!
If your string is the output of a password hashing scheme (which it looks like it might be) then I think you will need to Base64 encode in order to put it into plain text.
Standard procedure, if you have raw bytes you want to output to a text file, is to use Base 64 encoding. The Commons Codec library provides a Base64 encoder / decoder for you to use.
Hope this helps.