I'm looking to convert a Java char array to a byte array without creating an intermediate String, as the char array contains a password. I've looked up a couple of methods, but they all seem to fail:
char[] password = "password".toCharArray();
byte[] passwordBytes1 = new byte[password.length*2];
ByteBuffer.wrap(passwordBytes1).asCharBuffer().put(password);
byte[] passwordBytes2 = new byte[password.length*2];
for(int i=0; i<password.length; i++) {
passwordBytes2[2*i] = (byte) ((password[i]&0xFF00)>>8);
passwordBytes2[2*i+1] = (byte) (password[i]&0x00FF);
}
String passwordAsString = new String(password);
String passwordBytes1AsString = new String(passwordBytes1);
String passwordBytes2AsString = new String(passwordBytes2);
System.out.println(passwordAsString);
System.out.println(passwordBytes1AsString);
System.out.println(passwordBytes2AsString);
assertTrue(passwordAsString.equals(passwordBytes1) || passwordAsString.equals(passwordBytes2));
The assertion always fails (and, critically, when the code is used in production, the password is rejected), yet the print statements print out password three times. Why are passwordBytes1AsString and passwordBytes2AsString different from passwordAsString, yet appear identical? Am I missing out a null terminator or something? What can I do to make the conversion and unconversion work?
Conversion between char and byte is character set encoding and decoding.I prefer to make it as clear as possible in code. It doesn't really mean extra code volume:
Charset latin1Charset = Charset.forName("ISO-8859-1");
charBuffer = latin1Charset.decode(ByteBuffer.wrap(byteArray)); // also decode to String
byteBuffer = latin1Charset.encode(charBuffer); // also decode from String
Aside:
java.nio classes and java.io Reader/Writer classes use ByteBuffer & CharBuffer (which use byte[] and char[] as backing arrays). So often preferable if you use these classes directly. However, you can always do:
byteArray = ByteBuffer.array(); byteBuffer = ByteBuffer.wrap(byteArray);
byteBuffer.get(byteArray); charBuffer.put(charArray);
charArray = CharBuffer.array(); charBuffer = ByteBuffer.wrap(charArray);
charBuffer.get(charArray); charBuffer.put(charArray);
Original Answer
public byte[] charsToBytes(char[] chars){
Charset charset = Charset.forName("UTF-8");
ByteBuffer byteBuffer = charset.encode(CharBuffer.wrap(chars));
return Arrays.copyOf(byteBuffer.array(), byteBuffer.limit());
}
public char[] bytesToChars(byte[] bytes){
Charset charset = Charset.forName("UTF-8");
CharBuffer charBuffer = charset.decode(ByteBuffer.wrap(bytes));
return Arrays.copyOf(charBuffer.array(), charBuffer.limit());
}
Edited to use StandardCharsets
public byte[] charsToBytes(char[] chars)
{
final ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(CharBuffer.wrap(chars));
return Arrays.copyOf(byteBuffer.array(), byteBuffer.limit());
}
public char[] bytesToChars(byte[] bytes)
{
final CharBuffer charBuffer = StandardCharsets.UTF_8.decode(ByteBuffer.wrap(bytes));
return Arrays.copyOf(charBuffer.array(), charBuffer.limit());
}
Here is a JavaDoc page for StandardCharsets.
Note this on the JavaDoc page:
These charsets are guaranteed to be available on every implementation of the Java platform.
The problem is your use of the String(byte[]) constructor, which uses the platform default encoding. That's almost never what you should be doing - if you pass in "UTF-16" as the character encoding to work, your tests will probably pass. Currently I suspect that passwordBytes1AsString and passwordBytes2AsString are each 16 characters long, with every other character being U+0000.
I would do is use a loop to convert to bytes and another to conver back to char.
char[] chars = "password".toCharArray();
byte[] bytes = new byte[chars.length*2];
for(int i=0;i<chars.length;i++) {
bytes[i*2] = (byte) (chars[i] >> 8);
bytes[i*2+1] = (byte) chars[i];
}
char[] chars2 = new char[bytes.length/2];
for(int i=0;i<chars2.length;i++)
chars2[i] = (char) ((bytes[i*2] << 8) + (bytes[i*2+1] & 0xFF));
String password = new String(chars2);
If you want to use a ByteBuffer and CharBuffer, don't do the simple .asCharBuffer(), which simply does an UTF-16 (LE or BE, depending on your system - you can set the byte-order with the order method) conversion (since the Java Strings and thus your char[] internally uses this encoding).
Use Charset.forName(charsetName), and then its encode or decode method, or the newEncoder /newDecoder.
When converting your byte[] to String, you also should indicate the encoding (and it should be the same one).
This is an extension to Peter Lawrey's answer. In order to backward (bytes-to-chars) conversion work correctly for the whole range of chars, the code should be as follows:
char[] chars = new char[bytes.length/2];
for (int i = 0; i < chars.length; i++) {
chars[i] = (char) (((bytes[i*2] & 0xff) << 8) + (bytes[i*2+1] & 0xff));
}
We need to "unsign" bytes before using (& 0xff). Otherwise half of the all possible char values will not get back correctly. For instance, chars within [0x80..0xff] range will be affected.
You should make use of getBytes() instead of toCharArray()
Replace the line
char[] password = "password".toCharArray();
with
byte[] password = "password".getBytes();
When you use GetBytes From a String in Java, The return result will depend on the default encode of your computer setting.(eg: StandardCharsetsUTF-8 or StandardCharsets.ISO_8859_1etc...).
So, whenever you want to getBytes from a String Object. Make sure to give a encode . like :
String sample = "abc";
Byte[] a_byte = sample .getBytes(StandardCharsets.UTF_8);
Let check what has happened with the code.
In java, the String named sample , is stored by Unicode. every char in String stored by 2 byte.
sample : value: "abc" in Memory(Hex): 00 61 00 62 00 63
a -> 00 61
b -> 00 62
c -> 00 63
But, When we getBytes From a String, we have
Byte[] a_byte = sample .getBytes(StandardCharsets.UTF_8)
//result is : 61 62 63
//length: 3 bytes
Byte[] a_byte = sample .getBytes(StandardCharsets.UTF_16BE)
//result is : 00 61 00 62 00 63
//length: 6 bytes
In order to get the oringle byte of the String. We can just read the Memory of the string and get Each byte of the String.Below is the sample Code:
public static byte[] charArray2ByteArray(char[] chars){
int length = chars.length;
byte[] result = new byte[length*2+2];
int i = 0;
for(int j = 0 ;j<chars.length;j++){
result[i++] = (byte)( (chars[j] & 0xFF00) >> 8 );
result[i++] = (byte)((chars[j] & 0x00FF)) ;
}
return result;
}
Usages:
String sample = "abc";
//First get the chars of the String,each char has two bytes(Java).
Char[] sample_chars = sample.toCharArray();
//Get the bytes
byte[] result = charArray2ByteArray(sample_chars).
//Back to String.
//Make sure we use UTF_16BE. Because we read the memory of Unicode of
//the String from Left to right. That's the same reading
//sequece of UTF-16BE.
String sample_back= new String(result , StandardCharsets.UTF_16BE);
Related
How can I read NUL-terminated UTF-8 string from Java ByteBuffer starting at ByteBuffer#position()?
ByteBuffer b = /* 61 62 63 64 00 31 32 34 00 (hex) */;
String s0 = /* read first string */;
String s1 = /* read second string */;
// `s0` will now contain “ABCD” and `s1` will contain “124”.
I have already tried using Charsets.UTF_8.decode(b) but it seems this function is ignoring current ByteBuffer postision and reads until the end of the buffer.
Is there more idiomatic way to read such string from byte buffer than seeking for byte containing 0 and the limiting the buffer to it (or copying the part with string into separate buffer)?
Idiomatic meaning "one liner" not that I know of (unsurprising since NUL-terminated strings are not part of the Java spec).
The first thing I came up with is using b.slice().limit(x) to create a lightweight view onto the desired bytes only (better than copying them anywhere as you might be able to work directly with the buffer)
ByteBuffer b = ByteBuffer.wrap(new byte[] {0x61, 0x62, 0x63, 0x64, 0x00, 0x31, 0x32, 0x34, 0x00 });
int i;
while (b.hasRemaining()) {
ByteBuffer nextString = b.slice(); // View on b with same start position
for (i = 0; b.hasRemaining() && b.get() != 0x00; i++) {
// Count to next NUL
}
nextString.limit(i); // view now stops before NUL
CharBuffer s = StandardCharsets.UTF_8.decode(nextString);
System.out.println(s);
}
In java the char \u0000, the UTF-8 byte 0, the Unicode code point U+0 is a normal char. So read all (maybe into an overlarge byte array), and do
String s = new String(bytes, StandardCharsets.UTF_8);
String[] s0s1 = s.split("\u0000");
String s0 = s0s1[0];
String s1 = s0s1[1];
If you do not have fixed positions and must sequentially read every byte the code is ugly. One of the C founders indeed called the nul terminated string a historic mistake.
The reverse, to not produce a UTF-8 byte 0 for a java String, normally for further processing as C/C++ nul terminated strings, there exists writing a modified UTF-8, also encoding the 0 byte.
You can do it by replace and split functions. Convert your hex bytes to String and find 0 by a custom character. Then split your string with that custom character.
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;
/**
* Created by Administrator on 8/25/2020.
*/
public class Jtest {
public static void main(String[] args) {
//ByteBuffer b = /* 61 62 63 64 00 31 32 34 00 (hex) */;
ByteBuffer b = ByteBuffer.allocate(10);
b.put((byte)0x61);
b.put((byte)0x62);
b.put((byte)0x63);
b.put((byte)0x64);
b.put((byte)0x00);
b.put((byte)0x31);
b.put((byte)0x32);
b.put((byte)0x34);
b.put((byte)0x00);
b.rewind();
String s0;
String s1;
// print the ByteBuffer
System.out.println("Original ByteBuffer: "
+ Arrays.toString(b.array()));
// `s0` will now contain “ABCD” and `s1` will contain “124”.
String s = StandardCharsets.UTF_8.decode(b).toString();
String ss = s.replace((char)0,';');
String[] words = ss.split(";");
for(int i=0; i < words.length; i++) {
System.out.println(" Word " + i + " = " +words[i]);
}
}
}
I believe you can do it more efficiently with removing replace.
I'm using this code to convert a UTF-8 String to binary:
public String toBinary(String str) {
byte[] buf = str.getBytes(StandardCharsets.UTF_8);
StringBuilder result = new StringBuilder();
for (int i = 0; i < buf.length; i++) {
int ch = (int) buf[i];
String binary = Integer.toBinaryString(ch);
result.append(("00000000" + binary).substring(binary.length()));
result.append(' ');
}
return result.toString().trim();
}
Before I was using this code:
private String toBinary2(String str) {
StringBuilder result = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
int ch = (int) str.charAt(i);
String binary = Integer.toBinaryString(ch);
if (ch<256)
result.append(("00000000" + binary).substring(binary.length()));
else {
binary = ("0000000000000000" + binary).substring(binary.length());
result.append(binary.substring(0, 8));
result.append(' ');
result.append(binary.substring(8));
}
result.append(' ');
}
return result.toString().trim();
}
These two method can return different results; for example:
toBinary("è") = "11000011 10101000"
toBinary2("è") = "11101000"
I think that because the bytes of è are negative while the corresponding char is not (because char is a 2 byte unsigned integer).
What I want to know is: which of the two approaches is the correct one and why?
Thanks in advance.
Whenever you want to convert text into binary data (or into text representing binary data, as you do here) you have to use some encoding.
Your toBinary uses UTF-8 for that encoding.
Your toBinary2 uses something that's not a standard encoding: it encodes every UTF-16 codepoint * <= 256 in a single byte and all others in 2 bytes. Unfortunately that one is not a useful encoding, since for decoding you'll have to know if a single byte is stand-alone or part of a 2-byte sequence (UTF-8/UTF-16 do that by indicating with the highest-level bits which one it is).
tl;dr toBinary seems correct, toBinary2 will produce output that can't uniquely be decoded back to the original string.
* You might be wondering where the mention of UTF-16 comes from: That's because all String objects in Java are implicitly encoded in UTF-16. So if you use charAt you get UTF-16 codepoints (which just so happen to be equal to the Unicode code number for all characters that fit into the Basic Multilingual Plane).
This code snippet might help.
String s = "Some String";
byte[] bytes = s.getBytes();
StringBuilder binary = new StringBuilder();
for(byte b:bytes){
int val =b;
for(int i=;i<=s.length;i++){
binary.append((val & 128) == 0 ? 0 : 1);
val<<=1;
}
}
System.out.println(" "+s+ "to binary" +binary);
I have a hex string (sA) convert from UTF8 string.
When I convert hex string sA to UTF8 string, I can't show it in form UI with build mode (run file .jar) but when I run with run mode or debug mode UTF8 string can show in form UI.
I use netbeans IDE 7.3.1.
My code below:
public String hexToString(String txtInHex) {
byte[] txtInByte = new byte[txtInHex.length() / 2];
int j = 0;
for (int i = 0; i < txtInHex.length(); i += 2) {
txtInByte[j++] = Byte.parseByte(txtInHex.substring(i, i + 2), 16);
}
return new String(txtInByte);
}
private String asHex(byte[] buf) {
char[] chars = new char[2 * buf.length];
for (int i = 0; i < buf.length; ++i) {
chars[2 * i] = HEX_CHARS[(buf[i] & 0xF0) >>> 4];
chars[2 * i + 1] = HEX_CHARS[buf[i] & 0x0F];
}
return new String(chars);
}
There are multiple problems with this code.
The valid range for byte values is -128 to 127, or -80 to 7F in hex, and Byte.parseByte enforces this. If your asHex method has to process a character whose second byte is greater than 127 it will produce a string that can't be decoded by toHexString.
The asHex method processes only the second byte of the input characters, so it will work correctly only for the first 256 Unicode characters and produce bogus output for the rest of them.
The toHexString method decodes a string from a byte array assuming some platform-specific default encoding, which will give incorrect results if the data was supposedly encoded in UTF-8 and the default encoding is something else.
Why are you trying to create your own methods for encoding and decoding hex strings instead of using a well known and tested library?
new String(txtInByte, "UTF-8");
Without the encoding the platform encoding is taken, for instance Windows-1252. The same holds for its inverse: String.getBytes-
String s = "....";
byte[] b = s.getBytes("UTF-8");
I have a program that handles byte arrays in Java, and now I would like to write this into a XML file. However, I am unsure as to how I can convert the following byte array into a sensible String to write to a file. Assuming that it was Unicode characters I attempted the following code:
String temp = new String(encodedBytes, "UTF-8");
Only to have the debugger show that the encodedBytes contain "\ufffd\ufffd ^\ufffd\ufffd-m\ufffd\ufffd\/ufffd \ufffd\ufffdIA\ufffd\ufffd". The String should contain a hash in alphanumerical format.
How would I turn the above String into a sensible String for output?
The byte array doesn't look like UTF-8. Note that \ufffd (named REPLACEMENT CHARACTER) is "used to replace an incoming character whose value is unknown or unrepresentable in Unicode."
Addendum: Here's a simple example of how this can happen. When cast to a byte, the code point for ñ is neither UTF-8 nor US-ASCII; but it is valid ISO-8859-1. In effect, you have to know what the bytes represent before you can encode them into a String.
public class Hello {
public static void main(String[] args)
throws java.io.UnsupportedEncodingException {
String s = "Hola, señor!";
System.out.println(s);
byte[] b = new byte[s.length()];
for (int i = 0; i < b.length; i++) {
int cp = s.codePointAt(i);
b[i] = (byte) cp;
System.out.print((byte) cp + " ");
}
System.out.println();
System.out.println(new String(b, "UTF-8"));
System.out.println(new String(b, "US-ASCII"));
System.out.println(new String(b, "ISO-8859-1"));
}
}
Output:
Hola, señor!
72 111 108 97 44 32 115 101 -15 111 114 33
Hola, se�or!
Hola, se�or!
Hola, señor!
If your string is the output of a password hashing scheme (which it looks like it might be) then I think you will need to Base64 encode in order to put it into plain text.
Standard procedure, if you have raw bytes you want to output to a text file, is to use Base 64 encoding. The Commons Codec library provides a Base64 encoder / decoder for you to use.
Hope this helps.
In Java, I have a String and I want to encode it as a byte array (in UTF8, or some other encoding). Alternately, I have a byte array (in some known encoding) and I want to convert it into a Java String. How do I do these conversions?
Convert from String to byte[]:
String s = "some text here";
byte[] b = s.getBytes(StandardCharsets.UTF_8);
Convert from byte[] to String:
byte[] b = {(byte) 99, (byte)97, (byte)116};
String s = new String(b, StandardCharsets.US_ASCII);
You should, of course, use the correct encoding name. My examples used US-ASCII and UTF-8, two commonly-used encodings.
Here's a solution that avoids performing the Charset lookup for every conversion:
import java.nio.charset.Charset;
private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
String decodeUTF8(byte[] bytes) {
return new String(bytes, UTF8_CHARSET);
}
byte[] encodeUTF8(String string) {
return string.getBytes(UTF8_CHARSET);
}
String original = "hello world";
byte[] utf8Bytes = original.getBytes("UTF-8");
You can convert directly via the String(byte[], String) constructor and getBytes(String) method. Java exposes available character sets via the Charset class. The JDK documentation lists supported encodings.
90% of the time, such conversions are performed on streams, so you'd use the Reader/Writer classes. You would not incrementally decode using the String methods on arbitrary byte streams - you would leave yourself open to bugs involving multibyte characters.
My tomcat7 implementation is accepting strings as ISO-8859-1; despite the content-type of the HTTP request. The following solution worked for me when trying to correctly interpret characters like 'é' .
byte[] b1 = szP1.getBytes("ISO-8859-1");
System.out.println(b1.toString());
String szUT8 = new String(b1, "UTF-8");
System.out.println(szUT8);
When trying to interpret the string as US-ASCII, the byte info wasn't correctly interpreted.
b1 = szP1.getBytes("US-ASCII");
System.out.println(b1.toString());
As an alternative, StringUtils from Apache Commons can be used.
byte[] bytes = {(byte) 1};
String convertedString = StringUtils.newStringUtf8(bytes);
or
String myString = "example";
byte[] convertedBytes = StringUtils.getBytesUtf8(myString);
If you have non-standard charset, you can use getBytesUnchecked() or newString() accordingly.
I can't comment but don't want to start a new thread. But this isn't working. A simple round trip:
byte[] b = new byte[]{ 0, 0, 0, -127 }; // 0x00000081
String s = new String(b,StandardCharsets.UTF_8); // UTF8 = 0x0000, 0x0000, 0x0000, 0xfffd
b = s.getBytes(StandardCharsets.UTF_8); // [0, 0, 0, -17, -65, -67] 0x000000efbfbd != 0x00000081
I'd need b[] the same array before and after encoding which it isn't (this referrers to the first answer).
For decoding a series of bytes to a normal string message I finally got it working with UTF-8 encoding with this code:
/* Convert a list of UTF-8 numbers to a normal String
* Usefull for decoding a jms message that is delivered as a sequence of bytes instead of plain text
*/
public String convertUtf8NumbersToString(String[] numbers){
int length = numbers.length;
byte[] data = new byte[length];
for(int i = 0; i< length; i++){
data[i] = Byte.parseByte(numbers[i]);
}
return new String(data, Charset.forName("UTF-8"));
}
If you are using 7-bit ASCII or ISO-8859-1 (an amazingly common format) then you don't have to create a new java.lang.String at all. It's much much more performant to simply cast the byte into char:
Full working example:
for (byte b : new byte[] { 43, 45, (byte) 215, (byte) 247 }) {
char c = (char) b;
System.out.print(c);
}
If you are not using extended-characters like Ä, Æ, Å, Ç, Ï, Ê and can be sure that the only transmitted values are of the first 128 Unicode characters, then this code will also work for UTF-8 and extended ASCII (like cp-1252).
Charset UTF8_CHARSET = Charset.forName("UTF-8");
String strISO = "{\"name\":\"א\"}";
System.out.println(strISO);
byte[] b = strISO.getBytes();
for (byte c: b) {
System.out.print("[" + c + "]");
}
String str = new String(b, UTF8_CHARSET);
System.out.println(str);
Reader reader = new BufferedReader(
new InputStreamReader(
new ByteArrayInputStream(
string.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8));
//query is your json
DefaultHttpClient httpClient = new DefaultHttpClient();
HttpPost postRequest = new HttpPost("http://my.site/test/v1/product/search?qy=");
StringEntity input = new StringEntity(query, "UTF-8");
input.setContentType("application/json");
postRequest.setEntity(input);
HttpResponse response=response = httpClient.execute(postRequest);
terribly late but i just encountered this issue and this is my fix:
private static String removeNonUtf8CompliantCharacters( final String inString ) {
if (null == inString ) return null;
byte[] byteArr = inString.getBytes();
for ( int i=0; i < byteArr.length; i++ ) {
byte ch= byteArr[i];
// remove any characters outside the valid UTF-8 range as well as all control characters
// except tabs and new lines
if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) {
byteArr[i]=' ';
}
}
return new String( byteArr );
}