Convert byte array to understandable String - java

I have a program that handles byte arrays in Java, and now I would like to write this into a XML file. However, I am unsure as to how I can convert the following byte array into a sensible String to write to a file. Assuming that it was Unicode characters I attempted the following code:
String temp = new String(encodedBytes, "UTF-8");
Only to have the debugger show that the encodedBytes contain "\ufffd\ufffd ^\ufffd\ufffd-m\ufffd\ufffd\/ufffd \ufffd\ufffdIA\ufffd\ufffd". The String should contain a hash in alphanumerical format.
How would I turn the above String into a sensible String for output?

The byte array doesn't look like UTF-8. Note that \ufffd (named REPLACEMENT CHARACTER) is "used to replace an incoming character whose value is unknown or unrepresentable in Unicode."
Addendum: Here's a simple example of how this can happen. When cast to a byte, the code point for ñ is neither UTF-8 nor US-ASCII; but it is valid ISO-8859-1. In effect, you have to know what the bytes represent before you can encode them into a String.
public class Hello {
public static void main(String[] args)
throws java.io.UnsupportedEncodingException {
String s = "Hola, señor!";
System.out.println(s);
byte[] b = new byte[s.length()];
for (int i = 0; i < b.length; i++) {
int cp = s.codePointAt(i);
b[i] = (byte) cp;
System.out.print((byte) cp + " ");
}
System.out.println();
System.out.println(new String(b, "UTF-8"));
System.out.println(new String(b, "US-ASCII"));
System.out.println(new String(b, "ISO-8859-1"));
}
}
Output:
Hola, señor!
72 111 108 97 44 32 115 101 -15 111 114 33
Hola, se�or!
Hola, se�or!
Hola, señor!

If your string is the output of a password hashing scheme (which it looks like it might be) then I think you will need to Base64 encode in order to put it into plain text.
Standard procedure, if you have raw bytes you want to output to a text file, is to use Base 64 encoding. The Commons Codec library provides a Base64 encoder / decoder for you to use.
Hope this helps.

Related

Read NUL-terminated String from ByteBuffer

How can I read NUL-terminated UTF-8 string from Java ByteBuffer starting at ByteBuffer#position()?
ByteBuffer b = /* 61 62 63 64 00 31 32 34 00 (hex) */;
String s0 = /* read first string */;
String s1 = /* read second string */;
// `s0` will now contain “ABCD” and `s1` will contain “124”.
I have already tried using Charsets.UTF_8.decode(b) but it seems this function is ignoring current ByteBuffer postision and reads until the end of the buffer.
Is there more idiomatic way to read such string from byte buffer than seeking for byte containing 0 and the limiting the buffer to it (or copying the part with string into separate buffer)?
Idiomatic meaning "one liner" not that I know of (unsurprising since NUL-terminated strings are not part of the Java spec).
The first thing I came up with is using b.slice().limit(x) to create a lightweight view onto the desired bytes only (better than copying them anywhere as you might be able to work directly with the buffer)
ByteBuffer b = ByteBuffer.wrap(new byte[] {0x61, 0x62, 0x63, 0x64, 0x00, 0x31, 0x32, 0x34, 0x00 });
int i;
while (b.hasRemaining()) {
ByteBuffer nextString = b.slice(); // View on b with same start position
for (i = 0; b.hasRemaining() && b.get() != 0x00; i++) {
// Count to next NUL
}
nextString.limit(i); // view now stops before NUL
CharBuffer s = StandardCharsets.UTF_8.decode(nextString);
System.out.println(s);
}
In java the char \u0000, the UTF-8 byte 0, the Unicode code point U+0 is a normal char. So read all (maybe into an overlarge byte array), and do
String s = new String(bytes, StandardCharsets.UTF_8);
String[] s0s1 = s.split("\u0000");
String s0 = s0s1[0];
String s1 = s0s1[1];
If you do not have fixed positions and must sequentially read every byte the code is ugly. One of the C founders indeed called the nul terminated string a historic mistake.
The reverse, to not produce a UTF-8 byte 0 for a java String, normally for further processing as C/C++ nul terminated strings, there exists writing a modified UTF-8, also encoding the 0 byte.
You can do it by replace and split functions. Convert your hex bytes to String and find 0 by a custom character. Then split your string with that custom character.
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;
/**
* Created by Administrator on 8/25/2020.
*/
public class Jtest {
public static void main(String[] args) {
//ByteBuffer b = /* 61 62 63 64 00 31 32 34 00 (hex) */;
ByteBuffer b = ByteBuffer.allocate(10);
b.put((byte)0x61);
b.put((byte)0x62);
b.put((byte)0x63);
b.put((byte)0x64);
b.put((byte)0x00);
b.put((byte)0x31);
b.put((byte)0x32);
b.put((byte)0x34);
b.put((byte)0x00);
b.rewind();
String s0;
String s1;
// print the ByteBuffer
System.out.println("Original ByteBuffer: "
+ Arrays.toString(b.array()));
// `s0` will now contain “ABCD” and `s1` will contain “124”.
String s = StandardCharsets.UTF_8.decode(b).toString();
String ss = s.replace((char)0,';');
String[] words = ss.split(";");
for(int i=0; i < words.length; i++) {
System.out.println(" Word " + i + " = " +words[i]);
}
}
}
I believe you can do it more efficiently with removing replace.

Issue with Encoding base64 in PHP and Decoding base64 in Java

A string-"gACA" encoded in PHP using base64. Now I'm trying to decode in java using base64. But getting absurd value after decoding. I have tried like this:
public class DecodeString{
{
public static void main(String args[]){
String strEncode = "gACA"; //gACA is encoded string in PHP
byte byteEncode[] = com.sun.org.apache.xerces.internal.impl.dv.util.Base64.decode(strEncode );
System.out.println("Decoded String" + new String(k, "UTF-8"));
}
}
Output:
??
Please help me out
Java has built-in Base64 encoder-decoder, no need extra libraries to decode it:
byte[] data = javax.xml.bind.DatatypeConverter.parseBase64Binary("gACA");
for (byte b : data)
System.out.printf("%02x ", b);
Output:
80 00 80
It's 3 bytes with hexadecimal codes: 80 00 80
public static void main(String args[]) {
String strEncode = "gACA"; //gACA is encoded string in PHP
byte byteEncode[] = Base64.decode(strEncode);
String result = new String(byteEncode, "UTF-8");
char[] resultChar = result.toCharArray();
for(int i =0; i < resultChar.length; i++)
{
System.out.println((int)resultChar[i]);
}
System.out.println("Decoded String: " + result);
}
I suspect it's an encoding problem. Issue about 65533 � in C# text file reading this post suggest the first and last character are \“. In the middle there is a char 0. Your result is probably "" or "0", but with wrong encoding.
Try this, it worked fine for me (However I was decoding files):
Base64.decodeBase64(IOUtils.toByteArray(strEncode));
So it would look like this:
public class DecodeString{
{
public static void main(String args[]){
String strEncode = "gACA"; //gACA is encoded string in PHP
byte[] byteEncode = Base64.decodeBase64(IOUtils.toByteArray(strEncode));
System.out.println("Decoded String" + new String(k, "UTF-8));
}
}
Note that you will need extra libraries:
Commons Codec
Commons FileUpload
Commons IO
First things first, the code you use should not compile, it's missing a closing quote after "UTF-8.
And yeah, "gACA" is a valid base64 string as the format goes, but it doesn't decode to any meaningful UTF-8 text. I suppose you're using the wrong encoding, or messed up the string somehow...
RFC 4648 defines two alphabets.
PHP uses Base 64 Encoding
Java uses Base 64 Encoding with URL and Filename Safe Alphabet.
They are very close but not the exact same. In PHP:
const REPLACE_PAIRS = [
'-' => '+',
'_' => '/'
];
public static function base64FromUrlSafeToPHP($base64_url_encoded) {
return strtr($base64_url_encoded, self::REPLACE_PAIRS);
}
public static function base64FromPHPToUrlSafe($base64_encoded) {
return strtr($base64_encoded, array_flip(self::REPLACE_PAIRS));
}

Extract hexadecimal values from a percent encoded URL

Let's say for example i have URL containing the following percent encoded character : %80
It is obviously not an ascii character.
How would it be possible to convert this value to the corresponding hex string in Java.
i tried the following with no luck.Result should be 80.
public static void main(String[] args) {
System.out.print(byteArrayToHexString(URLDecoder.decode("%80","UTF-8").getBytes()));
}
public static String byteArrayToHexString(byte[] bytes)
{
StringBuffer buffer = new StringBuffer();
for(int i=0; i<bytes.length; i++)
{
if(((int)bytes[i] & 0xff) < 0x10)
buffer.append("0");
buffer.append(Long.toString((int) bytes[i] & 0xff, 16));
}
return buffer.toString();
}
The best way to deal with this is to parse the url using either java.net.URL or java.net.URI, and then use the relevant getters to extract the components that you require. These will take care of decoding any %-encoded portions in the appropriate fashion.
The problem with your current idea is that %80 does not represent "80", or 80. Rather it represents a byte that further needs to be interpreted in the context of the character encoding of the URL. And if the encoding is UTF-8, then the %80 needs to be followed by one or two more %-encoded bytes ... otherwise this is a malformed UTF-8 character representation.
I don't really see what you are trying. However, I'll give it a try.
When you have got this String: "%80" and you want to got the string "80", you can use this:
String str = "%80";
String hex = str.substring(1); // Cut off the '%'
If you are trying to extract the value 0x80 (which is 128 in decimal) out of it:
String str = "%80";
String hex = str.substring(1); // Cut off the '%'
int value = Integer.parseInt(hex, 16);
If you are trying to convert an int to its hexadecimal representation use this:
String hexRepresenation = Integer.toString(value, 16);

Confusion about Java conversion of bytes to String for comparison of "byte order marks"

I'm trying to recognize a BOM for UTF-8 when reading a file. Of course, Java files like to deal with 16 bit chars, and the BOM characters are eight bit bytes.
My test code looks like:
public void testByteOrderMarks() {
System.out.println("test byte order marks");
byte[] bytes = {(byte) 0xEF, (byte) 0xBB, (byte) 0xBF, (byte) 'a', (byte) 'b',(byte) 'c'};
String test = new String(bytes, Charset.availableCharsets().get("UTF-8"));
System.out.printf("test len: %s value %s\n", test.length(), test);
String three = test.substring(0,3);
System.out.printf("len %d >%s<\n", three.length(), three);
for (int i = 0; i < test.length();i++) {
byte b = bytes[i];
char c = test.charAt(i);
System.out.printf("b: %s %x c: %s %x\n", (char) b, b, c, (int) c);
}
}
and the result is:
test byte order marks
test len: 4 value ?abc
len 3 >?ab<
b: ? ef> c: ? feff
b: ? bb c: a 61
b: ? bf c: b 62
b: a 61 c: c 63
I can't figure out why the length of "test" is 4 and not 6.
I can't figure out why I don't pick up each 8 bit byte to do the comparison.
Thanks
Don't use characters when trying to figure out the BOM header. The BOM header is two or three bytes, so you should open an (File)InputStream, read two bytes and process them.
Incidentally, the XML header (<?xml version=... encoding=...>) is pure ASCII so it's safe to load that as a byte stream, too (well, unless there is a BOM to indicate that the file is saved with 16bit characters and not as UTF-8).
My solution (see DecentXML's XMLInputStreamReader) is to load the first few bytes of the file and analyze them. That gives me enough information to create a properly decoding Reader out of an InputStream.
A character is a character. The Byte Order Mark is the Unicode character U+FEFF. In Java it is the character '\uFEFF'. There is no need to delve into bytes. Just read the first character of the file, and if it matches '\uFEFF' it is the BOM. If it doesn't match then the file was written without a BOM.
private final static char BOM = '\uFEFF'; // Unicode Byte Order Mark
String firstLine = readFirstLineOfFile("filename.txt");
if (firstLine.charAt(0) == BOM) {
// We have a BOM
} else {
// No BOM present.
}
If you want to recognize a BOM file a better solution (and works for me) will be use the encoding detector library of Mozilla: http://code.google.com/p/juniversalchardet/
In that link is described easily how to use it:
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector {
public static void main(String[] args) throws java.io.IOException {
byte[] buf = new byte[4096];
String fileName = "testFile.";
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
If you are using maven the dependency is:
<dependency>
<groupId>com.googlecode.juniversalchardet</groupId>
<artifactId>juniversalchardet</artifactId>
<version>1.0.3</version>
</dependency>

Converting char array into byte array and back again

I'm looking to convert a Java char array to a byte array without creating an intermediate String, as the char array contains a password. I've looked up a couple of methods, but they all seem to fail:
char[] password = "password".toCharArray();
byte[] passwordBytes1 = new byte[password.length*2];
ByteBuffer.wrap(passwordBytes1).asCharBuffer().put(password);
byte[] passwordBytes2 = new byte[password.length*2];
for(int i=0; i<password.length; i++) {
passwordBytes2[2*i] = (byte) ((password[i]&0xFF00)>>8);
passwordBytes2[2*i+1] = (byte) (password[i]&0x00FF);
}
String passwordAsString = new String(password);
String passwordBytes1AsString = new String(passwordBytes1);
String passwordBytes2AsString = new String(passwordBytes2);
System.out.println(passwordAsString);
System.out.println(passwordBytes1AsString);
System.out.println(passwordBytes2AsString);
assertTrue(passwordAsString.equals(passwordBytes1) || passwordAsString.equals(passwordBytes2));
The assertion always fails (and, critically, when the code is used in production, the password is rejected), yet the print statements print out password three times. Why are passwordBytes1AsString and passwordBytes2AsString different from passwordAsString, yet appear identical? Am I missing out a null terminator or something? What can I do to make the conversion and unconversion work?
Conversion between char and byte is character set encoding and decoding.I prefer to make it as clear as possible in code. It doesn't really mean extra code volume:
Charset latin1Charset = Charset.forName("ISO-8859-1");
charBuffer = latin1Charset.decode(ByteBuffer.wrap(byteArray)); // also decode to String
byteBuffer = latin1Charset.encode(charBuffer); // also decode from String
Aside:
java.nio classes and java.io Reader/Writer classes use ByteBuffer & CharBuffer (which use byte[] and char[] as backing arrays). So often preferable if you use these classes directly. However, you can always do:
byteArray = ByteBuffer.array(); byteBuffer = ByteBuffer.wrap(byteArray);
byteBuffer.get(byteArray); charBuffer.put(charArray);
charArray = CharBuffer.array(); charBuffer = ByteBuffer.wrap(charArray);
charBuffer.get(charArray); charBuffer.put(charArray);
Original Answer
public byte[] charsToBytes(char[] chars){
Charset charset = Charset.forName("UTF-8");
ByteBuffer byteBuffer = charset.encode(CharBuffer.wrap(chars));
return Arrays.copyOf(byteBuffer.array(), byteBuffer.limit());
}
public char[] bytesToChars(byte[] bytes){
Charset charset = Charset.forName("UTF-8");
CharBuffer charBuffer = charset.decode(ByteBuffer.wrap(bytes));
return Arrays.copyOf(charBuffer.array(), charBuffer.limit());
}
Edited to use StandardCharsets
public byte[] charsToBytes(char[] chars)
{
final ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(CharBuffer.wrap(chars));
return Arrays.copyOf(byteBuffer.array(), byteBuffer.limit());
}
public char[] bytesToChars(byte[] bytes)
{
final CharBuffer charBuffer = StandardCharsets.UTF_8.decode(ByteBuffer.wrap(bytes));
return Arrays.copyOf(charBuffer.array(), charBuffer.limit());
}
Here is a JavaDoc page for StandardCharsets.
Note this on the JavaDoc page:
These charsets are guaranteed to be available on every implementation of the Java platform.
The problem is your use of the String(byte[]) constructor, which uses the platform default encoding. That's almost never what you should be doing - if you pass in "UTF-16" as the character encoding to work, your tests will probably pass. Currently I suspect that passwordBytes1AsString and passwordBytes2AsString are each 16 characters long, with every other character being U+0000.
I would do is use a loop to convert to bytes and another to conver back to char.
char[] chars = "password".toCharArray();
byte[] bytes = new byte[chars.length*2];
for(int i=0;i<chars.length;i++) {
bytes[i*2] = (byte) (chars[i] >> 8);
bytes[i*2+1] = (byte) chars[i];
}
char[] chars2 = new char[bytes.length/2];
for(int i=0;i<chars2.length;i++)
chars2[i] = (char) ((bytes[i*2] << 8) + (bytes[i*2+1] & 0xFF));
String password = new String(chars2);
If you want to use a ByteBuffer and CharBuffer, don't do the simple .asCharBuffer(), which simply does an UTF-16 (LE or BE, depending on your system - you can set the byte-order with the order method) conversion (since the Java Strings and thus your char[] internally uses this encoding).
Use Charset.forName(charsetName), and then its encode or decode method, or the newEncoder /newDecoder.
When converting your byte[] to String, you also should indicate the encoding (and it should be the same one).
This is an extension to Peter Lawrey's answer. In order to backward (bytes-to-chars) conversion work correctly for the whole range of chars, the code should be as follows:
char[] chars = new char[bytes.length/2];
for (int i = 0; i < chars.length; i++) {
chars[i] = (char) (((bytes[i*2] & 0xff) << 8) + (bytes[i*2+1] & 0xff));
}
We need to "unsign" bytes before using (& 0xff). Otherwise half of the all possible char values will not get back correctly. For instance, chars within [0x80..0xff] range will be affected.
You should make use of getBytes() instead of toCharArray()
Replace the line
char[] password = "password".toCharArray();
with
byte[] password = "password".getBytes();
When you use GetBytes From a String in Java, The return result will depend on the default encode of your computer setting.(eg: StandardCharsetsUTF-8 or StandardCharsets.ISO_8859_1etc...).
So, whenever you want to getBytes from a String Object. Make sure to give a encode . like :
String sample = "abc";
Byte[] a_byte = sample .getBytes(StandardCharsets.UTF_8);
Let check what has happened with the code.
In java, the String named sample , is stored by Unicode. every char in String stored by 2 byte.
sample : value: "abc" in Memory(Hex): 00 61 00 62 00 63
a -> 00 61
b -> 00 62
c -> 00 63
But, When we getBytes From a String, we have
Byte[] a_byte = sample .getBytes(StandardCharsets.UTF_8)
//result is : 61 62 63
//length: 3 bytes
Byte[] a_byte = sample .getBytes(StandardCharsets.UTF_16BE)
//result is : 00 61 00 62 00 63
//length: 6 bytes
In order to get the oringle byte of the String. We can just read the Memory of the string and get Each byte of the String.Below is the sample Code:
public static byte[] charArray2ByteArray(char[] chars){
int length = chars.length;
byte[] result = new byte[length*2+2];
int i = 0;
for(int j = 0 ;j<chars.length;j++){
result[i++] = (byte)( (chars[j] & 0xFF00) >> 8 );
result[i++] = (byte)((chars[j] & 0x00FF)) ;
}
return result;
}
Usages:
String sample = "abc";
//First get the chars of the String,each char has two bytes(Java).
Char[] sample_chars = sample.toCharArray();
//Get the bytes
byte[] result = charArray2ByteArray(sample_chars).
//Back to String.
//Make sure we use UTF_16BE. Because we read the memory of Unicode of
//the String from Left to right. That's the same reading
//sequece of UTF-16BE.
String sample_back= new String(result , StandardCharsets.UTF_16BE);

Categories

Resources