Java decode hex values using utf-8 encoding

Java decode hex values using utf-8 encoding - java

I am trying to decode some hex values to their actual names, but i am having issue.
21043D0438043C043E043A04 should be decoded to СНИМОК.
Current code I am using
String test = "21043D0438043C043E043A04";
byte[] bytes = Hex.decodeHex(test.toCharArray());
String a = new String(bytes, "UTF-8");
but i am getting some pretty weird results.
Tried also getting it as utf8 bytes but did not work.
byte[] bytesone = test.getBytes(Charset.forName("UTF-8"));
String b = new String(bytesone, Charset.forName("UTF-8"));
byte[] bytes = Hex.decodeHex(b.toCharArray());
String a = new String(bytes, "UTF-8");
Thanks in advance

\u0421 is the Cyrillic С so the code seems UTF-16LE (little endian).
String a = new String(bytes, "UTF-16LE");
String a = new String(bytes, StandardCharsets.UTF_16LE);

This looks more like UTF-16LE.

Related

PDF file content to Base 64 and vice versa in Java

I need to convert PDF content to Base64 and use that as a String.
When I use the below program to test the out.pdf becomes blank.
byte[] pdfRawData = FileUtils.readFileToByteArray(new File("C:\\in.pdf")) ;
String pdfStr = new String(pdfRawData);
//My data is available in the form of String
BASE64Encoder encoder = new BASE64Encoder();
String encodedPdf = encoder.encode(pdfStr.getBytes());
System.out.println(encodedPdf);
// Decode the encoded content to test
BASE64Decoder decoder = new BASE64Decoder();
FileUtils.writeByteArrayToFile(new File("C:\\out.pdf") , decoder.decodeBuffer(encodedPdf));
Can anyone please help me?

Why are you doing:
String pdfStr = new String(pdfRawData);
instead of passing pdfRawData to the encoder?
Doing so lead to lots of encoding issue, as you don't specify the encoding of the byte array to use to build the string (it will use platform default). And this is clearly redondant (byte array -> string -> byte array)

Convert byte[] to String and back

I'm trying to save content of a pdf file in a json and thought of saving the pdf as String value converted from byte[].
byte[] byteArray = feature.convertPdfToByteArray(Paths.get("path.pdf"));
String byteString = new String(byteArray, StandardCharsets.UTF_8);
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
System.out.println(secondString.equals(byteString));
System.out.println(Arrays.equals(byteArray, newByteArray));
System.out.println(byteArray.length + " vs " + newByteArray.length);
The result of the above code is as follows:
true
false
421371 vs 760998
The two String's are equal while the two byte[]s are not. Why is that and how to correctly convert/save a pdf inside a json?

You are probably using the wrong charset when reading from the PDF file.
For example, the character é (e with acute) does not exists in ISO-8859-1 :
byte[] byteArray = "é".getBytes(StandardCharsets.ISO_8859_1);
String byteString = new String(byteArray, StandardCharsets.UTF_8);
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
System.out.println(secondString.equals(byteString));
System.out.println(Arrays.equals(byteArray, newByteArray));
System.out.println(byteArray.length + " vs " + newByteArray.length);
Output :
true
false
1 vs 3

Why is that
If the byteArray indeed contains a PDF, it most likely is not valid UTF-8. Thus, wherever
String byteString = new String(byteArray, StandardCharsets.UTF_8);
stumbles over a byte sequence which is not valid UTF-8, it will replace that by a Unicode replacement character. I.e. this line damages your data, most likely beyond repair. So the following
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
does not result in the original byte array but instead a damaged version of it.
The newByteArray, on the other hand, is the result of UTF-8 encoding a given string, byteString. Thus, newByteArray is valid UTF-8 and
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
does not need to replace anything outside the UTF-8 mappings, in particular byteString and secondString are equal.
how to correctly convert/save a pdf inside a json?
As #mammago explained in his comment,
JSON is not the appropriate format for binary content (like files). You should propably use something like base64 to create a string out of your PDF and store that in your JSON object.

How to convert String to byte without changing?

I need a solution to convert String to byte array without changing like this:
Input:
String s="Test";
Output:
String s="Test";
byte[] b="Test";
When I use
s.getBytes();
then the reply is
"[B#428b76b8"
but I want the reply to be
"Test"

You should always make sure serialization and deserialization are using the same character set, this maps characters to byte sequences and vice versa. By default String.getBytes() and new String(bytes) uses the default character set which could be Locale specific.
Use the getBytes(Charset) overload
byte[] bytes = s.getBytes(Charset.forName("UTF-8"));
Use the new String(bytes, Charset) constructor
String andBackAgain = new String(bytes, Charset.forName("UTF-8"));
Also Java 7 added the java.nio.charset.StandardCharsets class, so you don't need to use dodgy String constants anymore
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
String andBackAgain = new String(bytes, StandardCharsets.UTF_8);

You can revert back using
String originalString = new String(b, "UTF-8");
That should get you back your original string. You don't want the bytes printed out directly.

You may try the following code snippet -
String string = "Sample String";
byte[] byteArray = string.getBytes();

In general that's probably not what you want to do, unless you're serializing or transmitting the data. Also, Java strings are UTF-16 rather than UTF-8, which what more like what you're expecting. If you really do want/need this then this should work:
String str = "Test";
byte[] raw = str.getBytes(new Charset("UTF-8", null));

how to insert and read utf-8 text on RecordStore

how can i insert a UTF-8 String into RecordStore and read this as a UTF-8 String ?
thanks

//write
ByteArrayOutputStream boStream = new ByteArrayOutputStream();
DataOutputStream doStream = new DataOutputStream(boStream);
doStream.writeUTF(myString);
temp.addRecord(boStream.toByteArray(), 0, boStream.size());
//read
ByteArrayInputStream biStream = new ByteArrayInputStream(temp.getRecord(id));
DataInputStream diStream = new DataInputStream(biStream);
myString = diStream.readUTF();

I got the handle wrong on the question. RecordStore still store byte arrays. What you need to do is convert it into a byte array and back again. Just use string.getBytes() and then store it like that, and then the opposite is String str = new String(bytes);. Hope that helps. The default charset of either J2ME or J2SE is UTF-8, so there's no messing about there.

Issue Decoding for a specific charset

I'm trying to decode a char and get back the same char.
Following is my simple test.
I'm confused, If i have to encode or decode. Tried both. Both print the same result.
Any suggestions are greatly helpful.
char inpData = '†';
String str = Character.toString((char) inpData);
byte b[] = str.getBytes(Charset.forName("MacRoman"));
System.out.println(b[0]); // prints -96
String decData = Integer.toString(b[0]);
CharsetDecoder decoder = Charset.forName("MacRoman").newDecoder();
ByteBuffer inBuffer = ByteBuffer.wrap(decData.getBytes());
CharBuffer result = decoder.decode(inBuffer);
System.out.println(result.toString()); // prints -96, expecting to print †
CharsetEncoder encoder = Charset.forName("MacRoman").newEncoder();
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(decData));
result = decoder.decode(bbuf);
System.out.println(result.toString());// prints -96, expecting to print †
Thank you.

When you do String decData = Integer.toString(b[0]);, you create the string "-96" and that is the string you're encoding/decoding. Not the original char.
You have to change your String back to a byte before.
To get your character back as a char from the -96 you have to do this :
String string = new String(b, "MacRoman");
char specialChar = string.charAt(0);
With this your reversing your first transformation from char -> String -> byte[0] by doing byte[0] -> String -> char[0]
If you have the String "-96", you must change first your string into a byte with :
byte b = Byte.parseByte("-96");

String decData = Integer.toString(b[0]);
This probably gets you the "-96" output in the last two examples. try
String decData = new String(b, "MacRoman");
Apart from that, keep in mind that System.out.println uses your system-charset to print out strings anyway. For a better test, consider writing your Strings to a file using your specific charset with something like
FileOutputStream fos = new FileOutputStream("test.txt");
OutputStreamWriter writer = new OutputStreamWriter(fos, "MacRoman");
writer.write(result.toString());
writer.close();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java decode hex values using utf-8 encoding - java

\u0421 is the Cyrillic С so the code seems UTF-16LE (little endian). String a = new String(bytes, "UTF-16LE"); String a = new String(bytes, StandardCharsets.UTF_16LE);

This looks more like UTF-16LE.

Related

PDF file content to Base 64 and vice versa in Java

Convert byte[] to String and back

How to convert String to byte without changing?

how to insert and read utf-8 text on RecordStore

Issue Decoding for a specific charset

Categories

Resources