Convert byte[] to String and back - java

I'm trying to save content of a pdf file in a json and thought of saving the pdf as String value converted from byte[].
byte[] byteArray = feature.convertPdfToByteArray(Paths.get("path.pdf"));
String byteString = new String(byteArray, StandardCharsets.UTF_8);
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
System.out.println(secondString.equals(byteString));
System.out.println(Arrays.equals(byteArray, newByteArray));
System.out.println(byteArray.length + " vs " + newByteArray.length);
The result of the above code is as follows:
true
false
421371 vs 760998
The two String's are equal while the two byte[]s are not. Why is that and how to correctly convert/save a pdf inside a json?

You are probably using the wrong charset when reading from the PDF file.
For example, the character é (e with acute) does not exists in ISO-8859-1 :
byte[] byteArray = "é".getBytes(StandardCharsets.ISO_8859_1);
String byteString = new String(byteArray, StandardCharsets.UTF_8);
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
System.out.println(secondString.equals(byteString));
System.out.println(Arrays.equals(byteArray, newByteArray));
System.out.println(byteArray.length + " vs " + newByteArray.length);
Output :
true
false
1 vs 3

Why is that
If the byteArray indeed contains a PDF, it most likely is not valid UTF-8. Thus, wherever
String byteString = new String(byteArray, StandardCharsets.UTF_8);
stumbles over a byte sequence which is not valid UTF-8, it will replace that by a Unicode replacement character. I.e. this line damages your data, most likely beyond repair. So the following
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
does not result in the original byte array but instead a damaged version of it.
The newByteArray, on the other hand, is the result of UTF-8 encoding a given string, byteString. Thus, newByteArray is valid UTF-8 and
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
does not need to replace anything outside the UTF-8 mappings, in particular byteString and secondString are equal.
how to correctly convert/save a pdf inside a json?
As #mammago explained in his comment,
JSON is not the appropriate format for binary content (like files). You should propably use something like base64 to create a string out of your PDF and store that in your JSON object.

Related

How to convert binary payload (file) to byte[] in java?

I receive json data that contains binary data like that ,and I would like to convert that data to byte[] in java but I don't know how ?.
"payload": "7V1bcxs3ln6frfdcfvfbghfdX8HSw9Zu1QzzartyhblfdcvberCObjvJpkiJUpmhRI1pKXYeXHRsZLSrCy
5dElN5tfvQaO72TdSoiOS3TH8Yxdffgtg754679513qdfrgvlslsqdeqaepdccngrdzedrtghBD+d++e7v//p80/v96v7h+u72
+z1gfK/39x/+9t391cPTzeP88aE/++Fvvd53n+8+Xd1c/fBm/unqAf+7
N7v65en++vGP3vx2fvPHw/XDdwfpHf5mevhq/vQDcnAAwD+gEPwDF+bDxTv+3UF61d/4eesrfP356uFx"
Based on the observation that the "binary" string consists of ASCII letters, digits and "+" and "/", I am fairly confident that it is actually Base64 encoded data.
To decode Base64 to a byte[] you can do something like this:
String s = "7V1bcxs3ln6...";
byte [] bytes = java.util.Base64.getDecoder().decode(s);
The decode call will throw IllegalArgumentException if the input string is not properly Base64 encoded.
When I decoded that particular string using an online Base64 decoder, the result is unintelligible. But that is what I would expect for an arbitrary "blob" of binary data.
In general if you have a String in some object that denotes the json payload you can :
String s = "7V1bcxs3ln6...";
byte [] bytes = s.getBytes();
Other than that if this payload should be decoded somehow then additional code will be required.
In my case I had to convert payload that I knew it was a text something like:
{"payload":"eyJ1c2VyX2lkIjo0LCJ1c2VybmFtZSI6IngiLCJjaXR5IjoiaGVyZSJ9"}
This is the difference between java.util.Base64.getDecoder() and getBytes():
String s = "eyJ1c2VyX2lkIjo0LCJ1c2VybmFtZSI6IngiLCJjaXR5IjoiaGVyZSJ9";
byte [] bytes = s.getBytes();
byte [] bytes_base64 = java.util.Base64.getDecoder().decode(s);
String bytesToStr = new String(bytes, StandardCharsets.UTF_8);
String bytesBase64Tostr = new String(bytes_base64, StandardCharsets.UTF_8);
System.out.println("bytesToStr="+bytesToStr);
System.out.println("bytesBase64Tostr="+bytesBase64Tostr);
Output:
bytesToStr=eyJ1c2VyX2lkIjo0LCJ1c2VybmFtZSI6IngiLCJjaXR5IjoiaGVyZSJ9
bytesBase64Tostr={"user_id":4,"username":"x","city":"here"}
java.util.Base64.getDecoder() worked for in my case

How to 'decode' a UTF-8 String which is built upon gzipped byte array

I got some legacy text data which is utf-8 encoded, but against a gzipped byte array.
I'm wondering whether I can get the raw data back
something like:
String text = "Hello World!";
byte[] binData = text.getBytes("UTF-8");
byte[] compressData = gzip(binData);//via GZIPOutputStream
//this is what I have
String encodedString = new String(compressData, "UTF-8");
assertEquals(text, smartDecode(encodedString));
Is it possible to provide a function like smartDecode to help me retrieve the original text 'Hello World!' back?

PDF file content to Base 64 and vice versa in Java

I need to convert PDF content to Base64 and use that as a String.
When I use the below program to test the out.pdf becomes blank.
byte[] pdfRawData = FileUtils.readFileToByteArray(new File("C:\\in.pdf")) ;
String pdfStr = new String(pdfRawData);
//My data is available in the form of String
BASE64Encoder encoder = new BASE64Encoder();
String encodedPdf = encoder.encode(pdfStr.getBytes());
System.out.println(encodedPdf);
// Decode the encoded content to test
BASE64Decoder decoder = new BASE64Decoder();
FileUtils.writeByteArrayToFile(new File("C:\\out.pdf") , decoder.decodeBuffer(encodedPdf));
Can anyone please help me?
Why are you doing:
String pdfStr = new String(pdfRawData);
instead of passing pdfRawData to the encoder?
Doing so lead to lots of encoding issue, as you don't specify the encoding of the byte array to use to build the string (it will use platform default). And this is clearly redondant (byte array -> string -> byte array)

How to convert String to byte without changing?

I need a solution to convert String to byte array without changing like this:
Input:
String s="Test";
Output:
String s="Test";
byte[] b="Test";
When I use
s.getBytes();
then the reply is
"[B#428b76b8"
but I want the reply to be
"Test"
You should always make sure serialization and deserialization are using the same character set, this maps characters to byte sequences and vice versa. By default String.getBytes() and new String(bytes) uses the default character set which could be Locale specific.
Use the getBytes(Charset) overload
byte[] bytes = s.getBytes(Charset.forName("UTF-8"));
Use the new String(bytes, Charset) constructor
String andBackAgain = new String(bytes, Charset.forName("UTF-8"));
Also Java 7 added the java.nio.charset.StandardCharsets class, so you don't need to use dodgy String constants anymore
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
String andBackAgain = new String(bytes, StandardCharsets.UTF_8);
You can revert back using
String originalString = new String(b, "UTF-8");
That should get you back your original string. You don't want the bytes printed out directly.
You may try the following code snippet -
String string = "Sample String";
byte[] byteArray = string.getBytes();
In general that's probably not what you want to do, unless you're serializing or transmitting the data. Also, Java strings are UTF-16 rather than UTF-8, which what more like what you're expecting. If you really do want/need this then this should work:
String str = "Test";
byte[] raw = str.getBytes(new Charset("UTF-8", null));

How to convert UTF8 string to UTF16

I'm getting a UTF8 string by processing a request sent by a client application. But the string is really UTF16. What can I do to get it into my local string is a letter followed by \0 character? I need to convert that String into UTF16.
Sample received string: S\0a\0m\0p\0l\0e (UTF8).
What I want is : Sample (UTF16)
FileItem item = (FileItem) iter.next();
String field = "";
String value = "";
if (item.isFormField()) {
try{
value=item.getString();
System.out.println("====" + value);
}
The bytes from the server are not UTF-8 if they look like S\0a\0m\0p\0l\0e. They are UTF-16. You can convert UTF16 bytes to a Java String with:
byte[] bytes = ...
String string = new String(bytes, "UTF-16");
Or you can use UTF-16LE or UTF-16BE as the character set name if you know the endian-ness of the byte stream coming from the server.
If you've already (mistakenly) constructed a String from the bytes as if it were UTF-8, you can convert to UTF-16 with:
string = new String(string.getBytes("UTF-8"), "UTF-16");
However, as JB Nizet points out, this round trip (bytes -> UTF-8 string -> bytes) is potentially lossy if the bytes weren't valid UTF-8 to start with.
I propose the following solution:
NSString *line_utf16[ENOUGH_MEMORY_SIZE];
line_utf16= [NSString stringWithFormat: #"%s", line_utf8];
ENOUGH_MEMORY_SIZE is at least twice exceeds memory used for line_utf8
I suppose memory for
line_utf16
has to be dynamically or statically allocated at least twice of the size of
line_utf8.
If you run into similar problem please add a couple of sentences!

Categories

Resources