base64 decoded file is not equal to the original unencoded file - java

I have a normal pdf file A.pdf , a third party encodes the file in base64 and sends it to me in a webservice as a long string (i have no control on the third party).
My problem is that when i decode the string with java org.apache.commons.codec.binary.Base64 and right the output to a file called B.pdf
I expect B.pdf to be identical to A.pdf, but B.pdf turns out a little different then A.pdf. As a result B.pdf is not recognized as a valid pdf by acrobat.
Does base64 have different types of encoding\charset mechanisms? can i detect how the string I received is encoded so that B.pdf=A.pdf ?
EDIT- this is the file I want to decode, after decoding it should open as a pdf
my encoded file
this is the header of the files opened in notepad++
**A.pdf**
%PDF-1.4
%±²³´
%Created by Wnv/EP PDF Tools v6.1
1 0 obj
<<
/PageMode /UseNone
/ViewerPreferences 2 0 R
/Type /Catalog
**B.pdf**
%PDF-1.4
%±²³´
%Created by Wnv/EP PDF Tools v6.1
1 0! bj
<<
/PageMode /UseNone
/ViewerPreferences 2 0 R
/]
pe /Catalog
this is how I decode the string
private static void decodeStringToFile(String encodedInputStr,
String outputFileName) throws IOException {
BufferedReader in = null;
BufferedOutputStream out = null;
try {
in = new BufferedReader(new StringReader(encodedInputStr));
out = new BufferedOutputStream(new FileOutputStream(outputFileName));
decodeStream(in, out);
out.flush();
} finally {
if (in != null)
in.close();
if (out != null)
out.close();
}
}
private static void decodeStream(BufferedReader in, OutputStream out)
throws IOException {
while (true) {
String s = in.readLine();
if (s == null)
break;
//System.out.println(s);
byte[] buf = Base64.decodeBase64(s);
out.write(buf);
}
}

You are breaking your decoding by working line-by-line. Base64 decoders simply ignore whitespace, which means that a byte in the original content could very well be broken into two Base64 text lines. You should concatenate all the lines together and decode the file in one go.
Prefer using byte[] rather than String when supplying content to the Base64 class methods. String implies character set encoding, which may not do what you want.

Related

Damaged Pdf after setting content from a server response

I am currently making rest calls to a server for signing a pdf document.
I am sending a pdf(binary content) and retrieving the binary content of the signed pdf.
When i get the binary content from the inputStream:
try (InputStream inputStream = conn.getInputStream()) {
if (inputStream != null) {
try (BufferedReader br = new BufferedReader(new InputStreamReader(inputStream))) {
String lines;
while ((lines = br.readLine()) != null) {
output += lines;
}
}
}
}
signedPdf.setBinaryContent(output.getBytes());
(signedPdf is a DTO with byte[] attribute)
but when i try to set the content of the pdf with the content of the response pdf:
ByteArrayOutputStream out = new ByteArrayOutputStream();
out.write(signedPdf);
pdf.setContent(signedPdf);
and try to open it, it says that the pdf is damaged and cannot be repaired.
Anyone encountered something similar? Do i need to set the content-length as well for the output stream?
PDF is binary data. One corrupts the PDF when reading as text (which in Java is always Unicode).
Also it is a waste: a byte as char would double the memory usages, and
there are two conversions: from bytes to String and vice versa, using some encoding.
When converting from UTF-8 even UTF-8 format errors may be raised.
try (InputStream inputStream = conn.getInputStream()) {
if (inputStream != null) {
byte[] content = inputStream.readAllBytes();
signedPdf.setBinaryContent(content);
}
}
Whether to use a BufferedInputStream depends, for instance on the expected PDF size.
Furthermore new String(byte[], Charset) and String.getBytes(Charset) with explicit Charset (like StandardCharsets.UTF_8) are preferable over a default Charset overloaded version. Those use the current platform encoding, and hence delivers non-portable code. Behaving differently on an other platform/computer.

Desired Result Data is not comming from compressed base64 string by scanning QR code from android phone?

I have extracted base64 string from an image, then perform compression on the base64 string.Then generate QR code using this compressed string.
But when I scan(QR CODE) the result using android phone I am getting value like 17740296 which is not value I want. My purpose is that after getting the scanned value I will decompress it and display the image using bitmap from base64. What is the wrong in my code. I am using java code to generate QR CODE.(I have tried UTF-8 also but not working). This code is working for String but not for the image.
Compressing Code is -
public static String compressBD(String str) throws IOException {
if (str == null || str.length() == 0) {
return str;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(str.getBytes());
gzip.close();
return out.toString("ISO-8859-1");
}
Decompress Code is -
public static String decompressBD(String str) throws Exception {
if (str == null || str.length() == 0) {
return str.toString();
}
// System.out.println("Input String length : " + str.toString().length());
GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(str.getBytes("ISO-8859-1")));
BufferedReader bf = new BufferedReader(new InputStreamReader(gis, "ISO-8859-1"));
String outStr = "";
String line;
while ((line=bf.readLine())!=null) {
outStr += line;
}
//System.out.println("Output String lenght : " + outStr.length());
return outStr;
}
this wont work.
Base64 ist an encoding that can be used entirely in strings. You can print it on a piece of paper, type it into your machine and decode an image fromt it.
However, if you use a gzip compression and transform this to a string, the compression generates bytes that are outside the string encoding and can not be printed or presented as string.
Base64 is meant to be the "compressed" encoding for strings. I would really encourage you not to use string as storage, but directly store the binary data, or transmitt it. This would also be considerably faster, since base64 encoding is very slow.
Its purpose is entirely to store binary content, that contains non printable bytes, in text messages. For what ever reason.
I hope it is understandable, but it basically means, you cant store a base64 zip content in a string. You have to store the binary representation if you want to compress.

My base64 encoded byte[] stream has extra characters after sent through a http response

I encode a pdf into a base64 byte[] stream and I want to send it as a http response to the browser. The problem is that the browser fails to load pdf.
I compared the base 64 string which I printed into the IDE console and the one from the browser console. The one from the IDE console is correct and the one from the browser has extra characters.
So, my base64 byte[]stream gets broken somehow when it's sent as a http response ? How do I solve this?
L.e. : The code
FileInputStream inputFileInputStream = null;
ServletOutputStream outputFileOutputStream = null;
File exportFile = new File(exportedReport);
int fileSize = (int) exportFile.length();
String fullPathToExport = exportFile.getAbsolutePath();
File fullPathFile = new File(fullPathToExport);
try {
// test to see if the path of the file is correct
System.out.println("The file is located at: "
+ exportFile.getAbsolutePath());
response.reset();
response.setContentType("application/pdf");
response.setContentLength(fileSize);
response.addHeader("Content-Transfer-Encoding", "base64");
response.setHeader( "Content-Disposition", "inline; filename=\"" + exportedReport +"\"");
inputFileInputStream = new FileInputStream(fullPathFile);
outputFileOutputStream = response.getOutputStream();
if (bytesToRead == -1) {
bytesToRead = (int)fullPathFile.length();
}
byte[] buffer = new byte[bytesToRead];
int bytesRead = -1;
while((inputFileInputStream != null) && ((bytesRead = inputFileInputStream.read(buffer)) != -1)){
if (codec.equals("base64")) {
//outputFileOutputStream.write(Base64.encodeBytes(buffer).getBytes("UTF-8"), 0, bytesToRead);
outputFileOutputStream.write(org.apache.commons.codec.binary.Base64.encodeBase64(buffer));
} else {
outputFileOutputStream.write(buffer, 0, bytesToRead);
}
}
inputFileInputStream.close();
outputFileOutputStream.flush();
outputFileOutputStream.close();
Your code has one major problem:
You are not sending one base64 encoded data part but many base64 encoded data parts (concatenated). But two or more base64 encoded data parts are not equal to one base64 encoded data part.
Example:
base64("This is a test") -> "VGhpcyBpcyBhIHRlc3Q="
base64("This ")+base64("is a ")+base64("test") -> "VGhpcyA=aXMgYSA=dGVzdA=="
You should use the org.apache.commons.codec.binary.Base64InputStream instead of the Base64.encodeBase64() utility method. Read the whole FileInputStream through it and you will get a valid base64 encoded data stream.
Anyway what you are doing is not necessary. You can transfer any 8 bit data via HTTP without further encoding.

How to tell the original encoding of a file

I have a bunch of plain text file which I downloaded from 3rd party servers.
Some of them are gibberish; the server sent the information of ENCODING1 (e.g.: UTF8), but in reality the encoding of the file was ENCODING2 (e.g.: Windows1252).
Is there a way to somehow correct these files?
I presume the files were (ENCODING1) mostly encoded in UTF8, ISO-8859-2 and Windows1252 (and I presume they were mostly saved with one of these encodings). I was thinking about re-encoding every filecontent with
new String(String.getBytes(ENCODING1), ENCODING2)
with all possibilites of ENCODING1 and ENCODING2 (for 3 encodings that would be 9 options)
then finding some way (for example: charachter frequency?) to tell which of the 9 results is the correct one.
Are there any 3rd party libraries for this?
I tried JChardet and ICU4J, but as far as I know both of them are only capable of detecting the encoding of the file before the step with ENCODING1 took place
Thanks,
krisy
You can use library provided by google to detect character set for a file, please see following:
import org.mozilla.universalchardet.UniversalDetector;
public class TestDetector
{
public static void main(String[] args) throws java.io.IOException
{
if (args.length != 1) {
System.err.println("Usage: java TestDetector FILENAME");
System.exit(1);
}
byte[] buf = new byte[4096];
String fileName = args[0];
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
}
Read more at following URL
You can also try jCharDet by sourceforge, please see following URL
Cheers !!
Inside JVM Strings are always unicode (converted by reading or creation), so aStringVariable.getBytes(ENCODING1) will only work for output.
For a basic understanding you should read http://www.joelonsoftware.com/articles/Unicode.html.
As mentioned in this article there is no way to know for sure which original encoding was used; due to this article e.g. Internet Explorer guesses by the frequency of different bytes.
So the original files are in UTF8 (multibyte Unicode format), ISO-8859-2 (Latin-2) and Windows-1252 (MS Latin-1). You want to have them all in UTF-8.
First the download should not do any conversion, so the contents stay intact.
Otherwise you could only attempt to repair a wrong encoding, without guarantee.
Java uses Unicode for text internally. So create a String only with the correct encoding. For the file contents use byte[].
The functionality available:
If the file is in 7-bits US-ASCII then it already UTF-8
If the file has only valid UTF-8 sequences, it most likely is UTF-8; can be tested
Remains to distinguish between Latin-2 and MS Latin-1
The latter can be done by some statistics. For instance identifying the language by their 100 most frequent words functions rather well.
I am aware of a couple of charset detectors. That one did not seem to work might also be that the file is already corrupted. With Notepad++ or JEdit or some other encoding converting editor you might check.
Charset detectCharset(Path path) throws IOException {
byte[] content = Files.readAllBytes(path);
boolean ascii = true;
boolean utf8 = true;
Map<Byte, Integer> specialCharFrequencies = new TreeMap<>();
for (int i = 0; i < content.length; ++i) {
byte b = content[i];
if (b < 0) {
ascii = false;
if ((b & 0xC0) == 0xC0) { // UTF-8 continuation byte
if (i == 0 || content[i - 1] >= 0) {
utf8 = false;
}
}
specialCharFrequencies.merge(b, 1, Integer::sum);
}
}
if (ascii || utf8) {
return StandardCharsets.UTF_8;
}
// ... determine by frequencies
Charset latin1 = Charset.forName("Windows-1252");
Charset latin2 = Charset.forName("ISO-8859-2");
System.out.println(" B Freq 1 2");
specialCharFrequencies.entrySet().stream()
.forEach(e -> System.out.printf("%02x %06d %s %s%n",
e.getKey() & 0xFF, e.getValue(),
new String(new byte[] {e.getKey(), 0, 1}, latin1),
new String(new byte[] {e.getKey(), 0, 1}, latin2)));
return null;
}
Illegal UTF-8 can slip through this check, but it would be easy to use a Charset decoder.

Corrupt Gzip string due to character encoding

I have some corrupted Gzip log files that I'm trying to restore. The files were transfered to our servers through a Java backed web page. The files have always been sent as plain text, but we recently started to receive log files that were Gzipped. These Gzipped files appear to be corrupted, and are not unzip-able, and the originals have been deleted. I believe this is from the character encoding in the method below.
Is there any way to revert the process to restore the files to their original zipped format? I have the resulting Strings binary array data in a database blob.
Thanks for any help you can give!
private String convertStreamToString(InputStream is) throws IOException {
/*
* To convert the InputStream to String we use the
* Reader.read(char[] buffer) method. We iterate until the
* Reader return -1 which means there's no more data to
* read. We use the StringWriter class to produce the string.
*/
if (is != null) {
Writer writer = new StringWriter();
char[] buffer = new char[1024];
try {
Reader reader = new BufferedReader(
new InputStreamReader(is, "UTF-8"));
int n;
while ((n = reader.read(buffer)) != -1) {
writer.write(buffer, 0, n);
}
} finally {
is.close();
}
return writer.toString();
} else {
return "";
}
}
If this is the method that was used to convert the InputStream to a String, then your data is almost certainly lost.
The problem is that UTF-8 has quite a few byte sequences that are simply not legal (i.e. they don't represent any value). These sequences will be replaced with the Unicode replacement character.
That character is the same no matter which invalid byte sequence was decoded. Therefore the specific information in those bytes is lost.
If that's the code you have you never should have converted to a Reader (or in fact a String). Only preserving as a Stream (or byte array) would avoid corrupting binary files. And once it's read into the string....illegal sequences (and there are many in utf-8) WILL be discarded.
So no, unless you are quite lucky, there is no way to recover the info. You'll have to provide another process where you process the pure stream and insert as a pure BLOB not a CLOB

Categories

Resources