Corrupt Gzip string due to character encoding

Corrupt Gzip string due to character encoding - java

I have some corrupted Gzip log files that I'm trying to restore. The files were transfered to our servers through a Java backed web page. The files have always been sent as plain text, but we recently started to receive log files that were Gzipped. These Gzipped files appear to be corrupted, and are not unzip-able, and the originals have been deleted. I believe this is from the character encoding in the method below.
Is there any way to revert the process to restore the files to their original zipped format? I have the resulting Strings binary array data in a database blob.
Thanks for any help you can give!
private String convertStreamToString(InputStream is) throws IOException {
/*
* To convert the InputStream to String we use the
* Reader.read(char[] buffer) method. We iterate until the
* Reader return -1 which means there's no more data to
* read. We use the StringWriter class to produce the string.
*/
if (is != null) {
Writer writer = new StringWriter();
char[] buffer = new char[1024];
try {
Reader reader = new BufferedReader(
new InputStreamReader(is, "UTF-8"));
int n;
while ((n = reader.read(buffer)) != -1) {
writer.write(buffer, 0, n);
}
} finally {
is.close();
}
return writer.toString();
} else {
return "";
}
}

If this is the method that was used to convert the InputStream to a String, then your data is almost certainly lost.
The problem is that UTF-8 has quite a few byte sequences that are simply not legal (i.e. they don't represent any value). These sequences will be replaced with the Unicode replacement character.
That character is the same no matter which invalid byte sequence was decoded. Therefore the specific information in those bytes is lost.

If that's the code you have you never should have converted to a Reader (or in fact a String). Only preserving as a Stream (or byte array) would avoid corrupting binary files. And once it's read into the string....illegal sequences (and there are many in utf-8) WILL be discarded.
So no, unless you are quite lucky, there is no way to recover the info. You'll have to provide another process where you process the pure stream and insert as a pure BLOB not a CLOB

Related

Damaged Pdf after setting content from a server response

I am currently making rest calls to a server for signing a pdf document.
I am sending a pdf(binary content) and retrieving the binary content of the signed pdf.
When i get the binary content from the inputStream:
try (InputStream inputStream = conn.getInputStream()) {
if (inputStream != null) {
try (BufferedReader br = new BufferedReader(new InputStreamReader(inputStream))) {
String lines;
while ((lines = br.readLine()) != null) {
output += lines;
}
}
}
}
signedPdf.setBinaryContent(output.getBytes());
(signedPdf is a DTO with byte[] attribute)
but when i try to set the content of the pdf with the content of the response pdf:
ByteArrayOutputStream out = new ByteArrayOutputStream();
out.write(signedPdf);
pdf.setContent(signedPdf);
and try to open it, it says that the pdf is damaged and cannot be repaired.
Anyone encountered something similar? Do i need to set the content-length as well for the output stream?

PDF is binary data. One corrupts the PDF when reading as text (which in Java is always Unicode).
Also it is a waste: a byte as char would double the memory usages, and
there are two conversions: from bytes to String and vice versa, using some encoding.
When converting from UTF-8 even UTF-8 format errors may be raised.
try (InputStream inputStream = conn.getInputStream()) {
if (inputStream != null) {
byte[] content = inputStream.readAllBytes();
signedPdf.setBinaryContent(content);
}
}
Whether to use a BufferedInputStream depends, for instance on the expected PDF size.
Furthermore new String(byte[], Charset) and String.getBytes(Charset) with explicit Charset (like StandardCharsets.UTF_8) are preferable over a default Charset overloaded version. Those use the current platform encoding, and hence delivers non-portable code. Behaving differently on an other platform/computer.

inputStream and utf 8 sometimes shows "?" characters

So I've been dealing with this problem for over a months now and I also checked almost every possible related solution over here in and over google but I couldn't find anything that really solved my case.
my problem is that i'm trying to download an html source from a website but what i'm getting in most cases is that some of the text shows some "?" characters in it,most likely beacuse the site is in Hebrew.
Here's my code,
public static InputStream openHttpGetConnection(String url)
throws Exception {
InputStream inputStream = null;
HttpClient httpClient = new DefaultHttpClient();
HttpResponse httpResponse = httpClient.execute(new HttpGet(url));
inputStream = httpResponse.getEntity().getContent();
return inputStream;
}
public static String downloadSource(String url) {
int BUFFER_SIZE = 1024;
InputStream inputStream = null;
try {
inputStream = openHttpGetConnection(url);
} catch (Exception e) {
// TODO: handle exception
}
int bytesRead;
String str = "";
byte[] inpputBuffer = new byte[BUFFER_SIZE];
try {
while ((bytesRead = inputStream.read(inpputBuffer)) > 0) {
String read = new String(inpputBuffer, 0, bytesRead,"UTF-8");
str +=read;
}
} catch (Exception e) {
// TODO: handle exception
}
return str;
}
Thanks.

To read characters from a byte stream with a given encoding, use a Reader. In your case it would be something like:
InputStreamReader isr = new InputStreamReader(inpputStream, "UTF-8");
char[] inputBuffer = new char[BUFFER_SIZE];
while ((charsRead = isr.read(inputBuffer, 0, BUFFER_SIZE)) > 0) {
String read = new String(inputBuffer, 0, charsRead);
str += read;
}
You can see that the bytes will be read in directly as characters --- it's the reader's problem to know if it needs to read one or two bytes, e.g., to create the character in the buffer. It's basically your approach but decoding as the bytes are being read in, instead of after.

Converting an InputStream to a String entails specifying an encoding, just as you do at new String(inpputBuffer, 0, bytesRead,"UTF-8");.
But your approach as several drawbacks.
How do you know you have to use UTF8 ?
When retreiving HTTP content, generally speaking, you can not know in advance what encoding will be used in the HTTP response. But HTTP provides a mechanism for specifying that, using the Content-Type header.
More specifically, your response object should have a Content-Type "header", that has an "attribute" called encoding. In the response, it should look something like :
Content-Type: text/html; encoding=UTF-8
You should use whatever is after the encoding= part to transform your bytes to chars.
Seeing you seem to use Apache HTTPClient, their documentation states :
You can set the content type header for a request with the addRequestHeader method in each method and retrieve the encoding for the response body with the getResponseCharSet method.
If the response is known to be a String, you can use the getResponseBodyAsString method which will automatically use the encoding specified in the Content-Type header or ISO-8859-1 if no charset is specified..
Alternate way
If there is no Content-Type header, and if you know your content is HTML, then you can try to convert it as a String using some encoding (UTF or ISO Latin preferably), and try to find some content matching <meta charset="UTF-8">, and use that as the charset. This should only be a fail-over.
Any byte sequence is not convertible to a String
Drawback number two is that you read any number of bytes from your stream, and try to convert it to a String, which may not be possible.
In practice, UTF-8 can encode some "characters" across several bytes. For example "é" can be encoded as 0xC3A9. So say for example that the response consists of two "é" characters. If your first call to read returns :
[c3, a9, c3]
Your conversion to a String using new String(byte[], off, enc) will leave the last byte apart, because it does not match a valid UTF8 sequence.
Your following read will get what's left to read
[a9]
Which is (whatever that is) not a "é" character.
Bottom line : you can not convert even a valid UTF-8 sequence to byte using your pattern.
Going forward : you use HTTPClient, use their method of HTTP Response to String conversion.
If you wish to do it yourself, the easy way is to copy your input to a byte array, and then convert the byte array. Something along the lines of (pseudo code) :
ByteArrayOutputStream responseContent = new ByteArrayOutputStream()
copyAllBytes(responseInputStream, responseContent)
byte[] rawResponse = responseContent.toByteArray();
String stringResponse = new String(rawResponse, encoding);
But you could also use a CharsetDecoder if you want a fully streamed implementation (one that does not buffer the response fully into memory), or as #jas answers, wrap your inputStream to a reader and concatenate the output (preferably into a StringBuilder, which should be faster if a high number of concatenation is to occur).

Decompressing PHP's gzcompress in Java

I'm trying to decompress a json object in Java that was initially compressed in PHP. Here's how it gets compressed into PHP:
function zip_json_encode(&$arr) {
$uncompressed = json_encode($arr);
return pack('L', strlen($uncompressed)).gzcompress($uncompressed);
}
and decoded (again in PHP):
function unzip_json_decode(&$data) {
$uncompressed = #gzuncompress(substr($data,4));
return json_decode($uncompressed, $array_instead_of_object);
}
That gets put into MySQL and now it must be pulled out of the db by Java. We pull it out from the ResultSet like this:
String field = rs.getString("field");
I then pass that string to a method to decompress it. This is where it falls apart.
private String decompressHistory(String historyString) throws SQLException {
StringBuffer buffer = new StringBuffer();
try {
byte[] historyBytes = historyString.substring(4).getBytes();
ByteArrayInputStream bin = new ByteArrayInputStream(historyBytes);
InflaterInputStream in = new InflaterInputStream(bin, new Inflater(true));
int len;
byte[] buf = new byte[1024];
while ((len = in.read(buf)) != -1) {
// buf should be decoded, right?
}
} catch (IOException e) {
e.getStackTrace();
}
return buffer.toString();
}
Not quite sure what's going wrong here, but any pointers would be appreciated!

You need to get rid of the true in Inflater(true). Use just Inflater(). The true makes it expect raw deflate data. Without the true, it is expecting zlib-wrapped deflate data. PHP's gzcompress() produces zlib-wrapped deflate data.

Gzipped data is binary, byte[]. Using String, Unicode text, not only needs conversion, but is faulty.
For instance this involves a conversion:
byte[] historyBytes = historyString.substring(4).getBytes();
byte[] historyBytes = historyString.substring(4).getBytes("ISO-8859-1");
The first version uses the default platform encoding, making the application non-portable.
The first to-do is to use binary data in the database as VARBINARY or BLOB.
ImputStream field = rs.getBinaryStream("field");
try (InputStream in = new GZIPInputStream(field)) {
...
}
Or so. Mind the other answer.

In the end, neither of the above solutions worked, but both have merits. When we pulled the data out of mysql and cast it to bytes we have a number of missing character bytes (67). This made it impossible to decompress on the java side. As for the answers above. Mark is correct that gzcompress() uses zlib and therefore you should use the Inflater() class in Java.
Joop is correct that the data conversion is faulty. Our table was too large to convert it to varbinary or blob. That may have solved the problem, but didn't work for us. We ended up having java make a request to our PHP app, then simply unpacked the compressed data on the PHP side. This worked well. Hopefully this is helpful to anyone else that stumbles across it.

Java Charset InputStreamReader, File Channel Differences

I'm trying to read a (Japanese) file that is encoded as a UTF-16 file.
When I read it using an InputStreamReader with a charset of 'UTF-16" the file is read correctly:
try {
InputStreamReader read = new InputStreamReader(new FileInputStream("JapanTest.txt"), "UTF-16");
BufferedReader in = new BufferedReader(read);
String str;
while((str=in.readLine())!=null){
System.out.println(str);
}
in.close();
}catch (Exception e){
System.out.println(e);
}
However, when I use File Channels and read from a byte array the Strings aren't always converted correctly:
File f = new File("JapanTest.txt");
fis = new FileInputStream(f);
channel = fis.getChannel();
MappedByteBuffer buffer = channel.map( FileChannel.MapMode.READ_ONLY, 0L, channel.size());
buffer.position(0);
int get = Math.min(buffer.remaining(), 1024);
byte[] barray = new byte[1024];
buffer.get(barray, 0, get);
CharSet charSet = Charset.forName("UTF-16");
//endOfLinePos is a calculated value and defines the number of bytes to read
rowString = new String(barray, 0, endOfLinePos, charSet);
System.out.println(rowString);
The problem I've found is that I can only read characters correctly if the MappedByteBuffer is at position 0. If I increment the position of the MappedByteBuffer and then read a number of bytes into a byte array, which is then converted to a string using the charset UTF-16, then the bytes are not converted correctly. I haven't faced this issue if a file is encoded in UTF-8, so is this only an issue with UTF-16?
More Details:
I need to be able to read any line from the file channel, so to do this I build a list of line ending byte positions and then use those positions to be able to get the bytes for any given line and then convert them to a string.

The code unit of UTF-16 is 2 bytes, not a byte like UTF-8. The pattern and single byte code unit length makes UTF-8 self-synchronizing; it can read correctly at any point and if it's a continuation byte, it can either backtrack or lose only a single character.
With UTF-16 you must always work with pairs of bytes, you cannot start reading at an odd byte or stop reading at an odd byte. You also must know the endianess, and use either UTF-16LE or UTF-16BE when not reading at the start of the file, because there will be no BOM.
You can also encode the file as UTF-8.

Possibly, the InputStreamReader does some transformations the normal new String(...) does not. As a work-around (and to verify this assumption) you could try to wrap the data read from the channel like new InputStreamReader( new ByteArrayInputStream( barray ) ).
Edit: Forget that :) - Channels.newReader() would be the way to go.

Reading Strings and Binary from the same FileInputStream

I have a file that contains some amount of plain text at the start followed by binary content at the end. The size of the binary content is determined by some one of the plain text lines I read.
I was using a BufferedReader to read the individual lines, however it exposes no methods to refer to read a byte array. The readUTF for a DataInputStream doesnt read all the way to the end of the line, and the readLine method is deprecated.
Using the underlying FileInputStream to read returns empty byte arrays. Any suggestions on how to go about this?
private DOTDataInfo parseFile(InputStream stream) throws IOException{
DOTDataInfo info = new DOTDataInfo();
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
int binSize = 0;
String line;
while((line = reader.readLine()) != null){
if(line.length() == 0)
break;
DOTProperty prop = parseProperty(line);
info.getProperties().add(prop);
if(prop.getName().equals("ContentSize"))
binSize = Integer.parseInt(prop.getValue());
}
byte[] content = new byte[binSize];
stream.read(content); //Its all empty now. If I use a DataInputStream instead, its got the values from the file
return info;
}

You could use RandomAccessFile. Use readLine() to read the plain text at the start (note the limitations of this, as described in the API), and then readByte() or readFully() to read the subsequent binary data.
Using the underlying FileInputStream
to read returns empty byte arrays.
That's because you have wrapped the stream in a BufferedReader, which has probably consumed all the bytes from the stream when filling up its buffer.

If you genuinely have a file (rather than something harder to seek in, e.g. a network stream) then I suggest something like this:
Open the file as a FileInputStream
Wrap it in InputStreamReader and a BufferedReader
Read the text, so you can find out how much content there is
Close the BufferedReader (which will close the InputStreamReader which will close the FileInputStream)
Reopen the file
Skip to (total file length - binary content length)
Read the rest of the data as normal
You could just call mark() at the start of the FileInputStream and then reset() and skip() to get to the right place if you want to avoid reopening the file. (I was looking for an InputStream.seek() but I can't see one - I can't remember wanting it before in Java, but does it really not have one? Ick.)

You need to use an InputStream. Readers are for character data. Look into wrapping your input stream with a DataInputStream, like:
stream=new DataInputStream(new BufferedInputStream(new FileInputStream(...)));
The data input stream will give you many useful methods to read various types of data, and of course, the base InputStream methods for reading bytes.
(This is actually exactly what a HTTP server must do to read a request with content.)
The readUTF doesn't read a line, it reads a string that was written in (modified) UTF8 format - refer to the JavaDoc.

Alas, DataInputStream is deprecated and does not handle UTF. But this should help (it reads a line from a binary stream, without any lookahead).
public static String lineFrom(InputStream in) throws IOException {
byte[] buf = new byte[128];
int pos = 0;
for (;;) {
int ch = in.read();
if (ch == '\n' || ch < 0) break;
buf[pos++] = (byte) ch;
if (pos == buf.length) buf = Arrays.copyOf(buf, pos + 128);
}
return new String(Arrays.copyOf(buf, pos), "UTF-8");
}

The correct way is to use an InputStream of some form, probably a FileInputStream unless this becomes a performance barrier.
What do you mean "Using the underlying FileInputStream to read returns empty byte arrays."? This seems very unlikely and is probably where your mistake is. Can you show us the example code you've tried?

You can read the text with BufferedReader. When you know where the binary starts you can close the file and open it with RandomAccessFile and read binary from any point in the file.
Or you can read the file as binary and convert to text the sections you identify as text. {Using new String(bytes, encoding)}

I recommend using DataInputStream. You have the following options:
Read both text and binary content with DataInputStream
Open a BufferedReader, read text and close the stream. Then open a DataInputStream, skip bytes equal to the size of the text and read binary data.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.