Java decompress GZIP stream sequentially

Java decompress GZIP stream sequentially - java

My Java program implements a server that should get a very large file, compressed using gzip, from a client over websockets and should check for some bytes pattern in the file content.
The client sends the file chunks embedded inside a proprietary protocol so I'm getting message after message from the client, parse the message and extract the gzipped file content.
I can't hold the whole file in the program memory so I'm trying to decompress each chunk, process the data and continue to the next chunk.
I'm using the following code:
public static String gzipDecompress(byte[] compressed) throws IOException {
String uncompressed;
try (
ByteArrayInputStream bis = new ByteArrayInputStream(compressed);
GZIPInputStream gis = new GZIPInputStream(bis);
Reader reader = new InputStreamReader(gis);
Writer writer = new StringWriter()
) {
char[] buffer = new char[10240];
for (int length = 0; (length = reader.read(buffer)) > 0; ) {
writer.write(buffer, 0, length);
}
uncompressed = writer.toString();
}
return uncompressed;
}
But I'm getting the following exception when calling the function with the first compressed chunk:
java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.Reader.read(Reader.java:140)
It's important to mention that I'm not skipping any chunk and trying to decompress the chunks sequentially.
What am I missing?

The problem is that you play with those chunks manually.
The correct way would be to obtain some InputStream, wrap it with GZIPInputStream and then read the data.
InputStream is = // obtain the original gzip stream
GZIPInputStream gis = new GZIPInputStream(is);
Reader reader = new InputStreamReader(gis);
//... proceed reading and so on
GZIPInputStream works in stream fashion, so if you only ask 10kb at a time from your reader, the overall memory footprint will be low regardless of the size of the initial GZIP file.
Update after the question was updated
A possible solution for your situation is to write an InputStream implementation that streams bytes that are being put to it in chunks by your client protocol handler.
Here is a prototype:
public class ProtocolDataInputStream extends InputStream {
private BlockingQueue<byte[]> nextChunks = new ArrayBlockingQueue<byte[]>(100);
private byte[] currentChunk = null;
private int currentChunkOffset = 0;
private boolean noMoreChunks = false;
#Override
public synchronized int read() throws IOException {
boolean takeNextChunk = currentChunk == null || currentChunkOffset >= currentChunk.length;
if (takeNextChunk) {
if (noMoreChunks) {
// stream is exhausted
return -1;
} else {
currentChunk = nextChunks.take();
currentChunkOffset = 0;
}
}
return currentChunk[currentChunkOffset++];
}
#Override
public synchronized int available() throws IOException {
if (currentChunk == null) {
return 0;
} else {
return currentChunk.length - currentChunkOffset;
}
}
public synchronized void addChunk(byte[] chunk, boolean chunkIsLast) {
nextChunks.add(chunk);
if (chunkIsLast) {
noMoreChunks = true;
}
}
}
Your client protocol handler adds byte chunks using addChunk(), while your decompressing code pulls the data out of this stream (via Reader).
Please note that this code has some issues:
The queue being used has a limited size. If addChunk() is being called too frequently, the queue may be filled, which will block addChunk(). This may be desirable or not.
Only read() method is implemented for illustration purposes. For performance, it is better to implement read(byte[]) in the same manner.
Conservative synchornization is used under the assumption that reader (decompressor) and writer (protocol handler calling addChunk()) are different threads.
InterruptedException is not handled on take() to avoid too much details.
If your decompressor and addChunk() execute in the same thread (in the same loop), then you could try to use the InputStream.available() method when pulling using InputStream or Reader.ready() when pulling with a Reader.

An arbitrary sequence of bytes from a gzipped stream is not valid standalone gzip data. One way or another, you must concatenate all the byte chunks.
The easiest way is to accumulate them all with a simple pipe:
import java.io.PipedOutputStream;
import java.io.IOException;
import java.util.zip.GZIPInputStream;
public class ChunkInflater {
private final PipedOutputStream pipe;
private final InputStream stream;
public ChunkInflater()
throws IOException {
pipe = new PipedOutputStream();
stream = new GZIPInputStream(new PipedInputStream(pipe));
}
public InputStream getInputStream() {
return stream;
}
public void addChunk(byte[] compressedChunk)
throws IOException {
pipe.write(compressedChunk);
}
}
Now you have an InputStream you can read in whatever increments you desire. For instance:
ChunkInflater inflater = new ChunkInflater();
Callable<Void> chunkReader = new Callable<Void>() {
#Override
public Void call()
throws IOException {
byte[] chunk;
while ((chunk = readChunkFromSource()) != null) {
inflater.addChunk(chunk);
}
return null;
}
};
ExecutorService executor = Executors.newSingleThreadExecutor();
executor.submit(chunkReader);
executor.shutdown();
Reader reader = new InputStreamReader(inflater.getInputStream());
// read text here

Related

How to calculate checksum with InputStream and then use it again

I want to calculate the CRC3 checksum of a given InputStream and then use to get the string out of it. Here's what I've tried so far
private long calculateChecksum(InputStream stream) throws IOException {
CRC32 crc = new CRC32();
byte[] buffer = new byte[8192];
int length;
while ((length = stream.read(buffer)) > 0) {
crc.update(buffer, 0, length);
}
return crc.getValue();
}
and then
String text = IOUtils.toString(inputStream, UTF_8);
I also tried to reverse the order. First use it as string and then calculate the checksum. But it didn't work.
What seems to be my issue is that the index goes to the end while calculating the checksum and then doesn't reset. Any idea how to use InputStream after calculating the checksum?

As others said, a stream can be consumed only once. But you can consume it and calculate the CRC value at the same time by wrapping your InputStream with a java.util.zip.CheckedInputStream.
Here is a complete example, assuming the text file "test.txt" is in the current directory and contains only this one line: These are german umlauts: äöüÄÖÜß
import org.apache.commons.io.IOUtils;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.zip.CRC32;
import java.util.zip.CheckedInputStream;
public class App {
private static final String INPUT_FILE = "test.txt";
public static void main( String[] args ) {
final CRC32 crc32 = new CRC32();
try(InputStream in = new CheckedInputStream(new BufferedInputStream(
new FileInputStream(INPUT_FILE)), crc32))
{
final String text = IOUtils.toString(in, StandardCharsets.UTF_8);
System.out.println(text);
System.out.println(String.format("CRC32: %x", crc32.getValue()));
} catch (IOException e) {
e.printStackTrace();
}
}
}
Output:
These are german umlauts: äöüÄÖÜß
CRC32: 84bcd851

Yes, an InputStream is consumed. You have a few options:
mark
mark() / reset() are optional methods of inputstreams; mark sets a mark (this does, by itself, nothing), and reset 'rewinds back' to the mark, replaying everything that was provided since the last time you called mark().
However, your average inputstream either does not support it, or, if it does, supports it by storing in memory all the bytes that are received since setting the mark. Meaning, if you do this to an inputstream that contains a few GB worth of data, you're going to get an OutOfMemoryError.
If there isn't a lot of data, just use mark and reset. Wrap in a BufferedInputStream which is specced to support mark/reset:
private void example(InputStream in) {
BufferedInputStream buffered = new BufferedInputStream(in);
in.mark();
long crc = calculateChecksum(buffered);
in.reset();
String text = IOUtils.toString(buffered, UTF_8);
}
Duplicate
Your second option is to duplicate the inputstream, sending each retrieved byte both to IOUtils as well as to the CRC algorithm.
This is complicated and not recommended.
Checksum the string instead.
You already have a string of data. Just checksum that:
private void example(InputStream in) {
String text = IOUtils.toString(in, UTF_8);
CRC32 crc = new CRC32();
crc.update(text.getBytes(UTF_8));
long checksum = crc.getValue();
}
Or, ditching IOUtils:
private void example(InputStream in) {
byte[] data = in.readAllBytes();
CRC32 crc = new CRC32();
crc.update(data);
long checksum = crc.getValue();
String text = new String(data, UTF_8);
}

InputStream is a read-once stream. Once you've read it, you can't go back to start again. This is because InputStream is general-purpose: it could be the stream of bytes read from a keyboard, for example, or read from a real-time data feed.
If your input stream is in fact a FileInputStream, then you could use
inputStream.getChannel.position(0);
to reset it to the start of the file.
If it's a ByteArrayInputStream, then you already have a byte array so you might as well just use that instead.
If you want to write a general-purpose function that doesn't know what kind of InputStream it is given, then you can wrap it in a BufferedInputStream and use its mark() method. This will use extra memory to buffer the whole of the stream.

Does not closing a FileOutPutStream not write anything to the file?

I have a function which writes the given input stream to a given output stream. Code below.
static void copyStream(InputStream is, OutputStream os) throws IOException {
byte[] buffer = new byte[4096];
int len;
try {
while ((len = is.read(buffer)) != -1) {
os.write(buffer, 0, len);
}
}
}
The above function is called from this function
public static void copyFile(File srcFile, File destFile) throws IOException {
FileInputStream fis = new FileInputStream(srcFile);
try {
FileOutputStream fos = new FileOutputStream(destFile);
try {
**copyStream**(fis, fos);
} finally {
if (fos != null)
fos.close();
}
} finally {
if (fis != null)
fis.close();
}
}
In this function, I am writing 4 MB at once. I use this function to copy images. Occasionally I see that the destination file is not created due to which an exception occurs while trying to read that file for future processing. I am guessing the culprit to be not closing the resources. Is my hypothesis good? What are the reasons why my function might fail? Please help

I believe, that given InputStream and OutputStream installed correctly.
Add os.flush(); at the end. Sure, both streams should be closed in the caller as well.
As alternative, you could use Apache IO utils org.apache.commons.io.IOUtils.copy(InputStream input, OutputStream output).

Yes you absolutely must close your destination file to ensure that all caches from the JVM through to the OS are flushed and the file is ready for a reader to consume.
Copying large files the way that you are doing is concise in code but inefficient in operation. Consider upgrading your code to use the more efficient NIO methods, documented here in a blog post. In case that blog disappears, here's the code:
Utility class:
public final class ChannelTools {
public static void fastChannelCopy(final ReadableByteChannel src, final WritableByteChannel dest) throws IOException {
final ByteBuffer buffer = ByteBuffer.allocateDirect(16 * 1024);
while (src.read(buffer) != -1) {
// prepare the buffer to be drained
buffer.flip();
// write to the channel, may block
dest.write(buffer);
// If partial transfer, shift remainder down
// If buffer is empty, same as doing clear()
buffer.compact();
}
// EOF will leave buffer in fill state
buffer.flip();
// make sure the buffer is fully drained.
while (buffer.hasRemaining()) {
dest.write(buffer);
}
}
}
Usage example with your InputStream and OutputStream:
// allocate the stream ... only for example
final InputStream input = new FileInputStream(inputFile);
final OutputStream output = new FileOutputStream(outputFile);
// get an channel from the stream
final ReadableByteChannel inputChannel = Channels.newChannel(input);
final WriteableByteChannel outputChannel = Channels.newChannel(output);
// copy the channels
ChannelTools.fastChannelCopy(inputChannel, outputChannel);
// closing the channels
inputChannel.close();
outputChannel.close()
There is also a more concise method documented in Wikipedia that achieves the same thing with less code:
// Getting file channels
FileChannel in = new FileInputStream(source).getChannel();
FileChannel out = new FileOutputStream(target).getChannel();
// JavaVM does its best to do this as native I/O operations.
in.transferTo(0, in.size(), out);
// Closing file channels will close corresponding stream objects as well.
out.close();
in.close();

Google Appengine JAVA - Zip lots of images saving in Blobstore

I developed a simple media library where you can choose a set of images and download them.
When a client request a download, a servlet receives the blob keys to use for create a zip file and then a Task is launched for the procedure
The task iterate through the received blob keys and zip the images into the archive. When the task has finished a mail with the download link is sent to the user.
Here is my problem:
FileWriteChannel writeChannel = fileService.openWriteChannel(file, lock);
OutputStream blobOutputStream = Channels.newOutputStream(writeChannel);
ZipOutputStream zip = new ZipOutputStream(blobOutputStream);
A single channel can handle only this amount of bytes
BlobstoreService.MAX_BLOB_FETCH_SIZE
Because of that, i must open and close the channel every 1mb of data i have to write (same issue for the read, but for the read i used this code and it works). or the write() method throws a null exception
Opening and closing the channel with a normal outputStream, does not presents issue, like this code
But handling a Zip file i also have to manage
ZipOutputStream zip = new ZipOutputStream(blobOutputStream);
ZipEntry zipEntry = new ZipEntry(image_file_name);
zipOut.putNextEntry(zipEntry);
// while the image has bytes to write
zipOut.write(bytesToWrite);
After i wrote 1MB of data in the ZipEntry i have to close the channel and open it again.
So here the problem: where i open a new channel i can't access to the previouse zipEntry i was writing and then i cannot continue to write the next 1MB of the image i'm processing.
And, after a open a new channel, if i try to write on the zipEntry object (w/o re-initializing it) i get a ClosedChannel exception
Here is the SAMPLE code i wrote, i know is not working, but explains what i am trying to do.
My question then: How (if is possible, off course) can i create a zip file writing 1MB per time?
I'm also available to other approaches, the thing i need is to zip some images into one zip and save it into the blobstore, if you have other ideas to make this, please tell me

You should create you own stream that can manipulate channels. When the blob size limit is reached your stream closes current channel and opens a new one.
Example for local files:
public class ZipChannels {
public static void main(String[] args) throws IOException {
File dirToZip = new File("target\\dependency");
//create zip-files
ChannelOutput out = new ChannelOutput();
ZipOutputStream zip = new ZipOutputStream(out);
int b = 0;
for(File file: dirToZip.listFiles()) {
ZipEntry zipEntry = new ZipEntry(file.getName());
zip.putNextEntry(zipEntry);
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
while((b = bis.read()) != -1) {
zip.write(b);
}
bis.close();
zip.closeEntry();
}
zip.close();
//merge all into one file for check it
BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream("package_all.zip"));
for (int i = 0; i < out.getChannelCount(); i++) {
BufferedInputStream bis = new BufferedInputStream(new FileInputStream("package_" + i + ".zip"));
while((b = bis.read()) != -1) {
bos.write(b);
}
bis.close();
}
bos.close();
}
public static class ChannelOutput extends OutputStream {
private OutputStream channel;
private int count = 0;
final private int MAX = 1000000;
#Override
public void write(int b) throws IOException {
if(count++ % MAX == 0) {
openNewChannel();
}
channel.write(b);
}
protected void openNewChannel() throws IOException {
if(channel != null) {
channel.close();
}
channel = new BufferedOutputStream(new FileOutputStream("package_" + (count / MAX) + ".zip"));
}
public int getChannelCount() {
return count / MAX + 1;
}
#Override
public void close() throws IOException {
channel.close();
}
#Override
public void flush() throws IOException {
channel.flush();
}
}
}
If you have any questions, please feel free to ask.

How do I wait to see if my InputStream has received all data to process further?

I have to write data to Amazon S3 and I write data using OutputStream to its InputStream as follows
final PipedOutputStream outputStream = new PipedOutputStream();
final PipedInputStream inputStream;
try {
inputStream = new PipedInputStream(outputStream);
new Thread(
new Runnable() {
#Override
public void run() {
PutObjectRequest putObjectRequest = new PutObjectRequest(S3EnvironmentConfigurator.BucketTypes.source.name(), getProposalName(uniqueId), inputStream, null);
amazonS3Client.putObject(putObjectRequest);
try {
inputStream.close();
} catch (IOException e) {
}
}
}
).start();
Now amazonS3Client.putObject looks like
#Override
public PutObjectResult putObject(#Nonnull final PutObjectRequest putObjectRequest)
throws AmazonClientException, AmazonServiceException {
try {
InputStreamReader is = new InputStreamReader(new GZIPInputStream(putObjectRequest.getInputStream()));
StringBuilder sb=new StringBuilder();
BufferedReader br = new BufferedReader(is);
String read = br.readLine();
while(read != null) {
System.out.println(read);
read =br.readLine();
}
} catch (IOException e) {
//ignore
}
return super.putObject(putObjectRequest);
This currently prints out the content on console
Needed
How can I do something like
while (putObjectRequest.getInputStream() is not completely available) {
// wait
}
// write inputStream, the entire InputStream is ready and available for processing

Once the side writing closes its OutputStream, the InputStream will deliver any bytes that have not yet been read, and then the next read() after that will return -1 to indicate end of input.
The value byte is returned as an int in the range 0 to 255. If no byte is available because the end of the stream has been reached, the value -1 is returned. This method blocks until input data is available, the end of the stream is detected, or an exception is thrown.
If the side writing fails to close() when its done, then the reader will block waiting for more input until the writing process closes.

You can copy all data to byte buffer and thus be sure that you've read all data from input stream. And then create new input stream based on this buffer. Something like:
byte[] array = IOUtils.toByteArray(inputStream);
InputStream newInputStream = new ByteArrayInputStream(array);
IOUtils from apache commons.

InputStreamReader buffering issue

I am reading data from a file that has, unfortunately, two types of character encoding.
There is a header and a body. The header is always in ASCII and defines the character set that the body is encoded in.
The header is not fixed length and must be run through a parser to determine its content/length.
The file may also be quite large so I need to avoid bring the entire content into memory.
So I started off with a single InputStream. I wrap it initially with an InputStreamReader with ASCII and decode the header and extract the character set for the body. All good.
Then I create a new InputStreamReader with the correct character set, drop it over the same InputStream and start trying to read the body.
Unfortunately it appears, javadoc confirms this, that InputStreamReader may choose to read-ahead for effeciency purposes. So the reading of the header chews some/all of the body.
Does anyone have any suggestions for working round this issue? Would creating a CharsetDecoder manually and feeding in one byte at a time but a good idea (possibly wrapped in a custom Reader implementation?)
Thanks in advance.
EDIT: My final solution was to write a InputStreamReader that has no buffering to ensure I can parse the header without chewing part of the body. Although this is not terribly efficient I wrap the raw InputStream with a BufferedInputStream so it won't be an issue.
// An InputStreamReader that only consumes as many bytes as is necessary
// It does not do any read-ahead.
public class InputStreamReaderUnbuffered extends Reader
{
private final CharsetDecoder charsetDecoder;
private final InputStream inputStream;
private final ByteBuffer byteBuffer = ByteBuffer.allocate( 1 );
public InputStreamReaderUnbuffered( InputStream inputStream, Charset charset )
{
this.inputStream = inputStream;
charsetDecoder = charset.newDecoder();
}
#Override
public int read() throws IOException
{
boolean middleOfReading = false;
while ( true )
{
int b = inputStream.read();
if ( b == -1 )
{
if ( middleOfReading )
throw new IOException( "Unexpected end of stream, byte truncated" );
return -1;
}
byteBuffer.clear();
byteBuffer.put( (byte)b );
byteBuffer.flip();
CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );
// although this is theoretically possible this would violate the unbuffered nature
// of this class so we throw an exception
if ( charBuffer.length() > 1 )
throw new IOException( "Decoded multiple characters from one byte!" );
if ( charBuffer.length() == 1 )
return charBuffer.get();
middleOfReading = true;
}
}
public int read( char[] cbuf, int off, int len ) throws IOException
{
for ( int i = 0; i < len; i++ )
{
int ch = read();
if ( ch == -1 )
return i == 0 ? -1 : i;
cbuf[ i ] = (char)ch;
}
return len;
}
public void close() throws IOException
{
inputStream.close();
}
}

Why don't you use 2 InputStreams? One for reading the header and another for the body.
The second InputStream should skip the header bytes.

Here is the pseudo code.
Use InputStream, but do not wrap a
Reader around it.
Read bytes containing header and
store them into
ByteArrayOutputStream.
Create ByteArrayInputStream from
ByteArrayOutputStream and decode
header, this time wrap ByteArrayInputStream
into Reader with ASCII charset.
Compute the length of non-ascii
input, and read that number of bytes
into another ByteArrayOutputStream.
Create another ByteArrayInputStream
from the second
ByteArrayOutputStream and wrap it
with Reader with charset from the
header.

I suggest rereading the stream from the start with a new InputStreamReader. Perhaps assume that InputStream.mark is supported.

My first thought is to close the stream and reopen it, using InputStream#skip to skip past the header before giving the stream to the new InputStreamReader.
If you really, really don't want to reopen the file, you could use file descriptors to get more than one stream to the file, although you may have to use channels to have multiple positions within the file (since you can't assume you can reset the position with reset, it may not be supported).

It's even easier:
As you said, your header is always in ASCII. So read the header directly from the InputStream, and when you're done with it, create the Reader with the correct encoding and read from it
private Reader reader;
private InputStream stream;
public void read() {
int c = 0;
while ((c = stream.read()) != -1) {
// Read encoding
if ( headerFullyRead ) {
reader = new InputStreamReader( stream, encoding );
break;
}
}
while ((c = reader.read()) != -1) {
// Handle rest of file
}
}

If you wrap the InputStream and limit all reads to just 1 byte at a time, it seems to disable the buffering inside of InputStreamReader.
This way we don't have to rewrite the InputStreamReader logic.
public class OneByteReadInputStream extends InputStream
{
private final InputStream inputStream;
public OneByteReadInputStream(InputStream inputStream)
{
this.inputStream = inputStream;
}
#Override
public int read() throws IOException
{
return inputStream.read();
}
#Override
public int read(byte[] b, int off, int len) throws IOException
{
return super.read(b, off, 1);
}
}
To construct:
new InputStreamReader(new OneByteReadInputStream(inputStream));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java decompress GZIP stream sequentially - java

Related

How to calculate checksum with InputStream and then use it again

Does not closing a FileOutPutStream not write anything to the file?

Google Appengine JAVA - Zip lots of images saving in Blobstore

How do I wait to see if my InputStream has received all data to process further?

InputStreamReader buffering issue

Categories

Resources