We have a bunch of threads that take a block of data, compress this data and then eventually concatenate them into one large byte array. If anyone can expand on this idea or recommend another method, that'd be awesome. I've currently got two methods that I'm trying out, but neither are working the way they should:
The first: I have each thread's run() function take the input data and just use GZIPOutputStream to compress it and write it to the buffer.
The problem with this approach here is that, because each thread has one block of data which is part of a longer complete data when I call GZIPOutputStream, it treats that little block as a complete piece of data to zip. That means it sticks on the header and trailer (I also use a custom dictionary so I've got no idea how many bits the header is now nor how to find out).
I think you could manually cut off the header and trailer and you would just be left with compressed data (and leave the header of the first block and the trailer of the last block). The other thing I'm not sure about with this method is whether I can even do that. If I leave the header on the first block of data, will it still decompress correctly. Doesn't that header contain information for ONLY the first block of the data and not the other concatenated blocks?
The second method is to use the Deflater class. In that case, I can simply set the input, set the dictionary, and then call deflate().
The problem is, that's not gzip format. That's just "raw" compressed data. I have no idea how to make it so that gzip can recognize the final output.
You need a method that writes to a single GZIPOutputStream that is called by the other threads, with suitable co-ordination between them so the data doesn't get mixed up. Or else have the threads write to temporary files, and assemble and zip it all in a second phase.
Related
We currently have some data log. The log is append-only, but on each append, the whole log is scanned from the beginning for some consistency checks (certain combinations of events trigger an alarm).
Now, we want to transform that log into a compressed log. Individual log entries are typically a few dozen bytes, so they won't compress well. However, the whole log stream does compress well, enough redundancy is present.
In theory, appeding to the compressed stream should be easy, as the state of the compression encoder can be reconstructed while the log is scanned (and decompressed).
Our current way is to have a compressor with identical settings running during the scan and decompression phase, feeding it with the just decompressed data (assuming it will build the identical state).
However, we know that this is not optimal. We'd like to reuse the state which is build during decompression for the compression of the new data. So the question is: How can we implement the (de)compression in a way that we do not need to feed the decompressed data to a compressor to build the state, but can re-use the state of the decompressor to compress the new data we append?
(We need to do this in java, unfortunately, which limits the number of available APIs. Inclusion of free/open source 3rd party code is an option, however.)
You probably don't have the interfaces you need in Java, but this can be done with zlib. You could write your own Java interface to zlib to do this.
While scanning you would retain the last 32K of uncompressed data using a queue. You would scan the compressed file using Z_BLOCK in inflate(). That would stop at every deflate block. When you get to the last block, which is identified by the first bit of the block, you would save the uncompressed data of that block, as well as the 32K that preceded it that you were saving in the queue. You would also save the last bits in the previous block that did not complete a byte (0..7 bits). You would then add your new log entry to that last chunk of uncompressed data, and then recompress just that part, using the 32K that preceded it with deflateSetDictionary(). You would start the compression on a bit boundary with deflatePrime(). That would overwrite what was the last compressed block with new compressed block or blocks.
I'm having a problem with a new file format I'm being asked to implement at work.
Basically, the file is a text file which contains a bunch of headers containing information about the data in UTC-8, and then the rest of the file is the numerical data in binary. I can write the data and read it back just fine, and I recently added the code to write the headers.
The problem is that I don't know how to read a file that contains both text and binary data. I want to be able to read in and deal with the header information (which is fairly extensive) and then be able to continue reading the binary data without having to re-iterate through the headers. Is this possible?
I am currently using a FileInputStream to read the binary data, but I don't know how to start it at the beginning of the data, rather than the beginning of the whole file. One of the FileInputStream's constructors takes a FileDescriptor as the argument and I think that's my answer, but I don't know how to get one from another file reading class. Am I approaching this correctly?
You can reposition a FileInputStream to any arbitrary point by getting its channel via getChannel() and calling position() on that channel.
The one caveat is that this position affects all consumers of the stream. It is not suitable if you have different threads (for example) reading from different parts of the same file. In that case, create a separate FileInputStream for each consumer.
Also, this technique only works for file streams, because the underlying file can be randomly accessed. There is no equivalent for sockets, or named pipes, or anything else that is actually a stream.
So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud
I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.
What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.
GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...
I could use some hints or tips for a decent interface for reading file of special characteristics.
The files in question has a header (~120 bytes), a body (1 byte - 3gb) and a footer (4 bytes).
The header contains information about the body and the footer is only a simple CRC32-value of the body.
I use Java so my idea was to extend the "InputStream" class and add a constructor such as "public MyInStream( InputStream in)" where I immediately read the header and the direct the overridden read()'s the body.
Problem is, I can't give the user of the class the CRC32-value until the whole body has been read.
Because the file can be 3gb large, putting it all in memory is a be an idea.
Reading it all in to a temporary file is going to be a performance hit if there are many small files.
I don't know how large the file is because the InputStream doesn't have to be a file, it could be a socket.
Looking at it again, maybe extending InputStream is a bad idea.
Thank you for reading the confused thoughts of a tired programmer. :)
Looking at it again, maybe extending
InputStream is a bad idea.
If users of the class need to access the body as a stream, it's IMO not a bad choice. Java's ObjectOutput/InputStream works like this.
I don't know how large the file is
because the InputStream doesn't have
to be a file, it could be a socket.
Um, then your problem is not with the choice of Java class, but with the design of the file format. If you can't change the format, there isn't really anything you can do to make the data at the end of the file available before all of it is read.
But perhaps you could encapsulate the processing of the checksum completely? Presumably it's a checksum for the body, so your class could always "read ahead" 4 bytes to see when the file ends and not return the last 4 bytes to the client as part of the body and instead compare them with a CRC computed while reading the body, throwing an exception when it does not match.
I am using DeflaterOutputStream to compress data as a part of a proprietary archive file format. I'm then using jcraft zlib code to decompress that data on the other end. The other end is a J2ME application, hence my reliance on third party zip decompression code and not the standard Java libraries.
My problem is that some files zip and unzip just fine, and others do not.
For the ones that do not, the compression method in the first byte of the data seems to be '5'.
From my reading up on zlib, I understand that a default value of '8' indicates the default deflate compression method. Any other value appears to be unacceptable to the decompressor.
What I'd like to know is:
What does '5' indicate?
Why does DeflaterOutputStream use different compression methods some of the time?
Can I stop it from doing that somehow?
Is there another way to generate deflated data that uses only the default compression method?
It might help to hone down exactly what you're looking at.
Before the whole of your data, there's usually a two-byte ZLIB header. As far as I'm aware, the lower 4 bits of the first byte of these should ALWAYS be 8. If you initialise your Deflater in nowrap mode, then you won't get these two bytes at all (though your other library must then be expecting not to get them).
Then, before each individual block of data, there's a 3-bit block header (notice, defined as a number of bits, not a whole number of bytes). Conceivably, you could have a block starting with byte 5, which would indicate a compressed block that is the final block, or with byte 8, which would be a non-compressed, non-final block.
When you create your DeflaterOutputStream, you can pass in a Deflater or your choosing to the constructor, and on that Defalter, there are some options you can set. The level is essentially the amount of look-ahead that the compression uses when looking for repeated patterns in the data; on the offchance, you might try setting this to a non-default value and see if it makes any difference to whether your decompresser can cope.
The strategy setting (see the setStrategy() method) can be used in some special circumstances to tell the deflater to only apply huffman compression. This can occasionally be useful in cases where you have already transformed your data so that frequencies of values are near negative powers of 2 (i.e. the distribution that Huffman coding works best on). I wouldn't expect this setting to affect whether a library can read your data, but juuust on the offchance, you might just try changing this setting.
In case its helpful, I've written a little bit about configuring Deflater, including the use of huffman-only compression on transformed data. I must admit, whatever options you choose, I'd really expect your library to be able to read the data. If you're really sure your compressed data is correct (i.e. ZLIB/Inflater can re-read your file), then you might consider just using another library...!
Oh, and stating the bleeding obvious but I'll mention it anyway, if your data is fixed you can of course just stick it in the jar and it'll effectively be deflated/inflater "for free". Ironically, your J2ME device MUST be able to decode zlib-compressed data, because that's essentially the format the jar is in...