We currently have some data log. The log is append-only, but on each append, the whole log is scanned from the beginning for some consistency checks (certain combinations of events trigger an alarm).
Now, we want to transform that log into a compressed log. Individual log entries are typically a few dozen bytes, so they won't compress well. However, the whole log stream does compress well, enough redundancy is present.
In theory, appeding to the compressed stream should be easy, as the state of the compression encoder can be reconstructed while the log is scanned (and decompressed).
Our current way is to have a compressor with identical settings running during the scan and decompression phase, feeding it with the just decompressed data (assuming it will build the identical state).
However, we know that this is not optimal. We'd like to reuse the state which is build during decompression for the compression of the new data. So the question is: How can we implement the (de)compression in a way that we do not need to feed the decompressed data to a compressor to build the state, but can re-use the state of the decompressor to compress the new data we append?
(We need to do this in java, unfortunately, which limits the number of available APIs. Inclusion of free/open source 3rd party code is an option, however.)
You probably don't have the interfaces you need in Java, but this can be done with zlib. You could write your own Java interface to zlib to do this.
While scanning you would retain the last 32K of uncompressed data using a queue. You would scan the compressed file using Z_BLOCK in inflate(). That would stop at every deflate block. When you get to the last block, which is identified by the first bit of the block, you would save the uncompressed data of that block, as well as the 32K that preceded it that you were saving in the queue. You would also save the last bits in the previous block that did not complete a byte (0..7 bits). You would then add your new log entry to that last chunk of uncompressed data, and then recompress just that part, using the 32K that preceded it with deflateSetDictionary(). You would start the compression on a bit boundary with deflatePrime(). That would overwrite what was the last compressed block with new compressed block or blocks.
Related
I am implementing a memory mapped stream in Java using java.nio package. It maps a chunk of memory to the output file and when more elements needed to be written to the output file it maps another chunk and so on.
w_buffer = channel.map(FileChannel.MapMode.READ_WRITE, w_cycle*bufferSize, bufferSize);
My implementation works smoothly when total amount of elements to be written to the file is a multiple of the size of a chunk which is mapped to the file at a time. However, when this is not the case -this will not be the case frequently since a stream cannot decide/know when user will stop writing- remaining space in the mapped chunk is also dumped to the file in the form of trailing zeros. How should I avoid these trailing zeros in the output file?
Thanks in advance.
You can truncate a channel to a given size using channel.truncate(size), but it is doubtful that this would work portably in combination with channel.map(). A lot of channel.map is dependent on the underlying operating system. It is generally not a good idea to mix "file access" and "memory mapped access" on the same file. Other solution could be to have a "size-used" value at the beginning of each chunk.
I know the Java libraries pretty well, so I was surprised when I realized that, apparently, there's no easy way to do something seemingly simple with a stream. I'm trying to read an HTTP request containing multipart form data (large, multiline tokens separated be delimiters that look like, for example, ------WebKitFormBoundary5GlahTkFmhDfanAn--), and I want to read until I encounter a part of the request with a given name, and then return an InputStream of that part.
I'm fine with just reading the stream into memory and returning a ByteArrayInputStream, because the files submitted should never be larger than 1MB. However, I want to make sure that the reading method throws an exception if the file is larger than 1MB, so that excessively-large files don't fill up the JVM's memory and crash the server. The file data may be binary, so that rules out BufferedReader.readLine() (it drops newlines, which could be any of \r, \n, or \r\n, resulting in loss of data).
All of the obvious tokenizing solutions, such as Scanner, read the tokens as Strings, not streams, which could cause OutOfMemoryErrors for large files--exactly what I'm trying to avoid. As far as I can tell, there's no equivalent to Scanner that returns each token as an InputStream without reading it into memory. Is there something I'm missing, or is there any way to create something like that myself, using just the standard Java libraries (no Apache Commons, etc.), that doesn't require me to read the stream a character at a time and write all of the token-scanning code myself?
Addendum: Shortly before posting this, I realized that the obvious solution to my original problem was simply to read the full request body into memory, failing if it's too large, and then to tokenize the resulting ByteArrayInputStream with a Scanner. This is inefficient, but it works. However, I'm still interested to know if there's a way to tokenize an InputStream into sub-streams, without reading them into memory, without using extra libraries, and without resorting to character-by-character processing.
It's not possible without loading them into memory (the solution you don't want) or saving them to disk (becomes I/O heavy). Tokenizing the stream into separate streams without loading it into memory implies that you can read the stream (to tokenize it) and be able to read it again later. In short, what you want is impossible unless your stream is seekable, but these are generally specialized streams for very specific applications and specialized I/O objects, like RandomAccessFile.
We have a bunch of threads that take a block of data, compress this data and then eventually concatenate them into one large byte array. If anyone can expand on this idea or recommend another method, that'd be awesome. I've currently got two methods that I'm trying out, but neither are working the way they should:
The first: I have each thread's run() function take the input data and just use GZIPOutputStream to compress it and write it to the buffer.
The problem with this approach here is that, because each thread has one block of data which is part of a longer complete data when I call GZIPOutputStream, it treats that little block as a complete piece of data to zip. That means it sticks on the header and trailer (I also use a custom dictionary so I've got no idea how many bits the header is now nor how to find out).
I think you could manually cut off the header and trailer and you would just be left with compressed data (and leave the header of the first block and the trailer of the last block). The other thing I'm not sure about with this method is whether I can even do that. If I leave the header on the first block of data, will it still decompress correctly. Doesn't that header contain information for ONLY the first block of the data and not the other concatenated blocks?
The second method is to use the Deflater class. In that case, I can simply set the input, set the dictionary, and then call deflate().
The problem is, that's not gzip format. That's just "raw" compressed data. I have no idea how to make it so that gzip can recognize the final output.
You need a method that writes to a single GZIPOutputStream that is called by the other threads, with suitable co-ordination between them so the data doesn't get mixed up. Or else have the threads write to temporary files, and assemble and zip it all in a second phase.
So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud
I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.
What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.
GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...
I am using DeflaterOutputStream to compress data as a part of a proprietary archive file format. I'm then using jcraft zlib code to decompress that data on the other end. The other end is a J2ME application, hence my reliance on third party zip decompression code and not the standard Java libraries.
My problem is that some files zip and unzip just fine, and others do not.
For the ones that do not, the compression method in the first byte of the data seems to be '5'.
From my reading up on zlib, I understand that a default value of '8' indicates the default deflate compression method. Any other value appears to be unacceptable to the decompressor.
What I'd like to know is:
What does '5' indicate?
Why does DeflaterOutputStream use different compression methods some of the time?
Can I stop it from doing that somehow?
Is there another way to generate deflated data that uses only the default compression method?
It might help to hone down exactly what you're looking at.
Before the whole of your data, there's usually a two-byte ZLIB header. As far as I'm aware, the lower 4 bits of the first byte of these should ALWAYS be 8. If you initialise your Deflater in nowrap mode, then you won't get these two bytes at all (though your other library must then be expecting not to get them).
Then, before each individual block of data, there's a 3-bit block header (notice, defined as a number of bits, not a whole number of bytes). Conceivably, you could have a block starting with byte 5, which would indicate a compressed block that is the final block, or with byte 8, which would be a non-compressed, non-final block.
When you create your DeflaterOutputStream, you can pass in a Deflater or your choosing to the constructor, and on that Defalter, there are some options you can set. The level is essentially the amount of look-ahead that the compression uses when looking for repeated patterns in the data; on the offchance, you might try setting this to a non-default value and see if it makes any difference to whether your decompresser can cope.
The strategy setting (see the setStrategy() method) can be used in some special circumstances to tell the deflater to only apply huffman compression. This can occasionally be useful in cases where you have already transformed your data so that frequencies of values are near negative powers of 2 (i.e. the distribution that Huffman coding works best on). I wouldn't expect this setting to affect whether a library can read your data, but juuust on the offchance, you might just try changing this setting.
In case its helpful, I've written a little bit about configuring Deflater, including the use of huffman-only compression on transformed data. I must admit, whatever options you choose, I'd really expect your library to be able to read the data. If you're really sure your compressed data is correct (i.e. ZLIB/Inflater can re-read your file), then you might consider just using another library...!
Oh, and stating the bleeding obvious but I'll mention it anyway, if your data is fixed you can of course just stick it in the jar and it'll effectively be deflated/inflater "for free". Ironically, your J2ME device MUST be able to decode zlib-compressed data, because that's essentially the format the jar is in...