Memory-mapped output stream produces an output file with trailing zeros - java

I am implementing a memory mapped stream in Java using java.nio package. It maps a chunk of memory to the output file and when more elements needed to be written to the output file it maps another chunk and so on.
w_buffer = channel.map(FileChannel.MapMode.READ_WRITE, w_cycle*bufferSize, bufferSize);
My implementation works smoothly when total amount of elements to be written to the file is a multiple of the size of a chunk which is mapped to the file at a time. However, when this is not the case -this will not be the case frequently since a stream cannot decide/know when user will stop writing- remaining space in the mapped chunk is also dumped to the file in the form of trailing zeros. How should I avoid these trailing zeros in the output file?
Thanks in advance.

You can truncate a channel to a given size using channel.truncate(size), but it is doubtful that this would work portably in combination with channel.map(). A lot of channel.map is dependent on the underlying operating system. It is generally not a good idea to mix "file access" and "memory mapped access" on the same file. Other solution could be to have a "size-used" value at the beginning of each chunk.

Related

Java - possible to modify and parse gzipped xml files without unzipping?

I have an arraylist of gzipped xml files. Is it possible to view and manipulate the contents of these xml files all without unzipping them and taking up disk space? If so, what would be the correct class(es) to use for this task?
I know I can create a gzipinputstream from a fileinputstream of the zip file but from there I'm not sure what to do. I have only this written:
GZIPInputStream in = new GZIPInputStream(new FileInputStream(zippedFiles.get(i)));
I need some way to parse text within the xml files and modify the xml itself but again, extracting all of them would take up too much disk space.
What exactly are you going to achieve? You can extract the file into memory using a ByteArrayOutputStream and convert it into a byte-Array that you forward to your XML parser library (converting it to String and passing that is not recommended as the encoding is specified inside the XML file itself and the conversion to String must therefore be done by the XML parser internally). Most XML parsers also support reading directly from any InputStream, so you could pass yours directly to it which will probably further reduce your memory consumption. Disk space will only be occupied when writing data back to it by simply reversing the described procedure. Still, as you directly replace the source file by overwriting it, there is nowhere any disk space wasted.
The fact that they're in a list doesn't change much, but no.
Ignoring compression, files are stored linearly on disks. You can append to them cheaply, you can replace bytes cheaply, but you can't replace sequences of different lengths (like replace("Testing Procedure Specification", "TPS")) without rewriting the file after the modified substring.
Gziping the file complicates things, but the same rule applies. In general, making arbitrary modifications to a file requires rewriting the file.
Your code for reading the files is on the right track, though. You can easily read through gziped files as streams and without having to decompress the entire file.

Append to a compressed stream with Java

We currently have some data log. The log is append-only, but on each append, the whole log is scanned from the beginning for some consistency checks (certain combinations of events trigger an alarm).
Now, we want to transform that log into a compressed log. Individual log entries are typically a few dozen bytes, so they won't compress well. However, the whole log stream does compress well, enough redundancy is present.
In theory, appeding to the compressed stream should be easy, as the state of the compression encoder can be reconstructed while the log is scanned (and decompressed).
Our current way is to have a compressor with identical settings running during the scan and decompression phase, feeding it with the just decompressed data (assuming it will build the identical state).
However, we know that this is not optimal. We'd like to reuse the state which is build during decompression for the compression of the new data. So the question is: How can we implement the (de)compression in a way that we do not need to feed the decompressed data to a compressor to build the state, but can re-use the state of the decompressor to compress the new data we append?
(We need to do this in java, unfortunately, which limits the number of available APIs. Inclusion of free/open source 3rd party code is an option, however.)
You probably don't have the interfaces you need in Java, but this can be done with zlib. You could write your own Java interface to zlib to do this.
While scanning you would retain the last 32K of uncompressed data using a queue. You would scan the compressed file using Z_BLOCK in inflate(). That would stop at every deflate block. When you get to the last block, which is identified by the first bit of the block, you would save the uncompressed data of that block, as well as the 32K that preceded it that you were saving in the queue. You would also save the last bits in the previous block that did not complete a byte (0..7 bits). You would then add your new log entry to that last chunk of uncompressed data, and then recompress just that part, using the 32K that preceded it with deflateSetDictionary(). You would start the compression on a bit boundary with deflatePrime(). That would overwrite what was the last compressed block with new compressed block or blocks.

File size caching & efficient retrieval of file sizes in Java

I need to determine the size of a very large character-encoded file. A read of the file takes a significant amount of time.
My understanding is that when a file is first created/modified the size is cached, so the OS can quickly retrieve the value when the size is requested, say, by a file manager. (eg. it seems quick when opening the properties dialog of a large file in win explorer)
Assuming the above is true, can this be retrieved in Java? I had thought that length() read the file to determine the size...or does it in fact get this cached size? Or does the creation of a File object do this read/retrieved the cached size?
My own research hasn't been able to answer these questions as yet.
I'd appreciate some help with my understanding
Thanks
File systems generally store the length as a part of the file description. This way the OS knows where the end of the file is. This information is cached when accessed. And repeated calls for this information will also be cache.
Note: the OS often reads more data from disk than you ask for. This is because access to disk are expensive and memory is relatively cheap. e.g. when you get the length of one file it may read in the detail of many files on the assumption you might want information about those files too. i.e. the first time you get a file's information it is likely to already be cached.
getLength() delegates to the underlying native operating system function to get the length of the file. You should be fine using this.
The length() method doesn't read the file. It calls a native method which delegates to the OS to get the file length. Its response time should not depend on the actual file length.
I think you're over thinking this. Length should query the file system and figure this out very quickly. It's certainly not reading the entire file, and counting bytes.

Reading a gz file and keeping track of position in file

So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud
I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.
What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.
GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...

Indexing a set of key pair values for use in J2ME Application

I have some 1000 key-pair values that I will use in my j2me application, reading it from a resource file. However I will be using only a few of those values at any time, say 10, based on the record number being generated inside the application logic. Loading all the values into memory and then looking up is fairly not an efficient option as I will not be using all the records. Is there a better scheme to store the values in the file, some indexing or something so that I can retrieve those key-pair values by skipping the amount of bytes in the file to reach and read the appropriate record? As this is a resource file in the jar there wont be any modifications to it.
If you know the record length when they are created, you could write the records out in binary format to a file. But at the start of each record, you could first write a number indicating its size in bytes and use a RandomAccessFile to access the records by moving the file pointer.
But in terms of speed, loading into memory will be faster than reading from a file, but if memory is at a premium, then a file wouldn't be a bad way to go.
Jeff
Skipping bytes in a compressed resource file inside a jar is not really going to be optimal either and the implementation of InputStream you get as a result of calling Class.getResourceAsInputStream() may be fragmented if you plan on running your application on several devices.
EDIT after additional info in comment:
It could be that the best way to do this is actually to store the (question, answer) data in 1000 different classes.
It's going to feel very weird as a solution but the class loader should only load the 10 classes you actually use, you can generate the 1000 source files with a simple J2SE program and you can load 10 random classes based on an integer inside their name using java.lang.Class.forName().
If the jar file doesn't become too big to use, you're basically relying on the indexing of its zip file format for the class loader performances...

Categories

Resources