Writing a random access file transparently to a zip file

Writing a random access file transparently to a zip file - java

I have a java application that writes a RandomAccessFile to the file system. It has to be a RAF because some things are not known until the end, where I then seek back and write some information at the start of the file.
I would like to somehow put the file into a zip archive. I guess I could just do this at the end, but this would involve copying all the data that has been written so far. Since these files can potentially grow very large, I would prefer a way that somehow did not involve copying the data.
Is there some way to get something like a "ZipRandomAccessFile", a la the ZipOutputStream which is available in the jdk?
It doesn't have to be jdk only, I don't mind taking in third party libraries to get the job done.
Any ideas or suggestions..?

Maybe you need to change the file format so it can be written sequentially.
In fact, since it is a Zip and Zip can contain multiple entries, you could write the sequential data to one ZipEntry and the data known 'only at completion' to a separate ZipEntry - which gives the best of both worlds.
It is easy to write, not having to go back to the beginning of the large sequential chunk. It is easy to read - if the consumer needs to know the 'header' data before reading the larger resource, they can read the data in that zip entry before proceeding.

The way the DEFLATE format is specified, it only makes sense if you read it from the start. So each time you'd seek back and forth, the underlying zip implementation would have to start reading the file from the start. And if you modify something, the whole file would have to be decompressed first (not just up to the modification point), the change applied to the decompressed data, then compress the whole thing again.
To sum it up, ZIP/DEFLATE isn't the format for this. However, breaking your data up into smaller, fixed size files that are compressed individually might be feasible.

The point of compression is to recognize redundancy in data (like some characters occurring more often or repeated patterns) and make the data smaller by encoding it without that redundancy. This makes it infeasible to create a compression algorithm that would allow random access writing. In particular:
You never know in advance how well a piece of data can be compressed. So if you change some block of data, its compressed version will be most likely either longer or shorter.
As a compression algorithm process the data stream, it uses the knowledge accumulated so far (like discovered repeated patterns) to compress the data at its current position. So if you change something, the algorithm needs to re-compress everything from this change to the end.
So the only reasonable solution is to manipulate the data and compress at once it at the end.

Related

Atomic file modification

There is region in file(possible small) that I want to overwrite.
Assume I calling fseek, fwrite, fsync. Is there any way to ensure atomicity of such region-rewriting operation, e.g. i need to be sure, that in any case of failure the region will contains only old(before modification) data, or only new(modified) data, but not a mix of this.
There are two thing i want to highlight.
First: It's ok if there is no way to atomically write ANY size region - we can handle it by appending data to the file, fsync'ing, and then rewriting 'pointer' area in file, then fsyncing again. However, if 'pointer' writing is not atomic, we still can have corrupted file with illegal pointers.
Second: I am pretty sure, writing 1-byte regions is atomic: i will not see in file any bytes I never put there. So we can use some tricks with allocating two regions for addresses and use 1-byte switch, so rewriting of region became - append new data, syncing, rewrite one of two(unused) pointer slots, syncing again, and then rewrite 'switch byte' and again syncing. So the overwrite region operation now contains at least 3 fsync invocation.
All of this would be much easer, if I will have atomic writing for longs, but do i really have it?
Is there any way to handle this situation without using method, mentioned in point 2?
Another question is - is there any ordering guarantee between writing and syncing?
For example, if i call fseek, fwrite [1], fseek, fwrite [2], fsync, can i have writing at [2] commited, and writing at [1] - not commited?
This question is applicable to linux and windows operation system, any particular answer(e.g. in ubuntu version a.b.c ....) is also wanted.

It's usually safe to assume that writing a 512 bytes chunks are done in one write by the HDDs.
However, i would not assume that. Instead, i'd go with your second solution, while adding a checksum to your write and verifying it before changing the pointer in the file.
Generally, it's a good practice to add checksum to everything written to disk.
To answer about "sync" guarantee - you can assume that. While sync is FS and disk dependent, let's say we are talking about 'reasonable' implementation.
After the 1st sync the data is guaranteed to be flushed to the disk (the disk might have it
in it's cache still) and if the data you are expected to get whatever you wrote.
If after the second sync the data of both syncs is in the disk cache, the situation you described can happen, but IMHO the probability of that is very low.
Anyway, there's no other mechanism which will promise you data is on disk. That's why you must have checksums.
Some more info: Ensure fsync did its job

Java handling billions bytes

I'm creating a compression algorithm in Java;
to use my algorithm I require a lot of information about the structure of the target file.
After collecting the data, I need to reread the file. <- But I don't want to.
While rereading the file, I make it a good target for compression by 'converting' the data of the file to a rather peculiar format. Then I compress it.
The problems now are:
I don't want to open a new FileInputStream for rereading the file.
I don't want to save the converted file which is usually 150% the size of the target file to the disk.
Are there any ways to 'reset' a FileInputStream for moving to the start of the file, and how would I store the huge amount 'converted' data efficiently without writing to the disk?

You can use one or more RandomAccessFiles. You can memory map them to ByteBuffer() which doesn't consume heap (actually they use about 128 bytes) or direct memory but can be accessed randomly.
Your temporary data can be storing in a direct ByteBuffer(s) or more memory mapped files. Since you have random access to the original data, you may not need to duplicate as much data in memory as you think.
This way you can access the whole data with just a few KB of heap.

There's the reset method, but you need to wrap the FileInputStream in a BufferedInputStream.

You could use RandomAccessFile, or java.nio ByteBuffer is what you are looking for. (I do not know.)
Resources might be saved by pipes/streams: immediately writing to a compressed stream.
To answer your questions on reset: not possible; the base class InputStream has provisions for mark and reset-to-mark, but FileInputStream was made optimal for several operating systems and does purely sequential input. Closing and opening is best.

Java - Parallelizing Gzip

I was assigned to parallelize GZip in Java 7, and I am not sure which is possible.
The assignment is:
Parallelize gzip using a given number of threads
Each thread takes a 1024 KiB block, using the last 32 KiB block from
the previous block as a dictionary. There is an option to use no
dicitionary
Read from Stdin and stdout
What I have tried:
I have tried using GZIPOutputStream, but there doesn't seem to be a
way to isolate and parallelize the deflate(), nor can I access the
deflater to alter the dictionary. I tried extending off of GZIPOutputStream, but it didn't seem to act as I wanted to, since I still couldn't isolate the compress/deflate.
I tried using Deflater with wrap enabled and a FilterOutputStream to
output the compressed bytes, but I wasn't able to get it to compress
properly in GZip format. I made it so each thread had a compressor that will write to a byte array, then it will write to the OutputStream.
I am not sure if I am did my approaches wrong or took the wrong approaches completely. Can anyone point me the right direction for which classes to use for this project?

Yep, zipping a file with dictionary can't be parallelized, as everything depends on everything. Maybe your teacher asked you to parallelize the individual gzipping of multiple files in a folder? That would be a great example of parallelized work.

To make a process concurrent, you need to have portions of code which can run concurrently and independently. Most compression algorithms are designed to be run sequentially, where every byte depends on every byte has comes before.
The only way to do compression concurrently is to change the algorythm (making it incompatible with existing approaches)

I think you can do it by inserting appropriate resets in the compression stream. The idea is that the underlying compression engine used in gzip allows the deflater to be reset, with an aim that it makes it easier to recover from stream corruption, though at a cost of making the compression ratio worse. After reset, the deflater will be in a known state and thus you could in fact start from that state (which is independent of the content being compressed) in multiple threads (and from many locations in the input data, of course) produce a compressed chunk and include the data produced when doing the following reset so that it takes the deflater back to the known state. Then you've just to reassemble the compressed pieces into the overall compressed stream. “Simple!” (Hah!)
I don't know if this will work, and I suspect that the complexity of the whole thing will make it not a viable choice except when you're compressing single very large files. (If you had many files, it would be much easier to just compress each of those in parallel.) Still, that's what I'd try first.
(Also note that the gzip format is just a deflated stream with extra metadata.)

.txt limit, need help

I have a problem here and no idea what to do. Basically I'm creating a .txt file which serves as an index for a random access file. In it I have the number of bytes required to seek to each entry in the file.
The file has 1484 records. This is where I have my problem: with the large amount of bytes the record has, I end up writing pretty long numbers into the file, and ultimately the .txt file ends up being too big. When I open it with an appropriate piece of software (such as notepad) the file is simply cut off at a certain point.
I tried to minimize it as much as possible, but it's just too big.
What can I do here? I'm clueless.
Thanks.

I am not really sure that the problem is that one... only 1484 records?
You can write a binary file instead, in which each four or eight bytes correspond to a record position. This way, all positions have the same length on disk, no matter how many digits they hold. If you need to browse/modify the file, you can easily write utility programs that decode the file so it lets you inspect it, and that encode your modifications, modifying it.
Another solution would be to compress the file. You can use the zip capabilities of Java, and unzip the file before using it, and zip it again after that.

It is probably because you are not feeding new lines to terminate each line. There is a limit set to the maximum line length that text editors can handle safely.

Storing your indices in a binary file, inside some kind of Collection (depending on your needs) would probably be much faster and lighter.

File processing in java

I have a file of size 2GB which has student records in it. I need to find students based on certain attributes in each record and create a new file with results. The order of the filtered students should be same as in the original file. What's the efficient & fastest way of doing this using Java IO API and threads without having memory issues? The maxheap size for JVM is set to 512MB.

What kind of file? Text-based, like CSV?
The easiest way would be to do something like grep does: Read the file line by line, parse the line, check your filter criterion, if matched, output a result line, then go to the next line, until the file is done. This is very memory efficient, as you only have the current line (or a buffer a little larger) loaded at the same time. Your process needs to read through the whole file just once.
I do not think multiple threads are going to help much. It would make things much more complicated, and since the process seems to be I/O bound anyway, trying to read the same file with multiple threads probably does not improve throughput.
If you find that you need to do this often, and going through the file each time is too slow, you need to build some kind of index. The easiest way to do that would be to import the file into a DB (can be an embedded DB like SQLite or HSQL) first.

I wouldn't overcomplicate this until you find that the boringly simple way doesn't work for what you need. Essentially you just need to:
open input stream to 2GB file, remembering to buffer (e.g. by wrapping with BufferedInputStream)
open output stream to filtered file you're going to create
read first record from input stream, look at whatever attribute to decide if you "need" it; if you do, write it to output file
repeat for remaining records
On one of my test systems with extremely modest hardware, BufferedInputStream around a FileInputStream out of the box read about 500 MB in 25 seconds, i.e. probably under 2 minutes to process your 2GB file, and the default buffer size is basically as good as it gets (see the BufferedInputStream timings I made for more details). I imagine with state of the art hardware it's quite possible the time would be halved.
Whether you need to go to a lot of effort to reduce the 2/3 minutes or just go for a wee while you're waiting for it to run is a decision that you'll have to make depending on your requirements. I think the database option won't buy you much unless you need to do a lot of different processing runs on the same set of data (and there are other solutions to this that don't automatically mean database).

2GB for a file is huge, you SHOULD go for a db.
If you really want to use Java I/O API, then try out this: Handling large data files efficiently with Java and this: Tuning Java I/O Performance

I think you should use memory mapped files.This will help you to map the bigger file to a
smaller memory.This will act like virtual memory and as far as performance is concerned mapped files are the faster than stream write/read.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.