Java handling billions bytes - java

I'm creating a compression algorithm in Java;
to use my algorithm I require a lot of information about the structure of the target file.
After collecting the data, I need to reread the file. <- But I don't want to.
While rereading the file, I make it a good target for compression by 'converting' the data of the file to a rather peculiar format. Then I compress it.
The problems now are:
I don't want to open a new FileInputStream for rereading the file.
I don't want to save the converted file which is usually 150% the size of the target file to the disk.
Are there any ways to 'reset' a FileInputStream for moving to the start of the file, and how would I store the huge amount 'converted' data efficiently without writing to the disk?

You can use one or more RandomAccessFiles. You can memory map them to ByteBuffer() which doesn't consume heap (actually they use about 128 bytes) or direct memory but can be accessed randomly.
Your temporary data can be storing in a direct ByteBuffer(s) or more memory mapped files. Since you have random access to the original data, you may not need to duplicate as much data in memory as you think.
This way you can access the whole data with just a few KB of heap.

There's the reset method, but you need to wrap the FileInputStream in a BufferedInputStream.

You could use RandomAccessFile, or java.nio ByteBuffer is what you are looking for. (I do not know.)
Resources might be saved by pipes/streams: immediately writing to a compressed stream.
To answer your questions on reset: not possible; the base class InputStream has provisions for mark and reset-to-mark, but FileInputStream was made optimal for several operating systems and does purely sequential input. Closing and opening is best.

Related

most efficient way to temporarily store discontinuous data composing a larger file

We are emulating a p2p network in java. So we divide the file into chunks (with checksums) so that the individual chunks can be recompiled into the original file once we have all the parts. What is the best way to store the individual parts while they are being downloaded?
I was thinking of just storing each chunk as a separate file...but if there are 20000 chunks, it would create as many files. is this the best way?
Thanks
Either keep chunks in memory or in files. There is no much to discuss here about. Findd the perfect ratio between chunks count and the actual size of it, to suit your needs.
Files sounds more reasonable as data would not be totally lost in case of application crash and continue of download would be possible.
I would write to memory until you reach some threshold, at which point you dump your memory to disk, and keep reading into memory. When the file transfer completes, you can take what is currently stored in memory, and concatenate it with what may have been stored on disk.

Can I seek to a position inside a Memory-mapped file?

I would live to have a memory-mapped file in Java NIO so that I can randomly move anywhere in the file to read any portion of it, pretty much like a seek method. Is that possible to do with a memory-mapped file, the same way you do with a RandomAccessFile.
NOTE: The file will be in READ/WRITE mode.
Thanks!
Assuming your file is small enough to fit into one ByteBuffer, you can use position(int). Another option is to randomly access the buffer with Xxx value = getXxx(offset) or putXxx(offset, value)
If you have more than 2 GB you will need an array or list of ByteBuffers to map the entire memory (assuming you have a 64-bit JVM)

Writing a random access file transparently to a zip file

I have a java application that writes a RandomAccessFile to the file system. It has to be a RAF because some things are not known until the end, where I then seek back and write some information at the start of the file.
I would like to somehow put the file into a zip archive. I guess I could just do this at the end, but this would involve copying all the data that has been written so far. Since these files can potentially grow very large, I would prefer a way that somehow did not involve copying the data.
Is there some way to get something like a "ZipRandomAccessFile", a la the ZipOutputStream which is available in the jdk?
It doesn't have to be jdk only, I don't mind taking in third party libraries to get the job done.
Any ideas or suggestions..?
Maybe you need to change the file format so it can be written sequentially.
In fact, since it is a Zip and Zip can contain multiple entries, you could write the sequential data to one ZipEntry and the data known 'only at completion' to a separate ZipEntry - which gives the best of both worlds.
It is easy to write, not having to go back to the beginning of the large sequential chunk. It is easy to read - if the consumer needs to know the 'header' data before reading the larger resource, they can read the data in that zip entry before proceeding.
The way the DEFLATE format is specified, it only makes sense if you read it from the start. So each time you'd seek back and forth, the underlying zip implementation would have to start reading the file from the start. And if you modify something, the whole file would have to be decompressed first (not just up to the modification point), the change applied to the decompressed data, then compress the whole thing again.
To sum it up, ZIP/DEFLATE isn't the format for this. However, breaking your data up into smaller, fixed size files that are compressed individually might be feasible.
The point of compression is to recognize redundancy in data (like some characters occurring more often or repeated patterns) and make the data smaller by encoding it without that redundancy. This makes it infeasible to create a compression algorithm that would allow random access writing. In particular:
You never know in advance how well a piece of data can be compressed. So if you change some block of data, its compressed version will be most likely either longer or shorter.
As a compression algorithm process the data stream, it uses the knowledge accumulated so far (like discovered repeated patterns) to compress the data at its current position. So if you change something, the algorithm needs to re-compress everything from this change to the end.
So the only reasonable solution is to manipulate the data and compress at once it at the end.

Write to Disk in Pages

I am writing to a disk some text as bytes. I need to maximize my performance and write as complete pages.
Does anybody know what is the optimal size of a page in bytes when writing to disk?
If you use a BufferedWriter or Buffered streams, you should be good. Java uses a 8K buffer. This should be sufficient for most usage patterns. Is your use case anything specific (like do you have fixed length data that needs to be written and fetched from disk in a single shot) etc which is making you optimize what Java already provides?

How would you change a single byte in a file?

What is the best way to change a single byte in a file using Java? I've implemented this in several ways. One uses all byte array manipulation, but this is highly sensitive to the amount of memory available and doesn't scale past 50 MB or so (i.e. I can't allocate 100MB worth of byte[] without getting OutOfMemory errors). I also implemented it another way which works, and scales, but it feels quite hacky.
If you're a java io guru, and you had to contend with very large files (200-500MB), how might you approach this?
Thanks!
I'd use RandomAccessFile, seek to the position I wanted to change and write the change.
If all I wanted to do was change a single byte, I wouldn't bother reading the entire file into memory. I'd use a RandomAccessFile, seek to the byte in question, write it, and close the file.

Categories

Resources