Reading a gz file and keeping track of position in file - java

So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud

I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.

What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.

GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...

Related

Java: Replace part of file without writing the entire file again

Is it possible to replace part of a files content, without rewriting the entire file to the disk.
Say that i have a very large file of several gigabytes, how to i replace the bytes from, lets say position 100 to 200 without rewriting the entire file?
As an added bonus, i need a solution that does not use any features never than java 1.4.
If you're positive that you're going to be writing exactly the same number of bytes, you can use a RandomAccessFile to accomplish this (available since Java 1.0). Just open the file, seek to wherever you need to be, and overwrite those bytes with whatever your new data is.
RandomAccessFile f = new RandomAccessFile(new File("C:\\test\\huge.txt"), "rw");
f.seek(100); // Seek ahead
f.write("here is some new stuff".getBytes())
You can also read from the file at arbitrary points in the same fashion, in case you don't know exactly how much data you need to replace (e.g. so you can pad/truncate whatever you're writing to avoid doing something awful by accident).

Best way to use as little memory as possible when reading/writing large file?

I'm on mobile (android), and have a large text file, about 50mb. I want to be able to open the file and seek to a particular position, then start reading data into a buffer from that point. Is using FileReader + BufferedReader the best way to do this if I want to use as little memory as possible?:
BufferedReader in
= new BufferedReader(new FileReader("foo.txt"));
in.skip(byteCount); // in some cases I have to read from an offset
// start reading a line at a time here
I'll also need to write to the file, only ever appending data, so:
FileWriter w = new FileWriter("foo.txt", true);
w.write(someCharacters);
I'm primarily interested to know if by misusing the wrong file reader/writer classes, I may accidentally be loading the entire file contents into memory before the reads or writes,
Thanks
Basically you don't want to read the whole file, but just a certain portion of it. In this case use java.io.RandomAccessFile instead:
its seek() method is guaranteed to do seek instead of reading & discarding (which is what some implementations of InputStream.skip() actually do)
the seek() method can move back the file pointer - something you can't do for an InputStream
a getFilePointer() method is provided to get the current position in file.
it only reads what you tells it to read, so there's no fear you'll accidentally load more than what you want
My dictionary app uses RandomAccessFile to access about 45MB of data back when each Android app could only use 16MB of RAM, also a service running my dictionary engine that operates on the same 45MB of data uses only about 2MB of RAM(and most of it prob were used by Davlik VM and not my search engine). So this class definitely works as intended.
You could try using a memory mapped file (java.nio.channels.FileChannel.map()). I'm not sure how much heap space would be allocated for this though.

How can I read a file multiple times in java

This is my understanding regarding reading a file using BufferedReader in java. Please correct me if I am wrong somewhere...
Recently I had a requirement where we are required to read a file multiple times.
The usual way which I use is setting a mark() and doing a reset. But the input parameters to
a mark is an integer and it cannot accept a long number. Is there a way in which we can read the file, a large number of times.
In c++ we can do a seekg on the fstream and read the contents once again irrespective of the number of times we want to do so. Is there anything in java which is of this nature.
Just close the file and read it again.
But review your requirement. Why can't you process it in one pass?
Not much of a good answer but if you want to do random reading and writing then you can use Channels in java.nio package.
BufferedReader is for reading a file when you logically see it as a series of records and records are generally accessed sequentially.
Channels allow you to view your file as a series of blocks. Blocks are meant to be read randomly. :)
Using subclass of channel, FileChannel, you can read what you want from wherever you want. You need to specify two things:
Where to read from.
How much to read.
It has a read(dst,pstn) where dst is a ByteBuffer and pstn is a long position.
Don't worry that it is abstract because you use it via Files.newByteChannel() which does all the voodoo needed to make it work :)

Avoid obtaining same InputStream more than once

I can see there are a number of posts regarding reuse InputStream. I understand InputStream is a one-time thing and cannot be reused.
However, I have a use case like this:
I have downloaded the file from DropBox by obtaining the DropBoxInputStream using the DropBox's Java SDK. I then need to upload the file to another system by passing the InputStream. However, as part of the download, I have to provide the MD5 of the file. So I have to read the file from the stream before uploading the file. Because the DropBoxInputStream I received can only be used once, I have to get another DropBoxInputStream after I have calculated the MD5 and before uploading the file. The procedure is like:
Get first DropBoxInputStream
Read from the DropBoxInputStream and calculate MD5
Get the second DropBoxInputStream
Upload file using the MD5 and the second DropBoxInputStream.
I am thinking that, if there are many way for me to "cache" or "backup" the InputStream before I calculate the MD5 so that I can save step 3 of obtaining the same DropBoxInputStream again?
Many thanks
EDIT:
Sorry I missed some information.
What I am currently doing is that I use a MD5DigestOutputStream to calculate MD5. I stream data across the MD5DigestOutputStream and save them locally as a temp file. Once the data goes through the MD5DigestOutputStream, it will calculate the MD5.
I then call a third party library to upload the file using the calculated md5 and a FileInputStream which reads from the temp file.
However, this requires huge disk space sometime and I want to remove the needs to use temp file. The library I use only accepts a MD5 and InputStream. This means I have to calculate the MD5 on my end. My plan is to use my MD5DigestOutputStream to write data to /dev/null (not keeping the file) so that I can calculate theMD5, and get the InputStream from DropBox again and pass that to the library I use. I assume the library will be able to get the file directly from DropBox without the need for me to cache the file either in the memory of at the disk. Will it work?
Input streams aren't really designed for creating copies or re-using, they're specifically for situations where you don't want to read off into a byte array and use array operations on that (this is especially useful when the whole array isn't available, as in, for e.g. socket comunication). You could buffer up into a byte array, which is the process of reading sections from the stream into a byte array buffer until you have enough information.
But that's unnecessary for calculating an md5. Notice that InputStream is abstract, so it needs be implemented in an extended class. It has many implementations- GZIPInputStream, fileinputstream etc. These are, in design pattern speak, decorators of the IO stream: they add extra functionality to the abstract base IO classes. For example, GZIPInputStream gzips up the stream.
So, what you need is a stream to do this for md5. There is, joyfully, a well documented similar thing: see this answer. So you should just be able to pass your dropbox input stream (as it will be itself an input stream) to create a new DigestInputStream, and then you can both take the md5 and continue to read as before.
Worried about type casting? The idea with decorators in Java is that, since the InputStream base class interfaces all the methods and 'beef' you need to do your IO, there's no harm in passing instances of objects inheriting from InputStream in the constructor of each stream implementation, and you can still do the same core IO.
Finally, I should probably answer your actual question- say you still want to "cache" or "backup" the stream anyway? Well, you could just write it to a byte array. This is well documented, but can become faff when your streams get more complicated. Alternatively, try looking at a PushbackInputStream. Here, you can easily write a function to read off n bytes, perform and operation on them, and then restore them to the stream. Generally good to avoid these implementations of streams in Java, as it's bad for memory use, but no worse than buffering everything up which you'd otherwise have to do.
Or, of course, I would have a go with DigestInputStream.
Hope this helps,
Best.
You don't need to open a new InputStream from DropBox.
Once you have read the file from DropBox, you have it locally. So it is either in memory (in a byte array) or you stored it in a local file. Now you can create an InputStream that reads the data from memory (ByteArrayInputStream) or disk (FileInputStream) in order to upload the file.
So instead of caching the InputStream (which you can't) you cache the contents (which you can).

How do I write/read to the beginning of a text file?

EDIT
This is my file reader, can I make this read it from bottom to up seeing how difficult it is to make it write from bottom to up.
BufferedReader mainChat = new BufferedReader(new FileReader("./messages/messages.txt"));
String str;
while ((str = mainChat.readLine()) != null)
{
System.out.println(str);
}
mainChat.close();
OR (old question)
How can I make it put the next String at the beginning of the file and then insert an new line(to shift the other lines down)?
FileWriter chatBuffer = new FileWriter("./messages/messages.txt",true);
BufferedWriter mainChat = new BufferedWriter(chatBuffer);
mainChat.write(message);
mainChat.newLine();
mainChat.flush();
mainChat.close();
Someone could correct me, but I'm pretty sure in most operating systems, there is no option but to read the whole file in, then write it back again.
I suppose the main reason is that, in most modern OSs, all files on the disc start at the beginning of a boundary. The problem is, you cannot tell the file allocation table that your file starts earlier than that point.
Therefore, all the later bytes in the file have to be rewritten. I don't know of any OS routines that do this in one step.
So, I would use a BufferedReader to store whole file into a Vector or StringBuffer, then write it all back with the prepended string first.
--
Edit
A way that would save memory for larger files, reading #Saury's randomaccessfile suggestion, would be:
file has N bytes to start with
we want to add on "hello world"
open the file for append
append 11 spaces
i=N
loop {
go back to byte i
read a byte
move to byte i+11
write that byte back
i--
} until i==0
then move to byte 0
write "hello world"
voila
Use FileUtils from Apache Common IO to simplify this if you can. However, it still needs to read the whole file in so it will be slow for large files.
List<String> newList = Arrays.asList("3");
File file = new File("./messages/messages.txt");
newList.addAll(FileUtils.readLines(file));
FileUtils.writeLines(file, newList);
FileUtils also have read/write methods that take care of encoding.
Use RandomAccessFile to read/write the file in reverse order. See following links for more details.
http://www.java2s.com/Code/Java/File-Input-Output/UseRandomAccessFiletoreverseafile.htm
http://download.oracle.com/javase/1.5.0/docs/api/java/io/RandomAccessFile.html
As was suggested here pre-pending to a file is rather difficult and is indeed linked to how files are stored on the hard drive. The operation is not naturally available from the OS so you will have to make it yourself and most obvious answers to this involve reading the whole file and writing it again. this may be fine for you but will incur important costs and could be a bottleneck for your application performance.
Appending would be the natural choice but this would, as far as I understand, make reading the file unnatural.
There are many ways you could tackle this depending on the specificities of your situation.
If writing this file is not time critical in your application and the file does not grow too big you could bite the bullet and read the whole file, prepend the information and write it again. apache's common-io's FileUtils will be of help here simpifying the operation where you can read the file as a list of strings, prepend the new lines to the list and write the list again.
If writing is time critical but have control over the reading or the file. That is, if the file is to be read by another of your programs. you could load the file in a list of lines and reverse the list. Again FileUtils from the common-io library and helper functions in the Collections class in the standard JDK should do the trick nicely.
If writing is time critical but the file is intended to be read through a normal text editor you could create a small class or program that would read the file and write it in another file with the preferred order.

Categories

Resources