Java - possible to modify and parse gzipped xml files without unzipping? - java

I have an arraylist of gzipped xml files. Is it possible to view and manipulate the contents of these xml files all without unzipping them and taking up disk space? If so, what would be the correct class(es) to use for this task?
I know I can create a gzipinputstream from a fileinputstream of the zip file but from there I'm not sure what to do. I have only this written:
GZIPInputStream in = new GZIPInputStream(new FileInputStream(zippedFiles.get(i)));
I need some way to parse text within the xml files and modify the xml itself but again, extracting all of them would take up too much disk space.

What exactly are you going to achieve? You can extract the file into memory using a ByteArrayOutputStream and convert it into a byte-Array that you forward to your XML parser library (converting it to String and passing that is not recommended as the encoding is specified inside the XML file itself and the conversion to String must therefore be done by the XML parser internally). Most XML parsers also support reading directly from any InputStream, so you could pass yours directly to it which will probably further reduce your memory consumption. Disk space will only be occupied when writing data back to it by simply reversing the described procedure. Still, as you directly replace the source file by overwriting it, there is nowhere any disk space wasted.

The fact that they're in a list doesn't change much, but no.
Ignoring compression, files are stored linearly on disks. You can append to them cheaply, you can replace bytes cheaply, but you can't replace sequences of different lengths (like replace("Testing Procedure Specification", "TPS")) without rewriting the file after the modified substring.
Gziping the file complicates things, but the same rule applies. In general, making arbitrary modifications to a file requires rewriting the file.
Your code for reading the files is on the right track, though. You can easily read through gziped files as streams and without having to decompress the entire file.

Related

Using POI or Tika to extract text, stream-to-stream without loading the entire file in memory

I'm trying to use either Apache POI and PDFBox by themselves, or within the context of Apache Tika, to extract and process plain text from MASSIVE Microsoft Office and PDF files (i.e. hundreds of megs in some cases). Also, my application is multi-threaded, so I will be parsing many of these large files concurrently.
At that scale, I MUST work with the files in a streaming manner. It's not an option to hold an entire file in main memory at any step along the way.
I have seen many source code examples for loading files into Tika / POI / PDFBox via input streams. I have seen many examples for extracting plain text via output streams. However, I've performed some basic memory profiling experiments... and I haven't yet found a way with any of these libraries (Tika, POI, or PDFBox) to avoid loading an entire document into main memory.
In between reading from a stream and writing to a stream, there is obviously conversion step in the middle... which I have not yet found a way to perform on a streaming basis. Am I missing something, or is this a known issue with extracting text from MS Office or PDF files using Tika / POI / PDFBox? Can I have true end-to-end streaming, without a file being fully loaded into main memory at any point along the way?
The first thing to make sure, if you care about the memory footprint, is that you're using a TikaInputStream backed by a File, eg change from something like
InputStream input = new FileInputStream("foo.xls");
To something like
InputStream input = TikaInputStream.get(new File("foo.xls"));
If you really only have an InputStream, not a file, and you want the lower memory option if possible, force Tika to buffer it to a temp file with something like
InputStream origInput = getAnInputStream();
TikaInputStream input = TikaInputStream.get(origInput);
input.getFile();
Many, but not all parsers will be able to take advantage of the backing File and read only the bits they need into memory, rather than buffering the whole thing, which'll help
.
Next up, make sure your ContentHandler doesn't buffer the whole contents into memory before outputting. Anything which does XPath lookups on the resulting document is probably out, as is anything which has an internal StringBuffer or similar. Pick a simpler one, and make sure you're setup to write the resulting html / text sax events somewhere as they come in
.
Finally, not all of the Tika parsers support streaming processing. Some only work by parsing the whole file's structure, then wandering through that finding the interesting bits to output. With those, using a File backed TikaInputStream will probably help, but won't stop a fair bit of memory being used.
IIRC, the low memory parsers include:
.xls
.xlsx
All ODF-based formats
XML
Some of the common document parsers which load + parse most/all of the file before being able to output anything include:
.doc / .docx / .ppt / .pptx
.pdf
Images
Videos

Skipping a number of bytes in a ZipInputStream

So at the end of a ZIP file, like the last 64K there is a Central Directory from which you could see what the ZIP file it self contains.
Now I have loaded my ZIP file into a ZipInputStream and before that I have declared a long variable that is the length of the ZIP file - the 64k.
So I want to skip as many bytes and the long variable states, and only start reading info after that. But I dont really understand how the .skip() method works for ZipInputStream.
After using it, the .getNextEntry() method will still start from the beginning and .read(byte[64 * 1024]) will tell me that it's the end of the stream which it should not be?
So what is this skip() method actually doing and how can I get my Central Directory?
As far as I can see, you are mixing two things here.
Either read your data as a plain InputStream, go to the position you want and start reading and parsing the plain data yourself.
Or use the ZipInputStream API and iterate through the ZipEntries. The ZipInputStream is an abstraction on top of the raw stream that handles reading the central directory and uncompressing the compressed bytes transparently. So you cannot get access to the raw directory using the ZipInputStream.
For more information, see also
How does ZipInputStream.getNextEntry() work?
How to read content of the Zipped file without extracting in java

Java: Replace part of file without writing the entire file again

Is it possible to replace part of a files content, without rewriting the entire file to the disk.
Say that i have a very large file of several gigabytes, how to i replace the bytes from, lets say position 100 to 200 without rewriting the entire file?
As an added bonus, i need a solution that does not use any features never than java 1.4.
If you're positive that you're going to be writing exactly the same number of bytes, you can use a RandomAccessFile to accomplish this (available since Java 1.0). Just open the file, seek to wherever you need to be, and overwrite those bytes with whatever your new data is.
RandomAccessFile f = new RandomAccessFile(new File("C:\\test\\huge.txt"), "rw");
f.seek(100); // Seek ahead
f.write("here is some new stuff".getBytes())
You can also read from the file at arbitrary points in the same fashion, in case you don't know exactly how much data you need to replace (e.g. so you can pad/truncate whatever you're writing to avoid doing something awful by accident).

Java: Reading a file containing both text and binary data

I'm having a problem with a new file format I'm being asked to implement at work.
Basically, the file is a text file which contains a bunch of headers containing information about the data in UTC-8, and then the rest of the file is the numerical data in binary. I can write the data and read it back just fine, and I recently added the code to write the headers.
The problem is that I don't know how to read a file that contains both text and binary data. I want to be able to read in and deal with the header information (which is fairly extensive) and then be able to continue reading the binary data without having to re-iterate through the headers. Is this possible?
I am currently using a FileInputStream to read the binary data, but I don't know how to start it at the beginning of the data, rather than the beginning of the whole file. One of the FileInputStream's constructors takes a FileDescriptor as the argument and I think that's my answer, but I don't know how to get one from another file reading class. Am I approaching this correctly?
You can reposition a FileInputStream to any arbitrary point by getting its channel via getChannel() and calling position() on that channel.
The one caveat is that this position affects all consumers of the stream. It is not suitable if you have different threads (for example) reading from different parts of the same file. In that case, create a separate FileInputStream for each consumer.
Also, this technique only works for file streams, because the underlying file can be randomly accessed. There is no equivalent for sockets, or named pipes, or anything else that is actually a stream.

Reading a gz file and keeping track of position in file

So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud
I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.
What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.
GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...

Categories

Resources