Skipping a number of bytes in a ZipInputStream - java

So at the end of a ZIP file, like the last 64K there is a Central Directory from which you could see what the ZIP file it self contains.
Now I have loaded my ZIP file into a ZipInputStream and before that I have declared a long variable that is the length of the ZIP file - the 64k.
So I want to skip as many bytes and the long variable states, and only start reading info after that. But I dont really understand how the .skip() method works for ZipInputStream.
After using it, the .getNextEntry() method will still start from the beginning and .read(byte[64 * 1024]) will tell me that it's the end of the stream which it should not be?
So what is this skip() method actually doing and how can I get my Central Directory?

As far as I can see, you are mixing two things here.
Either read your data as a plain InputStream, go to the position you want and start reading and parsing the plain data yourself.
Or use the ZipInputStream API and iterate through the ZipEntries. The ZipInputStream is an abstraction on top of the raw stream that handles reading the central directory and uncompressing the compressed bytes transparently. So you cannot get access to the raw directory using the ZipInputStream.
For more information, see also
How does ZipInputStream.getNextEntry() work?
How to read content of the Zipped file without extracting in java

Related

Java - possible to modify and parse gzipped xml files without unzipping?

I have an arraylist of gzipped xml files. Is it possible to view and manipulate the contents of these xml files all without unzipping them and taking up disk space? If so, what would be the correct class(es) to use for this task?
I know I can create a gzipinputstream from a fileinputstream of the zip file but from there I'm not sure what to do. I have only this written:
GZIPInputStream in = new GZIPInputStream(new FileInputStream(zippedFiles.get(i)));
I need some way to parse text within the xml files and modify the xml itself but again, extracting all of them would take up too much disk space.
What exactly are you going to achieve? You can extract the file into memory using a ByteArrayOutputStream and convert it into a byte-Array that you forward to your XML parser library (converting it to String and passing that is not recommended as the encoding is specified inside the XML file itself and the conversion to String must therefore be done by the XML parser internally). Most XML parsers also support reading directly from any InputStream, so you could pass yours directly to it which will probably further reduce your memory consumption. Disk space will only be occupied when writing data back to it by simply reversing the described procedure. Still, as you directly replace the source file by overwriting it, there is nowhere any disk space wasted.
The fact that they're in a list doesn't change much, but no.
Ignoring compression, files are stored linearly on disks. You can append to them cheaply, you can replace bytes cheaply, but you can't replace sequences of different lengths (like replace("Testing Procedure Specification", "TPS")) without rewriting the file after the modified substring.
Gziping the file complicates things, but the same rule applies. In general, making arbitrary modifications to a file requires rewriting the file.
Your code for reading the files is on the right track, though. You can easily read through gziped files as streams and without having to decompress the entire file.

Java: Reading a file containing both text and binary data

I'm having a problem with a new file format I'm being asked to implement at work.
Basically, the file is a text file which contains a bunch of headers containing information about the data in UTC-8, and then the rest of the file is the numerical data in binary. I can write the data and read it back just fine, and I recently added the code to write the headers.
The problem is that I don't know how to read a file that contains both text and binary data. I want to be able to read in and deal with the header information (which is fairly extensive) and then be able to continue reading the binary data without having to re-iterate through the headers. Is this possible?
I am currently using a FileInputStream to read the binary data, but I don't know how to start it at the beginning of the data, rather than the beginning of the whole file. One of the FileInputStream's constructors takes a FileDescriptor as the argument and I think that's my answer, but I don't know how to get one from another file reading class. Am I approaching this correctly?
You can reposition a FileInputStream to any arbitrary point by getting its channel via getChannel() and calling position() on that channel.
The one caveat is that this position affects all consumers of the stream. It is not suitable if you have different threads (for example) reading from different parts of the same file. In that case, create a separate FileInputStream for each consumer.
Also, this technique only works for file streams, because the underlying file can be randomly accessed. There is no equivalent for sockets, or named pipes, or anything else that is actually a stream.

How to get individual file contents from ZipArchiveInputStream using Apache commons compress library

I have org.apache.commons.compress.archivers.zip.ZipArchiveInputStream Object,
from which I can get each ArchiveEntry item and get the individual file's metadata.
But I need to know the way to get each file's contents as Blob.
Using org.apache.commons.compress.archivers.zip.ZipFile it can be done,
but ZipFile has constructors with physical or temporary File object or a file location.
I dont want to create a temporary File for this.
Even if there is a way to convert ZipArchiveInputStream to ZipFile, it would solve
the problem indirectly.
In short, my requirement is to read a Zip file from an InputStream/Blob and store individual Files as BLOB in Database.
FYI: I am using org.apache.commons :: commons-compress :: 1.4.1
Any solutions/ideas/suggestions are highly appreciated.
Cheers
Kum
perhaps you could use standard java.util.zip.ZipInputStream - it has a constructor for InputStream. you can use getNextEntry() / closeEntry() to iterate through entries and read() to obtain decompressed data.

Java 6 File Input Output Stream (same file)

I searched and looked at multiple questions like this, but my question is really different than anything I found. I've looked at Java Docs.
How do I get the equivalent of this c file open:
stream1 = fopen (out_file, "r+b");
Once I've done a partial read from the file, the first write makes the next read return EOF no matter how many bytes were in the file.
Essentially I want a file I/O stream that doesn't do that. The whole purpose of what I'm trying to do is to replace the bytes in an existing file in the current file. I don't want to do it in a copy or make a copy before I do the Read->Write.
You can use a RandomAccessFile.
As Perception mentions, you can use a RandomAccessFile. Also, in some situations, a FileChannel may work better. I've used these to handle binary file data with great success.
EDIT: you can get a FileChannel from the RandomAccessFile object using getChannel.

Reading a gz file and keeping track of position in file

So, here is the situation:
I have to read big .gz archives (GBs) and kind of "index" them to later on be able to retrieve specific pieces using random access.
In other words, I wish to read the archive line by line, and be able to get the specific location in the file for any such line. (so that I can jump directly to these specific locations upon request). (PS: ...and it's UTF-8 so we cannot assume 1 byte == 1 char.)
So, basically, what I just need is a BufferedReader which keeps track of its location in the file. However, this doesn't seem to exist.
Is there anything available or do I have to roll my own?
A few additional comments:
I cannot use BufferedReader directly since the file location corresponds to what has been buffered so far. In other words, a multiple of the internal buffer size instead of the line location.
I cannot use InputStreamReader directly for performance reasons. Unbuffered would be way to slow, and, btw, lacks convenience methods to read lines.
I cannot use RandomAccessFile since 1. it's zipped, and 2. RandomAccessFile uses "modified" UTF-8
I guess the best would be use a kind of of buffered reader keeping track of file location and buffer offset ...but this sounds quite cumbersome. But maybe I missed something. Perhaps there is already something existing to do that, to read files line by lines and keep track of location (even if zipped).
Thanks for tips,
Arnaud
I think jzran could be pretty much what you're looking for:
It's a Java library based on the
zran.c sample from zlib.
You can preprocess a large gzip
archive, producing an "index" that can
be used for random read access.
You can balance between index size and
access speed.
What you are looking for is called mark(), markSupported() and skip().
This methods are declared both in InputStream and Reader, so you are welcome to use them.
GZIP compression does not support seeking. Previous data blocks are needed to build compression tables...

Categories

Resources