Uncompressing a ZIP file in memory in Java [closed]

Uncompressing a ZIP file in memory in Java [closed] - java

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I'm downloading zipped files containing XMLs, and I'd like to avoid writing the zip files to disk before manipulating them because of latency requirements. However, java.util.zip doesn't suffice for me. There's no way to say "here's a byte array of a zip file, use it" without turning it into a stream, and ZipInputStream is not reliable, since it scans for entry headers (see discussion below EDIT for reasons why that is not reliable).
I do not yet have access to the zip files I'll be handling, so I don't know whether I'll be able to handle them through the ZipInputStream, and I need to find a solution that will work with any valid ZIP files, as the penalty for a failure once I go into production will be high.
Assuming ZipInputStream won't work, what can I do to solve this problem in cases where there are no entry headers? I'm using Wikipedia's definition, which includes a comment on how to correctly uncompress zip files (quoted below), as the standard.
EDIT
The Apache Commons Zip library has a good write up on some of the problems using Stream (both their solution and Java's) has. I'll further add, from wikipedia and personal experience, and the size and crc field on entry headers may not be filled (I've files with -1 in these fields). Thanks to centic for providing this link.
Also, let me quote the wikipedia on the subject:
Tools that correctly read zip archives must scan for the signatures of
the various fields, the zip central directory. They must not scan for
entries because only the directory specifies where a file chunk
starts. Scanning could lead to false positives, as the format doesn't
forbid other data to be between chunks, or uncompressed stream
containing such signatures.
Note that ZipInputStream scans for entries, not the central directory, which is the problem with it.
Final Edit
If anyone is interested, this script can be used to produce a valid ZIP file that cannot be read by ZipInputStream from an existing ZIP file. So, as a final edit to this closed question, I needed a library that can read files such as the ones produced by this script.

EDIT: Another suggestion...
Looking at ZipFile from the Apache Commons implementation, it looks like it wouldn't be too hard to effectively fork that for your project. Create a wrapper around your byte array which has all the pieces of the RandomAccessFile API which are required (I don't think there are very many). You've already indicated that you prefer the interface to ZipFile, so why not go with that?
We don't know enough about your project to know whether this opens up any legal questions - and even if you gave details, I doubt that anyone here would be able to give good legal advice - but I suspect it wouldn't take more than an hour or two to get this solution up and working, and I suspect you'd have reasonable confidence in it.
EDIT: This may be a slightly more productive answer...
If you're worried about the entries not being contiguous, but don't want to handle all the compression side yourself, you might consider an option where you effectively rewrite the data. Create a new ByteArrayOutputStream, and read the central directory at the end. For each entry in the central directory, write out an entry (header + data) to the output stream in a format that you believe ZipInputStream will be happy with. Then write a new central directory - if you want your replacement to be valid you may need to do this from scratch, but if you're using code which you know won't actually read the central directory, you could just provide the original one, ignoring the fact that it might not then be valid. So long as it starts with the right signature, that's probably good enough :)
Once you've done that, convert the ByteArrayOutputStream into a new byte[], wrap it in a ByteArrayInputStream and then pass that to ZipInputStream or ZipArchiveInputStream.
Depending on your purposes, you may not even need to do that much - you may be able to just extract each file as you go by creating a "mini" zip file with just the one entry you're reading from the directory at a time.
This does involve understanding the zip file format, but not completely - just the skeleton, effectively. It's not a quick and easy fix like using an existing API completely, but it shouldn't take very long. It doesn't guarantee it'll be able to read all invalid files (how could it?) but it will protect you against the "data between entries" issue you seem to be particularly concerned about. Hope it's at least a useful idea...
there's no way to say "here's a byte array of a zip file, use it"
Yes there is:
byte[] data = ...;
ByteArrayInputStream byteStream = new ByteArrayInputStream(data);
ZipInputStream zipStream = new ZipInputStream(byteStream);
That leaves the issue of whether ZipInputStream can handle all the zip files you'll give it - but I wouldn't write it off quite so quickly.
Of course, there are other APIs available. You may want to look at Apache Commons Compress, for example. Even though ZipFile requires a file, ZipArchiveInputStream doesn't - so again, you could use a ByteArrayInputStream. EDIT: It looks like ZipArchiveStream doesn't read from the central directory either. I was hoping it would use markSupported to check beforehand, but it appears not to...
EDIT: In the comments on the question, I asked where you'd read that the zip file doesn't have to contain entry data. You quoted wikipedia:
"Tools that correctly read zip archives must scan for the signatures of the various fields, the zip central directory. They must not scan for entries because only the directory specifies where a file chunk starts. Scanning could lead to false positives, as the format doesn't forbid other data to be between chunks, or uncompressed stream containing such signatures."
That's not the same as entry data being optional. It's saying that there may be extra data in awkward places, not that the entries may be missing completely. It's basically saying that the entries shouldn't be assumed to be contiguous. I could happily concede that ZipInputStream may not be reading the central directory at the end of the file, but finding code which does that isn't the same as finding code which copes with entry data not existing.
You then write:
I might further add that whether the zip is valid or not is not my concern. Working with it is.
... which suggests you want code which will handle invalid zip files. Combined with this:
I do not yet have access to the zip files I'll be handling, so I don't know whether I'll be able to handle them through the stream
That means you're asking for code which should handle zip files which are invalid in ways you can't even predict. Just how invalid would it have to be for you to be able to reject it? If I give you 1000 random bytes, with no attempt for them to be a zip file at all, what on earth would you do with it?
Basically, you need to pin the problem down more tightly before it's feasible to even say whether a particular library is a valid solution. It's reasonable to collect a set of zip files from various places, which may be invalid in well-understood ways, and say "I must be able to support all of these." Later you may need to do some work if it turns out that wasn't good enough. But to be able to support anything, however broken, simply isn't a valid requirement.

TrueZIP library provides alternative mature zip implementation.
It also features file system abstraction even for HTTP.
For example:
Path path = new TPath(new URI("http://acme.com/download/everything.zip/entry.xml"));
try (InputStream in = Files.newInputStream(path)) {
// Read archive entry contents here.
...
}
So, if you are interested only in specific entries, it would download them only, saving bandwidth and time.
And you would not have to write downloading code.
See also http://truezip.java.net/faq.html#http.

I would use the Apache library commons-compress, see http://commons.apache.org/compress/
It has support for reading Zip-files via streams, there is in-depth documentation at http://commons.apache.org/compress/zip.html for a detailed documentation. It also states some limitations which are inherent in the Zip-Format.
Sample code looks as follows:
ZipArchiveInputStream zip =
new ZipArchiveInputStream(inputStream);
try {
ZipArchiveEntry entry = zip.getNextZipEntry();
while(entry != null) {
assertEquals("README", entry.getName());
...
entry = zip.getNextZipEntry();
}
} finally {
zip.close();
}

This question sounds similar to How to create a directory in memory? pseudo file system / virtual directory. Basically, my suggestion is to use a more general solution- an in-memory virtual filesystem (and I don't mean on OS level, like Linux' ramfs/tmpfs).
One example is to use the Java 7 NIO APIs, which now provide an SPI for implementing a file system via FileSystemProvider. It seems that the ShrinkWrap filesystem implements this SPI.
A more accessible option would be to use Apache Commons VFS' ram filesystem: it requires only Java 5. If you need to be compatible with Java 5 and 6, this is probably your best bet.
I first remember reading about in-memory filesystems in Java from this article, which apart from pointing out solutions like Commons VFS and JBoss Microcontainer, gives a nice example use case for the NetBeans IDE.
While an in-memory virtual filesystem is a nice general solution of avoiding the OS-level filesystem (with the associated performance benefits), it probably suffers from other disadvantages, which more specialized solutions could address. For instance, I am not sure how using this filesystem would behave when used concurrently from multiple threads. It might work fine as long as you don't access the same files, or you might need to create separate filesystems (which might be prohibitive in terms of resource usage).

Related

Use the Checkstyle API without providing a java.io.File

Is there a way to use the Checkstyle API without providing a java.io.File?
Our app already has the file contents in memory (these aren't read from a local file, but from another source), so it
seems inefficent to me to have to create a temporary file and write the in-memory contents to it just to throw it away.
I've looked into using in-memory file systems to circumvent this, but it seems java.io.File is always bound to the
actual file system. Obviously I have no way of testing whether or not performance would be better, just wanted to ask if Checkstyle supports such a use case.

There is no clean way to do this. I recommend creating an issue at Checkstyle expanding more on your process and asking for a way to integrate it with Checkstyle.
Files are needed for our support of caching, as we skip over reading and processing a file if it is in the cache and it has not changed since the last run. The cache process is intertwined which is why no non-file route exists. Even without a file, Checkstyle processes the contents of files through FileText, which again needs a File for just a file name reference and lines of the file in a List.

java.io.File vs java.nio.Files which is the preferred in new code?

While writing answers around SO, a user tried pointing out that java.io.File should not be used in new code, instead he argues that the the new object java.nio.Files should be used instead; he linked to this article.
Now I have been developing in Java for several years now, and have not heard this argument before; since reading his post I have been searching, and have not found many other sources that confirm this, and personally, I feel like many of the points argued in the article are weak and that if you know how to read them, errors thrown by the File class will generally tell you exactly what the issue is.
As I am continually developing new code my question is this:
Is this an active argument in the Java community? Is Files preferred over File for new code? What are the major advantages / disadvantages between the two?

The documentation that you linked give the answer:
The java.nio.file package defines interfaces and classes for the Java
virtual machine to access files, file attributes, and file systems.
This API may be used to overcome many of the limitations of the
java.io.File class. The toPath method may be used to obtain a Path
that uses the abstract path represented by a File object to locate a
file. The resulting Path may be used with the Files class to provide
more efficient and extensive access to additional file operations,
file attributes, and I/O exceptions to help diagnose errors when an
operation on a file fails.

File has a newer implementation: Path. With a builder Paths.get("..."). And Files has many nice utility functions with better implementations too (move instead of the sometimes failing File.renameTo).
A Path maintains its file system. Hence you can copy out of a zip file system ("jar:file:..... .zip") some path to another file system and vice versa.
File.toPath() may help an incremental transition.
The utilities alone in Files make a move to the newer classes profitable.

Java: any way to get a ZipFile (or anything with a direct getEntry method) from a byte array?

I have the contents of a zip file in a byte array. The file contains a number of entries (typically about 12), but I only care about three of them.
I would like to somehow get this into a ZipFile object, so I can pull those specific three ZipEntrys out using ZipFile.getEntry. I'm open to using something other than ZipFile that has a similar look-up-by-name method like getEntry.
My initial investigation suggests that I'm out of luck. ZipFile requires a real file in the file subsystem (which I cannot and do not want to access) and so I can't get there from here, and no means other than ZipFile exists that allows extracting particular entries by name; but I wanted to check. In languages like C# and Python, this is pretty straightforward (in C# I go from byte array to MemoryStream to ZipArchive; in Python I just wrap it in StringIO and treat like a file), so I wanted to make sure I'm not missing something obvious.
My Plan B is to use ZipInputStream and repeated calls to getNextEntry to go through all dozen or so entries, and throw away all except the three I care about, but that just smells bad to me.

A ZipInputStream can be instantiated for any InputStream ... including a ByteArrayInputStream.
Apart from that you are out of luck ... if you stick with Java SE classes.
The root of the problem (from an API design perspective) is that ZipFile is a wrapper for functionality that is implemented in native code. The native code opens the input stream for itself, and it uses a native filename / pathname.
The main reason for a native ZIP implementation that works that way is that the JVM needs to load code from ZIP files as part of the bootstrap procedures. This happens before the native implementation has loaded classes such as InputStream. Indeed, it has to.
There are a number of 3rd party libraries. Start by reading this Q&A - What is a good Java library to zip/unzip files?

How do I append files to a .tar archive in Java?

I would like to create a tar archive in Java. I have files which are constantly being created and I'd like a worker thread to take a reference to those files from a queue and copy them into the archive.
I tried using Apache Compression library's TarArchiveOutputStream to do this, but I do not wish to keep the archive open for the entire duration of the program (since unless i finalize the archive, it can become corrupted - so i'd rather append to it in batches), and I haven't found a good way to append to an existing tar archive with their library (They do have the "ChangeSetPerformer" class, but it basically just creates a new tar and needs to copy the old one entirely, which isn't good for me, performance wise).
I also need the library to not have a low limit for the size of the archive (i.e. 4g or so is not enough), and i'd rather avoid having to actually compress the archive.
Any suggestions would be greatly appreciated!
thank you.

You run here in a limitation of tar: http://en.wikipedia.org/wiki/Tar_(file_format)#Random_access
because of that it is hard to add or remove single files without copying the whole archive.

I use a library called Java Tar: http://www.trustice.com/java/tar/
It's worked for me. In that package, look for:
http://www.gjt.org/javadoc/com/ice/tar/TarArchive.html#writeEntry(com.ice.tar.TarEntry, boolean)
Which lets you add an entry to the file without using a stream at the user level. I don't know about file size - but it would be a simple matter to test this aspect.

Java content APIs for a large number of files

Does anyone know any java libraries (open source) that provides features for handling a large number of files (write/read) from a disk. I am talking about 2-4 millions of files (most of them are pdf and ms docs). it is not a good idea to store all files in a single directory. Instead of re-inventing the wheel, I am hoping that it has been done by many people already.
Features I am looking for
1) Able to write/read files from disk
2) Able to create random directories/sub-directories for new files
2) Provide version/audit (optional)
I was looking at JCR API and it looks promising but it starts with a workspace and not sure what will be the performance when there are many nodes.

Edit: JCP does look pretty good. I'd suggest trying it out to see how it actually does perform for your use-case.
If you're running your system on Windows and noticed a horrible n^2 performance hit at some point, you're probably running up against the performance hit incurred by automatic 8.3 filename generation. Of course, you can disable 8.3 filename generation, but as you pointed out, it would still not be a good idea to store large numbers of files in a single directory.
One common strategy I've seen for handling large numbers of files is to create directories for the first n letters of the filename. For example, document.pdf would be stored in d/o/c/u/m/document.pdf. I don't recall ever seeing a library to do this in Java, but it seems pretty straightforward. If necessary, you can create a database to store the lookup table (mapping keys to the uniformly-distributed random filenames), so you won't have to rebuild your index every time you start up. If you want to get the benefit of automatic deduplication, you could hash each file's content and use that checksum as the filename (but you would also want to add a check so you don't accidentally discard a file whose checksum matches an existing file even though the contents are actually different).
Depending on the sizes of the files, you might also consider storing the files themselves in a database--if you do this, it would be trivial to add versioning, and you wouldn't necessarily have to create random filenames because you could reference them using an auto-generated primary key.

Combine the functionality in the java.io package with your own custom solution.
The java.io package can write and read files from disk and create arbitrary directories or sub-directories for new files. There is no external API required.
The versioning or auditing would have to be provided with your own custom solution. There are many ways to handle this, and you probably have a specific need that needs to be filled. Especially if you're concerned about the performance of an open-source API, it's likely that you will get the best result by simply coding a solution that specifically fits your needs.
It sounds like your module should scan all the files on startup and form an index of everything that's available. Based on the method used for sharing and indexing these files, it can rescan the files every so often or you can code it to receive a message from some central server when a new file or version is available. When someone requests a file or provides a new file, your module will know exactly how it is organized and exactly where to get or put the file within the directory tree.
It seems that it would be far easier to just engineer a solution specific to your needs.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.