I would like to create a tar archive in Java. I have files which are constantly being created and I'd like a worker thread to take a reference to those files from a queue and copy them into the archive.
I tried using Apache Compression library's TarArchiveOutputStream to do this, but I do not wish to keep the archive open for the entire duration of the program (since unless i finalize the archive, it can become corrupted - so i'd rather append to it in batches), and I haven't found a good way to append to an existing tar archive with their library (They do have the "ChangeSetPerformer" class, but it basically just creates a new tar and needs to copy the old one entirely, which isn't good for me, performance wise).
I also need the library to not have a low limit for the size of the archive (i.e. 4g or so is not enough), and i'd rather avoid having to actually compress the archive.
Any suggestions would be greatly appreciated!
thank you.
You run here in a limitation of tar: http://en.wikipedia.org/wiki/Tar_(file_format)#Random_access
because of that it is hard to add or remove single files without copying the whole archive.
I use a library called Java Tar: http://www.trustice.com/java/tar/
It's worked for me. In that package, look for:
http://www.gjt.org/javadoc/com/ice/tar/TarArchive.html#writeEntry(com.ice.tar.TarEntry, boolean)
Which lets you add an entry to the file without using a stream at the user level. I don't know about file size - but it would be a simple matter to test this aspect.
Related
Is there a way to use the Checkstyle API without providing a java.io.File?
Our app already has the file contents in memory (these aren't read from a local file, but from another source), so it
seems inefficent to me to have to create a temporary file and write the in-memory contents to it just to throw it away.
I've looked into using in-memory file systems to circumvent this, but it seems java.io.File is always bound to the
actual file system. Obviously I have no way of testing whether or not performance would be better, just wanted to ask if Checkstyle supports such a use case.
There is no clean way to do this. I recommend creating an issue at Checkstyle expanding more on your process and asking for a way to integrate it with Checkstyle.
Files are needed for our support of caching, as we skip over reading and processing a file if it is in the cache and it has not changed since the last run. The cache process is intertwined which is why no non-file route exists. Even without a file, Checkstyle processes the contents of files through FileText, which again needs a File for just a file name reference and lines of the file in a List.
I am reading different chunks of data from DB and writing each chunk into CSV file and adding that entry to zip file. Here are my questions:
I am dealing with huge data, Is it advisable to open zip stream open in the beginning and closing at end of transaction? If I do so, will it hold all these data in RAM and cause any memory issues?
Will there be any advantage if I keep these csv files in hard drive and zipping it at the end of transaction? If so, what is the best way to do it in java?
Note: We are using Java 1.6 for our application.
Have a look at the new File system introduced with Java 7
http://fahdshariff.blogspot.com/2011/08/java-7-working-with-zip-files.html
http://docs.oracle.com/javase/7/docs/technotes/guides/io/fsp/zipfilesystemprovider.html
This allows you to handle a zip file like a file system and just copy or write your data directly into files inside the zip file. However the Path.toFile() method is not supported on a zip File system, so for all legacy code that required a File object, you need to create a temporary file and then copy it over.
For your application you could just use something like Files.newBufferedWriter(...) to write the file directly into the zip archive without having to worry about the specifics.
Make sure the ZipOutputStream is wrapped around an outputstream that is not in memory (like a FileOutputStream). This will keep memory consumption to a minimum and you can basically write until your filesystem is full.
There is no advantage to first creating a csv file, then zipping it, write the csv line directly to the outputstream. This can easily be done with java 1.6
The only limitation you might run into if it gets really big is that java 1.6 does not support zip64 and as such you are limited to 4gb. At some point I backported the zip functionality of 1.7 to 1.6 to resolve this issue.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I'm downloading zipped files containing XMLs, and I'd like to avoid writing the zip files to disk before manipulating them because of latency requirements. However, java.util.zip doesn't suffice for me. There's no way to say "here's a byte array of a zip file, use it" without turning it into a stream, and ZipInputStream is not reliable, since it scans for entry headers (see discussion below EDIT for reasons why that is not reliable).
I do not yet have access to the zip files I'll be handling, so I don't know whether I'll be able to handle them through the ZipInputStream, and I need to find a solution that will work with any valid ZIP files, as the penalty for a failure once I go into production will be high.
Assuming ZipInputStream won't work, what can I do to solve this problem in cases where there are no entry headers? I'm using Wikipedia's definition, which includes a comment on how to correctly uncompress zip files (quoted below), as the standard.
EDIT
The Apache Commons Zip library has a good write up on some of the problems using Stream (both their solution and Java's) has. I'll further add, from wikipedia and personal experience, and the size and crc field on entry headers may not be filled (I've files with -1 in these fields). Thanks to centic for providing this link.
Also, let me quote the wikipedia on the subject:
Tools that correctly read zip archives must scan for the signatures of
the various fields, the zip central directory. They must not scan for
entries because only the directory specifies where a file chunk
starts. Scanning could lead to false positives, as the format doesn't
forbid other data to be between chunks, or uncompressed stream
containing such signatures.
Note that ZipInputStream scans for entries, not the central directory, which is the problem with it.
Final Edit
If anyone is interested, this script can be used to produce a valid ZIP file that cannot be read by ZipInputStream from an existing ZIP file. So, as a final edit to this closed question, I needed a library that can read files such as the ones produced by this script.
EDIT: Another suggestion...
Looking at ZipFile from the Apache Commons implementation, it looks like it wouldn't be too hard to effectively fork that for your project. Create a wrapper around your byte array which has all the pieces of the RandomAccessFile API which are required (I don't think there are very many). You've already indicated that you prefer the interface to ZipFile, so why not go with that?
We don't know enough about your project to know whether this opens up any legal questions - and even if you gave details, I doubt that anyone here would be able to give good legal advice - but I suspect it wouldn't take more than an hour or two to get this solution up and working, and I suspect you'd have reasonable confidence in it.
EDIT: This may be a slightly more productive answer...
If you're worried about the entries not being contiguous, but don't want to handle all the compression side yourself, you might consider an option where you effectively rewrite the data. Create a new ByteArrayOutputStream, and read the central directory at the end. For each entry in the central directory, write out an entry (header + data) to the output stream in a format that you believe ZipInputStream will be happy with. Then write a new central directory - if you want your replacement to be valid you may need to do this from scratch, but if you're using code which you know won't actually read the central directory, you could just provide the original one, ignoring the fact that it might not then be valid. So long as it starts with the right signature, that's probably good enough :)
Once you've done that, convert the ByteArrayOutputStream into a new byte[], wrap it in a ByteArrayInputStream and then pass that to ZipInputStream or ZipArchiveInputStream.
Depending on your purposes, you may not even need to do that much - you may be able to just extract each file as you go by creating a "mini" zip file with just the one entry you're reading from the directory at a time.
This does involve understanding the zip file format, but not completely - just the skeleton, effectively. It's not a quick and easy fix like using an existing API completely, but it shouldn't take very long. It doesn't guarantee it'll be able to read all invalid files (how could it?) but it will protect you against the "data between entries" issue you seem to be particularly concerned about. Hope it's at least a useful idea...
there's no way to say "here's a byte array of a zip file, use it"
Yes there is:
byte[] data = ...;
ByteArrayInputStream byteStream = new ByteArrayInputStream(data);
ZipInputStream zipStream = new ZipInputStream(byteStream);
That leaves the issue of whether ZipInputStream can handle all the zip files you'll give it - but I wouldn't write it off quite so quickly.
Of course, there are other APIs available. You may want to look at Apache Commons Compress, for example. Even though ZipFile requires a file, ZipArchiveInputStream doesn't - so again, you could use a ByteArrayInputStream. EDIT: It looks like ZipArchiveStream doesn't read from the central directory either. I was hoping it would use markSupported to check beforehand, but it appears not to...
EDIT: In the comments on the question, I asked where you'd read that the zip file doesn't have to contain entry data. You quoted wikipedia:
"Tools that correctly read zip archives must scan for the signatures of the various fields, the zip central directory. They must not scan for entries because only the directory specifies where a file chunk starts. Scanning could lead to false positives, as the format doesn't forbid other data to be between chunks, or uncompressed stream containing such signatures."
That's not the same as entry data being optional. It's saying that there may be extra data in awkward places, not that the entries may be missing completely. It's basically saying that the entries shouldn't be assumed to be contiguous. I could happily concede that ZipInputStream may not be reading the central directory at the end of the file, but finding code which does that isn't the same as finding code which copes with entry data not existing.
You then write:
I might further add that whether the zip is valid or not is not my concern. Working with it is.
... which suggests you want code which will handle invalid zip files. Combined with this:
I do not yet have access to the zip files I'll be handling, so I don't know whether I'll be able to handle them through the stream
That means you're asking for code which should handle zip files which are invalid in ways you can't even predict. Just how invalid would it have to be for you to be able to reject it? If I give you 1000 random bytes, with no attempt for them to be a zip file at all, what on earth would you do with it?
Basically, you need to pin the problem down more tightly before it's feasible to even say whether a particular library is a valid solution. It's reasonable to collect a set of zip files from various places, which may be invalid in well-understood ways, and say "I must be able to support all of these." Later you may need to do some work if it turns out that wasn't good enough. But to be able to support anything, however broken, simply isn't a valid requirement.
TrueZIP library provides alternative mature zip implementation.
It also features file system abstraction even for HTTP.
For example:
Path path = new TPath(new URI("http://acme.com/download/everything.zip/entry.xml"));
try (InputStream in = Files.newInputStream(path)) {
// Read archive entry contents here.
...
}
So, if you are interested only in specific entries, it would download them only, saving bandwidth and time.
And you would not have to write downloading code.
See also http://truezip.java.net/faq.html#http.
I would use the Apache library commons-compress, see http://commons.apache.org/compress/
It has support for reading Zip-files via streams, there is in-depth documentation at http://commons.apache.org/compress/zip.html for a detailed documentation. It also states some limitations which are inherent in the Zip-Format.
Sample code looks as follows:
ZipArchiveInputStream zip =
new ZipArchiveInputStream(inputStream);
try {
ZipArchiveEntry entry = zip.getNextZipEntry();
while(entry != null) {
assertEquals("README", entry.getName());
...
entry = zip.getNextZipEntry();
}
} finally {
zip.close();
}
This question sounds similar to How to create a directory in memory? pseudo file system / virtual directory. Basically, my suggestion is to use a more general solution- an in-memory virtual filesystem (and I don't mean on OS level, like Linux' ramfs/tmpfs).
One example is to use the Java 7 NIO APIs, which now provide an SPI for implementing a file system via FileSystemProvider. It seems that the ShrinkWrap filesystem implements this SPI.
A more accessible option would be to use Apache Commons VFS' ram filesystem: it requires only Java 5. If you need to be compatible with Java 5 and 6, this is probably your best bet.
I first remember reading about in-memory filesystems in Java from this article, which apart from pointing out solutions like Commons VFS and JBoss Microcontainer, gives a nice example use case for the NetBeans IDE.
While an in-memory virtual filesystem is a nice general solution of avoiding the OS-level filesystem (with the associated performance benefits), it probably suffers from other disadvantages, which more specialized solutions could address. For instance, I am not sure how using this filesystem would behave when used concurrently from multiple threads. It might work fine as long as you don't access the same files, or you might need to create separate filesystems (which might be prohibitive in terms of resource usage).
Does anyone know any java libraries (open source) that provides features for handling a large number of files (write/read) from a disk. I am talking about 2-4 millions of files (most of them are pdf and ms docs). it is not a good idea to store all files in a single directory. Instead of re-inventing the wheel, I am hoping that it has been done by many people already.
Features I am looking for
1) Able to write/read files from disk
2) Able to create random directories/sub-directories for new files
2) Provide version/audit (optional)
I was looking at JCR API and it looks promising but it starts with a workspace and not sure what will be the performance when there are many nodes.
Edit: JCP does look pretty good. I'd suggest trying it out to see how it actually does perform for your use-case.
If you're running your system on Windows and noticed a horrible n^2 performance hit at some point, you're probably running up against the performance hit incurred by automatic 8.3 filename generation. Of course, you can disable 8.3 filename generation, but as you pointed out, it would still not be a good idea to store large numbers of files in a single directory.
One common strategy I've seen for handling large numbers of files is to create directories for the first n letters of the filename. For example, document.pdf would be stored in d/o/c/u/m/document.pdf. I don't recall ever seeing a library to do this in Java, but it seems pretty straightforward. If necessary, you can create a database to store the lookup table (mapping keys to the uniformly-distributed random filenames), so you won't have to rebuild your index every time you start up. If you want to get the benefit of automatic deduplication, you could hash each file's content and use that checksum as the filename (but you would also want to add a check so you don't accidentally discard a file whose checksum matches an existing file even though the contents are actually different).
Depending on the sizes of the files, you might also consider storing the files themselves in a database--if you do this, it would be trivial to add versioning, and you wouldn't necessarily have to create random filenames because you could reference them using an auto-generated primary key.
Combine the functionality in the java.io package with your own custom solution.
The java.io package can write and read files from disk and create arbitrary directories or sub-directories for new files. There is no external API required.
The versioning or auditing would have to be provided with your own custom solution. There are many ways to handle this, and you probably have a specific need that needs to be filled. Especially if you're concerned about the performance of an open-source API, it's likely that you will get the best result by simply coding a solution that specifically fits your needs.
It sounds like your module should scan all the files on startup and form an index of everything that's available. Based on the method used for sharing and indexing these files, it can rescan the files every so often or you can code it to receive a message from some central server when a new file or version is available. When someone requests a file or provides a new file, your module will know exactly how it is organized and exactly where to get or put the file within the directory tree.
It seems that it would be far easier to just engineer a solution specific to your needs.
Supposing I have a File f that represents a directory, then f.delete() will only delete the directory if it is empty. I've found a couple of examples online that use File.listFiles() or File.list() to get all the files in the directory and then recursively traverses the directory structure and delete all the files. However, since it's possible to create infinitely recursive directory structures (in both Windows and Linux (with symbolic links)) presumably it's possible that programs written in this style might never terminate.
So, is there a better way to write such a program so that it doesn't fall into these pitfalls? Do I need to keep track of everywhere I've traversed and make sure I don't go around in circles or is there a nicer way?
Update: In response to some of the answers (thanks guys!) - I'd rather the code didn't follow symbolic links and stayed within the directory it was supposed to delete. Can I rely on the Commons-IO implementation to do that, even in the Windows case?
If you really want your recursive directory deletion to follow through symbolic links, then I don't think there is any platform independent way of doing so without keeping track of all the directories you have traversed.
However, in pretty much every case I can think of you would just want to delete the actual symbolic link pointing to the directory rather than recursively following through the symbolic link.
If this is the behaviour you want then you can use the FileUtils.deleteDirectory method in Apache Commons IO.
Try Apache Commons IO for a tested implementation.
However, I don't think it this handles the infinite-recursion problem.
File.getCanonicalPath() will tell you the “real” name of the file, including resolved symlinks. When while scanning you come across a directory you alread know (because you stored them in a Map) bail out.
If you could know which files are symlinks, you could just skip over those.
There is unfortunately no "clean" way of detecting symlinks in Java. Check out this pure Java workaround or this one involving native code.
At least under MacOSX, deleting a symbolic link to a directory does not delete the directory itself, and can therefore be deleted even if the target directory is not empty.
I assume this holds for most POSIX operating systems. And as far as I know, links under windows are also just files, and can be deleted as such from a Java program.