Split zip file in Java (without .part) - java

I have a zip file of 16 MB containing many log files, I want to split this zip in many other zip files of maximum size 6MB in my Java application.
I know it is possible with zip4j, but with Zip4j, my zip file is 'splited' to many parties (zip.part01 zip.part02) (I mean I cannot open any of these files without extracting them all).
Assuming that there is not any log file which size more than 6mb, what is the best approach to split my zip to many smaller zip files (and not parties) ? Unzipping the zip file, then looping on all the files and creating new smaller zips ?
I hope that my question is not confusing
Thank you very much

Assuming that the file is a text file, I would just design the program to read certain portions of the file and create new files from those portions and zip them individually. By this way each file is readable.
This may not be as efficient as zipping the entire file and split it, though.

Related

Is it faster to read a Zip file while unzipping it (in a stream) or wait for it complete unzipping and then read it?

I am currently reading a CSV file from a zip file. I have 2 options to read it.
Read the CSV while it is being unzipped line by line by streaming it (ZipInputStream and openCSV)
Unzip the entire CSV file first, and then go back and read the entire thing.
Which one would be faster? I am going to perform some tests but I was wondering if anyone already knew logistically which is more efficient. Thanks!

Unzipping tar.gz file partially

Would be possible to unzip tar.gz file partially e.g. unzip only few megabytes from the middle of the large tar.gz file ?
I got this idea as we have a lot zipped log files and it's very time consuming to unzip 100mb log file into ~1gb file and then search in it. Would be great to have option of 'partial unzip'.
Unless the .gz file was specially prepared for this purpose, then no, you need to decompress all of the data up to the middle in order to decompress what's in the middle.
It is possible to use Z_FULL_FLUSH in deflate() periodically to put breaks in in the compressed data to allow decompression starting at those break points. You would have to have a different file and your own software to keep track of where those breakpoints were, and how far into the uncompressed data they are.
Since it is a .tar.gz file, it would make sense to only have those breakpoints at file boundaries. The tar format itself can be read starting at any file header with no problem.

How does one access a file within a directory in a zip file without creating any new files?

I'm working on a java project that requires me to access a file within multiple embedded zip files and directories.
For example, archive1.zip/archive1/archive2.zip/archive2/directory1/file_that_I_need.txt.
It would be a lot easier if when each zip file was extracted, it would immediately list its contents but instead there's a folder inside that contains all the contents.
The examples I found online deal with zip files that, when extracted, contain the files they need to access but I can't find any that deal with accessing files within a directory in a zip file. Any advice on this would be great.
Thanks!
Given the prohibition against creating new files, you're pretty much stuck with ZipInputStream. When you find the ZipEntry that corresponds to the embedded archive, you then read its stream to find the actual file. You can proceed recursively through as many levels of archives as you want.
This works OK if you're looking to process a single file. However, re-reading the archives for multiple files can be expensive. A better solution is to at least open the outer archive as a ZipFile, which memory-maps the actual file.
If you can then extract the contained archives into a temporary directory and open them as ZipFiles as well, you'll probably see a big speed increase (as long as you're pulling multiple files from each embedded archive).
You might also look at http://truezip.java.net/ I've used an older version of it, and its quite a bit more powerful than the support that's built into Java. I think there is also an Apache Commons library for reading files from within nested archive structures.

append files to an archive without reading/rewriting the whole archive

I've got many files that I want to store in a single archive file. My first approach was to store the files in a gzipped tarball. The problem is, that I've to rewrite the whole archive if a single file is added.
I could get rid of the gzip compression, but adding a file would still be expensive.
What other archive format would you suggest that allows fast append operations?
The ZIP file format was designed to allow appends without a total re-write and is ubiquitous, even on Unix.
ZIP and TAR fomats (and the old AR format) allow file append without a full rewrite. However:
The Java archive classes DO NOT support this mode of operation.
File append is likely to result in multiple copies of a file in the archive if you append an existing file.
The ZIP and AR formats have a directory that needs to be rewritten following a file append operation. The standard utilities take precautions when rewriting the directory, but it is possible in theory that you may end up with an archive with a missing or corrupted directory if the append fails.

How can I protect myself from a zip bomb?

I just read about zip bombs, i.e. zip files that contain very large amount of highly compressible data (00000000000000000...).
When opened they fill the server's disk.
How can I detect a zip file is a zip bomb before unzipping it?
UPDATE Can you tell me how is this done in Python or Java?
Try this in Python:
import zipfile
with zipfile.ZipFile('a_file.zip') as z
print(f'total files size={sum(e.file_size for e in z.infolist())}')
Zip is, erm, an "interesting" format. A robust solution is to stream the data out, and stop when you have had enough. In Java, use ZipInputStream rather than ZipFile. The latter also requires you to store the data in a temporary file, which is also not the greatest of ideas.
Reading over the description on Wikipedia -
Deny any compressed files that contain compressed files.
Use ZipFile.entries() to retrieve a list of files, then ZipEntry.getName() to find the file extension.
Deny any compressed files that contain files over a set size, or the size can not be determined at startup.
While iterating over the files use ZipEntry.getSize() to retrieve the file size.
Don't allow the upload process to write enough data to fill up the disk, ie solve the problem, not just one possible cause of the problem.
Check a zip header first :)
If the ZIP decompressor you use can provide the data on original and compressed size you can use that data. Otherwise start unzipping and monitor the output size - if it grows too much cut it loose.
Make sure you are not using your system drive for temp storage. I am not sure if a virusscanner will check it if it encounters it.
Also you can look at the information inside the zip file and retrieve a list of the content. How to do this depends on the utility used to extract the file, so you need to provide more information here

Categories

Resources