Unzipping tar.gz file partially

Unzipping tar.gz file partially - java

Would be possible to unzip tar.gz file partially e.g. unzip only few megabytes from the middle of the large tar.gz file ?
I got this idea as we have a lot zipped log files and it's very time consuming to unzip 100mb log file into ~1gb file and then search in it. Would be great to have option of 'partial unzip'.

Unless the .gz file was specially prepared for this purpose, then no, you need to decompress all of the data up to the middle in order to decompress what's in the middle.
It is possible to use Z_FULL_FLUSH in deflate() periodically to put breaks in in the compressed data to allow decompression starting at those break points. You would have to have a different file and your own software to keep track of where those breakpoints were, and how far into the uncompressed data they are.
Since it is a .tar.gz file, it would make sense to only have those breakpoints at file boundaries. The tar format itself can be read starting at any file header with no problem.

Related

Reading Unix executable file

I have "Unix Executable file" with no file extension.
In Mac, I am able to see the content in preview mode but not sure about any other way to see the content.
Looking for a way to read the content and store in some other file location as JPG file or PNG file format.
Not sure how to read this thru Java.
In Mac terminal, I tried "file filename" and got the following output.
PNG image data, 110 x 103, 8-bit/color RGB, non-interlaced

Whatever is reporting 'unix executable file' is oversimplifying things. It's simply 'a file' (unix has sod all to do with it), and the file system has the concept of an 'executable' flag, which you can set or clear on any file and is utterly unrelated to whether the file's contents are executable. You can set any file executable, or not, and especially considering that e.g. macs and linux can mount DOS file systems (Which most USB sticks use because every OS can deal with these file systems), which do not have this flag, and 'for convenience' that means the OS acts as if ALL files have that flag and you can't remove it. In other words, it's a lie, forget about that part.
file is just guessing. This is no blame on file and the authors of that tool are by no means lazy. It's mathematically impossible - the disk system doesn't know what kind of data a file contains, it just knows: This file has these bytes, and it ends there. file just looks at the contents and takes a wild stab in the dark. Its wild stabs are decent, but no guarantee. I can make you a file that is BOTH a legal zip file (will unzip and everything just fine), AND is a PNG image equally well (renders in browsers, preview, etc). What could file possibly tell you here? Literally completely random garbage is ALSO a valid ISO-8859-1 formatted text file. The only way to know that this is clearly not the intended purpose of the file is to use Artificial Intelligence algorithms to realize that the contents in no way form legible words in any language on the planet. That's a very hard problem and file doesn't try to solve it.
Thus, there's no real way to know if it is a PNG file, if all you have is a file on disk. The file extension is a good hint, but if it's missing, you're just guessing. You can toss it through a PNG reader, and if it doesn't crash, it probably is, but it could just be a picture with random static because it isn't really a PNG file.
If you want to convert PNG files, ImageIO can do that.
Generally, the process that got you that file usually DOES know the format. For example, if you download it over the web, the web server didn't JUST send those bytes over. It also sent this header: Content-Type: image/png. THAT (and not the file extension) is what is the webserver's canonical truth. If the process that saves this file to disk elected to take that information and toss it in the garbage, well, now you're stuck guessing. If possible, go back to that part of the process and fix it so this info is no longer tossed in the bin. For example, if you have a shell script that uses wget to download a resource and then later on you have no idea if it's a PNG, or a JPG, or the output of a 'file not found' explanatory page in HTML, then fix wget to save that header and react accordingly.

Split zip file in Java (without .part)

I have a zip file of 16 MB containing many log files, I want to split this zip in many other zip files of maximum size 6MB in my Java application.
I know it is possible with zip4j, but with Zip4j, my zip file is 'splited' to many parties (zip.part01 zip.part02) (I mean I cannot open any of these files without extracting them all).
Assuming that there is not any log file which size more than 6mb, what is the best approach to split my zip to many smaller zip files (and not parties) ? Unzipping the zip file, then looping on all the files and creating new smaller zips ?
I hope that my question is not confusing
Thank you very much

Assuming that the file is a text file, I would just design the program to read certain portions of the file and create new files from those portions and zip them individually. By this way each file is readable.
This may not be as efficient as zipping the entire file and split it, though.

append files to an archive without reading/rewriting the whole archive

I've got many files that I want to store in a single archive file. My first approach was to store the files in a gzipped tarball. The problem is, that I've to rewrite the whole archive if a single file is added.
I could get rid of the gzip compression, but adding a file would still be expensive.
What other archive format would you suggest that allows fast append operations?

The ZIP file format was designed to allow appends without a total re-write and is ubiquitous, even on Unix.

ZIP and TAR fomats (and the old AR format) allow file append without a full rewrite. However:
The Java archive classes DO NOT support this mode of operation.
File append is likely to result in multiple copies of a file in the archive if you append an existing file.
The ZIP and AR formats have a directory that needs to be rewritten following a file append operation. The standard utilities take precautions when rewriting the directory, but it is possible in theory that you may end up with an archive with a missing or corrupted directory if the append fails.

download large files from a remote server, how to test for correct file and delivery?

There are 2 servers that are geographically very far from each other.
One server does file processing, then saves the processed file in a directory:
c:\processed\
Files can be 100-1GB in size.
The 2nd server is to download these files.
What techniques can I use to check if the file correctly downloaded?
Is a checksum all I need to do? will it hash according to the contents of the file or just the file attributes? (or what is best practise)
If the file is 1GB, will creating the checksum take a long time?

Checksum is fine to make sure that the downloaded data matches the source data. For a discussion of making it fast, see What is the fastest way to create a checksum for large files in C#.

How can I protect myself from a zip bomb?

I just read about zip bombs, i.e. zip files that contain very large amount of highly compressible data (00000000000000000...).
When opened they fill the server's disk.
How can I detect a zip file is a zip bomb before unzipping it?
UPDATE Can you tell me how is this done in Python or Java?

Try this in Python:
import zipfile
with zipfile.ZipFile('a_file.zip') as z
print(f'total files size={sum(e.file_size for e in z.infolist())}')

Zip is, erm, an "interesting" format. A robust solution is to stream the data out, and stop when you have had enough. In Java, use ZipInputStream rather than ZipFile. The latter also requires you to store the data in a temporary file, which is also not the greatest of ideas.

Reading over the description on Wikipedia -
Deny any compressed files that contain compressed files.
Use ZipFile.entries() to retrieve a list of files, then ZipEntry.getName() to find the file extension.
Deny any compressed files that contain files over a set size, or the size can not be determined at startup.
While iterating over the files use ZipEntry.getSize() to retrieve the file size.

Don't allow the upload process to write enough data to fill up the disk, ie solve the problem, not just one possible cause of the problem.

Check a zip header first :)

If the ZIP decompressor you use can provide the data on original and compressed size you can use that data. Otherwise start unzipping and monitor the output size - if it grows too much cut it loose.

Make sure you are not using your system drive for temp storage. I am not sure if a virusscanner will check it if it encounters it.
Also you can look at the information inside the zip file and retrieve a list of the content. How to do this depends on the utility used to extract the file, so you need to provide more information here

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.