I have an application that take a zip file as input in Java.
My application would decompress the zip file and inside the zip file there are some file contains filename exceeds 256 chars
Could I modify the filename of a file in zip without decompression
OS : linux/mac
It's more of a limitation of the file system used, than the OS itself.
You can only change this by formatting the drives to a different file system that supports longer file names.
Why you would need a file name that is couple of miles long is beyond me. But only advise I can give is, try to shorten the file name.
::EDIT::
Since you updated your question. Here's the correct answer. :)
Typically, your approach is correct. Although if your zip contains longer filenames, you can truncate them. (Take the first, 250, ignore the rest. Now, you may have duplicate filenames. Add a number at the end coz you got 5 chars left)
Another option is to ask the user to enter a new file name.
It is possible to edit the zip file itself, as long as you know how it is structured .etc.
I'm not aware that Java built-in APIs allow editing zip files. Although a while back, I came across this library names DotNetZip for Microsoft.NET which allows all the typical functionality plus editing the entries inside a zip file, encryption, passwords .etc. (it is awesome btw)
Look for a similar library for Java.
If the user were supplying a filename for a file to be created, I think you'd just have to ask for a shorter name, short enough to avoid throwing this exception.
But it sounds like this is the name of an existing file to be read from; in that case, if the system (file or operating; either way, something you can't control) is saying its not a valid filename, how could you expect to be able to read from it?
Related
I have "Unix Executable file" with no file extension.
In Mac, I am able to see the content in preview mode but not sure about any other way to see the content.
Looking for a way to read the content and store in some other file location as JPG file or PNG file format.
Not sure how to read this thru Java.
In Mac terminal, I tried "file filename" and got the following output.
PNG image data, 110 x 103, 8-bit/color RGB, non-interlaced
Whatever is reporting 'unix executable file' is oversimplifying things. It's simply 'a file' (unix has sod all to do with it), and the file system has the concept of an 'executable' flag, which you can set or clear on any file and is utterly unrelated to whether the file's contents are executable. You can set any file executable, or not, and especially considering that e.g. macs and linux can mount DOS file systems (Which most USB sticks use because every OS can deal with these file systems), which do not have this flag, and 'for convenience' that means the OS acts as if ALL files have that flag and you can't remove it. In other words, it's a lie, forget about that part.
file is just guessing. This is no blame on file and the authors of that tool are by no means lazy. It's mathematically impossible - the disk system doesn't know what kind of data a file contains, it just knows: This file has these bytes, and it ends there. file just looks at the contents and takes a wild stab in the dark. Its wild stabs are decent, but no guarantee. I can make you a file that is BOTH a legal zip file (will unzip and everything just fine), AND is a PNG image equally well (renders in browsers, preview, etc). What could file possibly tell you here? Literally completely random garbage is ALSO a valid ISO-8859-1 formatted text file. The only way to know that this is clearly not the intended purpose of the file is to use Artificial Intelligence algorithms to realize that the contents in no way form legible words in any language on the planet. That's a very hard problem and file doesn't try to solve it.
Thus, there's no real way to know if it is a PNG file, if all you have is a file on disk. The file extension is a good hint, but if it's missing, you're just guessing. You can toss it through a PNG reader, and if it doesn't crash, it probably is, but it could just be a picture with random static because it isn't really a PNG file.
If you want to convert PNG files, ImageIO can do that.
Generally, the process that got you that file usually DOES know the format. For example, if you download it over the web, the web server didn't JUST send those bytes over. It also sent this header: Content-Type: image/png. THAT (and not the file extension) is what is the webserver's canonical truth. If the process that saves this file to disk elected to take that information and toss it in the garbage, well, now you're stuck guessing. If possible, go back to that part of the process and fix it so this info is no longer tossed in the bin. For example, if you have a shell script that uses wget to download a resource and then later on you have no idea if it's a PNG, or a JPG, or the output of a 'file not found' explanatory page in HTML, then fix wget to save that header and react accordingly.
I have an InputStream of data that is the content of a file, but does not have any file information attached. I would like to be able to distinguish between cases when the data represents a *.zip file, and cases where it is a container file format (e.g. *.docx, *.odt, *.jar) that uses zip under the covers. I don't necessarily need to know what the container format is, just whether a stream is a "plain" zip or not (so I know whether it's appropriate to split the stream into separate files or not).
Is this possible? I'm happy to do the detection either after decompressing or before.
Ideally I'm trying to do this in Java, but if there are code examples in other languages then I'm happy to port them across if necessary.
There's no absolutely reliable and correct way to do this, because those formats that use the ZIP format as a container tend to be 100% valid and correct ZIP files.
So they are ZIP files.
However, since there's not an infinite number of those formats (and only a smaller subset of those are commonly found in the real world), you can probably get away with just specifically detecting those formats and treating everything that you don't recognize as a "real" ZIP file.
Most of these formats require some kind of easy-to-check identifier in the early bytes of the file, so if you are okay with writing specification-specific code it should be easy enough.
file detects most of those formats correctly, so looking into its source should give you enough pointers.
Some examples:
OpenDocument files (this file contains all kinds of archives, not just ODx files).
Office Open XML files
It's also quite likely (haven't checked) that Apache Tika already does all that detection.
I need to duplicate various kinds of file types, change them a bit so that the original's md5 hash won't match the modified one, but keep them readable and not corrupted.
TXT files - that's obvious. I just add a random string to the end of the file.
PDF file - well I started looking for a java library to edit pdf files, but then I accidentally tried to open a pdf file in notepad++, and thought - why don't I try to add a random string to the end of the not readable content that I see there. Well, to my surprise it worked and the file wasn't corrupted.
ZIP file - I've tried the same that I did with pdf, and it also worked.
DOCX- the same method stopped working here. Appending just a space (" ") at the end of the binary content of a docx file that I open in a text editor, corrupts the file.
So what I need is:
java libraries for modifying office documents :doc, docx, xls, xlsx, ppt, pptx.
There are still file types that I need to change there md5 hash output, but I don't think they are modifiable in java - media files for example, executables and etc..
So, nevertheless, how can i perform what I want on these files? Is there a way to just "touch" the file, change a header or something and make it nonidentical to an untouched one?
edit:
Ok, here's the motivation - I want to generate massive amount of data as I asked here: How to produce massive amount of data?
At the time of that question, the answers I got there were enough, but not they dont.
I need the data to be nonidentical. Pairs of files must fail md5 hash test.
i can't just generate random strings, because I need to simulate real files and documnets.
I can't use existing data dumps, because I need various sizes of these data sets that include various file types. I need something that I'll give as an input the size, and it will generate the data for me.
So I figured that I should use a starting data set of all the file types that I eventually need, and just duplicate this data set.
java libraries for modifying office documents :doc, docx, xls, xlsx, ppt, pptx.
Apache POI is used to modify MS Office files. Note that newer formats (xlsx, docx, etc.) are simply ZIP files containing XML. Unzipping them and modifying plain text XML might work as well.
The same advice goes to ZIP files: try unzipping and modifying the easiest file.
But what are you actually trying to achieve? Note that randomly attaching some string at the end of the file works only by chance. On other computer or other version of software the file might be considered as corrupted...
I would advice you to either store some metadata external to the file rather than comparing MD5 or look deeper into file formats. There are almost always headers and various pieces of metadata hidden in the file (ID3 tags in MP3, EXIF in images, etc.) It is much safer to modify it instead.
Also look for reserved/not used bytes - it is quite often. But again - why? are you doing it on the first place?
I have a directory with a name that contains Japanese characters, and I need to use the zip utils in java.util.zip to write it to a zip file. Writing the zip file succeeds, but when I open the resulting zip file with either Windows' built-in compressed file utility or 7-Zip, the directory with Japanese characters in the name appears as a bunch of garbage characters. I do have the Japanese/East Asian language pack installed on my system -- I can create directories with Japanese names, so that isn't the issue.
Interestingly, if I write a separate script to read the resulting zip file using java.util.zip, the directory name is correct, and I can extract the contents of the zip into appropriately named directories, with Japanese characters. But I can't do this using the commercial zip tools that I've tried, which is undoubtedly what our customers will want to do.
Any ideas about what is causing this problem, and how I can work around it?
I know about this bug, but I still need a workaround for this case.
TrueZIP claims to do this better:
The J2SE API always uses UTF-8 (eight
bit Unicode character set) for entry
names and comments instead of CP437
(a.k.a. IBM437, the genuine IBM-PC
character set), which is used by the
de-facto standard PKZIP from PKWARE.
As a result, you cannot read or write
ZIP files with international entry
file names such as e.g. "täscht.txt"
in a ZIP file created by a (southern)
German.
[description of other problems omitted]
The TrueZIP Library has been developed to overcome these limitations/disadvantages.
Miracles indeed happen, and Sun/Oracle did really fix the long-living bug/rfe:
Now it's possible to [set up filename encodings upon creating][1] the zip file/stream (requires Java 7).
[1]: http://download.java.net/jdk7/docs/api/java/util/zip/ZipOutputStream.html#ZipOutputStream(java.io.OutputStream, java.nio.charset.Charset)
If java.util.zip still behaves as this post describes, I'm not sure if it is possible (with the built-in classes). I have seen Chilkat's Java Zip library mentioned before as a way to get this to work, but have never used it.
I just read about zip bombs, i.e. zip files that contain very large amount of highly compressible data (00000000000000000...).
When opened they fill the server's disk.
How can I detect a zip file is a zip bomb before unzipping it?
UPDATE Can you tell me how is this done in Python or Java?
Try this in Python:
import zipfile
with zipfile.ZipFile('a_file.zip') as z
print(f'total files size={sum(e.file_size for e in z.infolist())}')
Zip is, erm, an "interesting" format. A robust solution is to stream the data out, and stop when you have had enough. In Java, use ZipInputStream rather than ZipFile. The latter also requires you to store the data in a temporary file, which is also not the greatest of ideas.
Reading over the description on Wikipedia -
Deny any compressed files that contain compressed files.
Use ZipFile.entries() to retrieve a list of files, then ZipEntry.getName() to find the file extension.
Deny any compressed files that contain files over a set size, or the size can not be determined at startup.
While iterating over the files use ZipEntry.getSize() to retrieve the file size.
Don't allow the upload process to write enough data to fill up the disk, ie solve the problem, not just one possible cause of the problem.
Check a zip header first :)
If the ZIP decompressor you use can provide the data on original and compressed size you can use that data. Otherwise start unzipping and monitor the output size - if it grows too much cut it loose.
Make sure you are not using your system drive for temp storage. I am not sure if a virusscanner will check it if it encounters it.
Also you can look at the information inside the zip file and retrieve a list of the content. How to do this depends on the utility used to extract the file, so you need to provide more information here