I have an encrypted ODT (Open Document Text) file and I need to unzip it. ODT is a ZIP file. An encrypted ODT is a normal ZIP file, just some files inside the ZIP are encrypted.
Using ZipFile works okay in a test, but I cannot use ZipFile really because I have a stream in memory, I don't want to work with a file.
Therefore I use ZipInputStream. But using ZipInputStream.getNextEntry() throws the dreadful
only DEFLATED entries can have EXT descriptor
exception.
From what I can understand, it throws on the first encrypted file inside the ZIP package, for example on content.xml. Because OpenOffice has encrypted the xml file, it was probably no point compressing it and it was stored inside the ZIP package uncompressed.
But ZipInputStream seems to have a problem with it and I don't see a way around.
And yes, the encrypted ODT file was created by OpenOffice Writer 3.2.1. And yes, the stock ZipInputStream cannot even enumerate through entries in it.
Anything you can suggest?
You can have a look if it's possible with ODF Toolkit library
The problem has nothing to do with encryption, but with the fact that ZipInputStream does not expect (and does not know how to handle) an EXT descriptor when the associated data was not DEFLATED (i.e. was stored uncompressed, as-is). This may well be a deficiency ("bug") in ZipInputStream, but I am not familiar enough with the zip specs to know one way or another.
An inelegant, even downright ugly workaround is to persist the stream to a temporary file, and then process it as a ZipFile.
(I am the author of ODFind and the "Decrypting ODF Files" document mentioned above.)
Have you stumbled upon what Ringlord did in ODFind to read encrypted ODF files? This ODF document (viewable as HTML here courtesy Google) claims there is simply no way to rely solely on the Java libraries to decrypt OpenOffice.org documents. However, the author explains how one can decrypt the content.xml payload of an ODF file with knowledge of the ODF Manifest, RFC 2989, the PBKDF2Engine in JBoss3 and some original discovery by the author.
Best wishes.
DISCLAIMER: I have no affiliation whatsoever with Ringlord despite every link above points to Ringlord content.
Related
Problem:
I read and write large multipage tiffs. During my test I have seen plain tiffs on disk. I know I can disable writing to disk with
ImageIO.setUseCache(false)
but then all data is in memory, what may lead to OutOfMemoryException.
Is there any way to encrypt the cache/temp file created by ImageIO.createImageInputStream() and ImageIO.createImageOutputStream()?
My current variants, what I can/will try
Registering a custom ImageInputStream/ImageOutputStream(Spi)for encrypted files similar to "javax.imageio.stream.FileImageInputStream". Is there any documentation/tutorial how to do that?
Extends RandomAccessFile to write encrypted and read decrypted to/from file due to existing "javax.imageio.stream.FileImageInputStream" already accepts RandomAccessFile. Is there already a solution for that?
last hope is to secure/encrypt the temp folder outside of my java app, but that would be error prone.
PS: I would use AES128/256 encryption with temp. key/IV (save that in memory) !
I have an InputStream of data that is the content of a file, but does not have any file information attached. I would like to be able to distinguish between cases when the data represents a *.zip file, and cases where it is a container file format (e.g. *.docx, *.odt, *.jar) that uses zip under the covers. I don't necessarily need to know what the container format is, just whether a stream is a "plain" zip or not (so I know whether it's appropriate to split the stream into separate files or not).
Is this possible? I'm happy to do the detection either after decompressing or before.
Ideally I'm trying to do this in Java, but if there are code examples in other languages then I'm happy to port them across if necessary.
There's no absolutely reliable and correct way to do this, because those formats that use the ZIP format as a container tend to be 100% valid and correct ZIP files.
So they are ZIP files.
However, since there's not an infinite number of those formats (and only a smaller subset of those are commonly found in the real world), you can probably get away with just specifically detecting those formats and treating everything that you don't recognize as a "real" ZIP file.
Most of these formats require some kind of easy-to-check identifier in the early bytes of the file, so if you are okay with writing specification-specific code it should be easy enough.
file detects most of those formats correctly, so looking into its source should give you enough pointers.
Some examples:
OpenDocument files (this file contains all kinds of archives, not just ODx files).
Office Open XML files
It's also quite likely (haven't checked) that Apache Tika already does all that detection.
This is something to do with the AEM Translation API, the TranslationObject can give us a context package as a ZipStreamInput, we need to take that input and either convert to a Base64 String to send as part of an XML, or a way to convert to an actual zip file to save locally.
I tried quite a few things, but all resulted in a 0 byte file, most likely because of the way ZipInputStream works needing to use the getNextEntry. (Again, the usage of ZipInputStream is forced upon us by the API)
Any insight would be greatly appreciated.
The other solution would be to extract the stream on the drive, and create a new zip from there, open that file as a inputstream to then convert, which feels quite convoluted for nothing.
I'm making two Java applications one to collect data, another to use it. The one collecting will be importing a file from the other which will include data and images and will be decrypted.
I'm unsure what filetype to use. So far all of the data is in XML and works great but I need the images and was hoping not to have to rely on giving all the images in a folder with a path reference.
Ideas?
well, I think that the best way is to create your own format (.myformat or .data). This file will be in fact a Zip file that contains your XML file and images.
There is no perfect example writen in java as far as I know. However, here are some examples :
Not in java
The best example is, as #Bolo said, the odt format. Indeed, OpenOffice writes the doc in an xml file, and the images too. All that is wrapped in an odt file.
The .exe file is an other example. The C files and the resources are put in a single file. try to open it with 7-zip, you'll see.
The Skyrim plugins are .esp file that contain the dds, the scripts, the niffs (textures)...
In java
The minecraft texture packs are a zip file that contains a .mcmeta file (the infos) and the textures (.png)
Jar files are like exe.
If both programs are in java you could also go with serialization, which is basically saving an object as a file (suffix will be .ser I think) and then being able to retrieve it. You should google it, even if it won't help right now it is quite good to know about it.
I'd suggest using JSON. Gson is a decent library.
You can embed images as byte arrays.
Save the serialized string in a file with a preferred extension, read it from the second application, de-serialize, and reconstruct images.
You can convert binary image data to text with Base64 encoding and this way you can embed your images in XML. [1]: http://en.wikipedia.org/wiki/Base64
I need to duplicate various kinds of file types, change them a bit so that the original's md5 hash won't match the modified one, but keep them readable and not corrupted.
TXT files - that's obvious. I just add a random string to the end of the file.
PDF file - well I started looking for a java library to edit pdf files, but then I accidentally tried to open a pdf file in notepad++, and thought - why don't I try to add a random string to the end of the not readable content that I see there. Well, to my surprise it worked and the file wasn't corrupted.
ZIP file - I've tried the same that I did with pdf, and it also worked.
DOCX- the same method stopped working here. Appending just a space (" ") at the end of the binary content of a docx file that I open in a text editor, corrupts the file.
So what I need is:
java libraries for modifying office documents :doc, docx, xls, xlsx, ppt, pptx.
There are still file types that I need to change there md5 hash output, but I don't think they are modifiable in java - media files for example, executables and etc..
So, nevertheless, how can i perform what I want on these files? Is there a way to just "touch" the file, change a header or something and make it nonidentical to an untouched one?
edit:
Ok, here's the motivation - I want to generate massive amount of data as I asked here: How to produce massive amount of data?
At the time of that question, the answers I got there were enough, but not they dont.
I need the data to be nonidentical. Pairs of files must fail md5 hash test.
i can't just generate random strings, because I need to simulate real files and documnets.
I can't use existing data dumps, because I need various sizes of these data sets that include various file types. I need something that I'll give as an input the size, and it will generate the data for me.
So I figured that I should use a starting data set of all the file types that I eventually need, and just duplicate this data set.
java libraries for modifying office documents :doc, docx, xls, xlsx, ppt, pptx.
Apache POI is used to modify MS Office files. Note that newer formats (xlsx, docx, etc.) are simply ZIP files containing XML. Unzipping them and modifying plain text XML might work as well.
The same advice goes to ZIP files: try unzipping and modifying the easiest file.
But what are you actually trying to achieve? Note that randomly attaching some string at the end of the file works only by chance. On other computer or other version of software the file might be considered as corrupted...
I would advice you to either store some metadata external to the file rather than comparing MD5 or look deeper into file formats. There are almost always headers and various pieces of metadata hidden in the file (ID3 tags in MP3, EXIF in images, etc.) It is much safer to modify it instead.
Also look for reserved/not used bytes - it is quite often. But again - why? are you doing it on the first place?