Is it possible to efficiently append a line into a zip or gzip file?
I'm storing equity market data directly into the file system and I have around 40 different files which are being updated every 5ms.
Whats the best way of doing this?
Use a database, not a zip file.
Database is recommended.
If you really want to use plain text file, put them directly on the file system (and if you are using Linux, choose a proper file system for it).
If you do want to use plain text file and put the text files in a zip file, check zip file system below:
java.nio.file.FileSystems:
http://docs.oracle.com/javase/7/docs/technotes/guides/io/fsp/zipfilesystemprovider.html
The zip file system provider introduced in the Java SE 7 release is an implementation of a custom file system provider. The zip file system provider treats a zip or JAR file as a file system and provides the ability to manipulate the contents of the file. The zip file system provider creates multiple file systems — one file system for each zip or JAR file.
TrueZip
http://truezip.java.net/
TrueZIP is a Java based plug-in framework for virtual file systems (VFS) which provides transparent access to archive files as if they were just plain directories
And remember: use memory to cache, reduce disk operations, and make the writing non-blocking.
Complete cycle of editing zip file (i mean read-modify-close) will produce too much overhead. I think it is better to accumulate changes in memory and modify target file at some reasonable rate (i.e. every 5 seconds or even more).
One approach could be like compressing data sent over a socket and flushing compressed blocks to disk from time to time.
You can use a ZipOutputStream's write method to write at a given offset.
String filepath = new String("/tmp/updated.txt")
FileOutputStream fout = new FileOutputStream("/tmp/example.zip");
ZipOutputStream zout = new ZipOutputStream(fout);
byte[] file = IOUtils.toByteArray(mWriter.toString());
short yourOffset = 0;
ZipEntry ze = new ZipEntry(filepath);
try {
zout.putNextEntry(ze);
zout.write(file, yourOffset, file.length);
zout.closeEntry();
} catch(Exception e) {
e.printStackTrace();
}
If you convert your file to a byte array using Apache commons IOUtils (import org.apache.commons.io.IOUtils) you can rewrite and replace the zip entry by calling write at your the offset where the line you want to edit begins. In this case it writes the entire file, from 0 to file.length. You can replace the file in the zip by creating a ZipEntry with a path to the updated file on your drive.
Related
Using BufferedWriter.write() when is a file created?
I know from the docs that when the buffer is filled it will flush to file, does this mean that:
every-time the buffer is filled an incomplete file will appear on my file system?
or that the file is only created when the BufferedWriter is closed?
My concern is that I am writing files to a directory using a BufferedWriter and another process is polling the directory for new files and reading them. I do not want an incomplete file to be created and be read by the other process.
Using BufferedWriter.write() when is a file created?
Never. BufferedWriter itself just writes to another Writer. Now if you're using a FileOutputStream or a FileWriter (where the first would probably be wrapped in an OutputStreamWriter) the file is created (or opened for write if it already exists) when you construct the object, i.e. before you've actually written any data.
My concern is that I am writing files to a directory using a BufferedWriter and another process is polling the directory for new files and reading them. I do not want an incomplete file to be created and be read by the other process.
One typical way of handling this is to write to a staging area and then rename the file into the correct place, which is usually an atomic operation. Or even write the file into the correct directory, but with a file extension which the polling process won't spot - and then rename the file to the final filename afterwards.
BufferedWriter doesn't create a file as Jon Skeet said. And you cannot guarantee that another process won't read an incomplete file when it is being written to disk. But there are two things you can do:
Lock the file so that the other process cannot read it before writing is complete. There are several questions concerning file locking in Java on this site (search for "[java] lock file").
Create the file with another filename (ie. use an extension that is not being looked for by the other process) and rename it when writing is finished.
Using TrueZIP, is there a way to open and modify an existing ZIP file from a stream (it may of course be outputted using another stream)?
I have code for modifying a ZIP that works perfectly as long as I work on an existing real ZIP file on the file system but I have a requirement that all temporary files need to be encrypted while stored on disk. In most part of our application this is easy to achieve (using CipherOutputStream and CipherInputStream) but I have one function that uses TrueZIP to update an existing ZIP file. This part obviously fails if the file is encrypted.
The ZIP files will be consumed by proprietary applications that do not support encryption so using the encryption that is part of the ZIP specification isn't possible.
The reason we are using TrueZIP is that we need the support for Zip64 (which I know is included in Java 7 but we cannot switch right now).
No, an archive file must be stored in accessible file system to use it with TrueZIP. But you have a number of other options:
TrueZIP uses instances of the IOPoolService interface to manage temporary files. You could provide your own implementation which encrypts all temporary files or maybe even just stores them on the heap (if they are small). Have a look at the TrueZIP Driver FILE to see the reference implementation.
You could use the ParanoidZipRaesDriver to use RAES encrypted ZIP files. This driver ensures that no unencrypted temporary files are used by limiting the number of concurrent threads for writing an archive file to one.
You could use the standard ZIP drivers with FsOutputOption.ENCRYPT to switch on WinZip AES encryption. To ensure that no unencrypted temporary files are used, you could then override the ZipDriver.newOutputSocket method just like the ParanoidZipRaesDriver does.
I was wondering if there was any way to create a FileInputStream object from just a file object without creating an actual file on the file system? What I am attempting to do is create a file object with some information, and then upload that file somewhere else. I have no need for it to be on the local file system. I know that I could just create a temp folder and then delete it afterwards, but was wondering if it was possible to not do it that way?
What I am attempting to do is create a file object with some
information, and then upload that file somewhere else
In that case you should not work with any file-related classes at all. Instead, crate a byte array, which you can tread as an InputStream via ByteArrayInputStream.
You are probably looking for a ByteArrayInputStream or something similar.
A file input stream reads from a file on disk, that is its purpose. By the way, a File object in Java does not really represent a file, but rather the path pointing to a (potential) file on disk.
Try creating a memory stream, your file is stored in the memory instead of the file system
I've got many files that I want to store in a single archive file. My first approach was to store the files in a gzipped tarball. The problem is, that I've to rewrite the whole archive if a single file is added.
I could get rid of the gzip compression, but adding a file would still be expensive.
What other archive format would you suggest that allows fast append operations?
The ZIP file format was designed to allow appends without a total re-write and is ubiquitous, even on Unix.
ZIP and TAR fomats (and the old AR format) allow file append without a full rewrite. However:
The Java archive classes DO NOT support this mode of operation.
File append is likely to result in multiple copies of a file in the archive if you append an existing file.
The ZIP and AR formats have a directory that needs to be rewritten following a file append operation. The standard utilities take precautions when rewriting the directory, but it is possible in theory that you may end up with an archive with a missing or corrupted directory if the append fails.
I just read about zip bombs, i.e. zip files that contain very large amount of highly compressible data (00000000000000000...).
When opened they fill the server's disk.
How can I detect a zip file is a zip bomb before unzipping it?
UPDATE Can you tell me how is this done in Python or Java?
Try this in Python:
import zipfile
with zipfile.ZipFile('a_file.zip') as z
print(f'total files size={sum(e.file_size for e in z.infolist())}')
Zip is, erm, an "interesting" format. A robust solution is to stream the data out, and stop when you have had enough. In Java, use ZipInputStream rather than ZipFile. The latter also requires you to store the data in a temporary file, which is also not the greatest of ideas.
Reading over the description on Wikipedia -
Deny any compressed files that contain compressed files.
Use ZipFile.entries() to retrieve a list of files, then ZipEntry.getName() to find the file extension.
Deny any compressed files that contain files over a set size, or the size can not be determined at startup.
While iterating over the files use ZipEntry.getSize() to retrieve the file size.
Don't allow the upload process to write enough data to fill up the disk, ie solve the problem, not just one possible cause of the problem.
Check a zip header first :)
If the ZIP decompressor you use can provide the data on original and compressed size you can use that data. Otherwise start unzipping and monitor the output size - if it grows too much cut it loose.
Make sure you are not using your system drive for temp storage. I am not sure if a virusscanner will check it if it encounters it.
Also you can look at the information inside the zip file and retrieve a list of the content. How to do this depends on the utility used to extract the file, so you need to provide more information here