I'm trying to zip a large number of pdf files (stored as BLOBs in the DB) and then return the zip as an attachment to the user.
What's the best way to do this without running into memory issues?
Another note: I actually need to merge some PDFs prior to adding them to the ZipOutputStream. Therefore, a couple PDFs will need to be stored in memory at a time.
I assume it would be best to then store them as temporary files on the server before zipping them all?
You can create zip files in memory in Java using ZipOutputStreams.
See http://www.exampledepot.com/egs/java.util.zip/CreateZip.html
Related
I am developing a Java application through which I need to read the log files present on a server and perform operations depending on the content of the logs.
Files range from 3GB up to 9GB.
Here on stack I have already read the discussion about reading large files with java, I am attaching the link:
Java reading large file discussion
In the discussion, the files are read locally,
in my case i have to retrieve and read the file on the server, is there an efficient way to achieve this?
I would like to avoid having to download files given their size.
I had thought about using URL Reader to retrieve the files, but I have doubts about the speed of execution.
The files I need to recover are under the path C:\production\LOG\file.log
Do you have any suggestions or advice?
I have a file which I need to upload to a service, and parse into relevant data. The parser and the uploader both require an InputStream. Ought I to open the file twice? I could save the file to a String, but having many of these files in memory is concerning.
EDIT: Thought I should make it clear that the parsing and uploading are entirely separate processes.
Since you are parsing it already it would be most efficient to load the file into a string. Parse it into indexes to the string, you will save memory and can just upload the string whenever you want to. This would be the most effective way, with memory but maybe not processing time.
A reply to one of the comments above.
Separate processes does not mean different threads or processes, just they do not need each other to operate.
According to current requirement,user will upload files with large size,which he may like to download later. I cannot store the uploaded files in DB because the size of files is large and performance will be impacted if I store uploaded files in DB.
Any one knows any java plugin which provide efficient file management on webserver and maintains the link to file so that the file can be downloaded when the link is requested. Also the code will make sure that user will be able to download only those files which is uploaded by them,they cannot download any file just by modifying the download link etc. I am using spring3 as the framework.
Please suggest how to solve this problem?
if you have write access to the file system why not just save them there ?
you then generate an unique ID and save the hash/file relation in db, you then need to supply the ID to get the file feed from a servlet
Store the file content on a part of filesystem out of web application so you cannot reach it changing the link.
Then you can store on db the path for that file, and return them only if the user has the permissions to read it.
Pay attention, do not store all the file on the same folder, or the number of files could grow too much. So find a way to store them with more folder levels.
There are 2 servers that are geographically very far from each other.
One server does file processing, then saves the processed file in a directory:
c:\processed\
Files can be 100-1GB in size.
The 2nd server is to download these files.
What techniques can I use to check if the file correctly downloaded?
Is a checksum all I need to do? will it hash according to the contents of the file or just the file attributes? (or what is best practise)
If the file is 1GB, will creating the checksum take a long time?
Checksum is fine to make sure that the downloaded data matches the source data. For a discussion of making it fast, see What is the fastest way to create a checksum for large files in C#.
I just read about zip bombs, i.e. zip files that contain very large amount of highly compressible data (00000000000000000...).
When opened they fill the server's disk.
How can I detect a zip file is a zip bomb before unzipping it?
UPDATE Can you tell me how is this done in Python or Java?
Try this in Python:
import zipfile
with zipfile.ZipFile('a_file.zip') as z
print(f'total files size={sum(e.file_size for e in z.infolist())}')
Zip is, erm, an "interesting" format. A robust solution is to stream the data out, and stop when you have had enough. In Java, use ZipInputStream rather than ZipFile. The latter also requires you to store the data in a temporary file, which is also not the greatest of ideas.
Reading over the description on Wikipedia -
Deny any compressed files that contain compressed files.
Use ZipFile.entries() to retrieve a list of files, then ZipEntry.getName() to find the file extension.
Deny any compressed files that contain files over a set size, or the size can not be determined at startup.
While iterating over the files use ZipEntry.getSize() to retrieve the file size.
Don't allow the upload process to write enough data to fill up the disk, ie solve the problem, not just one possible cause of the problem.
Check a zip header first :)
If the ZIP decompressor you use can provide the data on original and compressed size you can use that data. Otherwise start unzipping and monitor the output size - if it grows too much cut it loose.
Make sure you are not using your system drive for temp storage. I am not sure if a virusscanner will check it if it encounters it.
Also you can look at the information inside the zip file and retrieve a list of the content. How to do this depends on the utility used to extract the file, so you need to provide more information here