Java content APIs for a large number of files

Java content APIs for a large number of files - java

Does anyone know any java libraries (open source) that provides features for handling a large number of files (write/read) from a disk. I am talking about 2-4 millions of files (most of them are pdf and ms docs). it is not a good idea to store all files in a single directory. Instead of re-inventing the wheel, I am hoping that it has been done by many people already.
Features I am looking for
1) Able to write/read files from disk
2) Able to create random directories/sub-directories for new files
2) Provide version/audit (optional)
I was looking at JCR API and it looks promising but it starts with a workspace and not sure what will be the performance when there are many nodes.

Edit: JCP does look pretty good. I'd suggest trying it out to see how it actually does perform for your use-case.
If you're running your system on Windows and noticed a horrible n^2 performance hit at some point, you're probably running up against the performance hit incurred by automatic 8.3 filename generation. Of course, you can disable 8.3 filename generation, but as you pointed out, it would still not be a good idea to store large numbers of files in a single directory.
One common strategy I've seen for handling large numbers of files is to create directories for the first n letters of the filename. For example, document.pdf would be stored in d/o/c/u/m/document.pdf. I don't recall ever seeing a library to do this in Java, but it seems pretty straightforward. If necessary, you can create a database to store the lookup table (mapping keys to the uniformly-distributed random filenames), so you won't have to rebuild your index every time you start up. If you want to get the benefit of automatic deduplication, you could hash each file's content and use that checksum as the filename (but you would also want to add a check so you don't accidentally discard a file whose checksum matches an existing file even though the contents are actually different).
Depending on the sizes of the files, you might also consider storing the files themselves in a database--if you do this, it would be trivial to add versioning, and you wouldn't necessarily have to create random filenames because you could reference them using an auto-generated primary key.

Combine the functionality in the java.io package with your own custom solution.
The java.io package can write and read files from disk and create arbitrary directories or sub-directories for new files. There is no external API required.
The versioning or auditing would have to be provided with your own custom solution. There are many ways to handle this, and you probably have a specific need that needs to be filled. Especially if you're concerned about the performance of an open-source API, it's likely that you will get the best result by simply coding a solution that specifically fits your needs.
It sounds like your module should scan all the files on startup and form an index of everything that's available. Based on the method used for sharing and indexing these files, it can rescan the files every so often or you can code it to receive a message from some central server when a new file or version is available. When someone requests a file or provides a new file, your module will know exactly how it is organized and exactly where to get or put the file within the directory tree.
It seems that it would be far easier to just engineer a solution specific to your needs.

Related

Save Properties File MOST EFFICIENT

Which of these ways is better (faster, less storage)?
Save thousands of xyz.properties in every file — about 30 keys/values
One .properties file with all the data in it — about 30,000 keys/values

I think there are two aspects here:
As Guenther has correctly pointed out, dealing with files comes with overhead. You need "file handles"; and possible other data structures that deal with files; so there might many different levels where having one huge file is better than having many small files.
But there is also "maintainability". Meaning: from a developers point of view, dealing with a property file that contains 30 K key/values is something you really don't want to get into. If everything is in one file, you have to constantly update (and deploy) that one huge file. One change; and the whole file needs to go out. Will you have mechanisms in place that allow for "run-time" reloading of properties; or would that mean that your application has to shut down? And how often will it happen that you have duplicates in that large file; or worse: you put a value for property A on line 5082, and then somebody doesn't pay attention and overrides property A on line 29732. There are many things that can go wrong; just because of having all that stuff in one file; unable to be digested by any human being anymore! And rest assured: debugging something like that will be hard.
I just gave you some questions to think about; so you might want to step back to give more requirements from your end.
In any way; you might want to look into a solution where developers deal with the many small property file (you know, like one file per functionality). And then you use tooling to build that one large file used in the production environment.
Finally: if your application really needs 30K properties; then you should very much more worry about the quality of your product. In my eyes, this isn't a design "smell"; it sounds like a design fetidness. Meaning: no reasonably application should require 30K properties to function on.

Opening and closing 1000s of files is a major overhead with the operating system, so you'd probably best off with one big file.

Is it possible to attack a server by uploading / through uploaded files?

the title actually tells the issue. And before you get me wrong, I DO NOT want to know how this can be done, but how I can prevent it.
I want to write a file uploader (in Java with JPA and MySQL database). Since I'm not yet 100% sure about the internal management, there is the possibility that at some point the file could be executed/opened internally.
So, therefor I'd be glad to know, what there is, an attacker can do to harm, infect or manipulate my system by uploading whatever type of file, may it be a media file, a binary or whatever.
For instance:
What about special characters in the file name?
What about manipulating meta data like EXIF?
What about "embedded viruses" like in an MP3 file?
I hope this is not too vague and I'd be glad to read your tips and hints.
Best regards,
Stacky

It's really very application specific. If you're using a particular web app like phpBB, there are completely different security needs than if you're running a news group. If you want tailored security recommendations, you'll need to search for them based on the context of what you're doing. It could range from sanitizing input to limiting upload size and format.
For example, an MP3 file virus probably only works on a few specific MP3 players. Not on all of them.
At any rate, if you want broad coverage from viruses, then scan the files with a virus scanner, but that probably won't protect you from things like script injection.

If your server doesn't do something inherently stupid, there should be no problem. But...
Since I'm not yet 100% sure about the internal management, there is the possibility that at some point the file could be executed/opened internally.
... this qualifies as inherently stupid. You have to make sure you don't accidently execute uploaded files (permissions on the upload directory are a starting point, limit the upload to specific directories etc.).
Aside from executing, if the server attempts any file type specific processing (e.g. make thumbnails of images) there is always the possibility that the processing can be attacked through buffer overflow exploits (these are specific for each type of software/library though).
A pure file server (e.g. FTP) that just stores/serves files is save (when there are no other holes).

Uncompressing a ZIP file in memory in Java [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I'm downloading zipped files containing XMLs, and I'd like to avoid writing the zip files to disk before manipulating them because of latency requirements. However, java.util.zip doesn't suffice for me. There's no way to say "here's a byte array of a zip file, use it" without turning it into a stream, and ZipInputStream is not reliable, since it scans for entry headers (see discussion below EDIT for reasons why that is not reliable).
I do not yet have access to the zip files I'll be handling, so I don't know whether I'll be able to handle them through the ZipInputStream, and I need to find a solution that will work with any valid ZIP files, as the penalty for a failure once I go into production will be high.
Assuming ZipInputStream won't work, what can I do to solve this problem in cases where there are no entry headers? I'm using Wikipedia's definition, which includes a comment on how to correctly uncompress zip files (quoted below), as the standard.
EDIT
The Apache Commons Zip library has a good write up on some of the problems using Stream (both their solution and Java's) has. I'll further add, from wikipedia and personal experience, and the size and crc field on entry headers may not be filled (I've files with -1 in these fields). Thanks to centic for providing this link.
Also, let me quote the wikipedia on the subject:
Tools that correctly read zip archives must scan for the signatures of
the various fields, the zip central directory. They must not scan for
entries because only the directory specifies where a file chunk
starts. Scanning could lead to false positives, as the format doesn't
forbid other data to be between chunks, or uncompressed stream
containing such signatures.
Note that ZipInputStream scans for entries, not the central directory, which is the problem with it.
Final Edit
If anyone is interested, this script can be used to produce a valid ZIP file that cannot be read by ZipInputStream from an existing ZIP file. So, as a final edit to this closed question, I needed a library that can read files such as the ones produced by this script.

EDIT: Another suggestion...
Looking at ZipFile from the Apache Commons implementation, it looks like it wouldn't be too hard to effectively fork that for your project. Create a wrapper around your byte array which has all the pieces of the RandomAccessFile API which are required (I don't think there are very many). You've already indicated that you prefer the interface to ZipFile, so why not go with that?
We don't know enough about your project to know whether this opens up any legal questions - and even if you gave details, I doubt that anyone here would be able to give good legal advice - but I suspect it wouldn't take more than an hour or two to get this solution up and working, and I suspect you'd have reasonable confidence in it.
EDIT: This may be a slightly more productive answer...
If you're worried about the entries not being contiguous, but don't want to handle all the compression side yourself, you might consider an option where you effectively rewrite the data. Create a new ByteArrayOutputStream, and read the central directory at the end. For each entry in the central directory, write out an entry (header + data) to the output stream in a format that you believe ZipInputStream will be happy with. Then write a new central directory - if you want your replacement to be valid you may need to do this from scratch, but if you're using code which you know won't actually read the central directory, you could just provide the original one, ignoring the fact that it might not then be valid. So long as it starts with the right signature, that's probably good enough :)
Once you've done that, convert the ByteArrayOutputStream into a new byte[], wrap it in a ByteArrayInputStream and then pass that to ZipInputStream or ZipArchiveInputStream.
Depending on your purposes, you may not even need to do that much - you may be able to just extract each file as you go by creating a "mini" zip file with just the one entry you're reading from the directory at a time.
This does involve understanding the zip file format, but not completely - just the skeleton, effectively. It's not a quick and easy fix like using an existing API completely, but it shouldn't take very long. It doesn't guarantee it'll be able to read all invalid files (how could it?) but it will protect you against the "data between entries" issue you seem to be particularly concerned about. Hope it's at least a useful idea...
there's no way to say "here's a byte array of a zip file, use it"
Yes there is:
byte[] data = ...;
ByteArrayInputStream byteStream = new ByteArrayInputStream(data);
ZipInputStream zipStream = new ZipInputStream(byteStream);
That leaves the issue of whether ZipInputStream can handle all the zip files you'll give it - but I wouldn't write it off quite so quickly.
Of course, there are other APIs available. You may want to look at Apache Commons Compress, for example. Even though ZipFile requires a file, ZipArchiveInputStream doesn't - so again, you could use a ByteArrayInputStream. EDIT: It looks like ZipArchiveStream doesn't read from the central directory either. I was hoping it would use markSupported to check beforehand, but it appears not to...
EDIT: In the comments on the question, I asked where you'd read that the zip file doesn't have to contain entry data. You quoted wikipedia:
"Tools that correctly read zip archives must scan for the signatures of the various fields, the zip central directory. They must not scan for entries because only the directory specifies where a file chunk starts. Scanning could lead to false positives, as the format doesn't forbid other data to be between chunks, or uncompressed stream containing such signatures."
That's not the same as entry data being optional. It's saying that there may be extra data in awkward places, not that the entries may be missing completely. It's basically saying that the entries shouldn't be assumed to be contiguous. I could happily concede that ZipInputStream may not be reading the central directory at the end of the file, but finding code which does that isn't the same as finding code which copes with entry data not existing.
You then write:
I might further add that whether the zip is valid or not is not my concern. Working with it is.
... which suggests you want code which will handle invalid zip files. Combined with this:
I do not yet have access to the zip files I'll be handling, so I don't know whether I'll be able to handle them through the stream
That means you're asking for code which should handle zip files which are invalid in ways you can't even predict. Just how invalid would it have to be for you to be able to reject it? If I give you 1000 random bytes, with no attempt for them to be a zip file at all, what on earth would you do with it?
Basically, you need to pin the problem down more tightly before it's feasible to even say whether a particular library is a valid solution. It's reasonable to collect a set of zip files from various places, which may be invalid in well-understood ways, and say "I must be able to support all of these." Later you may need to do some work if it turns out that wasn't good enough. But to be able to support anything, however broken, simply isn't a valid requirement.

TrueZIP library provides alternative mature zip implementation.
It also features file system abstraction even for HTTP.
For example:
Path path = new TPath(new URI("http://acme.com/download/everything.zip/entry.xml"));
try (InputStream in = Files.newInputStream(path)) {
// Read archive entry contents here.
...
}
So, if you are interested only in specific entries, it would download them only, saving bandwidth and time.
And you would not have to write downloading code.
See also http://truezip.java.net/faq.html#http.

I would use the Apache library commons-compress, see http://commons.apache.org/compress/
It has support for reading Zip-files via streams, there is in-depth documentation at http://commons.apache.org/compress/zip.html for a detailed documentation. It also states some limitations which are inherent in the Zip-Format.
Sample code looks as follows:
ZipArchiveInputStream zip =
new ZipArchiveInputStream(inputStream);
try {
ZipArchiveEntry entry = zip.getNextZipEntry();
while(entry != null) {
assertEquals("README", entry.getName());
...
entry = zip.getNextZipEntry();
}
} finally {
zip.close();
}

This question sounds similar to How to create a directory in memory? pseudo file system / virtual directory. Basically, my suggestion is to use a more general solution- an in-memory virtual filesystem (and I don't mean on OS level, like Linux' ramfs/tmpfs).
One example is to use the Java 7 NIO APIs, which now provide an SPI for implementing a file system via FileSystemProvider. It seems that the ShrinkWrap filesystem implements this SPI.
A more accessible option would be to use Apache Commons VFS' ram filesystem: it requires only Java 5. If you need to be compatible with Java 5 and 6, this is probably your best bet.
I first remember reading about in-memory filesystems in Java from this article, which apart from pointing out solutions like Commons VFS and JBoss Microcontainer, gives a nice example use case for the NetBeans IDE.
While an in-memory virtual filesystem is a nice general solution of avoiding the OS-level filesystem (with the associated performance benefits), it probably suffers from other disadvantages, which more specialized solutions could address. For instance, I am not sure how using this filesystem would behave when used concurrently from multiple threads. It might work fine as long as you don't access the same files, or you might need to create separate filesystems (which might be prohibitive in terms of resource usage).

writing a file finder (java)

I am writing an application which will search for for files with special filename extension on computer. (JPG for example). Input data: "D:", ".JPG" Output: txt file with results(file directories); I know one simple reccursive algo, but may be there is smth better. So, may be you tell me an efficient algorithm to traverse the file directory. Also I want to use multithreading for solving this problem to make better performance. But how many threads should I use? If I will use 1 thread for 1 directory - this will be stupid.

The recursive option you name is the only way to go, unless you want to get your hands dirty with the file system. I suspect you don't.
Regarding thread performance, your best choice is to make the number of threads configurable, create some sample directories, and measure performance for each setting.
By the way, most file-finders create an index of files. They scan the disc on a schedule, and update a file which contains the relevant information about the files and directories on disk. The file is in a format designed to facilitate searching. This index file is used to perform actual searches. If you're planning on repeatedly running this search against the same directory, you should do this.

Java File Cursor

Is there some library for using some sort of cursor over a file? I have to read big files, but can't afford to read them all at once into memory. I'm aware of java.nio, but I want to use a higher level API.
A little backgrond: I have a tool written in GWT that analyzes submitted xml documents and then pretty prints the xml, among other things. Currently I'm writing the pretty printed xml to a temp file (my lib would throw me an OOMException if I use plain Strings), but the temp file's size are approaching 18 megs, I can't afford to respond a GWT RPC with 18 megs :)
So I can have a widget to show only a portion of the xml (check this example), but I need to read the corresponding portion of the file.

Have you taken a look at using FileChannels (i.e., memory mapped files)? Memory mapped files allow you to manipulate large files without bringing the entire file into memory.
Here's a link to a good introduction:
http://www.developer.com/java/other/article.php/1548681

Maybe java.io.RandomAccessFile can be of use to you.

I don't understand when you ask for a "higher level API" when positioning the file pointer. It is the higher levels that may need to control the "cursor". If you want control, go lower, not higher.
I am certain that lower level Java io clases allow you to position yourself anywhere within any sized file without reading anything into memory until you want to. I know I have done it before. Try RandomAccessFile as one example.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.