Java File Cursor

Java File Cursor - java

Is there some library for using some sort of cursor over a file? I have to read big files, but can't afford to read them all at once into memory. I'm aware of java.nio, but I want to use a higher level API.
A little backgrond: I have a tool written in GWT that analyzes submitted xml documents and then pretty prints the xml, among other things. Currently I'm writing the pretty printed xml to a temp file (my lib would throw me an OOMException if I use plain Strings), but the temp file's size are approaching 18 megs, I can't afford to respond a GWT RPC with 18 megs :)
So I can have a widget to show only a portion of the xml (check this example), but I need to read the corresponding portion of the file.

Have you taken a look at using FileChannels (i.e., memory mapped files)? Memory mapped files allow you to manipulate large files without bringing the entire file into memory.
Here's a link to a good introduction:
http://www.developer.com/java/other/article.php/1548681

Maybe java.io.RandomAccessFile can be of use to you.

I don't understand when you ask for a "higher level API" when positioning the file pointer. It is the higher levels that may need to control the "cursor". If you want control, go lower, not higher.
I am certain that lower level Java io clases allow you to position yourself anywhere within any sized file without reading anything into memory until you want to. I know I have done it before. Try RandomAccessFile as one example.

Related

Save Properties File MOST EFFICIENT

Which of these ways is better (faster, less storage)?
Save thousands of xyz.properties in every file — about 30 keys/values
One .properties file with all the data in it — about 30,000 keys/values

I think there are two aspects here:
As Guenther has correctly pointed out, dealing with files comes with overhead. You need "file handles"; and possible other data structures that deal with files; so there might many different levels where having one huge file is better than having many small files.
But there is also "maintainability". Meaning: from a developers point of view, dealing with a property file that contains 30 K key/values is something you really don't want to get into. If everything is in one file, you have to constantly update (and deploy) that one huge file. One change; and the whole file needs to go out. Will you have mechanisms in place that allow for "run-time" reloading of properties; or would that mean that your application has to shut down? And how often will it happen that you have duplicates in that large file; or worse: you put a value for property A on line 5082, and then somebody doesn't pay attention and overrides property A on line 29732. There are many things that can go wrong; just because of having all that stuff in one file; unable to be digested by any human being anymore! And rest assured: debugging something like that will be hard.
I just gave you some questions to think about; so you might want to step back to give more requirements from your end.
In any way; you might want to look into a solution where developers deal with the many small property file (you know, like one file per functionality). And then you use tooling to build that one large file used in the production environment.
Finally: if your application really needs 30K properties; then you should very much more worry about the quality of your product. In my eyes, this isn't a design "smell"; it sounds like a design fetidness. Meaning: no reasonably application should require 30K properties to function on.

Opening and closing 1000s of files is a major overhead with the operating system, so you'd probably best off with one big file.

Is it possible to attack a server by uploading / through uploaded files?

the title actually tells the issue. And before you get me wrong, I DO NOT want to know how this can be done, but how I can prevent it.
I want to write a file uploader (in Java with JPA and MySQL database). Since I'm not yet 100% sure about the internal management, there is the possibility that at some point the file could be executed/opened internally.
So, therefor I'd be glad to know, what there is, an attacker can do to harm, infect or manipulate my system by uploading whatever type of file, may it be a media file, a binary or whatever.
For instance:
What about special characters in the file name?
What about manipulating meta data like EXIF?
What about "embedded viruses" like in an MP3 file?
I hope this is not too vague and I'd be glad to read your tips and hints.
Best regards,
Stacky

It's really very application specific. If you're using a particular web app like phpBB, there are completely different security needs than if you're running a news group. If you want tailored security recommendations, you'll need to search for them based on the context of what you're doing. It could range from sanitizing input to limiting upload size and format.
For example, an MP3 file virus probably only works on a few specific MP3 players. Not on all of them.
At any rate, if you want broad coverage from viruses, then scan the files with a virus scanner, but that probably won't protect you from things like script injection.

If your server doesn't do something inherently stupid, there should be no problem. But...
Since I'm not yet 100% sure about the internal management, there is the possibility that at some point the file could be executed/opened internally.
... this qualifies as inherently stupid. You have to make sure you don't accidently execute uploaded files (permissions on the upload directory are a starting point, limit the upload to specific directories etc.).
Aside from executing, if the server attempts any file type specific processing (e.g. make thumbnails of images) there is always the possibility that the processing can be attacked through buffer overflow exploits (these are specific for each type of software/library though).
A pure file server (e.g. FTP) that just stores/serves files is save (when there are no other holes).

What is the most efficient way of repeatedly writing to an XML file in Android?

I am writing an application which needs to add nodes to an existing XML file repeatedly, throughout the day. Here is a sample of the list of nodes that I am trying to append:
<gx:Track>
<when>2012-01-21T14:37:18Z</when>
<gx:coord>-0.12345 52.12345 274.700</gx:coord>
<when>2012-01-21T14:38:18Z</when>
<gx:coord>-0.12346 52.12346 274.700</gx:coord>
<when>2012-01-21T14:39:18Z</when>
<gx:coord>-0.12347 52.12347 274.700</gx:coord>
....
This can happen up to several times a second over a long time and I would like to know what the best or most efficient way of doing this is.
Here is what I am doing right now: Use a DocumentBuilderFactory to parse the XML file, look for the container node, append the child nodes and then use the TransformerFactory to write it back to the SD card. However, I have noticed that as the file grows larger, this is taking more and more time.
I have tried to think of a better way and this is the only thing I can think of: Use RandomAccessFile to load the file and use .seek() to a specific position in the file. I would calculate the position based on the file length and subtract what I 'know' is the length of the file after what I'm appending.
I'm pretty sure this method will work but it feels a bit blind as opposed to the ease of using a DocumentBuilderFactory.
Is there a better way of doing this?

You should try using JAXB. It's a Java XML Binding library that comes in most of the Java 1.6 JDKs. JAXB lets you specify an XML Schema Definition file (and has some experimental support for DTD). The library will then compile Java classes for you to use in your code that translate back into an XML Document.
It's very quick and useful with optional support for validation. This would be a good starting point. This would be another good one to look at. Eclipse also has some great tools for generating the Java classes for you, and providing a nice GUI tool for XSD creation. The Eclipse plugins are called Davi I believe.

XML With SimpleXML Library - Performance on Android

I'm using the Simple XML library to process XML files in my Android application. These file can get quite big - around 1Mb, and can be nested pretty deeply, so they are fairly complex.
When the app loads one of these files, via the Simple API, it can take up to 30sec to complete. Currently I am passing a FileInputStream into the [read(Class, InputStream)][2] method of Simple's Persister class. Effectively it just reads the XML nodes and maps the data to instances of my model objects, replicating the XML tree structure in memory.
My question then is how do I improve the performance on Android? My current thought would be to read the contents of the file into a byte array, and pass a ByteArrayInputStream to the Persister's read method instead. I imagine the time to process the file would be quicker, but I'm not sure if the time saved would be counteracted by the time taken to read the whole file first. Also memory constraints might be an issue.
Is this a fools errand? Is there anything else I could do to increase the performance in this situation? If not I will just have to resort to improving the feedback to the user on the progress of loading the file.
Some caveats:
1) I can't change the XML library I'm using - the code in question is part of an "engine" which is used across desktop, mobile and web applications. The overhead to change it would be too much for the timebeing.
2) The data files are created by users so I don't have control over the size/depth of nesting in them.

Well, there are many things you can do to improve this. Here they are.
1) On Android you should be using at least version 2.5.2, but ideally 2.5.3 as it uses KXML which is much faster and more memory efficient on Android.
2) Simple will dynamically build your object graph, this means that it will need to load classes that have not already been loaded, and build a schema for each class based on its annotations using reflection. So first use will always be by far the most expensive. Repeated use of the same persister instance will be many times faster. So try to avoid multiple persister instances, just use one if possible.
3) Try measure the time taken to read the file directly, without using the Simple XML library. How long does it take? If it takes forever then you know the performance hit here is due to the file system. Try use a BufferedInputStream to improve performance.
4) If you still notice any issues, raise it on the mailing list.
EDIT:
Android has some issues with annotation processing https://code.google.com/p/android/issues/detail?id=7811, Simple 2.6.6 fixes has implemented work arounds for these issues. Performance increases of 10 times can be observed.

Java content APIs for a large number of files

Does anyone know any java libraries (open source) that provides features for handling a large number of files (write/read) from a disk. I am talking about 2-4 millions of files (most of them are pdf and ms docs). it is not a good idea to store all files in a single directory. Instead of re-inventing the wheel, I am hoping that it has been done by many people already.
Features I am looking for
1) Able to write/read files from disk
2) Able to create random directories/sub-directories for new files
2) Provide version/audit (optional)
I was looking at JCR API and it looks promising but it starts with a workspace and not sure what will be the performance when there are many nodes.

Edit: JCP does look pretty good. I'd suggest trying it out to see how it actually does perform for your use-case.
If you're running your system on Windows and noticed a horrible n^2 performance hit at some point, you're probably running up against the performance hit incurred by automatic 8.3 filename generation. Of course, you can disable 8.3 filename generation, but as you pointed out, it would still not be a good idea to store large numbers of files in a single directory.
One common strategy I've seen for handling large numbers of files is to create directories for the first n letters of the filename. For example, document.pdf would be stored in d/o/c/u/m/document.pdf. I don't recall ever seeing a library to do this in Java, but it seems pretty straightforward. If necessary, you can create a database to store the lookup table (mapping keys to the uniformly-distributed random filenames), so you won't have to rebuild your index every time you start up. If you want to get the benefit of automatic deduplication, you could hash each file's content and use that checksum as the filename (but you would also want to add a check so you don't accidentally discard a file whose checksum matches an existing file even though the contents are actually different).
Depending on the sizes of the files, you might also consider storing the files themselves in a database--if you do this, it would be trivial to add versioning, and you wouldn't necessarily have to create random filenames because you could reference them using an auto-generated primary key.

Combine the functionality in the java.io package with your own custom solution.
The java.io package can write and read files from disk and create arbitrary directories or sub-directories for new files. There is no external API required.
The versioning or auditing would have to be provided with your own custom solution. There are many ways to handle this, and you probably have a specific need that needs to be filled. Especially if you're concerned about the performance of an open-source API, it's likely that you will get the best result by simply coding a solution that specifically fits your needs.
It sounds like your module should scan all the files on startup and form an index of everything that's available. Based on the method used for sharing and indexing these files, it can rescan the files every so often or you can code it to receive a message from some central server when a new file or version is available. When someone requests a file or provides a new file, your module will know exactly how it is organized and exactly where to get or put the file within the directory tree.
It seems that it would be far easier to just engineer a solution specific to your needs.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java File Cursor - java

Have you taken a look at using FileChannels (i.e., memory mapped files)? Memory mapped files allow you to manipulate large files without bringing the entire file into memory. Here's a link to a good introduction: http://www.developer.com/java/other/article.php/1548681

Maybe java.io.RandomAccessFile can be of use to you.