I'm using the Simple XML library to process XML files in my Android application. These file can get quite big - around 1Mb, and can be nested pretty deeply, so they are fairly complex.
When the app loads one of these files, via the Simple API, it can take up to 30sec to complete. Currently I am passing a FileInputStream into the [read(Class, InputStream)][2] method of Simple's Persister class. Effectively it just reads the XML nodes and maps the data to instances of my model objects, replicating the XML tree structure in memory.
My question then is how do I improve the performance on Android? My current thought would be to read the contents of the file into a byte array, and pass a ByteArrayInputStream to the Persister's read method instead. I imagine the time to process the file would be quicker, but I'm not sure if the time saved would be counteracted by the time taken to read the whole file first. Also memory constraints might be an issue.
Is this a fools errand? Is there anything else I could do to increase the performance in this situation? If not I will just have to resort to improving the feedback to the user on the progress of loading the file.
Some caveats:
1) I can't change the XML library I'm using - the code in question is part of an "engine" which is used across desktop, mobile and web applications. The overhead to change it would be too much for the timebeing.
2) The data files are created by users so I don't have control over the size/depth of nesting in them.
Well, there are many things you can do to improve this. Here they are.
1) On Android you should be using at least version 2.5.2, but ideally 2.5.3 as it uses KXML which is much faster and more memory efficient on Android.
2) Simple will dynamically build your object graph, this means that it will need to load classes that have not already been loaded, and build a schema for each class based on its annotations using reflection. So first use will always be by far the most expensive. Repeated use of the same persister instance will be many times faster. So try to avoid multiple persister instances, just use one if possible.
3) Try measure the time taken to read the file directly, without using the Simple XML library. How long does it take? If it takes forever then you know the performance hit here is due to the file system. Try use a BufferedInputStream to improve performance.
4) If you still notice any issues, raise it on the mailing list.
EDIT:
Android has some issues with annotation processing https://code.google.com/p/android/issues/detail?id=7811, Simple 2.6.6 fixes has implemented work arounds for these issues. Performance increases of 10 times can be observed.
Related
Which of these ways is better (faster, less storage)?
Save thousands of xyz.properties in every file — about 30 keys/values
One .properties file with all the data in it — about 30,000 keys/values
I think there are two aspects here:
As Guenther has correctly pointed out, dealing with files comes with overhead. You need "file handles"; and possible other data structures that deal with files; so there might many different levels where having one huge file is better than having many small files.
But there is also "maintainability". Meaning: from a developers point of view, dealing with a property file that contains 30 K key/values is something you really don't want to get into. If everything is in one file, you have to constantly update (and deploy) that one huge file. One change; and the whole file needs to go out. Will you have mechanisms in place that allow for "run-time" reloading of properties; or would that mean that your application has to shut down? And how often will it happen that you have duplicates in that large file; or worse: you put a value for property A on line 5082, and then somebody doesn't pay attention and overrides property A on line 29732. There are many things that can go wrong; just because of having all that stuff in one file; unable to be digested by any human being anymore! And rest assured: debugging something like that will be hard.
I just gave you some questions to think about; so you might want to step back to give more requirements from your end.
In any way; you might want to look into a solution where developers deal with the many small property file (you know, like one file per functionality). And then you use tooling to build that one large file used in the production environment.
Finally: if your application really needs 30K properties; then you should very much more worry about the quality of your product. In my eyes, this isn't a design "smell"; it sounds like a design fetidness. Meaning: no reasonably application should require 30K properties to function on.
Opening and closing 1000s of files is a major overhead with the operating system, so you'd probably best off with one big file.
I am writing an application which needs to add nodes to an existing XML file repeatedly, throughout the day. Here is a sample of the list of nodes that I am trying to append:
<gx:Track>
<when>2012-01-21T14:37:18Z</when>
<gx:coord>-0.12345 52.12345 274.700</gx:coord>
<when>2012-01-21T14:38:18Z</when>
<gx:coord>-0.12346 52.12346 274.700</gx:coord>
<when>2012-01-21T14:39:18Z</when>
<gx:coord>-0.12347 52.12347 274.700</gx:coord>
....
This can happen up to several times a second over a long time and I would like to know what the best or most efficient way of doing this is.
Here is what I am doing right now: Use a DocumentBuilderFactory to parse the XML file, look for the container node, append the child nodes and then use the TransformerFactory to write it back to the SD card. However, I have noticed that as the file grows larger, this is taking more and more time.
I have tried to think of a better way and this is the only thing I can think of: Use RandomAccessFile to load the file and use .seek() to a specific position in the file. I would calculate the position based on the file length and subtract what I 'know' is the length of the file after what I'm appending.
I'm pretty sure this method will work but it feels a bit blind as opposed to the ease of using a DocumentBuilderFactory.
Is there a better way of doing this?
You should try using JAXB. It's a Java XML Binding library that comes in most of the Java 1.6 JDKs. JAXB lets you specify an XML Schema Definition file (and has some experimental support for DTD). The library will then compile Java classes for you to use in your code that translate back into an XML Document.
It's very quick and useful with optional support for validation. This would be a good starting point. This would be another good one to look at. Eclipse also has some great tools for generating the Java classes for you, and providing a nice GUI tool for XSD creation. The Eclipse plugins are called Davi I believe.
Does anyone know any java libraries (open source) that provides features for handling a large number of files (write/read) from a disk. I am talking about 2-4 millions of files (most of them are pdf and ms docs). it is not a good idea to store all files in a single directory. Instead of re-inventing the wheel, I am hoping that it has been done by many people already.
Features I am looking for
1) Able to write/read files from disk
2) Able to create random directories/sub-directories for new files
2) Provide version/audit (optional)
I was looking at JCR API and it looks promising but it starts with a workspace and not sure what will be the performance when there are many nodes.
Edit: JCP does look pretty good. I'd suggest trying it out to see how it actually does perform for your use-case.
If you're running your system on Windows and noticed a horrible n^2 performance hit at some point, you're probably running up against the performance hit incurred by automatic 8.3 filename generation. Of course, you can disable 8.3 filename generation, but as you pointed out, it would still not be a good idea to store large numbers of files in a single directory.
One common strategy I've seen for handling large numbers of files is to create directories for the first n letters of the filename. For example, document.pdf would be stored in d/o/c/u/m/document.pdf. I don't recall ever seeing a library to do this in Java, but it seems pretty straightforward. If necessary, you can create a database to store the lookup table (mapping keys to the uniformly-distributed random filenames), so you won't have to rebuild your index every time you start up. If you want to get the benefit of automatic deduplication, you could hash each file's content and use that checksum as the filename (but you would also want to add a check so you don't accidentally discard a file whose checksum matches an existing file even though the contents are actually different).
Depending on the sizes of the files, you might also consider storing the files themselves in a database--if you do this, it would be trivial to add versioning, and you wouldn't necessarily have to create random filenames because you could reference them using an auto-generated primary key.
Combine the functionality in the java.io package with your own custom solution.
The java.io package can write and read files from disk and create arbitrary directories or sub-directories for new files. There is no external API required.
The versioning or auditing would have to be provided with your own custom solution. There are many ways to handle this, and you probably have a specific need that needs to be filled. Especially if you're concerned about the performance of an open-source API, it's likely that you will get the best result by simply coding a solution that specifically fits your needs.
It sounds like your module should scan all the files on startup and form an index of everything that's available. Based on the method used for sharing and indexing these files, it can rescan the files every so often or you can code it to receive a message from some central server when a new file or version is available. When someone requests a file or provides a new file, your module will know exactly how it is organized and exactly where to get or put the file within the directory tree.
It seems that it would be far easier to just engineer a solution specific to your needs.
I'm currently debugging some fairly complex persistence code, and trying to increase test coverage whilst I'm at it.
Some of the bugs I'm finding against the production code require large, and very specific object graphs to reproduce.
While technically I could sit and write out buckets of instantiation code in my tests to reproduce the specific scenarios, I'm wondering if there are tools that can do this for me?
I guess specifically I'd like to be able to dump out an object as it is in my debugger frame (probably to xml), then use something to load in the XML and create the object graph for unit testing (eg, xStream etc).
Can anyone recommend tools or techniques which are useful in this scenario?
I've done this sort of thing using ObjectOutputStream, but XML should work fine. You need to be working with a serializable tree. You might try JAXB or xStream, etc., too. I think it's pretty straightforward. If you have a place in your code that builds the structure in a form that would be good for your test, inject the serialization code there, and write everything to a file. Then, remove the injected code. Then, for the test, load the XML. You can stuff the file into the classpath somewhere. I usually use a resources or config directory, and get a stream with Thread.currentThread().getContextClassLoader().getResourceAsStream(name). Then deserialize the stuff, and you're good to go.
XStream is of use here. It'll allow you to dump practically any POJO to/from XML without having to implement interfaces/annotate etc. The only headache I've had is with inner classes (since it'll try and serialise the referenced outer class).
I guess all you data is persisted in database. You can use some test data generation tool to get your database filled with test data, and then export that data in form of SQL scripts, and then preload before your intergration test starts.
You can use DBUnit to preload data in your unit test, it has also a number of options to verify database structure/data before test starts. http://www.dbunit.org/
For test data generation in database there is a number of comercial tools you can use. I don't know about any good free tool that can handle features like predefined lists of data, random data with predefined distribution, foreign key usage from other table etc.
I don't know about Java but if you change the implementations of your classes then you may no longer be able to deserialize old unit tests (which were serialized from older versions of the classes). So in the future you may need to put some effort into migrating your unit test data if you change your class definitions.
Is there some library for using some sort of cursor over a file? I have to read big files, but can't afford to read them all at once into memory. I'm aware of java.nio, but I want to use a higher level API.
A little backgrond: I have a tool written in GWT that analyzes submitted xml documents and then pretty prints the xml, among other things. Currently I'm writing the pretty printed xml to a temp file (my lib would throw me an OOMException if I use plain Strings), but the temp file's size are approaching 18 megs, I can't afford to respond a GWT RPC with 18 megs :)
So I can have a widget to show only a portion of the xml (check this example), but I need to read the corresponding portion of the file.
Have you taken a look at using FileChannels (i.e., memory mapped files)? Memory mapped files allow you to manipulate large files without bringing the entire file into memory.
Here's a link to a good introduction:
http://www.developer.com/java/other/article.php/1548681
Maybe java.io.RandomAccessFile can be of use to you.
I don't understand when you ask for a "higher level API" when positioning the file pointer. It is the higher levels that may need to control the "cursor". If you want control, go lower, not higher.
I am certain that lower level Java io clases allow you to position yourself anywhere within any sized file without reading anything into memory until you want to. I know I have done it before. Try RandomAccessFile as one example.