Size of jdom document vs single variables

Size of jdom document vs single variables - java

I planning on reading profile data from a xml file in JSP.
Now i can either read it and store the important information in single session variables or just put the whole section from the xml file in a jdom document and put that into a single session variable.
In your experience, will the data size impact the big or negligible ?

JDOM 2.0.0 - released yesterday (I am the maintainer) has an improved memory footprint... it uses less than 10x as much memory as the input document. Additionally, if you use the 'SlimJDOMFactory' when parsing the XML, you use even less. A typical 275KB document is parsed in 1.5Meg using the SlimJDOMFactory. See the performance metrics for JDOM 2.0.0 at http://hunterhacker.github.com/jdom/jdom2/performance.html and search for SlimJDOMFactory to get the results using the (slower) but more efficient factory.
This in no way answers your question, because in reality, it all depends on your input data size. My experience, and I am biased, is that for small documents it's easier to just load it all in memory, and only 'sweat it' for the big ones.

There is a rule of thumb when it comes down to storing XML in memory with DOM. The size of the XML-file * 10. So if you XML-File is 1MB big then you will need 10MB memory to store it.
But in my experience i would never ever store a DOM document in memory. Once some student tried to do that, but the XML-File was 50MB big, so guess what happens? We ran out of Memory.
For your case i would create a class which can hold all the relevant information, fill the class by reading the xml.
Do you really need to store the profile data in the session? Do you really need it all the time? Sometime it is enough to just the an id or a small class.

Well, the more objects you store in memory the less free heap memory you'll have, there isn't a way around that.
It depends on your application domain really.
But in general people use cache solutions (ehCache is one, from the top of my head) between the datasource (your xml file/s) and their application domain model.
The cache expires or is cleared on demand, therefore you have reasonable control of your objects, and the heap memory they occupy.

Related

How B-Tree works in term of serialisation?

In Java, I know that if you are going to build a B-Tree index on Hard Disk, you probably should use serialisation were the B-Tree structure has to be written from RAM to HD. My question is, if later I'd like to query the value of a key out of the index, is it possible to deserialise just part of the B-Tree back to RAM? Ideally, only retrieving the value of a specific key. Fetching the whole index to RAM is a bad design, at least where the B-Tree is larger than the RAM size.
If this is possible, it'd be great if someone provides some code. How DBMSs are doing this, either in Java or C?
Thanks in advance.

you probably should use serialisation were the B-Tree structure has to be written from RAM to HD
Absolutely not. Serialization is the last technique to use when implementing a disk-based B-tree. You have to be able to read individual nodes into memory, add/remove keys, change pointers, etc, and put them back. You also want the file to be readable by other languages. You should define a language-independent representation of a B-tree node. It's not difficult. You don't need anything beyond what RandomAccessFile provides.

You generally split the B-tree into several "pages," each with some of they key-value pairs, etc. Then you only need to load one page into memory at a time.

For inspiration of how rdbms are doing it, it's probably a good idea to check the source code of the embedded Java databases: Derby, HyperSql, H2, ...
And if those databases solve your problem, I'd rather forget about implementing indices and use their product right away. Because they're embedded, there is no need to set up a server. - the rdbms code is part of the application's classpath - and the memory footprint is modest.
IF that is a possibility for you of course...
If the tree can easily fit into memory, I'd strongly advise to keep it there. The difference in performance will be huge. Not to mention the difficulties to keep changes in sync on disk, reorganizing, etc...
When at some point you'll need to store it, check Externalizable instead of the regular serialization. Serializing is notoriously slow and extensive. While Externalizable allows you to control each byte being written to disk. Not to mention the difference in performance when reading the index back into memory.
If the tree is too big to fit into memory, you'll have to use RandomAccessFile with some kind of memory caching. Such that often accessed items come out of memory nonetheless. But then you'll need to take updates to the index into account. You'll have to flush them to disk at some point.
So, personally, I'd rather not do this from scratch. But rather use the code that's out there. :-)

What is an alternative to using DOM XML parser for large XML Documents for multiple find operations?

I am storing data for ranking users in XML documents - one row per user - containing a 36 char key, score, rank, and username as attributes.
<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<!DOCTYPE Ranks [<!ELEMENT Rank ANY ><!ATTLIST Rank id ID #IMPLIED>]>
<Ranks>
..<Rank id="<userKey>" score="36.0" name="John Doe" rank=15></Rank>..
</Ranks>
There are several such documents which are parsed on request using a DOM parser and kept in memory until the file is updated. This happens from within a HttpServlet which is backing a widget. Every time the widget is loaded it calls the servlet with a get request which then requires one of the documents to be queried. The queries on the documents require the following operations:
Look up - finding a particular ID
Iterate through each Rank element and get the id attribute
In my test environment the number of users is <100 and everything works well. However we are soon supposed to be delivering to a system with 200K+ users. I have serious concerns about the scalability of my approach - i.e. OutOfMemoryException!
I'm stuck for ideas for an implementation which balances performance and memory usage. While DOM is good for find operations it may choke because of the large size. I don't know much about StAX, but from what I have read it seems that it might solve the memory issue but could really slow down the queries as I will have to effectively iterate through the document to find the element of interest (Is that correct?).
Questions:
Is it possible to use StAX for multiple find (like getElementById) operations on large documents quick enough to serve an HttpRequest?
What is the maximum file size that a DOM Parser can handle?
Is it possible to estimate how much memory per user would be used for an XML document with the above structure?
Thanks
Edit: I am not allowed to use databases.
Edit: Would it be better/neater to use a custom formatted file instead and use Regular expressions to search the file for the required entry?

It sounds like you're using the xml document as a database. I think you'll be much happier using a proper database for this, and importing/exporting to xml as needed. Several databases work well, so you might as well use one that's well supported, like mysql or postgresql, although even sqlite will work better than xml.
In terms of SAX parsing, you basically build a large state machine that handles various events that occur while parsing (entering a tag, leaving a tag, seeing data, etc.). You're then on your own to manage memory (recording the data you see depending on the state you're in), so you're correct that it can have a better memory footprint, but running a query like that for every web request is ridiculous, especially when you can store all your data in a nice indexed database.

One of the big problems here is that DOM is not thread-safe, so even read operations need to be synchronized. From that point of view, using JDOM or XOM would definitely be better.
The other issue is the search strategy used to find the data. You really want the queries to be supported by indexing rather than using serial search. In fact, you need a decent query optimizer to generate efficient access paths. So given your constraint of not using a database, this sounds like a case for an in-memory XQuery engine with agressive optimization, for which the obvious candidate is Saxon-EE. But then I would say that, wouldn't I?

For heavy due XML processing, VTD-XML is the most efficient option available, it is far more efficent than JDOM, DOM4j or DOM... the key is non-object oriented approach of its info-set modeling... it is also far less likely to cause out of memory exception... Read this 2013 paper for the comprehensive comparison/benchmark between various XML frameworks
Processing XML with Java – A Performance Benchmark

Reducing memory footprint while using large XML DOM's in Java

Our application is required to take client data presented in XML format (several files) and parse this into our common XML format (a single file with schema). For this purpose we are using apache's XMLBeans data binding framework. The steps of this process are briefly described below.
First, we take raw java.io.File objects pointing to the client XML files on-disk and load these into a collection. We then iterate over this collection creating a single apache.xmlbeans.XmlObject per file. After all files have been parsed into XmlObjects, we create 4 collections holding the individual objects from the XML documents that we are interested in (to be clear, these are not hand-crafted objects but what I can only describe as 'proxy' objects created by apache's XMLBeans framework). As a final step, we then iterate over these collections to produce our XML document (in memory) and then save this to disk.
For the majority of use cases, this process works fine and can easily run in the JVM when given the '-Xmx1500m' command-line argument. However, issues arise when we are given 'large datasets' by the client. Large in this instance is 123Mb of client XML spread over 7 files. Such datasets result in our in-code collections being populated with almost 40,000 of the aforementioned 'proxy objects'. In these cases the memory usage just goes through the roof. I do not get any outofmemory exceptions the program just hangs until garbage collection occurs, free-ing up a small amount of memory, the program then continues, uses up this new space and the cycle repeats. These parsing sessions currently take 4-5 hours. We are aiming to bring this down to within an hour.
Its important to note that the calculations required to transform client xml into our xml require all of the xml data to cross-reference. Therefore we cannot implement a sequential parsing model or batch this process into smaller blocks.
What I've tried so far
Instead of holding all 123Mb of client xml in memory, on each request for data, load the files, find the data and release the references to these objects. This does seem to reduce the amount of memory consumed during the process but as you can imagine, the amount of time the constant I/O takes removes the benefit of the reduced memory footprint.
I suspected an issue was that we are holding an XmlObject[] for 123Mb worth of XML files as well as the collections of objects taken from these documents (using xpath queries). To remedy, I altered the logic so that instead of querying these collections, the documents were queried directly. The idea here being that at no point does there exist 4 massive Lists with 10's of 1000's of objects in, just the large collection of XmlObjects. This did not seem to make a difference at all and in some cases, increases the memory footprint even more.
Clutching at straws now, I considered that the XmlObject we use to build our xml in-memory before writing to disk was growing too large to maintain alongside all the client data. However, doing some sizeOf queries on this object revealed that at its largest, this object is less than 10Kb. After reading into how XmlBeans manages large DOM objects, it seems to use some form of buffered writer and as a result, is managing this object quite well.
So now I am out of ideas; Can't use SAX approaches instead of memory intensive DOM approaches as we need 100% of the client data in our app at any one time, cannot hold off requesting this data until we absolutely need it as the conversion process requires a lot of looping and the disk I/O time is not worth the saved memory space and I cannot seem to structure our logic in such a way as to reduce the amount of space the internal java collections occupy. Am I out of luck here? Must I just accept that if I want to parse 123Mb worth of xml data into our Xml format that I cannot do it with the 1500m memory allocation? While 123Mb is a large dataset in our domain, I cannot imagine others have never had to do something similar with Gb's of data at a time.
Other information that may be important
I have used JProbe to try and see if that can tell me anything useful. While I am a profiling noob, I ran through their tutorials for memory leaks and thread locks, understood them and there doesn't appear to be any leaks or bottlenecks in our code. After running the application with a large dataset, we quickly see a 'sawblade' type shape on the memory analysis screen (see attached image) with PS Eden space being taken over with a massive green block of PS Old Gen. This leads me to believe that the issue here is simply sheer amount of space taken up by object collections rather than a leak holding onto unused memory.
I am running on a 64-Bit Windows 7 platform but this will need to run on a 32 Bit environment.

The approach I'd take would be make two passes on the files, using SAX in both cases.
The first pass would parse the 'cross-reference' data, needed in the calculations, into custom objects and store them Maps. If the 'cross-reference' data is large then look at using distributed cache (Coherence is the natural fit if you've started with Maps).
The second pass would parse the files, retreive the 'cross-reference' data to perform calculations as needed and then write the output XML using the javax.xml.stream APIs.

Scaling application that reads large XML files

I have an application which reads large set of XML files (multiple around 20-30) periodically, like once every 10 minutes. Now each XML file can be approximated to at least 40-100 MB in size. Once each XML has read, a map is created out of the file, and then the map is passed across a processor chain (10-15), each processor using the data, performing some filter or writing to database, etc.
Now the application is running in 32 bit JVM. No intention on moving to 64 bit JVM right now. The memory foot-print as expected is very high... nearing the threshold of a 32 bit JVM. For now when we receive large files, we serialize the generated map into disk and run through the processor chain maximum of 3-4 map concurrently as if we try to process all the maps at the same time, it would easily go OutOfMemory. Also garbage collection is pretty high.
I have some ideas but wanted to see if there are some options which people have already tried/evaluated. So what are the options here for scaling this kind of application?

Yea, to parrot #aaray and #MeBigFatGuy, you want to use some event based parser for this, the dom4j mentioned, or SAX or StAX.
As a simple example, that 100MB XML is consuming a minimum of 200MB of RAM if you load it wholesale, as each character is immediately expanded to a 16 bit character.
Next, any tag of elements that you're not using is going to consume extra memory (plus all of the other baggage and bookkeeping of the nodes) and it's all wasted. If you're dealing with numbers, converting the raw string to a long will be a net win if the number is larger than 2 digits.
IF (and this is a BIG IF) you are using a lot of a reasonably small set of Strings, you can save some memory by String.intern()'ing them. This is a canonicalization process that makes sure if the string already exists in the jvm, its shared. The downside of this is that it pollutes your permgen (once interned, always interned). PermGen is pretty finite, but on the other hand it's pretty much immune to GC.
Have you considered being able to run the XML through an external XSLT to remove all of the cruft that you don't want to process before it even enters your JVM? There are several standalone, command line XSL processors that you can use to pre-process the files to something perhaps more sane. It really depends on how much of the data that is coming in you're actually using.
By using an event based XML processing model, the XSLT step is pretty much redundant. But the event based models are all basically awful to use, so perhaps using the XSLT step would let you re-use some of your existing DOM logic (assuming that's what you're doing).
The flatter your internal structures, the cheaper they are in terms of memory. You actually have a little bit of an advantage running a 32b vm, since instance pointers are half the size. But still, when you're talking 1000's or millions of nodes, it all adds up, and quickly.

We had a similar problem processing large XML files (around 400Mb). We greatly reduced the memory footprint of the application using this:
http://dom4j.sourceforge.net/dom4j-1.6.1/faq.html#large-doc

You can insert the contents of each XML file into a temporary DB table and each chain link would fetch the data it needs. You will probably lose performance, but gain scalability.

Keep serialized and compressed Objects in-memory

I'm currently working on a Part of an Application where "a lot" of data must be selected for further work and I have the impression that the I/O is limiting and not the following work.
My idea is now to have all these objects in memory but serialized an compressed. The question is, if accessing the objects like this would be faster than direct Database access and if it is a good idea or not. (and if it is feasble in terms of memory consumption = serialized form uses less memory than normal object)
EDIT February 2011:
The creation of the objects is the slow part and not the database access itself. Having all in memory is not possible and using ehcache option to "overflow to disk" is actually slower than just getting the data from the database. Standard java serialization is also unusable. it is also a lot slower. So basically nothing I can do about it...

You're basically looking for an in-memory cache or an in-memory datagrid. There are plenty of APIs/products for this sort of thing. ehcache/hibernate chace/gridgain etc etc

The compressed serialized form will use less memory, if it is a large object. However for smaller objects e.g. which use primtives. The original object will be much smaller.
I would first check whether you really need to do this. e.g. Can you just consume more memory? or restructure your objects so they use less memory.

"I have the impression that the I/O is limiting and not the following work. " -> I would be very sure of this before starting implementing such a thing.
The simpler approach I can suggest you is to use ehcache with the option to store on disk when the size of the cache get too big.
Another completely different approach could be using some doc based nosql db like couchdb to store objects selected "for further work"

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.