Maximum size when parsing XML with DOM - java

Currently I'm implementing a REST client which shall parse the XML response messages. It is intended, later, to be running on an Android device. Thus, memory and processing speed is quite an issue. However there will be only one XML response at a time so processing or holding multiple XML documents at a time is not an issue.
As fas as I understood there are three ways of parsing XML with the Android SDK:
SAX
XmlPullParser
DOM
Reading about these different parsing methods I got that SAX is recommended for large XML files as it won't hold the complete tree in memory like DOM.
However, I'm asking myself what is large in terms of kilobytes, megabytes, ...? Is there a practical size up to which it does not really matter whether using SAX or DOM?
Thanks,
Robert

There are no standard limits set for XML documents or DOM size so it depends entirely on what the host machine can cope with.
As you are implementing on Android you should assume a pretty limited amount of memory, and remember the DOM, the XML parser, your program logic, the display logic, the JVM and Android itself all have to fit in the available memory!.
As a rule of thumb you can expect the DOM occupy memory about four times the size of the source XML document. So assume 512MB of available memory, aim to take no more than half of this for you DOM and you end up with 512/8 or a practical maximum of 64MB for the XML doc.
Just to be on the safe side I would half that again to a 32MB max. So if you expect many documents of this size I would swithc to SAX parsing!.
If you want the app to respond with any speed on large documents the SAX is the way to go. A SAX parser can start returning results as soon as the first element is read a DOM parser needs to read the whole document before any output can be sent to your program.

Excerpt from this article:
DOM parsers suffer from memory bloat. With smaller XML sets this isn't such an issue but as the XML size grows DOM parsers become less and less efficient making them not very scaleable in terms of growing your XML. Push parsers are a happy medium since they allow you to control parsing, thereby eliminating any kind of complex state management since the state is always known, and they don't suffer from the memory bloat of DOM parsers.
This could be the reason SAX is recommended over DOM: SAX functions as an XML push parser. Also, check out the Wikipedia article for SAX here.
EDIT: To address size specifically you would have to look at your implementation. An example of DOM Document object size in the memory of a Java-based XML parser is here. Java, like a lot of languages, defines some memory-based limitations such as the JVM heap size, and the Android web services/XML DOM API may also define some internal limits at the programmers' discretion (mentioned in part here). There is no one definitive answer as to maximum allowed size.

My experience let me say that using DOM the memory used is 2x the file size, but of course it's just an indication. If the XML tree has just one field containing the entire data, the memory used is similar to file size!

Related

Java: read huge XML online, entry by entry

I am working with a huge XML file (wikipedia dump) and it certainly couldn't be read into a memory at once nor will it be practical to do so.
I googled SAX XML tutorials, but they all showing quite an ugly low-level approach, where you have to set flags manually and track what element you are in now.
Actually the whole dump consists of many relatively small page entries and a reasonable strategy looks like:
read the whole page entry into memory;
process it;
dispose it and move to the next.
It would require only the amount of memory to handle single page entry, while I could use all the conveniencies of parsed tree-like XML representation.
My questions are:
Is it possible to implement such a strategy in Java?
Is it possible to do so using Jsoup as it is my main tool for working with smaller XML files?

Memory efficient way to modify XML in Java

I need to modify a single information in XML file . XML file is about 100 lines . For modifying a single element in whole XML file what would be the most memory efficient way in JAVA ?
JAXB is better ?
Simple SAX parser ?
or any other way .....Kindly suggest .....
SAX parser gives more control on parsing and is faster than DOM parser. JAXB will be easy from the sense of less code writing. XStream is also another option but that is similar to JAXB which is a high level API, so it has some overhead task so it will be bit slower then SAX.
I will not suggest for direct string manipulation (applying String.indexOf() and String.replace()) although would be fastest way for updating a unique tag in the XML but its risky as your XML might not be valid and if xml structure is not simple then there will be risk of updating wrong level tag :-)
Therefore, SAX parser looks the best bet to me.
Your files are not big. The memory used to hold a 100-line XML file costs about as much as 5 milliseconds of a programmer's time. I would question your requirement: why do you need to do it in "the most memory efficient way"? I would use XSLT or JDOM2, unless there is clear quantified information that this will not meet externally-imposed performance requirement, which cannot be solved by buying a bit more memory.

DOM vs SAX Java

I have a big xml file that could be downloaded from the internet. To parse it I tried using the DOM parser however it doesn't let me skip certain tags as it gives me an error. Is there a way around this? If i understood correctly the SAX parser allows you to skip tags whilst the DOM doesn't. Can someone kindly clarify this fact, as if that is the case, I can't understand what is the advantage of a DOM parser. Thanks in advance.
DOM was designed as a language-independent object model to hold any XML data, and as such is a large and complex system. It suits well the two-phase approach of first loading an XML document in, then performing various operations on it.
SAX, on the other hand, was designed as a fairly light-weight system using a single-phase approach. With SAX, user-specified operations are performed as the document is loaded. Some applications use SAX to generate a smaller object model, with uninteresting information filtered out, which is then processed similarly to DOM.
Note that although DOM and SAX are the well-known "standard" XML APIs, there are plenty of others available, and sometimes a particular application may be better off using a non-standard API. With XML the important bit is always the data; code can be rewritten.
Some quick points:
SAX is faster than DOM.
SAX is good for large documents because
it takes comparitively less memory than Dom.
SAX takes less time
to read a document where as Dom takes more time.
With SAX we can
access data but we can't modify data.With Dom we can modify data.
We can stop the SAX parsing when ever and where ever you want.
SAX is sequential parsing but with DOM we can move to back also.
To parse machine generated code SAX is better.To parse human
readable documents DOM is useful.

XML transformation in Java with best Performance

I want to do some manipulation on xml content in Java. See below xml
From Source XML:
<ns1:Order xmlns:ns1="com.test.ns" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<OrderHeader>
<Image>Image as BinaryData of size 250KB</Image>
</OrderHeader>
</ns1:Order>
Target XML:
<OrderData>
<OrderHeader>
<Image>Image as BinaryData of size 250KB</Image>
</OrderHeader>
</OrderData>
As shown, I have Source xml and I want target xml for that .. The only difference we can observe is root_element "ns1:Order" is replace with "OrderData" in target xml.
Fyi, OrderHeader has one sub-element Image which holds binary image of 250KB (so this xml going to be large one) .. also root element of target xml "OrderData" is well-known in advance.
Now, I want to achieve above result in java with best performance .. I have Source xml content already as byte[] and I want target xml content also as byte[] .. I am open to use Sax parser too.
Please provide the solution which has best performance for doing above stuff.
Thanks in advance,
Nurali
Do you mean machine performance or human performance? Spending an infinite amount of programmer time to achieve a microscopic gain in machine performance is a strange trade-off to make these days, when a powerful computer costs about the same as half a day of a contract programmer's time.
I would recommend using XSLT. It might not be fastest, but it will be fast enough. For a simple transformation like this, XSLT performance will be dominated by parsing and serialization costs, and those won't be any worse than for any other solution.
Not much will beat direct bytes/String manipulation, for instance, a regular expression.
But be warned, manipulating XML with Regex is always a hot debate
I used XLST to transform XML documents. That's another way to do it. There are several Java implementations of XLST processors.
The fastest way to manipulate strings in Java is using direct manipulation and the StringBuilder for the results. I wrote code to modify 20 mb strings that built a table of change locations and then copied and modified the string into a new StringBuilder. For Strings XSLT and RegEx are much slower than direct manipulation and SAX/DOM parsers are slower still.

What is an alternative to using DOM XML parser for large XML Documents for multiple find operations?

I am storing data for ranking users in XML documents - one row per user - containing a 36 char key, score, rank, and username as attributes.
<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<!DOCTYPE Ranks [<!ELEMENT Rank ANY ><!ATTLIST Rank id ID #IMPLIED>]>
<Ranks>
..<Rank id="<userKey>" score="36.0" name="John Doe" rank=15></Rank>..
</Ranks>
There are several such documents which are parsed on request using a DOM parser and kept in memory until the file is updated. This happens from within a HttpServlet which is backing a widget. Every time the widget is loaded it calls the servlet with a get request which then requires one of the documents to be queried. The queries on the documents require the following operations:
Look up - finding a particular ID
Iterate through each Rank element and get the id attribute
In my test environment the number of users is <100 and everything works well. However we are soon supposed to be delivering to a system with 200K+ users. I have serious concerns about the scalability of my approach - i.e. OutOfMemoryException!
I'm stuck for ideas for an implementation which balances performance and memory usage. While DOM is good for find operations it may choke because of the large size. I don't know much about StAX, but from what I have read it seems that it might solve the memory issue but could really slow down the queries as I will have to effectively iterate through the document to find the element of interest (Is that correct?).
Questions:
Is it possible to use StAX for multiple find (like getElementById) operations on large documents quick enough to serve an HttpRequest?
What is the maximum file size that a DOM Parser can handle?
Is it possible to estimate how much memory per user would be used for an XML document with the above structure?
Thanks
Edit: I am not allowed to use databases.
Edit: Would it be better/neater to use a custom formatted file instead and use Regular expressions to search the file for the required entry?
It sounds like you're using the xml document as a database. I think you'll be much happier using a proper database for this, and importing/exporting to xml as needed. Several databases work well, so you might as well use one that's well supported, like mysql or postgresql, although even sqlite will work better than xml.
In terms of SAX parsing, you basically build a large state machine that handles various events that occur while parsing (entering a tag, leaving a tag, seeing data, etc.). You're then on your own to manage memory (recording the data you see depending on the state you're in), so you're correct that it can have a better memory footprint, but running a query like that for every web request is ridiculous, especially when you can store all your data in a nice indexed database.
One of the big problems here is that DOM is not thread-safe, so even read operations need to be synchronized. From that point of view, using JDOM or XOM would definitely be better.
The other issue is the search strategy used to find the data. You really want the queries to be supported by indexing rather than using serial search. In fact, you need a decent query optimizer to generate efficient access paths. So given your constraint of not using a database, this sounds like a case for an in-memory XQuery engine with agressive optimization, for which the obvious candidate is Saxon-EE. But then I would say that, wouldn't I?
For heavy due XML processing, VTD-XML is the most efficient option available, it is far more efficent than JDOM, DOM4j or DOM... the key is non-object oriented approach of its info-set modeling... it is also far less likely to cause out of memory exception... Read this 2013 paper for the comprehensive comparison/benchmark between various XML frameworks
Processing XML with Java – A Performance Benchmark

Categories

Resources