DOM vs SAX Java - java

I have a big xml file that could be downloaded from the internet. To parse it I tried using the DOM parser however it doesn't let me skip certain tags as it gives me an error. Is there a way around this? If i understood correctly the SAX parser allows you to skip tags whilst the DOM doesn't. Can someone kindly clarify this fact, as if that is the case, I can't understand what is the advantage of a DOM parser. Thanks in advance.

DOM was designed as a language-independent object model to hold any XML data, and as such is a large and complex system. It suits well the two-phase approach of first loading an XML document in, then performing various operations on it.
SAX, on the other hand, was designed as a fairly light-weight system using a single-phase approach. With SAX, user-specified operations are performed as the document is loaded. Some applications use SAX to generate a smaller object model, with uninteresting information filtered out, which is then processed similarly to DOM.
Note that although DOM and SAX are the well-known "standard" XML APIs, there are plenty of others available, and sometimes a particular application may be better off using a non-standard API. With XML the important bit is always the data; code can be rewritten.
Some quick points:
SAX is faster than DOM.
SAX is good for large documents because
it takes comparitively less memory than Dom.
SAX takes less time
to read a document where as Dom takes more time.
With SAX we can
access data but we can't modify data.With Dom we can modify data.
We can stop the SAX parsing when ever and where ever you want.
SAX is sequential parsing but with DOM we can move to back also.
To parse machine generated code SAX is better.To parse human
readable documents DOM is useful.

Related

is it possible to use Event Sourcing with StAX XML event sequences

I've got an XML performance problem that I'm looking to solve.
Specifically I've got the same small/medium sized XML file that's being parsed many hundreds of times.
The functionality is bound to a StAX XML event reader. Its output cannot be cloned or otherwise copied, the only way to reproduce the needed functionality is to run this XML event reader over the XML document again.
For performance I would like to read the XML into a StAX event sequence eagerly, and then replay that event sequence rather than re-parse the XML each time.
I believe the problem is implementation: while this idea is reasonable in principal, "Events" are expressed as state-changes against the XMLStreamReader which has a large API surface, a large portion (but not all) of which is related to its "current" event.
Does a system like this already exist?
If I have to build it myself, what might be the best way to ensure correctness?
The usual way to represent an XML document in memory, to avoid parsing it repeatedly, is to use one of the many tree models (JDOM2 and XOM are the best in my view, though many people still use the horrible old DOM model simply because it's packaged in the JDK). So I guess I'm asking why doesn't this "obvious" approach work for you?
There are cases where (internally within Saxon) I use a replayable stream of events instead, simply because storing the events and then replaying them is a bit more efficient than building a tree and then walking the tree. I don't use StaX events for this, I use my own class net.sf.saxon.event.EventBuffer which holds a list of net.sf.saxon.event.Event objects. Perhaps this event model is a bit better designed for the purpose, being rather simpler than the StAX model. Saxon doesn't have any logic to read an EventBuffer as a StAX event stream, but it would be easy enough to add. It's open source code, so see if you can adapt it.

Memory efficient way to modify XML in Java

I need to modify a single information in XML file . XML file is about 100 lines . For modifying a single element in whole XML file what would be the most memory efficient way in JAVA ?
JAXB is better ?
Simple SAX parser ?
or any other way .....Kindly suggest .....
SAX parser gives more control on parsing and is faster than DOM parser. JAXB will be easy from the sense of less code writing. XStream is also another option but that is similar to JAXB which is a high level API, so it has some overhead task so it will be bit slower then SAX.
I will not suggest for direct string manipulation (applying String.indexOf() and String.replace()) although would be fastest way for updating a unique tag in the XML but its risky as your XML might not be valid and if xml structure is not simple then there will be risk of updating wrong level tag :-)
Therefore, SAX parser looks the best bet to me.
Your files are not big. The memory used to hold a 100-line XML file costs about as much as 5 milliseconds of a programmer's time. I would question your requirement: why do you need to do it in "the most memory efficient way"? I would use XSLT or JDOM2, unless there is clear quantified information that this will not meet externally-imposed performance requirement, which cannot be solved by buying a bit more memory.

Which one to choose between SAX and STAX to read large xml files

I have a requirement where I can have 100 mb or bigger xml file having list of companies for which I need to add each company into a table from that xml file.
I was thinking of using SAX parser however I was also thinking of using stax parser. Can someone pls help me know which one should I use.
thx
StAX has a much more easier to use API, so I think it is a better choice. SAX has a low-level push API, which is not very nice to use (e.g. working with char[]). StAX has a much nicer to use pull API.
Another potential advantage: using StAX you don't have read the whole document, you may stop if you have what you needed.
There is a nice - though quite old - comparison of the Java XML parsing APIs found here.
Using StAX will allow you to minimize the amount of data kept in memory to only the most recently parsed record. Once you insert that record into your table, you no longer need to keep it in memory.
If you use SAX you would (likely) have to parse the entire xml content into memory before inserting records into your table. While it would be possible to insert as you go (when encountering the closing element for a record), that is more complicated with SAX than StAX.

parsing large XML file in Java [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Looping over a large XML file
What is a better way to parse large XML data which is essentially a collection of XML data in Java and Java based frameworks? We get data from a webservice call which runs into few MB (typically 25MB+). This data essentially corresponds to an unmarshalled list of Objects. My objective is to create the list of objects from the XML.
I tried using the SAX parser and it takes a good 45 seconds to parse these 3000 objects.
What are the other recommended approaches?
Try pull parsing instead, use StAX?
First search hit on comparing:
http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP2.html
Have you profiled and seen where the bottlenecks are?
StAX is built into java (since java 6), but some recommend the woodstox StAX implementation for even better performance. I have not tried it though. http://woodstox.codehaus.org/
I tried using the SAX parser and it takes a good 45 seconds to parse
these 3000 objects. What are the other recommended approaches?
There are only the following options:
DOM
SAX
StAX
SAX is the fastest SAXvsDOMvsStax so if you switch to different style, I don't think you'll get any benefit.
Unless you are doing something wrong now
Of course there are also the marshalling/demarshalling frameworks such as JAXB etc but IMO (not done any measurements) they could be slower since the add an extra layer of abstraction on the XML processing
SAX doesn't provide random access to the structure of the XML file, this means that SAX provides a relatively fast and efficient method of parsing. Because the SAX parser deals with only one element at a time, implementations can be extremely memory-efficient, making it often the one choice for dealing with large files.
Parsing 25Mb of XML should not take 45 seconds. There is something else going on. Perhaps most of the time is spent waiting for an external DTD to be fetched from the web, I don't know. Before changing your approach, you need to understand where the costs are coming from and therefore what part of the system will benefit from changes.
However, if you really do want to convert the XML into Java objects (not the application architecture I would choose, but never mind), then JAXB sounds a good bet. I haven't used JAXB much since I prefer to stick with XML-oriented languages like XSLT and XQuery, but when I did try JAXB I found it pretty fast. Of course it uses a SAX or StAX parser underneath.

Maximum size when parsing XML with DOM

Currently I'm implementing a REST client which shall parse the XML response messages. It is intended, later, to be running on an Android device. Thus, memory and processing speed is quite an issue. However there will be only one XML response at a time so processing or holding multiple XML documents at a time is not an issue.
As fas as I understood there are three ways of parsing XML with the Android SDK:
SAX
XmlPullParser
DOM
Reading about these different parsing methods I got that SAX is recommended for large XML files as it won't hold the complete tree in memory like DOM.
However, I'm asking myself what is large in terms of kilobytes, megabytes, ...? Is there a practical size up to which it does not really matter whether using SAX or DOM?
Thanks,
Robert
There are no standard limits set for XML documents or DOM size so it depends entirely on what the host machine can cope with.
As you are implementing on Android you should assume a pretty limited amount of memory, and remember the DOM, the XML parser, your program logic, the display logic, the JVM and Android itself all have to fit in the available memory!.
As a rule of thumb you can expect the DOM occupy memory about four times the size of the source XML document. So assume 512MB of available memory, aim to take no more than half of this for you DOM and you end up with 512/8 or a practical maximum of 64MB for the XML doc.
Just to be on the safe side I would half that again to a 32MB max. So if you expect many documents of this size I would swithc to SAX parsing!.
If you want the app to respond with any speed on large documents the SAX is the way to go. A SAX parser can start returning results as soon as the first element is read a DOM parser needs to read the whole document before any output can be sent to your program.
Excerpt from this article:
DOM parsers suffer from memory bloat. With smaller XML sets this isn't such an issue but as the XML size grows DOM parsers become less and less efficient making them not very scaleable in terms of growing your XML. Push parsers are a happy medium since they allow you to control parsing, thereby eliminating any kind of complex state management since the state is always known, and they don't suffer from the memory bloat of DOM parsers.
This could be the reason SAX is recommended over DOM: SAX functions as an XML push parser. Also, check out the Wikipedia article for SAX here.
EDIT: To address size specifically you would have to look at your implementation. An example of DOM Document object size in the memory of a Java-based XML parser is here. Java, like a lot of languages, defines some memory-based limitations such as the JVM heap size, and the Android web services/XML DOM API may also define some internal limits at the programmers' discretion (mentioned in part here). There is no one definitive answer as to maximum allowed size.
My experience let me say that using DOM the memory used is 2x the file size, but of course it's just an indication. If the XML tree has just one field containing the entire data, the memory used is similar to file size!

Categories

Resources