Java parse large XML document

Java parse large XML document - java

I'm trying to parse, and replace values in a large xml file, ~45MB each. The Way I do this is:
private void replaceData(File xmlFile, File out)
{
DocumentBuilderFactory df = DocumentBuilderFactory.newInstance();
DocumentBuilder db = df.newDocumentBuilder();
Document xmlDoc = db.parse(xmlFile);
xmlDoc.getDocumentElement().normalize();
Node allData = xmlDoc.getElementsByTagName("Data").item(0);
Element ctrlData = getSubElement(allData, "ctrlData");
NodeList subData = ctrlData.getElementsByTagName("SubData");
int len = subData.getLength();
for (int logIndex = 0; logIndex < len; logIndex++) {
Node log = subData.item(logIndex);
Element info = getSubElement(log, "info");
Element value = getSubElement(info, "dailyInfo");
Node valueNode = value.getElementsByTagName("value").item(0);
valueNode.setTextContent("blah");
}
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
DOMSource s = new DOMSource(xmlDoc);
StreamResult r = new StreamResult(out);
t.transform(s, r);
} catch (TransformerException | ParserConfigurationException | SAXException | IOException e) {
throw e;
}
}
private static Element getSubElement(Node node, String elementName)
{
return (Element)((Element)node).getElementsByTagName(elementName).item(0);
}
I notice that as I am further along the for loop the longer it takes, and for an average of 100k node's it takes over 2 hours, while if I just break out smaller chunks by hand of 1k, it will take ~10s. Is there something that is inefficient with the way that this document is being parsed?
----EDIT----
Based on comments and answers to this, I switched over to using Sax and XmlStreamWriter. Reference/example here: http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
After moving to using SAX, memory usage for the replaceData function does not expand to size of XML file, and XML file processing time went to ~18 seconds on average.

As people have mentioned in the comments loading the whole DOM into memory especially for large XMLs can be very inefficient therefore a better approach is to use the SAX parser that consumes constant memory. The drawback there is that you don't get the fluent API of having the whole DOM in memory and the visibility is quite limited if you want to perform complicated callback logic in nested nodes.
If all you are interesting in doing is parsing particular nodes and node families rather than parsing the whole XML then there is a better solution that gives you the best of both worlds and has been blogged about and open-sourced. It's basically a very light wrapper on top of SAX parser where you are registering the XML elements you are interested in and when you are getting the callback you have at your disposal their corresponding partial DOM to XPath.
This way you can keep your complexity at constant time (scaling to over 1GB of XML file as documented in the above blog) while maintaining the fluency of XPath-ing the DOM of the XML elements you are interested in.

Why are you doing this in Java when XSLT is designed for the task?
45Mb is a big file to hold in memory, but still viable. The tree models used by good XSLT processors such as Saxon are much more efficient (both in storage space in in search speed) than a general purpose DOM (for example, because they are read-only). And XSLT has much more scope to optimize your code.
I can't reverse engineer your specification from your code, but I don't see anything in your description that is intrinsically non-linear. I don't see any reason why this should take more than 10 minutes or so in Saxon.

Related

xPath multiple xml files from different url very slow

I need to check only one node from each file (109 files) that they are stored on different urls (109 urls).
I use this code
public class XPathParserXML {
public String version(String link, String serial) throws SAXException, IOException,
ParserConfigurationException, XPathExpressionException{
String version = new String();
String url = link+serial;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(url);
XPath xPathFactory = XPathFactory.newInstance().newXPath();
XPathExpression expr = xPathFactory.compile("//swVersion/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList node = (NodeList) result;
if (node == null){
version = "!!WORKING!!";
}else{
version = node.item(0).getNodeValue();
}
return version;
}
}
and i call the method "version(link,serial)" in cicle for 109 times
My code take like 20 seconds to elaborate all. Each file weight 0.64KB and i have a 20MB connection.
What can i do to speed up my code?

1. Object caching:
While that's not the only issue, probably, you should definitely cache and reuse all of those objects between calls to version():
DocumentBuilderFactory
DocumentBuilder
XPathFactory
XPathExpression
2. Circumvention of a known JAXP performance issue:
Besides, you should probably activate one of these flags:
-Dorg.apache.xml.dtm.DTMManager=
org.apache.xml.dtm.ref.DTMManagerDefault
or
-Dcom.sun.org.apache.xml.internal.dtm.DTMManager=
com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault
See also this question for details:
Java XPath (Apache JAXP implementation) performance
3. Reduce latency impact
Last but not least, you're serially accessing all those XML files over the wire. It may be useful to reduce the impact of your connection latency by parallelising access to those files, e.g. by using multiple threads at the client side. (Note if you choose multi-threading, then beware of thread-safety issues when caching the objects I've mentioned in the first section. Also, avoid creating too many parallel requests at the same time to prevent your server from failing)
Another way to reduce that impact would be to expose those XML files in a ZIP file from the server to avoid multiple connections and transfer all XML files at once.
4. Avoid XML validation if you can trust the source
From your additional comments, I see that you're using XML validation. This is, of course, expensive and should only be done if really needed. Since you run a very arbitrary XPath expression, I take that you don't care too much about XML validation. Best turn it off!
5. If all else fails... Avoid DOM
Since (from your comments) you've measured the parsing to take up most of the CPU, you have two more options to circumvent the whole issue:
Use a SAX parser and abort parsing once you reach the //swVersion element (From your code, I'm assuming that there is only one). SAX is much faster for these use-cases, than DOM.
Avoid XML entirely and search the document for a regex: <swVersion>(.*?)</swVersion>. That should only be your last resort, because it doesn't handle
namespaces
attributes
whitespace

Resolving which version of an XML Schema to use for XML documents with a version attribute

I have to write some code to handle reading and validating XML documents that use a version attribute in their root element to declare a version number, like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<Junk xmlns="urn:com:initech:tps"
xmlns:xsi="http://www3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:com:initech.tps:schemas/foo/Junk.xsd"
VersionAttribute="2.0">
There are a bunch of nested schemas, my code has an org.w3c.dom.ls.LsResourceResolver to figure out what schema to use, implementing this method:
LSInput resolveResource(String type,
String namespaceURI,
String publicId,
String systemId,
String baseURI)
Previous versions of the schema have embedded the schema version into the namespace, so I could use the namespaceURI and systemId to decide which schema to provide. Now the version number has been switched to an attribute in the root element, and my resolver doesn't have access to that. How am I supposed to figure out the version of the XML document in the LsResourceResolver?

I had never had to deal with schema versions before this and had no idea what was involved. When the version was part of the namespace then I could throw all the schemas in together and let them get sorted out, but with the version in the root element and namespace shared across versions there is no getting around reading the version information from the XML before starting the SAX parsing.
I'm going to do something very similar to what Pangea suggested (gets +1 from me), but I can't follow the advice exactly because the document is too big to read it all into memory, even once. By using STAX I can minimize the amount of work done to get the version from the file. See this DeveloperWorks article, "Screen XML documents efficiently with StAX":
The screening or classification of XML documents is a common problem,
especially in XML middleware. Routing XML documents to specific
processors may require analysis of both the document type and the
document content. The problem here is obtaining the required
information from the document with the least possible overhead.
Traditional parsers such as DOM or SAX are not well suited to this
task. DOM, for example, parses the whole document and constructs a
complete document tree in memory before it returns control to the
client. Even DOM parsers that employ deferred node expansion, and thus
are able to parse a document partially, have high resource demands
because the document tree must be at least partially constructed in
memory. This is simply not acceptable for screening purposes.
The code to get the version information will look like:
def map = [:]
def startElementCount = 0
def inputStream = new File(inputFile).newInputStream()
try {
XMLStreamReader reader =
XMLInputFactory.newInstance().createXMLStreamReader(inputStream)
for (int event; (event = reader.next()) != XMLStreamConstants.END_DOCUMENT;) {
if (event == XMLStreamConstants.START_ELEMENT) {
if (startElementCount > 0) return map
startElementCount += 1
map.rootElementName = reader.localName
for (int i = 0; i < reader.attributeCount; i++) {
if (reader.getAttributeName(i).toString() == 'VersionAttribute') {
map.versionIdentifier = reader.getAttributeValue(i).toString()
return map
}
}
}
}
} finally {
inputStream.close()
}
Then I can use the version information to figure out what resolver to use and what schema documents to set on the SaxFactory.

My Suggestion
Parse the Document using SAX or DOM
Get the version attribute
Use the Validator.validate(Source) method and and use the already parsed Document (from step 1) as shown below
Building DOMSource from parsed document
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new File(args[0]));
domSource = new DOMSource(document);

getting stackoverflowerror while converting org.w3c.dom.Document to org.dom4j.Document

I am getting stackoverflowerror while conveting org.w3c.dom.Document to org.dom4j.Document
Code :
public static org.dom4j.Document getDom4jDocument(Document w3cDocument)
{
//System.out.println("XMLUtility : Inside getDom4jDocument()");
org.dom4j.Document dom4jDocument = null;
DOMReader xmlReader = null;
try{
//System.out.println("Before conversion of w3cdoc to dom4jdoc");
xmlReader = new DOMReader();
dom4jDocument = xmlReader.read(w3cDocument);
//System.out.println("Conversion complete");
}catch(Exception e){
System.out.println("General Exception :- "+e.getMessage());
}
//System.out.println("XMLUtility : getDom4jDocument() Finished");
return dom4jDocument;
}
log :
java.lang.StackOverflowError
at java.lang.String.indexOf(String.java:1564)
at java.lang.String.indexOf(String.java:1546)
at org.dom4j.tree.NamespaceStack.getQName(NamespaceStack.java:158)
at org.dom4j.io.DOMReader.readElement(DOMReader.java:184)
at org.dom4j.io.DOMReader.readTree(DOMReader.java:93)
at org.dom4j.io.DOMReader.readElement(DOMReader.java:226)
at org.dom4j.io.DOMReader.readTree(DOMReader.java:93)
at org.dom4j.io.DOMReader.readElement(DOMReader.java:226)
Actually i want to convert XML to string by using org.dom4j.Document's asXML method. Is this conversion possible without converting org.w3c.dom.Document to org.dom4j.Document ? How ?

when handling a heavy file, you shouldn't use a DOM reader, but a SAX one. I assume your goal is to output your document to a string.
Here you can find some differences between SAX and DOM (source) :
SAX
Parses node by node
Doesn’t store the XML in memory
We cant insert or delete a node
SAX is an event based parser
SAX is a Simple API for XML
doesn’t preserve comments
SAX generally runs a little faster than DOM
DOM
Stores the entire XML document into memory before processing
Occupies more memory
We can insert or delete nodes
Traverse in any direction.
DOM is a tree model parser
Document Object Model (DOM) API
Preserves comments
SAX generally runs a little faster than DOM
You don't need to produce a model which will need a lot of memory space. You only need to crawl through nodes to output them one by one.
Here, you will find some code to start with ; then you should implement a tree traversal algorithm.
Regards

Take a look at java.lang.StackOverflowError in dom parser. Apparently trying to load a huge XML file into a String can result in a StackoverflowException. I think it's because the parser uses regex's to find the start and end of tags, which involves recursive calls for long Strings as described in java.lang.StackOverflowError while using a RegEx to Parse big strings.
You can try and split up the XML document and parse the sections separately and see if that helps.

Efficient XSLT pipeline in Java (or redirecting Results to Sources)

I have a series of XSL 2.0 stylesheets that feed into each other, i.e. the output of stylesheet A feeds B feeds C.
What is the most efficient way of doing this? The question rephrased is: how can one efficiently route the output of one transformation into another.
Here's my first attempt:
#Override
public void transform(Source data, Result out) throws TransformerException{
for(Transformer autobot : autobots){
if(autobots.indexOf(autobot) != (autobots.size()-1)){
log.debug("Transforming prelim stylesheet...");
data = transform(autobot,data);
}else{
log.debug("Transforming final stylesheet...");
autobot.transform(data, out);
}
}
}
private Source transform(Transformer autobot, Source data) throws TransformerException{
DOMResult result = new DOMResult();
autobot.transform(data, result);
Node node = result.getNode();
return new DOMSource(node);
}
As you can see, I'm using a DOM to sit in between transformations, and although it is convenient, it's non-optimal performance wise.
Is there any easy way to route to say, route a SAXResult to a SAXSource? A StAX solution would be another option.
I'm aware of projects like XProc, which is very cool if you haven't taken a look at yet, but I didn't want to invest in a whole framework.

I found this: #3. Chaining Transformations that shows two ways to use the TransformerFactory to chain transformations, having the results of one transform feed the next transform and then finally output to system out. This avoids the need for an intermediate serialization to String, file, etc. between transforms.
When multiple, successive
transformations are required to the
same XML document, be sure to avoid
unnecessary parsing operations. I
frequently run into code that
transforms a String to another String,
then transforms that String to yet
another String. Not only is this slow,
but it can consume a significant
amount of memory as well, especially
if the intermediate Strings aren't
allowed to be garbage collected.
Most transformations are based on a
series of SAX events. A SAX parser
will typically parse an InputStream or
another InputSource into SAX events,
which can then be fed to a
Transformer. Rather than having the
Transformer output to a File, String,
or another such Result, a SAXResult
can be used instead. A SAXResult
accepts a ContentHandler, which can
pass these SAX events directly to
another Transformer, etc.
Here is one approach, and the one I
usually prefer as it provides more
flexibility for various input and
output sources. It also makes it
fairly easy to create a transformation
chain dynamically and with a variable
number of transformations.
SAXTransformerFactory stf = (SAXTransformerFactory)TransformerFactory.newInstance();
// These templates objects could be reused and obtained from elsewhere.
Templates templates1 = stf.newTemplates(new StreamSource(
getClass().getResourceAsStream("MyStylesheet1.xslt")));
Templates templates2 = stf.newTemplates(new StreamSource(
getClass().getResourceAsStream("MyStylesheet1.xslt")));
TransformerHandler th1 = stf.newTransformerHandler(templates1);
TransformerHandler th2 = stf.newTransformerHandler(templates2);
th1.setResult(new SAXResult(th2));
th2.setResult(new StreamResult(System.out));
Transformer t = stf.newTransformer();
t.transform(new StreamSource(System.in), new SAXResult(th1));
// th1 feeds th2, which in turn feeds System.out.

Related question Efficient XSLT pipeline, with params, in Java clarified on correct parameters passing to such transformer chain.
And it also gave a hint on slightly shorter solution without third transformer:
SAXTransformerFactory stf = (SAXTransformerFactory)TransformerFactory.newInstance();
Templates templates1 = stf.newTemplates(new StreamSource(
getClass().getResourceAsStream("MyStylesheet1.xslt")));
Templates templates2 = stf.newTemplates(new StreamSource(
getClass().getResourceAsStream("MyStylesheet2.xslt")));
TransformerHandler th1 = stf.newTransformerHandler(templates1);
TransformerHandler th2 = stf.newTransformerHandler(templates2);
th2.setResult(new StreamResult(System.out));
// Note that indent, etc should be applied to the last transformer in chain:
th2.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
th1.getTransformer().transform(new StreamSource(System.in), new SAXResult(th2));

Your best bet is to stick to DOM as you're doing, because an XSLT processor would have to build a tree anyway - streaming is only an option for very limited category of transforms, and few if any processors can figure it out automatically and switch to a streaming-only implementation; otherwise they just read the input and build the tree.

slow construction of tree structure from XML

I'm parsing an XML document into my own structure but building it is very slow for large inputs is there a better way to do it?
public static DomTree<String> createTreeInstance(String path)
throws ParserConfigurationException, SAXException, IOException {
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder db = docBuilderFactory.newDocumentBuilder();
File f = new File(path);
Document doc = db.parse(f);
Node node = doc.getDocumentElement();
DomTree<String> tree = new DomTree<String>(node);
return tree;
}
Here is my DomTree constructor:
/**
* Recursively builds a tree structure from a DOM object.
* #param root
*/
public DomTree(Node root){
node = root;
NodeList children = root.getChildNodes();
DomTree<String> child = null;
for(int i = 0; i < children.getLength(); i++){
child = new DomTree<String>(children.item(i));
if (children.item(i).getNodeType() != Node.TEXT_NODE){
super.children.add(child);
}
}
}
UPDATE:
I have benchmarked the createTreeInstance() method using a 100MB XML file:
Creating docBuilderFactory... Done [3ms]
Creating docBuilder... Done [21ms]
parsing file... Done [5646ms]
getDocumentElement... Done [1ms]
creating DomTree... Done [17076ms]
UPDATE:
As John Doe suggests below it may be more appropriate to use SAX - I have never used SAX before, so is there a good way to convert what I have to using SAX?

If you're parsing a large XML, you don't use DOM, you use SAX, a pull parser such as XPP3 or anything else.
The problem is that you won't have an "XML tree" in memory which might be convenient, you only get events and deal with them accordingly. However it will be memory wise, and you can map to elements to your data structures.

Have you tried profiling this ? I think that may be more instructive than looking at the code. It's quite often that a bottleneck shows up that you'd normally never expect. A simple profile (that you can do trivially in code) is to time the DOM parsing vs. your tree building.
For more in-depth profiling, JProfiler is available as an evaluation copy. Others may be able to recommend something more appropriate.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.