xPath multiple xml files from different url very slow - java

I need to check only one node from each file (109 files) that they are stored on different urls (109 urls).
I use this code
public class XPathParserXML {
public String version(String link, String serial) throws SAXException, IOException,
ParserConfigurationException, XPathExpressionException{
String version = new String();
String url = link+serial;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(url);
XPath xPathFactory = XPathFactory.newInstance().newXPath();
XPathExpression expr = xPathFactory.compile("//swVersion/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList node = (NodeList) result;
if (node == null){
version = "!!WORKING!!";
}else{
version = node.item(0).getNodeValue();
}
return version;
}
}
and i call the method "version(link,serial)" in cicle for 109 times
My code take like 20 seconds to elaborate all. Each file weight 0.64KB and i have a 20MB connection.
What can i do to speed up my code?

1. Object caching:
While that's not the only issue, probably, you should definitely cache and reuse all of those objects between calls to version():
DocumentBuilderFactory
DocumentBuilder
XPathFactory
XPathExpression
2. Circumvention of a known JAXP performance issue:
Besides, you should probably activate one of these flags:
-Dorg.apache.xml.dtm.DTMManager=
org.apache.xml.dtm.ref.DTMManagerDefault
or
-Dcom.sun.org.apache.xml.internal.dtm.DTMManager=
com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault
See also this question for details:
Java XPath (Apache JAXP implementation) performance
3. Reduce latency impact
Last but not least, you're serially accessing all those XML files over the wire. It may be useful to reduce the impact of your connection latency by parallelising access to those files, e.g. by using multiple threads at the client side. (Note if you choose multi-threading, then beware of thread-safety issues when caching the objects I've mentioned in the first section. Also, avoid creating too many parallel requests at the same time to prevent your server from failing)
Another way to reduce that impact would be to expose those XML files in a ZIP file from the server to avoid multiple connections and transfer all XML files at once.
4. Avoid XML validation if you can trust the source
From your additional comments, I see that you're using XML validation. This is, of course, expensive and should only be done if really needed. Since you run a very arbitrary XPath expression, I take that you don't care too much about XML validation. Best turn it off!
5. If all else fails... Avoid DOM
Since (from your comments) you've measured the parsing to take up most of the CPU, you have two more options to circumvent the whole issue:
Use a SAX parser and abort parsing once you reach the //swVersion element (From your code, I'm assuming that there is only one). SAX is much faster for these use-cases, than DOM.
Avoid XML entirely and search the document for a regex: <swVersion>(.*?)</swVersion>. That should only be your last resort, because it doesn't handle
namespaces
attributes
whitespace

Related

Transform multiple input XML documents with XSLT in a Java application using the Saxon9HE API

How can I transform multiple XML input document objects with a single XSL transformation script using the Saxon9HE processor in a Java application?
I found a way to transform multiple XML input files from the filesystem with an XSLT script here, but I can't figure out how to pass multiple loaded XML Document objects to a Java application utilizing the Saxon9HE API. For a single XML document my code looks like this and works:
Processor proc = new Processor(false);
XsltCompiler comp = proc.newXsltCompiler();
try {
XsltExecutable exp = comp.compile(new StreamSource(stylesheetFile));
XdmNode source = proc.newDocumentBuilder().build(new DOMSource(inputXML));
Serializer out = proc.newSerializer();
out.setOutputProperty(Serializer.Property.METHOD, "xml");
out.setOutputProperty(Serializer.Property.INDENT, "yes");
out.setOutputFile(new File(outputFilename));
XsltTransformer trans = exp.load();
trans.setInitialContextNode(source);
trans.setDestination(out);
trans.transform();
} catch (SaxonApiException e) {
e.printStackTrace();
}
First point: avoid DOM if you can. When you are using Saxon, it's best to let Saxon build the document tree; this will be far more efficient. If you really need to use an external tree model, XOM and JDOM2 are much more efficient than DOM.
If you do want to provide a DOM as input, you have two choices: you can copy it to a Saxon tree, or you can wrap it as a Saxon tree. Use DocumentBuilder.build() in the first case, DocumentBuilder.wrap() in the second. Using build() gives you a higher initial cost, but the transformation itself is then faster.
If you want to pass pre-built trees into the transformation, declare the parameter using <xsl:param name="x" as="document-node()"/>, and then invoke the transformation using transformer.setParameter(new QName('x'), doc) where doc is an instance of XdmNode. You have to construct the XdmNode yourself by using a DocumentBuilder.
(Alternatively, if you want to access the documents in the stylesheet using the doc() or document() functions, you can invent a URI naming scheme and implement this in a URIResolver. When doc('my:uri') is called, your URIResolver is notified, and it should respond with a Source object. If you already have an XdmNode handy, then you can return XdmNode.asSource() to return this document tree as the result of your URIResolver.)

Java parse large XML document

I'm trying to parse, and replace values in a large xml file, ~45MB each. The Way I do this is:
private void replaceData(File xmlFile, File out)
{
DocumentBuilderFactory df = DocumentBuilderFactory.newInstance();
DocumentBuilder db = df.newDocumentBuilder();
Document xmlDoc = db.parse(xmlFile);
xmlDoc.getDocumentElement().normalize();
Node allData = xmlDoc.getElementsByTagName("Data").item(0);
Element ctrlData = getSubElement(allData, "ctrlData");
NodeList subData = ctrlData.getElementsByTagName("SubData");
int len = subData.getLength();
for (int logIndex = 0; logIndex < len; logIndex++) {
Node log = subData.item(logIndex);
Element info = getSubElement(log, "info");
Element value = getSubElement(info, "dailyInfo");
Node valueNode = value.getElementsByTagName("value").item(0);
valueNode.setTextContent("blah");
}
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
DOMSource s = new DOMSource(xmlDoc);
StreamResult r = new StreamResult(out);
t.transform(s, r);
} catch (TransformerException | ParserConfigurationException | SAXException | IOException e) {
throw e;
}
}
private static Element getSubElement(Node node, String elementName)
{
return (Element)((Element)node).getElementsByTagName(elementName).item(0);
}
I notice that as I am further along the for loop the longer it takes, and for an average of 100k node's it takes over 2 hours, while if I just break out smaller chunks by hand of 1k, it will take ~10s. Is there something that is inefficient with the way that this document is being parsed?
----EDIT----
Based on comments and answers to this, I switched over to using Sax and XmlStreamWriter. Reference/example here: http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
After moving to using SAX, memory usage for the replaceData function does not expand to size of XML file, and XML file processing time went to ~18 seconds on average.
As people have mentioned in the comments loading the whole DOM into memory especially for large XMLs can be very inefficient therefore a better approach is to use the SAX parser that consumes constant memory. The drawback there is that you don't get the fluent API of having the whole DOM in memory and the visibility is quite limited if you want to perform complicated callback logic in nested nodes.
If all you are interesting in doing is parsing particular nodes and node families rather than parsing the whole XML then there is a better solution that gives you the best of both worlds and has been blogged about and open-sourced. It's basically a very light wrapper on top of SAX parser where you are registering the XML elements you are interested in and when you are getting the callback you have at your disposal their corresponding partial DOM to XPath.
This way you can keep your complexity at constant time (scaling to over 1GB of XML file as documented in the above blog) while maintaining the fluency of XPath-ing the DOM of the XML elements you are interested in.
Why are you doing this in Java when XSLT is designed for the task?
45Mb is a big file to hold in memory, but still viable. The tree models used by good XSLT processors such as Saxon are much more efficient (both in storage space in in search speed) than a general purpose DOM (for example, because they are read-only). And XSLT has much more scope to optimize your code.
I can't reverse engineer your specification from your code, but I don't see anything in your description that is intrinsically non-linear. I don't see any reason why this should take more than 10 minutes or so in Saxon.

Resolving which version of an XML Schema to use for XML documents with a version attribute

I have to write some code to handle reading and validating XML documents that use a version attribute in their root element to declare a version number, like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<Junk xmlns="urn:com:initech:tps"
xmlns:xsi="http://www3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:com:initech.tps:schemas/foo/Junk.xsd"
VersionAttribute="2.0">
There are a bunch of nested schemas, my code has an org.w3c.dom.ls.LsResourceResolver to figure out what schema to use, implementing this method:
LSInput resolveResource(String type,
String namespaceURI,
String publicId,
String systemId,
String baseURI)
Previous versions of the schema have embedded the schema version into the namespace, so I could use the namespaceURI and systemId to decide which schema to provide. Now the version number has been switched to an attribute in the root element, and my resolver doesn't have access to that. How am I supposed to figure out the version of the XML document in the LsResourceResolver?
I had never had to deal with schema versions before this and had no idea what was involved. When the version was part of the namespace then I could throw all the schemas in together and let them get sorted out, but with the version in the root element and namespace shared across versions there is no getting around reading the version information from the XML before starting the SAX parsing.
I'm going to do something very similar to what Pangea suggested (gets +1 from me), but I can't follow the advice exactly because the document is too big to read it all into memory, even once. By using STAX I can minimize the amount of work done to get the version from the file. See this DeveloperWorks article, "Screen XML documents efficiently with StAX":
The screening or classification of XML documents is a common problem,
especially in XML middleware. Routing XML documents to specific
processors may require analysis of both the document type and the
document content. The problem here is obtaining the required
information from the document with the least possible overhead.
Traditional parsers such as DOM or SAX are not well suited to this
task. DOM, for example, parses the whole document and constructs a
complete document tree in memory before it returns control to the
client. Even DOM parsers that employ deferred node expansion, and thus
are able to parse a document partially, have high resource demands
because the document tree must be at least partially constructed in
memory. This is simply not acceptable for screening purposes.
The code to get the version information will look like:
def map = [:]
def startElementCount = 0
def inputStream = new File(inputFile).newInputStream()
try {
XMLStreamReader reader =
XMLInputFactory.newInstance().createXMLStreamReader(inputStream)
for (int event; (event = reader.next()) != XMLStreamConstants.END_DOCUMENT;) {
if (event == XMLStreamConstants.START_ELEMENT) {
if (startElementCount > 0) return map
startElementCount += 1
map.rootElementName = reader.localName
for (int i = 0; i < reader.attributeCount; i++) {
if (reader.getAttributeName(i).toString() == 'VersionAttribute') {
map.versionIdentifier = reader.getAttributeValue(i).toString()
return map
}
}
}
}
} finally {
inputStream.close()
}
Then I can use the version information to figure out what resolver to use and what schema documents to set on the SaxFactory.
My Suggestion
Parse the Document using SAX or DOM
Get the version attribute
Use the Validator.validate(Source) method and and use the already parsed Document (from step 1) as shown below
Building DOMSource from parsed document
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new File(args[0]));
domSource = new DOMSource(document);

Are there any advantages to using an XSLT stylesheet compared to manually parsing an XML file using a DOM parser

For one of our applications, I've written a utility that uses java's DOM parser. It basically takes an XML file, parses it and then processes the data using one of the following methods to actually retrieve the data.
getElementByTagName()
getElementAtIndex()
getFirstChild()
getNextSibling()
getTextContent()
Now i have to do the same thing but i am wondering whether it would be better to use an XSLT stylesheet. The organisation that sends us the XML file keeps changing their schema meaning that we have to change our code to cater for these shema changes. Im not very familiar with XSLT process so im trying to find out whether im better of using XSLT stylesheets rather than "manual parsing".
The reason XSLT stylesheets looks attractive is that i think that if the schema for the XML file changes i will only need to change the stylesheet? Is this correct?
The other thing i would like to know is which of the two (XSLT transformer or DOM parser) is better performance wise. For the manual option, i just use the DOM parser to parse the xml file. How does the XSLT transformer actually parse the file? Does it include additional overhead compared to manually parsing the xml file? The reason i ask is that performance is important because of the nature of the data i will be processing.
Any advice?
Thanks
Edit
Basically what I am currently doing is parsing an xml file and process the values in some of the xml elements. I don't transform the xml file into any other format. I just extract some value, extract a row from an Oracle database and save a new row into a different table. The xml file I parse just contains reference values I use to retrieve some data from the database.
Is xslt not suitable in this scenario? Is there a better approach that I can use to avoid code changes if the schema changes?
Edit 2
Apologies for not being clear enough about what i am doing with the XML data. Basically there is an XML file which contains some information. I extract this information from the XML file and use it to retrieve more information from a local database. The data in the xml file is more like reference keys for the data i need in the database. I then take the content i extracted from the XML file plus the content i retrieved from the database using a specific key from the XML file and save that data into another database table.
The problem i have is that i know how to write a DOM parser to extract the information i need from the XML file but i was wondering whether using an XSLT stylesheet was a better option as i wouldnt have to change the code if the schema changes.
Reading the responses below it sounds like XSLT is only used for transorming and XML file to another XML file or some other format. Given that i dont intend to transform the XML file, there is probably no need to add the additional overhead of parsing the XSLT stylesheet as well as the XML file.
Transforming XML documents into other formats is XSLT's reason for being. You can use XSLT to output HTML, JSON, another XML document, or anything else you need. You don't specify what kind of output you want. If you're just grabbing the contents of a few elements, then maybe you won't want to bother with XSLT. For anything more, XSLT offers an elegant solution. This is primarily because XSLT understands the structure of the document it's working on. Its processing model is tree traversal and pattern matching, which is essentially what you're manually doing in Java.
You could use XSLT to transform your source data into the representation of your choice. Your code will always work on this structure. Then, when the organization you're working with changes the schema, you only have to change your XSLT to transform the new XML into your custom format. None of your other code needs to change. Why should your business logic care about the format of its source data?
You are right that XSLT's processing model based on a rule-based event-driven approach makes your code more resilient to changes in the schema.
Because it's a different processing model to the procedural/navigational approach that you use with DOM, there is a learning and familiarisation curve, which some people find frustrating; if you want to go this way, be patient, because it will be a while before the ideas click into place. Once you are there, it's much easier than DOM programming.
The performance of a good XSLT processor will be good enough for your needs. It's of course possible to write very inefficient code, just as it is in any language, but I've rarely seen a system where XSLT was the bottleneck. Very often the XML parsing takes longer than the XSLT processing (and that's the same cost as with DOM or JAXB or anything else.)
As others have said, a lot depends on what you want to do with the XML data, which you haven't really explained.
I think that what you need is actually an XPath expression. You could configure that expression in some property file or whatever you use to retrieve your setup parameters.
In this way, you'd just change the XPath expression whenever your customer hides away the info you use in yet another place.
Basically, an XSLT is an overkill, you just need an XPath expression. A single XPath expression will allow to home in onto each value you are after.
Update
Since we are now talking about JDK 1.4 I've included below 3 different ways of fetching text in an XML file using XPath. (as simple as possible, no NPE guard fluff I'm afraid ;-)
Starting from the most up to date.
0. First the sample XML config file
<?xml version="1.0" encoding="UTF-8"?>
<config>
<param id="MaxThread" desc="MaxThread" type="int">250</param>
<param id="rTmo" desc="RespTimeout (ms)" type="int">5000</param>
</config>
1. Using JAXP 1.3 standard part of Java SE 5.0
import javax.xml.parsers.*;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
public class TestXPath {
private static final String CFG_FILE = "test.xml" ;
private static final String XPATH_FOR_PRM_MaxThread = "/config/param[#id='MaxThread']/text()";
public static void main(String[] args) {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
docFactory.setNamespaceAware(true);
DocumentBuilder builder;
try {
builder = docFactory.newDocumentBuilder();
Document doc = builder.parse(CFG_FILE);
XPathExpression expr = XPathFactory.newInstance().newXPath().compile(XPATH_FOR_PRM_MaxThread);
Object result = expr.evaluate(doc, XPathConstants.NUMBER);
if ( result instanceof Double ) {
System.out.println( ((Double)result).intValue() );
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
2. Using JAXP 1.2 standard part of Java SE 1.4-2
import javax.xml.parsers.*;
import org.apache.xpath.XPathAPI;
import org.w3c.dom.*;
public class TestXPath {
private static final String CFG_FILE = "test.xml" ;
private static final String XPATH_FOR_PRM_MaxThread = "/config/param[#id='MaxThread']/text()";
public static void main(String[] args) {
try {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
docFactory.setNamespaceAware(true);
DocumentBuilder builder = docFactory.newDocumentBuilder();
Document doc = builder.parse(CFG_FILE);
Node param = XPathAPI.selectSingleNode( doc, XPATH_FOR_PRM_MaxThread );
if ( param instanceof Text ) {
System.out.println( Integer.decode(((Text)(param)).getNodeValue() ) );
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
3. Using JAXP 1.1 standard part of Java SE 1.4 + jdom + jaxen
You need to add these 2 jars (available from www.jdom.org - binaries, jaxen is included).
import java.io.File;
import org.jdom.*;
import org.jdom.input.SAXBuilder;
import org.jdom.xpath.XPath;
public class TestXPath {
private static final String CFG_FILE = "test.xml" ;
private static final String XPATH_FOR_PRM_MaxThread = "/config/param[#id='MaxThread']/text()";
public static void main(String[] args) {
try {
SAXBuilder sxb = new SAXBuilder();
Document doc = sxb.build(new File(CFG_FILE));
Element root = doc.getRootElement();
XPath xpath = XPath.newInstance(XPATH_FOR_PRM_MaxThread);
Text param = (Text) xpath.selectSingleNode(root);
Integer maxThread = Integer.decode( param.getText() );
System.out.println( maxThread );
} catch (Exception e) {
e.printStackTrace();
}
}
}
Since performance is important, I would suggest using a SAX parser for this. JAXB will give you roughly the same performance as DOM parsing PLUS it will be much easier and maintainable. Handling the changes in the schema also should not affect you badly if you are using JAXB, just get the new schema and regenerate the classes. If you have a bridge between the JAXB and your domain logic, then the changes can be absorbed in that layer without worrying about XML. I prefer treating XML as just a message that is used in the messaging layer. All the application code should be agnostic of XML schema.

slow construction of tree structure from XML

I'm parsing an XML document into my own structure but building it is very slow for large inputs is there a better way to do it?
public static DomTree<String> createTreeInstance(String path)
throws ParserConfigurationException, SAXException, IOException {
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder db = docBuilderFactory.newDocumentBuilder();
File f = new File(path);
Document doc = db.parse(f);
Node node = doc.getDocumentElement();
DomTree<String> tree = new DomTree<String>(node);
return tree;
}
Here is my DomTree constructor:
/**
* Recursively builds a tree structure from a DOM object.
* #param root
*/
public DomTree(Node root){
node = root;
NodeList children = root.getChildNodes();
DomTree<String> child = null;
for(int i = 0; i < children.getLength(); i++){
child = new DomTree<String>(children.item(i));
if (children.item(i).getNodeType() != Node.TEXT_NODE){
super.children.add(child);
}
}
}
UPDATE:
I have benchmarked the createTreeInstance() method using a 100MB XML file:
Creating docBuilderFactory... Done [3ms]
Creating docBuilder... Done [21ms]
parsing file... Done [5646ms]
getDocumentElement... Done [1ms]
creating DomTree... Done [17076ms]
UPDATE:
As John Doe suggests below it may be more appropriate to use SAX - I have never used SAX before, so is there a good way to convert what I have to using SAX?
If you're parsing a large XML, you don't use DOM, you use SAX, a pull parser such as XPP3 or anything else.
The problem is that you won't have an "XML tree" in memory which might be convenient, you only get events and deal with them accordingly. However it will be memory wise, and you can map to elements to your data structures.
Have you tried profiling this ? I think that may be more instructive than looking at the code. It's quite often that a bottleneck shows up that you'd normally never expect. A simple profile (that you can do trivially in code) is to time the DOM parsing vs. your tree building.
For more in-depth profiling, JProfiler is available as an evaluation copy. Others may be able to recommend something more appropriate.

Categories

Resources