I'm parsing an XML document into my own structure but building it is very slow for large inputs is there a better way to do it?
public static DomTree<String> createTreeInstance(String path)
throws ParserConfigurationException, SAXException, IOException {
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder db = docBuilderFactory.newDocumentBuilder();
File f = new File(path);
Document doc = db.parse(f);
Node node = doc.getDocumentElement();
DomTree<String> tree = new DomTree<String>(node);
return tree;
}
Here is my DomTree constructor:
/**
* Recursively builds a tree structure from a DOM object.
* #param root
*/
public DomTree(Node root){
node = root;
NodeList children = root.getChildNodes();
DomTree<String> child = null;
for(int i = 0; i < children.getLength(); i++){
child = new DomTree<String>(children.item(i));
if (children.item(i).getNodeType() != Node.TEXT_NODE){
super.children.add(child);
}
}
}
UPDATE:
I have benchmarked the createTreeInstance() method using a 100MB XML file:
Creating docBuilderFactory... Done [3ms]
Creating docBuilder... Done [21ms]
parsing file... Done [5646ms]
getDocumentElement... Done [1ms]
creating DomTree... Done [17076ms]
UPDATE:
As John Doe suggests below it may be more appropriate to use SAX - I have never used SAX before, so is there a good way to convert what I have to using SAX?
If you're parsing a large XML, you don't use DOM, you use SAX, a pull parser such as XPP3 or anything else.
The problem is that you won't have an "XML tree" in memory which might be convenient, you only get events and deal with them accordingly. However it will be memory wise, and you can map to elements to your data structures.
Have you tried profiling this ? I think that may be more instructive than looking at the code. It's quite often that a bottleneck shows up that you'd normally never expect. A simple profile (that you can do trivially in code) is to time the DOM parsing vs. your tree building.
For more in-depth profiling, JProfiler is available as an evaluation copy. Others may be able to recommend something more appropriate.
Related
I need to check only one node from each file (109 files) that they are stored on different urls (109 urls).
I use this code
public class XPathParserXML {
public String version(String link, String serial) throws SAXException, IOException,
ParserConfigurationException, XPathExpressionException{
String version = new String();
String url = link+serial;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(url);
XPath xPathFactory = XPathFactory.newInstance().newXPath();
XPathExpression expr = xPathFactory.compile("//swVersion/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList node = (NodeList) result;
if (node == null){
version = "!!WORKING!!";
}else{
version = node.item(0).getNodeValue();
}
return version;
}
}
and i call the method "version(link,serial)" in cicle for 109 times
My code take like 20 seconds to elaborate all. Each file weight 0.64KB and i have a 20MB connection.
What can i do to speed up my code?
1. Object caching:
While that's not the only issue, probably, you should definitely cache and reuse all of those objects between calls to version():
DocumentBuilderFactory
DocumentBuilder
XPathFactory
XPathExpression
2. Circumvention of a known JAXP performance issue:
Besides, you should probably activate one of these flags:
-Dorg.apache.xml.dtm.DTMManager=
org.apache.xml.dtm.ref.DTMManagerDefault
or
-Dcom.sun.org.apache.xml.internal.dtm.DTMManager=
com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault
See also this question for details:
Java XPath (Apache JAXP implementation) performance
3. Reduce latency impact
Last but not least, you're serially accessing all those XML files over the wire. It may be useful to reduce the impact of your connection latency by parallelising access to those files, e.g. by using multiple threads at the client side. (Note if you choose multi-threading, then beware of thread-safety issues when caching the objects I've mentioned in the first section. Also, avoid creating too many parallel requests at the same time to prevent your server from failing)
Another way to reduce that impact would be to expose those XML files in a ZIP file from the server to avoid multiple connections and transfer all XML files at once.
4. Avoid XML validation if you can trust the source
From your additional comments, I see that you're using XML validation. This is, of course, expensive and should only be done if really needed. Since you run a very arbitrary XPath expression, I take that you don't care too much about XML validation. Best turn it off!
5. If all else fails... Avoid DOM
Since (from your comments) you've measured the parsing to take up most of the CPU, you have two more options to circumvent the whole issue:
Use a SAX parser and abort parsing once you reach the //swVersion element (From your code, I'm assuming that there is only one). SAX is much faster for these use-cases, than DOM.
Avoid XML entirely and search the document for a regex: <swVersion>(.*?)</swVersion>. That should only be your last resort, because it doesn't handle
namespaces
attributes
whitespace
I'm trying to parse, and replace values in a large xml file, ~45MB each. The Way I do this is:
private void replaceData(File xmlFile, File out)
{
DocumentBuilderFactory df = DocumentBuilderFactory.newInstance();
DocumentBuilder db = df.newDocumentBuilder();
Document xmlDoc = db.parse(xmlFile);
xmlDoc.getDocumentElement().normalize();
Node allData = xmlDoc.getElementsByTagName("Data").item(0);
Element ctrlData = getSubElement(allData, "ctrlData");
NodeList subData = ctrlData.getElementsByTagName("SubData");
int len = subData.getLength();
for (int logIndex = 0; logIndex < len; logIndex++) {
Node log = subData.item(logIndex);
Element info = getSubElement(log, "info");
Element value = getSubElement(info, "dailyInfo");
Node valueNode = value.getElementsByTagName("value").item(0);
valueNode.setTextContent("blah");
}
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
DOMSource s = new DOMSource(xmlDoc);
StreamResult r = new StreamResult(out);
t.transform(s, r);
} catch (TransformerException | ParserConfigurationException | SAXException | IOException e) {
throw e;
}
}
private static Element getSubElement(Node node, String elementName)
{
return (Element)((Element)node).getElementsByTagName(elementName).item(0);
}
I notice that as I am further along the for loop the longer it takes, and for an average of 100k node's it takes over 2 hours, while if I just break out smaller chunks by hand of 1k, it will take ~10s. Is there something that is inefficient with the way that this document is being parsed?
----EDIT----
Based on comments and answers to this, I switched over to using Sax and XmlStreamWriter. Reference/example here: http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
After moving to using SAX, memory usage for the replaceData function does not expand to size of XML file, and XML file processing time went to ~18 seconds on average.
As people have mentioned in the comments loading the whole DOM into memory especially for large XMLs can be very inefficient therefore a better approach is to use the SAX parser that consumes constant memory. The drawback there is that you don't get the fluent API of having the whole DOM in memory and the visibility is quite limited if you want to perform complicated callback logic in nested nodes.
If all you are interesting in doing is parsing particular nodes and node families rather than parsing the whole XML then there is a better solution that gives you the best of both worlds and has been blogged about and open-sourced. It's basically a very light wrapper on top of SAX parser where you are registering the XML elements you are interested in and when you are getting the callback you have at your disposal their corresponding partial DOM to XPath.
This way you can keep your complexity at constant time (scaling to over 1GB of XML file as documented in the above blog) while maintaining the fluency of XPath-ing the DOM of the XML elements you are interested in.
Why are you doing this in Java when XSLT is designed for the task?
45Mb is a big file to hold in memory, but still viable. The tree models used by good XSLT processors such as Saxon are much more efficient (both in storage space in in search speed) than a general purpose DOM (for example, because they are read-only). And XSLT has much more scope to optimize your code.
I can't reverse engineer your specification from your code, but I don't see anything in your description that is intrinsically non-linear. I don't see any reason why this should take more than 10 minutes or so in Saxon.
I have generated an xml file from an excel data base and it contains automatically an element called "offset". To make my new file match my needs, I want to remove this element using java.
here is the xml content:
<Root><models><id>2</id><modelName>Baseline</modelName><domain_id>2</domain_id><description> desctiption </description><years><Y2013>value1</Y2013><Y2014>value2</Y2014><Y2015>value3</Y2015><Y2016>value4</Y2016><Y2017>value5</Y2017></years><offset/></models></Root>
I made a code that reads(with a buffered reader) and writes the content in a new file and used the condition if:
while (fileContent !=null){
fileContent=xmlReader.readLine();
if (!(fileContent.equals("<offset/>"))){
System.out.println("here is the line:"+ fileContent);
XMLFile+=fileContent;
}
}
But it does not work
I would personally recommend using a proper XML parser like Java DOM to check and delete your nodes, rather than dealing with your XML as raw Strings (yuck). Try something like this to remove your 'offset' node.
File xmlFile = new File("your_xml.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(xmlFile);
NodeList nList = doc.getElementsByTagName("offset");
for (int i = 0; i < nList.getLength(); i++) {
Node node = nList.item(i);
node.getParentNode().removeChild(node);
}
The above code removes any 'offset' nodes in an xml file.
If resources/speed considerations are an issue (like when your_xml.xml is huge), you would be better off using SAX, which is faster (a little more code intensive) and doesn't store the XML in memory.
Once your Document has been edited you'll probably want to convert it to a String to parse to your OutputStream of choice.
Hope this helps.
Just try to replace the unwanted string with an empty string. It is quick... and seriously dirty. But might solve your problem quickly... just to reappear later on ;-)
fileContent.replace("<offset/>, "");
String sXML= "<Root><models><id>2</id><modelName>Baseline</modelName><domain_id>2</domain_id><description> desctiption </description><years><Y2013>value1</Y2013><Y2014>value2</Y2014><Y2015>value3</Y2015><Y2016>value4</Y2016><Y2017>value5</Y2017></years><offset/></models></Root>";
System.out.println(sXML);
String sNewXML = sXML.replace("<offset/>", "");
System.out.println(sNewXML);
String xml = "<Root><models><id>2</id><modelName>Baseline</modelName><domain_id>2</domain_id><description> desctiption </description><years><Y2013>value1</Y2013><Y2014>value2</Y2014><Y2015>value3</Y2015><Y2016>value4</Y2016><Y2017>value5</Y2017></years><offset/></models></Root>";
xml = xml.replaceAll("<offset/>", "");
System.out.println(xml);
In your original code that you included you have:
while (fileContent !=null)
Are you initializing fileContent to some non-null value before that line? If not, the code inside your while block will not run.
But I do agree with the other posters that a simple replaceAll() would be more concise, and a real XML API is better if you want to do anything more sophisticated.
I have to write some code to handle reading and validating XML documents that use a version attribute in their root element to declare a version number, like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<Junk xmlns="urn:com:initech:tps"
xmlns:xsi="http://www3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:com:initech.tps:schemas/foo/Junk.xsd"
VersionAttribute="2.0">
There are a bunch of nested schemas, my code has an org.w3c.dom.ls.LsResourceResolver to figure out what schema to use, implementing this method:
LSInput resolveResource(String type,
String namespaceURI,
String publicId,
String systemId,
String baseURI)
Previous versions of the schema have embedded the schema version into the namespace, so I could use the namespaceURI and systemId to decide which schema to provide. Now the version number has been switched to an attribute in the root element, and my resolver doesn't have access to that. How am I supposed to figure out the version of the XML document in the LsResourceResolver?
I had never had to deal with schema versions before this and had no idea what was involved. When the version was part of the namespace then I could throw all the schemas in together and let them get sorted out, but with the version in the root element and namespace shared across versions there is no getting around reading the version information from the XML before starting the SAX parsing.
I'm going to do something very similar to what Pangea suggested (gets +1 from me), but I can't follow the advice exactly because the document is too big to read it all into memory, even once. By using STAX I can minimize the amount of work done to get the version from the file. See this DeveloperWorks article, "Screen XML documents efficiently with StAX":
The screening or classification of XML documents is a common problem,
especially in XML middleware. Routing XML documents to specific
processors may require analysis of both the document type and the
document content. The problem here is obtaining the required
information from the document with the least possible overhead.
Traditional parsers such as DOM or SAX are not well suited to this
task. DOM, for example, parses the whole document and constructs a
complete document tree in memory before it returns control to the
client. Even DOM parsers that employ deferred node expansion, and thus
are able to parse a document partially, have high resource demands
because the document tree must be at least partially constructed in
memory. This is simply not acceptable for screening purposes.
The code to get the version information will look like:
def map = [:]
def startElementCount = 0
def inputStream = new File(inputFile).newInputStream()
try {
XMLStreamReader reader =
XMLInputFactory.newInstance().createXMLStreamReader(inputStream)
for (int event; (event = reader.next()) != XMLStreamConstants.END_DOCUMENT;) {
if (event == XMLStreamConstants.START_ELEMENT) {
if (startElementCount > 0) return map
startElementCount += 1
map.rootElementName = reader.localName
for (int i = 0; i < reader.attributeCount; i++) {
if (reader.getAttributeName(i).toString() == 'VersionAttribute') {
map.versionIdentifier = reader.getAttributeValue(i).toString()
return map
}
}
}
}
} finally {
inputStream.close()
}
Then I can use the version information to figure out what resolver to use and what schema documents to set on the SaxFactory.
My Suggestion
Parse the Document using SAX or DOM
Get the version attribute
Use the Validator.validate(Source) method and and use the already parsed Document (from step 1) as shown below
Building DOMSource from parsed document
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new File(args[0]));
domSource = new DOMSource(document);
I have this XML file which doesn't have a root node. Other than manually adding a "fake" root element, is there any way I would be able to parse an XML file in Java? Thanks.
I suppose you could create a new implementation of InputStream that wraps the one you'll be parsing from. This implementation would return the bytes of the opening root tag before the bytes from the wrapped stream and the bytes of the closing root tag afterwards. That would be fairly simple to do.
I may be faced with this problem too. Legacy code, eh?
Ian.
Edit: You could also look at java.io.SequenceInputStream which allows you to append streams to one another. You would need to put your prefix and suffix in byte arrays and wrap them in ByteArrayInputStreams but it's all fairly straightforward.
Your XML document needs a root xml element to be considered well formed. Without this you will not be able to parse it with an xml parser.
One way is to provide your own dummy wrapper without touching the original 'xml' (the not well formed 'xml') Need the word for that:
Syntax
<!DOCTYPE some_root_elem SYSTEM "/home/ego/some.dtd"
[
<!ENTITY entity-name "Some value to be inserted at the entity">
]
Example:
<!DOCTYPE dummy [
<!ENTITY data SYSTEM "http://wherever-my-data-is">
]>
<dummy>
&data;
</dummy>
You could use another parser like Jsoup. It can parse XML without a root.
I think even if any API would have an option for this, it will only return you the first node of the "XML" which will look like a root and discard the rest.
So the answer is probably to do it yourself. Scanner or StringTokenizer might do the trick.
Maybe some html parsers might help, they are usually less strict.
Here's what I did:
There's an old java.io.SequenceInputStream class, which is so old that it takes Enumeration rather than List or such.
With it, you can prepend and append the root element tags (<div> and </div> in my case) around your no-root XML stream. (You shouldn't do it by concatenating Strings due to performance and memory reasons.)
public void tryExtractHighestHeader(ParserContext context)
{
String xhtmlString = context.getBody();
if (xhtmlString == null || "".equals(xhtmlString))
return;
// The XHTML needs to be wrapped, because it has no root element.
ByteArrayInputStream divStart = new ByteArrayInputStream("<div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream divEnd = new ByteArrayInputStream("</div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream is = new ByteArrayInputStream(xhtmlString.getBytes(StandardCharsets.UTF_8));
Enumeration<InputStream> streams = new IteratorEnumeration(Arrays.asList(new InputStream[]{divStart, is, divEnd}).iterator());
try (SequenceInputStream wrapped = new SequenceInputStream(streams);) {
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(wrapped);
From here you can do whatever you like, but keep in mind the extra element.
XPath xPath = XPathFactory.newInstance().newXPath();
}
catch (Exception e) {
throw new RuntimeException("Failed parsing XML: " + e.getMessage());
}
}