Java XML library that preserves attribute order

Java XML library that preserves attribute order - java

I am writing a Java program that reads an XML file, makes some modifications, and writes back the XML.
Using the standard Java XML DOM API, the order of the attributes is not preserved.
That is, if I have an input file such as:
<person first_name="john" last_name="lederrey"/>
I might get an output file as:
<person last_name="lederrey" first_name="john"/>
That's correct, because the XML specification says that order attribute is not significant.
However, my program needs to preserve the order of the attributes, so that a person can easily compare the input and output document with a diff tool.
One solution for that is to process the document with SAX (instead of DOM):
Order of XML attributes after DOM processing
However, this does not work for my case,
because the transformation I need to do in one node might depend on a XPath expression on the whole document.
So, the simplest thing would be to have a XML library very similar to the standard Java DOM library, with the exception that it preserves the attribute order.
Is there such a library?
PS: Please, avoid discussing whether I should the preserve attribute order or not. This is a very interesting discussion, but it is not the point of this question.

Saxon these days offers a serialization option[1] to control the order in which attributes are output. It doesn't retain the input order (because Saxon doesn't know the input order), but it does allow you to control, for example, that the ID attribute always appears first.
And this can be very useful if the XML is going to be hand-edited; XML in which the attributes appear in the "wrong" order can be very disorienting to a human reader or editor.
If you're using this as part of a diff process then you would want to put both files through a process that normalizes the attribute order before comparing them. However, for comparing files my preferred approach is to parse them both and use the XPath deep-equal() function; or to use a specialized tool like DeltaXML.
[1] saxon:attribute-order - see http://www.saxonica.com/documentation/index.html#!extensions/output-extras/serialization-parameters

You might also want to try DecentXML, as it can preserve the attribute order, comments and even indentation.
It is very nice if you need to programmatically update an XML file that's also supposed to be human-editable. We use it for one of our configuration tools.
-- edit --
It seems it is no longer available on its original location; try these ones:
https://github.com/cartermckinnon/decentxml
https://github.com/haroldo-ok/decentxml (unnoficial and unmaintained fork; kept here just in case the other forks disappear, too)
https://directory.fsf.org/wiki/DecentXML

Do it twice:
Read the document in using a DOM parser so you have references, a repository, if you will.
Then read it again using SAX. At the point where you need to make the transformation, reference the DOM version to determine what you need, then output what you need in the middle of the SAX stream.

Your best bet would be to use StAX instead of DOM for generating the original document. StAX gives you a lot of fine control over these things and lets you stream output progressively to an output stream instead of holding it all in memory.

We had similar requirements per Dave's description. A solution that worked was based on Java reflection.
The idea is to set the propOrder for the attributes at runtime. In our case there's APP_DATA element containing three attributes: app, key, and value. The generated AppData class includes "content" in propOrder and none of the other attributes:
#XmlAccessorType(XmlAccessType.FIELD)
#XmlType(name = "AppData", propOrder = {
"content"
})
public class AppData {
#XmlValue
protected String content;
#XmlAttribute(name = "Value", required = true)
protected String value;
#XmlAttribute(name = "Name", required = true)
protected String name;
#XmlAttribute(name = "App", required = true)
protected String app;
...
}
So Java reflection was used as follows to set the order at runtime:
final String[] propOrder = { "app", "name", "value" };
ReflectionUtil.changeAnnotationValue(
AppData.class.getAnnotation(XmlType.class),
"propOrder", propOrder);
final JAXBContext jaxbContext = JAXBContext
.newInstance(ADI.class);
final Marshaller adimarshaller = jaxbContext.createMarshaller();
adimarshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT,
true);
adimarshaller.marshal(new JAXBElement<ADI>(new QName("ADI"),
ADI.class, adi),
new StreamResult(fileOutputStream));
The changeAnnotationValue() was borrowed from this post:
Modify a class definition's annotation string parameter at runtime
Here's the method for your convenience (credit goes to #assylias and #Balder):
/**
* Changes the annotation value for the given key of the given annotation to newValue and returns
* the previous value.
*/
#SuppressWarnings("unchecked")
public static Object changeAnnotationValue(Annotation annotation, String key, Object newValue) {
Object handler = Proxy.getInvocationHandler(annotation);
Field f;
try {
f = handler.getClass().getDeclaredField("memberValues");
} catch (NoSuchFieldException | SecurityException e) {
throw new IllegalStateException(e);
}
f.setAccessible(true);
Map<String, Object> memberValues;
try {
memberValues = (Map<String, Object>) f.get(handler);
} catch (IllegalArgumentException | IllegalAccessException e) {
throw new IllegalStateException(e);
}
Object oldValue = memberValues.get(key);
if (oldValue == null || oldValue.getClass() != newValue.getClass()) {
throw new IllegalArgumentException();
}
memberValues.put(key, newValue);
return oldValue;
}

You may override AttributeSortedMap and sort attributes as you need...
The main idea: load the document, recursively copy to elements that support sorted attributeMap and serialize using the existing XMLSerializer.
File test.xml
<root>
<person first_name="john1" last_name="lederrey1"/>
<person first_name="john2" last_name="lederrey2"/>
<person first_name="john3" last_name="lederrey3"/>
<person first_name="john4" last_name="lederrey4"/>
</root>
File AttOrderSorter.java
import com.sun.org.apache.xerces.internal.dom.AttrImpl;
import com.sun.org.apache.xerces.internal.dom.AttributeMap;
import com.sun.org.apache.xerces.internal.dom.CoreDocumentImpl;
import com.sun.org.apache.xerces.internal.dom.ElementImpl;
import com.sun.org.apache.xml.internal.serialize.OutputFormat;
import com.sun.org.apache.xml.internal.serialize.XMLSerializer;
import org.w3c.dom.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.Writer;
import java.util.List;
import static java.util.Arrays.asList;
public class AttOrderSorter {
private List<String> sortAtts = asList("last_name", "first_name");
public void format(String inFile, String outFile) throws Exception {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document outDocument = builder.newDocument();
try (FileInputStream inputStream = new FileInputStream(inFile)) {
Document document = dbFactory.newDocumentBuilder().parse(inputStream);
Element sourceRoot = document.getDocumentElement();
Element outRoot = outDocument.createElementNS(sourceRoot.getNamespaceURI(), sourceRoot.getTagName());
outDocument.appendChild(outRoot);
copyAtts(sourceRoot.getAttributes(), outRoot);
copyElement(sourceRoot.getChildNodes(), outRoot, outDocument);
}
try (Writer outxml = new FileWriter(new File(outFile))) {
OutputFormat format = new OutputFormat();
format.setLineWidth(0);
format.setIndenting(false);
format.setIndent(2);
XMLSerializer serializer = new XMLSerializer(outxml, format);
serializer.serialize(outDocument);
}
}
private void copyElement(NodeList nodes, Element parent, Document document) {
for (int i = 0; i < nodes.getLength(); i++) {
Node node = nodes.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
Element element = new ElementImpl((CoreDocumentImpl) document, node.getNodeName()) {
#Override
public NamedNodeMap getAttributes() {
return new AttributeSortedMap(this, (AttributeMap) super.getAttributes());
}
};
copyAtts(node.getAttributes(), element);
copyElement(node.getChildNodes(), element, document);
parent.appendChild(element);
}
}
}
private void copyAtts(NamedNodeMap attributes, Element target) {
for (int i = 0; i < attributes.getLength(); i++) {
Node att = attributes.item(i);
target.setAttribute(att.getNodeName(), att.getNodeValue());
}
}
public class AttributeSortedMap extends AttributeMap {
AttributeSortedMap(ElementImpl element, AttributeMap attributes) {
super(element, attributes);
nodes.sort((o1, o2) -> {
AttrImpl att1 = (AttrImpl) o1;
AttrImpl att2 = (AttrImpl) o2;
Integer pos1 = sortAtts.indexOf(att1.getNodeName());
Integer pos2 = sortAtts.indexOf(att2.getNodeName());
if (pos1 > -1 && pos2 > -1) {
return pos1.compareTo(pos2);
} else if (pos1 > -1 || pos2 > -1) {
return pos1 == -1 ? 1 : -1;
}
return att1.getNodeName().compareTo(att2.getNodeName());
});
}
}
public void main(String[] args) throws Exception {
new AttOrderSorter().format("src/main/resources/test.xml", "src/main/resources/output.xml");
}
}
Result - file output.xml
<?xml version="1.0" encoding="UTF-8"?>
<root>
<person last_name="lederrey1" first_name="john1"/>
<person last_name="lederrey2" first_name="john2"/>
<person last_name="lederrey3" first_name="john3"/>
<person last_name="lederrey4" first_name="john4"/>
</root>

You can't use the DOM, but you can use SAX, or querying children using XPath.
Visit the answer Order of XML attributes after DOM processing.

Related

Parsing XML file containing HTML entities in Java without changing the XML

I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I understand the correct way of dealing with this is to add suitable entity declarations to the XML file before parsing. However, I can't do that as I have no control over those XML files.
Is there some kind of callback I can override that is invoked whenever the Java XML parser encounters such an entity? I haven't been able to find one in the API.
I'd like to use:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( stream );
I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?
Here's a full example:
public class Main {
public static void main( String [] args ) throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( new FileInputStream( "test.xml" ));
}
}
with test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Produces:
[Fatal Error] :3:20: The entity "nbsp" was referenced, but not declared.
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 20; The entity "nbsp" was referenced, but not declared.
Update: I have been poking around in the JDK source code with a debugger, and boy, what an amount of spaghetti. I have no idea what the design is there, or whether there is one. Just how many layers of an onion can one layer on top of each other?
They key class seems to be com.sun.org.apache.xerces.internal.impl.XMLEntityManager, but I cannot find any code that either lets me add stuff into it before it gets used, or that attempts to resolve entities without going through that class.

I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download
public static void main(String args[]){
String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" +
"<bar>Some text — invalid!</bar></foo>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element e : doc.select("bar")) {
System.out.println(e);
}
}
Result:
<bar>
Some text — invalid!
</bar>
Loading from a file can be found here:
http://jsoup.org/cookbook/input/load-document-from-file

Issue - 1: I have to parse a bunch of XML files in Java that sometimes -- and
invalidly -- contain HTML entities such as —
XML has only five predefined entities. The —, is not among them. It works only when used in plain HTML or in legacy JSP. So, SAX will not help. It can be done using StaX which has high level iterator based API. (Collected from this link)
Issue - 2: I found that I can override resolveEntity in
org.xml.sax.helpers.DefaultHandler, but how do I use this with the
higher-level API?
Streaming API for XML, called StaX, is an API for reading and writing XML Documents.
StaX is a Pull-Parsing model. Application can take the control over parsing the XML documents by pulling (taking) the events from the parser.
The core StaX API falls into two categories and they are listed below. They are
Cursor based API: It is low-level API. cursor-based API allows the application to process XML as a stream of tokens aka events
Iterator based API: The higher-level iterator-based API allows the application to process XML as a series of event objects, each of which communicates a piece of the XML structure to the application.
STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property:
Requires the parser to replace internal entity references with their
replacement text and report them as characters
This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or XmlStreamReader.
However, the API is careful to say that this property is only intended to force the implementation to perform the replacement, rather than forcing it to notreplace them.
You may try it. Hope it will solve your issue. For your case,
Main.java
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.EntityReference;
import javax.xml.stream.events.XMLEvent;
public class Main {
public static void main(String[] args) {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(
XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
XMLEventReader reader;
try {
reader = inputFactory
.createXMLEventReader(new FileInputStream("F://test.xml"));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isEntityReference()) {
EntityReference ref = (EntityReference) event;
System.out.println("Entity Reference: " + ref.getName());
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (XMLStreamException e) {
e.printStackTrace();
}
}
}
test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Output:
Entity Reference: nbsp
Entity Reference: mdash
Credit goes to #skaffman.
Related Link:
http://www.journaldev.com/1191/how-to-read-xml-file-in-java-using-java-stax-api
http://www.journaldev.com/1226/java-stax-cursor-based-api-read-xml-example
http://www.vogella.com/tutorials/JavaXML/article.html
Is there a Java XML API that can parse a document without resolving character entities?
UPDATE:
Issue - 3: Is there a way to use StaX to "filter" the entities (replacing them
with something else, for example) and still produce a Document at the
end of the process?
To create a new document using the StAX API, it is required to create an XMLStreamWriter that provides methods to produce XML opening and closing tags, attributes and character content.
There are 5 methods of XMLStreamWriter for document.
xmlsw.writeStartDocument(); - initialises an empty document to which
elements can be added
xmlsw.writeStartElement(String s) -creates a new element named s
xmlsw.writeAttribute(String name, String value)- adds the attribute
name with the corresponding value to the last element produced by a
call to writeStartElement. It is possible to add attributes as long
as no call to writeElementStart,writeCharacters or writeEndElement
has been done.
xmlsw.writeEndElement - close the last started element
xmlsw.writeCharacters(String s) - creates a new text node with
content s as content of the last started element.
A sample example is attached with it:
StAXExpand.java
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;
import java.util.Arrays;
public class StAXExpand {
static XMLStreamWriter xmlsw = null;
public static void main(String[] argv) {
try {
xmlsw = XMLOutputFactory.newInstance()
.createXMLStreamWriter(System.out);
CompactTokenizer tok = new CompactTokenizer(
new FileReader(argv[0]));
String rootName = "dummyRoot";
// ignore everything preceding the word before the first "["
while(!tok.nextToken().equals("[")){
rootName=tok.getToken();
}
// start creating new document
xmlsw.writeStartDocument();
ignorableSpacing(0);
xmlsw.writeStartElement(rootName);
expand(tok,3);
ignorableSpacing(0);
xmlsw.writeEndDocument();
xmlsw.flush();
xmlsw.close();
} catch (XMLStreamException e){
System.out.println(e.getMessage());
} catch (IOException ex) {
System.out.println("IOException"+ex);
ex.printStackTrace();
}
}
public static void expand(CompactTokenizer tok, int indent)
throws IOException,XMLStreamException {
tok.skip("[");
while(tok.getToken().equals("#")) {// add attributes
String attName = tok.nextToken();
tok.nextToken();
xmlsw.writeAttribute(attName,tok.skip("["));
tok.nextToken();
tok.skip("]");
}
boolean lastWasElement=true; // for controlling the output of newlines
while(!tok.getToken().equals("]")){ // process content
String s = tok.getToken().trim();
tok.nextToken();
if(tok.getToken().equals("[")){
if(lastWasElement)ignorableSpacing(indent);
xmlsw.writeStartElement(s);
expand(tok,indent+3);
lastWasElement=true;
} else {
xmlsw.writeCharacters(s);
lastWasElement=false;
}
}
tok.skip("]");
if(lastWasElement)ignorableSpacing(indent-3);
xmlsw.writeEndElement();
}
private static char[] blanks = "\n".toCharArray();
private static void ignorableSpacing(int nb)
throws XMLStreamException {
if(nb>blanks.length){// extend the length of space array
blanks = new char[nb+1];
blanks[0]='\n';
Arrays.fill(blanks,1,blanks.length,' ');
}
xmlsw.writeCharacters(blanks, 0, nb+1);
}
}
CompactTokenizer.java
import java.io.Reader;
import java.io.IOException;
import java.io.StreamTokenizer;
public class CompactTokenizer {
private StreamTokenizer st;
CompactTokenizer(Reader r){
st = new StreamTokenizer(r);
st.resetSyntax(); // remove parsing of numbers...
st.wordChars('\u0000','\u00FF'); // everything is part of a word
// except the following...
st.ordinaryChar('\n');
st.ordinaryChar('[');
st.ordinaryChar(']');
st.ordinaryChar('#');
}
public String nextToken() throws IOException{
st.nextToken();
while(st.ttype=='\n'||
(st.ttype==StreamTokenizer.TT_WORD &&
st.sval.trim().length()==0))
st.nextToken();
return getToken();
}
public String getToken(){
return (st.ttype == StreamTokenizer.TT_WORD) ? st.sval : (""+(char)st.ttype);
}
public String skip(String sym) throws IOException {
if(getToken().equals(sym))
return nextToken();
else
throw new IllegalArgumentException("skip: "+sym+" expected but"+
sym +" found ");
}
}
For more, you can follow the tutorial
https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html
http://www.ibm.com/developerworks/library/x-tipstx2/index.html
http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch09s03.html
http://staf.sourceforge.net/current/STAXDoc.pdf

Another approach, since you're not using a rigid OXM approach anyway.
You might want to try using a less rigid parser such as JSoup?
This will stop immediate problems with invalid XML schemas etc, but it will just devolve the problem into your code.

Just to throw in a different approach to a solution:
You might envelope your input stream with a stream inplementation that replaces the entities by something legal.
While this is a hack for sure, it should be a quick and easy solution (or better say: workaround).
Not as elegant and clean as a xml framework internal solution, though.

I made yesterday something similar i need to add value from unziped XML in stream to database.
//import I'm not sure if all are necessary :)
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
//I didnt checked this code now because i'm in work for sure its work maybe
you will need to do little changes
InputSource is = new InputSource(new FileInputStream("test.xml"));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(is);
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
String words= xpath.evaluate("/foo/bar", doc.getDocumentElement());
ParsingHexToChar.parseToChar(words);
// lib which i use common-lang3.jar
//metod to parse
public static String parseToChar( String words){
String decode= org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(words);
return decode;
}

Try this using org.apache.commons package :
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
InputStream in = new FileInputStream(xmlfile);
String unescapeHtml4 = IOUtils.toString(in);
CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())
);
unescapeHtml4 = obj.translate(unescapeHtml4);
StringReader readerInput= new StringReader(unescapeHtml4);
InputSource is = new InputSource(readerInput);
Document doc = parser.parse(is);

Parsing XML file In java using DOM

I want to parse XML file to read values of certain elements in the file.
<row>
<element>
<status>OK</status>
<duration>
<value>499</value>
<text>8 mins</text>
</duration>
<distance>
<value>3208</value>
<text>3.2 km</text>
</distance>
</element>
<element>
<status>OK</status>
<duration>
<value>501</value>
<text>8 mins</text>
</duration>
<distance>
<value>2869</value>
<text>2.9 km</text>
</distance>
</element>
<element>
<status>OK</status>
<duration>
<value>788</value>
<text>13 mins</text>
</duration>
<distance>
<value>6718</value>
<text>6.7 km</text>
</distance>
</element>
</row>
I want to able to read the values of all "value" tags under "distance" tags. I have done the basic code to read XML data. I just want the idea to read the value of elements at the third level

Simple DOM parsing is not the preferred way anymore IMHO. Java comes with some mature frameworks to makes the parsing life much more easier. The following is some kind of my preference. Others may think different.
If the structure of your XML is fix, you could build some JAXB annotated pojos and read your data with this. JAXB delivers complete object hierarchies filled with your XML values. As well the XML data creation is also provided.
If you dont know the structure of your XML data or you stream XML data then maybe STAX parsing is the way to go.
Anyway these frameworks take a lot of problems away from you like file encoding, syntax checking, type safety (JAXB), ....
If you want to use DOM parsing, then you could use XPath to shorten your requests dramatically:
XPath xPath = XPathFactory.newInstance().newXPath();
InputSource source = new InputSource(ParseUsingXPath.class.getResourceAsStream("data.xml"));
NodeList list = (NodeList) xPath.evaluate("/row/element/distance/value", source, XPathConstants.NODESET);
for (int i=0;i<list.getLength();i++) {
System.out.println(list.item(i).getTextContent());
}
and it outputs:
3208
2869
6718
Here I use your XML directly as a StringStream from a file. You could use XPath with an DOM document object as well to process global searches.

have you already used xsd definition of xml? With this and jaxb you can unmarshal the xml to a java object in easy way.

Below code might help you to access value element.
import java.io.FileInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
public class DOMParsarDemo {
protected DocumentBuilder docBuilder;
protected Element root;
public DOMParsarDemo() throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
docBuilder = dbf.newDocumentBuilder();
}
public void parse(String file) throws Exception {
Document doc = docBuilder.parse(new FileInputStream(file));
root = doc.getDocumentElement();
System.out.println("root element is :" + root.getNodeName());
}
public void printAllElements() throws Exception {
printElement(root);
}
public void printElement(Node node) {
if (node.getNodeType() != Node.TEXT_NODE) {
Node child = node.getFirstChild();
while (child != null) {
if (node.getNodeName().equals("distance")) {
if (child.getNodeName().equals("value")) {
System.out.println(child.getFirstChild().getNodeValue());
}
}
printElement(child);
child = child.getNextSibling();
}
}
}
public static void main(String args[]) throws Exception {
DOMParsarDemo demo = new DOMParsarDemo();
demo.parse("resources/abc.xml");
demo.printAllElements();
}
}
output:
root element is :row
3208
2869
6718

Parsing xml file contents without knowing xml file structure

I've been working on learning some new tech using java to parse files and for the msot part it's going well. However, I'm at a lost as to how I could parse an xml file to where the structure is not known upon receipt. Lots of examples of how to do so if you know the structure (getElementByTagName seems to be the way to go), but no dynamic options, at least not that I've found.
So the tl;dr version of this question, how can I parse an xml file where I cannot rely on knowing it's structure?

Well the parsing part is easy; like helderdarocha stated in the comments, the parser only requires valid XML, it does not care about the structure. You can use Java's standard DocumentBuilder to obtain a Document:
InputStream in = new FileInputStream(...);
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);
(If you're parsing multiple documents, you can keep reusing the same DocumentBuilder.)
Then you can start with the root document element and use familiar DOM methods from there on out:
Element root = doc.getDocumentElement(); // perform DOM operations starting here.
As for processing it, well it really depends on what you want to do with it, but you can use the methods of Node like getFirstChild() and getNextSibling() to iterate through children and process as you see fit based on structure, tags, and attributes.
Consider the following example:
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
public class XML {
public static void main (String[] args) throws Exception {
String xml = "<objects><circle color='red'/><circle color='green'/><rectangle>hello</rectangle><glumble/></objects>";
// parse
InputStream in = new ByteArrayInputStream(xml.getBytes("utf-8"));
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);
// process
Node objects = doc.getDocumentElement();
for (Node object = objects.getFirstChild(); object != null; object = object.getNextSibling()) {
if (object instanceof Element) {
Element e = (Element)object;
if (e.getTagName().equalsIgnoreCase("circle")) {
String color = e.getAttribute("color");
System.out.println("It's a " + color + " circle!");
} else if (e.getTagName().equalsIgnoreCase("rectangle")) {
String text = e.getTextContent();
System.out.println("It's a rectangle that says \"" + text + "\".");
} else {
System.out.println("I don't know what a " + e.getTagName() + " is for.");
}
}
}
}
}
The input XML document (hard-coded for example) is:
<objects>
<circle color='red'/>
<circle color='green'/>
<rectangle>hello</rectangle>
<glumble/>
</objects>
The output is:
It's a red circle!
It's a green circle!
It's a rectangle that says "hello".
I don't know what a glumble is for.

What's wrong with this Java XML-Parsing code?

I'm trying to parse an XML file and be able to insert a path and get the value of the field.
It looks as follows:
import java.io.IOException;
import javax.xml.parsers.*;
import org.w3c.dom.*;
import org.xml.sax.SAXException;
public class XMLConfigManager {
private Element config = null;
public XMLConfigManager(String file) {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
try {
Document domTree;
DocumentBuilder db = dbf.newDocumentBuilder();
domTree = db.parse(file);
config = domTree.getDocumentElement();
}
catch (IllegalArgumentException iae) {
iae.printStackTrace();
}
catch (ParserConfigurationException pce) {
pce.printStackTrace();
}
catch (SAXException se) {
se.printStackTrace();
}
catch (IOException ioe) {
ioe.printStackTrace();
}
}
public String getStringValue(String path) {
String[] pathArray = path.split("\\|");
Element tempElement = config;
NodeList tempNodeList = null;
for (int i = 0; i < pathArray.length; i++) {
if (i == 0) {
if (tempElement.getNodeName().equals(pathArray[0])) {
System.out.println("First element is correct, do nothing here (just in next step)");
}
else {
return "**This node does not exist**";
}
}
else {
tempNodeList = tempElement.getChildNodes();
tempElement = getChildElement(pathArray[i],tempNodeList);
}
}
return tempElement.getNodeValue();
}
private Element getChildElement(String identifier, NodeList nl) {
String tempNodeName = null;
for (int i = 0; i < nl.getLength(); i++) {
tempNodeName = nl.item(i).getNodeName();
if (tempNodeName.equals(identifier)) {
Element returner = (Element)nl.item(i).getChildNodes();
return returner;
}
}
return null;
}
}
The XML looks like this (for test purposes):
<?xml version="1.0" encoding="UTF-8"?>
<amc>
<controller>
<someOtherTest>bla</someOtherTest>
<general>
<spam>This is test return String</spam>
<interval>1000</interval>
</general>
</controller>
<agent>
<name>test</name>
<ifc>ifcTest</ifc>
</agent>
</amc>
Now I can call the class like this
XMLConfigManager xmlcm = new XMLConfigManager("myConfig.xml");
System.out.println(xmlcm.getStringValue("amc|controller|general|spam"));
Here, I'm expecting the value of the tag spam, so this would be "This is test return String". But I'm getting null.
I've tried to fix this for days now and I just can't get it. The iteration works so it gets to the tag spam, but then, just as I said, it returns null instead of the text.
Is this a bug or am I just doing wrong? Why? :(
Thank you very much for help!
Regards, Flo

You're calling Node.getNodeValue() - which is documented to return null when you call it on an element. You should call getTextContent() instead - or use a higher level API, of course.

As others mentioned before me, you seem to be reinventing the concept of XPath. You can replace your code with the following:
javax.xml.xpath.XPath xpath = javax.xml.xpath.XPathFactory.newInstance().newXPath();
String expression = "/amc/controller/general/spam";
org.xml.sax.InputSource inputSource = new org.xml.sax.InputSource("myConfig.xml");
String result = xpath.evaluate(expression, inputSource);
See also: XML Validation and XPath Evaluation in J2SE 5.0
EDIT:
An example of extracting a collection with XPath:
NodeList result = (NodeList) xpath.evaluate(expression, inputSource, XPathConstants.NODESET);
for (int i = 0; i < result.getLength(); i++) {
System.out.println(result.item(i).getTextContent());
}
The javax.xml.xpath.XPath interface is documented here, and there are a few more examples in the aforementioned article.
In addition, there are third-party libraries for XML manipulation, which you may find more convenient, such as dom4j (suggested by duffymo) or JDOM. Regardless of which library you use, you can leverage the quite powerful XPath language.

Because you're using getNodeValue() rather than getTextContent().
Doing this by hand is an accident waiting to happen; either use the built-in XPath solutions, or a third-party library as suggested by #duffymo. This is not a situation where re-invention adds value, IMO.

I'd wonder why you're not using a library like dom4j and built-in XPath. You're doing a lot of work with a very low-level API (WC3 DOM).
Step through with a debugger and see what children that <spam> node has. You should quickly figure out why it's null. It'll be faster than asking here.

How to improve splitting xml file performance

I've see quite a lot posts/blogs/articles about splitting XML file into a smaller chunks and decided to create my own because I have some custom requirements. Here is what I mean, consider the following XML :
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<company>
<staff id="1">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="2">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="3">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="4">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="5">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<salary>100000</salary>
</staff>
</company>
I want to split this xml into n parts, each containing 1 file, but the staff element must contain nickname , if it's not there I don't want it. So this should produce 4 xml splits, each containing staff id starting at 1 until 4.
Here is my code :
public int split() throws Exception{
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputFilePath)));
String line;
List<String> tempList = null;
while((line=br.readLine())!=null){
if(line.contains("<?xml version=\"1.0\"") || line.contains("<" + rootElement + ">") || line.contains("</" + rootElement + ">")){
continue;
}
if(line.contains("<"+ element +">")){
tempList = new ArrayList<String>();
}
tempList.add(line);
if(line.contains("</"+ element +">")){
if(hasConditions(tempList)){
writeToSplitFile(tempList);
writtenObjectCounter++;
totalCounter++;
}
}
if(writtenObjectCounter == itemsPerFile){
writtenObjectCounter = 0;
fileCounter++;
tempList.clear();
}
}
if(tempList.size() != 0){
writeClosingRootElement();
}
return totalCounter;
}
private void writeToSplitFile(List<String> itemList) throws Exception{
BufferedWriter wr = new BufferedWriter(new FileWriter(outputDirectory + File.separator + "split_" + fileCounter + ".xml", true));
if(writtenObjectCounter == 0){
wr.write("<" + rootElement + ">");
wr.write("\n");
}
for (String string : itemList) {
wr.write(string);
wr.write("\n");
}
if(writtenObjectCounter == itemsPerFile-1)
wr.write("</" + rootElement + ">");
wr.close();
}
private void writeClosingRootElement() throws Exception{
BufferedWriter wr = new BufferedWriter(new FileWriter(outputDirectory + File.separator + "split_" + fileCounter + ".xml", true));
wr.write("</" + rootElement + ">");
wr.close();
}
private boolean hasConditions(List<String> list){
int matchList = 0;
for (String condition : conditionList) {
for (String string : list) {
if(string.contains(condition)){
matchList++;
}
}
}
if(matchList >= conditionList.size()){
return true;
}
return false;
}
I know that opening/closing stream for each written staff element which does impact the performance. But if I write once per file(which may contain n number of staff). Naturally root and split elements are configurable.
Any ideas how can I improve the performance/logic? I'd prefer some code, but good advice can be better sometimes
Edit:
This XML example is actually a dummy example, the real XML which I'm trying to split is about 300-500 different elements under split element all appearing at the random order and number varies. Stax may not be the best solution after all?
Bounty update :
I'm looking for a solution(code) that will:
Be able to split XML file into n parts with x split elements(from the dummy XML example staff is the split element).
The content of the spitted files should be wrapped in the root element from the original file(like in the dummy example company)
I'd like to be able to specify condition that must be in the split element i.e. I want only staff which have nickname, I want to discard those without nicknames. But be able to also split without conditions while running split without conditions.
The code doesn't necessarily have to improve my solution(lacking good logic and performance), but it works.
And not happy with "but it works". And I can't find enough examples of Stax for these kind of operations, user community is not great as well. It doesn't have to be Stax solution as well.
I'm probably asking too much, but I'm here to learn stuff, giving good bounty for the solution I think.

First piece of advice: don't try to write your own XML handling code. Use an XML parser - it's going to be much more reliable and quite possibly faster.
If you use an XML pull parser (e.g. StAX) you should be able to read an element at a time and write it out to disk, never reading the whole document in one go.

Here's my suggestion. It requires a streaming XSLT 3.0 processor: which means in practice that it needs Saxon-EE 9.3.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:mode streamable="yes">
<xsl:template match="/">
<xsl:apply-templates select="company/staff"/>
</xsl:template>
<xsl:template match=staff">
<xsl:variable name="v" as="element(staff)">
<xsl:copy-of select="."/>
</xsl:variable>
<xsl:if test="$v/nickname">
<xsl:result-document href="{#id}.xml">
<xsl:copy-of select="$v"/>
</xsl:result-document>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
In practice, though, unless you have hundreds of megabytes of data, I suspect a non-streaming solution will be quite fast enough, and probably faster than your hand-written Java code, given that your Java code is nothing to get excited about. At any rate, give an XSLT solution a try before you write reams of low-level Java. It's a routine problem, after all.

You could do the following with StAX:
Algorithm
Read and hold onto the root element event.
Read first chunk of XML:
Queue events until condition has been met.
If condition has been met:
Write start document event.
Write out root start element event
Write out split start element event
Write out queued events
Write out remaining events for this section.
If condition was not met then do nothing.
Repeat step 2 with next chunk of XML
Code for Your Use Case
The following code uses StAX APIs to break up the document as outlined in your question:
package forum7408938;
import java.io.*;
import java.util.*;
import javax.xml.namespace.QName;
import javax.xml.stream.*;
import javax.xml.stream.events.*;
public class Demo {
public static void main(String[] args) throws Exception {
Demo demo = new Demo();
demo.split("src/forum7408938/input.xml", "nickname");
//demo.split("src/forum7408938/input.xml", null);
}
private void split(String xmlResource, String condition) throws Exception {
XMLEventFactory xef = XMLEventFactory.newFactory();
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLEventReader xer = xif.createXMLEventReader(new FileReader(xmlResource));
StartElement rootStartElement = xer.nextTag().asStartElement(); // Advance to statements element
StartDocument startDocument = xef.createStartDocument();
EndDocument endDocument = xef.createEndDocument();
XMLOutputFactory xof = XMLOutputFactory.newFactory();
while(xer.hasNext() && !xer.peek().isEndDocument()) {
boolean metCondition;
XMLEvent xmlEvent = xer.nextTag();
if(!xmlEvent.isStartElement()) {
break;
}
// BOUNTY CRITERIA
// Be able to split XML file into n parts with x split elements(from
// the dummy XML example staff is the split element).
StartElement breakStartElement = xmlEvent.asStartElement();
List<XMLEvent> cachedXMLEvents = new ArrayList<XMLEvent>();
// BOUNTY CRITERIA
// I'd like to be able to specify condition that must be in the
// split element i.e. I want only staff which have nickname, I want
// to discard those without nicknames. But be able to also split
// without conditions while running split without conditions.
if(null == condition) {
cachedXMLEvents.add(breakStartElement);
metCondition = true;
} else {
cachedXMLEvents.add(breakStartElement);
xmlEvent = xer.nextEvent();
metCondition = false;
while(!(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
cachedXMLEvents.add(xmlEvent);
if(xmlEvent.isStartElement() && xmlEvent.asStartElement().getName().getLocalPart().equals(condition)) {
metCondition = true;
break;
}
xmlEvent = xer.nextEvent();
}
}
if(metCondition) {
// Create a file for the fragment, the name is derived from the value of the id attribute
FileWriter fileWriter = null;
fileWriter = new FileWriter("src/forum7408938/" + breakStartElement.getAttributeByName(new QName("id")).getValue() + ".xml");
// A StAX XMLEventWriter will be used to write the XML fragment
XMLEventWriter xew = xof.createXMLEventWriter(fileWriter);
xew.add(startDocument);
// BOUNTY CRITERIA
// The content of the spitted files should be wrapped in the
// root element from the original file(like in the dummy example
// company)
xew.add(rootStartElement);
// Write the XMLEvents that were cached while when we were
// checking the fragment to see if it matched our criteria.
for(XMLEvent cachedEvent : cachedXMLEvents) {
xew.add(cachedEvent);
}
// Write the XMLEvents that we still need to parse from this
// fragment
xmlEvent = xer.nextEvent();
while(xer.hasNext() && !(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
xew.add(xmlEvent);
xmlEvent = xer.nextEvent();
}
xew.add(xmlEvent);
// Close everything we opened
xew.add(xef.createEndElement(rootStartElement.getName(), null));
xew.add(endDocument);
fileWriter.close();
}
}
}
}

#Jon Skeet is spot on as usual in his advice. #Blaise Doughan gave you a very basic picture of using StAX (which would be my preferred choice, although you can do basically the same thing with SAX). You seem to be looking for something more explicit, so here's some pseudo code to get you started (based on StAX):
find first "staff" StartElement
set a flag indicating you are in a "staff" element and start tracking the depth (StartElement is +1, EndElement is -1)
now, process the "staff" sub-elements, grab any of the data you care about and put it in a file (or where ever)
keep processing until your depth reaches 0 (when you find the matching "staff" EndElement)
unset the flag indicating you are in a "staff" element
search for the next "staff" StartElement
if found, go to 2. and repeat
if not found, document is complete
EDIT:
wow, i have to say i'm amazed at the number of people willing to do someone else's work for them. i didn't realize SO was basically a free version of rent-a-coder.

#Gandalf StormCrow:
Let me divide your problem into three separate issues:-
i) Reading XML and simultaenous split XML in best possible way
ii) Checking condition in split file
iii) If condition met, process that spilt file.
for i), there are ofcourse mutliple solutions: SAX, STAX and other parsers and as simple as that as you mentioned just read using simple java io operations and search for tags.
I believe SAX/STAX/simple java IO, anything will do. I have taken your example as base for my solution.
ii) Checking condition in split file: you have used contains() method to check for existence of nickname. This does not seem best way: what if your conditions are as complex as if nickname should be present but length>5 or salary should be numeric etc.
I would use new java XML validation framework for this which make uses of XML schema.Please note we can cache schema object in memory so to reuse it again and again. This new validation framework is pretty fast.
iii) If condition met, process that spilt file.
You may want use java concurrent APIs to submit async tasks(ExecutorService class) to acheive parallel execution for faster performance.
So considering above points, one possible solution can be:-
You can create a company.xsd file like:-
<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.org/NewXMLSchema"
xmlns:tns="http://www.example.org/NewXMLSchema"
elementFormDefault="unqualified">
<element name="company">
<complexType>
<sequence>
<element name="staff" type="tns:stafftype"/>
</sequence>
</complexType>
</element>
<complexType name="stafftype">
<sequence>
<element name="firstname" type="string" minOccurs="0" />
<element name="lastname" type="string" minOccurs="0" />
<element name="nickname" type="string" minOccurs="1" />
<element name="salary" type="int" minOccurs="0" />
</sequence>
</complexType>
</schema>
then your java code would look like:-
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.IOException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;
import javax.xml.validation.Validator;
import org.xml.sax.SAXException;
public class testXML {
// Lookup a factory for the W3C XML Schema language
static SchemaFactory factory = SchemaFactory
.newInstance("http://www.w3.org/2001/XMLSchema");
// Compile the schema.
static File schemaLocation = new File("company.xsd");
static Schema schema = null;
static {
try {
schema = factory.newSchema(schemaLocation);
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private final ExecutorService pool = Executors.newFixedThreadPool(20);;
boolean validate(StringBuffer splitBuffer) {
boolean isValid = false;
Validator validator = schema.newValidator();
try {
validator.validate(new StreamSource(new ByteArrayInputStream(
splitBuffer.toString().getBytes())));
isValid = true;
} catch (SAXException ex) {
System.out.println(ex.getMessage());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return isValid;
}
void split(BufferedReader br, String rootElementName,
String splitElementName) {
StringBuffer splitBuffer = null;
String line = null;
String startRootElement = "<" + rootElementName + ">";
String endRootElement = "</" + rootElementName + ">";
String startSplitElement = "<" + splitElementName + ">";
String endSplitElement = "</" + splitElementName + ">";
String xmlDeclaration = "<?xml version=\"1.0\"";
boolean startFlag = false, endflag = false;
try {
while ((line = br.readLine()) != null) {
if (line.contains(xmlDeclaration)
|| line.contains(startRootElement)
|| line.contains(endRootElement)) {
continue;
}
if (line.contains(startSplitElement)) {
startFlag = true;
endflag = false;
splitBuffer = new StringBuffer(startRootElement);
splitBuffer.append(line);
} else if (line.contains(endSplitElement)) {
endflag = true;
startFlag = false;
splitBuffer.append(line);
splitBuffer.append(endRootElement);
} else if (startFlag) {
splitBuffer.append(line);
}
if (endflag) {
//process splitBuffer
boolean result = validate(splitBuffer);
if (result) {
//send it to a thread for processing further
//it is async so that main thread can continue for next
pool.submit(new ProcessingHandler(splitBuffer));
}
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
class ProcessingHandler implements Runnable {
String splitXML = null;
ProcessingHandler(StringBuffer splitXMLBuffer) {
this.splitXML = splitXMLBuffer.toString();
}
#Override
public void run() {
// do like writing to a file etc.
}
}

Have a look at this. This is slightly reworked sample from xmlpull.org:
http://www.xmlpull.org/v1/download/unpacked/doc/quick_intro.html
The following should do all you need unless you have nested splitting tags like:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<company>
<staff id="1">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
<other>
<staff>
...
</staff>
</other>
</staff>
</company>
To run it in pass-through mode simply pass null as splitting tag.
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import org.apache.commons.io.FileUtils;
import org.xmlpull.v1.XmlPullParser;
import org.xmlpull.v1.XmlPullParserException;
import org.xmlpull.v1.XmlPullParserFactory;
public class XppSample {
private String rootTag;
private String splitTag;
private String requiredTag;
private int flushThreshold;
private String fileName;
private String rootTagEnd;
private boolean hasRequiredTag = false;
private int flushCount = 0;
private int fileNo = 0;
private String header;
private XmlPullParser xpp;
private StringBuilder nodeBuf = new StringBuilder();
private StringBuilder fileBuf = new StringBuilder();
public XppSample(String fileName, String rootTag, String splitTag, String requiredTag, int flushThreshold) throws XmlPullParserException, FileNotFoundException {
this.rootTag = rootTag;
rootTagEnd = "</" + rootTag + ">";
this.splitTag = splitTag;
this.requiredTag = requiredTag;
this.flushThreshold = flushThreshold;
this.fileName = fileName;
XmlPullParserFactory factory = XmlPullParserFactory.newInstance(System.getProperty(XmlPullParserFactory.PROPERTY_NAME), null);
factory.setNamespaceAware(true);
xpp = factory.newPullParser();
xpp.setInput(new FileReader(fileName));
}
public void processDocument() throws XmlPullParserException, IOException {
int eventType = xpp.getEventType();
do {
if(eventType == XmlPullParser.START_TAG) {
processStartElement(xpp);
} else if(eventType == XmlPullParser.END_TAG) {
processEndElement(xpp);
} else if(eventType == XmlPullParser.TEXT) {
processText(xpp);
}
eventType = xpp.next();
} while (eventType != XmlPullParser.END_DOCUMENT);
saveFile();
}
public void processStartElement(XmlPullParser xpp) {
int holderForStartAndLength[] = new int[2];
String name = xpp.getName();
char ch[] = xpp.getTextCharacters(holderForStartAndLength);
int start = holderForStartAndLength[0];
int length = holderForStartAndLength[1];
if(name.equals(rootTag)) {
int pos = start + length;
header = new String(ch, 0, pos);
} else {
if(requiredTag==null || name.equals(requiredTag)) {
hasRequiredTag = true;
}
nodeBuf.append(xpp.getText());
}
}
public void flushBuffer() throws IOException {
if(hasRequiredTag) {
fileBuf.append(nodeBuf);
if(((++flushCount)%flushThreshold)==0) {
saveFile();
}
}
nodeBuf = new StringBuilder();
hasRequiredTag = false;
}
public void saveFile() throws IOException {
if(fileBuf.length()>0) {
String splitFile = header + fileBuf.toString() + rootTagEnd;
FileUtils.writeStringToFile(new File((fileNo++) + "_" + fileName), splitFile);
fileBuf = new StringBuilder();
}
}
public void processEndElement (XmlPullParser xpp) throws IOException {
String name = xpp.getName();
if(name.equals(rootTag)) {
flushBuffer();
} else {
nodeBuf.append(xpp.getText());
if(name.equals(splitTag)) {
flushBuffer();
}
}
}
public void processText (XmlPullParser xpp) throws XmlPullParserException {
int holderForStartAndLength[] = new int[2];
char ch[] = xpp.getTextCharacters(holderForStartAndLength);
int start = holderForStartAndLength[0];
int length = holderForStartAndLength[1];
String content = new String(ch, start, length);
nodeBuf.append(content);
}
public static void main (String args[]) throws XmlPullParserException, IOException {
//XppSample app = new XppSample("input.xml", "company", "staff", "nickname", 3);
XppSample app = new XppSample("input.xml", "company", "staff", null, 3);
app.processDocument();
}
}

Normally I would suggest using StAX, but it is unclear to me how 'stateful' your real XML is. If simple, then use SAX for ultimate performance, if not-so-simple, use StAX. So you need to
read bytes from disk
convert them to characters
parse the XML
determine whether to keep XML or throw away (skip out subtree)
write XML
convert characters to bytes
write to disk
Now, it might seem like steps 3-5 are the most resource-intensive, but I would rate them as
Most: 1 + 7
Middle: 2 + 6
Least: 3 + 4 + 5
As operations 1 and 7 are kind of seperate of the rest, you should do them in an async way, at least creating multiple small files is best done in n other threads, if you are familiar with multi-threading. For increased performance, you might also look into the new IO stuff in Java.
Now for steps 2 + 3 and 5 + 6 you can go a long way with FasterXML, it really does a lot of the stuff you are looking for, like triggering JVM hot-spot attention in the right places; might even support async reading/writing looking through the code quickly.
So then we are left with step 5, and depending on your logic, you should either
a. make an object binding, then decide how what to do
b. write XML anyways, hoping for the best, and then throw it away if no 'staff' element is present.
Whatever you do, object reuse is sensible. Note that both alternatives (obisously) requires the same amount of parsing (skip out of subtree ASAP), and for alternative b, that a little extra XML is actually not so bad performancewise, ideally make sure your char buffers are > one unit.
Alternative b is the most easy to implement, simply copy the 'xml event' from your reader to writer, example for StAX:
private static void copyEvent(int event, XMLStreamReader reader, XMLStreamWriter writer) throws XMLStreamException {
if (event == XMLStreamConstants.START_ELEMENT) {
String localName = reader.getLocalName();
String namespace = reader.getNamespaceURI();
// TODO check this stuff again before setting in production
if (namespace != null) {
if (writer.getPrefix(namespace) != null) {
writer.writeStartElement(namespace, localName);
} else {
writer.writeStartElement(reader.getPrefix(), localName, namespace);
}
} else {
writer.writeStartElement(localName);
}
// first: namespace definition attributes
if(reader.getNamespaceCount() > 0) {
int namespaces = reader.getNamespaceCount();
for(int i = 0; i < namespaces; i++) {
String namespaceURI = reader.getNamespaceURI(i);
if(writer.getPrefix(namespaceURI) == null) {
String namespacePrefix = reader.getNamespacePrefix(i);
if(namespacePrefix == null) {
writer.writeDefaultNamespace(namespaceURI);
} else {
writer.writeNamespace(namespacePrefix, namespaceURI);
}
}
}
}
int attributes = reader.getAttributeCount();
// the write the rest of the attributes
for (int i = 0; i < attributes; i++) {
String attributeNamespace = reader.getAttributeNamespace(i);
if (attributeNamespace != null && attributeNamespace.length() != 0) {
writer.writeAttribute(attributeNamespace, reader.getAttributeLocalName(i), reader.getAttributeValue(i));
} else {
writer.writeAttribute(reader.getAttributeLocalName(i), reader.getAttributeValue(i));
}
}
} else if (event == XMLStreamConstants.END_ELEMENT) {
writer.writeEndElement();
} else if (event == XMLStreamConstants.CDATA) {
String array = reader.getText();
writer.writeCData(array);
} else if (event == XMLStreamConstants.COMMENT) {
String array = reader.getText();
writer.writeComment(array);
} else if (event == XMLStreamConstants.CHARACTERS) {
String array = reader.getText();
if (array.length() > 0 && !reader.isWhiteSpace()) {
writer.writeCharacters(array);
}
} else if (event == XMLStreamConstants.START_DOCUMENT) {
writer.writeStartDocument();
} else if (event == XMLStreamConstants.END_DOCUMENT) {
writer.writeEndDocument();
}
}
And for a subtree,
private static void copySubTree(XMLStreamReader reader, XMLStreamWriter writer) throws XMLStreamException {
reader.require(XMLStreamConstants.START_ELEMENT, null, null);
copyEvent(XMLStreamConstants.START_ELEMENT, reader, writer);
int level = 1;
do {
int event = reader.next();
if(event == XMLStreamConstants.START_ELEMENT) {
level++;
} else if(event == XMLStreamConstants.END_ELEMENT) {
level--;
}
copyEvent(event, reader, writer);
} while(level > 0);
}
From which you probably can deduct how to skip out to a certain level. In general, for stateful StaX parsing, use the pattern
private static void parseSubTree(XMLStreamReader reader) throws XMLStreamException {
int level = 1;
do {
int event = reader.next();
if(event == XMLStreamConstants.START_ELEMENT) {
level++;
// do stateful stuff here
// for child logic:
if(reader.getLocalName().equals("Whatever")) {
parseSubTreeForWhatever(reader);
level --; // read from level 1 to 0 in submethod.
}
// alternatively, faster
if(level == 4) {
parseSubTreeForWhateverAtRelativeLevel4(reader);
level --; // read from level 1 to 0 in submethod.
}
} else if(event == XMLStreamConstants.END_ELEMENT) {
level--;
// do stateful stuff here, too
}
} while(level > 0);
}
where you in the start of the document read till the first start element and break (add the writer+copy for your use of course, as above).
Note that if you do an object binding, these methods should be placed in that object, and equally for the serialization methods.
I am pretty sure you will get 10s of MB/s on a modern system, and that should be sufficient. An issue to be investigate further, is approaches to use multiple cores for the actualy input, if you know for a fact the encoding subset, like non-crazy UTF-8, or ISO-8859, then random access might be possible -> send to different cores.
Have fun, and tell use how it went ;)
Edit: Almost forgot, if you for some reason are the one who is creating the file in the first place, or you will be reading them after splitting, you will se HUGE performance gains using XML binarization; there exist XML Schema generators which again can go into code generators. (And some XSLT transform libs use code generation too.) And run with the -server option for JVM.

How to make i faster:
Use asynchronous writes, possibly in parallel, might boost your perf if you have RAID-X something disks
Write to an SSD instead of HDD

My suggestion is that SAX, STAX, or DOM are not the ideal xml parser for your problem, the perfect solutions is called vtd-xml, there is an article on this subject explaining why DOM sax and STAX all done something very wrong... the code below is the shortest you have to write, yet performs 10x faster than DOM or SAX. http://www.javaworld.com/javaworld/jw-07-2006/jw-0724-vtdxml.html
Here is a latest paper entitled Processing XML with Java – A Performance Benchmark: http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf
import com.ximpleware.*;
import java.io.*;
public class gandalf {
public static void main(String a[]) throws VTDException, Exception{
VTDGen vg = new VTDGen();
if (vg.parseFile("c:\\xml\\gandalf.txt", false)){
VTDNav vn=vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/company/staff[nickname]");
int i=-1;
int count=0;
while((i=ap.evalXPath())!=-1){
vn.dumpFragment("c:\\xml\\staff"+count+".xml");
count++;
}
}
}
}

Here is DOM based solution. I have tested this with the xml you provided. This needs to be checked against the actual xml files that you have.
Since this is based on DOM parser, please remember that this will require a lot of memory depending upon your xml file size. But its much faster as it's DOM based.
Algorithm :
Parse the document
Extract the root element name
Get list he nodes based on the split criteria (using XPath)
For each node, create an empty document with root element name as extracted in step #2
Insert the node in this new document
Check if nodes are to be filtered or not.
If nodes are to be filtered, then check if a specified element is present in the newly created doc.
If node is not present, don't write to the file.
If the nodes are NOT to be filtered at all, don't check for condition in #7, and write the document to the file.
This can be run from command prompt as follows
java XMLSplitter xmlFileLocation splitElement filter filterElement
For the xml you mentioned it will be
java XMLSplitter input.xml staff true nickname
In case you don't want to filter
java XMLSplitter input.xml staff
Here is the complete java code:
package com.xml.xpath;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.DOMException;
import org.w3c.dom.DOMImplementation;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class XMLSplitter {
DocumentBuilder builder = null;
XPath xpath = null;
Transformer transformer = null;
String filterElement;
String splitElement;
String xmlFileLocation;
boolean filter = true;
public static void main(String[] arg) throws Exception{
XMLSplitter xMLSplitter = null;
if(arg.length < 4){
if(arg.length < 2){
System.out.println("Insufficient arguments !!!");
System.out.println("Usage: XMLSplitter xmlFileLocation splitElement filter filterElement ");
return;
}else{
System.out.println("Filter is off...");
xMLSplitter = new XMLSplitter();
xMLSplitter.init(arg[0],arg[1],false,null);
}
}else{
xMLSplitter = new XMLSplitter();
xMLSplitter.init(arg[0],arg[1],Boolean.parseBoolean(arg[2]),arg[3]);
}
xMLSplitter.start();
}
public void init(String xmlFileLocation, String splitElement, boolean filter, String filterElement )
throws ParserConfigurationException, TransformerConfigurationException{
//Initialize the Document builder
System.out.println("Initializing..");
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
builder = domFactory.newDocumentBuilder();
//Initialize the transformer
TransformerFactory transformerFactory = TransformerFactory.newInstance();
transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.ENCODING,"UTF-8");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
//Initialize the xpath
XPathFactory factory = XPathFactory.newInstance();
xpath = factory.newXPath();
this.filterElement = filterElement;
this.splitElement = splitElement;
this.xmlFileLocation = xmlFileLocation;
this.filter = filter;
}
public void start() throws Exception{
//Parser the file
System.out.println("Parsing file.");
Document doc = builder. parse(xmlFileLocation);
//Get the root node name
System.out.println("Getting root element.");
XPathExpression rootElementexpr = xpath.compile("/");
Object rootExprResult = rootElementexpr.evaluate(doc, XPathConstants.NODESET);
NodeList rootNode = (NodeList) rootExprResult;
String rootNodeName = rootNode.item(0).getFirstChild().getNodeName();
//Get the list of split elements
XPathExpression expr = xpath.compile("//"+splitElement);
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
System.out.println("Total number of split nodes "+nodes.getLength());
for (int i = 0; i < nodes.getLength(); i++) {
//Wrap each node inside root of the parent xml doc
Node sigleNode = wrappInRootElement(rootNodeName,nodes.item(i));
//Get the XML string of the fragment
String xmlFragment = serializeDocument(sigleNode);
//System.out.println(xmlFragment);
//Write the xml fragment in file.
storeInFile(xmlFragment,i);
}
}
private Node wrappInRootElement(String rootNodeName, Node fragmentDoc)
throws XPathExpressionException, ParserConfigurationException, DOMException,
SAXException, IOException, TransformerException{
//Create empty doc with just root node
DOMImplementation domImplementation = builder.getDOMImplementation();
Document doc = domImplementation.createDocument(null,null,null);
Element theDoc = doc.createElement(rootNodeName);
doc.appendChild(theDoc);
//Insert the fragment inside the root node
InputSource inStream = new InputSource();
String xmlString = serializeDocument(fragmentDoc);
inStream.setCharacterStream(new StringReader(xmlString));
Document fr = builder.parse(inStream);
theDoc.appendChild(doc.importNode(fr.getFirstChild(),true));
return doc;
}
private String serializeDocument(Node doc) throws TransformerException, XPathExpressionException{
if(!serializeThisNode(doc)){
return null;
}
DOMSource domSource = new DOMSource(doc);
StringWriter stringWriter = new StringWriter();
StreamResult streamResult = new StreamResult(stringWriter);
transformer.transform(domSource, streamResult);
String xml = stringWriter.toString();
return xml;
}
//Check whether node is to be stored in file or rejected based on input
private boolean serializeThisNode(Node doc) throws XPathExpressionException{
if(!filter){
return true;
}
XPathExpression filterElementexpr = xpath.compile("//"+filterElement);
Object result = filterElementexpr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
if(nodes.item(0) != null){
return true;
}else{
return false;
}
}
private void storeInFile(String content, int fileIndex) throws IOException{
if(content == null || content.length() == 0){
return;
}
String fileName = splitElement+fileIndex+".xml";
File file = new File(fileName);
if(file.exists()){
System.out.println(" The file "+fileName+" already exists !! cannot create the file with the same name ");
return;
}
FileWriter fileWriter = new FileWriter(file);
fileWriter.write(content);
fileWriter.close();
System.out.println("Generated file "+fileName);
}
}
Let me know if this works for you or any other help regarding this code.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java XML library that preserves attribute order - java

Your best bet would be to use StAX instead of DOM for generating the original document. StAX gives you a lot of fine control over these things and lets you stream output progressively to an output stream instead of holding it all in memory.

You can't use the DOM, but you can use SAX, or querying children using XPath. Visit the answer Order of XML attributes after DOM processing.

Related

Parsing XML file containing HTML entities in Java without changing the XML

Parsing XML file In java using DOM

Parsing xml file contents without knowing xml file structure

What's wrong with this Java XML-Parsing code?

How to improve splitting xml file performance

Categories

Resources