Parsing XML file In java using DOM

Parsing XML file In java using DOM - java

I want to parse XML file to read values of certain elements in the file.
<row>
<element>
<status>OK</status>
<duration>
<value>499</value>
<text>8 mins</text>
</duration>
<distance>
<value>3208</value>
<text>3.2 km</text>
</distance>
</element>
<element>
<status>OK</status>
<duration>
<value>501</value>
<text>8 mins</text>
</duration>
<distance>
<value>2869</value>
<text>2.9 km</text>
</distance>
</element>
<element>
<status>OK</status>
<duration>
<value>788</value>
<text>13 mins</text>
</duration>
<distance>
<value>6718</value>
<text>6.7 km</text>
</distance>
</element>
</row>
I want to able to read the values of all "value" tags under "distance" tags. I have done the basic code to read XML data. I just want the idea to read the value of elements at the third level

Simple DOM parsing is not the preferred way anymore IMHO. Java comes with some mature frameworks to makes the parsing life much more easier. The following is some kind of my preference. Others may think different.
If the structure of your XML is fix, you could build some JAXB annotated pojos and read your data with this. JAXB delivers complete object hierarchies filled with your XML values. As well the XML data creation is also provided.
If you dont know the structure of your XML data or you stream XML data then maybe STAX parsing is the way to go.
Anyway these frameworks take a lot of problems away from you like file encoding, syntax checking, type safety (JAXB), ....
If you want to use DOM parsing, then you could use XPath to shorten your requests dramatically:
XPath xPath = XPathFactory.newInstance().newXPath();
InputSource source = new InputSource(ParseUsingXPath.class.getResourceAsStream("data.xml"));
NodeList list = (NodeList) xPath.evaluate("/row/element/distance/value", source, XPathConstants.NODESET);
for (int i=0;i<list.getLength();i++) {
System.out.println(list.item(i).getTextContent());
}
and it outputs:
3208
2869
6718
Here I use your XML directly as a StringStream from a file. You could use XPath with an DOM document object as well to process global searches.

have you already used xsd definition of xml? With this and jaxb you can unmarshal the xml to a java object in easy way.

Below code might help you to access value element.
import java.io.FileInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
public class DOMParsarDemo {
protected DocumentBuilder docBuilder;
protected Element root;
public DOMParsarDemo() throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
docBuilder = dbf.newDocumentBuilder();
}
public void parse(String file) throws Exception {
Document doc = docBuilder.parse(new FileInputStream(file));
root = doc.getDocumentElement();
System.out.println("root element is :" + root.getNodeName());
}
public void printAllElements() throws Exception {
printElement(root);
}
public void printElement(Node node) {
if (node.getNodeType() != Node.TEXT_NODE) {
Node child = node.getFirstChild();
while (child != null) {
if (node.getNodeName().equals("distance")) {
if (child.getNodeName().equals("value")) {
System.out.println(child.getFirstChild().getNodeValue());
}
}
printElement(child);
child = child.getNextSibling();
}
}
}
public static void main(String args[]) throws Exception {
DOMParsarDemo demo = new DOMParsarDemo();
demo.parse("resources/abc.xml");
demo.printAllElements();
}
}
output:
root element is :row
3208
2869
6718

Related

Which methods can be used to return valid and invalid XML data from a file in Java?

I have the following data that is supposed to be XML:
<?xml version="1.0" encoding="UTF-8"?>
<Product>
<id>1</id>
<description>A new product</description>
<price>123.45</price>
</Product>
<Product>
<id>1</id>
<description>A new product</description>
<price>123.45</price>
</Product>
<ProductTTTTT>
<id>1</id>
<description>A new product</description>
<price>123.45</price>
</Product>
<Product>
<id>1</id>
<description>A new product</description>
<price>123.45</price>
</ProductAAAAAA>
So, basically I have multiple root elements (product)...
The point is that I'm trying to transform this data into 2 XML documents, 1 for valid nodes and other for invalid nodes.
Valid node:
<Product>
...
</Product>
Invalid nodes: <ProductTTTTT>...</Product> and <Product>...</ProductAAAAAA>
Then I am thinking how I can achieve this using JAVA (not web).
If I am not wrong, validating it with a XSD will invalidate the whole file, so not an option.
Using default JAXB parser (unmarshaller) will lead to item above since internally it creates a XSD of my entity.
Using XPath just (from what I know) will just return the whole file, I did not find a way to get something like GET !VALID (It is just to explain...)
Using XQuery (maybe?).. by the way, how to use XQuery with JAXB?
XSL(T) will lead to same thing on XPath, since it uses XPath to select the content.
So... which method can I use to achieve the objective? (And if possible, provide links or code please)

Firstly, you're confusing valid and well-formed. You say you want to find invalid elements, but your examples aren't just invalid, they are ill-formed. That means that no XML parser is going to do anything with them other than throwing an error message at you. You can't use JAXB or XPath, or XQuery, or XSLT, or anything to process something that isn't XML.
You say "unfortunately I do not have access to the system that sends this xml format". I'm not sure why you call it an XML format: it isn't. I also don't understand why you (and many others on StackOverflow) are prepared to spend your time digging in garbage like this rather than telling the sender to get their act together. If you were served a salad with maggots in it, would you try to pick them out, or would you send it back for replacement? You should adopt a zero-tolerance approach to bad data; that's the only way senders will learn to improve the quality.

If the file contains lines with start and end tags who's name begins with "Product", you could:
use a file scanner to split this document into individual pieces whenever a line starts with <Product or </Product
attempt to parse the extracted text as XML using an XML API.
If it succeeds, add that object to a list of "good" well-formed XML documents
then perform any additional schema validation or validity checks
If it throws a parse error, catch it, and add that snippet of text to the list of "bad" items that need to be cleaned up or otherwise handled
An example to get you started:
package com.stackoverflow.questions.52012383;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
public class FileSplitter {
public static void parseFile(File file, String elementName)
throws ParserConfigurationException, IOException {
List<Document> good = new ArrayList<>();
List<String> bad = new ArrayList<>();
String start-tag = "<" + elementName;
String end-tag = "</" + elementName;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
StringBuffer buffer = new StringBuffer();
String line;
boolean append = false;
try (Scanner scanner = new Scanner(file)) {
while (scanner.hasNextLine()) {
line = scanner.nextLine();
if (line.startsWith(startTag)) {
append = true; //start accumulating content
} else if (line.startsWith(endTag)) {
append = false;
buffer.append(line);
//instead of the line above, you could hard-code the ending tag to compensate for bad data:
// buffer.append(endTag + ">");
try { // to parse as XML
builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(buffer.toString())));
good.add(document); // parsed successfully, add it to the good list
buffer.setLength(0); //reset the buffer to start a new XML doc
} catch (SAXException ex) {
bad.add(buffer.toString()); // something is wrong, not well-formed XML
}
}
if (append) { // accumulate content
buffer.append(line);
}
}
System.out.println("Good items: " + good.size() + " Bad items: " + bad.size());
//do stuff with the good/bad results...
}
}
public static void main(String args[])
throws ParserConfigurationException, IOException {
File file = new File("/tmp/test.xml");
parseFile(file, "Product");
}
}

Parsing XML file containing HTML entities in Java without changing the XML

I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I understand the correct way of dealing with this is to add suitable entity declarations to the XML file before parsing. However, I can't do that as I have no control over those XML files.
Is there some kind of callback I can override that is invoked whenever the Java XML parser encounters such an entity? I haven't been able to find one in the API.
I'd like to use:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( stream );
I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?
Here's a full example:
public class Main {
public static void main( String [] args ) throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( new FileInputStream( "test.xml" ));
}
}
with test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Produces:
[Fatal Error] :3:20: The entity "nbsp" was referenced, but not declared.
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 20; The entity "nbsp" was referenced, but not declared.
Update: I have been poking around in the JDK source code with a debugger, and boy, what an amount of spaghetti. I have no idea what the design is there, or whether there is one. Just how many layers of an onion can one layer on top of each other?
They key class seems to be com.sun.org.apache.xerces.internal.impl.XMLEntityManager, but I cannot find any code that either lets me add stuff into it before it gets used, or that attempts to resolve entities without going through that class.

I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download
public static void main(String args[]){
String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" +
"<bar>Some text — invalid!</bar></foo>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element e : doc.select("bar")) {
System.out.println(e);
}
}
Result:
<bar>
Some text — invalid!
</bar>
Loading from a file can be found here:
http://jsoup.org/cookbook/input/load-document-from-file

Issue - 1: I have to parse a bunch of XML files in Java that sometimes -- and
invalidly -- contain HTML entities such as —
XML has only five predefined entities. The —, is not among them. It works only when used in plain HTML or in legacy JSP. So, SAX will not help. It can be done using StaX which has high level iterator based API. (Collected from this link)
Issue - 2: I found that I can override resolveEntity in
org.xml.sax.helpers.DefaultHandler, but how do I use this with the
higher-level API?
Streaming API for XML, called StaX, is an API for reading and writing XML Documents.
StaX is a Pull-Parsing model. Application can take the control over parsing the XML documents by pulling (taking) the events from the parser.
The core StaX API falls into two categories and they are listed below. They are
Cursor based API: It is low-level API. cursor-based API allows the application to process XML as a stream of tokens aka events
Iterator based API: The higher-level iterator-based API allows the application to process XML as a series of event objects, each of which communicates a piece of the XML structure to the application.
STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property:
Requires the parser to replace internal entity references with their
replacement text and report them as characters
This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or XmlStreamReader.
However, the API is careful to say that this property is only intended to force the implementation to perform the replacement, rather than forcing it to notreplace them.
You may try it. Hope it will solve your issue. For your case,
Main.java
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.EntityReference;
import javax.xml.stream.events.XMLEvent;
public class Main {
public static void main(String[] args) {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(
XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
XMLEventReader reader;
try {
reader = inputFactory
.createXMLEventReader(new FileInputStream("F://test.xml"));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isEntityReference()) {
EntityReference ref = (EntityReference) event;
System.out.println("Entity Reference: " + ref.getName());
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (XMLStreamException e) {
e.printStackTrace();
}
}
}
test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Output:
Entity Reference: nbsp
Entity Reference: mdash
Credit goes to #skaffman.
Related Link:
http://www.journaldev.com/1191/how-to-read-xml-file-in-java-using-java-stax-api
http://www.journaldev.com/1226/java-stax-cursor-based-api-read-xml-example
http://www.vogella.com/tutorials/JavaXML/article.html
Is there a Java XML API that can parse a document without resolving character entities?
UPDATE:
Issue - 3: Is there a way to use StaX to "filter" the entities (replacing them
with something else, for example) and still produce a Document at the
end of the process?
To create a new document using the StAX API, it is required to create an XMLStreamWriter that provides methods to produce XML opening and closing tags, attributes and character content.
There are 5 methods of XMLStreamWriter for document.
xmlsw.writeStartDocument(); - initialises an empty document to which
elements can be added
xmlsw.writeStartElement(String s) -creates a new element named s
xmlsw.writeAttribute(String name, String value)- adds the attribute
name with the corresponding value to the last element produced by a
call to writeStartElement. It is possible to add attributes as long
as no call to writeElementStart,writeCharacters or writeEndElement
has been done.
xmlsw.writeEndElement - close the last started element
xmlsw.writeCharacters(String s) - creates a new text node with
content s as content of the last started element.
A sample example is attached with it:
StAXExpand.java
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;
import java.util.Arrays;
public class StAXExpand {
static XMLStreamWriter xmlsw = null;
public static void main(String[] argv) {
try {
xmlsw = XMLOutputFactory.newInstance()
.createXMLStreamWriter(System.out);
CompactTokenizer tok = new CompactTokenizer(
new FileReader(argv[0]));
String rootName = "dummyRoot";
// ignore everything preceding the word before the first "["
while(!tok.nextToken().equals("[")){
rootName=tok.getToken();
}
// start creating new document
xmlsw.writeStartDocument();
ignorableSpacing(0);
xmlsw.writeStartElement(rootName);
expand(tok,3);
ignorableSpacing(0);
xmlsw.writeEndDocument();
xmlsw.flush();
xmlsw.close();
} catch (XMLStreamException e){
System.out.println(e.getMessage());
} catch (IOException ex) {
System.out.println("IOException"+ex);
ex.printStackTrace();
}
}
public static void expand(CompactTokenizer tok, int indent)
throws IOException,XMLStreamException {
tok.skip("[");
while(tok.getToken().equals("#")) {// add attributes
String attName = tok.nextToken();
tok.nextToken();
xmlsw.writeAttribute(attName,tok.skip("["));
tok.nextToken();
tok.skip("]");
}
boolean lastWasElement=true; // for controlling the output of newlines
while(!tok.getToken().equals("]")){ // process content
String s = tok.getToken().trim();
tok.nextToken();
if(tok.getToken().equals("[")){
if(lastWasElement)ignorableSpacing(indent);
xmlsw.writeStartElement(s);
expand(tok,indent+3);
lastWasElement=true;
} else {
xmlsw.writeCharacters(s);
lastWasElement=false;
}
}
tok.skip("]");
if(lastWasElement)ignorableSpacing(indent-3);
xmlsw.writeEndElement();
}
private static char[] blanks = "\n".toCharArray();
private static void ignorableSpacing(int nb)
throws XMLStreamException {
if(nb>blanks.length){// extend the length of space array
blanks = new char[nb+1];
blanks[0]='\n';
Arrays.fill(blanks,1,blanks.length,' ');
}
xmlsw.writeCharacters(blanks, 0, nb+1);
}
}
CompactTokenizer.java
import java.io.Reader;
import java.io.IOException;
import java.io.StreamTokenizer;
public class CompactTokenizer {
private StreamTokenizer st;
CompactTokenizer(Reader r){
st = new StreamTokenizer(r);
st.resetSyntax(); // remove parsing of numbers...
st.wordChars('\u0000','\u00FF'); // everything is part of a word
// except the following...
st.ordinaryChar('\n');
st.ordinaryChar('[');
st.ordinaryChar(']');
st.ordinaryChar('#');
}
public String nextToken() throws IOException{
st.nextToken();
while(st.ttype=='\n'||
(st.ttype==StreamTokenizer.TT_WORD &&
st.sval.trim().length()==0))
st.nextToken();
return getToken();
}
public String getToken(){
return (st.ttype == StreamTokenizer.TT_WORD) ? st.sval : (""+(char)st.ttype);
}
public String skip(String sym) throws IOException {
if(getToken().equals(sym))
return nextToken();
else
throw new IllegalArgumentException("skip: "+sym+" expected but"+
sym +" found ");
}
}
For more, you can follow the tutorial
https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html
http://www.ibm.com/developerworks/library/x-tipstx2/index.html
http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch09s03.html
http://staf.sourceforge.net/current/STAXDoc.pdf

Another approach, since you're not using a rigid OXM approach anyway.
You might want to try using a less rigid parser such as JSoup?
This will stop immediate problems with invalid XML schemas etc, but it will just devolve the problem into your code.

Just to throw in a different approach to a solution:
You might envelope your input stream with a stream inplementation that replaces the entities by something legal.
While this is a hack for sure, it should be a quick and easy solution (or better say: workaround).
Not as elegant and clean as a xml framework internal solution, though.

I made yesterday something similar i need to add value from unziped XML in stream to database.
//import I'm not sure if all are necessary :)
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
//I didnt checked this code now because i'm in work for sure its work maybe
you will need to do little changes
InputSource is = new InputSource(new FileInputStream("test.xml"));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(is);
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
String words= xpath.evaluate("/foo/bar", doc.getDocumentElement());
ParsingHexToChar.parseToChar(words);
// lib which i use common-lang3.jar
//metod to parse
public static String parseToChar( String words){
String decode= org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(words);
return decode;
}

Try this using org.apache.commons package :
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
InputStream in = new FileInputStream(xmlfile);
String unescapeHtml4 = IOUtils.toString(in);
CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())
);
unescapeHtml4 = obj.translate(unescapeHtml4);
StringReader readerInput= new StringReader(unescapeHtml4);
InputSource is = new InputSource(readerInput);
Document doc = parser.parse(is);

Parsing xml file contents without knowing xml file structure

I've been working on learning some new tech using java to parse files and for the msot part it's going well. However, I'm at a lost as to how I could parse an xml file to where the structure is not known upon receipt. Lots of examples of how to do so if you know the structure (getElementByTagName seems to be the way to go), but no dynamic options, at least not that I've found.
So the tl;dr version of this question, how can I parse an xml file where I cannot rely on knowing it's structure?

Well the parsing part is easy; like helderdarocha stated in the comments, the parser only requires valid XML, it does not care about the structure. You can use Java's standard DocumentBuilder to obtain a Document:
InputStream in = new FileInputStream(...);
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);
(If you're parsing multiple documents, you can keep reusing the same DocumentBuilder.)
Then you can start with the root document element and use familiar DOM methods from there on out:
Element root = doc.getDocumentElement(); // perform DOM operations starting here.
As for processing it, well it really depends on what you want to do with it, but you can use the methods of Node like getFirstChild() and getNextSibling() to iterate through children and process as you see fit based on structure, tags, and attributes.
Consider the following example:
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
public class XML {
public static void main (String[] args) throws Exception {
String xml = "<objects><circle color='red'/><circle color='green'/><rectangle>hello</rectangle><glumble/></objects>";
// parse
InputStream in = new ByteArrayInputStream(xml.getBytes("utf-8"));
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);
// process
Node objects = doc.getDocumentElement();
for (Node object = objects.getFirstChild(); object != null; object = object.getNextSibling()) {
if (object instanceof Element) {
Element e = (Element)object;
if (e.getTagName().equalsIgnoreCase("circle")) {
String color = e.getAttribute("color");
System.out.println("It's a " + color + " circle!");
} else if (e.getTagName().equalsIgnoreCase("rectangle")) {
String text = e.getTextContent();
System.out.println("It's a rectangle that says \"" + text + "\".");
} else {
System.out.println("I don't know what a " + e.getTagName() + " is for.");
}
}
}
}
}
The input XML document (hard-coded for example) is:
<objects>
<circle color='red'/>
<circle color='green'/>
<rectangle>hello</rectangle>
<glumble/>
</objects>
The output is:
It's a red circle!
It's a green circle!
It's a rectangle that says "hello".
I don't know what a glumble is for.

Java XML library that preserves attribute order

I am writing a Java program that reads an XML file, makes some modifications, and writes back the XML.
Using the standard Java XML DOM API, the order of the attributes is not preserved.
That is, if I have an input file such as:
<person first_name="john" last_name="lederrey"/>
I might get an output file as:
<person last_name="lederrey" first_name="john"/>
That's correct, because the XML specification says that order attribute is not significant.
However, my program needs to preserve the order of the attributes, so that a person can easily compare the input and output document with a diff tool.
One solution for that is to process the document with SAX (instead of DOM):
Order of XML attributes after DOM processing
However, this does not work for my case,
because the transformation I need to do in one node might depend on a XPath expression on the whole document.
So, the simplest thing would be to have a XML library very similar to the standard Java DOM library, with the exception that it preserves the attribute order.
Is there such a library?
PS: Please, avoid discussing whether I should the preserve attribute order or not. This is a very interesting discussion, but it is not the point of this question.

Saxon these days offers a serialization option[1] to control the order in which attributes are output. It doesn't retain the input order (because Saxon doesn't know the input order), but it does allow you to control, for example, that the ID attribute always appears first.
And this can be very useful if the XML is going to be hand-edited; XML in which the attributes appear in the "wrong" order can be very disorienting to a human reader or editor.
If you're using this as part of a diff process then you would want to put both files through a process that normalizes the attribute order before comparing them. However, for comparing files my preferred approach is to parse them both and use the XPath deep-equal() function; or to use a specialized tool like DeltaXML.
[1] saxon:attribute-order - see http://www.saxonica.com/documentation/index.html#!extensions/output-extras/serialization-parameters

You might also want to try DecentXML, as it can preserve the attribute order, comments and even indentation.
It is very nice if you need to programmatically update an XML file that's also supposed to be human-editable. We use it for one of our configuration tools.
-- edit --
It seems it is no longer available on its original location; try these ones:
https://github.com/cartermckinnon/decentxml
https://github.com/haroldo-ok/decentxml (unnoficial and unmaintained fork; kept here just in case the other forks disappear, too)
https://directory.fsf.org/wiki/DecentXML

Do it twice:
Read the document in using a DOM parser so you have references, a repository, if you will.
Then read it again using SAX. At the point where you need to make the transformation, reference the DOM version to determine what you need, then output what you need in the middle of the SAX stream.

Your best bet would be to use StAX instead of DOM for generating the original document. StAX gives you a lot of fine control over these things and lets you stream output progressively to an output stream instead of holding it all in memory.

We had similar requirements per Dave's description. A solution that worked was based on Java reflection.
The idea is to set the propOrder for the attributes at runtime. In our case there's APP_DATA element containing three attributes: app, key, and value. The generated AppData class includes "content" in propOrder and none of the other attributes:
#XmlAccessorType(XmlAccessType.FIELD)
#XmlType(name = "AppData", propOrder = {
"content"
})
public class AppData {
#XmlValue
protected String content;
#XmlAttribute(name = "Value", required = true)
protected String value;
#XmlAttribute(name = "Name", required = true)
protected String name;
#XmlAttribute(name = "App", required = true)
protected String app;
...
}
So Java reflection was used as follows to set the order at runtime:
final String[] propOrder = { "app", "name", "value" };
ReflectionUtil.changeAnnotationValue(
AppData.class.getAnnotation(XmlType.class),
"propOrder", propOrder);
final JAXBContext jaxbContext = JAXBContext
.newInstance(ADI.class);
final Marshaller adimarshaller = jaxbContext.createMarshaller();
adimarshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT,
true);
adimarshaller.marshal(new JAXBElement<ADI>(new QName("ADI"),
ADI.class, adi),
new StreamResult(fileOutputStream));
The changeAnnotationValue() was borrowed from this post:
Modify a class definition's annotation string parameter at runtime
Here's the method for your convenience (credit goes to #assylias and #Balder):
/**
* Changes the annotation value for the given key of the given annotation to newValue and returns
* the previous value.
*/
#SuppressWarnings("unchecked")
public static Object changeAnnotationValue(Annotation annotation, String key, Object newValue) {
Object handler = Proxy.getInvocationHandler(annotation);
Field f;
try {
f = handler.getClass().getDeclaredField("memberValues");
} catch (NoSuchFieldException | SecurityException e) {
throw new IllegalStateException(e);
}
f.setAccessible(true);
Map<String, Object> memberValues;
try {
memberValues = (Map<String, Object>) f.get(handler);
} catch (IllegalArgumentException | IllegalAccessException e) {
throw new IllegalStateException(e);
}
Object oldValue = memberValues.get(key);
if (oldValue == null || oldValue.getClass() != newValue.getClass()) {
throw new IllegalArgumentException();
}
memberValues.put(key, newValue);
return oldValue;
}

You may override AttributeSortedMap and sort attributes as you need...
The main idea: load the document, recursively copy to elements that support sorted attributeMap and serialize using the existing XMLSerializer.
File test.xml
<root>
<person first_name="john1" last_name="lederrey1"/>
<person first_name="john2" last_name="lederrey2"/>
<person first_name="john3" last_name="lederrey3"/>
<person first_name="john4" last_name="lederrey4"/>
</root>
File AttOrderSorter.java
import com.sun.org.apache.xerces.internal.dom.AttrImpl;
import com.sun.org.apache.xerces.internal.dom.AttributeMap;
import com.sun.org.apache.xerces.internal.dom.CoreDocumentImpl;
import com.sun.org.apache.xerces.internal.dom.ElementImpl;
import com.sun.org.apache.xml.internal.serialize.OutputFormat;
import com.sun.org.apache.xml.internal.serialize.XMLSerializer;
import org.w3c.dom.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.Writer;
import java.util.List;
import static java.util.Arrays.asList;
public class AttOrderSorter {
private List<String> sortAtts = asList("last_name", "first_name");
public void format(String inFile, String outFile) throws Exception {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document outDocument = builder.newDocument();
try (FileInputStream inputStream = new FileInputStream(inFile)) {
Document document = dbFactory.newDocumentBuilder().parse(inputStream);
Element sourceRoot = document.getDocumentElement();
Element outRoot = outDocument.createElementNS(sourceRoot.getNamespaceURI(), sourceRoot.getTagName());
outDocument.appendChild(outRoot);
copyAtts(sourceRoot.getAttributes(), outRoot);
copyElement(sourceRoot.getChildNodes(), outRoot, outDocument);
}
try (Writer outxml = new FileWriter(new File(outFile))) {
OutputFormat format = new OutputFormat();
format.setLineWidth(0);
format.setIndenting(false);
format.setIndent(2);
XMLSerializer serializer = new XMLSerializer(outxml, format);
serializer.serialize(outDocument);
}
}
private void copyElement(NodeList nodes, Element parent, Document document) {
for (int i = 0; i < nodes.getLength(); i++) {
Node node = nodes.item(i);
if (node.getNodeType() == Node.ELEMENT_NODE) {
Element element = new ElementImpl((CoreDocumentImpl) document, node.getNodeName()) {
#Override
public NamedNodeMap getAttributes() {
return new AttributeSortedMap(this, (AttributeMap) super.getAttributes());
}
};
copyAtts(node.getAttributes(), element);
copyElement(node.getChildNodes(), element, document);
parent.appendChild(element);
}
}
}
private void copyAtts(NamedNodeMap attributes, Element target) {
for (int i = 0; i < attributes.getLength(); i++) {
Node att = attributes.item(i);
target.setAttribute(att.getNodeName(), att.getNodeValue());
}
}
public class AttributeSortedMap extends AttributeMap {
AttributeSortedMap(ElementImpl element, AttributeMap attributes) {
super(element, attributes);
nodes.sort((o1, o2) -> {
AttrImpl att1 = (AttrImpl) o1;
AttrImpl att2 = (AttrImpl) o2;
Integer pos1 = sortAtts.indexOf(att1.getNodeName());
Integer pos2 = sortAtts.indexOf(att2.getNodeName());
if (pos1 > -1 && pos2 > -1) {
return pos1.compareTo(pos2);
} else if (pos1 > -1 || pos2 > -1) {
return pos1 == -1 ? 1 : -1;
}
return att1.getNodeName().compareTo(att2.getNodeName());
});
}
}
public void main(String[] args) throws Exception {
new AttOrderSorter().format("src/main/resources/test.xml", "src/main/resources/output.xml");
}
}
Result - file output.xml
<?xml version="1.0" encoding="UTF-8"?>
<root>
<person last_name="lederrey1" first_name="john1"/>
<person last_name="lederrey2" first_name="john2"/>
<person last_name="lederrey3" first_name="john3"/>
<person last_name="lederrey4" first_name="john4"/>
</root>

You can't use the DOM, but you can use SAX, or querying children using XPath.
Visit the answer Order of XML attributes after DOM processing.

Error with parsing DOM obtained from OMIM RESTful Web services

I am a beginner to web services can any one with experience please
help me with the following:
I am writing a client trying to get information from OMIM RESTful web
services. I am using a key OMIM provideds after registration. (
http://omim.org/help/api ) I am successful at getting connection to
client. Also with the GET method I am able to fetch the required data
into a DOM document. Further, I could successfuly write the entire DOM
into a file on the local disk. However, I am unable to work with the
DOM using the standard parsing functions available for DOM.
For example: I am able to get the root node with NodeList
nl=doc.getDocumentElement()and print onto the console. But when I try
to print the first child of the root node it returns null instead of
expected child node.
Sample XML form: webservices -> DOM -> file
<?xml version="1.0" encoding="UTF-8" standalone="no"?><omim version="1.0">
<clinicalSynopsisList>
<clinicalSynopsis>
<mimNumber>100070</mimNumber>
<prefix>%</prefix>
</clinicalSynopsis>
</clinicalSynopsisList>
</omim>
Please find my code below:
String path="http://api.omim.org:8000/api/clinicalSynopsis?mimNumber="+"100070"+"&include=clinicalSynopsis&format=xml&apiKey="+"<< xxxxx private key xxxxxxxxxx >> ";
URL url = new URL(path);
HttpURLConnection conn=(HttpURLConnection)url.openConnection();
conn.setRequestMethod("GET");
InputStream is = conn.getInputStream();
DocumentBuilder docBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = docBuilder.parse(is);
Source src= new DOMSource(doc);
File file = new File("d:/text.xml");
Result rs = new StreamResult(file);
TransformerFactory tmf = TransformerFactory.newInstance();
Transformer trnsfrmr = tmf.newTransformer();
trnsfrmr.transform(src, rs);
System.out.println("XML file is created successfully");
System.out.println("The root element is :: "+doc.getDocumentElement().getNodeName());
NodeList nl=doc.getDocumentElement().getChildNodes();
System.out.println("child nodelist length::"+nl.getLength());
System.out.println("First child node name :: "+doc.getDocumentElement().getFirstChild().getNodeName());
System.out.println("Last child node name :: "+doc.getDocumentElement().getLastChild().getNodeName());
Output I got:- XML file is created successfully The root element is ::
omim child nodelist length::3 First child node name :: #text Last
child node name :: #text
In the output got the root node is “omim” and it has 3 children. but
returns null when tried printing the first and last child name.
Similarly getParent (), getChild (), getSibling () methods are not
working for me.
Any help will be highly appreciated.
Thank you,

I posted a comment and then I figured I would rather explain it further in the answer. You should have asked why the root has there 3 child nodes. There's only one child element - clinicalSynopsisList so why 3? The first and the last child are the linebreaks (and maybe you have whitespaces in there too) before and after the clinicalSynopsisList. Your node content is interpreted as MIXED since you don't have a schema or DTD to tell the parses that omni can only contain elemtents. If you had, you could tell your parser to ignore the ignorable whitespace like it was explained in that other SO question I referred you to in my comment.
It's been a while since I worked with DOM API directly but I don't believe you can ask it for a first child element. Instead you could use XPath (start here, for example, or search SO or google for examples) to get to your first child element or just iterate over the child nodes with DOM API and consult with their node type (you would be ignoring text nodes)
And I would also recommend to take a look at Apache CXF and marshalling technologies like JAXB so that you didn't have to work with the "raw" XML that you read off of the webservice endpoint.

(coy of my answer on biostar) I cannot currently use the OMIM API , but the following java code should do the work. I think your problem is that you assume that the first child of an XML node is an ELEMENT, which is wrong, it seems to be a TEXT node containing a carriage return.
import java.net.URLEncoder;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
public class Biostar44705
{
private static final String API_KEY="XXXXXXXX";
private DocumentBuilder builder;
private Transformer echoTransformer=null;
private Biostar44705()throws Exception
{
DocumentBuilderFactory factory=DocumentBuilderFactory.newInstance();
factory.setCoalescing(true);
factory.setIgnoringComments(true);
factory.setNamespaceAware(false);
builder=factory.newDocumentBuilder();
TransformerFactory trf=TransformerFactory.newInstance();
this.echoTransformer =trf.newTransformer();
this.echoTransformer .setOutputProperty(OutputKeys.INDENT, "yes");
this.echoTransformer .setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
}
void get(int omimId)throws Exception
{
String uri="http://api.omim.org:8000/api/clinicalSynopsis?mimNumber="+omimId+
"&include=clinicalSynopsis&format=xml&apiKey="+
URLEncoder.encode(API_KEY,"UTF-8");
Document dom=builder.parse(uri);
Element root=dom.getDocumentElement();
if(root==null) return;
for(Node n1=root.getFirstChild();n1!=null;n1=n1.getNextSibling())
{
if(n1.getNodeType()!=Node.ELEMENT_NODE) continue;
echoTransformer.transform(new DOMSource(n1),new StreamResult(System.out));
break;
}
}
public static void main(String[] args) throws Exception
{
new Biostar44705().get(100070);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing XML file In java using DOM - java

have you already used xsd definition of xml? With this and jaxb you can unmarshal the xml to a java object in easy way.

Related

Which methods can be used to return valid and invalid XML data from a file in Java?

Parsing XML file containing HTML entities in Java without changing the XML

Parsing xml file contents without knowing xml file structure

Java XML library that preserves attribute order

Error with parsing DOM obtained from OMIM RESTful Web services

Categories

Resources