Error with parsing DOM obtained from OMIM RESTful Web services - java

I am a beginner to web services can any one with experience please
help me with the following:
I am writing a client trying to get information from OMIM RESTful web
services. I am using a key OMIM provideds after registration. (
http://omim.org/help/api ) I am successful at getting connection to
client. Also with the GET method I am able to fetch the required data
into a DOM document. Further, I could successfuly write the entire DOM
into a file on the local disk. However, I am unable to work with the
DOM using the standard parsing functions available for DOM.
For example: I am able to get the root node with NodeList
nl=doc.getDocumentElement()and print onto the console. But when I try
to print the first child of the root node it returns null instead of
expected child node.
Sample XML form: webservices -> DOM -> file
<?xml version="1.0" encoding="UTF-8" standalone="no"?><omim version="1.0">
<clinicalSynopsisList>
<clinicalSynopsis>
<mimNumber>100070</mimNumber>
<prefix>%</prefix>
</clinicalSynopsis>
</clinicalSynopsisList>
</omim>
Please find my code below:
String path="http://api.omim.org:8000/api/clinicalSynopsis?mimNumber="+"100070"+"&include=clinicalSynopsis&format=xml&apiKey="+"<< xxxxx private key xxxxxxxxxx >> ";
URL url = new URL(path);
HttpURLConnection conn=(HttpURLConnection)url.openConnection();
conn.setRequestMethod("GET");
InputStream is = conn.getInputStream();
DocumentBuilder docBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = docBuilder.parse(is);
Source src= new DOMSource(doc);
File file = new File("d:/text.xml");
Result rs = new StreamResult(file);
TransformerFactory tmf = TransformerFactory.newInstance();
Transformer trnsfrmr = tmf.newTransformer();
trnsfrmr.transform(src, rs);
System.out.println("XML file is created successfully");
System.out.println("The root element is :: "+doc.getDocumentElement().getNodeName());
NodeList nl=doc.getDocumentElement().getChildNodes();
System.out.println("child nodelist length::"+nl.getLength());
System.out.println("First child node name :: "+doc.getDocumentElement().getFirstChild().getNodeName());
System.out.println("Last child node name :: "+doc.getDocumentElement().getLastChild().getNodeName());
Output I got:- XML file is created successfully The root element is ::
omim child nodelist length::3 First child node name :: #text Last
child node name :: #text
In the output got the root node is “omim” and it has 3 children. but
returns null when tried printing the first and last child name.
Similarly getParent (), getChild (), getSibling () methods are not
working for me.
Any help will be highly appreciated.
Thank you,

I posted a comment and then I figured I would rather explain it further in the answer. You should have asked why the root has there 3 child nodes. There's only one child element - clinicalSynopsisList so why 3? The first and the last child are the linebreaks (and maybe you have whitespaces in there too) before and after the clinicalSynopsisList. Your node content is interpreted as MIXED since you don't have a schema or DTD to tell the parses that omni can only contain elemtents. If you had, you could tell your parser to ignore the ignorable whitespace like it was explained in that other SO question I referred you to in my comment.
It's been a while since I worked with DOM API directly but I don't believe you can ask it for a first child element. Instead you could use XPath (start here, for example, or search SO or google for examples) to get to your first child element or just iterate over the child nodes with DOM API and consult with their node type (you would be ignoring text nodes)
And I would also recommend to take a look at Apache CXF and marshalling technologies like JAXB so that you didn't have to work with the "raw" XML that you read off of the webservice endpoint.

(coy of my answer on biostar) I cannot currently use the OMIM API , but the following java code should do the work. I think your problem is that you assume that the first child of an XML node is an ELEMENT, which is wrong, it seems to be a TEXT node containing a carriage return.
import java.net.URLEncoder;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
public class Biostar44705
{
private static final String API_KEY="XXXXXXXX";
private DocumentBuilder builder;
private Transformer echoTransformer=null;
private Biostar44705()throws Exception
{
DocumentBuilderFactory factory=DocumentBuilderFactory.newInstance();
factory.setCoalescing(true);
factory.setIgnoringComments(true);
factory.setNamespaceAware(false);
builder=factory.newDocumentBuilder();
TransformerFactory trf=TransformerFactory.newInstance();
this.echoTransformer =trf.newTransformer();
this.echoTransformer .setOutputProperty(OutputKeys.INDENT, "yes");
this.echoTransformer .setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
}
void get(int omimId)throws Exception
{
String uri="http://api.omim.org:8000/api/clinicalSynopsis?mimNumber="+omimId+
"&include=clinicalSynopsis&format=xml&apiKey="+
URLEncoder.encode(API_KEY,"UTF-8");
Document dom=builder.parse(uri);
Element root=dom.getDocumentElement();
if(root==null) return;
for(Node n1=root.getFirstChild();n1!=null;n1=n1.getNextSibling())
{
if(n1.getNodeType()!=Node.ELEMENT_NODE) continue;
echoTransformer.transform(new DOMSource(n1),new StreamResult(System.out));
break;
}
}
public static void main(String[] args) throws Exception
{
new Biostar44705().get(100070);
}
}

Related

Which methods can be used to return valid and invalid XML data from a file in Java?

I have the following data that is supposed to be XML:
<?xml version="1.0" encoding="UTF-8"?>
<Product>
<id>1</id>
<description>A new product</description>
<price>123.45</price>
</Product>
<Product>
<id>1</id>
<description>A new product</description>
<price>123.45</price>
</Product>
<ProductTTTTT>
<id>1</id>
<description>A new product</description>
<price>123.45</price>
</Product>
<Product>
<id>1</id>
<description>A new product</description>
<price>123.45</price>
</ProductAAAAAA>
So, basically I have multiple root elements (product)...
The point is that I'm trying to transform this data into 2 XML documents, 1 for valid nodes and other for invalid nodes.
Valid node:
<Product>
...
</Product>
Invalid nodes: <ProductTTTTT>...</Product> and <Product>...</ProductAAAAAA>
Then I am thinking how I can achieve this using JAVA (not web).
If I am not wrong, validating it with a XSD will invalidate the whole file, so not an option.
Using default JAXB parser (unmarshaller) will lead to item above since internally it creates a XSD of my entity.
Using XPath just (from what I know) will just return the whole file, I did not find a way to get something like GET !VALID (It is just to explain...)
Using XQuery (maybe?).. by the way, how to use XQuery with JAXB?
XSL(T) will lead to same thing on XPath, since it uses XPath to select the content.
So... which method can I use to achieve the objective? (And if possible, provide links or code please)
Firstly, you're confusing valid and well-formed. You say you want to find invalid elements, but your examples aren't just invalid, they are ill-formed. That means that no XML parser is going to do anything with them other than throwing an error message at you. You can't use JAXB or XPath, or XQuery, or XSLT, or anything to process something that isn't XML.
You say "unfortunately I do not have access to the system that sends this xml format". I'm not sure why you call it an XML format: it isn't. I also don't understand why you (and many others on StackOverflow) are prepared to spend your time digging in garbage like this rather than telling the sender to get their act together. If you were served a salad with maggots in it, would you try to pick them out, or would you send it back for replacement? You should adopt a zero-tolerance approach to bad data; that's the only way senders will learn to improve the quality.
If the file contains lines with start and end tags who's name begins with "Product", you could:
use a file scanner to split this document into individual pieces whenever a line starts with <Product or </Product
attempt to parse the extracted text as XML using an XML API.
If it succeeds, add that object to a list of "good" well-formed XML documents
then perform any additional schema validation or validity checks
If it throws a parse error, catch it, and add that snippet of text to the list of "bad" items that need to be cleaned up or otherwise handled
An example to get you started:
package com.stackoverflow.questions.52012383;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
public class FileSplitter {
public static void parseFile(File file, String elementName)
throws ParserConfigurationException, IOException {
List<Document> good = new ArrayList<>();
List<String> bad = new ArrayList<>();
String start-tag = "<" + elementName;
String end-tag = "</" + elementName;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
StringBuffer buffer = new StringBuffer();
String line;
boolean append = false;
try (Scanner scanner = new Scanner(file)) {
while (scanner.hasNextLine()) {
line = scanner.nextLine();
if (line.startsWith(startTag)) {
append = true; //start accumulating content
} else if (line.startsWith(endTag)) {
append = false;
buffer.append(line);
//instead of the line above, you could hard-code the ending tag to compensate for bad data:
// buffer.append(endTag + ">");
try { // to parse as XML
builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(buffer.toString())));
good.add(document); // parsed successfully, add it to the good list
buffer.setLength(0); //reset the buffer to start a new XML doc
} catch (SAXException ex) {
bad.add(buffer.toString()); // something is wrong, not well-formed XML
}
}
if (append) { // accumulate content
buffer.append(line);
}
}
System.out.println("Good items: " + good.size() + " Bad items: " + bad.size());
//do stuff with the good/bad results...
}
}
public static void main(String args[])
throws ParserConfigurationException, IOException {
File file = new File("/tmp/test.xml");
parseFile(file, "Product");
}
}

Parsing XML file containing HTML entities in Java without changing the XML

I have to parse a bunch of XML files in Java that sometimes -- and invalidly -- contain HTML entities such as —, > and so forth. I understand the correct way of dealing with this is to add suitable entity declarations to the XML file before parsing. However, I can't do that as I have no control over those XML files.
Is there some kind of callback I can override that is invoked whenever the Java XML parser encounters such an entity? I haven't been able to find one in the API.
I'd like to use:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( stream );
I found that I can override resolveEntity in org.xml.sax.helpers.DefaultHandler, but how do I use this with the higher-level API?
Here's a full example:
public class Main {
public static void main( String [] args ) throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
Document doc = parser.parse( new FileInputStream( "test.xml" ));
}
}
with test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Produces:
[Fatal Error] :3:20: The entity "nbsp" was referenced, but not declared.
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 20; The entity "nbsp" was referenced, but not declared.
Update: I have been poking around in the JDK source code with a debugger, and boy, what an amount of spaghetti. I have no idea what the design is there, or whether there is one. Just how many layers of an onion can one layer on top of each other?
They key class seems to be com.sun.org.apache.xerces.internal.impl.XMLEntityManager, but I cannot find any code that either lets me add stuff into it before it gets used, or that attempts to resolve entities without going through that class.
I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download
public static void main(String args[]){
String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" +
"<bar>Some text — invalid!</bar></foo>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
for (Element e : doc.select("bar")) {
System.out.println(e);
}
}
Result:
<bar>
Some text — invalid!
</bar>
Loading from a file can be found here:
http://jsoup.org/cookbook/input/load-document-from-file
Issue - 1: I have to parse a bunch of XML files in Java that sometimes -- and
invalidly -- contain HTML entities such as —
XML has only five predefined entities. The —, is not among them. It works only when used in plain HTML or in legacy JSP. So, SAX will not help. It can be done using StaX which has high level iterator based API. (Collected from this link)
Issue - 2: I found that I can override resolveEntity in
org.xml.sax.helpers.DefaultHandler, but how do I use this with the
higher-level API?
Streaming API for XML, called StaX, is an API for reading and writing XML Documents.
StaX is a Pull-Parsing model. Application can take the control over parsing the XML documents by pulling (taking) the events from the parser.
The core StaX API falls into two categories and they are listed below. They are
Cursor based API: It is low-level API. cursor-based API allows the application to process XML as a stream of tokens aka events
Iterator based API: The higher-level iterator-based API allows the application to process XML as a series of event objects, each of which communicates a piece of the XML structure to the application.
STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property:
Requires the parser to replace internal entity references with their
replacement text and report them as characters
This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or XmlStreamReader.
However, the API is careful to say that this property is only intended to force the implementation to perform the replacement, rather than forcing it to notreplace them.
You may try it. Hope it will solve your issue. For your case,
Main.java
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.EntityReference;
import javax.xml.stream.events.XMLEvent;
public class Main {
public static void main(String[] args) {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(
XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
XMLEventReader reader;
try {
reader = inputFactory
.createXMLEventReader(new FileInputStream("F://test.xml"));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isEntityReference()) {
EntityReference ref = (EntityReference) event;
System.out.println("Entity Reference: " + ref.getName());
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (XMLStreamException e) {
e.printStackTrace();
}
}
}
test.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>Some text — invalid!</bar>
</foo>
Output:
Entity Reference: nbsp
Entity Reference: mdash
Credit goes to #skaffman.
Related Link:
http://www.journaldev.com/1191/how-to-read-xml-file-in-java-using-java-stax-api
http://www.journaldev.com/1226/java-stax-cursor-based-api-read-xml-example
http://www.vogella.com/tutorials/JavaXML/article.html
Is there a Java XML API that can parse a document without resolving character entities?
UPDATE:
Issue - 3: Is there a way to use StaX to "filter" the entities (replacing them
with something else, for example) and still produce a Document at the
end of the process?
To create a new document using the StAX API, it is required to create an XMLStreamWriter that provides methods to produce XML opening and closing tags, attributes and character content.
There are 5 methods of XMLStreamWriter for document.
xmlsw.writeStartDocument(); - initialises an empty document to which
elements can be added
xmlsw.writeStartElement(String s) -creates a new element named s
xmlsw.writeAttribute(String name, String value)- adds the attribute
name with the corresponding value to the last element produced by a
call to writeStartElement. It is possible to add attributes as long
as no call to writeElementStart,writeCharacters or writeEndElement
has been done.
xmlsw.writeEndElement - close the last started element
xmlsw.writeCharacters(String s) - creates a new text node with
content s as content of the last started element.
A sample example is attached with it:
StAXExpand.java
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;
import java.util.Arrays;
public class StAXExpand {
static XMLStreamWriter xmlsw = null;
public static void main(String[] argv) {
try {
xmlsw = XMLOutputFactory.newInstance()
.createXMLStreamWriter(System.out);
CompactTokenizer tok = new CompactTokenizer(
new FileReader(argv[0]));
String rootName = "dummyRoot";
// ignore everything preceding the word before the first "["
while(!tok.nextToken().equals("[")){
rootName=tok.getToken();
}
// start creating new document
xmlsw.writeStartDocument();
ignorableSpacing(0);
xmlsw.writeStartElement(rootName);
expand(tok,3);
ignorableSpacing(0);
xmlsw.writeEndDocument();
xmlsw.flush();
xmlsw.close();
} catch (XMLStreamException e){
System.out.println(e.getMessage());
} catch (IOException ex) {
System.out.println("IOException"+ex);
ex.printStackTrace();
}
}
public static void expand(CompactTokenizer tok, int indent)
throws IOException,XMLStreamException {
tok.skip("[");
while(tok.getToken().equals("#")) {// add attributes
String attName = tok.nextToken();
tok.nextToken();
xmlsw.writeAttribute(attName,tok.skip("["));
tok.nextToken();
tok.skip("]");
}
boolean lastWasElement=true; // for controlling the output of newlines
while(!tok.getToken().equals("]")){ // process content
String s = tok.getToken().trim();
tok.nextToken();
if(tok.getToken().equals("[")){
if(lastWasElement)ignorableSpacing(indent);
xmlsw.writeStartElement(s);
expand(tok,indent+3);
lastWasElement=true;
} else {
xmlsw.writeCharacters(s);
lastWasElement=false;
}
}
tok.skip("]");
if(lastWasElement)ignorableSpacing(indent-3);
xmlsw.writeEndElement();
}
private static char[] blanks = "\n".toCharArray();
private static void ignorableSpacing(int nb)
throws XMLStreamException {
if(nb>blanks.length){// extend the length of space array
blanks = new char[nb+1];
blanks[0]='\n';
Arrays.fill(blanks,1,blanks.length,' ');
}
xmlsw.writeCharacters(blanks, 0, nb+1);
}
}
CompactTokenizer.java
import java.io.Reader;
import java.io.IOException;
import java.io.StreamTokenizer;
public class CompactTokenizer {
private StreamTokenizer st;
CompactTokenizer(Reader r){
st = new StreamTokenizer(r);
st.resetSyntax(); // remove parsing of numbers...
st.wordChars('\u0000','\u00FF'); // everything is part of a word
// except the following...
st.ordinaryChar('\n');
st.ordinaryChar('[');
st.ordinaryChar(']');
st.ordinaryChar('#');
}
public String nextToken() throws IOException{
st.nextToken();
while(st.ttype=='\n'||
(st.ttype==StreamTokenizer.TT_WORD &&
st.sval.trim().length()==0))
st.nextToken();
return getToken();
}
public String getToken(){
return (st.ttype == StreamTokenizer.TT_WORD) ? st.sval : (""+(char)st.ttype);
}
public String skip(String sym) throws IOException {
if(getToken().equals(sym))
return nextToken();
else
throw new IllegalArgumentException("skip: "+sym+" expected but"+
sym +" found ");
}
}
For more, you can follow the tutorial
https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html
http://www.ibm.com/developerworks/library/x-tipstx2/index.html
http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees/HTML/ch09s03.html
http://staf.sourceforge.net/current/STAXDoc.pdf
Another approach, since you're not using a rigid OXM approach anyway.
You might want to try using a less rigid parser such as JSoup?
This will stop immediate problems with invalid XML schemas etc, but it will just devolve the problem into your code.
Just to throw in a different approach to a solution:
You might envelope your input stream with a stream inplementation that replaces the entities by something legal.
While this is a hack for sure, it should be a quick and easy solution (or better say: workaround).
Not as elegant and clean as a xml framework internal solution, though.
I made yesterday something similar i need to add value from unziped XML in stream to database.
//import I'm not sure if all are necessary :)
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
//I didnt checked this code now because i'm in work for sure its work maybe
you will need to do little changes
InputSource is = new InputSource(new FileInputStream("test.xml"));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(is);
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
String words= xpath.evaluate("/foo/bar", doc.getDocumentElement());
ParsingHexToChar.parseToChar(words);
// lib which i use common-lang3.jar
//metod to parse
public static String parseToChar( String words){
String decode= org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(words);
return decode;
}
Try this using org.apache.commons package :
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
InputStream in = new FileInputStream(xmlfile);
String unescapeHtml4 = IOUtils.toString(in);
CharSequenceTranslator obj = new AggregateTranslator(new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE()),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE())
);
unescapeHtml4 = obj.translate(unescapeHtml4);
StringReader readerInput= new StringReader(unescapeHtml4);
InputSource is = new InputSource(readerInput);
Document doc = parser.parse(is);

Meaningful XML comparison

I am trying to achieve meaningful XML comparison. I want to compare two different XML to know if they are 'meaningful' equal.
Example XML 1:
<?xml version="1.0" encoding="UTF-8"?>
<al:moAttribute>
<al:name>impiId</al:name>
<al:value>616731935012345678</al:value>
</al:moAttribute>
<al:moAttribute>
<al:name>impuId</al:name>
<al:value>tel:+16167319350</al:value>
</al:moAttribute>
XML 2 :
<?xml version="1.0" encoding="UTF-8"?>
<al:moAttribute>
<al:name>impuId</al:name>
<al:value>tel:+16167319350</al:value>
</al:moAttribute>
<al:moAttribute>
<al:name>impiId</al:name>
<al:value>616731935012345678</al:value>
</al:moAttribute>
In this example both the XMLs are 'meaningful' equal but only differs in the sequence of elements. I want to compare both of them to know if they are almost equal.
I tried this solution :
Best way to compare 2 XML documents in Java
I tried :
XMLUnit.setIgnoreWhitespace(true);
diff.identical (...);
diff.similar (...);
But if the XML's differs in sequence, XML comparison returns false.
Any suggestions please ?
Any tools at the XML level will assume that the order of elements is significant. If you know that in your particular vocabulary, the order of elements is not significant, then you need a tool that works with an understanding of your vocabulary. Your best bet is therefore to write a normalizing transformation (typically in XSLT) that removes irrelevant differences from the documents (for example, by sorting elements on some suitable key) so that they then compare equal when compared using standard XML tools (perhaps after XML canonicalisation).
You can do it using jaxb to achive your goal (exmple http://www.mkyong.com/java/jaxb-hello-world-example/)
1 construct two java objects using jaxb from given two xml files
2 in each java object,you have a list of al:values for each xml file (you only care about this)
3 compare those two list please refer to Simple way to find if two different lists contain exactly the same elements?
by doing this, you will overcome the order problem
You may find xmlunit's RecursiveElementNameAndTextQualifier useful here. Here is a snippet
XMLUnit.setIgnoreWhitespace(true);
XMLUnit.setIgnoreComments(true);
XMLUnit.setIgnoreAttributeOrder(true);
Document docx1 = XMLUnit.buildDocument(..);
Document docx2 = XMLUnit.buildDocument(..);
Diff diff = new Diff(docx1, docx2);
DifferenceEngine engine = new DifferenceEngine(diff);
ElementQualifier qualifier = new RecursiveElementNameAndTextQualifier();
diff = new Diff(docx1, docx2, engine, qualifier);
diff.overrideDifferenceListener(new DifferenceListener()
{
#Override public int differenceFound(Difference difference)
{
//do something with difference
// return processDiff(difference);
}
#Override public void skippedComparison(Node node, Node node1)
{
//no op
}
});
//check diff.identical() || diff.similar();
Guys This is working absolutely perfect for me .
It is showing the difference wherever the changes are.
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.net.URL;
import java.util.List;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.custommonkey.xmlunit.DetailedDiff;
import org.custommonkey.xmlunit.Diff;
import org.custommonkey.xmlunit.Difference;
import org.custommonkey.xmlunit.XMLUnit;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class Xmlreader {
public static void main(String[] args) throws SAXException, IOException, ParserConfigurationException {
XMLUnit.setIgnoreWhitespace(true);
XMLUnit.setIgnoreComments(true);
XMLUnit.setIgnoreAttributeOrder(true);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setCoalescing(true);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setIgnoringComments(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc1 = db.parse(new File("C:/Users/sravanlx/Desktop/base.xml"));
doc1.normalizeDocument();
Document doc2 = db.parse(new File("C:/Users/sravanlx/Desktop/base2.xml"));
/* URL url1 = Xmlreader.class.getResource("C:/Users/sravanlx/Desktop/base.xml");
URL url2 = Xmlreader.class.getResource("C:/Users/sravanlx/Desktop/base2.xml");
FileReader fr1 = null;
FileReader fr2 = null;
try {
fr1 = new FileReader("C:/Users/username/Desktop/base.xml");
fr2 = new FileReader("C:/Users/username/Desktop/base2.xml");
} catch (FileNotFoundException e) {
e.printStackTrace();
}*/
Diff diff = new Diff(doc1, doc2);
System.out.println("Similar? " + diff.similar());
System.out.println("Identical? " + diff.identical());
DetailedDiff detDiff = new DetailedDiff(diff);
List differences = detDiff.getAllDifferences();
for (Object object : differences) {
Difference difference = (Difference)object;
System.out.println("***********************");
System.out.println(difference);
System.out.println("***********************");
}
} }
I have solved this issue using XSLT which uses an unordered tree comparison in my github. Basically it would output the matches and mismatches of any two xml files with regarding to it's position relative to the root of the tree.
For example:
<a>
<c/>
<e/>
</a>
And:
<a>
<e/>
<c/>
</a>
Would be treated as equal.
You just have to modify the file variable at the top of the sheet to choose which XML file to compare against.
https://github.com/sflynn1812/xslt-diff-turbo
From an efficiency perspective the speed of any tree comparison algorithm is determined by the number of differences in the two trees.
Currently to apply it to your example I would suggest stripping out the xml namespaces first, because that is not currently supported.

Parsing XML file In java using DOM

I want to parse XML file to read values of certain elements in the file.
<row>
<element>
<status>OK</status>
<duration>
<value>499</value>
<text>8 mins</text>
</duration>
<distance>
<value>3208</value>
<text>3.2 km</text>
</distance>
</element>
<element>
<status>OK</status>
<duration>
<value>501</value>
<text>8 mins</text>
</duration>
<distance>
<value>2869</value>
<text>2.9 km</text>
</distance>
</element>
<element>
<status>OK</status>
<duration>
<value>788</value>
<text>13 mins</text>
</duration>
<distance>
<value>6718</value>
<text>6.7 km</text>
</distance>
</element>
</row>
I want to able to read the values of all "value" tags under "distance" tags. I have done the basic code to read XML data. I just want the idea to read the value of elements at the third level
Simple DOM parsing is not the preferred way anymore IMHO. Java comes with some mature frameworks to makes the parsing life much more easier. The following is some kind of my preference. Others may think different.
If the structure of your XML is fix, you could build some JAXB annotated pojos and read your data with this. JAXB delivers complete object hierarchies filled with your XML values. As well the XML data creation is also provided.
If you dont know the structure of your XML data or you stream XML data then maybe STAX parsing is the way to go.
Anyway these frameworks take a lot of problems away from you like file encoding, syntax checking, type safety (JAXB), ....
If you want to use DOM parsing, then you could use XPath to shorten your requests dramatically:
XPath xPath = XPathFactory.newInstance().newXPath();
InputSource source = new InputSource(ParseUsingXPath.class.getResourceAsStream("data.xml"));
NodeList list = (NodeList) xPath.evaluate("/row/element/distance/value", source, XPathConstants.NODESET);
for (int i=0;i<list.getLength();i++) {
System.out.println(list.item(i).getTextContent());
}
and it outputs:
3208
2869
6718
Here I use your XML directly as a StringStream from a file. You could use XPath with an DOM document object as well to process global searches.
have you already used xsd definition of xml? With this and jaxb you can unmarshal the xml to a java object in easy way.
Below code might help you to access value element.
import java.io.FileInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
public class DOMParsarDemo {
protected DocumentBuilder docBuilder;
protected Element root;
public DOMParsarDemo() throws Exception {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
docBuilder = dbf.newDocumentBuilder();
}
public void parse(String file) throws Exception {
Document doc = docBuilder.parse(new FileInputStream(file));
root = doc.getDocumentElement();
System.out.println("root element is :" + root.getNodeName());
}
public void printAllElements() throws Exception {
printElement(root);
}
public void printElement(Node node) {
if (node.getNodeType() != Node.TEXT_NODE) {
Node child = node.getFirstChild();
while (child != null) {
if (node.getNodeName().equals("distance")) {
if (child.getNodeName().equals("value")) {
System.out.println(child.getFirstChild().getNodeValue());
}
}
printElement(child);
child = child.getNextSibling();
}
}
}
public static void main(String args[]) throws Exception {
DOMParsarDemo demo = new DOMParsarDemo();
demo.parse("resources/abc.xml");
demo.printAllElements();
}
}
output:
root element is :row
3208
2869
6718

Parsing xml file contents without knowing xml file structure

I've been working on learning some new tech using java to parse files and for the msot part it's going well. However, I'm at a lost as to how I could parse an xml file to where the structure is not known upon receipt. Lots of examples of how to do so if you know the structure (getElementByTagName seems to be the way to go), but no dynamic options, at least not that I've found.
So the tl;dr version of this question, how can I parse an xml file where I cannot rely on knowing it's structure?
Well the parsing part is easy; like helderdarocha stated in the comments, the parser only requires valid XML, it does not care about the structure. You can use Java's standard DocumentBuilder to obtain a Document:
InputStream in = new FileInputStream(...);
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);
(If you're parsing multiple documents, you can keep reusing the same DocumentBuilder.)
Then you can start with the root document element and use familiar DOM methods from there on out:
Element root = doc.getDocumentElement(); // perform DOM operations starting here.
As for processing it, well it really depends on what you want to do with it, but you can use the methods of Node like getFirstChild() and getNextSibling() to iterate through children and process as you see fit based on structure, tags, and attributes.
Consider the following example:
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
public class XML {
public static void main (String[] args) throws Exception {
String xml = "<objects><circle color='red'/><circle color='green'/><rectangle>hello</rectangle><glumble/></objects>";
// parse
InputStream in = new ByteArrayInputStream(xml.getBytes("utf-8"));
Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);
// process
Node objects = doc.getDocumentElement();
for (Node object = objects.getFirstChild(); object != null; object = object.getNextSibling()) {
if (object instanceof Element) {
Element e = (Element)object;
if (e.getTagName().equalsIgnoreCase("circle")) {
String color = e.getAttribute("color");
System.out.println("It's a " + color + " circle!");
} else if (e.getTagName().equalsIgnoreCase("rectangle")) {
String text = e.getTextContent();
System.out.println("It's a rectangle that says \"" + text + "\".");
} else {
System.out.println("I don't know what a " + e.getTagName() + " is for.");
}
}
}
}
}
The input XML document (hard-coded for example) is:
<objects>
<circle color='red'/>
<circle color='green'/>
<rectangle>hello</rectangle>
<glumble/>
</objects>
The output is:
It's a red circle!
It's a green circle!
It's a rectangle that says "hello".
I don't know what a glumble is for.

Categories

Resources