Algorithm for identifying differences in XML documents - java

I'm trying to create a program in Java that takes two XML files (one is an updated version of the other) and takes them into main memory. It will then compare the files and count the number of differences between each corresponding node from the two (excluding white space). Later on the program will do more with the differences but I'm just confused on how to start comparing nodes from two separate files. Any suggestions would be much appreciated.

My first suggestion is that you could use XMLUnit:
Reader expected=new FileReader(...);
Reader tested=new FileReader(...);
Diff diff=XMLUnit.compareXML(expected, tested);

For an algorithm that computes signatures (hashes) at each node to facilitate comparison, see Detecting Changes in XML Documents.
For change detection on XML documents where element ordering is insignificant, see X-Diff: An Effective Change Detection Algorithm for XML Documents. Java and C++ implementations of the X-Diff algorithm are available.

It depends if you have differences of nodes, or differences inside nodes.
This code extract all nodes, and their paths,
and value inside
Assuming, you have two xml Documents:
XPath xPath = XPathFactory.newInstance().newXPath();
//Every nodes
expression="//*";
NodeList nodes = (NodeList) xPath.compile(expression).evaluate(document, XPathConstants.NODESET);
// iterate them all
for(int i=0; i<nodes.getLength(); i++)
{
Node the_node = nodes.item(i);
if(the_node instanceof Element)
{
Element the_element=(Element) the_node;
// PATH
String path ="";
Node noderec = the_node;
while( noderec != null)
{
if (path.equals("")) path = noderec.getNodeName();
else
path = noderec.getNodeName() + '/' + path;
noderec = noderec.getParentNode();
if (noderec==document){path="//"+path; noderec=null;}
}
System.out.println( "PATH:"+path );
System.out.println("CONTENT="+the_element.getTextContent());
}
}
PATH : gives you the path
CONTENT: sub content of the node
With that, you get all the pathes of your xml: you can compare one by one, sort, and use others algorithms to find if something is inserted, ...
And inside each node, you can make another comparisons.
Hope it helps

Related

DOM Parser wrong childNodes Count

This is strange but let me try my best to put it accross.
I have a XML which i am reading through the normal way from desktop and parsing it through DOM parser.
<?xml version="1.0" encoding="UTF-8"?>
<Abase
xmlns="www.abc.com/Events/Abase.xsd">
<FVer>0</FVer>
<DV>abc App</DV>
<DP>abc Wallet</DP>
<Dversion>11</Dversion>
<sigID>Ss22</sigID>
<activity>Adding New cake</activity>
</Abase>
Reading the XML to get the childs.
Document doc = docBuilder.parse("C://Users//Desktop//abc.xml");
Node root = doc.getElementsByTagName("Abase").item(0);
NodeList listOfNodes = root.getChildNodes(); //Sysout Prints 13
So here my logic works well.When am trying to do by pushing the same XML to a queue and read it and get the child nodes it gives me no. of child nodes is 6.
Document doc=docBuilder.parse(new InputSource(new ByteArrayInputStream(msg.getBytes("UTF-8"))));
Node root = doc.getElementsByTagName("Abase").item(0);
NodeList listOfNodes = root.getChildNodes(); //Sysout Prints 6
this screws my logic of parsing the XML.Can anyone help me out?
UPDATE
Adding sending logic :
javax.jms.TextMessage tmsg = session.createTextMessage();
tmsg.setText(inp);
sender.send(tmsg);
PROBLEM
If i read this xml from desktop it says 13 childs, 6 element node and 7 text nodes.The Common Logic is :
Read all the childs and iterate through list of child items.
If node ISNOT text node get inside if block,add one parent element with two child and append to existing ROOT.Then get NodeName and get TextContext between the element node and push them as setTextContext for both the childs respectively.
So i have a fresh ELEMENT NODE now which have two childs .And as i dont need the already existing element node now which are still the childs of root,Lastly am removing them.
So the above logic is all screwed if i am pushing the XML to queue and areading it for doing the same logic.
OUTPUT XML which is coming good when i read from desktop,but reading from queue is having problem, because it screw the complete tree.
<Abase
xmlns="www.abc.com/Events/Abase.xsd">
<Prop>
<propName>FVer</propName>
<propName>0</propName> //similarly for other nodes
</Prop>
</Abase>
Thanks
Well, there are 13 children if whitespace text nodes are included, but only 6 if whitespace text nodes are dropped. So there's some difference in the way the tree has been built between the two cases, that affects whether whitespace text nodes are retained or not.
The document under "Output XML" means that there is something wrong on the sender side. My guess would by that inp isn't a String but some kind of object and setText(inp) doesn't call inp.toString() but instead triggers some kind of serialization code which produces this odd XML that you're seeing.

How to extract xml tag value without using the tag name in java?

I am using java.I have an xml file which looks like this:
<?xml version="1.0"?>
<personaldetails>
<phno>1553294232</phno>
<email>
<official>xya#gmail.com</official>
<personal>bk#yahoo.com</personal>
</email>
</personaldetails>
Now,I need to check each of the tag values for its type using specific conditions,and put them in separate files.
For example,in the above file,i write conditions like 10 digits equals phone number,
something in the format of xxx#yy.com is an email..
So,what i need to do is i need to extract the tag values in each tag and if it matches a certain condition,it is put in the first text file,if not in the second text file.
in that case,the first text file will contain:
1553294232
xya#gmail.com
bk#yahoo.com
and the rest of the values in the second file.
i just don't know how to extract the tag values without using the tag name.(or without using GetElementsByTagName).
i mean this code should extract the email bk#yahoo.com even if i give <mailing> instead of <personal> tag.It should not depend on the tag name.
Hope i am not confusing.I am new to java using xml.So,pardon me if my question is silly.
Please Help.
Seems like a typical use case for XPath
XPath allows you to query XML in a very flexible way.
This tutorial could help:
http://www.javabeat.net/2009/03/how-to-query-xml-using-xpath/
If you're using Java script, which could to be the case, since you mention getElementsByTagName(), you could just use JQuery selectors, it will give you a consistent behavior across browsers, and JQuery library is useful for a lot of other things, if you are not using it already... http://api.jquery.com/category/selectors/
Here for example is information on this:
http://www.switchonthecode.com/tutorials/xml-parsing-with-jquery
Since you don't know your element name, I would suggest creating a DOM tree and iterating through it. As and when you get a element, you would try to match it against your ruleset (and I would suggest using regex for this purpose) and then write it to your a file.
This would be a sample structure to help you get started, but you would need to modify it based on your requirement:
public void parseXML(){
try{
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc;
doc = documentBuilder.parse(new File("test.xml"));
getData(null, doc.getDocumentElement());
}catch(Exception exe){
exe.printStackTrace();
}
}
private void getData(Node parentNode, Node node){
switch(node.getNodeType()){
case Node.ELEMENT_NODE:{
if(node.hasChildNodes()){
NodeList list = node.getChildNodes();
int size = list.getLength();
for(int index = 0; index < size; index++){
getData(node, list.item(index));
}
}
break;
}
case Node.TEXT_NODE:{
String data = node.getNodeValue();
if(data.trim().length() > 0){
/*
* Here you need to check the data against your ruleset and perform your operation
*/
System.out.println(parentNode.getNodeName()+" :: "+node.getNodeValue());
}
break;
}
}
}
You might want to look at the Chain of Responsibility design pattern to design your ruleset.

java: unescaped quotes terminate xml text node value

I'm writing an android app in java. The app emulates flashcards, with questions on one side and answers on the other.
I am presently slurping a well-formed (as I believe) .xml document (which is produced by a Qt-based program which has no problem reading the output back in) using the following (fairly standard) code:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
try
{
DocumentBuilder builder = factory.newDocumentBuilder();
Document dom = builder.parse(new File(diskLocation));
Element pack = dom.getDocumentElement();
NodeList flashCards = pack.getElementsByTagName("flashcard");
for (int i=0; i < flashCards.getLength(); i++)
{
FlashCard flashCard = new FlashCard();
Node cardNode = flashCards.item(i);
NodeList cardProperties = cardNode.getChildNodes();
for (int j=0;j<cardProperties.getLength();j++)
{
Node cardProperty = cardProperties.item(j);
String propertyName = cardProperty.getNodeName();
if (propertyName.equalsIgnoreCase("Question"))
{
flashCard.setQuestion(cardProperty.getFirstChild().getNodeValue());
}
else if (propertyName.equalsIgnoreCase("Answer"))
{
flashCard.setAnswer(cardProperty.getFirstChild().getNodeValue());
}
else if
...etc.
Here is a flashcard for learning xml:
<flashcard>
<Question>What is the entity reference for ' " '?</Question>
<Answer>&quot;</Answer>
<Info></Info>
<Hint></Hint>
<KnownLevel>1</KnownLevel>
<LastCorrect>1</LastCorrect>
<CurrentStreak>4</CurrentStreak>
<LevelUp>4</LevelUp>
<AnswerTime>0</AnswerTime>
</flashcard>
As I understand the standard, '<' and '&' need to be escaped ('>' probably should be), but quotes and apostrophes don't (unless they're in attributes), yet when the question and answer for this card are parsed, they come out as What is the entity reference for ' and & respectively;
The input seems to follow standards. Is the java XMLDom implementation really not standards-compliant, or am I missing something?
I find it very difficult to believe I'm the only one to have (had) this problem, yet I've searched both google and stack overflow and found surprisingly little of direct relevance.
Thank you for any help!
Rob
Edit: I've just realised the file has a !DOCTYPE, but doesn't start with an <?xml tag.
I wonder if this makes any difference.
From the standard:
In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup
which means that either ' or " MUST be escaped in the content of elements.

Java appending XML docs to existing docs

I have two XML docs that I've created and I want to combine these two inside of a new envelope. So I have
<alert-set>
<warning>National Weather Service...</warning>
<start-date>5/19/2009</start-date>
<end-date>5/19/2009</end-date>
</alert-set>
and
<weather-set>
<chance-of-rain type="percent">31</chance-of-rain>
<conditions>Partly Cloudy</conditions>
<temperature type="Fahrenheit">78</temperature>
</weather-set>
What I'd like to do is combine the two inside a root node: < DataSet> combined docs < /DataSet>
I've tried creating a temporary doc and replacing children with the root nodes of the documents:
<DataSet>
<blank/>
<blank/>
</DataSet>
And I was hoping to replace the two blanks with the root elements of the two documents but I get "WRONG_DOCUMENT_ERR: A node is used in a different document than the one that created it." I tried adopting and importing the root nodes but I get the same error.
Is there not some easy way of combining documents without having to read through and create new elements for each node?
EDIT: Sample code snippets
Just trying to move one to the "blank" document for now... The importNode and adoptNode functions cannot import/adopt Document nodes, but they can't import the element node and its subtree... or if it does, it does not seem to work for appending/replacing still.
Document xmlDoc; //created elsewhere
Document weather = getWeather(latitude, longitude);
Element weatherRoot = weather.getDocumentElement();
Node root = xmlDoc.getDocumentElement();
Node adopt = weather.adoptNode(weatherRoot);
Node imported = weather.importNode(weatherRoot, true);
Node child = root.getFirstChild();
root.replaceChild(adopt, child); //initially tried replacing the <blank/> elements
root.replaceChild(imported, child);
root.appendChild(adopt);
root.appendChild(imported);
root.appendChild(adopt.cloneNode(true));
All of these throw the DOMException: WRONG_DOCUMENT_ERR: A node is used in a different document than the one that created it.
I think I'll have to figure out how to use stax or just reread the documents and create new elements... That kinda seems like too much work just to combine documents, though.
It's a bit tricky, but the following example runs:
public static void main(String[] args) {
DocumentImpl doc1 = new DocumentImpl();
Element root1 = doc1.createElement("root1");
Element node1 = doc1.createElement("node1");
doc1.appendChild(root1);
root1.appendChild(node1);
DocumentImpl doc2 = new DocumentImpl();
Element root2 = doc2.createElement("root2");
Element node2 = doc2.createElement("node2");
doc2.appendChild(root2);
root2.appendChild(node2);
DocumentImpl doc3 = new DocumentImpl();
Element root3 = doc3.createElement("root3");
doc3.appendChild(root3);
// root3.appendChild(root1); // Doesn't work -> DOMException
root3.appendChild(doc3.importNode(root1, true));
// root3.appendChild(root2); // Doesn't work -> DOMException
root3.appendChild(doc3.importNode(root2, true));
}
I know you got the issue solved already, but I still wanted to take a stab at this problem using the XOM library that I'm currently testing out (related to this question), and while doing that, offer a different approach than that of Andreas_D's answer.
(To simplify this example, I put your <alert-set> and <weather-set> into separate files, which I read into nu.xom.Document instances.)
import nu.xom.*;
[...]
Builder builder = new Builder();
Document alertDoc = builder.build(new File("src/xomtest", "alertset.xml"));
Document weatherDoc = builder.build(new File("src/xomtest", "weatherset.xml"));
Document mainDoc = builder.build("<DataSet><blank/><blank/></DataSet>", "");
Element root = mainDoc.getRootElement();
root.replaceChild(
root.getFirstChildElement("blank"), alertDoc.getRootElement().copy());
root.replaceChild(
root.getFirstChildElement("blank"), weatherDoc.getRootElement().copy());
The key is to make a copy of the elements to be inserted into mainDoc; otherwise you'll get a complain that "child already has a parent".
Outputting mainDoc now gives:
<?xml version="1.0" encoding="UTF-8"?>
<DataSet>
<alert-set>
<warning>National Weather Service...</warning>
<start-date>5/19/2009</start-date>
<end-date>5/19/2009</end-date>
</alert-set>
<weather-set>
<chance-of-rain type="percent">31</chance-of-rain>
<conditions>Partly Cloudy</conditions>
<temperature type="Fahrenheit">78</temperature>
</weather-set>
</DataSet>
To my delight, this turned out to be very straight-forward to do with XOM. It only took a few minutes to write this, even though I'm definitely not very experienced with the library yet. (It would have been even easier without the <blank/> elements, i.e., starting with simply <DataSet></DataSet>.)
So, unless you have compelling reasons for using only the standard JDK tools, I warmly recommend trying out XOM as it can make XML handling in Java much more pleasant.

How do I copy DOM nodes from one document to another in Java?

I'm having trouble with copying nodes from one document to another one. I've used both the adoptNode and importNode methods from Node but they don't work. I've also tried appendChild but that throws an exception. I'm using Xerces. Is this not implemented there? Is there another way to do this?
List<Node> nodesToCopy = ...;
Document newDoc = ...;
for(Node n : nodesToCopy) {
// this doesn't work
newDoc.adoptChild(n);
// neither does this
//newDoc.importNode(n, true);
}
The problem is that Node's contain a lot of internal state about their context, which includes their parentage and the document by which they are owned. Neither adoptChild() nor importNode() place the new node anywhere in the destination document, which is why your code is failing.
Since you want to copy the node and not move it from one document to another there are three distinct steps you need to take...
Create the copy
Import the copied node into the destination document
Place the copied into it's correct position in the new document
for(Node n : nodesToCopy) {
// Create a duplicate node
Node newNode = n.cloneNode(true);
// Transfer ownership of the new node into the destination document
newDoc.adoptNode(newNode);
// Make the new node an actual item in the target document
newDoc.getDocumentElement().appendChild(newNode);
}
The Java Document API allows you to combine the first two operations using importNode().
for(Node n : nodesToCopy) {
// Create a duplicate node and transfer ownership of the
// new node into the destination document
Node newNode = newDoc.importNode(n, true);
// Make the new node an actual item in the target document
newDoc.getDocumentElement().appendChild(newNode);
}
The true parameter on cloneNode() and importNode() specifies whether you want a deep copy, meaning to copy the node and all it's children. Since 99% of the time you want to copy an entire subtree, you almost always want this to be true.
adoptChild does not create a duplicate, it just moves the node to another parent.
You probably want the cloneNode() method.

Categories

Resources