java: unescaped quotes terminate xml text node value

java: unescaped quotes terminate xml text node value - java

I'm writing an android app in java. The app emulates flashcards, with questions on one side and answers on the other.
I am presently slurping a well-formed (as I believe) .xml document (which is produced by a Qt-based program which has no problem reading the output back in) using the following (fairly standard) code:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
try
{
DocumentBuilder builder = factory.newDocumentBuilder();
Document dom = builder.parse(new File(diskLocation));
Element pack = dom.getDocumentElement();
NodeList flashCards = pack.getElementsByTagName("flashcard");
for (int i=0; i < flashCards.getLength(); i++)
{
FlashCard flashCard = new FlashCard();
Node cardNode = flashCards.item(i);
NodeList cardProperties = cardNode.getChildNodes();
for (int j=0;j<cardProperties.getLength();j++)
{
Node cardProperty = cardProperties.item(j);
String propertyName = cardProperty.getNodeName();
if (propertyName.equalsIgnoreCase("Question"))
{
flashCard.setQuestion(cardProperty.getFirstChild().getNodeValue());
}
else if (propertyName.equalsIgnoreCase("Answer"))
{
flashCard.setAnswer(cardProperty.getFirstChild().getNodeValue());
}
else if
...etc.
Here is a flashcard for learning xml:
<flashcard>
<Question>What is the entity reference for ' " '?</Question>
<Answer>&quot;</Answer>
<Info></Info>
<Hint></Hint>
<KnownLevel>1</KnownLevel>
<LastCorrect>1</LastCorrect>
<CurrentStreak>4</CurrentStreak>
<LevelUp>4</LevelUp>
<AnswerTime>0</AnswerTime>
</flashcard>
As I understand the standard, '<' and '&' need to be escaped ('>' probably should be), but quotes and apostrophes don't (unless they're in attributes), yet when the question and answer for this card are parsed, they come out as What is the entity reference for ' and & respectively;
The input seems to follow standards. Is the java XMLDom implementation really not standards-compliant, or am I missing something?
I find it very difficult to believe I'm the only one to have (had) this problem, yet I've searched both google and stack overflow and found surprisingly little of direct relevance.
Thank you for any help!
Rob
Edit: I've just realised the file has a !DOCTYPE, but doesn't start with an <?xml tag.
I wonder if this makes any difference.

From the standard:
In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup
which means that either ' or " MUST be escaped in the content of elements.

Related

Java 9, INVALID_CHARACTER_ERR when trying to add URL as element in XML

I'm working with XML for the first time, trying to generate XML to send over to a client and I'm having a hell of a time doing it. Whenever I try to pass a URL, I get an INVALID_CHARACTER_ERR and nothing I've tried so far works.
I tried using replacements like & #123; and so on for the curly braces, and tried escaping everything that wasn't a letter, resulting in the abomination under my code. It seems to throw the error if I have any kind of character that isn't a letter. Another thing that I noticed is that the document's InputEncoding is null, but that seems to be because I'm creating it in code, does that mean that it actually doesn't have an encoding type? I haven't been able to find an easy way to set it either.
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document orders = dBuilder.newDocument();
Element order = orders.createElement("{https://secure.targeturl.com/foo/bar}tagpayload");
Element tOrder = orders.createElement("tagorder");
order.appendChild(tOrder);
Element header = orders.createElement("orderheader");
tOrder.appendChild(header);
Element billto = orders.createElement("billto");
header.appendChild(billto); ```
``` "& #123;https& #58;& #47;& #47;secure& #46;targeturl& #46;com/foo& #47;bar& #125;tagpayload" ```

This is not the correct way to create a namespaced element:
Element order = orders.createElement("{https://secure.targeturl.com/foo/bar}tagpayload");
Instead, use the createElementNS method:
Element order = orders.createElementNS("https://secure.targeturl.com/foo/bar", "tagpayload");
You are seeing an exception because { is not a legal character in an XML element name. createElement has no awareness of namespaces or the “{uri}name” namespace notation.

Algorithm for identifying differences in XML documents

I'm trying to create a program in Java that takes two XML files (one is an updated version of the other) and takes them into main memory. It will then compare the files and count the number of differences between each corresponding node from the two (excluding white space). Later on the program will do more with the differences but I'm just confused on how to start comparing nodes from two separate files. Any suggestions would be much appreciated.

My first suggestion is that you could use XMLUnit:
Reader expected=new FileReader(...);
Reader tested=new FileReader(...);
Diff diff=XMLUnit.compareXML(expected, tested);

For an algorithm that computes signatures (hashes) at each node to facilitate comparison, see Detecting Changes in XML Documents.
For change detection on XML documents where element ordering is insignificant, see X-Diff: An Effective Change Detection Algorithm for XML Documents. Java and C++ implementations of the X-Diff algorithm are available.

It depends if you have differences of nodes, or differences inside nodes.
This code extract all nodes, and their paths,
and value inside
Assuming, you have two xml Documents:
XPath xPath = XPathFactory.newInstance().newXPath();
//Every nodes
expression="//*";
NodeList nodes = (NodeList) xPath.compile(expression).evaluate(document, XPathConstants.NODESET);
// iterate them all
for(int i=0; i<nodes.getLength(); i++)
{
Node the_node = nodes.item(i);
if(the_node instanceof Element)
{
Element the_element=(Element) the_node;
// PATH
String path ="";
Node noderec = the_node;
while( noderec != null)
{
if (path.equals("")) path = noderec.getNodeName();
else
path = noderec.getNodeName() + '/' + path;
noderec = noderec.getParentNode();
if (noderec==document){path="//"+path; noderec=null;}
}
System.out.println( "PATH:"+path );
System.out.println("CONTENT="+the_element.getTextContent());
}
}
PATH : gives you the path
CONTENT: sub content of the node
With that, you get all the pathes of your xml: you can compare one by one, sort, and use others algorithms to find if something is inserted, ...
And inside each node, you can make another comparisons.
Hope it helps

How to extract xml tag value without using the tag name in java?

I am using java.I have an xml file which looks like this:
<?xml version="1.0"?>
<personaldetails>
<phno>1553294232</phno>
<email>
<official>xya#gmail.com</official>
<personal>bk#yahoo.com</personal>
</email>
</personaldetails>
Now,I need to check each of the tag values for its type using specific conditions,and put them in separate files.
For example,in the above file,i write conditions like 10 digits equals phone number,
something in the format of xxx#yy.com is an email..
So,what i need to do is i need to extract the tag values in each tag and if it matches a certain condition,it is put in the first text file,if not in the second text file.
in that case,the first text file will contain:
1553294232
xya#gmail.com
bk#yahoo.com
and the rest of the values in the second file.
i just don't know how to extract the tag values without using the tag name.(or without using GetElementsByTagName).
i mean this code should extract the email bk#yahoo.com even if i give <mailing> instead of <personal> tag.It should not depend on the tag name.
Hope i am not confusing.I am new to java using xml.So,pardon me if my question is silly.
Please Help.

Seems like a typical use case for XPath
XPath allows you to query XML in a very flexible way.
This tutorial could help:
http://www.javabeat.net/2009/03/how-to-query-xml-using-xpath/
If you're using Java script, which could to be the case, since you mention getElementsByTagName(), you could just use JQuery selectors, it will give you a consistent behavior across browsers, and JQuery library is useful for a lot of other things, if you are not using it already... http://api.jquery.com/category/selectors/
Here for example is information on this:
http://www.switchonthecode.com/tutorials/xml-parsing-with-jquery

Since you don't know your element name, I would suggest creating a DOM tree and iterating through it. As and when you get a element, you would try to match it against your ruleset (and I would suggest using regex for this purpose) and then write it to your a file.
This would be a sample structure to help you get started, but you would need to modify it based on your requirement:
public void parseXML(){
try{
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc;
doc = documentBuilder.parse(new File("test.xml"));
getData(null, doc.getDocumentElement());
}catch(Exception exe){
exe.printStackTrace();
}
}
private void getData(Node parentNode, Node node){
switch(node.getNodeType()){
case Node.ELEMENT_NODE:{
if(node.hasChildNodes()){
NodeList list = node.getChildNodes();
int size = list.getLength();
for(int index = 0; index < size; index++){
getData(node, list.item(index));
}
}
break;
}
case Node.TEXT_NODE:{
String data = node.getNodeValue();
if(data.trim().length() > 0){
/*
* Here you need to check the data against your ruleset and perform your operation
*/
System.out.println(parentNode.getNodeName()+" :: "+node.getNodeValue());
}
break;
}
}
}
You might want to look at the Chain of Responsibility design pattern to design your ruleset.

Parsing XML with apostrophe

Taking the BBC News RSS feed for example, one of their news items is as follows:
<item><title>Pupils 'bullied on sports field'</title><description>bla bla..
I have some java code parsing this - however, when a title contains an apostrophe (as above), the parsing stops, so I end up with the following title: Pupils ' and then it continues on and parses the description (which is fine). How do I get it to parse the full title? The following is a segment of code from inside my for loop where I parse the info:
NodeList title = element.getElementsByTagName("title");
Element line = (Element) title.item(0);
tmp.setTitle(getCharacterDataFromElement(line).toString());
The exact same code is used to parse the other elements like description and pubDate etc, which are all fine.
This is the getCharacterDataFromElement method:
public static String getCharacterDataFromElement(Element e) {
Node child = ((Node) e).getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "";
}
What am I doing wrong? I use the DocumentBuilder, DocumentBuilderFactory and org.w3c.dom to work with the RSS Feed.

Your getCharacterDataFromElement only looks at the first child - see if there are further child elements too and tack all the text together
HTH - DF

As davidfrancis suggested, you should iterate over all children in getCharacterDataFromElement().
Alternatively, if you can use DOM level 3, you can use the Node.getTextContent() method instead which does what you want.
NodeList title = element.getElementsByTagName("title");
Element line = (Element)title.item(0);
tmp.setTitle(line.getTextContent());

Well, AFAIK, apostrophe is a reserved character in XML and thus should be encoded as &apos;.
This means the BBC News RSS feed doesnt provide well-formatted XML.
The best thing would be to issue a bug report to the BBC News RSS feed provider so that they fix it.

Java appending XML docs to existing docs

I have two XML docs that I've created and I want to combine these two inside of a new envelope. So I have
<alert-set>
<warning>National Weather Service...</warning>
<start-date>5/19/2009</start-date>
<end-date>5/19/2009</end-date>
</alert-set>
and
<weather-set>
<chance-of-rain type="percent">31</chance-of-rain>
<conditions>Partly Cloudy</conditions>
<temperature type="Fahrenheit">78</temperature>
</weather-set>
What I'd like to do is combine the two inside a root node: < DataSet> combined docs < /DataSet>
I've tried creating a temporary doc and replacing children with the root nodes of the documents:
<DataSet>
<blank/>
<blank/>
</DataSet>
And I was hoping to replace the two blanks with the root elements of the two documents but I get "WRONG_DOCUMENT_ERR: A node is used in a different document than the one that created it." I tried adopting and importing the root nodes but I get the same error.
Is there not some easy way of combining documents without having to read through and create new elements for each node?
EDIT: Sample code snippets
Just trying to move one to the "blank" document for now... The importNode and adoptNode functions cannot import/adopt Document nodes, but they can't import the element node and its subtree... or if it does, it does not seem to work for appending/replacing still.
Document xmlDoc; //created elsewhere
Document weather = getWeather(latitude, longitude);
Element weatherRoot = weather.getDocumentElement();
Node root = xmlDoc.getDocumentElement();
Node adopt = weather.adoptNode(weatherRoot);
Node imported = weather.importNode(weatherRoot, true);
Node child = root.getFirstChild();
root.replaceChild(adopt, child); //initially tried replacing the <blank/> elements
root.replaceChild(imported, child);
root.appendChild(adopt);
root.appendChild(imported);
root.appendChild(adopt.cloneNode(true));
All of these throw the DOMException: WRONG_DOCUMENT_ERR: A node is used in a different document than the one that created it.
I think I'll have to figure out how to use stax or just reread the documents and create new elements... That kinda seems like too much work just to combine documents, though.

It's a bit tricky, but the following example runs:
public static void main(String[] args) {
DocumentImpl doc1 = new DocumentImpl();
Element root1 = doc1.createElement("root1");
Element node1 = doc1.createElement("node1");
doc1.appendChild(root1);
root1.appendChild(node1);
DocumentImpl doc2 = new DocumentImpl();
Element root2 = doc2.createElement("root2");
Element node2 = doc2.createElement("node2");
doc2.appendChild(root2);
root2.appendChild(node2);
DocumentImpl doc3 = new DocumentImpl();
Element root3 = doc3.createElement("root3");
doc3.appendChild(root3);
// root3.appendChild(root1); // Doesn't work -> DOMException
root3.appendChild(doc3.importNode(root1, true));
// root3.appendChild(root2); // Doesn't work -> DOMException
root3.appendChild(doc3.importNode(root2, true));
}

I know you got the issue solved already, but I still wanted to take a stab at this problem using the XOM library that I'm currently testing out (related to this question), and while doing that, offer a different approach than that of Andreas_D's answer.
(To simplify this example, I put your <alert-set> and <weather-set> into separate files, which I read into nu.xom.Document instances.)
import nu.xom.*;
[...]
Builder builder = new Builder();
Document alertDoc = builder.build(new File("src/xomtest", "alertset.xml"));
Document weatherDoc = builder.build(new File("src/xomtest", "weatherset.xml"));
Document mainDoc = builder.build("<DataSet><blank/><blank/></DataSet>", "");
Element root = mainDoc.getRootElement();
root.replaceChild(
root.getFirstChildElement("blank"), alertDoc.getRootElement().copy());
root.replaceChild(
root.getFirstChildElement("blank"), weatherDoc.getRootElement().copy());
The key is to make a copy of the elements to be inserted into mainDoc; otherwise you'll get a complain that "child already has a parent".
Outputting mainDoc now gives:
<?xml version="1.0" encoding="UTF-8"?>
<DataSet>
<alert-set>
<warning>National Weather Service...</warning>
<start-date>5/19/2009</start-date>
<end-date>5/19/2009</end-date>
</alert-set>
<weather-set>
<chance-of-rain type="percent">31</chance-of-rain>
<conditions>Partly Cloudy</conditions>
<temperature type="Fahrenheit">78</temperature>
</weather-set>
</DataSet>
To my delight, this turned out to be very straight-forward to do with XOM. It only took a few minutes to write this, even though I'm definitely not very experienced with the library yet. (It would have been even easier without the <blank/> elements, i.e., starting with simply <DataSet></DataSet>.)
So, unless you have compelling reasons for using only the standard JDK tools, I warmly recommend trying out XOM as it can make XML handling in Java much more pleasant.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java: unescaped quotes terminate xml text node value - java

From the standard: In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup which means that either ' or " MUST be escaped in the content of elements.

Related

Java 9, INVALID_CHARACTER_ERR when trying to add URL as element in XML

Algorithm for identifying differences in XML documents

How to extract xml tag value without using the tag name in java?

Parsing XML with apostrophe

Java appending XML docs to existing docs

Categories

Resources