XmlUnit: The entity "nbsp" was referenced, but not declared - java

I need to test a XHTML code like <div> </div> using XmlUnit. The Diff constructor tells me this:
org.xml.sax.SAXParseException: The entity "nbsp" was referenced, but
not declared.
I know that nbsp entity is not defined in XML, but the HTML code is not mine, so I cannot replace it by #160 (that would be obvious solution otherwise).
I don't want to modify the HTML code by adding <!DOCTYPE html [ <!ENTITY nbsp " "> ]>, I would prefer to leave the code without change.
Is there another way around this problem? I know there is a HTMLDocumentBuilder class in XmlUnit, but I wasn't able to find good documentation or examples.

You can use a DOCTYPE declaration that refers to the MathML DTD:
<!DOCTYPE math
PUBLIC "-//W3C//DTD MathML 3.0//EN"
"http://www.w3.org/Math/DTD/mathml3/mathml3.dtd">
or a local copy of the same.

You can enable Feature "http://apache.org/xml/features/continue-after-fatal-error" to not throw an exception in case of unknown entities. This still gives a warning though:
documentBuilderFactory.setFeature(
"http://apache.org/xml/features/continue-after-fatal-error",
true);
É voilá!

Related

Add attribute to xml element not allowed in DTD

I have to use a external DTD, that specifies that a certain element can only have id attribute:
<!ELEMENT x (y | z)>
<!ATTLIST x id ID #IMPLIED>
So something like this is valid
<x id="x">...</x>
But if i try something like this:
<x id="x" custom="custom">...</x>
My parser gives me the following error:
Attribute "custom" must be declared for element type "x".
So I understand what the error says and why its happening, but as i said the DTD is external and sadly i cant change it. Is there a workaround or a hack that can use to add my own custom attribute?
You can either disable DTD validation in your parser, or try defining internal DTD.

Missing NameSpace Information In XML file using EXIficient

I am using EXIficient to convert XML data to EXI and back to XML. Here, i use their EXIficientDemo class. Sample Code:
EXIficientDemo sample = new EXIficientDemo();
sample.parseAndProofFileLocations("FilePath");
sample.codeSchemaLess();
Firstly it converted xml file to EXI then back to XML, when it generate XML from previously generated EXI's file, it loses some information about Namespace.
Actual XML File:
<?xml version="1.0" encoding="utf-8"?>
<tt xml:lang="ja" xmlns="http://www.w3.org/ns/ttml"
xmlns:tts="http://www.w3.org/ns/ttml#styling"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<body>
<div>
<p xml:id="s1">
<span tts:origin="somethings">somethings</span>
</p>
</div>
</body>
Generated XML File By EXIficient
<?xml version="1.0" encoding="UTF-8"?>
<ns3:tt xmlns:ns3="http://www.w3.org/ns/ttml"
xml:lang="ja"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns3:body><ns3:div>
<ns3:p xml:id="s1">
<ns3:span xmlns:ns4="http://www.w3.org/ns/ttml#styling"
ns4:origin="somethings">somethings</ns3:span>
</ns3:p>
</ns3:div></ns3:body>
In the generated XML file, it is missing xmlns:tts="http://www.w3.org/ns/ttml#styling"
How to fixed this problem? If you can, please help me.
EXIficient may be suppressing unused namespaces. Your example doesn't show any use of the ttm namespace.
As you can see, it didn't retain the namespace prefix for the ttml namespace either (changed to ns3). The generated XML is perfectly valid if the ttml#metadata namespace is unused.
Update
With the updated question, where namespace ttml#styling is used by the origin attribute of the span element, the namespace is retained in the rebuilt XML, but it has been moved to the span element.
This is still a very valid XML document.
Namespace declarations (xmlns) can appear anywhere in a XML document, and applies to the element on which it appears, and all subelements (unless overridden, which is very unusual).
The same namespace can be declared many times on different elements. For simplicity and/or optimization, it is common to declare all namespaces up front, on the root element, using different prefixes, but it is not required to do so.
I read this question by accident and rather late unfortunately.
Just in case people are still struggling with this and are wondering what they can do.
As it was pointed out EXIficient behaves just fine with regards to namespace handling.
Having said that, the EXI specification allows one to preserve prefixes and namespaces (see Preserve Options).
In EXIficient one can set these options accordingly,
e.g.,
EXIFactory.getFidelityOptions().setFidelity(FidelityOptions.FEATURE_PREFIX, true);

tagsoup breaks good xml

Cleaning an xml file I have obtained unexpected results: tagsoup has orphaned some properties closing the parent tag too soon. It also downcases the parent tag's name.
Before tagsoup:
<Objects>
<Object>
<ObjectID>240</ObjectID>
[...]
<Status>Not Ready</Status>
<Title>Some description which includes word/word, 22,000</Title>
<Url>http://example.com/withquerystring?id=240&other=1&url=http%3A%2F%2Fredirected.example.com%2F40</Url>
[...]
<Owner>
<Name>JOHN MARSHALL, MR</Name>
</Owner>
</Object>
<Object>
<ObjectID>122</ObjectID>
[...]
After tagsoup:
<Objects>
<object>
<ObjectID>240</ObjectID>
[...]
<Status>Not Ready</Status>
</object>
<Title>Some description which includes word/word, 22,000</Title>
<Url>http://example.com/withquerystring?id=240&other=1&url=http%3A%2F%2Fredirected.example.com%2F40</Url>
[...]
<Owner>
<Name>JOHN MARSHALL, MR</Name>
</Owner>
<object>
<ObjectID>122</ObjectID>
[...]
I'm in a java project that uses this libraries:
import org.ccil.cowan.tagsoup.Parser;
import org.ccil.cowan.tagsoup.XMLWriter;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
I'm using Java 6.
Any clues for that?
The desired output of a valid xml file would be the same file (maybe just changing details, but not the structure), wouldn't it?
Tagsoup is intended as an HTML parser and to clean up poor HTML. For tag names that are defined by HTML tagsoup knows which elements are allowed inside which other elements and will try and correct any that are wrongly nested. Also remember that in HTML, unlike XML, tag names are not case sensitive.
In this case it seems to have decided that it knows what object and title should mean in HTML (respectively an embedded object of some kind, and the title of the page), and it knows that title is not allowed inside object. But ObjectID and Status are not known HTML element names, so it gives the benefit of the doubt and leaves them alone.

how to reference the path to a DTD value in XML

I am a newbie when it comes to XML and DTD values, so forgive me if this is a simple question or if I am going about this in the wrong way. Can you specify a DTD value in the same way you can specify a path to a property in XML?
For instance, if you have the XML file below:
<!DOCTYPE ... SYSTEM "<path_to_file>">
<BOOK>
<AUTHOR>
<FIRST>John</FIRST>
<LAST>Quncy</LAST>
</AUTHOR>
<NAME>blah</NAME>
<DATE>12/23/13</DATE>
</BOOK>
You could specify the first name of the author by the path:
/BOOK/AUTHOR/FIRST
Is there any syntax to specify a DTD entity like the DOCTYPE in the same way?
Ultimately what I would like to do is use an in house XML parser already written in java to find a DTD entry that I specify and delete it from the XML file. For instance, with the above XML, I would like to specify DOCTYPE and have it removed from the XML. There is already code in place that, given a path, will delete that section from the XML file. I would like to leverage that to also delete DTD entries as well, but I have no idea how to reference it.
No. DOCTYPE is a parsing and validation directive. That is: DOCTYPE and DTD affect parsing and validation but are not a part of the document as separate entities after that. The XML data model does not containDOCTYPE or DTD definitions and they practically don't exist after the document has been parsed.

How do I take off the XML version tag in the XOM library for Java?

I'm writing a small application in Java that uses XOM to output XHTML.
The problem is that XOM places the following tag before all the html:
<?xml version="1.0" encoding="UTF-8"?>
I've read their documentation, but I can't seem to find how to remove this tag. Thanks guys.
Edit: I'm outputting to a file using XOM's Serializer class
Follow up: If it is good practice to use the XML tag before the DOCTYPE, why don't any websites use it? Also, why does the W3C validator give me and error when it sees the XML tag? Here is the error:
Illegal processing instruction target (found xml)
Finally, if I were to put the XML tag before my DOCTYPE, does this mean I don't have to specify <meta charset="UTF-8" /> in my html header?
The tag is valid as XML and XHTML, and good practice. There should be no reason to remove it.
Just leave it there ... or fix whatever it is that is expecting it not to be there.
If you don't believe me, take a look at this excerpt from the XHTML 1.1 spec.
"Example of an XHTML 1.1 document
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html version="-//W3C//DTD XHTML 1.1//EN"
xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/1999/xhtml
http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd"
>
<head>
<title>Virtual Library</title>
</head>
<body>
<p>Moved to example.org.</p>
</body>
</html>
Note that in this example, the XML declaration is included. An XML declaration like the one above is not required in all XML documents. XHTML document authors SHOULD use XML declarations in all their documents. XHTML document authors MUST use an XML declaration when the character encoding of the document is other than the default UTF-8 or UTF-16 and no encoding is specified by a higher-level protocol."
By the way, the W3C validation service says that is OK ... but if there is any whitespace before the <?xml ...?> tag it complains.
Does this work? This is listed in the Javadoc
protected void writeXMLDeclaration()
throws IOException
You could override it, and do nothing.....
Agreed you should normally output the prologue
Assuming you wish to serve your XHTML as text/html content type, you are right to want to remove the XML declaration, because if you don't, it will throw IE6 into quirks mode.
Overriding writeXMLDeclaration() as suggested by MJB looks like a good way to do it.
But you should be aware that you may well hit other problems using an XML serializer and serving the output as text/html.
Most likely, is that the output will produce a tag like this: <script src="myscript.js" />. Browsers (except Safari) won't treat that as a script self closing tag, but as as a script start tag, and everything that follows will be treated as part of the script and not rendered by the browser.
You will probably need to override your serializer to make it HTML aware to resolve this. I suggest overriding the writeEmptyElementTag() function, and for all elements with names not in the list "area", "base", "basefont", "bgsound", "br", "col", "command", "embed", "frame", "hr", "isindex", "image", "img", "input", "keygen", "link", "meta", "param", "source", "spacer" and "wbr", call writeStartTag() and then writeEndTag() instead of the default behaviour.
Finally, if I were to put the XML tag
before my DOCTYPE, does this mean I
don't have to specify <meta
charset="UTF-8" /> in my html header?
No it doesn't. When served as text/html, the XML declaration is simply ignored by browsers, so you will still need to provide the character encoding by some other means, either the meta tag, or in the HTTP headers.

Categories

Resources