tagsoup breaks good xml

tagsoup breaks good xml - java

Cleaning an xml file I have obtained unexpected results: tagsoup has orphaned some properties closing the parent tag too soon. It also downcases the parent tag's name.
Before tagsoup:
<Objects>
<Object>
<ObjectID>240</ObjectID>
[...]
<Status>Not Ready</Status>
<Title>Some description which includes word/word, 22,000</Title>
<Url>http://example.com/withquerystring?id=240&other=1&url=http%3A%2F%2Fredirected.example.com%2F40</Url>
[...]
<Owner>
<Name>JOHN MARSHALL, MR</Name>
</Owner>
</Object>
<Object>
<ObjectID>122</ObjectID>
[...]
After tagsoup:
<Objects>
<object>
<ObjectID>240</ObjectID>
[...]
<Status>Not Ready</Status>
</object>
<Title>Some description which includes word/word, 22,000</Title>
<Url>http://example.com/withquerystring?id=240&other=1&url=http%3A%2F%2Fredirected.example.com%2F40</Url>
[...]
<Owner>
<Name>JOHN MARSHALL, MR</Name>
</Owner>
<object>
<ObjectID>122</ObjectID>
[...]
I'm in a java project that uses this libraries:
import org.ccil.cowan.tagsoup.Parser;
import org.ccil.cowan.tagsoup.XMLWriter;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
I'm using Java 6.
Any clues for that?
The desired output of a valid xml file would be the same file (maybe just changing details, but not the structure), wouldn't it?

Tagsoup is intended as an HTML parser and to clean up poor HTML. For tag names that are defined by HTML tagsoup knows which elements are allowed inside which other elements and will try and correct any that are wrongly nested. Also remember that in HTML, unlike XML, tag names are not case sensitive.
In this case it seems to have decided that it knows what object and title should mean in HTML (respectively an embedded object of some kind, and the title of the page), and it knows that title is not allowed inside object. But ObjectID and Status are not known HTML element names, so it gives the benefit of the doubt and leaves them alone.

Related

How can i replace attribute of prefix in xml?

I want to replace an attribute of xml in java.
How can i replace that?
Please help me.
xml is like this:
<header p1:name="blabla">
<body>
<description>hello world !!!</description>
</body>
</header>
<!-- TO-BE -->
<header name="blabla">
<body>
<description>hello world !!!</description>
</body>
</header>
I want to replace 'p1:' to space area like TO-BE.

When you want to transform XML from Java, I would suggest using XSLT. For simple tasks you can use the XSLT 1.0 processor that comes with the JDK; for more complex tasks you can download an XSLT 3.0 implementation such as Saxon.
However, XSLT assumes that the XML input is well-formed. The sample you have shown isn't, because it uses a namespace prefix p1 that hasn't been declared. This suggests a problem further up the processing pipeline, and rather than getting rid of this prefix, you should perhaps consider how it got there in the first place: errors that create bad data should be fixed at source, rather than repairing the data later.

Missing NameSpace Information In XML file using EXIficient

I am using EXIficient to convert XML data to EXI and back to XML. Here, i use their EXIficientDemo class. Sample Code:
EXIficientDemo sample = new EXIficientDemo();
sample.parseAndProofFileLocations("FilePath");
sample.codeSchemaLess();
Firstly it converted xml file to EXI then back to XML, when it generate XML from previously generated EXI's file, it loses some information about Namespace.
Actual XML File:
<?xml version="1.0" encoding="utf-8"?>
<tt xml:lang="ja" xmlns="http://www.w3.org/ns/ttml"
xmlns:tts="http://www.w3.org/ns/ttml#styling"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<body>
<div>
<p xml:id="s1">
<span tts:origin="somethings">somethings</span>
</p>
</div>
</body>
Generated XML File By EXIficient
<?xml version="1.0" encoding="UTF-8"?>
<ns3:tt xmlns:ns3="http://www.w3.org/ns/ttml"
xml:lang="ja"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ns3:body><ns3:div>
<ns3:p xml:id="s1">
<ns3:span xmlns:ns4="http://www.w3.org/ns/ttml#styling"
ns4:origin="somethings">somethings</ns3:span>
</ns3:p>
</ns3:div></ns3:body>
In the generated XML file, it is missing xmlns:tts="http://www.w3.org/ns/ttml#styling"
How to fixed this problem? If you can, please help me.

EXIficient may be suppressing unused namespaces. Your example doesn't show any use of the ttm namespace.
As you can see, it didn't retain the namespace prefix for the ttml namespace either (changed to ns3). The generated XML is perfectly valid if the ttml#metadata namespace is unused.
Update
With the updated question, where namespace ttml#styling is used by the origin attribute of the span element, the namespace is retained in the rebuilt XML, but it has been moved to the span element.
This is still a very valid XML document.
Namespace declarations (xmlns) can appear anywhere in a XML document, and applies to the element on which it appears, and all subelements (unless overridden, which is very unusual).
The same namespace can be declared many times on different elements. For simplicity and/or optimization, it is common to declare all namespaces up front, on the root element, using different prefixes, but it is not required to do so.

I read this question by accident and rather late unfortunately.
Just in case people are still struggling with this and are wondering what they can do.
As it was pointed out EXIficient behaves just fine with regards to namespace handling.
Having said that, the EXI specification allows one to preserve prefixes and namespaces (see Preserve Options).
In EXIficient one can set these options accordingly,
e.g.,
EXIFactory.getFidelityOptions().setFidelity(FidelityOptions.FEATURE_PREFIX, true);

XmlUnit: The entity "nbsp" was referenced, but not declared

I need to test a XHTML code like <div> </div> using XmlUnit. The Diff constructor tells me this:
org.xml.sax.SAXParseException: The entity "nbsp" was referenced, but
not declared.
I know that nbsp entity is not defined in XML, but the HTML code is not mine, so I cannot replace it by #160 (that would be obvious solution otherwise).
I don't want to modify the HTML code by adding <!DOCTYPE html [ <!ENTITY nbsp " "> ]>, I would prefer to leave the code without change.
Is there another way around this problem? I know there is a HTMLDocumentBuilder class in XmlUnit, but I wasn't able to find good documentation or examples.

You can use a DOCTYPE declaration that refers to the MathML DTD:
<!DOCTYPE math
PUBLIC "-//W3C//DTD MathML 3.0//EN"
"http://www.w3.org/Math/DTD/mathml3/mathml3.dtd">
or a local copy of the same.

You can enable Feature "http://apache.org/xml/features/continue-after-fatal-error" to not throw an exception in case of unknown entities. This still gives a warning though:
documentBuilderFactory.setFeature(
"http://apache.org/xml/features/continue-after-fatal-error",
true);
É voilá!

How do I take off the XML version tag in the XOM library for Java?

I'm writing a small application in Java that uses XOM to output XHTML.
The problem is that XOM places the following tag before all the html:
<?xml version="1.0" encoding="UTF-8"?>
I've read their documentation, but I can't seem to find how to remove this tag. Thanks guys.
Edit: I'm outputting to a file using XOM's Serializer class
Follow up: If it is good practice to use the XML tag before the DOCTYPE, why don't any websites use it? Also, why does the W3C validator give me and error when it sees the XML tag? Here is the error:
Illegal processing instruction target (found xml)
Finally, if I were to put the XML tag before my DOCTYPE, does this mean I don't have to specify <meta charset="UTF-8" /> in my html header?

The tag is valid as XML and XHTML, and good practice. There should be no reason to remove it.
Just leave it there ... or fix whatever it is that is expecting it not to be there.
If you don't believe me, take a look at this excerpt from the XHTML 1.1 spec.
"Example of an XHTML 1.1 document
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html version="-//W3C//DTD XHTML 1.1//EN"
xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/1999/xhtml
http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd"
>
<head>
<title>Virtual Library</title>
</head>
<body>
<p>Moved to example.org.</p>
</body>
</html>
Note that in this example, the XML declaration is included. An XML declaration like the one above is not required in all XML documents. XHTML document authors SHOULD use XML declarations in all their documents. XHTML document authors MUST use an XML declaration when the character encoding of the document is other than the default UTF-8 or UTF-16 and no encoding is specified by a higher-level protocol."
By the way, the W3C validation service says that is OK ... but if there is any whitespace before the <?xml ...?> tag it complains.

Does this work? This is listed in the Javadoc
protected void writeXMLDeclaration()
throws IOException
You could override it, and do nothing.....
Agreed you should normally output the prologue

Assuming you wish to serve your XHTML as text/html content type, you are right to want to remove the XML declaration, because if you don't, it will throw IE6 into quirks mode.
Overriding writeXMLDeclaration() as suggested by MJB looks like a good way to do it.
But you should be aware that you may well hit other problems using an XML serializer and serving the output as text/html.
Most likely, is that the output will produce a tag like this: <script src="myscript.js" />. Browsers (except Safari) won't treat that as a script self closing tag, but as as a script start tag, and everything that follows will be treated as part of the script and not rendered by the browser.
You will probably need to override your serializer to make it HTML aware to resolve this. I suggest overriding the writeEmptyElementTag() function, and for all elements with names not in the list "area", "base", "basefont", "bgsound", "br", "col", "command", "embed", "frame", "hr", "isindex", "image", "img", "input", "keygen", "link", "meta", "param", "source", "spacer" and "wbr", call writeStartTag() and then writeEndTag() instead of the default behaviour.
Finally, if I were to put the XML tag
before my DOCTYPE, does this mean I
don't have to specify <meta
charset="UTF-8" /> in my html header?
No it doesn't. When served as text/html, the XML declaration is simply ignored by browsers, so you will still need to provide the character encoding by some other means, either the meta tag, or in the HTTP headers.

How to parse xml having html tags within xml tags

I've got an xml which has html within the xml tags and i'm not able to parse as it.
When i start parsing the xml the str tag has html in it
can anyone help me out in extracting the html with all the tags.

It is a good idea to store XHTML within CDATA tags (<![CDATA[ and ]]>), so that it can be retrieved normally:
<str name="body">
<![CDATA[<font face="arial" size="2"><ul><li><p align="justify">india’s first</p></li></ul></font>]]>
</str>

Problem is not the HTML but improper HTML. If this HTML is in your hand, ensure it complies with XHTML and xml parser will treat it as normal xml. However, you may otherwise use tools like "HTML Tidy" ti fix your HTML and use HTML parsers. For example:
http://www.codeproject.com/KB/dotnet/apmilhtml.aspx

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

tagsoup breaks good xml - java

Related

How can i replace attribute of prefix in xml?

Missing NameSpace Information In XML file using EXIficient

XmlUnit: The entity "nbsp" was referenced, but not declared

How do I take off the XML version tag in the XOM library for Java?

How to parse xml having html tags within xml tags

Categories

Resources