I am using the SAX parser in java to read some XML. The XML I am giving it has problems and is causing the parse to fail. Here is the error message:
11-18 10:25:37.290: W/System.err(3712): org.xml.sax.SAXParseException: Illegal: "<" inside attribute value (position:START_TAG <question text='null'>#1:23 in java.io.InputStreamReader#4074c678)
I have a feeling that it does not like the fact that I have some HTML tags inside of a string in the XML. I would think that anything inside the quotes gets ignored from a syntax standpoint. Also, is it valid to use single quotes here? Here is an example:
<quiz>
<question text="<img src='//files/alex/hilltf.PNG' alt='hill' style='max-width:400px' /> is represented on map by cut. ">
<answer text="1"/>
<answer text="2" correct="true"/>
</question>
</quiz>
You need to escape the < inside the text attribute value. Since XML uses < and > to denote tags, it's illegal in content unless escaped or enclosed in a CDATA tag (which isn't an option for an attribute value).
The error message is correct. A < must be the start of a tag, and cannot appear inside a string. It must be < instead. I don't believe the quotes is a problem.
Related
I am getting below error pls help
"parse error:
Error on line 1 of document :
The markup in the document preceding the root element must be well-formed.
Nested exception: The markup in the document preceding the root element must be well-formed.
XML is below
<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<'env:Envelope' xmlns>:env=\"http://www.w3.org/2003/05/soap-envelope\" xmlns:ns1=\"urn:zimbraAdmin\">
xmlns:ns2=\"urn:zimbraAdmin\"><env:Header><ns2:context/></env:Header><env:Body>
<ModifyAccountRequest xmlns=\"urn:zimbraAdmin\"><id>4d41ec71-d898-42b8-b522-3c3cdc5583a0</id>
<a n=\"zimbraIsAdminAccount\">TRUE</a>
</ModifyAccountRequest></env:Body></env:Envelope>
That was terribly malformed. Issues are highlighted below:
1. Every instance of \" should be replaced with a simple " as the slash indicates a literal character to Java and is not needed in normal XML.
2. There should be no single quotes around <'env:Envelope' and I honestly have no idea where they came from.
3. The closing carat at xmlns>:env= should be removed, as should the one at the end of the physical line xmlns:ns1=\"urn:zimbraAdmin\">. Removing that carat brings the next namespace statement (which seems unnecessarily identical to ns1) into the envelope tag.
I have no idea what caused the envelope to become so malformed, but you should read up on the purpose of the values and variables you were setting with the xmlns and namespace references so next time you at least uderstand what all the parts of the XML request do. This will help you troubleshoot your own documents in the future.
In the meantime, since you seem to be at a total loss, here is the XML with the errors above corrected.
<?xml version="1.0" encoding="UTF-8"?>
<env:Envelope xmlns:env="http://www.w3.org/2003/05/soap-envelope" xmlns:ns1="urn:zimbraAdmin" xmlns:ns2="urn:zimbraAdmin">
<env:Header>
<ns2:context/>
</env:Header>
<env:Body>
<ModifyAccountRequest xmlns="urn:zimbraAdmin">
<id>4d41ec71-d898-42b8-b522-3c3cdc5583a0</id>
<a n="zimbraIsAdminAccount">TRUE</a>
</ModifyAccountRequest>
</env:Body>
</env:Envelope>
For some reason someone changed the webService xml response that I needed. So now, the imformation I need to fetch is inside a CDATA tag.
The thing is that all "<" and ">" characters have been replaced with "<" and ">".
Example how it should look like:
<MapAAAResult><!CDATA[<map>http://tstgis.xxxxxxx.xxx/gis_n/WebService1/Users/Image/xxxxxbinkor4.png|vialcap:2</map>
<nbr>234</nbr>
<nbrProcess>97` ....
And this is how I am receiving it:
<MapAAAResult>
<mapa>http://tstgis.xxxxxxx.xxx/gis_n/WebService1/Users/Image/xxxxxxxxbi542m4.png|vialcap:1</map>
<nbr>234</nbr>
<nbrProcess>97 .....
How can I do to get the information back to its original form? More exactly how can I transform that information back to an xml?
Any ideas?
Thanks!!
Possibly related to the character escaping issue:
HTML inside XML CDATA being converted with < and > brackets
The characters like "<" , ">", "&" are illegal in XML elements and escaping these can be done via CDATA or character replacement. Looks like the webService switched up their schema somewhere along the way.
I've encountered a similar issue where I had to parse an escaped xml. A quick solution to get back the xml is to use replaceAll():
String data = "<MapAAAResult>"
+ "<map>http://tstgis.xxxxxxx.xxx/gis_n/WebService1/Users/Image/xxxxxxxxbi542m4.png|vialcap:1</map><nbr>234</nbr>"
+ "<nbrProcess>97";
data = data.replaceAll("<","<");
data = data.replaceAll(">", ">");
data = data.replaceAll("&","&");
System.out.println(data);
you will get back:
<MapAAAResult><map>http://tstgis.xxxxxxx.xxx/gis_n/WebService1/Users/Image/xxxxxxxxbi542m4.png|vialcap:1</map><nbr>234</nbr><nbrProcess>97...
It can get more complex with embedded CDATA tags within the first CDATA field, and xml parsing could get confused with the ending "]]>" such as:
<xml><![CDATA[ <tag><![CDATA[data]]></tag> ]]></xml>
Thus, escaping the embedded data by using the < > & is more resilient but can introduce unnecessary processing. Also note: some parsers or xml readers can recognize the escaped characters.
Some other related threads:
XSL unescape HTML inside CDATA
When to CDATA vs. Escape & Vice Versa?
Hi I tried to find out the solution for this but could get anything of much help.
The problem is that the CDATA section has a XML present in it and I want to escape the special character '&' present in it. I'am using XMLBeans and tried using XmlOptionCharEscapeMap but its throwing exception while parsing.
`XmlObject.Factory.parse(XMLString, xmlOptionsObj);`
here the setSaveSubstituteCharacters in xmlOptionsObj was set with XmlOptionCharEscapeMap Object.
XML example:
<Message xmlns="http://www.com.test/XMLSchema">
<Header></Header>
<Body><![CDATA[<Inner xmlns="http://www.com.test/XMLSchema">
<TagA>...</TagA>
</Inner>]]></Body>
</Message>'
'
I transform xml with the Saxon XSLT2 processor (using Java + the Saxon S9API) and have to deal with xml-documents as the source, which contain invalid characters as tag names and therefore can't be parsed by the document-builder.
Example:
<A>
<B />
<C>
<D />
</C>
<E!_RANDOM_ />
< />
</A>
Code:
import net.sf.saxon.s9api.*;
[...]
/* XSLT Processor & Compiler */
proc = new Processor(false);
/* build document from input*/
XdmNode source = proc.newDocumentBuilder().build(new StreamSource(input));
Error:
Error on line X column Y
SXXP0003: Error reported by XML parser: Element type
"E" must be followed by either attribute specifications, ">" or "/>".
The exclamation mark and the tag name just being space are currently my only invalid tags.
I am searching for a more robust solution rather than just removing whole lines of the (formated) xml.
With some mind-bending I could come up with a regular expression to identify the invalid strings, but would struggle with the removal of the nodes containing attributes and child-nodes.
Thank you for your help!
If the input contains invalid tags then it is not XML. It's best to get your mind-set right by referring to these as non-XML documents rather than XML documents; that helps to make it clear that to process non-XML documents, you need non-XML tools. (Forget about "nodes" - there are no nodes until the document has been parsed, and it can't be parsed until you have turned it into well-formed XML). To turn non-XML into XML, you will typically want to use non-XML tools that are good at text manipulation, such as Perl. Of course, it's much better to fix the problem at source: all the benefits of XML are lost if people generate data in private non-XML formats.
I've got html datas that i'm converting into a Dom4J document.
I've met an error:
org.dom4j.DocumentException: Error on line 1 of document : Reference is not allowed in prolog. Nested exception: Reference is not allowed in prolog.
at org.dom4j.io.SAXReader.read(SAXReader.java:482)
at org.dom4j.DocumentHelper.parseText(DocumentHelper.java:278)
at MonTest.main(MonTest.java:21)
Nested exception:
org.xml.sax.SAXParseException: Reference is not allowed in prolog.
It was a character "&" that i needed to escape into & amp; in order to build the document.
In XML, it seems that we need to escape 5 characters: (gt, lt, quot, amp, apos)
Nevertheless, how can i escape it, without escaping it into the "nodes" elements:
<div id="test" class='toto'>A&A<A"A</div>
should give:
<div id="test" class='toto'>A&A<A"A</div>
and not
<div id="test" class='toto'>A&A<A"A</div>
Thank you,
Escape strings before adding to XML document. Use StringEscapeUtils.escapeXml method from Apache Commons Lang. Use some library to build XML e.g. http://code.google.com/p/joox/.
I would have a look at using a lenient HTML XMLReader instead of the default XMLReader implementation. Something like tag soup or html tidy.