Best practise to escape XML characters? - java

I've got html datas that i'm converting into a Dom4J document.
I've met an error:
org.dom4j.DocumentException: Error on line 1 of document : Reference is not allowed in prolog. Nested exception: Reference is not allowed in prolog.
at org.dom4j.io.SAXReader.read(SAXReader.java:482)
at org.dom4j.DocumentHelper.parseText(DocumentHelper.java:278)
at MonTest.main(MonTest.java:21)
Nested exception:
org.xml.sax.SAXParseException: Reference is not allowed in prolog.
It was a character "&" that i needed to escape into & amp; in order to build the document.
In XML, it seems that we need to escape 5 characters: (gt, lt, quot, amp, apos)
Nevertheless, how can i escape it, without escaping it into the "nodes" elements:
<div id="test" class='toto'>A&A<A"A</div>
should give:
<div id="test" class='toto'>A&A<A"A</div>
and not
<div id="test" class=&apos;toto&apos;>A&A<A"A</div>
Thank you,

Escape strings before adding to XML document. Use StringEscapeUtils.escapeXml method from Apache Commons Lang. Use some library to build XML e.g. http://code.google.com/p/joox/.

I would have a look at using a lenient HTML XMLReader instead of the default XMLReader implementation. Something like tag soup or html tidy.

Related

Trying to replace <br>, <BR>, <br +attribute> tags with <br/>

I am attempting to convert a bunch of HTML documents to XML compliance (via a java method) and there are a lot of <br> tags that either (1) are unclosed or (2) contain attributes. For some reason the regex I'm using does not address the tags that contain attributes. Here is the code:
htmlString = htmlString.replaceAll("(?i)<br *>", "<br/>");
This code works fine for all the <br> tags in the documents; it replaces them with <br/>. However, for tags like
<BR style="PAGE-BREAK-BEFORE: always" clear=all>
it doesn't do anything. I'd like all br tags to just be <br/>, regardless of any attributes in the tag prior to conversion.
What do I need to add to my regex in order to achieve this?
This regex will do what you want: <(BR|br)[^>]*>
Here is a working example: Regex101
You probably want <br\b[^>]*> to match all tags that
Start with <br
Have a word-break after the <br (so you wouldn't match a <brown> tag, for example
Contain any number of non-> characters, including 0
End with a >
You have to use .* instead of * :
htmlString.replaceAll("(?i)<br .*>", "<br/>")
//-----------------------------^^
because :
* Match the preceding character or subexpression 0 or more times.
and
.* Matches any character zero or many times
So for your case :
String htmlString = "<BR style=\"PAGE-BREAK-BEFORE: always\" clear=all>";
System.out.println(htmlString.replaceAll("(?i)<br .*>", "<br/>"));
Output
<br/>
Using regular expressions to parse HTML is not a good idea because HTML is not regular. You should use a proper parsing library like NekoHTML.
NekoHTML is a simple HTML scanner and tag balancer that enables
application programmers to parse HTML documents and access the
information using standard XML interfaces. The parser can scan HTML
files and "fix up" many common mistakes that human (and computer)
authors make in writing HTML documents. NekoHTML adds missing parent
elements; automatically closes elements with optional end tags; and
can handle mismatched inline element tags.

Java CDATA extract xml

For some reason someone changed the webService xml response that I needed. So now, the imformation I need to fetch is inside a CDATA tag.
The thing is that all "<" and ">" characters have been replaced with "<" and ">".
Example how it should look like:
<MapAAAResult><!CDATA[<map>http://tstgis.xxxxxxx.xxx/gis_n/WebService1/Users/Image/xxxxxbinkor4.png|vialcap:2</map>
<nbr>234</nbr>
<nbrProcess>97` ....
And this is how I am receiving it:
<MapAAAResult>
<mapa>http://tstgis.xxxxxxx.xxx/gis_n/WebService1/Users/Image/xxxxxxxxbi542m4.png|vialcap:1</map>
<nbr>234</nbr>
<nbrProcess>97 .....
How can I do to get the information back to its original form? More exactly how can I transform that information back to an xml?
Any ideas?
Thanks!!
Possibly related to the character escaping issue:
HTML inside XML CDATA being converted with < and > brackets
The characters like "<" , ">", "&" are illegal in XML elements and escaping these can be done via CDATA or character replacement. Looks like the webService switched up their schema somewhere along the way.
I've encountered a similar issue where I had to parse an escaped xml. A quick solution to get back the xml is to use replaceAll():
String data = "<MapAAAResult>"
+ "<map>http://tstgis.xxxxxxx.xxx/gis_n/WebService1/Users/Image/xxxxxxxxbi542m4.png|vialcap:1</map><nbr>234</nbr>"
+ "<nbrProcess>97";
data = data.replaceAll("<","<");
data = data.replaceAll(">", ">");
data = data.replaceAll("&","&");
System.out.println(data);
you will get back:
<MapAAAResult><map>http://tstgis.xxxxxxx.xxx/gis_n/WebService1/Users/Image/xxxxxxxxbi542m4.png|vialcap:1</map><nbr>234</nbr><nbrProcess>97...
It can get more complex with embedded CDATA tags within the first CDATA field, and xml parsing could get confused with the ending "]]>" such as:
<xml><![CDATA[ <tag><![CDATA[data]]></tag> ]]></xml>
Thus, escaping the embedded data by using the < > & is more resilient but can introduce unnecessary processing. Also note: some parsers or xml readers can recognize the escaped characters.
Some other related threads:
XSL unescape HTML inside CDATA
When to CDATA vs. Escape & Vice Versa?

Removing nodes with invalid tag names from a xml document

I transform xml with the Saxon XSLT2 processor (using Java + the Saxon S9API) and have to deal with xml-documents as the source, which contain invalid characters as tag names and therefore can't be parsed by the document-builder.
Example:
<A>
<B />
<C>
<D />
</C>
<E!_RANDOM_ />
< />
</A>
Code:
import net.sf.saxon.s9api.*;
[...]
/* XSLT Processor & Compiler */
proc = new Processor(false);
/* build document from input*/
XdmNode source = proc.newDocumentBuilder().build(new StreamSource(input));
Error:
Error on line X column Y
SXXP0003: Error reported by XML parser: Element type
"E" must be followed by either attribute specifications, ">" or "/>".
The exclamation mark and the tag name just being space are currently my only invalid tags.
I am searching for a more robust solution rather than just removing whole lines of the (formated) xml.
With some mind-bending I could come up with a regular expression to identify the invalid strings, but would struggle with the removal of the nodes containing attributes and child-nodes.
Thank you for your help!
If the input contains invalid tags then it is not XML. It's best to get your mind-set right by referring to these as non-XML documents rather than XML documents; that helps to make it clear that to process non-XML documents, you need non-XML tools. (Forget about "nodes" - there are no nodes until the document has been parsed, and it can't be parsed until you have turned it into well-formed XML). To turn non-XML into XML, you will typically want to use non-XML tools that are good at text manipulation, such as Perl. Of course, it's much better to fix the problem at source: all the benefits of XML are lost if people generate data in private non-XML formats.

Web harvest failing to convert malformed html to xml

I am using xquery processor in web harvest (from java) to parse an html page that contains an invalid tag inside a <div> element, like <div 3px="abc">. The exception is:
SXXP0003: Error reported by XML parser: Element type "div" must be followed by either
attribute specifications, ">" or "/>".
at org.webharvest.runtime.processors.XQueryProcessor.execute(Unknown Source)
Is there a quick way to clean the div pre-processing? Or any workaround for this problem?

XML Illegal Attribute Value

I am using the SAX parser in java to read some XML. The XML I am giving it has problems and is causing the parse to fail. Here is the error message:
11-18 10:25:37.290: W/System.err(3712): org.xml.sax.SAXParseException: Illegal: "<" inside attribute value (position:START_TAG <question text='null'>#1:23 in java.io.InputStreamReader#4074c678)
I have a feeling that it does not like the fact that I have some HTML tags inside of a string in the XML. I would think that anything inside the quotes gets ignored from a syntax standpoint. Also, is it valid to use single quotes here? Here is an example:
<quiz>
<question text="<img src='//files/alex/hilltf.PNG' alt='hill' style='max-width:400px' /> is represented on map by cut. ">
<answer text="1"/>
<answer text="2" correct="true"/>
</question>
</quiz>
You need to escape the < inside the text attribute value. Since XML uses < and > to denote tags, it's illegal in content unless escaped or enclosed in a CDATA tag (which isn't an option for an attribute value).
The error message is correct. A < must be the start of a tag, and cannot appear inside a string. It must be < instead. I don't believe the quotes is a problem.

Categories

Resources