Web harvest failing to convert malformed html to xml - java

I am using xquery processor in web harvest (from java) to parse an html page that contains an invalid tag inside a <div> element, like <div 3px="abc">. The exception is:
SXXP0003: Error reported by XML parser: Element type "div" must be followed by either
attribute specifications, ">" or "/>".
at org.webharvest.runtime.processors.XQueryProcessor.execute(Unknown Source)
Is there a quick way to clean the div pre-processing? Or any workaround for this problem?

Related

Trying to replace <br>, <BR>, <br +attribute> tags with <br/>

I am attempting to convert a bunch of HTML documents to XML compliance (via a java method) and there are a lot of <br> tags that either (1) are unclosed or (2) contain attributes. For some reason the regex I'm using does not address the tags that contain attributes. Here is the code:
htmlString = htmlString.replaceAll("(?i)<br *>", "<br/>");
This code works fine for all the <br> tags in the documents; it replaces them with <br/>. However, for tags like
<BR style="PAGE-BREAK-BEFORE: always" clear=all>
it doesn't do anything. I'd like all br tags to just be <br/>, regardless of any attributes in the tag prior to conversion.
What do I need to add to my regex in order to achieve this?
This regex will do what you want: <(BR|br)[^>]*>
Here is a working example: Regex101
You probably want <br\b[^>]*> to match all tags that
Start with <br
Have a word-break after the <br (so you wouldn't match a <brown> tag, for example
Contain any number of non-> characters, including 0
End with a >
You have to use .* instead of * :
htmlString.replaceAll("(?i)<br .*>", "<br/>")
//-----------------------------^^
because :
* Match the preceding character or subexpression 0 or more times.
and
.* Matches any character zero or many times
So for your case :
String htmlString = "<BR style=\"PAGE-BREAK-BEFORE: always\" clear=all>";
System.out.println(htmlString.replaceAll("(?i)<br .*>", "<br/>"));
Output
<br/>
Using regular expressions to parse HTML is not a good idea because HTML is not regular. You should use a proper parsing library like NekoHTML.
NekoHTML is a simple HTML scanner and tag balancer that enables
application programmers to parse HTML documents and access the
information using standard XML interfaces. The parser can scan HTML
files and "fix up" many common mistakes that human (and computer)
authors make in writing HTML documents. NekoHTML adds missing parent
elements; automatically closes elements with optional end tags; and
can handle mismatched inline element tags.

Removing nodes with invalid tag names from a xml document

I transform xml with the Saxon XSLT2 processor (using Java + the Saxon S9API) and have to deal with xml-documents as the source, which contain invalid characters as tag names and therefore can't be parsed by the document-builder.
Example:
<A>
<B />
<C>
<D />
</C>
<E!_RANDOM_ />
< />
</A>
Code:
import net.sf.saxon.s9api.*;
[...]
/* XSLT Processor & Compiler */
proc = new Processor(false);
/* build document from input*/
XdmNode source = proc.newDocumentBuilder().build(new StreamSource(input));
Error:
Error on line X column Y
SXXP0003: Error reported by XML parser: Element type
"E" must be followed by either attribute specifications, ">" or "/>".
The exclamation mark and the tag name just being space are currently my only invalid tags.
I am searching for a more robust solution rather than just removing whole lines of the (formated) xml.
With some mind-bending I could come up with a regular expression to identify the invalid strings, but would struggle with the removal of the nodes containing attributes and child-nodes.
Thank you for your help!
If the input contains invalid tags then it is not XML. It's best to get your mind-set right by referring to these as non-XML documents rather than XML documents; that helps to make it clear that to process non-XML documents, you need non-XML tools. (Forget about "nodes" - there are no nodes until the document has been parsed, and it can't be parsed until you have turned it into well-formed XML). To turn non-XML into XML, you will typically want to use non-XML tools that are good at text manipulation, such as Perl. Of course, it's much better to fix the problem at source: all the benefits of XML are lost if people generate data in private non-XML formats.

Best practise to escape XML characters?

I've got html datas that i'm converting into a Dom4J document.
I've met an error:
org.dom4j.DocumentException: Error on line 1 of document : Reference is not allowed in prolog. Nested exception: Reference is not allowed in prolog.
at org.dom4j.io.SAXReader.read(SAXReader.java:482)
at org.dom4j.DocumentHelper.parseText(DocumentHelper.java:278)
at MonTest.main(MonTest.java:21)
Nested exception:
org.xml.sax.SAXParseException: Reference is not allowed in prolog.
It was a character "&" that i needed to escape into & amp; in order to build the document.
In XML, it seems that we need to escape 5 characters: (gt, lt, quot, amp, apos)
Nevertheless, how can i escape it, without escaping it into the "nodes" elements:
<div id="test" class='toto'>A&A<A"A</div>
should give:
<div id="test" class='toto'>A&A<A"A</div>
and not
<div id="test" class=&apos;toto&apos;>A&A<A"A</div>
Thank you,
Escape strings before adding to XML document. Use StringEscapeUtils.escapeXml method from Apache Commons Lang. Use some library to build XML e.g. http://code.google.com/p/joox/.
I would have a look at using a lenient HTML XMLReader instead of the default XMLReader implementation. Something like tag soup or html tidy.

How to parse xml having html tags within xml tags

I've got an xml which has html within the xml tags and i'm not able to parse as it.
When i start parsing the xml the str tag has html in it
can anyone help me out in extracting the html with all the tags.
It is a good idea to store XHTML within CDATA tags (<![CDATA[ and ]]>), so that it can be retrieved normally:
<str name="body">
<![CDATA[<font face="arial" size="2"><ul><li><p align="justify">india’s first</p></li></ul></font>]]>
</str>
Problem is not the HTML but improper HTML. If this HTML is in your hand, ensure it complies with XHTML and xml parser will treat it as normal xml. However, you may otherwise use tools like "HTML Tidy" ti fix your HTML and use HTML parsers. For example:
http://www.codeproject.com/KB/dotnet/apmilhtml.aspx

Need to handle special characters in URL

My input html is
<p>
<span>first
</span>
<span>Google Cloud Connect for Microsoft Office</span>
</p>
I am using xslt1.0 to convert the html to xml..my output xml is
<Relationship Id="rId12700703801" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="http://tools.google.com/dlpage/cloudconnect#utm_campaign=launch&utm_source=en-na-us-gdb-GCC-Appsperience_02242011&utm_medium=blog" TargetMode="External"/></Relationships>
with error "XML Parsing Error: not well-formed" in the location =(after launch&utm_source) in target attribute..
I want to escape the special characters present in url through xslt and make the xml.
Please help me. Thanks in advance..
are you generating the input html? if so you can use URLEncoder.encode to properly encode the string so the transformer doesn't complain about the syntax.
If this is just a random html page, and you have no control over it, then you probably need to use some html parser, such as tagsoup, et. al, to pre-correct it as most html files are not properly formatted.
XSLT expects XML as input, not HTML. You need to turn your HTML into XML if you want to transform it with XSLT.
I think it might be possible to do it with HTML Tidy.

Categories

Resources