Converting HTML table into XML

Converting HTML table into XML - java

i have an application which generates test case results in the form of HTML table.
I need to convert this html table data into JUnit XML format for some other usage.
How can I parse each entry in of a table. I have browsed over internet, people are commenting other posts to conver html to xhtml which is xml equivalent but this doesn't suffice my requirement. I want the xml to be based on Junit xsd schema

You need a two step process. Use an HTML parser to create a DOM. There's lots to choose from in Java
When you've parsed the HTML and got a DOM, you can transform it into the XML form you want, either by hand code, or by using a transformation language like XSLT. Xalan is typically the library you want for performing XSLT transforms in Java.

Related

How to create a Word doc from a template using Content Control Data Binding with OpenDoPE

I have a Word template, complete with fonts, colors, etc. I am querying a database and retrieving information into a POJO. I want to extract the relevant info from said POJO and create a Word document as per my template's directives.
The doc will have tables and graphs so I need to use Content Control Data Binding. As I understand it, I'll have to do the following to achieve this
Modify the Word template to add content controls
Transform the POJO into an XML object (template?)
Use ContentControlMergeXML to bind the XML data to the Word template
Unfortunately, I can't find a good step-by-step example of this anywhere. Nearly all of the links in the docx4j forum lead to broken GitHub pages
My questions
How can I use OpenDoPE to add tags to my Word template? I'll need to preserve style, so I want the correct OpenDoPE version
Should the POJO be converted into an XML object or document?
Is there an end to end example of this entire process so I can follow along? (preferably with source code)

Content control data binding essentially injects an XPath value into a content control in the Word document.
That XPath is evaluated against an XML document, so yes, you need to convert your POJO into XML.
Authoring
Now, there are 3 different OpenDoPE Word AddIns which you can use to add content controls to your Word document. See the links at https://opendope.org/implementations.html
The most recent one assumes a fixed XML format. So to use that, you'd need to transform your POJO to match that format. (ie use the AddIn to author your docx, then inspect the resulting XML (embedded in the docx), then figure out how to transform your POJO to that).
The older AddIns support arbitrary XML, but are cruder. To use one of these, first convert your POJO to XML (eg using JAXB), then feed the AddIn your sample XML.
Runtime
To bind your XML to a docx "template" to create an instance docx, see https://github.com/plutext/docx4j/blob/master/docx4j-samples-docx4j/src/main/java/org/docx4j/samples/ContentControlBindingExtensions.java
You can run that sample code against the sample docx + data; you can take a look at the docx to see what the content controls look like (they bind a custom xml part in the docx, so unzip it to see that)
ps the GitHub links broke as a result of a recent code re-org. GitHub isn't smart enough to dynamically maintain them :-( See https://www.docx4java.org/downloads.html for downloadable sample code.

Creating HTML from XML v/s JAVA object

I am trying to create HTML file from the result of an execution. The result is in the form of XML. There are few transformers that I can use to transform XML to HTML using XSLT file.
Other thing I will also have is the JAVA object of result which I can use for converting it to HTML.
Which of the above two approach is better and is there any API that I can use to convert java object to HTML other than XSLT or FILE I/O.
any one help?

I believe you have to go by xml (either directly or generated from your java object by jaxb).
In principle the templating frameworks (Velocity, Freemarker ...) can let you prepare a template into which you can inject your java object and render the response as you whish. But personally I think it will be easier/simple just to transform the xml that you already have

Extensible HTML parsing in Java driven by decoupled rules

We are using the awesome jsoup library to parse HTML documents in Java.
Now the source of these documents differ (they are coming from different clients), so the HTML elements and the text differ per different source. To handle this we have written a separate HTML parser per different source of HTML document that deals with elements, element text, element attributes etc. of that document. Some of the parsed text needs to be replaced etc as well.
The stuff is working but indeed it is not extensible. We have to write a new HTML parser for a new html document source or add/change code of an existing one if there are more elements added or removed from the supported HTML document.
E.g if today the parser for a document from company ExampleCompany expects us to parse their HTML and process it with the following 2 element attributes:
Document doc = Jsoup.parse(htmlAsString);
String dataExampleCount = doc.select("div[id=top-share-bar]").attr("data-example_count");
String cbDateText = doc.select("div[class=cbdate]").text();
Tomorrow, the ExampleCompany adds a new element to their HTML (it may be in JavaScript or CSS or in the body) like "a[class=loc mr10]" and expects us to use that element's text as well. So we have to go and add another line of code:
String locMr10Text = doc.select("a[class=loc mr10]").text();
Is there a way to decouple the rules or XPATH expressions to find the elements and their text in some external file, be it XML or JSON or XSL where I can just define which elements to be looked for, which element's attributes or text to be extracted etc?
So, from the above example, if I externalize the rules in JSON:
{
"Attrs": {
"div[id=top-share-bar]": "data-example_count",
},
"Text": '[
"div[class=cbdate]",
"div[class=loc mr10]",
]'
}
We could just keep updating the rules JSON and not add any line of Java code but Just parse the JSON and accordingly parse the HTML.
This will facilitate in:
There will be only 1 HTML parser which just takes the rules and the
HTML document and produces the output.
No need to recompile the code
if the HTML document's elements change. Just change the rules file to
accommodate the change.
I am thinking of writing our own format to externalize the XPATH expressions etc but wished to know if there is something standard being used if there is a requirement like ours.
I have read a related link to what I am asking File format for storing html parser rules, however I am not sure if the answer gives any direction of best way of decoupling the what to parse from how to parse it.
Any suggestions will be helpful.

How do I write unescaped XML outside of a CDATA

I am trying to write XML data using Stax where the content itself is HTML
If I try
xtw.writeStartElement("contents");
xtw.writeCharacters("<b>here</b>");
xtw.writeEndElement();
I get this
<contents><b>here</b></contents>
Then I notice the CDATA method and change my code to:
xtw.writeStartElement("contents");
xtw.writeCData("<b>here</b>");
xtw.writeEndElement();
and this time the result is
<contents><![CDATA[<b>here</b>]]></contents>
which is still not good. What I really want is
<contents><b>here</b></contents>
So is there an XML API/Library that allows me to write raw text without being in a CDATA section? So far I have looked at Stax and JDom and they do not seem to offer this.
In the end I might resort to good old StringBuilder but this would not be elegant.
Update:
I agree mostly with the answers so far. However instead of <b>here</b> I could have a 1MB HTML document that I want to embed in a bigger XML document. What you suggest means that I have to parse this HTML document in order to understand its structure. I would like to avoid this if possible.
Answer:
It is not possible, otherwise you could create invalid XML documents.

The issue is that is not raw text it is an element so you should be writing
xtw.writeStartElement("contents");
xtw.writeStartElement("b");
xtw.writeCData("here");
xtw.writeEndElement();
xtw.writeEndElement();

If you want the XML to be included AS XML and not as character data, then it has to be parsed at some point. If you don't want to manually do the parsing yourself, you have two alternatives:
(1) Use external parsed entities -- in this case the external file will be pulled in and parsed by the XML parser. When the output is again serialized, it will include the contents of the external file.
[ See http://www.javacommerce.com/displaypage.jsp?name=entities.sql&id=18238 ]
(2) Use Xinclude -- in that case the file has to be run thru an xinclude processor which will merge the xinclude references into the output. Most xslt processors, as well as xmllint will also do xinclude with an appropriate option.
[ See: http://www.xml.com/pub/a/2002/07/31/xinclude.html ]
( XSLT can also be used to merge documents without using the XInclude syntax. XInclude just provides a standard syntax )

The problem is not "here", it's <b></b>.
Add the <b> element as a child of contents and you'll be able to do it. Any library like JDOM or DOM4J will allow you to do this. The general case is to parse the content into an XML DOM and add the root element as a child of <contents>.
You can't add escaped values outside of a CDATA section.

If you want to embed a large HTML document in an XML document then CDATA imho is the way to go. That way you don't have to understand or process the internal structure and you can later change the document type from HTML to something else without much hassle. Also I think you can't embed e.g. DOCTYPE instructions directly (i.e. as structured data that retains the semantics of the DOCTYPE instruction). They have to be represented as characters.
(This is primarily a response to your update but alas I don't have enough rep to comment...............)

I don't see what the problem is with parsing the large block of XML you want to insert into your output. Use a StAX parser to parse it, and just write code to forward all of the events to your existing serializer (variable "xtw").

If the blob of html is actually xhtml then I'd suggest doing something like (in pseudo-code):
xtw.writeStartElement("contents")
XMLReader xtr=new XMLReader();
xtr.read(blob);
Dom dom=xtr.getDom();
for(element e:dom){
xtw.writeElement(e);
}
xtw.writeEndElement();
or something like that. I had to do something similar once but used a different library.

If your XML and HTML are not too big, you could make a workaround:
xtw.writeStartElement("contents");
xtw.writeCharacters("anUniqueIdentifierForReplace"); // <--
xtw.writeEndElement();
When you have your XML as a String:
xmlAsString.replace("anUniqueIdentifierForReplace", yourHtmlAsString);
I know, it's not so nice, but this could work.
Edit: Of course, you should check if yourHtmlAsString is valid.

Getting elements by type in malformed HTML

What's the easiest way in Java to retrieve all elements with a certain type in a malformed HTML page? So I want to do something like this:
public static void main(String[] args) {
// Read in an HTML file from disk
// Retrieve all INPUT elements regardless of whether the HTML is well-formed
// Loop through all elements and retrieve their ids if they exist for the element
}

HtmlCleaner is arguably one of the best HTML parsers out there when it comes to dealing with (somewhat) malformed HTML.
Documentation is here with some code samples; you're basically looking for getElementsByName() method.
Take a look at Comparison of Java HTML parsers if you're considering other libraries.

I've had success using tagsoup. Heres a short description from their home page:
This is the home page of TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

Check Jtidy.
JTidy is a Java port of HTML Tidy, a
HTML syntax checker and pretty
printer. Like its non-Java cousin,
JTidy can be used as a tool for
cleaning up malformed and faulty HTML.
In addition, JTidy provides a DOM
interface to the document that is
being processed, which effectively
makes you able to use JTidy as a DOM
parser for real-world HTML.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Converting HTML table into XML - java

Related

How to create a Word doc from a template using Content Control Data Binding with OpenDoPE

Creating HTML from XML v/s JAVA object

Extensible HTML parsing in Java driven by decoupled rules

How do I write unescaped XML outside of a CDATA

Getting elements by type in malformed HTML

Categories

Resources