Extensible HTML parsing in Java driven by decoupled rules

Extensible HTML parsing in Java driven by decoupled rules - java

We are using the awesome jsoup library to parse HTML documents in Java.
Now the source of these documents differ (they are coming from different clients), so the HTML elements and the text differ per different source. To handle this we have written a separate HTML parser per different source of HTML document that deals with elements, element text, element attributes etc. of that document. Some of the parsed text needs to be replaced etc as well.
The stuff is working but indeed it is not extensible. We have to write a new HTML parser for a new html document source or add/change code of an existing one if there are more elements added or removed from the supported HTML document.
E.g if today the parser for a document from company ExampleCompany expects us to parse their HTML and process it with the following 2 element attributes:
Document doc = Jsoup.parse(htmlAsString);
String dataExampleCount = doc.select("div[id=top-share-bar]").attr("data-example_count");
String cbDateText = doc.select("div[class=cbdate]").text();
Tomorrow, the ExampleCompany adds a new element to their HTML (it may be in JavaScript or CSS or in the body) like "a[class=loc mr10]" and expects us to use that element's text as well. So we have to go and add another line of code:
String locMr10Text = doc.select("a[class=loc mr10]").text();
Is there a way to decouple the rules or XPATH expressions to find the elements and their text in some external file, be it XML or JSON or XSL where I can just define which elements to be looked for, which element's attributes or text to be extracted etc?
So, from the above example, if I externalize the rules in JSON:
{
"Attrs": {
"div[id=top-share-bar]": "data-example_count",
},
"Text": '[
"div[class=cbdate]",
"div[class=loc mr10]",
]'
}
We could just keep updating the rules JSON and not add any line of Java code but Just parse the JSON and accordingly parse the HTML.
This will facilitate in:
There will be only 1 HTML parser which just takes the rules and the
HTML document and produces the output.
No need to recompile the code
if the HTML document's elements change. Just change the rules file to
accommodate the change.
I am thinking of writing our own format to externalize the XPATH expressions etc but wished to know if there is something standard being used if there is a requirement like ours.
I have read a related link to what I am asking File format for storing html parser rules, however I am not sure if the answer gives any direction of best way of decoupling the what to parse from how to parse it.
Any suggestions will be helpful.

Related

How to create a Word doc from a template using Content Control Data Binding with OpenDoPE

I have a Word template, complete with fonts, colors, etc. I am querying a database and retrieving information into a POJO. I want to extract the relevant info from said POJO and create a Word document as per my template's directives.
The doc will have tables and graphs so I need to use Content Control Data Binding. As I understand it, I'll have to do the following to achieve this
Modify the Word template to add content controls
Transform the POJO into an XML object (template?)
Use ContentControlMergeXML to bind the XML data to the Word template
Unfortunately, I can't find a good step-by-step example of this anywhere. Nearly all of the links in the docx4j forum lead to broken GitHub pages
My questions
How can I use OpenDoPE to add tags to my Word template? I'll need to preserve style, so I want the correct OpenDoPE version
Should the POJO be converted into an XML object or document?
Is there an end to end example of this entire process so I can follow along? (preferably with source code)

Content control data binding essentially injects an XPath value into a content control in the Word document.
That XPath is evaluated against an XML document, so yes, you need to convert your POJO into XML.
Authoring
Now, there are 3 different OpenDoPE Word AddIns which you can use to add content controls to your Word document. See the links at https://opendope.org/implementations.html
The most recent one assumes a fixed XML format. So to use that, you'd need to transform your POJO to match that format. (ie use the AddIn to author your docx, then inspect the resulting XML (embedded in the docx), then figure out how to transform your POJO to that).
The older AddIns support arbitrary XML, but are cruder. To use one of these, first convert your POJO to XML (eg using JAXB), then feed the AddIn your sample XML.
Runtime
To bind your XML to a docx "template" to create an instance docx, see https://github.com/plutext/docx4j/blob/master/docx4j-samples-docx4j/src/main/java/org/docx4j/samples/ContentControlBindingExtensions.java
You can run that sample code against the sample docx + data; you can take a look at the docx to see what the content controls look like (they bind a custom xml part in the docx, so unzip it to see that)
ps the GitHub links broke as a result of a recent code re-org. GitHub isn't smart enough to dynamically maintain them :-( See https://www.docx4java.org/downloads.html for downloadable sample code.

What Java API data structure is good for HTML trees?

For fun, I'm writing a basic parser that finds data within an HTML document. I want to find the best structure to represent the branches of the parsed file.
The criteria for "best structure" is this: I want to easily search for a tag's relative location and access its contents, like "the image in the second image tag after the third h3 tag in the body" or "the title tag in the header".
I expect to search the first level of tags for the tag I'm looking for, then move into the branch associated with that tag. That's the structure this question is looking for, but if there is a better way to find relative locations in an HTML document, please explain.
So that's the question. More generally, what kind of Java structures are available through the API that can represent tree data structures?

Don't reinvent the wheel, just use an HTML parser like Jsoup, you will be able to get your tags thanks to a CSS selector using the method Element#select(cssQuery).
Document doc = Jsoup.parse(file, encoding);
Elements elements = doc.select(cssQuery);

Getting elements by type in malformed HTML

What's the easiest way in Java to retrieve all elements with a certain type in a malformed HTML page? So I want to do something like this:
public static void main(String[] args) {
// Read in an HTML file from disk
// Retrieve all INPUT elements regardless of whether the HTML is well-formed
// Loop through all elements and retrieve their ids if they exist for the element
}

HtmlCleaner is arguably one of the best HTML parsers out there when it comes to dealing with (somewhat) malformed HTML.
Documentation is here with some code samples; you're basically looking for getElementsByName() method.
Take a look at Comparison of Java HTML parsers if you're considering other libraries.

I've had success using tagsoup. Heres a short description from their home page:
This is the home page of TagSoup, a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

Check Jtidy.
JTidy is a Java port of HTML Tidy, a
HTML syntax checker and pretty
printer. Like its non-Java cousin,
JTidy can be used as a tool for
cleaning up malformed and faulty HTML.
In addition, JTidy provides a DOM
interface to the document that is
being processed, which effectively
makes you able to use JTidy as a DOM
parser for real-world HTML.

parsing a non xml file in java

I want to parse a document that is not pure xml. For example
my name is <j> <b> mike</b> </j>
example 2
my name is <mytag1 attribute="val" >mike</mytag1> and yours is <mytag2> john</mytag2>
Means my input is not pure xml. ITs simliar to html but the tags are not html.
How can i parse it in java?

Your examples are valid XML, except for the lack of a document element. If you know this to always be the case, then you could just wrap a set of dummy tags around the whole thing and use a standard parser (SAX, DOM...)
On the other hand if you get something uglier (e.g. tags don't match up, or are spaced out in an overlapping fashion), you'll have to do something custom which will involve a number of rules that you have to decide on that will be unique to your application. (e.g. How do I handle an opening tag that has no close? What do I do if the closing tag is outside the parent?)

There are few parsers that take not well formed html and turn it into well formed xml, here is some comparison with examples, that includes the most popular ones, except maybe HTMLParser. Probably that's what you need.

How can I efficiently parse HTML with Java?

I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both the tasks.
I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.
I want to know which HTML parser can parse HTML efficiently. I need
Speed
Ease to locate any HtmlElement by its "id" or "name" or "tag type".
It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
See the Selector javadoc for more info.
This is a new project, so any ideas for improvement are very welcome!

The best I've seen so far is HtmlCleaner:
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
With HtmlCleaner you can locate any element using XPath.
For other html parsers see this SO question.

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extensible HTML parsing in Java driven by decoupled rules - java

Related

How to create a Word doc from a template using Content Control Data Binding with OpenDoPE

What Java API data structure is good for HTML trees?

Getting elements by type in malformed HTML

parsing a non xml file in java

How can I efficiently parse HTML with Java?

Categories

Resources