Possible to parse a HTML document and build a DOM tree(java) - java

Is it possible and what tools could be used to parse an html document as a string or from a file and then to construct a DOM tree so that a developer can walk the tree through some API.
For example:
DomRoot = parse("myhtml.html");
for (tags : DomRoot) {
}
Note: this is a HTML document not XHtml.

You can use TagSoup - it is a SAX Compliant parser that can clean malformed content such as HTML from generic web pages into well-formed XML.
This is <B>bold, <I>bold italic, </b>italic, </i>normal text
gets correctly rewritten as:
This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

JTidy should let you do what you want.
Usage is fairly straight forward, but parsing is configurable. e.g.:
InputStream in = ...;
Tidy tidy = new Tidy();
// configure Tidy instance as required
...
...
Document doc = tidy.parseDOM(in, null);
Element root = doc.getDocumentElement();
The JavaDoc is hosted here.

You can take a look at NekoHTML, a Java library that performs a best effort cleaning and tag balancing in your document. It is an easy way to parse a malformed HTML (or a non-valid XML) file.
It is distributed under the Apache 2.0 license.

HTML Parser seems to support conversion from HTML to XML. Then you can build a DOM tree using the usual Java toolchain.

There are several open source tools to parse HTML from Java.
Check http://java-source.net/open-source/html-parsers
Also you can check answers to this question: Reading HTML file to DOM tree using Java It is almost the same...

Related

JSoup always timing out

I'm trying to work with JSoup to parse an HTML file I have generated through Servlet. From what I have read, I need to declare a Document. When I run the code
Document doc= Jsoup.parse(URL, 10000);
It always times out, if i increase the timeout time, it will run until it reaches that time. When i put in Integer.MAX_VALUE, it simply runs forever. I am working in Google Chrome on a macbook pro.
My questions are:
Is this just my computer or am i doing something wrong?
Is there a way to fix this or a way to parse the HTML page that is entirely differently?
Alternative Solutions
As explained in the documentation of Jsoup if you have an accessible URL than you can get its content this way:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
If you have HTML in a string this is how you should parse it:
document = Jsoup.parse(htmlString);
If you have HTML in a local file then:
Document doc = Jsoup.parse(new File("FilePath"), "UTF-8", "http://example.com/");
Your Solution
The way you are using Jsoup parser is correct but problem is with link perhaps if you can provide details about it then we can figure out what's going wrong.
Make sure whatever HTML is generated by your Servlet is accessible if it is the linkof yours should be a URL to that Servlet.

how to place HTML text into OpenOffice document using OpenOffice API

Lets see at this example:
I've got HTML tagged text:
<font size="100">Example text</font>
I have *.odt (OpenDocument Text) document where I want to place this HTML text with formatting depends on HTML tags (in this example font tag should be ommited and text Example text should have 100point size font in result *.odt file).
I prefer (but this is not strong requirement) to use OpenOffice UNO API for Java to achieve that. Is there any way to inject this HTML text into body of *.odt document with simple UNO API build-in HTML-odt converter or something like this (or I have to manually go through HTML tags in text and then use OO UNO API for placing text with specific formatting - e.g. font size)?
OK, this is what I've done to achieve this (using OpenOffice UNO Api with JAVA):
Load odt document where we want to place HTML text.
Goto place where you want to place HTML text.
Save HTML text in temp file in the system (maybe it is possible without saving with http URL but I wasn't testing it).
Insert HTML into odt following this instructions and passing URL to temp HTML file (remember about converting system path to OO path).
Maybe you can use JODConverter or you can use the xslt from xhtml2odt

Java Html parser to extract specific data?

I have a html file like the following
...
<span itemprop="A">234</span>
...
<span itemprop="B">690</span>
...
In this i want to extract values as A and B.
Can u suggest any html parser library for java that can do this easily?
Personally, I favour JSoup over JTidy. It has CSS-like selectors, and the documentation is much better, imho. With JSoup, you can easily extract those values with the following lines:
Document doc = Jsoup.connect("your_url").get();
Elements spans = doc.select("span[itemprop]");
for (Element span : spans) {
System.out.println(span.text()); // will print 234 and 690
}
http://jsoup.org/
JSoup is the way to go.
JTidy is a confusingly named yet respected HTML parser.

How to use Freemarker to convert a XML Word document to a DOC?

I'm trying to use Freemarker to convert an XML Word document to a standard DOC. For example:
I generate a Word document (A.doc) and then save it as XML Word document (A.xml).
On Freemarker, I import A.xml and export it as 2003 Word (B.doc).
In POI, I import the converted DOC (B.doc). (POI can't read XML docs.)
The problem is: the converted document isn't really a DOC, it's an XML doc,
so POI fails to open it.
How to use Freemarker generate a real DOC, not a XML word document?
I'm using Linux.
Your approach probably won't work because FreeMarker is designed for generating text output. Classic Word DOC files are not very "textual", so I think FreeMarker is not the right tool for your task.
(Side note: but RTF might work)

Library to query HTML with XPath in Java?

Can anyone recommend me a java library to allow me XPath Queries over URLs?
I've tried JAXP without success.
Thank you.
There are several different approaches to this documented on the Web:
Using HtmlCleaner
HtmlCleaner / Java DOM parser - Using XPath Contains against HTML in Java (This is the way I recommend)
HtmlCleaner itself has a built in utility supporting XPath - See the javadocs http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/XPather.html or this example http://thinkandroid.wordpress.com/2010/01/05/using-xpath-and-html-cleaner-to-parse-html-xml/
Using Jericho
Jericho and Jaxen
http://sujitpal.blogspot.com/2009/04/xpath-over-html-using-jericho-and-jaxen.html
I have tried a few different variations of these approaches, i.e. HtmlParser plus the Java DOM parser, and JSoup plus Jaxen, but the combination that worked best is HtmlCleaner plus the Java DOM parser. The next best combination was Jericho plus Jaxen.
jsoup, Java HTML Parser Very similar to jQuery syntax way.
You could use TagSoup together with Saxon. That way you simply replace any XML SAX parser used with TagSoup and the XPath 2.0 or XSLT 2.0 or XQuery 1.0 implementation works as usual.
Use Xsoup. According to the docs, it's faster than HtmlCleaner. Example
#Test
public void testSelect() {
String html = "<html><div><a href='https://github.com'>github.com</a></div>" +
"<table><tr><td>a</td><td>b</td></tr></table></html>";
Document document = Jsoup.parse(html);
String result = Xsoup.compile("//a/#href").evaluate(document).get();
Assert.assertEquals("https://github.com", result);
List<String> list = Xsoup.compile("//tr/td/text()").evaluate(document).list();
Assert.assertEquals("a", list.get(0));
Assert.assertEquals("b", list.get(1));
}
Link to Xsoup - https://github.com/code4craft/xsoup
I've used JTidy to make HTML into a proper DOM, then used plain XPath to query the DOM.
If you want to do cross-document/cross-URL queries, better use JTidy with XQuery.

Categories

Resources