Parse html document with NekoHTML - java

I am using NekoHTML framework with xerces 2.11.0 version to parse an HTML document.
But i am having a problem with this simple code :
DOMParser parser = new DOMParser();
System.out.println(parser.getClass().toString());
InputSource url = new InputSource("http://www.cbgarden.org");
try{
parser.parse(url);
Document document = parser.getDocument();
System.out.println(document.hasChildNodes());
System.out.println(document.getBaseURI());
System.out.println(document.getNodeName());
System.out.println(document.getNodeValue());
}catch(Exception e){
e.printStackTrace();
}
Now I put here the result of the multiple prints:
class org.cyberneko.html.parsers.DOMParser
true
http://www.cbgarden.org
document
null
So my question is : What could be wrong ?
No exception is thrown and I am following the rules that are defined in the usage rules in the NekoHTML. My build path libraries are with this precedence:
nekohtml.jar
nekohtmlSamples.jar
xercesImpl.jar
xercesSamples.jar
xml-apis.jar

I guess your question is about the null?
The document node has no value. It only has subnodes (like <html> witch contains <head> and <body>).
But if you want to have the whole page source as a String, you can simply download it using a URL its method openStream().

Related

How to parse random HTML pages with Apache Tika (and XPath) [duplicate]

This question already has an answer here:
java tika how to convert html to plain text retaining specific element
(1 answer)
Closed 1 year ago.
I'm new to Tika and stuggling to understand it.
What I want to achieve is extracting the link's href of a HTML page (which can be any webpage).
For trial version, I just tried to extract the links as such (or even just the first) using XPath. But I can never get it right and the handler is always empty.
(In this example, I've removed the XHTML: namspace bits because otherwise I had a SAX error).
The code example is below. Thanks so much for any help :)
XPathParser xhtmlParser = new XPathParser ("xhtml", XHTMLContentHandler.XHTML);
org.apache.tika.sax.xpath.Matcher anchorLinkContentMatcher = xhtmlParser.parse("//body//a");
ContentHandler handler = new MatchingContentHandler(
new ToXMLContentHandler(), anchorLinkContentMatcher);
HtmlParser parser = new HtmlParser();
Metadata metadata = new Metadata();
ParseContext pcontext = new ParseContext();
try {
parser.parse(urlContentStream, handler, metadata,pcontext);
System.out.println(handler);
}
catch (Exception e)
{
....
}
I found an answer (at least to get something working, even if not yet final version, I got something from the handler).
The answer is at java tika how to convert html to plain text retaining specific element

How to save a Jsoup Document to an HTML file?

I have used this method to retrieve a webpage into an org.jsoup.nodes.Document object:
myDoc = Jsoup.connect(myURL).ignoreContentType(true).get();
How should I write this object to a HTML file?
The methods myDoc.html(), myDoc.text() and myDoc.toString() don't output all elements of the document.
Some information in a javascript element can be lost in parsing it. For example, "timestamp" in the source of an Instagram media page.
Use doc.outerHtml().
import org.apache.commons.io.FileUtils;
public void downloadPage() throws Exception {
final Response response = Jsoup.connect("http://www.example.net").execute();
final Document doc = response.parse();
final File f = new File("filename.html");
FileUtils.writeStringToFile(f, doc.outerHtml(), StandardCharsets.UTF_8);
}
Don't forget to catch Exceptions. Add dependency or download Apache commons-io library for easy and quick way to saving files in UTF-8 format.
The fact that there are elements that are ignored, must be due to the attempt of normalization by Jsoup.
In order to get the server's exact output without any form of normalization use this.
Connection.Response html = Jsoup.connect("PUT_URL_HERE").execute();
System.out.println(html.body());

Dom4j parsing - How to declare HTML entities programmatically? "The entity "nbsp" was referenced, but not declared."

I'm using Dom4j to parse HTML documents.
Dom4j expects XML, so HTML entities are not declared.
It's possible to declare them in document's DTD, but I am parsing external input, so that's not appropriate. I'd rather declare them programmatically in the parser.
Here's my code:
// Read.
final DocumentFactory df = DOMDocumentFactory.getInstance();
SAXReader reader = new SAXReader();
Document doc, outDoc;
try {
doc = reader.read( new StringReader(htmlStr) );
}
catch( Exception ex ){
throw new RuntimeException("Error parsing the HTML:\n " + ex.toString() );
}
I see that SAXReader has reader.setEntityResolver( ??? ); but seems like it's not the solution as the overridable method looks like this:
public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException
What I am looking for is something like
reader.setTrueEntityResolver( new EntityResolver(){
public InputStream resolve( String name ){ ... }
}
I've found a possible solution in http://evc-cit.info/dom4j/dom4j_groovy.html
Where it's suggested to add a XML Commons Catalog stuff.
However, that seems like an overkill, as there's no doctype specified anyway, and I only intend to resolve the commons HTML 4 entities.
Update: Turned out that without explicit DOCTYPE declaration, this doesn't have any effect - EntityResolver is never called.
Maven dep:
<dependency>
<groupId>xml-resolver</groupId>
<artifactId>xml-resolver</artifactId>
<version>1.2</version>
<scope>test</scope>
</dependency>
Config in /CatalogManager.proeprties on classpath:
# allow location to be relative to this file's directory
relative-catalogs=yes
# A semicolon-delimited list of catalog files.
# In this instance, we have a single catalog file, and it's a relative path name
catalogs=sgml-lib/xml.soc
# no debugging messages, please
verbosity=0
# Use the SYSTEM identifier
prefer=system
Tell the parser to use the catalog resolver when it encounters the DTD:
cResolver = new CatalogResolver( cMgr )
reader = new SAXReader( )
reader.setEntityResolver( cResolver )
Well, as you said, DOM4J is not meant to parse HTML. I would rather use something like tagsoup or HTML Cleaner. It's just not entities, HTML is not XML.

Writing to an XML File in Java

I have an XML file of which I have an element as shown;
"<Event start="2011.12.12 13:45:00:0000" end="2011.12.12 13:47:00:0000" anon="89"/>"
I want to add another attribute "comment" and write it to this XML File giving;
"<Event start="2011.12.12 13:45:00:0000" end="2011.12.12 13:47:00:0000" anon="89" comment=""/>"
How would I go about doing this?
Thanks, Matt
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setIgnoringElementContentWhitespace(true);
Document document = factory.newDocumentBuilder().parse(xmlFile);
Element eventElement = (Element)document.getElementsByTagName("Event").item(0);
eventElement.setAttribute("comment", "");
FYI: I've use DOM framework here org.w3c.dom.*
Use setAttribute method to add attribute,
// Add an attribute
element.setAttribute("newAttrName", "attrValue");
Use the following method to write to XML file,
// This method writes a DOM document to a file
public static void writeXmlFile(Document doc, String filename) {
try {
// Prepare the DOM document for writing
Source source = new DOMSource(doc);
// Prepare the output file
File file = new File(filename);
Result result = new StreamResult(file);
// Write the DOM document to the file
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(source, result);
} catch (TransformerConfigurationException e) {
} catch (TransformerException e) {
}
}
Parse the file, add the attribute and write it back to disk.
There is plenty of frameworks that can do this. The DOM framework in Java is probably the first thing you should look at.
Using DOM, as suggested in previous answers, is certainly reasonable for this particular problem, which is relatively simple.
However, I have found that JDOM is generally much easier to use when you want to parse and/or modify XML files. Its basic approach is to load the entire file into an easy to use data structure. This works well unless your XML file is very very large.
For more info go to http://www.jdom.org/

Parsing an XML file without root in Java

I have this XML file which doesn't have a root node. Other than manually adding a "fake" root element, is there any way I would be able to parse an XML file in Java? Thanks.
I suppose you could create a new implementation of InputStream that wraps the one you'll be parsing from. This implementation would return the bytes of the opening root tag before the bytes from the wrapped stream and the bytes of the closing root tag afterwards. That would be fairly simple to do.
I may be faced with this problem too. Legacy code, eh?
Ian.
Edit: You could also look at java.io.SequenceInputStream which allows you to append streams to one another. You would need to put your prefix and suffix in byte arrays and wrap them in ByteArrayInputStreams but it's all fairly straightforward.
Your XML document needs a root xml element to be considered well formed. Without this you will not be able to parse it with an xml parser.
One way is to provide your own dummy wrapper without touching the original 'xml' (the not well formed 'xml') Need the word for that:
Syntax
<!DOCTYPE some_root_elem SYSTEM "/home/ego/some.dtd"
[
<!ENTITY entity-name "Some value to be inserted at the entity">
]
Example:
<!DOCTYPE dummy [
<!ENTITY data SYSTEM "http://wherever-my-data-is">
]>
<dummy>
&data;
</dummy>
You could use another parser like Jsoup. It can parse XML without a root.
I think even if any API would have an option for this, it will only return you the first node of the "XML" which will look like a root and discard the rest.
So the answer is probably to do it yourself. Scanner or StringTokenizer might do the trick.
Maybe some html parsers might help, they are usually less strict.
Here's what I did:
There's an old java.io.SequenceInputStream class, which is so old that it takes Enumeration rather than List or such.
With it, you can prepend and append the root element tags (<div> and </div> in my case) around your no-root XML stream. (You shouldn't do it by concatenating Strings due to performance and memory reasons.)
public void tryExtractHighestHeader(ParserContext context)
{
String xhtmlString = context.getBody();
if (xhtmlString == null || "".equals(xhtmlString))
return;
// The XHTML needs to be wrapped, because it has no root element.
ByteArrayInputStream divStart = new ByteArrayInputStream("<div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream divEnd = new ByteArrayInputStream("</div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream is = new ByteArrayInputStream(xhtmlString.getBytes(StandardCharsets.UTF_8));
Enumeration<InputStream> streams = new IteratorEnumeration(Arrays.asList(new InputStream[]{divStart, is, divEnd}).iterator());
try (SequenceInputStream wrapped = new SequenceInputStream(streams);) {
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(wrapped);
From here you can do whatever you like, but keep in mind the extra element.
XPath xPath = XPathFactory.newInstance().newXPath();
}
catch (Exception e) {
throw new RuntimeException("Failed parsing XML: " + e.getMessage());
}
}

Categories

Resources