I use HtmlCleaner 2.6.1 and Xpath to parse html page in Android application.
Here html page:
http://www.kino-govno.com/comments/42571-postery-kapitan-fillips-i-poslednij-rubezh
http://www.kino-govno.com/comments/42592-fantasticheskie-idei-i-mesta-ih-obitanija
The first link return document, is all right.The second link here in this place:
document = domSerializer.createDOM(tagNode);
returns nothing.
If you create a simple java project without android. That all works fine.
Here is the Code :
String queries = "//div[starts-with(#class, 'news_text op')]/p";
URL url = new URL(link2);
TagNode tagNode = new HtmlCleaner().clean(url);
CleanerProperties cleanerProperties = new CleanerProperties();
DomSerializer domSerializer = new DomSerializer(cleanerProperties);
document = domSerializer.createDOM(tagNode);
xPath = XPathFactory.newInstance().newXPath();
pageNode = (NodeList)xPath.evaluate(queries,document, XPathConstants.NODESET);
String val = pageNode.item(0).getFirstChild().getNodeValue();
That's because HtmlCleaner wraps the paragraphs of the second HTML page into another <div/>, so it is not a direct child any more. Use the descendent-or-self-axis // instead of the child-axis /:
//div[starts-with(#class, 'news_text op')]//p
Related
I'm trying to get DIV contents from previously fetched HTML document. I'm using Java Swing.
final java.io.Reader stringReader = new StringReader(html);
final HTMLEditorKit htmlKit = new HTMLEditorKit();
final HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
final HTMLEditorKit.Parser parser = new ParserDelegator();
parser.parse(stringReader, htmlDoc.getReader(0), true);
final javax.swing.text.Element el = htmlDoc.getElement("id");
This code should get a DIV with ID of "id" that I have inside html.
But what next? How to get the contents of div? Been searching it all around but only thing I found is how to get attribute value, not the Element contents.
Should I move to jsoup? I would rather use Java native, but so far I'm stuck.
Thanks!
not the Element contents.
Try something like:
int start = el.getStartOffset();
int end = el.getEndOffset();
String text = htmlDoc.getText(start, end - start);
I need to find the easier and the efficient way to convert a JDOM element (with all it's tailoring nodes) to a Document. ownerDocument( ) won't work as this is version JDOM 1.
Moreover, org.jdom.IllegalAddException: The Content already has an existing parent "root" exception occurs when using the following code.
DocumentBuilderFactory dbFac = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFac.newDocumentBuilder();
Document doc = null;
Element elementInfo = getElementFromDB();
doc = new Document(elementInfo);
XMLOutputter xmlOutput = new XMLOutputter();
byte[] byteInfo= xmlOutput.outputString(elementInfo).getBytes("UTF-8");
String stringInfo = new String(byteInfo);
doc = dBuilder.parse(stringInfo);
I think you have to use the following method of the element.
Document doc = <element>.getDocument();
Refer the API documentation It says
Return this parent's owning document or null if the branch containing this parent is currently not attached to a document.
JDOM content can only have one parent at a time, and you have to detatch it from one parent before you can attach it to another. This code:
Document doc = null;
Element elementInfo = getElementFromDB();
doc = new Document(elementInfo);
if that code is failing, it is because the getElementFromDB() method is returning an Element that is part of some other structure. You need to 'detach' it:
Element elementInfo = getElementFromDB();
elementInfo.detach();
Document doc = new Document(elementInfo);
OK, that solves the IllegalAddException
On the other hand, if you just want to get the document node containing the element, JDOM 1.1.3 allows you to do that with getDocument:
Document doc = elementInfo.getDocument();
Note that the doc may be null.
To get the top most element available, try:
Element top = elementInfo;
while (top.getParentElement() != null) {
top = top.getParentElement();
}
In your case, your elementInfo you get from the DB is a child of an element called 'root', something like:
<root>
<elementInfo> ........ </elementInfo>
</root>
That is why you get the message you do, with the word "root" in it:
The Content already has an existing parent "root"
I would like to transform a feed to a Document object.
I tried the following code but it seems it's not working with a real feed (uri = null), but it works with an XML file which is already in my computer.
The transform function :
public static Document obtainDocument(String feedurl) {
Document doc = null;
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
URL url = new URL(feedurl);
doc = builder.parse(url.openStream());
...Exceptions...
return doc;
}
EDIT
I'm pretty sure that the URL is right, I use:
String feedurl = "http://feeds2.feedburner.com/Pressecitron";
I tried to use the following code too:
public static Document obtainDocument(String feedurl) {
Document doc = null;
try {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
URL url = new URL(feedurl);
URLConnection conn = url.openConnection();
doc = builder.parse(conn.getInputStream());
...
return doc;
}
which seems to not works better
And my first version of parser used a String too, but my mate wants me to use a Document (if the connection doesn't work). It worked with the String if I remember well.
Have you tried all the possible ways of using the parse() method ?
Are you sure the URI / URL is correct ?
From the method that you have, you get the feedURL as a String. You can directly pass it to the parse() method and see if that works.
i wanna do something like
Document dom = new Document();
Element ele = new Element("jsp:include");
dom.setRootElement(ele);
but its throwing error i am using jdom for getting dom(org.jdom.Document, org.jdom.Element)
whats wrong in doing this
Namespace ns = Namespace.getNamespace("jsp", "http://java.sun.com/JSP/Page");
Element element = new Element("include", ns);
Document dom = new Document(element);
I am trying to access a url, get the html from it and use xpaths to get certain values from it. I am getting the html just fine and Jtidy seems to be cleaning it appropriately. However, when I try to get the desired values using xpaths, I get an empty NodeList back. I know my xpath expression is correct; I have tested it in other ways. Whats wrong with this code. Thanks for the help.
String url_string = base_url + countries[c];
URL url = new URL(url_string);
Tidy tidy = new Tidy();
tidy.setShowWarnings(false);
tidy.setXHTML(true);
tidy.setMakeClean(true);
Document doc = tidy.parseDOM(url.openStream(), null);
//tidy.pprint(doc, System.out);
String xpath_string = "id('catlisting')//a";
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile(xpath_string);
NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
System.out.println("size="+nodes.getLength());
for (int r=0; r<nodes.getLength(); r++) {
System.out.println(nodes.item(r).getNodeValue());
}
Try "//div[#id='catlisting']//a"