Parsing with htmlcleaner

Parsing with htmlcleaner - java

I developed a method which allows you to extract items from a specific class using htmlcleaner now I was wondering...
How would you be able to extract the body and all its elements inside an html using htmlcleaner?
public String htmlParser(String html){
TagNode rootNode;
HtmlCleaner html_cleaner = new HtmlCleaner();
rootNode = html_cleaner.clean(html);
TagNode[] items = rootNode.getElementsByName("body", true);
ParseBody(items[0]);
html = item_found;
return html;
}
String item_found;
public void ParseBody(TagNode root){
if(root.getAllElements(true).length > 0){
for(TagNode node: root.getAllElements(true)){
ParseBody(node);
}
}else{
item_found = item_found + root.toString();// root.toString() only brings out the first name inside TagNode
- In here I wanted just the text of all items in the body but it would still be beneficial for everyone if the question is complete-
//if(root.getText().toString() != null || !(root.getText().toString().equals("null"))){
//item_found = item_found + root.getText().toString();
//}
}
}

Related

Merging same elements in JSoup

I have the HTML string like
<b>test</b><b>er</b>
<span class="ab">continue</span><span> without</span>
I want to collapse the Tags which are similar and belong to each other. In the above sample I want to have
<b>tester</b>
since the tags have the same tag withouth any further attribute or style. But for the span Tag it should remain the same because it has a class attribute. I am aware that I can iterate via Jsoup over the tree.
Document doc = Jsoup.parse(input);
for (Element element : doc.select("b")) {
}
But I'm not clear how look forward (I guess something like nextSibling) but than how to collapse the elements?
Or exists a simple regexp merge?
The attributes I can specify on my own. It's not required to have a one-fits-for-all Tag solution.

My approach would be like this. Comments in the code
public class StackOverflow60704600 {
public static void main(final String[] args) throws IOException {
Document doc = Jsoup.parse("<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>");
mergeSiblings(doc, "b");
System.out.println(doc);
}
private static void mergeSiblings(Document doc, String selector) {
Elements elements = doc.select(selector);
for (Element element : elements) {
// get the next sibling
Element nextSibling = element.nextElementSibling();
// merge only if the next sibling has the same tag name and the same set of attributes
if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
&& nextSibling.attributes().equals(element.attributes())) {
// your element has only one child, but let's rewrite all of them if there's more
while (nextSibling.childNodes().size() > 0) {
Node siblingChildNode = nextSibling.childNodes().get(0);
element.appendChild(siblingChildNode);
}
// remove because now it doesn't have any children
nextSibling.remove();
}
}
}
}
output:
<html>
<head></head>
<body>
<b>tester</b>
<span class="ab">continue</span>
<span> without</span>
</body>
</html>
One more note on why I used loop while (nextSibling.childNodes().size() > 0). It turned out for or iterator couldn't be used here because appendChild adds the child but removes it from the source element and remaining childen are be shifted. It may not be visible here but the problem will appear when you try to merge: <b>test</b><b>er<a>123</a></b>

I tried to update the code from #Krystian G but my edit was rejected :-/ Therefore I post it as an own post. The code is an excellent starting point but it fails if between the tags a TextNode appears, e.g.
<span> no class but further</span> (in)valid <span>spanning</span> would result into a
<span> no class but furtherspanning</span> (in)valid
Therefore the corrected code looks like:
public class StackOverflow60704600 {
public static void main(final String[] args) throws IOException {
String test1="<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>";
String test2="<b>test</b><b>er<a>123</a></b>";
String test3="<span> no class but further</span> <span>spanning</span>";
String test4="<span> no class but further</span> (in)valid <span>spanning</span>";
Document doc = Jsoup.parse(test1);
mergeSiblings(doc, "b");
System.out.println(doc);
}
private static void mergeSiblings(Document doc, String selector) {
Elements elements = doc.select(selector);
for (Element element : elements) {
Node nextElement = element.nextSibling();
// if the next Element is a TextNode but has only space ==> we need to preserve the
// spacing
boolean addSpace = false;
if (nextElement != null && nextElement instanceof TextNode) {
String content = nextElement.toString();
if (!content.isBlank()) {
// the next element has some content
continue;
} else {
addSpace = true;
}
}
// get the next sibling
Element nextSibling = element.nextElementSibling();
// merge only if the next sibling has the same tag name and the same set of
// attributes
if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
&& nextSibling.attributes().equals(element.attributes())) {
// your element has only one child, but let's rewrite all of them if there's more
while (nextSibling.childNodes().size() > 0) {
Node siblingChildNode = nextSibling.childNodes().get(0);
if (addSpace) {
// since we have had some space previously ==> preserve it and add it
if (siblingChildNode instanceof TextNode) {
((TextNode) siblingChildNode).text(" " + siblingChildNode.toString());
} else {
element.appendChild(new TextNode(" "));
}
}
element.appendChild(siblingChildNode);
}
// remove because now it doesn't have any children
nextSibling.remove();
}
}
}
}

Unable to parse element attribute with XOM

I'm attempting to parse an RSS field using the XOM Java library. Each entry's image URL is stored as an attribute for the <img> element, as seen below.
<rss version="2.0">
<channel>
<item>
<title>Decision Paralysis</title>
<link>https://xkcd.com/1801/</link>
<description>
<img src="https://imgs.xkcd.com/comics/decision_paralysis.png"/>
</description>
<pubDate>Mon, 20 Feb 2017 05:00:00 -0000</pubDate>
<guid>https://xkcd.com/1801/</guid>
</item>
</channel>
</rss>
Attempting to parse <img src=""> with .getFirstChildElement("img") only returns a null pointer, making my code crash when I try to retrieve <img src= ...>. Why is my program failing to read in the <img> element, and how can I read it in properly?
import nu.xom.*;
public class RSSParser {
public static void main() {
try {
Builder parser = new Builder();
Document doc = parser.build ( "https://xkcd.com/rss.xml" );
Element rootElement = doc.getRootElement();
Element channelElement = rootElement.getFirstChildElement("channel");
Elements itemList = channelElement.getChildElements("item");
// Iterate through itemList
for (int i = 0; i < itemList.size(); i++) {
Element item = itemList.get(i);
Element descElement = item.getFirstChildElement("description");
Element imgElement = descElement.getFirstChildElement("img");
// Crashes with NullPointerException
String imgSrc = imgElement.getAttributeValue("src");
}
}
catch (Exception error) {
error.printStackTrace();
System.exit(1);
}
}
}

There is no img element in the item. Try
if (imgElement != null) {
String imgSrc = imgElement.getAttributeValue("src");
}
What the item contains is this:
<description><img
src="http://imgs.xkcd.com/comics/us_state_names.png"
title="Technically DC isn't a state, but no one is too
pedantic about it because they don't want to disturb the snakes
."
alt="Technically DC isn't a state, but no one is too pedantic about it because they don't want to disturb the snakes." />
</description>
That's not an img elment. It's plain text.

I managed to come up with a somewhat hacky solution using regex and pattern matching.
// Iterate through itemList
for (int i = 0; i < itemList.size(); i++) {
Element item = itemList.get(i);
String descString = item.getFirstChildElement("description").getValue();
// Parse image URL (hacky)
String imgSrc = "";
Pattern pattern = Pattern.compile("src=\"[^\"]*\"");
Matcher matcher = pattern.matcher(descString);
if (matcher.find()) {
imgSrc = descString.substring( matcher.start()+5, matcher.end()-1 );
}
}

Reading XML tags in java, code optimization

What I am actually doing is a recursive function which reads the tags in the xml. Below is the code:
private void readTag(org.w3c.dom.Node item, String histoTags, String fileName, Hashtable<String, String> tagsInfos) {
try {
if (item.getNodeType() == Node.ELEMENT_NODE) {
NodeList itemChilds = item.getChildNodes();
for (int i=0; i < itemChilds.getLength(); i++) {
org.w3c.dom.Node itemChild = itemChilds.item(i);
readTag(itemChild, histoTags + "|" + item.getNodeName(), fileName, tagsInfos);
}
}
else if (item.getNodeType() == Node.TEXT_NODE) {
tagsInfosSoft.put(histoTags, item.getNodeValue());
}
}
This function takes some time to execute. The xml the function reads is in this format:
<?xml version="1.0" encoding="UTF-8"?>
<Document>
<Mouvement>
<Com>
<IdCom>32R01000000772669473</IdCom>
<RefCde>32R</RefCde>
<Edit>0</Edit>
<Com>
<Mouvement>
<Document>
Is there any way of optimizing this code in java?

Two optimizations, don't know how much they will help:
Don't use getChildNodes(). Use getFirstChild() and getNextSibling().
Reuse a single StringBuilder instead of creating a new one for every element (implicitly done by histoTags + "|" + item.getNodeName()).
But, you should also be aware that the text content of an element node may seen as a combination of multiple TEXT and CDATA nodes.
Your code will also work better if it works on elements, not nodes.
private static void readTag(Element elem, StringBuilder histoTags, String fileName, Hashtable<String, String> tagsInfos) {
int histoLen = histoTags.length();
CharSequence textContent = null;
boolean hasChildElement = false;
for (Node child = elem.getFirstChild(); child != null; child = child.getNextSibling()) {
switch (child.getNodeType()) {
case Node.ELEMENT_NODE:
histoTags.append('|').append(child.getNodeName());
readTag((Element)child, histoTags, fileName, tagsInfos);
histoTags.setLength(histoLen);
hasChildElement = true;
break;
case Node.TEXT_NODE:
case Node.CDATA_SECTION_NODE:
//uncomment to test: System.out.println(histoTags + ": \"" + child.getTextContent() + "\"");
if (textContent == null)
// Optimization: Don't copy to a StringBuilder if only one text node will be found
textContent = child.getTextContent();
else if (textContent instanceof StringBuilder)
// Ok, now we need a StringBuilder to collect text from multiple nodes
((StringBuilder)textContent).append(child.getTextContent());
else
// And we keep collecting text from multiple nodes
textContent = new StringBuilder(textContent).append(child.getTextContent());
break;
default:
// ignore all others
}
}
if (textContent != null) {
String text = textContent.toString();
// Suppress pure whitespace content on elements with child elements, i.e. structural whitespace
if (! hasChildElement || ! text.trim().isEmpty())
tagsInfos.put(histoTags.toString(), text);
}
}
Test
String xml = "<root>\n" +
" <tag>hello <![CDATA[world]]> Foo <!-- comment --> Bar</tag>\n" +
"</root>\n";
Element docElem = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.parse(new InputSource(new StringReader(xml)))
.getDocumentElement();
Hashtable<String, String> tagsInfos = new Hashtable<>();
readTag(docElem, new StringBuilder(docElem.getNodeName()), "fileName", tagsInfos);
System.out.println(tagsInfos);
Output (with print uncommented)
root: "
"
root|tag: "hello "
root|tag: "world"
root|tag: " Foo "
root|tag: " Bar"
root: "
"
{root|tag=hello world Foo Bar}
See how splitting the text inside the <tag> node using CDATA and comments caused the DOM node to contain multiple TEXT/CDATA child nodes.

java android very large xml parsing

I have got a very large xml file with categories in one xml file which maps to sub categories in another xml file according to category id. The xml file with only category id and names is loading fast, but the xml file which has subcategories with images path, description, latitude-longitude etc...is taking time to load.
I am using javax.xml package and org.w3c.dom package.
The list action is loading the file in each click to look for subcategories.
Is there any way to make this whole process faster?
Edit-1
Heres the code i am using to getch subcategories:
Document doc = this.builder.parse(inStream, null);
doc.getDocumentElement().normalize();
NodeList pageList = doc.getElementsByTagName("page");
final int length = pageList.getLength();
for (int i = 0; i < length; i++)
{
boolean inCategory = false;
Element categories = (Element) getChild(pageList.item(i), "categories");
if(categories != null)
{
NodeList categoryList = categories.getElementsByTagName("category");
for(int j = 0; j < categoryList.getLength(); j++)
{
if(Integer.parseInt(categoryList.item(j).getTextContent()) == catID)
{
inCategory = true;
break;
}
}
}
if(inCategory == true)
{
final NamedNodeMap attr = pageList.item(i).getAttributes();
//
//get Page ID
final int categoryID = Integer.parseInt(getNodeValue(attr, "id"));
//get Page Name
final String categoryName = (getChild(pageList.item(i), "title") != null) ? getChild(pageList.item(i), "title").getTextContent() : "Untitled";
//get ThumbNail
final NamedNodeMap thumb_attr = getChild(pageList.item(i), "thumbnail").getAttributes();
final String categoryImage = "placethumbs/" + getNodeValue(thumb_attr, "file");
//final String categoryImage = "androidicon.png";
Category category = new Category(categoryName, categoryID, categoryImage);
this.list.add(category);
Log.d(tag, category.toString());
}
}

Use SAX based parser, DOM is not good for large xml.

Maybe a SAX processor would be quicker (assuming your App is slowing down due to memory requirements of using a DOM-style approach?)
Article on processing XML on android
SOF question about SAX processing on Android

Get Xpath from the org.w3c.dom.Node

Can i get the full xpath from the org.w3c.dom.Node ?
Say currently node is pointing to some where the middle of the xml document. I would like extract the xpath for that element.
The output xpath I'm looking for is //parent/child1/chiild2/child3/node. A parent to node xpath. Just ignore the xpath's which are having expressions and points to the same node.

There's no generic method for getting the XPath, mainly because there's no one generic XPath that identifies a particular node in the document. In some schemas, nodes will be uniquely identified by an attribute (id and name are probably the most common attributes.) In others, the name of each element (that is, the tag) is enough to uniquely identify a node. In a few (unlikely, but possible) cases, there's no one unique name or attribute that takes you to a specific node, and so you'd need to use cardinality (get the n'th child of the m'th child of...).
EDIT:
In most cases, it's not hard to create a schema-dependent function to assemble an XPath for a given node. For example, suppose you have a document where every node is uniquely identified by an id attribute, and you're not using namespaces. Then (I think) the following pseudo-Java would work to return an XPath based on those attributes. (Warning: I have not tested this.)
String getXPath(Node node)
{
Node parent = node.getParent();
if (parent == null) {
return "/" + node.getTagName();
}
return getXPath(parent) + "/" + "[#id='" + node.getAttribute("id") + "']";
}

I am working for the company behind jOOX, a library that provides many useful extensions to the Java standard DOM API, mimicking the jquery API. With jOOX, you can obtain the XPath of any element like this:
String path = $(element).xpath();
The above path will then be something like this
/document[1]/library[2]/books[3]/book[1]

I've taken this code from
Mikkel Flindt post & modified it so it can work for Attribute Node.
public static String getFullXPath(Node n) {
// abort early
if (null == n)
return null;
// declarations
Node parent = null;
Stack<Node> hierarchy = new Stack<Node>();
StringBuffer buffer = new StringBuffer();
// push element on stack
hierarchy.push(n);
switch (n.getNodeType()) {
case Node.ATTRIBUTE_NODE:
parent = ((Attr) n).getOwnerElement();
break;
case Node.ELEMENT_NODE:
parent = n.getParentNode();
break;
case Node.DOCUMENT_NODE:
parent = n.getParentNode();
break;
default:
throw new IllegalStateException("Unexpected Node type" + n.getNodeType());
}
while (null != parent && parent.getNodeType() != Node.DOCUMENT_NODE) {
// push on stack
hierarchy.push(parent);
// get parent of parent
parent = parent.getParentNode();
}
// construct xpath
Object obj = null;
while (!hierarchy.isEmpty() && null != (obj = hierarchy.pop())) {
Node node = (Node) obj;
boolean handled = false;
if (node.getNodeType() == Node.ELEMENT_NODE) {
Element e = (Element) node;
// is this the root element?
if (buffer.length() == 0) {
// root element - simply append element name
buffer.append(node.getNodeName());
} else {
// child element - append slash and element name
buffer.append("/");
buffer.append(node.getNodeName());
if (node.hasAttributes()) {
// see if the element has a name or id attribute
if (e.hasAttribute("id")) {
// id attribute found - use that
buffer.append("[#id='" + e.getAttribute("id") + "']");
handled = true;
} else if (e.hasAttribute("name")) {
// name attribute found - use that
buffer.append("[#name='" + e.getAttribute("name") + "']");
handled = true;
}
}
if (!handled) {
// no known attribute we could use - get sibling index
int prev_siblings = 1;
Node prev_sibling = node.getPreviousSibling();
while (null != prev_sibling) {
if (prev_sibling.getNodeType() == node.getNodeType()) {
if (prev_sibling.getNodeName().equalsIgnoreCase(
node.getNodeName())) {
prev_siblings++;
}
}
prev_sibling = prev_sibling.getPreviousSibling();
}
buffer.append("[" + prev_siblings + "]");
}
}
} else if (node.getNodeType() == Node.ATTRIBUTE_NODE) {
buffer.append("/#");
buffer.append(node.getNodeName());
}
}
// return buffer
return buffer.toString();
}

For me this one worked best ( using org.w3c.dom elements):
String getXPath(Node node)
{
Node parent = node.getParentNode();
if (parent == null)
{
return "";
}
return getXPath(parent) + "/" + node.getNodeName();
}

Some IDEs specialised in XML will do that for you.
Here are the most well known
oXygen
Stylus Studio
xmlSpy
For instance in oXygen, you can right-click on an element part of an XML document and the contextual menu will have an option 'Copy Xpath'.
There are also a number of Firefox add-ons (such as XPather that will happily do the job for you. For Xpather, you just click on a part of the web page and select in the contextual menu 'show in XPather' and you're done.
But, as Dan has pointed out in his answer, the XPath expression will be of limited use. It will not include predicates for instance. Rather it will look like this.
/root/nodeB[2]/subnodeX[2]
For a document like
<root>
<nodeA>stuff</nodeA>
<nodeB>more stuff</nodeB>
<nodeB cond="thisOne">
<subnodeX>useless stuff</subnodeX>
<subnodeX id="MyCondition">THE STUFF YOU WANT</subnodeX>
<subnodeX>more useless stuff</subnodeX>
</nodeB>
</root>
The tools I listed will not generate
/root/nodeB[#cond='thisOne']/subnodeX[#id='MyCondition']
For instance for an html page, you'll end-up with the pretty useless expression :
/html/body/div[6]/p[3]
And that's to be expected. If they had to generate predicates, how would they know which condition is relevant ? There are zillions of possibilities.

Something like this will give you a simple xpath:
public String getXPath(Node node) {
return getXPath(node, "");
}
public String getXPath(Node node, String xpath) {
if (node == null) {
return "";
}
String elementName = "";
if (node instanceof Element) {
elementName = ((Element) node).getLocalName();
}
Node parent = node.getParentNode();
if (parent == null) {
return xpath;
}
return getXPath(parent, "/" + elementName + xpath);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing with htmlcleaner - java

Related

Merging same elements in JSoup

Unable to parse element attribute with XOM

Reading XML tags in java, code optimization

java android very large xml parsing

Get Xpath from the org.w3c.dom.Node

Categories

Resources