JSoup get text and inline images in order

JSoup get text and inline images in order - java

I've got some HTML that looks like this:
<tr>
<td>
Some text that is interrupted by an image here:
<img alt="imageName.png" src="linkhere" width="18" height="18">
and then continues here.
</td>
</tr>
and basically I just need a way to loop through the nodes here and add either the text or the image alt to a string with JSoup, maintaining the order of the nodes.
In the end it should look like this:
Some text that is interrupted by an image here: "imageName.png" and then continues here
So far I'm able to get the image by itself or the text by itself by using:
element.text();
//or
element.select("img").attr("alt")
but I'm having trouble getting both of them into an ordered list.
Any ideas?

The following code should give you the output string you are looking for. It basically loops through all the nodes in the document and determines whether or not they are text nodes or elements. If they are text nodes, it will add them to the output string. If they are elements, it will check for an image child and add the alt text to the string.
String test = "";
Element body = doc.getElementsByTag("body").first();
List<Node> childNodes = body.childNodes();
for(Node node : childNodes){
if(node instanceof TextNode){
// These are text nodes, lets see if they are empty or not and add them to the string.
String nodeString = node.toString();
if(nodeString != null && !nodeString.trim().isEmpty()){
test += nodeString;
}
} else if (node instanceof Element) {
// Here is an element, let's see if there is an image.
Element element = (Element)node;
Element image = element.children().select("img").first();
if(image != null)
{
test += image.attr("alt");
}
}
}

Related

Parse HTMl using JSOUP - Need specific pattern

I am trying to get text between tags and save into some variable, for example:
Here I want to save value return which is between em tags. Also I need the rest of the text which is in p tags,
em tag value is assigned with return and
p tag value should return only --> an item, cancel an order, print a receipt, track your purchases or reorder items.
if some value is before em tag, even that value should be in different variable basically one p if it has multiple tags within then it should be split and save into different variables. If I know how can I get rest of text which are not in inner tags I can retrieve the rest.
I have written below: the below is returning just "return" which is in "'em' tags.
Here ep is basically doc.select(p), selecting p tag and then iterating, not sure if I am doing right way, any other approaches are highly appreciated.
String text ="\<p><em>return </em>an item, cancel an order, print a receipt, track your purchases or reorder items.</p>"
Elements italic_tags = ep.select("em");
for(Element em:italic_tags) {
if(em.tagName().equals("em")) {
System.out.println( em.select("em").text());
}
}

If you need to select each sub text and text enclosed by different tags you need to try selecting Node instead of Element. I modified your HTML to include more tags so the example is more complete:
String text = "<p><em>return </em>an item, <em>cancel</em> an order, <em>print</em> a receipt, <em>track</em> your purchases or reorder items.</p>";
Document doc = Jsoup.parse(text);
Element ep = doc.selectFirst("p");
List<Node> childNodes = ep.childNodes();
for (Node node : childNodes) {
if (node instanceof TextNode) {
// if it's a text, just display it
System.out.println(node);
} else {
// if it's another element, then display its first
// child which in this case is a text
System.out.println(node.childNode(0));
}
}
output:
return
an item,
cancel
an order,
print
a receipt,
track
your purchases or reorder items.

Iterating through elements in jsoup and parsing href

I was having trouble getting just the href from a rows of table data. Although I was able to get it working, I am wondering if anyone has an explanation for why my code here works.
for (Element element : result.select("tr")) {
if (element.select("tr.header.left").isEmpty()) {
Elements tds = element.select("td");
//The line below is what I don't understand
String link = tds.get(0).getElementsByAttribute("href").first().attr("href");
String position = tds.get(1).text();
}
}
The line that I was using before, that did not work is below:
String link = tds.get(0).attr("href");
Why does this line return an empty string? I'm assuming it has to do with how I am iterating through the elements as I've selected by "tr". However, I'm not familiar with how Elements vs Element are structured.
Thanks for your help!

Elements is simply an ArrayList<Element>
The reason you're having to write that extra code is because <td> doesn't have an href attribute, so tds.get(0).attr("href"); won't work. You're presumably trying to capture the href from an <a> within the cell. The longer, working code is saying:
For the first cell in the row, get the first element with an #href attribute (i.e. a link), and get
its #href attribute
Try the following example (with example document) to show how to access the child links more clearly:
Element result = Jsoup.parse("<html><body><table><tr><td><a href=\"http://a.com\" /</td><td>Label1</td></tr><tr><td><a href=\"http://b.com\" /></td><td>Label2</td></tr></table></body></html>");
for (Element element : result.select("tr")) {
if (element.select("tr.header.left").isEmpty()) {
Elements tds = element.select("td");
String link = tds.get(0).getElementsByTag("a").attr("href");
String position = tds.get(1).text();
System.out.println(link + ", " + position);
}
}

Modify the text content of XML tag

How can I insert a new tag for each word from the text content of a tag?
If i have a xml like:
<root>
<el> Text content for tag
</el>
</root>
I want the output to be:
<root>
<el> <new>Text</new> <new>content</new> <new>for</new> <new>tag</new>
</el>
</root>
Any idea?

You already asked part of this question before here: Add new node in XML file
Based on that, I will use an example similar on the one you used in that question, which is a bit more complex than this one because the elements didn't contain plain text, but could have mixed content (elements and text).
The XML I am using there is the one you posted before:
<nodes>
<RegDef>This <i>text</i> have i node.</RegDef>
<RegDef>This text doesn't have i atribute.</RegDef>
</nodes>
Refer to the previous question. In that question I call a method which I called wrapWordsInContents() which returns a new element with its words wrapped inside <w> elements. That new element is used to replace the old one. This is that method:
public static Element wrapWordsInContents(Element node, Document document) {
NodeList children = node.getChildNodes();
int size = children.getLength();
Element newElement = document.createElement(node.getTagName());
for(int i = 0; i < size; i++) {
if (children.item(i).getNodeType() == Document.ELEMENT_NODE) {
newElement.appendChild(wrapWordsInContents((Element)(children.item(i)), document));
} else { // text node
String text = children.item(i).getTextContent().trim();
if(text.isEmpty()) {
continue;
}
String[] words = text.split("\\s");
for(String word : words) {
Element w = document.createElement("w");
Node textNode = document.createTextNode(word);
w.appendChild(textNode);
newElement.appendChild(w);
}
}
}
return newElement;
}
Note that it recursively processes any child elements, wrapping any words it finds inside them with the <w> tag. If you want to use <new>, just replace "w" for "new".
If you run the code in the previous question with this method, you will get a new document which will generate a XML that when serialized will produce this output:
<nodes>
<RegDef><w>This</w><i><w>text</w></i><w>have</w><w>i</w><w>node.</w></RegDef>
<RegDef><w>This</w><w>text</w><w>doesn't</w><w>have</w><w>i</w><w>atribute.</w></RegDef>
</nodes>
For the code example you posted in this question, you would use:
NodeList elNodes = document.getElementsByTagName("el");
int size = elNodes.getLength();
for(int i = 0; i < size; i++) {
Element el = (Element)elNodes.item(i);
Element newEl = wrapWordsInContents(el, document);
Element parent = (Element)el.getParentNode(); // this is `<root>`
parent.replaceChild(newEl, el);
}

Normalization in DOM parsing with java - how does it work?

I saw the line below in code for a DOM parser at this tutorial.
doc.getDocumentElement().normalize();
Why do we do this normalization ?
I read the docs but I could not understand a word.
Puts all Text nodes in the full depth of the sub-tree underneath this Node
Okay, then can someone show me (preferably with a picture) what this tree looks like ?
Can anyone explain me why normalization is needed?
What happens if we don't normalize ?

The rest of the sentence is:
where only structure (e.g., elements, comments, processing instructions, CDATA sections, and entity references) separates Text nodes, i.e., there are neither adjacent Text nodes nor empty Text nodes.
This basically means that the following XML element
<foo>hello
wor
ld</foo>
could be represented like this in a denormalized node:
Element foo
Text node: ""
Text node: "Hello "
Text node: "wor"
Text node: "ld"
When normalized, the node will look like this
Element foo
Text node: "Hello world"
And the same goes for attributes: <foo bar="Hello world"/>, comments, etc.

In simple, Normalisation is Reduction of Redundancies.
Examples of Redundancies:
a) white spaces outside of the root/document tags(...<document></document>...)
b) white spaces within start tag (<...>) and end tag (</...>)
c) white spaces between attributes and their values (ie. spaces between key name and =")
d) superfluous namespace declarations
e) line breaks/white spaces in texts of attributes and tags
f) comments etc...

As an extension to #JBNizet's answer for more technical users here's what implementation of org.w3c.dom.Node interface in com.sun.org.apache.xerces.internal.dom.ParentNode looks like, gives you the idea how it actually works.
public void normalize() {
// No need to normalize if already normalized.
if (isNormalized()) {
return;
}
if (needsSyncChildren()) {
synchronizeChildren();
}
ChildNode kid;
for (kid = firstChild; kid != null; kid = kid.nextSibling) {
kid.normalize();
}
isNormalized(true);
}
It traverses all the nodes recursively and calls kid.normalize()
This mechanism is overridden in org.apache.xerces.dom.ElementImpl
public void normalize() {
// No need to normalize if already normalized.
if (isNormalized()) {
return;
}
if (needsSyncChildren()) {
synchronizeChildren();
}
ChildNode kid, next;
for (kid = firstChild; kid != null; kid = next) {
next = kid.nextSibling;
// If kid is a text node, we need to check for one of two
// conditions:
// 1) There is an adjacent text node
// 2) There is no adjacent text node, but kid is
// an empty text node.
if ( kid.getNodeType() == Node.TEXT_NODE )
{
// If an adjacent text node, merge it with kid
if ( next!=null && next.getNodeType() == Node.TEXT_NODE )
{
((Text)kid).appendData(next.getNodeValue());
removeChild( next );
next = kid; // Don't advance; there might be another.
}
else
{
// If kid is empty, remove it
if ( kid.getNodeValue() == null || kid.getNodeValue().length() == 0 ) {
removeChild( kid );
}
}
}
// Otherwise it might be an Element, which is handled recursively
else if (kid.getNodeType() == Node.ELEMENT_NODE) {
kid.normalize();
}
}
// We must also normalize all of the attributes
if ( attributes!=null )
{
for( int i=0; i<attributes.getLength(); ++i )
{
Node attr = attributes.item(i);
attr.normalize();
}
}
// changed() will have occurred when the removeChild() was done,
// so does not have to be reissued.
isNormalized(true);
}
Hope this saves you some time.

Preserving lines with Jsoup

I am using Jsoup to get some data from html, I have this code:
System.out.println("nie jest");
StringBuffer url=new StringBuffer("http://www.darklyrics.com/lyrics/");
url.append(args[0]);
url.append("/");
url.append(args[1]);
url.append(".html");
//wyciaganie odpowiednich klas z naszego htmla
Document doc=Jsoup.connect(url.toString()).get();
Element lyrics=doc.getElementsByClass("lyrics").first();
Element tracks=doc.getElementsByClass("albumlyrics").first();
//Jso
//lista sciezek
int numberOfTracks=tracks.getElementsByTag("a").size();
Everything would be fine, I extracthe data I want, but when I do:
lyrics.text()
I get the text with no line breaks, so I am wondering how to leave line breaks in displayed text, I read other threads on stackoverflow on this matter but they weren't helpful, I tried to do something like this:
TextNode tex=TextNode.createFromEncoded(lyrics.text(), lyrics.baseUri());
but I can't get the text I want with line breaks. I looked at previous threads about this like,
Removing HTML entities while preserving line breaks with JSoup
but I can't get the effect I want. What should I do?
Edit: I got the effect I wanted but I don't think it is very good solution:
for (Node nn:listOfNodes)
{
String s=Jsoup.parse(nn.toString()).text();
if ((nn.nodeName()=="#text" || nn.nodeName()=="h3"))
{
buf.append(s+"\n");
}
}
Anyone got better idea?

You could get the text nodes (the text between <br />s) by checking if the node is an instance of TextNode. This should work out for you:
Document document = Jsoup.connect(url.toString()).get();
Element lyrics = document.select(".lyrics").first();
StringWriter buffer = new StringWriter();
PrintWriter writer = new PrintWriter(buffer);
for (Node node : lyrics.childNodes()) {
if (node.nodeName().equals("h3")) {
writer.println(((Element) node).text());
} else if (node instanceof TextNode) {
writer.println(((TextNode) node).text());
}
}
System.out.println(buffer.toString());
(please note that comparing the object's internal value should be done by equals() method, not ==; strings are objects, not primitives)
Oh, I also suggest to read their privacy policy.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup get text and inline images in order - java

Related

Parse HTMl using JSOUP - Need specific pattern

Iterating through elements in jsoup and parsing href

Modify the text content of XML tag

Normalization in DOM parsing with java - how does it work?

Preserving lines with Jsoup

Categories

Resources