Parse HTMl using JSOUP - Need specific pattern

Parse HTMl using JSOUP - Need specific pattern - java

I am trying to get text between tags and save into some variable, for example:
Here I want to save value return which is between em tags. Also I need the rest of the text which is in p tags,
em tag value is assigned with return and
p tag value should return only --> an item, cancel an order, print a receipt, track your purchases or reorder items.
if some value is before em tag, even that value should be in different variable basically one p if it has multiple tags within then it should be split and save into different variables. If I know how can I get rest of text which are not in inner tags I can retrieve the rest.
I have written below: the below is returning just "return" which is in "'em' tags.
Here ep is basically doc.select(p), selecting p tag and then iterating, not sure if I am doing right way, any other approaches are highly appreciated.
String text ="\<p><em>return </em>an item, cancel an order, print a receipt, track your purchases or reorder items.</p>"
Elements italic_tags = ep.select("em");
for(Element em:italic_tags) {
if(em.tagName().equals("em")) {
System.out.println( em.select("em").text());
}
}

If you need to select each sub text and text enclosed by different tags you need to try selecting Node instead of Element. I modified your HTML to include more tags so the example is more complete:
String text = "<p><em>return </em>an item, <em>cancel</em> an order, <em>print</em> a receipt, <em>track</em> your purchases or reorder items.</p>";
Document doc = Jsoup.parse(text);
Element ep = doc.selectFirst("p");
List<Node> childNodes = ep.childNodes();
for (Node node : childNodes) {
if (node instanceof TextNode) {
// if it's a text, just display it
System.out.println(node);
} else {
// if it's another element, then display its first
// child which in this case is a text
System.out.println(node.childNode(0));
}
}
output:
return
an item,
cancel
an order,
print
a receipt,
track
your purchases or reorder items.

Related

JSoup get text and inline images in order

I've got some HTML that looks like this:
<tr>
<td>
Some text that is interrupted by an image here:
<img alt="imageName.png" src="linkhere" width="18" height="18">
and then continues here.
</td>
</tr>
and basically I just need a way to loop through the nodes here and add either the text or the image alt to a string with JSoup, maintaining the order of the nodes.
In the end it should look like this:
Some text that is interrupted by an image here: "imageName.png" and then continues here
So far I'm able to get the image by itself or the text by itself by using:
element.text();
//or
element.select("img").attr("alt")
but I'm having trouble getting both of them into an ordered list.
Any ideas?

The following code should give you the output string you are looking for. It basically loops through all the nodes in the document and determines whether or not they are text nodes or elements. If they are text nodes, it will add them to the output string. If they are elements, it will check for an image child and add the alt text to the string.
String test = "";
Element body = doc.getElementsByTag("body").first();
List<Node> childNodes = body.childNodes();
for(Node node : childNodes){
if(node instanceof TextNode){
// These are text nodes, lets see if they are empty or not and add them to the string.
String nodeString = node.toString();
if(nodeString != null && !nodeString.trim().isEmpty()){
test += nodeString;
}
} else if (node instanceof Element) {
// Here is an element, let's see if there is an image.
Element element = (Element)node;
Element image = element.children().select("img").first();
if(image != null)
{
test += image.attr("alt");
}
}
}

Iterating through elements in jsoup and parsing href

I was having trouble getting just the href from a rows of table data. Although I was able to get it working, I am wondering if anyone has an explanation for why my code here works.
for (Element element : result.select("tr")) {
if (element.select("tr.header.left").isEmpty()) {
Elements tds = element.select("td");
//The line below is what I don't understand
String link = tds.get(0).getElementsByAttribute("href").first().attr("href");
String position = tds.get(1).text();
}
}
The line that I was using before, that did not work is below:
String link = tds.get(0).attr("href");
Why does this line return an empty string? I'm assuming it has to do with how I am iterating through the elements as I've selected by "tr". However, I'm not familiar with how Elements vs Element are structured.
Thanks for your help!

Elements is simply an ArrayList<Element>
The reason you're having to write that extra code is because <td> doesn't have an href attribute, so tds.get(0).attr("href"); won't work. You're presumably trying to capture the href from an <a> within the cell. The longer, working code is saying:
For the first cell in the row, get the first element with an #href attribute (i.e. a link), and get
its #href attribute
Try the following example (with example document) to show how to access the child links more clearly:
Element result = Jsoup.parse("<html><body><table><tr><td><a href=\"http://a.com\" /</td><td>Label1</td></tr><tr><td><a href=\"http://b.com\" /></td><td>Label2</td></tr></table></body></html>");
for (Element element : result.select("tr")) {
if (element.select("tr.header.left").isEmpty()) {
Elements tds = element.select("td");
String link = tds.get(0).getElementsByTag("a").attr("href");
String position = tds.get(1).text();
System.out.println(link + ", " + position);
}
}

Extracting Table Data with JSoup on Yahoo Finance

Trying to practice extracting data from tables using JSoup. Can't figure out why I can't pull the "Shares Outstanding" field from
https://finance.yahoo.com/q/ks?s=AAPL+Key+Statistics
Here's two attempts where 's' is AAPL:
public class YahooStatistics {
String sharesOutstanding = "Shares Outstanding:";
public YahooStatistics(String s) {
String keyStatisticsURL = ("https://finance.yahoo.com/q/ks?s="+s+"+Key+Statistics");
//Attempt 1
try {
Document doc = Jsoup.connect(keyStatisticsURL).get();
for (Element table : doc.select("table.yfnc_datamodoutline1")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (Element td : tds.select(sharesOutstanding)) {
System.out.println(td.ownText());
}
}
}
}
catch (IOException ex) {
ex.printStackTrace();
}
//Attempt 2
try {
Document doc = Jsoup.connect(keyStatisticsURL).get();
for (Element table : doc.select("table.yfnc_datamodoutline1")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (int j = 0; j < tds.size() - 1; j++) {
Element td = tds.get(j);
if ((td.ownText()).equals(sharesOutstanding)) {
System.out.println(tds.get(j+1).ownText());
}
}
}
}
}
catch(IOException ex) {
ex.printStackTrace();
}
The attempts return: BUILD SUCCESSFUL and nothing else.
I've disabled JavaScript on my browser and the table still shows, so I'm assuming this is not written in JavaScript but HTML.
Any suggestions are appreciated.

Notes about your source after the edit:
You should compare ownText() rather than text(). text() gives you the combined text of all the element and all its sub-elements. In this case the element contains Shares Outstanding<font size="-1"><sup>5</sup></font>:, so its combined text is "Shares Outstanding5:". If you use ownText it will just be "Shares Outstanding:".
Note the colon (:). Update the value in sharesOutstanding accordingly.
You are passing it the wrong URL. There should be a + following the AAPL.
Your current query (at least the second attempt) is returning the element twice, because there is a nested table so it finds the TDs twice.
You can either break from your loops once you found a match, go back to your original version (with corrections as above) - see note - or you can try using a more sophisticated query which will only match once:
Elements elems = doc.select("td.yfnc_tablehead1:containsOwn("+sharesOutstanding+") + td.yfnc_tabledata1");
if ( ! elems.isEmpty() ) {
System.out.println( elems.get(0).owntext() );
}
This selector gives you all the td elements whose class is yfnc_tabledata1, whose immediate preceding sibling is a td element whose class is yfnc_tablehead1 and whose own text contains the "Shares Outstanding:" string. This should basically select the exact TD you need.
Note: the previous version of this answer was a long rattle about the difference between Elements.select() and Element.select(). It turns out that I was dead wrong and your original version should have worked - if you had corrected the four points above. So to set the record straight: select() on an Elements actually does look inside each element and the resulting list may contain descendents of any of the elements in the original list that match the selection. Sorry about that.

Get the list of object containing text matching a pattern

I'm currently working with the API Apache POI and I'm trying to edit a Word document with it (*.docx). A document is composed by paragraphs (in XWPFParagraph objects) and a paragraph contains text embedded in 'runs' (XWPFRun). A paragraph can have many runs (depending on the text properties, but it's sometimes random). In my document I can have specific tags which I need to replace with data (all my tags follows this pattern <#TAG_NAME#>)
So for example, if I process a paragraph containing the text Some text with a tag <#SOMETAG#>, I could get something like this
XWPFParagraph paragraph = ... // Get a paragraph from the document
System.out.println(paragraph.getText());
// Prints: Some text with a tag <#SOMETAG#>
But if I want to edit the text of that paragraph I need to process the runs and the number of runs is not fixed. So if I show the content of runs with that code:
System.out.println("Number of runs: " + paragraph.getRuns().size());
for (XWPFRun run : paragraph.getRuns()) {
System.out.println(run.text());
}
Sometimes it can be like this:
// Output:
// Number of runs: 1
// Some text with a tag <#SOMETAG#>
And other time like this
// Output:
// Number of runs: 4
// Some text with a tag
// <#
// SOMETAG
// #>
What I need to do is to get the first run containing the start of the tag and the indexes of the following runs containing the rest of the tag (if the tag is divided in many runs). I've managed to get a first version of that algorithm but it only works if the beginning of the tag (<#) and the end of the tag (#>) aren't divided. Here's what I've already done.
So what I would like to get is an algorithm capable to manage that problem and if possible get it work with any given tag (not necessarily <# and #>, so I could replace with something like this {{{ and this }}}).
Sorry if my English isn't perfect, don't hesitate to ask me to clarify any point you want.

Finally I found the answer myself, I totally changed my way of thinking my original algorithm (I commented it so it might help someone who could be in the same situation I was)
// Before using the function, I'm sure that:
// paragraph.getText().contains(surroundedTag) == true
private void editParagraphWithData(XWPFParagraph paragraph, String surroundedTag, String replacement) {
List<Integer> runsToRemove = new LinkedList<Integer>();
StringBuilder tmpText = new StringBuilder();
int runCursor = 0;
// Processing (in normal order) the all runs until I found my surroundedTag
while (!tmpText.toString().contains(surroundedTag)) {
tmpText.append(paragraph.getRuns().get(runCursor).text());
runsToRemove.add(runCursor);
runCursor++;
}
tmpText = new StringBuilder();
// Processing back (in reverse order) to only keep the runs I need to edit/remove
while (!tmpText.toString().contains(surroundedTag)) {
runCursor--;
tmpText.insert(0, paragraph.getRuns().get(runCursor).text());
}
// Edit the first run of the tag
XWPFRun runToEdit = paragraph.getRuns().get(runCursor);
runToEdit.setText(tmpText.toString().replaceAll(surroundedTag, replacement), 0);
// Forget the runs I don't to remove
while (runCursor >= 0) {
runsToRemove.remove(0);
runCursor--;
}
// Remove the unused runs
Collections.reverse(runsToRemove);
for (Integer runToRemove : runsToRemove) {
paragraph.removeRun(runToRemove);
}
}
So now I'm processing all runs of the paragraph until I found my surrounded tag, then I'm processing back the paragraph to ignore the first runs if I don't need to edit them.

Preserving lines with Jsoup

I am using Jsoup to get some data from html, I have this code:
System.out.println("nie jest");
StringBuffer url=new StringBuffer("http://www.darklyrics.com/lyrics/");
url.append(args[0]);
url.append("/");
url.append(args[1]);
url.append(".html");
//wyciaganie odpowiednich klas z naszego htmla
Document doc=Jsoup.connect(url.toString()).get();
Element lyrics=doc.getElementsByClass("lyrics").first();
Element tracks=doc.getElementsByClass("albumlyrics").first();
//Jso
//lista sciezek
int numberOfTracks=tracks.getElementsByTag("a").size();
Everything would be fine, I extracthe data I want, but when I do:
lyrics.text()
I get the text with no line breaks, so I am wondering how to leave line breaks in displayed text, I read other threads on stackoverflow on this matter but they weren't helpful, I tried to do something like this:
TextNode tex=TextNode.createFromEncoded(lyrics.text(), lyrics.baseUri());
but I can't get the text I want with line breaks. I looked at previous threads about this like,
Removing HTML entities while preserving line breaks with JSoup
but I can't get the effect I want. What should I do?
Edit: I got the effect I wanted but I don't think it is very good solution:
for (Node nn:listOfNodes)
{
String s=Jsoup.parse(nn.toString()).text();
if ((nn.nodeName()=="#text" || nn.nodeName()=="h3"))
{
buf.append(s+"\n");
}
}
Anyone got better idea?

You could get the text nodes (the text between <br />s) by checking if the node is an instance of TextNode. This should work out for you:
Document document = Jsoup.connect(url.toString()).get();
Element lyrics = document.select(".lyrics").first();
StringWriter buffer = new StringWriter();
PrintWriter writer = new PrintWriter(buffer);
for (Node node : lyrics.childNodes()) {
if (node.nodeName().equals("h3")) {
writer.println(((Element) node).text());
} else if (node instanceof TextNode) {
writer.println(((TextNode) node).text());
}
}
System.out.println(buffer.toString());
(please note that comparing the object's internal value should be done by equals() method, not ==; strings are objects, not primitives)
Oh, I also suggest to read their privacy policy.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parse HTMl using JSOUP - Need specific pattern - java

Related

JSoup get text and inline images in order

Iterating through elements in jsoup and parsing href

Extracting Table Data with JSoup on Yahoo Finance

Get the list of object containing text matching a pattern

Preserving lines with Jsoup

Categories

Resources