JSOUP - Select only some text from html - java

I am trying to select some text from the HTML using Jsoup in Android.
My HTML code looks like that:
<tr class="tip " data-original-title="">
<td>
!!! NOT That !!! </td>
<td>
A205 </td>
<td>
I want to get this </td>
<td>
And this </td>
<td>
!!! And not this !!! </td>
<td>
</td>
</tr>
How can I do that? Thank you so much!

For example:
package ru.java.study;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Main {
private static String htmlText =
"<tr class=\"tip \" data-original-title=\"\">" +
"<td>!!! NOT That !!!</td>" +
" <td>" +
" A205 </td>" +
" <td>" +
" I want to get this </td>" +
" <td>" +
" And this </td>" +
" <td>" +
" !!! And not this !!! </td>" +
" <td>" +
" </td>" +
" </tr>";
public static void main(String[] args) {
Document document = Jsoup.parse("<table>"+htmlText); //Add <table>
String first_TD = document.select("td").get(2).text();
String second_TD = document.select("td").get(3).text();;
System.out.println(first_TD);
System.out.println(second_TD);
}
}

You must be more specific in your selection. There should be id="..." or class="..." attributes in <table> tag to precisely identify the table that you need.
// Don't forget about <table> tag
String html = "<table>" +
"<tr class=\"tip \" data-original-title=\"\">" +
"<td>!!! NOT That !!!</td>" +
"<td>A205</td>" +
"<td>I want to get this</td>" +
"<td>And this</td>" +
"<td>!!! And not this !!!</td>" +
"<td></td>" +
"</tr>" +
"</table>";
Document doc = Jsoup.parseBodyFragment(html);
// You should use more specific selector.
// For example if table tag looks like this: <table id="myID">...</table>
// then selector should look like this "table#myID tr.tip > td"
Elements cells = doc.select("tr.tip > td");
String cell3content = cells.get(2).html(); // use .text() for content without html tags
String cell4content = cells.get(3).html();
System.out.println(cell3content);
System.out.println(cell4content);

Related

Extract data from HTML page in between Tags using HTMLUNIT

I'm trying to extract data from web page using Html Unit. I've already achieved this by converting HtmlPage to text and then extracted data by using regular expression out of that HTML page. I've also achieved to extract data from Html tables using class attribute in Html.
I want to use HtmlUnit again fully for all extraction to learn for the same requirement I have done using regular expression. Am not able to get how can I extract data within tags in the form of key value pair.
Here is the sample Html data
<div class="top_red_bar">
<div id="site-breadcrumbs">
Home
|
Queues
|
Topics
|
Subscribers
|
Connections
|
Network
|
Scheduled
|
<a href="/admin/send.jsp"
title="Send">Send</a>
</div>
<div id="site-quicklinks"><P>
<a href="http://activemq.apache.org/support.html"
title="Get help and support using Apache ActiveMQ">Support</a></p>
</div>
</div>
<table border="0">
<tbody>
<tr>
<td valign="top" width="100%" style="overflow:hidden;">
<div class="body-content">
<h2>Welcome!</h2>
<p>
Welcome to the Apache ActiveMQ Console of <b>localhost</b> (ID:TOOLCONTROLPJX526-524666-65544585445-2:3)
</p>
<p>
You can find more information about Apache ActiveMQ on the Apache ActiveMQ Site
</p>
<h2>Broker</h2>
<table>
<tr>
<td>Name</td>
<td><b>localhost</b></td>
</tr>
<tr>
<td>Version</td>
<td><b>5.13.3</b></td>
</tr>
<tr>
<td>ID</td>
<td><b>ID:TOOLCONTROLPJX526-524666-65544585445-2:3</b></td>
</tr>
<tr>
<td>Uptime</td>
<td><b>17 days 13 hours</b></td>
</tr>
<tr>
<td>Store percent used</td>
<td><b>19</b></td>
</tr>
<tr>
<td>Memory percent used</td>
<td><b>0</b></td>
</tr>
<tr>
<td>Temp percent used</td>
<td><b>0</b></td>
</tr>
</table>
I want to extract data in between table tag.
Expected output
Name:localhost
Version:5.13.3
ID:ID:TOOLCONTROLPJX526-524666-65544585445-2:3
Uptime:7 days 13 hours
Store percent used:19
Memory percent used:0
Temp percent used:0
How it can be achieved? I want to know which methods to be used within HTLM unit to achieve this.
This are the steps i followed (not the only solution)
parse the string through parseHtml method with dummy url
get the second table by xpath
iterate with double nested loop (for and iterator -to append separator correctly-)
ExtractTableData:
import java.net.URL;
import com.gargoylesoftware.htmlunit.StringWebResponse;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HTMLParser;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTable;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow.CellIterator;
public class ExtractTableData {
public static void main(String[] args) throws Exception {
String html = "<div class=\"top_red_bar\">\n" + " <div id=\"site-breadcrumbs\">\n"
+ " Home\n"
+ " |\n"
+ " Queues\n"
+ " |\n"
+ " Topics\n"
+ " |\n"
+ " Subscribers\n"
+ " |\n"
+ " Connections\n"
+ " |\n"
+ " Network\n"
+ " |\n"
+ " Scheduled\n"
+ " |\n" + " <a href=\"/admin/send.jsp\"\n"
+ " title=\"Send\">Send</a>\n" + " </div>\n"
+ " <div id=\"site-quicklinks\"><P>\n"
+ " <a href=\"http://activemq.apache.org/support.html\"\n"
+ " title=\"Get help and support using Apache ActiveMQ\">Support</a></p>\n"
+ " </div>\n" + " </div>\n" + "\n"
+ " <table border=\"0\">\n" + " <tbody>\n"
+ " <tr>\n"
+ " <td valign=\"top\" width=\"100%\" style=\"overflow:hidden;\">\n"
+ " <div class=\"body-content\">\n" + "\n" + "\n"
+ "<h2>Welcome!</h2>\n" + "\n" + "<p>\n"
+ "Welcome to the Apache ActiveMQ Console of <b>localhost</b> (ID:TOOLCONTROLPJX526-524666-65544585445-2:3)\n"
+ "</p>\n" + "\n" + "<p>\n"
+ "You can find more information about Apache ActiveMQ on the Apache ActiveMQ Site\n"
+ "</p>\n" + "\n" + "<h2>Broker</h2>\n" + "\n" + "\n" + "<table>\n" + " <tr>\n"
+ " <td>Name</td>\n" + " <td><b>localhost</b></td>\n" + " </tr>\n" + " <tr>\n"
+ " <td>Version</td>\n" + " <td><b>5.13.3</b></td>\n" + " </tr>\n" + " <tr>\n"
+ " <td>ID</td>\n" + " <td><b>ID:TOOLCONTROLPJX526-524666-65544585445-2:3</b></td>\n"
+ " </tr>\n" + " <tr>\n" + " <td>Uptime</td>\n"
+ " <td><b>17 days 13 hours</b></td>\n" + " </tr>\n" + " <tr>\n"
+ " <td>Store percent used</td>\n" + " <td><b>19</b></td>\n" + " </tr>\n"
+ " <tr>\n" + " <td>Memory percent used</td>\n" + " <td><b>0</b></td>\n"
+ " </tr>\n" + " <tr>\n" + " <td>Temp percent used</td>\n" + " <td><b>0</b></td>\n"
+ " </tr>\n" + "</table>";
WebClient webClient = new WebClient();
HtmlPage page = HTMLParser.parseHtml(new StringWebResponse(html, new URL("http://dummy.url.for.parsing.com/")),
webClient.getCurrentWindow());
final HtmlTable table = (HtmlTable) page.getByXPath("//table").get(1);
for (final HtmlTableRow row : table.getRows()) {
CellIterator cellIterator = row.getCellIterator();
if (cellIterator.hasNext()) {
System.out.print(cellIterator.next().asText());
while (cellIterator.hasNext()) {
System.out.print(":" + cellIterator.next().asText());
}
}
System.out.println();
}
}
}
Output:
Name:localhost
Version:5.13.3
ID:ID:TOOLCONTROLPJX526-524666-65544585445-2:3
Uptime:17 days 13 hours
Store percent used:19
Memory percent used:0
Temp percent used:0

how to retrieve data from strong tags in html file using jsoup?

I have some html data like
<div class="bs-example">
<div class="panel panel-primary">
<div class="panel-heading">
<h3 class="panel-title">ABC</h3>
</div>
<div class="panel-body">
<div class="slimScroller" style="height:280px; position: relative;" data-rail-visible="1" data-always-visible="1">
<strong>Name:</strong>
<br />
<strong>ID No:</strong> XXXXX<br />
<strong>Status:</strong> ACTIVE<br />
<strong>Class:</strong> 5<br />
<strong>Category:</strong> A<br />
<strong>Marks:</strong> 500<br />
</div>
</div>
</div>
</div>
I want output as (multiple students data):
Name: ABC
ID No.: XXXXX
Status: Active
Class: 5
Category: A
Marks: 500
How to get this data using jsoup or any other way? Please help.
You can use Element.nextElementSibling() or/and Element.nextSibling() to get the output you need.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Exam {
public static void main(String[] args) {
String html = "<div class=\"bs-example\">" +
" <div class=\"panel panel-primary\">" +
" <div class=\"panel-heading\">" +
" <h3 class=\"panel-title\">ABC</h3>" +
" </div>" +
" <div class=\"panel-body\">" +
" <div class=\"slimScroller\" style=\"height:280px; position: relative;\" data-rail-visible=\"1\" data-always-visible=\"1\">" +
" <strong>Name:</strong>" +
" <br />" +
" <strong>ID No:</strong> XXXXX<br />" +
" <strong>Status:</strong> ACTIVE<br />" +
" <strong>Class:</strong> 5<br />" +
" <strong>Category:</strong> A<br />" +
" <strong>Marks:</strong> 500<br />" +
" </div>" +
" </div>" +
" </div>" +
"</div>";
Document doc = Jsoup.parse(html);
Elements eles = doc.select("div.slimScroller strong");
for(Element e :eles)
System.out.println(e.text() +
( e.nextElementSibling().tagName().equals("a")?
e.nextElementSibling().attr("href").replace("https://", ""):
e.nextSibling().toString()));
}
}
The following code should provide the output specified based off your comment describing how your a tags are:
private static void printStudentInfo(Document document){
Elements students = document.select("div.slimScroller strong");
for(Element student : students){
System.out.print(student.text());
System.out.println(student.nextElementSibling().tagName().equals("a") ?
student.nextElementSibling().text() : student.nextSibling().toString());
}
}

How to locate an element which have same name and same atrributes in selenium and insert some text

When we get more than one element which are same in attribute and name like multiple textbox with same name and same class.There are no way to distingues those element.I want to insert different value for each textbox as located. How do we resolve this? Kindly advise , Thanks you
My WEB IMAGE :
MY HTML CODE :
<tr class="model-added">
<td class="table_bg1 textTr">上2级代理佣金</td>
<td>
<input type="text" name="upRebate[]" value="" maxlength="18">
<td class="table_bg1 textTr">上3级代理佣金</td>
<td>
<input type="text" name="upRebate[]" value="" maxlength="18">
MY CODE :
WebDriverWait insert3 = new WebDriverWait(driver, 20);
insert3.until(ExpectedConditions.presenceOfElementLocated(By.xpath("//input[#name='upRebate[]'])[position()=2]")))
.sendKeys(dealerAmount);
Solution i have figure it out :
List<WebElement> li = driver.findElements(By.name(Constant.YHTY_Commission_upRebate));
li.get(1).sendKeys(dealerAmountList2);
System.out.println("INSERT 上2级代理佣金 : " + dealerAmountList2);
Log.info("INSERT 上2级代理佣金 : " + dealerAmountList2);
li.get(2).sendKeys(dealerAmountList3);
System.out.println("INSERT 上2级代理佣金 : " + dealerAmountList3);
Log.info("INSERT 上2级代理佣金 : " + dealerAmountList3);
li.get(3).sendKeys(dealerAmountList4);
System.out.println("INSERT 上2级代理佣金 : " + dealerAmountList4);
Log.info("INSERT 上2级代理佣金 : " + dealerAmountList4);

Jsoup how to parse text inside span class="hps"

<span id="result_box" class="short_text" lang="es">
<span class="hps">
hello
</span>
<span class="hps">
world
</span>
</span>
I want to get the hello world String using Jsoup but i have no idea how to do this.
Use Jsoup.parse to get the html Document. Select the elements that you want using css selector like: span.hps (http://jsoup.org/apidocs/org/jsoup/select/Selector.html)
Document doc = Jsoup.parse("<span id=\"result_box\" class=\"short_text\" lang=\"es\">\n" +
" <span class=\"hps\">\n" +
" hello\n" +
" </span>\n" +
" <span class=\"hps\">\n" +
" world\n" +
" </span>\n" +
"</span>");
System.out.println(doc.html());
Elements els = doc.select("span.hps");
for(Element e:els){
System.out.print(e.text());
}
In case you don't care about each element value you can replace the for loop:
els.text()

Parse HTML page in Java

I'm parsing this page segment:
<tr valign="middle">
<td class="inner"><span style=""><span class="" title=""></span> 2 <span class="icon ok" title="Verified"></span> </span><span class="icon cat_tv" title="Video » TV" style="bottom:-2;"></span> VALUE </td>
<td width="1%" align="center" nowrap="nowrap" class="small inner" >VALUE</td>
<td width="1%" align="right" nowrap="nowrap" class="small inner" >VALUE</td>
<td width="1%" align="center" nowrap="nowrap" class="small inner" >VALUE</td>
</tr>
I have this segment in variable tv: HtmlElement tv = tr.get(i);
I read tag VALUE in this way:
HtmlElement a = tv.getElementsByTagName("a").get(0);
object.name.value(a.getTextContent());
url = a.getAttribute("href");
object.url_detail.value(myBase + url);
How can I read only VALUE field of the other <td>....</td> sections?
I would suggest using XPath, which is the recommended way of parsing XML/HTML
Reference: How to read XML using XPath in Java
Also take a look at this question: RegEx match open tags except XHTML self-contained tags
Update
If I understood correctly, you need the "VALUE" from each td, right?
If so, your XPath would something like this:
//td[#class="small inner"]/text()
You may try a wonderful java package jsoup.
UPDATE: using the package, you can solve the problem like this:
String html = "<tr valign=\"middle\">"
+ " <td class=\"inner\">"
+ " <span style=\"\"><span class=\"\" title=\"\"></span> 2 <span class=\"icon ok\" title=\"Verified\"></span> </span><span class=\"icon cat_tv\" title=\"Video » TV\" style=\"bottom:-2;\"></span>"
+ " VALUE "
+ " </td>"
+ " <td width=\"1%\" align=\"center\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
+ " <td width=\"1%\" align=\"right\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
+ " <td width=\"1%\" align=\"center\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
+ "</tr>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Elements labelPLine = doc.select("a[href]");
System.out.println("value 1:" + labelPLine.text());
Elements labelPLine2 = doc.select("td[width=1%");
Iterator<Element> it = labelPLine2.iterator();
int n = 2;
while (it.hasNext()) {
System.out.println("value " + (n++) + ":" + it.next().text());
}
The result would be:
value 1:VALUE
value 2:VALUE
value 3:VALUE
value 4:VALUE

Categories

Resources