How to extract href detail from links using Jsoup? - java

I have an html with this form:
<table>
<tbody>
<tr>
<td class="t1"><img class="png" src="" alt="site1"></td>
<td class="t2 up">INFORMATION</td>
<td class="t2 down">INFORMATION</td>
<td class="t2 up mark">INFORMATION</td>
</tr>
<tr>
<td class="t1"><img class="png" src="" alt="site2"></td>
<td class="t2 down">INFORMATION</td>
<td class="t2 stable">INFORMATION</td>
<td class="t2 up">INFORMATION</td>
</tr>
.
.
.
</tbody>
</table>
and I want to extract or the value of href (/click/site1) or the value of alt (site1).
How can I do this using Jsoup??
thx
edit:
this is the code that I wrote:
for(Element table : doc.select("table"))
{
for(Element row : table.select("tr"))
{
System.out.print(table.attr("href").toString());
Elements column = row.select("td");
{
System.out.println(column.text());
}
}
System.out.println();
}
but this line System.out.print(table.attr("href").toString());doesn't print anything

This process is described in jsoup cookbook.
http://jsoup.org/cookbook/extracting-data/working-with-urls
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"
In your question you try to get the attribute href from the table but the table doesn't have href attribute. Either you search for all a tags or you may select the td inside your row and then the link inside of that.
Did some coding add changed your example and added some code to only write the links.
for(Element table : doc.select("table")) {
for(Element row : table.select("tr")) {
Elements column = row.select("td");
Elements atag = column.get(0).select("a");
System.out.print(atag.get(0).attr("href").toString());
System.out.print(" ");
System.out.println(column.text());
}
System.out.println();
}
for(Element link : doc.select("a")) {
System.out.println(link.attr("href")); // == "/"
}

Related

Selenium Webdriver - Fetching column from a data table using Java 8

I am trying to grab a column from a data table.
Here is my table to use as an example, what I am looking for is to extract the Firstname from the table.
<table style="width:100%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td class="abc">Jill</td>
<td class="abc">Smith</td>
<td class="abc">50</td>
</tr>
<tr>
<td class="abc">Eve</td>
<td class="abc">Jackson</td>
<td class="abc">94</td>
</tr>
</table>
How can i modify the code below to give me this result:
Jill
Eve
WebElement table = driver.findElement(By.id("searchResultsGrid"));
// Now get all the TR elements from the table
List<WebElement> allRows = table.findElements(By.tagName("tr"));
// And iterate over them, getting the cells
for (WebElement row : allRows) {
List<WebElement> cells = row.findElements(By.tagName("td"));
for (WebElement cell : cells) {
System.out.println("content >> " + cell.getText());
}
}
Using Java 8 you can iterate list using .forEach after getting only Firstname column list as below :-
WebElement table = driver.findElement(By.id("searchResultsGrid"));
List<WebElement> firstCells = table.findElements(By.xpath(".//tr/td[1]"));
firstCells.forEach(firstCell->System.out.println("Firstname >> " + firstCell.getText()));

HREF + TEXT with Jsoup

I've the following HTML Page:
</div><div id="page_content_list01" class="grid_12">
<h2><strong class="floatleft">TEXT1</strong></h2><br>
<table>
<tbody>
<tr>
<th class="no_width">
<p class="floatleft">Attachments:</p>
</th>
<td class="link_azure">
<a target="_blank" href="http://www.example.com">TEXT2</a><br/>
</td>
</tr>
</tbody>
</table><h2><strong class="floatleft">TEXT3</strong></h2><br>
<table>
<tbody>
<tr>
<th class="no_width">
<p class="floatleft">Atachments:</p>
</th>
<td class="link_azure">
<a target="_blank" href="http://www.example2.com">TEXT4</a><br/>
</td>
</tr>
</tbody>
</table><h2><strong class="floatleft">TEXT5</strong></h2><br>
<table>
<tbody>
<tr>
Actually I'm doing:
Elements rows = document.select("div#page_content_list01");
Now I to select "TEXT" and link. I wanna to make clickable link, so I'm using:
for (Element eleme : rows) {
Elements elements = eleme.select("a");
for (Element elem : elementi) {
String url = elem.attr("href");
String title = elem.text();
}
}
and I'm getting:
url = "http://www.example.com";
title = "TEXT2";
and it's ok, but in this way I can't read "TEXT1" and "TEXT3".
Can someone help me please?
I think you need to work on the selecors. First, your primary selector
Elements rows = document.select("div#page_content_list01");
will return with a list of ONE element only, since you actually select the div, not the tables or table rows. I would instead do this to get all relevant info:
Elements tables = document.select("div#page_content_list01>table");
for (Element table : tables){
Element h2 = table.previousElementSibling();
String titleStr = h2.text();
Element a = table.select("a").first();
String linkStr = a.attr("href");
}
Note that the Text in the h2 elements is on the same level as the table, not inside a common div. This is why I use the previous sibling notation. Also note that I wrote this out of my head and it is untested. You should get the idea though.

Parsing table data with jsoup

I am using jsoup in my android app to parse my html code but now I need parse table data and I can not get it to work. I try many ways but not successful so I want try luck here if anyone have experience.
Here is part of my html:
<div id="editacia_jedla">
<h2>My header</h2>
<h3>My sub header</h3>
<table border="0" class="jedalny_listok_tabulka" cellpadding="2" cellspacing="1">
<tr>
<td width="100" class="menu_nazov neparna" align="left">Food Menu 1</td>
<td class="jedlo neparna" align="left">vegetable and beef
<div class="jedlo_box_alergeny">Allergens: 1, 3</div>
</td>
</tr>
<tr>
<td width="100" class="menu_nazov parna" align="left">Food Menu 2</td>
<td class="jedlo parna" align="left">Potato salad and pork
<div class="jedlo_box_alergeny">Allergens: 6</div>
</td>
</tr>
</table>
etc
</div>
My java/android code:
try {
String tableHtmlCode="";
Document fullHtmlDocument = Jsoup.connect(urlOfFoodDay).get();
Element elm1 = fullHtmlDocument.select("#editacia_jedla").first();
for( Element element : elm1.children() )
{
tableHtmlCode+=element.getElementsByIndexEquals(2); //this set table content because 0=h2, 1=h3
}
Document parsedTableDocument = Jsoup.parse(tableHtmlCode);
//Element th = parsedTableDocument.select("td[class=jedlo neparna]").first(); THIS IS BAD
String foodContent="";
String foodAllergens="";
}
So now I want extract text vegetable and beef and save it to string foodContent and numbera 1, 3(together) from div class jedlo_box_alergeny save to string foodAllergens. Someone can help? I will very grateful for any ideas
Iterate over your document's parent tag jedalny_listok_tabulka and loop over td tags.
td tag is the parent to href tags which include the allergy values. Hence, you would loop over the tags a elements to get your numbers, something like:
Elements myElements = doc.getElementsByClass("jedalny_listok_tabulka")
.first().getElementsByTag("td");
for (Element element : myElements) {
if (element.className().contains("jedlo")) {
String foodContent = element.ownText();
String foodAllergen = "";
for (Element href : element.getElementsByTag("a")) {
foodAllergen += " " + href.text();
}
System.out.println(foodContent + " : " + foodAllergen);
}
}
Output:
vegetable and beef : 1 3
Potato salad and pork : 6

How I can cut html table into separate tables?

I have a html doc parsed by JSoup. In this table there are several rows:
<table>
<tbody>
<tr>...</tr>
<tr>...</tr>
<tr>...</tr>
<tr>...</tr>
<tr>...</tr>
<tr>...</tr>
</tbody>
</table>
Some of the rows are kind of headers - I find those rows with Jsoup select(...) method. So I have Elements object containing all rows that are headers. Lets say it looks like this:
<table>
<tbody>
<tr id="tr1">...</tr>
<tr id="tr2">...</tr> // this is header
<tr id="tr3">...</tr>
<tr id="tr4">...</tr>
<tr id="tr5">...</tr> // this is header
<tr id="tr6">...</tr>
</tbody>
</table>
Id attributes are just for this example - int real case there are not id attributes in parsed html.
What I need is to get 2 tables (2 Element objects containing each table), one for each header, containing all rows below given header but above next header. So I expect:
<table> // Element 1
<tbody>
<tr id="tr3">...</tr>
<tr id="tr4">...</tr>
</tbody>
</table>
<table> // Element 2
<tbody>
<tr id="tr6">...</tr>
</tbody>
</table>
Can any1 help me with this task?
That's a good exercise to test JSoup's capacity of dom handling. Below is the snippet you need. The code is pretty much self-explanatory (createElement creates an element and so on), but if you need any clarification let me know:
Elements tables = new Elements();
for (Element headerTR : headerRows) {
Element tbody = doc.createElement("tbody");
Element firstSiblingTR = headerTR.nextElementSibling();
if (firstSiblingTR != null) {
Element secondSiblingTR = firstSiblingTR.nextElementSibling();
tbody.appendChild(firstSiblingTR);
if (secondSiblingTR != null) {
tbody.appendChild(secondSiblingTR);
}
}
Element table = doc.createElement("table");
table.appendChild(tbody);
tables.add(table);
}
Example usage:
public static void main(String[] args) {
Document doc = Jsoup.parse("<html><body>"+
"<table>" +
" <tbody>" +
" <tr><td>1</td></tr>" +
" <tr class='header'><td>2</td></tr>" + // class added to simulate ur list
" <tr><td>3</td></tr>" +
" <tr><td>4</td></tr>" +
" <tr class='header'><td>5</td></tr>" + // class added to simulate ur list
" <tr><td>6</td></tr>" +
" </tbody>" +
"</table>" +
"</body></html>");
Elements headerRows = doc.getElementsByClass("header"); // simulating ur list
Elements tables = new Elements();
for (Element headerTR : headerRows) {
Element tbody = doc.createElement("tbody");
Element firstSiblingTR = headerTR.nextElementSibling();
if (firstSiblingTR != null) {
Element secondSiblingTR = firstSiblingTR.nextElementSibling();
tbody.appendChild(firstSiblingTR);
if (secondSiblingTR != null) {
tbody.appendChild(secondSiblingTR);
}
}
Element table = doc.createElement("table");
table.appendChild(tbody);
tables.add(table);
}
System.out.println(tables); // print <table> list
}
Output:
<table>
<tbody>
<tr><td>3</td></tr>
<tr><td>4</td></tr>
</tbody>
</table>
<table>
<tbody>
<tr><td>6</td></tr>
</tbody>
</table>

Html parsing text from TD Tag

I have my Html data
<table border='0' cellpadding='3' bgcolor="#CCCCCC" class="hostinfo_title2" width='100%' align="center">
<tr align='center' bgcolor="#ffffff">
<td width='26%' class="hostinfo_title3">Archive Url</td>
</tr>
<tr bgcolor="#ffffff"
<td height="25" align="center">http://www.toradio.com/prgramdetails/20130413_vali_mm.mp3</td>
</tr>
</table>
I want to get mp3 url(http://www.toradio.com/prgramdetails/20130413_vali_mm.mp3) from above HTML text.
I'm following this link,Is it Correct or any better way to parse this text
Could any one help?
Check out JSoup. It's a nice HTML Parser for JAVA.
You should be able to do that with something like this:
String html = "<YOUR HTML HERE>";
Document doc = Jsoup.parse(html);
Elements tds = doc.select("table.hostinfo_title2").select("td");
String mp3Link = "";
for(Element td : tds) {
if(td.text().contains("mp3") {
mp3Link = td.text();
// do something with mp3Link
}
}

Categories

Resources