Jsoup selecting table data - java

For the life of me I can't figure out how to select the img src using jsoup the link ending in "51u1FaI-FHL._SL500_AA300_.jpg".
I've tried multiple things but none have worked. Any help?
doc1 = Jsoup.connect("http://www.amazon.com/gp/product/B0051HDDO2?ie=UTF8&ref=mas_faad").timeout(20000).get();
Element table = doc1.select("table[class=productImageGrid]").first()
Iterator<Element> ite = table.select("td[height=300]").iterator();
Thanks,
Cody
<table style="text-align: center;" border="0" cellpadding="0" cellspacing="0" width="300">
<tr>
<td id="prodImageCell" height="300" width="300" style="padding-bottom: 10px;"><img onclick="if(0 ){ async_openImmersiveView(event);} else {openImmersiveView(event);}" class="prod_image_selector" style="cursor:pointer;" onload="if (typeof uet == 'function') { uet('af'); }" **src="http://ecx.images-amazon.com/images/I/51u1FaI-FHL._SL500_AA300_.jpg"** id="prodImage"/><div id="prodImageCellInner" style="position: relative; height:0px; "><!--Comment for IE as it is empty div--></div></td>
<td id="prodVideoClick" style="display:none"></td>
<img id="loadingImage" src=http://g-ecx.images-amazon.com/images/G/01/ui/loadIndicators/loading-large_boxed._V192195297_.gif style="position: absolute; z-index: 200; display:none">
</tr>
<tr>
<td class="tiny" style="padding-bottom: 5px;"> <span id="prodImageCaption" style="color: #666666; font-size: 10px;">Click for larger image and other views</span> </td>
</tr>
</table>

#user793728: try this:-
document = Jsoup.connect("http://www.amazon.com/gp/product/B0051HDDO2?ie=UTF8&ref=mas_faad").timeout(20000).get();
Elements elements =document.select(".prod_image_selector");
for (Element element : elements){
Attributes imageAttributes=element.attributes();
for (Attribute attribute: imageAttributes){
if(attribute.getKey().equals("src")){
String imageURL=attribute.getValue();
}
}
}

The issue here seems to be that Amazon is returning different HTML to jsoup than it is to your browser, based on the request UserAgent.
I set the UserAgent to a known browser, and selected the element using the #prodImage ID, and got the result OK.
E.g.
Document doc = Jsoup.connect("http://www.amazon.com/gp/product/B0051HDDO2?ie=UTF8&ref=mas_faad")
.timeout(20000)
.userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.91 Safari/534.30")
.get();
Element img = doc.select("#prodImage").first();
System.out.println(img.attr("src"));
Returns http://ecx.images-amazon.com/images/I/51u1FaI-FHL._SL500_AA300_.jpg
To troubleshoot issues like this, I suggesst outputting doc.html() and looking at the retrieved, parsed HTML, as it can be different from the view-source HTML of your browser (as servers can return different HTML, and view-source shows before the HTML has been tidied and built into a DOM).
Hope this helps!

Related

how to web-scrape from a table?

I am working on a project where I have to web-scrape from the site https://lite.ip2location.com.
When you come into the site there are a number of divs each with a different country. When you click on one of them the browser is redirected to a table on that site. The table has a thead and tbody. I need to access the tbody but for some reason, I only get the information from the thead tag.
This is my code:
public static void main(String[] args) {
final String url = "https://lite.ip2location.com/ip-address-ranges-by-country";
try {
final Document document = Jsoup.connect(url).get();
for (Element element : document.select("div.card-columns div")) {
Elements link = element.select("a");
String redirectUrl = "https://lite.ip2location.com" + link.attr("href");
final Document redirectDoc = Jsoup.connect(redirectUrl).get();
Element table = redirectDoc.select("table").get(0);
for (Element row : table.select("tbody tr")) {
System.out.println(row.text());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
Jsoup just parses HTML from a URL it does not execute JavaScript or fetches additional resources such as js or css files.
At a page with IP address ranges for a country data represented in JSON that is loaded asynchronously by a browser and then with that data a table is populated.
Here final Document redirectDoc = Jsoup.connect(redirectUrl).get(); you got an HTML page that contains only a template for a table. Like this
<div class="row my-5" style="min-height:500px;">
<div class="col table-responsive">
<table id="ip-address" class="table table-striped table-hover">
<thead>
<tr>
<th width="30%" class="no-sort">Begin IP Address</th>
<th width="30%" class="no-sort">End IP Address</th>
<th width="40%" class="text-right no-sort">Total Count</th>
</tr>
</thead>
<tbody>
</tbody>
</table>
</div>
</div>
And exactly this fragment you parses in your code.
So, there is a one of a possible solution to get ranges.
The data with IP address ranges for Zimbabwe locates at URLs like this https://cdn-lite.ip2location.com/datasets/ZW.json . A file name matches with Country Codes Alpha-2 (ZW for Zimbabwe).
These codes for available countries can be extracted from https://lite.ip2location.com/ip-address-ranges-by-country page where inside a <p class="card-text"> tag for each country there is a span tag to draw a flag.
The second class contains a code at the end of the name (flag-icon-ba)
<div class="card" style="min-height:72px;">
<div class="card-body" style="padding:.85rem;">
<p class="card-text"><span class="flag-icon flag-icon-ba"></span> Bosnia and Herzegovina</p>
</div>
</div>
BA for Bosnia and Herzegovina.
Having a URL to a JSON data for a particular country, you can fetch it with Jsoup.
String data = Jsoup
.connect("https://cdn-lite.ip2location.com/datasets/BA.json")
.ignoreContentType(true)
.get().text();

HREF + TEXT with Jsoup

I've the following HTML Page:
</div><div id="page_content_list01" class="grid_12">
<h2><strong class="floatleft">TEXT1</strong></h2><br>
<table>
<tbody>
<tr>
<th class="no_width">
<p class="floatleft">Attachments:</p>
</th>
<td class="link_azure">
<a target="_blank" href="http://www.example.com">TEXT2</a><br/>
</td>
</tr>
</tbody>
</table><h2><strong class="floatleft">TEXT3</strong></h2><br>
<table>
<tbody>
<tr>
<th class="no_width">
<p class="floatleft">Atachments:</p>
</th>
<td class="link_azure">
<a target="_blank" href="http://www.example2.com">TEXT4</a><br/>
</td>
</tr>
</tbody>
</table><h2><strong class="floatleft">TEXT5</strong></h2><br>
<table>
<tbody>
<tr>
Actually I'm doing:
Elements rows = document.select("div#page_content_list01");
Now I to select "TEXT" and link. I wanna to make clickable link, so I'm using:
for (Element eleme : rows) {
Elements elements = eleme.select("a");
for (Element elem : elementi) {
String url = elem.attr("href");
String title = elem.text();
}
}
and I'm getting:
url = "http://www.example.com";
title = "TEXT2";
and it's ok, but in this way I can't read "TEXT1" and "TEXT3".
Can someone help me please?
I think you need to work on the selecors. First, your primary selector
Elements rows = document.select("div#page_content_list01");
will return with a list of ONE element only, since you actually select the div, not the tables or table rows. I would instead do this to get all relevant info:
Elements tables = document.select("div#page_content_list01>table");
for (Element table : tables){
Element h2 = table.previousElementSibling();
String titleStr = h2.text();
Element a = table.select("a").first();
String linkStr = a.attr("href");
}
Note that the Text in the h2 elements is on the same level as the table, not inside a common div. This is why I use the previous sibling notation. Also note that I wrote this out of my head and it is untested. You should get the idea though.

Getting Input tag id, value in webdriver using java

Html
<table id="tblRenewalList" class="adminlist dataTable" width="100%" cellspacing="1" cellpadding="1" border="1" style="margin-left: 0px; width: 100%;" aria-describedby="tblRenewalList_info">
<thead>
</thead>
<tbody role="alert" aria-live="polite" aria-relevant="all">
<tr class="odd">
<td class="alignCenter">
<input id="chkRenewal_868" class="chkPatent" type="checkbox" onclick="RenewalSelection(this)" companyid="33" value="868">
</td>
</tr>
</table>
with above Html i want to scrape the id, value
following are my java code, when i try with below code, its return empty values, please find the code
WebElement inputValues = driver.findElement(By
.xpath("//*[#id='tblRenewalList']/tbody/tr[1]/td[1]"));
String idValue = inputValues.getAttribute("id");
String ed2 = inputValues.getAttribute("value");
following are my expected output
id = chkRenewal_868
value = 868
The document isn't well-formed, i don't know if that matters for webdriver,
but XPath must be
//*[#id='tblRenewalList']/tbody/tr[1]/td[1]/input

Html parsing text from TD Tag

I have my Html data
<table border='0' cellpadding='3' bgcolor="#CCCCCC" class="hostinfo_title2" width='100%' align="center">
<tr align='center' bgcolor="#ffffff">
<td width='26%' class="hostinfo_title3">Archive Url</td>
</tr>
<tr bgcolor="#ffffff"
<td height="25" align="center">http://www.toradio.com/prgramdetails/20130413_vali_mm.mp3</td>
</tr>
</table>
I want to get mp3 url(http://www.toradio.com/prgramdetails/20130413_vali_mm.mp3) from above HTML text.
I'm following this link,Is it Correct or any better way to parse this text
Could any one help?
Check out JSoup. It's a nice HTML Parser for JAVA.
You should be able to do that with something like this:
String html = "<YOUR HTML HERE>";
Document doc = Jsoup.parse(html);
Elements tds = doc.select("table.hostinfo_title2").select("td");
String mp3Link = "";
for(Element td : tds) {
if(td.text().contains("mp3") {
mp3Link = td.text();
// do something with mp3Link
}
}

Replacing text inside tags using Jsoup

<table width="100%" border="0" cellpadding="0" cellspacing="1" class="table_border" id="center_table">
<tbody>
<tr>
<td width="25%" class="heading_table_top">S. No.</td>
<td width="45%" class="heading_table_top">
Booking Status (Coach No , Berth No., Quota)
</td>
<td width="30%" class="heading_table_top">
* Current Status (Coach No , Berth No.)
</td>
</tr>
</tbody>
</table>
I scrap a webpage and store the response in a string.
I then parse it into jsoup doc
Document doc = Jsoup.parse(result);
Then i select the table using
Element table=doc.select("table[id=center_table]").first();
Now i need to replace the text in tag "Booking Status (Coach No , Berth No., Quota)" to "Booking Status" using jsoup.. Could anybody help ?
I tried
table.children().text().replaceAll(RegEx to select the text?????, "Booking Status");
Elements tds=doc.select("table[id=center_table] td"); // select the tds from your table
for(Element td : tds) { // loop through them
if(td.text().contains("Booking Status")) { // found the one you want
td.text("Booking Status"); // Replace with your text
}
}
then you can use doc.toString() to get the text of the HTML back to save to disk, send to a webView or whatever else you want to do with it.
Elements tablecells=doc.select("table tbody tr td");
will give you 3 cells.
use a loop to get the each element with
Element e=Elements.get(int index);
Use the e.text() to get the String.
Compare or replace strings with String.equals() , String.contains(), String.replace()

Categories

Resources