JSoup get HTML table data from website

JSoup get HTML table data from website - java

I'd like to get data from a HTML table which looks like this:
<tr>
<td rowspan="30" class="listWeekday">Mo</td>
<td class="listStart">05:00</td>
<td class="listEnd">08:30</td>
</tr>
<tr>
<td... unknown value of Start and End td's> </td></tr>
<tr>
<td rowspan="30" class="listWeekday">Tu</td>
<td.. same as Monday, continues so till Friday></td></tr>
I like to parse this table with Jsoup. I tried to use the select() method with "td.listWeekday" running in
for (Element elem : values) {
S.o.P(elem.text()); }
Works fine, but when I try to get the listStart values it collects the Data from all days, but I like to seperate them, so I get the listStart and listEnd values for each day.
I think this is possible, but I don't even have a clue where to start, because the number of listStart and listEnd's change every day.

Analyzing tables with rowspan entries is not straightforward in JSoup or any other HTML library I know. What you could do in your case is to keep a simple variable with the current day while cycling over all rows. Something like this:
String URL = "http://pastebin.com/raw/Sa2MRCTQ";
Document doc = Jsoup.connect(URL).get();
Elements trs = doc.select("tr:has(td.liste-startzeit)");
String currentDay = null;
for (Element tr : trs){
Element tdDay = tr.select("td.liste-wochentag").first();
if (tdDay!=null){
currentDay = tdDay.text();
}
Element tdStart = tr.select("td.liste-startzeit").first();
System.out.println(currentDay +" : "+tdStart.text());
}

Related

Unable to grab attribute value having a space in attribute name using jsoup java instead getting empty string

I'm new to jsoup and trying to grab the attribute value of "title data-original-title" attribute but getting an empty string. I want the value
Jul-30-2015 03:26:13 PM
<table class="table table-hover">
<thead>
<tr style="border-color: #E1E1E1; border-width: 1px; background-color: #F9F9F9; border-top-style: solid;">
<th>Height</th>
<th>Age</th>
<th>txn</th>
<th>Uncles</th>
<th>Miner</th>
<th>GasUsed</th>
<th>GasLimit</th>
<th>Avg.GasPrice</th>
<th>Reward</th>
</tr>
</thead>
<tbody>
<tr><td></td>
<td>
**<span rel="tooltip" data-placement="bottom" title="" data-original-title="Jul-30-2015 03:26:13 PM">1149 days 18 hrs ago</span>**
</td>
My code is
for (int i = total_pages; i >= 1; i--) {
System.out.println("\nDisplaying blocks on page " + i);
String newString = "https://etherscan.io/blocks?p=" + i;
Document d3 = Jsoup.connect(newString).get();
Elements e = d3.select("table.table-hover > tbody");
Elements r = e.get(0).select("tr");
for (Element cr : r) {
Elements test = d3.select("span");
System.out.println(test.attr("data-original-title"));
}
}
Any help would be appreciated. I modified the attribute value to get data placement value and it is being retrieved correctly. But the data-original-title still returns empty string.

Data attributes are special kind of attributes so accessing them is a bit different but still very easy.
Instead of
System.out.println(test.attr("data-original-title"));
use:
System.out.println(test.first().dataset().get("original-title"));

You can try to see if this works :
d3.select("span[data-original-title]").get(0).attr("data-original-title")
Explanation :
This looks for the first span containing attribute "data-original-title" and gets the value of that attribute.

Html parsing in Java using Jsoup

I've been using Jsoup for HTML parsing, but I encountered a big problem. It takes too long like 1 hour.
Here's the site that I am parsing.
<tr>
<td class="class1">value1 </td>
<td class="class1">value2</td>
<td class="class1">value3</td>
<td class="class1">value4</td>
<td class="class1">value5 </td>
<td class="class1">value6</td>
<td class="class1">value7</td>
<td class="class1">value8</td>
<td class="class1">value9</td>
</tr>
In the site, there are thousands of tables like this, and I need to parse them all to a list. I only need value1 and value6, so to do that I am using this code.
Document doc = Jsoup.connect(url).get();
ls = new LinkedList();
for(int i = 15; i<doc.text().length(); i++) {//15 because the tables I want starting from 15
Element element = doc.getElementsByTag("tr").get(i);//table index
Elements row = element.getElementsByTag("td");
value6 = row.get(5).text();//getting value6
value1 = row.get(0).text();//getting value1
node = new Node(value1, value6);
ls.insert(node);
As I said it takes too much time, so I need to do it faster. Any ideas how to fix this problem ?

I think your problem stems from the for loop for(int i = 15; i<doc.text().length(); i++). What you do here is loop over the whole text of the document character by character. I highly doubt that this is what you want to do. I think you want to cycle over the table rows instead. So something like this should work:
Document doc = Jsoup.connect(url).get();
Elements trs = doc.select("tr");
for (int i = 15; i < trs.size(); i++){
Element tr = trs.get(i);
Elements tds = tr.select("td").;
String value6 = tds.get(5).text(); //getting value6
String value1 = tds.get(1).text(); //getting value1
//do whatever you need to do with the values
}

Select link in a table with jsoup using Java code

I need to get the download link in this table:
<table cellpadding="0" cellspacing="3" border="0">
<tr>
<td><img class="img" src="...path" /></td>
<td>File -
<a id="1569" class="tepLink" href="javascript:void(0);">[Click me]</a>
</td>
</tr>
</table>
and this is what I tried:
Element table = doc.select("table[cellpadding=\"0\" cellspacing=\"3\" border=\"0\"]").first();
Element dwlLink = table.select("td:has(a)").first();
String absPath = dwlLink.attr("abs:href");
//use download manager to download from string absPath
I always get a "null object reference" so I must be wrong with that code, what should it do?

Just select all anchor tags and then get the first element in the Elements object.
Elements anchorTags = doc.select("table[cellpadding=0][cellspacing=3][border=0] a");
if(anchorTags.isEmpty())
{
System.out.println("Not found");
}
else
{
System.out.println(anchorTags.first());
}
EDIT:
I changed the select method to include the cellpadding, cellspacing and border attributes since that seems like what you were after in one of your examples.
Also, the Element.first() method returns null if the Elements list is empty. Always check for null when calling that method to prevent NullPointerExceptions.

table.select("td:has(a)").first(); will select the first <tr> element that contains an anchor. It will not select the anchor <a> itself.
here is what you can do:
Element aEl = doc.select("table[cellpadding] td a").first();

Jsoup query, only parse specific elements

I'm trying to extract some data (see HTML below). I would like to extract the people who are in HR. only the first and last name.
HTML:
<tbody>
<tr>
<td>Peter</td>
<td>Smith</td>
<td>35</td>
<td>HR</td>
</tr>
<tr>
<td>Paul</td>
<td>Roberts</td>
<td>47</td>
<td>Legal</td>
</tr>
<tr>
<td>James</td>
<td>Griffin </td>
<td>23</td>
<td>HR</td>
</tr>
</tbody>
What i want extract:
Peter Smith
James Griffin
what i got so far:
public class Extract {
public static void main(String[] args) throws IOException {
Document Page = Jsoup.connect("URL").get(); //pick up html
Element List = Page.select("tbody").first();
Elements Info = List.select("tr");
for(Element value: Info)
{
System.out.println(value.select("td").first()); //first <td> ... </td>
System.out.println(value.select("td").second() + "\n"); //??? Trying to take the second <td> ... </td>
}
}
}

I would suggest putting a class on all td that has a first name and last name like:
<td class="first-name">Peter</td>
<td class="last-name">Smith</td>
<td>35</td>
<td>HR</td>
Then calling your JSoup select within the for loop like:
Element firstNames= value.select(".first-name");
Element lastNames= value.select(".last-name");
Or something along those lines. The point is, select using a class instead would be better and would insure you get nothing but the names.
If you don't control the input then you can also use the selector for:
Element firstNames= value.select("td:eq(0)");
Element lastNames= value.select("td:eq(1)");
However this requires that you are sure the information is always in the right order.

Locating data in a complex table in Selenium Webdriver

I am currently trying to drill down on a user in a table full of users using Selenium webdriver, I have worked out how to iterate through the table but I'm having trouble actually selecting the person I want.
Here is the HTML (modified with X's due to it not being my data)
<table id="XXXXXXXXX_list" cellspacing="0" cellpadding="0" style=" border:0px black solid;WIDTH:100%;">
<tbody>
<tr cellspacing="0" style="height: 16px;">
<tr>
<tr onclick="widgetListView_onClick('XXXX_list',1,this,event)">
<tr onclick="widgetListView_onClick('XXXX_list',2,this,event)">
<tr onclick="widgetListView_onClick('XXXX_list',3,this,event)">
<tr onclick="widgetListView_onClick('XXXX_list',4,this,event)">
<tr onclick="widgetListView_onClick('XXXX_list',5,this,event)">
<tr onclick="widgetListView_onClick('XXXX_list',6,this,event)">
<tr onclick="widgetListView_onClick('XXXX_list',7,this,event)">
<td class="listView_default_dataStyle" nowrap="" style="font-size:12px ;
font-family: sans-serif ;color: black ;background: #FFFFFF "
ondblclick="XXXXListView_onDblClick('XXXXX_list',17, event)">NAME</td>
<td class="listView_default_dataStyle" nowrap="" style="font-size:12px ;font-family: sans-serif;
color: black ;background: #FFFFFF " ondblclick="XXXXX_onDblClick('XXXX_list',17, event)"> </td>
</tr>
Here is the code I am writing to try and find the user going by NAME in the table.
WebElement table = driver.findElement(By.id("table_list"));
// Now get all the TR elements from the table
List<WebElement> allRows = table.findElements(By.tagName("tr"));
// And iterate over them, getting the cells
for (WebElement row : allRows) {
List<WebElement> cells = row.findElements(By.tagName("td"));
for (WebElement cell : cells) {
List<WebElement> Names = cell.findElements(By.xpath("//td[text()='NAME']"));
System.out.println(Names);
This just prints thousands of [] (the table is huge in the real application).
Essentially what I need is to stop when I find the correct name and create a web element out of that table row. Which I can then click and drill down on.
Sorry if any of this is a bit vague,

Well if each name in the table is unique, you don't need to complicate things so much. Just search for element with text matching your 'Name' then select the row accordingly. Look at the code below:
WebElement name = driver.findElement(By.xpath("//table[#id='XXXXXXXXX_list']//td[contains(text(),'NAME')]"));//Select td with text NAME in table with id XXXXXXXXX_list
WebElement rowWithName = name.findElement(By.xpath("./.."));//Select the parent node, i.e., tr, of the td with text NAME
/*
* Look into that row for other element or perform any action on the row.
*/
If the names are not unique, i.e., same name exists twice at similar node, 1st instance will be picked each time. In that case we will have to try things differently, i.e., we will have to index the xpath for correct instance of matching name. Do ask if you have any further doubts :)

This will help you out.
try{
ArrayList<WebElement> cells = (ArrayList<WebElement>) driver.findElements(By.tagName("td"));
log4j.info("Value = "+input_type+" is stored in array from Webpage for "+keyword+" ");
for(WebElement type : cells)
{
if(type.getAttribute("name").equals("your correct name here")) {
type.sendKeys("ABC");
}
}
return true;
}catch(Throwable e){
return false;
}
You need to use Array list like this and you can compare your Name in which you wanna fill value Or wanna do any operation like getText(), click() etc.
Enjoy!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup get HTML table data from website - java

Related

Unable to grab attribute value having a space in attribute name using jsoup java instead getting empty string

Html parsing in Java using Jsoup

Select link in a table with jsoup using Java code

Jsoup query, only parse specific elements

Locating data in a complex table in Selenium Webdriver

Categories

Resources