Html parsing in Java using Jsoup

Html parsing in Java using Jsoup - java

I've been using Jsoup for HTML parsing, but I encountered a big problem. It takes too long like 1 hour.
Here's the site that I am parsing.
<tr>
<td class="class1">value1 </td>
<td class="class1">value2</td>
<td class="class1">value3</td>
<td class="class1">value4</td>
<td class="class1">value5 </td>
<td class="class1">value6</td>
<td class="class1">value7</td>
<td class="class1">value8</td>
<td class="class1">value9</td>
</tr>
In the site, there are thousands of tables like this, and I need to parse them all to a list. I only need value1 and value6, so to do that I am using this code.
Document doc = Jsoup.connect(url).get();
ls = new LinkedList();
for(int i = 15; i<doc.text().length(); i++) {//15 because the tables I want starting from 15
Element element = doc.getElementsByTag("tr").get(i);//table index
Elements row = element.getElementsByTag("td");
value6 = row.get(5).text();//getting value6
value1 = row.get(0).text();//getting value1
node = new Node(value1, value6);
ls.insert(node);
As I said it takes too much time, so I need to do it faster. Any ideas how to fix this problem ?

I think your problem stems from the for loop for(int i = 15; i<doc.text().length(); i++). What you do here is loop over the whole text of the document character by character. I highly doubt that this is what you want to do. I think you want to cycle over the table rows instead. So something like this should work:
Document doc = Jsoup.connect(url).get();
Elements trs = doc.select("tr");
for (int i = 15; i < trs.size(); i++){
Element tr = trs.get(i);
Elements tds = tr.select("td").;
String value6 = tds.get(5).text(); //getting value6
String value1 = tds.get(1).text(); //getting value1
//do whatever you need to do with the values
}

Related

How to create element's Xpath using different search ( cssSelector / tag / ClassName )

I would like to find an element using a differect cssSelector / tag / ClassName amd them get it's xpath value ( to be more specific, I have a website where when a day changes, one of the classes change it's class) here is what do I meean:
<tr>
<td> 1.1.2019 </td>
<td> 2.1.2019 </td>
<td class="active"> 3.1.2019 </td>
<td> 4.1.2019 </td>
</tr>
<tr>
<td> </td>
<td> 10 </td>
<td> </td> #Here
<td> </td>
</tr>
I want to according to where is that "active class", click the table under it. ny idea how to do so ?
short version of what I want :
Find element using cssSelector
Get this element's Xpath <- the problem
click it using edited xpath
I want to GET XPATH OF LOCATED ELEMENT , not to locate it using Xpath

You can find the index by locating all the <td> elements in the first row and check wich one has the index
List<WebElement> columns = driver.findElements(By.xpath("//tr[td[#class='active']]/td")); # just an example, can be any other locator
int index = 0;
for (int i = 0 ; i < columns.getSize() ; i++) {
String attribute = columns.get(i).getAttribute("class")
if (attribute != null && attribute.equals("active")) {
index = i + 1;
}
}

Unable to grab attribute value having a space in attribute name using jsoup java instead getting empty string

I'm new to jsoup and trying to grab the attribute value of "title data-original-title" attribute but getting an empty string. I want the value
Jul-30-2015 03:26:13 PM
<table class="table table-hover">
<thead>
<tr style="border-color: #E1E1E1; border-width: 1px; background-color: #F9F9F9; border-top-style: solid;">
<th>Height</th>
<th>Age</th>
<th>txn</th>
<th>Uncles</th>
<th>Miner</th>
<th>GasUsed</th>
<th>GasLimit</th>
<th>Avg.GasPrice</th>
<th>Reward</th>
</tr>
</thead>
<tbody>
<tr><td></td>
<td>
**<span rel="tooltip" data-placement="bottom" title="" data-original-title="Jul-30-2015 03:26:13 PM">1149 days 18 hrs ago</span>**
</td>
My code is
for (int i = total_pages; i >= 1; i--) {
System.out.println("\nDisplaying blocks on page " + i);
String newString = "https://etherscan.io/blocks?p=" + i;
Document d3 = Jsoup.connect(newString).get();
Elements e = d3.select("table.table-hover > tbody");
Elements r = e.get(0).select("tr");
for (Element cr : r) {
Elements test = d3.select("span");
System.out.println(test.attr("data-original-title"));
}
}
Any help would be appreciated. I modified the attribute value to get data placement value and it is being retrieved correctly. But the data-original-title still returns empty string.

Data attributes are special kind of attributes so accessing them is a bit different but still very easy.
Instead of
System.out.println(test.attr("data-original-title"));
use:
System.out.println(test.first().dataset().get("original-title"));

You can try to see if this works :
d3.select("span[data-original-title]").get(0).attr("data-original-title")
Explanation :
This looks for the first span containing attribute "data-original-title" and gets the value of that attribute.

JSoup get HTML table data from website

I'd like to get data from a HTML table which looks like this:
<tr>
<td rowspan="30" class="listWeekday">Mo</td>
<td class="listStart">05:00</td>
<td class="listEnd">08:30</td>
</tr>
<tr>
<td... unknown value of Start and End td's> </td></tr>
<tr>
<td rowspan="30" class="listWeekday">Tu</td>
<td.. same as Monday, continues so till Friday></td></tr>
I like to parse this table with Jsoup. I tried to use the select() method with "td.listWeekday" running in
for (Element elem : values) {
S.o.P(elem.text()); }
Works fine, but when I try to get the listStart values it collects the Data from all days, but I like to seperate them, so I get the listStart and listEnd values for each day.
I think this is possible, but I don't even have a clue where to start, because the number of listStart and listEnd's change every day.

Analyzing tables with rowspan entries is not straightforward in JSoup or any other HTML library I know. What you could do in your case is to keep a simple variable with the current day while cycling over all rows. Something like this:
String URL = "http://pastebin.com/raw/Sa2MRCTQ";
Document doc = Jsoup.connect(URL).get();
Elements trs = doc.select("tr:has(td.liste-startzeit)");
String currentDay = null;
for (Element tr : trs){
Element tdDay = tr.select("td.liste-wochentag").first();
if (tdDay!=null){
currentDay = tdDay.text();
}
Element tdStart = tr.select("td.liste-startzeit").first();
System.out.println(currentDay +" : "+tdStart.text());
}

How to get specific tags value from HTML table

First i want to apologise for my english. I am new with programming in java and also in Jsoup. i want to get some data from website. Information in the website is given in HTML tabel.i don't need not all fields from the tabel. I use this;
Document doc = Jsoup.connect("http://www.emo.nl/barges/en.html")
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();
Element table1 = doc.select("table").first();//.getElementsByTag("td");//.getElementsByTag("td")
String body = table1.toString();
Document docb = Jsoup.parseBodyFragment(body);
Element bbd = docb.body();
String hhk = bbd.toString();
System.out.println(hhk);
result of this code gives me all Tabel in String. As follow;
<body>
<table>
<tbody>
<tr>
<th>Name</th>
<th>Bargeno.</th>
<th>Reported present</th>
<th>Busy</th>
<th>Starting</th>
<th>Harbour</th>
</tr>
<tr>
<td>AMETHYST</td>
<td>2327085</td>
<td>*</td>
<td>Busy</td>
<td>19-03-2014 spil 1</td>
<td>HH</td>
</tr>
<tr>
<td>AMETHYST 2</td>
<td>2327086</td>
<td>*</td>
<td>Busy</td>
<td>19-03-2014 spil 1</td>
<td>HH</td>
</tr>
<tr>
<td>AQUAPOLIS</td>
<td>6105002</td>
<td>*</td>
<td> </td>
<td>19-03-2014 spil 1</td>
<td>HH</td>
</tr>
</tbody>
</table>
</body>
This is too much information for me i want to make two variabel lets say;
private String naam;
private String date;
and in name variabel i want to store first <td> tag (AMETHYST)
and in date variabel i want to put fifth <td> tag (19-03-2014)
Is there any way to do this thanks a lot for any help.

One way to do it would be to read the elements at the specified index:
String naam = bbd.getElementsByTag("td").get(0).text();
String date = bbd.getElementsByTag("td").get(4).text();
System.out.println(naam + " " + date);
Gives,
AMETHYST 19-03-2014  spil 1
EDIT:
Since the td contains &nbps; spil 1 you would see that getting retrieved too. In case you want to eliminate and the presence is consistent then;
System.out.println(naam + " " + date.substring(0, date.indexOf('\u00A0') - 1));
Gives,
AMETHYST 19-03-2014
EDIT 2: Based on OP's query on getting the collection of all 1st tds within the table use something like:
Elements tds = table1.select(" > tbody > tr > td:eq(0)");
for (Element el : tds) {
System.out.println(el.text());
}
Where > tbody > tr > td:eq(0) pulls out the 0th index td against every tr encountered within your table1
Output,
AQUAPOLIS
AQUAPOLIS
IMPERIAL 7
CHIMO
...
For more information on the selector syntax refer to here.

Selenium webdriver : exclude child node

Here is the sample HTML Code :
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<tr class="tinyfont">
<tr height="2px">
<tr height="1px">
<tr height="1px">
<tr>
<tr height="2px">
<tr height="1px">
<tr height="1px">
<tr height="2px">
</tbody>
</table>
I am using selenium webdriver.
I have received the all the child elements from this code but now I want to exclude one particular child element in logic, how I can exclude one of the child element from my array.
I want to exclude tr[6] child element..
List<WebElement> list = driver.findElements(By.xpath("/html/body/table/tbody //*"));
ArrayList<String> al1 = new ArrayList<String>();
for(WebElement ele:list){
String className = ele.getAttribute("class");
System.out.println("Class name = "+className);
al1.add(className);
}
Thanks in Advance!!

Either omit the 6th table row, then select all descendants:
/html/body/table/tbody/tr[position() != 6]//*
or only select all table rows that are not at position 1 and have an attribute (and then select their descendants):
/html/body/table/tbody/tr[position() = 1 or #*]//*
or to be more specific, also check the attribute name:
/html/body/table/tbody/tr[position() = 1 or #height or #class]//*

Is it always element 6 that you want to avoid? If it is, use a for look with an increment and just avoid element 6 with an if statement.
int numOfElements = driver.findElements(By.xpath("/html/body/table/tbody //*")).count();
ArrayList<String> al1 = new ArrayList<String>();
for(int i = 1; i<= numOfElements; i++)
{
if(i!=6)
{
String className = driver.findElement(By.xpath("/html/body/table/tbody/tr["+i+"]")).getAttribute("class");
System.out.println("Class name = "+className);
al1.add(className);
}
}
This wont sound like a solution that you are looking for, but it still is a round about way to achieve what you want. Off the top of my head, I cant think of another way unless you have a attribute that contains something to compare off of or to exclude

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Html parsing in Java using Jsoup - java

Related

How to create element's Xpath using different search ( cssSelector / tag / ClassName )

Unable to grab attribute value having a space in attribute name using jsoup java instead getting empty string

JSoup get HTML table data from website

How to get specific tags value from HTML table

Selenium webdriver : exclude child node

Categories

Resources