Jsoup query, only parse specific elements

Jsoup query, only parse specific elements - java

I'm trying to extract some data (see HTML below). I would like to extract the people who are in HR. only the first and last name.
HTML:
<tbody>
<tr>
<td>Peter</td>
<td>Smith</td>
<td>35</td>
<td>HR</td>
</tr>
<tr>
<td>Paul</td>
<td>Roberts</td>
<td>47</td>
<td>Legal</td>
</tr>
<tr>
<td>James</td>
<td>Griffin </td>
<td>23</td>
<td>HR</td>
</tr>
</tbody>
What i want extract:
Peter Smith
James Griffin
what i got so far:
public class Extract {
public static void main(String[] args) throws IOException {
Document Page = Jsoup.connect("URL").get(); //pick up html
Element List = Page.select("tbody").first();
Elements Info = List.select("tr");
for(Element value: Info)
{
System.out.println(value.select("td").first()); //first <td> ... </td>
System.out.println(value.select("td").second() + "\n"); //??? Trying to take the second <td> ... </td>
}
}
}

I would suggest putting a class on all td that has a first name and last name like:
<td class="first-name">Peter</td>
<td class="last-name">Smith</td>
<td>35</td>
<td>HR</td>
Then calling your JSoup select within the for loop like:
Element firstNames= value.select(".first-name");
Element lastNames= value.select(".last-name");
Or something along those lines. The point is, select using a class instead would be better and would insure you get nothing but the names.
If you don't control the input then you can also use the selector for:
Element firstNames= value.select("td:eq(0)");
Element lastNames= value.select("td:eq(1)");
However this requires that you are sure the information is always in the right order.

Related

Select link in a table with jsoup using Java code

I need to get the download link in this table:
<table cellpadding="0" cellspacing="3" border="0">
<tr>
<td><img class="img" src="...path" /></td>
<td>File -
<a id="1569" class="tepLink" href="javascript:void(0);">[Click me]</a>
</td>
</tr>
</table>
and this is what I tried:
Element table = doc.select("table[cellpadding=\"0\" cellspacing=\"3\" border=\"0\"]").first();
Element dwlLink = table.select("td:has(a)").first();
String absPath = dwlLink.attr("abs:href");
//use download manager to download from string absPath
I always get a "null object reference" so I must be wrong with that code, what should it do?

Just select all anchor tags and then get the first element in the Elements object.
Elements anchorTags = doc.select("table[cellpadding=0][cellspacing=3][border=0] a");
if(anchorTags.isEmpty())
{
System.out.println("Not found");
}
else
{
System.out.println(anchorTags.first());
}
EDIT:
I changed the select method to include the cellpadding, cellspacing and border attributes since that seems like what you were after in one of your examples.
Also, the Element.first() method returns null if the Elements list is empty. Always check for null when calling that method to prevent NullPointerExceptions.

table.select("td:has(a)").first(); will select the first <tr> element that contains an anchor. It will not select the anchor <a> itself.
here is what you can do:
Element aEl = doc.select("table[cellpadding] td a").first();

HREF + TEXT with Jsoup

I've the following HTML Page:
</div><div id="page_content_list01" class="grid_12">
<h2><strong class="floatleft">TEXT1</strong></h2><br>
<table>
<tbody>
<tr>
<th class="no_width">
<p class="floatleft">Attachments:</p>
</th>
<td class="link_azure">
<a target="_blank" href="http://www.example.com">TEXT2</a><br/>
</td>
</tr>
</tbody>
</table><h2><strong class="floatleft">TEXT3</strong></h2><br>
<table>
<tbody>
<tr>
<th class="no_width">
<p class="floatleft">Atachments:</p>
</th>
<td class="link_azure">
<a target="_blank" href="http://www.example2.com">TEXT4</a><br/>
</td>
</tr>
</tbody>
</table><h2><strong class="floatleft">TEXT5</strong></h2><br>
<table>
<tbody>
<tr>
Actually I'm doing:
Elements rows = document.select("div#page_content_list01");
Now I to select "TEXT" and link. I wanna to make clickable link, so I'm using:
for (Element eleme : rows) {
Elements elements = eleme.select("a");
for (Element elem : elementi) {
String url = elem.attr("href");
String title = elem.text();
}
}
and I'm getting:
url = "http://www.example.com";
title = "TEXT2";
and it's ok, but in this way I can't read "TEXT1" and "TEXT3".
Can someone help me please?

I think you need to work on the selecors. First, your primary selector
Elements rows = document.select("div#page_content_list01");
will return with a list of ONE element only, since you actually select the div, not the tables or table rows. I would instead do this to get all relevant info:
Elements tables = document.select("div#page_content_list01>table");
for (Element table : tables){
Element h2 = table.previousElementSibling();
String titleStr = h2.text();
Element a = table.select("a").first();
String linkStr = a.attr("href");
}
Note that the Text in the h2 elements is on the same level as the table, not inside a common div. This is why I use the previous sibling notation. Also note that I wrote this out of my head and it is untested. You should get the idea though.

How to get specific tags value from HTML table

First i want to apologise for my english. I am new with programming in java and also in Jsoup. i want to get some data from website. Information in the website is given in HTML tabel.i don't need not all fields from the tabel. I use this;
Document doc = Jsoup.connect("http://www.emo.nl/barges/en.html")
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();
Element table1 = doc.select("table").first();//.getElementsByTag("td");//.getElementsByTag("td")
String body = table1.toString();
Document docb = Jsoup.parseBodyFragment(body);
Element bbd = docb.body();
String hhk = bbd.toString();
System.out.println(hhk);
result of this code gives me all Tabel in String. As follow;
<body>
<table>
<tbody>
<tr>
<th>Name</th>
<th>Bargeno.</th>
<th>Reported present</th>
<th>Busy</th>
<th>Starting</th>
<th>Harbour</th>
</tr>
<tr>
<td>AMETHYST</td>
<td>2327085</td>
<td>*</td>
<td>Busy</td>
<td>19-03-2014 spil 1</td>
<td>HH</td>
</tr>
<tr>
<td>AMETHYST 2</td>
<td>2327086</td>
<td>*</td>
<td>Busy</td>
<td>19-03-2014 spil 1</td>
<td>HH</td>
</tr>
<tr>
<td>AQUAPOLIS</td>
<td>6105002</td>
<td>*</td>
<td> </td>
<td>19-03-2014 spil 1</td>
<td>HH</td>
</tr>
</tbody>
</table>
</body>
This is too much information for me i want to make two variabel lets say;
private String naam;
private String date;
and in name variabel i want to store first <td> tag (AMETHYST)
and in date variabel i want to put fifth <td> tag (19-03-2014)
Is there any way to do this thanks a lot for any help.

One way to do it would be to read the elements at the specified index:
String naam = bbd.getElementsByTag("td").get(0).text();
String date = bbd.getElementsByTag("td").get(4).text();
System.out.println(naam + " " + date);
Gives,
AMETHYST 19-03-2014  spil 1
EDIT:
Since the td contains &nbps; spil 1 you would see that getting retrieved too. In case you want to eliminate and the presence is consistent then;
System.out.println(naam + " " + date.substring(0, date.indexOf('\u00A0') - 1));
Gives,
AMETHYST 19-03-2014
EDIT 2: Based on OP's query on getting the collection of all 1st tds within the table use something like:
Elements tds = table1.select(" > tbody > tr > td:eq(0)");
for (Element el : tds) {
System.out.println(el.text());
}
Where > tbody > tr > td:eq(0) pulls out the 0th index td against every tr encountered within your table1
Output,
AQUAPOLIS
AQUAPOLIS
IMPERIAL 7
CHIMO
...
For more information on the selector syntax refer to here.

Jsoup - read one by one

I'm starting to use Jsoup recently. I need list some elements in a HTML source. For example:
<table class="list">
<tr>
<td class="year" colspan="5">2012</td>
</tr>
<tr>
<td class="code">COMP0348</td>
<td class="name">Software Engineering</td>
</tr>
<tr>
<td class="code">COMP0734</td>
<td class="name">System Information</td>
</tr>
<td class="year" colspan="5">2013</td>
</tr>
<tr>
<td class="code">COMP999</td>
<td class="name">Windows</td>
</tr>
</table>
This is what I want:
2012
Comp0348 Software Engineering
COMP0734 System Information
2013
COMP999 Windows
But in my code, it's not list one by one, it's list one string containing first all "year", and after in another line containing all "code" and after in another line all "name".
Like:
2012
Comp0348 COMP0734 COMP999
Software Engineering System Information Windows
How can I do this?

I guess you only select the tags by criteria, but not the structure.
But see here:
Document doc = ...
Element table = doc.select("table.list").first(); // select the table
for( Element element : table.select("tr") ) // select all 'tr' of the table
{
final Elements td = element.select("td.year"); // select the 'td' with 'year' class
if( !td.isEmpty() ) // if it's the one with the 'year' class
{
final String year = td.first().text(); // get year
System.out.println(year);
}
else // if it's another 'tr' tag containing the 'code' and 'name' element
{
final String code = element.select("td.code").first().text(); // get code
final String name = element.select("td.name").first().text(); // get name
System.out.println(code + " " + name);
}
}
Output (using your html):
2012
COMP0348 Software Engineering
COMP0734 System Information
2013
COMP999 Windows

Replacing text inside tags using Jsoup

<table width="100%" border="0" cellpadding="0" cellspacing="1" class="table_border" id="center_table">
<tbody>
<tr>
<td width="25%" class="heading_table_top">S. No.</td>
<td width="45%" class="heading_table_top">
Booking Status (Coach No , Berth No., Quota)
</td>
<td width="30%" class="heading_table_top">
* Current Status (Coach No , Berth No.)
</td>
</tr>
</tbody>
</table>
I scrap a webpage and store the response in a string.
I then parse it into jsoup doc
Document doc = Jsoup.parse(result);
Then i select the table using
Element table=doc.select("table[id=center_table]").first();
Now i need to replace the text in tag "Booking Status (Coach No , Berth No., Quota)" to "Booking Status" using jsoup.. Could anybody help ?
I tried
table.children().text().replaceAll(RegEx to select the text?????, "Booking Status");

Elements tds=doc.select("table[id=center_table] td"); // select the tds from your table
for(Element td : tds) { // loop through them
if(td.text().contains("Booking Status")) { // found the one you want
td.text("Booking Status"); // Replace with your text
}
}
then you can use doc.toString() to get the text of the HTML back to save to disk, send to a webView or whatever else you want to do with it.

Elements tablecells=doc.select("table tbody tr td");
will give you 3 cells.
use a loop to get the each element with
Element e=Elements.get(int index);
Use the e.text() to get the String.
Compare or replace strings with String.equals() , String.contains(), String.replace()

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup query, only parse specific elements - java

Related

Select link in a table with jsoup using Java code

HREF + TEXT with Jsoup

How to get specific tags value from HTML table

Jsoup - read one by one

Replacing text inside tags using Jsoup

Categories

Resources