Jsoup - read one by one - java

I'm starting to use Jsoup recently. I need list some elements in a HTML source. For example:
<table class="list">
<tr>
<td class="year" colspan="5">2012</td>
</tr>
<tr>
<td class="code">COMP0348</td>
<td class="name">Software Engineering</td>
</tr>
<tr>
<td class="code">COMP0734</td>
<td class="name">System Information</td>
</tr>
<td class="year" colspan="5">2013</td>
</tr>
<tr>
<td class="code">COMP999</td>
<td class="name">Windows</td>
</tr>
</table>
This is what I want:
2012
Comp0348 Software Engineering
COMP0734 System Information
2013
COMP999 Windows
But in my code, it's not list one by one, it's list one string containing first all "year", and after in another line containing all "code" and after in another line all "name".
Like:
2012
Comp0348 COMP0734 COMP999
Software Engineering System Information Windows
How can I do this?

I guess you only select the tags by criteria, but not the structure.
But see here:
Document doc = ...
Element table = doc.select("table.list").first(); // select the table
for( Element element : table.select("tr") ) // select all 'tr' of the table
{
final Elements td = element.select("td.year"); // select the 'td' with 'year' class
if( !td.isEmpty() ) // if it's the one with the 'year' class
{
final String year = td.first().text(); // get year
System.out.println(year);
}
else // if it's another 'tr' tag containing the 'code' and 'name' element
{
final String code = element.select("td.code").first().text(); // get code
final String name = element.select("td.name").first().text(); // get name
System.out.println(code + " " + name);
}
}
Output (using your html):
2012
COMP0348 Software Engineering
COMP0734 System Information
2013
COMP999 Windows

Related

HREF + TEXT with Jsoup

I've the following HTML Page:
</div><div id="page_content_list01" class="grid_12">
<h2><strong class="floatleft">TEXT1</strong></h2><br>
<table>
<tbody>
<tr>
<th class="no_width">
<p class="floatleft">Attachments:</p>
</th>
<td class="link_azure">
<a target="_blank" href="http://www.example.com">TEXT2</a><br/>
</td>
</tr>
</tbody>
</table><h2><strong class="floatleft">TEXT3</strong></h2><br>
<table>
<tbody>
<tr>
<th class="no_width">
<p class="floatleft">Atachments:</p>
</th>
<td class="link_azure">
<a target="_blank" href="http://www.example2.com">TEXT4</a><br/>
</td>
</tr>
</tbody>
</table><h2><strong class="floatleft">TEXT5</strong></h2><br>
<table>
<tbody>
<tr>
Actually I'm doing:
Elements rows = document.select("div#page_content_list01");
Now I to select "TEXT" and link. I wanna to make clickable link, so I'm using:
for (Element eleme : rows) {
Elements elements = eleme.select("a");
for (Element elem : elementi) {
String url = elem.attr("href");
String title = elem.text();
}
}
and I'm getting:
url = "http://www.example.com";
title = "TEXT2";
and it's ok, but in this way I can't read "TEXT1" and "TEXT3".
Can someone help me please?
I think you need to work on the selecors. First, your primary selector
Elements rows = document.select("div#page_content_list01");
will return with a list of ONE element only, since you actually select the div, not the tables or table rows. I would instead do this to get all relevant info:
Elements tables = document.select("div#page_content_list01>table");
for (Element table : tables){
Element h2 = table.previousElementSibling();
String titleStr = h2.text();
Element a = table.select("a").first();
String linkStr = a.attr("href");
}
Note that the Text in the h2 elements is on the same level as the table, not inside a common div. This is why I use the previous sibling notation. Also note that I wrote this out of my head and it is untested. You should get the idea though.

Jsoup query, only parse specific elements

I'm trying to extract some data (see HTML below). I would like to extract the people who are in HR. only the first and last name.
HTML:
<tbody>
<tr>
<td>Peter</td>
<td>Smith</td>
<td>35</td>
<td>HR</td>
</tr>
<tr>
<td>Paul</td>
<td>Roberts</td>
<td>47</td>
<td>Legal</td>
</tr>
<tr>
<td>James</td>
<td>Griffin </td>
<td>23</td>
<td>HR</td>
</tr>
</tbody>
What i want extract:
Peter Smith
James Griffin
what i got so far:
public class Extract {
public static void main(String[] args) throws IOException {
Document Page = Jsoup.connect("URL").get(); //pick up html
Element List = Page.select("tbody").first();
Elements Info = List.select("tr");
for(Element value: Info)
{
System.out.println(value.select("td").first()); //first <td> ... </td>
System.out.println(value.select("td").second() + "\n"); //??? Trying to take the second <td> ... </td>
}
}
}
I would suggest putting a class on all td that has a first name and last name like:
<td class="first-name">Peter</td>
<td class="last-name">Smith</td>
<td>35</td>
<td>HR</td>
Then calling your JSoup select within the for loop like:
Element firstNames= value.select(".first-name");
Element lastNames= value.select(".last-name");
Or something along those lines. The point is, select using a class instead would be better and would insure you get nothing but the names.
If you don't control the input then you can also use the selector for:
Element firstNames= value.select("td:eq(0)");
Element lastNames= value.select("td:eq(1)");
However this requires that you are sure the information is always in the right order.

How to get specific tags value from HTML table

First i want to apologise for my english. I am new with programming in java and also in Jsoup. i want to get some data from website. Information in the website is given in HTML tabel.i don't need not all fields from the tabel. I use this;
Document doc = Jsoup.connect("http://www.emo.nl/barges/en.html")
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();
Element table1 = doc.select("table").first();//.getElementsByTag("td");//.getElementsByTag("td")
String body = table1.toString();
Document docb = Jsoup.parseBodyFragment(body);
Element bbd = docb.body();
String hhk = bbd.toString();
System.out.println(hhk);
result of this code gives me all Tabel in String. As follow;
<body>
<table>
<tbody>
<tr>
<th>Name</th>
<th>Bargeno.</th>
<th>Reported present</th>
<th>Busy</th>
<th>Starting</th>
<th>Harbour</th>
</tr>
<tr>
<td>AMETHYST</td>
<td>2327085</td>
<td>*</td>
<td>Busy</td>
<td>19-03-2014 spil 1</td>
<td>HH</td>
</tr>
<tr>
<td>AMETHYST 2</td>
<td>2327086</td>
<td>*</td>
<td>Busy</td>
<td>19-03-2014 spil 1</td>
<td>HH</td>
</tr>
<tr>
<td>AQUAPOLIS</td>
<td>6105002</td>
<td>*</td>
<td> </td>
<td>19-03-2014 spil 1</td>
<td>HH</td>
</tr>
</tbody>
</table>
</body>
This is too much information for me i want to make two variabel lets say;
private String naam;
private String date;
and in name variabel i want to store first <td> tag (AMETHYST)
and in date variabel i want to put fifth <td> tag (19-03-2014)
Is there any way to do this thanks a lot for any help.
One way to do it would be to read the elements at the specified index:
String naam = bbd.getElementsByTag("td").get(0).text();
String date = bbd.getElementsByTag("td").get(4).text();
System.out.println(naam + " " + date);
Gives,
AMETHYST 19-03-2014  spil 1
EDIT:
Since the td contains &nbps; spil 1 you would see that getting retrieved too. In case you want to eliminate and the presence is consistent then;
System.out.println(naam + " " + date.substring(0, date.indexOf('\u00A0') - 1));
Gives,
AMETHYST 19-03-2014
EDIT 2: Based on OP's query on getting the collection of all 1st tds within the table use something like:
Elements tds = table1.select(" > tbody > tr > td:eq(0)");
for (Element el : tds) {
System.out.println(el.text());
}
Where > tbody > tr > td:eq(0) pulls out the 0th index td against every tr encountered within your table1
Output,
AQUAPOLIS
AQUAPOLIS
IMPERIAL 7
CHIMO
...
For more information on the selector syntax refer to here.

Parsing table data with jsoup

I am using jsoup in my android app to parse my html code but now I need parse table data and I can not get it to work. I try many ways but not successful so I want try luck here if anyone have experience.
Here is part of my html:
<div id="editacia_jedla">
<h2>My header</h2>
<h3>My sub header</h3>
<table border="0" class="jedalny_listok_tabulka" cellpadding="2" cellspacing="1">
<tr>
<td width="100" class="menu_nazov neparna" align="left">Food Menu 1</td>
<td class="jedlo neparna" align="left">vegetable and beef
<div class="jedlo_box_alergeny">Allergens: 1, 3</div>
</td>
</tr>
<tr>
<td width="100" class="menu_nazov parna" align="left">Food Menu 2</td>
<td class="jedlo parna" align="left">Potato salad and pork
<div class="jedlo_box_alergeny">Allergens: 6</div>
</td>
</tr>
</table>
etc
</div>
My java/android code:
try {
String tableHtmlCode="";
Document fullHtmlDocument = Jsoup.connect(urlOfFoodDay).get();
Element elm1 = fullHtmlDocument.select("#editacia_jedla").first();
for( Element element : elm1.children() )
{
tableHtmlCode+=element.getElementsByIndexEquals(2); //this set table content because 0=h2, 1=h3
}
Document parsedTableDocument = Jsoup.parse(tableHtmlCode);
//Element th = parsedTableDocument.select("td[class=jedlo neparna]").first(); THIS IS BAD
String foodContent="";
String foodAllergens="";
}
So now I want extract text vegetable and beef and save it to string foodContent and numbera 1, 3(together) from div class jedlo_box_alergeny save to string foodAllergens. Someone can help? I will very grateful for any ideas
Iterate over your document's parent tag jedalny_listok_tabulka and loop over td tags.
td tag is the parent to href tags which include the allergy values. Hence, you would loop over the tags a elements to get your numbers, something like:
Elements myElements = doc.getElementsByClass("jedalny_listok_tabulka")
.first().getElementsByTag("td");
for (Element element : myElements) {
if (element.className().contains("jedlo")) {
String foodContent = element.ownText();
String foodAllergen = "";
for (Element href : element.getElementsByTag("a")) {
foodAllergen += " " + href.text();
}
System.out.println(foodContent + " : " + foodAllergen);
}
}
Output:
vegetable and beef : 1 3
Potato salad and pork : 6

jsoup to get a particular element in Tablw

I have html page that extract the information from:
table class="students">
<tbody>
<tr class="rz" style="color:red;" onclick="location.href='//andy.pvt.com';">
<td>
<a title="Display Andy website,Andy" href="//andy.pvt.com">15</a>
</td>
<td>Andy jr</td>
<td align="right">44.31</td>
<td align="right">23.79</td>
<td align="right">57</td>
<td align="right">1,164,700</td>
<td align="right">0.12</td>
<td align="center">
<td align="left">0.99</td>
<td align="right">
</tr>
=
I want to get Andy, 15 andy.pvt.lom.
I am able to extract this table using doc.select(table).get
I am not able to extract the information I am looking.
how to get the "tables.select("xxxx");"
can you please help me with the xxx what I am missing?
You state:
I tried ; tables = doc.select("table").get(0); than tables.select("a title).
You want something more along the lines of
tables.select("a[href]").attr("href"); // to get your String
and
tables.select("a[href]").text(); // to get your number
e.g.,
Elements tables = doc.select("table");
String hrefAttr = tables.select("a[href]").attr("href");
System.out.println("href attribute: " + hrefAttr);
String number = tables.select("a[href]").text();
System.out.println("number: " + number);

Categories

Resources