Selecting innermost child of an element Jsoup

Selecting innermost child of an element Jsoup - java

I am attempting to scrape the following html:
<table>
<tr>
<td class="cellRight" style="cursor:pointer;">
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="cellRight" style="border:0;color:#0066CC;"
title="View summary" width="70%">92%</td>
<td class="cellRight" style="border:0;" width="30%">
</td>
</tr>
</table>
</td>
</tr>
<tr class="listroweven">
<td class="cellLeft" nowrap><span class="categorytab" onclick=
"showAssignmentsByMPAndCourse('08/03/2015','58100:6');" title=
"Display Assignments for Art 5 with Ms. Martinho"><span style=
"text-decoration: underline">58100/6 - Art 5 with Ms.
Martinho</span></span></td>
<td class="cellLeft" nowrap>
Martinho, Suzette<br>
<b>Email:</b> <a href="mailto:smartinho#mtsd.us" style=
"text-decoration:none"><img alt="" border="0" src=
"/genesis/images/labelIcon.png" title=
"Send e-mail to teacher"></a>
</td>
<td class="cellRight" onclick=
"window.location.href = '/genesis/parents?tab1=studentdata&tab2=gradebook&tab3=coursesummary&studentid=100916&action=form&courseCode=58100&courseSection=6&mp=MP4';"
style="cursor:pointer;">
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="cellCenter"><span style=
"font-style:italic;color:brown;font-size: 8pt;">No
Grades</span></td>
</tr>
</table>
</td>
</tr>
<tr class="listrowodd">
<td class="cellLeft" nowrap><span class="categorytab" onclick=
"showAssignmentsByMPAndCourse('08/03/2015','58200:10');" title=
"Display Assignments for Family and Consumer Sciences 5 with Sheerin">
<span style="text-decoration: underline">58200/10 - Family and
Consumer Sciences 5 with Sheerin</span></span></td>
<td class="cellLeft" nowrap>
Sheerin, Susan<br>
<b>Email:</b> <a href="mailto:ssheerin#mtsd.us" style=
"text-decoration:none"><img alt="" border="0" src=
"/genesis/images/labelIcon.png" title=
"Send e-mail to teacher"></a>
</td>
<td class="cellRight" style="cursor:pointer;">
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="cellCenter"><span style=
"font-style:italic;color:brown;font-size: 8pt;">No
Grades</span></td>
</tr>
</table>
</td>
</tr>
</table>
I am trying to extract the values for the student's grades, and if no grades are present, the value "no grades" which will be present in the html if this is the case. However, when I do a select request such as the following:
doc.select("[class=cellRight]")
I get an output where all of the grade values are listed twice (because they are nested within two elements containing the [class=cellRight] distinguisher, and the normal amount of "no grades" listing. So my question is, how can I only select the innermost child in a document which contains the distinguisher [class=cellRight]? (I have already dealt with the issue of a blank value) All help is appreciated!!

There are many possibilities to to this.
One would be this: Test for each "cellRight" element all its parents if they also carry that class. Discard if you find it:
List<Element> keepList = new ArrayList<>();
Elements els = doc.select(".cellRight");
for (Element el : els){
boolean keep = true;
for (Element parentEl : el.parents()){
if (parentEl.hasClass("cellRight")){
//parent has class as well -> discard!
keep = false;
break;
}
}
if (keep){
keepList.add(el);
}
}
//keepList now contains inner most elements with your class
Note that this is written without compiler and out of my head. There might be spelling/syntax errors.
Other note. your use of "[class=cellRight]" works well only if there is this single class. With multiple clsses in random order (which is totally to be expected) it is better to use the dot syntax ".cellRight"

Related

how to extract the href value using selenium in java

here i want to extract the href value from below code,
<table id="offers_table" class="fixed offers breakword" summary="" width="100%" cellspacing="0">
<tbody>
<tr>
<tr>
<tr>
<td class="offer onclick ">
<table class="fixed breakword ad_id1ezENl" summary="Ad" data-photos="2" width="100%" cellspacing="0">
<tbody>
<tr>
<td rowspan="2" width="164">
<div class="space">
<span class="rel inlblk detailcloudbox">
<a class="thumb vtop inlblk rel tdnone linkWithHash scale5 detailsLink"
href="https://www.olx.in/item/hyundai-accent-car-ID1ezENl.html#1a86c09693" title="">
</span>
</div>
</td>
<td valign="top">
<td class="wwnormal tright td-price" width="170" valign="top">
</tr>
<tr>
</tbody>
</table>
I have tried this below code but it shows the error
WebElement ele=driver.findElement(By.id("offers_table"));
WebElement href=ele.findElement(By.xpath("//tr[3]/span[#class='rel inlblk detailcloudbox']/a[#href]"));
System.out.println(href.getAttribute("href"));

You can use cssSelector insted of xpath for this case, try using below code:
String hrefvalue = driver.findElement(By.cssSelector("span.rel.inlblk.detailcloudbox > a")).getAttribute("href");

Try this xpath:
//a[#class='thumb vtop inlblk rel tdnone linkWithHash scale5 detailsLink']
[#href='https://www.olx.in/item/hyundai-accent-car-ID1ezENl.html#1a86c09693']

java : hide or display table according to condition

I have a form with some inputs; each input returns a list of data which is displayed in a table in another html page. Each input have a table to display it's data. My task is to do not display the data if the input is not entered by the user.
Here is my code
<!-- Country Table-->
<%for(int i = 0; i < countryList.length;i++){
if(countryList.length == 0)
break;
%>
<div class="box" align="center">
<table name="tab" align="center" class="gridtable">
<thead >
<tr>
<th style="width: 50%" scope="col">Entity Watch List Key</th>
<th style="width: 50%" scope="col">Watch List Name</th>
</tr>
</thead>
<tbody>
<tr>
<td style="width: 50%"><%out.println((String) (countryList[i].getEntityWatchListKey()));%></td>
<td style="width: 50%"><%out.println((String) (countryList[i].getEntityName()));%></td>
</tr>
</tbody>
</table>
</div>
<%}%>
I am using break to go out of the loop to do not display the table, is that true ?

You can use this condition before the for loop,
if(countryList.length != 0)
or
if(countryList.length > 0)
and then you need not use the break condition,
Furthermore the for loop you have currently defined will not work because if the length of the array is 0 then this condition i < countryList.length will become 0<0 and it will fail,so your for loop won't even be entered.So your current if condition if(countryList.length == 0) will not be accessed.

Please modify your code
<div class="box" align="center">
<table name="tab" align="center" class="gridtable">
<thead >
<tr>
<th style="width: 50%" scope="col">Entity Watch List Key</th>
<th style="width: 50%" scope="col">Watch List Name</th>
</tr>
</thead>
<tbody>
<%for(int i = 0; i < countryList.length;i++){
if(countryList.length > 0) %>
<tr>
<td style="width: 50%"><%out.println((String) (countryList[i].getEntityWatchListKey()));%></td>
<td style="width: 50%"><%out.println((String) (countryList[i].getEntityName()));%></td>
</tr>
<%}%>
</tbody>
</table>
</div>
For a good practice you have to repeat the row not the table.

Jsoup select not returning all elements

I am new to Jsoup Library. I have html like this.
<tr class="srrowns">
<td class="num"> <a name="y2015"> </a> 1 </td>
<td nowrap>CVE-2015-4004</td>
<td>119</td>
<td class="num"> <b style="color:red"> </b> </td>
<td> DoS Overflow +Info </td>
<td>2015-06-07</td>
<td>2015-06-08</td>
<td>
<div class="cvssbox" style="background-color:#ff8000">
8.5
</div></td>
<td align="center">None</td>
<td align="center">Remote</td>
<td align="center">Low</td>
<td align="center">Not required</td>
<td align="center">Partial</td>
<td align="center">None</td>
<td align="center">Complete</td>
</tr>
when I run element.select("td"), it is returning
<td class="num"> <a name="y2015"> </a> 1 </td>
<td nowrap>CVE-2015-4004</td>
<td>119</td>
<td class="num"> <b style="color:red"> </b> </td>
<td> DoS Overflow +Info </td>
<td>2015-06-07</td>
<td>2015-06-08</td>
<td>
<div class="cvssbox" style="background-color:#ff8000">
8.5
</div></td>
<td align="center">None</td>
<td align="center">Remote</td>
<td align="center">Low</td>
<td align="center">Not required</td>
<td align="center">Partial</td>
<td align="center">Complete</td>
Obivously, deleting <td align="center">None</td> before "Complete". Is there any way that I could get all items from Jsoup Selector?
My code looks something like this in Scala.
val connection = Jsoup.connect(url).get()
val treelist = connection.select("tr.srrowns:contains(CVE-2015-4001)")
val tree = tree.select("td")
I just saw that Jsoup select is implemented using LinkedHashSet. My goal is to extract text from each tags using Jsoup.text().Is there a workaround for this or do I have to write a parser just for getting all nodes(including duplicates)?
Thank you very much.

Try this CSS selector:
tr.srrowns:has(td:contains(CVE-2015-4004)) > td
DEMO
http://try.jsoup.org/~vAgiHQY6TIJ5MSUzR-m_Y1GD5_U
SAMPLE CODE
var cve = "CVE-2015-4004";
val doc = Jsoup.connect(url).get()
val tds = doc.select("tr.srrowns:has(td:contains(" + cve + ")) > td")
for( var td <- tds ){
println( td.text() );
}

Getting Input tag id, value in webdriver using java

Html
<table id="tblRenewalList" class="adminlist dataTable" width="100%" cellspacing="1" cellpadding="1" border="1" style="margin-left: 0px; width: 100%;" aria-describedby="tblRenewalList_info">
<thead>
</thead>
<tbody role="alert" aria-live="polite" aria-relevant="all">
<tr class="odd">
<td class="alignCenter">
<input id="chkRenewal_868" class="chkPatent" type="checkbox" onclick="RenewalSelection(this)" companyid="33" value="868">
</td>
</tr>
</table>
with above Html i want to scrape the id, value
following are my java code, when i try with below code, its return empty values, please find the code
WebElement inputValues = driver.findElement(By
.xpath("//*[#id='tblRenewalList']/tbody/tr[1]/td[1]"));
String idValue = inputValues.getAttribute("id");
String ed2 = inputValues.getAttribute("value");
following are my expected output
id = chkRenewal_868
value = 868

The document isn't well-formed, i don't know if that matters for webdriver,
but XPath must be
//*[#id='tblRenewalList']/tbody/tr[1]/td[1]/input

How to parse html and keep ALL line breaks?

I have a document that contains <br/> , <p> , and <table> elements
I have been trying to parse this HTML using Jsoup and preserve the lines.
I tried many methods from similar questions but no result
FileInputStream in = new FileInputStream("C:............xxx.htm");
String htmlText = IOUtils.toString(in);
File file = new File("C:............xxx.txt") ;
PrintWriter pr = new PrintWriter(file) ;
String text = Jsoup.parse(htmlText.replaceAll("(?i)<br[^>]*>", "br2n")).text();
System.out.println(text.replaceAll("br2n", "\n"));
pr.println(text.replaceAll("br2n", "\n"));
// for (String line : htmlText.split("\n")) {
// String stripped = Jsoup.parse(line).text();
//
// System.out.println(stripped);
// pr.println(stripped);
//
// }
pr.close();
Here is the representative part of my HTML file (the original file starts with <html> ...of course)
<table border="0" cellspacing="0" cellpadding="0" bgcolor="white"
width='650'>
<tr>
<td><font size="4"><br />
<b>The scientific explantion of the syndrom</b></font>
<table width='650' border="0" cellspacing="5" cellpadding="0">
<tr>
<td width='5%'> </td>
<td width='25%'> </td>
<td width='25%'> </td>
<td width='15%'> </td>
<td width='30%'> </td>
</tr>
<tr height="24">
<td align="left" nowrap="nowrap" colspan="3"><font size=
"3"><b>Recent Update</b></font></td>
<td align="left" nowrap="nowrap"><a name=
"9J003346248"></a><font size="3"><b>Issue:</b></font></td>
<td align="left"><font size="3">9569865248</font></td>
</tr>
<tr>
<td> </td>
<td align="left"><b>Locust:</b></td>
<td align="left" colspan="3">UYF78UIGK</td>
</tr>
</table>
<br/> The explanation above does not necc....... <p>
Blah ....
</p>
<table border="2" cellspacing="1" cellpadding="0" bgcolor="white"
width='750'>
<tr>
<td><font size="4"><br />
<b>Syndrom of the main ......</b></font>
<table width='650' border="0" cellspacing="5" cellpadding="0">
<tr>
<td width='5%'> </td>
<td width='25%'> </td>
<td width='25%'> </td>
<td width='15%'> </td>
<td width='30%'> </td>
</tr>
<tr height="24">
<td align="left" nowrap="nowrap" colspan="3"><font size=
"3"><b>Data</b></font></td>
<td align="left" nowrap="nowrap"><a name=
"9J003346248"></a><font size="3"><b>Issue:</b></font></td>
<td align="left"><font size="3">9509809248</font></td>
</tr>
<tr>
<td> </td>
<td align="left"><b>Locust:</b></td>
<td align="left" colspan="3">U344365GK</td>
</tr>
</table>
<br/> The explanation above does not necc....... <p>
Blah ....
</p>
I need to make sure that all rows in those table lie one after another the way they do in the original document. But I have multiple tables and other "line breaking elements". How can I do this using Jsoup? Is it possible to parse html and keep line using other api more effectively?

You had it almost right. Try this
String text = Jsoup.parse(htmlText.replaceAll("(?i)</tr>", "</tr> br2n ").replaceAll("(?i)<br[^>]*>", "br2n")).replaceAll("(?i)<p>", "<p> br2n ").replaceAll("(?i)</p>", "</p> br2n ").text();
System.out.println(text.replaceAll("br2n", "\n"));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Selecting innermost child of an element Jsoup - java

Related

how to extract the href value using selenium in java

java : hide or display table according to condition

Jsoup select not returning all elements

Getting Input tag id, value in webdriver using java

How to parse html and keep ALL line breaks?

Categories

Resources