Jsoup select elements after first element - java

i want to parse a html table with jsoup.
part of the html page i want to parse:
<tr>
<td class="dkHeading">A1</td>
<td class="dkHeading">A2</td>
<td class="dkHeading">A3</td>
<td class="dkHeading">A4</td>
<td class="dkHeading">A5</td>
<td class="dkHeading">A6</td>
<td class="dkHeading">A7</td>
</tr>
<tr id="RContents">
<td class="dkTextCenter">B1</td>
<td class="dkTextCenter">B2</td>
<td class="dkTextCenter">B3</td>
<td class="dkTextLeft">B4</td>
<td class="dkTextCenter">B5</td>
<td class="dkTextCenter">B6</td>
<td class="dkTextCenter">B7</td>
</tr>
<tr>
<td class="dkTextCenter">C1</td>
<td class="dkTextCenter">C2</td>
<td class="dkTextCenter">C3</td>
<td class="dkTextLeft">C4</td>
<td class="dkTextCenter">C5</td>
<td class="dkTextCenter">C6</td>
<td class="dkTextCenter">C7</td>
</tr>
<tr>
<td class="dkTextCenter">D1</td>
<td class="dkTextCenter">D2</td>
<td class="dkTextCenter">D3</td>
<td class="dkTextLeft">D4</td>
<td class="dkTextCenter">D5</td>
<td class="dkTextCenter">D6</td>
<td class="dkTextCenter">D7</td>
</tr>
how can i select all "tr" elements after (and including) that tr with id "RContents"?
i tried doc.select("tr[id=RContents] > tr"); but that did't work.

You can use the next siblings selector ~:
doc.select("tr[id=RContents] ~ tr");

you can select tr Elements, then loop through them. since the elements are in order you can try something like this:
Document document = Jsoup.parse("YOURHTML");
Elements elements = document.select("tr");
boolean start=false;
for(Element e : elements){
if(e.hasAttr("id") && e.attr("id").equals("RContents"))){
start=true;
}
if(start){
//all tr elements including id=RContents and after
}
}

Related

how to extract the href value using selenium in java

here i want to extract the href value from below code,
<table id="offers_table" class="fixed offers breakword" summary="" width="100%" cellspacing="0">
<tbody>
<tr>
<tr>
<tr>
<td class="offer onclick ">
<table class="fixed breakword ad_id1ezENl" summary="Ad" data-photos="2" width="100%" cellspacing="0">
<tbody>
<tr>
<td rowspan="2" width="164">
<div class="space">
<span class="rel inlblk detailcloudbox">
<a class="thumb vtop inlblk rel tdnone linkWithHash scale5 detailsLink"
href="https://www.olx.in/item/hyundai-accent-car-ID1ezENl.html#1a86c09693" title="">
</span>
</div>
</td>
<td valign="top">
<td class="wwnormal tright td-price" width="170" valign="top">
</tr>
<tr>
</tbody>
</table>
I have tried this below code but it shows the error
WebElement ele=driver.findElement(By.id("offers_table"));
WebElement href=ele.findElement(By.xpath("//tr[3]/span[#class='rel inlblk detailcloudbox']/a[#href]"));
System.out.println(href.getAttribute("href"));
You can use cssSelector insted of xpath for this case, try using below code:
String hrefvalue = driver.findElement(By.cssSelector("span.rel.inlblk.detailcloudbox > a")).getAttribute("href");
Try this xpath:
//a[#class='thumb vtop inlblk rel tdnone linkWithHash scale5 detailsLink']
[#href='https://www.olx.in/item/hyundai-accent-car-ID1ezENl.html#1a86c09693']

Selecting innermost child of an element Jsoup

I am attempting to scrape the following html:
<table>
<tr>
<td class="cellRight" style="cursor:pointer;">
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="cellRight" style="border:0;color:#0066CC;"
title="View summary" width="70%">92%</td>
<td class="cellRight" style="border:0;" width="30%">
</td>
</tr>
</table>
</td>
</tr>
<tr class="listroweven">
<td class="cellLeft" nowrap><span class="categorytab" onclick=
"showAssignmentsByMPAndCourse('08/03/2015','58100:6');" title=
"Display Assignments for Art 5 with Ms. Martinho"><span style=
"text-decoration: underline">58100/6 - Art 5 with Ms.
Martinho</span></span></td>
<td class="cellLeft" nowrap>
Martinho, Suzette<br>
<b>Email:</b> <a href="mailto:smartinho#mtsd.us" style=
"text-decoration:none"><img alt="" border="0" src=
"/genesis/images/labelIcon.png" title=
"Send e-mail to teacher"></a>
</td>
<td class="cellRight" onclick=
"window.location.href = '/genesis/parents?tab1=studentdata&tab2=gradebook&tab3=coursesummary&studentid=100916&action=form&courseCode=58100&courseSection=6&mp=MP4';"
style="cursor:pointer;">
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="cellCenter"><span style=
"font-style:italic;color:brown;font-size: 8pt;">No
Grades</span></td>
</tr>
</table>
</td>
</tr>
<tr class="listrowodd">
<td class="cellLeft" nowrap><span class="categorytab" onclick=
"showAssignmentsByMPAndCourse('08/03/2015','58200:10');" title=
"Display Assignments for Family and Consumer Sciences 5 with Sheerin">
<span style="text-decoration: underline">58200/10 - Family and
Consumer Sciences 5 with Sheerin</span></span></td>
<td class="cellLeft" nowrap>
Sheerin, Susan<br>
<b>Email:</b> <a href="mailto:ssheerin#mtsd.us" style=
"text-decoration:none"><img alt="" border="0" src=
"/genesis/images/labelIcon.png" title=
"Send e-mail to teacher"></a>
</td>
<td class="cellRight" style="cursor:pointer;">
<table cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="cellCenter"><span style=
"font-style:italic;color:brown;font-size: 8pt;">No
Grades</span></td>
</tr>
</table>
</td>
</tr>
</table>
I am trying to extract the values for the student's grades, and if no grades are present, the value "no grades" which will be present in the html if this is the case. However, when I do a select request such as the following:
doc.select("[class=cellRight]")
I get an output where all of the grade values are listed twice (because they are nested within two elements containing the [class=cellRight] distinguisher, and the normal amount of "no grades" listing. So my question is, how can I only select the innermost child in a document which contains the distinguisher [class=cellRight]? (I have already dealt with the issue of a blank value) All help is appreciated!!
There are many possibilities to to this.
One would be this: Test for each "cellRight" element all its parents if they also carry that class. Discard if you find it:
List<Element> keepList = new ArrayList<>();
Elements els = doc.select(".cellRight");
for (Element el : els){
boolean keep = true;
for (Element parentEl : el.parents()){
if (parentEl.hasClass("cellRight")){
//parent has class as well -> discard!
keep = false;
break;
}
}
if (keep){
keepList.add(el);
}
}
//keepList now contains inner most elements with your class
Note that this is written without compiler and out of my head. There might be spelling/syntax errors.
Other note. your use of "[class=cellRight]" works well only if there is this single class. With multiple clsses in random order (which is totally to be expected) it is better to use the dot syntax ".cellRight"

Jsoup select not returning all elements

I am new to Jsoup Library. I have html like this.
<tr class="srrowns">
<td class="num"> <a name="y2015"> </a> 1 </td>
<td nowrap>CVE-2015-4004</td>
<td>119</td>
<td class="num"> <b style="color:red"> </b> </td>
<td> DoS Overflow +Info </td>
<td>2015-06-07</td>
<td>2015-06-08</td>
<td>
<div class="cvssbox" style="background-color:#ff8000">
8.5
</div></td>
<td align="center">None</td>
<td align="center">Remote</td>
<td align="center">Low</td>
<td align="center">Not required</td>
<td align="center">Partial</td>
<td align="center">None</td>
<td align="center">Complete</td>
</tr>
when I run element.select("td"), it is returning
<td class="num"> <a name="y2015"> </a> 1 </td>
<td nowrap>CVE-2015-4004</td>
<td>119</td>
<td class="num"> <b style="color:red"> </b> </td>
<td> DoS Overflow +Info </td>
<td>2015-06-07</td>
<td>2015-06-08</td>
<td>
<div class="cvssbox" style="background-color:#ff8000">
8.5
</div></td>
<td align="center">None</td>
<td align="center">Remote</td>
<td align="center">Low</td>
<td align="center">Not required</td>
<td align="center">Partial</td>
<td align="center">Complete</td>
Obivously, deleting <td align="center">None</td> before "Complete". Is there any way that I could get all items from Jsoup Selector?
My code looks something like this in Scala.
val connection = Jsoup.connect(url).get()
val treelist = connection.select("tr.srrowns:contains(CVE-2015-4001)")
val tree = tree.select("td")
I just saw that Jsoup select is implemented using LinkedHashSet. My goal is to extract text from each tags using Jsoup.text().Is there a workaround for this or do I have to write a parser just for getting all nodes(including duplicates)?
Thank you very much.
Try this CSS selector:
tr.srrowns:has(td:contains(CVE-2015-4004)) > td
DEMO
http://try.jsoup.org/~vAgiHQY6TIJ5MSUzR-m_Y1GD5_U
SAMPLE CODE
var cve = "CVE-2015-4004";
val doc = Jsoup.connect(url).get()
val tds = doc.select("tr.srrowns:has(td:contains(" + cve + ")) > td")
for( var td <- tds ){
println( td.text() );
}

How to parse html and keep ALL line breaks?

I have a document that contains <br/> , <p> , and <table> elements
I have been trying to parse this HTML using Jsoup and preserve the lines.
I tried many methods from similar questions but no result
FileInputStream in = new FileInputStream("C:............xxx.htm");
String htmlText = IOUtils.toString(in);
File file = new File("C:............xxx.txt") ;
PrintWriter pr = new PrintWriter(file) ;
String text = Jsoup.parse(htmlText.replaceAll("(?i)<br[^>]*>", "br2n")).text();
System.out.println(text.replaceAll("br2n", "\n"));
pr.println(text.replaceAll("br2n", "\n"));
// for (String line : htmlText.split("\n")) {
// String stripped = Jsoup.parse(line).text();
//
// System.out.println(stripped);
// pr.println(stripped);
//
// }
pr.close();
Here is the representative part of my HTML file (the original file starts with <html> ...of course)
<table border="0" cellspacing="0" cellpadding="0" bgcolor="white"
width='650'>
<tr>
<td><font size="4"><br />
<b>The scientific explantion of the syndrom</b></font>
<table width='650' border="0" cellspacing="5" cellpadding="0">
<tr>
<td width='5%'> </td>
<td width='25%'> </td>
<td width='25%'> </td>
<td width='15%'> </td>
<td width='30%'> </td>
</tr>
<tr height="24">
<td align="left" nowrap="nowrap" colspan="3"><font size=
"3"><b>Recent Update</b></font></td>
<td align="left" nowrap="nowrap"><a name=
"9J003346248"></a><font size="3"><b>Issue:</b></font></td>
<td align="left"><font size="3">9569865248</font></td>
</tr>
<tr>
<td> </td>
<td align="left"><b>Locust:</b></td>
<td align="left" colspan="3">UYF78UIGK</td>
</tr>
</table>
<br/> The explanation above does not necc....... <p>
Blah ....
</p>
<table border="2" cellspacing="1" cellpadding="0" bgcolor="white"
width='750'>
<tr>
<td><font size="4"><br />
<b>Syndrom of the main ......</b></font>
<table width='650' border="0" cellspacing="5" cellpadding="0">
<tr>
<td width='5%'> </td>
<td width='25%'> </td>
<td width='25%'> </td>
<td width='15%'> </td>
<td width='30%'> </td>
</tr>
<tr height="24">
<td align="left" nowrap="nowrap" colspan="3"><font size=
"3"><b>Data</b></font></td>
<td align="left" nowrap="nowrap"><a name=
"9J003346248"></a><font size="3"><b>Issue:</b></font></td>
<td align="left"><font size="3">9509809248</font></td>
</tr>
<tr>
<td> </td>
<td align="left"><b>Locust:</b></td>
<td align="left" colspan="3">U344365GK</td>
</tr>
</table>
<br/> The explanation above does not necc....... <p>
Blah ....
</p>
I need to make sure that all rows in those table lie one after another the way they do in the original document. But I have multiple tables and other "line breaking elements". How can I do this using Jsoup? Is it possible to parse html and keep line using other api more effectively?
You had it almost right. Try this
String text = Jsoup.parse(htmlText.replaceAll("(?i)</tr>", "</tr> br2n ").replaceAll("(?i)<br[^>]*>", "br2n")).replaceAll("(?i)<p>", "<p> br2n ").replaceAll("(?i)</p>", "</p> br2n ").text();
System.out.println(text.replaceAll("br2n", "\n"));

how to verify the sorting in collapse group selenium web driver-java

i want to test the sorting in collapse group?? it's possible or not? Please share that how to do it. blow pic have two group 1) Branch:Clifton and 2) Branch: Holsopple.(columns are sorted by clicking on Columns heading(Contact, Type etc)).below mentioned code work where group is not exist but fail where is group on page.
when i compare the gettext in java it shows result false while sorting of the text on the page is correct,because my java code gets the text in whole column and sorting is on collapse group. I wanna write the code which verify the sorting of columns on collapse group base.
HTML is here:
<tbody>
<tr class="rgGroupHeader">
<td class="rgGroupCol">
<td colspan="9">
<p>Branch: Clifton</p>
</td>
</tr>
<td class="rgGroupCol"/>
<td style="display:none;" title="289855">289855</td>
<td style="display:none;" title="31">31</td>
<td style="display:none;"/>
<td style="display:none;" title="12">12</td>
<td style="display:none;" title="6">6</td>
<td class="col_priority">
<td title="10/24/2013">10/24/2013</td>
<td class="col_status">
<div id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl19_divStatus" class="status_active" title="Open - Active"/>
</td>
<td>
<td>
<td>
<td title="Nawaz, S (10/22/2013)">
<td class="col_manager_instruction">
<td class="col_expiry" title="N/A">N/A</td>
</tr>
<tr id="ctl00_CPHPageContents_dtgLeads_ctl00__8" class="rgRow">
<td class="rgGroupCol"/>
<td style="display:none;" title="289856">289856</td>
<td style="display:none;" title="31">31</td>
<td style="display:none;"/>
<td style="display:none;" title="11">11</td>
<td style="display:none;" title="6">6</td>
<td class="col_priority">
<td title="10/24/2013">10/24/2013</td>
<td class="col_status">
<td>
<td>
<td>
<td title="Nawaz, S (10/22/2013)">
<td class="col_manager_instruction">
<td class="col_expiry" title="11/25/2013">11/25/2013</td>
</tr>
<tr class="rgGroupHeader">
<td class="rgGroupCol">
<input id="ctl00_CPHPageContents_dtgLeads_ctl00__35__0" class="rgCollapse" type="button" title="Collapse group" onclick="$find("ctl00_CPHPageContents_dtgLeads_ctl00")._toggleGroupsExpand(this, event); return false;__doPostBack('ctl00$CPHPageContents$dtgLeads$ctl00$ctl37$ctl00','')" value=" " name="ctl00$CPHPageContents$dtgLeads$ctl00$ctl37$ctl00"/>
</td>
<td colspan="9">
<p>Branch: Holsopple</p>
</td>
</tr>
<tr id="ctl00_CPHPageContents_dtgLeads_ctl00__16" class="rgRow">
<td class="rgGroupCol"/>
<td style="display:none;" title="289768">289768</td>
<td style="display:none;" title="2">2</td>
<td style="display:none;"/>
<td style="display:none;" title="12">12</td>
<td style="display:none;" title="4">4</td>
<td class="col_priority">
<div id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_divPriority" class="priority_high" title="High"/>
</td>
<td title="06/27/2013">06/27/2013</td>
<td class="col_status">
<div id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_divStatus" class="status_active" title="Open - Active"/>
</td>
<td>
<div id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_divInner">
<a id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_hlnkContact" href="/Leads/Research/289768">John Ross</a>
<input id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_hdfContactID" type="hidden" value="174120" name="ctl00$CPHPageContents$dtgLeads$ctl00$ctl38$hdfContactID"/>
<div id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_divContactCardControl" class="pos_r"/>
</div>
</td>
<td>
<div class="lead_type">
<a id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_lnkType" class="lead_type_link" href="/Leads/Research/289768" title="Maturing CD 100">Maturing CD 100</a>
<a id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_lnkDownArrow" class="down_arrow" onclick="showCloseTransferLayer('ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_CloseTransferLayer')" href="javascript:;"/>
<span class="pos_r">
<div id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_CloseTransferLayer">
<a id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_lnkCloseLead" onclick="return ShowPopupForm('/Forms/Popups/CloseLead.aspx?LeadID=289768','WindowCloseLead');" href="javascript:;">Cancel Lead</a>
<a id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_lnkTransferLead" onclick="return ShowPopupForm('/Forms/Popups/TransferLead.aspx?LeadID=289768','WindowTransferLead');" href="javascript:;">Transfer Lead</a>
</div>
</span>
</div>
</td>
<td>
<div id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_divAssignedTo">
<div id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_ddlAssignedTo" class="RadComboBox RadComboBox_Default assigned_to_combo" style="width:160px;">
<table style="border-width: 0px; border-collapse: collapse;" summary="combobox">
<tbody>
<tr class="rcbReadOnly">
<td class="rcbInputCell rcbInputCellLeft" style="width:100%;">
<input id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_ddlAssignedTo_Input" class="rcbInput radPreventDecorate" type="text" readonly="readonly" value="Org, T" name="ctl00$CPHPageContents$dtgLeads$ctl00$ctl38$ddlAssignedTo" autocomplete="off"/>
</td>
<td class="rcbArrowCell rcbArrowCellRight">
<a id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_ddlAssignedTo_Arrow" style="overflow: hidden;display: block;position: relative;outline: none;">select</a>
</td>
</tr>
</tbody>
</table>
<div class="rcbSlide" style="z-index:6000;">
<div id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_ddlAssignedTo_DropDown" class="RadComboBoxDropDown RadComboBoxDropDown_Default " style="display:none;">
<div class="rcbScroll rcbWidth" style="width:100%;">
<ul class="rcbList" style="list-style:none;margin:0;padding:0;zoom:1;">
<li class="rcbItem">Ghaffar, A</li>
<li class="rcbItem">Keller, K</li>
<li class="rcbItem">Nawaz, S</li>
<li class="rcbItem">Org, 1</li>
<li class="rcbItem">Org, T</li>
</ul>
</div>
</div>
</div>
<input id="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_ddlAssignedTo_ClientState" type="hidden" name="ctl00_CPHPageContents_dtgLeads_ctl00_ctl38_ddlAssignedTo_ClientState" autocomplete="off"/>
</div>
</div>
</td>
<td/>
<td class="col_manager_instruction">
<td class="col_expiry" title="N/A">N/A</td>
</tr>
</tbody>
Java Code:
List<String> displayedNames = new ArrayList<String>();
List<String> SortedNames = new ArrayList<String>();
String getData;
Thread.sleep(thread);
for(int i=0;i<tableType.size();i++)
{
getData=tableType.get(i).getText();
System.out.println(getData);
displayedNames.add(getData);
SortedNames.add(getData);
}
System.out.println(displayedNames);
Thread.sleep(thread);
List<String> sortingOperation = displayedNames;
Thread.sleep(thread);
Collections.sort(sortingOperation);
Thread.sleep(thread);
Assert.assertEquals(SortedNames, sortingOperation);
List<String> displayedNames = new ArrayList<String>();
List<String> SortedNames = new ArrayList<String>();
Thread.sleep(10000);
WebElementtableType=action.driver.findElement(By.xpath("//[#id='table_1_core_table_content']/tbody/tr"));
Thread.sleep(10000);
List<WebElement>rowElmt=tableType.findElements(By.xpath("//tr/td[5]"));
String getData;
Thread.sleep(5000);
for(int i=2;i<rowElmt.size();i++)
{
getData=rowElmt.get(i).getText();
displayedNames.add(getData);
SortedNames.add(getData);
}
System.out.println(displayedNames);
Thread.sleep(5000);
List<String> sortingOperation = displayedNames;
Collections.sort(sortingOperation);
Assert.assertEquals(SortedNames, sortingOperation);
}

Categories

Resources