Issue on parsing Html with jsoup for java

Issue on parsing Html with jsoup for java - java

I am trying to parse HTML using jsoup.
I used "try jsoup" to check if parsing of the html is correct.
screenshot of the results : please open this link ^^
My code is :
URL url = new URL("http://tw.search.bid.yahoo.com/search/ac;_ylt=AtqkyTO06sgGHho20HzmPEX3_rF8?ei=UTF-8&p=%E8%A1%A3%E6%9C%8D");
Document doc;
try {
doc = Jsoup.parse(url, 3000);
Elements descriptions = doc.select("div#srp_sl_result"+" div.att-item");
for (Element element : descriptions) {
System.out.println(element.ownText());
System.out.println("--------------");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
But the results are returning empty,
I am getting following output:
--------------
--------------
--------------
I am expecting output like:
女裝手套衣服＊艾爾莎＊暗釦長款披風式毛衣罩衫外套S~L【TAA1166】 出價 799 元 直購 799 元 運費80元 ｜
30 次 ｜ 剩 16小時 60分 賣家：艾爾莎時尚精品 (評價 25229) 在新北市
☆意樂舖☆【塑鋼衣架】ABS強化多功能神奇魔術衣架(收納衣服.領帶.皮帶.肩帶) 出價 35 元 直購 35 元 運費
55元 ｜ 8 次 ｜ 1天 6小時 賣家：意樂舖(創意樂園小舖) (評價 14613) 在新北市
HappyLife【YK1324】韓國超人氣乾濕兩用衣架 防滑魔術衣架 止滑衣架 衣服衣櫃衣櫥收納 出價 25 元 直購
25 元 運費70元 ｜ 16 次 ｜ 2天 3小時 賣家：HappyLife快樂生活網 (評價 14360) 在新北市
Here is some sample HTML from the search page:
<div class="att-item item yui3-g " data-url="https://login.yahoo.com/config/login?.intl=tw&.pd=c%3D3Chd7Yq72e502eh4R99sgUvi5Q--&.done=https%3A%2F%2Ftw.search.bid.yahoo.com%2Fsearch%2Fauction%2Fproduct%3Fei%3DUTF-8%26p%3D%25E8%25A1%25A3%25E6%259C%258D&rr=2465463942">
<div class="yui3-u">
<div class="srp-pdimage">
<img height="120" alt=" (DAJIN達錦衣服設計中心)棒壘球帽字凸繡200元，棒球帽，帽子，棒壘球服，棒球衣 " src="https://s.yimg.com/hg/ac/30/ea/e79010279-ac-4511xf9x0430x0600-s.jpg" />
</div>
</div>
</div>
What should I change in my code?
How to achieve my goal.
Please help me!

You should use the text() method, not ownText(), as the documentation states, it:
Gets the combined text of this element and all its children.
Here is an updated example:
public static void main(String[] args) throws MalformedURLException {
URL url = new URL( "http://tw.search.bid.yahoo.com/search/"
+ "ac;_ylt=AtqkyTO06sgGHho20HzmPEX3_rF8?ei=UTF-8&p=%E8%A1%A3%E6%9C%8D");
Document doc;
try {
doc = Jsoup.parse(url, 3000);
Elements descriptions = doc.select("div#srp_sl_result div.att-item");
for (Element element : descriptions) {
System.out.println(element.text());
System.out.println("--------------");
}
} catch (IOException e) {
e.printStackTrace();
}
}

I've visited the page you are trying to parse and in the browser console I've written:
$('div#srp_sl_result div.att-item')
The search returned a div:
<div class="att-item item yui3-u" data-url="https://login.yahoo.com/config/login?.intl=tw&.pd=c%3D3Chd7Yq72e502eh4R99sgUvi5Q--&.done=https%3A%2F%2Ftw.search.bid.yahoo.com%2Fsearch%2Fauction%2Fproduct%3Fei%3DUTF-8%26p%3D%25E8%25A1%25A3%25E6%259C%258D&rr=3456505015" id="yui_3_14_1_3_1394093660536_452">
<div class="wrap" id="yui_3_14_1_3_1394093660536_451">
<div class="srp-pdimage" id="yui_3_14_1_3_1394093660536_450">
<a href="https://tw.page.bid.yahoo.com/tw/auction/f61398121;_ylt=Ali1FeHY3kStUUeBmGO4vupyFbN8;_ylv=3?u=Y2583393636" id="yui_3_14_1_3_1394093660536_456">
<img width="200" alt=" HappyLife【SP323】納川6+1家庭裝真空收納袋/真空袋/壓縮袋/棉被衣物衣服收納~附吸氣管 " src="https://s.yimg.com/hg/ac/b6/51/f61398121-ac-6849xf8x0600x0400-s.jpg" id="yui_3_14_1_3_1394093660536_455">
</a>
</div>
<div class="srp-pdhead">
<div class="srp-pdinfo">
<a class="srp-bid" href="https://tw.page.bid.yahoo.com/tw/show/bid_hist;_ylt=Ahu0X7QeYNL6gEwV.IhDhWlyFbN8;_ylv=3?aID=f61398121">6 次</a>
<span>出價</span>
<em>399</em>
<span>元</span>
<span class="sep">｜</span>
</div>
<div class="srp-pdprice">
<span>直購</span>
<em>399</em>
<span>元</span>
</div>
</div>
<div class="srp-pdtitle">
HappyLife【SP323】納川6+1家庭裝真空收納袋/真空袋/壓縮袋/棉被衣物衣服收納~附吸氣管
</div>
<div class="srp-pdftitle">
HappyLife【SP323】納川6+1家庭裝真空收納袋/真空袋/壓縮袋/棉被衣物衣服收納~附吸氣管
</div>
<div class="srp-pdstore">
<a class="srp-ico" href="https://tw.help.yahoo.com/auct/policy/protection.html#reward" alt="享買賣家五萬保障"></a>
HappyLife快樂生活網
</div>
</div>
</div>
So I don't understand why you have so many elements returned. In any case element.ownText() returns the text of that div, excluding any inner element, so no text should be shown because that div has no text, only other elements

Related

Jsoup css selector "not", not return anything

I'm trying to ignore an item and not parse it on Jsoup
But css selector "not", not working !!
I don't understand what is wrong ??
my code:
MangaList list = new MangaList();
Document document = getPage("https://3asq.org/");
MangaInfo manga;
for (Element o : document.select("div.page-item-detail:not(.item-thumb#manga-item-5520)")) {
manga = new MangaInfo();
manga.name = o.select("h3").first().select("a").last().text();
manga.path = o.select("a").first().attr("href");
try {
manga.preview = o.select("img").first().attr("src");
} catch (Exception e) {
manga.preview = "";
}
list.add(manga);
}
return list;
html code:
<div class="col-12 col-md-6 badge-pos-1">
<div class="page-item-detail manga">
<div id="manga-item-5520" class="item-thumb hover-details c-image-hover" data-post-id="5520">
<a href="https://3asq.org/manga/gosu/" title="Gosu">
<img width="110" height="150" src="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg" srcset="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg 110w, https://3asq.org/wp-content/uploads/2020/03/IMG_4497-175x238.jpg 175w" sizes="(max-width: 110px) 100vw, 110px" class="img-responsive" style="" alt="IMG_4497"/> </a>
</div>
<div class="item-summary">
<div class="post-title font-title">
<h3 class="h5">
<span class="manga-title-badges custom noal-manga">Noal-Manga</span> Gosu
</h3>

If I debug your code and extract the HTML for:
System.out.println(document.select("div.page-item-detail").get(0)) (hint use the expression evaluator in IntelliJ IDEA (Alt+F8 - for in-session, real-time debugging)
I get:
<div class="page-item-detail manga">
<div id="manga-item-2003" class="item-thumb hover-details c-image-hover" data-post-id="2003">
<a href="http...
...
</div>
</div>
</div>
It looks like you want to extract the next div tag down with class containing item-thumb ... but only if the id isn't manga-item-5520.
So here's what I did to remove that one item
document.select("div.page-item-detail div[class*=item-thumb][id!=manga-item-5520]")
Result size: 19
With the element included:
document.select("div.page-item-detail div[class*=item-thumb]")
Result size: 20
You can also try the following if you want to remain based at the outer div tag rather than the inner div tag.
document.select("div.page-item-detail:has(div[class*=item-thumb][id!=manga-item-5520])")

Selenium check dynamic elements

Selenium/Java
the task is to get text of two possible elements: elem1 and elem2.
When Scenario A - elem1 is displayed, and the locator of elem2 doesn't exist. And vice versa.
My code:
public void checkTextInPopUp() {
List<WebElement> commonDiv= driver.findElements(By.xpath(".//*
[#id='CheckStockProductAvailabilityWidget']/div/div"));
if (commonDiv.size()>=1) {
addToCartStock.click();
} else {
System.out.println(driver.findElement(By.id("ajaxErrorMsg")).getText());
closeCheckStock.click();
}
}
My code is always work only in scenario1 and failed when element2 is dispayed, saying: unable to locate element2.
Elem1 html:
<div id="CheckStockProductAvailabilityWidget" class="dijitContentPane"
lang="en" controllerid="CheckStockProductAvailabilityController"
widgetid="CheckStockProductAvailabilityWidget"
dojotype="wc.widget.RefreshArea" style="">
<div class="row-fluid">
<div class="span11">
<p id="ajaxErrorMsg" class="error-font-color">Price &
Availability Check cannot be executed for your order.</p>
</div>
</div>
</div>
</div>
Elem2 html:
<div id="CheckStockProductAvailabilityWidget" class="dijitContentPane"
lang="en" controllerid="CheckStockProductAvailabilityController"
widgetid="CheckStockProductAvailabilityWidget"
dojotype="wc.widget.RefreshArea" style="">
<div class="row-fluid">
<div class="span11">
<div class="row-fluid ">
<div class="span12">
Part# 00000
<br/>
<p>
</div>
</div>
<div class="row-fluid space-bottom">
<div class="row-fluid ">
<div class="row-fluid mobile-inline-block">
Both elements have a common , and both returns "1" when getting element.size

I've found out the solution by using try and catch. Will see if it's an ideal one.
public void getStockPopUpMessage() {
try {
driver.findElement(By.xpath(".//*[#id='ajaxErrorMsg']"));
System.out.println("Stock displays: " +
driver.findElement(By.xpath(".//*[#id='ajaxErrorMsg']")).getText());
closeCheckStock.click();
} catch (NoSuchElementException e) {
System.out.println("No ajax");
dothis();
}
}

You can modify the code as below
Code:
public void checkTextInPopUp(){
WebElement rootElement=driver.findElement(By.id("CheckStockProductAvailabilityWidget"));
List<WebElement> element1List=rootElement.findElements(By.xpath(".//div[#class='span11']/p"));
if (element1List.size()==1) {
---Do your stuff----
addToCartStock.click();
} else {
System.out.println(rootElement.findElement(By.xpath(".//div[#class='span12']")).getText());
}
}
}
Details:
Find the Root WebElement. Since, this element always visible in both element 1 and element2 HTML
Element 1 may or may not be visible at a time. To avoid NoSuchElementException, find the element using findElements Method and store the element in List (You can find the element using the root element)
If the element 1 is found, then element 1 will be visible and element1 size will be 1.
If the element 1 is not found, then element 2 will be visible and
element 1 size will be 0.
logic can be added based on the element size condition
Updated Code:
List<WebElement> elementList=driver.findElements(By.xpath("//p[#id='ajaxErrorMsg']"));
//If the element 1 html is present, then element list size will be 1, else it will be 0.
if(elementList.size()>0){
//element 1 related stuff
---Do your stuff----
addToCartStock.click();
}
else{
//element 2 related stuff
System.out.println(rootElement.findElement(By.xpath(".//div[#class='span12']")).getText());
}

Simply use
#FindAll ({
#FindBy(locator elem1)
#FindBy(locator elem2)enter code here
})
private WebElement elementer code hereX;
FindAll annotation works with OR logic. It will find elem1 or elem2 and store it as a web element. Alternatively you can store it in a List of web element within your object repository.

element not found when using xpath in angular site

I have a angular website and I am trying to automate it using Selenium/Java. I know protractor is more easy for angular sites but I wish to use selenium.
I have been using "contains" keyword in the xpath to find elements as there are no unique id's available.
The element I am facing problem with is in the attached image circled in red. When I search the console with the xpath as shown in the image, the element is highlighted. But when I use it in the code I get the element not found error.
Is there a better way to handle this and why I am getting the error. I already have a wait condition.
HTML Code:
<div class="status-selector ng-isolate-scope" fm-select=""
fm-select-options="::IssueDetailsCtrl.issue.allowedIssueStatuses"
fm-disabled="!IssueDetailsCtrl.issue.permissions.editStatus"
fm-model="IssueDetailsCtrl.issue.status"
fm-change="IssueDetailsCtrl.updateStatus()">
<div class="fm-select undefined selected" ng-class="getStyle()" tabindex=""
ng-keyup="handleKeys($event)" ng-keydown="handleKeyDown($event)" style="">
<div class="fm-select-title" ng-click="toggleVisibility()"
ng-class="{"fm-select-title-highlighted":
isOpen, "fm-select-disabled": fmDisabled }">
<div class="selected-item-icon" ng-class="selectedOption.imageClass"></div>
<span class="ng-binding">Open</span>
</div>
<!-- ngIf: isOpen -->
</div>
</div>
<div class="fm-select undefined selected" ng-class="getStyle()" tabindex=""
ng-keyup="handleKeys($event)" ng-keydown="handleKeyDown($event)" style="">
<div class="fm-select-title" ng-click="toggleVisibility()"
ng-class="{"fm-select-title-highlighted":
isOpen, "fm-select-disabled": fmDisabled }">
<div class="selected-item-icon" ng-class="selectedOption.imageClass"></div>
<span class="ng-binding">Open</span>
</div>
<!-- ngIf: isOpen -->
</div>
<div class="fm-select-title" ng-click="toggleVisibility()"
ng-class="{"fm-select-title-highlighted":
isOpen, "fm-select-disabled": fmDisabled }">
<div class="selected-item-icon" ng-class="selectedOption.imageClass"></div>
<span class="ng-binding">Open</span>
</div>
<div class="selected-item-icon" ng-class="selectedOption.imageClass"></div>
<span class="ng-binding">Open</span>
Wait condition:
public void waitAndClickElement(WebElement element) throws InterruptedException {
boolean clicked = false;
int attempts = 0;
while (!clicked && attempts < 20) {
try {
this.wait.until(ExpectedConditions.elementToBeClickable(element)).click();
System.out.println("Successfully clicked on the WebElement: " + "<" + element.toString() + ">");
clicked = true;
} catch (Exception e) {
System.out.println("Unable to wait and click on WebElement, Exception: " + e.getMessage());
Assert.fail("Unable to wait and click on the WebElement, using locator: " + "<" + element.toString() + ">");
}
attempts++;
}
}

Your app is Angular App and the button is dynamically attached to the DOM Tree only when certain condition satisfied by Angular ng-if, so need to wait Angular compile the HTML of the button before it attached to the DOM Tree.
WebDriverWait wait = new WebDriverWait(driver, 20);
WebElement ele = wait.until(ExpectedConditions.presenceOfElementLocated(
By.xpath("//div/span[text()='Open']"))
);
ele.click();

I would suggest you to use explicit wait in this case :
new WebDriver(driver, 20).until(ExpectedConditions.elementToBeClickable(By.xpath("//span[#class='ng-binding' and text()='Open']"))).click();
Or :
You can use this Xpath also :
//div[#class='selected-item-icon']/following-sibling::span[text()='Open' and #class='ng-binding']

Jsoup: Get texts inside divs

I have some trouble to get the texts in the following HTML code, I need some help please.
<div class="itemlist">
<ul>
<li>
<div class="Description">
<h2>Item 1</h2> // GET THIS
<h3 title="Shipping :01-02 Nov">Shipping :01-02Nov</h3> // GET THIS
</div>
<div class="price" style="margin: 0px auto; display: none;">
<span class="arial-12-88" style="display: inline;"></span>
<div class="currency-USD arial-24-26-bold">450 USD</div> // GET THIS
<span class="arial-12-d0" style="display: inline;"></span>
</div>
<div class="button_set" style="display: flex;">
<button class="learn">Learn More</button>
<a href="user/orderDetails.htm?m=add&pid=00020170918214914392zGPQW7nE06A2&count=1&fitting=">
<button class="add">Add To Cart</button></a> // GET THIS
</div>
</li>
next item ...
</ul>
</div>
The output should be:
Item 1
Shipping :01-02Nov
450 USD
My approach is too static and cannot handle changes in the item structure. Because not every item has e.g. the price on the same ChildNumber. The only equal things are the div class names.
I use at the moment as I used the debugger to find which child I have to call:
Element content = doc.getElementsByClass("itemlist").first();
Node child1 = content.childNode(1);
for (Node node : child1.childNodes()) {
try {
Node desc = node.childNode(3);
Node price = node.childNode(5);
Node stock = node.childNode(7);
// get description
Node desc_elem = desc.childNode(1);
Node desc_text = desc_elem.childNode(0);
String desc_txt = ((TextNode) desc_text).text().trim();
} catch (Exception e) {
continue;
}
Please help me to find a more dynamic way. Ideal would be to get all listitems and loop over them. Then call to get the div description, div price. Then I could read the text from the child.

//select the div with the item list
Element itemlist = doc.select("div.itemlist").first();
// select each li element
Elements items = itemlist.select("li");
// for each li element select the corresponding div with item name, shipping info and price
for(Element e : items){
System.out.println(e.select("div.Description h2").text());
System.out.println(e.select("div.Description h3").text());
System.out.println(e.select("div.currency-USD").text());
}

Jsoup get text from website

I already can navigate in the site and get all the links that i want. But my main objective is getting the commentary of the hotels. The site i am using is this http://www.booking.com/hotel/pt/park-italia-flat.pt-pt.html?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmaLsBiAEBmAEvuAEEyAEE2AEB6AEB-AEL;sid=637e7af0c3009aa9ea132a960e2d2d40;dcid=4;ucfs=1;room1=A,A;srfid=b8260a1c264a3873291a9061733a43536a4d35c2X979#tab-reviews
I can get where using jsoup no problem but now i dont know how to get the text. I already tried getElementsByTag and getTextand other solutions. Can this be done with jsoup or i need another library.
I am trying this way to get the text. But the text that appears is not what i want.
Document doc ;
try {
doc = Jsoup.connect(pair.getValue().toString() + "#tab-reviews").get();
Elements scriptElements = doc.getElementsMatchingText("span");
for (Element link : scriptElements ) {
System.out.printf(" Text: <%s> \n", link.text());
}
} catch (IOException ex) {
Logger.getLogger(GetComentsThread.class.getName()).log(Level.SEVERE, null, ex);
}
For getting the urls i using something like this.
Pattern pattern = Pattern.compile("src=destinationfinder");
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
Matcher matcher = pattern.matcher(link.attr("abs:href"));
if (matcher.find()) {
dest = link.attr("abs:href");
break;
}
}
Now i can get some reviews but only the positive dont know why
doc = Jsoup.connect(pair.getValue().toString() + "#tab-reviews").get();
//doc = Jsoup.connect("http://www.booking.com/hotel/pt/pestanaportohotel.pt-pt.html?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmaLsBiAEBmAEvuAEEyAEE2AEB6AEB-AEL;sid=cff2dddd95e71c0768847a554584c888;dcid=4;dist=0;group_adults=2;room1=A%2CA;sb_price_type=total;srfid=798bd6b01ea1dba53ee6b6b945dda1f623859730X2;type=total;ucfs=1&#tab-reviews").get();
String teste="p.trackit";
Elements scriptElements = doc.select(teste);
for (Element link : scriptElements) {
//System.out.printf(" Text: <%s> ...%s\n", link.text(),link.attr("class=\"review_pos\""));
System.out.printf(" Text: <> ...%s\n",link.text());
}

Reviews are loaded using an AJAX request to another url.
There you can get all the info you need.
Response:
<li class="
review_item
clearfix
">
<p class="review_item_date">
16 de Setembro de 2015
</p>
<div class="review_item_reviewer">
<h4>
Beatriz
</h4>
<span class="reviewer_country">
<span class="reviewer_country_flag sflag slang-br">
</span>
Brasil
</span>
</div>
<!-- .review_item_reviewer -->
<div class="review_item_review">
<div class="
review_item_review_container
lang_ltr
seo_reviews_item
">
<div class="review_item_review_header">
<div class="
review_item_header_score_container
">
<div class="review_item_review_score jq_tooltip high_score_tooltip" title="
Excepcional
">
9,6
</div>
</div>
<div class="review_item_header_content_container">
<div class="review_item_header_content seo_review_title">
Excepcional
</div>
</div>
</div>
<ul class="review_item_info_tags">
<li class="review_info_tag"><span class="bullet">•</span> Viagem de lazer</li>
<li class="review_info_tag"><span class="bullet">•</span> Família</li>
<li class="review_info_tag"><span class="bullet">•</span> Apartamento com Varanda</li>
<li class="review_info_tag"><span class="bullet">•</span> Ficou 5 noites</li>
<li class="review_info_tag"><span class="bullet">•</span> Submetido através de dispositivo móvel</li>
</ul>
<div class="review_item_review_content">
<p class="review_pos"><i class="review_item_icon">눇</i>Conforto, perto do centro, perto de um lindo mercado, bem decorado, com todo material necessário para fazer as refeições, Wi-Fi excelente</p>
</div>
</div>
</div>
</li>

looks like you just need to use jsoup to get content from
class="review_pos" and class="review_neg"

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Issue on parsing Html with jsoup for java - java

Related

Jsoup css selector "not", not return anything

Selenium check dynamic elements

element not found when using xpath in angular site

Jsoup: Get texts inside divs

Jsoup get text from website

Categories

Resources