JSOUP - Help getting <IMG SRC> from <DIV CLASS> - java

I have the HTML snippet below. There are multiple div classes for "teaser-img" throughout the document. I want to be able to grab all the "img src" from all these "teaser-img" classes.
<div class="teaser-img">
<a href="/julien/blog/failure-consciousness-vs-success-consciousness-shifting-focus-become-badass-or-loser">
<img src="http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg" alt="" title=""/>
</a>
</div>
I have tried many things so I wouldn't know what code to share with you guys. Your help will be much appreciated.

final String html = "<div class=\"teaser-img\">\n"
+ " <a href=\"/julien/blog/failure-consciousness-vs-success-consciousness-shifting-focus-become-badass-or-loser\">\n"
+ " <img src=\"http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg\" alt=\"\" title=\"\"/>\n"
+ " </a>\n"
+ "</div>";
// Parse the html from string or eg. connect to a website using connect()
Document doc = Jsoup.parseBodyFragment(html);
for( Element element : doc.select("div.teaser-img img[src]") )
{
System.out.println(element);
}
Output:
<img src="http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg" alt="" title="">
See here for documentation about the selector syntax.

Related

Jsoup css selector "not", not return anything

I'm trying to ignore an item and not parse it on Jsoup
But css selector "not", not working !!
I don't understand what is wrong ??
my code:
MangaList list = new MangaList();
Document document = getPage("https://3asq.org/");
MangaInfo manga;
for (Element o : document.select("div.page-item-detail:not(.item-thumb#manga-item-5520)")) {
manga = new MangaInfo();
manga.name = o.select("h3").first().select("a").last().text();
manga.path = o.select("a").first().attr("href");
try {
manga.preview = o.select("img").first().attr("src");
} catch (Exception e) {
manga.preview = "";
}
list.add(manga);
}
return list;
html code:
<div class="col-12 col-md-6 badge-pos-1">
<div class="page-item-detail manga">
<div id="manga-item-5520" class="item-thumb hover-details c-image-hover" data-post-id="5520">
<a href="https://3asq.org/manga/gosu/" title="Gosu">
<img width="110" height="150" src="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg" srcset="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg 110w, https://3asq.org/wp-content/uploads/2020/03/IMG_4497-175x238.jpg 175w" sizes="(max-width: 110px) 100vw, 110px" class="img-responsive" style="" alt="IMG_4497"/> </a>
</div>
<div class="item-summary">
<div class="post-title font-title">
<h3 class="h5">
<span class="manga-title-badges custom noal-manga">Noal-Manga</span> Gosu
</h3>
If I debug your code and extract the HTML for:
System.out.println(document.select("div.page-item-detail").get(0)) (hint use the expression evaluator in IntelliJ IDEA (Alt+F8 - for in-session, real-time debugging)
I get:
<div class="page-item-detail manga">
<div id="manga-item-2003" class="item-thumb hover-details c-image-hover" data-post-id="2003">
<a href="http...
...
</div>
</div>
</div>
It looks like you want to extract the next div tag down with class containing item-thumb ... but only if the id isn't manga-item-5520.
So here's what I did to remove that one item
document.select("div.page-item-detail div[class*=item-thumb][id!=manga-item-5520]")
Result size: 19
With the element included:
document.select("div.page-item-detail div[class*=item-thumb]")
Result size: 20
You can also try the following if you want to remain based at the outer div tag rather than the inner div tag.
document.select("div.page-item-detail:has(div[class*=item-thumb][id!=manga-item-5520])")

element not found when using xpath in angular site

I have a angular website and I am trying to automate it using Selenium/Java. I know protractor is more easy for angular sites but I wish to use selenium.
I have been using "contains" keyword in the xpath to find elements as there are no unique id's available.
The element I am facing problem with is in the attached image circled in red. When I search the console with the xpath as shown in the image, the element is highlighted. But when I use it in the code I get the element not found error.
Is there a better way to handle this and why I am getting the error. I already have a wait condition.
HTML Code:
<div class="status-selector ng-isolate-scope" fm-select=""
fm-select-options="::IssueDetailsCtrl.issue.allowedIssueStatuses"
fm-disabled="!IssueDetailsCtrl.issue.permissions.editStatus"
fm-model="IssueDetailsCtrl.issue.status"
fm-change="IssueDetailsCtrl.updateStatus()">
<div class="fm-select undefined selected" ng-class="getStyle()" tabindex=""
ng-keyup="handleKeys($event)" ng-keydown="handleKeyDown($event)" style="">
<div class="fm-select-title" ng-click="toggleVisibility()"
ng-class="{"fm-select-title-highlighted":
isOpen, "fm-select-disabled": fmDisabled }">
<div class="selected-item-icon" ng-class="selectedOption.imageClass"></div>
<span class="ng-binding">Open</span>
</div>
<!-- ngIf: isOpen -->
</div>
</div>
<div class="fm-select undefined selected" ng-class="getStyle()" tabindex=""
ng-keyup="handleKeys($event)" ng-keydown="handleKeyDown($event)" style="">
<div class="fm-select-title" ng-click="toggleVisibility()"
ng-class="{"fm-select-title-highlighted":
isOpen, "fm-select-disabled": fmDisabled }">
<div class="selected-item-icon" ng-class="selectedOption.imageClass"></div>
<span class="ng-binding">Open</span>
</div>
<!-- ngIf: isOpen -->
</div>
<div class="fm-select-title" ng-click="toggleVisibility()"
ng-class="{"fm-select-title-highlighted":
isOpen, "fm-select-disabled": fmDisabled }">
<div class="selected-item-icon" ng-class="selectedOption.imageClass"></div>
<span class="ng-binding">Open</span>
</div>
<div class="selected-item-icon" ng-class="selectedOption.imageClass"></div>
<span class="ng-binding">Open</span>
Wait condition:
public void waitAndClickElement(WebElement element) throws InterruptedException {
boolean clicked = false;
int attempts = 0;
while (!clicked && attempts < 20) {
try {
this.wait.until(ExpectedConditions.elementToBeClickable(element)).click();
System.out.println("Successfully clicked on the WebElement: " + "<" + element.toString() + ">");
clicked = true;
} catch (Exception e) {
System.out.println("Unable to wait and click on WebElement, Exception: " + e.getMessage());
Assert.fail("Unable to wait and click on the WebElement, using locator: " + "<" + element.toString() + ">");
}
attempts++;
}
}
Your app is Angular App and the button is dynamically attached to the DOM Tree only when certain condition satisfied by Angular ng-if, so need to wait Angular compile the HTML of the button before it attached to the DOM Tree.
WebDriverWait wait = new WebDriverWait(driver, 20);
WebElement ele = wait.until(ExpectedConditions.presenceOfElementLocated(
By.xpath("//div/span[text()='Open']"))
);
ele.click();
I would suggest you to use explicit wait in this case :
new WebDriver(driver, 20).until(ExpectedConditions.elementToBeClickable(By.xpath("//span[#class='ng-binding' and text()='Open']"))).click();
Or :
You can use this Xpath also :
//div[#class='selected-item-icon']/following-sibling::span[text()='Open' and #class='ng-binding']

How to fetch data from ul li in android studio with jsoup

I am trying to get Product 1 and Product 2 but I cant get it help please
I am using jsoup and volley
<ul id="searched-products">
<li>
<div class="gd-col navUnitContainer1 gu4">
<div class="product_name">
<a>Prodict 1</a>
</div>
</div>
</li>
<li>
<div class="gd-col navUnitContainer1 gu4">
<div class="product_name">
<a>Prodict 2</a>
</div>
</div>
</li>
</ul>
I have tried this
Elements itemElements = doc.select("ul#searched-products li");
but its not selecting "li".I have also tried this
Elements itemElements = doc.select("ul#searched-products"); //this line works
Element e1 = itemElements.get(i);
e1.select("li"); or item.getElementsByTag("li");
still no good...
There are hundreds of li so I cant do this
doc.select("li");
Kindly suggest something
Like this:
public class JsoupList {
public static void main(String[] brawwwr){
String html = "<ul id=\"searched-products\">" +
"<li>" +
"<div class=\"gd-col navUnitContainer1 gu4\">" +
"<div class=\"product_name\">" +
"<a>Prodict 1</a>" +
"</div>" +
"</div>" +
"</li>" +
"<li>" +
"<div class=\"gd-col navUnitContainer1 gu4\">" +
"<div class=\"product_name\">"+
"<a>Prodict 2</a>" +
"</div>" +
"</div>" +
"</li>" +
"</ul>";
Document doc = Jsoup.parse(html);
Elements itemElements = doc.select("ul#searched-products li");
for(Element elem : itemElements){
System.out.println(elem.select("div div a").text());
}
}
}
Will return
Prodict 1
Prodict 2
You can imagine repetitive code inside tags like a little page of its own.
regards
Try this code.
Elements itemElements = doc.select("ul#searched-products");
itemElements = itemElements.select("li");
for(Element ele : itemElements){
String text = ele.text();
System.out.println(text); //this will return Prodict 1 and Prodict 2
}
// or u can try by getting all the a
for(Element ele : itemElements){
String text = ele.select("a").first().text();
System.out.println(text); //this will also return Prodict 1 and Prodict 2
}
To exclude <li> or <a> tags outside the list, you need to restrict the selector to match only inside the list. The best would be to use the ID (#searched-products). Then do not select <li> or <a> tags from the doc, but from the selected <ul>element.
You can get your text with any of the following selectors (not a complete list):
#searched-products li a
#searched-products a
#searched-products .product_name a
#searched-products .product_name
Even the last one is okay, since you need only the text, and div.product_name contains only the <a> tag.
for(Element e: doc.select("#searched-products .product_name")) {
String t = e.text(); // Prodict N
}
By the way, your original approach with selecting <li> tags inside ul#searched-products should have worked. If that doesn't return anything, the case might be that the list is generated dynamically on that page. You can test it easily by printing out the HTML that Jsoup has (doc.html() or doc.select('#searched-products').html()).
If really that's the case, Jsoup is not the right tool for you. I suggest you to use Selenium with possibly a headless browser (HtmlUnit or PhantomJS). They can return and even interact with dynamically created elements, so maybe other parts of your crawl process can be simplified.

Jsoup get text from website

I already can navigate in the site and get all the links that i want. But my main objective is getting the commentary of the hotels. The site i am using is this http://www.booking.com/hotel/pt/park-italia-flat.pt-pt.html?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmaLsBiAEBmAEvuAEEyAEE2AEB6AEB-AEL;sid=637e7af0c3009aa9ea132a960e2d2d40;dcid=4;ucfs=1;room1=A,A;srfid=b8260a1c264a3873291a9061733a43536a4d35c2X979#tab-reviews
I can get where using jsoup no problem but now i dont know how to get the text. I already tried getElementsByTag and getTextand other solutions. Can this be done with jsoup or i need another library.
I am trying this way to get the text. But the text that appears is not what i want.
Document doc ;
try {
doc = Jsoup.connect(pair.getValue().toString() + "#tab-reviews").get();
Elements scriptElements = doc.getElementsMatchingText("span");
for (Element link : scriptElements ) {
System.out.printf(" Text: <%s> \n", link.text());
}
} catch (IOException ex) {
Logger.getLogger(GetComentsThread.class.getName()).log(Level.SEVERE, null, ex);
}
For getting the urls i using something like this.
Pattern pattern = Pattern.compile("src=destinationfinder");
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
Matcher matcher = pattern.matcher(link.attr("abs:href"));
if (matcher.find()) {
dest = link.attr("abs:href");
break;
}
}
Now i can get some reviews but only the positive dont know why
doc = Jsoup.connect(pair.getValue().toString() + "#tab-reviews").get();
//doc = Jsoup.connect("http://www.booking.com/hotel/pt/pestanaportohotel.pt-pt.html?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmaLsBiAEBmAEvuAEEyAEE2AEB6AEB-AEL;sid=cff2dddd95e71c0768847a554584c888;dcid=4;dist=0;group_adults=2;room1=A%2CA;sb_price_type=total;srfid=798bd6b01ea1dba53ee6b6b945dda1f623859730X2;type=total;ucfs=1&#tab-reviews").get();
String teste="p.trackit";
Elements scriptElements = doc.select(teste);
for (Element link : scriptElements) {
//System.out.printf(" Text: <%s> ...%s\n", link.text(),link.attr("class=\"review_pos\""));
System.out.printf(" Text: <> ...%s\n",link.text());
}
Reviews are loaded using an AJAX request to another url.
There you can get all the info you need.
Response:
<li class="
review_item
clearfix
">
<p class="review_item_date">
16 de Setembro de 2015
</p>
<div class="review_item_reviewer">
<h4>
Beatriz
</h4>
<span class="reviewer_country">
<span class="reviewer_country_flag sflag slang-br">
</span>
Brasil
</span>
</div>
<!-- .review_item_reviewer -->
<div class="review_item_review">
<div class="
review_item_review_container
lang_ltr
seo_reviews_item
">
<div class="review_item_review_header">
<div class="
review_item_header_score_container
">
<div class="review_item_review_score jq_tooltip high_score_tooltip" title="
Excepcional
">
9,6
</div>
</div>
<div class="review_item_header_content_container">
<div class="review_item_header_content seo_review_title">
Excepcional
</div>
</div>
</div>
<ul class="review_item_info_tags">
<li class="review_info_tag"><span class="bullet">•</span> Viagem de lazer</li>
<li class="review_info_tag"><span class="bullet">•</span> Família</li>
<li class="review_info_tag"><span class="bullet">•</span> Apartamento com Varanda</li>
<li class="review_info_tag"><span class="bullet">•</span> Ficou 5 noites</li>
<li class="review_info_tag"><span class="bullet">•</span> Submetido através de dispositivo móvel</li>
</ul>
<div class="review_item_review_content">
<p class="review_pos"><i class="review_item_icon">눇</i>Conforto, perto do centro, perto de um lindo mercado, bem decorado, com todo material necessário para fazer as refeições, Wi-Fi excelente</p>
</div>
</div>
</div>
</li>
looks like you just need to use jsoup to get content from
class="review_pos" and class="review_neg"

Jsoup how to parse text inside span class="hps"

<span id="result_box" class="short_text" lang="es">
<span class="hps">
hello
</span>
<span class="hps">
world
</span>
</span>
I want to get the hello world String using Jsoup but i have no idea how to do this.
Use Jsoup.parse to get the html Document. Select the elements that you want using css selector like: span.hps (http://jsoup.org/apidocs/org/jsoup/select/Selector.html)
Document doc = Jsoup.parse("<span id=\"result_box\" class=\"short_text\" lang=\"es\">\n" +
" <span class=\"hps\">\n" +
" hello\n" +
" </span>\n" +
" <span class=\"hps\">\n" +
" world\n" +
" </span>\n" +
"</span>");
System.out.println(doc.html());
Elements els = doc.select("span.hps");
for(Element e:els){
System.out.print(e.text());
}
In case you don't care about each element value you can replace the for loop:
els.text()

Categories

Resources