JSoup not able to get links from html

JSoup not able to get links from html - java

I'm trying to get links from html of a site but unable to do so using Jsoup.
This is the HTML:
<div class="anime_muti_link">
<ul>
<li><div class="doamin">Domain</div><div class="link">Link</div></li>
<li class="anime">
Server m1</div><span>Watch This Link</span>
</li>
<li class="anime">
Server m2</div><span>Watch This Link</span>
</li>
<li class="xstreamcdn">
Xstreamcdn</div><span>Watch This Link</span>
</li>
<li class="mixdrop">
<div class="server mixdrop">Mixdrop</div><span>Watch This Link</span>
</li>
<li class="streamsb">
StreamSB</div><span>Watch This Link</span>
</li>
<li class="doodstream">
Doodstream</div><span>Watch This Link</span>
</li>
</ul>
</div>
This is the android code that I wrote which doesn't seem to work:
try {
Document doc = Jsoup.connect(URL).get();
Elements content = doc.getElementsByClass("anime_muti_link");
Elements links = content.select("a");
String[] urls = new String[links.size()];
for (int i = 0; i < links.size(); i++) {
urls[i] = links.get(i).attr("data-video");
if (!urls[i].startsWith("https://")) {
urls[i] = "https:" + urls[i];
}
}
arrayList.addAll(Arrays.asList(urls));
Log.d("CALLING_URL", "Links: " + Arrays.toString(urls));
} catch (IOException e) {
e.getMessage();
}
Can someone please help me with this? Thanks
Edit: Basically I'm trying to get those 6 links and add them to my list to use it within the app.
Edit 2:
So I found another HTML that can seems better:
<div class="heading-servers">
<span><i class="fa fa-signal"></i> Servers</span>
<ul class="servers">
<li data-vs="https://example.com" class="server server-active" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">Netu</li>
<li data-vs="https://example.com" class="server" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">VideoVard</li>
<li data-vs="https://example.com" class="server" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">Doodstream</li>
<li data-vs="https://example.com" class="server" style="display: block;" onclick="return loadIframe('ifrm', this.getAttribute('data-vs'));">Okstream</li>
</ul>
</div>

As you can see, in this li definition you are including a nested div:
<li class="xstreamcdn">
Xstreamcdn</div><span>Watch This Link</span>
</li>
This is causing that the variable content, the HTML fragment with class anime_muti_link, to look like:
<div class="anime_muti_link">
<ul>
<li>
<div class="doamin">
Domain
</div>
<div class="link">
Link
</div></li>
<li class="anime"> <a href="#" class="active" rel="1" data-video="example.com">
<div class="server m1">
Server m1
</div><span>Watch This Link</span></a> </li>
<li class="anime"> <a href="#" rel="1" data-video="example.com">
<div class="server m1">
Server m2
</div><span>Watch This Link</span></a> </li>
<li class="xstreamcdn"> Xstreamcdn</li>
</ul>
</div>
A similar result will be obtained even if you tidy your HTML. I used this code from one of my previous answers:
Tidy tidy = new Tidy();
tidy.setXHTML(true);
tidy.setIndentContent(true);
tidy.setPrintBodyOnly(true);
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setSmartIndent(true);
tidy.setShowWarnings(false);
tidy.setQuiet(true);
tidy.setTidyMark(false);
org.w3c.dom.Document htmlDOM = tidy.parseDOM(new ByteArrayInputStream(html.getBytes()), null);
OutputStream out = new ByteArrayOutputStream();
tidy.pprint(htmlDOM, out);
String tidiedHtml = out.toString();
// System.out.println(tidiedHtml);
Document document = Jsoup.parse(tidiedHtml);
Elements content = document.getElementsByClass("anime_muti_link");
System.out.println(content);
And this is why you are finding only three anchors.
Please, try correcting your HTML or selecting the anchor tag as the document level instead:
Document document = Jsoup.parse(html);
// Elements content = document.getElementsByClass("anime_muti_link");
// System.out.println(content);
Elements links = document.select("a");
String[] urls = new String[links.size()];
for (int i = 0; i < links.size(); i++) {
urls[i] = links.get(i).attr("data-video");
if (!urls[i].startsWith("https://")) {
urls[i] = "https://" + urls[i];
}
}
System.out.println(Arrays.asList(urls));
If the result obtained contains undesired links, perhaps you can try narrowing the selector used, something like:
document.select(".anime_muti_link a")
If this doesn't work, another possible alternative could be selecting the anchor elements with a data-video attribute, a[data-video]:
Document document = Jsoup.parse(html);
Elements videoLinks = document.select("a[data-video]");
String[] urls = new String[videoLinks.size()];
for (int i = 0; i < videoLinks.size(); i++) {
urls[i] = videoLinks.get(i).attr("data-video");
if (!urls[i].startsWith("https://")) {
urls[i] = "https://" + urls[i];
}
}
System.out.println(Arrays.asList(urls));
With your new test case, you can obtain the desired information with a very similar code:
String html = "<div class=\"heading-servers\">\n" +
" <span><i class=\"fa fa-signal\"></i> Servers</span>\n" +
" <ul class=\"servers\">\n" +
" <li data-vs=\"https://example.com\" class=\"server server-active\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">Netu</li>\n" +
" <li data-vs=\"https://example.com\" class=\"server\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">VideoVard</li>\n" +
" <li data-vs=\"https://example.com\" class=\"server\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">Doodstream</li>\n" +
" <li data-vs=\"https://example.com\" class=\"server\" style=\"display: block;\" onclick=\"return loadIframe('ifrm', this.getAttribute('data-vs'));\">Okstream</li>\n" +
" </ul>\n" +
" </div>";
Document document = Jsoup.parse(html);
Elements videoLinks = document.select("div.heading-servers ul.servers li.server");
String[] urls = new String[videoLinks.size()];
for (int i = 0; i < videoLinks.size(); i++) {
urls[i] = videoLinks.get(i).attr("data-vs");
if (!urls[i].startsWith("https://")) {
urls[i] = "https://" + urls[i];
}
}
System.out.println(Arrays.asList(urls));
The most important part is the definition of the selector that should be applied to the parsed document, div.heading-servers ul.servers li.server in our case.
I provided a selector with many fragments, but depending on the actual use HTML it could be simplified with ul.servers li.server or even li.server.

Related

Jsoup css selector "not", not return anything

I'm trying to ignore an item and not parse it on Jsoup
But css selector "not", not working !!
I don't understand what is wrong ??
my code:
MangaList list = new MangaList();
Document document = getPage("https://3asq.org/");
MangaInfo manga;
for (Element o : document.select("div.page-item-detail:not(.item-thumb#manga-item-5520)")) {
manga = new MangaInfo();
manga.name = o.select("h3").first().select("a").last().text();
manga.path = o.select("a").first().attr("href");
try {
manga.preview = o.select("img").first().attr("src");
} catch (Exception e) {
manga.preview = "";
}
list.add(manga);
}
return list;
html code:
<div class="col-12 col-md-6 badge-pos-1">
<div class="page-item-detail manga">
<div id="manga-item-5520" class="item-thumb hover-details c-image-hover" data-post-id="5520">
<a href="https://3asq.org/manga/gosu/" title="Gosu">
<img width="110" height="150" src="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg" srcset="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg 110w, https://3asq.org/wp-content/uploads/2020/03/IMG_4497-175x238.jpg 175w" sizes="(max-width: 110px) 100vw, 110px" class="img-responsive" style="" alt="IMG_4497"/> </a>
</div>
<div class="item-summary">
<div class="post-title font-title">
<h3 class="h5">
<span class="manga-title-badges custom noal-manga">Noal-Manga</span> Gosu
</h3>

If I debug your code and extract the HTML for:
System.out.println(document.select("div.page-item-detail").get(0)) (hint use the expression evaluator in IntelliJ IDEA (Alt+F8 - for in-session, real-time debugging)
I get:
<div class="page-item-detail manga">
<div id="manga-item-2003" class="item-thumb hover-details c-image-hover" data-post-id="2003">
<a href="http...
...
</div>
</div>
</div>
It looks like you want to extract the next div tag down with class containing item-thumb ... but only if the id isn't manga-item-5520.
So here's what I did to remove that one item
document.select("div.page-item-detail div[class*=item-thumb][id!=manga-item-5520]")
Result size: 19
With the element included:
document.select("div.page-item-detail div[class*=item-thumb]")
Result size: 20
You can also try the following if you want to remain based at the outer div tag rather than the inner div tag.
document.select("div.page-item-detail:has(div[class*=item-thumb][id!=manga-item-5520])")

I want to extract all <li> element text that are under <ul>

I need to click on all elements BASIC, TRACKS, ...
My idea is to extract all elements in list then using list count and loop, I'll click on each and every element.
Need to check that each and every element is working even if new element is added I don't want to check code.
<div class="headerarea" style="" xpath="1">
<h2>
<span id="ctl00_ctl00_phDesktop_lblModuleTitle">Abstract Setup</span>
</h2>
<ul>
<li>
<a id="ctl00_ctl00_phDesktop_rModuleNavigation_ctl01_btnModuleNavigation" class="headerarea_active" href="https://staging.m-anage.com/Modules/Abstract/Setup/basics.aspx">Basic</a></li>
<li>
<a id="ctl00_ctl00_phDesktop_rModuleNavigation_ctl02_btnModuleNavigation" href="https://staging.m-anage.com/testselenium/en-US/Abstract/AbstractSetup/Tracks">Tracks</a></li>
<li>
<a id="ctl00_ctl00_phDesktop_rModuleNavigation_ctl03_btnModuleNavigation" href="https://staging.m-anage.com/Modules/Abstract/Setup/steps.aspx">WIZARD</a></li>
<li>
<a id="ctl00_ctl00_phDesktop_rModuleNavigation_ctl04_btnModuleNavigation" href="https://staging.m-anage.com/Modules/Abstract/Setup/keywords.aspx">KEYWORDS</a></li>
<li>
<a id="ctl00_ctl00_phDesktop_rModuleNavigation_ctl05_btnModuleNavigation" href="https://staging.m-anage.com/Modules/Abstract/Setup/categories.aspx">CATEGORIES</a></li>
<li>
<a id="ctl00_ctl00_phDesktop_rModuleNavigation_ctl06_btnModuleNavigation" href="https://staging.m-anage.com/Modules/Abstract/Setup/conditions.aspx">CONDITIONS</a></li>
<li>
<a id="ctl00_ctl00_phDesktop_rModuleNavigation_ctl07_btnModuleNavigation" href="https://staging.m-anage.com/Modules/Abstract/Setup/interests.aspx">Interests</a></li>
<li>
<a id="ctl00_ctl00_phDesktop_rModuleNavigation_ctl08_btnModuleNavigation" href="https://staging.m-anage.com/Modules/Abstract/Setup/templates.aspx">Templates</a></li>
<li>
<a id="ctl00_ctl00_phDesktop_rModuleNavigation_ctl09_btnModuleNavigation" href="https://staging.m-anage.com/testselenium/en-US/Abstract/AbstractSetup/Index">Submission fee</a></li>
<li>
<a id="ctl00_ctl00_phDesktop_rModuleNavigation_ctl10_btnModuleNavigation" href="https://staging.m-anage.com/testselenium/en-US/Mail/MailServerSetup/Index?pModuleType=Abstract" style="">SMTP Setup</a></li>
<li>
<a id="ctl00_ctl00_phDesktop_rModuleNavigation_ctl11_btnModuleNavigation" href="https://staging.m-anage.com/testselenium/en-US/Abstract/AbstractSetup/Coauthor">Co-author</a></li>
</ul>
</div>
I tried travelling to child path but no success
Here is the java code that I tried.
List<WebElement> tags =
driver.findElements(By.xpath("//div[#class='headerarea']/ul/li"));
for(int i=0;i<tags.size();i++) {
while(???) {
//driver.findElement(By.xpath("//div[#class='headerarea']/ul/li")).click();
}
}

Try below code :
List<WebElement> links = driver.findElements(By.tagName("li"));
for (int i = 1; i < links.size(); i++)
{
System.out.println(links.get(i).getText());
}
You can also use WebDriverWait if you are facing synchronization issue.
WebDriverWait wait = new WebDriverWait(driver, 10);
List<WebElement> links = wait.until(ExpectedConditions.presenceOfAllElementsLocatedBy(By.tagName(li)));
for (int i = 1; i < links.size(); i++)
{
System.out.println(links.get(i).getText());
}

List<WebElement> tags = driver.findElements(By.cssSelector(".headerarea ul>li"));
for(WebElement e : tags) {
e.click();
}

Iterate a list of web elements and obtain two or more values

I need to iterate to the list of web elements to get the text of the name and title
example:
<ul id="GalleryViewInner" class="gv-ic">
<li id="item3ad73f1239" class="sresult gvresult">
<div id="1"
<span id="span1">5,000USD</span>
</div>
<div id="2"
<td id="td1">TITLE</td>
</div>
<li id="item3ad73f1239" class="sresult gvresult">
<li id="item3ad73f1239" class="sresult gvresult">
<li id="item3ad73f1239" class="sresult gvresult">
<li id="item3ad73f1239" class="sresult gvresult">
</ul>
iterate ul list:
List<WebElement> allElements = driver.findElements(By.xpath("//ul[#id='GalleryViewInner']/li"));
Iterator<WebElement> iter = allElements.iterator();
while (iter.hasNext()) {
WebElement PRICE = iter.next();
PRICE.getTxt();
TITLE.getText();
}
in each iteration I need to get two or more elements from each "li"
I need to get price and name of all the li elements
Java
Selenium Webdriver

Asumming your web elements like you showed us above, and other <li></li> are the same
<ul id="GalleryViewInner" class="gv-ic">
<li id="item3ad73f1239" class="sresult gvresult">
<div id="1"
<span id="span1">5,000USD</span>
</div>
<div id="2"
<td id="td1">TITLE</td>
</div>
</li>
<li></li>
</ul>
Asumming price is <span></span> with id "span1" and name is <td></td> with id "td1" I'll go with this approach
List<WebElement> liElements = driver.findElements(By.xpath("//ul[#id='GalleryViewInner']/li"));
for(WebElement li : liElements){
WebElement spanPrice = li.findElement(By.id("span1"));
String price = spanPrice.getText();
WebElement tdName = li.findElement(By.id("td1"));
String name = tdName.getText();
}

As per the HTML you have provided to get the text of the name and title you can use the following code block :
List<WebElement> all_span_elements = driver.findElements(By.xpath("//ul[#id='GalleryViewInner']//li/div/span"));
List<WebElement> all_td_elements = driver.findElements(By.xpath("//ul[#id='GalleryViewInner']//li//following::div[2]/td"));
List<String> names = new ArrayList<>();
List<String> titles = new ArrayList<>();
for(WebElement ele1:all_span_elements)
names.add(ele1.getAttribute("innerHTML"));
for(WebElement ele2:all_td_elements)
titles.add(ele2.getAttribute("innerHTML"));
for(int i=0; i<all_span_elements.size(); i++)
System.out.println("Medicine Name is : " + names.get(i) + "and Title is : " + titles.get(i));

Jsoup get text from website

I already can navigate in the site and get all the links that i want. But my main objective is getting the commentary of the hotels. The site i am using is this http://www.booking.com/hotel/pt/park-italia-flat.pt-pt.html?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmaLsBiAEBmAEvuAEEyAEE2AEB6AEB-AEL;sid=637e7af0c3009aa9ea132a960e2d2d40;dcid=4;ucfs=1;room1=A,A;srfid=b8260a1c264a3873291a9061733a43536a4d35c2X979#tab-reviews
I can get where using jsoup no problem but now i dont know how to get the text. I already tried getElementsByTag and getTextand other solutions. Can this be done with jsoup or i need another library.
I am trying this way to get the text. But the text that appears is not what i want.
Document doc ;
try {
doc = Jsoup.connect(pair.getValue().toString() + "#tab-reviews").get();
Elements scriptElements = doc.getElementsMatchingText("span");
for (Element link : scriptElements ) {
System.out.printf(" Text: <%s> \n", link.text());
}
} catch (IOException ex) {
Logger.getLogger(GetComentsThread.class.getName()).log(Level.SEVERE, null, ex);
}
For getting the urls i using something like this.
Pattern pattern = Pattern.compile("src=destinationfinder");
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
Matcher matcher = pattern.matcher(link.attr("abs:href"));
if (matcher.find()) {
dest = link.attr("abs:href");
break;
}
}
Now i can get some reviews but only the positive dont know why
doc = Jsoup.connect(pair.getValue().toString() + "#tab-reviews").get();
//doc = Jsoup.connect("http://www.booking.com/hotel/pt/pestanaportohotel.pt-pt.html?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmaLsBiAEBmAEvuAEEyAEE2AEB6AEB-AEL;sid=cff2dddd95e71c0768847a554584c888;dcid=4;dist=0;group_adults=2;room1=A%2CA;sb_price_type=total;srfid=798bd6b01ea1dba53ee6b6b945dda1f623859730X2;type=total;ucfs=1&#tab-reviews").get();
String teste="p.trackit";
Elements scriptElements = doc.select(teste);
for (Element link : scriptElements) {
//System.out.printf(" Text: <%s> ...%s\n", link.text(),link.attr("class=\"review_pos\""));
System.out.printf(" Text: <> ...%s\n",link.text());
}

Reviews are loaded using an AJAX request to another url.
There you can get all the info you need.
Response:
<li class="
review_item
clearfix
">
<p class="review_item_date">
16 de Setembro de 2015
</p>
<div class="review_item_reviewer">
<h4>
Beatriz
</h4>
<span class="reviewer_country">
<span class="reviewer_country_flag sflag slang-br">
</span>
Brasil
</span>
</div>
<!-- .review_item_reviewer -->
<div class="review_item_review">
<div class="
review_item_review_container
lang_ltr
seo_reviews_item
">
<div class="review_item_review_header">
<div class="
review_item_header_score_container
">
<div class="review_item_review_score jq_tooltip high_score_tooltip" title="
Excepcional
">
9,6
</div>
</div>
<div class="review_item_header_content_container">
<div class="review_item_header_content seo_review_title">
Excepcional
</div>
</div>
</div>
<ul class="review_item_info_tags">
<li class="review_info_tag"><span class="bullet">•</span> Viagem de lazer</li>
<li class="review_info_tag"><span class="bullet">•</span> Família</li>
<li class="review_info_tag"><span class="bullet">•</span> Apartamento com Varanda</li>
<li class="review_info_tag"><span class="bullet">•</span> Ficou 5 noites</li>
<li class="review_info_tag"><span class="bullet">•</span> Submetido através de dispositivo móvel</li>
</ul>
<div class="review_item_review_content">
<p class="review_pos"><i class="review_item_icon">눇</i>Conforto, perto do centro, perto de um lindo mercado, bem decorado, com todo material necessário para fazer as refeições, Wi-Fi excelente</p>
</div>
</div>
</div>
</li>

looks like you just need to use jsoup to get content from
class="review_pos" and class="review_neg"

Pagination in webDriver displays Stale Element Reference Exception

Environment : Eclipse, Chrome, Java
I am dealing with test case for pagination in the application. I have tried with some code, but it moves only upto 2nd page.
Code :
List<WebElement> allpages = driver.findElements(By.xpath("//div[#class='pagination']//a"));
System.out.println(allpages.size());
if(allpages.size()>0)
{
System.out.println("Pagination exists");
for(int i=0; i<allpages.size(); i++)
{
Thread.sleep(3000);
allpages.get(i).click();
driver.manage().timeouts().pageLoadTimeout(5, TimeUnit.SECONDS);
//System.out.println(i);
}
}
else
{
System.out.println("Pagination doesn't exists");
}
}
Size displayed is 12. The issue is it moves upto 2nd page only and then displays error of StaleElementReference
Here's the HTML Code for the same pagination.
HTML Code :
<div id="page-navigation" class="pull-right">
<div id="303b171e-5a26-e456" class="flex-view">
<div class="pagination">
<ul>
<li class="">
«
</li>
<li class="" data-value="0">
1
</li>
<li class="" data-value="1">
2
</li>
<li class="active" data-value="2">
3
</li>
.....

When the new page is loaded after clicking one of the pagination links, the allpages WebElements are no longer valid and need to be found again. You will have to put another call to allpages = driver.findElements(By.xpath("//div[#class='pagination']//a")); in your for loop so that you can get new references for each new page.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup not able to get links from html - java

Related

Jsoup css selector "not", not return anything

I want to extract all <li> element text that are under <ul>

Iterate a list of web elements and obtain two or more values

Jsoup get text from website

Pagination in webDriver displays Stale Element Reference Exception

Categories

Resources