Jsoup: Get texts inside divs - java

I have some trouble to get the texts in the following HTML code, I need some help please.
<div class="itemlist">
<ul>
<li>
<div class="Description">
<h2>Item 1</h2> // GET THIS
<h3 title="Shipping :01-02 Nov">Shipping :01-02Nov</h3> // GET THIS
</div>
<div class="price" style="margin: 0px auto; display: none;">
<span class="arial-12-88" style="display: inline;"></span>
<div class="currency-USD arial-24-26-bold">450 USD</div> // GET THIS
<span class="arial-12-d0" style="display: inline;"></span>
</div>
<div class="button_set" style="display: flex;">
<button class="learn">Learn More</button>
<a href="user/orderDetails.htm?m=add&pid=00020170918214914392zGPQW7nE06A2&count=1&fitting=">
<button class="add">Add To Cart</button></a> // GET THIS
</div>
</li>
next item ...
</ul>
</div>
The output should be:
Item 1
Shipping :01-02Nov
450 USD
My approach is too static and cannot handle changes in the item structure. Because not every item has e.g. the price on the same ChildNumber. The only equal things are the div class names.
I use at the moment as I used the debugger to find which child I have to call:
Element content = doc.getElementsByClass("itemlist").first();
Node child1 = content.childNode(1);
for (Node node : child1.childNodes()) {
try {
Node desc = node.childNode(3);
Node price = node.childNode(5);
Node stock = node.childNode(7);
// get description
Node desc_elem = desc.childNode(1);
Node desc_text = desc_elem.childNode(0);
String desc_txt = ((TextNode) desc_text).text().trim();
} catch (Exception e) {
continue;
}
Please help me to find a more dynamic way. Ideal would be to get all listitems and loop over them. Then call to get the div description, div price. Then I could read the text from the child.

//select the div with the item list
Element itemlist = doc.select("div.itemlist").first();
// select each li element
Elements items = itemlist.select("li");
// for each li element select the corresponding div with item name, shipping info and price
for(Element e : items){
System.out.println(e.select("div.Description h2").text());
System.out.println(e.select("div.Description h3").text());
System.out.println(e.select("div.currency-USD").text());
}

Related

Jsoup css selector "not", not return anything

I'm trying to ignore an item and not parse it on Jsoup
But css selector "not", not working !!
I don't understand what is wrong ??
my code:
MangaList list = new MangaList();
Document document = getPage("https://3asq.org/");
MangaInfo manga;
for (Element o : document.select("div.page-item-detail:not(.item-thumb#manga-item-5520)")) {
manga = new MangaInfo();
manga.name = o.select("h3").first().select("a").last().text();
manga.path = o.select("a").first().attr("href");
try {
manga.preview = o.select("img").first().attr("src");
} catch (Exception e) {
manga.preview = "";
}
list.add(manga);
}
return list;
html code:
<div class="col-12 col-md-6 badge-pos-1">
<div class="page-item-detail manga">
<div id="manga-item-5520" class="item-thumb hover-details c-image-hover" data-post-id="5520">
<a href="https://3asq.org/manga/gosu/" title="Gosu">
<img width="110" height="150" src="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg" srcset="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg 110w, https://3asq.org/wp-content/uploads/2020/03/IMG_4497-175x238.jpg 175w" sizes="(max-width: 110px) 100vw, 110px" class="img-responsive" style="" alt="IMG_4497"/> </a>
</div>
<div class="item-summary">
<div class="post-title font-title">
<h3 class="h5">
<span class="manga-title-badges custom noal-manga">Noal-Manga</span> Gosu
</h3>
If I debug your code and extract the HTML for:
System.out.println(document.select("div.page-item-detail").get(0)) (hint use the expression evaluator in IntelliJ IDEA (Alt+F8 - for in-session, real-time debugging)
I get:
<div class="page-item-detail manga">
<div id="manga-item-2003" class="item-thumb hover-details c-image-hover" data-post-id="2003">
<a href="http...
...
</div>
</div>
</div>
It looks like you want to extract the next div tag down with class containing item-thumb ... but only if the id isn't manga-item-5520.
So here's what I did to remove that one item
document.select("div.page-item-detail div[class*=item-thumb][id!=manga-item-5520]")
Result size: 19
With the element included:
document.select("div.page-item-detail div[class*=item-thumb]")
Result size: 20
You can also try the following if you want to remain based at the outer div tag rather than the inner div tag.
document.select("div.page-item-detail:has(div[class*=item-thumb][id!=manga-item-5520])")

How access to inner same classname div with different idname in HTML data using Jsoup in android

I'm trying to parse data from HTML.I need to get the all names from inner div class=vacancy-item which has different idnames.
Below please See the HTML code
<section class="home-vacancies" id="vacancy_wrapper">
<div class="home-block-title">job openings</div>
<div class="vacancy-filter">
...................
</div>
<div class="vacancy-wrapper">
<div class="vacancy-item" data-id="9120">
..............
</div>
<div class="vacancy-item" data-id="9119">
..................
</div>
<div class="vacancy-item" data-id="9118">
................................
</div>
<div class="vacancy-item" data-id="9117">
.............................
</div>
Here is my code:
Please help.
doc = Jsoup.connect("URL").get();
//title = doc.select(".page-content div:eq(3)");
title = doc.getElementsByClass("div[class=vacancy-wrapper]");
titleList.clear();
for (Element titles : title) {
String text = titles.getElementsB("vacancy-item").text();
titleList.add(text);
}
Thanks!
You can only query for a class attribute with getElementByClass, e.g. getElementByClass("vacancy-wrapper") would work.
You will also need a second loop to get each vacancy-items text as a separate element:
Elements title = doc.getElementsByClass("vacancy-wrapper");
for (Element titles : title) {
Elements items = titles.getElementsByClass("vacancy-item");
for (Element item : items) {
String text = item.text();
// process text
}
}
An other option would be to use Jsoup's select method:
Elements es = doc.select("div.vacancy-wrapper div.vacancy-item");
for (Element vi : es) {
String text = vi.text());
// process text
}
This would select all div elements with a class attribute vacancy-item that are under a div with a class attribute vacancy-wrapper.

How to find elements whose sibling index is less than x and greater than y

I have some Element eNews. After finding indexes by CssQuery I have to select sibling elements with index less than y and greater than x;
Elements lines = eNews.select("div.clear");
int x = lines.get(0).elementSiblingIndex();
int y = lines.get(1).elementSiblingIndex();
Elements tNews = eNews.getElementsByIndexGreaterThan(x)
?AND?
eNews.getElementsByIndexLessThan(y)
This is some sample code. I want to extract text from html tags between first and second <div class="clear></div>
<div class="aktualnosci">
<div class="zd">
<a href="/Data/Thumbs/ODAweDYwMA,dsc_0458.jpg" title="" rel="lightbox">
<img src="/Data/Thumbs/dsc_0458.jpg"/>
</a>
<p class="show"></p>
</div>
<h3>Awanse</h3>
<div class="data">
<img alt="" src="/Themes/kalendarz-ico.gif">
2013-11-18 12:26
</div>
<!--Start tag-->
<div class="clear"></div>
<!--Tags to extract-->
<p class="gr">W związku z Narodowym Świętem Niepodległości ....</p>
<p style="text-align: justify">W zeszły p....</p>
<p style="text-align: justify">OISW Kraków</p>
<!--End tag-->
<div class="clear"></div>
<div class="slider">
<span class="slide-left"></span>
<span class="slide-right"></span>
</div>
</div>
You can use a selector like div.clear ~ :gt(1):lt(4)
E.g.:
Elements tNews = eNews.select("div.clear ~ :gt(1):lt(4)");
See this example and the selector docs. (It's a bit hard to validate this does what you're trying to achieve without knowing your input HTML and the data you're trying to extract.)
Update based on your edit: there are a couple ways to do this if you can't know the indexes in advance. Below I get the first div, then accumulate sibling elements until we hit the next div.clear. (I'll have a think if I can generify this pattern and add it to jsoup.)
Document doc = Jsoup.parse(h);
Element firstDiv = doc.select("div.clear").first();
Elements news = new Elements();
Element item = firstDiv.nextElementSibling();
while (item != null && !(item.tagName().equals("div") && item.className().equals("clear"))) {
news.add(item);
item = item.nextElementSibling();
}
System.out.println(String.format("Found %s items", news.size()));
for (Element element : news) {
System.out.println(element.text());
}
Outputs:
Found 3 items
W związku z Narodowym Świętem Niepodległości ....
W zeszły p....
OISW Kraków

Selenium driver - select the desired li item in the list

In a list of 8 Elements I would select the one that contains the search text in children div. I need this because the elements of the list changes order every time. Here I would like to select the one that contains the text "TITLE TO LISTEN". How do I scroll through the list and select the wish li?
Thanks in advance
Here one li:
...
<li id="3636863298979137009" class="clearfix" data-size="1" data-fixed="1" data-side="r">
<div class="userContentWrapper">
<div class="jki">
<span class="userContent">
TITLE TO LISTEN
</div>
<div class="TimelineUFI uiContainer">
<form id="u_0_b0" class="able_item collapsed_s autoexpand_mode" onsubmit="return window.Event && E" action="/ajax/ufi/modify.php" method="post" >
<input type="hidden" value="1" name="data_only_response" autocomplete="off">
<div class="TimelineFeedbackHeader">
<a class="ction_link" role="button" title="Journal" data-ft="{"tn":"J","type":25}" rel="dialog" href="/ajax/" tabindex="0" rip-style-bordercolor-backup="" style="" rip-style-borderstyle-backup="" >LISTEN</a>
</div>
</form>
</div>
</div>
</li>
</ol>
</div>
...
I tried this code, but it don't work because the elements ids change each time.
driver.findElement(By.xpath("//li[8]/div[2]/div/div[2]/form/div/div/span[2]/a")).click();
For example:
If text contain "TEXT TO LISTEN": li[3]/div[2]/div/div/div[2]/div/div/span
Link "listen" i want to click : li[3]/div[2]/div/div[2]/form/div/div/span[2]/a
here is number 3, but the order may change. I would first like to get that number and then click on the right link
Use this
driver.findElement(By.xpath("//li[contains(text(), 'Your text goes here')]"))
EDIT: just realised it's very old ques and you might have got ans by now, so for others who are looking for answer to this question.
You could get list of all li elements, and then search for specified text
for(int i=0; i< listOfLiElements.Count, i++){
if(listOfLiElements[i].FindElement(By.ClassName("userContent")).Text == "TITLE TO LISTEN")
{
correctElement = listOfLiElements[i].FindElement(By.TagName("a"));
i =listOfLiElements.Count;
}
}
Well, then just iterate through for each and ask if the current element has the right text inside it.
List<Element> listOfLiTags = driver.findElement(By.Id("yourUlId")).findElements(By.TagName("li"));
for(Element li : listOfLiTags) {
String text = li.getElement(By.ClassName("userContent").getText();
if(text.equals("TITLE TO LISTEN") {
//do whatever you want and don't forget break
break;
}
}
Note that this is much more easier with CssSelector API.
List<Element> listOfSpans = driver.findElements(
By.CssSelector("ul[id=yourId] li span[class=userContent]");
Now just iterate and ask for the right text:)
You can try this :
public void ClickLink()
{
WebElement ol =driver.findElement(By.id("ol"));
List<WebElement> lis=ol.findElements(By.tagName("li"));
ArrayList<String> listFromGUI=new ArrayList<>();
for(int i=0;i<lis.size();i++)
{
WebElement li=ol.findElement(By.xpath("//ol[#id='ol']/li["+(i+1)+"]/div[2]/div/div/div[2]/div/div/span"));
if(li.getText().trim().equals("TEXT TO LISTEN"))
{
WebElement link=ol.findElement(By.xpath("//ol[#id='ol']/li["+(i+1)+"]/div[2]/div/div[2]/form/div/div/span[2]/a"));
if(link.getText().trim().equals("LISTEN"))
{
link.click();
break;
}
}
}
}

How do you get a div class inside a div class inside a div class?

My problem is that I need to get a div class inside a div class inside a div class and the are 4 instances of the classes with the same name but different data...
I can currently get the first div class inside the div class but i need to be able to access other elements within it aswell......for example:
docTide = Jsoup.connect("http://www.mhpa.co.uk/search-tide-times/").timeout(600000).get();
Elements tideTableRows = docTide.select("div.tide_row.odd");
Element firstDiv = tideTableRows.first();
Element secondDiv = tideTableRows.get(1);
System.out.println("This is the first div: " + firstDiv.text());
System.out.println("This is the second div: " + secondDiv.text());
but this is the structure of the webpage where there are 2 repeats and I need to access each of them e.g:
<div class="tide_row odd">
<div class="time">00:57</div>
<div class="height_m">4.9</div>
<div class="height_f">16,1</div>
<div class="range_m">1.9</div>
<div class="range_f">6,3</div>
</div>
<div class="tide_row even">
<div class="time">07:23</div>
<div class="height_m">2.9</div>
<div class="height_f">9,6</div>
<div class="range_m">2</div>
<div class="range_f">6,7</div>
</div>
<div class="tide_row odd">
<div class="time">13:46</div>
<div class="height_m">5.1</div>
<div class="height_f">16,9</div>
<div class="range_m">2.2</div>
<div class="range_f">7,3</div>
</div>
<div class="tide_row even">
<div class="time">20:23</div>
<div class="height_m">2.8</div>
<div class="height_f">9,2</div>
<div class="range_m">2.3</div>
<div class="range_f">7,7</div>
</div>
So basically it has nested classes in separate classes with the same name, how can I construct the correct syntax to return the data from the classes separately?
This is quite hard to explain!
Edit: This is how I managed to extract the information from the nested classes:
docTide = Jsoup.connect("http://www.mhpa.co.uk/search-tide-times/").timeout(600000).get();
Elements tideTimeOdd = docTide.select("div.tide_row.odd div:eq(0)");
Elements tideTimeEven = docTide.select("div.tide_row.even div:eq(0)");
Elements tideHightOdd = docTide.select("div.tide_row.odd div:eq(2)");
Elements tideHightEven = docTide.select("div.tide_row.even div:eq(2)");
Element firstTideTime = tideTimeOdd.first();
Element secondTideTime = tideTimeEven.first();
Element thirdTideTime = tideTimeOdd.get(1);
Element fourthTideTime = tideTimeEven.get(1);
Element firstTideHight = tideHightOdd.first();
Element secondTideHight = tideHightEven.first();
Element thirdTideHight = tideHightOdd.get(1);
Element fourthTideHight = tideHightEven.get(1);
It would work fine by just doing:
docTide = Jsoup.connect("http://www.mhpa.co.uk/search-tide-times/").timeout(600000).get();
Elements tideTableRows = docTide.select("div[class=tide_row odd]");
Element firstDiv += tideTableRows.select("div[class=time]");
Element secondDiv += tideTableRows.select("div[class=height]");
You should try to access elements by id if you can. It makes your code a lot simpler and if you have 50 headers in the same container, as an example, this way you don't have to count them all.
Seperate elements:
docTide = Jsoup.connect("http://www.mhpa.co.uk/search-tide-times/").timeout(600000).get();
Element tideTableRows = docTide.select("div[class=tide_row odd]").first();
Element firstDiv1 = tideTableRows.select("div[class=time]");
Element secondDiv1 = tideTableRows.select("div[class=height]");
tideTableRows2 = docTide.select("div[class=tide_row odd]").second();
Element firstDiv2 = tideTableRows.select("div[class=time]");
Element secondDiv2 = tideTableRows.select("div[class=height]");
You can try this:
docTide = Jsoup.connect("http://www.mhpa.co.uk/search-tide-times/").timeout(600000).get();
Elements tideTableRows = docTide.select("div.tide_row");
for (Element tideTableRow : tideTableRows){
if (tideTableRow.hasClass("odd")){
//do the odd stuff...
}
Elements innerDivs = tideTableRows.select("div");
for (Element innerDiv : innerDivs){
//do whatever you need
}
}
Note: The code is not tested.
Update: I showed you how to access the odd rows only... From there you should be able to get the rest by yourself I hope.

Categories

Resources