How to get content from span with Jsoup - java

I am using Jsoup HTML parser to extract content from a HTML page.
<span class="mainPrice reduced_">
<span class="oPrice" data-test="preisArtikel">
<span itemprop="price" content="68.00"><span class="oPriceLeft">68</span><span class="oPriceSeparator">,</span><span class="oPriceRight">00</span></span><span class="oPriceSymbol oPriceSymbolRight">€</span>
I want to extract the content (68.00) and I tried following:
Elements price = doc.select("span.oPrice");
String priceString = price.text();
That doesn't work because the class "oPrice" occurs 44 times in the page and the string "priceString" contains 44 different prices.
Thank you for your help.

Try this:
//For one element
Element elements = document.select("span[content]").first();
System.out.println(elements.attr("content"));
If you have multiple like same span
//For multiple
Elements elements = document.select("span[content]");
for (Element element:elements){
System.out.println(element.attr("content"));
}
Output:
68.00
On top of that Check JsoupSelector for the reference.

Related

Check that two span classes are equal to text and click on the correct span element after

I have two span elements and I need to check that my text: Frat Brothers (2013) is equal to text inside this span clases and that click on this element.
<a href="/frat-brothers" class="">
<span class="name-content-row__row__title">Frat Brothers</span>
<span class="name-content-row__row--year">(2013)</span>
</a>
My code:
String title = "Frat Brothers (2013)";
List<WebElement> content = driver.findElements(By.cssSelector("span[class*='name-content-row__'"));
for (WebElement e : content) {
System.out.println("elememts is : " + e.getText());
if (e.getText().equals(title)) {
click(e);
}
output:
elememts is : Frat Brothers
elememts is : (2013)
if statment isn't executed.
if statment did not execute, cause you have
String title = "Frat Brothers (2013)";
change that to
String title = "Frat Brothers";
and you should be good to go.
also do not use click(e); instead it should be e.click();
driver.findElements method accepts By parameter while you passing it a String.
In case you want to select elements according to this CSS Selector you can do this:
List<WebElement> content = driver.findElements(By.cssSelector("span[class*='name-content-row__'"));
Also, you will get 2 span elements, first with text Frat Brothers and second with text (2013).
No one of these elements text will NOT be equal to Frat Brothers (2013).
You can check if title contains these texts
You can try the following xPath: //a[normalize-space()='Frat Brothers (2013)']
So that there would be no need for extra code. Like:
String title = "Frat Brothers (2013)";
WebElement content = driver.findElement(By.xpath("//a[normalize-space()='" + title + "']"));
content.click();
P.S. - Here is the xPath test: http://xpather.com/V9cjThsr
Text verification, add testng dependency in your pom.xml
String title = "Frat Brothers (2013)";
// storing text from the element
String first = driver.findElements(By.cssSelector("span[class*='name-content-row__'")).get(0).getText();
String second = driver.findElements(By.cssSelector("span[class*='name-content-row__'")).get(1).getText();
// validating the text
Assert.assertTrue(title.contains(first), "checking name is available in the element");
Assert.assertTrue(title.contains(second), "checking year is available in the element");

Web Crawler Amazon get span-Element

I'm crawling amazon categories and I get the salesrank and the product URLs. Now I want to crawl the category and I get every information from the category span.
<span class="zg_hrsr_ladder">in Bücher > Krimis & Thriller > <b>Deutschland</b></span>
This is an example code snippet and with following code
Elements category = htmlDocument.select("span.zg_hrsr_ladder");
I get everything inside the span. But I want only the text inside the a href "Bücher" "Krimis & Thriller" and "Deutschland". How can I get this information?
You want to get the text inside the <a> element, so select anchors in your span (append " a" to the selector) and call text() and the resulting elements.
Example Code
String source = "<span class=\"zg_hrsr_ladder\">in Bücher > Krimis & Thriller > <b>Deutschland</b></span>";
Document htmlDocument = Jsoup.parse(source, "UTF-8");
Elements category = htmlDocument.select("span.zg_hrsr_ladder a");
category.forEach(aElement -> {
System.out.println(aElement.text());
});
Output
Bücher
Krimis & Thriller
Deutschland

Use Jsoup to select an HTML element with no class

Consider an html document like this one
<div>
<p>...</p>
<p>...</p>
...
<p class="random_class_name">...</p>
...
</div>
How could we select all of the p elements, but excluding the p element with random_class_name class?
Elements ps = body.select("p:not(.random_class_name)");
You can use the pseudo selector :not
If the class name is not known, you still can use a similar expression:
Elements ps = body.select("p:not([class])");
In the second example I use the attribute selector [], in the first the normal syntax for classes.
See the Jsoup docu about css selectors
Document doc = Jsoup.parse(htmlValue);
Elements pElements = doc.select("p");
for (Element element : pElements) {
String class = element.attr("class");
if(class == null){
//.....
}else{
//.....
}
}

Using JSoup to select a group of tags

I am attempting to use JSoup to scrape some information off a page, which can be identified by a group of tags in a particular order. The order of them is as follows:
<span class="sold" >Sold</span></td>
<td class='prc'>
<div class="g-b bidsold" itemprop="price">
AU $1.00</div>
I am looking to grab each value that is in place of the AU $1.00 field on the page, but they can only be identified by the span class="sold" selector that occurs a few tags beforehand.
I have tried something like select("span.sold:lt(4) + [itemprop=price]") but feel like I'm flailing around in the dark!
The code below should do the trick!!!
Document doc = Jsoup.connect(/*URL of your HTML document*/").get();
Element part = doc.body();
Elements parts = part.getElementsByTag("div");
String attValue;
String requiredContent;
for(Element ent : parts)
{
if(ent.hasAttr("class"))
{
attValue = ent.attr("class");
if(attValue.equals("g-b bidsold"))
{
System.out.println("\n");
requiredContent=ent.text();
System.out.println(requiredContent);
}
}
}
Just make sure to iterate and get the output in an array.
You could also do this:
Elements soldPrices = doc.select("td:has(.sold) + td [itemprop=price]");
That will return elements (the DIVs) that have price itemprops, which have immediately preceeding TDs with elements (the SPANs) with class=sold.
See the Selector syntax for more details.

How do I parse this HTML with Jsoup

I am trying to extract "Know your tractor" and "Shell Petroleum Company.1955"? Bear in mind that that is just a snippet of the whole code and there are more then one H2/H3 tag. And I would like to get the data from all the H2 and H3 tags.
Heres the HTML: http://i.stack.imgur.com/Pif3B.png
The Code I have just now is:
ArrayList<String> arrayList = new ArrayList<String>();
Document doc = null;
try{
doc = Jsoup.connect("http://primo.abdn.ac.uk:1701/primo_library/libweb/action/search.do?dscnt=0&scp.scps=scope%3A%28ALL%29&frbg=&tab=default_tab&dstmp=1332103973502&srt=rank&ct=search&mode=Basic&dum=true&indx=1&tb=t&vl(freeText0)=tractor&fn=search&vid=ABN_VU1").get();
Elements heading = doc.select("h2.EXLResultTitle span");
for (Element src : heading) {
String j = src.text();
System.out.println(j); //check whats going into the array
arrayList.add(j);
}
How would I extract "Know your tractor" and "Shell Petroleum Company.1955"? Thanks for your help!
Your selector only selects <span> elements which are inside <h2 class="EXLResultTitle">, while you actually need those <h2> elements themself. So, just remove span from the selector:
Elements headings = doc.select("h2.EXLResultTitle");
for (Element heading : headings) {
System.out.println(heading.text());
}
You should be able to figure the selector for <h3 class="EXLResultAuthor"> yourself based on the lesson learnt.
See also:
Jsoup cookbook - CSS selectors
Jsoup Selector API documentation

Categories

Resources