Web Crawler Amazon get span-Element

Web Crawler Amazon get span-Element - java

I'm crawling amazon categories and I get the salesrank and the product URLs. Now I want to crawl the category and I get every information from the category span.
<span class="zg_hrsr_ladder">in Bücher > Krimis & Thriller > <b>Deutschland</b></span>
This is an example code snippet and with following code
Elements category = htmlDocument.select("span.zg_hrsr_ladder");
I get everything inside the span. But I want only the text inside the a href "Bücher" "Krimis & Thriller" and "Deutschland". How can I get this information?

You want to get the text inside the <a> element, so select anchors in your span (append " a" to the selector) and call text() and the resulting elements.
Example Code
String source = "<span class=\"zg_hrsr_ladder\">in Bücher > Krimis & Thriller > <b>Deutschland</b></span>";
Document htmlDocument = Jsoup.parse(source, "UTF-8");
Elements category = htmlDocument.select("span.zg_hrsr_ladder a");
category.forEach(aElement -> {
System.out.println(aElement.text());
});
Output
Bücher
Krimis & Thriller
Deutschland

Related

Check that two span classes are equal to text and click on the correct span element after

I have two span elements and I need to check that my text: Frat Brothers (2013) is equal to text inside this span clases and that click on this element.
<a href="/frat-brothers" class="">
<span class="name-content-row__row__title">Frat Brothers</span>
<span class="name-content-row__row--year">(2013)</span>
</a>
My code:
String title = "Frat Brothers (2013)";
List<WebElement> content = driver.findElements(By.cssSelector("span[class*='name-content-row__'"));
for (WebElement e : content) {
System.out.println("elememts is : " + e.getText());
if (e.getText().equals(title)) {
click(e);
}
output:
elememts is : Frat Brothers
elememts is : (2013)
if statment isn't executed.

if statment did not execute, cause you have
String title = "Frat Brothers (2013)";
change that to
String title = "Frat Brothers";
and you should be good to go.
also do not use click(e); instead it should be e.click();

driver.findElements method accepts By parameter while you passing it a String.
In case you want to select elements according to this CSS Selector you can do this:
List<WebElement> content = driver.findElements(By.cssSelector("span[class*='name-content-row__'"));
Also, you will get 2 span elements, first with text Frat Brothers and second with text (2013).
No one of these elements text will NOT be equal to Frat Brothers (2013).
You can check if title contains these texts

You can try the following xPath: //a[normalize-space()='Frat Brothers (2013)']
So that there would be no need for extra code. Like:
String title = "Frat Brothers (2013)";
WebElement content = driver.findElement(By.xpath("//a[normalize-space()='" + title + "']"));
content.click();
P.S. - Here is the xPath test: http://xpather.com/V9cjThsr

Text verification, add testng dependency in your pom.xml
String title = "Frat Brothers (2013)";
// storing text from the element
String first = driver.findElements(By.cssSelector("span[class*='name-content-row__'")).get(0).getText();
String second = driver.findElements(By.cssSelector("span[class*='name-content-row__'")).get(1).getText();
// validating the text
Assert.assertTrue(title.contains(first), "checking name is available in the element");
Assert.assertTrue(title.contains(second), "checking year is available in the element");

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}

I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

How to get content from span with Jsoup

I am using Jsoup HTML parser to extract content from a HTML page.
<span class="mainPrice reduced_">
<span class="oPrice" data-test="preisArtikel">
<span itemprop="price" content="68.00"><span class="oPriceLeft">68</span><span class="oPriceSeparator">,</span><span class="oPriceRight">00</span></span><span class="oPriceSymbol oPriceSymbolRight">€</span>
I want to extract the content (68.00) and I tried following:
Elements price = doc.select("span.oPrice");
String priceString = price.text();
That doesn't work because the class "oPrice" occurs 44 times in the page and the string "priceString" contains 44 different prices.
Thank you for your help.

Try this:
//For one element
Element elements = document.select("span[content]").first();
System.out.println(elements.attr("content"));
If you have multiple like same span
//For multiple
Elements elements = document.select("span[content]");
for (Element element:elements){
System.out.println(element.attr("content"));
}
Output:
68.00
On top of that Check JsoupSelector for the reference.

Using JSoup to select a group of tags

I am attempting to use JSoup to scrape some information off a page, which can be identified by a group of tags in a particular order. The order of them is as follows:
<span class="sold" >Sold</span></td>
<td class='prc'>
<div class="g-b bidsold" itemprop="price">
AU $1.00</div>
I am looking to grab each value that is in place of the AU $1.00 field on the page, but they can only be identified by the span class="sold" selector that occurs a few tags beforehand.
I have tried something like select("span.sold:lt(4) + [itemprop=price]") but feel like I'm flailing around in the dark!

The code below should do the trick!!!
Document doc = Jsoup.connect(/*URL of your HTML document*/").get();
Element part = doc.body();
Elements parts = part.getElementsByTag("div");
String attValue;
String requiredContent;
for(Element ent : parts)
{
if(ent.hasAttr("class"))
{
attValue = ent.attr("class");
if(attValue.equals("g-b bidsold"))
{
System.out.println("\n");
requiredContent=ent.text();
System.out.println(requiredContent);
}
}
}
Just make sure to iterate and get the output in an array.

You could also do this:
Elements soldPrices = doc.select("td:has(.sold) + td [itemprop=price]");
That will return elements (the DIVs) that have price itemprops, which have immediately preceeding TDs with elements (the SPANs) with class=sold.
See the Selector syntax for more details.

How do I parse this HTML with Jsoup

I am trying to extract "Know your tractor" and "Shell Petroleum Company.1955"? Bear in mind that that is just a snippet of the whole code and there are more then one H2/H3 tag. And I would like to get the data from all the H2 and H3 tags.
Heres the HTML: http://i.stack.imgur.com/Pif3B.png
The Code I have just now is:
ArrayList<String> arrayList = new ArrayList<String>();
Document doc = null;
try{
doc = Jsoup.connect("http://primo.abdn.ac.uk:1701/primo_library/libweb/action/search.do?dscnt=0&scp.scps=scope%3A%28ALL%29&frbg=&tab=default_tab&dstmp=1332103973502&srt=rank&ct=search&mode=Basic&dum=true&indx=1&tb=t&vl(freeText0)=tractor&fn=search&vid=ABN_VU1").get();
Elements heading = doc.select("h2.EXLResultTitle span");
for (Element src : heading) {
String j = src.text();
System.out.println(j); //check whats going into the array
arrayList.add(j);
}
How would I extract "Know your tractor" and "Shell Petroleum Company.1955"? Thanks for your help!

Your selector only selects <span> elements which are inside <h2 class="EXLResultTitle">, while you actually need those <h2> elements themself. So, just remove span from the selector:
Elements headings = doc.select("h2.EXLResultTitle");
for (Element heading : headings) {
System.out.println(heading.text());
}
You should be able to figure the selector for <h3 class="EXLResultAuthor"> yourself based on the lesson learnt.
See also:
Jsoup cookbook - CSS selectors
Jsoup Selector API documentation

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Web Crawler Amazon get span-Element - java

Related

Check that two span classes are equal to text and click on the correct span element after

Use JSoup to get all textual links

How to get content from span with Jsoup

Using JSoup to select a group of tags

How do I parse this HTML with Jsoup

Categories

Resources