How do I parse this HTML with Jsoup

How do I parse this HTML with Jsoup - java

I am trying to extract "Know your tractor" and "Shell Petroleum Company.1955"? Bear in mind that that is just a snippet of the whole code and there are more then one H2/H3 tag. And I would like to get the data from all the H2 and H3 tags.
Heres the HTML: http://i.stack.imgur.com/Pif3B.png
The Code I have just now is:
ArrayList<String> arrayList = new ArrayList<String>();
Document doc = null;
try{
doc = Jsoup.connect("http://primo.abdn.ac.uk:1701/primo_library/libweb/action/search.do?dscnt=0&scp.scps=scope%3A%28ALL%29&frbg=&tab=default_tab&dstmp=1332103973502&srt=rank&ct=search&mode=Basic&dum=true&indx=1&tb=t&vl(freeText0)=tractor&fn=search&vid=ABN_VU1").get();
Elements heading = doc.select("h2.EXLResultTitle span");
for (Element src : heading) {
String j = src.text();
System.out.println(j); //check whats going into the array
arrayList.add(j);
}
How would I extract "Know your tractor" and "Shell Petroleum Company.1955"? Thanks for your help!

Your selector only selects <span> elements which are inside <h2 class="EXLResultTitle">, while you actually need those <h2> elements themself. So, just remove span from the selector:
Elements headings = doc.select("h2.EXLResultTitle");
for (Element heading : headings) {
System.out.println(heading.text());
}
You should be able to figure the selector for <h3 class="EXLResultAuthor"> yourself based on the lesson learnt.
See also:
Jsoup cookbook - CSS selectors
Jsoup Selector API documentation

Related

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}

I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

How to get content from span with Jsoup

I am using Jsoup HTML parser to extract content from a HTML page.
<span class="mainPrice reduced_">
<span class="oPrice" data-test="preisArtikel">
<span itemprop="price" content="68.00"><span class="oPriceLeft">68</span><span class="oPriceSeparator">,</span><span class="oPriceRight">00</span></span><span class="oPriceSymbol oPriceSymbolRight">€</span>
I want to extract the content (68.00) and I tried following:
Elements price = doc.select("span.oPrice");
String priceString = price.text();
That doesn't work because the class "oPrice" occurs 44 times in the page and the string "priceString" contains 44 different prices.
Thank you for your help.

Try this:
//For one element
Element elements = document.select("span[content]").first();
System.out.println(elements.attr("content"));
If you have multiple like same span
//For multiple
Elements elements = document.select("span[content]");
for (Element element:elements){
System.out.println(element.attr("content"));
}
Output:
68.00
On top of that Check JsoupSelector for the reference.

Use Jsoup to select an HTML element with no class

Consider an html document like this one
<div>
<p>...</p>
<p>...</p>
...
<p class="random_class_name">...</p>
...
</div>
How could we select all of the p elements, but excluding the p element with random_class_name class?

Elements ps = body.select("p:not(.random_class_name)");
You can use the pseudo selector :not
If the class name is not known, you still can use a similar expression:
Elements ps = body.select("p:not([class])");
In the second example I use the attribute selector [], in the first the normal syntax for classes.
See the Jsoup docu about css selectors

Document doc = Jsoup.parse(htmlValue);
Elements pElements = doc.select("p");
for (Element element : pElements) {
String class = element.attr("class");
if(class == null){
//.....
}else{
//.....
}
}

Wrong URL when parsing HTML with Jsoup Android

could you help me with parsing html site?
I need get src of image and link to another page, but I don't know why I get empty list
This is my code:
Elements elems2 = doc.select("div");
for (Element elem2 : elems2) {
if (elem2.attr("class").equals("grid-box-img")) {
System.out.println(elem2.attr("img"));
kfunewphoto.add(elem2.attr("src"));
}
}
and example of html:
<div class="grid-box-img"><img width="680" height="470" src="https://i.stack.imgur.com/c7PGK.png" class="attachment-full wp-post-image" alt="shou-talanty-uspej-uvidet-pervym-clever-russia" /></div>
I need get "http://cleverrussia.com/wp-content/uploads/2014/10/shou-talanty-uspej-uvidet-pervym-clever-russia.png" and the second part of code:
Elements elems = doc.select("h2");
for (Element elem : elems) {
if (elem.attr("class").equals("entry-title")) {
str = elem.text();
kfunews.add(elem.text());
kfunewslist1.add(elem.attr("href"));
}
<h2 class="entry-title">Шоу “Таланты”. Успей увидеть первым!</h2>
And I need get: "http://cleverrussia.com/shou-talanty-uspej-uvidet-pervym/"
This is full code of page - view-source:http://cleverrussia.com/

The error is that you're trying to select img and a as attributes. Check the below code to see how to fix your code.
// Prints the image source
System.out.println(elem2.select("img").attr("src"));
kfunewphoto.add(elem2.select("img").attr("src"));
// Prints the target link
System.out.println(elem.select("a").attr("href"));
kfunewslist1.add(elem.select("a").attr("href"));

Using JSoup to select a group of tags

I am attempting to use JSoup to scrape some information off a page, which can be identified by a group of tags in a particular order. The order of them is as follows:
<span class="sold" >Sold</span></td>
<td class='prc'>
<div class="g-b bidsold" itemprop="price">
AU $1.00</div>
I am looking to grab each value that is in place of the AU $1.00 field on the page, but they can only be identified by the span class="sold" selector that occurs a few tags beforehand.
I have tried something like select("span.sold:lt(4) + [itemprop=price]") but feel like I'm flailing around in the dark!

The code below should do the trick!!!
Document doc = Jsoup.connect(/*URL of your HTML document*/").get();
Element part = doc.body();
Elements parts = part.getElementsByTag("div");
String attValue;
String requiredContent;
for(Element ent : parts)
{
if(ent.hasAttr("class"))
{
attValue = ent.attr("class");
if(attValue.equals("g-b bidsold"))
{
System.out.println("\n");
requiredContent=ent.text();
System.out.println(requiredContent);
}
}
}
Just make sure to iterate and get the output in an array.

You could also do this:
Elements soldPrices = doc.select("td:has(.sold) + td [itemprop=price]");
That will return elements (the DIVs) that have price itemprops, which have immediately preceeding TDs with elements (the SPANs) with class=sold.
See the Selector syntax for more details.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How do I parse this HTML with Jsoup - java

Related

Use JSoup to get all textual links

How to get content from span with Jsoup

Use Jsoup to select an HTML element with no class

Wrong URL when parsing HTML with Jsoup Android

Using JSoup to select a group of tags

Categories

Resources