How do I parse this HTML with Jsoup - java

I am trying to extract "Know your tractor" and "Shell Petroleum Company.1955"? Bear in mind that that is just a snippet of the whole code and there are more then one H2/H3 tag. And I would like to get the data from all the H2 and H3 tags.
Heres the HTML: http://i.stack.imgur.com/Pif3B.png
The Code I have just now is:
ArrayList<String> arrayList = new ArrayList<String>();
Document doc = null;
try{
doc = Jsoup.connect("http://primo.abdn.ac.uk:1701/primo_library/libweb/action/search.do?dscnt=0&scp.scps=scope%3A%28ALL%29&frbg=&tab=default_tab&dstmp=1332103973502&srt=rank&ct=search&mode=Basic&dum=true&indx=1&tb=t&vl(freeText0)=tractor&fn=search&vid=ABN_VU1").get();
Elements heading = doc.select("h2.EXLResultTitle span");
for (Element src : heading) {
String j = src.text();
System.out.println(j); //check whats going into the array
arrayList.add(j);
}
How would I extract "Know your tractor" and "Shell Petroleum Company.1955"? Thanks for your help!

Your selector only selects <span> elements which are inside <h2 class="EXLResultTitle">, while you actually need those <h2> elements themself. So, just remove span from the selector:
Elements headings = doc.select("h2.EXLResultTitle");
for (Element heading : headings) {
System.out.println(heading.text());
}
You should be able to figure the selector for <h3 class="EXLResultAuthor"> yourself based on the lesson learnt.
See also:
Jsoup cookbook - CSS selectors
Jsoup Selector API documentation

Related

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}
You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}
I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

How to get content from span with Jsoup

I am using Jsoup HTML parser to extract content from a HTML page.
<span class="mainPrice reduced_">
<span class="oPrice" data-test="preisArtikel">
<span itemprop="price" content="68.00"><span class="oPriceLeft">68</span><span class="oPriceSeparator">,</span><span class="oPriceRight">00</span></span><span class="oPriceSymbol oPriceSymbolRight">€</span>
I want to extract the content (68.00) and I tried following:
Elements price = doc.select("span.oPrice");
String priceString = price.text();
That doesn't work because the class "oPrice" occurs 44 times in the page and the string "priceString" contains 44 different prices.
Thank you for your help.
Try this:
//For one element
Element elements = document.select("span[content]").first();
System.out.println(elements.attr("content"));
If you have multiple like same span
//For multiple
Elements elements = document.select("span[content]");
for (Element element:elements){
System.out.println(element.attr("content"));
}
Output:
68.00
On top of that Check JsoupSelector for the reference.

Use Jsoup to select an HTML element with no class

Consider an html document like this one
<div>
<p>...</p>
<p>...</p>
...
<p class="random_class_name">...</p>
...
</div>
How could we select all of the p elements, but excluding the p element with random_class_name class?
Elements ps = body.select("p:not(.random_class_name)");
You can use the pseudo selector :not
If the class name is not known, you still can use a similar expression:
Elements ps = body.select("p:not([class])");
In the second example I use the attribute selector [], in the first the normal syntax for classes.
See the Jsoup docu about css selectors
Document doc = Jsoup.parse(htmlValue);
Elements pElements = doc.select("p");
for (Element element : pElements) {
String class = element.attr("class");
if(class == null){
//.....
}else{
//.....
}
}

Wrong URL when parsing HTML with Jsoup Android

could you help me with parsing html site?
I need get src of image and link to another page, but I don't know why I get empty list
This is my code:
Elements elems2 = doc.select("div");
for (Element elem2 : elems2) {
if (elem2.attr("class").equals("grid-box-img")) {
System.out.println(elem2.attr("img"));
kfunewphoto.add(elem2.attr("src"));
}
}
and example of html:
<div class="grid-box-img"><img width="680" height="470" src="https://i.stack.imgur.com/c7PGK.png" class="attachment-full wp-post-image" alt="shou-talanty-uspej-uvidet-pervym-clever-russia" /></div>
I need get "http://cleverrussia.com/wp-content/uploads/2014/10/shou-talanty-uspej-uvidet-pervym-clever-russia.png" and the second part of code:
Elements elems = doc.select("h2");
for (Element elem : elems) {
if (elem.attr("class").equals("entry-title")) {
str = elem.text();
kfunews.add(elem.text());
kfunewslist1.add(elem.attr("href"));
}
<h2 class="entry-title">Шоу “Таланты”. Успей увидеть первым!</h2>
And I need get: "http://cleverrussia.com/shou-talanty-uspej-uvidet-pervym/"
This is full code of page - view-source:http://cleverrussia.com/
The error is that you're trying to select img and a as attributes. Check the below code to see how to fix your code.
// Prints the image source
System.out.println(elem2.select("img").attr("src"));
kfunewphoto.add(elem2.select("img").attr("src"));
// Prints the target link
System.out.println(elem.select("a").attr("href"));
kfunewslist1.add(elem.select("a").attr("href"));

Using JSoup to select a group of tags

I am attempting to use JSoup to scrape some information off a page, which can be identified by a group of tags in a particular order. The order of them is as follows:
<span class="sold" >Sold</span></td>
<td class='prc'>
<div class="g-b bidsold" itemprop="price">
AU $1.00</div>
I am looking to grab each value that is in place of the AU $1.00 field on the page, but they can only be identified by the span class="sold" selector that occurs a few tags beforehand.
I have tried something like select("span.sold:lt(4) + [itemprop=price]") but feel like I'm flailing around in the dark!
The code below should do the trick!!!
Document doc = Jsoup.connect(/*URL of your HTML document*/").get();
Element part = doc.body();
Elements parts = part.getElementsByTag("div");
String attValue;
String requiredContent;
for(Element ent : parts)
{
if(ent.hasAttr("class"))
{
attValue = ent.attr("class");
if(attValue.equals("g-b bidsold"))
{
System.out.println("\n");
requiredContent=ent.text();
System.out.println(requiredContent);
}
}
}
Just make sure to iterate and get the output in an array.
You could also do this:
Elements soldPrices = doc.select("td:has(.sold) + td [itemprop=price]");
That will return elements (the DIVs) that have price itemprops, which have immediately preceeding TDs with elements (the SPANs) with class=sold.
See the Selector syntax for more details.

Categories

Resources