JSoup with Wunderground Pollen data

JSoup with Wunderground Pollen data - java

I am currently scraping pollen data from wunderground since their API accessor doesn't offer pollen data, specifically the values attributed to each day.
I've navigated the HTML using Chrome Dev Tools and found the specific line that I want. Using the documentation offered by JSoup, I tried putting in my own custom CSS Selectors, but I am quite lost.
I was wondering if anyone would give me some insight on how to access that particular element.
For example, below is an example of what I have so far.
doc = Jsoup.connect("http://www.wunderground.com/DisplayPollen.asp?Zipcode=19104").get();
Element title = doc.getElementById("td");
Element tagName = doc.tagName("id");
System.out.println(tagName);

You don't want to use doc.getElementById("td") because <td> is not id attribute, but tag (also getElementById doesn't support CSS query).
What you want is to select first <td> with class levels. You can do it via
Element tag = doc.select("td.levels").first();
Also to get only text which will be generated with this tag (and not entire HTML) use text() method like
System.out.println(tag.text());

Document doc = Jsoup.connect("http://www.wunderground.com/DisplayPollen.asp?Zipcode=19104").get();
Elements days = doc.select("table.pollen-table").first().select("td.even-four");
for (Element day : days) {
System.out.println(day.text());
}
Elements levels = doc.select("td.levels");
for (Element level : levels) {
System.out.println(level.text());
}

Related

JSoup Scraping based on custom attributes

So I am using JSoup to scrape a website that creates a bunch of divs with dynamic class names (they change every reload), but the same attribute names. E.g:
<div class="[random text here that changes] js_resultTile" data-listing-number="[some number]">
<div class="a12_regularTile js_rollover_container " itemscope itemtype="http://schema.org/Product" data-listing-number="[same number here]">
<a href...
I've tried multiple approaches to selecting those divs and saving them in elements, but I can't seem to get it right. I've tried by attribute:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.select("div[data-listing-number]");
I've tried by class:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.getElementsByClass("a12_regularTile")
And:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.select("div[class*=js_resultTile]")
I've tried another attribute method:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = new Elements();
for (Element element : doc.getAllElements() )
{
for ( Attribute attribute : element.attributes() )
{
if ( attribute.getKey().equalsIgnoreCase("data-listing-number"))
{
myEls.add(element);
}
}
}
None of these work. I can select the doc that gets me all the HTML, but my myEls object is always empty. What can I use to select these elements?

Are you sure these elements are present in HTML returned by server? They may be added later by JavaScript. If JavaScript is involved in page presentation then you won't be able to use Jsoup. More details in my answer to similar question here: JSoup: Difficulty extracting a single element
And one more tip. Instead of using your for-for-if construction you can use this:
for (Element element : doc.getAllElements()) {
if (element.dataset().containsKey("listing-number")) {
myEls.add(element);
}
}

jSoup get data using td-class tags from webpage

I would like to get data from http://www.futbol24.com/Live/?__igp=1&LiveDate=20141104 using jSoup. I know how to use jSoup - but I am finding it difficult to pinpoint the data that I need.
I would like the Time, Home Team and Away Team from each row of the tbody table. So the output from the first row should be:
08:30 Persipura Jayapura Pelita Bandung Raya
I can see the td class of each of these elements as "status alt", "home" and "guest".
Currently I have tried the below, but it doesn't seem to output anything... what am I doing wrong?
matches = new ArrayList<Match>();
//getHistory
String website = "http://www.futbol24.com/Live/?__igp=1&LiveDate=20141104";
Document doc = Jsoup.connect(website).get();
Element tblHeader = doc.select("tbody").first();
List<Match> data = new ArrayList<>();
for (Element element1 : tblHeader.children()){
Match match = new Match();
match.setTimeOfMatch(element1.select("td.status.alt").text());
match.setAwayTeam(element1.select("td.home").text());
match.setHomeTeam(element1.select("td.guest").text());
data.add(match);
System.out.println(data.toString());
Does anybody know how I can use jSoup to get these elements from each row of the table?
Thanks,
Rob

The content of this site is generated via AJAX it seems. Jsoup can't handle this, since it is not a browser that interprets JavaScript. To solve this scraping problem you may need something like Selenium webdriver. I gave a longer answer to a generalized question about this before, so please look here:
Jsoup get dynamically generated HTML

Get title attribute with jsoup

I have a problem with parsing a website.
The website contains a phrase like this:
<td class="school">
<abbr title data-original-title="Highschool">...</abbr>
</td>
How can I get the title (Highschool)?
I'm programming with jsoup and java.
Thanks for your help.

Just try reading jsoup cookbook.
First you should get abbr element, and then its data-original-title attribute:
Element abbrElement = doc.select("abbr").first();
String originalTitle = abbrElement.attr("data-original-title");
Of course you should make sure that you select right abbr element. Above code will select the first one appearing in the document.

This can be done relatively easy using jsoup's DOM methods or selection on a parsed document. Check out these links for reference:
DOM navigation
Extracting attributes
//assuming that the class "school" contains the tag for the title
Elements titles = doc.getElementsByClass("school").getElementsByTag("abbr");
for (Element t: titles) {
String title= t.attr("data-original-title");
//do something with the title
}

Getting text from a website using JSoup

I’m working with JSoup to parse the html website.
I want to get the article from (for example) Wikipedia.
I would like to get the text from the main page (http://en.wikipedia.org/wiki/Main_Page) from the table “From today’s featured article”.
Here’s the code:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page”);
Elements el = doc.select("div.mp-tfa”);
System.out.println(el);
The problem is that it doesn’t work properly - it prints out just a blank line.
The “From today’s featured article” table is inserted in div class=“mp-tfa”.
How to get this text in my java program?
Thanks in advance.

Change:
doc.select("div.mp-tfa");
To:
doc.select("div#mp-tfa");
The better way would to iterate over the Elements thus retrieved for the tag, class or Element of your choice, simply put:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page").get();
Elements el = doc.select("div#mp-tfa");
for (Element e : el) {
System.out.println(e.text());
}
Would give:
The Boulonnais is a heavy draft horse breed from Fr....

I think it's supposed to be:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page").get();
Elements el = doc.select("div#mp-tfa");
System.out.println(el);

Could the value of an html anchor tag be fetched using xpath?

If I have HTML that looks like:
<td class="blah">&nbs;???? </td>
Could I get the ???? value using xpath?
What would it look like?

To use XPath you usually need XML not HTML, but some parsers (e.g. the one built into PHP) have a relaxed Mode which will parse most HTML, too.
If you want to find all <a> that are direct children of <td class="blah"> the XPath you need is
//td[#class = 'blah']/a
or
//td[#class = 'blah']/a[#href = 'http://...']
(depending on whether you only want the one url or all urls)
This will give you a Set of Nodes. You'll need to iterate through it and then check for the nodeType of the firstChild (supposed to be a text node) and the number of child nodes (supposed to be 1). Then the firstChild will contain the ????

Why would you use an XML parser to parse HTML?
I would suggest using a dedicated Java HTML parser, there are many, but I haven't tried any myself.
As for your question, would it work, I suspect it will not work, you will get an error when trying to parse it as HTML right at &nbs; if not earlier.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JSoup with Wunderground Pollen data - java

Related

JSoup Scraping based on custom attributes

jSoup get data using td-class tags from webpage

Get title attribute with jsoup

Getting text from a website using JSoup

Could the value of an html anchor tag be fetched using xpath?

Categories

Resources