parsing using JSoup a website - weatherbug - java

I have tried several approaches (below) and I have not been able to pull out the temperatures from the hourly forecast of weatherbug:
http://weather.weatherbug.com/MA/Boston-weather/local-forecast/hourly-forecast.html?zcode=z6286&zip=02108
I am using JAVA / Jsoup.
The temps are listed as: <span>33° F</span> within a table.
I suspect my problem is not fully understanding the html.
It appears to be within a table labeled: <table cellspacing="0" id="hourly">
Below are several things I have tried with no luck:
It seems like everything I have tried is not able to find or "see" the table.
doc=Jsoup.connect(urlString).get();
dataread = doc.body().text();
length = dataread.length();
System.out.printf("length = %d\n",length);
System.out.println(dataread);
the above was to see if I was at least on track - the data was not in "dataread".
Then I tried printing the results from combinations of:
Elements table = doc.select("table[class=hourly]");
Elements table = doc.getElementsByTag("boxhdr");
Elements byclass = doc.getElementsByClass("boxhdr");
System.out.println(table.size());
System.out.println(table);
I extended my parsing further up the page hoping to get lucky - with the boxhdr label and so forth.
Can you help me extract the temperatures?
Thanks!

The table you want to extract the data from has an id hourly.
Elements table = doc.getElementById("hourly");
Moreover, the content of the table is probably generated by Javascript and it's impossible to retrieve it using Jsoup. Refer to this thread.

Related

In JSoup I am trying to get text from a span that has multiple classes with strange names the compiler is not liking

Here is my code:
enter code heretext = text.toUpperCase();
Document doc = Jsoup.connect("https://finance.yahoo.com/quote/" + text + "?p=" + text).userAgent("Safari").get();
Element temp = doc.selectFirst("span.Trsdu(0.3s).Fw(b).Fz(36px).Mb(-4px).D(ib)");
System.out.println(temp);
here is the span I am trying to get:
<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="35">1,119.50</span>
I am new to JSoup so if i am being ignorant please let me know what i should do
This may not be the answer but I can't comment yet since I don't have 50 rep points but I'd still like to help so I'll post it here.
Jsoup has a lot of issue with recognizing characters that I've also encountered.
For this particular example, I think you can use the data attribute 'data-react-id' to locate that element. First you would select all spans then the attribute, something like this doc.select("span").select("[data-react-id]=35]")
Hope that helps.

jSoup get data using td-class tags from webpage

I would like to get data from http://www.futbol24.com/Live/?__igp=1&LiveDate=20141104 using jSoup. I know how to use jSoup - but I am finding it difficult to pinpoint the data that I need.
I would like the Time, Home Team and Away Team from each row of the tbody table. So the output from the first row should be:
08:30 Persipura Jayapura Pelita Bandung Raya
I can see the td class of each of these elements as "status alt", "home" and "guest".
Currently I have tried the below, but it doesn't seem to output anything... what am I doing wrong?
matches = new ArrayList<Match>();
//getHistory
String website = "http://www.futbol24.com/Live/?__igp=1&LiveDate=20141104";
Document doc = Jsoup.connect(website).get();
Element tblHeader = doc.select("tbody").first();
List<Match> data = new ArrayList<>();
for (Element element1 : tblHeader.children()){
Match match = new Match();
match.setTimeOfMatch(element1.select("td.status.alt").text());
match.setAwayTeam(element1.select("td.home").text());
match.setHomeTeam(element1.select("td.guest").text());
data.add(match);
System.out.println(data.toString());
Does anybody know how I can use jSoup to get these elements from each row of the table?
Thanks,
Rob
The content of this site is generated via AJAX it seems. Jsoup can't handle this, since it is not a browser that interprets JavaScript. To solve this scraping problem you may need something like Selenium webdriver. I gave a longer answer to a generalized question about this before, so please look here:
Jsoup get dynamically generated HTML

JSoup with Wunderground Pollen data

I am currently scraping pollen data from wunderground since their API accessor doesn't offer pollen data, specifically the values attributed to each day.
I've navigated the HTML using Chrome Dev Tools and found the specific line that I want. Using the documentation offered by JSoup, I tried putting in my own custom CSS Selectors, but I am quite lost.
I was wondering if anyone would give me some insight on how to access that particular element.
For example, below is an example of what I have so far.
doc = Jsoup.connect("http://www.wunderground.com/DisplayPollen.asp?Zipcode=19104").get();
Element title = doc.getElementById("td");
Element tagName = doc.tagName("id");
System.out.println(tagName);
You don't want to use doc.getElementById("td") because <td> is not id attribute, but tag (also getElementById doesn't support CSS query).
What you want is to select first <td> with class levels. You can do it via
Element tag = doc.select("td.levels").first();
Also to get only text which will be generated with this tag (and not entire HTML) use text() method like
System.out.println(tag.text());
Document doc = Jsoup.connect("http://www.wunderground.com/DisplayPollen.asp?Zipcode=19104").get();
Elements days = doc.select("table.pollen-table").first().select("td.even-four");
for (Element day : days) {
System.out.println(day.text());
}
Elements levels = doc.select("td.levels");
for (Element level : levels) {
System.out.println(level.text());
}

Does JSoup find ALL images

im trying to analyze different websites to find all of the images it contains.
Now for this im using Jsoup with the following code:
Elements imagePath = doc.select("[src]");
e.attr("abs:src")
Now when i run this on a domain name i get alot of images but if i try to run the same thing on a sub domain i get the same images
for instance the site http://www.example.com would preduce the same output as http://www.example.com/page1
Now my question is does JSoup find all images for all subsites to a domain or is it just random luck that it preduces the same output?
Are you updating your Document object? My guess is (since there is no valuable code provided) that you have parsed your domain into doc and you did not do the same for subdomain. Jsoup applies your select only to current document node and have nothing to do with subdomains/pages etc. (Since it doesn't even has to be a website).

HtmlUnit accessing an element without id or Name

How can I access this element:
<input type="submit" value="Save as XML" onclick="some code goes here">
More info: I have to access programmatically a web page and simulate clicking on a button on it, which then will generate a xml file which I hope to be able to save on the local machine.
I am trying to do so by using HtmlUnit libraries, but all examples I could find use getElementById() or getElementByName() methods. Unfortunately, this exact element doesn't have a name or Id, so I failed miserably. I supposed then that the thing I have to do is use the getByXPath() method but I got completely lost into XPath documentation(this matter is all new to me).
I have been stuck on this for a couple of hours so I really need all the help I can get.
Thanks in advance.
There are several options for an XPATH to select that input element.
Below is one option, which looks throughout the document for an input element that has an attribute named type with the value "submit" and an attribute named value with the value "Save as XML".
//input[#type='submit' and #value='Save as XML']
If you could provide a little bit more structure, a more specific (and efficient) XPATH could be created. For instance, something like this might work:
/html/body//form//input[#type='submit' and #value='Save as XML']
You should be able to use the XPATH with code like this:
client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false
page = client.getPage(url)
submitButton = page.getByXPath("/html/body//form//input[#type='submit' and #value='Save as XML']")
Although I would, in most cases, recommend using XPath, if you don't know anything about it you can try the getInputByValue(String value) method. This is an example based on your question:
// Fetch the form somehow
HtmlForm form = this.page.getForms().get(0);
// Get the input by its value
System.out.println(form.getInputByValue("Save as XML").asXml());

Categories

Resources