Jsoup how to get values from html - java

So I'm trying to get specific information from this link: https://myanimelist.net/anime/31988/Hibike_Euphonium_2
I don't really understand html so this is a bit harder for me.
I'm looking specifically get information from here:
<div>
<span class="dark_text">Studios:</span>
Kyoto Animation </div>
<div class="spaceit">
What I'm trying to do is search for when it says "Studios" and then get the title of the href link (Kyoto Animation).
So for I have managed to get this:
Document doc = Jsoup.connect("https://myanimelist.net/anime/31988/Hibike_Euphonium_2").get();
Elements studio = doc.select("a[href][title]");
for(Element link : studio){
System.out.println(link.attr("title"));
}
And it's outputting this:
Lantis
Pony Canyon
Rakuonsha
Ponycan USA
Kyoto Animation
Drama
Music
School
Kyoto Animation
Go to the Last Post
Go to the Last Post
Anime You Should Watch Before Their Sequels Air This Fall 2016 Season
Collection
Follow #myanimelist on Twitter

It should be
doc.select("span:contains(Studios) + a[href][title]");
of I assume that span is common element for list header.
So basicly this selector gets all span elements that contains text Studios and then gets 1 level children a elements having attributes href and title
Just in case, given selector will select only one link and in span
More universal could be
*:contains(Studio) > a[title]
and that means - take every a element that has title attribute and is direct children of any (*) element that contains test Studio. Contains takes into account all text from descending children as well. For text of specific element :textOwn is used.

Not tested, but what about something like
...
Elements studio = doc.select("a[#title='Kyoto Animations']");
...

Related

JSoup crawling how to crawl from same tag but two items

I need to crawl all three items in span tag. i have some code but i need a few hints. This is my code until now.
News n = new News();
n.setHeadline(news.getElementsByTag("h2").first().text());
n.setTypeOfSport(news.getElementsByTag("span").first().text());
n.setDate(news.getElementsByTag("span").);
n.setTime(news.getElementsByTag("span").);
It looks like you want to pick all span elements from <div class="info"> and access them based on their position (index).
Assuming that your news variable is of type Document or Element(s) you should have access to select(CSSquery) method. If it also at some level holds this <div class="info"> your code can look like:
Elements spans = news.select("div.info span");
//now you can get and handle text from all spans via
spans.get(0).text();
spans.get(1).text();
spans.get(2).text();
For more info about selecting elements using CSS see https://jsoup.org/cookbook/extracting-data/selector-syntax

Get title attribute with jsoup

I have a problem with parsing a website.
The website contains a phrase like this:
<td class="school">
<abbr title data-original-title="Highschool">...</abbr>
</td>
How can I get the title (Highschool)?
I'm programming with jsoup and java.
Thanks for your help.
Just try reading jsoup cookbook.
First you should get abbr element, and then its data-original-title attribute:
Element abbrElement = doc.select("abbr").first();
String originalTitle = abbrElement.attr("data-original-title");
Of course you should make sure that you select right abbr element. Above code will select the first one appearing in the document.
This can be done relatively easy using jsoup's DOM methods or selection on a parsed document. Check out these links for reference:
DOM navigation
Extracting attributes
//assuming that the class "school" contains the tag for the title
Elements titles = doc.getElementsByClass("school").getElementsByTag("abbr");
for (Element t: titles) {
String title= t.attr("data-original-title");
//do something with the title
}

JSoup with Wunderground Pollen data

I am currently scraping pollen data from wunderground since their API accessor doesn't offer pollen data, specifically the values attributed to each day.
I've navigated the HTML using Chrome Dev Tools and found the specific line that I want. Using the documentation offered by JSoup, I tried putting in my own custom CSS Selectors, but I am quite lost.
I was wondering if anyone would give me some insight on how to access that particular element.
For example, below is an example of what I have so far.
doc = Jsoup.connect("http://www.wunderground.com/DisplayPollen.asp?Zipcode=19104").get();
Element title = doc.getElementById("td");
Element tagName = doc.tagName("id");
System.out.println(tagName);
You don't want to use doc.getElementById("td") because <td> is not id attribute, but tag (also getElementById doesn't support CSS query).
What you want is to select first <td> with class levels. You can do it via
Element tag = doc.select("td.levels").first();
Also to get only text which will be generated with this tag (and not entire HTML) use text() method like
System.out.println(tag.text());
Document doc = Jsoup.connect("http://www.wunderground.com/DisplayPollen.asp?Zipcode=19104").get();
Elements days = doc.select("table.pollen-table").first().select("td.even-four");
for (Element day : days) {
System.out.println(day.text());
}
Elements levels = doc.select("td.levels");
for (Element level : levels) {
System.out.println(level.text());
}

Jsoup Select method returns null

I am trying to get the rating of each movie but I cant seem to use the select method in the right way. I am trying to get the 7.0 part from the webpage:
http://www.imdb.com/title/tt0800369/
<div class="star-box giga-star">
<div class="titlePageSprite star-box-giga-star"> 7.0 </div>
I am using this line in java:
Element rating = doc.select("star-box giga-star").first();
System.out.println(rating);
Thanks in advance!
You can select an element by its class using .star-box-giga-star, and use text() to get the textual content of the element.
doc.select(".star-box-giga-star").text();
Problem with your selector is that you are using ancestor child selector instead of .class or element.class like div.star-box. Notice that to use multiple class you need to use element.class1.class2 or just .class1.class2 if you don't want to specify element.
Also if you want to specify parent child relationship you will have to use > so try maybe something like
Document doc = Jsoup.connect("http://www.imdb.com/title/tt0800369/").get();
Element rating = doc
.select("div.star-box.giga-star > div.titlePageSprite.star-box-giga-star")
.first();
System.out.println(rating);
Unfortunately this will print
<div class="titlePageSprite star-box-giga-star">
7.0
</div>
so if you want to get only text contend from that element use System.out.println(rating.text());
BTW since there is only one element with class star-box-giga-star you can just use
String rating = doc.select(".star-box-giga-star").text();
as shown in Alex answer

Parsing XML with Jsoup

I get the following XML which represents a news article:
<content>
Some text blalalala
<h2>Small subtitle</h2>
Some more text blbla
<ul class="list">
<li>List item 1</li>
<li>List item 2</li>
</ul>
<br />
Even more freakin text
</content>
I know the format isn't ideal but for now I have to take it.
The Article should look like:
Some text blalalala
Small subtitle
List with items
Even more freakin text
I parse this XML with Jsoup. I can get the text within the <content> tag with doc.ownText() but then I have no idea where the other stuff (subtitle) is placed, I get only one big String.
Would it be better to use an event based parser for this (I hate them :() or is there a possibility to do something like doc.getTextUntilTagAppears("tagName")?
Edit: For clarification, I know hot to get the elements under <content>, my problem is with getting the text within <content>, broken up every time when its interrupted by an element.
I learned that I can get all the text within content with .textNodes(), works great, but then again I don't know where which text node belongs in my article (one at the top before h2, the other one at the bottom).
Jsoup has a fantastic selector based syntax. See here
If you want the subtitle
Document doc = Jsoup.parse("path-to-your-xml"); // get the document node
You know that subtitle is in the h2 element
Element subtitle = doc.select("h2").first(); // first h2 element that appears
And if you like to have the list:
Elements listItems = doc.select("ul.list > li");
for(Element item: listItems)
System.out.println(item.text()); // print list's items one after another
The mistake I made was going through the XML by Elements, which do not include TextNodes. When I go through it Node by Node, I can check wether the Node is an Element or a TextNode, that way I can treat them accordingly.

Categories

Resources