Parsing XML with Jsoup

Parsing XML with Jsoup - java

I get the following XML which represents a news article:
<content>
Some text blalalala
<h2>Small subtitle</h2>
Some more text blbla
<ul class="list">
<li>List item 1</li>
<li>List item 2</li>
</ul>
<br />
Even more freakin text
</content>
I know the format isn't ideal but for now I have to take it.
The Article should look like:
Some text blalalala
Small subtitle
List with items
Even more freakin text
I parse this XML with Jsoup. I can get the text within the <content> tag with doc.ownText() but then I have no idea where the other stuff (subtitle) is placed, I get only one big String.
Would it be better to use an event based parser for this (I hate them :() or is there a possibility to do something like doc.getTextUntilTagAppears("tagName")?
Edit: For clarification, I know hot to get the elements under <content>, my problem is with getting the text within <content>, broken up every time when its interrupted by an element.
I learned that I can get all the text within content with .textNodes(), works great, but then again I don't know where which text node belongs in my article (one at the top before h2, the other one at the bottom).

Jsoup has a fantastic selector based syntax. See here
If you want the subtitle
Document doc = Jsoup.parse("path-to-your-xml"); // get the document node
You know that subtitle is in the h2 element
Element subtitle = doc.select("h2").first(); // first h2 element that appears
And if you like to have the list:
Elements listItems = doc.select("ul.list > li");
for(Element item: listItems)
System.out.println(item.text()); // print list's items one after another

The mistake I made was going through the XML by Elements, which do not include TextNodes. When I go through it Node by Node, I can check wether the Node is an Element or a TextNode, that way I can treat them accordingly.

Related

Jsoup not getting text of few elements when parsing all the elements in a page

I need to find key,value pairs in a web page with a known set of keys. For this, i am parsing all the elements in the web page using Jsoup in java. But i am unable to retrieve text of few elements during the iteration.
I am using the below code to select all elements and iterating using forEach loop.
Elements elements = document.body().select("*");
Sample HTML:
<div id="requisitionDescriptionInterface.ID1622.row1" class="contentlinepanel" title=""><h2 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:htm="http://www.w3.org/1999/xhtml" xmlns:ftl="http://www.taleo.net/ftl" class="no-change-header-inline"><span id="requisitionDescriptionInterface.ID1638.row1" class="subtitle" style="display: inline;" title="">Primary Location</span></h2><span class="inline"> </span><span id="requisitionDescriptionInterface.ID1659.row1" class="text" title="">India-Noida</span></div>
Both the spans (with texts Primary Location, Noida-India) are being selected. I am able to verify this by printing IDs using element.id()
I am also able to print the text 'Primary Location' but text 'India-Noida' is not being selected. I am using the method Element.ownText() to select the text.
Can someone tell me what am I doing wrong?

JSoup crawling how to crawl from same tag but two items

I need to crawl all three items in span tag. i have some code but i need a few hints. This is my code until now.
News n = new News();
n.setHeadline(news.getElementsByTag("h2").first().text());
n.setTypeOfSport(news.getElementsByTag("span").first().text());
n.setDate(news.getElementsByTag("span").);
n.setTime(news.getElementsByTag("span").);

It looks like you want to pick all span elements from <div class="info"> and access them based on their position (index).
Assuming that your news variable is of type Document or Element(s) you should have access to select(CSSquery) method. If it also at some level holds this <div class="info"> your code can look like:
Elements spans = news.select("div.info span");
//now you can get and handle text from all spans via
spans.get(0).text();
spans.get(1).text();
spans.get(2).text();
For more info about selecting elements using CSS see https://jsoup.org/cookbook/extracting-data/selector-syntax

Jsoup how to get values from html

So I'm trying to get specific information from this link: https://myanimelist.net/anime/31988/Hibike_Euphonium_2
I don't really understand html so this is a bit harder for me.
I'm looking specifically get information from here:
<div>
<span class="dark_text">Studios:</span>
Kyoto Animation </div>
<div class="spaceit">
What I'm trying to do is search for when it says "Studios" and then get the title of the href link (Kyoto Animation).
So for I have managed to get this:
Document doc = Jsoup.connect("https://myanimelist.net/anime/31988/Hibike_Euphonium_2").get();
Elements studio = doc.select("a[href][title]");
for(Element link : studio){
System.out.println(link.attr("title"));
}
And it's outputting this:
Lantis
Pony Canyon
Rakuonsha
Ponycan USA
Kyoto Animation
Drama
Music
School
Kyoto Animation
Go to the Last Post
Go to the Last Post
Anime You Should Watch Before Their Sequels Air This Fall 2016 Season
Collection
Follow #myanimelist on Twitter

It should be
doc.select("span:contains(Studios) + a[href][title]");
of I assume that span is common element for list header.
So basicly this selector gets all span elements that contains text Studios and then gets 1 level children a elements having attributes href and title
Just in case, given selector will select only one link and in span
More universal could be
*:contains(Studio) > a[title]
and that means - take every a element that has title attribute and is direct children of any (*) element that contains test Studio. Contains takes into account all text from descending children as well. For text of specific element :textOwn is used.

Not tested, but what about something like
...
Elements studio = doc.select("a[#title='Kyoto Animations']");
...

Java Jsoup: Retrieve only the article

Trying to retrieve the text of the article. I want to select all of the text within
<p>... </p>
I was able to do that.
But I only want to retrieve the text from the article body, not the entire page
Document article = Jsoup.connect("html doc").get();
Elements paragraphs = article.select("p");
The code above gets the entire text from the page. I just want the text between
<article itemprop= "articleBody">...</article>
I'm sorry if this was hard to understand, I tried to formulate the
questions as best I could.

Elements#text() will return text-only content of all the combined paragraphs (see here for more details https://jsoup.org/apidocs/org/jsoup/select/Elements.html)

Try selecting on the itemprop attribute
for (Element paragraph : doc.select("article[itemprop=articleBody]"))
System.out.println(paragraph.text());
See CSS Selectors for more tips

Parsing data from HTML using Jsoup causing trouble with "span" element

I'm trying to parse text inside SPAN and it's causing me some trouble.
HTML code for what I'm trying to parse:
<span title="Geografija">GEO</span>
My selector syntax:
Elements eles = doc.select("table.ednevnik-seznam_ur_teden tbody tr:eq(2) span");
This is what I get:
<span title="Geografija">GEO</span>
It literally parses the HTML code, but I'm trying to only parse the text inside span element. In this case, I should get this:
GEO
What am I doing wrong here?

If you want the text of the element, get the element from your list (perhaps using Elements#first or Elements#get), then use Element#text to get the element's text.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing XML with Jsoup - java

The mistake I made was going through the XML by Elements, which do not include TextNodes. When I go through it Node by Node, I can check wether the Node is an Element or a TextNode, that way I can treat them accordingly.

Related

Jsoup not getting text of few elements when parsing all the elements in a page

JSoup crawling how to crawl from same tag but two items

Jsoup how to get values from html

Java Jsoup: Retrieve only the article

Parsing data from HTML using Jsoup causing trouble with "span" element

Categories

Resources