Parsing data from HTML using Jsoup causing trouble with "span" element - java

I'm trying to parse text inside SPAN and it's causing me some trouble.
HTML code for what I'm trying to parse:
<span title="Geografija">GEO</span>
My selector syntax:
Elements eles = doc.select("table.ednevnik-seznam_ur_teden tbody tr:eq(2) span");
This is what I get:
<span title="Geografija">GEO</span>
It literally parses the HTML code, but I'm trying to only parse the text inside span element. In this case, I should get this:
GEO
What am I doing wrong here?

If you want the text of the element, get the element from your list (perhaps using Elements#first or Elements#get), then use Element#text to get the element's text.

Related

Jsoup not getting text of few elements when parsing all the elements in a page

I need to find key,value pairs in a web page with a known set of keys. For this, i am parsing all the elements in the web page using Jsoup in java. But i am unable to retrieve text of few elements during the iteration.
I am using the below code to select all elements and iterating using forEach loop.
Elements elements = document.body().select("*");
Sample HTML:
<div id="requisitionDescriptionInterface.ID1622.row1" class="contentlinepanel" title=""><h2 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:htm="http://www.w3.org/1999/xhtml" xmlns:ftl="http://www.taleo.net/ftl" class="no-change-header-inline"><span id="requisitionDescriptionInterface.ID1638.row1" class="subtitle" style="display: inline;" title="">Primary Location</span></h2><span class="inline"> </span><span id="requisitionDescriptionInterface.ID1659.row1" class="text" title="">India-Noida</span></div>
Both the spans (with texts Primary Location, Noida-India) are being selected. I am able to verify this by printing IDs using element.id()
I am also able to print the text 'Primary Location' but text 'India-Noida' is not being selected. I am using the method Element.ownText() to select the text.
Can someone tell me what am I doing wrong?

Java - jsoup get element with specific string

I would like to select an element that matches a specific String
<img src='http://iblink.ch/resized/sjg63ngi3h3g4a.jpg' alt='tree'>
since I don't have a specific class or div to trigger I try to use getElementsContainingOwnText("resized")
method to get this element.
But it does not find it?
I also try: getElementsContainingText
Same output :(
Anyone have any idea?
The text is the part outside the tags: <tag attribute="value">Text</tag>
So you want to select Elements with a certain attribute value like this:
Elements els = doc.select("img[src*=resized]");
Have a look into CSS selectors as they are implemented in Jsoup.

Parsing XML with Jsoup

I get the following XML which represents a news article:
<content>
Some text blalalala
<h2>Small subtitle</h2>
Some more text blbla
<ul class="list">
<li>List item 1</li>
<li>List item 2</li>
</ul>
<br />
Even more freakin text
</content>
I know the format isn't ideal but for now I have to take it.
The Article should look like:
Some text blalalala
Small subtitle
List with items
Even more freakin text
I parse this XML with Jsoup. I can get the text within the <content> tag with doc.ownText() but then I have no idea where the other stuff (subtitle) is placed, I get only one big String.
Would it be better to use an event based parser for this (I hate them :() or is there a possibility to do something like doc.getTextUntilTagAppears("tagName")?
Edit: For clarification, I know hot to get the elements under <content>, my problem is with getting the text within <content>, broken up every time when its interrupted by an element.
I learned that I can get all the text within content with .textNodes(), works great, but then again I don't know where which text node belongs in my article (one at the top before h2, the other one at the bottom).
Jsoup has a fantastic selector based syntax. See here
If you want the subtitle
Document doc = Jsoup.parse("path-to-your-xml"); // get the document node
You know that subtitle is in the h2 element
Element subtitle = doc.select("h2").first(); // first h2 element that appears
And if you like to have the list:
Elements listItems = doc.select("ul.list > li");
for(Element item: listItems)
System.out.println(item.text()); // print list's items one after another
The mistake I made was going through the XML by Elements, which do not include TextNodes. When I go through it Node by Node, I can check wether the Node is an Element or a TextNode, that way I can treat them accordingly.

Jsoup select and iterate all elements

I will connect to a url through jsoup and get all the contents of it but the thing is if I select like,
doc.select("body")
its returning a single element but I want to get all the elements in the page and iterate them one by one for example,
<html>
<head><title>Test</title></head>
<body>
<p>Hello All</p>
Second Page
<div>Test</div>
</body>
</html>
If I select using body I am getting the result in a single line like,
Test Hello All Second Page Test
Instead I want to select all elements and iterate one by one and produce the results like,
Test
Hello All
Second Page
Test
Will that be possible using jsoup?
Thanks,
Karthik
You can select all elements of the document using * selector and then get text of each individually using Element#ownText().
Elements elements = document.body().select("*");
for (Element element : elements) {
System.out.println(element.ownText());
}
To get all of the elements within the body of the document using jsoup library.
doc.body().children().select("*");
To get just the first level of elements in the documents body elements.
doc.body().children();
You can use XPath or any library which contain XPath
the expression is //text()
Test the expression with your xml here

Could the value of an html anchor tag be fetched using xpath?

If I have HTML that looks like:
<td class="blah">&nbs;???? </td>
Could I get the ???? value using xpath?
What would it look like?
To use XPath you usually need XML not HTML, but some parsers (e.g. the one built into PHP) have a relaxed Mode which will parse most HTML, too.
If you want to find all <a> that are direct children of <td class="blah"> the XPath you need is
//td[#class = 'blah']/a
or
//td[#class = 'blah']/a[#href = 'http://...']
(depending on whether you only want the one url or all urls)
This will give you a Set of Nodes. You'll need to iterate through it and then check for the nodeType of the firstChild (supposed to be a text node) and the number of child nodes (supposed to be 1). Then the firstChild will contain the ????
Why would you use an XML parser to parse HTML?
I would suggest using a dedicated Java HTML parser, there are many, but I haven't tried any myself.
As for your question, would it work, I suspect it will not work, you will get an error when trying to parse it as HTML right at &nbs; if not earlier.

Categories

Resources