I will connect to a url through jsoup and get all the contents of it but the thing is if I select like,
doc.select("body")
its returning a single element but I want to get all the elements in the page and iterate them one by one for example,
<html>
<head><title>Test</title></head>
<body>
<p>Hello All</p>
Second Page
<div>Test</div>
</body>
</html>
If I select using body I am getting the result in a single line like,
Test Hello All Second Page Test
Instead I want to select all elements and iterate one by one and produce the results like,
Test
Hello All
Second Page
Test
Will that be possible using jsoup?
Thanks,
Karthik
You can select all elements of the document using * selector and then get text of each individually using Element#ownText().
Elements elements = document.body().select("*");
for (Element element : elements) {
System.out.println(element.ownText());
}
To get all of the elements within the body of the document using jsoup library.
doc.body().children().select("*");
To get just the first level of elements in the documents body elements.
doc.body().children();
You can use XPath or any library which contain XPath
the expression is //text()
Test the expression with your xml here
Related
I have a following HTML:
<data-my-tag>
<data-another-tag>
<p>...</p>
<data-my-tag>
<span>...</span>
</data-my-tag>
</data-another-tag>
</data-my-tag>
I use JSOUP to parse it and I would like to match all elements starting with <data-.
I only found methods to match getElementsByTag which matches by entire tag name. Also select method performs only css selector, but there seems to be no way to match data-* in JSOUP way (e.g. use XPath). Is there any way to match these tags via JSOUP.
Unfortunately, it is not possible to use XPath queries in JSOUP. The only way I figured out is following:
Document doc = Jsoup.parse(content);
Elements elements = doc.select("*");
elements.stream().filter(e -> e.nodeName().startsWith("data-")).forEach(e -> {
// do what you need with the node
});
I am using Jsoup and was wondering how do you get embedded tags? I can get the section tag but I am not sure how to get the div tag inside as I have a list of elements. My question is how do I fetch a div tag inside a section tag?
this will work surely
Elements elements = doc.select("section.page-content-full div.content");
Just use the query selector syntax :
Elements elems = doc.select("section.main-page-content-full>div.content");
If you want just the first element use the following :
Elements elems = doc.select("section.main-page-content-full>div.content").first();
I would like to select an element that matches a specific String
<img src='http://iblink.ch/resized/sjg63ngi3h3g4a.jpg' alt='tree'>
since I don't have a specific class or div to trigger I try to use getElementsContainingOwnText("resized")
method to get this element.
But it does not find it?
I also try: getElementsContainingText
Same output :(
Anyone have any idea?
The text is the part outside the tags: <tag attribute="value">Text</tag>
So you want to select Elements with a certain attribute value like this:
Elements els = doc.select("img[src*=resized]");
Have a look into CSS selectors as they are implemented in Jsoup.
I'm trying to parse text inside SPAN and it's causing me some trouble.
HTML code for what I'm trying to parse:
<span title="Geografija">GEO</span>
My selector syntax:
Elements eles = doc.select("table.ednevnik-seznam_ur_teden tbody tr:eq(2) span");
This is what I get:
<span title="Geografija">GEO</span>
It literally parses the HTML code, but I'm trying to only parse the text inside span element. In this case, I should get this:
GEO
What am I doing wrong here?
If you want the text of the element, get the element from your list (perhaps using Elements#first or Elements#get), then use Element#text to get the element's text.
I'm using JSOUP, and trying to get the elements which start with a particular div tag id. For example:
<div id="test123">.
I need to check if the elements starts with the string "test" and get all the elements.
I looked at http://jsoup.org/cookbook/extracting-data/selector-syntax and I tried a multiple variations using:
doc.select("div:matches(test(*))");
But it still didn't work. Any help would be much appreciated.
Use the attribute-starts-with selector [attr^=value].
Elements elements = doc.select("div[id^=test]");
// ...
This will return all <div> elements with an id attribute starting with test.