<span id="result_box" class="short_text" lang="es">
<span class="hps">
hello
</span>
<span class="hps">
world
</span>
</span>
I want to get the hello world String using Jsoup but i have no idea how to do this.
Use Jsoup.parse to get the html Document. Select the elements that you want using css selector like: span.hps (http://jsoup.org/apidocs/org/jsoup/select/Selector.html)
Document doc = Jsoup.parse("<span id=\"result_box\" class=\"short_text\" lang=\"es\">\n" +
" <span class=\"hps\">\n" +
" hello\n" +
" </span>\n" +
" <span class=\"hps\">\n" +
" world\n" +
" </span>\n" +
"</span>");
System.out.println(doc.html());
Elements els = doc.select("span.hps");
for(Element e:els){
System.out.print(e.text());
}
In case you don't care about each element value you can replace the for loop:
els.text()
Related
I want to extract content matching XPath: .//*[contains (#class, 'post-content')] However I wish to exclude child nodes:
1) Containing text: P3 or AP
2) Div containing id = bottom
3) Form containing label with text: Get email updates
I have the following HTML:
<div class="td-post-content">
<p>P1</p>
<p>P2</p>
<p>P3</p>
<p>P4</p>
<p>P5</p>
<p>AP</p>
<div id="td-a-rec bottom"> </div>
<form action="https://example.com/subscribe method=" post " id="subscribe-form " name="subscribe-form " class="validate " target="_blank " novalidate=" ">
<div id="signup_scroll ">
<label for="mce-EMAIL ">Get email updates from..</label>
<input type="email " value=" " name="EMAIL " class="email " id="EMAIL " placeholder="email address " required=" ">
<div style="position: absolute; left: -5000px; " aria-hidden="true "><input type="text " name="b_11 " tabindex="-1 " value=" "></div>
<div class="clear "><input type="submit " value="Subscribe " name="subscribe " id="-subscribe " class="button "></div>
</div>
</form>
</div>
I am able to achieve this by using the XPath syntax: [not(contains(#id,'bottom'))] + [not(contains(text(),'P3'))] + [not(contains(text(),'AP'))] etc However, the main issue is that instead of matching all desired child elements as a single element - it now matches each element as a WebElement List.
Right now the only way to extract desired text is by iterating through the web element list and concatenating the results into a single Senter code heretring.
Is is possible to directly scrape all desired content in one shot (with a single call to element.getText() ) without the need to iterate through element list?
Thanks
From your description, it looks like all you want is the text from the P tags with a couple of exclusions. The CSS selector div.td-post-content > p will get you all the P tags including the ones you want to exclude. You can gather those into a list and then remove the text you want to exclude to give you the final list.
List<WebElement> ps = driver.findElements(By.cssSelector("div.td-post-content > p"));
List<String> text = ps.stream().map(e -> e.getText()).collect(Collectors.toList());
text.remove("AP");
text.remove("P3");
System.out.println(text);
Running this prints
[P1, P2, P3, P4, P5]
I have a website where I want to extract some data from. I want to extract the 8a on the second line (a-element) with JSoup. I can not use Regex because sometimes 8a is just 2 or 7c+ and these same values can be in the text in between the a tags as well. Ideas?
<div class="vsr">
L'Américain (intégral) 8a
<span class="ag">7c+</span>
<em>Tony Fouchereau</em>
<span class="btype">traversée d-g, surplomb, départ assis</span>
<span class="glyphicon glyphicon-camera" aria-hidden="true"></span>
<span class="glyphicon glyphicon-film" aria-hidden="true"></span>
</div>
You can use Jsoup css selectors to extract specific information.
https://jsoup.org/cookbook/extracting-data/selector-syntax
#Test
public void extract8a() {
Document doc = Jsoup.parse("<div class=\"vsr\"> \n" +
" L'Américain (intégral) 8a \n" +
" <span class=\"ag\">7c+</span> \n" +
" <em>Tony Fouchereau</em> \n" +
" <span class=\"btype\">traversée d-g, surplomb, départ assis</span> \n" +
" <span class=\"glyphicon glyphicon-camera\" aria-hidden=\"true\"></span> \n" +
" <span class=\"glyphicon glyphicon-film\" aria-hidden=\"true\"></span> \n" +
"</div>");
System.out.println(doc.select("div.vsr").first().ownText());
}
For example a web site has a code like this:
<div>
<div>
first
</div>
<div>
second
</div>
<div>
third
</div>
</div>
and I want to get the "second" div text with "Jsoup" and it has no attribute or class.
There are few ways to do it.
For instance we could use select method which returns Elements with all specified elements. Since Elements extends ArrayList<Element> it inherits all of ArrayList public methods. This means we can use get(index) method to select specific child (starting from 0)
String html =
"<div>\n" +
" <div>\n" +
" first\n" +
" </div>\n" +
" <div>\n" +
" second\n" +
" </div>\n" +
" <div>\n" +
" third\n" +
" </div>\n" +
"</div>";
Document doc = Jsoup.parse(html);
Elements select = doc.select("div > div");
System.out.println(select.get(1));
Output:
<div>
second
</div>
Another way could be using :eq(n) in CSS selector (from official Jsoup tutorial)
:eq(n): find elements whose sibling index is equal to n; e.g. form input:eq(1)
like
System.out.println(doc.select("div > div:eq(1)"));
I get problems when I parse a specific div's class.
<div class="box_3 box_3a">
<div class="title_new_2"></div>
<div class="list_indeks_2"></div>
</div>
I have tried to select <div class="list_indeks_2"></div> with jsoup as follows:
links = doc.select(".list_indeks_2")
However, this code didn't work because the div's class contains underscores (_). How does one handle an underscore (_) in the jsoup select method?
Try to access the element based on the attribute.
The snippet was tested with JSoup version 1.8.1.
Document doc = Jsoup.parse(
"<div class=\"box_3 box_3a\">\n"
+ " <div class=\"title_new_2\">some title</div>\n"
+ " <div class=\"list_indeks_2\">some index</div>\n"
+ "</div>");
Elements rows = doc.getElementsByAttributeValue("class", "list_indeks_2");
System.out.println("rows.size() = " + rows.size());
String index = "";
for (Element span : rows) {
index = span.text();
}
System.out.println("index = " + index);
this produces as output
rows.size() = 1
index = some index
I have the HTML snippet below. There are multiple div classes for "teaser-img" throughout the document. I want to be able to grab all the "img src" from all these "teaser-img" classes.
<div class="teaser-img">
<a href="/julien/blog/failure-consciousness-vs-success-consciousness-shifting-focus-become-badass-or-loser">
<img src="http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg" alt="" title=""/>
</a>
</div>
I have tried many things so I wouldn't know what code to share with you guys. Your help will be much appreciated.
final String html = "<div class=\"teaser-img\">\n"
+ " <a href=\"/julien/blog/failure-consciousness-vs-success-consciousness-shifting-focus-become-badass-or-loser\">\n"
+ " <img src=\"http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg\" alt=\"\" title=\"\"/>\n"
+ " </a>\n"
+ "</div>";
// Parse the html from string or eg. connect to a website using connect()
Document doc = Jsoup.parseBodyFragment(html);
for( Element element : doc.select("div.teaser-img img[src]") )
{
System.out.println(element);
}
Output:
<img src="http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg" alt="" title="">
See here for documentation about the selector syntax.