HTML
<div id='one'>
<button id='two'>I am a button</button>
<button id='three'>I am a button</button>
I am a div
</div>
Code
driver.findElement(By.id('one')).getText();
I've seen this question pop up a few times in the last maybe year or so and I've wanted to try writing this function... so here you go. It takes the parent element and removes each child's textContent until what remains is the textNode. I've tested this on your HTML and it works.
/**
* Takes a parent element and strips out the textContent of all child elements and returns textNode content only
*
* #param e
* the parent element
* #return the text from the child textNodes
*/
public static String getTextNode(WebElement e)
{
String text = e.getText().trim();
List<WebElement> children = e.findElements(By.xpath("./*"));
for (WebElement child : children)
{
text = text.replaceFirst(child.getText(), "").trim();
}
return text;
}
and you call it
System.out.println(getTextNode(driver.findElement(By.id("one"))));
Warning: the initial solution (deep below) won't workI opened an enhancement request: 2840 against the Selenium WebDrive and another one against the W3C WebDrive specification - the more votes, the sooner they'll get enough attention (one can hope). Until then, the solution suggested by #shivansh in the other answer (execution of a JavaScript via Selenium) remains the only alternative. Here's the Java adaptation of that solution (collects all text nodes, discards all that are whitespace only, separates the remaining by \t):
WebElement e=driver.findElement(By.xpath("//*[#id='one']"));
if(driver instanceof JavascriptExecutor) {
String jswalker=
"var tw = document.createTreeWalker("
+ "arguments[0],"
+ "NodeFilter.SHOW_TEXT,"
+ "{ acceptNode: function(node) { return NodeFilter.FILTER_ACCEPT;} },"
+ "false"
+ ");"
+ "var ret=null;"
+ "while(tw.nextNode()){"
+ "var t=tw.currentNode.wholeText.trim();"
+ "if(t.length>0){" // skip over all-white text values
+ "ret=(ret ? ret+'\t'+t : t);" // if many, tab-separate them
+ "}"
+ "}"
+ "return ret;" // will return null if no non-empty text nodes are found
;
Object val=((JavascriptExecutor) driver).executeScript(jswalker, e);
// ---- Pass the context node here ------------------------------^
String textNodesTabSeparated=(null!=val ? val.toString() : null);
// ----^ --- this is the result you want
}
References:
TreeWalker - supported by all browsers
Selenium Javascript Executor
Initial suggested solution - not working - see enhancement request: 2840
driver.findElement(By.id('one')).find(By.XPath("./text()").getText();
In a single search
driver.findElement(By.XPath("//[#id=one]/text()")).getText();
See XPath spec/Location Paths the child::text() selector.
I use a function like below:
private static final String ALL_DIRECT_TEXT_CONTENT =
"var element = arguments[0], text = '';\n" +
"for (var i = 0; i < element.childNodes.length; ++i) {\n" +
" var node = element.childNodes[i];\n" +
" if (node.nodeType == Node.TEXT_NODE" +
" && node.textContent.trim() != '')\n" +
" text += node.textContent.trim();\n" +
"}\n" +
"return text;";
public String getText(WebDriver driver, WebElement element) {
return (String) ((JavascriptExecutor) driver).executeScript(ALL_DIRECT_TEXT_CONTENT, element);
}
var outerElement = driver.FindElement(By.XPath("a"));
var outerElementTextWithNoSubText = outerElement.Text.Replace(outerElement.FindElement(By.XPath("./*")).Text, "");
Similar solution to the ones given, but instead of JavaScript or setting text to "", I remove elements in the XML and then get the text.
Problem:
Need text from 'root element without children' where children can be x levels deep and the text in the root can be the same as the text in other elements.
The solution treats the webelement as an XML and replaces the children with voids so only the root remains.
The result is then parsed. In my cases this seems to be working.
I only verified this code in a environment with Groovy. No idea if it will work in Java without modifications. Essentially you need to replace the groovy libraries for XML with Java libraries and off you go I guess.
As for the code itself, I have two parameters:
WebElement el
boolean strict
When strict is true, then really only the root is taken into account. If strict is false, then markup tags will be left. I included in this whitelist p, b, i, strong, em, mark, small, del, ins, sub, sup.
The logic is:
Manage whitelisted tags
Get element as string (XML)
Parse to an XML object
Set all child nodes to void
Parse and get text
Up until now this seems to be working out.
You can find the code here: GitHub Code
I am trying to scrape prices of a website with jSoup, but I only get an empty string.
I've tested my code with jSoup Online and I expect <meta itemprop="price" content="6,99"> to be printed when I use the following code:
Document doc = Jsoup.connect(URL).get();
Elements meta = doc.select("meta[itemprop=price]");
System.out.println("meta: " + meta.text());
price = meta.attr("content");
However, I just get an empty string and no error. What am I doing wrong here?
For the ones interested I am trying to scrape the price of this page
Try this:
Document doc = Jsoup.connect(URL).get();
Element meta = doc.select("meta[itemprop=price]").first();
System.out.println("meta: " + meta.text());
String price = meta.attr("content");
The webserver you are trying to access needs another user agent string to respond with the info you want. Try this:
Document doc = Jsoup.connect(URL).userAgent("Mozilla/5.0").get();
I have this web page https://rrtp.comed.com/pricing-table-today/ and from that I need to get the information about Time (Hour Ending) and Day-Ahead Hourly Price column alone. I tried with the following code,
Document doc = Jsoup.connect("https://rrtp.comed.com/pricing-table-today/").get();
for (Element table : doc.select("table.prices three-col")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 2) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}
but unfortunately I am unable to get the data I need.
Is there something wrong in the code..? or This page can't be crawled...?
Need some help
As I said in comment:
You should hit https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717 because it's source from which data is loaded on the page you have pointed to.
Data under this link is not a valid html document (and this is why it's not working for you), but you can easily make it "quite" right.
All you have to do is first get the response and add <table>..</table> tags around it, then it's enough to parse it as html document.
Connection.Response response = Jsoup.connect("https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717").execute();
Document doc = Jsoup.parse("<table>" + response.body() + "</table>");
for (Element element : doc.select("tr")) {
System.out.println(element.html());
}
i have to use this Code to search in google and Show the result Link,but i dont know how to send My Query
String s="reza";
Document doc = Jsoup.connect("http://google.com/search?q=").userAgent("Mozilla").data(?????).get();
Elements titles = doc.select(".entrytitle");
//print all titles in main page
for(Element e: titles){
System.out.println("text: " +e.text());
System.out.println("html: "+ e.html());
}
//print all available links on page
Elements links = doc.select("a[href]");
for(Element l: links){
System.out.println("link: " +l.attr("abs:href"));
}
}
"reza" is my query that i want to search it in google
How can i use this method to search
my problem is sending query to Google search page
Change Jsoup.connect("http://google.com/search?q=") to Jsoup.connect("http://google.com/search?q=" + s).
And remove the .data(???).
I'm currently trying to scrape amazon for a bunch of data. I'm using jsoup to help me do this, and everything has gone pretty smoothly, but for some reason I can't figure out how to pull the current number of sellers selling new products.
Here's an example of the url I'm scraping : http://www.amazon.com/dp/B006L7KIWG
I want to extract "39 new" from the following below:
<div id="secondaryUsedAndNew" class="mbcOlp">
<div class="mbcOlpLink">
<a class="buyAction" href="/gp/offer-listing/B006L7KIWG/ref=dp_olp_new_mbc?ie=UTF8&condition=new">
39 new
</a> from
<span class="price">$60.00</span>
</div>
</div>
This project is the first time I've used jsoup, so the coding may be a bit iffy, but here are some of the things I have tried:
String asinPage = "http://www.amazon.com/dp/" + getAsin();
try {
Document document = Jsoup.connect(asinPage).timeout(timeout).get();
.....
//get new sellers try one
Elements links = document.select("a[href]");
for (Element link : links) {
// System.out.println("Span olp:"+link.text());
String code = link.attr("abs:href");
String label = trim(link.text(), 35);
if (label.contains("new")) {
System.out.println(label + " : " + code);
}
}
//get new sellers try one
Elements links = document.select("div.mbcOlpLink");
for (Element link : links) {
// System.out.println("Span olp:"+link.text());
}
//about a million other failed attempts that you'll just have to take my word on.
I've been successful when scrape everything else I need on the page, but for some reason this particular element is being a pain, any help would be GREAT! Thanks guys!
I would use
String s = document.select("div[id=secondaryUsedAndNew] a.buyAction").text.replace(" "," ");
This should leave you "42 new" as it says on the page at this moment.
Hope this works for you!