How to use :empty pseudo selector in jsoup - java

I want to select the div tag that has no more div or any other tag.
i tried the below code and i want the output as "This is output"
but empty pseudo-selector isn't working.
String htmlString =
"<html><div><div><div><p><b>This is first line</b></p> </div><b>This is second line</b></div><div>This is output</div><div><span style=\"color:blue\">This is third line</span></div></html>"`;
org.jsoup.nodes.Document doc1 = Jsoup.parse(htmlString);
Elements elements1 = doc1.select("html:empty");
for (Element element : elements1) {
System.out.println(element.toString());
}

Since you posted a couple of similar questions recently, where your html structure changed and the css selector broke, maybe it would be better/more suiting for you, to avoid the css selectors and process/filter the elements yourself:
String htmlString = "<html><p><b>This has no div</b></p><div><div><div><p><b>This is first line</b></p></div><b>This is second line</b></div><div>This is output</div><div><span style=\"color:blue\">This is third line</span></div></html>";
Document doc = Jsoup.parse(htmlString);
Elements elements = doc.getAllElements();
// for all textnodes
outerloop:
for (Element element : elements) {
if(element.childNodes().size()>0 && element.childNode(0).nodeName().equals("#text")){
Element divContent = element;
if(divContent.nodeName().equals("div")){
System.out.println("No element in div; text: " + element.text()+ "\n");
}else{
while(divContent.parents().size()>0 && !divContent.parent().nodeName().equals("div")){
divContent = divContent.parent();
if(divContent.parent().nodeName().equals("body")){
continue outerloop; // continue, to skip element <p><b>This has no div</b></p>
//break; // break, if you want the element <p><b>This has no div</b></p> anyway
}
}
System.out.println("element: " + divContent.toString());
System.out.println("text: " + element.text() + "\n");
}
}
}
// only for <div>text...</div>
for (Element element : elements) {
if(element.childNodes().size()>0 && element.childNode(0).nodeName().equals("#text") && element.nodeName().equals("div")){
System.out.println("text: " + element.text());
}
}
Output:
element: <p><b>This is first line</b></p>
text: This is first line
element: <b>This is second line</b>
text: This is second line
No element in div; text: This is output
element: <span style="color:blue">This is third line</span>
text: This is third line
text: This is output

I tried this it working
public class Test{
public static void main(String[] args) {
String htmlString =
"<html>" +
"<div><div>" +
"<div><p><b>This is first line</b></p> </div>" +
"<b>This is second line</b></div><div>This is output</div>" +
"<div><span style=\"color:blue\">This is third line</span></div></html>";
org.jsoup.nodes.Document doc1 = Jsoup.parse(htmlString);
for (Element e : doc1.select("div:not(b),div:not(p),div:not(span)"))
System.out.println(e.ownText());
}
}
Output:
This is output

Related

get all links from a div with JSoup

Basically, I am using Jsoup to parse a site, I want to get all the links from the following html:
<ul class="detail-main-list">
<li>
Dis Be the link
</li>
</ul>
Any idea how?
Straight from jsoup.org, right there, first thing you see:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%s\n\t%s",
headline.attr("title"), headline.absUrl("href"));
}
Modifying this to what you need seems trivial:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
Elements anchorTags = doc.select("ul.detail-main-list a");
for (Element anchorTag : anchorTags) {
System.out.println("Links to: " + anchorTag.attr("href"));
System.out.println("In absolute form: " + anchorTag.absUrl("href"));
System.out.println("Text content: " + anchorTag.text());
}
The ul.detail-main-list a part is a so-called selector string. A real short tutorial on these:
foo means: Any HTML element with that tag name, i.e. <foo></foo>.
.bar means: Any HTML element with class bar, i.e. <foo class="bar baz"></foo>
#bar means: Any HTML element with id bar, i.e. <foo id="bar">
These can be combined: ul.detail-main-list matches any <ul> tags that have the string detail-main-list in their list of classes.
a b means: all things matching the 'b' selection that have something matching 'a' as a parent. So ul a matches all <a> tags that have a <ul> tag around them someplace.
The JSoup docs are excellent.
You can do a specific a href link in this way from any website.
public static void main(String[] args) {
String htmlString = "<html>\n" +
" <head></head>\n" +
" <body>\n" +
"<ul class=\"detail-main-list\">\n" +
" <li> \n" +
" Dis Be the link\n" +
" </li> \n" +
"</ul>" +
" </body>\n" +
"</html>"
+ "<head></head>";
Document html = Jsoup.parse(htmlString);
Elements elements = html.select("a");
for(Element element: elements){
System.out.println(element.attr("href"));
}
}
Output:
/manga/toki_wa/v01/c001/1.html

How to get the values stored in a span class in selenium

I have a span class like in the attached picture. I want to fetch all three values i.e. 0.413%, 0.012%, -- and --
When I traverse to this span class and get text then all three values stored in the string but i want them one by one.
'--' can be at anywhere. How to fetch these values.
<span class="text-light ng-binding" ng-show="calculatorStatus == 'COMPLETED'" style="font-size: 0.85em;">
0.413%
<br/>
0.012%
<br/>
--
</span>
Actual: 0.413% \n 0.012% \n --
Expected: 0.413%, 0.012%, --
Looks like homework. Hmmm. OK. I keep this in my kitbag:
public String getTextFromElementsTextNodes(WebDriver webDriver, WebElement element) throws IllegalArgumentException {
String text = "";
if (webDriver instanceof JavascriptExecutor) {
text = (String)((JavascriptExecutor) webDriver).executeScript(
"var nodes = arguments[0].childNodes;" +
"var text = '';" +
"for (var i = 0; i < nodes.length; i++) {" +
" if (nodes[i].nodeType == Node.TEXT_NODE) {" +
" text += nodes[i].textContent;" +
" }" +
"}" +
"return text;"
, element);
} else {
throw new IllegalArgumentException("driver is not an instance of JavascriptExecutor");
}
return text;
}
It returns all characters including non-ASCII line breaks. I usually just want the text so I add this
getTextFromElementsTextNodes(driver, anElement).replaceAll("[^\\x00-\\x7F]", " ");

JSoup get specific data from webpage

I've been trying to get data from: http://www.betvictor.com/sports/en/to-lead-anytime, where I would like to get the list of matches using JSoup.
For example:
Caen v AS Saint Etienne
Celtic v Rangers
and so on...
My current code is:
String couponPage = "http://www.betvictor.com/sports/en/to-lead-anytime";
Document doc1 = Jsoup.connect(couponPage).get();
String match = doc1.select("#coupon_143751140 > table:nth-child(3) > tbody > tr:nth-child(2) > td.event_description").text();
System.out.println("match:" + match);
Once I can figure out how to get one item of data, I will put it in a for loop to loop through the whole table, but first I need to get one item of data.
Currently, the output is "match: " so it looks like the "match" variable is empty.
Any help is most appreciated,
I have worked out how to answer my question after a few hours of experimenting. Turns out the page didn't load properly straight away, and had to implement the "timeout" method.
Document doc;
try {
// need http protocol
doc = Jsoup.connect("http://www.betvictor.com/sports/en/football/coupons/100/0/0/43438/0/100/0/0/0/0/1").timeout(10000).get();
// get all links
Elements matches = doc.select("td.event_description a");
Elements odds = doc.select("td.event_description a");
for (Element match : matches) {
// get the value from href attribute
String matchEvent = match.text();
String[] parts = matchEvent.split(" v ");
String team1 = parts[0];
String team2 = parts[1];
System.out.println("text : " + team1 + " v " + team2);
}

Can I include white space between all html text() elements in Jsoup

I want to use Jsoup to extract all text from an HTML page and return a single string of all the text without any HTML. The code I am using half works, but has the effect of joining elements which affects my keyword searches against the string.
This is the Java code I am using:
String resultText = scrapePage(htmldoc);
private String scrapePage(Document doc) {
Element allHTML = doc.select("html").first();
return allHTML.text();
}
Run against the following HTML:
<html>
<body>
<h1>Title</h1>
<p>here is para1</p>
<p>here is para2</p>
</body>
</html>
Outputting resultText gives "Titlehere is para1here is para2" meaning I can't search for the word "para1" as the only word is "para1here".
I don't want to split document into further elements than necessary (for example, getting all H1, p.text elements as there is such a wide range of tags I could be matching
(e.g. data1data2 would come from):
<td>data1</td><td>data2</td>
Is there a way if can get all the text from the page but also include a space between the tags? I don't want to preserve whitepsace otherwise, no need to keep line breaks etc. as I am just preparing a keyword string. I will probably trim all white space otherwise to a single space for this reason.
I don't have this issue using JSoup 1.7.3.
Here's the full code i used for testing:
final String html = "<html>\n"
+ " <body>\n"
+ " <h1>Title</h1>\n"
+ " <p>here is para1</p>\n"
+ " <p>here is para2</p>\n"
+ " </body>\n"
+ "</html>";
Document doc = Jsoup.parse(html);
Element element = doc.select("html").first();
System.out.println(element.text());
And the output:
Title here is para1 here is para2
Can you run my code? Also update to a newer version of jsoup if you don't have 1.7.3 yet.
Previous answer is not right, because it works just thanks to "\n" end of lines added to each line, but in reality you may not have end of line on end of each HTML line...
void example2text() throws IOException {
String url = "http://www.example.com/";
String out = new Scanner(new URL(url).openStream(), "UTF-8").useDelimiter("\\A").next();
org.jsoup.nodes.Document doc = Jsoup.parse(out);
String text = "";
Elements tags = doc.select("*");
for (Element tag : tags) {
for (TextNode tn : tag.textNodes()) {
String tagText = tn.text().trim();
if (tagText.length() > 0) {
text += tagText + " ";
}
}
}
System.out.println(text);
}
By using answer: https://stackoverflow.com/a/35798214/4642669

Extract href values inside td tags in jsoup

I have
<table class="table" >
<tr>
<td>text1</td>
<td>text2</td>
</tr>
<tr>
<td>text</td>
<td>text</td>
</tr>
and I want to extract the url and text of all rows
I use
Document doc = Jsoup.connect(url).get();
for (Element table : doc.select("table.table")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
String text1=tds.get(0).text();
String url= row.attr("href");
System.out.println(text1+ "," + url);
}
}
I get the text1 value but url is null.
How can I get the url from the td tags?
Your row variable is not the a tag, so there is no attribute href on it.
Try with this:
Element table = doc.select("table.table");
Elements links = table.getElementsByTag("a");
for (Element link: links) {
String url = link.attr("href");
String text = link.text();
System.out.println(text + ", " + url);
}
This is pretty much extracted from the JSoup documentation
You (maybe some one else) can try with this:
Document doc = Jsoup.connect(url).get();
for (Element table : doc.select("table.table")) {
for (Element row : table.select("tr")) {
for (Element tds : row.select("td")) {
Elements links = tds.select("a[href]");
for (Element link : links) {
System.out.println("link : " + link.attr("href"));
System.out.println("text : " + link.text());
}
}
}
}

Categories

Resources