<div></div>
<div></div>
<div></div>
<div></div>
<ul>
<form id=the_main_form method="post">
<li>
<div></div>
<div> <h2>
<a onclick="xyz;" target="_blank" href="http://sample.com" style="text-decoration:underline;">This is sample</a>
</h2></div>
<div></div>
<div></div>
</li>
there are 50 li's like that
I have posted the snip of the html from a big HTML.
<div> </div> => means there is data in between them removed the data as it is not neccessary.
I would like to know how the JSOUP- select statement be to extract the href and Text?
I selected doc.select("div div div ul xxxx");
where xxx is form ..shoud I give the form id (or) how should I do that
Try this:
Elements allLis = doc.select("#the_main_form > li ");
for (Element li : allLis) {
Element a = li.select("div:eq(1) > h2 > a");
String href = a.attr("href");
String text = a.text();
}
Hope it helps!
EDIT:
Elements allLis = doc.select("#the_main_form > li ");
This part of the code gets all <li> tags that are inside the <form> with id #the_main_form.
Element a = li.select("div:eq(1) > h2 > a");
Then we iterate over all the <li> tags and get <a> tags, by first getting <div> tags ( the second one inside all <li>s by using index=1 -> div:eq(1)) then getting <h2> tags, where our <a> tags are present.
Hope you understand now! :)
Please try this:
package com.stackoverflow.works;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
/*
* # author: sarath_sivan
*/
public class HtmlParserService {
public static void parseHtml(String html) {
Document document = Jsoup.parse(html);
Element linkElement = document.select("a").first();
String linkHref = linkElement.attr("href"); // "http://sample.com"
String linkText = linkElement.text(); // "This is sample"
System.out.println(linkHref);
System.out.println(linkText);
}
public static void main(String[] args) {
String html = "<a onclick=\"xyz;\" target=\"_blank\" href=\"http://sample.com\" style=\"text-decoration:underline;\">This is sample</a>";
parseHtml(html);
}
}
Hope you have the Jsoup Library in your classpath.
Thank you!
Related
I'm trying to ignore an item and not parse it on Jsoup
But css selector "not", not working !!
I don't understand what is wrong ??
my code:
MangaList list = new MangaList();
Document document = getPage("https://3asq.org/");
MangaInfo manga;
for (Element o : document.select("div.page-item-detail:not(.item-thumb#manga-item-5520)")) {
manga = new MangaInfo();
manga.name = o.select("h3").first().select("a").last().text();
manga.path = o.select("a").first().attr("href");
try {
manga.preview = o.select("img").first().attr("src");
} catch (Exception e) {
manga.preview = "";
}
list.add(manga);
}
return list;
html code:
<div class="col-12 col-md-6 badge-pos-1">
<div class="page-item-detail manga">
<div id="manga-item-5520" class="item-thumb hover-details c-image-hover" data-post-id="5520">
<a href="https://3asq.org/manga/gosu/" title="Gosu">
<img width="110" height="150" src="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg" srcset="https://3asq.org/wp-content/uploads/2020/03/IMG_4497-110x150.jpg 110w, https://3asq.org/wp-content/uploads/2020/03/IMG_4497-175x238.jpg 175w" sizes="(max-width: 110px) 100vw, 110px" class="img-responsive" style="" alt="IMG_4497"/> </a>
</div>
<div class="item-summary">
<div class="post-title font-title">
<h3 class="h5">
<span class="manga-title-badges custom noal-manga">Noal-Manga</span> Gosu
</h3>
If I debug your code and extract the HTML for:
System.out.println(document.select("div.page-item-detail").get(0)) (hint use the expression evaluator in IntelliJ IDEA (Alt+F8 - for in-session, real-time debugging)
I get:
<div class="page-item-detail manga">
<div id="manga-item-2003" class="item-thumb hover-details c-image-hover" data-post-id="2003">
<a href="http...
...
</div>
</div>
</div>
It looks like you want to extract the next div tag down with class containing item-thumb ... but only if the id isn't manga-item-5520.
So here's what I did to remove that one item
document.select("div.page-item-detail div[class*=item-thumb][id!=manga-item-5520]")
Result size: 19
With the element included:
document.select("div.page-item-detail div[class*=item-thumb]")
Result size: 20
You can also try the following if you want to remain based at the outer div tag rather than the inner div tag.
document.select("div.page-item-detail:has(div[class*=item-thumb][id!=manga-item-5520])")
I want to retrieve visitor ID from “visitor” or "visitor.VisitorId" . but below code I use to retrieve data but successfully run without any error but I received value is null.
HTML Code:-
<ul class="sidebar-menu">
<li id="visitorView" class="treeview active">
<a>
<ul id="visitorViewMenu" class="treeview-menu menu-open" style="display: block;">
<!-- ngRepeat: visitor in Visitors -->
<li class="ng-scope" ng-repeat="visitor in Visitors" style="">
<a id="visitor.VisitorId" class="ng-binding" ng-click="select(visitor)">
<countryflag class="flagimg ng-isolate-scope" visitor="visitor">
<span class="chattabname"/>
A
<span class="timmer1 pull-right" runtimer="{"VisitorID":"c2c45b4d-5077-492f-afd6-88ab3bba99cd","Name":"A","StartTime":"2016-09-09 10:33:21","WidgetId":"7fcf22c6-4a9d-4701-9865-b8a85d597862","ConnectionId":"edc7d72b-8217-4961-81ff-f4ef4138bc3b","TimeZone":"Asia/Colombo","CountryCode":"lk","VisitorName":null,"Department":null,"CompanyId":"a4afbd8b-1de9-49d9-8fe6-4ec8119f4bb8"}">
</a>
</li>
<!-- end ngRepeat: visitor in Visitors -->
<li>
</ul>
</li>
<li class="treeview">
<li class="treeview">
</ul>
Selenium Code:-
**1st method :-**
WebElement cityField = driver.findElement(By.cssSelector("a[ng-click='select(visitor)']"));
**2nd method :-**
WebElement cityField = driver.findElement(By.cssSelector("a[id='visitor.VisitorId']"));
**Output**
System.out.println("+++-- "+cityField.getAttribute("value"));
Try using getText() which will return innerText of the <a> element as below :-
WebElement cityField = driver.findElement(By.id("visitor.VisitorId"));
System.out.println("+++-- " + cityField.getText());
Or if you want to get span element where visitorId present in runtimer attribute value, you should locate span element and get runtimer attribute value as :-
WebElement cityField = driver.findElement(By.cssSelector("a[id = 'visitor.VisitorId'] span.timmer1"));
String runtimeData = cityField.getAttribute("runtimer");
//Now do some programming stuff to retrieve visitor id
runtimer attribute data looks like in json format, so you can retrieve any data after converting in into org.json.JSONObject by passing their key as below :-
import org.json.JSONException;
import org.json.JSONObject;
public static Object getValue(String data, String key) throws JSONException {
JSONObject jObject = new JSONObject(data);
return jObject.get(key);
}
String visitorID = (String) getValue(runtimeData, "VisitorID");
System.out.println(visitorID);
Output :-
c2c45b4d-5077-492f-afd6-88ab3bba99cd
As OP suggested, we can use split() function as well to retrieve data as :-
String[] splitS = runtimeData.split(",");
for(int i =0; i < splitS.length; i++)
{
System.out.println("splitS" + splitS[i]);
}
If I understand correctly the value you are looking for is in the runtimer attribute that located in descendant element of id="visitor.VisitorId", you need to put that in getAttribute() method
WebElement cityField = driver.findElement(By.cssSelector("#visitor.VisitorId > .timmer1"));
String attributeData = cityField.getAttribute("runtimer");
String visitorId = attributeData.split(",");
System.out.println("+++-- " + visitorId);
Output: +++-- "VisitorID":"c2c45b4d-5077-492f-afd6-88ab3bba99cd"
I'm trying to parse data from HTML.I need to get the all names from inner div class=vacancy-item which has different idnames.
Below please See the HTML code
<section class="home-vacancies" id="vacancy_wrapper">
<div class="home-block-title">job openings</div>
<div class="vacancy-filter">
...................
</div>
<div class="vacancy-wrapper">
<div class="vacancy-item" data-id="9120">
..............
</div>
<div class="vacancy-item" data-id="9119">
..................
</div>
<div class="vacancy-item" data-id="9118">
................................
</div>
<div class="vacancy-item" data-id="9117">
.............................
</div>
Here is my code:
Please help.
doc = Jsoup.connect("URL").get();
//title = doc.select(".page-content div:eq(3)");
title = doc.getElementsByClass("div[class=vacancy-wrapper]");
titleList.clear();
for (Element titles : title) {
String text = titles.getElementsB("vacancy-item").text();
titleList.add(text);
}
Thanks!
You can only query for a class attribute with getElementByClass, e.g. getElementByClass("vacancy-wrapper") would work.
You will also need a second loop to get each vacancy-items text as a separate element:
Elements title = doc.getElementsByClass("vacancy-wrapper");
for (Element titles : title) {
Elements items = titles.getElementsByClass("vacancy-item");
for (Element item : items) {
String text = item.text();
// process text
}
}
An other option would be to use Jsoup's select method:
Elements es = doc.select("div.vacancy-wrapper div.vacancy-item");
for (Element vi : es) {
String text = vi.text());
// process text
}
This would select all div elements with a class attribute vacancy-item that are under a div with a class attribute vacancy-wrapper.
What I want: I am new to Jsoup. I want to parse my html string and search for each text value that appears inside tags (any tag). And then change that text value to something else.
What I have done: I am able to change the text value for single tag. Below is the code:
public static void main(String[] args) {
String html = "<div><p>Test Data</p> <p>HELLO World</p></div>";
Document doc1=Jsoup.parse(html);
Elements ps = doc1.getElementsByTag("p");
for (Element p : ps) {
String pText = p.text();
p.text(base64_Dummy(pText));
}
System.out.println("======================");
String changedHTML=doc1.html();
System.out.println(changedHTML);
}
public static String base64_Dummy(String abc){
return "This is changed text";
}
output:
======================
<html>
<head></head>
<body>
<div>
<p>This is changed text</p>
<p>This is changed text</p>
</div>
</body>
</html>
Above code is able to change the p tag's value. But, in my case html string can contain any tag; whose value I want to search and change.
How can I search all tags in html string and change their text value one by one.
You can try with something similar to this code:
String html = "<html><body><div><p>Test Data</p> <div> <p>HELLO World</p></div></div> other text</body></html>";
Document doc = Jsoup.parse(html);
List<Node> children = doc.childNodes();
// We will search nodes in a breadth-first way
Queue<Node> nodes = new ArrayDeque<>();
nodes.addAll(doc.childNodes());
while (!nodes.isEmpty()) {
Node n = nodes.remove();
if (n instanceof TextNode && ((TextNode) n).text().trim().length() > 0) {
// Do whatever you want with n.
// Here we just print its text...
System.out.println(n.parent().nodeName()+" contains text: "+((TextNode) n).text().trim());
} else {
nodes.addAll(n.childNodes());
}
}
And you'll get the following output:
body contains text: other text
p contains text: Test Data
p contains text: HELLO World
You want to use the CSS selector * and the method textNodes to get the text of a given tag (Element in Jsoup world).
This line below
Elements ps = doc1.getElementsByTag("p");
becomes
Elements ps = doc1.select("*");
Now, with this new selector you'll be able to select any elements (tags) within your HTML code.
FULL CODE EXAMPLE
public static void main(String[] args) {
System.out.println("Setup proxy...");
JSoup.setupProxy();
String html = "<html><body><div><p>Test Data</p> <div> <p>HELLO World</p></div></div> other text</body></html>";
Document doc1 = Jsoup.parse(html);
Elements tags = doc1.select("*");
for (Element tag : tags) {
for (TextNode tn : tag.textNodes()) {
String tagText = tn.text().trim();
if (tagText.length() > 0) {
tn.text(base64_Dummy(tagText));
}
}
}
System.out.println("======================");
String changedHTML = doc1.html();
System.out.println(changedHTML);
}
public static String base64_Dummy(String abc) {
return "This is changed text";
}
OUTPUT
======================
<html>
<head></head>
<body>
<div>
<p>This is changed text</p>
<div>
<p>This is changed text</p>
</div>
</div>This is changed text
</body>
</html>
I need to validate HTML using java. So I try with jsoup library. But some my test cases failing with it.
For eg this is my html content. I dont have any control on this content. I am getting this from some external source provider.
String invalidHtml = "<div id=\"myDivId\" ' class = claasnamee value='undaa' > <<p> p tagil vanne <br> <span> span close cheythillee!! </p> </div>";
doc = Jsoup.parseBodyFragment(invalidHtml);
For above html I am getting this output.
<html>
<head></head>
<body>
<div id="myDivId" '="" class="claasnamee" value="undaa">
<
<p> p tagil vanne <br /> <span> span close cheythillee!! </span></p>
</div>
</body>
</html>
for a single quote in my above string is comming like this. So how can I fix this issue. Any one can help me please.
The best place to validate your html would be http://validator.w3.org/. But that would be manual process. But dont worry jsoup can do this for you as well. The below program is like a workaround but it does the purpose.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JsoupValidate {
public static void main(String[] args) throws Exception {
String invalidHtml = "<div id=\"myDivId\" ' class = claasnamee value='undaa' > <<p> p tagil vanne <br> <span> span close cheythillee!! </p> </div>";
Document initialDoc = Jsoup.parseBodyFragment(invalidHtml);
Document validatedDoc = Jsoup.connect("http://validator.w3.org/check")
.data("fragment", initialDoc.html())
.data("st", "1")
.post();
System.out.println("******");
System.out.println("Errors");
System.out.println("******");
for(Element error : validatedDoc.select("li.msg_err")){
System.out.println(error.select("em").text() + " : " + error.select("span.msg").text());
}
System.out.println();
System.out.println("**************");
System.out.println("Cleaned output");
System.out.println("**************");
Document cleanedOuput = Jsoup.parse(validatedDoc.select("pre.source").text());
cleanedOuput.select("meta[name=generator]").first().remove();
cleanedOuput.outputSettings().indentAmount(4);
cleanedOuput.outputSettings().prettyPrint(true);
System.out.println(cleanedOuput.html());
}
}
var invalidHtml = "<div id=\"myDivId\" ' class = claasnamee value='undaa' > <<p> p tagil vanne <br> <span> span close cheythillee!! </p> </div>";
var parser = Parser.htmlParser()
.setTrackErrors(10); // Set the number of errors it can track. 0 by default so it's important to set that
var dom = Jsoup.parse(invalidHtml, "" /* this is the default */, parser);
System.out.println(parser.getErrors()); // Do something with the errors, if any