Basically, I am using Jsoup to parse a site, I want to get all the links from the following html:
<ul class="detail-main-list">
<li>
Dis Be the link
</li>
</ul>
Any idea how?
Straight from jsoup.org, right there, first thing you see:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%s\n\t%s",
headline.attr("title"), headline.absUrl("href"));
}
Modifying this to what you need seems trivial:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
Elements anchorTags = doc.select("ul.detail-main-list a");
for (Element anchorTag : anchorTags) {
System.out.println("Links to: " + anchorTag.attr("href"));
System.out.println("In absolute form: " + anchorTag.absUrl("href"));
System.out.println("Text content: " + anchorTag.text());
}
The ul.detail-main-list a part is a so-called selector string. A real short tutorial on these:
foo means: Any HTML element with that tag name, i.e. <foo></foo>.
.bar means: Any HTML element with class bar, i.e. <foo class="bar baz"></foo>
#bar means: Any HTML element with id bar, i.e. <foo id="bar">
These can be combined: ul.detail-main-list matches any <ul> tags that have the string detail-main-list in their list of classes.
a b means: all things matching the 'b' selection that have something matching 'a' as a parent. So ul a matches all <a> tags that have a <ul> tag around them someplace.
The JSoup docs are excellent.
You can do a specific a href link in this way from any website.
public static void main(String[] args) {
String htmlString = "<html>\n" +
" <head></head>\n" +
" <body>\n" +
"<ul class=\"detail-main-list\">\n" +
" <li> \n" +
" Dis Be the link\n" +
" </li> \n" +
"</ul>" +
" </body>\n" +
"</html>"
+ "<head></head>";
Document html = Jsoup.parse(htmlString);
Elements elements = html.select("a");
for(Element element: elements){
System.out.println(element.attr("href"));
}
}
Output:
/manga/toki_wa/v01/c001/1.html
Related
I'm trying to parse a error file from a system which is presented to me in HTML. Don't find it very pretty, but this is what I have to work with.
The errors are presented with codes which I can find a reference to in a catalog based on a set and a message id.
<HTML>
<BODY>
<h4>2020-07-16 10:24:22.614</h4>
<SPAN STYLE="color:black; font:bold;"> Set:<INPUT TYPE="text" VALUE="158" SIZE=3</INPUT> Id: <INPUT TYPE="text" VALUE="10420" SIZE=5</INPUT>
</SPAN>
</BODY>
</HTML>
I'm trying to parse the timestamp, and the two values in the input fields with JSoup. The timestamp is not a problem at all, but I don't seem to find a way to parse the Set and the Id of the message.
Document doc = Jsoup.parse(errorLog, "UTF-8", "");
Element body = doc.body();
Elements MessageTimestamps = doc.select("h4");
Elements MessageSets = doc.getElementsByAttributeValue("SIZE", "3");
Elements MessageID = doc.getElementsByAttributeValue("SIZE","5");
String[] timestampArray = new String[MessageTimestamps.size()];
System.out.println("Total: " + timestampArray.length);
for(int i = 0; i< MessageTimestamps.size(); i++) {
System.out.println("Timestamp: " + MessageTimestamps.get(i).text());
System.out.println("MessageSets: " + MessageSets.get(i).text());
}
Result:
Total: 6
Timestamp: 2020-07-16 10:24:22.614
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
Anyone an idea?
You could select the input fields having a SIZE attribute which contain the values 3 or 5 by doing something like:
public static void main(String[] args){
String html = "<HTML>\n"
+ "<BODY>\n"
+ "<h4>2020-07-16 10:24:22.614</h4>\n"
+ "<SPAN STYLE=\"color:black; font:bold;\"> Set:<INPUT TYPE=\"text\" VALUE=\"158\" SIZE=3</INPUT> Id: <INPUT TYPE=\"text\" VALUE=\"10420\" SIZE=5</INPUT>\n"
+ "</SPAN>\n"
+ "</BODY>\n"
+ "</HTML>";
Document doc = Jsoup.parse(html);
Element time = doc.selectFirst("h4");
Element set = doc.selectFirst("INPUT[SIZE*=3]");
Element id = doc.selectFirst("INPUT[SIZE*=5]");
System.out.println(time.text());
System.out.println(set.attr("value"));
System.out.println(id.attr("value"));
}
I want to select the div tag that has no more div or any other tag.
i tried the below code and i want the output as "This is output"
but empty pseudo-selector isn't working.
String htmlString =
"<html><div><div><div><p><b>This is first line</b></p> </div><b>This is second line</b></div><div>This is output</div><div><span style=\"color:blue\">This is third line</span></div></html>"`;
org.jsoup.nodes.Document doc1 = Jsoup.parse(htmlString);
Elements elements1 = doc1.select("html:empty");
for (Element element : elements1) {
System.out.println(element.toString());
}
Since you posted a couple of similar questions recently, where your html structure changed and the css selector broke, maybe it would be better/more suiting for you, to avoid the css selectors and process/filter the elements yourself:
String htmlString = "<html><p><b>This has no div</b></p><div><div><div><p><b>This is first line</b></p></div><b>This is second line</b></div><div>This is output</div><div><span style=\"color:blue\">This is third line</span></div></html>";
Document doc = Jsoup.parse(htmlString);
Elements elements = doc.getAllElements();
// for all textnodes
outerloop:
for (Element element : elements) {
if(element.childNodes().size()>0 && element.childNode(0).nodeName().equals("#text")){
Element divContent = element;
if(divContent.nodeName().equals("div")){
System.out.println("No element in div; text: " + element.text()+ "\n");
}else{
while(divContent.parents().size()>0 && !divContent.parent().nodeName().equals("div")){
divContent = divContent.parent();
if(divContent.parent().nodeName().equals("body")){
continue outerloop; // continue, to skip element <p><b>This has no div</b></p>
//break; // break, if you want the element <p><b>This has no div</b></p> anyway
}
}
System.out.println("element: " + divContent.toString());
System.out.println("text: " + element.text() + "\n");
}
}
}
// only for <div>text...</div>
for (Element element : elements) {
if(element.childNodes().size()>0 && element.childNode(0).nodeName().equals("#text") && element.nodeName().equals("div")){
System.out.println("text: " + element.text());
}
}
Output:
element: <p><b>This is first line</b></p>
text: This is first line
element: <b>This is second line</b>
text: This is second line
No element in div; text: This is output
element: <span style="color:blue">This is third line</span>
text: This is third line
text: This is output
I tried this it working
public class Test{
public static void main(String[] args) {
String htmlString =
"<html>" +
"<div><div>" +
"<div><p><b>This is first line</b></p> </div>" +
"<b>This is second line</b></div><div>This is output</div>" +
"<div><span style=\"color:blue\">This is third line</span></div></html>";
org.jsoup.nodes.Document doc1 = Jsoup.parse(htmlString);
for (Element e : doc1.select("div:not(b),div:not(p),div:not(span)"))
System.out.println(e.ownText());
}
}
Output:
This is output
I came across a problem using jsoup. I can't match the <div id="shout_132684"> those digits are changing. How should I match those?
Elements content = doc.select("div:matches(id=\"shout_.+?\")");
Doesn't work.
You can use the startswith CSS selector ^=. It is supported by Jsoups .select(...).
You can do it like this:
doc.select("div[id^=shout]");
This is an full example:
public static void main(String[] args) {
Document parse = Jsoup.parse("<div id=\"shout_23\"/>" +
"<div id=\"shout_42\"/>" +
"<div id=\"notValidId\"/>" +
"<div id=\"shout_1337\"/>");
Elements divs = parse.select("div[id^=shout");
for (Element element : divs) {
System.out.println(element);
}
}
It will print:
<div id="shout_23"></div>
<div id="shout_42"></div>
<div id="shout_1337"></div>
For more accurate parsing you can still do it with regular expressions:
Elements content = doc.select("div[id~=(shout_)[0-9]+]");
I have made little test (with Jsoup 1.6.1):
String s = "" +Jsoup.parse("<td></td>").select("td").size();
System.out.println("Selected elements count : " + s);
It outputs:
Selected elements count : 0
But it should return 1, because I have parsed html with td element. What is wrong with my code or is there bug in Jsoup?
Because Jsoup is a HTML5 compliant parser and you feeded it with invalid HTML. A <td> has to go inside at least a <table>.
int size = Jsoup.parse("<table><td></td></table>").select("td").size();
System.out.println("Selected elements count : " + size);
String url = "http://foobar.com";
Document doc = Jsoup.connect(url).get();
Elements td = doc.select("td");
Jsoup 1.6.2 allows to parse with different parser and simple XML parser is provided. With following code I could solve my problem. You can later parse your fragment with HTML parse, to get valid HTML.
// Jsoup 1.6.2
String s = "" + Jsoup.parse("<td></td>", "", Parser.xmlParser()).select("td").size();
System.out.println("Selected elements count : " + s);
I'm using JSoup to parse this HTML content:
<div class="submitted">
<strong><a title="View user profile." href="/user/1">user1</a></strong>
on 27/09/2011 - 15:17
<span class="via">www.google.com</span>
</div>
Which looks like this in web browser:
user1 on 27/09/2011 - 15:17 www.google.com
The username and the website can be parsed into variables using this:
String user = content.getElementsByClass("submitted").first().getElementsByTag("strong").first().text();
String website = content.getElementsByClass("submitted").first().getElementsByClass("via").first().text();
But I'm unsure of how to get the "on 27/09/2011 -15:17" into a variable, if I use
String date = content.getElementsByClass("submitted").first().text();
It also contains username and the website???
You can always remove the user and the website elements like this (you can clone your submitted element if you do not want the remove actions to "damage" your document):
public static void main(String[] args) throws Exception {
Document content = Jsoup.parse(
"<div class=\"submitted\">" +
" <strong><a title=\"View user profile.\" href=\"/user/1\">user1</a></strong>" +
" on 27/09/2011 - 15:17 " +
" <span class=\"via\">www.google.com</span>" +
"</div> ");
// create a clone of the element so we do not destroy the original
Element submitted = content.getElementsByClass("submitted").first().clone();
// remove the elements that you do not need
submitted.getElementsByTag("strong").remove();
submitted.getElementsByClass("via").remove();
// print the result (demo)
System.out.println(submitted.text());
}
Outputs:
on 27/09/2011 - 15:17
You can then parse string that you get.
String str[] = contentString.split(" ");
Then you can construct the string you want like this:
String str = str[1] + " " + str[2] + " - " + str[4];
This will extract you the string you need.
Select the element before the text you wish to grab, then get its next sibling node (not element), which is a text node:
Document doc = Jsoup.parse("<div class=\"submitted\">" +
" <strong><a title=\"View user profile.\" href=\"/user/1\">user1</a></strong>" +
" on 27/09/2011 - 15:17 " +
" <span class=\"via\">www.google.com</span>" +
"</div> ");
String str = doc.select("strong").first().nextSibling().toString().trim();
System.out.println(str);
You can also ask an element for its child text nodes and index directly (though referencing the nodes by sibling is usually more robust than indexing):
Document doc = Jsoup.parse(
"<div class=\"submitted\">" +
" <strong><a title=\"View user profile.\" href=\"/user/1\">user1</a></strong>" +
" on 27/09/2011 - 15:17 " +
" <span class=\"via\">www.google.com</span>" +
"</div> ");
String str = doc.select("div").first().textNodes().get(1).text().trim();
System.out.println(str);