I have made little test (with Jsoup 1.6.1):
String s = "" +Jsoup.parse("<td></td>").select("td").size();
System.out.println("Selected elements count : " + s);
It outputs:
Selected elements count : 0
But it should return 1, because I have parsed html with td element. What is wrong with my code or is there bug in Jsoup?
Because Jsoup is a HTML5 compliant parser and you feeded it with invalid HTML. A <td> has to go inside at least a <table>.
int size = Jsoup.parse("<table><td></td></table>").select("td").size();
System.out.println("Selected elements count : " + size);
String url = "http://foobar.com";
Document doc = Jsoup.connect(url).get();
Elements td = doc.select("td");
Jsoup 1.6.2 allows to parse with different parser and simple XML parser is provided. With following code I could solve my problem. You can later parse your fragment with HTML parse, to get valid HTML.
// Jsoup 1.6.2
String s = "" + Jsoup.parse("<td></td>", "", Parser.xmlParser()).select("td").size();
System.out.println("Selected elements count : " + s);
Related
Basically, I am using Jsoup to parse a site, I want to get all the links from the following html:
<ul class="detail-main-list">
<li>
Dis Be the link
</li>
</ul>
Any idea how?
Straight from jsoup.org, right there, first thing you see:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
log("%s\n\t%s",
headline.attr("title"), headline.absUrl("href"));
}
Modifying this to what you need seems trivial:
Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
Elements anchorTags = doc.select("ul.detail-main-list a");
for (Element anchorTag : anchorTags) {
System.out.println("Links to: " + anchorTag.attr("href"));
System.out.println("In absolute form: " + anchorTag.absUrl("href"));
System.out.println("Text content: " + anchorTag.text());
}
The ul.detail-main-list a part is a so-called selector string. A real short tutorial on these:
foo means: Any HTML element with that tag name, i.e. <foo></foo>.
.bar means: Any HTML element with class bar, i.e. <foo class="bar baz"></foo>
#bar means: Any HTML element with id bar, i.e. <foo id="bar">
These can be combined: ul.detail-main-list matches any <ul> tags that have the string detail-main-list in their list of classes.
a b means: all things matching the 'b' selection that have something matching 'a' as a parent. So ul a matches all <a> tags that have a <ul> tag around them someplace.
The JSoup docs are excellent.
You can do a specific a href link in this way from any website.
public static void main(String[] args) {
String htmlString = "<html>\n" +
" <head></head>\n" +
" <body>\n" +
"<ul class=\"detail-main-list\">\n" +
" <li> \n" +
" Dis Be the link\n" +
" </li> \n" +
"</ul>" +
" </body>\n" +
"</html>"
+ "<head></head>";
Document html = Jsoup.parse(htmlString);
Elements elements = html.select("a");
for(Element element: elements){
System.out.println(element.attr("href"));
}
}
Output:
/manga/toki_wa/v01/c001/1.html
I work with incoming html text blocks, like this:
String html = "<p>Some text here with already existing tags and it's escaped symbols.\n" +
" More text here:<br/>\\r\\n---<br/>\\r\\n" +
" <img src=\"/attachments/a0d4789a-1575-4b70-b57f-9e8fe21df46b\" sha256=\"2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c\"></a>" +
" It was img tag with attr to replace above</p>\\r\\n\\r\\n<p>More text here\n" +
" and here.<br/>\\r\\n---</p>";
I need to replace src attribute value in img tags with slightly modified sha256 attribute value in the same tag. I can do it easily with Jsoup like this:
Document doc = Jsoup.parse(html);
Elements elementsByAttribute = doc.select("img[src]");
elementsByAttribute.forEach(x -> x.attr("src", "/usr/myfolder/" + x.attr("sha256") + ".zip"));
But there is a problem. Incoming text already has some formatting, html tags, escaping etc that need to be preserved. But Jsoup removes tags / adds tags / unescapes / escapes and does some other modifications to the original input.
For example, System.out.println(doc); or System.out.println(doc.html()); gives me following:
<html>
<head></head>
<body>
<p>Some text here with already existing tags and it's escaped symbols. More text here:<br>\r\n---<br>\r\n <img src="/usr/myfolder/2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c.zip" sha256="2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c"> It was img tag with attr to replace above</p>\r\n\r\n
<p>More text here and here.<br>\r\n---</p>
</body>
</html>
My src attribute is replaced, but a lot more html-tags are added, it's is escaped to it's.
If I use System.out.println(doc.text()); i receive following:
Some text here with already existing tags and it's escaped symbols. More text here: \r\n--- \r\n It was img tag with attr to replace above\r\n\r\n More text here and here. \r\n---
My tags are removed here, it's is escaped to it's again.
I tried some other Jsoup features but didn't find how to solve this problem.
Quesion: is there any way to replace only some attributes with Jsoup without changing other tags? Maybe there is some othere library for that purpose? Or my only way is regex?
I encoutered same problem and in my case this code doesn't change original formatting.
Try this:
public static void m(){
String html = "<p>Some text here with already existing tags and it's escaped symbols.\n" +
" More text here:<br/>\r\n---<br/>\r\n" +
" <img src=\"/attachments/a0d4789a-1575-4b70-b57f-9e8fe21df46b\" sha256=\"2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c\"></a>" +
" It was img tag with attr to replace above</p>\r\n\r\n<p>More text here\n" +
" and here.<br/>\r\n---</p>";
Document doc = Jsoup.parseBodyFragment(html);
doc.outputSettings().prettyPrint(false);
Elements elementsByAttribute = doc.select("img[src]");
elementsByAttribute.forEach(x -> x.attr("src", "/usr/myfolder/" + x.attr("sha256") + ".zip"));
String result = doc.body().html();
System.out.println(result);
}
It outputs in console (In your example there is dandling </a> after <img/> so library remove it):
<p>Some text here with already existing tags and it's escaped symbols.
More text here:<br>
---<br>
<img src="/usr/myfolder/2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c.zip" sha256="2957635fcf46eb54d99f4f335794bd75a89d2ebc1663f5d1708a2fc662ee065c"> It was img tag with attr to replace above</p>
<p>More text here
and here.<br>
---</p>
And my case in Kotlin (replace content of src of <img/> attrs & remove <script></script> tags)(text is input String? variable from outer scope):
val content: String get(){
var c = text ?: ""
//val document = Jsoup.parse(c)
val document = Jsoup.parseBodyFragment(c)
document.outputSettings().prettyPrint(false)
val elementsByAttr = document.select("img[src]")
elementsByAttr.forEach {
val srcContent = it.attr("src")
val (type,value) = srcContent.let {
val eqIdx = it.indexOf('=')
it.substring(0, max(0,eqIdx)) to it.substring(eqIdx+1)
}
if (type=="path"){
it.attr("src", ArticleRepo.imgPathPrefix+value)
}
}
document.select("script").remove()
c = document.body().html()
return c
}
I have a requirement to create hyperlinks for TOC in word document using DOCX4J after converting JSON request to HTML request (
String HTMLString = "<div id=\"toc_container\">\r\n" +
"<p class=\"toc_title\">Contents</p>\r\n" +
"<ul class=\"toc_list\">\r\n" +
" <li>1 First Point Header\r\n" +
" <ul>\r\n" +
" <li>1.1 First Sub Point 1</li>\r\n" +
" <li>1.2 First Sub Point 2</li>\r\n" +
" </ul>\r\n" +
"</li>\r\n" +
"<li>2 Second Point Header</li>\r\n" +
"<li>3 Third Point Header</li>\r\n" +
"</ul>\r\n" +
"</div>";
I have tried making use of below statements but those methods takes hardcoded value / paragraph values as inputs.
but I need to pass HTML request which is converted from JSON..
Hyperlink h = mdp.hyperlinkToBookmark("Target String", "Hit");
mdp.addParagraphOfText("Click here --> " ).getContent().add(h);
Please provide complete code for converting the <a href> to MS word hyperlink/bookmarks using DOCX4J which means when I click on TOC(table of contents) heading in word document it should be go to the appropriate page no.
I've been trying to get data from: http://www.betvictor.com/sports/en/to-lead-anytime, where I would like to get the list of matches using JSoup.
For example:
Caen v AS Saint Etienne
Celtic v Rangers
and so on...
My current code is:
String couponPage = "http://www.betvictor.com/sports/en/to-lead-anytime";
Document doc1 = Jsoup.connect(couponPage).get();
String match = doc1.select("#coupon_143751140 > table:nth-child(3) > tbody > tr:nth-child(2) > td.event_description").text();
System.out.println("match:" + match);
Once I can figure out how to get one item of data, I will put it in a for loop to loop through the whole table, but first I need to get one item of data.
Currently, the output is "match: " so it looks like the "match" variable is empty.
Any help is most appreciated,
I have worked out how to answer my question after a few hours of experimenting. Turns out the page didn't load properly straight away, and had to implement the "timeout" method.
Document doc;
try {
// need http protocol
doc = Jsoup.connect("http://www.betvictor.com/sports/en/football/coupons/100/0/0/43438/0/100/0/0/0/0/1").timeout(10000).get();
// get all links
Elements matches = doc.select("td.event_description a");
Elements odds = doc.select("td.event_description a");
for (Element match : matches) {
// get the value from href attribute
String matchEvent = match.text();
String[] parts = matchEvent.split(" v ");
String team1 = parts[0];
String team2 = parts[1];
System.out.println("text : " + team1 + " v " + team2);
}
I want to use Jsoup to extract all text from an HTML page and return a single string of all the text without any HTML. The code I am using half works, but has the effect of joining elements which affects my keyword searches against the string.
This is the Java code I am using:
String resultText = scrapePage(htmldoc);
private String scrapePage(Document doc) {
Element allHTML = doc.select("html").first();
return allHTML.text();
}
Run against the following HTML:
<html>
<body>
<h1>Title</h1>
<p>here is para1</p>
<p>here is para2</p>
</body>
</html>
Outputting resultText gives "Titlehere is para1here is para2" meaning I can't search for the word "para1" as the only word is "para1here".
I don't want to split document into further elements than necessary (for example, getting all H1, p.text elements as there is such a wide range of tags I could be matching
(e.g. data1data2 would come from):
<td>data1</td><td>data2</td>
Is there a way if can get all the text from the page but also include a space between the tags? I don't want to preserve whitepsace otherwise, no need to keep line breaks etc. as I am just preparing a keyword string. I will probably trim all white space otherwise to a single space for this reason.
I don't have this issue using JSoup 1.7.3.
Here's the full code i used for testing:
final String html = "<html>\n"
+ " <body>\n"
+ " <h1>Title</h1>\n"
+ " <p>here is para1</p>\n"
+ " <p>here is para2</p>\n"
+ " </body>\n"
+ "</html>";
Document doc = Jsoup.parse(html);
Element element = doc.select("html").first();
System.out.println(element.text());
And the output:
Title here is para1 here is para2
Can you run my code? Also update to a newer version of jsoup if you don't have 1.7.3 yet.
Previous answer is not right, because it works just thanks to "\n" end of lines added to each line, but in reality you may not have end of line on end of each HTML line...
void example2text() throws IOException {
String url = "http://www.example.com/";
String out = new Scanner(new URL(url).openStream(), "UTF-8").useDelimiter("\\A").next();
org.jsoup.nodes.Document doc = Jsoup.parse(out);
String text = "";
Elements tags = doc.select("*");
for (Element tag : tags) {
for (TextNode tn : tag.textNodes()) {
String tagText = tn.text().trim();
if (tagText.length() > 0) {
text += tagText + " ";
}
}
}
System.out.println(text);
}
By using answer: https://stackoverflow.com/a/35798214/4642669