HTML Parsing using Jsoup.Jar

HTML Parsing using Jsoup.Jar - java

Document doc = Jsoup.connect("http://reviews.opentable.com/0938/9/reviews.htm").get();
Element part = doc.body();
Elements parts = part.getElementsByTag("span");
String attValue;
String html;
for(Element ent : parts)
{
if(ent.hasAttr("class"))
{
attValue = ent.attr("class");
if(attValue=="BVRRReviewText description")
{
System.out.println("\n");
html=ent.text();
System.out.println(html);
}
}
}
Am using Jsoup.jar for the above program.
I am accessing the webpage and my aim is to the print the text that is found within the tag <span class="BVRRReviewText description">text</span>.
But nothing is getting printed as output. There is no contents added to the String html in the program. But attValue is getting all the attribute values of the span tag.
Where must I have went wrong? Please advise.

if(attValue=="BVRRReviewText description")
should be
if(attValue.equals("...")) surely?
This is Java, not Javascript.

Change
attValue=="BVRRReviewText description"
for
attValue.matches("...")

Related

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}

I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

How can I use Jsoup to turn unallowed html tag delimiter into entities where there are unallowed tags

Using Jsoup clean is it possible to convert this string:
Here is some <b>important</b> stuff that can't have
<script>javascript</script> or the following embed tag
<embed src="helloworld.swf" type="application/vnd.adobe.flash-movie"> movie
in the output
to this :
Here is some <b>important</b> stuff that can't have
<script>javascript</script> or the following embed tag
<embed src="helloworld.swf" type="application/vnd.adobe.flash-movie">
movie in the output
so it renders
Here is some important stuff that can't have
<script>javascript</script> or the following embed tag
<embed src="helloworld.swf" type="application/vnd.adobe.flash-movie">
movie in the output
Where the bold tag is allowed and left alone but the script and embed tags delimiters change from < > to < and > so they are treated as just text and not real html elements.
What settings are necessary to accomplish this? I have:
private static String limitHtml(String value) {
String result = value;
if (value != null && !value.isEmpty()) {
Document.OutputSettings settings = new Document.OutputSettings();
settings.prettyPrint(false);
// what other settings ???
Whitelist whitelist = Whitelist.none().addTags(ALLOWED_HTML_TAGS);
whitelist.addAttributes(":all", ALLOWED_HTML_ATTRIBUTES);
result = Jsoup.clean(value, "", whitelist, settings);
}
return result;
}
Is there a similar Java lib that can accomplish this if Jsoup doesn't.

Jsoup can definitively get your back here. The trick is to use a dummy document (transitional variable in the code) with a single pre element in it.
We will simply add each unallowed element found in this pre element.
Later, we replace the unallowed element in the initial value with its escaped html code.
CODE
// Comma separated list of allowed tags.
private static String ALLOWED_HTML_TAGS_CSS_QUERY = "b,span";
private static String limitHtml(String value) {
String result = value;
if (value != null && !value.isEmpty()) {
// Build a sided document. It will help us escape unallowed tags.
Document transitional = Jsoup.parse("<pre></pre>");
// Parse the actual value for finding unallowed tags
Document doc = Jsoup.parseBodyFragment(value, "");
Elements unallowedElements = doc.select("*:not("+ALLOWED_HTML_TAGS_CSS_QUERY+")");
for (Element e : unallowedElements) {
switch (e.tagName()) {
case "#root": case "html": case "head": case "body":
// Those tags are added automatically by Jsoup. Nothing to do...
break;
default:
// Load the unallowed element to escape its html code in the transitional document
Element pre = transitional.select("pre").first().text(e.outerHtml());
// Replace unallowed element with its escape html code
e.replaceWith(new TextNode(pre.text(), ""));
}
}
// Get the final sanitized value
Document.OutputSettings settings = new Document.OutputSettings();
settings.prettyPrint(false);
Whitelist whitelist = Whitelist.none().addTags(ALLOWED_HTML_TAGS);
whitelist.addAttributes(":all", ALLOWED_HTML_ATTRIBUTES);
result = Jsoup.clean(doc.body().html(), "", whitelist, settings);
}
return result;
}
SAMPLE USAGE
String unsanitizedHtml = "Here is some <b>important</b> stuff that can't have " + //
"<script>javascript</script> or the following embed tag " + //
"<embed src=\"helloworld.swf\" type=\"application/vnd.adobe.flash-movie\"> movie" + //
"in the output";
System.out.println("BEFORE:\n" + unsanitizedHtml);
System.out.println();
System.out.println("AFTER:\n" + limitHtml(unsanitizedHtml));
OUTPUT
BEFORE:
Here is some <b>important</b> stuff that can't have <script>javascript</script> or the following embed tag <embed src="helloworld.swf" type="application/vnd.adobe.flash-movie"> moviein the output
AFTER:
Here is some <b>important</b> stuff that can't have <script>javascript</script> or the following embed tag <embed src="helloworld.swf" type="application/vnd.adobe.flash-movie"> moviein the output

Replace string with jsoup only in text portions

I have found several topics with similar questions and valuable answers, but I am still struggling with this:
I want to parse some html with Jsoup so I can replace, for example,
"changeme"
with
<changed>changeme</changed>
, but only if it appears on a text portion of the html, no if it is part of a tag. So, starting with this html:
<body>
<p>test changeme app</p>
</BODY>
</HTML>
I would want to get to this:
<body>
<p>test <changed>changeme</changed> app</p>
</BODY>
</HTML>
I have tried several approaches, this one is which brings me closer to the desired result:
Document doc = null;
try {
doc = Jsoup.parse(new File("tmp1450348256397.txt"), "UTF-8");
} catch (Exception ex) {
}
Elements els = doc.body().getAllElements();
for (Element e : els) {
if (e.text().contains("changeme")) {
e.html(e.html().replaceAll("changeme","<changed>changeme</changed>"));
}
}
html = doc.toString();
System.out.println(html);
But with this approach I find two problems:
<body>
<p><a href="http://<changed>changeme</changed> .html">test
<changed>
changeme
</changed>
app</a></p>
</BODY>
</HTML>
Line breaks are inserted before and after the new element I am introducing. This is not a real problem as I coul get rid of them if I use #changed# to do the replacing and after the doc.toString() I replace them again to the desired value (with < >).
The real problem: The URL in the href has been modified, and I don't want it to happen.
Ideas? Thx.

Here is my solution:
String html=""
+"<p><a href=\"http://changeme.html\">"
+ "test changeme "
+ "<div class=\"changeme\">"
+ "inner text changeme"
+ "</div>"
+ " app</a>"
+"</p>";
Document doc = Jsoup.parse(html);
Elements els = doc.body().getAllElements();
for (Element e : els) {
List<TextNode> tnList = e.textNodes();
for (TextNode tn : tnList){
String orig = tn.text();
tn.text(orig.replaceAll("changeme","<changed>changeme</changed>"));
}
}
html = doc.toString();
System.out.println(html);
TextNodes are always leaf nodes, i.e. they do not contain more HTML elements. In your original approach you replace the HTML of an element with new HTML with replaced changme strings. You only check for the changeme to be part of the TextNodes contents, but you replace every occurrence in the HTML string of the element, including all occurrences outside TextNodes.
My solution basically works like yours, but I use the JSoup method textNodes(). This way I don't need to typecast.
P.S.
Of course, my solution as well as yours will contain <changed>changeme</changed> instead of <changed>changeme</changed> in the end. This may or may not be what you want. If you do not want this, then your result is not any more valid HTML, since changed is no valid HTML tag. Jsoup will not help you in this case. However, you can of course replace in the resulting string all <changed>changeme</changed> again - outside JSoup.

I think your issue is that you're replacing the elements html rather than just its text, change:
e.html(e.html().replaceAll("changeme","<changed>changeme</changed>"));
to
e.text(e.text().replaceAll("changeme","<changed>changeme</changed>"));
the line breaks issue can probably be solved by doing doc.outputSettings().prettyPrint(false); before doing html = doc.toString();

Finally I tried this solution (at the end of the question), using TextNodes:
How I can replace "text" in the each tag using Jsoup
This is the resulting code:
Elements els = doc.body().getAllElements();
for (Element e : els) {
for (Node child : e.childNodes()){
if (child instanceof TextNode && !((TextNode) child).isBlank()) {
((TextNode)child).text(((TextNode)child).text().replaceAll("changeme","<changed>changeme</changed>"));
}
}
}
Now the output is the expected, and it even does not introduce extra break lines. In this case prettyPrint must be set to True.
The only problem is that I don't really understand the difference of using TextNode vs Element.text(). If someone wants to provide some info it will be much appreciated.
Thanks.

Wrong URL when parsing HTML with Jsoup Android

could you help me with parsing html site?
I need get src of image and link to another page, but I don't know why I get empty list
This is my code:
Elements elems2 = doc.select("div");
for (Element elem2 : elems2) {
if (elem2.attr("class").equals("grid-box-img")) {
System.out.println(elem2.attr("img"));
kfunewphoto.add(elem2.attr("src"));
}
}
and example of html:
<div class="grid-box-img"><img width="680" height="470" src="https://i.stack.imgur.com/c7PGK.png" class="attachment-full wp-post-image" alt="shou-talanty-uspej-uvidet-pervym-clever-russia" /></div>
I need get "http://cleverrussia.com/wp-content/uploads/2014/10/shou-talanty-uspej-uvidet-pervym-clever-russia.png" and the second part of code:
Elements elems = doc.select("h2");
for (Element elem : elems) {
if (elem.attr("class").equals("entry-title")) {
str = elem.text();
kfunews.add(elem.text());
kfunewslist1.add(elem.attr("href"));
}
<h2 class="entry-title">Шоу “Таланты”. Успей увидеть первым!</h2>
And I need get: "http://cleverrussia.com/shou-talanty-uspej-uvidet-pervym/"
This is full code of page - view-source:http://cleverrussia.com/

The error is that you're trying to select img and a as attributes. Check the below code to see how to fix your code.
// Prints the image source
System.out.println(elem2.select("img").attr("src"));
kfunewphoto.add(elem2.select("img").attr("src"));
// Prints the target link
System.out.println(elem.select("a").attr("href"));
kfunewslist1.add(elem.select("a").attr("href"));

adding text before and after a link jSoup

I've just stared learning Jsoup and the cookbook on their website but I'm just a bit stuck with addling text to an element I've parsed.
try{
Document doc = Jsoup.connect(url).get();
Element add = doc.prependText("a href") ;
Elements links = add.select("a[href]");
for (Element link : links) {
PrintStream sb = System.out.format("%n %s",link.attr("abs:href"));
System.out.print("<br>");
}
}
catch(Exception e){
System.out.print("error --> " + e);
}
Example run with google.com I get
http://www.google.ie/imghp?hl=en&tab=wi<br>
http://maps.google.ie/maps?hl=en&tab=wl<br>
https://play.google.com/?hl=en&tab=w8<br>
But I really want
<a href> http://www.google.ie/imghp?hl=en&tab=wi<br></a>
<a href> http://maps.google.ie/maps?hl=en&tab=wl<br></a>
<a href> https://play.google.com/?hl=en&tab=w8<br></a>
With this code I've gotten all the links off the page but I want to also get the and tags so I can them create my on webpage. I've tried adding a string and prepend text but just can't seem to get it right.
Thanks

with link.attr(...) you get the attribute value.
But you need the whole tag:
Document doc = Jsoup.connect(...).get();
for( Element e : doc.select("a[href]") ) // Select all 'a'-Tags with 'href' attribute
{
String wholeTag = e.toString(); // Get a string as the element is
/* No you you can use the html - in this example for a simple output */
System.out.println(wholeTag);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HTML Parsing using Jsoup.Jar - java

if(attValue=="BVRRReviewText description") should be if(attValue.equals("...")) surely? This is Java, not Javascript.

Change attValue=="BVRRReviewText description" for attValue.matches("...")

Related

Use JSoup to get all textual links

How can I use Jsoup to turn unallowed html tag delimiter into entities where there are unallowed tags

Replace string with jsoup only in text portions

Wrong URL when parsing HTML with Jsoup Android

adding text before and after a link jSoup

Categories

Resources