I need to validate HTML using java. So I try with jsoup library. But some my test cases failing with it.
For eg this is my html content. I dont have any control on this content. I am getting this from some external source provider.
String invalidHtml = "<div id=\"myDivId\" ' class = claasnamee value='undaa' > <<p> p tagil vanne <br> <span> span close cheythillee!! </p> </div>";
doc = Jsoup.parseBodyFragment(invalidHtml);
For above html I am getting this output.
<html>
<head></head>
<body>
<div id="myDivId" '="" class="claasnamee" value="undaa">
<
<p> p tagil vanne <br /> <span> span close cheythillee!! </span></p>
</div>
</body>
</html>
for a single quote in my above string is comming like this. So how can I fix this issue. Any one can help me please.
The best place to validate your html would be http://validator.w3.org/. But that would be manual process. But dont worry jsoup can do this for you as well. The below program is like a workaround but it does the purpose.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JsoupValidate {
public static void main(String[] args) throws Exception {
String invalidHtml = "<div id=\"myDivId\" ' class = claasnamee value='undaa' > <<p> p tagil vanne <br> <span> span close cheythillee!! </p> </div>";
Document initialDoc = Jsoup.parseBodyFragment(invalidHtml);
Document validatedDoc = Jsoup.connect("http://validator.w3.org/check")
.data("fragment", initialDoc.html())
.data("st", "1")
.post();
System.out.println("******");
System.out.println("Errors");
System.out.println("******");
for(Element error : validatedDoc.select("li.msg_err")){
System.out.println(error.select("em").text() + " : " + error.select("span.msg").text());
}
System.out.println();
System.out.println("**************");
System.out.println("Cleaned output");
System.out.println("**************");
Document cleanedOuput = Jsoup.parse(validatedDoc.select("pre.source").text());
cleanedOuput.select("meta[name=generator]").first().remove();
cleanedOuput.outputSettings().indentAmount(4);
cleanedOuput.outputSettings().prettyPrint(true);
System.out.println(cleanedOuput.html());
}
}
var invalidHtml = "<div id=\"myDivId\" ' class = claasnamee value='undaa' > <<p> p tagil vanne <br> <span> span close cheythillee!! </p> </div>";
var parser = Parser.htmlParser()
.setTrackErrors(10); // Set the number of errors it can track. 0 by default so it's important to set that
var dom = Jsoup.parse(invalidHtml, "" /* this is the default */, parser);
System.out.println(parser.getErrors()); // Do something with the errors, if any
Related
I have the HTML snippet below. There are multiple div classes for "teaser-img" throughout the document. I want to be able to grab all the "img src" from all these "teaser-img" classes.
<div class="teaser-img">
<a href="/julien/blog/failure-consciousness-vs-success-consciousness-shifting-focus-become-badass-or-loser">
<img src="http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg" alt="" title=""/>
</a>
</div>
I have tried many things so I wouldn't know what code to share with you guys. Your help will be much appreciated.
final String html = "<div class=\"teaser-img\">\n"
+ " <a href=\"/julien/blog/failure-consciousness-vs-success-consciousness-shifting-focus-become-badass-or-loser\">\n"
+ " <img src=\"http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg\" alt=\"\" title=\"\"/>\n"
+ " </a>\n"
+ "</div>";
// Parse the html from string or eg. connect to a website using connect()
Document doc = Jsoup.parseBodyFragment(html);
for( Element element : doc.select("div.teaser-img img[src]") )
{
System.out.println(element);
}
Output:
<img src="http://www.rsdnation.com/files/imagecache/blog_thumbnail/files/blog_thumbs/rsdnatonaustin.jpg" alt="" title="">
See here for documentation about the selector syntax.
So, I have an XHTML document report skeleton that I want to populate by getting Elements of a certain IDs and setting their contents.
I tried getElementById(), and had null returned (because, as I found out, id is not implicitly "id" and needs to be declared in a schema).
panel.setDocument(Main.class.getResource("/halreportview/DefaultSiteDetails.html").toString());
panel = populateDefaultReport(panel);
Element header1 = panel.getDocument().getElementById("header1");
header1.setTextContent("<span class=\"b\">Instruction Type:</span> Example<br/><span class=\"b\">Allocated To:</span> "+employee.toString()+"<br/><span class=\"b\">Scheduled Date:</span> "+dateFormat.format(scheduledDate));
So, I tried some work-arounds because I don't want to have to validate my XHTML documents. I tried adding a quick DTD to the top of the file in question like so;
<?xml version="1.0"?>
<!DOCTYPE foo [<!ATTLIST bar id ID #IMPLIED>]>
But getElementById() still returned null. So tried using xml:id instead of id in the XHTML document in the hopes it was supported, but again no luck. So instead I tried to use getElementsByTagName() and loop through the results checking ids. This worked, and found the correct element (as confirmed by output printing "Found it"), but when I try to call setTextContent on this element I am still getting a NullPointException. Code below;
Element header1;
NodeList sections = panel.getDocument().getElementsByTagName("p");
for (int i = 0; i < sections.getLength(); ++i) {
if (((Element)sections.item(i)).getAttribute("id").equals("header1")) {
System.out.println("Found it");
header1 = (Element) sections.item(i);
header1.setTextContent("<span class=\"b\">Instruction Type:</span> Example<br/><span class=\"b\">Allocated To:</span> "+employee.toString()+"<br/><span class=\"b\">Scheduled Date:</span> "+dateFormat.format(scheduledDate));
}
}
I'm loosing my mind on this one. I must be suffering from some kind of fundamental misunderstanding of how this is supposed to work. Any ideas?
Edit; Excerpt from my XHTML file below with CSS removed.
<html>
<head>
<title>Site Details</title>
<style type="text/css">
</style>
</head>
<body>
<div class="header">
<p></p>
<img src="#" alt="Logo" height="81" width="69"/>
<p id="header1"><span class="b">Instruction Type:</span> Example<br/><span class="b">Allocated To:</span> Example<br/><span class="b">Scheduled Date:</span> Example</p>
</div>
</body>
</html>
I am not sure why its not working , but I have put together example for you and it works !
Note : My example is using following libraries
Apache Commons IO (Link)
Jsoup HTML Parser (Jsoup link)
Apache Commons Lang (Link)
My input xhtml file ,
<html>
<head>
<title>Site Details</title>
<style type="text/css">
</style>
</head>
<body>
<div class="header">
<p></p>
<img src="#" alt="Logo" height="81" width="69" />
<p id="header1">
<span class="b">Instruction Type:</span> Example<br />
<span class="b">Allocated To:</span> Example<br />
<span class="b">Scheduled Date:</span> Example
</p>
</div>
</body>
</html>
The java code that work ! [All comments, read ]
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Date;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang3.StringEscapeUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Test {
/**
* #param args
* #throws IOException
* #throws InterruptedException
*/
public static void main(String[] args) throws IOException, InterruptedException {
//loading file from project
//When it is exported as JAR the files inside jar are not files but they are stream
InputStream stream = Test.class.getResourceAsStream("/test.xhtml");
//convert stream to file
File xhtmlfile = File.createTempFile("xhtmlFile", ".tmp");
FileOutputStream fileOutputStream = new FileOutputStream(xhtmlfile);
IOUtils.copy(stream, fileOutputStream);
xhtmlfile.deleteOnExit();
//get html string from file
String htmlString = FileUtils.readFileToString(xhtmlfile);
//parse using jsoup
Document doc = Jsoup.parse(htmlString);
//get all elements
Elements allElements = doc.getAllElements();
for (Element el : allElements) {
//if element id is header 1
if (el.id().equals("header1")) {
//dummy emp name
String employeeName = "dummyEmployee";
//update text
el.text("<span class=\"b\">Instruction Type:</span> Example<br/><span class=\"b\">Allocated To:</span> "
+ employeeName.toString() + "<br/><span class=\"b\">Scheduled Date:</span> " + new Date());
//dont loop further
break;
}
}
//now get html from the updated document
String html = doc.html();
//we need to unscape html
String escapeHtml4 = StringEscapeUtils.unescapeHtml4(html);
//print html
System.out.println(escapeHtml4);
}
}
*output *
<html>
<head>
<title>Site Details</title>
<style type="text/css">
</style>
</head>
<body>
<div class="header">
<p></p>
<img src="#" alt="Logo" height="81" width="69" />
<p id="header1"><span class="b">Instruction Type:</span> Example<br/><span class="b">Allocated To:</span> dummyEmployee<br/><span class="b">Scheduled Date:</span> Sat Nov 02 07:37:12 GMT 2013</p>
</div>
</body>
</html>
I have this template and code below to generate a "tags" in my web application as inidicated in the sample output:
Template:
<p class="tag" data-field="tags">Tags:
</p>
Java code:
#DataField
DivElement tags = DOM.createElement("p").cast();
#Override
public void setModel(MyModel model) {
binder.setModel(model, InitialState.FROM_MODEL);
for (String tag : model.getTags()){
Anchor a = new Anchor();
a.setText(tag);
a.setHref("#Tags?id=" + tag);
tags.appendChild(a.getElement());
Label comma = new Label(",");
tags.appendChild(comma.getElement());
}
}
HTML Output (Browser):
<p data-field="tags" class="tag">Tags:
<a class="gwt-Anchor" href="#Tags?id=test">test</a>
<div class="gwt-Label">,</div>
<a class="gwt-Anchor" href="#Tags?id=tagg">tagg</a>
<div class="gwt-Label">,</div>
<a class="gwt-Anchor" href="#Tags?id=new">new</a>
<div class="gwt-Label">,</div>
</p>
The problem I face now is that the HTML output when run from the browser should be like this:
<p data-field="tags" class="tag">Tags:
<a class="gwt-Anchor" href="#Tags?id=test">test</a>,
<a class="gwt-Anchor" href="#Tags?id=tagg">tagg</a>,
<a class="gwt-Anchor" href="#Tags?id=new">new</a>
</p>
And not create gwt-label DIV in between
Instead of
Label comma = new Label(",");
tags.appendChild(comma.getElement());
Use
tags.setInnerHTML(tags.getInnerHTML() + ",");
instead of append child try
tags.appendData(",");
I'd recommend to use SafeHTML. SafeHTML templates could also make your code simplier.
public interface MyTemplate extends SafeHtmlTemplates {
#Template("<a class=\"gwt-Anchor\" href=\"#Tags?id={0}\">{1}</a>{2}")
SafeHtml getTag(String url, String text, String comma);
}
(the last argument, "comma" can either be ", " or "").
In my Java application I have String that have to be edited. The problem is that these Strings can contain HTML tags/elements, which should not be edited (no id to retrieve element).
Scenario (add -):
String a = "<span> <table> </table> </span> <div></div> <div> text 2</div>";
should become: <span> <table> </table> </span> <div></div> <div> -text 2</div>
String b = "text";
should become: -text
String c = "<p> t </p>";
should become: <p> -t </p>
My question is: How can I retrieve the text in a string that can contain html tags (cannot add id or class)
You can use an XML parsing library.
String newText = null;
for ( Node node : document.nodes() ) {
if ( node.text() != null ) newText = "-" + node.text();
}
note that this is pseudo.
newText will now be -text or whatever the node text is.
EDIT:
Your question is a bit ambiguous in terms of "the text can contain html elements."
If it doesn't contain html tags, then you cannot use an XML parser, which brings up the question.. if it doesn't contain tags, then why can't you just do...
String newString = "-" + a;
<div></div>
<div></div>
<div></div>
<div></div>
<ul>
<form id=the_main_form method="post">
<li>
<div></div>
<div> <h2>
<a onclick="xyz;" target="_blank" href="http://sample.com" style="text-decoration:underline;">This is sample</a>
</h2></div>
<div></div>
<div></div>
</li>
there are 50 li's like that
I have posted the snip of the html from a big HTML.
<div> </div> => means there is data in between them removed the data as it is not neccessary.
I would like to know how the JSOUP- select statement be to extract the href and Text?
I selected doc.select("div div div ul xxxx");
where xxx is form ..shoud I give the form id (or) how should I do that
Try this:
Elements allLis = doc.select("#the_main_form > li ");
for (Element li : allLis) {
Element a = li.select("div:eq(1) > h2 > a");
String href = a.attr("href");
String text = a.text();
}
Hope it helps!
EDIT:
Elements allLis = doc.select("#the_main_form > li ");
This part of the code gets all <li> tags that are inside the <form> with id #the_main_form.
Element a = li.select("div:eq(1) > h2 > a");
Then we iterate over all the <li> tags and get <a> tags, by first getting <div> tags ( the second one inside all <li>s by using index=1 -> div:eq(1)) then getting <h2> tags, where our <a> tags are present.
Hope you understand now! :)
Please try this:
package com.stackoverflow.works;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
/*
* # author: sarath_sivan
*/
public class HtmlParserService {
public static void parseHtml(String html) {
Document document = Jsoup.parse(html);
Element linkElement = document.select("a").first();
String linkHref = linkElement.attr("href"); // "http://sample.com"
String linkText = linkElement.text(); // "This is sample"
System.out.println(linkHref);
System.out.println(linkText);
}
public static void main(String[] args) {
String html = "<a onclick=\"xyz;\" target=\"_blank\" href=\"http://sample.com\" style=\"text-decoration:underline;\">This is sample</a>";
parseHtml(html);
}
}
Hope you have the Jsoup Library in your classpath.
Thank you!