Validate HTML code programmatically

Validate HTML code programmatically - java

I am trying to validate a String of HTML code. That is, when HTML code syntax is wrong I want to know, perhaps in the form of a return false.
I am currently using JTidy but it doesn't tell me there was bad syntax it just corrects it. I don't need to correct it just say if the synthax is bad or good.
JTidy code:
String s = "<td>cookie<td>"; // bad syntax.
Tidy tidy = new Tidy();
InputStream stream = new ByteArrayInputStream(s.getBytes(StandardCharsets.UTF_8));
tidy.parse(stream, System.out);
Any help is appriciated.

java has inbuilt DOM Parser in it.
Use DOM Parser to check. It will also show errors.

Related

Java XSS Sanitization for nested HTML elements

I am using JSoup library in Java to sanitize input to prevent XSS attacks. It works well for simple inputs like alert('vulnerable').
Example:
String data = "<script>alert('vulnerable')</script>";
data = Jsoup.clean(data, , Whitelist.none());
data = StringEscapeUtils.unescapeHtml4(data); //StringEscapeUtils from apache-commons lib
System.out.println(data);
Output: ""
However, if I tweak the input to the following, JSoup cannot sanitize the input.
String data = "<<b>script>alert('vulnerable');<</b>/script>";
data = Jsoup.clean(data, , Whitelist.none());
data = StringEscapeUtils.unescapeHtml4(data);
System.out.println(data);
Output: <script>alert('vulnerable');</script>
This output obviously still prone to XSS attacks. Is there a way to fully sanitize the input so that all HTML tags is removed from input?

Not sure if this is the best solution, but a temporary workaround would be parsing the raw text into a Doc and then clean the combined text of the Doc element and all its children:
String unsafe = "<<b>script>alert('vulnerable');<</b>/script>";
Document doc = Jsoup.parse(unsafe);
String safe = Jsoup.clean(doc.text(), Whitelist.none());
System.out.println(safe);
Wait for someone else to come up with the best solution.

The problem is that you are unescaping the safe HTML that jsoup has made. The output of the Cleaner is HTML. The none safelist passes no tags, only the textnodes, as HTML.
So the input:
<<b>script>alert('vulnerable');<</b>/script>
Through the Cleaner returns:
<script>alert('vulnerable');</script>
which is perfectly safe for presenting as HTML. See https://try.jsoup.org/~hfn2nvIglfl099_dVxLQEPxekqg
Just don't include the unescape line.

Microdata extraction from HTML in Java

I really need help to extract Mircodata which is embedded in HTML5. My purpose is to get structured data from a webpage just like this tool of google: http://www.google.com/webmasters/tools/richsnippets. I have searched a lot but there is no possible solution.
Currently, I use the any23 library but I can’t find any documentation, just only javadocs which dont provide enough information for me.
I use any23's Microdata Extractor but getting stuck at the third parameter: "org.w3c.dom.Document in". I can't parse a HTML content to be a w3cDom. I have used JTidy as well as JSoup but the DOM objects in these library are not fixed with the Extractor constructor. In addition, I also doubt about the 2nd parameter of the Microdata Extractor.
I hope that anyone can help me to do with any23 or suggest another library can solve this extraction issues.
Edit: I found solution myself by using the same way as any23 command line tool did. Here is the snippet of code:
HTTPDocumentSource doc = new HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(), value);
InputStream documentInputInputStream = doc.openInputStream();
TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream, doc.getDocumentURI());
Document document = tagSoupParser.getDOM();
ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream();
MicrodataParser.getMicrodataAsJSON(tagSoupParser.getDOM(),new PrintStream(byteArrayOutput));
String result = byteArrayOutput.toString("UTF-8");
These line of code only extract microdata from HTML and write them in JSON format. I tried to use MicrodataExtractor which can change the output format to others(Rdf, turtle, ...) but the input document seems to only accept XML format. It throws "Document didn't start" when I put in a HTML document.
If anyone found the way to use MicrodataExtractor, please leave the answer here.
Thank you.

xpath is generally the way to consume html or xml.
have a look at: How to read XML using XPath in Java

Quotation-Marks causing IllegalNameException when parsing HTML with JDom2

Good Evening everyone!
I'm trying to parse a HTML-page in Java with JDOM2, to access some information from it.
My code looks like this: (Just added the packages for this codeblock, don't have them in my real source)
//Here goes the reading of the site into my String "string" (using NekoHTML)
org.xml.sax.InputSource is = new InputSource();
is.setCharacterStream(new StringReader(string));
org.cyberneko.html.parsers.DOMParser parser = new DOMParser();
parser.parse(is);
org.jdom2.input.DOMBuilder builder = new DOMBuilder();
org.jdom2.Document doc = builder.build(parser.getDocument());
This works fine for everything except some special case: When the site contains quotation-Marks within an element. Here is an example of what I mean:
Der "realismo mágico" und die Phantastische...
So, after that wonderful Tag I get the following error-trace:
SEVERE: org.jdom2.IllegalNameException: The name "literatur"" is not legal for JDOM/XML attributes: XML name 'literatur"' cannot contain the character """.
So, now my question is: What are my options to take care of this error? Is there maybe a feature in NekoHTML I can use for this (using the "setFeature()"), or something within JDOM I could use?
If no: Are there other libaries that are suitable for scraping websites that can take such a thing as the quotation mark within the tag?
Thanks for your time!

Okay, so I solved the problem like following:
Since there wasn't any dependency on NekoHTML I switched to jTidy as parser which does the job in this case.
Question answered.

Externalize XML construction from a stream of CSV in Java

I get a stream of values as CSV , based on some condition I need to generate a XML including only a set of values from the CSV. For e.g .
Input : a:value1, b:value2, c:value3, d:value4, e:value5.
if (condition1)
XML O/P = <Request><ValueOfA>value1</ValueOfA><ValueOfE>value5</ValueOfE></Request>
else if (condition2)
XML O/P = <Request><ValueOfB>value2</ValueOfB><ValueOfD>value4</ValueOfD></Request>
I want to externalize the process in a way that given a template the output XML is generated accordingly. String manipulation is the easiest way of implementing this but I do not want to mess up the XML if some special characters appear in the input, etc. Please suggest.

Perhaps you could benefit from templating engine, something like Apache Velocity.

I would suggest creating an xsd and using JAXB to create the Java binding classes that you can use to generate the XML.

I recommend my own templating engine (JATL http://code.google.com/p/jatl/) Although its geared to (X)HTML its also very good at generating XML.
I didn't bother solving the whole problem for you (that is double splitting on the input ("," and then ":").) but this is how you would use JATL.
final String a = "stuff";
HtmlWriter html = new HtmlWriter() {
#Override
protected void build() {
//If condition1
start("Request").start("ValueOfA").text(a).end().end();
}
};
//Now write.
StringWriter writer = new StringWriter();
String results = html.write(writer).getBuffer().toString();
Which would generate
<Request><ValueOfA>stuff</ValueOfA></Request>
All the correct escaping is handled for you.

How to work with html code readed on Java?

I know how to read the HTML code of a website, for example, the next java code reads all the HTML code from http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html this is a website that shows all the football players of F.C. Barcelona.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class ReadWebPage {
public static void main(String[] args) throws IOException {
String urltext = "http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html";
URL url = new URL(urltext);
BufferedReader in = new BufferedReader(new InputStreamReader(url
.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
// Process each line.
System.out.println(inputLine);
}
in.close();
}
}
OK, but now I need to work with the HTML code, I need to obtain the names ("Valdés, Victor", "Pinto, José Manuel", etc...) and the positions (Goalkeeper, Defence, Midfield, Striker) of each of the players of the team. For example, I need to create an ArrayList <String> PlayerNames and an ArrayList <String> PlayerPositions and put on these arrays all the names and positions of all the players.
How I can do it??? I can't find the code example that can do it on google..... code examples are welcome
thanks

I would recommend using HtmlUnit, which will give you access to the DOM tree of the HTML page, and even execute JavaScript in case the data are dynamically put in the page using AJAX.
You could also use JSoup: no JavaScript, but more lightweight and support for CSS selectors.

I think that the best approach is first to purify HTML code into the valid XHTML form, and them apply XSL transformation - for retrieving some part of information you can use XPATH expressions. The best available html tag balancer is in my opinion neko HTML (http://nekohtml.sourceforge.net/).

You might like to take a look at htmlparser
I used this for something similar.
Usage something like this:
Parser fullWebpage = new Parser("WEBADDRESS");
NodeList nl = fullWebpage.extractAllNodesThatMatch(new TagNameFilter("<insert html tag>"));
NodeList tds = nodes.extractAllNodesThatMatch(new TagNameFilter("a"),true);
String data = tds.toHtml();

Java has its own, built-in HTML parser. A positive feature of this parser it that it is error tolerant and would assume some tags even if they are missing or misspelled. While called swing.text.html.Parser, it has actually nothing shared with Swing (and with text only as much as HTML is a text). Use ParserDelegator. You need to write a callback for use with this parser, otherwise it is not complex to use. The code example (written as a ParserDelegator test) can be found here. Some say it is a reminder of the HotJava browser. The only problem with it, seems not upgraded to the most recent versions of HTML.
The simple code example would be
Reader reader; // read HTML from somewhere
HTMLEditorKit.ParserCallback callback = new MyCallBack(); // Implement that interface.
ParserDelegator delegator = new ParserDelegator();
delegator.parse(reader, callback, false);

I've found a link that is just what you was looking for:
http://tiny-url.org/work_with_html_java

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Validate HTML code programmatically - java

java has inbuilt DOM Parser in it. Use DOM Parser to check. It will also show errors.

Related

Java XSS Sanitization for nested HTML elements

Microdata extraction from HTML in Java

Quotation-Marks causing IllegalNameException when parsing HTML with JDom2

Externalize XML construction from a stream of CSV in Java

How to work with html code readed on Java?

Categories

Resources