HTML Validation on back-end - java

I am receiving response from external service in html format and pass it directly to my front end. However, sometime external system returns broken html, which can lead to the broken page on my site. Thence, I want to validate this html response whether it is broken or valid. If it is valid I will pass it further, otherwise it will be ignored with error in log.
By what means can I make validation on back-end in Java?
Thank you.

I believe there is no such "generic" thing available in Java. But you can build your own parser to validate the HTML using any one Open Source HTML Parser

I found the solution:
private static boolean isValidHtml(String htmlToValidate) throws ParserConfigurationException,
SAXException, IOException {
String docType = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" " +
"\"https://www.w3.org/TR/xhtml11/DTD/xhtml11-flat.dtd\"> " +
"<html xmlns=\"http://www.w3.org/1999/xhtml\" " + "xml:lang=\"en\">\n";
try {
InputSource inputSource = new InputSource(new StringReader(docType + htmlToValidate));
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setValidating(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
builder.setErrorHandler(new ErrorHandler() {
#Override
public void error(SAXParseException exception) throws SAXException {
throw new SAXException(exception);
}
#Override
public void fatalError(SAXParseException exception) throws SAXException {
throw new SAXException(exception);
}
#Override
public void warning(SAXParseException exception) throws SAXException {
throw new SAXException(exception);
}
});
builder.parse(inputSource);
} catch (SAXException ex) {
//log.error(ex.getMessage(), ex); // validation message
return false;
}
return true;
}
This method can be used this way:
String htmlToValidate = "<head><title></title></head><body></body></html>";
boolean isValidHtml = isValidHtml(htmlToValidate);

Related

Loss of special characters while using javax.xml.transform.Transformer

I have following problem - I lose some of special characters when using javax.xml.transform.Transformer. Both xml and xls files are UTF-8 formatted.
I seem to lose some of capital polish characters - Ą,Ł etc during transform and replaced by "�?" characters.
Here is my transforming method:
public static boolean transform(Logger logger, String inXML,String inXSL,String outTXT) throws Exception
{
try
{
TransformerFactory factory = TransformerFactory.newInstance();
ErrorListener listener = new ErrorListener()
{
#Override
public void warning(TransformerException exception)
throws TransformerException {}
#Override
public void fatalError(TransformerException exception)
throws TransformerException {}
#Override
public void error(TransformerException exception)
throws TransformerException {}
};
factory.setErrorListener(listener);
StreamSource xslStream = new StreamSource(inXSL);
Transformer transformer = factory.newTransformer(xslStream);
StreamSource in = new StreamSource(inXML);
StreamResult out = new StreamResult(outTXT);
transformer.transform(in,out);
return true;
}
catch(Exception e)
{
logger.log("ERROR DURING XSLT TRANSFORM (" + e.getMessage() + ")",2);
return false;
}
}
Any help will be appreciated!
=====
Using XSL file - Link
It seemed it was necessary to set output encoding.
After adding
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
engine seems to work fine in both environments.
I had similiar problem and after adding UTF-16 (not UTF-8) encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16");
special characters worked.

Handle external Entities and Stylesheet in Sax Parser (XML)

I want to ignore external entities and external stylesheets (eg. <?xml-stylesheet type="text/xsl" href="......."?>).
I know I have to set XMLReader property to ignore external entities but I don't know how to ignore stylesheets...
import org.apache.xerces.parsers.SAXParser;
import org.xml.sax.XMLReader;
//...
final XMLReader parser = new SAXParser();
// Ignore entities
parser.setProperty("http://xml.org/sax/features/external-general-entities", false);
// IS CORRECT???
parser.setProperty("http://xml.org/sax/features/external-general-entities", false);
There are more properties to set to avoid external entities and stylesheet?
How Can I understand if there are external entities o stylesheets?
Working for me:
public class SaxParser extends DefaultHandler
implements ContentHandler, DTDHandler, EntityResolver{
public transient static final String STYLE_SHEET_TAG = "xml-stylesheet";
public transient static final String EXTERNAL_ENTITY = "ExternalEntity";
public static void main(String[] args) {
new SaxParser().execute();
}
public void execute() {
String pathFileXml = "test/XML.xml";
final XMLReader parser = new SAXParser();
parser.setContentHandler(this);
parser.setDTDHandler(this);
parser.setEntityResolver(this);
try {
parser.parse(pathFileXml);
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
if (SaxParser.STYLE_SHEET_TAG.equals(e.getMessage())
|| SaxParser.EXTERNAL_ENTITY.equals(e.getMessage())) {
System.out.println("CATCH ERRORE");
}
e.printStackTrace();
}
System.out.println("OK");
}
#Override
public void processingInstruction(String target, String data)
throws SAXException {
System.out.println("Processing Instruction");
System.out.println("PI=> target: " + target + ", data: " + data);
if (STYLE_SHEET_TAG.equalsIgnoreCase(target.trim())) {
throw new SAXException(STYLE_SHEET_TAG);
}
return;
}
#Override
public InputSource resolveEntity(String publicId, String systemId)
throws IOException, SAXException {
System.out.println("publicId: " + publicId + ", systemId: " + systemId);
throw new SAXException(SaxParser.EXTERNAL_ENTITY);
}
}
The external stylesheet declaration is a standard processing instruction.
You can ignore processing instructions by not implementing the handler method:
void processingInstruction(java.lang.String target, java.lang.String data) {}
in your SAX handler.

How to validate an XML file against a given DTD file?

I have an XML file, which has a DTD reference in it, like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE something SYSTEM "something.dtd">
I'm using a DocumentBuilderFactory:
public static Document validateXMLFile(String xmlFilePath) throws ParserConfigurationException, SAXException, IOException {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setValidating(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
builder.setErrorHandler(new ErrorHandler() {
#Override
public void error(SAXParseException exception) throws SAXException {
// do something more useful in each of these handlers
exception.printStackTrace();
}
#Override
public void fatalError(SAXParseException exception) throws SAXException {
exception.printStackTrace();
}
#Override
public void warning(SAXParseException exception) throws SAXException {
exception.printStackTrace();
}
});
Document doc = builder.parse(xmlFilePath);
return doc;
}
But now I want to validate the XML file against a DTD file on a user-defined location, and not relative to the path of the XML file.
How can I do that?
Example:
validateXMLFile("/path/to/the/xml_file.xml", "/path/to/the/dtd_file.dtd");
Use EntityResolver.
final String dtd = "/path/to/the/dtd_file.dtd";
builder.setEntityResolver(new EntityResolver() {
public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException {
if (systemId.endsWith("something.dtd")) {
return new InputSource(new FileInputStream(dtd));
}
return null;
}
});
Note that it can work only if the XML document has a DTD declaration.

Parsing XML without document start and end tags

I'm parsing a document that I cannot change from the internet using a SAX Parser. It was working just fine when the documents came formatted as such:
<outtertag>
<innertag>data</innertag>
<innerag>moreData</innertag>
</outtertag>
However, there are certain calls I make where the XML comes formatted without the outer tags, so I essentially get just a list of data, like such:
<innertag>data</innertag>
<innerag>moreData</innertag>
This seems silly to me, but I don't get to choose how the XML is formatted and it can't be changed for now. The problem is that it seems that the SAX Parser hits the endDocument event as soon as it hits the first closing innertag.
I have a rather hacky solution of converting the InputStream into a String, throwing tags around it, and then converting it back to an InputStream. It actually parses fine that way. But, surely there's a better way. I'd also would prefer not to write a whole other parser. Most of the tags are the same aside from the lack of opening and closing tags.
Just for the heck of it, I'll post the code, but it's pretty standard SAX Parser. The original is actually parsing about 30 some tags:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
XMLReader xmlReader = saxParser.getXMLReader();
MyHandler handler = new MyHandler();
xmlReader.setContentHandler(handler);
InputSource inputSource = new InputSource(url.openStream());
xmlReader.parse(inputSource);
}
catch (SAXException e) { e.printStackTrace(); }
catch (ParserConfigurationException e) { e.printStackTrace(); }
catch(Exception e) { e.printStackTrace(); }
}
private class MyHandler extends DefaultHandler {
private StringBuilder content;
public MyHandler() {
content = new StringBuilder();
}
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
content = new StringBuilder();
if(localName.equalsIgnoreCase("innertag")) {
//Doing stuff
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
//Doing stuff
}
public void characters(char[] ch, int start, int length)
throws SAXException {
content.append(ch, start, length);
}
public void endDocument() throws SAXException {
//When parsing the second type of document, hits this event almost immediately after parsing first tag
}
}
And, if it matters, here's my hacky code I'm using, but just feels wrong, yet it works:
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
StringBuilder sb = new StringBuilder("<tag>");
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line);
}
sb.append("</tag>");
String xml =sb.toString();
InputStream is = new ByteArrayInputStream(xml.getBytes());
InputSource source = new InputSource(is);
xmlReader.parse(source);
I'd say what you're doing now is about as good as you'll get. The one thing to consider improving is the stream -> string -> stream conversion, especially if the documents are large. You could use something like Guava's ByteStreams.join(), which lets you concatenate streams together instead of strings. Something like the following:
import com.google.common.io.*;
import java.io.*;
public class ConcatenateStreams {
public static void main(String[] args) throws Exception {
InputStream malformedXmlContent = externalXmlStream();
InputSupplier<InputStream> joined = ByteStreams.join(
inputSupplier("<root>"),
inputSupplier(malformedXmlContent),
inputSupplier("</root>"));
ByteStreams.copy(joined, System.out);
}
private static InputStream externalXmlStream() {
return new ByteArrayInputStream("<foo>5</foo><bar>10</bar>".getBytes());
}
private static InputSupplier<InputStream> inputSupplier(final String text) {
return inputSupplier(new ByteArrayInputStream(text.getBytes()));
}
private static InputSupplier<InputStream> inputSupplier(final InputStream inputStream) {
return new InputSupplier<InputStream>() {
#Override
public InputStream getInput() throws IOException {
return inputStream;
}
};
}
}
which outputs:
<root><foo>5</foo><bar>10</bar></root>
The XML you have is not a well-formed document, but it is a well-formed external parsed entity, which means it can be referenced from a well-formed document by means of an entity reference. So create a skeleton document like this:
<!DOCTYPE doc [
<!ENTITY e SYSTEM "data.xml">
]>
<doc>&e;</doc>
where data.xml is your XML, and pass this document to the XML parser in place of the original. Beats writing dozens of lines of Java code.

parse an xml string in java?

how do you parse xml stored in a java string object?
Java's XMLReader only parses XML documents from a URI or inputstream. is it not possible to parse from a String containing an xml data?
Right now I have the following:
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser sp = factory.newSAXParser();
XMLReader xr = sp.getXMLReader();
ContactListXmlHandler handler = new ContactListXmlHandler();
xr.setContentHandler(handler);
xr.p
} catch (ParserConfigurationException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
And on my handler i have this:
public class ContactListXmlHandler extends DefaultHandler implements Resources {
private List<ContactName> contactNameList = new ArrayList<ContactName>();
private ContactName contactItem;
private StringBuffer sb;
public List<ContactName> getContactNameList() {
return contactNameList;
}
#Override
public void startDocument() throws SAXException {
// TODO Auto-generated method stub
super.startDocument();
sb = new StringBuffer();
}
#Override
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
// TODO Auto-generated method stub
super.startElement(uri, localName, qName, attributes);
if(localName.equals(XML_CONTACT_NAME)){
contactItem = new ContactName();
}
sb.setLength(0);
}
#Override
public void characters(char[] ch, int start, int length){
// TODO Auto-generated method stub
try {
super.characters(ch, start, length);
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
sb.append(ch, start, length);
}
#Override
public void endDocument() throws SAXException {
// TODO Auto-generated method stub
super.endDocument();
}
/**
* where the real stuff happens
*/
#Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
// TODO Auto-generated method stub
//super.endElement(arg0, arg1, arg2);
if(contactItem != null){
if (localName.equalsIgnoreCase("title")) {
contactItem.setUid(sb.toString());
Log.d("handler", "setTitle = " + sb.toString());
} else if (localName.equalsIgnoreCase("link")) {
contactItem.setFullName(sb.toString());
} else if (localName.equalsIgnoreCase("item")){
Log.d("handler", "adding rss item");
contactNameList.add(contactItem);
}
sb.setLength(0);
}
}
Thanks in advance
The SAXParser can read an InputSource.
An InputSource can take a Reader in its constructor
So, you can put parse XML string via a StringReader
new InputSource(new StringReader("... your xml here....")));
Try jcabi-xml (see this blog post) with a one-liner:
XML xml = new XMLDocument("<document>...</document>")
Your XML might be simple enough to parse manually using the DOM or SAX API, but I'd still suggest using an XML serialization API such as JAXB, XStream, or Simple instead because writing your own XML serialization/deserialization code is a drag.
Note that the XStream FAQ erroneously claims that you must use generated classes with JAXB:
How does XStream compare to JAXB (Java API for XML Binding)?
JAXB is a Java binding tool. It generates Java code from a schema and
you are able to transform from those classes into XML matching the
processed schema and back. Note, that you cannot use your own objects,
you have to use what is generated.
It seems this was true was true at one time, but JAXB 2.0 no longer requires you to use Java classes generated from a schema.
If you go this route, be sure to check out the side-by-side comparisons of the serialization/marshalling APIs I've mentioned:
http://blog.bdoughan.com/2010/10/how-does-jaxb-compare-to-xstream.html
http://blog.bdoughan.com/2010/10/how-does-jaxb-compare-to-simple.html
Take a look at this: http://www.rgagnon.com/javadetails/java-0573.html
import javax.xml.parsers.*;
import org.xml.sax.InputSource;
import org.w3c.dom.*;
import java.io.*;
public class ParseXMLString {
public static void main(String arg[]) {
String xmlRecords =
"<data>" +
" <employee>" +
" <name>John</name>" +
" <title>Manager</title>" +
" </employee>" +
" <employee>" +
" <name>Sara</name>" +
" <title>Clerk</title>" +
" </employee>" +
"</data>";
try {
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(xmlRecords));
Document doc = db.parse(is);
NodeList nodes = doc.getElementsByTagName("employee");
// iterate the employees
for (int i = 0; i < nodes.getLength(); i++) {
Element element = (Element) nodes.item(i);
NodeList name = element.getElementsByTagName("name");
Element line = (Element) name.item(0);
System.out.println("Name: " + getCharacterDataFromElement(line));
NodeList title = element.getElementsByTagName("title");
line = (Element) title.item(0);
System.out.println("Title: " + getCharacterDataFromElement(line));
}
}
catch (Exception e) {
e.printStackTrace();
}
/*
output :
Name: John
Title: Manager
Name: Sara
Title: Clerk
*/
}
public static String getCharacterDataFromElement(Element e) {
Node child = e.getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "?";
}
}

Categories

Resources