Loss of special characters while using javax.xml.transform.Transformer

Loss of special characters while using javax.xml.transform.Transformer - java

I have following problem - I lose some of special characters when using javax.xml.transform.Transformer. Both xml and xls files are UTF-8 formatted.
I seem to lose some of capital polish characters - Ą,Ł etc during transform and replaced by "�?" characters.
Here is my transforming method:
public static boolean transform(Logger logger, String inXML,String inXSL,String outTXT) throws Exception
{
try
{
TransformerFactory factory = TransformerFactory.newInstance();
ErrorListener listener = new ErrorListener()
{
#Override
public void warning(TransformerException exception)
throws TransformerException {}
#Override
public void fatalError(TransformerException exception)
throws TransformerException {}
#Override
public void error(TransformerException exception)
throws TransformerException {}
};
factory.setErrorListener(listener);
StreamSource xslStream = new StreamSource(inXSL);
Transformer transformer = factory.newTransformer(xslStream);
StreamSource in = new StreamSource(inXML);
StreamResult out = new StreamResult(outTXT);
transformer.transform(in,out);
return true;
}
catch(Exception e)
{
logger.log("ERROR DURING XSLT TRANSFORM (" + e.getMessage() + ")",2);
return false;
}
}
Any help will be appreciated!
=====
Using XSL file - Link

It seemed it was necessary to set output encoding.
After adding
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
engine seems to work fine in both environments.

I had similiar problem and after adding UTF-16 (not UTF-8) encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16");
special characters worked.

Related

HTML Validation on back-end

I am receiving response from external service in html format and pass it directly to my front end. However, sometime external system returns broken html, which can lead to the broken page on my site. Thence, I want to validate this html response whether it is broken or valid. If it is valid I will pass it further, otherwise it will be ignored with error in log.
By what means can I make validation on back-end in Java?
Thank you.

I believe there is no such "generic" thing available in Java. But you can build your own parser to validate the HTML using any one Open Source HTML Parser

I found the solution:
private static boolean isValidHtml(String htmlToValidate) throws ParserConfigurationException,
SAXException, IOException {
String docType = "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" " +
"\"https://www.w3.org/TR/xhtml11/DTD/xhtml11-flat.dtd\"> " +
"<html xmlns=\"http://www.w3.org/1999/xhtml\" " + "xml:lang=\"en\">\n";
try {
InputSource inputSource = new InputSource(new StringReader(docType + htmlToValidate));
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setValidating(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
builder.setErrorHandler(new ErrorHandler() {
#Override
public void error(SAXParseException exception) throws SAXException {
throw new SAXException(exception);
}
#Override
public void fatalError(SAXParseException exception) throws SAXException {
throw new SAXException(exception);
}
#Override
public void warning(SAXParseException exception) throws SAXException {
throw new SAXException(exception);
}
});
builder.parse(inputSource);
} catch (SAXException ex) {
//log.error(ex.getMessage(), ex); // validation message
return false;
}
return true;
}
This method can be used this way:
String htmlToValidate = "<head><title></title></head><body></body></html>";
boolean isValidHtml = isValidHtml(htmlToValidate);

Marshaller on Windows adds new line at the end of file

I have a project that uses JAXB marshalled XML files in order to compare configuration states of different environments. I noticed that there must be some differences in the implementation of the JAXB marshaller under Windows against the Unix version. When I compare 2 files created on the different platforms, my comparison tool always flags one difference at the end of the file. The file created on Windows has a new line (CR and LF) at the end of the file while the Unix version doesn't have it.
Please note that the issue is not about the difference of the new line characters between both platforms! The Windows marshaller effectively adds a "new line" at the end of the file while the Unix marshaller stops after the closing ">" of the root tag.
Is there any parameter I can pass to the marshaller in order to prevent this additional line or do I have to explicitly remove it after marshalling on Windows, so that my comparison tool doesn't flag the difference?
This is how the marshalling code looks like:
public void marshal(final Object rootObject, final OutputStream outputStream) throws JAXBException, TransformerException {
Preconditions.checkArgument(rootObject != null, "rootObject must not be null");
Preconditions.checkArgument(outputStream != null, "outputStream must not be null");
final JAXBContext ctx = JAXBContext.newInstance(rootObject.getClass());
final Document document = getFactories().newDocument();
document.setXmlStandalone(true);
final Marshaller marshaller = ctx.createMarshaller();
marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
marshaller.setSchema(schema);
marshaller.marshal(rootObject, document);
createTransformer().transform(new DOMSource(document), new StreamResult(outputStream));
}
public static Transformer createTransformer() {
final Transformer transformer = getFactories().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.STANDALONE, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, JAXBDefaults.OUTPUT_CHARSET.name());
transformer.setOutputProperty(OutputKeys.CDATA_SECTION_ELEMENTS, CDATA_XML_ELEMENTS);
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", IDENT_LENGTH);
return transformer;
}
private static class JAXBFactories {
private DocumentBuilderFactory documentBuilderFactory;
public DocumentBuilderFactory getDocumentBuilderFactory() {
if (documentBuilderFactory == null) {
documentBuilderFactory = DocumentBuilderFactory.newInstance();
}
return documentBuilderFactory;
}
private DocumentBuilder documentBuilder;
public DocumentBuilder getDocumentBuilder() {
if (documentBuilder == null) {
try {
documentBuilder = getDocumentBuilderFactory().newDocumentBuilder();
} catch (final ParserConfigurationException ex) {
throw new RuntimeException("Failed to create DocumentBuilder", ex);
}
}
return documentBuilder;
}
public Document newDocument() {
return getDocumentBuilder().newDocument();
}
private TransformerFactory transformerFactory;
public TransformerFactory getTransformerFactory() {
if (transformerFactory == null) {
transformerFactory = TransformerFactory.newInstance();
}
return transformerFactory;
}
public Transformer newTransformer() {
try {
return getTransformerFactory().newTransformer();
} catch (final TransformerConfigurationException ex) {
throw new RuntimeException("Failed to create Transformer", ex);
}
}
}
private static class FactoriesHolder {
static final JAXBFactories FACTORIES = new JAXBFactories();
}
private static JAXBFactories getFactories() {
return FactoriesHolder.FACTORIES;
}

There is no reason (or expectation) that pretty-printing XML will produce exactly the same results from two different systems. It does, however, seem likely that if you switched off the pretty printing (and let yourt IDE/editor do that) you are likely to discover that the output is the same.
Pretty-printing XML is a transform of the original that adds layout. It is no longer real xml.

How do I call DaisyDiff to compare two HTML files?

I need to create a diff between two HTML documents in my app. I found a library called DaisyDiff that can do it. It has an API that looks like this:
/**
* Diffs two html files, outputting the result to the specified consumer.
*/
public static void diffHTML(InputSource oldSource, InputSource newSource,
ContentHandler consumer, String prefix, Locale locale)
throws SAXException, IOException
I know absolutely nothing about SAX and I can't figure out what to pass as the third argument. After poking through https://code.google.com/p/daisydiff/source/browse/trunk/daisydiff/src/java/org/outerj/daisy/diff/Main.java I wrote this method:
#Override
protected String doInBackground(String... params)
{
try {
String oldFileName = params[0],
newFileName = params[1];
ByteArrayOutputStream os = new ByteArrayOutputStream();
FileInputStream oldis = null, newis = null;
oldis = openFileInput(oldFileName);
newis = openFileInput(newFileName);
SAXTransformerFactory tf = (SAXTransformerFactory) TransformerFactory
.newInstance();
TransformerHandler result = tf.newTransformerHandler();
result.setResult(new StreamResult(os));
DaisyDiff.diffHTML(new InputSource(oldis), new InputSource(newis), result, "", Locale.getDefault());
Log.d("diff", "output length = " + os.size());
return os.toString("Utf-8");
}catch (Exception e){
return e.toString();
}
}
I have no idea if that even makes sense. It doesn't work, nothing is written to the output. Please help me with this. Thanks in advance.

According to how HtmlTestFixture.diff is coded up (inside src/test/java of DaisyDiff, you need to give it instructions on how the result should be formatted. Have you tried adding the below setOutputProperty(...) calls?
#Test
//#Test comes from TestNG and is not related to DaisyDiff
public void daisyDiffTest() throws Exception {
String html1 = "<html><body>var v2</body></html>";
String html2 = "<html> \n <body> \n Hello world \n </body> \n </html>";
try {
StringWriter finalResult = new StringWriter();
SAXTransformerFactory tf = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler result = tf.newTransformerHandler();
result.getTransformer().setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
result.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
result.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
result.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
result.setResult(new StreamResult(finalResult));
ContentHandler postProcess = result;
DaisyDiff.diffHTML(new InputSource(new StringReader(html1)), new InputSource(new StringReader(html2)), postProcess, "test", Locale.ENGLISH);
System.out.println(finalResult.toString());
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Done this way, my output is as follows. Now I can stick this into an HTML file, include the right css and js files and have a pretty output.
<span class="diff-html-removed" id="removed-test-0" previous="first-test" changeId="removed-test-0" next="added-test-0">var v2</span><span class="diff-html-added" previous="removed-test-0" changeId="added-test-0" next="last-test"> </span><span class="diff-html-added" id="added-test-0" previous="removed-test-0" changeId="added-test-0" next="last-test">Hello world </span>

Java setURIResolver not being called?

I want to use the setURIResolver callback function provided by javax.xml.transform.Transformer. I have implemented the 'resolve' function but it is not being called.
public class XSLMagic implements URIResolver {
public void DoXSLTransform(final File xslDoc, final File xmlDoc, final File resultDoc) {
// Create the factory...
TransformerFactory tf = TransformerFactory.newInstance();
// Create the transformer object from
Transformer tr = tf.newTransformer(new StreamSource(xslDoc));
tr.setURIResolver(this); // <--- THIS LINE doesn't seem to work.
tr.transform(new StreamSource(xmlDoc), new StreamResult(resultDoc));
}
#Override
public Source resolve(String href, String base) throws TransformerException {
System.out.print("resolve: " + href + " " + base + "\n");
return null;
}
}
I have tested that it is not being called by the lack of outputted messages, also by setting a debug point on the function and then stepping through.
What am I doing wrong?

Worked out the answer as I wrote this... :)
Set the setURIResolver on the TransformerFactory, not the Transformer object.
So the code would be...
public class XSLMagic implements URIResolver {
public void DoXSLTransform(final File xslDoc, final File xmlDoc, final File resultDoc) {
// Create the factory...
TransformerFactory tf = TransformerFactory.newInstance();
tf.setURIResolver(this); // WORKS - Set the URIResolver to the factory instead, 'resolve' function now called as expected.
// Create the transformer object from
Transformer tr = tf.newTransformer(new StreamSource(xslDoc));
tr.transform(new StreamSource(xmlDoc), new StreamResult(resultDoc));
}
#Override
public Source resolve(String href, String base) throws TransformerException {
System.out.print("resolve: " + href + " " + base + "\n");
return null;
}
}

Parsing XML without document start and end tags

I'm parsing a document that I cannot change from the internet using a SAX Parser. It was working just fine when the documents came formatted as such:
<outtertag>
<innertag>data</innertag>
<innerag>moreData</innertag>
</outtertag>
However, there are certain calls I make where the XML comes formatted without the outer tags, so I essentially get just a list of data, like such:
<innertag>data</innertag>
<innerag>moreData</innertag>
This seems silly to me, but I don't get to choose how the XML is formatted and it can't be changed for now. The problem is that it seems that the SAX Parser hits the endDocument event as soon as it hits the first closing innertag.
I have a rather hacky solution of converting the InputStream into a String, throwing tags around it, and then converting it back to an InputStream. It actually parses fine that way. But, surely there's a better way. I'd also would prefer not to write a whole other parser. Most of the tags are the same aside from the lack of opening and closing tags.
Just for the heck of it, I'll post the code, but it's pretty standard SAX Parser. The original is actually parsing about 30 some tags:
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
XMLReader xmlReader = saxParser.getXMLReader();
MyHandler handler = new MyHandler();
xmlReader.setContentHandler(handler);
InputSource inputSource = new InputSource(url.openStream());
xmlReader.parse(inputSource);
}
catch (SAXException e) { e.printStackTrace(); }
catch (ParserConfigurationException e) { e.printStackTrace(); }
catch(Exception e) { e.printStackTrace(); }
}
private class MyHandler extends DefaultHandler {
private StringBuilder content;
public MyHandler() {
content = new StringBuilder();
}
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
content = new StringBuilder();
if(localName.equalsIgnoreCase("innertag")) {
//Doing stuff
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
//Doing stuff
}
public void characters(char[] ch, int start, int length)
throws SAXException {
content.append(ch, start, length);
}
public void endDocument() throws SAXException {
//When parsing the second type of document, hits this event almost immediately after parsing first tag
}
}
And, if it matters, here's my hacky code I'm using, but just feels wrong, yet it works:
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
StringBuilder sb = new StringBuilder("<tag>");
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line);
}
sb.append("</tag>");
String xml =sb.toString();
InputStream is = new ByteArrayInputStream(xml.getBytes());
InputSource source = new InputSource(is);
xmlReader.parse(source);

I'd say what you're doing now is about as good as you'll get. The one thing to consider improving is the stream -> string -> stream conversion, especially if the documents are large. You could use something like Guava's ByteStreams.join(), which lets you concatenate streams together instead of strings. Something like the following:
import com.google.common.io.*;
import java.io.*;
public class ConcatenateStreams {
public static void main(String[] args) throws Exception {
InputStream malformedXmlContent = externalXmlStream();
InputSupplier<InputStream> joined = ByteStreams.join(
inputSupplier("<root>"),
inputSupplier(malformedXmlContent),
inputSupplier("</root>"));
ByteStreams.copy(joined, System.out);
}
private static InputStream externalXmlStream() {
return new ByteArrayInputStream("<foo>5</foo><bar>10</bar>".getBytes());
}
private static InputSupplier<InputStream> inputSupplier(final String text) {
return inputSupplier(new ByteArrayInputStream(text.getBytes()));
}
private static InputSupplier<InputStream> inputSupplier(final InputStream inputStream) {
return new InputSupplier<InputStream>() {
#Override
public InputStream getInput() throws IOException {
return inputStream;
}
};
}
}
which outputs:
<root><foo>5</foo><bar>10</bar></root>

The XML you have is not a well-formed document, but it is a well-formed external parsed entity, which means it can be referenced from a well-formed document by means of an entity reference. So create a skeleton document like this:
<!DOCTYPE doc [
<!ENTITY e SYSTEM "data.xml">
]>
<doc>&e;</doc>
where data.xml is your XML, and pass this document to the XML parser in place of the original. Beats writing dozens of lines of Java code.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Loss of special characters while using javax.xml.transform.Transformer - java

It seemed it was necessary to set output encoding. After adding transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); engine seems to work fine in both environments.

I had similiar problem and after adding UTF-16 (not UTF-8) encoding transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-16"); special characters worked.

Related

HTML Validation on back-end

Marshaller on Windows adds new line at the end of file

How do I call DaisyDiff to compare two HTML files?

Java setURIResolver not being called?

Parsing XML without document start and end tags

Categories

Resources