Escaped HTML in XML node via XSLT into XSL-FO

Escaped HTML in XML node via XSLT into XSL-FO - java

I have a document that has to be generated as PDF. I use Xalan and Apache FOP for processing an XML with XSLT into XSL-FO.
In my XML tree there is a node like this:
<root>
<formula>
<text>3+10*10^-6*l</text>
<html><html xmlns="http://www.w3.org/1999/xhtml">3 · + 10 · 10<sup>-6</sup> · <i>l</i></html></html>
</formula>
</root>
How can I not only get proper HTML (by using disable-output-escaping="yes") but also get a node-set (exsl:node-set?) that I can process later on? I mean, I want to get a XSL-FO representation of that HTML formula in order to integrate that into my PDF output.
Something like
<xsl:template match="xhtml:b">
<fo:inline font-weight="bold"><xsl:apply-templates/></fo:inline>
</xsl:template>
There may be a solution using saxon:parse(). However, I cannot switch to that from Xalan-J.
Is there a solution in my scenario?

You can certainly write one stylesheet to process with Xalan that does
<xsl:template match="html">
<xsl:value-of select="." disable-output-escaping="yes"/>
</xsl:template>
which then creates a serialized result document with the XHTML markup.
A second stylesheet could then process the result document of the first stylesheet e.g.
<xsl:template match="xhtml:html" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xsl:apply-templates/>
</xsl:template>
But you can't do it within one stylesheet with a result tree fragment as doe (disable-output-escaping) is a serialization feature and if you work with result tree fragments converted to a node set with the help of exsl:node-set or similar within one stylesheet there is no serialization happening.
Looking closer, as your snippet seems to contain references to undeclared entities like · I think the sample does not parse as XML at all so you would need to fix that first to do any XSLT processing.

Related

XSLT required for placing the files in target directory

I have a requirement where the source file name is Rocky_InvoiceNo(uniquevalue)_Timestamp.xml..
Target system wants the filename to InvoiceNo(uniquevalue)_Timestamp.xml.
Can anyone please share the xslt code to achieve this.

check this code:-
<xsl:variable name="outputpath_1" select="substring-after('Rocky_InvoiceNo(uniquevalue)_Timestamp.xml', '_')"/>
<xsl:value-of select="$outputpath_1"/>

The context of your question is not entirely clear.
In general, you can retrieve the filepath of the source XML document using the base-uri() or the document-uri() function, e.g.
<xsl:variable name="source-path" select="base-uri()"/>
Then you can remove the "Rocky_" part of the filename using:
<xsl:variable name="target-path" select="replace($source-path, 'Rocky_', '')"/>
and use the resulting path to create a result document using the xsl:result-document instruction, e.g.
<xsl:result-document href="{$target-path}">
<!-- your tranformation here -->
</xsl:result-document>
However, IMHO it would be much simpler to perform this task by the application that initiates the XSL transformation instead of in the XSLT stylesheet itself.

Merge XSLTs with import/include statement using Java

Let's say i have two xslt A, and B. In xslt A, we have a import/include statement to use some template from B. Is there a way in Java that we generate the resulting xslt (A merged with the imported template)?? Will it be possible??
SAXON has a way to export the compiled XSLT, but unfortunately the compiled XSLT has the link to the imported XSLT, which we don't want. Any input is appreciated.
Haven't explored XALAN yet on this one.

Why would you want to use Java for this, rather than XSLT?
Most of the job is easy, it can be done with a couple of template rules:
<xsl:mode on-no-match="shallow-copy"/>
<xsl:mode name="nested" on-no-match="shallow-copy"/>
<xsl:template match="xsl:stylesheet | xsl:transform" mode="nested">
<xsl:apply-templates mode="nested"/>
</xsl:template>
<xsl:template match="xsl:import | xsl:include" mode="#all">
<xsl:apply-templates select="document(#href)" mode="nested"/>
</xsl:template>
However, there are complications that make it difficult or impossible if certain XSLT features have been used, for example:
import precedences may not be converted correctly
xsl:apply-imports isn't going to work
attributes on xsl:stylesheet that have module scope (for example exclude-result-prefixes) will be lost.

Accessing unparsed entities in XSLT with a SAXTransformerFactory and TransformerHandlers

I have some trouble while retrieving unparsed entity URIs, with the XPath function unparsed-entity-uri().
I'm using a SAXTransformerFactory like in "Efficient XSLT pipeline in Java" question, because I need to perform a transformations chain (i.e. apply several XSLT transformations, and use the result of a transformation as input for the second transformation).
I discovered I'm unable to retrieve unparsed entity thank to the code below. Actually it works well with Xalan, but not with Saxon-HE (version 9.7.0) - but I need Saxon because I'd rather XSLT 2.0 (even if in the code below there's nothing specific to XSLT 2, it's only for the sake of providing an example). It also works with Saxon if I don't use a TransformerHandler, e.g. stf.newTransformer(new StreamSource("transfo.xsl")).transform(new StreamSource("input.xsl"), new StreamResult(System.out)) will produce the desired output.
Is there a configuration step that I forgot?
// use "org.apache.xalan.processor.TransformerFactoryImpl" for Xalan
String transformerFactoryClassName = "net.sf.saxon.TransformerFactoryImpl";
SAXTransformerFactory stf = (SAXTransformerFactory) TransformerFactory.newInstance(transformerFactoryClassName,
LaunchSimpleTransformationUnparsedEntities.class.getClassLoader());
try {
TransformerHandler thTransf = stf
.newTransformerHandler(new StreamSource("transfo.xsl"));
// output the result in console
thTransf.setResult(new StreamResult(System.out));
// Launch transformation of input.xml
Transformer t = stf.newTransformer();
t.transform(new StreamSource("input.xml"),
new SAXResult(thTransf));
} catch (TransformerConfigurationException e) {
e.printStackTrace();
} catch (TransformerException e) {
e.printStackTrace();
}
In input, I have (for input.xml):
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE book
[<!ENTITY cover_hadrien SYSTEM "images/covers/cover_hadrien.jpg" NDATA jpeg>]>
<book>
<title>Les mémoires d'Hadrien</title>
<author>Marguerite Yourcenar</author>
<cover imgref="cover_hadrien" />
</book>
and a sample XSLT (for transfo.xsl):
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="cover">
<xsl:copy>
<xsl:value-of select="unparsed-entity-uri(#imgref)"/>
</xsl:copy>
</xsl:template>
<xsl:template match="#*|node()">
<xsl:copy>
<xsl:apply-templates select="#*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
as a result, I would expect something like:
<?xml version="1.0" encoding="UTF-8"?><book>
<title>Les mémoires d'Hadrien</title>
<author>Marguerite Yourcenar</author>
<cover>images/covers/cover_hadrien.jpg</cover>
</book>
but <cover> is empty when performing the transformation with Saxon.

Interesting observation. The issue in fact is not with Saxon's TransformerHandler, but rather with the "identity transformer" obtained using SAXTransformerFactory.newTransformer(): the identity transformer is not passing unparsed entities down the line. This is essentially because Saxon's identity transformer is reusing parts of the XSLT engine, and XSLT does not provide any way for a transformation to output unparsed entities in the result. If you sent the SAX parser output directly to the TransformerHandler, rather than going via an identity transformer, then I think it would all work.
As with all things JAXP-related, the specification of SAXTransformerFactory.newTransformer() is infuriatingly vague. All it says is that the returned Transformer performs a copy of the Source to the Result. i.e. the "identity transform". What exactly counts as a copy? I think Saxon's interpretation has been that it is equivalent to the effect of doing an XSLT identity transform - which would lose unparsed entities (as well as other things like CDATA sections, the DTD, etc).
Incidentally XSLT 2.0 specifies that the result of unparsed-entity-uri() should be an absolute URI (XSLT 1.0 doesn't say anything on the subject) so even if this is fixed, the Saxon output will be different.
Entered as a Saxon issue here: https://saxonica.plan.io/issues/3201 I think we need to be a bit careful about passing unparsed entities to a SAXResult if we don't pass all the other events expected by a SAX DTDHandler - and we're certainly not going to change the Saxon identity transformer to retain things (like DTD declarations) that aren't modelled in XDM.

Indeed, following #MichaelKay's details, launching the transformation that way works properly:
// launch transformation of input.xml
XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setContentHandler(thTransf);
reader.setDTDHandler(thTransf);
reader.parse(new InputSource(input.xml"));
(this will replace the following line:
// Launch transformation of input.xml
Transformer t = stf.newTransformer();
t.transform(new StreamSource("input.xml"),
new SAXResult(thTransf));
that were used initially).

How to use Parameters in xsl:apply-templates which are set from java?

I have a question on how to dynamically set the xpath expression in apply-templates select=?
<xsl:template match="CDS">
<xsl:result-document href="{$fileName}">
<xsl:copy>
<xsl:apply-templates select="$xpathCondition"/>
</xsl:copy>
</xsl:result-document>
</xsl:template>
This $xpathCondition am trying to set from java from properties file and setting to param in xsl.
transformer.setParameter("fileName", "Test.xml");
transformer.setParameter("xpathCondition", "CD[contains(Title/text(),'TEST')]");
$fileName is working as expected. But $xpathCondition is not working as expected.

There's no standard way of parsing a string as a dynamic XPath expression and executing it until you get to the xsl:evaluate instruction in XSLT 3.0. You really need to tell us which version you are using - the fact that you use xsl:result-document tells us that it's 2.0 or later, but beyond that we are guessing.
Many XSLT processors have an extension function called xx:eval() or similar.

The problem can be tackled in XSLT 3.0 using static parameters and shadow attributes. You can write:
<xsl:param name="xpathCondition" static="yes"/>
and then:
<xsl:apply-templates _select="{$xpathCondition}"/>
(Note the underscore in _select)
With 2.0 (or indeed 1.0) you can simulate this approach by doing a transformation on the stylesheet before executing it.

PDF report with embedded HTML

We have a Java-based system that reads data from a database, merges individual data fields with preset XSL-FO tags and converts the result to PDF with Apache FOP.
In XSL-FO format it looks like this:
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE Html [
<!ENTITY nbsp " ">
<!-- all other entities -->
]>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format">
<xsl:output method="xml" indent="yes" />
<xsl:template match="/">
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:svg="http://www.w3.org/2000/svg" font-family="..." font-size="...">
<fo:layout-master-set>
<fo:simple-page-master master-name="Letter Page" page-width="8.500in" page-height="11.000in">
<!-- appropriate settings -->
</fo:simple-page-master>
</fo:layout-master-set>
<fo:page-sequence master-reference="Letter Page">
<!-- some static content -->
<fo:flow flow-name="xsl-region-body">
<fo:block>
<fo:table ...>
<fo:table-column ... />
<fo:table-body>
<fo:table-row>
<fo:table-cell ...>
<fo:block text-align="...">
<fo:inline font-size="..." font-weight="...">
<!-- Header / Title -->
</fo:inline>
</fo:block>
</fo:table-cell>
</fo:table-row>
</fo:table-body>
</fo:table>
</fo:block>
<fo:block>
<fo:table ...>
<fo:table-column ... />
<fo:table-body>
<fo:table-row>
<fo:table-cell>
<fo:block ...>
<!-- Field A -->
</fo:block>
</fo:table-cell>
</fo:table-row>
</fo:table-body>
</fo:table>
<!-- Other fields in a very similar fashion as the above "Field A" -->
</fo:block>
</fo:flow>
</fo:page-sequence>
</fo:root>
</xsl:template>
</xsl:stylesheet>
Now I am looking for a way to allow some of the fields to contain static HTML-formatted content. This content will be generated by our HTML-enabled editor (something along the lines of CLEditor, CKEditor, etc.) or pasted from outside.
My plan is to follow the recipe from this JavaWorld article:
use JTidy to convert HTML-formatted string to proper XHTML
further modify xhtml2fo.xsl from Antenna House to remove all document-wide and page-wide transformations
apply this modified XSLT to my XHTML string (javax.xml.transform)
extract all the nodes under the root with XPath (javax.xml.xpath)
feed the result directly into existing XSL-FO document
I have a bare-bone version of such code and got the following error:
(Location of error unknown)org.apache.fop1.fo.ValidationException:
"{http://www.w3.org/1999/XSL/Format}table-body" is not a valid child
of "fo:block"! (No context info available)
My questions:
What would be the way to troubleshoot this issue?
Can <fo:block> serve as a generic container with other objects (including tables) nested inside?
Is this an overall reasonable approach to solving the task?
If someone already "been there done that", please share your experience.

If you use an XSLT debugger such as in oXygen or XML Spy, then you can step through the transformation. With oXygen -- not sure about XML Spy or other editors -- if you click on the markup in the debugger output, oXygen highlights the markup from both the source and the stylesheet that produced that node.
Once you have the FO, the focheck framework (https://github.com/AntennaHouse/focheck) has the most complete validation of FO currently available.
fo:block can contain tables, etc. In the XSL 1.1 spec, the definition of every FO includes a 'Contents' subsection that lists its allowed content. See, e.g., http://www.w3.org/TR/xsl11/#fo_block. The definitions of the 'parameter entities' in the content models are at http://www.w3.org/TR/xsl11/#d0e6532, but some FOs have additional restrictions in the text of their definitions.
The article that you cite doesn't seem to have the 'extract all the nodes under the root with XPath' step, and I'm not sure why you need it. Other than that, it looks like a reasonable approach for doing the job using Java.
Instead of inserting the FO transformed from your JTidy-ed HTML into the static FO, you could replace your  with non-FO markup that provides enough information to make a reference to the field to insert. You can then make an XSLT stylesheet that transforms the template+references document into straight FO by doing an identity transform on the FO parts -- as in the answer from #kevin-brown -- and using the information in the reference markup to construct the URI to use with the document() function (http://www.w3.org/TR/xslt#document) to find the markup to insert.
If the FO for the field content is sitting on the disk, then using document() is straightforward. If it's not, then you'd have to do something like overriding the URIResolver used by the XSLT processor so that, rather than looking on the disk, it does the right thing to retrieve the content. You may even be able to have the JTidying happen as part of the URIResolver retrieving the HTML. You could also do the transformation to FO 'inside' the URIResolver or, also as #kevin-brown suggested, do it as a separate mode. If the transformation is done before or during the URIResolver retrieving the FO, then the 'main' transformation of template+references to FO just needs to extract the right part of the FO sub-document, e.g. document('constructed-URI')/fo:root/fo:page-sequence/*. However, if you're modifying the stylesheet from Antenna House, then you should be able to modify it to not produce an outer fo:root, etc., anyway.
I did something similar years ago with overriding the URI resolver for the libxslt XSLT processor for an XSLT-based server: the context for successive runs of the inner XSLT processor was saved as documents at special URIs and weren't necessarily written to the file system at all.
You could, instead, possibly write an extension function that does the lookup of the references to the fields. The Print and Page Layout Community Group # W3C, for example, has produced extension functions for multiple XSLT processors that runs an FO processor in the middle of the XSLT transformation to get back the XML for an area tree for the formatted result. See http://www.w3.org/community/ppl/wiki/XSLTExtensions

The best way to troubleshoot is to use a validating viewer/editor to examine the XSL FO. Many (such as oXygen) will show you errors in XSL FO structure as you open them and they will describe the issue (just as the error reported).
In your case, you obviously have an fo:table-body as a child of fo:block. It cannot be. An fo:table-body have but one valid parent, fo:table. You are either missing the fo:table tag or you have erroneously inserted an fo:block in this position.
In my opinion, I might do things slightly different. I would put the XHTML content inline into the XSL FO right where you want it. Then I would create an identity transform that copies over all the content that is fo-based, but converts the XHTML parts using XSL. This way, you can actually step that transform in an XSL editor like oXygen and see where errors occur and exactly why. Like any other degugger.
Note: You may wish to look at other XSLs also, especially if your HTML may have any style="" CSS attributes. If this is the case it is not simple HTML, then you will need a better method for processing the HTML with CSS to FO.
http://www.cloudformatter.com/css2pdf is based on this complete transform. That general stylesheet is available here: http://xep.cloudformatter.com/doc/XSL/xeponline-fo-translate-2.xsl
I am the author of that stylesheet. It does much more than you ask, but has a fairly complex parsing recursion for converting CSS styling into XSL FO attributes.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.