PDF report with embedded HTML

PDF report with embedded HTML - java

We have a Java-based system that reads data from a database, merges individual data fields with preset XSL-FO tags and converts the result to PDF with Apache FOP.
In XSL-FO format it looks like this:
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE Html [
<!ENTITY nbsp " ">
<!-- all other entities -->
]>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format">
<xsl:output method="xml" indent="yes" />
<xsl:template match="/">
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:svg="http://www.w3.org/2000/svg" font-family="..." font-size="...">
<fo:layout-master-set>
<fo:simple-page-master master-name="Letter Page" page-width="8.500in" page-height="11.000in">
<!-- appropriate settings -->
</fo:simple-page-master>
</fo:layout-master-set>
<fo:page-sequence master-reference="Letter Page">
<!-- some static content -->
<fo:flow flow-name="xsl-region-body">
<fo:block>
<fo:table ...>
<fo:table-column ... />
<fo:table-body>
<fo:table-row>
<fo:table-cell ...>
<fo:block text-align="...">
<fo:inline font-size="..." font-weight="...">
<!-- Header / Title -->
</fo:inline>
</fo:block>
</fo:table-cell>
</fo:table-row>
</fo:table-body>
</fo:table>
</fo:block>
<fo:block>
<fo:table ...>
<fo:table-column ... />
<fo:table-body>
<fo:table-row>
<fo:table-cell>
<fo:block ...>
<!-- Field A -->
</fo:block>
</fo:table-cell>
</fo:table-row>
</fo:table-body>
</fo:table>
<!-- Other fields in a very similar fashion as the above "Field A" -->
</fo:block>
</fo:flow>
</fo:page-sequence>
</fo:root>
</xsl:template>
</xsl:stylesheet>
Now I am looking for a way to allow some of the fields to contain static HTML-formatted content. This content will be generated by our HTML-enabled editor (something along the lines of CLEditor, CKEditor, etc.) or pasted from outside.
My plan is to follow the recipe from this JavaWorld article:
use JTidy to convert HTML-formatted string to proper XHTML
further modify xhtml2fo.xsl from Antenna House to remove all document-wide and page-wide transformations
apply this modified XSLT to my XHTML string (javax.xml.transform)
extract all the nodes under the root with XPath (javax.xml.xpath)
feed the result directly into existing XSL-FO document
I have a bare-bone version of such code and got the following error:
(Location of error unknown)org.apache.fop1.fo.ValidationException:
"{http://www.w3.org/1999/XSL/Format}table-body" is not a valid child
of "fo:block"! (No context info available)
My questions:
What would be the way to troubleshoot this issue?
Can <fo:block> serve as a generic container with other objects (including tables) nested inside?
Is this an overall reasonable approach to solving the task?
If someone already "been there done that", please share your experience.

If you use an XSLT debugger such as in oXygen or XML Spy, then you can step through the transformation. With oXygen -- not sure about XML Spy or other editors -- if you click on the markup in the debugger output, oXygen highlights the markup from both the source and the stylesheet that produced that node.
Once you have the FO, the focheck framework (https://github.com/AntennaHouse/focheck) has the most complete validation of FO currently available.
fo:block can contain tables, etc. In the XSL 1.1 spec, the definition of every FO includes a 'Contents' subsection that lists its allowed content. See, e.g., http://www.w3.org/TR/xsl11/#fo_block. The definitions of the 'parameter entities' in the content models are at http://www.w3.org/TR/xsl11/#d0e6532, but some FOs have additional restrictions in the text of their definitions.
The article that you cite doesn't seem to have the 'extract all the nodes under the root with XPath' step, and I'm not sure why you need it. Other than that, it looks like a reasonable approach for doing the job using Java.
Instead of inserting the FO transformed from your JTidy-ed HTML into the static FO, you could replace your  with non-FO markup that provides enough information to make a reference to the field to insert. You can then make an XSLT stylesheet that transforms the template+references document into straight FO by doing an identity transform on the FO parts -- as in the answer from #kevin-brown -- and using the information in the reference markup to construct the URI to use with the document() function (http://www.w3.org/TR/xslt#document) to find the markup to insert.
If the FO for the field content is sitting on the disk, then using document() is straightforward. If it's not, then you'd have to do something like overriding the URIResolver used by the XSLT processor so that, rather than looking on the disk, it does the right thing to retrieve the content. You may even be able to have the JTidying happen as part of the URIResolver retrieving the HTML. You could also do the transformation to FO 'inside' the URIResolver or, also as #kevin-brown suggested, do it as a separate mode. If the transformation is done before or during the URIResolver retrieving the FO, then the 'main' transformation of template+references to FO just needs to extract the right part of the FO sub-document, e.g. document('constructed-URI')/fo:root/fo:page-sequence/*. However, if you're modifying the stylesheet from Antenna House, then you should be able to modify it to not produce an outer fo:root, etc., anyway.
I did something similar years ago with overriding the URI resolver for the libxslt XSLT processor for an XSLT-based server: the context for successive runs of the inner XSLT processor was saved as documents at special URIs and weren't necessarily written to the file system at all.
You could, instead, possibly write an extension function that does the lookup of the references to the fields. The Print and Page Layout Community Group # W3C, for example, has produced extension functions for multiple XSLT processors that runs an FO processor in the middle of the XSLT transformation to get back the XML for an area tree for the formatted result. See http://www.w3.org/community/ppl/wiki/XSLTExtensions

The best way to troubleshoot is to use a validating viewer/editor to examine the XSL FO. Many (such as oXygen) will show you errors in XSL FO structure as you open them and they will describe the issue (just as the error reported).
In your case, you obviously have an fo:table-body as a child of fo:block. It cannot be. An fo:table-body have but one valid parent, fo:table. You are either missing the fo:table tag or you have erroneously inserted an fo:block in this position.
In my opinion, I might do things slightly different. I would put the XHTML content inline into the XSL FO right where you want it. Then I would create an identity transform that copies over all the content that is fo-based, but converts the XHTML parts using XSL. This way, you can actually step that transform in an XSL editor like oXygen and see where errors occur and exactly why. Like any other degugger.
Note: You may wish to look at other XSLs also, especially if your HTML may have any style="" CSS attributes. If this is the case it is not simple HTML, then you will need a better method for processing the HTML with CSS to FO.
http://www.cloudformatter.com/css2pdf is based on this complete transform. That general stylesheet is available here: http://xep.cloudformatter.com/doc/XSL/xeponline-fo-translate-2.xsl
I am the author of that stylesheet. It does much more than you ask, but has a fairly complex parsing recursion for converting CSS styling into XSL FO attributes.

Related

XSLT required for placing the files in target directory

I have a requirement where the source file name is Rocky_InvoiceNo(uniquevalue)_Timestamp.xml..
Target system wants the filename to InvoiceNo(uniquevalue)_Timestamp.xml.
Can anyone please share the xslt code to achieve this.

check this code:-
<xsl:variable name="outputpath_1" select="substring-after('Rocky_InvoiceNo(uniquevalue)_Timestamp.xml', '_')"/>
<xsl:value-of select="$outputpath_1"/>

The context of your question is not entirely clear.
In general, you can retrieve the filepath of the source XML document using the base-uri() or the document-uri() function, e.g.
<xsl:variable name="source-path" select="base-uri()"/>
Then you can remove the "Rocky_" part of the filename using:
<xsl:variable name="target-path" select="replace($source-path, 'Rocky_', '')"/>
and use the resulting path to create a result document using the xsl:result-document instruction, e.g.
<xsl:result-document href="{$target-path}">
<!-- your tranformation here -->
</xsl:result-document>
However, IMHO it would be much simpler to perform this task by the application that initiates the XSL transformation instead of in the XSLT stylesheet itself.

Escaped HTML in XML node via XSLT into XSL-FO

I have a document that has to be generated as PDF. I use Xalan and Apache FOP for processing an XML with XSLT into XSL-FO.
In my XML tree there is a node like this:
<root>
<formula>
<text>3+10*10^-6*l</text>
<html><html xmlns="http://www.w3.org/1999/xhtml">3 · + 10 · 10<sup>-6</sup> · <i>l</i></html></html>
</formula>
</root>
How can I not only get proper HTML (by using disable-output-escaping="yes") but also get a node-set (exsl:node-set?) that I can process later on? I mean, I want to get a XSL-FO representation of that HTML formula in order to integrate that into my PDF output.
Something like
<xsl:template match="xhtml:b">
<fo:inline font-weight="bold"><xsl:apply-templates/></fo:inline>
</xsl:template>
There may be a solution using saxon:parse(). However, I cannot switch to that from Xalan-J.
Is there a solution in my scenario?

You can certainly write one stylesheet to process with Xalan that does
<xsl:template match="html">
<xsl:value-of select="." disable-output-escaping="yes"/>
</xsl:template>
which then creates a serialized result document with the XHTML markup.
A second stylesheet could then process the result document of the first stylesheet e.g.
<xsl:template match="xhtml:html" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<xsl:apply-templates/>
</xsl:template>
But you can't do it within one stylesheet with a result tree fragment as doe (disable-output-escaping) is a serialization feature and if you work with result tree fragments converted to a node set with the help of exsl:node-set or similar within one stylesheet there is no serialization happening.
Looking closer, as your snippet seems to contain references to undeclared entities like · I think the sample does not parse as XML at all so you would need to fix that first to do any XSLT processing.

Creating a word document from a template dynamically using values from java objects

I want to create a word document from an HTML page.
I am planning to get the values on the HTML page and then pass these values to a document template.
I have used JSOUP to parse the contents of the HTML page and I get the values in my java program. I now want to pass these values to a word document template.
I want to know what are the best techniques I can use to create the document template and pass the values to the template to create the word document.
Thank You.

I found something very Interesting and simple. We just need to create a simple .xml template for the document we want to create and then programmatically change the contents of the xml file and save it as a ms word document.
You can find the xml template and the code here.

i suggest you use xslt, because your data is already in xml-format and there are well defined xml-formats from microsoft.
You could write a document template with word and save it in xml-format. Then you can convert the word-xml to a xsl-template with your html-xml as input. After the xslt-transformation you have a valid word-xml with your dynamic values from the html-xml.
XSLT example for excel
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="no" />
<xsl:template match="/">
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
...
<xsl:for-each
select="/yourroot/person">
...
<Cell ss:StyleID="uf">
<Data ss:Type="String">
<xsl:value-of
select="#Name" />
</Data>
</Cell>
..
</xsl:for-each>
...
</xsl:template>
</xsl:stylesheet>

JODReports and Docmosis might also be useful options for you since there is template populate and Doc output. If DOCX is your real target, then you can write out the document yourself since the XML is published - but that is a lot of work.

ERROR: 'The first argument to the non-static Java function 'evaluate' is not a valid object reference.' when using TrasformFactory

I am trying to transform an xsl + xml to xml (for later on transforming it into a pdf using FOP library). The JDK I am using is 1.5, and there is no way I can use another (that is what the company I work in is using). I read that the xalan jar of java 1.5 is the one responsible for the error. The text that causes the error is:
"dyn:evaluate($xpath)"/>
in:
<xsl:variable name="paramName" select="#name"/>
<xsl:variable name="xpath"
select="concat('/doc/data/',$paramName)" />
<fo:inline>
<xsl:value-of select="dyn:evaluate($xpath)"/>
</fo:inline>
</xsl:template>
is there a way arround it without changing the jar? Is there a way to write it differently? or am I using the wrong syntax?
Thanks for your help

evaluate() is an EXSLT extension function. It is non-standard, but many XSLT processors, including xalan, support it.
Have you declared the dyn namespace prefix in your stylesheet, so that it correctly references the EXSLT dynamic namespace?
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:dyn="http://exslt.org/dynamic"
extension-element-prefixes="dyn">
...
</xsl:stylesheet>

SXXP0003: Error reported by XML parser: Content is not allowed in prolog

My XML file is
<?xml version="1.0" encoding="ISO-8859-1"?>
<T0020
xsi:schemaLocation="http://www.safersys.org/namespaces/T0020V1 T0020V1.xsd"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.safersys.org/namespaces/T0020V1">
<INTERFACE>
<NAME>SAFER</NAME>
<VERSION>04.02</VERSION>
</INTERFACE>
<TRANSACTION>
<VERSION>01.00</VERSION>
<OPERATION>REPLACE</OPERATION>
<DATE_TIME>2009-09-01T00:00:00</DATE_TIME>
<TZ>CT</TZ>
</TRANSACTION>
<IRP_ACCOUNT>
<IRP_CARRIER_ID_NUMBER>564182</IRP_CARRIER_ID_NUMBER>
<IRP_BASE_COUNTRY>US</IRP_BASE_COUNTRY>
<IRP_BASE_STATE>AR</IRP_BASE_STATE>
<IRP_ACCOUNT_NUMBER>67432</IRP_ACCOUNT_NUMBER>
<IRP_ACCOUNT_TYPE>I</IRP_ACCOUNT_TYPE>
<IRP_STATUS_CODE>100</IRP_STATUS_CODE>
<IRP_STATUS_DATE>2008-02-01</IRP_STATUS_DATE>
<IRP_UPDATE_DATE>2009-06-18</IRP_UPDATE_DATE>
<IRP_NAME>
<NAME_TYPE>LG</NAME_TYPE>
<NAME>LARRY SHADDON</NAME>
<IRP_ADDRESS>
<ADDRESS_TYPE>PH</ADDRESS_TYPE>
<STREET_LINE_1>10291 HWY 124</STREET_LINE_1>
<STREET_LINE_2/>
<CITY>RUSSELLVILLE</CITY>
<STATE>AR</STATE>
<ZIP_CODE>72802</ZIP_CODE>
<COUNTY>POPE</COUNTY>
<COLONIA/>
<COUNTRY>US</COUNTRY>
</IRP_ADDRESS>
<IRP_ADDRESS>
<ADDRESS_TYPE>MA</ADDRESS_TYPE>
<STREET_LINE_1>10291 HWY124</STREET_LINE_1>
<STREET_LINE_2/>
<CITY>RUSSELLVILLE</CITY>
<STATE>AR</STATE>
<ZIP_CODE>72802</ZIP_CODE>
<COUNTY>POPE</COUNTY>
<COLONIA/>
<COUNTRY>US</COUNTRY>
</IRP_ADDRESS>
</IRP_NAME>
</IRP_ACCOUNT>
</T0020>
I am using following XSLT to split my xml file to multiple xml file .
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:t="http://www.safersys.org/namespaces/T0020V1" version="2.0">
<xsl:output method="xml" indent="yes" name="xml" />
<xsl:variable name="accounts" select="t:T0020/t:IRP_ACCOUNT" />
<xsl:variable name="size" select="30" />
<xsl:template match="/">
<xsl:for-each select="$accounts[position() mod $size = 1]">
<xsl:variable name="filename" select="resolve-uri(concat('output/',position(),'.xml'))" />
<xsl:result-document href="{$filename}" method="xml">
<T0020>
<xsl:for-each select=". | following-sibling::t:IRP_ACCOUNT[position() < $size]">
<xsl:copy-of select="." />
</xsl:for-each>
</T0020>
</xsl:result-document>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
It works well in Sample Java Apllication,but when i tried to use same in my Spring based application then it gives following error .
Error on line 1 column 1 of T0020:
SXXP0003: Error reported by XML parser: Content is not allowed in prolog.
I don't know what goes wrong ? Please help me. Thanks In Advance.

Your XML starts with a byte-order mark in UTF-8 (0xEF,0xBB,0xBF), which isn't visible. Try opening your file with a hex editor and have a look.
Many text editors under Windows like to insert this at the start of UTF-8 encoded text, despite the fact that UTF-8 doesn't actually need a byte order mark since the ordering of bytes in UTF-8 is already well defined.
Java's XML parsers will all choke on a BOM with exactly the error message you are seeing. You'll need to either strip out the BOM, or write a wrapper for your InputStream that you're handing the XML parser to do this for you at parsing time.

There is some content in the document before the XML data starts, probably whitespace at a guess (that's where I've seen this before).
The prolog is the part of the document that is before the opening tag, with tag-like constructs like <? and <!. You may have some characters/whitespace in between these tags too. Prologs and valid content are explained on tiztag.com.
Maybe post up an depersonalised example of your XML data?

It's also possible to get this if you attempt to process the content twice. (Which is fairly easy to do in spring.) In which case, there'd be nothing wrong with your XML. This scenario seems likely since the sample application works, but introducing spring causes problems.

In my case the encoding="UTF-16" was causing this issue. It got resolved when I changed it to UTF-8.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.