How to transform XML with XSL using Java

How to transform XML with XSL using Java - java

I am currently using the standard javax.xml.transform library to transform my XML to CSV using XSL. My XSL file is large - at around 950 lines. My XML files can be quite large also.
It was working fine in the prototype stage with a fraction of the XSL in place at around 50 lines or so. Now in the 'final system' when it performs the transform it comes up with the error Branch target offset too large for short.
private String transformXML() {
String formattedOutput = "";
try {
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer =
tFactory.newTransformer( new StreamSource( xslFilename ) );
StreamSource xmlSource = new StreamSource(new ByteArrayInputStream( xmlString.getBytes() ) );
ByteArrayOutputStream baos = new ByteArrayOutputStream();
transformer.transform( xmlSource, new StreamResult( baos ) );
formattedOutput = baos.toString();
} catch( Exception e ) {
e.printStackTrace();
}
return formattedOutput;
}
I came across a few postings on this error but not too sure what to do.
Am I doing anything wrong code wise?
Are there any alternative 3rd Party transformers available that could do this?
Thanks,
Andez

Try Saxon instead.
Your code would stay the same. All you would need to do is set javax.xml.transform.TransformerFactory to net.sf.saxon.TransformerFactoryImpl in the JVM's system properties.

Use saxon. offtop: if you use the same stylesheet to transform many XML files, you might want to consider templates (pre-compiled stylesheets):
javax.xml.transform.Templates style = tFactory.newTemplates(xslSource);
style.newTransformer().transform(...);

I came across a post on the net that mentioned apache XALAN. So I added the jars to my project. Everything has started working since even though I do not directly reference any XALAN classes in my code. As far as I can tell it still should use the jaxax.xml classes.
Not too sure what is happening there. But it is working.

As an alternative to Saxon, you can split up your large template into smaller templates.
Template definitions contained in XSLT stylesheets are compiled by SAP
JVM's XSLT compiler "Xalan" into Java methods for faster execution of
transformations. Java bytecode branch instructions contained in these
Java methods are limited to 32K offsets. Large template definitions
can now lead to very large Java methods, where the branch offset would
need to be larger than 32K. Therefore these stylesheets cannot be
compiled to Java methods and therefore cannot be used for
transformations.
Solution
Since each template definition of an XSLT stylesheet is compiled into
a separate Java method, using multiple smaller templates can be used
as solution. A very large template can be broken into multiple smaller
templates by using the "call-template" element.
It is described in-depth in this article Size limitation for XSLT stylesheets.
Sidenote: I would only recommend this as a last resort if saxon is not available, as this requires quite a few changes to your xsl file.

Related

Transform multiple input XML documents with XSLT in a Java application using the Saxon9HE API

How can I transform multiple XML input document objects with a single XSL transformation script using the Saxon9HE processor in a Java application?
I found a way to transform multiple XML input files from the filesystem with an XSLT script here, but I can't figure out how to pass multiple loaded XML Document objects to a Java application utilizing the Saxon9HE API. For a single XML document my code looks like this and works:
Processor proc = new Processor(false);
XsltCompiler comp = proc.newXsltCompiler();
try {
XsltExecutable exp = comp.compile(new StreamSource(stylesheetFile));
XdmNode source = proc.newDocumentBuilder().build(new DOMSource(inputXML));
Serializer out = proc.newSerializer();
out.setOutputProperty(Serializer.Property.METHOD, "xml");
out.setOutputProperty(Serializer.Property.INDENT, "yes");
out.setOutputFile(new File(outputFilename));
XsltTransformer trans = exp.load();
trans.setInitialContextNode(source);
trans.setDestination(out);
trans.transform();
} catch (SaxonApiException e) {
e.printStackTrace();
}

First point: avoid DOM if you can. When you are using Saxon, it's best to let Saxon build the document tree; this will be far more efficient. If you really need to use an external tree model, XOM and JDOM2 are much more efficient than DOM.
If you do want to provide a DOM as input, you have two choices: you can copy it to a Saxon tree, or you can wrap it as a Saxon tree. Use DocumentBuilder.build() in the first case, DocumentBuilder.wrap() in the second. Using build() gives you a higher initial cost, but the transformation itself is then faster.
If you want to pass pre-built trees into the transformation, declare the parameter using <xsl:param name="x" as="document-node()"/>, and then invoke the transformation using transformer.setParameter(new QName('x'), doc) where doc is an instance of XdmNode. You have to construct the XdmNode yourself by using a DocumentBuilder.
(Alternatively, if you want to access the documents in the stylesheet using the doc() or document() functions, you can invent a URI naming scheme and implement this in a URIResolver. When doc('my:uri') is called, your URIResolver is notified, and it should respond with a Source object. If you already have an XdmNode handy, then you can return XdmNode.asSource() to return this document tree as the result of your URIResolver.)

Are there any advantages to using an XSLT stylesheet compared to manually parsing an XML file using a DOM parser

For one of our applications, I've written a utility that uses java's DOM parser. It basically takes an XML file, parses it and then processes the data using one of the following methods to actually retrieve the data.
getElementByTagName()
getElementAtIndex()
getFirstChild()
getNextSibling()
getTextContent()
Now i have to do the same thing but i am wondering whether it would be better to use an XSLT stylesheet. The organisation that sends us the XML file keeps changing their schema meaning that we have to change our code to cater for these shema changes. Im not very familiar with XSLT process so im trying to find out whether im better of using XSLT stylesheets rather than "manual parsing".
The reason XSLT stylesheets looks attractive is that i think that if the schema for the XML file changes i will only need to change the stylesheet? Is this correct?
The other thing i would like to know is which of the two (XSLT transformer or DOM parser) is better performance wise. For the manual option, i just use the DOM parser to parse the xml file. How does the XSLT transformer actually parse the file? Does it include additional overhead compared to manually parsing the xml file? The reason i ask is that performance is important because of the nature of the data i will be processing.
Any advice?
Thanks
Edit
Basically what I am currently doing is parsing an xml file and process the values in some of the xml elements. I don't transform the xml file into any other format. I just extract some value, extract a row from an Oracle database and save a new row into a different table. The xml file I parse just contains reference values I use to retrieve some data from the database.
Is xslt not suitable in this scenario? Is there a better approach that I can use to avoid code changes if the schema changes?
Edit 2
Apologies for not being clear enough about what i am doing with the XML data. Basically there is an XML file which contains some information. I extract this information from the XML file and use it to retrieve more information from a local database. The data in the xml file is more like reference keys for the data i need in the database. I then take the content i extracted from the XML file plus the content i retrieved from the database using a specific key from the XML file and save that data into another database table.
The problem i have is that i know how to write a DOM parser to extract the information i need from the XML file but i was wondering whether using an XSLT stylesheet was a better option as i wouldnt have to change the code if the schema changes.
Reading the responses below it sounds like XSLT is only used for transorming and XML file to another XML file or some other format. Given that i dont intend to transform the XML file, there is probably no need to add the additional overhead of parsing the XSLT stylesheet as well as the XML file.

Transforming XML documents into other formats is XSLT's reason for being. You can use XSLT to output HTML, JSON, another XML document, or anything else you need. You don't specify what kind of output you want. If you're just grabbing the contents of a few elements, then maybe you won't want to bother with XSLT. For anything more, XSLT offers an elegant solution. This is primarily because XSLT understands the structure of the document it's working on. Its processing model is tree traversal and pattern matching, which is essentially what you're manually doing in Java.
You could use XSLT to transform your source data into the representation of your choice. Your code will always work on this structure. Then, when the organization you're working with changes the schema, you only have to change your XSLT to transform the new XML into your custom format. None of your other code needs to change. Why should your business logic care about the format of its source data?

You are right that XSLT's processing model based on a rule-based event-driven approach makes your code more resilient to changes in the schema.
Because it's a different processing model to the procedural/navigational approach that you use with DOM, there is a learning and familiarisation curve, which some people find frustrating; if you want to go this way, be patient, because it will be a while before the ideas click into place. Once you are there, it's much easier than DOM programming.
The performance of a good XSLT processor will be good enough for your needs. It's of course possible to write very inefficient code, just as it is in any language, but I've rarely seen a system where XSLT was the bottleneck. Very often the XML parsing takes longer than the XSLT processing (and that's the same cost as with DOM or JAXB or anything else.)
As others have said, a lot depends on what you want to do with the XML data, which you haven't really explained.

I think that what you need is actually an XPath expression. You could configure that expression in some property file or whatever you use to retrieve your setup parameters.
In this way, you'd just change the XPath expression whenever your customer hides away the info you use in yet another place.
Basically, an XSLT is an overkill, you just need an XPath expression. A single XPath expression will allow to home in onto each value you are after.
Update
Since we are now talking about JDK 1.4 I've included below 3 different ways of fetching text in an XML file using XPath. (as simple as possible, no NPE guard fluff I'm afraid ;-)
Starting from the most up to date.
0. First the sample XML config file
<?xml version="1.0" encoding="UTF-8"?>
<config>
<param id="MaxThread" desc="MaxThread" type="int">250</param>
<param id="rTmo" desc="RespTimeout (ms)" type="int">5000</param>
</config>
1. Using JAXP 1.3 standard part of Java SE 5.0
import javax.xml.parsers.*;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
public class TestXPath {
private static final String CFG_FILE = "test.xml" ;
private static final String XPATH_FOR_PRM_MaxThread = "/config/param[#id='MaxThread']/text()";
public static void main(String[] args) {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
docFactory.setNamespaceAware(true);
DocumentBuilder builder;
try {
builder = docFactory.newDocumentBuilder();
Document doc = builder.parse(CFG_FILE);
XPathExpression expr = XPathFactory.newInstance().newXPath().compile(XPATH_FOR_PRM_MaxThread);
Object result = expr.evaluate(doc, XPathConstants.NUMBER);
if ( result instanceof Double ) {
System.out.println( ((Double)result).intValue() );
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
2. Using JAXP 1.2 standard part of Java SE 1.4-2
import javax.xml.parsers.*;
import org.apache.xpath.XPathAPI;
import org.w3c.dom.*;
public class TestXPath {
private static final String CFG_FILE = "test.xml" ;
private static final String XPATH_FOR_PRM_MaxThread = "/config/param[#id='MaxThread']/text()";
public static void main(String[] args) {
try {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
docFactory.setNamespaceAware(true);
DocumentBuilder builder = docFactory.newDocumentBuilder();
Document doc = builder.parse(CFG_FILE);
Node param = XPathAPI.selectSingleNode( doc, XPATH_FOR_PRM_MaxThread );
if ( param instanceof Text ) {
System.out.println( Integer.decode(((Text)(param)).getNodeValue() ) );
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
3. Using JAXP 1.1 standard part of Java SE 1.4 + jdom + jaxen
You need to add these 2 jars (available from www.jdom.org - binaries, jaxen is included).
import java.io.File;
import org.jdom.*;
import org.jdom.input.SAXBuilder;
import org.jdom.xpath.XPath;
public class TestXPath {
private static final String CFG_FILE = "test.xml" ;
private static final String XPATH_FOR_PRM_MaxThread = "/config/param[#id='MaxThread']/text()";
public static void main(String[] args) {
try {
SAXBuilder sxb = new SAXBuilder();
Document doc = sxb.build(new File(CFG_FILE));
Element root = doc.getRootElement();
XPath xpath = XPath.newInstance(XPATH_FOR_PRM_MaxThread);
Text param = (Text) xpath.selectSingleNode(root);
Integer maxThread = Integer.decode( param.getText() );
System.out.println( maxThread );
} catch (Exception e) {
e.printStackTrace();
}
}
}

Since performance is important, I would suggest using a SAX parser for this. JAXB will give you roughly the same performance as DOM parsing PLUS it will be much easier and maintainable. Handling the changes in the schema also should not affect you badly if you are using JAXB, just get the new schema and regenerate the classes. If you have a bridge between the JAXB and your domain logic, then the changes can be absorbed in that layer without worrying about XML. I prefer treating XML as just a message that is used in the messaging layer. All the application code should be agnostic of XML schema.

Add nodes to an xml at runtime?

I am writing an application that must update parts of an already existing xml file based on a set of files in a directory. An example of this xml file can be seen below:
http://izpack.org/documentation/sample-install-definition.html
In the below scope a list of files is added and its specified if they should be "parsable" (used for parameter substitution):
<packs>
<pack name="Main Application" required="yes" installGroups="New Application" >
<file src="post-install-tasks.bat" targetdir="$INSTALL_PATH"/>
<file src="build.xml" targetdir="$INSTALL_PATH"/>
<parsable targetfile="$INSTALL_PATH/post-install-tasks.bat"/>
<parsable targetfile="$INSTALL_PATH/build.xml"/>
</pack>
</packs>
Now the number of files that must be added to this scope can change each time the application is run. To make this possible I have considered the following approach:
1) Read the whole xml into a org.w3c.dom.*; Document and add nodes based on result from reading the directory.
2) Somehow add the content from a .properties file to the scope. This way its possible to update the filelist without recompiling the code.
3) ??
Any suggestions on a good approach to this kind of task?

if there's a chance that your XML configuration might be of significant size, then it is really not good to go ahead with a DOM based approach [due to the associated memory footprint of loading a large XML document]
you should take a look at StaX. it has a highly optimised approach for both parsing and writing XML documents.

3) Overwrite the old file with your new, modified version. The DOM parsers keep comments intact, but you could end up with formatting differences. In order to write to a file, do:
Source source = new DOMSource(doc);
File file = new File(filename);
Result result = new StreamResult(file);
Transformer xformer = TransformerFactory.newInstance().newTransformer();
xformer.transform(source, result);

Efficient XSLT pipeline in Java (or redirecting Results to Sources)

I have a series of XSL 2.0 stylesheets that feed into each other, i.e. the output of stylesheet A feeds B feeds C.
What is the most efficient way of doing this? The question rephrased is: how can one efficiently route the output of one transformation into another.
Here's my first attempt:
#Override
public void transform(Source data, Result out) throws TransformerException{
for(Transformer autobot : autobots){
if(autobots.indexOf(autobot) != (autobots.size()-1)){
log.debug("Transforming prelim stylesheet...");
data = transform(autobot,data);
}else{
log.debug("Transforming final stylesheet...");
autobot.transform(data, out);
}
}
}
private Source transform(Transformer autobot, Source data) throws TransformerException{
DOMResult result = new DOMResult();
autobot.transform(data, result);
Node node = result.getNode();
return new DOMSource(node);
}
As you can see, I'm using a DOM to sit in between transformations, and although it is convenient, it's non-optimal performance wise.
Is there any easy way to route to say, route a SAXResult to a SAXSource? A StAX solution would be another option.
I'm aware of projects like XProc, which is very cool if you haven't taken a look at yet, but I didn't want to invest in a whole framework.

I found this: #3. Chaining Transformations that shows two ways to use the TransformerFactory to chain transformations, having the results of one transform feed the next transform and then finally output to system out. This avoids the need for an intermediate serialization to String, file, etc. between transforms.
When multiple, successive
transformations are required to the
same XML document, be sure to avoid
unnecessary parsing operations. I
frequently run into code that
transforms a String to another String,
then transforms that String to yet
another String. Not only is this slow,
but it can consume a significant
amount of memory as well, especially
if the intermediate Strings aren't
allowed to be garbage collected.
Most transformations are based on a
series of SAX events. A SAX parser
will typically parse an InputStream or
another InputSource into SAX events,
which can then be fed to a
Transformer. Rather than having the
Transformer output to a File, String,
or another such Result, a SAXResult
can be used instead. A SAXResult
accepts a ContentHandler, which can
pass these SAX events directly to
another Transformer, etc.
Here is one approach, and the one I
usually prefer as it provides more
flexibility for various input and
output sources. It also makes it
fairly easy to create a transformation
chain dynamically and with a variable
number of transformations.
SAXTransformerFactory stf = (SAXTransformerFactory)TransformerFactory.newInstance();
// These templates objects could be reused and obtained from elsewhere.
Templates templates1 = stf.newTemplates(new StreamSource(
getClass().getResourceAsStream("MyStylesheet1.xslt")));
Templates templates2 = stf.newTemplates(new StreamSource(
getClass().getResourceAsStream("MyStylesheet1.xslt")));
TransformerHandler th1 = stf.newTransformerHandler(templates1);
TransformerHandler th2 = stf.newTransformerHandler(templates2);
th1.setResult(new SAXResult(th2));
th2.setResult(new StreamResult(System.out));
Transformer t = stf.newTransformer();
t.transform(new StreamSource(System.in), new SAXResult(th1));
// th1 feeds th2, which in turn feeds System.out.

Related question Efficient XSLT pipeline, with params, in Java clarified on correct parameters passing to such transformer chain.
And it also gave a hint on slightly shorter solution without third transformer:
SAXTransformerFactory stf = (SAXTransformerFactory)TransformerFactory.newInstance();
Templates templates1 = stf.newTemplates(new StreamSource(
getClass().getResourceAsStream("MyStylesheet1.xslt")));
Templates templates2 = stf.newTemplates(new StreamSource(
getClass().getResourceAsStream("MyStylesheet2.xslt")));
TransformerHandler th1 = stf.newTransformerHandler(templates1);
TransformerHandler th2 = stf.newTransformerHandler(templates2);
th2.setResult(new StreamResult(System.out));
// Note that indent, etc should be applied to the last transformer in chain:
th2.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
th1.getTransformer().transform(new StreamSource(System.in), new SAXResult(th2));

Your best bet is to stick to DOM as you're doing, because an XSLT processor would have to build a tree anyway - streaming is only an option for very limited category of transforms, and few if any processors can figure it out automatically and switch to a streaming-only implementation; otherwise they just read the input and build the tree.

How can I validate documents against Schematron schemas in Java?

As far as I can tell, JAXP by default supports W3C XML Schema and RelaxNG from Java 6.
I can see a few APIs, mostly experimental or incomplete, on the schematron.com links page.
Is there an approach on validating schematron in Java that's complete, efficient and can be used with the JAXP API?

Jing supports pre-ISO Schematron validation (note that Jing's implementation is based also on XSLT).
There are also XSLT implementations that can be very easily invoked from Java. You need to decide what version of Schematron you are interested in and then get the corresponding stylesheet - all of them should be available from schematron.com. The process is very simple simple, involving basically 2 steps:
apply the skeleton XSLT on your Schematron schema to get a new XSLT stylesheet that represents your Schematron schema in XSLT
apply the obtained XSLT on your instance document or documents to validate them
JAXP is just an API and it does not require support for Relax NG from an implementation. You need to check the specific implementation that you use to see if that supports Relax NG or not.

A pure Java Schematron implementation is located at https://github.com/phax/ph-schematron/
It brings support for both the XSLT approach and the pure Java approach.

You can check out SchematronAssert (disclosure: my code). It is meant primarily for unit testing, but you may use it for normal code too. It is implemented using XSLT.
Unit testing example:
ValidationOutput result = in(booksDocument)
.forEvery("book")
.check("author")
.validate();
assertThat(result).hasNoErrors();
Standalone validation example:
StreamSource schemaSource = new StreamSource(... your schematron schema ...);
StreamSource xmlSource = new StreamSource(... your xml document ... );
StreamResult output = ... here your SVRL will be saved ...
// validation
validator.validate(xmlSource, schemaSource, output);
Work with an object representation of SVRL:
ValidationOutput output = validator.validate(xmlSource, schemaSource);
// look at the output
output.getFailures() ...
output.getReports() ...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.