(Limited) Java extension function support in XSLT - java

I've long made do with the default XML libraries provided by Java (Xerces2-J and Xalan-J) - occasionally and directly using the latest versions of those libraries when needed. It appears I'm running up against some of the limits of these libraries - especially with Xalan-J that is really no longer being maintained and without a release for almost 6 years...
I need to provide some custom functions to pull information from external services when called, so they must be implemented in Java. (i.e. I can't implement them within the XSLT itself, as either XSLT or JavaScript functions, etc.) I've done this before using Xalan-Java Extensions. However, providing this seems to be either allow-all or nothing:
http://www.biglist.com/lists/lists.mulberrytech.com/xsl-list/archives/200911/msg00198.html
http://marc.info/?l=xalan-j-users&m=123750821029013
https://issues.apache.org/jira/browse/XALANJ-1850
I need to be able to provide access to a Java extension - but without allowing any arbitrary calls out to Java (think of an embedded call to System.exit(), for example), and ideally, without the XSLT authors even needing to know that it is a Java function (by use of xmlns:java="http://xml.apache.org/xalan/java", for example). Ideally, I'd also be able to keep TransformerFactory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);.
I can do (almost) exactly what I'm looking for using XPath.setXPathFunctionResolver, but this only works for direct XPath calls - and I've not found a way to set a custom XPath for use in XSLT by Xalan. (It also requires the FEATURE_SECURE_PROCESSING to not be set on the XPathFactory, though I might be able to get away with having it set only on the TransformerFactory - ignoring that if set on the TransformerFactory, the TransformerFactory also automatically sets the same flag on the XPathFactory that is uses.)
So I decided to give Saxon (Saxon-HE 9.5.1-1) a try - and am immediately noticing 2 issues:
When just using XPath directly, XPath.setXPathFunctionResolver does not seem to have any affect under Saxon. The set call completes without issue, but resolveFunction on the passed-in XPathFunctionResolver is never even called under Saxon. (It simply "just works" under Xalan.) Is there some additional configuration necessary for Saxon - or is this maybe a limitation of the HE version?
I've looked at http://www.saxonica.com/documentation/#!extensibility/integratedfunctions/ext-simple-J - which per the author, is provided for even under the HE version. It also looks to be exactly what I need - however (like with XPathFunctionResolver under Xalan), I can't see how to wire this into XSLT processing. (This is answered by the XALAN register Extension Function like in SAXON in the question itself.)

Related

Sample code to load & validate an XML file & schema in Saxon under Java

This is for a tool in our system that will verify that it can load an XML file with Saxon, and list any problems. So I want to have Saxon load the file and throw an exception if it can't fully parse it. This test has an option to be given a schema file so it validates against the schema if it exists.
This is for our Java version so need to use the Java API. I tried to port over the C# Validate example (there is no Java validate that I could find) but the API is quite a bit different.
And if possible, get a list of errors it found parsing.
In the download file you should find samples/java/ee/SchemaValidatorExample.java which uses the JAXP interface, and also java/he/S9APIExamples.java which uses the s9api API and includes use cases SchemaA and SchemaB.
These could do with updating, none of them takes advantage of the new interface SchemaValidator.setInvalidityHandler() which allows you either to supply an instance of InvalidityReportGeneratorEE which gives you a Saxon-generated report of all invalidities found, or your own InvalidityReportGenerator or InvalidityHandler to produce your own customized reporting. I suggest you take a tour of the Javadoc for documentation of what these do.
These are very much geared to custom reporting of the invalidities found. Throwing an exception if the file isn't valid is much simpler.

How to customize Apache Nutch 2.3 generate step

I want Nutch to select specific URLs according to my own rules. This step is done at generate time. I know how to write parser/indexer plugin. But How to do it at generate time. My Nutch version is 2.3 series
The Nutch generator is not really an extension point in Nutch, so you are not able of writing plugins to customize it. Nevertheless, nothing stops you from writing your own generator with your own logic.
You would need to adjust the bin/nutch and bin/crawl scripts in order to call your own generator instead of the default one. Keep in mind that some other parts of Nutch rely on some parts of the generator implementation (SegmentMerger for instance). If you customize these parts, then you'll need to update some other classes as well.
The generator uses the ScoringFilter.generatorSortValue() method when is deciding which elements to return. So, this is one alternative that doesn't require changing the generator.
Side note, this is not entirely uncommon to do, I've seemed some clients requiring customized generators.
As suggested by Jorge, you could write a scoringfilter to assign scores to pages based on your own logic and filter during the generation step based on that. Alternatively, if by chance your selection rules can be determined based on the URL alone, you could have a bespoke URL normaliser used with a scope of generate (or whatever the value is) which would rewrite the URLs into something that the URL filters would then discard. You'd need to activate the filtering as part of the generate step. This is an ugly hack.
Nutch 2.x is really awkward and I am not sure you could create a copy of your table based on a filter of the original one.
What Gora backend do you use?
StormCrawler is a lot more flexible for this and we've recently added a mechanism for filtering URLs at the spout level, which is exactly what you'd need. You could do a similar thing in Nutch 2.x but that would probably mean changing things in GORA as well.

Java/Maven - Saxon without SeviceLoader override

We are building a common component that is a dependency for multiple other projects.
Our project does some XSLT transformations and we need to use the Saxon engine.
We have full control over the specific XSLT transformation that must use Saxon, but no control over the classpath of the applications that are dependent on us, and we don't want to force them to use Saxon for other XML work.
We can manually invoke the Saxon library directly when doing our transformations using the API provided by those factories.
The problem is that Saxon uses the ServiceLoader pattern to inject itself as the TransformerFactory implementation using this file in the jar:
[saxon.jar]/META-INF/services/javax.xml.transform.TransformerFactory
This means that applications that use us as a dependency might end up using Saxon instead of their existing XML libraries. Asking those applications to change their code to invoke their specific implementation is not an option.
Is there any way we can 'override' the Saxon library to remove the ServiceLoader implementation? Either using Maven, Java or some other process?
Unfortunately it's all too common to find yourself using libraries that have been written to use the JAXP pluggability mechanism to pick up whatever XSLT processor is on the classpath, but which will actually only work if that processor happens to be Xalan.
For the XPath scenario this problem was so acute that the Saxon META-INF no longer declares itself as an XPath service provider (although it still implements all the JAXP interfaces). But for XSLT that solution wouldn't be acceptable.
I would think that for most situations, setting the Java system property javax.xml.transform.TransformerFactory to the relevant classname for Xalan should solve the problem.
Answering this for any future developers with the same issue.
We were not able to find a way to solve this issue. We considered writing a Maven plugin to remove the META-INF/services/ file from the JAR but ultimately decided this was not an appropriate solution.
We are now in the same position we started - dependent applications end up with Saxon as a registered provider and it might override their existing configuration.
For those applications that must use a specific XSLT processer we have asked them to set the system property, e.g.
javax.xml.transform.TransformerFactory=org.apache.xalan.processor.TransformerFactoryImpl

Woodstox on Android

I've previously written a library in Java using the native Java 1.6 Stax parser heavily. However, I now want use this library for Android, meaning that this parser is not supported. I'd like to use Woodstox as it implements the Stax 1.0 api and I wouldn't have to rewrite any of my current code, just sub in the dependency.
Android does not have the stax 1 api, so I realize I have to add it. Right now, I've added the woodstox-core-asl-4.2.0.jar, stax-api-1.0-2.jar, and the stax2-api-3.1.3.jar to the classpath. Everything compiles fine, but when I actually try to run an Android application which depends on this library, I get runtime errors indicating it isn't using Woodstox as the implementation for the stax 1 api.
Is there something I'm misunderstanding or doing incorrectly? Am I missing a jar? I've read the Woodstox help page thoroughly but can't find anything else I'm missing.
EDIT: I'm starting to wonder if it's actually possible to use Woodstox on Android. The issue is with the dependency on the stax api. After some research I discovered that the Dalvik VM appears to not be ok with those packages being in the javax.* namespace.
How are you passing XMLInputFactory / XMLOutputFactory instance? Usually it is better to directly construct instances (of WstxInputFactory, WstxOutputFactory), since there is no real benefit from callign XMLInputFactory.newInstance(): at most it adds overhead (much slower, scans classpath).
You are not missing any jars: core and stax2-api are needed always; and if the platform does not include Stax API jar (which Android for some odd reason does not, even tho it is part of JDK 1.6), then that one.
EDIT:
Auto-discovery should be based on couple of things:
XMLInputFactory (etc) check to see if matching system property ("javax.xml.stream.XMLInputFactory") has value; this is class name of factory to use, if any. So you can set system property with specific impl value
If no system property found, see if there is resource META-INF/services/javax.xml.stream.XMLInputFactory to read; one line with class name
If neither works, try to create default instance; for J2SE SDK this would be implementation that Oracle provides (Sjsxp)
Woodstox provides SPI information (step 2), to auto-register itself. But since on Android you repackage jars (as Android packages, APK?), it is possible that resource files are either not included, or not found. If this is the root cause, you can still set matching system property. If not, you will need to provide class name or factory instance using other means.

How to set Saxon as the Xslt processor in Java?

This is a simple question, but one I cannot find the answer to. I have an XSLT 2.0 stylesheet that I'm trying to process in Java. It relies on XSL elements from Saxon.
My current class works fine with simple XSLT 1.0, but I'm getting errors about unrecognized elements with my 2.0 XSLT built with Saxon.
I cannot figure out how to tell Java to use Saxon as the processor. I'm using javax.xml.transform in my class. Is this a property I can set? What do I set it to?
Thanks!
Edited
I figured out how to set the property to use Saxon, but now I'm getting this error.
Provider net.sf.saxon.TransformerFactoryImpl not found
How do I include Saxon in my application?
There are multiple ways to do this (in order of lookup precedence):
Direct Instantiation
Explicitly instantiate the Saxon factory (with a nod to Michael's comment above):
TransformerFactory fact = new net.sf.saxon.TransformerFactoryImpl()
This approach means that your code is locked into using Saxon at compile time. This can be seen as an advantage (no risk of it running with the wrong processor) or a disadvantage (no opportunity to configure a different processor at execution time - not even Saxon Enterprise Edition).
For Saxon-PE, substitute com.saxonica.config.ProfessionalTransformerFactory. For Saxon-EE, substitute com.saxonica.config.EnterpriseTransformerFactory.
Specify Class Name
Specify the factory class when constructing it:
TransformerFactory fact = TransformerFactory.newInstance(
"net.sf.saxon.TransformerFactoryImpl", null);
Note: available as of Java 6. The Java 5 version does not have this method.
This approach allows you to choose the processor at execution time, while still avoiding the costs and risks of a classpath search. For example, your application could provide some configuration mechanism to allow it to run with different Saxon editions by choosing between the various Saxon factory classes.
Use System Property
Set the javax.xml.transform.TransformerFactory system property before creating an instance:
System.setProperty("javax.xml.transform.TransformerFactory",
"net.sf.saxon.TransformerFactoryImpl");
Or on the command line (line broken for readability):
java -Djavax.xml.transform.TransformerFactory=
net.sf.saxon.TransformerFactoryImpl YourApp
This approach has the disadvantage that system properties affect the whole Java VM. Setting this property to select Saxon could mean that some other module in the application, which you might not even know about, starts using Saxon instead of Xalan, and that module could fail as a result if it uses Xalan-specific XSLT constructs.
Use Properties File
Create the following file:
JRE/lib/jaxp.properties
With the following contents:
javax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl
This approach has similar consequences to using the system property.
Service Loader
Create the following file in any JAR on the CLASSPATH:
META-INF/services/javax.xml.transform.TransformerFactory
With the following contents:
net.sf.saxon.TransformerFactoryImpl
This approach has the disadvantage that a small change to the classpath could cause the application to run with a different XSLT engine, perhaps one that the application has never been tested with.
Platform Default
If none of the above are done, then the platform default TransformerFactory instance will be loaded. A friendly description of this plugability layer can be found here.
Note that 'platform' here means the Java VM, not the hardware or operating system that it is running on. For all current known Java VMs, the platform default is a version of Xalan (which only supports XSLT 1.0). There is no guarantee that this will always be true of every Java VM in the future.
I'd consider this answer an argument against the Java way of doing things.
You can explicitly construct the required Source and Result objects to make sure they are Saxon implementations rather than whatever the default ones are.
I wrote a wrapper around Saxon parser in order to make its use simple, and I called it "EasySaxon": you can find it here, with some code snippet of samples.
Hope it helps.
Francesco

Categories

Resources