How to extend doc() functionality in saxon

How to extend doc() functionality in saxon - java

I am looking for an extension of doc() functionality currently available in SAXON in a way that it will read XML not from filesystem or from http network, but from memory, where I have those xmls.
The way I want to use it is like:
mydoc('id')/root/subroot/#myattr
or
doc('mydoc://id')/root/subroot/#myattr
What I have considered so far:
use queryEvaluator.setContextItem() - does not solve my use case as I can have multiple XML sources in one query
register some own URL scheme protocol into Java - seems to me like overkill and I have never done this
write own ExtensionFunction - seems to be the right way so far, but i am confused whether I should use ExtensionFunction or rather ExtensionFunctionDefinition. Also I am littel bit confused by Doc_1 and Doc Saxonica source code as it uses Atomizer and other unknown internall stuff.
So the questions are:
Is it variant 3 the best one (in the means of simplicity) or would you recommend some other approach ?
Is it OK to use ExtensionFunction and return XdmNode from my in-memory xmls ? It seems to me it should work, but I really do not want to step into some edge cases or saxon minefield.
Any comment from experienced Saxon user will be appretiated.

The standard way of doing this is to write a URIResolver and register it with the transformer. The URIResolver is called, supplying the requested URI, and it is expected to return a Source (which can be a StreamSource, SAXSource, or DOMSource, for example). In this scenario you would typically return a StreamSource wrapping a StringReader which wraps the String containing the XML.
You could equally well use an extension function, but it's probably a little bit more complicated.

Related

Is it possible to cache XML documents in Saxon to avoid re-parsing and re-indexing?

I am currently assessing whether XSLT3 with Saxon could be useful for our purposes. Please hear me out.
We are developing a REST API which provides credentials given an input request XML. Basically, there are 3 files in play:
site.xml:
This file holds the data representing the complete organisation: users, roles, credentials, settings, ...
It could easily contain 10.000 lines.
It could be considered as static/immutable.
You could compare it as XML representation of a database, so to say.
request.xml:
This file holds the request as provided to the REST API.
It is rather small, usually around 10 to 50 lines.
It is different for each request.
request.xslt:
This file holds the stylesheet to convert the given request.xml to an output XML.
It loads site.xml via the XSLT document() function, as it needs that data to fulfill the request.
The problem here is that loading site.xml in request.xslt takes a long time. In addition, for each request, indexes as introduced by the XSLT <xsl:key .../> directive must be rebuilt. This adds up.
So it would make sense to somehow cache site.xml, to avoid having to parse and index that file for every request.
It's important to note that multiple API requests can happen concurrently, thus it should be safe to share this cached site.xml between several ongoing XSLT transformations.
Is this possible with Saxon (Java)? How would that work?
Update 1
After some additional reflecting, I realize that maybe I should not attempt to just cache the site.xml XML file, but the request.xslt instead? This assumes that site.xml, which is loaded in request.xslt via document(), is part of that cache.

It would help if you show/tell us which API you use to run XSLT with Saxon.
As for caching the XSLT, with JAXP I think you can do that with a Templates created with newTemplates from the TransformerFactoryImpl (http://saxonica.com/html/documentation/using-xsl/embedding/jaxp-transformation.html), each time you want to run the XSLT you will to create a Transformer with newTransformer().
With the s9api API you can compile once to get an XsltExecutable (http://saxonica.com/html/documentation/javadoc/net/sf/saxon/s9api/XsltExecutable.html) that "is immutable, and therefore thread-safe", you then have to us load() or load30() to create an XsltTransformer or Xslt30Transformer each time you need to run the code.
As for sharing a document, see http://saxonica.com/html/documentation/sourcedocs/preloading.html:
An option is available (Feature.PRE_EVALUATE_DOC_FUNCTION) to indicate
that calls to the doc() or document() functions with constant string
arguments should be evaluated when a query or stylesheet is compiled,
rather than at run-time. This option is intended for use when a
reference or lookup document is used by all queries and
transformations
The section on that configuration option, however, states:
In XSLT 3.0 a better way of having external documents pre-loaded at
stylesheet compile time is to use the new facility of static global
variables.
So in that case you could declare
<xsl:variable name="site-doc" static="yes" select="doc('site.xml')"/>
You will need to wait on Michael Kay's response as to whether that suffices to share the document.

Well, it is certainly possible, but the best way of doing it depends a little on the circumstances, e.g. what happens when site.xml changes.
I would be inclined to create a single s9api Processor at application startup, and immediately (that is, during application initialization) load site.xml into an XdmNode using Processor.DocumentBuilder.build(); this can then be passed as a parameter value (an <xsl:param>) into each transformation that uses it. Or if you prefer to access it using document(), you could register a URIResolver that responds to the document() call by returning the relevant XdmNode.
As for indexing and the key() function, so long as the xsl:key definition is "sharable", then if two transformations based on the same compiled stylesheet (s9api XsltExecutable) access the same document, the index will not be rebuilt. An xsl:key definition is shareable if its match and use attributes do not depend on anything that can vary from one transformation to another, such as the content of global variables or parameters.
Saxon's native tree implementations (unlike the DOM) are thread-safe: if you build a document once, you can access it in multiple threads. The building of indexes to support the key() function is synchronized so concurrent transformations will not interfere with each other.
Martin's suggestion of allowing compile-time evaluation of the document() call would also work. You could also put the document into a global variable defined with static="yes". This doesn't play well, however, with exporting compiled stylesheets into persistent files: there are some restrictions that apply when exporting a stylesheet that contains node-valued static variables.

What is the proper media-type for HAL+JSON?

I'm using Spring to create a RESTful service and I'm curious about the syntax for media-types.
From my understanding, the general media-type for HAL+JSON is application/hal+json. Also, from my understanding, a vendor-specific custom media-type that supports HAL+JSON would be something like application/vnd.api.entity.hal+json. However, I have also seen application/vnd.api.entity+hal+json. Which one is correct?
Also, what would the correct wild-card type be for HAL+JSON? Would it be application/*.hal+json or application/*+hal+json. Links to any pertinent RFC's would be appreciated. Thanks!

application/vnd.api.entity+json
application/vnd.api.entity.hal+json would only make sense if you plan to provide your data also without support for HAL. The client has to know the structure of the content anyway and HAL is part of it.
application/vnd.api.entity+hal+json is just wrong. The standard states that only registered suffixes should be used. It also refers to them as "Structured Syntax Suffixes". So it's quite clear that it's about how to read data not about its meaning. Only one suffix is allowed and more wouldn't make sense.
Think about it as application/semantic+syntax, or application/what's in it + how to read it.

Allowing maximal flexibly/extensibility using a factory

I have a little design issue on which I would like to get some advice:
I have several classes that inherit from the same base class, each one can accept the same data and analyze it in a slightly different way.
Analyzer
|
˪_ AnalyzerA
|
˪_ AnalyzerB
...
I have an input file (I do not have control over the file's format) that defines which analyzers should be invoked and their parameters. Plus it defines data-extractors in the same way and other similar things too (in similar I mean that this is an action that can have several variations).
I have a module that iterates over different analyzers in the file and calls some factory that constructs the correct analyzer. I have a factory for each of the archetypes the input file can define and so far so good.
But what if I want to extend it and to add a new type of analyzer?
The solution I was thinking about is using a property file for each factory that will be named after the factories name and it will hold a mapping between the input file's definition of whatever it wants me to execute and the actual classes that I use to execute the action.
This way I could load that class at run-time -> verify that it's implementing the right interface and then execute it.
If some John Doe would like to create his own analyzer he'd just need to add a new property to the correct file (I'm not quite sure what would be the best strategy to allow this kind of property customization).
So in short:
Is my solution too flawed?
If no what would be the most user friendly/convenient way to allow customization of properties?
P.S
Unfortunately I'm confined to using only build in JDK classes as the existing solution, so I can't just drop in SF on them.
I hope this question is not out of line I'm just not used to having my wings clipped this way, not having SF or some other to help me implement an elegant solution.

Have a look at the way how the java.sql.DriverManager.getConnection(connectionString) method is implemented. The best way is to watch the source code.
Very rough summary of the idea (it is hidden inside a lot of private methods). It is more or less an implementation of chain of responsibility, although there is not linked list of drivers.
DriverManager manages a list of drivers.
Each driver must register itself to the DriverManager by calling its method registerDriver().
Upon request for a connection, the getConnection(connectionString) method sequentially calls the drivers passing them the connectionString.
Each driver KNOWS if the given connection string is within its competence. If yes, it creates the connection and returns it. Otherwise the control is passed to the next driver.
Analogy:
drivers = your concrete Analyzers
connection strings = types of your files to be analyzed
Advantages:
There is no need to explicitly bind the analyzers with their type of file they are meant for. Let the analyzer to decide itself if it is able to analyze the file. If not, null is returned (or an exception or whatever) to tell the AnalyzerManager that the next analyzer in the row should be asked.
Adding new analyzer just means adding a new call to the register() method. Complete decoupling.

Java Sax to parse complex large XML file

I am using SAX to parse some large XML files and I want to ask the following: The XML files have a complex structure. Something like the following:
<library>
<books>
<book>
<title></title>
<img>
<name></name>
<url></url>
</img>
...
...
</book>
...
...
</books>
<categories>
<category id="abcd">
<locations>
<location>...</location>
</locations>
<url>...</url>
</category>
...
...
</categories>
<name>...</name>
<url>...</url>
</library>
The fact is that these files are over 50MB each and a lot of tags are repeated under different context, e.g. url under /books/book/img but also under /library and under /library/categories/category and so on.
My SAX parser uses a subclass of DefaultHandler in which I override teh startElement and the endElement methods (among others). But the problem is that these methods are huge in terms of lines of code due to the business logic of these XML files. I am using a lot of
if ("url".equalsIgnoreCase(qName)) {
// peek at stack and if book is on top
// ...
// else if category is on top
// ...
} else if (....) {
}
I was wondering whether there is a more proper / correct / elegant way to perform the xml parsing.
Thank you all

What you can do is implement separate ContentHandler for different contexts. For example write one for <books>, one for <categories> and one top-level one.
Then, as soon as the books startElement method is called, you immediately switch the ContentHandler using XMLReader.setContentHandler(). Then the <books> specific ContentHandler switches back to the top-level handler to when its endElement method is called for books.
This way each ContentHandler can focus on his particular part of the XML and need not know about all the other parts.
The only ugly-ish part is that the specific handlers need to know of the top-level handler and when to switch back to it, which can be worked around by providing a simple "handler stack" that handles that for you.

Not sure whether you're asking 1) is there something else you can do besides checking the tag against a bunch of strings or 2) if there's an alternative to a long if-then-else kind of statement.
The answer to 1 is not that I've found. Someone else may tackle that one.
The answer to 2 depends on your domain. One way I see is that if the point of this is to hydrate a bunch of objects from an XML file, then you can use a factory method.
So the first factory method has the long if then else statement that simply passes off the XML to the appropriate classes. Then each of your classes has a method like constructYourselfFromXmlString. This will improve your design because only the objects themselves know about the private data that is in an XML to hydrate them.
the reason this is hard is that, if you think about it, exporting an Object to XML and importing back in really violates encapsulation. Nothing to be done about it, just is. This at least makes things a little more encapsulated.
HTH

Agreeing with the sentiment that exporting an object to XML is a violation of encapsulation, the actual technique used to handle tags which are nested at different lengths isn't terribly difficult using SAX.
Basically, keep a StringBuffer which will maintain your "location" in the document, which will be a directory like representation of the nested tag you are currently within. For example, if at the moment the string buffer's contents are /library/book/img/url then you know it's an URL for an image in a book, and not a URL for some category.
Once you ensure that your "path tracking" algorithms are correct you can then wrap your object creation routines with better handling by using string matches. Instead of
if ("url".equalsIgnoreCase(qName)) {
...
}
you can now substitute
if (location.equalsIgnoreCase("/library/book/img/url")) {
...
}
If for some reason this doesn't appeal to you, there are still other solutions. For example, you can make a SAX handler which implements a stack of Handlers where the top handler is responsible for handling just it's portion of the XML document, and it pops itself off the stack once it is done. Using such a scheme, the each object gets created by its own unique individual handler, and some handlers basically check and direct which "object creation" handlers get shoved onto the handling stack at the appropriate times.
I've used both techniques. There are strengths in both, and which one is best really depends on the input and the needed objects.

You could refactor your SAX content handling so that you register a set of rules, each of which has a test that it applies to see if it matches the element, and an action that is executed if it does. This is moving closer to the XSLT processing model, while still doing streamed processing. Or you could move to XSLT - processing 50Mb input files is well within the capabilities of a modern XSLT processor.

try SAX-JAVA Binding Made Easier

Java+XSL, calling Java code from within template

I'm working with XSL templates in Java, and I'm trying to build a custom tag that will call some Java code, then put a result inside the template. I'm using XOM as my XML engine. I'm kind of new with both XOM and XSL, so I'm not even sure if this is a smart idea.
A very simple example of something I want to do is this, where my_ns is a custom namespace with 'custom_tag' that the method custom tag
<xsl:template name="foo">
<my_ns:custom_tag />
</xsl:template>
public Node custom_tag() {
return Node("<a/>");
}
#result of calling the template foo
<a/>
I'm open to suggestions for involve alternate ways of calling Java from a XSL template.

This is more a question about if your XSLT processor can execute/call java code from within the template more than your XML engine/parser/api. The default XSLT processor for java is Xalan-C or Xalan-J (can't remember which) from the Apache Software Foundation. I do believe both of them allow for extension functions to execute java code inside the method. I've done JDBC sql queries inside a XSL stylesheet before using a xalan-j extension function. I also recall reading that the Saxon XSLT processor also allows this functionality. You'll have search your XSLT processor to get the specifics on to implement this.
The question on whether this is a good idea or not really depends on the problem. Even though I used the SQL extension function mentioned above and it fit the bill in that case, I felt really dirty about it afterwards. The reason I say this is because you lose portability between XSLT processors when you add in the vendor-specific extension functions.
Your example shows you are just simply creating a new node in the output and if that is the case, I don't see the need to have java do this when that is one of the main functions of XSLT: creating nodes. I suspect your real problem is more complex than simply creating a node so I'll suggest you may want to look into doing all the work in java to get the results you are looking for OR doing some of the work in java and passing a parameter (name/value pair using the xsl:param element) to your XSL stylesheet a runtime.
Here's some quick sites to get you started:
http://xml.apache.org/xalan-j/extensions.html
http://www.saxonica.com/documentation/extensions/intro.xml
http://www.w3schools.com/xsl/
http://www.w3schools.com/xsl/el_param.asp

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.