Selecting an XSD schema during SAX parsing

Selecting an XSD schema during SAX parsing - java

Assume the following:
I have a set of XSD schemas S, each with distinct namespace URIs.
I know that I'm going to be receiving an XML document containing a root element that contains exactly one namespace declaration that refers to a member of S. I can abort parsing immediately with an error if I don't receive exactly one namespace declaration, or if the received namespace doesn't refer to any schema in S.
I want to parse the incoming XML document with a SAX parser, and I want to validate the incoming document during parsing against one of the schemas in S. I know from the above that the first call I'm going to see in the ContentHandler that I give to the parser will be a call to startPrefixMapping when the parser encounters the namespace declaration.
Is it possible to, in the startPrefixMapping call, pick one of the schemas in S for validation once I know which one I need?
It seems that I could maybe call setSchema on the parser inside the startPrefixMapping call, but I get the feeling from the API documentation that I'm not supposed to do this (and that it may be too late to call the method at that point anyway).
Is there some other way to supply a set of schemas to the parser and perhaps have it pick the right one itself based on the namespace declaration it receives?
Edit: I was wrong, it's not just inadvisable to call setSchema on a parser once parsing has started - it's actually impossible. Parsers don't expose a setSchema call, only parser factories do. This means that my options are limited to those that can allow the parser to select a schema for itself. Unfortunately, that has its own problems: It's not possible for an XML document to merely specify a namespace, it also has to specify a filename for the intended schema (which in my opinion is an implementation detail on the parser side and should not be required of the incoming data) and the parser has to intercept the request for this filename to supply a member of S for validation.
Edit: I've solved this. I've put together some heavily-commented public domain example code here that looks up schemas based on pre-specified systemIds, and the schemas are delivered programatically (so they can be served from databases, class resources, etc). It correctly rejects any document that specifies an unknown schema, specifies no schema, or tries to specify its own schemaLocation to try to fool the validator.
https://github.com/io7m/xml-schema-lookup-example

Related

Is it possible to cache XML documents in Saxon to avoid re-parsing and re-indexing?

I am currently assessing whether XSLT3 with Saxon could be useful for our purposes. Please hear me out.
We are developing a REST API which provides credentials given an input request XML. Basically, there are 3 files in play:
site.xml:
This file holds the data representing the complete organisation: users, roles, credentials, settings, ...
It could easily contain 10.000 lines.
It could be considered as static/immutable.
You could compare it as XML representation of a database, so to say.
request.xml:
This file holds the request as provided to the REST API.
It is rather small, usually around 10 to 50 lines.
It is different for each request.
request.xslt:
This file holds the stylesheet to convert the given request.xml to an output XML.
It loads site.xml via the XSLT document() function, as it needs that data to fulfill the request.
The problem here is that loading site.xml in request.xslt takes a long time. In addition, for each request, indexes as introduced by the XSLT <xsl:key .../> directive must be rebuilt. This adds up.
So it would make sense to somehow cache site.xml, to avoid having to parse and index that file for every request.
It's important to note that multiple API requests can happen concurrently, thus it should be safe to share this cached site.xml between several ongoing XSLT transformations.
Is this possible with Saxon (Java)? How would that work?
Update 1
After some additional reflecting, I realize that maybe I should not attempt to just cache the site.xml XML file, but the request.xslt instead? This assumes that site.xml, which is loaded in request.xslt via document(), is part of that cache.

It would help if you show/tell us which API you use to run XSLT with Saxon.
As for caching the XSLT, with JAXP I think you can do that with a Templates created with newTemplates from the TransformerFactoryImpl (http://saxonica.com/html/documentation/using-xsl/embedding/jaxp-transformation.html), each time you want to run the XSLT you will to create a Transformer with newTransformer().
With the s9api API you can compile once to get an XsltExecutable (http://saxonica.com/html/documentation/javadoc/net/sf/saxon/s9api/XsltExecutable.html) that "is immutable, and therefore thread-safe", you then have to us load() or load30() to create an XsltTransformer or Xslt30Transformer each time you need to run the code.
As for sharing a document, see http://saxonica.com/html/documentation/sourcedocs/preloading.html:
An option is available (Feature.PRE_EVALUATE_DOC_FUNCTION) to indicate
that calls to the doc() or document() functions with constant string
arguments should be evaluated when a query or stylesheet is compiled,
rather than at run-time. This option is intended for use when a
reference or lookup document is used by all queries and
transformations
The section on that configuration option, however, states:
In XSLT 3.0 a better way of having external documents pre-loaded at
stylesheet compile time is to use the new facility of static global
variables.
So in that case you could declare
<xsl:variable name="site-doc" static="yes" select="doc('site.xml')"/>
You will need to wait on Michael Kay's response as to whether that suffices to share the document.

Well, it is certainly possible, but the best way of doing it depends a little on the circumstances, e.g. what happens when site.xml changes.
I would be inclined to create a single s9api Processor at application startup, and immediately (that is, during application initialization) load site.xml into an XdmNode using Processor.DocumentBuilder.build(); this can then be passed as a parameter value (an <xsl:param>) into each transformation that uses it. Or if you prefer to access it using document(), you could register a URIResolver that responds to the document() call by returning the relevant XdmNode.
As for indexing and the key() function, so long as the xsl:key definition is "sharable", then if two transformations based on the same compiled stylesheet (s9api XsltExecutable) access the same document, the index will not be rebuilt. An xsl:key definition is shareable if its match and use attributes do not depend on anything that can vary from one transformation to another, such as the content of global variables or parameters.
Saxon's native tree implementations (unlike the DOM) are thread-safe: if you build a document once, you can access it in multiple threads. The building of indexes to support the key() function is synchronized so concurrent transformations will not interfere with each other.
Martin's suggestion of allowing compile-time evaluation of the document() call would also work. You could also put the document into a global variable defined with static="yes". This doesn't play well, however, with exporting compiled stylesheets into persistent files: there are some restrictions that apply when exporting a stylesheet that contains node-valued static variables.

JAXB issue with missing namespace definition

So I searched around quite a bit for a solution to this particular issue and I am hoping someone can point me in a good direction.
We are receiving data as XML, and we only have XSD to validate the data. So I used JAXB to generate the Java classes. When I went to unmarshal a sample XML, I found that some attribute values are missing. It turns out that the schema expects those attributes to be QName, but the data provider didn't define the prefix in the XML.
For instance, one XML attribute value is "repository:<uuid>", but the namespace prefix "repository" is never defined in the dataset. (Never mind the provider's best practices suggest defining it!)
So when I went to unmarshal a sample set, the QName attributes with the specified prefix ("repository" in my sample above) are NULL! So it looks like JAXB is "throwing out" those attribute QName values which have undefined namespace prefix. I am surprised that it doesn't preserve even the local name.
Ideally, I would like to maintain the value as is, but it looks like I can't map the QName to a String at binding time (Schema to Java).
I tried "manually" inserting a namespace definition to the XML and it works like a charm. What would be the least complicated method to do this?
Is there a way to "insert" namespace mapping/definition at runtime? Or define it "globally" at binding time?

The simplest would be to use strings instead of QName. You can use the javaType customization to achieve this.
If you want to add prefix/namespace mappings in the runtime, there are quite a few ways to do it:
Similar to above, you could provide your own QName converter which would consider your prefixes.
You can put a SAX or StAX filter in between and declare additional prefixes in the startDocument.
What you actually need is to add your prefix mappings into the UnmarshallingContext.environmentNamespaceContext. I've checked the source code but could not find a direct and easy way to do it.
I personally would implement a SAX/StAX filter to "preprocess" your XML on the event level.

What Java XML API to use in my app - StAX or DOM?

I did some research, looked at the table at the bottom here (1) and I am trying to find out what kind of API I should use.
Let me introduce the problem my app in going to solve:
My application listens to some observer events fired from all places (e.g. events from CDI) in some observer class. In that class, there are methods which observes these events.
I need to construct XML file on-the-fly as these events are being observed. More concretely, when I observe event "start", I need to create this xml.
<start></start>
After that when I observe some other event, like "installed" (does not matter how it is called really), I need to have this structure:
<start><installed></installed><start>
Everytime I observe some event, I need to be able to write that XML representation to external file. Summing it up, it seems I can not use "SAX" because SAX just parses XML documents but I need to write them or construct them. Next, I am about to use StAX or DOM but StAX is "forward only" which I do not quite understand what it stands for, but when I take StAX API it behaves like this (2) and when it is "forward" I am "forced" to manually start and end elements but that is not applicable in my case. I do not know when I am about to end the document generation, I just need to have valid xml every time in order to write it.
However, there is this method (3) which says that when I call it, it automatically closes all elements. So e.g. when I have this:
<a>
<b></b>
<c>
<d>
</d>
and I call writeEndDocument(), does that mean that it automatically closes "c" and "a"?
(1) http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP2.html
(2) http://docs.oracle.com/javase/tutorial/jaxp/stax/example.html#bnbgx
(3) http://docs.oracle.com/javase/6/docs/api/javax/xml/stream/XMLStreamWriter.html#writeEndDocument()

I recommend to use the following XML libraries (ordered by recommendation; only use the next one if the one before doesn't suit you needs):
JAXB (work with objects rather than XML)
StAX (lower level than JAXB)
SAX (only for reading; should be rarely used now with JAXB and StAX available)
DOM (should be rarely used now with JAXB and StAX available)

Do not use lower level XML techniques (either SAX or DOM) unless you really need them. I believe that this is not the case.
Use JAXB. Create class that represents your events. Every time you get event create instance of this class and populate fields. Every time you have to create XML just marshal the instance(s) to any stream you want (file, socket, whatever).

Is there a way of handling multiple xsd versions with xmlBeans?

I know that I can compile multiple xsd files in a single jar. I've tried using different namespaces which only takes me half way through my goal. This way I can parse the correct schema but I want this to be transparent to my users which will receive the xmlBeans object that I've parsed.
They don't have to know which version of xml file is currently present on the system. I would need a super class for every xsd version to achieve this.
Could this be done with xmlBeans?

My understanding is, if you have a com namespace and a com.v1 and com.v2 namespace and you have an xsd element called EmployeeV1 in com.v1 and EmployeeV2 in com.v2.
You want to a super class called Employee in the com namespace which you want to return to your caller?
Do you think EmployeeV1 and EmployeeV2 could extend from Employee in your xsd? Then maybe when you generate you will get the class hierarchy that represents your xsd.
If that doesn't work, (i haven't used xmlbeans in years now), you might have to create your own domain object and make your callers consume that. That might be worth the effort, since to me it looks like you handle the parsing of an XML that other people rely on, you could abstract all other users from the structure of the XML (which is in flux) by having an intermediary domain object.

Line number information in Spring XML Namespace Handler

I'm writing custom Spring namespace handler (Java). If the XML is invalid, I'd like to report error message that will include line number (in the parsed document), so that user knows where to look. However, I don't know how to retrieve line number from DOM objects or otherwise.
Note that I'm talking about errors that are not discovered by XSD validation (those report line numbers correctly).
Is it even possible to get such information from inside Namespace handler?
Thanks,
Ondrej

If you are using the SAX parser you can extend the DefaultLocator class and register a Locator in setDocumentLocator method. The locator gets notified when an event occurs and therefore you can call getLineNumber() method to obtain the line number of interest.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.