Force resolution of xsl:include, xsl:import in Java

Force resolution of xsl:include, xsl:import in Java - java

I'm using Saxon 9.3 HE and Java 1.6. I can resolve xsl:include and xsl:import statements in the xsl by supplying a resolver to setURIResolver on the TransformerFactory instance.
However the Source resolve(String includee, String includer) method doesn't get called if the file was resolved previously. This is a problem for me because I want to resolve differently based on the includer file. For example <xsl:include href="foo.xsl"/> in file1.xsl would be a different file from <xsl:include href="foo.xsl"/> in file2.xsl, and file1.xsl and file2.xsl would be included by file3.xsl. I have some "base" code and "customer-specific" code that can override the template file and I need to resolve them differently for a framework I'm building.

The XSLT specification is clear that resolving a relative URI in the href attribute against the base URI of the containing element must be done according to the standard rules for handling relative URIs, while dereferencing the resulting absolute URI can be done any way the implementation likes. I'd suggest rethinking your design to take account of this.

I would have expected that because the two XSLs that have that includes in them have different base URIs that the URIResolver would need to be called for each one (what if they are in different directories?).
When creating sources for file1.xsl and file2.xsl, what are their system IDs? If they are null, exist in the same directory or don't have any path information (i.e. systemId is just file1.xsl and file2.xsl) maybe Saxon is trying to do an optimization by assuming they are in the same directory and therefore assuming foo.xsl referred to by each one is the same file.
Maybe try explicitly setting the systemId of the source of the base files and make them have different directories?

Related

Java XML: Avoid relative systemId expansion against user.dir

Consider the following example XML:
<book xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="xsd/book.xsd" />
When parsing this xml file with standard JAXP APIs (which will often use a Xerces implementation), the "xsd/book.xsd" systemId will get "resolved" against the user directory and will result in file:///home/user/xsd/book.xsd.
For xerces, this behavior is implemented here: https://github.com/apache/xerces2-j/blob/cf0c517a41b31b0242b96ab1af9627a3ab07fcd2/src/org/apache/xerces/impl/XMLEntityManager.java#L1894
To workaround this, we're currently using an EntityResolver2 to extract the original, relative systemId out of the absolute URI file:///home/user/xsd/book.xsd but this is really hacky.
Question:
Is there better way, e.g. by disabling this strange "userdir"-behavior and just keep the relative systemIds as they are?

If you want the schemaLocation to be interpreted as relative to the base URI of the source document, just make sure that the base URI of the source document is known to Xerces. For example, don't supply the input as a FileInputStream with no known system Id. It will only use the current directory as a fallback if it doesn't know where the input file is located.

is "content://" in the Uri of Content Provider in Android replaceable ?

In our platform, we use a certain format from paths. In the Android App, it receives those paths to load some data or do something.
I want to do all the data handling using content provider, I want to give the path and get data. A simple transaction.
When I read into content providers, the documentation and all the tutorials out there always use "content://" at the beginning. However, I want to use our own start of the path which is usually "is-://". Can something like this work?

no, this is how the system categorize the uri as content provider.
its like relacing file:// with something else.

After referring to Developer.google site
A content URI is a URI that identifies data in a provider. Content URIs include the symbolic name of the entire provider (its authority) and a name that points to a table (a path). When you call a client method to access a table in a provider, the content URI for the table is one of the arguments.
From this I believe you can't set it on your own as it includes the symbol name.
Also why do you want to change it?

Custom URL scheme as adapter on existing URL schemes

Is there a clean and spec-conformant way to define a custom URL scheme that acts as an adapter on the resource returned by another URL?
I have already defined a custom URL protocol which returns a decrypted representation of a local file. So, for instance, in my code,
decrypted-file:///path/to/file
transparently decrypts the file you would get from file:///path/to/file. However, this only works for local files. No fun! I am hoping that the URL specification allows a clean way that I could generalize this by defining a new URL scheme as a kind of adapter on existing URLs.
For example, could I instead define a custom URL scheme decrypted: that could be used as an adapter that prefixes another absolute URL that retrieved a resource? Then I could just do
decrypted:file:///path/to/file
or decrypted:http://server/path/to/file or decrypted:ftp://server/path/to/file or whatever. This would make my decrypted: protocol composable with all existing URL schemes that do file retrieval.
Java does something similar with the jar: URL scheme but from my reading of RFC 3986 it seems like this Java technology violates the URL spec. The embedded URL is not properly byte-encoded, so any /, ?, or # delimiters in the embedded URL should officially be treated as segment delimiters in the embedding URL (even if that's not what JarURLConnection does). I want to stay within the specs.
Is there a nice and correct way to do this? Or is the only option to byte-encode the entire embedded URL (i.e., decrypted:file%3A%2F%2F%2Fpath%2Fto%2Ffile, which is not so nice)?
Is what I'm suggesting (URL adapters) done anywhere else? Or is there a deeper reason why this is misguided?

There's no built-in adaptor in Cocoa, but writing your own using NSURLProtocol is pretty straightforward for most uses. Given an arbitrary URL, encoding it like so seems simplest:
myscheme:<originalurl>
For example:
myscheme:http://example.com/path
At its simplest, NSURL only actually cares if the string you pass in is a valid URI, which the above is. Yes, there is then extra URL support layered on top, based around RFC 1808 etc. but that's not essential.
All that's required to be a valid URI is a colon to indicate the scheme, and no invalid characters (basically, ASCII without spaces).
You can then use the -resourceSpecifier method to retrieve the original URL and work with that.

Serializing supplementary unicode characters into XML documents with Java

I am trying to serialize DOM documents with supplementary unicode characters such as U+1D49C (𝒜, mathematical script capital A). Creating a node with such a character is not a problem (I just set the node value to the UTF-16 equivalent, "\uD835\uDC9C"). When serializing, however, Xalan and XSLTC (with a Transformer) and Xerces (with LSSerializer) all create invalid character entities like "𝒜" instead of "𝒜". I tried the "normalize-characters" parameter for LSSerializer, but it is not supported. Only Saxon gets it right, without using a character entity when the encoding is unicode.
I cannot use Saxon in practice (among other reasons, I use Java applets and do not want to load another jar), so I am looking for a solution with the default JDK libraries. Is it possible to get valid XML documents serialized from a DOM document with supplementary unicode characters ?
[edit] I found someone else who encountered this problem : http://www.dragishak.com/?p=131
[edit2] actually, it seems to work with LSSerializer when I don't have xerces on the classpath (the class used is com.sun.org.apache.xml.internal.serialize.DOMSerializerImpl). It does not work with a transformer and com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl.

Since I didn't see any answer coming, and other people seem to have the same problem, I looked into it further...
To find the origin of the bug, I used the serializer source code from Xalan 2.7.1, which is also used in Xerces.
org.apache.xml.serializer.dom3.LSSerializerImpl uses org.apache.xml.serializer.ToXMLStream, which extends org.apache.xml.serializer.ToStream.
ToStream.characters(final char chars[], final int start, final int length) handles the characters, and does not support unicode characters properly (note: org.apache.xml.serializer.ToTextSream (which can be used with a Transformer) does a better job in the characters method, but it only handles plain text and ignores all markup; one would think that XML files are text, but for some reason ToXMLStream does not extend ToTextStream).
org.apache.xalan.transformer.TransformerIdentityImpl is also using org.apache.xml.serializer.ToXMLStream (which is returned by org.apache.xml.serializer.SerializerFactory.getSerializer(Properties format)), so it suffers from the same bug.
ToStream is using org.apache.xml.serializer.CharInfo to check if a character should be replaced by a String, so the bug could also be fixed there instead of directly in ToStream. CharInfo is using a propery file, org.apache.xml.serializer.XMLEntities.properties, with a list of character entities, so changing this file could also be a way to fix the bug, although so far it is designed just for the special XML characters (quot,amp,lt,gt). The only way to make ToXMLStream use a different property file than the one in the package would be to add a org.apache.xml.serializer.XMLEntities.properties file before in the classpath, which would not be very clean...
With the default JDK (1.6 and 1.7), TransformerFactory returns a com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl, which uses com.sun.org.apache.xml.internal.serializer.ToXMLStream. In com.sun.org.apache.xml.internal.serializer.ToStream, characters() is sometimes calling processDirty(), which calls accumDefaultEscape(), which could handle unicode characters better, but in practice it does not seem to work (maybe processDirty is not called for unicode characters)...
com.sun.org.apache.xml.internal.serialize.DOMSerializerImpl is using com.sun.org.apache.xml.internal.serialize.XMLSerializer, which supports unicode. Strangely enough, XMLSerializer comes from Xerces, and yet it is not used by Xerces when xalan or xsltc are on the classpath. This is because org.apache.xerces.dom.CoreDOMImplementationImpl.createLSSerializer is using org.apache.xml.serializer.dom3.LSSerializerImpl when it is available instead of org.apache.xerces.dom.DOMSerializerImpl. With serializer.jar on the classpath, org.apache.xml.serializer.dom3.LSSerializerImpl is used. Warning: xalan.jar and xsltc.jar both reference serializer.jar in the manifest, so serializer.jar ends up on the classpath if it is in the same directory and either xalan.jar or xsltc.jar is on the classpath ! If only xercesImpl.jar and xml-apis.jar are on the classpath, org.apache.xerces.dom.DOMSerializerImpl is used as the LSSerializer, and unicode characters are properly handled.
CONCLUSION AND WORKAROUND: the bug lies in Apache's org.apache.xml.serializer.ToStream class (renamed com.sun.org.apache.xml.internal.serializer.ToStream inside the JDK). A serializer that handles unicode characters properly is org.apache.xml.serialize.DOMSerializerImpl (renamed com.sun.org.apache.xml.internal.serialize.DOMSerializerImpl inside the JDK). However, Apache prefers ToStream instead of DOMSerializerImpl when it is available, so maybe it behaves better for other things (or maybe it's just a reorganization). On top of that, they went as far as deprecating DOMSerializerImpl in Xerces 2.9.0. Hence the following workaround, which might have side effects :
when Xerces and Apache's serializer are on the classpath, replace "(doc.getImplementation()).createLSSerializer()" by "new org.apache.xerces.dom.DOMSerializerImpl()"
when Apache's serializer is on the classpath (for instance because of xalan) but not Xerces, try to replace "(doc.getImplementation()).createLSSerializer()" by "new com.sun.org.apache.xml.internal.serialize.DOMSerializerImpl()" (a fallback is necessary because this class might disappear in the future)
These 2 workarounds produce a warning when compiling.
I don't have a workaround for XSLT transforms, but this is beyond the scope of the question. I guess one could do a transform to another DOM document and use DOMSerializerImpl to serialize.
Some other workarounds, which might be a better solution for some people :
use Saxon with a Transformer
use XML documents with UTF-16 encoding

Here is an example that worked for me. Code is written in Groovy running on Java 7, which you can easily translate to Java since I've used all Java APIs in the example. If you pass in a DOM document that has supplementary (plane 1) unicode characters and you will get back out a String which has those characters properly serialized. For example, if the document has a unicode Script L (see http://www.fileformat.info/info/unicode/char/1d4c1/index.htm), it will be serialized in the returned String as &#x1d4c1 instead of 𝓁 (which is what you will get with a Xalan Transformer).
import org.w3c.dom.Document
...
def String writeToStringLS( Document doc ) {
def domImpl = doc.getImplementation()
def implLS = domImpl.getFeature("LS", "3.0")
def lsOutput = implLS.createLSOutput()
lsOutput.encoding = "UTF-8"
def bo = new ByteArrayOutputStream()
def out = new BufferedWriter( new OutputStreamWriter( bo, "UTF-8") )
lsOutput.characterStream = out
def lsWriter = implLS.createLSSerializer()
def result = lsWriter.write(doc, lsOutput)
return bo.toString()
}

Creating XML Schema from URL works but from Local File fails?

I need to validate XML Schema Instance (XSD) documents which are programmatically generated so I'm using the following Java snippet, which works fine:
SchemaFactory factory = SchemaFactory.newInstance(
XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema xsdSchema = factory.newSchema( // Reads URL every time...
new URL("http://www.w3.org/2001/XMLSchema.xsd"));
Validator xsdValidator = xsdSchema.newValidator();
xsdValidator.validate(new StreamSource(schemaInstanceStream));
However, when I save the XML Schema definition file locally and refer to it this way:
Schema schema = factory.newSchema(
new File("test/xsd/XMLSchema.xsd"));
It fails with the following exception:
org.xml.sax.SAXParseException: schema_reference.4: Failed to read schema document 'file:/Users/foo/bar/test/xsd/XMLSchema.xsd', because 1) could not find the document; 2) the document could not be read; 3) the root element of the document is not <xsd:schema>.
I've ensured that the file exists and is readable by doing exists() and canRead() assertions on the File object. I've also downloaded the file with a couple different utilities (web browser, wget) to ensure that there is no corruption.
Any idea why I can validate XSD instance documents when I generate the schema from the HTTP URL but I get the above exception when trying to generate from a local file with the same contents?
[Edit]
To elaborate, I've tried multiple forms of factory.newSchema(...) using Readers and InputStreams (instead of the File directly) and still get exactly the same error. Moreover, I've dumped the file contents before using it or the various input streams to ensure it's the right one. Quite vexing.
Full Answer
It turns out that there are three additional files referenced by XML Schema which must be also stored locally and XMLSchema.xsd contains an import statement whose schemaLocation attribute must be changed. Here are the files that must be saved in the same directory:
XMLSchema.xsd - change schemaLocation to "xml.xsd" in the "import" element for XML Namespace.
XMLSchema.dtd - as is.
datatypes.dtd - as is.
xml.xsd - as is.
Thanks to #Blaise Doughan and #Tomasz Nurkiewicz for their hints.

I assume you are trying to load XMLSchema.xsd. Please also download XMLSchema.dtd and datatypes.dtd and put them in the same directory. This should push you a little bit further.

UPDATE
Is XMLSchema.xsd importing any other schemas by relative paths that are not on the local file systen?
Your relative path may not be correct wrt your working directory. Try entering a fully qualified path to eliminate the possibility that the file can not be found.
org.xml.sax.SAXParseException: schema_reference.4: Failed to read
schema document 'file:/Users/foo/bar/test/xsd/XMLSchema.xsd', because
1) could not find the document; 2) the document could not be read; 3)
the root element of the document is not .

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.