How can I validate documents against Schematron schemas in Java?

How can I validate documents against Schematron schemas in Java? - java

As far as I can tell, JAXP by default supports W3C XML Schema and RelaxNG from Java 6.
I can see a few APIs, mostly experimental or incomplete, on the schematron.com links page.
Is there an approach on validating schematron in Java that's complete, efficient and can be used with the JAXP API?

Jing supports pre-ISO Schematron validation (note that Jing's implementation is based also on XSLT).
There are also XSLT implementations that can be very easily invoked from Java. You need to decide what version of Schematron you are interested in and then get the corresponding stylesheet - all of them should be available from schematron.com. The process is very simple simple, involving basically 2 steps:
apply the skeleton XSLT on your Schematron schema to get a new XSLT stylesheet that represents your Schematron schema in XSLT
apply the obtained XSLT on your instance document or documents to validate them
JAXP is just an API and it does not require support for Relax NG from an implementation. You need to check the specific implementation that you use to see if that supports Relax NG or not.

A pure Java Schematron implementation is located at https://github.com/phax/ph-schematron/
It brings support for both the XSLT approach and the pure Java approach.

You can check out SchematronAssert (disclosure: my code). It is meant primarily for unit testing, but you may use it for normal code too. It is implemented using XSLT.
Unit testing example:
ValidationOutput result = in(booksDocument)
.forEvery("book")
.check("author")
.validate();
assertThat(result).hasNoErrors();
Standalone validation example:
StreamSource schemaSource = new StreamSource(... your schematron schema ...);
StreamSource xmlSource = new StreamSource(... your xml document ... );
StreamResult output = ... here your SVRL will be saved ...
// validation
validator.validate(xmlSource, schemaSource, output);
Work with an object representation of SVRL:
ValidationOutput output = validator.validate(xmlSource, schemaSource);
// look at the output
output.getFailures() ...
output.getReports() ...

Related

Transform multiple input XML documents with XSLT in a Java application using the Saxon9HE API

How can I transform multiple XML input document objects with a single XSL transformation script using the Saxon9HE processor in a Java application?
I found a way to transform multiple XML input files from the filesystem with an XSLT script here, but I can't figure out how to pass multiple loaded XML Document objects to a Java application utilizing the Saxon9HE API. For a single XML document my code looks like this and works:
Processor proc = new Processor(false);
XsltCompiler comp = proc.newXsltCompiler();
try {
XsltExecutable exp = comp.compile(new StreamSource(stylesheetFile));
XdmNode source = proc.newDocumentBuilder().build(new DOMSource(inputXML));
Serializer out = proc.newSerializer();
out.setOutputProperty(Serializer.Property.METHOD, "xml");
out.setOutputProperty(Serializer.Property.INDENT, "yes");
out.setOutputFile(new File(outputFilename));
XsltTransformer trans = exp.load();
trans.setInitialContextNode(source);
trans.setDestination(out);
trans.transform();
} catch (SaxonApiException e) {
e.printStackTrace();
}

First point: avoid DOM if you can. When you are using Saxon, it's best to let Saxon build the document tree; this will be far more efficient. If you really need to use an external tree model, XOM and JDOM2 are much more efficient than DOM.
If you do want to provide a DOM as input, you have two choices: you can copy it to a Saxon tree, or you can wrap it as a Saxon tree. Use DocumentBuilder.build() in the first case, DocumentBuilder.wrap() in the second. Using build() gives you a higher initial cost, but the transformation itself is then faster.
If you want to pass pre-built trees into the transformation, declare the parameter using <xsl:param name="x" as="document-node()"/>, and then invoke the transformation using transformer.setParameter(new QName('x'), doc) where doc is an instance of XdmNode. You have to construct the XdmNode yourself by using a DocumentBuilder.
(Alternatively, if you want to access the documents in the stylesheet using the doc() or document() functions, you can invent a URI naming scheme and implement this in a URIResolver. When doc('my:uri') is called, your URIResolver is notified, and it should respond with a Source object. If you already have an XdmNode handy, then you can return XdmNode.asSource() to return this document tree as the result of your URIResolver.)

How to validate a document using a grammar in Xerces

I have following situation
- I create XML-documents (DocumentImpl) on the fly (using data). So the XML is never written to disc.
- I create XSD-schemas on the fly (using data-definitions), these also are never written to disc. The grammars are complex with assertions, so they need to be used as XMLSchema v1.1
- I store the grammars (SchemaGrammar) from the XSD-schemas in a hashmap, this is because the same grammars are often used more times.
Now my question,
I want to validate the documents against a grammar. I know which grammar to use. They are found by the according data-definition-name.
My problem is that I cannot find example code how to do this, because all the examples seem to work from streams or files, while I have the objects ready.

I think, it works like this:
`
XMLGrammarPoolImpl pool = new XMLGrammarPoolImpl();
pool.putGrammar(grammar);
XMLSchema11Factory factory = new XMLSchema11Factory();
Schema schema = factory.newSchema(pool);
Validator validator = schema.newValidator();
DOMSource source = new DOMSource(document);
validator.validate(source);
`

How to transform XML with XSL using Java

I am currently using the standard javax.xml.transform library to transform my XML to CSV using XSL. My XSL file is large - at around 950 lines. My XML files can be quite large also.
It was working fine in the prototype stage with a fraction of the XSL in place at around 50 lines or so. Now in the 'final system' when it performs the transform it comes up with the error Branch target offset too large for short.
private String transformXML() {
String formattedOutput = "";
try {
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer =
tFactory.newTransformer( new StreamSource( xslFilename ) );
StreamSource xmlSource = new StreamSource(new ByteArrayInputStream( xmlString.getBytes() ) );
ByteArrayOutputStream baos = new ByteArrayOutputStream();
transformer.transform( xmlSource, new StreamResult( baos ) );
formattedOutput = baos.toString();
} catch( Exception e ) {
e.printStackTrace();
}
return formattedOutput;
}
I came across a few postings on this error but not too sure what to do.
Am I doing anything wrong code wise?
Are there any alternative 3rd Party transformers available that could do this?
Thanks,
Andez

Try Saxon instead.
Your code would stay the same. All you would need to do is set javax.xml.transform.TransformerFactory to net.sf.saxon.TransformerFactoryImpl in the JVM's system properties.

Use saxon. offtop: if you use the same stylesheet to transform many XML files, you might want to consider templates (pre-compiled stylesheets):
javax.xml.transform.Templates style = tFactory.newTemplates(xslSource);
style.newTransformer().transform(...);

I came across a post on the net that mentioned apache XALAN. So I added the jars to my project. Everything has started working since even though I do not directly reference any XALAN classes in my code. As far as I can tell it still should use the jaxax.xml classes.
Not too sure what is happening there. But it is working.

As an alternative to Saxon, you can split up your large template into smaller templates.
Template definitions contained in XSLT stylesheets are compiled by SAP
JVM's XSLT compiler "Xalan" into Java methods for faster execution of
transformations. Java bytecode branch instructions contained in these
Java methods are limited to 32K offsets. Large template definitions
can now lead to very large Java methods, where the branch offset would
need to be larger than 32K. Therefore these stylesheets cannot be
compiled to Java methods and therefore cannot be used for
transformations.
Solution
Since each template definition of an XSLT stylesheet is compiled into
a separate Java method, using multiple smaller templates can be used
as solution. A very large template can be broken into multiple smaller
templates by using the "call-template" element.
It is described in-depth in this article Size limitation for XSLT stylesheets.
Sidenote: I would only recommend this as a last resort if saxon is not available, as this requires quite a few changes to your xsl file.

How to read an XML file with Java?

I don't need to read complex XML files. I just want to read the following configuration file with a simplest XML reader
<config>
<db-host>localhost</db-host>
<db-port>3306</db-port>
<db-username>root</db-username>
<db-password>root</db-password>
<db-name>cash</db-name>
</config>
How to read the above XML file with a XML reader through Java?

I like jdom:
SAXBuilder parser = new SAXBuilder();
Document docConfig = parser.build("config.xml");
Element elConfig = docConfig.getRootElement();
String host = elConfig.getChildText("host");

Since you want to parse config files, I think commons-configuration would be the best solution.
Commons Configuration provides a generic configuration interface which enables a Java application to read configuration data from a variety of sources (including XML)

You could use a simple DOM parser to read the xml representation.
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
dom = db.parse("config.xml");

If you just need a simple solution that's included with the Java SDK (since 5.0), check out the XPath package. I'm sure others perform better, but this was all I needed. Here's an example:
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.xml.sax.InputSource;
...
try {
XPath xpath = XPathFactory.newInstance().newXPath();
InputSource inputSource = new InputSource("strings.xml");
// result will equal "Save My Changes" (see XML below)
String result = xpath.evaluate("//string", inputSource);
}
catch(XPathExpressionException e) {
// do something
}
strings.xml
<?xml version="1.0" encoding="utf-8"?>
<resources>
<string name="saveLabel">Save My Changes</string>
</resources>

There are several XML parsers for Java. One I've used and found particularly developer friendly is JDOM. And by developer friendly, I mean "java oriented" (i.e., you work with objects in your program), instead of "document oriented", as some other tools are.

I would recommend Commons Digester, which allows you to parse a file without writing reams of code. It uses a series of rules to determine what action is should perform when encountering a given element or attribute (a typical rule might be to create a particular business object).

For a similar use case in my application I used JaxB. With Jaxb, reading XML files is like interacting with Java POJOs. But to use JAXB you need to have the xsd for this xml file. You can look for more info here

If you want to be able to read and write objects to XML directly, you can use XStream

Although I have not tried XPath yet as it has just come to my attention now, I have tried a few solutions and have not found anything that works for this scenario.
I decided to make a library that fulfills this need as long as you are working under the assumptions mentioned in the readme. It has the advantage that it uses SAX to parse the entire file and return it to the user in the form of a map so you can lookup values as key -> value.
https://github.com/daaso-consultancy/ConciseXMLParser
If something is missing kindly inform me of the missing item as I only develop it based on the needs of others and myself.

Is there a Java XML API that can parse a document without resolving character entities?

I have program that needs to parse XML that contains character entities. The program itself doesn't need to have them resolved, and the list of them is large and will change, so I want to avoid explicit support for these entities if I can.
Here's a simple example:
<?xml version="1.0" encoding="UTF-8"?>
<xml>Hello there &something;</xml>
Is there a Java XML API that can parse a document successfully without resolving (non-standard) character entities? Ideally it would translate them into a special event or object that could be handled specially, but I'd settle for an option that would silently suppress them.
Answer & Example:
Skaffman gave me the answer: use a StAX parser with IS_REPLACING_ENTITY_REFERENCES set to false.
Here's the code I whipped up to try it out:
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
XMLEventReader reader = inputFactory.createXMLEventReader(
new FileInputStream("your file here"));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isEntityReference()) {
EntityReference ref = (EntityReference) event;
System.out.println("Entity Reference: " + ref.getName());
}
}
For the above XML, it will print "Entity Reference: something".

The STaX API has support for the notion of not replacing character entity references, by way of the IS_REPLACING_ENTITY_REFERENCES property:
Requires the parser to replace
internal entity references with their
replacement text and report them as
characters
This can be set into an XmlInputFactory, which is then in turn used to construct an XmlEventReader or XmlStreamReader. However, the API is careful to say that this property is only intended to force the implementation to perform the replacement, rather than forcing it to not replace them. Still, it's got to be worth a try.

Works for me only when disabling support of external entities:
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
inputFactory.setProperty(XMLInputFactory.IS_SUPPORTING_EXTERNAL_ENTITIES, false);

A SAX parse with an org.xml.sax.EntityResolver might suit your purpose. You could for sure suppress them, and you could probably find a way to leave them unresolved.
This tutorial seems the most relevant: it shows how to resolve entities into strings.

I am not a Java developer, but I "think" Java xml classes support a similar functionality to .net for accomplishing this. IN .net the xmlreadersettings class you set the ProhibitDtd property false and set the XmlResolver property to null. This will cause the parser to ignore externally referenced entities without throwing an exception when they are read. I just did a google search for "Java ignore enity" and got lots of hits, some of which appear to address this topic. I realize this is not a total answer to your question but it should point you in a useful direction.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I validate documents against Schematron schemas in Java? - java

As far as I can tell, JAXP by default supports W3C XML Schema and RelaxNG from Java 6. I can see a few APIs, mostly experimental or incomplete, on the schematron.com links page. Is there an approach on validating schematron in Java that's complete, efficient and can be used with the JAXP API?

A pure Java Schematron implementation is located at https://github.com/phax/ph-schematron/ It brings support for both the XSLT approach and the pure Java approach.

Related

Transform multiple input XML documents with XSLT in a Java application using the Saxon9HE API

How to validate a document using a grammar in Xerces

How to transform XML with XSL using Java

How to read an XML file with Java?

Is there a Java XML API that can parse a document without resolving character entities?

Categories

Resources