Java XML Parse/Validation Error Handling - java

I'm trying to write something in Java that receives an XML string and validates it against an XSD schema, and does automatic error handling for some simple common errors, and outputs a fixed XML string.
I've come across the SAX ErrorHandler interface for the Validator.validate() function, but that seems to be mostly for reporting exceptions and I can't figure out how to modify the XML from it, other than getting the line/column number which would be very tedious to fix problems.
I also found the Validator.validate() function which has a source and a result, and returns augmented XML, which to my knowledge just fills in missing attributes that have default values, which is part of what I need to do.
But I also need something along the lines of fixing a missing start or end tag, and correcting a tag that has been misspelled by a letter, and things like that. There are so many "Handler" interfaces (ValidationHandler, ContentHandler, EntityResolver) that I'm not sure which ones to look at in depth, so if someone could point me in the right direction that would be great (I don't need a detailed code example).
Also I'm not sure how the XMLReader fits in to it all.

To deal with errors you have to implement the interface ErrorHandler or to extend the DefaultHandler helper class and redefine the error method. That is the method called for validation errors. If you want to be more precise, I think that you will have to analyze the error message. I don't think SaX will give you something that makes errors easy to fix.
BTW, note that for validating against an XSD, you should not use the method setValidating. See the code below.
The Java doc (1.7) of the setValidating method says :
Note that "the validation" here means a validating parser as defined in the XML recommendation. In other words, it essentially just controls the DTD validation. (except the legacy two properties defined in JAXP 1.2.)
To use modern schema languages such as W3C XML Schema or RELAX NG instead of DTD, you can configure your parser to be a non-validating parser by leaving the setValidating(boolean) method false, then use the setSchema(Schema) method to associate a schema to a parser.
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;
// ...
public static void main(String args[]) throws Exception {
if (args.length == 0 || args.length > 2) {
System.err.println("Usage: java Validator <doc.xml> [<schema.xsd>]");
System.exit(1);
}
SchemaFactory sf = SchemaFactory.newInstance(XMLConstants. W3C_XML_SCHEMA_NS_URI);
String xsdpath = "book.xsd";
if (args.length == 2) {
xsdpath = args[1];
}
Schema s = sf.newSchema(new File(xsdpath));
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(false);
factory.setNamespaceAware(true);
factory.setSchema(s);
XMLReader parser = factory.newSAXParser().getXMLReader();
parser.setFeature("http://xml.org/sax/features/namespaces", true);
parser.setFeature("http://xml.org/sax/features/namespace-prefixes", false);
PrintStream out = new PrintStream(System.out, true, "UTF-8");
parser.setContentHandler(new MyHandler(out));
parser.setErrorHandler(new DefaultHandler());
parser.parse(args[0]);
}
}

I've used DocumentBuilderFactory with setValidating(true) to generate an instance of an XML validating parser (i.e. DocumentBuilder).
Note that both validating and non-validating XML parsers will verify that the XML is "well formed" (e.g. end-tags, etc.). "Validating" refers to checking that the XML conforms to a DTD or schema.

Related

Using Apache Commons Digester with JSON

I am writing a class that extends a class that uses Digester to parse an XML response from an API (Example existing class, code snipper below). After receiving the response, the code creates an object and adds specific methods on that.
Code snippet edited for brevity:
private Digester createDigester() {
Digester digester = new Digester();
digester.addObjectCreate("GeocodeResponse/result", GoogleGeocoderResult.class);
digester.addObjectCreate("GeocodeResponse/result/address_component", GoogleAddressComponent.class);
digester.addCallMethod("GeocodeResponse/result/address_component/long_name", "setLongName", 0);
...
digester.addSetNext("GeocodeResponse/result/address_component", "addAddressComponent");
Class<?>[] dType = {Double.class};
digester.addCallMethod("GeocodeResponse/result/formatted_address", "setFormattedAddress", 0);
...
digester.addSetNext("GeocodeResponse/result", "add");
return digester;
}
}
The API that I will be calling, however, only supports JSON. I have found a probable solution, which involves converting the JSON to XML and then running it through Digester, but that seems incredibly hackish.
public JsonDigester(final String customRootElementName) {
super(new JsonXMLReader(customRootElementName));
}
Is there a better way to do this?
This class is specifically meant to deal with XML as per the documentation:
Basically, the Digester package lets you configure an XML -> Java
object mapping module, which triggers certain actions called rules
whenever a particular pattern of nested XML elements is recognized. A
rich set of predefined rules is available for your use, or you can
also create your own.
Why would you think it would work with JSON?

Sax parser with DefaultHandler implementing LexicalHandler

I have a handler with this signature:
class XMLReaderHandler extends DefaultHandler implements LexicalHandler
I'm using it with a sax parser ( javax.xml.parsers.SAXParser )
Is it possible that in case of empty tag like <id></id> the parser is not calling the method
characters(char[] ch, int start, int length)
throws SAXException
It's what I'm getting in debug in presence of an empty tag. I still need to do the characters thing because I want to add an empty string in that case.
PS: don't suggest XPath because I'm already using it, I'm implementing it with Sax in case I need to process big data.
Thanks
EDIT:
So some new info. I'm using in my schema the xs:anyType with processContents="skip". Apparently SaxParser will skip things like that, so I found that I could use something like:
saxParserFactory.setFeature(
"http://java.sun.com/xml/schema/features/report-ignored-element-content-whitespace",
true);
but it's giving me org.xml.sax.SAXNotRecognizedException: Feature 'http://java.sun.com/xml/schema/features/report-ignored-element-content-whitespace' is not recognized.

How do I serialize / deserialize a class in XML with Woodstox StAX 2

I'm pretty much trying to archive, what has been done in how-to-serialize-deserialize-simple-classes-to-xml-and-back (C#) in Java. If possible, I would like to avoid writing a serialize / deserialize methods for each class.
For example, part of serialize:
XMLOutputFactory xof = null;
XMLStreamWriter2 writer = null;
try {
resp.setContentType("text/plain");
xof = XMLOutputFactory.newInstance();
writer = (XMLStreamWriter2) //
xof.createXMLStreamWriter(resp.getOutputStream());
writer.writeStartDocument("1.0");
writer.writeStartElement("data");
//
// Magic happens here.
//
writer.writeEndElement();
writer.writeEndDocument();
} catch (XMLStreamException e) {
e.printStackTrace();
resp.sendError(1, "Problem 1 occured.");
} finally {
try {
writer.flush();
writer.close();
} catch (XMLStreamException e) {
e.printStackTrace();
resp.sendError(2, "Problem 2 occured.");
}
}
Not part of this question, as I'm trying to tackle problems 1 by 1, but might give you a sense of what I'm trying to do. When I deserialize, I would also like to check if the input is valid. Eventually I want to use XSLT transforms with serialized form.
JAXB is how you serialize Java objects to XML. The following will help you get started:
http://wiki.eclipse.org/EclipseLink/Examples/MOXy/GettingStarted
JAXB Implementations
There are several implementations of this standard:
EclipseLink MOXy (I'm the tech lead)
Metro (the reference implementation, included in Java SE 6)
JaxMe
Woodstox StAX 2
JAXB accepts many input/output formats including StAX.
Validation
XML is converted to objects using an Unmarshaller, and objects are converted to XML with a Marshaller. You can set an instance of javax.xml.validation.Schema to validate the input during these operations.
You can also use the javax.xml.validation APIs directly with JAXB, check out the following for an example:
Checking a java value with an xml schema
XSLT
The javax.xml.transform libraries are used in Java to perform XSLT transforms. JAXB is designed to work with these libraries using JAXBSource and JAXBResult.
For More Information
Check out my blog:
http://bdoughan.blogspot.com
In addition to the comprehensive accepted answer, it's worth noting that Woodstox (or any Stax2 implementation) can actually validate both input and output; see this blog entry for sample code. One benefit is that you can also validate against Relax NG (not supported AFAIK by JAXP parser that JAXB uses by default) or DTD.
Also: there is a new project called Jackson-xml-databinder (a spin-off of Jackson JSON processor) that implements "mini-JAXB" (subset of full JAXB functionality) using a Stax2 parser (like Woodstox or Aalto). Main benefits are bit more powerful data binding part and even better performance than JAXB implementations; downside that it is not as mature, and does not support all XML specific aspects. It is probably most useful in cases where both JSON and XML formats are to be supported.

How do I stop the Sun JDK1.6 builtin StAX parser from resolving DTD entities

I'm using the StAX event based API's to modify an XML stream. The stream represents an HTML document, complete with DTD declaration. I would like to copy this DTD declaration into the output document (written using an XMLEventWriter). When I ask the factory to disregard DTD's it will not download the DTD, but remove the whole statement and only leave a "<!DOCUMENTTYPE" string. When not disregarding, the whole DTD gets downloaded, and included when verbatim outputting the DTD event. I don't want to use the time to download this DTD, but include the complete DTD specification (resolving entities is already disabled and I don't need that). Does anyone know how to disable the fetching of external DTD's.
You should be able to implement a custom XMLResolver that redirects attempts to fetch external DTDs to a local resource (if your code parses only a specific doc type, this is often a class resource right in a JAR).
class CustomResolver implements javax.xml.stream.XMLResolver {
public Object resolveEntity(String publicID,
String systemID,
String baseURI,
String namespace)
throws XMLStreamException
{
if ("The public ID you expect".equals(publicID)) {
return getClass().getResourceAsStream("doc.dtd");
} else {
return null;
}
}
Note that some documents only include the "systemID", so you should fall back to checking that. The problem with system identifier is that it's supposed to be "system" specific URL, rather than a well-known, stable URI. In practice, it's often used as if it were a URI though.
See the setXMLResolver method.
Also: your original approach (setting SUPPORT_DTD to false) might work with Woodstox, if so far you have been using the default Sun stax parser bundled with JDK 1.6.

How would you use Java to handle various XML documents?

I'm looking for the best method to parse various XML documents using a Java application. I'm currently doing this with SAX and a custom content handler and it works great - zippy and stable.
I've decided to explore the option having the same program, that currently recieves a single format XML document, receive two additional XML document formats, with various XML element changes. I was hoping to just swap out the ContentHandler with an appropriate one based on the first "startElement" in the document... but, uh-duh, the ContentHandler is set and then the document is parsed!
... constructor ...
{
SAXParserFactory spf = SAXParserFactory.newInstance();
try {
SAXParser sp = spf.newSAXParser();
parser = sp.getXMLReader();
parser.setErrorHandler(new MyErrorHandler());
} catch (Exception e) {}
... parse StringBuffer ...
try {
parser.setContentHandler(pP);
parser.parse(new InputSource(new StringReader(xml.toString())));
return true;
} catch (IOException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
}
...
So, it doesn't appear that I can do this in the way I initially thought I could.
That being said, am I looking at this entirely wrong? What is the best method to parse multiple, discrete XML documents with the same XML handling code? I tried to ask in a more general post earlier... but, I think I was being too vague. For speed and efficiency purposes I never really looked at DOM because these XML documents are fairly large and the system receives about 1200 every few minutes. It's just a one way send of information
To make this question too long and add to my confusion; following is a mockup of some various XML documents that I would like to have a single SAX, StAX, or ?? parser cleanly deal with.
products.xml:
<products>
<product>
<id>1</id>
<name>Foo</name>
<product>
<id>2</id>
<name>bar</name>
</product>
</products>
stores.xml:
<stores>
<store>
<id>1</id>
<name>S1A</name>
<location>CA</location>
</store>
<store>
<id>2</id>
<name>A1S</name>
<location>NY</location>
</store>
</stores>
managers.xml:
<managers>
<manager>
<id>1</id>
<name>Fen</name>
<store>1</store>
</manager>
<manager>
<id>2</id>
<name>Diz</name>
<store>2</store>
</manager>
</managers>
As I understand it, the problem is that you don't know what format the document is prior to parsing. You could use a delegate pattern. I'm assuming you're not validating against a DTD/XSD/etcetera and that it is OK for the DefaultHandler to have state.
public class DelegatingHandler extends DefaultHandler {
private Map<String, DefaultHandler> saxHandlers;
private DefaultHandler delegate = null;
public DelegatingHandler(Map<String, DefaultHandler> delegates) {
saxHandlers = delegates;
}
#Override
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
if(delegate == null) {
delegate = saxHandlers.get(name);
}
delegate.startElement(uri, localName, name, attributes);
}
#Override
public void endElement(String uri, String localName, String name)
throws SAXException {
delegate.endElement(uri, localName, name);
}
//etcetera...
You've done a good job of explaining what you want to do but not why. There are several XML frameworks that simplify marshalling and unmarshalling Java objects to/from XML.
The simplest is Commons Digester which I typically use to parse configuration files. But if you are want to deal with Java objects then you should look at Castor, JiBX, JAXB, XMLBeans, XStream, or something similar. Castor or JiBX are my two favourites.
I have tried the SAXParser once, but once I found XStream I never went back to it. With XStream you can create Java Objects and convert them to XML. Send them over and use XStream to recreate the object. Very easy to use, fast, and creates clean XML.
Either way you have to know what data your going to receiver from the XML file. You can send them over in different ways to know which parser to use. Or have a data object that can hold everything but only one structure is populated (product/store/managers). Maybe something like:
public class DataStructure {
List<ProductStructure> products;
List<StoreStructure> stors;
List<ManagerStructure> managers;
...
public int getProductCount() {
return products.lenght();
}
...
}
And with XStream convert to XML send over and then recreate the object. Then do what you want with it.
See the documentation for XMLReader.setContentHandler(), it says:
Applications may register a new or different handler in the middle of a parse, and the SAX parser must begin using the new handler immediately.
Thus, you should be able to create a SelectorContentHandler that consumes events until the first startElement event, based on that changes the ContentHandler on the XML reader, and passes the first start element event to the new content handler. You just have to pass the XMLReader to the SelectorContentHandler in the constructor. If you need all the events to be passes to the vocabulary specific content handler, SelectorContentHandler has to cache the events and then pass them, but in most cases this is not needed.
On a side note, I've lately used XOM in almost all my projects to handle XML ja thus far performance hasn't been the issue.
JAXB. The Java Architecture for XML Binding. Basically you create an xsd defining your XML layout (I believe you could also use a DTD). Then you pass the XSD to the JAXB compiler and the compiler creates Java classes to marshal and unmarshal your XML document into Java objects. It's really simple.
BTW, there are command line options to jaxb to specify the package name you want to place the resulting classes in, etc.
If you want more dynamic handling, Stax approach would probably work better than Sax.
That's quite low-level, still; if you want simpler approach, XStream and JAXB are my favorites. But they do require quite rigid objects to map to.
Agree with StaxMan, who interestingly enough wants you to use Stax. It's a pull based parser instead of the push you are currently using. This would require some significant changes to your code though.
:-)
Yes, I have some bias towards Stax. But as I said, oftentimes data binding is more convenient than streaming solution. But if it's streaming you want, and don't need pipelining (of multiple filtering stages), Stax is simpler than SAX.
One more thing: as good as XOM is (wrt alternatives), often Tree Model is not the right thing to use if you are not dealing with "document-centric" xml (~= xhtml pages, docbook, open office docs).
For data interchange, config files etc data binding is more convenient, more efficient, more natural. Just say no to tree models like DOM for these use cases.
So, JAXB, XStream, JibX are good. Or, for more acquired taste, digester, castor, xmlbeans.
VTD-XML is known for being the best XML processing technology for heavy duty XML processing. See the reference below for a proof
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf

Categories

Resources