How to ignore inline DTD when parsing XML file in Java

How to ignore inline DTD when parsing XML file in Java - java

I have a problem reading a XML file with DTD declaration inside (external declaration is solved). I'm using SAX method (javax.xml.parsers.SAXParser). When there is no DTD definition parsing looks like for example StartEement-Characters-StartElement-Characters-EndElement-Characters...... So there is characters method called immediately after Start or End element and thats how I need it to be. When DTD is in file parsing schema changes to for example StartElement-StartElement-StartElement-Characters-EndEement-EndEement-EndEement. And I need Characters method after every element. So I'm asking is there any way to prevent change of parsing schema?
My code:
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(false);
SAXParser parser = factory.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setFeature("http://xml.org/sax/features/validation", false);
reader.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
reader.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
reader.setFeature("http://xml.org/sax/features/external-general-entities", false);
reader.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
reader.setFeature("http://xml.org/sax/features/use-entity-resolver2", false);
reader.setFeature("http://apache.org/xml/features/validation/unparsed-entity-checking", false);
reader.setFeature("http://xml.org/sax/features/resolve-dtd-uris", false);
reader.setFeature("http://apache.org/xml/features/validation/dynamic", false);
reader.setFeature("http://apache.org/xml/features/validation/schema/augment-psvi", false);
reader.parse(input);
There is XML file that I'm trying to parse link (its link on my dropbox).

I suspect that the nodes that were previously being reported to the characters() callback are now being reported to the ignorableWhitespace() callback. The simplest solution might be to simply call characters() from ignorableWhitespace().
This is what the spec has to say about ignorableWhitespace():
Validating Parsers must use this method to report each chunk of
whitespace in element content (see the W3C XML 1.0 recommendation,
section 2.10): non-validating parsers may also use this method if they
are capable of parsing and using content models.
In other words, if there is a DTD, and if you are not validating, then
it's up to the parser whether it reports whitespace in element-only
content models using the characters() callback or the
ignorableWhitespace() callback.

Related

Resolving which version of an XML Schema to use for XML documents with a version attribute

I have to write some code to handle reading and validating XML documents that use a version attribute in their root element to declare a version number, like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<Junk xmlns="urn:com:initech:tps"
xmlns:xsi="http://www3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:com:initech.tps:schemas/foo/Junk.xsd"
VersionAttribute="2.0">
There are a bunch of nested schemas, my code has an org.w3c.dom.ls.LsResourceResolver to figure out what schema to use, implementing this method:
LSInput resolveResource(String type,
String namespaceURI,
String publicId,
String systemId,
String baseURI)
Previous versions of the schema have embedded the schema version into the namespace, so I could use the namespaceURI and systemId to decide which schema to provide. Now the version number has been switched to an attribute in the root element, and my resolver doesn't have access to that. How am I supposed to figure out the version of the XML document in the LsResourceResolver?

I had never had to deal with schema versions before this and had no idea what was involved. When the version was part of the namespace then I could throw all the schemas in together and let them get sorted out, but with the version in the root element and namespace shared across versions there is no getting around reading the version information from the XML before starting the SAX parsing.
I'm going to do something very similar to what Pangea suggested (gets +1 from me), but I can't follow the advice exactly because the document is too big to read it all into memory, even once. By using STAX I can minimize the amount of work done to get the version from the file. See this DeveloperWorks article, "Screen XML documents efficiently with StAX":
The screening or classification of XML documents is a common problem,
especially in XML middleware. Routing XML documents to specific
processors may require analysis of both the document type and the
document content. The problem here is obtaining the required
information from the document with the least possible overhead.
Traditional parsers such as DOM or SAX are not well suited to this
task. DOM, for example, parses the whole document and constructs a
complete document tree in memory before it returns control to the
client. Even DOM parsers that employ deferred node expansion, and thus
are able to parse a document partially, have high resource demands
because the document tree must be at least partially constructed in
memory. This is simply not acceptable for screening purposes.
The code to get the version information will look like:
def map = [:]
def startElementCount = 0
def inputStream = new File(inputFile).newInputStream()
try {
XMLStreamReader reader =
XMLInputFactory.newInstance().createXMLStreamReader(inputStream)
for (int event; (event = reader.next()) != XMLStreamConstants.END_DOCUMENT;) {
if (event == XMLStreamConstants.START_ELEMENT) {
if (startElementCount > 0) return map
startElementCount += 1
map.rootElementName = reader.localName
for (int i = 0; i < reader.attributeCount; i++) {
if (reader.getAttributeName(i).toString() == 'VersionAttribute') {
map.versionIdentifier = reader.getAttributeValue(i).toString()
return map
}
}
}
}
} finally {
inputStream.close()
}
Then I can use the version information to figure out what resolver to use and what schema documents to set on the SaxFactory.

My Suggestion
Parse the Document using SAX or DOM
Get the version attribute
Use the Validator.validate(Source) method and and use the already parsed Document (from step 1) as shown below
Building DOMSource from parsed document
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new File(args[0]));
domSource = new DOMSource(document);

how can I tell xalan NOT to validate XML retrieved using the "document" function?

Yesterday Oracle decided to take down java.sun.com for a while. This screwed things up for me because xalan tried to validate some XML but couldn't retrieve the properties.dtd.
I'm using xalan 2.7.1 to run some XSL transforms, and I don't want it to validate anything.
so tried loading up the XSL like this:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
spf.setValidating(false);
XMLReader rdr = spf.newSAXParser().getXMLReader();
Source xsl = new SAXSource(rdr, new InputSource(xslFilePath));
Templates cachedXSLT = factory.newTemplates(xsl);
Transformer transformer = cachedXSLT.newTransformer();
transformer.transform(xmlSource, result);
in the XSL itself, I do something like this:
<xsl:variable name="entry" select="document(concat($prefix, $locale_part, $suffix))/properties/entry[#key=$key]"/>
The XML this code retrieves has the following definition at the top:
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<entry key="...
Despite the java code above instructing the parser to NOT VALIDATE, it still sends a request to java.sun.com. While java.sun.com is unavailable, this makes the transform fail with the message:
Can not load requested doc: http://java.sun.com/dtd/properties.dtd
How do I get xalan to stop trying to validate the XML loaded from the "document" function?

The documentation mentions that the parser may read the DTDs even if not validating, as it may become necessary to use the DTD to resolve (expand) entities.
Since I don't have control over the XML documents, nont's option of modifying the XML was not available to me.
I managed to shut down attempts to pull in DTD documents by sabotaging the resolver, as follows.
My code uses a DocumentBuilder to return a Document (= DOM) but the XMLReader as per the OP's example also has a method setEntityResolver so the same technique should work with that.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false); // turns off validation
factory.setSchema(null); // turns off use of schema
// but that's *still* not enough!
builder = factory.newDocumentBuilder();
builder.setEntityResolver(new NullEntityResolver()); // swap in a dummy resolver
return builder().parse(xmlFile);
Here, now, is my fake resolver: It returns an empty InputStream no matter what's asked of it.
/** my resolver that doesn't */
private static class NullEntityResolver implements EntityResolver {
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, IOException {
// Message only for debugging / if you care
System.out.println("I'm asked to resolve: " + publicId + " / " + systemId);
return new InputSource(new ByteArrayInputStream(new byte[0]));
}
}
Alternatively, your fake resolver could return streams of actual documents read as local resources or whatever.

Be aware that disabling DTD loading will cause parsing to fail if the DTD defines any entities that your XML file depends on. That said, to disable DTD loading try this, which assumes you're using the default Xerces that ships with Java.
/*
* Instantiate the SAXParser and set the features to prevent loading of an external DTD
*/
SAXParser sp = SAXParserFactory.newInstance().newSAXParser();
XMLReader xrdr = sp.getXMLReader();
xrdr.setFeature("http://xml.org/sax/features/validation", false);
xrdr.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
If you really need the DTD, then the other alternative is to implement a local XML catalog
/*
* Instantiate the SAXParser and add catalog support
*/
SAXParser sp = SAXParserFactory.newInstance().newSAXParser();
XMLReader xrdr = sp.getXMLReader();
CatalogResolver cr = new CatalogResolver();
xrdr.setEntityResolver(cr);
To which you will have to provide the appropriate DTDs and an XML catalog definition. This Wikipedia Article and this article were helpful.
CatalogResolver looks at the system property xml.catalog.files to determine what catalogs to load.

Try using setFeature on SAXParserFactory.
Try this:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setValidating(false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
I think that should be enough, otherwise try setting a few other features:
spf.setFeature("http://xml.org/sax/features/validation", false);
spf.setFeature("http://xml.org/sax/features/external-general-entities", false);
spf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);

I just ended up stripping the doctype declaration out of the XML, because nothing else worked. When I get around to it, I'll try this: http://www.sagehill.net/docbookxsl/UseCatalog.html#UsingCatsXalan

Sorry for necroposting, but I have found a solution which actually works and decided I should share it.
1.
For some reason, setValidating(false) doesn't work. In some cases, it still downloads external DTD files. To prevent this, you should attach a custom EntityResolver as advised here:
XMLReader rdr = spf.newSAXParser().getXMLReader();
rdr.setEntityResolver(new MyCustomEntityResolver());
The EntityResolver will be called for every external entity request. Returning null will not work because the framework will still download the file from the Internet after that. Instead, you can return an empty stream which is a valid DTD, as advised here:
private class MyCustomEntityResolver implements EntityResolver {
public InputSource resolveEntity(String publicId, String systemId) {
return new InputSource(new StringReader(""));
}
}
2.
You are telling setValidating(false) to the SAX parser which reads your XSLT code. That is, it will not validate your XSLT. When it encounters a document() function, it loads the linked XML file using another parser which still validates it, and also downloads external entities. To handle this, you should attach a custom URIResolver to the transformer:
Transformer transformer = cachedXSLT.newTransformer();
transformer.setURIResolver(new MyCustomURIResolver());
The transformer will call your URIResolver implementation when it encounters the document() function. Your implementation will have to return a Source for the passed URI. The simplest thing is to return a StreamSource as advised here. But in your case you should parse the document yourself, preventing validation and external requests using the customized SAXParser you already have (or create a new one each time).
private class MyCustomURIResolver implements URIResolver {
public Source resolve(String href, String base) {
return new SAXSource(rdr,new InputSource(href));
}
}
So you will have to implement two custom interfaces in your code.

How to detect "Invalid character found in text content"

I'm doing an XML validation in Java, using SAX, and i'd like to recognize the following kind of error :
"An invalid character was found in text content".
At the moment, i have a validation with SAX, and for some documents i have corrupted characters not detected as errors. When i try to open the result XML file with IE Browser for example, i get an error message "an invalid character was found in text content".
This is an example of XML data:
<?xml version='1.0' encoding='UTF-8' standalone='yes'>
<!DOCTYPE blabla SYSTEM 'blabla.dtd'>
<blabla type='type' num='num'>
<...>... corrupted character </...>
</blabla>
And this is an example of the instanciation of the parser:
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
parser = factory.newSAXParser();
parser.setProperty(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);
parser.setProperty(JAXP_SCHEMA_SOURCE, new File(theConfig.getRoot()
.concat(File.separator).concat(theConfig.getXsdFileName())
.concat("-v").concat(theConfig.getXsdFileVersion()).concat(
XSD_EXTENSION)));
reader = parser.getXMLReader();
reader.setErrorHandler(getHandler());
reader.setEntityResolver(new MyEntityResolver(theConfig.getRoot(),
theConfig));
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(theDataToParse));
reader.parse(is);
The error handler implements methods 'warning', 'error' and 'fatalError', but nothing is detected.
The entity resolver enable to lead a custome entity file, stored in a configuration directory.
Does someone have an idea why such malformed character error is not detected ? Is it because my stream comes from a String and not a file ?
Thanks in advance for your help.
Regards.

yes, apparently you have already done the byte to character conversion since you are holding the string already. if you want to detect the invalid character, you need to parse the bytes. in general, it's not good to hold xml data as string data as you risk corrupting it through incorrect character encoding. the best way to treat xml is as binary data.

Querying an HTML page with XPath in Java

Can anyone advise me a library for Java that allows me to perform an XPath Query over an html page?
I tried using JAXP but it keeps giving me a strange error that I cannot seem to fix (thread "main" java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd).
Thank you very much.
EDIT
I found this:
// Create a new SAX Parser factory
SAXParserFactory factory = SAXParserFactory.newInstance();
// Turn on validation
factory.setValidating(true);
// Create a validating SAX parser instance
SAXParser parser = factory.newSAXParser();
// Create a new DOM Document Builder factory
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// Turn on validation
factory.setValidating(true);
// Create a validating DOM parser
DocumentBuilder builder = factory.newDocumentBuilder();
from http://www.ibm.com/developerworks/xml/library/x-jaxpval.html But turning the argumrent to false did not change anything.

Setting the parser to "non validating" just turns off validation; it does not inhibit fetching of DTD's. Fetching of DTD is needed not just for validation, but also for entity expansion... as far as I recall.
If you want to suppress fetching of DTD's, you need to register a proper EntityResolver to the DocumentBuilderFactory or DocumentBuilder. Implement the EntityResolver's resolveEntity method to always return an empty string.

Take a look at this:
http://www.w3.org/2005/06/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
Probably you have the parser set to perform DOM validation, and it is trying to retrieve the DTD. JAXP should have a way to disable DTD validation, and just run XPATH against a document assumed to be valid. I haven't used JAXP is many years so I'm sorry I couldn't be more helpful.

Error accessing w3.org when applying a XSLT

I'm applying a xslt to a HTML file (already filtered and tidied to make it parseable as XML).
My code looks like this:
TransformerFactory transformerFactory = TransformerFactory.newInstance();
this.xslt = transformerFactory.newTransformer(xsltSource);
xslt.transform(sanitizedXHTML, result);
However, I receive error for every doctype found like this:
ERROR: 'Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/html4/loose.dtd'
I have no issue accessing the dtds from my browser.
I have little control over the HTML being parsed, and can't rip the DOCTYPE since I need them for entities.
Any help is welcome.
EDIT:
I tried to disable DTD validation like this:
private Source getSource(StreamSource sanitizedXHTML) throws ParsingException {
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(false);
spf.setValidating(false); // Turn off validation
XMLReader rdr;
try {
rdr = spf.newSAXParser().getXMLReader();
} catch (SAXException e) {
throw new ParsingException(e);
} catch (ParserConfigurationException e) {
throw new ParsingException(e);
}
InputSource inputSrc = new InputSource(sanitizedXHTML.getInputStream());
return new SAXSource(rdr, inputSrc);
}
and then just calling it...
Source source = getSource(sanitizedXHTML);
xslt.transform(source, result);
The error persists.
EDIT 2:
Wrote a entity resolver, and got HTML 4.01 Transitional DTD on my local disk. However, I get this error now:
ERROR: 'The declaration for the entity "HTML.Version" must end with '>'.'
The DTD is as is, downloaded from w3.org

I have some suggestions in an answer to a related question.
In particular, when parsing the XML document, you might want to turn DTD validation off, to prevent the parser from trying to fetch the DTD. Alternatively, you might use your own entity resolver to return a local copy of the DTD instead of fetching it over the network.
Edit: Just calling setValidating(false) on the SAX Parser Factory might not be enough to prevent the parser from loading the external DTD. The parser may need the DTD for other purposes, such as entity definitions. (Perhaps you could change your HTML sanitization/preprocessing phase to replace all entity references with the equivalent numeric character entity references, eliminating the need for the DTD?)
I don't think there is a standard SAX feature flag which would ensure that external DTD loading is completely disabled, so you might have to use something specific to your parser. So if you are using Xerces, for example, you might want to look up Xerces-specific features and call setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false) just to be sure.

Assuming you want the DTD loaded (for your entities), you will need to use a resolver. The basic problem that you are encountering is that the W3C limits access to the urls for the DTDs for performance reasons (they don't get any performance if they don't).
Now you should be working with a local copy of the DTD and using a catalog to handle this. You should take a look at the Apache Commons Resolver. If you don't know how to use a catalog, they're well documented in Norm Walsh's article
Of course, you will have problems if you do validate. That's an SGML DTD and you are trying to use it for XML. This will not work (probably)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to ignore inline DTD when parsing XML file in Java - java

Related

Resolving which version of an XML Schema to use for XML documents with a version attribute

how can I tell xalan NOT to validate XML retrieved using the "document" function?

How to detect "Invalid character found in text content"

Querying an HTML page with XPath in Java

Error accessing w3.org when applying a XSLT

Categories

Resources