How to make use of a dtd file to parse xml data? - java

I have more than a 1000 xml files with size in 2-3MBs each. I am using a DOM parser for parsing the xml. Now I have been provided with dtd file for every xml file. I dont know how to use the dtd for better parsing.
I did my side of research and found that dtd can be used for validation of an xml file. If so how can I validate it in Java? Is there any other use of a dtd file?
Thanks,

To validate the xml with DTD you just need to set the validator on true mode.
How to do that?
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setValidating(true);

Related

How to cache a dtd file when parsing xml in java

I am parsing a few million xml files with that are formatted like so:
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE test-document PUBLIC "-//TEST//TEST DOC//EN" "https://somerandomurl.com/test.dtd">
<test-document>...</test-document>
Every time I am parsing a file the same https://somerandomurl.com/test.dtd file is downloaded and that consumes a lot of bandwidth and seems unnecessary. Is there a way to store the file and have my code redirect my local copy? I can't edit the xml files so it has to be in my code. Given the following java code what would be a reasonable way to implement such a thing?
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setIgnoringComments(true);
factory.setIgnoringElementContentWhitespace(true);
factory.setValidating(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource("file.xml"));//My final document object.
First read the DTD into a string variable.
Then do
builder.setEntityResolver(
(sysId, PubId) -> new InputSource(new StringReader(dtd)));
Or if you want to be more careful, have your EntityResolver check that the systemId and/or publicId are as expected before returning the contents of dtd.
Note that this will still involve parsing the DTD each time, it just saves the cost of fetching it from the network.
Also important: instantiating the XML parser is a significant cost (and instantiating a DocumentBuilderFactory is even bigger). Make sure you reuse both the factory and the parser.
If you just want to cache downloaded DTD files, way to go is using XML catalogs. In particular, you'd be specifying, in a resolution rule in a catalog file such as the following
<catalog
Xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<system
systemId="https://somerandomurl.com/test.dtd"
uri="file://mydir/test.dtd"/>
</catalog>
that the entity with system identifier https://somerandomurl.com/test.dtd is resolved as the file /mydir/test.dtd which should contain a downloaded local copy of the DTD file linked to by the https: URL.
Links
https://www.xml.com/pub/a/2004/03/03/catalogs.html
https://docs.oracle.com/javase/10/core/xml-catalog-api1.htm#JSCOR-GUID-96D2C9AC-641A-4BDB-BB08-9FA04358A6F4
https://www.oasis-open.org/committees/entity/spec-2001-08-06.html#s.system

which is the best way for fetching value from XML : JAXB or DOM?

Which one is the efficient way for reading xml. I'm aware of two ways:
1)JAXB:
By annotating my classes with jaxb annotation we get the xml in java object vice versa using Marshalling & Unmarshalling of object.
2)DOM:
Using dom parser for parsing the xml and using xpath values from xml can be accessed.
Example of DOM:
File fXmlFile = new File("/Users/link1/input.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
As per the business demands, I'm expecting to use the fastest way and the better way between the above two. Suggestions and few tactics would be appreciated.
First question to ask: does your XML always have the same structure and can this structure be mapped on a hierarchy of Java objects?
If Yes -> either use JAXB or Jackson XmlMapper
If No (the structure of your XML varies) -> Do you require random access to the data in your XML with many reads and possibly some writes (after which you convert the data back to XML)?
2.1. If Yes -> use DOM (It is designed for in memory handling of the XML Document Tree, but has more overhead)
2.2. If No (more efficient XML parsing) -> Do you need to parse all information in the XML or do you need XML validation?
2.2.1 If Yes -> use SAX (it is included in the JDK and allows for validation)
2.2.2 If No -> use StAX (it is an XML pull parser that allows reading some values in the XML without having to parse the full XML, but it does not offer validation.)

Writing XML according to a DTD

I would like to know if there is a way (particularly, an API), in Java, to write a XML in a SAX-like way (i.e., event-like way, differently from JDOM, which I cannot use) that takes a DTD and guarantees that my XML document is being correctly written.
I have been using SAX for parsing and I have written a XML writer layer by myself as if I were writing a plain file (through OutputStreamWriter), but I have seen that my XML writer layer is not always following the DTD rules.
SAX does not know to write XML documents. It is attended to parse them. So, you can choose any method you want to create document and then validate it using SAX API against DTD.
BTW may I ask you why are you limiting yourself to using tools that were almost obsolete about 10 years ago? Why not to use higher level API that converts objects to XML and vice versa? For example JAXB.
The Standard DocumentBuilder methodology can validate for you.
This snippet taken from http://www.edankert.com/validate.html#Validate_using_internal_DTD
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
SchemaFactory schemaFactory =
SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
factory.setSchema(schemaFactory.newSchema(
new Source[] {new StreamSource("contacts.xsd")}));
DocumentBuilder builder = factory.newDocumentBuilder();
builder.setErrorHandler(new SimpleErrorHandler());
Document document = builder.parse(new InputSource("document.xml"));

SAXParseException when “ is used in XML

I'm getting a "org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 26; The entity "ldquo" was referenced, but not declared." exception when reading an XML document. I'm reading it as follows:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xmlBody));
Document document = builder.parse(is);
And then there's an exception on builder.parse(is);
From searching I figured that it is necessary to declare some of those new entities externally, unfortunately, I cannot modify the original XML document.
How do I fix this problem?
Thanks
From searching I figured that it is necessary to declare some of those new entities externally, unfortunately, I cannot modify the original XML document.
Well, unless you declare the entity then the document isn't XML and you won't be able to process it using an XML parser.
When you are asked to process input that isn't well-formed XML, the best approach is to fix the process that created the document (the whole idea of using XML for interchange relies on it being well-formed XML). The alternatives are to "repair" the document to turn it into well-formed XML (which you say you can't do), or to forget the fact that it was intended to be XML, and treat it as you would any proprietary non-XML format.
Not a pleasant set of choices - but that's the mess you get into when people pay lip-service to XML but fail to conform to the letter of the standard.
Try
factory.setExpandEntityReferences(false);
This will prevent the parser from trying to expand entities.
EDIT: How about this http://xerces.apache.org/xerces2-j/features.html#dom.create-entity-ref-nodes -- The top of that page has an example of how to set features on the underlying parser. This should cause the parser to create entity-reference DOM nodes instead of trying to expand the entities.

Error accessing w3.org when applying a XSLT

I'm applying a xslt to a HTML file (already filtered and tidied to make it parseable as XML).
My code looks like this:
TransformerFactory transformerFactory = TransformerFactory.newInstance();
this.xslt = transformerFactory.newTransformer(xsltSource);
xslt.transform(sanitizedXHTML, result);
However, I receive error for every doctype found like this:
ERROR: 'Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/html4/loose.dtd'
I have no issue accessing the dtds from my browser.
I have little control over the HTML being parsed, and can't rip the DOCTYPE since I need them for entities.
Any help is welcome.
EDIT:
I tried to disable DTD validation like this:
private Source getSource(StreamSource sanitizedXHTML) throws ParsingException {
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(false);
spf.setValidating(false); // Turn off validation
XMLReader rdr;
try {
rdr = spf.newSAXParser().getXMLReader();
} catch (SAXException e) {
throw new ParsingException(e);
} catch (ParserConfigurationException e) {
throw new ParsingException(e);
}
InputSource inputSrc = new InputSource(sanitizedXHTML.getInputStream());
return new SAXSource(rdr, inputSrc);
}
and then just calling it...
Source source = getSource(sanitizedXHTML);
xslt.transform(source, result);
The error persists.
EDIT 2:
Wrote a entity resolver, and got HTML 4.01 Transitional DTD on my local disk. However, I get this error now:
ERROR: 'The declaration for the entity "HTML.Version" must end with '>'.'
The DTD is as is, downloaded from w3.org
I have some suggestions in an answer to a related question.
In particular, when parsing the XML document, you might want to turn DTD validation off, to prevent the parser from trying to fetch the DTD. Alternatively, you might use your own entity resolver to return a local copy of the DTD instead of fetching it over the network.
Edit: Just calling setValidating(false) on the SAX Parser Factory might not be enough to prevent the parser from loading the external DTD. The parser may need the DTD for other purposes, such as entity definitions. (Perhaps you could change your HTML sanitization/preprocessing phase to replace all entity references with the equivalent numeric character entity references, eliminating the need for the DTD?)
I don't think there is a standard SAX feature flag which would ensure that external DTD loading is completely disabled, so you might have to use something specific to your parser. So if you are using Xerces, for example, you might want to look up Xerces-specific features and call setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false) just to be sure.
Assuming you want the DTD loaded (for your entities), you will need to use a resolver. The basic problem that you are encountering is that the W3C limits access to the urls for the DTDs for performance reasons (they don't get any performance if they don't).
Now you should be working with a local copy of the DTD and using a catalog to handle this. You should take a look at the Apache Commons Resolver. If you don't know how to use a catalog, they're well documented in Norm Walsh's article
Of course, you will have problems if you do validate. That's an SGML DTD and you are trying to use it for XML. This will not work (probably)

Categories

Resources