I am parsing a few million xml files with that are formatted like so:
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE test-document PUBLIC "-//TEST//TEST DOC//EN" "https://somerandomurl.com/test.dtd">
<test-document>...</test-document>
Every time I am parsing a file the same https://somerandomurl.com/test.dtd file is downloaded and that consumes a lot of bandwidth and seems unnecessary. Is there a way to store the file and have my code redirect my local copy? I can't edit the xml files so it has to be in my code. Given the following java code what would be a reasonable way to implement such a thing?
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setIgnoringComments(true);
factory.setIgnoringElementContentWhitespace(true);
factory.setValidating(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource("file.xml"));//My final document object.
First read the DTD into a string variable.
Then do
builder.setEntityResolver(
(sysId, PubId) -> new InputSource(new StringReader(dtd)));
Or if you want to be more careful, have your EntityResolver check that the systemId and/or publicId are as expected before returning the contents of dtd.
Note that this will still involve parsing the DTD each time, it just saves the cost of fetching it from the network.
Also important: instantiating the XML parser is a significant cost (and instantiating a DocumentBuilderFactory is even bigger). Make sure you reuse both the factory and the parser.
If you just want to cache downloaded DTD files, way to go is using XML catalogs. In particular, you'd be specifying, in a resolution rule in a catalog file such as the following
<catalog
Xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<system
systemId="https://somerandomurl.com/test.dtd"
uri="file://mydir/test.dtd"/>
</catalog>
that the entity with system identifier https://somerandomurl.com/test.dtd is resolved as the file /mydir/test.dtd which should contain a downloaded local copy of the DTD file linked to by the https: URL.
Links
https://www.xml.com/pub/a/2004/03/03/catalogs.html
https://docs.oracle.com/javase/10/core/xml-catalog-api1.htm#JSCOR-GUID-96D2C9AC-641A-4BDB-BB08-9FA04358A6F4
https://www.oasis-open.org/committees/entity/spec-2001-08-06.html#s.system
Related
Which one is the efficient way for reading xml. I'm aware of two ways:
1)JAXB:
By annotating my classes with jaxb annotation we get the xml in java object vice versa using Marshalling & Unmarshalling of object.
2)DOM:
Using dom parser for parsing the xml and using xpath values from xml can be accessed.
Example of DOM:
File fXmlFile = new File("/Users/link1/input.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
As per the business demands, I'm expecting to use the fastest way and the better way between the above two. Suggestions and few tactics would be appreciated.
First question to ask: does your XML always have the same structure and can this structure be mapped on a hierarchy of Java objects?
If Yes -> either use JAXB or Jackson XmlMapper
If No (the structure of your XML varies) -> Do you require random access to the data in your XML with many reads and possibly some writes (after which you convert the data back to XML)?
2.1. If Yes -> use DOM (It is designed for in memory handling of the XML Document Tree, but has more overhead)
2.2. If No (more efficient XML parsing) -> Do you need to parse all information in the XML or do you need XML validation?
2.2.1 If Yes -> use SAX (it is included in the JDK and allows for validation)
2.2.2 If No -> use StAX (it is an XML pull parser that allows reading some values in the XML without having to parse the full XML, but it does not offer validation.)
I have an input XML that has entities declared in it. It looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE doctype PUBLIC "desc" "DTD.dtd" [
<!ENTITY SLSD_68115_jpg SYSTEM "68115.jpg" NDATA JPEG>
]>
The DTD.dtd file contains the neccessary notation:
<!NOTATION JPEG SYSTEM "JPG" >
During XSLT transformation I would like to get the URI declared in the entity using the name 'SLSD_68115_jpg' like so:
<xsl:value-of select="unparsed-entity-uri('SLSD_68115_jpg')"/>
So that it would return something like "68115.jpg".
The problem is that it always returns an empty string. There is no way for me to modify the input xml. I understand that this could be a common problem from what I found on the internet, but i haven't found any final conclusions, solutions or alternatives to this problem.
It might be important to note that I had a problem before since I am using a StreamSource and things like systemId had to be set manually, I think this is where the problem might be hidden. It's like the transformator is unable to resolve the entity with given id.
I'm using Xalan, I probably need to provide more details but I'm not sure what to add, I'll answer any questions is there are any.
Any help would be greatly appretiated.
I found out why the "unparsed-entity-uri" was unable to resolve the declared entities. This might be a special case, but I will post this solution so it might save someone else a lot of time.
I'm (very) new to XSLT. The xsl file I got to work with however as a student was pretty extreme with multiple import statements and files containing more than 5K lines of code.
Simply by the time I got to the point where I needed the entities the transformator used a different document that was essentially the sub document of the original one, which is okay, but for example the entity declarations are not passed to the sub document. Therefore there is no way for me to use the entities from that point beyond.
Now like I said im new to XSLT but I think that for example lines like this can cause the problem:
<xsl:apply-templates select="exslt:node-set($nodelist)"/>
Because after this, entity references are no bueno.
If this was trivial then my apologies for waisting your time.
Thanks to everyone none the less!
Instead of a StreamSource, try a SAXSource configured with a validating parser:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setValidating(true);
spf.setNamespaceAware(true);
XMLReader xmlr = spf.newSAXParser().getXMLReader();
InputSource input = new InputSource(
new File("/path/to/file.xml").toURI().toString());
// if you already have an InputStream/Reader then do
// input.setByteStream or input.setCharacterStream as appropriate
SAXSource source = new SAXSource(xmlr, input);
Or you can use a DOMSource in the same way
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(true);
dbf.setNamespaceAware(true);
File f = new File("/path/to/file.xml");
Document doc = dbf.newDocumentBuilder().parse(f);
DOMSource source = new DOMSource(doc, f.toURI().toString());
I have more than a 1000 xml files with size in 2-3MBs each. I am using a DOM parser for parsing the xml. Now I have been provided with dtd file for every xml file. I dont know how to use the dtd for better parsing.
I did my side of research and found that dtd can be used for validation of an xml file. If so how can I validate it in Java? Is there any other use of a dtd file?
Thanks,
To validate the xml with DTD you just need to set the validator on true mode.
How to do that?
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setValidating(true);
I would like to know if there is a way (particularly, an API), in Java, to write a XML in a SAX-like way (i.e., event-like way, differently from JDOM, which I cannot use) that takes a DTD and guarantees that my XML document is being correctly written.
I have been using SAX for parsing and I have written a XML writer layer by myself as if I were writing a plain file (through OutputStreamWriter), but I have seen that my XML writer layer is not always following the DTD rules.
SAX does not know to write XML documents. It is attended to parse them. So, you can choose any method you want to create document and then validate it using SAX API against DTD.
BTW may I ask you why are you limiting yourself to using tools that were almost obsolete about 10 years ago? Why not to use higher level API that converts objects to XML and vice versa? For example JAXB.
The Standard DocumentBuilder methodology can validate for you.
This snippet taken from http://www.edankert.com/validate.html#Validate_using_internal_DTD
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
SchemaFactory schemaFactory =
SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
factory.setSchema(schemaFactory.newSchema(
new Source[] {new StreamSource("contacts.xsd")}));
DocumentBuilder builder = factory.newDocumentBuilder();
builder.setErrorHandler(new SimpleErrorHandler());
Document document = builder.parse(new InputSource("document.xml"));
I'm getting a "org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 26; The entity "ldquo" was referenced, but not declared." exception when reading an XML document. I'm reading it as follows:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xmlBody));
Document document = builder.parse(is);
And then there's an exception on builder.parse(is);
From searching I figured that it is necessary to declare some of those new entities externally, unfortunately, I cannot modify the original XML document.
How do I fix this problem?
Thanks
From searching I figured that it is necessary to declare some of those new entities externally, unfortunately, I cannot modify the original XML document.
Well, unless you declare the entity then the document isn't XML and you won't be able to process it using an XML parser.
When you are asked to process input that isn't well-formed XML, the best approach is to fix the process that created the document (the whole idea of using XML for interchange relies on it being well-formed XML). The alternatives are to "repair" the document to turn it into well-formed XML (which you say you can't do), or to forget the fact that it was intended to be XML, and treat it as you would any proprietary non-XML format.
Not a pleasant set of choices - but that's the mess you get into when people pay lip-service to XML but fail to conform to the letter of the standard.
Try
factory.setExpandEntityReferences(false);
This will prevent the parser from trying to expand entities.
EDIT: How about this http://xerces.apache.org/xerces2-j/features.html#dom.create-entity-ref-nodes -- The top of that page has an example of how to set features on the underlying parser. This should cause the parser to create entity-reference DOM nodes instead of trying to expand the entities.