XSLT: Getting URI of a declared entity - java

I have an input XML that has entities declared in it. It looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE doctype PUBLIC "desc" "DTD.dtd" [
<!ENTITY SLSD_68115_jpg SYSTEM "68115.jpg" NDATA JPEG>
]>
The DTD.dtd file contains the neccessary notation:
<!NOTATION JPEG SYSTEM "JPG" >
During XSLT transformation I would like to get the URI declared in the entity using the name 'SLSD_68115_jpg' like so:
<xsl:value-of select="unparsed-entity-uri('SLSD_68115_jpg')"/>
So that it would return something like "68115.jpg".
The problem is that it always returns an empty string. There is no way for me to modify the input xml. I understand that this could be a common problem from what I found on the internet, but i haven't found any final conclusions, solutions or alternatives to this problem.
It might be important to note that I had a problem before since I am using a StreamSource and things like systemId had to be set manually, I think this is where the problem might be hidden. It's like the transformator is unable to resolve the entity with given id.
I'm using Xalan, I probably need to provide more details but I'm not sure what to add, I'll answer any questions is there are any.
Any help would be greatly appretiated.

I found out why the "unparsed-entity-uri" was unable to resolve the declared entities. This might be a special case, but I will post this solution so it might save someone else a lot of time.
I'm (very) new to XSLT. The xsl file I got to work with however as a student was pretty extreme with multiple import statements and files containing more than 5K lines of code.
Simply by the time I got to the point where I needed the entities the transformator used a different document that was essentially the sub document of the original one, which is okay, but for example the entity declarations are not passed to the sub document. Therefore there is no way for me to use the entities from that point beyond.
Now like I said im new to XSLT but I think that for example lines like this can cause the problem:
<xsl:apply-templates select="exslt:node-set($nodelist)"/>
Because after this, entity references are no bueno.
If this was trivial then my apologies for waisting your time.
Thanks to everyone none the less!

Instead of a StreamSource, try a SAXSource configured with a validating parser:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setValidating(true);
spf.setNamespaceAware(true);
XMLReader xmlr = spf.newSAXParser().getXMLReader();
InputSource input = new InputSource(
new File("/path/to/file.xml").toURI().toString());
// if you already have an InputStream/Reader then do
// input.setByteStream or input.setCharacterStream as appropriate
SAXSource source = new SAXSource(xmlr, input);
Or you can use a DOMSource in the same way
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(true);
dbf.setNamespaceAware(true);
File f = new File("/path/to/file.xml");
Document doc = dbf.newDocumentBuilder().parse(f);
DOMSource source = new DOMSource(doc, f.toURI().toString());

Related

SAXParseException when “ is used in XML

I'm getting a "org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 26; The entity "ldquo" was referenced, but not declared." exception when reading an XML document. I'm reading it as follows:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xmlBody));
Document document = builder.parse(is);
And then there's an exception on builder.parse(is);
From searching I figured that it is necessary to declare some of those new entities externally, unfortunately, I cannot modify the original XML document.
How do I fix this problem?
Thanks
From searching I figured that it is necessary to declare some of those new entities externally, unfortunately, I cannot modify the original XML document.
Well, unless you declare the entity then the document isn't XML and you won't be able to process it using an XML parser.
When you are asked to process input that isn't well-formed XML, the best approach is to fix the process that created the document (the whole idea of using XML for interchange relies on it being well-formed XML). The alternatives are to "repair" the document to turn it into well-formed XML (which you say you can't do), or to forget the fact that it was intended to be XML, and treat it as you would any proprietary non-XML format.
Not a pleasant set of choices - but that's the mess you get into when people pay lip-service to XML but fail to conform to the letter of the standard.
Try
factory.setExpandEntityReferences(false);
This will prevent the parser from trying to expand entities.
EDIT: How about this http://xerces.apache.org/xerces2-j/features.html#dom.create-entity-ref-nodes -- The top of that page has an example of how to set features on the underlying parser. This should cause the parser to create entity-reference DOM nodes instead of trying to expand the entities.

how can I tell xalan NOT to validate XML retrieved using the "document" function?

Yesterday Oracle decided to take down java.sun.com for a while. This screwed things up for me because xalan tried to validate some XML but couldn't retrieve the properties.dtd.
I'm using xalan 2.7.1 to run some XSL transforms, and I don't want it to validate anything.
so tried loading up the XSL like this:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
spf.setValidating(false);
XMLReader rdr = spf.newSAXParser().getXMLReader();
Source xsl = new SAXSource(rdr, new InputSource(xslFilePath));
Templates cachedXSLT = factory.newTemplates(xsl);
Transformer transformer = cachedXSLT.newTransformer();
transformer.transform(xmlSource, result);
in the XSL itself, I do something like this:
<xsl:variable name="entry" select="document(concat($prefix, $locale_part, $suffix))/properties/entry[#key=$key]"/>
The XML this code retrieves has the following definition at the top:
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<entry key="...
Despite the java code above instructing the parser to NOT VALIDATE, it still sends a request to java.sun.com. While java.sun.com is unavailable, this makes the transform fail with the message:
Can not load requested doc: http://java.sun.com/dtd/properties.dtd
How do I get xalan to stop trying to validate the XML loaded from the "document" function?
The documentation mentions that the parser may read the DTDs even if not validating, as it may become necessary to use the DTD to resolve (expand) entities.
Since I don't have control over the XML documents, nont's option of modifying the XML was not available to me.
I managed to shut down attempts to pull in DTD documents by sabotaging the resolver, as follows.
My code uses a DocumentBuilder to return a Document (= DOM) but the XMLReader as per the OP's example also has a method setEntityResolver so the same technique should work with that.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false); // turns off validation
factory.setSchema(null); // turns off use of schema
// but that's *still* not enough!
builder = factory.newDocumentBuilder();
builder.setEntityResolver(new NullEntityResolver()); // swap in a dummy resolver
return builder().parse(xmlFile);
Here, now, is my fake resolver: It returns an empty InputStream no matter what's asked of it.
/** my resolver that doesn't */
private static class NullEntityResolver implements EntityResolver {
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, IOException {
// Message only for debugging / if you care
System.out.println("I'm asked to resolve: " + publicId + " / " + systemId);
return new InputSource(new ByteArrayInputStream(new byte[0]));
}
}
Alternatively, your fake resolver could return streams of actual documents read as local resources or whatever.
Be aware that disabling DTD loading will cause parsing to fail if the DTD defines any entities that your XML file depends on. That said, to disable DTD loading try this, which assumes you're using the default Xerces that ships with Java.
/*
* Instantiate the SAXParser and set the features to prevent loading of an external DTD
*/
SAXParser sp = SAXParserFactory.newInstance().newSAXParser();
XMLReader xrdr = sp.getXMLReader();
xrdr.setFeature("http://xml.org/sax/features/validation", false);
xrdr.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
If you really need the DTD, then the other alternative is to implement a local XML catalog
/*
* Instantiate the SAXParser and add catalog support
*/
SAXParser sp = SAXParserFactory.newInstance().newSAXParser();
XMLReader xrdr = sp.getXMLReader();
CatalogResolver cr = new CatalogResolver();
xrdr.setEntityResolver(cr);
To which you will have to provide the appropriate DTDs and an XML catalog definition. This Wikipedia Article and this article were helpful.
CatalogResolver looks at the system property xml.catalog.files to determine what catalogs to load.
Try using setFeature on SAXParserFactory.
Try this:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setValidating(false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
I think that should be enough, otherwise try setting a few other features:
spf.setFeature("http://xml.org/sax/features/validation", false);
spf.setFeature("http://xml.org/sax/features/external-general-entities", false);
spf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
I just ended up stripping the doctype declaration out of the XML, because nothing else worked. When I get around to it, I'll try this: http://www.sagehill.net/docbookxsl/UseCatalog.html#UsingCatsXalan
Sorry for necroposting, but I have found a solution which actually works and decided I should share it.
1.
For some reason, setValidating(false) doesn't work. In some cases, it still downloads external DTD files. To prevent this, you should attach a custom EntityResolver as advised here:
XMLReader rdr = spf.newSAXParser().getXMLReader();
rdr.setEntityResolver(new MyCustomEntityResolver());
The EntityResolver will be called for every external entity request. Returning null will not work because the framework will still download the file from the Internet after that. Instead, you can return an empty stream which is a valid DTD, as advised here:
private class MyCustomEntityResolver implements EntityResolver {
public InputSource resolveEntity(String publicId, String systemId) {
return new InputSource(new StringReader(""));
}
}
2.
You are telling setValidating(false) to the SAX parser which reads your XSLT code. That is, it will not validate your XSLT. When it encounters a document() function, it loads the linked XML file using another parser which still validates it, and also downloads external entities. To handle this, you should attach a custom URIResolver to the transformer:
Transformer transformer = cachedXSLT.newTransformer();
transformer.setURIResolver(new MyCustomURIResolver());
The transformer will call your URIResolver implementation when it encounters the document() function. Your implementation will have to return a Source for the passed URI. The simplest thing is to return a StreamSource as advised here. But in your case you should parse the document yourself, preventing validation and external requests using the customized SAXParser you already have (or create a new one each time).
private class MyCustomURIResolver implements URIResolver {
public Source resolve(String href, String base) {
return new SAXSource(rdr,new InputSource(href));
}
}
So you will have to implement two custom interfaces in your code.

Parsing an XML file without root in Java

I have this XML file which doesn't have a root node. Other than manually adding a "fake" root element, is there any way I would be able to parse an XML file in Java? Thanks.
I suppose you could create a new implementation of InputStream that wraps the one you'll be parsing from. This implementation would return the bytes of the opening root tag before the bytes from the wrapped stream and the bytes of the closing root tag afterwards. That would be fairly simple to do.
I may be faced with this problem too. Legacy code, eh?
Ian.
Edit: You could also look at java.io.SequenceInputStream which allows you to append streams to one another. You would need to put your prefix and suffix in byte arrays and wrap them in ByteArrayInputStreams but it's all fairly straightforward.
Your XML document needs a root xml element to be considered well formed. Without this you will not be able to parse it with an xml parser.
One way is to provide your own dummy wrapper without touching the original 'xml' (the not well formed 'xml') Need the word for that:
Syntax
<!DOCTYPE some_root_elem SYSTEM "/home/ego/some.dtd"
[
<!ENTITY entity-name "Some value to be inserted at the entity">
]
Example:
<!DOCTYPE dummy [
<!ENTITY data SYSTEM "http://wherever-my-data-is">
]>
<dummy>
&data;
</dummy>
You could use another parser like Jsoup. It can parse XML without a root.
I think even if any API would have an option for this, it will only return you the first node of the "XML" which will look like a root and discard the rest.
So the answer is probably to do it yourself. Scanner or StringTokenizer might do the trick.
Maybe some html parsers might help, they are usually less strict.
Here's what I did:
There's an old java.io.SequenceInputStream class, which is so old that it takes Enumeration rather than List or such.
With it, you can prepend and append the root element tags (<div> and </div> in my case) around your no-root XML stream. (You shouldn't do it by concatenating Strings due to performance and memory reasons.)
public void tryExtractHighestHeader(ParserContext context)
{
String xhtmlString = context.getBody();
if (xhtmlString == null || "".equals(xhtmlString))
return;
// The XHTML needs to be wrapped, because it has no root element.
ByteArrayInputStream divStart = new ByteArrayInputStream("<div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream divEnd = new ByteArrayInputStream("</div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream is = new ByteArrayInputStream(xhtmlString.getBytes(StandardCharsets.UTF_8));
Enumeration<InputStream> streams = new IteratorEnumeration(Arrays.asList(new InputStream[]{divStart, is, divEnd}).iterator());
try (SequenceInputStream wrapped = new SequenceInputStream(streams);) {
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(wrapped);
From here you can do whatever you like, but keep in mind the extra element.
XPath xPath = XPathFactory.newInstance().newXPath();
}
catch (Exception e) {
throw new RuntimeException("Failed parsing XML: " + e.getMessage());
}
}

Error accessing w3.org when applying a XSLT

I'm applying a xslt to a HTML file (already filtered and tidied to make it parseable as XML).
My code looks like this:
TransformerFactory transformerFactory = TransformerFactory.newInstance();
this.xslt = transformerFactory.newTransformer(xsltSource);
xslt.transform(sanitizedXHTML, result);
However, I receive error for every doctype found like this:
ERROR: 'Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/html4/loose.dtd'
I have no issue accessing the dtds from my browser.
I have little control over the HTML being parsed, and can't rip the DOCTYPE since I need them for entities.
Any help is welcome.
EDIT:
I tried to disable DTD validation like this:
private Source getSource(StreamSource sanitizedXHTML) throws ParsingException {
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(false);
spf.setValidating(false); // Turn off validation
XMLReader rdr;
try {
rdr = spf.newSAXParser().getXMLReader();
} catch (SAXException e) {
throw new ParsingException(e);
} catch (ParserConfigurationException e) {
throw new ParsingException(e);
}
InputSource inputSrc = new InputSource(sanitizedXHTML.getInputStream());
return new SAXSource(rdr, inputSrc);
}
and then just calling it...
Source source = getSource(sanitizedXHTML);
xslt.transform(source, result);
The error persists.
EDIT 2:
Wrote a entity resolver, and got HTML 4.01 Transitional DTD on my local disk. However, I get this error now:
ERROR: 'The declaration for the entity "HTML.Version" must end with '>'.'
The DTD is as is, downloaded from w3.org
I have some suggestions in an answer to a related question.
In particular, when parsing the XML document, you might want to turn DTD validation off, to prevent the parser from trying to fetch the DTD. Alternatively, you might use your own entity resolver to return a local copy of the DTD instead of fetching it over the network.
Edit: Just calling setValidating(false) on the SAX Parser Factory might not be enough to prevent the parser from loading the external DTD. The parser may need the DTD for other purposes, such as entity definitions. (Perhaps you could change your HTML sanitization/preprocessing phase to replace all entity references with the equivalent numeric character entity references, eliminating the need for the DTD?)
I don't think there is a standard SAX feature flag which would ensure that external DTD loading is completely disabled, so you might have to use something specific to your parser. So if you are using Xerces, for example, you might want to look up Xerces-specific features and call setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false) just to be sure.
Assuming you want the DTD loaded (for your entities), you will need to use a resolver. The basic problem that you are encountering is that the W3C limits access to the urls for the DTDs for performance reasons (they don't get any performance if they don't).
Now you should be working with a local copy of the DTD and using a catalog to handle this. You should take a look at the Apache Commons Resolver. If you don't know how to use a catalog, they're well documented in Norm Walsh's article
Of course, you will have problems if you do validate. That's an SGML DTD and you are trying to use it for XML. This will not work (probably)

How to read an XML file with Java?

I don't need to read complex XML files. I just want to read the following configuration file with a simplest XML reader
<config>
<db-host>localhost</db-host>
<db-port>3306</db-port>
<db-username>root</db-username>
<db-password>root</db-password>
<db-name>cash</db-name>
</config>
How to read the above XML file with a XML reader through Java?
I like jdom:
SAXBuilder parser = new SAXBuilder();
Document docConfig = parser.build("config.xml");
Element elConfig = docConfig.getRootElement();
String host = elConfig.getChildText("host");
Since you want to parse config files, I think commons-configuration would be the best solution.
Commons Configuration provides a generic configuration interface which enables a Java application to read configuration data from a variety of sources (including XML)
You could use a simple DOM parser to read the xml representation.
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
dom = db.parse("config.xml");
If you just need a simple solution that's included with the Java SDK (since 5.0), check out the XPath package. I'm sure others perform better, but this was all I needed. Here's an example:
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.xml.sax.InputSource;
...
try {
XPath xpath = XPathFactory.newInstance().newXPath();
InputSource inputSource = new InputSource("strings.xml");
// result will equal "Save My Changes" (see XML below)
String result = xpath.evaluate("//string", inputSource);
}
catch(XPathExpressionException e) {
// do something
}
strings.xml
<?xml version="1.0" encoding="utf-8"?>
<resources>
<string name="saveLabel">Save My Changes</string>
</resources>
There are several XML parsers for Java. One I've used and found particularly developer friendly is JDOM. And by developer friendly, I mean "java oriented" (i.e., you work with objects in your program), instead of "document oriented", as some other tools are.
I would recommend Commons Digester, which allows you to parse a file without writing reams of code. It uses a series of rules to determine what action is should perform when encountering a given element or attribute (a typical rule might be to create a particular business object).
For a similar use case in my application I used JaxB. With Jaxb, reading XML files is like interacting with Java POJOs. But to use JAXB you need to have the xsd for this xml file. You can look for more info here
If you want to be able to read and write objects to XML directly, you can use XStream
Although I have not tried XPath yet as it has just come to my attention now, I have tried a few solutions and have not found anything that works for this scenario.
I decided to make a library that fulfills this need as long as you are working under the assumptions mentioned in the readme. It has the advantage that it uses SAX to parse the entire file and return it to the user in the form of a map so you can lookup values as key -> value.
https://github.com/daaso-consultancy/ConciseXMLParser
If something is missing kindly inform me of the missing item as I only develop it based on the needs of others and myself.

Categories

Resources