I'm getting a "org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 26; The entity "ldquo" was referenced, but not declared." exception when reading an XML document. I'm reading it as follows:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xmlBody));
Document document = builder.parse(is);
And then there's an exception on builder.parse(is);
From searching I figured that it is necessary to declare some of those new entities externally, unfortunately, I cannot modify the original XML document.
How do I fix this problem?
Thanks
From searching I figured that it is necessary to declare some of those new entities externally, unfortunately, I cannot modify the original XML document.
Well, unless you declare the entity then the document isn't XML and you won't be able to process it using an XML parser.
When you are asked to process input that isn't well-formed XML, the best approach is to fix the process that created the document (the whole idea of using XML for interchange relies on it being well-formed XML). The alternatives are to "repair" the document to turn it into well-formed XML (which you say you can't do), or to forget the fact that it was intended to be XML, and treat it as you would any proprietary non-XML format.
Not a pleasant set of choices - but that's the mess you get into when people pay lip-service to XML but fail to conform to the letter of the standard.
Try
factory.setExpandEntityReferences(false);
This will prevent the parser from trying to expand entities.
EDIT: How about this http://xerces.apache.org/xerces2-j/features.html#dom.create-entity-ref-nodes -- The top of that page has an example of how to set features on the underlying parser. This should cause the parser to create entity-reference DOM nodes instead of trying to expand the entities.
Related
I have this problem where I need to send to soap webservice that requires the root tag to have an xml data, this the xml that I'm trying to send:
<root><test key="Applicants">this is a data</test></root>
I need to append this to the SoapBody object as a document with this code:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setExpandEntityReferences(false);
DocumentBuilder builder = factory.newDocumentBuilder();
Document result = builder.parse(new ByteArrayInputStream(request.getRequest().getBytes()));
Then adding it to the SoapBody to be sent to the webservice.
However, upon sending this request and tracing the logs, it's actually reverting the " character to literal quotes (")
This is the xml being sent:
<root><test key="Applicants">this is a data</test></root>
As you can see, the " is being transformed to literal quotes, how can I keep the original data within root tag (which has the ")? It seems to be transforming it when I'm converting it to a Document object.
Would appreciate any help. Thanks.
Edit:
The webservice actually requires this format (from their documentation and sample xml requests), if this isn't possible, is it a limitation? Should I user another framework?
The " and " are completely equivalent in this context. You haven't actually said whether this is causing a problem: if it is, then it's because some recipient of the XML isn't processing it correctly. Incidentally, it would also be legitimate to convert the > to >.
When you parse XML and re-serialise it, irrelevant details like redundant whitespace get lost - just as if you copy this text into your text editor, the line-wrapping and font size gets lost.
Which one is the efficient way for reading xml. I'm aware of two ways:
1)JAXB:
By annotating my classes with jaxb annotation we get the xml in java object vice versa using Marshalling & Unmarshalling of object.
2)DOM:
Using dom parser for parsing the xml and using xpath values from xml can be accessed.
Example of DOM:
File fXmlFile = new File("/Users/link1/input.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
As per the business demands, I'm expecting to use the fastest way and the better way between the above two. Suggestions and few tactics would be appreciated.
First question to ask: does your XML always have the same structure and can this structure be mapped on a hierarchy of Java objects?
If Yes -> either use JAXB or Jackson XmlMapper
If No (the structure of your XML varies) -> Do you require random access to the data in your XML with many reads and possibly some writes (after which you convert the data back to XML)?
2.1. If Yes -> use DOM (It is designed for in memory handling of the XML Document Tree, but has more overhead)
2.2. If No (more efficient XML parsing) -> Do you need to parse all information in the XML or do you need XML validation?
2.2.1 If Yes -> use SAX (it is included in the JDK and allows for validation)
2.2.2 If No -> use StAX (it is an XML pull parser that allows reading some values in the XML without having to parse the full XML, but it does not offer validation.)
I'm trying to follow http://www.ibm.com/developerworks/xml/library/x-nmspccontext/index.html
UniversalNamespaceResolver
example for resolving namespaces of the XPath evaluation agains an XML. The problem I encountered is that lookupNamespaceURI call below returns null on the XML, I given below:
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document dDoc = builder.parse(new InputSource(new StringReader(xml)));
String nsURI = dDoc.lookupNamespaceURI("h");
the XML:
<?xml version="1.0"?>
<h:root xmlns:h="http://www.w3.org/TR/html4/">
<h:table>
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>`
</h:root>
while I'd expect it to return "http://www.w3.org/TR/html4/".
When configuring a DocumentBuilder, you have to explicitly make it namespace aware (a silly relic from the first days of xml when there were no namespaces):
domFactory.setNamespaceAware(true);
As a side note, the advice in that article is not very good. it fundamentally misses the point that you don't care what the namespace prefixes are in the actual document, they are irrelevant. you need the xpath namespace resolver to match the xpath expressions that you are using, and that is all. if you do what they are suggesting, you will have to change your xpath code whenever the document's prefixes change, which is a horrible idea.
Note, they sort of cede this point in their last bullet, but the rest of the article seems to miss that this is the fundamental idea when using xpath.
But if you don't have control over the XML file, and someone can send you any prefixes they wish, it might be better to be independent of their choices. You can code your own namespace resolution as in Example 1 (HardcodedNamespaceResolver), and use them in your XPath expressions.
I have to write some code to handle reading and validating XML documents that use a version attribute in their root element to declare a version number, like this:
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<Junk xmlns="urn:com:initech:tps"
xmlns:xsi="http://www3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:com:initech.tps:schemas/foo/Junk.xsd"
VersionAttribute="2.0">
There are a bunch of nested schemas, my code has an org.w3c.dom.ls.LsResourceResolver to figure out what schema to use, implementing this method:
LSInput resolveResource(String type,
String namespaceURI,
String publicId,
String systemId,
String baseURI)
Previous versions of the schema have embedded the schema version into the namespace, so I could use the namespaceURI and systemId to decide which schema to provide. Now the version number has been switched to an attribute in the root element, and my resolver doesn't have access to that. How am I supposed to figure out the version of the XML document in the LsResourceResolver?
I had never had to deal with schema versions before this and had no idea what was involved. When the version was part of the namespace then I could throw all the schemas in together and let them get sorted out, but with the version in the root element and namespace shared across versions there is no getting around reading the version information from the XML before starting the SAX parsing.
I'm going to do something very similar to what Pangea suggested (gets +1 from me), but I can't follow the advice exactly because the document is too big to read it all into memory, even once. By using STAX I can minimize the amount of work done to get the version from the file. See this DeveloperWorks article, "Screen XML documents efficiently with StAX":
The screening or classification of XML documents is a common problem,
especially in XML middleware. Routing XML documents to specific
processors may require analysis of both the document type and the
document content. The problem here is obtaining the required
information from the document with the least possible overhead.
Traditional parsers such as DOM or SAX are not well suited to this
task. DOM, for example, parses the whole document and constructs a
complete document tree in memory before it returns control to the
client. Even DOM parsers that employ deferred node expansion, and thus
are able to parse a document partially, have high resource demands
because the document tree must be at least partially constructed in
memory. This is simply not acceptable for screening purposes.
The code to get the version information will look like:
def map = [:]
def startElementCount = 0
def inputStream = new File(inputFile).newInputStream()
try {
XMLStreamReader reader =
XMLInputFactory.newInstance().createXMLStreamReader(inputStream)
for (int event; (event = reader.next()) != XMLStreamConstants.END_DOCUMENT;) {
if (event == XMLStreamConstants.START_ELEMENT) {
if (startElementCount > 0) return map
startElementCount += 1
map.rootElementName = reader.localName
for (int i = 0; i < reader.attributeCount; i++) {
if (reader.getAttributeName(i).toString() == 'VersionAttribute') {
map.versionIdentifier = reader.getAttributeValue(i).toString()
return map
}
}
}
}
} finally {
inputStream.close()
}
Then I can use the version information to figure out what resolver to use and what schema documents to set on the SaxFactory.
My Suggestion
Parse the Document using SAX or DOM
Get the version attribute
Use the Validator.validate(Source) method and and use the already parsed Document (from step 1) as shown below
Building DOMSource from parsed document
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new File(args[0]));
domSource = new DOMSource(document);
I would like to know if there is a way (particularly, an API), in Java, to write a XML in a SAX-like way (i.e., event-like way, differently from JDOM, which I cannot use) that takes a DTD and guarantees that my XML document is being correctly written.
I have been using SAX for parsing and I have written a XML writer layer by myself as if I were writing a plain file (through OutputStreamWriter), but I have seen that my XML writer layer is not always following the DTD rules.
SAX does not know to write XML documents. It is attended to parse them. So, you can choose any method you want to create document and then validate it using SAX API against DTD.
BTW may I ask you why are you limiting yourself to using tools that were almost obsolete about 10 years ago? Why not to use higher level API that converts objects to XML and vice versa? For example JAXB.
The Standard DocumentBuilder methodology can validate for you.
This snippet taken from http://www.edankert.com/validate.html#Validate_using_internal_DTD
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
SchemaFactory schemaFactory =
SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
factory.setSchema(schemaFactory.newSchema(
new Source[] {new StreamSource("contacts.xsd")}));
DocumentBuilder builder = factory.newDocumentBuilder();
builder.setErrorHandler(new SimpleErrorHandler());
Document document = builder.parse(new InputSource("document.xml"));