Configure Xerces SAX parser to tolerate an XML syntax error

Configure Xerces SAX parser to tolerate an XML syntax error - java

I am getting this error when parsing an incorrectly-generated XML document:
org.xml.sax.SAXParseException: The value of attribute "bar" associated with an element type "foo" must not contain the '<' character.
I know what is causing the problem. It is this line:
<foo bar="x<y">42</foo>
It should have been
<foo bar="x<y">42</foo>
I am aware that this is not valid XML, but my code has to download and parse similar files unattended and for political reasons it might not be possible to persuade the supplier to fix the faulty program, especially when other programs are reading the file and tolerating this error.
Is there any way to configure Xerces to tolerate it? At present it treats it as a fatal error. Implementing an ErrorHandler to ignore it is not satisfactory because then the remainder of the document is not parsed.
Alternatively can you suggest another stream-based parser that can be configured to tolerate this error? Using a DOM parser is not feasible as these documents run into hundreds of megabytes.

... and for political reasons it might not be possible to persuade the supplier to fix the faulty program ...
For political reasons you ought to try your damnedest to get them to fix it. Wave the requirements specification in front of them that says that the input must be well-formed XML. Threaten to bill them for the cost of developing a bespoke parser. (OK, that probably won't work ...)
By giving up without a fight, you are just leaving the problem to trouble other people who have to deal with this supplier in the future.

I don't think you will find any XML parsers that will tolerate this sort of error. The only thing I can suggest is that you pre-process the XML to remove errors that might occur.

Related

Custom JAXB validation error messages

I am writing a small application which will be used for validating xml-files, correct them (if possible) and then perform tests on the contents. The end users will have very little knowledge of XML and parsing, so I want to catch the validation errors and then write my own event and error handler that produces error messages that are hopefully easier to understand for the end users.
I based my initial attempt on the solution detailed in this blogpost.
So far I have classified the errors based on the contents of event.getMessage(). Unfortunately, without knowing all types of parsing errors that may occur, it is more or less impossible to write good custom error messages. Is there a good way to find what types of error messages that can occur during validation?
I.e. I am looking for a listing of all messages, e.g. The content of element X is not complete one of ..., Invalid content was found starting with element Y..., Value Z is not facet-valid with respect to pattern...
Or is there some better way to do this?

The format of and number of messages will depend on which XML Schema validator you've plugged into JAXB. If you're using one based on Xerces-J (like the one included in Oracle's JDK) most of the messages will be prefixed with an identifier corresponding to a validation rule/constraint in the XML Schema specification (such as cvc-maxLength-valid). The list of identifiers for the XML Schema validation rules are available here in the specification. The full list of XML schema related error messages produced by Xerces can be found in its XMLSchemaMessages.properties message file, but keep in mind that this has changed over time and will depend on which version of Xerces you're using.

What to do when a huge XML document is not well formed (Java)

I am using Java SAX parser to parse XML data sent from a third party source that is around 3 GB. I am getting an error resulting from the XML document not being well formed: The processing instruction target matching "[xX][mM][lL]" is not allowed.
As far as I understand, this is normally due to a character being somewhere it should not be.
Main problem: Cannot manually edit these files due to their very large size.
I was wondering if there was a workaround for files that are very large in size that cannot be opened and edited manually (due to their large size) and if there is a way to code it so that it would remove any problematic characters automatically.

I would think the most likely explanation is that the file contains a concatenation of several XML documents, or perhaps an embedded XML document: either way, an XML declaration that isn't at the start of the file.
A lot now depends on your relationship with the supplier of the bad data. If they sent you faulty equipment or buggy software, you would presumably complain and ask them to fix it. But if you don't have a service relationship with the third party, you either have to change supplier or do the best you can with the faulty input, which means repairing the fault yourself. In general, you can't repair faulty XML unless you know what kind of fault you are looking for, and that can be very difficult to determine if the files are large (or if the failures are very rare).
The data isn't XML, so don't try to use XML tools to process it. Use text processing tools such as sed or awk. The first step is to search the file for occurrences of <?xml and see if that gives any hints.

This error occurs, if the declaration is anywhere but the beginning of the document. The reason might be
Whitespace before the XML declaration
Any hidden character before the XML declaration
The XML declaration appears anywhere else in the document
You should start checking case #2, see here: http://www.w3.org/International/questions/qa-byte-order-mark#remove
If that doesn't help, you should remove leading whitespace from the document. You could do that by wrapping the original InputStream with another InputStream and use that to remove the whitespace.
The same can be done if you are facing case #3, but the implementation would be a bit more complex.

How to get the line number an xml element is on via the Java w3c dom api

Is there a way to lookup the line number that a given element is at in an xml file via the w3c dom api?
My use case for this is that we have 30,000+ maps in kml/xml format. I wrote a unit test that iterates over each file found on the hard drive (about 17GB worth) and tests that it is parseable by our application. When it fails I throw an exception that contains the element instance that was considered "invalid". In order for our mapping department (nobody here knows how to program) to easily track down the typo we would like to log the line number of the element that caused the exception.
Can anybody suggest a way to do this? Please note we are using the W3C dom api included in the Android 1.6 SDK.

I'm not sure whether the Android API is different, but a normal Java application could catch a SAXParseException when parsing and look at the line number.

I may be wrong, but the line number shouldn't be relevant to your XML parser/reader as long as the XML structure itself is valid.
You might try to extrapolate the line-number programatically on the assumption that each node/content must be on a distinct line but it's going to be tricky.

It looks like you're validating your XML files. That is, you're not interested in whether the documents are syntactically correct ("well-formed"), but if they are semantically valid for your application. The right tool for this would be a validating XML parser, coupled with a dedicated XML scheme. See for example this tutorial on XML validation in Java. Validation errors will usually contain detailed error information, including the line number of problematic elements.

Error-tolerant XML parsing in Scala

I would like to be able to parse XML that isn't necessarily well-formed. I'd be looking for a fuzzy rather than a strict parser, able to recover from badly nested tags, for example. I could write my own but it's worth asking here first.
Update:
What I'm trying to do is extract links and other info from HTML. In the case of well-formed XML I can use the Scala XML API. In the case of ill-formed XML, it would be nice to somehow convert it into correct XML (somehow) and deal with it the same way, otherwise I'd have to have two completely different sets of functions for dealing with documents.
Obviously because the input is not well-formed and I'm trying to create a well-formed tree, there would have to be some heuristic involved (such as when you see <parent><child></parent> you would close the <child> first and when you then see a <child> you ignore it). But of course this isn't a proper grammar and so there's no correct way of doing it.

What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).
The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:
<parent>
<child>
</parent>
</child>
Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.
Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).

Try the parser on the XHtml object. It is much more lenient than the one on XML.

Take a look at htmlcleaner. I have used it successfully to convert "HTML from the wild" to valid XML.

Try Tag Soup.
JTidy does something similar but only for HTML.

I mostly agree with Daniel Spiewak's answer. This is just another way to create "your own parser".
While I don't know of any Scala specific solution, you can try using Woodstox, a Java library that implements the StAX API. (Being an even-based API, I am assuming it will be more fault tolerant than a DOM parser)
There is also a Scala wrapper around Woodstox called Frostbridge, developed by the same guy who made the Simple Build Tool for Scala.
I had mixed opinions about Frostbridge when I tried it, but perhaps it is more suitable for your purposes.

I agree with the answers that turning invalid XML into "correct" XML is impossible.
Why don't you just do a regular text search for the hrefs if that's all you're interested in? One issue would be commented out links, but if the XML is invalid, it might not be possible to tell what is intended to be commented out!

Caucho has a JAXP compliant XML parser that is a little bit more tolerant than what you would usually expect. (Including support for dealing with escaped character entity references, AFAIK.)
Find JavaDoc for the parsers here

A related topic (with my solution) is listed below:
Scala and html parsing

How to handle the different dialects of regular expressions (java vs. xsd)?

When I try to validate an XML file against an XSD in java (see this example) there are some incompatibilities between the regular expressions given in the XSD file and the regular expressions in java.
If there is an regular expression like "[ab-]" in the XSD (meaning any of the characters "a", "b" or "-", java complains about a syntax error in the expression.
This is a known bug since 28-MAR-2005, see Sun bug database.
What can I do to work around this bug? Up to now I try to "correct" the XSD file by replacing the "[ab-]" by "[ab\-]", but sometimes this is not an option.
If you have problems with this bug, too, please vote for it at the Sun bug database!

Since a bug is already filed, I'd recommend you try a different XML Schema processor. There's not going to be a lot you can do about it.
If you can preprocess the stream the XSD is coming in on, then you could create a parser which understands the basic regular expression structure and can fix anything that looks of the form [.*-] (where the .star is not a literal in this case).

Although it may not be the best solution in the world, you could consider using the Sax parser. I have used it for over 3 years now, however I have not done much regex validation with it, so I cannot speak to it's robustness related to that.
Other than that, I think Kaleb is probably correct on the preprocessing side (which is anything but ideal) - you might be able to use a regex for any of the incoming regex'es to do a replace.... although that has quite the code smell about it.
Edit:
An additional thought that just came to me. If the regex does not need to be in the xsd - i.e. it is there simply because that was "easiest" in the past - you could do the regex validation outside of the xsd. But, if other systems use the xsd, that is likely not the correct solution, and you can forget I said anything.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Configure Xerces SAX parser to tolerate an XML syntax error - java

I don't think you will find any XML parsers that will tolerate this sort of error. The only thing I can suggest is that you pre-process the XML to remove errors that might occur.

Related

Custom JAXB validation error messages

What to do when a huge XML document is not well formed (Java)

How to get the line number an xml element is on via the Java w3c dom api

Error-tolerant XML parsing in Scala

How to handle the different dialects of regular expressions (java vs. xsd)?

Categories

Resources