How to detect "Invalid character found in text content" - java

I'm doing an XML validation in Java, using SAX, and i'd like to recognize the following kind of error :
"An invalid character was found in text content".
At the moment, i have a validation with SAX, and for some documents i have corrupted characters not detected as errors. When i try to open the result XML file with IE Browser for example, i get an error message "an invalid character was found in text content".
This is an example of XML data:
<?xml version='1.0' encoding='UTF-8' standalone='yes'>
<!DOCTYPE blabla SYSTEM 'blabla.dtd'>
<blabla type='type' num='num'>
<...>... corrupted character </...>
</blabla>
And this is an example of the instanciation of the parser:
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
parser = factory.newSAXParser();
parser.setProperty(JAXP_SCHEMA_LANGUAGE, W3C_XML_SCHEMA);
parser.setProperty(JAXP_SCHEMA_SOURCE, new File(theConfig.getRoot()
.concat(File.separator).concat(theConfig.getXsdFileName())
.concat("-v").concat(theConfig.getXsdFileVersion()).concat(
XSD_EXTENSION)));
reader = parser.getXMLReader();
reader.setErrorHandler(getHandler());
reader.setEntityResolver(new MyEntityResolver(theConfig.getRoot(),
theConfig));
InputSource is = new InputSource();
is.setCharacterStream(new StringReader(theDataToParse));
reader.parse(is);
The error handler implements methods 'warning', 'error' and 'fatalError', but nothing is detected.
The entity resolver enable to lead a custome entity file, stored in a configuration directory.
Does someone have an idea why such malformed character error is not detected ? Is it because my stream comes from a String and not a file ?
Thanks in advance for your help.
Regards.

yes, apparently you have already done the byte to character conversion since you are holding the string already. if you want to detect the invalid character, you need to parse the bytes. in general, it's not good to hold xml data as string data as you risk corrupting it through incorrect character encoding. the best way to treat xml is as binary data.

Related

Java XML Document converting " to "(literal quote) upon parsing/converting to Document

I have this problem where I need to send to soap webservice that requires the root tag to have an xml data, this the xml that I'm trying to send:
<root><test key="Applicants">this is a data</test></root>
I need to append this to the SoapBody object as a document with this code:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
factory.setExpandEntityReferences(false);
DocumentBuilder builder = factory.newDocumentBuilder();
Document result = builder.parse(new ByteArrayInputStream(request.getRequest().getBytes()));
Then adding it to the SoapBody to be sent to the webservice.
However, upon sending this request and tracing the logs, it's actually reverting the " character to literal quotes (")
This is the xml being sent:
<root><test key="Applicants">this is a data</test></root>
As you can see, the " is being transformed to literal quotes, how can I keep the original data within root tag (which has the ")? It seems to be transforming it when I'm converting it to a Document object.
Would appreciate any help. Thanks.
Edit:
The webservice actually requires this format (from their documentation and sample xml requests), if this isn't possible, is it a limitation? Should I user another framework?
The " and " are completely equivalent in this context. You haven't actually said whether this is causing a problem: if it is, then it's because some recipient of the XML isn't processing it correctly. Incidentally, it would also be legitimate to convert the > to >.
When you parse XML and re-serialise it, irrelevant details like redundant whitespace get lost - just as if you copy this text into your text editor, the line-wrapping and font size gets lost.

Decoding a base64 XML cuts off the last part

I have a base64 encoded string, which represents an XML Schema (xsd). I decode this using Apache's Base64 utilities, put the resulting byte array into an intputsource and let an XMLSchemaCollection read this inputSource:
String base64String = ......
byte[] decoded = Base64.decodeBase64(base64String);
InputSource inputSource = new InputSource(new ByteArrayInputStream(decoded));
xmlSchemaCollection.read(inputSource, new ValidationEventHandler());
This gives an error:
XML document structure must start and end within the same entity
Which usually means the XML structure isn't valid. I performed two tests to see what the base64 actually holds. First is printing it out to the console:
System.out.println(new String(decoded,"UTF-8"));
In eclipse, I see my xml is suddenly cut off, like part of it is missing. However, if I use any online website, such as https://www.base64decode.org/, and I copy/paste my base64, I see the complete full xml. If I validate this xml, the validation succeeds. So I'm a bit confused as to why eclipse seemingly cuts off my xml after decoding?
Errors like this are usually indicative of a badly formatted document:
XML document structures must start and end within the same entity...
A few things you can do to debug this:
1. Print out the XML document to a log and run it through some sort of XML validator.
2. Check to make sure that there are no invalid characters (ex UTF-16 characters in a UTF-8 document)

How to replace invalid characters in XML string?

I have a string which was encoded by UTF-16. When parsing using javax.xml.parsers.DocumentBuilder, I got an error like this:
Character reference "&#x0" is an invalid XML character
Here is the code I used to parse the XML:
InputSource inputSource = new InputSource();
inputSource.setCharacterStream(new StringReader(xmlString));
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = factory.newDocumentBuilder();
org.w3c.dom.Document document = parser.parse(inputSource);
My question is, how to replace the invalid characters by (space)?
You just need to use String.replaceAll and pass the pattern of invalid characters.
You are trying to parse an invalid xml entity and this is what raising exception. It seems you need not to worry about UTF-16 for your situation.
Find some explanation and example here.
As an example, it is not possible to use & character for a valid xml, we need to use & instead. Here & is the xml entity.
Assuming above example should be self explanatory to understand what xml entity is.
As I understand there are some xml entity which is not valid. But no worry again. it is possible to declare & add new xml entity. Take a look at the above article for more detail.
EDIT: Assuming there is & character making the xml invalid.
StringEscapeUtils()
escapeXml
public static void escapeXml(java.io.Writer writer,
java.lang.String str)
throws java.io.IOException
Escapes the characters in a String using XML entities.
For example: "bread" & "butter" => "bread" & "butter".
Supports only the five basic XML entities (gt, lt, quot, amp, apos).
Does not support DTDs or external entities.
Note that unicode characters greater than 0x7f are currently escaped to their
numerical \\u equivalent. This may change in future releases.
Parameters:
writer - the writer receiving the unescaped string, not null
str - the String to escape, may be null
Throws:
java.lang.IllegalArgumentException - if the writer is null
java.io.IOException - if there is a problem writing
See Also:
unescapeXml(java.lang.String)

How to ignore inline DTD when parsing XML file in Java

I have a problem reading a XML file with DTD declaration inside (external declaration is solved). I'm using SAX method (javax.xml.parsers.SAXParser). When there is no DTD definition parsing looks like for example StartEement-Characters-StartElement-Characters-EndElement-Characters...... So there is characters method called immediately after Start or End element and thats how I need it to be. When DTD is in file parsing schema changes to for example StartElement-StartElement-StartElement-Characters-EndEement-EndEement-EndEement. And I need Characters method after every element. So I'm asking is there any way to prevent change of parsing schema?
My code:
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(false);
SAXParser parser = factory.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setFeature("http://xml.org/sax/features/validation", false);
reader.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
reader.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
reader.setFeature("http://xml.org/sax/features/external-general-entities", false);
reader.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
reader.setFeature("http://xml.org/sax/features/use-entity-resolver2", false);
reader.setFeature("http://apache.org/xml/features/validation/unparsed-entity-checking", false);
reader.setFeature("http://xml.org/sax/features/resolve-dtd-uris", false);
reader.setFeature("http://apache.org/xml/features/validation/dynamic", false);
reader.setFeature("http://apache.org/xml/features/validation/schema/augment-psvi", false);
reader.parse(input);
There is XML file that I'm trying to parse link (its link on my dropbox).
I suspect that the nodes that were previously being reported to the characters() callback are now being reported to the ignorableWhitespace() callback. The simplest solution might be to simply call characters() from ignorableWhitespace().
This is what the spec has to say about ignorableWhitespace():
Validating Parsers must use this method to report each chunk of
whitespace in element content (see the W3C XML 1.0 recommendation,
section 2.10): non-validating parsers may also use this method if they
are capable of parsing and using content models.
In other words, if there is a DTD, and if you are not validating, then
it's up to the parser whether it reports whitespace in element-only
content models using the characters() callback or the
ignorableWhitespace() callback.

Stream xml input to sax Parser, How to print the xml streamed?

Well I am trying to connect to one remote server via socket, and I get big xml responses back from socket, delimited by a '\n' character.
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<data>
.......
.......
</data>
</Response>\n <---- \n acts as delimiter
<?xml version="1.0" encoding="UTF-8"?>
<Response>
<data>
....
....
</data>
</Response>\n
..
I am trying to parse these xml using SAX Parser. Ideally I want to get one full response to a string by searching for '\n' and give this response to parser. But since my single response is very large, I am getting outOfMemory Exception when holding such a large xml in string..So the only option remained was to stream the xml to SAX.
SAXParserFactory spfactory = SAXParserFactory.newInstance();
SAXParser saxParser = spfactory.newSAXParser();
XMLReader xmlReader = saxParser.getXMLReader();
xmlReader.setContentHandler(new MyDefaultHandler(context));
InputSource xmlInputSource = new InputSource(new
CloseShieldInputStream(mySocket.getInputStream()));
xmlReader.parse(xmlInputSource);
I am using closeShieldInputStream to prevent SAX closing my socket stream on exception because of '\n'. I asked a previous question on that..
Now sometimes I am getting Parse Error
org.apache.harmony.xml.ExpatParser$ParseException: At line 1, column 8: not well-formed (invalid token)
I searched for it and found out this error normally comes when the encoding of actual xml is not same as what SAX expecting. I wrote a C program and print out the xml, and all my xml is encoded by UTF-8.
Now my question..
Is there any other reason for the above given error in xml parsing
other than encoding issue
Is there any way to print (or write to any file) the input to SAX as
it streams from socket?
After trying Hemal Pandya's answer..
OutputStream log = new BufferedOutputStream(new FileOutputStream("log.txt"));
InputSource xmlInputSource = new InputSource(new CloseShieldInputStream(new
TeeInputStream(mReadStream, log)));
xmlReader.parse(xmlInputSource);
a new file with name log.txt getting created when I mount the SDCard but it is empty..Am I using this right?
Well Finally how I done it..
I worked it out with TeeInputStream itself..thanks Hemal Pandya for suggesting that..
//open a log file in append mode..
OutputStream log = new BufferedOutputStream(new FileOutputStream("log.txt",true));
InputSource xmlInputSource = new InputSource(new CloseShieldInputStream(new
TeeInputStream(mReadStream, log)));
try{
xmlReader.parse(xmlInputSource);
//flush content in the log stream to file..this code only executes if parsing completed successfully
log.flush();
}catch(SaxException e){
//we want to get the log even if parsing failed..So we are making sure we get the log in either case..
log.flush();
}
Is there any way to print (or write to any file) the input to SAX as
it streams from socket?
Apache Commons has a TeeInputStream that should be useful.
OutputStream log = new BufferedOutputStream(new FileOutputtStream("response.xml"));
InputSource xmlInputSource = new InputSource(new
CloseShieldInputStream(new TeeInputStream(mySocket.getInputStream(), log)));
I haven't used it, you might want to try it first in a standalone program to figure out close semantics, though looking at docs and your requirements it looks like you would want to close it separately at end.
I am not familiar with Expat, but to accomplish you are are describing in general, you need a SAX parser that supports pushing data into the parser instead of having the parser pull data from a source. Check if Expat supports a push model. If it does, then you can simply read a chunk of data from the socket, push it to the parser, and it will parse whatever it can from the chuck, caching any remaining data for use during the next push. Repeat as needed until you are ready to close the socket connection. In this model, the \n separator would get treated as miscellaneous whitespace between nodes, so you have to use the SAX events to detect when a new <Response> node opens and closes. Also, because you are receiving multiple <Response> nodes in the data, and XML does not allow more than 1 top-level document node, you would need to push a custom opening tag into the parser before you then start pushing the socket data into the parser. The custom opening tag will then become the top-level document node, and the <Response> nodes will be children of it.

Categories

Resources