Unicode(0xb) error while parsing an XML file using Stax - java

While parsing an XML file Stax produces an error:
Unicode(0xb) error-An invalid XML character (Unicode: 0xb) was found in the element content of the document.
Just click on the link below with the xml line with special character as "VI". It's not an alphabetical character: when you try to copy and paste it in Notepad, you will get it as some symbol. I have tried parsing it using Stax. It was showing the above-mentioned error.
Please can somebody give me a solution for this?
Thanks in advance.

0xB (vertical tab) is not a valid character in XML. The only valid characters before ASCII 32 (0x20, space) are 0x9 (tab), 0xA (carriage return) and 0xD (line feed).
In short, what you are trying to parse is NOT XML.

Whenever invalid xml character comes xml, it gives such error. When u open it in notepad++ it look like VT, SOH,FF like these are invalid xml chars. I m using xml version 1.0 and i validate text data before entering it in database by pattern
Pattern p = Pattern.compile("[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\u10000-\u10FFF]+");
retunContent = p.matcher(retunContent).replaceAll("");
It will ensure that no invalid special char will enter in xml

According to the XML W3C Recommendation 0xb is not allowed in an XML file:
Character Range
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
So strictly speaking your input file is not an XML file.

Related

Unable to parse trademark symbol in XML using XPath

Im trying to parse an XML file in Java and some lines contains an HTML symbol & #153; Still, when I do
((String) myXPath.evaluate(node, STRING));
I get square symbol instead of ™. My machine is Linux and XML encoding is UTF-8. I can't understand how to properly encode this exact symbol. & #8482; is encoded perfectly well...
I create a Document instance in a following way:
File xmlFile = new File(path);
FileInputStream fileIS = new FileInputStream(xmlFile);
xmlDocument = builder.parse(fileIS);
The HTML entity & # 153 represents the character with Unicode codepoint 153, which is some unprintable control character. It isn't a trademark symbol. 153 might be a trademark symbol in some Microsoft Windows character set, but that's irrelevant on the web. You need to use the Unicode codepoint which is 8482 - https://en.wikipedia.org/wiki/Trademark_symbol
Note that the numbers used in HTML entity references have nothing to do with the file encoding. In fact, that's the whole point of using them - they survive changes of encoding.

How to solve the IllegalDataException in jdom2 library?

I am using jdom 2.0.6 version and received this IllegalDataException:
Error in setText for tokenization:
it fails on calling the setText() method.
Element text = new Element("Text");
text.setText(doc.getText());
It seems some characters in 'text' it doesn't accept. For two examples:
Originally Posted by Yvette H( 45) Odd socks, yes, no undies yes, no coat yes, no shoes odd. 🏻
ParryOtter said: Posted
Should I specify encoding somewhere or for some other reasons?
In fact you just have to escape your text which contains illegal characters with CDATA :
Element text = new Element("Text");
text.setContent(new CDATA(doc.getText()));
The reverse operation (reading text escaped with CDATA is transparent in JDOM2, you won't have to escape it back).
For my tests I added an illegal character at the end of my text by creating one from hex value 0x2 like that :
String text = doc.getText();
int hex = 0x2;
text += (char) hex;

JAXB Getting error While parsing the XML

While parsing the XML using JAXB am getting error as "javax.xml.bind.UnmarshalException: An invalid XML character (Unicode: 0xffffffff) was found in the element content of the document " . Because in my xml node have some special characters like "TRÊS2115". How to handle this scenario. I need that special character values too.
This is a problem with your input data. The accented character needs to be escaped within the XML file. The code that wrote the file failed to properly encode the character.

how to store special character '{' and '}' in StringBuilder.AppendFormat

I am creating one Project using RESTFUL WEBSERVICES but in the url when I am giving
url: "http://localhost:8080/RestfulWebservicesNewVersion2/REST/webservices/GetFriend"
I am getting this output:
"\u0027EmployeeList\u0027:{{\u0027emp_id\u0027:\u00272\u0027,\u0027emp_ename\u0027:
\u0027rkjha\u0027,\u0027emp_phoneno\u0027:\u00273232323232\u0027,\u0027emp_email\u0027
Can you tell me how could I will remove the "U0027 " part.
You can use java.text.normalizer to remove Unicode characters that are not in the "normal" ASCII character set.

about SAXparseException: content is not allowed in prolog

I am using glassfish server and the following error keeps coming:
Caused by: org.xml.sax.SAXParseException: Content is not allowed in prolog.
at com.sun.enterprise.deployment.io.DeploymentDescriptorFile.read(DeploymentDescriptorFile.java:304)
at com.sun.enterprise.deployment.io.DeploymentDescriptorFile.read(DeploymentDescriptorFile.java:226)
at com.sun.enterprise.deployment.archivist.Archivist.readStandardDeploymentDescriptor(Archivist.java:480)
at com.sun.enterprise.deployment.archivist.Archivist.readDeploymentDescriptors(Archivist.java:305)
at com.sun.enterprise.deployment.archivist.Archivist.open(Archivist.java:213)
at com.sun.enterprise.deployment.archivist.ApplicationArchivist.openArchive(ApplicationArchivist.java:813)
at com.sun.enterprise.instance.WebModulesManager.getDescriptor(WebModulesManager.java:395)
... 65 more
Check this link
http://mark.koli.ch/2009/02/resolving-orgxmlsaxsaxparseexception-content-is-not-allowed-in-prolog.html
In short, some XML file contains the three-byte pattern (0xEF 0xBB 0xBF) at the front (right before <?xml ...?>), which is the UTF-8 byte order mark. The java's default XML parser can't handle this case.
The quick and dirty solution is to remove the white space at the front of the XML file:
String xml = "<?xml ...";
xml = xml.replaceFirst("^([\\W]+)<","<");
note that the String.trim() dost not enough, since it only trim the limited whitespace characters.

Categories

Resources