XML Acute accent encoding issue - java

For my customer I have to marshal a XML file (received from an external service) to Java entities, and save it on database.
For that I am using a simple Jaxb method that does the job.
I have an issue with the XML file. I received it and I don't understand why acute accent caracter doesn't show correctly in the file.
It is encoded in UTF-8 in Unix (LF).
Acute accent is display like that in the file :
When copy it and paste it on a new file it is correctly displayed.
The problem is that when Jaxb process the file, I get this error:
org.springframework.dao.DataAccessResourceFailureException: Error reading XML stream; nested exception is javax.xml.stream.XMLStreamException: ParseError at [row,col]:[14642,669]
Message: The element type "Nm" must be terminated by the matching end-tag "</Nm>".
It's not an end-tag issue, it is correctly closed. Wen I replace this "XB4" caracter by another one, it works properly.
Java file encoding format is UTF-8.
Does someone have an idea ?
Thanks a lot.

Related

Reading UTF-16 XML files with JCabi Java

I have found this JCabi snippet code that works well with UTF-8 xml encoded files, it basically reads the xml file and then prints it as a string.
XML xml;
try {
xml = new XMLDocument(new File("test8.xml"));
String xmlString = xml.toString();
System.out.println(xmlString);
} catch (FileNotFoundException e1) {
e1.printStackTrace();
}
However I need this to run this same code on a UTF-16 encoded xml it gives me the following error:
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "AWT-EventQueue-0" java.lang.IllegalArgumentException: Can't parse, most probably the XML is invalid
Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
I have read about this error and this means that the parser it is not recognizing the prolog because it's seeing characters that are not supposed to be there because of the encoding.
I have tried other libraries that offer a way to "tell" the class which encoding the source file is encoded in, but the only library I was able to get it to work to some degree was JCabi, but I was not able to find a way to tell it that my source file is encoded in UTF-16.
Thanks, any help is appreciated.
The jcabi XMLDocument has various constructors including one which takes a string. So one approach is to use:
Path path = Paths.get("test16_LE_with_bom.xml");
XML xml = new XMLDocument(Files.readString(path, StandardCharsets.UTF_16LE));
String xmlString = xml.toString();
System.out.println(xmlString);
This makes use of java.nio.charset.StandardCharsets and java.nio.file.Files.
In my first test, my XML file was encoded as UTF-16-LE (and with a BOM at the start: FF FE for little-endian). The above approach handled the BOM OK.
My test file's prolog is as follows (with no explicit encoding - maybe that's a bad thing, here?):
<?xml version="1.0"?>
In my second test I removed the BOM and re-ran with the updated file - which also worked.
I used Notepad++ and a hex editor to verify/select encodings & to edit the test files.
Your file may be different from my test files (BE vs. LE).

CharConversionException while transforming xml file

I have a Java program which process xml files. When transforming xml into another xml file base on certain schema( xsd/xsl) it throws following error.
This error only throws for one xml file which has a xml tag like this.
<abc>xxx yyyy “ggggg vvvv” uuuu</abc>
But after removing or re-type two quotes, it doesn’t throw the error.
Anybody, please assist me to resolve this issue.
java.io.CharConversionException: Character larger than 4 bytes are not supported: byte 0x93 implies a length of more than 4 bytes
at .org.apache.xmlbeans..impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDecoder.java:162)
<?xml version= “1.0’ encoding =“UTF-8” standalone =“yes “?><xyz xml s=“http://pqr.yy”><Header><abc> aaa “cccc” aaaaa vvv</abc></Header></xyz>.
As others have reported in comments, it has failed because the typographical quotation marks are encoded in Windows-1292 encoding, not in UTF-8, so the parser hasn't managed to decode them.
The encoding declared in the XML declaration must match the actual encoding used for the characters.
To find out how this error arose, and to prevent it happening again, we would need to know where this (wannabe) XML file came from, and how it was created.
My guess would be that someone used a "smart" editor; Microsoft editors in particular are notorious for changing what you type to what Microsoft think you wanted to type. If you're editing XML by hand it's best to use an XML-aware editor.

JAXB Getting error While parsing the XML

While parsing the XML using JAXB am getting error as "javax.xml.bind.UnmarshalException: An invalid XML character (Unicode: 0xffffffff) was found in the element content of the document " . Because in my xml node have some special characters like "TRÊS2115". How to handle this scenario. I need that special character values too.
This is a problem with your input data. The accented character needs to be escaped within the XML file. The code that wrote the file failed to properly encode the character.

Trouble of viewing an XML file encoded in utf-8 with non-ascii characters

I have an xml file which gets its data from an oracle table of CLOB type. I wrote into the Clob value using Unicode Stream
Writer value = clob.setCharacterStream(0L);
value.write(StrValue));
When I write non-ascii characters like chinese and then access the clob attribute using PL/SQL developer, I see the characters showing up as they are. However, when I put the value in an xml file encoded in UTF-8 and try to open the xml file through IE, i get the error message
"an invalid character was found in text content. Error processing
resource ...".
The other interesting thing is that when I write into CLOB using ascii stream, like:
OutputStream value = clob.getAsciiOutputStream();
value.write(strValue.getBytes("UTF-8"));
then, the characters appear correctly in the XML on browser , but are messy in DB when accessed using PL/SQL developer.
Is there any problem in converting unicode characters to UTF-8. Any suggestion please?

Forcing escaped characters when writing to XML

I'm using org.w3c and javax.xml.parsers in Java for reading and writing xml files.
When I read an xml file, the
escaped line breaks will be replaced by real line breaks. When I write the content back to the file, I loose escaping and the content of the file will change unintentionally.
so
<somenode>First line.
Second line</somenode>
will be replaced by:
<somenode>First line.
Second line.</somenode>
Before writing xml content back to disk I tried:
String content = node.getTextContent().replace("\n","
");
node.setTextContent(content);
Of course it does not work, it will be escaped to &#10; in the file.
I do not want to litter the file with CDATA tags!
What I want to do is legal XML output so there has to be a way to do it.
Thanks in advance for any ideas :)
Do it by setting the following property for the JAXB Marshaller:
marshaller.setProperty("jaxb.encoding", "Unicode");

Categories

Resources