Problems with JAXB and UTF-16 encoding - java

Hi I have a small APP that reads content from an xml file and put it into a corresponding Java object.
Here is the XML:
<?xml version="1.0" encoding="UTF-16"?>
<Marker>
<TimePosition>2700</TimePosition>
<SamplePosition>119070</SamplePosition>
</Marker>
here is the corresponding Java code:
JAXBContext jaxbContext = JAXBContext.newInstance(MarkerDto.class);
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
InputStream inputStream = new FileInputStream("D:/marker.xml");
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_16.toString());
MarkerDto markerDto = (MarkerDto) jaxbUnmarshaller.unmarshal(reader);
If I run this code I get an "Content is not allowed in prolog." exception. If I run the same with UTF-8 everything works fine. Does anyone have a clue what might be the problem?

There's several things wrong here (ranging from slightly suboptimal, to potentially very wrong). In increasing order of likelihood of causing the problem:
When constructing an InputStreamReader, there's no need to call toString() on the Charset, because that class has a constructor that takes a Charset, so simply remove the .toString():
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_16);
This is a tiny nitpick and has no effect on functionality.
Don't construct a Reader at all! XML is a format that's self-describing when it comes to encoding: Valid XML files can be parsed without knowing the encoding up-front. So instead of creating a Reader, simply pass the InputStream directly into your XML-handling code. Delete the line that creates the Reader and change the next one to this:
MarkerDto markerDto = (MarkerDto) jaxbUnmarshaller.unmarshal(inputStream);
This may or may not fix your problem, depending on whether the input is well-formed.
Your XML file might have encoding="UTF-16" in the header and not actually be UTF-16 encoded. If that's the case, then it is malformed and a conforming parser will decline to parse it. Verify this by opening the file with the advanced text editor of your choice (I suggest Notepad++ on Windows, Linux users probably know what their preference is) and check if it shows "UTF-16" as encoding (and the content is readable).
If I run the same with UTF-8 everything works fine.
This line suggests that that's what's actually happening here: the XML file is mis-labeling itself. This needs to be fixed at the point where the XML file is created.
Notably, this demo code provides exactly the same Content is not allowed in prolog. exception message that is reported in the question:
String xml = "<?xml version=\"1.0\" encoding=\"UTF-16\"?>\n<foo />";
JAXBContext jaxbContext = JAXBContext.newInstance();
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
InputStream inputStream = new ByteArrayInputStream(xml.getBytes(StandardCharsets.UTF_8));
jaxbUnmarshaller.unmarshal(inputStream);
Note that the XML encoding attribute claims UTF-16, but the actual data handed to the XML parser is UTF-8 encoded.

Related

JAXB fails to load file having name containing URL encoded characters

I have a file whose path looks like this /home/jwayne/test/55-0388%25car.xml. I try to unmarshall the XML back to an object using JAXB as follows.
File file = new File("/home/jwayne/test/55-0388%25car.xml");
JAXBContext context = JAXBContext.newInstance(Rectangle.class);
Unmarshaller unmarshaller = context.createUnmarshaller();
Rectangle rectangle = (Rectangle) unmarshaller.unmarshal(file);
However, I get a FileNotFoundException (FNFE) with the stacktrace as follows.
[java.io.FileNotFoundException: /home/jwayne/test/55-0388%car.xml (No such file or directory)]
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:246)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:214)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:157)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:162)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:171)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:189)
...
Note that somehow (as suggested by the stacktrace), the unmarshaller has modified the file name from 55-0388%25car.xml to 55-0388%car.xml.
Stepping through the code, however, the problem is actually pretty deep: sun.net.www.protocol.file.Handler has a method openConnection that does the following.
File var4 = new File(ParseUtil.decode(var1.getPath()));
That sun.net.www.ParseUtil.decode method actually transforms the file path.
Any idea on how to quickly fix this problem (besides renaming the file)? (Note I am using JDK v1.8.0_191).
The root cause of your problem is that there is a % that is used to URL encode special characters. For % it is %25.
What JAXB does internally is that it decodes %25 to plain % and thus the file cannot be found.
Quick (and dirty) solution is to do some string replacing, like:
String fileName = "/home/jwayne/test/55-0388%25car.xml":
fileName = fileName.replace("%25", "%2525");
File file = new File(fileName);
This applies whenever there is %25 in a file name. But I guess this happens to any URL encoded characters. So if there are any other special characters you need some handling for each or some clever regexp solution.
Update:
to get around this JAXB behaviour provide it with InputStream instead of File. So like:
FileInputStream fis = new FileInputStream(fileName);
Rectangle r2 = (Rectangle) unmarshaller.unmarshal(fis);
Then there is no means for JAXB to alter any URI / filename.

DOM4J utf-8 encoding Umlaute(Ä,ü,ß) incorrectly

I'm using DOM4j for parsing and writing an XML-Tree which is always in UTF-8.
My XML file includes German Special-Characters. Parsing them is not a problem, but when I'm writing the tree to a file, the special characters are getting converted to � characters.
I can't change the encoding of the XML file as it is restricted to UTF-8.
Code
SAXReader xmlReader = new SAXReader();
xmlReader.setEncoding("UTF-8");
Document doc = xmlReader.read(file);
doc.setXMLEncoding("UTF-8");
Element root = doc.getRootElement();
// manipulate doc
OutputFormat format = new OutputFormat();
format.setEncoding("UTF-8");
XMLWriter writer = new XMLWriter(new FileWriter(file), format);
writer.write(doc);
writer.close();
Expected output
...
<statementText>This is a test!Ä Ü ß</statementText>
...
Actual output
...
<statementText>This is a test!� � �</statementText>
...
You are passing a FileWriter to the XMLWriter. A Writer already handles String or char[] data, so it already handles the encoding, which means the XMLWriter has no chance of influencing it.
Additionally FileWriter is an especially problematic Writer type, since you can never specify which encoding it should use, instead it always uses the platform default encoding (which is often something like ISO-8859-1 on Windows and UTF-8 on Linux). It should basically never be used for this reason.
To let the XMLWriter apply what it is given as configuration pass it an OutputStream instead (which handles byte[]). The most obvious one to use here would be FileOutputStream:
XMLWriter writer = new XMLWriter(new FileOutputStream(file), format);
This is even documented in the JavaDoc for XMLWriter:
Warning: using your own Writer may cause the writer's preferred character encoding to be ignored. If you use encodings other than UTF8, we recommend using the method that takes an OutputStream instead.
Arguably the warning is a bit misleading, as the Writer can be problematic even if you intend to write UTF-8 data.

Getting string '???' while parsing utf-8 encoded XML using JAXB parser for Korean string 작동불

I want to read below XML content and I'm using JAXB parser to convert
XML to object. XML doc is in UTF-8 format which contains some utf-8
characters which I'm not getting through my object but getting ???
instead.
XML file data:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CallDetails>
<APPOINTMENTDATE>29.11.2016</APPOINTMENTDATE>
<APPOINTMENTTIME>29.11.2016 11:11:00</APPOINTMENTTIME>
<ASCCODE>83000220</ASCCODE>
<CALLDESC>작동불</CALLDESC>
<CALLRECEIVEDBY>김정권</CALLRECEIVEDBY>
<CALLRECEIVEDMODECODE></CALLRECEIVEDMODECODE>
<CALLREGBYCAT></CALLREGBYCAT>
<CALLREGBYCODE></CALLREGBYCODE>
<CALLREGDATE>29.11.2016</CALLREGDATE>
<CALLREGTIME>29.11.2016 09:11:00</CALLREGTIME>
<CALLTYPECODE>SVC</CALLTYPECODE>
<COVERAGETYPECODE>UW</COVERAGETYPECODE>
<SPECIALREQUEST></SPECIALREQUEST>
</CallDetails>
Reading file as below,
InputStream inputStream = null;
inputStream = new FileInputStream(path);
InputStreamReader reader = new InputStreamReader(inputStream,"UTF-8");
JAXBContext context = JAXBContext.newInstance(clazz);
Unmarshaller um = context.createUnmarshaller();
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
return um.unmarshal(is);
and getting object as below:
THIRDPARTYSERVICEORDERNO = serviceOrderListDTO.getServiceOrderList().get(0).getThirdPartyServiceOrderNo();
CALLDESC = ServiceOrderListDTO.getServiceOrderList().get(0).getCallDetailsList().getCallDesc();
System.out.println("THIRDPARTYSERVICEORDERNO : "+THIRDPARTYSERVICEORDERNO);
System.out.println("CALLDESC: "+CALLDESC);
after running this code, I'm getting output as below,
THIRDPARTYSERVICEORDERNO : AJ16110004904;
CALLDESC: ???;
I have made a test of your code.
The result it produces is correct, that means that in debug mode in-memory values are displayed correct.
While printing those symbols to the console you will see ??? because the console window can't display by default those symbols.
You have to be sure that:
Encoding of the project in your IDE is set to UTF-8
Fonts that are used to display the message are UTF-8-compatible. ( take a look at http://unifoundry.com/unifont.html)
You should run your jre using -Dfile.encoding=UFT-8

XMLStreamException : Parse error

I have a process that parses an xml file with java 5 on apache tomcat 6.
Since, I compiled in java 7 with an execution join apache tomcat 7, I receive the following error:
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,60]
Message: Invalid encoding name "ISO8859-1".
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.setInputSource(XMLStreamReaderImpl.java:219)
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.(XMLStreamReaderImpl.java:189)
at com.sun.xml.internal.stream.XMLInputFactoryImpl.getXMLStreamReaderImpl(XMLInputFactoryImpl.java:262)
at com.sun.xml.internal.stream.XMLInputFactoryImpl.createXMLStreamReader(XMLInputFactoryImpl.java:129)
at com.sun.xml.internal.stream.XMLInputFactoryImpl.createXMLEventReader(XMLInputFactoryImpl.java:78)
at org.simpleframework.xml.stream.StreamProvider.provide(StreamProvider.java:66)
at org.simpleframework.xml.stream.NodeBuilder.read(NodeBuilder.java:58)
at org.simpleframework.xml.core.Persister.read(Persister.java:543)
at org.simpleframework.xml.core.Persister.read(Persister.java:444)
Here is the xml fragment used:
?xml version="1.0" encoding="ISO8859-1" standalone="no" ?
If I replace ISO8859-1 by UTF-8 the parsing process works but it's not an option for me.
The lib that I use is simple-xml-2.1.8.jar
As someone noticed me, ISO8859-1 is a wrong content type. ISO-8859-1 is the correct one. As I mentioned, it's difficult to ask "producers" to correct their files. I would want to manage the problem in my application.
Get access to the Xerces XMLReader instance from Simple XML and set
reader.setFeature("http://apache.org/xml/features/allow-java-encodings", true)
before parsing the XML.
Since ISO8859-1 "works" in Java, this may just work.
The list of supported "features" of Xerces is available here
Alternatively, a good old regex on encoding="ISO8859-1" to fix the XML should do the trick, prior to processing it.
If you know the file encoding up front (UTF-8, ISO-8859-1 or whatever) then you should create a suitable Reader configured for that encoding, then use the Persister.read method that takes a Reader instead of the one that takes a File or InputStream. That way you are in control of the byte-to-character decoding rather than relying on the XML reader to detect the encoding (and fail, as the file declared it wrongly). So instead of
File f = new File(....);
MyType obj = persister.read(MyType.class, f);
you would do something more like
File f = new File(....);
MyType obj = null;
try( FileInputStream fis = new FileInputStream(f);
InputStreamReader reader = new InputStreamReader(fis, "ISO-8859-1")) { // or UTF-8, ...
obj = persister.read(MyType.class, reader);
}

Filtering Wikipedia's XML dump: error on some accents

I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.
I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.
In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).
I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…
Any idea ?
Alternatives I tried:
Getting the file (commented lines were tried wihtout success*):
FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream,
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);
Filtered prefixes:
ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
"Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
"Catégorie:", "Modèle:", "Wikipédia:", // UTF-8 as ISO-8859-1
"Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work
* ERRATUM
Actually, my bad, that one I tried work, I tested the wrong index:
InputSource is = new InputSource( fis );
is.setEncoding("UTF-8"); // force UTF-8 interpretation
parser.parse(fis, handler);
Since you write the prefixes as plain strings into your source file, you want to make sure that you save that .java file in UTF-8, too (or any other encoding that supports the special characters you're using). Then, however, you have to tell the compiler which encoding the file is in with the -encoding flag:
javac -encoding utf-8 *.java
For the XML source, you could try
Reader r = new InputStreamReader(new FileInputStream(xmlFileName), "UTF-8");
InputStreams do not deal with encodings since they are byte-based, not character-based. So, here we create a Reader from an FileInputStream - the latter (stream) doesn't know about encodings, but the former (reader) does, because we give the encoding in the constructor.

Categories

Resources