DOM4J utf-8 encoding Umlaute(Ä,ü,ß) incorrectly - java

I'm using DOM4j for parsing and writing an XML-Tree which is always in UTF-8.
My XML file includes German Special-Characters. Parsing them is not a problem, but when I'm writing the tree to a file, the special characters are getting converted to � characters.
I can't change the encoding of the XML file as it is restricted to UTF-8.
Code
SAXReader xmlReader = new SAXReader();
xmlReader.setEncoding("UTF-8");
Document doc = xmlReader.read(file);
doc.setXMLEncoding("UTF-8");
Element root = doc.getRootElement();
// manipulate doc
OutputFormat format = new OutputFormat();
format.setEncoding("UTF-8");
XMLWriter writer = new XMLWriter(new FileWriter(file), format);
writer.write(doc);
writer.close();
Expected output
...
<statementText>This is a test!Ä Ü ß</statementText>
...
Actual output
...
<statementText>This is a test!� � �</statementText>
...

You are passing a FileWriter to the XMLWriter. A Writer already handles String or char[] data, so it already handles the encoding, which means the XMLWriter has no chance of influencing it.
Additionally FileWriter is an especially problematic Writer type, since you can never specify which encoding it should use, instead it always uses the platform default encoding (which is often something like ISO-8859-1 on Windows and UTF-8 on Linux). It should basically never be used for this reason.
To let the XMLWriter apply what it is given as configuration pass it an OutputStream instead (which handles byte[]). The most obvious one to use here would be FileOutputStream:
XMLWriter writer = new XMLWriter(new FileOutputStream(file), format);
This is even documented in the JavaDoc for XMLWriter:
Warning: using your own Writer may cause the writer's preferred character encoding to be ignored. If you use encodings other than UTF8, we recommend using the method that takes an OutputStream instead.
Arguably the warning is a bit misleading, as the Writer can be problematic even if you intend to write UTF-8 data.

Related

Problems with JAXB and UTF-16 encoding

Hi I have a small APP that reads content from an xml file and put it into a corresponding Java object.
Here is the XML:
<?xml version="1.0" encoding="UTF-16"?>
<Marker>
<TimePosition>2700</TimePosition>
<SamplePosition>119070</SamplePosition>
</Marker>
here is the corresponding Java code:
JAXBContext jaxbContext = JAXBContext.newInstance(MarkerDto.class);
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
InputStream inputStream = new FileInputStream("D:/marker.xml");
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_16.toString());
MarkerDto markerDto = (MarkerDto) jaxbUnmarshaller.unmarshal(reader);
If I run this code I get an "Content is not allowed in prolog." exception. If I run the same with UTF-8 everything works fine. Does anyone have a clue what might be the problem?
There's several things wrong here (ranging from slightly suboptimal, to potentially very wrong). In increasing order of likelihood of causing the problem:
When constructing an InputStreamReader, there's no need to call toString() on the Charset, because that class has a constructor that takes a Charset, so simply remove the .toString():
Reader reader = new InputStreamReader(inputStream, StandardCharsets.UTF_16);
This is a tiny nitpick and has no effect on functionality.
Don't construct a Reader at all! XML is a format that's self-describing when it comes to encoding: Valid XML files can be parsed without knowing the encoding up-front. So instead of creating a Reader, simply pass the InputStream directly into your XML-handling code. Delete the line that creates the Reader and change the next one to this:
MarkerDto markerDto = (MarkerDto) jaxbUnmarshaller.unmarshal(inputStream);
This may or may not fix your problem, depending on whether the input is well-formed.
Your XML file might have encoding="UTF-16" in the header and not actually be UTF-16 encoded. If that's the case, then it is malformed and a conforming parser will decline to parse it. Verify this by opening the file with the advanced text editor of your choice (I suggest Notepad++ on Windows, Linux users probably know what their preference is) and check if it shows "UTF-16" as encoding (and the content is readable).
If I run the same with UTF-8 everything works fine.
This line suggests that that's what's actually happening here: the XML file is mis-labeling itself. This needs to be fixed at the point where the XML file is created.
Notably, this demo code provides exactly the same Content is not allowed in prolog. exception message that is reported in the question:
String xml = "<?xml version=\"1.0\" encoding=\"UTF-16\"?>\n<foo />";
JAXBContext jaxbContext = JAXBContext.newInstance();
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
InputStream inputStream = new ByteArrayInputStream(xml.getBytes(StandardCharsets.UTF_8));
jaxbUnmarshaller.unmarshal(inputStream);
Note that the XML encoding attribute claims UTF-16, but the actual data handed to the XML parser is UTF-8 encoded.

XML Document read in as Latin1 but half converted to UTF-8

I'm hitting my head off a brick wall with a bizarre problem that I know there will be an obvious answer to, but I can't see if for the life of me. It's all to do with encoding. Before the code, a simple description: I want to take in an XML document which is Latin1 (ISO-8859-1) encoded, and then send the thing completely unchanged over an HttpURLConnection. I have a small test class and the raw XML which shows my problem. The XML file contains a Latin1 character 0xa2 (a cent character), which is invalid UTF-8 - I'm deliberately using this as my test case. The XML declaration is ISO-8859-1. I can read it in no bother, but then when I want to convert the org.w3c.dom.Document to a byte[] array to send down the HttpURLConnection, the 0xa2 character gets converted to the UTF-8 encoded cent character (0xc2 0xa2), and the declaration stays as ISO-8859-1. In other words, it's converted to two characters - totally wrong.
The code which does this:
FileInputStream input = new FileInputStream( "input-file" );
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware( true );
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse( input );
Source source = new DOMSource( document );
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Result result = new StreamResult( baos );
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform( source, result );
byte[] bytes = baos.toByteArray();
FileOutputStream fos = new FileOutputStream( "output-file" );
fos.write( bytes );
I'm just writing it to a file at the moment while I figure out what on earth is converting this character. The input-file has 0xa2, the output-file contains 0xc2 0xa2. One way to fix this is to put this line in the 2nd last block:
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
However, not all XML documents that I'll be dealing with will be Latin1; most, indeed, will be UTF-8 when they come in. I'm assuming I shouldn't have to be working out what the encoding is such that I feed that in to the transformer though? I mean, surely it should be working this out for itself, and I'm just doing something else wrong?
A thought had occurred to me that I could just query the document to find out the encoding and thus the extra line could just do the trick:
transformer.setOutputProperty(OutputKeys.ENCODING, document.getInputEncoding());
However, I then determined that this wasn't the answer, as document.getInputEncoding() returns a different String if I run it in a terminal on the linux box in comparison to when I run it within Eclipse on my Mac.
Any hints would be appreciated. I fully accept I'm missing out on something obvious.
yes, by default, xml documents are written as utf-8, so you need to explicitly tell the Transformer to use a different encoding. your last edit is the "trick" to doing this such that it always matches the input xml encoding:
transformer.setOutputProperty(OutputKeys.ENCODING, document.getXmlEncoding());
the only question is, do you really need to maintain the input encoding?
Why not just open it with a normal FileInputStream and stream the bytes to the output stream directly from that? Why do you need to load it into DOM format in memory if you are just sending it byte for byte over an HttpURLConnection?
Edit: According to javadoc for Document, you should probably be using document.getXmlEncoding() to get what matches the encoding in the XML prolog.
This may be helpful - it's too long for a comment, but not really an answer. From the spec:
The encoding attribute specifies the preferred encoding to use for
outputting the result tree. XSLT processors are required to respect
values of UTF-8 and UTF-16. For other values, if the XSLT processor
does not support the specified encoding it may signal an error; if it
does not signal an error it should use UTF-8 or UTF-16 instead.
You may want to test with "encoding=junk", as it were, to see what it does.
The valid values for Java are described here. See also IANA charsets.

Change encoding of DOM4J Document: UTF to ISO-8859-1 (Java)

I need to create an org.dom4j.Document but when I print it it's always UTF-8.
I want to change it to ISO-8859-1 but I didn't find the way to do it.
It's not possible to use .setEncoding() and the Document is created on the fly (not reading from an InputStream).
It's the same problem that is discussed at http://www.coderanch.com/t/127978/XML/change-Encoding-Dom
Thanks a lot!
I believe you can set the encoding in the OutputFormat format class and use that to configure XMLWriter.
OutputFormat outFormat = new OutputFormat();
outFormat.setEncoding("ISO-8859-1");
XMLWriter out = new XMLWriter(outputStream, outFormat);
out.write(myDocumentObject);
You will need to provide the XMLWriter class an OutputStream or Writer.

Filtering Wikipedia's XML dump: error on some accents

I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.
I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.
In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).
I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…
Any idea ?
Alternatives I tried:
Getting the file (commented lines were tried wihtout success*):
FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream,
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);
Filtered prefixes:
ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
"Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
"Catégorie:", "Modèle:", "Wikipédia:", // UTF-8 as ISO-8859-1
"Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work
* ERRATUM
Actually, my bad, that one I tried work, I tested the wrong index:
InputSource is = new InputSource( fis );
is.setEncoding("UTF-8"); // force UTF-8 interpretation
parser.parse(fis, handler);
Since you write the prefixes as plain strings into your source file, you want to make sure that you save that .java file in UTF-8, too (or any other encoding that supports the special characters you're using). Then, however, you have to tell the compiler which encoding the file is in with the -encoding flag:
javac -encoding utf-8 *.java
For the XML source, you could try
Reader r = new InputStreamReader(new FileInputStream(xmlFileName), "UTF-8");
InputStreams do not deal with encodings since they are byte-based, not character-based. So, here we create a Reader from an FileInputStream - the latter (stream) doesn't know about encodings, but the former (reader) does, because we give the encoding in the constructor.

Converting document encoding when reading with dom4j

Is there any way I can convert a document being parsed by dom4j's SAXReader from the ISO-8859-2 encoding to UTF-8? I need that to happen while parsing, so that the objects created by dom4j are already Unicode/UTF-8 and running code such as:
"some text".equals(node.getText());
returns true.
This is done automatically by dom4j. All String instances in Java are in a common, decoded form; once a String is created, it isn't possible to tell what the original character encoding was (or even if the string was created from encoded bytes).
Just make sure that the XML document has the character encoding specified (which is required unless it is UTF-8).
The decoding happens in (or before) the InputSource (before the SAXReader). From that class's javadocs:
The SAX parser will use the InputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly, disregarding any text encoding declaration found in that stream. If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification. If neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.
So it depends on how you are creating the InputSource. To guarantee the proper decoding you can use something like the following:
InputStream stream = <input source>
Charset charset = Charset.forName("ISO-8859-2");
Reader reader = new BufferedReader(new InputStreamReader(stream, charset));
InputSource source = new InputSource(reader);

Categories

Resources