XML Document read in as Latin1 but half converted to UTF-8

XML Document read in as Latin1 but half converted to UTF-8 - java

I'm hitting my head off a brick wall with a bizarre problem that I know there will be an obvious answer to, but I can't see if for the life of me. It's all to do with encoding. Before the code, a simple description: I want to take in an XML document which is Latin1 (ISO-8859-1) encoded, and then send the thing completely unchanged over an HttpURLConnection. I have a small test class and the raw XML which shows my problem. The XML file contains a Latin1 character 0xa2 (a cent character), which is invalid UTF-8 - I'm deliberately using this as my test case. The XML declaration is ISO-8859-1. I can read it in no bother, but then when I want to convert the org.w3c.dom.Document to a byte[] array to send down the HttpURLConnection, the 0xa2 character gets converted to the UTF-8 encoded cent character (0xc2 0xa2), and the declaration stays as ISO-8859-1. In other words, it's converted to two characters - totally wrong.
The code which does this:
FileInputStream input = new FileInputStream( "input-file" );
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware( true );
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse( input );
Source source = new DOMSource( document );
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Result result = new StreamResult( baos );
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform( source, result );
byte[] bytes = baos.toByteArray();
FileOutputStream fos = new FileOutputStream( "output-file" );
fos.write( bytes );
I'm just writing it to a file at the moment while I figure out what on earth is converting this character. The input-file has 0xa2, the output-file contains 0xc2 0xa2. One way to fix this is to put this line in the 2nd last block:
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
However, not all XML documents that I'll be dealing with will be Latin1; most, indeed, will be UTF-8 when they come in. I'm assuming I shouldn't have to be working out what the encoding is such that I feed that in to the transformer though? I mean, surely it should be working this out for itself, and I'm just doing something else wrong?
A thought had occurred to me that I could just query the document to find out the encoding and thus the extra line could just do the trick:
transformer.setOutputProperty(OutputKeys.ENCODING, document.getInputEncoding());
However, I then determined that this wasn't the answer, as document.getInputEncoding() returns a different String if I run it in a terminal on the linux box in comparison to when I run it within Eclipse on my Mac.
Any hints would be appreciated. I fully accept I'm missing out on something obvious.

yes, by default, xml documents are written as utf-8, so you need to explicitly tell the Transformer to use a different encoding. your last edit is the "trick" to doing this such that it always matches the input xml encoding:
transformer.setOutputProperty(OutputKeys.ENCODING, document.getXmlEncoding());
the only question is, do you really need to maintain the input encoding?

Why not just open it with a normal FileInputStream and stream the bytes to the output stream directly from that? Why do you need to load it into DOM format in memory if you are just sending it byte for byte over an HttpURLConnection?
Edit: According to javadoc for Document, you should probably be using document.getXmlEncoding() to get what matches the encoding in the XML prolog.

This may be helpful - it's too long for a comment, but not really an answer. From the spec:
The encoding attribute specifies the preferred encoding to use for
outputting the result tree. XSLT processors are required to respect
values of UTF-8 and UTF-16. For other values, if the XSLT processor
does not support the specified encoding it may signal an error; if it
does not signal an error it should use UTF-8 or UTF-16 instead.
You may want to test with "encoding=junk", as it were, to see what it does.
The valid values for Java are described here. See also IANA charsets.

Related

DOM4J utf-8 encoding Umlaute(Ä,ü,ß) incorrectly

I'm using DOM4j for parsing and writing an XML-Tree which is always in UTF-8.
My XML file includes German Special-Characters. Parsing them is not a problem, but when I'm writing the tree to a file, the special characters are getting converted to � characters.
I can't change the encoding of the XML file as it is restricted to UTF-8.
Code
SAXReader xmlReader = new SAXReader();
xmlReader.setEncoding("UTF-8");
Document doc = xmlReader.read(file);
doc.setXMLEncoding("UTF-8");
Element root = doc.getRootElement();
// manipulate doc
OutputFormat format = new OutputFormat();
format.setEncoding("UTF-8");
XMLWriter writer = new XMLWriter(new FileWriter(file), format);
writer.write(doc);
writer.close();
Expected output
...
<statementText>This is a test!Ä Ü ß</statementText>
...
Actual output
...
<statementText>This is a test!� � �</statementText>
...

You are passing a FileWriter to the XMLWriter. A Writer already handles String or char[] data, so it already handles the encoding, which means the XMLWriter has no chance of influencing it.
Additionally FileWriter is an especially problematic Writer type, since you can never specify which encoding it should use, instead it always uses the platform default encoding (which is often something like ISO-8859-1 on Windows and UTF-8 on Linux). It should basically never be used for this reason.
To let the XMLWriter apply what it is given as configuration pass it an OutputStream instead (which handles byte[]). The most obvious one to use here would be FileOutputStream:
XMLWriter writer = new XMLWriter(new FileOutputStream(file), format);
This is even documented in the JavaDoc for XMLWriter:
Warning: using your own Writer may cause the writer's preferred character encoding to be ignored. If you use encodings other than UTF8, we recommend using the method that takes an OutputStream instead.
Arguably the warning is a bit misleading, as the Writer can be problematic even if you intend to write UTF-8 data.

Parse XML file containing umlaute using SAX parser

I have looked through a lot of posts regarding the same problem, but i can't figure it out. I trying to parse a XML file with umlauts in it. This is what i have now:
File file = new File(this.xmlConfig);
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handlerConfig);
But it won't get umlauts properly. Ä,Ü and Ö will be only weird characters. The file is definitely in utf-8 and it is declared as such with the first line like this: <?xml version="1.0" encoding="utf-8"?>
What I'm doing wrong?

First rule: Don't second guess the encoding used in the XML document. Always use byte streams to parse XML documents:
InputStream inputStream= new FileInputStream(this.xmlConfig);
InputSource is = new InputSource(inputStream);
saxParser.parse(is, handlerConfig);
If that doesn't work, the <?xml version=".." encoding="UTF-8" ?> (or whatever) in the XML is wrong, and you have to take it from there.
Second rule: Make sure you inspect the the result with a tool that supports the encoding used in the target, or result, document. Have you?
Third rule: Check the byte values in the source document. Bring up your favourite HEX editor/viewer and inspect the content. For example, the letter Ä should be the byte sequence 0xC3 0x84, if the encoding is UTF-8.
Forth rule: If it doesn't look correct, always suspect that the UTf-8 source is viewed, or interpreted, as an ISO-8859-1 source. Verify this by comparing the first and second byte from the UTF-8 source with the ISO 8859-1 code charts.
UPDATE:
The byte sequence for the UNICODE letter ä (latin small letter a with diaresis, U+00E4) is 0xC3 0xA4 in the UTF-8 encoding. If you use a viewing tool that only understands (or is configured to interpret the source as) ISO-8859-1 encoding, the first byte, 0xC3is the letter Ã, and the second byte is the letter ¤, or currency sign (Unicode U+00A4), which may look like a circle.
Hence, the "TextView" thingy in Android is interpreting your input as an ISO-8859-1 stream. I have no idea if it is possible to change that or not. But if you have your parsing result as a String or a byte array, you could convert that to a ISO-8859-1 stream (or byte array), and then feed it to "TextView".

How can i convert a string to a Document (DOM) with charset in ISO-8859-1

I'm converting a string received in a web service to a Document (DOM) xml, like this:
Document file= null;
String xmlFile= "blablabla"; //latin1 encodeing
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
this.file = builder.parse(new InputSource(new StringReader(xmlFile)));
But the string is encoded with ISO-8859-1 (latin1) and when I read a node of this Document, I have some errors. How can I create correctly DOM object with ISO-8859-1 encoding?or how can I read a node with the encoding Latin 1 in a string???

try this:
this.file = builder.parse(new ByteArrayInputStream(xmlFile.getBytes("ISO-8859-1")));

Foreword
String have no encoding as they represent a sequence of characters (which are abstract entities defined in unicode standard).
Byte sequences have an encoding and may be interpreted as a sequence of Character (provided that you tell java how to interpret it).
Your problem
In your problem, your data is stored into a String. Hence it has already been interpreted as a sequence of characters. Apparently the interpretation was incorrect.
Depending on your problem and the way you know the encoding of your data, there are 2 options:
Solution 1 (may be the best):
DO NOT INTERPRET the data you receive and keep it as a byte sequence (Stream or byte[] or ByteArray). Then pass this Byte sequence directly to your DOM parser (it will correctly decode the xml file, whatever its encoding, provided that the markup is correct.
Solution 2 (may be the only possible depending on the way you get the data):
Reencode the String as a ByteArray as mentioned in #ThOrndike's answer:
this.file = builder.parse(new ByteArrayInputStream(xmlFile.getBytes("ISO-8859-1")));
This will only work if you are sure the String has been correctly interpreted in the first place.
Apparently, it is not the case here and it seems that the library that give you the String, already interpreted it as an UTF-8 byte sequence (replacing all erroneous bytes by '?', it is the behavior of the UTF-8 readers). In that case, you cannot do anything as the original byte has been lost.
Your only hope is solution 1, or find a way to force the library that gives you the String to interpret it correctly.

Filtering Wikipedia's XML dump: error on some accents

I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.
I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.
In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).
I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…
Any idea ?
Alternatives I tried:
Getting the file (commented lines were tried wihtout success*):
FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream,
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);
Filtered prefixes:
ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
"Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
"CatÃ©gorie:", "ModÃ¨le:", "WikipÃ©dia:", // UTF-8 as ISO-8859-1
"Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work
* ERRATUM
Actually, my bad, that one I tried work, I tested the wrong index:
InputSource is = new InputSource( fis );
is.setEncoding("UTF-8"); // force UTF-8 interpretation
parser.parse(fis, handler);

Since you write the prefixes as plain strings into your source file, you want to make sure that you save that .java file in UTF-8, too (or any other encoding that supports the special characters you're using). Then, however, you have to tell the compiler which encoding the file is in with the -encoding flag:
javac -encoding utf-8 *.java
For the XML source, you could try
Reader r = new InputStreamReader(new FileInputStream(xmlFileName), "UTF-8");
InputStreams do not deal with encodings since they are byte-based, not character-based. So, here we create a Reader from an FileInputStream - the latter (stream) doesn't know about encodings, but the former (reader) does, because we give the encoding in the constructor.

Converting document encoding when reading with dom4j

Is there any way I can convert a document being parsed by dom4j's SAXReader from the ISO-8859-2 encoding to UTF-8? I need that to happen while parsing, so that the objects created by dom4j are already Unicode/UTF-8 and running code such as:
"some text".equals(node.getText());
returns true.

This is done automatically by dom4j. All String instances in Java are in a common, decoded form; once a String is created, it isn't possible to tell what the original character encoding was (or even if the string was created from encoded bytes).
Just make sure that the XML document has the character encoding specified (which is required unless it is UTF-8).

The decoding happens in (or before) the InputSource (before the SAXReader). From that class's javadocs:
The SAX parser will use the InputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly, disregarding any text encoding declaration found in that stream. If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification. If neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.
So it depends on how you are creating the InputSource. To guarantee the proper decoding you can use something like the following:
InputStream stream = <input source>
Charset charset = Charset.forName("ISO-8859-2");
Reader reader = new BufferedReader(new InputStreamReader(stream, charset));
InputSource source = new InputSource(reader);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.