Convert UTF-8 to ISO-8859-1 with Numeric Character Reference

Convert UTF-8 to ISO-8859-1 with Numeric Character Reference - java

I get xml from third party with encoding UTF-8 and I need to send it to another third party but with ISO-8859-1 encoding. In xml there are many different languages e.g Russian in cyrillic. I know that it's technically impossible to directly convert UTF-8 into ISO-8859-1 however I found StringEscapeUtils.escapeXML() but when using this method then the whole xml is converted even <, > and so on and I would only convert cyrillic to character number reference. Is such method exists in Java or it always parse whole xml? Is there another possibility to parse only characters which can't be encoded in ISO-8859-1 to number format reference?
I've seen similar questions on SO like: How do I convert between ISO-8859-1 and UTF-8 in Java? but it's without mentioning number format reference

UPDATE: Removed unnecessary DOM loading.
Use the XML transformer. It knows how to XML escape characters that are not supported by the given encoding.
Example
Transformer transformer = TransformerFactory.newInstance().newTransformer();
// Convert XML file to UTF-8 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new StreamSource(new File("test.xml")),
new StreamResult(new File("test-utf8.xml")));
// Convert XML file to ISO-8859-1 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
transformer.transform(new StreamSource(new File("test.xml")),
new StreamResult(new File("test-8859-1.xml")));
test.xml (input, UTF-8)
<?xml version="1.0" encoding="UTF-8"?>
<test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好，世界</chinese>
<emoji>👋 🌎</emoji>
</test>
Translated by https://translate.google.com (except emoji)
test-utf8.xml (output, UTF-8)
<?xml version="1.0" encoding="UTF-8"?><test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好，世界</chinese>
<emoji>👋 🌎</emoji>
</test>
test-8859-1.xml (output, ISO-8859-1)
<?xml version="1.0" encoding="ISO-8859-1"?><test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好，世界</chinese>
<emoji>👋 🌎</emoji>
</test>
If you replace the test.xml with the test-8859-1.xml file (copy/paste/rename), you still get the same outputs, since the parser both auto-detects the encoding and unescapes all the escaped characters.

Related

Troubles with XML encoding in Java

I have a problem with XML encoding.
When i created XML on localhost with cp1251 encoding all cool
But when i deploy my module on server, xml file have incorrect symbols like "Р¤Р°Р№Р»РџР¤Р "
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
DOMSource source = new DOMSource(doc);
transformer.setOutputProperty(OutputKeys.ENCODING, "cp1251");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.transform(source, result);
String attach = writer.toString();
How i can fix it?

I tried to read an XML Document which was UTF-8 encoded, and attempted to transform it with a different encoding, which had no effect at all (the existing encoding of the document was used instead of the one I specified with the output property). When creating a new Document in memory (encoding is null), the output property was used correctly.
Looks like when transforming an XML Document, the output property OutputKeys.ENCODING is only used when the org.w3c.dom.Document does not have an encoding yet.
Solution
To change the encoding of a XML Document, don't use the Document as the source, but its root node (the document element) instead.
// use doc.getDocumentElement() instead of doc
DOMSource source = new DOMSource(doc.getDocumentElement());
Works like a charm.
Source document:
<?xml version="1.0" encoding="UTF-8"?>
<foo bla="Grüezi">
Encoding test äöüÄÖÜ «Test»
</foo>
Output with "cp1251":
<?xml version="1.0" encoding="WINDOWS-1251"?><foo bla="Grüezi">
Encoding test äöüÄÖÜ «Test»
</foo>

A (String)Writer will not be influenced from an output encoding (only from the used input encoding), as Java maintains all text in Unicode. Either write to binary, or output the string as Cp1251.
Note that the encoding should be in the <?xml encoding="Windows-1251"> line. And I guess "Cp1251" is a bit more java specific.
So the error probably lies in the writing of the string; for instance
response.setCharacterEncoding("Windows-1251");
response.write(attach);
Or
attach.getBytes("Windows-1251")

How to check well formedness of xml file which has encoding type of UTF-16 using Java?

I have parsed an xml file which has encoding type UTF-8. It parses well.
I had nothing change any encoding type in the xml file.
The xml header for UTF-8 looks like:
<?xml version="1.0" encoding="UTF-8"?>
There is no error for above format!!!
Suppose i had another file to check well formedness which has header like:
<?xml version="1.0" encoding="UTF-16"?>
How to resolve this error?

Java xml parsers usually receive their input wrapped in an InputSource object. This can be constructed with a Reader parameter that does the character decoding for the given charset.
InputStream in = ...
InputSource is = new InputSource(new InputStreamReader(in, "utf-16"));
For the "utf-16" charset the stream should start with a byte order mark, if that is not the case use either "utf-16le" or "utf-16be".

Parse XML file containing umlaute using SAX parser

I have looked through a lot of posts regarding the same problem, but i can't figure it out. I trying to parse a XML file with umlauts in it. This is what i have now:
File file = new File(this.xmlConfig);
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handlerConfig);
But it won't get umlauts properly. Ä,Ü and Ö will be only weird characters. The file is definitely in utf-8 and it is declared as such with the first line like this: <?xml version="1.0" encoding="utf-8"?>
What I'm doing wrong?

First rule: Don't second guess the encoding used in the XML document. Always use byte streams to parse XML documents:
InputStream inputStream= new FileInputStream(this.xmlConfig);
InputSource is = new InputSource(inputStream);
saxParser.parse(is, handlerConfig);
If that doesn't work, the <?xml version=".." encoding="UTF-8" ?> (or whatever) in the XML is wrong, and you have to take it from there.
Second rule: Make sure you inspect the the result with a tool that supports the encoding used in the target, or result, document. Have you?
Third rule: Check the byte values in the source document. Bring up your favourite HEX editor/viewer and inspect the content. For example, the letter Ä should be the byte sequence 0xC3 0x84, if the encoding is UTF-8.
Forth rule: If it doesn't look correct, always suspect that the UTf-8 source is viewed, or interpreted, as an ISO-8859-1 source. Verify this by comparing the first and second byte from the UTF-8 source with the ISO 8859-1 code charts.
UPDATE:
The byte sequence for the UNICODE letter ä (latin small letter a with diaresis, U+00E4) is 0xC3 0xA4 in the UTF-8 encoding. If you use a viewing tool that only understands (or is configured to interpret the source as) ISO-8859-1 encoding, the first byte, 0xC3is the letter Ã, and the second byte is the letter ¤, or currency sign (Unicode U+00A4), which may look like a circle.
Hence, the "TextView" thingy in Android is interpreting your input as an ISO-8859-1 stream. I have no idea if it is possible to change that or not. But if you have your parsing result as a String or a byte array, you could convert that to a ISO-8859-1 stream (or byte array), and then feed it to "TextView".

UTF-16 Encoding

<?xml version="1.0" encoding="UTF-16"?>
<note>
<from>Jani</from>
<to>ALOK</to>
<message>AshuTosh</message>
</note>
I have the XML parser which supports UTF-8 encoding only else it gives SAX parser exception. How can i convert the UTF-16 to UTF-8?

In that case its not a XML parser that your are using, see section 2.2 of the xml specification:
All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode
Java xml parsers usually receive their input wrapped in an InputSource object. This can be constructed with a Reader parameter that does the character decoding for the given charset.
InputStream in = ...
InputSource is = new InputSource(new InputStreamReader(in, "utf-16"));
For the "utf-16" charset the stream should start with a byte order mark, if that is not the case use either "utf-16le" or "utf-16be".

Xml transformation encoding problem

Hi I have a simple code:
InputSource is = new InputSource(new StringReader(xml))
Document d = documentBuilder.parse(is)
StringWriter result = new StringWriter()
DOMSource ds = new DOMSource(d)
Transformer t = TransformerFactory.newInstance().newTransformer()
t.setOutputProperty(OutputKeys.INDENT, "yes");
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
t.setOutputProperty(OutputKeys.STANDALONE, "yes");
t.setOutputProperty(OutputKeys.ENCODING,"UTF-16")
t.transform(ds,new StreamResult(result))
return result.toString()
that should trasnform an xml to UTF-16 encoding. Although internal representation of String in jvm already uses UTF-16 chars as far I know, but my expectations are that the result String should contain a header where the encoding is set to "UTF-16", originla xml where it was UTF-8 but I get:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
(also the standalone property seems to be wrong)
The transformer instance is: com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl
(what I think is a default)
So what I miss here?

Use a writer where you explicitly declare UTF-16 as output encoding. Try OutputStreamWriter(OutputStream out, String charsetName) which should wrap aByteArrayOutputStream and see if this works.

I have wrote a test on my own now. With one minor change:
t.transform(ds,new StreamResult(new File("dest.xml")));
I have the same results but the file is indeed UTF-16 encoded, checked with a hex editor. For some strange reason the xml declaration is not changed. So your code works.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.