UTF-16 Encoding - java

<?xml version="1.0" encoding="UTF-16"?>
<note>
<from>Jani</from>
<to>ALOK</to>
<message>AshuTosh</message>
</note>
I have the XML parser which supports UTF-8 encoding only else it gives SAX parser exception. How can i convert the UTF-16 to UTF-8?

In that case its not a XML parser that your are using, see section 2.2 of the xml specification:
All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode
Java xml parsers usually receive their input wrapped in an InputSource object. This can be constructed with a Reader parameter that does the character decoding for the given charset.
InputStream in = ...
InputSource is = new InputSource(new InputStreamReader(in, "utf-16"));
For the "utf-16" charset the stream should start with a byte order mark, if that is not the case use either "utf-16le" or "utf-16be".

Related

Convert UTF-8 to ISO-8859-1 with Numeric Character Reference

I get xml from third party with encoding UTF-8 and I need to send it to another third party but with ISO-8859-1 encoding. In xml there are many different languages e.g Russian in cyrillic. I know that it's technically impossible to directly convert UTF-8 into ISO-8859-1 however I found StringEscapeUtils.escapeXML() but when using this method then the whole xml is converted even <, > and so on and I would only convert cyrillic to character number reference. Is such method exists in Java or it always parse whole xml? Is there another possibility to parse only characters which can't be encoded in ISO-8859-1 to number format reference?
I've seen similar questions on SO like: How do I convert between ISO-8859-1 and UTF-8 in Java? but it's without mentioning number format reference
UPDATE: Removed unnecessary DOM loading.
Use the XML transformer. It knows how to XML escape characters that are not supported by the given encoding.
Example
Transformer transformer = TransformerFactory.newInstance().newTransformer();
// Convert XML file to UTF-8 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new StreamSource(new File("test.xml")),
new StreamResult(new File("test-utf8.xml")));
// Convert XML file to ISO-8859-1 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
transformer.transform(new StreamSource(new File("test.xml")),
new StreamResult(new File("test-8859-1.xml")));
test.xml (input, UTF-8)
<?xml version="1.0" encoding="UTF-8"?>
<test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好,世界</chinese>
<emoji>👋 🌎</emoji>
</test>
Translated by https://translate.google.com (except emoji)
test-utf8.xml (output, UTF-8)
<?xml version="1.0" encoding="UTF-8"?><test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好,世界</chinese>
<emoji>👋 🌎</emoji>
</test>
test-8859-1.xml (output, ISO-8859-1)
<?xml version="1.0" encoding="ISO-8859-1"?><test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好,世界</chinese>
<emoji>👋 🌎</emoji>
</test>
If you replace the test.xml with the test-8859-1.xml file (copy/paste/rename), you still get the same outputs, since the parser both auto-detects the encoding and unescapes all the escaped characters.

How to check well formedness of xml file which has encoding type of UTF-16 using Java?

I have parsed an xml file which has encoding type UTF-8. It parses well.
I had nothing change any encoding type in the xml file.
The xml header for UTF-8 looks like:
<?xml version="1.0" encoding="UTF-8"?>
There is no error for above format!!!
Suppose i had another file to check well formedness which has header like:
<?xml version="1.0" encoding="UTF-16"?>
How to resolve this error?
Java xml parsers usually receive their input wrapped in an InputSource object. This can be constructed with a Reader parameter that does the character decoding for the given charset.
InputStream in = ...
InputSource is = new InputSource(new InputStreamReader(in, "utf-16"));
For the "utf-16" charset the stream should start with a byte order mark, if that is not the case use either "utf-16le" or "utf-16be".

Parse XML file containing umlaute using SAX parser

I have looked through a lot of posts regarding the same problem, but i can't figure it out. I trying to parse a XML file with umlauts in it. This is what i have now:
File file = new File(this.xmlConfig);
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.parse(is, handlerConfig);
But it won't get umlauts properly. Ä,Ü and Ö will be only weird characters. The file is definitely in utf-8 and it is declared as such with the first line like this: <?xml version="1.0" encoding="utf-8"?>
What I'm doing wrong?
First rule: Don't second guess the encoding used in the XML document. Always use byte streams to parse XML documents:
InputStream inputStream= new FileInputStream(this.xmlConfig);
InputSource is = new InputSource(inputStream);
saxParser.parse(is, handlerConfig);
If that doesn't work, the <?xml version=".." encoding="UTF-8" ?> (or whatever) in the XML is wrong, and you have to take it from there.
Second rule: Make sure you inspect the the result with a tool that supports the encoding used in the target, or result, document. Have you?
Third rule: Check the byte values in the source document. Bring up your favourite HEX editor/viewer and inspect the content. For example, the letter Ä should be the byte sequence 0xC3 0x84, if the encoding is UTF-8.
Forth rule: If it doesn't look correct, always suspect that the UTf-8 source is viewed, or interpreted, as an ISO-8859-1 source. Verify this by comparing the first and second byte from the UTF-8 source with the ISO 8859-1 code charts.
UPDATE:
The byte sequence for the UNICODE letter ä (latin small letter a with diaresis, U+00E4) is 0xC3 0xA4 in the UTF-8 encoding. If you use a viewing tool that only understands (or is configured to interpret the source as) ISO-8859-1 encoding, the first byte, 0xC3is the letter Ã, and the second byte is the letter ¤, or currency sign (Unicode U+00A4), which may look like a circle.
Hence, the "TextView" thingy in Android is interpreting your input as an ISO-8859-1 stream. I have no idea if it is possible to change that or not. But if you have your parsing result as a String or a byte array, you could convert that to a ISO-8859-1 stream (or byte array), and then feed it to "TextView".

Why using InputSource fixes SAX parser when file contains special UTF-8 characters

I'm looking to get an explanation on why my SAX parser fails when some special UTF-8 characters are inside my XML file.
To parse the XML file I use Document doc = builder.parse(inputSource);
However when I use an inputSource it works fine:
DocumentBuilder builder = factory.newDocumentBuilder();
InputStream in = new FileInputStream(file);
InputSource inputSource = new InputSource(new InputStreamReader(in));
Document doc = builder.parse(inputSource);
I don't quite understand why the latter works. I've seen example of it being used but there isn't an explanation on why it works.
Does the second parse a string rather than a file, therefore the encoding will be UTF-8?
I suspect your document isn't really in the encoding you've declared. This line:
InputSource inputSource = new InputSource(new InputStreamReader(in));
will use the platform default encoding to convert the binary data into text within InputStreamReader. The XML parser doesn't get to do it any more - it doesn't get to see the raw bytes.
If this is working, your XML file is probably subtly bust - it may be declaring that it's in UTF-8, but using the platform default encoding (e.g. Windows-1252). Rather than use the workaround, you should fix the XML if you have any choice about it.

Converting document encoding when reading with dom4j

Is there any way I can convert a document being parsed by dom4j's SAXReader from the ISO-8859-2 encoding to UTF-8? I need that to happen while parsing, so that the objects created by dom4j are already Unicode/UTF-8 and running code such as:
"some text".equals(node.getText());
returns true.
This is done automatically by dom4j. All String instances in Java are in a common, decoded form; once a String is created, it isn't possible to tell what the original character encoding was (or even if the string was created from encoded bytes).
Just make sure that the XML document has the character encoding specified (which is required unless it is UTF-8).
The decoding happens in (or before) the InputSource (before the SAXReader). From that class's javadocs:
The SAX parser will use the InputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly, disregarding any text encoding declaration found in that stream. If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification. If neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.
So it depends on how you are creating the InputSource. To guarantee the proper decoding you can use something like the following:
InputStream stream = <input source>
Charset charset = Charset.forName("ISO-8859-2");
Reader reader = new BufferedReader(new InputStreamReader(stream, charset));
InputSource source = new InputSource(reader);

Categories

Resources