Xml transformation encoding problem

Xml transformation encoding problem - java

Hi I have a simple code:
InputSource is = new InputSource(new StringReader(xml))
Document d = documentBuilder.parse(is)
StringWriter result = new StringWriter()
DOMSource ds = new DOMSource(d)
Transformer t = TransformerFactory.newInstance().newTransformer()
t.setOutputProperty(OutputKeys.INDENT, "yes");
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
t.setOutputProperty(OutputKeys.STANDALONE, "yes");
t.setOutputProperty(OutputKeys.ENCODING,"UTF-16")
t.transform(ds,new StreamResult(result))
return result.toString()
that should trasnform an xml to UTF-16 encoding. Although internal representation of String in jvm already uses UTF-16 chars as far I know, but my expectations are that the result String should contain a header where the encoding is set to "UTF-16", originla xml where it was UTF-8 but I get:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
(also the standalone property seems to be wrong)
The transformer instance is: com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl
(what I think is a default)
So what I miss here?

Use a writer where you explicitly declare UTF-16 as output encoding. Try OutputStreamWriter(OutputStream out, String charsetName) which should wrap aByteArrayOutputStream and see if this works.

I have wrote a test on my own now. With one minor change:
t.transform(ds,new StreamResult(new File("dest.xml")));
I have the same results but the file is indeed UTF-16 encoded, checked with a hex editor. For some strange reason the xml declaration is not changed. So your code works.

Related

Convert UTF-8 to ISO-8859-1 with Numeric Character Reference

I get xml from third party with encoding UTF-8 and I need to send it to another third party but with ISO-8859-1 encoding. In xml there are many different languages e.g Russian in cyrillic. I know that it's technically impossible to directly convert UTF-8 into ISO-8859-1 however I found StringEscapeUtils.escapeXML() but when using this method then the whole xml is converted even <, > and so on and I would only convert cyrillic to character number reference. Is such method exists in Java or it always parse whole xml? Is there another possibility to parse only characters which can't be encoded in ISO-8859-1 to number format reference?
I've seen similar questions on SO like: How do I convert between ISO-8859-1 and UTF-8 in Java? but it's without mentioning number format reference

UPDATE: Removed unnecessary DOM loading.
Use the XML transformer. It knows how to XML escape characters that are not supported by the given encoding.
Example
Transformer transformer = TransformerFactory.newInstance().newTransformer();
// Convert XML file to UTF-8 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new StreamSource(new File("test.xml")),
new StreamResult(new File("test-utf8.xml")));
// Convert XML file to ISO-8859-1 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
transformer.transform(new StreamSource(new File("test.xml")),
new StreamResult(new File("test-8859-1.xml")));
test.xml (input, UTF-8)
<?xml version="1.0" encoding="UTF-8"?>
<test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好，世界</chinese>
<emoji>👋 🌎</emoji>
</test>
Translated by https://translate.google.com (except emoji)
test-utf8.xml (output, UTF-8)
<?xml version="1.0" encoding="UTF-8"?><test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好，世界</chinese>
<emoji>👋 🌎</emoji>
</test>
test-8859-1.xml (output, ISO-8859-1)
<?xml version="1.0" encoding="ISO-8859-1"?><test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好，世界</chinese>
<emoji>👋 🌎</emoji>
</test>
If you replace the test.xml with the test-8859-1.xml file (copy/paste/rename), you still get the same outputs, since the parser both auto-detects the encoding and unescapes all the escaped characters.

How to Convert DOM document object to xml applying utf-8 charset encoding

I have a requirement to convert a DOM document object to xml and ensure the content of the xml is in utf-8 charset.
My code looks like below but it is not achieving the intended result and in the xml generated I can see that the characters are not getting encoded.
Document doc = (Document)operation.getResult(); //this method is returning the document object
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
DOMSource domSource = new DOMSource(doc);
OutputStreamWriter osw = new OutputStreamWriter(outputStream, "UTF-8");
StreamResult result = new StreamResult(osw);
transformer.transform(domSource,result);
The outputStream got from the above code is provided to a FILE Download component in ADF and here seeing that the generated xml file is not encoded for the special chars tough the header line stating the encoding is getting generated.
The sample of xml file getting generated is like this.
<?xml version = '1.0' encoding = 'UTF-8'?>
<PlanObjects>
<CompPlan BusinessUnit="Vision Operations" OrgId="204" Name="RNNewCompPlan" StartDate="2015-01-01" EndDate="2015-12-31">
<CompPlansVORow>
<CompPlanName>RNNewCompPlan</CompPlanName>
<Description>Using some special chars in desc - ¥ © ¢ </Description>
<DisplayName>RNNewCompPlan</DisplayName>
</CompPlansVORow>
</CompPlan>
</PlanObjects>
Was expecting the characters "¥ © ¢ " to have got encoded and display as hex / octet code.
Can someone please suggest what is going wrong here ?

Your understanding of UTF-8 is incorrect - ¥ © ¢ have been encoded as UTF-8, along with the rest of the file. You can verify that by opening your file in an hex editor and find the sequence: 'c2a5 c2a9 c2a2', which will be the UTF-8 encoding of ¥ © ¢.
AFAIK, you shouldn't use hexadecimal/octal character escape sequences in XML. An XML parser will decode your file without issue.
To test your code works with another parser, use the following python code:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
print ET.tostring(root, encoding="UTF-8")

Troubles with XML encoding in Java

I have a problem with XML encoding.
When i created XML on localhost with cp1251 encoding all cool
But when i deploy my module on server, xml file have incorrect symbols like "Р¤Р°Р№Р»РџР¤Р "
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
DOMSource source = new DOMSource(doc);
transformer.setOutputProperty(OutputKeys.ENCODING, "cp1251");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.transform(source, result);
String attach = writer.toString();
How i can fix it?

I tried to read an XML Document which was UTF-8 encoded, and attempted to transform it with a different encoding, which had no effect at all (the existing encoding of the document was used instead of the one I specified with the output property). When creating a new Document in memory (encoding is null), the output property was used correctly.
Looks like when transforming an XML Document, the output property OutputKeys.ENCODING is only used when the org.w3c.dom.Document does not have an encoding yet.
Solution
To change the encoding of a XML Document, don't use the Document as the source, but its root node (the document element) instead.
// use doc.getDocumentElement() instead of doc
DOMSource source = new DOMSource(doc.getDocumentElement());
Works like a charm.
Source document:
<?xml version="1.0" encoding="UTF-8"?>
<foo bla="Grüezi">
Encoding test äöüÄÖÜ «Test»
</foo>
Output with "cp1251":
<?xml version="1.0" encoding="WINDOWS-1251"?><foo bla="Grüezi">
Encoding test äöüÄÖÜ «Test»
</foo>

A (String)Writer will not be influenced from an output encoding (only from the used input encoding), as Java maintains all text in Unicode. Either write to binary, or output the string as Cp1251.
Note that the encoding should be in the <?xml encoding="Windows-1251"> line. And I guess "Cp1251" is a bit more java specific.
So the error probably lies in the writing of the string; for instance
response.setCharacterEncoding("Windows-1251");
response.write(attach);
Or
attach.getBytes("Windows-1251")

How to save Element as string in UTF8?

I convert org.w3c.dom.Element to String in this way:
StringWriter writer = new StringWriter();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new DOMSource(node), new StreamResult(writer));
String result = writer.toString();
But when I use it later I get an exception: io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence which says about wrong encoding.

In fact, it's completely unnecessery. I needed it for serialization nodes and further export them to different formats (xml, html, test). So I found out that it's better to share org.w3c.dom Documents. From Document you can get any information you need.

XML Document read in as Latin1 but half converted to UTF-8

I'm hitting my head off a brick wall with a bizarre problem that I know there will be an obvious answer to, but I can't see if for the life of me. It's all to do with encoding. Before the code, a simple description: I want to take in an XML document which is Latin1 (ISO-8859-1) encoded, and then send the thing completely unchanged over an HttpURLConnection. I have a small test class and the raw XML which shows my problem. The XML file contains a Latin1 character 0xa2 (a cent character), which is invalid UTF-8 - I'm deliberately using this as my test case. The XML declaration is ISO-8859-1. I can read it in no bother, but then when I want to convert the org.w3c.dom.Document to a byte[] array to send down the HttpURLConnection, the 0xa2 character gets converted to the UTF-8 encoded cent character (0xc2 0xa2), and the declaration stays as ISO-8859-1. In other words, it's converted to two characters - totally wrong.
The code which does this:
FileInputStream input = new FileInputStream( "input-file" );
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware( true );
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse( input );
Source source = new DOMSource( document );
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Result result = new StreamResult( baos );
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform( source, result );
byte[] bytes = baos.toByteArray();
FileOutputStream fos = new FileOutputStream( "output-file" );
fos.write( bytes );
I'm just writing it to a file at the moment while I figure out what on earth is converting this character. The input-file has 0xa2, the output-file contains 0xc2 0xa2. One way to fix this is to put this line in the 2nd last block:
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
However, not all XML documents that I'll be dealing with will be Latin1; most, indeed, will be UTF-8 when they come in. I'm assuming I shouldn't have to be working out what the encoding is such that I feed that in to the transformer though? I mean, surely it should be working this out for itself, and I'm just doing something else wrong?
A thought had occurred to me that I could just query the document to find out the encoding and thus the extra line could just do the trick:
transformer.setOutputProperty(OutputKeys.ENCODING, document.getInputEncoding());
However, I then determined that this wasn't the answer, as document.getInputEncoding() returns a different String if I run it in a terminal on the linux box in comparison to when I run it within Eclipse on my Mac.
Any hints would be appreciated. I fully accept I'm missing out on something obvious.

yes, by default, xml documents are written as utf-8, so you need to explicitly tell the Transformer to use a different encoding. your last edit is the "trick" to doing this such that it always matches the input xml encoding:
transformer.setOutputProperty(OutputKeys.ENCODING, document.getXmlEncoding());
the only question is, do you really need to maintain the input encoding?

Why not just open it with a normal FileInputStream and stream the bytes to the output stream directly from that? Why do you need to load it into DOM format in memory if you are just sending it byte for byte over an HttpURLConnection?
Edit: According to javadoc for Document, you should probably be using document.getXmlEncoding() to get what matches the encoding in the XML prolog.

This may be helpful - it's too long for a comment, but not really an answer. From the spec:
The encoding attribute specifies the preferred encoding to use for
outputting the result tree. XSLT processors are required to respect
values of UTF-8 and UTF-16. For other values, if the XSLT processor
does not support the specified encoding it may signal an error; if it
does not signal an error it should use UTF-8 or UTF-16 instead.
You may want to test with "encoding=junk", as it were, to see what it does.
The valid values for Java are described here. See also IANA charsets.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Xml transformation encoding problem - java

Use a writer where you explicitly declare UTF-16 as output encoding. Try OutputStreamWriter(OutputStream out, String charsetName) which should wrap aByteArrayOutputStream and see if this works.

I have wrote a test on my own now. With one minor change: t.transform(ds,new StreamResult(new File("dest.xml"))); I have the same results but the file is indeed UTF-16 encoded, checked with a hex editor. For some strange reason the xml declaration is not changed. So your code works.

Related

Convert UTF-8 to ISO-8859-1 with Numeric Character Reference

How to Convert DOM document object to xml applying utf-8 charset encoding

Troubles with XML encoding in Java

How to save Element as string in UTF8?

XML Document read in as Latin1 but half converted to UTF-8

Categories

Resources