Troubles with XML encoding in Java

Troubles with XML encoding in Java - java

I have a problem with XML encoding.
When i created XML on localhost with cp1251 encoding all cool
But when i deploy my module on server, xml file have incorrect symbols like "Р¤Р°Р№Р»РџР¤Р "
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
DOMSource source = new DOMSource(doc);
transformer.setOutputProperty(OutputKeys.ENCODING, "cp1251");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.transform(source, result);
String attach = writer.toString();
How i can fix it?

I tried to read an XML Document which was UTF-8 encoded, and attempted to transform it with a different encoding, which had no effect at all (the existing encoding of the document was used instead of the one I specified with the output property). When creating a new Document in memory (encoding is null), the output property was used correctly.
Looks like when transforming an XML Document, the output property OutputKeys.ENCODING is only used when the org.w3c.dom.Document does not have an encoding yet.
Solution
To change the encoding of a XML Document, don't use the Document as the source, but its root node (the document element) instead.
// use doc.getDocumentElement() instead of doc
DOMSource source = new DOMSource(doc.getDocumentElement());
Works like a charm.
Source document:
<?xml version="1.0" encoding="UTF-8"?>
<foo bla="Grüezi">
Encoding test äöüÄÖÜ «Test»
</foo>
Output with "cp1251":
<?xml version="1.0" encoding="WINDOWS-1251"?><foo bla="Grüezi">
Encoding test äöüÄÖÜ «Test»
</foo>

A (String)Writer will not be influenced from an output encoding (only from the used input encoding), as Java maintains all text in Unicode. Either write to binary, or output the string as Cp1251.
Note that the encoding should be in the <?xml encoding="Windows-1251"> line. And I guess "Cp1251" is a bit more java specific.
So the error probably lies in the writing of the string; for instance
response.setCharacterEncoding("Windows-1251");
response.write(attach);
Or
attach.getBytes("Windows-1251")

Related

Convert UTF-8 to ISO-8859-1 with Numeric Character Reference

I get xml from third party with encoding UTF-8 and I need to send it to another third party but with ISO-8859-1 encoding. In xml there are many different languages e.g Russian in cyrillic. I know that it's technically impossible to directly convert UTF-8 into ISO-8859-1 however I found StringEscapeUtils.escapeXML() but when using this method then the whole xml is converted even <, > and so on and I would only convert cyrillic to character number reference. Is such method exists in Java or it always parse whole xml? Is there another possibility to parse only characters which can't be encoded in ISO-8859-1 to number format reference?
I've seen similar questions on SO like: How do I convert between ISO-8859-1 and UTF-8 in Java? but it's without mentioning number format reference

UPDATE: Removed unnecessary DOM loading.
Use the XML transformer. It knows how to XML escape characters that are not supported by the given encoding.
Example
Transformer transformer = TransformerFactory.newInstance().newTransformer();
// Convert XML file to UTF-8 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new StreamSource(new File("test.xml")),
new StreamResult(new File("test-utf8.xml")));
// Convert XML file to ISO-8859-1 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
transformer.transform(new StreamSource(new File("test.xml")),
new StreamResult(new File("test-8859-1.xml")));
test.xml (input, UTF-8)
<?xml version="1.0" encoding="UTF-8"?>
<test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好，世界</chinese>
<emoji>👋 🌎</emoji>
</test>
Translated by https://translate.google.com (except emoji)
test-utf8.xml (output, UTF-8)
<?xml version="1.0" encoding="UTF-8"?><test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好，世界</chinese>
<emoji>👋 🌎</emoji>
</test>
test-8859-1.xml (output, ISO-8859-1)
<?xml version="1.0" encoding="ISO-8859-1"?><test>
<english>Hello World</english>
<portuguese>Olá Mundo</portuguese>
<czech>Ahoj světe</czech>
<russian>Привет мир</russian>
<chinese>你好，世界</chinese>
<emoji>👋 🌎</emoji>
</test>
If you replace the test.xml with the test-8859-1.xml file (copy/paste/rename), you still get the same outputs, since the parser both auto-detects the encoding and unescapes all the escaped characters.

How to Convert DOM document object to xml applying utf-8 charset encoding

I have a requirement to convert a DOM document object to xml and ensure the content of the xml is in utf-8 charset.
My code looks like below but it is not achieving the intended result and in the xml generated I can see that the characters are not getting encoded.
Document doc = (Document)operation.getResult(); //this method is returning the document object
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
DOMSource domSource = new DOMSource(doc);
OutputStreamWriter osw = new OutputStreamWriter(outputStream, "UTF-8");
StreamResult result = new StreamResult(osw);
transformer.transform(domSource,result);
The outputStream got from the above code is provided to a FILE Download component in ADF and here seeing that the generated xml file is not encoded for the special chars tough the header line stating the encoding is getting generated.
The sample of xml file getting generated is like this.
<?xml version = '1.0' encoding = 'UTF-8'?>
<PlanObjects>
<CompPlan BusinessUnit="Vision Operations" OrgId="204" Name="RNNewCompPlan" StartDate="2015-01-01" EndDate="2015-12-31">
<CompPlansVORow>
<CompPlanName>RNNewCompPlan</CompPlanName>
<Description>Using some special chars in desc - ¥ © ¢ </Description>
<DisplayName>RNNewCompPlan</DisplayName>
</CompPlansVORow>
</CompPlan>
</PlanObjects>
Was expecting the characters "¥ © ¢ " to have got encoded and display as hex / octet code.
Can someone please suggest what is going wrong here ?

Your understanding of UTF-8 is incorrect - ¥ © ¢ have been encoded as UTF-8, along with the rest of the file. You can verify that by opening your file in an hex editor and find the sequence: 'c2a5 c2a9 c2a2', which will be the UTF-8 encoding of ¥ © ¢.
AFAIK, you shouldn't use hexadecimal/octal character escape sequences in XML. An XML parser will decode your file without issue.
To test your code works with another parser, use the following python code:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
print ET.tostring(root, encoding="UTF-8")

How to check well formedness of xml file which has encoding type of UTF-16 using Java?

I have parsed an xml file which has encoding type UTF-8. It parses well.
I had nothing change any encoding type in the xml file.
The xml header for UTF-8 looks like:
<?xml version="1.0" encoding="UTF-8"?>
There is no error for above format!!!
Suppose i had another file to check well formedness which has header like:
<?xml version="1.0" encoding="UTF-16"?>
How to resolve this error?

Java xml parsers usually receive their input wrapped in an InputSource object. This can be constructed with a Reader parameter that does the character decoding for the given charset.
InputStream in = ...
InputSource is = new InputSource(new InputStreamReader(in, "utf-16"));
For the "utf-16" charset the stream should start with a byte order mark, if that is not the case use either "utf-16le" or "utf-16be".

How to create an empty DOCTYPE using W3C DOM in Java?

I am trying to read an XML document and output it into a new XML document using the W3C DOM API in Java. To handle DOCTYPEs, I am using the following code (from an input Document doc to a target File target):
TransformerFactory transfac = TransformerFactory.newInstance();
Transformer trans = transfac.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no"); // omit '<?xml version="1.0"?>'
trans.setOutputProperty(OutputKeys.INDENT, "yes");
// if a doctype was set, it needs to persist
if (doc.getDoctype() != null) {
DocumentType doctype = doc.getDoctype();
trans.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, doctype.getSystemId());
trans.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, doctype.getPublicId());
}
FileWriter sw = new FileWriter(target);
StreamResult result = new StreamResult(sw);
DOMSource source = new DOMSource(doc);
trans.transform(source, result);
This works fine for both XML documents with and without DOCTYPEs. However, I am now coming across a NullPointerException when trying to transform the following input XML document:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE permissions >
<permissions>
// ...
</permissions>
HTML 5 uses a similar syntax for its DOCTYPEs, and it is valid. But I have no idea how to handle this using the W3C DOM API - trying to set the DOCTYPE_SYSTEM to null throws an exception. Can I still use the W3C DOM API to output an empty doctype?

Although this question is two years old, it is a top search result in some web search engine, so maybe it is a useful shortcut. See the question Set HTML5 doctype with XSLT referring to http://www.w3.org/html/wg/drafts/html/master/syntax.html#doctype-legacy-string, which says:
For the purposes of HTML generators that cannot output HTML markup
with the short DOCTYPE "<!DOCTYPE html>", a DOCTYPE legacy
string may be inserted into the DOCTYPE [...]
In other words, <!DOCTYPE html SYSTEM "about:legacy-compat"> or
<!DOCTYPE html SYSTEM 'about:legacy-compat'>, case-insensitively
except for the part in single or double quotes.
Leading to a line of Java code like this:
trans.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, "about:legacy-compat");

Try the suggestions here https://stackoverflow.com/a/6637886/116509. Basically it looks like it can't be done with standard Java DOM support.
You can also try StAX
XMLStreamWriter xmlStreamWriter =
XMLOutputFactory.newFactory().createXMLStreamWriter( System.out, doc.getXmlEncoding() );
Result result = new StAXResult( xmlStreamWriter );
// ... create dtd String
xmlStreamWriter.writeDTD( dtd );
DOMSource source = new DOMSource( doc );
trans.transform( source, result );
but it's ugly because the DTD parameter is a String, and you only have a DocumentType object.

Xml transformation encoding problem

Hi I have a simple code:
InputSource is = new InputSource(new StringReader(xml))
Document d = documentBuilder.parse(is)
StringWriter result = new StringWriter()
DOMSource ds = new DOMSource(d)
Transformer t = TransformerFactory.newInstance().newTransformer()
t.setOutputProperty(OutputKeys.INDENT, "yes");
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
t.setOutputProperty(OutputKeys.STANDALONE, "yes");
t.setOutputProperty(OutputKeys.ENCODING,"UTF-16")
t.transform(ds,new StreamResult(result))
return result.toString()
that should trasnform an xml to UTF-16 encoding. Although internal representation of String in jvm already uses UTF-16 chars as far I know, but my expectations are that the result String should contain a header where the encoding is set to "UTF-16", originla xml where it was UTF-8 but I get:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
(also the standalone property seems to be wrong)
The transformer instance is: com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl
(what I think is a default)
So what I miss here?

Use a writer where you explicitly declare UTF-16 as output encoding. Try OutputStreamWriter(OutputStream out, String charsetName) which should wrap aByteArrayOutputStream and see if this works.

I have wrote a test on my own now. With one minor change:
t.transform(ds,new StreamResult(new File("dest.xml")));
I have the same results but the file is indeed UTF-16 encoded, checked with a hex editor. For some strange reason the xml declaration is not changed. So your code works.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Troubles with XML encoding in Java - java

Related

Convert UTF-8 to ISO-8859-1 with Numeric Character Reference

How to Convert DOM document object to xml applying utf-8 charset encoding

How to check well formedness of xml file which has encoding type of UTF-16 using Java?

How to create an empty DOCTYPE using W3C DOM in Java?

Xml transformation encoding problem

Categories

Resources