Decoding of Unicode characters in a ISO-8859-1 encoded XML document - java

Using javax.xml.transform I created this ISO-8859-1 document which contains two &#-encoded characters 쎼 and 쎶:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xml>쎼 and 쎶</xml>
Question: how will a standards-compliant XML reader interpret the 쎼 and 쎶,
just as the plain &# ... strings (not converted back to 쎼 and 쎶)
as 쎼 and 쎶
Code to generate the XML:
public void testInvalidCharacter() {
try {
String str = "\uC3BC and \uC3B6"; // 쎼 and 쎶
System.out.println(str);
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("xml");
root.setTextContent(str);
doc.appendChild(root);
DOMSource domSource = new DOMSource(doc);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, StandardCharsets.ISO_8859_1.name());
StringWriter out = new StringWriter();
transformer.transform(domSource, new StreamResult(out));
System.out.println(out.toString());
} catch (ParserConfigurationException | DOMException | IllegalArgumentException | TransformerException e) {
e.printStackTrace(System.err);
}
}

An XML Parser will recognize the '&#...' escape syntax and properly return 쎼 and 쎶 with its API for the text of the element.
E.g. in Java the org.w3c.dom.Element.getTextContent() method for the Element with the tag Name 'xml' will return a String with that Unicode characters, though your XML document itself is ISO-8859-1

Related

Java DOM Transformer - XML creation doesn't replace apostrophe and quotes in the final xml

I'm trying to create an XML and return it as a response to the caller based on the input.
The transformer works as expected for most parts, but it doesn't convert apostrophe and quotes to their XML equivalent. Below is the code I'm using
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
// root elements
Document doc = docBuilder.newDocument();
Element rootElement = doc.createElement("template");
doc.appendChild(rootElement);
/* Adding attendant ID */
Element line = doc.createElement("line");
line.appendChild(doc.createTextNode("----&----<------>------'-----\"--------"));
Attr Attr1 = doc.createAttribute("Attr1");
Attr1.setValue("attribute value 1");
line.setAttributeNode(Attr1);
Attr Attr2 = doc.createAttribute("Attr2");
Attr2.setValue("attribute value 2");
line.setAttributeNode(Attr2);
rootElement.appendChild(line);
// write the content into xml file
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
// Output to String
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
transformer.transform(source, result);
String strResult = writer.toString();
//return escapeXml(strResult);
System.out.println(strResult);
Resulting output
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<template>
<line Attr1="attribute value 1" Attr2="attribute value 2">----&----<------>------'-----"--------</line>
</template>
Expected Result
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<template>
<line Attr1="attribute value 1" Attr2="attribute value 2">----&----<------>------&apos;-----"--------</line>
</template>
Initially I thought could escape those character before sending it as input to transformer, but it replaced all the ampersand to their equivalent "&". If I replace the apostrophe or quotes after the final XML is created, it replaces attributes as well.
I'm thinking we could solve this in 2 ways
I could transform the & , < , > , ' , " before adding to node and transformer ignores it
Give explicit directions to transformer to convert ' , " them to their XML equivalent.
Currently I'm unaware of how to achieve these. Could someone help me on this or if a better solution to create a valid XML would hugely be appreciated.
Thanks.
Why do you want quotation marks and apostrophes to be escaped? XML doesn't require them to be escaped (except in attributes where they conflict with the attribute delimiters). The serializer knows what it's doing: trust it.

In Java I need to convert an XML Document to a string, with non-printable characters in data as hex

I have a method that takes a Document and produces an XML String value. It works fine, except that spaces, tabs, and other characters like that are preserved as-is in the node values. I need them converted to their hex equivalents.
Here's the method I have:
public static String docToXML( Document doc )
{
try
{
StringWriter sw = new StringWriter();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new DOMSource(doc), new StreamResult(sw));
return sw.toString();
}
catch (Exception ex)
{
throw new RuntimeException("Error converting to String", ex);
}
}
Even if the value is entered into the document in hex form, it is converted to a space or tab as it's converted to a String.
Does anyone know how to make this happen? I'm assuming it's an Output Property, but I haven't found one.
EDIT:
The current behavior is something like this (for a space):
<MyField> </MyField>
The desired behavior is:
<MyField> </MyField>
With XSLT 2.0 you can use character maps to achieve this:
<xsl:character-map>
<xsl:output-character character=" " string="&#x20;"/>
<xsl:output-character character=" " string="&#x09;"/>
...
</xsl:character-map>

Display XML with stylesheet in JEditorPane

I have an XML file, which uses an XSS and XSL stored in the folder to display the XML in a proper format.
when i use the following code
JEditorPane editor = new JEditorPane();
editor.setBounds(114, 65, 262, 186);
frame.getContentPane().add(editor);
editor.setContentType( "html" );
File file=new File("c:/r/testResult.xml");
editor.setPage(file.toURI().toURL());
All i can see is the text part of the XML without any styling. what should i do to make this display with style sheet.
The JEditorPane does not automatically process XSLT style-sheets. You must perform the transformation yourself:
try (InputStream xslt = getClass().getResourceAsStream("StyleSheet.xslt");
InputStream xml = getClass().getResourceAsStream("Document.xml")) {
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = db.parse(xml);
StringWriter output = new StringWriter();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer(new StreamSource(xslt));
transformer.transform(new DOMSource(doc), new StreamResult(output));
String html = output.toString();
// JEditorPane doesn't like the META tag...
html = html.replace("<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">", "");
editor.setContentType("text/html; charset=UTF-8");
editor.setText(html);
} catch (IOException | ParserConfigurationException | SAXException | TransformerException e) {
editor.setText("Unable to format document due to:\n\t" + e);
}
editor.setCaretPosition(0);
Use an appropriate InputStream or StreamSource for your particular xslt and xml documents.

XML encoding UTF-8 not working for turkish characters

I have a method to create and record to xml file. It produces corrupted result. My turkish characters writing as hexadecimal expressions. While i'm using UTF-8, i couldn't solve the problem. By the way i checked both with Sublime and Notepad++ editors.
public boolean add(BatFile batFile) throws Exception {
File inputFile = new File(fileLocation);
DocumentBuilderFactory docFactory = DocumentBuilderFactory
.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(inputFile);
Element rootElement = doc.getDocumentElement();
Element batFileElement = doc.createElement("BatFile");
rootElement.appendChild(batFileElement);
Element batJobName = doc.createElement("Name");
batJobName.appendChild(doc.createTextNode(batFile.getName()));
batFileElement.appendChild(batJobName);
Element batFileBriefDesc = doc.createElement("BriefDesc");
batFileBriefDesc
.appendChild(doc.createTextNode(batFile.getBriefDesc()));
batFileElement.appendChild(batFileBriefDesc);
Element batFileDesc = doc.createElement("Desc");
batFileDesc.appendChild(doc.createTextNode(batFile.getDesc()));
batFileElement.appendChild(batFileDesc);
Element batFileName = doc.createElement("FileName");
batFileName.appendChild(doc.createTextNode(batFile.getFileName()));
batFileElement.appendChild(batFileName);
Element batCommandArgs = doc.createElement("CommandArgs");
for (int k = 0; k < batFile.getCommandArgs().size(); k++) {
Element commandArg = doc.createElement("CommandArg");
// commandArg.setAttribute("ID", String.valueOf(k));
commandArg.appendChild(doc.createTextNode(batFile.getCommandArgs()
.get(k)));
batCommandArgs.appendChild(commandArg);
}
batFileElement.appendChild(batCommandArgs);
Element batCreationTime = doc.createElement("CreationTime");
batCreationTime.appendChild(doc.createTextNode(batFile
.getCreationTime()));
batFileElement.appendChild(batCreationTime);
Element batSchedulerPattern = doc.createElement("SchedulerPattern");
batSchedulerPattern.appendChild(doc.createTextNode(batFile
.getExecutionPattern()));
batFileElement.appendChild(batSchedulerPattern);
Element batTaskID = doc.createElement("TaskID");
if (batFile.getTaskID() != null) {
batTaskID.appendChild(doc.createTextNode(batFile.getTaskID()));
}
batFileElement.appendChild(batTaskID);
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
DOMSource domSource = new DOMSource(doc);
StreamResult result = new StreamResult(new FileWriter(inputFile));
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(domSource, result);
return true;
}
When i test it with those codes below:
#Test
public void testAddingTask() throws Exception {
IBAO testBao = XMLBAO.getInstance();
BatFile testBatFile = new BatFile();
testBatFile.setName("ŞŞŞŞŞ");
testBatFile.setBriefDesc("ÇÇÇÇÇÇ");
testBatFile.setDesc("ĞĞĞĞĞĞ");
testBatFile.setFileName("FileName");
testBatFile.setCreationTime("Merhaba");
testBatFile.setExecutionPattern("ööçöçöçüü");
testBatFile.addCommandArgs("ZZZZZZZZ");
testBatFile.setTaskID("ÜÜÜÜÜÜÜÜ");
testBao.add(testBatFile);
}
It produces me this result:
<BatFiles>
<BatFile>
<Name>???/Name>
<BriefDesc>???</BriefDesc>
<Desc>???</Desc>
<FileName>FileName</FileName>
<CommandArgs>
<CommandArg>ZZZZZZZZ</CommandArg>
</CommandArgs>
<CreationTime>Merhaba</CreationTime>
<SchedulerPattern>??????</SchedulerPattern>
<TaskID>????</TaskID>
</BatFile>
</BatFiles>
You're writing to a character stream and not letting the API control which encoding the data is written as. FileWriter uses the default platform encoding which might not be UTF-8:
The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable.
Use a FileOutputStream with the StreamResult (in a try-with-resources block.)
You might also be having issues due to Java source file encodings. Consider using Unicode escapes instead of literals. That is, "\u015E" instead of "Ş".

Converting string to XMLDocument doesn't create text nodes

here's my situation.
I have a string containing XML data:
<tag>
<anotherTag> data </anotherTag>
</tag>
I take that string and I run it through this code to convert it to a Document:
try {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse(new InputSource(new StringReader(sXMLString)));
}
catch (Exception e) {
// Parser with specified options can't be built
ceLogger.logError("Unable to build a new XML Document from string provided:\t" + e.getMessage());
return null;
}
The resulting xml is almost perfect. Its missing the data however and looks like this:
<tag>
<anotherTag />
</tag>
How can I copy over the text when creating an XML Document and why is it removing the text in the first place?
Edit:
The actual problem ended up being something along the lines of this:
While parsing through the XML structure with my own function this line is there:
if (curChild.getNodeType()==Node.ELEMENT_NODE)
sResult.append(XMLToString((Element)children.item(i),attribute_mask));
But no such logic exists for TEXT nodes, so they are simply ignored.
Your code is correct. The only guess I can make is that you are outputting your code incorrectly. I've tested your code, and used the following method to output, and the XML was displayed correctly with the text node:
public static void outputXML(Document dom) throws TransformerException
{
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
//initialize StreamResult with File object to save to file
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(dom);
transformer.transform(source, result);
String xmlString = result.getWriter().toString();
System.out.println(xmlString);
}
The output was:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<tag> <anotherTag> data </anotherTag>
</tag>

Categories

Resources