How to save Element as string in UTF8? - java

I convert org.w3c.dom.Element to String in this way:
StringWriter writer = new StringWriter();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new DOMSource(node), new StreamResult(writer));
String result = writer.toString();
But when I use it later I get an exception: io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence which says about wrong encoding.

In fact, it's completely unnecessery. I needed it for serialization nodes and further export them to different formats (xml, html, test). So I found out that it's better to share org.w3c.dom Documents. From Document you can get any information you need.

Related

Processing xml file (Java)

I have to read and xml file, do some changes, and copy it to another location. I also have to keep the german special characters, and keep the empty tags as they are (prevent them to become self-closing tags). For preventing the self closing tags, I used Xerces Library, as in the link:
preventing empty xml elements are converted to self closing elements
In my application, if my changes in xml are ignored, the code looks like:
public static void main(String args[]) throws Exception {
InputStream inputStream= new FileInputStream(new File("D:\\qwe.xml"));
Reader reader = new InputStreamReader(inputStream,"ISO-8859-1");
InputSource is = new InputSource(reader);
is.setEncoding("ISO-8859-1");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(is);
doc.setXmlStandalone(true);
File file = new File ("D:\\qwerty.xml");
XMLStreamWriter writer = XMLOutputFactory.newFactory().createXMLStreamWriter(new FileOutputStream(file));
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1") ;
transformer.transform(new DOMSource(doc), new StAXResult(writer));
}
The first row in the source file is
<?xml version="1.0" encoding="UTF-8"?>
The problem is in the destination file, qwerty.xml, where encoding="UTF-8" is removed. In the source file, although the encoding is UTF-8, I had to set it as "ISO-8859-1" because of german characters. I want to keep the first row as the original, keep the empty tags as they are (not self-closing tags), and keep the german characters. My code succeeds to do only the second and third thing.
The call
Transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
has no effect unless the transformer is producing serialized output.
In your case the transformer is not producing serialized output because you are sending the output to a StAXResult. I'm not sure why you are using the XmlStreamWriter to produce output, but if you want to do it that way, it's the XmlStreamWriter that decides on the encoding, not the Transformer.
I would have thought it was simpler to send the Transformer output to a StreamResult.

Convert org.w3c.dom.Document to File file

I have a xml file as object in Java as org.w3c.dom.Document doc and I want to convert this into File file. How can I convert the type Document to File?
thanks
I want to add metadata elements in an existing xml file (standard dita) with type File.
I know a way to add elements to the file, but then I have to convert the file to a org.w3c.dom.Document. I did that with the method loadXML:
private Document loadXML(File f) throws Exception{
DocumentBuilder b = DocumentBuilderFactory.newInstance().newDocumentBuilder();
return builder.parse(f);
After that I change the org.w3c.dom.Document, then I want to continue with the flow of the program and I have to convert the Document doc back to a File file.
What is a efficient way to do that? Or what is a better solution to get some elements in a xml File without converting it?
You can use a Transformer class to output the entire XML content to a File, as showed below:
Document doc =...
// write the content into xml file
DOMSource source = new DOMSource(doc);
FileWriter writer = new FileWriter(new File("/tmp/output.xml"));
StreamResult result = new StreamResult(writer);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.transform(source, result);
With JDK 1.8.0 a short way is to use the built-in XMLSerializer (which was introduced with JDK 1.4 as a fork of Apache Xerces)
import com.sun.org.apache.xml.internal.serialize.XMLSerializer;
Document doc = //use your method loadXML(File f)
//change Document
java.io.Writer writer = new java.io.FileWriter("MyOutput.xml");
XMLSerializer xml = new XMLSerializer(writer, null);
xml.serialize(doc);
Use an object of type OutputFormat to configure output, for example like this:
OutputFormat format = new OutputFormat(Method.XML, StandardCharsets.UTF_8.toString(), true);
format.setIndent(4);
format.setLineWidth(80);
format.setPreserveEmptyAttributes(true);
format.setPreserveSpace(true);
XMLSerializer xml = new XMLSerializer(writer, format);
Note that the classes are from com.sun.* package which is not documented and therefore generally is not seen as the preferred way of doing things. However, with javax.xml.transform.OutputKeys you cannot specify the amount of indentation or line width for example. So, if this is important then this solution should help.

How to Convert DOM document object to xml applying utf-8 charset encoding

I have a requirement to convert a DOM document object to xml and ensure the content of the xml is in utf-8 charset.
My code looks like below but it is not achieving the intended result and in the xml generated I can see that the characters are not getting encoded.
Document doc = (Document)operation.getResult(); //this method is returning the document object
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
DOMSource domSource = new DOMSource(doc);
OutputStreamWriter osw = new OutputStreamWriter(outputStream, "UTF-8");
StreamResult result = new StreamResult(osw);
transformer.transform(domSource,result);
The outputStream got from the above code is provided to a FILE Download component in ADF and here seeing that the generated xml file is not encoded for the special chars tough the header line stating the encoding is getting generated.
The sample of xml file getting generated is like this.
<?xml version = '1.0' encoding = 'UTF-8'?>
<PlanObjects>
<CompPlan BusinessUnit="Vision Operations" OrgId="204" Name="RNNewCompPlan" StartDate="2015-01-01" EndDate="2015-12-31">
<CompPlansVORow>
<CompPlanName>RNNewCompPlan</CompPlanName>
<Description>Using some special chars in desc - ¥ © ¢ </Description>
<DisplayName>RNNewCompPlan</DisplayName>
</CompPlansVORow>
</CompPlan>
</PlanObjects>
Was expecting the characters "¥ © ¢ " to have got encoded and display as hex / octet code.
Can someone please suggest what is going wrong here ?
Your understanding of UTF-8 is incorrect - ¥ © ¢ have been encoded as UTF-8, along with the rest of the file. You can verify that by opening your file in an hex editor and find the sequence: 'c2a5 c2a9 c2a2', which will be the UTF-8 encoding of ¥ © ¢.
AFAIK, you shouldn't use hexadecimal/octal character escape sequences in XML. An XML parser will decode your file without issue.
To test your code works with another parser, use the following python code:
import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
root = tree.getroot()
print ET.tostring(root, encoding="UTF-8")

Convert Doc or Docx into HTML in Java

How to convert doc or docx into HTML in Java. Using Apache POI, I was able to convert doc to html but unable to convert docx into html? Please show me sample code? This code work with doc but not docx.
HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(stream);
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
There is no reason why this shouldn't / can't work.
Please review the following:
How to extract plain text from a DOCX file using the new OOXML support in Apache POI 3.5?
https://stackoverflow.com/a/5507019/751158
In short, make sure you're using an up-to-date version of POI, and have all of the required libraries.
(If you need additional assistance, please explain what isn't working. Are you getting compile-time errors? Run-time errors? Unexpected output?)

Xml transformation encoding problem

Hi I have a simple code:
InputSource is = new InputSource(new StringReader(xml))
Document d = documentBuilder.parse(is)
StringWriter result = new StringWriter()
DOMSource ds = new DOMSource(d)
Transformer t = TransformerFactory.newInstance().newTransformer()
t.setOutputProperty(OutputKeys.INDENT, "yes");
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
t.setOutputProperty(OutputKeys.STANDALONE, "yes");
t.setOutputProperty(OutputKeys.ENCODING,"UTF-16")
t.transform(ds,new StreamResult(result))
return result.toString()
that should trasnform an xml to UTF-16 encoding. Although internal representation of String in jvm already uses UTF-16 chars as far I know, but my expectations are that the result String should contain a header where the encoding is set to "UTF-16", originla xml where it was UTF-8 but I get:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
(also the standalone property seems to be wrong)
The transformer instance is: com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl
(what I think is a default)
So what I miss here?
Use a writer where you explicitly declare UTF-16 as output encoding. Try OutputStreamWriter(OutputStream out, String charsetName) which should wrap aByteArrayOutputStream and see if this works.
I have wrote a test on my own now. With one minor change:
t.transform(ds,new StreamResult(new File("dest.xml")));
I have the same results but the file is indeed UTF-16 encoded, checked with a hex editor. For some strange reason the xml declaration is not changed. So your code works.

Categories

Resources