How to save parsed and changed DOM document in xml file?

How to save parsed and changed DOM document in xml file? - java

I have xml-file. I need to read it, make some changes and write new changed version to some new destination.
I managed to read, parse and patch this file (with DocumentBuilderFactory, DocumentBuilder, Document and so on).
But I cannot find a way how to save that file. Is there a way to get it's plain text view (as String) or any better way?

Something like this works:
Transformer transformer = TransformerFactory.newInstance().newTransformer();
Result output = new StreamResult(new File("output.xml"));
Source input = new DOMSource(myDocument);
transformer.transform(input, output);

That will work, provided you're using xerces-j:
public void serialise(org.w3c.dom.Document document) {
java.io.ByteArrayOutputStream data = new java.io.ByteArrayOutputStream();
java.io.PrintStream ps = new java.io.PrintStream(data);
org.apache.xml.serialize.OutputFormat of =
new org.apache.xml.serialize.OutputFormat("XML", "ISO-8859-1", true);
of.setIndent(1);
of.setIndenting(true);
org.apache.xml.serialize.XMLSerializer serializer =
new org.apache.xml.serialize.XMLSerializer(ps, of);
// As a DOM Serializer
serializer.asDOMSerializer();
serializer.serialize(document);
return data.toString();
}

That will give you possibility to define xml format
new XMLWriter(new FileOutputStream(fileName),
new OutputFormat(){{
setEncoding("UTF-8");
setIndent(" ");
setTrimText(false);
setNewlines(true);
setPadText(true);
}}).write(document);

Related

Javax DocumentBuilder produces “double-UTF-8’ed” charset encoding

I’ve got a Java DOM Document which MyFilter has rewritten. From logging output I know that the contents of the Document are still correct. I am using the following lines to convert theDocument to a List<String> to pass it back through an interface:
Transformer transformer = TransformerFactory.newInstance().newTransformer();
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
transformer.transform(new DOMSource(theDocument), new StreamResult(buffer));
return Arrays.asList(new String(buffer.toByteArray()).split("\r?\n"));
The filter is called from this file copying method using org.apache.commons.io.FileUtils:
List<String> lines = FileUtils.readLines(source, "UTF-8");
if (filters != null) {
for (final MyFilter filter : filters) {
lines = filter.filter(lines);
}
}
FileUtils.writeLines(destination, "UTF-8", lines);
This works perfectly fine on my machine (where I could debug it), but on other machines just running the code, reproducibly any non-ASCII characters get double-UTF-8’ed (e.g., Größe becomes GrÃ¶ÃŸe). The code is executed within a web app running in Tomcat. I am sure they are differently configured, but what I want is that I get the non-corrupt result on any configuration.
Any ideas what I could be missing?

When you have Document object created you have to read it Content.
After it you have to write it to file using LSSerializer interface, which DOM standart provides for this purpose.
By default, the LSSerializer produces an XML document without spaces or line
breaks. As a result, the output looks less pretty, but it is actually more suitable for parsing by another program because it is free from unnecessary white space.
If you want white space, you use yet another magic incantation after creating the serializer:
ser.getDomConfig().setParameter("format-pretty-print", true);
Code snippets looks like:
private String getContentFromDocument(Document doc) {
String content;
DOMImplementation impl = doc.getImplementation();
DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS", "3.0");
LSSerializer ser = implLS.createLSSerializer();
ser.getDomConfig().setParameter("format-pretty-print", true);
content = ser.writeToString(doc);
return content;
}
And after you have string content you can write it to file, like:
public void writeToXmlFile(String xmlContent) {
File theDir = new File("./output");
if (!theDir.exists())
theDir.mkdir();
String fileName = "./output/" + this.getClass().getSimpleName() + "_"
+ Calendar.getInstance().getTimeInMillis() + ".xml";
try (OutputStream stream = new FileOutputStream(new File(fileName))) {
try (OutputStreamWriter out = new OutputStreamWriter(stream, StandardCharsets.UTF_8)) {
out.write(xmlContent);
out.write("\n");
}
} catch (IOException ex) {
System.err.println("Cannot write to file!" + ex.getMessage());
}
}
BTW:
Have you tried to get Document object at a little bit easier, like:
DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = documentFactory.newDocumentBuilder();
Document doc = builder.parse(new File(fileName));
You can try this as well. It should be enough for parsing xml file.

I finally found it: The problem was within the String(byte[]) constructor which interprets byte[] relative to the platform’s default charset. This should at least have been tagged deprecated. The transformer obviously produces UTF-8 output independent of the platform. Changing the method like below passes the same charset to both:
final String ENCODING = "UTF-8";
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, ENCODING);
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
transformer.transform(new DOMSource(theDocument), new StreamResult(buffer));
return Arrays.asList(new String(buffer.toByteArray(), ENCODING).split("\r?\n"));
To get it working, it does not really matter which encoding, just both should use the same. Hovever, it is good to choose some unicode charset as otherwise unmappable characters may get lost. However, the charset will be reflected in the XML declaration, thus when the List<String> gets saved later, it is important to save it accordigly.

How to perform mail merge functionality in java using dot/dotx and doc/docx format document template

I want to perform mail merge functionality in java using dot/dotx and doc/docx format documents. I tried using docx4j but it removes much rich text indentation from the documents.
I also tried fetching out some of the html content from the word document but couldnt able to repaste in word document.
public static void readDocxFile1(String fileName) {
// this.file = file;
try {
File file = new File(fileName);
FileInputStream finStream=new FileInputStream(file.getAbsolutePath());
HWPFDocument doc=new HWPFDocument(finStream);
WordExtractor wordExtract=new WordExtractor(doc);
Document newDocument = DocumentBuilderFactory.newInstance() .newDocumentBuilder().newDocument();
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) ;
wordToHtmlConverter.processDocument(doc);
StringWriter stringWriter = new StringWriter();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.transform(new DOMSource( wordToHtmlConverter.getDocument()), new StreamResult( stringWriter ) );
String html = stringWriter.toString();
System.out.println("html>>>>>>"+html);
}
catch(Exception e)
{
e.printStackTrace();
}
}
My requirement is that I have to (1) read a dot/dotx or doc/docx template and for the no. of people looping it to (2) replace the keywords and then (3) repasting it in the new document.
Please suggest a way how can I perform this feature.
Also please suggest if ASPOSE.WORDS API for JAVA will do this for me.

Yes, you can meet these requirements using Aspose.Words for Java API. I would suggest you please refer to the following sections of documentation:
Loading, Saving and Converting
Mail Merge and Reporting
Find and Replace Overview
I work with Aspose as Developer Evangelist.

how to create xml file in runtime?

I am trying to create an XML file at run-time under my web content folder, but a No such file or directory error was displayed.
My code:
Document document = DocumentHelper.createDocument();
Element rootElement = document.addElement("Students");
Element studentElement = rootElement.addElement("student").addAttribute("country", "USA");
studentElement.addElement("id").addText("1");
studentElement.addElement("name").addText("Peter");
XMLWriter writer = new XMLWriter(new FileWriter("/WebContent/Students.xml"));
//Note that You can format this XML document
/*
* FileWriter output = new FileWriter(new File("Students.xml"));
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(output,format);<- will fomat the output
*/
//You can print this to the console and see what it looks like
String xmlElement = document.asXML();
System.out.println(xmlElement);
writer.write(document);
writer.close();
I don't know how to do this. Can anyone help me to fix my code?

i got the answer i just change the path from /WebContent/Students.xml to
WebContent/Students.xml.
just remove the / before the WebContent

How to use Tika's XWPFWordExtractorDecorator class?

Someone told me that Tika's XWPFWordExtractorDecorator class is used to convert docx into html. But I am not sure how to use this class to get the HTML from docx. Any other library for doing the same job is also appreciated/

You shouldn't use it directly
Instead, call Tika in the usual way, and it'll call the appropriate code for you
If you want XHTML from parsing a file, the code looks something like
// Either of these will work, the latter is recommended
//InputStream input = new FileInputStream("test.docx");
InputStream input = TikaInputStream.get(new File("test.docx"));
// AutoDetect is normally best, unless you know the best parser for the type
Parser parser = new AutoDetectParser();
// Handler for indented XHTML
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.setResult(new StreamResult(sw));
// Call the Tika Parser
try {
Metadata metadata = new Metadata();
parser.parse(input, handler, metadata, new ParseContext());
String xml = sw.toString();
} finally {
input.close();
}

get node raw text

How get node value with its children nodes? For example I have following node parsed into dom Document instance:
<root>
<ch1>That is a text with <value name="val1">value contents</value></ch1>
</root>
I select ch1 node using xpath. Now I need to get its contents, everything what is containing between <ch1> and </ch1>, e.g. That is a text with <value name="val1">value contents</value>.
How can I do it?

I have found the following code snippet that uses transformation, it gives almost exactly what I want. It is possible to tune result by changing output method.
public static String serializeDoc(Node doc) {
StringWriter outText = new StringWriter();
StreamResult sr = new StreamResult(outText);
Properties oprops = new Properties();
oprops.put(OutputKeys.METHOD, "xml");
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = null;
try {
t = tf.newTransformer();
t.setOutputProperties(oprops);
t.transform(new DOMSource(doc), sr);
} catch (Exception e) {
System.out.println(e);
}
return outText.toString();
}

If this is server side java (ie you do not need to worry about it running on other jvm's) and you are using the Sun/Oracle JDK, you can do the following:
import com.sun.org.apache.xml.internal.serialize.OutputFormat;
import com.sun.org.apache.xml.internal.serialize.XMLSerializer;
...
Node n = ...;
OutputFormat outputFormat = new OutputFormat();
outputFormat.setOmitXMLDeclaration(true);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
XMLSerializer ser = new XMLSerializer(baos, outputFormat);
ser.serialize(n);
System.out.println(new String(baos.toByteArray()));
Remember to ensure your ultimate conversion to string may need to take an encoding parameter if the parsed xml dom has its text nodes in a different encoding than your platforms default one or you'll get garbage on the unusual characters.

You could use jOOX to wrap your DOM objects and get many utility functions from it, such as the one you need. In your case, this will produce the result you need (using css-style selectors to find <ch1/>:
String xml = $(document).find("ch1").content();
Or with XPath as you did:
String xml = $(document).xpath("//ch1").content();
Internally, jOOX will use a transformer to generate that output, as others have mentioned

As far as I know, there is no equivalent of innerHTML in Document. DOM is meant to hide the details of the markup from you.
You can probably get the effect you want by going through the children of that node. Suppose for example that you want to copy out the text, but replace each "value" tag with a programmatically supplied value:
HashMap<String, String> values = ...;
StringBuilder str = new StringBuilder();
for(Element child = ch1.getFirstChild; child != null; child = child.getNextSibling()) {
if(child.getNodeType() == Node.TEXT_NODE) {
str.append(child.getTextContent());
} else if(child.getNodeName().equals("value")) {
str.append(values.get(child.getAttributes().getNamedItem("name").getTextContent()));
}
}
String output = str.toString();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to save parsed and changed DOM document in xml file? - java

Something like this works: Transformer transformer = TransformerFactory.newInstance().newTransformer(); Result output = new StreamResult(new File("output.xml")); Source input = new DOMSource(myDocument); transformer.transform(input, output);

That will give you possibility to define xml format new XMLWriter(new FileOutputStream(fileName), new OutputFormat(){{ setEncoding("UTF-8"); setIndent(" "); setTrimText(false); setNewlines(true); setPadText(true); }}).write(document);

Related

Javax DocumentBuilder produces “double-UTF-8’ed” charset encoding

How to perform mail merge functionality in java using dot/dotx and doc/docx format document template

how to create xml file in runtime?

How to use Tika's XWPFWordExtractorDecorator class?

get node raw text

Categories

Resources