I have a xml file as object in Java as org.w3c.dom.Document doc and I want to convert this into File file. How can I convert the type Document to File?
thanks
I want to add metadata elements in an existing xml file (standard dita) with type File.
I know a way to add elements to the file, but then I have to convert the file to a org.w3c.dom.Document. I did that with the method loadXML:
private Document loadXML(File f) throws Exception{
DocumentBuilder b = DocumentBuilderFactory.newInstance().newDocumentBuilder();
return builder.parse(f);
After that I change the org.w3c.dom.Document, then I want to continue with the flow of the program and I have to convert the Document doc back to a File file.
What is a efficient way to do that? Or what is a better solution to get some elements in a xml File without converting it?
You can use a Transformer class to output the entire XML content to a File, as showed below:
Document doc =...
// write the content into xml file
DOMSource source = new DOMSource(doc);
FileWriter writer = new FileWriter(new File("/tmp/output.xml"));
StreamResult result = new StreamResult(writer);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.transform(source, result);
With JDK 1.8.0 a short way is to use the built-in XMLSerializer (which was introduced with JDK 1.4 as a fork of Apache Xerces)
import com.sun.org.apache.xml.internal.serialize.XMLSerializer;
Document doc = //use your method loadXML(File f)
//change Document
java.io.Writer writer = new java.io.FileWriter("MyOutput.xml");
XMLSerializer xml = new XMLSerializer(writer, null);
xml.serialize(doc);
Use an object of type OutputFormat to configure output, for example like this:
OutputFormat format = new OutputFormat(Method.XML, StandardCharsets.UTF_8.toString(), true);
format.setIndent(4);
format.setLineWidth(80);
format.setPreserveEmptyAttributes(true);
format.setPreserveSpace(true);
XMLSerializer xml = new XMLSerializer(writer, format);
Note that the classes are from com.sun.* package which is not documented and therefore generally is not seen as the preferred way of doing things. However, with javax.xml.transform.OutputKeys you cannot specify the amount of indentation or line width for example. So, if this is important then this solution should help.
Related
In input i have xml file(it can be 1000 or 100000 files) and i have to convert it to 6 csv files for later saving to the database. My question is how to do this in java more efficient, now i create 6 transformers with different xslt stylesheets and manually transform xml 6 times. I tried do this in one xslt transformation with function: result-document, it works, but in inputmay be more than one xml file and after each transformation data in result files rewrites. My idea collect all data from xml files in csv and then copy it to db tables.
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTemplates(stylesource).newTransformer();
Transformer transformer2 = tf.newTemplates(stylesource2).newTransformer();
Transformer transformer3 = tf.newTemplates(stylesource3).newTransformer();
Transformer transformer4 = tf.newTemplates(stylesource4).newTransformer();
Transformer transformer5 = tf.newTemplates(stylesource5).newTransformer();
Transformer transformer6 = tf.newTemplates(stylesource6).newTransformer();
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
public void transformXmlToCsv(String content) throws TransformerException, IOException, SAXException {
Document doc = db.parse(new InputSource(new StringReader(content)));
Source source = new DOMSource(doc);
transformer.transform(source, outputTarget);
transformer2.transform(source, outputTarget2);
transformer3.transform(source, outputTarget3);
transformer4.transform(source, outputTarget4);
transformer5.transform(source, outputTarget5);
transformer6.transform(source, outputTarget6);
}
One improvement you could make would be to avoid repeated parsing of the source document by building the input tree once. For example, by building a DOM tree and using a DOMSource, or (better if you're using Saxon) by using Saxon interfaces to build the tree once in Saxon's internal format.
Another improvement would be to only create one TransformerFactory for everything. Creating a TransformerFactory is typically expensive (it involves a search of the classpath) and there's no need to ever create more than one.
It should be easy to fix your problem with xsl:result-document. There are many ways of doing it, e.g. by directing the output of each transformation to a different directory, but I can't tell what the best way is without more information.
I have to read and xml file, do some changes, and copy it to another location. I also have to keep the german special characters, and keep the empty tags as they are (prevent them to become self-closing tags). For preventing the self closing tags, I used Xerces Library, as in the link:
preventing empty xml elements are converted to self closing elements
In my application, if my changes in xml are ignored, the code looks like:
public static void main(String args[]) throws Exception {
InputStream inputStream= new FileInputStream(new File("D:\\qwe.xml"));
Reader reader = new InputStreamReader(inputStream,"ISO-8859-1");
InputSource is = new InputSource(reader);
is.setEncoding("ISO-8859-1");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(is);
doc.setXmlStandalone(true);
File file = new File ("D:\\qwerty.xml");
XMLStreamWriter writer = XMLOutputFactory.newFactory().createXMLStreamWriter(new FileOutputStream(file));
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1") ;
transformer.transform(new DOMSource(doc), new StAXResult(writer));
}
The first row in the source file is
<?xml version="1.0" encoding="UTF-8"?>
The problem is in the destination file, qwerty.xml, where encoding="UTF-8" is removed. In the source file, although the encoding is UTF-8, I had to set it as "ISO-8859-1" because of german characters. I want to keep the first row as the original, keep the empty tags as they are (not self-closing tags), and keep the german characters. My code succeeds to do only the second and third thing.
The call
Transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
has no effect unless the transformer is producing serialized output.
In your case the transformer is not producing serialized output because you are sending the output to a StAXResult. I'm not sure why you are using the XmlStreamWriter to produce output, but if you want to do it that way, it's the XmlStreamWriter that decides on the encoding, not the Transformer.
I would have thought it was simpler to send the Transformer output to a StreamResult.
My code receives an XML String from an InputStreamReader (it's actually the output of REST request to another server) and then the String is written to a file (file includes not only XML).
The problem is that the String is received as one line of XML and so it's stored as one huge line in the file (no indentation, tabs, formatting etc.).
Can I receive this XML stream and format it while writing it to the file?
Note: I can't use DOM here, it must be implemented without loading the XML to memory.
You can do it using Transformer and SAX, if you are allowed to :-
public static void prettyPrintXmlToFile(String sourceXml, File targetFile) throws Exception{
Transformer serializer = SAXTransformerFactory.newInstance().newTransformer();
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
Source xmlSource = new SAXSource(new InputSource(new StringReader(sourceXml)));
StreamResult res = new StreamResult(targetFile);
serializer.transform(xmlSource, res);
}
How to convert doc or docx into HTML in Java. Using Apache POI, I was able to convert doc to html but unable to convert docx into html? Please show me sample code? This code work with doc but not docx.
HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(stream);
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
There is no reason why this shouldn't / can't work.
Please review the following:
How to extract plain text from a DOCX file using the new OOXML support in Apache POI 3.5?
https://stackoverflow.com/a/5507019/751158
In short, make sure you're using an up-to-date version of POI, and have all of the required libraries.
(If you need additional assistance, please explain what isn't working. Are you getting compile-time errors? Run-time errors? Unexpected output?)
How do I control the order that the XML attributes are listed within the output file?
It seems by default they are getting alphabetized, which the program I'm sending this XML to apparently isn't handling.
e.g. I want zzzz to show first, then bbbbb in the following code.
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element root = doc.createElement("requests");
doc.appendChild(root);
root.appendChild(request);
root.setAttribute("zzzzzz", "My z value");
root.setAttribute("bbbbbbb", "My b value");
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File(file));
transformer.transform(source, result);
The order of attributes is defined to be insignificant in XML: no conformant XML application should produce results that depend on the order in which attributes appear. Therefore, serializers (code that produces lexical XML as output) will usually give you no control over the order.
Now, it would sometimes be nice to have that control for cosmetic reasons, because XML is designed to be human-readable. So there's a valid reason for wanting the feature. But the fact is, I know of no serializer that offers it.
I had the same issue when I used XML DOM API for writing file. To resolve the problem I had to use XMLStreamWriter. Attributes appear in a xml file in the order you write it using XMLStreamWriter.
XML Canonicalisation results in a consistent attribute ordering, primarily to allow one to check a signature over some or all of the XML, though there are other potential uses. This may suit your purposes.
If you don't want to use another framework just for a custom attribute order you can simply add an order identifier to the attributes.
<someElement a__price="32" b__amount="3"/>
After the XML serializer is done post process the raw XML like so:
public static String removeAttributeOrderIdentifiers(String xml) {
return xml.replaceAll(
" [a-z]__(.+?=\")",
"$1"
);
}
And you will get:
<someElement amount="3" price="32"/>