Straight from the manual:
Writing Out a DOM as an XML File
After you have constructed a DOM (either by parsing an XML file or
building it programmatically) you frequently want to save it as XML.
This section shows you how to do that using the Xalan transform
package.
Using that package, you will create a transformer object to wire a
DOMSource to a StreamResult. You will then invoke the transformer's
transform() method to write out the DOM as XML data.
my output:
thufir#dur:~/NetBeansProjects/helloWorldSaxon$
thufir#dur:~/NetBeansProjects/helloWorldSaxon$ gradle clean run
> Task :run
Jan 04, 2019 3:28:24 PM helloWorldSaxon.HandlerForXML createDocumentFromURL
INFO: http://books.toscrape.com/
Jan 04, 2019 3:28:26 PM helloWorldSaxon.HandlerForXML createDocumentFromURL
INFO: javax.xml.transform.dom.DOMResult#3cda1055
Jan 04, 2019 3:28:26 PM helloWorldSaxon.HandlerForXML createDocumentFromURL
INFO: html
BUILD SUCCESSFUL in 2s
4 actionable tasks: 4 executed
thufir#dur:~/NetBeansProjects/helloWorldSaxon$
Firstly, I'd like more meaningful output for what the domResult is, looks like, or contains. More important, I believe, is iterating or traversing document below:
public void createDocumentFromURL() throws SAXException, IOException, TransformerException, ParserConfigurationException {
LOG.info(url.toString());
TransformerFactory transformerFactory = TransformerFactory.newInstance();
XMLReader xmlReader = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
Source source = new SAXSource(xmlReader, new InputSource(url.toString()));
DOMResult domResult = new DOMResult();
Transformer transformer = transformerFactory.newTransformer();
transformer.transform(source, domResult); //how do I find the result of this operation?
LOG.info(domResult.toString()); //traverse or iterate how?
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
// Document document = documentBuilder.parse(); ///bzzzt, wrong
Document document = (Document) domResult.getNode();
LOG.info(document.getDocumentElement().getTagName());
}
That the output is "html" inclines me to believe that this is the html. The desired output is that html, but from a Document, rather than a String.
Oracle documention on writing out a DOM is to parse the document. Is this document not already parsed? Or, to put another way, how do I establish that it is or is not an XML file at all?
So.....transform it again?
see also:
Java: convert StreamResult to DOM
You really just have to transform the DOM to your file.
Example
// Create DOM
Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
Element root = document.createElement("Root");
document.appendChild(root);
Element foo = document.createElement("Foo");
foo.appendChild(document.createTextNode("Bar"));
root.appendChild(foo);
You can save that DOM to a file like this:
// Write DOM to file as XML
File xmlFile = new File("/path/to/file.xml");
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new DOMSource(document), new StreamResult(xmlFile));
You can also just print the DOM like this:
// Print DOM as XML
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new DOMSource(document), new StreamResult(System.out));
Output
<?xml version="1.0" encoding="UTF-8" standalone="no"?><Root><Foo>Bar</Foo></Root>
If you want the XML formatted:
// Print DOM as formatted XML
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.transform(new DOMSource(document), new StreamResult(System.out));
Output
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<Root>
<Foo>Bar</Foo>
</Root>
Related
I have to read and xml file, do some changes, and copy it to another location. I also have to keep the german special characters, and keep the empty tags as they are (prevent them to become self-closing tags). For preventing the self closing tags, I used Xerces Library, as in the link:
preventing empty xml elements are converted to self closing elements
In my application, if my changes in xml are ignored, the code looks like:
public static void main(String args[]) throws Exception {
InputStream inputStream= new FileInputStream(new File("D:\\qwe.xml"));
Reader reader = new InputStreamReader(inputStream,"ISO-8859-1");
InputSource is = new InputSource(reader);
is.setEncoding("ISO-8859-1");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(is);
doc.setXmlStandalone(true);
File file = new File ("D:\\qwerty.xml");
XMLStreamWriter writer = XMLOutputFactory.newFactory().createXMLStreamWriter(new FileOutputStream(file));
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1") ;
transformer.transform(new DOMSource(doc), new StAXResult(writer));
}
The first row in the source file is
<?xml version="1.0" encoding="UTF-8"?>
The problem is in the destination file, qwerty.xml, where encoding="UTF-8" is removed. In the source file, although the encoding is UTF-8, I had to set it as "ISO-8859-1" because of german characters. I want to keep the first row as the original, keep the empty tags as they are (not self-closing tags), and keep the german characters. My code succeeds to do only the second and third thing.
The call
Transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
has no effect unless the transformer is producing serialized output.
In your case the transformer is not producing serialized output because you are sending the output to a StAXResult. I'm not sure why you are using the XmlStreamWriter to produce output, but if you want to do it that way, it's the XmlStreamWriter that decides on the encoding, not the Transformer.
I would have thought it was simpler to send the Transformer output to a StreamResult.
I have a xml file as object in Java as org.w3c.dom.Document doc and I want to convert this into File file. How can I convert the type Document to File?
thanks
I want to add metadata elements in an existing xml file (standard dita) with type File.
I know a way to add elements to the file, but then I have to convert the file to a org.w3c.dom.Document. I did that with the method loadXML:
private Document loadXML(File f) throws Exception{
DocumentBuilder b = DocumentBuilderFactory.newInstance().newDocumentBuilder();
return builder.parse(f);
After that I change the org.w3c.dom.Document, then I want to continue with the flow of the program and I have to convert the Document doc back to a File file.
What is a efficient way to do that? Or what is a better solution to get some elements in a xml File without converting it?
You can use a Transformer class to output the entire XML content to a File, as showed below:
Document doc =...
// write the content into xml file
DOMSource source = new DOMSource(doc);
FileWriter writer = new FileWriter(new File("/tmp/output.xml"));
StreamResult result = new StreamResult(writer);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.transform(source, result);
With JDK 1.8.0 a short way is to use the built-in XMLSerializer (which was introduced with JDK 1.4 as a fork of Apache Xerces)
import com.sun.org.apache.xml.internal.serialize.XMLSerializer;
Document doc = //use your method loadXML(File f)
//change Document
java.io.Writer writer = new java.io.FileWriter("MyOutput.xml");
XMLSerializer xml = new XMLSerializer(writer, null);
xml.serialize(doc);
Use an object of type OutputFormat to configure output, for example like this:
OutputFormat format = new OutputFormat(Method.XML, StandardCharsets.UTF_8.toString(), true);
format.setIndent(4);
format.setLineWidth(80);
format.setPreserveEmptyAttributes(true);
format.setPreserveSpace(true);
XMLSerializer xml = new XMLSerializer(writer, format);
Note that the classes are from com.sun.* package which is not documented and therefore generally is not seen as the preferred way of doing things. However, with javax.xml.transform.OutputKeys you cannot specify the amount of indentation or line width for example. So, if this is important then this solution should help.
I am trying to read a XML file using FileInputStreamReader class. But, when I try to read large XML files, some problems occur. Which java class is more suitable to read large XML files and what are the most efficient parsers to parse XML files?
I think that DOM parsing is a good way to parse XML files
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = factory.newDocumentBuilder();
Document doc = docBuilder.parse(yourFile);
in this way you start the parsing of your XML document, then you can modify the nodes and change what you want to change
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
DOMSource source = new DOMSource(yourDoc);
StreamResult result = new StreamResult(yourFile);
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.transform(source, result);
And this second parte is to save all your changes. The "setOutputPropery" method is not mandatory, it's only used to give to the XML file a nice indent.
I am trying to read an XML document and output it into a new XML document using the W3C DOM API in Java. To handle DOCTYPEs, I am using the following code (from an input Document doc to a target File target):
TransformerFactory transfac = TransformerFactory.newInstance();
Transformer trans = transfac.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no"); // omit '<?xml version="1.0"?>'
trans.setOutputProperty(OutputKeys.INDENT, "yes");
// if a doctype was set, it needs to persist
if (doc.getDoctype() != null) {
DocumentType doctype = doc.getDoctype();
trans.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, doctype.getSystemId());
trans.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, doctype.getPublicId());
}
FileWriter sw = new FileWriter(target);
StreamResult result = new StreamResult(sw);
DOMSource source = new DOMSource(doc);
trans.transform(source, result);
This works fine for both XML documents with and without DOCTYPEs. However, I am now coming across a NullPointerException when trying to transform the following input XML document:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE permissions >
<permissions>
// ...
</permissions>
HTML 5 uses a similar syntax for its DOCTYPEs, and it is valid. But I have no idea how to handle this using the W3C DOM API - trying to set the DOCTYPE_SYSTEM to null throws an exception. Can I still use the W3C DOM API to output an empty doctype?
Although this question is two years old, it is a top search result in some web search engine, so maybe it is a useful shortcut. See the question Set HTML5 doctype with XSLT referring to http://www.w3.org/html/wg/drafts/html/master/syntax.html#doctype-legacy-string, which says:
For the purposes of HTML generators that cannot output HTML markup
with the short DOCTYPE "<!DOCTYPE html>", a DOCTYPE legacy
string may be inserted into the DOCTYPE [...]
In other words, <!DOCTYPE html SYSTEM "about:legacy-compat"> or
<!DOCTYPE html SYSTEM 'about:legacy-compat'>, case-insensitively
except for the part in single or double quotes.
Leading to a line of Java code like this:
trans.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, "about:legacy-compat");
Try the suggestions here https://stackoverflow.com/a/6637886/116509. Basically it looks like it can't be done with standard Java DOM support.
You can also try StAX
XMLStreamWriter xmlStreamWriter =
XMLOutputFactory.newFactory().createXMLStreamWriter( System.out, doc.getXmlEncoding() );
Result result = new StAXResult( xmlStreamWriter );
// ... create dtd String
xmlStreamWriter.writeDTD( dtd );
DOMSource source = new DOMSource( doc );
trans.transform( source, result );
but it's ugly because the DTD parameter is a String, and you only have a DocumentType object.
How do I control the order that the XML attributes are listed within the output file?
It seems by default they are getting alphabetized, which the program I'm sending this XML to apparently isn't handling.
e.g. I want zzzz to show first, then bbbbb in the following code.
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
Element root = doc.createElement("requests");
doc.appendChild(root);
root.appendChild(request);
root.setAttribute("zzzzzz", "My z value");
root.setAttribute("bbbbbbb", "My b value");
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File(file));
transformer.transform(source, result);
The order of attributes is defined to be insignificant in XML: no conformant XML application should produce results that depend on the order in which attributes appear. Therefore, serializers (code that produces lexical XML as output) will usually give you no control over the order.
Now, it would sometimes be nice to have that control for cosmetic reasons, because XML is designed to be human-readable. So there's a valid reason for wanting the feature. But the fact is, I know of no serializer that offers it.
I had the same issue when I used XML DOM API for writing file. To resolve the problem I had to use XMLStreamWriter. Attributes appear in a xml file in the order you write it using XMLStreamWriter.
XML Canonicalisation results in a consistent attribute ordering, primarily to allow one to check a signature over some or all of the XML, though there are other potential uses. This may suit your purposes.
If you don't want to use another framework just for a custom attribute order you can simply add an order identifier to the attributes.
<someElement a__price="32" b__amount="3"/>
After the XML serializer is done post process the raw XML like so:
public static String removeAttributeOrderIdentifiers(String xml) {
return xml.replaceAll(
" [a-z]__(.+?=\")",
"$1"
);
}
And you will get:
<someElement amount="3" price="32"/>