Writing XML in different character encodings with Java

Writing XML in different character encodings with Java - java

I am attempting to write an XML library file that can be read again into my program.
The file writer code is as follows:
XMLBuilder builder = new XMLBuilder();
Document doc = builder.build(bookList);
DOMImplementation impl = doc.getImplementation();
DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS", "3.0");
LSSerializer ser = implLS.createLSSerializer();
String out = ser.writeToString(doc);
//System.out.println(out);
try{
FileWriter fstream = new FileWriter(location);
BufferedWriter outwrite = new BufferedWriter(fstream);
outwrite.write(out);
outwrite.close();
}catch (Exception e){
}
The above code does write an xml document.
However, in the XML header, it is an attribute that the file is encoded in UTF-16.
when i read in the file, i get the error:
"content not allowed in prolog"
this error does not occur when the encoding attribute is manually changed to UTF-8.
I am trying to get the above code to write an XML document encoded in UTF-8, or successfully parse a UTF-16 file.
the code for parsing in is
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder loader = factory.newDocumentBuilder();
Document document = loader.parse(filename);
the last line returns the error.

the LSSerializer writeToString method does not allow the Serializer to pick a encoding.
with the setEncoding method of an instance of LSOutput, LSSerializer's write method can be used to change encoding. the LSOutput CharacterStream can be set to an instance of the BufferedWriter, such that calls from LSSerializer to write will write to the file.

Related

Format attributes for XML in Pretty format in java

I am trying to format XML string to pretty. I want all the attributes to be printed in single line.
XML input:
<root><feeds attribute1="a" attribute2="b" attribute3="c" attribute4="d" attribute5="e" attribute6="f"> <id>2140</id><title>gj</title><description>ghj</description>
<msg/>
Expected output:
<root>
<feeds attribute1="a" attribute2="b" attribute3="c" attribute4="d" attribute5="e" attribute6="f">
<id>2140</id>
<title>gj</title>
<description>ghj</description>
<msg/>
</feeds>
Actual Output:
<root>
<feeds attribute1="a" attribute2="b" attribute3="c" attribute4="d"
attribute5="e" attribute6="f">
<id>2140</id>
<title>gj</title>
<description>ghj</description>
<msg/>
</feeds>
Here is my code to format xml. I have also tried SAX parser. I don't want to use DOM4J.
public static String formatXml(String xml) {
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
LSSerializer writer = impl.createLSSerializer();
writer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE);
writer.getDomConfig().setParameter("xml-declaration", false);
writer.getDomConfig().setParameter("well-formed", true);
LSOutput output = impl.createLSOutput();
ByteArrayOutputStream out = new ByteArrayOutputStream();
output.setByteStream(out);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xml));
writer.write(db.parse(is), output);
return new String(out.toByteArray());
}
Is there any way to keep attributes in one line with SAX or DOM parser? I am not looking for any additional library. I am looking for solution with java library only.

A SAX or DOM parser will read your input string and allow your application to understand what was passed in. At some point in time your application then writes out that data, and that is the moment where you decide to insert additional whitespace (like linefeeds and tab characters) to pretty-print the document.
If you really want to use SAX and make the parser efficient the best you could do is write the document while it is being parsed. So you would implement the ContentHandler interface (https://docs.oracle.com/en/java/javase/11/docs/api/java.xml/org/xml/sax/ContentHandler.html) such that it directly writes out the data while adding linefeeds where you feel they belong to.
Check this tutorial to see how the ContentHandler can then be applied in a SAX parser: https://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html

Javax DocumentBuilder produces “double-UTF-8’ed” charset encoding

I’ve got a Java DOM Document which MyFilter has rewritten. From logging output I know that the contents of the Document are still correct. I am using the following lines to convert theDocument to a List<String> to pass it back through an interface:
Transformer transformer = TransformerFactory.newInstance().newTransformer();
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
transformer.transform(new DOMSource(theDocument), new StreamResult(buffer));
return Arrays.asList(new String(buffer.toByteArray()).split("\r?\n"));
The filter is called from this file copying method using org.apache.commons.io.FileUtils:
List<String> lines = FileUtils.readLines(source, "UTF-8");
if (filters != null) {
for (final MyFilter filter : filters) {
lines = filter.filter(lines);
}
}
FileUtils.writeLines(destination, "UTF-8", lines);
This works perfectly fine on my machine (where I could debug it), but on other machines just running the code, reproducibly any non-ASCII characters get double-UTF-8’ed (e.g., Größe becomes GrÃ¶ÃŸe). The code is executed within a web app running in Tomcat. I am sure they are differently configured, but what I want is that I get the non-corrupt result on any configuration.
Any ideas what I could be missing?

When you have Document object created you have to read it Content.
After it you have to write it to file using LSSerializer interface, which DOM standart provides for this purpose.
By default, the LSSerializer produces an XML document without spaces or line
breaks. As a result, the output looks less pretty, but it is actually more suitable for parsing by another program because it is free from unnecessary white space.
If you want white space, you use yet another magic incantation after creating the serializer:
ser.getDomConfig().setParameter("format-pretty-print", true);
Code snippets looks like:
private String getContentFromDocument(Document doc) {
String content;
DOMImplementation impl = doc.getImplementation();
DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS", "3.0");
LSSerializer ser = implLS.createLSSerializer();
ser.getDomConfig().setParameter("format-pretty-print", true);
content = ser.writeToString(doc);
return content;
}
And after you have string content you can write it to file, like:
public void writeToXmlFile(String xmlContent) {
File theDir = new File("./output");
if (!theDir.exists())
theDir.mkdir();
String fileName = "./output/" + this.getClass().getSimpleName() + "_"
+ Calendar.getInstance().getTimeInMillis() + ".xml";
try (OutputStream stream = new FileOutputStream(new File(fileName))) {
try (OutputStreamWriter out = new OutputStreamWriter(stream, StandardCharsets.UTF_8)) {
out.write(xmlContent);
out.write("\n");
}
} catch (IOException ex) {
System.err.println("Cannot write to file!" + ex.getMessage());
}
}
BTW:
Have you tried to get Document object at a little bit easier, like:
DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = documentFactory.newDocumentBuilder();
Document doc = builder.parse(new File(fileName));
You can try this as well. It should be enough for parsing xml file.

I finally found it: The problem was within the String(byte[]) constructor which interprets byte[] relative to the platform’s default charset. This should at least have been tagged deprecated. The transformer obviously produces UTF-8 output independent of the platform. Changing the method like below passes the same charset to both:
final String ENCODING = "UTF-8";
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, ENCODING);
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
transformer.transform(new DOMSource(theDocument), new StreamResult(buffer));
return Arrays.asList(new String(buffer.toByteArray(), ENCODING).split("\r?\n"));
To get it working, it does not really matter which encoding, just both should use the same. Hovever, it is good to choose some unicode charset as otherwise unmappable characters may get lost. However, the charset will be reflected in the XML declaration, thus when the List<String> gets saved later, it is important to save it accordigly.

Set xml encoding

I am sending xml to a web service and there I am converting input xml to string and now I am having a problem setting its encoding. Here is a code:
Element soapinElement = (Element) streams.getSoapin().getValue().getAny();
Node node = (Node) soapinElement;
Document document = node.getOwnerDocument();
DOMImplementationLS domImplLS = (DOMImplementationLS) document.getImplementation();
LSSerializer serializer = domImplLS.createLSSerializer();
LSOutput output = domImplLS.createLSOutput();
output.setEncoding("UTF-8");
Writer stringWriter = new StringWriter();
output.setCharacterStream(stringWriter);
serializer.write(document, output);
String soapinString = stringWriter.toString();
This code makes a String from request xml. The problem is that when the request XML is encoded not in UTF-8 it produces unreadable characters inside xml elements:
<some element>РћР’Р” Р’РћР</some element>
When I send UTF-8 encoded xml there is no problem. So the question is how to set UTF-8 encoding when converting xml to String.
Default encoding used by JVM is ISO8859-1.

The setEncoding method says what the encoding actually is, not what you want it to be. The XML library won't convert the characters.
See this question: Meaning of XML encoding
If you want to convert the encoding, that is another question.

I would rethink my whole approach if I were you, generally XML should be kept as a tree.
But if you really need a string, try this
final StringWriter sw = new StringWriter();
try {
TransformerFactory.newInstance().newTransformer().transform(
new DOMSource(document),
new StreamResult(sw)
);
} catch (TransformerException e) {
throw new RuntimeException(e);
}
// Now you have the XML as a String:
System.out.println(sw.toString());

Java XML parsing error : Content is not allowed in prolog

My code write a XML file with the LSSerializer class :
DOMImplementation impl = doc.getImplementation();
DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS","3.0");
LSSerializer ser = implLS.createLSSerializer();
String str = ser.writeToString(doc);
System.out.println(str);
String file = racine+"/"+p.getNom()+".xml";
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file),"UTF-8");
out.write(str);
out.close();
The XML is well-formed, but when I parse it, I get an error.
Parse code :
File f = new File(racine+"/"+filename);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(f);
XPathFactory xpfactory = XPathFactory.newInstance();
XPath xp = xpfactory.newXPath();
String expression;
expression = "root/nom";
String nom = xp.evaluate(expression, doc);
The error :
[Fatal Error] Terray.xml:1:40: Content is not allowed in prolog.
9 août 2011 19:42:58 controller.MakaluController activatePatient
GRAVE: null
org.xml.sax.SAXParseException: Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:208)
at model.MakaluModel.setPatientActif(MakaluModel.java:147)
at controller.MakaluController.activatePatient(MakaluController.java:59)
at view.ListePatientsPanel.jButtonOKActionPerformed(ListePatientsPanel.java:92)
...
Now, with some research, I found that this error is dure to a "hidden" character at the very beginning of the XML.
In fact, I can fix the bug by creating a XML file manually.
But where is the error in the XML writing ? (When I try to println the string, there is no space before ths
Solution : change the serializer
I run the solution of UTF-16 encoding for a while, but it was not very stable.
So I found a new solution : change the serializer of the XML document, so that the encoding is coherent between the XML header and the file encoding. :
DOMSource domSource = new DOMSource(doc);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
String file = racine+"/"+p.getNom()+".xml";
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file),"UTF-8");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.INDENT,"yes");
transformer.transform(domSource, new StreamResult(out));

But where is the error in the XML writing ?
Looks like the error is not in the writing but the parsing. As you have already discovered there is a blank character at the beginning of the file, which causes the error in the parse call in your stach trace:
Document doc = builder.parse(f);
The reason you do not see the space when you print it out may be simply the encoding you are using. Try changing this line:
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file),"UTF-8");
to use 'UTF-16' or 'US-ASCII'

I think that it is probably linked to BOM (Byte Order Mark). See Wikipedia
You can verify with Notepad++ by example : Open your file and check the "Encoding" Menu to see if you're in "UTF8 without BOM" or "UTF8 with BOM".

Using UTF-16 is the way to go,
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(fileName),"UTF-16");
This can read the file with no issues

Try this code:
InputStream is = new FileInputStream(file);
Document doc = builder.parse(is , "UTF-8");

How do I modify an XML element using StAX in Java

I want to modify my name in xml cv file, but when I use this statement:
XMLOutputFactory xof = XMLOutputFactory.newInstance();
XMLStreamWriter xtw = null;
xtw = xof.createXMLStreamWriter(new FileWriter("eman.xml"));
all the content of the file are removed and it becomes empty. Basically, I want to open (eman.xml) for modification without removing its content.

If u want to use STAX in processing your xml file , u should know that u can only read/write from/ to xml file , but if u want to modify ur xml when exact event happen in STAX u could make processing for ur xml using DOM .
here is some example:
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader xmlReader= factory.createXMLStreamReader(new FileReader(fileName));
int eventType;
while(xmlReader.hasNext()){
eventType= xmlReader.next();
if(eventType==XMLEvent.START_ELEMENT)
{
QName qNqme = xmlReader.getName();
if("YURTAG".equals(qNqme.toString()))
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
File file = new File("YOURXML.xml");
Document doc = builder.parse(file);
//make the required processing for your file.
}
}

The question about reading and writing at the same time with stax is also answered here: How to modify a huge XML file by StAX?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Writing XML in different character encodings with Java - java

Related

Format attributes for XML in Pretty format in java

Javax DocumentBuilder produces “double-UTF-8’ed” charset encoding

Set xml encoding

Java XML parsing error : Content is not allowed in prolog

How do I modify an XML element using StAX in Java

Categories

Resources