Set xml encoding - java

I am sending xml to a web service and there I am converting input xml to string and now I am having a problem setting its encoding. Here is a code:
Element soapinElement = (Element) streams.getSoapin().getValue().getAny();
Node node = (Node) soapinElement;
Document document = node.getOwnerDocument();
DOMImplementationLS domImplLS = (DOMImplementationLS) document.getImplementation();
LSSerializer serializer = domImplLS.createLSSerializer();
LSOutput output = domImplLS.createLSOutput();
output.setEncoding("UTF-8");
Writer stringWriter = new StringWriter();
output.setCharacterStream(stringWriter);
serializer.write(document, output);
String soapinString = stringWriter.toString();
This code makes a String from request xml. The problem is that when the request XML is encoded not in UTF-8 it produces unreadable characters inside xml elements:
<some element>РћР’Р” Р’РћР</some element>
When I send UTF-8 encoded xml there is no problem. So the question is how to set UTF-8 encoding when converting xml to String.
Default encoding used by JVM is ISO8859-1.

The setEncoding method says what the encoding actually is, not what you want it to be. The XML library won't convert the characters.
See this question: Meaning of XML encoding
If you want to convert the encoding, that is another question.

I would rethink my whole approach if I were you, generally XML should be kept as a tree.
But if you really need a string, try this
final StringWriter sw = new StringWriter();
try {
TransformerFactory.newInstance().newTransformer().transform(
new DOMSource(document),
new StreamResult(sw)
);
} catch (TransformerException e) {
throw new RuntimeException(e);
}
// Now you have the XML as a String:
System.out.println(sw.toString());

Related

Format attributes for XML in Pretty format in java

I am trying to format XML string to pretty. I want all the attributes to be printed in single line.
XML input:
<root><feeds attribute1="a" attribute2="b" attribute3="c" attribute4="d" attribute5="e" attribute6="f"> <id>2140</id><title>gj</title><description>ghj</description>
<msg/>
Expected output:
<root>
<feeds attribute1="a" attribute2="b" attribute3="c" attribute4="d" attribute5="e" attribute6="f">
<id>2140</id>
<title>gj</title>
<description>ghj</description>
<msg/>
</feeds>
Actual Output:
<root>
<feeds attribute1="a" attribute2="b" attribute3="c" attribute4="d"
attribute5="e" attribute6="f">
<id>2140</id>
<title>gj</title>
<description>ghj</description>
<msg/>
</feeds>
Here is my code to format xml. I have also tried SAX parser. I don't want to use DOM4J.
public static String formatXml(String xml) {
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
LSSerializer writer = impl.createLSSerializer();
writer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE);
writer.getDomConfig().setParameter("xml-declaration", false);
writer.getDomConfig().setParameter("well-formed", true);
LSOutput output = impl.createLSOutput();
ByteArrayOutputStream out = new ByteArrayOutputStream();
output.setByteStream(out);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xml));
writer.write(db.parse(is), output);
return new String(out.toByteArray());
}
Is there any way to keep attributes in one line with SAX or DOM parser? I am not looking for any additional library. I am looking for solution with java library only.
A SAX or DOM parser will read your input string and allow your application to understand what was passed in. At some point in time your application then writes out that data, and that is the moment where you decide to insert additional whitespace (like linefeeds and tab characters) to pretty-print the document.
If you really want to use SAX and make the parser efficient the best you could do is write the document while it is being parsed. So you would implement the ContentHandler interface (https://docs.oracle.com/en/java/javase/11/docs/api/java.xml/org/xml/sax/ContentHandler.html) such that it directly writes out the data while adding linefeeds where you feel they belong to.
Check this tutorial to see how the ContentHandler can then be applied in a SAX parser: https://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html

Javax DocumentBuilder produces “double-UTF-8’ed” charset encoding

I’ve got a Java DOM Document which MyFilter has rewritten. From logging output I know that the contents of the Document are still correct. I am using the following lines to convert theDocument to a List<String> to pass it back through an interface:
Transformer transformer = TransformerFactory.newInstance().newTransformer();
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
transformer.transform(new DOMSource(theDocument), new StreamResult(buffer));
return Arrays.asList(new String(buffer.toByteArray()).split("\r?\n"));
The filter is called from this file copying method using org.apache.commons.io.FileUtils:
List<String> lines = FileUtils.readLines(source, "UTF-8");
if (filters != null) {
for (final MyFilter filter : filters) {
lines = filter.filter(lines);
}
}
FileUtils.writeLines(destination, "UTF-8", lines);
This works perfectly fine on my machine (where I could debug it), but on other machines just running the code, reproducibly any non-ASCII characters get double-UTF-8’ed (e.g., Größe becomes Größe). The code is executed within a web app running in Tomcat. I am sure they are differently configured, but what I want is that I get the non-corrupt result on any configuration.
Any ideas what I could be missing?
When you have Document object created you have to read it Content.
After it you have to write it to file using LSSerializer interface, which DOM standart provides for this purpose.
By default, the LSSerializer produces an XML document without spaces or line
breaks. As a result, the output looks less pretty, but it is actually more suitable for parsing by another program because it is free from unnecessary white space.
If you want white space, you use yet another magic incantation after creating the serializer:
ser.getDomConfig().setParameter("format-pretty-print", true);
Code snippets looks like:
private String getContentFromDocument(Document doc) {
String content;
DOMImplementation impl = doc.getImplementation();
DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS", "3.0");
LSSerializer ser = implLS.createLSSerializer();
ser.getDomConfig().setParameter("format-pretty-print", true);
content = ser.writeToString(doc);
return content;
}
And after you have string content you can write it to file, like:
public void writeToXmlFile(String xmlContent) {
File theDir = new File("./output");
if (!theDir.exists())
theDir.mkdir();
String fileName = "./output/" + this.getClass().getSimpleName() + "_"
+ Calendar.getInstance().getTimeInMillis() + ".xml";
try (OutputStream stream = new FileOutputStream(new File(fileName))) {
try (OutputStreamWriter out = new OutputStreamWriter(stream, StandardCharsets.UTF_8)) {
out.write(xmlContent);
out.write("\n");
}
} catch (IOException ex) {
System.err.println("Cannot write to file!" + ex.getMessage());
}
}
BTW:
Have you tried to get Document object at a little bit easier, like:
DocumentBuilderFactory documentFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = documentFactory.newDocumentBuilder();
Document doc = builder.parse(new File(fileName));
You can try this as well. It should be enough for parsing xml file.
I finally found it: The problem was within the String(byte[]) constructor which interprets byte[] relative to the platform’s default charset. This should at least have been tagged deprecated. The transformer obviously produces UTF-8 output independent of the platform. Changing the method like below passes the same charset to both:
final String ENCODING = "UTF-8";
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, ENCODING);
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
transformer.transform(new DOMSource(theDocument), new StreamResult(buffer));
return Arrays.asList(new String(buffer.toByteArray(), ENCODING).split("\r?\n"));
To get it working, it does not really matter which encoding, just both should use the same. Hovever, it is good to choose some unicode charset as otherwise unmappable characters may get lost. However, the charset will be reflected in the XML declaration, thus when the List<String> gets saved later, it is important to save it accordigly.

get node raw text

How get node value with its children nodes? For example I have following node parsed into dom Document instance:
<root>
<ch1>That is a text with <value name="val1">value contents</value></ch1>
</root>
I select ch1 node using xpath. Now I need to get its contents, everything what is containing between <ch1> and </ch1>, e.g. That is a text with <value name="val1">value contents</value>.
How can I do it?
I have found the following code snippet that uses transformation, it gives almost exactly what I want. It is possible to tune result by changing output method.
public static String serializeDoc(Node doc) {
StringWriter outText = new StringWriter();
StreamResult sr = new StreamResult(outText);
Properties oprops = new Properties();
oprops.put(OutputKeys.METHOD, "xml");
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = null;
try {
t = tf.newTransformer();
t.setOutputProperties(oprops);
t.transform(new DOMSource(doc), sr);
} catch (Exception e) {
System.out.println(e);
}
return outText.toString();
}
If this is server side java (ie you do not need to worry about it running on other jvm's) and you are using the Sun/Oracle JDK, you can do the following:
import com.sun.org.apache.xml.internal.serialize.OutputFormat;
import com.sun.org.apache.xml.internal.serialize.XMLSerializer;
...
Node n = ...;
OutputFormat outputFormat = new OutputFormat();
outputFormat.setOmitXMLDeclaration(true);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
XMLSerializer ser = new XMLSerializer(baos, outputFormat);
ser.serialize(n);
System.out.println(new String(baos.toByteArray()));
Remember to ensure your ultimate conversion to string may need to take an encoding parameter if the parsed xml dom has its text nodes in a different encoding than your platforms default one or you'll get garbage on the unusual characters.
You could use jOOX to wrap your DOM objects and get many utility functions from it, such as the one you need. In your case, this will produce the result you need (using css-style selectors to find <ch1/>:
String xml = $(document).find("ch1").content();
Or with XPath as you did:
String xml = $(document).xpath("//ch1").content();
Internally, jOOX will use a transformer to generate that output, as others have mentioned
As far as I know, there is no equivalent of innerHTML in Document. DOM is meant to hide the details of the markup from you.
You can probably get the effect you want by going through the children of that node. Suppose for example that you want to copy out the text, but replace each "value" tag with a programmatically supplied value:
HashMap<String, String> values = ...;
StringBuilder str = new StringBuilder();
for(Element child = ch1.getFirstChild; child != null; child = child.getNextSibling()) {
if(child.getNodeType() == Node.TEXT_NODE) {
str.append(child.getTextContent());
} else if(child.getNodeName().equals("value")) {
str.append(values.get(child.getAttributes().getNamedItem("name").getTextContent()));
}
}
String output = str.toString();

Java XML parsing error : Content is not allowed in prolog

My code write a XML file with the LSSerializer class :
DOMImplementation impl = doc.getImplementation();
DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS","3.0");
LSSerializer ser = implLS.createLSSerializer();
String str = ser.writeToString(doc);
System.out.println(str);
String file = racine+"/"+p.getNom()+".xml";
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file),"UTF-8");
out.write(str);
out.close();
The XML is well-formed, but when I parse it, I get an error.
Parse code :
File f = new File(racine+"/"+filename);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(f);
XPathFactory xpfactory = XPathFactory.newInstance();
XPath xp = xpfactory.newXPath();
String expression;
expression = "root/nom";
String nom = xp.evaluate(expression, doc);
The error :
[Fatal Error] Terray.xml:1:40: Content is not allowed in prolog.
9 août 2011 19:42:58 controller.MakaluController activatePatient
GRAVE: null
org.xml.sax.SAXParseException: Content is not allowed in prolog.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249)
at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:208)
at model.MakaluModel.setPatientActif(MakaluModel.java:147)
at controller.MakaluController.activatePatient(MakaluController.java:59)
at view.ListePatientsPanel.jButtonOKActionPerformed(ListePatientsPanel.java:92)
...
Now, with some research, I found that this error is dure to a "hidden" character at the very beginning of the XML.
In fact, I can fix the bug by creating a XML file manually.
But where is the error in the XML writing ? (When I try to println the string, there is no space before ths
Solution : change the serializer
I run the solution of UTF-16 encoding for a while, but it was not very stable.
So I found a new solution : change the serializer of the XML document, so that the encoding is coherent between the XML header and the file encoding. :
DOMSource domSource = new DOMSource(doc);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer();
String file = racine+"/"+p.getNom()+".xml";
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file),"UTF-8");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.INDENT,"yes");
transformer.transform(domSource, new StreamResult(out));
But where is the error in the XML writing ?
Looks like the error is not in the writing but the parsing. As you have already discovered there is a blank character at the beginning of the file, which causes the error in the parse call in your stach trace:
Document doc = builder.parse(f);
The reason you do not see the space when you print it out may be simply the encoding you are using. Try changing this line:
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(file),"UTF-8");
to use 'UTF-16' or 'US-ASCII'
I think that it is probably linked to BOM (Byte Order Mark). See Wikipedia
You can verify with Notepad++ by example : Open your file and check the "Encoding" Menu to see if you're in "UTF8 without BOM" or "UTF8 with BOM".
Using UTF-16 is the way to go,
OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(fileName),"UTF-16");
This can read the file with no issues
Try this code:
InputStream is = new FileInputStream(file);
Document doc = builder.parse(is , "UTF-8");

Writing XML in different character encodings with Java

I am attempting to write an XML library file that can be read again into my program.
The file writer code is as follows:
XMLBuilder builder = new XMLBuilder();
Document doc = builder.build(bookList);
DOMImplementation impl = doc.getImplementation();
DOMImplementationLS implLS = (DOMImplementationLS) impl.getFeature("LS", "3.0");
LSSerializer ser = implLS.createLSSerializer();
String out = ser.writeToString(doc);
//System.out.println(out);
try{
FileWriter fstream = new FileWriter(location);
BufferedWriter outwrite = new BufferedWriter(fstream);
outwrite.write(out);
outwrite.close();
}catch (Exception e){
}
The above code does write an xml document.
However, in the XML header, it is an attribute that the file is encoded in UTF-16.
when i read in the file, i get the error:
"content not allowed in prolog"
this error does not occur when the encoding attribute is manually changed to UTF-8.
I am trying to get the above code to write an XML document encoded in UTF-8, or successfully parse a UTF-16 file.
the code for parsing in is
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
DocumentBuilder loader = factory.newDocumentBuilder();
Document document = loader.parse(filename);
the last line returns the error.
the LSSerializer writeToString method does not allow the Serializer to pick a encoding.
with the setEncoding method of an instance of LSOutput, LSSerializer's write method can be used to change encoding. the LSOutput CharacterStream can be set to an instance of the BufferedWriter, such that calls from LSSerializer to write will write to the file.

Categories

Resources