I don't really understand the definition from the Oracle documentation:
The property that requires the parser to coalesce adjacent character
data sections
I've tried a few examples with both this property to true and false, and there don't seem to be any noticeable changes.
Can anyone please provide me with a better explanation and maybe an example in which it matters?
It can e.g. make a difference if the text content of an element is a mix of plain &-encoded text, and CDATA-encoded text.
Demo
public static void main(String[] args) throws Exception {
test(false);
test(true);
}
static void test(boolean coalesce) throws Exception {
System.out.println("IS_COALESCING = " + coalesce + ":");
String xml = "<Root>abc<![CDATA[def]]>ghi</Root>";
XMLInputFactory xmlInputFactory = XMLInputFactory.newFactory();
xmlInputFactory.setProperty(XMLInputFactory.IS_COALESCING, coalesce);
XMLEventReader reader = xmlInputFactory.createXMLEventReader(new StringReader(xml));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isCharacters())
System.out.println(" \"" + event.asCharacters().getData() + "\"");
}
}
Output
IS_COALESCING = false:
"abc"
"def"
"ghi"
IS_COALESCING = true:
"abcdefghi"
If you parsed into DOM, the <Root> element would have 3 Node children:
Text where getData() returns "abc"
CDATASection where getData() returns "def"
Text where getData() returns "ghi"
The XMLInputFactory property works the same as the DocumentBuilderFactory.setCoalescing(boolean coalescing) method:
Specifies that the parser produced by this code will convert CDATA nodes to Text nodes and append it to the adjacent (if any) text node. By default the value of this is set to false
Related
I'd like to use XMLStreamReader for reading a XML file which contains Horizontal Tab ASCII Codes , for example:
<tag>foo bar</tag>
and print out or write it back to another xml file.
Google tells me to set javax.xml.stream.isCoalescing to true in XMLInputFactory, but my test code below does not work as expected.
public static void main(String[] args) throws IOException, XMLStreamException {
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty(factory.IS_COALESCING, true);
System.out.println("IS_COALESCING supported ? " + factory.isPropertySupported(factory.IS_COALESCING));
System.out.println("factory IS_COALESCING value is " +factory.getProperty(factory.IS_COALESCING));
String rawString = "<tag>foo bar</tag>";
XMLStreamReader reader = factory.createXMLStreamReader(new StringReader(rawString));
System.out.println("reader IS_COALESCING value is " +reader.getProperty(factory.IS_COALESCING));
PrintWriter pw = new PrintWriter(System.out, true);
while (reader.hasNext())
{
reader.next();
pw.print(reader.getEventType());
if (reader.hasText())
pw.append(' ').append(reader.getText());
pw.println();
}
}
The output is
IS_COALESCING supported ? true
factory IS_COALESCING value is true
reader IS_COALESCING value is true
1
4 foo bar
2
8
But I want to keep the same Horizontal Tab like:
IS_COALESCING supported ? true
factory IS_COALESCING value is true
reader IS_COALESCING value is true
1
4 foo bar
2
8
What am I missing here? thanks
From what I see, the parsing part is correct - it's just not printed as you envision it. Your unicode encoding is interpreted by the XML reader as \t and represented accordingly in Java.
Using Guava's XmlEscapers, I can produce something similar to what you want to have:
public class Test {
public static void main(String[] args) throws IOException, XMLStreamException {
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty(XMLInputFactory.IS_COALESCING, true);
System.out.println("IS_COALESCING supported ? " + factory.isPropertySupported(XMLInputFactory.IS_COALESCING));
System.out.println("factory IS_COALESCING value is " + factory.getProperty(XMLInputFactory.IS_COALESCING));
String rawString = "<tag>foo bar</tag>";
XMLStreamReader reader = factory.createXMLStreamReader(new StringReader(rawString));
System.out.println("reader IS_COALESCING value is " + reader.getProperty(XMLInputFactory.IS_COALESCING));
PrintWriter pw = new PrintWriter(System.out, true);
while (reader.hasNext()) {
reader.next();
pw.print(reader.getEventType());
if (reader.hasText()) {
pw.append(' ').append(XmlEscapers.xmlAttributeEscaper().escape(reader.getText()));
}
pw.println();
}
}
The Output looks like this:
IS_COALESCING supported ? true
factory IS_COALESCING value is true
reader IS_COALESCING value is true
1
4 foo bar
2
8
Some remarks to this:
The library itself is marked as unstable, there might be other alternatives
\t does not need to be escaped in XML content, thus I had to choose the attribute converter. While it works, there might be some side effects
Is a 100%-copy of the content really required? Otherwise, I would suggest to let the XML libraries do their work and have them create the correct encoding.
If you really want to have a 1:1 copy, is it an option to specify the input as CDATA?
I am attempting to read XML from a server on http://localhost:8000, into a string variable.
The layout of the XML document is very simple, and when directing to http://localhost:8000, the following is displayed:
<result>Hello World</result>
Is there a simple way to parse this into a String variable from the localhost URL, so that for example, if I was to run:
System.out.println(XMLVariable)
(where XMLVariable is the string variable in which the content was stored in) that the output to the command line would simply be "Hello World"?
You need to parse the response from the server into an XML data structure of some sort.
The easiest way that I'm aware of (in Java) to do that is to use dom4j.
It can be as simple as this...
SAXReader reader = new SAXReader();
Document document = reader.read("http://localhost:8000/");
System.out.println(document.getText());
You can use StAX for parsing the response:
private Optional<String> extractResultValue(String xml) throws XMLStreamException {
final XMLInputFactory factory = XMLInputFactory.newInstance();
final XMLEventReader reader = factory.createXMLEventReader(new StringReader(xml));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isCharacters()) {
return Optional.ofNullable(event.asCharacters().getData());
}
}
return Optional.empty();
}
Example call:
extractResultValue("<Your data from server>")
extractResultValue("<result>Hello World</result>") // Optional[Hello World]
extractResultValue("<result></result>") // Optional.empty
extractResultValue("<test>value</test>") // Optional[value]
I'm using Sax with xalan implementation (v. 2.7.2). I have string in html format
" <p>Test k"nnen</p>"
and I have to pass it to content of xml tag.
The result is:
"<p>Test k"nnen</p>"
xalan encodes the ampersand sign although it's a part of already escaped entity.
Anyone knows a way how to make xalan understand escaped entities and not escape their ampersand?
One of possible solution is to add startCDATA() to transformerHandler but It's not something can use in my code.
public class TestSax{
public static void main(String[] args) throws TransformerConfigurationException, SAXException {
TestSax t = new TestSax();
System.out.println(t.createSAXXML());
}
public String createSAXXML() throws SAXException, TransformerConfigurationException {
Writer writer = new StringWriter( );
StreamResult streamResult = new StreamResult(writer);
SAXTransformerFactory transformerFactory =
(SAXTransformerFactory) SAXTransformerFactory.newInstance( );
String data = null;
TransformerHandler transformerHandler =
transformerFactory.newTransformerHandler( );
transformerHandler.setResult(streamResult);
transformerHandler.startDocument( );
transformerHandler.startElement(null,"decimal","decimal", null);
data = " <p>Test k"nnen</p>";
transformerHandler.characters(data.toCharArray(),0,data.length( ));
transformerHandler.endElement(null,"decimal","decimal");
transformerHandler.endDocument( );
return writer.toString( );
}}
If your input is XML, then you need to parse it. Then <p> and </p> will be recognized as tags, and " will be recognized as an entity reference.
On the other hand if you want to treat it as a string and pass it through XML machinery, then "<" and "&" are going to be preserved as ordinary characters, which means they will be escaped as < and & respectively.
If you want "<" treated as an ordinary character but "&" treated with its XML meaning, then you need software with some kind of split personality, and you're not going to get that off-the-shelf.
I am doing some surgical XML transformations using XMLEventReader and XMLEventWriter. For the most part, I just write the events as they are read:
import javax.xml.stream.*;
import javax.xml.stream.events.XMLEvent;
import java.io.StringReader;
import java.io.StringWriter;
public class StaxExample {
public static void main(String[] args) throws XMLStreamException {
String inputXml =
"<foo>" +
" <bar baz=\"a
b
c
\"/>" +
" <changeme/>" +
"</foo>";
StringWriter result = new StringWriter();
XMLEventReader reader = XMLInputFactory.newFactory().createXMLEventReader(new StringReader(inputXml));
XMLEventWriter writer = XMLOutputFactory.newFactory().createXMLEventWriter(result);
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
//in real code, look for "changeme" and insert some stuff
writer.add(event);
}
System.out.println(result.toString());
}
}
My problem is, this produces:
<?xml version="1.0" ?><foo> <bar baz="a
b
c
"></bar> <changeme></changeme></foo>
While syntactically valid XML, it's necessary (due to a downstream consumer) that I preserve the newlines. The above XML will instead be normalized to a b c by that consumer (and indeed, by StAX itself--if I take this output and feed it back into the same program, the second time it will output baz="a b c ").
While I've given up on XMLEventWriter preserving non-semantic formatting, is there a way to prevent it from essentially changing my attribute values?
Well, I suggest you implement your own Writer:
public class EscappingNLWriter extends FilterWriter
{
public EscappingNLWriter(Writer out) {super(out);}
public void write(c)
{
if (c=='\n')
{
out.write("
");
}
else
{
out.write(c);
}
}
public void write(char[] buff, int offset, int len) throws IOException
{
// ...Same char filtering...
}
public void write(String str, int offset, int len) throws IOException
{
// ...Same char filtering...
}
}
And then use it to encapsulate the StringWriter:
Writer result = new EscappingNLWriter(new StringWriter());
If you need an absolute accuracy about where to escape newlines in the XML and where not to (i.e.: you need to escape newlines only within attributes and not elsewhere), I have another suggestion tough a little more complicated:
Look at your code:
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
//in real code, look for "changeme" and insert some stuff
writer.add(event);
}
There is one point where you can interpose between the attribute and the writer: Just after initializing event and before passing it to writer.add, you can encapsulate the event in your own implementation of XMLEvent to ensure that if it is an instance of javax.xml.stream.events.Attribute, you will overwrite Attribute.getValue() to return the value properly escapped.
But there is an extra complication: The XMLEvents returned by a XMLEventReader usually do not include Attribute events: Attributes are included within its corresponding StartElement events. So you need one more level of encapsulation: The StartElement objects and then the contained Attribute objects.
I have written a method to check my XML strings for &.
I need to modify the method to include the following:
< <
> >
\ &guot
& &
\ &apos
Here is the method
private String xmlEscape(String s) {
try {
return s.replaceAll("&(?!amp;)", "&");
}
catch (PatternSyntaxException pse) {
return s;
}
} // end xmlEscape()
Here is the way I am using it
sb.append(" <Host>" + xmlEscape(url.getHost()) + "</Host>\n");
How can I modify my method to incorporate the rest of the symbols?
EDIT
I think I must not have phrase the question correctly.
In the xmlEscape() method I am wanting to check the string for the following chars
< > ' " &, if they are found I want to replace the found char with the correct char.
Example: if there is a char & the char would be replaced with & in the string.
Can you do something as simple as
try {
s.replaceAll("&(?!amp;)", "&");
s.replaceAll("<", "<");
s.replaceAll(">", ">");
s.replaceAll("'", "'");
s.replaceAll("\"", """);
return s;
}
catch (PatternSyntaxException pse) {
return s;
}
You may want to consider using Apache commons StringEscapeUtils.escapeXml method or one of the many other XML escape utilities out there. That gives you a correct escaping to XML content without worrying about missing something when you need to escape something else but a host name.
Alternatively have you considered using the StAX (JSR-173) APIs to compose your XML document rather than appending strings together (an implementation is included in the JDK/JRE)? This will handle all the necessary character escaping for you:
package forum12569441;
import java.io.*;
import javax.xml.stream.*;
public class Demo {
public static void main(String[] args) throws Exception {
// WRITE THE XML
XMLOutputFactory xof = XMLOutputFactory.newFactory();
StringWriter sw = new StringWriter();
XMLStreamWriter xsw = xof.createXMLStreamWriter(sw);
xsw.writeStartDocument();
xsw.writeStartElement("foo");
xsw.writeCharacters("<>\"&'");
xsw.writeEndDocument();
String xml = sw.toString();
System.out.println(xml);
// READ THE XML
XMLInputFactory xif = XMLInputFactory.newFactory();
XMLStreamReader xsr = xif.createXMLStreamReader(new StringReader(xml));
xsr.nextTag(); // Advance to "foo" element
System.out.println(xsr.getElementText());
}
}
Output
<?xml version="1.0" ?><foo><>"&'</foo>
<>"&'