Xalan's SAX implementation - double encoding entities in string

Xalan's SAX implementation - double encoding entities in string - java

I'm using Sax with xalan implementation (v. 2.7.2). I have string in html format
" <p>Test k"nnen</p>"
and I have to pass it to content of xml tag.
The result is:
"<p>Test k&quot;nnen</p>"
xalan encodes the ampersand sign although it's a part of already escaped entity.
Anyone knows a way how to make xalan understand escaped entities and not escape their ampersand?
One of possible solution is to add startCDATA() to transformerHandler but It's not something can use in my code.
public class TestSax{
public static void main(String[] args) throws TransformerConfigurationException, SAXException {
TestSax t = new TestSax();
System.out.println(t.createSAXXML());
}
public String createSAXXML() throws SAXException, TransformerConfigurationException {
Writer writer = new StringWriter( );
StreamResult streamResult = new StreamResult(writer);
SAXTransformerFactory transformerFactory =
(SAXTransformerFactory) SAXTransformerFactory.newInstance( );
String data = null;
TransformerHandler transformerHandler =
transformerFactory.newTransformerHandler( );
transformerHandler.setResult(streamResult);
transformerHandler.startDocument( );
transformerHandler.startElement(null,"decimal","decimal", null);
data = " <p>Test k"nnen</p>";
transformerHandler.characters(data.toCharArray(),0,data.length( ));
transformerHandler.endElement(null,"decimal","decimal");
transformerHandler.endDocument( );
return writer.toString( );
}}

If your input is XML, then you need to parse it. Then <p> and </p> will be recognized as tags, and " will be recognized as an entity reference.
On the other hand if you want to treat it as a string and pass it through XML machinery, then "<" and "&" are going to be preserved as ordinary characters, which means they will be escaped as < and & respectively.
If you want "<" treated as an ordinary character but "&" treated with its XML meaning, then you need software with some kind of split personality, and you're not going to get that off-the-shelf.

Related

What is the property IS_COALESCING in XMLInputFactory for?

I don't really understand the definition from the Oracle documentation:
The property that requires the parser to coalesce adjacent character
data sections
I've tried a few examples with both this property to true and false, and there don't seem to be any noticeable changes.
Can anyone please provide me with a better explanation and maybe an example in which it matters?

It can e.g. make a difference if the text content of an element is a mix of plain &-encoded text, and CDATA-encoded text.
Demo
public static void main(String[] args) throws Exception {
test(false);
test(true);
}
static void test(boolean coalesce) throws Exception {
System.out.println("IS_COALESCING = " + coalesce + ":");
String xml = "<Root>abc<![CDATA[def]]>ghi</Root>";
XMLInputFactory xmlInputFactory = XMLInputFactory.newFactory();
xmlInputFactory.setProperty(XMLInputFactory.IS_COALESCING, coalesce);
XMLEventReader reader = xmlInputFactory.createXMLEventReader(new StringReader(xml));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isCharacters())
System.out.println(" \"" + event.asCharacters().getData() + "\"");
}
}
Output
IS_COALESCING = false:
"abc"
"def"
"ghi"
IS_COALESCING = true:
"abcdefghi"
If you parsed into DOM, the <Root> element would have 3 Node children:
Text where getData() returns "abc"
CDATASection where getData() returns "def"
Text where getData() returns "ghi"
The XMLInputFactory property works the same as the DocumentBuilderFactory.setCoalescing(boolean coalescing) method:
Specifies that the parser produced by this code will convert CDATA nodes to Text nodes and append it to the adjacent (if any) text node. By default the value of this is set to false

Marshalling CDATA elements with CDATA_SECTION_ELEMENTS adds carriage return characters

I'm working on an application that exports and imports data from / to a DB. The format of the data extract is XML and I'm using JAXB for the serialization / (un)marshalling. I want some elements to be marshalled as CDATA elements and am using this solution which sets OutputKeys.CDATA_SECTION_ELEMENTS to the Transformer properties.
So far this was working quite well, but now I came to a field in the DB that itself contains an XML string, which I also would like to place inside of a CDATA element. Now, for some reason the Transformer is now adding some unnecessary carriage return characters (\r) to each line end, so that the output looks like this:
This is my code:
private static final String IDENT_LENGTH = "3";
private static final String CDATA_XML_ELEMENTS = "text definition note expression mandatoryExpression optionalExpression settingsXml";
public static void marshall(final Object rootObject, final Schema schema, final Writer writer) throws Exception {
final JAXBContext ctx = JAXBContext.newInstance(rootObject.getClass());
final Document document = createDocument();
final Marshaller marshaller = ctx.createMarshaller();
marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
marshaller.setSchema(schema);
marshaller.marshal(rootObject, document);
createTransformer().transform(new DOMSource(document), new StreamResult(writer));
}
private static Document createDocument() throws ParserConfigurationException {
// the DocumentBuilderFactory is actually being hold in a singleton
final DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
return builderFactory.newDocumentBuilder().newDocument();
}
private static Transformer createTransformer() throws TransformerConfigurationException, TransformerFactoryConfigurationError {
// the TransformerFactory is actually being hold in a singleton
final TransformerFactory transformerFactory = TransformerFactory.newInstance();
final Transformer transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.STANDALONE, "yes");
transformer.setOutputProperty(OutputKeys.CDATA_SECTION_ELEMENTS, CDATA_XML_ELEMENTS);
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", IDENT_LENGTH);
return transformer;
}
I'm passing a FileWriter to the marshall method.
My annotated model class looks like this:
#XmlType
#XmlRootElement
public class DashboardSettings {
#XmlElement
private String settingsXml;
public String getSettingsXml() {
return settingsXml;
}
public void setSettingsXml(final String settingsXml) {
this.settingsXml = settingsXml;
}
}
NOTE:
The XML string coming from the DB has Windows style line endings, i.e. \r and \n. I have the feeling that JAXB expects currently Linux style input (i. e. only \n) and is therefore adding a \r character because I'm currently running on a Windows machine. So the question is actually, what's the best way to solve this? Is there any parameter I can pass to control the line ending characters when marshalling? Or should I convert the line endings to Linux style prior marshalling? How will my program behave on different platforms (Windows, Linux, Mac OS)?
I don't necessarily need a platform independent solution, it's OK if the output is in Windows, Linux or whatever style. What I want to avoid is the combination \r\r\n as shown in the above screenshot.

I realise this question is pretty old, but I ran into a similar problem, so maybe an answer can help someone else.
It seems to be an issue with CDATA sections. In my case, I was using the createCDATASection method to create them. When the code was running on a Windows machine, an additional CR was added, as in your example.
I've tried a bunch of things to solve this "cleanly", to no avail.
In my project, the XML document was then exported to a string to POST to a Linux server. So once the string was generated, I just removed the CR characters, leaving only the LF:
myXmlString.replaceAll("\\r", "");
I might not be an appropriate solution for the specific question, but once again, it may help you (or someone else) find a solution.
Note: I'm stuck with Java 7 for this specific project, so it may have been fixed in a more recent version.

Get specified parameter from encoded url paramters String with java

Note that what I want is not get specified parameter in a sevlet, but to get the parameter from a String like that:
res_data=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22utf8%22%3F%3E%3Cdirect_trade_create_res%3E%3Crequest_token%3E201502051324ee4d4baf14d30e3510808c08ee1d%3C%2Frequest_token%3E%3C%2Fdirect_trade_create_res%3E&service=alipay.wap.trade.create.direct&sec_id=MD5&partner=2088611853232587&req_id=20121212344553&v=2.0
It's a url encoded utf-8 string, when decode this by python I can get the real data it represents:
res_data=<?xml version="1.0" encoding="utf-8"?><direct_trade_create_res><request_token>201502051324ee4d4baf14d30e3510808c08ee1d</request_token></direct_trade_create_res>&service=alipay.wap.trade.create.direct&sec_id=MD5&partner=2088611853232587&req_id=20121212344553&v=2.0
I want to get the parameter res_data that I care about, more specifically, I just want the request_token in the xml of res_data
I know I can use regex to get this work, but is there a more suitable way to use some lib like apache url lib or something else that I can get the res_data parameter more elegantly? May be stealing some components from servlet mechanism?

Since you say you don't want to hack it with a regex you might use a proper XML parser, although for such a small example it is probably overkill.
If you can assume that you can simply split your string on &'s, i.e., there aren't any &'s in there that do not signal the boundary of two attribute-value pairs, you can first decode the string, then extract the attribute-value pairs from it and finally use a DOM parser + XPath to get to the request token:
// split up URL parameters into attribute value pairs
String[] pairs = s.split("&");
// expect the first attribute/value pair to contain the data
// and decode the URL escape sequences
String resData = URLDecoder.decode(pairs[0], "utf-8");
int equalIndex = resData.indexOf("=");
if (equalIndex >= 0) {
// the value is right of the '=' sign
String xmlString = resData.substring(equalIndex + 1);
// prepare XML parser
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = dbf.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xmlString));
Document doc = parser.parse(is);
// prepare XPath expression to extract request token
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression xp = xpath.compile("//request_token/text()");
String requestToken = xp.evaluate(doc);
}

You can use java.net.URLDecoder. Assuming the parameter is in a string called param (and you have already split it away from the other parameters that were connected to it by &):
String[] splitString = param.split("=");
String realData = null;
try {
String realData = java.net.URLDecoder.decode( splitString[1], "UTF-8" );
} catch ( UnsupportedEncodingException e ) {
// Nothing to do, it should not happen as you supplied a standard one
}
Once you do that, you can parse it with the XML parser of your choice and extract whatever you want. Don't try to parse XML with a regex, though.

How to parse an XML that has an & in its tag name using XPath in Java?

I want to parse an XML whose tag contains an & for example: <xml><OC&C>12.4</OC&C></xml>. I tried to escape & to & but that didn't fix the issue for tag name (it fixes it for values only), currently my code is throwing an exception, see complete function below.
public static void main(String[] args) throws Exception
{
String xmlString = "<xml><OC&C>12.4</OC&C></xml>";
xmlString = xmlString.replaceAll("&", "&");
String path = "xml";
InputSource inputSource = new InputSource(new StringReader(xmlString));
try
{
Document xmlDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(inputSource);
XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression xPathExpression = xPath.compile(path);
System.out.println("Compiled Successfully.");
}
catch (SAXException e)
{
System.out.println("Error while retrieving node Path:" + path + " from " + xmlString + ". Returning null");
}
}

Hmmm... I don't think that it is a legal XML name. I'd think about using a regex to replace OC&C to something legal first, and then parse it.

It's not "an XML". It's a non-XML. XML doesn't allow ampersands in names. Therefore, you can't parse it successfully using an XML parser.

xml could not be name of any XML element. So, your XML fragment could never be parsed anyway. Then you could try something like that.
<name><![CDATA[<OC&C>12.4</OC&C>]]></name>

XML escape code

I have written a method to check my XML strings for &.
I need to modify the method to include the following:
< &lt
> &gt
\ &guot
& &amp
\ &apos
Here is the method
private String xmlEscape(String s) {
try {
return s.replaceAll("&(?!amp;)", "&");
}
catch (PatternSyntaxException pse) {
return s;
}
} // end xmlEscape()
Here is the way I am using it
sb.append(" <Host>" + xmlEscape(url.getHost()) + "</Host>\n");
How can I modify my method to incorporate the rest of the symbols?
EDIT
I think I must not have phrase the question correctly.
In the xmlEscape() method I am wanting to check the string for the following chars
< > ' " &, if they are found I want to replace the found char with the correct char.
Example: if there is a char & the char would be replaced with & in the string.
Can you do something as simple as
try {
s.replaceAll("&(?!amp;)", "&");
s.replaceAll("<", "<");
s.replaceAll(">", ">");
s.replaceAll("'", "&apos;");
s.replaceAll("\"", """);
return s;
}
catch (PatternSyntaxException pse) {
return s;
}

You may want to consider using Apache commons StringEscapeUtils.escapeXml method or one of the many other XML escape utilities out there. That gives you a correct escaping to XML content without worrying about missing something when you need to escape something else but a host name.

Alternatively have you considered using the StAX (JSR-173) APIs to compose your XML document rather than appending strings together (an implementation is included in the JDK/JRE)? This will handle all the necessary character escaping for you:
package forum12569441;
import java.io.*;
import javax.xml.stream.*;
public class Demo {
public static void main(String[] args) throws Exception {
// WRITE THE XML
XMLOutputFactory xof = XMLOutputFactory.newFactory();
StringWriter sw = new StringWriter();
XMLStreamWriter xsw = xof.createXMLStreamWriter(sw);
xsw.writeStartDocument();
xsw.writeStartElement("foo");
xsw.writeCharacters("<>\"&'");
xsw.writeEndDocument();
String xml = sw.toString();
System.out.println(xml);
// READ THE XML
XMLInputFactory xif = XMLInputFactory.newFactory();
XMLStreamReader xsr = xif.createXMLStreamReader(new StringReader(xml));
xsr.nextTag(); // Advance to "foo" element
System.out.println(xsr.getElementText());
}
}
Output
<?xml version="1.0" ?><foo><>"&'</foo>
<>"&'

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Xalan's SAX implementation - double encoding entities in string - java

Related

What is the property IS_COALESCING in XMLInputFactory for?

Marshalling CDATA elements with CDATA_SECTION_ELEMENTS adds carriage return characters

Get specified parameter from encoded url paramters String with java

How to parse an XML that has an & in its tag name using XPath in Java?

XML escape code

Categories

Resources