XML escape code - java

I have written a method to check my XML strings for &.
I need to modify the method to include the following:
< &lt
> &gt
\ &guot
& &amp
\ &apos
Here is the method
private String xmlEscape(String s) {
try {
return s.replaceAll("&(?!amp;)", "&");
}
catch (PatternSyntaxException pse) {
return s;
}
} // end xmlEscape()
Here is the way I am using it
sb.append(" <Host>" + xmlEscape(url.getHost()) + "</Host>\n");
How can I modify my method to incorporate the rest of the symbols?
EDIT
I think I must not have phrase the question correctly.
In the xmlEscape() method I am wanting to check the string for the following chars
< > ' " &, if they are found I want to replace the found char with the correct char.
Example: if there is a char & the char would be replaced with & in the string.
Can you do something as simple as
try {
s.replaceAll("&(?!amp;)", "&");
s.replaceAll("<", "<");
s.replaceAll(">", ">");
s.replaceAll("'", "&apos;");
s.replaceAll("\"", """);
return s;
}
catch (PatternSyntaxException pse) {
return s;
}

You may want to consider using Apache commons StringEscapeUtils.escapeXml method or one of the many other XML escape utilities out there. That gives you a correct escaping to XML content without worrying about missing something when you need to escape something else but a host name.

Alternatively have you considered using the StAX (JSR-173) APIs to compose your XML document rather than appending strings together (an implementation is included in the JDK/JRE)? This will handle all the necessary character escaping for you:
package forum12569441;
import java.io.*;
import javax.xml.stream.*;
public class Demo {
public static void main(String[] args) throws Exception {
// WRITE THE XML
XMLOutputFactory xof = XMLOutputFactory.newFactory();
StringWriter sw = new StringWriter();
XMLStreamWriter xsw = xof.createXMLStreamWriter(sw);
xsw.writeStartDocument();
xsw.writeStartElement("foo");
xsw.writeCharacters("<>\"&'");
xsw.writeEndDocument();
String xml = sw.toString();
System.out.println(xml);
// READ THE XML
XMLInputFactory xif = XMLInputFactory.newFactory();
XMLStreamReader xsr = xif.createXMLStreamReader(new StringReader(xml));
xsr.nextTag(); // Advance to "foo" element
System.out.println(xsr.getElementText());
}
}
Output
<?xml version="1.0" ?><foo><>"&'</foo>
<>"&'

Related

Xalan's SAX implementation - double encoding entities in string

I'm using Sax with xalan implementation (v. 2.7.2). I have string in html format
" <p>Test k"nnen</p>"
and I have to pass it to content of xml tag.
The result is:
"<p>Test k&quot;nnen</p>"
xalan encodes the ampersand sign although it's a part of already escaped entity.
Anyone knows a way how to make xalan understand escaped entities and not escape their ampersand?
One of possible solution is to add startCDATA() to transformerHandler but It's not something can use in my code.
public class TestSax{
public static void main(String[] args) throws TransformerConfigurationException, SAXException {
TestSax t = new TestSax();
System.out.println(t.createSAXXML());
}
public String createSAXXML() throws SAXException, TransformerConfigurationException {
Writer writer = new StringWriter( );
StreamResult streamResult = new StreamResult(writer);
SAXTransformerFactory transformerFactory =
(SAXTransformerFactory) SAXTransformerFactory.newInstance( );
String data = null;
TransformerHandler transformerHandler =
transformerFactory.newTransformerHandler( );
transformerHandler.setResult(streamResult);
transformerHandler.startDocument( );
transformerHandler.startElement(null,"decimal","decimal", null);
data = " <p>Test k"nnen</p>";
transformerHandler.characters(data.toCharArray(),0,data.length( ));
transformerHandler.endElement(null,"decimal","decimal");
transformerHandler.endDocument( );
return writer.toString( );
}}
If your input is XML, then you need to parse it. Then <p> and </p> will be recognized as tags, and " will be recognized as an entity reference.
On the other hand if you want to treat it as a string and pass it through XML machinery, then "<" and "&" are going to be preserved as ordinary characters, which means they will be escaped as < and & respectively.
If you want "<" treated as an ordinary character but "&" treated with its XML meaning, then you need software with some kind of split personality, and you're not going to get that off-the-shelf.

How to parse an XML that has an & in its tag name using XPath in Java?

I want to parse an XML whose tag contains an & for example: <xml><OC&C>12.4</OC&C></xml>. I tried to escape & to & but that didn't fix the issue for tag name (it fixes it for values only), currently my code is throwing an exception, see complete function below.
public static void main(String[] args) throws Exception
{
String xmlString = "<xml><OC&C>12.4</OC&C></xml>";
xmlString = xmlString.replaceAll("&", "&");
String path = "xml";
InputSource inputSource = new InputSource(new StringReader(xmlString));
try
{
Document xmlDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(inputSource);
XPath xPath = XPathFactory.newInstance().newXPath();
XPathExpression xPathExpression = xPath.compile(path);
System.out.println("Compiled Successfully.");
}
catch (SAXException e)
{
System.out.println("Error while retrieving node Path:" + path + " from " + xmlString + ". Returning null");
}
}
Hmmm... I don't think that it is a legal XML name. I'd think about using a regex to replace OC&C to something legal first, and then parse it.
It's not "an XML". It's a non-XML. XML doesn't allow ampersands in names. Therefore, you can't parse it successfully using an XML parser.
xml could not be name of any XML element. So, your XML fragment could never be parsed anyway. Then you could try something like that.
<name><![CDATA[<OC&C>12.4</OC&C>]]></name>

JAXB: Marshal output XML with indentation create empty line break on the first line

When I marshal an XML with this attribute
marshal.setProperty(Marshaller.JAXB_FRAGMENT, Boolean.TRUE);
marshal.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
it will generate an empty line break at the very top
//Generate empty line break here
<XX>
<YY>
<PDF>pdf name</PDF>
<ZIP>zip name</ZIP>
<RECEIVED_DT>received date time</RECEIVED_DT>
</YY>
</XX>
I think the reason is because marshal.setProperty(Marshaller.JAXB_FRAGMENT, Boolean.TRUE);, which remove <?xml version="1.0" encoding="UTF-8" standalone="yes"?>, leave the output xml a line break in the beginning. Is there a way to fix this? I use JAXB come with JDK 6, does Moxy suffer from this problem?
As you point out EclipseLink JAXB (MOXy) does not have this problem so you could use that (I'm the MOXy lead):
http://blog.bdoughan.com/2011/05/specifying-eclipselink-moxy-as-your.html
Option #1
One option would be to use a java.io.FilterWriter or java.io.FilterOutputStream and customize it to ignore the leading new line.
Option #2
Another option would be to marshal to StAX, and use a StAX implementation that supports formatting the output. I haven't tried this myself but the answer linked below suggests using com.sun.xml.txw2.output.IndentingXMLStreamWriter.
https://stackoverflow.com/a/3625359/383861
Inspired by first option of bdoughan's comment in this post, I've written a custom writer to remove blank line in xml file like the following ways:
public class XmlWriter extends FileWriter {
public XmlWriter(File file) throws IOException {
super(file);
}
public void write(String str) throws IOException {
if(org.apache.commons.lang3.StringUtils.isNotBlank(str)) {
super.write(str);
}
}
}
To check empty line, I've used org.apache.commons.lang3.StringUtils.isNotBlank() method, you can use your own custom condition.
Then use this writer to marshal method like the following way in Java 8.
// skip other code
File file = new File("test.xml");
marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
marshaller.setProperty(Marshaller.JAXB_FRAGMENT, true);
try (FileWriter writer = new XmlWriter(file)) {
marshaller.marshal(object, writer);
}
It'll remove <?xml version="1.0" encoding="UTF-8" standalone="yes"?> tag, also will not print blank line.
Since I was marshalling to a File object, I decided to remove this line afterwards:
public static void removeEmptyLines(File file) throws IOException {
long fileTimestamp = file.lastModified();
List<String> lines = Files.readAllLines(file.toPath());
try (Writer writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), StandardCharsets.UTF_8))) {
for (String line : lines) {
if (!line.trim().isEmpty()) {
writer.write(line + "\n");
}
}
}
file.setLastModified(fileTimestamp);
}

Need an application to fix a XML with unescaped chars

This XML (rdf file extension, but is XML) was generated by a automatic tool, but unfortunately have various "unescaped" strings like
<tag xml:lang="fr">L'insuline (du latin insula, île) </tag>
And the parser (and reasoner software) crash with this...
Java or PHP solutions are valid to me too!
Thanks,
Celso
Here's a general method that I use a lot to make sure a String is escaped properly for XML.
private static final String AMP = "&";
private static final String LT = "<";
private static final String GT = ">";
private static final String QUOTE = """;
private static final String APOS = "&apos;";
public static String encodeEntities(String dirtyString) {
StringBuffer buff = new StringBuffer();
char[] chars = dirtyString.toCharArray();
for (int i = 0; i < chars.length; i++) {
if (chars[i] > 0x7f) {
buff.append("&#" + (int) chars[i] + ";");
continue;
}
switch (chars[i]) {
case '&':
buff.append(AMP);
break;
case '<':
buff.append(LT);
break;
case '\'':
buff.append(APOS);
break;
case '"':
buff.append(QUOTE);
break;
case '>':
buff.append(GT);
break;
default:
buff.append(chars[i]);
break;
}
}
return buff.toString();
}
The xml given by the OP is well-formed xml as the single quote character is valid and so is the circumflex "i", neither needs escaping. I would make sure you're using a text encoding such as UTF-8. Here's quick java example that does an identity transformation:
public static void main(String[] args) throws Exception {
Transformer t = TransformerFactory.newInstance().newTransformer();
StreamResult s = new StreamResult(System.out);
t.transform(new StreamSource(new StringReader("<tag xml:lang=\"fr\">L'insuline (du latin insula, île) </tag>")), s);
}
The XML fragment given by the OP looks well-formed. Neither the apostrophe nor the i-circumflex needs escaping. The most likely problem is that the XML is encoded using iso-8859-1, but lacks an XML declaration, so the parser think it is in UTF-8 encoding. The solution then is to add the XML declaration <?xml version="1.0" encoding="iso-8859-1"?>, which tells the parser how to decode the characters. (For a document containing only ASCII characters, iso-8859-1 and utf-8 are indistinguishable, so this problem only surfaces when you use characters outside the ASCII range).
A word of advice: if you had given the error message generated by the parser, you wouldn't have got so many incorrect answers.

Ideal Java library for cleaning html, and escaping malformed fragments

I've got some HTML files that need to be parsed and cleaned, and they occasionally have content with special characters like <, >, ", etc. which have not been properly escaped.
I have tried running the files through jTidy, but the best I can get it to do is just omit the content it sees as malformed html. Is there a different library that will just escape the malformed fragments instead of omitting them? If not, any recommendations on what library would be easiest to modify?
Clarification:
Sample input: <p> blah blah <M+1> blah </p>
Desired output: <p> blah blah <M+1> blah </p>
You can also try TagSoup. TagSoup emits regular old SAX events so in the end you get what looks like a well-formed XML document.
I have had very good luck with TagSoup and I'm always surprised at how well it handles poorly constructed HTML files.
Ultimately I solved this by running a regular expression first and an unmodified TagSoup second.
Here is my regular expression code to escape unknown tags like <M+1>
private static String escapeUnknownTags(String input) {
Scanner scan = new Scanner(input);
StringBuilder builder = new StringBuilder();
while (scan.hasNext()) {
String s = scan.findWithinHorizon("[^<]*</?[^<>]*>?", 1000000);
if (s == null) {
builder.append(escape(scan.next(".*")));
} else {
processMatch(s, builder);
}
}
return builder.toString();
}
private static void processMatch(String s, StringBuilder builder) {
if (!isKnown(s)) {
String escaped = escape(s);
builder.append(escaped);
}
else {
builder.append(s);
}
}
private static String escape(String s) {
s = s.replaceAll("<", "<");
s = s.replaceAll(">", ">");
return s;
}
private static boolean isKnown(String s) {
Scanner scan = new Scanner(s);
if (scan.findWithinHorizon("[^<]*</?([^<> ]*)[^<>]*>?", 10000) == null) {
return false;
}
MatchResult mr = scan.match();
try {
String tag = mr.group(1).toLowerCase();
if (HTML.getTag(tag) != null) {
return true;
}
}
catch (Exception e) {
// Should never happen
e.printStackTrace();
}
return false;
}
HTML cleaner
HtmlCleaner is open-source HTML parser written in Java. HTML found on
Web is usually dirty, ill-formed and unsuitable for further
processing. For any serious consumption of such documents, it is
necessary to first clean up the mess and bring the order to tags,
attributes and ordinary text. For the given HTML document, HtmlCleaner
reorders individual elements and produces well-formed XML. By default,
it follows similar rules that the most of web browsers use in order to
create Document Object Model. However, user may provide custom tag and
rule set for tag filtering and balancing.
Ok, I suspect it is this. Use the following code, it will help.
javax.swing.text.html.HTML

Categories

Resources