XMLStreamWriter writeCharacters without escaping

XMLStreamWriter writeCharacters without escaping - java

How do I use XMLStreamWriter to write exactly what I put in? For instance, if I create script tag and fill it with javascript I don't want all my single quotes coming out as &apos;
Here's a small test I wrote that doesn't use any of the abstractions I've got in place, just calls to writeCharacters.
public void testWriteCharacters() {
StringWriter sw = new StringWriter();
XMLOutputFactory factory = XMLOutputFactory.newInstance();
StringBuffer out = new StringBuffer();
try {
XMLStreamWriter writer = factory.createXMLStreamWriter(sw);
writer.writeStartElement("script");
writer.writeAttribute("type","text/javascript");
writer.writeCharacters("function hw(){ \n"+
"\t alert('hello world');\n" +
"}\n");
writer.writeEndElement();
out.append(sw);
} catch (XMLStreamException e) {
} finally {
try {
sw.close();
} catch(IOException e) {
e.printStackTrace();
}
}
System.out.println(out.toString());
}
This produces an apos entity for both the single quotes surrounding hello world.

You could use a property on the factory:
final XMLOutputFactory streamWriterFactory = XMLOutputFactory.newFactory();
streamWriterFactory.setProperty("escapeCharacters", false);
Then the writer created by this factory will write characters without escaping the text in the element given that the factory supports this property. XMLOutputFactoryImpl does.

XmlStreamWriter.writeCharacters() doesn't escape '. It only escapes <, > and &, and writeAttribute also escapes " (see javadoc).
However, if you want to write text without escaping at all, you have to write it as a CDATA section using writeCData().
The typical approach for writing scripts in CDATA sections is:
<script>//<![CDATA[
...
//]]></script>
That is:
out.writeCharacters("//");
out.writeCData("\n ... \n//");

Alternative method, with custom escape handler:
XMLOutputFactory xmlFactory = XMLOutputFactory.newInstance();
xmlFactory.setProperty(XMLOutputFactory2.P_TEXT_ESCAPER, new MyEscapingWriterFactory());
'MyEscapingWriterFactory' is your implementation of 'EscapingWriterFactory' interface. It allows fine grained text escaping control. This is useful when you use text element to deal with random input (say, invalid XML with multiple processing instructions or incorrectly written CDATA sections).

You can also use woodstox's stax implementation. Their XMLStreamWriter2 class has a writeRaw() method. We're using it for this specific reason and it works great.

Write directly to the underlying Writer or OutputStream:
Writer out = new StringWriter();
XMLStreamWriter writer = XMLOutputFactory.newInstance().createXMLStreamWriter(out);
... //write your XML
writer.flush();
//write extra characters directly to the underlying writer
out.write("<yourstuff>Test characters</yourstuff>");
out.flush();
... //continue with normal XML
writer.writeEndElement();
writer.flush();

Related

XML escape code

I have written a method to check my XML strings for &.
I need to modify the method to include the following:
< &lt
> &gt
\ &guot
& &amp
\ &apos
Here is the method
private String xmlEscape(String s) {
try {
return s.replaceAll("&(?!amp;)", "&");
}
catch (PatternSyntaxException pse) {
return s;
}
} // end xmlEscape()
Here is the way I am using it
sb.append(" <Host>" + xmlEscape(url.getHost()) + "</Host>\n");
How can I modify my method to incorporate the rest of the symbols?
EDIT
I think I must not have phrase the question correctly.
In the xmlEscape() method I am wanting to check the string for the following chars
< > ' " &, if they are found I want to replace the found char with the correct char.
Example: if there is a char & the char would be replaced with & in the string.
Can you do something as simple as
try {
s.replaceAll("&(?!amp;)", "&");
s.replaceAll("<", "<");
s.replaceAll(">", ">");
s.replaceAll("'", "&apos;");
s.replaceAll("\"", """);
return s;
}
catch (PatternSyntaxException pse) {
return s;
}

You may want to consider using Apache commons StringEscapeUtils.escapeXml method or one of the many other XML escape utilities out there. That gives you a correct escaping to XML content without worrying about missing something when you need to escape something else but a host name.

Alternatively have you considered using the StAX (JSR-173) APIs to compose your XML document rather than appending strings together (an implementation is included in the JDK/JRE)? This will handle all the necessary character escaping for you:
package forum12569441;
import java.io.*;
import javax.xml.stream.*;
public class Demo {
public static void main(String[] args) throws Exception {
// WRITE THE XML
XMLOutputFactory xof = XMLOutputFactory.newFactory();
StringWriter sw = new StringWriter();
XMLStreamWriter xsw = xof.createXMLStreamWriter(sw);
xsw.writeStartDocument();
xsw.writeStartElement("foo");
xsw.writeCharacters("<>\"&'");
xsw.writeEndDocument();
String xml = sw.toString();
System.out.println(xml);
// READ THE XML
XMLInputFactory xif = XMLInputFactory.newFactory();
XMLStreamReader xsr = xif.createXMLStreamReader(new StringReader(xml));
xsr.nextTag(); // Advance to "foo" element
System.out.println(xsr.getElementText());
}
}
Output
<?xml version="1.0" ?><foo><>"&'</foo>
<>"&'

JAXB: Marshal output XML with indentation create empty line break on the first line

When I marshal an XML with this attribute
marshal.setProperty(Marshaller.JAXB_FRAGMENT, Boolean.TRUE);
marshal.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
it will generate an empty line break at the very top
//Generate empty line break here
<XX>
<YY>
<PDF>pdf name</PDF>
<ZIP>zip name</ZIP>
<RECEIVED_DT>received date time</RECEIVED_DT>
</YY>
</XX>
I think the reason is because marshal.setProperty(Marshaller.JAXB_FRAGMENT, Boolean.TRUE);, which remove <?xml version="1.0" encoding="UTF-8" standalone="yes"?>, leave the output xml a line break in the beginning. Is there a way to fix this? I use JAXB come with JDK 6, does Moxy suffer from this problem?

As you point out EclipseLink JAXB (MOXy) does not have this problem so you could use that (I'm the MOXy lead):
http://blog.bdoughan.com/2011/05/specifying-eclipselink-moxy-as-your.html
Option #1
One option would be to use a java.io.FilterWriter or java.io.FilterOutputStream and customize it to ignore the leading new line.
Option #2
Another option would be to marshal to StAX, and use a StAX implementation that supports formatting the output. I haven't tried this myself but the answer linked below suggests using com.sun.xml.txw2.output.IndentingXMLStreamWriter.
https://stackoverflow.com/a/3625359/383861

Inspired by first option of bdoughan's comment in this post, I've written a custom writer to remove blank line in xml file like the following ways:
public class XmlWriter extends FileWriter {
public XmlWriter(File file) throws IOException {
super(file);
}
public void write(String str) throws IOException {
if(org.apache.commons.lang3.StringUtils.isNotBlank(str)) {
super.write(str);
}
}
}
To check empty line, I've used org.apache.commons.lang3.StringUtils.isNotBlank() method, you can use your own custom condition.
Then use this writer to marshal method like the following way in Java 8.
// skip other code
File file = new File("test.xml");
marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
marshaller.setProperty(Marshaller.JAXB_FRAGMENT, true);
try (FileWriter writer = new XmlWriter(file)) {
marshaller.marshal(object, writer);
}
It'll remove <?xml version="1.0" encoding="UTF-8" standalone="yes"?> tag, also will not print blank line.

Since I was marshalling to a File object, I decided to remove this line afterwards:
public static void removeEmptyLines(File file) throws IOException {
long fileTimestamp = file.lastModified();
List<String> lines = Files.readAllLines(file.toPath());
try (Writer writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), StandardCharsets.UTF_8))) {
for (String line : lines) {
if (!line.trim().isEmpty()) {
writer.write(line + "\n");
}
}
}
file.setLastModified(fileTimestamp);
}

Code for Using StAX in java

I have an 200 MB xml of the following form:
<school name = "some school">
<class standard = "2A">
<student>
.....
</student>
<student>
.....
</student>
<student>
.....
</student>
</class>
</school>
I need to split this xml into several files using StAX such that n students come under each xml file and the structure is preserved as <school> then <class> and <students> under them. The attributes of School and class also must be preserved in the resultant xmls.
Here is the code I am using:
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
String xmlFile = "input.XML";
XMLEventReader reader = inputFactory.createXMLEventReader(new FileReader(xmlFile));
XMLOutputFactory outputFactory = XMLOutputFactory.newInstance();
outputFactory.setProperty("javax.xml.stream.isRepairingNamespaces", Boolean.TRUE);
XMLEventWriter writer = null;
int count = 0;
QName name = new QName(null, "student");
try {
while (true) {
XMLEvent event = reader.nextEvent();
if (event.isStartElement()) {
StartElement element = event.asStartElement();
if (element.getName().equals(name)) {
String filename = "input"+ count + ".xml";
writer = outputFactory.createXMLEventWriter(new FileWriter(filename));
writeToFile(reader, event, writer);
writer.close();
count++;
}
}
if (event.isEndDocument())
break;
}
} catch (XMLStreamException e) {
throw e;
} catch (IOException e) {
e.printStackTrace();
} finally {
reader.close();
}
private static void writeToFile(XMLEventReader reader, XMLEvent startEvent, XMLEventWriter writer) throws XMLStreamException, IOException {
StartElement element = startEvent.asStartElement();
QName name = element.getName();
int stack = 1;
writer.add(element);
while (true) {
XMLEvent event = reader.nextEvent();
if (event.isStartElement() && event.asStartElement().getName().equals(name))
stack++;
if (event.isEndElement()) {
EndElement end = event.asEndElement();
if (end.getName().equals(name)) {
stack--;
if (stack == 0) {
writer.add(event);
break;
}
}
}
writer.add(event);
}
}
Please check the function call writeToFile(reader, event, writer) in the try block. Here the reader object has only the student tag. I need the reader has the school, class, and then n students in it. so that the file generated has a similar structure as the original only with lesser children per file.
Thanks in advance.

I think you can keep track of list of parent events prior to the "student" start element event and pass it to the writeToFile() method. Then in the writeToFile() method you can use that list to simulate the "school" and "class" events.

You have code for determining when to start a new file which I haven't examined closely, but the process of finishing one file and starting the next is definitely incomplete.
On reaching a point where you want to end a file, you have to generate end events for the enclosing <class> and <school> tags and for the document before closing it. When you start your new file, you need to generate start events for the same after opening it and before starting again to copy student events.
In order to generate the start events properly, you will have to retain the corresponding events from the input.

Save yourself trouble and time and use the flat xml file structure you currently have, and then create POJO Objects which will represent each object as you've stated; Student, School and Class. And then using Jaxb bind the objects with different part of the Structure. You can then effectively unmarshal the xml and access the various elements as if you're dealing with SQL objects.
Use this link as a starting point XML parsing with JAXB
One issue doing it this way is memory consumption. For design flexibility and memory management, I will suggest using SQL to handle this.

Tagsoup fails to parse html document from a StringReader ( java )

I have this function:
private Node getDOM(String str) throws SearchEngineException {
DOMResult result = new DOMResult();
try {
XMLReader reader = new Parser();
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeature(Parser.namespacePrefixesFeature, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new SAXSource(reader,new InputSource(new StringReader(str))), result);
} catch (Exception ex) {
throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
}
return result.getNode();
}
It takes a String that contains the html document sent by the http server after a POST request, but fails to parse it properly - I only get like four nodes from the entire document. The string itself looks fine - if I print it out and copypasta it into a text document I see the page I expected.
When I use an overloaded version of the above method:
private Node getDOM(URL url) throws SearchEngineException {
DOMResult result = new DOMResult();
try {
XMLReader reader = new Parser();
reader.setFeature(Parser.namespacesFeature, false);
reader.setFeature(Parser.namespacePrefixesFeature, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform(new SAXSource(reader, new InputSource(url.openStream())), result);
} catch (Exception ex) {
throw new SearchEngineException("NukatSearchEngine.getDom: " + ex.getMessage());
}
return result.getNode();
}
then everything works just fine - I get a proper DOM tree, but I need to somehow retrieve the POST answer from server.
Storing the string in a file and reading it back does not work - still getting the same results.
What could be the problem?

Is it maybe a problem with the xml encoding?

This seems like an encoding problem. In the code example of yours that doesn't work you're passing the url as a string into the constructor, which uses it as the systemId, and you get problems with Tagsoup parsing the html. In the example that works you're passing the stream in to the InputSource constructor. The difference is that when you pass in the stream then the SAX implementation can figure out the encoding from the stream.
If you want to test this you could try these steps:
Stream the html you're parsing through a java.io.InputStreamReader and call getEncoding on it to see what encoding it detects.
In your first example code, call setEncoding on the InputSource passing in the encoding that the inputStreamReader reported.
See if the first example, changed to explicitly set the encoding, parses the html correctly.
There's a discussion of this toward the end of an article on using the SAX InputSource.

To get a POST response you first need to do a POST request, new InputSource(url.openStream()) probably opens a connection and reads the response from a GET request. Check out Sending a POST Request Using a URL.
Other possibilities that might be interesting to check out for doing POST requests and getting the response:
Jersey Web Client
HtmlUnit

StAX XML formatting in Java

Is it possible using StAX (specifically woodstox) to format the output xml with newlines and tabs, i.e. in the form:
<element1>
<element2>
someData
</element2>
</element1>
instead of:
<element1><element2>someData</element2></element1>
If this is not possible in woodstox, is there any other lightweight libs that can do this?

There is com.sun.xml.txw2.output.IndentingXMLStreamWriter
XMLOutputFactory xmlof = XMLOutputFactory.newInstance();
XMLStreamWriter writer = new IndentingXMLStreamWriter(xmlof.createXMLStreamWriter(out));

Using the JDK Transformer:
public String transform(String xml) throws XMLStreamException, TransformerException
{
Transformer t = TransformerFactory.newInstance().newTransformer();
t.setOutputProperty(OutputKeys.INDENT, "yes");
t.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
Writer out = new StringWriter();
t.transform(new StreamSource(new StringReader(xml)), new StreamResult(out));
return out.toString();
}

Via the JDK: transformer.setOutputProperty(OutputKeys.INDENT, "yes");.

If you're using the StAX cursor API, you can indent the output by wrapping the XMLStreamWriter in an indenting proxy. I tried this in my own project and it worked nicely.

Rather than relying on a com.sun...class that might go away (or get renamed com.oracle...class), I recommend downloading the StAX utility classes from java.net. This package contains a IndentingXMLStreamWriter class that works nicely. (Source and javadoc are included in the download.)

How about StaxMate:
http://www.cowtowncoder.com/blog/archives/2006/09/entry_21.html
Works well with Woodstox, fast, low-memory usage (no in-memory tree built), and indents like so:
SMOutputFactory sf = new SMOutputFactory(XMLOutputFactory.newInstance());
SMOutputDocument doc = sf.createOutputDocument(new FileOutputStream("output.xml"));
doc.setIndentation("\n ", 1, 2); // for unix linefeed, 2 spaces per level
// write doc like:
SMOutputElement root = doc.addElement("element1");
root.addElement("element2").addCharacters("someData");
doc.closeRoot(); // important, flushes, closes output

If you're using the iterating method (XMLEventReader), can't you just attach a new line '\n' character to the relevant XMLEvents when writing to your XML file?

Not sure about stax, but there was a recent discussion about pretty printing xml here
pretty print xml from java
this was my attempt at a solution
How to pretty print XML from Java?
using the org.dom4j.io.OutputFormat.createPrettyPrint() method

if you are using XMLEventWriter, then an easier way to do that is:
XMLOutputFactory outputFactory = XMLOutputFactory.newInstance();
XMLEventWriter writer = outputFactory.createXMLEventWriter(w);
XMLEventFactory eventFactory = XMLEventFactory.newInstance();
Characters newLine = eventFactory.createCharacters("\n");
writer.add(startRoot);
writer.add(newLine);

With Spring Batch this requires a subclass since this JIRA BATCH-1867
public class IndentingStaxEventItemWriter<T> extends StaxEventItemWriter<T> {
#Setter
#Getter
private boolean indenting = true;
#Override
protected XMLEventWriter createXmlEventWriter( XMLOutputFactory outputFactory, Writer writer) throws XMLStreamException {
if ( isIndenting() ) {
return new IndentingXMLEventWriter( super.createXmlEventWriter( outputFactory, writer ) );
}
else {
return super.createXmlEventWriter( outputFactory, writer );
}
}
}
But this requires an additionnal dependency because Spring Batch does not include the code to indent the StAX output:
<dependency>
<groupId>net.java.dev.stax-utils</groupId>
<artifactId>stax-utils</artifactId>
<version>20070216</version>
</dependency>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

XMLStreamWriter writeCharacters without escaping - java

You can also use woodstox's stax implementation. Their XMLStreamWriter2 class has a writeRaw() method. We're using it for this specific reason and it works great.

Related

XML escape code

JAXB: Marshal output XML with indentation create empty line break on the first line

Code for Using StAX in java

Tagsoup fails to parse html document from a StringReader ( java )

StAX XML formatting in Java

Categories

Resources