JDOM XPath Getting Inner Element without Namespace - java

I have an xml like this:
<root
xmlns:gl-bus="http://www.xbrl.org/int/gl/bus/2006-10-25"
xmlns:gl-cor="http://www.xbrl.org/int/gl/cor/2006-10-25" >
<gl-cor:entityInformation>
<gl-bus:accountantInformation>
...............
</gl-bus:accountantInformation>
</gl-cor:entityInformation>
</root>
All I want to extract the element "gl-cor:entityInformation" from the root with its child elements. However, I do not want the namespace declarations come with it.
The code is like this:
XPathExpression<Element> xpath = XPathFactory.instance().compile("gl-cor:entityInformation", Filters.element(), null, NAMESPACES);
Element innerElement = xpath.evaluateFirst(xmlDoc.getRootElement());
The problem is that the inner element holds the namespace declarations now. Sample output:
<gl-cor:entityInformation xmlns:gl-cor="http://www.xbrl.org/int/gl/cor/2006-10-25">
<gl-bus:accountantInformation xmlns:gl-bus="http://www.xbrl.org/int/gl/bus/2006-10-25">
</gl-bus:accountantInformation>
</gl-cor:entityInformation>
This is how I get xml as string:
public static String toString(Element element) {
Format format = Format.getPrettyFormat();
format.setTextMode(Format.TextMode.NORMALIZE);
format.setEncoding("UTF-8");
XMLOutputter xmlOut = new XMLOutputter();
xmlOut.setFormat(format);
return xmlOut.outputString(element);
}
As you see the namespace declarations are passed into the inner elements. Is there a way to get rid of these declarations without losing the prefixes?
I want this because later on I will be merging these inner elements inside another parent element and this parent element has already those namespace declarations.

JDOM by design insists that the in-memory model of the XML is well structured at all times. The behaviour you are seeing is exactly what I would expect from JDOM and I consider it to be "right". JDOM's XMLOutputter also outputs well structured and internally consistent XML and XML fragments.
Changing the bahaviour of the internal in-memory model is not an option with JDOM, but customizing the XMLOutputter to change its behaviour is relatively easy. The XMLOutputter is structured to have an "engine" supplied as a constructor argument: XMLOutputter(XMLOutputProcessor). In addition, JDOM supplies an easy-to-customize default XMLOutputProcessor called AbstractXMLOutputProcessor.
You can get the behaviour you want by doing the following:
private static final XMLOutputProcessor noNamespaces = new AbstractXMLOutputProcessor() {
#Override
protected void printNamespace(final Writer out, final FormatStack fstack,
final Namespace ns) throws IOException {
// do nothing with printing Namespaces....
}
};
Now, when you create your XMLOutputter to print your XML element fragment, you can do the following:
public static String toString(Element element) {
Format format = Format.getPrettyFormat();
format.setTextMode(Format.TextMode.NORMALIZE);
format.setEncoding("UTF-8");
XMLOutputter xmlOut = new XMLOutputter(noNamespaces);
xmlOut.setFormat(format);
return xmlOut.outputString(element);
}
Here's a full program working with your input XML:
import java.io.IOException;
import java.io.Writer;
import org.jdom2.Document;
import org.jdom2.Element;
import org.jdom2.JDOMException;
import org.jdom2.Namespace;
import org.jdom2.filter.Filters;
import org.jdom2.input.SAXBuilder;
import org.jdom2.output.Format;
import org.jdom2.output.XMLOutputter;
import org.jdom2.output.support.AbstractXMLOutputProcessor;
import org.jdom2.output.support.FormatStack;
import org.jdom2.output.support.XMLOutputProcessor;
import org.jdom2.xpath.XPathExpression;
import org.jdom2.xpath.XPathFactory;
public class JDOMEray {
public static void main(String[] args) throws JDOMException, IOException {
Document eray = new SAXBuilder().build("eray.xml");
Namespace[] NAMESPACES = {Namespace.getNamespace("gl-cor", "http://www.xbrl.org/int/gl/cor/2006-10-25")};
XPathExpression<Element> xpath = XPathFactory.instance().compile("gl-cor:entityInformation", Filters.element(), null, NAMESPACES);
Element innerElement = xpath.evaluateFirst(eray.getRootElement());
System.out.println(toString(innerElement));
}
private static final XMLOutputProcessor noNamespaces = new AbstractXMLOutputProcessor() {
#Override
protected void printNamespace(final Writer out, final FormatStack fstack,
final Namespace ns) throws IOException {
// do nothing with printing Namespaces....
}
};
public static String toString(Element element) {
Format format = Format.getPrettyFormat();
format.setTextMode(Format.TextMode.NORMALIZE);
format.setEncoding("UTF-8");
XMLOutputter xmlOut = new XMLOutputter(noNamespaces);
xmlOut.setFormat(format);
return xmlOut.outputString(element);
}
}
For me the above program outputs:
<gl-cor:entityInformation>
<gl-bus:accountantInformation>...............</gl-bus:accountantInformation>
</gl-cor:entityInformation>

Related

Read XML in JAVA

I would like to read my XML in JAVA
<?xml version="1.0" encoding="UTF-8"?>
<myapp version="1.0">
<photo_information>
<date>2016/08/20</date>
<time>17:21:59</time>
<user_data></user_data>
<prints>1</prints>
<photos>
<photo image="1">IMG_0001.JPG</photo>
<photo image="2">IMG_0002.JPG</photo>
<photo image="3">IMG_0003.JPG</photo>
<photo image="4">IMG_0004.JPG</photo>
<output>prints\160820_172159.jpg</output>
</photos>
</photo_information>
</myapp>
I need the following infos:
prints
All images (IMG_0001.JPG, IMG_0002.JPG, IMG_0003.JPG, IMG_0004.JPG)
Output (prints\160820_172159.jpg)
I tried this with this code but it´s not working:
package my.app.test;
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;
public class TestXML {
public static void main(String[] args) {
Document doc = null;
String filePath = "/myPath/IMG_0001.xml";
File f = new File(filePath);
try {
SAXBuilder builder = new SAXBuilder();
doc = builder.build(f);
XMLOutputter fmt = new XMLOutputter();
fmt.output(doc, System.out);
Element element = doc.getRootElement();
System.out.println("\nWurzelelement: " + element);
System.out.println("Wurzelelementname: " + element.getName());
List alleKinder = (List) element.getChildren();
System.out.println("Erstes Kindelement: "
+ ((Element) alleKinder.get(0)).getName());
List benannteKinder = element.getChildren("photos");
System.out.println("benanntes Kindelement: "
+ ((Element) benannteKinder.get(0)).getName());
Element kind = element.getChild("bw_mode");
System.out.println("Photo: " + kind.getValue());
Element kind2 = element.getChild("photo");
System.out.println("Photo: " + kind2.getAttributeValue("name"));
} catch (JDOMException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
The best way is to create a schema and use JAXB to unmarshall incoming XML into java object and read it like POJO.
If you don't know how to create schema then you can use some online tool to get help - http://xmlgrid.net/xml2xsd.html
You will then have to generate Java objects using ANT.
Yes! Hell lot of work for such a simple problem but this is how xmls should be parsed in Java.
Remember !!! XML is like a war. If it isn't helping you then most probably you are not using it enough.
Alternatively to your approach you could take a look at the library xstream xstream.
This library enables you to serialize and deserialize objects to xml code.
Your first step is to model a class that contains all fields of your photo information. Normally you would call it PhotoInformation:
class PhotoInformation {
LocalDate date;
LocalTime time;
UserData userData;
int prints;
Photos photos;
}
In addition you need to create a few other classes: UserData and Photos.
In the next step you need to set up the parser of xstream to fill your objects with the content from the xml file.
For that you'll find a tutorial here or here if you like annotations.
Using JAXB Parsing would be a better option. To parse this you would have to make Classes.
//Add this to your pom in plugins
<groupId>org.codehaus.mojo</groupId>
<artifactId>jaxb2-maven-plugin</artifactId>
<version>1.6</version>
Making the following classes
#XmlRootElement(name = "myapp")
class MyApp{
PhotoInformation[] photoInformation;
#XmlElement(name = "photo_information")
public PhotoInformation getPhotoInformation() {
return individuals;
}
public void setPhotoInformation(Individuals photoInformation) {
this.individuals = individuals;
}
}
#XmlRootElement(name= "photo_Information")
class PhotoInformation {
LocalDate date;
LocalTime time;
UserData userData;
int prints;
Photos[] photos;
//add getters and setters for the above variables with the #XmlElement Annotations and correct tag name
}
#XmlRootElement(name= "photo")
class Photo{
String photo;
//add getter and setter for the above variables with the #XmlElement Annotations and correct tag name
}
for UnMarshalling(Parsing) the file
JAXBContext jaxbContext = JAXBContext.newInstance(MyApp.class);
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
File XMLfile = new File(folderPath);
myApp myapp = (myApp) jaxbUnmarshaller.unmarshal(XMLfile);
Through myapp object you will be able to access the rest of the tags that inside the myApp class and you can further iterate through photo_information and have access to photos.
I hope this helps

Insert XML document to a specific node on another XML document (java)

I have an XML1:
<letterContent>
<key1>key1</key1>
<key2>key2</key2>
<type>456</type>
<object1>789</object1>
<effectiveDate>00</effectiveDate>
<expandedData />
</letterContent>
... and XML 2:
<expandedData>
<rsnForReg>
<legacyTIN>
<asCurrent>leg123</asCurrent>
</legacyTIN>
<etpmTIN>
<asCurrent>etpm123</asCurrent>
</etpmTIN>
<regType>
<asCurrent/>
</regType>
</rsnForReg>
</expandedData>
I want to insert XML 2 in XML 1 document on the expandedData node using JAVA.
The final XML1 should look like:
<letterContent>
<key1>key1</key1>
<key2>key2</key2>
<type>456</type>
<object1>789</object1>
<effectiveDate>00</effectiveDate>
<expandedData>
<rsnForReg>
<legacyTIN>
<asCurrent>leg123</asCurrent>
</legacyTIN>
<etpmTIN>
<asCurrent>etpm123</asCurrent>
</etpmTIN>
<regType>
<asCurrent/>
</regType>
</rsnForReg>
</expandedData>
</letterContent>
XML2 inserted on the XML1's expandedData node. Any ideas? I know i need to build a recursive function to loop through XML 2 but not sure how to implement it in java.
Consider using XPath:
import static javax.xml.xpath.XPathConstants.*;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.xpath.*;
import org.w3c.dom.*;
import org.xml.sax.InputSource;
public class Xml2into1 {
public static void main(String[] args) throws Exception {
// read from files
InputSource xml1 = new InputSource("xml1.xml");
InputSource xml2 = new InputSource("xml2.xml");
// find the node to add to
XPath xpath = XPathFactory.newInstance()
.newXPath();
Node expandedData1 = (Node) xpath.evaluate("//expandedData", xml1, NODE);
Document doc1 = expandedData1.getOwnerDocument();
// insert the nodes
Node expandedData2 = (Node) xpath.evaluate("//expandedData", xml2, NODE);
expandedData1.getParentNode()
.replaceChild(doc1.adoptNode(expandedData2), expandedData1);
// print results
TransformerFactory.newInstance()
.newTransformer()
.transform(new DOMSource(doc1), new StreamResult(System.out));
}
}

How to generate CDATA block using JAXB?

I am using JAXB to serialize my data to XML. The class code is simple as given below. I want to produce XML that contains CDATA blocks for the value of some Args. For example, current code produces this XML:
<command>
<args>
<arg name="test_id">1234</arg>
<arg name="source"><html>EMAIL</html></arg>
</args>
</command>
I want to wrap the "source" arg in CDATA such that it looks like below:
<command>
<args>
<arg name="test_id">1234</arg>
<arg name="source"><[![CDATA[<html>EMAIL</html>]]></arg>
</args>
</command>
How can I achieve this in the below code?
#XmlRootElement(name="command")
public class Command {
#XmlElementWrapper(name="args")
protected List<Arg> arg;
}
#XmlRootElement(name="arg")
public class Arg {
#XmlAttribute
public String name;
#XmlValue
public String value;
public Arg() {};
static Arg make(final String name, final String value) {
Arg a = new Arg();
a.name=name; a.value=value;
return a; }
}
Note: I'm the EclipseLink JAXB (MOXy) lead and a member of the JAXB (JSR-222) expert group.
If you are using MOXy as your JAXB provider then you can leverage the #XmlCDATA extension:
package blog.cdata;
import javax.xml.bind.annotation.XmlRootElement;
import org.eclipse.persistence.oxm.annotations.XmlCDATA;
#XmlRootElement(name="c")
public class Customer {
private String bio;
#XmlCDATA
public void setBio(String bio) {
this.bio = bio;
}
public String getBio() {
return bio;
}
}
For More Information
http://bdoughan.blogspot.com/2010/07/cdata-cdata-run-run-data-run.html
http://blog.bdoughan.com/2011/05/specifying-eclipselink-moxy-as-your.html
Use JAXB's Marshaller#marshal(ContentHandler) to marshal into a ContentHandler object. Simply override the characters method on the ContentHandler implementation you are using (e.g. JDOM's SAXHandler, Apache's XMLSerializer, etc):
public class CDataContentHandler extends (SAXHandler|XMLSerializer|Other...) {
// see http://www.w3.org/TR/xml/#syntax
private static final Pattern XML_CHARS = Pattern.compile("[<>&]");
public void characters(char[] ch, int start, int length) throws SAXException {
boolean useCData = XML_CHARS.matcher(new String(ch,start,length)).find();
if (useCData) super.startCDATA();
super.characters(ch, start, length);
if (useCData) super.endCDATA();
}
}
This is much better than using the XMLSerializer.setCDataElements(...) method because you don't have to hardcode any list of elements. It automatically outputs CDATA blocks only when one is required.
Solution Review:
The answer of fred is just a workaround which will fail while validating the content when the Marshaller is linked to a Schema because you modify only the string literal and do not create CDATA sections. So if you only rewrite the String from foo to <![CDATA[foo]]> the length of the string is recognized by Xerces with 15 instead of 3.
The MOXy solution is implementation specific and does not work only with the classes of the JDK.
The solution with the getSerializer references to the deprecated XMLSerializer class.
The solution LSSerializer is just a pain.
I modified the solution of a2ndrade by using a XMLStreamWriter implementation. This solution works very well.
XMLOutputFactory xof = XMLOutputFactory.newInstance();
XMLStreamWriter streamWriter = xof.createXMLStreamWriter( System.out );
CDataXMLStreamWriter cdataStreamWriter = new CDataXMLStreamWriter( streamWriter );
marshaller.marshal( jaxbElement, cdataStreamWriter );
cdataStreamWriter.flush();
cdataStreamWriter.close();
Thats the CDataXMLStreamWriter implementation. The delegate class simply delegates all method calls to the given XMLStreamWriter implementation.
import java.util.regex.Pattern;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;
/**
* Implementation which is able to decide to use a CDATA section for a string.
*/
public class CDataXMLStreamWriter extends DelegatingXMLStreamWriter
{
private static final Pattern XML_CHARS = Pattern.compile( "[&<>]" );
public CDataXMLStreamWriter( XMLStreamWriter del )
{
super( del );
}
#Override
public void writeCharacters( String text ) throws XMLStreamException
{
boolean useCData = XML_CHARS.matcher( text ).find();
if( useCData )
{
super.writeCData( text );
}
else
{
super.writeCharacters( text );
}
}
}
Here is the code sample referenced by the site mentioned above:
import java.io.File;
import java.io.StringWriter;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.Marshaller;
import javax.xml.bind.Unmarshaller;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.apache.xml.serialize.OutputFormat;
import org.apache.xml.serialize.XMLSerializer;
import org.w3c.dom.Document;
public class JaxbCDATASample {
public static void main(String[] args) throws Exception {
// unmarshal a doc
JAXBContext jc = JAXBContext.newInstance("...");
Unmarshaller u = jc.createUnmarshaller();
Object o = u.unmarshal(...);
// create a JAXB marshaller
Marshaller m = jc.createMarshaller();
// get an Apache XMLSerializer configured to generate CDATA
XMLSerializer serializer = getXMLSerializer();
// marshal using the Apache XMLSerializer
m.marshal(o, serializer.asContentHandler());
}
private static XMLSerializer getXMLSerializer() {
// configure an OutputFormat to handle CDATA
OutputFormat of = new OutputFormat();
// specify which of your elements you want to be handled as CDATA.
// The use of the '^' between the namespaceURI and the localname
// seems to be an implementation detail of the xerces code.
// When processing xml that doesn't use namespaces, simply omit the
// namespace prefix as shown in the third CDataElement below.
of.setCDataElements(
new String[] { "ns1^foo", // <ns1:foo>
"ns2^bar", // <ns2:bar>
"^baz" }); // <baz>
// set any other options you'd like
of.setPreserveSpace(true);
of.setIndenting(true);
// create the serializer
XMLSerializer serializer = new XMLSerializer(of);
serializer.setOutputByteStream(System.out);
return serializer;
}
}
For the same reasons as Michael Ernst I wasn't that happy with most of the answers here. I could not use his solution as my requirement was to put CDATA tags in a defined set of fields - as in raiglstorfer's OutputFormat solution.
My solution is to marshal to a DOM document, and then do a null XSL transform to do the output. Transformers allow you to set which elements are wrapped in CDATA tags.
Document document = ...
jaxbMarshaller.marshal(jaxbObject, document);
Transformer nullTransformer = TransformerFactory.newInstance().newTransformer();
nullTransformer.setOutputProperty(OutputKeys.INDENT, "yes");
nullTransformer.setOutputProperty(OutputKeys.CDATA_SECTION_ELEMENTS, "myElement {myNamespace}myOtherElement");
nullTransformer.transform(new DOMSource(document), new StreamResult(writer/stream));
Further info here: http://javacoalface.blogspot.co.uk/2012/09/outputting-cdata-sections-with-jaxb.html
The following simple method adds CDATA support in JAX-B which does not support CDATA natively :
declare a custom simple type CDataString extending string to identify the fields that should be handled via CDATA
Create a custom CDataAdapter that parses and print content in CDataString
use JAXB bindings to link CDataString and you CDataAdapter. the CdataAdapter will add/remove to/from CdataStrings at Marshall/Unmarshall time
Declare a custom character escape handler that does not escape character when printing CDATA strings and set this as the Marshaller CharacterEscapeEncoder
Et voila, any CDataString element will be encapsulated with at Marshall time. At unmarshall time, the will automatically be removed.
Supplement of #a2ndrade's answer.
I find one class to extend in JDK 8. But noted that the class is in com.sun package. You can make one copy of the code in case this class may be removed in future JDK.
public class CDataContentHandler extends com.sun.xml.internal.txw2.output.XMLWriter {
public CDataContentHandler(Writer writer, String encoding) throws IOException {
super(writer, encoding);
}
// see http://www.w3.org/TR/xml/#syntax
private static final Pattern XML_CHARS = Pattern.compile("[<>&]");
public void characters(char[] ch, int start, int length) throws SAXException {
boolean useCData = XML_CHARS.matcher(new String(ch, start, length)).find();
if (useCData) {
super.startCDATA();
}
super.characters(ch, start, length);
if (useCData) {
super.endCDATA();
}
}
}
How to use:
JAXBContext jaxbContext = JAXBContext.newInstance(...class);
Marshaller marshaller = jaxbContext.createMarshaller();
StringWriter sw = new StringWriter();
CDataContentHandler cdataHandler = new CDataContentHandler(sw,"utf-8");
marshaller.marshal(gu, cdataHandler);
System.out.println(sw.toString());
Result example:
<?xml version="1.0" encoding="utf-8"?>
<genericUser>
<password><![CDATA[dskfj>><<]]></password>
<username>UNKNOWN::UNKNOWN</username>
<properties>
<prop2>v2</prop2>
<prop1><![CDATA[v1><]]></prop1>
</properties>
<timestamp/>
<uuid>cb8cbc487ee542ec83e934e7702b9d26</uuid>
</genericUser>
As of Xerxes-J 2.9, XMLSerializer has been deprecated. The suggestion is to replace it with DOM Level 3 LSSerializer or JAXP's Transformation API for XML. Has anyone tried approach?
Just a word of warning: according to documentation of the javax.xml.transform.Transformer.setOutputProperty(...) you should use the syntax of qualified names, when indicating an element from another namespace. According to JavaDoc (Java 1.6 rt.jar):
"(...) For example, if a URI and local name were obtained from an element defined with , then the qualified name would be "{http://xyz.foo.com/yada/baz.html}foo. Note that no prefix is used."
Well this doesn't work - the implementing class from Java 1.6 rt.jar, meaning com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl interprets elements belonging to a different namespace only then correctly, when they are declared as "http://xyz.foo.com/yada/baz.html:foo", because in the implementation someone is parsing it looking for the last colon. So instead of invoking:
transformer.setOutputProperty(OutputKeys.CDATA_SECTION_ELEMENTS, "{http://xyz.foo.com/yada/baz.html}foo")
which should work according to JavaDoc, but ends up being parsed as "http" and "//xyz.foo.com/yada/baz.html", you must invoke
transformer.setOutputProperty(OutputKeys.CDATA_SECTION_ELEMENTS, "http://xyz.foo.com/yada/baz.html:foo")
At least in Java 1.6.
The following code will prevent from encoding CDATA elements:
Marshaller marshaller = context.createMarshaller();
marshaller.setProperty(Marshaller.JAXB_ENCODING, "UTF-8");
marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
StringWriter stringWriter = new StringWriter();
PrintWriter printWriter = new PrintWriter(stringWriter);
DataWriter dataWriter = new DataWriter(printWriter, "UTF-8", new CharacterEscapeHandler() {
#Override
public void escape(char[] buf, int start, int len, boolean b, Writer out) throws IOException {
out.write(buf, start, len);
}
});
marshaller.marshal(data, dataWriter);
System.out.println(stringWriter.toString());
It will also keep UTF-8 as your encoding.

Find the value of a specific attribute in an XML file in java

I need to just read the value of a single attribute inside an XML file using java. The XML would look something like this:
<behavior name="Fred" version="2.0" ....>
and I just need to read out the version. Can someone point in the direction of a resource that would show me how to do this?
You don't need a fancy library -- plain old JAXP versions of DOM and XPath are pretty easy to read and write for this. Whatever you do, don't use a regular expression.
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
public class GetVersion {
public static void main(String[] args) throws Exception {
XPath xpath = XPathFactory.newInstance().newXPath();
Document doc = DocumentBuilderFactory.newInstance()
.newDocumentBuilder().parse("file:////tmp/whatever.xml");
String version = xpath.evaluate("//behavior/#version", doc);
System.out.println(version);
}
}
JAXB for brevity:
private static String readVersion(File file) {
#XmlRootElement class Behavior {
#XmlAttribute String version;
}
return JAXB.unmarshal(file, Behavior.class).version;
}
StAX for efficiency:
private static String readVersionEfficient(File file)
throws XMLStreamException, IOException {
XMLInputFactory inFactory = XMLInputFactory.newInstance();
XMLStreamReader xmlReader = inFactory
.createXMLStreamReader(new StreamSource(file));
try {
while (xmlReader.hasNext()) {
if (xmlReader.next() == XMLStreamConstants.START_ELEMENT) {
if (xmlReader.getLocalName().equals("behavior")) {
return xmlReader.getAttributeValue(null, "version");
} else {
throw new IOException("Invalid file");
}
}
}
throw new IOException("Invalid file");
} finally {
xmlReader.close();
}
}
Here's one.
import javax.xml.parsers.SAXParser;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.SAXException;
import org.xml.sax.Attributes;
import javax.xml.parsers.SAXParserFactory;
/**
* Here is sample of reading attributes of a given XML element.
*/
public class SampleOfReadingAttributes {
/**
* Application entry point
* #param args command-line arguments
*/
public static void main(String[] args) {
try {
// creates and returns new instance of SAX-implementation:
SAXParserFactory factory = SAXParserFactory.newInstance();
// create SAX-parser...
SAXParser parser = factory.newSAXParser();
// .. define our handler:
SaxHandler handler = new SaxHandler();
// and parse:
parser.parse("sample.xml", handler);
} catch (Exception ex) {
ex.printStackTrace(System.out);
}
}
/**
* Our own implementation of SAX handler reading
* a purchase-order data.
*/
private static final class SaxHandler extends DefaultHandler {
// we enter to element 'qName':
public void startElement(String uri, String localName,
String qName, Attributes attrs) throws SAXException {
if (qName.equals("behavior")) {
// get version
String version = attrs.getValue("version");
System.out.println("Version is " + version );
}
}
}
}
As mentioned you can use the SAXParser.
Digester mentioned using regular expressions, which I won't recommend as it would lead to code that is difficult to maintain: What if you add another version attribute in another tag, or another behaviour tag? You can handle it, but it won't be pretty.
You can also use XPath, which is a language for querying xml. That's what I would recommend.
If all you need is to read the version, then you can use regex. But really, I think you need apache digester
Apache Commons Configuration is nice, too. Commons Digester is based on it.

Can JAXB parse large XML files in chunks

I need to parse potentially large XML files, of which the schema is already provided to me in several XSD files, so XML binding is highly favored. I'd like to know if I can use JAXB to parse the file in chunks and if so, how.
Because code matters, here is a PartialUnmarshaller who reads a big file into chunks. It can be used that way new PartialUnmarshaller<YourClass>(stream, YourClass.class)
import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBException;
import javax.xml.bind.Unmarshaller;
import javax.xml.stream.*;
import java.io.InputStream;
import java.util.List;
import java.util.NoSuchElementException;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
import static javax.xml.stream.XMLStreamConstants.*;
public class PartialUnmarshaller<T> {
XMLStreamReader reader;
Class<T> clazz;
Unmarshaller unmarshaller;
public PartialUnmarshaller(InputStream stream, Class<T> clazz) throws XMLStreamException, FactoryConfigurationError, JAXBException {
this.clazz = clazz;
this.unmarshaller = JAXBContext.newInstance(clazz).createUnmarshaller();
this.reader = XMLInputFactory.newInstance().createXMLStreamReader(stream);
/* ignore headers */
skipElements(START_DOCUMENT, DTD);
/* ignore root element */
reader.nextTag();
/* if there's no tag, ignore root element's end */
skipElements(END_ELEMENT);
}
public T next() throws XMLStreamException, JAXBException {
if (!hasNext())
throw new NoSuchElementException();
T value = unmarshaller.unmarshal(reader, clazz).getValue();
skipElements(CHARACTERS, END_ELEMENT);
return value;
}
public boolean hasNext() throws XMLStreamException {
return reader.hasNext();
}
public void close() throws XMLStreamException {
reader.close();
}
void skipElements(int... elements) throws XMLStreamException {
int eventType = reader.getEventType();
List<Integer> types = asList(elements);
while (types.contains(eventType))
eventType = reader.next();
}
}
This is detailed in the user guide. The JAXB download from http://jaxb.java.net/ includes an example of how to parse one chunk at a time.
When a document is large, it's
usually because there's repetitive
parts in it. Perhaps it's a purchase
order with a large list of line items,
or perhaps it's an XML log file with
large number of log entries.
This kind of XML is suitable for
chunk-processing; the main idea is to
use the StAX API, run a loop, and
unmarshal individual chunks
separately. Your program acts on a
single chunk, and then throws it away.
In this way, you'll be only keeping at
most one chunk in memory, which allows
you to process large documents.
See the streaming-unmarshalling
example and the partial-unmarshalling
example in the JAXB RI distribution
for more about how to do this. The
streaming-unmarshalling example has an
advantage that it can handle chunks at
arbitrary nest level, yet it requires
you to deal with the push model ---
JAXB unmarshaller will "push" new
chunk to you and you'll need to
process them right there.
In contrast, the partial-unmarshalling
example works in a pull model (which
usually makes the processing easier),
but this approach has some limitations
in databinding portions other than the
repeated part.
Yves Amsellem's answer is pretty good, but only works if all elements are of exactly the same type. Otherwise your unmarshall will throw an exception, but the reader will have already consumed the bytes, so you would be unable to recover. Instead, we should follow Skaffman's advice and look at the sample from the JAXB jar.
To explain how it works:
Create a JAXB unmarshaller.
Add a listener to the unmarshaller for intercepting the appropriate elements. This is done by "hacking" the ArrayList to ensure the elements are not stored in memory after being unmarshalled.
Create a SAX parser. This is where the streaming happens.
Use the unmarshaller to generate a handler for the SAX parser.
Stream!
I modified the solution to be generic*. However, it required some reflection. If this is not OK, please look at the code samples in the JAXB jars.
ArrayListAddInterceptor.java
import java.lang.reflect.Field;
import java.util.ArrayList;
public class ArrayListAddInterceptor<T> extends ArrayList<T> {
private static final long serialVersionUID = 1L;
private AddInterceptor<T> interceptor;
public ArrayListAddInterceptor(AddInterceptor<T> interceptor) {
this.interceptor = interceptor;
}
#Override
public boolean add(T t) {
interceptor.intercept(t);
return false;
}
public static interface AddInterceptor<T> {
public void intercept(T t);
}
public static void apply(AddInterceptor<?> interceptor, Object o, String property) {
try {
Field field = o.getClass().getDeclaredField(property);
field.setAccessible(true);
field.set(o, new ArrayListAddInterceptor(interceptor));
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
Main.java
public class Main {
public void parsePurchaseOrders(AddInterceptor<PurchaseOrder> interceptor, List<File> files) {
try {
// create JAXBContext for the primer.xsd
JAXBContext context = JAXBContext.newInstance("primer");
Unmarshaller unmarshaller = context.createUnmarshaller();
// install the callback on all PurchaseOrders instances
unmarshaller.setListener(new Unmarshaller.Listener() {
public void beforeUnmarshal(Object target, Object parent) {
if (target instanceof PurchaseOrders) {
ArrayListAddInterceptor.apply(interceptor, target, "purchaseOrder");
}
}
});
// create a new XML parser
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
XMLReader reader = factory.newSAXParser().getXMLReader();
reader.setContentHandler(unmarshaller.getUnmarshallerHandler());
for (File file : files) {
reader.parse(new InputSource(new FileInputStream(file)));
}
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
*This code has not been tested and is for illustrative purposes only.
I wrote a small library (available on Maven Central) to help to read big XML files and process them by chunks. Please note it can only be applied for files with a unique container having a list of data (even from different types). In other words, your file has to follow the structure:
<container>
<type1>...</type1>
<type2>...</type2>
<type1>...</type1>
...
</container>
Here is an example where Type1, Type2, ... are the JAXB representation of the repeating data in the file:
try (StreamingUnmarshaller unmarshaller = new StreamingUnmarshaller(Type1.class, Type2.class, ...)) {
unmarshaller.open(new FileInputStream(fileName));
unmarshaller.iterate((type, element) -> doWhatYouWant(element));
}
You can find more information with detailed examples on the GitHub page of the library.

Categories

Resources