Java Stax for Complex / Large XML - java

I have an XML file that is 4.2 GB! Obviously parsing the entire DOM is not practical. I have been looking at SAX and STAX to accomplish parsing this gigantic XML file. However all the examples I've seen are simple. The XML file I am dealing with has nested on nested on nested. There are areas where it goes 10+ levels.
I found this tutorial but not sure if its a viable solution.
http://www.javacodegeeks.com/2013/05/parsing-xml-using-dom-sax-and-stax-parser-in-java.html (botton example using STAX)
I'm not really sure how to handle nested objects.
I have created Java objects to mimic the structure of the XML. Here are a few, too many to display.
Record.java
public class Record implements Serializable {
String uid;
StaticData staticData;
DynamicData dynamicData;
}
Summary.java
public class Summary {
EWUID ewuid;
PubInfo pubInfo;
Titles titles;
Names names;
DocTypes docTypes;
Publishers publishers;
}
EWUID.java
public class EWUID {
String collId;
String edition;
}
PubInfo.java
public class PubInfo {
String coverDate;
String hasAbstract;
String issue;
String pubMonth;
String pubType;
String pubYear;
String sortDate;
String volume;
}
This is the code I've come up with so far.
public class TRWOSParser {
XMLEventReader eventReader;
XMLInputFactory inputFactory;
InputStream inputStream;
public TRWOSParser(String file) throws FileNotFoundException, XMLStreamException {
inputFactory = XMLInputFactory.newInstance();
inputStream = new FileInputStream(file);
eventReader = inputFactory.createXMLEventReader(inputStream);
}
public void parse() throws XMLStreamException{
while (eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
if (event.isStartElement()) {
StartElement startElement = event.asStartElement();
if (startElement.getName().getLocalPart().equals("record")) {
Record record = new Record();
Iterator<Attribute> attributes = startElement.getAttributes();
while (attributes.hasNext()) {
Attribute attribute = attributes.next();
if (attribute.getName().toString().equals("UID")) {
System.out.println("UID: " + attribute.getValue());
}
}
}
}
}
}
}
Update:
The data in the XML is licensed so I cannot show the full file. This is a very very small segment in which I have scrambled the data.
<?xml version="1.0" encoding="UTF-8"?>
<records>
<REC>
<UID>WOS:000310438600004</UID>
<static_data>
<summary>
<EWUID>
<WUID coll_id="WOS" />
<edition value="WOS.SCI" />
</EWUID>
<pub_info coverdate="NOV 2012" has_abstract="N" issue="5" pubmonth="NOV" pubtype="Journal" pubyear="2012" sortdate="2012-11-01" vol="188">
<page begin="1662" end="1663" page_count="2">1662-1663</page>
</pub_info>
<titles count="6">
<title type="source">JOURNAL OF UROLOGY</title>
<title type="source_abbrev">J UROLOGY</title>
<title type="abbrev_iso">J. Urol.</title>
<title type="abbrev_11">J UROL</title>
<title type="abbrev_29">J UROL</title>
<title type="item">Something something</title>
</titles>
<names count="1">
<name addr_no="1 2 3" reprint="Y" role="author" seq_no="1">
<display_name>John Doe</display_name>
<full_name>John Doe</full_name>
<wos_standard>Doe, John</wos_standard>
<first_name>John</first_name>
<last_name>Doe</last_name>
</name>
</names>
<doctypes count="1">
<doctype>Editorial Material</doctype>
</doctypes>
<publishers>
<publisher>
<address_spec addr_no="1">
<full_address>360 PARK AVE SOUTH, NEW YORK, NY 10010-1710 USA</full_address>
<city>NEW YORK</city>
</address_spec>
<names count="1">
<name addr_no="1" role="publisher" seq_no="1">
<display_name>ELSEVIER SCIENCE INC</display_name>
<full_name>ELSEVIER SCIENCE INC</full_name>
</name>
</names>
</publisher>
</publishers>
</summary>
</static_data>
</REC>
</records>

A similar solution to lscoughlin's answer is to use DOM4J which has mechanims to deal with this scenario: http://dom4j.sourceforge.net/
In my opionin it is more straight forward and easier to follow. It might not support namespaces, though.

I'm making two assumptions 1) that there is an early level of repetition, and 2) that you can do something meaningful with a partial document.
Let's assume you can move some level of nesting in, and then handle the document multiple times, removing the nodes at the working level each time you "handle" the document. This means that only a single working subtree will be in memory at any given time.
Here's a working code snippet:
package bigparse;
import static javax.xml.stream.XMLStreamConstants.CHARACTERS;
import static javax.xml.stream.XMLStreamConstants.END_DOCUMENT;
import static javax.xml.stream.XMLStreamConstants.END_ELEMENT;
import static javax.xml.stream.XMLStreamConstants.START_DOCUMENT;
import static javax.xml.stream.XMLStreamConstants.START_ELEMENT;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.StringWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
public class BigParse {
public static void main(String... args) {
XMLInputFactory factory = XMLInputFactory.newInstance();
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
try {
XMLStreamReader streamReader = factory.createXMLStreamReader(new FileReader("src/main/resources/test.xml"));
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document document = documentBuilder.newDocument();
Element rootElement = null;
Element currentElement = null;
int branchLevel = 0;
int maxBranchLevel = 1;
while (streamReader.hasNext()) {
int event = streamReader.next();
switch (event) {
case START_DOCUMENT:
continue;
case START_ELEMENT:
if (branchLevel < maxBranchLevel) {
Element workingElement = readElementOnly(streamReader, document);
if (rootElement == null) {
document.appendChild(workingElement);
rootElement = document.getDocumentElement();
currentElement = rootElement;
} else {
currentElement.appendChild(workingElement);
currentElement = workingElement;
}
branchLevel++;
} else {
workingLoop(streamReader, document, currentElement);
}
continue;
case CHARACTERS:
currentElement.setTextContent(streamReader.getText());
continue;
case END_ELEMENT:
if (currentElement != rootElement) {
currentElement = (Element) currentElement.getParentNode();
branchLevel--;
}
continue;
case END_DOCUMENT:
break;
}
}
} catch (ParserConfigurationException
| FileNotFoundException
| XMLStreamException e) {
throw new RuntimeException(e);
}
}
private static Element readElementOnly(XMLStreamReader streamReader, Document document) {
Element workingElement = document.createElement(streamReader.getLocalName());
for (int attributeIndex = 0; attributeIndex < streamReader.getAttributeCount(); attributeIndex++) {
workingElement.setAttribute(
streamReader.getAttributeLocalName(attributeIndex),
streamReader.getAttributeValue(attributeIndex));
}
return workingElement;
}
private static void workingLoop(final XMLStreamReader streamReader, final Document document, final Element fragmentRoot)
throws XMLStreamException {
Element startElement = readElementOnly(streamReader, document);
fragmentRoot.appendChild(startElement);
Element currentElement = startElement;
while (streamReader.hasNext()) {
int event = streamReader.next();
switch (event) {
case START_DOCUMENT:
continue;
case START_ELEMENT:
Element workingElement = readElementOnly(streamReader, document);
currentElement.appendChild(workingElement);
currentElement = workingElement;
continue;
case CHARACTERS:
currentElement.setTextContent(streamReader.getText());
continue;
case END_ELEMENT:
if (currentElement != startElement) {
currentElement = (Element) currentElement.getParentNode();
continue;
} else {
handleDocument(document, startElement);
fragmentRoot.removeChild(startElement);
startElement = null;
return;
}
}
}
}
// THIS FUNCTION DOES SOMETHING MEANINFUL
private static void handleDocument(Document document, Element startElement) {
System.out.println(stringify(document));
}
private static String stringify(Document document) {
try {
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
StreamResult result = new StreamResult(new StringWriter());
DOMSource source = new DOMSource(document);
transformer.transform(source, result);
String xmlString = result.getWriter().toString();
return xmlString;
} catch (Exception e) {
throw new RuntimeException(e);
}
}
}
EDIT: I made an incredibly silly mistake. It's fixed now. It's working but imperfect -- should be enough to lead you in a useful direction.

Consider using an XSLT 3.0 streaming transformation of the form:
<xsl:template name="main">
<xsl:stream href="bigInput.xml">
<xsl:for-each select="copy-of(/records/REC)">
<!-- process one record -->
</xsl:for-each>
</xsl:stream>
</xsl:template>
You can process this using Saxon-EE 9.6.
The "process one record" logic could use the Saxon SQL extension, or it could invoke an extension function: the context node will be a REC element with its contained tree, fully navigable within the subtree, but with no ability to navigate outside the REC element currently being processed.

Related

How to parse xml and get the corresponding values using Stax Iterator ?

I would like to parse xml node using STAX Iterator API and get the values of each id node. In the below code, how do I get the corresponding value of id type=id2 or id3. How can I do this?
<entity>
<id type="id1">8500123</id>
<id type="id2">8500124</id>
<id type="id3">8500125</id>
<link idType="someId">99369</link>
</entity>
STAX Iterator API code below;
XMLEventReader xmlEventReader = xmlInputFactory.createXMLEventReader(new FileInputStream(fileName));
while (xmlEventReader.hasNext()) {
XMLEvent xmlEvent = xmlEventReader.nextEvent();
if (xmlEvent.isStartElement()) {
StartElement startElement = xmlEvent.asStartElement();
if (startElement.getName().getLocalPart().equals("entity")) {
XMLEvent xmlEvent2 = xmlEventReader.nextEvent();//has to forgo this bcoz it always return a new line.
XMLEvent xmlEvent3 = xmlEventReader.nextEvent();
if (xmlEvent3.isStartElement()) {
StartElement startElement2 = xmlEvent3.asStartElement();
if (startElement2.getName().getLocalPart().equals("id")) {
connector = new Connector();
Attribute idAttr = startElement2.getAttributeByName(new QName("type"));
if(idAttr.getName().equals("id1")){
connector.setId1(idAttr.getValue());
}
}
}
}
}
}
Since the question is old there is probably no longer an issue, but I was just trying to do the same thing. The sample code was almost there; the missing step was to check for an event type of XMLStreamConstants.CHARACTERS which corresponds to either:
The data between an opening and closing tag.
Whitespace between tags.
So in your case you want to extract the data only if all of these conditions are met:
The event type being processed is XMLStreamConstants.CHARACTERS (in which case EventType.isCharacters() returns true).
The immediately preceding event processed was of type XMLStreamConstants.START_ELEMENT.
The value of the type attribute of that preceding start element was "id2" or "id3".
It's possible to do that by tweaking your existing code, but a cleaner and more generic approach is to iteratively process the events returned by XMLEventReader using a case statement. To get the value of the data between a start tag and end tag:
Characters characters = xmlEvent.asCharacters();
String data = characters.getData();
Here's a working example, where the file sample.xml contains the data in the OP:
package pkg;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.Attribute;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.EndElement;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
public class StaxDemo {
public static void main(String[] args) throws XMLStreamException, IOException {
try (Reader reader = new FileReader("sample.xml");) {
XMLInputFactory xmlInputFactory = XMLInputFactory.newFactory();
XMLEventReader xmlEventReader = xmlInputFactory.createXMLEventReader(reader);
parseXml(xmlEventReader);
}
}
static void parseXml(XMLEventReader xmlEventReader) throws XMLStreamException {
String typeValue = null;
while (xmlEventReader.hasNext()) {
XMLEvent xmlEvent = xmlEventReader.nextEvent();
switch (xmlEvent.getEventType()) {
case XMLStreamConstants.START_DOCUMENT:
System.out.println("XMLEvent.START_DOCUMENT");
break;
case XMLStreamConstants.START_ELEMENT:
StartElement startElement = xmlEvent.asStartElement();
Attribute typeAttribute = startElement.getAttributeByName(new QName("type"));
if (typeAttribute != null) {
typeValue = typeAttribute.getValue();
}
System.out.println("XMLEvent.START_ELEMENT: <" + startElement.getName() + "> " + "type=" + typeValue);
break;
case XMLStreamConstants.CHARACTERS:
Characters characters = xmlEvent.asCharacters();
if ((typeValue != null)) { // Non-null if preceding event was for START_ELEMENT.
if ((typeValue.equals("id2")) || (typeValue.equals("id3"))) {
String data = characters.getData();
System.out.println("XMLEvent.CHARACTERS: data=[" + data + "]");
}
typeValue = null;
}
break;
case XMLStreamConstants.END_ELEMENT:
EndElement endElement = xmlEvent.asEndElement();
System.out.println("XMLEvent.END_ELEMENT: </" + endElement.getName() + ">");
break;
case XMLStreamConstants.END_DOCUMENT:
System.out.println("XMLEvent.END_DOCUMENT");
break;
default:
System.out.println("case default: Event Type = " + xmlEvent.getEventType());
break;
}
}
}
}
I added a few println() calls just to clarify how the file is processed by XMLEventReader. Here's the output:
XMLEvent.START_DOCUMENT
XMLEvent.START_ELEMENT: <entity> type=null
XMLEvent.START_ELEMENT: <id> type=id1
XMLEvent.END_ELEMENT: </id>
XMLEvent.START_ELEMENT: <id> type=id2
XMLEvent.CHARACTERS: data=[z8500124]
XMLEvent.END_ELEMENT: </id>
XMLEvent.START_ELEMENT: <id> type=id3
XMLEvent.CHARACTERS: data=[z8500125]
XMLEvent.END_ELEMENT: </id>
XMLEvent.START_ELEMENT: <link> type=null
XMLEvent.END_ELEMENT: </link>
XMLEvent.END_ELEMENT: </entity>
XMLEvent.END_DOCUMENT
Oracle provides a tutorial for StAX. While all the basic information is there, I found it a bit disorganized.

How to read modify fragments of XML using StAX in Java?

My goal is to read objects (featureMember) into DOM, change them and write back into new XML. XML is too big to use DOM itself. I figured what I need is StAX and TransformerFactory, but I can't make it work.
This is what I've done till now:
private void change(File pathIn, File pathOut) {
try {
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLOutputFactory factoryOut = XMLOutputFactory.newInstance();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
XMLEventReader in = factory.createXMLEventReader(new FileReader(pathIn));
XMLEventWriter out = factoryOut.createXMLEventWriter(new FileWriter(pathOut));
while (in.hasNext()) {
XMLEvent e = in.nextTag();
if (e.getEventType() == XMLStreamConstants.START_ELEMENT) {
if (((StartElement) e).getName().getLocalPart().equals("featureMember")) {
DOMResult result = new DOMResult();
t.transform(new StAXSource(in), result);
Node domNode = result.getNode();
System.out.println(domnode);
}
}
out.add(e);
}
in.close();
out.close();
} catch (FileNotFoundException e1) {
e1.printStackTrace();
} catch (IOException e1) {
e1.printStackTrace();
} catch (TransformerConfigurationException e1) {
e1.printStackTrace();
} catch (XMLStreamException e1) {
e1.printStackTrace();
} catch (TransformerException e1) {
e1.printStackTrace();
}
}
I get exception (on t.transform()):
Exception in thread "AWT-EventQueue-0" java.lang.IllegalStateException: StAXSource(XMLEventReader) with XMLEventReader not in XMLStreamConstants.START_DOCUMENT or XMLStreamConstants.START_ELEMENT state
Simplified version of my xml looks like (it has namespaces):
<?xml version="1.0" encoding="UTF-8"?>
<gml:FeatureCollection xmlns:gml="http://www.opengis.net/gml/3.2" gml:id="featureCollection">
<gml:featureMember>
</eg:RST>
<eg:pole>Krakow</eg:pole>
<eg:localId>id1234</eg:localId>
</gml:featureMember>
<gml:featureMember>
<eg:RST>1002</eg:RST>
<eg:pole>Rzeszow</eg:pole>
<eg:localId>id1235</eg:localId>
</gml:featureMember>
</gml:FeatureCollection>
I have a list of localId's of objects (featureMember), which I want to change and correspoding changed RST or pole (it depends on user which one is changed):
localId (id1234) RST (1001)
localId (id1236) RST (1003)
...
The problem you're having is that when you create the StAXSource, your START_ELEMENT event has already been consumed. So the XMLEventReader is probably at some whitespace text node event, or something else that can't be an XML document source. You can use the peek() method to view the next event without consuming it. Make sure there is an event with hasNext() first, though.
I'm not 100% sure of what you wish to accomplish, so here's some things you could do depending on the scenario.
EDIT: I just read some of the comments on your question which make things a bit more clear. The below could still help you to achieve the desired result with some adjustment. Also note that Java XSLT processors allow for extension functions and extension elements, which can call into Java code from an XSLT stylesheet. This can be a powerful method to extend basic XSLT functionality with external resources such as database queries.
In case you want the input XML to be transformed into one output XML, you might be better of simply using an XML stylesheet transformation. In your code, you create a transformer without any templates, so it becomes the default "identity transformer" which just copies input to output. Suppose your input XML is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<gml:FeatureCollection xmlns:gml="http://www.opengis.net/gml/3.2" gml:id="featureCollection" xmlns:eg="acme.com">
<gml:featureMember>
<eg:RST/>
<eg:pole>Krakow</eg:pole>
<eg:localId>id1234</eg:localId>
</gml:featureMember>
<gml:featureMember>
<eg:RST>1002</eg:RST>
<eg:pole>Rzeszow</eg:pole>
<eg:localId>id1235</eg:localId>
</gml:featureMember>
</gml:FeatureCollection>
I've bound the eg prefix to some dummy namespace since it was missing from your sample and fixed the malformed RST element.
The following program will run an XSLT transformation on your input and writes it to an output file.
package xsltplayground;
import java.io.File;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.transform.Result;
import javax.xml.transform.Source;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
public class XSLTplayground {
public static void main(String[] args) throws Exception {
URL url = XSLTplayground.class.getResource("sample.xml");
File input = new File(url.toURI());
URL url2 = XSLTplayground.class.getResource("stylesheet.xsl");
File xslt = new File(url2.toURI());
URL url3 = XSLTplayground.class.getResource(".");
File output = new File(new File(url3.toURI()), "output.xml");
change(input, output, xslt);
}
private static void change(File pathIn, File pathOut, File xsltFile) {
try {
// Creating transformer with XSLT file
TransformerFactory tf = TransformerFactory.newInstance();
Source xsltSource = new StreamSource(xsltFile);
Transformer t = tf.newTransformer(xsltSource);
// Input source
Source input = new StreamSource(pathIn);
// Output target
Result output = new StreamResult(pathOut);
// Transforming
t.transform(input, output);
} catch (TransformerConfigurationException ex) {
Logger.getLogger(XSLTplayground.class.getName()).log(Level.SEVERE, null, ex);
} catch (TransformerException ex) {
Logger.getLogger(XSLTplayground.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
Here's a sample stylesheet.xsl file, which for convenience I just dumped into the same package as the input XML and class.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:eg="acme.com">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="node()|#*">
<xsl:copy>
<xsl:apply-templates select="node()|#*" />
</xsl:copy>
</xsl:template>
<xsl:template match="gml:featureMember">
<gml:member>
<xsl:apply-templates select="node()|#*" />
</gml:member>
</xsl:template>
</xsl:stylesheet>
The above stylesheet will copy everything by default, but when it gets to a <gml:featureMember> element it will wrap the contents into a new <gml:member> element. Just a very simple example of what you can do with XSLT.
The output would be:
<?xml version="1.0" encoding="UTF-8"?>
<gml:FeatureCollection xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:eg="acme.com" gml:id="featureCollection">
<gml:member>
<eg:RST/>
<eg:pole>Krakow</eg:pole>
<eg:localId>id1234</eg:localId>
</gml:member>
<gml:member>
<eg:RST>1002</eg:RST>
<eg:pole>Rzeszow</eg:pole>
<eg:localId>id1235</eg:localId>
</gml:member>
</gml:FeatureCollection>
Since both input and output are file streams, you don't need the entire DOM in memory. XSLT in Java is pretty fast and efficient, so this might suffice.
Maybe you actually want to split every occurrence of some element into its own output file, with some changes to it. Here's an example of code that uses StAX for splitting off the <gml:featureMember> elements as separate documents. You could then iterate over the created files an transform them however you want (XSLT would again be a good choice). Obviously the error handling would need to be a bit more robust. This is just for demonstration.
package xsltplayground;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.net.URL;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.stream.StreamSource;
public class XSLTplayground {
public static void main(String[] args) throws Exception {
URL url = XSLTplayground.class.getResource("sample.xml");
File input = new File(url.toURI());
URL url2 = XSLTplayground.class.getResource("stylesheet.xsl");
File xslt = new File(url2.toURI());
URL url3 = XSLTplayground.class.getResource(".");
File output = new File(url3.toURI());
change(input, output, xslt);
}
private static void change(File pathIn, File directoryOut, File xsltFile) throws InterruptedException {
try {
// Creating a StAX event reader from the input
XMLInputFactory xmlIf = XMLInputFactory.newFactory();
XMLEventReader reader = xmlIf.createXMLEventReader(new StreamSource(pathIn));
// Create a StAX output factory
XMLOutputFactory xmlOf = XMLOutputFactory.newInstance();
int counter = 1;
// Keep going until no more events
while (reader.hasNext()) {
// Peek into the next event to find out what it is
XMLEvent next = reader.peek();
// If it's the start of a featureMember element, commence output
if (next.isStartElement()
&& next.asStartElement().getName().getLocalPart().equals("featureMember")) {
File output = new File(directoryOut, "output_" + counter + ".xml");
try (OutputStream ops = new FileOutputStream(output)) {
XMLEventWriter writer = xmlOf.createXMLEventWriter(ops);
copy(reader, writer);
writer.flush();
writer.close();
}
counter++;
} else {
// Not in a featureMember element: ignore
reader.next();
}
}
} catch (XMLStreamException ex) {
Logger.getLogger(XSLTplayground.class.getName()).log(Level.SEVERE, null, ex);
} catch (IOException ex) {
Logger.getLogger(XSLTplayground.class.getName()).log(Level.SEVERE, null, ex);
}
}
private static void copy(XMLEventReader reader, XMLEventWriter writer) throws XMLStreamException {
// Creating an XMLEventFactory
XMLEventFactory ef = XMLEventFactory.newFactory();
// Writing an XML document start
writer.add(ef.createStartDocument());
int depth = 0;
boolean stop = false;
while (!stop) {
XMLEvent next = reader.nextEvent();
writer.add(next);
if (next.isStartElement()) {
depth++;
} else if (next.isEndElement()) {
depth--;
if (depth == 0) {
writer.add(ef.createEndDocument());
stop = true;
}
}
}
}
}

Java: Having trouble parsing XML with nested nodes

I have an XML file with something like this
<album>
<title> Sample Album </title>
<year> 2014 </year>
<musicalStyle> Waltz </musicalStyle>
<song> Track 1 </song>
<song> Track 2 </song>
<song> Track 3 </song>
<song> Track 4 </song>
<song> Track 5 </song>
<song> Track 6 </song>
<song> Track 7 </song>
</album>
I was able to parse the song by following a tutorial but now I'm stuck with the nested nodes.
Song.XMLtitleStartTag = <title> and the end tag being </title>
public static SongList parseFromFile(File inputFile){
System.out.println("Parse File Data:");
if(inputFile == null) return null;
SongList theSongs = new SongList();
BufferedReader inputFileReader;
String inputLine; //current input line
try{
inputFileReader= new BufferedReader(new FileReader(inputFile));
while((inputLine = inputFileReader.readLine()) != null){
if(inputLine.trim().startsWith(Song.XMLtitleStartTag) &&
inputLine.endsWith(Song.XMLtitleEndTag)){
String titleString = inputLine.substring(Song.XMLtitleStartTag.length()+1,
inputLine.length()- Song.XMLtitleEndTag.length()).trim();
if(titleString != null && titleString.length() > 0)
theSongs.add(new Song(titleString))
}
}
I understand there are different ways to parse XML, I was wondering if I should stick to the method I'm using and build off of it, or should I try a different, easier approach.
Also wondering if I could get a pointer with parsing the rest of the album information if possible
The short answer is, yes, you should drop your current approach and seek something else. Many hundreds of developer hours have gone into producing libraries that are capable of parsing XML files in standardised manner.
There are any number of libraries available for parsing XML.
You could start by taking a look at the inbuilt APIs, Java API for XML Processing (JAXP).
Generally it comes down to two approaches.
SAX or DOM.
SAX is basically inline processing of the XML as it's parsed. This means, that as the XML document is being processed, you are been given the opportunity to process that parsing. This is good for large documents and when you only need linear access to the content.
DOM (or Document Object Model) generates a model of the XML, which you can process at your leisure. It's better suited to smaller XML documents, as the entire model is normally read into memory and when you want to interact with the document in a non-linear fashion, such as searching for example...
The following is a simple snippet of loading a XML document in a DOM...
try {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
try {
Document doc = builder.parse(new File("Album.xml"));
} catch (SAXException | IOException ex) {
ex.printStackTrace();
}
} catch (ParserConfigurationException exp) {
exp.printStackTrace();
}
Once you have the Document, you are ready to process it in any way you see fit. To my mind, it'd take a look at XPath, which is a query API for XML
For example...
import java.io.File;
import java.io.IOException;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class SongList {
public static void main(String[] args) {
try {
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
try {
Document doc = builder.parse(new File("Album.xml"));
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
// Find all album tabs starting at the root level
XPathExpression xExpress = xPath.compile("/album");
NodeList nl = (NodeList)xExpress.evaluate(doc.getDocumentElement(), XPathConstants.NODESET);
for (int index = 0; index < nl.getLength(); index++) {
Node albumNode = nl.item(index);
// Find the title node that is a child of the albumNode
Node titleNode = (Node) xPath.compile("title").evaluate(albumNode, XPathConstants.NODE);
System.out.println(titleNode.getTextContent());
}
// Find all albums whose title is equal to " Sample Album "
xExpress = xPath.compile("/album[title=' Sample Album ']");
nl = (NodeList)xExpress.evaluate(doc.getDocumentElement(), XPathConstants.NODESET);
for (int index = 0; index < nl.getLength(); index++) {
Node albumNode = nl.item(index);
Node titleNode = (Node) xPath.compile("title").evaluate(albumNode, XPathConstants.NODE);
System.out.println(titleNode.getTextContent());
}
} catch (SAXException | IOException | XPathExpressionException ex) {
ex.printStackTrace();
}
} catch (ParserConfigurationException exp) {
exp.printStackTrace();
}
}
}
Perhaps you could try something like:
import java.io.File;
import java.util.LinkedList;
import java.util.List;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class Test {
public static final class Album {
public final String title;
public final String year;
public final String style;
public final List<Song> songs;
Album(final String title, final String year, final String style){
this.title = title;
this.year = year;
this.style = style;
songs = new LinkedList<>();
}
}
public static final class Song {
public final Album album;
public final String name;
Song(final Album album, final String name){
this.album = album;
this.name = name;
}
}
public static List<Album> getAlbums(final File xml) throws Exception {
final List<Album> albums = new LinkedList<>();
final Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(xml);
doc.getDocumentElement().normalize();
final NodeList list = doc.getElementsByTagName("album");
for(int i = 0; i < list.getLength(); i++){
final Node node = list.item(i);
if(node.getNodeType() != Node.ELEMENT_NODE)
continue;
final Element e = (Element) node;
final NodeList children = e.getChildNodes();
final Album album = new Album(children.item(0).getNodeValue(), children.item(1).getNodeValue(), children.item(2).getNodeValue());
final NodeList songs = e.getElementsByTagName("song");
for(int j = 0; j < songs.getLength(); j++)
album.songs.add(new Song(album, songs.item(j).getNodeValue()));
albums.add(album);
}
return albums;
}
}
Parsing XML correctly requires a much more flexible (and complicated) mechanism than the routine you have here. You would do better to make use of an existing parser.
If you really want to write your own, this code is not the foundation of a workable approach. Remember that XML is not line based and there are no requirements for related tags to be contained on the same line. This makes parsing a file line by line a difficult and awkward way to get started, and trying to identify entities by pattern matching one line at a time is simply a broken technique (any entity may always span more than a single line).

Multiple NameSpace in Xml Xpath value

Am new in using Xpath parsing in Java for Xmls. But I learnt it and it worked pretty well until this below issue am not sure how to go traverse to next node in this . Please find the below code and Let me know what needs to be corrected .
package test;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.TransformerException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class CallTestcall {
public static void main(String[] args) throws Exception {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
String responsePath1 = "C:/Verizon/webserviceTestTool/generatedResponse/example.xml";
Document doc1 = builder.parse(responsePath1);
String responsePath0 = "C:/Verizon/webserviceTestTool/generatedResponse/response.xml";
Document doc0 = builder.parse(responsePath0);
example0(doc0);
example1(doc1);
}
private static void example0(Document example)
throws XPathExpressionException, TransformerException {
System.out.println("\n*** First example - namespacelookup hardcoded ***");
XPath xPath = XPathFactory.newInstance().newXPath();
xPath.setNamespaceContext(new HardcodedNamespaceResolver());
String result = xPath.evaluate("s:Envelope/s:Body/ns1:UpdateSessionResponse",
example);
// I tried all the Values to traverse further to UpdateSessionResult but am not able to I used the following xpath expressions
result = xPath.evaluate("s:Envelope/s:Body/ns1:UpdateSessionResponse/a:UpdateSessionResult",
example);
result = xPath.evaluate("s:Envelope/s:Body/ns1:UpdateSessionResponse/i:UpdateSessionResult",
example);
System.out.println("example0 : "+result);
}
private static void example1(Document example)
throws XPathExpressionException, TransformerException {
System.out.println("\n*** First example - namespacelookup hardcoded ***");
XPath xPath = XPathFactory.newInstance().newXPath();
xPath.setNamespaceContext(new HardcodedNamespaceResolver());
String result = xPath.evaluate("books:booklist/technical:book/:author",
example);
System.out.println("example1 : "+result);
}
}
Please find the class that implements nameSpaceContext where I have added the prefixes
package test;
import java.util.Iterator;
import javax.xml.XMLConstants;
import javax.xml.namespace.NamespaceContext;
public class HardcodedNamespaceResolver implements NamespaceContext {
/**
* This method returns the uri for all prefixes needed. Wherever possible it
* uses XMLConstants.
*
* #param prefix
* #return uri
*/
public String getNamespaceURI(String prefix) {
if (prefix == null) {
throw new IllegalArgumentException("No prefix provided!");
} else if (prefix.equals(XMLConstants.DEFAULT_NS_PREFIX)) {
return "http://univNaSpResolver/book";
} else if (prefix.equals("books")) {
return "http://univNaSpResolver/booklist";
} else if (prefix.equals("fiction")) {
return "http://univNaSpResolver/fictionbook";
} else if (prefix.equals("technical")) {
return "http://univNaSpResolver/sciencebook";
} else if (prefix.equals("s")) {
return "http://schemas.xmlsoap.org/soap/envelope/";
} else if (prefix.equals("a")) {
return "http://channelsales.corp.cox.com/vzw/v1/data/";
} else if (prefix.equals("i")) {
return "http://www.w3.org/2001/XMLSchema-instance";
} else if (prefix.equals("ns1")) {
return "http://channelsales.corp.cox.com/vzw/v1/";
}
else {
return XMLConstants.NULL_NS_URI;
}
}
public String getPrefix(String namespaceURI) {
// Not needed in this context.
return null;
}
public Iterator getPrefixes(String namespaceURI) {
// Not needed in this context.
return null;
}
}
Please find my Xml ::::
String XmlString = "<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/"><s:Body><UpdateSessionResponse xmlns="http://channelsales.corp.cox.com/vzw/v1/"><UpdateSessionResult xmlns:a="http://channelsales.corp.cox.com/vzw/v1/data/" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<a:ResponseHeader>
<a:SuccessFlag>true</a:SuccessFlag>
<a:ErrorCode i:nil="true"/>
<a:ErrorMessage i:nil="true"/>
<a:Timestamp>2012-12-05T15:28:35.5363903-05:00</a:Timestamp>
</a:ResponseHeader>
<a:SessionId>cd3ce09e-eb33-48e8-b628-ecd406698aee</a:SessionId>
<a:CacheKey i:nil="true"/>
Try the following. It works for me.
result = xPath.evaluate("/s:Envelope/s:Body/ns1:UpdateSessionResponse/ns1:UpdateSessionResult",
example);
Since you are searching from the root of the document, precede the xpath expression with a forward slash (/)
Also, in the XML fragment below, the string xmlns="http... means you are setting that to be the default namespace. In your namespace resolver you are giving this the prefix ns1. So even though UpdateSessionResult is defining two namespace prefixes a and i, it does not use those prefixes itself (for example <a:UpdateSessionResult...) therefore it belongs to the default namespace (named 'ns1')
<UpdateSessionResponse xmlns="http://channelsales.corp.cox.com/vzw/v1/">
<UpdateSessionResult xmlns:a="http://channelsales.corp.cox.com/vzw/v1/data/" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
That's why you need to use ns1:UpdateSessionResult instead of either a:UpdateSessionResult or i:UpdateSessionResult

Find the value of a specific attribute in an XML file in java

I need to just read the value of a single attribute inside an XML file using java. The XML would look something like this:
<behavior name="Fred" version="2.0" ....>
and I just need to read out the version. Can someone point in the direction of a resource that would show me how to do this?
You don't need a fancy library -- plain old JAXP versions of DOM and XPath are pretty easy to read and write for this. Whatever you do, don't use a regular expression.
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
public class GetVersion {
public static void main(String[] args) throws Exception {
XPath xpath = XPathFactory.newInstance().newXPath();
Document doc = DocumentBuilderFactory.newInstance()
.newDocumentBuilder().parse("file:////tmp/whatever.xml");
String version = xpath.evaluate("//behavior/#version", doc);
System.out.println(version);
}
}
JAXB for brevity:
private static String readVersion(File file) {
#XmlRootElement class Behavior {
#XmlAttribute String version;
}
return JAXB.unmarshal(file, Behavior.class).version;
}
StAX for efficiency:
private static String readVersionEfficient(File file)
throws XMLStreamException, IOException {
XMLInputFactory inFactory = XMLInputFactory.newInstance();
XMLStreamReader xmlReader = inFactory
.createXMLStreamReader(new StreamSource(file));
try {
while (xmlReader.hasNext()) {
if (xmlReader.next() == XMLStreamConstants.START_ELEMENT) {
if (xmlReader.getLocalName().equals("behavior")) {
return xmlReader.getAttributeValue(null, "version");
} else {
throw new IOException("Invalid file");
}
}
}
throw new IOException("Invalid file");
} finally {
xmlReader.close();
}
}
Here's one.
import javax.xml.parsers.SAXParser;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.SAXException;
import org.xml.sax.Attributes;
import javax.xml.parsers.SAXParserFactory;
/**
* Here is sample of reading attributes of a given XML element.
*/
public class SampleOfReadingAttributes {
/**
* Application entry point
* #param args command-line arguments
*/
public static void main(String[] args) {
try {
// creates and returns new instance of SAX-implementation:
SAXParserFactory factory = SAXParserFactory.newInstance();
// create SAX-parser...
SAXParser parser = factory.newSAXParser();
// .. define our handler:
SaxHandler handler = new SaxHandler();
// and parse:
parser.parse("sample.xml", handler);
} catch (Exception ex) {
ex.printStackTrace(System.out);
}
}
/**
* Our own implementation of SAX handler reading
* a purchase-order data.
*/
private static final class SaxHandler extends DefaultHandler {
// we enter to element 'qName':
public void startElement(String uri, String localName,
String qName, Attributes attrs) throws SAXException {
if (qName.equals("behavior")) {
// get version
String version = attrs.getValue("version");
System.out.println("Version is " + version );
}
}
}
}
As mentioned you can use the SAXParser.
Digester mentioned using regular expressions, which I won't recommend as it would lead to code that is difficult to maintain: What if you add another version attribute in another tag, or another behaviour tag? You can handle it, but it won't be pretty.
You can also use XPath, which is a language for querying xml. That's what I would recommend.
If all you need is to read the version, then you can use regex. But really, I think you need apache digester
Apache Commons Configuration is nice, too. Commons Digester is based on it.

Categories

Resources