Match set of simple xpaths with SAX - java

I have a set of simple xpaths involving only tags and attributes, no predicates. My XML input has a size of several MB so I want to use a streaming XML parser.
How can I match the streaming XML parser against the set of xapths to retrieve one value for each xpath?
The crux seems to build the right data structure from the set of xpaths so it can be evaluated based on the xml events.
This seems like a fairly common task but I couldn't find any readily available solutions.

To match a streaming XML parser against a set of simple xpaths, you can use the following steps:
Create a Map<String, String> to store the xpaths and their corresponding values. Initialize the values to null.
Create a Stack<String> to keep track of the current path of the XML elements.
Create a SAXParser and a DefaultHandler to parse the XML input.
In the startElement method of the handler, push the element name to the stack and append it to the current path. Then, check if the current path matches any of the xpaths in the map. If yes, set a flag to indicate that the value should be extracted.
In the endElement method of the handler, pop the element name from the stack and remove it from the current path. Then, reset the flag to indicate that the value should not be extracted.
In the characters method of the handler, check if the flag is set. If yes, append the character data to the value of the matching xpath in the map.
After parsing the XML input, return the map with the xpaths and their values.
Explanation
A streaming XML parser, such as SAXParser, reads the XML input sequentially and triggers events when it encounters different parts of the document, such as start tags, end tags, text, etc. It does not build a tree structure of the document in memory, which makes it more efficient for large XML inputs.
An xpath is a syntax for selecting nodes from an XML document. It consists of a series of steps, separated by slashes, that describe the location of the desired node. For example, /bookstore/book/title selects the title element of the book element of the bookstore element.
A simple xpath involves only tags and attributes, no predicates. For example, /bookstore/book[#lang='en']/title selects the title element of the book element that has an attribute lang with value en.
To match a streaming XML parser against a set of simple xpaths, we need to keep track of the current path of the XML elements as we parse the input, and compare it with the xpaths in the set. If we find a match, we need to extract the value of the node and store it in a map. We also need to handle the cases where the node value spans across multiple character events, or where the node has multiple occurrences in the document.
Example
Suppose we have the following XML input:
<bookstore>
<book lang="en">
<title>Harry Potter and the Philosopher's Stone</title>
<author>J. K. Rowling</author>
<price>10.99</price>
</book>
<book lang="fr">
<title>Le Petit Prince</title>
<author>Antoine de Saint-Exupéry</author>
<price>8.50</price>
</book>
</bookstore>
And the following set of simple xpaths:
/bookstore/book/title
/bookstore/book/author
/bookstore/book[#lang='fr']/price
We can use the following Java code to match the streaming XML parser against the set of xpaths:
import java.io.*;
import java.util.*;
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
public class XPathMatcher {
public static Map<String, String> match(InputStream xmlInput, Set<String> xpaths) throws Exception {
// Create a map to store the xpaths and their values
Map<String, String> map = new HashMap<>();
for (String xpath : xpaths) {
map.put(xpath, null);
}
// Create a stack to keep track of the current path
Stack<String> stack = new Stack<>();
// Create a SAXParser and a DefaultHandler to parse the XML input
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
DefaultHandler handler = new DefaultHandler() {
// A flag to indicate if the value should be extracted
boolean extract = false;
// A variable to store the current path
String currentPath = "";
// A variable to store the matching xpath
String matchingXPath = "";
#Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
// Push the element name to the stack and append it to the current path
stack.push(qName);
currentPath += "/" + qName;
// Check if the current path matches any of the xpaths in the map
for (String xpath : map.keySet()) {
// If the xpath has an attribute, extract the attribute name and value
String attrName = "";
String attrValue = "";
if (xpath.contains("[#")) {
int start = xpath.indexOf("[#") + 2;
int end = xpath.indexOf("=");
attrName = xpath.substring(start, end);
start = end + 2;
end = xpath.indexOf("]");
attrValue = xpath.substring(start, end - 1);
}
// If the xpath matches the current path, and either has no attribute or has a matching attribute, set the flag and the matching xpath
if (xpath.startsWith(currentPath) && (attrName.isEmpty() || attrValue.equals(attributes.getValue(attrName)))) {
extract = true;
matchingXPath = xpath;
break;
}
}
}
#Override
public void endElement(String uri, String localName, String qName) throws SAXException {
// Pop the element name from the stack and remove it from the current path
stack.pop();
currentPath = currentPath.substring(0, currentPath.length() - qName.length() - 1);
// Reset the flag and the matching xpath
extract = false;
matchingXPath = "";
}
#Override
public void characters(char[] ch, int start, int length) throws SAXException {
// Check if the flag is set
if (extract) {
// Append the character data to the value of the matching xpath in the map
String value = map.get(matchingXPath);
if (value == null) {
value = "";
}
value += new String(ch, start, length);
map.put(matchingXPath, value);
}
}
};
// Parse the XML input
parser.parse(xmlInput, handler);
// Return the map with the xpaths and their values
return map;
}
public static void main(String[] args) throws Exception {
// Create an input stream from the XML file
InputStream xmlInput = new FileInputStream("bookstore.xml");
// Create a set of simple xpaths
Set<String> xpaths = new HashSet<>();
xpaths.add("/bookstore/book/title");
xpaths.add("/bookstore/book/author");
xpaths.add("/bookstore/book[#lang='fr']/price");
// Match the streaming XML parser against the set of xpaths
Map<String, String> map = match(xmlInput, xpaths);
// Print the results
for (String xpath : map.keySet()) {
System.out.println(xpath + " = " + map.get(xpath));
}
}
}
The output of the code is:
/bookstore/book/title = Harry Potter and the Philosopher's StoneLe Petit Prince
/bookstore/book/author = J. K. RowlingAntoine de Saint-Exupéry
/bookstore/book[#lang='fr']/price = 8.50

Related

How to read values inside a tag using STAX parser in JAVA

I have a xml like below.
<user VERSION_NO="1">
<userCompany QTAG="30000-9" LITERAL="Pharmaxy Group" CA_ID="33">PG</userCompany></user>
where "user " is my parent USER DTO and in the DTO I have attributes like "userCompany".
I am hitting a webservice(soap) , where I get the response as above. Based on the "Literal" value I need to do perform some business logic and set to my USER DTO.
So how to read the "LITERAL" value using STAX ?
At first you needs to know some technical terms. In XML your userCompany is an element while LITERAL is an attribute of that element.
Using StAX you first get a reader from XMLInputFactory. I prefer XMLEventReader. This reader is able iterating over all XML elements in given XML. If you have found the wanted StartElement - userCompany in your case - then you can get the value of the attribute named "LITERAL" from it.
For naming XML elements and attributes QNames are used istead of simply Strings. This is because QName is able taking different name spaces into account.
Example:
import java.io.StringReader;
import javax.xml.stream.*;
import javax.xml.stream.events.*;
import javax.xml.namespace.QName;
public class StAXGetAttributeValue {
static String getAttributeValue(StartElement startElement, QName attributeName) {
Attribute attribute = startElement.getAttributeByName(attributeName);
String attributeValue = attribute.getValue();
return attributeValue;
}
public static void main (String args[]) throws Exception {
String xmlString = "<user VERSION_NO=\"1\"><userCompany QTAG=\"30000-9\" LITERAL=\"Pharmaxy Group\" CA_ID=\"33\">PG</userCompany></user>";
XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(new StringReader(xmlString));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if(event.isStartElement()) {
StartElement startElement = (StartElement)event;
QName startElementName = startElement.getName();
if("userCompany".equals(startElementName.getLocalPart())) {
String valueOf_LITERAL_Attribute = getAttributeValue(startElement, QName.valueOf("LITERAL"));
System.out.println(valueOf_LITERAL_Attribute); //prints Pharmaxy Group
}
}
}
}
}

Java Regular Expression to Remove tags from html

<table><tr><td>HEADER</td><td>Header Value <supporting value></td></tr><tr><td>SUB</td><td>sub value. write to <test#gmail.com></td></tr><tr><td>START DATE</td><td>11/23/ 2016</td></tr><tr><td>END DATE</td><td>11/23/2016</td></tr></table>
The above text is my html String, need to extract values for HEADER, SUB,START DATE and END DATE. I used Jsoup to extract the values but I have issues with non-html element tags. The API either skips these elements OR adds an ending tag which was not there in the first place.
So my idea is to replace non-html element tags with < and then use Jsooup to extract the values
Any suggestions??
You may wish to refer jSoup to parse HTML documents. You can extract and manipulates data using this api.
You can extract the content using this regex:
/<td>[^<]*<([^>]*)><\/td>/
assumed the markup layout always looks the same.
Although you can't parse a complete HTML document using a regular expression, because it is not a context-free language, a partial extraction like this in fact is possible.
Found the solution, got all the tags from the html String using the pattern <([^\s>/]+)
Then replaced all the tags except for TR and TD with "&lt" "&gt". When I parse the text using Jsoup I get the desired value.
Please find the code below,
public class JsoupParser2 {
public static void main(String args[]) {
String orginalString, replaceString = null;
HashSet<String> tagSet = new HashSet<String>();
HashMap<String,String> notes = new HashMap<String,String>();
Document document = null;
try{
//Read the html content as String
File testFile = new File("C:\\test.html");
List<String> content = Files.readLines(testFile, Charsets.UTF_8);
String testContent = content.get(0);
//Get all the tags present in the html content
Pattern p = Pattern.compile("<([^\\s>/]+)");
Matcher m = p.matcher(testContent);
while(m.find()) {
String tag = m.group(1);
tagSet.add(tag);
}
//Replace the tags thats non-html
for(String replaceTag : tagSet){
if(!"table".equals(replaceTag) && !"tr".equals(replaceTag) && !"td".equals(replaceTag)){
orginalString = "<"+replaceTag+">";
replaceString = "<"+replaceTag+">";
testContent = testContent.replaceAll(orginalString, replaceString);
}
}
//Parse the html content
document = Jsoup.parse(testContent, "", Parser.xmlParser());
//traverse through TR and TD to get to the values
//store the values in the map
Elements pTags = document.select("tr");
for (Element tag : pTags) {
if(!tag.getElementsByTag("td").isEmpty()){
String key = tag.getElementsByTag("td").get(0).text().trim();
String value = tag.getElementsByTag("td").get(1).html().trim();
System.out.println("KEY : "+key); System.out.println("VALUE : "+value);
notes.put(key, value);
System.out.println("==============================================");
}
}
}catch (IOException e) {
e.printStackTrace();
}catch(IndexOutOfBoundsException ioobe){
System.out.println("ioobe");
}
}
}

Match specific html attribute values

I would like to match all attribute values for id, class, name and for! I created a simple function for that task.
private Collection<String> getAttributes(final String htmlContent) {
final Set<String> attributes = new HashSet<>();
final Pattern pattern = Pattern.compile("(class|id|for|name)=\\\"(.*?)\\\"");
final Matcher matcher = pattern.matcher(htmlContent);
while (matcher.find()) {
attributes.add(matcher.group(2));
}
return attributes;
}
Example html content:
<input id="test" name="testName" class="aClass bClass" type="input" />
How can I split html classes via regular expression, so that I get the following result set:
test
testName
aClass
bClass
And is there any way to improve my code? I really don't like the loop.
If you take a look at the JSoup library you can find useful tools for html parsing and manipulation.
For example:
Document doc = ...//create HTML document
Elements htmlElements = doc.children();
htmlElements.traverse(new MyHtmlElementVisitor());
The class MyHtmlElementVisitor simply has to implement NodeVisitor and can access the Node attributes.
Though you might find a good regex for the same job, it has several drawbacks. Just to name a few:
hard to find a failsafe regex for every possible html document
hard to read, therefore difficult to find bugs and implement changes
the regex usually isn't reusable
Don't use regular expressions for parsing HTML. Seriously, it's more complicated than you think.
If your document is actually XHTML, you can use XPath:
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xpath.evaluate(
"//#*["
+ "local-name()='class'"
+ " or local-name()='id'"
+ " or local-name()='for'"
+ " or local-name()='name'"
+ "]",
new InputSource(new StringReader(htmlContent)),
XPathConstants.NODESET);
int count = nodes.getLength();
for (int i = 0; i < count; i++) {
Collections.addAll(attributes,
nodes.item(i).getNodeValue().split("\\s+"));
}
If it's not XHTML, you can use Swing's HTML parsing:
HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
private final Object[] attributesOfInterest = {
HTML.Attribute.CLASS,
HTML.Attribute.ID,
"for",
HTML.Attribute.NAME,
};
private void addAttributes(AttributeSet attr) {
for (Object a : attributesOfInterest) {
Object value = attr.getAttribute(a);
if (value != null) {
Collections.addAll(attributes,
value.toString().split("\\s+"));
}
}
}
#Override
public void handleStartTag(HTML.Tag tag,
MutableAttributeSet attr,
int pos) {
addAttributes(attr);
super.handleStartTag(tag, attr, pos);
}
#Override
public void handleSimpleTag(HTML.Tag tag,
MutableAttributeSet attr,
int pos) {
addAttributes(attr);
super.handleSimpleTag(tag, attr, pos);
}
};
HTMLDocument doc = (HTMLDocument)
new HTMLEditorKit().createDefaultDocument();
doc.getParser().parse(new StringReader(htmlContent), callback, true);
As for doing it without a loop, I don't think that's possible. But any implementation is going to use one or more loops internally anyway.

DOM4J Parse not returning any child nodes

I am attempting to begin writing a program which uses DOM4j with which I wish to parse a XML file, save it to some tables and finally allow the user to manipulate the data.
Unfortunately I am stuck on the most basic step, the parsing.
Here is the portion of my XML I am attempting to include:
<?xml version="1.0"?>
<Document xmlns="urn:iso:std:iso:20022:tech:xsd:camt.054.001.04">
<BkToCstmrDbtCdtNtfctn>
<GrpHdr>
<MsgId>000022222</MsgId>
When I attempt to find the root of my XML it does return the root correctly as "Document". When I attempt to get the child node from Document it also correctly gives me "BkToCstmrDbtCdtNtfctn". The problem is that when I try to go any further and get the child nodes from "Bk" I can't. I get this in the console:
org.dom4j.tree.DefaultElement#2b05039f [Element: <BkToCstmrDbtCdtNtfctn uri: urn:iso:std:iso:20022:tech:xsd:camt.054.001.04 attributes: []/>]
Here is my code, I would appreciate any feedback. Ultimately I want to get the "MsgId" attribute back but in general I just want to figure how to parse deeper into the XML because in reality it probably has about 25 layers.
public static Document getDocument(final String xmlFileName){
Document document = null;
SAXReader reader = new SAXReader();
try{
document = reader.read(xmlFileName);
}
catch (DocumentException e)
{
e.printStackTrace();
}
return document;
}
public static void main(String args[]){
String xmlFileName = "C:\\Users\\jhamric\\Desktop\\Camt54.xml";
String xPath = "//Document";
Document document = getDocument(xmlFileName);
Element root = document.getRootElement();
List<Node> nodes = document.selectNodes(xPath);
for(Iterator i = root.elementIterator(); i.hasNext();){
Element element = (Element) i.next();
System.out.println(element);
}
for(Iterator i = root.elementIterator("BkToCstmrDbtCdtNtfctn");i.hasNext();){
Element bk = (Element) i.next();
System.out.println(bk);
}
}
}
The best approach is probably to use XPath, but since the XML document uses namespaces, you cannot use the "simple" selectNodes methods in the API. I would create a helper method to easily evaluate any XPath expression on either the Document or the Element level:
public static void main(String[] args) throws Exception {
Document doc = getDocument(...);
Map<String, String> namespaceContext = new HashMap<>();
namespaceContext.put("ns", "urn:iso:std:iso:20022:tech:xsd:camt.054.001.04");
// Select the first GrpHdr element in document order
Element element = (Element) select("//ns:GrpHdr[1]", doc, namespaceContext);
System.out.println(element.asXML());
// Select the text content of the MsgId element
Text msgId = (Text) select("./ns:MsgId/text()", element, namespaceContext);
System.out.println(msgId.getText());
}
static Object select(String expression, Branch contextNode, Map<String, String> namespaceContext) {
XPath xp = contextNode.createXPath(expression);
xp.setNamespaceURIs(namespaceContext);
return xp.evaluate(contextNode);
}
Note that the XPath expression must use namespace prefixes that is mapped to the namespace URIs used in the input document, but that the actual value of the prefix doesn't matter.

using xpath in java to go through this list?

i have an xml file that contains lots of different nodes. some in particularly are nested like this:
<emailAddresses>
<emailAddress>
<value>sambj1981#gmail.com</value>
<typeSource>WORK</typeSource>
<typeUser></typeUser>
<primary>false</primary>
</emailAddress>
<emailAddress>
<value>sambj#hotmail.co.uk</value>
<typeSource>HOME</typeSource>
<typeUser></typeUser>
<primary>true</primary>
</emailAddress>
</emailAddresses>
From the above node, what i want to do is go through each and get the values inside it(value, typeSource, typeUser etc) and put them in a POJO.
i tried to see if i can use this xpath expression "//emailAddress" but it doesnt return me the tags inside inside it. maybe i am doing it wrong. i am pretty new to using xpath.
i could do something like this:
//emailAddress/value | //emailAddress/typeSource | .. but doing that will list all elements values together if im not mistaken leaving me to work out when i have finished reading from a specific emailAddress tag and going to the next emailAddress tag.
well to sum up my needs i basically want this to be returned similar to how you would return results from a bog standard sql query that returns results in a row. i.e. if your sql query produces 10 emailAddress, it will return each emailAddress in a row and i can simply iterate over "each emailAddress" and get the appropriate value based on the colunm name or index.
No,
//emailAddress
doesn't return the tags inside, that is correct. What it does return is a NodeList/NodeSet. To actually get the values you can do something like this:
String emailpath = "//emailAddress";
String emailvalue = ".//value";
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xpath = xPathFactory.newXPath();
Document document;
public XpathStuff(String file) throws ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = docFactory.newDocumentBuilder();
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file));
document = builder.parse(bis);
NodeList nodeList = getNodeList(document, emailpath);
for(int i = 0; i < nodeList.getLength(); i++){
System.out.println(getValue(nodeList.item(i), emailvalue));
}
bis.close();
}
public NodeList getNodeList(Document doc, String expr) {
try {
XPathExpression pathExpr = xpath.compile(expr);
return (NodeList) pathExpr.evaluate(doc, XPathConstants.NODESET);
} catch (XPathExpressionException e) {
e.printStackTrace();
}
return null;
}
//extracts the String value for the given expression
private String getValue(Node n, String expr) {
try {
XPathExpression pathExpr = xpath.compile(expr);
return (String) pathExpr.evaluate(n,
XPathConstants.STRING);
} catch (XPathExpressionException e) {
e.printStackTrace();
}
return null;
}
Maybe I should point out that when iterating over the Nodelist, in .//values the first dot means the current context. Without the dot you would get the first node all the time.
//emailAddress/*
will get these nodes in the document order.
It depends on how you want to iterate through the nodes. We do all our XML using XOM (http://www.xom.nu/) which is an easy reliable Java package. It's possible to write your own strategy using XOM calls.
If you use XStream you can set it up quite easily. Like so:
#XStreamAlias( "EmailAddress" )
public class EmailAddress {
#XStreamAlias()
private String value;
#XStreamAlias()
private String typeSource;
#XStreamAlias()
private String typeUser;
#XStreamAlias()
private boolean primary;
// ... the rest omitted for brevity
}
You then marshal & unmarshal quite simply like so:
XStream xstream = new XStream();
xstream.processAnnotations( EmailAddress.class );
xstream.toXML( /* Object value here */ emailAddress );
xstream.fromXML( /* String xml value here */ "" );
IDK if you have to use XPath or not, but if not I'd consider an out of the box solution like this.
I am totally aware this is not what you were asking for, but may consider using jibx. This is a tool for human-readable XML to POJO mapping.
So I believe you could generate mapping for your email structure in a quick way and let the jibx do the work for you.

Categories

Resources