Read sitemap with XPath - java

I want to read Sitemap with XPath but it doesn't work.
here is my code :
private void evaluate2(String src){
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
try{
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new ByteArrayInputStream(src.getBytes()));
System.out.println(src);
XPathFactory xp_factory = XPathFactory.newInstance();
XPath xpath = xp_factory.newXPath();
XPathExpression expr = xpath.compile("//url/loc");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
System.out.println(nodes.getLength());
for (int i = 0; i < nodes.getLength(); i++) {
items.add(nodes.item(i).getNodeValue());
System.out.println(nodes.item(i).toString());
}
}catch(Exception e){
System.out.println(e.getMessage());
}
}
Before I retrieve the remote source of the sitemap, and it's passed to evaluate2 through the variable src.
And the System.out.println(nodes.getLength()); display 0
My xpath query is working because this query work in PHP.
Do you see errors in my code ?
Thanks

You parse the sitemap with a namespace-aware parser (that's what factory.setNamespaceAware(true) does), but then attempt to access it using an XPath that does not usea namespace resolver (or reference any namespaces).
The simplest solution is to configure the parser as not namespace aware. As long as you're just parsing a self-contained sitemap, that shouldn't be a problem.
One more problem in your code is that you pass the sitemap contents as a String, then convert that String using the platform default encoding. This will work as long as your platform-default encoding matches that of the actual bytes that you retrieved from the server (assuming that you also created the string using the platform-default encoding). If it doesn't, you're likely to get a conversion error.

I think the input has namespace. So you would have to initialize the namespaceContext for the xpath object and change your xpath with prefixes. i.e. //usr/loc should be //ns:url/ns:loc
and then add the namespace prefix binding in the namespace object.
You can find an NamespaceContext implementation available with apache common. http://ws.apache.org/commons/util/apidocs/index.html
ws-commons-utils
NamespaceContextImpl namespaceContextObj = new NamespaceContextImpl();
nsContext.startPrefixMapping("ns", "http://sitename/xx");
xpath.setNamespaceContext(namespaceContextObj);
XPathExpression expr = xpath.compile("//ns:url/ns:loc");
In case you don't know what namespaces that are comming, you can get them from the document it self, but I doubt it ll be of much use. There are few how-tos here
http://www.ibm.com/developerworks/xml/library/x-nmspccontext/index.html

I can't see any errors in your code so I gues the problem is the source.
Are you sure that the source file contains this element?
Maybe you could try to use this code to parse the String in an Document
builder.parse(new InputSource(new StringReader(xml)));

Related

Trying to get the value of a tag in an xml string java

I have an xml string stored in a StringBuilder.
My xml looks like this
couldn't write it in code so here's a screenshot
inside the report tag, it looks like
what it looks like
I would like to get access to any tag value I want in the record tag, what I have is :
StringBuilder informationString = new StringBuilder();
Scanner scanner = new Scanner(url.openStream());
while (scanner.hasNext()) {
informationString.append(scanner.nextLine());
}
//Close the scanner
scanner.close();
System.out.println(informationString);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(String.valueOf(informationString))));
Element rootElement = document.getDocumentElement();
But I do not know what to do with this and am very lost
Thanks by advance for helping
In general, you can use the below routine
Element documentElement=....
NodeList elmList=documentElement.getElementsByTagName("elementName");
Element e=(Element)elmList.itm(x);//putting it in a loop would do
You could keep using the above to get elements recursively.
Though a better approach would be to use XPath (Saxon has a decent XPath implementaton, though there are many more libraries to choose from)

Casting JDom 1.1.3 Element to Document without DocumentBuilderFactory or DocumentBuilder

I need to find the easier and the efficient way to convert a JDOM element (with all it's tailoring nodes) to a Document. ownerDocument( ) won't work as this is version JDOM 1.
Moreover, org.jdom.IllegalAddException: The Content already has an existing parent "root" exception occurs when using the following code.
DocumentBuilderFactory dbFac = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFac.newDocumentBuilder();
Document doc = null;
Element elementInfo = getElementFromDB();
doc = new Document(elementInfo);
XMLOutputter xmlOutput = new XMLOutputter();
byte[] byteInfo= xmlOutput.outputString(elementInfo).getBytes("UTF-8");
String stringInfo = new String(byteInfo);
doc = dBuilder.parse(stringInfo);
I think you have to use the following method of the element.
Document doc = <element>.getDocument();
Refer the API documentation It says
Return this parent's owning document or null if the branch containing this parent is currently not attached to a document.
JDOM content can only have one parent at a time, and you have to detatch it from one parent before you can attach it to another. This code:
Document doc = null;
Element elementInfo = getElementFromDB();
doc = new Document(elementInfo);
if that code is failing, it is because the getElementFromDB() method is returning an Element that is part of some other structure. You need to 'detach' it:
Element elementInfo = getElementFromDB();
elementInfo.detach();
Document doc = new Document(elementInfo);
OK, that solves the IllegalAddException
On the other hand, if you just want to get the document node containing the element, JDOM 1.1.3 allows you to do that with getDocument:
Document doc = elementInfo.getDocument();
Note that the doc may be null.
To get the top most element available, try:
Element top = elementInfo;
while (top.getParentElement() != null) {
top = top.getParentElement();
}
In your case, your elementInfo you get from the DB is a child of an element called 'root', something like:
<root>
<elementInfo> ........ </elementInfo>
</root>
That is why you get the message you do, with the word "root" in it:
The Content already has an existing parent "root"

Extract a node with its entire content from a namespaced xml

Given the following namespaced xml file:
<ptk:PrintTalk xmlns:ptk="http://linkToNameSpace"> xmlns:xjdf="http://linkToNamespace"
<ptk:Request>
<ptk:PurchaseOrder Currency="EUR">
<xjdf:XJDF name="someName" version="2.0">
<xjdf:ProductList>
<xjdf:Product>
...
</xjdf:Product>
<xjdf:OtherProduct>
...
</xjdf:OtherProduct>
and many other products
</xjdf:ProductList>
<xjdf:ParameterSet>
<xjdf:Parameter>
...
</xjdf:Parameter> and so on until
</xjdf:XJDF>
</ptk:PurchaseOrder>
</ptk:Request>
</ptk:PrintTalk>
how would I extract following using XPath:
<xjdf:XJDF name="someName" version="2.0">
<xjdf:ProductList>
<xjdf:Product>
...
</xjdf:Product>
<xjdf:OtherProduct>
...
</xjdf:OtherProduct>
and many other products
</xjdf:ProductList>
<xjdf:ParameterSet>
<xjdf:Parameter>
...
</xjdf:Parameter> and so on until
</xjdf:XJDF>
I already tried something like:
/ptk:PrintTalk/ptk:Request/ptk:PurchaseOrder/*
or
//xjdf:XJDF
but these expressions give me not the result I am looking for. I use IntellijIdea's built in xpath expression evaluator, programming language is java. No libraries for xpath - just java.xml.*
UPDATE
using
//ptk:PurchaseOrder//*
I get every node as a single node without any child nodes inside, e. g. would
<xjdf:ProductList>
<xjdf:Product>
...
</xjdf:Product>
</xjdf:ProductList> (here the product tag is a child of product list tag)
result in
<xjdf:ProuctList>
<xjdf:Product>
The java code I use to do the operation:
#Override
public XJDF readFrom(
final Class<XJDF> type, final Type genericType, final Annotation[] annotations, final MediaType mediaType,
final MultivaluedMap<String, String> multivaluedMap, final InputStream inputStream
) throws IOException {
try {
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document documentPtk = documentBuilder.parse(new InputSource(inputStream));
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
XPathExpression xPathExpression = xPath.compile("//ptk:PurchaseOrder//*");
Document documentXjdf = (Document) xPathExpression.evaluate(documentPtk, XPathConstants.NODE);
} catch (Exception e) {
throw new WebApplicationException("PrintTalk document could not be deserialized.", e);
}
}
Three main points to make here:
DocumentBuilderFactory is not namespace-aware by default, you must explicitly switch on namespaces before you create the DocumentBuilder
XPath doesn't use the namespace prefix mappings from the XML document, it uses its own NamespaceContext instead
The Node returned by this query won't be a Document, it'll be an Element.
Annoyingly there's no default implementation of NamespaceContext in the Java core class library so you have to either use a third party one (I usually use the SimpleNamespaceContext from Spring) or write your own implementation of the interface.
Here's an example using SimpleNamespaceContext:
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document documentPtk = documentBuilder.parse(new InputSource(inputStream));
XPathFactory xPathFactory = XPathFactory.newInstance();
XPath xPath = xPathFactory.newXPath();
SimpleNamespaceContext nsCtx = new SimpleNamespaceContext();
nsCtx.bindNamespaceUri("p", "http://linkToNameSpace");
xPath.setNamespaceContext(nsCtx);
XPathExpression xPathExpression = xPath.compile("/p:PrintTalk/p:Request/p:PurchaseOrder/*");
Element documentXjdf = (Element) xPathExpression.evaluate(documentPtk, XPathConstants.NODE);

How to parse large SOAP response

I have a large SOAP response that I want to process and store in Database. I'm trying to process the whole thing as Document as below
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setCoalescing(true);
DocumentBuilder db = dbf.newDocumentBuilder();
InputStream is = new ByteArrayInputStream(resp.getBytes());
Document doc = db.parse(is);
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile(fetchResult);
String result = (String) expr.evaluate(doc, XPathConstants.STRING);
resp is the SOAP response and fetchResult is
String fetchResult = "//result/text()";
I'm getting out of memory exception with this approach. So I was trying to process the document as a stream, rather than consuming the entire response as a Document.
But I can not come up with the code.
Could any of you please help me out?
DOM & JDOM are memory-consuming parsing APIs. DOM creates a tree of the XML document in memory. You should use StAX or SAX because they offer better performance.
If this in Java you could try using dom4j. This has a nice way of reading the xml using the xpathExpression.
Additionally dom4j provides an event based model for processing XML documents. Using this event based model allows us to prune the XML tree when parts of the document have been successfully processed avoiding having to keep the entire document in memory.
If you need to process a very large XML file that is generated externally by some database process and looks something like the following (where N is a very large number).
<ROWSET>
<ROW id="1">
...
</ROW>
<ROW id="2">
...
</ROW>
...
<ROW id="N">
...
</ROW>
</ROWSET>
So to process each <ROW> individually you can do the following.
// enable pruning mode to call me back as each ROW is complete
SAXReader reader = new SAXReader();
reader.addHandler( "/ROWSET/ROW",
new ElementHandler() {
public void onStart(ElementPath path) {
// do nothing here...
}
public void onEnd(ElementPath path) {
// process a ROW element
Element row = path.getCurrent();
Element rowSet = row.getParent();
Document document = row.getDocument();
...
// prune the tree
row.detach();
}
}
);
Document document = reader.read(url);
// The document will now be complete but all the ROW elements
// will have been pruned.
// We may want to do some final processing now
...
Please see How dom4j handle very large XML documents? to understand how it works.
Moreover dom4j works with any SAX parser via JAXP.
For more details see What XML parser does dom4j use?
The XPath & XPathExpression classes have methods that accept an InputSource argument.
InputStream input = ...;
InputSource source = new InputSource(input);
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
XPathExpression expr = xpath.compile("...");
String result = (String) expr.evaluate(source, XPathConstants.STRING);

XML Searching and Parsing

I have an XML file that I am trying to search using Java. I just need to find an element by its Tag name and then find that Tag's value. So for example:
I have this XML file:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="https://company.com/test/xslt/processing_report.xslt"?>
<Certificate xmlns="urn:us:net:exchangenetwork:Company">
<Value1>Veggie</Value1>
<Value2>Fruits</Value2>
<type1>Apple</type1>
<FindME>Red</FindME>
<Value3>Bread</Value3>
</Certificate>
I want to find the value inside of the FindME Tag. I can't use XPath because different files can have different structures, but they always have a FindME tag. Lastly I am looking for the simplest piece of code, I do not care much about performance. Thank you
Here is the code:
XPathFactory f = XPathFactory.newInstance();
XPathExpression expr = f.newXPath().compile(
"//*[local-name() = 'FindME']/text()");
DocumentBuilderFactory domFactory = DocumentBuilderFactory
.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse("src/test.xml"); //your XML file
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
System.out.println(nodes.getLength());
for (int i = 0; i < nodes.getLength(); i++) {
System.out.println(nodes.item(i).getNodeValue());
}
Explained :
//* - match any element node - no matter where they are
local-name() = 'FindME' - where local name - i.e; not the full path - matches 'FindME'
text() - get the node value.
I think you need to read up on XPath because it can very easily solve this problem. So can using getElementsByTagName in the DOM API.
You can still use XPath. All you need to do is use //FindMe (read here on // usage) expression. This finds a the "FindMe" elements from any where in the xml irrespective of its parent or path from the root.
If you are using namespaces then make sure you are making the parser aware of that
String findMeVal = null;
InputStream is = //...
XmlPullParser parser = //...
parser.setFeature(XmlPullParser.FEATURE_PROCESS_NAMESPACES, true);
parser.setInput(is, null);
int event;
while (XmlPullParser.END_DOCUMENT != (event = parser.next())) {
if (event == XmlPullParser.START_TAG) {
if ("FindME".equals(parser.getName())) {
findMeVal = parser.nextText();
break;
}
}
}

Categories

Resources