I am reading an XML using dom4j by using XPath techniques for selecting desired nodes. Consider that my XML looks like this:
<Emp_Dir>
<Emp_Classification type ="Permanent" >
<Emp id= "1">
<name>jame</name>
<Emp_Bio>
<age>12</age>
<height>5.4</height>
<weight>78</weight>
</Emp_Bio>
<Emp_Details>
<salary>2000</salary>
<designation>developer</designation>
</Emp_Details>
</Emp>
<Emp id= "2">
<name>jame</name>
<Emp_Bio>
<age>12</age>
<height>5.4</height>
<weight>78</weight>
</Emp_Bio>
<Emp_Details>
<salary>2000</salary>
<designation>developer</designation>
</Emp_Details>
</Emp>
</Emp_Classification>
<Emp_Classification type ="Contract" >
.
.
.
</Emp_Classification>
<Emp_Classification type ="PartTime" >
.
.
.
</Emp_Classification>
</Emp_Dir>
Note: The above XML might looks ugly to you but i only create this dummy file for the sake of understanding and keeping the secracy of my project
When i specify some simple XPath expression, like:
//Emp_Classification (or)
/Emp_Dir/Emp_Classification
then its works fine but when i specify some complex expression like:
/Emp_Dir/Emp_Classification/[#type='Permanent'] (or)
//Emp_Dir/Emp_Classification/[#type='Permanent']
then it gives me the following error:
"Invalid XPath expression: /Emp_Dir/Emp_Classification/[#type='Permanent'] Expected one of '.', '..', '#', '*', <QName>"
Coulde anybody guides me what goes wrong in my XPath?
My second question is that how do i select the Emp_Bio node of Permanent Employees only, does this works?
//Emp_Dir/Emp_Classification/[#type='Permanent']/Emp/Emp_Bio
Use : //Emp_Dir/Emp_Classification[#type='Permanent']
(note the removal of /)
And then use this : //Emp_Dir/Emp_Classification[#type='Permanent']/Emp/Emp_Bio for the latter part of the question.
Related
I have the following XML structure
<CodeSnippet>
<Code id="code1">
<Tags>button java</Tags>
<Snippet>sample code</Snippet>
</Code>
<Code id="code2">
<Tags>eclipse jbutton java</Tags>
<Snippet>sample code</Snippet>
</Code>
<.....>
</CodeSnippet>
Now, I want to retrieve all the Snippet from the above xml when i search using Tags. For instance, if search for "java" then all the nodes that contain tags as "java" must return the snippet.
My search query is:
//Code/Tags[contains(concat(' ',/text(),' '), ' "+ searchTags[0] +" ')]";
Here, searchTags[0] contains "java".
My result set should contain the Snippets of the selected nodes i.e. code1 and code2 from above xml structure.
Try this expression:
//Code/Tags[contains(., 'Java')]/../Snippet
For retrieving all the "Tags" containing "java", I used the below xpath expression,
//Code/Tags/text()[contains(., 'java')]
For retrieving all the "Snippet" related to the tags "java", I used below expression
//Code/Tags[contains(./text(), 'java')]/parent::Code/Snippet/text()
Thanks to #dfsq for helping me out with his expression. Thanks a lot.
You can write :
//Tags//Snippet
If you used XPath
I transform xml with the Saxon XSLT2 processor (using Java + the Saxon S9API) and have to deal with xml-documents as the source, which contain invalid characters as tag names and therefore can't be parsed by the document-builder.
Example:
<A>
<B />
<C>
<D />
</C>
<E!_RANDOM_ />
< />
</A>
Code:
import net.sf.saxon.s9api.*;
[...]
/* XSLT Processor & Compiler */
proc = new Processor(false);
/* build document from input*/
XdmNode source = proc.newDocumentBuilder().build(new StreamSource(input));
Error:
Error on line X column Y
SXXP0003: Error reported by XML parser: Element type
"E" must be followed by either attribute specifications, ">" or "/>".
The exclamation mark and the tag name just being space are currently my only invalid tags.
I am searching for a more robust solution rather than just removing whole lines of the (formated) xml.
With some mind-bending I could come up with a regular expression to identify the invalid strings, but would struggle with the removal of the nodes containing attributes and child-nodes.
Thank you for your help!
If the input contains invalid tags then it is not XML. It's best to get your mind-set right by referring to these as non-XML documents rather than XML documents; that helps to make it clear that to process non-XML documents, you need non-XML tools. (Forget about "nodes" - there are no nodes until the document has been parsed, and it can't be parsed until you have turned it into well-formed XML). To turn non-XML into XML, you will typically want to use non-XML tools that are good at text manipulation, such as Perl. Of course, it's much better to fix the problem at source: all the benefits of XML are lost if people generate data in private non-XML formats.
I'm having big problems with Xpath evaluation using Jaxen.
Here's part of XML i'm evaluating on:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
<responseDate>2011-05-31T13:04:08+00:00</responseDate>
<request metadataPrefix="oai_dc" verb="ListRecords">http://citeseerx.ist.psu.edu/oai2</request>
<ListRecords>
<record>
<header>
<identifier>oai:CiteSeerXPSU:10.1.1.1.1484</identifier>
<datestamp>2009-05-24</datestamp>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Winner-Take-All..</dc:title>
<dc:relation>10.1.1.134.6077</dc:relation>
<dc:relation>10.1.1.65.2144</dc:relation>
<dc:relation>10.1.1.54.7277</dc:relation>
<dc:relation>10.1.1.48.5282</dc:relation>
</oai_dc:dc>
</metadata>
</record>
<resumptionToken>10.1.1.1.2041-1547151-500-oai_dc</resumptionToken>
</ListRecords>
</OAI-PMH>
I'm using Jaxen because in my use case it's much faster then Apache implementation. I'm using W3C DOM for XML representation.
I need to select all record arguments, and then on selected nodes evaluate other xpaths (it's needed because of my processing architecture).
I'm selecting all record nodes (this works):
/OAI-PMH/ListRecords/record
Then on every selected record node I'm evaluating other xpaths to get needed data:
Select identifier text value (this works):
header/identifier/text()
Select title text value (this does NOT work):
metadata/oai_dc:dc/dc:title/text()
I've registered namespaces prefixes with their URIs (oai_dc and dc). I also tried other xpaths but none of them work:
metadata/dc/title/text()
metadata//dc:title/text()
I've read other stackoverflow questions about xpaths, namespaces and solution to add prefix "oai" with URI "http://www.openarchives.org/OAI/2.0/". I tried adding that "oai:" prefix to nodes without defined prefix but as result I even didn't select record nodes. Any ideas what I'm doing wrong?
Solution:
Problem was about parser (thanks jasso). It wasn't set to be namespace aware - after changing that setting everything works fine, as expected.
I can't see how the XPath expression /OAI-PMH/ListRecords/record can possibly select anything, since your document does not have a {}OAI-PMH element, only a {http://www.openarchives.org/OAI/2.0/}OAI-PMH element. See http://jaxen.codehaus.org/faq.html
I have this job in Talend that is supposed to retrieve a field and loop through it.
My big problem is that the code is looping through the XML fields but it's returning null.
Here is a sample of the XML:
<?xml version="1.0" encoding="ISO-8859-1"?>
<empresas>
<empresa>
<imoveis>
<imovel>
[-- some fields -- ]
<fotos>
<nome id="" order="">photo1</nome>
<nome id="" order=""></nome>
<nome id="" order=""></nome>
<nome id="" order=""></nome>
</fotos>
</imovel>
[ -- other entries here -- ]
</imoveis>
</empresa>
</empresas>
Now using the tExtractXMLField component I am trying to get the "fotos" element.
Here is what I have in the component:
I have tried to change the XPath query and the XPath loop query but the result is either I don't loop through the field or I get the null in the value field in the tMap.
Here is an image of the job:
You can see that I have retrieved 4 items from the XML but what I get is null in the "nome" field. There must be something wrong with the XPath but I can't seem to find the problem :(
Hope someone can help me out. Thanks
Notes: I am using talendv4.1.2 on ubuntu 10.10 64bit
If you want to loop on <nome> nodes your Loop XPath Query has to be
"/empresas/empresa/imoveis/imovel/fotos/nome"
and foto_nome XPath Query something like
"text()"
Take care: I also corrected an error in your XML that could bring issues (</imoveis> missing the "s").
There are two ways to go about it. One way is to use directly XMLinput and the instructions that bluish mentioned.
The other way is to continue on the path that you chose. In the XMLinput, make sure that your Loop XPath query is set to "/empresas/empresa/imoveis/imovel/fotos" and that you pass through the fotos element with the Get Nodes option checked. The XPath Query of your fotos element should be "../fotos" or ".".
Your extractXMLField component looks to be well configured.
Also, I don't know what tSetGlobalVar does in your design, but make sure it doesn't affect the fotos element that you're trying to pass through.
I have made a test job, this will help you definitely. If I'm not wrong you want to get all the "nome" under the "fotos" tag.
Try to change your loop xpath to the top level in the file, "empresas". Sometimes that works for me, also I have seem the "?xml version="1.0" encoding="ISO-8859-1"?" tag cause problems before, you could try to remove that.
Also make sure that the encoding is set correctly in the tFileInputXML.
I think you are confusing reading XML and extracting XML from XML.
Reading XML:
If the part of XML you have provided is the file readed by you tFileInputXML you don't need tExtractXMLField, just configure the tFileInputXML as this:
set the xpath loop to the <nome> elements, like this "//nome"
add 3 columns in the tFileInputXML component id, order and content
get content column with xpath query "."
get id value with xpath query "#id"
get order value with xpath query "#order"
Extracting XML from XML:
That is the goal of the tExtractXMLField component:
It allows to parse XML data contained in a database column or another XML document as if it was itself a data flow.
To put it in a nutshell, tExtractXMLField create a flow of data from a column record containing XML.
It is very useful when parsing soap query result: server reply is usually provided as xml, like this one:
<arg2>
<![CDATA[
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<exportInscriptionEnLigneType>
<date>2015-04-10</date>
<nbDossiers>2</nbDossiers>
<reference>20150410100</reference>
<listeDossiers>
<dossier>
<numOrdre>1</numOrdre>
<identifiantDossier>AAAAA</identifiantDossier>
</dossier>
<dossier>
<numOrdre>2</numOrdre>
<identifiantDossier>BBBBB</identifiantDossier>
</dossier>
</listeDossiers>
</exportInscriptionEnLigneType>
]]>
</arg2>
In XML above, arg2>element contains an XML document that you may need to parse.
tExtractXMLField has been created for this purpose.
I've written a tutorial on how to achieve this work, please have a look here "how to extract xml from xml". It is in french but screenshots may help understanding the few comments provided.
Hope it will help.
Best regards,
Need to select all nodes from the path a/b/c as NodeList from a Document using getElementsByTagName() . How do i provide path of node as input?
eg: -
<root>
<a>
<b>
<c>1</c>
<c>2</c>
<c>3</c>
<c>4</c>
<c>5</c>
<c>6</c>
</b>
</a>
</root>
need to select all 'c' nodes from the path a/b/c . How can I achieve this. Directly selecting c is an option, but to avoid ambiguity if more 'c's are present, I need to give the path. How do I achieve this?
Take a look at the Java XPathAPI. You probably want to specify an XPath of /root/a/b to specify all the <c/> nodes in the above hierarchy.