I am very new to XML document database technologies, Xquery/Xpath and so on. So this is probably a very newbie question.
Scenario: A number of XML documents as the input, want to run some number of transformations can be run on these XML documents (using XQuery). I'd like to store these results. Into the same XML data store as the input.
So far I am experimenting with using the BaseX document database to store and processes these XML documents, and so far its been very easy to work with, I am impressed.
Ideally I'd like to interface with BaseX using the XQJ API (http://xqj.net/basex/) as my reasoning would be that XQJ would keep the implementation of application code independent of BaseX as can be. The secondary option would write my java code directly to the BaseX API.
The Problem: I am having a hard time figuring out how to store the results from an XQuery as a new "document" in the database. Perhaps this is more of a conceptual lack of understanding with XQuery (or XQuery Update) itself than any difficulty with the BaseX/XQJ API.
In this simple example if I have a query like this, it returns some XML output with in a format that I want for my new document
let $items := //firstName
return <results>
{ for $item in $items
return <result> {$item} </result>
}
</results>
Gives
<results>
<result>
<firstName>Bob</firstName>
</result>
<result>
<firstName>Joe</firstName>
</result>
<result>
<firstName>Tom</firstName>
</result>
</results>
I want to store this new <result> document back into the database, for use in later querying/transformations/etc. In SQL this makes sense to me. I would do CREATE TABLE <name> SELECT <query> or INSERT INTO, etc. But I am unclear what the equivalent is in XQuery. I think the XQuery Update functionality is what I need here, but I'm having trouble finding concrete examples.
This is further complicated when dealing with XQJ
XQResultSequence rs = xqe.executeQuery("//firstName");
// what do i do with it now??
Is there a way to persist this XQResultSequence back INTO the database using BaseX? Or even better, can I run additional XQueries directly on the XQResultSequence?
Thanks for the help!
BaseX implements XQuery Update Facility, so you should be able to use fn:put:
let $items := //firstName
return fn:put(
<results>{
for $item in $items
return <result> {$item} </result>
}</results>,
"/results/result-new.xml")
If you are running simple ad-hoc queries like above, it should be fairly straightforward. I'm not very familiar with XQJ, but if you want to run queries in a sequence, I suspect there is a way to pass those XQResultSequence variables back to a new query, in which you would likely accept it by declaring a variable as external in the following query:
declare variable $previous-result as item()* external;
Related
I have list of webpages around 1 million, I want to efficiently just extract text from those pages. Currently I am using BeautifulSoup library in python to get text from HTML and using request command to get html of a webpage. This approach extract some extra information in addition to the text like if any javascript is listed in body.
Could you please suggest me any suitable and efficient way to do the task. I looked at scrapy but it looks like it crawls specific website. Can we pass it list of specific webpages to get information from ?
Thank you in advance.
Yes, you can use Scrapy to crawl a set of URLs in a generic fashion.
You simply need to set them on the start_urls list attribute of your spider, or reimplement the start_requests spider method to yield requests from any data source, and then implement your parse callback to perform the generic content extraction you want.
You can use html-text to extract text from them, and regular Scrapy selectors to extract additional data like the one you mention.
In scrapy you can set up your own parser. E.g. Beautiful soup. This parser you can call from your parse method.
To extract text from generic pages I traverse the body only, exclude comments etc and some tags like script, style, etc:
for snippet in soup.find('body').descendants:
if isinstance(snippet, bs4.element.NavigableString) \
and not isinstance(snippet, EXCLUDED_STRING_TYPES)\
and snippet.parent.name not in EXCLUDED_TAGS:
snippet = re.sub(UNICODE_WHITESPACES, ' ', snippet)
snippet = snippet.strip()
if snippet != '':
snippets.append(snippet)
with
EXCLUDED_STRING_TYPES = (bs4.Comment, bs4.CData, bs4.ProcessingInstruction, bs4.Declaration)
EXCLUDED_TAGS = ['script', 'noscript', 'style', 'pre', 'code']
UNICODE_WHITESPACES = re.compile(u'[\t\n\x0b\x0c\r\x1c\x1d\x1e\x1f \x85\xa0\u1680\u2000\u2001\u2002\u2003\u2004'
u'\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000]+')
I have an XML structure of a message received from an XMPP subscription (below). I only care about the "user" part of this message and would like to convert this into an equivalent "User" object in Java so that I can use it to perform other processing. Is there a way to achieve this in Java?
The only way I know I can do it is to use Jackson annotations (e.g #JsonProperty) and create the equivalent objects for all the parent elements - event, notification, update, data etc - but I don't really care about them so seems like a waste.
Not sure how I can just convert the "user" part to an object and forget about the rest?
<event xmlns='http://jabber.org/protocol/pubsub#event'>
<notification xmlns='http://jabber.org/protocol/pubsub'>
<Update>
<data>
<user>
<dialogs>/finesse/api/User/1234/Dialogs</dialogs>
<extension></extension>
<firstName>1234</firstName>
<lastName>1234</lastName>
<loginId>1234</loginId>
<loginName>1234</loginName>
<roles>
<role>Agent</role>
</roles>
<state>LOGOUT</state>
<stateChangeTime>2015-03-11T14:25:42Z</stateChangeTime>
<teamId>1</teamId>
<teamName>Default</teamName>
<uri>/finesse/api/User/1234</uri>
</user>
</data>
</Update>
</notification>
</event>
It's a little bit ugly and not optimal for huge xml data, but you could extract user part from xml using for example dom4j and than use Jackson to parse "only user xml" part.
Document doc = new SAXReader().read(...);
Node user = doc.getRootElement()
.element("notification")
.element("Update")
.element("data")
.element("user");
String onlyUserXml = user.asXML();
How can I find and iterate through all the nodes present under CDATA and those nodes are started by (<) and closed by (>)?
Also, how should I iterate over all the child nodes and get the values like in below child node? I want to retrieve the value.
Input XML
<SOURCE TransactionId="1" ProviderName="ABCDD"><RESPONSE><![CDATA[<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><soap:Body><NetworkResponse xmlns="http://www.example.com/"><NetworkResult><Network offering_id="13" transaction_id="2" submission_id="3" timestamp="20140828 16010683 GMT" customer_id="NETTest">
<Network_List>
<Network_Info att0="Y" att1="N" att2="N" att3="Y" att4="Y">
<SIM_DATA>
<SIM><![CDATA[1100040101]]></SIM>
</SIM_DATA>
<NetworkResponseInfo k_status="C">
<KEY1>269</KEY1>
<PARENTNODE>
<CHILDNODE1>
<KEY2>XXXXXXX</KEY2>
<KEY3>YYYYYYY</KEY3>
</CHILDNODE1>
<CHILDNODE2>
<KEY4>N</KEY4>
<KEY5>I</KEY5>
</CHILDNODE2>
<CHILDNODE3>
<KEY6>1</KEY6>
<KEY7>3</KEY7>
</CHILDNODE3>
</PARENTNODE>
<KEY8><![CDATA[some image not visible]]></KEY8>
<KEY9>N</KEY9>
<KEY10>15</KEY10>
</NetworkResponseInfo>
</Network_Info>
</Network_List>
<response_message_list transaction_status_code="000" transaction_status_text="Successful"/>
</Network></NetworkResult></NetworkResponse></soap:Body></soap:Envelope>]]></RESPONSE></SOURCE>
Output XML
<ns3:NetworkResponse>
<Networks_OF_List>
<NetCharSeq>
<Nrep>
<type>Some Image</type>
<data> Data Coming from KEY8 CDATA section</data>
</Nrep>
<Nrep>
<type>ANYTHING</type>
<data>VALUE INSIDE SIM CDATA</data>
</Nrep>
<NetDetail>
<MYKEY1>Value present inside KEY4</MYKEY1>
<MYKEY2>Value present inside KEY5</MYKEY2>
</NetDetail>
<SystemID>Value of KEY2</SystemID>
<SystemPath>Valuelue of KEY3</SystemPath>
</NetCharSeq>
</Networks_OF_List>
</ns3:NetworkResponse>
(Welcome at SO. Please note that you are downvoted by some users because you do not show what you have done so far. Have a look at the How To Ask section to learn how to ask questions that actually can be answered and are considered proper questions in the SO format.)
If you can use XSLT 3.0, you can consider using the new fn:parse-xml function, which will take a document-as-a-string.
However, your CDATA-section contains itself escaped data, which means that, after you apply fn:parse-xml, you will have to do it once again for the text node that is the child of NetworkResult.
A better solution is often to fix this at the source and creating an XML format that allows other XML in certain elements (you can allow this with a proper XSD). It will save you a lot of trouble and at least you XML can then be pre-validated.
If you are stuck with XSLT 2.0 or 1.0, you can use disable-output-escaping (google it, there is a lot of info around on how to use it), but you will have to re-process your output once more because of the double-escape that is used. You may want to consider an XProc pipeline to ease the process.
You wrote: Also, how should I iterate over all the child nodes and get the values like in below child node
That is what XSLT is all about, please read this XSLT Tutorial, or any other tutorial you can find, it will be explained to you in the first minutes.
Update: as suggested by michael.hor257k in the comments, you can also parse the escaped data by hand using string manipulation functions. As he already says in the comments, this is laborious and error-prone, but sometimes, esp. if the XML is not really XML after unescaping, but something like XML, then this may be your only option.
I have an XML file starting:
<?xml version="1.0"?>
<results>
<result id="0001">
<hometeam>
<name>Dantooine Destroyers</name>
<score>6</score>
</hometeam>
<awayteam>
<name>Wayland Warriors</name>
<score>0</score>
</awayteam>
</result>
<result id="0002">
<hometeam>
<name>Dantooine Destroyers</name>
<score>3</score>
</hometeam>
<awayteam>...
and in a java file:
if(event.isStartElement()){
if(event.asStartElement().getName().getLocalPart().equals(HOME)){
System.out.println("In hometeam"); // for testing purposes
event = eventReader.nextEvent(); // I expect <name> element
if(event.isStartElement()){ // <------------ FALSE
if(event.asStartElement().getName().getLocalPart().equals(NAME)){....
I'd expect this if statement to be true for the <name> element but if I stick in a System.out.println(event.isStartElement()) I get FALSE....
Also event.getEventType() returns XMLEvent.CHARACTERS which I don't understand... Can anybody see why?
Feel free to make edits to tags/title and question if necessary.
Characters means that next part of XML are well - characters (in your case newline and indentation) - low level parser is unqualified to discardthem for you. it delivers just raw events. It's your work to proces structure correctly.
That .nextEvent() call is probably bringing in the whitespace between <hometeam> and <name>. Note that in XML, all character data between tags (even if it's whitespace or newlines) is also accessible from the API.
You can test this by printing the element.
You don't normally see that whitespace with DOM-based APIs (or you can easily ignore it) but with event-driven APIs (like SAX or StAX) you have to ignore it.
I want to take an XML file as input and output the same XML except for some search/replace actions for attributes and text, based on matching certain node characteristics.
What's the best general approach for this, and are there tutorials somewhere?
DOM is out since I can't guarantee being able to keep the whole thing in memory.
I don't mind using SAX or StAX, except that I want the default behavior to be a pass-through no-op filter; I did something similar with StAX once and it was a pain, didn't work with namespaces, and I was never sure if I had included all the cases I needed to handle.
I think XSLT won't work (but am not sure), because it's declarative and I need to do some procedural calculations when figuring out what text/attributes to emit on the output.
(contrived example:
Suppose I was looking for all nodes with XPath of /group/item/#number and wanted to evaluate the number attribute as an integer, factor it using a method public List<Integer> factorize(int i), convert the list of factors to a space-delimited string, and add an attribute factors to the corresponding /group/item node?
input:
<group name="beatles"><item name="paul" number="64"></group>
<group name="rolling stones"><item name="mick" number="19"></group>
<group name="who"><item name="roger" number="515"></group>
expected output:
<group name="beatles"><item name="paul" number="64" factors="2 2 2 2 2 2"></group>
<group name="rolling stones"><item name="mick" number="19" factors="19"></group>
<group name="who"><item name="roger" number="515" factors="103 5"></group>
)
Update: I got the StAX XMLEventReader/Writer method working easily, but it doesn't preserve certain formatting quirks that are important in my application. (I guess the program that saves/loads XML doesn't honor valid XML files. >:( argh.) Is there a way to process XML that minimizes textual differences between input and output? (at least when it comes to character data.)
XSLT seems like an appropriate model for what you are doing. Look into using XSLT with procedural extensions.
If you really can't keep the whole document in memory, Saxon is your only XSLT choice. It's likely that whatever calculations you need to do can be done in XSLT -- but if not, it's not too hard to write your own extension functions.
I find Apache Digester a big help for rules-based parsing of XML.
Update: If it's filtering and output that you're concerned with, review this set of articles on Developerworks which is concerned with the same issues. Of particular relevance are parts 2, 3 and 4. The summary: Use SAX, XMLFilter and XMLWriter.
While I suppose this is technically a good fit for XSLT, I've always found it hard to debug for complex transformations. YMMV :-)
Further Update: XMLWriter is available from here. I don't know what your particular difficulty with SAX is. I created a file groups.xml containing:
<groups>
<group name="beatles"><item name="paul" number="64"/></group>
<group name="rolling stones"><item name="mick" number="19"/></group>
<group name="who"><item name="roger" number="515"/></group>
</groups>
Note that I had to make some changes to make it well-formed XML. Then, I knocked up this simple Jython script, groups.py, to illustrate how to solve your problem:
import java.io
import org.xml.sax.helpers
import sys
sys.path.append("xml-writer.jar")
import com.megginson.sax
def get_factors(n):
return "factors for %s" % n
class MyFilter(org.xml.sax.helpers.XMLFilterImpl):
def startElement(self, uri, localName, qName, attrs):
if qName == "item":
newAttrs = org.xml.sax.helpers.AttributesImpl(attrs)
n = attrs.length
for i in range(n):
name = attrs.getLocalName(i)
if name == "number":
newAttrs.addAttribute("", "factors", "factors",
"CDATA",
get_factors(attrs.getValue(i)))
attrs = newAttrs
#call superclass method...
org.xml.sax.helpers.XMLFilterImpl.startElement(self, uri, localName,
qName, attrs)
source = org.xml.sax.InputSource(java.io.FileInputStream("groups.xml"))
reader = org.xml.sax.helpers.XMLReaderFactory.createXMLReader()
filter = MyFilter(reader)
writer = com.megginson.sax.XMLWriter(filter,
java.io.FileWriter("output.xml"))
writer.parse(source)
Obviously, I've mocked up the factor finding function as your example was, I believe, purely illustrative. The script reads groups.xml, applies a filter, and outputs to output.xml. Let's run it:
$ jython groups.py
$ cat output.xml
<?xml version="1.0" standalone="yes"?>
<groups>
<group name="beatles"><item name="paul" number="64" factors="factors for 64"></item></group>
<group name="rolling stones"><item name="mick" number="19" factors="factors for 19"></item></group>
<group name="who"><item name="roger" number="515" factors="factors for 515"></item></group>
</groups>
Job done? Of course, you'll need to transcribe this code to Java.
StAX should work well for you. Piping input to output is super easy; you just write the XMLEvent you get from the XMLEventReader to the XMLEventWriter.
XMLEventFactory EVT_FACTORY;
XMLEventReader reader;
XMLEventWriter writer;
QName numberQName = new QName("number");
QName factorsQName = new QName("factors");
while(reader.hasNext()) {
XMLEvent e = reader.nextEvent();
if(e.isAttribute() && ((Attribute)e).getName().equals(numberQName)) {
String v = ((Attribute)e).getValue();
String factors = factorize(Integer.parseInt(v));
XMLEvent factorsAttr = EVT_FACTORY.createAttribute(factorsQName, factors);
writer.add(factorsAttr);
}
// pass through
writers.add(e);
}