Can anybody suggest how can I solve this, there is XML document that I'm trying to append. I'm looking up nodes with xpath, the things is the software which generates this XML sometimes screws it up as following :
<element name="Element ">Element value</element>
So when I'm looking the node up with xpath using //element[#name="Element"] without a blank space I don't get a match. Naturally I get match with this //element[#name="Element "]
Is there something I can do to match this without blank space?Does xpath accept regular expressions or there is more smart way to do this, I can change the xml file as well after it has been generated by the sofware(with faulty user input).
(untested).
Would
//element[normalize-space(#name)=="Element"]
work?
A more general (though less accurate) solution could be:
//element[contains(#name,"Element")]
However that would also catch things like name="Elements" and name="Elementary" etc.
can you recommend some useful resources for xpath
Unfortunately there aren't many great such resources in my experience - w3schools is probably your best bet, or the spec itself for more advanced stuff.
Related
I want to write a script that checks a
document for keywords and specifies html document nodes in which they are contained (possibly
assign a unique identifier).
I am not a professional programmer and do not know the strength of low-level languages and things as PLO.. I'm afraid of doing something very bad and unsupported.
How is it possible to isolate the desired nodes?
My experience - js and php - php only for very simple things. Also, I
do not want to use the opportunity to work
with js nodes. My thoughts:
to make a string of html
verify the existence of the words on the page
if the word on page exists: foreach node in body element I get first and last positions
(for example, we see opening tag for each character we initially know
position and therefore we calculate the first
position where the tag is opened and last where closed. And so on for all nodes).
We know the position of the word (eg 192,
199) and check in what range it got (in this
case, these bands - nodes html document).
I need ideas from experienced programmers.
It does not matter what language you are
programming (except for web-oriented)-
every opinion is important to me. It is likely
that there are libraries that solve such
problems. I very much hope that you will
understand me. English is not my native
language.
I always recommend Beautiful Soup for this kind of thing. It is a Python library that allows you to parse XML/HTML documents really quickly. You could quite quickly get something running that extracts the text from each div element I would have thought. Then using Pythons built-in string manipulation tools I'm sure searching for particular words would be fairly simple.
You need to use a html parser. Refer
Which HTML Parser is the best?
After that, you need to use xpath feature to extract whichever node.
I need to build a component which would take a few XML documents in input and check the following kind of rules:
XML1:/bookstore/book[price>35.00] != null
and (XML2:/city/name = 'Montreal'
or XML3://customer[#language] contains 'en')
Basically my component should be able to:
substitute the XML tokens with the corresponding XML document(before colon)
apply xpath query on this XML document
check the xpath output against expected result ("=", "!=", "contains")
follow the basic syntax ("and", "or" and parentheses)
tell if the rule is true or false
Do you know any library which could help me? maybe JavaCC?
Thanks
For evaluating XPATHs I recommend JAXEN.
Jaxen is an open source XPath library
written in Java. It is adaptable to
many different object models,
including DOM, XOM, dom4j, and JDOM.
Is it also possible to write adapters
that treat non-XML trees such as
compiled Java byte code or Java beans
as XML, thus enabling you to query
these trees with XPath too.
The Java XPath API (Java 5 / javax.xml.xpath) is also an option, but I haven't tried it yet.
Somebody on the JavaCC mailing list pointed me to the right direction, mentioning Schematron. It led me to Probatron which seems to be the best java implementation available.
Schematron web site claims that the language supports "jump across links and between XML documents to check constraints" but it seems Probatron doesn't allow that. I may not to tweak it or find a trick for that (like building a temporary XML document containing all my source documents). Apart from that, it looks Probatron is the right library for me.
I want to use an html parser that does the following in a nice, elegant way
Extract text (this is most important)
Extract links, meta keywords
Reconstruct original doc (optional but nice feature to have)
From my investigation so far jericho seems to fit. Any other open source libraries you guys would recommend?
I recently experimented with HtmlCleaner and CyberNekoHtml. CyberNekoHtml is a DOM/SAX parser that produces predictable results. HtmlCleaner is a tad faster, but quite often fails to produce accurate results.
I would recommend CyberNekoHtml. CyberNekoHtml can do all of the things you mentioned. It is very easy to extract a list of all elements, and their attributes, for example. It would be possible to traverse the DOM tree building each element back into HTML if you wanted to reconstruct the page.
There's a list of open source java html parsers here:
http://java-source.net/open-source/html-parsers
I would definitely go for JSoup.
Very elegant library and does exactly what you need.
See Example Here
I ended up using HtmlCleaner http://htmlcleaner.sourceforge.net/ for something similar. It's really easy to use and was quick for what I needed.
I would like to be able to parse XML that isn't necessarily well-formed. I'd be looking for a fuzzy rather than a strict parser, able to recover from badly nested tags, for example. I could write my own but it's worth asking here first.
Update:
What I'm trying to do is extract links and other info from HTML. In the case of well-formed XML I can use the Scala XML API. In the case of ill-formed XML, it would be nice to somehow convert it into correct XML (somehow) and deal with it the same way, otherwise I'd have to have two completely different sets of functions for dealing with documents.
Obviously because the input is not well-formed and I'm trying to create a well-formed tree, there would have to be some heuristic involved (such as when you see <parent><child></parent> you would close the <child> first and when you then see a <child> you ignore it). But of course this isn't a proper grammar and so there's no correct way of doing it.
What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).
The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:
<parent>
<child>
</parent>
</child>
Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.
Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).
Try the parser on the XHtml object. It is much more lenient than the one on XML.
Take a look at htmlcleaner. I have used it successfully to convert "HTML from the wild" to valid XML.
Try Tag Soup.
JTidy does something similar but only for HTML.
I mostly agree with Daniel Spiewak's answer. This is just another way to create "your own parser".
While I don't know of any Scala specific solution, you can try using Woodstox, a Java library that implements the StAX API. (Being an even-based API, I am assuming it will be more fault tolerant than a DOM parser)
There is also a Scala wrapper around Woodstox called Frostbridge, developed by the same guy who made the Simple Build Tool for Scala.
I had mixed opinions about Frostbridge when I tried it, but perhaps it is more suitable for your purposes.
I agree with the answers that turning invalid XML into "correct" XML is impossible.
Why don't you just do a regular text search for the hrefs if that's all you're interested in? One issue would be commented out links, but if the XML is invalid, it might not be possible to tell what is intended to be commented out!
Caucho has a JAXP compliant XML parser that is a little bit more tolerant than what you would usually expect. (Including support for dealing with escaped character entity references, AFAIK.)
Find JavaDoc for the parsers here
A related topic (with my solution) is listed below:
Scala and html parsing
I need to write a java application that does a keyword search within the tags and the actual data from many xml files. From my research online I get the feeling i have to use xalan, but I can't figure out how to use it or what it does. Could somebody point me in the right direction? Thanks
The first thing you need to do is to decide what data you're actually going to search. You say "within the tags and actual data" -- does that mean that you'll do a keyword search for an element name? Or an element name and content within it?
Depending on how complex your search queries are, you'll probably want to turn to a real search engine, like Lucene. I will say, however, that before you take this step you need to give a lot of thought to how you plan to search, so that you build an appropriate index.
If your search requirements are simpler, you could load the documents into a DOM and use XPath. I'd suggest trying this out before moving to Lucene.
You don't need Xalan; the JDK comes with XML parsers and an XPath evaluator. I've written a couple of articles on using them: (parsing), (xpath).
Xalan is an XSLT processor: it enables you to write an XSL stylesheet that will transform your source XML document into something else.
Sure may write an XSL transform and then you search the result of the transform.
Another option is to parse the document with an XML parser and then use Lucene: see Parsing, indexing, and searching XML documents with Digester and Lucene.
You may also want to use XPath. It all depends on what exactly you want to achieve.
I sounds like you are looking for an XPath implementation for Java. This allows you to construct a search expression and apply it to one or more XML documents (which generally have to have been parsed). Xalan is one option, but there are others. Versions of Java starting with Java 5 have included XML parsing and XPath capabilities. If you are using a recent version of Java, and want to simply parse and search through a set of XML documents, then you likely need nothing besides the Java SDK.
See this article for a good (but somewhat dated) overview of the XPath capabilities that come "out of the box": http://www.ibm.com/developerworks/library/x-javaxpathapi.html
See this SO post on how to do a search using the contains() XPath function.
As for an example on how to do an XPath query, I suggest looking at the Java XPath documentation. Here's the example code they provide:
XPath xpath = XPathFactory.newInstance().newXPath();
String expression = "/widgets/widget";
InputSource inputSource = new InputSource("widgets.xml");
NodeSet nodes = (NodeSet) xpath.evaluate(expression, inputSource, XPathConstants.NODESET);
This would load the file widgets.xml and return a NodeSet of all nodes matching the expression.