I have this matlab function:
function trackName = getTrackName(xpath, gpxSourceDom)
% Import the XPath classes
import javax.xml.xpath.*
% Compile the expression
expression = xpath.compile('gpx/trk/name');
% Apply the expression to the DOM.
trackNames = expression.evaluate(gpxSourceDom, XPathConstants.NODESET);
end
I need a way to print every element inside trackNames NODESET. How can I do that?
A quick search of MATLAB and xpath returned this result:
using xpath in matlab
The part you're missing is iterating through the results and displaying the name. For more ideas of what you can do with the nodes, check out the javadoc.
for i = 1:nodeList.getLength
node = nodeList.item(i-1);
disp(char(node.getFirstChild.getNodeValue))
end
Related
public WebElement findChildByXpath(WebElement parent, String xpath) {
loggingService.timeMark("findChildByXpath", "begin. Xpath: " + xpath);
String parentInnerHtml = parent.getAttribute("innerHTML"); // Uncomment for debug purpose.
WebElement child = parent.findElement(By.xpath(xpath));
String childInnerHtml = child.getAttribute("innerHTML"); // Uncomment for debug purpose.
return child;
}
The problem with this code is that childInnerHtml gives me wrong result. I scrape numbers and they are equal.
I even suppose that my code is equal to driver.findElement(By.xpath.
Could you tell me whether my comment really finds a child or what to correct?
Child XPath need to be a relative XPath. Normally this means the XPath expression is started with a dot . to make this XPath relative to the node it applied on. I.e. to be relative to the parent node. Otherwise Selenium will search for the given xpath (the parameter you passing to this method) starting from the top of the entire page.
So, if for example, the passed xpath is "//span[#id='myId']" it should be ".//span[#id='myId']".
Alternatevely you can add this dot . inside the parent.findElement(By.xpath(xpath)); line to make it
WebElement child = parent.findElement(By.xpath("." + xpath));
But passing the xpath with the dot is more simple and clean way. Especially if the passed xpath is come complex expression like "(//div[#class='myClass'])[5]//input" - in this case automatically adding a dot before this expression may not work properly.
Using Jsoup I want to be able add text existing in each html tag to a List<String> in order.
This is fairly easy using BeautifulSoup4 in python but I'm having a hard time in Java.
BeautifulSoup Code:
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
text_list =[]
for t in visible_texts:
text_list.append(t.strip())
return list(filter(None, text_list))
html = urllib.request.urlopen('https://someURL.com/something').read()
print(text_from_html(html))
This code will print ["text1", "text2", "text3",...]
My initial attempt was to follow the Jsoup documentation for text conversion.
Jsoup Code Attempt-1:
Document doc = Jsoup.connect('https://someURL.com/something')
.userAgent("Bot")
.get();
Elements divElements = doc.select("*")
List<String> texts = divElements.eachText();
System.out.println(texts);
What ends up happening is a duplication of texts ["text1 text2 text3","text2 text3", "text3",...]
My assumption is that Jsoup goes through each Element and prints out every text within that Element including the text existing in each child node. Then it goes to the child node and prints out the remaining text, so on and so forth.
I have seen many people specify Tag/Attributes via cssQuery to bypass this problem but my project requires to do this for any scrape-able website.
Any suggestion is appreciated.
Your assumption is right - but BeautifulSoup would probably do the same. Only the text=True in findAll(text=True) limits the result to pure text-nodes. To have the equivalent in JSoup use the following selector:
Elements divElements = doc.select(":matchText");
I'm using the S9API with Saxon 9.7 HE, and I have a NodeInfo object. I need to create an xPath statement that uniquely corresponds to this NodeInfo. It's easy enough to get the prefix, the node name, and the parent:
String prefix = node.getPrefix();
String localPart = node.getLocalPart();
NodeInfo parent = node.getParent();
So I can walk up the tree, building the xPath as I go. But what I can't find is any way to get the positional predicate info. IOW, it's not sufficient to create:
/persons/person/child
because it might match multiple child elements. I need to create:
/persons/person[2]/child[1]
which will match only one child element. Any ideas on how to get the positional predicate info? Or maybe there's a better way to do this altogether?
BTW, for those who use the generic DOM and not the S9API, here's an easy solution to this problem: http://practicalxml.sourceforge.net/apidocs/net/sf/practicalxml/DomUtil.html#getAbsolutePath(org.w3c.dom.Element)
Edit: #Michael Kay's answer works. To add some meat to it:
XPathExpression xPathExpression = xPath.compile("./path()");
List matches = (List) xPathExpression.evaluate(node, XPathConstants.NODESET);
String pathToNode = matches.get(0).toString();
// If you want to remove the expanded QName syntax:
pathToNode = pathToNode.replaceAll("Q\\{.*?\\}", "");
This must be done using the same xPath object that was previously used to acquire the NodeInfo object.
In XPath 3.0 you can use fn:path().
Earlier Saxon releases offer saxon:path().
The challenge here is handling namespaces. fn:path() returns a path that's not sensitive to namespace-prefix bindings by using the new expanded-QName syntax
/Q{}persons/Q{}person[2]/Q{}child[1]
So i'm trying to learn some xml parsing here, and I'm getting the hang of it, but for whatever reason, I seem to have to tack on "text()" at the end of each query, otherwise I get null values returned to me. I don't actually understand the function of this "text()" ending but I know it's not necessary and I'm wondering why I can't omit it. Please help! Here is my code:
import org.w3c.dom.*;
import javax.xml.xpath.*;
import javax.xml.parsers.*;
import java.io.IOException;
import org.xml.sax.SAXException;
public class ParseClass
{
public static void main(String[] args)
throws ParserConfigurationException, SAXException,
IOException, XPathExpressionException
{
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse("C:\\Users\\Brandon\\Job\\XPath\\XPath_Sample_Stuff\\catalog.xml");
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("/catalog/book[author='Thurman, Paula']/title/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++)
{
System.out.println(nodes.item(i).getNodeValue());
}
}
}
PS. In case you didn't notice. i'm using XPath and DOM for my parsing.
You're calling getNodeValue on your result, and as this docs show (see the table) it is null for a node of type Element. When you use text(), the returned set now contains nodes of type Text, so you get the results you wanted (i.e. the contents of the title element instead of the element itself).
I'd also suggest seeing this for more info on the usage of text() in xpath.
And if you want to extract the text from your element, directly, you could use getTextContent instead of getNodeValue:
// Will work for both element and text nodes
System.out.println(nodes.item(i).getTextContent());
First of all your Xpath expression is invalid (I am considering it as typo). Attributes are indicated with # so correct xpath will be /catalog/book[#author='Thurman, Paula']/title/text().
/catalog/book[#author='Thurman, Paula']/title/ will match the <title> node from your xml whereas /catalog/book[#author='Thurman, Paula']/title/text() with match the text node of <title> i.e if title node was something like <title>The Godfather</title>the later expression would match The Godfather.
A suggestion: don't use DOM. There are many tree representations of XML available in the Java world (JDOM, XOM, DOM4J) that are vastly more usable than DOM. DOM is full of gotcha's like the one you just encountered, where getNodeValue() on an element returns null. The only reason anyone uses DOM is that (a) it came originally from W3C, and (b) it found its way into the JDK. But that all happened an awfully long time ago, and people have learnt from its design mistakes.
Below is my element hierarchy. How to check (using XPath) that AttachedXml element is present under CreditReport of Primary Consumer
<Consumers xmlns="http://xml.mycompany.com/XMLSchema">
<Consumer subjectIdentifier="Primary">
<DataSources>
<Credit>
<CreditReport>
<AttachedXml><![CDATA[ blah blah]]>
Use the boolean() XPath function
The boolean function converts its
argument to a boolean as follows:
a number is true if and only if
it is neither positive or negative
zero nor NaN
a node-set is true if and only if
it is non-empty
a string is true if and only if
its length is non-zero
an object of a type other than
the four basic types is converted to a
boolean in a way that is dependent on
that type
If there is an AttachedXml in the CreditReport of primary Consumer, then it will return true().
boolean(/mc:Consumers
/mc:Consumer[#subjectIdentifier='Primary']
//mc:CreditReport/mc:AttachedXml)
The Saxon documentation, though a little unclear, seems to suggest that the JAXP XPath API will return false when evaluating an XPath expression if no matching nodes are found.
This IBM article mentions a return value of null when no nodes are matched.
You might need to play around with the return types a bit based on this API, but the basic idea is that you just run a normal XPath and check whether the result is a node / false / null / etc.
XPathFactory xpathFactory = XPathFactory.newInstance(NamespaceConstant.OBJECT_MODEL_SAXON);
XPath xpath = xpathFactory.newXPath();
XPathExpression expr = xpath.compile("/Consumers/Consumer/DataSources/Credit/CreditReport/AttachedXml");
Object result = expr.evaluate(doc, XPathConstants.NODE);
if ( result == null ) {
// do something
}
Use:
boolean(/*/*[#subjectIdentifier="Primary"]/*/*/*/*
[name()='AttachedXml'
and
namespace-uri()='http://xml.mycompany.com/XMLSchema'
]
)
Normally when you try to select a node using xpath your xpath-engine will return null or equivalent if the node doesn't exists.
xpath: "/Consumers/Consumer/DataSources/Credit/CreditReport/AttachedXml"
If your using xsl check out this question for an answer:
xpath find if node exists
take look at my example
<tocheading language="EN">
<subj-group>
<subject>Editors Choice</subject>
<subject>creative common</subject>
</subj-group>
</tocheading>
now how to check if creative common is exist
tocheading/subj-group/subject/text() = 'creative common'
hope this help you
If boolean() is not available (the tool I'm using does not) one way to achieve it is:
//SELECT[#id='xpto']/OPTION[not(not(#selected))]
In this case, within the /OPTION, one of the options is the selected one. The "selected" does not have a value... it just exists, while the other OPTION do not have "selected". This achieves the objective.