Java XML DOM: how are id Attributes special? - java

The javadoc for the Document class has the following note under getElementById.
Note: Attributes with the name "ID" or "id" are not of type ID unless so defined
So, I read an XHTML doc into the DOM (using Xerces 2.9.1).
The doc has a plain old <p id='fribble'> in it.
I call getElementById("fribble"), and it returns null.
I use XPath to get "//*[id='fribble']", and all is well.
So, the question is, what causes the DocumentBuilder to actually mark ID attributes as 'so defined?'

These attributes are special because of their type and not because of their name.
IDs in XML
Although it is easy to think of attributes as name="value" with the value is being a simple string, that is not the full story -- there is also an attribute type associated with attributes.
This is easy to appreciate when there is an XML Schema involved, since XML Schema supports datatypes for both XML elements and XML attributes. The XML attributes are defined to be of a simple type (e.g. xs:string, xs:integer, xs:dateTime, xs:anyURI). The attributes being discussed here are defined with the xs:ID built-in datatype (see section 3.3.8 of the XML Schema Part 2: Datatypes).
<xs:element name="foo">
<xs:complexType>
...
<xs:attribute name="bar" type="xs:ID"/>
...
</xs:complexType>
</xs:element>
Although DTD don't support the rich datatypes in XML Schema, it does support a limited set of attribute types (which is defined in section 3.3.1 of XML 1.0). The attributes being discussed here are defined with an attribute type of ID.
<!ATTLIST foo bar ID #IMPLIED>
With either the above XML Schema or DTD, the following element will be identified by the ID value of "xyz".
<foo bar="xyz"/>
Without knowing the XML Schema or DTD, there is no way to tell what is an ID and what is not:
Attributes with the name of "id" do not necessarily have an attribute type of ID; and
Attributes with names that are not "id" might have an attribute type of ID!
To improve this situation, the xml:id was subsequently invented (see xml:id W3C Recommendation). This is an attribute that always has the same prefix and name, and is intended to be treated as an attribute with attribute type of ID. However, whether it does will depend on the parser being used is aware of xml:id or not. Since many parsers were initially written before xml:id was defined, it might not be supported.
IDs in Java
In Java, getElementById() finds elements by looking for attributes of type ID, not for attributes with the name of "id".
In the above example, getElementById("xyz") will return that foo element, even though the name of the attribute on it is not "id" (assuming the DOM knows that bar has an attribute type of ID).
So how does the DOM know what attribute type an attribute has? There are three ways:
Provide an XML Schema to the parser (example)
Provide a DTD to the parser
Explicitly indicate to the DOM that it is treated as an attribute type of ID.
The third option is done using the setIdAttribute() or setIdAttributeNS() or setIdAttributeNode() methods on the org.w3c.dom.Element class.
Document doc;
Element fooElem;
doc = ...; // load XML document instance
fooElem = ...; // locate the element node "foo" in doc
fooElem.setIdAttribute("bar", true); // without this, 'found' would be null
Element found = doc.getElementById("xyz");
This has to be done for each element node that has one of these type of attributes on them. There is no simple built-in method to make all occurrences of attributes with a given name (e.g. "id") be of attribute type ID.
This third approach is only useful in situations where the code calling the getElementById() is separate from that creating the DOM. If it was the same code, it already has found the element to set the ID attribute so it is unlikely to need to call getElementById().
Also, be aware that those methods were not in the original DOM specification. The getElementById was introduced in DOM level 2.
IDs in XPath
The XPath in the original question gave a result because it was only matching the attribute name.
To match on attribute type ID values, the XPath id function needs to be used (it is one of the Node Set Functions from XPath 1.0):
id("xyz")
If that had been used, the XPath would have given the same result as getElementById() (i.e. no match found).
IDs in XML continued
Two important features of ID should be highlighted.
Firstly, the values of all attributes of attribute type ID must be unique to the whole XML document. In the following example, if personId and companyId both have attribute type of ID, it would be an error to add another company with companyId of id24601, because it will be a duplicate of an existing ID value. Even though the attribute names are different, it is the attribute type that matters.
<test1>
<person personId="id24600">...</person>
<person personId="id24601">...</person>
<company companyId="id12345">...</company>
<company companyId="id12346">...</company>
</test1>
Secondly, the attributes are defined on elements rather than the entire XML document. So attributes with the same attribute name on different elements might have different attribute type properties. In the following example XML document, if only alpha/#bar has an attribute type of ID (and no other attribute was), getElementById("xyz") will return an element, but getElementById("abc") will not (since beta/#bar is not of attribute type ID). Also, it is not an error for the attribute gamma/#bar to have the same value as alpha/#bar, that value is not considered in the uniqueness of IDs in the XML document because it is is not of attribute type ID.
<test2>
<alpha bar="xyz"/>
<beta bar="abc"/>
<gamma bar="xyz"/>
</test2>

For the getElementById() call to work, the Document has to know the types of its nodes, and the target node must be of the XML ID type for the method to find it. It knows about the types of its elements via an associated schema. If the schema is not set, or does not declare the id attribute to be of the XML ID type, getElementById() will never find it.
My guess is that your document doesn't know the p element's id attribute is of the XML ID type (is it?). You can navigate to the node in the DOM using getChildNodes() and other DOM-traversal functions, and try calling Attr.isId() on the id attribute to tell for sure.
From the getElementById javadoc:
The DOM implementation is expected to
use the attribute Attr.isId to
determine if an attribute is of type
ID.
Note: Attributes with the name "ID" or
"id" are not of type ID unless so
defined.
If you are using a DocumentBuilder to parse your XML into a DOM, be sure to call setSchema(schema) on the DocumentBuilderFactory before calling newDocumentBuilder(), to ensure that the builder you get from the factory is aware of element types.

ID attribute isn't an attribute whose name is "ID", it's an attribute which is declared to be an ID attribute by a DTD or a schema. For example, the html 4 DTD describes it:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

The corresponding xpath expression would actually be id('fribble'), which should return the same result as getElementById. For this to work, the dtd or schema associated with your document has to declare the attribute as being of type ID.
If you are in control of the queried xml you could also try renaming the attribute to xml:id as per http://www.w3.org/TR/xml-id/.

The following will allow you to get an element by id:
public static Element getElementById(Element rootElement, String id)
{
try
{
String path = String.format("//*[#id = '%1$s' or #Id = '%1$s' or #ID = '%1$s' or #iD = '%1$s' ]", id);
XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList)xPath.evaluate(path, rootElement, XPathConstants.NODESET);
return (Element) nodes.item(0);
}
catch (Exception e)
{
return null;
}
}

Related

Add attribute to xml element not allowed in DTD

I have to use a external DTD, that specifies that a certain element can only have id attribute:
<!ELEMENT x (y | z)>
<!ATTLIST x id ID #IMPLIED>
So something like this is valid
<x id="x">...</x>
But if i try something like this:
<x id="x" custom="custom">...</x>
My parser gives me the following error:
Attribute "custom" must be declared for element type "x".
So I understand what the error says and why its happening, but as i said the DTD is external and sadly i cant change it. Is there a workaround or a hack that can use to add my own custom attribute?
You can either disable DTD validation in your parser, or try defining internal DTD.

web crawling from BBC website using XPath

This is an example of a typical webpage I am trying to crawl
http://www.bbc.com/news/business-31013604
If you inspect the element of the webpage. The main article is under
<div class="story-body">
However, when I try to get the main content using
MongoClient mongoClient = new MongoClient("127.0.0.1", 27017);
DB db = mongoClient.getDB("nutch");
DBCollection coll = db.getCollection("crawl_data");
BasicDBObject bo = new BasicDBObject("url", url).append("fetch_time", new Date());
bo.append("article_text", getXPathValue(doc,"//DIV[#class='story-body']"));
I am not able to get the article content. In the database, it shows null in that field.
I have successfully crawled some pages from reuters, so fucntion getXPathValue should be correct.
I fetch pages using http request. Don't know if that is the issue here.
The problem is that you are crawling an XHTML page (or at least a document in the XHTML namespace). The most significant difference between HTML and XHTML is that XHTML documents have a default namespace:
<root xmlns="www.example-of-default-namespace.com"/>
An XPath expression that does not take into account namespaces, for example
//root
will never find this element, because it's in a namespace.
The same happens with your XHTML document. There are two ways to solve this problem.
Register the XHTML namespace
The first, and more appropriate, solution is to register or declare the XHTML namespace in your code, and then use a prefix in your XPath expression. Since you do not show any code, I can hardly comment on that, we don't even know the programming language.
Ignore namespaces
Secondly, you can ignore any namespaces by modifying your XPath expression to
//*[local-name() = 'div' and #class='story-body']
Here * is a wildcard for any element, in any (or no) namespace, and local-name() returns the local part of an element or attribute name. In XML, there are qualified names that look like:
prefix:root
The first part of this qualified name is the prefix, and the second part is the local name of this element. So, the result of local-name(prefix:root) is root.
Also note that I have lowercased "div". HTML might be case-insensitive, but XHTML, and by extension, XML, and by extension, XPath are not.

How to remove an specific xml attribute from org.w3c.dom.Document

I have this XML:
<Body xmlns:wsu="http://mynamespace">
<Ticket xmlns="http://othernamespace">
<Customer xlmns="">Robert</Customer>
<Products xmlns="">
<Product>a product</>
</Products>
</Ticket>
<Delivered xmlns="" />
<Payment xlmns="">cash</Payment>
</Body>
I am using Java to read it as a DOM document. I want remove the empty namespace attributes (i.e., xmlns=""). Is there any way to do that?
You need to understand that xmlns is a very special attribute. Basically, the xmlns="" is so that your Customer element is in the "unnamed" namespace, rather than the http://othernamespace namespace (and likewise for other elements which would otherwise inherit a default namespace from their ancestors).
If you want to get rid of the xmlns="", you basically need to put the elements into the appropriate namespace - so it's changing the element name. I don't think the W3C API lets you change the name of an element - you may well need to create a new element with the appropriate namespaced-name, and copy the content. Or if you're responsible for creating the document to start with, just use the right namespace.

how to get the value of attribute processing by STAX using java language?

i want to get the value of attribute of xml file without knowing it's index, since attributes are repeated in more than one element in the xml file.
here is my xml file
<fields>
<form name="userAdditionFrom">
</form>
</fields>
and here is the procssing file
case XMLEvent.ATTRIBUTE:
//how can i know the index of attribute?
String attName = xmlReader.getAttributeValue(?????);
break;
thanx in advance.
Alaa
If it is XMLStreamReader then getAttributeValue(int index) and getAttributeValue(String namespaceURI, String localName) can be used to get attribute value.
From your question it look like you are using mix of Event and Cursor API. I have appended Using StAX link for your reference that gives idea how to use both.
Resources:
XMLStreamReader getAttributeValue(String, String) JavaDoc Entry
Using StAX

How to set namespace only on first tag with XOM?

I am using XOM to build XML documents in Java.
I have created a simple XML document, and I want an XML namespace. But when I set the namespace on the first tag, an empty namespace is set on the childs like xmlns="", how can I get rid of this behaviour? I only want xmlns on the first tag.
I want this XML:
<request xmlns="http://my-namespace">
<type>Test</type>
<data>
<myData>test data</myData>
</data>
</request>
But this is the XML document output from XOM
<request xmlns="http://my-namespace">
<type xmlns="">Test</type>
<data xmlns="">
<myData>test data</myData>
</data>
</request>
This is my Java XOM code:
String namespace = "http://my-namespace";
Element request = new Element("request", namespace);
Element type = new Element("type");
type.appendChild("Test");
request.appendChild(type);
Element data = new Element("data");
request.appendChild(data);
Element myData = new Element("myData");
myData.appendChild("test data");
data.appendChild(myData);
Document doc = new Document(request);
doc.toXML();
This works for me. However, I'm a bit puzzled as to why the Element objects don't inherit the namespace of their parents, though. (Not an XML nor XOM expert)
Code:
String namespace = "http://my-namespace";
Element request = new Element("request", namespace);
Element type = new Element("type", namespace);
type.appendChild("Test");
request.appendChild(type);
Element data = new Element("data", namespace);
request.appendChild(data);
Element myData = new Element("myData", namespace);
myData.appendChild("test data");
data.appendChild(myData);
Document doc = new Document(request);
System.out.println(doc.toXML());
Output:
<?xml version="1.0"?>
<request xmlns="http://my-namespace">
<type>Test</type>
<data>
<myData>test data</myData>
</data>
</request>
I ran into the same problem, and Google lead me here.
#Michael - That's what it says in the javadoc, yes, but unfortunately, that's not how it works when you implement it. The child elements will continue to get blank xmlns attributes unless you do Catchwa's implementation.
Catchwa's implementation works just fine. Only the element I tell it to have a namespace, has a namespace. All empty xmlns attributes are gone. It's strange.
Is it a bug? I can't seem to figure that part out. Or is it just the way XOM works?
Don't confuse namespaces and namespace declarations. The namespace is an intrinsic property of each element. The namespace declaration is the `xmlns' attribute. They are not the same thing, although they are connected. When you create an element, you set its namespace, not its namespace declaration.
In the XOM data model namespaces are not attributes. They are an intrinsic property of the element itself. There is no rule in XML that requires children of an element to be in the same namespace as the parent. Indeed theoretically every element in the document could be in a different namespace.
In XOM you specify the namespace of an element or attribute at the same time you specify the local name. When you create an element, the element initially has no parent so there's no way for XOM to default to giving the element the same namespace as its parent, even if that's what was wanted (and it's not).
When the document is serialized the namespaces are represented by xmlns and xmlns:*prefix* attributes. XOM figures out where to put these elements to match the namespaces you've assigned to each element. Just specify the namespace you want for each element in your code, and let XOM figure out where to put the namespace declarations.
In XOM you can add a namespace declaration to the root element.
Here's a short example with three different namespaces:
final String NS_XLINK = "http://www.w3.org/1999/xlink";
final String NS_OTHER = "http://other.com";
Element root = new Element("root", "http://root.com");
root.addNamespaceDeclaration("xlink", NS_XLINK);
root.addNamespaceDeclaration("other", NS_OTHER);
root.addAttribute(new Attribute("xlink:href", NS_XLINK, "http://somewhere.com"));
root.appendChild(new Element("other:alien", NS_OTHER));
Document doc = new Document(root);
System.out.println(doc.toXML());
which produces this result (with additional line breaks inserted for readability):
<?xml version="1.0"?>
<root
xmlns="http://root.com"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:other="http://other.com"
xlink:href="http://somewhere.com">
<other:alien />
</root>

Categories

Resources