Querying an HTML page with XPath in Java - java

Can anyone advise me a library for Java that allows me to perform an XPath Query over an html page?
I tried using JAXP but it keeps giving me a strange error that I cannot seem to fix (thread "main" java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd).
Thank you very much.
EDIT
I found this:
// Create a new SAX Parser factory
SAXParserFactory factory = SAXParserFactory.newInstance();
// Turn on validation
factory.setValidating(true);
// Create a validating SAX parser instance
SAXParser parser = factory.newSAXParser();
// Create a new DOM Document Builder factory
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// Turn on validation
factory.setValidating(true);
// Create a validating DOM parser
DocumentBuilder builder = factory.newDocumentBuilder();
from http://www.ibm.com/developerworks/xml/library/x-jaxpval.html But turning the argumrent to false did not change anything.

Setting the parser to "non validating" just turns off validation; it does not inhibit fetching of DTD's. Fetching of DTD is needed not just for validation, but also for entity expansion... as far as I recall.
If you want to suppress fetching of DTD's, you need to register a proper EntityResolver to the DocumentBuilderFactory or DocumentBuilder. Implement the EntityResolver's resolveEntity method to always return an empty string.

Take a look at this:
http://www.w3.org/2005/06/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
Probably you have the parser set to perform DOM validation, and it is trying to retrieve the DTD. JAXP should have a way to disable DTD validation, and just run XPATH against a document assumed to be valid. I haven't used JAXP is many years so I'm sorry I couldn't be more helpful.

Related

java DOM lookupNamespaceURI is not able to locate namespace URI

I'm trying to follow http://www.ibm.com/developerworks/xml/library/x-nmspccontext/index.html
UniversalNamespaceResolver
example for resolving namespaces of the XPath evaluation agains an XML. The problem I encountered is that lookupNamespaceURI call below returns null on the XML, I given below:
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document dDoc = builder.parse(new InputSource(new StringReader(xml)));
String nsURI = dDoc.lookupNamespaceURI("h");
the XML:
<?xml version="1.0"?>
<h:root xmlns:h="http://www.w3.org/TR/html4/">
<h:table>
<h:tr>
<h:td>Apples</h:td>
<h:td>Bananas</h:td>
</h:tr>
</h:table>`
</h:root>
while I'd expect it to return "http://www.w3.org/TR/html4/".
When configuring a DocumentBuilder, you have to explicitly make it namespace aware (a silly relic from the first days of xml when there were no namespaces):
domFactory.setNamespaceAware(true);
As a side note, the advice in that article is not very good. it fundamentally misses the point that you don't care what the namespace prefixes are in the actual document, they are irrelevant. you need the xpath namespace resolver to match the xpath expressions that you are using, and that is all. if you do what they are suggesting, you will have to change your xpath code whenever the document's prefixes change, which is a horrible idea.
Note, they sort of cede this point in their last bullet, but the rest of the article seems to miss that this is the fundamental idea when using xpath.
But if you don't have control over the XML file, and someone can send you any prefixes they wish, it might be better to be independent of their choices. You can code your own namespace resolution as in Example 1 (HardcodedNamespaceResolver), and use them in your XPath expressions.

Writing XML according to a DTD

I would like to know if there is a way (particularly, an API), in Java, to write a XML in a SAX-like way (i.e., event-like way, differently from JDOM, which I cannot use) that takes a DTD and guarantees that my XML document is being correctly written.
I have been using SAX for parsing and I have written a XML writer layer by myself as if I were writing a plain file (through OutputStreamWriter), but I have seen that my XML writer layer is not always following the DTD rules.
SAX does not know to write XML documents. It is attended to parse them. So, you can choose any method you want to create document and then validate it using SAX API against DTD.
BTW may I ask you why are you limiting yourself to using tools that were almost obsolete about 10 years ago? Why not to use higher level API that converts objects to XML and vice versa? For example JAXB.
The Standard DocumentBuilder methodology can validate for you.
This snippet taken from http://www.edankert.com/validate.html#Validate_using_internal_DTD
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
SchemaFactory schemaFactory =
SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
factory.setSchema(schemaFactory.newSchema(
new Source[] {new StreamSource("contacts.xsd")}));
DocumentBuilder builder = factory.newDocumentBuilder();
builder.setErrorHandler(new SimpleErrorHandler());
Document document = builder.parse(new InputSource("document.xml"));

SAXParseException when “ is used in XML

I'm getting a "org.xml.sax.SAXParseException; lineNumber: 4; columnNumber: 26; The entity "ldquo" was referenced, but not declared." exception when reading an XML document. I'm reading it as follows:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xmlBody));
Document document = builder.parse(is);
And then there's an exception on builder.parse(is);
From searching I figured that it is necessary to declare some of those new entities externally, unfortunately, I cannot modify the original XML document.
How do I fix this problem?
Thanks
From searching I figured that it is necessary to declare some of those new entities externally, unfortunately, I cannot modify the original XML document.
Well, unless you declare the entity then the document isn't XML and you won't be able to process it using an XML parser.
When you are asked to process input that isn't well-formed XML, the best approach is to fix the process that created the document (the whole idea of using XML for interchange relies on it being well-formed XML). The alternatives are to "repair" the document to turn it into well-formed XML (which you say you can't do), or to forget the fact that it was intended to be XML, and treat it as you would any proprietary non-XML format.
Not a pleasant set of choices - but that's the mess you get into when people pay lip-service to XML but fail to conform to the letter of the standard.
Try
factory.setExpandEntityReferences(false);
This will prevent the parser from trying to expand entities.
EDIT: How about this http://xerces.apache.org/xerces2-j/features.html#dom.create-entity-ref-nodes -- The top of that page has an example of how to set features on the underlying parser. This should cause the parser to create entity-reference DOM nodes instead of trying to expand the entities.

how can I tell xalan NOT to validate XML retrieved using the "document" function?

Yesterday Oracle decided to take down java.sun.com for a while. This screwed things up for me because xalan tried to validate some XML but couldn't retrieve the properties.dtd.
I'm using xalan 2.7.1 to run some XSL transforms, and I don't want it to validate anything.
so tried loading up the XSL like this:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
spf.setValidating(false);
XMLReader rdr = spf.newSAXParser().getXMLReader();
Source xsl = new SAXSource(rdr, new InputSource(xslFilePath));
Templates cachedXSLT = factory.newTemplates(xsl);
Transformer transformer = cachedXSLT.newTransformer();
transformer.transform(xmlSource, result);
in the XSL itself, I do something like this:
<xsl:variable name="entry" select="document(concat($prefix, $locale_part, $suffix))/properties/entry[#key=$key]"/>
The XML this code retrieves has the following definition at the top:
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<entry key="...
Despite the java code above instructing the parser to NOT VALIDATE, it still sends a request to java.sun.com. While java.sun.com is unavailable, this makes the transform fail with the message:
Can not load requested doc: http://java.sun.com/dtd/properties.dtd
How do I get xalan to stop trying to validate the XML loaded from the "document" function?
The documentation mentions that the parser may read the DTDs even if not validating, as it may become necessary to use the DTD to resolve (expand) entities.
Since I don't have control over the XML documents, nont's option of modifying the XML was not available to me.
I managed to shut down attempts to pull in DTD documents by sabotaging the resolver, as follows.
My code uses a DocumentBuilder to return a Document (= DOM) but the XMLReader as per the OP's example also has a method setEntityResolver so the same technique should work with that.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false); // turns off validation
factory.setSchema(null); // turns off use of schema
// but that's *still* not enough!
builder = factory.newDocumentBuilder();
builder.setEntityResolver(new NullEntityResolver()); // swap in a dummy resolver
return builder().parse(xmlFile);
Here, now, is my fake resolver: It returns an empty InputStream no matter what's asked of it.
/** my resolver that doesn't */
private static class NullEntityResolver implements EntityResolver {
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, IOException {
// Message only for debugging / if you care
System.out.println("I'm asked to resolve: " + publicId + " / " + systemId);
return new InputSource(new ByteArrayInputStream(new byte[0]));
}
}
Alternatively, your fake resolver could return streams of actual documents read as local resources or whatever.
Be aware that disabling DTD loading will cause parsing to fail if the DTD defines any entities that your XML file depends on. That said, to disable DTD loading try this, which assumes you're using the default Xerces that ships with Java.
/*
* Instantiate the SAXParser and set the features to prevent loading of an external DTD
*/
SAXParser sp = SAXParserFactory.newInstance().newSAXParser();
XMLReader xrdr = sp.getXMLReader();
xrdr.setFeature("http://xml.org/sax/features/validation", false);
xrdr.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
If you really need the DTD, then the other alternative is to implement a local XML catalog
/*
* Instantiate the SAXParser and add catalog support
*/
SAXParser sp = SAXParserFactory.newInstance().newSAXParser();
XMLReader xrdr = sp.getXMLReader();
CatalogResolver cr = new CatalogResolver();
xrdr.setEntityResolver(cr);
To which you will have to provide the appropriate DTDs and an XML catalog definition. This Wikipedia Article and this article were helpful.
CatalogResolver looks at the system property xml.catalog.files to determine what catalogs to load.
Try using setFeature on SAXParserFactory.
Try this:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setValidating(false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
I think that should be enough, otherwise try setting a few other features:
spf.setFeature("http://xml.org/sax/features/validation", false);
spf.setFeature("http://xml.org/sax/features/external-general-entities", false);
spf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
I just ended up stripping the doctype declaration out of the XML, because nothing else worked. When I get around to it, I'll try this: http://www.sagehill.net/docbookxsl/UseCatalog.html#UsingCatsXalan
Sorry for necroposting, but I have found a solution which actually works and decided I should share it.
1.
For some reason, setValidating(false) doesn't work. In some cases, it still downloads external DTD files. To prevent this, you should attach a custom EntityResolver as advised here:
XMLReader rdr = spf.newSAXParser().getXMLReader();
rdr.setEntityResolver(new MyCustomEntityResolver());
The EntityResolver will be called for every external entity request. Returning null will not work because the framework will still download the file from the Internet after that. Instead, you can return an empty stream which is a valid DTD, as advised here:
private class MyCustomEntityResolver implements EntityResolver {
public InputSource resolveEntity(String publicId, String systemId) {
return new InputSource(new StringReader(""));
}
}
2.
You are telling setValidating(false) to the SAX parser which reads your XSLT code. That is, it will not validate your XSLT. When it encounters a document() function, it loads the linked XML file using another parser which still validates it, and also downloads external entities. To handle this, you should attach a custom URIResolver to the transformer:
Transformer transformer = cachedXSLT.newTransformer();
transformer.setURIResolver(new MyCustomURIResolver());
The transformer will call your URIResolver implementation when it encounters the document() function. Your implementation will have to return a Source for the passed URI. The simplest thing is to return a StreamSource as advised here. But in your case you should parse the document yourself, preventing validation and external requests using the customized SAXParser you already have (or create a new one each time).
private class MyCustomURIResolver implements URIResolver {
public Source resolve(String href, String base) {
return new SAXSource(rdr,new InputSource(href));
}
}
So you will have to implement two custom interfaces in your code.

Java SAX Parser raises UnknownHostException

The XML file I want to parse starts with :
<!DOCTYPE plist PUBLIC "-//...//DTD PLIST 1.0//EN" "http://www.....dtd">
So when I start the SAX praser, it tries to access this DTD online, and I get a java.net.UnknownHostException.
I cannot modify the XML file before feeding it to the SAX parser
I have to run even with no internet connection
How can I change the SAX Parser behaviour so that it does not try to load the DTD ?
Thanks.
javax.xml.parsers.SAXParserFactory factory = javax.xml.parsers.SAXParserFactory.newInstance();
factory.setValidating(false);
javax.xml.parsers.SAXParser parser = factory.newSAXParser();
parser.parse(xmlFile, handler);
Ok, turns out the parse() method overrides any previously set entity resolvers with the handler passed in to the parse method. The following code should work:
javax.xml.parsers.SAXParserFactory factory = javax.xml.parsers.SAXParserFactory.newInstance();
factory.setValidating(false);
javax.xml.parsers.SAXParser parser = factory.newSAXParser();
parser.parse(new java.io.File("x.xml"), new org.xml.sax.helpers.DefaultHandler(){
public org.xml.sax.InputSource resolveEntity(String publicId, String systemId)
throws org.xml.sax.SAXException, java.io.IOException {
System.out.println("Ignoring: " + publicId + ", " + systemId);
return new org.xml.sax.InputSource(new java.io.StringReader(""));
}
});
Use XMLReader instead of SAXParser.
XMLReader reader =XMLReaderFactory.createXMLReader();
reader.setEntityResolver(new DummyEntityResolver());
reader.setContentHandler(handler);
reader.parse(inputSource);
It should also work with SAXParser, but for some reasons it doesn't.
You can implement a custom EntityResolver which is what is used to lookup external entities during XML parsing.
org.xml.sax.EntityResolver customEntityResolver = new DummyEntityResolver();
javax.xml.parsers.SAXParser parser = factory.newSAXParser();
parser.getXMLReader().setEntityResolver(customEntityResolver);
parser.parse(xmlFile, handler);
And in your custom EntityResolver, just always return null. I think that should fix this problem.
You should provide an EntityResolve to have the problem resolve. I will recommend you to write a resolver that will know how to read the DTDs locally instead (provided that you have them shipped together with your application). Otherwise, return null like Gowri suggested.
You might want to read up the the API doc.
yc

Categories

Resources