Java SAX Parser raises UnknownHostException - java

The XML file I want to parse starts with :
<!DOCTYPE plist PUBLIC "-//...//DTD PLIST 1.0//EN" "http://www.....dtd">
So when I start the SAX praser, it tries to access this DTD online, and I get a java.net.UnknownHostException.
I cannot modify the XML file before feeding it to the SAX parser
I have to run even with no internet connection
How can I change the SAX Parser behaviour so that it does not try to load the DTD ?
Thanks.
javax.xml.parsers.SAXParserFactory factory = javax.xml.parsers.SAXParserFactory.newInstance();
factory.setValidating(false);
javax.xml.parsers.SAXParser parser = factory.newSAXParser();
parser.parse(xmlFile, handler);

Ok, turns out the parse() method overrides any previously set entity resolvers with the handler passed in to the parse method. The following code should work:
javax.xml.parsers.SAXParserFactory factory = javax.xml.parsers.SAXParserFactory.newInstance();
factory.setValidating(false);
javax.xml.parsers.SAXParser parser = factory.newSAXParser();
parser.parse(new java.io.File("x.xml"), new org.xml.sax.helpers.DefaultHandler(){
public org.xml.sax.InputSource resolveEntity(String publicId, String systemId)
throws org.xml.sax.SAXException, java.io.IOException {
System.out.println("Ignoring: " + publicId + ", " + systemId);
return new org.xml.sax.InputSource(new java.io.StringReader(""));
}
});

Use XMLReader instead of SAXParser.
XMLReader reader =XMLReaderFactory.createXMLReader();
reader.setEntityResolver(new DummyEntityResolver());
reader.setContentHandler(handler);
reader.parse(inputSource);
It should also work with SAXParser, but for some reasons it doesn't.

You can implement a custom EntityResolver which is what is used to lookup external entities during XML parsing.
org.xml.sax.EntityResolver customEntityResolver = new DummyEntityResolver();
javax.xml.parsers.SAXParser parser = factory.newSAXParser();
parser.getXMLReader().setEntityResolver(customEntityResolver);
parser.parse(xmlFile, handler);
And in your custom EntityResolver, just always return null. I think that should fix this problem.

You should provide an EntityResolve to have the problem resolve. I will recommend you to write a resolver that will know how to read the DTDs locally instead (provided that you have them shipped together with your application). Otherwise, return null like Gowri suggested.
You might want to read up the the API doc.
yc

Related

Invoking the parser with new File in the parse(new File(args[0]), handler);

I just learned to write parser.I need a quick help.I have tried searchign for the same but was not of much help.
I have written all the callback methods.Now i am writing the private static main method to call the parser.
SAXParserFactory parserFactor = SAXParserFactory.newInstance();
SAXParser parser = parserFactor.newSAXParser();
SAXParserExample handler = new SAXParserExample();`
parser.parse(new File(args[0]), handler);
Now my xml name is Employees.xml and i have a class also Employee.java
When i am runing this i am getting this error.I guess i need to send the xml file as argument.
Exception in thread "main" java.lang.RuntimeException: The name of the XML file is required!
at main.java.SAXParserExample.main(SAXParserExample.java:82)
You can try this
java Classname Filename.txt
Hope it works.

XSLT: Getting URI of a declared entity

I have an input XML that has entities declared in it. It looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE doctype PUBLIC "desc" "DTD.dtd" [
<!ENTITY SLSD_68115_jpg SYSTEM "68115.jpg" NDATA JPEG>
]>
The DTD.dtd file contains the neccessary notation:
<!NOTATION JPEG SYSTEM "JPG" >
During XSLT transformation I would like to get the URI declared in the entity using the name 'SLSD_68115_jpg' like so:
<xsl:value-of select="unparsed-entity-uri('SLSD_68115_jpg')"/>
So that it would return something like "68115.jpg".
The problem is that it always returns an empty string. There is no way for me to modify the input xml. I understand that this could be a common problem from what I found on the internet, but i haven't found any final conclusions, solutions or alternatives to this problem.
It might be important to note that I had a problem before since I am using a StreamSource and things like systemId had to be set manually, I think this is where the problem might be hidden. It's like the transformator is unable to resolve the entity with given id.
I'm using Xalan, I probably need to provide more details but I'm not sure what to add, I'll answer any questions is there are any.
Any help would be greatly appretiated.
I found out why the "unparsed-entity-uri" was unable to resolve the declared entities. This might be a special case, but I will post this solution so it might save someone else a lot of time.
I'm (very) new to XSLT. The xsl file I got to work with however as a student was pretty extreme with multiple import statements and files containing more than 5K lines of code.
Simply by the time I got to the point where I needed the entities the transformator used a different document that was essentially the sub document of the original one, which is okay, but for example the entity declarations are not passed to the sub document. Therefore there is no way for me to use the entities from that point beyond.
Now like I said im new to XSLT but I think that for example lines like this can cause the problem:
<xsl:apply-templates select="exslt:node-set($nodelist)"/>
Because after this, entity references are no bueno.
If this was trivial then my apologies for waisting your time.
Thanks to everyone none the less!
Instead of a StreamSource, try a SAXSource configured with a validating parser:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setValidating(true);
spf.setNamespaceAware(true);
XMLReader xmlr = spf.newSAXParser().getXMLReader();
InputSource input = new InputSource(
new File("/path/to/file.xml").toURI().toString());
// if you already have an InputStream/Reader then do
// input.setByteStream or input.setCharacterStream as appropriate
SAXSource source = new SAXSource(xmlr, input);
Or you can use a DOMSource in the same way
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(true);
dbf.setNamespaceAware(true);
File f = new File("/path/to/file.xml");
Document doc = dbf.newDocumentBuilder().parse(f);
DOMSource source = new DOMSource(doc, f.toURI().toString());

How to ignore inline DTD when parsing XML file in Java

I have a problem reading a XML file with DTD declaration inside (external declaration is solved). I'm using SAX method (javax.xml.parsers.SAXParser). When there is no DTD definition parsing looks like for example StartEement-Characters-StartElement-Characters-EndElement-Characters...... So there is characters method called immediately after Start or End element and thats how I need it to be. When DTD is in file parsing schema changes to for example StartElement-StartElement-StartElement-Characters-EndEement-EndEement-EndEement. And I need Characters method after every element. So I'm asking is there any way to prevent change of parsing schema?
My code:
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(false);
SAXParser parser = factory.newSAXParser();
XMLReader reader = parser.getXMLReader();
reader.setFeature("http://xml.org/sax/features/validation", false);
reader.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
reader.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
reader.setFeature("http://xml.org/sax/features/external-general-entities", false);
reader.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
reader.setFeature("http://xml.org/sax/features/use-entity-resolver2", false);
reader.setFeature("http://apache.org/xml/features/validation/unparsed-entity-checking", false);
reader.setFeature("http://xml.org/sax/features/resolve-dtd-uris", false);
reader.setFeature("http://apache.org/xml/features/validation/dynamic", false);
reader.setFeature("http://apache.org/xml/features/validation/schema/augment-psvi", false);
reader.parse(input);
There is XML file that I'm trying to parse link (its link on my dropbox).
I suspect that the nodes that were previously being reported to the characters() callback are now being reported to the ignorableWhitespace() callback. The simplest solution might be to simply call characters() from ignorableWhitespace().
This is what the spec has to say about ignorableWhitespace():
Validating Parsers must use this method to report each chunk of
whitespace in element content (see the W3C XML 1.0 recommendation,
section 2.10): non-validating parsers may also use this method if they
are capable of parsing and using content models.
In other words, if there is a DTD, and if you are not validating, then
it's up to the parser whether it reports whitespace in element-only
content models using the characters() callback or the
ignorableWhitespace() callback.

how can I tell xalan NOT to validate XML retrieved using the "document" function?

Yesterday Oracle decided to take down java.sun.com for a while. This screwed things up for me because xalan tried to validate some XML but couldn't retrieve the properties.dtd.
I'm using xalan 2.7.1 to run some XSL transforms, and I don't want it to validate anything.
so tried loading up the XSL like this:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
spf.setValidating(false);
XMLReader rdr = spf.newSAXParser().getXMLReader();
Source xsl = new SAXSource(rdr, new InputSource(xslFilePath));
Templates cachedXSLT = factory.newTemplates(xsl);
Transformer transformer = cachedXSLT.newTransformer();
transformer.transform(xmlSource, result);
in the XSL itself, I do something like this:
<xsl:variable name="entry" select="document(concat($prefix, $locale_part, $suffix))/properties/entry[#key=$key]"/>
The XML this code retrieves has the following definition at the top:
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<entry key="...
Despite the java code above instructing the parser to NOT VALIDATE, it still sends a request to java.sun.com. While java.sun.com is unavailable, this makes the transform fail with the message:
Can not load requested doc: http://java.sun.com/dtd/properties.dtd
How do I get xalan to stop trying to validate the XML loaded from the "document" function?
The documentation mentions that the parser may read the DTDs even if not validating, as it may become necessary to use the DTD to resolve (expand) entities.
Since I don't have control over the XML documents, nont's option of modifying the XML was not available to me.
I managed to shut down attempts to pull in DTD documents by sabotaging the resolver, as follows.
My code uses a DocumentBuilder to return a Document (= DOM) but the XMLReader as per the OP's example also has a method setEntityResolver so the same technique should work with that.
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false); // turns off validation
factory.setSchema(null); // turns off use of schema
// but that's *still* not enough!
builder = factory.newDocumentBuilder();
builder.setEntityResolver(new NullEntityResolver()); // swap in a dummy resolver
return builder().parse(xmlFile);
Here, now, is my fake resolver: It returns an empty InputStream no matter what's asked of it.
/** my resolver that doesn't */
private static class NullEntityResolver implements EntityResolver {
public InputSource resolveEntity(String publicId, String systemId)
throws SAXException, IOException {
// Message only for debugging / if you care
System.out.println("I'm asked to resolve: " + publicId + " / " + systemId);
return new InputSource(new ByteArrayInputStream(new byte[0]));
}
}
Alternatively, your fake resolver could return streams of actual documents read as local resources or whatever.
Be aware that disabling DTD loading will cause parsing to fail if the DTD defines any entities that your XML file depends on. That said, to disable DTD loading try this, which assumes you're using the default Xerces that ships with Java.
/*
* Instantiate the SAXParser and set the features to prevent loading of an external DTD
*/
SAXParser sp = SAXParserFactory.newInstance().newSAXParser();
XMLReader xrdr = sp.getXMLReader();
xrdr.setFeature("http://xml.org/sax/features/validation", false);
xrdr.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
If you really need the DTD, then the other alternative is to implement a local XML catalog
/*
* Instantiate the SAXParser and add catalog support
*/
SAXParser sp = SAXParserFactory.newInstance().newSAXParser();
XMLReader xrdr = sp.getXMLReader();
CatalogResolver cr = new CatalogResolver();
xrdr.setEntityResolver(cr);
To which you will have to provide the appropriate DTDs and an XML catalog definition. This Wikipedia Article and this article were helpful.
CatalogResolver looks at the system property xml.catalog.files to determine what catalogs to load.
Try using setFeature on SAXParserFactory.
Try this:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setValidating(false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
I think that should be enough, otherwise try setting a few other features:
spf.setFeature("http://xml.org/sax/features/validation", false);
spf.setFeature("http://xml.org/sax/features/external-general-entities", false);
spf.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
I just ended up stripping the doctype declaration out of the XML, because nothing else worked. When I get around to it, I'll try this: http://www.sagehill.net/docbookxsl/UseCatalog.html#UsingCatsXalan
Sorry for necroposting, but I have found a solution which actually works and decided I should share it.
1.
For some reason, setValidating(false) doesn't work. In some cases, it still downloads external DTD files. To prevent this, you should attach a custom EntityResolver as advised here:
XMLReader rdr = spf.newSAXParser().getXMLReader();
rdr.setEntityResolver(new MyCustomEntityResolver());
The EntityResolver will be called for every external entity request. Returning null will not work because the framework will still download the file from the Internet after that. Instead, you can return an empty stream which is a valid DTD, as advised here:
private class MyCustomEntityResolver implements EntityResolver {
public InputSource resolveEntity(String publicId, String systemId) {
return new InputSource(new StringReader(""));
}
}
2.
You are telling setValidating(false) to the SAX parser which reads your XSLT code. That is, it will not validate your XSLT. When it encounters a document() function, it loads the linked XML file using another parser which still validates it, and also downloads external entities. To handle this, you should attach a custom URIResolver to the transformer:
Transformer transformer = cachedXSLT.newTransformer();
transformer.setURIResolver(new MyCustomURIResolver());
The transformer will call your URIResolver implementation when it encounters the document() function. Your implementation will have to return a Source for the passed URI. The simplest thing is to return a StreamSource as advised here. But in your case you should parse the document yourself, preventing validation and external requests using the customized SAXParser you already have (or create a new one each time).
private class MyCustomURIResolver implements URIResolver {
public Source resolve(String href, String base) {
return new SAXSource(rdr,new InputSource(href));
}
}
So you will have to implement two custom interfaces in your code.

Querying an HTML page with XPath in Java

Can anyone advise me a library for Java that allows me to perform an XPath Query over an html page?
I tried using JAXP but it keeps giving me a strange error that I cannot seem to fix (thread "main" java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd).
Thank you very much.
EDIT
I found this:
// Create a new SAX Parser factory
SAXParserFactory factory = SAXParserFactory.newInstance();
// Turn on validation
factory.setValidating(true);
// Create a validating SAX parser instance
SAXParser parser = factory.newSAXParser();
// Create a new DOM Document Builder factory
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// Turn on validation
factory.setValidating(true);
// Create a validating DOM parser
DocumentBuilder builder = factory.newDocumentBuilder();
from http://www.ibm.com/developerworks/xml/library/x-jaxpval.html But turning the argumrent to false did not change anything.
Setting the parser to "non validating" just turns off validation; it does not inhibit fetching of DTD's. Fetching of DTD is needed not just for validation, but also for entity expansion... as far as I recall.
If you want to suppress fetching of DTD's, you need to register a proper EntityResolver to the DocumentBuilderFactory or DocumentBuilder. Implement the EntityResolver's resolveEntity method to always return an empty string.
Take a look at this:
http://www.w3.org/2005/06/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
Probably you have the parser set to perform DOM validation, and it is trying to retrieve the DTD. JAXP should have a way to disable DTD validation, and just run XPATH against a document assumed to be valid. I haven't used JAXP is many years so I'm sorry I couldn't be more helpful.

Categories

Resources