I want a small subtree out of a xml file (100 Mb) and need to turn off DTD validation, but I can not find any solution for that.
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("//HEADER");
Node node = (Node) expr.evaluate(new InputSource(new FileReader(file)), XPathConstants.NODE);
I tryed to use DocumentBuilder and turn off the DTD validation but that's so slow.
Thanks,
Joo
The reason why it's so slow is because you are forcing a full scan of all the nodes because your XPath criterion is too vague: //HEADER means that the XPath engine will scan each and every node of your 100MB to select the ones where the node name is HEADER. If you can make the XPath expression more specific, you should see dramatic improvements.
Other than that, the code below is something I had to do to prevent DTD validation in the past.
It forces Xerces as the SAX parser and explicitly sets a number of Xerces specific features. But again this will probably not affect significantly the response time.
import java.io.File;
import java.io.StringReader;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.apache.xerces.jaxp.SAXParserFactoryImpl;
import org.xml.sax.InputSource;
[...]
private static SAXParserFactory spf ;
private static SAXParserFactory spf ;
private BillCooker() throws Exception {
System.setProperty("javax.xml.parsers.SAXParserFactory", "org.apache.xerces.jaxp.SAXParserFactoryImpl" ) ;
spf = SAXParserFactoryImpl.newInstance();
spf.setNamespaceAware(true);
spf.setValidating(false);
spf.setFeature("http://xml.org/sax/features/validation", false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
spf.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
I trimmed it to leave only the lines relevant to validation
Related
I have one composite Object being inserted into my KieSession. This object is composed of a pre-created Document from an xml file, an XPath, and expression that I am trying find inside the DOM. I am able to match a single expression, but I am trying to modify the rule so that it will search the DOM for all Strings located in a list. Thank you.
This is the Object that will be initialized and inserted into the KieSession
import lombok.AllArgsConstructor;
import lombok.Data;
#Data
#AllArgsConstructor
public class Person {
private String expression;
private XPath xPath;
private Document dom;
private List<String> validExpressions = new ArrayList<>();
}
In my Drools file, I am able to use the xPath parser to search the Document and find if any of the nodes have a name that matches the expression. However, What I want to do is be able to run the same search but iterating through all of the items in the ArrayList and have the rule fire if any of them are a match.
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathConstants;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
rule "Person Matcher"
when
Person($xp : xPath)
Person($ex : expression)
Person($dom : dom)
Person($validExpressions: getValidExpressions())
Person(((NodeList) $xp.compile($ex).evaluate($dom, XPathConstants.NODESET)).getLength() == "1")
then
System.out.println("MATCHED ONE OF THEM");
end
In the above example, lets say that my XML DOM has a node "height", if when I insert the fact into the kiesession and fire the rules, I set the expression to "height", then this rule will correctly fire. However, I want to be check through all of the validExpressions instead. I have tried using from, accumulate, but can't seem to get anything to work. Is there any way to get this to work like this?
New line in drools (between object selectors) acts like logical and. You are looking for cartesian match of 5 objects with different parameters. Probably you meant something like this (didn't test)
rule "Person Matcher"
when
Person($xp : xPath,
$ex : expression,
$dom : dom,
$validExpressions: getValidExpressions(),
$nodeList: ((NodeList) $xp.compile($ex).evaluate($dom, XPathConstants.NODESET),
$nodeList.getLength() == 1
)
then
System.out.println("MATCHED ONE OF THEM");
end
Here is the code snippet for a DomParser which I am using, The DomParser which I am using is of Oracle.
import oracle.xml.parser.v2.DOMParser;
DOMParser domParser = new DOMParser();
domParser.parse(new StringReader(xmlPayload));
Document doc = domParser.getDocument();
doc.getDocumentElement().normalize();
System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
NodeList nList = doc.getElementsByTagName("student");
Recently our Security team has raised a concern that the above DOM parser is vulnerable to security attack and has come up with a recommendation on setting two attributes
domParser.setAttribute("RESOLVE_ENTITY_DEFAULT", true);
domParser.setAttribute("DEFAULT_ENTITY_EXPANSION_DEPTH", 150);
But on setting these attributes, I am getting the below error,
Exception in thread "main" java.lang.IllegalArgumentException
at oracle.xml.parser.v2.XMLParser.setAttribute(XMLParser.java:870)
at oracle.xml.parser.v2.DOMParser.setAttribute(DOMParser.java:538)
at DomParserExample.main(DomParserExample.java:20)
kindly let me know how can I prevent XML Entity Expansion injection and XXE attacks. I have tried looking into OWASP XEE Cheat Sheet and browsed through various questions and answers for XXE attack, but could not find a solution for this.
try this
domParser.setAttribute(XMLParser.RESOLVE_ENTITY_DEFAULT, true);
domParser.setAttribute(XMLParser.DEFAULT_ENTITY_EXPANSION_DEPTH, 150);
The proper way to handle XXE in Oracle DOMParser is documented here.
https://docs.oracle.com/en/database/oracle/oracle-database/18/adxdk/security-considerations-oracle-xml-developers-kit.html#GUID-45303542-41DE-4455-93B3-854A826EF8BB
// Extend oracle.xml.parser.v2.XMLParser
DOMParser domParser = new DOMParser();
// Do not expand entity references
domParser.setAttribute(DOMParser.EXPAND_ENTITYREF, false);
// dtdObj is an instance of oracle.xml.parser.v2.DTD
domParser.setAttribute(DOMParser.DTD_OBJECT, dtdObj);
// Do not allow more than 11 levels of entity expansion
domParser.setAttribute(DOMParser.ENTITY_EXPANSION_DEPTH, 12);
what will be the maven dependencies version to use XMLParser and DOMParser to get resolve the fortify fix for DOM Parser.
So i'm trying to learn some xml parsing here, and I'm getting the hang of it, but for whatever reason, I seem to have to tack on "text()" at the end of each query, otherwise I get null values returned to me. I don't actually understand the function of this "text()" ending but I know it's not necessary and I'm wondering why I can't omit it. Please help! Here is my code:
import org.w3c.dom.*;
import javax.xml.xpath.*;
import javax.xml.parsers.*;
import java.io.IOException;
import org.xml.sax.SAXException;
public class ParseClass
{
public static void main(String[] args)
throws ParserConfigurationException, SAXException,
IOException, XPathExpressionException
{
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse("C:\\Users\\Brandon\\Job\\XPath\\XPath_Sample_Stuff\\catalog.xml");
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("/catalog/book[author='Thurman, Paula']/title/text()");
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
for (int i = 0; i < nodes.getLength(); i++)
{
System.out.println(nodes.item(i).getNodeValue());
}
}
}
PS. In case you didn't notice. i'm using XPath and DOM for my parsing.
You're calling getNodeValue on your result, and as this docs show (see the table) it is null for a node of type Element. When you use text(), the returned set now contains nodes of type Text, so you get the results you wanted (i.e. the contents of the title element instead of the element itself).
I'd also suggest seeing this for more info on the usage of text() in xpath.
And if you want to extract the text from your element, directly, you could use getTextContent instead of getNodeValue:
// Will work for both element and text nodes
System.out.println(nodes.item(i).getTextContent());
First of all your Xpath expression is invalid (I am considering it as typo). Attributes are indicated with # so correct xpath will be /catalog/book[#author='Thurman, Paula']/title/text().
/catalog/book[#author='Thurman, Paula']/title/ will match the <title> node from your xml whereas /catalog/book[#author='Thurman, Paula']/title/text() with match the text node of <title> i.e if title node was something like <title>The Godfather</title>the later expression would match The Godfather.
A suggestion: don't use DOM. There are many tree representations of XML available in the Java world (JDOM, XOM, DOM4J) that are vastly more usable than DOM. DOM is full of gotcha's like the one you just encountered, where getNodeValue() on an element returns null. The only reason anyone uses DOM is that (a) it came originally from W3C, and (b) it found its way into the JDK. But that all happened an awfully long time ago, and people have learnt from its design mistakes.
I want to do an XPath query on this file (excerpt shown):
<?xml version="1.0" encoding="UTF-8"?>
<!-- MetaDataAPI generated on: Friday, May 25, 2007 3:26:31 PM CEST -->
<ModelClass xmlns="http://xml.sap.com/2002/10/metamodel/webdynpro" xmlns:IDX="urn:sap.com:WebDynpro.ModelClass:2.0">
<ModelClass.Parent>
<Core.Reference package="com.test.mypackage" name="ModelName" type="Model"/>
This is a snippet of the code I'm using:
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document document = builder.parse(new File(testFile));
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
xpath.setNamespaceContext( new NamespaceContext() {
public String getNamespaceURI(String prefix) {
...
String result = xpath.evaluate(xpathQueryString, document);
System.out.println(result);
The problem I'm facing is that when the default namespace is referenced in an XPath query, the getNamespaceURI method is not called to resolve it.
This query for example doesn't extract anything:
//xmlns:ModelClass.Parent/xmlns:Core.Reference[#type=\"Model\"]/#package
Now I've tried "tricking" the parser by replacing xmlns with a fake prefix d and then writing the getNamespaceURI method accordingly (so to return http://xml.sap.com/2002/10/metamodel/webdynpro when d is encountered). In this case, the getNamespaceURI is called but the result of the XPath expression evaluation is always an empty string.
If I strip out namespaces from the file and from the XPath query expression, I can get the string I wanted (com.test.mypackage).
Is there a way to make things work properly with the default namespace?
The XPath 1.0 specification requires that "no prefix means no namespace". So JAXP, which was designed for XPath 1.0, is quite right to stop you binding the "null prefix" to some non-null namespace.
XPath 2.0 allows you to declare a default namespace for unqualified names in your XPath expression, but to take advantage of that you will need an API (such as Saxon's s9api) that takes advantage of this feature.
In your Namespace context, bind a prefix of your choice (e.g. df) to the namespace URI in the document
xpath.setNamespaceContext( new NamespaceContext() {
public String getNamespaceURI(String prefix) {
switch (prefix) {
case "df": return "http://xml.sap.com/2002/10/metamodel/webdynpro";
...
}
});
and then use that prefix in your path expressions to qualify element names e.g. /df:ModelClass/df:ModelClass.Parent/df:Core.Reference[#type = 'Model']/#package.
My XpathUtility class has following method:
public Node findElementByXpath(Document doc, String axpath) throws Exception{
XPath xPath = XPathFactory.newInstance().newXPath();
Node node = (Node) xPath.evaluate(axpath, doc, XPathConstants.NODE);
return node;
}
in my main I load a org.w3c.dom document and attempt to locate an element via xpath:
XpathUtility xu = new XpathUtility();
Node foundElement= xu.findElementByXpath(domdoc, "/html[1]/body[1]/div[32]/a[1]");
I have checked manually via firebug that element exists using that xpath.
What happens when this code runs: it hangs becomes unresponsive for about 30 seconds and then throws NullPointerException for foundElement.
An XHTML document is an XML document with a DTD reference, which XML parsers are obliged to download and evaluate in order to properly parse the XML infoset, and the elements are bound to the XHTML namespace.
So, it appears that you have two problems:
The XHTML DTD is taking a really long time to download from the W3C website.
The W3C servers are slow to return DTDs. Is the delay
intentional?
Yes. Due to various software systems downloading DTDs from our site
millions of times a day (despite the caching directives of our
servers), we have started to serve DTDs from our site with an
artificial delay. Our goals in doing so are to bring more attention to
our ongoing issues with excessive DTD traffic, and to protect the
stability and response time of the rest of our site.
You can overcome this by using a local entity resolver that loads a local copy of the DTD, rather than reaching out to the W3C website on every request.
The elements in the document are bound to the XHTML namespace, but you are using an XPath that is matching on the default no-namespace.
There are several things that you can do to ensure that your XPath matches what you want:
Register the XHTML namespace with your XPath engine and adjust your XPath expressions to use the registered XHTML namespace prefix.
Use an XPath statement that matches on the XHTML namespace and the local name inside of a predicate filter for a more generic match on elements e.g. /*[local-name()='html' and namespace-uri()='www.w3.org/1999/xhtml/'][1]/*[local-name()='body' and namespace-uri()='www.w3.org/1999/xhtml/'][1]/*[local-name()='div' and namespace-uri()='www.w3.org/1999/xhtml/'][32]/*[local-name()='a' and namespace-uri()='www.w3.org/1999/xhtml/'][1]
Use an XPath statement that simply matches on the local name for a more generic match on elements. e.g. /*[local-name()='html'][1]/*[local-name()='body'][1]/*[local-name()='div'][32]/*[local-name()='a'][1]