Streaming XPath evaluation

Streaming XPath evaluation - java

Are there any production-ready libraries for streaming XPath expressions evaluation against provided xml-document? My investigations show that most of existing solutions load entire DOM-tree into memory before evaluating xpath expression.

XSLT 3.0 provides streaming mode of processing and this will become a standard with the XSLT 3.0 W3C specification becoming a W3C Recommendation.
At the time of writing this answer (May, 2011) Saxon provides some support for XSLT 3.0 streaming .

Would this be practical for a complete XPath implementation, given that XPath syntax allows for:
/AAA/XXX/following::*
and
/AAA/BBB/following-sibling::*
which implies look-ahead requirements ? i.e. from a particular node you're going to have to load the rest of the document anyway.
The doc for the Nux library (specifically StreamingPathFilter) makes this point, and references some implementations that rely on a subset of XPath. Nux claims to perform some streaming query capability, but given the above there will be some limitations in terms of XPath implementation.

There are several options:
DataDirect Technologies sells an XQuery implementation that employs projection and streaming, where possible. It can handle files into the multi-gigabyte range - e.g. larger than available memory. It's a thread-safe library, so it's easy to integrate. Java-only.
Saxon is an open-source version, with a modestly-priced more expensive cousin, which will do streaming in some contexts. Java, but with a .net port also.
MarkLogic and eXist are XML databases that, if your XML is loaded into them, will process XPaths in a fairly intelligent fashion.

Try Joost.

Though I have no practical experience with it, I thought it is worth mentioning QuiXProc ( http://code.google.com/p/quixproc/ ). It is a streaming approach to XProc, and uses libraries that provide streaming support for XPath amongst others..

FWIW, I've used Nux streaming filter xpath queries against very large (>3GB) files, and it's both worked flawlessly and used very little memory. My use case is been slightly different (not validation centric), but I'd highly encourage you to give it a shot with Nux.

I think I'll go for custom code. .NET library gets us quite close to the target, if one just wants to read some paths of the xml document.
Since all the solutions I see so far respect only XPath subset, this is also this kind of solution. The subset is really small though. :)
This C# code reads xml file and counts nodes given an explicit path. You can also operate on attributes easily, using xr["attrName"] syntax.
int c = 0;
var r = new System.IO.StreamReader(asArgs[1]);
var se = new System.Xml.XmlReaderSettings();
var xr = System.Xml.XmlReader.Create(r, se);
var lstPath = new System.Collections.Generic.List<String>();
var sbPath = new System.Text.StringBuilder();
while (xr.Read()) {
//Console.WriteLine("type " + xr.NodeType);
if (xr.NodeType == System.Xml.XmlNodeType.Element) {
lstPath.Add(xr.Name);
}
// It takes some time. If 1 unit is time needed for parsing the file,
// then this takes about 1.0.
sbPath.Clear();
foreach(object n in lstPath) {
sbPath.Append('/');
sbPath.Append(n);
}
// This takes about 0.6 time units.
string sPath = sbPath.ToString();
if (xr.NodeType == System.Xml.XmlNodeType.EndElement
|| xr.IsEmptyElement) {
if (xr.Name == "someElement" && lstPath[0] == "main")
c++;
// And test simple XPath explicitly:
// if (sPath == "/main/someElement")
}
if (xr.NodeType == System.Xml.XmlNodeType.EndElement
|| xr.IsEmptyElement) {
lstPath.RemoveAt(lstPath.Count - 1);
}
}
xr.Close();

Related

How do I get the string value of a node?

The XPath string(/ROOT/Products/UnitPrice) works fine in dom4j & the .NET runtime. But in Saxon it throws an exception of:
net.sf.saxon.s9api.SaxonApiException: A sequence of more than one item is not allowed as the first argument of string() (<UnitPrice/>, <UnitPrice/>, ...)
What's going on here? Why is this not OK?

Saxon expects a single node as input.
The .NET implementation is different; it considers only the first one:
The string() function converts a node-set to a string by returning the string value of the first node in the node-set, which in some instances may yield unexpected results.
See MSDN

Problem is: /ROOT/Products/UnitPrice may return more than one result and XPath 2.0 string function does not accept more than one argument (see here).
Saxon is XPath 2.0 compliant. To solve your problem, you can write this XPath expression:
for $price in /ROOT/Products/UnitPrice return string($price)
You will then have to iterate over the result (XdmValue object).

If you are using the s9api interface, you can call
XPathCompiler.setBackwardsCompatible(true);
to make XPath expressions run in XPath 1.0 compatibility mode. This doesn't completely replicate all aspects of XPath 1.0 behaviour, but it will handle most of the things that changed between XPath 1.0 and 2.0.
Very often the incompatibilities that were introduced in 2.0 are because they affect areas that were a common source of user errors in 1.0. It's really best not to rely on the implicit truncation of an input sequence performed by functions like string(); it's the cause of many application bugs.
==LATER==
We tried to remove 1.0 compatibility mode in Saxon-HE 9.8, thinking that after 10 years few people would still be relying on it. Unfortunately those few made a fuss, and we decided to backtrack. But I've just seen that in HE 9.8, the setBackwardsCompatible() method will throw an error saying it's not supported. Try instead:
XPathCompiler.getUnderlyingStaticContext().setBackwardsCompatibilityMode(true);

Java parallel execution on multicore

i would like to know whether exist a way to parallelize queries in Java (or there is framework or a library) like in C# and Linq:
var query = from item in source.AsParallel().WithDegreeOfParallelism(2)
where Compute(item) > 42
select item;
and if i can't parallelize queries if i can do something like this in c# (a for each parallelized) :
Parallel.ForEach(files, currentFile =>
{
// The more computational work you do here, the greater
// the speedup compared to a sequential foreach loop.
string filename = System.IO.Path.GetFileName(currentFile);
System.Drawing.Bitmap bitmap = new System.Drawing.Bitmap(currentFile);
bitmap.RotateFlip(System.Drawing.RotateFlipType.Rotate180FlipNone);
bitmap.Save(System.IO.Path.Combine(newDir, filename));
// Peek behind the scenes to see how work is parallelized.
// But be aware: Thread contention for the Console slows down parallel loops!!!
Console.WriteLine("Processing {0} on thread {1}", filename,
Thread.CurrentThread.ManagedThreadId);
}
please if you post any framework or library, can you tell me the experience that you had with it ?
thx for your time.
about c# and linq you can find here the documentation : http://msdn.microsoft.com/en-us/library/dd997425.aspx

There isn't a direct translation. Firstly Java doesn't have LINQ nor does it have any standard parallel collection classes.
GPars is perhaps closest fit.
Note that while it is targeted at groovy it's API is perfectly usable from java

Maybe Fork/Join Framework can help you ? Here is java tutorial

Simple java recursive descent parsing library with placeholders

For an application I want to parse a String with arithmetic expressions and variables. Just imagine this string:
((A + B) * C) / (D - (E * F))
So I have placeholders here and no actual integer/double values. I am searching for a library which allows me to get the first placeholder, put (via a database query for example) a value into the placeholder and proceed with the next placeholder.
So what I essentially want to do is to allow users to write a string in their domain language without knowing the actual values of the variables. So the application would provide numeric values depending on some "contextual logic" and would output the result of the calculation.
I googled and did not find any suitable library. I found ANTLR, but I think it would be very "heavyweight" for my usecase. Any suggestions?

You are right that ANTLR is a bit of an overkill. However parsing arithmetic expressions in infix notation isn't that hard, see:
Operator-precedence parser
Shunting-yard algorithm
Algorithms for Parsing Arithmetic Expressions
Also you should consider using some scripting languages like Groovy or JRuby. Also JDK 6 onwards provides built-in JavaScript support. See my answer here: Creating meta language with Java.

If all you want to do is simple expressions, and you know the grammar for those expressions in advance, you don't even need a library; you can code this trivially in pure Java.
See this answer for a detailed version of how:
Is there an alternative for flex/bison that is usable on 8-bit embedded systems?
If the users are defining thier own expression language, if it is always in the form of a few monadic or binary operators, and they can specify the precedence, you can bend the above answer by parameterizing the parser with a list of operators at several levels of precedence.
If the language can be more sophisticated, you might want to investigate metacompilers.

Is there a Java equivalent of Python's printf hash replacement?

Specifically I am converting a python script into a java helper method. Here is a snippet (slightly modified for simplicity).
# hash of values
vals = {}
vals['a'] = 'a'
vals['b'] = 'b'
vals['1'] = 1
output = sys.stdout
file = open(filename).read()
print >>output, file % vals,
So in the file there are %(a), %(b), %(1) etc that I want substituted with the hash keys. I perused the API but couldn't find anything. Did I miss it or does something like this not exist in the Java API?

You can't do this directly without some additional templating library. I recommend StringTemplate. Very lightweight, easy to use, and very optimized and robust.

I doubt you'll find a pure Java solution that'll do exactly what you want out of the box.
With this in mind, the best answer depends on the complexity and variety of Python formatting strings that appear in your file:
If they're simple and not varied, the easiest way might be to code something up yourself.
If the opposite is true, one way to get the result you want with little work is by embedding Jython into your Java program. This will enable you to use Python's string formatting operator (%) directly. What's more, you'll be able to give it a Java Map as if it were a Python dictionary (vals in your code).

What is the simplest and minimalistic java xml api?

There are many pretty good json libs lika GSon. But for XML I know only Xerces/JDOM and both have tedious API.
I don't like to use unnecessary objects like DocumentFactory, XpathExpressionFactory, NodeList and so on.
So in the light of native xml support in languages such as groovy/scala I have a question.
Is there are minimalistic java XML IO framework?
PS XStream/JAxB good for serialization/deserialization, but in this case I'm looking for streaming some data in XML with XPath for example.

The W3C DOM model is unpleasant and cumbersome, I agree. JDOM is already pretty simple. The only other DOM API that I'm aware of that is simpler is XOM.

What about StAX? With Java 6 you don't even need additional libs.

Dom4J rocks. It's very easy and understandable
Sample Code:
public static void main(String[] args) throws Exception {
final String xml = "<root><foo><bar><baz name=\"phleem\" />"
+ "<baz name=\"gumbo\" /></bar></foo></root>";
Document document = DocumentHelper.parseText(xml);
// simple collection views
for (Element element : (List<Element>) document
.getRootElement()
.element("foo")
.element("bar")
.elements("baz")) {
System.out.println(element.attributeValue("name"));
}
// and easy xpath support
List<Element> elements2 = (List<Element>)
document.createXPath("//baz").evaluate(document);
for (final Element element : elements2) {
System.out.println(element.attributeValue("name"));
}
}
Output:
phleem
gumbo
phleem
gumbo

try VTD-XML. Its almost 3 to 4 times faster than DOM parsers with outstanding memory footprint.

Deppends on how complex your java objects are: are they self-containing etc (like graph nodes). If your objects are simple, you can use Google gson - it is the simpliest API(IMO).
In Xstream things start get messy when you need to debug.Also you need to be carefull when you choose an aprpriate Driver for XStream.

JDOM and XOM are probably the simplest. DOM4J is more powerful but more complex. DOM is just horrible. Processing XML in Java will always be more complex than processing JSON, because JSON was designed for structured data while XML was designed for documents, and documents are more complex than structured data. Why not use a language that was designed for XML instead, specifically XSLT or XQuery?

NanoXML is very small, below 50kb. I've found this today and I'm really impressed.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Streaming XPath evaluation - java

Are there any production-ready libraries for streaming XPath expressions evaluation against provided xml-document? My investigations show that most of existing solutions load entire DOM-tree into memory before evaluating xpath expression.

XSLT 3.0 provides streaming mode of processing and this will become a standard with the XSLT 3.0 W3C specification becoming a W3C Recommendation. At the time of writing this answer (May, 2011) Saxon provides some support for XSLT 3.0 streaming .

Try Joost.

Though I have no practical experience with it, I thought it is worth mentioning QuiXProc ( http://code.google.com/p/quixproc/ ). It is a streaming approach to XProc, and uses libraries that provide streaming support for XPath amongst others..

FWIW, I've used Nux streaming filter xpath queries against very large (>3GB) files, and it's both worked flawlessly and used very little memory. My use case is been slightly different (not validation centric), but I'd highly encourage you to give it a shot with Nux.

Related

How do I get the string value of a node?

Java parallel execution on multicore

Simple java recursive descent parsing library with placeholders

Is there a Java equivalent of Python's printf hash replacement?

What is the simplest and minimalistic java xml api?

Categories

Resources