Saxon XQuery Memory Management

Saxon XQuery Memory Management - java

So I have been working with Saxon quite a bit recently and am having some concerns about its memory management ability.
From what I understand Saxon does not take data as streams which means that if I need to make comparisons on 1000 tuples all 1000 tuples are allocated memory. This seems like a flawed system to me. Is there a reason behind this other than limitations in java?
I feel like this really makes XQuery a less viable alternative to SQL and JDBC which supports streaming.

In general, XPath allows navigation anywhere in the source document, for example you can write things like //x[#y = //z/#y] - such queries are clearly not streamable.
Saxon-EE does support streaming for a restricted subset of queries. The streaming capability is currently much more advanced in XSLT than in XQuery, simply because the XSL working group has been working on this area extensively over the last few years. Saxon-EE 9.6 supports pretty well all the streaming capability of the draft XSLT 3.0 specification.
Details are here:
http://www.saxonica.com/documentation/#!sourcedocs/streaming
this also includes information about Saxon's capability for streaming XQuery

Related

MarkLogic to Java & Back Solution

I need to query XML out of a MarkLogic server and marshal it into Java objects. What is a good way to go about this? Specifically:
Does using MarkLogic have any impact on the XML technology stack? (i.e. is there something about MarkLogic that leads to a different approach to searching for, reading and writing XML snippets?)
Should I process the XML myself using one of the XML APIs or is there a simpler way?
Is it worth using JAXB for this?
Someone asked a good question of why I am using Java. I am using Java/Java EE because I am strongest in that language. This is a one man project and I don't want to get stuck anywhere. The project is to develop web service APIs and data processing and transformation (CSV to XML) functionality. Java/Java EE can do this well and do it elegantly.

Note: I'm the EclipseLink JAXB (MOXy) lead, and a member of the JAXB 2 (JSR-222) expert group.
Does using MarkLogic have any impact on the XML technology stack?
(i.e. is there something about MarkLogic that leads to a different
approach to searching for, reading and writing XML snippets?)
Potentially. Some object-to-XML libraries support a larger variety of documents than other ones do. MOXy leverages XPath based mappings that allows it to handle a wider variety of documents. Below are some examples:
http://blog.bdoughan.com/2010/09/xpath-based-mapping-geocode-example.html
http://blog.bdoughan.com/2011/03/map-to-element-based-on-attribute-value.html
Should I process the XML myself using one of the XML APIs or is there
a simpler way?
Using a framework is generally easier. Java SE offers may standard libraries for processing XML: JAXB (javax.xml.bind), XPath (javax.xml.xpath), DOM, SAX, StAX. Since these standards there are also other implementations (i.e. MOXy and Apache JaxMe implement JAXB).
http://blog.bdoughan.com/2011/05/specifying-eclipselink-moxy-as-your.html
Is it worth using JAXB for this?
Yes.

There are a number of XML-> Java object marshall-ing libraries. I think you might want to look for an answer to this question by searching for generic Java XML marshalling/unmarshalling questions like this one:
Java Binding Vs Manually Defining Classes
Your use case is still not perfectly clear although the title edit helps - If you're looking for Java connectivity, you might also want to look at http://developer.marklogic.com/code/mljam which allows you to execute Java code from within MarkLogic XQuery.

XQSync uses XStream for this. As I understand it, JAXB is more powerful - but also more complex.

Having used JAXB to unmarshal xml served from XQuery for 5 years now, I have to say that I have found it to be exceptionally useful and time-saving. As for complexity, it is easy to learn and use for probably 90% of what you would be using it for. I've used it for both simple and complex schemas and found it to be very performant and time-saving. Executing Java code from within MarkLogic is usually a non-starter, because it runs in a separate VM on the Marklogic server, so it really can't leverage any session state or libraries from, say, a Java EE web application. With JAXB, it is very easy to take a result stream and convert it to Java objects. I really can't say enough good things about it. It has made my development efforts infinitely easier and allows you to leverage Java for those things that it does best (rich integration across various technologies and platforms, advanced business logic, fast memory management for heavy processing jobs, etc.) while still using XQuery for what it does best (i.e. searching and transforming content).

Does using MarkLogic have any impact on the XML technology stack?
No. By the time it comes out of MarkLogic, it's just XML that could have come from anywhere.
I need to query XML and marshal it into Java objects.
Why?
If you have a good reason for using Java, then we need to know what that reason is before we can tell you which Java technology is appropriate.
If you don't have a good reason for using Java, then you are better off using a high-level XML processing language such as XSLT or XQuery.
As for JAXB, it is appropriate when your schema is reasonably simple and stable. If the schema is complex (e.g. the schema for articles in an academic journal), then JAXB can be hopelessly unwieldy because of the number of classes that are generated. One problem with using it to process XQuery output is that it's very likely the XQuery output will not conform to any known schema, and the structure of the XQuery results will be different for each query that gets written.

XSLT Performance Considerations

I am working on a project which uses following technologies.
Java, XML, XSLs
There's heavy use of XMLs. Quite often I need to
- convert one XML document into another
- convert one XML document into another after applying some business logic.
Everything will be built into a EAR and deployed on an application server. As the number of user is huge, I need to take performance into consideration before defining coding standards.
I am not a very big fan of XSLs but I am trying to understand if using XSLs a better option in this scenario or should I stick of Java only. Note that I have requirements to convert XML into XML format only. I don't have requirements to convert XML into some other format like HTML etc.
From performance and manitainability point of view - isnt JAVA a better option than using XLST for XML to XML transformations?

From my previous experience of this kind of application, if you have a performance bottleneck, then it won't be the XSLT processing. (The only exception might be if the processing is very complex and the programmer very inexperienced in XSLT.) There may be performance bottlenecks in XML parsing or serialisation if you are dealing with large documents, but these will apply whatever technology you use for the transformation.
Simple transformations are much simpler to code in XSLT than in Java. Complex transformations are also usually simpler to code in XSLT, unless they make heavy use of functionality available for free in the Java class library (an example might be date parsing). Of course, that's only true for people equally comfortable with coding in both languages.
Of course, it's impossible to give any more than arm-waving advice about performance until you start talking concrete numbers.

I agree with above responses. XSLT is faster and more concise to develop than performing transformations in Java. You can change XSLT without having to recompile the entire application (just re-create EAR and redeploy). Manual transformations should we always faster but the code might be much larger than XSLT due to XPATH and other technologies allowing very condensed and powerful expressions. Try several XSLT engines (java provided, saxon, xalan...) and try to debug and profile the XSLT, using tools like standalone IDE Altova XMLSpy to detect bottleneck. Try to load the XSLT transformation and reuse it when processing several XMLs that require the same transformation. Another option is to compile the XSLT to Java classes, allowing faster parsing (saxon seems to allow it), but changes are not as easy as you need to re-compile XSLT and classes generated.
We use XSLT and XSL-FO to generate invoices for a billing software. We extract the data from database and create an XML file, transform it with XSLT using XSL-FO and process the result XML (FO instructions) to generate a PDF using Apache FOP. When generating invoices of several pages, job is done in less than a second in a multi-user environment and on a user-request basis (online processing). We do also batch processing (billing cycles) and the job is done faster as reusing the XSLT transformation. Only for very-large PDF documents (>100 pages) we have some troubles (minutes) but the most expensive task is always processing XML with FO to PDF, not XML to XML with XSLT.
As always said, if you need more processing power, you can just "add" more processors and do the jobs in parallel easily. I think time saved using XSLT if you have some experience using it can be used to buy more hardware. It's the dichotomy of using powerful development tools to save development time and buy more hardware or do things "manually" in order to get maximum performance.
Integration tools like ESB are heavily based on XSLT transformations to adapt XML data from one system (sender) to another system (receiver) and usually can perform hundreds of "transactions" (data processing and integration) in a second.

If you use a modern XSLT processor, such as Saxon (available in a free version), you will find the performance to be quite good. Also, in the long term XSL transforms will be much more maintainable than hardcoded Java classes.
(I have no connection with the authors of Saxon)

Here is my observation based on empirical data. I use xslt extensively , and in many cases as an alternative for data processors implemented in java. Some of the data processors we compiled are a bit more involved. We primarily use SAXON EE, through the oxygenxml editor. Here is what we have noticed in terms of the performance of the transformation.
For less complex xsl stylesheets, the performance is quite good ( 2s to read a 30MB xml file and generate
over 20 html content pages, with a lot of div structures) . and the variance in performance seems about linear or less with respect to change in the size of the file.
However, when the complexity of the xsl stylesheet changes, the performance change can be exponential.( same file , with a function call introduced in template called often,with the function implementing a simple xpath resolution, can change the processing time , for the same file , from 2s to 24s) And it seems introduction of functions and function calls seem to be a major culprit.
That said, we have not done a detailed performance review and code optimization. ( still in alpha mode, and the performance is still within our limits - ie batch job ). I must admit that we may have "abused" xsl function, as in a lot of places we used th idea of code abstraction into functions ( in addition to using templates ) . My suspicion is that, due t the nature in which xslt templates are called, there might be a lot of eventual recursion in the implementation procedures ( for the xslt processor), and function calls can become expensive if they are not optimized . We think a change in "strategy" in way we write our xsl scripts, (to be more XSLT/XPATH centric) may help performance of the xlst processor. For instance, use of xsl keys. so yes, we maybe just as guilty as the processor charged :)
One other performance issue, is memory utilization. While RAM is not technically a problem , but a simple processor ramping from 1GB ( !!! ) to 6GB for a single invocation/transformation is not exactly kosher. There maybe scalability and capacity concerns ( depending on application and usage). This may be something less to do with the underlying xlst processor, and more to do with the editor tool.This seems to have a huge impact on debugging the style sheets in real time ( ie stepping through the xslt ) .
Few observations:
- commandline or "production" invocation of the processor has better performance
- for consecutive runs ( invoking the xslt processor), the first run takes the longest ( say 10s) and consecutive runs take a lot less ( say 4s ) .Again, maybe something to do with the editor environment.
That said, while performance of the processors may be a pain at times , and depending on the application requirements, it is my opinion that if you consider other factors already mentioned here, such as code maintenance, ease of implementation, rapid changes, size of code base, the performance issues may be mitigated, or can be "accepted" ( if the end application can still live with the perfomance numbers ) when comparing implementation using XSLT vs Java ( or other )
...adieu!

implementing simple Document management

My qustion is: How would you go on implementing simple DMS(document management) based on following requirements?
DMS shouls be distributed web application.
Support for document versioning.
Support for document locking.
Document search.
Im already clear on what technologies I want to use. I will use Sring MVC, Hibernate and relational (most likely MYSQL) database.
One thing Im not very clear on is if I need to use webdav, since I could just upload or download documets. I thing I have to because I need to acomplish point 2. and especially point 3. somehow. Is this the right way to go?
Any examples or experience with this would come very handy :). May be Milton is not the best library to pick for webdav?

#Eduard, regarding dependencies on 3rd parties - are you doing this as a college/university exercise or something that will affect real users in a production environment?
At the risk of sounding very pretentious; don't reimplement the wheel! I'd definitely 2nd the call to use JCR, this way you are depending a standard and not a 3rd party implementation.
JCR is a well defined standard (that means a lot of people invested commercial effort (i.e. cash and expertise in huge amounts) into this). I would seriously reconsider looking into JCR - think of it as an API where 3rd parties provide the implementation (no vendor lockin).
Have a look at the features you'll get out-of-the-box, I believe 99 - 110% of the functionality you require is available through a JCR implementation. Plus you'll benefit from the fact the code you'll be using has been tested by hundreds of people in real world situations.
Where I'd differ from bmscomp is in suggesting JackRabbit http://jackrabbit.apache.org/

Option 1:
I am not sure about webdav, no real experience on it. But I would highly recommend you using a Document database like MongoDB.
With mongodb, you can:
1. Handle document versions
2. MongoDB has atomic operations, you can add your logic of document locking.
This will give you some awesome added benefits of search your documents store.
Option 2:
Apache Jackrabbit: A Content repository
A content repository is a hierarchical
content store with support for
structured and unstructured content,
full text search, versioning,
transactions, observation, and more.

Think about using JCR Java content Repository
http://en.wikipedia.org/wiki/Content_repository_API_for_Java or you can have a look at the job done on Alfresco or and Exo framework they did a good job

You can use these open source projects to meet your requirements:
http://sourceforge.net/projects/logicaldoc/ -
LogicalDOC is a modern document management system with a nice interface, easy to use and very fast. It uses open source Java technologies such as GWT, Spring, Lucene in order to provide a flexible and scalable DMS platform. http://www.logicaldoc.com
http://sourceforge.net/projects/openkm/ -
OpenKM Document Management - DMS Updated 2011-05-25
OpenKM is powerful scalable Document Management System (DMS). OpenKM uses Jboss + J2EE + Ajax web (GWT) + Jackrabbit (lucene) Open Source technologies. http://www.openkm.com/

Spring MVC is a good choice. If you want to use a relational database then can also check out Datanucleus. At least the JDO layer (plus maybe the JPA layer) provides versioning support. For search I recommend apache solr, based on lucene, wich has excellent and powerful fulltext search capabilites.
Although webdav seems like the natural choice as a simple and cross plattform file transfer protocol I never had good experiences. Either the Client or the Server didn't work well (konqueror, internet explorer, zope 2, ...). So abstract from the protocol and provide multiple ways to access the file.

Alternatives to rules engine for centralizing and maintaining rules

I'm trying to find an appropriate solution/framework to centralize and maintain rules. The number of rules is huge and they change frequently. I've gone through rules engines like Drools but find them unsuitable for reasons like the complexity of rules execution which affects maintainability and rules centralization overheads (rules engines often require another repository system to hold the rules).
The solution/framework I'm looking for should ideally allow me to write rules in standard programming languages such as Java with little overheads to centralizing and maintaining rules.
Big thanks in advance.

Drools 5.2.0 will have the new parser API, which - in theory - allows you to avoid DRL and write a rule engine's Left Hand Side (LHS) in Java, much like you'd write a JPA query with the JPA 2.0 criteria API.

Have you tried Spring support for dynamic languages? You can invoke beans written in languages like Groovy or JRuby (I wrote JavaScript support some time ago if you care). Source code of these dynamic beans can be extracted into separate files which are scanned periodically to discover changes at runtime.
Much simpler, yet still powerful.

What XSLT processor should I use for Java transformation?

What XSLT processor should I use for Java transformation? There are SAXON, Xalan and TrAX. What is the criteria for my choice? I need simple transformation without complicated logical transformations. Need fast and easy to implement solution. APP runs under Tomcat, Java 1.5.
I have some jaxp libraries and could not understand what is the version of jaxp used.
Thanks in advance.
Best regards.

The JDK comes bundles with an internal version of Xalan, and you get an instance on it by using the standard API (e.g. TransformerFactory.newInstance()).
Unless Xalan doesn't work for you (which is highly unlikely), there's no need to look elsewhere.
By the way, TrAX is the old name for the javax.xml.transform API, from the days when it was an optional extension to the JDK.

In general, it's a hard question to answer because it heavily depends what you mean by "simple transformation" and "fast", as well as the amount of XML you want to process. There are probably other considerations as well. To illustrate, "fast" could mean "fast to write" or "fast to execute", if you process files the size of available memory you might make a different choice (maybe STX, as described in another SO question) than if you parse small files etc.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.