XSLT Performance Considerations - java

I am working on a project which uses following technologies.
Java, XML, XSLs
There's heavy use of XMLs. Quite often I need to
- convert one XML document into another
- convert one XML document into another after applying some business logic.
Everything will be built into a EAR and deployed on an application server. As the number of user is huge, I need to take performance into consideration before defining coding standards.
I am not a very big fan of XSLs but I am trying to understand if using XSLs a better option in this scenario or should I stick of Java only. Note that I have requirements to convert XML into XML format only. I don't have requirements to convert XML into some other format like HTML etc.
From performance and manitainability point of view - isnt JAVA a better option than using XLST for XML to XML transformations?

From my previous experience of this kind of application, if you have a performance bottleneck, then it won't be the XSLT processing. (The only exception might be if the processing is very complex and the programmer very inexperienced in XSLT.) There may be performance bottlenecks in XML parsing or serialisation if you are dealing with large documents, but these will apply whatever technology you use for the transformation.
Simple transformations are much simpler to code in XSLT than in Java. Complex transformations are also usually simpler to code in XSLT, unless they make heavy use of functionality available for free in the Java class library (an example might be date parsing). Of course, that's only true for people equally comfortable with coding in both languages.
Of course, it's impossible to give any more than arm-waving advice about performance until you start talking concrete numbers.

I agree with above responses. XSLT is faster and more concise to develop than performing transformations in Java. You can change XSLT without having to recompile the entire application (just re-create EAR and redeploy). Manual transformations should we always faster but the code might be much larger than XSLT due to XPATH and other technologies allowing very condensed and powerful expressions. Try several XSLT engines (java provided, saxon, xalan...) and try to debug and profile the XSLT, using tools like standalone IDE Altova XMLSpy to detect bottleneck. Try to load the XSLT transformation and reuse it when processing several XMLs that require the same transformation. Another option is to compile the XSLT to Java classes, allowing faster parsing (saxon seems to allow it), but changes are not as easy as you need to re-compile XSLT and classes generated.
We use XSLT and XSL-FO to generate invoices for a billing software. We extract the data from database and create an XML file, transform it with XSLT using XSL-FO and process the result XML (FO instructions) to generate a PDF using Apache FOP. When generating invoices of several pages, job is done in less than a second in a multi-user environment and on a user-request basis (online processing). We do also batch processing (billing cycles) and the job is done faster as reusing the XSLT transformation. Only for very-large PDF documents (>100 pages) we have some troubles (minutes) but the most expensive task is always processing XML with FO to PDF, not XML to XML with XSLT.
As always said, if you need more processing power, you can just "add" more processors and do the jobs in parallel easily. I think time saved using XSLT if you have some experience using it can be used to buy more hardware. It's the dichotomy of using powerful development tools to save development time and buy more hardware or do things "manually" in order to get maximum performance.
Integration tools like ESB are heavily based on XSLT transformations to adapt XML data from one system (sender) to another system (receiver) and usually can perform hundreds of "transactions" (data processing and integration) in a second.

If you use a modern XSLT processor, such as Saxon (available in a free version), you will find the performance to be quite good. Also, in the long term XSL transforms will be much more maintainable than hardcoded Java classes.
(I have no connection with the authors of Saxon)

Here is my observation based on empirical data. I use xslt extensively , and in many cases as an alternative for data processors implemented in java. Some of the data processors we compiled are a bit more involved. We primarily use SAXON EE, through the oxygenxml editor. Here is what we have noticed in terms of the performance of the transformation.
For less complex xsl stylesheets, the performance is quite good ( 2s to read a 30MB xml file and generate
over 20 html content pages, with a lot of div structures) . and the variance in performance seems about linear or less with respect to change in the size of the file.
However, when the complexity of the xsl stylesheet changes, the performance change can be exponential.( same file , with a function call introduced in template called often,with the function implementing a simple xpath resolution, can change the processing time , for the same file , from 2s to 24s) And it seems introduction of functions and function calls seem to be a major culprit.
That said, we have not done a detailed performance review and code optimization. ( still in alpha mode, and the performance is still within our limits - ie batch job ). I must admit that we may have "abused" xsl function, as in a lot of places we used th idea of code abstraction into functions ( in addition to using templates ) . My suspicion is that, due t the nature in which xslt templates are called, there might be a lot of eventual recursion in the implementation procedures ( for the xslt processor), and function calls can become expensive if they are not optimized . We think a change in "strategy" in way we write our xsl scripts, (to be more XSLT/XPATH centric) may help performance of the xlst processor. For instance, use of xsl keys. so yes, we maybe just as guilty as the processor charged :)
One other performance issue, is memory utilization. While RAM is not technically a problem , but a simple processor ramping from 1GB ( !!! ) to 6GB for a single invocation/transformation is not exactly kosher. There maybe scalability and capacity concerns ( depending on application and usage). This may be something less to do with the underlying xlst processor, and more to do with the editor tool.This seems to have a huge impact on debugging the style sheets in real time ( ie stepping through the xslt ) .
Few observations:
- commandline or "production" invocation of the processor has better performance
- for consecutive runs ( invoking the xslt processor), the first run takes the longest ( say 10s) and consecutive runs take a lot less ( say 4s ) .Again, maybe something to do with the editor environment.
That said, while performance of the processors may be a pain at times , and depending on the application requirements, it is my opinion that if you consider other factors already mentioned here, such as code maintenance, ease of implementation, rapid changes, size of code base, the performance issues may be mitigated, or can be "accepted" ( if the end application can still live with the perfomance numbers ) when comparing implementation using XSLT vs Java ( or other )
...adieu!

Related

Saxon XQuery Memory Management

So I have been working with Saxon quite a bit recently and am having some concerns about its memory management ability.
From what I understand Saxon does not take data as streams which means that if I need to make comparisons on 1000 tuples all 1000 tuples are allocated memory. This seems like a flawed system to me. Is there a reason behind this other than limitations in java?
I feel like this really makes XQuery a less viable alternative to SQL and JDBC which supports streaming.
In general, XPath allows navigation anywhere in the source document, for example you can write things like //x[#y = //z/#y] - such queries are clearly not streamable.
Saxon-EE does support streaming for a restricted subset of queries. The streaming capability is currently much more advanced in XSLT than in XQuery, simply because the XSL working group has been working on this area extensively over the last few years. Saxon-EE 9.6 supports pretty well all the streaming capability of the draft XSLT 3.0 specification.
Details are here:
http://www.saxonica.com/documentation/#!sourcedocs/streaming
this also includes information about Saxon's capability for streaming XQuery

What other alternatives exist for XML-to-XML transformation other than XSLT

I've huge XML files (3000+ unique nodes) that need to be translated from 1 format to another format. My main concern is about the speed and memory usage. Are there any alternatives to XSLT for this other than programatically parsing the input XML using StAX and creating the target XML using StAX?
I know there is a STX project but I doesn't think it is being maintained.
If you are so concerned about speed and memory usage you might want to write your own SAX transformer. Whether that's easy enough depends on the complexity of the transformation.
That said - 3000 nodes is not much and I've used Apache Cocoon to transform much bigger documents. And STX worked well, too. Not maintained does not necessarily mean it's not working.
Better try the existing solutions and then improve as needed.
Smooks can help you. Handy and fast. http://www.smooks.org/
I've found JDom helpful for simple programmatic manipulation of XML structures in Java.

XML Parsing : JDOM or RegEx ? Which is faster?

A colleague of mine needs to develop an Eclipse plugin that has to parse multiple XML files to check for programming rules imposed by a client (for example, no xsl:for-each, or no namespaces declared but not used). There are about a 1000 files to be parsed regularly, each file containing about 300-400 lines.
We were wondering which solution was faster to do it. I'm thinking JDOM, and he's thinking RegEx.
Anyone can help us decide which is best ?
Thanks
DOM, hands down. RegEx would be madness. Use the tool that was intended for the job.
You can't parse recursive structures with RegEx. So unless you have really simple XML files, XML parsing will be much faster and the code will be somewhat sane (so you won't spend endless hours to locate bugs).
Since the files are pretty small, JDom will make your job much easier. For larger files, you will have to use a SAX or similar parser (so you don't have to keep the whole file in RAM).
I you try to parse XML using regular expressions, you are entering a world of pain. If speed is important, using a event-based API might be a tad faster than DOM/JDOM.
If all checks are simple "no " or no namespace, a StAX parser would be best, as you are just streaming the documents through it, get all the start elements 'events' and then do your checking. For this, the parser needs relatively little memory.
If you need to referential checking, DOM may be better, as you can easily walk the tree (perhaps via xpath).

What can we do to make XML processing faster?

We work on an internal corporate system that has a web front-end as one of its interfaces.
The front-end (Java + Tomcat + Apache) communicates to the back-end (proprietary system written in a COBOL-like language) through SOAP web services.
As a result, we pass large XML files back and forth.
We believe that this architecture has a significant impact on performance due to the large overhead of XML transportation and parsing. Unfortunately, we are stuck with this architecture.
How can we make this XML set-up more efficient?
Any tips or techniques are greatly appreciated.
Profiling!
Do some proper profiling of your system under load - there isn't really enough information to go on here.
You need to work out where the time is going and what the bottleknecks are (network bandwidth, cpu, memory etc...). Only then will you know what to do about it - many optimisations are really just trade-offs (for example caching is sacrificing memory to improve performance elsewhere)
The only thing that I can think of off-hand is making sure that you are using HTTP compression with web services - XML can usually be compacted down to a fraction of its normal size, but again this will only help if you have CPU cycles to spare.
You can compress the transfer if both ends can support that, and you can try different parsers, but since you say SOAP there aren't many choices. SOAP is bloated anyway.
I'm going to go out on a limb here and suggest GZIP Compression if you think it is due to bandwidth issues. (you mentioned XML Transportation) Yes, this would increase your CPU time, but it might speed things up in the transport.
Here's the first Google hit on GZIP Compression as a starting point. It describes how it works on Apache.
First make sure that your parsing methods are efficient for large documents. StAX is a good one for parsing large documents.
Additionally, you can take a look at binary XML approaches. These provide more efficient transport but also attempt to aid in parsing.
Try StAX. It performs well and has a nice, concise syntax.
Check if your application reads in the whole XML documents as a DOM tree. Those may get VERY big, and frequently you can do with a simple SAX event inspection or a SAX-based XSLT program (which can be compiled for fast processing).
This is very visible in a profiler like visualvm in the Sun Java 6 JDK

What XSLT processor should I use for Java transformation?

What XSLT processor should I use for Java transformation? There are SAXON, Xalan and TrAX. What is the criteria for my choice? I need simple transformation without complicated logical transformations. Need fast and easy to implement solution. APP runs under Tomcat, Java 1.5.
I have some jaxp libraries and could not understand what is the version of jaxp used.
Thanks in advance.
Best regards.
The JDK comes bundles with an internal version of Xalan, and you get an instance on it by using the standard API (e.g. TransformerFactory.newInstance()).
Unless Xalan doesn't work for you (which is highly unlikely), there's no need to look elsewhere.
By the way, TrAX is the old name for the javax.xml.transform API, from the days when it was an optional extension to the JDK.
In general, it's a hard question to answer because it heavily depends what you mean by "simple transformation" and "fast", as well as the amount of XML you want to process. There are probably other considerations as well. To illustrate, "fast" could mean "fast to write" or "fast to execute", if you process files the size of available memory you might make a different choice (maybe STX, as described in another SO question) than if you parse small files etc.

Categories

Resources