What can we do to make XML processing faster?

What can we do to make XML processing faster? - java

We work on an internal corporate system that has a web front-end as one of its interfaces.
The front-end (Java + Tomcat + Apache) communicates to the back-end (proprietary system written in a COBOL-like language) through SOAP web services.
As a result, we pass large XML files back and forth.
We believe that this architecture has a significant impact on performance due to the large overhead of XML transportation and parsing. Unfortunately, we are stuck with this architecture.
How can we make this XML set-up more efficient?
Any tips or techniques are greatly appreciated.

Profiling!
Do some proper profiling of your system under load - there isn't really enough information to go on here.
You need to work out where the time is going and what the bottleknecks are (network bandwidth, cpu, memory etc...). Only then will you know what to do about it - many optimisations are really just trade-offs (for example caching is sacrificing memory to improve performance elsewhere)
The only thing that I can think of off-hand is making sure that you are using HTTP compression with web services - XML can usually be compacted down to a fraction of its normal size, but again this will only help if you have CPU cycles to spare.

You can compress the transfer if both ends can support that, and you can try different parsers, but since you say SOAP there aren't many choices. SOAP is bloated anyway.

I'm going to go out on a limb here and suggest GZIP Compression if you think it is due to bandwidth issues. (you mentioned XML Transportation) Yes, this would increase your CPU time, but it might speed things up in the transport.
Here's the first Google hit on GZIP Compression as a starting point. It describes how it works on Apache.

First make sure that your parsing methods are efficient for large documents. StAX is a good one for parsing large documents.
Additionally, you can take a look at binary XML approaches. These provide more efficient transport but also attempt to aid in parsing.

Try StAX. It performs well and has a nice, concise syntax.

Check if your application reads in the whole XML documents as a DOM tree. Those may get VERY big, and frequently you can do with a simple SAX event inspection or a SAX-based XSLT program (which can be compiled for fast processing).
This is very visible in a profiler like visualvm in the Sun Java 6 JDK

Related

XSLT Performance Considerations

I am working on a project which uses following technologies.
Java, XML, XSLs
There's heavy use of XMLs. Quite often I need to
- convert one XML document into another
- convert one XML document into another after applying some business logic.
Everything will be built into a EAR and deployed on an application server. As the number of user is huge, I need to take performance into consideration before defining coding standards.
I am not a very big fan of XSLs but I am trying to understand if using XSLs a better option in this scenario or should I stick of Java only. Note that I have requirements to convert XML into XML format only. I don't have requirements to convert XML into some other format like HTML etc.
From performance and manitainability point of view - isnt JAVA a better option than using XLST for XML to XML transformations?

From my previous experience of this kind of application, if you have a performance bottleneck, then it won't be the XSLT processing. (The only exception might be if the processing is very complex and the programmer very inexperienced in XSLT.) There may be performance bottlenecks in XML parsing or serialisation if you are dealing with large documents, but these will apply whatever technology you use for the transformation.
Simple transformations are much simpler to code in XSLT than in Java. Complex transformations are also usually simpler to code in XSLT, unless they make heavy use of functionality available for free in the Java class library (an example might be date parsing). Of course, that's only true for people equally comfortable with coding in both languages.
Of course, it's impossible to give any more than arm-waving advice about performance until you start talking concrete numbers.

I agree with above responses. XSLT is faster and more concise to develop than performing transformations in Java. You can change XSLT without having to recompile the entire application (just re-create EAR and redeploy). Manual transformations should we always faster but the code might be much larger than XSLT due to XPATH and other technologies allowing very condensed and powerful expressions. Try several XSLT engines (java provided, saxon, xalan...) and try to debug and profile the XSLT, using tools like standalone IDE Altova XMLSpy to detect bottleneck. Try to load the XSLT transformation and reuse it when processing several XMLs that require the same transformation. Another option is to compile the XSLT to Java classes, allowing faster parsing (saxon seems to allow it), but changes are not as easy as you need to re-compile XSLT and classes generated.
We use XSLT and XSL-FO to generate invoices for a billing software. We extract the data from database and create an XML file, transform it with XSLT using XSL-FO and process the result XML (FO instructions) to generate a PDF using Apache FOP. When generating invoices of several pages, job is done in less than a second in a multi-user environment and on a user-request basis (online processing). We do also batch processing (billing cycles) and the job is done faster as reusing the XSLT transformation. Only for very-large PDF documents (>100 pages) we have some troubles (minutes) but the most expensive task is always processing XML with FO to PDF, not XML to XML with XSLT.
As always said, if you need more processing power, you can just "add" more processors and do the jobs in parallel easily. I think time saved using XSLT if you have some experience using it can be used to buy more hardware. It's the dichotomy of using powerful development tools to save development time and buy more hardware or do things "manually" in order to get maximum performance.
Integration tools like ESB are heavily based on XSLT transformations to adapt XML data from one system (sender) to another system (receiver) and usually can perform hundreds of "transactions" (data processing and integration) in a second.

If you use a modern XSLT processor, such as Saxon (available in a free version), you will find the performance to be quite good. Also, in the long term XSL transforms will be much more maintainable than hardcoded Java classes.
(I have no connection with the authors of Saxon)

Here is my observation based on empirical data. I use xslt extensively , and in many cases as an alternative for data processors implemented in java. Some of the data processors we compiled are a bit more involved. We primarily use SAXON EE, through the oxygenxml editor. Here is what we have noticed in terms of the performance of the transformation.
For less complex xsl stylesheets, the performance is quite good ( 2s to read a 30MB xml file and generate
over 20 html content pages, with a lot of div structures) . and the variance in performance seems about linear or less with respect to change in the size of the file.
However, when the complexity of the xsl stylesheet changes, the performance change can be exponential.( same file , with a function call introduced in template called often,with the function implementing a simple xpath resolution, can change the processing time , for the same file , from 2s to 24s) And it seems introduction of functions and function calls seem to be a major culprit.
That said, we have not done a detailed performance review and code optimization. ( still in alpha mode, and the performance is still within our limits - ie batch job ). I must admit that we may have "abused" xsl function, as in a lot of places we used th idea of code abstraction into functions ( in addition to using templates ) . My suspicion is that, due t the nature in which xslt templates are called, there might be a lot of eventual recursion in the implementation procedures ( for the xslt processor), and function calls can become expensive if they are not optimized . We think a change in "strategy" in way we write our xsl scripts, (to be more XSLT/XPATH centric) may help performance of the xlst processor. For instance, use of xsl keys. so yes, we maybe just as guilty as the processor charged :)
One other performance issue, is memory utilization. While RAM is not technically a problem , but a simple processor ramping from 1GB ( !!! ) to 6GB for a single invocation/transformation is not exactly kosher. There maybe scalability and capacity concerns ( depending on application and usage). This may be something less to do with the underlying xlst processor, and more to do with the editor tool.This seems to have a huge impact on debugging the style sheets in real time ( ie stepping through the xslt ) .
Few observations:
- commandline or "production" invocation of the processor has better performance
- for consecutive runs ( invoking the xslt processor), the first run takes the longest ( say 10s) and consecutive runs take a lot less ( say 4s ) .Again, maybe something to do with the editor environment.
That said, while performance of the processors may be a pain at times , and depending on the application requirements, it is my opinion that if you consider other factors already mentioned here, such as code maintenance, ease of implementation, rapid changes, size of code base, the performance issues may be mitigated, or can be "accepted" ( if the end application can still live with the perfomance numbers ) when comparing implementation using XSLT vs Java ( or other )
...adieu!

Key factors for designing scalable web based application

Currently I am working on web based application. I want to know what are the key factors a designer should take care while designing scalable web based application ?

That's a fairly vague and broad question and something you could write books about. How far do you take it? At some point the performance of SQL JOINs breaks down and you have to implement some sharding/partitioning strategy. Is that the level you mean?
General principles are:
Cache and version all static content (images, CSS, Javascript);
Put such content on another domain to stop needless cookie traffic;
GZip/deflate everything;
Only execute required Javascript;
Never do with Javascript what you can do on the serverside (eg style table rows with CSS rather than using fancy jQuery odd/even tricks, which can be a real time killer);
Keep external HTTP requests to a minimum. That means very few CSS, Javascript and image files. That may mean implementing some form of CSS spriting and/or combining CSS or JS files;
Use serverside caching where necessary but only after you find there's a problem. Memory is an expensive but often effective tradeoff for more performance;
Test and tune all database queries;
Minimize redirects.

Having a good read of highscalability.com should give you some ideas. I highly recommend the Amazon articles.

Every application is different. You'll have to profile your application to see where you should concentrate your optimization efforts. Some web applications might require database access optimizations, while others have complicated business logic that cause the bottleneck.
Don't attempt to optimize random arbitrary parts of you application without first profiling. You might end up having to support complicated optimized code that doesn't actually make your application snappier.

I get the sense from the other answers here that there is a general confusion between scalability and performance. High performance means that the response is quick. High scalability means that you get a response no matter how many others are also using the site at the same time. There's a big difference.
In fact, you actually have to sacrifice a little performance just to get good scalability. A general pattern to scalability is distributed computing. Factoring functionality out into separate tiers of clustered servers (web, business rules, database) is the usual approach to scalability. That extra round trip will slow down page load a little bit.
Everyone always wants to focus on high scalability but also don't forget that, for software vendors who sell licenses to customers who self host the application, scaling down can be just as important as scaling up. An application that can run on a single server for ten users but can also be configured to run on a ten server web cluster, a three server middle tier, and a four server database cluster for 10,000 users would be a system well designed for scalability.

None. Just code the application using proper design techniques (separation of concerns, etc) and then when the application is done or nearly done, do your performance testing. You'll find the real bottlenecks then - they won't be what you might have guessed in the beginning. This is where your proper design from the beginning comes into play - it makes it easy to make changes to fix the bottlenecks.

Sometimes, a specific answer is more helpful than just generic tips.
If you want to scale, the only thing to target is SPEED (in hardware and software) and RESOURCES (in hardware).
Hardware, the latter is expensive (more servers, load-balancers, etc.).
So, by carefully selecting your initial development framework you will save a lot of time and resources -up to several orders of magnitude.
For example, nginx is (much) faster than Apache.
Other solutions are faster than nginx (for both static and dynamic contents) but I could not disclose them without being censored on StackOverflow (it was rated SPAM & advertising despite the fact that this is a FREE solution).
That's the limits of "sharing": we must share only "acceptable" solutions rather than efficient solutions.
Cheers,
Pierre.

Is Web Service suitable for ETL purpose?

My company is considering using web service as mean of ETL process. However I don't think web service fit into this purpose, for several reasons:
1. web service could possibly consume a lot of memory when generating large xml.
2. xml is a bloated format.
3. possibly time-out if the server takes huge amount of time to generate data
4. file size limitation? (for windows, it's 2Gb, if my memory serves me right)
I am not a web service expert, so I need your opinions. :)
Thanks.

There are plenty of technologies in the Web Services tool shed that circumvent all the problems you elaborate. There is stream oriented XML shredding, there are XML compression formats for delivery, protocols that deal with fragmentation and fairness and there are many a storage systems that can hold terabytes upon terabytes of data.
If by web service you imagine some college freshmen homework concoction of an interface that accepts a single glop argument with a 2GB serialized table in it then all your arguments are valid. But if you give your requirements to an experienced team with knowledge of the concepts involved in WS-ReliableMessaging and WS-Transaction then there is no reason not to have an ETL process around Web Services. Note that I do not advocate the SOAP protocols per-se, but I do advocate knowledge and understanding of the concepts involved.
Now that being said, whether an Web Service oriented ETL process makes sense for you or not it depends on a whole set of other reasons. However, your rebuttal of the Web Service technologies does not hold water.

I would not use a web service for an ETL task. There are specialized tools for that task (e.g., Ab Initio, Informatica, etc.) that are better suited.
If you have a large amount of data, I'd say that the price of the extra latency that the network would introduce would be prohibitive.

It really does depend on what you are doing and how you are trying to accomplish it. In general webservices require more care and feeding than you would normally put into an ETL process, but they can be surprisingly effective at the task as well. I did not get enough specifics for your scenario to say whether it would work.
I have worked on Webservices which transmit and recieve 100+ MB documents, some encoded in XML some not, and do it in seconds (on a closed local network). These services required a good deal of tuning and planning, but they did work well for our scenario and they allowed a wide variety of clients to connect and transmit differing amounts of data through a fairly standard interface. This differed from some of the other ETL jobs we had were the job was specific to each client and had to be setup and maintained for each client.
It all depends on what you are doing and what your constraints are.
If you are going to pursue this route sit down and draft out the process from beginning to end, including how you want clients to connect, verify that the data was received and verify that the job is finished. Consider some of the scenarios, the clients and the types of data being transmitted and then work out what would be needed. Contrast that with what is already available in other tools, and how much time you have to get it done.

I'm really wondering why your company is not considering using a real ETL tool like like those mentioned by duffymo in his answer or, Talend or CloverETL if open source is an option.
They are in general good for ETL purpose :)
Building your own solution sounds like reinventing the wheel.
Many of them have web services oriented features (see Export a job as webservice in Talend's wiki or CloverETL Server HTTP Launch Services for example).
I'm not an ETL product expert and I didn't check them all but I'm pretty sure this is something to consider.

Look up MTOM, to start with, which allows arbitrary non-XML data to be streamed in a web service.

Web services are just fine for ETL tasks. Remember that each task is going to get handled in its own thread for free, and you're guaranteed proper cleanup between requests. Using web services inside something like Tomcat wouldn't be nearly as heavy as you think.
If you're concerned over the bloat of XML, consider JSON format.

Large amount of data - what is the best way to send them?

we have this scenario:
A server which contains needed data and client component which these data wants.
On the server are stored 2 types of data:
- some information - just a couple of strings basically
- binary data
We have a problem with getting binary data. Both sides are written in Java 5 so we have couple of ways....
Web Service is not the best solution because of speed, memory etc...
So, What would you prefer?
I would like to miss low level socket connection if possible...
thanks in advance
Vitek

I think the only way to do LARGE amounts of data is going to be with raw socket access.
You will hit the Out of Memory issues on large files with most other methods.
Socket handling is really pretty straight forward in Java, and it will let you stream the data without loading the entire file into memory (which is what happens behind the scenes without your own buffering).
Using this strategy I managed to build a system that allowed for the transfer of arbitrarily large files (I was using a 7+ GB DVD image to test the system) without hitting memory issues.

Take a look at the W3C standard MTOM to transfer binary data as part of a SOAP service. It is efficient in that it sends as a binary and can also send as buffered chunks. It will also interop with other clients or providers:
How to do MTOM Interop
Server Side - Sending Attachments with SOAP

You might want to have a look at protobuf, this is the library that google uses to exchange data. Its very efficient and extensible. On a sidenote, Never underestimate the bandwidth of a station wagon full of 1TB harddisks!

I've tried converting the binary data to Base64 and then sending it over via SOAP calls and it's worked for me. I don't know if that counts as a web service, but if it does, then you're pretty much stuck with sockets.

Some options:
You could use RMI which will hide the socket level stuff for you, and perhaps gzip the data...but if the connection fails it won't resume for you. Probably will encounter memory issues too.
just HTTP the data with a binary mime type (again perhaps configuring gzip on the webserver). similar problem on resume.
spawn something like wget (I think this can do resume)
if the client already has the data (a previous version of it), rsync would copy only the changes

What about the old, affordable and robust FTP? You can for example easily embed an FTP server in your server-side components and then code a FTP client. FTP was born exactly for that (File Transfer Protocol, isn't it?), while SOAP with attachments was not designed with that stuff in mind and can perform very badly.
For example you could have a look at:
http://mina.apache.org/ftpserver/
But there are other implementations out there, Apache Mina is just the first one I can recall.
Good luck & regards

Is sneakernet an option? :P
RMI is well known for its ease-of-use and its memory leaks. Be warned. Depending on just how much data we're talking about, sneakernet and sockets are both good options.

Consider GridFTP as your transport layer. See also this question.

Best binary XML format for JavaME

Can anyone recommend a good binary XML format? It's for a JavaME application, so it needs to be a) Easy to implement on the server, and b) Easy to write a low-footprint parser for on a low-end JavaME client device.
And it goes without saying that it needs to be smaller than XML, and faster to parse.
The data would be something akin to SVG.

You might want to take a look at wbxml (Wireless Binary XML) it is optimized for size, and often used on mobile phones, but it is not optimized for parsing speed.

Hessian might be an alternative worth looking at. It is a small protocol, well-suited for Java ME applications.
"Hessian is a binary web service protocol that makes web services usable without requiring a large framework, and without learning a new set of protocols. Because it is a binary protocol, it is well-suited to sending binary data without any need to extend the protocol with attachments."
More links:
Here
Here too

What kind of data are you planning to use? I would say, that if the server is also done in Java, easiest way for small footprint is to send/receive binary data in predefined format. Just write everything in known order into DataOutputStream.
But it would really depend, what what kind of data are you working on and can you define the format.
Actually you should evaluate, if this kind of optimization is even needed. Maybe you target devices are not so limited.

It very much depends on the target device. If you have JSR172 available, then you are done with the parsing, the runtime does it for you. And XML is mainly about making your own format. As was alredy stated if your goal is performance, than XML is probably not the best way to go and you will end up doing some binary stuff.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.