Java Sax to parse complex large XML file

Java Sax to parse complex large XML file - java

I am using SAX to parse some large XML files and I want to ask the following: The XML files have a complex structure. Something like the following:
<library>
<books>
<book>
<title></title>
<img>
<name></name>
<url></url>
</img>
...
...
</book>
...
...
</books>
<categories>
<category id="abcd">
<locations>
<location>...</location>
</locations>
<url>...</url>
</category>
...
...
</categories>
<name>...</name>
<url>...</url>
</library>
The fact is that these files are over 50MB each and a lot of tags are repeated under different context, e.g. url under /books/book/img but also under /library and under /library/categories/category and so on.
My SAX parser uses a subclass of DefaultHandler in which I override teh startElement and the endElement methods (among others). But the problem is that these methods are huge in terms of lines of code due to the business logic of these XML files. I am using a lot of
if ("url".equalsIgnoreCase(qName)) {
// peek at stack and if book is on top
// ...
// else if category is on top
// ...
} else if (....) {
}
I was wondering whether there is a more proper / correct / elegant way to perform the xml parsing.
Thank you all

What you can do is implement separate ContentHandler for different contexts. For example write one for <books>, one for <categories> and one top-level one.
Then, as soon as the books startElement method is called, you immediately switch the ContentHandler using XMLReader.setContentHandler(). Then the <books> specific ContentHandler switches back to the top-level handler to when its endElement method is called for books.
This way each ContentHandler can focus on his particular part of the XML and need not know about all the other parts.
The only ugly-ish part is that the specific handlers need to know of the top-level handler and when to switch back to it, which can be worked around by providing a simple "handler stack" that handles that for you.

Not sure whether you're asking 1) is there something else you can do besides checking the tag against a bunch of strings or 2) if there's an alternative to a long if-then-else kind of statement.
The answer to 1 is not that I've found. Someone else may tackle that one.
The answer to 2 depends on your domain. One way I see is that if the point of this is to hydrate a bunch of objects from an XML file, then you can use a factory method.
So the first factory method has the long if then else statement that simply passes off the XML to the appropriate classes. Then each of your classes has a method like constructYourselfFromXmlString. This will improve your design because only the objects themselves know about the private data that is in an XML to hydrate them.
the reason this is hard is that, if you think about it, exporting an Object to XML and importing back in really violates encapsulation. Nothing to be done about it, just is. This at least makes things a little more encapsulated.
HTH

Agreeing with the sentiment that exporting an object to XML is a violation of encapsulation, the actual technique used to handle tags which are nested at different lengths isn't terribly difficult using SAX.
Basically, keep a StringBuffer which will maintain your "location" in the document, which will be a directory like representation of the nested tag you are currently within. For example, if at the moment the string buffer's contents are /library/book/img/url then you know it's an URL for an image in a book, and not a URL for some category.
Once you ensure that your "path tracking" algorithms are correct you can then wrap your object creation routines with better handling by using string matches. Instead of
if ("url".equalsIgnoreCase(qName)) {
...
}
you can now substitute
if (location.equalsIgnoreCase("/library/book/img/url")) {
...
}
If for some reason this doesn't appeal to you, there are still other solutions. For example, you can make a SAX handler which implements a stack of Handlers where the top handler is responsible for handling just it's portion of the XML document, and it pops itself off the stack once it is done. Using such a scheme, the each object gets created by its own unique individual handler, and some handlers basically check and direct which "object creation" handlers get shoved onto the handling stack at the appropriate times.
I've used both techniques. There are strengths in both, and which one is best really depends on the input and the needed objects.

You could refactor your SAX content handling so that you register a set of rules, each of which has a test that it applies to see if it matches the element, and an action that is executed if it does. This is moving closer to the XSLT processing model, while still doing streamed processing. Or you could move to XSLT - processing 50Mb input files is well within the capabilities of a modern XSLT processor.

try SAX-JAVA Binding Made Easier

Related

Is it possible to cache XML documents in Saxon to avoid re-parsing and re-indexing?

I am currently assessing whether XSLT3 with Saxon could be useful for our purposes. Please hear me out.
We are developing a REST API which provides credentials given an input request XML. Basically, there are 3 files in play:
site.xml:
This file holds the data representing the complete organisation: users, roles, credentials, settings, ...
It could easily contain 10.000 lines.
It could be considered as static/immutable.
You could compare it as XML representation of a database, so to say.
request.xml:
This file holds the request as provided to the REST API.
It is rather small, usually around 10 to 50 lines.
It is different for each request.
request.xslt:
This file holds the stylesheet to convert the given request.xml to an output XML.
It loads site.xml via the XSLT document() function, as it needs that data to fulfill the request.
The problem here is that loading site.xml in request.xslt takes a long time. In addition, for each request, indexes as introduced by the XSLT <xsl:key .../> directive must be rebuilt. This adds up.
So it would make sense to somehow cache site.xml, to avoid having to parse and index that file for every request.
It's important to note that multiple API requests can happen concurrently, thus it should be safe to share this cached site.xml between several ongoing XSLT transformations.
Is this possible with Saxon (Java)? How would that work?
Update 1
After some additional reflecting, I realize that maybe I should not attempt to just cache the site.xml XML file, but the request.xslt instead? This assumes that site.xml, which is loaded in request.xslt via document(), is part of that cache.

It would help if you show/tell us which API you use to run XSLT with Saxon.
As for caching the XSLT, with JAXP I think you can do that with a Templates created with newTemplates from the TransformerFactoryImpl (http://saxonica.com/html/documentation/using-xsl/embedding/jaxp-transformation.html), each time you want to run the XSLT you will to create a Transformer with newTransformer().
With the s9api API you can compile once to get an XsltExecutable (http://saxonica.com/html/documentation/javadoc/net/sf/saxon/s9api/XsltExecutable.html) that "is immutable, and therefore thread-safe", you then have to us load() or load30() to create an XsltTransformer or Xslt30Transformer each time you need to run the code.
As for sharing a document, see http://saxonica.com/html/documentation/sourcedocs/preloading.html:
An option is available (Feature.PRE_EVALUATE_DOC_FUNCTION) to indicate
that calls to the doc() or document() functions with constant string
arguments should be evaluated when a query or stylesheet is compiled,
rather than at run-time. This option is intended for use when a
reference or lookup document is used by all queries and
transformations
The section on that configuration option, however, states:
In XSLT 3.0 a better way of having external documents pre-loaded at
stylesheet compile time is to use the new facility of static global
variables.
So in that case you could declare
<xsl:variable name="site-doc" static="yes" select="doc('site.xml')"/>
You will need to wait on Michael Kay's response as to whether that suffices to share the document.

Well, it is certainly possible, but the best way of doing it depends a little on the circumstances, e.g. what happens when site.xml changes.
I would be inclined to create a single s9api Processor at application startup, and immediately (that is, during application initialization) load site.xml into an XdmNode using Processor.DocumentBuilder.build(); this can then be passed as a parameter value (an <xsl:param>) into each transformation that uses it. Or if you prefer to access it using document(), you could register a URIResolver that responds to the document() call by returning the relevant XdmNode.
As for indexing and the key() function, so long as the xsl:key definition is "sharable", then if two transformations based on the same compiled stylesheet (s9api XsltExecutable) access the same document, the index will not be rebuilt. An xsl:key definition is shareable if its match and use attributes do not depend on anything that can vary from one transformation to another, such as the content of global variables or parameters.
Saxon's native tree implementations (unlike the DOM) are thread-safe: if you build a document once, you can access it in multiple threads. The building of indexes to support the key() function is synchronized so concurrent transformations will not interfere with each other.
Martin's suggestion of allowing compile-time evaluation of the document() call would also work. You could also put the document into a global variable defined with static="yes". This doesn't play well, however, with exporting compiled stylesheets into persistent files: there are some restrictions that apply when exporting a stylesheet that contains node-valued static variables.

How to extend doc() functionality in saxon

I am looking for an extension of doc() functionality currently available in SAXON in a way that it will read XML not from filesystem or from http network, but from memory, where I have those xmls.
The way I want to use it is like:
mydoc('id')/root/subroot/#myattr
or
doc('mydoc://id')/root/subroot/#myattr
What I have considered so far:
use queryEvaluator.setContextItem() - does not solve my use case as I can have multiple XML sources in one query
register some own URL scheme protocol into Java - seems to me like overkill and I have never done this
write own ExtensionFunction - seems to be the right way so far, but i am confused whether I should use ExtensionFunction or rather ExtensionFunctionDefinition. Also I am littel bit confused by Doc_1 and Doc Saxonica source code as it uses Atomizer and other unknown internall stuff.
So the questions are:
Is it variant 3 the best one (in the means of simplicity) or would you recommend some other approach ?
Is it OK to use ExtensionFunction and return XdmNode from my in-memory xmls ? It seems to me it should work, but I really do not want to step into some edge cases or saxon minefield.
Any comment from experienced Saxon user will be appretiated.

The standard way of doing this is to write a URIResolver and register it with the transformer. The URIResolver is called, supplying the requested URI, and it is expected to return a Source (which can be a StreamSource, SAXSource, or DOMSource, for example). In this scenario you would typically return a StreamSource wrapping a StringReader which wraps the String containing the XML.
You could equally well use an extension function, but it's probably a little bit more complicated.

What Java XML API to use in my app - StAX or DOM?

I did some research, looked at the table at the bottom here (1) and I am trying to find out what kind of API I should use.
Let me introduce the problem my app in going to solve:
My application listens to some observer events fired from all places (e.g. events from CDI) in some observer class. In that class, there are methods which observes these events.
I need to construct XML file on-the-fly as these events are being observed. More concretely, when I observe event "start", I need to create this xml.
<start></start>
After that when I observe some other event, like "installed" (does not matter how it is called really), I need to have this structure:
<start><installed></installed><start>
Everytime I observe some event, I need to be able to write that XML representation to external file. Summing it up, it seems I can not use "SAX" because SAX just parses XML documents but I need to write them or construct them. Next, I am about to use StAX or DOM but StAX is "forward only" which I do not quite understand what it stands for, but when I take StAX API it behaves like this (2) and when it is "forward" I am "forced" to manually start and end elements but that is not applicable in my case. I do not know when I am about to end the document generation, I just need to have valid xml every time in order to write it.
However, there is this method (3) which says that when I call it, it automatically closes all elements. So e.g. when I have this:
<a>
<b></b>
<c>
<d>
</d>
and I call writeEndDocument(), does that mean that it automatically closes "c" and "a"?
(1) http://docs.oracle.com/cd/E17802_01/webservices/webservices/docs/1.6/tutorial/doc/SJSXP2.html
(2) http://docs.oracle.com/javase/tutorial/jaxp/stax/example.html#bnbgx
(3) http://docs.oracle.com/javase/6/docs/api/javax/xml/stream/XMLStreamWriter.html#writeEndDocument()

I recommend to use the following XML libraries (ordered by recommendation; only use the next one if the one before doesn't suit you needs):
JAXB (work with objects rather than XML)
StAX (lower level than JAXB)
SAX (only for reading; should be rarely used now with JAXB and StAX available)
DOM (should be rarely used now with JAXB and StAX available)

Do not use lower level XML techniques (either SAX or DOM) unless you really need them. I believe that this is not the case.
Use JAXB. Create class that represents your events. Every time you get event create instance of this class and populate fields. Every time you have to create XML just marshal the instance(s) to any stream you want (file, socket, whatever).

Java+XSL, calling Java code from within template

I'm working with XSL templates in Java, and I'm trying to build a custom tag that will call some Java code, then put a result inside the template. I'm using XOM as my XML engine. I'm kind of new with both XOM and XSL, so I'm not even sure if this is a smart idea.
A very simple example of something I want to do is this, where my_ns is a custom namespace with 'custom_tag' that the method custom tag
<xsl:template name="foo">
<my_ns:custom_tag />
</xsl:template>
public Node custom_tag() {
return Node("<a/>");
}
#result of calling the template foo
<a/>
I'm open to suggestions for involve alternate ways of calling Java from a XSL template.

This is more a question about if your XSLT processor can execute/call java code from within the template more than your XML engine/parser/api. The default XSLT processor for java is Xalan-C or Xalan-J (can't remember which) from the Apache Software Foundation. I do believe both of them allow for extension functions to execute java code inside the method. I've done JDBC sql queries inside a XSL stylesheet before using a xalan-j extension function. I also recall reading that the Saxon XSLT processor also allows this functionality. You'll have search your XSLT processor to get the specifics on to implement this.
The question on whether this is a good idea or not really depends on the problem. Even though I used the SQL extension function mentioned above and it fit the bill in that case, I felt really dirty about it afterwards. The reason I say this is because you lose portability between XSLT processors when you add in the vendor-specific extension functions.
Your example shows you are just simply creating a new node in the output and if that is the case, I don't see the need to have java do this when that is one of the main functions of XSLT: creating nodes. I suspect your real problem is more complex than simply creating a node so I'll suggest you may want to look into doing all the work in java to get the results you are looking for OR doing some of the work in java and passing a parameter (name/value pair using the xsl:param element) to your XSL stylesheet a runtime.
Here's some quick sites to get you started:
http://xml.apache.org/xalan-j/extensions.html
http://www.saxonica.com/documentation/extensions/intro.xml
http://www.w3schools.com/xsl/
http://www.w3schools.com/xsl/el_param.asp

Servlet doPost() Method setup?

I am interested in creating a web app that uses JSP, Servlets and XML.
At the moment I have the following:
JSP - Form input.
Servlet - Retrieving Form data and sending that data to a java object.
Java object (1) - Converts data into XML file....instantiates java object (2).
Java object (2) - Sends that file to a database.
On the returning side the database will send back another XML file that I will then process using XSLT to display back to the user.
Can I place that XSLT code in the orignial Servlets doPost() method? So my doPost()` method would:
Retrieve user inputted data from the form on my JSP page.
Instantiate a java object to convert that data to XML, in-turn that object will instantiates another object to send the XML file to a database.
Converts the resulting XML file sent from the database and displays it for the user.
Can one servlet doPost() method handle all of this? If not, how would I set up my application and classes to handle this work flow?
Thank you in advance

I wouldn't load the XSLT in POST, because every method has to do it.
Read that XSTL in the init method, precompile and cache it. Just make sure that you keep it thread safe.
Once you have the XSLT, you've got to apply it to every XML response, so those steps do belong in POST.

All your doPost() method has to do is generate a suitable servlet response (some form of content, and a suitable HTTP response structure). So it can do anything you want (including the above).
However it sounds like your rendering requirement is distinct from your form submission and storage requirement. So I would make your doPost() method delegate to a suitable method for rendering the output. That way you can generate output from stored data separately from submitting data to the database.

Well, this is not really specific to servlets, but more to Java/OOP (object oriented programming) in general. You can in fact do everything in a single method, even in a main() method. But hundreds or more of lines in a single method isn't really readable, maintainable, reuseable nor testable in long terms. Right now, you're probably just starting with Java and you probably don't need to do anything else than this, but if you ever need to duplicate (almost) the same lines of code, then it's time to refactor. Extract the variables from the duplicate code lines and wrap those lines in a new method which takes those variables as arguments and does a simple one-step task.
In general, you'd like to already split the big task in separate subtasks beforehand, using separate and reuseable classes and methods. In your case, you can for example have a single DAO class for all the DB interaction task, a generic XML helper class to convert Javabeans to XML and vice versa with help of XSL and (maybe) a domain object to manage the input/output processing (conversion/validation/errorhandling/response) and executing actions. Write down in paper how the big picture is to be accomplished in small single tasks. Each task can be often as good done by a single method. Group the methods with the same responsibilities and/or the same shared data in the same class.
To go a step further, for several tasks there may be 3rd party tools available which eases the task. I can think of for example XMLBeans and/or XStream to do the Javabean <--> XML conversion. That would already save a lot of boilerplate code and likely also the XSL step.
That said, duffymo's suggestion to load the XSL only once is a very good one. You don't need to re-execute exactly the same task which isn't dependent on request parameters at all again and again on every request, that's only inefficient.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.