XML Parsing : JDOM or RegEx ? Which is faster? - java

A colleague of mine needs to develop an Eclipse plugin that has to parse multiple XML files to check for programming rules imposed by a client (for example, no xsl:for-each, or no namespaces declared but not used). There are about a 1000 files to be parsed regularly, each file containing about 300-400 lines.
We were wondering which solution was faster to do it. I'm thinking JDOM, and he's thinking RegEx.
Anyone can help us decide which is best ?
Thanks

DOM, hands down. RegEx would be madness. Use the tool that was intended for the job.

You can't parse recursive structures with RegEx. So unless you have really simple XML files, XML parsing will be much faster and the code will be somewhat sane (so you won't spend endless hours to locate bugs).
Since the files are pretty small, JDom will make your job much easier. For larger files, you will have to use a SAX or similar parser (so you don't have to keep the whole file in RAM).

I you try to parse XML using regular expressions, you are entering a world of pain. If speed is important, using a event-based API might be a tad faster than DOM/JDOM.

If all checks are simple "no " or no namespace, a StAX parser would be best, as you are just streaming the documents through it, get all the start elements 'events' and then do your checking. For this, the parser needs relatively little memory.
If you need to referential checking, DOM may be better, as you can easily walk the tree (perhaps via xpath).

Related

Is it possible to use ANTLR4 Listeners without generating a parse tree?

I have developed a successful translator that uses ANTLR4 grammar and parse tree listeners. I'm very pleased with the rapid time to success using ANTLR for this project, and just started comparing binary outputs of my ANTLR4 based solution against a legacy C++ body of code that is slower. (My solution is targeted at speed so even though it's implemented in Java it could be faster).
However, when I started testing it with larger 110Mb input ASCII files, I find that I run out of HEAP. This occurs during the ANTLRInputStream instantiation.
Which I believe I can fix with UnbufferedChar/Token streams. This stackoverflow question also suggests that the parse tree generation should be turned off, as the parse tree consumes a significant amount of memory.
If I turn off parse tree generation, my parse tree listeners won't be called. At least that's how I understand it. I suspect I won't be able to manage translating 1Gb files with the parse tree listener generation on. What is the solution?
I'd like to avoid moving my ParseTreeListener code to the grammar files if I can.

What other alternatives exist for XML-to-XML transformation other than XSLT

I've huge XML files (3000+ unique nodes) that need to be translated from 1 format to another format. My main concern is about the speed and memory usage. Are there any alternatives to XSLT for this other than programatically parsing the input XML using StAX and creating the target XML using StAX?
I know there is a STX project but I doesn't think it is being maintained.
If you are so concerned about speed and memory usage you might want to write your own SAX transformer. Whether that's easy enough depends on the complexity of the transformation.
That said - 3000 nodes is not much and I've used Apache Cocoon to transform much bigger documents. And STX worked well, too. Not maintained does not necessarily mean it's not working.
Better try the existing solutions and then improve as needed.
Smooks can help you. Handy and fast. http://www.smooks.org/
I've found JDom helpful for simple programmatic manipulation of XML structures in Java.

libxml2 from java

This question is somewhat related to
Fastest XML parser for small, simple documents in Java
but with a few more specifics.
I'm working on an application which needs to parse many (10s of millions), small (approx. 300k) xml documents. The current implementation is using xerces-j and it takes about 2.5 ms per xml document on a 1.5 GHz machine. I'd like to improve this performance. I came across this article
http://www.xml.com/pub/a/2007/05/16/xml-parser-benchmarks-part-2.html
claiming that libxml2 can parse about an order of magnitude faster than any java parsers. I'm not sure if I believe it, but it caught my attention. Has anyone tried using libxml2 from the jvm? If so, is it faster than java dom parsing (xerces)? I'm thinking I'd still need my java dom structure, but I'm guessing that copying from a c-structured dom into java-dom shouldn't take long. I must have java-dom - sax will not help me in this case.
update: I just wrote a test for libxml2 and it wasn't any faster than xerces... granted my c coding ability is extremely rusty.
update I broadened the question a bit here:
why is sax parsing faster than dom parsing ? and how does stax work?
and am open to the possibility of ditching dom.
Thanks
In Java, StAX JSR-173 is generally considered to be the fastest approach to parsing XML. There are multiple implementations of StAX, the Woodstox implementation is generally regarded as being fast.
To improve performance I would avoid DOM. What are you doing with the XML? If you are ultimately dealing with it as objects, the you should consider an OXM solution. The standard is JAXB JSR-222. JAXB implementations such as MOXy (I'm the tech lead) will even allow you to do a partial mapping which will improve performance:
http://bdoughan.blogspot.com/2010/09/xpath-based-mapping-geocode-example.html
First of all, your question does not contain a question. What do you want to know?
I suppose you were using JNI to convert the c-dom into a java-dom. I dont know if there are official numbers, but in my experience c+JNI often is slower than directly doing it in java.
If you really want to speed up your processing, try to get rid of the dom (why do you need it? Maybe we can think of a solution together). If all xml files have the same schema, use your own specialized data model (and a SAX parser).
If you only use a subset of xml (i.e. without namespaces, only few attributes), consider writing your own parser that directly produces more efficient java objects (but I would not recommend that).

parsing and translating from text to xml

I need to translate programs written in a domain specific language into xml representation. These programs are in the form of simple text file. What approach would you suggest me? What api should I use to:
Parse the text files written in this language.
Write xml based on the token and token streams I obtain.
My criteria is more of a rapid and easier development rather then memory or computing time efficiency.
Many Thanks
Ketan
The less trivial part of the job is with step #1, parsing the Domain Specific Language (DSL) text, rather than #2, pushing this to some XML language.
Hopefully you readily have a parser for the DSL (obviously this language must have been put to use somewhere...), and you may be able to "hook" your export/conversion logic into this parser. If such is not possible, you'll need to write a new parser.
Depending on the complexity of the DSL, you may be able to write, longhand, a simple parser based on a few loops and switch cases.
For more complicated languages, ANTLR is often a good choice. In a nutshell, one formalize the grammar of the DSL, in Backus Naur Form (BNF, or actually EBNF, here, i.e. the Extended family) and ANTLR produces a parser, written in a target language of choice (including Java). The learning curve with ANTLR is a factor to consider but in the context of a moderately to extremely sophisticated language, a well worth investment. ANTLR is similar but, in my opinion, a better tool than GNU Bison, this latter would however do the trick as well, and too, target Java is so desired.
If you are familiar with other languages, in particular Python, there are many other tools that can be put to use for more or less ad-hoc parsers; I've also used PyParsing and gladly recommend it.
XStream is the best XML serializer/deserializer for Java EVAR. If you can turn your DSL into Java classes, this is a great library to use.

XOM v/s javax.xml.parsers

i want to do read simple XML file .i found
Simple way to do Xml in Java
There are also several parsers available just wanted to make sure that what are the advantages of using XOM parser over suns parser
Any suggestions?
XOM is extremely quick compared to the standard W3C DOM. If that's your priority, there's none better.
However, it's still a DOM-type API, and so it's not memory efficient. It's not a replacement for SAX or STAX.
You might want to check this question about the best XML library and its top (XOM) answer; lots of details about advantages of XOM. (Leave a comment if something is unclear; Peter Štibraný seems to know XOM inside and out.)
As mentioned, XOM is very quick and simple in most tasks compared to standard javax.xml. For examples, see this post in a question about the simplest way to read in an XML file in Java. I collected some nice examples that make XOM look pretty good (and javax.xml rather clumsy) there. :-)
So personally I've come to like XOM after evaluating (as you can see in the linked posts); for any new Java project I'd most likely choose XOM for XML handling. The only shortcoming I've found is that it doesn't directly support streaming XML (unlike dom4j where I'm coming from), but with a simple workaround it can stream just fine.
How do you need to access your data?
If it is one-pass, then you don't need to build the tree in memory. You can use SAX (fast, simple) or StAX (faster, not quite so simple).
If you need to keep the tree in memory to navigate, XOM or JDOM are good choices. DOM is the Choice Of Last Resort, whether it is level 1, 2, or 3, with or without extensions.
Xerces, which is the parser included with Java (although you should get the updated version from Apache and not use the one bundled with Java, even in 6.0), also has a streaming native interface called XNI.
If you want to hook other pre-made parts up in the chain, often SAX or StAX work well, since they might build their own model in memory. For example, the Saxon XSLT/XQuery engine works with DOM, SAX or StAX, but builds internally a TinyTree (default) or DOM (optional). DataDirect XQuery works with SAX, StAX or DOM also, but really likes StAX.

Categories

Resources