libxml2 from java

libxml2 from java - java

This question is somewhat related to
Fastest XML parser for small, simple documents in Java
but with a few more specifics.
I'm working on an application which needs to parse many (10s of millions), small (approx. 300k) xml documents. The current implementation is using xerces-j and it takes about 2.5 ms per xml document on a 1.5 GHz machine. I'd like to improve this performance. I came across this article
http://www.xml.com/pub/a/2007/05/16/xml-parser-benchmarks-part-2.html
claiming that libxml2 can parse about an order of magnitude faster than any java parsers. I'm not sure if I believe it, but it caught my attention. Has anyone tried using libxml2 from the jvm? If so, is it faster than java dom parsing (xerces)? I'm thinking I'd still need my java dom structure, but I'm guessing that copying from a c-structured dom into java-dom shouldn't take long. I must have java-dom - sax will not help me in this case.
update: I just wrote a test for libxml2 and it wasn't any faster than xerces... granted my c coding ability is extremely rusty.
update I broadened the question a bit here:
why is sax parsing faster than dom parsing ? and how does stax work?
and am open to the possibility of ditching dom.
Thanks

In Java, StAX JSR-173 is generally considered to be the fastest approach to parsing XML. There are multiple implementations of StAX, the Woodstox implementation is generally regarded as being fast.
To improve performance I would avoid DOM. What are you doing with the XML? If you are ultimately dealing with it as objects, the you should consider an OXM solution. The standard is JAXB JSR-222. JAXB implementations such as MOXy (I'm the tech lead) will even allow you to do a partial mapping which will improve performance:
http://bdoughan.blogspot.com/2010/09/xpath-based-mapping-geocode-example.html

First of all, your question does not contain a question. What do you want to know?
I suppose you were using JNI to convert the c-dom into a java-dom. I dont know if there are official numbers, but in my experience c+JNI often is slower than directly doing it in java.
If you really want to speed up your processing, try to get rid of the dom (why do you need it? Maybe we can think of a solution together). If all xml files have the same schema, use your own specialized data model (and a SAX parser).
If you only use a subset of xml (i.e. without namespaces, only few attributes), consider writing your own parser that directly produces more efficient java objects (but I would not recommend that).

Related

What other alternatives exist for XML-to-XML transformation other than XSLT

I've huge XML files (3000+ unique nodes) that need to be translated from 1 format to another format. My main concern is about the speed and memory usage. Are there any alternatives to XSLT for this other than programatically parsing the input XML using StAX and creating the target XML using StAX?
I know there is a STX project but I doesn't think it is being maintained.

If you are so concerned about speed and memory usage you might want to write your own SAX transformer. Whether that's easy enough depends on the complexity of the transformation.
That said - 3000 nodes is not much and I've used Apache Cocoon to transform much bigger documents. And STX worked well, too. Not maintained does not necessarily mean it's not working.
Better try the existing solutions and then improve as needed.

Smooks can help you. Handy and fast. http://www.smooks.org/

I've found JDom helpful for simple programmatic manipulation of XML structures in Java.

MarkLogic to Java & Back Solution

I need to query XML out of a MarkLogic server and marshal it into Java objects. What is a good way to go about this? Specifically:
Does using MarkLogic have any impact on the XML technology stack? (i.e. is there something about MarkLogic that leads to a different approach to searching for, reading and writing XML snippets?)
Should I process the XML myself using one of the XML APIs or is there a simpler way?
Is it worth using JAXB for this?
Someone asked a good question of why I am using Java. I am using Java/Java EE because I am strongest in that language. This is a one man project and I don't want to get stuck anywhere. The project is to develop web service APIs and data processing and transformation (CSV to XML) functionality. Java/Java EE can do this well and do it elegantly.

Note: I'm the EclipseLink JAXB (MOXy) lead, and a member of the JAXB 2 (JSR-222) expert group.
Does using MarkLogic have any impact on the XML technology stack?
(i.e. is there something about MarkLogic that leads to a different
approach to searching for, reading and writing XML snippets?)
Potentially. Some object-to-XML libraries support a larger variety of documents than other ones do. MOXy leverages XPath based mappings that allows it to handle a wider variety of documents. Below are some examples:
http://blog.bdoughan.com/2010/09/xpath-based-mapping-geocode-example.html
http://blog.bdoughan.com/2011/03/map-to-element-based-on-attribute-value.html
Should I process the XML myself using one of the XML APIs or is there
a simpler way?
Using a framework is generally easier. Java SE offers may standard libraries for processing XML: JAXB (javax.xml.bind), XPath (javax.xml.xpath), DOM, SAX, StAX. Since these standards there are also other implementations (i.e. MOXy and Apache JaxMe implement JAXB).
http://blog.bdoughan.com/2011/05/specifying-eclipselink-moxy-as-your.html
Is it worth using JAXB for this?
Yes.

There are a number of XML-> Java object marshall-ing libraries. I think you might want to look for an answer to this question by searching for generic Java XML marshalling/unmarshalling questions like this one:
Java Binding Vs Manually Defining Classes
Your use case is still not perfectly clear although the title edit helps - If you're looking for Java connectivity, you might also want to look at http://developer.marklogic.com/code/mljam which allows you to execute Java code from within MarkLogic XQuery.

XQSync uses XStream for this. As I understand it, JAXB is more powerful - but also more complex.

Having used JAXB to unmarshal xml served from XQuery for 5 years now, I have to say that I have found it to be exceptionally useful and time-saving. As for complexity, it is easy to learn and use for probably 90% of what you would be using it for. I've used it for both simple and complex schemas and found it to be very performant and time-saving. Executing Java code from within MarkLogic is usually a non-starter, because it runs in a separate VM on the Marklogic server, so it really can't leverage any session state or libraries from, say, a Java EE web application. With JAXB, it is very easy to take a result stream and convert it to Java objects. I really can't say enough good things about it. It has made my development efforts infinitely easier and allows you to leverage Java for those things that it does best (rich integration across various technologies and platforms, advanced business logic, fast memory management for heavy processing jobs, etc.) while still using XQuery for what it does best (i.e. searching and transforming content).

Does using MarkLogic have any impact on the XML technology stack?
No. By the time it comes out of MarkLogic, it's just XML that could have come from anywhere.
I need to query XML and marshal it into Java objects.
Why?
If you have a good reason for using Java, then we need to know what that reason is before we can tell you which Java technology is appropriate.
If you don't have a good reason for using Java, then you are better off using a high-level XML processing language such as XSLT or XQuery.
As for JAXB, it is appropriate when your schema is reasonably simple and stable. If the schema is complex (e.g. the schema for articles in an academic journal), then JAXB can be hopelessly unwieldy because of the number of classes that are generated. One problem with using it to process XQuery output is that it's very likely the XQuery output will not conform to any known schema, and the structure of the XQuery results will be different for each query that gets written.

XML Parsing : JDOM or RegEx ? Which is faster?

A colleague of mine needs to develop an Eclipse plugin that has to parse multiple XML files to check for programming rules imposed by a client (for example, no xsl:for-each, or no namespaces declared but not used). There are about a 1000 files to be parsed regularly, each file containing about 300-400 lines.
We were wondering which solution was faster to do it. I'm thinking JDOM, and he's thinking RegEx.
Anyone can help us decide which is best ?
Thanks

DOM, hands down. RegEx would be madness. Use the tool that was intended for the job.

You can't parse recursive structures with RegEx. So unless you have really simple XML files, XML parsing will be much faster and the code will be somewhat sane (so you won't spend endless hours to locate bugs).
Since the files are pretty small, JDom will make your job much easier. For larger files, you will have to use a SAX or similar parser (so you don't have to keep the whole file in RAM).

I you try to parse XML using regular expressions, you are entering a world of pain. If speed is important, using a event-based API might be a tad faster than DOM/JDOM.

If all checks are simple "no " or no namespace, a StAX parser would be best, as you are just streaming the documents through it, get all the start elements 'events' and then do your checking. For this, the parser needs relatively little memory.
If you need to referential checking, DOM may be better, as you can easily walk the tree (perhaps via xpath).

XOM v/s javax.xml.parsers

i want to do read simple XML file .i found
Simple way to do Xml in Java
There are also several parsers available just wanted to make sure that what are the advantages of using XOM parser over suns parser
Any suggestions?

XOM is extremely quick compared to the standard W3C DOM. If that's your priority, there's none better.
However, it's still a DOM-type API, and so it's not memory efficient. It's not a replacement for SAX or STAX.

You might want to check this question about the best XML library and its top (XOM) answer; lots of details about advantages of XOM. (Leave a comment if something is unclear; Peter Štibraný seems to know XOM inside and out.)
As mentioned, XOM is very quick and simple in most tasks compared to standard javax.xml. For examples, see this post in a question about the simplest way to read in an XML file in Java. I collected some nice examples that make XOM look pretty good (and javax.xml rather clumsy) there. :-)
So personally I've come to like XOM after evaluating (as you can see in the linked posts); for any new Java project I'd most likely choose XOM for XML handling. The only shortcoming I've found is that it doesn't directly support streaming XML (unlike dom4j where I'm coming from), but with a simple workaround it can stream just fine.

How do you need to access your data?
If it is one-pass, then you don't need to build the tree in memory. You can use SAX (fast, simple) or StAX (faster, not quite so simple).
If you need to keep the tree in memory to navigate, XOM or JDOM are good choices. DOM is the Choice Of Last Resort, whether it is level 1, 2, or 3, with or without extensions.
Xerces, which is the parser included with Java (although you should get the updated version from Apache and not use the one bundled with Java, even in 6.0), also has a streaming native interface called XNI.
If you want to hook other pre-made parts up in the chain, often SAX or StAX work well, since they might build their own model in memory. For example, the Saxon XSLT/XQuery engine works with DOM, SAX or StAX, but builds internally a TinyTree (default) or DOM (optional). DataDirect XQuery works with SAX, StAX or DOM also, but really likes StAX.

What Java XML library do you recommend (to replace dom4j)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I'm looking for something like dom4j, but without dom4j's warts, such as bad or missing documentation and seemingly stalled development status.
Background: I've been using and advocating dom4j, but don't feel completely right about it because I know the library is far from optimal (example: see how methods in XSLT related Stylesheet class are documented; what would you pass to run() as the String mode parameter?)
Requirements:
The library should make basic XML handling easier than it is when using pure JDK (javax.xml and org.w3c.dom packages). Things like this:
Read an XML document (from file or String) into an object, easily traverse and manipulate the DOM, do XPath queries and run XSLT against it.
Build an XML document in your Java code, add elements and attributes and data, and finally write the document into a file or String.
I really like what dom4j promises, actually: "easy to use, open source library for working with XML, XPath and XSLT [...] with full support for DOM, SAX and JAXP." And upcoming dom4j 2.0 does claim to fix everything: fully utilise Java 5 and add missing documentation. But unfortunately, if you look closer:
Warning: dom4j 2.0 is in pre-alpha
stage. It is likely it can't be
compiled. In case it can be compiled
at random it is likely it can't run.
In case it runs occasionally it can
explode suddenly. If you want to use
dom4j, you want version 1.6.1. Really.
...and the website has said that for a long time. So is there a good alternative to dom4j? Please provide some justification for your preferred library, instead of just dumping names and links. :-)

Sure, XOM :-)
XOM is designed to be easy to learn
and easy to use. It works very
straight-forwardly, and has a very
shallow learning curve. Assuming
you're already familiar with XML, you
should be able to get up and running
with XOM very quickly.
I use XOM for several years now, and I still like it very much. Easy to use, plenty of documentation and articles on the web, API doesn't change between releases. 1.2 was released recently.
XOM is the only XML API that makes no
compromises on correctness. XOM only
accepts namespace well-formed XML
documents, and only allows you to
create namespace well-formed XML
documents. (In fact, it's a little
stricter than that: it actually
guarantees that all documents are
round-trippable and have well-defined
XML infosets.) XOM manages your XML so
you don't have to. With XOM, you can
focus on the unique value of your
application, and trust XOM to get the
XML right.
Check out web page http://www.xom.nu/ for FAQ, Cookbook, design rationale, etc. If only everything was designed with so much love :-)
Author also wrote about What's Wrong with XML APIs (and how to fix them). (Basically, reasons why XOM exists in the first place)
Here is also 5-part Artima interview with author about XOM, where they talk about what's wrong with XML APIs, The Good, the Bad, and the DOM, A Design Review of JDOM, Lessons Learned from JDOM and finally Design Principles and XOM.

The one built into the JDK ... with a few additions.
Yes, it's painful to use: it is modeled after W3C specs that were clearly designed by committee. However, it is available anywhere, and if you settle on it you don't run into the "I like Dom4J," "I like JDOM," "I like StringBuffer" arguments that come from third-party libraries. Especially since such arguments can turn into different pieces of code using different libraries ...
However, as I said, I do enhance slightly: the Practical XML library is a collection of utility classes that make it easier to work with the DOM. Other than the XPath wrapper, there's nothing complex here, just a bunch of routines that I found myself rewriting for every job.

I've been using XMLTool for replacing Dom4j and it's working pretty well.
XML Tool uses Fluent Interface pattern to facilitate XML manipulations:
XMLTag tag = XMLDoc.newDocument(false)
.addDefaultNamespace("http://www.w3.org/2002/06/xhtml2/")
.addNamespace("wicket", "http://wicket.sourceforge.net/wicket-1.0")
.addRoot("html")
.addTag("wicket:border")
.gotoRoot().addTag("head")
.addNamespace("other", "http://other-ns.com")
.gotoRoot().addTag("other:foo");
System.out.println(tag.toString());
It's made for Java 5 and it's easy to create an iterable object over
selected elements:
for (XMLTag xmlTag : tag.getChilds()) {
System.out.println(xmlTag.getCurrentTagName());
}

I've always liked jdom. It was written to be more intuitive than DOM parsing(and SAX parsing always seems clumsy anyway).
From the mission statement:
There is no compelling reason for a
Java API to manipulate XML to be
complex, tricky, unintuitive, or a
pain in the neck. JDOMTM is both
Java-centric and Java-optimized. It
behaves like Java, it uses Java
collections, it is completely natural
API for current Java developers, and
it provides a low-cost entry point for
using XML.
That's pretty much been my experience - fairly intuitive navigation of node trees.

I use XStream, its a simple library to serialize objects to XML and back again.
it can be annotation-driven (like JAXB), but it has very simple and easy to use api and you can even generate JSON.

I'll add to the built-in answer by #kdgregory by saying why not JAXB?
With a few annotations its pretty easy to model most XML documents. I mean your probably going to parse the stuff and put in an object right?
JAXB 2.0 is built in to JDK 1.6 and unlike many other builtin javax libraries this one is pretty good (Kohusuke worked on it so you know its good).

In a recent project I had to do some XML parsing, and ended up using Simple Framework, recommended by a colleague.
I was quite happy with it in the end. It uses an annotation-based approach of mapping XML elements and attributes to Java classes and fields.
<example>
<a>
<b>
<x>foo</x>
</b>
<b>
<y>bar</y>
</b>
</a>
</example>
Corresponding Java code:
#Root
public class Example {
#Path("a/b[1]")
#Element
private String x;
#Path("a/b[2]")
#Element
private String y;
}
It's all quite different from dom4j or XOM. You avoid writing silly, boilerplatey XML handling code, but at first you'll probably bang your head against a wall for a while trying to get the annotations right.
(It was me who asked this question 4 years ago. While XOM seems a decent and quite popular dom4j replacement, I haven't come to fully embrace it. Curious that no-one had mentioned Simple Framework here. I decided to fix that, as I'd probably use it again.)

In our project we are using http://www.castor.org/ but just for small XML files. It's really easy to learn, needs just a mapping XML file (or none if the XML tags match perfectly class attributes) and it's done. It supports listeners (like callbacks) to perform additional processing. The cons: it is not a Java EE standard like JAXB.

you can try JAXB, with annotations its very handy and simple to do: Java Architecture for XML Binding.

I'm sometimes using Jericho, which is primarily HTML parser, but can parse any XML-like structure.
Of course it is only for the simplest XML operations, such as finding tags with given name, iterating through structure, replacing tags and its attributes, but aren't this the most use cases?

For building XML documetns, I suggest xmlenc. It is used in cassandra.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.