What XSLT processor should I use for Java transformation? There are SAXON, Xalan and TrAX. What is the criteria for my choice? I need simple transformation without complicated logical transformations. Need fast and easy to implement solution. APP runs under Tomcat, Java 1.5.
I have some jaxp libraries and could not understand what is the version of jaxp used.
Thanks in advance.
Best regards.
The JDK comes bundles with an internal version of Xalan, and you get an instance on it by using the standard API (e.g. TransformerFactory.newInstance()).
Unless Xalan doesn't work for you (which is highly unlikely), there's no need to look elsewhere.
By the way, TrAX is the old name for the javax.xml.transform API, from the days when it was an optional extension to the JDK.
In general, it's a hard question to answer because it heavily depends what you mean by "simple transformation" and "fast", as well as the amount of XML you want to process. There are probably other considerations as well. To illustrate, "fast" could mean "fast to write" or "fast to execute", if you process files the size of available memory you might make a different choice (maybe STX, as described in another SO question) than if you parse small files etc.
Related
I need to query XML out of a MarkLogic server and marshal it into Java objects. What is a good way to go about this? Specifically:
Does using MarkLogic have any impact on the XML technology stack? (i.e. is there something about MarkLogic that leads to a different approach to searching for, reading and writing XML snippets?)
Should I process the XML myself using one of the XML APIs or is there a simpler way?
Is it worth using JAXB for this?
Someone asked a good question of why I am using Java. I am using Java/Java EE because I am strongest in that language. This is a one man project and I don't want to get stuck anywhere. The project is to develop web service APIs and data processing and transformation (CSV to XML) functionality. Java/Java EE can do this well and do it elegantly.
Note: I'm the EclipseLink JAXB (MOXy) lead, and a member of the JAXB 2 (JSR-222) expert group.
Does using MarkLogic have any impact on the XML technology stack?
(i.e. is there something about MarkLogic that leads to a different
approach to searching for, reading and writing XML snippets?)
Potentially. Some object-to-XML libraries support a larger variety of documents than other ones do. MOXy leverages XPath based mappings that allows it to handle a wider variety of documents. Below are some examples:
http://blog.bdoughan.com/2010/09/xpath-based-mapping-geocode-example.html
http://blog.bdoughan.com/2011/03/map-to-element-based-on-attribute-value.html
Should I process the XML myself using one of the XML APIs or is there
a simpler way?
Using a framework is generally easier. Java SE offers may standard libraries for processing XML: JAXB (javax.xml.bind), XPath (javax.xml.xpath), DOM, SAX, StAX. Since these standards there are also other implementations (i.e. MOXy and Apache JaxMe implement JAXB).
http://blog.bdoughan.com/2011/05/specifying-eclipselink-moxy-as-your.html
Is it worth using JAXB for this?
Yes.
There are a number of XML-> Java object marshall-ing libraries. I think you might want to look for an answer to this question by searching for generic Java XML marshalling/unmarshalling questions like this one:
Java Binding Vs Manually Defining Classes
Your use case is still not perfectly clear although the title edit helps - If you're looking for Java connectivity, you might also want to look at http://developer.marklogic.com/code/mljam which allows you to execute Java code from within MarkLogic XQuery.
XQSync uses XStream for this. As I understand it, JAXB is more powerful - but also more complex.
Having used JAXB to unmarshal xml served from XQuery for 5 years now, I have to say that I have found it to be exceptionally useful and time-saving. As for complexity, it is easy to learn and use for probably 90% of what you would be using it for. I've used it for both simple and complex schemas and found it to be very performant and time-saving. Executing Java code from within MarkLogic is usually a non-starter, because it runs in a separate VM on the Marklogic server, so it really can't leverage any session state or libraries from, say, a Java EE web application. With JAXB, it is very easy to take a result stream and convert it to Java objects. I really can't say enough good things about it. It has made my development efforts infinitely easier and allows you to leverage Java for those things that it does best (rich integration across various technologies and platforms, advanced business logic, fast memory management for heavy processing jobs, etc.) while still using XQuery for what it does best (i.e. searching and transforming content).
Does using MarkLogic have any impact on the XML technology stack?
No. By the time it comes out of MarkLogic, it's just XML that could have come from anywhere.
I need to query XML and marshal it into Java objects.
Why?
If you have a good reason for using Java, then we need to know what that reason is before we can tell you which Java technology is appropriate.
If you don't have a good reason for using Java, then you are better off using a high-level XML processing language such as XSLT or XQuery.
As for JAXB, it is appropriate when your schema is reasonably simple and stable. If the schema is complex (e.g. the schema for articles in an academic journal), then JAXB can be hopelessly unwieldy because of the number of classes that are generated. One problem with using it to process XQuery output is that it's very likely the XQuery output will not conform to any known schema, and the structure of the XQuery results will be different for each query that gets written.
I need to use a xml pull parser. I can find stax-api.jar which seems to be already part of com.sun.xml.* and it seems that there is already something stax related implemented.
com.sun.xml unfortunately has no sources in JDK 6, so I can't tell.
Also there are xmlpull, stax.codehaus.org and apache axiom, that kinda implements stax-api. stax.codehaus.org seems to be a stax reference implementation. Xmlpull seems to be done by the same people as the reference implementation and Apache Axiom seems to be a StAX based parser that was created for Apache Axis2.
Could you please clarify what are the main differences, what API to use and when would you use one of these implementations and why ?
Edit: Before you decide to close this question, notice that xmlpull.org and stax.codehaus.org releases are pretty old (5 years) and one really can't say if the stax parser implementation is part of sun.com.xml.*.
I'd just need someone with pull parser experience to tell me, what to use and why.
For instance, Apache Abdera project (I'm parsing atom feeds too) is using Axiom implementation that seems to be implementing its Axiom-api and also geronimo-stax-api_1.0_spec
Aside from pointing out that JDK/JRE bundles Sun's SJSXP which works ok at this point, I would recommend AGAINST using Stax ref impl (stax.codehaus.org) -- do NOT use it for anything, ever. It has lots of remaining bugs (although many were fixed, initial versions were horrible), isn't particularly fast, doesn't implement even all mandatory features. Stay clear of it.
I am partial to Woodstox, which is by far the most complete implementation for XML features (on par with Xerces, about the only other Java XML parser that can say this), more performant than Sjsxp, and all around solid parser and generator -- this is why most modern Java XML web service frameworks and containers bundle Woodstox.
Or, if you want super-high performance, check out Aalto. It is successor to Woodstox, with less features (no DTD handling) but 2x faster for many common cases.
And if you ever need non-blocking/async parsing (for NIO based input for example), Aalto is the only known Java XML parser to offer that feature.
As to Axiom: it is NOT a parser, but tree model built on top of Stax parser like Woodstox, so they didn't reinvent the wheel. XmlPull predates Stax API by couple of years; basically Stax standardization came about people using XmlPull, liking what they saw, and Sun+BEA wanting to standardize the approach. There was some friction in the process, so in the end XmlPull was not discontinue when Stax was finalized, but one can think of Stax as successor -- XmlPull is still used for mobile devices; I think Android platform includes it.
(disclaimers: I am involved in both Aalto and Woodstox projects; as well as provided more than a dozen bug fixes to both SJSXP and Stax RI)
As of Java 1.6, there is a StaX implementation inside the plain bundled JRE. You can use that. If you don't like the performance, drop in woodstox.
Axiom is something else entirely, much more complex. Xmlpull seems to be going by the wayside in favor of one Stax implementation or another.
Does Java have a built in XML library for generating and parsing documents? If not, which third party one should I be using?
The Sun Java Runtime comes with the Xerces and Xalan implementations that provide the ability to parse XML (via the DOM and SAX intefaces), and also perform XSL transformations and execute XPath queries.
However, it is better to use the JAXP API to work on XML, since JAXP allows you to not worry about the underlying implementation used (Xerces or Crimson or any other). When you use JAXP, at runtime the JRE will use the service provider it can locate, to perform the needed operations. As indicated previously, Xerces/Xalan will be used since it is shipped with the Sun JRE (not others though), so you dont have to download and install a specific provider (say, a different version of Xerces, or Crimson).
A basic JAXP tutorial can be found in The J2EE 1.4 tutorial (Its from the J2EE tutorial, but it will help).
Do note that the Xerces/Xalan implementations provided by the Sun JRE, will not be found in the org.apache.xerces.* or org.apache.xalan.* packages. Instead, they will be present in the internal com.sun.org.apache.xerces.* and com.sun.org.apache.xalan.* packages.
By the way, JDOM is not an XML parser - it will use the parser provided to it by JAXP in order to provide you with an easier abstraction to work with.
Yes. It has a two options in the javax.xml package: DOM builds documents in memory, and SAX is an event-based approach.
You may also want to look at JDOM, which is a 3rd party library that offers a combination of the two, and can be easier to use.
Yes. Java contains javax.xml library. You can checkout some samples at Sun's Java API for XML Code Samples.
However, I personally like using JDOM library.
javax.xml package contains Java's native XML solution which is actually a special version of Xerces. You can do what you asked with it, however using 3rd party libraries such as JDOM makes the whole process a lot easier.
Have a look at JAX-B This is increasingly the "standard" way to do XML processing. Uses Java annotations to simplify the programming model. The reference gives sample code for reading and writing XML.
Java does come with a large set of packages and classes to handle XML. These are part of the Standard Edition JDK, and located under the javax.xml package.
Aside from reading XML and writing it with DOM or SAX, these packages also perform XSL transformations, JAX-B object marshalling and unmarshalling, XPath processing and web services SOAP handling. I advise you to read more about these online in Sun's excellent tutorials.
I can't tell you which one to use (few requirements specified, and there
are a dozen libraries), but I would seriously consider XOM (here).
Written by Eliotte Rusty Harold, it is quite complete in terms of the XML
spec, and generally excellent. I have found it very easy to use. See the
link above for Harold's motivation and criticism of other solutions.
You could have a look to the javax.xml package, which contains everything you need to work with XML documents in Java...
Java API for XML Processing (JAXP) is part of standard library JavaSE. JAXP allows you to code against standard interface and lets you pick the parser implementation later if needed.
The Java API for XML Processing, or
JAXP for short, enables applications
to parse and transform XML documents
using an API that is independent of a
particular XML processor
implementation. JAXP also provides a
pluggability feature which enables
applications to easily switch between
particular XML processor
implementations.
You can use StAX (streaming API for XML)
http://en.wikipedia.org/wiki/StAX
http://www.xml.com/pub/a/2003/09/17/stax.html
https://sjsxp.dev.java.net/
StAx is optimized to process large xml files, without causing OOM (out of memory) problem :)
As is said above... Java's SDK now comes with Xerces and Xalan. Xalan only implements version 1.0 of the XSLT API, so if you want 2.0, you should look at Saxon from Michael Kay.
I've been using JDOM for general XML parsing for a long time, but get the feeling that there must be something better, or at least more lightweight, for Java 5 or 6.
There's nothing wrong with the JDOM API, but I don't like having to include Xerces with my deployments. Is there a more lightweight alternative, if all I want to do is read in an XML file or write one out?
The best lightweight alternative is, in my opinion, XOM, but JDOM is still a very good API, and I see no reason to replace it.
It doesn't have a dependency on Xerces, though (at least, it doesn't need the Apache Xerces distro, it works alongside the Xerces that's packaged into the JRE).
I've used the javax.xml.stream package (XMLStreamReader/XMLStreamWriter) to read and write XML using xml pull/push techniques. It's worked for me so far.
We use JAXB - it generates the classes based on the schema. You can generate your files without a schema, and just annotate how you want the xml to be.
There was recently a fork of JDOM for java 5 called coffeeDOM. You should check it out.
You should check out Commons Digester (see the answer I've given here). It provides a very lightweight mechanism for parsing XML.
JDOM is very good and simple. There has been many new ways to parse XML after release of JDOM, but those has have different focus than simplicity. JAXB makes things simple in some cases when you have well known XML document has your schema does not get updated daily basis.
New push parsers are very good and even mandatory for very large XML files (hundreds of MBs).
Speed benefit for SAX parser can be ten fold.
Use one of the XML APIs that come standard with Java, so that you don't have to include any third-party libraries.
XML in the Java Platform Standard Edition (Java SE) 6
I would like to think JAXP is a good choise for you.
It's standard, included in JDK, it provides clear interface and allows to hook up any implementations..
If all what you need in is to read and write not very large and overcomplicated xml files, JAXP DOM api embedded in JDK will cover you requirements.
I'm currently working on a project which needs to persist any kind of object (of which implementation we don't have any control) so these objects could be recovered afterwards.
We can't implement an ORM because we can't restrict the users of our library at development time.
Our first alternative was to serialize it with the Java default serialization but we had a lot of trouble recovering the objects when the users started to pass different versions of the same object (attributes changed types, names, ...).
We have tried with the XMLEncoder class (transforms an object into a XML), but we have found that there is a lack of functionality (doesn't support Enums for example).
Finally, we also tried JAXB but this impose our users to annotate their classes.
Any good alternative?
It's 2011, and in a commercial grade REST web services project we use the following serializers to offer clients a variety of media types:
XStream (for XML but not for JSON)
Jackson (for JSON)
Kryo (a fast, compact binary serialization format)
Smile (a binary format that comes with Jackson 1.6 and later).
Java Object Serialization.
We experimented with other serializers recently:
SimpleXML seems solid, runs at 2x the speed of XStream, but requires a bit too much configuration for our situation.
YamlBeans had a couple of bugs.
SnakeYAML had a minor bug relating to dates.
Jackson JSON, Kryo, and Jackson Smile were all significantly faster than good old Java Object Serialization, by about 3x to 4.5x. XStream is on the slow side. But these are all solid choices at this point. We'll keep monitoring the other three.
http://x-stream.github.io/ is nice, please take a look at it! Very convenient
of which implementation we don't have any control
The solution is don't do this. If you don't have control of a type's implementation you shouldn't be serialising it. End of story. Java serialisation provides serialVersionUID specifically for managing serialisation incompatibilities between different versions of a type. If you don't control the implementation you cannot be sure that IDs are being changed correctly when a developer changes a class.
Take a simple example of a 'Point'. It can be represented by either a cartesian or a polar coordinate system. It would be cost prohibitive for you to build a system that could cope dynamically with these sorts of corrections - it really has to be the developer of the class who designs the serialisation.
In short it's your design that's wrong - not the technology.
The easiest thing for you to do is still to use serialization, IMO, but put more thought into the serialized form of the classes (which you really ought to do anyway). For instance:
Explicitly define the SerialUID.
Define your own serialized form where appropriate.
The serialized form is part of the class' API and careful thought should be put into its design.
I won't go into a lot of details, since pretty much everything I have said comes from Effective Java. I'll instead, refer you to it, specifically the chapters about Serialization. It warns you about all the problems you're running into, and provides proper solutions to the problem:
http://www.amazon.com/Effective-Java-2nd-Joshua-Bloch/dp/0321356683
With that said, if you're still considering a non-serialization approach, here are a couple:
XML marshalling
As many has pointed out is an option, but I think you'll still run into the same problems with backward compatibility. However, with XML marshalling, you'll hopefully catch these right away, since some frameworks may do some checks for you during initialization.
Conversion to/from YAML
This is an idea I have been toying with, but I really liked the YAML format (at least as a custom toString() format). But really, the only difference for you is that you'd be marshalling to YAML instead of XML. The only benefit is that that YAML is slightly more human readable than XML. The same restrictions apply.
google came up with a binary protocol -- http://code.google.com/apis/protocolbuffers/ is faster, has a smaller payload compared to XML -- which others have suggested as alternate.
One of the advanteages of protocol buffers is that it can exchange info with C, C++, python and java.
Try serializing to json with Gson for example.
Also a very fast JDK serialization drop-in replacement:
http://ruedigermoeller.github.io/fast-serialization/
If serialization speed is important to you then there is a comprehensive benchmark of JVM serializers here:
https://github.com/eishay/jvm-serializers/wiki
Personally, I use Fame a lot, since it features interoperability with Smalltalk (both VW and Squeak) and Python. (Disclaimer, I am the main contributor of the Fame project.)
Possibly Castor?
Betwixt is a good library for serializing objects - but it's not going to be an automatic kind of thing. If the number of objects you have to serialize is relatively fixed, this may be a good option for you, but if your 'customer' is going to be throwing new classes at you all the time, it may be more effort than it's worth (Definitely easier than XMLEncoder for all the special cases, though).
Another approach is to require your customer to provide the appropriate .betwixt files for any objects they throw at you (that effectively offloads the responsibility to them).
Long and short - serialization is hard - there is no completely brain dead approach to it. Java serialization is as close to a brain dead solution as I've ever seen, but as you've found, incorrect use of the version uid value can break it. Java serialization also requires use of the marker 'Serializable' interface, so if you can't control your source, you are kind of out of luck on that one.
If the requirement is truly as arduous as you describe, you may have to resort to some sort of BCE (Byte code modification) on the objects / aspects / whatever. This is getting way outside the realm of a small development project, and into the realm of Hibernate, Casper or an ORM....
SBE is an established library for fast, bytebuffer based serialization library and capable of versioning. However it is a bit hard to use as you need to write length wrapper classes over it.
In light of its shortcomings, I have recently made a Java-only serialization library inspired by SBE and FIX-protocol (common financial market protocol to exchange trade/quote messages), that tries to keep the advantages of both while overcoming their weaknesses. You can take a look at https://github.com/iceberglet/anymsg
Another idea: Use cache. Caches provide much better control, scalability and robustness to the application. Still need to serialize, though, but the management becomes much easier with within a caching service framework. Cache can be persisted in memory, disk, database or array - or all of the options - with one being overflow, stand by, fail-over for the other . Commons JCS and Ehcache are two java implementations, the latter is an enterprise solution free up to 32 GB of storage (disclaimer: I don't work for ehcache ;-)).