Compiling huge schema into Java

Compiling huge schema into Java - java

There are two major tools which provides a way to compile XSD schema into Java: xmlbeans and JAXB.
The problem is the XSD schema is really huge: 30MB of XML files, most of the schema isn't used in my project, so I can comment out most of the code, but it not a good solution.
Currently my project uses xmlbeans which compiles the schema with major changes. It produces ~60MB of classes and it takes ~30 min to compile.
Another solution is to use JAXB, which generates ~14MB of code without need to edit the code. But it produces huge ObjectFactory class, which fails to compile with "too many constants" error. I can throw the class away and compile the schema without it, but as I understand, it's very useful class.
Any ideas how to handle this huge schema?

Could you create a script to extract the portion(s) of the schema you need and integrate that into your build process prior to mapping with XmlBeans or JAXB?
You could probably script this extraction fairly simply and easily in Python, Perl, Awk, etc; or even in XSL if you have expertise there (I've never spent enough contiguous time coding XSL to get proficient, so I'd probably stick to a scripting language, but that's just me).
e.g.:
python extract.py big-schema.xsd >small-schema.xsd
xsd2java <args> small-schema.xsd
...
You might find that a subsequent update by the 3rd-party vendor would invalidate your extraction script, but unless they're making very large changes to the overall schema, you should be able to update the script fairly quickly, and it sounds like those updates should be fairly infrequent.
Incidentally, I'm a little partial to XmlBeans; when we did our own evaluation of XML-Java mapping tools, it seemed to handle constructs like xs:choice, xs:all, and type-substitution better than anything else we tried. But that was several years ago, and could certainly have changed by now. At this point, we're continuing to use it more out of institutional inertia than anything else, so take that recommendation with a dash of salt.

30Mb of schema? What on earth is this - I'd be interested to know if it's available as a test case for schema processors.
Data mapping (a la JAXB) works best with small schemas. I've seen people really struggle when the schema gets as large as about 200 element types. You must be dealing with something a couple of orders of magnitude larger here - I would say it's a non starter.

Related

EclipseLink Moxy minimum libraries required for unmarshalling

i have been working with EclipseLink in the past couple of days to implement one of our small converter applications. The input for these are usually one document format type and now i.e. in the future a sophisticated metadata xml.
Since we have a schema for this and there are still slight changes to be expected in the future, i wanted to give the JAXB approach a try and i like it very much so far.
However, as i finish the application i noticed that due to the usage of eclipselink.jar, the application is rather large (~10MB) in comparision to similar converters (~1MB).
This is due to the fact that there is, for reasons of technological environment, no global classpath for the converter jars, but every one of them needs to be self sufficient.
This means that i copy every required jar into one big jar using ant.
I am not quite fond of this approach myself but so far can only hint that some different approach may or may not be more elegant.
There are some smaller jars containing fragments of needed classes with the eclipselink distribution, but i found none that contained the
org.eclipse.persistence.jaxb.JAXBContextFactory
(plus the dependencies for this).
It seems to me, but this is a lot of guess work, that the
eclipselink.jar
includes the complete-wellness-package-that-leaves-nothing-to-be-desired and that is a bit of an overkill for me.
Long story short:
Is there a light weight version of the eclipselink.jar which would support the unmarshalling of an xml for which i generated java classes in advance? Or am i trying the impossible?
Thank you in advance
Christian

Instead of using eclipselink.jar, you can use the bundles. Then you will need to include the following
org.eclipse.persistence.asm.version.jar
org.eclipse.persistence.core.version.jar
org.eclipse.persistence.moxy.version.jar
The total is still larger than other providers, but we're working on fixing that.

Java marshaller performance

I've used JAXB Marshaller as well as my own marshaller for marshalling pure java bean objects into XML. It has been observed that both of them require almost same time to marshal. The performance is not acceptable and needs to be improved. What are possible ways where we can improve performance of marshaller? Like threading?

Make sure you create the JaxB context instance only once, creating the context takes some time as it uses reflection to parse the object's annotations.
Note that the JAXBContext is thread safe, but the marshallers\unmarshallers aren't, so you still have to create the marshaller for every thread. However I found that creating the marshallers when you already hold a jaxb context is pretty fast.

Seconding the Use of JibX. Like questzen, I found that JibX was 9 times faster than JAXB in my performance tests.
Also, make sure you have woodstox on the classpath when using JibX. I found woodstox's Stax Implementation is roughly 1050% faster than the Java6 implementation of Stax.

Byeond other good suggestions, I suggest there is something wrong with the way you use JAXB -- it is generally reasonably well performing as long as:
You use JAXB version 2 (NEVER ever use obsolete JAXB 1 -- that was horribly slow, useless piece of crap); preferably a recent 2.1.x version from http://jaxb.dev.java.net
Ensure that you use SAX or Stax source/destination; NEVER use DOM unless you absolute must for interoperability: using DOM will make it 3 - 5x slower, without any benefit (it just doubles object model: POJO -> DOM -> XML; DOM part is completely unnecessary)
Ideally use fastest SAX/Stax parser available; Woodstox is faster than Sun's bundled Stax processor (and BEA's ref. impl. is buggy, no faster than Sun's)
If JAXB is still more than 50% slower than manually written variant, I would profile it to see what else is going wrong. It should not work slowly when used properly -- I have measured it, continuously, and found it so fast that hand-writing converters is usually not worth time and effort.
Jibx is a good package too btw, so I have nothing against trying it out. It might still be bit faster than JAXB; just not 5x or 10x, when both are used correctly.

If large XML trees are written, providing a BufferedOutputStream to the
javax.xml.bind.Marshaller.marshal(Object jaxbElement, java.io.OutputStream os) method made a big difference in my case:
The time needed to write a 100MB XML file could be reduced from 130 sec to 7 sec.

In my experience, JIBX http://jibx.sourceforge.net/ was nearly 10X faster then JAXB. Yes, I measured it for a performance spec. We used it to bind java beans with large HL7 xml. That being said, the way to improve performance is not to rely on the schema definition but to write custom bindings.

We have just tracked down a JAXB performance problem related to the default parser configuration used by Xerces. JAXB performance was very slow (30s+) for one data file (<1Mb)
Quoting "How do I change the default parser configuration?" from http://xerces.apache.org/xerces2-j/faq-xni.html
The DOM and SAX parsers decide which parser configuration to use in the following order
The org.apache.xerces.xni.parser.XMLParserConfiguration system property is queried for the class name of the parser configuration.
If a file called xerces.properties exists in the lib subdirectory of the JRE installation and the org.apache.xerces.xni.parser.XMLParserConfiguration property is defined it, then its value will be read from the file.
The org.apache.xerces.xni.parser.XMLParserConfiguration file is requested from the META-INF/services/ directory. This file contains the class name of the parser configuration.
The org.apache.xerces.parsers.XIncludeAwareParserConfiguration is used as the default parser configuration.
Unmarshalling using JAXB results in this algorithm being repeatedly applied. So a huge amount of time can be spent repeatedly scanning the classpath, looking for the configuration file that doesn't exist. The fix is to do option 1, option 2 or option 3 (create the configuration file under META-INF). Anything to prevent the repeated classpath scanning.
Hope this helps someone - it's taken us days to track this down.

We can achieve the performance in Marshalling and unmarshalling by setting the fast booting property at system level. This will give lot of performance improvement.
System.setProperty("com.sun.xml.bind.v2.runtime.JAXBContextImpl.fastBoot","true");

Castor performance issues

We recently upgraded to Castor 1.2 from version 0.9.5.3 and we've noticed a dramatic drop in performance when calling unmarshal on XML. We're unmarshaling to java classes that were generated by castor in both cases. For comparison, using identical XML the time for the XML unmarshal call used to take about 10-20ms and now takes about 2300ms. Is there something obvious I might be missing in our new castor implementation, maybe in a property file that I missed or should I start looking at reverting to the old version? Maybe there was something in the java class file generation that is killing the unmarshal call? I might also consider Castor alternatives if there were a good reason to drop it in favor of something else. We're using java 1.5 in a weblogic server.

We had very serious performance issues by using castor 1.0.5, with .castor.cdr file (a few seconds to unmarshal, whereas it took milliseconds by the past).
It appeared that the .castor.cdr generated file contained old/wrong values (inexisting types and descriptor). After deleting the incriminated lines in this file, all was back to normal.
Hope this could help for anyone who have the same problem!

You might want to consider using JiBX instead. It is considerably faster than Castor (9x faster on one project where I made the switch), and cleaner.
EDIT: See also my answer to this related question.

We wound up reverting to Castor version 0.9.5.3 and the performance jumped back up after we regenerated the java classes from the new XSD's. I'm not sure why exactly there's such a big difference between the resulting java but the 1.2 classes were about 2 orders of magnitude slower when unmarshaling.
**EDIT:**It looks like by creating ClassDescriptorResolvers/mapping file and turing off validation that we could improve the performance also but since we create about 1000 objects with the schema generation process this isn't really viable from a cost perspective.

I too have this issue, when generating a basic customer/address set of XML it takes around 3s to generate a document including 74 customers.
Reverting to 1.0.4 (for testing) sees this return to 1.4s,
But hand rolling the XML sees the output at under 40ms.. I know the frameworks add some overhead, but there must be something causing this.
Has there been any profiling run on Castor?
I'll go investigate JiBX as suggested by Dan.

The drawbacks of annotation processing in Java?

I am considering starting a project which is used to generate code in Java using annotations (I won't get into specifics, as it's not really relevant). I am wondering about the validity and usefulness of the project, and something that has struck me is the dependence on the Annontation Processor Tool (apt).
What I'd like to know, as I can't speak from experience, is what are the drawbacks of using annotation processing in Java?
These could be anything, including the likes of:
it is hard to do TDD when writing the processor
it is difficult to include the processing on a build system
processing takes a long time, and it is very difficult to get it to run fast
using the annotations in an IDE requires a plugin for each, to get it to behave the same when reporting errors
These are just examples, not my opinion. I am in the process of researching if any of these are true (including asking this question ;-) )
I am sure there must be drawbacks (for instance, Qi4J specifically list not using pre-processors as an advantage) but I don't have the experience with it to tell what they are.
The ony reasonable alternative to using annotation processing is probably to create plugins for the relevant IDEs to generate the code (it would be something vaguely similar to override/implement methods feature that would generate all the signatures without method bodies). However, that step would have to be repeated each time relevant parts of the code changes, annotation processing would not, as far as I can tell.
In regards to the example given with the invasive amount of annotations, I don't envision the use needing to be anything like that, maybe a handful for any given class. That wouldn't stop it being abused of course.

I created a set of JavaBean annotations to generate property getters/setters, delegation, and interface extraction (edit: removed link; no longer supported)
Testing
Testing them can be quite trying...
I usually approach it by creating a project in eclipse with the test code and building it, then make a copy and turn off annotation processing.
I can then use Eclipse to compare the "active" test project to the "expected" copy of the project.
I don't have too many test cases yet (it's very tedious to generate so many combinations of attributes), but this is helping.
Build System
Using annotations in a build system is actually very easy. Gradle makes this incredibly simple, and using it in eclipse is just a matter of making a plugin specifying the annotation processor extension and turning on annotation processing in projects that want to use it.
I've used annotation processing in a continuous build environment, building the annotations & processor, then using it in the rest of the build. It's really pretty painless.
Processing Time
I haven't found this to be an issue - be careful of what you do in the processors. I generate a lot of code in mine and it runs fine. It's a little slower in ant.
Note that Java6 processors can run a little faster because they are part of the normal compilation process. However, I've had trouble getting them to work properly in a code generation capacity (I think much of the problem is eclipse's support and running multiple-phase compiles). For now, I stick with Java 5.
Error Processing
This is one of the best-thought-through things in the annotation API. The API has a "messenger" object that handles all errors. Each IDE provides an implementation that converts this into appropriate error messages at the right location in the code.
The only eclipse-specific thing I did was to cast the processing environment object so I could check if it was bring run as a build or for editor reconciliation. If editing, I exit. Eventually I'll change this to just do error checking at edit time so it can report errors as you type. Be careful, though -- you need to keep it really fast for use during reconciliation or editing gets sluggish.
Code Generation Gotcha
[added a little more per comments]
The annotation processor specifications state that you are not allowed to modify the class that contains the annotation. I suspect this is to simplify the processing (further rounds do not need to include the annotated classes, preventing infinite update loops as well)
You can generate other classes, however, and they recommend that approach.
I generate a superclass for all of the get/set methods and anything else I need to generate. I also have the processor verify that the annotated class extends the generated class. For example:
#Bean(...)
public class Foo extends FooGen
I generate a class in the same package with the name of the annotated class plus "Gen" and verify that the annotated class is declared to extend it.
I have seen someone use the compiler tree api to modify the annotated class -- this is against spec and I suspect they'll plug that hole at some point so it won't work.
I would recommend generating a superclass.
Overall
I'm really happy using annotation processors. Very well designed, especially looking at IDE/command-line build independence.
For now, I would recommend sticking with the Java5 annotation processors if you're doing code generation - you need to run a separate tool called apt to process them, then do the compilation.
Note that the API for Java 5 and Java 6 annotation processors is different! The Java 6 processing API is better IMHO, but I just haven't had luck with java 6 processors doing what I need yet.
When Java 7 comes out I'll give the new processing approach another shot.
Feel free to email me if you have questions. (scott#javadude.com)
Hope this helps!

I think if annotation processor then definitely use the Java 6 version of the API. That is the one which will be supported in the future. The Java 5 API was still in the in the non official com.sun.xyz namespace.
I think we will see a lot more uses of the annotation processor API in the near future. For example Hibernate is developing a processor for the new JPA 2 query related static meta model functionality. They are also developing a processor for validating Bean Validation annotations. So annotation processing is here to stay.
Tool integration is ok. The latest versions of the mainstream IDEs contain options to configure the annotation processors and integrate them into the build process. The main stream build tools also support annotation processing where maven can still cause some grief.
Testing I find a big problem though. All tests are indirect and somehow verify the end result of the annotation processing. I cannot write any simple unit tests which just assert simple methods working on TypeMirrors or other reflection based classes. The problem is that one cannot instantiate these type of classes outside the processors compilation cycle. I don't think that Sun had really testability in mind when designing the API.

One specific which would be helpful in answering the question would be as opposed to what? Not doing the project, or doing it not using annotations? And if not using annotations, what are the alternatives?
Personally, I find excessive annotations unreadable, and many times too inflexible. Take a look at this for one method on a web service to implement a vendor required WSDL:
#WebMethod(action=QBWSBean.NS+"receiveResponseXML")
#WebResult(name="receiveResponseXML"+result,targetNamespace = QBWSBean.NS)
#TransactionAttribute(TransactionAttributeType.NOT_SUPPORTED)
public int receiveResponseXML(
#WebParam(name = "ticket",targetNamespace = QBWSBean.NS) String ticket,
#WebParam(name = "response",targetNamespace = QBWSBean.NS) String response,
#WebParam(name = "hresult",targetNamespace = QBWSBean.NS) String hresult,
#WebParam(name = "message",targetNamespace = QBWSBean.NS) String message) {
I find that code highly unreadable. An XML configuration alternative isn't necessarily better, though.

Is it worth the effort to move from a hand crafted hibernate mapping file to annotaions?

I've got a webapp whose original code base was developed with a hand crafted hibernate mapping file. Since then, I've become fairly proficient at 'coding' my hbm.xml file. But all the cool kids are using annotations these days.
So, the question is: Is it worth the effort to refactor my code to use hibernate annotations? Will I gain anything, other than being hip and modern? Will I lose any of the control I have in my existing hand coded mapping file?
A sub-question is, how much effort will it be? I like my databases lean and mean. The mapping covers only a dozen domain objects, including two sets, some subclassing, and about 8 tables.
Thanks, dear SOpedians, in advance for your informed opinions.

"If it ain't broke - don't fix it!"
I'm an old fashioned POJO/POCO kind of guy anyway, but why change to annotations just to be cool? To the best of my knowledge you can do most of the stuff as annotations, but the more complex mappings are sometimes expressed more clearly as XML.

One thing you'll gain from using annotations instead of an external mapping file is that your mapping information will be on classes and fields which improves maintainability. You add a field, you immediately add the annotation. You remove one, you also remove the annotation. you rename a class or a field, the annotation is right there and you can rename the table or column as well. you make changes in class inheritance, it's taken into account. You don't have to go and edit an external file some time later. this makes the whole thing more efficient and less error prone.
On the other side, you'll lose the global view your mapping file used to give you.

I've recently done both in a project and found:
I prefer writing annotations to XML (plays well with static typing of Java, auto-complete in IDE, refactoring, etc). I like seeing the stuff all woven together rather than going back and forth between code and XML.
Encodes db information in your classes. Some people find that gross and unacceptable. I can't say it bothered me. It has to go somewhere and we're going to rebuild the WAR for a change regardless.
We actually went all the way to JPA annotations but there are definitely cases where the JPA annotations are not enough, so then had to use either Hibernate annotations or config to tweak.
Note that you can actually use both annotations AND hbm files. Might be a nice hybrid that specifies the O part in annotations and R part in hbm files but sounds like more trouble than it's worth.

As much as I like to move on to new and potentially better things I need to remember to not mess with things that aren't broken. So if having the hibernate mappings in a separate file is working for you now I wouldn't change it.

I definitely prefer annotations, having used them both. They are much more maintainable and since you aren't dealing with that many classes to re-map, I would say it's worth it. The annotations make refactoring much easier.

All the features are supported both in the XML and in annotations.
You will still be able to override your annotations with xml declaration.
As for the effort, i think it is worth it as you will be able to see all in one place and not switch between your code and the xml file (unless of-course you are using two monitors ;) )

The only thing you'll gain from using
annotations
I would probably argue that this is the thing you want to gain from using annotations. Because you don't get compile time safety with NHibernate this is the next best thing.

"If it ain't broke - don't fix it!"
#Macka - Thanks, I needed to hear that. And thanks to everyone for your answers.
While I am in the very fortunate position of having an insane amount of professional and creative control over my work, and can bring in just about any technology, library, or tool for just about any reason (baring expensive stuff) including "because all the cool kids are using it"...It does not really make sense to port what amounts to a significant portion of the core of an existing project.
I'll try out Hibernate or JPA annotations with a green-field project some time. Unfortunately, I rarely get new completely independent projects.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.