I am looking for a tool or java code or class library/API that can generate XSD from XML files. (Something like the xsd.exe utility in the .NET Framework sdk)
These tools can provide a good starting point, but they aren't a substitute for thinking through what the actual schema constraints ought to be. You get the opportunity for two kinds of errors: (1) allowing XML that shouldn't be allowed and (2) disallowing XML that should be ok.
As an example, pretend that you want to infer an XSD from a few thousand patient records that include a 'gender' tag (I used to work on medical records software). The tool would likely encounter 'M' and 'F' as values and might deduce that the element is an enumeration. However, other valid (although rarely used) values are B (both), U (unknown), or N (none). These are rare, of course. So, if you used your derived schema as an input validator, it would perform well until a patient with multiple sex organs was admitted to the hospital.
Conversely, to avoid this error, an XSD generator might not add enumerated type restrictions (I can't remember what these are called in schemas), and your application would work well until it encountered an errant record with gender=X.
So, beware. It's best to use these tools only as a starting point. Also, they tend to produce verbose and redundant schemas because they can't figure out patterns as well as humans.
Check Castor, I think it has the functionality you are looking for. They also provide you with an ant task that creates XSD schemas from XML files.
PS I suggest you to add more specific tags in the future: For instance, using xml, xsd and java will increment the possibility of getting answers.
You can use xsd-gen-0.2.0-jar-with-dependencies.jar file to convert xml to xsd.
And Command for it is "java -jar xsd-gen-VERSION-jar-with-dependencies.jar /path/to/xml.xml > /path/to/my.xsd"
Try the xsd-gen project from Google.
https://code.google.com/p/xsd-gen/
Related
In the thread What’s your favorite “programmer ignorance” pet peeve?, the following answer appears, with a large amount of upvotes:
Programmers who build XML using string concatenation.
My question is, why is building XML via string concatenation (such as a StringBuilder in C#) bad?
I've done this several times in the past, as it's sometimes the quickest way for me to get from point A to point B when to comes to the data structures/objects I'm working with. So far, I have come up with a few reasons why this isn't the greatest approach, but is there something I'm overlooking? Why should this be avoided?
Probably the biggest reason I can think of is you need to escape your strings manually, and most new programmers (and even some experienced programmers) will forget this. It will work great for them when they test it, but then "randomly" their apps will fail when someone throws an & symbol in their input somewhere. Ok, I'll buy this, but it's really easy to prevent the problem (SecurityElement.Escape to name one).
When I do this, I usually omit the XML declaration (i.e. <?xml version="1.0"?>). Is this harmful?
Performance penalties? If you stick with proper string concatenation (i.e. StringBuilder), is this anything to be concerned about? Presumably, a class like XmlWriter will also need to do a bit of string manipulation...
There are more elegant ways of generating XML, such as using XmlSerializer to automatically serialize/deserialize your classes. Ok sure, I agree. C# has a ton of useful classes for this, but sometimes I don't want to make a class for something really quick, like writing out a log file or something. Is this just me being lazy? If I am doing something "real" this is my preferred approach for dealing w/ XML.
You can end up with invalid XML, but you will not find out until you parse it again - and then it is too late. I learned this the hard way.
I think readability, flexibility and scalability are important factors. Consider the following piece of Linq-to-Xml:
XDocument doc = new XDocument(new XDeclaration("1.0","UTF-8","yes"),
new XElement("products", from p in collection
select new XElement("product",
new XAttribute("guid", p.ProductId),
new XAttribute("title", p.Title),
new XAttribute("version", p.Version))));
Can you find a way to do it easier than this? I can output it to a browser, save it to a document, add attributes/elements in seconds and so on ... just by adding couple lines of code. I can do practically everything with it without much of effort.
Actually, I find the biggest problem with string concatenation is not getting it right the first time, but rather keeping it right during code maintenance. All too often, a perfectly-written piece of XML using string concat is updated to meet a new requirement, and string concat code is just too brittle.
As long as the alternatives were XML serialization and XmlDocument, I could see the simplicity argument in favor of string concat. However, ever since XDocument et. al., there is just no reason to use string concat to build XML anymore. See Sander's answer for the best way to write XML.
Another benefit of XDocument is that XML is actually a rather complex standard, and most programmers simply do not understand it. I'm currently dealing with a person who sends me "XML", complete with unquoted attribute values, missing end tags, improper case sensitivity, and incorrect escaping. But because IE accepts it (as HTML), it must be right! Sigh... Anyway, the point is that string concatenation lets you write anything, but XDocument will force standards-complying XML.
I wrote a blog entry back in 2006 moaning about XML generated by string concatenation; the simple point is that if an XML document fails to validate (encoding issues, namespace issues and so on) it is not XML and cannot be treated as such.
I have seen multiple problems with XML documents that can be directly attributed to generating XML documents by hand using string concatenation, and nearly always around the correct use of encoding.
Ask yourself this; what character set am I currently encoding my document with ('ascii7', 'ibm850', 'iso-8859-1' etc)? What will happen if I write a UTF-16 string value into an XML document that has been manually declared as 'ibm850'?
Given the richness of the XML support in .NET with XmlDocument and now especially with XDocument, there would have to be a seriously compelling argument for not using these libraries over basic string concatenation IMHO.
I think that the problem is that you aren't watching the xml file as a logical data storage thing, but as a simple textfile where you write strings.
It's obvious that those libraries do string manipulation for you, but reading/writing xml should be something similar to saving datas into a database or something logically similar
If you need trivial XML then it's fine. Its just the maintainability of string concatenation breaks down when the xml becomes larger or more complex. You pay either at development or at maintenance time. The choice is yours always - but history suggests the maintenance is always more costly and thus anything that makes it easier is worthwhile generally.
You need to escape your strings manually. That's right. But is that all? Sure, you can put the XML spec on your desk and double-check every time that you've considered every possible corner-case when you're building an XML string. Or you can use a library that encapsulates this knowledge...
Another point against using string concatenation is that the hierarchical structure of the data is not clear when reading the code. In #Sander's example of Linq-to-XML for example, it's clear to what parent element the "product" element belongs, to what element the "title" attribute applies, etc.
As you said, it's just awkward to build XML correct using string concatenation, especially now you have XML linq that allows for simple construction of an XML graph and will get namespaces, etc correct.
Obviously context and how it is being used matters, such as in the logging example string.Format can be perfectly acceptable.
But too often people ignore these alternatives when working with complex XML graphs and just use a StringBuilder.
The main reason is DRY: Don't Repeat Yourself.
If you use string concat to do XML, you will constantly be repeating the functions that keep your string as a valid XML document. All the validation would be repeated, or not present. Better to rely on a class that is written with XML validation included.
I've always found creating an XML to be more of a chore than reading in one. I've never gotten the hang of serialization - it never seems to work for my classes - and instead of spending a week trying to get it to work, I can create an XML file using strings in a mere fraction of the time and write it out.
And then I load it in using an XMLReader tree. And if the XML file doesn't read as valid, I go back and find the problem within my saving routines and corret it. But until I get a working save/load system, I refuse to perform mission-critical work until I know my tools are solid.
I guess it comes down to programmer preference. Sure, there are different ways of doing things, for sure, but for developing/testing/researching/debugging, this would be fine. However I would also clean up my code and comment it before handing it off to another programmer.
Because regardless of the fact you're using StringBuilder or XMLNodes to save/read your file, if it is all gibberish mess, nobody is going to understand how it works.
Maybe it won't ever happen, but what if your environment switches to XML 2.0 someday? Your string-concatenated XML may or may not be valid in the new environment, but XDocument will almost certainly do the right thing.
Okay, that's a reach, but especially if your not-quite-standards-compliant XML doesn't specify an XML version declaration... just saying.
I am currently trying to use the XMLUnit library to compare two XML files.
One of them, the candidate, is generated by my code from Java Objects (using JAXB) and the other one is the reference (I cannot modify it).
Basically I am trying to prove that given a reference XML file I can unserialize it (using Jaxb and some classes of my own) then serialize it back to another file and still have the same content.
The library seems to furnish the services I need but when the generated file is not properly indented (in kind of a "pretty-print" version) the comparison fails and it doesn't when indentation is OK.
For example when the candidate is generated there is no indentation, the content is a one-liner, if a indent it properly (manually) the comparison is OK.
Here is the error message generated by XMLUnit:
[different] Expected number of child
nodes '3' but was '1'
Do you guys have any idea to solve this?
Maybe the solution is to generate a pretty-print version of the candidate, in this case do you have an idea to combine it with the JAXB serialiser?
By the way if you now a better solution in Java to compare XML files I'll be glad to know it ;)
Thanks in advance for your help.
You can relax some of the constraints used by XMLUnit when comparing to trees by setting properties on the org.custommonkey.xmlunit.XMLUnit class.
In your case, you probably want:
XMLUnit.setIgnoreComments(true);
XMLUnit.setIgnoreWhitespace(true);
XMLUnit.setIgnoreDiffBetweenTextAndCDATA(true);
You may also find the setIgnoredAttributeOrder property helpful as well.
We work with messages that are text-based (no XML). Our goal is to validate the messages, a message is valid if the content is correct. We developed our own language defined in XML to express rules on the message. We need to add more complex rules and we think that it’s now time to look at other alternative and use real rules engine. We support these types of rules:
name in a list of values or a regular
expression ex {SMITH, MOORE, A*}
name is present in the message-
name is not present in the message
if condition then name = John else name = Jane
Note that the condition is simple and does not contain any logical operators.
We need to support these types of rules:
if then else but the condition contains logical operators
for ... loop :
For all the customers in the message we want at least one from the USA and at least one from France
For all the customers in the message we want at least five that are from the USA and are buying more than $1000 a year
For any customer with name John, the last name must be Doe
Total customers with name John < 15
The name of the company is equal to the name of the company in another location in the message
The rules will depend on the type of messages we process. So we were investigating several existing solutions like:
JESS
OWL (Consistency checking)
Schematron (by transforming the message in XML)
What would be the best alternatives considering that we develop in Java? Another thing to consider is that we should be able to do error reporting like error description, error location (line and column number).
It sounds to me like you're on the right track already; my suggestions are:
Inspect your text-based messages directly with a parser/interpreter and apply rules over the generated objects. #Kdeveloper has suggested JavaCC for generating parser/interpreters, and I can add to this by personally vouching for ANTLRv3 which is an excellent environment for generating parser/interpreter/transformers in Java (amongst other languages). From there, you could use Jess or some other Java rules engine to validate the objects you generate. You could possibly also try encoding your rules into a parser/interpreter directly, but I'd advise against this and instead opt for separating the rules out to keep the parsing and semantic validation steps separate.
Transforming your text-based messages to XML to apply Schematron is also another viable option, but you'll obviously need to parse your text messages to get them into XML anyway. For this, I'd still suggest looking at JavaCC or ANTLRv3, and perhaps populating a pre-determined object model which can be marshaled to XML (such as that which can be generated by Castor or JAXB from a W3C XML Schema). From there, you can apply Schematron over the resulting XML.
I'd argue that transforming to OWL the trickiest option of your suggestions, but could be the most powerful. To start with, you'll probably want an ontology terminology (TBox) (the classes, properties, etc). to map your instance data (ABox) into. From there, consistency checking will only get you so far; many of the kinds of constraints you've outlined as wanting to capture simply can't be represented in OWL and validated using a DL-reasoner alone. However, if you couple your OWL ontology with SWRL rules (for example), you have a chance of capturing much of the types of rules you've outlined. Look at the types of rules and built-ins available in SWRL to see if this is expressive enough for you. If it is, you can employ the use of DL-Reasoners with SWRL support such as Pellet or HermiT. Note that individual implementations of OWL/SWRL reasoners such as these may implement more or less of the W3C specification, so you'll need to inspect each to determine their applicability.
If you're rules are static (i.e. known at compile time) you could make this with well known Java parser generator: JavaCC.
I'm a newbie when it comes to properties, and I read that XML is the preferred way to store these. I noticed however, that writing a regular .properties file in the style of
foo=bar
fu=baz
also works. This would mean a lot less typing (and maybe easier to read and more efficient as well). So what are the benefits of using an XML file?
In XML you can store more complex (e.g. hierarchical) data than in a properties file. So it depends on your usecase. If you just want to store a small number of direct properties a properties file is easier to handle (though the Java properties class can read XML based properties, too).
It would make sense to keep your configuration interface as generic as possible anyway, so you have no problem to switch to another representation ( e.g. by using Apache Commons Configuration ) if you need to.
The biggest benefit to using an XML file is that XML declares its encoding, while .properties does not.
If you are translating these properties files to N languages, it is possible that these files could come back in N different encodings. And if you're not careful, you or someone else could irreversibly corrupt the character encodings.
If you have a lot of repeating data, it can be simpler to process
<connections>
<connection>this</connection>
<connection>that</connection>
<connection>the other</connection>
</connections>
than it is to process
connection1=this
connection2=that
connection3=the other
especially if you are expecting to have to store a lot of data, or it must be stored in a definite hierarchy
If you are just storing a few scalar values though, I'd go for the simple Properties approach every time
If you have both hierarchical data & duplicate namespaces, then use XML.
1) To emulate just a hierarchical structure in a properties file, simply use dot notation:
a.b=The Joker
a.b.c=Batgirl
a.b=Batman
a.b=Superman
a.b.c=Supergirl
So, complex (hierarchical) data representation is *not a reason to use xml.
2) For just repeating data, we can use a 3rd party library like ini4j to peg explicitly in java a count identifier on an implicit quantifier in the properties file itself.
a.b=The Joker
a.b=Batgirl
a.b=Batman
is translated to (in the background)
a.b1=The Joker
a.b2=Batgirl
a.b3=Batman
However, numerating same name properties still doesn't maintain the specific parent-child relationships. ie. how do we represent whether Batgirl is with The Joker or Batman?
So, xml is required when both features are needed. We can now decide if the 1st xml entry is what we want or the 2nd.
[a]
[b]Joker[/b]
[b]
[c]Batgirl[/c]
[/b]
[a]
--or--
[a]
[b]Batman[/b]
[b]
[c]Batgirl[/c]
[/b]
[/a]
Further detail in ....
http://ilupper.blogspot.com/2010/05/xml-vs-properties.html
XML is handy for complex data structures and or relationships. It does a decent job for having a "common language" between systems.
However, xml comes at a cost. Its is heavy to consume. You've got to load a parser, ensure the file is in the correct format, find the information etc...
Whereas properties files is pretty light weight and easy to read. Works for simple key/value pairs.
It depends on the data you're encoding. With XML, you can define a more complex representation of the configuration data in your application. Take something like the struts framework as an example. Within the framework you have a number of Action classes that can contain 1...n number of forward branches. With an XML configuration file, you can define it like:
<action class="MyActionClass">
<forward name="prev" targetAction="..."/>
<forward name="next" targetAction="..."/>
<forward name="help" targetAction="..."/>
</action>
This kind of association is difficult to accomplish using just the key-value pair representation of the properties file. Most likely, you would need to come up with a delimiting character and then include all of the forward actions on a single property separated by this delimiting character. It's quite a bit of work for a hackish solution.
Yet, as you pointed out, the XML syntax can become a burden if you just want to state something very simple, like set feature blah to true.
The disadvantages of XML:
It is hard to read - the tags make it look busier than it really is
The hierarchies and tags make it hard to edit and more prone to human errors
It is not possible to "append" to an XML property file to introduce a new property or provide an overriding value for an existing property so that the last one wins. The ability to append a property can be very powerful - we can implement a property management logic around this so that certain properties are "hot" and we don't need to restart the instance when these change
The Java property file solves the above problems. Consistent naming conventions and dot notation can help in solving the issue of hierarchy.
I have a requirement where i need to generate html forms on the fly based on many different xml schema's (as of now i have 20 of them and the count keeps increasing). I need to collect data from the user to create instance docs corresponding to each of them and then store the instance docs in db....
challenges
1) schema has lot of unbounded complex types. so we doesnt know in advance the number and type of input types to be created. so pre-creating html etc is not an option
2) even if i can handle generation of the form on the fly, the problem is collecting the data entered..as forms generated dynamically should/will have dynamic id/names for input types
Can anyone suggest the best way to implement this?
thank you in advance
It seems to me like a clear case for XSLT.
Generating HTML from XML through XSLT is the primary goal of XSLT.
As for the id/names, you can create an XSLT which will also generate a set of id/names in a way that you can use.
Use WSDL2XForms to create XForms from XML Schemas (XSD). Then publish them with Chiba (chiba.sourceforge.net) - it converts these XForms to standard HTML forms on the server side.
The Google Code project xsd-forms seems to be a promising approach.
A XQuery-based translator from XSD to XForms is available at http://en.wikibooks.org/wiki/XRX/XForms_Generator.
I don't know much about that one: http://nunojob.wordpress.com/2008/01/05/creating-a-user-interface-for-xml-schema-using-xforms/. Seems to be a presentation only.
We had a problem somewhat like this. One of our team thought that we ought to be able to create a web form UI on the fly to accept data conforming to an XSD. It turned out that this is very difficult ... given all the complexity of full XSD. So we ended up inventing our own schema language (which was both simpler and richer than XSD) and using this as the basis for generating our UI layouts. We also implemented a tool-chain for creating and validating the schemas and for generating equivalent XSDs and OWL schemas.