We work with messages that are text-based (no XML). Our goal is to validate the messages, a message is valid if the content is correct. We developed our own language defined in XML to express rules on the message. We need to add more complex rules and we think that it’s now time to look at other alternative and use real rules engine. We support these types of rules:
name in a list of values or a regular
expression ex {SMITH, MOORE, A*}
name is present in the message-
name is not present in the message
if condition then name = John else name = Jane
Note that the condition is simple and does not contain any logical operators.
We need to support these types of rules:
if then else but the condition contains logical operators
for ... loop :
For all the customers in the message we want at least one from the USA and at least one from France
For all the customers in the message we want at least five that are from the USA and are buying more than $1000 a year
For any customer with name John, the last name must be Doe
Total customers with name John < 15
The name of the company is equal to the name of the company in another location in the message
The rules will depend on the type of messages we process. So we were investigating several existing solutions like:
JESS
OWL (Consistency checking)
Schematron (by transforming the message in XML)
What would be the best alternatives considering that we develop in Java? Another thing to consider is that we should be able to do error reporting like error description, error location (line and column number).
It sounds to me like you're on the right track already; my suggestions are:
Inspect your text-based messages directly with a parser/interpreter and apply rules over the generated objects. #Kdeveloper has suggested JavaCC for generating parser/interpreters, and I can add to this by personally vouching for ANTLRv3 which is an excellent environment for generating parser/interpreter/transformers in Java (amongst other languages). From there, you could use Jess or some other Java rules engine to validate the objects you generate. You could possibly also try encoding your rules into a parser/interpreter directly, but I'd advise against this and instead opt for separating the rules out to keep the parsing and semantic validation steps separate.
Transforming your text-based messages to XML to apply Schematron is also another viable option, but you'll obviously need to parse your text messages to get them into XML anyway. For this, I'd still suggest looking at JavaCC or ANTLRv3, and perhaps populating a pre-determined object model which can be marshaled to XML (such as that which can be generated by Castor or JAXB from a W3C XML Schema). From there, you can apply Schematron over the resulting XML.
I'd argue that transforming to OWL the trickiest option of your suggestions, but could be the most powerful. To start with, you'll probably want an ontology terminology (TBox) (the classes, properties, etc). to map your instance data (ABox) into. From there, consistency checking will only get you so far; many of the kinds of constraints you've outlined as wanting to capture simply can't be represented in OWL and validated using a DL-reasoner alone. However, if you couple your OWL ontology with SWRL rules (for example), you have a chance of capturing much of the types of rules you've outlined. Look at the types of rules and built-ins available in SWRL to see if this is expressive enough for you. If it is, you can employ the use of DL-Reasoners with SWRL support such as Pellet or HermiT. Note that individual implementations of OWL/SWRL reasoners such as these may implement more or less of the W3C specification, so you'll need to inspect each to determine their applicability.
If you're rules are static (i.e. known at compile time) you could make this with well known Java parser generator: JavaCC.
Related
Trying to build hashmaps using values from XML for the purpose of sending the list of maps off through an API call to build a table in another app. I need to find the most efficient way, whether library, pattern, or even another language, to find the values in the XML and build this structure.
As of now, I am building a class for each table type. I am doing this because there are dozens of types of tables and they are all looking for different values. For example, imagine I am working for a supermarket and there is an XML document holding all the items in the store along with various details about each item. I need to build a table of items from the XML for each section in the supermarket. I would have a GroceryBuilder class, a ClothingBuilder class, and so on. So in the GrocerBuilder class, I would traverse the XML, find all the grocery items, and add the items, along with other data related to those items, to a hashmap, looking like this, where [n] equals a row:
("Grocery[1].Item", "Apple"),
("Grocery[1].Description", "Granny Smith"),
("Grocery[1].Color", "Green"),
("Grocery[2].Item", "Paper Plates"),
("Grocery[2].PricePerEach", ".03"),
("Grocery[2].Purpose", "Eating"),
("Grocery[3].Item", "Bologna"),
("Grocery[3].Description", "Meat-like"),
("Grocery[3].Purpose", "Sustainence"),
As you can see above, each row can have different column values because not every cell in the table is populated.
Here is an example of what the XML could look like:
<grocery>
<food>
<produce>
<apple>
<description>Granny Smith</description>
<itemCd>93jfu4n</itemCd>
<color>Green</color>
</apple>
<pear>
<description>Concorde</description>
<itemCd>0272ve6dg3</itemCd>
<color>Yellow</color>
</pear>
<banana>
<description>Regular</description>
<itemCd>2je7c3</itemCd>
<color>Yellow</color>
</pear>
...
<insert 50 types of produce here/>
...
</produce>
<meat>
<bologna>
<description>Meat-like</description>
<itemCd>9dmd623</itemCd>
<purpose>Sustainence</purpose>
</bologna>
...
<insert 50 types of meat here/>
...
</meat>
</food>
<sporting goods>
...
<insert 50 types of sporting goods here/>
...
</sporting goods>
<clothing>
...
<insert 50 types of clothing here/>
...
</clothing>
</grocery>
The problem I am facing is that there are potentially 100+ table types (using the example above, imagine a table for every section in the store), each looking for specific values, so I would potentially have to build 100+ different classes. I am looking for a more generic way to build these structures.
The challenge is that there are many conditions on the values I am getting from the XML. For instance, insert the value into the XML only if the ItemCd is a certain value. Or if the Apple Description equals whatever value, insert this value instead.
So far, I've been building these maps manually, looping over each item in a section (i.e. "produce"), checking conditions, and inserting the values based on those conditions. But this is going to be a ton of effort if I must do this for 100+ tables. Is there an established pattern or library that could handle this better? Or even a language other than Java?
XPath approach
To select parts of an XML document use XPath; to transform from one XML document to another, or even construct an XML document from text or JSON, use XSLT (which embeds XPath). XPath supports the powerful conditionality that you describe and more.
See also
How to read XML using XPath in Java
XSLT processing with Java?
Data binding approach
There are data binding tools such as Jakarta XML Binding that can help automate the mapping between XML and Java objects.
See also
XML data binding
Java XML Binding
Simple, structurally typed XML data binding (without code generation or reflection)
so I would potentially have to build 100+ different classes
I am looking for a more generic way to build the
and maybe the grocery may introduce a new product type anytime (assumption)
In this case you right and defining a class for each type is not effective. You had good intuition to ask for a more generic approach.
I don't see other requirements or constraints so it is hard to tell what is exactly a good solution.
For start I'd propose to have a common parametrized TypeBuilder - a parser or filter (e. g. pass the required type name as a parameter returning a set of any properties as a map from the parsed input? A parser returning a list of tbe TypeBuilder instances? )
The challenge is that there are many conditions on the values I am getting from the XML.
Your intution is right again, it is considered as bad practice to put (business) rules into the code.
If you cannot find some generic set of rules, then you may need to put the rules somewhere. Maybe a rule engine is an overkill (feasible for enterprises, such as supermarkets), maybe a list of regular expressions could be good enough. Without more requirements or constraints it is hard to propose better answer
For my project, I need to store info about protocols (the data sent (most likely integers) and in the order it's sent) and info that might be formatted something like this:
'ID' 'STRING' 'ADDITIONAL INTEGER DATA'
This info will be read by a Java program and stored in memory for processing, but I don't know what would be the most sensible format to store this data in?
EDIT: Here's some extra information:
1)I will be using this data in a game server.
2)Since it is a game server, speed is not the primary concern, since this data will primary be read and utilized during startup, which shouldn't occur very often.
3)Memory consumption I would like to keep at a minimum, however.
4)The second data "example" will be used as a "dictionary" to look up names of specific in-game items, their stats and other integer data (and therefore might become very large, unlike the first data containing the protocol information, where each file will only note small protocol bites, like a login protocol for instance).
5)And yes, I would like the data to be "human-editable".
EDIT 2: Here's the choices that I've made:
JSON - For the protocol descriptions
CSV - For the dictionaries
There are many factors that could come to weigh--here are things that might help you figure this out:
1) Speed/memory usage: If the data needs to load very quickly or is very large, you'll probably want to consider rolling your own binary format.
2) Portability/compatibility: Balanced against #1 is the consideration that you might want to use the data elsewhere, with programs that won't read a custom binary format. In this case, your heavy hitters are probably going to be CSV, dBase, XML, and my personal favorite, JSON.
3) Simplicity: Delimited formats like CSV are easy to read, write, and edit by hand. Either use double-quoting with proper escaping or choose a delimiter that will not appear in the data.
If you could post more info about your situation and how important these factors are, we might be able to guide you further.
How about XML, JSON or CSV ?
I've written a similar protocol-specification using XML. (Available here.)
I think it is a good match, since it captures the hierarchal nature of specifying messages / network packages / fields etc. Order of fields are well defined and so on.
I even wrote a code-generator that generated the message sending / receiving classes with methods for each message type in XSLT.
The only drawback as I see it is the verbosity. If you have a really simple structure of the specification, I would suggest you use some simple home-brewed format and write a parser for it using a parser-generator of your choice.
In addition to the formats suggested by others here (CSV, XML, JSON, etc.) you might consider storing the info in a Java properties file. (See the java.util.Properties class.) The code is already there for you, so all you have to figure out is the properties names (or name prefixes) you want to use.
The Properties class also provides for storing/loading properties in a simple XML format.
I'm a newbie when it comes to properties, and I read that XML is the preferred way to store these. I noticed however, that writing a regular .properties file in the style of
foo=bar
fu=baz
also works. This would mean a lot less typing (and maybe easier to read and more efficient as well). So what are the benefits of using an XML file?
In XML you can store more complex (e.g. hierarchical) data than in a properties file. So it depends on your usecase. If you just want to store a small number of direct properties a properties file is easier to handle (though the Java properties class can read XML based properties, too).
It would make sense to keep your configuration interface as generic as possible anyway, so you have no problem to switch to another representation ( e.g. by using Apache Commons Configuration ) if you need to.
The biggest benefit to using an XML file is that XML declares its encoding, while .properties does not.
If you are translating these properties files to N languages, it is possible that these files could come back in N different encodings. And if you're not careful, you or someone else could irreversibly corrupt the character encodings.
If you have a lot of repeating data, it can be simpler to process
<connections>
<connection>this</connection>
<connection>that</connection>
<connection>the other</connection>
</connections>
than it is to process
connection1=this
connection2=that
connection3=the other
especially if you are expecting to have to store a lot of data, or it must be stored in a definite hierarchy
If you are just storing a few scalar values though, I'd go for the simple Properties approach every time
If you have both hierarchical data & duplicate namespaces, then use XML.
1) To emulate just a hierarchical structure in a properties file, simply use dot notation:
a.b=The Joker
a.b.c=Batgirl
a.b=Batman
a.b=Superman
a.b.c=Supergirl
So, complex (hierarchical) data representation is *not a reason to use xml.
2) For just repeating data, we can use a 3rd party library like ini4j to peg explicitly in java a count identifier on an implicit quantifier in the properties file itself.
a.b=The Joker
a.b=Batgirl
a.b=Batman
is translated to (in the background)
a.b1=The Joker
a.b2=Batgirl
a.b3=Batman
However, numerating same name properties still doesn't maintain the specific parent-child relationships. ie. how do we represent whether Batgirl is with The Joker or Batman?
So, xml is required when both features are needed. We can now decide if the 1st xml entry is what we want or the 2nd.
[a]
[b]Joker[/b]
[b]
[c]Batgirl[/c]
[/b]
[a]
--or--
[a]
[b]Batman[/b]
[b]
[c]Batgirl[/c]
[/b]
[/a]
Further detail in ....
http://ilupper.blogspot.com/2010/05/xml-vs-properties.html
XML is handy for complex data structures and or relationships. It does a decent job for having a "common language" between systems.
However, xml comes at a cost. Its is heavy to consume. You've got to load a parser, ensure the file is in the correct format, find the information etc...
Whereas properties files is pretty light weight and easy to read. Works for simple key/value pairs.
It depends on the data you're encoding. With XML, you can define a more complex representation of the configuration data in your application. Take something like the struts framework as an example. Within the framework you have a number of Action classes that can contain 1...n number of forward branches. With an XML configuration file, you can define it like:
<action class="MyActionClass">
<forward name="prev" targetAction="..."/>
<forward name="next" targetAction="..."/>
<forward name="help" targetAction="..."/>
</action>
This kind of association is difficult to accomplish using just the key-value pair representation of the properties file. Most likely, you would need to come up with a delimiting character and then include all of the forward actions on a single property separated by this delimiting character. It's quite a bit of work for a hackish solution.
Yet, as you pointed out, the XML syntax can become a burden if you just want to state something very simple, like set feature blah to true.
The disadvantages of XML:
It is hard to read - the tags make it look busier than it really is
The hierarchies and tags make it hard to edit and more prone to human errors
It is not possible to "append" to an XML property file to introduce a new property or provide an overriding value for an existing property so that the last one wins. The ability to append a property can be very powerful - we can implement a property management logic around this so that certain properties are "hot" and we don't need to restart the instance when these change
The Java property file solves the above problems. Consistent naming conventions and dot notation can help in solving the issue of hierarchy.
I am looking for a tool or java code or class library/API that can generate XSD from XML files. (Something like the xsd.exe utility in the .NET Framework sdk)
These tools can provide a good starting point, but they aren't a substitute for thinking through what the actual schema constraints ought to be. You get the opportunity for two kinds of errors: (1) allowing XML that shouldn't be allowed and (2) disallowing XML that should be ok.
As an example, pretend that you want to infer an XSD from a few thousand patient records that include a 'gender' tag (I used to work on medical records software). The tool would likely encounter 'M' and 'F' as values and might deduce that the element is an enumeration. However, other valid (although rarely used) values are B (both), U (unknown), or N (none). These are rare, of course. So, if you used your derived schema as an input validator, it would perform well until a patient with multiple sex organs was admitted to the hospital.
Conversely, to avoid this error, an XSD generator might not add enumerated type restrictions (I can't remember what these are called in schemas), and your application would work well until it encountered an errant record with gender=X.
So, beware. It's best to use these tools only as a starting point. Also, they tend to produce verbose and redundant schemas because they can't figure out patterns as well as humans.
Check Castor, I think it has the functionality you are looking for. They also provide you with an ant task that creates XSD schemas from XML files.
PS I suggest you to add more specific tags in the future: For instance, using xml, xsd and java will increment the possibility of getting answers.
You can use xsd-gen-0.2.0-jar-with-dependencies.jar file to convert xml to xsd.
And Command for it is "java -jar xsd-gen-VERSION-jar-with-dependencies.jar /path/to/xml.xml > /path/to/my.xsd"
Try the xsd-gen project from Google.
https://code.google.com/p/xsd-gen/
Using Java, I need to encode a Map<String, String> of name value pairs to store into a String, and be able to decode it again. These will be stored in a database column, and will probably usually be short and simple, so the common case should produce a simple nice looking line, but shouldn't corrupt the data, even if it contains unexpected characters, etc.
How would you choose to do it such that:
The encoded form is a single, human readable line
It doesn't require a big library or much context to encode / decode
Any delimeters are properly escaped
Url encoding? JSON? Do it yourself? Please specify any helper libraries or methods you'd use.
(Edited to specify more context and requirements as requested.)
As #Uri says, additional context would be good. I think your primary concerns are less about the particular encoding scheme, as rolling your own for most encodings is pretty easy for a simple Map<String, String>.
An interesting question is: what will this intermediate string encoding be used for?
if it's purely internal, an ad-hoc format is fine eg simple concatenation:
key1|value1|key2|value2
if humans night read it, a format like Ruby's map declaration is nice:
{ first_key => first_value,
second_key => second_value }
if the encoding is to send a serialised map over the wire to another application, the XML suggestion makes a lot of sense as it's standard-ish and reasonably self-documenting, at the cost of XML's verbosity.
<map>
<entry key='foo' value='bar'/>
<entry key='this' value='that'/>
</map>
if the map is going to be flushed to file and read back later by another Java application, #Cletus' suggestion of the Properties class is a good one, and has the additional benefit of being easy to open and inspect by human beings.
Edit: you've added the information that this is to store in a database column - is there a reason to use a single column, rather than three columns like so:
CREATE TABLE StringMaps
(
map_id NUMBER NOT NULL, -- ditch this if you only store one map...
key VARCHAR2 NOT NULL,
value VARCHAR2
);
As well as letting you store more semantically meaningful data, this moves the encoding/decoding into your data access layer more formally, and allows other database readers to easily see the data without having to understand any custom encoding scheme you might use. You can also easily query by key or value if you want to.
Edit again: you've said that it really does need to fit into a single column, in which case I'd either:
use the first pipe-separated encoding (or whatever exotic character you like, maybe some unprintable-in-English unicode character). Simplest thing that works. Or...
if you're using a database like Oracle that recognises XML as a real type (and so can give you XPath evaluations against it and so on) and need to be able to read the data well from the database layer, go with XML. Writing XML parsers for decoding is never fun, but shouldn't be too painful with such a simple schema.
Even if your database doesn't support XML natively, you can just throw it into any old character-like column-type...
Why not just use the Properties class? That does exactly what you want.
I have been contemplating a similar need of choosing a common representation for the conversations (transport content) between my clients and servers via a facade pattern. I want a representation that is standardized, human-readable (brief), robust, fast. I want it to be lightweight to implement and run, easy to test, and easy to "wrap". Note that I have already eliminated XML by my definition, and by explicit intent.
By "wrap", I mean that I want to support other transport content representations such as XML, SOAP, possibly Java properties or Windows INI formats, comma-separated values (CSV) and that ilk, Google protocol buffers, custom binary formats, proprietary binary formats like Microsoft Excel workbooks, and whatever else may come along. I would implement these secondary representations using wrappers/decorators around the primary facade. Each of these secondary representations is desirable, especially to integrate with other systems in certain circumstances, but none of them is desirable as a primary representation due to various shortcomings (failure to meet one or more of my criteria listed above).
Therefore, so far, I am opting for the JSON format as my primary transport content representation. I intend to explore that option in detail in the near future.
Only in cases of extreme performance considerations would I skip translating the underlying conventional format. The advantages of a clean design include good performance (no wasted effort, ease of maintainability) for which a decent hardware selection should be the only necessary complement. When performance needs become extreme (e.g., processing forty thousand incoming data files totaling forty million transactions per day), then EVERYTHING has to be revisited anyway.
As a developer, DBA, architect, and more, I have built systems of practically every size and description. I am confident in my selection of criteria, and eagerly await confirmation of its suitability. Indeed, I hope to publish an implementation as open-source (but don't hold your breath quite yet).
Note that this design discussion ignores the transport medium (HTTP, SMTP, RMI, .Net Remoting, etc.), which is intentional. I find that it is much more effective to treat the transport medium and the transport content as completely separate design considerations, from each other and from the system in question. Indeed, my intent is to make these practically "pluggable".
Therefore, I encourage you to strongly consider JSON. Best wishes.
Some additional context for the question would help.
If you're going to be encoding and decoding at the entire-map granularity, why not just use XML?
As #DanVinton says, if you need this in internal use (I mean "
internal use
as
it's used only by my components, not components written by others
you can concate key and value.
I prefer use different separator between key and key and key and value:
Instead of
key1+SEPARATOR+value1+SEPARATOR+key2 etc
I code
key1+SEPARATOR_KEY_AND_VALUE+value1+SEPARATOR_KEY(n)_AND_KEY(N+1)+key2 etc
if you must debug, this way is clearer (by design too)
Check out the apache commons configuration package. This will allow you to read/save a file as XML or properties format. It also gives you an option of automatically saving the property changes to a file.
Apache Configuration
A realise this is an old "deadish" thread, but I've got a solution not posited previously which I think is worth throwing in the ring.
We store "arbitrary" attributes (i.e. created by the user at runtime) of geographic features in a single CLOB column in the DB in the standard XML attributes format. That is:
name="value" name="value" name="value"
To create an XML element you just "wrap up" the attributes in an xml element. That is:
String xmlString += "<arbitraryAttributes" + arbitraryAttributesString + " />"
"Serialising" a Properties instance to an xml-attributes-string is a no-brainer... it's like ten lines of code. We're lucky in that we can impose on the users the rule that all attribute names must be valid xml-element-names; and we xml-escape (i.e. "e; etc) each "value" to avoid problems from double-quotes and whatever in the value strings.
It's effective, flexible, fast (enough) and simple.
Now, having said all that... if we had the time again, we'd just totally divorce ourselves from the whole "metadata problem" by storing the complete unadulterated uninterpreted metadata xml-document in a CLOB and use one of the open-source metadata editors to handle the whole mess.
Cheers. Keith.