Mapping large XML documents using JAXB or Alternative

Mapping large XML documents using JAXB or Alternative - java

I am trying to map large XML documents specified by a large set of XSDs to another large XML document (About 2500 lines). The mapping is not exactly one to one, but it's relatively close, with maybe 30-40 elements changing, some needing to be concatenated, or basic filtering logic performed on them. I've found Altova MapForce to be a good solution, however it seems to be overkill as far as the features it provides. Another option I've explored is building a custom mapping framework using JAXB, but I fear I would be building a product like MapForce, and I estimate it would take a few hundred man hours.
I have found very little online about XML Mapping, with the biggest finds being a handful of commercial product solutions, all of which seem a bit overkill.
Any ideas?

I you are using Eclipse IDE, you have an option similar to File - New - "Create bean from XSD schema". It's very useful!

Related

Apache Solr, SolrJ vs Data Import Handler for parsing XML

I'm hoping to use Solr to run searches from info parsed from XML files.
These XML files are not in Solr's document format, as such I have to parse them and get the fields I need that way.
I am familiar with Java programming and was wondering if SolrJ would be an easier method than using the Data Import Handler. I'm considering running through each XML file I have and parsing the fields that I need from each. Is there any downside to one method over the other? I imagine since I have familiarity with Java it may be easier to parse the XML that way?
I will probably need multiple conditions and regular expressions. If anything, a reliable way to get my fields from relatively unstructured XML.
How would SolrJ work with the interface? That is, if I index using SolrJ, can I do my queries through the interface still?

DIH was designed for prototyping, though some people do use it for production. You can start from it, but be ready to jump to SolrJ or other methods if you hit its limitations. And if you have very complex mappings, you may be better off with SolrJ to start from.
You can also apply XSLT transform on an incoming XML document to map it to the Solr format.
And as said elsewhere, search is a separate issue from indexing.

How you index your content into Solr is orthogonal to how you query it. You can index any way you want, as long as it produces the right docs in the index.
Now, regarding indexing, if DIH will get what you need without much tweaking go for it. But if you need to do a lot of tweaking of the data, in the end you might finish faster if you just write some java with Solr. With Solr you have all the flexibility, with DIH you are more constrained (think of the 80/20 rule).

Neo4j: Enforcing schema with XSD

I was wondering if there exists a tool for Neo4j that can read an XSD file and use it to enforce a schema on Neo4j.
I'm newbie on graph databases but I'm starting to appreciate the schema-less approach. There's a lot of projects out there that have been pumping in a lot of non-sequential data and making sense of it all which is really cool.
I've come across some requirements that call for control on what properties a node or edge can have given a certain label and what labels an edge can have given the labels of its source and destination nodes. The schema is also subject to change - although not frequent.
As I understand, the standard practice is to control the schema from the application itself which to me doesn't seem like it should be a BEST practice. For example, the picky developers from Oracle land create views for applications to interact with and then apply triggers onto the views that execute the appropriate transactions upon the application attempting to insert or update on the view.
I would be looking for a similar device in Neo4j and since I already have the XSD files, it would be a lot less work overall to simply dump them into a folder and have it use those for reference on what to enforce.
This is something I'm willing to write myself unless there's already a library out there for this. I have a day job after all. :)
Thanks!

Not only does this tool not exist, but it couldn't even exist without more work on standardizing how XML is stored in neo4j. There are key differences between the XML model and the neo4j model.
There's this python application here that can import XML into neo4j; documents, not schemas. But in the way that it does it, there are many things to keep in mind:
There's no obvious mapping from XML elements/attributes on to neo4j nodes/properties. You'd think that elements should be nodes, attributes properties; but a better graph model would usually be different than that. For example, XML namespaces would make great nodes because they connect to so many other things (e.g. all elements defined in a namespace) yet typically they're attributes. Maybe namespaces should be labels? Also maybe a reasonable choice, except there's no standard answer there.
XML trees have sequence, and sequence matters; graphs don't. Say you have an XML element with 2 children, A and B. In neo4j you might have a node connected to two other nodes, but you need a way of expressing (probably via a relationship property) that A comes before B. That's of course doable in neo4j, but there's no agreement as far as I know about how to do that. So maybe you pick a sequence attribute, and give it an integer value. Seems reasonable...but now your schema validation software has a dependency on that design choice. XML in neo4j stored any other way won't validate.
There's a host of XML processing options that matter in schema validation that wouldn't in a graph, for example whether or not you care about ignoring whitespace nodes, strict vs. lax schema validation, and so on.
Look, neo4j is great but if you really need to validate a pile of XML documents, it's probably not your best choice because of some mismatches between the graph model and XML's document model. Possible options might be to validate the documents before they go into neo4j, or just to come up with a way of synthesizing XML documents from what is in neo4j, and then validating that result once it's outside of the graph database, as an XML file.

Hibernate Search, Lucene or any other alternative?

I have a query which is doing ILIKE on some 11 string or text fields of table which is not big (500 000), but for ILIKE obviously too big, search query takes round 20 seconds. Database is postgres 8.4
I need to implement this search to be much faster.
What came to my mind:
I made additional TVECTOR column assembled from all columns that need to be searched, and created the full text index on it. The fulltext search was quite fast. But...I can not map this TVECTOR type in my .hbms. So this idea fell off (in any case i thaught it more as a temporary solution).
Hibernate search. (Heard about it first time today) It seems promissing, but I need experienced opinion on it, since I dont wanna get into the new API, possibly not the simplest one, for something which could be done simpler.
Lucene
In any case, this has happened now with this table, but i would like to solution to be more generic and applied for future cases related to full text searches.
All advices appreciated!
Thanx

I would strongly recommend Hibernate Search which provides a very easy to use bridge between Hibernate and Lucene. Rememeber you will be using both here. You simply annotate properties on your domain classes which you wish to be able to search over. Then when you update/insert/delete an entity which is enabled for searching Hibernate Search simply updates the relevant indexes. This will only happen if the transaction in which the database changes occurs was committed i.e. if it's rolled back the indexes will not be broken.
So to answer your questions:
Yes you can index specific columns on specific tables. You also have the ability to Tokenize the contents of the field so that you can match on parts of the field.
It's not hard to use at all, you simply work out which properties you wish to search on. Tell Hibernate where to keep its indexes. And then can use the EntityManager/Session interfaces to load the entities you have searched for.

Since you're already using Hibernate and Lucene, Hibernate Search is an excellent choice.
What Hibernate Search will primarily provide is a mechanism to have your Lucene indexes updated when data is changed, and the ability to maximize what you already know about Hibernate to simplify your searches against the Lucene indexes.
You'll be able to specify what specific fields in each entity you want to be indexed, as well as adding multiple types of indexes as needed (e.g., stemmed and full text). You'll also be able to manage to index graph for associations so you can make fairly complex queries through Search/Lucene.
I have found that it's best to rely on Hibernate Search for the text heavy searches, but revert to plain old Hibernate for more traditional searching and for hydrating complex object graphs for result display.

I recommend Compass. It's an open source project built on top of Lucene that provider a simpler API (than Lucene). It integrates nicely with many common Java libraries and frameworks such as Spring and Hibernate.

I have used Lucene in the past to index database tables. The solution works great, but remeber that you need to maintain the index. Either, you update the index every time your objects are persisted or you have a daemon indexer that dump the database tables in your Lucene index.
Have you considered Solr? It's built on top of Lucene and offers automatic indexing from a DB and a Rest API.

A year ago I would have recommended Compass. It was good at what it does, and technically still happily runs along in the application I developed and maintain.
However, there's no more development on Compass, with efforts having switched to ElasticSearch. From that project's website I cannot quite determine if it's ready for the Big Time yet or even actually alive.
So I'm switching to Hibernate Search which doesn't give me that good a feeling but that migration is still in its initial stages, so I'll reserve judgement for a while longer.

All the projects are based on Lucene. If you want to implement a very advanced features I advice you to use Lucene directly. If not, you may use Solr which is a powerful API on top of lucene that can help you index and search from DB.

Lossless schema mapping from XML records to relations

I have a problem of dealing with close to 100000 xml records. The problem is to construct a schema mapping from xml schema of these records to relations
Any ideas in this field are welcome. Please propose an algorithm / a mthodology that can be followed to achieve this schema mapping.
Or any related work would surely be helpful.
Thanks
Peter

We're think on details, but it might be that you don't need an algorithm or methodology as much as you need really good tools. Altova has a set of XML tools; some of them can help you map XML documents to a SQL database. (I'm not sure whether they will help you create tables based on XML document elements.) You can download Altova Missionkit here and use it free for 30 days.
I'm sure they're not the only player in this market.
Disclaimer: I have no relationship with Altova. I've used XMLSpy briefly during a contract job for a Fortune 500 a while back. It worked well, and without surprises.

How to update large XML file

Rather than rewriting the entire contents of an xml file when a single element is updated, is there a better alternative to updating the file?

I would recommend using VTD-XML http://vtd-xml.sourceforge.net/
From their FAQ ( http://vtd-xml.sourceforge.net/faq.html ):
Why should I use VTD-XML for large XML files?
For numerous reasons summarized below:
Performance: The performance of VTD-XML is far better than SAX
Ease to use: Random access combined with XPath makes application easy to write
Better maintainability: App code is shorter and simpler to understand.
Incremental update: Occasional, small changes become very efficient.
Indexing: Pre-parsed form of XML will further boost processing performance.
Other features: Cut, paste, split and assemble XML documents is only possible with VTD-XML.
In order to take advantage of VTD-XML, we recommended that developers split their ultra large XML documents into smaller, more manageable chucks (<2GB).

If your XML file is so large that updating it is a performance bottleneck, you should consider moving away from XML to a more efficient disk format (or a real database).
If, however, you just feel like it might be a problem, remember the rules of optimization:
Don't do it
(experts only) Don't do it, yet.

You have a few options here, but none of them are good.
Since XML Objects aren't broken into distinct parts, you'll either have to use some filesystem level modification with regex pattern matching (sed is a good start), OR you should break your xml into smaller parts for manageability.

If possible, serialize the XML and use diff/patch/apply Linux tools (or equivalent tools in your platform) . This way, you don't have to deal with parsing, writing.

Process Large XML Files with XQuery Works with Gigabyte Size XML Files
http://www.xquery.com
XQuery is a query language that was designed as a native XML query language. Because most types of data can be represented as XML, XQuery can also be used to query other types of data. For example, XQuery can be used to query relational data using an XML view of a relational database. This is important because many Internet applications need to integrate information from multiple sources, including data found in web messages, relational data, and various XML sources. XQuery was specifically designed for this kind of data integration.
For example, suppose your company is a financial institution that needs to produce reports of stock holdings for each client. A client requests a report with a Simple Object Access Protocol (SOAP) message, which is represented in XML. In most businesses, the stock holdings data is stored in multiple relational databases, such as Oracle, Microsoft SQL Server, or DB2. XQuery can query both the SOAP message and the relational databases, creating a report in XML.
XQuery is based on the structure of XML and leverages that structure to make it possible to perform queries on any type of data that can be represented as XML, including relational data. In addition, XQuery API for Java (XQJ) lets your queries run in any environment that supports the J2EE platform.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.