Apache Solr, SolrJ vs Data Import Handler for parsing XML

Apache Solr, SolrJ vs Data Import Handler for parsing XML - java

I'm hoping to use Solr to run searches from info parsed from XML files.
These XML files are not in Solr's document format, as such I have to parse them and get the fields I need that way.
I am familiar with Java programming and was wondering if SolrJ would be an easier method than using the Data Import Handler. I'm considering running through each XML file I have and parsing the fields that I need from each. Is there any downside to one method over the other? I imagine since I have familiarity with Java it may be easier to parse the XML that way?
I will probably need multiple conditions and regular expressions. If anything, a reliable way to get my fields from relatively unstructured XML.
How would SolrJ work with the interface? That is, if I index using SolrJ, can I do my queries through the interface still?

DIH was designed for prototyping, though some people do use it for production. You can start from it, but be ready to jump to SolrJ or other methods if you hit its limitations. And if you have very complex mappings, you may be better off with SolrJ to start from.
You can also apply XSLT transform on an incoming XML document to map it to the Solr format.
And as said elsewhere, search is a separate issue from indexing.

How you index your content into Solr is orthogonal to how you query it. You can index any way you want, as long as it produces the right docs in the index.
Now, regarding indexing, if DIH will get what you need without much tweaking go for it. But if you need to do a lot of tweaking of the data, in the end you might finish faster if you just write some java with Solr. With Solr you have all the flexibility, with DIH you are more constrained (think of the 80/20 rule).

Related

Batch indexing to solr

I have a java class that sends http post requests to a solr instance to index json files. it is implemented in a multithreaded manner. However, I have realized that sending so many http requests (close to 20,000) is causing the network to be a bottle neck. I read online that I can do batch indexing, but I can't find any clear examples. Is there any advice to batch index solr?
Thank you.

For generic JSON, you must have a configuration somewhere in solrconfig.xml that defines how it is treated.
One of the parameters is split. You might be able to use it to combine your JSON documents into a one bigger one that Solr would split and process separately. Notice that the specific format may be a little different for different Solr versions. Get the correct version of the downloadable reference guide PDF, if something is not working.
Or, if you can generate it, use JSON format Solr understands directly and which has full support for multiple documents.

Neo4j: Enforcing schema with XSD

I was wondering if there exists a tool for Neo4j that can read an XSD file and use it to enforce a schema on Neo4j.
I'm newbie on graph databases but I'm starting to appreciate the schema-less approach. There's a lot of projects out there that have been pumping in a lot of non-sequential data and making sense of it all which is really cool.
I've come across some requirements that call for control on what properties a node or edge can have given a certain label and what labels an edge can have given the labels of its source and destination nodes. The schema is also subject to change - although not frequent.
As I understand, the standard practice is to control the schema from the application itself which to me doesn't seem like it should be a BEST practice. For example, the picky developers from Oracle land create views for applications to interact with and then apply triggers onto the views that execute the appropriate transactions upon the application attempting to insert or update on the view.
I would be looking for a similar device in Neo4j and since I already have the XSD files, it would be a lot less work overall to simply dump them into a folder and have it use those for reference on what to enforce.
This is something I'm willing to write myself unless there's already a library out there for this. I have a day job after all. :)
Thanks!

Not only does this tool not exist, but it couldn't even exist without more work on standardizing how XML is stored in neo4j. There are key differences between the XML model and the neo4j model.
There's this python application here that can import XML into neo4j; documents, not schemas. But in the way that it does it, there are many things to keep in mind:
There's no obvious mapping from XML elements/attributes on to neo4j nodes/properties. You'd think that elements should be nodes, attributes properties; but a better graph model would usually be different than that. For example, XML namespaces would make great nodes because they connect to so many other things (e.g. all elements defined in a namespace) yet typically they're attributes. Maybe namespaces should be labels? Also maybe a reasonable choice, except there's no standard answer there.
XML trees have sequence, and sequence matters; graphs don't. Say you have an XML element with 2 children, A and B. In neo4j you might have a node connected to two other nodes, but you need a way of expressing (probably via a relationship property) that A comes before B. That's of course doable in neo4j, but there's no agreement as far as I know about how to do that. So maybe you pick a sequence attribute, and give it an integer value. Seems reasonable...but now your schema validation software has a dependency on that design choice. XML in neo4j stored any other way won't validate.
There's a host of XML processing options that matter in schema validation that wouldn't in a graph, for example whether or not you care about ignoring whitespace nodes, strict vs. lax schema validation, and so on.
Look, neo4j is great but if you really need to validate a pile of XML documents, it's probably not your best choice because of some mismatches between the graph model and XML's document model. Possible options might be to validate the documents before they go into neo4j, or just to come up with a way of synthesizing XML documents from what is in neo4j, and then validating that result once it's outside of the graph database, as an XML file.

XML as data store. Insert, remove, delete

I was planning to use XML to store the data for a Java DVD database application I'm writing. I know that the word "database" is right there in the title, but XML just seemed so much more portable, was human readable and (I assumed before looking into it) simpler to implement.
Parsing XML seems to be the easiest thing in the world... even creating a new XML file isn't much trouble, but changing records, inserting them or deleting them, I can only see to do by creating a fresh XML file.
Am I missing something? Or is the thing that I'm missing that I should switch over to a database format (but there's some wonderful database format I've not heard of, that's totally portable and users won't need to install something separate to use :) )

the most popular way to use a file as a database is probably with sqlite http://www.sqlite.org/ and that's what i would use if i were solving your problem (it's pretty much a standard SQL database, but uses just one file as storage). another, pure-java option is apache derby http://db.apache.org/derby/
however, pure xml databases do exist (and were quite fashionable about 10 years ago - the "nosql" of their time) - the associated standards are xpath http://en.wikipedia.org/wiki/XPath and xquery http://en.wikipedia.org/wiki/Xquery . i haven't used it, but it seems like basex http://basex.org/open-source/ is an open-source implementation that you could use (and it does claim to provide ACID guarantees - http://basex.org/products/ ).
if you're more familiar with xml than sql i don't see any great harm in using an xml database for a small project. just structure your code so that most of the program doesn't care what the storage is (ie by providing a neutral interface). then if xml doesn't work out you can switch to sql by re-implementing just that interface and leaving the rest of your program alone (and if it does work, post back here saying so - it would be interesting to know).

If you're going to have a web-based front end, it seems that a regular database is the way to go as the back end. I don't believe your users would have a need to download anything new, since that's all taken care of server-side. A real database also has the ACID advantage over a pseudobase; it should be atomic, consistent, isolated, and durable, and I can't imagine XML would be a good substitute in those respects.

Suggest a persistent strategy for a workflow system

I am in the process of creating a UI configuration tool for my pet project. One aspect of this tool lets the end user DEFINE his orchestration. I then need to save this orchestration definition into a database. There will be a executable version of this definition in a running system. The executable version is created dynamically on-demand.
Idea is to separate the DEFINITION from EXECUTABLE version so that I have the flexibility to choose the runtime version among BPMN or JPDL or a POJO based workflow solution (BeanFlow).
Limitation: I can't use the BPMN editors that come with frameworks like jBPM, Activiti etc as I wan't to use my own UI that is specific to my domain.
I need suggestions on HOW to PERSIST the definition.
Should I use rdbms tables? If so, is there a db schema I can borrow that is close to orchestration concepts?
Should I serialize my definition to BPMN/JPDL XML instance document?
Are there any other simple formats that I can use?

By "orchestration" I'm assuming you mean a finite state machine. Where the current state dictates what transitions can be followed to other states. The representation of states and transitions as edges and vertices often produces a directed acyclic graph, however there are times when the graph will cycle (e.g. draft -- submit for approval --> pending approval -- reject --> draft).
In practice, separating the definition from execution calls for a persistence format that can easily accommodate customization. As your system evolves you will find a number of unanticipated edge cases whose solution should not require altering a persistence schema, only code. This implies XML or a NoSQL solution - something whose schema is easily changed or non existent.
Now, having written my own XML definition for this purpose (for uninteresting reasons I'll exclude), my suggestion is using JPDL (or BPMN). Reason is their definitions likely incorporate whatever you're considering now, will in the future, and enable customization - such as hanging arbitrary data or behavior off them at a given point. You also get the advantage of tools already built - not just UI - for dealing with cycle detection and ensuring there is a path to completion for example.
Some of the interesting features I know JPDL possesses are an ability to help merge forked processes, timed tasks (including those that repeat periodically), and facilities for sending notification. This last item - notification - bears some further exposition. One of the things I've found with my own system is the need for sending out configurable email whose content is based on the data flowing through. These existing engines make that relatively easy by providing a way to plugin variables for instance into text that's then dynamically evaluated at run time before transmission. Also they provide bridges between the engine and whatever user store for the purpose of sending notifications to groups of people, tasking them and enforcing security policy.
Finally, depending on the scope of your system, you will probably still be using a database as well. What I suggest is storing off the XML and data being orchestrated into the database in a serialized format. Then, if the data is being altered as it travels through the execution, write out serializations of the data - and perhaps workflow if it is also changed - into a history/audit log table as well.

I would NOT use rdbms tables, or if you do, store the definitions as text blobs. Trying to make records for the definition is a bad idea because it's much more inflexible and difficult to change your definition over time. Many people would use different approaches, but I'd use JSON or YAML, and avoid XML. The motivation for that is to make it as simple as possible. Trying to use XML, especially a formalized specific format of XML is going to make you spend much more time meeting an exact specification that doesn't actually do anything to help what you're trying to accomplish. JSON and YAML are both very easy to work with from a code perspective. YAML is more easily readable by humans and easier to edit, and isn't as tricky for punctuation and escaping as JSON. JSON is more widely used, and is smaller than YAML. JSON also has a binary counterpart, BSON, if document size is a concern.
Once you have an importer/exporter that goes to/from your internal objects to your data format, then persisting using RDBMS, or other mechanisms, will be straightforward. You could even use CouchDB, which could offer other benefits to your application and may be a great fit.

Very good question! Here is my two cents:
RDBMS: if you do this you will be able to query the workflow instances, for example which tokens are at 'node X'?
Storing XML as clob: the simplicity is the truth of this solution, but you can't really query these just get them by id
NOSQL: there are a lot of different solutions for different problems. MongoDB is a popular solution, it provides document oriented persistence.

How about a simple serialisation of the composed UI using for example XStream and then store the serialised bits into the database as a binary column. Then when user logs in, get the associated data, deserialise, initialise if required and display.

How to update large XML file

Rather than rewriting the entire contents of an xml file when a single element is updated, is there a better alternative to updating the file?

I would recommend using VTD-XML http://vtd-xml.sourceforge.net/
From their FAQ ( http://vtd-xml.sourceforge.net/faq.html ):
Why should I use VTD-XML for large XML files?
For numerous reasons summarized below:
Performance: The performance of VTD-XML is far better than SAX
Ease to use: Random access combined with XPath makes application easy to write
Better maintainability: App code is shorter and simpler to understand.
Incremental update: Occasional, small changes become very efficient.
Indexing: Pre-parsed form of XML will further boost processing performance.
Other features: Cut, paste, split and assemble XML documents is only possible with VTD-XML.
In order to take advantage of VTD-XML, we recommended that developers split their ultra large XML documents into smaller, more manageable chucks (<2GB).

If your XML file is so large that updating it is a performance bottleneck, you should consider moving away from XML to a more efficient disk format (or a real database).
If, however, you just feel like it might be a problem, remember the rules of optimization:
Don't do it
(experts only) Don't do it, yet.

You have a few options here, but none of them are good.
Since XML Objects aren't broken into distinct parts, you'll either have to use some filesystem level modification with regex pattern matching (sed is a good start), OR you should break your xml into smaller parts for manageability.

If possible, serialize the XML and use diff/patch/apply Linux tools (or equivalent tools in your platform) . This way, you don't have to deal with parsing, writing.

Process Large XML Files with XQuery Works with Gigabyte Size XML Files
http://www.xquery.com
XQuery is a query language that was designed as a native XML query language. Because most types of data can be represented as XML, XQuery can also be used to query other types of data. For example, XQuery can be used to query relational data using an XML view of a relational database. This is important because many Internet applications need to integrate information from multiple sources, including data found in web messages, relational data, and various XML sources. XQuery was specifically designed for this kind of data integration.
For example, suppose your company is a financial institution that needs to produce reports of stock holdings for each client. A client requests a report with a Simple Object Access Protocol (SOAP) message, which is represented in XML. In most businesses, the stock holdings data is stored in multiple relational databases, such as Oracle, Microsoft SQL Server, or DB2. XQuery can query both the SOAP message and the relational databases, creating a report in XML.
XQuery is based on the structure of XML and leverages that structure to make it possible to perform queries on any type of data that can be represented as XML, including relational data. In addition, XQuery API for Java (XQJ) lets your queries run in any environment that supports the J2EE platform.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.