Multiple document / root entity definitions linked together in Solr data-config.xml

Multiple document / root entity definitions linked together in Solr data-config.xml - java

I'm trying to define Solr data-config.xml and schema.xml files so that I could have multiple independent document and/or root entity nodes which are then linked together. It seems Solr won't index anything but the first definition of root nodes in data-config.
What I'm trying to achieve is that I could have indexed documents which are imported from database. Each row will create one document. This is fine and already working.
Next I want to have kind of a context for the documents. The context can be updated and in that case I also need to update the Solr index. The issue is that if the context is indexed as a sub entity for the documents I would need to delete and re-add all the documents which are involved.
The goal is that I would like to have the context as a separate entity and the docs would have link to it. Then the previous case would change so that I could only delete and re-add the context while the link between it and related docs would remain unchanged and there would be no need to drop the documents during the udpate.
The amount of documents linked to a context can be anything from hundreds to tens of thousands so I really wouldn't like to recreate all of them in case the context updates.

Related

Spring Data Neo4j - Howto limit all queries to a subset of nodes (depending on the request header)

in our project we are using Spring Data Neo4j and have a lot of queries. When we want to update a node with new data, we do not actually update the node, instead we create a new node with a PREVIOUS relation to the current node. Furthermore we can make revisions. These are nodes with a CONTAINS relation to the nodes belonging to this revision.
Now the client passes in the request header an ID of a revision and the executed queries should then only contain the nodes which are either the newest (so no PREVIOUS relation exists there) or are part of the revision. We want to limit the queried nodes to a subset of nodes.
Now I don't want to change every query manually and I think that there is an easier way but unfortunately I haven't found it anywhere after a long search.

If your model looks like the one below and follows your description, this should do it.
Assuming a parameter ID
MATCH (r:Revision {ID: $id})
MATCH (n)
WHERE (n)<-[:CONTAINS]-(r)
OR NOT (n)-[:PREVIOUS]->()
RETURN n

Create doc with subcollection in one shot

Let's say I have collection called root
Can I create document with its subcollection in one call?
I mean if I do:
db.Collection("root").document("doc1").Collection("sub").document("doc11").Set(data)
Then will that create the structure in one shot?
To be honest I gave this a try and doc1 had an italics heading which I thought only for deleted docs

The code you shared doesn't create an actual document. It merely "reserves" the ID for a document in root and then creates a sub collection under it with an actual doc11 document.
Seeing a document name in italics in the Firestore console indicates that there is no physical document at a location, but that there is data under the location. This is most typical when you've deleted a document that previously existed, but your code is another way accomplishing the same.
There is no way to create two documents in one call, although you can create multiple documents in a single transaction or batch write if you want.

CouchDB/Couchbase/MongoDB transaction emulation?

I've never used CouchDB/MongoDB/Couchbase before and am evaluating them for my application. Generally speaking, they seem to be a very interesting technology that I would like to use. However, coming from an RDBMS background, I am hung up on the lack of transactions. But at the same time, I know that there is going to be much less a need for transactions as I would have in an RDBMS given the way data is organized.
That being said, I have the following requirement and not sure if/how I can use a NoSQL DB.
I have a list of clients
Each client can have multiple files
Each file must be sequentially number for that specific client
Given an RDBMS this would be fairly simple. One table for client, one (or more) for files. In the client table, keep a counter of last filenumber, and increment by one when inserting a new record into the file table. Wrap everything in a transaction and you are assured that there are inconsistencies. Heck, just to be safe, I could even put a unique constraint on a (clientId, filenumber) index to ensure that there is never the same filenumber used twice for a client.
How can I accomplish something similar in MongoDB or CouchDB/base? Is it even feasible? I keep reading about two-phase commits, but I can't seem to wrap my head around how that works in this kind of instance. Is there anything in Spring/Java that provides two-phase commit that would work with these DBs, or does it need to be custom code?

Couchdb is transactional by default. Every document in couchdb contains a _rev key. All updates to a document are performed against this _rev key:-
Get the document.
Send it for update using the _rev property.
If update succeeds then you have updated the latest _rev of the document
If the update fails the document was not recent. Repeat steps 1-3.
Check out this answer by MrKurt for a more detailed explanation.
The couchdb recipies has a banking example that show how transactions are done in couchdb.
And there is also this atomic bank transfers article that illustrate transactions in couchdb.
Anyway the common theme in all of these links is that if you follow the couchdb pattern of updating against a _rev you can't have an inconsistent state in your database.
Heck, just to be safe, I could even put a unique constraint on a (clientId, filenumber) index to ensure that there is never the same filenumber used twice for a client.
All couchdb documents are unique since the _id fields in two documents can't be the same. Check out the view cookbook
This is an easy one: within a CouchDB database, each document must have a unique _id field. If you require unique values in a database, just assign them to a document’s _id field and CouchDB will enforce uniqueness for you.
There’s one caveat, though: in the distributed case, when you are running more than one CouchDB node that accepts write requests, uniqueness can be guaranteed only per node or outside of CouchDB. CouchDB will allow two identical IDs to be written to two different nodes. On replication, CouchDB will detect a conflict and flag the document accordingly.
Edit based on comment
In a case where you want to increment a field in one document based on the successful insert of another document
You could use separate documents in this case. You insert a document, wait for the success response. Then add another document like
{_id:'some_id','count':1}
With this you can set up a map reduce view that simply counts the results of these documents and you have an update counter. All you are doing is instead of updating a single document for updates you are inserting a new document to reflect a successful insert.
I always end up with the case where a failed file insert would leave the DB in an inconsistent state especially with another client successfully inserting a file at the same time.
Okay so I already described how you can do updates over separate documents but even when updating a single document you can avoid inconsistency if you :
Insert a new file
When couchdb gives a success message -> attempt to update the counter.
Why this works?
This works because because when you try to update the update document you must supply a _rev string. You can think of _rev as a local state for your document. Consider this scenario:-
You read the document that is to be updated.
You change some fields.
Meanwhile another request has already changed the original document. This means the document now has a new _rev
But You request couchdb to update the document with a _rev that is stale that you read in step #1.
Couchdb will generate an exception.
You read the document again get the latest _rev and attempt to update it.
So if you do this you will always have to update against the latest revision of the document. I hope this makes things a bit clearer.
Note:
As pointed out by Daniel the _rev rules don't apply to bulk updates.

Yes you can do the same with MongoDB, and Couchbase/CouchDB using proper approach.
First of all in MongoDB you have unique index, this will help you to ensure a part of the problem:
- http://docs.mongodb.org/manual/tutorial/create-a-unique-index/
You also have some pattern to implement sequence properly:
- http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
You have many options to implement a cross document/collection transactions, you can find some good information about this on this blog post:
http://edgystuff.tumblr.com/post/93523827905/how-to-implement-robust-and-scalable-transactions (the 2 phase commit is documented in detail here: http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/ )
Since you are talking about Couchbase, you can find some pattern here too:
http://docs.couchbase.com/couchbase-devguide-2.5/#providing-transactional-logic

Lucene indexing strategy for documents that change often

I'm integrating search functionality into a desktop application and I'm using vanilla Lucene to do so. The application handles (potentially thousands) of POJOs each with its own set of key/value(s) properties. When mapping models between my application and Lucene I originally thought of assigning each POJO a Document and add the properties as Fields. This approach works great as far as indexing and searching goes but the main downside is that whenever a POJO changes its properties I have to reindex ALL the properties again, even the ones that didn't change, in order to update the index. I have been thinking of changing my approach and instead create a Document per property and assign the same id to all the Documents from the same POJO. This way when a POJO property changes I only update its corresponding Document without reindexing all the other unchanged properties. I think that the graph db Neo4J follows a similar approach when comes to indexing, but I'm not completely sure. Could anyone comment on possible impact on performance, querying, etc?

It depends fundamentally on what you want to return as a Document in a search result.
But indexing is pretty cheap. Does a changed POJO really have so many properties that reindexing them all is a major problem?

If you only search one field in every search request, splitting one POJO to several documents will speed up reindexing. But it will cause another problem if search one multiple fields, a POJO may appear many times.
Actually, I agree with EJP, building index is very fast in small dataset.

Java: from Lucene Hits to original objects

I'd like to implement a filter/search feature in my application using Lucene.
Querying Lucene index gives me a Hits instance, which is nothing more than a list of Documents matching my criteria.
Since I generate the indexed Documents from my objects, which is the best way to find the original object related to a specific Lucene Document?
A better description of my situation:
Three model classes for now: Folder (can have other Folders or
Lists as children), List (can have Tasks as children) and
Task (can have other Tasks as children). They are all
DefaultMutableTreeNode subclasses. I'll add the Tag entity in the
future.
Each Task has a text, a start date, a due date, some boolean flags.
They are displayed in a JTree.
The hole tree is saved in an XML file.
I'd like to do things like these:
search Tasks with Google-like queries.
Find all Tasks that start today.
Filter Tasks by Tag.

You can't, not with vanilla Lucene. You said yourself that you converted your objects into Documents and then stored the Documents in Lucene, how would you imagine that process would be reversible?
If you want to store and retrieve your own objects in Lucene, I strongly recommend that you use Compass instead. Compass is to Lucene what Hibernate is to JDBC - you define a mapping between your objects and Lucene documents, Compass takes care of the conversion.

Add a "stored" field that contains an object identifier. For each hit, lookup the original object via the identifier.
Without knowing more context, it's hard to be more specific.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.