How to perform Upsert in solr [duplicate] - java

I've been attempting to do the equivalent of an UPSERT (insert or update if already exists) in solr. I only know what does not work and the solr/lucene documentation I have read has not been helpful. Here's what I have tried:
curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[{"id":"1","name":{"set":"steve"}}]'
{"responseHeader":{"status":409,"QTime":2},"error":{"msg":"Document not found for update. id=1","code":409}}
I do up to 50 updates in one request and request may contain the same id with exclusive fields (title_en and title_es for example). If there was a way of querying whether or not a list of id's exist, I could split the data and perform separate insert and update commands... This would be an acceptable alternative but is there already a handler that does this? I would like to avoid doing any in house routines at this point.
Thanks.

With Solr 4.0 you can do a Partial update of all those document with just the fields that have changed will keeping the complete document same. The id should match.

Solr does not support UPSERT mechanics out of the box. You can create a record or you can update a record and syntax is different.
And if you update the record you must make sure all your other pre-inserted fields are stored (not just indexed). Under the covers, an update creates a completely new record just pre-populated with previously stored values. But that functionality if very deep in (probably in Lucene itself).
Have you looked at DataImportHandler? You reverse the control flow (start from Solr), but it does have support for checking which records need to be updated and which records need to be created.
Or you can just run a solr query like http://solr.example.com:8983/solr/select?q=id%3A(ID1+ID2+ID3)&fl=id&wt=csv where you ask Solr to look for your ID records and return only ID of records it does find. Then, you could post-process that to segment your Updates and Inserts.

Related

CouchDB/Couchbase/MongoDB transaction emulation?

I've never used CouchDB/MongoDB/Couchbase before and am evaluating them for my application. Generally speaking, they seem to be a very interesting technology that I would like to use. However, coming from an RDBMS background, I am hung up on the lack of transactions. But at the same time, I know that there is going to be much less a need for transactions as I would have in an RDBMS given the way data is organized.
That being said, I have the following requirement and not sure if/how I can use a NoSQL DB.
I have a list of clients
Each client can have multiple files
Each file must be sequentially number for that specific client
Given an RDBMS this would be fairly simple. One table for client, one (or more) for files. In the client table, keep a counter of last filenumber, and increment by one when inserting a new record into the file table. Wrap everything in a transaction and you are assured that there are inconsistencies. Heck, just to be safe, I could even put a unique constraint on a (clientId, filenumber) index to ensure that there is never the same filenumber used twice for a client.
How can I accomplish something similar in MongoDB or CouchDB/base? Is it even feasible? I keep reading about two-phase commits, but I can't seem to wrap my head around how that works in this kind of instance. Is there anything in Spring/Java that provides two-phase commit that would work with these DBs, or does it need to be custom code?
Couchdb is transactional by default. Every document in couchdb contains a _rev key. All updates to a document are performed against this _rev key:-
Get the document.
Send it for update using the _rev property.
If update succeeds then you have updated the latest _rev of the document
If the update fails the document was not recent. Repeat steps 1-3.
Check out this answer by MrKurt for a more detailed explanation.
The couchdb recipies has a banking example that show how transactions are done in couchdb.
And there is also this atomic bank transfers article that illustrate transactions in couchdb.
Anyway the common theme in all of these links is that if you follow the couchdb pattern of updating against a _rev you can't have an inconsistent state in your database.
Heck, just to be safe, I could even put a unique constraint on a (clientId, filenumber) index to ensure that there is never the same filenumber used twice for a client.
All couchdb documents are unique since the _id fields in two documents can't be the same. Check out the view cookbook
This is an easy one: within a CouchDB database, each document must have a unique _id field. If you require unique values in a database, just assign them to a document’s _id field and CouchDB will enforce uniqueness for you.
There’s one caveat, though: in the distributed case, when you are running more than one CouchDB node that accepts write requests, uniqueness can be guaranteed only per node or outside of CouchDB. CouchDB will allow two identical IDs to be written to two different nodes. On replication, CouchDB will detect a conflict and flag the document accordingly.
Edit based on comment
In a case where you want to increment a field in one document based on the successful insert of another document
You could use separate documents in this case. You insert a document, wait for the success response. Then add another document like
{_id:'some_id','count':1}
With this you can set up a map reduce view that simply counts the results of these documents and you have an update counter. All you are doing is instead of updating a single document for updates you are inserting a new document to reflect a successful insert.
I always end up with the case where a failed file insert would leave the DB in an inconsistent state especially with another client successfully inserting a file at the same time.
Okay so I already described how you can do updates over separate documents but even when updating a single document you can avoid inconsistency if you :
Insert a new file
When couchdb gives a success message -> attempt to update the counter.
Why this works?
This works because because when you try to update the update document you must supply a _rev string. You can think of _rev as a local state for your document. Consider this scenario:-
You read the document that is to be updated.
You change some fields.
Meanwhile another request has already changed the original document. This means the document now has a new _rev
But You request couchdb to update the document with a _rev that is stale that you read in step #1.
Couchdb will generate an exception.
You read the document again get the latest _rev and attempt to update it.
So if you do this you will always have to update against the latest revision of the document. I hope this makes things a bit clearer.
Note:
As pointed out by Daniel the _rev rules don't apply to bulk updates.
Yes you can do the same with MongoDB, and Couchbase/CouchDB using proper approach.
First of all in MongoDB you have unique index, this will help you to ensure a part of the problem:
- http://docs.mongodb.org/manual/tutorial/create-a-unique-index/
You also have some pattern to implement sequence properly:
- http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
You have many options to implement a cross document/collection transactions, you can find some good information about this on this blog post:
http://edgystuff.tumblr.com/post/93523827905/how-to-implement-robust-and-scalable-transactions (the 2 phase commit is documented in detail here: http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/ )
Since you are talking about Couchbase, you can find some pattern here too:
http://docs.couchbase.com/couchbase-devguide-2.5/#providing-transactional-logic

Hbase sort on column qualifiers

I have an Hbase table with a couple of million records. Each record has a couple of properties describing the record, stored each in a column qualifier.(Mostly int or string values)
I have a a requirement that I should be able to see the records paginated and sorted based on a column qualifier (or even more than one, in the future). What would be a best approach to do this? I have looked into secondary indexes using coprocessors (mostly hindex from huawei), but it doesn't seem to match my use case exactly. I've also thought about replicating all the data into multiple tables, one for each sort property, which would be included in the rowkey and then redirect queries to those tables. But this seems very tedious as I have a few so called properties already..
Thanks for any suggestions.
You need your NoSQL database to work just like a RDBMS, and given the size of your data your life would be a lot simpler if you stick to it, unless you expect exponential growth :) Also, you don't mention if your data gets updated, this is very important to make a good decision.
Having said that, you have a lot of options, here are some:
If you can wait for the results: Write a MapReduce task to do the scan, sort it and retrieve the top X rows, do you really need more than 1000 pages (20-50k rows) for each sort type?. Another option would be using something like Hive.
If you can aggregate the data and "reduce" the dataset: Write a MapReduce task to periodically export the newest aggregated data to a SQL table (which will handle the queries). I've done this a few times to and it works like a charm, but it depends on your requirements.
If you have plenty of storage: Write a MapReduce task to periodically regenerate (or append the data) a new table for each property (sorting by it in the row-key). You don't need multiple tables, just use a prefix in your rowkeys for each case, or, if you do not want tables and you won't have a lot queries, simply write the sorted data to csv files and store them in the HDFS, they could be easily read by your frontend app.
Manually maintain a secondary index: Which would not very tolerant to schema updates and new properties but would work great for near real-time results. To do it, you have to update your code to also to write to the secondary table with a good buffer to help with performance while avoiding hot regions. Think about this type of rowkeys: [4B SORT FIELD ID (4 chars)] [8B SORT FIELD VALUE] [8B timestamp], with just one column storing the rowkey of the main table. To retrieve the data sorted by any of the fields just perform a SCAN using the SORT FIELD ID as start row + the starting sort field value as pivot for pagination (ignore it to get the first page, then set the last one retrieved), that way you'll have the rowkeys of the main table, and you can just perform a multiget to it to retrieve the full data. Keep in mind that you'll need a small script to scan the main table and write the data to the index table for the existing rows.
Rely on any of the automatic secondary indexing through coprocessors like you mentioned, although I do not like this option at all.
You have mostly enumerated the options. HBase natively does not support secondary indexes as you are aware. In addition to hindex you may consider phoenix
https://github.com/forcedotcom/phoenix
( from SalesForce) which in addition to secondary indexes has jdbc driver and sql support.

Updating lucene index - the most popular way

I am updating lucene index onec a day. My strategy in general is:
Find all objects in DB that was modified since last index generation.
Create new tmp-index for these objects. (old index is stil available)
Remove all new indexed Documents (they are in tmp-index) from the old index using IndexWriter.deleteDocuments(Term)
Merge old index and tmp-index using IndexWriter.addIndexes(...)
I have found that in lucene wiki: There is no direct update procedure in Lucene...
I have found also that in lucene 4.1.0 doc: A document can be updated with updateDocument...
I have tried IndexWriter.updateDocument(Term, Document) but then performing search with filter I got NPE from one of my methods what not happens when I update index as describe in 1-4. Have anyone had a similar problem? How do you update your index?
What I do is basically this:
I keep a persistent IndexReader/Readers, this will keep the state that it has since it was created.
I start to delete and create all documents once again. I think I just do a deleteAll() and then recreate them (addDocument()).
I commit, which will activate all those changes.
I drop all IndexReaders that I have, so the next time the system request a Reader, it will create it and store it for subsequent requests.
The updateDocument is basically a delete/create, afaik.
You might want to use a SearcherManager to get new IndexSearchers as you update the index with a IndexWriter. I see no need for using a temporary index?

Accessing database multiple times

I am working on solution of below mentioned but could not find any best practice/tool for this.
For a batch of requests(say 5000 unique ids and records) received in webservice, it has to fetch rows for those unique ids in database and keep them in buffer(or cache) and compare those with records received in webservice. If there is a change for a particular data(say column) that will be updated in table for that unique id. And in turn, the child tables of that table also get affected. For ex, if someone changes his laptop model number and country, model number will be updated in a table and country value in another table. Likewise it goes on accessing multiple tables in short time. The maximum records coming in a webservice call might reach 70K in one call in an hour.
I don't have any other option than implementing it in java. Is there any good practice of implementing this, or can it be achieved using any open source java tools. Please suggest. Thanks.
Hibernate is likely to be the first thing you should try. I tend to avoid because it is overkill for most of my applications but it is a standard tool for accessing database which anyone who knows Java should at least have an understanding of. There are dozens of other solutions you could use but Hibernate is the most often used.
JDBC is the API to use to access relational database. Useful performance and security tips:
use prepared statements
use where ... in () queries to load many rows at once, but beware on the limit in the number of values in the in clause (1000 max in Oracle)
use batched statements to make your updates, rather than executing each update separately (see http://download.oracle.com/javase/1.3/docs/guide/jdbc/spec2/jdbc2.1.frame6.html)
See http://download.oracle.com/javase/tutorial/jdbc/ for a tutorial on JDBC.
This sounds not that complicated. Of course, you must know (or learn):
SQL
JDBC
Then you can go through the web service data record by record and for each record do the following:
fetch corresponding database record
for each field in record
if updated
execute corresponding update SQL statement
commit // every so many records
70K records per hour should be not the slightest problem for a decent RDBMS.

Should I use Lucene only for search?

Our website needs to give out data to the world. This is open-source data that we have stored, and we want it to make it publicly available. It's about 2 million records.
We've implemented the search of these records using Lucene, which is fine, however we'd like to show an individual record (say the user clicks on it after the search is done) and provide more detailed information for that record.
This more detailed information however isn't stored in the index directly... there are like many-to-many relationships and we use our relational database (MySQL) to provide this information.
So like a single record belongs to a category, we want the user to click on that category and show the rest of the records within that category (lots more associations like this).
My question is, should we use Lucene also to store this sort of information and retrieve it through simple search (category:apples), or should MySQL continue doing this logical job? Should I use Lucene only for the search part?
EDIT
I would like to point out that all of our records are pretty static.... changes are made to this data once every week or so.
Lucene's strength lies in rapidly building an index of a set of documents and allowing you to search over them. If this "detailed information" does not need to be indexed or searched over, then don't store it in Lucene.
Lucene is not a database, it's an index.
You want to use Lucene to store data?, I thing it's ok, I've used Solr http://lucene.apache.org/solr/
which built on top of Lucene to work as search engine and store more data relate to the record that maybe use for front end display. It worked with 500k records for me, and 2mil records I think it should be fine.

Categories

Resources