importing mongodb data to solr in optimal way - java

I have millions of documents in my mongodb database and to implement searching I am using Apache Solr. I googled how to import data from mongodb to solr, but found that there is no straight forward approach using Data Import Handler. So I decided to insert the document into Solr at the same time when I am inserting it into mongoDB using the solr client for Java, SolrJ. My documents are in following format in mongoDB:
{
"_id" : ObjectId("51cc52c9e4b04f75b27542ba"),
"article" : {
"summary" : "As more people eschew landlines, companies are offering technologies like\npersonal cell sites and signal boosters to augment cellphones, Eric A. Taub\nwrites in The New York Times.\n\n",
"author" : "By THE NEW YORK TIMES",
"title" : "Daily Report: Trying to Fix a Big Flaw in Cellphone Technology (NYT)",
"source" : "NYT",
"publish_date" : "Thu, 27 Jun 2013 12:01:00 +0100",
"source_url" : "http://bits.blogs.nytimes.com/feed/",
"url" : "http://news.feedzilla.com/en_us/stories/top-news/315578403?client_source=api&format=json"
},
"topics" : [
{
"categoryName" : "Technology Internet",
"score" : "94%"
}
],
"socialTags" : [
{
"originalValue" : "Cell site",
"importance" : "1"
},
{
"originalValue" : "Cellular repeater",
"importance" : "1"
},
{
"originalValue" : "Technology Internet",
"importance" : "1"
}
],
"entities" : [
{
"_type" : "PublishedMedium",
"name" : "The New York Times"
},
{
"_type" : "Company",
"name" : "The New York Times"
},
{
"_type" : "Person",
"name" : "Eric A. Taub"
}
]
}
I want to do indexing on two fields: 'summary' and 'title' of 'article' array.
So far, what I have learned is putting this entire document into solr does not make sense as it will increase the size of index and make searches slower. So, I decided to store following fields in Solr: 'docId', 'title', 'summary', so while searching in solr I will retrieve only docId and then retrieve other details from mongodb bec it faster than retrieving data from solr bec of analysers, tokenisers and all.
FIRST:
So, I need to maintain a unique field 'docId', shall I use use the default '_id' generated by mongod? But for that the doc has to be inserted first, so that mongod can generate the '_id'. So I need to retrieve the doc after inserting into mongodb, fetch the '_id' then insert in solr the fields 'docId', 'summary' and 'title'. Can this be imroved?
SECOND:
I need to define the schema in solr for this where I have to map fields from mongodb to fields in solr. I have the default instance of solr running, from solr/example/start.jar.
It has a default schema and a default collection called 'collection1'. How can I create my own collection, I cannot find anything in the admin interface for this. I want to create a collection for my project and then write a schema as I have defined above.
Whatever tutorials I have found, they simply add the documents to solr. So do I need to override the default schema?
Note: I am new to Solr, you would have already inferred after reading the question :D So please help!

Related

How to handle race-condition (Synchronisation) in Java project using ElasticSearch as persistence

Problem: We have data stored in DB. For example, take a USER object which is stored. CRUD operations can occur on any of the USER.
Is there any efficient way in which I can make these cruds as thread-safe if simultaneous operations come for the same USER.
My Solution
We use a Cache like a concurrent map which will help in holding the lock on specific OID if it is being worked upon.
The issue with this approach is we are having an additional cache, which will be handled in various scenarios like system restart and all.
Is there any other way in which I can achieve this which is much efficient. Anything can be done in persistence layer itself ?
PS: I am not using any Framework.
I think you need to use _seq_no and _primary_term.
You should read the document from the elastic search and retrieve sequence number and primary term:
request:
GET example-index/_doc/FOjCPXdy5uFjy80o4vxPI4kzlRQo
response:
{
"_index" : "example-index",
"_type" : "_doc",
"_id" : "FOjCPXdy5uFjy80o4vxPI4kzlRQo",
"_version" : 2,
"_seq_no" : 1,
"_primary_term" : 1,
"found" : true,
"_source" : {
"data": "text"
}
}
then pass them with your update query:
request:
POST example-index/_update/FOjCPXdy5uFjy80o4vxPI4kzlRQo?if_seq_no=1&if_primary_term=1
{
"doc":{
"data": "updated text"
}
}
so if the other agent updated your document those parameters will change after and you will get error 409:
response:
{
"error" : {
"root_cause" : [
{
"type" : "version_conflict_engine_exception",
"reason" : "[FOjCPXdy5uFjy80o4vxPI4kzlRQo]: version conflict, required seqNo [1], primary term [1]. current document has seqNo [2] and primary term [1]",
"index_uuid" : "GUZKnab3T5aBlzXEuOPI7Q",
"shard" : "0",
"index" : "example-index"
}
],
"type" : "version_conflict_engine_exception",
"reason" : "[FOjCPXdy5uFjy80o4vxPI4kzlRQo]: version conflict, required seqNo [1], primary term [1]. current document has seqNo [2] and primary term [1]",
"index_uuid" : "GUZKnab3T5aBlzXEuOPI7Q",
"shard" : "0",
"index" : "example-index"
},
"status" : 409
}
documentation: elastic.co/optimistic-concurrency-control

How do I upsert a mongo (or spring-mongo) document containing a list

I'm trying determine if there is a way using spring-mongodb (or even using the mongo java API), to upsert a document containing a list such that the elements of the list are always a union of the values upserted.
Suppose I have the following Classes (made up to simplify things):
public class Patron {
private String name;
private String address;
private List<Book> booksRead;
// assume gets/sets
}
public class Book{
private String title;
private String author;
// assume gets/sets
}
Further, let's assume I get the updates on only the latest books read, but I want to keep in the DB the full list of all books read. So what I'd like to do is insert a Patron(with booksRead) if it doesn't exist or update their booksRead if the Patron already does exist.
So, the first upsert 'John Doe' is not in collection so the document is inserted and looks like this:
"_id": ObjectId("553450062ef7b63435ec1f57"),
"name" : "John Doe"
"address" : "123 Oak st, Anytown, NY, 13760"
"booksRead" : [
{
"title" : "Grapes of Wrath",
"author" : "John Steinbeck"
},
{
"title" : "Creatures Great and Small",
"author" : "James Herriot"
}
]
John re-reads 'Grapes of Wrath' and also reads 'Of Mice and Men'. An insert is attempted passing 2 books as books read, but I'd like to only insert 'Of Mice and Men' to the read list so the document looks like:
"_id": ObjectId("553450062ef7b63435ec1f57"),
"name" : "John Doe"
"address" : "123 Oak st, Anytown, NY, 13760"
"booksRead" : [
{
"title" : "Grapes of Wrath",
"author" : "John Steinbeck"
},
{
"title" : "Creatures Great and Small",
"author" : "James Herriot"
},
{
"title" : "Of Mice and Men",
"author" : "John Steinbeck"
}
]
Everything I've tried seems to point to needing to separate calls one for the insert and one for the update. Update.set works for the initial load (insert) but replaces full list on second update. $addToSet works on update, but complains about trying to insert into 'non-array' on initial insert.
UPDATE:
This appears to be an issue with Spring-mongodb. I can achieve the above with mongo java api calls, it just fails when using the spring equivalents. (at least up to spring-data-mongodb 1.6.2)
Easiest way to accomplish would be to remove the old entry and then re-add it with the added books. Ideally, this should be done transactionally.
I think you could do this using the addToSet operator.
You can find documentation on MongoDB's web site :
http://docs.mongodb.org/manual/reference/operator/update/addToSet/#up._S_addToSet
It is used with either update() or findAndModify() methods.
Issue appears to be at Spring layer. While I get errors upserting using its FindAndModify command, I don't have a issue with $addToSet with the mongo DBCollection.findAndModify and DBCollection.update methods.

How to insert multiple index at a time on a solr update using json

I have refer different related web page for getting how can i post multiple index to solr in a single request. I have gone through the solr link http://wiki.apache.org/solr/UpdateJSON#Example but the link explain feature not that much clearly.
Also i have found that create a json like this:
{
"add": {"doc": {"id" : "TestDoc1", "title" : "test1"} },
"add": {"doc": {"id" : "TestDoc2", "title" : "another test"} }
}
can solve the issue. But in this case only last index is updated/inserted to index. My project is a java project. Please help me on this.
The JSON module support using the regular JSON array notation (from 3.2 and forward). If you're adding documents, there is no need for the "add" key either:
[
{
"id" : "MyTestDocument",
"title" : "This is just a test"
},
{
"id" : "MyTestDocument2",
"title" : "This is antoher test"
}
]

Fetching referenced mongodb documents in another collection using Morphia

I've been trying to wrap around my head around this...
I have the following referenced documents in a users and groups collection.
group documents
{
"_id" : ObjectId("52eabc9914cc8d6cc1e6f723"),
"className" : "org.xxxxxx.sms.core.domain.Group",
"name" : "CE Play group",
"description" : "CE Play group",
"creationdate" : ISODate("2014-01-30T20:56:57.848Z"),
"user" : DBRef("users", ObjectId("52ea69c714ccd207329b2476"))
}
{
"_id" : ObjectId("52ea69c714ccd207329b2477"),
"className" : "org.xxxxxx.sms.core.domain.Group",
"name" : "Default",
"description" : "Default sms group",
"creationdate" : ISODate("2014-01-30T15:03:35.916Z"),
"user" : DBRef("users", ObjectId("52ea69c714ccd207329b2476"))
}
users document
{
"_id" : ObjectId("52ea69c714ccd207329b2476"),
"className" : "org.xxxxxx.core.domain.User",
"username" : "jomski2009",
"firstname" : "Jome",
"lastname" : "Akpoduado",
"email" : "jomea#example.com",
"active" : true,
"usergroups" : [
DBRef("groups", ObjectId("52ea69c714ccd207329b2477")),
DBRef("groups", ObjectId("52eabc9914cc8d6cc1e6f723"))
]
}
I have a Morphia Datastore singleton object which has been set up in a class to retrieve a user group and perform some manipulations on. Say I wanted to fetch the group named "Default" with a supplied username "jomski2009" , how would I achieve this in Morphia without fetching the usergroups as a list and iterating over the list of groups just to find the group I want?
Thanks.
DBRef is used in Mongo as a client side concept. MongoDB does not do joins so the purpose of DBRef is to hand over to the client a location from which to fetch the required object. This is handled by various client libraries in different ways.
If it is feasible for your application to do so, you might want to take a look at using Embedded Annotation instead of the Reference type. Or at the very least to include a list of usernames in your Group objects in addition to the object references. This will allow you to filter these in queries.
Also it is worthwhile looking at moving any unique identifiers like "username" to the _id field of the document as long as it is always unique. Primary key lookups are always faster.

Elasticsearch for logging - need architectural advice

I am trying to come up with an optimized architecture to store event logging messages on Elasticsearch.
Here are my specs/needs:
Messages are read-only; once entered, they are only queried for reporting.
No free text search. User will use only filters for reporting.
Must be able to do timestamp range queries.
Mainly need to filter by agent and customer interactions (in addition to other fields).
customers and agents belong to the same location.
So the most frequently executed query will be: get all LogItems given client_id, customer_id, and timestamp range.
Here is what a LogItem looks like:
"_source": {
"agent_id" : 14,
"location_id" : 2,
"customer_id" : 5289,
"timestamp" : 1320366520000, //Java Long millis since epoch
"event_type" : 7,
"screen_id" : 12
}
I need help indexing my data.
I have been reading what is an elasticsearch index? and using elasticsearch to serve events for customers to get an idea of a good indexing architecture, but I need assistance from the pros.
So here are my questions:
The article suggests creating "One index per day". How would I do range queries with that architecture? (eg: is it possible to query on index range?)
Currently I'm using one big index. If I create one index per location_id, how do I use shards for further organization of my records?
Given the specs above, is there a better architecture you can suggest?
What fields should I filter with vs query with?
EDIT: Here's a sample query run from my app:
{
"query" : {
"bool" : {
"must" : [ {
"term" : {
"agent_id" : 6
}
}, {
"range" : {
"timestamp" : {
"from" : 1380610800000,
"to" : 1381301940000,
"include_lower" : true,
"include_upper" : true
}
}
}, {
"terms" : {
"event_type" : [ 4, 7, 11 ]
}
} ]
}
},
"filter" : {
"term" : {
"customer_id" : 56241
}
}
}
You can definitely search on multiple indices. You can use wildcards or a comma-separated list of indices for instance, but keep in mind that index names are strings, not dates.
Shards are not for organizing your data but to distribute it and eventually scale out. How you do that is driven by your data and what you do with it. Have a look at this talk: http://vimeo.com/44716955 .
Regarding your question about filters VS queries, have a look at this other question.
Take a good look at logstash (and kibana). They are all about solving this problem. If you decide to roll your own architecture for this, you might copy some of their design.

Categories

Resources