How can to group lucene's results? - java

My application indexes discussion threads. Each entry in the discussion is indexed as a separate Lucene document with a common_id field which can be used to group search hits into one discussion.
Currently when the search is performed, if a thread has 3 entries, then 3 separate hits are returned. Even though this is correct, from the users point of view the same entry is appearing in the search multiple times.
Is there a way to tell lucene to group it's search results by the common_id field before returning them?

I believe what you are asking for is Field Collapsing, which is a feature of Solr (and I believe Elasticsearch as well).
If you want to roll your own, One possible way to do this is:
Add a "series id" field to each document that is a member of a
series. You will have to ensure that this gets incremented for every
new series.
Make an initial query to Lucene, and get a hit list.
For each hit, check to see if it has a series id; If it does, make another query by the series id in order to retrieve all the
members of the series.
An alternative is to store the ids of all the series members in a field inside each member's document.

There is nothing built into Lucene that collapses results based on a field. You will need to implement that yourself.
However, they've recently built this feature into Solr.
See http://www.lucidimagination.com/blog/2010/09/16/2446/

Since version 3.2 lucene supports grouping search results based on a field.
http://lucene.apache.org/core/4_1_0/grouping/org/apache/lucene/search/grouping/package-summary.html

Related

Is it a good idea to have unique keys to better aggregate data in MongoDB

Hello I am creating an app where people essentially join groups to do tasks, and each group has a unique name. I want to be able to update each of the users document that has to do with a specific group without having to for loop each user and update with each iteration.
I want to know if its a good idea to have a unique key like this in mongoDB.
{
...
"specific_group_name": (whatever data point here)
...
}
in each of the users document, so I can just call a simple
updateToMany(eq("specific_group_name", (whatever data point here)), Bson object)
To decrease the run time that is involved, just in case there is alot of users within the group.
Thank you
Just a point to note, instead of a specific group name, better make sure that it's specific groupId. Also pay special attention to cases when you have to remove group from the people, and also if there's cases when a person in a particular group shouldn't receive this update.
What you want to do is entirely valid though. If you put specific_group_name/id in the collection, then you're moving the selection logic to db. If you're doing a one-by-one update, then you have more flexibility on how to select users to update on Java/application side.
If selection is simple (a.k.a always update people in this group) then go ahead

Groupping results based on fields - Lucene

I'm using Lucene 4.10.4. I want to take "n results" from 20 different fields in an efficient way without searching 20 times. If I search using boolean query, we might get all the results in single search. I want to group results based on fields, is there any grouping concept?
Yes there is:
http://lucene.apache.org/core/4_10_4/grouping/org/apache/lucene/search/grouping/package-summary.html
But in newer versions it only works on DocValues, so you would have to add the field again as a DocValue to be able to group over it. (But maybe in 4.10 it still works with the FieldCache, but I'm not familiar with that)
You can use GroupingSearch or maybe BlockGroupingCollector to have multiple elements per group and specify how results are ordered within a group.
You have to include the lucene-grouping dependency to use it.

Performance of search in java list vs on database records using hibernate

Now I have a situation where I need to make some comparisons and result filtration that is not very simple to do, what I want is something like Lucenes search but only I will develop it, it is not my decision though I would have gone with Lucene.
What I will do is:
Find the element according to full word match of a certain field, if not then check if it starts with it the check if it just contains.
Every field has its weight according to matching case(full->begins->contains) and its priority to me.
After one has matched I will also check the weight of the other fields as well to make a final total row weight.
Then I will return an Map of both rows and their weights.
Now I realized that this is not easy done by hibernate's HQL meaning I would have to run multiple queries to do this.
So my question is should I do it in java meaning should I retrieve all records and do my calculations to find my target, or should I do it in hibernate by executing multiple queries? which is better according to performance and speed ?
Unfortunately, I think the right answer is "it depends": how many words, what data structure, whether the data fits in memory, how often you have to do the search, etc.
I am inclined to think that a database is a better solution, even if Hibernate is not part of it. You might need to learn how to write better SQL. Perhaps the dynamic SQL that Hibernate generates for you isn't sufficient. Proper JOINs and indexing might make this perform nicely.
There might be a third way to consider: Lucene and indexing. I'd need to know more about your problem to decide.

Avoiding exploding indices and entity-group write-rate limits with appengine

I have an application in which there are Courses, Topics, and Tags. Each Topic can be in many Courses and have many Tags. I want to look up every Topic that has a specific Tag x and is in specific Course y.
Naively, I give each standard a list of Course ids and Tag ids, so I can select * from Topic where tagIds = x && courseIds = y. I think this query would require an exploding index: with 30 courses and 30 tags we're looking at ~900 index entries, right? At 50 x 20 I'm well over the 5000-entry limit.
I could just select * from Topic where tagIds = x, and then use a for loop to go through the result, choosing only Topics whose courseIds.contain(y). This returns way more results than I'm interested in and spends a lot of time deserializing those results, but the index stays small.
I could select __KEY__ from Topic where tagIds = x AND select __KEY__ from Topic where courseIds = y and find the intersection in my application code. If the sets are small this might not be unreasonable.
I could make a sort of join table, TopicTagLookup with a tagId and courseId field. The parent key of these entities would point to the relevant Topic. Then I would need to make one of these TopicTagLookup entities for every combination of courseId x tagId x relevant topic id. This is effectively like creating my own index. It would still explode, but there would be no 5000-entry limit. Now, however, I need to write 5000 entities to the same entity group, which would run up against the entity-group write-rate limit!
I could precalculate each query. A TopicTagQueryCache entity would hold a tagId, courseId, and a List<TopicId>. Then the query looks like select * from TopicTagQueryCache where tagId=x && courseId = y, fetching the list of topic ids, and then using a getAllById call on the list. Similar to #3, but I only have one entity per courseId x tagId. There's no need for entity groups, but now I have this potentially huge list to maintain transactionally.
Appengine seems great for queries you can precalculate. I just don't quite see a way to precalculate this query efficiently. The question basically boils down to:
What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
Your assessment of your options is correct. If you don't need any sort criteria, though, option 3 is more or less already done for you by the App Engine datastore, with the merge join strategy. Simply do a query as you detail in option 1, without any sorts or inequality filters, and App Engine will do a merge join internally in the datastore, and return only the relevant results.
Options 4 and 5 are similar to the relation index pattern documented in this talk.
I like #5 - you are essentially creating your own (exploding) index. It will be fast to query.
The only downsides are that you have to manually maintain it (next paragraph), and retrieving the Topic entity will require an extra query (first you query TopicTagQueryCache to get the topic ID and then you need to actually retrieve the topic).
Updating the TopicTagQueryCache you suggested shouldn't be a problem either. I wouldn't worry about doing it transactionally - this "index" will just be stale for a short period of time when you update a Topic (at worst, your Topic will temporarily show up in results it should no longer show up in, and perhaps take a moment before it shows up in new results which it should show up it - this doesn't seem so bad). You can even do this update on the task queue (to make sure this potentially large number of database writes all succeed, and so that you can quickly finish the request so your user isn't waiting).
As you said yourself you should arrange your data to facilitate the scaling of your app, thus in the question of What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
You can hold your own indexes of these sets by creating objects of CourseRef and TopicRef which consist of Key only, with the ID portion being an actual Key of the corresponding entity. These "Ref" entities will be under a specific tag, thus no actual Key duplicates. So the structure for a given Tag is : Tag\CourseRef...\TopicRef...
This way given a Tag and Course, you construct the Key Tag\CourseRef and do an ancestor Query which gets you a set of keys you can fetch. This is extremely fast as it is actually a direct access, and this should handle large lists of courses or topics without the issues of List properties.
This method will require you to use the DataStore API to some extent.
As you can see this gives answer to a specific question, and the model will do no good for other type of Set operations.

Should I use Lucene only for search?

Our website needs to give out data to the world. This is open-source data that we have stored, and we want it to make it publicly available. It's about 2 million records.
We've implemented the search of these records using Lucene, which is fine, however we'd like to show an individual record (say the user clicks on it after the search is done) and provide more detailed information for that record.
This more detailed information however isn't stored in the index directly... there are like many-to-many relationships and we use our relational database (MySQL) to provide this information.
So like a single record belongs to a category, we want the user to click on that category and show the rest of the records within that category (lots more associations like this).
My question is, should we use Lucene also to store this sort of information and retrieve it through simple search (category:apples), or should MySQL continue doing this logical job? Should I use Lucene only for the search part?
EDIT
I would like to point out that all of our records are pretty static.... changes are made to this data once every week or so.
Lucene's strength lies in rapidly building an index of a set of documents and allowing you to search over them. If this "detailed information" does not need to be indexed or searched over, then don't store it in Lucene.
Lucene is not a database, it's an index.
You want to use Lucene to store data?, I thing it's ok, I've used Solr http://lucene.apache.org/solr/
which built on top of Lucene to work as search engine and store more data relate to the record that maybe use for front end display. It worked with 500k records for me, and 2mil records I think it should be fine.

Categories

Resources