Even though I read the documentation for Elasticsearch to understand what a percolator is. I still have difficulty understanding what it means and where it is used in simple terms. Can anyone provide me with more details?
What you usually do is index documents and get them back by querying. What the percolator allows you to do in a nutshell is index your queries and percolate documents against the indexed queries to know which queries they match. It's also called reversed search, as what you do is the opposite to what you are used to.
There are different usecases for the percolator, the first one being any platform that stores users interests in order to send the right content to the right users as soon as it comes in.
For instance a user subscribes to a specific topic, and as soon as a new article for that topic comes in, a notification will be sent to the interested users. You can express the users interests as an elasticsearch query, using the query DSL, and you can register it in elasticsearch as it was a document. Every time a new article is issued, without needing to index it, you can percolate it to know which users are interested in it. At this point in time you know who needs to receive a notification containing the article link (sending the notification is not done by elasticsearch though). An additional step would also be to index the content itself but that is not required.
Have a look at this presentation to see other couple of usecases and other features available in combination with the percolator starting from elasticsearch 1.0.
In Simple terms percolator does this:
User: Hey Percolator! How can you help me?
Percolator: Hai User! I can help you to get the alerts of your interests.
User: That's great! What should I do next?
Percolator: Please let me know your interests in the form of queries indexed in Elasticsearch.
User: I've prepared all my interests as queries and indexed them into Elasticsearch. Is it that simple?
Percolator: Yes! It is that simple! I'll watch all incoming documents and get back to you with documents if they matches with any of your interests(queries)!
User: That's awesome! I'm just curious and want to know that how can
you figure out which documents match with my interests.
Percolator: That's a good question! Answer for that is very simple! You had indexed your interests as queries into Elasticsearch right? I use them and run all those(not exactly all but for simplicity let's assume all) queries against incoming documents(these docs need not to be indexed and could be just sent for percolation!). In fact this process is called percolation! If any document matches with any of your queries then I'll send that result to the client(It could be you also)!
Under the hood, a percolate query will take what you want to percolate (e.g. that news article that you want to alert on) and Elasticsearch will create a tiny in-memory index with that document.
You'd have a bunch of registered queries (e.g. one for each user's preferences). Initially, Elasticsearch will pre-filter queries that are likely to match, then run those likely ones. Much like Luwak used to do (now Lucene Monitor).
The rule of thumb, for the alerting use-case at least, is:
have lots of incoming documents and few queries (e.g. alert on logs)? Simply run queries at a scheduled interval
have fewer documents and lots of queries? Then percolate these documents
I've also seen people using percolator to tag documents, but implementing something custom in the indexing pipeline to do that sounds more logical.
Related
I have a database with 20,000 records. Each record has a name. When a user wants to view a record, he can visit a webapp and type the name of the record in an inputfield. While typing, results from the database would be shown/filtered matchin what the user typed. I would like to know the basic architecture/concepts on how to program this
I'm using the following language stack:
frontend: html5/javascript (+ajax to make instant calls while user is typing)
backend: java + jdbc to connect to simple sql database
My initial idea is:
A user types text
Whenever a character is entered or removed in the inputfield, make an ajax request to the backend
The backend does a LIKE %input% query on the name field in the database
All data found by the query is send as a json string to the frontend
The frontend processes the json string and displays whatever results it finds
My two concerns are: the high amount of ajax requests to process, in conjunction with the possibly very heavy LIKE queries. What are ways to optimize this? Only search for every two characters they type/remove? Only query for the first ten results?
Do you know of websites that utilise these optimizations?
NOTE: assume the records are persons and names are like real people names, so some names are more common than others.
You can choose SPA approach - load all 20 000 names/ids to client side and then filter it in memory - it's supposed to be the fastest way with minimal load to the database and back-end
Here are possible solutions:
Restirct search to prefix search - LIKE 'prefix%' can be executed efficiently using BTREE-type index.
Measure performance of naive LIKE '%str%' solution - it you are working on B2B application, database will likely load that table in memory and do queries fast enough.
Look at documentation for your database - there could be special features for that like inverted index
as #Stepan Novikov suggested, load your data in memory and search manually
Use specialized search indexers like SOLR or ElasticSearch (likely overkill for only 20k records)
If you are feeling ninja, implement your own N-gram index.
I am looking for a good design pattern for sharding a list in Google App Engine. I have read about and implemented sharded counters as described in the Google Docs here but I am now trying to apply the same principle to a list. Below is my problem and possible solution - please can I get your input?
Problem:
A user on my system could receive many messages kind of like a online chat system. I'd like the server to record all incoming messages (they will contain several fields - from, to, etc). However, I know from the docs that updating the same entity group often can result in an exception caused by datastore contention. This could happen when one user receives many messages in a short time thus causing his entity to be written to many times. So what about abstracting out the sharded counter example above:
Define say five entities/entity groups
for each message to be added, pick one entity at random and append the message to it writing it back to the store,
To get list of messages, read all entities in and merge...
Ok some questions on the above:
Most importantly, is this the best way to go about things or is there a more elegant/more efficient design pattern?
What would be a efficient way to filter the list of messages by one of the fields say everything after a certain date?
What if I require a sharded set instead? Should I read in all entities and check if the new item already exists on every write? Or just add it as above and then remove duplicates whenever the next request comes in to read?
why would you want to put all messages in 1 entity group ?
If you don't specify a ancestor, you won't need sharding, but the end user might see some lagging when querying the messages due to eventual consistency.
Depends if that is an acceptable tradeoff.
I have found the Jquery datatables plug in extremely useful for simple, read only applications where I'd like to give the user pagination, sorting and searching of very large sets of data (millions of rows using server side processing).
I have a system for reusing this code but I end up doing the same thing over and over alot. I'd like to write a very generalized api that I essentially just need to configure the sql needed to retrieve the data used in the table. I am looking for a good design pattern/approach to do this. I've seen articles like this http://www.codeproject.com/Articles/359750/jQuery-DataTables-in-Java-Web-Applications and have a complete understanding of how server side processing works (have done it in java and asp.net many times). For someone to answer you will probably need to have a deep understanding of how server side processing works in java but here are some issues that come up with attempting to do this:
I generally run three separate queries. A count without the search clause, a count with the clause included, the query for the actual data. I haven't found an efficient way to do all 3 at once and doing so requires a lot of extra data to come back from db (ie counts over and over). The api needs to support behavior based on these three different queries and complex queries at that. I generally row number () over an index for the pagination to be relatively speedy with large data.
*where clause changes dynamically (user can search over a variable number of rows).
*order by clause changes for the same reason.
overall, each case is often pretty specific to the data we need. Is there a good way to abstract this so that I can do minimal work when I want to use the plug in server side.
So, the steps are as follows in most projects:
*extract the params the plug on sends to the server (alot of times my own are added, mostly date ranges)
*build the unfiltered count query (this is rarely dynamic).
*build the filtered count query (is dynamic)
*build the data query
*construct a model object of the table and return it as json.
A lot of the issues occur setting the prepared statements with a variable number of parameters. Dynamically generating the sql in a general way (say based on just column names) seems unlikely. I am wondering if someone else has created something they are using for this or if it sounds like a specific pattern is applicable. It has just occurred to me that creating a reusable filter may be helpful in java. Any advice would be greatly appreciated. Feel free to be language agnostic as the architecture is what I'm trying to figure out.
We have base search criteria where all request parameters relevant to DataTables are mapped onto class properties (fields) and custom search criteria class that extends base and contains specific to business logic fields for sutom search. Also on server side we have repository class that takes custom search criteria as an argument and makes queries to database.
If you are familiar with C#, you could check out custom binding code and example of usage.
You could do such custom binding in your Java code as well.
I am developing a search component of a web application using Lucene. I would like to save the user queries to an index and use them to suggest alternate queries to users, and to keep query statistics (most often used queries, top scoring queries, ...).
To use this data for alternate query suggestions, I would analyze the queries to see which terms are most often used with one another and use that to create a suggestion to the user.
But I can't figure out in which form to index the data. I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content. Does anyone have any ideas about the way this can be accomplished?
Thanks for the help.
"I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content"
You can tell Lucene not to store document content, which means that the principal overhead will be the unique Terms, and the index itself. So, it might not be a large overhead to store each query as a unique Document...this way you will not be throwing away any information.
First, I believe that you should store the queries separately from the existing index. The problem is not redundant data but rather "watering down" your index - storing the queries in the same index may harm the relevance of your searches. Some options for this are:
Use a separate Lucene index.
Use Solr, with two separate cores, one for the documents and the other for the queries.
Use a query log. Store scores with the queries. Build query statistics using post-processing.As this is a web application, you can probably use a servlet container, such as Tomcat's, logs for this.
Second, Auto-Suggest From Popular Queries Using EdgeNGrams suggests an alternative implementation of query suggestion using Solr.
I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh
First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.
Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B
Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?
If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html
The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.