Avoiding exploding indices and entity-group write-rate limits with appengine

Avoiding exploding indices and entity-group write-rate limits with appengine - java

I have an application in which there are Courses, Topics, and Tags. Each Topic can be in many Courses and have many Tags. I want to look up every Topic that has a specific Tag x and is in specific Course y.
Naively, I give each standard a list of Course ids and Tag ids, so I can select * from Topic where tagIds = x && courseIds = y. I think this query would require an exploding index: with 30 courses and 30 tags we're looking at ~900 index entries, right? At 50 x 20 I'm well over the 5000-entry limit.
I could just select * from Topic where tagIds = x, and then use a for loop to go through the result, choosing only Topics whose courseIds.contain(y). This returns way more results than I'm interested in and spends a lot of time deserializing those results, but the index stays small.
I could select __KEY__ from Topic where tagIds = x AND select __KEY__ from Topic where courseIds = y and find the intersection in my application code. If the sets are small this might not be unreasonable.
I could make a sort of join table, TopicTagLookup with a tagId and courseId field. The parent key of these entities would point to the relevant Topic. Then I would need to make one of these TopicTagLookup entities for every combination of courseId x tagId x relevant topic id. This is effectively like creating my own index. It would still explode, but there would be no 5000-entry limit. Now, however, I need to write 5000 entities to the same entity group, which would run up against the entity-group write-rate limit!
I could precalculate each query. A TopicTagQueryCache entity would hold a tagId, courseId, and a List<TopicId>. Then the query looks like select * from TopicTagQueryCache where tagId=x && courseId = y, fetching the list of topic ids, and then using a getAllById call on the list. Similar to #3, but I only have one entity per courseId x tagId. There's no need for entity groups, but now I have this potentially huge list to maintain transactionally.
Appengine seems great for queries you can precalculate. I just don't quite see a way to precalculate this query efficiently. The question basically boils down to:
What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?

Your assessment of your options is correct. If you don't need any sort criteria, though, option 3 is more or less already done for you by the App Engine datastore, with the merge join strategy. Simply do a query as you detail in option 1, without any sorts or inequality filters, and App Engine will do a merge join internally in the datastore, and return only the relevant results.
Options 4 and 5 are similar to the relation index pattern documented in this talk.

I like #5 - you are essentially creating your own (exploding) index. It will be fast to query.
The only downsides are that you have to manually maintain it (next paragraph), and retrieving the Topic entity will require an extra query (first you query TopicTagQueryCache to get the topic ID and then you need to actually retrieve the topic).
Updating the TopicTagQueryCache you suggested shouldn't be a problem either. I wouldn't worry about doing it transactionally - this "index" will just be stale for a short period of time when you update a Topic (at worst, your Topic will temporarily show up in results it should no longer show up in, and perhaps take a moment before it shows up in new results which it should show up it - this doesn't seem so bad). You can even do this update on the task queue (to make sure this potentially large number of database writes all succeed, and so that you can quickly finish the request so your user isn't waiting).

As you said yourself you should arrange your data to facilitate the scaling of your app, thus in the question of What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
You can hold your own indexes of these sets by creating objects of CourseRef and TopicRef which consist of Key only, with the ID portion being an actual Key of the corresponding entity. These "Ref" entities will be under a specific tag, thus no actual Key duplicates. So the structure for a given Tag is : Tag\CourseRef...\TopicRef...
This way given a Tag and Course, you construct the Key Tag\CourseRef and do an ancestor Query which gets you a set of keys you can fetch. This is extremely fast as it is actually a direct access, and this should handle large lists of courses or topics without the issues of List properties.
This method will require you to use the DataStore API to some extent.
As you can see this gives answer to a specific question, and the model will do no good for other type of Set operations.

Related

Huge Leaderboard ranking with filtering

We are building a massive multi-player educational game with some millions of entries in the leader-board (based on aggregated XPs gained). After a game finishes, we need to show the leaderboard and how this player/student is ranked.
But there are a couple of filters for this leaderboard (global/by country, by month/year/today, by age etc) that can be mixed together e.g. 'Get me the leaderboard for my Country for the last month'. Number of combinations is ~20.
My problem is how to store such a structure that is updated regularly; recalculation of rankings must be done after each game. A typical full leaderboard at the moment has ~5 millions of entries for players coming from >150 countries.
I used to have a MySQL Cluster Table (userid, xps, countryid) with 3 nodes, but ordering by XPs (either in DBMS or application which required all data from DB) proven to be too slow as numbers got bigger (>20K of users). This is an interesting post but again half a second for each query is too much.
Then we used REDIS (see this post), but filtering is the problem here. We used separate lists for TOP 5 and the rest. TOP 5 was updated instantly, for the rest there was some delay of 20-30 minutes. We in fact ranked this user based on a cached instance of the Leaderboard (using the real XPs though, not the cached), so this was acceptable. Real-time on non-Top5 is not a prerequisite.
This is fine for one global ranking, but how to filter the results based on month and/or country and/or age. Do we need to keep a list for every filtering combination?
We also tested custom structures in Java (using it as a Java caching server similar in functionality with REDIS), still experimenting with it. Which is the best combination of structures to achieve our goal? We ended up using one list per filtering combination e.g. Map<FilteringCombination, SortedList<User>> and then doing binary search to the list of a specific key. This way, a finished game requires a couple of insertions say X, but it requires X*NumOfPlayers space, which is X times more than keeping a single list (not sure if this can fit to memory but we can always create a cluster here by splitting combinations to different servers). There is an issue here on how to rebuild the cache in case of failure, but that is another problem we can deal with.
Extending the above method, we might slightly improve performance if we define scoring buckets inside each list (eg a bucket for 0-100xp, another for 101 - 1000xp, another for 1001 - 10000xp etc). The bucket splitting policy will be based on the players' xp distribution in our game. It's true that this distribution is dynamic in real world, but we have seen that after a few months changes are minor, having in mind that XPs are always increasing but new users are coming as well.
We are also testing Cassandra's natural ordering by utilizing clustering keys and white-rows feature, although we know that having some millions of rows may not be easy to handle.
All in all, that is what we need to achieve. If a user (let's name her UserX) is not included in the Top5 list, we need to show this user's ranking together with some surrounding players (eg 2 above and 2 below) as the example below:
Global TOP 5 My Global Ranking (425) My Country Ranking Other Rankings
1. karen (12000xp) 423. george 1. david
2. greg (11280xp) 424. nancy 2. donald
3. philips (10293xp) **425. UserX** 3. susan
4. jason (9800xp) 426. rebecca **4. UserX**
5. barbara (8000xp) 427. james 5. teresa
I've studied many SO or other posts, but still cannot find a solution for efficiently updating and filtering large Leaderboard tables. Which one candidate solution would you choose and what are the possible performance improvements (space + memory + (Insertion/Searching CPU cost))?

That's a very interesting problem - thanks for posting. In general databases excel at this type of problem in which there is large amounts of data that needs to be filtered and searched. My first guess is that you are not using MySQL indexes correctly. Having said that you clearly need to regularly find the nth row in an ordered list which is something that SQL is not at all good at.
If you are looking to some form of in-memory database then you'll need something more sophisticated than REDIS. I would suggest you look at VoltDB which is very fast but not cheap.
If you would like to build your own in-memory store then you'll need to calculate memory use to see if it's feasible. You will need an index (discussed later in this answer) for each row you want to search or filter on along with the record for each user. However even for 10 million rows and 20 fields its still going to be less than 1Gb RAM which should be fine on modern computers.
Now for the data structures. I believe you are on the right track using maps to lists. I don't think the lists need to be sorted - you just need to be able to get the set of users for particular value. In fact sets may be more appropriate (again worth testing performance). Here is my suggestion to try (I've just added country and age fields - I assume you'll need others but it's a reasonable example to start with):
enum Country {
...
}
class User {
String givenName;
String familyName;
int xp;
Country country;
int age;
}
class LeaderBoard {
Set<User> users;
Map<Integer, Set<User>> xpIndex;
Map<Country, Set<User>> countryIndex;
Map<Integer, Set<User>> ageIndex;
}
Each of the indices will need to be updated when a field changes. For example:
private setUserAge(User user, int age) {
assert users.contains(user);
assert ageIndex.get(user.getAge()).contains(user);
ageIndex.get(user.getAge()).remove(user);
if (!ageIndex.containsKey(age)) {
ageIndex.put(age, new TreeSet<>());
}
ageIndex.get(age).add(user);
user.setAge(age);
}
Getting all users, by rank, that satisfy a given combination can be done in a number of ways:
countryIndex.get(Country.Germany).stream()
.filter(ageIndex.get(20)::contains)
.sorted(User::compareRank)
...
or
SortedSet<User> germanUsers = new TreeSet<>(User::compareRank);
germanUsers.addAll(countryIndex.get(Country.Germany));
germanUsers.retainAll(ageIndex.get(20));
You'll need to check which of these is more efficient - I would guess the stream implementation will be. Also it can be easily converted to a paralellStream.
You mention a concern with update efficiency. I would be very surprised if this was an issue unless there were many updates a second. In general with these types of applications you will get many more reads than writes.
I see no reason to manually partition the indexes as you are suggesting unless you are going to have hundreds of millions of entries. Better would be to experiment with HashMap vs TreeMap for the concrete instantiation of the indices.
The next obvious enhancement if you need better performance is to multithread the application. That should not be too complex as you have relatively simple data structures to synchronize. Use of parallel streams in the searches helps of course (and you get them for free in Java 8).
So my recommendation is to go with these simple data structures and eek out performance using multithreading and adjusting the concrete implementations (e.g. hash functions) before trying anything more sophisticated.

Although I am still in the middle of benchmarks, I am updating the status of the current development.
Best performance rates come when using:
Map<Country, Map<Age, Map <TimingIdentifier, List<User>>>>
(List is sorted)
Some notes on the keys: I added a Country called World in order to have an instance of the full leader-board country-independent (as if the Country filter is not selected). I did the same for Age (All-Ages) and TimeIdentifier (All-Time). TimeIdentifier key values are [All-Time, Month, Week, Day]
The above can be extended for other filters, so it can be applied for other scenarios as well.
Map<Filter1,Map<Filter2,Map<Filter3,Map<Filter4 ..other Map Keys here..,List<User>>>>
Update: Instead of using multiple Map wrappers, a class used as a key in a single Map with the above fields is slightly faster. Of course, we need a multiton like pattern to create all available FilterCombination objects:
class FilterCombination {
private int CountryId;
private int AgeId;
private int TimeId;
...
}
then we define the Map<FilterCombination, List<User>> (sorted List)
I could use a TreeSet but I didn't. Why? Basically, I was looking for an Order Statistic Tree (see here), but it seems there are not official Java implementations (see here). Probably this is the way to go VS sorted List due to inefficiency of List.add(index, Object) which is O(n). A LinkedList would be better for .add(index, Object) but unfortunately it is slow in getting the k-th element (ranking is O(n)). So, every structure has its pros and against for such a task.
At the moment, I ended up using a sorted List. The reason is that when adding an element to the sorted list, I use a slightly modified binary search algorithm (see here). The above method gives me current User's rank at the insertion phase (so no additional search query is required), it is O(logn + n) (binary searching index + List.add(index, Object)).
Is there any other structure that performs better that O(logn + n) for insert + get rank together?
*Of course if I need to ask for User's ranking at a later time, I will again do a binary search, based on User's XP (+ timestamp as you see below) and not Id, because now I cannot search via User-Id in a List).
**As a comparator I use the following criteria
1st: XP points
in case of a draw - 2nd criterion: timestamp of last XP update
so, it is highly possible that equalities in Sorted list will be very very few. And even more, I would't mind if two users with the same XP are ranked in reverse order (even with our sample data of some millions of games, I found very few ties, not including zero XPs for which I don't care at all).
An XP update requires some work and resources. Fortunately, the second comparison criteria improved significantly User search inside this List (binary search again), because, before updating User's XPs, I had to remove the previous entries for this User in the lists... but I am looking via her previous XPs and timestamps so it is log(n).

Easiest option is to choose Redis' sorted set, and use master slaves for replication. Turning on RDB on each slaves and backing RDB files up to S3. Using Kafka to persist all writes before they go to Redis. So we can replay missing transactions later on.

How to get the number of results in an App Engine query before actually iterating through them all

In my Google App Engine app I need to fetch and return a potentially large number of entities from a datastore query in response to a service call GET request. This call may return potentially thousands of entities and MBs of serialized data.
The first portion of the response packet communicates how many entities are in the serialized results, followed by all of the serialized entities. Currently I am iterating through all the entities in the query with a QueryResultIterator up to a maximum page size limit, after which I return a cursor from which can be used to continue fetching where the previous call left off (if the maximum was reached and there are still results in the query). As I iterate through the results, I save them in a list. Once I've either exhausted the query results or reached the maximum page size, I can then get the number of entities from the size of this list. But then I have to iterate through this list again to serialize each of the entities and write the results to the response output stream.
I don't know that this is the most efficient method to perform this operation. Is there a way I can get the number of entities in a query's results before actually iterating through them all or fetching them directly into a list? (The list method doesn't work anyway because I'm using cursors, which requires the use of QueryResultIterator).
QueryResultIterator has a method getIndexList(). Would this be a less costly way to get the number of entities in the query's results? I'm assuming this list would contain exactly one index object for each entity in the query's results. Also, I'd need this list to only contain the indexes for the entities after the current cursor position for the interator. Is my understanding correct or would this method not do what I think it would?
A list of just indexes would require much less memory than loading a list of whole entities. Although, I don't know if this list would be limited at all by the query's prefetch or chunk sizes, or if I'd want to use the query's limit parameter at all because I would only be interested in knowing how many entities were in the results up to the maximum page size plus one (to know there are still more results and provide a cursor to continue).
Currently I'm setting the prefetch and chunk size (to the size of my page limit), but I'm not using the limit or offset parameters since I'm using cursors instead. From what I understand cursors are preferable to offset/limit. Would setting the limit parameter affect continuing a query with a cursor?
Clearly I have quite a few questions as to how GAE datastore queries work and how they're affected by changing parameters. So any insights are appreciated. The documentation for App Engine APIs is often sparse, as in one sentence descriptions of methods stating pretty much what can be deduced from the method signature. They don't generally go into much detail otherwise. Maybe the way I'm doing it currently is just fine after all. It works as is, but I'm trying to optimize the service call to get the best response time possible for my client application.
UPDATE: By the way, I am using Objectify v3 in my app and to perform this query. There are several places I am required to use the low-level datastore API, including to do geo-location queries (with geomodel) and projection queries (which aren't support in Objectify v3). So if there is a good way to do this using Objectify, that would be ideal. Otherwise I can use the low-level API, but it's always messier this way.

Both the low-level api and Objectify have a count() method (look at the javadocs for details). However, counting can be a very expensive and lengthy operation - it costs 1 small op for every number returned. For example, count() returning 5000 costs 5000 small ops (plus 1 read for the query), and takes as long as it would take to do a keys-only scan of all 5000 (which is what GAE actually does).
If you absolutely must have an exact count, you probably need to aggregate this value yourself by incrementing/decrementing a (possibly sharded) counter. This gets very tricky when you are dealing with filtered queries.
There is no one right solution here. Google searches give you totals like "About 119,000,000 results" which are deliberately inexact and almost certainly precalculated. For smaller result sets, using count() can be acceptable - but you might want to apply a limit() so that you never break the bank. You can always say "More than 500 results..."

if you want to fetch no of record than you can use following code
com.google.appengine.api.datastore.Query qry = new com.google.appengine.api.datastore.Query("EntityName");
com.google.appengine.api.datastore.DatastoreService datastoreService = DatastoreServiceFactory.getDatastoreService();
int totalCount = datastoreService.prepare(qry).countEntities(FetchOptions.Builder.withDefaults());
anf if you want filter than you can used
qry.addFilter("firstName", FilterOperator.EQUAL, firstName);
i hope it will help you

IllegalArgumentException: Splitting the provided query requires that too many subqueries are merged in memory

I look up a bunch of model ids:
List<Long> ids = lookupIds(searchCriteria);
And then I run a query to order them:
fooModelList = (List<FooModel>) query.execute(ids);
The log shows that this is the GQL that this is compiled to:
Compiling "SELECT FROM com.foo.FooModel WHERE
:p.contains(id) ORDER BY createdDateTime desc RANGE 0,10"
When the ids ArrayList is small this works fine.
But over a certain size (40 maybe?) I get this error:
IllegalArgumentException: Splitting the provided query requires
that too many subqueries are merged in memory.
Is there a way to work around this or is this a fixed limit in GAE?

This is a fixed limit. If you're looking up entities by ID, though, you shouldn't be doing queries in the first place - you should be doing fetches by key. If you're querying by a foreign key, you'll need to do separate queries yourself if you want to go over the limit of 40 - but you should probably reconsider your design, since this is extremely inefficient.

I could not verify this using the documentation of GAE, so my answer might not be complete. Yet I found that "ORDER BY createdDateTime desc" sets this limit, which is 30 by the way. My hypothesis is that if gae doesn't need to sort it, it does not need to process the query in memory.
If you do need to 'sort it', do this (which is the way to go with timed-stuff in GAE anyway):
Add a field 'week' or 'month' or something to the query which contains an integer that uniquely separates weeks/months (so you need something else that 0..52 or 0..11, as they also need to be unique over the years). Than you make your query and state that you are only interested in those of this week, and maybe also last week (or month). So if we are in week 4353, your query has something like ":week IN [4353, 4352]". Than you should have a relatively small query-set. Then filter out the posts that are are too old, and sort it in memory.

Google app engine: Poor Performance with JDO + Datastore

I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh

First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine

You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.

Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B

Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?

If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html

The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.

How large can an appengine task payload be?

I'm using the new experimental taskqueue for java appengine and I'm trying to create tasks that aggregate statistics in my datastore. I'm trying to count the number of UNIQUE values within all the entitities (of a certain type) in my datastore. More concretely, say entity of type X has a field A. I want to count the NUMBER of unique values of A in my datastore.
My current approach is to create a task which queries for the first 10 entities of type X, creating a hashtable to store the unique values of A in, then passing this hashtable to the next task as the payload. This next task will count the next 10 entities and so on and so forth until I've gone through all the entities. During the execution of the last task, I'll count the number of keys in my hashtable (that's been passed from task to task all along) to find the total number of unique values of A.
This works for a small number of entities in my data store. But I'm worried that this hashtable will get too big once I have a lot of unique values. What is the maximum allowable size for the payload of an appengine task?????
Can you suggest any alternative approaches?
Thanks.

According to the docs, the maximum task object size is 100K.

"Can you suggest any alternative approaches?".
Create an entity for each unique value, by constructing a key based on the value and using Model.get_or_insert. Then Query.count up the entities in batches of 1000 (or however many you can count before your request times out - more than 10), using the normal paging tricks.
Or use code similar to that given in the docs for get_or_insert to keep count as you go - App Engine transactions can be run more than once, so a memcached count incremented in the transaction would be unreliable. There may be some trick around that, though, or you could keep the count in the datastore provided that you aren't doing anything too unpleasant with entity parents.

This may be too late, but perhaps it can be of use. First, anytime you have a remote chance of wanting to walk serially through a set of entities, suggest using either a date_created or date_modified auto_update field which is indexed. From this point you can create a model with a TextProperty to store your hash table using json.dumps(). All you need to do is pass the last date processed, and the model id for the hash table entity. Do a query with date_created later than the last date, json_load() the TextProperty, and accumulate the next 10 records. Could get a bit more sophisticated (e.g. handle date_created collisions by utilizing the parameters passed and a little different query approach). Add a 1 second countdown to the next task to avoid any issues with updating the hash table entity too quickly. HTH, -stevep

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.