I'm developing a Google App Engine Java app where users can search business objects from database based on search criteria.
The search results (a list of records) should not include any of the records (certain number of records, say 100) from their past searches. I'm storing the past results in the User Profile for this reason.
Any suggestions on efficiently implementing this logic (without using multiple collection iterations). I'm using JDO and there are restrictions in using 'NOT IN' condition in the queries.
Here's a solution, assuming your goal is to get 200 keys that are not in the history already.
I will attempt to estimate the number of operations used as a proxy for "efficiency", since this is how we will be charged in the new pricing model
Fetch the User object and "history keys" (1 read operation)
Do a keys only query and fetch 300 records. (300 small operations)
In your code, subtract any of the history keys from the 300 records. (0 operations)
If you end up with less than 200 records after step 3, fetch another 100.(repeat if necessary) (100 small operations).
Once you have 200 keys not seen before, you can fetch the full business object entities if you need them, or display the keys to the user. (200 read operations if you fetch the entire objects)
If the datastore supported a native "NOT IN" operator, then we could shave off 100 small operations from step 2, and skip step 4. The largest cost here will be fetching the actual 200 entities, which would have to happen with or without the NOT IN operator. Ultimately, this method is not that inefficient compared to what a native NOT IN operator would do.
Further optimizations:
If you don't need to display 200 keys all at once, then you can use cursors to only get N results at a time.
I am simply guessing when I suggest that you get 300 keys at first. You may need to get more or less. You can also probably get less than 100 on the second attempt.
Related
Our data set has a lot of duplicate partition keys. We are using the TOKEN method to paginate through the data. If the rows with the duplicate keys are split across a page we don't get the remainder of the duplicates on the next call.
For example assume we have the following keys: 1 2 3 5 5 5 6 7 8 and we have a limit of 5 rows per query. The first query "select * from table where TOKEN(id) > TOKEN('') limit 5;" returns 1 2 3 5 5 as expected. The second query "select * from table where TOKEN(id) > TOKEN('5') limit 5;" returns 6 7 8. This is not the desired behavior, we want the second query to return 5 6 7 8. Thinking about this, it is obvious why this happens: "(TOKEN(id) > TOKEN('5')" fails if id == 5
Are we doing something wrong or is this just the way it works? We are using the latest Java driver, but I don't think this is a driver problem since the Golang driver also exhibits this behavior
We've (mostly) worked around the problem by either dropping any duplicated records at the end of the row set (the 5 5 in the example) or dropping the last record (to cover the case where the last record is duplicated in the second record set). This fails if the record set is all duplicates. Obviously larger limits reduces this edge case, but it doesn't seem safe to use in a production environment.
* EDITED *
The TOKEN method is recommended in a lot of pages both here on Stackoverflow and elsewhere on the web. But obviously it doesn't work :-(
#alex:
Thanks for your reply. The example was just that, a simplified example of the issue. In reality we have 30 million rows and are using a limit of 1000. When the table was first designed years ago the designer didn't understand how the partition key works so they used the user ID as the partition thus giving us 30 million partitions. We believe that this is at least contributing to our excesive repair times (currently at 12 hours for the cluster). We need to copy the entire table into a new one with a different partition key (in a live production environment) to resolve the partition key issue. This page https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/ seems to be a better solution.
#Nadav:
Thanks for your reply. Simply removing the limit will cause the request to time out in multiple layers of our software. The DataStax page above seems to be the best solution for us.
You are mixing up several things - in Cassandra data is organized inside partitions, and you can get data by partition key, or perform a range scan using the token function. The results of the query could be delivered to applications by pages - you can specify the fetch size (although 5 is quite small), fetch one page, process, fetch next, process, ..., until the result set is exhausted.
In your case, the page size doesn't match the result set size - you have 6 results there, and the next result set (for token(id) > token(5)) has only 3 rows. I don't know a solution that works out of box (except select * from table, but it may timeout in case if you have a lot of data). In your case I would better go with bigger ranges (for example, whole token range), and page results inside it (without using limit), and then handle the case when you need to switch to the next token range, and you have some rows that are left from previous token range.
I have an example of Java code that is performing effective scan of the all token ranges, similar to what the Spark connector is doing. The main trick there is to route request to the node that holds the data, so it will read data directly from the node itself, without need to reach other nodes (if you're reading with LOCAL_ONE, of course).
You shouldn't, and can't, use token ranges and LIMIT to page through results, and you found out yourself that it doesn't work - because LIMIT cuts off some of the result, and you have no way to continue.
Instead, Cassandra gives you a separate paging feature: You make a request, get the first 1000 (or whatever) rows and also a "cookie" with which you can resume the query to get the next page of results. Please refer to your favorite driver's documentation on the syntax of using Cassandra paging in your favorite language. It's not "LIMIT" - it's a separate feature.
Splitting up a large query into multiple token ranges still has its uses. For example, it allows you to query the different ranges in parallel, since different token ranges will often come from different nodes. But still, you need to query each range to completion, using paging, and cannot use "LIMIT" because you can't know how many results to expect from each range and need to read them all.
I am trying to write an algorithm which does insert of frequent data search.
Let's say User can search different combination of two entities (Source-Destination), Each time user search I want to store data with count, and if he search same combination(Source-Destination) I will update the count.
In this case if Users are 1000, and if User searches for 0 different combination(Source-Destination) and data will be stored for 30 Days.
So total number of rows will be 100000*30*30=13500000(1.3 Billion) Rows. (using Mysql)
Please suggest me If there is better way to write this.
GOAL: I want to get top 10 Searach Combination of users at any point of time.
1,000 users and 60,000 rows are nothing by today's standards. Don't even think about it, there is no performance concern whatsoever, so just focus on doing it properly instead of worrying about slowness. There will be no slowness.
The proper way of doing it is by creating a table in which each row contains the search terms, ([source,destination] in your case,) and a sum, and using a unique index on the [source, destination] pair of columns. Which is the same as making those two columns the primary key.
If you had 100,000,000 rows, and performance was critical, and you also had a huge budget affording you the luxury to do whatever weird thing it takes to make ends meet, then you would perhaps want to do something exotic, like appending each search to an indexless table (allowing the fastest appends possible) and then compute the sums in a nightly batch process. But with less than a million rows such an approach would be a complete overkill.
Edit:
Aha, so the real issue is the OP's need for a "sliding window". Well, in that case, I cannot see any approach other than saving every single search, along with the time that it happened, and in a batch process a) computing sums, and b) deleting entries that are older than the "window".
I am trying to reduce the datastore cost by using Projection. I have read that a Projection Query costs only 1 Read Operation but in my case the Projection cost goes more than 1. Here is the code:
Query<Finders> q = ofy().load().type(Finders.class).project("Password","Country");
for(Finders finder:q)
{
resp.getWriter().println(finder.getCountry()+" "+finder.getPassword());
}
On executing this, the q object contains 6 items and to retrieve these 6 items it takes 6 Read operations as shown in Appstats.
Can anyone tell me what's wrong over here ?
To read all items (with a single read operation if they will all fit) call .list() on the query, to get a List<Finders>. You chose to iterate on the query instead, and that's quite likely to not rely on a single, possibly huge read from the datastore, but parcel things out more.
Where projections enter the picture is quite different: if you have entities with many fields, or some fields that are very large, and in a certain case you know you need only a certain subset of fields (esp. if it's one not requiring "some fields that are very large"), then a projection is a very wise idea because it avoids reading stuff you don't need.
That makes it more likely that a certain fetch of (e.g) 10 entities will take a single datastore read -- there are byte limits on how much can come from a single datastore read, so, if by carefully picking and choosing the fields you actually require, you're reading only (say) 10k per entity, rather than (say) 500k per entity, then clearly you may well need fewer reads from the datastore.
But if you don't do one big massive read with .list(), but an entity-by-entity read by iteration, then most likely you'll still get multiple reads -- essentially, by iterating, you've said you want that!-)
In my Google App Engine app I need to fetch and return a potentially large number of entities from a datastore query in response to a service call GET request. This call may return potentially thousands of entities and MBs of serialized data.
The first portion of the response packet communicates how many entities are in the serialized results, followed by all of the serialized entities. Currently I am iterating through all the entities in the query with a QueryResultIterator up to a maximum page size limit, after which I return a cursor from which can be used to continue fetching where the previous call left off (if the maximum was reached and there are still results in the query). As I iterate through the results, I save them in a list. Once I've either exhausted the query results or reached the maximum page size, I can then get the number of entities from the size of this list. But then I have to iterate through this list again to serialize each of the entities and write the results to the response output stream.
I don't know that this is the most efficient method to perform this operation. Is there a way I can get the number of entities in a query's results before actually iterating through them all or fetching them directly into a list? (The list method doesn't work anyway because I'm using cursors, which requires the use of QueryResultIterator).
QueryResultIterator has a method getIndexList(). Would this be a less costly way to get the number of entities in the query's results? I'm assuming this list would contain exactly one index object for each entity in the query's results. Also, I'd need this list to only contain the indexes for the entities after the current cursor position for the interator. Is my understanding correct or would this method not do what I think it would?
A list of just indexes would require much less memory than loading a list of whole entities. Although, I don't know if this list would be limited at all by the query's prefetch or chunk sizes, or if I'd want to use the query's limit parameter at all because I would only be interested in knowing how many entities were in the results up to the maximum page size plus one (to know there are still more results and provide a cursor to continue).
Currently I'm setting the prefetch and chunk size (to the size of my page limit), but I'm not using the limit or offset parameters since I'm using cursors instead. From what I understand cursors are preferable to offset/limit. Would setting the limit parameter affect continuing a query with a cursor?
Clearly I have quite a few questions as to how GAE datastore queries work and how they're affected by changing parameters. So any insights are appreciated. The documentation for App Engine APIs is often sparse, as in one sentence descriptions of methods stating pretty much what can be deduced from the method signature. They don't generally go into much detail otherwise. Maybe the way I'm doing it currently is just fine after all. It works as is, but I'm trying to optimize the service call to get the best response time possible for my client application.
UPDATE: By the way, I am using Objectify v3 in my app and to perform this query. There are several places I am required to use the low-level datastore API, including to do geo-location queries (with geomodel) and projection queries (which aren't support in Objectify v3). So if there is a good way to do this using Objectify, that would be ideal. Otherwise I can use the low-level API, but it's always messier this way.
Both the low-level api and Objectify have a count() method (look at the javadocs for details). However, counting can be a very expensive and lengthy operation - it costs 1 small op for every number returned. For example, count() returning 5000 costs 5000 small ops (plus 1 read for the query), and takes as long as it would take to do a keys-only scan of all 5000 (which is what GAE actually does).
If you absolutely must have an exact count, you probably need to aggregate this value yourself by incrementing/decrementing a (possibly sharded) counter. This gets very tricky when you are dealing with filtered queries.
There is no one right solution here. Google searches give you totals like "About 119,000,000 results" which are deliberately inexact and almost certainly precalculated. For smaller result sets, using count() can be acceptable - but you might want to apply a limit() so that you never break the bank. You can always say "More than 500 results..."
if you want to fetch no of record than you can use following code
com.google.appengine.api.datastore.Query qry = new com.google.appengine.api.datastore.Query("EntityName");
com.google.appengine.api.datastore.DatastoreService datastoreService = DatastoreServiceFactory.getDatastoreService();
int totalCount = datastoreService.prepare(qry).countEntities(FetchOptions.Builder.withDefaults());
anf if you want filter than you can used
qry.addFilter("firstName", FilterOperator.EQUAL, firstName);
i hope it will help you
I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh
First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.
Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B
Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?
If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html
The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.