Selecting distinct entities across a large google app engine table

Selecting distinct entities across a large google app engine table - java

I was wondering if anyone can help me with this problem.
We have an idea we'd like to implement, and we're currently unable to do this efficiently.
I've anonymised the data as best as possible, but the structure is the same.
We have two entities, Car and CarJourney. Each Car has 0 to many CarJourney's. Each Car Journey has (amongst other properties) a date associated with it - the date the journey was started.
I wish to query by time over car journeys. I'll have two times, a start date and an end date, where start date <= endDate, and I want to receive the most recently started journey in that period.
So, if I had a particular car in mind, say car 123, I'd write a query that limits by Car.key and Car.startDate, where Car.key == 123 and Journey.startDate >= startDate and Journey.startDate <= endDate with an ordering on Journey.startDate descending and a limit of 1.
e.g. Car A has 3 journeys, taken on 1st, 2nd and the 3rd of the month. The query start date is 1st and the query end date is the 2nd. The result of this query would be one Car journey, the 2nd.
Once the result of that query is returned, a very small amount of processing is done to return a result to the user.
That's the easy bit.
But, instead of over 1 Car, I want a list of cars, where the list contains N keys to cars.
So, I want to run the above query N times, once for every car. And I want the latest journey for each car.
Because the time range is flexible (and thus can't be known beforehand) we can't implement a "isMostRecent" flag, because while it might be the most recent for now, it might not be the most recent for the specified date parameters.
We also need to ensure that this returns promptly (current queries are around the 3-5 second mark for a small set of data) as this goes straight back to the user. This means that we can't use task queues, and because the specified dates are arbitrary we can't implement mass indexing of "isWithinDate" fields.
We tried using an async query, but because the amount of processing is negligible the bottleneck is still the queries on the datastore (because the async api still sends the requests synchronously, it just doesn't block).
Ideally, we'd implement this as a select on car journeys ordered by startDate where the Car.key is distinct, but we can't seem to pull this off in GAE.
There are numerous small optimisations we can make (for example, some MemCaching of repeated queries) but none have made a significant dent in our query time. And MemCaching can only help for a maximum of 1-2 minutes (due to the inevitable forward march of time!)
Any ideas are most welcome and highly appreciated.
Thanks,
Ed

It sounds like the best option is to execute the many queries yourself. You say you tried asynchronous queries, but the bottleneck was sending the query. This seems extremely odd - you should be able to have many queries in flight at the same time, substantially cutting down your latency. How did you determine this?

First of all I'd recommend using objectify. JDO/JPA on appengine just fool people into thinking that appengine datastore is just a SQL database, which, as you realized, is far from the truth.
If I understand correctly you have a Car which contains a List of CarJourneys?
List properties on appengine are limited to 5000 entries and any time you access/change them they have to be serialized/deserialized in whole. So if you plan to have a lot of CarJourneys per Car than this will get slow. Also because appengine creates an index entry for every value in the collection this can lead to exploding indexes.
Instead, just create a property Car inside CarJourney that points to the Car that made the journey: a one-to-one relationship from CarJourney to Car. The type can be Key or just string/long containing the id of the Car. When querying just add filter for Car property.
I suggest watching Brett Slatkin's video: Scalable, Complex Apps on App Engine.

You can also use one query and filter distinct cars by yourself. Like select CarJouney startDate >= startDate and startDate <= endDate order by startData and iterate (+filter on your side) through this query until you find enough data to show.

Denormalization should solve your problem - having a last_journey reference property in your car, so everytime you start a journey, you'd also update the Car entity - this way you'd be able to query all cars and have their lastest journey on the resultset.
It's worth noting that when you access last_journey, a new get() will be issued to the datastore, so if you're listing a lot of cars, you could build a list with all the last_journey keys and fetch then all at once passing that to db.get().
Scalable, Complex Apps on App Engine is definately a must watch (sadly the sound is terrible on this video)

I have faced same kind of problem some time ago.
I tried some solutions (in memory sort and
filtering, encoding things into keys etc. and I have benchmarked those
for both latency and cpu cycles using some test data around 100K
entities)
An other approach I have taken is encoding the date as an integer (day
since start of epoch or day since start of year, same for hour of day
or month depending on how much detail you need in your output) and
saving this into a property. This way you turn your date query filter
into an equality only filter which does not even needs to specify an
index) then you can sort or filter on other properties.
Benchmarking the latest solution I have found that when the filtered
result set is a small fraction of the unfiltered original set, is 1+
order of magnitude faster and cpu-eficient. Worst case when no
reduction of the result set due to filtering the latency and cpu usage
was comparable to the previous solutions)
Hope this helps, or did I missed something ?
Happy coding-:)

You can also make this queries in parallel by calling it right from client, using ajax. I mean that you can return to the user an empty html page, just with cars definitions, and then make ajax calls for journeys for every car on this page.

As JB nizet suggested I am wondering if the answer might be something such as a single query, possibly with a temporary table, or anonymous intermediate table (I don't know what google supports to this end) using a group by (thus eliminating extra transfer of data and the need for Java to do the processing). I am thinking something along the lines of
CREATE TEMPORARY TABLE temp1 AS
SELECT * FROM car_journey
WHERE start_date > ? AND
end_date < ?
SELECT car_id, journey_id
FROM temp1 t1, (
SELECT car_id, MIN(start_date)
FROM temp1
GROUP BY car_id
) t2
WHERE t1.car_id = t2.car_id AND
t1.start_date = t2.start_date
With the temporary table you can greatly reduce the time for the secondary query, since theoretically the data will be much smaller than the full table.
Finally, again not knowing what google supports, I would ask if you have indices defined on the appropriate columns, which may help speed up the query.

Related

How to query records where datetime is greater than X in DynamoDB?

I have a table in DynamoDB, and I need to get a list of records (in Java) which are from the last day. They all have a dateTime attribute.
Relevant attributes of the table I'm referring to:
customerUrl(string, hashkey), dateTime(number, range key), and a few other attributes which aren't relevant
I've already tried setting a Global Secondary Index with a hashkey of dateTime and no range key. This index is named 'performanceIndex'. I then tried to query it as follows:
Map<String, AttributeValue> eav = new HashMap<>();
eav.put(":val1", new AttributeValue().withN(maximumAgeMillis));
DynamoDBQueryExpression<PingLog> pinglogQuery = new DynamoDBQueryExpression<PingLog>();
pinglogQuery.setKeyConditionExpression("dateTime > :val1");
pinglogQuery.setExpressionAttributeValues(eav);
pinglogQuery.setIndexName("performanceIndex");
pinglogQuery.setConsistentRead(false);
List<PingLog> pinglogs = PostDatabaseMapper.getInstance().query(PingLog.class, pinglogQuery);
However, the query permanently keeps going and never returns. I added a println statement before and after it, and only the first one actually printed.
Before this query I just did a scan with a filter, and that worked, but now we have so many records (80 million) that a scan takes forever. What should I do? Do I need a different secondary index? Is my query wrong?

You should create a GSI with yyyy-mm-dd as the partition key, and hh:mm:ss as the sort key. (This might require backfilling the entire table, but if you query by date often, it will be worth it.) Check out this answer to a related question, which has some more details on this approach.
There is a potential complication depending on what sort of data access patterns you have. Is it fairly steady, or is it bursty? Will current items have a much higher write throughput than any other day?
If you’re dealing with time-series data, such as IoT sensor readings, this strategy may not work for you. You could have a hot partition in your GSI, which could put back-pressure in your main table and cause writes to be throttled. This is unlikely because of DynamoDB’s adaptive capacity, but it is possible.
In this case, you should consider DynamoDB’s recommended best practice for handling time-series data. It discusses how to deal with data that has different access requirements over time. The gist of their solution is to create separate tables for each period of time (day/month/year/whatever) so that data from different time frames can have different provisioned capacity.

IllegalArgumentException: Splitting the provided query requires that too many subqueries are merged in memory

I look up a bunch of model ids:
List<Long> ids = lookupIds(searchCriteria);
And then I run a query to order them:
fooModelList = (List<FooModel>) query.execute(ids);
The log shows that this is the GQL that this is compiled to:
Compiling "SELECT FROM com.foo.FooModel WHERE
:p.contains(id) ORDER BY createdDateTime desc RANGE 0,10"
When the ids ArrayList is small this works fine.
But over a certain size (40 maybe?) I get this error:
IllegalArgumentException: Splitting the provided query requires
that too many subqueries are merged in memory.
Is there a way to work around this or is this a fixed limit in GAE?

This is a fixed limit. If you're looking up entities by ID, though, you shouldn't be doing queries in the first place - you should be doing fetches by key. If you're querying by a foreign key, you'll need to do separate queries yourself if you want to go over the limit of 40 - but you should probably reconsider your design, since this is extremely inefficient.

I could not verify this using the documentation of GAE, so my answer might not be complete. Yet I found that "ORDER BY createdDateTime desc" sets this limit, which is 30 by the way. My hypothesis is that if gae doesn't need to sort it, it does not need to process the query in memory.
If you do need to 'sort it', do this (which is the way to go with timed-stuff in GAE anyway):
Add a field 'week' or 'month' or something to the query which contains an integer that uniquely separates weeks/months (so you need something else that 0..52 or 0..11, as they also need to be unique over the years). Than you make your query and state that you are only interested in those of this week, and maybe also last week (or month). So if we are in week 4353, your query has something like ":week IN [4353, 4352]". Than you should have a relatively small query-set. Then filter out the posts that are are too old, and sort it in memory.

Avoiding exploding indices and entity-group write-rate limits with appengine

I have an application in which there are Courses, Topics, and Tags. Each Topic can be in many Courses and have many Tags. I want to look up every Topic that has a specific Tag x and is in specific Course y.
Naively, I give each standard a list of Course ids and Tag ids, so I can select * from Topic where tagIds = x && courseIds = y. I think this query would require an exploding index: with 30 courses and 30 tags we're looking at ~900 index entries, right? At 50 x 20 I'm well over the 5000-entry limit.
I could just select * from Topic where tagIds = x, and then use a for loop to go through the result, choosing only Topics whose courseIds.contain(y). This returns way more results than I'm interested in and spends a lot of time deserializing those results, but the index stays small.
I could select __KEY__ from Topic where tagIds = x AND select __KEY__ from Topic where courseIds = y and find the intersection in my application code. If the sets are small this might not be unreasonable.
I could make a sort of join table, TopicTagLookup with a tagId and courseId field. The parent key of these entities would point to the relevant Topic. Then I would need to make one of these TopicTagLookup entities for every combination of courseId x tagId x relevant topic id. This is effectively like creating my own index. It would still explode, but there would be no 5000-entry limit. Now, however, I need to write 5000 entities to the same entity group, which would run up against the entity-group write-rate limit!
I could precalculate each query. A TopicTagQueryCache entity would hold a tagId, courseId, and a List<TopicId>. Then the query looks like select * from TopicTagQueryCache where tagId=x && courseId = y, fetching the list of topic ids, and then using a getAllById call on the list. Similar to #3, but I only have one entity per courseId x tagId. There's no need for entity groups, but now I have this potentially huge list to maintain transactionally.
Appengine seems great for queries you can precalculate. I just don't quite see a way to precalculate this query efficiently. The question basically boils down to:
What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?

Your assessment of your options is correct. If you don't need any sort criteria, though, option 3 is more or less already done for you by the App Engine datastore, with the merge join strategy. Simply do a query as you detail in option 1, without any sorts or inequality filters, and App Engine will do a merge join internally in the datastore, and return only the relevant results.
Options 4 and 5 are similar to the relation index pattern documented in this talk.

I like #5 - you are essentially creating your own (exploding) index. It will be fast to query.
The only downsides are that you have to manually maintain it (next paragraph), and retrieving the Topic entity will require an extra query (first you query TopicTagQueryCache to get the topic ID and then you need to actually retrieve the topic).
Updating the TopicTagQueryCache you suggested shouldn't be a problem either. I wouldn't worry about doing it transactionally - this "index" will just be stale for a short period of time when you update a Topic (at worst, your Topic will temporarily show up in results it should no longer show up in, and perhaps take a moment before it shows up in new results which it should show up it - this doesn't seem so bad). You can even do this update on the task queue (to make sure this potentially large number of database writes all succeed, and so that you can quickly finish the request so your user isn't waiting).

As you said yourself you should arrange your data to facilitate the scaling of your app, thus in the question of What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
You can hold your own indexes of these sets by creating objects of CourseRef and TopicRef which consist of Key only, with the ID portion being an actual Key of the corresponding entity. These "Ref" entities will be under a specific tag, thus no actual Key duplicates. So the structure for a given Tag is : Tag\CourseRef...\TopicRef...
This way given a Tag and Course, you construct the Key Tag\CourseRef and do an ancestor Query which gets you a set of keys you can fetch. This is extremely fast as it is actually a direct access, and this should handle large lists of courses or topics without the issues of List properties.
This method will require you to use the DataStore API to some extent.
As you can see this gives answer to a specific question, and the model will do no good for other type of Set operations.

System.currentTimeInMillis() as column names(time sorted) in a row of NoSQL database

I want to use long timestamp value(may be generated by System.currentTimeInMillis()) as column names in my database. Can System.currentTimeInMillis() method guarantee an always increasing values ?? I have seen people complaining that sometimes it became slower.. !
I am also open to other alternatives that may be considerable for putting as increasing column names. I just want to guarantee uniqueness(until they fall in same millisecond when I can consider them ok..) & increasing sequence ( may be also perhaps smaller in size (less bytes) if anyhow possible!).
Edit: I have a NoSQL database where column names(& hence columns) are sorted in a row as ascending/descending number sequence. Thus I am looking to generate timestamps as column names that could enable me to sort the columns by time.
I am looking to store comments of a blog post in a single row using timestamp values as column names to enable sort by time. I think I wouldnt mind even if 10 ms is the resolution since probablity of someone commenting in the same 1/100 of a sec on the same blog post on my application would be very low.
Edit: Thank you all for your comments and suggestions. Really helpful.. I think I have got a solution to work around the problems of seldom failures of System.currentTimeInMillis(). I could implement like this:-
When a user adds a new comment to a post, the frontend with send an id 'suggestedId' which is one greater than the id of last comment( frontend would know about this from the previous database read). This id would be compared with the id generated using System.nanotime(). if the suggestedId is less than the generatedId then generatedId will be used else suggestedId would be used. So it simply means whatever is greater, use that Id. This guarantees monotonocity
Although not truly perfect but yes sounds good for practical usage!
Would you guys like to share your thoughts upon this? Thanks!!!

The general database design issues have been addressed by other commenters, but just on this point:
Can System.currentTimeInMillis() method guarantee an always increasing values ?? I have seen people complaining that sometimes it became slower.. !
For future reference, the word for this (always-increasing values) is monotonicity. No, System.currentTimeMillis() is not monotonic. Not only can it go more slowly, or speed up (if, say, the System it's running on is using NTP for time correction), but it can arbitrarily change up or down (if the user, or a script, changes the system time).
System.nanoTime() does not formally guarantee monotonicity; however, the Hotspot JVM does if and only if the underlying system supports it (modern Linux kernels on modern hardware certainly do). Sounds better - with the caveat that some processors use power management techniques etc which can screw this up in the presence of multiple cores. So it's better, but still not perfect.

On many systems, System.currentTimeMillis() does not resolve below 10 ms increments. So two different calls can easily return the same value.
I suggest that you keep an auxiliary table with a counter that you can increment to give the next value.
Why do you want this for column names? It seems a very odd sort of data base design.

I am looking to store comments of a blog post in a single row using timestamp values as column names to enable sort by time.
I'm no NoSQL expert, but I'd say it's not a good idea to store comments as columns in one row. Why don't you add a row per comments along with a timestamp you can sort by?
Using a traditional relational database the table could look like this:
comments
--------
id (PK)
blog_id (FK)
created_on (timestamp)
text
Selecting the comments in order would then be in SQL:
SELECT * from comments WHERE blog_id = ? ORDER BY created_on

System.currentTimeMillis() typically has around 10-20ms granularity, but even if it had 1ms granularity, in principle, 1ms is an eternity in computing time and it would be quite plausible, depending on what you're doing, for two calls to end up with the same value. However, I'm guessing that even 20ms is probably not an eternity compared to how frequently people make blog comments.
So, if two people post a comment within the same 20ms (or whatever), just sorting on this value will not define an order for the posts in question. But do you particularly care about this unlikely situation. If you do, then you need to build in a little bit of extra logic (have a counter for the number of messages posted "this millisecond"). I personally wouldn't bother in your use case.
As far as I can understand, you're also storing the data in a fundamentally silly way. Why not just have a "Comments" table with a row per comment and a single time column, which you can sort on as required.

Many databases provide a way to get serial numbers into column. For example see this -- PostgreSQL Autoincrement

Google app engine: Poor Performance with JDO + Datastore

I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh

First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine

You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.

Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B

Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?

If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html

The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.