I would like to use the appengine mapper to iterate over a range of dates (from-date and to-date passed as properties to the configuration). For each date in the range, I would retrieve the entities that have this date as a property and operate on this set.
For example, if I have the following set of entities:
Key Date Value
a 2011/09/09 323
b 2011/09/09 132
c 2011/09/08 354
d 2011/09/08 432
e 2011/09/08 234
f 2011/09/07 423
g 2011/09/07 543
I would like to specify a date range of 2011/09/09 - 2011/09/07 which would create three mapper instances, for 2011/09/09, 2011/09/08 and 2011/09/07. In turn these would query for entities a+b, c+d+e and f+g respectively, and perform some operations on the values. (Each of the mappers would also make other datastore queries for additional data, hence the 'bonus question' below)
Presumably I need to create a custom InputFormat class, however I'm quite new to mapreduce/hadoop and I was hoping someone had some examples?
Bonus question: is it "bad form" to use a dao to load data in a mapper? Other distributed computing platforms I have worked with (eg DataSynapse) would require that you parcel all inputs up and provide with the task to prevent too much contention on a dataserver. However, with the appengine HR datastore I presume this isn't a concern?
It's not currently possible to iterate over a subset of entities of a given kind in App Engine's mapreduce implementaiton. If the entities make up a large proportion of the data, you can simply iterate over everything and ignore the unwanted entities; if they only make up a small proportion, you will have to roll-your-own update procedure using the task queue.
Based on Nick Johnson answer you will need to retrieve your date range from the context using custom parameters. Then mapper filters out (ignores) entity that falls out of range before processing it.
But if you insist on mapping across all entities of a given kind then there is a workaround solution that depending on your requirements may or may not be feasible. Suppose that you are pretty fixed on the date ranges (sounds unlikely but just maybe). Then for each expected range you create corresponding child entity kind with a parent key (or just a reference but parent key works better for consistency - think transaction across entity group) pointing to the main entity.
Thus each entity from the range receives a child entity of the kind corresponding to this range. Then setup a mapper on the child entity kind corresponding the range and retrieve its parent to work on it.
I do somewhat similar but in opposite direction and for single child entity kind when populating my data for Relation Index Entity pattern. Hence, the answer to your bonus question - go ahead use dao or whatever your data layer consists of.
While first approach is more sound, the latter may be feasible in cases when your ranges are not very dynamic and manageable. Given schema-less nature of the datastore creating new entity kinds is neither expensive nor a bad practice.
Related
I have an agriculture related data that I need to model in java.to mention how the data looks in a nutshell the data is collection of attributes which are collected whenever a new plant variety is produced. Currently the plants I want to model their data are around 135 grouped into 9 groups. The problem I am facing during the modeling process is that every plant has its own attributes that it doesn't share with others and also it have few attributes similar to other plants and there is also some difference in attributes of the same plant released in the same year which makes it difficult to restrict the amount of fields I tried to include in the class.
For example the attributes shown in the red might not be included in other variety or might be included adding other attributes it is not possible to know exactly which attributes can be there or not. The other problem is that some attributes similar to the one shown in the blue rectangle has range values.
What I have tried was to list down every possible attributes by looking over 24 books I have and having unique attributes making them into classes for every plant and extracting values which have ranges into other classes with min and max values and I ended up having over 200 classes which makes it very complicated to build a unified system that I can use to feed the data and retrieve info from it.
What is ur recommended approach to model the data the database I planned to use is mongodb since since some varieties may miss values which other varieties has. I am ready to move to other programming languages like python if I have to thanks.
Does it sound bad to have 180 unindexed properties(columns) with Integer/Long type per entity in datastore?
I need to count 6 requests per user saving by day for analytics reasons and I'm doing everything based on the sharding counters article and webcast:
https://cloud.google.com/appengine/articles/sharding_counters
So basically it's 6 values per day incrementing every new request, so I'm thinking in having:
1 Kind per Month
6 types of analytics * month days = 180
How much is too much in Google Datastore properties?
Thank you
Probably not a good idea.
Keep in mind that every time you want to update a single property value the entire entity will have to be re-written (i.e. retrieved from the datastore, deserialized, updated, re-serialized and re-sent to the datastore). The bigger the entity, the slower the performance.
IMHO it's better to have multiple smaller entities than a big one in such case. It is possible to split a single big entity into multiple smaller ones, efficiently related to each-other - see re-using an entity's ID for other entities of different kinds - sane idea?
Along the same line I believe it's even possible to find a way to encode the day info and the user ID into unique custom key IDs, for easy access. Something like <userid>_YYMMDD or just <userid>_DD
I have a large object that I store using objectify. I need a list of those objects with only subset of the properties populated. How can this be done?
App Engine stores and retrieves entities as encoded Protocol Buffers. There's no way for the underlying infrastructure to store, update, or retrieve only part of an entity, so there's no point having a library that does this - hence Objectify, like other libraries, don't. If you regularly need to access only part of an entity, split those fields into a separate entity.
It's not a good idea to split an entity in two in a noSql database: when you need to read a list of entries, you would be obliged to do n requests to get the second part of the list (n x m if your data is split in more entities). This is naturally due to the fact that there is no possible join in noSql databases.
What could be done is to "cache": duplicate the needed subset in another entity to get the most of performance. It has the disadvantage of being obliged to write twice on a persist of the main entity (if a field of the subset was changed).
What I usually do is write a /** OPTIMIZE xxxx */ comment on the class that needs to read a subset and get back to it when I need more performance.
I have an application in which there are Courses, Topics, and Tags. Each Topic can be in many Courses and have many Tags. I want to look up every Topic that has a specific Tag x and is in specific Course y.
Naively, I give each standard a list of Course ids and Tag ids, so I can select * from Topic where tagIds = x && courseIds = y. I think this query would require an exploding index: with 30 courses and 30 tags we're looking at ~900 index entries, right? At 50 x 20 I'm well over the 5000-entry limit.
I could just select * from Topic where tagIds = x, and then use a for loop to go through the result, choosing only Topics whose courseIds.contain(y). This returns way more results than I'm interested in and spends a lot of time deserializing those results, but the index stays small.
I could select __KEY__ from Topic where tagIds = x AND select __KEY__ from Topic where courseIds = y and find the intersection in my application code. If the sets are small this might not be unreasonable.
I could make a sort of join table, TopicTagLookup with a tagId and courseId field. The parent key of these entities would point to the relevant Topic. Then I would need to make one of these TopicTagLookup entities for every combination of courseId x tagId x relevant topic id. This is effectively like creating my own index. It would still explode, but there would be no 5000-entry limit. Now, however, I need to write 5000 entities to the same entity group, which would run up against the entity-group write-rate limit!
I could precalculate each query. A TopicTagQueryCache entity would hold a tagId, courseId, and a List<TopicId>. Then the query looks like select * from TopicTagQueryCache where tagId=x && courseId = y, fetching the list of topic ids, and then using a getAllById call on the list. Similar to #3, but I only have one entity per courseId x tagId. There's no need for entity groups, but now I have this potentially huge list to maintain transactionally.
Appengine seems great for queries you can precalculate. I just don't quite see a way to precalculate this query efficiently. The question basically boils down to:
What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
Your assessment of your options is correct. If you don't need any sort criteria, though, option 3 is more or less already done for you by the App Engine datastore, with the merge join strategy. Simply do a query as you detail in option 1, without any sorts or inequality filters, and App Engine will do a merge join internally in the datastore, and return only the relevant results.
Options 4 and 5 are similar to the relation index pattern documented in this talk.
I like #5 - you are essentially creating your own (exploding) index. It will be fast to query.
The only downsides are that you have to manually maintain it (next paragraph), and retrieving the Topic entity will require an extra query (first you query TopicTagQueryCache to get the topic ID and then you need to actually retrieve the topic).
Updating the TopicTagQueryCache you suggested shouldn't be a problem either. I wouldn't worry about doing it transactionally - this "index" will just be stale for a short period of time when you update a Topic (at worst, your Topic will temporarily show up in results it should no longer show up in, and perhaps take a moment before it shows up in new results which it should show up it - this doesn't seem so bad). You can even do this update on the task queue (to make sure this potentially large number of database writes all succeed, and so that you can quickly finish the request so your user isn't waiting).
As you said yourself you should arrange your data to facilitate the scaling of your app, thus in the question of What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
You can hold your own indexes of these sets by creating objects of CourseRef and TopicRef which consist of Key only, with the ID portion being an actual Key of the corresponding entity. These "Ref" entities will be under a specific tag, thus no actual Key duplicates. So the structure for a given Tag is : Tag\CourseRef...\TopicRef...
This way given a Tag and Course, you construct the Key Tag\CourseRef and do an ancestor Query which gets you a set of keys you can fetch. This is extremely fast as it is actually a direct access, and this should handle large lists of courses or topics without the issues of List properties.
This method will require you to use the DataStore API to some extent.
As you can see this gives answer to a specific question, and the model will do no good for other type of Set operations.
I'm using the new experimental taskqueue for java appengine and I'm trying to create tasks that aggregate statistics in my datastore. I'm trying to count the number of UNIQUE values within all the entitities (of a certain type) in my datastore. More concretely, say entity of type X has a field A. I want to count the NUMBER of unique values of A in my datastore.
My current approach is to create a task which queries for the first 10 entities of type X, creating a hashtable to store the unique values of A in, then passing this hashtable to the next task as the payload. This next task will count the next 10 entities and so on and so forth until I've gone through all the entities. During the execution of the last task, I'll count the number of keys in my hashtable (that's been passed from task to task all along) to find the total number of unique values of A.
This works for a small number of entities in my data store. But I'm worried that this hashtable will get too big once I have a lot of unique values. What is the maximum allowable size for the payload of an appengine task?????
Can you suggest any alternative approaches?
Thanks.
According to the docs, the maximum task object size is 100K.
"Can you suggest any alternative approaches?".
Create an entity for each unique value, by constructing a key based on the value and using Model.get_or_insert. Then Query.count up the entities in batches of 1000 (or however many you can count before your request times out - more than 10), using the normal paging tricks.
Or use code similar to that given in the docs for get_or_insert to keep count as you go - App Engine transactions can be run more than once, so a memcached count incremented in the transaction would be unreliable. There may be some trick around that, though, or you could keep the count in the datastore provided that you aren't doing anything too unpleasant with entity parents.
This may be too late, but perhaps it can be of use. First, anytime you have a remote chance of wanting to walk serially through a set of entities, suggest using either a date_created or date_modified auto_update field which is indexed. From this point you can create a model with a TextProperty to store your hash table using json.dumps(). All you need to do is pass the last date processed, and the model id for the hash table entity. Do a query with date_created later than the last date, json_load() the TextProperty, and accumulate the next 10 records. Could get a bit more sophisticated (e.g. handle date_created collisions by utilizing the parameters passed and a little different query approach). Add a 1 second countdown to the next task to avoid any issues with updating the hash table entity too quickly. HTH, -stevep