Does it sound bad to have 180 unindexed properties(columns) with Integer/Long type per entity in datastore?
I need to count 6 requests per user saving by day for analytics reasons and I'm doing everything based on the sharding counters article and webcast:
https://cloud.google.com/appengine/articles/sharding_counters
So basically it's 6 values per day incrementing every new request, so I'm thinking in having:
1 Kind per Month
6 types of analytics * month days = 180
How much is too much in Google Datastore properties?
Thank you
Probably not a good idea.
Keep in mind that every time you want to update a single property value the entire entity will have to be re-written (i.e. retrieved from the datastore, deserialized, updated, re-serialized and re-sent to the datastore). The bigger the entity, the slower the performance.
IMHO it's better to have multiple smaller entities than a big one in such case. It is possible to split a single big entity into multiple smaller ones, efficiently related to each-other - see re-using an entity's ID for other entities of different kinds - sane idea?
Along the same line I believe it's even possible to find a way to encode the day info and the user ID into unique custom key IDs, for easy access. Something like <userid>_YYMMDD or just <userid>_DD
Related
We are developing an application in which entity ids for tables must be in incremental order starting from 1 to so on, for each namespace.
We came across allocateIdRange, allocateIds methods in DatastoreService interface but these ids must be assigned manually and will not be assigned by DatastoreService itself. Assigning ids manually may leads to synchronization problems with multiple instances.
Can anyone provide me suggestions to overcome this problem?
We are using objectify 3.0 for DatastoreService operations.
I agree with Tim Hoffman and tx802 when they say you should reconsider your design regarding sequential ids. However a while ago i had to implement something very similar because the customer forced us to use sequential and uninterrupted numbers for order numbers (for unclear reasons). Regardless we complied with the customers wishes by using sharding counters(link contains full code sample) for the order numbers. Sharding counters work like this:
You create a couple of entities of the same kind in your datastore which are just counter values
The actual value is calculated by querying all entities of that kind and summarizing their values
When you wish to increase the value, one of the entities is randomly chosen and incremented
The current counter value may be cached in memcache for improved performance
Why does this work:
As you may know you have a restriction/limitation of 1 transaction per second and entity group in the datastore. Therefor you shard the counter into multiple entities and avoid this limitation. The more traffic you expect, the more shards you're going to need. Luckily you can increase the count of shards at any time.
We also know that writes are slow in comparison to reads. Therefor building the sum of all shards is a fast operation while increasing a single shard value (write) is slow, which doesn't bother us when using sharding counters because we have sufficient time.
Summarized:
You can use sharding counters for sequential ids. If you can avoid the whole sequential id dilemma it would be a better solution though.
We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB.
On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second).
We had tried several strategies to speed up the entities retrieval:
Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.
Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.
The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case.
We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval?
UPDATE: Supplying additional information about our use case and parallelization result:
We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only.
We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.
UPDATE 2: Giving an example of what we are trying to do:
Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not.
The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value.
The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.
200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory.
Do you need the whole 5kb of each entity for computation?
Do you need all 200k entities when querying before computation? Do queries touch all entities?
Also, check out BigQuery. It might suit your needs.
Use Memcache. I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform.
This is very interesting, but yes, its possible & Iv seen some mind boggling results.
I would have done the same; map-reduce concept
It would be great if you would provide us more metrics on how many parallel instances do you use & what are the results of each instance?
Also, our process includes retrieval alone or retrieval & storing ?
How many elements do you have in your data store? 4000? 10000? Reason is because you could cache it up from the previous request.
regards
In the end, it does not appear that we could retrieve >2000 entities from a single instance in under one second, so we are forced to use in-memory caching placed on our backend instance, as described in the original question. If someone comes up with a better answer, or if we found a better strategy/implementation for this problem, I would change or update the accepted answer.
Our solution involves periodically reading entities in a background task and storing the result in a json blob. That way we can quickly return more than 100k rows. All filtering and sorting is done in javascript using SlickGrid's DataView model.
As someone has already commented, MapReduce is the way to go on GAE. Unfortunately the Java library for MapReduce is broken for me so we're using non optimal task to do all the reading but we're planning to get MapReduce going in the near future (and/or the Pipeline API).
Mind that, last time I checked, the Blobstore wasn't returning gzipped entities > 1MB so at the moment we're loading the content from a compressed entity and expanding it into memory, that way the final payload gets gzipped. I don't like that, it introduces latency, I hope they fix issues with GZIP soon!
I would like to use the appengine mapper to iterate over a range of dates (from-date and to-date passed as properties to the configuration). For each date in the range, I would retrieve the entities that have this date as a property and operate on this set.
For example, if I have the following set of entities:
Key Date Value
a 2011/09/09 323
b 2011/09/09 132
c 2011/09/08 354
d 2011/09/08 432
e 2011/09/08 234
f 2011/09/07 423
g 2011/09/07 543
I would like to specify a date range of 2011/09/09 - 2011/09/07 which would create three mapper instances, for 2011/09/09, 2011/09/08 and 2011/09/07. In turn these would query for entities a+b, c+d+e and f+g respectively, and perform some operations on the values. (Each of the mappers would also make other datastore queries for additional data, hence the 'bonus question' below)
Presumably I need to create a custom InputFormat class, however I'm quite new to mapreduce/hadoop and I was hoping someone had some examples?
Bonus question: is it "bad form" to use a dao to load data in a mapper? Other distributed computing platforms I have worked with (eg DataSynapse) would require that you parcel all inputs up and provide with the task to prevent too much contention on a dataserver. However, with the appengine HR datastore I presume this isn't a concern?
It's not currently possible to iterate over a subset of entities of a given kind in App Engine's mapreduce implementaiton. If the entities make up a large proportion of the data, you can simply iterate over everything and ignore the unwanted entities; if they only make up a small proportion, you will have to roll-your-own update procedure using the task queue.
Based on Nick Johnson answer you will need to retrieve your date range from the context using custom parameters. Then mapper filters out (ignores) entity that falls out of range before processing it.
But if you insist on mapping across all entities of a given kind then there is a workaround solution that depending on your requirements may or may not be feasible. Suppose that you are pretty fixed on the date ranges (sounds unlikely but just maybe). Then for each expected range you create corresponding child entity kind with a parent key (or just a reference but parent key works better for consistency - think transaction across entity group) pointing to the main entity.
Thus each entity from the range receives a child entity of the kind corresponding to this range. Then setup a mapper on the child entity kind corresponding the range and retrieve its parent to work on it.
I do somewhat similar but in opposite direction and for single child entity kind when populating my data for Relation Index Entity pattern. Hence, the answer to your bonus question - go ahead use dao or whatever your data layer consists of.
While first approach is more sound, the latter may be feasible in cases when your ranges are not very dynamic and manageable. Given schema-less nature of the datastore creating new entity kinds is neither expensive nor a bad practice.
I'm using Google App Engine.
If a Long key field is generated by IdGeneratorStrategy.Identity and then the object is deleted from the datastore, is there any chance of the key being used again by a different object of the same class?
papercrane on reddit writes:
The documentation for
GenerationType.IDENTITY says that it
means the persistence provider (the
database) will provide the unique ID.
So it is entirely up to your database
software if it decides to reuse IDs
from deleted records. Without knowing
anything else about your problem I'd
say it is possible, but I can't think
of any good reason for a database
server to keep track of which IDs are
in use and recycle old ones. That
seems like a lot of overhead for very
little benefit.
And Mark Ross on Google Groups writes
on how GAE identities are generated:
Since the datastore in prod is
comprised of multiple back-ends, we
use a sharded counter approach to dole
out IDs so that we don't have to worry
about different back-ends handing out
the same id. So, back-end A may be
working from a pool of IDs ranging
from 0 to 100 and back-end B may be
working from a pool of IDs ranging
from 101 to 200, and so on. If your
inserts hit different datastore
back-ends you'll get IDs that jump
around a bit. You can depend on these
IDs being unique, but not
monotonically increasing.
I now think that it is very unlikely that Identity values are reused but it would still be good to have a clear definitive answer.
App Engine will never reuse IDs for a given kind and parent. In fact, I think you'll be hard pressed to find a database that does - keeping a simple counter is far, far simpler than trying to figure out which IDs are still in use, and with 64 bits, you're not going to run out of IDs.
I was wondering if anyone can help me with this problem.
We have an idea we'd like to implement, and we're currently unable to do this efficiently.
I've anonymised the data as best as possible, but the structure is the same.
We have two entities, Car and CarJourney. Each Car has 0 to many CarJourney's. Each Car Journey has (amongst other properties) a date associated with it - the date the journey was started.
I wish to query by time over car journeys. I'll have two times, a start date and an end date, where start date <= endDate, and I want to receive the most recently started journey in that period.
So, if I had a particular car in mind, say car 123, I'd write a query that limits by Car.key and Car.startDate, where Car.key == 123 and Journey.startDate >= startDate and Journey.startDate <= endDate with an ordering on Journey.startDate descending and a limit of 1.
e.g. Car A has 3 journeys, taken on 1st, 2nd and the 3rd of the month. The query start date is 1st and the query end date is the 2nd. The result of this query would be one Car journey, the 2nd.
Once the result of that query is returned, a very small amount of processing is done to return a result to the user.
That's the easy bit.
But, instead of over 1 Car, I want a list of cars, where the list contains N keys to cars.
So, I want to run the above query N times, once for every car. And I want the latest journey for each car.
Because the time range is flexible (and thus can't be known beforehand) we can't implement a "isMostRecent" flag, because while it might be the most recent for now, it might not be the most recent for the specified date parameters.
We also need to ensure that this returns promptly (current queries are around the 3-5 second mark for a small set of data) as this goes straight back to the user. This means that we can't use task queues, and because the specified dates are arbitrary we can't implement mass indexing of "isWithinDate" fields.
We tried using an async query, but because the amount of processing is negligible the bottleneck is still the queries on the datastore (because the async api still sends the requests synchronously, it just doesn't block).
Ideally, we'd implement this as a select on car journeys ordered by startDate where the Car.key is distinct, but we can't seem to pull this off in GAE.
There are numerous small optimisations we can make (for example, some MemCaching of repeated queries) but none have made a significant dent in our query time. And MemCaching can only help for a maximum of 1-2 minutes (due to the inevitable forward march of time!)
Any ideas are most welcome and highly appreciated.
Thanks,
Ed
It sounds like the best option is to execute the many queries yourself. You say you tried asynchronous queries, but the bottleneck was sending the query. This seems extremely odd - you should be able to have many queries in flight at the same time, substantially cutting down your latency. How did you determine this?
First of all I'd recommend using objectify. JDO/JPA on appengine just fool people into thinking that appengine datastore is just a SQL database, which, as you realized, is far from the truth.
If I understand correctly you have a Car which contains a List of CarJourneys?
List properties on appengine are limited to 5000 entries and any time you access/change them they have to be serialized/deserialized in whole. So if you plan to have a lot of CarJourneys per Car than this will get slow. Also because appengine creates an index entry for every value in the collection this can lead to exploding indexes.
Instead, just create a property Car inside CarJourney that points to the Car that made the journey: a one-to-one relationship from CarJourney to Car. The type can be Key or just string/long containing the id of the Car. When querying just add filter for Car property.
I suggest watching Brett Slatkin's video: Scalable, Complex Apps on App Engine.
You can also use one query and filter distinct cars by yourself. Like select CarJouney startDate >= startDate and startDate <= endDate order by startData and iterate (+filter on your side) through this query until you find enough data to show.
Denormalization should solve your problem - having a last_journey reference property in your car, so everytime you start a journey, you'd also update the Car entity - this way you'd be able to query all cars and have their lastest journey on the resultset.
It's worth noting that when you access last_journey, a new get() will be issued to the datastore, so if you're listing a lot of cars, you could build a list with all the last_journey keys and fetch then all at once passing that to db.get().
Scalable, Complex Apps on App Engine is definately a must watch (sadly the sound is terrible on this video)
I have faced same kind of problem some time ago.
I tried some solutions (in memory sort and
filtering, encoding things into keys etc. and I have benchmarked those
for both latency and cpu cycles using some test data around 100K
entities)
An other approach I have taken is encoding the date as an integer (day
since start of epoch or day since start of year, same for hour of day
or month depending on how much detail you need in your output) and
saving this into a property. This way you turn your date query filter
into an equality only filter which does not even needs to specify an
index) then you can sort or filter on other properties.
Benchmarking the latest solution I have found that when the filtered
result set is a small fraction of the unfiltered original set, is 1+
order of magnitude faster and cpu-eficient. Worst case when no
reduction of the result set due to filtering the latency and cpu usage
was comparable to the previous solutions)
Hope this helps, or did I missed something ?
Happy coding-:)
You can also make this queries in parallel by calling it right from client, using ajax. I mean that you can return to the user an empty html page, just with cars definitions, and then make ajax calls for journeys for every car on this page.
As JB nizet suggested I am wondering if the answer might be something such as a single query, possibly with a temporary table, or anonymous intermediate table (I don't know what google supports to this end) using a group by (thus eliminating extra transfer of data and the need for Java to do the processing). I am thinking something along the lines of
CREATE TEMPORARY TABLE temp1 AS
SELECT * FROM car_journey
WHERE start_date > ? AND
end_date < ?
SELECT car_id, journey_id
FROM temp1 t1, (
SELECT car_id, MIN(start_date)
FROM temp1
GROUP BY car_id
) t2
WHERE t1.car_id = t2.car_id AND
t1.start_date = t2.start_date
With the temporary table you can greatly reduce the time for the secondary query, since theoretically the data will be much smaller than the full table.
Finally, again not knowing what google supports, I would ask if you have indices defined on the appropriate columns, which may help speed up the query.