DynamoDb has provisioned data throughput limits and based on the answer in this post it appears that if you exceed your throughput limits DynamoDb can start throwing exceptions, rather than just slowing your query down. If you run a query (e.g. as described here), how do you ensure the query doesn't exceed the throughputs?
A query will be automatically limited to 1 MB in which case you have to do some tricky re-querying using the LastEvaluatedKey as described in their docs (with no code example). However dependent on the query speed or similar even this 1 MB could presumably blow your provisioned limits?
You can also set the maximum number of items returned in the QuerySpec object so you then process the query in individual pages - which I'm guessing do separate queries under the hood. However if you don't know the size of your items in-advance, you can't limit it by number of KB as needed to stay within the throughput. Chicken and egg situation.
Am I missing something here? Is there any easier way to do this?
Your problem can be solved by following solutions or by combining the solutions
Solution 1: As you have stated in your question if you know your Item size you can Limit the number of records, pass the lastEvaluatedKey to get next set of records.
Again even this might be enough to throttle your limits, you need to have a sleep mechanism for a while so that you wont hit the table continuously.
Solution 2: By far the best method I have implement is that don't hit the DynamoDB table rather store your information in some form of cache(Memcahce, Elasticache, etc)
This will ensure that you wont require very high throughput rate.
Note: This will depend on your type of application as well if your data is updated very frequently than this might not be very helpful.
Related
I was testing running some simple queries (SELECT * from [myNodeType]) on a large node set of 100.000 up to > 1 million nodes. It seems like query performance is going down rather quickly on result sets larger than 100k hits.
Has someone experienced similar issues with jackrabbit? Am i hitting a design limitation? Is there any way to deal with such large result sets? (also memory usage seems quite excessive)
The total number of nodes is not the problem, we are running repositories with a similar size.
The architecture of your permission model and the queries itself have a bigger impact.
Jackrabbit will evaluate the permissions of all of your query hits based on the utilized session. The exception are administrative sessions which ar enot permision checked at all.
Due to the resource based access control concept, complexity of your permission concept e.g. by excessive usage of inheritance and the total number of hits will influence the response time of your query.
Generally speaking making the query more specific by introducing addional parameters that limit the size of your resultset and using a more simple permission architecture or bypassing access control completly (which is not recommended for security reasons in many use cases) increases the response time of your queries.
In my Google App Engine app I need to fetch and return a potentially large number of entities from a datastore query in response to a service call GET request. This call may return potentially thousands of entities and MBs of serialized data.
The first portion of the response packet communicates how many entities are in the serialized results, followed by all of the serialized entities. Currently I am iterating through all the entities in the query with a QueryResultIterator up to a maximum page size limit, after which I return a cursor from which can be used to continue fetching where the previous call left off (if the maximum was reached and there are still results in the query). As I iterate through the results, I save them in a list. Once I've either exhausted the query results or reached the maximum page size, I can then get the number of entities from the size of this list. But then I have to iterate through this list again to serialize each of the entities and write the results to the response output stream.
I don't know that this is the most efficient method to perform this operation. Is there a way I can get the number of entities in a query's results before actually iterating through them all or fetching them directly into a list? (The list method doesn't work anyway because I'm using cursors, which requires the use of QueryResultIterator).
QueryResultIterator has a method getIndexList(). Would this be a less costly way to get the number of entities in the query's results? I'm assuming this list would contain exactly one index object for each entity in the query's results. Also, I'd need this list to only contain the indexes for the entities after the current cursor position for the interator. Is my understanding correct or would this method not do what I think it would?
A list of just indexes would require much less memory than loading a list of whole entities. Although, I don't know if this list would be limited at all by the query's prefetch or chunk sizes, or if I'd want to use the query's limit parameter at all because I would only be interested in knowing how many entities were in the results up to the maximum page size plus one (to know there are still more results and provide a cursor to continue).
Currently I'm setting the prefetch and chunk size (to the size of my page limit), but I'm not using the limit or offset parameters since I'm using cursors instead. From what I understand cursors are preferable to offset/limit. Would setting the limit parameter affect continuing a query with a cursor?
Clearly I have quite a few questions as to how GAE datastore queries work and how they're affected by changing parameters. So any insights are appreciated. The documentation for App Engine APIs is often sparse, as in one sentence descriptions of methods stating pretty much what can be deduced from the method signature. They don't generally go into much detail otherwise. Maybe the way I'm doing it currently is just fine after all. It works as is, but I'm trying to optimize the service call to get the best response time possible for my client application.
UPDATE: By the way, I am using Objectify v3 in my app and to perform this query. There are several places I am required to use the low-level datastore API, including to do geo-location queries (with geomodel) and projection queries (which aren't support in Objectify v3). So if there is a good way to do this using Objectify, that would be ideal. Otherwise I can use the low-level API, but it's always messier this way.
Both the low-level api and Objectify have a count() method (look at the javadocs for details). However, counting can be a very expensive and lengthy operation - it costs 1 small op for every number returned. For example, count() returning 5000 costs 5000 small ops (plus 1 read for the query), and takes as long as it would take to do a keys-only scan of all 5000 (which is what GAE actually does).
If you absolutely must have an exact count, you probably need to aggregate this value yourself by incrementing/decrementing a (possibly sharded) counter. This gets very tricky when you are dealing with filtered queries.
There is no one right solution here. Google searches give you totals like "About 119,000,000 results" which are deliberately inexact and almost certainly precalculated. For smaller result sets, using count() can be acceptable - but you might want to apply a limit() so that you never break the bank. You can always say "More than 500 results..."
if you want to fetch no of record than you can use following code
com.google.appengine.api.datastore.Query qry = new com.google.appengine.api.datastore.Query("EntityName");
com.google.appengine.api.datastore.DatastoreService datastoreService = DatastoreServiceFactory.getDatastoreService();
int totalCount = datastoreService.prepare(qry).countEntities(FetchOptions.Builder.withDefaults());
anf if you want filter than you can used
qry.addFilter("firstName", FilterOperator.EQUAL, firstName);
i hope it will help you
We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB.
On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second).
We had tried several strategies to speed up the entities retrieval:
Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.
Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.
The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case.
We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval?
UPDATE: Supplying additional information about our use case and parallelization result:
We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only.
We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.
UPDATE 2: Giving an example of what we are trying to do:
Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not.
The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value.
The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.
200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory.
Do you need the whole 5kb of each entity for computation?
Do you need all 200k entities when querying before computation? Do queries touch all entities?
Also, check out BigQuery. It might suit your needs.
Use Memcache. I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform.
This is very interesting, but yes, its possible & Iv seen some mind boggling results.
I would have done the same; map-reduce concept
It would be great if you would provide us more metrics on how many parallel instances do you use & what are the results of each instance?
Also, our process includes retrieval alone or retrieval & storing ?
How many elements do you have in your data store? 4000? 10000? Reason is because you could cache it up from the previous request.
regards
In the end, it does not appear that we could retrieve >2000 entities from a single instance in under one second, so we are forced to use in-memory caching placed on our backend instance, as described in the original question. If someone comes up with a better answer, or if we found a better strategy/implementation for this problem, I would change or update the accepted answer.
Our solution involves periodically reading entities in a background task and storing the result in a json blob. That way we can quickly return more than 100k rows. All filtering and sorting is done in javascript using SlickGrid's DataView model.
As someone has already commented, MapReduce is the way to go on GAE. Unfortunately the Java library for MapReduce is broken for me so we're using non optimal task to do all the reading but we're planning to get MapReduce going in the near future (and/or the Pipeline API).
Mind that, last time I checked, the Blobstore wasn't returning gzipped entities > 1MB so at the moment we're loading the content from a compressed entity and expanding it into memory, that way the final payload gets gzipped. I don't like that, it introduces latency, I hope they fix issues with GZIP soon!
After investigating a bit at work I noticed that the application I'm working on is using the default fetch size (which is 10 for Oracle from what I know). The problem is that in the majority of cases the users fetch large amount of data (ranging from few thousand to even hundreds of thousands) and that the default 10 is really a huge bottleneck.
So the obvious conclusion here would be to make the fetch size larger. At first I was thinking about setting the default to 100 and bumping it to a 1000 for several queries. But then I read on the net that the default is so small to prevent memory issues (i.e. when the JVM heap cannot handle so much data), should I be worried about it?
I haven't seen anywhere further explanation to this. Does it mean that a bigger fetch sizes means more overhead while fetching the result set? Or do they just mean that with the default I can fetch 10 records and then GC them and fetch another 10 and so on (whereas lets say fetching a 10000 all at once would result in an OutOfMemory exception)? In such case I wouldn't really care as I need all the records in the memory anyway. In the former case (where bigger result set means bigger memory overhead) I guess I should load test it first.
By setting the fetch size too, big you are risking OutOfMemoryError.
The fact that you need all these records anyway is probably not justifiable. More chances you need the entities reflected by the returned ResultSets... Setting the fetch size to 10000 means you're heaping 10000 records represented by JDBC classes. Of course, you don't pass these around through your application. You first transform them into your favorite business-logic-entities and then hand them to your business-logic-executor. This way, The records form the first fetch bulk are available for GC as soon as JDBC fetches the next fetch bulk.
Typically, this transformation is done a little bunch at a time exactly because of the memory threat aforementioned.
One thing you're absolutely right, though: you should test for performance with well-defined requirements before tweaking.
So the obvious conclusion here would be to make the fetch size larger.
Perhaps an equally obvious conclusion should be: "Let's see if we can cut down on the number of objects that users bring back." When Google returns results, it does so in batches of 25 or 50 sorted by greatest likelihood to be considered useful by you. If your users are bringing back thousands of objects, perhaps you need to think about how to cut down on that. Can the database do more of the work? Are there other operations that could be written to eliminate some of those objects? Could the objects themselves be smarter?
When calling DatastoreService.delete(keys) with 400 keys, I'm get this exception:
java.lang.IllegalArgumentException: datastore transaction or write too big.
I thought the limit on batch deletes was 500 so I am well under the limit. Am I missing something here?
Thanks,
Keyur
it looks like you're hitting the overall size limit for puts and deletes. you're right that batch puts and deletes have a limit of 500 entities, but there's also an overall size limit of roughly 10MB. i'm not sure if that's documented, but i'll check and have them add it if not.
so, try reducing the number of entities per delete call.
if you want to dig deeper, the size of a put or delete depends on many factors beyond the size of the individual entities, e.g. the size of the index rows that need to be updated. it's also generally based on the delta size of the update itself, not the size of the existing entities. this means it's not always intuitive or easy to calculate. these articles can help though:
http://code.google.com/appengine/articles/storage_breakdown.html
http://code.google.com/appengine/articles/life_of_write.html