I'm trying to implement application using google guice framework with dynamodb database.
I have implemented API for finding documents by range query ie. time period when I query by Month it gives limited number of documents i.e 3695 documents and again I search by start time and end time it also gives same number of documents which does not contain newly created document.
Please find the way to implement API which will solve the limitation issues of application or dynamodb.
The response of dynamodb is limited to 1mb per page. Also when your resultset is bigger, you only get the first results till response size is 1MB.
In the docs:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#Pagination
Is described how to use the meta data of the response to see real amount of results, starting index and so on. To query the hole result in batches / pages.
Important excerpt of the docs:
If you query or scan for specific attributes that match values that
amount to more than 1 MB of data, you'll need to perform another Query
or Scan request for the next 1 MB of data. To do this, take the
LastEvaluatedKey value from the previous request, and use that value
as the ExclusiveStartKey in the next request. This will let you
progressively query or scan for new data in 1 MB increments.
When the entire result set from a Query or Scan has been processed,
the LastEvaluatedKey is null. This indicates that the result set is
complete (i.e. the operation processed the “last page” of data).
Related
I am searching my Elastic index from my Java backend using Elastic's high level REST client for JAVA. I notice that it takes 700 to 800 milliseconds to receive the response from Elastic.
I checked the actual query time in Elastic and it is only 7 milliseconds.
I have built filters and aggregations into my query and also am returning many fields.
However, if I remove all filters and aggregations and limit the result set to a single document and only return a single field, the time it takes my Java code to receive the response from Elastic is still > 700ms. Why might this be? My server code is running in California. My Elastic index is served in North Virginia. Perhaps this explains the latency? What else could be the cause?
This is a multisearch containing two search queries.
Our data set has a lot of duplicate partition keys. We are using the TOKEN method to paginate through the data. If the rows with the duplicate keys are split across a page we don't get the remainder of the duplicates on the next call.
For example assume we have the following keys: 1 2 3 5 5 5 6 7 8 and we have a limit of 5 rows per query. The first query "select * from table where TOKEN(id) > TOKEN('') limit 5;" returns 1 2 3 5 5 as expected. The second query "select * from table where TOKEN(id) > TOKEN('5') limit 5;" returns 6 7 8. This is not the desired behavior, we want the second query to return 5 6 7 8. Thinking about this, it is obvious why this happens: "(TOKEN(id) > TOKEN('5')" fails if id == 5
Are we doing something wrong or is this just the way it works? We are using the latest Java driver, but I don't think this is a driver problem since the Golang driver also exhibits this behavior
We've (mostly) worked around the problem by either dropping any duplicated records at the end of the row set (the 5 5 in the example) or dropping the last record (to cover the case where the last record is duplicated in the second record set). This fails if the record set is all duplicates. Obviously larger limits reduces this edge case, but it doesn't seem safe to use in a production environment.
* EDITED *
The TOKEN method is recommended in a lot of pages both here on Stackoverflow and elsewhere on the web. But obviously it doesn't work :-(
#alex:
Thanks for your reply. The example was just that, a simplified example of the issue. In reality we have 30 million rows and are using a limit of 1000. When the table was first designed years ago the designer didn't understand how the partition key works so they used the user ID as the partition thus giving us 30 million partitions. We believe that this is at least contributing to our excesive repair times (currently at 12 hours for the cluster). We need to copy the entire table into a new one with a different partition key (in a live production environment) to resolve the partition key issue. This page https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/ seems to be a better solution.
#Nadav:
Thanks for your reply. Simply removing the limit will cause the request to time out in multiple layers of our software. The DataStax page above seems to be the best solution for us.
You are mixing up several things - in Cassandra data is organized inside partitions, and you can get data by partition key, or perform a range scan using the token function. The results of the query could be delivered to applications by pages - you can specify the fetch size (although 5 is quite small), fetch one page, process, fetch next, process, ..., until the result set is exhausted.
In your case, the page size doesn't match the result set size - you have 6 results there, and the next result set (for token(id) > token(5)) has only 3 rows. I don't know a solution that works out of box (except select * from table, but it may timeout in case if you have a lot of data). In your case I would better go with bigger ranges (for example, whole token range), and page results inside it (without using limit), and then handle the case when you need to switch to the next token range, and you have some rows that are left from previous token range.
I have an example of Java code that is performing effective scan of the all token ranges, similar to what the Spark connector is doing. The main trick there is to route request to the node that holds the data, so it will read data directly from the node itself, without need to reach other nodes (if you're reading with LOCAL_ONE, of course).
You shouldn't, and can't, use token ranges and LIMIT to page through results, and you found out yourself that it doesn't work - because LIMIT cuts off some of the result, and you have no way to continue.
Instead, Cassandra gives you a separate paging feature: You make a request, get the first 1000 (or whatever) rows and also a "cookie" with which you can resume the query to get the next page of results. Please refer to your favorite driver's documentation on the syntax of using Cassandra paging in your favorite language. It's not "LIMIT" - it's a separate feature.
Splitting up a large query into multiple token ranges still has its uses. For example, it allows you to query the different ranges in parallel, since different token ranges will often come from different nodes. But still, you need to query each range to completion, using paging, and cannot use "LIMIT" because you can't know how many results to expect from each range and need to read them all.
I will fetch approximately 500,000 to 1,000,000 rows in BiqQuery. We will limit it to an offset and max. In this case pageSize = max and startIndex = offset.
Our data will only be processed once a day and then uploaded to BigQuery.
The documentation recommended using pageToken instead of startIndex.
I have done some estimation using pageToken and startIndex and could not see any difference in time.
I found one answer here at StackOverflow:
"You should use the page token returned from the original query response or the previous jobs.getQueryResults() call to iterate through pages. This is generally more efficient and reliable than using index-based pagination"
But I'm not convinced why I should use pageToken, then I need to store the token to use it when going back and forth. Timewise, I could not see any difference.
But I'm not convinced why I should use "pageToken"
There are few but important differences between the two
index-based pagination - Is good when you know how many records are returned from your query and doesn't consider the size of a record (This is important for client-side application
page token - Specific page in the result set not requiring any pre-information to access such as the size of the results
So if in your case you know how many results you have and you don't care about the page size you can use index-based other-wise use page token
In my Google App Engine app I need to fetch and return a potentially large number of entities from a datastore query in response to a service call GET request. This call may return potentially thousands of entities and MBs of serialized data.
The first portion of the response packet communicates how many entities are in the serialized results, followed by all of the serialized entities. Currently I am iterating through all the entities in the query with a QueryResultIterator up to a maximum page size limit, after which I return a cursor from which can be used to continue fetching where the previous call left off (if the maximum was reached and there are still results in the query). As I iterate through the results, I save them in a list. Once I've either exhausted the query results or reached the maximum page size, I can then get the number of entities from the size of this list. But then I have to iterate through this list again to serialize each of the entities and write the results to the response output stream.
I don't know that this is the most efficient method to perform this operation. Is there a way I can get the number of entities in a query's results before actually iterating through them all or fetching them directly into a list? (The list method doesn't work anyway because I'm using cursors, which requires the use of QueryResultIterator).
QueryResultIterator has a method getIndexList(). Would this be a less costly way to get the number of entities in the query's results? I'm assuming this list would contain exactly one index object for each entity in the query's results. Also, I'd need this list to only contain the indexes for the entities after the current cursor position for the interator. Is my understanding correct or would this method not do what I think it would?
A list of just indexes would require much less memory than loading a list of whole entities. Although, I don't know if this list would be limited at all by the query's prefetch or chunk sizes, or if I'd want to use the query's limit parameter at all because I would only be interested in knowing how many entities were in the results up to the maximum page size plus one (to know there are still more results and provide a cursor to continue).
Currently I'm setting the prefetch and chunk size (to the size of my page limit), but I'm not using the limit or offset parameters since I'm using cursors instead. From what I understand cursors are preferable to offset/limit. Would setting the limit parameter affect continuing a query with a cursor?
Clearly I have quite a few questions as to how GAE datastore queries work and how they're affected by changing parameters. So any insights are appreciated. The documentation for App Engine APIs is often sparse, as in one sentence descriptions of methods stating pretty much what can be deduced from the method signature. They don't generally go into much detail otherwise. Maybe the way I'm doing it currently is just fine after all. It works as is, but I'm trying to optimize the service call to get the best response time possible for my client application.
UPDATE: By the way, I am using Objectify v3 in my app and to perform this query. There are several places I am required to use the low-level datastore API, including to do geo-location queries (with geomodel) and projection queries (which aren't support in Objectify v3). So if there is a good way to do this using Objectify, that would be ideal. Otherwise I can use the low-level API, but it's always messier this way.
Both the low-level api and Objectify have a count() method (look at the javadocs for details). However, counting can be a very expensive and lengthy operation - it costs 1 small op for every number returned. For example, count() returning 5000 costs 5000 small ops (plus 1 read for the query), and takes as long as it would take to do a keys-only scan of all 5000 (which is what GAE actually does).
If you absolutely must have an exact count, you probably need to aggregate this value yourself by incrementing/decrementing a (possibly sharded) counter. This gets very tricky when you are dealing with filtered queries.
There is no one right solution here. Google searches give you totals like "About 119,000,000 results" which are deliberately inexact and almost certainly precalculated. For smaller result sets, using count() can be acceptable - but you might want to apply a limit() so that you never break the bank. You can always say "More than 500 results..."
if you want to fetch no of record than you can use following code
com.google.appengine.api.datastore.Query qry = new com.google.appengine.api.datastore.Query("EntityName");
com.google.appengine.api.datastore.DatastoreService datastoreService = DatastoreServiceFactory.getDatastoreService();
int totalCount = datastoreService.prepare(qry).countEntities(FetchOptions.Builder.withDefaults());
anf if you want filter than you can used
qry.addFilter("firstName", FilterOperator.EQUAL, firstName);
i hope it will help you
I'm developing a Google App Engine Java app where users can search business objects from database based on search criteria.
The search results (a list of records) should not include any of the records (certain number of records, say 100) from their past searches. I'm storing the past results in the User Profile for this reason.
Any suggestions on efficiently implementing this logic (without using multiple collection iterations). I'm using JDO and there are restrictions in using 'NOT IN' condition in the queries.
Here's a solution, assuming your goal is to get 200 keys that are not in the history already.
I will attempt to estimate the number of operations used as a proxy for "efficiency", since this is how we will be charged in the new pricing model
Fetch the User object and "history keys" (1 read operation)
Do a keys only query and fetch 300 records. (300 small operations)
In your code, subtract any of the history keys from the 300 records. (0 operations)
If you end up with less than 200 records after step 3, fetch another 100.(repeat if necessary) (100 small operations).
Once you have 200 keys not seen before, you can fetch the full business object entities if you need them, or display the keys to the user. (200 read operations if you fetch the entire objects)
If the datastore supported a native "NOT IN" operator, then we could shave off 100 small operations from step 2, and skip step 4. The largest cost here will be fetching the actual 200 entities, which would have to happen with or without the NOT IN operator. Ultimately, this method is not that inefficient compared to what a native NOT IN operator would do.
Further optimizations:
If you don't need to display 200 keys all at once, then you can use cursors to only get N results at a time.
I am simply guessing when I suggest that you get 300 keys at first. You may need to get more or less. You can also probably get less than 100 on the second attempt.