BigQuery Pagination - Use pageToken or startIndex? - java

I will fetch approximately 500,000 to 1,000,000 rows in BiqQuery. We will limit it to an offset and max. In this case pageSize = max and startIndex = offset.
Our data will only be processed once a day and then uploaded to BigQuery.
The documentation recommended using pageToken instead of startIndex.
I have done some estimation using pageToken and startIndex and could not see any difference in time.
I found one answer here at StackOverflow:
"You should use the page token returned from the original query response or the previous jobs.getQueryResults() call to iterate through pages. This is generally more efficient and reliable than using index-based pagination"
But I'm not convinced why I should use pageToken, then I need to store the token to use it when going back and forth. Timewise, I could not see any difference.

But I'm not convinced why I should use "pageToken"
There are few but important differences between the two
index-based pagination - Is good when you know how many records are returned from your query and doesn't consider the size of a record (This is important for client-side application
page token - Specific page in the result set not requiring any pre-information to access such as the size of the results
So if in your case you know how many results you have and you don't care about the page size you can use index-based other-wise use page token

Related

Change how Spring JPA does pagination

I want to know if it is possible to change how Spring does pagination because the default way of PagingAndSortingRepository uses a LIMIT with OFFSET but on large datasets that becomes too slow.
The data I have does have an indexed column so there is no problem to write logic like WHERE timestamp > x AND timestamp < y LIMIT 1000; and keep track of the highest timestamp received. I am just wondering if this is already built into Spring JPA, so I could tell it to order by a column and use that rather than OFFSET.
https://electrictoolbox.com/mysql-limit-slow-large-offset/
There is no magic solution for this.
Yes, it can be slow. But I find that this is not an indicator of problem in paging, but rather an indicator that one uses improper tool / solution.
I find that using paging to navigate across a huge datasets is a bad approach. I'd suggest you to use a relatively small number of pages like 20 pages. If you want to navigate across a bigger volume of data, consider proper splitting / structuring of data in your requests. For instance, if your data can be filtered by time, consider using proper filters in your request (and corresponding GUI elements if it is triggered from GUI).
Ask user first to select a time period, like from/to years, or from/to dates, or from/to hours, depending how much data you have. And use paging for such query (which will return relatively small number of pages).
Alternatively, no matter what filter / search criteria user defined, in the first request check the number of pages. If it exceeds some limit (like 20 pages), throw an exception and ask user to refine the filter / search criteria. Continue that until the number of pages does not exceed the limit you want. Paging with such filter can be much faster.

Is there a easy way to get Nth page of items from DynamoDB by java?

I am now working on a web app associated with Amazon DynamoDB,
I want to achieve a function that my users can directly get to the Nth page to view the item info,
I have been told that the pagination in DynamoDB is based on last key, rather than limit/offset.It doesn't natively support offset.DynamoDB Scan / Query Pagination
Does that mean : If I want to get to the 10th page of items, then I have to query the 9 pages ahead first?(which seems reeeeeally not a good solution)
Is there a easier way to do that?
You are right. DynamoDB doesn't support numerical offset. The only way to paginate is to use the LastEvaluatedKey parameter when making a request. You still have some good options to achieve pagination using a number.
Fast Cursor
You can make fast pagination requests by discarding the full result and getting only the Keys. You are limited to 1MB per request. This represents a large amount of Keys! Using this, you can move your cursor to the required position and start reading full objects.
This solution is acceptable for small/medium datasets. You will run into performance and cost issues on large datasets.
Numerical index
You can also create a global secondary index where you will paginate your dataset. You can add for example an offset property to all your objects. You can query this global index directly to get the desired page.
Obviously this only works if you don't use any custom filter... And you have to maintain this value when inserting/deleting/updating objects. So this solution is only good if you have an 'append only' dataset
Cached Cursor
This solution is built on the first one. But instead of fetching keys every single time, you can cache the pages positions and reuse them for other requests. Cache tools like redis or memcached can help you to achieve that.
You check the cache to see if pages are already calculated
If not, you scan your dataset getting only Keys. Then you store the starting Key of each page in your cache.
You request the desired page to fetch full objects
Choose the solution that fits your needs. I hope this will help you :)

Split a big Jira-Rest-Request

I'm looking for an opportunity to split a big request like:
rest/api/2/search?jql=(project in (project1, project2, project3....project10)) AND issuetype = Bug AND (component not in (projectA, projectB) OR component = EMPTY). The result will containe > 500 Bugs -> It's very very slow. I want to get them with different requests (methode to performe the request will be annotated with #Asynchronous) but the jql needs to be the same. I don't want to search separately for project1, project2...project10. Would be nice if someone has an idea to resolve my problem.
Thank you :)
You need to calculate pagination. First get the metadata.
rest/api/2/search?jql=[complete search query]&fields=*none&maxResults=0
you should get something like this:
{"startAt":0,"maxResults":0,"total":100,"issues":[]}
so completely without fields, just pagination metadata.
Than create search URI like this.
rest/api/2/search?jql=[complete search query]&startAt=0&maxResults=10
rest/api/2/search?jql=[complete search query]&startAt=10&maxResults=10
..etc
Beware data should change so you should be prepared that you won't recieve all the data and also pagination metadata if calculation is expensive (exspecially "total") should not be presented. More Paged API
Can you not break into 2 parts? If you are displaying in a web page ( display what you can without performance hit. If its a report then get all objects gradually and show once completed.
Get the count in total for JQL & just get the minimum information needed for step 2 - assume its 900
Use the pagination feature (maxResults=100) make multiple calls.
Work on each request.
If you don't want to run the two requests at once and need paging of bugs by user request, you can:
Make a request with the 'maxResults' property set to how much you need.
On the next request set the 'maxResults' property and the 'startAt' with the same value.
If you need to fetch more data, make new request with the same 'maxResults' but update 'startAt' to be the count of bugs you fetched in the previous requests.

Dynamodb Range query gives limited number of results

I'm trying to implement application using google guice framework with dynamodb database.
I have implemented API for finding documents by range query ie. time period when I query by Month it gives limited number of documents i.e 3695 documents and again I search by start time and end time it also gives same number of documents which does not contain newly created document.
Please find the way to implement API which will solve the limitation issues of application or dynamodb.
The response of dynamodb is limited to 1mb per page. Also when your resultset is bigger, you only get the first results till response size is 1MB.
In the docs:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#Pagination
Is described how to use the meta data of the response to see real amount of results, starting index and so on. To query the hole result in batches / pages.
Important excerpt of the docs:
If you query or scan for specific attributes that match values that
amount to more than 1 MB of data, you'll need to perform another Query
or Scan request for the next 1 MB of data. To do this, take the
LastEvaluatedKey value from the previous request, and use that value
as the ExclusiveStartKey in the next request. This will let you
progressively query or scan for new data in 1 MB increments.
When the entire result set from a Query or Scan has been processed,
the LastEvaluatedKey is null. This indicates that the result set is
complete (i.e. the operation processed the “last page” of data).

How to get the number of results in an App Engine query before actually iterating through them all

In my Google App Engine app I need to fetch and return a potentially large number of entities from a datastore query in response to a service call GET request. This call may return potentially thousands of entities and MBs of serialized data.
The first portion of the response packet communicates how many entities are in the serialized results, followed by all of the serialized entities. Currently I am iterating through all the entities in the query with a QueryResultIterator up to a maximum page size limit, after which I return a cursor from which can be used to continue fetching where the previous call left off (if the maximum was reached and there are still results in the query). As I iterate through the results, I save them in a list. Once I've either exhausted the query results or reached the maximum page size, I can then get the number of entities from the size of this list. But then I have to iterate through this list again to serialize each of the entities and write the results to the response output stream.
I don't know that this is the most efficient method to perform this operation. Is there a way I can get the number of entities in a query's results before actually iterating through them all or fetching them directly into a list? (The list method doesn't work anyway because I'm using cursors, which requires the use of QueryResultIterator).
QueryResultIterator has a method getIndexList(). Would this be a less costly way to get the number of entities in the query's results? I'm assuming this list would contain exactly one index object for each entity in the query's results. Also, I'd need this list to only contain the indexes for the entities after the current cursor position for the interator. Is my understanding correct or would this method not do what I think it would?
A list of just indexes would require much less memory than loading a list of whole entities. Although, I don't know if this list would be limited at all by the query's prefetch or chunk sizes, or if I'd want to use the query's limit parameter at all because I would only be interested in knowing how many entities were in the results up to the maximum page size plus one (to know there are still more results and provide a cursor to continue).
Currently I'm setting the prefetch and chunk size (to the size of my page limit), but I'm not using the limit or offset parameters since I'm using cursors instead. From what I understand cursors are preferable to offset/limit. Would setting the limit parameter affect continuing a query with a cursor?
Clearly I have quite a few questions as to how GAE datastore queries work and how they're affected by changing parameters. So any insights are appreciated. The documentation for App Engine APIs is often sparse, as in one sentence descriptions of methods stating pretty much what can be deduced from the method signature. They don't generally go into much detail otherwise. Maybe the way I'm doing it currently is just fine after all. It works as is, but I'm trying to optimize the service call to get the best response time possible for my client application.
UPDATE: By the way, I am using Objectify v3 in my app and to perform this query. There are several places I am required to use the low-level datastore API, including to do geo-location queries (with geomodel) and projection queries (which aren't support in Objectify v3). So if there is a good way to do this using Objectify, that would be ideal. Otherwise I can use the low-level API, but it's always messier this way.
Both the low-level api and Objectify have a count() method (look at the javadocs for details). However, counting can be a very expensive and lengthy operation - it costs 1 small op for every number returned. For example, count() returning 5000 costs 5000 small ops (plus 1 read for the query), and takes as long as it would take to do a keys-only scan of all 5000 (which is what GAE actually does).
If you absolutely must have an exact count, you probably need to aggregate this value yourself by incrementing/decrementing a (possibly sharded) counter. This gets very tricky when you are dealing with filtered queries.
There is no one right solution here. Google searches give you totals like "About 119,000,000 results" which are deliberately inexact and almost certainly precalculated. For smaller result sets, using count() can be acceptable - but you might want to apply a limit() so that you never break the bank. You can always say "More than 500 results..."
if you want to fetch no of record than you can use following code
com.google.appengine.api.datastore.Query qry = new com.google.appengine.api.datastore.Query("EntityName");
com.google.appengine.api.datastore.DatastoreService datastoreService = DatastoreServiceFactory.getDatastoreService();
int totalCount = datastoreService.prepare(qry).countEntities(FetchOptions.Builder.withDefaults());
anf if you want filter than you can used
qry.addFilter("firstName", FilterOperator.EQUAL, firstName);
i hope it will help you

Categories

Resources