Spring Data Paging over combined Result of several Queries

Spring Data Paging over combined Result of several Queries - java

We are using Spring Boot 2 with Spring Data and its PagingAndSortingRepository feature. This works well for single queries, but in one case we have to make three different queries and implement pagination for the combined result.
What is the best way to do it?
Here's what I have tried:
1) Write a UNION or JOIN query of sorts that already returns the combined result as a Page or Slice. However, this query takes almost 10 times as long as shooting three seperate queries and do the aggregation in Java. We are talking complex computations here (PostGIS backend).
2) Manually construct the pages/slices by using the existing SliceImpl or PageImpl classes. This works fine for the initial request, but fails on the second request, when the user says something like: give me page 1 (page size == 10 items). The first page (page 0) may have had 4 items from the first query and 6 of 12 total items from the second query. Asking for page 1 gives me then 0 results from the first query and 2 (instead of 6) from the second, while filling up the rest from the third query. So clearly, this cannot work from a logical point of view.
Any other ideas?
Edit: we are planning to add Hibernate Search and Caching, which might solve this problem externally by making option 1) fast enough. My question was meant to ask for an "internal" solution, i. e. some code I can write today, until we have the external solution in place.

As you have described in point 2, unless you do always left join between queries, no one can guarantee you that what you have retrieved with the first query part is sufficient to generate a page of 10 valid element.
Implement a logic that find element until the page is complete it's more expensive than the single query for sure... specially when you have to increment pages more and more.
I think you have to combine all your queries in a single query.
A solution in this case could be to create a materialized view on your database and apply simpler filters.
To have a cache framework can help too.

Related

Problem with Cassandra TOKEN paging mechanism

Our data set has a lot of duplicate partition keys. We are using the TOKEN method to paginate through the data. If the rows with the duplicate keys are split across a page we don't get the remainder of the duplicates on the next call.
For example assume we have the following keys: 1 2 3 5 5 5 6 7 8 and we have a limit of 5 rows per query. The first query "select * from table where TOKEN(id) > TOKEN('') limit 5;" returns 1 2 3 5 5 as expected. The second query "select * from table where TOKEN(id) > TOKEN('5') limit 5;" returns 6 7 8. This is not the desired behavior, we want the second query to return 5 6 7 8. Thinking about this, it is obvious why this happens: "(TOKEN(id) > TOKEN('5')" fails if id == 5
Are we doing something wrong or is this just the way it works? We are using the latest Java driver, but I don't think this is a driver problem since the Golang driver also exhibits this behavior
We've (mostly) worked around the problem by either dropping any duplicated records at the end of the row set (the 5 5 in the example) or dropping the last record (to cover the case where the last record is duplicated in the second record set). This fails if the record set is all duplicates. Obviously larger limits reduces this edge case, but it doesn't seem safe to use in a production environment.
* EDITED *
The TOKEN method is recommended in a lot of pages both here on Stackoverflow and elsewhere on the web. But obviously it doesn't work :-(
#alex:
Thanks for your reply. The example was just that, a simplified example of the issue. In reality we have 30 million rows and are using a limit of 1000. When the table was first designed years ago the designer didn't understand how the partition key works so they used the user ID as the partition thus giving us 30 million partitions. We believe that this is at least contributing to our excesive repair times (currently at 12 hours for the cluster). We need to copy the entire table into a new one with a different partition key (in a live production environment) to resolve the partition key issue. This page https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/ seems to be a better solution.
#Nadav:
Thanks for your reply. Simply removing the limit will cause the request to time out in multiple layers of our software. The DataStax page above seems to be the best solution for us.

You are mixing up several things - in Cassandra data is organized inside partitions, and you can get data by partition key, or perform a range scan using the token function. The results of the query could be delivered to applications by pages - you can specify the fetch size (although 5 is quite small), fetch one page, process, fetch next, process, ..., until the result set is exhausted.
In your case, the page size doesn't match the result set size - you have 6 results there, and the next result set (for token(id) > token(5)) has only 3 rows. I don't know a solution that works out of box (except select * from table, but it may timeout in case if you have a lot of data). In your case I would better go with bigger ranges (for example, whole token range), and page results inside it (without using limit), and then handle the case when you need to switch to the next token range, and you have some rows that are left from previous token range.
I have an example of Java code that is performing effective scan of the all token ranges, similar to what the Spark connector is doing. The main trick there is to route request to the node that holds the data, so it will read data directly from the node itself, without need to reach other nodes (if you're reading with LOCAL_ONE, of course).

You shouldn't, and can't, use token ranges and LIMIT to page through results, and you found out yourself that it doesn't work - because LIMIT cuts off some of the result, and you have no way to continue.
Instead, Cassandra gives you a separate paging feature: You make a request, get the first 1000 (or whatever) rows and also a "cookie" with which you can resume the query to get the next page of results. Please refer to your favorite driver's documentation on the syntax of using Cassandra paging in your favorite language. It's not "LIMIT" - it's a separate feature.
Splitting up a large query into multiple token ranges still has its uses. For example, it allows you to query the different ranges in parallel, since different token ranges will often come from different nodes. But still, you need to query each range to completion, using paging, and cannot use "LIMIT" because you can't know how many results to expect from each range and need to read them all.

Hibernate limit amount of result but check for more

As the title states, I want to only retrieve a maximum of for example 1000 rows, but if the queries result would be 1001, i would like to know in some way. I have seen examples which would check the amount of rows in result with a a second query, but i would like to have it in the query i use to get the 1000 rows. I am using hibernate and criteria to receive my results from my database. Database is MS SQL

What you want is not posssible in a generic way.
The 2 usual patterns for pagination are :
use 2 queries : a first one that count, the next one that get a page of result
use only one query, where you fetch one result more than what you show on the page
With the first pattern, your pagination have more functionalities because you can display the total number of pages, and allow the user to jump to the page he wants directly, but you get this possibility at the cost of an additional sql query.
With the second pattern you can just say to the user if there is one more page of data or not. The user can then just jump to the next page, (or any previous page he already saw).

You want to have two information that results from two distinct queries :
select (count) from...
select col1, col2, from...
You cannot do it in a single executed Criteria or JPQL query.
But you can do it with a native SQL query (by using a subquery by the way) with a different way according to the DBMS used.
By making it, you would make more complex your code, make it more dependent to a specific DBMS and you would probably not gained really something in terms of performance.
I think that you should use rather a count and a second query to get the rows.
And if later you want to exploit the result of the count to fetch next results, you should favor the use of the pagination mechanisms provided by Hibernate rather doing it in a custom way.

How can paginated results from a hypermedia driven RESTful API be managed if the underlying data can change?

I'm creating a hypermedia driven RESTful API that will be used to query transactional data. The intention is that the results will be paginated.
Each API call will query an indexed database table. Since I don't want to keep the results server side due to memory considerations, I was thinking to retrieve the data based on rownum, dependent upon which page is requested. E.G. on page one, WHERE rownum <= 10, on page two, WHERE rownum BETWEEN 11 AND 20 etc.
However, the database in question is replicated from a production system and could potentially add records into an area of the result set already requested. E.G. page one is requested -> 10 rows are returned -> a transaction is inserted at row 5. Now page two will include a record already displayed on page one, as the results are essentially pushed up by a rownum.
What would be a good way of achieving my objective of creating a hypermedia driven RESTful API that provides paginated transactional data from a database, without holding on to the result sets for the duration of the session?

This is a pretty common problem and there are actually not many approaches.
I can think of only three, actually:
You don't care and the result will change. This is the behaviour of stackoverflow: if you're on page 2 of the questions page and someone posts a new question, when clicking on page 3 you may get one or more of the questions that were already listed on page 2, because the index has shifted.
If you don't want to keep in memory the actual data, you're in for a lot of trouble. You could store the handler for the result set, instead of the results themselves, and loop over it fetching the number of rows that you actually need. E.g. you run the select, fetch 10 rows and store the handler of the resultset. Together with the rows, you return to the client a unique ID of the query. The problem will be when you have a range specified, because you can't really "rewind" a database cursor, and that would mean caching the results, which you may want to do anyway. But if you do it like that, sooner or later you're going to have all of the results in memory anyway.
You could still use some memory, but keep only some unique identifier of the rows, associated with a unique identifier of the query, as above. This could work, but only if the rows may be added, and not deleted or updated (if they're updated, they may not match the query any more).
Personally, I'd go with option 1.

How to get the number of results in an App Engine query before actually iterating through them all

In my Google App Engine app I need to fetch and return a potentially large number of entities from a datastore query in response to a service call GET request. This call may return potentially thousands of entities and MBs of serialized data.
The first portion of the response packet communicates how many entities are in the serialized results, followed by all of the serialized entities. Currently I am iterating through all the entities in the query with a QueryResultIterator up to a maximum page size limit, after which I return a cursor from which can be used to continue fetching where the previous call left off (if the maximum was reached and there are still results in the query). As I iterate through the results, I save them in a list. Once I've either exhausted the query results or reached the maximum page size, I can then get the number of entities from the size of this list. But then I have to iterate through this list again to serialize each of the entities and write the results to the response output stream.
I don't know that this is the most efficient method to perform this operation. Is there a way I can get the number of entities in a query's results before actually iterating through them all or fetching them directly into a list? (The list method doesn't work anyway because I'm using cursors, which requires the use of QueryResultIterator).
QueryResultIterator has a method getIndexList(). Would this be a less costly way to get the number of entities in the query's results? I'm assuming this list would contain exactly one index object for each entity in the query's results. Also, I'd need this list to only contain the indexes for the entities after the current cursor position for the interator. Is my understanding correct or would this method not do what I think it would?
A list of just indexes would require much less memory than loading a list of whole entities. Although, I don't know if this list would be limited at all by the query's prefetch or chunk sizes, or if I'd want to use the query's limit parameter at all because I would only be interested in knowing how many entities were in the results up to the maximum page size plus one (to know there are still more results and provide a cursor to continue).
Currently I'm setting the prefetch and chunk size (to the size of my page limit), but I'm not using the limit or offset parameters since I'm using cursors instead. From what I understand cursors are preferable to offset/limit. Would setting the limit parameter affect continuing a query with a cursor?
Clearly I have quite a few questions as to how GAE datastore queries work and how they're affected by changing parameters. So any insights are appreciated. The documentation for App Engine APIs is often sparse, as in one sentence descriptions of methods stating pretty much what can be deduced from the method signature. They don't generally go into much detail otherwise. Maybe the way I'm doing it currently is just fine after all. It works as is, but I'm trying to optimize the service call to get the best response time possible for my client application.
UPDATE: By the way, I am using Objectify v3 in my app and to perform this query. There are several places I am required to use the low-level datastore API, including to do geo-location queries (with geomodel) and projection queries (which aren't support in Objectify v3). So if there is a good way to do this using Objectify, that would be ideal. Otherwise I can use the low-level API, but it's always messier this way.

Both the low-level api and Objectify have a count() method (look at the javadocs for details). However, counting can be a very expensive and lengthy operation - it costs 1 small op for every number returned. For example, count() returning 5000 costs 5000 small ops (plus 1 read for the query), and takes as long as it would take to do a keys-only scan of all 5000 (which is what GAE actually does).
If you absolutely must have an exact count, you probably need to aggregate this value yourself by incrementing/decrementing a (possibly sharded) counter. This gets very tricky when you are dealing with filtered queries.
There is no one right solution here. Google searches give you totals like "About 119,000,000 results" which are deliberately inexact and almost certainly precalculated. For smaller result sets, using count() can be acceptable - but you might want to apply a limit() so that you never break the bank. You can always say "More than 500 results..."

if you want to fetch no of record than you can use following code
com.google.appengine.api.datastore.Query qry = new com.google.appengine.api.datastore.Query("EntityName");
com.google.appengine.api.datastore.DatastoreService datastoreService = DatastoreServiceFactory.getDatastoreService();
int totalCount = datastoreService.prepare(qry).countEntities(FetchOptions.Builder.withDefaults());
anf if you want filter than you can used
qry.addFilter("firstName", FilterOperator.EQUAL, firstName);
i hope it will help you

Avoiding exploding indices and entity-group write-rate limits with appengine

I have an application in which there are Courses, Topics, and Tags. Each Topic can be in many Courses and have many Tags. I want to look up every Topic that has a specific Tag x and is in specific Course y.
Naively, I give each standard a list of Course ids and Tag ids, so I can select * from Topic where tagIds = x && courseIds = y. I think this query would require an exploding index: with 30 courses and 30 tags we're looking at ~900 index entries, right? At 50 x 20 I'm well over the 5000-entry limit.
I could just select * from Topic where tagIds = x, and then use a for loop to go through the result, choosing only Topics whose courseIds.contain(y). This returns way more results than I'm interested in and spends a lot of time deserializing those results, but the index stays small.
I could select __KEY__ from Topic where tagIds = x AND select __KEY__ from Topic where courseIds = y and find the intersection in my application code. If the sets are small this might not be unreasonable.
I could make a sort of join table, TopicTagLookup with a tagId and courseId field. The parent key of these entities would point to the relevant Topic. Then I would need to make one of these TopicTagLookup entities for every combination of courseId x tagId x relevant topic id. This is effectively like creating my own index. It would still explode, but there would be no 5000-entry limit. Now, however, I need to write 5000 entities to the same entity group, which would run up against the entity-group write-rate limit!
I could precalculate each query. A TopicTagQueryCache entity would hold a tagId, courseId, and a List<TopicId>. Then the query looks like select * from TopicTagQueryCache where tagId=x && courseId = y, fetching the list of topic ids, and then using a getAllById call on the list. Similar to #3, but I only have one entity per courseId x tagId. There's no need for entity groups, but now I have this potentially huge list to maintain transactionally.
Appengine seems great for queries you can precalculate. I just don't quite see a way to precalculate this query efficiently. The question basically boils down to:
What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?

Your assessment of your options is correct. If you don't need any sort criteria, though, option 3 is more or less already done for you by the App Engine datastore, with the merge join strategy. Simply do a query as you detail in option 1, without any sorts or inequality filters, and App Engine will do a merge join internally in the datastore, and return only the relevant results.
Options 4 and 5 are similar to the relation index pattern documented in this talk.

I like #5 - you are essentially creating your own (exploding) index. It will be fast to query.
The only downsides are that you have to manually maintain it (next paragraph), and retrieving the Topic entity will require an extra query (first you query TopicTagQueryCache to get the topic ID and then you need to actually retrieve the topic).
Updating the TopicTagQueryCache you suggested shouldn't be a problem either. I wouldn't worry about doing it transactionally - this "index" will just be stale for a short period of time when you update a Topic (at worst, your Topic will temporarily show up in results it should no longer show up in, and perhaps take a moment before it shows up in new results which it should show up it - this doesn't seem so bad). You can even do this update on the task queue (to make sure this potentially large number of database writes all succeed, and so that you can quickly finish the request so your user isn't waiting).

As you said yourself you should arrange your data to facilitate the scaling of your app, thus in the question of What's the best way to organize data so that we can do set operations like finding the Topics in the intersection of a Course and a Tag?
You can hold your own indexes of these sets by creating objects of CourseRef and TopicRef which consist of Key only, with the ID portion being an actual Key of the corresponding entity. These "Ref" entities will be under a specific tag, thus no actual Key duplicates. So the structure for a given Tag is : Tag\CourseRef...\TopicRef...
This way given a Tag and Course, you construct the Key Tag\CourseRef and do an ancestor Query which gets you a set of keys you can fetch. This is extremely fast as it is actually a direct access, and this should handle large lists of courses or topics without the issues of List properties.
This method will require you to use the DataStore API to some extent.
As you can see this gives answer to a specific question, and the model will do no good for other type of Set operations.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.