Our data set has a lot of duplicate partition keys. We are using the TOKEN method to paginate through the data. If the rows with the duplicate keys are split across a page we don't get the remainder of the duplicates on the next call.
For example assume we have the following keys: 1 2 3 5 5 5 6 7 8 and we have a limit of 5 rows per query. The first query "select * from table where TOKEN(id) > TOKEN('') limit 5;" returns 1 2 3 5 5 as expected. The second query "select * from table where TOKEN(id) > TOKEN('5') limit 5;" returns 6 7 8. This is not the desired behavior, we want the second query to return 5 6 7 8. Thinking about this, it is obvious why this happens: "(TOKEN(id) > TOKEN('5')" fails if id == 5
Are we doing something wrong or is this just the way it works? We are using the latest Java driver, but I don't think this is a driver problem since the Golang driver also exhibits this behavior
We've (mostly) worked around the problem by either dropping any duplicated records at the end of the row set (the 5 5 in the example) or dropping the last record (to cover the case where the last record is duplicated in the second record set). This fails if the record set is all duplicates. Obviously larger limits reduces this edge case, but it doesn't seem safe to use in a production environment.
* EDITED *
The TOKEN method is recommended in a lot of pages both here on Stackoverflow and elsewhere on the web. But obviously it doesn't work :-(
#alex:
Thanks for your reply. The example was just that, a simplified example of the issue. In reality we have 30 million rows and are using a limit of 1000. When the table was first designed years ago the designer didn't understand how the partition key works so they used the user ID as the partition thus giving us 30 million partitions. We believe that this is at least contributing to our excesive repair times (currently at 12 hours for the cluster). We need to copy the entire table into a new one with a different partition key (in a live production environment) to resolve the partition key issue. This page https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/ seems to be a better solution.
#Nadav:
Thanks for your reply. Simply removing the limit will cause the request to time out in multiple layers of our software. The DataStax page above seems to be the best solution for us.
You are mixing up several things - in Cassandra data is organized inside partitions, and you can get data by partition key, or perform a range scan using the token function. The results of the query could be delivered to applications by pages - you can specify the fetch size (although 5 is quite small), fetch one page, process, fetch next, process, ..., until the result set is exhausted.
In your case, the page size doesn't match the result set size - you have 6 results there, and the next result set (for token(id) > token(5)) has only 3 rows. I don't know a solution that works out of box (except select * from table, but it may timeout in case if you have a lot of data). In your case I would better go with bigger ranges (for example, whole token range), and page results inside it (without using limit), and then handle the case when you need to switch to the next token range, and you have some rows that are left from previous token range.
I have an example of Java code that is performing effective scan of the all token ranges, similar to what the Spark connector is doing. The main trick there is to route request to the node that holds the data, so it will read data directly from the node itself, without need to reach other nodes (if you're reading with LOCAL_ONE, of course).
You shouldn't, and can't, use token ranges and LIMIT to page through results, and you found out yourself that it doesn't work - because LIMIT cuts off some of the result, and you have no way to continue.
Instead, Cassandra gives you a separate paging feature: You make a request, get the first 1000 (or whatever) rows and also a "cookie" with which you can resume the query to get the next page of results. Please refer to your favorite driver's documentation on the syntax of using Cassandra paging in your favorite language. It's not "LIMIT" - it's a separate feature.
Splitting up a large query into multiple token ranges still has its uses. For example, it allows you to query the different ranges in parallel, since different token ranges will often come from different nodes. But still, you need to query each range to completion, using paging, and cannot use "LIMIT" because you can't know how many results to expect from each range and need to read them all.
Related
I want to get all data from offset to limit from a table with about 40 columns and 1.000.000 rows. I tried to index the id column via postgres and get the result of my select query via java and an entitymanager.
My query needs about 1 minute to get my results, which is a bit too long. I tried to use a different index and also limited my query down to 100 but still it needs this time. How can i fix it up? Do I need a better index or is anything wrong with my code?
CriteriaQuery<T> q = entityManager.getCriteriaBuilder().createQuery(Entity.class);
TypedQuery<T> query = entityManager.createQuery(q);
List<T> entities = query.setFirstResult(offset).setMaxResults(limit).getResultList();
Right now you probably do not utilize the index at all. There is some ambiguity how a hibernate limit/offset will translate to database operations (see this comment in the case of postgres). It may imply overhead as described in detail in a reply to this post.
If you have a direct relationship of offset and limit to the values of the id column you could use that in a query of the form
SELECT e
FROM Entity
WHERE id >= offset and id < offset + limit
Given the number of records asked for is significantly smaller than the total number of records int the table the database will use the index.
The next thing is, that 40 columns is quite a bit. If you actually need significantly less for your purpose, you could define a restricted entity with just the attributes required and query for that one. This should take out some more overhead.
If you're still not within performance requirements you could chose to take a jdbc connection/query instead of using hibernate.
Btw. you could log the actual sql issued by jpa/hibernate and use it to get an execution plan from postgress, this will show you what the query actually looks like and if an index will be utilized or not. Further you could monitor the database's query execution times to get an idea which fraction of the processing time is consumed by it and which is consumed by your java client plus data transfer overhead.
There also is a technique to mimick the offset+limit paging, using paging based on the page's first record's key.
Map<Integer, String> mapPageTopRecNoToKey = new HashMap<>();
Then search records >= page's key and load page size + 1 records to find the next page.
Going from page 1 to page 5 would take a bit more work but would still be fast.
This of course is a terrible kludge, but the technique at that time indeed was a speed improvement on some databases.
In your case it would be worth specifying the needed fields in jpql: select e.a, e.b is considerably faster.
We are using Spring Boot 2 with Spring Data and its PagingAndSortingRepository feature. This works well for single queries, but in one case we have to make three different queries and implement pagination for the combined result.
What is the best way to do it?
Here's what I have tried:
1) Write a UNION or JOIN query of sorts that already returns the combined result as a Page or Slice. However, this query takes almost 10 times as long as shooting three seperate queries and do the aggregation in Java. We are talking complex computations here (PostGIS backend).
2) Manually construct the pages/slices by using the existing SliceImpl or PageImpl classes. This works fine for the initial request, but fails on the second request, when the user says something like: give me page 1 (page size == 10 items). The first page (page 0) may have had 4 items from the first query and 6 of 12 total items from the second query. Asking for page 1 gives me then 0 results from the first query and 2 (instead of 6) from the second, while filling up the rest from the third query. So clearly, this cannot work from a logical point of view.
Any other ideas?
Edit: we are planning to add Hibernate Search and Caching, which might solve this problem externally by making option 1) fast enough. My question was meant to ask for an "internal" solution, i. e. some code I can write today, until we have the external solution in place.
As you have described in point 2, unless you do always left join between queries, no one can guarantee you that what you have retrieved with the first query part is sufficient to generate a page of 10 valid element.
Implement a logic that find element until the page is complete it's more expensive than the single query for sure... specially when you have to increment pages more and more.
I think you have to combine all your queries in a single query.
A solution in this case could be to create a materialized view on your database and apply simpler filters.
To have a cache framework can help too.
I am trying to write an algorithm which does insert of frequent data search.
Let's say User can search different combination of two entities (Source-Destination), Each time user search I want to store data with count, and if he search same combination(Source-Destination) I will update the count.
In this case if Users are 1000, and if User searches for 0 different combination(Source-Destination) and data will be stored for 30 Days.
So total number of rows will be 100000*30*30=13500000(1.3 Billion) Rows. (using Mysql)
Please suggest me If there is better way to write this.
GOAL: I want to get top 10 Searach Combination of users at any point of time.
1,000 users and 60,000 rows are nothing by today's standards. Don't even think about it, there is no performance concern whatsoever, so just focus on doing it properly instead of worrying about slowness. There will be no slowness.
The proper way of doing it is by creating a table in which each row contains the search terms, ([source,destination] in your case,) and a sum, and using a unique index on the [source, destination] pair of columns. Which is the same as making those two columns the primary key.
If you had 100,000,000 rows, and performance was critical, and you also had a huge budget affording you the luxury to do whatever weird thing it takes to make ends meet, then you would perhaps want to do something exotic, like appending each search to an indexless table (allowing the fastest appends possible) and then compute the sums in a nightly batch process. But with less than a million rows such an approach would be a complete overkill.
Edit:
Aha, so the real issue is the OP's need for a "sliding window". Well, in that case, I cannot see any approach other than saving every single search, along with the time that it happened, and in a batch process a) computing sums, and b) deleting entries that are older than the "window".
I'm creating a hypermedia driven RESTful API that will be used to query transactional data. The intention is that the results will be paginated.
Each API call will query an indexed database table. Since I don't want to keep the results server side due to memory considerations, I was thinking to retrieve the data based on rownum, dependent upon which page is requested. E.G. on page one, WHERE rownum <= 10, on page two, WHERE rownum BETWEEN 11 AND 20 etc.
However, the database in question is replicated from a production system and could potentially add records into an area of the result set already requested. E.G. page one is requested -> 10 rows are returned -> a transaction is inserted at row 5. Now page two will include a record already displayed on page one, as the results are essentially pushed up by a rownum.
What would be a good way of achieving my objective of creating a hypermedia driven RESTful API that provides paginated transactional data from a database, without holding on to the result sets for the duration of the session?
This is a pretty common problem and there are actually not many approaches.
I can think of only three, actually:
You don't care and the result will change. This is the behaviour of stackoverflow: if you're on page 2 of the questions page and someone posts a new question, when clicking on page 3 you may get one or more of the questions that were already listed on page 2, because the index has shifted.
If you don't want to keep in memory the actual data, you're in for a lot of trouble. You could store the handler for the result set, instead of the results themselves, and loop over it fetching the number of rows that you actually need. E.g. you run the select, fetch 10 rows and store the handler of the resultset. Together with the rows, you return to the client a unique ID of the query. The problem will be when you have a range specified, because you can't really "rewind" a database cursor, and that would mean caching the results, which you may want to do anyway. But if you do it like that, sooner or later you're going to have all of the results in memory anyway.
You could still use some memory, but keep only some unique identifier of the rows, associated with a unique identifier of the query, as above. This could work, but only if the rows may be added, and not deleted or updated (if they're updated, they may not match the query any more).
Personally, I'd go with option 1.
I'm developing a Google App Engine Java app where users can search business objects from database based on search criteria.
The search results (a list of records) should not include any of the records (certain number of records, say 100) from their past searches. I'm storing the past results in the User Profile for this reason.
Any suggestions on efficiently implementing this logic (without using multiple collection iterations). I'm using JDO and there are restrictions in using 'NOT IN' condition in the queries.
Here's a solution, assuming your goal is to get 200 keys that are not in the history already.
I will attempt to estimate the number of operations used as a proxy for "efficiency", since this is how we will be charged in the new pricing model
Fetch the User object and "history keys" (1 read operation)
Do a keys only query and fetch 300 records. (300 small operations)
In your code, subtract any of the history keys from the 300 records. (0 operations)
If you end up with less than 200 records after step 3, fetch another 100.(repeat if necessary) (100 small operations).
Once you have 200 keys not seen before, you can fetch the full business object entities if you need them, or display the keys to the user. (200 read operations if you fetch the entire objects)
If the datastore supported a native "NOT IN" operator, then we could shave off 100 small operations from step 2, and skip step 4. The largest cost here will be fetching the actual 200 entities, which would have to happen with or without the NOT IN operator. Ultimately, this method is not that inefficient compared to what a native NOT IN operator would do.
Further optimizations:
If you don't need to display 200 keys all at once, then you can use cursors to only get N results at a time.
I am simply guessing when I suggest that you get 300 keys at first. You may need to get more or less. You can also probably get less than 100 on the second attempt.