I want to know if it is possible to change how Spring does pagination because the default way of PagingAndSortingRepository uses a LIMIT with OFFSET but on large datasets that becomes too slow.
The data I have does have an indexed column so there is no problem to write logic like WHERE timestamp > x AND timestamp < y LIMIT 1000; and keep track of the highest timestamp received. I am just wondering if this is already built into Spring JPA, so I could tell it to order by a column and use that rather than OFFSET.
https://electrictoolbox.com/mysql-limit-slow-large-offset/
There is no magic solution for this.
Yes, it can be slow. But I find that this is not an indicator of problem in paging, but rather an indicator that one uses improper tool / solution.
I find that using paging to navigate across a huge datasets is a bad approach. I'd suggest you to use a relatively small number of pages like 20 pages. If you want to navigate across a bigger volume of data, consider proper splitting / structuring of data in your requests. For instance, if your data can be filtered by time, consider using proper filters in your request (and corresponding GUI elements if it is triggered from GUI).
Ask user first to select a time period, like from/to years, or from/to dates, or from/to hours, depending how much data you have. And use paging for such query (which will return relatively small number of pages).
Alternatively, no matter what filter / search criteria user defined, in the first request check the number of pages. If it exceeds some limit (like 20 pages), throw an exception and ask user to refine the filter / search criteria. Continue that until the number of pages does not exceed the limit you want. Paging with such filter can be much faster.
Related
I’m creating a web application
Frontend - reactjs and backend java.
Frontend and backend communicate with each other via rest.
On UI I show a list of items. And I need to filter them for some params.
Option 1: filter logic is on front end
In this case I just need to make a get call to backend and get all items.
After user choose some filter option filtering is happening on ui.
Pros: for that I don’t need to send data to back end and wait for response. Speed of refreshing the list should be faster.
Cons: If I will need multiple frontend clients. Let’s say a mobile app. Than I need to create filters again on this app too.
Option 2: filter logic is on back end
In this case I get all list items when app is loading. After user changes the filter options I need to send a get request with filters params and wait for response.
After that update a list of items on UI.
Pros: filter logic is written only once.
Cons: Speed probably will be much slower. Because it takes time to send request and get a result back.
Question: Where the filter logic should be? In frontend or in backend? Or maybe what is a best practice?
Filter and limit on the back end. If you had a million records, and a hundred thousand users trying to access those records at the same time, would you really want to send a million records to EVERY user? It'd kill your server and user experience (waiting for a million records to propagate from the back end for every user AND then propagate on the front end would take ages when compared to just getting 20-100 records and then clicking a (pagination) button to retrieve the next 20-100). On top of that, then to filter a million records on the front-end would, again, take a very long time and ultimately not be very practical.
From a real world stand point, most websites have some sort of record limit: Ebay = 50-200 records, Amazon = ~20, Target = ~20... etc. This ensures quick server responses and a smooth user experience for every user.
This depends on the size of your data.
For eg: If you are having a large amount of data, it is better to implement the filter logic on the backend and let the db perform the operations.
In case, you have less amount of data, you can do the filter logic on the front end after getting the data.
Let us understand this by an example.
Suppose you have an entity having 1,00,000 records and you want to show it in a grid.
In this case it is better to get 10 records on every call and show it in a grid.
If you want to perform any filter operation on this, it is better to make a query for the db on the backend and get the results
In case it you have just 1000 records in your entity, it will be beneficial to get all the data and do all the filter operations on the frontend.
Most likely begin with the frontend (unless you're dealing with huge amounts of data):
Implement filtering on the frontend (unless for some reason it's easier to do it on the backend, which I find unlikely).
Iterate until filtering functionality is somewhat stable.
Analyze your traffic, see if it makes sense to put the effort into implementing backend filtering. See what percentage of requests are actually filtered, and what savings you'd be getting from backend filtering.
Implement (or not) backend filtering depending on the results of #3.
As a personal note, the accepted answer is terrible advice:
"If you had a million records, and a hundred thousand users trying to access those records at the same time"; nothing is forcing the hundred thousand users to use filtering, your system should be able to handle that doomsday scenario. Backend filtering should be just an optimization, not a solution.
once you do filtering on the backend you'll probably want to do pagination as well; this is not a trivial feature if you want consistent results.
doing backend filtering is likely to become much more complex than just frontend filtering; you should be aware that you're going to spend a significant amount of time (not only for the initial implementation but also for ongoing maintenance) and ask yourself if it's not premature optimization.
TL/DR: Do wherever is easier for you and don't worry about it until it makes sense to start optimizing.
It depends on the specific requirements of your application, but in my opinion the safer bet would be the back-end.
Considering you need filtering in the first place, I assume you have enough data so that paging through it is required. In this case, you need to have the filtering on the back-end.
Lets say you have a page size of 20. After you apply the filter you would expect to have a page of 20 entities that match that specific filtering criteria in the UI. This can't be achieved if you fetch 20 entities, store them in the front-end and afterwards apply the filter on them.
Also, if you have enough data, fetching all of it in the front-end will be impossible due to memory constraints.
I am now working on a web app associated with Amazon DynamoDB,
I want to achieve a function that my users can directly get to the Nth page to view the item info,
I have been told that the pagination in DynamoDB is based on last key, rather than limit/offset.It doesn't natively support offset.DynamoDB Scan / Query Pagination
Does that mean : If I want to get to the 10th page of items, then I have to query the 9 pages ahead first?(which seems reeeeeally not a good solution)
Is there a easier way to do that?
You are right. DynamoDB doesn't support numerical offset. The only way to paginate is to use the LastEvaluatedKey parameter when making a request. You still have some good options to achieve pagination using a number.
Fast Cursor
You can make fast pagination requests by discarding the full result and getting only the Keys. You are limited to 1MB per request. This represents a large amount of Keys! Using this, you can move your cursor to the required position and start reading full objects.
This solution is acceptable for small/medium datasets. You will run into performance and cost issues on large datasets.
Numerical index
You can also create a global secondary index where you will paginate your dataset. You can add for example an offset property to all your objects. You can query this global index directly to get the desired page.
Obviously this only works if you don't use any custom filter... And you have to maintain this value when inserting/deleting/updating objects. So this solution is only good if you have an 'append only' dataset
Cached Cursor
This solution is built on the first one. But instead of fetching keys every single time, you can cache the pages positions and reuse them for other requests. Cache tools like redis or memcached can help you to achieve that.
You check the cache to see if pages are already calculated
If not, you scan your dataset getting only Keys. Then you store the starting Key of each page in your cache.
You request the desired page to fetch full objects
Choose the solution that fits your needs. I hope this will help you :)
I am trying to write an algorithm which does insert of frequent data search.
Let's say User can search different combination of two entities (Source-Destination), Each time user search I want to store data with count, and if he search same combination(Source-Destination) I will update the count.
In this case if Users are 1000, and if User searches for 0 different combination(Source-Destination) and data will be stored for 30 Days.
So total number of rows will be 100000*30*30=13500000(1.3 Billion) Rows. (using Mysql)
Please suggest me If there is better way to write this.
GOAL: I want to get top 10 Searach Combination of users at any point of time.
1,000 users and 60,000 rows are nothing by today's standards. Don't even think about it, there is no performance concern whatsoever, so just focus on doing it properly instead of worrying about slowness. There will be no slowness.
The proper way of doing it is by creating a table in which each row contains the search terms, ([source,destination] in your case,) and a sum, and using a unique index on the [source, destination] pair of columns. Which is the same as making those two columns the primary key.
If you had 100,000,000 rows, and performance was critical, and you also had a huge budget affording you the luxury to do whatever weird thing it takes to make ends meet, then you would perhaps want to do something exotic, like appending each search to an indexless table (allowing the fastest appends possible) and then compute the sums in a nightly batch process. But with less than a million rows such an approach would be a complete overkill.
Edit:
Aha, so the real issue is the OP's need for a "sliding window". Well, in that case, I cannot see any approach other than saving every single search, along with the time that it happened, and in a batch process a) computing sums, and b) deleting entries that are older than the "window".
Actually there is more than 12000 records getting fetched from database and displayed pagewise using pagination. And now I have a search box in UI which will search from all the records(near about 12000) across all pages. But this is taking sometimes to search from this huge record.
Could you please help me how can I make this search faster.
Consider these options:
Instead of searching after you fetched all of them, make an intelligent query that will reduce the amount of candidates. Maybe your situation is so simple that your query can represent the full search?
Multithreaded searching.
Split your 12000 blocks of data into the amount of available processor cores (by using Runtime.getRuntime().availableProcessors()) and launch a Thread for each block that will do the job.
If your searching is heavy per object, you might want to see if there is a possibility to do a cheaper search method. Maybe by only looking into a couple of important fields only. Make sure that the job can be done quickly. In another Thread, you could do the deeper search and add results as they were found.
This option is rather hard to do, but you could implement a technique that searches for candidates while the user is still entering the search words. Every three characters they types, filter the current set of candidates. This would allow you to filter from 12000 to maybe 4000 and for the next three chars only 100 left, etc. This, of course, depends on the situation.
There is many way to do search.
1) SQL Search, here you can use the same sql statement that fetched the 12000 records and appends where clause using java code. here we prefer to search for the Indexed fields or u can add index in the DB level on the searchable fields.
2) Full Text Search, there is some technology that allows you to index your records as Full Text Index, you can read more about this here
In my Google App Engine app I need to fetch and return a potentially large number of entities from a datastore query in response to a service call GET request. This call may return potentially thousands of entities and MBs of serialized data.
The first portion of the response packet communicates how many entities are in the serialized results, followed by all of the serialized entities. Currently I am iterating through all the entities in the query with a QueryResultIterator up to a maximum page size limit, after which I return a cursor from which can be used to continue fetching where the previous call left off (if the maximum was reached and there are still results in the query). As I iterate through the results, I save them in a list. Once I've either exhausted the query results or reached the maximum page size, I can then get the number of entities from the size of this list. But then I have to iterate through this list again to serialize each of the entities and write the results to the response output stream.
I don't know that this is the most efficient method to perform this operation. Is there a way I can get the number of entities in a query's results before actually iterating through them all or fetching them directly into a list? (The list method doesn't work anyway because I'm using cursors, which requires the use of QueryResultIterator).
QueryResultIterator has a method getIndexList(). Would this be a less costly way to get the number of entities in the query's results? I'm assuming this list would contain exactly one index object for each entity in the query's results. Also, I'd need this list to only contain the indexes for the entities after the current cursor position for the interator. Is my understanding correct or would this method not do what I think it would?
A list of just indexes would require much less memory than loading a list of whole entities. Although, I don't know if this list would be limited at all by the query's prefetch or chunk sizes, or if I'd want to use the query's limit parameter at all because I would only be interested in knowing how many entities were in the results up to the maximum page size plus one (to know there are still more results and provide a cursor to continue).
Currently I'm setting the prefetch and chunk size (to the size of my page limit), but I'm not using the limit or offset parameters since I'm using cursors instead. From what I understand cursors are preferable to offset/limit. Would setting the limit parameter affect continuing a query with a cursor?
Clearly I have quite a few questions as to how GAE datastore queries work and how they're affected by changing parameters. So any insights are appreciated. The documentation for App Engine APIs is often sparse, as in one sentence descriptions of methods stating pretty much what can be deduced from the method signature. They don't generally go into much detail otherwise. Maybe the way I'm doing it currently is just fine after all. It works as is, but I'm trying to optimize the service call to get the best response time possible for my client application.
UPDATE: By the way, I am using Objectify v3 in my app and to perform this query. There are several places I am required to use the low-level datastore API, including to do geo-location queries (with geomodel) and projection queries (which aren't support in Objectify v3). So if there is a good way to do this using Objectify, that would be ideal. Otherwise I can use the low-level API, but it's always messier this way.
Both the low-level api and Objectify have a count() method (look at the javadocs for details). However, counting can be a very expensive and lengthy operation - it costs 1 small op for every number returned. For example, count() returning 5000 costs 5000 small ops (plus 1 read for the query), and takes as long as it would take to do a keys-only scan of all 5000 (which is what GAE actually does).
If you absolutely must have an exact count, you probably need to aggregate this value yourself by incrementing/decrementing a (possibly sharded) counter. This gets very tricky when you are dealing with filtered queries.
There is no one right solution here. Google searches give you totals like "About 119,000,000 results" which are deliberately inexact and almost certainly precalculated. For smaller result sets, using count() can be acceptable - but you might want to apply a limit() so that you never break the bank. You can always say "More than 500 results..."
if you want to fetch no of record than you can use following code
com.google.appengine.api.datastore.Query qry = new com.google.appengine.api.datastore.Query("EntityName");
com.google.appengine.api.datastore.DatastoreService datastoreService = DatastoreServiceFactory.getDatastoreService();
int totalCount = datastoreService.prepare(qry).countEntities(FetchOptions.Builder.withDefaults());
anf if you want filter than you can used
qry.addFilter("firstName", FilterOperator.EQUAL, firstName);
i hope it will help you