MS SQL SERVER Reading large data

MS SQL SERVER Reading large data - java

Hi I use MS SQL Server 2016 and I need to read large data.
I want use adaptive buffering to read first for example 1000 rows.
My procedure retrun 1 milion rows but I dont want out of memory my application so I want to read data by adaptive buffering.
I saw only for long data columns but I want this for rows. Could you help me?

Your application and users don't want a million rows. Why would your procedure try to return them all? You want to use pagination.
For example, in SQL Server you can do this using OFFSET/FETCH:
CREATE PROCEDURE dbo.PageByPage
#PageNumber int = 1,
#PerPage int = 100
AS
BEGIN
SET NOCOUNT ON;
WITH the_keys AS
(
SELECT key_column
FROM dbo.table_name
ORDER BY key_column
OFFSET #PerPage * (#PageNumber - 1) ROWS
FETCH NEXT #PerPage ROWS ONLY
)
SELECT t.all_columns ...
FROM dbo.table_name AS t
INNER JOIN the_keys AS k
ON t.key_column = k.key_column
ORDER BY t.key_column;
END
GO
Now your application keeps track of which page you're on, and passes the next or previous page as the user navigates through pages.
The reason for the CTE is to make the initial search as narrow as possible, since your actual query probably is more complex (Aaron Bertrand talks about this approach here).

Related

Select all records from offset to limit using a postgres index

I want to get all data from offset to limit from a table with about 40 columns and 1.000.000 rows. I tried to index the id column via postgres and get the result of my select query via java and an entitymanager.
My query needs about 1 minute to get my results, which is a bit too long. I tried to use a different index and also limited my query down to 100 but still it needs this time. How can i fix it up? Do I need a better index or is anything wrong with my code?
CriteriaQuery<T> q = entityManager.getCriteriaBuilder().createQuery(Entity.class);
TypedQuery<T> query = entityManager.createQuery(q);
List<T> entities = query.setFirstResult(offset).setMaxResults(limit).getResultList();

Right now you probably do not utilize the index at all. There is some ambiguity how a hibernate limit/offset will translate to database operations (see this comment in the case of postgres). It may imply overhead as described in detail in a reply to this post.
If you have a direct relationship of offset and limit to the values of the id column you could use that in a query of the form
SELECT e
FROM Entity
WHERE id >= offset and id < offset + limit
Given the number of records asked for is significantly smaller than the total number of records int the table the database will use the index.
The next thing is, that 40 columns is quite a bit. If you actually need significantly less for your purpose, you could define a restricted entity with just the attributes required and query for that one. This should take out some more overhead.
If you're still not within performance requirements you could chose to take a jdbc connection/query instead of using hibernate.
Btw. you could log the actual sql issued by jpa/hibernate and use it to get an execution plan from postgress, this will show you what the query actually looks like and if an index will be utilized or not. Further you could monitor the database's query execution times to get an idea which fraction of the processing time is consumed by it and which is consumed by your java client plus data transfer overhead.

There also is a technique to mimick the offset+limit paging, using paging based on the page's first record's key.
Map<Integer, String> mapPageTopRecNoToKey = new HashMap<>();
Then search records >= page's key and load page size + 1 records to find the next page.
Going from page 1 to page 5 would take a bit more work but would still be fast.
This of course is a terrible kludge, but the technique at that time indeed was a speed improvement on some databases.
In your case it would be worth specifying the needed fields in jpql: select e.a, e.b is considerably faster.

Efficiant way to check large number string existing in database

I have a very large table in the database, the table has a column called
"unique_code_string", this table has almost 100,000,000 records.
Every 2 minutes, I will receive 100,000 code string, they are in an array and they are unique to each other. I need to insert them to the large table if they are all "good".
The meaning of "good" is this:
All 100,000 codes in the array never occur in the database large table.
If one or more codes occur in the database large table, the whole array will not use at all,
it means no codes in the array will insert into the large table.
Currently, I use this way:
First I do a loop and check each code in the array to see if there is already same code in the database large table.
Second, if all code is "new", then, I do the real insert.
But this way is very slow, I must finish all thing within 2 minutes.
I am thinking of other ways:
Join the 100,000 code in a SQL "in clause", each code has 32 length, I think no database will accept this 32*100,000 length "in clause".
Use database transaction, I force insert the codes anyway, if error happens, the transaction rollback. This cause some performance issue.
Use database temporary table, I am not good at writing SQL querys, please give me some example if this idea may work.
Now, can any experts give me some advice or some solutions?
I am a non-English speaker, I hope you see the issue I am meeting.
Thank you very much.

Load the 100,000 rows into a table!
Create a unique index on the original table:
create unique index unq_bigtable_uniquecodestring on bigtable (unique_code_string);
Now, you have the tools you need. I think I would go for a transaction, something like this:
insert into bigtable ( . . . )
select . . .
from smalltable;
If any row fails (due to the unique index), then the transaction will fail and nothing is inserted. You can also be explicit:
insert into bigtable ( . . . )
select . . .
from smalltable
where not exists (select 1
from smalltable st join
bigtable bt
on st.unique_code_string = bt.unique_code_string
);
For this version, you should also have an index/unique constraint on smalltable(unique_code_string).

It's hard to find an optimal solution with so little information. Often this depends on the network latency between application and database server and hardware resources.
You can load the 100,000,000 unique_code_string from the database and use HashSet or TreeSet to de-duplicate in-memory before inserting into the database. If your database server is resource constrained or there is considerable network latency this might be faster.
Depending how your receive the 100,000 records delta you could load it into the database e.g. a CSV file can be read using external table. If you can get the data efficiently into a temporary table and database server is not overloaded you can do it very efficiently with SQL or stored procedure.
You should spend some time to understand how real-time the update has to be e.g. how many SQL queries are reading the 100,000,000 row table and can you allow some of these SQL queries to be cancelled or blocked while you update the rows. Often it's a good idea to create a shadow table:
Create new table as copy of the existing 100,000,000 rows table.
Disable the indexes on the new table
Load the delta rows to the new table
Rebuild the indexes on new table
Delete the existing table
Rename the new table to the existing 100,000,000 rows table
The approach here is database specific. It will depend on how your database is defining the indexes e.g. if you have a partitioned table it might be not necessary.

Pagination of data issue: java + oracle sql

I have a requirement for pagination of data(items) retrieved from database.
The UI also contains search options and the amount of data, order also depends on search criteria.
Let's say a client send request for with some search criteria and gets 60 results. The client see the items from 1 to maxPageSize (25 by default). If 2nd page is requested - 26-50 items will be shown.
The problem is on current moment I can't get amount of max results and can't display number of maxPage.
I see 2 solutions for this problem:
Query database second time with the same parameters but without
pagination and get count of the items.
Retrieve all items from DB,
filter them on backend code by search criteria and send to client.
The questions are:
1) Which of the operations is less expensive in general?
2) What else can be done to solve this kind of task if there is better solution?
P.S. the backend code written on Java, queries send via JDBC to Oracle 11g DB.
---EDIT---
I've solved this problem this way:
WITH FINAL_RESULT AS
(SELECT SORTED_ITEMS.*,
ROWNUM RN
FROM (sorted basic query with searches))
SELECT FINAL_RESULT.*,
(SELECT COUNT(*) FROM FINAL_RESULT) ITEMS_COUNT
FROM FINAL_RESULT
WHERE RN BETWEEN ? AND ?

Second solution would be quite expensive in case there is bulk amount of data in the Database.
However, the First solution is quite suitable with some tweeks in it. You don't need to Query database second time with the same parameters, instead the server should send the TOTAL_COUNT in each request and the value should be cached.
If the count hasn't changed, there would be no load to the Database because of caching.

load ACTUAL data of SELECT query in chunks (without execution of same query again)

Suppose, if I execute one select query (HQL) and it gives 100K rows as result. I would like to know if there is a way to load in java(or any other language) those 100k in 1K of chunks After query is finished executing.
The reason for me to break it in chunks is - I do not know where exactly those 100K results will be stored, while i perform processing on them in java. But i would like to use lesser memory consumption.
execute query(hibernate criteria with hql) (suppose 100K row results)
pick first 1K in them (without loading other 99K in memory of JVM or somewhere, like lazy loading in hibernate)
process
pick next 1K
repeat from (2)
update- I do not want to hit the query again.
either i am not able to understand any of the answers or you people aren't able to understand my question

First split up your query in two queries.
First one is get the count of the result set by changing the Select part to something like SELECT COUNT(o) FROM Object o.
Send is you existing query without changes.
Then first run the count query an request a single result. It will be directly a Long value with the result size.
Then calculate you iterations:
long pages = Math.ceil(count/1000);
Last but not least iterate over the calculated pages and fire up your query by setting Offset and Limit before getting the result.

Display 100000 records on browser / multiple pages

I would like to display 100000 records on browser / multiple pages with minimal impact on memory. ie Per page 100 records.
I would like to move page back and forth. My doubts are
1. Can I maintain all the record inside the memory ? Is this good Idea ?
2) Can I make database connection/query for ever page ? If so how do write a query?
Could anyone please help me..

It's generally not a good idea to maintain so much records in memory. If the application is accessed by several users at the same time, the memory impact will be huge.
I don't know what DBMS are you using, but in MySQL and several others, you can rely on the DB for pagination with a query such as:
SELECT * FROM MyTable
LIMIT 0, 100
The first number after limit is the offset (how many records it will skip) and the second is the number of records it will fetch.
Bear in mind that this is SQL does not have the same syntax on every DB (some don't even support it).

I would not hold the data in memory (either in the browser or in the serving application). Instead I'd page through the results using SQL.
How you do this can be database-specific. See here for one example in MySql. Mechanisms will exist for other databases.

1) No, having all the records in memory kind of defeats the point of having a database. Look into having a scrollable result set, that way you can get the functionality you want without having to play with the SQL. You can also adjust how many records are fetched at a time so that you don't load more records than you need.
2) Db connections are expensive to create and destroy but any serious system will pool the connections so the impact on performance won't be that great.
If you want to get a bit more fancy you can do away with pages altogether and just load more records as the user scrolls through the list.

It would not be a good idea, as you are making the browser executable hold all of that.
When I had something like this to do used javascript to render the page, and just made ajax calls to get the next page. There is a slight delay in displaying the next table, as you fetch it, but users are used to that.
If you are showing 100 records/page, use json to pass the data from the server, as javascript can parse it quickly, and then use innerHTML to put the html, as the DOM is much slower in rendering tables.

As mentioned by others here, it is not a good idea to store a large list of results in memory. Query for results for each page is certainly a much better approach. To do that you have two options. One is to use whatever the database specific features your DBMS provides for targeting a specific subsection of results from a query. The other approach is to use the generic methods provided by JDBC to achieve the same effect. This keeps your code from being tied to a specific database:
// get a ResultSet from some query
ResultSet results = ...
if (count > 0) {
results.setFetchSize(count + 1);
results.setFetchDirection(ResultSet.FETCH_FORWARD);
results.absolute(count * beginIndex);
}
for (int rowNumber = 0; results.next(); ++rowNumber) {
if (count > 0 && rowNumber > count) {
break;
}
// process the ResultSet below
...
}
Using a library like Spring JDBC or Hibernate can make this even easier.

In many SQL language, you have a notion of LIMIT (mysql, ...) or OFFSET (mssql).
You can use this kind of thing to limit rows per page

Depends on the data. 100k int's might not be too bad if you are caching that.
T-SQL has SET ##ROWCOUNT = 100 to limit the amount of records returned.
But to do it right and return the total # of pages, you need a more advanced paging SPROC.
It's a pretty hotly dedated topic and there are many ways to do it.
Here's a sample of an old sproc I wrote
CREATE PROCEDURE Objects_GetPaged
(
#sort VARCHAR(255),
#Page INT,
#RecsPerPage INT,
#Total INT OUTPUT
)
AS
SET NOCOUNT ON
--Create a temporary table
CREATE TABLE #TempItems
(
id INT IDENTITY,
memberid int
)
INSERT INTO #TempItems (memberid)
SELECT Objects.id
FROM Objects
ORDER BY CASE #sort WHEN 'Alphabetical' THEN Objects.UserName ELSE NULL END ASC,
CASE #sort WHEN 'Created' THEN Objects.Created ELSE NULL END DESC,
CASE #sort WHEN 'LastLogin' THEN Objects.LastLogin ELSE NULL END DESC
SELECT #Total=COUNT(*) FROM #TempItems
-- Find out the first and last record we want
DECLARE #FirstRec int, #LastRec int
SELECT #FirstRec = (#Page - 1) * #RecsPerPage
SELECT #LastRec = (#Page * #RecsPerPage + 1)
SELECT *
FROM #TempItems
INNER JOIN Objects ON(Objects.id = #TempItems.id)
WHERE #TempItems.ID > #FirstRec AND #TempItems.ID < #LastRec
ORDER BY #TempItems.Id

I would recommend that you choose using CachedRowSet .
A CachedRowSet object is a container for rows of data that caches its rows in memory, which makes it possible to operate without always being connected to its data source.
A CachedRowSet object is a disconnected rowset, which means that it makes use of a connection to its data source only briefly. It connects to its data source while it is reading data to populate itself with rows and again while it is propagating changes back to its underlying data source.
Because a CachedRowSet object stores data in memory, the amount of data that it can contain at any one time is determined by the amount of memory available. To get around this limitation, a CachedRowSet object can retrieve data from a ResultSet object in chunks of data, called pages. To take advantage of this mechanism, an application sets the number of rows to be included in a page using the method setPageSize. In other words, if the page size is set to five, a chunk of five rows of data will be fetched from the data source at one time. An application can also optionally set the maximum number of rows that may be fetched at one time. If the maximum number of rows is set to zero, or no maximum number of rows is set, there is no limit to the number of rows that may be fetched at a time.
After properties have been set, the CachedRowSet object must be populated with data using either the method populate or the method execute. The following lines of code demonstrate using the method populate. Note that this version of the method takes two parameters, a ResultSet handle and the row in the ResultSet object from which to start retrieving rows.
CachedRowSet crs = new CachedRowSetImpl();
crs.setMaxRows(20);
crs.setPageSize(4);
crs.populate(rsHandle, 10);
When this code runs, crs will be populated with four rows from rsHandle starting with the tenth row.
On the similar path, you could build upon a strategy to paginate your data on the JSP and so on and so forth.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.