I would like to display 100000 records on browser / multiple pages with minimal impact on memory. ie Per page 100 records.
I would like to move page back and forth. My doubts are
1. Can I maintain all the record inside the memory ? Is this good Idea ?
2) Can I make database connection/query for ever page ? If so how do write a query?
Could anyone please help me..
It's generally not a good idea to maintain so much records in memory. If the application is accessed by several users at the same time, the memory impact will be huge.
I don't know what DBMS are you using, but in MySQL and several others, you can rely on the DB for pagination with a query such as:
SELECT * FROM MyTable
LIMIT 0, 100
The first number after limit is the offset (how many records it will skip) and the second is the number of records it will fetch.
Bear in mind that this is SQL does not have the same syntax on every DB (some don't even support it).
I would not hold the data in memory (either in the browser or in the serving application). Instead I'd page through the results using SQL.
How you do this can be database-specific. See here for one example in MySql. Mechanisms will exist for other databases.
1) No, having all the records in memory kind of defeats the point of having a database. Look into having a scrollable result set, that way you can get the functionality you want without having to play with the SQL. You can also adjust how many records are fetched at a time so that you don't load more records than you need.
2) Db connections are expensive to create and destroy but any serious system will pool the connections so the impact on performance won't be that great.
If you want to get a bit more fancy you can do away with pages altogether and just load more records as the user scrolls through the list.
It would not be a good idea, as you are making the browser executable hold all of that.
When I had something like this to do used javascript to render the page, and just made ajax calls to get the next page. There is a slight delay in displaying the next table, as you fetch it, but users are used to that.
If you are showing 100 records/page, use json to pass the data from the server, as javascript can parse it quickly, and then use innerHTML to put the html, as the DOM is much slower in rendering tables.
As mentioned by others here, it is not a good idea to store a large list of results in memory. Query for results for each page is certainly a much better approach. To do that you have two options. One is to use whatever the database specific features your DBMS provides for targeting a specific subsection of results from a query. The other approach is to use the generic methods provided by JDBC to achieve the same effect. This keeps your code from being tied to a specific database:
// get a ResultSet from some query
ResultSet results = ...
if (count > 0) {
results.setFetchSize(count + 1);
results.setFetchDirection(ResultSet.FETCH_FORWARD);
results.absolute(count * beginIndex);
}
for (int rowNumber = 0; results.next(); ++rowNumber) {
if (count > 0 && rowNumber > count) {
break;
}
// process the ResultSet below
...
}
Using a library like Spring JDBC or Hibernate can make this even easier.
In many SQL language, you have a notion of LIMIT (mysql, ...) or OFFSET (mssql).
You can use this kind of thing to limit rows per page
Depends on the data. 100k int's might not be too bad if you are caching that.
T-SQL has SET ##ROWCOUNT = 100 to limit the amount of records returned.
But to do it right and return the total # of pages, you need a more advanced paging SPROC.
It's a pretty hotly dedated topic and there are many ways to do it.
Here's a sample of an old sproc I wrote
CREATE PROCEDURE Objects_GetPaged
(
#sort VARCHAR(255),
#Page INT,
#RecsPerPage INT,
#Total INT OUTPUT
)
AS
SET NOCOUNT ON
--Create a temporary table
CREATE TABLE #TempItems
(
id INT IDENTITY,
memberid int
)
INSERT INTO #TempItems (memberid)
SELECT Objects.id
FROM Objects
ORDER BY CASE #sort WHEN 'Alphabetical' THEN Objects.UserName ELSE NULL END ASC,
CASE #sort WHEN 'Created' THEN Objects.Created ELSE NULL END DESC,
CASE #sort WHEN 'LastLogin' THEN Objects.LastLogin ELSE NULL END DESC
SELECT #Total=COUNT(*) FROM #TempItems
-- Find out the first and last record we want
DECLARE #FirstRec int, #LastRec int
SELECT #FirstRec = (#Page - 1) * #RecsPerPage
SELECT #LastRec = (#Page * #RecsPerPage + 1)
SELECT *
FROM #TempItems
INNER JOIN Objects ON(Objects.id = #TempItems.id)
WHERE #TempItems.ID > #FirstRec AND #TempItems.ID < #LastRec
ORDER BY #TempItems.Id
I would recommend that you choose using CachedRowSet .
A CachedRowSet object is a container for rows of data that caches its rows in memory, which makes it possible to operate without always being connected to its data source.
A CachedRowSet object is a disconnected rowset, which means that it makes use of a connection to its data source only briefly. It connects to its data source while it is reading data to populate itself with rows and again while it is propagating changes back to its underlying data source.
Because a CachedRowSet object stores data in memory, the amount of data that it can contain at any one time is determined by the amount of memory available. To get around this limitation, a CachedRowSet object can retrieve data from a ResultSet object in chunks of data, called pages. To take advantage of this mechanism, an application sets the number of rows to be included in a page using the method setPageSize. In other words, if the page size is set to five, a chunk of five rows of data will be fetched from the data source at one time. An application can also optionally set the maximum number of rows that may be fetched at one time. If the maximum number of rows is set to zero, or no maximum number of rows is set, there is no limit to the number of rows that may be fetched at a time.
After properties have been set, the CachedRowSet object must be populated with data using either the method populate or the method execute. The following lines of code demonstrate using the method populate. Note that this version of the method takes two parameters, a ResultSet handle and the row in the ResultSet object from which to start retrieving rows.
CachedRowSet crs = new CachedRowSetImpl();
crs.setMaxRows(20);
crs.setPageSize(4);
crs.populate(rsHandle, 10);
When this code runs, crs will be populated with four rows from rsHandle starting with the tenth row.
On the similar path, you could build upon a strategy to paginate your data on the JSP and so on and so forth.
Related
I want to get all data from offset to limit from a table with about 40 columns and 1.000.000 rows. I tried to index the id column via postgres and get the result of my select query via java and an entitymanager.
My query needs about 1 minute to get my results, which is a bit too long. I tried to use a different index and also limited my query down to 100 but still it needs this time. How can i fix it up? Do I need a better index or is anything wrong with my code?
CriteriaQuery<T> q = entityManager.getCriteriaBuilder().createQuery(Entity.class);
TypedQuery<T> query = entityManager.createQuery(q);
List<T> entities = query.setFirstResult(offset).setMaxResults(limit).getResultList();
Right now you probably do not utilize the index at all. There is some ambiguity how a hibernate limit/offset will translate to database operations (see this comment in the case of postgres). It may imply overhead as described in detail in a reply to this post.
If you have a direct relationship of offset and limit to the values of the id column you could use that in a query of the form
SELECT e
FROM Entity
WHERE id >= offset and id < offset + limit
Given the number of records asked for is significantly smaller than the total number of records int the table the database will use the index.
The next thing is, that 40 columns is quite a bit. If you actually need significantly less for your purpose, you could define a restricted entity with just the attributes required and query for that one. This should take out some more overhead.
If you're still not within performance requirements you could chose to take a jdbc connection/query instead of using hibernate.
Btw. you could log the actual sql issued by jpa/hibernate and use it to get an execution plan from postgress, this will show you what the query actually looks like and if an index will be utilized or not. Further you could monitor the database's query execution times to get an idea which fraction of the processing time is consumed by it and which is consumed by your java client plus data transfer overhead.
There also is a technique to mimick the offset+limit paging, using paging based on the page's first record's key.
Map<Integer, String> mapPageTopRecNoToKey = new HashMap<>();
Then search records >= page's key and load page size + 1 records to find the next page.
Going from page 1 to page 5 would take a bit more work but would still be fast.
This of course is a terrible kludge, but the technique at that time indeed was a speed improvement on some databases.
In your case it would be worth specifying the needed fields in jpql: select e.a, e.b is considerably faster.
I am going to generate simple CSV file report in Java using Hibernate and MySQL.
I am using Native SQL (because query is too complex which is not possible with HQL or Criteria query and also this doesn't matter here) part of Hibernate to fetch the data and simply writing it using any of CSVWriter api (this doesn't matter here.)
As far all is well, but the problem starts now.
Requirements:
The report size can be with 5000K to 15000K records with 25 fields.
It can be run on real time.
There is one report column (let's say finalValue) for which I want sorting and it can be extract like this, (sum(b.quantity*c.unit_gross_price) - COALESCE(sum(pai.value),0)).
Problem:
MySQL Indexing can not be used for finalValue column (mentioned above) as it is complex combination of aggregate functions. So if execute the query (with or without limit) with sorting, it is taking 40sec, otherwise 0.075sec.
The Solutions:
These are the some solutions, that I can think but each have some limitations.
Sorting using java.util.TreeSet : It will throw the OutOfMemoryError, which is obvious as heap space will be exceed if I will put 15000K heavy objects.
Using limit in MySQL query and write file for each iteration : It will take much time as every query will take same time around 50sec as without sorting limit can't be use.
So the main problem here is to overcome two parameters : Memory and Time. I need to balance both of them.
Any ideas, suggestions?
NOTE: I am not given here any snaps of code that doesn't mean question details is not enough. Code doe's not require here.
I think you can use a streaming ResultSet here. As documeted on this page under the ResultSet section.
Here are the main points from the documentation.
By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement. If you are working with ResultSets that have a large number of rows or large values and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.
There are some caveats with this approach. You must read all of the rows in the result set (or close it) before you can issue any other queries on the connection, or an exception will be thrown.
The earliest the locks these statements hold can be released (whether they be MyISAM table-level locks or row-level locks in some other storage engine such as InnoDB) is when the statement completes.
If using streaming results, process them as quickly as possible if you want to maintain concurrent access to the tables referenced by the statement producing the result set.
So, with a streaming result-set, write your order by query, and then start writing the results into your CSV file.
This still probably doesn't solve the sorting issue, but I think if you can't pre-generate that value and put an index on it, the sorting is going to take some time.
However, there might be some server config variables that you can use to optimize the sorting performance.
From the MySQL Order-By optimization page
I think you can set the read_rnd_buffer_size value, which, according to the docs, can:
Setting the variable to a large value can improve ORDER BY performance by a lot
Another one is sort_buffer_size, for which, the docs say the follwing:
If you see many Sort_merge_passes per second in SHOW GLOBAL STATUS output, you can consider increasing the sort_buffer_size value to speed up ORDER BY or GROUP BY operations that cannot be improved with query optimization or improved indexing.
Another variable that can probably help is the innodb_buffer_pool_size. Which allows innodb to keep as much table data in memory as possible and avoid some disk-seeks.
However, all of these variables require some tuning. Some trial-and-error and probably some kind of benchmarking to get right.
There are some other suggestions on that MySQL Order-By optimization page as well.
Use a temporary table to store your select result with an index on finalValue. This will store and index your intermediate result.
CREATE TEMPORARY TABLE my_temp_table (INDEX my_index_name (finalValue))
SELECT ... -- your select
Note that complex expressions will require an alias in your SELECT to be used as a part of a CREATE TABLE SELECT. I assume that your SELECT has the alias finalValue (the column you mentioned).
Then select the temporary table ordered by the finalValue (the index will be used).
SELECT * FROM my_temp_table ORDER BY finalValue;
And finally drop the temporary table (or reuse it if you want, but remember that when client session terminates temporary data is automatically deleted).
Summary tables. (Let's see more details to be sure this is Data Warehouse type data.) Summary tables are augmented periodically with subtotals and counts. Then when the report is needed, the data is readily available almost directly from the summary table, rather than scanning lots of raw data and doing aggregates.
My blog on Summary Tables. Let's see your schema and report query; we can discuss this in more detail.
I am Programming a software with JAVA and using the Oracle DB.
Normally we obtain the values from the Database using a Loop like
Resultset rt = (Resultset) cs.getObject(1);
while(rt.next){
....
}
But it sound is more slowly when fetch thousand of data from the database.
My question is:
In Oracle DB: I created a Procedure like this and it is the Iterating data and assign to the cursor.
Ex.procedure test_pro(sysref_cursor out info) as
open info select * from user_tbl ......
end test_pro;
In JAVA Code: As I mentioned before I Iterate a the resultset for obtain values, but the side of database, even I select the values, why should I use a loop for getting that values?
(another fact in the .net frameworks, there are using the database binding concept. So is any way in the java, binding the database procedures like .net 's, without the iterating.
)
Depending on what you are going to do with that data and at which frequence, the choice for a ref_cursor might be a good or a bad one. Ref_cursors are intended to give non Oracle aware programs a way to pass it data, for reporting purposes.
In you case, stick to the looping but don't forget to implement array fetching because this has a tremendous effect on the performance. The database passes blocks of rows to your jdbc buffer at the client and your code fetches rows from that buffer. By the time you hit the end of the buffer, the Jdbc layer requests the next chunk of rows from the database, eliminating lot's of network round trips. The default already fetches 10 rows at a time. For larger sets, use bigger numbers, if memory can provide the room.
See Oracle® Database JDBC Developer's Guide and Reference
If you know for sure there will always be exactly one result, like in this case, you can even skip the if and just call rs.next() once:
For example :
ResultSet resultset = statement.executeQuery("SELECT MAX (custID) FROM customer");
resultset.next(); // exactly one result so allowed
int max = resultset.getInt(1); // use indexed retrieval since the column has no name
Yes,you can call procedure in java.
http://www.mkyong.com/jdbc/jdbc-callablestatement-stored-procedure-out-parameter-example/
You can't avoid looping. For performance reasons you need to adjust your prefetch on Statement or Resultset object (100 is a solid starting point).
Why is done this way? It's similar to reading streams - you never know how big it can be - so you read by chunk/buffer, one after another...
I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate:
ScrollableResults results = session.createQuery("SELECT person FROM Person person")
.setReadOnly(true).setCacheable(false).scroll(ScrollMode.FORWARD_ONLY);
while (results.next())
storeInFile(results.get()[0]);
The problem is the above will try and load all 90 million rows into RAM before moving on to the while loop... and that will kill my memory with OutOfMemoryError: Java heap space exceptions :(.
So I guess ScrollableResults isn't what I was looking for? What is the proper way to handle this? I don't mind if this while loop takes days (well I'd love it to not).
I guess the only other way to handle this is to use setFirstResult and setMaxResults to iterate through the results and just use regular Hibernate results instead of ScrollableResults. That feels like it will be inefficient though and will start taking a ridiculously long time when I'm calling setFirstResult on the 89 millionth row...
UPDATE: setFirstResult/setMaxResults doesn't work, it turns out to take an unusably long time to get to the offsets like I feared. There must be a solution here! Isn't this a pretty standard procedure?? I'm willing to forgo Hibernate and use JDBC or whatever it takes.
UPDATE 2: the solution I've come up with which works ok, not great, is basically of the form:
select * from person where id > <offset> and <other_conditions> limit 1
Since I have other conditions, even all in an index, it's still not as fast as I'd like it to be... so still open for other suggestions..
Using setFirstResult and setMaxResults is your only option that I'm aware of.
Traditionally a scrollable resultset would only transfer rows to the client on an as required basis. Unfortunately the MySQL Connector/J actually fakes it, it executes the entire query and transports it to the client, so the driver actually has the entire result set loaded in RAM and will drip feed it to you (evidenced by your out of memory problems). You had the right idea, it's just shortcomings in the MySQL java driver.
I found no way to get around this, so went with loading large chunks using the regular setFirst/max methods. Sorry to be the bringer of bad news.
Just make sure to use a stateless session so there's no session level cache or dirty tracking etc.
EDIT:
Your UPDATE 2 is the best you're going to get unless you break out of the MySQL J/Connector. Though there's no reason you can't up the limit on the query. Provided you have enough RAM to hold the index this should be a somewhat cheap operation. I'd modify it slightly, and grab a batch at a time, and use the highest id of that batch to grab the next batch.
Note: this will only work if other_conditions use equality (no range conditions allowed) and have the last column of the index as id.
select *
from person
where id > <max_id_of_last_batch> and <other_conditions>
order by id asc
limit <batch_size>
You should be able to use a ScrollableResults, though it requires a few magic incantations to get working with MySQL. I wrote up my findings in a blog post (http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/) but I'll summarize here:
"The [JDBC] documentation says:
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
This can be done using the Query interface (this should work for Criteria as well) in version 3.2+ of the Hibernate API:
Query query = session.createQuery(query);
query.setReadOnly(true);
// MIN_VALUE gives hint to JDBC driver to stream results
query.setFetchSize(Integer.MIN_VALUE);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
// iterate over results
while (results.next()) {
Object row = results.get();
// process row then release reference
// you may need to evict() as well
}
results.close();
This allows you to stream over the result set, however Hibernate will still cache results in the Session, so you’ll need to call session.evict() or session.clear() every so often. If you are only reading data, you might consider using a StatelessSession, though you should read its documentation beforehand."
Set fetch size in query to an optimal value as given below.
Also, when caching is not required, it may be better to use StatelessSession.
ScrollableResults results = session.createQuery("SELECT person FROM Person person")
.setReadOnly(true)
.setFetchSize( 1000 ) // <<--- !!!!
.setCacheable(false).scroll(ScrollMode.FORWARD_ONLY)
FetchSize must be Integer.MIN_VALUE, otherwise it won't work.
It must be literally taken from the official reference: https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
Actually you could have gotten what you wanted -- low-memory scrollable results with MySQL -- if you had used the answer mentioned here:
Streaming large result sets with MySQL
Note that you will have problems with Hibernate lazy-loading because it will throw an exception on any queries performed before the scroll is finished.
With 90 million records, it sounds like you should be batching your SELECTs. I've done with with Oracle when doing the initial load into a distrbuted cache. Looking at the MySQL documentation, the equivalent seems to be using the LIMIT clause: http://dev.mysql.com/doc/refman/5.0/en/select.html
Here's an example:
SELECT * from Person
LIMIT 200, 100
This would return rows 201 through 300 of the Person table.
You'd need to get the record count from your table first and then divide it by your batch size and work out your looping and LIMIT parameters from there.
The other benefit of this would be parallelism - you can execute multiple threads in parallel on this for faster processing.
Processing 90 million records also doesn't sound like the sweet spot for using Hibernate.
The problem could be, that Hibernate keeps references to all objests in the session until you close the session. That has nothing to do with query caching. Maybe it would help to evict() the objects from the session, after you are done writing the object to the file. If they are no longer references by the session, the garbage collector can free the memory and you won't run out of memory anymore.
I propose more than a sample code, but a query template based on Hibernate to do this workaround for you (pagination, scrolling and clearing Hibernate session).
It can also easily be adapted to use an EntityManager.
I've used the Hibernate scroll functionality successfully before without it reading the entire result set in. Someone said that MySQL does not do true scroll cursors, but it claims to based on the JDBC dmd.supportsResultSetType(ResultSet.TYPE_SCROLL_INSENSITIVE) and searching around it seems like other people have used it. Make sure it's not caching the Person objects in the session - I've used it on SQL queries where there was no entity to cache. You can call evict at the end of the loop to be sure or test with a sql query. Also play around with setFetchSize to optimize the number of trips to the server.
recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one.
i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination.
In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":
create or replace function load_records ()
returns VOID as $$
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
join table2 t2 on (t2.fieldpk = t1.fieldpk)
join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$$ language plpgsql;
after that, i can paginate without count each row but using the sequence assigned:
select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000
From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:
#Override
public E get(int index) {
if (bufferParcial.size() <= (index - lastIndexRoulette))
{
lastIndexRoulette = index;
bufferParcial.removeAll(bufferParcial);
bufferParcial = new ArrayList<E>();
bufferParcial.addAll(daoInterface.getBufferParcial());
if (bufferParcial.isEmpty())
{
return null;
}
}
return bufferParcial.get(index - lastIndexRoulette);<br>
}
by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.
results for this approach can be seen here
http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html
Another option if you're "running out of RAM" is to just request say, one column instead of the entire object How to use hibernate criteria to return only one element of an object instead the entire object? (saves a lot of CPU process time to boot).
For me it worked properly when setting useCursors=true, otherwise The Scrollable Resultset ignores all the implementations of fetch size, in my case it was 5000 but Scrollable Resultset fetched millions of records at once causing excessive memory usage. underlying DB is MSSQLServer.
jdbc:jtds:sqlserver://localhost:1433/ACS;TDS=8.0;useCursors=true
I'm trying to create a java program to cleanup and merge rows in my table. The table is large, about 500k rows and my current solution is running very slowly. The first thing I want to do is simply get an in-memory array of objects representing all the rows of my table. Here is what I'm doing:
pick an increment of say 1000 rows at a time
use JDBC to fetch a resultset on the following SQL query
SELECT * FROM TABLE WHERE ID > 0 AND ID < 1000
add the resulting data to an in-memory array
continue querying all the way up to 500,000 in increments of 1000, each time adding results.
This is taking way to long. In fact its not even getting past the second increment from 1000 to 2000. The query takes forever to finish (although when I run the same thing directly through a MySQL browser its decently fast). Its been a while since I've used JDBC directly. Is there a faster alternative?
First of all, are you sure you need the whole table in memory? Maybe you should consider (if possible) selecting rows that you want to update/merge/etc. If you really have to have the whole table you could consider using a scrollable ResultSet. You can create it like this.
// make sure autocommit is off (postgres)
con.setAutoCommit(false);
Statement stmt = con.createStatement(
ResultSet.TYPE_SCROLL_INSENSITIVE, //or ResultSet.TYPE_FORWARD_ONLY
ResultSet.CONCUR_READ_ONLY);
ResultSet srs = stmt.executeQuery("select * from ...");
It enables you to move to any row you want by using 'absolute' and 'relative' methods.
One thing that helped me was Statement.setFetchSize(Integer.MIN_VALUE). I got this idea from Jason's blog. This cut down execution time by more than half. Memory consumed went down dramatically (as only one row is read at a time.)
This trick doesn't work for PreparedStatement, though.
Although it's probably not optimum, your solution seems like it ought to be fine for a one-off database cleanup routine. It shouldn't take that long to run a query like that and get the results (I'm assuming that since it's a one off a couple of seconds would be fine). Possible problems -
is your network (or at least your connection to mysql ) very slow? You could try running the process locally on the mysql box if so, or something better connected.
is there something in the table structure that's causing it? pulling down 10k of data for every row? 200 fields? calculating the id values to get based on a non-indexed row? You could try finding a more db-friendly way of pulling the data (e.g. just the columns you need, have the db aggregate values, etc.etc)
If you're not getting through the second increment something is really wrong - efficient or not, you shouldn't have any problem dumping 2000, or 20,000 rows into memory on a running JVM. Maybe you're storing the data redundantly or extremely inefficiently?