Handling very large amount of data in MyBatis

Handling very large amount of data in MyBatis - java

My goal is actually to dump all the data of a database to an XML file. The database is not terribly big, it's about 300MB. The problem is that I have a memory limitation of 256MB (in JVM) only. So obviously I cannot just read everything into memory.
I managed to solve this problem using iBatis (yes I mean iBatis, not myBatis) by calling it's getList(... int skip, int max) multiple times, with incremented skip. That does solve my memory problem, but I'm not impressed with the speed. The variable names suggests that what the method does under the hood is to read the entire result-set skip then specified record. This sounds quite redundant to me (I'm not saying that's what the method is doing, I'm just guessing base on the variable name).
Now, I switched to myBatis 3 for the next version of my application. My question is: is there any better way to handle large amount of data chunk by chunk in myBatis? Is there anyway to make myBatis process first N records, return them to the caller while keeping the result set connection open so the next time the user calls the getList(...) it will start reading from the N+1 record without doing any "skipping"?

myBatis CAN stream results. What you need is a custom result handler. With this you can take each row separately and write it to your XML file. The overall scheme looks like this:
session.select(
"mappedStatementThatFindsYourObjects",
parametersForStatement,
resultHandler);
Where resultHandler is an instance of a class implementing the ResultHandler interface. This interface has just one method handleResult. This method provides you with a ResultContext object. From this context you can retrieve the row currently being read and do something with it.
handleResult(ResultContext context) {
Object result = context.getResultObject();
doSomething(result);
}

No, mybatis does not have full capability to stream results yet.
EDIT 1:
If you don't need nested result mappings then you could implement a custom result handler to stream results. on current released versions of MyBatis. (3.1.1) The current limitation is when you need to do complex result mapping. The NestedResultSetHandler does not allow custom result handlers. A fix is available, and it looks like is currently targeted for 3.2. See Issue 577.
In summary, to stream large result sets using MyBatis you'll need.
Implement your own ResultSetHandler.
Increase fetch size. (as noted below by Guillaume Perrot)
For Nested result maps, use the fix discussed on Issue 577. This fix also resolves some memory issues with large result sets.

I have successfully used MyBatis streaming with the Cursor. The Cursor has been implemented on MyBatis at this PR.
From the documentation it is described as
A Cursor offers the same results as a List, except it fetches data
lazily using an Iterator.
Besides, the code documentation says
Cursors are a perfect fit to handle millions of items queries that
would not normally fits in memory.
Here is an example of implementation I have done and which I was able to successfully use it:
import org.mybatis.spring.SqlSessionFactoryBean;
// You have your SqlSessionFactory somehow, if using Spring you can use
SqlSessionFactoryBean sqlSessionFactory = new SqlSessionFactoryBean();
Then you define your mapper, e.g., UserMapper with the SQL query that returns a Cursor of your target object, not a List. The whole idea is to not store all the elements in memory:
import org.apache.ibatis.annotations.Select;
import org.apache.ibatis.cursor.Cursor;
public interface UserMapper {
#Select(
"SELECT * FROM users"
)
Cursor<User> getAll();
}
Then you write the that code that will use an open SQL session from the factory and query using your mapper:
try(SqlSession sqlSession = sqlSessionFactory.openSession()) {
Iterator<User> iterator = sqlSession.getMapper(UserMapper.class)
.getAll()
.iterator();
while (iterator.hasNext()) {
doSomethingWithUser(iterator.next());
}
}

handleResult receives as many records as the query gets, no pause.
When there are too many records to process I used sqlSessionFactory.getSession().getConnection().
Then as, normal JDBC, get a Statement, get the Resultset, and process one by one the records. Don't forget to close the session.

If just dumping all the data without any ordering requirement from tables, why not directly do the pagination in SQL? Set a limit to the query statement, where specifying different record id as the offset, to separate the whole table into chunks, each of which could directly be read into memory if the rows limit is a reasonable number.
The sql could be something like:
SELECT * FROM resource
WHERE "ID" >= continuation_id LIMIT 300;
I think this could be viewed as an alternative solution for you to dump all the data by chunks, getting rid of the different feature problems in mybatis, or any Persistence layer, support.

Related

Mybatis batch processing

I have a table with 3+ millions records. And I need to read all those records from DB, send them for processing to kafka queue for other system to process. Then read results from output kafka queue and write back to the DB.
I need to read and write in sane portions otherwise I'm getting OOM exception at once.
What could be possible technical solutions to atchieve batch read and write operations with mybatis?
Neat working examples would be much appreciated.

I will write pseudo code as I don't know mush about Kafka.
First at read time, Mybatis default behavior is to return result in a List, but you don't want to load 3 millions of object into the memory. This must be overridden by using a custom implementation of org.apache.ibatis.session.ResultHandler<T>
public void handleResult(final ResultContext<YourType> context) {
addToKafkaQueue(context.getResultObject());
}
Also set the fetchSizeof the statement (When using annotation based mapper: #Option(fetchSize=500)) if there is no value defined in Mybatis global settings. If let unset, this option relies by default on driver value, different for every DB vendor. This defines how much records at once will be buffered in the result set. E.g: for Oracle this values 10: generally too low because involving to much read operation from the app to the DB; for Postgresql this is unlimited (whole result set), then too much. You will have to figure out the right balance between speed and memory usage.
For the update:
do {
YourType object = readFromKafkaQueue();
mybatisMapper.update(object);
} while (kafkaQueueHasMoreElements());
sqlSession.flushStatement(); // only when using ExecutorType.BATCH
The most important is the ExecutorType (this is argument in SessionFactory.openSession()) either ExecutorType.REUSE that will allow preparing the statement only once instead of at every iteration with default ExecutorType.SIMPLE or ExecutorType.BATCH that will stack the statements and actually execute them only on flush.
Now remains to think about the transactions: this might involve commit 3 millions updates, or it could be segmented.

You need to create separate instance of mapper for batch processing.
This might be helpful.
http://pretius.com/how-to-use-mybatis-effectively-perform-batch-db-operations/

Sorting funcationality Optimization using MySQL and Java

I am going to generate simple CSV file report in Java using Hibernate and MySQL.
I am using Native SQL (because query is too complex which is not possible with HQL or Criteria query and also this doesn't matter here) part of Hibernate to fetch the data and simply writing it using any of CSVWriter api (this doesn't matter here.)
As far all is well, but the problem starts now.
Requirements:
The report size can be with 5000K to 15000K records with 25 fields.
It can be run on real time.
There is one report column (let's say finalValue) for which I want sorting and it can be extract like this, (sum(b.quantity*c.unit_gross_price) - COALESCE(sum(pai.value),0)).
Problem:
MySQL Indexing can not be used for finalValue column (mentioned above) as it is complex combination of aggregate functions. So if execute the query (with or without limit) with sorting, it is taking 40sec, otherwise 0.075sec.
The Solutions:
These are the some solutions, that I can think but each have some limitations.
Sorting using java.util.TreeSet : It will throw the OutOfMemoryError, which is obvious as heap space will be exceed if I will put 15000K heavy objects.
Using limit in MySQL query and write file for each iteration : It will take much time as every query will take same time around 50sec as without sorting limit can't be use.
So the main problem here is to overcome two parameters : Memory and Time. I need to balance both of them.
Any ideas, suggestions?
NOTE: I am not given here any snaps of code that doesn't mean question details is not enough. Code doe's not require here.

I think you can use a streaming ResultSet here. As documeted on this page under the ResultSet section.
Here are the main points from the documentation.
By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement. If you are working with ResultSets that have a large number of rows or large values and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.
There are some caveats with this approach. You must read all of the rows in the result set (or close it) before you can issue any other queries on the connection, or an exception will be thrown.
The earliest the locks these statements hold can be released (whether they be MyISAM table-level locks or row-level locks in some other storage engine such as InnoDB) is when the statement completes.
If using streaming results, process them as quickly as possible if you want to maintain concurrent access to the tables referenced by the statement producing the result set.
So, with a streaming result-set, write your order by query, and then start writing the results into your CSV file.
This still probably doesn't solve the sorting issue, but I think if you can't pre-generate that value and put an index on it, the sorting is going to take some time.
However, there might be some server config variables that you can use to optimize the sorting performance.
From the MySQL Order-By optimization page
I think you can set the read_rnd_buffer_size value, which, according to the docs, can:
Setting the variable to a large value can improve ORDER BY performance by a lot
Another one is sort_buffer_size, for which, the docs say the follwing:
If you see many Sort_merge_passes per second in SHOW GLOBAL STATUS output, you can consider increasing the sort_buffer_size value to speed up ORDER BY or GROUP BY operations that cannot be improved with query optimization or improved indexing.
Another variable that can probably help is the innodb_buffer_pool_size. Which allows innodb to keep as much table data in memory as possible and avoid some disk-seeks.
However, all of these variables require some tuning. Some trial-and-error and probably some kind of benchmarking to get right.
There are some other suggestions on that MySQL Order-By optimization page as well.

Use a temporary table to store your select result with an index on finalValue. This will store and index your intermediate result.
CREATE TEMPORARY TABLE my_temp_table (INDEX my_index_name (finalValue))
SELECT ... -- your select
Note that complex expressions will require an alias in your SELECT to be used as a part of a CREATE TABLE SELECT. I assume that your SELECT has the alias finalValue (the column you mentioned).
Then select the temporary table ordered by the finalValue (the index will be used).
SELECT * FROM my_temp_table ORDER BY finalValue;
And finally drop the temporary table (or reuse it if you want, but remember that when client session terminates temporary data is automatically deleted).

Summary tables. (Let's see more details to be sure this is Data Warehouse type data.) Summary tables are augmented periodically with subtotals and counts. Then when the report is needed, the data is readily available almost directly from the summary table, rather than scanning lots of raw data and doing aggregates.
My blog on Summary Tables. Let's see your schema and report query; we can discuss this in more detail.

How to get the number of results in an App Engine query before actually iterating through them all

In my Google App Engine app I need to fetch and return a potentially large number of entities from a datastore query in response to a service call GET request. This call may return potentially thousands of entities and MBs of serialized data.
The first portion of the response packet communicates how many entities are in the serialized results, followed by all of the serialized entities. Currently I am iterating through all the entities in the query with a QueryResultIterator up to a maximum page size limit, after which I return a cursor from which can be used to continue fetching where the previous call left off (if the maximum was reached and there are still results in the query). As I iterate through the results, I save them in a list. Once I've either exhausted the query results or reached the maximum page size, I can then get the number of entities from the size of this list. But then I have to iterate through this list again to serialize each of the entities and write the results to the response output stream.
I don't know that this is the most efficient method to perform this operation. Is there a way I can get the number of entities in a query's results before actually iterating through them all or fetching them directly into a list? (The list method doesn't work anyway because I'm using cursors, which requires the use of QueryResultIterator).
QueryResultIterator has a method getIndexList(). Would this be a less costly way to get the number of entities in the query's results? I'm assuming this list would contain exactly one index object for each entity in the query's results. Also, I'd need this list to only contain the indexes for the entities after the current cursor position for the interator. Is my understanding correct or would this method not do what I think it would?
A list of just indexes would require much less memory than loading a list of whole entities. Although, I don't know if this list would be limited at all by the query's prefetch or chunk sizes, or if I'd want to use the query's limit parameter at all because I would only be interested in knowing how many entities were in the results up to the maximum page size plus one (to know there are still more results and provide a cursor to continue).
Currently I'm setting the prefetch and chunk size (to the size of my page limit), but I'm not using the limit or offset parameters since I'm using cursors instead. From what I understand cursors are preferable to offset/limit. Would setting the limit parameter affect continuing a query with a cursor?
Clearly I have quite a few questions as to how GAE datastore queries work and how they're affected by changing parameters. So any insights are appreciated. The documentation for App Engine APIs is often sparse, as in one sentence descriptions of methods stating pretty much what can be deduced from the method signature. They don't generally go into much detail otherwise. Maybe the way I'm doing it currently is just fine after all. It works as is, but I'm trying to optimize the service call to get the best response time possible for my client application.
UPDATE: By the way, I am using Objectify v3 in my app and to perform this query. There are several places I am required to use the low-level datastore API, including to do geo-location queries (with geomodel) and projection queries (which aren't support in Objectify v3). So if there is a good way to do this using Objectify, that would be ideal. Otherwise I can use the low-level API, but it's always messier this way.

Both the low-level api and Objectify have a count() method (look at the javadocs for details). However, counting can be a very expensive and lengthy operation - it costs 1 small op for every number returned. For example, count() returning 5000 costs 5000 small ops (plus 1 read for the query), and takes as long as it would take to do a keys-only scan of all 5000 (which is what GAE actually does).
If you absolutely must have an exact count, you probably need to aggregate this value yourself by incrementing/decrementing a (possibly sharded) counter. This gets very tricky when you are dealing with filtered queries.
There is no one right solution here. Google searches give you totals like "About 119,000,000 results" which are deliberately inexact and almost certainly precalculated. For smaller result sets, using count() can be acceptable - but you might want to apply a limit() so that you never break the bank. You can always say "More than 500 results..."

if you want to fetch no of record than you can use following code
com.google.appengine.api.datastore.Query qry = new com.google.appengine.api.datastore.Query("EntityName");
com.google.appengine.api.datastore.DatastoreService datastoreService = DatastoreServiceFactory.getDatastoreService();
int totalCount = datastoreService.prepare(qry).countEntities(FetchOptions.Builder.withDefaults());
anf if you want filter than you can used
qry.addFilter("firstName", FilterOperator.EQUAL, firstName);
i hope it will help you

Reading Multiple Resultset using JPA

I am using JPA(Eclipselink) to execute SQL Server Stored Procedure which returns multiple Resultsets.
As per my knowledge, easiest way to call a SP is:
entityManager.createNativeQuery("exec sp_name").getResultList();
After executing the SP I can only read the single (or very first) ResultSet.
Can some one please suggest how do I retrieve the next ResultSets (or ResultLists())?

I can't answer for EclipseLink specifically, and I'm not sure what the JPA spec says, but most features of JPA took their cue from Hibernate, and Hibernate's limitations on stored procedures are:
The procedure must return a result set. Note that since these servers can return multiple result sets and update counts, Hibernate will iterate the results and take the first result that is a result set as its return value. Everything else will be discarded.
My guess is that JPA defines the same limitation.

EclipseLink has extended support for stored procedures through its StoreProcedureCall class and NamedStoredProcedureCallQuery annotation. You can create a JPA Query using a StoredProcedureCall using the JpaEntityManager interface createQuery(Call) API.
StoreProcedureCall provides additional support over JPA native SQL queries including support for in, out and intout parameters and cursored output parameters and typing. StoreProcedureCall supports calls with both a result set and output parameters, but does not currently support multiple result sets.
What is being returned in your second result set, and how do you want the result returned? You could subclass and customize your SQLServerPlatform in EclipseLink and overwrite the executeStoredProcedure() method to process multiple results sets. It should not be to hard, and you could contribute the code back to EclipseLink if successful. Or you could log and enhancement request for this feature. Looking at the code it should be simple enough to implement, the bigger issue is how to return the multiple result sets.

Using Hibernate's ScrollableResults to slowly read 90 million records

I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate:
ScrollableResults results = session.createQuery("SELECT person FROM Person person")
.setReadOnly(true).setCacheable(false).scroll(ScrollMode.FORWARD_ONLY);
while (results.next())
storeInFile(results.get()[0]);
The problem is the above will try and load all 90 million rows into RAM before moving on to the while loop... and that will kill my memory with OutOfMemoryError: Java heap space exceptions :(.
So I guess ScrollableResults isn't what I was looking for? What is the proper way to handle this? I don't mind if this while loop takes days (well I'd love it to not).
I guess the only other way to handle this is to use setFirstResult and setMaxResults to iterate through the results and just use regular Hibernate results instead of ScrollableResults. That feels like it will be inefficient though and will start taking a ridiculously long time when I'm calling setFirstResult on the 89 millionth row...
UPDATE: setFirstResult/setMaxResults doesn't work, it turns out to take an unusably long time to get to the offsets like I feared. There must be a solution here! Isn't this a pretty standard procedure?? I'm willing to forgo Hibernate and use JDBC or whatever it takes.
UPDATE 2: the solution I've come up with which works ok, not great, is basically of the form:
select * from person where id > <offset> and <other_conditions> limit 1
Since I have other conditions, even all in an index, it's still not as fast as I'd like it to be... so still open for other suggestions..

Using setFirstResult and setMaxResults is your only option that I'm aware of.
Traditionally a scrollable resultset would only transfer rows to the client on an as required basis. Unfortunately the MySQL Connector/J actually fakes it, it executes the entire query and transports it to the client, so the driver actually has the entire result set loaded in RAM and will drip feed it to you (evidenced by your out of memory problems). You had the right idea, it's just shortcomings in the MySQL java driver.
I found no way to get around this, so went with loading large chunks using the regular setFirst/max methods. Sorry to be the bringer of bad news.
Just make sure to use a stateless session so there's no session level cache or dirty tracking etc.
EDIT:
Your UPDATE 2 is the best you're going to get unless you break out of the MySQL J/Connector. Though there's no reason you can't up the limit on the query. Provided you have enough RAM to hold the index this should be a somewhat cheap operation. I'd modify it slightly, and grab a batch at a time, and use the highest id of that batch to grab the next batch.
Note: this will only work if other_conditions use equality (no range conditions allowed) and have the last column of the index as id.
select *
from person
where id > <max_id_of_last_batch> and <other_conditions>
order by id asc
limit <batch_size>

You should be able to use a ScrollableResults, though it requires a few magic incantations to get working with MySQL. I wrote up my findings in a blog post (http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/) but I'll summarize here:
"The [JDBC] documentation says:
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
This can be done using the Query interface (this should work for Criteria as well) in version 3.2+ of the Hibernate API:
Query query = session.createQuery(query);
query.setReadOnly(true);
// MIN_VALUE gives hint to JDBC driver to stream results
query.setFetchSize(Integer.MIN_VALUE);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
// iterate over results
while (results.next()) {
Object row = results.get();
// process row then release reference
// you may need to evict() as well
}
results.close();
This allows you to stream over the result set, however Hibernate will still cache results in the Session, so you’ll need to call session.evict() or session.clear() every so often. If you are only reading data, you might consider using a StatelessSession, though you should read its documentation beforehand."

Set fetch size in query to an optimal value as given below.
Also, when caching is not required, it may be better to use StatelessSession.
ScrollableResults results = session.createQuery("SELECT person FROM Person person")
.setReadOnly(true)
.setFetchSize( 1000 ) // <<--- !!!!
.setCacheable(false).scroll(ScrollMode.FORWARD_ONLY)

FetchSize must be Integer.MIN_VALUE, otherwise it won't work.
It must be literally taken from the official reference: https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html

Actually you could have gotten what you wanted -- low-memory scrollable results with MySQL -- if you had used the answer mentioned here:
Streaming large result sets with MySQL
Note that you will have problems with Hibernate lazy-loading because it will throw an exception on any queries performed before the scroll is finished.

With 90 million records, it sounds like you should be batching your SELECTs. I've done with with Oracle when doing the initial load into a distrbuted cache. Looking at the MySQL documentation, the equivalent seems to be using the LIMIT clause: http://dev.mysql.com/doc/refman/5.0/en/select.html
Here's an example:
SELECT * from Person
LIMIT 200, 100
This would return rows 201 through 300 of the Person table.
You'd need to get the record count from your table first and then divide it by your batch size and work out your looping and LIMIT parameters from there.
The other benefit of this would be parallelism - you can execute multiple threads in parallel on this for faster processing.
Processing 90 million records also doesn't sound like the sweet spot for using Hibernate.

The problem could be, that Hibernate keeps references to all objests in the session until you close the session. That has nothing to do with query caching. Maybe it would help to evict() the objects from the session, after you are done writing the object to the file. If they are no longer references by the session, the garbage collector can free the memory and you won't run out of memory anymore.

I propose more than a sample code, but a query template based on Hibernate to do this workaround for you (pagination, scrolling and clearing Hibernate session).
It can also easily be adapted to use an EntityManager.

I've used the Hibernate scroll functionality successfully before without it reading the entire result set in. Someone said that MySQL does not do true scroll cursors, but it claims to based on the JDBC dmd.supportsResultSetType(ResultSet.TYPE_SCROLL_INSENSITIVE) and searching around it seems like other people have used it. Make sure it's not caching the Person objects in the session - I've used it on SQL queries where there was no entity to cache. You can call evict at the end of the loop to be sure or test with a sql query. Also play around with setFetchSize to optimize the number of trips to the server.

recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one.
i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination.
In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":
create or replace function load_records ()
returns VOID as $$
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
join table2 t2 on (t2.fieldpk = t1.fieldpk)
join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$$ language plpgsql;
after that, i can paginate without count each row but using the sequence assigned:
select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000
From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:
#Override
public E get(int index) {
if (bufferParcial.size() <= (index - lastIndexRoulette))
{
lastIndexRoulette = index;
bufferParcial.removeAll(bufferParcial);
bufferParcial = new ArrayList<E>();
bufferParcial.addAll(daoInterface.getBufferParcial());
if (bufferParcial.isEmpty())
{
return null;
}
}
return bufferParcial.get(index - lastIndexRoulette);<br>
}
by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.
results for this approach can be seen here
http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html

Another option if you're "running out of RAM" is to just request say, one column instead of the entire object How to use hibernate criteria to return only one element of an object instead the entire object? (saves a lot of CPU process time to boot).

For me it worked properly when setting useCursors=true, otherwise The Scrollable Resultset ignores all the implementations of fetch size, in my case it was 5000 but Scrollable Resultset fetched millions of records at once causing excessive memory usage. underlying DB is MSSQLServer.
jdbc:jtds:sqlserver://localhost:1433/ACS;TDS=8.0;useCursors=true

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.