Mybatis batch processing - java

I have a table with 3+ millions records. And I need to read all those records from DB, send them for processing to kafka queue for other system to process. Then read results from output kafka queue and write back to the DB.
I need to read and write in sane portions otherwise I'm getting OOM exception at once.
What could be possible technical solutions to atchieve batch read and write operations with mybatis?
Neat working examples would be much appreciated.

I will write pseudo code as I don't know mush about Kafka.
First at read time, Mybatis default behavior is to return result in a List, but you don't want to load 3 millions of object into the memory. This must be overridden by using a custom implementation of org.apache.ibatis.session.ResultHandler<T>
public void handleResult(final ResultContext<YourType> context) {
addToKafkaQueue(context.getResultObject());
}
Also set the fetchSizeof the statement (When using annotation based mapper: #Option(fetchSize=500)) if there is no value defined in Mybatis global settings. If let unset, this option relies by default on driver value, different for every DB vendor. This defines how much records at once will be buffered in the result set. E.g: for Oracle this values 10: generally too low because involving to much read operation from the app to the DB; for Postgresql this is unlimited (whole result set), then too much. You will have to figure out the right balance between speed and memory usage.
For the update:
do {
YourType object = readFromKafkaQueue();
mybatisMapper.update(object);
} while (kafkaQueueHasMoreElements());
sqlSession.flushStatement(); // only when using ExecutorType.BATCH
The most important is the ExecutorType (this is argument in SessionFactory.openSession()) either ExecutorType.REUSE that will allow preparing the statement only once instead of at every iteration with default ExecutorType.SIMPLE or ExecutorType.BATCH that will stack the statements and actually execute them only on flush.
Now remains to think about the transactions: this might involve commit 3 millions updates, or it could be segmented.

You need to create separate instance of mapper for batch processing.
This might be helpful.
http://pretius.com/how-to-use-mybatis-effectively-perform-batch-db-operations/

Related

Spring mongo thread safety

I am learning about spring boot & concurrency. I understand that when spring boot receives multiple request, it will spin up multiple threads to handle the request. I have a method here which accesses mongo. This method will save a new someResult (with probably some new values set by the caller). My question is if there are say 100 concurrent calls to my spring boot controller, and im getting someResult object, and setting the values and saving etc, will the values be inconsistent?
public void upsert(SomeResult someResult) {
String collection = this.SomeResultConfig.getCollectionSomeResultCollection();
String queryStr = "{testingID : '%s'}";
queryStr = String.format(queryStr, someResult.getTestingID());
Query query = new BasicQuery(queryStr);
List<SomeResult> someResultList = this.mongoOps.find(query, SomeResult.class, collection);
if (someResult.size() != 0) {
this.mongoOps.findAllAndRemove(query, collection);
}
this.mongoOps.save(someResult, collection);
}
Yes, it is possible that 2 threads will read the result first, then do a modification and then each thread writes the modification into the database. The database state ends up with one of the values - the latter one, the earlier write is lost - this is called lost update phenomena.
Apart from using transactions with sufficient isolation level you can also use optimistic locking.
The #Version annotation provides syntax similar to that of JPA in the context of MongoDB and makes sure updates are only applied to documents with a matching version. Therefore, the actual value of the version property is added to the update query in such a way that the update does not have any effect if another operation altered the document in the meantime. In that case, an OptimisticLockingFailureException is thrown. The following example shows these features:
https://docs.spring.io/spring-data/mongodb/docs/current/reference/html/#mongo-template.optimistic-locking
In your code you can decide if you want to retry the whole operation if the exception is thrown or if to report the error to the user.
MongoDB document level operations are atomic. Thus, if multiple threads modify the same document, the resulting document will be some interleaving of those update operations. If those update operations update common values, the resulting value will be what was set with the last update, which is nondeterministic.
If however your update operations are multi-step operations (like read a document, modify it in memory, and rewrite it back), then when you write the document there is no guarantee that you are not overwriting the changes made by another thread. For these kinds of updates, consider using mongo transactions so the read-update operation makes sure what you update is what you read.
If multiple threads are working on different documents you don't have any of these problems.

Is cursor data remain in java heap memory or in database for StoredProcedureItemReader and JdbcCursorItemReader

In Spring-batch, normally when we used any itemreader it will fetch each row with every call of read method and once its equal to commit interval , it sends data to writer.
I have read that StoredProcedureItemReader and JdbcCursorItemReader is cursor based reader , once queries are executed data is in curser and read method gets on row in each call.
However, my question is where this cursor resides in java memory or database as :-
1. If its fetching all the data in java memory in single go, then what is advantage of having batch, as application will have chances of getting slow or get out of memory.
2. If it is somewhere in database, then Storeprocedor or JDBC connection itself will time, till the all records are fetched in process.
I tried to find answer but didn't find anywhere in documentation and also don't know to test this to know for sure by myself.
I seems important to me, as if some as to use these readers, they will need to increase the timeout for connection or they need have more heap memory.
Yes, with a StoredProcedureItemReader, the whole batch is fetched. This is because many drivers (including some of Oracle's) do not support batching CallableStatements. Some will fail, and some will silently ignore you. Thus to be consistent, Spring Batch allows no batching for stored procedures at all.
Some database drivers, like the IBM Informix driver, allow automatic batching based on parameters in the connection string; in other words, it will batch if possible without having to explicitly control it, but there are caveats to that. If in your use case, you absolutely NEED to batch stored procedures and you don't want to write a custom ItemReader or ItemWriter, then this might be an option for you.

Handling very large amount of data in MyBatis

My goal is actually to dump all the data of a database to an XML file. The database is not terribly big, it's about 300MB. The problem is that I have a memory limitation of 256MB (in JVM) only. So obviously I cannot just read everything into memory.
I managed to solve this problem using iBatis (yes I mean iBatis, not myBatis) by calling it's getList(... int skip, int max) multiple times, with incremented skip. That does solve my memory problem, but I'm not impressed with the speed. The variable names suggests that what the method does under the hood is to read the entire result-set skip then specified record. This sounds quite redundant to me (I'm not saying that's what the method is doing, I'm just guessing base on the variable name).
Now, I switched to myBatis 3 for the next version of my application. My question is: is there any better way to handle large amount of data chunk by chunk in myBatis? Is there anyway to make myBatis process first N records, return them to the caller while keeping the result set connection open so the next time the user calls the getList(...) it will start reading from the N+1 record without doing any "skipping"?
myBatis CAN stream results. What you need is a custom result handler. With this you can take each row separately and write it to your XML file. The overall scheme looks like this:
session.select(
"mappedStatementThatFindsYourObjects",
parametersForStatement,
resultHandler);
Where resultHandler is an instance of a class implementing the ResultHandler interface. This interface has just one method handleResult. This method provides you with a ResultContext object. From this context you can retrieve the row currently being read and do something with it.
handleResult(ResultContext context) {
Object result = context.getResultObject();
doSomething(result);
}
No, mybatis does not have full capability to stream results yet.
EDIT 1:
If you don't need nested result mappings then you could implement a custom result handler to stream results. on current released versions of MyBatis. (3.1.1) The current limitation is when you need to do complex result mapping. The NestedResultSetHandler does not allow custom result handlers. A fix is available, and it looks like is currently targeted for 3.2. See Issue 577.
In summary, to stream large result sets using MyBatis you'll need.
Implement your own ResultSetHandler.
Increase fetch size. (as noted below by Guillaume Perrot)
For Nested result maps, use the fix discussed on Issue 577. This fix also resolves some memory issues with large result sets.
I have successfully used MyBatis streaming with the Cursor. The Cursor has been implemented on MyBatis at this PR.
From the documentation it is described as
A Cursor offers the same results as a List, except it fetches data
lazily using an Iterator.
Besides, the code documentation says
Cursors are a perfect fit to handle millions of items queries that
would not normally fits in memory.
Here is an example of implementation I have done and which I was able to successfully use it:
import org.mybatis.spring.SqlSessionFactoryBean;
// You have your SqlSessionFactory somehow, if using Spring you can use
SqlSessionFactoryBean sqlSessionFactory = new SqlSessionFactoryBean();
Then you define your mapper, e.g., UserMapper with the SQL query that returns a Cursor of your target object, not a List. The whole idea is to not store all the elements in memory:
import org.apache.ibatis.annotations.Select;
import org.apache.ibatis.cursor.Cursor;
public interface UserMapper {
#Select(
"SELECT * FROM users"
)
Cursor<User> getAll();
}
Then you write the that code that will use an open SQL session from the factory and query using your mapper:
try(SqlSession sqlSession = sqlSessionFactory.openSession()) {
Iterator<User> iterator = sqlSession.getMapper(UserMapper.class)
.getAll()
.iterator();
while (iterator.hasNext()) {
doSomethingWithUser(iterator.next());
}
}
handleResult receives as many records as the query gets, no pause.
When there are too many records to process I used sqlSessionFactory.getSession().getConnection().
Then as, normal JDBC, get a Statement, get the Resultset, and process one by one the records. Don't forget to close the session.
If just dumping all the data without any ordering requirement from tables, why not directly do the pagination in SQL? Set a limit to the query statement, where specifying different record id as the offset, to separate the whole table into chunks, each of which could directly be read into memory if the rows limit is a reasonable number.
The sql could be something like:
SELECT * FROM resource
WHERE "ID" >= continuation_id LIMIT 300;
I think this could be viewed as an alternative solution for you to dump all the data by chunks, getting rid of the different feature problems in mybatis, or any Persistence layer, support.

Hibernate Batch Insert. Would it ever use one insert instead of multiple inserts?

I've been looking around trying to determine some Hibernate behavior that I'm unsure about. In a scenario where Hibernate batching is properly set up, will it only ever use multiple insert statements when a batch is sent? Is it not possible to use a DB independent multi-insert statement?
I guess I'm trying to determine if I actually have the batching set up correctly. I see the multiple insert statements but then I also see the line "Executing batch size: 25."
There's a lot of code I could post but I'm trying to keep this general. So, my questions are:
1) What can you read in the logs to be certain that batching is being used?
2) Is it possible to make Hibernate use a multi-row insert versus multiple insert statements?
Hibernate uses multiple insert statements (one per entity to insert), but sends them to the database in batch mode (using Statement.addBatch() and Statement.executeBatch()). This is the reason you're seeing multiple insert statements in the log, but also "Executing batch size: 25".
The use of batched statements greatly reduces the number of roundtrips to the database, and I would be surprised if it were less efficient than executing a single statement with multiple inserts. Moreover, it also allows mixing updates and inserts, for example, in a single database call.
I'm pretty sure it's not possible to make Hibernate use multi-row inserts, but I'm also pretty sure it would be useless.
I know that this is an old question but i had the same problem that i thought that hibernate batching means that hibernate would combine multiple inserts into one statement which it doesn't seem to do.
After some testing i found this answer that a batch of multiple inserts is just as good as a multi-row insert. I did a test inserting 1000 rows one time using hibernate batch and one time without. Both tests took about 20s so there was no performace gain in using hibernate batch.
To be sure i tried using the rewriteBatchedStatements option from the MySQL Connector/J which actually combines multiple inserts into one statement. It reduced the time to insert 1000 records down to 3s.
So after all hibernate batch seems to be useless and a real multi-row insert to be much better. Am i doing something wrong or what causes my test results?
The Oracle bulk insert collect an array of entyty and pass in a single block to the db associating to it a unic ciclic insert/update/delete.
Is unic way to speed network throughput .
Oracle suggest to do it calling a stored procedure from hibernate passing it an array of datas.
http://biemond.blogspot.it/2012/03/oracle-bulk-insert-or-select-from-java.html?m=1
Is not only a software problem but infrastructural!
Problem is network data flow optimization and TCP stack fragmentation.
Mysql have function.
You have to do something like what is described in this article.
Normal transfer on network the correct volume of data is the solution
You have also to verify network mtu and Oracle sdu/tdu utilization respect data transferred between application and database

General Transaction Question dealing with many records

I'm working with a code base that is new to me, and it uses iBatis.
I need to update or add to an existing table, and it may involve 20,000+ records.
The process will run once per day, and run in the middle of the night.
I'm getting the data from a web services call. I plan to get the data, then populate one model type object per record, and pass each model type object to some method that will read the data in the object, and update/insert the data into the table.
Example:
ArrayList records= new ArrayList();
Foo foo= new Foo();
foo.setFirstName("Homer");
foo.setLastName("Simpson");
records.add(foo);
//make more Foo objects, and put in ArrayList.
updateOrInsert(records); //this method then iterates over the list and calls some method that does the updating/inserting
My main question is how to handle all of the updating/inserting as a transaction. If the system goes down before all of the records are read as used to update/insert the table, I need to know, so I may go back to the web services call and try again when the system is ok.
I am using Java 1.4, and the db is Oracle.
I would highly recommend you consider using spring batch - http://static.springsource.org/spring-batch/
The framework provides lot of the essential features required for batch processing - error reporting, transaction management, multi-threading, scaling, input validation.
The framework is very well designed and very easy to use.
The approach you have listed might not perform very well, since you are waiting to read all objects, storing them all in memory and then inserting in the database.
You might want to consider designing the process as follows:
Create a cache capable of storing 200 objects
Invoke the webservice to fetch the data
Create an instance of an object, validate and store the data in the object's fields
Add the object to the cache.
When the cache is full, perform a batch commit of the objects in the cache to the database
Continue with step 1
SpringBatch will allow you to perform batch commits, control the size of the batch commits, perform error handling when reading input - in your case retry the request, perform error handling while writing the data to the database.
Have a look at it.

Categories

Resources