Spring mongo thread safety - java

I am learning about spring boot & concurrency. I understand that when spring boot receives multiple request, it will spin up multiple threads to handle the request. I have a method here which accesses mongo. This method will save a new someResult (with probably some new values set by the caller). My question is if there are say 100 concurrent calls to my spring boot controller, and im getting someResult object, and setting the values and saving etc, will the values be inconsistent?
public void upsert(SomeResult someResult) {
String collection = this.SomeResultConfig.getCollectionSomeResultCollection();
String queryStr = "{testingID : '%s'}";
queryStr = String.format(queryStr, someResult.getTestingID());
Query query = new BasicQuery(queryStr);
List<SomeResult> someResultList = this.mongoOps.find(query, SomeResult.class, collection);
if (someResult.size() != 0) {
this.mongoOps.findAllAndRemove(query, collection);
}
this.mongoOps.save(someResult, collection);
}

Yes, it is possible that 2 threads will read the result first, then do a modification and then each thread writes the modification into the database. The database state ends up with one of the values - the latter one, the earlier write is lost - this is called lost update phenomena.
Apart from using transactions with sufficient isolation level you can also use optimistic locking.
The #Version annotation provides syntax similar to that of JPA in the context of MongoDB and makes sure updates are only applied to documents with a matching version. Therefore, the actual value of the version property is added to the update query in such a way that the update does not have any effect if another operation altered the document in the meantime. In that case, an OptimisticLockingFailureException is thrown. The following example shows these features:
https://docs.spring.io/spring-data/mongodb/docs/current/reference/html/#mongo-template.optimistic-locking
In your code you can decide if you want to retry the whole operation if the exception is thrown or if to report the error to the user.

MongoDB document level operations are atomic. Thus, if multiple threads modify the same document, the resulting document will be some interleaving of those update operations. If those update operations update common values, the resulting value will be what was set with the last update, which is nondeterministic.
If however your update operations are multi-step operations (like read a document, modify it in memory, and rewrite it back), then when you write the document there is no guarantee that you are not overwriting the changes made by another thread. For these kinds of updates, consider using mongo transactions so the read-update operation makes sure what you update is what you read.
If multiple threads are working on different documents you don't have any of these problems.

Related

Multithreading in Grails - Passing domain objects into each thread causes some fields to randomly be null

I am trying to speed up a process in a Grails application by introducing parallel programming. This particular process requires sifting through thousands of documents, gathering the necessary data from them and exporting it to an excel file.
After many hours of trying to track down why this process was going so slowly, I've determined that the process has to do a lot of work gathering specific parts of data from each domain object. (Example: The domain object has lists of data inside it, and this process takes each index in these lists and appends it to a string with commas to make a nice looking, sorted list in a cell of the excel sheet. There are more examples but those shouldn't be important.)
So anything that wasn't a simple data access (document.id, document.name, etc...) was causing this process to take a long time.
My idea was to use threads for each document to asynchronously acquire all this data, when each thread finished gathering the data, it can come back to the main thread and be placed into the excel sheet, now all with simple data access, because the thread already gathered all the data.
This seems to be working, however I have a bug with the domain objects and the threads. Each thread is passed in its corresponding document domain object, but for whatever reason, the document domain objects will randomly have parts of its data changed to null.
For example: Before the document is passed into the thread, one part of the domain object will have a list that looks like this: [US, England, Wales], randomly at any point, the list will look like this in the thread: [US, null, Wales]. And this happens for any random part of the domain object, at any random time.
Generating the threads:
def docThreadPool = Executors.newFixedThreadPool(1)
def docThreadsResults = new Future<Map>[filteredDocs.size()]
filteredDocs.each {
def final document = it
def future = docThreadPool.submit(new DocumentExportCallable(document))
docThreadsResults[docCount] = future
docCount++
}
Getting the data back from the threads:
filteredDocs.each {
def data = docThreadsResults[count].get()
build excel spreadsheet...
}
DocumentExportCallable class:
class DocumentExportCallable implements Callable {
def final document
DocumentExportCallable(document) {
this.document = document
}
Map call() {
def data = [:]
code to get all the data...
return data
}
}
EDIT:
As seen below, it would be useful if I could show you the domain object. However I am not able to do this. BUT, the fact that you guys asked me about the domain object had me thinking that it just might be where the problem lies. Turns out, every part of the domain object that randomly messes up in the threads is a variable in the domain object inside "mapping" which uses SQL joins to get the data for those variables. I've just been made aware of lazy vs eager fetching in Grails. I'm wondering if this might be where the problem lies...by default it is set to lazy fetching so this constant access to the db by each thread might be where things are going wrong. I believe finding a way to change this to eager fetching might solve the problem.
I have the answer to why these null values were appearing randomly. Everything seems to be working now and my implementation is now performing much faster than the previous implementation!
Turns out I was unaware that Grails domain objects with 1-m relationships make separate sql calls when you access these fields even after you get the object itself. This must have caused these threads to be making un-thread-safe sql calls which created these random null values. Setting these 1-m properties in this specific case to be eagerly fetched fixed the issue.
For anyone reading later on, you'll want to read up on lazy vs eager fetching to get a better understanding.
As for the code:
These are the 1-m variables that were the issue in my domain object:
static hasMany = [propertyOne : OtherDomainObject, propertyTwo : OtherDomainObject, propertyThree : OtherDomainObject]
I added a flag to my database call which would enable this code for this specific case, as I didn't want these properties to always be eagerly fetched throughout the app:
if (isEager) {
fetchMode 'propertyOne', FetchMode.JOIN
fetchMode 'propertyTwo', FetchMode.JOIN
fetchMode 'propertyThree', FetchMode.JOIN
setResultTransformer Criteria.DISTINCT_ROOT_ENTITY
}
My apologies, but at the moment I do not remember why I had to put the "setResultTransformer" in the code above, but without it there were issues. Maybe someone later on can explain this, otherwise I'm sure a google search will explain.
What is happening is that your grails domain objects were detaching from the hibernate session thus hitting LazyInitiationException when your thread attempted to load lazy properties.
It's good that switching to eager fetching worked for you but it may not be an option for everyone. What you could have also done is used grails async task framework instead as it has built in session handling. See https://async.grails.org/latest/guide/index.html
However, even with grails async task passing an object between threads seems to detach it as the new thread will have a newly bound session. The solutions that I have found where to either .attach() or .merge() on the new thread to bind it with the session on the calling thread.
I believe the optimal solution would be to have hibernate load the object on the new thread, meaning in your code snippet you would pass a document id and Document.get(id) on your session supported thread.

Mybatis batch processing

I have a table with 3+ millions records. And I need to read all those records from DB, send them for processing to kafka queue for other system to process. Then read results from output kafka queue and write back to the DB.
I need to read and write in sane portions otherwise I'm getting OOM exception at once.
What could be possible technical solutions to atchieve batch read and write operations with mybatis?
Neat working examples would be much appreciated.
I will write pseudo code as I don't know mush about Kafka.
First at read time, Mybatis default behavior is to return result in a List, but you don't want to load 3 millions of object into the memory. This must be overridden by using a custom implementation of org.apache.ibatis.session.ResultHandler<T>
public void handleResult(final ResultContext<YourType> context) {
addToKafkaQueue(context.getResultObject());
}
Also set the fetchSizeof the statement (When using annotation based mapper: #Option(fetchSize=500)) if there is no value defined in Mybatis global settings. If let unset, this option relies by default on driver value, different for every DB vendor. This defines how much records at once will be buffered in the result set. E.g: for Oracle this values 10: generally too low because involving to much read operation from the app to the DB; for Postgresql this is unlimited (whole result set), then too much. You will have to figure out the right balance between speed and memory usage.
For the update:
do {
YourType object = readFromKafkaQueue();
mybatisMapper.update(object);
} while (kafkaQueueHasMoreElements());
sqlSession.flushStatement(); // only when using ExecutorType.BATCH
The most important is the ExecutorType (this is argument in SessionFactory.openSession()) either ExecutorType.REUSE that will allow preparing the statement only once instead of at every iteration with default ExecutorType.SIMPLE or ExecutorType.BATCH that will stack the statements and actually execute them only on flush.
Now remains to think about the transactions: this might involve commit 3 millions updates, or it could be segmented.
You need to create separate instance of mapper for batch processing.
This might be helpful.
http://pretius.com/how-to-use-mybatis-effectively-perform-batch-db-operations/

Concurrency Control on my Code

I am working on an order capture and generator application. Application is working fine with concurrent users working on different orders. The problem starts when two Users from different systems/locations try to work on the same order. How it is impacting the business is, that for same order, application will generate duplicate data since two users are working on that order simultaneously.
I have tried to synchronize the method where I am generating the order, but that would mean that no other user can work on any new order since synchronize will put a lock for that method. This will certainly block all the users from generating a new order when one order is being progressed, since, it will hit the synchronized code.
I have also tried with criteria initilization for an order, but no success.
Can anyone please suggest a proper approach??
All suggestions/comments are welcome. Thanks in advance.
Instead of synchronizing on the method level, you may use block-level synchronization for the blocks of code which must be operated on by only one thread at a time. This way you can increase the scope for parallel processing of the same order.
On a grand scale, if you are backing up your entities in a database, I would advice you to look at optimistic locking.
Add a version field to your order entity. Once the order is placed (the first time) the version is 1. Every update should then come in order from this, so imagine two subsequent concurrent processes
a -> Read data (version=1)
Update data
Store data (set version=2 if version=1)
b -> Read data (version=1)
Update data
Store data (set version=2 if version=1)
If the processing of these two are concurrent rather than serialized, you will notice how one of the processes indeed will fail to store data. That is the losing user, who will have to retry his edits. (Where he reads version=2 instead).
If you use JPA, optimistic locking is as easy as adding a #Version attribute to your model. If you use raw JDBC, you will need to add the add it to the update condition
update table set version=2, data=xyz where orderid=x and version=1
That is by far the best and in fact preferred solution to your general problem.

Handling very large amount of data in MyBatis

My goal is actually to dump all the data of a database to an XML file. The database is not terribly big, it's about 300MB. The problem is that I have a memory limitation of 256MB (in JVM) only. So obviously I cannot just read everything into memory.
I managed to solve this problem using iBatis (yes I mean iBatis, not myBatis) by calling it's getList(... int skip, int max) multiple times, with incremented skip. That does solve my memory problem, but I'm not impressed with the speed. The variable names suggests that what the method does under the hood is to read the entire result-set skip then specified record. This sounds quite redundant to me (I'm not saying that's what the method is doing, I'm just guessing base on the variable name).
Now, I switched to myBatis 3 for the next version of my application. My question is: is there any better way to handle large amount of data chunk by chunk in myBatis? Is there anyway to make myBatis process first N records, return them to the caller while keeping the result set connection open so the next time the user calls the getList(...) it will start reading from the N+1 record without doing any "skipping"?
myBatis CAN stream results. What you need is a custom result handler. With this you can take each row separately and write it to your XML file. The overall scheme looks like this:
session.select(
"mappedStatementThatFindsYourObjects",
parametersForStatement,
resultHandler);
Where resultHandler is an instance of a class implementing the ResultHandler interface. This interface has just one method handleResult. This method provides you with a ResultContext object. From this context you can retrieve the row currently being read and do something with it.
handleResult(ResultContext context) {
Object result = context.getResultObject();
doSomething(result);
}
No, mybatis does not have full capability to stream results yet.
EDIT 1:
If you don't need nested result mappings then you could implement a custom result handler to stream results. on current released versions of MyBatis. (3.1.1) The current limitation is when you need to do complex result mapping. The NestedResultSetHandler does not allow custom result handlers. A fix is available, and it looks like is currently targeted for 3.2. See Issue 577.
In summary, to stream large result sets using MyBatis you'll need.
Implement your own ResultSetHandler.
Increase fetch size. (as noted below by Guillaume Perrot)
For Nested result maps, use the fix discussed on Issue 577. This fix also resolves some memory issues with large result sets.
I have successfully used MyBatis streaming with the Cursor. The Cursor has been implemented on MyBatis at this PR.
From the documentation it is described as
A Cursor offers the same results as a List, except it fetches data
lazily using an Iterator.
Besides, the code documentation says
Cursors are a perfect fit to handle millions of items queries that
would not normally fits in memory.
Here is an example of implementation I have done and which I was able to successfully use it:
import org.mybatis.spring.SqlSessionFactoryBean;
// You have your SqlSessionFactory somehow, if using Spring you can use
SqlSessionFactoryBean sqlSessionFactory = new SqlSessionFactoryBean();
Then you define your mapper, e.g., UserMapper with the SQL query that returns a Cursor of your target object, not a List. The whole idea is to not store all the elements in memory:
import org.apache.ibatis.annotations.Select;
import org.apache.ibatis.cursor.Cursor;
public interface UserMapper {
#Select(
"SELECT * FROM users"
)
Cursor<User> getAll();
}
Then you write the that code that will use an open SQL session from the factory and query using your mapper:
try(SqlSession sqlSession = sqlSessionFactory.openSession()) {
Iterator<User> iterator = sqlSession.getMapper(UserMapper.class)
.getAll()
.iterator();
while (iterator.hasNext()) {
doSomethingWithUser(iterator.next());
}
}
handleResult receives as many records as the query gets, no pause.
When there are too many records to process I used sqlSessionFactory.getSession().getConnection().
Then as, normal JDBC, get a Statement, get the Resultset, and process one by one the records. Don't forget to close the session.
If just dumping all the data without any ordering requirement from tables, why not directly do the pagination in SQL? Set a limit to the query statement, where specifying different record id as the offset, to separate the whole table into chunks, each of which could directly be read into memory if the rows limit is a reasonable number.
The sql could be something like:
SELECT * FROM resource
WHERE "ID" >= continuation_id LIMIT 300;
I think this could be viewed as an alternative solution for you to dump all the data by chunks, getting rid of the different feature problems in mybatis, or any Persistence layer, support.

General Transaction Question dealing with many records

I'm working with a code base that is new to me, and it uses iBatis.
I need to update or add to an existing table, and it may involve 20,000+ records.
The process will run once per day, and run in the middle of the night.
I'm getting the data from a web services call. I plan to get the data, then populate one model type object per record, and pass each model type object to some method that will read the data in the object, and update/insert the data into the table.
Example:
ArrayList records= new ArrayList();
Foo foo= new Foo();
foo.setFirstName("Homer");
foo.setLastName("Simpson");
records.add(foo);
//make more Foo objects, and put in ArrayList.
updateOrInsert(records); //this method then iterates over the list and calls some method that does the updating/inserting
My main question is how to handle all of the updating/inserting as a transaction. If the system goes down before all of the records are read as used to update/insert the table, I need to know, so I may go back to the web services call and try again when the system is ok.
I am using Java 1.4, and the db is Oracle.
I would highly recommend you consider using spring batch - http://static.springsource.org/spring-batch/
The framework provides lot of the essential features required for batch processing - error reporting, transaction management, multi-threading, scaling, input validation.
The framework is very well designed and very easy to use.
The approach you have listed might not perform very well, since you are waiting to read all objects, storing them all in memory and then inserting in the database.
You might want to consider designing the process as follows:
Create a cache capable of storing 200 objects
Invoke the webservice to fetch the data
Create an instance of an object, validate and store the data in the object's fields
Add the object to the cache.
When the cache is full, perform a batch commit of the objects in the cache to the database
Continue with step 1
SpringBatch will allow you to perform batch commits, control the size of the batch commits, perform error handling when reading input - in your case retry the request, perform error handling while writing the data to the database.
Have a look at it.

Categories

Resources