save() performance - will saveAndFlush() be better?

save() performance - will saveAndFlush() be better? - java

I would like to ask you about the performance of save() in CrudRepository.
Firstly, an example of code.
for(int i=0; i<5000; i++){
Example example = new Example(0, true, false, i, "example");
example = exampleRepository.save(example);
List<ChildExample> childExamples = new ArrayList<>();
ChildExample childExample = new ChildExample(0, i, true, example);
childExamples.add(childExample);
childExampleRepository.saveAll(childExamples);
}
This is just an example, but everything has to stay at their position (e.g. creating a list of examples, then using saveAll, using cascade etc. It is not allowed).
What have I observed? The first 2000 objects were saved very fast - let's say: 10 minutes. But - next 2000 were saved in a much longer time, about 30 minutes. Why is that? Why saving each subsequent takes longer? What if I use JpaRepository and saveAndFlush()? Will this process be shortened if I use saveAndFlush()?

When you hit save() which is the equivalent of entityManager.persist(), the persistence provider does not implicitly perform an INSERT on the physical database. It simply stored the given entity in its Persistence Context. It has become managed in the current session cache (first level cache).
This is to prevent unnecessary overload of CRUD operations. By default, the changes are flushed on commit of the current transaction (or upon reaching a certain threshold of managed entities like in your case). An implicit flush may also be triggered when a SELECT operation is performed during the transaction which contains persisted entities somewhere in the JOINs (this is not the case here though).
When you use flush, the persistence provider is obliged to perform a physical save on the database at that moment.
But will it increase the performance? There is no clear answer to that question and it totally depends on each unique scenario. It is an option though and you need to perform a set of tests in order to find out.
You may also fiddle around with hibernate.jdbc.batch_size. You may gain a lot if you hit this configuration right for you particular circumstance.

Related

LockModeType.OPTIMISTIC and Mysql's default isolation level REPEATABLE READ don't work together?

I am trying to learn JPA with hibernate and use MySQL as the db.
From my understanding,
LockModeType.OPTIMISTIC: The entity version is checked towards the end
of the currently running transaction.
REPEATABLE READ: All consistent reads within the same transaction read
the snapshot established by the first such read in that transaction
Is it true that LockModeType.OPTIMISTIC in hibernate does not work with MySQL's default isolation level?
Say I have the following code:
tx.begin();
EntityManager em = JPA.createEntityManager();
Item item = em.find(Item.class, 1, LockModeType.OPTIMISTIC);
// Assume the item here has version = 0
// Read the item fields etc, during that another transaction commits and made item version increased to version = 1
tx.commit(); // Here Hibernate should execute SELECT during flushing to check version,
// i.e SELECT version FROM Item WHERE id = 1
em.close();
What I would expect is that, during flushing, Hibernate would throw OptimisticLockException because the version of the item is no longer 0. However, due to the isolation level, in the same transaction Hibernate would still see the item in version = 0 and not triggering OptimisitcLockExcpetion.
I tried to search but seems no one raised such question before, hopefully someone can help clear my confusion on OptimisticLock.

If your question is actually is there a flaw in HBN implementation (or JPA specification) related to the following statement:
If transaction T1 calls for a lock of type LockModeType.OPTIMISTIC on
a versioned object, the entity manager must ensure that neither of the
following phenomena can occur:
P1 (Dirty read): Transaction T1 modifies a row. Another transaction T2 then reads that row and obtains the modified value, before T1 has
committed or rolled back. Transaction T2 eventually commits
successfully; it does not matter whether T1 commits or rolls back and
whether it does so before or after T2 commits.
P2 (Non-repeatable read): Transaction T1 reads a row. Another transaction T2 then modifies or deletes that row, before T1 has
committed. Both transactions eventually commit successfully.
Lock modes must always prevent the phenomena P1 and P2.
then the answer is yes, you are correct: in case when you are performing computations based on some entity state, but you are not modifying those entity state, HBN just issues select version from ... where id = ... at the end of transaction and hence it do not see changes from other transactions due to RR isolation level. However I would not say that RC isolation level performs much better for this particular case: it's behaviour more correct from technical perspective but it is completely unreliable from business perspective because it depends on timings, so just do not rely on LockModeType.OPTIMISTIC - it is unreliable by design and use another techniques like:
store data from different domains in different entities
take advantage of #OptimisticLock annotation to prevent incrementing of version when it is not required (actually this will poison you domain model by HBN annotations)
mark some properties as updatable=false and update them via JPQL update in order to prevent version increment
UPD.
Taking the P2 as example, if I really need T1 (only read row) to fail if T2 (modify/delete row) commits first, the only workaround I can think of is to use LockModeType.OPTIMISTIC_FORCE_INCREMENT. So when T1 commits it will try to update the version and fail. Can you elaborate more on how your provided 3 points at the end can help with this situation if we keep using RR isolation level?
The short story:
LockModeType.OPTIMISTIC_FORCE_INCREMENT does not seem to be a good workaround, cause it turns reader into writer, so incrementing version will fail both writers and other readers. However in your case it might be acceptable to issue LockModeType.PESSIMISTIC_READ which for some DBs translates into select ... from ... for share/lock in share mode, which in turn blocks only writer and blocks (or fails) current reader, so you will avoid the phenomenon we are talking about.
The long story:
When we have started thinking about some "business consistency" the JPA specification is not our friend anymore, the problem is they define consistency in terms of "denied phenomena" and "someone must fail", but does not give us any clues and APIs how to control the behaviour in the correct way from business perspective. Let's consider the following example:
class User {
#Id
long id;
#Version
long version;
boolean locked;
int failedAuthAttempts;
}
our goal is to lock user account when failedAuthAttempts exceeds some threshold value. The pure SQL solution for our problem is very simple and straightforward:
update user
set failed_auth_attempts = failed_auth_attempts + 1,
locked = case failed_auth_attempts + 1 >= :threshold_value then 1 else 0 end
where id = :user_id
but JPA complicates everything... at first glance our naive implementation should look like:
void onAuthFailure(long userId) {
User user = em.find(User.class, userId);
int failedAuthAttempts = user.failedAuthAttempts + 1;
user.failedAuthAttempts = failedAuthAttempts;
if (failedAuthAttempts >= thresholdValue) {
user.locked = true;
}
em.save(user);
}
but that implementation has obvious flaw: if someone actively bruteforces user account not all failed auth attempts get recorded due to concurrency (here I'm not paying attention that it might be acceptable because sooner or later we will lock user account). How to resolve such issue? May we write something like:
void onAuthFailure(long userId) {
User user = em.find(User.class, userId, LockModeType.PESSIMISTIC_WRITE);
int failedAuthAttempts = user.failedAuthAttempts + 1;
user.failedAuthAttempts = failedAuthAttempts;
if (failedAuthAttempts >= thresholdValue) {
user.locked = true;
}
em.save(user);
}
? Actually no. The problem is for entities which are not present in persistence context (i.e. "unknown entities") hibernate issues select ... from ... where id=:id for update, but for known entities it issues select ... from ... where id=:id and version=:version for update and obviously fails due to version mismatch. So we have following tricky options to make our code to work "correctly":
spawn another transaction (I believe in most cases it is not a good option)
lock entity via select query, i.e. smth. like em.createQuery("select id from user where id=:id").setLockMode(LockModeType.PESSIMISTIC_WRITE).getFirstResult() (I believe that may not work in RR mode, moreover following refresh call looses data)
mark properties as non-updatable and update them via JPQL update (pure SQL solution)
Now let's pretend we need to add another business data into our User entity, say "SO reputation", how are we supposed to update new field keeping in mind that someone might bruteforce our user? The options are following:
continue to write "tricky code" (actually that might lead us to the counterintuitive idea that we always need to lock entity before updating it)
split data from different domains across different entities (sounds counterintuitive too)
use mixed techniques
I do believe this UPD will not help you much, however it's purpose was to demonstrate that it does not worth to discuss consistency in JPA domain without knowledge about target model.

To understand this, let's have a quick look on how hibernates optimistic locking works:
1: begin a new transaction
2: find an entity by ID (hibernate issues a SELECT ... WHERE id=xxx;), which e.g. could have a version count of 1
3: modify the entity
4: flush the changes to the DB (e.g. triggered automatically before committing a transaction):
4.1: hibernate issues an UPDATE ... SET ..., version=2 WHERE id=xxx AND version=1 which returns the number of updated rows
4.2: hibernate checks whether there was one row actually updated, throwing a StaleStateException if not
5: commit the transaction / rollback in case of the exception
With the repeatable_read isolation level, the first SELECT establishes the state (snapshot) which subsequent SELECTs of the same transaction read. However, the key here is that the UPDATE does not operate on the established snapshot, but on the committed state of the row (which might have been changed by other committed transactions in the meantime).
Therefore the update does not actually update any rows in case the version counter was already updated by another committed transaction in the meantime, and hibernate can detect this.
Also see:
https://dev.mysql.com/doc/refman/8.0/en/innodb-consistent-read.html
Repeatable Read isolation level SELECT vs UPDATE...WHERE

Load entire tables including relationships into memory with JPA

I have to process a huge amount of data distributed over 20 tables (~5 million records in summary) and I need to efficently load them.
I'm using Wildfly 14 and JPA/Hibernate.
Since in the end, every single record will be used by the business logic (in the same transaction), I decided to pre-load the entire content of the required tables into memory via simply:
em.createQuery("SELECT e FROM Entity e").size();
After that, every object should be availabe in the transaction and thus be available via:
em.find(Entity.class, id);
But this doesn't work somehow and there are still a lot of calls to the DB, especially for the relationships.
How can I efficiently load the whole content of the required tables including
the relationships and make sure I got everything / there will be no further DB calls?
What I already tried:
FetchMode.EAGER: Still too many single selects / object graph too complex
EntityGraphs: Same as FetchMode.EAGER
Join fetch statements: Best results so far, since it simultaneously populates the relationships to the referred entities
2nd Level / Query Cache: Not working, probably the same problem as em.find
One thing to note is that the data is immutable (at least for a specific time) and could also be used in other transactions.
Edit:
My plan is to load and manage the entire data in a #Singleton bean. But I want to make sure I'm loading it the most efficient way and be sure the entire data is loaded. There should be no further queries necessary when the business logic is using the data. After a specific time (ejb timer), I'm going to discard the entire data and reload the current state from the DB (always whole tables).

Keep in mind, that you'll likely need a 64-bit JVM and a large amount of memory. Take a look at Hibernate 2nd Level Cache. Some things to check for since we don't have your code:
#Cacheable annotation will clue Hibernate in so that the entity is cacheable
Configure 2nd level caching to use something like ehcache, and set the maximum memory elements to something big enough to fit your working set into it
Make sure you're not accidentally using multiple sessions in your code.
If you need to process things in this way, you may want to consider changing your design to not rely on having everything in memory, not using Hibernate/JPA, or not use an app server. This will give you more control of how things are executed. This may even be a better fit for something like Hadoop. Without more information it's hard to say what direction would be best for you.

I understand what you're asking but JPA/Hibernate isn't going to want to cache that much data for you, or at least I wouldn't expect a guarantee from it. Consider that you described 5 million records. What is the average length per record? 100 bytes gives 500 megabytes of memory that'll just crash your untweaked JVM. Probably more like 5000 bytes average and that's 25 gB of memory. You need to think about what you're asking for.
If you want it cached you should do that yourself or better yet just use the results when you have them. If you want a memory based data access you should look at a technology specifically for that. http://www.ehcache.org/ seems popular but it's up to you and you should be sure you understand your use case first.
If you are trying to be database efficient then you should just understand what your doing and design and test carefully.

Basically it should be a pretty easy task to load entire tables with one query each table and link the objects, but JPA works different as to be shown in this example.
The biggest problem are #OneToMany/#ManyToMany-relations:
#Entity
public class Employee {
#Id
#Column(name="EMP_ID")
private long id;
...
#OneToMany(mappedBy="owner")
private List<Phone> phones;
...
}
#Entity
public class Phone {
#Id
private long id;
...
#ManyToOne
#JoinColumn(name="OWNER_ID")
private Employee owner;
...
}
FetchType.EAGER
If defined as FetchType.EAGER and the query SELECT e FROM Employee e Hibernate generates the SQL statement SELECT * FROM EMPLOYEE and right after it SELECT * FROM PHONE WHERE OWNER_ID=? for every single Employee loaded, commonly known as 1+n problem.
I could avoid the n+1 problem by using the JPQL-query SELECT e FROM Employee e JOIN FETCH e.phones, which will result in something like SELECT * FROM EMPLOYEE LEFT OUTER JOIN PHONE ON EMP_ID = OWNER_ID.
The problem is, this won't work for a complex data model with ~20 tables involved.
FetchType.LAZY
If defined as FetchType.LAZY the query SELECT e FROM Employee e will just load all Employees as Proxies, loading the related Phones only when accessing phones, which in the end will lead into the 1+n problem as well.
To avoid this it is pretty obvious to just load all the Phones into the same session SELECT p FROM Phone p. But when accessing phones Hibernate will still execute SELECT * FROM PHONE WHERE OWNER_ID=?, because Hibernate doesn't know that there are already all Phones in its current session.
Even when using 2nd level cache, the statement will be executed on the DB because Phone is indexed by its primary key in the 2nd level cache and not by OWNER_ID.
Conclusion
There is no mechanism like "just load all data" in Hibernate.
It seems there is no other way than keep the relationships transient and connect them manually or even just use plain old JDBC.
EDIT:
I just found a solution which works very well. I defined all relevant #ManyToMany and #OneToMany as FetchType.EAGER combinded with #Fetch(FetchMode.SUBSELECT) and all #ManyToOne with #Fetch(FetchMode.JOIN), which results in an acceptable loading time. Next to adding javax.persistence.Cacheable(true) to all entities I added org.hibernate.annotations.Cache to every relevant collection, which enables collection caching in the 2nd level cache. I disabled 2nd level cache timeout eviction and "warm up" the 2nd level cache via #Singleton EJB combined with #Startup on server start / deploy. Now I have 100% control over the cache, there are no further DB calls until I manually clear it.

Each time I retrieve an entity from HIbernate and then discard does Hibernate hang onto any resources

I go round in a big loop retrieving a SongDiff Hibernate entity, processing it and then getting the next one.
So I am only using one SongDiff entity at a time so my question is will Hibernate release such resources as go round the loop, or could it hang onto it, do I need to call session.flush() occasionally or would that make no difference?
I am only retrieving data from Hibernate, I am not modifying any data
I ask because code with OutOfMemory Error in this area of the code and I wonder if this could be the problem.
try
{
session = com.jthink.songlayer.hibernate.HibernateUtil.getSession();
for(Integer recNo:recNos)
{
count++;
//Get Metadata changes
SongDiff songDiff = SongChangesCache.getSongDiffFromDatabase(session, recNo);
MetadataAllChanges mas = (MetadataAllChanges) SerializationHelper.deserialize(songDiff.getDiff());
sr.writeDatatoXlsFile(recNo, mas);
}
}
finally
{
HibernateUtil.closeSession(session);
}

By default, as you say, Hibernate will build up entity instances - in its StatefulPersistenceContext. It does this to enable it to match an instance if it is fetched again, and to track updates ready for the next commit or flush.
The most obvious option to avoid session build-up is to start a new session, but this isn't always practical, for example due to transactional scope.
The best solution within a single session is session.clear(), which will discard any entity instances the session has built up. If there are also updates happening in the loop, then you will need to do a session.flush() before the clear(), to push any changes so far to the database.
The other option, at least within a pure Hibernate (non-JPA) environment, is a StatelessSession, which has no StatefulPersistenceContext, so won't build up. I've tried it a couple of times, but hestitate to recommend it - it seems to be less widely used, so harder to troubleshoot online.

Hibernate loading all entities utilizing 1st or 2nd level cache

We have an entire table of entities that we need to load during a hibernate session and the only way I know to load all entities is through an HQL query:
public <T> List<T> getAllEntities(final Class<T> entityClass) {
if (null == entityClass)
throw new IllegalArgumentException("entityClass can't be null");
List<T> list = castResultList(createQuery(
"select e from " + entityClass.getSimpleName() + " e ").list());
return list;
}
We use EHcache for 2nd level caching.
The problem is this gets called 100's of times in a given transaction session and takes up a considerable portion of the total time. Is there any way to load all entities of a given type (load an entire table) and still benefit from 1st level session cache or 2nd level ehcache.
We've been told to stay away from query caching because of their potential performance penalties relative to their gains.
* Hibernate Query Cache considered harmful
Although we're doing performance profiling right now so it might be time to try turning on query cache.

L1 and L2 cache can't help you much with the problem of "get an entire table."
The L1 cache is ill-equipped because if someone else inserted something, it's not there. (You may "know" that no one else would ever do so within the business rules of the system, but the Hibernate Session doesn't.) Hence you have to go look in the DB to be sure.
With the L2 cache, things may have been expired or flushed since the last time anybody put the table in there. This can be at the mercy of the cache provider or even done totally externally, maybe through a MBean. So Hibernate can't really know at any given time if what's in the cache for that type represents the entire contents of the table. Again, you have to look in the DB to be sure.
Since you have special knowledge about this Entity (new ones are never created) that there isn't a practical way to impart on the L1 or L2 caches, you need to either use the tool provided by Hibernate for when you have special business-rules-level knowledge about a result set, query cache, or cache the info yourself.
--
If you really really want it in the L2 cache, you could in theory make all entities in the table members of a collection on some other bogus entity, then enable caching the collection and manage it secretly in the DAO. I don't think it could possibly be worth having that kind of bizarreness in your code though :)

Query cache is considered harmful if and only if the underlying table changes often. In your case the table is changed once a day. So the query would stay in cache for 24 hours. Trust me: use the query cache for it. It is a perfect use case for a query cache.
Example of harmful query cache: if you have a user table and you use the query cache for "from User where username = ..." then this query will evict from cache each time the user table is modified (another user changes/deletes his account). So ANY modification of this table triggers cache eviction. The only way to improve this situation is querying by natural-id, but this is another story.
If you know your table will be modified only once a day as in your case, the query cache will only evict once a day!
But pay attention on your logic when modifying the table. If you do it via hibernate everything is fine. If you use a direct query you have to tell hibernate that you have modified the table (something like query.addSynchronizedEntity(..)). If you do it via shell script you need to adjust the time-to-live of the underlying cache region.
Your answer is by the way reimplementing the query cache as the query cache just caches the list of ids. The actual objects are looked up in L1/L2 cache. so you still need to cache the entities when you use the query cache.
Please mark this as the correct answer for further reference.

We ended up solving this by storing in memory the primary keys to all the entities in the table we needed to load (because they're template data and no new templates are added/removed).
Then we could use this list of primary keys to look up each entity and utilize Hibernates 1st and 2nd level cache.

How would you go about improving MySQL throughput in this simple scenario?

I have a relatively simple object model:
ParentObject
Collection<ChildObject1>
ChildObject2
The MySQL operation when saving this object model does the following:
Update the ParentObject
Delete all previous items from the ChildObject1 table (about 10 rows)
Insert all new ChildObject1 (again, about 10 rows)
Insert ChildObject2
The objects / tables are unremarkable - no strings, rather mainly ints and longs.
MySQL is currently saving about 20-30 instances of the object model per second. When this goes into prodcution it's going to be doing upwards of a million saves, which at current speeds is going to take 10+ hours, which is no good to me...
I am using Java and Spring. I have profiled my app and the bottle neck is in the calls to MySQL by a long distance.
How would you suggest I increase the throughput?

You can get some speedup by tracking a dirty flag on your objects (especially your collection of child objects). You only delete/update the dirty ones. Depending on what % of them change on each write, you might save a good chunk.
The other thing you can do is do bulk writes via batch updating on the prepared statement. (Look at PreparedStatement.addBatch()) This can be an order of magnitude faster, but might not be record by record,e.g. might look something like:
delete all dirty-flagged children as a single batch command
update all parents as a single batch command
insert all dirty-flagged children as a single batch command.
Note that since you're dealing with millions of records you're probably not going to be able to load them all into a map and dump them at once, you'll have to stream them into a batch handler and dump the changes to the db 1000 records at a time or so. Once you've done this the actual speed is sensitive to the batch size, you'll have to determine the defaults by trial-and-error.

Deleting any existing ChildObject1 records from the table and then inserting the ChildObject1 instances from the current state of your Parent object seems unnecessary to me. Are the values of the all of the child objects different than what was previously stored?
A better solution might involve only modifying the database when you need to, i.e. when there has been a change in state of the ChildObject1 instances.
Rolling your own persistence logic for this type of thing can be hard (your persistence layer needs to know the state of the ChildObject1 objects when they were retrieved to compare them with the versions of the objects at save-time). You might want to look into using an ORM like Hibernate for something like this, which does an excellent job of knowing when it needs to update the records in the database or not.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

save() performance - will saveAndFlush() be better? - java

Related

LockModeType.OPTIMISTIC and Mysql's default isolation level REPEATABLE READ don't work together?

Load entire tables including relationships into memory with JPA

Each time I retrieve an entity from HIbernate and then discard does Hibernate hang onto any resources

Hibernate loading all entities utilizing 1st or 2nd level cache

How would you go about improving MySQL throughput in this simple scenario?

Categories

Resources