This is JPA2 running on Hibernate.
I want to retrieve multiple instances of the same entity type, given their ids. Many of them will already be in the persistence context and/or second-level cache.
I tried several approaches, but all seem to have their drawbacks:
When I iterate over the ids with entityManager.find(id), I get one query for each non-cached item, that is, too many queries.
With a query of the form SELECT e FROM MyEntity e WHERE e.id in (:ids), the cached entries will be reloaded from the db.
I can manually check the cache beforehand for each of the ids using entityManager.getEntityManagerFactory().getCache().contains(id). This works for the second-level cache, but will return false on entries that are in the persistence context, but not in the second-level cache.
What is the best way of doing this without choosing between loading inefficiently and loading too much?
You should* be able to speculatively pull an entity out of the session cache like this:
T obj = entityManager.getReference(entityClass, id);
boolean inSessionCache = entityManager.getEntityManagerFactory().isLoaded(obj);
This still leaves you with a pretty gross solution, i admit.
(* might)
Related
I ran the following jdo query, it is intended to get a list of entities that has empty "flow id". However, when I check the returned list, it did contain an entity with non-empty flow id.
PersistenceManager pm = pmf.getPersistenceManagerProxy();
String flowId = "";
Map<String, Object> params = new HashMap<>();
params.put("flowId", flowId);
List<MyEntity> entities = pm.newQuery(MyEntity.class, " this.flowId == :flowId ").setNamedParameters(params).executeList();
It doesn't always happen, but when it happens, I found there is always an update to that entity from another process to clear the "flow id" at a similar time. However, the result I got from the above query have that entity but shows a non-empty flow id. I also checked the JDO object state of the unexpectedly returned entity, it is persistent-clean. The query is run within an active transaction.
Here are the SQL compiled by JDOQLQuery.
SELECT 'com.example.MyEntity' AS "NUCLEUS_TYPE","A0"."CREATE_TIME","A0"."DATA_MAX_TIMESTAMP","A0"."DATA_MIN_TIMESTAMP","A0"."ID","A0"."OBSERVATION_ID","A0"."PARTITION_VALUE","A0"."PARTITION_CYCLE","A0"."PARTITION_TIMESTAMP","A0"."FLOW_ID","A0"."PROCESSING_STAGE","A0"."PROCESSING_STATUS","A0"."RECORD_COUNT","A0"."UPDATE_TIME" FROM "MY_ENTITIES" "A0" WHERE "A0"."FLOW_ID" = ?
Although I don't think it is relevant, the isolation level is read-committed, the entity is detachable, and that query above is running within a transaction. Please help, thanks!
Update
After I change the isolation level to repeatable-read, it never happens again. So highly likely it is related with isolation level. I'm not sure whether there is a bug or not. My data nucleus version is 4.1.6. Any thoughts will help.
I turn off both level1 and level2 cache. It seems caching in datanucleus doesn't work in some rare cases with multiple processes updating the same row with read-committed isolation level.
When the data store is applying the filtering condition, it can see the latest update from another process (set the flow_id to empty in my example). So it returns that row. However when datanucleus lookup its field using the row id, it checks the cache first but there is a potential stale value in cache for that entity, which seems to be the root cause of this issue.
I have to process a huge amount of data distributed over 20 tables (~5 million records in summary) and I need to efficently load them.
I'm using Wildfly 14 and JPA/Hibernate.
Since in the end, every single record will be used by the business logic (in the same transaction), I decided to pre-load the entire content of the required tables into memory via simply:
em.createQuery("SELECT e FROM Entity e").size();
After that, every object should be availabe in the transaction and thus be available via:
em.find(Entity.class, id);
But this doesn't work somehow and there are still a lot of calls to the DB, especially for the relationships.
How can I efficiently load the whole content of the required tables including
the relationships and make sure I got everything / there will be no further DB calls?
What I already tried:
FetchMode.EAGER: Still too many single selects / object graph too complex
EntityGraphs: Same as FetchMode.EAGER
Join fetch statements: Best results so far, since it simultaneously populates the relationships to the referred entities
2nd Level / Query Cache: Not working, probably the same problem as em.find
One thing to note is that the data is immutable (at least for a specific time) and could also be used in other transactions.
Edit:
My plan is to load and manage the entire data in a #Singleton bean. But I want to make sure I'm loading it the most efficient way and be sure the entire data is loaded. There should be no further queries necessary when the business logic is using the data. After a specific time (ejb timer), I'm going to discard the entire data and reload the current state from the DB (always whole tables).
Keep in mind, that you'll likely need a 64-bit JVM and a large amount of memory. Take a look at Hibernate 2nd Level Cache. Some things to check for since we don't have your code:
#Cacheable annotation will clue Hibernate in so that the entity is cacheable
Configure 2nd level caching to use something like ehcache, and set the maximum memory elements to something big enough to fit your working set into it
Make sure you're not accidentally using multiple sessions in your code.
If you need to process things in this way, you may want to consider changing your design to not rely on having everything in memory, not using Hibernate/JPA, or not use an app server. This will give you more control of how things are executed. This may even be a better fit for something like Hadoop. Without more information it's hard to say what direction would be best for you.
I understand what you're asking but JPA/Hibernate isn't going to want to cache that much data for you, or at least I wouldn't expect a guarantee from it. Consider that you described 5 million records. What is the average length per record? 100 bytes gives 500 megabytes of memory that'll just crash your untweaked JVM. Probably more like 5000 bytes average and that's 25 gB of memory. You need to think about what you're asking for.
If you want it cached you should do that yourself or better yet just use the results when you have them. If you want a memory based data access you should look at a technology specifically for that. http://www.ehcache.org/ seems popular but it's up to you and you should be sure you understand your use case first.
If you are trying to be database efficient then you should just understand what your doing and design and test carefully.
Basically it should be a pretty easy task to load entire tables with one query each table and link the objects, but JPA works different as to be shown in this example.
The biggest problem are #OneToMany/#ManyToMany-relations:
#Entity
public class Employee {
#Id
#Column(name="EMP_ID")
private long id;
...
#OneToMany(mappedBy="owner")
private List<Phone> phones;
...
}
#Entity
public class Phone {
#Id
private long id;
...
#ManyToOne
#JoinColumn(name="OWNER_ID")
private Employee owner;
...
}
FetchType.EAGER
If defined as FetchType.EAGER and the query SELECT e FROM Employee e Hibernate generates the SQL statement SELECT * FROM EMPLOYEE and right after it SELECT * FROM PHONE WHERE OWNER_ID=? for every single Employee loaded, commonly known as 1+n problem.
I could avoid the n+1 problem by using the JPQL-query SELECT e FROM Employee e JOIN FETCH e.phones, which will result in something like SELECT * FROM EMPLOYEE LEFT OUTER JOIN PHONE ON EMP_ID = OWNER_ID.
The problem is, this won't work for a complex data model with ~20 tables involved.
FetchType.LAZY
If defined as FetchType.LAZY the query SELECT e FROM Employee e will just load all Employees as Proxies, loading the related Phones only when accessing phones, which in the end will lead into the 1+n problem as well.
To avoid this it is pretty obvious to just load all the Phones into the same session SELECT p FROM Phone p. But when accessing phones Hibernate will still execute SELECT * FROM PHONE WHERE OWNER_ID=?, because Hibernate doesn't know that there are already all Phones in its current session.
Even when using 2nd level cache, the statement will be executed on the DB because Phone is indexed by its primary key in the 2nd level cache and not by OWNER_ID.
Conclusion
There is no mechanism like "just load all data" in Hibernate.
It seems there is no other way than keep the relationships transient and connect them manually or even just use plain old JDBC.
EDIT:
I just found a solution which works very well. I defined all relevant #ManyToMany and #OneToMany as FetchType.EAGER combinded with #Fetch(FetchMode.SUBSELECT) and all #ManyToOne with #Fetch(FetchMode.JOIN), which results in an acceptable loading time. Next to adding javax.persistence.Cacheable(true) to all entities I added org.hibernate.annotations.Cache to every relevant collection, which enables collection caching in the 2nd level cache. I disabled 2nd level cache timeout eviction and "warm up" the 2nd level cache via #Singleton EJB combined with #Startup on server start / deploy. Now I have 100% control over the cache, there are no further DB calls until I manually clear it.
I have a method which executes HQL using org.hibernate.Query.executeUpdate() and changes some rows in a database. If some of affected by the query rows were previously loaded into the current session (e.g. using Session.get()), they are now stale and need refreshing.
But I want that method to be independent of a previous work with the session, and not to track all loaded objects that might be affected in order to refresh them afterwards. Is it possible with Hibernate to retrieve and iterate through objects in the 1-level cache?
I've found the following solution for the issue, which works for me:
Map.Entry<Object,EntityEntry>[] entities = ((SessionImplementor)session).getPersistenceContext().reentrantSafeEntityEntries()
We have an entire table of entities that we need to load during a hibernate session and the only way I know to load all entities is through an HQL query:
public <T> List<T> getAllEntities(final Class<T> entityClass) {
if (null == entityClass)
throw new IllegalArgumentException("entityClass can't be null");
List<T> list = castResultList(createQuery(
"select e from " + entityClass.getSimpleName() + " e ").list());
return list;
}
We use EHcache for 2nd level caching.
The problem is this gets called 100's of times in a given transaction session and takes up a considerable portion of the total time. Is there any way to load all entities of a given type (load an entire table) and still benefit from 1st level session cache or 2nd level ehcache.
We've been told to stay away from query caching because of their potential performance penalties relative to their gains.
* Hibernate Query Cache considered harmful
Although we're doing performance profiling right now so it might be time to try turning on query cache.
L1 and L2 cache can't help you much with the problem of "get an entire table."
The L1 cache is ill-equipped because if someone else inserted something, it's not there. (You may "know" that no one else would ever do so within the business rules of the system, but the Hibernate Session doesn't.) Hence you have to go look in the DB to be sure.
With the L2 cache, things may have been expired or flushed since the last time anybody put the table in there. This can be at the mercy of the cache provider or even done totally externally, maybe through a MBean. So Hibernate can't really know at any given time if what's in the cache for that type represents the entire contents of the table. Again, you have to look in the DB to be sure.
Since you have special knowledge about this Entity (new ones are never created) that there isn't a practical way to impart on the L1 or L2 caches, you need to either use the tool provided by Hibernate for when you have special business-rules-level knowledge about a result set, query cache, or cache the info yourself.
--
If you really really want it in the L2 cache, you could in theory make all entities in the table members of a collection on some other bogus entity, then enable caching the collection and manage it secretly in the DAO. I don't think it could possibly be worth having that kind of bizarreness in your code though :)
Query cache is considered harmful if and only if the underlying table changes often. In your case the table is changed once a day. So the query would stay in cache for 24 hours. Trust me: use the query cache for it. It is a perfect use case for a query cache.
Example of harmful query cache: if you have a user table and you use the query cache for "from User where username = ..." then this query will evict from cache each time the user table is modified (another user changes/deletes his account). So ANY modification of this table triggers cache eviction. The only way to improve this situation is querying by natural-id, but this is another story.
If you know your table will be modified only once a day as in your case, the query cache will only evict once a day!
But pay attention on your logic when modifying the table. If you do it via hibernate everything is fine. If you use a direct query you have to tell hibernate that you have modified the table (something like query.addSynchronizedEntity(..)). If you do it via shell script you need to adjust the time-to-live of the underlying cache region.
Your answer is by the way reimplementing the query cache as the query cache just caches the list of ids. The actual objects are looked up in L1/L2 cache. so you still need to cache the entities when you use the query cache.
Please mark this as the correct answer for further reference.
We ended up solving this by storing in memory the primary keys to all the entities in the table we needed to load (because they're template data and no new templates are added/removed).
Then we could use this list of primary keys to look up each entity and utilize Hibernates 1st and 2nd level cache.
Im currently trying to get hibernate working using the caching provider that comes with hibernate.
net.sf.ehcache.hibernate.SingletonEhCacheProvider
I have a default cache and a class specific cache enabled in the ecache.xml which is referenced in my hibernate.cfg.xml file. The class/mapping file specific cache is defined to handle upto 20000 objects.
However, I'm seeing no perfrormance gains since I turned on the cache mapping on one of the mapping files Im testing this with.
My test is as follows.
Load 10000 objects of the particular mapping file im testing (this should hit the DB and be a bottle neck).
Next I go to load the same 10000 objects, as this point I would expect the cache to be hit and to see significant performance gains. Have tried using both "read-only" and "read-write" cache mapping on the hibernate mapping xml file Im testing with.
I'm wondering is their anything I need to be doing to ensure the cache is being hit before the DB when loading objects?
Note as part of the test im pagin through these 10000 records using something similar to below ( paging a 1000 records at time).
Criteria crit = HibernateUtil.getSession() .createCriteria( persistentClass );
crit.setFirstResult(startIndex);
crit.setFetchSize(fetchSize);
return crit.list();
Have seen that criteria has a caching mode setter ( setCacheMode() ) so is there something I should be doing with that??
I notice using the below stats code that theres 10000 objects (well hiberante dehydrated onjects i imagine??) in memory but
for some reason I'm getting 0 hits and more worryingly 0 misses so it looks like its not going to the cache at all when its doing a look up even though the stats code seems to be telling me that theres 10000 objects in memory.
Any ideas on what im doing worng? I take it the fact im getting misses is good as it means the cache is being used, but i cant figure out why im not getting any cache hits. Is it down to the fact im using setFirstResult() and setFetchSize() with criteria.
System.out.println("Cache Misses = " + stats.getSecondLevelCacheMissCount());
System.out.println("Cache Hits Count = " + stats.getSecondLevelCacheHitCount());
System.out.println("2nd level elements in mem "+ stats.getSecondLevelCacheStatistics("com.SomeTestEntity").getElementCountInMemory());
The second level cache works for "find by primary key". For other queries, you need to cache the query (provided the query cache is enabled), in your case using Criteria#setCacheable(boolean):
Criteria crit = HibernateUtil.getSession().createCriteria( persistentClass );
crit.setFirstResult(startIndex);
crit.setFetchSize(fetchSize);
crit.setCachable(true); // Enable caching of this query result
return crit.list();
I suggest to read:
Hibernate: Truly Understanding the Second-Level and Query Caches
If I cache the query, are all them hibernate entities from the query then available in the second level cache?
Yes they will. This is explained black on white in the link I mentioned: "Note that the query cache does not cache the state of the actual entities in the result set; it caches only identifier values and results of value type. So the query cache should always be used in conjunction with the second-level cache". Did you read it?
As i was under the impression that using the query cache was entirely different than using the hibernate 2nd level cache.
It is different (the "key" used for the cache entrie(s) is different). But the query caches relies on the L2 cache.
From your answer you seem to be suggesting that the query cache and second level cache are both the same, and to generate cache hits I need to be using the "find by primary key".
I'm just saying you need to cache the query since you're not "finding by primary key". I don't get what is not clear. Did you try to call setCacheable(true) on your query or criteria object? Sorry for insisting but, did you read the link I posted?