Are JPA entities that are not in use garbage collected and why?

Are JPA entities that are not in use garbage collected and why? - java

Building a Spring application that fetches data from web using an API I bumped multiple times into OutOfMemoryError: GC overhead limit exceeded. After some profiling sessions I started to question my model, which is something like this:
#Entity
class A {
#Id
private Integer id;
private String name;
#OneToMany
private Set<B> b1;
#OneToMany
private Set<B> b2;
}
#Entity
Class B {
#Id
private Integer id;
#ManyToOne
private A a1;
#ManyToOne
private A a2;
}
There is a CrudRepository assigned to manage these entities (JPA + EclipseLink). Entity loading is default, which in this case means eager AFAIK.
The program attempts to do the following:
// populates the set with 2500 A instances.
Set<A> aCollection = fetchAFromWebAPI();
for (A a : aCollection) {
// populates b1 and b2 of each A with a 100 of B instances
fetchBFromWebAPI(a);
aRepository.save(a);
}
By the end of this process there would be 500k B instances, except it never reaches the end because of OutOfMemoryError: GC overhead limit exceeded. Now I could add more memory, but I want to understand why all these instances aren't garbage collected? Save an A to the database and forget it. Is this because A instances have B instances in their b1 or b2 that in their turn reference A instances?
Another observation I made is that the process runs significantly more smoothly for the first time, when there is no data in database.
Is there something fundamentally wrong with this model or this process?

A JPA transaction has an associated session cache of all entities used in the transaction. By saving your entities you keep introducing more instances into that session cache. In your case I'd recommend to use EntityManager.clear() every n entities - that detaches the persisted entities from the session and makes them available for garbage collection.
If you want to learn more about the lifecycle of JPA entities you can refer to e.g.
http://www.objectdb.com/java/jpa/persistence/managed
Edit:
Additionally the answer of BatScream also is correct: you seem to accumulate more and more data in every iteration that is still referenced by the set. You might want to consider to remove instances you have processed from the set.

The collection aCollection keeps on growing after each iteration. Each instance of A will be populated with 200 entries of B instances after each loop. Hence your heap space gets eaten up.
All the A instances in the collection aCollection are always reachable when the garbage collector runs during this period, since you are not removing the just saved A from the collection.
To avoid this, you can use the Set Iterator to safely remove the just processed A instance from the collection.

Related

Hibernate-Search flushToIndexes causing java.lang.OutOfMemoryError (heap space)

I've a Spring application, using Hibernate, and connected to Elasticsearch through Hibernate-Search.
To simplify the example, I'll put only required annotations and code.
I've an entity A, contained in multiple B entities (a lot, actually ~8000).
The B entities also contains a lot of embedded details (entities C, E, ...).
Those entities are all connected with #IndexedEmbedded and #ContainedIn Hibernate-Search annotations (see the example below).
I've created a service, modifying a field of A object, and forcing the flush through flushToIndexes.
On the flush, Hibernate-Search updates A index, and because of the #ContainedIn, propagates on the 8000 B indexes.
But to update B indexes, for some reason, Hibernate-Search loads every 8000 B objects linked to the A object at one time,
and also every details contained in thoses B objects (C, E, and so on).
All this takes a long time, and ends on nothing more than java.lang.OutOfMemoryError: Java heap space.
#Entity
#Table(name = "A")
#Indexed
public class A {
#ContainedIn
#OneToMany(fetch = FetchType.LAZY, mappedBy = "a")
private Set<B> bCollection;
#Field
#Column(name = "SOME_FIELD")
private String someField; // Value updated in the service
}
#Entity
#Table(name = "B")
#Indexed
public class B {
#IndexedEmbedded
#ManyToOne(fetch = FetchType.LAZY)
#JoinColumn(name = "A_ID")
private A a;
#IndexedEmbedded
#OneToOne(fetch = FetchType.LAZY, mappedBy = "b")
#Fetch(FetchMode.JOIN)
private C c; // Some other details
#IndexedEmbedded
#OneToMany(fetch = FetchType.LAZY, mappedBy = "b")
private Set<E> eCollection; // Some other details
}
// My service
aObject.setSomeField("some value");
fullTextSession.flushToIndexes();
Increasing JVM allocated memory (from 8GB to 24 GB, which is actually a lot for ~10000 objects) didn't solve anything.
So I presume the loading of the whole dataset requires more than 24 GB...
However, the problem seems more complicated than it looks ~
Is that a bug ? Is that common ? What did I do wrong ? How could I solve that ?
Is there some hidden Hibernate-Search configuration, to avoid this behaviour ?

It is a limitation of Hibernate Search. #ContainedIn will perform relatively well only for small associations; large ones such as yours will indeed trigger the loading of all associated entities and will perform badly, or in the worst cases trigger OOM.
It hasn't been fixed yet because the problem is rather complex. We would need to use queries instead of associations for the #ContainedIn (HSEARCH-1937), which would be rather simple. But more importantly we would need to perform chunking (periodical flush/clear), which would either have side-effect on the user session or be performed outside of the user transaction (HSEARCH-2364), both of which may have nasty consequences.
The work around would be to not add the #ContainedIn on A.bCollection, and handle reindexing manually: https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#manual-index-changes
Similarly to what I mentioned in another answer, you can adopt one of two strategies:
The easy path: reindex all the B entities periodically using the mass indexer, e.g. every night.
The hard path: whenever an A changes, save the information "this entity changed" somewhere (this could be as simple as storing a "last update date/time" on entity A, or adding a row in an event table). In parallel, have a periodic process inspect the changes, load the affected entities of type B, and reindex them. Preferably do that in batches of manageable size, one transaction per batch if you can (that will avoid some headaches).
The first solution is fairly simple, but has the big disadvantage that the Person index will be up to 24 hours out of date. Depending on your use case, that may be ok or that may not. It also may not be feasible if you have many entities of type B (read: millions) and full reindexing takes more than just a few minutes.
The second solution is prone to errors and you would basically be doing Hibernate Search's work, but it would work even for very large tables, and the delay between the database change and the reindexing would be much shorter.

Hibernate associations using too much memory

I have a table "class" which is linked to tables "student" and "teachers".
A "class" is linked to multiple students and teachers via foriegn key relationship.
When I use hibernate associations and fetch large number of entities(tried for 5000) i am seeing that it is taking 4 times more memory than if i just use foreign key place holders.
Is there something wrong in hibernate association?
Can i use any memory profiler to figure out what's using too much memory?
This is how the schema is:
class(id,className)
student(id,studentName,class_id)
teacher(id,teacherName,class_id)
class_id is foreign key..
Case #1 - Hibernate Associations
1)in Class Entity , mapped students and teachers as :
#Entity
#Table(name="class")
public class Class {
private Integer id;
private String className;
private Set<Student> students = new HashSet<Student>();
private Set<Teacher> teachers = new HashSet<Teacher>();
#OneToMany(fetch = FetchType.EAGER, mappedBy = "classRef")
#Cascade({ CascadeType.ALL })
#Fetch(FetchMode.SELECT)
#BatchSize(size=500)
public Set<Student> getStudents() {
return students;
}
2)in students and teachers , mapped class as:
#Entity
#Table(name="student")
public class Student {
private Integer id;
private String studentName;
private Class classRef;
#ManyToOne
#JoinColumn(name = "class_id")
public Class getClassRef() {
return classRef;
}
Query used :
sessionFactory.openSession().createQuery("from Class where id<5000");
This however was taking a Huge amount of memory.
Case #2- Remove associations and fetch seperately
1)No Mapping in class entity
#Entity
#Table(name="class")
public class Class {
private Integer id;
private String className;
2)Only a placeholder for Foreign key in student, teachers
#Entity
#Table(name="student")
public class Student {
private Integer id;
private String studentName;
private Integer class_id;
Queries used :
sessionFactory.openSession().createQuery("from Class where id<5000");
sessionFactory.openSession().createQuery("from Student where class_id = :classId");
sessionFactory.openSession().createQuery("from Teacher where class_id = :classId");
Note - Shown only imp. part of the code. I am measuring memory usage of the fetched entities via JAMM library.
I also tried marking the query as readOnly in case #1 as below, which does not improve memory usage very much ; just a very little. So that's not the solve.
Query query = sessionFactory.openSession().
createQuery("from Class where id<5000");
query.setReadOnly(true);
List<Class> classList = query.list();
sessionFactory.getCurrentSession().close();
Below are the heapdump snapshots sorted by sizes. Looks like the Entity maintained by hibernate is creating the problem..
Snapshot of Heapdump for hibernate associations program
Snapshot of heapdump for fetching using separate entities

You are doing a EAGER fetch with the below annotation. This will in turn fetch all the students without even you accessing the getStudents(). Make it lazy and it will fetch only when needed.
From
#OneToMany(fetch = FetchType.EAGER, mappedBy = "classRef")
To
#OneToMany(fetch = FetchType.LAZY, mappedBy = "classRef")

When Hibernate loads a Class entity containing OneToMany relationships, it replaces the collections with its own custom version of them. In the case of a Set, it uses a PersistentSet. As can be seen on grepcode, this PersistentSet object contains quite a bit of stuff, much of it inherited from AbstractPersistentCollection, to help Hibernate manage and track things, particularly dirty checking.
Among other things, the PersistentSet contains a reference to the session, a boolean to track whether it's initialized, a list of queued operations, a reference to the Class object that owns it, a string describing its role (not sure what exactly that's for, just going by the variable name here), the string uuid of the session factory, and more. The biggest memory hog among the lot is probably the snapshot of the unmodified state of the set, which I would expect to approximately double memory consumption by itself.
There's nothing wrong here, Hibernate is just doing more than you realized, and in more complex ways. It shouldn't be a problem unless you are severely short on memory.
Note, incidentally, that when you save a new Class object that Hibernate previously was unaware of, Hibernate will replace the simple HashSet objects you created with new PersistentSet objects, storing the original HashSet wrapped inside the PersistentSet in its set field. All Set operations will be forwarded to the wrapped HashSet, while also triggering PersistentSet dirty tracking and queuing logic, etc. With that in mind, you should not keep and use any external references to the Set from before saving, and should instead fetch a new reference to Hibernate's PersistentSet instance and use that if you need to make any changes (to the set, not to the students or teachers within it) after the initial save.

Regarding the huge memory consumption you are noticing, one potential reason is Hibernate Session has to maintain the state of each entity it has loaded the form of EntityEntry object i.e., one extra object, EntityEntry, for each loaded entity. This is needed for hibernate automatic dirty checking mechanism during the flush stage to compare the current state of entity with its original state (one that is stored as EntityEntry).
Note that this EntityEntry is different from the object that we get to access in our application code when we call session.load/get/createQuery/createCriteria. This is internal to hibernate and stored in the first level cache.
Quoting form the javadocs for EntityEntry :
We need an entry to tell us all about the current state of an object
with respect to its persistent state Implementation Warning: Hibernate
needs to instantiate a high amount of instances of this class,
therefore we need to take care of its impact on memory consumption.
One option, assuming the intent is only to read and iterate through the data and not perform any changes to those entities, you can consider using StatelessSession instead of Session.
The advantage as quoted from Javadocs for Stateless Session:
A stateless session does not implement a first-level cache nor
interact with any second-level cache, nor does it implement
transactional write-behind or automatic dirty checking
With no automatic dirty checking there is no need for Hibernate to create EntityEntry for each entity of loaded entity as it did in the earlier case with Session. This should reduce pressure on memory utilization.
Said that, it does have its own set of limitations as mentioned in the StatelessSession javadoc documentation.
One limitation that is worth highlighting is, it doesn't lazy loading the collections. If we are using StatelessSession and want to load the associated collections we should either join fetch them using HQL or EAGER fetch using Criteria.
Another one is related to second level cache where it doesn't interact with any second-level cache, if any.
So given that it doesn't have any overhead of first-level cache, you may want to try with Stateless Session and see if that fits your requirement and helps in reducing the memory consumption as well.

Yes, you can use a memory profiler, like visualvm or yourkit, to see what takes so much memory. One way is to get a heap dump and then load it in one of these tools.
However, you also need to make sure that you compare apples to apples. Your queries in case#2 sessionFactory.openSession().createQuery("from Student where class_id = :classId");
sessionFactory.openSession().createQuery("from Teacher where class_id = :classId");
select students and teachers only for one class, while in case #1 you select way more. You need to use <= :classId instead.
In addition, it is a little strange that you need one student and one teacher record per one class. A teacher can teach more than one class and a student can be in more than one class. I do not know what exact problem you're solving but if indeed a student can participate in many classes and a teacher can teach more than one class, you will probably need to design your tables differently.

Try #Fetch(FetchMode.JOIN), This generates only one query instead of multiple select queries. Also review the generated queries. I prefer using Criteria over HQL(just a thought).
For profiling, use freewares like visualvm or jconsole. yourkit is good for advanced profiling, but it is not for free. I guess there is a trail version of it.
You can take the heapdump of your application and analyze it with any memory analyzer tools to check for any memory leaks.
BTW, I am not exactly sure about the memory usage for current scenario.

Its likely the reason is the bi-directional link from Student to Class and Class to Students. When you fetch Class A (id 4500), The Class object must be hydrated, in turn this must go and pull all the Student objects (and teachers presumably) associated with this class. When this happens each Student Object must be hydrated. Which causes the fetch of every class the Student is a part of. So although you only wanted class A, you end up with:
Fetch Class A (id 4900)
Returns Class A with reference to 3 students, Student A, B, C.
Student A has ref to Class A, B (id 5500)
Class B needs hydrating
Class B has reference to Students C,D
Student C needs hydrating
Student C only has reference to Class A and B
Student C hydration complete.
Student D needs hydrating
Student D only has reference to Class B
Student B hydration complete
Class B hydration complete
Student B needs hydrating (from original class load class A)
etc... With eager fetching, this continues until all links are hydrated. The point being that its possible you end up with Classes in memory that you didn't actually want. Or whose id is not less than 5000.
This could get worse fast.
Also, you should make sure you are overriding the hashcode and equals methods. Otherwise you may be getting redundant objects, both in memory and in your set.
One way to improve is either change to LAZY loading as other have mentioned or break the bidirectional links. If you know you will only ever access students per class, then don't have the link from student back to class. For student/class example it makes sense to have the bidirectional link, but maybe it can be avoided.

as you say you "I want "all" the collections". so lazy-loading won't help.
Do you need every field of every entity? In which case use a projection to get just the bits you want. See when to use Hibernate Projections.
Alternatively consider having minimalist Teacher-Lite and Student-Lite entity that the full-fat versions extend.

hibernate jpa update two field on persisit and read from one only

one quick question for java hibernate/jpa users.
I have two tables(entities) A and B with relations as A has many B (one to many). Entity A has Set of values B in java.
Due to read performance issue i want to implement master-details denormalization, so i want to store raw Set object (maybe serialized) directly in entity A (because many to one relation cost me to much cpu time because of read by jpa (update is not an issue)).
The problem is, can i achieve something like that getBs always returns me denormalized object (so its fast) and addB adds new B to Set and updates denormalized object with new raw data that is prepared for faster read?
its oracle db.
entity example:
class A {
Long id,
String name;
Set<B> arrayOfBs;
byte[] denormalizedArrayOfB;
getArrayOfBs() {
return (Set<B>) denormalizedArrayOfB;
}
addArrayOfBs(B b) {
//persist b
// update and persist denormalizedArray with new b
}
//getters and setters...
}
class B {
Long id;
A reference;
String x;
String y;
//getters and setters...
}

That's complicated. There are better approaches to your problem:
You can simply replace the one-to-many association with a DAO query. So whenever you fetch the parent entities you won't be able to get the children collection (maybe they are way too many). But when you want to get a parent's children, you simply run a DAO query, which is also easier to filter.
You leave the children collection, but you use an in-memory cache to save the fully initialized object graph. This might sounds like a natural choice, but most likely you're going to trade consistency for performance.

Grails. Hibernate lazy loading multiple objects

I'm having difficulties with proxied objects in Grails.
Assuming I've got the following
class Order {
#ManyToMany(fetch = FetchType.EAGER)
#JoinTable(name="xxx", joinColumns = {#JoinColumn(name = "xxx")}, inverseJoinColumns = {#JoinColumn(name = "yyy")})
#OrderBy("id")
#Fetch(FetchMode.SUBSELECT)
private List<OrderItem> items;
}
class Customer {
#ManyToOne(cascade = CascadeType.ALL, fetch = FetchType.LAZY, optional = true)
#JoinColumn(name = "xxx",insertable = false, nullable = false)
private OrderItem lastItem;
private Long lastOrderId;
}
And inside some controller class
//this all happens during one hibernate session.
def currentCustomer = Customer.findById(id)
//at this point currentCustomer.lastItem is a javassist proxy
def lastOrder = Order.findById(current.lastOrderId)
//lastOrder.items is a proxy
//Some sample actions to initialise collections
lastOrder.items.each { println "${it.id}"}
After the iteration lastOrder.items still contains a proxy of currentCustomer.lastItem. For example if there are 4 items in the lastOrder.items collection, it looks like this:
object
object
javassist proxy (all fields are null including id field). This is the same object as in currentCustomer.lastItem.
object
Furthermore, this proxy object has all properties set to null and it's not initialized when getters are invoked. I have to manually call GrailsHibernateUtils.unwrapIdProxy() on every single element inside lastOrder.items to ensure that there are no proxies inside (which basically leads to EAGER fetching).
This one proxy object leads to some really weird Exceptions, which are difficult to track on testing phase.
Interesting fact: if I change the ordering of the operations (load the order first and the customer second) every element inside lastOrder.items is initialized.
The question is: Is there a way to tell Hibernate that it should initialize the collections when they are touched, no matter if any elements from the collection is already proxied in the session?

I think what's happening here is an interesting interaction between the first level cache (stored in Hibernate's Session instance) and having different FetchType on related objects.
When you load Customer, it gets put in to the Session cache, along with any objects that are loaded with it. This includes a proxy object for the OrderItem object, because you've got FetchType.LAZY. Hibernate only allows one instance to be associated with any particular ID, so any further operations that would be acting on the OrderItem with that ID would always be using that proxy. If you asked the same Session to get that particular OrderItem in another way, as you are by loading an Order containing it, that Order would have the proxy, because of Session-level identity rules.
That's why it 'works' when you reverse the order. Load the Order first, it's collection is FetchType.EAGER, and so it (and the first level cache) have fully realized instances of OrderItem. Now load a Customer which has it's lastItem set to one of the already-loaded OrderItem instances and presto, you have a real OrderItem, not a proxy.
You can see the identity rules documented in the Hibernate manual:
For objects attached to a particular Session... JVM identity for database identity is guaranteed by Hibernate.
All that said, even if you get an OrderItem proxy, it should work fine as long as the associated Session is still active. I wouldn't necessarily expect the proxy ID field to show up as populated in the debugger or similar, simply because the proxy handles things in a 'special' way (ie, it's not a POJO). But it should respond to method calls the same way it's base class would. So if you have an OrderItem.getId() method, it should certainly return the ID when called, and similarly on any other method. Because it's lazily initialized though, some of those calls may require a database query.
It's possible that the only real problem here is simply that it's confusing to have it so that any particular OrderItem could be a proxy or not. Maybe you want to simply change the relationships so that they're either both lazy, or both eager?
For what it's worth, it's a bit odd that you've got the ManyToMany relationship as EAGER and the ManyToOne as LAZY. That's exactly the reverse of the usual settings, so I would at least think about changing it (although I obviously don't know your entire use case). One way to think about it: If an OrderItem is so expensive to fetch completely that it's a problem when querying for Customer, surely it's also too expensive to load all of them at once? Or conversely, if it's cheap enough to load all of them, surely it's cheap enough to just grab it when you get a Customer?

I think you can force eager loading this way or using
def lastOrder = Order.withCriteria(uniqueResult: true) {
eq('id', current.lastOrderId)
items{}
}
or using HQL query with 'fetch all'

Ehcache - why are the entries so big?

I have a fairly simple data model like:
class MyParent {
// 7 fields here, some numeric, some String, not longer than 50 chars total
Set<MyChild> children;
}
class MyChild {
int ownerId;
// 3 more fields, numeric or dates
}
MyParent, MyChild and MyParent.children are all cached with read-only.
I have 40,000 instances of MyParent and 100,000 instances of MyChild. That yields 180,000 entries in cache (if you add 40,000 MyParent.children).
I want to cache everything, grouped by ownerId. Not wanting to reinvent the wheel, I wanted to use query cache like:
Query query = session
.createQuery(
"select distinct p from MyParent p join fetch p.children c where c.ownerId = :ownerId");
query.setParameter("ownerId", ownerId);
query.setCacheable(true);
query.setCacheRegion("MyRegion");
query.list();
For all 1,500 values of ownerId.
Cache works, but I noticed it's huge! Measured with Ehcache.calculateInMemorySize(), on average each entry is over one kilobyte big. In order to cache ~180,000 entries I would need over 200 MB. That's outragous, given that the entries themselves are much smaller.
Where does the overhead come from and how can I decrease it?

I'm not sure from the question what cache you used to do the math, but let me use the MyParent class as an example. Given what you explained about the class, on a 64bit VM with compressedOops enabled, a MyParent instance would be a little below 500 bytes in heap. And that is without the Set, I'll explain why later (it'd be another 128 bytes on top otherwise). The cache also needs to hold the key for that entry, which comes added to the calculation...
Hibernate doesn't directly use the primary key the key to something it stores in the cache, but a CacheKey entry. That instance holds the pk of the entity the value represents as well as four other fields: type, the Hibernate type mapping; entityOrRoleName, the entity or collection-role name; tenantId, the tenant identifier associated this data; and finally, the hashCode of the pk (see org.hibernate.type.Type.getHashCode).
Now sadly it all doesn't end here, the value for that entry isn't the MyParent instance, but a CacheEntry instance. This time, besides more metadata (subClass, the entity's name, which defaults to FQCN; lazyPropertiesAreUnfetched, a boolean; and the optimisitc locking value out of the entity), that instance still doesn't hold the MyParent instance, but a disassembled representation of it. This representation is an array of the state (all properties) of the entity.
I guess that with this information, the "estimated" sizes of your hibernate caches will make more sense. I'd like to stress out that these are only estimations, and if I remember correctly how it is being calculated, it probably is slightly above reality. Indeed some information in the CacheKey for instance probably should be accounted for differently. As of Ehcache 2.5, you will be able to enable memory based tuning on Caches (and even at the CacheManager level). When that is being done, cache entries are precisely measured and the calculateInMemorySize() will give you the real measured size of the cache.
You can download the beta for 2.5 now from the ehcache.org. Also note that when using byte-based sizing on your caches, the sizing engine will account for these shared instances across cached entries in Hibernate's cache types. You can read more on the way this all works here : http://ehcache.org/documentation/configuration.html#Memory_Based_Cache_Sizing_Ehcache_2.5_and_higher
Hope that helps you make more sense out of it all...
Alex

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.