Hibernate really slow. How to make it faster? - java

In my app. I have Case and for each Case there can be 0 to 2 Claim. If a Case has 0 claims it runs pretty fast, 1 claims and it slows down, and 2 is awfully slow. Any idea how to make this faster? I didn't know if my case and claim were going back and forth causing an infinite recurison, so I added a JsonManagedReference and JsonBackReference, but that doesn't seem to help much with speeds. Any ideas? Here is my Case.java:
#Entity
public class Case {
#OneToMany(mappedBy="_case", fetch = FetchType.EAGER)
#Fetch(FetchMode.JOIN)
#JsonManagedReference(value = "case-claim")
public Set<Claim> claims;
}
In Claim.java:
#Entity
public class Claim implements Cloneable {
#ManyToOne(optional = true)
#JoinColumn(name = "CASE_ID")
#JsonBackReference(value = "case-claim")
private Case _case;
}
output of 0 claims:
https://gist.github.com/elmatt/2cafbe7ecb1fa0b7f6a8
output of 2 claims:
https://gist.github.com/elmatt/b000bc28909453effc95

Your problem has nothing to do with the relationship between Case and Claim.
FYI: 300ms is not "pretty fast." Your problem is that you expect hibernate to magically and quickly deliver a complex object hierarchy to you, with no particular effort on your part. I view ORM as "The Big Lie" - it is super easy to use and works great on toy problems, but tends to fail miserably when you try to scale to interesting applications (like yours).
Don't abandon hibernate, but realize that you are going to need to work harder than you thought you would in order to make it work for you.
I happen to work in a similar data domain (post-adjudication healthcare claim analysis and processing). You should be able to select this kind of data in well under 10ms per claim (with all associated dimensions) using MySQL on modest hardware from a table with >1 billion claims and the DB hosted on a separate server from the app.
How do you get from where you are to where you should be?
1. Minimize the number of round-trips to the database by minimizing the number of separate queries that are executed.
2. Hand-craft your important queries to grab just the rows and joins that you actually need.
3. Use explain plan on every query to make sure that it hits the tables in the right order and every step is appropriately supported by an index.
4. Consider partitioning your big tables and include the partition criteria in your queries to enable partition-pruning to focus the query on the proper data.
5. Be very hesitant to let hibernate manage your relationships between your entities. I generally do not let hibernate deal with any relationships.
A few years ago, I worked on a product that is an iPhone app where the user walks through workflows (e.g., a nurse taking a patient's vitals) and each screen made a round-trip to the app server to execute the workflow step and get the data for the next screen. Think about how little data you can work with on an iPhone screen. Yet the DB portion of the round-trip generally took 2-5 seconds to execute. Everyone there took it for granted, because "That is how long it has always taken." I dug into the code and found that each step was pulling in a significant portion of the database (and then was not used by the business logic).
The only time they tweaked the default hibernate behavior was when they got an exception due to too many joins (yes, MySQL has a limit of something like 67 tables in one query).
The approach of creating your Java data model and simply ORM'ing it into the database generally works just fine on configuration data and the like, but tends to perform terribly for complex data models involving your transactional data. This is what is biting you now.
Your problem is totally fixable, and can be attacked incrementally - you don't have to tear apart the whole application to start making things better.

Can you enable hibernate logging and provide the output. It should indicate the SQL queries being executed against your DB. Information about which DB you are using would also be useful. When you have those I would recommend profiling the queries to ensure your DB is setup appropriately. It sounds like an non indexed query.
Size of the datasets would be helpful in targeting possible issues as well - number of rows and so on.
I would also recommend timing the actual hibernate call (could be as crude as log statement immediately before / after) vs overall processing to identify whether it really is hibernate or some other processing. Without further information & context that is not clear here.
Now you've posted your queries we can see what is happening. It looks like the structure of your entities is more complex than the code snippet originally posted. There are references to Person, Activities, HealthPlan and others in there.
As others have commented your query is triggering a very large select of a lot of data due to the nature of your model.
I recommend creating Named Queries for claims, and then load those using the ID of Case.
You should also review your hibernate model and switch to FetchType.LAZY, other hibernate will create large queries such as the one you have posted. The catch here is that if you try to access a related entity outside of the transaction you will get a lazyinitializationexception. You will need to consider each use case and ensure you load the data you need. Two common mistakes with Hibernate is to use FetchType.EAGER everywhere or to initiate the transaction to early to avoid this. There is not one correct design approach, but I normally do the following
JSP -> Controller -> [TX BOUNDARY] Service -> DAO
You service method(s) should encapsulate the business logic you need to load the data you require, before passing it back to the controller.
Again, per the other answer, I think you're expecting too much of Hibernate. It is a powerful tool but you need to understand how it works to get the best from it.

Related

Hierarchical Data Model with JPA

Recently I come across a schema model like this
Structure looks exactly the same, i just renamed with Entity name like Table (*)
Starting from Table C, all the tables are having close to 200 Columns, from C to L
Reason for posting this is like, I never come across structure like this before, if anyone who have already experienced like this or worked similar or more complex than this please do share your idea,
Having a structure like this is good or bad, and why?
Assume we need to have API to save data for the table structure like this,
how to design the API
How we are going to manage the Transactional across all these tables
In service code, there are few cases where we might need to get data from these table and transfer to external system.
Catch here is, external system is accepting the request in the flatten structure not in the hierarchy which we have as mentioned above. If this data needs to be transferred to external system, how can we manage marshaling and un marshaling
Last but not least, API which is going to manage the data like this can be consumed atleast 2K a day.
What is your thought on this, I don't know exactly why we need it, it needs a detailed discussion and we need to break up the things.
If I consider Spring Data JPA, Hibernate. What are all things i need to consider,
More Importantly, all these tables row values will be limited based on the the ownerId/tenantId, so the data needs to be consistent across all the tables.
I can not comment on the general aspect of the structure as that is pretty domain specific and one would need to know why this structure was chosen to be able to say if it's good or not. Either way, you probably can't change this anyway, so why bother asking if it's good or not?
Having said that, with such a model there are a few aspects that you should consider:
When updating data, it is pretty important to update only columns that really changed to avoid index trashing and allow the DB to use spare storage in pages. This is a performance concern that usually comes up when using Hibernate with such models as Hibernate usually updates all "updatable" columns, not just the dirty ones. There is an option to do dynamic updates though. Without dynamic updates, you might produce a few more IOs per update and thus keep locks for a longer time which affects the overall scalability.
When reading data, it is very important not to use join fetching by default as that might result in a result set size explosion.

Load entire tables including relationships into memory with JPA

I have to process a huge amount of data distributed over 20 tables (~5 million records in summary) and I need to efficently load them.
I'm using Wildfly 14 and JPA/Hibernate.
Since in the end, every single record will be used by the business logic (in the same transaction), I decided to pre-load the entire content of the required tables into memory via simply:
em.createQuery("SELECT e FROM Entity e").size();
After that, every object should be availabe in the transaction and thus be available via:
em.find(Entity.class, id);
But this doesn't work somehow and there are still a lot of calls to the DB, especially for the relationships.
How can I efficiently load the whole content of the required tables including
the relationships and make sure I got everything / there will be no further DB calls?
What I already tried:
FetchMode.EAGER: Still too many single selects / object graph too complex
EntityGraphs: Same as FetchMode.EAGER
Join fetch statements: Best results so far, since it simultaneously populates the relationships to the referred entities
2nd Level / Query Cache: Not working, probably the same problem as em.find
One thing to note is that the data is immutable (at least for a specific time) and could also be used in other transactions.
Edit:
My plan is to load and manage the entire data in a #Singleton bean. But I want to make sure I'm loading it the most efficient way and be sure the entire data is loaded. There should be no further queries necessary when the business logic is using the data. After a specific time (ejb timer), I'm going to discard the entire data and reload the current state from the DB (always whole tables).
Keep in mind, that you'll likely need a 64-bit JVM and a large amount of memory. Take a look at Hibernate 2nd Level Cache. Some things to check for since we don't have your code:
#Cacheable annotation will clue Hibernate in so that the entity is cacheable
Configure 2nd level caching to use something like ehcache, and set the maximum memory elements to something big enough to fit your working set into it
Make sure you're not accidentally using multiple sessions in your code.
If you need to process things in this way, you may want to consider changing your design to not rely on having everything in memory, not using Hibernate/JPA, or not use an app server. This will give you more control of how things are executed. This may even be a better fit for something like Hadoop. Without more information it's hard to say what direction would be best for you.
I understand what you're asking but JPA/Hibernate isn't going to want to cache that much data for you, or at least I wouldn't expect a guarantee from it. Consider that you described 5 million records. What is the average length per record? 100 bytes gives 500 megabytes of memory that'll just crash your untweaked JVM. Probably more like 5000 bytes average and that's 25 gB of memory. You need to think about what you're asking for.
If you want it cached you should do that yourself or better yet just use the results when you have them. If you want a memory based data access you should look at a technology specifically for that. http://www.ehcache.org/ seems popular but it's up to you and you should be sure you understand your use case first.
If you are trying to be database efficient then you should just understand what your doing and design and test carefully.
Basically it should be a pretty easy task to load entire tables with one query each table and link the objects, but JPA works different as to be shown in this example.
The biggest problem are #OneToMany/#ManyToMany-relations:
#Entity
public class Employee {
#Id
#Column(name="EMP_ID")
private long id;
...
#OneToMany(mappedBy="owner")
private List<Phone> phones;
...
}
#Entity
public class Phone {
#Id
private long id;
...
#ManyToOne
#JoinColumn(name="OWNER_ID")
private Employee owner;
...
}
FetchType.EAGER
If defined as FetchType.EAGER and the query SELECT e FROM Employee e Hibernate generates the SQL statement SELECT * FROM EMPLOYEE and right after it SELECT * FROM PHONE WHERE OWNER_ID=? for every single Employee loaded, commonly known as 1+n problem.
I could avoid the n+1 problem by using the JPQL-query SELECT e FROM Employee e JOIN FETCH e.phones, which will result in something like SELECT * FROM EMPLOYEE LEFT OUTER JOIN PHONE ON EMP_ID = OWNER_ID.
The problem is, this won't work for a complex data model with ~20 tables involved.
FetchType.LAZY
If defined as FetchType.LAZY the query SELECT e FROM Employee e will just load all Employees as Proxies, loading the related Phones only when accessing phones, which in the end will lead into the 1+n problem as well.
To avoid this it is pretty obvious to just load all the Phones into the same session SELECT p FROM Phone p. But when accessing phones Hibernate will still execute SELECT * FROM PHONE WHERE OWNER_ID=?, because Hibernate doesn't know that there are already all Phones in its current session.
Even when using 2nd level cache, the statement will be executed on the DB because Phone is indexed by its primary key in the 2nd level cache and not by OWNER_ID.
Conclusion
There is no mechanism like "just load all data" in Hibernate.
It seems there is no other way than keep the relationships transient and connect them manually or even just use plain old JDBC.
EDIT:
I just found a solution which works very well. I defined all relevant #ManyToMany and #OneToMany as FetchType.EAGER combinded with #Fetch(FetchMode.SUBSELECT) and all #ManyToOne with #Fetch(FetchMode.JOIN), which results in an acceptable loading time. Next to adding javax.persistence.Cacheable(true) to all entities I added org.hibernate.annotations.Cache to every relevant collection, which enables collection caching in the 2nd level cache. I disabled 2nd level cache timeout eviction and "warm up" the 2nd level cache via #Singleton EJB combined with #Startup on server start / deploy. Now I have 100% control over the cache, there are no further DB calls until I manually clear it.

Calculation on query vs programmatically

i m working on Java EE projects using Hibernate as ORM , I have come to a phase where i have to perform some mathematical calculation on my Classes , like SUM , COUNT , addition and division .
i have 2 solutions :
To select my classes and apply those operation programmatically in my code
To do calculations on my named queries
i want to please in terms of performance and speed , which one is better ?
And thank you
If you are going to load the same entities that you want to do the aggregation on from the database in the same transaction, then the performance will be better if you do the calculation in Java.
It saves you one round-trip to the database, because in that case you already have the entities in memory.
Other benefits are:
Easier to unit-test the calculation because you can stick to a Java-based unit testing framework
Keeps the logic in one language
Will also work for collections of entities that haven't been persisted yet
But if you're not going to load the same set of entities that you want to do the calculation on, then you will get a performance improvement in almost any situation if you let the database do the calculation. The more entities are involved, the bigger the performance benefit.
Imagine doing a summation over all line items in this year's orders, perhaps several million of them.
It should be clear that having to load all these entities into the memory of the Java process across a TCP connection (even if it is within the same machine) first will take more time, and more memory, than letting the database perform the calculation.
And if your mapping requires additional queries per entity, then Hibernate would have at least one extra round-trip to the database for every entity, in which case the performance benefits of calculating things in SQL on the database would be even bigger.
Are these calculation on the entities (or data)? if yes, then you can indeed go for queries(or even faster, use sql queries iso hql). From performance perspective ,IMO, stored procedures shines but people don't use them so often with hibernate.
Also, if you have some frequent repetitive calculation, try using caching in your application.

caching readonly data for java application

I have a database which has around 150K records of data with a primary key on the table. The data size for each record will take less than 1kB. The processing time for constructing a POJO from the DB record takes about 1-2 secs(there is some business logic that takes too much time). This is read-only data. Hence I'm planning to implement caching the data. What I'm thinking to do is. Load the data in subsets(200 records each time) and create a thread that'll construct the POJOs and keep them in a hashtable. While the cache is being loaded(when I start the application) the User will see a wait sign. For storing the data in HashTable is an issue I'll actually store the processed data in to another DB table(marshall the POJO to xml).
I use a third party API to load the data from database. Once I load a record I'll have load the data I'll have to load associations for the loaded data and then associations for the association found at the top level. It's like loading a family tree.
I can't use Hibernate or any ORM framework as I'm using a third party API to load the data which is shipped with the database it self(it's a product). More over I don't think loading data once is not a big issue.
If there is a possibility to fine tune the business logic I wouldn't have asked this question here.
Caching the data on demand is an option, but I'm trying to see if I can do anything better.
Suggest me if there is a better idea that you are aware of. Thank you./
Suggest me if there is a better idea that you are aware of.
Yes, fix the business logic so that it doesn't take 1 to 2 seconds per record. That's a ridiculously long time.
Before you do that, profile your application to make sure that it is really the business logic that is causing the slow record loading, and not something else. (For example, it could be a pathological data structure, or a database issue.)
Once you've fixed the root cause of the slow record loading, it is still a good idea to cache the read-only records, but you probably don't need to preload the cache. Instead, just load the records on demand.
It sounds like you are reinventing the wheel. I'd be looking to use hibernate. Apart from simplifying the code to access the database, hibernate has built-in caching and lazy loading of data so it only creates objects as you request them. Ergo, a lot of what you describe above is already in place and you can concentrate on sorting out your business logic. I suspect that once you solve the business logic performance issue, there will be no need to do such as complicated caching system and hibernate defaults will be sufficient.
As maximdim said in a comment, preloading the whole thing will take a lot of time. If your system is not very strange, the user won't need all data at once. Just cache on demand instead. I would also recommend using an established caching solution, such as EHCache, which has persistence via DiskStore -- the only issue is that whatever you cache in this case has to be Serializable. Since you can marshall it as XML, I'm betting you can serialize it too, which should be faster.
In a past project, we had to query a very busy, very sluggish service running in an off-site mainframe in order to assemble one of the entities. Average response times from our app were dominated by this query. Since the data we retrieved was mostly read-only caching with EHCache solved our problems.
jdbm has a nice, persistent map implementation (http://code.google.com/p/jdbm2/) - that may help you do local caching - it would certainly be a lot faster than serializing your POJOs to XML and writing them back into a SQL database.
If your data is truly read-only, then I'd think that the best solution would be to treat the source database as an input queue that feeds your app database. Create a background process (heck, a service would be better), and have it monitor the source database and keep your app database synced.

Google app engine: Poor Performance with JDO + Datastore

I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh
First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.
Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B
Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?
If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html
The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.

Categories

Resources