I have to process a huge amount of data distributed over 20 tables (~5 million records in summary) and I need to efficently load them.
I'm using Wildfly 14 and JPA/Hibernate.
Since in the end, every single record will be used by the business logic (in the same transaction), I decided to pre-load the entire content of the required tables into memory via simply:
em.createQuery("SELECT e FROM Entity e").size();
After that, every object should be availabe in the transaction and thus be available via:
em.find(Entity.class, id);
But this doesn't work somehow and there are still a lot of calls to the DB, especially for the relationships.
How can I efficiently load the whole content of the required tables including
the relationships and make sure I got everything / there will be no further DB calls?
What I already tried:
FetchMode.EAGER: Still too many single selects / object graph too complex
EntityGraphs: Same as FetchMode.EAGER
Join fetch statements: Best results so far, since it simultaneously populates the relationships to the referred entities
2nd Level / Query Cache: Not working, probably the same problem as em.find
One thing to note is that the data is immutable (at least for a specific time) and could also be used in other transactions.
Edit:
My plan is to load and manage the entire data in a #Singleton bean. But I want to make sure I'm loading it the most efficient way and be sure the entire data is loaded. There should be no further queries necessary when the business logic is using the data. After a specific time (ejb timer), I'm going to discard the entire data and reload the current state from the DB (always whole tables).
Keep in mind, that you'll likely need a 64-bit JVM and a large amount of memory. Take a look at Hibernate 2nd Level Cache. Some things to check for since we don't have your code:
#Cacheable annotation will clue Hibernate in so that the entity is cacheable
Configure 2nd level caching to use something like ehcache, and set the maximum memory elements to something big enough to fit your working set into it
Make sure you're not accidentally using multiple sessions in your code.
If you need to process things in this way, you may want to consider changing your design to not rely on having everything in memory, not using Hibernate/JPA, or not use an app server. This will give you more control of how things are executed. This may even be a better fit for something like Hadoop. Without more information it's hard to say what direction would be best for you.
I understand what you're asking but JPA/Hibernate isn't going to want to cache that much data for you, or at least I wouldn't expect a guarantee from it. Consider that you described 5 million records. What is the average length per record? 100 bytes gives 500 megabytes of memory that'll just crash your untweaked JVM. Probably more like 5000 bytes average and that's 25 gB of memory. You need to think about what you're asking for.
If you want it cached you should do that yourself or better yet just use the results when you have them. If you want a memory based data access you should look at a technology specifically for that. http://www.ehcache.org/ seems popular but it's up to you and you should be sure you understand your use case first.
If you are trying to be database efficient then you should just understand what your doing and design and test carefully.
Basically it should be a pretty easy task to load entire tables with one query each table and link the objects, but JPA works different as to be shown in this example.
The biggest problem are #OneToMany/#ManyToMany-relations:
#Entity
public class Employee {
#Id
#Column(name="EMP_ID")
private long id;
...
#OneToMany(mappedBy="owner")
private List<Phone> phones;
...
}
#Entity
public class Phone {
#Id
private long id;
...
#ManyToOne
#JoinColumn(name="OWNER_ID")
private Employee owner;
...
}
FetchType.EAGER
If defined as FetchType.EAGER and the query SELECT e FROM Employee e Hibernate generates the SQL statement SELECT * FROM EMPLOYEE and right after it SELECT * FROM PHONE WHERE OWNER_ID=? for every single Employee loaded, commonly known as 1+n problem.
I could avoid the n+1 problem by using the JPQL-query SELECT e FROM Employee e JOIN FETCH e.phones, which will result in something like SELECT * FROM EMPLOYEE LEFT OUTER JOIN PHONE ON EMP_ID = OWNER_ID.
The problem is, this won't work for a complex data model with ~20 tables involved.
FetchType.LAZY
If defined as FetchType.LAZY the query SELECT e FROM Employee e will just load all Employees as Proxies, loading the related Phones only when accessing phones, which in the end will lead into the 1+n problem as well.
To avoid this it is pretty obvious to just load all the Phones into the same session SELECT p FROM Phone p. But when accessing phones Hibernate will still execute SELECT * FROM PHONE WHERE OWNER_ID=?, because Hibernate doesn't know that there are already all Phones in its current session.
Even when using 2nd level cache, the statement will be executed on the DB because Phone is indexed by its primary key in the 2nd level cache and not by OWNER_ID.
Conclusion
There is no mechanism like "just load all data" in Hibernate.
It seems there is no other way than keep the relationships transient and connect them manually or even just use plain old JDBC.
EDIT:
I just found a solution which works very well. I defined all relevant #ManyToMany and #OneToMany as FetchType.EAGER combinded with #Fetch(FetchMode.SUBSELECT) and all #ManyToOne with #Fetch(FetchMode.JOIN), which results in an acceptable loading time. Next to adding javax.persistence.Cacheable(true) to all entities I added org.hibernate.annotations.Cache to every relevant collection, which enables collection caching in the 2nd level cache. I disabled 2nd level cache timeout eviction and "warm up" the 2nd level cache via #Singleton EJB combined with #Startup on server start / deploy. Now I have 100% control over the cache, there are no further DB calls until I manually clear it.
Related
In my app. I have Case and for each Case there can be 0 to 2 Claim. If a Case has 0 claims it runs pretty fast, 1 claims and it slows down, and 2 is awfully slow. Any idea how to make this faster? I didn't know if my case and claim were going back and forth causing an infinite recurison, so I added a JsonManagedReference and JsonBackReference, but that doesn't seem to help much with speeds. Any ideas? Here is my Case.java:
#Entity
public class Case {
#OneToMany(mappedBy="_case", fetch = FetchType.EAGER)
#Fetch(FetchMode.JOIN)
#JsonManagedReference(value = "case-claim")
public Set<Claim> claims;
}
In Claim.java:
#Entity
public class Claim implements Cloneable {
#ManyToOne(optional = true)
#JoinColumn(name = "CASE_ID")
#JsonBackReference(value = "case-claim")
private Case _case;
}
output of 0 claims:
https://gist.github.com/elmatt/2cafbe7ecb1fa0b7f6a8
output of 2 claims:
https://gist.github.com/elmatt/b000bc28909453effc95
Your problem has nothing to do with the relationship between Case and Claim.
FYI: 300ms is not "pretty fast." Your problem is that you expect hibernate to magically and quickly deliver a complex object hierarchy to you, with no particular effort on your part. I view ORM as "The Big Lie" - it is super easy to use and works great on toy problems, but tends to fail miserably when you try to scale to interesting applications (like yours).
Don't abandon hibernate, but realize that you are going to need to work harder than you thought you would in order to make it work for you.
I happen to work in a similar data domain (post-adjudication healthcare claim analysis and processing). You should be able to select this kind of data in well under 10ms per claim (with all associated dimensions) using MySQL on modest hardware from a table with >1 billion claims and the DB hosted on a separate server from the app.
How do you get from where you are to where you should be?
1. Minimize the number of round-trips to the database by minimizing the number of separate queries that are executed.
2. Hand-craft your important queries to grab just the rows and joins that you actually need.
3. Use explain plan on every query to make sure that it hits the tables in the right order and every step is appropriately supported by an index.
4. Consider partitioning your big tables and include the partition criteria in your queries to enable partition-pruning to focus the query on the proper data.
5. Be very hesitant to let hibernate manage your relationships between your entities. I generally do not let hibernate deal with any relationships.
A few years ago, I worked on a product that is an iPhone app where the user walks through workflows (e.g., a nurse taking a patient's vitals) and each screen made a round-trip to the app server to execute the workflow step and get the data for the next screen. Think about how little data you can work with on an iPhone screen. Yet the DB portion of the round-trip generally took 2-5 seconds to execute. Everyone there took it for granted, because "That is how long it has always taken." I dug into the code and found that each step was pulling in a significant portion of the database (and then was not used by the business logic).
The only time they tweaked the default hibernate behavior was when they got an exception due to too many joins (yes, MySQL has a limit of something like 67 tables in one query).
The approach of creating your Java data model and simply ORM'ing it into the database generally works just fine on configuration data and the like, but tends to perform terribly for complex data models involving your transactional data. This is what is biting you now.
Your problem is totally fixable, and can be attacked incrementally - you don't have to tear apart the whole application to start making things better.
Can you enable hibernate logging and provide the output. It should indicate the SQL queries being executed against your DB. Information about which DB you are using would also be useful. When you have those I would recommend profiling the queries to ensure your DB is setup appropriately. It sounds like an non indexed query.
Size of the datasets would be helpful in targeting possible issues as well - number of rows and so on.
I would also recommend timing the actual hibernate call (could be as crude as log statement immediately before / after) vs overall processing to identify whether it really is hibernate or some other processing. Without further information & context that is not clear here.
Now you've posted your queries we can see what is happening. It looks like the structure of your entities is more complex than the code snippet originally posted. There are references to Person, Activities, HealthPlan and others in there.
As others have commented your query is triggering a very large select of a lot of data due to the nature of your model.
I recommend creating Named Queries for claims, and then load those using the ID of Case.
You should also review your hibernate model and switch to FetchType.LAZY, other hibernate will create large queries such as the one you have posted. The catch here is that if you try to access a related entity outside of the transaction you will get a lazyinitializationexception. You will need to consider each use case and ensure you load the data you need. Two common mistakes with Hibernate is to use FetchType.EAGER everywhere or to initiate the transaction to early to avoid this. There is not one correct design approach, but I normally do the following
JSP -> Controller -> [TX BOUNDARY] Service -> DAO
You service method(s) should encapsulate the business logic you need to load the data you require, before passing it back to the controller.
Again, per the other answer, I think you're expecting too much of Hibernate. It is a powerful tool but you need to understand how it works to get the best from it.
i have a JPA Entity (EclipseLink) developing a web application with JSF 2. Let's say i have this:
private String table;
#OneToMany(mappedBy = "NodeTypeID")
private Collection<NodeEntity> nodeEntityCollection;
That collection is coming very large because, of course, the rows in the table at the database are a lot. I don't show all those entities in the web because... you can't do it, is too much for a web page. So i limit the collection to 150 objects.
I limit it after the +1,000 entities are already on memory, so i guess the process of making all those instances has to be slow. So, i just want to know, what would you do in this case ? Just make a query to bring just the 150 entities i want ? Is there an annotation for that ? Is it good practice to let that process just like that ?
In hibernate Criteria there are couple of methods to handle pagination, ie retrieve 150 rows at a time, the client has to keep track of page number that you are viewing and send it to the server. Storing 1500 rows in the server is usually not a big deal for short duration.
setFirstResult(i*PAGE_SIZE)
setMaxResults(PAGE_SIZE)
ref : http://docs.jboss.org/hibernate/envers/3.6/javadocs/org/hibernate/Criteria.html#setMaxResults(int)
My team is writing an application with GAE (Java) that has led me to question the scalability of entity relationship modeling (specifically many-to-many) in object oriented databases like BigTable.
The preferred solution for modeling unowned one-to-many and many-to-many relationships in the App Engine Datastore (see Entity Relationships in JDO) seems to be list-of-keys. However, Google warns:
"There are a few limitations to implementing many-to-many
relationships this way. First, you must explicitly retrieve the values
on the side of the collection where the list is stored since all you
have available are Key objects. Another more important one is that you
want to avoid storing overly large lists of keys..."
Speaking of overly large lists of keys, if you attempt to model this way and assume that you are storing one Long for each key then with a per-entity limit of 1MB the theoretical maximum number of relationships per entity is ~130k. For a platform who's primary advantage is scalabililty, that's really not that many relationships. So now we are looking at possibly sharding entities which require more than 130k relationships.
A different approach (Relationship Model) is outlined in the article Modeling Entity Relationships as part of the Mastering the datastore series in the AppEngine developer resources. However, even here Google warns about the performance of relational models:
"However, you need to be very careful because traversing the
connections of a collection will require more calls to the datastore.
Use this kind of many-to-many relationship only when you really need
to, and do so with care to the performance of your application."
So by now you are asking: 'Why do you need more than 130k relationships per-entity?' Well I'm glad you asked. Let's take, for example, a CMS application with say 1 million users (Hey I can dream right?!)
Users can upload content and share it with:
1. public
2. individuals
3. groups
4. any combination
Now someone logs in, and navigates to a dashboard that shows new uploads from people they are connected to in any group. This dashboard should include public content, and content shared specifically with this user or a group this user is a member of. Not too bad right? Let's dig into it.
public class Content {
private Long id;
private Long authorId;
private List<Long> sharedWith; //can be individual ids or group ids
}
Now my query to get everything an id is allowed to see might look like this:
List<Long> idsThatGiveMeAccess = new ArrayList<Long>();
idsThatGiveMeAccess.add(myId);
idsThatGiveMeAccess.add(publicId); //Let's say that sharing with 0L makes it public
for (Group g : groupsImIn)
idsThatGiveMeAccess.add(g.getId());
List<Long> authorIdsThatIWantToSee = new ArrayList<Long>();
//Add a bunch of authorIds
Query q = new Query("Content")
.addFilter("authorId", Query.FilterOperator.IN, authorIdsThatIWantToSee)
.addFilter("sharedWith", Query.FilterOperator.IN, idsThatGiveMeAccess);
Obviously I've already broken several rules. Namely, using two IN filters will blow up. Even a single IN filter at any size approaching the limits we are talking about would blow up. Aside from all that, let's say I want to limit and page through the results... no no! You can't do that if you use an IN filter. I can't think of any way to do this operation in a single query - which means you can't paginate it without extensive read-time processing and managing multiple cursors.
So here are the tools I can think of for doing this: denormalization, sharding, or relationship entities. However even with these concepts I don't see how it is possible to model this data in a way that could scale. Obviously it's possible. Google and others do it all the time. I just can't see how. Can anyone shed any light on how to model this or point me toward any good resources for cms-style access control based on NoSQL DB?
storing a list of ids as a property wont scale.
Why not simply store a new object for each new relationship? (Like in sql).
That object will store for your cms two properties: The id of the shared item and the user id. If its shared with 1000 users you will have 1000 of these. Querying it for a given user is trivial. Listing permissions for a given item or a list of what a user has shared with them is easy too.
I have a simple domain model as follows
Driver - key(string), run-count, unique-track-count
Track - key(string), run-count, unique-driver-count, best-time
Run - key(?), driver-key, track-key, time, boolean-driver-update, boolean-track-updated
I need to be able to update a Run and a Driver in the same transaction; as well as a Run and a Track in the same transaction (obviously to make sure i don't update the statistics twice, or miss out on an increment counter)
Now I have tried assigning as run key, a key made up of driver-key/track-key/run-key(string)
This will let me update in one transaction the Run entity and the Driver entity.
But if I try updating the Run and Track entities together, it will complain that it cannot transact over multiple groups. It says that it has both the Driver and the Truck in the transaction and it can't operate on both...
tx.begin();
run = pmf.getObjectById(Run.class, runKey);
track = pmf.getObjectById(Track.class, trackKey);
//This is where it fails;
incrementCounters();
updateUpdatedFlags();
tx.commit();
Strangely enough when I do a similar thing to update Run and Driver it works fine.
Any suggestions on how else I can map my domain model to achieve the same functionality?
With Google App Engine, all of the datastore operations must be on entities in the same entity group. This is because your data is usually stored across multiple tables, and Google App Engine cannot do transactions across multiple tables.
Entities with owned one-to-one and one-to-many relationships are automatically in the same entity group. So if an entity contains a reference to another entity, or a collection of entities, you can read or write to both in the same transactions. For entities that don't have an owner relationship, you can create an entity with an explicit entity group parent.
You could put all of the objects in the same entity group, but you might get some contention if too many users are trying to modify objects in an entity group at the same time. If every object is in its own entity group, you can't do any meaningful transactions. You want to do something in between.
One solution is to have Track and Run in the same entity group. You could do this by having Track contain a List of Runs (if you do this, then Track might not need run-count, unique-driver-count and best-time; they could be computed when needed). If you do not want Track to have a List of Runs, you can use an unowned one-to-many relationship and specify the entity group parent of the Run be its Track (see "Creating Entities With Entity Groups" on this page). Either way, if a Run is in the same entity group as its track, you could do transactions that involve a Run and some/all of its Tracks.
For many large systems, instead of using transactions for consistency, changes are done by making operations that are idempotent. For instance, if Driver and Run were not in the same entity group, you could update the run-count for a Driver by first doing a query to get the count of all runs before some date in the past, then, in a transaction, update the Driver with the new count and the date when it was last computed.
Keep in mind when using dates that machines can have some kind of a clock drift, which is why I suggested using a date in the past.
I think I found a lateral but still clean solution which still makes sense in my domain model.
The domain model changes slightly as follows:
Driver - key(string-id), driver-stats - ex. id="Michael", runs=17
Track - key(string-id), track-stats - ex. id="Monza", bestTime=157
RunData - key(string-id), stat-data - ex. id="Michael-Monza-20101010", time=148
TrackRun - key(Track/string-id), track-stats-updated - ex. id="Monza/Michael-Monza-20101010", track-stats-updated=false
DriverRun - key(Driver/string-id), driver-stats-updated - ex. id="Michael/Michael-Monza-20101010", driver-stats-updated=true
I can now update atomically (i.e. precisely) the statistics of a Track with the statistics from a Run, immediately or in my own time. (And same with the Driver / Run statistics).
So basically I have to expand a little bit the way I model my problem, in a non-conventional relational way. What do you think?
realize this is late, but..
Have you seen this method for Bank Account transfers?
http://blog.notdot.net/2009/9/Distributed-Transactions-on-App-Engine
It seems to me that you could do something similar by breaking out your increment counters into two steps as a IncrementEntity and process that, picking up the pieces later if a transaction fails etc.
From the blog:
In a transaction, deduct the required
amount from the paying account, and
create a Transfer child entity to
record this, specifying the receiving
account in the 'target' field, and
leaving the 'other' field blank for
now.
In a second transaction, add the
required amount to the receiving
account, and create a Transfer child
entity to record this, specifying the
paying account in the 'target' field,
and the Transfer entity created in
step 1 in the 'other' field.
Finally,
update the Transfer entity created in
step 1, setting the 'other' field to
the Transfer we created in step 2.
The blog has code examples in Python, but is should be easy to adapt
There's an interesting google io session on this topic http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html
I guess you could update the Run stats and then fire two tasks to update the Driver and the Track individually.
We have an entire table of entities that we need to load during a hibernate session and the only way I know to load all entities is through an HQL query:
public <T> List<T> getAllEntities(final Class<T> entityClass) {
if (null == entityClass)
throw new IllegalArgumentException("entityClass can't be null");
List<T> list = castResultList(createQuery(
"select e from " + entityClass.getSimpleName() + " e ").list());
return list;
}
We use EHcache for 2nd level caching.
The problem is this gets called 100's of times in a given transaction session and takes up a considerable portion of the total time. Is there any way to load all entities of a given type (load an entire table) and still benefit from 1st level session cache or 2nd level ehcache.
We've been told to stay away from query caching because of their potential performance penalties relative to their gains.
* Hibernate Query Cache considered harmful
Although we're doing performance profiling right now so it might be time to try turning on query cache.
L1 and L2 cache can't help you much with the problem of "get an entire table."
The L1 cache is ill-equipped because if someone else inserted something, it's not there. (You may "know" that no one else would ever do so within the business rules of the system, but the Hibernate Session doesn't.) Hence you have to go look in the DB to be sure.
With the L2 cache, things may have been expired or flushed since the last time anybody put the table in there. This can be at the mercy of the cache provider or even done totally externally, maybe through a MBean. So Hibernate can't really know at any given time if what's in the cache for that type represents the entire contents of the table. Again, you have to look in the DB to be sure.
Since you have special knowledge about this Entity (new ones are never created) that there isn't a practical way to impart on the L1 or L2 caches, you need to either use the tool provided by Hibernate for when you have special business-rules-level knowledge about a result set, query cache, or cache the info yourself.
--
If you really really want it in the L2 cache, you could in theory make all entities in the table members of a collection on some other bogus entity, then enable caching the collection and manage it secretly in the DAO. I don't think it could possibly be worth having that kind of bizarreness in your code though :)
Query cache is considered harmful if and only if the underlying table changes often. In your case the table is changed once a day. So the query would stay in cache for 24 hours. Trust me: use the query cache for it. It is a perfect use case for a query cache.
Example of harmful query cache: if you have a user table and you use the query cache for "from User where username = ..." then this query will evict from cache each time the user table is modified (another user changes/deletes his account). So ANY modification of this table triggers cache eviction. The only way to improve this situation is querying by natural-id, but this is another story.
If you know your table will be modified only once a day as in your case, the query cache will only evict once a day!
But pay attention on your logic when modifying the table. If you do it via hibernate everything is fine. If you use a direct query you have to tell hibernate that you have modified the table (something like query.addSynchronizedEntity(..)). If you do it via shell script you need to adjust the time-to-live of the underlying cache region.
Your answer is by the way reimplementing the query cache as the query cache just caches the list of ids. The actual objects are looked up in L1/L2 cache. so you still need to cache the entities when you use the query cache.
Please mark this as the correct answer for further reference.
We ended up solving this by storing in memory the primary keys to all the entities in the table we needed to load (because they're template data and no new templates are added/removed).
Then we could use this list of primary keys to look up each entity and utilize Hibernates 1st and 2nd level cache.