We have a legacy database where a single top level table has many relationships and sub-relationships. We usually don't need all or most of them and we set them to lazy load by default, and then use joins in HQL to pre-fetch the ones we're going to need in a particular part of the code.
We've got a module where we need quite a few of these. We don't want to get into N+1, but we've hit a massive performance snafu with this approach where one record has almost 4000 children, and they in turn have varying numbers of children. We have tried lazy-loading as many as we can without getting into N+1, but it appears that the cross-product that the join is producing is just unrealistically large.
Is there a better way to approach this problem? It seems like what is needed is a way to break this joined query into multiple queries, and then piece the hibernate models together in their relationships as a second step. Like if there was a way to do the HQL to load tables A, B, and C, but then make the load of C's sub-detail a second step that hibernate applies to the hierarchy by key.
Related
In my current project we have multiple search pages in the system where we fetch a lot of data from the database to be shown in a large table element in the UI. We're using JPA for data access (our provider is Hibernate). The data for most of the pages is gathered from multiple database tables - around 10 in many cases - including some aggregate data from OneToMany relationships (e.g. "number of associated entities of type X"). In order to improve performance, we're using result set pagination with TypedQuery.setFirstResult() and TypedQuery.setMaxResults() to lazy-load additional rows from the database as the user scrolls the table. As the searches are very dynamic, we're using the JPA CriteriaQuery API to build the queries. However, we're currently somewhat suffering from the N+1 SELECT problem. It's pretty bad in some cases actually, as we might be iterating through 3 levels of nested OneToMany relationships, where on each level the data is lazy-loaded. We can't really declare those collections as eager loaded in the entity mappings, as we're only interested in them in some of our pages. I.e. we might fetch data from the same table in several different pages, but we're showing different data from the table and from different associated tables in different pages.
In order to alleviate this, we started experimenting with JPA entity graphs, and they seem to help a lot with the N+1 SELECT problem. However, when you use entity graphs, Hibernate apparently applies the pagination in-memory. I can somewhat understand why it does that, but this behavior negates a lot (if not all) of the benefits of the entity graphs in many cases. When we didn't use entity graphs, we could load data without applying any WHERE restrictions (i.e. considering the whole table as the result set), no matter how many millions of rows the table had, as only a very limited amount of rows were actually fetched due to the pagination. Now that the pagination is done in-memory, Hibernate basically fetches the whole database table (plus all relationships defined in the entity graph), and then applies the pagination in-memory, throwing the rest of the rows away. Not good.
So the question is, is there an efficient way to apply both pagination and entity graphs with JPA (Hibernate)? If JPA does not offer a solution to this, Hibernate-specific extensions are also acceptable. If that's not possible either, what are the other alternatives? Using database Views? Views would be a bit cumbersome, as we support several database vendors. Creating all of the necessary views for different vendors would increase development effort quite a bit.
Another idea I've had would be to apply both the entity graphs and pagination as we currently do, and simply not trigger any queries if they would return too many rows. I already need to do COUNT queries to get the lazy-loading of rows to work properly in the UI.
I'm not sure I fully understand your problem but we faced something similar: We have paged lists of entities that may contain data from multiple joined entities. Those lists might be sorted and filtered (some of those sorts/filters have to be applied in memory due missing capabilities in the dbms but that's just a side note) and the paging should be applied afterwards.
Keeping all that data in memory doesn't work well so we took the following approach (there might be better/more standard ones):
Use a query to load the primary keys (simple longs in our case) of the main entities. Join only what is needed for sorting and filtering to make the query as simple as possible.
In our case the query would actually load more data to apply sorts and filters in memory where necessary but that data is released asap and only the primary keys are kept.
When displaying a specific page we extract the corresponding primary keys for a page and use a second query to load everything that is to be displayed on that page. This second query might contain more joins and thus be more complex and slower than the one in step 1 but since we only load data for that page the actual burden on the system is quite low.
I'm not sure if something special exists for this use case - but it felt like a case where someone was likely to have made some sort of useful structure/technique/design-pattern.
My Situation
I have a set of SQL commands executed from middle tier (Java) to insert/update/delete data to any of a set of very large tables via joins from a related staging table.
I have more SQL commands which update various derived tables based on the staging table/actual table contents. Different tables will interact with different derived tables via different queries (as usual). These commands may have to be interleaved with the first set depending on the use case - so, I can't necessarily execute set 1 then set 2 all at once.
My Question
So, I need to build a chain of commands that get executed sequentially, and I need to trigger a rollback if any of them fail. I'd like to do this in the most clear, documented way possible.
Does anyone know a standard way of coding this? I'm sure anyone migrating from stored procedure code to middle tier code has done this before and I don't want to reinvent the wheel if there are good options out there.
Additional Information
One of my main concerns is making everything clear. To elaborate, I'll have a set of queries specifically designed to:
Truncate staging table A' and populate it with primary keys targeting deletion records
Delete from actual table A based on join with A'
Truncate staging table A' and populate it with full data for upserts
Update/Insert records from A' to A based on joins
The same logic will apply to tables B, C, D, etc. Unfortunately, it can be the case where just A and C need an extra step, like syncing deletes to a certain derived table, to be done after the deletions but before the upserts.
I'd obviously like to group all the logic for updating a table, and I'd like to group all the logic for updating a derived table as well, but at execution time they have to be intelligently interleaved and this sounds messy to me.
Don't write such a thing yourself. This is what JTA was born for.
You can use either JPA or Spring to do it.
Annotate the unit of work as transactional and let the database and JDBC handle it.
If you must do it yourself, follow the aspect-oriented approach and make it a decorative "before & after" implementation.
I need to get data from several tables, so I used a query with N left outer joins. It seems to me that it may be a waste of performance since I get the cartesian product of lots of data. Which is the preferable way to this in order to achieve greater performance? I'm thinking of doing N+1 little queries. Am I on the right track?
I know, this has little to do with JDBC specifics. I want to retrieve data from a single table, and make left outer joins to other N tables. The result set gets very big because I get a cartesian product. For example:
table1data1, table2data1, table3data1
table1data1, table2data2, table3data1
table1data1, table2data1, table3data2
table1data1, table2data2, table3data2
I know that if a make several queries to the database (such as in my example I get 1 record for table1, 2 records for table 2 and 2 records for table 2) I'll make a lot of roundtrips to the database. But I've tested this way and it looks a lot faster.
This really isn't JDBC specific. Generally speaking, depending on the amount of data being returned, you'll get better performance retrieving everything in a single result set. N+1 queries tends to make for a lot of round trips to the database. Does the result set contain fields you don't need? Can you trim the columns being returned? That would be a first step, if possible.
I think your current approach off getting a lot of data in one trip to the database is the right approach. However if you find yourself executing the same query many times with different parameters, it is more performant to write it as a stored procedure using bind variables. But I would definitely shy-away from breaking your JOIN into several smaller queries.
I currently have hibernate set up in my project. It works well for most things. However today I needed to have a query return a couple hundred thousand rows from a table. It was ~2/3s of the total rows in the table. The problem is the query is taking ~7 minutes. Using straight JDBC and executing what I assumed was an identical query, it takes < 20 seconds. Because of this I assume I am doing something completely wrong. I'll list some code below.
DetachedCriteria criteria =DetachedCriteria.forlass(MyObject.class);
criteria.add(Restrictions.eq("booleanFlag", false));
List<MyObject> list = getHibernateTemplate().findByCriteria(criteria);
Any ideas on why it would be slow and/or what I could do to change it?
You have probably answered your own question already, use straight JDBC.
Hibernate is creating at best an instance of some Object for every row, or worse, multiple Object instances for each row. Hibernate has some really degenerate code generation and instantiation behavior that can be difficult to control, especially with large data sets, and even worse if you have any of the caching options enabled.
Hibernate is not suited for large results sets, and processing hundreds of thousands of rows as objects isn't very performance oriented either.
Raw JDBC is just that raw types for rows columns. Orders of magnitudes of less data.
I'm not sure hibernate is the right thing to use if you need to pull hundreds of thousands of records. The query execute time might be under 20 seconds but the fetch time will be huge and consume a lot of memory. After you get all those records, how do you output them? It's far more data than you could display to a user. Hibernate isn't really a good solution for doing data wharehouse style data crunching.
Probably you have several references to other classes in your MyObject class and in your mapping you set eager loading or something like that. It's very hard to find the issue using the code you wrote because it's OK.
Probably it will be better for you to use Hibernate Profiler - http://hibernateprofiler.com/ . It will show you all the problems with your mappings, configurations and queries.
I have trouble understanding how to avoid the n+1 select in jpa or hibernate.
From what i read, there's the 'left join fetch', but i'm not sure if it still works with more than one list (oneToMany)..
Could someone explain it to me, or give me a link with a clear complete explanation please ?
I'm sorry if this is a noob question, but i can't find a real clear article or doc on this issue.
Thanks
Apart from the join, you can also use subselect(s). This results in 2 queries being executed (or in general m + 1, if you have m lists), but it scales well for a large number of lists too, unlike join fetching.
With join fetching, if you fetch 2 tables (or lists) with your entity, you get a cartesian product, i.e. all combinations of pairs of rows from the two tables. If the tables are large, the result can be huge, e.g. if both tables have 1000 rows, the cartesian product contains 1 million rows!
A better alternative for such cases is to use subselects. In this case, you would issue 2 selects - one for each table - on top of the main select (which loads the parent entity), so altogether you load 1 + 100 + 100 rows with 3 queries.
For the record, the same with lazy loading would result in 201 separate selects, each loading a single row.
Update: here are some examples:
a tutorial: Tuning Lazy Fetching, with a section on subselects towards the end (btw it also explains the n+1 selects problem and all strategies to deal with it),
examples of HQL subqueries from the Hibernate reference,
just in case, the chapter on fetching strategies from the Hibernate reference - similar content as the first one, but much more thorough