I am using Hibernate Search built on top of Lucene indexing. If indexes are created against database table the performance will be good in returning the results.
My question is, once indexes are created, if we query for the results does Hibernate Search fetch results from the original database table using the created indexes? or does it not need to hit the database to fetch the results?
Thanks!
Unless you use Projections the indexes are used only to identify the set of primary keys matching the query, these are then used to load the entities from the Database.
There are many good reasons for this:
As you pointed out, we don't store all data in the index: a larger index is a slower index
Adding all needed metadata to the index would make indexing a very expensive operation
Value extraction from the index is not efficient at all: it's good at queries, no more
Relational databases are very good at loading data by primary key
If you DB isn't good enough, second level cache is excellent to load by primary key
By loading from the DB we guarantee consistency especially with async indexing
By loading from the DB you have entities participate in Transactions and isolation
That said, if you don't need fully managed entities you can use Projections to load the fields you annotated as Stored.YES. A common pattern is to provide preview of matches using projections, and then when the user clicks for details to load the full entity matching that result.
By default, every time an object is inserted, updated or deleted through Hibernate, Hibernate Search updates the according Lucene index as per documentation
Hence, the further searches will yeild the data through lucene indexes only.
Another Question explaining how Indexes work
Related
I am using MassIndexer with #Indexed interceptor and it works just fine I am able to filter the entities. but the problem is that I have thousands of soft-deleted records, I don't want these objects to be in the indexing process since they are not important anymore.
so, is it possible in Hibernate Search to predefined query or conditions before the indexing process?
You can't do indexing with predefined HQL. Rather you can intercept the indexing process and instruct indexer whether it should index, skip or remove index for entity.
Please refer to Conditional Indexing topic in Reference Guilde.
Under your conditions: when > 95% of data is not to be indexed I would suggest the following:
Consider manual reindexing by running query and pushing items to index as described in Manual index changes
Consider splitting full data table and only active data table. This is a bit of data duplication but should give you considerable performance gains when working with active records only.
I have to process an xml file and for that I need to get around ~4k objects using it's primary key from a single table . I am using EhCache. I have few queries as follows:
1) It is taking lot of time if I am querying row by row based on Id and saving it in Cache . Can I query at initial point of time and save whole table in EHCache and can query it using primary key later in the processing
2) I dont want to use Query cache. As I can't load 4k objects at a time and loop it for finding correct object.
I am looking for optimal solution as right now my process is taking around 2 hours (it involves other processing too)
Thank you for your kind help.
You can read the whole table and store it in a Map<primary-key, table-row> to reduce the overhead of the DB connection.
I guess a TreeMap is probably the best choice, it makes search for elements faster.
Ehcache is great to handle concurrence, but if you are reading the xml with a single process you don't even need it (just store the Map in memory).
In my current project we have multiple search pages in the system where we fetch a lot of data from the database to be shown in a large table element in the UI. We're using JPA for data access (our provider is Hibernate). The data for most of the pages is gathered from multiple database tables - around 10 in many cases - including some aggregate data from OneToMany relationships (e.g. "number of associated entities of type X"). In order to improve performance, we're using result set pagination with TypedQuery.setFirstResult() and TypedQuery.setMaxResults() to lazy-load additional rows from the database as the user scrolls the table. As the searches are very dynamic, we're using the JPA CriteriaQuery API to build the queries. However, we're currently somewhat suffering from the N+1 SELECT problem. It's pretty bad in some cases actually, as we might be iterating through 3 levels of nested OneToMany relationships, where on each level the data is lazy-loaded. We can't really declare those collections as eager loaded in the entity mappings, as we're only interested in them in some of our pages. I.e. we might fetch data from the same table in several different pages, but we're showing different data from the table and from different associated tables in different pages.
In order to alleviate this, we started experimenting with JPA entity graphs, and they seem to help a lot with the N+1 SELECT problem. However, when you use entity graphs, Hibernate apparently applies the pagination in-memory. I can somewhat understand why it does that, but this behavior negates a lot (if not all) of the benefits of the entity graphs in many cases. When we didn't use entity graphs, we could load data without applying any WHERE restrictions (i.e. considering the whole table as the result set), no matter how many millions of rows the table had, as only a very limited amount of rows were actually fetched due to the pagination. Now that the pagination is done in-memory, Hibernate basically fetches the whole database table (plus all relationships defined in the entity graph), and then applies the pagination in-memory, throwing the rest of the rows away. Not good.
So the question is, is there an efficient way to apply both pagination and entity graphs with JPA (Hibernate)? If JPA does not offer a solution to this, Hibernate-specific extensions are also acceptable. If that's not possible either, what are the other alternatives? Using database Views? Views would be a bit cumbersome, as we support several database vendors. Creating all of the necessary views for different vendors would increase development effort quite a bit.
Another idea I've had would be to apply both the entity graphs and pagination as we currently do, and simply not trigger any queries if they would return too many rows. I already need to do COUNT queries to get the lazy-loading of rows to work properly in the UI.
I'm not sure I fully understand your problem but we faced something similar: We have paged lists of entities that may contain data from multiple joined entities. Those lists might be sorted and filtered (some of those sorts/filters have to be applied in memory due missing capabilities in the dbms but that's just a side note) and the paging should be applied afterwards.
Keeping all that data in memory doesn't work well so we took the following approach (there might be better/more standard ones):
Use a query to load the primary keys (simple longs in our case) of the main entities. Join only what is needed for sorting and filtering to make the query as simple as possible.
In our case the query would actually load more data to apply sorts and filters in memory where necessary but that data is released asap and only the primary keys are kept.
When displaying a specific page we extract the corresponding primary keys for a page and use a second query to load everything that is to be displayed on that page. This second query might contain more joins and thus be more complex and slower than the one in step 1 but since we only load data for that page the actual burden on the system is quite low.
I have some objects and I have two ways of storing them: using hibernate-search (lucene) with indexes, or as hibernate entities (postgres records).
I don't need full text search per se, just searching by the properties values (exact match) and maybe range.
Is there any performance comparison available for these two use cases?
The question is like comparing apple and pears. If you need persistent storage and the ability to query and retrieve entities you need to use hibernate core.
Hibernate Search is primarily an extension to Hibernate Core offering (full text) search. Yes is is possible to store the entity values in the index as well and retrieve them via projections. In this case, however, you are not dealing with entities anymore, but with object arrays. You are missing out on the life cycle gurantees Hibernate offers.
My guess would be that you want Hibernate Core (ORM) plus Criteria queries.
I have a complex query that requires a full-text search on some fields and basic restrictions on other fields. Hibernate Search documentation strongly advises against adding database query restrictions to a full text search query and instead recommends putting all of the necessary fields into the full-text index. The problem I have with that is that the other fields are volatile; values can change every minute or so and those updates to the database may occur outside of the JVM doing the search, so there is a high likelihood that the local Lucene index would be out of date with respect to those fields.
Looking for strategy recommendations here. The best I've come up with so far is to join the results manually by first executing the database query (fetching only object IDs) and then execute the full text search. and somehow efficiently filter the Lucene results by the set of object IDs from the database. Of course, I don't know how many results I'll get from each separate query, so I'm worried about performance and memory. It could be tens of thousands of rows apiece in the worst case.
I am quite interested in other ideas for this as we have a very similar scenario.
We only needed to show 50 results rows as a maximum with a couple of lookups per row. We run the query against the lucene index with the db pk ids in the index and the pull the lookups out of the database per row. It's still performant for us.
As you seem to want to process more than a few rows and lookups I did consider an alternative. Timestamp any db row updates. This would allow us to query the DB for stale indexes and then iteratively call a reindex of related documents.
I have the same problem and do a separate Lucene and criteria query. If I first do the criteria query I will use the resulting ids to apply a custom IdFilter for Lucene search which checks whether the result is in the given Id collection from the first query. However this approach does not scale well because also in my case the number of results after the first query might be huge and the filter is limited to 1024 ids. I did not find a good solution but I change the order of my two queries depending on the number of the to be expected results. The first query should be the one which filters out most of the results.
You can do a scheduler index update base on the last modified date.