In an app using Wicket+Spring+JPA/Hibernate stack, I have an Inbox/Search page which should have quite complex search capabilities, where records saved in a database are filtered using a myriad of filtering options. So far I've used JPA Criteria API to build the database query but it's getting quite messy. I was wondering if Hibernate-Search would be a good fit for this even though I don't really need any full-text search capabilities, I just feel (from what I read about it) that producing the query might be a bit easier?
Sorry, but Hibernate Search is based on Lucence. It is not just an other query language.
Lucene does not serach for entities in your database, it search for attibutes in the Lucene index.
Hibernate Search add the functionality to connect the Entities from your Database to the Lucene Index.
Hibernate Search and Lucene are create tool when you need advanced full text search. But if you don't need it, it is only a lot of unnesseary work (and problems).
So, as long as you do not use Lucene, Hibernate Search does not fit your needs.
The primary use case for Hibernate Search is fulltext search. However, it can also be used to index/search simple attributes/criteria. Whether the syntax for writing the queries is simpler than a criteria query is a matter of taste.
If you are not using the fulltext search capabilities you have to consider that you are adding an additional step in your application. The search query will be run against the Lucene index which will return entity ids (unless projection is used). The matching entities will then be fetched from the database.
On the other hand, once you use Hibernate Search it is easy to "improve" your search by adding some fulltext search capabilities to some of your criteria (if possible).
Whether or not you are using Search, I think the key is to write some sort of framework which programmatically builds your queries - Search or Criteria queries.
Related
I have many MySQL tables to store different types of data like goods, catagories, brands, suppliers, etc. Each of them needs to implement full-text search via Lucene.
So I plan to build one Lucene Directory (and one IndexWriter + one IndexReader corresponding to this Directory) for each table, e.g.
HashMap<String, Directory> = ...;
put("goods", FSDirectory.open(luceneDirRoot + "/goods"));
put("catagories", FSDirectory.open(luceneDirRoot + "/catagories"));
...
Is this a good practice to use Lucene?
Furthur more, how can I know how many directories I made by Lucene, like MySQL command "SHOW TABLES"? new File(luceneDirRoot).listFiles() can be a choice but I am not sure whether there are other non-Lucene folders.
I would implement one Lucene index pro MySQL table provided you do not need to perform search over several tables. Alternative would be to write everything into one index and add table name into each lucene document, that way you could limit the search to particular table.
AFAIK Lucene does not support SHOW TABLES equivalent the way you desire it, but you might easily do that by yourself, e.g. by using naming convention for the directories.
I would recommend to look at Hibernate Search, this is a good match for your needs, it builds one index directory pro table and allows you to perform full text search while handling the low-level lucene issues for you. You just configure the index by annotating the JPA entities corresponding to your tables and have to implement the full text queries. This is much easier then doing naked Lucene with data from MySQL on your own, Hibernate Search builds the index for you and integrates well with data from relational DB such as MySQL.
I am using Hibernate Search Along with Lucene to implement full text search on my data base. I want to know that do hibernate search query or lucene query return top ranked and the most relevant results? Documentation says:
Apache Lucene provides a very flexible and powerful way to sort
results. While the default sorting (by relevance) is appropriate most
of the time
Link: http://docs.jboss.org/hibernate/search/4.2/reference/en-US/html_single/#search-query
Section: 5.1.3.3. Sorting
But I am very confused with the results as they are always arranged with the IDs of the objects. I just need the top 100 most relevant records.
See Customizing Lucene's scoring formula
Sorting by relevance is affected by your Analyzer choices. If you are getting results in the order of primary keys it is likely that they are all having the same score, which is normally very unlikely so my guess is that you're not enabling tokenization on any searched field.
Make sure you're tokenizing the fields used in the Query and they are using an appropriate Analyzer. To pick an appropriate one you'll have to experiment a bit as it depends on the language (if it's natural language) or on what kind of data you're indexing.
To actually debug the sort order applied by Relevance sort, see usage of Projections in the Hibernate Search documentation: both FullTextQuery.SCORE and FullTextQuery.EXPLANATION can be very useful to understand what's going on.
A handy utility to quickly experiment the effect of different Analyzers is to use org.hibernate.search.util.AnalyzerUtils. You can either write unit tests creating the Analyzer instance yourself or you can retrieve the analyzers by name using org.hibernate.search.engine.SearchFactory.getAnalyzer(String) or the base one used for a specific indexed entity by entity type: org.hibernate.search.engine.SearchFactory.getAnalyzer(Class).
I am using Hibernate Search built on top of Lucene indexing. If indexes are created against database table the performance will be good in returning the results.
My question is, once indexes are created, if we query for the results does Hibernate Search fetch results from the original database table using the created indexes? or does it not need to hit the database to fetch the results?
Thanks!
Unless you use Projections the indexes are used only to identify the set of primary keys matching the query, these are then used to load the entities from the Database.
There are many good reasons for this:
As you pointed out, we don't store all data in the index: a larger index is a slower index
Adding all needed metadata to the index would make indexing a very expensive operation
Value extraction from the index is not efficient at all: it's good at queries, no more
Relational databases are very good at loading data by primary key
If you DB isn't good enough, second level cache is excellent to load by primary key
By loading from the DB we guarantee consistency especially with async indexing
By loading from the DB you have entities participate in Transactions and isolation
That said, if you don't need fully managed entities you can use Projections to load the fields you annotated as Stored.YES. A common pattern is to provide preview of matches using projections, and then when the user clicks for details to load the full entity matching that result.
By default, every time an object is inserted, updated or deleted through Hibernate, Hibernate Search updates the according Lucene index as per documentation
Hence, the further searches will yeild the data through lucene indexes only.
Another Question explaining how Indexes work
I am working with the Lucene and Derby databases. Lucene contains the text index, and Derby has information regarding additional user data. For example, each document has a tag. For this purpose the Derby database has two tables
TAGS:
ID
Name
LUCENETAGS:
ID
LUCENEID (docID in Lucene, not a field)
TAGID
I want a user to be able to search something like:
very interesting text AND tag:fun
Changing the structure in a way that tag is a Lucene field is not an option.
Thank you!
I believe you'll have to simply perform your text search in Lucene, and then filter your results based on the result of a query into a Derby.
If few documents will match a particular tag, you could also query the database for the IDs to be queried, and rewrite the query like:
(very interesting text) AND id:(1 2 3 etc.)
Probably not feasible, but in the case that tags are pretty sparse, it might be worth considering.
I do wonder, though, why a field can't be added to the index, duplicating the stored value in the Derby Database. In any implementation you choose to get what you want from your stated structure, you will see much poorer performance, and more complexity for you to deal with, than if the data were available in the index as well.
I am developing a search component of a web application using Lucene. I would like to save the user queries to an index and use them to suggest alternate queries to users, and to keep query statistics (most often used queries, top scoring queries, ...).
To use this data for alternate query suggestions, I would analyze the queries to see which terms are most often used with one another and use that to create a suggestion to the user.
But I can't figure out in which form to index the data. I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content. Does anyone have any ideas about the way this can be accomplished?
Thanks for the help.
"I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content"
You can tell Lucene not to store document content, which means that the principal overhead will be the unique Terms, and the index itself. So, it might not be a large overhead to store each query as a unique Document...this way you will not be throwing away any information.
First, I believe that you should store the queries separately from the existing index. The problem is not redundant data but rather "watering down" your index - storing the queries in the same index may harm the relevance of your searches. Some options for this are:
Use a separate Lucene index.
Use Solr, with two separate cores, one for the documents and the other for the queries.
Use a query log. Store scores with the queries. Build query statistics using post-processing.As this is a web application, you can probably use a servlet container, such as Tomcat's, logs for this.
Second, Auto-Suggest From Popular Queries Using EdgeNGrams suggests an alternative implementation of query suggestion using Solr.