how to reduce the result set in hibernate search(lucene)? - java

I have 9 millions of products in DB and I am trying to use hibernate search for finding suggestion of the products when user start typing product name in search box in a website. It is a autocomplete feature. I have implemented the web service to get the possible solutions from dataset using hibernate search.
The dataset have the 2 fields
Product Name
Url link
The data stored in my sql. I am using org.hibernate.search.jpa.FullTextQuery for the search.
Issues with the results:
Number of results are too much. I am getting 18K+ results when I searching "intel core".
It causes performance issue in terms of query response. For above search it took 2 seconds.
Is there way to reduce the search results on my dataset for better response time of the query?

Use setMaxResults()
Doc here: https://docs.jboss.org/hibernate/orm/3.5/javadocs/org/hibernate/Query.html#setMaxResults(int).
You may want to tweak your query/data to get the "most relevant" records returned.

Related

Is there a way within hibernate to retrieve fast non-blocking row counts?

The following query generated by hibernate takes 13+ seconds and locks the table:
SELECT COUNT(auditentit0_.audit_id) AS col_0_0_ FROM Audit auditentit0_ WHERE 1=1;
The growing Microsoft SQL server database table contains 90+ million rows.
For Microsoft SQL server, I have found an accurate meta data way of getting the same information very quickly.
However, I would rather not write custom code for Microsoft sql server and oracle (the next database) if hibernate has a way of getting this information.
Here is an example meta data query for Microsoft sql server that is accurate and almost instant:
SELECT SUM (row_count) FROM sys.dm_db_partition_stats WHERE object_id=OBJECT_ID('huge_audit_table') AND (index_id=0 or index_id=1);
Is there a way to have hibernate issue a similar query for a table row count?
One posted answer has indicated that a view could be of use. I'm investigating this post to see if it can solve the issue:
https://vladmihalcea.com/map-jpa-entity-to-view-or-sql-query-with-hibernate/
In hibernate you should use projections like in the link you provided in order to guarantee that it works on multiple dbms:
protected Long countByCriteria(DetachedCriteria criteria) {
Criteria crit = criteria.getExecutableCriteria(getSession());
crit.setProjection(Projections.rowCount());
return (Long)crit.uniqueResult();
}
What engine are you using in mysql? I never had a blocking problem with row count in MySql or Oracle. Maybe the following link will help you: Any way to select without causing locking in MySQL?
Also, after some quick reading i see that Sql Server does indeed block on count.
Maybe you could use a stored procedure or some other mechanism to pass the problem to the dbms.
Edit:
Projections in Hibernate are used to select the columns to fetch, the columns to group elements by, and to use built-in aggregate functions (sum, count, avg, max, min, countDistinct).
It helps you keep your application database-agnotic. Remember that hibernate supports around 30 databases.
In your case you have an specific problem with mssql as the count blocks the table prioritizing accuracy. And using the system views is really quick as you get an estimate but isnĀ“t standard.
You could encapsulate the problem into a view or stored procedure dbms dependant. Or maybe you could try with a NOLOCK hint or READ UNCOMMITED in hibernate (in a count of an audit table it should be acceptable).
To solve this particular problem we stepped back and changed how the UI functions. Through a collaborative effort between UIX and UI developers we agreed that unfiltered queries will NOT ask for total counts. The initial screen load will show only a page full of data. No page 1 of 60,000 controls will exists. Only when the user enters specific criteria will the total count come into play. Those queries should be very fast. Now... it is possible for the user to still setup a query that will be just as bad as the original problem. It should be the exception versus the norm.
So there really is not a solid answer for the OP. If you are faced with this type of problem, if you have control of the UI and API, then it is time to rethink the solution. Think of how google handles paging from a UI perspective. The days of showing a "page 1 of (XX)" are gone IMHO.

Pagination of data issue: java + oracle sql

I have a requirement for pagination of data(items) retrieved from database.
The UI also contains search options and the amount of data, order also depends on search criteria.
Let's say a client send request for with some search criteria and gets 60 results. The client see the items from 1 to maxPageSize (25 by default). If 2nd page is requested - 26-50 items will be shown.
The problem is on current moment I can't get amount of max results and can't display number of maxPage.
I see 2 solutions for this problem:
Query database second time with the same parameters but without
pagination and get count of the items.
Retrieve all items from DB,
filter them on backend code by search criteria and send to client.
The questions are:
1) Which of the operations is less expensive in general?
2) What else can be done to solve this kind of task if there is better solution?
P.S. the backend code written on Java, queries send via JDBC to Oracle 11g DB.
---EDIT---
I've solved this problem this way:
WITH FINAL_RESULT AS
(SELECT SORTED_ITEMS.*,
ROWNUM RN
FROM (sorted basic query with searches))
SELECT FINAL_RESULT.*,
(SELECT COUNT(*) FROM FINAL_RESULT) ITEMS_COUNT
FROM FINAL_RESULT
WHERE RN BETWEEN ? AND ?
Second solution would be quite expensive in case there is bulk amount of data in the Database.
However, the First solution is quite suitable with some tweeks in it. You don't need to Query database second time with the same parameters, instead the server should send the TOTAL_COUNT in each request and the value should be cached.
If the count hasn't changed, there would be no load to the Database because of caching.

Improve results of a "search" input field?

I have a database with 20,000 records. Each record has a name. When a user wants to view a record, he can visit a webapp and type the name of the record in an inputfield. While typing, results from the database would be shown/filtered matchin what the user typed. I would like to know the basic architecture/concepts on how to program this
I'm using the following language stack:
frontend: html5/javascript (+ajax to make instant calls while user is typing)
backend: java + jdbc to connect to simple sql database
My initial idea is:
A user types text
Whenever a character is entered or removed in the inputfield, make an ajax request to the backend
The backend does a LIKE %input% query on the name field in the database
All data found by the query is send as a json string to the frontend
The frontend processes the json string and displays whatever results it finds
My two concerns are: the high amount of ajax requests to process, in conjunction with the possibly very heavy LIKE queries. What are ways to optimize this? Only search for every two characters they type/remove? Only query for the first ten results?
Do you know of websites that utilise these optimizations?
NOTE: assume the records are persons and names are like real people names, so some names are more common than others.
You can choose SPA approach - load all 20 000 names/ids to client side and then filter it in memory - it's supposed to be the fastest way with minimal load to the database and back-end
Here are possible solutions:
Restirct search to prefix search - LIKE 'prefix%' can be executed efficiently using BTREE-type index.
Measure performance of naive LIKE '%str%' solution - it you are working on B2B application, database will likely load that table in memory and do queries fast enough.
Look at documentation for your database - there could be special features for that like inverted index
as #Stepan Novikov suggested, load your data in memory and search manually
Use specialized search indexers like SOLR or ElasticSearch (likely overkill for only 20k records)
If you are feeling ninja, implement your own N-gram index.

Merging Hibernate Search results with relational database query

I have a complex query that requires a full-text search on some fields and basic restrictions on other fields. Hibernate Search documentation strongly advises against adding database query restrictions to a full text search query and instead recommends putting all of the necessary fields into the full-text index. The problem I have with that is that the other fields are volatile; values can change every minute or so and those updates to the database may occur outside of the JVM doing the search, so there is a high likelihood that the local Lucene index would be out of date with respect to those fields.
Looking for strategy recommendations here. The best I've come up with so far is to join the results manually by first executing the database query (fetching only object IDs) and then execute the full text search. and somehow efficiently filter the Lucene results by the set of object IDs from the database. Of course, I don't know how many results I'll get from each separate query, so I'm worried about performance and memory. It could be tens of thousands of rows apiece in the worst case.
I am quite interested in other ideas for this as we have a very similar scenario.
We only needed to show 50 results rows as a maximum with a couple of lookups per row. We run the query against the lucene index with the db pk ids in the index and the pull the lookups out of the database per row. It's still performant for us.
As you seem to want to process more than a few rows and lookups I did consider an alternative. Timestamp any db row updates. This would allow us to query the DB for stale indexes and then iteratively call a reindex of related documents.
I have the same problem and do a separate Lucene and criteria query. If I first do the criteria query I will use the resulting ids to apply a custom IdFilter for Lucene search which checks whether the result is in the given Id collection from the first query. However this approach does not scale well because also in my case the number of results after the first query might be huge and the filter is limited to 1024 ids. I did not find a good solution but I change the order of my two queries depending on the number of the to be expected results. The first query should be the one which filters out most of the results.
You can do a scheduler index update base on the last modified date.

Keeping query statistics using lucene

I am developing a search component of a web application using Lucene. I would like to save the user queries to an index and use them to suggest alternate queries to users, and to keep query statistics (most often used queries, top scoring queries, ...).
To use this data for alternate query suggestions, I would analyze the queries to see which terms are most often used with one another and use that to create a suggestion to the user.
But I can't figure out in which form to index the data. I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content. Does anyone have any ideas about the way this can be accomplished?
Thanks for the help.
"I was thinking of simply adding the queries into the index, but in that way there could be a lot of redundant data since many documents in the index would have the same content"
You can tell Lucene not to store document content, which means that the principal overhead will be the unique Terms, and the index itself. So, it might not be a large overhead to store each query as a unique Document...this way you will not be throwing away any information.
First, I believe that you should store the queries separately from the existing index. The problem is not redundant data but rather "watering down" your index - storing the queries in the same index may harm the relevance of your searches. Some options for this are:
Use a separate Lucene index.
Use Solr, with two separate cores, one for the documents and the other for the queries.
Use a query log. Store scores with the queries. Build query statistics using post-processing.As this is a web application, you can probably use a servlet container, such as Tomcat's, logs for this.
Second, Auto-Suggest From Popular Queries Using EdgeNGrams suggests an alternative implementation of query suggestion using Solr.

Categories

Resources