Have SolR limit number of hits per file

Have SolR limit number of hits per file - java

I have a SolR index where each record is a page from a file. So for every record we have the full text, the page number and the file ID.
When we do a search, often a single file will overwhelm the results as it contains the search term repeatedly.
What I would like to do is to have the search query only return a maximum of two hits per document and then offer the user a "see more hits from this document" which would do another, more limited query. I.e. similar to how Google will only show you a handful of results from any given domain, with the option of seeing more from each.
Is there anyway to structure a SolR query to accomplish this?

Which solr version are you using? If it's 4.0 (i.e. nightly), then you can use collapsing on the filename field.

Related

Partial search through a SQL database efficiently

What are some examples of efficiently searching through a directory as you're typing a person's name?
Say for example, we have a database with 1 million users. We start typing in the search box: "sea", it will display every user's name on a scroll-able window that has "sea" on it (kind of like searching through a Skype directory). After changing a letter, the window should update immediately. All of this is coming from a SQL database. What are few efficient libraries, algorithms that can do this without much delay?

First consider changing the task from "name contains substring" to "name starts with substring". If this is possible, then add index on your name column in database table and use the query:
select name from table where name like :1 || '%'
Limit the number of returned rows using DBMS-specific syntax, for example, for Oracle add
and rownum < 20
This query should return your rows pretty fast.
If you really need "contains substring", then decide whether you want the search to be handled by database or by an external text indexing solution.
For database-contained solution you'll have to use a different approach depending on DBMS. Every one of these solutions requires configuration steps not described here.
For Oracle you can use Oracle Text, see
http://www.oracle.com/technetwork/documentation/index-098492.html
The query will look like
select name from table where contains(name, :1) > 0
For Postgres you can use Full Text Search.
You can also use a solution that is not dependent on the database, for example, see Apache Solr:
http://lucene.apache.org/solr/

for example
SELECT name
FROM Table
WHERE name LIKE '%sea%'

how to reduce the result set in hibernate search(lucene)?

I have 9 millions of products in DB and I am trying to use hibernate search for finding suggestion of the products when user start typing product name in search box in a website. It is a autocomplete feature. I have implemented the web service to get the possible solutions from dataset using hibernate search.
The dataset have the 2 fields
Product Name
Url link
The data stored in my sql. I am using org.hibernate.search.jpa.FullTextQuery for the search.
Issues with the results:
Number of results are too much. I am getting 18K+ results when I searching "intel core".
It causes performance issue in terms of query response. For above search it took 2 seconds.
Is there way to reduce the search results on my dataset for better response time of the query?

Use setMaxResults()
Doc here: https://docs.jboss.org/hibernate/orm/3.5/javadocs/org/hibernate/Query.html#setMaxResults(int).
You may want to tweak your query/data to get the "most relevant" records returned.

how to do searching from a bulk amount of data in java

Actually there is more than 12000 records getting fetched from database and displayed pagewise using pagination. And now I have a search box in UI which will search from all the records(near about 12000) across all pages. But this is taking sometimes to search from this huge record.
Could you please help me how can I make this search faster.

Consider these options:
Instead of searching after you fetched all of them, make an intelligent query that will reduce the amount of candidates. Maybe your situation is so simple that your query can represent the full search?
Multithreaded searching.
Split your 12000 blocks of data into the amount of available processor cores (by using Runtime.getRuntime().availableProcessors()) and launch a Thread for each block that will do the job.
If your searching is heavy per object, you might want to see if there is a possibility to do a cheaper search method. Maybe by only looking into a couple of important fields only. Make sure that the job can be done quickly. In another Thread, you could do the deeper search and add results as they were found.
This option is rather hard to do, but you could implement a technique that searches for candidates while the user is still entering the search words. Every three characters they types, filter the current set of candidates. This would allow you to filter from 12000 to maybe 4000 and for the next three chars only 100 left, etc. This, of course, depends on the situation.

There is many way to do search.
1) SQL Search, here you can use the same sql statement that fetched the 12000 records and appends where clause using java code. here we prefer to search for the Indexed fields or u can add index in the DB level on the searchable fields.
2) Full Text Search, there is some technology that allows you to index your records as Full Text Index, you can read more about this here

Should I use Lucene only for search?

Our website needs to give out data to the world. This is open-source data that we have stored, and we want it to make it publicly available. It's about 2 million records.
We've implemented the search of these records using Lucene, which is fine, however we'd like to show an individual record (say the user clicks on it after the search is done) and provide more detailed information for that record.
This more detailed information however isn't stored in the index directly... there are like many-to-many relationships and we use our relational database (MySQL) to provide this information.
So like a single record belongs to a category, we want the user to click on that category and show the rest of the records within that category (lots more associations like this).
My question is, should we use Lucene also to store this sort of information and retrieve it through simple search (category:apples), or should MySQL continue doing this logical job? Should I use Lucene only for the search part?
EDIT
I would like to point out that all of our records are pretty static.... changes are made to this data once every week or so.

Lucene's strength lies in rapidly building an index of a set of documents and allowing you to search over them. If this "detailed information" does not need to be indexed or searched over, then don't store it in Lucene.
Lucene is not a database, it's an index.

You want to use Lucene to store data?, I thing it's ok, I've used Solr http://lucene.apache.org/solr/
which built on top of Lucene to work as search engine and store more data relate to the record that maybe use for front end display. It worked with 500k records for me, and 2mil records I think it should be fine.

How can to group lucene's results?

My application indexes discussion threads. Each entry in the discussion is indexed as a separate Lucene document with a common_id field which can be used to group search hits into one discussion.
Currently when the search is performed, if a thread has 3 entries, then 3 separate hits are returned. Even though this is correct, from the users point of view the same entry is appearing in the search multiple times.
Is there a way to tell lucene to group it's search results by the common_id field before returning them?

I believe what you are asking for is Field Collapsing, which is a feature of Solr (and I believe Elasticsearch as well).
If you want to roll your own, One possible way to do this is:
Add a "series id" field to each document that is a member of a
series. You will have to ensure that this gets incremented for every
new series.
Make an initial query to Lucene, and get a hit list.
For each hit, check to see if it has a series id; If it does, make another query by the series id in order to retrieve all the
members of the series.
An alternative is to store the ids of all the series members in a field inside each member's document.

There is nothing built into Lucene that collapses results based on a field. You will need to implement that yourself.
However, they've recently built this feature into Solr.
See http://www.lucidimagination.com/blog/2010/09/16/2446/

Since version 3.2 lucene supports grouping search results based on a field.
http://lucene.apache.org/core/4_1_0/grouping/org/apache/lucene/search/grouping/package-summary.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Have SolR limit number of hits per file - java

Which solr version are you using? If it's 4.0 (i.e. nightly), then you can use collapsing on the filename field.

Related

Partial search through a SQL database efficiently

how to reduce the result set in hibernate search(lucene)?

how to do searching from a bulk amount of data in java

Should I use Lucene only for search?

How can to group lucene's results?

Categories

Resources