Different lucene search results using different search space size - java

I have an application that uses lucene for searching. The search space are in the thousands. Searching against these thousands, I get only a few results, around 20 (which is ok and expected).
However, when I reduce my search space to just those 20 entries (i.e. I indexed only those 20 entries and disregard everything else...so that development would be easier), I get the same 20 results but in different order (and scoring).
I tried disabling the norm factors via Field#setOmitNorms(true), but I still get different results?
What could be causing the difference in the scoring?
Thanks

Please see the scoring documentation in Lucene's Similarity API. My bet is on the difference in idf between the two cases (both numDocs and docFreq are different). In order to know for sure, use the explain() function to debug the scores.
Edit: A code fragment for getting explanations:
TopDocs hits = searcher.search(query, searchFilter, max);
ScoreDoc[] scoreDocs = hits.scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs) {
String explanation = searcher.explain(query, scoreDoc.doc).toString();
Log.debug(explanation);
}

Scoring depends on all the documents in the index:
In general, the idea behind the
Vector Space Model (VSM) is the more
times a query term appears in a
document relative to the number of
times the term appears in all the
documents in the collection, the more relevant that document is to the query.
Source: Apache Lucene - Scoring

Related

Hibernate search fuzzy more than 2

I have a Java backend with hibernate, lucene and hibernate-search. Now I want to do a fuzzy query, BUT instead of 0, 1, or 2, I want to allow more "differences" between the query and the expected result (to compensate for example misspelling in long words). Is there any way to achieve this? The maximum of allowed differences will later be calculated by the length of the query.
What I want this for, is an autocomplete search with correction of wrong letters. This autocomplete should only search for missing characters BEHIND the given query, not in front of it. If characters in front of the query compared to the entry are missing, they should be counted as difference.
Examples:
Maximum allowed different characters in this example is 2.
fooo should match
fooo (no difference)
fooobar (only characters added -> autocomplete)
fouubar (characters added and misspelled -> autocomplete and spelling correction)
fooo should NOT match
barfooo (we only allow additional characters behind the query, but this example is less important)
fuuu (more than 2 differences)
This is my current code for the SQL query:
FullTextEntityManager fullTextEntityManager = this.sqlService.getFullTextEntityManager();
QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(MY_CLASS.class).overridesForField("name", "foo").get();
Query query = queryBuilder.keyword().fuzzy().withEditDistanceUpTo(2).onField("name").matching("QUERY_TO_MATCH").createQuery();
FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery(query, MY_CLASS.class);
List<MY_CLASS> results = fullTextQuery.getResultList();
Notes:
1. I use org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory for indexing, but that should not make any change.
2. This is using a custom framework, which is not open source. You can just ignore the sqlService, it only provides the FullTextEntityManager and handles all things around hibernate, which do not require custom code each time.
3. This code does already work, but only with withEditDistanceUpTo(2), which means maximum 2 "differences" between QUERY_TO_MATCH and the matching entry in the database or index. Missing characters also count as differences.
4. withEditDistanceUpTo(2) does not accept values greater than 2.
Does anyone have any ideas to achieve that?
I am not aware of any solution where you would specify an exact number of changes that are allowed.
That approach has serious drawbacks, anyway: what does it mean to match "foo" with up to 3 changes? Just match anything? As you can see, a solution that works with varying term lengths might be better.
One solution is to index n-grams. I'm not talking about edge-ngrams, like you already do, but actual ngrams extracted from the whole term, not just the edges. So when indexing 2-grams of foooo, you would index:
fo
oo (occurring multiple times)
And when querying, the term fouuu would be transformed to:
fo
ou
uu
... and it would match the indexed document, since they have at least one term in common (fo).
Obviously there are some drawbacks. With 2-grams, the term fuuuu wouldn't match foooo, but the term barfooo would, because they have a 2-gram in common. So you would get false positives. The longer the grams, the less likely you are to get false positives, but the less fuzzy your search will be.
You can make these false positives go away by relying on scoring and on a sort by score to place the best matches first in the result list. For example, you could configure the ngram filter to preserve the original term, so that fooo will be transformed to [fooo, fo, oo] instead of just [fo, oo], and thus an exact search of fooo will have a better score for a document containing fooo than for a document containing barfooo (since there are more matches). You could also set up multiple separate fields: one without ngrams, one with 3-grams, one with 2-grams, and build a boolean query with on should clause per field: the more clauses are matched, the higher the score will be, and the higher you will find the document in the hits.
Also, I'd argue that fooo and similar are really artificial examples and you're unlikely to have these terms in a real-world dataset; you should try whatever solution you come up with against a real dataset and see if it works well enough. If you want fuzzy search, you'll have to accept some false positives: the question is not whether they exist, but whether they are rare enough that users can still easily find what they are looking for.
In order to use ngrams, apply the n-gram filter using org.apache.lucene.analysis.ngram.NGramFilterFactory. Apply it both when indexing and when querying. Use the parameters minGramSize/maxGramSize to configure the size of ngrams, and keepShortTerm (true/false) to control whether to preserve the original term or not.
You may keep the edge-ngram filter or not; see if it improves the relevance of your results? I suspect it may improve the relevance slightly if you use keepShortTerm = true. In any case, make sure to apply the edge-ngram filter before the ngram filter.
Ok, my friend and I found a solution.
We found a question in the changelog of lucene which asks for the same feature, and we implemented a solution:
There is a SlowFuzzyQuery in a sandbox version of lucene. It is slower (obviously) but supports an editDistance greater than 2.

Lucene BooleanQuery with multiple FuzzyQuery is too slow

A Document is an employee data of a company with multiple fields name like: empName, empId, departmentId etc.
Using custom analyzer have indexed around 4 million data.
Search query is having a list of employees' name, and know that all employees in list belong to same department. There are multiple departments in company.
So I want to do fuzzy search for all employees' names for under given department id.
For this I am using boolean query which looks like:
Query termQuery = new TermQuery(new Term("departmentId","1234"));
BooleanQuery.Builder bld = new BooleanQuery.Builder();
for(String str:employeeNameList) {
bld.add(new FuzzyQuery(new Term("name",str)), BooleanClause.Occur.SHOULD);
}
BooleanQuery bq = bld.build();
BooleanQuery finalBooleanQuery = new BooleanQuery.Builder()
.add(termQuery, BooleanClause.Occur.MUST)
.add(bq, BooleanClause.Occur.MUST).build();
Now passing finalBooleanQuery inside search method of IndexSearcher and getting results.
Problem is its taking too much time, when size of employeeNameList more than 50 it takes around 500 ms for search.
How can I reduce time from 500 ms to 50 ms ?
There is any other solution for this problem ?
If you take a look at the other constructors for FuzzyQuery, you'll see some easy ways to improve performance. Each additional argument is there for you to reduce the amount of work the FuzzyQuery is going to do, and so improve performance.
First, and most important:
Prefix length: I strongly recommend setting this to a non-zero value. This is how many characters at the beginning of the term will not be subject to fuzzy matching. So, if searching for "abc" with a prefix of 1, "abb" and "acc" would be matched, but not "bbc". This allows lucene to work with the index when attempting to find matching terms, instead of having to scan the whole term dictionary. It's likely you will see the largest performance improvement here. Many seem to find 2 to be a good balance point between performance and meeting search demands.
The rest of the available arguments can also help:
maxEdits - 2 is the default, and the maximum. Setting this to 1 will match less, and as such, work faster.
maxExpansions - Under the hood, this query finds terms that match the fuzzy parameters, then performs a search for those terms. If you are searching for short terms, especially, this list of matching terms could turn out to be very long. Setting maxExpansions will prevent these extremely long lists of matches from occurring. Default is 50.
transpositions - Whether swapping two characters is an allowed edit. Default is true. Basically, the difference between Levenshtein and Damerau-Levenshtein. false is less work and less matches, so will perform better. Don't know if the difference will be that big though.

Fuzzy String matching of Strings in Java

I have a very large list of Strings stored in a NoSQL DB. Incoming query is a string and I want to check if this String is there in the list or not. In case of Exact match, this is very simple. That NoSQL DB may have the String as the primary key and I will just check if there is any record with that string as primary key. But I need to check for Fuzzy match as well.
There is one approach to traverse every String in that list and check Levenshtein Distance of input String with the Strings in list, but this approach will result in O(n) complexity and the size of list is very large (10 million) and may even increase. This approach will result in higher latency of my solution.
Is there a better way to solve this problem?
Fuzzy matching is complicated for the reasons you have discovered. Calculating a distance metric for every combination of search term against database term is impractical for performance reasons.
The solution to this is usually to use an n-gram index. This can either be used standalone to give a result, or as a filter to cut down the size of possible results so that you have fewer distance scores to calculate.
So basically, if you have a word "stack" you break it into n-grams (commonly trigrams) such as "s", "st", "sta", "ack", "ck", "k". You index those in your database against the database row. You then do the same for the input and look for the database rows that have the same matching n-grams.
This is all complicated, and your best option is to use an existing implementation such as Lucene/Solr which will do the n-gram stuff for you. I haven't used it myself as I work with proprietary solutions, but there is a stackoverflow question that might be related:
Return only results that match enough NGrams with Solr
Some databases seem to implement n-gram matching. Here is a link to a Sybase page that provides some discussion of that:
Sybase n-gram text index
Unfortunately, discussions of n-grams would be a long post and I don't have time. Probably it is discussed elsewhere on stackoverflow and other sites. I suggest Googling the term and reading up about it.
First of all, if Searching is what you're doing, then you should use a Search Engine (ElasticSearch is pretty much the default). They are good at this and you are not re-inventing wheels.
Second, the technique you are looking for is called stemming. Along with the original String, save a normalized string in your DB. Normalize the search query with the same mechanism. That way you will get much better search results. Obviously, this is one of the techniques a search engine uses under the hood.
Use Solr (or Lucene) could be a suitable solution for you?
Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:
roam~
This search will find terms like foam and roams.
Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:
roam~0.8
https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

How to improve the performance when working with wikipedia data and huge no. of webpages?

I am supposed to extract representative terms from an organisation's website using wikipedia's article-link data dump.
To achieve this I've -
Crawled & downloaded organisation's webpages. (~110,000)
Created a dictionary of wikipedia ID and terms/title. (~40million records)
Now, I'm supposed to process each of the webpages using the dictionary to recognise terms and track their term IDs & frequencies.
For the dictionary to fit in memory, I've splitted the dictionary into smaller files. Based on my experiment with a small data-set, the processing time for the above will be around 75 days.
And this is just for 1 organisation. I have to do the same for more than 40 of them.
Implementation -
HashMap for storing dictionary in memory.
looping through each map entry to search the term in a webpage, using Boyer-Moore search implementation.
Repeating the above for each webpage, and storing results in a HashMap.
I've tried optimizing the code and tuning the JVM for better performance.
Can someone please advise on a more efficient way to implement the above, reducing the processing time to a few days.
Is Hadoop an option to consider?
Based on your question:
Number of Documents = 110000
Dictionary => List of [TermID, Title Terms] = 40million entries
Size of documents = 11000 * 1KB per document on an average = 26.9GB
(1KB per document on an average)
Size of dictionary = 40million * 256bytes = 9.5GB of raw data
(256bytes per entry on an average)
How did you arrive at the 75 days estimate?
There are number of performance targets:
How are you storing the Documents?
How are you storing/retrieving the Dictionary? ( assuming not all of it in memory unless you can afford to)
How many machines are you running it on?
Are you performing the dictionary lookups in parallel? ( of-course assuming dictionary is immutable once you have already processed whole of wikipedia )
Here is an outline of what I believe you are doing:
dictionary = read wikipedia dictionary
document = a sequence of documents
documents.map { doc =>
var docTermFreq = Map[String, Int]()
for(term <- doc.terms.map if(dictionary.contains(term)) ) {
docTermFreq = docTermFreq + (term -> docTermFreq.getOrElse(term, 0) + 1)
}
// store docTermFreq map
}
What this is essentially doing is breaking up each document into tokens and then performing a lookup in wikipedia dictionary for its token's existence.
This is exactly what a Lucene Analyzer does.
A Lucene Tokenizer will convert document into tokens. This happens before the terms are indexed into lucene. So all you have to do is implement a Analyzer which can lookup the Wikipedia Dictionary, for whether or not a token is in dictionary.
I would do it like this:
Take every document and prepare a token stream ( using an Analyzer described above )
Index the document terms.
At this point you will have wikipedia terms only, in the Lucene Index.
When you do this, you will have ready-made statistics from the Lucene Index such as:
Document Frequency of a Term
TermFrequencyVector ( exactly what you need )
and a ready to use inverted index! ( for a quick introduction to Inverted Index and Retrieval )
There are lot of things you can do to improve the performance. For example:
Parallelize the document stream processing.
You can store the dictionary in key-value database such as BerkeylyDB or Kyoto Cabinet, or even an in-memory key-value storage such as Redis or Memcache.
I hope that helps.
One of the ways that use only MR is to:
Assuming you already have N dictionaries of smaller size that fit to memory you can:
Launch N "map only" jobs that will be scanning all your data (each one with only one dictionary) and output smth like {pageId, termId, occurence, etc} to folder /your_tmp_folder/N/
As a result you will have N*M files where M is amount of mappers on each stage(should be the same).
Then second job will simply analyze your {pageId, termId, occurence, etc} objects and build stats per page id.
Map only jobs should be very fast in your case. If not - please paste your code.

Lucene 4 Pagination

I am using Lucene 4.2 and am implementing result pagination.
IndexSearcher.searchAfter provides an efficient way of implementing "next page" functionality but what is the best way to go about implementing "previous page" or even "go to page" functionality? There is no IndexSearcher.searchBefore for example.
I was considering determining the total number of pages given the page size and keeping a ScoreDoc[] array to track the "after" ScoreDoc for each page (the array would be populated as results are paged in). This would allow me to use the "closest" ScoreDoc for use in IndexSearcher.searchAfter (or null in the worst case).
Does this make sense? Is there a better approach?
I've been using Lucene 4.8 and have been working on a REST interface which includes pagination.
My solution has been to use a TopScoreDocCollector and call the topDocs(int startIndex, int numberOfhits) method. The start index is calculated by multiplying the zero based page number by the number of hits.
...
DirectoryReader reader = DirectoryReader.open(MMapDirectory.open( java.io.File(indexFile) );
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(MAX_RESULTS, true); // MAX_RESULTS is just an int limiting the total number of hits
int startIndex = (page -1) * hitsPerPage; // our page is 1 based - so we need to convert to zero based
Query query = new QueryParser(Version.LUCENE_48, "All", analyzer).parse(searchQuery);
searcher.search(query, collector);
TopDocs hits = collector.topDocs(startIndex, hitsPerPage);
...
So my REST interface accepts the page number and number of hits per page as parameters.
So going forward or back is as simple as submitting a new request with the appropriate value for the page
I agree with the solution explained by Jaimie. But I want to point out another aspect you have to be aware of and which is helping to understand the general mechanism of a search engine.
With the TopDocCollector you can define how much hits you want to be collected matching your search query, before the result is sorted by score or other sort criterias.
See the following example:
collector = TopScoreDocCollector.create(9999, true);
searcher.search(parser.parse("Clone Warrior"), collector);
// get first page
topDocs = collector.topDocs(0, 10);
int resultSize=topDocs.scoreDocs.length; // 10 or less
int totalHits=topDocs.totalHits; // 9999 or less
We tell Lucene here to collect a maximum of 9999 documents containing the search phrase 'Clone Warrior'. This means, if the index contains more than 9999 documents containing this search phrase, the collector will stop after it is filled up with 9999 hits!
This means, that as greater you choose the MAX_RESULTS as better become your search result. But this is only relevant if you expect a large number of hits.
On the other side if you search for "luke skywalker" and you will expect only one hit, than the MAX_RESULTS can also be set to 1.
So changing the MAX_RESULTS can influence the returned scoreDocs as the sorting will be performed on the collected hits. It is practically to set MAX_RESULTS to a size which is large enough so that the human user can not argue to miss a specific document. This concept is totally contrary to the behavior of a SQL database, which does always consider the complete data pool.
But lucene also supports another mechanism. You can, instead of defining the MAX_RESULTS for the collector, alternatively define the amount of time you want to wait for the resultset. So for example you can define that you always want to stop the collector after 300ms. This is a good approach to protect your application for performance issues. But if you want to make sure that you count all relevant documents than you have to set the parameter for MAX_RESULTS or the maximum wait time to a endless value.
I am using lucene 8.2.0. I have implemented paging using indexSearcher.searchAfter() like below. searchAfter() takes ScoreDoc as first parameter so i need to create object of ScoreDoc. To create object of ScoreDoc you need to maintain three things from previous ScoreDoc results 1.doc, 2.score, 3.shardIndex which will help in creating ScoreDoc object
ScoreDoc scoreDoc = new ScoreDoc(53, 2.4933066f,0);
TopDocs hits3 = indexSearcher.searchAfter(scoreDoc,query3,10);
I have also used above mentioned answer and it is working fine using TopScoreDocCollector, but the performance of indexSearcher.searchAfter is 3 to 4 time better than TopScoreDocCollector approach.

Categories

Resources