lucene custom scoring - java

Having a document already indexed, at search i must part that document in two: first part consist of the first 100 words (tokens) and the rest of the document represents the second part. I have to score this two parts like this: the second part with 70% and the first with 30%.
EDIT 2: So i tried creating a Searcher that uses SpanPositionRangeQuery, but i must have understood SpanQuery usage all wrong because i can't get any hits (i used lukeall to verify if the words i was searching were indexed). Can someone give me a hand?
public static void search(String indexDir, String q) throws Exception
{
Directory dir = FSDirectory.open(new File(indexDir), null);
IndexSearcher is = new IndexSearcher(dir);
Term term = new Term("Field", q);
SpanPositionRangeQuery spanQuery = new SpanPositionRangeQuery(new SpanTermQuery(term), 0, 100);
spanQuery.setBoost(0.3f);CustomRomanianAnalyzer(Version.LUCENE_35));
long start = System.currentTimeMillis();
TopDocs hits = is.search(spanQuery, 10);
//TopDocs hits = is.search(query, 10);
long end = System.currentTimeMillis();
System.err.println("I found " + hits.totalHits + " documents (in " +
(end - start) + " milliseconds) '" +
q + "':");
for (int i=0;i<hits.scoreDocs.length;i++)
{
ScoreDoc scoreDoc = hits.scoreDocs[i];
Document doc = is.doc(scoreDoc.doc);
System.out.println(doc.get("filename"));
}
is.close();
}
I don't know how to combine query parser with SpanPositionRangeQuery to get what i need...

Yes, this can be done by setting the boost for each clause in a BooleanQuery. Using separate fields will work, but isn't strictly necessary. Lucene has a SpanPositionRangeQuery suitable for searching part of a document.
<SpanPositionRangeQuery: spanPosRange(field:term, 0, 100)^0.3>

Related

Add weights to documents Lucene 8

I am currently working on a small search engine for college using Lucene 8. I already built it before, but without applying any weights to documents.
I am now required to add the PageRanks of documents as a weight for each document, and I already computed the PageRank values. How can I add a weight to a Document object (not query terms) in Lucene 8? I looked up many solutions online, but they only work for older versions of Lucene. Example source
Here is my (updated) code that generates a Document object from a File object:
public static Document getDocument(File f) throws FileNotFoundException, IOException {
Document d = new Document();
//adding a field
FieldType contentType = new FieldType();
contentType.setStored(true);
contentType.setTokenized(true);
contentType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
contentType.setStoreTermVectors(true);
String fileContents = String.join(" ", Files.readAllLines(f.toPath(), StandardCharsets.UTF_8));
d.add(new Field("content", fileContents, contentType));
//adding other fields, then...
//the boost coefficient (updated):
double coef = 1.0 + ranks.get(path);
d.add(new DoubleDocValuesField("boost", coef));
return d;
}
The issue with my current approach is that I would need a CustomScoreQuery object to search the documents, but this is not available in Lucene 8. Also, I don't want to downgrade now to Lucene 7 after all the code I wrote in Lucene 8.
Edit:
After some (lengthy) research, I added a DoubleDocValuesField to each document holding the boost (see updated code above), and used a FunctionScoreQuery for searching as advised by #EricLavault. However, now all my documents have a score of exactly their boost, regardless of the query! How do I fix that? Here is my searching function:
public static TopDocs search(String query, IndexSearcher searcher, String outputFile) {
try {
Query q_temp = buildQuery(query); //the original query, was working fine alone
Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query
q = q.rewrite(DirectoryReader.open(bm25IndexDir));
TopDocs results = searcher.search(q, 10);
ScoreDoc[] filterScoreDosArray = results.scoreDocs;
for (int i = 0; i < filterScoreDosArray.length; ++i) {
int docId = filterScoreDosArray[i].doc;
Document d = searcher.doc(docId);
//here, when printing, I see that the document's score is the same as its "boost" value. WHY??
System.out.println((i + 1) + ". " + d.get("path")+" Score: "+ filterScoreDosArray[i].score);
}
return results;
}
catch(Exception e) {
e.printStackTrace();
return null;
}
}
//function that builds the query, working fine
public static Query buildQuery(String query) {
try {
PhraseQuery.Builder builder = new PhraseQuery.Builder();
TokenStream tokenStream = new EnglishAnalyzer().tokenStream("content", query);
tokenStream.reset();
while (tokenStream.incrementToken()) {
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
builder.add(new Term("content", charTermAttribute.toString()));
}
tokenStream.end(); tokenStream.close();
builder.setSlop(1000);
PhraseQuery q = builder.build();
return q;
}
catch(Exception e) {
e.printStackTrace();
return null;
}
}
Starting from Lucene 6.5.0 :
Index-time boosts are deprecated. As a replacement,
index-time scoring factors should be indexed into a doc value field
and combined at query time using eg. FunctionScoreQuery. (Adrien
Grand)
The recommendation instead of using index time boost would be to encode scoring factors (ie. length normalization factors) into doc values fields instead. (cf. LUCENE-6819)
Regarding my edited problem (boost value completely replacing search score instead of boosting it), here is what the documentation says about FunctionScoreQuery (emphasis mine):
A query that wraps another query, and uses a DoubleValuesSource to replace or modify the wrapped query's score.
So, when does it replace, and when does it modify?
Turns out, the code I was using is for entirely replacing the score by the boost value:
Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query
What I needed to do instead was using the function boostByValue, that modifies the searching score (by multiplying the score by the boost value):
Query q = FunctionScoreQuery.boostByValue(q_temp, DoubleValuesSource.fromDoubleField("boost"));
And now it works! Thanks #EricLavault for the help!

Lucene search engine isn't accurate, can't figure out why

I am trying to create a search engine for the first time, and I'm using the library offered by Apache Lucene. Everything works fine, however when I search for more than one word, for example "computer science" the results that I get aren't accurate because I never get documents that contain both words. It searches the documents for each word separately (I get documents that contain either "computer" or "science" but never both).
I've been staring at my code for almost a week now and I can't figure out the problem. The query parsing seems to work perfectly, so I think the problem might be in the search but I don't know what I'm doing wrong. So If you can help me, I'll be grateful.
public static wikiPage[] index(String searchQuery) throws SQLException, IOException, ParseException {
String sql = "select * from Record";
ResultSet rs = db.runSql(sql);
StandardAnalyzer analyzer = new StandardAnalyzer();
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
//1. Indexer
try (IndexWriter w = new IndexWriter(index, config)) {
while (rs.next()) {
String RecordID = rs.getString("RecordID");
String URL = rs.getString("URL");
String Title = rs.getString("Title");
String Info = rs.getString("Info");
addDoc(w, RecordID, URL, Info, Title);
}
}
catch (Exception e) {
System.out.print(e);
index.close();
}
//2. Query
MultiFieldQueryParser multipleQueryParser = new MultiFieldQueryParser(new String[]{"Title", "Info"}, new StandardAnalyzer());
Query q = multipleQueryParser.parse(searchQuery);
//3. Search
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs results = searcher.search(q, 10000);
ScoreDoc[] hits = results.scoreDocs;
// 4. display results
wikiPage[] resultArray = new wikiPage[hits.length];
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
resultArray[i] = new wikiPage(d.get("URL"), d.get("Title"));
System.out.println((i + 1) + ". " + d.get("Title") + "\t" + d.get("URL"));
}
reader.close();
return resultArray;
}
private static void addDoc(IndexWriter w, String RecordID, String URL, String Info, String Title) throws IOException {
Document doc = new Document();
doc.add(new StringField("RecordID", RecordID, Field.Store.YES));
doc.add(new TextField("Title", Title, Field.Store.YES));
doc.add(new TextField("URL", URL, Field.Store.YES));
doc.add(new TextField("Info", Info, Field.Store.YES));
w.addDocument(doc);
}
This is the output of System.out.println(q.toString());
(Title:computer Info:computer) (Title:science Info:science)
If you want to search it as a phrase (that is, finding "computer" and "science" together), surround the query with quotes, so it should look like "computer science". In your code, you could do something like:
Query q = multipleQueryParser.parse("\"" + searchQuery + "\"");
If you just want to find docs that contain both terms somewhere in the document, but not necessarily together, the query should look like +computer +science. Probably the easiest way to do this is to change the default operator of your query parser:
multipleQueryParser.setDefaultOperator(QueryParser.Operator.AND);
Query q = multipleQueryParser.parse(searchQuery);
As per the doc, prefix required terms with + and use AND (and OR for readability).
Try this:
(Title:+computer OR Info:+computer) AND (Title:+science OR Info:+science)
Maybe build this string and use it directly.

Why Lucene algorithm not working for Exact String in Java?

I am working on Lucene Algorithm in Java.
We have 100K stop names in MySQL Database.
The stop names are like
NEW YORK PENN STATION,
NEWARK PENN STATION,
NEWARK BROAD ST,
NEW PROVIDENCE
etc
When user gives a search input like NEW YORK, we get the NEW YORK PENN STATION stop in a result, but when user gives exact NEW YORK PENN STATION in a search input then it returns zero results.
My Code is -
public ArrayList<String> getSimilarString(ArrayList<String> source, String querystr)
{
ArrayList<String> arResult = new ArrayList<String>();
try
{
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
for(int i = 0; i < source.size(); i++)
{
addDoc(w, source.get(i), "1933988" + (i + 1) + "z");
}
w.close();
// 2. query
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");
// 3. search
int hitsPerPage = 20;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for(int i = 0; i < hits.length; ++i)
{
int docId = hits[i].doc;
Document d = searcher.doc(docId);
arResult.add(d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
catch(Exception e)
{
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
return arResult;
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException
{
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
In this code source is list of Stop Names and query is user given search input.
Does Lucene algorithm work on Large String?
Why Lucene algorithm is not working on Exact String?
Instead of
1) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");
Ex: "new york station" will be parsed to "title:new title:york title:station".
This query will return all the docs containing any of the above terms.
Try this..
2) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse("+(" + querystr + ")");
Ex1: "new york" will be parsed to "+(title:new title:york)"
The above '+' indicates 'must' occurrence of the term in the result document.
It will match both the docs containing "new york" and "new york station"
Ex2: "new york station" will be parsed to +(title:new title:york title:station).
The query will match only "new york station" and not just "new york" since station is not present.
Please make sure that the field name 'title' is what you're looking for.
Your questions.
Does Lucene algorithm work on Large String?
You've got to define what a large string is. Are you actually looking for Phrase Search. In general, Yes, Lucene works for large strings.
Why Lucene algorithm is not working on Exact String?
Because parsing ("querystr" + "* ") will generate individual term queries with OR operator connecting them.
Ex: 'new york*' will be parsed to: "title:new OR title:york*
If you are looking forward to find "new york station", the above wildcard query is not what you should be looking for. This is because the StandardAnalyser you passed in, while indexing, will tokenize (break down terms) new york station to 3 terms.
So, the query "york*" will find "york station" only because it has "york" in it but not because of the wildcard since "york" has no idea of "station" as they are different terms, i.e. different entries in the Index.
What you actually need is a PhraseQuery for finding exact string, for which the query string should be "new york" WITH the quotes

Why Lucene does not return the results based on whole word match?

I am using Lucene to match the keywords with list of words within an application. The whole process is automated without any human intervention. Best matched result (the one on the top and highest score) is picked from the results list returned from Lucene.
The following code demonstrates the above functionality and the results are printed on console.
Problem :
The problem is that lucene searches the keyword (word to be searched) and gives as a result a word that partially matches the keyword. On the other hand the full matched result also exists and does not get ranked in the first position.
For example, if I have lucene RAM index that contains words 'Test' and 'Test Engineer'. If i want to search index for 'AB4_Test Eng_AA0XY11' then results would be
Test
Test Engineer
Although Eng in 'AB4_Test Eng_AA0XY11' matched for Engineer (that is why it is listed in results). But it does not get the top position. I want to optimize my solution to bring the 'Test Engineer' on top because it the best match that considers whole keyword. Can any one help me in solving this problem?
public class LuceneTest {
private static void search(Set<String> keywords) {
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
try {
// 1. create the index
Directory luceneIndex = buildLuceneIndex(analyzer);
int hitsPerPage = 5;
IndexReader reader = IndexReader.open(luceneIndex);
for(String keyword : keywords) {
// Create query string. replace all underscore, hyphen, comma, ( , ), {, }, . with plus sign
StringBuilder querystr = new StringBuilder(128);
String [] splitName = keyword.split("[\\-_,/(){}:. ]");
// After tokenizing also add plus sign between each camel case word.
for (String token : splitName) {
querystr.append(token + "+");
}
// 3. search
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
Query q = new QueryParser(Version.LUCENE_36, "name", analyzer).parse(querystr.toString());
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println();
System.out.println(keyword);
System.out.println("----------------------");
for (ScoreDoc scoreDoc : hits) {
Document d = searcher.doc(scoreDoc.doc);
System.out.println("Found " + d.get("id") + " : " + d.get("name"));
}
// searcher can only be closed when there
searcher.close();
}
}catch (Exception e) {
e.printStackTrace();
}
}
/**
*
*/
private static Directory buildLuceneIndex(Analyzer analyzer) throws CorruptIndexException, LockObtainFailedException, IOException{
Map<Integer, String> map = new HashMap<Integer, String>();
map.put(1, "Test Engineer");
map.put(2, "Test");
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
// 1. create the index
IndexWriter w = new IndexWriter(index, config);
for (Map.Entry<Integer, String> entry : map.entrySet()) {
try {
Document doc = new Document();
doc.add(new Field("id", entry.getKey().toString(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("name", entry.getValue() , Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}catch (Exception e) {
e.printStackTrace();
}
}
w.close();
return index;
}
public static void main(String[] args) {
Set<String> list = new TreeSet<String>();
list.add("AB4_Test Eng_AA0XY11");
list.add("AB4_Test Engineer_AA0XY11");
search(list);
}
}
You can have a look at the Lucene Query syntax rules to see how you can enforce the search for Test Engineer.
Basically, using a query such as
AB4_Test AND Eng_AA0XY11
could work, though I am not sure of it. The page pointed by the link above is quite concise and you will be able to find rapidly a query that can fulfill your needs.
If these two results (test , test engineer) have the same ranking score, then you will see them in the order they came up.
You should try using the length filter and also boosting of the terms, and maybe then you can come up with the solution.
See also:
what is the best lucene setup for ranking exact matches as the highest

Extract term frequecny for each word in a lucene 5.2.1 index using java

How to extract term frequency of each word from a Lucene 5.2.1 index using java?
I have code that used to work for a previous Luecene version does not work anymore. I think most code on the Internet are for previous versions of Lucene.
You can get the term frequency of a given term from IndexReader.totalTermFreq, such as:
Term myTerm = new Term("contentfield", "myterm");
long totaltf = myReader.totalTermFreq(myTerm);
If you want to interate all the terms in the index and get the frequency of each, you can use MultiFields for that:
Fields fields = MultiFields.getFields(reader);
Iterator<String> fieldsIter = fields.iterator();
while (fieldsIter.hasNext()) {
String fieldname = fieldsIter.next();
TermsEnum terms = fields.terms(fieldname).iterator();
BytesRef term;
while ((term = terms.next()) != null) {
System.out.println(fieldname + ":" + term.utf8ToString() + " ttf:" + terms.totalTermFreq());
//Or whatever else you want to do with it...
}
}

Categories

Resources