How I can achieve scoring and sorting in lucene as per the start date.
Event which has latest start date should be shown first in search results. I am using lucene Version.LUCENE_44
I have retreived data from DB and stored in Lucene Document as,
public static Document createDoc(Event e) {
Document d = new Document();
//event id
d.add(new StoredField("id", e.getId()));
//event name
d.add(new StoredField("eventname", e.getEName());
TextField field = new TextField("enameSrch", e.getEName(), Store.NO);
field.setBoost(10.0f);
d.add(field);
//event owner
d.add(new StoredField("eventowner", e.getEOwner());
//event start date
d.add(new LongField("edateSort", Long.MAX_VALUE-e.getEStartTime(), Store.YES));
//event tags
if (e.eventTags()!=null) {
field = new TextField("eTagSrch", e.getTags(), Store.NO);
field.setBoost(5.0f);
d.add(field);
d.add(new StoredField("eTags", e.getTags()));
}
And while searching I am doing as,
public List search(String srchTxt){
PhraseQuery enameQuery = new PhraseQuery();
Term term = new Term("enameSrch", srchTxt.toLowerCase());
enameQuery .add(term);
PhraseQuery etagQuery = new PhraseQuery();
term = new Term("eTagSrch", srchTxt.toLowerCase());
etagQuery.add(term);
BooleanQuery b= new BooleanQuery();
b.add(enameQuery , Occur.SHOULD);
b.add(etagQuery , Occur.SHOULD);
SortField startField = new SortField("edateSort", Type.LONG);
SortField scoreField = SortField.FIELD_SCORE;
Sort sort = new Sort(scoreField, startField);
TopFieldDocs tfd = searcher.search(b, 10, sort);
ScoreDoc[] myscore= tfd.scoreDocs;
To rephrase: I want to sort Documents by date, which is stored as a Long field in my Document (see code above)
What your code does is sorts by score, then by date, since your scores coming back are not likely the same, they will almost always be by score anyways.
This is what I would do:
Sort sorter = new Sort(); // new sort object
String field = "fieldName"; // enter the field to sort by
Type type = Type.Long; // since your field is long type
boolean descending = false; // ascending by default
SortField sortField = new SortField(field, type, descending);
sorter.setSort(sortField); // now set the sort field
This will just sort by the field you specified. You can also do:
sorter.setSort(sortField, SortField.FIELD_SCORE); // this will sort by field, then by score
Related
I am currently working on a small search engine for college using Lucene 8. I already built it before, but without applying any weights to documents.
I am now required to add the PageRanks of documents as a weight for each document, and I already computed the PageRank values. How can I add a weight to a Document object (not query terms) in Lucene 8? I looked up many solutions online, but they only work for older versions of Lucene. Example source
Here is my (updated) code that generates a Document object from a File object:
public static Document getDocument(File f) throws FileNotFoundException, IOException {
Document d = new Document();
//adding a field
FieldType contentType = new FieldType();
contentType.setStored(true);
contentType.setTokenized(true);
contentType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
contentType.setStoreTermVectors(true);
String fileContents = String.join(" ", Files.readAllLines(f.toPath(), StandardCharsets.UTF_8));
d.add(new Field("content", fileContents, contentType));
//adding other fields, then...
//the boost coefficient (updated):
double coef = 1.0 + ranks.get(path);
d.add(new DoubleDocValuesField("boost", coef));
return d;
}
The issue with my current approach is that I would need a CustomScoreQuery object to search the documents, but this is not available in Lucene 8. Also, I don't want to downgrade now to Lucene 7 after all the code I wrote in Lucene 8.
Edit:
After some (lengthy) research, I added a DoubleDocValuesField to each document holding the boost (see updated code above), and used a FunctionScoreQuery for searching as advised by #EricLavault. However, now all my documents have a score of exactly their boost, regardless of the query! How do I fix that? Here is my searching function:
public static TopDocs search(String query, IndexSearcher searcher, String outputFile) {
try {
Query q_temp = buildQuery(query); //the original query, was working fine alone
Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query
q = q.rewrite(DirectoryReader.open(bm25IndexDir));
TopDocs results = searcher.search(q, 10);
ScoreDoc[] filterScoreDosArray = results.scoreDocs;
for (int i = 0; i < filterScoreDosArray.length; ++i) {
int docId = filterScoreDosArray[i].doc;
Document d = searcher.doc(docId);
//here, when printing, I see that the document's score is the same as its "boost" value. WHY??
System.out.println((i + 1) + ". " + d.get("path")+" Score: "+ filterScoreDosArray[i].score);
}
return results;
}
catch(Exception e) {
e.printStackTrace();
return null;
}
}
//function that builds the query, working fine
public static Query buildQuery(String query) {
try {
PhraseQuery.Builder builder = new PhraseQuery.Builder();
TokenStream tokenStream = new EnglishAnalyzer().tokenStream("content", query);
tokenStream.reset();
while (tokenStream.incrementToken()) {
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
builder.add(new Term("content", charTermAttribute.toString()));
}
tokenStream.end(); tokenStream.close();
builder.setSlop(1000);
PhraseQuery q = builder.build();
return q;
}
catch(Exception e) {
e.printStackTrace();
return null;
}
}
Starting from Lucene 6.5.0 :
Index-time boosts are deprecated. As a replacement,
index-time scoring factors should be indexed into a doc value field
and combined at query time using eg. FunctionScoreQuery. (Adrien
Grand)
The recommendation instead of using index time boost would be to encode scoring factors (ie. length normalization factors) into doc values fields instead. (cf. LUCENE-6819)
Regarding my edited problem (boost value completely replacing search score instead of boosting it), here is what the documentation says about FunctionScoreQuery (emphasis mine):
A query that wraps another query, and uses a DoubleValuesSource to replace or modify the wrapped query's score.
So, when does it replace, and when does it modify?
Turns out, the code I was using is for entirely replacing the score by the boost value:
Query q = new FunctionScoreQuery(q_temp, DoubleValuesSource.fromDoubleField("boost")); //the new query
What I needed to do instead was using the function boostByValue, that modifies the searching score (by multiplying the score by the boost value):
Query q = FunctionScoreQuery.boostByValue(q_temp, DoubleValuesSource.fromDoubleField("boost"));
And now it works! Thanks #EricLavault for the help!
The function below merge word MongoDB collection and map content like this:
Collection:
cat 3,
dog 5
Map:
dog 2,
zebra 1
Collection after merge:
cat 3,
dog 7,
zebra 1
We have empty collection and map with about 14000 elements.
Oracle PL/SQL procedure using one merge SQL running on 15k RPM HD do it in less then a second.
MongoBD on SSD disk needs about 53 seconds.
It looks like Oracle prepares in memory image of file operation
and saves result in one i/o operation.
MongoDB probably does 14000 i/o - it is about 4 ms for each insert. It is corresponds with performance of SSD.
If I do just 14000 inserts without search for documents existence as in case of merge everything works also fast - less then a second.
My questions:
Can the code be improved?
Maybe it necessary to do something with MongoDB configuration?
Function code:
public void addBookInfo(String bookTitle, HashMap<String, Integer> bookInfo)
{
// insert information to the book collection
Document d = new Document();
d.append("book_title", bookTitle);
book.insertOne(d);
// insert information to the word collection
// prepare collection of word info and book_word info documents
List<Document> wordInfoToInsert = new ArrayList<Document>();
List<Document> book_wordInfoToInsert = new ArrayList<Document>();
for (String key : bookInfo.keySet())
{
Document d1 = new Document();
Document d2 = new Document();
d1.append("word", key);
d1.append("count", bookInfo.get(key));
wordInfoToInsert.add(d1);
d2.append("book_title", bookTitle);
d2.append("word", key);
d2.append("count", bookInfo.get(key));
book_wordInfoToInsert.add(d2);
}
// this is collection of insert/update DB operations
List<WriteModel<Document>> updates = new ArrayList<WriteModel<Document>>();
// iterator for collection of words
ListIterator<Document> listIterator = wordInfoToInsert.listIterator();
// generate list of insert/update operations
while (listIterator.hasNext())
{
d = listIterator.next();
String wordToUpdate = d.getString("word");
int countToAdd = d.getInteger("count").intValue();
updates.add(
new UpdateOneModel<Document>(
new Document("word", wordToUpdate),
new Document("$inc",new Document("count", countToAdd)),
new UpdateOptions().upsert(true)
)
);
}
// perform bulk operation
// this is slowly
BulkWriteResult bulkWriteResult = word.bulkWrite(updates);
boolean acknowledge = bulkWriteResult.wasAcknowledged();
if (acknowledge)
System.out.println("Write acknowledged.");
else
System.out.println("Write was not acknowledged.");
boolean countInfo = bulkWriteResult.isModifiedCountAvailable();
if (countInfo)
System.out.println("Change counters avaiable.");
else
System.out.println("Change counters not avaiable.");
int inserted = bulkWriteResult.getInsertedCount();
int modified = bulkWriteResult.getModifiedCount();
System.out.println("inserted: " + inserted);
System.out.println("modified: " + modified);
// insert information to the book_word collection
// this is very fast
book_word.insertMany(book_wordInfoToInsert);
}
As the lucene migration guide mentioned, to set document level boost we should multiply all fields boost by boosting value. here is my code :
StringField nameField = new StringField("name", name, Field.Store.YES) ;
StringField linkField = new StringField("link", link, Field.Store.YES);
Field descField;
TextField reviewsField = new TextField("reviews", reviews_str, Field.Store.YES);
TextField authorsField = new TextField("authors", authors_str, Field.Store.YES);
FloatField scoreField = new FloatField("score", origScore,Field.Store.YES);
if (desc != null) {
descField = new TextField("desc", desc, Field.Store.YES);
} else {
descField = new TextField("desc", "", Field.Store.YES);
}
doc.add(nameField);
doc.add(linkField);
doc.add(descField);
doc.add(reviewsField);
doc.add(authorsField);
doc.add(scoreField);
nameField.setBoost(score);
linkField.setBoost(score);
descField.setBoost(score);
reviewsField.setBoost(score);
authorsField.setBoost(score);
scoreField.setBoost(score);
but I've got this exception when running code :
Exception in thread "main" java.lang.IllegalArgumentException: You cannot set an index-time boost on an unindexed field, or one that omits norms
I've searched google. but I've got no answers. would you please help me?
Index-time boosts are stored in the field's norm, and both StringField and FloatField omit norms by default. So, you'll need to turn them on before you set the boosts.
To turn norms on, you'll need to define your own field types:
//Start with a copy of the standard field type
FieldType myStringType = new FieldType(StringField.TYPE_STORED);
myStringType.setOmitNorms(false);
//StringField doesn't do anything special except have a customized fieldtype, so just use Field.
Field nameField = new Field("name", name, myStringType);
FieldType myFloatType = new FieldType(FloatField.TYPE_STORED);
myFloatType.setOmitNorms(false);
//For FloatField, use the appropriate FloatField ctor, instead of Field (similar for other numerics)
Field scoreField = new FloatField("score", origScore, myFloatType);
I am working on Lucene Algorithm in Java.
We have 100K stop names in MySQL Database.
The stop names are like
NEW YORK PENN STATION,
NEWARK PENN STATION,
NEWARK BROAD ST,
NEW PROVIDENCE
etc
When user gives a search input like NEW YORK, we get the NEW YORK PENN STATION stop in a result, but when user gives exact NEW YORK PENN STATION in a search input then it returns zero results.
My Code is -
public ArrayList<String> getSimilarString(ArrayList<String> source, String querystr)
{
ArrayList<String> arResult = new ArrayList<String>();
try
{
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
for(int i = 0; i < source.size(); i++)
{
addDoc(w, source.get(i), "1933988" + (i + 1) + "z");
}
w.close();
// 2. query
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");
// 3. search
int hitsPerPage = 20;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for(int i = 0; i < hits.length; ++i)
{
int docId = hits[i].doc;
Document d = searcher.doc(docId);
arResult.add(d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
catch(Exception e)
{
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
return arResult;
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException
{
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
In this code source is list of Stop Names and query is user given search input.
Does Lucene algorithm work on Large String?
Why Lucene algorithm is not working on Exact String?
Instead of
1) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");
Ex: "new york station" will be parsed to "title:new title:york title:station".
This query will return all the docs containing any of the above terms.
Try this..
2) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse("+(" + querystr + ")");
Ex1: "new york" will be parsed to "+(title:new title:york)"
The above '+' indicates 'must' occurrence of the term in the result document.
It will match both the docs containing "new york" and "new york station"
Ex2: "new york station" will be parsed to +(title:new title:york title:station).
The query will match only "new york station" and not just "new york" since station is not present.
Please make sure that the field name 'title' is what you're looking for.
Your questions.
Does Lucene algorithm work on Large String?
You've got to define what a large string is. Are you actually looking for Phrase Search. In general, Yes, Lucene works for large strings.
Why Lucene algorithm is not working on Exact String?
Because parsing ("querystr" + "* ") will generate individual term queries with OR operator connecting them.
Ex: 'new york*' will be parsed to: "title:new OR title:york*
If you are looking forward to find "new york station", the above wildcard query is not what you should be looking for. This is because the StandardAnalyser you passed in, while indexing, will tokenize (break down terms) new york station to 3 terms.
So, the query "york*" will find "york station" only because it has "york" in it but not because of the wildcard since "york" has no idea of "station" as they are different terms, i.e. different entries in the Index.
What you actually need is a PhraseQuery for finding exact string, for which the query string should be "new york" WITH the quotes
I am using Lucene to match the keywords with list of words within an application. The whole process is automated without any human intervention. Best matched result (the one on the top and highest score) is picked from the results list returned from Lucene.
The following code demonstrates the above functionality and the results are printed on console.
Problem :
The problem is that lucene searches the keyword (word to be searched) and gives as a result a word that partially matches the keyword. On the other hand the full matched result also exists and does not get ranked in the first position.
For example, if I have lucene RAM index that contains words 'Test' and 'Test Engineer'. If i want to search index for 'AB4_Test Eng_AA0XY11' then results would be
Test
Test Engineer
Although Eng in 'AB4_Test Eng_AA0XY11' matched for Engineer (that is why it is listed in results). But it does not get the top position. I want to optimize my solution to bring the 'Test Engineer' on top because it the best match that considers whole keyword. Can any one help me in solving this problem?
public class LuceneTest {
private static void search(Set<String> keywords) {
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
try {
// 1. create the index
Directory luceneIndex = buildLuceneIndex(analyzer);
int hitsPerPage = 5;
IndexReader reader = IndexReader.open(luceneIndex);
for(String keyword : keywords) {
// Create query string. replace all underscore, hyphen, comma, ( , ), {, }, . with plus sign
StringBuilder querystr = new StringBuilder(128);
String [] splitName = keyword.split("[\\-_,/(){}:. ]");
// After tokenizing also add plus sign between each camel case word.
for (String token : splitName) {
querystr.append(token + "+");
}
// 3. search
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
Query q = new QueryParser(Version.LUCENE_36, "name", analyzer).parse(querystr.toString());
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println();
System.out.println(keyword);
System.out.println("----------------------");
for (ScoreDoc scoreDoc : hits) {
Document d = searcher.doc(scoreDoc.doc);
System.out.println("Found " + d.get("id") + " : " + d.get("name"));
}
// searcher can only be closed when there
searcher.close();
}
}catch (Exception e) {
e.printStackTrace();
}
}
/**
*
*/
private static Directory buildLuceneIndex(Analyzer analyzer) throws CorruptIndexException, LockObtainFailedException, IOException{
Map<Integer, String> map = new HashMap<Integer, String>();
map.put(1, "Test Engineer");
map.put(2, "Test");
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
// 1. create the index
IndexWriter w = new IndexWriter(index, config);
for (Map.Entry<Integer, String> entry : map.entrySet()) {
try {
Document doc = new Document();
doc.add(new Field("id", entry.getKey().toString(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("name", entry.getValue() , Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}catch (Exception e) {
e.printStackTrace();
}
}
w.close();
return index;
}
public static void main(String[] args) {
Set<String> list = new TreeSet<String>();
list.add("AB4_Test Eng_AA0XY11");
list.add("AB4_Test Engineer_AA0XY11");
search(list);
}
}
You can have a look at the Lucene Query syntax rules to see how you can enforce the search for Test Engineer.
Basically, using a query such as
AB4_Test AND Eng_AA0XY11
could work, though I am not sure of it. The page pointed by the link above is quite concise and you will be able to find rapidly a query that can fulfill your needs.
If these two results (test , test engineer) have the same ranking score, then you will see them in the order they came up.
You should try using the length filter and also boosting of the terms, and maybe then you can come up with the solution.
See also:
what is the best lucene setup for ranking exact matches as the highest