Lucene searching over stored and unstored Fields concurrently

Lucene searching over stored and unstored Fields concurrently - java

I'm working with Lucene 7.4 and have indexed a sample of txt files.
I have some Fields that have been stored, such as path and filename,
and a content Field, which was unstored before passing the doc to the IndexWriter.
Consequently my content Field contains the processed (e.g. tokenized, stemmed) content data of the file, my filename Field contains the unprocessed filename, the entire String.
try (InputStream stream = Files.newInputStream(file)) {
// create empty document
Document doc = new Document();
// add the last modification time field
Field lastModField = new StoredField(LuceneConstants.LAST_MODIFICATION_TIME, Files.getAttribute(file, "lastModifiedTime", LinkOption.NOFOLLOW_LINKS).toString());
doc.add(lastModField);
// add the path Field
Field pathField = new StringField(LuceneConstants.FILE_PATH, file.toString(), Field.Store.YES);
doc.add(pathField);
// add the name Field
doc.add(new StringField(LuceneConstants.FILE_NAME, file.getFileName().toString(), Field.Store.YES));
// add the content
doc.add(new TextField(LuceneConstants.CONTENTS, new BufferedReader(new InputStreamReader(stream))));
System.out.println("adding " + file);
writer.addDocument(doc);
Now, as far as I understand, I have to use 2 QueryParsers, since I need to use 2 different Analyzers for searching over both fields, one for each.
I cant't figure out how to combine them.
What I want is a TopDoc wherein the results are ordered by a relevance score, that is some combination of the 2 relevance scores from the search over the filename Field and the search over the content Field.
Does Lucene 7.4 provide you with the means for an easy solution to this?
PS: This is my first post in a long time, if not ever. Please remark any formatting or content issues.
EDIT:
Analyzer used for indexing content Field and for searching content Field:
Analyzer myTxtAnalyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.addTokenFilter("porterstem")
.build();
And I'm using the KeywordAnalyzer to search over the filename Field, which, to reiterate, is stored, so not analyzed.
My program is supposed to index files and search over that index, retrieving
a list of the most relevant documents. If a searchString, which may contain whitespaces, exactly matches the fileName,
I'd like that to heavily impact my search results.
I'm a computer science student, and this is my first project with Lucene.
If there are no functions available, it's all good. What I'm asking for is not a requirement for my task. I'm just pondering and I feel like this is something there might already exist a simple solution for. But I can't seem to find it, if it exists.
EDIT 2:
I had a misconception aobut what happens when using Stored.YES/.NO.
My problem has nothing to do with it.
The String wasn't tokenized, because it was in a StringField.
I assumed it was because it was stored.
However, my question remains.
Is there a way to search over tokenized and untokenized Fields concurrently?

As #andrewjames mentions, you don't need to use multiple analyzers in your example because only the TextField gets analyzed, the StringFields do not. If you had a situation where you did need to use different analyzers for different fields, Lucene can accommodate that. To do so you use a PerFieldAnalyzerWrapper which basically let's you specify a default Analyzer and then as many field specific analyzers as you like (passed to PerFieldAnalyzerWrapper as a dictionary). Then when analyzing the doc it will use the field specific analyzer if one was specified and if not, it will use the default analyzer you specified for the PerFieldAnalyzerWrapper.
Whether using a single analyzer or using multiple via PerFieldAnalyzerWrapper, you only need one QueryParser and you will pass that parser either the one analyzer or the PerFieldAnalyzerWrapper which is an analyzer that wraps several analyzers.
The fact that some of your fields are stored and some are not stored has no impact on searching them. The only thing that matters for the search is that the field is indexed, and both StringFields and TextFields are always indexed.
You mention the following:
And I'm using the KeywordAnalyzer to search over the filename Field, which, to reiterate, is stored, so not analyzed.
Whether a field is stored or not has nothing to do with whether it's analyzed. For the filename field your code is using a StringField with Field.Store.YES. Because it's a StringField it will be indexed BUT not analyzed, and because you specified to store the field it will be stored. So since the field is NOT analyzed, it won't be using the KeywordAnalyzer or any other analyzer :-)
Is there a way to search over tokenized and untokenized Fields concurrently?
The real issue here isn't about searching tokenized and untokenized fields concurrently, it's really just about search multiple fields concurrently. The fact that one is tokenized and one is not is of no consequence for lucene. To search multiple fields at once you can use a BooleanQuery and with this query object you can add multiple queries to it, one for each field, and specify an AND ie Must or an OR ie Should relationship between the subqueries.
I hope this helps clear things up for you.

Related

MongoDB (Java): efficient update of multiple documents to different(!) values

I have a MongoDB database and the program I'm writing is meant to change the values of a single field for all documents in a collection. Now if I want them all to change to a single value, like the string value "mask", then I know that updateMany does the trick and it's quite efficient.
However, what I want is an efficient solution for updating to different new values, in fact I want to pick the new value for the field in question for each document from a list, e.g. an ArrayList. But then something like this
collection.updateMany(new BasicDBObject(),
new BasicDBObject("$set",new BasicDBObject(fieldName,
listOfMasks.get(random.nextInt(size)))));
wouldn't work since updateMany doesn't recompute the value that the field should be set to, it just computes what the argument
listOfMasks.get(random.nextInt(size))
would be once and then it uses that for all the documents. So I don't think there's a solution to this problem that can actually employ updateMany since it's simply not versatile enough.
But I was wondering if anyone has any ideas for at least making it faster than simply iterating through all the documents and each time do updateOne where it updates to a new value from the ArrayList (in a random order but that's just a detail), like below?
// Loop until the MongoCursor is empty (until the search is complete)
try {
while (cursor.hasNext()) {
// Pick a random mask
String mask = listOfMasks.get(random.nextInt(size));
// Update this document
collection.updateOne(cursor.next(), Updates.set("test_field", mask));
}
} finally {
cursor.close();
}```

MongoDB provides the bulk write API to batch updates. This would be appropriate for your example of setting the value of a field to a random value (determined on the client) for each document.
Alternatively if there is a pattern to the changes needed you could potentially use find and modify operation with the available update operators.

Lucene: Searching multiple fields with default operator = AND

To allow users to search across multiple fields with Lucene 3.5 I currently create and add a QueryParser for each field to be searched to a DisjunctionMaxQuery. This works great when using OR as the default operator but I now want to change the default operator to AND to get more accurate (and fewer) results.
Problem is, queryParser.setDefaultOperator(QueryParser.AND_OPERATOR) misses many documents since all terms must be in atleast 1 field.
For example, consider the following data for a document: title field = "Programming Languages", body field = "Java, C++, PHP". If a user were to search for Java Programming this particular document would not be included in the results since the title nor the body field contains all terms in the query although combined they do. I would want this document returned for the above query but not for the query HTML Programming.
I've considered a catchall field but I have a few problems with it. First, users frequently include per field terms in their queries (author:bill) which is not possible with a catchall field. Also, I highlight certain fields with FastVectorHighlighter which requires them to be indexed and stored. So by adding a catchall field I would have to index most of the same data twice which is time and space consuming.
Any ideas?

Guess I should have done a little more research. Turns out MultiFieldQueryParser provides the exact functionality I was looking for. For whatever reason I was creating a QueryParser for each field I wanted to search like this:
String[] fields = {"title", "body", "subject", "author"};
QueryParser[] parsers = new QueryParser[fields.length];
for(int i = 0; i < parsers.length; i++)
{
parsers[i] = new QueryParser(Version.LUCENE_35, fields[i], analyzer);
parsers[i].setDefaultOperator(QueryParser.AND_OPERATOR);
}
This would result in a query like this:
(+title:java +title:programming) | (+body:java +body:programming)
...which is not what I was looking. Now I create a single MultiFieldQueryParser like this:
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_35, new String[]{"title", "body", "subject"}, analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
This gives me the query I was looking for:
+(title:java body:java) +(title:programming body:programming)
Thanks to #seeta and #femtoRgon for the help!

Perhaps what you need is a combination of Boolean queries that capture the different combinations of fields and terms. In your given example, the query could be -
(title:Java AND body:programming) OR (title:programming AND body:Java).
I don't know if there's an existing Query class that generates this automatically for you, but I think that's what should be the ultimate query that's run on the index.

You want to be able to search multiple fields with the same set of terms, then the question from your comment:
((title:java title:programming) | (body:java body:programming))~0.2
May not be the best implementation.
You're effectively getting either the score from the title, or the score from the body for the combined set of terms. The case where you hit java in the title and programming in the body would be given approx. equal weight to a hit on java in the body and no hit on programming.
I think a better structured query would be:
(title:java body:java)~0.2 (title:programming body:programming)~0.2
This makes more sense to me, since you want the dismax queries to limit score growing on multiple queries of the same term (in different fields), but you do want scoring to grow for hits on different terms, I believe.
If that sort of query structure gets you better score results, limiting results to a certain minimum score (a percentage of the max score returned, rather than a simple hard-coded value) may be adequate to prevent too-weak results from being seen.
I also still wouldn't count out indexing an all field. It's an implementation I've used before, while indexing BOTH the specific field and the catchall field, thus allowing both general querying and specific single-field queries. Index storage tends to be pretty lean for unstored terms, and it will generally help performance, if you find yourself having to create big, complicated queries to make up for not having it.
If you really want to be sure that it takes minimal storage, you can even turn off TermVectors for that field:
new Field(name, value, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);
Although I don't know how much of a difference that would really make.

In Lucene using FieldInvertState to identify how many times field has been added to document

From the FieldInvertState class passed to computeNorm() in Similarity class is there any way to find out how many times a particular field was added to a document to aid me in my normalization calculation.
i.e can it differentiate between
doc.add(new Field(fielda,"val1");
doc.add(new Field(fielda,"val2");
and
doc.add(new Field(fielda,"val1 val2"); //added once but analyzer breaks into two terms
ideally return a value 2 in one case and 1 in the second

Also see the documentation in Similarity.
Since you yourself know how many 'things' you are adding to this field, you could put this count into a DocValues field and pull it in your Similarity: you don't need the indexers help.

No, but you could use a custom attribute to specify that "val2" was added in a different way.

Lucene TermFrequenciesVector

what do I obtain if I call IndexReader.getTermFrequenciesVector(...) on an index created with TermVector.YES option?

The documentation already answers this, as Xodorap notes in a comment.
The TermFreqVector object returned can retrieve which terms (words produced by your analyzer) a field contains and how many times each of those terms exists within that field.
You can cast the returned TermFreqVector to the interface TermPositionVector if you index the field using TermVector.WITH_OFFSETS, TermVector.WITH_POSITIONS or TermVector.WITH_POSITIONS_OFFSETS. This gives you access to GetTermPositions with allow you to check where in the field the term exists, and GetOffsets which allows you to check where in the original content the term originated from. The later allows, combined with Store.YES, highlighting of matching terms in a search query.
There are different contributed highlighters available under Contrib area found at the Lucene homepage.

Or you can implement proximity or first occurrence type score contributions. Which highlighting won't help you with at all.

Fuzzy Queries in Lucene

I am using Lucene in JAVA and indexing a table in our database based on company name. After the index I wish to do a fuzzy match (Levenshtein distance) on a value we wish to input into the database. The reason is that we do not want to be entering dupes because of spelling errors.
For example if I have the company name "Widget Makers XYZ" I don't want to insert "Widget Maker XYZ".
From what I've read Lucene's fuzzy match algorithm should give me a number between 0 and 1, I want to do some testing and then determine and adequate value for us determine what is valid or invalid.
The problem is I am stuck, and after searching what seems like everywhere on the internet, need the StackOverflow community's help.
Like I said I have indexed the database on company name, and then have the following code:
IndexSearcher searcher = new IndexSearcher(directory);
new QueryParser(Version.LUCENE_30, "company", analyzer);
Query fuzzy_query = new FuzzyQuery(new Term("company", "Center"));
I encounter the problem afterwards, basically I do not know how to get the fuzzy match value. I know the code must look something like the following, however no collectors seem to fit my needs. (As you can see right now I am only able to count the number of matches, which is useless to me)
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(fuzzy_query, collector);
System.out.println("\ncollector.getTotalHits() = " + collector.getTotalHits());
Also I am unable to use the ComplexPhraseQueryParser class which is shown in the Lucene documentation. I am doing:
import org.apache.lucene.queryParser.*;
Does anybody have an idea as to why its inaccessible or what I am doing wrong? Apologies for the length of the question.

You do not need Lucene to get the score. Take a look at Simmetrics library, it is exceedingly simple to use. Just add the jar and use it thus:
Levenstein ld = new Levenstein ();
float sim = ld.GetSimilarity(string1, string2);
Also do note, depending on the type of data (i.e. longer strings, # whitespaces etc.), you might want to look at other algorithms such as Jaro-Winkler, Smith-Waterman etc.
You could use the above to determine to collapse fuzzy duplicate strings into one "master" string and then index.

You can get the match values with:
TopDocs topDocs = collector.topDocs();
for(ScoreDoc scoreDoc : topDocs.scoreDocs) {
System.out.println(scoreDoc.score);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.