Lucene TermFrequenciesVector - java

what do I obtain if I call IndexReader.getTermFrequenciesVector(...) on an index created with TermVector.YES option?

The documentation already answers this, as Xodorap notes in a comment.
The TermFreqVector object returned can retrieve which terms (words produced by your analyzer) a field contains and how many times each of those terms exists within that field.
You can cast the returned TermFreqVector to the interface TermPositionVector if you index the field using TermVector.WITH_OFFSETS, TermVector.WITH_POSITIONS or TermVector.WITH_POSITIONS_OFFSETS. This gives you access to GetTermPositions with allow you to check where in the field the term exists, and GetOffsets which allows you to check where in the original content the term originated from. The later allows, combined with Store.YES, highlighting of matching terms in a search query.
There are different contributed highlighters available under Contrib area found at the Lucene homepage.

Or you can implement proximity or first occurrence type score contributions. Which highlighting won't help you with at all.

Related

Lucene searching over stored and unstored Fields concurrently

I'm working with Lucene 7.4 and have indexed a sample of txt files.
I have some Fields that have been stored, such as path and filename,
and a content Field, which was unstored before passing the doc to the IndexWriter.
Consequently my content Field contains the processed (e.g. tokenized, stemmed) content data of the file, my filename Field contains the unprocessed filename, the entire String.
try (InputStream stream = Files.newInputStream(file)) {
// create empty document
Document doc = new Document();
// add the last modification time field
Field lastModField = new StoredField(LuceneConstants.LAST_MODIFICATION_TIME, Files.getAttribute(file, "lastModifiedTime", LinkOption.NOFOLLOW_LINKS).toString());
doc.add(lastModField);
// add the path Field
Field pathField = new StringField(LuceneConstants.FILE_PATH, file.toString(), Field.Store.YES);
doc.add(pathField);
// add the name Field
doc.add(new StringField(LuceneConstants.FILE_NAME, file.getFileName().toString(), Field.Store.YES));
// add the content
doc.add(new TextField(LuceneConstants.CONTENTS, new BufferedReader(new InputStreamReader(stream))));
System.out.println("adding " + file);
writer.addDocument(doc);
Now, as far as I understand, I have to use 2 QueryParsers, since I need to use 2 different Analyzers for searching over both fields, one for each.
I cant't figure out how to combine them.
What I want is a TopDoc wherein the results are ordered by a relevance score, that is some combination of the 2 relevance scores from the search over the filename Field and the search over the content Field.
Does Lucene 7.4 provide you with the means for an easy solution to this?
PS: This is my first post in a long time, if not ever. Please remark any formatting or content issues.
EDIT:
Analyzer used for indexing content Field and for searching content Field:
Analyzer myTxtAnalyzer = CustomAnalyzer.builder()
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop")
.addTokenFilter("porterstem")
.build();
And I'm using the KeywordAnalyzer to search over the filename Field, which, to reiterate, is stored, so not analyzed.
My program is supposed to index files and search over that index, retrieving
a list of the most relevant documents. If a searchString, which may contain whitespaces, exactly matches the fileName,
I'd like that to heavily impact my search results.
I'm a computer science student, and this is my first project with Lucene.
If there are no functions available, it's all good. What I'm asking for is not a requirement for my task. I'm just pondering and I feel like this is something there might already exist a simple solution for. But I can't seem to find it, if it exists.
EDIT 2:
I had a misconception aobut what happens when using Stored.YES/.NO.
My problem has nothing to do with it.
The String wasn't tokenized, because it was in a StringField.
I assumed it was because it was stored.
However, my question remains.
Is there a way to search over tokenized and untokenized Fields concurrently?
As #andrewjames mentions, you don't need to use multiple analyzers in your example because only the TextField gets analyzed, the StringFields do not. If you had a situation where you did need to use different analyzers for different fields, Lucene can accommodate that. To do so you use a PerFieldAnalyzerWrapper which basically let's you specify a default Analyzer and then as many field specific analyzers as you like (passed to PerFieldAnalyzerWrapper as a dictionary). Then when analyzing the doc it will use the field specific analyzer if one was specified and if not, it will use the default analyzer you specified for the PerFieldAnalyzerWrapper.
Whether using a single analyzer or using multiple via PerFieldAnalyzerWrapper, you only need one QueryParser and you will pass that parser either the one analyzer or the PerFieldAnalyzerWrapper which is an analyzer that wraps several analyzers.
The fact that some of your fields are stored and some are not stored has no impact on searching them. The only thing that matters for the search is that the field is indexed, and both StringFields and TextFields are always indexed.
You mention the following:
And I'm using the KeywordAnalyzer to search over the filename Field, which, to reiterate, is stored, so not analyzed.
Whether a field is stored or not has nothing to do with whether it's analyzed. For the filename field your code is using a StringField with Field.Store.YES. Because it's a StringField it will be indexed BUT not analyzed, and because you specified to store the field it will be stored. So since the field is NOT analyzed, it won't be using the KeywordAnalyzer or any other analyzer :-)
Is there a way to search over tokenized and untokenized Fields concurrently?
The real issue here isn't about searching tokenized and untokenized fields concurrently, it's really just about search multiple fields concurrently. The fact that one is tokenized and one is not is of no consequence for lucene. To search multiple fields at once you can use a BooleanQuery and with this query object you can add multiple queries to it, one for each field, and specify an AND ie Must or an OR ie Should relationship between the subqueries.
I hope this helps clear things up for you.

Subtle difference when searching multi value fields in Solr

I have a very simple question but I don't understand exactly why it happens and what the difference is.
Take a simple Solr search on a multi value field:
field_name:ABC AND DEF
field_name:(ABC AND DEF)
They return quite different results. I understand the brackets are for grouping but I don't understand the difference. It seems quite subtle.
Many thanks.
The first query isn't doing what you think it's doing.
field_name:ABC AND DEF
This is parsed as:
field_name:ABC AND <default search field>:DEF
This is different from your second example, which is parsed as:
field_name:ABC AND field_name:DEF
In the first example the second part of your query is made against whatever field is defined as the default search field in your index (or in the query itself, if you've set df).

Lucene: Searching multiple fields with default operator = AND

To allow users to search across multiple fields with Lucene 3.5 I currently create and add a QueryParser for each field to be searched to a DisjunctionMaxQuery. This works great when using OR as the default operator but I now want to change the default operator to AND to get more accurate (and fewer) results.
Problem is, queryParser.setDefaultOperator(QueryParser.AND_OPERATOR) misses many documents since all terms must be in atleast 1 field.
For example, consider the following data for a document: title field = "Programming Languages", body field = "Java, C++, PHP". If a user were to search for Java Programming this particular document would not be included in the results since the title nor the body field contains all terms in the query although combined they do. I would want this document returned for the above query but not for the query HTML Programming.
I've considered a catchall field but I have a few problems with it. First, users frequently include per field terms in their queries (author:bill) which is not possible with a catchall field. Also, I highlight certain fields with FastVectorHighlighter which requires them to be indexed and stored. So by adding a catchall field I would have to index most of the same data twice which is time and space consuming.
Any ideas?
Guess I should have done a little more research. Turns out MultiFieldQueryParser provides the exact functionality I was looking for. For whatever reason I was creating a QueryParser for each field I wanted to search like this:
String[] fields = {"title", "body", "subject", "author"};
QueryParser[] parsers = new QueryParser[fields.length];
for(int i = 0; i < parsers.length; i++)
{
parsers[i] = new QueryParser(Version.LUCENE_35, fields[i], analyzer);
parsers[i].setDefaultOperator(QueryParser.AND_OPERATOR);
}
This would result in a query like this:
(+title:java +title:programming) | (+body:java +body:programming)
...which is not what I was looking. Now I create a single MultiFieldQueryParser like this:
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_35, new String[]{"title", "body", "subject"}, analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
This gives me the query I was looking for:
+(title:java body:java) +(title:programming body:programming)
Thanks to #seeta and #femtoRgon for the help!
Perhaps what you need is a combination of Boolean queries that capture the different combinations of fields and terms. In your given example, the query could be -
(title:Java AND body:programming) OR (title:programming AND body:Java).
I don't know if there's an existing Query class that generates this automatically for you, but I think that's what should be the ultimate query that's run on the index.
You want to be able to search multiple fields with the same set of terms, then the question from your comment:
((title:java title:programming) | (body:java body:programming))~0.2
May not be the best implementation.
You're effectively getting either the score from the title, or the score from the body for the combined set of terms. The case where you hit java in the title and programming in the body would be given approx. equal weight to a hit on java in the body and no hit on programming.
I think a better structured query would be:
(title:java body:java)~0.2 (title:programming body:programming)~0.2
This makes more sense to me, since you want the dismax queries to limit score growing on multiple queries of the same term (in different fields), but you do want scoring to grow for hits on different terms, I believe.
If that sort of query structure gets you better score results, limiting results to a certain minimum score (a percentage of the max score returned, rather than a simple hard-coded value) may be adequate to prevent too-weak results from being seen.
I also still wouldn't count out indexing an all field. It's an implementation I've used before, while indexing BOTH the specific field and the catchall field, thus allowing both general querying and specific single-field queries. Index storage tends to be pretty lean for unstored terms, and it will generally help performance, if you find yourself having to create big, complicated queries to make up for not having it.
If you really want to be sure that it takes minimal storage, you can even turn off TermVectors for that field:
new Field(name, value, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);
Although I don't know how much of a difference that would really make.

Which is the best choice to indexing a Boolean value in lucene?

Indexing a Boolean value(true/false) in lucene(not need to store)
I want to get more disk space usage and higher search performance
doc.add(new Field("boolean","true",Field.Store.NO,Field.Index.NOT_ANALYZED_NO_NORMS));
//or
doc.add(new Field("boolean","1",Field.Store.NO,Field.Index.NOT_ANALYZED_NO_NORMS));
//or
doc.add(new NumericField("boolean",Integer.MAX_VALUE,Field.Store.NO,true).setIntValue(1));
Which should I choose? Or any other better way?
thanks a lot
An interesting question!
I don't think the third option (NumericField) is a good choice for a boolean field. I can't think of any use case for this.
The Lucene search index (leaving to one side stored data, which you aren't using anyway) is stored as an inverted index
Leaving your first and second options as (theoretically) identical
If I was faced with this, I think I would choose option one ("true" and "false" terms), if it influences the final decision.
Your choice of NOT_ANALYZED_NO_NORMS looks good, I think.
Lucene jumps through an elaborate set of hoops to make NumericField searchable by NumericRangeQuery, so definitely avoid it an all cases where your values don't represent quantities. For example, even if you index an integer, but only as a unique ID, you would still want to use a plain String field. Using "true"/"false" is the most natural way to index a boolean, while using "1"/"0" gives just a slight advantage by avoiding the possibility of case mismatch or typo. I'd say this advantage is not worth much and go for true/false.
Use Solr (a flavour of lucene) - it indexes all basic java types natively.
I've used it and it rocks.

In Lucene using FieldInvertState to identify how many times field has been added to document

From the FieldInvertState class passed to computeNorm() in Similarity class is there any way to find out how many times a particular field was added to a document to aid me in my normalization calculation.
i.e can it differentiate between
doc.add(new Field(fielda,"val1");
doc.add(new Field(fielda,"val2");
and
doc.add(new Field(fielda,"val1 val2"); //added once but analyzer breaks into two terms
ideally return a value 2 in one case and 1 in the second
Also see the documentation in Similarity.
Since you yourself know how many 'things' you are adding to this field, you could put this count into a DocValues field and pull it in your Similarity: you don't need the indexers help.
No, but you could use a custom attribute to specify that "val2" was added in a different way.

Categories

Resources