Searching for Terms with whitespace using Lucene

Searching for Terms with whitespace using Lucene - java

I'm trying to use Lucene to add a search feature but can't seem to get an index to work with significant whitespace. I've got the following test case setup:
RAMDirectory directory = new RAMDirectory();
KeywordAnalyzer analyzer = new KeywordAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(directory, config);
Document doc = new Document();
doc.add(new TextField("content", "Bill Evans", Field.Store.NO));
writer.addDocument(doc);
writer.close();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("content", analyzer);
parser.setSplitOnWhitespace(false);
Query query = parser.parse("Bill E");
TopDocs docs = searcher.search(query, 1);
assertTrue(docs.totalHits > 0);
I'm using Lucene 6.6.0 and from what I understand the KeywordAnalyzer is what I'm looking for:
"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.
But I can't seem to get any matching documents that contain whitespace.
Any ideas on how to solve this?

When you index, you have a single document with a single field and with a single term with value - Bill Evans
When you are going to search, TermQuery produced by QueryParser tries to search with term value - Bill E and that term obviously doesn't exist in index so you get zero hits.
if you replace your search string with - Bill Evans , you will get results.
Please refer this question too
First , you need to separate your indexing and searching concerns. You can only search what is indexed. If you are indexing full texts without breaking into tokens then at search times - you need to produce WildCardQuery , FuzzyQuery , PhraseQuery etc if your input string at search time is different than what in indexed. TermQuery searches for exact term values.
My suggestion would to be to store full text value ( without tokens - StringField would do that ) as well as generate additional tokens breaking on space using something like - SimpleAnalyzer .
So Something like,
doc.add(new TextField("content", "Bill Evans", Field.Store.NO));
doc.add(new StringField("storedcontent", "Bill Evans", Field.Store.YES));
Above code with SimpleAnalyzer , you will now have terms - bill & evans
( as well as full text as stored field ) and if you now search with same analyzer , your query would be like - content:bill content:e & you will get a result.
All in all - system is working the way you have coded it :)
So understand your requirements first as what you wish to index and what kind of queries you wish to perform on that index.

Related

Lucene Phrase Query not working

I have a String address = "456 SOME STREET";
which I have to search in Lucene, I have created the index for this
StringField address = new StringField(Constants.ORGANIZATION_ADDRESS, address,Field.Store.YES);
And I am using Phrase Query to search this String using below Code
String[] tokens = address.split("\\s+");
PhraseQuery addressQuery = new PhraseQuery(Constants.ORGANIZATION_ADDRESS, tokens);
finalQuery.add(addressQuery, BooleanClause.Occur.MUST);
But its not giving me any result,I have tried TermQuery as well but that is also not working. Would really appreciate any help because I have tried many options now and I am unable to figure out whats wrong
I have also tried below
For Indexing :
doc.add(new StringField(Constants.ORGANIZATION_ADDRESS, address,Field.Store.YES));
Search using Term Query :
fullAddressExact= fullAddressExact.toLowerCase();
TermQuery tq = new TermQuery(new Term(Constants.ORGANIZATION_ADDRESS,fullAddressExact));
finalQuery.add(tq, BooleanClause.Occur.MUST);
Even this doesnt give any result. My intention to get the exact match

You should probably use TextField, not StringField when indexing the documents.
StringField stores the string as is, without breaking it into tokens, so in your example the index will contain "456 SOME STREET". Only a TermQuery with this term will retrieve it (or a PrefixQuery).
TextField is the standard field when indexing text, it splits the text into tokens (using a Tokenizer) and indexes the words separately, in your example, 456, SOME, STREET can all be used to find the document.
Read more about it here (a bit old, but relevant).

How to get internal doc id set by lucene

I've indexed some documents in the index module. Intuitively, Lucene set IDs for any indexed document. These IDs may not have a specific order though. Concretely, the first doc ID is set to 127, the second one is set to 133 and so on...
In the search module, I have the document (which I want to process), But I'm trying to get these already-set docIDs (that was set by Lucene in index time) See the code below:
private long calculateProbabilityOfDocument(String topic, Document doc){
Terms termVector = iReader.getTermVector(DOCID, FIELD);
}
EDIT:
I think Lucene may not let me access the internal IDs. Is there any other approach?
Thanks in advance!

I finally could end up finding the solution.
I found out that lucene does not allow access to its internal document IDs. However, we can iterate through the documents and get their TermVector. Seems that it's the only possible way to get term vectors. I'm using the script below:
QueryParser parser = new QueryParser("Body", new EnglishAnalyzer());
Query query = parser.parse(topic);
TopDocs hits = iSearcher.search(query, 1000);
for (int i=0; i<hits.scoreDocs.length; i++){
Terms termVector = iSearcher.getIndexReader().getTermVector(hits.scoreDocs[i].doc, "Body");
Document doc = iSearcher.doc(hits.scoreDocs[i].doc);
documentsList.put(doc, termVector);
}

Lucene: Multiple words in a single term

Let's say I have a docs like
stringfield:123456
textfield:name website stackoverflow
and If I build a query in the following manner
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
QueryParser luceneQueryParser = new QueryParser(Version.LUCENE_42, "", analyzer);
Query luceneSearchQuery = luceneQueryParser.parse("textfield:\"name website\"");
it will return the doc as expected, but if I build my query using Lucene QueryAPI
PhraseQuery firstNameQuery = new PhraseQuery();
firstNameQuery.add(new Term("textfield","name website"));
it will not give me any result, i will have to tokenize "name website" and add each token in phrasequery.
Is there any default way in QueryAPI to tokenize as it does while parsing a String Query.
Sure I can do that myself but reinvent the wheel if it's already implemented.

You are adding the entire query as a single term to your PhraseQuery. You are on the right track, but when tokenized, that will not be a single term, but rather two. That is, your index has the terms name, website, and stackoverflow, but your query only has one term, which matches none of those name website.
The correct way to use a PhraseQuery, is to add each term to the PhraseQuery separately.
PhraseQuery phrase = new PhraseQuery();
phrase.add(new Term("textfield", "name"));
phrase.add(new Term("textfield", "website"));

When you:
luceneQueryParser.parse("textfield:\"name website\"");
Lucene will tokenize the string "name website", and get 2 terms.
When you:
new Term("textfield","name website")
Lucene will not tokenize the string "name website", instead use the whole as a term.
As the result what you said, when you index the document, the field textfield MUST be Indexed and Tokenized.

Lucene - get document ids from term

In Lucene 4.1, I see you can use DirectoryReader.docFreq() to get the number of documents in an index containing a given term. Is there a way to actually get those documents? Either the objects or id numbers would be fine. I think AtomicReader.termDocsEnum() would be useful, but I'm not sure if I can use AtomicReader - I don't see how to create an AtomicReader instance on a given directory.

Why not just search for it?
IndexSearcher searcher = new IndexSearcher(directoryReader);
TermQuery query = new TermQuery(new Term("field", "term"));
TopDocs topdocs = searcher.query(query, numberToReturn);

How to index date field in lucene

I am new to lucene. I have to index date field.
i am using Following IndexWriter constructor in lucene 3.0.0.
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED)
my point is:
Why it needs a analyzer when date fields are not analyzed,while indexing I used Field.Index.NOT_ANALYZED.

You can store date field in this fashion..
Document doc = new Document();
doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
Field.Store.YES, Field.Index.NOT_ANALYZED));
where f is a file object...
Now use the above document for indexwriter...
checkout the sample code comes with lucene... and the following link...
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/DateTools.html
UPDATE
Field.Index NOT_ANALYZED
Index the field's value without using
an Analyzer, so it can be searched. As
no analyzer is used the value will be
stored as a single term. This is
useful for unique Ids like product
numbers.
As per lucene javadoc you don't need analyzer for fields using Field.Index NOT_ANALYZED but i think by design the IndexWriter expects an analyzer as indexing the exact replica of data is not efficient in terms of storage and searching.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.