Lucene - get document ids from term

Lucene - get document ids from term - java

In Lucene 4.1, I see you can use DirectoryReader.docFreq() to get the number of documents in an index containing a given term. Is there a way to actually get those documents? Either the objects or id numbers would be fine. I think AtomicReader.termDocsEnum() would be useful, but I'm not sure if I can use AtomicReader - I don't see how to create an AtomicReader instance on a given directory.

Why not just search for it?
IndexSearcher searcher = new IndexSearcher(directoryReader);
TermQuery query = new TermQuery(new Term("field", "term"));
TopDocs topdocs = searcher.query(query, numberToReturn);

Related

Searching for Terms with whitespace using Lucene

I'm trying to use Lucene to add a search feature but can't seem to get an index to work with significant whitespace. I've got the following test case setup:
RAMDirectory directory = new RAMDirectory();
KeywordAnalyzer analyzer = new KeywordAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(directory, config);
Document doc = new Document();
doc.add(new TextField("content", "Bill Evans", Field.Store.NO));
writer.addDocument(doc);
writer.close();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("content", analyzer);
parser.setSplitOnWhitespace(false);
Query query = parser.parse("Bill E");
TopDocs docs = searcher.search(query, 1);
assertTrue(docs.totalHits > 0);
I'm using Lucene 6.6.0 and from what I understand the KeywordAnalyzer is what I'm looking for:
"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.
But I can't seem to get any matching documents that contain whitespace.
Any ideas on how to solve this?

When you index, you have a single document with a single field and with a single term with value - Bill Evans
When you are going to search, TermQuery produced by QueryParser tries to search with term value - Bill E and that term obviously doesn't exist in index so you get zero hits.
if you replace your search string with - Bill Evans , you will get results.
Please refer this question too
First , you need to separate your indexing and searching concerns. You can only search what is indexed. If you are indexing full texts without breaking into tokens then at search times - you need to produce WildCardQuery , FuzzyQuery , PhraseQuery etc if your input string at search time is different than what in indexed. TermQuery searches for exact term values.
My suggestion would to be to store full text value ( without tokens - StringField would do that ) as well as generate additional tokens breaking on space using something like - SimpleAnalyzer .
So Something like,
doc.add(new TextField("content", "Bill Evans", Field.Store.NO));
doc.add(new StringField("storedcontent", "Bill Evans", Field.Store.YES));
Above code with SimpleAnalyzer , you will now have terms - bill & evans
( as well as full text as stored field ) and if you now search with same analyzer , your query would be like - content:bill content:e & you will get a result.
All in all - system is working the way you have coded it :)
So understand your requirements first as what you wish to index and what kind of queries you wish to perform on that index.

How to get internal doc id set by lucene

I've indexed some documents in the index module. Intuitively, Lucene set IDs for any indexed document. These IDs may not have a specific order though. Concretely, the first doc ID is set to 127, the second one is set to 133 and so on...
In the search module, I have the document (which I want to process), But I'm trying to get these already-set docIDs (that was set by Lucene in index time) See the code below:
private long calculateProbabilityOfDocument(String topic, Document doc){
Terms termVector = iReader.getTermVector(DOCID, FIELD);
}
EDIT:
I think Lucene may not let me access the internal IDs. Is there any other approach?
Thanks in advance!

I finally could end up finding the solution.
I found out that lucene does not allow access to its internal document IDs. However, we can iterate through the documents and get their TermVector. Seems that it's the only possible way to get term vectors. I'm using the script below:
QueryParser parser = new QueryParser("Body", new EnglishAnalyzer());
Query query = parser.parse(topic);
TopDocs hits = iSearcher.search(query, 1000);
for (int i=0; i<hits.scoreDocs.length; i++){
Terms termVector = iSearcher.getIndexReader().getTermVector(hits.scoreDocs[i].doc, "Body");
Document doc = iSearcher.doc(hits.scoreDocs[i].doc);
documentsList.put(doc, termVector);
}

lucene BooleanQuery.Builder Build doesn't Work

Hello Guys i have a Question :)
I create a BooleanQuery Like this :
BooleanQuery.Builder qry = new BooleanQuery.Builder();
qry.add(new TermQuery(new Term("Name", "Anna")), BooleanClause.Occur.SHOULD);
And if i do a search like this now :
TopDocs docs = searcher.search(qry.build(), hitsPerPage);
it gets Zero Results ? But if I use this code :
TopDocs docs = searcher.search(parser.parse(qry.build().toString()), hitsPerPage);
Then I get the right results ? Can you explain me why I have to parse it again ?
I am using Version 5.5.0 and Name is a TextField

A TextField runs your data through an analyzer and will likely produce the term "anna" (lowercase). A TermQuery does not run anything through an analyzer, so it searches for "Anna" (uppercase) and this does not match. Create the TermQuery with the lowercased term and you should see results: new TermQuery(new Term("Name", "anna")).
The BooleanQuery has nothing to do with this, in fact, this particular query would rewrite itself to the underlying TermQuery, as this is the only subquery.
The parser takes the string "Name:Anna" (produced by the TermQuery), runs it through the analyzer and gives you a "Name:anna" TermQuery, that's why it works if you run the query through the parser – it involves the necessary analyzing step.

Lucene query object and search

I am upgrading from Lucene 3.6 to 5.3.0, but the search doesn't want to take my parameters when using 5.3.0.
This works in 3.6:
IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory));
SimpleAnalyzer analyzer = new SimpleAnalyzer(Version.LUCENE_36);
QueryParser parser = new QueryParser(Version.LUCENE_36, "contents",
analyzer);
TopDocs topDocs = null;
Query query = parser.parse(queryString);
topDocs = searcher.search(query, 1000);
But in 5.3, the compiler is asking me to use SrndQuery, but I still get an error on the searcher.search method:
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(directory));
Analyzer analyzer = new SimpleAnalyzer();
QueryParser parser = new QueryParser();
TopDocs topDocs = null;
SrndQuery query = QueryParser.parse(queryString);
topDocs = searcher.search(query, 1000);//**The method search(Query, int) in the type IndexSearcher is not applicable for the arguments (SrndQuery, int)**
Not sure what I am doing wrong here. Any ideas?
P.S. I am upgrading because I am not able to get Highlighted text from some PDFs I recently indexed.

It bears stating that you are using the Surround query parser, rather than the standard query parser (if you are intending to use the standard parser, you are importing the wrong one).
The problem you are running into is that a SrndQuery isn't really a lucene query, so you can't just run it into the searcher and get results. You need to transform it into lucene query to search with it. This is done via the SrndQuery.makeLuceneQueryField method. You'll need to create a BasicQueryFactory to pass into it, but they are easy to construct:
SrndQuery query = QueryParser.parse(queryString);
BasicQueryFactory factory = new BasicQueryFactory(1000 /*maxBasicQueries*/);
Query luceneQuery = query.makeLuceneQueryField("myDefaultField", factory);
topDocs = searcher.search(luceneQuery, 1000);
Somewhat Tangential Note: I kinda wondered if you should keep the BasicQueryFactory around, rather than creating a new one for every search, but appears to be unnecessary. Definitely nothing expensive going on in the ctor, and it looks like solr's SurroundQParserPlugin constructs a new one for each query it parses, so doing that should be fine.

Lucene: Multiple words in a single term

Let's say I have a docs like
stringfield:123456
textfield:name website stackoverflow
and If I build a query in the following manner
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
QueryParser luceneQueryParser = new QueryParser(Version.LUCENE_42, "", analyzer);
Query luceneSearchQuery = luceneQueryParser.parse("textfield:\"name website\"");
it will return the doc as expected, but if I build my query using Lucene QueryAPI
PhraseQuery firstNameQuery = new PhraseQuery();
firstNameQuery.add(new Term("textfield","name website"));
it will not give me any result, i will have to tokenize "name website" and add each token in phrasequery.
Is there any default way in QueryAPI to tokenize as it does while parsing a String Query.
Sure I can do that myself but reinvent the wheel if it's already implemented.

You are adding the entire query as a single term to your PhraseQuery. You are on the right track, but when tokenized, that will not be a single term, but rather two. That is, your index has the terms name, website, and stackoverflow, but your query only has one term, which matches none of those name website.
The correct way to use a PhraseQuery, is to add each term to the PhraseQuery separately.
PhraseQuery phrase = new PhraseQuery();
phrase.add(new Term("textfield", "name"));
phrase.add(new Term("textfield", "website"));

When you:
luceneQueryParser.parse("textfield:\"name website\"");
Lucene will tokenize the string "name website", and get 2 terms.
When you:
new Term("textfield","name website")
Lucene will not tokenize the string "name website", instead use the whole as a term.
As the result what you said, when you index the document, the field textfield MUST be Indexed and Tokenized.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.