Why doesn't Lucene find any documents with this code?

Why doesn't Lucene find any documents with this code? - java

I am working on this piece of code which add a single document to a lucene (4.7) index and then try to find it by quering a term that exists in the document for sure. But indexSearcher doesn't return any document. What is wrong with my code? Thank you for your comments and feedbacks.
String indexDir = "/home/richard/luc_index_03";
try {
Directory directory = new SimpleFSDirectory(new File(
indexDir));
Analyzer analyzer = new SimpleAnalyzer(
Version.LUCENE_47);
IndexWriterConfig conf = new IndexWriterConfig(
Version.LUCENE_47, analyzer);
conf.setOpenMode(OpenMode.CREATE_OR_APPEND);
conf.setRAMBufferSizeMB(256.0);
IndexWriter indexWriter = new IndexWriter(
directory, conf);
Document doc = new Document();
String title="New York is an awesome city to live!";
doc.add(new StringField("title", title, StringField.Store.YES));
indexWriter.addDocument(doc);
indexWriter.commit();
indexWriter.close();
directory.close();
IndexReader reader = DirectoryReader
.open(FSDirectory.open(new File(
indexDir)));
IndexSearcher indexSearcher = new IndexSearcher(
reader);
String field="title";
SimpleQueryParser qParser = new SimpleQueryParser(analyzer, field);
String queryText="New York" ;
Query query = qParser.parse(queryText);
int hitsPerPage = 100;
TopDocs results = indexSearcher.search(query, 5 * hitsPerPage);
System.out.println("number of results: "+results.totalHits);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = results.totalHits;
for (ScoreDoc scoreDoc:hits){
Document docC = indexSearcher.doc(scoreDoc.doc);
String path = docC.get("path");
String titleC = docC.get("title");
String ne = docC.get("ne");
System.out.println(path+"\n"+titleC+"\n"+ne);
System.out.println("---*****----");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
After running I just get
number of results: 0

This is because you use StringField. From the javadoc:
A field that is indexed but not tokenized: the entire String value is indexed as a single token.
Just use TextField instead and you should be ok.

Related

TermQuery not giving expected result as QueryParser - Lucene 7.4.0

I am indexing 10 text documents using StandardAnalyser.
public static void indexDoc(final IndexWriter writer, Path filePath, long timstamp)
{
try (InputStream iStream = Files.newInputStream(filePath))
{
Document doc = new Document();
Field pathField = new StringField("path",filePath.toString(),Field.Store.YES);
Field flagField = new TextField("ashish","i am stored",Field.Store.YES);
LongPoint last_modi = new LongPoint("last_modified",timstamp);
Field content = new TextField("content",new BufferedReader(new InputStreamReader(iStream,StandardCharsets.UTF_8)));
doc.add(pathField);
doc.add(last_modi);
doc.add(content);
doc.add(flagField);
if(writer.getConfig().getOpenMode()==OpenMode.CREATE)
{
System.out.println("Adding "+filePath.toString());
writer.addDocument(doc);
}
} catch (IOException e) {
e.printStackTrace();
}
}
above is the code snippet used to index a document.
for testing purpose, i am searching a field called as 'ashish'.
When I use QueryParser, Lucene gives the search results as expected.
public static void main(String[] args) throws Exception
{
String index = "E:\\Lucene\\Index";
String field = "ashish";
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser(field, analyzer);
String line = "i am stored";
Query query = parser.parse(line);
// Query q = new TermQuery(new Term("ashish","i am stored"));
System.out.println("Searching for: " + query.toString());
TopDocs results = searcher.search(query, 5 * hitsPerPage);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits);
System.out.println(numTotalHits + " total matching documents");
for(int i=0;i<numTotalHits;i++)
{
Document doc = searcher.doc(hits[i].doc);
String path = doc.get("path");
String content = doc.get("ashish");
System.out.println(path+"\n"+content);
}
}
above code demonstrates the use of QueryParser to retrieve the desired field, which works properly. it hits all 10 documents, as i am storing this field for all 10 documents. all good here.
however when I use TermQuery API, I don't get the desired result.
I am presenting the code change that I did for TermQuery.
public static void main(String[] args) throws Exception
{
String index = "E:\\Lucene\\Index";
String field = "ashish";
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();
// QueryParser parser = new QueryParser(field, analyzer);
String line = "i am stored";
// Query query = parser.parse(line);
Query q = new TermQuery(new Term("ashish","i am stored"));
System.out.println("Searching for: " + q.toString());
TopDocs results = searcher.search(q, 5 * hitsPerPage);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits);
System.out.println(numTotalHits + " total matching documents");
for(int i=0;i<numTotalHits;i++)
{
Document doc = searcher.doc(hits[i].doc);
String path = doc.get("path");
String content = doc.get("ashish");
System.out.println(path+"\n"+content);
System.out.println("----------------------------------------------------------------------------------");
}
}
also attaching the screenshot of TermQuery API execution.
did some research on stackoverflow itself example Lucene TermQuery and QueryParser but did not find any practical solution also the lucene version was very old in those examples.
would appreciate a help.
thanks in advance!

I got the answer of my question in this post
link that explains how TermQuery works
TermQuery searches for entire String as it is. this behavior will give you improper results as while indexing data is often tokenized.
in the posted code, I was passing entire search String to TermQuery like
Query q = new TermQuery(new Term("ashish","i am stored"));
now in above case, Lucene is finding "i am stored" as it is, which it will never find because in indexing this string is tokenized.
instead I tried to search like Query q = new TermQuery(new Term("ashish","stored"));
Above query gave me an expected results.
thanks,
Ashish

The real problem is your query string is not getting analyzed here. So, use same analyzer as used while indexing document and try using below code to analyze query string and then search.
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("ashish", analyzer);
Query query = new TermQuery(new Term("ashish", "i am stored"));
query = parser.parse(query.toString());
ScoreDoc[] hits = searcher.search(query, 5).scoreDocs;

can't delete document index in luence [duplicate]

This question already has an answer here:
can't delete document with lucene IndexWriter.deleteDocuments(term)
(1 answer)
Closed 6 years ago.
I build a search index for luence like this:
Field idField = new Field("_id", "58369c7e0293a47b09d34605", Field.Store.YES, Field.Index.NO);
Field tagField = new Field("tag", joinListStr(gifModel.getTags()), Field.Store.YES, Field.Index.ANALYZED);
Field textField = new Field("text", gifModel.getText(), Field.Store.NO, Field.Index.ANALYZED);
doc.add(idField);
doc.add(tagField);
doc.add(textField);
iwriter.addDocument(doc);
I want to delete that document by Term via the _id field acroding to this article:
public Map<String, Object> deleteIndexByMongoId(String id) {
try {
Directory directory = FSDirectory.open(new File(GifMiaoMacro.LUCENE_INDEX_FILE));
IndexReader indexReader = IndexReader.open(directory);
Term term = new Term("_id", id);
int num = indexReader.deleteDocuments(term);
indexReader.close();
return new ReturnMap(num);
}catch (IOException e){
e.printStackTrace();
return new ReturnMap(GifError.S_DELETE_INDEX_ERR, "delete index error");
}
}
But here num allways is 0 and search result shows the document still in the search index, what have I missing?
EDIT
change the indexReader to indexWriter still not working
public Map<String, Object> deleteIndexByMongoId(String id) {
try {
Directory directory = FSDirectory.open(new File(GifMiaoMacro.LUCENE_INDEX_FILE));
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_CURRENT, new SmartChineseAnalyzer(Version.LUCENE_CURRENT));
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
Term term = new Term("_id", id);
indexWriter.deleteDocuments(term);
indexWriter.close();
return new ReturnMap(0);
}catch (IOException e){
e.printStackTrace();
return new ReturnMap(GifError.S_DELETE_INDEX_ERR, "delete index error");
}
}

What version of Lucene are you using?? IndexReader.deleteDocuments no longer exists. It was depricated after Lucene 3.6. Either ways use the IndexWriter class.
Directory directory = FSDirectory.open(new File(GifMiaoMacro.LUCENE_INDEX_FILE));
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new SimpleAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig);
Term term = new Term("_id", id);
indexWriter.deleteDocuments(term);
IndexWriter.deletedocuments(term)
Field idField = new Field("_id", "58369c7e0293a47b09d34605", Field.Store.YES, Field.Index.NO);
Seems you have made the id field unindexable. So it cannot be searched, even if it is stored. You will have to use a field that is searchable from the index.

Auto Suggestion not working in Lucene after first search iteration

Currently I am working on the auto suggestion part using lucene in my application. The Auto suggestion of the words are working fine in console application but now i have integerated to the web application but it's not working the desired way.
When the documents are search for the first time with some keywords search and auto suggestion both are working fine and showing the result. But when i search again for some other keyword or same keyword both the auto suggestion as well as Search result are not showing. I am not able to figure out why this weird result is coming.
The snippets for the auto suggestion as well as search are as follows:
final int HITS_PER_PAGE = 20;
final String RICH_DOCUMENT_PATH = "F:\\Sample\\SampleRichDocuments";
final String INDEX_DIRECTORY = "F:\\Sample\\LuceneIndexer";
String searchText = request.getParameter("search_text");
BooleanQuery.Builder booleanQuery = null;
Query textQuery = null;
Query fileNameQuery = null;
try {
textQuery = new QueryParser("content", new StandardAnalyzer()).parse(searchText);
fileNameQuery = new QueryParser("title", new StandardAnalyzer()).parse(searchText);
booleanQuery = new BooleanQuery.Builder();
booleanQuery.add(textQuery, BooleanClause.Occur.SHOULD);
booleanQuery.add(fileNameQuery, BooleanClause.Occur.SHOULD);
} catch (ParseException e) {
e.printStackTrace();
}
Directory index = FSDirectory.open(new File(INDEX_DIRECTORY).toPath());
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(HITS_PER_PAGE);
try{
searcher.search(booleanQuery.build(), collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
for (ScoreDoc hit : hits) {
Document doc = reader.document(hit.doc);
}
// Auto Suggestion of the data
Dictionary dictionary = new LuceneDictionary(reader, "content");
AnalyzingInfixSuggester analyzingSuggester = new AnalyzingInfixSuggester(index, new StandardAnalyzer());
analyzingSuggester.build(dictionary);
List<LookupResult> lookupResultList = analyzingSuggester.lookup(searchText, false, 10);
System.out.println("Look up result size :: "+lookupResultList.size());
for (LookupResult lookupResult : lookupResultList) {
System.out.println(lookupResult.key+" --- "+lookupResult.value);
}
analyzingSuggester.close();
reader.close();
}catch(IOException e){
e.printStackTrace();
}
For ex:
In first iteration if i search for word "sample"
Auto suggestion gives me result: sample, samples, sampler etc. (These are the words in the documents)
Search Result as : sample
But if i search it again with same text or different it's showing no result and also LookUpResult list size is coming Zero.
I am not getting why this is happening. Please help
Below is the updated code for the index creation from set of documents.
final String INDEX_DIRECTORY = "F:\\Sample\\LuceneIndexer";
long startTime = System.currentTimeMillis();
List<ContentHandler> contentHandlerList = new ArrayList<ContentHandler> ();
String fileNames = (String)request.getAttribute("message");
File file = new File("F:\\Sample\\SampleRichDocuments"+fileNames);
ArrayList<File> fileList = new ArrayList<File>();
fileList.add(file);
Metadata metadata = new Metadata();
// Parsing the Rich document set with Apache Tikka
ContentHandler handler = new BodyContentHandler(-1);
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
InputStream stream = new FileInputStream(file);
try {
parser.parse(stream, handler, metadata, context);
contentHandlerList.add(handler);
}catch (TikaException e) {
e.printStackTrace();
}catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
finally {
try {
stream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
FieldType fieldType = new FieldType();
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setStoreTermVectorPayloads(true);
fieldType.setStoreTermVectorOffsets(true);
fieldType.setStored(true);
Analyzer analyzer = new StandardAnalyzer();
Directory directory = FSDirectory.open(new File(INDEX_DIRECTORY).toPath());
IndexWriterConfig conf = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(directory, conf);
Iterator<ContentHandler> handlerIterator = contentHandlerList.iterator();
Iterator<File> fileIterator = fileList.iterator();
Date date = new Date();
while (handlerIterator.hasNext() && fileIterator.hasNext()) {
Document doc = new Document();
String text = handlerIterator.next().toString();
String textFileName = fileIterator.next().getName();
String fileName = textFileName.replaceAll("_", " ");
fileName = fileName.replaceAll("-", " ");
fileName = fileName.replaceAll("\\.", " ");
String fileNameArr[] = fileName.split("\\s+");
for(String contentTitle : fileNameArr){
Field titleField = new Field("title",contentTitle,fieldType);
titleField.setBoost(2.0f);
doc.add(titleField);
}
if(fileNameArr.length > 0){
fileName = fileNameArr[0];
}
String document_id= UUID.randomUUID().toString();
FieldType documentFieldType = new FieldType();
documentFieldType.setStored(false);
Field idField = new Field("document_id",document_id, documentFieldType);
Field fileNameField = new Field("file_name", textFileName, fieldType);
Field contentField = new Field("content",text,fieldType);
doc.add(idField);
doc.add(contentField);
doc.add(fileNameField);
writer.addDocument(doc);
analyzer.close();
}
writer.commit();
writer.deleteUnusedFiles();
long endTime = System.currentTimeMillis();
writer.close();
Also i have observed that from second search iteration the files in the index directory are getting deleted and only the file with .segment suffix is getting changes like .segmenta, .segmentb, .segmentc etc..
I dont know why this weird situation is happening.

your code looks pretty straightforward. So, I am sensing that you might facing this problem because something is going wrong with your indexes, providing the information about how you are building indexes might help to diagnose.
But exact code this time :)

I think your problem is with writer.deleteUnusedFiles() call.
According to JavaDocs, this call can "delete unreferenced index commits".
What indexes to delete is driven by IndexDeletionPolicy.
However "The default deletion policy is KeepOnlyLastCommitDeletionPolicy, which always removes old commits as soon as a new commit is done (this matches the behavior before 2.2).".
It also talks about "delete on last close", which means once this index is used and closed(e.g. during search), that index will be deleted.
So all indexes that matched your first search result will be deleted immediately.
Try this:
IndexWriterConfig conf = new IndexWriterConfig(analyzer);
conf.setIndexDeletionPolicy(NoDeletionPolicy.INSTANCE);

How to index an query country codes with lucene?

I'm creating a lucene index for citynames and countrycodes (depending on each other). I want to countrycodes to be lowercase searchable and exact match.
At first, I now try to query a single countrycode and find all indexed elements that match that code. By my result is always empty.
//prepare
VERSION = Version.LUCENE_4_9;
IndexWriterConfig config = new IndexWriterConfig(VERSION, new SimpleAnalyzer());
//index
Document doc = new Document();
doc.add(new StringField("countryCode", countryCode, Field.Store.YES));
writer.addDocument(doc);
//lookup
Query query = new QueryParser(VERSION, "countryCode", new SimpleAnalyzer()).parse(countryCode);
Result:
when I query for coutrycodes like "IT", "DE", "EN" etc, the result is always empty. Why?
Is SimpleAnalyzer from for 2-letter countrycodes?

For StringField, you can use TermQuery instead of QueryParser
Directory dir = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_9, new SimpleAnalyzer(Version.LUCENE_4_9));
IndexWriter writer = new IndexWriter(dir, config);
String countryCode = "DE";
// index
Document doc = new Document();
doc.add(new StringField("countryCode", countryCode, Store.YES));
writer.addDocument(doc);
writer.close();
IndexSearcher search = new IndexSearcher(DirectoryReader.open(dir));
//lookup
Query query = new TermQuery(new Term("countryCode", countryCode));
TopDocs docs = search.search(query, 1);
System.out.println(docs.totalHits);

I'm a bit confused here. I'll assume the your index writer is initialized in some part of your code not provided, but shy aren't you passing in Version into SimpleAnalyzer? There is no no arg constructor for SimpleAnalyzer, not since 3.X, anyway.
That's the only real issue I see. Here is a working example using your code:
private static Version VERSION;
public static void main(String[] args) throws IOException, ParseException {
//prepare
VERSION = Version.LUCENE_4_9;
Directory dir = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(VERSION, new SimpleAnalyzer(VERSION));
IndexWriter writer = new IndexWriter(dir, config);
String countryCode = "DE";
//index
Document doc = new Document();
doc.add(new TextField("countryCode", countryCode, Field.Store.YES));
writer.addDocument(doc);
writer.close();
IndexSearcher search = new IndexSearcher(DirectoryReader.open(dir));
//lookup
Query query = new QueryParser(VERSION, "countryCode", new SimpleAnalyzer(VERSION)).parse(countryCode);
TopDocs docs = search.search(query, 1);
System.out.println(docs.totalHits);
}

Lucene - search does not return anything

I am using Lucene to search. Here is the code-
RAMDirectory index = new RAMDirectory();
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_34, analyzer);
IndexWriter w = new IndexWriter(index, config);
while(contentResutlset.next()){
System.out.println("Indexing Content no.(ID) " + contentResutlset.getString(1));
Document doc = new Document();
doc.add(new Field("uniquename",contentResutlset.getString(1),Store.YES,Index.ANALYZED));
doc.add(new Field("type",contentResutlset.getString(2),Store.YES,Index.ANALYZED));
doc.add(new Field("key",contentResutlset.getString(3),Store.YES,Index.ANALYZED));
doc.add(new Field("value",contentResutlset.getString(4),Store.YES,Index.ANALYZED));
w.addDocument(doc);
}
w.close();
contentResutlset.close();
statement.close();
connection.close();
Query q = new QueryParser(Version.LUCENE_34, "value", analyzer).parse("wordtosearch");
int hitsPerPage = 10;
IndexSearcher searcher = new IndexSearcher(index, true);
ScoreDoc[] topdocs = searcher.search(q, 1000).scoreDocs;
topdocs.length is 0.
What is wrong above?
And how can i change the above to use store the index in database instead of RAMDirectory?
Should I use JDBCDirectory?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why doesn't Lucene find any documents with this code? - java

This is because you use StringField. From the javadoc: A field that is indexed but not tokenized: the entire String value is indexed as a single token. Just use TextField instead and you should be ok.

Related

TermQuery not giving expected result as QueryParser - Lucene 7.4.0

can't delete document index in luence [duplicate]

Auto Suggestion not working in Lucene after first search iteration

How to index an query country codes with lucene?

Lucene - search does not return anything

Categories

Resources