Jave, Lucene : Search with numbers as String not working

Jave, Lucene : Search with numbers as String not working - java

I am working on integrating Lucene in our Spring-MVC based project and currently it's working good, other than search with numbers.
Whenever I try search like 123Ab or 123 or anything which has numbers inside it, I don't get back any search results.
As soon as I remove the numbers though, it works fine.
Any suggestions? Thank you.
Code :
public List<Integer> searchLucene(String text, long groupId, boolean type) {
List<Integer> objectIds = new ArrayList<>();
if (text != null) {
//String specialChars = "+ - && || ! ( ) { } [ ] ^ \" ~ * ? : \\ /";
text = text.replace("+", "\\+");
text = text.replace("-", "\\-");
text = text.replace("&&", "\\&&");
text = text.replace("||", "\\||");
text = text.replace("!", "\\!");
text = text.replace("(", "\\(");
text = text.replace(")", "\\)");
text = text.replace("{", "\\}");
text = text.replace("{", "\\}");
text = text.replace("[", "\\[");
text = text.replace("^", "\\^");
// text = text.replace("\"","\\\"");
text = text.replace("~", "\\~");
text = text.replace("*", "\\*");
text = text.replace("?", "\\?");
text = text.replace(":", "\\:");
//text = text.replace("\\","\\\\");
text = text.replace("/", "\\/");
try {
Path path;
//Set system path code
Directory directory = FSDirectory.open(path);
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser queryParser = new QueryParser("contents", new SimpleAnalyzer());
Query query;
query = queryParser.parse(text+"*");
TopDocs topDocs = indexSearcher.search(query, 50);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
org.apache.lucene.document.Document document = indexSearcher.doc(scoreDoc.doc);
objectIds.add(Integer.valueOf(document.get("id")));
System.out.println("");
System.out.println("id " + document.get("id"));
System.out.println("content " + document.get("contents"));
}
indexSearcher.getIndexReader().close();
directory.close();
return objectIds;
} catch (Exception ignored) {
}
}
return null;
}
Indexing code :
#Override
public void saveIndexes(String text, String tagFileName, String filePath, long groupId, boolean type, int objectId) {
try {
//indexing directory
File testDir;
Path path1;
Directory index_dir;
if (type) {
// System path code
Directory directory = org.apache.lucene.store.FSDirectory.open(path);
IndexWriterConfig config = new IndexWriterConfig(new SimpleAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, config);
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
if (filePath != null) {
File file = new File(filePath); // current directory
doc.add(new TextField("path", file.getPath(), Field.Store.YES));
}
doc.add(new StringField("id", String.valueOf(objectId), Field.Store.YES));
// doc.add(new TextField("id",String.valueOf(objectId),Field.Store.YES));
if (text == null) {
if (filePath != null) {
FileInputStream is = new FileInputStream(filePath);
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
StringBuilder stringBuffer = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
stringBuffer.append(line).append("\n");
}
stringBuffer.append("\n").append(tagFileName);
reader.close();
doc.add(new TextField("contents", stringBuffer.toString(), Field.Store.YES));
}
} else {
text = text + "\n" + tagFileName;
doc.add(new TextField("contents", text, Field.Store.YES));
}
indexWriter.addDocument(doc);
indexWriter.commit();
indexWriter.flush();
indexWriter.close();
directory.close();
} catch (Exception ignored) {
}
}
I have tried with and without wildcard i.e *. Thank you.

Issue is in your indexing code.
Your field contents is a TextField and you are using a SimpleAnalyzer so if you see SimpleAnalyzer documentation, it says ,
An Analyzer that filters LetterTokenizer with LowerCaseFilter
So that means for your field, if it is set to tokenized numbers will be removed.
Now look at , TextField code, here a TextField is always tokenized irrespective of it being TYPE_STORED or TYPE_NOT_STORED.
So if you wish to index letters and numbers, you need to use a StringField instead of a TextField.
StringField documentation,
A field that is indexed but not tokenized: the entire String value is
indexed as a single token. For example this might be used for a
'country' field or an 'id' field, or any field that you intend to use
for sorting or access through the field cache.
A StringField is never tokenized irrespective of it being TYPE_STORED or TYPE_NOT_STORED
So after indexing, numbers are removed from contents field and is indexed without numbers so you don't find those patterns while searching.
Instead of QueryParser and doing complicated searches, first use a query like below to first verify your indexed Terms,
Query wildcardQuery = new WildcardQuery(new Term("contents", searchString));
TopDocs hits = searcher.search(wildcardQuery, 20);
Also, to know if debugging to be focused on indexer side or searcher side , use Luke Tool to see if terms are created as per your need. If terms are there, you can focus on searcher code.

Related

Lucene 7 Index TREC

I'm having difficulties indexing TREC in Lucene 7. Until now I only needed to Index Text Files which was easily archivable by using a InputStreamReader like desribed by the Demo.
/** Indexes a single document */
static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {
try (InputStream stream = Files.newInputStream(file)) {
// make a new, empty document
Document doc = new Document();
Field pathField = new StringField("path", file.toString(), ld.Store.YES);
doc.add(pathField);
doc.add(new LongPoint("modified", lastModified));
doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));
if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
System.out.println("adding " + file);
writer.addDocument(doc);
} else {
System.out.println("updating " + file);
writer.updateDocument(new Term("path", file.toString()), doc);
}
}
}
But TREC has different tags that store information not relevant for the search results. Like Header Title DocNo and many more. How would I adjust this Code to save specific Tags in their own textfield with their appropiate content?

Answering my own Question since I found a Solution. This might not be the most optimal and by no means the best looking one.
My Solution is to take the complet InputStream and read it step by step doing the appropiate actions if a certain tag is found here a small example:
BufferedReader in = new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8));
while ((read = in.readLine()) != null) {
String[] splited = read.split("\\s+");
boolean text = false;
for (String part : splited) {
if (part.equals(new String("<TEXT>"))) {
text = true;
}
}
This solution works to solve my problem but im fairly certain that there is a better looking solution out there.

Auto Suggestion not working in Lucene after first search iteration

Currently I am working on the auto suggestion part using lucene in my application. The Auto suggestion of the words are working fine in console application but now i have integerated to the web application but it's not working the desired way.
When the documents are search for the first time with some keywords search and auto suggestion both are working fine and showing the result. But when i search again for some other keyword or same keyword both the auto suggestion as well as Search result are not showing. I am not able to figure out why this weird result is coming.
The snippets for the auto suggestion as well as search are as follows:
final int HITS_PER_PAGE = 20;
final String RICH_DOCUMENT_PATH = "F:\\Sample\\SampleRichDocuments";
final String INDEX_DIRECTORY = "F:\\Sample\\LuceneIndexer";
String searchText = request.getParameter("search_text");
BooleanQuery.Builder booleanQuery = null;
Query textQuery = null;
Query fileNameQuery = null;
try {
textQuery = new QueryParser("content", new StandardAnalyzer()).parse(searchText);
fileNameQuery = new QueryParser("title", new StandardAnalyzer()).parse(searchText);
booleanQuery = new BooleanQuery.Builder();
booleanQuery.add(textQuery, BooleanClause.Occur.SHOULD);
booleanQuery.add(fileNameQuery, BooleanClause.Occur.SHOULD);
} catch (ParseException e) {
e.printStackTrace();
}
Directory index = FSDirectory.open(new File(INDEX_DIRECTORY).toPath());
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(HITS_PER_PAGE);
try{
searcher.search(booleanQuery.build(), collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
for (ScoreDoc hit : hits) {
Document doc = reader.document(hit.doc);
}
// Auto Suggestion of the data
Dictionary dictionary = new LuceneDictionary(reader, "content");
AnalyzingInfixSuggester analyzingSuggester = new AnalyzingInfixSuggester(index, new StandardAnalyzer());
analyzingSuggester.build(dictionary);
List<LookupResult> lookupResultList = analyzingSuggester.lookup(searchText, false, 10);
System.out.println("Look up result size :: "+lookupResultList.size());
for (LookupResult lookupResult : lookupResultList) {
System.out.println(lookupResult.key+" --- "+lookupResult.value);
}
analyzingSuggester.close();
reader.close();
}catch(IOException e){
e.printStackTrace();
}
For ex:
In first iteration if i search for word "sample"
Auto suggestion gives me result: sample, samples, sampler etc. (These are the words in the documents)
Search Result as : sample
But if i search it again with same text or different it's showing no result and also LookUpResult list size is coming Zero.
I am not getting why this is happening. Please help
Below is the updated code for the index creation from set of documents.
final String INDEX_DIRECTORY = "F:\\Sample\\LuceneIndexer";
long startTime = System.currentTimeMillis();
List<ContentHandler> contentHandlerList = new ArrayList<ContentHandler> ();
String fileNames = (String)request.getAttribute("message");
File file = new File("F:\\Sample\\SampleRichDocuments"+fileNames);
ArrayList<File> fileList = new ArrayList<File>();
fileList.add(file);
Metadata metadata = new Metadata();
// Parsing the Rich document set with Apache Tikka
ContentHandler handler = new BodyContentHandler(-1);
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
InputStream stream = new FileInputStream(file);
try {
parser.parse(stream, handler, metadata, context);
contentHandlerList.add(handler);
}catch (TikaException e) {
e.printStackTrace();
}catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
finally {
try {
stream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
FieldType fieldType = new FieldType();
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setStoreTermVectorPayloads(true);
fieldType.setStoreTermVectorOffsets(true);
fieldType.setStored(true);
Analyzer analyzer = new StandardAnalyzer();
Directory directory = FSDirectory.open(new File(INDEX_DIRECTORY).toPath());
IndexWriterConfig conf = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(directory, conf);
Iterator<ContentHandler> handlerIterator = contentHandlerList.iterator();
Iterator<File> fileIterator = fileList.iterator();
Date date = new Date();
while (handlerIterator.hasNext() && fileIterator.hasNext()) {
Document doc = new Document();
String text = handlerIterator.next().toString();
String textFileName = fileIterator.next().getName();
String fileName = textFileName.replaceAll("_", " ");
fileName = fileName.replaceAll("-", " ");
fileName = fileName.replaceAll("\\.", " ");
String fileNameArr[] = fileName.split("\\s+");
for(String contentTitle : fileNameArr){
Field titleField = new Field("title",contentTitle,fieldType);
titleField.setBoost(2.0f);
doc.add(titleField);
}
if(fileNameArr.length > 0){
fileName = fileNameArr[0];
}
String document_id= UUID.randomUUID().toString();
FieldType documentFieldType = new FieldType();
documentFieldType.setStored(false);
Field idField = new Field("document_id",document_id, documentFieldType);
Field fileNameField = new Field("file_name", textFileName, fieldType);
Field contentField = new Field("content",text,fieldType);
doc.add(idField);
doc.add(contentField);
doc.add(fileNameField);
writer.addDocument(doc);
analyzer.close();
}
writer.commit();
writer.deleteUnusedFiles();
long endTime = System.currentTimeMillis();
writer.close();
Also i have observed that from second search iteration the files in the index directory are getting deleted and only the file with .segment suffix is getting changes like .segmenta, .segmentb, .segmentc etc..
I dont know why this weird situation is happening.

your code looks pretty straightforward. So, I am sensing that you might facing this problem because something is going wrong with your indexes, providing the information about how you are building indexes might help to diagnose.
But exact code this time :)

I think your problem is with writer.deleteUnusedFiles() call.
According to JavaDocs, this call can "delete unreferenced index commits".
What indexes to delete is driven by IndexDeletionPolicy.
However "The default deletion policy is KeepOnlyLastCommitDeletionPolicy, which always removes old commits as soon as a new commit is done (this matches the behavior before 2.2).".
It also talks about "delete on last close", which means once this index is used and closed(e.g. during search), that index will be deleted.
So all indexes that matched your first search result will be deleted immediately.
Try this:
IndexWriterConfig conf = new IndexWriterConfig(analyzer);
conf.setIndexDeletionPolicy(NoDeletionPolicy.INSTANCE);

Why doesn't Lucene find any documents with this code?

I am working on this piece of code which add a single document to a lucene (4.7) index and then try to find it by quering a term that exists in the document for sure. But indexSearcher doesn't return any document. What is wrong with my code? Thank you for your comments and feedbacks.
String indexDir = "/home/richard/luc_index_03";
try {
Directory directory = new SimpleFSDirectory(new File(
indexDir));
Analyzer analyzer = new SimpleAnalyzer(
Version.LUCENE_47);
IndexWriterConfig conf = new IndexWriterConfig(
Version.LUCENE_47, analyzer);
conf.setOpenMode(OpenMode.CREATE_OR_APPEND);
conf.setRAMBufferSizeMB(256.0);
IndexWriter indexWriter = new IndexWriter(
directory, conf);
Document doc = new Document();
String title="New York is an awesome city to live!";
doc.add(new StringField("title", title, StringField.Store.YES));
indexWriter.addDocument(doc);
indexWriter.commit();
indexWriter.close();
directory.close();
IndexReader reader = DirectoryReader
.open(FSDirectory.open(new File(
indexDir)));
IndexSearcher indexSearcher = new IndexSearcher(
reader);
String field="title";
SimpleQueryParser qParser = new SimpleQueryParser(analyzer, field);
String queryText="New York" ;
Query query = qParser.parse(queryText);
int hitsPerPage = 100;
TopDocs results = indexSearcher.search(query, 5 * hitsPerPage);
System.out.println("number of results: "+results.totalHits);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = results.totalHits;
for (ScoreDoc scoreDoc:hits){
Document docC = indexSearcher.doc(scoreDoc.doc);
String path = docC.get("path");
String titleC = docC.get("title");
String ne = docC.get("ne");
System.out.println(path+"\n"+titleC+"\n"+ne);
System.out.println("---*****----");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
After running I just get
number of results: 0

This is because you use StringField. From the javadoc:
A field that is indexed but not tokenized: the entire String value is indexed as a single token.
Just use TextField instead and you should be ok.

Can Lucene return search results with line number?

I want to implement "Find in Files" similar to one in IDE's using lucene. Basically wants to search in source code files like .c,.cpp,.h,.cs and .xml. I tried the demo shown in apache website. It returns the list of files without line numbers and number of occurance in that file. I am sure there should be some ways to get it.
Is there anyway to get those details?

Can you please share the link of the demo shown in apache website?
Here I show you how to get the term frequency of a term given set of documents:
public static void main(final String[] args) throws CorruptIndexException,
LockObtainFailedException, IOException {
// Create the index
final Directory directory = new RAMDirectory();
final Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
final IndexWriterConfig config = new IndexWriterConfig(
Version.LUCENE_36, analyzer);
final IndexWriter writer = new IndexWriter(directory, config);
// addDoc(writer, field, text);
addDoc(writer, "title", "foo");
addDoc(writer, "title", "buz qux");
addDoc(writer, "title", "foo foo bar");
// Search
final IndexReader reader = IndexReader.open(writer, false);
final IndexSearcher searcher = new IndexSearcher(reader);
final Term term = new Term("title", "foo");
final Query query = new TermQuery(term);
System.out.println("Query: " + query.toString() + "\n");
final int limitShow = 3;
final TopDocs td = searcher.search(query, limitShow);
final ScoreDoc[] hits = td.scoreDocs;
// Take IDs and frequencies
final int[] docIDs = new int[td.totalHits];
for (int i = 0; i < td.totalHits; i++) {
docIDs[i] = hits[i].doc;
}
final Map<Integer, Integer> id2freq = getFrequencies(reader, term,
docIDs);
// Show results
for (int i = 0; i < td.totalHits; i++) {
final int docNum = hits[i].doc;
final Document doc = searcher.doc(docNum);
System.out.println("\tposition " + i);
System.out.println("Title: " + doc.get("title"));
final int freq = id2freq.get(docNum);
System.out.println("Occurrences of \"" + term.text() + "\" in \""
+ term.field() + "\" = " + freq);
System.out.println("--------------------------------\n");
}
searcher.close();
reader.close();
writer.close();
}
Here we add the documents to the index:
private static void addDoc(final IndexWriter w, final String field,
final String text) throws CorruptIndexException, IOException {
final Document doc = new Document();
doc.add(new Field(field, text, Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field(field, text, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
This is an example of how to take the number of occurrencies of a term in a doc:
public static Map<Integer, Integer> getFrequencies(
final IndexReader reader, final Term term, final int[] docIDs)
throws CorruptIndexException, IOException {
final Map<Integer, Integer> id2freq = new HashMap<Integer, Integer>();
final TermDocs tds = reader.termDocs(term);
if (tds != null) {
for (final int docID : docIDs) {
// Skip to the next docID
tds.skipTo(docID);
// Get its term frequency
id2freq.put(docID, tds.freq());
}
}
return id2freq;
}
If you put all together and you run it you will obtain this output:
Query: title:foo
position 0
Title: foo
Occurrences of "foo" in "title" = 2
--------------------------------
position 1
Title: foo foo bar
Occurrences of "foo" in "title" = 4
--------------------------------

I tried many forums, response is zero. So finally I got an idea from #Luca Mastrostefano answer to get the line number details.
Taginfo from lucene searcher returns the file names. I think that is sufficient enough to get the line number. Lucene index is not storing the actual content, it is actually stores the hash values. So it is impossible to get the line number directly. Hence, I assume only way is use that path and read the file and get line number.
public static void PrintLines(string filepath,string key)
{
int counter = 1;
string line;
// Read the file and display it line by line.
System.IO.StreamReader file = new System.IO.StreamReader(filepath);
while ((line = file.ReadLine()) != null)
{
if (line.Contains(key))
{
Console.WriteLine("\t"+counter.ToString() + ": " + line);
}
counter++;
}
file.Close();
}
Call this function after path from lucene searcher.

Lucene Search Returns no results when the file contents are saved

I am trying to develop a log querying system using apache lucene. I have developed a demo code to index two files and then search for the query string.
The first file contains the data
maclean
the second file contains the data
pinto
Bellow is the code that I have used for indexing
fis = new FileInputStream(file);
DataInputStream in = new DataInputStream(fis);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
Document doc = new Document();
Document doc = new Document();
doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));
doc.add(new StoredField("filename", file.getCanonicalPath()));
if (indexWriter.getConfig().getOpenMode() == OpenMode.CREATE) {
System.out.println("adding " + file);
indexWriter.addDocument(doc);
} else {
System.out.println("updating " + file);
indexWriter.updateDocument(new Term("path", file.getPath()), doc);
}
If i use this code then i get the proffer result. But in display i can show only the file name since i have stored the only the file name.
So i modified the code and stored the file contents as well using this code
FileInputStream fis = null;
if (file.isHidden() || file.isDirectory() || !file.canRead() || !file.exists()) {
return;
}
if (suffix!=null && !file.getName().endsWith(suffix)) {
return;
}
System.out.println("Indexing file " + file.getCanonicalPath());
try {
fis = new FileInputStream(file);
} catch (FileNotFoundException fnfe) {
System.out.println("File Not Found"+fnfe);
}
DataInputStream in = new DataInputStream(fis);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
String Data="";
while ((strLine = br.readLine()) != null)
{
Data=Data+strLine;
}
Document doc = new Document();
doc.add(new TextField("contents", Data, Field.Store.YES));
doc.add(new StoredField("filename", file.getCanonicalPath()));
if (indexWriter.getConfig().getOpenMode() == OpenMode.CREATE) {
System.out.println("adding " + file);
indexWriter.addDocument(doc);
} else {
System.out.println("updating " + file);
indexWriter.updateDocument(new Term("path", file.getPath()), doc);
}
According to my understanding i should get the number of results as 1. and It should show the file name and content of the file containing maclean
But instead i get the result as
-----------------------Results--------------------------
0 total matching documents
Found 0
Is there anything wrong that i am doing in the code or there is a logical explanation to this? Why does the first code works and second doesn't work?
Search query Code
try
{
Directory directory = FSDirectory.open(indexDir);
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_41);
QueryParser parser = new QueryParser(Version.LUCENE_41, "contents", analyzer);
Query query = parser.parse(queryStr);
System.out.println("Searching for: " + query.toString("contents"));
TopDocs results = searcher.search(query, maxHits);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = results.totalHits;
System.out.println("\n\n\n-----------------------Results--------------------------\n\n\n");
System.out.println(numTotalHits + " total matching documents");
for (int i = 0; i < numTotalHits; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(i+":File name is: "+d.get("filename"));
System.out.println(i+":File content is: "+d.get("contents"));
}
System.out.println("Found " + numTotalHits);
}
catch(Exception e)
{
System.out.println("Exception Was caused in SimpleSearcher");
e.printStackTrace();
}

Use StoredField instead of TextField
doc.add(new StoredField("Data",Line));
When you use Text Field the string gets tokenized and as a result you will not be able to search for the same. Stored Field stores the entire string without tokenizing it.

I think your exact problem, is that by the time you get to creating a BufferedReader for the indexed field, you have already read the whole file, and the stream is at the end of the file, with nothing further to read. You should be able to fix that with a call to fis.reset();
However, you should not do that. Don't store the same data in two separate fields, one for indexing and one for storage. Instead, set the same field to both store and index the data. TextField has a ctor that allows you to store the data as well as index, something like:
doc.add(new TextField("contents", Data, Field.Store.YES));

I think there could be two problems with your code.
First, I notice that you did not user near-real-time search and did not commit the writer before reading as well. Lucene's IndexReader takes a snapshot of the index, either the committed version when NRT is not used or both committed and uncommitted version when NRT is used. That could be the reason that your IndexReader fails to see the change. As it seems you require concurrent reading and writing, I recommend you use NRT search (IndexReader reader = DirectoryReader.open(indexWriter);)
The second problem could be that, as #femtoRgon said, the data you stored may not what you expect. I notice that, when you append the content of your file for storage, you seem to lose EOL characters. I suggest you use Luke to check your index http://www.getopt.org/luke/

This works in Lucene 4.5: doc.add(new TextField("Data", Data, Field.Store.YES));

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jave, Lucene : Search with numbers as String not working - java

Related

Lucene 7 Index TREC

Auto Suggestion not working in Lucene after first search iteration

Why doesn't Lucene find any documents with this code?

Can Lucene return search results with line number?

Lucene Search Returns no results when the file contents are saved

Categories

Resources