Is it 2 character search possible in lucence - java

Hi i have a question about lucence search
Is it possible to search a 2 character from file using lucence search
For ex. if there are names like "karthik test" is it possible to search for "ka" or "te" in lucence. If so kindly provide a code piece..

Yes, this is possible using wildcards.
Feed your QueryParser with te*, and it will generate a query that starts for a te prefix with any suffix.

May be this will help you
private List search(String word, IndexSearcher searcher, Date fromDate, Date toDate, int skip, int noOfRecords) throws Exception {
StandardAnalyzer analyzer = new StandardAnalyzer();
BooleanQuery.Builder finalQuery = new BooleanQuery.Builder();
List results = null;
for(String key : keyUtil.getAllKeys()) {
if((!key.contains("Date") || !key.contains("Time"))) {
QueryParser queryParser = new QueryParser(key, analyzer);
Query query = queryParser.parse(word);
finalQuery.add(query, Occur.SHOULD);
}
}
if(fromDate != null && toDate != null) {
Query query = NumericDocValuesField.newSlowRangeQuery("StartDate", fromDate.getTime(), toDate.getTime());
finalQuery.add(query, Occur.MUST);
}
TopDocs hits = searcher.search(finalQuery.build(), skip + noOfRecords);
results = new ArrayList();
if(hits.totalHits.value > 0) {
int count = 0;
for (ScoreDoc sd : hits.scoreDocs) {
if(count >= skip) {
Document d = searcher.doc(sd.doc);
results.add(d.get("storePath"));
}
count ++;
}
}
analyzer.close();
return results;
}
You can always use RegEx pattern with with attribute "word". Like * someWord *

Related

Lucene: FastVectorHighlighter returns null

Here's what I did:
String textField1 = fastVectorHighlighter.getBestFragment(fastVectorHighlighter.getFieldQuery(query), indexReader, docId, SearchItem.FIELD_TEXT_FIELD1, DEFAULT_FRAGMENT_LENGTH);
Here's the query:
((FIELD_TEXT_FIELD1:十五*)^4.0) (FIELD_TEXT_FIELD3:十五*)
The original text is correct(indexReader.document(docId).get(SearchItem.FIELD_TEXT_FIELD3) is correct.), and definitely contains characters in query.
Here's how I index textField1 :
Field textField1 = new TextField(SearchItem.FIELD_TEXT_FIELD1, "", Field.Store.YES);
Problem solved!
It turns out, I need to change
fastVectorHighlighter.getFieldQuery(query)
to
fastVectorHighlighter.getFieldQuery(query, indexReader)
Follow the code into FieldQuery#flatten, we will find Lucene doesn't deal with PrefixQuery the normal way:
} else if (sourceQuery instanceof CustomScoreQuery) {
final Query q = ((CustomScoreQuery) sourceQuery).getSubQuery();
if (q != null) {
flatten( applyParentBoost( q, sourceQuery ), reader, flatQueries);
}
} else if (reader != null) { // <<====== Here it is!
Query query = sourceQuery;
if (sourceQuery instanceof MultiTermQuery) {
MultiTermQuery copy = (MultiTermQuery) sourceQuery.clone();
copy.setRewriteMethod(new MultiTermQuery.TopTermsScoringBooleanQueryRewrite(MAX_MTQ_TERMS));
query = copy;
}
Query rewritten = query.rewrite(reader);
if (rewritten != query) {
// only rewrite once and then flatten again - the rewritten query could have a speacial treatment
// if this method is overwritten in a subclass.
flatten(rewritten, reader, flatQueries);
}
We can see it needs a IndexReader for PrefixQuery, FuzzyQuery etc.

Lucene search engine isn't accurate, can't figure out why

I am trying to create a search engine for the first time, and I'm using the library offered by Apache Lucene. Everything works fine, however when I search for more than one word, for example "computer science" the results that I get aren't accurate because I never get documents that contain both words. It searches the documents for each word separately (I get documents that contain either "computer" or "science" but never both).
I've been staring at my code for almost a week now and I can't figure out the problem. The query parsing seems to work perfectly, so I think the problem might be in the search but I don't know what I'm doing wrong. So If you can help me, I'll be grateful.
public static wikiPage[] index(String searchQuery) throws SQLException, IOException, ParseException {
String sql = "select * from Record";
ResultSet rs = db.runSql(sql);
StandardAnalyzer analyzer = new StandardAnalyzer();
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
//1. Indexer
try (IndexWriter w = new IndexWriter(index, config)) {
while (rs.next()) {
String RecordID = rs.getString("RecordID");
String URL = rs.getString("URL");
String Title = rs.getString("Title");
String Info = rs.getString("Info");
addDoc(w, RecordID, URL, Info, Title);
}
}
catch (Exception e) {
System.out.print(e);
index.close();
}
//2. Query
MultiFieldQueryParser multipleQueryParser = new MultiFieldQueryParser(new String[]{"Title", "Info"}, new StandardAnalyzer());
Query q = multipleQueryParser.parse(searchQuery);
//3. Search
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs results = searcher.search(q, 10000);
ScoreDoc[] hits = results.scoreDocs;
// 4. display results
wikiPage[] resultArray = new wikiPage[hits.length];
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
resultArray[i] = new wikiPage(d.get("URL"), d.get("Title"));
System.out.println((i + 1) + ". " + d.get("Title") + "\t" + d.get("URL"));
}
reader.close();
return resultArray;
}
private static void addDoc(IndexWriter w, String RecordID, String URL, String Info, String Title) throws IOException {
Document doc = new Document();
doc.add(new StringField("RecordID", RecordID, Field.Store.YES));
doc.add(new TextField("Title", Title, Field.Store.YES));
doc.add(new TextField("URL", URL, Field.Store.YES));
doc.add(new TextField("Info", Info, Field.Store.YES));
w.addDocument(doc);
}
This is the output of System.out.println(q.toString());
(Title:computer Info:computer) (Title:science Info:science)
If you want to search it as a phrase (that is, finding "computer" and "science" together), surround the query with quotes, so it should look like "computer science". In your code, you could do something like:
Query q = multipleQueryParser.parse("\"" + searchQuery + "\"");
If you just want to find docs that contain both terms somewhere in the document, but not necessarily together, the query should look like +computer +science. Probably the easiest way to do this is to change the default operator of your query parser:
multipleQueryParser.setDefaultOperator(QueryParser.Operator.AND);
Query q = multipleQueryParser.parse(searchQuery);
As per the doc, prefix required terms with + and use AND (and OR for readability).
Try this:
(Title:+computer OR Info:+computer) AND (Title:+science OR Info:+science)
Maybe build this string and use it directly.

Lucene searchAfter returns same results as previous query

I just discovered that Lucene supports paging using the searchAfter method, so I added it to my implementation. Unfortunately it returns the same results as the previous query and I don't get why.
Basically what I'm doing is calling my search method and storing both the Query and the last ScoreDoc returned by collector.topDocs(). Then, when the user wants to load the next page, I call searcher.searchAfter(previousResult, previousQuery, MAX_HITS) but I receive the same results that I would receive by running searcher.search(query, collector).
Please note that I'm returning 15 documents per page and my query has 32 hits, so I'd expect it to have at least 2 pages.
Here's the code:
private Query previousQuery;
private ScoreDoc previousResult;
private float previousMaxScore;
private ScoreDoc[] search(String query) throws ParseException, IOException {
Query q = parser.parse(query);
TopScoreDocCollector collector = TopScoreDocCollector.create(MAX_HITS_PER_PAGE);
searcher.search(q, collector);
TopDocs result = collector.topDocs();
// Let's normalize scores
for (ScoreDoc scoreDoc : result.scoreDocs) {
scoreDoc.score /= result.getMaxScore();
}
// If the query had some results then we save the last one to support paging
if (collector.getTotalHits() > 0) {
previousQuery = q;
previousResult = result.scoreDocs[result.scoreDocs.length - 1];
previousMaxScore = result.getMaxScore();
}
return result.scoreDocs;
}
private ScoreDoc[] searchNextPage() throws IOException {
// previosuResult is CORRECT, it's the last one from the previous query so it's not part of the issue.
TopDocs result = searcher.searchAfter(previousResult, previousQuery, MAX_HITS_PER_PAGE);
// Let's normalize scores
for (ScoreDoc scoreDoc : result.scoreDocs) {
scoreDoc.score /= previousMaxScore;
}
// Let's update previous result with the new one
if (result.scoreDocs.length > 0) {
previousResult = result.scoreDocs[result.scoreDocs.length - 1];
}
return result.scoreDocs;
}
What am I doing wrong?

Why Lucene does not return the results based on whole word match?

I am using Lucene to match the keywords with list of words within an application. The whole process is automated without any human intervention. Best matched result (the one on the top and highest score) is picked from the results list returned from Lucene.
The following code demonstrates the above functionality and the results are printed on console.
Problem :
The problem is that lucene searches the keyword (word to be searched) and gives as a result a word that partially matches the keyword. On the other hand the full matched result also exists and does not get ranked in the first position.
For example, if I have lucene RAM index that contains words 'Test' and 'Test Engineer'. If i want to search index for 'AB4_Test Eng_AA0XY11' then results would be
Test
Test Engineer
Although Eng in 'AB4_Test Eng_AA0XY11' matched for Engineer (that is why it is listed in results). But it does not get the top position. I want to optimize my solution to bring the 'Test Engineer' on top because it the best match that considers whole keyword. Can any one help me in solving this problem?
public class LuceneTest {
private static void search(Set<String> keywords) {
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
try {
// 1. create the index
Directory luceneIndex = buildLuceneIndex(analyzer);
int hitsPerPage = 5;
IndexReader reader = IndexReader.open(luceneIndex);
for(String keyword : keywords) {
// Create query string. replace all underscore, hyphen, comma, ( , ), {, }, . with plus sign
StringBuilder querystr = new StringBuilder(128);
String [] splitName = keyword.split("[\\-_,/(){}:. ]");
// After tokenizing also add plus sign between each camel case word.
for (String token : splitName) {
querystr.append(token + "+");
}
// 3. search
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
Query q = new QueryParser(Version.LUCENE_36, "name", analyzer).parse(querystr.toString());
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println();
System.out.println(keyword);
System.out.println("----------------------");
for (ScoreDoc scoreDoc : hits) {
Document d = searcher.doc(scoreDoc.doc);
System.out.println("Found " + d.get("id") + " : " + d.get("name"));
}
// searcher can only be closed when there
searcher.close();
}
}catch (Exception e) {
e.printStackTrace();
}
}
/**
*
*/
private static Directory buildLuceneIndex(Analyzer analyzer) throws CorruptIndexException, LockObtainFailedException, IOException{
Map<Integer, String> map = new HashMap<Integer, String>();
map.put(1, "Test Engineer");
map.put(2, "Test");
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
// 1. create the index
IndexWriter w = new IndexWriter(index, config);
for (Map.Entry<Integer, String> entry : map.entrySet()) {
try {
Document doc = new Document();
doc.add(new Field("id", entry.getKey().toString(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("name", entry.getValue() , Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}catch (Exception e) {
e.printStackTrace();
}
}
w.close();
return index;
}
public static void main(String[] args) {
Set<String> list = new TreeSet<String>();
list.add("AB4_Test Eng_AA0XY11");
list.add("AB4_Test Engineer_AA0XY11");
search(list);
}
}
You can have a look at the Lucene Query syntax rules to see how you can enforce the search for Test Engineer.
Basically, using a query such as
AB4_Test AND Eng_AA0XY11
could work, though I am not sure of it. The page pointed by the link above is quite concise and you will be able to find rapidly a query that can fulfill your needs.
If these two results (test , test engineer) have the same ranking score, then you will see them in the order they came up.
You should try using the length filter and also boosting of the terms, and maybe then you can come up with the solution.
See also:
what is the best lucene setup for ranking exact matches as the highest

How can I get the list of unique terms from a specific field in Lucene?

I have an index from a large corpus with several fields. Only one these fields contain text.
I need to extract the unique words from the whole index based on this field.
Does anyone know how I can do that with Lucene in java?
If you are using the Lucene 4.0 api, you need to get the fields out of the index reader. The Fields then offers the way to get the terms for each field in the index. Here is an example of how to do that:
Fields fields = MultiFields.getFields(indexReader);
Terms terms = fields.terms("field");
TermsEnum iterator = terms.iterator(null);
BytesRef byteRef = null;
while((byteRef = iterator.next()) != null) {
String term = new String(byteRef.bytes, byteRef.offset, byteRef.length);
}
Eventually, for the new version of Lucene you can get the string from the BytesRef calling:
byteRef.utf8ToString();
instead of
new String(byteRef.bytes, byteRef.offset, byteRef.length);
If you want to get the document frequency, you can do :
int docFreq = iterator.docFreq();
You're looking for term vectors (a set of all the words that were in the field and the number of times each word was used, excluding stop words). You'll use IndexReader's getTermFreqVector(docid, field) for each document in the index, and populate a HashSet with them.
The alternative would be to use terms() and pick only terms for the field you're interested in:
IndexReader reader = IndexReader.open(index);
TermEnum terms = reader.terms();
Set<String> uniqueTerms = new HashSet<String>();
while (terms.next()) {
final Term term = terms.term();
if (term.field().equals("field_name")) {
uniqueTerms.add(term.text());
}
}
This is not the optimal solution, you're reading and then discarding all other fields. There's a class Fields in Lucene 4, that returns terms(field) only for a single field.
Same result, just a little cleaner, is to use the LuceneDictionary in the lucene-suggest package. It takes care of a field that does not contain any terms by returning an BytesRefIterator.EMPTY. That will save you a NPE :)
LuceneDictionary ld = new LuceneDictionary( indexReader, "field" );
BytesRefIterator iterator = ld.getWordsIterator();
BytesRef byteRef = null;
while ( ( byteRef = iterator.next() ) != null )
{
String term = byteRef.utf8ToString();
}
As of Lucene 7+ the above and some related links are obsolete.
Here's what's current:
// IndexReader has leaves, you'll iterate through those
int leavesCount = reader.leaves().size();
final String fieldName = "content";
for(int l = 0; l < leavesCount; l++) {
System.out.println("l: " + l);
// specify the field here ----------------------------->
TermsEnum terms = reader.leaves().get(l).reader().terms(fieldName).iterator();
// this stops at 20 just to sample the head
for(int i = 0; i < 20; i++) {
// and to get it out, here -->
final Term content = new Term(fieldName, BytesRef.deepCopyOf(terms.next()));
System.out.println("i: " + i + ", term: " + content);
}
}
The answers using TermsEnum and terms.next() have a subtle off by one bug. This is because the TermsEnum already points to the first term, so while(terms.next()) will cause the first term to be skipped.
Instead use a for loop:
TermEnum terms = reader.terms();
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
// do something with the term
}
To modify the code from the accepted answer:
IndexReader reader = IndexReader.open(index);
TermEnum terms = reader.terms();
Set<String> uniqueTerms = new HashSet<String>();
for(Term term = terms.term(); term != null; terms.next(), term = terms.term()) {
if (term.field().equals("field_name")) {
uniqueTerms.add(term.text());
}
}
Slightly different compared to the solution of #pokeRex110 (tested with Lucene 9.3.0)
Terms terms = MultiTerms.getTerms(indexReader, "title");
if (terms != null) {
TermsEnum iter = terms.iterator();
BytesRef byteRef = null;
while ((byteRef = iter.next()) != null) {
System.out.printf("%s (freq=%s)%n",
byteRef.utf8ToString(),
iter.docFreq()
);
}
}

Categories

Resources