Lucene BooleanQuery OR output not as wanted - java

i have an index with some documents, the 593 documents that contains the word computer and 51 documents that contains the word science and 596 documents that contains the words computer OR science, i wanna outpout those 596 docs here's my code:
public class Main
{
public static void main(String args[]) throws IOException, ParseException{
String[] champ ={"W", "A"};
BooleanQuery q = new BooleanQuery();
BooleanQuery qIntermediaire;
qIntermediaire = new BooleanQuery();
for(int i=0;i<champ.length;i++){
qIntermediaire.add(new BooleanClause(new FuzzyQuery(new Term(champ[i], "computer"), 0), BooleanClause.Occur.SHOULD));
}
q.add(new BooleanClause(qIntermediaire, BooleanClause.Occur.MUST));
qIntermediaire = new BooleanQuery();
for(int i=0;i<champ.length;i++){
qIntermediaire.add(new BooleanClause(new FuzzyQuery(new Term(champ[i], "science"), 0), BooleanClause.Occur.SHOULD));
}
q.add(new BooleanClause(qIntermediaire, BooleanClause.Occur.SHOULD));
Path indexPath = Paths.get("MonIndex");
Directory directory = FSDirectory.open(indexPath);
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher iSearcher = new IndexSearcher(reader);
TopDocs topdocs = iSearcher.search(q, 10000);
ScoreDoc[] resultsList = topdocs.scoreDocs;
System.out.println(resultsList.length);
}
}
for some reasons this is giving me 461 documents :(

Related

Porting code from lucene to elasticsearch

I have to following simple code that I want to port from lucene 6.5.x to elasticsearch 5.3.x.
However, the scores are different and I want to have the same score results like in lucene.
As example, the idf:
Lucenes docFreq is 3 (3 docs contains the term "d") and docCount is 4 (documents with this field). Elasticsearch has 1 docFreq and 2 docCount (or 1 and 1). I am not sure how these values relate to each other in elasticsearch...
The other different in scoring is the avgFieldLength:
Lucene is right with 14 / 4 = 3.5. Elasticsearch is different for each score result - but this should be the same for all documents...
Can you please tell me, which settings/mapping I missed in elasticsearch to get it to work like lucene?
IndexingExample.java:
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.document.Field;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
public class IndexingExample {
private static final String INDEX_DIR = "/tmp/lucene6idx";
private IndexWriter createWriter() throws IOException {
FSDirectory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
return new IndexWriter(dir, config);
}
private List<Document> createDocs() {
List<Document> docs = new ArrayList<>();
FieldType summaryType = new FieldType();
summaryType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
summaryType.setStored(true);
summaryType.setTokenized(true);
Document doc1 = new Document();
doc1.add(new Field("title", "b c d d d", summaryType));
docs.add(doc1);
Document doc2 = new Document();
doc2.add(new Field("title", "b c d d", summaryType));
docs.add(doc2);
Document doc3 = new Document();
doc3.add(new Field("title", "b c d", summaryType));
docs.add(doc3);
Document doc4 = new Document();
doc4.add(new Field("title", "b c", summaryType));
docs.add(doc4);
return docs;
}
private IndexSearcher createSearcher() throws IOException {
Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexReader reader = DirectoryReader.open(dir);
return new IndexSearcher(reader);
}
public static void main(String[] args) throws IOException, ParseException {
// indexing
IndexingExample app = new IndexingExample();
IndexWriter writer = app.createWriter();
writer.deleteAll();
List<Document> docs = app.createDocs();
writer.addDocuments(docs);
writer.commit();
writer.close();
// search
IndexSearcher searcher = app.createSearcher();
Query q1 = new TermQuery(new Term("title", "d"));
TopDocs hits = searcher.search(q1, 20);
System.out.println(hits.totalHits + " docs found for the query \"" + q1.toString() + "\"");
int num = 0;
for (ScoreDoc sd : hits.scoreDocs) {
Explanation expl = searcher.explain(q1, sd.doc);
System.out.println(expl);
}
}
}
Elasticsearch:
DELETE twitter
PUT twitter/tweet/1
{
"title" : "b c d d d"
}
PUT twitter/tweet/2
{
"title" : "b c d d"
}
PUT twitter/tweet/3
{
"title" : "b c d"
}
PUT twitter/tweet/4
{
"title" : "b c"
}
POST /twitter/tweet/_search
{
"explain": true,
"query": {
"term" : {
"title" : "d"
}
}
}
Problem solved with the help of jimczy:
Don't forget that ES creates an index with 5 shards by default and
that docFreq and docCount are computed per shard. You can create an
index with 1 shard or use the dfs mode to compute distributed stats:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html#dfs-query-then-fetch
This search query (dfs_query_then_fetch) worked like expected:
POST /twitter/tweet/_search?search_type=dfs_query_then_fetch
{
"explain": true,
"query": {
"term" : {
"title" : "d"
}
}
}

Lucene search engine isn't accurate, can't figure out why

I am trying to create a search engine for the first time, and I'm using the library offered by Apache Lucene. Everything works fine, however when I search for more than one word, for example "computer science" the results that I get aren't accurate because I never get documents that contain both words. It searches the documents for each word separately (I get documents that contain either "computer" or "science" but never both).
I've been staring at my code for almost a week now and I can't figure out the problem. The query parsing seems to work perfectly, so I think the problem might be in the search but I don't know what I'm doing wrong. So If you can help me, I'll be grateful.
public static wikiPage[] index(String searchQuery) throws SQLException, IOException, ParseException {
String sql = "select * from Record";
ResultSet rs = db.runSql(sql);
StandardAnalyzer analyzer = new StandardAnalyzer();
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
//1. Indexer
try (IndexWriter w = new IndexWriter(index, config)) {
while (rs.next()) {
String RecordID = rs.getString("RecordID");
String URL = rs.getString("URL");
String Title = rs.getString("Title");
String Info = rs.getString("Info");
addDoc(w, RecordID, URL, Info, Title);
}
}
catch (Exception e) {
System.out.print(e);
index.close();
}
//2. Query
MultiFieldQueryParser multipleQueryParser = new MultiFieldQueryParser(new String[]{"Title", "Info"}, new StandardAnalyzer());
Query q = multipleQueryParser.parse(searchQuery);
//3. Search
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs results = searcher.search(q, 10000);
ScoreDoc[] hits = results.scoreDocs;
// 4. display results
wikiPage[] resultArray = new wikiPage[hits.length];
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
resultArray[i] = new wikiPage(d.get("URL"), d.get("Title"));
System.out.println((i + 1) + ". " + d.get("Title") + "\t" + d.get("URL"));
}
reader.close();
return resultArray;
}
private static void addDoc(IndexWriter w, String RecordID, String URL, String Info, String Title) throws IOException {
Document doc = new Document();
doc.add(new StringField("RecordID", RecordID, Field.Store.YES));
doc.add(new TextField("Title", Title, Field.Store.YES));
doc.add(new TextField("URL", URL, Field.Store.YES));
doc.add(new TextField("Info", Info, Field.Store.YES));
w.addDocument(doc);
}
This is the output of System.out.println(q.toString());
(Title:computer Info:computer) (Title:science Info:science)
If you want to search it as a phrase (that is, finding "computer" and "science" together), surround the query with quotes, so it should look like "computer science". In your code, you could do something like:
Query q = multipleQueryParser.parse("\"" + searchQuery + "\"");
If you just want to find docs that contain both terms somewhere in the document, but not necessarily together, the query should look like +computer +science. Probably the easiest way to do this is to change the default operator of your query parser:
multipleQueryParser.setDefaultOperator(QueryParser.Operator.AND);
Query q = multipleQueryParser.parse(searchQuery);
As per the doc, prefix required terms with + and use AND (and OR for readability).
Try this:
(Title:+computer OR Info:+computer) AND (Title:+science OR Info:+science)
Maybe build this string and use it directly.

Is it 2 character search possible in lucence

Hi i have a question about lucence search
Is it possible to search a 2 character from file using lucence search
For ex. if there are names like "karthik test" is it possible to search for "ka" or "te" in lucence. If so kindly provide a code piece..
Yes, this is possible using wildcards.
Feed your QueryParser with te*, and it will generate a query that starts for a te prefix with any suffix.
May be this will help you
private List search(String word, IndexSearcher searcher, Date fromDate, Date toDate, int skip, int noOfRecords) throws Exception {
StandardAnalyzer analyzer = new StandardAnalyzer();
BooleanQuery.Builder finalQuery = new BooleanQuery.Builder();
List results = null;
for(String key : keyUtil.getAllKeys()) {
if((!key.contains("Date") || !key.contains("Time"))) {
QueryParser queryParser = new QueryParser(key, analyzer);
Query query = queryParser.parse(word);
finalQuery.add(query, Occur.SHOULD);
}
}
if(fromDate != null && toDate != null) {
Query query = NumericDocValuesField.newSlowRangeQuery("StartDate", fromDate.getTime(), toDate.getTime());
finalQuery.add(query, Occur.MUST);
}
TopDocs hits = searcher.search(finalQuery.build(), skip + noOfRecords);
results = new ArrayList();
if(hits.totalHits.value > 0) {
int count = 0;
for (ScoreDoc sd : hits.scoreDocs) {
if(count >= skip) {
Document d = searcher.doc(sd.doc);
results.add(d.get("storePath"));
}
count ++;
}
}
analyzer.close();
return results;
}
You can always use RegEx pattern with with attribute "word". Like * someWord *

Why Lucene does not return the results based on whole word match?

I am using Lucene to match the keywords with list of words within an application. The whole process is automated without any human intervention. Best matched result (the one on the top and highest score) is picked from the results list returned from Lucene.
The following code demonstrates the above functionality and the results are printed on console.
Problem :
The problem is that lucene searches the keyword (word to be searched) and gives as a result a word that partially matches the keyword. On the other hand the full matched result also exists and does not get ranked in the first position.
For example, if I have lucene RAM index that contains words 'Test' and 'Test Engineer'. If i want to search index for 'AB4_Test Eng_AA0XY11' then results would be
Test
Test Engineer
Although Eng in 'AB4_Test Eng_AA0XY11' matched for Engineer (that is why it is listed in results). But it does not get the top position. I want to optimize my solution to bring the 'Test Engineer' on top because it the best match that considers whole keyword. Can any one help me in solving this problem?
public class LuceneTest {
private static void search(Set<String> keywords) {
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
try {
// 1. create the index
Directory luceneIndex = buildLuceneIndex(analyzer);
int hitsPerPage = 5;
IndexReader reader = IndexReader.open(luceneIndex);
for(String keyword : keywords) {
// Create query string. replace all underscore, hyphen, comma, ( , ), {, }, . with plus sign
StringBuilder querystr = new StringBuilder(128);
String [] splitName = keyword.split("[\\-_,/(){}:. ]");
// After tokenizing also add plus sign between each camel case word.
for (String token : splitName) {
querystr.append(token + "+");
}
// 3. search
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
Query q = new QueryParser(Version.LUCENE_36, "name", analyzer).parse(querystr.toString());
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println();
System.out.println(keyword);
System.out.println("----------------------");
for (ScoreDoc scoreDoc : hits) {
Document d = searcher.doc(scoreDoc.doc);
System.out.println("Found " + d.get("id") + " : " + d.get("name"));
}
// searcher can only be closed when there
searcher.close();
}
}catch (Exception e) {
e.printStackTrace();
}
}
/**
*
*/
private static Directory buildLuceneIndex(Analyzer analyzer) throws CorruptIndexException, LockObtainFailedException, IOException{
Map<Integer, String> map = new HashMap<Integer, String>();
map.put(1, "Test Engineer");
map.put(2, "Test");
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
// 1. create the index
IndexWriter w = new IndexWriter(index, config);
for (Map.Entry<Integer, String> entry : map.entrySet()) {
try {
Document doc = new Document();
doc.add(new Field("id", entry.getKey().toString(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("name", entry.getValue() , Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}catch (Exception e) {
e.printStackTrace();
}
}
w.close();
return index;
}
public static void main(String[] args) {
Set<String> list = new TreeSet<String>();
list.add("AB4_Test Eng_AA0XY11");
list.add("AB4_Test Engineer_AA0XY11");
search(list);
}
}
You can have a look at the Lucene Query syntax rules to see how you can enforce the search for Test Engineer.
Basically, using a query such as
AB4_Test AND Eng_AA0XY11
could work, though I am not sure of it. The page pointed by the link above is quite concise and you will be able to find rapidly a query that can fulfill your needs.
If these two results (test , test engineer) have the same ranking score, then you will see them in the order they came up.
You should try using the length filter and also boosting of the terms, and maybe then you can come up with the solution.
See also:
what is the best lucene setup for ranking exact matches as the highest

Search for '$' using RegexQuery (NOT any other) in a Lucene index

I have the following program:
public class RegexQueryExample {
public static String[] terms = {
"US $65M dollars",
"USA",
"$35",
"355",
"US $33",
"U.S.A",
"John Keates",
"Tom Dick Harry",
"Southeast' Asia"
};
private static Directory directory;
public static void main(String[] args) throws CorruptIndexException, IOException {
String searchString = ".*\\$.*";
createIndex();
searchRegexIndex(searchString);
}
/**
* Creates an index for the files in the data directory.
*/
private static void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException {
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
directory = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
for (String term : terms) {
Document document = new Document();
if (term.indexOf('$') >= 0) {
document.add(new Field("type", "currency", Field.Store.YES, Field.Index.NOT_ANALYZED));
} else {
document.add(new Field("type", "simple_field", Field.Store.YES, Field.Index.NOT_ANALYZED));
}
document.add(new Field("term", term, Field.Store.YES, Field.Index.NOT_ANALYZED));
indexWriter.addDocument(document);
}
indexWriter.close();
}
/**
* searches for a regular expression satisfied by a file path.
*
* #param searchString the string to be searched.
*/
private static void searchRegexIndex(String regexString) throws CorruptIndexException, IOException {
regexString = regexString;
IndexSearcher searcher = new IndexSearcher(directory);
RegexQuery rquery = new RegexQuery(new Term("term", regexString));
BooleanQuery queryin = new BooleanQuery();
BooleanQuery query = new BooleanQuery();
query.add(new TermQuery(new Term("type", "simple_field")), BooleanClause.Occur.MUST);
query.add(rquery, BooleanClause.Occur.MUST);
TopDocs hits = searcher.search(query, terms.length);
ScoreDoc[] alldocs = hits.scoreDocs;
for (int i = 0; i < alldocs.length; i++) {
Document d = searcher.doc(alldocs[i].doc);
System.out.println((i + 1) + ". " + d.get("term"));
}
}
}
The createIndex() function creates the Lucene index while searchRegexIndex() performs a regex query. In the main() function I search for .*\\$.* expecting it to return the terms containing the $ sign. But, it did not work. How do I make it work? Is this some problem with the Analyzer?
Edit:
My Lucene index snapshot from Luke:
You are using StandardAnalyzer, which removes the dollar signs from the tokens. E.g. "US $65M dollars" becomes three tokens: "us", "65m", "dollars". You need to use another analyzer that does not remove the dollar signs. Luke provides an excellent analyzer tool in which you can try out different analyzers and check their outputs.

Categories

Resources