Search for '$' using RegexQuery (NOT any other) in a Lucene index

Search for '$' using RegexQuery (NOT any other) in a Lucene index - java

I have the following program:
public class RegexQueryExample {
public static String[] terms = {
"US $65M dollars",
"USA",
"$35",
"355",
"US $33",
"U.S.A",
"John Keates",
"Tom Dick Harry",
"Southeast' Asia"
};
private static Directory directory;
public static void main(String[] args) throws CorruptIndexException, IOException {
String searchString = ".*\\$.*";
createIndex();
searchRegexIndex(searchString);
}
/**
* Creates an index for the files in the data directory.
*/
private static void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException {
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
directory = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
for (String term : terms) {
Document document = new Document();
if (term.indexOf('$') >= 0) {
document.add(new Field("type", "currency", Field.Store.YES, Field.Index.NOT_ANALYZED));
} else {
document.add(new Field("type", "simple_field", Field.Store.YES, Field.Index.NOT_ANALYZED));
}
document.add(new Field("term", term, Field.Store.YES, Field.Index.NOT_ANALYZED));
indexWriter.addDocument(document);
}
indexWriter.close();
}
/**
* searches for a regular expression satisfied by a file path.
*
* #param searchString the string to be searched.
*/
private static void searchRegexIndex(String regexString) throws CorruptIndexException, IOException {
regexString = regexString;
IndexSearcher searcher = new IndexSearcher(directory);
RegexQuery rquery = new RegexQuery(new Term("term", regexString));
BooleanQuery queryin = new BooleanQuery();
BooleanQuery query = new BooleanQuery();
query.add(new TermQuery(new Term("type", "simple_field")), BooleanClause.Occur.MUST);
query.add(rquery, BooleanClause.Occur.MUST);
TopDocs hits = searcher.search(query, terms.length);
ScoreDoc[] alldocs = hits.scoreDocs;
for (int i = 0; i < alldocs.length; i++) {
Document d = searcher.doc(alldocs[i].doc);
System.out.println((i + 1) + ". " + d.get("term"));
}
}
}
The createIndex() function creates the Lucene index while searchRegexIndex() performs a regex query. In the main() function I search for .*\\$.* expecting it to return the terms containing the $ sign. But, it did not work. How do I make it work? Is this some problem with the Analyzer?
Edit:
My Lucene index snapshot from Luke:

You are using StandardAnalyzer, which removes the dollar signs from the tokens. E.g. "US $65M dollars" becomes three tokens: "us", "65m", "dollars". You need to use another analyzer that does not remove the dollar signs. Luke provides an excellent analyzer tool in which you can try out different analyzers and check their outputs.

Related

Lucene BooleanQuery OR output not as wanted

i have an index with some documents, the 593 documents that contains the word computer and 51 documents that contains the word science and 596 documents that contains the words computer OR science, i wanna outpout those 596 docs here's my code:
public class Main
{
public static void main(String args[]) throws IOException, ParseException{
String[] champ ={"W", "A"};
BooleanQuery q = new BooleanQuery();
BooleanQuery qIntermediaire;
qIntermediaire = new BooleanQuery();
for(int i=0;i<champ.length;i++){
qIntermediaire.add(new BooleanClause(new FuzzyQuery(new Term(champ[i], "computer"), 0), BooleanClause.Occur.SHOULD));
}
q.add(new BooleanClause(qIntermediaire, BooleanClause.Occur.MUST));
qIntermediaire = new BooleanQuery();
for(int i=0;i<champ.length;i++){
qIntermediaire.add(new BooleanClause(new FuzzyQuery(new Term(champ[i], "science"), 0), BooleanClause.Occur.SHOULD));
}
q.add(new BooleanClause(qIntermediaire, BooleanClause.Occur.SHOULD));
Path indexPath = Paths.get("MonIndex");
Directory directory = FSDirectory.open(indexPath);
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher iSearcher = new IndexSearcher(reader);
TopDocs topdocs = iSearcher.search(q, 10000);
ScoreDoc[] resultsList = topdocs.scoreDocs;
System.out.println(resultsList.length);
}
}
for some reasons this is giving me 461 documents :(

Porting code from lucene to elasticsearch

I have to following simple code that I want to port from lucene 6.5.x to elasticsearch 5.3.x.
However, the scores are different and I want to have the same score results like in lucene.
As example, the idf:
Lucenes docFreq is 3 (3 docs contains the term "d") and docCount is 4 (documents with this field). Elasticsearch has 1 docFreq and 2 docCount (or 1 and 1). I am not sure how these values relate to each other in elasticsearch...
The other different in scoring is the avgFieldLength:
Lucene is right with 14 / 4 = 3.5. Elasticsearch is different for each score result - but this should be the same for all documents...
Can you please tell me, which settings/mapping I missed in elasticsearch to get it to work like lucene?
IndexingExample.java:
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.document.Field;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
public class IndexingExample {
private static final String INDEX_DIR = "/tmp/lucene6idx";
private IndexWriter createWriter() throws IOException {
FSDirectory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
return new IndexWriter(dir, config);
}
private List<Document> createDocs() {
List<Document> docs = new ArrayList<>();
FieldType summaryType = new FieldType();
summaryType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
summaryType.setStored(true);
summaryType.setTokenized(true);
Document doc1 = new Document();
doc1.add(new Field("title", "b c d d d", summaryType));
docs.add(doc1);
Document doc2 = new Document();
doc2.add(new Field("title", "b c d d", summaryType));
docs.add(doc2);
Document doc3 = new Document();
doc3.add(new Field("title", "b c d", summaryType));
docs.add(doc3);
Document doc4 = new Document();
doc4.add(new Field("title", "b c", summaryType));
docs.add(doc4);
return docs;
}
private IndexSearcher createSearcher() throws IOException {
Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexReader reader = DirectoryReader.open(dir);
return new IndexSearcher(reader);
}
public static void main(String[] args) throws IOException, ParseException {
// indexing
IndexingExample app = new IndexingExample();
IndexWriter writer = app.createWriter();
writer.deleteAll();
List<Document> docs = app.createDocs();
writer.addDocuments(docs);
writer.commit();
writer.close();
// search
IndexSearcher searcher = app.createSearcher();
Query q1 = new TermQuery(new Term("title", "d"));
TopDocs hits = searcher.search(q1, 20);
System.out.println(hits.totalHits + " docs found for the query \"" + q1.toString() + "\"");
int num = 0;
for (ScoreDoc sd : hits.scoreDocs) {
Explanation expl = searcher.explain(q1, sd.doc);
System.out.println(expl);
}
}
}
Elasticsearch:
DELETE twitter
PUT twitter/tweet/1
{
"title" : "b c d d d"
}
PUT twitter/tweet/2
{
"title" : "b c d d"
}
PUT twitter/tweet/3
{
"title" : "b c d"
}
PUT twitter/tweet/4
{
"title" : "b c"
}
POST /twitter/tweet/_search
{
"explain": true,
"query": {
"term" : {
"title" : "d"
}
}
}

Problem solved with the help of jimczy:
Don't forget that ES creates an index with 5 shards by default and
that docFreq and docCount are computed per shard. You can create an
index with 1 shard or use the dfs mode to compute distributed stats:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html#dfs-query-then-fetch
This search query (dfs_query_then_fetch) worked like expected:
POST /twitter/tweet/_search?search_type=dfs_query_then_fetch
{
"explain": true,
"query": {
"term" : {
"title" : "d"
}
}
}

Lucene search engine isn't accurate, can't figure out why

I am trying to create a search engine for the first time, and I'm using the library offered by Apache Lucene. Everything works fine, however when I search for more than one word, for example "computer science" the results that I get aren't accurate because I never get documents that contain both words. It searches the documents for each word separately (I get documents that contain either "computer" or "science" but never both).
I've been staring at my code for almost a week now and I can't figure out the problem. The query parsing seems to work perfectly, so I think the problem might be in the search but I don't know what I'm doing wrong. So If you can help me, I'll be grateful.
public static wikiPage[] index(String searchQuery) throws SQLException, IOException, ParseException {
String sql = "select * from Record";
ResultSet rs = db.runSql(sql);
StandardAnalyzer analyzer = new StandardAnalyzer();
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
//1. Indexer
try (IndexWriter w = new IndexWriter(index, config)) {
while (rs.next()) {
String RecordID = rs.getString("RecordID");
String URL = rs.getString("URL");
String Title = rs.getString("Title");
String Info = rs.getString("Info");
addDoc(w, RecordID, URL, Info, Title);
}
}
catch (Exception e) {
System.out.print(e);
index.close();
}
//2. Query
MultiFieldQueryParser multipleQueryParser = new MultiFieldQueryParser(new String[]{"Title", "Info"}, new StandardAnalyzer());
Query q = multipleQueryParser.parse(searchQuery);
//3. Search
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs results = searcher.search(q, 10000);
ScoreDoc[] hits = results.scoreDocs;
// 4. display results
wikiPage[] resultArray = new wikiPage[hits.length];
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
resultArray[i] = new wikiPage(d.get("URL"), d.get("Title"));
System.out.println((i + 1) + ". " + d.get("Title") + "\t" + d.get("URL"));
}
reader.close();
return resultArray;
}
private static void addDoc(IndexWriter w, String RecordID, String URL, String Info, String Title) throws IOException {
Document doc = new Document();
doc.add(new StringField("RecordID", RecordID, Field.Store.YES));
doc.add(new TextField("Title", Title, Field.Store.YES));
doc.add(new TextField("URL", URL, Field.Store.YES));
doc.add(new TextField("Info", Info, Field.Store.YES));
w.addDocument(doc);
}
This is the output of System.out.println(q.toString());
(Title:computer Info:computer) (Title:science Info:science)

If you want to search it as a phrase (that is, finding "computer" and "science" together), surround the query with quotes, so it should look like "computer science". In your code, you could do something like:
Query q = multipleQueryParser.parse("\"" + searchQuery + "\"");
If you just want to find docs that contain both terms somewhere in the document, but not necessarily together, the query should look like +computer +science. Probably the easiest way to do this is to change the default operator of your query parser:
multipleQueryParser.setDefaultOperator(QueryParser.Operator.AND);
Query q = multipleQueryParser.parse(searchQuery);

As per the doc, prefix required terms with + and use AND (and OR for readability).
Try this:
(Title:+computer OR Info:+computer) AND (Title:+science OR Info:+science)
Maybe build this string and use it directly.

Why Lucene algorithm not working for Exact String in Java?

I am working on Lucene Algorithm in Java.
We have 100K stop names in MySQL Database.
The stop names are like
NEW YORK PENN STATION,
NEWARK PENN STATION,
NEWARK BROAD ST,
NEW PROVIDENCE
etc
When user gives a search input like NEW YORK, we get the NEW YORK PENN STATION stop in a result, but when user gives exact NEW YORK PENN STATION in a search input then it returns zero results.
My Code is -
public ArrayList<String> getSimilarString(ArrayList<String> source, String querystr)
{
ArrayList<String> arResult = new ArrayList<String>();
try
{
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
for(int i = 0; i < source.size(); i++)
{
addDoc(w, source.get(i), "1933988" + (i + 1) + "z");
}
w.close();
// 2. query
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");
// 3. search
int hitsPerPage = 20;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Get results
for(int i = 0; i < hits.length; ++i)
{
int docId = hits[i].doc;
Document d = searcher.doc(docId);
arResult.add(d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
catch(Exception e)
{
System.out.println("Exception (LuceneAlgo.getSimilarString()) : " + e);
}
return arResult;
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException
{
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
In this code source is list of Stop Names and query is user given search input.
Does Lucene algorithm work on Large String?
Why Lucene algorithm is not working on Exact String?

Instead of
1) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr + "*");
Ex: "new york station" will be parsed to "title:new title:york title:station".
This query will return all the docs containing any of the above terms.
Try this..
2) Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse("+(" + querystr + ")");
Ex1: "new york" will be parsed to "+(title:new title:york)"
The above '+' indicates 'must' occurrence of the term in the result document.
It will match both the docs containing "new york" and "new york station"
Ex2: "new york station" will be parsed to +(title:new title:york title:station).
The query will match only "new york station" and not just "new york" since station is not present.
Please make sure that the field name 'title' is what you're looking for.
Your questions.
Does Lucene algorithm work on Large String?
You've got to define what a large string is. Are you actually looking for Phrase Search. In general, Yes, Lucene works for large strings.
Why Lucene algorithm is not working on Exact String?
Because parsing ("querystr" + "* ") will generate individual term queries with OR operator connecting them.
Ex: 'new york*' will be parsed to: "title:new OR title:york*
If you are looking forward to find "new york station", the above wildcard query is not what you should be looking for. This is because the StandardAnalyser you passed in, while indexing, will tokenize (break down terms) new york station to 3 terms.
So, the query "york*" will find "york station" only because it has "york" in it but not because of the wildcard since "york" has no idea of "station" as they are different terms, i.e. different entries in the Index.
What you actually need is a PhraseQuery for finding exact string, for which the query string should be "new york" WITH the quotes

Why Lucene does not return the results based on whole word match?

I am using Lucene to match the keywords with list of words within an application. The whole process is automated without any human intervention. Best matched result (the one on the top and highest score) is picked from the results list returned from Lucene.
The following code demonstrates the above functionality and the results are printed on console.
Problem :
The problem is that lucene searches the keyword (word to be searched) and gives as a result a word that partially matches the keyword. On the other hand the full matched result also exists and does not get ranked in the first position.
For example, if I have lucene RAM index that contains words 'Test' and 'Test Engineer'. If i want to search index for 'AB4_Test Eng_AA0XY11' then results would be
Test
Test Engineer
Although Eng in 'AB4_Test Eng_AA0XY11' matched for Engineer (that is why it is listed in results). But it does not get the top position. I want to optimize my solution to bring the 'Test Engineer' on top because it the best match that considers whole keyword. Can any one help me in solving this problem?
public class LuceneTest {
private static void search(Set<String> keywords) {
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
try {
// 1. create the index
Directory luceneIndex = buildLuceneIndex(analyzer);
int hitsPerPage = 5;
IndexReader reader = IndexReader.open(luceneIndex);
for(String keyword : keywords) {
// Create query string. replace all underscore, hyphen, comma, ( , ), {, }, . with plus sign
StringBuilder querystr = new StringBuilder(128);
String [] splitName = keyword.split("[\\-_,/(){}:. ]");
// After tokenizing also add plus sign between each camel case word.
for (String token : splitName) {
querystr.append(token + "+");
}
// 3. search
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
Query q = new QueryParser(Version.LUCENE_36, "name", analyzer).parse(querystr.toString());
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out.println();
System.out.println(keyword);
System.out.println("----------------------");
for (ScoreDoc scoreDoc : hits) {
Document d = searcher.doc(scoreDoc.doc);
System.out.println("Found " + d.get("id") + " : " + d.get("name"));
}
// searcher can only be closed when there
searcher.close();
}
}catch (Exception e) {
e.printStackTrace();
}
}
/**
*
*/
private static Directory buildLuceneIndex(Analyzer analyzer) throws CorruptIndexException, LockObtainFailedException, IOException{
Map<Integer, String> map = new HashMap<Integer, String>();
map.put(1, "Test Engineer");
map.put(2, "Test");
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, analyzer);
// 1. create the index
IndexWriter w = new IndexWriter(index, config);
for (Map.Entry<Integer, String> entry : map.entrySet()) {
try {
Document doc = new Document();
doc.add(new Field("id", entry.getKey().toString(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("name", entry.getValue() , Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}catch (Exception e) {
e.printStackTrace();
}
}
w.close();
return index;
}
public static void main(String[] args) {
Set<String> list = new TreeSet<String>();
list.add("AB4_Test Eng_AA0XY11");
list.add("AB4_Test Engineer_AA0XY11");
search(list);
}
}

You can have a look at the Lucene Query syntax rules to see how you can enforce the search for Test Engineer.
Basically, using a query such as
AB4_Test AND Eng_AA0XY11
could work, though I am not sure of it. The page pointed by the link above is quite concise and you will be able to find rapidly a query that can fulfill your needs.

If these two results (test , test engineer) have the same ranking score, then you will see them in the order they came up.
You should try using the length filter and also boosting of the terms, and maybe then you can come up with the solution.
See also:
what is the best lucene setup for ranking exact matches as the highest

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.