I'm trying to use Lucene (5.4.1) MoreLikeThis to tag(classify) texts. It's kind of working, but I'm getting poor results, and I think that the problem is related with the Query object.
The example bellow works, but the highest topdoc isn't the one that I expect. By debuging the query object, it shows only content:erro. From a complete portuguese phrase (see into the example) the query was constructed with just one word.
I'm not using stop words or any other kind of filter.
So why lucene is picking just erro as a query term?
To init main objects
Analyzer analyzer = new PortugueseAnalyzer();
Directory indexDir = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
To index
try (IndexWriter indexWriter = new IndexWriter(indexDir, config)) {
FieldType type = new FieldType();
type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setStoreTermVectors(true);
Document doc = new Document();
doc.add(new StringField("id", "880b2bbc", Store.YES));
doc.add(new Field("content", "erro", type));
doc.add(new Field("tag", "atag", type));
indexWriter.addDocument(doc);
indexWriter.commit();
}
To search
try (IndexReader idxReader = DirectoryReader.open(indexDir)) {
IndexSearcher indexSearcher = new IndexSearcher(idxReader);
MoreLikeThis mlt = new MoreLikeThis(idxReader);
mlt.setMinTermFreq(0);
mlt.setMinDocFreq(0);
mlt.setFieldNames(new String[] { "content" });
mlt.setAnalyzer(analyzer);
Reader sReader = new StringReader("Melhorias no controle de sessão no sistema qquercoisa quando expira, ao logar novamente no sistema é exibido o erro "xpto");
Query query = mlt.like("content", sReader);
TopDocs topDocs = indexSearcher.search(query, 3);
}
Well, I decided to take a look inside MoreLokeThis class and I found the answer.
The Query query = mlt.like("content", sReader); call the createQueue(Map<String, Int> words) method in MoreLokeThis class.
Inside it, the tokenized terms/words from sReader (that were converted to a Map) are checked against the index.
Only terms/words that are present into the index are used to create a query.
Using the example that I provided, since my index contains only a document with the word erro, this is the only word that is kept from the phrase that I passed.
Related
I am using Lucene 8.2.0 in Java 11.
I am trying to index a Long value so that I can filter by it using a range query, for example like so: +my_range_field:[1 TO 200]. However, any variant of that, even my_range_field:[* TO *], returns 0 results in this minimal example. As soon as I remove the + from it to make it an OR, I get 2 results.
So I am thinking I must make a mistake in how I index it, but I can't make out what it might be.
From the LongPoint JavaDoc:
An indexed long field for fast range filters. If you also need to store the value, you should add a separate StoredField instance.
Finding all documents within an N-dimensional shape or range at search time is efficient. Multiple values for the same field in one document is allowed.
This is my minimal example:
public static void main(String[] args) {
Directory index = new RAMDirectory();
StandardAnalyzer analyzer = new StandardAnalyzer();
try {
IndexWriter indexWriter = new IndexWriter(index, new IndexWriterConfig(analyzer));
Document document1= new Document();
Document document2= new Document();
document1.add(new LongPoint("my_range_field", 10));
document1.add(new StoredField("my_range_field", 10));
document2.add(new LongPoint("my_range_field", 100));
document2.add(new StoredField("my_range_field", 100));
document1.add(new TextField("my_text_field", "test content 1", Field.Store.YES));
document2.add(new TextField("my_text_field", "test content 2", Field.Store.YES));
indexWriter.deleteAll();
indexWriter.commit();
indexWriter.addDocument(document1);
indexWriter.addDocument(document2);
indexWriter.commit();
indexWriter.close();
QueryParser parser = new QueryParser("text", analyzer);
IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(index));
String luceneQuery = "+my_text_field:test* +my_range_field:[1 TO 200]";
Query query = parser.parse(luceneQuery);
System.out.println(indexSearcher.search(query, 10).totalHits.value);
} catch (IOException e) {
} catch (ParseException e) {
}
}
You need to first use StandardQueryParser, then provide the parser with a PointsConfig map, essentially hinting which fields are to be treated as Points. You'll now get 2 results.
// Change this line to the following
StandardQueryParser parser = new StandardQueryParser(analyzer);
IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(dir));
/* Added code */
PointsConfig longConfig = new PointsConfig(new DecimalFormat(), Long.class);
Map<String, PointsConfig> pointsConfigMap = new HashMap<>();
pointsConfigMap.put("my_range_field", longConfig);
parser.setPointsConfigMap(pointsConfigMap);
/* End of added code */
String luceneQuery = "+my_text_field:test* +my_range_field:[1 TO 200]";
// Change the query to the following
Query query = parser.parse(luceneQuery, "text");
I found the solution to my problem.
I was under the impression that the query parser could just parse any query string correctly. That doesn't seem to be the case.
Using
Query rangeQuery = LongPoint.newRangeQuery("my_range_field", 1L, 11L);
Query searchQuery = new WildcardQuery(new Term("my_text_field", "test*"));
Query build = new BooleanQuery.Builder()
.add(searchQuery, BooleanClause.Occur.MUST)
.add(rangeQuery, BooleanClause.Occur.MUST)
.build();
returned the correct result.
I'm building a Lucene index for Twitter User with their Tweets. My idea is to store info about User (name, description, ecc) with his tweets with the following code:
for (Map.Entry<Long, User> entry : users.entrySet()) {
User user = entry.getValue();
Document document = new Document();
document.add(new LongField("id", user.getId(), Field.Store.YES));
document.add(new StringField("name", user.getName(), Field.Store.YES));
document.add(new StringField("username", user.getUsername(), Field.Store.YES));
for (UserTweet t : user.getTweets()) {
document.add(new TextField("tweet", t.getText(), Field.Store.YES));
}
writer.addDocument(document);
}
Here a document can have a lot of tweets in the "tweet" field. The analyzer used for this field is the EnglishAnalyzer.
Is this method correct to store tweets?
My problem is when I set the Highlighter to retrieve the tweets that match. If I search a term that is present in ALL tweets of ALL stored users, as a result I get ALL users (correct!), but if I want to see all Tweets of a single user that match with the query (with Highlighter) I get only the first Tweet of every user and not all.
This is the code that I use to search:
BooleanQuery.Builder booleanQuery = new BooleanQuery.Builder();
QueryParser queryParserKeywords = new QueryParser("tweet", new EnglishAnalyzer());
String strQueryKeywords = "";
for (String s : c.getValue().split(" "))
strQueryKeywords += "tweet:"+ s +" OR ";
strQueryKeywords = strQueryKeywords.substring(0, strQueryKeywords.lastIndexOf("OR"));
Query queryKeywords = queryParserKeywords.parse(strQueryKeywords);
QueryScorer queryScorerKeywords = new QueryScorer(queryKeywords, "tweet");
Fragmenter fragment = new SimpleSpanFragmenter(queryScorerKeywords, 150);
keywordsHighlighter = new Highlighter(queryScorerKeywords);
keywordsHighlighter.setTextFragmenter(fragment);
booleanQuery.add(queryKeywords, BooleanClause.Occur.SHOULD);
... (other boolean clause over other fields)
searcher.search(booleanQuery.build(), collector);
...
for (ScoreDoc doc : collector.topDocs().scoreDocs) {
Document d = searcher.doc(doc.doc);
TokenStream tokenStream = new EnglishAnalyzer().tokenStream("",d.getField("tweet").stringValue());
TextFragment[] fragments = keywordsHighlighter.getBestTextFragments(tokenStream, d.getField("tweet").stringValue(), false, 10);
for (TextFragment fragment : fragments) {
System.out.println(" - " + fragment.toString());
}
}
What's wrong with my code?
At last, to search over multiple fields with different text (ex: City=New York, Keyword=Star Wars, ecc.), Is it correct to use the BooleanQuery or exist a better solution?
Thanks a lot.
I am doing lucene search on my resources.I have a case where I search for a particular product and I need to do it on a grouping search via 'keywords' field.I can get to know the total number of products grouped by keywords associated with it.How can I get all the documents related to this search, so that I can retrieve other needed fields from it. I tried using AbstractAllGroupHeadsCollector but couldnt find and got confused with its usage.Here is my code.
Thanks in advance.
Integer totalGroupCount = null;
IndexReader ir = DirectoryReader.open(indexLocation);
IndexSearcher is = new IndexSearcher(ir);
GroupingSearch groupingSearch = new GroupingSearch("keywords");
groupingSearch.setGroupSort(Sort.RELEVANCE);
groupingSearch.setFillSortFields(true);
groupingSearch.setCachingInMB(4.0, true);
groupingSearch.setAllGroups(true);
//TermQuery query = new TermQuery(new Term("products", "wfa packages"));
TopGroups<BytesRef> result = groupingSearch.search(is, query, 0, 10);
// Render groupsResult...
totalGroupCount = result.totalGroupCount; // The group count
GroupDocs<BytesRef>[] d=result.groups;
System.out.println("total groups="+result.totalGroupedHitCount);
You have your GroupDocs array, that's most of the way there already. You can then get the scoreDocs from each GroupDocs, and lookup the document with the doc id, from ScoreDoc.doc, like:
for (GroupDocs<BytesRef> group : d) {
for (ScoreDoc scoredoc : group.scoreDocs) {
Document doc = is.doc(scoredoc.doc);
//Do stuff
}
}
I'm pretty new to lucene index, so I apologize in advance if what I am trying to do is trivial.
I have an index where the documents contain (among other) two fields:
documentoId and employeeId.
Each employee can submit various documents. The structure is pretty much the same as in the bookstore example.
What I am trying to achieve, is to get all the newest documents matching a query, meaning with the highest documentoId for each employeeId.
In SQL, this would be something like:
select max(documentoId ), employeeId
from documents
where content like 'mySearchValue'
group by employeeId
I don't know if I should use facet API, or if this can be done with queries, or with the searchAfter method...I'm pretty lost with the documentation.
Any help would be greatly appreciated!
Thanks
Lucene supports grouping search; what you need to do is to define your group and how does it have to be sorted. In the example below, I group by documentoId and sort in descending order.
public static void main(String[] args) throws IOException, ParseException {
StandardAnalyzer standardAnalyzer = new StandardAnalyzer(Version.LUCENE_46);
RAMDirectory ramDirectory = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(ramDirectory, new IndexWriterConfig(Version.LUCENE_46, standardAnalyzer));
Document d0 = new Document();
d0.add(new TextField("employeeId", "foo", Field.Store.YES));
d0.add(new IntField("documentoId", 1, Field.Store.YES));
indexWriter.addDocument(d0);
Document d1 = new Document();
d1.add(new TextField("employeeId", "bar", Field.Store.YES));
d1.add(new IntField("documentoId", 20, Field.Store.YES));
indexWriter.addDocument(d1);
Document d2 = new Document();
d2.add(new TextField("employeeId", "baz", Field.Store.YES));
d2.add(new IntField("documentoId", 3, Field.Store.YES));
indexWriter.addDocument(d2);
indexWriter.commit();
GroupingSearch groupingSearch = new GroupingSearch("documentoId");
Sort groupSort = new Sort(new SortField("documentoId", SortField.Type.INT, true)); // in descending order
groupingSearch.setGroupSort(groupSort);
groupingSearch.setSortWithinGroup(groupSort);
IndexReader reader = DirectoryReader.open(ramDirectory);
IndexSearcher searcher = new IndexSearcher(reader);
TopGroups<?> groups = groupingSearch.search(searcher, new MatchAllDocsQuery(), 0, 10);
Document highestScoredDocument = reader.document(groups.groups[0].scoreDocs[0].doc);
System.out.println(
"Descending order, first document is " +
"employeeId:" + highestScoredDocument.get("employeeId") + " " +
"documentoId:" + highestScoredDocument.get("documentoId")
);
}
The above code detects that the d1 (middle document) scores at the top and prints the following:
Descending order, first document is employeeId:bar documentoId:20
Above code does not address content like 'mySearchValue' part, you would have to replace MatchAllDocsQuery with a relevant query to do that.
Custom sorting of the hits will do the trick. Google the search.sort parameter in Lucene.
For those in the same situation, I solved my problem using mindas comment and modifying it to use my group field:
GroupingSearch groupingSearch = new GroupingSearch("employeeId");
Sort groupSort = new Sort(new SortField("documentoId", SortField.Type.INT, true)); // in descending order
groupingSearch.setGroupSort(groupSort);
groupingSearch.setSortWithinGroup(groupSort);
int offset = 0;
int limitGroup = 50;
TopGroups<?> groups = groupingSearch.search(is,query, offset, limitGroup);
List<Document> result = new ArrayList();
for (int i=0; i<groups.groups.length; i++) {
ScoreDoc sdoc = groups.groups[i].scoreDocs[0]; // first result of each group
Document d = is.doc(sdoc.doc);
result.add(d);
}
I read How to incorporate multiple fields in QueryParser? but i didn't get it.
At the moment i have a very strange construction like:
parser = New QueryParser("bodytext", analyzer)
parser2 = New QueryParser("title", analyzer)
query = parser.Parse(strSuchbegriff)
query2 = parser.Parse(strSuchbegriff)
What can i do for something like:
parser = New QuerParser ("bodytext" , "title",analyzer)
query =parser.Parse(strSuchbegriff)
so the Parser looks for the searching word in the field "bodytext" an in the field "title".
There are 3 ways to do this.
The first way is to construct a query manually, this is what QueryParser is doing internally. This is the most powerful way to do it, and means that you don't have to parse the user input if you want to prevent access to some of the more exotic features of QueryParser:
IndexReader reader = IndexReader.Open("<lucene dir>");
Searcher searcher = new IndexSearcher(reader);
BooleanQuery booleanQuery = new BooleanQuery();
Query query1 = new TermQuery(new Term("bodytext", "<text>"));
Query query2 = new TermQuery(new Term("title", "<text>"));
booleanQuery.add(query1, BooleanClause.Occur.SHOULD);
booleanQuery.add(query2, BooleanClause.Occur.SHOULD);
// Use BooleanClause.Occur.MUST instead of BooleanClause.Occur.SHOULD
// for AND queries
Hits hits = searcher.Search(booleanQuery);
The second way is to use MultiFieldQueryParser, this behaves like QueryParser, allowing access to all the power that it has, except that it will search over multiple fields.
IndexReader reader = IndexReader.Open("<lucene dir>");
Searcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(
new string[] {"bodytext", "title"},
analyzer);
Hits hits = searcher.Search(queryParser.parse("<text>"));
The final way is to use the special syntax of QueryParser see here.
IndexReader reader = IndexReader.Open("<lucene dir>");
Searcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser queryParser = new QueryParser("<default field>", analyzer);
// <default field> is the field that QueryParser will search if you don't
// prefix it with a field.
string special = "bodytext:" + text + " OR title:" + text;
Hits hits = searcher.Search(queryParser.parse(special));
Your other option is to create new field when you index your content called bodytextandtitle, into which you can place the contents of both bodytext and title, then you only have to search one field.
We can not use
BooleanQuery booleanQuery = new BooleanQuery();
We have to use builder
BooleanQuery.Builder finalQuery = new BooleanQuery.Builder();
then we can use finalQuery.build(); to get query
more generic way to do this is
private static TopDocs search(Map filters, IndexSearcher searcher) throws Exception {
StandardAnalyzer analyzer = new StandardAnalyzer();
BooleanQuery.Builder finalQuery = new BooleanQuery.Builder();
for(String attribute : filters.keySet()) {
QueryParser queryParser = new QueryParser(attribute, analyzer);
Query query = queryParser.parse(filters.get(attribute));
finalQuery.add(query, Occur.MUST);
}
TopDocs hits = searcher.search(finalQuery.build(),10);
return hits;
}