retrieve documents after grouping search in lucene

retrieve documents after grouping search in lucene - java

I am doing lucene search on my resources.I have a case where I search for a particular product and I need to do it on a grouping search via 'keywords' field.I can get to know the total number of products grouped by keywords associated with it.How can I get all the documents related to this search, so that I can retrieve other needed fields from it. I tried using AbstractAllGroupHeadsCollector but couldnt find and got confused with its usage.Here is my code.
Thanks in advance.
Integer totalGroupCount = null;
IndexReader ir = DirectoryReader.open(indexLocation);
IndexSearcher is = new IndexSearcher(ir);
GroupingSearch groupingSearch = new GroupingSearch("keywords");
groupingSearch.setGroupSort(Sort.RELEVANCE);
groupingSearch.setFillSortFields(true);
groupingSearch.setCachingInMB(4.0, true);
groupingSearch.setAllGroups(true);
//TermQuery query = new TermQuery(new Term("products", "wfa packages"));
TopGroups<BytesRef> result = groupingSearch.search(is, query, 0, 10);
// Render groupsResult...
totalGroupCount = result.totalGroupCount; // The group count
GroupDocs<BytesRef>[] d=result.groups;
System.out.println("total groups="+result.totalGroupedHitCount);

You have your GroupDocs array, that's most of the way there already. You can then get the scoreDocs from each GroupDocs, and lookup the document with the doc id, from ScoreDoc.doc, like:
for (GroupDocs<BytesRef> group : d) {
for (ScoreDoc scoredoc : group.scoreDocs) {
Document doc = is.doc(scoredoc.doc);
//Do stuff
}
}

Related

search with lucene's QueryParser while setting fieldName to "*" failed?

I want to search through all fields in the index by lucene, and I learned to do this by writting the code like this:
//Create a parser that searches through all fields
QueryParser queryParser = new QueryParser("*", new StandardAnalyzer());
//Specify the search terms
String queryString = "search terms";
//Create the query to search through all fields
Query query = queryParser.parse(queryString);
//Execute the query and get the results
IndexSearcher searcher = new IndexSearcher(indexReader);
TopDocs results = searcher.search(query, 100);
ScoreDoc[] hits = results.scoreDocs;
//Iterate through the results
for (ScoreDoc hit : hits) {
Document doc = searcher.doc(hit.doc);
//Process the document
}
when I setted first parameter of QueryParser Constructor to "*", like the code above, and I got nothing from the TopDocs(which I had expected to search through all fields of the documents I'd write into the index and returnd all matching documents), the "hits.totalHits" returns 0.
Can anyone tell me what is wrong with my code or how to write code using QueryParser to search through all fields in the index?
Thanks!

Lucene LongPoint Range search doesn't work

I am using Lucene 8.2.0 in Java 11.
I am trying to index a Long value so that I can filter by it using a range query, for example like so: +my_range_field:[1 TO 200]. However, any variant of that, even my_range_field:[* TO *], returns 0 results in this minimal example. As soon as I remove the + from it to make it an OR, I get 2 results.
So I am thinking I must make a mistake in how I index it, but I can't make out what it might be.
From the LongPoint JavaDoc:
An indexed long field for fast range filters. If you also need to store the value, you should add a separate StoredField instance.
Finding all documents within an N-dimensional shape or range at search time is efficient. Multiple values for the same field in one document is allowed.
This is my minimal example:
public static void main(String[] args) {
Directory index = new RAMDirectory();
StandardAnalyzer analyzer = new StandardAnalyzer();
try {
IndexWriter indexWriter = new IndexWriter(index, new IndexWriterConfig(analyzer));
Document document1= new Document();
Document document2= new Document();
document1.add(new LongPoint("my_range_field", 10));
document1.add(new StoredField("my_range_field", 10));
document2.add(new LongPoint("my_range_field", 100));
document2.add(new StoredField("my_range_field", 100));
document1.add(new TextField("my_text_field", "test content 1", Field.Store.YES));
document2.add(new TextField("my_text_field", "test content 2", Field.Store.YES));
indexWriter.deleteAll();
indexWriter.commit();
indexWriter.addDocument(document1);
indexWriter.addDocument(document2);
indexWriter.commit();
indexWriter.close();
QueryParser parser = new QueryParser("text", analyzer);
IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(index));
String luceneQuery = "+my_text_field:test* +my_range_field:[1 TO 200]";
Query query = parser.parse(luceneQuery);
System.out.println(indexSearcher.search(query, 10).totalHits.value);
} catch (IOException e) {
} catch (ParseException e) {
}
}

You need to first use StandardQueryParser, then provide the parser with a PointsConfig map, essentially hinting which fields are to be treated as Points. You'll now get 2 results.
// Change this line to the following
StandardQueryParser parser = new StandardQueryParser(analyzer);
IndexSearcher indexSearcher = new IndexSearcher(DirectoryReader.open(dir));
/* Added code */
PointsConfig longConfig = new PointsConfig(new DecimalFormat(), Long.class);
Map<String, PointsConfig> pointsConfigMap = new HashMap<>();
pointsConfigMap.put("my_range_field", longConfig);
parser.setPointsConfigMap(pointsConfigMap);
/* End of added code */
String luceneQuery = "+my_text_field:test* +my_range_field:[1 TO 200]";
// Change the query to the following
Query query = parser.parse(luceneQuery, "text");

I found the solution to my problem.
I was under the impression that the query parser could just parse any query string correctly. That doesn't seem to be the case.
Using
Query rangeQuery = LongPoint.newRangeQuery("my_range_field", 1L, 11L);
Query searchQuery = new WildcardQuery(new Term("my_text_field", "test*"));
Query build = new BooleanQuery.Builder()
.add(searchQuery, BooleanClause.Occur.MUST)
.add(rangeQuery, BooleanClause.Occur.MUST)
.build();
returned the correct result.

Remove document from array in MongoDB Java

I got a JSON string that looks something like this:
String tmp = "[
{
"ID":"12",
"Date":"2018-02-02",
"ObjData":[
{
"Name":"AAA",
"Order":"12345",
"Extra1":{
"Temp":"3"
},
"Extra2":{
"Temp":"5"
}
},
{
"Name":"BBB",
"Order":"54321",
"Extra1":{
"Temp":"3"
},
"Extra2":{
"Temp":"5"
}
}
]
}
]"
I would like to remove for example the the document where ´Order´ equals "54321" from ´ObjData´. I got the following code:
Document doc = new Document();
doc = Document.parse(tmp);
Document fields = new Document("ID", "12")
.append("ObjData", Arrays.asList(new Document("Order", "54321")));
Document update = new Document("$pull", fields);
coll.updateOne(doc, update);
I am trying to use the ´pull´ method to remove the entire document from the array where the ´Order´ equals 54321 but for some reason it's not working, I am probably doing something wrong. Could someone point out the issue please?
Also, what would be the best way to keep count of the documents within the array so that once all documents are pulled the entire document is deleted from the database? Would it be good to add some kind of ´size´ attribute and keep track of the size and decrease it after each pull?

To remove document with Order=54321 from internal array from any document (if you don't know ID) you can use empty filter like:
Document filter = new Document();
Document update = new Document("$pull", new Document("ObjData", new Document("Order", "54321")));
coll.updateOne(filter, update);

Updating records to remove values from ObjData array
The first parameter to the updateOne method is a query to find the document you want to update, not the full document.
So for your code, assuming ID is a unique value and that there's an item in your collection with an ID of "12":
// { ID: "12" }
Document query = new Document("ID", "12");
// { ObjData: { $pull: { Order: "54321" } } }
Document update = new Document("ObjData",
new Document("$pull",
new Document("Order", "54321")
)
);
coll.updateOne(query, update);
Alternatively, if you want to remove the order from all documents in the database, just replace query with an empty Document, i.e.:
// { <empty> }
Document query = new Document();
Deleting records with empty ObjData array
As for removing records when the size reaches zero, you can use a filter with $size:
db.myColl.deleteMany({ ObjData: { $size: 0 } })
This is also doable using the Java driver:
// { ObjData: { $size: 0 } }
Document query = new Document("ObjData",
new Document("$size", 0)
);
coll.deleteMany(query);
Note that for large collections (i.e. where myColl is large, not the ObjData array), this may not perform very well. If this is the case, then you may want to track the size separately (as you hinted at in your question) and index it to make it faster to search on since you can't create an index on array size in MongoDB.
References
updateOne documentation for updating documents using the Java driver
deleteOne documentation for deleting documents using the Java driver
$pull documentation for removing documents from an array
$size documentation for filtering documents based on the size of an array

Lucene: BestTextFragments returns only the first document

I'm building a Lucene index for Twitter User with their Tweets. My idea is to store info about User (name, description, ecc) with his tweets with the following code:
for (Map.Entry<Long, User> entry : users.entrySet()) {
User user = entry.getValue();
Document document = new Document();
document.add(new LongField("id", user.getId(), Field.Store.YES));
document.add(new StringField("name", user.getName(), Field.Store.YES));
document.add(new StringField("username", user.getUsername(), Field.Store.YES));
for (UserTweet t : user.getTweets()) {
document.add(new TextField("tweet", t.getText(), Field.Store.YES));
}
writer.addDocument(document);
}
Here a document can have a lot of tweets in the "tweet" field. The analyzer used for this field is the EnglishAnalyzer.
Is this method correct to store tweets?
My problem is when I set the Highlighter to retrieve the tweets that match. If I search a term that is present in ALL tweets of ALL stored users, as a result I get ALL users (correct!), but if I want to see all Tweets of a single user that match with the query (with Highlighter) I get only the first Tweet of every user and not all.
This is the code that I use to search:
BooleanQuery.Builder booleanQuery = new BooleanQuery.Builder();
QueryParser queryParserKeywords = new QueryParser("tweet", new EnglishAnalyzer());
String strQueryKeywords = "";
for (String s : c.getValue().split(" "))
strQueryKeywords += "tweet:"+ s +" OR ";
strQueryKeywords = strQueryKeywords.substring(0, strQueryKeywords.lastIndexOf("OR"));
Query queryKeywords = queryParserKeywords.parse(strQueryKeywords);
QueryScorer queryScorerKeywords = new QueryScorer(queryKeywords, "tweet");
Fragmenter fragment = new SimpleSpanFragmenter(queryScorerKeywords, 150);
keywordsHighlighter = new Highlighter(queryScorerKeywords);
keywordsHighlighter.setTextFragmenter(fragment);
booleanQuery.add(queryKeywords, BooleanClause.Occur.SHOULD);
... (other boolean clause over other fields)
searcher.search(booleanQuery.build(), collector);
...
for (ScoreDoc doc : collector.topDocs().scoreDocs) {
Document d = searcher.doc(doc.doc);
TokenStream tokenStream = new EnglishAnalyzer().tokenStream("",d.getField("tweet").stringValue());
TextFragment[] fragments = keywordsHighlighter.getBestTextFragments(tokenStream, d.getField("tweet").stringValue(), false, 10);
for (TextFragment fragment : fragments) {
System.out.println(" - " + fragment.toString());
}
}
What's wrong with my code?
At last, to search over multiple fields with different text (ex: City=New York, Keyword=Star Wars, ecc.), Is it correct to use the BooleanQuery or exist a better solution?
Thanks a lot.

Lucene: get newest document for category

I'm pretty new to lucene index, so I apologize in advance if what I am trying to do is trivial.
I have an index where the documents contain (among other) two fields:
documentoId and employeeId.
Each employee can submit various documents. The structure is pretty much the same as in the bookstore example.
What I am trying to achieve, is to get all the newest documents matching a query, meaning with the highest documentoId for each employeeId.
In SQL, this would be something like:
select max(documentoId ), employeeId
from documents
where content like 'mySearchValue'
group by employeeId
I don't know if I should use facet API, or if this can be done with queries, or with the searchAfter method...I'm pretty lost with the documentation.
Any help would be greatly appreciated!
Thanks

Lucene supports grouping search; what you need to do is to define your group and how does it have to be sorted. In the example below, I group by documentoId and sort in descending order.
public static void main(String[] args) throws IOException, ParseException {
StandardAnalyzer standardAnalyzer = new StandardAnalyzer(Version.LUCENE_46);
RAMDirectory ramDirectory = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(ramDirectory, new IndexWriterConfig(Version.LUCENE_46, standardAnalyzer));
Document d0 = new Document();
d0.add(new TextField("employeeId", "foo", Field.Store.YES));
d0.add(new IntField("documentoId", 1, Field.Store.YES));
indexWriter.addDocument(d0);
Document d1 = new Document();
d1.add(new TextField("employeeId", "bar", Field.Store.YES));
d1.add(new IntField("documentoId", 20, Field.Store.YES));
indexWriter.addDocument(d1);
Document d2 = new Document();
d2.add(new TextField("employeeId", "baz", Field.Store.YES));
d2.add(new IntField("documentoId", 3, Field.Store.YES));
indexWriter.addDocument(d2);
indexWriter.commit();
GroupingSearch groupingSearch = new GroupingSearch("documentoId");
Sort groupSort = new Sort(new SortField("documentoId", SortField.Type.INT, true)); // in descending order
groupingSearch.setGroupSort(groupSort);
groupingSearch.setSortWithinGroup(groupSort);
IndexReader reader = DirectoryReader.open(ramDirectory);
IndexSearcher searcher = new IndexSearcher(reader);
TopGroups<?> groups = groupingSearch.search(searcher, new MatchAllDocsQuery(), 0, 10);
Document highestScoredDocument = reader.document(groups.groups[0].scoreDocs[0].doc);
System.out.println(
"Descending order, first document is " +
"employeeId:" + highestScoredDocument.get("employeeId") + " " +
"documentoId:" + highestScoredDocument.get("documentoId")
);
}
The above code detects that the d1 (middle document) scores at the top and prints the following:
Descending order, first document is employeeId:bar documentoId:20
Above code does not address content like 'mySearchValue' part, you would have to replace MatchAllDocsQuery with a relevant query to do that.

Custom sorting of the hits will do the trick. Google the search.sort parameter in Lucene.

For those in the same situation, I solved my problem using mindas comment and modifying it to use my group field:
GroupingSearch groupingSearch = new GroupingSearch("employeeId");
Sort groupSort = new Sort(new SortField("documentoId", SortField.Type.INT, true)); // in descending order
groupingSearch.setGroupSort(groupSort);
groupingSearch.setSortWithinGroup(groupSort);
int offset = 0;
int limitGroup = 50;
TopGroups<?> groups = groupingSearch.search(is,query, offset, limitGroup);
List<Document> result = new ArrayList();
for (int i=0; i<groups.groups.length; i++) {
ScoreDoc sdoc = groups.groups[i].scoreDocs[0]; // first result of each group
Document d = is.doc(sdoc.doc);
result.add(d);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

retrieve documents after grouping search in lucene - java

Related

search with lucene's QueryParser while setting fieldName to "*" failed?

Lucene LongPoint Range search doesn't work

Remove document from array in MongoDB Java

Lucene: BestTextFragments returns only the first document

Lucene: get newest document for category

Categories

Resources