How to sort Numeric field in Lucene 6

How to sort Numeric field in Lucene 6 - java

I want to sort my search result based on a numeric field.
In the following example code, I want to sort based on the 'age' field.
I start from using the answers from:
[How to sort IntPont or LongPoint field in Lucene 6
But it does sort based SCORE. The age are still not ascending.
And
[Sorting search result in Lucene based on a numeric field
I changed SortField.Type.SCORE to SortField.Type.LONG in the search function.
But I get:
unexpected docvalues type NONE for field 'age' (expected=NUMERIC)
Here my code:
public class TestLongPointSort {
public static void main(String[] args) throws Exception {
String indexPath = "/tmp/testSort";
Analyzer standardAnalyzer = new StandardAnalyzer();
Directory indexDir = FSDirectory.open(Paths.get(indexPath));
IndexWriterConfig iwc = new IndexWriterConfig(standardAnalyzer);
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter masterIndex = new IndexWriter(indexDir, iwc);
Document doc = new Document();
String name = "bob";
doc.add(new TextField("name", name, Field.Store.YES));
doc.add(new SortedDocValuesField("name", new BytesRef(name)));
doc.add(new SortedNumericDocValuesField("age", 20L));
doc.add(new StoredField("age", 20L));
long ts = System.currentTimeMillis();
doc.add(new SortedNumericDocValuesField("ts", ts));
doc.add(new StoredField("ts", ts));
masterIndex.addDocument(doc);
Thread.sleep(1);
name = "max";
doc = new Document();
doc.add(new TextField("name", name, Field.Store.YES));
doc.add(new SortedDocValuesField("name", new BytesRef(name)));
doc.add(new SortedNumericDocValuesField("age", 19L));
doc.add(new StoredField("age", 19L));
ts = System.currentTimeMillis();
doc.add(new SortedNumericDocValuesField("ts", ts));
doc.add(new StoredField("ts", ts));
masterIndex.addDocument(doc);
Thread.sleep(1);
name = "jim";
doc = new Document();
doc.add(new TextField("name", name, Field.Store.YES));
doc.add(new SortedDocValuesField("name", new BytesRef(name)));
doc.add(new SortedNumericDocValuesField("age", 21L));
doc.add(new StoredField("age", 21L));
ts = System.currentTimeMillis();
doc.add(new SortedNumericDocValuesField("ts", ts));
doc.add(new StoredField("ts", ts));
masterIndex.addDocument(doc);
masterIndex.commit();
masterIndex.close();
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new KeywordAnalyzer();
QueryParser queryParser = new QueryParser("message", analyzer);
Sort sort;
TopDocs docs;
sort = new Sort(new SortField("name", SortField.Type.STRING));
docs = searcher.search(new MatchAllDocsQuery(), 100, sort);
System.out.println("Sorted by name");
for (ScoreDoc scoreDoc : docs.scoreDocs) {
Document doc2 = searcher.doc(scoreDoc.doc);
System.out.println("Name:" + doc2.get("name") + " ; age:" + doc2.get("age") + " ; ts:" + doc2.get("ts"));
}
//docs = searcher.search(new MatchAllDocsQuery(), 100, new Sort(new SortField("age", SortField.Type.SCORE, true)));
docs = searcher.search(new MatchAllDocsQuery(), 100, new Sort(new SortField("age", SortField.Type.LONG, true)));
System.out.println("Sorted by age");
for (ScoreDoc scoreDoc : docs.scoreDocs) {
Document doc2 = searcher.doc(scoreDoc.doc);
System.out.println("Name:" + doc2.get("name") + " ; age:" + doc2.get("age") + " ; ts:" + doc2.get("ts"));
}
reader.close();
}
}
As we can see, sorting STRING is good but I didn't figure out how I can get my numbers (LONG) sorted.
What is the right way to sort Numeric fields?
Thanks

To sort search results using a SortedNumericDocValuesField, you'll need to use a SortedNumericSortField:
Sort sort = new Sort(new SortedNumericSortField("age", SortField.Type.LONG, true));
TopDocs docs = searcher.search(new MatchAllDocsQuery(), 100, sort);

I would suggest you use ArrayList to store data from Document rather than saving it to another document, then use sort methods of ArrayList.
Please visit these links for your reference.
SO - how to sort arraylist
JAVA ArrayList sort method sample

Related

Lucene: BestTextFragments returns only the first document

I'm building a Lucene index for Twitter User with their Tweets. My idea is to store info about User (name, description, ecc) with his tweets with the following code:
for (Map.Entry<Long, User> entry : users.entrySet()) {
User user = entry.getValue();
Document document = new Document();
document.add(new LongField("id", user.getId(), Field.Store.YES));
document.add(new StringField("name", user.getName(), Field.Store.YES));
document.add(new StringField("username", user.getUsername(), Field.Store.YES));
for (UserTweet t : user.getTweets()) {
document.add(new TextField("tweet", t.getText(), Field.Store.YES));
}
writer.addDocument(document);
}
Here a document can have a lot of tweets in the "tweet" field. The analyzer used for this field is the EnglishAnalyzer.
Is this method correct to store tweets?
My problem is when I set the Highlighter to retrieve the tweets that match. If I search a term that is present in ALL tweets of ALL stored users, as a result I get ALL users (correct!), but if I want to see all Tweets of a single user that match with the query (with Highlighter) I get only the first Tweet of every user and not all.
This is the code that I use to search:
BooleanQuery.Builder booleanQuery = new BooleanQuery.Builder();
QueryParser queryParserKeywords = new QueryParser("tweet", new EnglishAnalyzer());
String strQueryKeywords = "";
for (String s : c.getValue().split(" "))
strQueryKeywords += "tweet:"+ s +" OR ";
strQueryKeywords = strQueryKeywords.substring(0, strQueryKeywords.lastIndexOf("OR"));
Query queryKeywords = queryParserKeywords.parse(strQueryKeywords);
QueryScorer queryScorerKeywords = new QueryScorer(queryKeywords, "tweet");
Fragmenter fragment = new SimpleSpanFragmenter(queryScorerKeywords, 150);
keywordsHighlighter = new Highlighter(queryScorerKeywords);
keywordsHighlighter.setTextFragmenter(fragment);
booleanQuery.add(queryKeywords, BooleanClause.Occur.SHOULD);
... (other boolean clause over other fields)
searcher.search(booleanQuery.build(), collector);
...
for (ScoreDoc doc : collector.topDocs().scoreDocs) {
Document d = searcher.doc(doc.doc);
TokenStream tokenStream = new EnglishAnalyzer().tokenStream("",d.getField("tweet").stringValue());
TextFragment[] fragments = keywordsHighlighter.getBestTextFragments(tokenStream, d.getField("tweet").stringValue(), false, 10);
for (TextFragment fragment : fragments) {
System.out.println(" - " + fragment.toString());
}
}
What's wrong with my code?
At last, to search over multiple fields with different text (ex: City=New York, Keyword=Star Wars, ecc.), Is it correct to use the BooleanQuery or exist a better solution?
Thanks a lot.

need to display all the terms for a given index

I need to display all the terms for a given Lucene index.
public void addDocuments(IndexWriter indexWriter) throws IOException {
Document doc1 = new Document();
doc1.add(new TextField("title", "harrypotter", Field.Store.YES));
indexWriter.addDocument(doc1);
Document doc2 = new Document();
doc2.add(new TextField("title", "luceneinaction", Field.Store.YES));
indexWriter.addDocument(doc2);
Document doc3 = new Document();
doc3.add(new TextField("title", "harrypotter", Field.Store.YES));
indexWriter.addDocument(doc3);
}
I am trying this:
Fields fields = MultiFields.getFields(reader);
Terms terms = fields.terms("title");
TermsEnum iterator = terms.iterator(null);
BytesRef byteRef = null;
while((byteRef = iterator.next()) != null) {
System.out.println(byteRef.utf8ToString());
}
However this gives me only unique terms:
harrypotter
luceneinaction
Is there anyway to get all the terms (duplicate as well)? Or terms are always unique?
Thanks.
PS: Lucene version is 4.0.

It will give you the unique terms. However, you can get the count of the documents containing the term in the following way:
while ((byteRef = iterator.next()) != null) {
System.out.println(byteRef.utf8ToString() + " - " + iterator.docFreq());
}

Lucene is an inverted index, so it stores the references to terms like this:
harrypotter -> doc1, doc3
luceneinaction -> doc2
Each term points to documents as you can see above.
If you need to get terms for each document, run them separately through desired analyzer.

Migrate from Lucene 3.0 to 4.9.0

I want to migrate an example from the book "Lucene in Action 2nd Edition", which is based on Lucene 3.0, to Lucene's current version. Here is the code that needs to be migrated:
public void testUpdate() throws IOException {
assertEquals(1, getHitCount("city", "Amsterdam"));
IndexWriter writer = getWriter();
Document doc = new Document();
doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("country", "Netherlands", Field.Store.YES, Field.Index.NO));
doc.add(new Field("contents", "Den Haag has a lot of museums", Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field("city", "Den Haag", Field.Store.YES, Field.Index.ANALYZED));
writer.updateDocument(new Term("id", "1"), doc);
writer.close();
assertEquals(0, getHitCount("city", "Amsterdam"));
assertEquals(1, getHitCount("city", "Den Haag"));
}
I'm trying to perform the migration according to the Lucene Migration Guide using the equivalents for the former Field constructors to create the Document object. The code for this looks as follows:
#Test
public void testUpdate() throws IOException
{
assertEquals(1, getHitCount("city", "Amsterdam"));
IndexWriter writer = getWriter();
Document doc = new Document();
FieldType ft = new FieldType(StringField.TYPE_STORED);
ft.setOmitNorms(false);
doc.add(new Field("id", "1", ft));
doc.add(new StoredField("country", "Netherlands"));
doc.add(new TextField("contents", "Den Haag has a lot of museums", Store.NO));
doc.add(new Field("city", "Den Haag", TextField.TYPE_STORED));
writer.updateDocument(new Term("id", "1"), doc);
writer.close();
assertEquals(0, getHitCount("city", "Amsterdam"));
assertEquals(1, getHitCount("city", "Den Haag");
}
The second assertion method fails, because it doesn't find the string "Den Haag" (only "Den" or "Haag" works though). If I use a StringField object instead, the test passes, since the "city" attribute is not anaylzed (i.e. tokenized) and thus is kept unchanged. But it is obviously not the intention of the example to treat this attribute like e.g. an ID. I've read that the combination "Field.Store.YES / Field.Index.ANALYZED" is good for small textual content like an intro text, abstract or title, so it should also match concatenated strings like "Den Haag" or am I wrong? Could anyone clarify please.
The author uses a Term object to create the search string:
protected int getHitCount(String fieldName, String searchString) throws IOException {
DirectoryReader dr = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(dr);
Term t = new Term(fieldName, searchString);
Query query = new TermQuery(t);
int hitCount = TestUtil.hitCount(searcher, query);
return hitCount;
}
The TestUtil class only contains a single line of code
public static int hitCount(IndexSearcher searcher, Query query) {
return searcher.search(query, 1).totalHits;
}

Short explanation: you need to make sure tokenization setting (on/off) is the same at index time and at search time.
Long explanation: If you want your content to be analyzed, you should not only use TextField but also QueryParser so your query goes through the same process. In your case your query is failing because with
new Field("city", "Den Haag", TextField.TYPE_STORED));
the text gets tokenized into "Den" and "Haag". Later, when you create TermQuery you search against a single term "Den Haag" which, of course, yields no results.
Code below shows how could this work for non-tokenized case:
doc.add(new StringField("city", "Den Haag", Field.Store.YES));
...
PhraseQuery query = new PhraseQuery();
query.addTerm(new Term("city", "Den Haag"));

Lucene: get newest document for category

I'm pretty new to lucene index, so I apologize in advance if what I am trying to do is trivial.
I have an index where the documents contain (among other) two fields:
documentoId and employeeId.
Each employee can submit various documents. The structure is pretty much the same as in the bookstore example.
What I am trying to achieve, is to get all the newest documents matching a query, meaning with the highest documentoId for each employeeId.
In SQL, this would be something like:
select max(documentoId ), employeeId
from documents
where content like 'mySearchValue'
group by employeeId
I don't know if I should use facet API, or if this can be done with queries, or with the searchAfter method...I'm pretty lost with the documentation.
Any help would be greatly appreciated!
Thanks

Lucene supports grouping search; what you need to do is to define your group and how does it have to be sorted. In the example below, I group by documentoId and sort in descending order.
public static void main(String[] args) throws IOException, ParseException {
StandardAnalyzer standardAnalyzer = new StandardAnalyzer(Version.LUCENE_46);
RAMDirectory ramDirectory = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(ramDirectory, new IndexWriterConfig(Version.LUCENE_46, standardAnalyzer));
Document d0 = new Document();
d0.add(new TextField("employeeId", "foo", Field.Store.YES));
d0.add(new IntField("documentoId", 1, Field.Store.YES));
indexWriter.addDocument(d0);
Document d1 = new Document();
d1.add(new TextField("employeeId", "bar", Field.Store.YES));
d1.add(new IntField("documentoId", 20, Field.Store.YES));
indexWriter.addDocument(d1);
Document d2 = new Document();
d2.add(new TextField("employeeId", "baz", Field.Store.YES));
d2.add(new IntField("documentoId", 3, Field.Store.YES));
indexWriter.addDocument(d2);
indexWriter.commit();
GroupingSearch groupingSearch = new GroupingSearch("documentoId");
Sort groupSort = new Sort(new SortField("documentoId", SortField.Type.INT, true)); // in descending order
groupingSearch.setGroupSort(groupSort);
groupingSearch.setSortWithinGroup(groupSort);
IndexReader reader = DirectoryReader.open(ramDirectory);
IndexSearcher searcher = new IndexSearcher(reader);
TopGroups<?> groups = groupingSearch.search(searcher, new MatchAllDocsQuery(), 0, 10);
Document highestScoredDocument = reader.document(groups.groups[0].scoreDocs[0].doc);
System.out.println(
"Descending order, first document is " +
"employeeId:" + highestScoredDocument.get("employeeId") + " " +
"documentoId:" + highestScoredDocument.get("documentoId")
);
}
The above code detects that the d1 (middle document) scores at the top and prints the following:
Descending order, first document is employeeId:bar documentoId:20
Above code does not address content like 'mySearchValue' part, you would have to replace MatchAllDocsQuery with a relevant query to do that.

Custom sorting of the hits will do the trick. Google the search.sort parameter in Lucene.

For those in the same situation, I solved my problem using mindas comment and modifying it to use my group field:
GroupingSearch groupingSearch = new GroupingSearch("employeeId");
Sort groupSort = new Sort(new SortField("documentoId", SortField.Type.INT, true)); // in descending order
groupingSearch.setGroupSort(groupSort);
groupingSearch.setSortWithinGroup(groupSort);
int offset = 0;
int limitGroup = 50;
TopGroups<?> groups = groupingSearch.search(is,query, offset, limitGroup);
List<Document> result = new ArrayList();
for (int i=0; i<groups.groups.length; i++) {
ScoreDoc sdoc = groups.groups[i].scoreDocs[0]; // first result of each group
Document d = is.doc(sdoc.doc);
result.add(d);
}

Lucene skips years when NumericRangeQuery on dates

We are running a Lucene query for the date range 20000101 to 20070531, but Lucene only returns documents with a publicationDate between 20000101-20000701 and 20070101-20070531. Lucene skips several years. When running different date sets the results are similar.
Full insert code:
Document doc = new Document();
doc.add(new Field("pageNumber", article.getPageNumber(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new NumericField("publicationDate", 8, Field.Store.YES, true).setIntValue(Integer.parseInt(article.getPublicationDate())));
doc.add(new Field("headline", article.getHeadline(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("text", article.getText(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("fileName", article.getFileName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("mediaType", article.getMediaType(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("mediaSource", article.getMediaSource(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("overLap", article.getMediaType(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("status", article.getMediaType(), Field.Store.YES, Field.Index.NOT_ANALYZED));
indexWriter.addDocument(doc);
Document count code:
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
Directory index = new SimpleFSDirectory(new File(LUCENE_INDEX_DIRECTORY));
IndexReader reader = IndexReader.open(index);
Query sourceQuery = new TermQuery(new Term("mediaSource", source));
QueryParser queryParser = new QueryParser(Version.LUCENE_36, "text", analyzer);
Query textQuery = queryParser.parse(terms);
Query dateRangeQuery = NumericRangeQuery.newIntRange("publicationDate", startDate, endDate, true, true);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(sourceQuery, BooleanClause.Occur.MUST);
booleanQuery.add(textQuery, BooleanClause.Occur.MUST);
booleanQuery.add(dateRangeQuery, BooleanClause.Occur.MUST);
IndexSearcher searcher = new IndexSearcher(reader);
TotalHitCountCollector collector = new TotalHitCountCollector();
searcher.search(booleanQuery, collector);
System.out.println("start: " + startDate);
System.out.println("end: " + endDate);
System.out.println("total: " + collector.getTotalHits());
String hitCount = String.valueOf(collector.getTotalHits());
searcher.close();
reader.close();
analyzer.close();
return hitCount;
Full document list:
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
Directory index = new SimpleFSDirectory(new File(LUCENE_INDEX_DIRECTORY));
IndexReader reader = IndexReader.open(index);
Query sourceQuery = new TermQuery(new Term("mediaSource", source));
QueryParser queryParser = new QueryParser(Version.LUCENE_36, "text", analyzer);
Query textQuery = queryParser.parse(terms);
Query dateRangeQuery = NumericRangeQuery.newIntRange("publicationDate", startDate, endDate, true, true);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(sourceQuery, BooleanClause.Occur.MUST);
booleanQuery.add(textQuery, BooleanClause.Occur.MUST);
booleanQuery.add(dateRangeQuery, BooleanClause.Occur.MUST);
IndexSearcher searcher = new IndexSearcher(reader);
TotalHitCountCollector collector = new TotalHitCountCollector();
searcher.search(booleanQuery, collector);
Sort sort = new Sort(new SortField("publicationDate", SortField.INT));
if (collector.getTotalHits() > 0) {
TopDocs topDocs = searcher.search(booleanQuery, collector.getTotalHits(), sort);
int i = 0;
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
ArrayList<String> resultRow = new ArrayList<String>();
Document doc = searcher.doc(scoreDoc.doc);
resultRow.add(String.valueOf(i));
resultRow.add(doc.get("publicationDate"));
resultRow.add(doc.get("mediaSource"));
resultRow.add(doc.get("fileName"));
resultRow.add(doc.get("headline"));
resultRow.add(doc.get("pageNumber"));
ql.results.put(String.valueOf(i), resultRow);
i++;
}
} else {
ArrayList<String> resultRow = new ArrayList<String>();
resultRow.add("0");
resultRow.add("0");
resultRow.add("0");
resultRow.add("0");
resultRow.add("0");
resultRow.add("0");
ql.results.put("0", resultRow);
}
Truncated results (last 10 of 2058 documents):
20021231 Iraq Belongs on the Back Burner
20021231 With Missionaries Spreading, Muslims' Anger Is Following
20021231 WHITE HOUSE CUTS ESTIMATE OF COST OF WAR WITH IRAQ
20021231 Bring Back the Draft
20040101 Pakistani Leader's New Tactic: Persuasion
20040101 What We Will Do in 2004
20040101 Ethnic Morass Bogs Down Afghan Talks On Charter
20040101 U.S. Hunts Terror Clues in Case of 2 Brothers
20040101 Giving Up Those Weapons: After Libya, Who Is Next?
20040101 An Odd Sight in Iran as Aid Team Tents Go Up: The U.S. Flag

The problem is that NumericRangeQueries do not work correctly. Using a RangeQuery with string values corrects the problem.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to sort Numeric field in Lucene 6 - java

To sort search results using a SortedNumericDocValuesField, you'll need to use a SortedNumericSortField: Sort sort = new Sort(new SortedNumericSortField("age", SortField.Type.LONG, true)); TopDocs docs = searcher.search(new MatchAllDocsQuery(), 100, sort);

I would suggest you use ArrayList to store data from Document rather than saving it to another document, then use sort methods of ArrayList. Please visit these links for your reference. SO - how to sort arraylist JAVA ArrayList sort method sample

Related

Lucene: BestTextFragments returns only the first document

need to display all the terms for a given index

Migrate from Lucene 3.0 to 4.9.0

Lucene: get newest document for category

Lucene skips years when NumericRangeQuery on dates

Categories

Resources