Lucene skips years when NumericRangeQuery on dates - java

We are running a Lucene query for the date range 20000101 to 20070531, but Lucene only returns documents with a publicationDate between 20000101-20000701 and 20070101-20070531. Lucene skips several years. When running different date sets the results are similar.
Full insert code:
Document doc = new Document();
doc.add(new Field("pageNumber", article.getPageNumber(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new NumericField("publicationDate", 8, Field.Store.YES, true).setIntValue(Integer.parseInt(article.getPublicationDate())));
doc.add(new Field("headline", article.getHeadline(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("text", article.getText(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("fileName", article.getFileName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("mediaType", article.getMediaType(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("mediaSource", article.getMediaSource(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("overLap", article.getMediaType(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("status", article.getMediaType(), Field.Store.YES, Field.Index.NOT_ANALYZED));
indexWriter.addDocument(doc);
Document count code:
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
Directory index = new SimpleFSDirectory(new File(LUCENE_INDEX_DIRECTORY));
IndexReader reader = IndexReader.open(index);
Query sourceQuery = new TermQuery(new Term("mediaSource", source));
QueryParser queryParser = new QueryParser(Version.LUCENE_36, "text", analyzer);
Query textQuery = queryParser.parse(terms);
Query dateRangeQuery = NumericRangeQuery.newIntRange("publicationDate", startDate, endDate, true, true);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(sourceQuery, BooleanClause.Occur.MUST);
booleanQuery.add(textQuery, BooleanClause.Occur.MUST);
booleanQuery.add(dateRangeQuery, BooleanClause.Occur.MUST);
IndexSearcher searcher = new IndexSearcher(reader);
TotalHitCountCollector collector = new TotalHitCountCollector();
searcher.search(booleanQuery, collector);
System.out.println("start: " + startDate);
System.out.println("end: " + endDate);
System.out.println("total: " + collector.getTotalHits());
String hitCount = String.valueOf(collector.getTotalHits());
searcher.close();
reader.close();
analyzer.close();
return hitCount;
Full document list:
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
Directory index = new SimpleFSDirectory(new File(LUCENE_INDEX_DIRECTORY));
IndexReader reader = IndexReader.open(index);
Query sourceQuery = new TermQuery(new Term("mediaSource", source));
QueryParser queryParser = new QueryParser(Version.LUCENE_36, "text", analyzer);
Query textQuery = queryParser.parse(terms);
Query dateRangeQuery = NumericRangeQuery.newIntRange("publicationDate", startDate, endDate, true, true);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(sourceQuery, BooleanClause.Occur.MUST);
booleanQuery.add(textQuery, BooleanClause.Occur.MUST);
booleanQuery.add(dateRangeQuery, BooleanClause.Occur.MUST);
IndexSearcher searcher = new IndexSearcher(reader);
TotalHitCountCollector collector = new TotalHitCountCollector();
searcher.search(booleanQuery, collector);
Sort sort = new Sort(new SortField("publicationDate", SortField.INT));
if (collector.getTotalHits() > 0) {
TopDocs topDocs = searcher.search(booleanQuery, collector.getTotalHits(), sort);
int i = 0;
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
ArrayList<String> resultRow = new ArrayList<String>();
Document doc = searcher.doc(scoreDoc.doc);
resultRow.add(String.valueOf(i));
resultRow.add(doc.get("publicationDate"));
resultRow.add(doc.get("mediaSource"));
resultRow.add(doc.get("fileName"));
resultRow.add(doc.get("headline"));
resultRow.add(doc.get("pageNumber"));
ql.results.put(String.valueOf(i), resultRow);
i++;
}
} else {
ArrayList<String> resultRow = new ArrayList<String>();
resultRow.add("0");
resultRow.add("0");
resultRow.add("0");
resultRow.add("0");
resultRow.add("0");
resultRow.add("0");
ql.results.put("0", resultRow);
}
Truncated results (last 10 of 2058 documents):
20021231 Iraq Belongs on the Back Burner
20021231 With Missionaries Spreading, Muslims' Anger Is Following
20021231 WHITE HOUSE CUTS ESTIMATE OF COST OF WAR WITH IRAQ
20021231 Bring Back the Draft
20040101 Pakistani Leader's New Tactic: Persuasion
20040101 What We Will Do in 2004
20040101 Ethnic Morass Bogs Down Afghan Talks On Charter
20040101 U.S. Hunts Terror Clues in Case of 2 Brothers
20040101 Giving Up Those Weapons: After Libya, Who Is Next?
20040101 An Odd Sight in Iran as Aid Team Tents Go Up: The U.S. Flag

The problem is that NumericRangeQueries do not work correctly. Using a RangeQuery with string values corrects the problem.

Related

How to sort Numeric field in Lucene 6

I want to sort my search result based on a numeric field.
In the following example code, I want to sort based on the 'age' field.
I start from using the answers from:
[How to sort IntPont or LongPoint field in Lucene 6
But it does sort based SCORE. The age are still not ascending.
And
[Sorting search result in Lucene based on a numeric field
I changed SortField.Type.SCORE to SortField.Type.LONG in the search function.
But I get:
unexpected docvalues type NONE for field 'age' (expected=NUMERIC)
Here my code:
public class TestLongPointSort {
public static void main(String[] args) throws Exception {
String indexPath = "/tmp/testSort";
Analyzer standardAnalyzer = new StandardAnalyzer();
Directory indexDir = FSDirectory.open(Paths.get(indexPath));
IndexWriterConfig iwc = new IndexWriterConfig(standardAnalyzer);
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter masterIndex = new IndexWriter(indexDir, iwc);
Document doc = new Document();
String name = "bob";
doc.add(new TextField("name", name, Field.Store.YES));
doc.add(new SortedDocValuesField("name", new BytesRef(name)));
doc.add(new SortedNumericDocValuesField("age", 20L));
doc.add(new StoredField("age", 20L));
long ts = System.currentTimeMillis();
doc.add(new SortedNumericDocValuesField("ts", ts));
doc.add(new StoredField("ts", ts));
masterIndex.addDocument(doc);
Thread.sleep(1);
name = "max";
doc = new Document();
doc.add(new TextField("name", name, Field.Store.YES));
doc.add(new SortedDocValuesField("name", new BytesRef(name)));
doc.add(new SortedNumericDocValuesField("age", 19L));
doc.add(new StoredField("age", 19L));
ts = System.currentTimeMillis();
doc.add(new SortedNumericDocValuesField("ts", ts));
doc.add(new StoredField("ts", ts));
masterIndex.addDocument(doc);
Thread.sleep(1);
name = "jim";
doc = new Document();
doc.add(new TextField("name", name, Field.Store.YES));
doc.add(new SortedDocValuesField("name", new BytesRef(name)));
doc.add(new SortedNumericDocValuesField("age", 21L));
doc.add(new StoredField("age", 21L));
ts = System.currentTimeMillis();
doc.add(new SortedNumericDocValuesField("ts", ts));
doc.add(new StoredField("ts", ts));
masterIndex.addDocument(doc);
masterIndex.commit();
masterIndex.close();
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new KeywordAnalyzer();
QueryParser queryParser = new QueryParser("message", analyzer);
Sort sort;
TopDocs docs;
sort = new Sort(new SortField("name", SortField.Type.STRING));
docs = searcher.search(new MatchAllDocsQuery(), 100, sort);
System.out.println("Sorted by name");
for (ScoreDoc scoreDoc : docs.scoreDocs) {
Document doc2 = searcher.doc(scoreDoc.doc);
System.out.println("Name:" + doc2.get("name") + " ; age:" + doc2.get("age") + " ; ts:" + doc2.get("ts"));
}
//docs = searcher.search(new MatchAllDocsQuery(), 100, new Sort(new SortField("age", SortField.Type.SCORE, true)));
docs = searcher.search(new MatchAllDocsQuery(), 100, new Sort(new SortField("age", SortField.Type.LONG, true)));
System.out.println("Sorted by age");
for (ScoreDoc scoreDoc : docs.scoreDocs) {
Document doc2 = searcher.doc(scoreDoc.doc);
System.out.println("Name:" + doc2.get("name") + " ; age:" + doc2.get("age") + " ; ts:" + doc2.get("ts"));
}
reader.close();
}
}
As we can see, sorting STRING is good but I didn't figure out how I can get my numbers (LONG) sorted.
What is the right way to sort Numeric fields?
Thanks
To sort search results using a SortedNumericDocValuesField, you'll need to use a SortedNumericSortField:
Sort sort = new Sort(new SortedNumericSortField("age", SortField.Type.LONG, true));
TopDocs docs = searcher.search(new MatchAllDocsQuery(), 100, sort);
I would suggest you use ArrayList to store data from Document rather than saving it to another document, then use sort methods of ArrayList.
Please visit these links for your reference.
SO - how to sort arraylist
JAVA ArrayList sort method sample

Lucene: BestTextFragments returns only the first document

I'm building a Lucene index for Twitter User with their Tweets. My idea is to store info about User (name, description, ecc) with his tweets with the following code:
for (Map.Entry<Long, User> entry : users.entrySet()) {
User user = entry.getValue();
Document document = new Document();
document.add(new LongField("id", user.getId(), Field.Store.YES));
document.add(new StringField("name", user.getName(), Field.Store.YES));
document.add(new StringField("username", user.getUsername(), Field.Store.YES));
for (UserTweet t : user.getTweets()) {
document.add(new TextField("tweet", t.getText(), Field.Store.YES));
}
writer.addDocument(document);
}
Here a document can have a lot of tweets in the "tweet" field. The analyzer used for this field is the EnglishAnalyzer.
Is this method correct to store tweets?
My problem is when I set the Highlighter to retrieve the tweets that match. If I search a term that is present in ALL tweets of ALL stored users, as a result I get ALL users (correct!), but if I want to see all Tweets of a single user that match with the query (with Highlighter) I get only the first Tweet of every user and not all.
This is the code that I use to search:
BooleanQuery.Builder booleanQuery = new BooleanQuery.Builder();
QueryParser queryParserKeywords = new QueryParser("tweet", new EnglishAnalyzer());
String strQueryKeywords = "";
for (String s : c.getValue().split(" "))
strQueryKeywords += "tweet:"+ s +" OR ";
strQueryKeywords = strQueryKeywords.substring(0, strQueryKeywords.lastIndexOf("OR"));
Query queryKeywords = queryParserKeywords.parse(strQueryKeywords);
QueryScorer queryScorerKeywords = new QueryScorer(queryKeywords, "tweet");
Fragmenter fragment = new SimpleSpanFragmenter(queryScorerKeywords, 150);
keywordsHighlighter = new Highlighter(queryScorerKeywords);
keywordsHighlighter.setTextFragmenter(fragment);
booleanQuery.add(queryKeywords, BooleanClause.Occur.SHOULD);
... (other boolean clause over other fields)
searcher.search(booleanQuery.build(), collector);
...
for (ScoreDoc doc : collector.topDocs().scoreDocs) {
Document d = searcher.doc(doc.doc);
TokenStream tokenStream = new EnglishAnalyzer().tokenStream("",d.getField("tweet").stringValue());
TextFragment[] fragments = keywordsHighlighter.getBestTextFragments(tokenStream, d.getField("tweet").stringValue(), false, 10);
for (TextFragment fragment : fragments) {
System.out.println(" - " + fragment.toString());
}
}
What's wrong with my code?
At last, to search over multiple fields with different text (ex: City=New York, Keyword=Star Wars, ecc.), Is it correct to use the BooleanQuery or exist a better solution?
Thanks a lot.

Migrate from Lucene 3.0 to 4.9.0

I want to migrate an example from the book "Lucene in Action 2nd Edition", which is based on Lucene 3.0, to Lucene's current version. Here is the code that needs to be migrated:
public void testUpdate() throws IOException {
assertEquals(1, getHitCount("city", "Amsterdam"));
IndexWriter writer = getWriter();
Document doc = new Document();
doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("country", "Netherlands", Field.Store.YES, Field.Index.NO));
doc.add(new Field("contents", "Den Haag has a lot of museums", Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field("city", "Den Haag", Field.Store.YES, Field.Index.ANALYZED));
writer.updateDocument(new Term("id", "1"), doc);
writer.close();
assertEquals(0, getHitCount("city", "Amsterdam"));
assertEquals(1, getHitCount("city", "Den Haag"));
}
I'm trying to perform the migration according to the Lucene Migration Guide using the equivalents for the former Field constructors to create the Document object. The code for this looks as follows:
#Test
public void testUpdate() throws IOException
{
assertEquals(1, getHitCount("city", "Amsterdam"));
IndexWriter writer = getWriter();
Document doc = new Document();
FieldType ft = new FieldType(StringField.TYPE_STORED);
ft.setOmitNorms(false);
doc.add(new Field("id", "1", ft));
doc.add(new StoredField("country", "Netherlands"));
doc.add(new TextField("contents", "Den Haag has a lot of museums", Store.NO));
doc.add(new Field("city", "Den Haag", TextField.TYPE_STORED));
writer.updateDocument(new Term("id", "1"), doc);
writer.close();
assertEquals(0, getHitCount("city", "Amsterdam"));
assertEquals(1, getHitCount("city", "Den Haag");
}
The second assertion method fails, because it doesn't find the string "Den Haag" (only "Den" or "Haag" works though). If I use a StringField object instead, the test passes, since the "city" attribute is not anaylzed (i.e. tokenized) and thus is kept unchanged. But it is obviously not the intention of the example to treat this attribute like e.g. an ID. I've read that the combination "Field.Store.YES / Field.Index.ANALYZED" is good for small textual content like an intro text, abstract or title, so it should also match concatenated strings like "Den Haag" or am I wrong? Could anyone clarify please.
The author uses a Term object to create the search string:
protected int getHitCount(String fieldName, String searchString) throws IOException {
DirectoryReader dr = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(dr);
Term t = new Term(fieldName, searchString);
Query query = new TermQuery(t);
int hitCount = TestUtil.hitCount(searcher, query);
return hitCount;
}
The TestUtil class only contains a single line of code
public static int hitCount(IndexSearcher searcher, Query query) {
return searcher.search(query, 1).totalHits;
}
Short explanation: you need to make sure tokenization setting (on/off) is the same at index time and at search time.
Long explanation: If you want your content to be analyzed, you should not only use TextField but also QueryParser so your query goes through the same process. In your case your query is failing because with
new Field("city", "Den Haag", TextField.TYPE_STORED));
the text gets tokenized into "Den" and "Haag". Later, when you create TermQuery you search against a single term "Den Haag" which, of course, yields no results.
Code below shows how could this work for non-tokenized case:
doc.add(new StringField("city", "Den Haag", Field.Store.YES));
...
PhraseQuery query = new PhraseQuery();
query.addTerm(new Term("city", "Den Haag"));

Multiple attribute queries in Apache Lucene

The below program satisfies the query where title has both lucene and action. If I want to search for a tupple where isbn (considering isbn is not unique) is 1234 and title contains both Lucene and dummies. Does lucene provide a facility for that.
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
String querystr = args.length > 0 ? args[0] : "lucene AND action";
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);
From the top of my head , your QueryParser class is built to query the title field only, so in order to make a query that targets both title and isbn fields you have to make use of a class like MultiFieldQueryParser and a query like title:(lucene AND dummies) AND isbn:1234 or just build your BooleanQuery (this is what you end up with) by hand from multiple TermQuery objects .
I hope this helps

Lucene: get newest document for category

I'm pretty new to lucene index, so I apologize in advance if what I am trying to do is trivial.
I have an index where the documents contain (among other) two fields:
documentoId and employeeId.
Each employee can submit various documents. The structure is pretty much the same as in the bookstore example.
What I am trying to achieve, is to get all the newest documents matching a query, meaning with the highest documentoId for each employeeId.
In SQL, this would be something like:
select max(documentoId ), employeeId
from documents
where content like 'mySearchValue'
group by employeeId
I don't know if I should use facet API, or if this can be done with queries, or with the searchAfter method...I'm pretty lost with the documentation.
Any help would be greatly appreciated!
Thanks
Lucene supports grouping search; what you need to do is to define your group and how does it have to be sorted. In the example below, I group by documentoId and sort in descending order.
public static void main(String[] args) throws IOException, ParseException {
StandardAnalyzer standardAnalyzer = new StandardAnalyzer(Version.LUCENE_46);
RAMDirectory ramDirectory = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(ramDirectory, new IndexWriterConfig(Version.LUCENE_46, standardAnalyzer));
Document d0 = new Document();
d0.add(new TextField("employeeId", "foo", Field.Store.YES));
d0.add(new IntField("documentoId", 1, Field.Store.YES));
indexWriter.addDocument(d0);
Document d1 = new Document();
d1.add(new TextField("employeeId", "bar", Field.Store.YES));
d1.add(new IntField("documentoId", 20, Field.Store.YES));
indexWriter.addDocument(d1);
Document d2 = new Document();
d2.add(new TextField("employeeId", "baz", Field.Store.YES));
d2.add(new IntField("documentoId", 3, Field.Store.YES));
indexWriter.addDocument(d2);
indexWriter.commit();
GroupingSearch groupingSearch = new GroupingSearch("documentoId");
Sort groupSort = new Sort(new SortField("documentoId", SortField.Type.INT, true)); // in descending order
groupingSearch.setGroupSort(groupSort);
groupingSearch.setSortWithinGroup(groupSort);
IndexReader reader = DirectoryReader.open(ramDirectory);
IndexSearcher searcher = new IndexSearcher(reader);
TopGroups<?> groups = groupingSearch.search(searcher, new MatchAllDocsQuery(), 0, 10);
Document highestScoredDocument = reader.document(groups.groups[0].scoreDocs[0].doc);
System.out.println(
"Descending order, first document is " +
"employeeId:" + highestScoredDocument.get("employeeId") + " " +
"documentoId:" + highestScoredDocument.get("documentoId")
);
}
The above code detects that the d1 (middle document) scores at the top and prints the following:
Descending order, first document is employeeId:bar documentoId:20
Above code does not address content like 'mySearchValue' part, you would have to replace MatchAllDocsQuery with a relevant query to do that.
Custom sorting of the hits will do the trick. Google the search.sort parameter in Lucene.
For those in the same situation, I solved my problem using mindas comment and modifying it to use my group field:
GroupingSearch groupingSearch = new GroupingSearch("employeeId");
Sort groupSort = new Sort(new SortField("documentoId", SortField.Type.INT, true)); // in descending order
groupingSearch.setGroupSort(groupSort);
groupingSearch.setSortWithinGroup(groupSort);
int offset = 0;
int limitGroup = 50;
TopGroups<?> groups = groupingSearch.search(is,query, offset, limitGroup);
List<Document> result = new ArrayList();
for (int i=0; i<groups.groups.length; i++) {
ScoreDoc sdoc = groups.groups[i].scoreDocs[0]; // first result of each group
Document d = is.doc(sdoc.doc);
result.add(d);
}

Categories

Resources