Migrate from Lucene 3.0 to 4.9.0

Migrate from Lucene 3.0 to 4.9.0 - java

I want to migrate an example from the book "Lucene in Action 2nd Edition", which is based on Lucene 3.0, to Lucene's current version. Here is the code that needs to be migrated:
public void testUpdate() throws IOException {
assertEquals(1, getHitCount("city", "Amsterdam"));
IndexWriter writer = getWriter();
Document doc = new Document();
doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("country", "Netherlands", Field.Store.YES, Field.Index.NO));
doc.add(new Field("contents", "Den Haag has a lot of museums", Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field("city", "Den Haag", Field.Store.YES, Field.Index.ANALYZED));
writer.updateDocument(new Term("id", "1"), doc);
writer.close();
assertEquals(0, getHitCount("city", "Amsterdam"));
assertEquals(1, getHitCount("city", "Den Haag"));
}
I'm trying to perform the migration according to the Lucene Migration Guide using the equivalents for the former Field constructors to create the Document object. The code for this looks as follows:
#Test
public void testUpdate() throws IOException
{
assertEquals(1, getHitCount("city", "Amsterdam"));
IndexWriter writer = getWriter();
Document doc = new Document();
FieldType ft = new FieldType(StringField.TYPE_STORED);
ft.setOmitNorms(false);
doc.add(new Field("id", "1", ft));
doc.add(new StoredField("country", "Netherlands"));
doc.add(new TextField("contents", "Den Haag has a lot of museums", Store.NO));
doc.add(new Field("city", "Den Haag", TextField.TYPE_STORED));
writer.updateDocument(new Term("id", "1"), doc);
writer.close();
assertEquals(0, getHitCount("city", "Amsterdam"));
assertEquals(1, getHitCount("city", "Den Haag");
}
The second assertion method fails, because it doesn't find the string "Den Haag" (only "Den" or "Haag" works though). If I use a StringField object instead, the test passes, since the "city" attribute is not anaylzed (i.e. tokenized) and thus is kept unchanged. But it is obviously not the intention of the example to treat this attribute like e.g. an ID. I've read that the combination "Field.Store.YES / Field.Index.ANALYZED" is good for small textual content like an intro text, abstract or title, so it should also match concatenated strings like "Den Haag" or am I wrong? Could anyone clarify please.
The author uses a Term object to create the search string:
protected int getHitCount(String fieldName, String searchString) throws IOException {
DirectoryReader dr = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(dr);
Term t = new Term(fieldName, searchString);
Query query = new TermQuery(t);
int hitCount = TestUtil.hitCount(searcher, query);
return hitCount;
}
The TestUtil class only contains a single line of code
public static int hitCount(IndexSearcher searcher, Query query) {
return searcher.search(query, 1).totalHits;
}

Short explanation: you need to make sure tokenization setting (on/off) is the same at index time and at search time.
Long explanation: If you want your content to be analyzed, you should not only use TextField but also QueryParser so your query goes through the same process. In your case your query is failing because with
new Field("city", "Den Haag", TextField.TYPE_STORED));
the text gets tokenized into "Den" and "Haag". Later, when you create TermQuery you search against a single term "Den Haag" which, of course, yields no results.
Code below shows how could this work for non-tokenized case:
doc.add(new StringField("city", "Den Haag", Field.Store.YES));
...
PhraseQuery query = new PhraseQuery();
query.addTerm(new Term("city", "Den Haag"));

Related

How to sort Numeric field in Lucene 6

I want to sort my search result based on a numeric field.
In the following example code, I want to sort based on the 'age' field.
I start from using the answers from:
[How to sort IntPont or LongPoint field in Lucene 6
But it does sort based SCORE. The age are still not ascending.
And
[Sorting search result in Lucene based on a numeric field
I changed SortField.Type.SCORE to SortField.Type.LONG in the search function.
But I get:
unexpected docvalues type NONE for field 'age' (expected=NUMERIC)
Here my code:
public class TestLongPointSort {
public static void main(String[] args) throws Exception {
String indexPath = "/tmp/testSort";
Analyzer standardAnalyzer = new StandardAnalyzer();
Directory indexDir = FSDirectory.open(Paths.get(indexPath));
IndexWriterConfig iwc = new IndexWriterConfig(standardAnalyzer);
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter masterIndex = new IndexWriter(indexDir, iwc);
Document doc = new Document();
String name = "bob";
doc.add(new TextField("name", name, Field.Store.YES));
doc.add(new SortedDocValuesField("name", new BytesRef(name)));
doc.add(new SortedNumericDocValuesField("age", 20L));
doc.add(new StoredField("age", 20L));
long ts = System.currentTimeMillis();
doc.add(new SortedNumericDocValuesField("ts", ts));
doc.add(new StoredField("ts", ts));
masterIndex.addDocument(doc);
Thread.sleep(1);
name = "max";
doc = new Document();
doc.add(new TextField("name", name, Field.Store.YES));
doc.add(new SortedDocValuesField("name", new BytesRef(name)));
doc.add(new SortedNumericDocValuesField("age", 19L));
doc.add(new StoredField("age", 19L));
ts = System.currentTimeMillis();
doc.add(new SortedNumericDocValuesField("ts", ts));
doc.add(new StoredField("ts", ts));
masterIndex.addDocument(doc);
Thread.sleep(1);
name = "jim";
doc = new Document();
doc.add(new TextField("name", name, Field.Store.YES));
doc.add(new SortedDocValuesField("name", new BytesRef(name)));
doc.add(new SortedNumericDocValuesField("age", 21L));
doc.add(new StoredField("age", 21L));
ts = System.currentTimeMillis();
doc.add(new SortedNumericDocValuesField("ts", ts));
doc.add(new StoredField("ts", ts));
masterIndex.addDocument(doc);
masterIndex.commit();
masterIndex.close();
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(indexPath)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new KeywordAnalyzer();
QueryParser queryParser = new QueryParser("message", analyzer);
Sort sort;
TopDocs docs;
sort = new Sort(new SortField("name", SortField.Type.STRING));
docs = searcher.search(new MatchAllDocsQuery(), 100, sort);
System.out.println("Sorted by name");
for (ScoreDoc scoreDoc : docs.scoreDocs) {
Document doc2 = searcher.doc(scoreDoc.doc);
System.out.println("Name:" + doc2.get("name") + " ; age:" + doc2.get("age") + " ; ts:" + doc2.get("ts"));
}
//docs = searcher.search(new MatchAllDocsQuery(), 100, new Sort(new SortField("age", SortField.Type.SCORE, true)));
docs = searcher.search(new MatchAllDocsQuery(), 100, new Sort(new SortField("age", SortField.Type.LONG, true)));
System.out.println("Sorted by age");
for (ScoreDoc scoreDoc : docs.scoreDocs) {
Document doc2 = searcher.doc(scoreDoc.doc);
System.out.println("Name:" + doc2.get("name") + " ; age:" + doc2.get("age") + " ; ts:" + doc2.get("ts"));
}
reader.close();
}
}
As we can see, sorting STRING is good but I didn't figure out how I can get my numbers (LONG) sorted.
What is the right way to sort Numeric fields?
Thanks

To sort search results using a SortedNumericDocValuesField, you'll need to use a SortedNumericSortField:
Sort sort = new Sort(new SortedNumericSortField("age", SortField.Type.LONG, true));
TopDocs docs = searcher.search(new MatchAllDocsQuery(), 100, sort);

I would suggest you use ArrayList to store data from Document rather than saving it to another document, then use sort methods of ArrayList.
Please visit these links for your reference.
SO - how to sort arraylist
JAVA ArrayList sort method sample

Lucene is limiting the query terms

I'm trying to use Lucene (5.4.1) MoreLikeThis to tag(classify) texts. It's kind of working, but I'm getting poor results, and I think that the problem is related with the Query object.
The example bellow works, but the highest topdoc isn't the one that I expect. By debuging the query object, it shows only content:erro. From a complete portuguese phrase (see into the example) the query was constructed with just one word.
I'm not using stop words or any other kind of filter.
So why lucene is picking just erro as a query term?
To init main objects
Analyzer analyzer = new PortugueseAnalyzer();
Directory indexDir = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
To index
try (IndexWriter indexWriter = new IndexWriter(indexDir, config)) {
FieldType type = new FieldType();
type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setStoreTermVectors(true);
Document doc = new Document();
doc.add(new StringField("id", "880b2bbc", Store.YES));
doc.add(new Field("content", "erro", type));
doc.add(new Field("tag", "atag", type));
indexWriter.addDocument(doc);
indexWriter.commit();
}
To search
try (IndexReader idxReader = DirectoryReader.open(indexDir)) {
IndexSearcher indexSearcher = new IndexSearcher(idxReader);
MoreLikeThis mlt = new MoreLikeThis(idxReader);
mlt.setMinTermFreq(0);
mlt.setMinDocFreq(0);
mlt.setFieldNames(new String[] { "content" });
mlt.setAnalyzer(analyzer);
Reader sReader = new StringReader("Melhorias no controle de sessão no sistema qquercoisa quando expira, ao logar novamente no sistema é exibido o erro "xpto");
Query query = mlt.like("content", sReader);
TopDocs topDocs = indexSearcher.search(query, 3);
}

Well, I decided to take a look inside MoreLokeThis class and I found the answer.
The Query query = mlt.like("content", sReader); call the createQueue(Map<String, Int> words) method in MoreLokeThis class.
Inside it, the tokenized terms/words from sReader (that were converted to a Map) are checked against the index.
Only terms/words that are present into the index are used to create a query.
Using the example that I provided, since my index contains only a document with the word erro, this is the only word that is kept from the phrase that I passed.

need to display all the terms for a given index

I need to display all the terms for a given Lucene index.
public void addDocuments(IndexWriter indexWriter) throws IOException {
Document doc1 = new Document();
doc1.add(new TextField("title", "harrypotter", Field.Store.YES));
indexWriter.addDocument(doc1);
Document doc2 = new Document();
doc2.add(new TextField("title", "luceneinaction", Field.Store.YES));
indexWriter.addDocument(doc2);
Document doc3 = new Document();
doc3.add(new TextField("title", "harrypotter", Field.Store.YES));
indexWriter.addDocument(doc3);
}
I am trying this:
Fields fields = MultiFields.getFields(reader);
Terms terms = fields.terms("title");
TermsEnum iterator = terms.iterator(null);
BytesRef byteRef = null;
while((byteRef = iterator.next()) != null) {
System.out.println(byteRef.utf8ToString());
}
However this gives me only unique terms:
harrypotter
luceneinaction
Is there anyway to get all the terms (duplicate as well)? Or terms are always unique?
Thanks.
PS: Lucene version is 4.0.

It will give you the unique terms. However, you can get the count of the documents containing the term in the following way:
while ((byteRef = iterator.next()) != null) {
System.out.println(byteRef.utf8ToString() + " - " + iterator.docFreq());
}

Lucene is an inverted index, so it stores the references to terms like this:
harrypotter -> doc1, doc3
luceneinaction -> doc2
Each term points to documents as you can see above.
If you need to get terms for each document, run them separately through desired analyzer.

Multiple attribute queries in Apache Lucene

The below program satisfies the query where title has both lucene and action. If I want to search for a tupple where isbn (considering isbn is not unique) is 1234 and title contains both Lucene and dummies. Does lucene provide a facility for that.
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
String querystr = args.length > 0 ? args[0] : "lucene AND action";
Query q = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(querystr);

From the top of my head , your QueryParser class is built to query the title field only, so in order to make a query that targets both title and isbn fields you have to make use of a class like MultiFieldQueryParser and a query like title:(lucene AND dummies) AND isbn:1234 or just build your BooleanQuery (this is what you end up with) by hand from multiple TermQuery objects .
I hope this helps

Lucene skips years when NumericRangeQuery on dates

We are running a Lucene query for the date range 20000101 to 20070531, but Lucene only returns documents with a publicationDate between 20000101-20000701 and 20070101-20070531. Lucene skips several years. When running different date sets the results are similar.
Full insert code:
Document doc = new Document();
doc.add(new Field("pageNumber", article.getPageNumber(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new NumericField("publicationDate", 8, Field.Store.YES, true).setIntValue(Integer.parseInt(article.getPublicationDate())));
doc.add(new Field("headline", article.getHeadline(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("text", article.getText(), Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("fileName", article.getFileName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("mediaType", article.getMediaType(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("mediaSource", article.getMediaSource(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("overLap", article.getMediaType(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("status", article.getMediaType(), Field.Store.YES, Field.Index.NOT_ANALYZED));
indexWriter.addDocument(doc);
Document count code:
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
Directory index = new SimpleFSDirectory(new File(LUCENE_INDEX_DIRECTORY));
IndexReader reader = IndexReader.open(index);
Query sourceQuery = new TermQuery(new Term("mediaSource", source));
QueryParser queryParser = new QueryParser(Version.LUCENE_36, "text", analyzer);
Query textQuery = queryParser.parse(terms);
Query dateRangeQuery = NumericRangeQuery.newIntRange("publicationDate", startDate, endDate, true, true);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(sourceQuery, BooleanClause.Occur.MUST);
booleanQuery.add(textQuery, BooleanClause.Occur.MUST);
booleanQuery.add(dateRangeQuery, BooleanClause.Occur.MUST);
IndexSearcher searcher = new IndexSearcher(reader);
TotalHitCountCollector collector = new TotalHitCountCollector();
searcher.search(booleanQuery, collector);
System.out.println("start: " + startDate);
System.out.println("end: " + endDate);
System.out.println("total: " + collector.getTotalHits());
String hitCount = String.valueOf(collector.getTotalHits());
searcher.close();
reader.close();
analyzer.close();
return hitCount;
Full document list:
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
Directory index = new SimpleFSDirectory(new File(LUCENE_INDEX_DIRECTORY));
IndexReader reader = IndexReader.open(index);
Query sourceQuery = new TermQuery(new Term("mediaSource", source));
QueryParser queryParser = new QueryParser(Version.LUCENE_36, "text", analyzer);
Query textQuery = queryParser.parse(terms);
Query dateRangeQuery = NumericRangeQuery.newIntRange("publicationDate", startDate, endDate, true, true);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(sourceQuery, BooleanClause.Occur.MUST);
booleanQuery.add(textQuery, BooleanClause.Occur.MUST);
booleanQuery.add(dateRangeQuery, BooleanClause.Occur.MUST);
IndexSearcher searcher = new IndexSearcher(reader);
TotalHitCountCollector collector = new TotalHitCountCollector();
searcher.search(booleanQuery, collector);
Sort sort = new Sort(new SortField("publicationDate", SortField.INT));
if (collector.getTotalHits() > 0) {
TopDocs topDocs = searcher.search(booleanQuery, collector.getTotalHits(), sort);
int i = 0;
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
ArrayList<String> resultRow = new ArrayList<String>();
Document doc = searcher.doc(scoreDoc.doc);
resultRow.add(String.valueOf(i));
resultRow.add(doc.get("publicationDate"));
resultRow.add(doc.get("mediaSource"));
resultRow.add(doc.get("fileName"));
resultRow.add(doc.get("headline"));
resultRow.add(doc.get("pageNumber"));
ql.results.put(String.valueOf(i), resultRow);
i++;
}
} else {
ArrayList<String> resultRow = new ArrayList<String>();
resultRow.add("0");
resultRow.add("0");
resultRow.add("0");
resultRow.add("0");
resultRow.add("0");
resultRow.add("0");
ql.results.put("0", resultRow);
}
Truncated results (last 10 of 2058 documents):
20021231 Iraq Belongs on the Back Burner
20021231 With Missionaries Spreading, Muslims' Anger Is Following
20021231 WHITE HOUSE CUTS ESTIMATE OF COST OF WAR WITH IRAQ
20021231 Bring Back the Draft
20040101 Pakistani Leader's New Tactic: Persuasion
20040101 What We Will Do in 2004
20040101 Ethnic Morass Bogs Down Afghan Talks On Charter
20040101 U.S. Hunts Terror Clues in Case of 2 Brothers
20040101 Giving Up Those Weapons: After Libya, Who Is Next?
20040101 An Odd Sight in Iran as Aid Team Tents Go Up: The U.S. Flag

The problem is that NumericRangeQueries do not work correctly. Using a RangeQuery with string values corrects the problem.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Migrate from Lucene 3.0 to 4.9.0 - java

Related

How to sort Numeric field in Lucene 6

Lucene is limiting the query terms

need to display all the terms for a given index

Multiple attribute queries in Apache Lucene

Lucene skips years when NumericRangeQuery on dates

Categories

Resources