need to display all the terms for a given index - java

I need to display all the terms for a given Lucene index.
public void addDocuments(IndexWriter indexWriter) throws IOException {
Document doc1 = new Document();
doc1.add(new TextField("title", "harrypotter", Field.Store.YES));
indexWriter.addDocument(doc1);
Document doc2 = new Document();
doc2.add(new TextField("title", "luceneinaction", Field.Store.YES));
indexWriter.addDocument(doc2);
Document doc3 = new Document();
doc3.add(new TextField("title", "harrypotter", Field.Store.YES));
indexWriter.addDocument(doc3);
}
I am trying this:
Fields fields = MultiFields.getFields(reader);
Terms terms = fields.terms("title");
TermsEnum iterator = terms.iterator(null);
BytesRef byteRef = null;
while((byteRef = iterator.next()) != null) {
System.out.println(byteRef.utf8ToString());
}
However this gives me only unique terms:
harrypotter
luceneinaction
Is there anyway to get all the terms (duplicate as well)? Or terms are always unique?
Thanks.
PS: Lucene version is 4.0.

It will give you the unique terms. However, you can get the count of the documents containing the term in the following way:
while ((byteRef = iterator.next()) != null) {
System.out.println(byteRef.utf8ToString() + " - " + iterator.docFreq());
}

Lucene is an inverted index, so it stores the references to terms like this:
harrypotter -> doc1, doc3
luceneinaction -> doc2
Each term points to documents as you can see above.
If you need to get terms for each document, run them separately through desired analyzer.

Related

Lucene is limiting the query terms

I'm trying to use Lucene (5.4.1) MoreLikeThis to tag(classify) texts. It's kind of working, but I'm getting poor results, and I think that the problem is related with the Query object.
The example bellow works, but the highest topdoc isn't the one that I expect. By debuging the query object, it shows only content:erro. From a complete portuguese phrase (see into the example) the query was constructed with just one word.
I'm not using stop words or any other kind of filter.
So why lucene is picking just erro as a query term?
To init main objects
Analyzer analyzer = new PortugueseAnalyzer();
Directory indexDir = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(OpenMode.CREATE_OR_APPEND);
To index
try (IndexWriter indexWriter = new IndexWriter(indexDir, config)) {
FieldType type = new FieldType();
type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
type.setStored(true);
type.setStoreTermVectors(true);
Document doc = new Document();
doc.add(new StringField("id", "880b2bbc", Store.YES));
doc.add(new Field("content", "erro", type));
doc.add(new Field("tag", "atag", type));
indexWriter.addDocument(doc);
indexWriter.commit();
}
To search
try (IndexReader idxReader = DirectoryReader.open(indexDir)) {
IndexSearcher indexSearcher = new IndexSearcher(idxReader);
MoreLikeThis mlt = new MoreLikeThis(idxReader);
mlt.setMinTermFreq(0);
mlt.setMinDocFreq(0);
mlt.setFieldNames(new String[] { "content" });
mlt.setAnalyzer(analyzer);
Reader sReader = new StringReader("Melhorias no controle de sessão no sistema qquercoisa quando expira, ao logar novamente no sistema é exibido o erro "xpto");
Query query = mlt.like("content", sReader);
TopDocs topDocs = indexSearcher.search(query, 3);
}
Well, I decided to take a look inside MoreLokeThis class and I found the answer.
The Query query = mlt.like("content", sReader); call the createQueue(Map<String, Int> words) method in MoreLokeThis class.
Inside it, the tokenized terms/words from sReader (that were converted to a Map) are checked against the index.
Only terms/words that are present into the index are used to create a query.
Using the example that I provided, since my index contains only a document with the word erro, this is the only word that is kept from the phrase that I passed.

Lucene: BestTextFragments returns only the first document

I'm building a Lucene index for Twitter User with their Tweets. My idea is to store info about User (name, description, ecc) with his tweets with the following code:
for (Map.Entry<Long, User> entry : users.entrySet()) {
User user = entry.getValue();
Document document = new Document();
document.add(new LongField("id", user.getId(), Field.Store.YES));
document.add(new StringField("name", user.getName(), Field.Store.YES));
document.add(new StringField("username", user.getUsername(), Field.Store.YES));
for (UserTweet t : user.getTweets()) {
document.add(new TextField("tweet", t.getText(), Field.Store.YES));
}
writer.addDocument(document);
}
Here a document can have a lot of tweets in the "tweet" field. The analyzer used for this field is the EnglishAnalyzer.
Is this method correct to store tweets?
My problem is when I set the Highlighter to retrieve the tweets that match. If I search a term that is present in ALL tweets of ALL stored users, as a result I get ALL users (correct!), but if I want to see all Tweets of a single user that match with the query (with Highlighter) I get only the first Tweet of every user and not all.
This is the code that I use to search:
BooleanQuery.Builder booleanQuery = new BooleanQuery.Builder();
QueryParser queryParserKeywords = new QueryParser("tweet", new EnglishAnalyzer());
String strQueryKeywords = "";
for (String s : c.getValue().split(" "))
strQueryKeywords += "tweet:"+ s +" OR ";
strQueryKeywords = strQueryKeywords.substring(0, strQueryKeywords.lastIndexOf("OR"));
Query queryKeywords = queryParserKeywords.parse(strQueryKeywords);
QueryScorer queryScorerKeywords = new QueryScorer(queryKeywords, "tweet");
Fragmenter fragment = new SimpleSpanFragmenter(queryScorerKeywords, 150);
keywordsHighlighter = new Highlighter(queryScorerKeywords);
keywordsHighlighter.setTextFragmenter(fragment);
booleanQuery.add(queryKeywords, BooleanClause.Occur.SHOULD);
... (other boolean clause over other fields)
searcher.search(booleanQuery.build(), collector);
...
for (ScoreDoc doc : collector.topDocs().scoreDocs) {
Document d = searcher.doc(doc.doc);
TokenStream tokenStream = new EnglishAnalyzer().tokenStream("",d.getField("tweet").stringValue());
TextFragment[] fragments = keywordsHighlighter.getBestTextFragments(tokenStream, d.getField("tweet").stringValue(), false, 10);
for (TextFragment fragment : fragments) {
System.out.println(" - " + fragment.toString());
}
}
What's wrong with my code?
At last, to search over multiple fields with different text (ex: City=New York, Keyword=Star Wars, ecc.), Is it correct to use the BooleanQuery or exist a better solution?
Thanks a lot.

retrieve documents after grouping search in lucene

I am doing lucene search on my resources.I have a case where I search for a particular product and I need to do it on a grouping search via 'keywords' field.I can get to know the total number of products grouped by keywords associated with it.How can I get all the documents related to this search, so that I can retrieve other needed fields from it. I tried using AbstractAllGroupHeadsCollector but couldnt find and got confused with its usage.Here is my code.
Thanks in advance.
Integer totalGroupCount = null;
IndexReader ir = DirectoryReader.open(indexLocation);
IndexSearcher is = new IndexSearcher(ir);
GroupingSearch groupingSearch = new GroupingSearch("keywords");
groupingSearch.setGroupSort(Sort.RELEVANCE);
groupingSearch.setFillSortFields(true);
groupingSearch.setCachingInMB(4.0, true);
groupingSearch.setAllGroups(true);
//TermQuery query = new TermQuery(new Term("products", "wfa packages"));
TopGroups<BytesRef> result = groupingSearch.search(is, query, 0, 10);
// Render groupsResult...
totalGroupCount = result.totalGroupCount; // The group count
GroupDocs<BytesRef>[] d=result.groups;
System.out.println("total groups="+result.totalGroupedHitCount);
You have your GroupDocs array, that's most of the way there already. You can then get the scoreDocs from each GroupDocs, and lookup the document with the doc id, from ScoreDoc.doc, like:
for (GroupDocs<BytesRef> group : d) {
for (ScoreDoc scoredoc : group.scoreDocs) {
Document doc = is.doc(scoredoc.doc);
//Do stuff
}
}

Migrate from Lucene 3.0 to 4.9.0

I want to migrate an example from the book "Lucene in Action 2nd Edition", which is based on Lucene 3.0, to Lucene's current version. Here is the code that needs to be migrated:
public void testUpdate() throws IOException {
assertEquals(1, getHitCount("city", "Amsterdam"));
IndexWriter writer = getWriter();
Document doc = new Document();
doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("country", "Netherlands", Field.Store.YES, Field.Index.NO));
doc.add(new Field("contents", "Den Haag has a lot of museums", Field.Store.NO, Field.Index.ANALYZED));
doc.add(new Field("city", "Den Haag", Field.Store.YES, Field.Index.ANALYZED));
writer.updateDocument(new Term("id", "1"), doc);
writer.close();
assertEquals(0, getHitCount("city", "Amsterdam"));
assertEquals(1, getHitCount("city", "Den Haag"));
}
I'm trying to perform the migration according to the Lucene Migration Guide using the equivalents for the former Field constructors to create the Document object. The code for this looks as follows:
#Test
public void testUpdate() throws IOException
{
assertEquals(1, getHitCount("city", "Amsterdam"));
IndexWriter writer = getWriter();
Document doc = new Document();
FieldType ft = new FieldType(StringField.TYPE_STORED);
ft.setOmitNorms(false);
doc.add(new Field("id", "1", ft));
doc.add(new StoredField("country", "Netherlands"));
doc.add(new TextField("contents", "Den Haag has a lot of museums", Store.NO));
doc.add(new Field("city", "Den Haag", TextField.TYPE_STORED));
writer.updateDocument(new Term("id", "1"), doc);
writer.close();
assertEquals(0, getHitCount("city", "Amsterdam"));
assertEquals(1, getHitCount("city", "Den Haag");
}
The second assertion method fails, because it doesn't find the string "Den Haag" (only "Den" or "Haag" works though). If I use a StringField object instead, the test passes, since the "city" attribute is not anaylzed (i.e. tokenized) and thus is kept unchanged. But it is obviously not the intention of the example to treat this attribute like e.g. an ID. I've read that the combination "Field.Store.YES / Field.Index.ANALYZED" is good for small textual content like an intro text, abstract or title, so it should also match concatenated strings like "Den Haag" or am I wrong? Could anyone clarify please.
The author uses a Term object to create the search string:
protected int getHitCount(String fieldName, String searchString) throws IOException {
DirectoryReader dr = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(dr);
Term t = new Term(fieldName, searchString);
Query query = new TermQuery(t);
int hitCount = TestUtil.hitCount(searcher, query);
return hitCount;
}
The TestUtil class only contains a single line of code
public static int hitCount(IndexSearcher searcher, Query query) {
return searcher.search(query, 1).totalHits;
}
Short explanation: you need to make sure tokenization setting (on/off) is the same at index time and at search time.
Long explanation: If you want your content to be analyzed, you should not only use TextField but also QueryParser so your query goes through the same process. In your case your query is failing because with
new Field("city", "Den Haag", TextField.TYPE_STORED));
the text gets tokenized into "Den" and "Haag". Later, when you create TermQuery you search against a single term "Den Haag" which, of course, yields no results.
Code below shows how could this work for non-tokenized case:
doc.add(new StringField("city", "Den Haag", Field.Store.YES));
...
PhraseQuery query = new PhraseQuery();
query.addTerm(new Term("city", "Den Haag"));

Lucene: get newest document for category

I'm pretty new to lucene index, so I apologize in advance if what I am trying to do is trivial.
I have an index where the documents contain (among other) two fields:
documentoId and employeeId.
Each employee can submit various documents. The structure is pretty much the same as in the bookstore example.
What I am trying to achieve, is to get all the newest documents matching a query, meaning with the highest documentoId for each employeeId.
In SQL, this would be something like:
select max(documentoId ), employeeId
from documents
where content like 'mySearchValue'
group by employeeId
I don't know if I should use facet API, or if this can be done with queries, or with the searchAfter method...I'm pretty lost with the documentation.
Any help would be greatly appreciated!
Thanks
Lucene supports grouping search; what you need to do is to define your group and how does it have to be sorted. In the example below, I group by documentoId and sort in descending order.
public static void main(String[] args) throws IOException, ParseException {
StandardAnalyzer standardAnalyzer = new StandardAnalyzer(Version.LUCENE_46);
RAMDirectory ramDirectory = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(ramDirectory, new IndexWriterConfig(Version.LUCENE_46, standardAnalyzer));
Document d0 = new Document();
d0.add(new TextField("employeeId", "foo", Field.Store.YES));
d0.add(new IntField("documentoId", 1, Field.Store.YES));
indexWriter.addDocument(d0);
Document d1 = new Document();
d1.add(new TextField("employeeId", "bar", Field.Store.YES));
d1.add(new IntField("documentoId", 20, Field.Store.YES));
indexWriter.addDocument(d1);
Document d2 = new Document();
d2.add(new TextField("employeeId", "baz", Field.Store.YES));
d2.add(new IntField("documentoId", 3, Field.Store.YES));
indexWriter.addDocument(d2);
indexWriter.commit();
GroupingSearch groupingSearch = new GroupingSearch("documentoId");
Sort groupSort = new Sort(new SortField("documentoId", SortField.Type.INT, true)); // in descending order
groupingSearch.setGroupSort(groupSort);
groupingSearch.setSortWithinGroup(groupSort);
IndexReader reader = DirectoryReader.open(ramDirectory);
IndexSearcher searcher = new IndexSearcher(reader);
TopGroups<?> groups = groupingSearch.search(searcher, new MatchAllDocsQuery(), 0, 10);
Document highestScoredDocument = reader.document(groups.groups[0].scoreDocs[0].doc);
System.out.println(
"Descending order, first document is " +
"employeeId:" + highestScoredDocument.get("employeeId") + " " +
"documentoId:" + highestScoredDocument.get("documentoId")
);
}
The above code detects that the d1 (middle document) scores at the top and prints the following:
Descending order, first document is employeeId:bar documentoId:20
Above code does not address content like 'mySearchValue' part, you would have to replace MatchAllDocsQuery with a relevant query to do that.
Custom sorting of the hits will do the trick. Google the search.sort parameter in Lucene.
For those in the same situation, I solved my problem using mindas comment and modifying it to use my group field:
GroupingSearch groupingSearch = new GroupingSearch("employeeId");
Sort groupSort = new Sort(new SortField("documentoId", SortField.Type.INT, true)); // in descending order
groupingSearch.setGroupSort(groupSort);
groupingSearch.setSortWithinGroup(groupSort);
int offset = 0;
int limitGroup = 50;
TopGroups<?> groups = groupingSearch.search(is,query, offset, limitGroup);
List<Document> result = new ArrayList();
for (int i=0; i<groups.groups.length; i++) {
ScoreDoc sdoc = groups.groups[i].scoreDocs[0]; // first result of each group
Document d = is.doc(sdoc.doc);
result.add(d);
}

Categories

Resources