Lucene Index - single term and phrase querying

Lucene Index - single term and phrase querying - java

I've read some documents and build a lucene index which looks like
Documents:
id 1
keyword foo bar
keyword john
id 2
keyword foo
id 3
keyword john doe
keyword bar foo
keyword what the hell
I want to query lucene in a way, where I can combine single term and phrases.
Let's say my query is
foo bar
should give back the doc ids 1, 2 and 3
The query
"foo bar"
should give back the doc ids 1
The query
john
should give back the doc ids 1 and 3
The query
john "foo bar"
should give back the doc ids 1
My implementation in java is not working. Also reading tons of documents didn't help.
When I query my index with
"foo bar"
I get 0 hits
When I query my index with
foo "john doe"
I get back the doc ids 1, 2 and 3 (i would expect only doc id 3 since the query is meant as foo AND "john doe") The problem is, that "john doe" gives back 0 hits but foo gives back 3 hits.
My goal is to combine single term and phrase terms. What am I doing wrong? I've also played around with the analyzers with no luck.
My implementation looks like this:
Indexer
import ...
public class Indexer
{
private static final Logger LOG = LoggerFactory.getLogger(Indexer.class);
private final File indexDir;
private IndexWriter writer;
public Indexer(File indexDir)
{
this.indexDir = indexDir;
this.writer = null;
}
private IndexWriter createIndexWriter()
{
try
{
Directory dir = FSDirectory.open(indexDir);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_34);
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_34, analyzer);
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
iwc.setRAMBufferSizeMB(256.0);
IndexWriter idx = new IndexWriter(dir, iwc);
idx.deleteAll();
return idx;
} catch (IOException e)
{
throw new RuntimeException(String.format("Could create indexer on directory [%s]", indexDir.getAbsolutePath()), e);
}
}
public void index(TestCaseDescription desc)
{
if (writer == null)
writer = createIndexWriter();
Document doc = new Document();
addPathToDoc(desc, doc);
addLastModifiedToDoc(desc, doc);
addIdToDoc(desc, doc);
for (String keyword : desc.getKeywords())
addKeywordToDoc(doc, keyword);
updateIndex(doc, desc);
}
private void addIdToDoc(TestCaseDescription desc, Document doc)
{
Field idField = new Field(LuceneConstants.FIELD_ID, desc.getId(), Field.Store.YES, Field.Index.ANALYZED);
idField.setIndexOptions(IndexOptions.DOCS_ONLY);
doc.add(idField);
}
private void addKeywordToDoc(Document doc, String keyword)
{
Field keywordField = new Field(LuceneConstants.FIELD_KEYWORDS, keyword, Field.Store.YES, Field.Index.ANALYZED);
keywordField.setIndexOptions(IndexOptions.DOCS_ONLY);
doc.add(keywordField);
}
private void addLastModifiedToDoc(TestCaseDescription desc, Document doc)
{
NumericField modifiedField = new NumericField(LuceneConstants.FIELD_LAST_MODIFIED);
modifiedField.setLongValue(desc.getLastModified());
doc.add(modifiedField);
}
private void addPathToDoc(TestCaseDescription desc, Document doc)
{
Field pathField = new Field(LuceneConstants.FIELD_PATH, desc.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
pathField.setIndexOptions(IndexOptions.DOCS_ONLY);
doc.add(pathField);
}
private void updateIndex(Document doc, TestCaseDescription desc)
{
try
{
if (writer.getConfig().getOpenMode() == OpenMode.CREATE)
{
// New index, so we just add the document (no old document can be there):
LOG.debug(String.format("Adding testcase [%s] (%s)", desc.getId(), desc.getPath()));
writer.addDocument(doc);
} else
{
// Existing index (an old copy of this document may have been indexed) so
// we use updateDocument instead to replace the old one matching the exact
// path, if present:
LOG.debug(String.format("Updating testcase [%s] (%s)", desc.getId(), desc.getPath()));
writer.updateDocument(new Term(LuceneConstants.FIELD_PATH, desc.getPath()), doc);
}
} catch (IOException e)
{
throw new RuntimeException(String.format("Could not create or update index for testcase [%s] (%s)", desc.getId(),
desc.getPath()), e);
}
}
public void store()
{
try
{
writer.close();
} catch (IOException e)
{
throw new RuntimeException(String.format("Could not write index [%s]", writer.getDirectory().toString()));
}
writer = null;
}
}
Searcher:
import ...
public class Searcher
{
private static final Logger LOG = LoggerFactory.getLogger(Searcher.class);
private final Analyzer analyzer;
private final QueryParser parser;
private final File indexDir;
public Searcher(File indexDir)
{
this.indexDir = indexDir;
analyzer = new StandardAnalyzer(Version.LUCENE_34);
parser = new QueryParser(Version.LUCENE_34, LuceneConstants.FIELD_KEYWORDS, analyzer);
parser.setAllowLeadingWildcard(true);
}
public List<String> search(String searchString)
{
List<String> testCaseIds = new ArrayList<String>();
try
{
IndexSearcher searcher = getIndexSearcher(indexDir);
Query query = parser.parse(searchString);
LOG.info("Searching for: " + query.toString(parser.getField()));
AllDocCollector results = new AllDocCollector();
searcher.search(query, results);
LOG.info("Found [{}] hit", results.getHits().size());
for (ScoreDoc scoreDoc : results.getHits())
{
Document doc = searcher.doc(scoreDoc.doc);
String id = doc.get(LuceneConstants.FIELD_ID);
testCaseIds.add(id);
}
searcher.close();
return testCaseIds;
} catch (Exception e)
{
throw new RuntimeException(String.format("Could not search index [%s]", indexDir.getAbsolutePath()), e);
}
}
private IndexSearcher getIndexSearcher(File indexDir)
{
try
{
FSDirectory dir = FSDirectory.open(indexDir);
return new IndexSearcher(dir);
} catch (IOException e)
{
LOG.error(String.format("Could not open index directory [%s]", indexDir.getAbsolutePath()), e);
throw new RuntimeException(e);
}
}
}

Why are you using DOCS_ONLY?! If you only index docids, then you only have a basic inverted index with term->document mappings, but no proximity information. So thats why your phrase queries don't work.

I think you roughly want:
keyword:"foo bar"~1^2 OR keyword:"foo" OR keyword:"bar"
Which is to say, phrase match "foo bar" and boost it (prefer the full phrase), OR match "foo", OR match "bar".
The full query syntax is here: http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/queryparsersyntax.html
EDIT:
It looks like one thing you're missing is that the default operator is OR. So you probably want to do something like this:
+keyword:john AND +keyword:"foo bar"
The plus sign means "must contain". You put the AND explicitly so that the document must contain both (rather that the default, which translates to "must contain john OR must contain "foo bar").

The problem was solved by replacing
StandardAnalyzer
with
KeywordAnalyzer
for both, the indexer and the searcher.
As I was able to point out that the StandardAnalyzer splits input text into several words I've replaced it with KeywordAnalyzer since the input (which can consist of one or more words) will remain untouched. It will recognize a term like
bla foo
as a single keyword.

Related

Elastic Search missing some documents while creating index from another index

I have created one index in elastic search name as documents_local and load data from Oracle Database via logstash. This index containing following values.
"FilePath" : "Path of the file",
"FileName" : "filename.pdf",
"Language" : "Language_name"
Then, i want to index those file contents also, so i have created one more index in elastic search named as document_attachment and using the following Java code, i fetched FilePath & FileName from the index documents_local with the help of the filepath i retrived file will is available in my local drive and i have index those file contents using ingest-attachment plugin processor.
Please find my java code below, where am indexing the files.
private final static String INDEX = "documents_local"; //Documents Table with file Path - Source Index
private final static String ATTACHMENT = "document_attachment"; // Documents with Attachment... -- Destination Index
private final static String TYPE = "doc";
public static void main(String args[]) throws IOException {
RestHighLevelClient restHighLevelClient = null;
Document doc=new Document();
try {
restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
new HttpHost("localhost", 9201, "http")));
} catch (Exception e) {
System.out.println(e.getMessage());
}
//Fetching Id, FilePath & FileName from Document Index.
SearchRequest searchRequest = new SearchRequest(INDEX);
searchRequest.types(TYPE);
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
QueryBuilder qb = QueryBuilders.matchAllQuery();
searchSourceBuilder.query(qb);
searchSourceBuilder.size(3000);
searchRequest.source(searchSourceBuilder);
SearchResponse searchResponse = null;
try {
searchResponse = restHighLevelClient.search(searchRequest);
} catch (IOException e) {
e.getLocalizedMessage();
}
SearchHit[] searchHits = searchResponse.getHits().getHits();
long totalHits=searchResponse.getHits().totalHits;
int line=1;
String docpath = null;
Map<String, Object> jsonMap ;
for (SearchHit hit : searchHits) {
String encodedfile = null;
File file=null;
Map<String, Object> sourceAsMap = hit.getSourceAsMap();
doc.setId((int) sourceAsMap.get("id"));
doc.setLanguage(sourceAsMap.get("language"));
doc.setFilename(sourceAsMap.get("filename").toString());
doc.setPath(sourceAsMap.get("path").toString());
String filepath=doc.getPath().concat(doc.getFilename());
System.out.println("Line Number--> "+line+++"ID---> "+doc.getId()+"File Path --->"+filepath);
file = new File(filepath);
if(file.exists() && !file.isDirectory()) {
try {
FileInputStream fileInputStreamReader = new FileInputStream(file);
byte[] bytes = new byte[(int) file.length()];
fileInputStreamReader.read(bytes);
encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
jsonMap = new HashMap<>();
jsonMap.put("id", doc.getId());
jsonMap.put("language", doc.getLanguage());
jsonMap.put("filename", doc.getFilename());
jsonMap.put("path", doc.getPath());
jsonMap.put("fileContent", encodedfile);
String id=Long.toString(doc.getId());
IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id )
.source(jsonMap)
.setPipeline(ATTACHMENT);
try {
IndexResponse response = restHighLevelClient.index(request);
} catch(ElasticsearchException e) {
if (e.status() == RestStatus.CONFLICT) {
}
e.printStackTrace();
}
}
System.out.println("Indexing done...");
}
Please find my mappings details for ingest attachment plugin( i did mappping first and then executing this java code ).
PUT _ingest/pipeline/document_attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "fileContent"
}
}
]
}
But while doing this process, Am missing some of the doucments,
My source index is documents_local which is having 2910 documents.
Am fetching all the 118 documents and attaching my PDF ( after converting to base64) and writting in another index document_attachment
But document_attachment index having only 118 it should be 2910. some of the doucments are missing. Also, for indexing its taking very long time.
Am not sure, how the documents are missing in second index ( document_attachment ) and is there any other way to improvise this process ?
Can we include Thread mechanism here. Because, i future i have to index more than 100k pdf, in this same way.

stanford nlp api for java: how to get the name as full not in parts

the aim of my code is to submit a document (be it pdf or doc file) and get all the text in it. give the text to be analysed by stanford nlp. the code works just fine. but suppose there is name in the document eg: "Pardeep Kumar" . the output recieved for it, is as follows:
Pardeep NNP PERSON
Kumar NNP PERSON
but i want it to be like this:
Pardeep Kumar NNP PERSON
how do i do that?how do i check two words placed adjacently that actually make one name or anything similar? how do i not let them be split in different words?
here is my code:
public class readstuff {
public static void analyse(String data) {
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create an empty Annotation just with the given text
Annotation document = new Annotation(data);
// run all Annotators on this text
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
// System.out.println("word"+"\t"+"POS"+"\t"+"NER");
for (CoreMap sentence : sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(CoreAnnotations.TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
if(ne.equals("PERSON") || ne.equals("LOCATION") || ne.equals("DATE") )
{
System.out.format("%32s%10s%16s",word,pos,ne);
System.out.println();
//System.out.println(word +" \t"+pos +"\t"+ne);
}
}
}
}
public static void main(String[] args) throws FileNotFoundException, IOException, TransformerConfigurationException{
JFileChooser window=new JFileChooser();
int a=window.showOpenDialog(null);
if(a==JFileChooser.APPROVE_OPTION){
String name=window.getSelectedFile().getName();
String extension = name.substring(name.lastIndexOf(".") + 1, name.length());
String data = null;
if(extension.equals("docx")){
XWPFDocument doc=new XWPFDocument(new FileInputStream(window.getSelectedFile()));
XWPFWordExtractor extract= new XWPFWordExtractor(doc);
//System.out.println("docx file reading...");
data=extract.getText();
//extract.getMetadataTextExtractor();
}
else if(extension.equals("doc")){
HWPFDocument doc=new HWPFDocument(new FileInputStream(window.getSelectedFile()));
WordExtractor extract= new WordExtractor(doc);
//System.out.println("doc file reading...");
data=extract.getText();
}
else if(extension.equals("pdf")){
//System.out.println(window.getSelectedFile());
PdfReader reader=new PdfReader(new FileInputStream(window.getSelectedFile()));
int n=reader.getNumberOfPages();
for(int i=1;i<n;i++)
{
//System.out.println(data);
data=data+PdfTextExtractor.getTextFromPage(reader,i );
}
}
else{
System.out.println("format not supported");
}
analyse(data);
}
}
}

You want to use the entitymentions annotator.
package edu.stanford.nlp.examples;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.util.*;
import java.util.*;
public class EntityMentionsExample {
public static void main(String[] args) {
Annotation document =
new Annotation("John Smith visited Los Angeles on Tuesday. He left Los Angeles on Wednesday.");
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,entitymentions");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(document);
for (CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {
System.out.println(entityMention);
System.out.println(entityMention.get(CoreAnnotations.EntityTypeAnnotation.class));
}
}
}
}

Some how"You want to use the entitymentions annotator." didn't work for me in a way I wanted.
For example if a text contained a name such as 'Rodriguez Quinonez, Dora need a health checkup', it returned 'Rodriguez Quinonez' as a PERSON & Dora as another PERSON. So only wasy out it seemed is to apply some post processing when the nERs are out from Stanford engine. See below
Once I got the entities out I passed them through a method that did the grouping on following basis below -
If the adjacent ners are person (no matter if they are 1 or 2 or 3 or more), you can group them together to form a aggregated single noun.
Here's my code
A class to hold the value in word attribute and NNP in ner attribute
public class NERData {
String word;
String ner;
....
}
Get the NERS out (if you are only interested in ners)
public List<NERData> getNers(String data){
Annotation document = new Annotation(data);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
List<NERData> ret = new ArrayList<NERData>();
for(CoreMap sentence: sentences) {
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
String word = token.get(TextAnnotation.class);
String ne = token.get(NamedEntityTagAnnotation.class);
if(!(ne == null || ne.equals("O"))){
NERData d = new NERData(word, ne);
//System.out.println("word is "+word+" ner "+ne);
ret.add(d);
}
}
}
StanfordCoreNLP.clearAnnotatorPool();
return ret;
}
Now pass the list of Ners to a method that looks for adjacent person identification and aggregates them into one.
public List<List<NERData>> getGroups(List<NERData> data){
List<List<NERData>> groups = new ArrayList<List<NERData>>();
List<NERData> group= new ArrayList<NERData>();
NERData curr = null;
int count = 0;
for (NERData val : data) {
if (curr == null) {
curr = val;
count = 1;
group.add(curr);
}
else if (!curr.getNer().equalsIgnoreCase(val.getNer())) {
if(!groups.contains(group)){
groups.add(group);
}
curr = val;
count = 1;
group = new ArrayList<NERData>();
group.add(val);
}
else {
group.add(val);
if(!groups.contains(group)){
groups.add(group);
}
curr = val;
++count;
}
}
return groups;
}
As a result, you will get Pardeep Kumar NNP PERSON as the output.
Note - This may not work well if you have multiple person names in the same sentence not separated by any noun.

approach for creating lucene indexes

I am implementing search feature for a news website. On that website ,users submit news articles containing title and text, currently these articles are inserted directly into a database.I heard that full text searching inside a database containing long..long text would not be efficient.
so i tried using lucene for indexing and searching. i am able to index full database with it and also able to search the content.But i am not sure if i am using the best approach.
Here is my indexer class :
public class LuceneIndexer {
public static void indexNews(Paste p ,IndexWriter indexWriter) throws IOException {
Document doc = new Document();
doc.add(new Field("id", p.getNewsId(), Field.Store.YES, Field.Index.NO));
doc.add(new Field("title", p.getTitle(), Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field("text", p.getNewsRawText(), Field.Store.YES, Field.Index.UN_TOKENIZED));
String fullSearchableText = p.getTitle() + " " + p.getNewsRawText();
doc.add(new Field("content", fullSearchableText, Field.Store.NO, Field.Index.TOKENIZED));
indexWriter.addDocument(doc);
}
public static void rebuildIndexes() {
try {
System.out.println("started indexing");
IndexWriter w = getIndexWriter();
ArrayList<News> n = new GetNewsInfo().getLastPosts(0);
for (News news : n) {
indexNews(news,w );
}
closeIndexWriter(w);
System.out.println("indexing done");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static IndexWriter getIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
IndexWriter indexWriter = new IndexWriter(GlobalData.LUCENE_INDEX_STOREAGE, new StandardAnalyzer(), true);
return indexWriter;
}
public static void closeIndexWriter(IndexWriter w) throws CorruptIndexException, IOException {
w.close();
}
Is above code efficient ?
I think i should add a document into index when it is submitted by user instead of indexing full database again.
Do i need to create new IndexWriter every time an article is submitted?
is that efficient to open and close an IndexWriter very frequently?

You are right that you don't need to readd every document to the index, you only need to add new ones, the rest will remain in the index.
But then you do need to create a new IndexWriter every time. If you prefer you can use a service or something which keeps an IndexWriter alive, but the opening and closing does not take much time. If you do reuse an IndexWriter make sure that you use indexWriter.commit() after each adding.

Do i need to create new IndexWriter every time an article is
submitted?
No
is that efficient to open and close an IndexWriter very frequently?
Definitely not! You should read the guidelines for indexing here.

Java Lucene IndexReader not working correctly

The story is this. I want to mimic the behavior of a relational database using a Lucene index in java. I need to be able to do searching(reading) and writing at the same time.
For example, I want to save Project information into an index. For simplicity, let's say that the project has 2 fields - id and name. Now, before adding a new project to the index, I'm searching if a project with a given id is already present. For this I'm using an IndexSearcher. This operation completes with success (namely the IndexSearcher returns the internal doc id for the document that contains the project id I'm looking for).
Now I want to actually read the value of this project ID, so I'm using now an IndexReader to get the indexed Lucene document from which I can extract the project id field.
The problem is that the IndexReader return a Document that has all of the fields NULL. So, to repeat IndexSearcher works correctly, IndexReader returns bogus stuff.
I'm thinking that somehow this has to do with the fact that the document fields data does not get saved on the hard disk when the IndexWriter is flushed. The thing is that the first time I do this indexing operation, IndexReader works good. However after a restart of my application, the above mentioned situation happens. So I'm thinking that the first time around data floats in RAM, but doesn't get flushed correctly (or totally since IndexSearcher works) on the hard drive.
Maybe it will help if I give you the source code, so here it is (you can safely ignore the tryGetIdFromMemory part, I'm using that as an speed optimization trick):
public class ProjectMetadataIndexer {
private File indexFolder;
private Directory directory;
private IndexSearcher indexSearcher;
private IndexReader indexReader;
private IndexWriter indexWriter;
private Version luceneVersion = Version.LUCENE_31;
private Map<String, Integer> inMemoryIdHolder;
private final int memoryCapacity = 10000;
public ProjectMetadataIndexer() throws IOException {
inMemoryIdHolder = new HashMap<String, Integer>();
indexFolder = new File(ConfigurationSingleton.getInstance()
.getProjectMetaIndexFolder());
directory = FSDirectory.open(indexFolder);
IndexWriterConfig config = new IndexWriterConfig(luceneVersion,
new WhitespaceAnalyzer(luceneVersion));
indexWriter = new IndexWriter(directory, config);
indexReader = IndexReader.open(indexWriter, false);
indexSearcher = new IndexSearcher(indexReader);
}
public int getProjectId(String projectName) throws IOException {
int fromMemoryId = tryGetProjectIdFromMemory(projectName);
if (fromMemoryId >= 0) {
return fromMemoryId;
} else {
int projectId;
Term projectNameTerm = new Term("projectName", projectName);
TermQuery projectNameQuery = new TermQuery(projectNameTerm);
BooleanQuery query = new BooleanQuery();
query.add(projectNameQuery, Occur.MUST);
TopDocs docs = indexSearcher.search(query, 1);
if (docs.totalHits == 0) {
projectId = IDStore.getInstance().getProjectId();
indexMeta(projectId, projectName);
} else {
int internalId = docs.scoreDocs[0].doc;
indexWriter.close();
indexReader.close();
indexSearcher.close();
indexReader = IndexReader.open(directory);
Document document = indexReader.document(internalId);
List<Fieldable> fields = document.getFields();
System.out.println(document.get("projectId"));
projectId = Integer.valueOf(document.get("projectId"));
}
storeInMemory(projectName, projectId);
return projectId;
}
}
private int tryGetProjectIdFromMemory(String projectName) {
String key = projectName;
Integer id = inMemoryIdHolder.get(key);
if (id == null) {
return -1;
} else {
return id.intValue();
}
}
private void storeInMemory(String projectName, int projectId) {
if (inMemoryIdHolder.size() > memoryCapacity) {
inMemoryIdHolder.clear();
}
String key = projectName;
inMemoryIdHolder.put(key, projectId);
}
private void indexMeta(int projectId, String projectName)
throws CorruptIndexException, IOException {
Document document = new Document();
Field idField = new Field("projectId", String.valueOf(projectId),
Store.NO, Index.ANALYZED);
document.add(idField);
Field nameField = new Field("projectName", projectName, Store.NO,
Index.ANALYZED);
document.add(nameField);
indexWriter.addDocument(document);
}
public void close() throws CorruptIndexException, IOException {
indexReader.close();
indexWriter.close();
}
}
To be more precise all the problems occur in this if:
if (docs.totalHits == 0) {
projectId = IDStore.getInstance().getProjectId();
indexMeta(projectId, projectName);
} else {
int internalId = docs.scoreDocs[0].doc;
Document document = indexReader.document(internalId);
List<Fieldable> fields = document.getFields();
System.out.println(document.get("projectId"));
projectId = Integer.valueOf(document.get("projectId"));
}
On the else branch...
I don't know what is wrong.

Do you store the respective fields? If not, the fields are "only" stored in the reverse index part, i.e. the field value is mapped to the document, but the document itself doesn't contain the field value.
The part of the code where you save the document might be helpful.

I had a hard time figuring out how to index/search numbers and I just wanted to say the following snippets of code really helped me out:
projectId = Integer.valueOf(document.get("projectId"));
////////////
Field idField = new Field("projectId", String.valueOf(projectId),
Store.NO, Index.ANALYZED);
document.add(idField);
Thanks!

Lucene 3.0.3 does not delete document

We use Lucene to index some internal documents. Sometimes we need to remove documents. These documents have an unique id and are represented by a class DocItem as follows (ALL THE CODE IS A SIMPLIFIED VERSION WITH ONLY SIGNIFICANT (I hope) PARTS):
public final class DocItem {
public static final String fID = "id";
public static final String fTITLE = "title";
private Document doc = new Document();
private Field id = new Field(fID, "", Field.Store.YES, Field.Index.ANALYZED);
private Field title = new Field(fTITLE, "", Field.Store.YES, Field.Index.ANALYZED);
public DocItem() {
doc.add(id);
doc.add(title);
}
... getters & setters
public getDoc() {
return doc;
}
}
So, to index a document, a new DocItem is created and passed to an indexer class as follows:
public static void index(DocItem docitem) {
File file = new File("indexdir");
Directory dir= new SimpleFSDirectory(file);
IndexWriter idxWriter = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
idxWriter.addDocument(docitem.getDoc());
idxWriter.close();
}
We created an auxiliary method to iterate over the index directory:
public static void listAll() {
File file = new File("indexdir");
Directory dir = new SimpleFSDirectory(file);
IndexReader reader = IndexReader.open(dir);
for (int i = 0; i < reader.maxDoc(); i++) {
Document doc = reader.document(i);
System.out.println(doc.get(DocItem.fID));
}
}
Running the listAll, we can see that our docs are being indexed properly. At least, we can see the id and other attributes.
We retrieve the document using IndexSearcher as follows:
public static DocItem search(String id) {
File file = new File("indexdir");
Directory dir = new SimpleFSDirectory(file);
IndexSearcher searcher = new IndexSearcher(index, true);
Query q = new QueryParser(Version.LUCENE_30, DocItem.fID, new StandardAnalyzer(Version.LUCENE_30)).parse(id);
TopDocs td = searcher.search(q, 1);
ScoreDoc[] hits = td.scoreDocs;
searcher.close();
return hits[0];
}
So after retrieving it, we are trying to delete it with:
public static void Delete(DocItem docitem) {
File file = new File("indexdir");
Directory dir= new SimpleFSDirectory(file);
IndexWriter idxWriter = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
idxWriter.deleteDocuments(new Term(DocItem.fID, docitem.getId()));
idxWriter.commit();
idxWriter.close();
}
The problem is that it doesn't work. The document is never deleted. If I run the listAll() after the deletion, the document is still there. We tried to use IndexReader, with no lucky.
By this post and this post, We think that we are using it accordinlgy.
What we are doing wrong? Any advice? We are using lucene 3.0.3 and java 1.6.0_24.
TIA,
Bob

I would suggest, use IndexReader DeleteDocumets, it returns the number of documents deleted. this will help you narrow whether the deletions occur on first count.
the advantage of this over the indexwriter method, is that it returns the total document deleted, if none if shall return 0.
Also see the How do I delete documents from the index? and this post
Edit: Also i noticed you open the indexreader in readonly mode, can you change the listFiles() index reader open with false as second param, this will allow read write, perhaps the source of error

I call IndexWriterConfig#setMaxBufferedDeleteTerms(1) during IndexWriter instantiation/configuration and all delete operations go to disc immediately. Maybe it's not correct design-wise, but solves the problem explained here.

public static void Delete(DocItem docitem) {
File file = new File("indexdir");
Directory dir= new SimpleFSDirectory(file);
IndexWriter idxWriter = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
idxWriter.deleteDocuments(new Term(DocItem.fID, docitem.getId()));
idxWriter.commit();
idxWriter.close(

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene Index - single term and phrase querying - java

Why are you using DOCS_ONLY?! If you only index docids, then you only have a basic inverted index with term->document mappings, but no proximity information. So thats why your phrase queries don't work.

Related

Elastic Search missing some documents while creating index from another index

stanford nlp api for java: how to get the name as full not in parts

approach for creating lucene indexes

Java Lucene IndexReader not working correctly

Lucene 3.0.3 does not delete document

Categories

Resources