We use Lucene to index some internal documents. Sometimes we need to remove documents. These documents have an unique id and are represented by a class DocItem as follows (ALL THE CODE IS A SIMPLIFIED VERSION WITH ONLY SIGNIFICANT (I hope) PARTS):
public final class DocItem {
public static final String fID = "id";
public static final String fTITLE = "title";
private Document doc = new Document();
private Field id = new Field(fID, "", Field.Store.YES, Field.Index.ANALYZED);
private Field title = new Field(fTITLE, "", Field.Store.YES, Field.Index.ANALYZED);
public DocItem() {
doc.add(id);
doc.add(title);
}
... getters & setters
public getDoc() {
return doc;
}
}
So, to index a document, a new DocItem is created and passed to an indexer class as follows:
public static void index(DocItem docitem) {
File file = new File("indexdir");
Directory dir= new SimpleFSDirectory(file);
IndexWriter idxWriter = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
idxWriter.addDocument(docitem.getDoc());
idxWriter.close();
}
We created an auxiliary method to iterate over the index directory:
public static void listAll() {
File file = new File("indexdir");
Directory dir = new SimpleFSDirectory(file);
IndexReader reader = IndexReader.open(dir);
for (int i = 0; i < reader.maxDoc(); i++) {
Document doc = reader.document(i);
System.out.println(doc.get(DocItem.fID));
}
}
Running the listAll, we can see that our docs are being indexed properly. At least, we can see the id and other attributes.
We retrieve the document using IndexSearcher as follows:
public static DocItem search(String id) {
File file = new File("indexdir");
Directory dir = new SimpleFSDirectory(file);
IndexSearcher searcher = new IndexSearcher(index, true);
Query q = new QueryParser(Version.LUCENE_30, DocItem.fID, new StandardAnalyzer(Version.LUCENE_30)).parse(id);
TopDocs td = searcher.search(q, 1);
ScoreDoc[] hits = td.scoreDocs;
searcher.close();
return hits[0];
}
So after retrieving it, we are trying to delete it with:
public static void Delete(DocItem docitem) {
File file = new File("indexdir");
Directory dir= new SimpleFSDirectory(file);
IndexWriter idxWriter = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
idxWriter.deleteDocuments(new Term(DocItem.fID, docitem.getId()));
idxWriter.commit();
idxWriter.close();
}
The problem is that it doesn't work. The document is never deleted. If I run the listAll() after the deletion, the document is still there. We tried to use IndexReader, with no lucky.
By this post and this post, We think that we are using it accordinlgy.
What we are doing wrong? Any advice? We are using lucene 3.0.3 and java 1.6.0_24.
TIA,
Bob
I would suggest, use IndexReader DeleteDocumets, it returns the number of documents deleted. this will help you narrow whether the deletions occur on first count.
the advantage of this over the indexwriter method, is that it returns the total document deleted, if none if shall return 0.
Also see the How do I delete documents from the index? and this post
Edit: Also i noticed you open the indexreader in readonly mode, can you change the listFiles() index reader open with false as second param, this will allow read write, perhaps the source of error
I call IndexWriterConfig#setMaxBufferedDeleteTerms(1) during IndexWriter instantiation/configuration and all delete operations go to disc immediately. Maybe it's not correct design-wise, but solves the problem explained here.
public static void Delete(DocItem docitem) {
File file = new File("indexdir");
Directory dir= new SimpleFSDirectory(file);
IndexWriter idxWriter = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30), IndexWriter.MaxFieldLength.UNLIMITED);
idxWriter.deleteDocuments(new Term(DocItem.fID, docitem.getId()));
idxWriter.commit();
idxWriter.close(
Related
I'm using Nutch to crawl some websites (as a process that runs separate of everything else), while I want to use a Java (Scala) program to analyse the HTML data of websites using Jsoup.
I got Nutch to work by following the tutorial (without the script, only executing the individual instructions worked), and I think it's saving the websites' HTML in the crawl/segments/<time>/content/part-00000 directory.
The problem is that I cannot figure out how to actually read the website data (URLs and HTML) in a Java/Scala program. I read this document, but find it a bit overwhelming since I've never used Hadoop.
I tried to adapt the example code to my environment, and this is what I arrived at (mostly by guesswprk):
val reader = new MapFile.Reader(FileSystem.getLocal(new Configuration()), ".../apache-nutch-1.8/crawl/segments/20140711115438/content/part-00000", new Configuration())
var key = null
var value = null
reader.next(key, value) // test for a single value
println(key)
println(value)
However, I am getting this exception when I run it:
Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1873)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
I am not sure how to work with a MapFile.Reader, specifically, what constructor parameters I am supposed to pass to it. What Configuration objects am I supposed to pass in? Is that the correct FileSystem? And is that the data file I'm interested in?
Scala:
val conf = NutchConfiguration.create()
val fs = FileSystem.get(conf)
val file = new Path(".../part-00000/data")
val reader = new SequenceFile.Reader(fs, file, conf)
val webdata = Stream.continually {
val key = new Text()
val content = new Content()
reader.next(key, content)
(key, content)
}
println(webdata.head)
Java:
public class ContentReader {
public static void main(String[] args) throws IOException {
Configuration conf = NutchConfiguration.create();
Options opts = new Options();
GenericOptionsParser parser = new GenericOptionsParser(conf, opts, args);
String[] remainingArgs = parser.getRemainingArgs();
FileSystem fs = FileSystem.get(conf);
String segment = remainingArgs[0];
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
Text key = new Text();
Content content = new Content();
// Loop through sequence files
while (reader.next(key, content)) {
try {
System.out.write(content.getContent(), 0,
content.getContent().length);
} catch (Exception e) {
}
}
}
}
Alternatively, you can use org.apache.nutch.segment.SegmentReader (example).
I am using apache lucene to index the html files. I am storing the path of the html files in the lucene index . Its storing the index and , i have checked it in luke all.
But when i am searching the path of the file its returning the no of documents very much high . i want it should search the exact path as it was stored in the lucene index.
i am using the following code
for index creation
try{
File indexDir=new File("d:/abc/")
IndexWriter indexWriter = new IndexWriter(
FSDirectory.open(indexDir),
new SimpleAnalyzer(),
true,
IndexWriter.MaxFieldLength.LIMITED);
indexWriter.setUseCompoundFile(false);
Document doc= new Document();
String path=f.getCanonicalPath();
doc.add(new Field("fpath",path,
Field.Store.YES,Field.Index.ANALYZED));
indexWriter.addDocument(doc);
indexWriter.optimize();
indexWriter.close();
}
catch(Exception ex )
{
ex.printStackTrace();
}
Following the code for searching the filepath
File indexDir = new File("d:/abc/");
int maxhits = 10000000;
int len = 0;
try {
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser = new QueryParser(Version.LUCENE_36,"fpath", new SimpleAnalyzer());
Query query = parser.parse(path);
query.setBoost((float) 1.5);
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
JOptionPane.showMessageDialog(null,"items found"+len);
}
catch(Exception ex)
{
ex.printStackTrace();
}
its showing the no of documents found as total no of document while the searched path file exists only once
You are analyzing the path, which will split it into separate terms. The root path term (like catalog in /catalog/products/versions) likely occurs in all documents, so any search that includes catalog without forcing all terms to be mandatory will return all documents.
You need a search query like (using the example above):
+catalog +products +versions
to force all terms to be present.
Note that this gets more complicated if the same set of terms can occur in different orders, like:
/catalog/products/versions
/versions/catalog/products/SKUs
In that case, you need to use a different Lucene tokenizer than the tokenizer in the Standard Analyzer.
I have the follwing XML file -
<?xml version="1.0" encoding="UTF-8"?>
<BatchOrders xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<BatchHeader>
<ServiceProvider>123456789</ServiceProvider>
</BatchHeader>
<OrderDetails>
<MessageType>HelloWorld</MessageType>
<IssueDateTime>22/01/2012 00:00:00</IssueDateTime>
<receivedDateTime>22/01/2012 00:00:00</receivedDateTime>
<Status>TestStatus</Status>
</OrderDetails>
</BatchOrders>
I want to read in the contents and set them to fields I have created. So I have the following code below (not some is omitted - I have just included what I think I need to show. The below is in a test class which I have created - I also have a writer as part of this class that writes an XML File fine to disk as I expect. The problem I am facing is reading the file above and displaying the contents read to the Console just for now.
File myFileRead = null;
FileReader myFileReader = null;
try {
myFileRead = new File("C:/Path/myfile.xml");
myRecord = new myRecord();
myFileReader = new FileReader(myFileRead);
myXPathReader reader = new myXPathReader(myFileReader);
while (reader.hasNext())
{
record = reader.next();
//prints out then to cosole
}
So from above I have the myRecord class where I have the getters/setters for e.g ServiceProvider, etc. I also then have a class for myXpathReader which does the following:
private Document document;
private List batchorders;
private Iterator iterator;
public myXPathReader (Reader myFileReader)
throws Exception
{
SAXBuilder builder = new SAXBuilder();
document = builder.build(myFileReader);
batchorders = new JDOMXPath("//BatchOrders").selectNodes(document);
iterator = batchorders.iterator();
}
public int getSize() { return batchorders.size(); }
public boolean hasNext() { return iterator.hasNext(); }
public myRecord next()
throws Exception {
Element element = (Element) iterator.next();
myRecord record = new myRecord();
record.setServiceProvider((new JDOMXPath("./ServiceProvider").stringValueOf(element)));
//Some more sets ans close class etc...
Now if I debug the code and after the element on iterator.next I can see the file contents have being read in correctly. But on my console the ServiceProvider value and in fact all the values are getting set to empty string "". Am I doing something incorrect on the JDOMXPath in order to pull the value from the XML?
In your example XML ServiceProvider is not a child of BatchOrders, there's another level (BatchHeader) in between. So your second XPath expression should probably be
BatchHeader/ServiceProvider
instead of ./ServiceProvider
I am trying to set up Lucene to process some documents stored in the database. I started with this HelloWorld sample. However, the index that is created is not persisted anywhere and needs to be re-created each time the program is run. Is there a way to save the index that Lucene creates so that the documents do not need to be loaded into it each time the program starts up?
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action");
addDoc(w, "Lucene for Dummies");
addDoc(w, "Managing Gigabytes");
addDoc(w, "The Art of Computer Science");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = new QueryParser(Version.LUCENE_35, "title", analyzer).parse(querystr);
// 3. search
int hitsPerPage = 10;
IndexReader reader = IndexReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for(int i=0;i<hits.length;++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("title"));
}
// searcher can only be closed when there
// is no need to access the documents any more.
searcher.close();
}
private static void addDoc(IndexWriter w, String value) throws IOException {
Document doc = new Document();
doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
w.addDocument(doc);
}
}
You're creating the index in RAM:
Directory index = new RAMDirectory();
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/store/RAMDirectory.html
IIRC, you just need to switch that to one of the filesystem based Directory implementations.
http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/store/Directory.html
If you want to keep using RAMDirectory during searching (due to performance benefits) but don't want the index to be built from scratch every time, you can first create your index using a file system based directory like NIOFSDirectory (don't use if you're on windows). And then come search time, open a copy of the original directory using the constructor RAMDirectory(Directory dir).
The story is this. I want to mimic the behavior of a relational database using a Lucene index in java. I need to be able to do searching(reading) and writing at the same time.
For example, I want to save Project information into an index. For simplicity, let's say that the project has 2 fields - id and name. Now, before adding a new project to the index, I'm searching if a project with a given id is already present. For this I'm using an IndexSearcher. This operation completes with success (namely the IndexSearcher returns the internal doc id for the document that contains the project id I'm looking for).
Now I want to actually read the value of this project ID, so I'm using now an IndexReader to get the indexed Lucene document from which I can extract the project id field.
The problem is that the IndexReader return a Document that has all of the fields NULL. So, to repeat IndexSearcher works correctly, IndexReader returns bogus stuff.
I'm thinking that somehow this has to do with the fact that the document fields data does not get saved on the hard disk when the IndexWriter is flushed. The thing is that the first time I do this indexing operation, IndexReader works good. However after a restart of my application, the above mentioned situation happens. So I'm thinking that the first time around data floats in RAM, but doesn't get flushed correctly (or totally since IndexSearcher works) on the hard drive.
Maybe it will help if I give you the source code, so here it is (you can safely ignore the tryGetIdFromMemory part, I'm using that as an speed optimization trick):
public class ProjectMetadataIndexer {
private File indexFolder;
private Directory directory;
private IndexSearcher indexSearcher;
private IndexReader indexReader;
private IndexWriter indexWriter;
private Version luceneVersion = Version.LUCENE_31;
private Map<String, Integer> inMemoryIdHolder;
private final int memoryCapacity = 10000;
public ProjectMetadataIndexer() throws IOException {
inMemoryIdHolder = new HashMap<String, Integer>();
indexFolder = new File(ConfigurationSingleton.getInstance()
.getProjectMetaIndexFolder());
directory = FSDirectory.open(indexFolder);
IndexWriterConfig config = new IndexWriterConfig(luceneVersion,
new WhitespaceAnalyzer(luceneVersion));
indexWriter = new IndexWriter(directory, config);
indexReader = IndexReader.open(indexWriter, false);
indexSearcher = new IndexSearcher(indexReader);
}
public int getProjectId(String projectName) throws IOException {
int fromMemoryId = tryGetProjectIdFromMemory(projectName);
if (fromMemoryId >= 0) {
return fromMemoryId;
} else {
int projectId;
Term projectNameTerm = new Term("projectName", projectName);
TermQuery projectNameQuery = new TermQuery(projectNameTerm);
BooleanQuery query = new BooleanQuery();
query.add(projectNameQuery, Occur.MUST);
TopDocs docs = indexSearcher.search(query, 1);
if (docs.totalHits == 0) {
projectId = IDStore.getInstance().getProjectId();
indexMeta(projectId, projectName);
} else {
int internalId = docs.scoreDocs[0].doc;
indexWriter.close();
indexReader.close();
indexSearcher.close();
indexReader = IndexReader.open(directory);
Document document = indexReader.document(internalId);
List<Fieldable> fields = document.getFields();
System.out.println(document.get("projectId"));
projectId = Integer.valueOf(document.get("projectId"));
}
storeInMemory(projectName, projectId);
return projectId;
}
}
private int tryGetProjectIdFromMemory(String projectName) {
String key = projectName;
Integer id = inMemoryIdHolder.get(key);
if (id == null) {
return -1;
} else {
return id.intValue();
}
}
private void storeInMemory(String projectName, int projectId) {
if (inMemoryIdHolder.size() > memoryCapacity) {
inMemoryIdHolder.clear();
}
String key = projectName;
inMemoryIdHolder.put(key, projectId);
}
private void indexMeta(int projectId, String projectName)
throws CorruptIndexException, IOException {
Document document = new Document();
Field idField = new Field("projectId", String.valueOf(projectId),
Store.NO, Index.ANALYZED);
document.add(idField);
Field nameField = new Field("projectName", projectName, Store.NO,
Index.ANALYZED);
document.add(nameField);
indexWriter.addDocument(document);
}
public void close() throws CorruptIndexException, IOException {
indexReader.close();
indexWriter.close();
}
}
To be more precise all the problems occur in this if:
if (docs.totalHits == 0) {
projectId = IDStore.getInstance().getProjectId();
indexMeta(projectId, projectName);
} else {
int internalId = docs.scoreDocs[0].doc;
Document document = indexReader.document(internalId);
List<Fieldable> fields = document.getFields();
System.out.println(document.get("projectId"));
projectId = Integer.valueOf(document.get("projectId"));
}
On the else branch...
I don't know what is wrong.
Do you store the respective fields? If not, the fields are "only" stored in the reverse index part, i.e. the field value is mapped to the document, but the document itself doesn't contain the field value.
The part of the code where you save the document might be helpful.
I had a hard time figuring out how to index/search numbers and I just wanted to say the following snippets of code really helped me out:
projectId = Integer.valueOf(document.get("projectId"));
////////////
Field idField = new Field("projectId", String.valueOf(projectId),
Store.NO, Index.ANALYZED);
document.add(idField);
Thanks!