I am trying to index text, word files and also search some content in these files. It is OK when I search for a specific string but when I try to use a regular expression to search, it will not work any more. In the following, I will list some crucial code for explaining.
The index function:
// FileBean is the class contains the file path,
// file content, file lastModified information
public void indexDoc(IndexWriter writer, FileBean t) throws Exception {
Document doc = new Document();
System.out.println(t.getPath());
doc.add(new StringField(LuceneConstants.PATH, t.getPath(), Field.Store.YES));
doc.add(new LongPoint(LuceneConstants.MODIFIED, t.getModified()));
doc.add(new TextField(LuceneConstants.CONTENT, t.getContent(), Field.Store.NO));
if (writer.getConfig().getOpenMode() == IndexWriterConfig.OpenMode.CREATE){
writer.addDocument(doc);
} else{
writer.updateDocument(new Term(LuceneConstants.PATH, t.getPath()), doc);
}
}
I am using queryParse to build the query, the query will be a RegexQuery just like '\d{16}' for a number.
the search function
public static TopDocs getResults(IndexSearcher searcher, Query query) throws IOException {
TopDocs docs = searcher.search(query, 10);
return docs;
}
TopDocs's totalHit is 0, which is not what I expect. It seems to me that there is no file being searched. This content should satisfy with the given regular expression that is provided.
I tried googling it but still I have not found a valid solution. Can anyone provide any suggestions on why totalHit is returning 0? Thanks.
Try taking away the '+', so it would be '\d{16}'.
OMG, I finally found the reason. Though I didn't know what was the deep reason. I found if I used '[0-9]' instead if '\d'. It would be OK!!!!!
If anyone could explain this, it will be wonderful!!!!!
Related
I am working on a task where I need to search a text using lucene. But here the requirement is to use the already created segment, .si, .cfe and .cfs files by other application.
I am able to get those files but while searching the text it won't show me the results.
The code is for search is:
public void searchText(String indexPath, String searchString) {
try {
Analyzer analyzer = new StandardAnalyzer();
File indexDirectory = new File(indexPath);
Directory directory = FSDirectory.open(indexDirectory.toPath());
IndexReader directoryReader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(directoryReader);
QueryParser parser = new QueryParser("requiredtext", analyzer);
Query query = parser.parse(searchString);
System.out.println(query);
// Parse a simple query that searches for "text":
ScoreDoc[] hits = searcher.search(query, 10).scoreDocs;
// Iterate through the results:
for (int i = 0; i < hits.length; i++) {
Document hitDoc = searcher.doc(hits[i].doc);
}
analyzer.close();
directoryReader.close();
directory.close();
}
catch (Exception ex){
System.out.println("Exception - "+ex.getMessage());
}
}
I am using the Lucene version 8.11.1 with Java8.
The question is: Is it possible in Lucene to read/find/search the text for which the files are written by some other application and search by other application. If it is then please provide the pointers how?
Atul
I found the issue and fixed it.
I was looking for the data in the field "requiredtext" but the indexer wont store the data for the field like while indexing it wont set the property for this field "TextField.Store.YES" and that is the reason I wont get the data for the field which I am looking for.
I got the data for other field for which the property was set.
And my questions was is it possible to search data on other files which are created by other application? So the answer is yes. #andrewJames answer helps to prove it.
I do have few Word templates, and my requirement is to replace some of the words/place holders in the document based on the user input, using Java. I tried lot of libraries including 2-3 versions of docx4j but nothing work well, they all just didn't do anything!
I know this question has been asked before, but I tried all options I know. So, using what java library I can "really" replace/edit these templates? My preference goes to the "easy to use / Few line of codes" type libraries.
I am using Java 8 and my MS Word templates are in MS Word 2007.
Update
This code is written by using the code sample provided by SO member Joop Eggen
public Main() throws URISyntaxException, IOException, ParserConfigurationException, SAXException
{
URI docxUri = new URI("C:/Users/Yohan/Desktop/yohan.docx");
Map<String, String> zipProperties = new HashMap<>();
zipProperties.put("encoding", "UTF-8");
FileSystem zipFS = FileSystems.newFileSystem(docxUri, zipProperties);
Path documentXmlPath = zipFS.getPath("/word/document.xml");
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(Files.newInputStream(documentXmlPath));
byte[] content = Files.readAllBytes(documentXmlPath);
String xml = new String(content, StandardCharsets.UTF_8);
//xml = xml.replace("#DATE#", "2014-09-24");
xml = xml.replace("#NAME#", StringEscapeUtils.escapeXml("Sniper"));
content = xml.getBytes(StandardCharsets.UTF_8);
Files.write(documentXmlPath, content);
}
However this returns the below error
java.nio.file.ProviderNotFoundException: Provider "C" Not found
at: java.nio.file.FileSystems.newFileSystem(FileSystems.java:341) at java.nio.file.FileSystems.newFileSystem(FileSystems.java:341)
at java.nio.fileFileSystems.newFileSystem(FileSystems.java:276)
One may use for docx (a zip with XML and other files) a java zip file system and XML or text processing.
URI docxUri = ,,, // "jar:file:/C:/... .docx"
Map<String, String> zipProperties = new HashMap<>();
zipProperties.put("encoding", "UTF-8");
try (FileSystem zipFS = FileSystems.newFileSystem(docxUri, zipProperties)) {
Path documentXmlPath = zipFS.getPath("/word/document.xml");
When using XML:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(Files.newInputStream(documentXmlPath));
//Element root = doc.getDocumentElement();
You can then use XPath to find the places, and write the XML back again.
It even might be that you do not need XML but could replace place holders:
byte[] content = Files.readAllBytes(documentXmlPath);
String xml = new String(content, StandardCharsets.UTF_8);
xml = xml.replace("#DATE#", "2014-09-24");
xml = xml.replace("#NAME#", StringEscapeUtils.escapeXml("Sniper")));
...
content = xml.getBytes(StandardCharsets.UTF_8);
Files.delete(documentXmlPath);
Files.write(documentXmlPath, content);
For a fast development, rename a copy of the .docx to a name with the .zip file extension, and inspect the files.
File.write should already apply StandardOpenOption.TRUNCATE_EXISTING, but I have added Files.delete as some error occured. See comments.
Try Apache POI. POI can work with doc and docx, but docx is more documented therefore support of it better.
UPD: You can use XDocReport, which use POI. Also I recomend to use xlsx for templates because it more suitable and more documented
I have spent a few days on this issue, until I found that what makes the difference is the try-with-resources on the FileSystem instance, appearing in Joop Eggen's snippet but not in question snippet:
try (FileSystem zipFS = FileSystems.newFileSystem(docxUri, zipProperties))
Without such try-with-resources block, the FileSystem resource will not be closed (as explained in Java tutorial), and the word document not modified.
Stepping back a bit, there are about 4 different approaches for editing words/placeholders:
MERGEFIELD or DOCPROPERTY fields (if you are having problems with this in docx4j, then you have probably not set up your input docx correctly)
content control databinding
variable replacement on the document surface (either at the DOM/SAX level, or using a library)
do stuff as XHTML, then import that
Before choosing one, you should decide whether you also need to be able to handle:
repeating data (eg adding table rows)
conditional content (eg entire paragraphs which will either be present or absent)
adding images
If you need these, then MERGEFIELD or DOCPROPERTY fields are probably out (though you can also use IF fields, if you can find a library which supports them). And adding images makes DOM/SAX manipulation as advocated in one of the other answers, messier and error prone.
The other things to consider are:
your authors: how technical are they? What does that imply for the authoring UI?
the "user input" you mention for variable replacement, is this given, or is obtaining it part of the problem you are solving?
Please try this to edit or replace the word in document
public class UpdateDocument {
public static void main(String[] args) throws IOException {
UpdateDocument obj = new UpdateDocument();
obj.updateDocument(
"c:\\test\\template.docx",
"c:\\test\\output.docx",
"Piyush");
}
private void updateDocument(String input, String output, String name)
throws IOException {
try (XWPFDocument doc = new XWPFDocument(
Files.newInputStream(Paths.get(input)))
) {
List<XWPFParagraph> xwpfParagraphList = doc.getParagraphs();
//Iterate over paragraph list and check for the replaceable text in each paragraph
for (XWPFParagraph xwpfParagraph : xwpfParagraphList) {
for (XWPFRun xwpfRun : xwpfParagraph.getRuns()) {
String docText = xwpfRun.getText(0);
//replacement and setting position
docText = docText.replace("${name}", name);
xwpfRun.setText(docText, 0);
}
}
// save the docs
try (FileOutputStream out = new FileOutputStream(output)) {
doc.write(out);
}
}
}
}
I am using apache lucene to index the html files. I am storing the path of the html files in the lucene index . Its storing the index and , i have checked it in luke all.
But when i am searching the path of the file its returning the no of documents very much high . i want it should search the exact path as it was stored in the lucene index.
i am using the following code
for index creation
try{
File indexDir=new File("d:/abc/")
IndexWriter indexWriter = new IndexWriter(
FSDirectory.open(indexDir),
new SimpleAnalyzer(),
true,
IndexWriter.MaxFieldLength.LIMITED);
indexWriter.setUseCompoundFile(false);
Document doc= new Document();
String path=f.getCanonicalPath();
doc.add(new Field("fpath",path,
Field.Store.YES,Field.Index.ANALYZED));
indexWriter.addDocument(doc);
indexWriter.optimize();
indexWriter.close();
}
catch(Exception ex )
{
ex.printStackTrace();
}
Following the code for searching the filepath
File indexDir = new File("d:/abc/");
int maxhits = 10000000;
int len = 0;
try {
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser = new QueryParser(Version.LUCENE_36,"fpath", new SimpleAnalyzer());
Query query = parser.parse(path);
query.setBoost((float) 1.5);
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
JOptionPane.showMessageDialog(null,"items found"+len);
}
catch(Exception ex)
{
ex.printStackTrace();
}
its showing the no of documents found as total no of document while the searched path file exists only once
You are analyzing the path, which will split it into separate terms. The root path term (like catalog in /catalog/products/versions) likely occurs in all documents, so any search that includes catalog without forcing all terms to be mandatory will return all documents.
You need a search query like (using the example above):
+catalog +products +versions
to force all terms to be present.
Note that this gets more complicated if the same set of terms can occur in different orders, like:
/catalog/products/versions
/versions/catalog/products/SKUs
In that case, you need to use a different Lucene tokenizer than the tokenizer in the Standard Analyzer.
I am trying to implement simplest lucene search. I followed this as my starting point.
I can understand the sample code :
public static void indexHotel(Hotel hotel) throws IOException {
IndexWriter writer = (IndexWriter) getIndexWriter(false);
Document doc = new Document();
doc.add(new Field("id", hotel.getId(), Field.Store.YES,
Field.Index.NO));
doc.add(new Field("name", hotel.getName(), Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("city", hotel.getCity(), Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field("description", hotel.getDescription(),
Field.Store.YES,
Field.Index.TOKENIZED));
String fullSearchableText = hotel.getName() + " " + hotel.getCity() + " " + hotel.getDescription();
doc.add(new Field("content", fullSearchableText,
Field.Store.NO,
Field.Index.TOKENIZED));
writer.addDocument(doc);
}
the thing i can't get in this code is what does getIndexWriter(false) do . this method is nowhere mentioned in the post i followed. moreover in another code block :
public void rebuildIndexes() throws IOException {
//
// Erase existing index
//
getIndexWriter(true);
//
// Index all hotel entries
//
Hotel[] hotels = HotelDatabase.getHotels();
for(Hotel hotel: hotels) {
indexHotel(hotel);
}
//
// Don’t forget to close the index writer when done
//
closeIndexWriter();
}
some undefined methods are used.
A bit confusing for beginner like me.
I want to create only one index . i think getIndexWriter(true); and closeIndexWriter() are some utility methods just to get IndexWriter but i can't make any assumption for what does that true in getIndexWriter(true); is used for.
By following some other post's iv'e got more confused regarding the creation of IndexWriter.
Can somebody please put me on the right path if i am doing anything wrong ?
Well depending on where is your index (RAM or FileSystem) you can open different indexWriters.
Assuming that you are trying to write the index into the file system you should have something like this:
public static final Version luceneVersion = Version.LUCENE_40;
IndexWriter getIndexWriter(){
Directory indexDir = FSDirectory.open(new File( INDEX_PATH ));
IndexWriterConfig luceneConfig = new IndexWriterConfig(
luceneVersion, new StandardAnalyzer(luceneVersion));
return(new IndexWriter(indexDir, luceneConfig));
}
Note the analyzer class 'StandardAnalyzer', you should choose the analyzer depending on the application requirements. I reckon StandardAnalyzer is good enough for what you want to do.
The input argument maybe asks if a new writer should be created ?
I am implementing search feature for a news website. On that website ,users submit news articles containing title and text, currently these articles are inserted directly into a database.I heard that full text searching inside a database containing long..long text would not be efficient.
so i tried using lucene for indexing and searching. i am able to index full database with it and also able to search the content.But i am not sure if i am using the best approach.
Here is my indexer class :
public class LuceneIndexer {
public static void indexNews(Paste p ,IndexWriter indexWriter) throws IOException {
Document doc = new Document();
doc.add(new Field("id", p.getNewsId(), Field.Store.YES, Field.Index.NO));
doc.add(new Field("title", p.getTitle(), Field.Store.YES, Field.Index.TOKENIZED));
doc.add(new Field("text", p.getNewsRawText(), Field.Store.YES, Field.Index.UN_TOKENIZED));
String fullSearchableText = p.getTitle() + " " + p.getNewsRawText();
doc.add(new Field("content", fullSearchableText, Field.Store.NO, Field.Index.TOKENIZED));
indexWriter.addDocument(doc);
}
public static void rebuildIndexes() {
try {
System.out.println("started indexing");
IndexWriter w = getIndexWriter();
ArrayList<News> n = new GetNewsInfo().getLastPosts(0);
for (News news : n) {
indexNews(news,w );
}
closeIndexWriter(w);
System.out.println("indexing done");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static IndexWriter getIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
IndexWriter indexWriter = new IndexWriter(GlobalData.LUCENE_INDEX_STOREAGE, new StandardAnalyzer(), true);
return indexWriter;
}
public static void closeIndexWriter(IndexWriter w) throws CorruptIndexException, IOException {
w.close();
}
Is above code efficient ?
I think i should add a document into index when it is submitted by user instead of indexing full database again.
Do i need to create new IndexWriter every time an article is submitted?
is that efficient to open and close an IndexWriter very frequently?
You are right that you don't need to readd every document to the index, you only need to add new ones, the rest will remain in the index.
But then you do need to create a new IndexWriter every time. If you prefer you can use a service or something which keeps an IndexWriter alive, but the opening and closing does not take much time. If you do reuse an IndexWriter make sure that you use indexWriter.commit() after each adding.
Do i need to create new IndexWriter every time an article is
submitted?
No
is that efficient to open and close an IndexWriter very frequently?
Definitely not! You should read the guidelines for indexing here.