Lucene to use different files to search text

Lucene to use different files to search text - java

I am working on a task where I need to search a text using lucene. But here the requirement is to use the already created segment, .si, .cfe and .cfs files by other application.
I am able to get those files but while searching the text it won't show me the results.
The code is for search is:
public void searchText(String indexPath, String searchString) {
try {
Analyzer analyzer = new StandardAnalyzer();
File indexDirectory = new File(indexPath);
Directory directory = FSDirectory.open(indexDirectory.toPath());
IndexReader directoryReader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(directoryReader);
QueryParser parser = new QueryParser("requiredtext", analyzer);
Query query = parser.parse(searchString);
System.out.println(query);
// Parse a simple query that searches for "text":
ScoreDoc[] hits = searcher.search(query, 10).scoreDocs;
// Iterate through the results:
for (int i = 0; i < hits.length; i++) {
Document hitDoc = searcher.doc(hits[i].doc);
}
analyzer.close();
directoryReader.close();
directory.close();
}
catch (Exception ex){
System.out.println("Exception - "+ex.getMessage());
}
}
I am using the Lucene version 8.11.1 with Java8.
The question is: Is it possible in Lucene to read/find/search the text for which the files are written by some other application and search by other application. If it is then please provide the pointers how?
Atul

I found the issue and fixed it.
I was looking for the data in the field "requiredtext" but the indexer wont store the data for the field like while indexing it wont set the property for this field "TextField.Store.YES" and that is the reason I wont get the data for the field which I am looking for.
I got the data for other field for which the property was set.
And my questions was is it possible to search data on other files which are created by other application? So the answer is yes. #andrewJames answer helps to prove it.

Related

Upload documents into Watson's Retrieve & Rank service

I'm implementing a solution using Watson's Retrieve & Rank service.
When I use the tooling interface, I upload my documents and they appear as a list, where I can click on any of them to open up all the Titles that are inside the document ( Answer Units ), as you can see on the Picture 1 and Picture 2.
When I try to upload documents via Java, it wont recognize the documents, they get uploaded in parts ( Answer units as documents ), each part as a new document.
I would like to know how can I upload my documents as a entire document and not only parts of it?
Here's the codes for the upload function in Java:
public Answers ConvertToUnits(File doc, String collection) throws ParseException, SolrServerException, IOException{
DC.setUsernameAndPassword(USERNAME,PASSWORD);
Answers response = DC.convertDocumentToAnswer(doc).execute();
SolrInputDocument newdoc = new SolrInputDocument();
WatsonProcessing wp = new WatsonProcessing();
Collection<SolrInputDocument> newdocs = new ArrayList<SolrInputDocument>();
for(int i=0; i<response.getAnswerUnits().size(); i++)
{
String titulo = response.getAnswerUnits().get(i).getTitle();
String id = response.getAnswerUnits().get(i).getId();
newdoc.addField("title", titulo);
for(int j=0; j<response.getAnswerUnits().get(i).getContent().size(); j++)
{
String texto = response.getAnswerUnits().get(i).getContent().get(j).getText();
newdoc.addField("body", texto);
}
wp.IndexDocument(newdoc,collection);
newdoc.clear();
}
wp.ComitChanges(collection);
return response;
}
public void IndexDocument(SolrInputDocument newdoc, String collection) throws SolrServerException, IOException
{
UpdateRequest update = new UpdateRequest();
update.add(newdoc);
UpdateResponse addResponse = solrClient.add(collection, newdoc);
}

You can specify config options in this line:
Answers response = DC.convertDocumentToAnswer(doc).execute();
I think something like this should do the trick:
String configAsString = "{ \"conversion_target\":\"answer_units\", \"answer_units\": { \"selector_tags\": [] } }";
JsonParser jsonParser = new JsonParser();
JsonObject customConfig = jsonParser.parse(configAsString).getAsJsonObject();
Answers response = DC.convertDocumentToAnswer(doc, null, customConfig).execute();
I've not tried it out, so might not have got the syntax exactly right, but hopefully this will get you on the right track.
Essentially, what I'm trying to do here is use the selector_tags option in the config (see https://www.ibm.com/watson/developercloud/doc/document-conversion/customizing.shtml#htmlau for doc on this) to specify which tags the document should be split on. By specifying an empty list with no tags in, it results in it not being split at all - and coming out in a single answer unit as you want.
(Note that you can do this through the tooling interface, too - by unticking the "Split my documents up into individual answers for me" option when you upload the document)

Lucene - Creating an Index using FSDirectory

first time posting; long time reader. I apologize a head of time if this was already asked here (I'm new to lucene as well!). I've done a lot of research and wasn't able to find a good explanation/example for my question.
First of all, I've used IKVM.NET to convert lucene 4.9 java to include in my .net application. I chose to do this so I was able to use the most recent version of lucene. No issues.
I am trying to create a basic example to start to learn lucene and to apply it to my app. I've done countless google searches and read lots of articles, apache's website, etc. My code follows mostly the example here: http://www.lucenetutorial.com/lucene-in-5-minutes.html
My question is, I don't believe I want to use RAMDirectory.. right? Since I will be indexing a database and allowing users to search it via the website. I opted for using FSDirectory because I didn't think it should be all stored in memory.
When the IndexWriter is created it is creating new files each time(.cfe, .cfs, .si, segments.gen, write.lock, etc.) It seems to me you would create these files once and then use them until the index needs to be rebuilt?
So how do I create an IndexWriter with out recreating the index files?
Code:
StandardAnalyzer analyzer;
Directory directory;
protected void Page_Load(object sender, EventArgs e)
{
var version = org.apache.lucene.util.Version.LUCENE_CURRENT;
analyzer = new StandardAnalyzer(version);
if(directory == null){ directory= FSDirectory.open(new java.io.File(HttpContext.Current.Request.PhysicalApplicationPath + "/indexes"));
}
IndexWriterConfig config = new IndexWriterConfig(version, analyzer);
//i found setting the open mode will overwrite the files but still creates new each time
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
IndexWriter w = new IndexWriter(directory, config);
addDoc(w, "test", "1234");
addDoc(w, "test1", "1234");
addDoc(w, "test2", "1234");
addDoc(w, "test3", "1234");
w.close();
}
private static void addDoc(IndexWriter w, String _keyword, String _keywordid)
{
Document doc = new Document();
doc.add(new TextField("Keyword", _keyword, Field.Store.YES));
doc.add(new StringField("KeywordID", _keywordid, Field.Store.YES));
w.addDocument(doc);
}
protected void searchButton_Click(object sender, EventArgs e)
{
String querystr = "";
String results="";
querystr = searchTextBox.Text.ToString();
Query q = new QueryParser(org.apache.lucene.util.Version.LUCENE_4_0, "Keyword", analyzer).parse(querystr);
int hitsPerPage = 100;
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
if (hits.Length == 0)
{
label.Text = "Nothing was found.";
}
else
{
for (int i = 0; i < hits.Length; ++i)
{
int docID = hits[i].doc;
Document d = searcher.doc(docID);
results += "<br />" + (i + 1) + ". " + d.get("KeywordID") + "\t" + d.get("Keyword") + " Hit Score: " + hits[i].score.ToString() + "<br />";
}
label.Text = results;
reader.close();
}
}

Yes, RAMDirectory is great for quick, on-the-fly tests and tutorials, but in production you will usually want to store your index on the file system through an FSDirectory.
The reason it's rewriting the index every time you open the writer is that you are setting the OpenMode to IndexWriterConfig.OpenMode.CREATE. CREATE means you want to remove any existing index at that location, and start from scratch. You probably want IndexWriterConfig.OpenMode.CREATE_OR_APPEND, which will open an existing index if one is found.
One minor note:
You shouldn't use LUCENE_CURRENT (deprecated), use a real version instead. You are also using LUCENE_4_0 in your QueryParser. Neither of these will probably cause any major problems, but good to be consistent anyway.

When we use RAMDirectory it loads whole index or large parts of it into “memory” that is virtual memory. As physical memory is limited, the operating system may, of course, decide to swap out our large RAMDirectory. So RAMDirectory is not a good idea to optimize index loading times.
On the other hand, if we don’t use RAMDirectory to buffer our index and use NIOFSDirectory or SimpleFSDirectory, we have to pay another price: Our code has to do a lot of syscalls to the O/S kernel to copy blocks of data between the disk or filesystem cache and our buffers residing in Java heap. This needs to be done on every search request, over and over again.
To resolve all above issue MMapDirectory uses virtual memory and a kernel feature called “mmap” to access the disk files.
Check this link also.

Faster alternative to JsonPath in Java

JsonPath seems to be pretty slow for large JSON files.
In my project, I'd like a user to be able to pass an entire query as a string. I used JsonPath because it lets you do an entire query like $.store.book[3].price all at once by doing JsonPath.read(fileOrString, "$.store.book[3].price", new Filter[0]). Is there a faster method to interact with JSON files in Javascript? It would be ideal to be able to pass the entire query as a string, but I'll write a parser if I have to. Any ideas?
Even small optimizations would be helpful. For instance, I'm currently reading from a JSON file every time I query. Would it be better just to copy the entire file into a string at the beginning and query to the string instead?
EDIT: To those of you saying "this is Javascript, not Java", well, it actually is Java. JsonPath is a Javascript-like query language, but the file I am writing is most assuredly Java. Only the query is written in Javascript. Here's some info about JsonPath, and a snippet of code: https://code.google.com/p/json-path/
List toRet;
String query = "$.store.book[3].price";
try {
// if output is a list, good
toRet = (List) JsonPath.read(filestring_, query, new Filter[0]);
} catch (ClassCastException cce) {
// if output isn't a list, put it in a list
Object outObj = null;
try {
outObj = JsonPath.read(filestring_, query, new Filter[0]);
} catch (Exception e) {
throw new DataSourceException("Invalid file!\n", e, DataSourceException.UNKNOWN);
}

apache lucene indexing and searching on the filepath

I am using apache lucene to index the html files. I am storing the path of the html files in the lucene index . Its storing the index and , i have checked it in luke all.
But when i am searching the path of the file its returning the no of documents very much high . i want it should search the exact path as it was stored in the lucene index.
i am using the following code
for index creation
try{
File indexDir=new File("d:/abc/")
IndexWriter indexWriter = new IndexWriter(
FSDirectory.open(indexDir),
new SimpleAnalyzer(),
true,
IndexWriter.MaxFieldLength.LIMITED);
indexWriter.setUseCompoundFile(false);
Document doc= new Document();
String path=f.getCanonicalPath();
doc.add(new Field("fpath",path,
Field.Store.YES,Field.Index.ANALYZED));
indexWriter.addDocument(doc);
indexWriter.optimize();
indexWriter.close();
}
catch(Exception ex )
{
ex.printStackTrace();
}
Following the code for searching the filepath
File indexDir = new File("d:/abc/");
int maxhits = 10000000;
int len = 0;
try {
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory, true);
QueryParser parser = new QueryParser(Version.LUCENE_36,"fpath", new SimpleAnalyzer());
Query query = parser.parse(path);
query.setBoost((float) 1.5);
TopDocs topDocs = searcher.search(query, maxhits);
ScoreDoc[] hits = topDocs.scoreDocs;
len = hits.length;
JOptionPane.showMessageDialog(null,"items found"+len);
}
catch(Exception ex)
{
ex.printStackTrace();
}
its showing the no of documents found as total no of document while the searched path file exists only once

You are analyzing the path, which will split it into separate terms. The root path term (like catalog in /catalog/products/versions) likely occurs in all documents, so any search that includes catalog without forcing all terms to be mandatory will return all documents.
You need a search query like (using the example above):
+catalog +products +versions
to force all terms to be present.
Note that this gets more complicated if the same set of terms can occur in different orders, like:
/catalog/products/versions
/versions/catalog/products/SKUs
In that case, you need to use a different Lucene tokenizer than the tokenizer in the Standard Analyzer.

Parsing malformed/incomplete/invalid XML files [duplicate]

This question already has answers here:
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 5 years ago.
I have a process that parses an XML file using JDOM and xpath to parse the file as shown below:
private static SAXBuilder builder = null;
private static Document doc = null;
private static XPath xpathInstance = null;
builder = new SAXBuilder();
Text list = null;
try {
doc = builder.build(new StringReader(xmldocument));
} catch (JDOMException e) {
throw new Exception(e);
}
try {
xpathInstance = XPath.newInstance("//book[author='Neal Stephenson']/title/text()");
list = (Text) xpathInstance.selectSingleNode(doc);
} catch (JDOMException e) {
throw new Exception(e);
}
The above works fine. The xpath expressions are stored in a properties file so these can be changed anytime. Now i have to process some more xml files that come from a legacy system that will only send the xml files in chunks of 4000 bytes. The existing processing reads the 4000 byte chunks and stores them in an Oracle database with each chunk as one row in the database (Making any changes to the legacy system or the processing that stores the chunks as rows in the database is out of the question).
I can build the complete valid XML document by extracting all the rows related to a specific xml document and merging them and then use the existing processing (shown above) to parse the xml document.
The thing is though, the data i need to extract from the XML document will always be on the first 4000 bytes. This chunk ofcourse is not a valid XML document as it will be incomplete but will contain all the data i need. I cant parse just the one chunk as the JDOM builder will reject it.
I am wondering whether i can parse the malformed XML chunk without having to merge all parts (which could get to quite many) in order to get a valid XML document. This will save me several trips to the database to check if a chunk is available and i wont have to merge 100s of chunks only for being able to use the first 4000 bytes.
I know i could probably use java's string functions to extract the relevant data but is this possible using a parser or even xpath? or do they both expect the xml document to be a well formed document before it can parse it?

You could try to use JSoup to parse the invalid XML. By definition XML should be well-formed, otherwise it's invalid and should not be used.
UPDATE - example:
public static void main(String[] args) {
for (Node node : Parser.parseFragment("<test><author name=\"Vlad\"><book name=\"SO\"/>" ,
new Element(Tag.valueOf("p"), ""),
"")) {
print(node, 0);
}
}
public static void print(Node node, int offset) {
for (int i = 0; i < offset; i++) {
System.out.print(" ");
}
System.out.print(node.nodeName());
for (Attribute attribute: node.attributes()) {
System.out.print(", ");
System.out.print(attribute.getKey() + "=" + attribute.getValue());
}
System.out.println();
for (Node child : node.childNodes()) {
print(child, offset + 4);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Lucene to use different files to search text - java

Related

Upload documents into Watson's Retrieve & Rank service

Lucene - Creating an Index using FSDirectory

Faster alternative to JsonPath in Java

apache lucene indexing and searching on the filepath

Parsing malformed/incomplete/invalid XML files [duplicate]

Categories

Resources