I want to build a keyword search, i saw google app engine api and lucene api but my problem is I have some articles lets say 5000 articles each article have a unique ID, if user search with a keyword then the program should return all the article ID which contains this keyword.
Second thing if user search with a keyword for ex. dress then it should return the articles which contains the keywords dress, dressing, dressed etc.
This is what the Search API is designed for.
While it has some limitations, for your basic use case it should suffice. If you want to use Lucene, you will need to run it on another platform (or heavily customise it) because it uses the file system.
For your requirement to find similar words, you can read about stemmed queries here
Use lucene which is a high-performance, full-featured text search engine library. Index each article in different lucene document with unique field article_id. Also index article text in field article_text. Apply StopWordsFilter, PorterStemFilter etc. to field article_text. After indexing you are ready to search keywords.
Related
Is it possible to create a summary of a large document using some out-of-the-box search engines, like Lucene, Solr or Sphinx and search documents most relevant to a query?
I don't need to search inside the document or create a snippet. Just get 5 documents best matching the query.
Update. More specifically I don't want an engine to keep the whole document, but only it's "summary" (you may call it index information or TD-IDF representation).
Update. More specifically I don't want an engine to keep the whole
document, but only it's "summary" (you may call it index information
or TD-IDF representation).
To answer you updated question. Lucene/Solr fit your needs. For the 'summary', you have the option to not storing the original text by specifying:
org.apache.lucene.document.Field.Store.NO
By saving 'summary' as field org.apache.lucene.document.TextField, the summary will be indexed and tokenized. It will store the TD-IDF information for you to search.
Basically, if you want to have summarization feature - there are plenty of ways to do it, for example TextRank, big article on the wiki, tons of implementation available in NTLK, and others. However, it will not help you with the querying, you will need to index it anyway somewhere.
I think you could achieve something like this, using feature called More Like This. It exists in both Lucene/Solr/Elasticsearch. The idea behind it, that if you send a query (which is a raw text of the document) the search engine will find most suitable one, by extracting from it the most relevant words (which reminds me about summarization) and then will take a look inside inverted index to find top N similar documents. It will not discard the text, though, but it will do "like" operator based on the TF-IDF metrics.
References for MLT in Elasticsearch, Lucene, Solr
but only it's "summary" (you may call it index information or TD-IDF representation).
What you are looking for seems quite standard :
Apache Lucene [1], if you look for a library
Apache Solr or Elastic Search, if you are looking for a
production ready Enterprise Search Server.
How a Lucene Search Engine works [2] is building an Inverted index of each field in your document ( plus a set of additional data structures required by other features).
What apparently you don't want to do is to store the content of a field, which means taking the text content and store it in full(compressed) in the index ( to be retrieved later) .
In Lucene and Solr this is matter of configuration.
Summarisation is a completely different NLP task and is not probably what you need.
Cheers
[1] http://lucene.apache.org/index.html
[2] https://sease.io/2015/07/26/exploring-solr-internals-the-lucene-inverted-index/
I need to extract the keywords of a search query. For example, suppose one searches for "latest popular Nokia phones". I want to extract the keywords of this phrase. Are there any libraries written in Java to get this done?
AFAIK the Apache Lucene is doing such a thing (removing words like a, an, the and so one). It provides also a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities and much more.
I have a system where people inputs some words and based on this I have to search into a database of products. The products belongs to one category and have attributes such as brand,price,condition (new, old,used..)
Does someone knows how to sort a list of results according to best match i.e. those which match words entered by the user should appear first
Maybe you could use Zend Lucene, you'll find a quick intro on this Symfony framework page.
Edit: as you are using Java, try the original Lucene library (Zend Lucene is actually a port to PHP).
Currently, I am using Lucene version 3.0.2 to create a search application that is similar to a dictionary. One of the objects that I want to display is a sort of "example", where Lucene would look for a word in a book and then the sentences where the words were used are displayed.
I've been reading the Lucene in Action book and it mentions something like this, but looking through it I can't find other mentions. Is this something you can do with Lucene? If it is, how is can you do it?
I believe what you are looking for is a Highlighter.
One possibility is to use the lucene.search.highlight package, specifically the Highlighter.
Another option is to use the lucene.search.vectorhighlight package, specifically the FastVectorHighlighter.
Both classes search a text document, choose relevant snippets and display them with the matching terms highlighted. I have only used the first one, which worked fine for my use-case. If you can pre-divide the book into shorter parts, it would make highlighting faster.
To the moment I know that compass may handle this work. But indexing with compass looks pretty expensive. Is there any lighter alternatives?
To be honest, I don't know if Lucene will be lighter than Compass in terms of indexing (why would it be, doesn't Compass use Lucene for that?).
Anyway, because you asked for alternatives, there is GAELucene. I'm quoting its announcement below:
Enlightened by the discussion "Can I
run Lucene in google app engine?",
I implemented a google datastore based
Lucene component, GAELucene, which can
help you to run search applications on
google app engine.
The main clazz of GAELucene include:
GAEDirectory - a read only Directory based on google datastore.
GAEFile - stands for an index file, the file's byte content will be
splited into multi GAEFileContent.
GAEFileContent - stands for a segment of index file.
GAECategory - the identifier of different indices.
GAEIndexInput - a memory-resident IndexInput? implementation like the
RAMInputStream.
GAEIndexReader - wrapper for IndexReader? that cached in
GAEIndexReaderPool
GAEIndexReaderPool - pool for GAEIndexReader
The following code snippet
demonstrates the use of GAELucene do
searching:
Query queryObject = parserQuery(request);
GAEIndexReaderPool readerPool = GAEIndexReaderPool.getInstance();
GAEIndexReader indexReader = readerPool.borrowReader(INDEX_CATEGORY_DEMO);
IndexSearcher searcher = newIndexSearcher(indexReader);
Hits hits = searcher.search(queryObject);
readerPool.returnReader(indexReader);
I warmly recommend to read the whole discussion on nabble, very informative.
Just in case, regarding Compass, Shay Banon wrote a blog entry detailing how to use Compass in App Engine here: http://www.kimchy.org/searchable-google-appengine-with-compass/
Apache Lucene is the de-facto choice for full text indexing in Java. Looks like Compass Core contains "An implementation of Lucene Directory to store the index within a database (using Jdbc). It is separated from Compass code base and can be used with pure Lucene applications." plus tons of other stuff. You could try to separate just the Lucence component thereby stripping away several libs and making it more lightweight. Either that or ditch Compass altogether and use pure unadorned Lucene.
For Google App Engine, the only indexing library I've seen is appengine-search, with a description of how to use it on this page. I haven't tried it out though.
I've used Lucene (which Compass is based on) and found it to work great with comparatively low expense. The indexing is a task that you can schedule at times that work for your app.
Some alternatives indexing projects are mentioned in this SO thread, including Xapian and minion. I haven't checked either of these out though, since Lucene did everything I needed it to very well.
The Google App engine internal search seems better, and even havsupport synonyms:
https://developers.google.com/appengine/docs/java/search/
If you want to run Lucene on GAE you might also have a look at LuGAEne. It's an implementation of Lucene's Directory for GAE.
Usage is actually pretty simple, just replace one of Lucene's standard directories with GaeDirectory
Directory directory = new GaeDirectory("MyIndex");
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_43, analyzer);
IndexWriter writer = new IndexWriter(directory, config);
...
gaelucene seems to be in "maintenance mode" (no commit since Sep 2009) and lucene-appengine does not (yet) work when you're using Objectify version 4 in your application.
Disclaimer: I'm the author of LuGAEne.