How to extract a key from a "search query"? - java

I need to extract the keywords of a search query. For example, suppose one searches for "latest popular Nokia phones". I want to extract the keywords of this phrase. Are there any libraries written in Java to get this done?

AFAIK the Apache Lucene is doing such a thing (removing words like a, an, the and so one). It provides also a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities and much more.

Related

Full text search by summaries

Is it possible to create a summary of a large document using some out-of-the-box search engines, like Lucene, Solr or Sphinx and search documents most relevant to a query?
I don't need to search inside the document or create a snippet. Just get 5 documents best matching the query.
Update. More specifically I don't want an engine to keep the whole document, but only it's "summary" (you may call it index information or TD-IDF representation).
Update. More specifically I don't want an engine to keep the whole
document, but only it's "summary" (you may call it index information
or TD-IDF representation).
To answer you updated question. Lucene/Solr fit your needs. For the 'summary', you have the option to not storing the original text by specifying:
org.apache.lucene.document.Field.Store.NO
By saving 'summary' as field org.apache.lucene.document.TextField, the summary will be indexed and tokenized. It will store the TD-IDF information for you to search.
Basically, if you want to have summarization feature - there are plenty of ways to do it, for example TextRank, big article on the wiki, tons of implementation available in NTLK, and others. However, it will not help you with the querying, you will need to index it anyway somewhere.
I think you could achieve something like this, using feature called More Like This. It exists in both Lucene/Solr/Elasticsearch. The idea behind it, that if you send a query (which is a raw text of the document) the search engine will find most suitable one, by extracting from it the most relevant words (which reminds me about summarization) and then will take a look inside inverted index to find top N similar documents. It will not discard the text, though, but it will do "like" operator based on the TF-IDF metrics.
References for MLT in Elasticsearch, Lucene, Solr
but only it's "summary" (you may call it index information or TD-IDF representation).
What you are looking for seems quite standard :
Apache Lucene [1], if you look for a library
Apache Solr or Elastic Search, if you are looking for a
production ready Enterprise Search Server.
How a Lucene Search Engine works [2] is building an Inverted index of each field in your document ( plus a set of additional data structures required by other features).
What apparently you don't want to do is to store the content of a field, which means taking the text content and store it in full(compressed) in the index ( to be retrieved later) .
In Lucene and Solr this is matter of configuration.
Summarisation is a completely different NLP task and is not probably what you need.
Cheers
[1] http://lucene.apache.org/index.html
[2] https://sease.io/2015/07/26/exploring-solr-internals-the-lucene-inverted-index/

Identifying all the names from a given text

I want to identify all the names written in any text, currently I am using IMDB movie reviews.
I am using stanford POS tagger, and analysing all the proper nouns (as proper noun are names of person,things,places), but this is slow.
Firstly I am tagging all the input lines, then I am checking for all the words with NNP in the end, which is a slow process.
Is there any efficient substitute to achieve this task? ANy library (preferably in JAVA).
Thanks.
Do you know the input language? If yes you could match each word against a dictionnary and flag the word as proper noun if it is not in the dictionnary. It would require a complete dictionnary with all the declensions of each word of the language, and pay attention to numbers and other special cases.
EDIT: See also this answer in the official FAQ: have you tried to change the model used?
A (paid) web service called GlobalNLP can do it in multiple languages: https://nlp.linguasys.com/docs/services/54131f001c78d802f0f2b28f/operations/5429f9591c78d80a3cd66926

search by keyword mapping using Google app engine or Lucene

I want to build a keyword search, i saw google app engine api and lucene api but my problem is I have some articles lets say 5000 articles each article have a unique ID, if user search with a keyword then the program should return all the article ID which contains this keyword.
Second thing if user search with a keyword for ex. dress then it should return the articles which contains the keywords dress, dressing, dressed etc.
This is what the Search API is designed for.
While it has some limitations, for your basic use case it should suffice. If you want to use Lucene, you will need to run it on another platform (or heavily customise it) because it uses the file system.
For your requirement to find similar words, you can read about stemmed queries here
Use lucene which is a high-performance, full-featured text search engine library. Index each article in different lucene document with unique field article_id. Also index article text in field article_text. Apply StopWordsFilter, PorterStemFilter etc. to field article_text. After indexing you are ready to search keywords.

How to build my own "common word" filter with Lucene

I know that Lucene uses a stop word(common) filter for searching and I also know that for this job the Standard Analyzer or EnglishAnalyzer are responsible. What about, if I want to add my own common words to the analyzer filter? How could I add words like computer, internet, system, etc.
I assume by "common words" you mean stopwords.
In order to add to standard list, just use another constructor of StandardAnalyzer (which accepts stopwords either as CharArraySet or Reader). To get standard stopword set, use StopAnalyzer.STOP_WORDS_SET.

Displaying sample text from the Lucene Search Results

Currently, I am using Lucene version 3.0.2 to create a search application that is similar to a dictionary. One of the objects that I want to display is a sort of "example", where Lucene would look for a word in a book and then the sentences where the words were used are displayed.
I've been reading the Lucene in Action book and it mentions something like this, but looking through it I can't find other mentions. Is this something you can do with Lucene? If it is, how is can you do it?
I believe what you are looking for is a Highlighter.
One possibility is to use the lucene.search.highlight package, specifically the Highlighter.
Another option is to use the lucene.search.vectorhighlight package, specifically the FastVectorHighlighter.
Both classes search a text document, choose relevant snippets and display them with the matching terms highlighted. I have only used the first one, which worked fine for my use-case. If you can pre-divide the book into shorter parts, it would make highlighting faster.

Categories

Resources