Currently, I am using Lucene version 3.0.2 to create a search application that is similar to a dictionary. One of the objects that I want to display is a sort of "example", where Lucene would look for a word in a book and then the sentences where the words were used are displayed.
I've been reading the Lucene in Action book and it mentions something like this, but looking through it I can't find other mentions. Is this something you can do with Lucene? If it is, how is can you do it?
I believe what you are looking for is a Highlighter.
One possibility is to use the lucene.search.highlight package, specifically the Highlighter.
Another option is to use the lucene.search.vectorhighlight package, specifically the FastVectorHighlighter.
Both classes search a text document, choose relevant snippets and display them with the matching terms highlighted. I have only used the first one, which worked fine for my use-case. If you can pre-divide the book into shorter parts, it would make highlighting faster.
Related
I'm trying to implement the Lesk Algorithm for word sense disambiguation using Wordnet and it's Java API JWI. One of the steps requires to build a bag of words from the gloss and example sentences of the target word. I can easily get the gloss from the method getGloss() in class ISynset, but I don't see a method to get the example sentences. I'm sure I'm missing something obvious since JWI is described as "full-featured" on wordnet's site, but i can't find anything useful in the documentation or the internet. How do I get those sentences?
It may not be there. Examples are attached to synsets (e.g. they are a sibling function to getting lemmas and definitions in the NLTK API), but the 2.4.0 JWI docs for synset only have functions for getGloss() and getWords().
(If it turns out there is a way to get them from JWI, can someone leave me a comment, and I'll then delete this answer.)
Is it possible to create a summary of a large document using some out-of-the-box search engines, like Lucene, Solr or Sphinx and search documents most relevant to a query?
I don't need to search inside the document or create a snippet. Just get 5 documents best matching the query.
Update. More specifically I don't want an engine to keep the whole document, but only it's "summary" (you may call it index information or TD-IDF representation).
Update. More specifically I don't want an engine to keep the whole
document, but only it's "summary" (you may call it index information
or TD-IDF representation).
To answer you updated question. Lucene/Solr fit your needs. For the 'summary', you have the option to not storing the original text by specifying:
org.apache.lucene.document.Field.Store.NO
By saving 'summary' as field org.apache.lucene.document.TextField, the summary will be indexed and tokenized. It will store the TD-IDF information for you to search.
Basically, if you want to have summarization feature - there are plenty of ways to do it, for example TextRank, big article on the wiki, tons of implementation available in NTLK, and others. However, it will not help you with the querying, you will need to index it anyway somewhere.
I think you could achieve something like this, using feature called More Like This. It exists in both Lucene/Solr/Elasticsearch. The idea behind it, that if you send a query (which is a raw text of the document) the search engine will find most suitable one, by extracting from it the most relevant words (which reminds me about summarization) and then will take a look inside inverted index to find top N similar documents. It will not discard the text, though, but it will do "like" operator based on the TF-IDF metrics.
References for MLT in Elasticsearch, Lucene, Solr
but only it's "summary" (you may call it index information or TD-IDF representation).
What you are looking for seems quite standard :
Apache Lucene [1], if you look for a library
Apache Solr or Elastic Search, if you are looking for a
production ready Enterprise Search Server.
How a Lucene Search Engine works [2] is building an Inverted index of each field in your document ( plus a set of additional data structures required by other features).
What apparently you don't want to do is to store the content of a field, which means taking the text content and store it in full(compressed) in the index ( to be retrieved later) .
In Lucene and Solr this is matter of configuration.
Summarisation is a completely different NLP task and is not probably what you need.
Cheers
[1] http://lucene.apache.org/index.html
[2] https://sease.io/2015/07/26/exploring-solr-internals-the-lucene-inverted-index/
I have a Java (lucene 4) based application and a set of keywords fed into the application as a search query (the terms may include more than one words, eg it can be: “memory”, “old house”, “European Union law”, etc).
I need a way to get the list of matched keywords out of an indexed document and possibly also get keyword positions in the document (also for the multi-word keywords).
I tried with the lucene highlight package but I need to get only the keywords without any surrounding portion of text. It also returns multi-word keywords in separate fragments.
I would greatly appreciate any help.
There's a similar (possibly same) question here:
Get matched terms from Lucene query
Did you see this?
The solution suggested there is to disassemble a complicated query into a more simple query, until you get a TermQuery, and then check via searcher.explain(query, docId) (because if it matches, you know that's the term).
I think It's not very efficient, but
it worked for me until I ran into SpanQueries. it might be enough for you.
Just wanted to know how you would do it.
I have a webservice that permits me to complete an user address while he's writing it.
When the suggestions are shown, i'd like the part of the suggestion label that match the user input to be surrounded with bold tags.
I want the "matching" to be clever, and not just a simple/replace, since the WS we use is clever too but i don't have that code).
For exemple:
Input: 3 OxFôr sTrE
Ws result: 3 Oxford Street
Formatted: <b>3 Oxford Stre</b>et
Formatted: [bold]3 Oxford Stre[/bold]et
I can do it in JS or Java.
I'd rather do it in JS but with Java perhaps Lucene can help?
Do you see how it can be handled?
Index your Text using NGrams using a search engine or a custom data structure. I am implementing Auto Recommendation stuff by indexing around 1 billion query words using NGrams & then while displaying I sort them as per frequency of each typed query. Lucene/Solr can help you here. Highlighting stuff (as you asked) will be enclosed in tags by default if you use Lucene/Solr and you can also exploit ngram indexing feature provided by Lucene/Solr
LinkedIn Engineering recently open sourced Cleo (the open source technology behind LinkedIn's typeahead search) : Link.
Great stuff by LinkedIn. Check out for clever matching and highlighting stuff as desired by you
I need to extract the keywords of a search query. For example, suppose one searches for "latest popular Nokia phones". I want to extract the keywords of this phrase. Are there any libraries written in Java to get this done?
AFAIK the Apache Lucene is doing such a thing (removing words like a, an, the and so one). It provides also a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities and much more.