Is it possible to create a summary of a large document using some out-of-the-box search engines, like Lucene, Solr or Sphinx and search documents most relevant to a query?
I don't need to search inside the document or create a snippet. Just get 5 documents best matching the query.
Update. More specifically I don't want an engine to keep the whole document, but only it's "summary" (you may call it index information or TD-IDF representation).
Update. More specifically I don't want an engine to keep the whole
document, but only it's "summary" (you may call it index information
or TD-IDF representation).
To answer you updated question. Lucene/Solr fit your needs. For the 'summary', you have the option to not storing the original text by specifying:
org.apache.lucene.document.Field.Store.NO
By saving 'summary' as field org.apache.lucene.document.TextField, the summary will be indexed and tokenized. It will store the TD-IDF information for you to search.
Basically, if you want to have summarization feature - there are plenty of ways to do it, for example TextRank, big article on the wiki, tons of implementation available in NTLK, and others. However, it will not help you with the querying, you will need to index it anyway somewhere.
I think you could achieve something like this, using feature called More Like This. It exists in both Lucene/Solr/Elasticsearch. The idea behind it, that if you send a query (which is a raw text of the document) the search engine will find most suitable one, by extracting from it the most relevant words (which reminds me about summarization) and then will take a look inside inverted index to find top N similar documents. It will not discard the text, though, but it will do "like" operator based on the TF-IDF metrics.
References for MLT in Elasticsearch, Lucene, Solr
but only it's "summary" (you may call it index information or TD-IDF representation).
What you are looking for seems quite standard :
Apache Lucene [1], if you look for a library
Apache Solr or Elastic Search, if you are looking for a
production ready Enterprise Search Server.
How a Lucene Search Engine works [2] is building an Inverted index of each field in your document ( plus a set of additional data structures required by other features).
What apparently you don't want to do is to store the content of a field, which means taking the text content and store it in full(compressed) in the index ( to be retrieved later) .
In Lucene and Solr this is matter of configuration.
Summarisation is a completely different NLP task and is not probably what you need.
Cheers
[1] http://lucene.apache.org/index.html
[2] https://sease.io/2015/07/26/exploring-solr-internals-the-lucene-inverted-index/
Related
I want to build a keyword search, i saw google app engine api and lucene api but my problem is I have some articles lets say 5000 articles each article have a unique ID, if user search with a keyword then the program should return all the article ID which contains this keyword.
Second thing if user search with a keyword for ex. dress then it should return the articles which contains the keywords dress, dressing, dressed etc.
This is what the Search API is designed for.
While it has some limitations, for your basic use case it should suffice. If you want to use Lucene, you will need to run it on another platform (or heavily customise it) because it uses the file system.
For your requirement to find similar words, you can read about stemmed queries here
Use lucene which is a high-performance, full-featured text search engine library. Index each article in different lucene document with unique field article_id. Also index article text in field article_text. Apply StopWordsFilter, PorterStemFilter etc. to field article_text. After indexing you are ready to search keywords.
Imagine that I am building a hashtag search. My main indexed type is called Post, which has a list of Hashtag items, which are marked as IndexedEmbedded. Separately, every post has a list of Comment objects, each of which, again, contains a list of Hashtag objects.
On the search side, I am using a MultiFieldQueryParser, to which I pass a long list of possible search fields, including some nested fields like:
hashTags.value and
coments.hashTags.value
Now, the interesting thing happens when I want to search for something, say #architecture. I figure out where the hashtags are, so the simplest logical thing to do would be to convert a query of the type #architecture, into one of the type hashTags.value:architecture or comments.hashTags.value:architecture Although possible, this is very inflexible. What if I come up with yet another field that contains hashtags? I'd have to include that too.
Is there a general way to do this?
P.S. Please have in mind that the root type I am searching for is Post, because this is the kind of results I'd like to achieve
Hashtags are keywords, and you should let Lucene handle the text analysis to extract the hashtags from your main text and store them in a custom field.
You can do this very easily with Hibernate Search by defining your text to be indexed in two different #Field (using #Fields annotation). You could have one field named comments and another commentsHashtags.
You then apply a custom Analyser to commentsHashtags which does some standard tokenization and discards any term not starting with #; you can define one easily by taking the standard tokenizer and apply a custom filter.
When you run a query, you don't have to write custom code to look for hashtags in the query input, let it be processed by the same Analyser (which is the default anyway) and target both fields, you can even boost the hashtags more if that makes sense.
With this solution you
take advantage of the high efficiency of Search's text analysis
avoid entities and tables on the database containing the hashtags: useless overhead
avoid messing with free text extraction
It gets you another strong win point:
you can then open a raw IndexReader and load the termvector from commentsHashtags to get both a list of all used tags, and metrics about them. Cool to do some data mining, or just visualize a tag cloud.
Instead of treating the fields as different and the top-level document as Post, why not store both Posts and Comments as Lucene documents? That way, you can just have a single field called "hashtags" that you search. You should also have a field called "type" or something to differentiate between comments and posts.
Search results may be either comments of posts. You can filter by type if users want to search only posts or comments. Or you can show them differently in your UI.
If you want to add another concept that also uses hashtags (like ... I dunno... splanks or whatever silly name we all give to Internet communications in the future), then you can add it alongside the existing Post and Comment documents simply my indexing your new type with a "hashtags" field. You'll have to do plenty of work to add the splanks, anyway, so adding a handler for that new type of search result shouldn't be too much of an inconvenience.
I need to extract the keywords of a search query. For example, suppose one searches for "latest popular Nokia phones". I want to extract the keywords of this phrase. Are there any libraries written in Java to get this done?
AFAIK the Apache Lucene is doing such a thing (removing words like a, an, the and so one). It provides also a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities and much more.
I have a search autocomplete on my site, and I'm using Solr to find matching documents. I am trying to get partial matches on page titles, so for example Java* would match Java, Javascript, etc. As of right now, the autocomplete is set up to give me partial matches on all of the text in the page, which gives some weird results, so I've decided to switch over to using the page title. However, when I try to switch the search term from text for the page text to title, the query suddenly does not pick up partial matches any more. Here is an example of my original query:
q=text:java^2+text:"java"
&hl=true&hl.snippets=1&hl.fragsize=25&hl.fl=title&start=0&rows=3
Unfortunately, the guy who set this up for me does not work with me any more, so I have little idea what's going on 'under the hood'. I'm using Spring/J2EE for my backend, if that makes any difference.
You need to make sure that the field is no string based field. You can lookup this if you take a look at your schema.xml. If you search with Java* inside a string field it will match only titles which start with Java*.
Another thing is that you need to make sure that you are aware that Wildcard Queries are case sensitive (see this).
Depends on how the field title was analyzed, look at schema.xml to see what type the field is and how its analyzed to create term. Easy way to do that would be to go to solr admin http://localhost:8983/solr/admin/analysis.jsp, choose the same name option, type in the field name (am guessing 'title') put some sample text and query to see what terms are created and matched.
Currently, I am using Lucene version 3.0.2 to create a search application that is similar to a dictionary. One of the objects that I want to display is a sort of "example", where Lucene would look for a word in a book and then the sentences where the words were used are displayed.
I've been reading the Lucene in Action book and it mentions something like this, but looking through it I can't find other mentions. Is this something you can do with Lucene? If it is, how is can you do it?
I believe what you are looking for is a Highlighter.
One possibility is to use the lucene.search.highlight package, specifically the Highlighter.
Another option is to use the lucene.search.vectorhighlight package, specifically the FastVectorHighlighter.
Both classes search a text document, choose relevant snippets and display them with the matching terms highlighted. I have only used the first one, which worked fine for my use-case. If you can pre-divide the book into shorter parts, it would make highlighting faster.