I'm pretty new to Lucene, so forgive me in advance if some of my terminology is wrong.
Lucene offers different types of fields (keyword, text, unstored, unindexed), but it seems it also supports Numeric field, Int field and Float field.
Now, I'm wondering if "the closer the better" functionality exists/or is easy to implement in Lucene:
I want the creation_date of a document stored as the unix time into a float field.
Then I want to be able to compare the unix time given in a query with the indexed unix time of the documents.
Instead of a range query (which checks if the range is between particular bounds) or a boolean query (which checks if the values are the same) I want to be able to return a sense of similarity based on the time between the unix times. If the timespan is small it should end up with a higher score than if the timespan is large. Preferably this shouldn't happen linear but instead exponentially for example. So as the title of this question says: The closer, the better.
I've noticed that ElasticSearch, which uses Lucene as core offers decay function scores, is this the behaviour that I'm looking for and is this present in Lucene?
Lastly, I'm wondering: can one compare this 'type' of scoring together with the default tf-idf scoring that is used to query the body of the documents, in a way that the final score is a combination of the score of the timespan between the documents and the textual similarity of the bodies.
I dont think you get it out of the box like elastic search. You could always try to add it yourself as a module. These algorithms are available at large on the internet.
You could also use the boosting and negative boosting systems in lucene in combination with the exisiting ranking system to experiment if that gives you the sort of results you would want. I am doing that on apache SOLR and it's working like a charm :)
on your last point, tf-idf module is available in solr, if not already in lucene just copy it from solr and add it as module in lucene and combine your own module with the tf-idf module to achieve a combined result.
Related
I'm looking for an appropriate search engine that I can use my own similarity measure and tokenization approaches in it. Lucene search engine is introduced as a good one for this purpose but I have no idea about that. I searched on the internet about the tutorial of new versions of Lucene search engine but most of the pages are from a few years ago. Some of my questions are as follow:
Is it possible to change the similarity measure, tokenization and Stemming approaches and use self-built classes in the Lucene? If yes, How to do that?
Is there any difference between how we index the text for keywords search or phrasal search? should I make two different index for keyword search and phrasal search? (I think if we remove stop words, it will affect on the result of phrasal search and if I don't remove stop words, it will affect on the result of keyword search, won't it?)
Any information about this topic is appreciated.
This is possible, yes, and we do it on a couple solutions at my workplace. Here is a reasonable tutorial on how to do this. The tutorial uses Solr, which is a good Lucene implementation. To answer your questions directly:
Yes, there is a way to do this by overriding interfaces and providing your own implementation (see tutorial). Tokenization can be done without needing to override classes within Solr's default configuration, depending on how funky you need to get with Tokenization.
Yes, making an index that will return accurate results is a measure in understanding how your users will be searching the index. That having been said, a large part of the complexity in how queries search comes from people wanting matching results to float to the top of the results list, which is done via scoring. Given it sounds like you're looking to override the scoring, it may not matter for you. You should note though that by default, Lucene will match on hits to multiple columns higher than a single match exactly on a single column. That means that if you store data across many columns (and you search by default across many columns) your search will get less and less "accurate".
Full text search against a single column tends to be pretty accurate phrase vs words, but you'll end up with a pretty large index.
I am currently conducting a java project in NLP/IR, and are fairly new to this.
The project consists of a collection with around 1000 documents, where each document has about 100 words, structured as bag of words with term-frequency. I want to find similar documents based on a document(from the collection).
Using TF-IDF, calculating tf-idf for the query(a given document) and every other document in the collection, then comparing these values as a vector with cosine similarity. Could this give some insight in their similarity? Or would it not be reasonable, because of the big query(document)?
Is there any other similarity measures that could work better?
Thanks for the help
TF-IDF-based similarity, typically using a cosine to compare a vector representing the query terms to a set of vector representing the TF-IDF values of the documents, is a common approach to calculate "similarity".
Mind that there "similarity" is a very generic term. In the IR domain, you typically speak rather of "relevance". Texts can be similar on many levels: in the same language, using the same characters, using the same words, talking about the same people, using a similarly complex grammatical structure and much more - consequently, there are many many measures. Search the web for text similarity to find many publications but also open-source frameworks and libraries that implement different measures.
Today, "semantic similarity" is attracting more interest than the traditional keyword-based IR models. If this is your area of interest, you might look into the results of the SemEval shared tasks years 2012-2015.
If all you want is to compare two documents using TF-IDF, you can do that. Since you mention that each doc contains 100 words, in the worst case there might be 1000*100 unique words. So, im assuming your vectors are built on all unique words (since all documents should be represented in same dimension). If the no. of unique words are too high, you could try using some dimensionality reduction techniques to reduce the dimensions (like PCA). But what you are trying to do is right, you can always compare documents like this for finding similarity between documents.
If you want similarity more in the sense of semantics you should look at using LDA (topic modelling) type techniques.
I want to use Lucene to process several millions of news data. I'm quite new to Lucene, so I'm trying to learn more and more about how it is working.
By several tutorials throughout the web I found the class TopScoreDocCollector to be of high relevance for querying the Lucene index.
You create it like this
int hitsPerPage = 10000;
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
and it will later collect the results of your query (only the amount you defined in hitsPerPage). I initially thought the results taken in would be just randomly distributed or something (like you have 100.000 documents that match your query and just get random 10.000). I think now that I was wrong.
After reading more and more about Lucene I came to the javadoc of the class (please see here). Here it says
A Collector implementation that collects the top-scoring hits,
returning them as a TopDocs
So for me it now seems that Lucene is using some very smart technology to somehow return me the top scored Documents for my input query. But how does that Scorer work? What does he take into account? I've extended my research on this topic but could not find an answer I completely understood so far.
Can you explain me how the Scorer in TopcScoreDocCollector scores my news documents and if this can be of use for me?
Lucene uses an inverted index to produce an iterator over the list of doc ids that matches your query.
It then goes through each one of them and computes a score. By default that score is based on so-called Tf-idf. In a nutshell, it will take in account how many times the terms of your query appear in the document, and also the number of documents that contain the term.
The idea is that if you look for (warehouse work), having the word work many times is not as significant as having the word warehouse many times.
Then, rather than sorting the whole set of matching documents, lucene takes in account the fact that you really only need the top K documents or so. Using a heap (or priority queue), one can compute these top K, with a complexity of O(N log K) instead of O(N log N).
That's the role of the TopScoreDocCollector.
You can implement your own logic for a scorer (assign a score to a document), or a collector (aggregates results).
This might be not the best answer since there is defently sooner or later someone available to explain the internal behaviour of Lucene but based on my days as a student there are two things regarding "information retrieval" - one is taking benefit of existing solutions such as Lucene and those - the other is the whole theory behind it.
If you are interested in the later one i recomend to take http://en.wikipedia.org/wiki/Information_retrieval as a starting point to get a overview and dig into the whole thematics.
I personally think it is one of the most interesting fields with a huge potential yet i never had the "academic hardskills" to realy get in touch with it.
To parametrize the solutions available it is crucial to at least have a overview of the theory - there are for example "challanges" whereas there has information been manually indexed/ valued as a reference to be able to compare the quality of a programmic solution.
Based on such a challange we managed to aquire a slightly higher quality than "luceene out of the box" after we feed luceene with 4 different information bases (sorry its a few years back i can barely remember hence the missing key words..) which all were the result of luceene itself but with different parameters.
To come back to your question i can answer none directly but hope to give you a certain base to determine if you realy need/ want to know whats behind luceene or if you rather just want to use it as a blackbox (and or make it a greybox by parameterization)
Sorry if i got you totally wrong.
I am building an sms based application that will retrieve railway schedules.Now the problem that I am facing is that if the user types the wrong name of a particular station(Suppose he writes 'Kolkta' instead of 'Kolkata') then my app would not be able to forward the result of query that has got nearest match to it.How will I do it?Is there an API in java for this?
I guess Apache Lucene provides support you want in java.
Lucence Apache sounds promising, but if you want something more straightforward that you can cook at home very easily, you can try computing the minimal edit distance between the user input and the entire set of railway names. This is a measurement of similarity between strings and can be computed very efficiently (especially in your case, where the strings are very short).
The link above contains a scary mathematical formula but this is the nature of all formal representations. They are scary. Scroll a bit downwards a you will find the extremely short pseudo code for the algorithm (almost copy paste).
So, I realize that this covers a wide array of topics and pieces of them have been covered before on StackOverflow, such as this question. Similarly, Partial String Matching and Approximate String Matching are popular algorithmic discussions, it seems. However, using these ideas in conjunction to suit a problems where both need to be discussed seems highly inefficient. I'm looking for a way to combine the two problems in to one solution, efficiently.
Right now, I'm using AppEngine with Java and the Persistent DataStore. This is somewhat annoying, since it doesn't seem to have any arithmetic usage in the queries to make things easier, so I'm currently considering doing some precalculation and storing it as an extra field in the database. Essentially, this is the idea that a friend and I were having on how to possibly implement a system for matching and I was more or less hoping for suggestions on how to make it more efficient. If it needs to be scrapped in favor of something better that already exists, I can handle that, as well.
Let's start off with a basic example of what I'd look to do a search for. Consider the following nonsense sentence:
The isolating layer rackets the principal beneath your hypocritical rubbish.
If a user does a search for
isalatig pri
I would think that this would be a fairly good starting match for the string, and the value should be returned. The current method that we are considering using basically assigns a value to test divisibility. Essentially, there is a table with the following data
A: 2 B: 3 C: 5
D: 7 E: 11 F: 13
...
with each character being mapped to a prime number (multiple characters don't make a difference, only one character is needed). And if the query string divides the string in the database, then the value is returned as a possible match.
After this, keywords that aren't listed as stopwords are compared from the search string to see if they are starting substrings of words in the possible match under a given threshold of an edit distance (currently using the Levenshtein distance).
distance("isalatig", "isolating") == 2
distance("pri", "principal") == 0 // since principal has a starting
// substring of pri it passes
The total distance for each query is then ranked in ascending order and the top n values are then returned back to the person doing the querying.
This is the basic idea behind the algorithm, though since this is my first time dealing with such a scenario, I realize that I'm probably missing something very important (or my entire idea may be wrong). What is the best way to handle the current situation that I'm trying to implement. Similarly, if there are any utilities that AppEngine currently offers to combat what I'm trying to do, please let me know.
First off, a clarification: App Engine doesn't allow arithmetic in queries because there's no efficient way to query on the result of an arbitrary arithmetic expression. When you do this in an SQL database, the planner is forced to select an inefficient query plan, which usually involves scanning all the candidate records one by one.
Your scheme will not work for the same reason: There's no way to index an integer such that you can efficiently query for all numbers that are divisible by your target number. Other potential issues include words that translate into numbers that are too large to store in a fixed length integer, and being unable to distinguish between 'rental', 'learnt' and 'antler'.
If we discard for the moment your requirement for matching arbitrary prefixes of strings, what you are searching for is full-text indexing, which is typically implemented using an inverted index and stemming. Support for fulltext search is on the App Engine roadmap but hasn't been released yet; in the meantime your best option appears to be SearchableModel, or using an external search engine such as Google Site Search.