recursively determine similarity in lucene - java

I have a collection of books in multiple languages. I need to link parts of each book to each other based on their similarity. I need to link books to similar books, chapters to similar chapters and subchapters to similar subchapters.
Preferably, the similarity measure would also take into account how similar the next highest level are, so when I want to compare two chapters, it would first check how similar the books the chapters belong to are to each other and use that as a baseline. I suppose this part I will have to implement manually, but I'm wondering how to do the hierarchical linking effectively.
Is there a way to tell lucene that the documents in an index follow a hierarchical structure where books are composed of chapter and chapters are composed of subchapters (which are the actual documents to store)? If so, books and chapters could be constructed at runtime by combining the documents they are composed of. Does lucene have a way to do this?
One simple alternate approach would be to create separate indices for each level of resolution, i.e. one for books, one for chapters and one for subchapters. But this seems inelegant and I'm not sure if this would work well, considering that I would get different inverse-document-frequency values in the different indices. This leads to a secondary question: is there a way to make lucene only consider certain documents as a reference class for its tf-idf calculations?

If number of levels is predetermined then you may use grouping functionality http://lucene.apache.org/core/4_9_0/grouping/org/apache/lucene/search/grouping/package-summary.html
I do not know if it works with multiple-value fields, if yes then it could address also multiple levels of grouping (hierarchy similarity). You may of course use different fields for different levels.
There is also another approach which I experienced - grouping document ID's in fast NoSQL database and when lucene returns document you may search ID's of similar documents in NoSQL and then back in lucene. However, such approach may lead to other issues (i.e. number of total hits, because lucene returns total hits even without returning the results, but grouping shall most probably change that number.

Related

Pre trained vectors, nlp, word2vec, word embedding for particular topic?

is there any pretrained vector for particular topic only? for example "java", so i want vectors related java in file. mean if i give input inheritance then cosine similarity show me polymorphism and other related stuff only!
i am using corpus as GoogleNews-vectors-negative300.bin and Glove vectors. still not getting related words.
Not sure if I understand your question/problem statement, but if you want to work with a corpus of java source code you can use code2vec which provides pre-trained word-embeddings models. Check it out: https://code2vec.org/
Yes, you can occasionally find other groups' pre-trained vectors for download, which may have better coverage of whatever problem domains they've been trained on: both more specialized words, and word-vectors matching the word sense in that domain.
For example, the GoogleNews word-vectors were trained on news articles circa 2012, so its vector for 'Java' may be dominated by stories of the Java island of Indosnesia as much as the programming language. And many other vector-sets are trained on Wikipedia text, which will be dominated by usages in that particular reference-style of writing. But there could be other sets that better emphasize the word-senses you need.
However, the best approach is often to train your own word-vectors, from a training corpus that closely matches the topics/documents you are concerned about. Then, the word-vectors are well-tuned to your domain-of-concern. As long as you have "enough" varied examples of a word used in context, the resulting vector will likely be better than generic vectors from someone else's corpus. ("Enough" has no firm definition, but is usually at least 5, and ideally dozens to hundreds, of representative, diverse uses.)
Let's consider your example goal – showing some similarity between the ideas of 'polymorphism' and 'input inheritance'. For that, you'd need a training corpus that discusses those concepts, ideally many times, from many authors, in many problem-contexts. (Textbooks, online articles, and Stack Overflow pages might be possible sources.)
You'd further need a tokenization strategy that manages to create a single word-token for the two-word concept 'input_inheritance' - which is a separate challenge, and might be tackled via (1) a hand-crafted glossary of multi-word-phrases that should be combined; (2) statistical analysis of word-pairs that seem to occur so often together, they should be combined; (3) more sophisticated grammar-aware phrase- and entity-detection preprocessing.
(The multiword phrases in the GoogleNews set were created via a statistical algorithm which is also available in the gensim Python library as the Phrases class. But, the exact parameters Google used have not, as far as I know, been revealed.And, good results from this algorithm can require a lot of data and tuning, and still result in some combinations that a person would consider nonsense, and missing others that a person would consider natural.)

Using self-built approaches in Lucene search engine

I'm looking for an appropriate search engine that I can use my own similarity measure and tokenization approaches in it. Lucene search engine is introduced as a good one for this purpose but I have no idea about that. I searched on the internet about the tutorial of new versions of Lucene search engine but most of the pages are from a few years ago. Some of my questions are as follow:
Is it possible to change the similarity measure, tokenization and Stemming approaches and use self-built classes in the Lucene? If yes, How to do that?
Is there any difference between how we index the text for keywords search or phrasal search? should I make two different index for keyword search and phrasal search? (I think if we remove stop words, it will affect on the result of phrasal search and if I don't remove stop words, it will affect on the result of keyword search, won't it?)
Any information about this topic is appreciated.
This is possible, yes, and we do it on a couple solutions at my workplace. Here is a reasonable tutorial on how to do this. The tutorial uses Solr, which is a good Lucene implementation. To answer your questions directly:
Yes, there is a way to do this by overriding interfaces and providing your own implementation (see tutorial). Tokenization can be done without needing to override classes within Solr's default configuration, depending on how funky you need to get with Tokenization.
Yes, making an index that will return accurate results is a measure in understanding how your users will be searching the index. That having been said, a large part of the complexity in how queries search comes from people wanting matching results to float to the top of the results list, which is done via scoring. Given it sounds like you're looking to override the scoring, it may not matter for you. You should note though that by default, Lucene will match on hits to multiple columns higher than a single match exactly on a single column. That means that if you store data across many columns (and you search by default across many columns) your search will get less and less "accurate".
Full text search against a single column tends to be pretty accurate phrase vs words, but you'll end up with a pretty large index.

What does the TopScoreDocCollector in Lucene by default use for Scoring?

I want to use Lucene to process several millions of news data. I'm quite new to Lucene, so I'm trying to learn more and more about how it is working.
By several tutorials throughout the web I found the class TopScoreDocCollector to be of high relevance for querying the Lucene index.
You create it like this
int hitsPerPage = 10000;
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
and it will later collect the results of your query (only the amount you defined in hitsPerPage). I initially thought the results taken in would be just randomly distributed or something (like you have 100.000 documents that match your query and just get random 10.000). I think now that I was wrong.
After reading more and more about Lucene I came to the javadoc of the class (please see here). Here it says
A Collector implementation that collects the top-scoring hits,
returning them as a TopDocs
So for me it now seems that Lucene is using some very smart technology to somehow return me the top scored Documents for my input query. But how does that Scorer work? What does he take into account? I've extended my research on this topic but could not find an answer I completely understood so far.
Can you explain me how the Scorer in TopcScoreDocCollector scores my news documents and if this can be of use for me?
Lucene uses an inverted index to produce an iterator over the list of doc ids that matches your query.
It then goes through each one of them and computes a score. By default that score is based on so-called Tf-idf. In a nutshell, it will take in account how many times the terms of your query appear in the document, and also the number of documents that contain the term.
The idea is that if you look for (warehouse work), having the word work many times is not as significant as having the word warehouse many times.
Then, rather than sorting the whole set of matching documents, lucene takes in account the fact that you really only need the top K documents or so. Using a heap (or priority queue), one can compute these top K, with a complexity of O(N log K) instead of O(N log N).
That's the role of the TopScoreDocCollector.
You can implement your own logic for a scorer (assign a score to a document), or a collector (aggregates results).
This might be not the best answer since there is defently sooner or later someone available to explain the internal behaviour of Lucene but based on my days as a student there are two things regarding "information retrieval" - one is taking benefit of existing solutions such as Lucene and those - the other is the whole theory behind it.
If you are interested in the later one i recomend to take http://en.wikipedia.org/wiki/Information_retrieval as a starting point to get a overview and dig into the whole thematics.
I personally think it is one of the most interesting fields with a huge potential yet i never had the "academic hardskills" to realy get in touch with it.
To parametrize the solutions available it is crucial to at least have a overview of the theory - there are for example "challanges" whereas there has information been manually indexed/ valued as a reference to be able to compare the quality of a programmic solution.
Based on such a challange we managed to aquire a slightly higher quality than "luceene out of the box" after we feed luceene with 4 different information bases (sorry its a few years back i can barely remember hence the missing key words..) which all were the result of luceene itself but with different parameters.
To come back to your question i can answer none directly but hope to give you a certain base to determine if you realy need/ want to know whats behind luceene or if you rather just want to use it as a blackbox (and or make it a greybox by parameterization)
Sorry if i got you totally wrong.

Mahout - Clustering - "naming" the cluster elements

I'm doing some research and I'm playing with Apache Mahout 0.6
My purpose is to build a system which will name different categories of documents based on user input. The documents are not known in advance and I don't know also which categories do I have while collecting these documents. But I do know, that all the documents in the model should belong to one of the predefined categories.
For example:
Lets say I've collected a N documents, that belong to 3 different groups :
Politics
Madonna (pop-star)
Science fiction
I don't know what document belongs to what category, but I know that each one of my N documents belongs to one of those categories (e.g. there are no documents about, say basketball among these N docs)
So, I came up with the following idea:
Apply mahout clustering (for example k-mean with k=3 on these documents)
This should divide the N documents to 3 groups. This should be kind of my model to learn with. I still don't know which document really belongs to which group, but at least the documents are clustered now by group
Ask the user to find any document in the web that should be about 'Madonna' (I can't show to the user none of my N documents, its a restriction). Then I want to measure 'similarity' of this document and each one of 3 groups.
I expect to see that the measurement for similarity between user_doc and documents in Madonna group in the model will be higher than the similarity between the user_doc and documents about politics.
I've managed to produce the cluster of documents using 'Mahout in Action' book.
But I don't understand how should I use Mahout to measure similarity between the 'ready' cluster group of document and one given document.
I thought about rerunning the cluster with k=3 for N+1 documents with the same centroids (in terms of k-mean clustering) and see whether where the new document falls, but maybe there is any other way to do that?
Is it possible to do with Mahout or my idea is conceptually wrong? (example in terms of Mahout API would be really good)
Thanks a lot and sorry for a long question (couldn't describe it better)
Any help is highly appreciated
P.S. This is not a home-work project :)
This might be possible, but a much easier solution (IMHO) would be to hand-label a few documents from each category, then use those to bootstrap k-means. I.e., compute the centroids of the hand-labeled politics/Madonna/scifi documents and let k-means take it from there.
(In fancy terms, you would be doing semisupervised nearest centroids classification.)

Can Lucene return several search results from a single indexed file?

I am using Lucene to index and search a small number of large documents. Using the demo from the Lucene site I have indexed the documents and am able to search them. However, the search result is not particularly useful as it points to the file of the document. With very large documents this isn't particularly useful.
I am wondering if Lucene can index these very large documents and create an abstraction over them which provides much more fine-grained results.
An example might better explain what I mean. Consider a very large book, such as the Bible. One file contains the entire text of the Bible, so with the demo, the result of searching for say, 'Damascus' would point to the file. What I would like to do is to retain the large document, but searches would return results pointing to a Book, Chapter or even as precise as a Verse. So a search for 'Damascus' could return (among others) Book 23, Chapter 7, Verse 8.
Is this possible (and best-practice in Lucene usage), or should I instead attempt to section the large document into many small files to index?
If it makes any difference, I am using Java Lucene 2.9.0 and am indexing HTML files approximately 1MB - 4MB in size. Which in terms of file size is not large, but it is large, relative to a person reading it.
I don't think I've explained this as well as I could. Here goes for another example.
Say I take my large HTML file, and (for arguments sake) the search term 'Damascus' appears 3 times. Once on line 100 within a <div> tag, on line 2000 within a <p> tag, and on line 5000 within a <h1> tag. Is it possible to index with Lucene, such that there will be 3 results, and they can point to the specific element the term was within?
I don't think I want to provide a different document result for the term. So if the term 'Damascus' appeared twice within a specific <div>, there would only be one match.
It appears from a comment from Kragen that what I would want to do is parse the HTML when Lucene is going through the indexing phase. Then I can decide the chunk I want to consider as one document from what is read in by the parser. So if I see a div with a certain class I can begin a new Lucene document and it will be returned as a separate hit when a word within the div content is searched on.
Does this sound like what I want to do, and is it possible?
Yes - Lucene records the offset of matching terms in a file, so that can be used to figure out where in the indexed content you need to look for matches.
There is a Lucene.Highlight add-on that does this exact task for you - try this article, there are also a couple of questions on StackOverflow concerning hit highlighting (many of these are tailored to use with web apps and so also do things like surrounding matching words with <b> tags)
UPDATE: Depending on how you search your index you might also find that its a good idea to split your large documents into smaller sections (for example chapters) as well - however this is more a question on how you want to organise, prioritise and present your results to the end user.
For example, supposing a user does a search for "foo" and there are 2 books containing that term. The first book (book A) might contain 2 chapters each of which have many references to "foo", however the term is barely mentioned in the rest of the book, however the second book (book B) contains many references to "foo", however they are scattered around the whole book. If you index by book, then you will probably find that book B is the first hit, however indexing by chapter you are likely to find that the 2 chapters from book A are the first 2 hits, followed by the chapters from book B.
Finally, obviously the user will be presented with 1 hit per matching document you have in your index - if you want to present your users with a list of matching books then obviously index by book, however you might find it more appropriate to present the user with a list of matching chapters in which case obviously index by chapter.
One way to do this is to create several documents out of a single book. The documents could represent books, chapters or verses. As the text need not be unique, this is what I would do.
This way, the first verse in the first chapter in the book of Genesis will be indexed four times: in the whole bible, in the book of Genesis, in the first chapter and as the verse.
A subtlety here is the exact goal of retrieval:
Do you want just to display the search keywords in context to a user? In this case consider using a Lucene highlighter. If you need the retrieval to be further used (i.e. take the retrieved pointer to a chapter or verse and do some processing on this place in the text) I would go with the finer-grained documents as I described before.

Categories

Resources