Mahout - Clustering - "naming" the cluster elements

Mahout - Clustering - "naming" the cluster elements - java

I'm doing some research and I'm playing with Apache Mahout 0.6
My purpose is to build a system which will name different categories of documents based on user input. The documents are not known in advance and I don't know also which categories do I have while collecting these documents. But I do know, that all the documents in the model should belong to one of the predefined categories.
For example:
Lets say I've collected a N documents, that belong to 3 different groups :
Politics
Madonna (pop-star)
Science fiction
I don't know what document belongs to what category, but I know that each one of my N documents belongs to one of those categories (e.g. there are no documents about, say basketball among these N docs)
So, I came up with the following idea:
Apply mahout clustering (for example k-mean with k=3 on these documents)
This should divide the N documents to 3 groups. This should be kind of my model to learn with. I still don't know which document really belongs to which group, but at least the documents are clustered now by group
Ask the user to find any document in the web that should be about 'Madonna' (I can't show to the user none of my N documents, its a restriction). Then I want to measure 'similarity' of this document and each one of 3 groups.
I expect to see that the measurement for similarity between user_doc and documents in Madonna group in the model will be higher than the similarity between the user_doc and documents about politics.
I've managed to produce the cluster of documents using 'Mahout in Action' book.
But I don't understand how should I use Mahout to measure similarity between the 'ready' cluster group of document and one given document.
I thought about rerunning the cluster with k=3 for N+1 documents with the same centroids (in terms of k-mean clustering) and see whether where the new document falls, but maybe there is any other way to do that?
Is it possible to do with Mahout or my idea is conceptually wrong? (example in terms of Mahout API would be really good)
Thanks a lot and sorry for a long question (couldn't describe it better)
Any help is highly appreciated
P.S. This is not a home-work project :)

This might be possible, but a much easier solution (IMHO) would be to hand-label a few documents from each category, then use those to bootstrap k-means. I.e., compute the centroids of the hand-labeled politics/Madonna/scifi documents and let k-means take it from there.
(In fancy terms, you would be doing semisupervised nearest centroids classification.)

Related

Comparing documents - document similarity

I am currently conducting a java project in NLP/IR, and are fairly new to this.
The project consists of a collection with around 1000 documents, where each document has about 100 words, structured as bag of words with term-frequency. I want to find similar documents based on a document(from the collection).
Using TF-IDF, calculating tf-idf for the query(a given document) and every other document in the collection, then comparing these values as a vector with cosine similarity. Could this give some insight in their similarity? Or would it not be reasonable, because of the big query(document)?
Is there any other similarity measures that could work better?
Thanks for the help

TF-IDF-based similarity, typically using a cosine to compare a vector representing the query terms to a set of vector representing the TF-IDF values of the documents, is a common approach to calculate "similarity".
Mind that there "similarity" is a very generic term. In the IR domain, you typically speak rather of "relevance". Texts can be similar on many levels: in the same language, using the same characters, using the same words, talking about the same people, using a similarly complex grammatical structure and much more - consequently, there are many many measures. Search the web for text similarity to find many publications but also open-source frameworks and libraries that implement different measures.
Today, "semantic similarity" is attracting more interest than the traditional keyword-based IR models. If this is your area of interest, you might look into the results of the SemEval shared tasks years 2012-2015.

If all you want is to compare two documents using TF-IDF, you can do that. Since you mention that each doc contains 100 words, in the worst case there might be 1000*100 unique words. So, im assuming your vectors are built on all unique words (since all documents should be represented in same dimension). If the no. of unique words are too high, you could try using some dimensionality reduction techniques to reduce the dimensions (like PCA). But what you are trying to do is right, you can always compare documents like this for finding similarity between documents.
If you want similarity more in the sense of semantics you should look at using LDA (topic modelling) type techniques.

What does the TopScoreDocCollector in Lucene by default use for Scoring?

I want to use Lucene to process several millions of news data. I'm quite new to Lucene, so I'm trying to learn more and more about how it is working.
By several tutorials throughout the web I found the class TopScoreDocCollector to be of high relevance for querying the Lucene index.
You create it like this
int hitsPerPage = 10000;
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
and it will later collect the results of your query (only the amount you defined in hitsPerPage). I initially thought the results taken in would be just randomly distributed or something (like you have 100.000 documents that match your query and just get random 10.000). I think now that I was wrong.
After reading more and more about Lucene I came to the javadoc of the class (please see here). Here it says
A Collector implementation that collects the top-scoring hits,
returning them as a TopDocs
So for me it now seems that Lucene is using some very smart technology to somehow return me the top scored Documents for my input query. But how does that Scorer work? What does he take into account? I've extended my research on this topic but could not find an answer I completely understood so far.
Can you explain me how the Scorer in TopcScoreDocCollector scores my news documents and if this can be of use for me?

Lucene uses an inverted index to produce an iterator over the list of doc ids that matches your query.
It then goes through each one of them and computes a score. By default that score is based on so-called Tf-idf. In a nutshell, it will take in account how many times the terms of your query appear in the document, and also the number of documents that contain the term.
The idea is that if you look for (warehouse work), having the word work many times is not as significant as having the word warehouse many times.
Then, rather than sorting the whole set of matching documents, lucene takes in account the fact that you really only need the top K documents or so. Using a heap (or priority queue), one can compute these top K, with a complexity of O(N log K) instead of O(N log N).
That's the role of the TopScoreDocCollector.
You can implement your own logic for a scorer (assign a score to a document), or a collector (aggregates results).

This might be not the best answer since there is defently sooner or later someone available to explain the internal behaviour of Lucene but based on my days as a student there are two things regarding "information retrieval" - one is taking benefit of existing solutions such as Lucene and those - the other is the whole theory behind it.
If you are interested in the later one i recomend to take http://en.wikipedia.org/wiki/Information_retrieval as a starting point to get a overview and dig into the whole thematics.
I personally think it is one of the most interesting fields with a huge potential yet i never had the "academic hardskills" to realy get in touch with it.
To parametrize the solutions available it is crucial to at least have a overview of the theory - there are for example "challanges" whereas there has information been manually indexed/ valued as a reference to be able to compare the quality of a programmic solution.
Based on such a challange we managed to aquire a slightly higher quality than "luceene out of the box" after we feed luceene with 4 different information bases (sorry its a few years back i can barely remember hence the missing key words..) which all were the result of luceene itself but with different parameters.
To come back to your question i can answer none directly but hope to give you a certain base to determine if you realy need/ want to know whats behind luceene or if you rather just want to use it as a blackbox (and or make it a greybox by parameterization)
Sorry if i got you totally wrong.

recursively determine similarity in lucene

I have a collection of books in multiple languages. I need to link parts of each book to each other based on their similarity. I need to link books to similar books, chapters to similar chapters and subchapters to similar subchapters.
Preferably, the similarity measure would also take into account how similar the next highest level are, so when I want to compare two chapters, it would first check how similar the books the chapters belong to are to each other and use that as a baseline. I suppose this part I will have to implement manually, but I'm wondering how to do the hierarchical linking effectively.
Is there a way to tell lucene that the documents in an index follow a hierarchical structure where books are composed of chapter and chapters are composed of subchapters (which are the actual documents to store)? If so, books and chapters could be constructed at runtime by combining the documents they are composed of. Does lucene have a way to do this?
One simple alternate approach would be to create separate indices for each level of resolution, i.e. one for books, one for chapters and one for subchapters. But this seems inelegant and I'm not sure if this would work well, considering that I would get different inverse-document-frequency values in the different indices. This leads to a secondary question: is there a way to make lucene only consider certain documents as a reference class for its tf-idf calculations?

If number of levels is predetermined then you may use grouping functionality http://lucene.apache.org/core/4_9_0/grouping/org/apache/lucene/search/grouping/package-summary.html
I do not know if it works with multiple-value fields, if yes then it could address also multiple levels of grouping (hierarchy similarity). You may of course use different fields for different levels.
There is also another approach which I experienced - grouping document ID's in fast NoSQL database and when lucene returns document you may search ID's of similar documents in NoSQL and then back in lucene. However, such approach may lead to other issues (i.e. number of total hits, because lucene returns total hits even without returning the results, but grouping shall most probably change that number.

Text classification with weka

I'm building a text classifier in java with Weka library.
First i remove stopwords, then I'm using a stemmer (e.g convert cars to car).
Right now i have 6 predefined categories. I train the classifier on
5 documents for every category. The length of the documents are similar.
The results are ok when the text to be classified is short. But when the text is longer
than 100 words the results getting stranger and stranger.
I return the probabilities for each category as following:
Probability:
[0.0015560238056109177, 0.1808919321002592, 0.6657404531908249, 0.004793498469427115, 0.13253647895234325, 0.014481613481534815]
which is a pretty reliable classification.
But when I use texts longer than around 100 word I get results like:
Probability: [1.2863123678314889E-5, 4.3728547754744305E-5, 0.9964710903856974, 5.539960514402068E-5, 0.002993481218084141, 4.234371196414616E-4]
Which is to good.
Right now Im using Naive Bayes Multinomial for classifying the documents. I have read
about it and found out that i could act strange on longer text. Might be my problem right now?
Anyone has any good idea why this is happening?

There can be multiple factors for this behavior. If your training and test texts are not of the same domain, this can happen. Also, I believe adding more documents for every category should do some good. 5 documents in every category is seeming very less. If you do not have more training documents or it is difficult to have more training documents, then you can synthetically add positive and negative instances in your training set (see SMOTE algorithm in detail). Keep us posted the update.

Can Lucene return several search results from a single indexed file?

I am using Lucene to index and search a small number of large documents. Using the demo from the Lucene site I have indexed the documents and am able to search them. However, the search result is not particularly useful as it points to the file of the document. With very large documents this isn't particularly useful.
I am wondering if Lucene can index these very large documents and create an abstraction over them which provides much more fine-grained results.
An example might better explain what I mean. Consider a very large book, such as the Bible. One file contains the entire text of the Bible, so with the demo, the result of searching for say, 'Damascus' would point to the file. What I would like to do is to retain the large document, but searches would return results pointing to a Book, Chapter or even as precise as a Verse. So a search for 'Damascus' could return (among others) Book 23, Chapter 7, Verse 8.
Is this possible (and best-practice in Lucene usage), or should I instead attempt to section the large document into many small files to index?
If it makes any difference, I am using Java Lucene 2.9.0 and am indexing HTML files approximately 1MB - 4MB in size. Which in terms of file size is not large, but it is large, relative to a person reading it.
I don't think I've explained this as well as I could. Here goes for another example.
Say I take my large HTML file, and (for arguments sake) the search term 'Damascus' appears 3 times. Once on line 100 within a <div> tag, on line 2000 within a <p> tag, and on line 5000 within a <h1> tag. Is it possible to index with Lucene, such that there will be 3 results, and they can point to the specific element the term was within?
I don't think I want to provide a different document result for the term. So if the term 'Damascus' appeared twice within a specific <div>, there would only be one match.
It appears from a comment from Kragen that what I would want to do is parse the HTML when Lucene is going through the indexing phase. Then I can decide the chunk I want to consider as one document from what is read in by the parser. So if I see a div with a certain class I can begin a new Lucene document and it will be returned as a separate hit when a word within the div content is searched on.
Does this sound like what I want to do, and is it possible?

Yes - Lucene records the offset of matching terms in a file, so that can be used to figure out where in the indexed content you need to look for matches.
There is a Lucene.Highlight add-on that does this exact task for you - try this article, there are also a couple of questions on StackOverflow concerning hit highlighting (many of these are tailored to use with web apps and so also do things like surrounding matching words with <b> tags)
UPDATE: Depending on how you search your index you might also find that its a good idea to split your large documents into smaller sections (for example chapters) as well - however this is more a question on how you want to organise, prioritise and present your results to the end user.
For example, supposing a user does a search for "foo" and there are 2 books containing that term. The first book (book A) might contain 2 chapters each of which have many references to "foo", however the term is barely mentioned in the rest of the book, however the second book (book B) contains many references to "foo", however they are scattered around the whole book. If you index by book, then you will probably find that book B is the first hit, however indexing by chapter you are likely to find that the 2 chapters from book A are the first 2 hits, followed by the chapters from book B.
Finally, obviously the user will be presented with 1 hit per matching document you have in your index - if you want to present your users with a list of matching books then obviously index by book, however you might find it more appropriate to present the user with a list of matching chapters in which case obviously index by chapter.

One way to do this is to create several documents out of a single book. The documents could represent books, chapters or verses. As the text need not be unique, this is what I would do.
This way, the first verse in the first chapter in the book of Genesis will be indexed four times: in the whole bible, in the book of Genesis, in the first chapter and as the verse.
A subtlety here is the exact goal of retrieval:
Do you want just to display the search keywords in context to a user? In this case consider using a Lucene highlighter. If you need the retrieval to be further used (i.e. take the retrieved pointer to a chapter or verse and do some processing on this place in the text) I would go with the finer-grained documents as I described before.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.