Comparing documents - document similarity

Comparing documents - document similarity - java

I am currently conducting a java project in NLP/IR, and are fairly new to this.
The project consists of a collection with around 1000 documents, where each document has about 100 words, structured as bag of words with term-frequency. I want to find similar documents based on a document(from the collection).
Using TF-IDF, calculating tf-idf for the query(a given document) and every other document in the collection, then comparing these values as a vector with cosine similarity. Could this give some insight in their similarity? Or would it not be reasonable, because of the big query(document)?
Is there any other similarity measures that could work better?
Thanks for the help

TF-IDF-based similarity, typically using a cosine to compare a vector representing the query terms to a set of vector representing the TF-IDF values of the documents, is a common approach to calculate "similarity".
Mind that there "similarity" is a very generic term. In the IR domain, you typically speak rather of "relevance". Texts can be similar on many levels: in the same language, using the same characters, using the same words, talking about the same people, using a similarly complex grammatical structure and much more - consequently, there are many many measures. Search the web for text similarity to find many publications but also open-source frameworks and libraries that implement different measures.
Today, "semantic similarity" is attracting more interest than the traditional keyword-based IR models. If this is your area of interest, you might look into the results of the SemEval shared tasks years 2012-2015.

If all you want is to compare two documents using TF-IDF, you can do that. Since you mention that each doc contains 100 words, in the worst case there might be 1000*100 unique words. So, im assuming your vectors are built on all unique words (since all documents should be represented in same dimension). If the no. of unique words are too high, you could try using some dimensionality reduction techniques to reduce the dimensions (like PCA). But what you are trying to do is right, you can always compare documents like this for finding similarity between documents.
If you want similarity more in the sense of semantics you should look at using LDA (topic modelling) type techniques.

Related

Java: How to use TF-IDF to compute similarity of two documents?

My goals is to find a similarity value between two documents (collections of words). I have already found several answers like this SO post or this SO post which provide Python libraries that achieve this, but I have trouble understanding the approach and making it work for my use case.
If I understand correctly, TF-IDF of a document is computed with respect to a given term, right? That's how I interpret it from the Wikipedia article on this: "tf-idf...is a numerical statistic that is intended to reflect how important a word is to a document".
In my case, I don't have a specific search term which I want to compare to the document, but I have two different documents. I assume I need to first compute vectors for the documents, and then take the cosine between these vectors. But all the answers I found with respect to constructing these vectors always assume a search term, which I don't have in my case.
Can't wrap my head around this, any conceptual help or links to Java libraries that achieve this would be highly appreciated.

I suggest running terminology extraction first, together with their frequencies. Note that stemming can also be applied to the extracted terms to avoid noise in during the subsequent cosine similarity calculation. See Java library for keywords extraction from input text SO thread for more help and ideas on that.
Then, as you yourself mention, for each of those terms, you will have to compute the TF-IDF values, get the vectors and compute the cosine similarity.
When calculating TF-IDF, mind that 1 + log(N/n) (N standing for the total number of corpora and n standing for the number of corpora that include the term) formula is better since it avoids the issue when TF is not 0 and IDF turns out equal to 0.

recursively determine similarity in lucene

I have a collection of books in multiple languages. I need to link parts of each book to each other based on their similarity. I need to link books to similar books, chapters to similar chapters and subchapters to similar subchapters.
Preferably, the similarity measure would also take into account how similar the next highest level are, so when I want to compare two chapters, it would first check how similar the books the chapters belong to are to each other and use that as a baseline. I suppose this part I will have to implement manually, but I'm wondering how to do the hierarchical linking effectively.
Is there a way to tell lucene that the documents in an index follow a hierarchical structure where books are composed of chapter and chapters are composed of subchapters (which are the actual documents to store)? If so, books and chapters could be constructed at runtime by combining the documents they are composed of. Does lucene have a way to do this?
One simple alternate approach would be to create separate indices for each level of resolution, i.e. one for books, one for chapters and one for subchapters. But this seems inelegant and I'm not sure if this would work well, considering that I would get different inverse-document-frequency values in the different indices. This leads to a secondary question: is there a way to make lucene only consider certain documents as a reference class for its tf-idf calculations?

If number of levels is predetermined then you may use grouping functionality http://lucene.apache.org/core/4_9_0/grouping/org/apache/lucene/search/grouping/package-summary.html
I do not know if it works with multiple-value fields, if yes then it could address also multiple levels of grouping (hierarchy similarity). You may of course use different fields for different levels.
There is also another approach which I experienced - grouping document ID's in fast NoSQL database and when lucene returns document you may search ID's of similar documents in NoSQL and then back in lucene. However, such approach may lead to other issues (i.e. number of total hits, because lucene returns total hits even without returning the results, but grouping shall most probably change that number.

Best way to find document similarity

I'm new to NLP, i want to find the similarity between the two documents
I googled and found that there are some ways to do it e.g.
Shingling, and find text resemblance
Cosine similarity or lucene
tf-idf
What is the best way to do this(I'm open for other methods too), in which we get high precision , If there is some API in java to do this please also let me know

The answer to your question is twofold: (a) syntactic and (b) semantic similarity.
Syntactic similarity
You have already discovered Shingling, so I will focus on other aspects. Recent approaches use latent variable models to describe syntactic patterns. The basic idea is to use conditional probability: P (f| f_c ), where f is some feature, and f_c is its context. The simplest example of such models is a Markov model with words as features, and the previous words as context. These models answer the question: *what is the probability of a word w_ n, given that words w1, ... w_ n-1 occur before it in a document? This avenue will lead you to building language models, thereby measuring document similarity based on perplexity. For purely syntactic similarity measures, one may look at parse tree features instead of words.
Semantic similarity
This is a much harder problem, of course. State-of-the-art in this direction involves understanding distributional semantics. Distributional semantics essentially says, "terms which occur in similar contexts over large amounts of data are bound to have similar meanings". This approach is data-intensive. The basic idea is to build vectors of "contexts", and then measure the similarity of these vectors.
Measuring document similarity based on natural language is not easy, and an answer here will not do justice, so I point you to this ACL paper, which, in my opinion, provides a pretty good picture.

most effective distance function for collaborative filtering in weka Java API

so I'm building this collaborative filtering system using Weka's machine learning library JAVA API...
I basically use the StringToWordVector filter to convert string objects into their word occurence decomposition....
so now I'm using kNN algorithm to find the nearest neighbors to a target object....
my question is, what distance function should I use to to compute distance between two objects that has been filtered by the StringToWordVector filter...which one woud be most effective for this scenario?
the available options in Weka are:
AbstractStringDistanceFunction, ChebyshevDistance, EditDistance, EuclideanDistance, ManhattanDistance, NormalizableDistance

Yes similarity metrics are good times. Short answer is that you should try them all and optimize with respect to RMSE, MAE, breadth of return set, etc.
There seems to be a distinction between Edit distance and the rest of these metrics as I would expect an EditDistance algorithm to work on strings themselves.
How does your StringToWordVector work? First answer this question, and then use that answer to fuel thoughts like: what do I want a similarity between two words to mean in my application (does semantic meaning outweigh word-length for instance).
And as long as you're using a StringVectorizer, it would seem you're free to consider more mainstream similarity metrics like LogLikelihood, Pearson, and Cosine (respectively). I think this is worth doing as none of the similarity metrics you've listed are widely used or studied seriously in the literature to my knowledge.
May the similarity be with you!

Need help in latent semantic indexing

I am sorry, if my question sounds stupid :)
Can you please recommend me any pseudo code or good algo for LSI implementation in java?
I am not math expert. I tried to read some articles on wikipedia and other websites about
LSI ( latent semantic indexing ) they were full of math.
I know LSI is full of math. But if i see some source code or algo. I understand things more
easily. That's why i asked here, because so many GURU are here !
Thanks in advance

An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".
Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.
Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.
Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.
But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.
So, algorithm for LSA will look something like this:
Collect all documents and all unique words from them.
Extract frequency information and build co-occurrence matrix.
Reduce matrix with SVD.
If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.
Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.

This maybe a bit late but I always liked Sujit Pal's blog http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html and I have written a bit on my site if you are interested.
The process is way less complicated than it is often written up as. And really all you need is a library that can do single value decomposition of a matrix.
If you are interested I can explain in a couple of the short take away bits:
1) you create a matrix/dataset/etc with word counts of various documents - the different documents will be your columns and the rows the distinct words.
2) Once you've created the matrix you use a library like Jama (for Java) or SmartMathLibrary (for C#) and run the single value decomposition. All this does is take your original matrix and break it up in to three different parts/matrix that essentially represent your documents, your words, and kind of a multiplier (sigma) these are called the vectors.
3) Once you have you word, document, sigma vectors you shrink them equally (k) by just copying smaller parts of the vector/matrix and then multiply them back together. By shrinking them it kind of normalizes your data and this is LSI.
here are some fairly clear resources:
http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf
Hope this help you out a bit.
Eric

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.