Best way to find document similarity

Best way to find document similarity - java

I'm new to NLP, i want to find the similarity between the two documents
I googled and found that there are some ways to do it e.g.
Shingling, and find text resemblance
Cosine similarity or lucene
tf-idf
What is the best way to do this(I'm open for other methods too), in which we get high precision , If there is some API in java to do this please also let me know

The answer to your question is twofold: (a) syntactic and (b) semantic similarity.
Syntactic similarity
You have already discovered Shingling, so I will focus on other aspects. Recent approaches use latent variable models to describe syntactic patterns. The basic idea is to use conditional probability: P (f| f_c ), where f is some feature, and f_c is its context. The simplest example of such models is a Markov model with words as features, and the previous words as context. These models answer the question: *what is the probability of a word w_ n, given that words w1, ... w_ n-1 occur before it in a document? This avenue will lead you to building language models, thereby measuring document similarity based on perplexity. For purely syntactic similarity measures, one may look at parse tree features instead of words.
Semantic similarity
This is a much harder problem, of course. State-of-the-art in this direction involves understanding distributional semantics. Distributional semantics essentially says, "terms which occur in similar contexts over large amounts of data are bound to have similar meanings". This approach is data-intensive. The basic idea is to build vectors of "contexts", and then measure the similarity of these vectors.
Measuring document similarity based on natural language is not easy, and an answer here will not do justice, so I point you to this ACL paper, which, in my opinion, provides a pretty good picture.

Related

Pre trained vectors, nlp, word2vec, word embedding for particular topic?

is there any pretrained vector for particular topic only? for example "java", so i want vectors related java in file. mean if i give input inheritance then cosine similarity show me polymorphism and other related stuff only!
i am using corpus as GoogleNews-vectors-negative300.bin and Glove vectors. still not getting related words.

Not sure if I understand your question/problem statement, but if you want to work with a corpus of java source code you can use code2vec which provides pre-trained word-embeddings models. Check it out: https://code2vec.org/

Yes, you can occasionally find other groups' pre-trained vectors for download, which may have better coverage of whatever problem domains they've been trained on: both more specialized words, and word-vectors matching the word sense in that domain.
For example, the GoogleNews word-vectors were trained on news articles circa 2012, so its vector for 'Java' may be dominated by stories of the Java island of Indosnesia as much as the programming language. And many other vector-sets are trained on Wikipedia text, which will be dominated by usages in that particular reference-style of writing. But there could be other sets that better emphasize the word-senses you need.
However, the best approach is often to train your own word-vectors, from a training corpus that closely matches the topics/documents you are concerned about. Then, the word-vectors are well-tuned to your domain-of-concern. As long as you have "enough" varied examples of a word used in context, the resulting vector will likely be better than generic vectors from someone else's corpus. ("Enough" has no firm definition, but is usually at least 5, and ideally dozens to hundreds, of representative, diverse uses.)
Let's consider your example goal – showing some similarity between the ideas of 'polymorphism' and 'input inheritance'. For that, you'd need a training corpus that discusses those concepts, ideally many times, from many authors, in many problem-contexts. (Textbooks, online articles, and Stack Overflow pages might be possible sources.)
You'd further need a tokenization strategy that manages to create a single word-token for the two-word concept 'input_inheritance' - which is a separate challenge, and might be tackled via (1) a hand-crafted glossary of multi-word-phrases that should be combined; (2) statistical analysis of word-pairs that seem to occur so often together, they should be combined; (3) more sophisticated grammar-aware phrase- and entity-detection preprocessing.
(The multiword phrases in the GoogleNews set were created via a statistical algorithm which is also available in the gensim Python library as the Phrases class. But, the exact parameters Google used have not, as far as I know, been revealed.And, good results from this algorithm can require a lot of data and tuning, and still result in some combinations that a person would consider nonsense, and missing others that a person would consider natural.)

Java: How to use TF-IDF to compute similarity of two documents?

My goals is to find a similarity value between two documents (collections of words). I have already found several answers like this SO post or this SO post which provide Python libraries that achieve this, but I have trouble understanding the approach and making it work for my use case.
If I understand correctly, TF-IDF of a document is computed with respect to a given term, right? That's how I interpret it from the Wikipedia article on this: "tf-idf...is a numerical statistic that is intended to reflect how important a word is to a document".
In my case, I don't have a specific search term which I want to compare to the document, but I have two different documents. I assume I need to first compute vectors for the documents, and then take the cosine between these vectors. But all the answers I found with respect to constructing these vectors always assume a search term, which I don't have in my case.
Can't wrap my head around this, any conceptual help or links to Java libraries that achieve this would be highly appreciated.

I suggest running terminology extraction first, together with their frequencies. Note that stemming can also be applied to the extracted terms to avoid noise in during the subsequent cosine similarity calculation. See Java library for keywords extraction from input text SO thread for more help and ideas on that.
Then, as you yourself mention, for each of those terms, you will have to compute the TF-IDF values, get the vectors and compute the cosine similarity.
When calculating TF-IDF, mind that 1 + log(N/n) (N standing for the total number of corpora and n standing for the number of corpora that include the term) formula is better since it avoids the issue when TF is not 0 and IDF turns out equal to 0.

Comparing documents - document similarity

I am currently conducting a java project in NLP/IR, and are fairly new to this.
The project consists of a collection with around 1000 documents, where each document has about 100 words, structured as bag of words with term-frequency. I want to find similar documents based on a document(from the collection).
Using TF-IDF, calculating tf-idf for the query(a given document) and every other document in the collection, then comparing these values as a vector with cosine similarity. Could this give some insight in their similarity? Or would it not be reasonable, because of the big query(document)?
Is there any other similarity measures that could work better?
Thanks for the help

TF-IDF-based similarity, typically using a cosine to compare a vector representing the query terms to a set of vector representing the TF-IDF values of the documents, is a common approach to calculate "similarity".
Mind that there "similarity" is a very generic term. In the IR domain, you typically speak rather of "relevance". Texts can be similar on many levels: in the same language, using the same characters, using the same words, talking about the same people, using a similarly complex grammatical structure and much more - consequently, there are many many measures. Search the web for text similarity to find many publications but also open-source frameworks and libraries that implement different measures.
Today, "semantic similarity" is attracting more interest than the traditional keyword-based IR models. If this is your area of interest, you might look into the results of the SemEval shared tasks years 2012-2015.

If all you want is to compare two documents using TF-IDF, you can do that. Since you mention that each doc contains 100 words, in the worst case there might be 1000*100 unique words. So, im assuming your vectors are built on all unique words (since all documents should be represented in same dimension). If the no. of unique words are too high, you could try using some dimensionality reduction techniques to reduce the dimensions (like PCA). But what you are trying to do is right, you can always compare documents like this for finding similarity between documents.
If you want similarity more in the sense of semantics you should look at using LDA (topic modelling) type techniques.

most effective distance function for collaborative filtering in weka Java API

so I'm building this collaborative filtering system using Weka's machine learning library JAVA API...
I basically use the StringToWordVector filter to convert string objects into their word occurence decomposition....
so now I'm using kNN algorithm to find the nearest neighbors to a target object....
my question is, what distance function should I use to to compute distance between two objects that has been filtered by the StringToWordVector filter...which one woud be most effective for this scenario?
the available options in Weka are:
AbstractStringDistanceFunction, ChebyshevDistance, EditDistance, EuclideanDistance, ManhattanDistance, NormalizableDistance

Yes similarity metrics are good times. Short answer is that you should try them all and optimize with respect to RMSE, MAE, breadth of return set, etc.
There seems to be a distinction between Edit distance and the rest of these metrics as I would expect an EditDistance algorithm to work on strings themselves.
How does your StringToWordVector work? First answer this question, and then use that answer to fuel thoughts like: what do I want a similarity between two words to mean in my application (does semantic meaning outweigh word-length for instance).
And as long as you're using a StringVectorizer, it would seem you're free to consider more mainstream similarity metrics like LogLikelihood, Pearson, and Cosine (respectively). I think this is worth doing as none of the similarity metrics you've listed are widely used or studied seriously in the literature to my knowledge.
May the similarity be with you!

Need help in latent semantic indexing

I am sorry, if my question sounds stupid :)
Can you please recommend me any pseudo code or good algo for LSI implementation in java?
I am not math expert. I tried to read some articles on wikipedia and other websites about
LSI ( latent semantic indexing ) they were full of math.
I know LSI is full of math. But if i see some source code or algo. I understand things more
easily. That's why i asked here, because so many GURU are here !
Thanks in advance

An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".
Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.
Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.
Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.
But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.
So, algorithm for LSA will look something like this:
Collect all documents and all unique words from them.
Extract frequency information and build co-occurrence matrix.
Reduce matrix with SVD.
If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.
Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.

This maybe a bit late but I always liked Sujit Pal's blog http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html and I have written a bit on my site if you are interested.
The process is way less complicated than it is often written up as. And really all you need is a library that can do single value decomposition of a matrix.
If you are interested I can explain in a couple of the short take away bits:
1) you create a matrix/dataset/etc with word counts of various documents - the different documents will be your columns and the rows the distinct words.
2) Once you've created the matrix you use a library like Jama (for Java) or SmartMathLibrary (for C#) and run the single value decomposition. All this does is take your original matrix and break it up in to three different parts/matrix that essentially represent your documents, your words, and kind of a multiplier (sigma) these are called the vectors.
3) Once you have you word, document, sigma vectors you shrink them equally (k) by just copying smaller parts of the vector/matrix and then multiply them back together. By shrinking them it kind of normalizes your data and this is LSI.
here are some fairly clear resources:
http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf
Hope this help you out a bit.
Eric

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.