I am looking for a simple java class that can compute tf-idf calculation. I want to do similarity test on 2 documents. I found so many BIG API who used tf-idf class. I do not want to use a big jar file, just to do my simple test. Please help !
Or atlest if some one can tell me how to find TF? and IDF? I will calculate the results :)
OR
If you can tell me some good java tutorial for this.
Please do not tell me for looking google, I already did for 3 days and couldn't find any thing :(
Please also do not refer me to Lucene :(
Term Frequency is the square root of the number of times a term occurs in a particular document.
Inverse Document Frequency is (the log of (the total number of documents divided by the number of documents containing the term)) plus one in case the term occurs zero times -- if it does, obviously don't try to divide by zero.
If it isn't clear from that answer, there is a TF per term per document, and an IDF per term.
And then TF-IDF(term, document) = TF(term, document) * IDF(term)
Finally, you use the vector space model to compare documents, where each term is a new dimension and the "length" of the part of the vector pointing in that dimension is the TF-IDF calculation. Each document is a vector, so compute the two vectors and then compute the distance between them.
So to do this in Java, read the file in one line at a time with a FileReader or something, and split on spaces or whatever other delimiters you want to use - each word is a term. Count the number of times each term appears in each file, and the number of files each term appears in. Then you have everything you need to do the above calculations.
And since I have nothing else to do, I looked up the vector distance formula. Here you go:
D=sqrt((x2-x1)^2+(y2-y1)^2+...+(n2-n1)^2)
For this purpose, x1 is the TF-IDF for term x in document 1.
Edit: in response to your question about how to count the words in a document:
Read the file in line by line with a reader, like new BufferedReader(new FileReader(filename)) - you can call BufferedReader.readLine() in a while loop, checking for null each time.
For each line, call line.split("\\s") - that will split your line on whitespace and give you an array of all of the words.
For each word, add 1 to the word's count for the current document. This could be done using a HashMap.
Now, after computing D for each document, you will have X values where X is the number of documents. To compare all documents against each other is to do only X^2 comparisons - this shouldn't take particularly long for 10,000. Remember that two documents are MORE similar if the absolute value of the difference between their D values is lower. So then you could compute the difference between the Ds of every pair of documents and store that in a priority queue or some other sorted structure such that the most similar documents bubble up to the top. Make sense?
agazerboy, Sujit Pal's blog post gives a thorough description of calculating TF and IDF.
WRT verifying results, I suggest you start with a small corpus (say 100 documents) so that you can see easily whether you are correct. For 10000 documents, using Lucene begins to look like a really rational choice.
While you specifically asked not to refer Lucene, please allow me to point to you the exact class. The class you are looking for is DefaultSimilarity. It has an extremely simple API to calculate TF and IDF. See the java code here. Or you could just implement yourself as specified in the DefaultSimilarity documentation.
TF = sqrt(freq)
and
IDF = log(numDocs/(docFreq+1)) + 1.
The log and sqrt functions are used to damp the actual values. Using the raw values can skew results dramatically.
Related
Every example I've seen for Encog neural nets has involved XOR or something very simple. I have around 10,000 sentences and each word in the sentence has some type of tag. The input layer needs to take 2 inputs, the previous word and the current word. If there is no previous word, then the 1st input is not activated at all. I need to go through each sentence like this. Each word is contingent on the previous word, so I can't just have an array that looks similar to the XOR example. Furthermore, I don't really want to load all the words from 10,000+ sentences into an array, I'd rather scan one sentence at a time and once I reach EOF, start back at the beginning.
How should I go about doing this? I'm not super comfortable with Encog because all the examples I've seen have either been XOR or extremely complicated.
There are 2 inputs... Each input consists of 30 neurons. The chance of the word being a certain tag is used as inputs. So, most of the neurons get 0, the others get probability inputs like .5, .3, and .2. When I say 'aren't activated' I just mean that all the neurons are set to 0. The output layer represents all the possible tags, so, its 30. Whatever one of the output neurons has the highest number is the tag that is chosen.
I'm not sure how to go through all 10,000 sentences and look-up each word in each sentence (for the inputs and activate that input) in the 'demos' of Encog that I've seen.)
It seems that the networks are trained with a single array holding all training data, and that is looped through until the network is trained. I would like to train the network with many different arrays (an array per sentence) and then look through them all again.
This format is clearly not going to work for what I'm doing:
do {
train.iteration();
System.out.println(
"Epoch #" + epoch + " Error:" + train.getError());
epoch++;
} while(train.getError() > 0.01);
So, I'm not sure how to tell you this, but that's not how a neural net works. You can't just use a word as an input, and you can't just "not activate" an input either. At a very basic level, this is what you need to run a neural network on a problem:
A fixed-length input vector (whatever you are feeding in, it must be represented numerically with a fixed length. Each entry in the vector is a single number)
A set of labels (each input vector must correspond to a single, fixed-length output vector)
Once you have those two, the neural net classifies an example, then edits itself to get as close as possible to the labels.
If you're looking to work with words and a deep learning framework, you should map your words to an existing vector representation (I would highly recommend glove, but word2vec is decent as well) and then learn on top of that representation.
After having a deeper understanding of what you're attempting here I think the issue is that you're dealing with 60 inputs, not one. These inputs are the concatenation of the existing predictions for both words (in the case with no first word the first 30 entries are 0). You should take care of the mapping yourself (should be very straightforward), and then just treat it as trying to predict 30 numbers with 60 numbers.
I feel obliged to tell you that the way you've framed the problem you will see awful performance. When dealing with a sparse (mostly zeros) vector and such a small dataset deep learning techniques will show VERY poor performance compared to other methods. You are better off using glove + svm or a random forest model on your existing data.
You can use other implementations of MLDataSet besides BasicMLDataSet.
I ran into a similar problem with windows of DNA sequences. Building an array of all the windows would not have been scalable.
Instead, I implemented my own VersatileDataSource, and wrapped it in a VersatileMLDataSet.
VersatileDataSource has just a few methods to implement:
public interface VersatileDataSource {
String[] readLine();
void rewind();
int columnIndex(String name);
}
For each readLine(), you could return the inputs for the previous/current word, and advance the position to the next word.
I want Lucene Scoring function to have no bias based on the length of the document. This is really a follow up question to Calculate the score only based on the documents have more occurance of term in lucene
I was wondering how Field.setOmitNorms(true) works? I see that there are two factors that make short documents get a high score:
"boost" that shorter length posts - using doc.getBoost()
"lengthNorm" in the definition of norm(t,d)
Here is the documentation
I was wondering - if I wanted no bias towards shorter documents, is Field.setOmitNorms(true) enough?
Using BM25Similarity you could reduce to 0f:
#param b Controls to what degree document length normalizes tf values
or
#param k1 Controls non-linear term frequency normalization (saturation).
Both params will affect SimWeight
indexSearcher.setSimilarity(new BM25Similarity(1.2f,0f));
More explanation can be found here : http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
Shorter docs are meant to be more relevant when you use TF-IDF scoring.
You can use your custom scoring functions in Lucene. Its easy to customize the scoring algorithm. Subclass DefaultSimilarity and override the method you want to customize.
There's a code sample here that will help you implement it
I have indexed a set of text files by lucene. Also, I have stored TermVectors. But I want to know the frequency of some terms in some documents in O(1). Is it possible?
I mean, is there a function(Term term, Integer docNum) that returns the frequency of term in document docNum ?
There is no ready-made function, you'll have to write some code. First use IndexReader.termDocs(Term). That will give you a TermDocs instance which is, typically of Lucene, a Cursor-like object. Now call TermDocs.skipTo(int), then TermDocs.next(), then TermDocs.freq(). If you are sure at the outset that your document contains your term, this is it; otherwise check after each step whether you can proceed. The Javadocs are well-written for each step involved.
I'm doing a Java project where I have to make a text similarity program. I want it to take 2 text documents, then compare them with each other and get the similarity of it. How similar they are to each other.
I'll later put an already database which can find the synonyms for the words and go through the text to see if one of the text document writers just changed the words to other synonyms while the text is exactly the same. Same thing with moving paragrafs up or down.
Yes, as was it a plagarism program...
I want to hear from you people what kind of algoritms you would recommend.
I've found Levenstein and Cosine similarity by looking here and other places. Both of them seem to be mentioned a lot. Hamming distance is another my teacher told me about.
I got some questions related to those since I'm not really getting Wikipedia. Could someone explain those things to me?
Levenstein: This algorithm changed by sub, add and elimination the word and see how close it is to the other word in the text document. But how can that be used on a whole text file? I can see how it can be used on a word, but not on a sentence or a whole text document from one to another.
Cosine: It's measure of similarity between two vectors by measuring the cosine of the angle between them. What I don't understand here how two text can become 2 vectors and what about the words/sentence in those?
Hamming: This distance seems to work better than Levenstein but it's only on equal strings. How come it's important when 2 documents and even the sentences in those aren't two strings of equal length?
Wikipedia should make sense but it's not. I'm sorry if the questions sound too stupid but it's hanging me down and I think there's people in here who's quite capeable to explain it so even newbeginners into this field can get it.
Thanks for your time.
Levenstein: in theory you could use it for a whole text file, but it's really not very suitable for the task. It's really intended for single words or (at most) a short phrase.
Cosine: You start by simply counting the unique words in each document. The answers to a previous question cover the computation once you've done that.
I've never used Hamming distance for this purpose, so I can't say much about it.
I would add TFIDF (Term Frequency * Inverted Document Frequency) to the list. It's fairly similar to Cosine distance, but 1) tends to do a better job on shorter documents, and 2) does a better job of taking into account what words are extremely common in an entire corpus rather than just the ones that happen to be common to two particular documents.
One final note: for any of these to produce useful results, you nearly need to screen out stop words before you try to compute the degree of similarity (though TFIDF seems to do better than the others if yo skip this). At least in my experience, it's extremely helpful to stem the words (remove suffixes) as well. When I've done it, I used Porter's stemmer algorithm.
For your purposes, you probably want to use what I've dubbed an inverted thesaurus, which lets you look up a word, and for each word substitute a single canonical word for that meaning. I tried this on one project, and didn't find it as useful as expected, but it sounds like for your project it would probably be considerably more useful.
Basic idea of comparing similarity between two documents, which is a topic in information retrieval, is extracting some fingerprint and judge whether they share some information based on the fingerprint.
Just some hints, the Winnowing: Local Algorithms for Document Fingerprinting maybe a choice and a good start point to your problem.
Consider the example on wikipedia for Levenshtein distance:
For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:
1. kitten → sitten (substitution of 's' for 'k')
2. sitten → sittin (substitution of 'i' for 'e')
3. sittin → sitting (insertion of 'g' at the end).
Now, replace "kitten" with "text from first paper", and "sitting" with "text from second paper".
Paper[] papers = getPapers();
for(int i = 0; i < papers.length - 1; i++) {
for(int j = i + 1; j < papers.length; j++) {
Paper first = papers[i];
Paper second = papers[j];
int dist = compareSimilarities(first.text,second.text);
System.out.println(first.name + "'s paper compares to " + second.name + "'s paper with a similarity score of " + dist);
}
}
Compare those results and peg the kids with the lowest distance scores.
In your compareSimilarities method, you could use any or all of the comparison algorithms. Another one you could incorporate in to the formula is "longest common substring" (which is a good method of finding plagerism.)
First of all,thanks for reading my question.
I used TF/IDF then on those values, I calculated cosine similarity to see how many documents are more similar. You can see the following matrix. Column names are like doc1, doc2, doc3 and rows names are same like doc1, doc2, doc3 etc. With the help of following matrix, I can see that doc1 and doc4 has 72% similarity (0.722711142). It is correct even if I see both documents they are similar. I have 1000 documents and I can see each document freq. in matrix to see how many of them are similar.
I used different clustering like k-means and agnes ( hierarchy) to combine them. It made clusters. For example Cluster1 has (doc4, doc5, doc3) becoz they have values (0.722711142, 0.602301766, 0.69912109) more close respectively. But when I see manually if these 3 documents are realy same so they are NOT. :( What am I doing or should I use something else other than clustering??????
1 0.067305859 -0.027552299 0.602301766 0.722711142
0.067305859 1 0.048492904 0.029151952 -0.034714695
-0.027552299 0.748492904 1 0.610617214 0.010912109
0.602301766 0.029151952 -0.061617214 1 0.034410392
0.722711142 -0.034714695 0.69912109 0.034410392 1
P.S: The values can be wrong, it is just to give you an idea.
If you have any question please do ask.
Thanks
I'm not familiar with TF/IDF, but the process can go wrong in many stages generally:
1, Did you remove stopwords?
2, Did you apply stemming? Porter stemmer for example.
3, Did you normalize frequencies for document length? (Maybe the TFIDF thing has a solution for that, I don't know)
4, Clustering is a discovery method but not a holy grail. The documents it retrieves as a group may be related more or less, but that depends on the data, tuning, clustering algorithm, etc.
What do you want to achieve? What is your setup?
Good luck!
My approach would be not to use pre-calculated similarity values at all, because the similarity between docs should be found by the clustering algorithm itself. I would simply set up a feature space with one column per term in the corpus, so that the number of columns equals the size of the vocabulary (minus stop word, if you want). Each feature value contains the relative frequency of the respective term in that document. I guess you could use tf*idf values as well, although I wouldn't expect that to help too much. Depending on the clustering algorithm you use, the discriminating power of a particular term should be found automatically, i.e. if a term appears in all documents with a similar relative frequency, then that term does not discriminate well between the classes and the algorithm should detect that.