Text similarity Algorithms

Text similarity Algorithms - java

I'm doing a Java project where I have to make a text similarity program. I want it to take 2 text documents, then compare them with each other and get the similarity of it. How similar they are to each other.
I'll later put an already database which can find the synonyms for the words and go through the text to see if one of the text document writers just changed the words to other synonyms while the text is exactly the same. Same thing with moving paragrafs up or down.
Yes, as was it a plagarism program...
I want to hear from you people what kind of algoritms you would recommend.
I've found Levenstein and Cosine similarity by looking here and other places. Both of them seem to be mentioned a lot. Hamming distance is another my teacher told me about.
I got some questions related to those since I'm not really getting Wikipedia. Could someone explain those things to me?
Levenstein: This algorithm changed by sub, add and elimination the word and see how close it is to the other word in the text document. But how can that be used on a whole text file? I can see how it can be used on a word, but not on a sentence or a whole text document from one to another.
Cosine: It's measure of similarity between two vectors by measuring the cosine of the angle between them. What I don't understand here how two text can become 2 vectors and what about the words/sentence in those?
Hamming: This distance seems to work better than Levenstein but it's only on equal strings. How come it's important when 2 documents and even the sentences in those aren't two strings of equal length?
Wikipedia should make sense but it's not. I'm sorry if the questions sound too stupid but it's hanging me down and I think there's people in here who's quite capeable to explain it so even newbeginners into this field can get it.
Thanks for your time.

Levenstein: in theory you could use it for a whole text file, but it's really not very suitable for the task. It's really intended for single words or (at most) a short phrase.
Cosine: You start by simply counting the unique words in each document. The answers to a previous question cover the computation once you've done that.
I've never used Hamming distance for this purpose, so I can't say much about it.
I would add TFIDF (Term Frequency * Inverted Document Frequency) to the list. It's fairly similar to Cosine distance, but 1) tends to do a better job on shorter documents, and 2) does a better job of taking into account what words are extremely common in an entire corpus rather than just the ones that happen to be common to two particular documents.
One final note: for any of these to produce useful results, you nearly need to screen out stop words before you try to compute the degree of similarity (though TFIDF seems to do better than the others if yo skip this). At least in my experience, it's extremely helpful to stem the words (remove suffixes) as well. When I've done it, I used Porter's stemmer algorithm.
For your purposes, you probably want to use what I've dubbed an inverted thesaurus, which lets you look up a word, and for each word substitute a single canonical word for that meaning. I tried this on one project, and didn't find it as useful as expected, but it sounds like for your project it would probably be considerably more useful.

Basic idea of comparing similarity between two documents, which is a topic in information retrieval, is extracting some fingerprint and judge whether they share some information based on the fingerprint.
Just some hints, the Winnowing: Local Algorithms for Document Fingerprinting maybe a choice and a good start point to your problem.

Consider the example on wikipedia for Levenshtein distance:
For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:
1. kitten → sitten (substitution of 's' for 'k')
2. sitten → sittin (substitution of 'i' for 'e')
3. sittin → sitting (insertion of 'g' at the end).
Now, replace "kitten" with "text from first paper", and "sitting" with "text from second paper".
Paper[] papers = getPapers();
for(int i = 0; i < papers.length - 1; i++) {
for(int j = i + 1; j < papers.length; j++) {
Paper first = papers[i];
Paper second = papers[j];
int dist = compareSimilarities(first.text,second.text);
System.out.println(first.name + "'s paper compares to " + second.name + "'s paper with a similarity score of " + dist);
}
}
Compare those results and peg the kids with the lowest distance scores.
In your compareSimilarities method, you could use any or all of the comparison algorithms. Another one you could incorporate in to the formula is "longest common substring" (which is a good method of finding plagerism.)

Related

Encog Neural Net - How to structure training data?

Every example I've seen for Encog neural nets has involved XOR or something very simple. I have around 10,000 sentences and each word in the sentence has some type of tag. The input layer needs to take 2 inputs, the previous word and the current word. If there is no previous word, then the 1st input is not activated at all. I need to go through each sentence like this. Each word is contingent on the previous word, so I can't just have an array that looks similar to the XOR example. Furthermore, I don't really want to load all the words from 10,000+ sentences into an array, I'd rather scan one sentence at a time and once I reach EOF, start back at the beginning.
How should I go about doing this? I'm not super comfortable with Encog because all the examples I've seen have either been XOR or extremely complicated.
There are 2 inputs... Each input consists of 30 neurons. The chance of the word being a certain tag is used as inputs. So, most of the neurons get 0, the others get probability inputs like .5, .3, and .2. When I say 'aren't activated' I just mean that all the neurons are set to 0. The output layer represents all the possible tags, so, its 30. Whatever one of the output neurons has the highest number is the tag that is chosen.
I'm not sure how to go through all 10,000 sentences and look-up each word in each sentence (for the inputs and activate that input) in the 'demos' of Encog that I've seen.)
It seems that the networks are trained with a single array holding all training data, and that is looped through until the network is trained. I would like to train the network with many different arrays (an array per sentence) and then look through them all again.
This format is clearly not going to work for what I'm doing:
do {
train.iteration();
System.out.println(
"Epoch #" + epoch + " Error:" + train.getError());
epoch++;
} while(train.getError() > 0.01);

So, I'm not sure how to tell you this, but that's not how a neural net works. You can't just use a word as an input, and you can't just "not activate" an input either. At a very basic level, this is what you need to run a neural network on a problem:
A fixed-length input vector (whatever you are feeding in, it must be represented numerically with a fixed length. Each entry in the vector is a single number)
A set of labels (each input vector must correspond to a single, fixed-length output vector)
Once you have those two, the neural net classifies an example, then edits itself to get as close as possible to the labels.
If you're looking to work with words and a deep learning framework, you should map your words to an existing vector representation (I would highly recommend glove, but word2vec is decent as well) and then learn on top of that representation.
After having a deeper understanding of what you're attempting here I think the issue is that you're dealing with 60 inputs, not one. These inputs are the concatenation of the existing predictions for both words (in the case with no first word the first 30 entries are 0). You should take care of the mapping yourself (should be very straightforward), and then just treat it as trying to predict 30 numbers with 60 numbers.
I feel obliged to tell you that the way you've framed the problem you will see awful performance. When dealing with a sparse (mostly zeros) vector and such a small dataset deep learning techniques will show VERY poor performance compared to other methods. You are better off using glove + svm or a random forest model on your existing data.

You can use other implementations of MLDataSet besides BasicMLDataSet.
I ran into a similar problem with windows of DNA sequences. Building an array of all the windows would not have been scalable.
Instead, I implemented my own VersatileDataSource, and wrapped it in a VersatileMLDataSet.
VersatileDataSource has just a few methods to implement:
public interface VersatileDataSource {
String[] readLine();
void rewind();
int columnIndex(String name);
}
For each readLine(), you could return the inputs for the previous/current word, and advance the position to the next word.

Fast string retrieval based on metric distance in Java

Given an arbitrary string s, I would like a method to quickly retrieve all strings S ⊆ M from a large set of strings M (where |M| > 1 million), where all strings of S have minimal edit distance < t (some minimum threshold) from s.
At worst, S may be empty if no strings in M match this criteria, and at best, S = {s} (an exact match). For any case in between, I completely expect that S may be quite large.
In general, I expect to have the maximum edit distance threshold fixed (e.g., 2), and need to perform this operation very many times over arbitrary strings s, thus the need for an efficient method, as naively iterating and testing all strings would be too expensive.
While I have used edit distance as an example metric, I would like to use other metrics as well, such as the Jaccard index.
Can anyone make a suggestion about an existing Java implementation which can achieve this, or point me to the right algorithms and data structures for solving this problem?
UPDATE #1
I have since learned that Metric trees are precisely the kind of structure I am after, which exploits the distance metric to organise subsets of strings in M based on their distance from each other with the metric. Both Vantage-Point, BK and other similar metric tree data structures and algorithms seem ideal for this kind of problem. Now, to find easy-to-use implementations in Java...
UPDATE #2
Using a combination of this bk-tree and this Levenshtein distance implementation, I'm successfully able to retrieve subsets against arbitrary strings from a set (M) of one million strings with retrieval times of around 10ms.

BK trees are designed for such a case. It works with metric distance, such as Levenshtein or Jaccard index.

Although I never tried it myself, it might be worth looking at a Levenshtein Automaton. I once bookmarked this article, which looks rather elaborate and provides several code snippets:
Damn Cool Algorithms: Levenshtein Automata
As already mentioned by H W you will not be able to avoid checking each word in your dictionary. However, the automaton will speed up calculating the distance. Combine this with an efficient data structure for your dictionary (e.g. a Trie, as mentioned in the Wikipedia article), and you might be able to accelerate you current approach.

What is the best algorithm for matching two string containing less than 10 words in latin script

I'm comparing song titles, using Latin script (although not always), my aim is an algorithm that gives a high score if the two song titles seem to be the same same title and a very low score if they have nothing in common.
Now I already had to code (Java) to write this using Lucene and a RAMDirectory - however using Lucene simply to compare two strings is too heavyweight and consequently too slow. I've now moved to using https://github.com/nickmancol/simmetrics which has many nice algorithms for comparing two strings:
https://github.com/nickmancol/simmetrics/tree/master/src/main/java/uk/ac/shef/wit/simmetrics/similaritymetrics
BlockDistance
ChapmanLengthDeviation
ChapmanMatchingSoundex
ChapmanMeanLength
ChapmanOrderedNameCompoundSimilarity
CosineSimilarity
DiceSimilarity
EuclideanDistance
InterfaceStringMetric
JaccardSimilarity
Jaro
JaroWinkler
Levenshtein
MatchingCoefficient
MongeElkan
NeedlemanWunch
OverlapCoefficient
QGramsDistance
SmithWaterman
SmithWatermanGotoh
SmithWatermanGotohWindowedAffine
Soundex
but I'm not well versed in these algorithms and what would be a good choice ?
I think Lucene uses CosineSimilarity in some form, so that is my starting point but I think there might be something better.
Specifically, the algorithm should work on short strings and should understand the concept of words, i.e spaces should be treated specially. Good matching of Latin script is most important, but good matching of other scripts such as Korean and Chinese is relevant as well but I expect would need different algorithm because of the way they treat spaces.

They're all good. They work on different properties of strings and have different matching properties. What works best for you depends on what you need.
I'm using the JaccardSimilarity to match names. I chose the JaccardSimilarity because it was reasonably fast and for short strings excelled in matching names with common typo's while quickly degrading the score for anything else. Gives extra weight to spaces. It is also insensitive to word order. I needed this behavior because the impact of a false positive was much much higher then that off a false negative, spaces could be typos but not often and word order was not that important.
Note that this was done in combination with a simplifier that removes non-diacritics and a mapper that maps the remaining characters to the a-z range. This is passed through a normalizes that standardizes all word separator symbols to a single space. Finally the names are parsed to pick out initials, pre- inner- and suffixes. This because names have a structure and format to them that is rather resistant to just string comparison.
To make your choice you need to make a list of what criteria you want and then look for an algorithm that satisfied those criteria. You can also make a reasonably large test set and run all algorithms on that test set too see what the trade offs are with respect to time, number of positives, false positives, false negatives and negatives, the classes of errors your system should handle, ect, ect.
If you are still unsure of your choice, you can also setup your system to switch the exact comparison algorithms at run time. This allows you to do an A-B test and see which algorithm works best in practice.
TLDR; which algorithm you want depends on what you need, if you don't know what you need make sure you can change it later on and run tests on the fly.

You are likely need to solve a string-to-string correction problem. Levenshtein distance algorithm is implemented in many languages. Before running it I'd remove all spaces from string, because they don't contain any sensitive information, but may influence two strings difference. For string search prefix trees are also useful, you can have a look in this direction as well. For example here or here. Was already discussed on SO. If spaces are so much significant in your case, just assign a greater weight to them.

Each algorithm is going to focus on a similar, but slightly different aspect of the two strings. Honestly, it depends entirely on what you are trying to accomplish. You say that the algorithm needs to understand words, but should it also understand interactions between those words? If not, you can just break up each string according to spaces, and compare each word in the first string to each word in the second. If they share a word, the commonality factor of the two strings would need to increase.
In this way, you could create your own algorithm that focused only on what you were concerned with. If you want to test another algorithm that someone else made, you can find examples online and run your data through to see how accurate the estimated commonality is with each.
I think http://jtmt.sourceforge.net/ would be a good place to start.

Interesting. Have you thought about a radix sort?
http://en.wikipedia.org/wiki/Radix_sort
The concept behind the radix sort is that it is a non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by the individual digits. If you convert your string into an array of characters, which will be a number no greater than 3 digits, then your k=3(maximum number of digits) and you n = number of string to compare. This will sort the first digits of all your strings. Then you will have another factor s=the length of the longest string. your worst case scenario for sorting would be 3*n*s and the best case would be (3 + n) * s. Check out some radix sort examples for strings here:
http://algs4.cs.princeton.edu/51radix/LSD.java.html
http://users.cis.fiu.edu/~weiss/dsaajava3/code/RadixSort.java

Did you take a look at the levenshtein distance ?
int org.apache.commons.lang.StringUtils.getLevenshteinDistance(String s, String t)
Find the Levenshtein distance between two Strings.
This is the number of changes needed to change one String into
another, where each change is a single character modification
(deletion, insertion or substitution).
The previous implementation of the Levenshtein distance algorithm was
from http://www.merriampark.com/ld.htm
Chas Emerick has written an implementation in Java, which avoids an
OutOfMemoryError which can occur when my Java implementation is used
with very large strings. This implementation of the Levenshtein
distance algorithm is from http://www.merriampark.com/ldjava.htm
Anyway, I'm curious to know what do you choose in this case !

The most efficient way to search for an array of strings in another string

I have a large arrray of strings that looks something like this:
String temp[] = new String[200000].
I have another String, let's call it bigtext. What I need to do is iterate through each entry of temp, checking to see if that entry is found in bigtext and then do some work based on it. So, the skeletal code looks something like this:
for (int x = 0; x < temp.length; x++) {
if (bigtext.indexOf(temp[x]) > -1 {
//do some stuff
} else continue;
}
Because there are so many entries in temp and there are many instances of bigtext as well, I want to do this in the most efficient way. I am wondering if what I've outlined is the most efficient way to iterate through this search of if there are better ways to do this.
Thanks,
Elliott

I think you're looking for an algorithm like Rabin-Karp or Aho–Corasick which are designed to search in parallel for a large number of sub-strings in a text.

Note that your current complexity is O(|S1|*n), where |S1| is the length of bigtext and n is the number of elements in your array, since each search is actually O(|S1|).
By building a suffix tree from bigtext, and iterating on elements in the array, you could bring this complexity down to O(|S1| + |S2|*n), where |S2| is the length of the longest string in the array. Assuming |S2| << |S1|, it could be much faster!
Building a suffix tree is O(|S1|), and each search is O(|S2|). You don't have to go through bigtext to find it, just on the relevant piece of the suffix tree. Since it is done n times, you get total of O(|S1| + n*|S2|), which is asymptotically better then the naive implementation.

If you have additional information about temp, you can maybe improve the iteration.
You can also reduce the time spent, if you parallelize the iteration.

Efficency depends heavily on what is valuable to you.
Are you willing to increase memory for reduced time? Are you willing to increase time for efficent handling of large data sets? Are you willing to increase contention for CPU cores? Are you willing to do pre-processing (perhaps one or more forms of indexing) to reduce the lookup time in a critical section.
With you offering, you indicate the entire portion you want made more efficent, but that means you have excluded any portion of the code or system where the trade-off can be made. This forces one to imagine what you care about and what you don't care about. Odds are excellent that all the posted answers are both correct and incorrect depending on one's point of view.

An alternative approach would be to tokenize the text - let's say split by common punctuation. Then put these tokens in to a Set and then find the intersect with the main container.
Instead of an array, hold the words in a Set too. The intersection can be calculated by simply doing
bidTextSet.retainAll(mainWordsSet);
What remains will be the words that occur in bigText that are in your "dictionary".

Use a search algorithm like Boyer-Moore. Google Boyer Moore, and it has lots of links which explain how it works. For instance, there is a Java example.

I'm afraid it's not efficient at all in any case!
To pick the right algorithm, you need to provide some answers:
What can be computed off-line? That is, is bigText known in advance? I guess temp is not, from its name.
Are you actually searching for words? If so, index them. Bloom filter can help, too.
If you need a bit of fuzziness, may stem or soundex can do the job?
Sticking to strict inclusion test, you might build a trie from your temp array. It would prevent searching the same sub-string several times.

That is a very efficient approach. You can improve it slightly by only evaluating temp.length once
for(int x = 0, len = temp.length; x < len; x++)
Although you don't provide enough detail of your program, it's quite possible you can find a more efficent approach by redesigning your program.

The best way to store and access 120,000 words in java

I'm programming a java application that reads strictly text files (.txt). These files can contain upwards of 120,000 words.
The application needs to store all +120,000 words. It needs to name them word_1, word_2, etc. And it also needs to access these words to perform various methods on them.
The methods all have to do with Strings. For instance, a method will be called to say how many letters are in word_80. Another method will be called to say what specific letters are in word_2200.
In addition, some methods will compare two words. For instance, a method will be called to compare word_80 with word_2200 and needs to return which has more letters. Another method will be called to compare word_80 with word_2200 and needs to return what specific letters both words share.
My question is: Since I'm working almost exclusively with Strings, is it best to store these words in one large ArrayList? Several small ArrayLists? Or should I be using one of the many other storage possibilities, like Vectors, HashSets, LinkedLists?
My two primary concerns are 1.) access speed, and 2.) having the greatest possible number of pre-built methods at my disposal.
Thank you for your help in advance!!
Wow! Thanks everybody for providing such a quick response to my question. All your suggestions have helped me immensely. I’m thinking through and considering all the options provided in your feedback.
Please forgive me for any fuzziness; and let me address your questions:
Q) English?
A) The text files are actually books written in English. The occurrence of a word in a second language would be rare – but not impossible. I’d put the percentage of non-English words in the text files at .0001%
Q) Homework?
A) I’m smilingly looking at my question’s wording now. Yes, it does resemble a school assignment. But no, it’s not homework.
Q) Duplicates?
A) Yes. And probably every five or so words, considering conjunctions, articles, etc.
Q) Access?
A) Both random and sequential. It’s certainly possible a method will locate a word at random. It’s equally possible a method will want to look for a matching word between word_1 and word_120000 sequentially. Which leads to the last question…
Q) Iterate over the whole list?
A) Yes.
Also, I plan on growing this program to perform many other methods on the words. I apologize again for my fuzziness. (Details do make a world of difference, do they not?)
Cheers!

I would store them in one large ArrayList and worry about (possibly unnecessary) optimisations later on.
Being inherently lazy, I don't think it's a good idea to optimise unless there's a demonstrated need. Otherwise, you're just wasting effort that could be better spent elsewhere.
In fact, if you can set an upper bound to your word count and you don't need any of the fancy List operations, I'd opt for a normal (native) array of string objects with an integer holding the actual number. This is likely to be faster than a class-based approach.
This gives you the greatest speed in accessing the individual elements whilst still retaining the ability to do all that wonderful string manipulation.
Note I haven't benchmarked native arrays against ArrayLists. They may be just as fast as native arrays, so you should check this yourself if you have less blind faith in my abilities than I do :-).
If they do turn out to be just as fast (or even close), the added benefits (expandability, for one) may be enough to justify their use.

Just confirming pax assumptions, with a very naive benchmark
public static void main(String[] args)
{
int size = 120000;
String[] arr = new String[size];
ArrayList al = new ArrayList(size);
for (int i = 0; i < size; i++)
{
String put = Integer.toHexString(i).toString();
// System.out.print(put + " ");
al.add(put);
arr[i] = put;
}
Random rand = new Random();
Date start = new Date();
for (int i = 0; i < 10000000; i++)
{
int get = rand.nextInt(size);
String fetch = arr[get];
}
Date end = new Date();
long diff = end.getTime() - start.getTime();
System.out.println("array access took " + diff + " ms");
start = new Date();
for (int i = 0; i < 10000000; i++)
{
int get = rand.nextInt(size);
String fetch = (String) al.get(get);
}
end = new Date();
diff = end.getTime() - start.getTime();
System.out.println("array list access took " + diff + " ms");
}
and the output:
array access took 578 ms
array list access took 907 ms
running it a few times the actual times seem to vary some, but generally array access is between 200 and 400 ms faster, over 10,000,000 iterations.

If you will access these Strings sequentially, the LinkedList would be the best choice.
For random access, ArrayLists have a nice memory usage/access speed tradeof.

My take:
For a non-threaded program, an Arraylist is always fastest and simplest.
For a threaded program, a java.util.concurrent.ConcurrentHashMap<Integer,String> or java.util.concurrent.ConcurrentSkipListMap<Integer,String> is awesome. Perhaps you would later like to allow threads so as to make multiple queries against this huge thing simultaneously.

If you're going for fast traversal as well as compact size, use a DAWG (Directed Acyclic Word Graph.) This data structure takes the idea of a trie and improves upon it by finding and factoring out common suffixes as well as common prefixes.
http://en.wikipedia.org/wiki/Directed_acyclic_word_graph

Use a Hashtable? This will give you your best lookup speed.

ArrayList/Vector if order matters (it appears to, since you are calling the words "word_xxx"), or HashTable/HashMap if it doesn't.
I'll leave the exercise of figuring out why you would want to use an ArrayList vs. a Vector or a HashTable vs. a HashMap up to you since I have a sneaking suspicion this is your homework. Check the Javadocs.
You're not going to get any methods that help you as you've asked for in the examples above from your Collections Framework class, since none of them do String comparison operations. Unless you just want to order them alphabetically or something, in which case you'd use one of the Tree implementations in the Collections framework.

How about a radix tree or Patricia trie?
http://en.wikipedia.org/wiki/Radix_tree

The only advantage of a linked list over an array or array list would be if there are insertions and deletions at arbitrary places. I don't think this is the case here: You read in the document and build the list in order.
I THINK that when the original poster talked about finding "word_2200", he meant simply the 2200th word in the document, and not that there are arbitrary labels associated with each word. If so, then all he needs is indexed access to all the words. Hence, an array or array list. If there really is something more complex, if one word might be labeled "word_2200" and the next word is labeled "foobar_42" or some such, then yes, he'd need a more complex structure.
Hey, do you want to give us a clue WHY you want to do any of this? I'm hard pressed to remember the last time I said to myself, "Hey, I wonder if the 1,237th word in this document I'm reading is longer or shorter than the 842nd word?"

Depends on what the problem is - speed or memory.
If it's memory, the minimum solution is to write a function getWord(n) which scans the whole file each time it runs, and extracts word n.
Now - that's not a very good solution. A better solution is to decide how much memory you want to use: lets say 1000 items. Scan the file for words once when the app starts, and store a series of bookmarks containing the word number and the position in the file where it is located - do this in such a way that the bookmarks are more-or-less evenly spaced through the file.
Then, open the file for random access. The function getWord(n) now looks at the bookmarks to find the biggest word # <= n (please use a binary search), does a seek to get to the indicated location, and scans the file, counting the words, to find the requested word.
An even quicker solution, using rather more memnory, is to build some sort of cache for the blocks - on the basis that getWord() requests usually come through in clusters. You can rig things up so that if someone asks for word # X, and its not in the bookmarks, then you seek for it and put it in the bookmarks, saving memory by consolidating whichever bookmark was least recently used.
And so on. It depends, really, on what the problem is - on what kind of patterns of retreival are likely.

I don't understand why so many people are suggesting Arraylist, or the like, since you don't mention ever having to iterate over the whole list. Further, it seems you want to access them as key/value pairs ("word_348"="pedantic").
For the fastest access, I would use a TreeMap, which will do binary searches to find your keys. Its only downside is that it's unsynchronized, but that's not a problem for your application.
http://java.sun.com/javase/6/docs/api/java/util/TreeMap.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.