Which Pattern Matching Algorithm fits for my case? - java

I have a project which it needs to compare to text documents and find the similarity rate between every single sentence and the general similarities of the texts.
I did some transforming on texts like lowering all words,deleting duplicate words,deleting punctuations except fullstops. After doing some operations, i had 2 arraylists which include sentences and the words all seperated. It looks like
[["hello","world"],["welcome","here"]]
Then i sorted every sentence alphabetically.After all these, i'm comparing all the words one by one,doing linear search but if the word which i'm searching is bigger than i'm looking (ASCII of first character like world > burger) ,i'm not looking remaining part,jumping other word. It seems like complicated but i need an answer of " Is there any faster,efficient common algorithms like Boyer Moore,Hashing or other?" . I'm not asking any code peace but i need some theoretical advices.Thank you.
EDIT:
I should've tell the main purpose of the project. Actually it is kinda plagiarism detector.There are two txt files which are main.txt and sub.txt.The program will compare them and it gives an output something like that:
Output:
Similarity rate of two texts is: %X
{The most similar sentence}
{The most similar 2nd sentence}
{The most similar 3d sentence}
{The most similar 4th sentence}
{The most similar 5th sentence}
So i need to find out sub.txt similarity rate to main.txt file.I thought that i need to compare all the sentences in two files with each other.
For instance, main.txt has 10 sentences and sub.txt has 5 sentences,
there will be 50 comparison and 50 similarity rate will be calculated
and stored.
Finally i sort the similarty rates and print the most 5 sentences.Actually i've done the project,but it's not efficient. It has 4 nested for loops and compares all words uncountable times and complexity becomes like O(n^4) ( maybe not that much) but it's really huge even in the worst case. I found Levenshtein distance algorithm and Cosine similarity algorithms but i'm not sure about them. Thanks for any suggestion!
EDIT2:
For my case similarity between 2 sentence is like:
main_sentence:"Hello dude how are you doing?"
sub_sentence:"Hello i'm fine dude."
Since intersection is 2 words ["hello","dude"]
The similarity is : (length of intersected words)*100/(length of main text)
For this case it's: 2*100/6 = %33,3

As a suggestion, and even if this is not a "complete answer" to your issue, comparing Strings is usually a "heavy" operation (even if you first check their length, which, in fact, is one of the first things the equals() method already performs when comparing Strings)
What I suggest is doing next: create a dummy hashcode()-like method. It won't be a real hashcode(), but the number associated to the order in which that word was read by your code. Something like a cryptographic method, but much simpler.
Note that string.hashCode() won't work, as the word "Hello" from the first document wouldn't return the same hashcode than the word "Hello" from the second document.
Data "Warming" - PreConversion
Imagine you have a shared HashMap<String,Integer> (myMap), which key is an String and the value an Integer. Note that HashMap's hashing in java with String keys lower than 10 characters (which usually are, in english language) is incredibly fast. Without any check, just put each word with its counter value:
myMap.put(yourString, ++counter);
Let's say you have 2 documents:
1.txt- Welcome mate what are you doing here
2.txt- Mate I was here before are you dumb
I assume you already lowercased all words, and removed duplicates.
You start reading the first document and assigning each word to a number. The map would look like:
KEY VALUE
welcome 1
mate 2
what 3
are 4
you 5
doing 6
here 7
Now with the second document. If a key is repeated, the put() method will update its value. So:
KEY VALUE
welcome 1
mate 8
what 3
are 13
you 14
doing 6
here 11
I 9
was 10
before 12
dumb 15
Once complete, you create another HashMap<Integer,String> (reverseMap), this way in reverse:
KEY VALUE
1 welcome
8 mate
3 what
13 are
14 you
6 doing
11 here
9 I
10 was
12 before
15 dumb
You convert both documents into a List of Integers, so they look like:
1.txt- Welcome mate what are you doing here
2.txt- Mate I was here before are you dumb
to:
listOne - [1, 8, 3, 13, 14, 6, 11]
listTwo - [8, 9, 10, 11, 12, 13, 14, 15]
Duplicate words, positions and sequences
To find the duplicated within both documents:
First, create a deep clone of one of the lists, for example, listTwo. A deep clone of a List of Integers is relatively easy to perform. Calling it listDuplicates as that will be its objective.
List<Integer> listDuplicates = new ArrayList<>();
for (Integer i:listTwo)
listDuplicates.add(new Integer(i));
Call retainAll:
listDuplicates.retainAll(listOne);
The result would be:
listDuplicates- [8,11,13,14]
So, from a total of listOne.size()+listTwo.size() = 15 words found on 2 documents, 4 are duplicates are 11 are unique.
In order to get the converted values, just call:
for (Integer i : listDuplicates)
System.out.println(reverseMap.get(i)); // mate , here, are, you
Now that duplicates are identified, listOne and listTwo can also be used now in order to:
Identify the position on each list (so we can get the difference in the positions of this words). The first word would have -1 value, as its the first one and doesn't have a diff with the previous one, but won't necessarily mean they are consequent with any other (they are just the first duplicates).
If the next element has -1 value, that means the [8] and [11] would aslo be consecutive:
doc1 doc2 difDoc1 difDoc2
[8] 2 1 -1 (0-1) -1 (0-1)
[11] 7 4 -5 (2-7) -3 (1-4)
[13] 4 6 3 (7-4) -2 (4-6)
[14] 5 7 -1 (4-5) -1 (6-7)
In this case, the distance shown in [14] with its previous duplicate (the diff between [13] and [14]) is the same in both documents: -1: that means that not only are duplicates, but both are consequently placed in both documents.
Hence, we've found not only duplicate words, but also a duplicate sequence of two words between those lines:
[13][14]--are you
The same mechanism (identifying a diff of -1 for the same variable in both documents) would also help to find a complete duplicate sequence of 2 or more words. If all the duplicates show a diff of -1 in both documents, that means we've found a complete duplicate line:
In this example this is shown clearer:
doc1- "here i am" [4,5,6]
doc2- "here i am" [4,5,6]
listDuplicates - [4,5,6]
doc1 doc2 difDoc1 difDoc2
[4] 1 1 -1 (0-1) -1 (0-1)
[5] 2 2 -1 (1-2) -1 (1-2)
[6] 3 3 -1 (2-3) -1 (2-3)
All the diffs are -1 for the same variable in both documents -> all duplicates are next to each other in both documents --> The sentence is exactly the same in both documents. So, this time, we've found a complete duplicate line of 3 words.
[4][5][6] -- here i am
Apart of this duplicate sequence search, this difference table would also be helpful when calculating the variance, median,... from the duplicate words, in order to get some kind of "similarity" factor (something like a basic indicative value of equity between documents. By no mean definitive, but somehow helpful)
Unique values - helpful in order to get a non-equity indicative ?
Similar mechanisms would be used to get those unique values. For example, by removing the duplicates from the reverseMap:
for (Integer i: listDuplicates)
reverseMap.remove(i);
Now the reverseMap only contains unique values. reverseMap.size() = 11
KEY VALUE
1 welcome
3 what
6 doing
9 I
10 was
12 before
15 dumb
In order to get the unique words:
reverseMap.values() = {welcome,what,doing,I,was,before,dumb}
If you need to know which unique words are from which document, you could use the reverseMap (as the Lists may be altered after you execute methods such as retainAll on them):
Count the number of words from the 1st document. This time, 7.
If the key of the reverseMap is <=7, that unique word comes from the 1st document. {welcome,what,doing}
If the key is >7, that unique word comes from the 2nd document. {I,was,before,dumb}
The uniqueness factor could also be another indicative, this way, a negative one (as we are searching for similarities here). Still could be really helpful.
equals and hashCode - avoid
As the hashcode() method for Strings won't return the same value for two same words (only for two same String Object references), wouldn't work here. String.equals() method works by comparing the chars (also checks for the length, as you do manually) which would be total overkill if used for big documents:
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String) anObject;
int n = value.length;
if (n == anotherString.value.length) {
char v1[] = value;
char v2[] = anotherString.value;
int i = 0;
while (n-- != 0) {
if (v1[i] != v2[i])
return false;
i++;
}
return true;
}
}
return false;
}
My oppinion is to avoid this as much as possible, specially hashCode() should never be used, as:
String one = "hello";
String two = "hello";
one.hashCode() != two.hashCode()
There's an exception to this, but only when the compiler interns strings; Once you load thousands of them, that won't ever again be used by the compiler. In those rare cases where both String Objects reference the same cached memory address, this will also be true:
one.hashCode() == two.hashCode() --> true
one == two --> true
But those are really unusual exceptions, and once string internship doesn't kick, those hashCodes won't be equal and the operator == to compare Strings will return false even if the Strings hold the same value (as usual, because it works comparing their memory addresses).

The essential technique is to see this is as a multi-stage process. The key is that you're not trying to compare every document with every other document, but rather, you have a first pass that identifies small clusters of likely matches in essentially a one-pass process:
(1) Index or cluster the documents in a way that will allow candidate matches to be identified;
(2) Identify candidate documents that may be a match based on those indexes/clusters;
(3) For each cluster or index match, have a scoring algorithm that scores the similarity of a given pair of documents.
There are a number of ways to solve (1) and (3), depending on the nature and number of the documents. Options to consider:
For certain datasets, (1) could be as simple as indexing on unusual words/clombinations of words
For more complex documents and/or larger datasets, you will need to do something sometimes called 'dimension reduction': rather than clustering on shared combinations of words, you'll need to cluster on combinations of shared features, where each feature is identified by a set of words. Look at a feature extraction technique sometimes referred to as "latent semantic indexing" -- essentially, you model the documents mathematically as a matrix of "words per feature" multiplied by "feature per document", and then by factorising the matrix you arrive at an approximation of a set of features, along with a weighted list of which words make up each feature
Then, once you have a means of identifying a set of words/features to index on, you need some kind of indexing function that will mean that candidate document matches have identical/similar index keys. Look at cosine similarity and "locality-sensitive hashing" such as SimHash.
Then for (3), given a small set of candidate documents (or documents that cluster together in your hashing system), you need a similarity metric. Again, what method is appropriate depends on your data, but conceptually, one way you could see this as "for each sentence in document X, find the most similar document in document Y and score its similarity; obtain a 'plagiarism score' that his the sum of these values". There are various ways to define 'similarity score' between two strings: e.g. longest common subsequence, edit distance, number of common word pairs/sequences...
As you can probably imagine from all of this, there's no single algorithm that will hand you exactly what you need on a plate. (That's why entire companies and research departments are dedicated to this problem...) But hopefully the above will give you some pointers.

Related

Best approach to solve Word Chain

I am trying to solve this problem in CodeEval.
In this challenge we suggest you to play in the known game "Word
chain" in which players come up with words that begin with the letter
that the previous word ended with. The challenge is to determine the
maximum length of a chain that can be created from a list of words.
Example:
Input:
soup,sugar,peas,rice
Ouput:
4
Explanation: We can form a chain of 4 words like this: "soup->peas->sugar->rice".
Constraints:
The length of a list of words is in range [4, 35].
A word in a list of words is represented by a random lowercase ascii string with the length of [3, 7] letters.
There is no repeating words in a list of words.
My attempt: My approach is to model the words as a graph, such that each word in the inputs represents a node and there is an (directed) edge between from wordi to wordj if last character of wordi is equal to the first character of wordj.
After that I am running bfs from each node and computing the length of the farthest node from the this node. The final result is the maximum value possible for all nodes.
But this approach is not giving me a full score. Hence, my question is how to solve this problem correctly and efficiently?
For my reputation is less than 50, so I can't make a comment...
If the total number of word is less than 20, we can solve using dynamic programming and bitmask.
make dp[20][1<<20]. dp[i][j] means currently you are in i, and you have visit the bitmask j's word.
For number is bigger than 20, I still haven't a good idea. May be we need to use some random algorithm, perhaps...。
My idea is to use dfs and add some optimizaion, because 35 is not too big. I think it's enough to solve the problem.
See the solution mentioned here: Detecting when matrix multiplication is possible
The solution to your problem is pretty much same. Create a directed graph such that for every work add an edge from first letter to last letter.
Then find a Euler path ( http://en.wikipedia.org/wiki/Euler_path ) in that graph.
EDIT: I see that you are not assured of using all words and you need the longest path in the graph ( http://en.wikipedia.org/wiki/Longest_path_problem ). This problem is NP-complete.
See the solution mentioned word chain in core java
The page gives a solution in Core Java, it follows the following process:
Load the Dictionary Items in memory for a given word length
Get the next eligible list of words from the memory for the given word
There is another approach using the Map/reduce hadoop framework, which is mentioned in detail in the word chain using map-reduce

What's the best way to iterate through all combinations of a multi-dimensional array of unknown sizes without repeating any combination?

ArrayList<ArrayList<ArrayList<String>>> one = new ArrayList<ArrayList<ArrayList<String>>>();
one would look something like this with some example values:
[
[
["A","B","C",...],
["G","E","J",...],
...
],
[
["1","2",...],
["8","5","12","7",...],
...
],
...
]
Assuming that there will always be one base case, at least one letter arraylist (e.g. ["A","B","C"]), but there could be more (e.g. ["X,"Y","Z"]) and there may be any size of number arraylists, maybe none at all, but could be hundreds (e.g. ["1","2","3"],...,["997","998","999"]). Also, there could be more types of arraylists (e.g. ["#","#","$"]) of any size. So really the only thing that is definitive is that ALWAYS:
one.size()>=1
one.get(0).size()>=1
one.get(0).get(0).size()>=1
So the problem is: How can I best get every combination of each category without knowing how large each arraylist will be or having any repeats but assuming that one.get(0).get(0) is valid? e.g. ["A","B","C",...] ["1","2",...] ..., ["A","B","C",...] ["8","5","12","7",...] .... I'm using Java in my project currently but an any algorithm that works I can convert over myself. I apologize if this is not clear, I'm having a hard time putting it into words which is probably part of why I can't think of a solution.
I know two solutions to this, the recursive and the non recursive. Here's the non recursive (similar to the answer at How to get 2D array possible combinations )
1) Multiply the length of every array together. This is the number of possible combinations you can make. Call this totalcombinations.
2) Set up an int[] array called counters. It should be as long as the number of arrays, and all initialized to 0.
3a) For totalcombinations times, concatenate counter[0]th entry in arrays[0], the counter[1]th entry in arrays[1]... etc and add it to the list of all results.
3b) Then set j = 0 and increment counters[j]. If this causes counters[j] > arrays[j].length, then counters[j] = 0, ++j and increment the new counters[j] (e.g. repeat 3b)) until you do not get such an overflow.
If you imagine counters as being like the tumblers of a suitcase - when you overflow the first digit from 9 to 0, the next one ticks over - then you should get the strategy here.

Generate deletion, insertion, substitution, transpotions for a string

I am implementing a spell checker algorithm. I have constructed a Trie that stores my words for quick searching.
When a given input string is passed what I want to do is generate potential deletions, insertions, substitutions and transpositions for that string with an edit distance of 1. Using this super set I can then try to find the word in my Trie and offer the user "did you mean?" type results.
I have looked online and most solutions mention calculating the Levenstein Distance. That only works if you already know the two strings and you want to find the edit distance between the two.
Suggestions?
I would use an 2 pass algo:
Pass 1
look and calculate the distance for all words starting with the same letter as the word to spell check. This will be fast. you can stop the depth search when the number of chars is greater then spell word length + 2 (then this obiously another word)
Display results of pass1, eg by marking word red underline
Pass 2
look for all words and stop when length + 3 or 4
Update the results found in pass 1

Returning a Subset of Strings from 10000 ascii strings

My college is getting over so I have started preparing for the interviews to get the JOB and I came across this interview question while I was preparing for the interview
You have a set of 10000 ascii strings (loaded from a file)
A string is input from stdin.
Write a pseudocode that returns (to stdout) a subset of strings in (1) that contain the same distinct characters (regardless of order) as
input in (2). Optimize for time.
Assume that this function will need to be invoked repeatedly. Initializing the string array once and storing in memory is okay .
Please avoid solutions that require looping through all 10000 strings.
Can anyone provide me a general pseudocode/algorithm kind of thing how to solve this problem? I am scratching my head thinking about the solution. I am mostly familiar with Java.
Here is an O(1) algorithm!
Initialization:
For each string, sort characters, removing duplicates - eg "trees" becomes "erst"
load sorted word into a trie tree using the sorted characters, adding a reference to the original word to the list of words stored at the each node traversed
Search:
sort input string same as initialization for source strings
follow source string trie using the characters, at the end node, return all words referenced there
They say optimise for time, so I guess we're safe to abuse space as much as we want.
In that case, you could do an initial pass on the 10000 strings and build a mapping from each of the unique characters present in the 10000 to their index (rather a set of their indices). That way you can ask the mapping the question, which sets contain character 'x'? Call this mapping M> ( order: O(nm) when n is the number of strings and m is their maximum length)
To optimise in time again, you could reduce the stdin input string to unique characters, and put them in a queue, Q. (order O(p), p is the length of the input string)
Start a new disjoint set, say S. Then let S = Q.extractNextItem.
Now you could loop over the rest of the unique characters and find which sets contain all of them.
While (Q is not empty) (loops O(p)) {
S = S intersect Q.extractNextItem (close to O(1) depending on your implementation of disjoint sets)
}
voila, return S.
Total time: O(mn + p + p*1) = O(mn + p)
(Still early in the morning here, I hope that time analysis was right)
As Bohemian says, a trie tree is definitely the way to go!
This sounds like the way an address book lookup would work on a phone. Start punching digits in, and then filter the address book based on the number representation as well as any of the three (or actually more if using international chars) letters that number would represent.

How to get the following output in most efficient way?

Read a file of this format:
japan
usa
japan
russia
usa
japan
japan
australia
Print the output in the following format:
<country> : <count>
So for above file output would be:
japan : 4
usa : 2
australia : 1
russia : 1
Note that since australia and russia both have count as 1, the name are sorted, 'a' before 'r'. Do it in the most efficient way.
Here is what I tried:
Read the entire file and insert into a HashMap.
We will have pairs like <japan, 4> in there.
Now read the HashMap and insert in another TreeMap<Integer, List<String>>
Iterate over TreeiMap using a Comparator, which will iterate in reverse-sorted order.
Sort value (which will be a List<String>) and print the result.
this can be done in O(n*S) (n is the number of input strings,S is the biggest string size ) I'll give you a general algorithm,in pseudo code, the Java will be a bit messy...
arr <- HashSet<String>[NumberOfElements]
map <- HashMap<String,int>
for each country:
if country in map.keySet():
count <- map.get(country)
arr[count].del(country)
map.delete(country)
count <- count + 1
else:
count <- 1
arr[count].add(country)
map.put(country,count)
for i=arr.length-1;i>=0;i--:
sorted <- radixSort(arr[i])
for each country in sorted:
print country, i
arr here is a "histogram", since for every iteration the 'size' is increased by at-most 1, we use it to store the data.
complexity explanations:
this algorithm uses radix sort, where a 'digit' is actually a character, and is O(n), and using it will prevent the O(nlogn) for other sort algorithm or using a TreeSet
we iterate over the array which is at most of size n (if every country appears only once).
a trick point is the sort inside a loop: it is still O(n) because at overall you sort at most n elements (and not n elements per iteration!) so it is O(2n)=O(n).
we can pre-find the NumberOfElements with a single iteration.
at overall: it is O(n*S), where n is the number of inputs (where populating arr), and S is the biggest string size (since we need to read the strings...)
A java.util.Map should get you on track.
The most efficient way in terms of coding time would be to forget Java and use sort | uniq -c | sort -n (which is, incidentally, one of my favorite shell snippets). Follow that with awk if you really need the formatting as depicted. The runtime won't even be that bad for large inputs (since those are fairly efficient programs) but startup time would dominate on your example list. Of course you could run it somewhere on the order of 10,000 times before you could launch Eclipse.

Categories

Resources