I have been searching the internet for awhile now for steps to find all the anagrams of a string (word) (i.e. Team produces the word tame) using a hashtable and a trie. All I have found here on SO is to verify 2 words are anagrams. I would like to take it a step further and find an algorithm in english so that I can program it in Java.
For example,
Loop through all the characters.
For each unique character insert into the hashtable.
and so forth.
I don't want a complete program. Yes, I am practicing for an interview. If this question comes up then I will know it and know how to explain it not just memorize it.
the most succinct answer due to some guy quoted in the "programming pearls" book is (paraphrasing):
"sort it this way (waves hand horizontally left to right), and then that way (waves hand vertically top to bottom)"
this means, starting from a one-column table (word), create a two column table: (sorted_word, word), then sort it on the first column.
now to find anagrams of a word, first compute sorted word and do a binary search for its first occurrence in the first column of the table, and read off the second column values while the first column is the same.
input (does not need to be sorted):
mate
tame
mote
team
tome
sorted "this way" (horizontally):
aemt, mate
aemt, tame
emot, mote
aemt, team
emot, tome
sorted "that way" (vertically):
aemt, mate
aemt, tame
aemt, team
emot, mote
emot, tome
lookup "team" -> "aemt"
aemt, mate
aemt, tame
aemt, team
As far as hashtables/tries they only come into the picture if you want a slightly speedier lookup. Using hash tables you can partition the 2-column vertically sorted table into k-partitions based on the hash of the first column. this will give you a constant factor speedup because you have to do a binary search only within one partition. tries are a different way of optimizing by helping you avoid doing too many string comparisons, you hang off the index of the first row for the appropriate section of the table for each terminal in the trie.
Hash tables are not the best solution, so I doubt you would be required to use them.
The simplest approach to finding anagram pairs (that I know of) is as follows:
Map characters as follows:
a -> 2
b -> 3
c -> 5
d -> 7
and so on, such that letters a..z are mapped to the first 26 primes.
Multiply the character values for each character in the word, lets call it the "anagram number". Its pretty easy to see TEAM and TAME will produce the same number. Indeed the anagram values of two different words will be the same if and only if they are anagrams.
Thus the problem of finding anagrams between the two lists reduces to finding anagram values that appear on both lists. This easily done by sorting each list by anagram number and stepping through to find common values, in nlog(n) times.
String to char[]
sort it char[]
generate String from sorted char[]
use it as key to HashMap<String, List<String>>
insert current original String to list of values associated
for example for
car, acr, rca, abc it would have
acr: car, acr, rca
abc: abc
Related
I have a project which it needs to compare to text documents and find the similarity rate between every single sentence and the general similarities of the texts.
I did some transforming on texts like lowering all words,deleting duplicate words,deleting punctuations except fullstops. After doing some operations, i had 2 arraylists which include sentences and the words all seperated. It looks like
[["hello","world"],["welcome","here"]]
Then i sorted every sentence alphabetically.After all these, i'm comparing all the words one by one,doing linear search but if the word which i'm searching is bigger than i'm looking (ASCII of first character like world > burger) ,i'm not looking remaining part,jumping other word. It seems like complicated but i need an answer of " Is there any faster,efficient common algorithms like Boyer Moore,Hashing or other?" . I'm not asking any code peace but i need some theoretical advices.Thank you.
EDIT:
I should've tell the main purpose of the project. Actually it is kinda plagiarism detector.There are two txt files which are main.txt and sub.txt.The program will compare them and it gives an output something like that:
Output:
Similarity rate of two texts is: %X
{The most similar sentence}
{The most similar 2nd sentence}
{The most similar 3d sentence}
{The most similar 4th sentence}
{The most similar 5th sentence}
So i need to find out sub.txt similarity rate to main.txt file.I thought that i need to compare all the sentences in two files with each other.
For instance, main.txt has 10 sentences and sub.txt has 5 sentences,
there will be 50 comparison and 50 similarity rate will be calculated
and stored.
Finally i sort the similarty rates and print the most 5 sentences.Actually i've done the project,but it's not efficient. It has 4 nested for loops and compares all words uncountable times and complexity becomes like O(n^4) ( maybe not that much) but it's really huge even in the worst case. I found Levenshtein distance algorithm and Cosine similarity algorithms but i'm not sure about them. Thanks for any suggestion!
EDIT2:
For my case similarity between 2 sentence is like:
main_sentence:"Hello dude how are you doing?"
sub_sentence:"Hello i'm fine dude."
Since intersection is 2 words ["hello","dude"]
The similarity is : (length of intersected words)*100/(length of main text)
For this case it's: 2*100/6 = %33,3
As a suggestion, and even if this is not a "complete answer" to your issue, comparing Strings is usually a "heavy" operation (even if you first check their length, which, in fact, is one of the first things the equals() method already performs when comparing Strings)
What I suggest is doing next: create a dummy hashcode()-like method. It won't be a real hashcode(), but the number associated to the order in which that word was read by your code. Something like a cryptographic method, but much simpler.
Note that string.hashCode() won't work, as the word "Hello" from the first document wouldn't return the same hashcode than the word "Hello" from the second document.
Data "Warming" - PreConversion
Imagine you have a shared HashMap<String,Integer> (myMap), which key is an String and the value an Integer. Note that HashMap's hashing in java with String keys lower than 10 characters (which usually are, in english language) is incredibly fast. Without any check, just put each word with its counter value:
myMap.put(yourString, ++counter);
Let's say you have 2 documents:
1.txt- Welcome mate what are you doing here
2.txt- Mate I was here before are you dumb
I assume you already lowercased all words, and removed duplicates.
You start reading the first document and assigning each word to a number. The map would look like:
KEY VALUE
welcome 1
mate 2
what 3
are 4
you 5
doing 6
here 7
Now with the second document. If a key is repeated, the put() method will update its value. So:
KEY VALUE
welcome 1
mate 8
what 3
are 13
you 14
doing 6
here 11
I 9
was 10
before 12
dumb 15
Once complete, you create another HashMap<Integer,String> (reverseMap), this way in reverse:
KEY VALUE
1 welcome
8 mate
3 what
13 are
14 you
6 doing
11 here
9 I
10 was
12 before
15 dumb
You convert both documents into a List of Integers, so they look like:
1.txt- Welcome mate what are you doing here
2.txt- Mate I was here before are you dumb
to:
listOne - [1, 8, 3, 13, 14, 6, 11]
listTwo - [8, 9, 10, 11, 12, 13, 14, 15]
Duplicate words, positions and sequences
To find the duplicated within both documents:
First, create a deep clone of one of the lists, for example, listTwo. A deep clone of a List of Integers is relatively easy to perform. Calling it listDuplicates as that will be its objective.
List<Integer> listDuplicates = new ArrayList<>();
for (Integer i:listTwo)
listDuplicates.add(new Integer(i));
Call retainAll:
listDuplicates.retainAll(listOne);
The result would be:
listDuplicates- [8,11,13,14]
So, from a total of listOne.size()+listTwo.size() = 15 words found on 2 documents, 4 are duplicates are 11 are unique.
In order to get the converted values, just call:
for (Integer i : listDuplicates)
System.out.println(reverseMap.get(i)); // mate , here, are, you
Now that duplicates are identified, listOne and listTwo can also be used now in order to:
Identify the position on each list (so we can get the difference in the positions of this words). The first word would have -1 value, as its the first one and doesn't have a diff with the previous one, but won't necessarily mean they are consequent with any other (they are just the first duplicates).
If the next element has -1 value, that means the [8] and [11] would aslo be consecutive:
doc1 doc2 difDoc1 difDoc2
[8] 2 1 -1 (0-1) -1 (0-1)
[11] 7 4 -5 (2-7) -3 (1-4)
[13] 4 6 3 (7-4) -2 (4-6)
[14] 5 7 -1 (4-5) -1 (6-7)
In this case, the distance shown in [14] with its previous duplicate (the diff between [13] and [14]) is the same in both documents: -1: that means that not only are duplicates, but both are consequently placed in both documents.
Hence, we've found not only duplicate words, but also a duplicate sequence of two words between those lines:
[13][14]--are you
The same mechanism (identifying a diff of -1 for the same variable in both documents) would also help to find a complete duplicate sequence of 2 or more words. If all the duplicates show a diff of -1 in both documents, that means we've found a complete duplicate line:
In this example this is shown clearer:
doc1- "here i am" [4,5,6]
doc2- "here i am" [4,5,6]
listDuplicates - [4,5,6]
doc1 doc2 difDoc1 difDoc2
[4] 1 1 -1 (0-1) -1 (0-1)
[5] 2 2 -1 (1-2) -1 (1-2)
[6] 3 3 -1 (2-3) -1 (2-3)
All the diffs are -1 for the same variable in both documents -> all duplicates are next to each other in both documents --> The sentence is exactly the same in both documents. So, this time, we've found a complete duplicate line of 3 words.
[4][5][6] -- here i am
Apart of this duplicate sequence search, this difference table would also be helpful when calculating the variance, median,... from the duplicate words, in order to get some kind of "similarity" factor (something like a basic indicative value of equity between documents. By no mean definitive, but somehow helpful)
Unique values - helpful in order to get a non-equity indicative ?
Similar mechanisms would be used to get those unique values. For example, by removing the duplicates from the reverseMap:
for (Integer i: listDuplicates)
reverseMap.remove(i);
Now the reverseMap only contains unique values. reverseMap.size() = 11
KEY VALUE
1 welcome
3 what
6 doing
9 I
10 was
12 before
15 dumb
In order to get the unique words:
reverseMap.values() = {welcome,what,doing,I,was,before,dumb}
If you need to know which unique words are from which document, you could use the reverseMap (as the Lists may be altered after you execute methods such as retainAll on them):
Count the number of words from the 1st document. This time, 7.
If the key of the reverseMap is <=7, that unique word comes from the 1st document. {welcome,what,doing}
If the key is >7, that unique word comes from the 2nd document. {I,was,before,dumb}
The uniqueness factor could also be another indicative, this way, a negative one (as we are searching for similarities here). Still could be really helpful.
equals and hashCode - avoid
As the hashcode() method for Strings won't return the same value for two same words (only for two same String Object references), wouldn't work here. String.equals() method works by comparing the chars (also checks for the length, as you do manually) which would be total overkill if used for big documents:
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String) anObject;
int n = value.length;
if (n == anotherString.value.length) {
char v1[] = value;
char v2[] = anotherString.value;
int i = 0;
while (n-- != 0) {
if (v1[i] != v2[i])
return false;
i++;
}
return true;
}
}
return false;
}
My oppinion is to avoid this as much as possible, specially hashCode() should never be used, as:
String one = "hello";
String two = "hello";
one.hashCode() != two.hashCode()
There's an exception to this, but only when the compiler interns strings; Once you load thousands of them, that won't ever again be used by the compiler. In those rare cases where both String Objects reference the same cached memory address, this will also be true:
one.hashCode() == two.hashCode() --> true
one == two --> true
But those are really unusual exceptions, and once string internship doesn't kick, those hashCodes won't be equal and the operator == to compare Strings will return false even if the Strings hold the same value (as usual, because it works comparing their memory addresses).
The essential technique is to see this is as a multi-stage process. The key is that you're not trying to compare every document with every other document, but rather, you have a first pass that identifies small clusters of likely matches in essentially a one-pass process:
(1) Index or cluster the documents in a way that will allow candidate matches to be identified;
(2) Identify candidate documents that may be a match based on those indexes/clusters;
(3) For each cluster or index match, have a scoring algorithm that scores the similarity of a given pair of documents.
There are a number of ways to solve (1) and (3), depending on the nature and number of the documents. Options to consider:
For certain datasets, (1) could be as simple as indexing on unusual words/clombinations of words
For more complex documents and/or larger datasets, you will need to do something sometimes called 'dimension reduction': rather than clustering on shared combinations of words, you'll need to cluster on combinations of shared features, where each feature is identified by a set of words. Look at a feature extraction technique sometimes referred to as "latent semantic indexing" -- essentially, you model the documents mathematically as a matrix of "words per feature" multiplied by "feature per document", and then by factorising the matrix you arrive at an approximation of a set of features, along with a weighted list of which words make up each feature
Then, once you have a means of identifying a set of words/features to index on, you need some kind of indexing function that will mean that candidate document matches have identical/similar index keys. Look at cosine similarity and "locality-sensitive hashing" such as SimHash.
Then for (3), given a small set of candidate documents (or documents that cluster together in your hashing system), you need a similarity metric. Again, what method is appropriate depends on your data, but conceptually, one way you could see this as "for each sentence in document X, find the most similar document in document Y and score its similarity; obtain a 'plagiarism score' that his the sum of these values". There are various ways to define 'similarity score' between two strings: e.g. longest common subsequence, edit distance, number of common word pairs/sequences...
As you can probably imagine from all of this, there's no single algorithm that will hand you exactly what you need on a plate. (That's why entire companies and research departments are dedicated to this problem...) But hopefully the above will give you some pointers.
Problem:
Essentially, my goal is to build an ArrayList of IndexEntry objects from a text file. An IndexEntry has the following fields: String word, representing this unique word in the text file, and ArrayList numsList, a list containing the lines of the text file in which word occurs.
The ArrayList I build must keep the IndexEntries sorted so that their word fields are in alphabetical order. However, I want to do this in the fastest way possible. Currently, I visit each word as it appears in the text file and use binary search to determine if an IndexEntry for that word already exists in order to add the current line number to its numsList. In the case of an IndexEntry not existing I create a new one in the appropriate spot in order to maintain alphabetical order.
Example:
_
One
Two
One
Three
_
Would yield an ArrayList of IndexEntries whose output as a String (in the order of word, numsList) is:
One [1, 5], Three [7], Two [3]
Keep in mind that I am working with much larger text files, with many occurrences of the same word.
Question:
Is binary search the fastest way to approach this problem? I am still a novice at programming in Java, and am curious about searching algorithms that might perform better in this scenario or the relative time complexity of using a Hash Table when compared with my current solution.
You could try a TreeMap or a ConcurrentSkipListMap which will keep your index sorted.
However, if you only need a sorted list at the end of your indexing, good old HashMap<String, List> is the way to go (ArrayList as value is probably a safe bet as well)
When you are done, get the values of the map and sort them once by key.
Should be good enough for a couple hundred megabytes of text files.
If you are on Java 8, use the neat computeIfAbsent and computeIfPresent methods.
My college is getting over so I have started preparing for the interviews to get the JOB and I came across this interview question while I was preparing for the interview
You have a set of 10000 ascii strings (loaded from a file)
A string is input from stdin.
Write a pseudocode that returns (to stdout) a subset of strings in (1) that contain the same distinct characters (regardless of order) as
input in (2). Optimize for time.
Assume that this function will need to be invoked repeatedly. Initializing the string array once and storing in memory is okay .
Please avoid solutions that require looping through all 10000 strings.
Can anyone provide me a general pseudocode/algorithm kind of thing how to solve this problem? I am scratching my head thinking about the solution. I am mostly familiar with Java.
Here is an O(1) algorithm!
Initialization:
For each string, sort characters, removing duplicates - eg "trees" becomes "erst"
load sorted word into a trie tree using the sorted characters, adding a reference to the original word to the list of words stored at the each node traversed
Search:
sort input string same as initialization for source strings
follow source string trie using the characters, at the end node, return all words referenced there
They say optimise for time, so I guess we're safe to abuse space as much as we want.
In that case, you could do an initial pass on the 10000 strings and build a mapping from each of the unique characters present in the 10000 to their index (rather a set of their indices). That way you can ask the mapping the question, which sets contain character 'x'? Call this mapping M> ( order: O(nm) when n is the number of strings and m is their maximum length)
To optimise in time again, you could reduce the stdin input string to unique characters, and put them in a queue, Q. (order O(p), p is the length of the input string)
Start a new disjoint set, say S. Then let S = Q.extractNextItem.
Now you could loop over the rest of the unique characters and find which sets contain all of them.
While (Q is not empty) (loops O(p)) {
S = S intersect Q.extractNextItem (close to O(1) depending on your implementation of disjoint sets)
}
voila, return S.
Total time: O(mn + p + p*1) = O(mn + p)
(Still early in the morning here, I hope that time analysis was right)
As Bohemian says, a trie tree is definitely the way to go!
This sounds like the way an address book lookup would work on a phone. Start punching digits in, and then filter the address book based on the number representation as well as any of the three (or actually more if using international chars) letters that number would represent.
I am searching in Wordnet for synonyms for a big list of words. The way I have it done it, when some word has more than one synonym, the results are returned in alphabetical order. What I need is to have them ordered by their probability of occurrence, and I would take just the top 1 synonym.
I have used the prolog wordnet database and Syns2Index to convert it into Lucene type index for querying synonyms. Is there a way to get them ordered by their probabilities in this way, or I should use another approach?
Speed not important, this synonym lookup will not be done online.
In case someone stumbles upon this thread, this was the way to go(at least what i needed):
http://lyle.smu.edu/~tspell/jaws/doc/edu/smu/tspell/wordnet/impl/file/ReferenceSynset.html#getTagCount%28java.lang.String%29
tagCount method gives the most likely synset group for every word. The problem again is that synset with highes probability again can have several words. But i guess theres no chance to avoid this
I think that you should do another step (provided that speed is not important).
From the Lucene index, you should build another dictionary in which each word is mapped to a small object that contains the only synonym that its meaning has higher probability of appearance, its meaning, and probability of appearance. I.e., given this code:
class Synonym {
public:
String name;
double probability;
String meaning;
}
Map<String, Synonym> m = new HashMap<String, Synonym>();
... you just have to fill it from the Lucene index.
I'm making a boggle-like word game. The user is given a grid of letters like this:
O V Z W X
S T A C K
Y R F L Q
The user picks out a word using any adjacent chains of letters, like the word "STACK" across the middle line. The letters used are then replaced by the machine e.g. (new letters in lowercase):
O V Z W X
z e x o p
Y R F L Q
Notice you can now spell "OVeRFLoW" by using the new letters. My problem is: What algorithm can I use to pick new letters that maximizes the number of long words the user can spell? I want the game to be fun and involve spelling e.g. 6 letter words sometimes but, if you pick bad letters, games involve the user just spelling 3 letter words and not getting a chance to find larger words.
For example:
You could just randomly pick new letters from the alphabet. This does not work well.
Likewise, I found picking randomly but using the letter frequencies from Scrabble didn't work well. This works better in Scrabble I think as you are less constrained about the order you use the letters in.
I tried having a set of lists, each representing one of the dies from the Boggle game, and each letter would be picked from a random die side (I also wonder whether I can legally use this data in a product). I didn't notice this working well. I imagine the Boggle dice sides were chosen in some sensible manner, but I cannot find how this was done.
Some ideas I've considered:
Make a table of how often letter pairs occur together in the dictionary. For the sake of argument, say E is seen next to A 30% of the time. When picking a new letter, I would randomly pick a letter based on the frequency of this letter occurring next to a randomly chosen adjacent letter on the grid. For example, if the neighboring letter was E, the new letter would be "A" 30% of the time. The should mean there are lots of decent pairs to use scattered around the map. I could maybe improve this by making probability tables of a letter occurring between two other letters.
Somehow do a search for what words can be spelt on the current grid, taking the new letters to be wildcards. I would then replace the wildcards with letters that allowed the biggest words to be spelt. I'm not sure how you would do this efficiently however.
Any other ideas are appreciated. I wonder if there is a common way to solve this problem and what other word games use.
Edit: Thanks for the great answers so far! I forgot to mention, I'm really aiming for low memory/cpu requirements if possible, I'm probably going to use the SOWPODS dictionary (about 250,000) and my grid will be able 6 x 6.
Here's a simple method:
Write a fast solver for the game using the same word list that the player will use. Generate say 100 different possible boards at random (using letter frequencies is probably a good idea here, but not essential). For each board calculate all the words that can be generated and score the board based on the number of words found or the count weighted by word length (i.e. the total sum of word lengths of all words found). Then just pick the best scoring board from the 100 possibilities and give that to the player.
Also instead of always picking the highest scoring board (i.e. the easiest board) you could have different score thresholds to make the game more difficult for experts.
A minor variation on the letter-pair approach: use the frequency of letter pairs in long words - say 6 letters or longer - since that's your objective. You could also develop a weighting that included all adjacent letters, not just a random one.
This wordgame I slapped up a while back, which behaves very similarly to what you describe, uses English frequency tables to select letters, but decides first whether to generate a vowel or consonant, allowing me to ensure a given rate of vowels on the board. This seems to work reasonably well.
You should look up n-gramming, and Markovian Models.
Your first idea is very losely related to Markovian algorithms.
Basically, if you have a large text corpus, say of 1000 words. What you can do is analyse each letter and create a table to know the probability of a certain letter following the current letter.
For example, I know that the letter Q from my 1000 words ( 4000 letters in total ) is used only 40 times. Then I calculate what probable letters follow using my markov hash table.
For example,
QU happens 100% of the time so I know that should Q be randomly chosen by your application that I need to make sure that the letter U is also included.
Then, the letter "I" is used 50% of the time, and "A" 25% of the times and "O" 25% of the time.
Its actually really complicated to explain and I bet there are other explainations out there which are much better then this.
But the idea is that given a legitmately large text corpus you can create a chain of X letters which are probably consistent with English language and thus should be easy for users to make words out of.
You can choose to look forward on a value of n-gram, the highest the number the easier you could make your game. For example, an n-gram of two would probably make it very hard to create words over 6, but an n-gram of 4 would be very easy.
The Wikipedia explains it really badly, so I wouldn't follow that.
Take a look at this Markov generator:
http://www.haykranen.nl/projects/markov/demo/
I do not know about a precanned algorithm for this, but...
There is a dictionary file in UNIX, and I imagine there is something similar available on other platforms (maybe even in the java libraries? - google it). Anyways, use the files the spell checker uses.
After they spell a word an it drops out, you have existing letters and blank spaces.
1) From each existing letter, go right, left, up, down (you will need to understand recursive algorithms). As long as the string you have built so far is found at the start of words or backwards from the end of words in the dictionary file, continue. When you come across a blank space, count the frequency of the letters you need next. Use the most frequent letters.
It will not guarantee a word as you have not checked the corresponding ending or beginning, but I think it would be much easier to implement than an exhaustive search and get pretty good results.
I think this will get you a step closer to your destination: http://en.wikipedia.org/wiki/Levenshtein_distance
You might look at this Java implementation of the Jumble algorithm to find sets of letters that permute to multiple dictionary words:
$ java -jar dist/jumble.jar | sort -nr | head
11 Orang Ronga angor argon goran grano groan nagor orang organ rogan
10 Elaps Lepas Pales lapse salep saple sepal slape spale speal
9 ester estre reest reset steer stere stree terse tsere
9 caret carte cater crate creat creta react recta trace
9 Easter Eastre asteer easter reseat saeter seater staree teaser
9 Canari Carian Crania acinar arnica canari carina crania narica
8 leapt palet patel pelta petal plate pleat tepal
8 laster lastre rastle relast resalt salter slater stelar
8 Trias arist astir sitar stair stria tarsi tisar
8 Trema armet mater metra ramet tamer terma trame
...