Strings & possible anagrams of those strings - java

I'm doing a wee project (in Java) while uni is out just to test myself and I've hit a stumbling block.
I'm trying to write a program that will read in from a text version of dictionary, store it in a ds (data structure), then ask the user for a random string (preferably a nonsense string, but only letters and -'s, no numbers or other punctuation - I'm not interested in anything else), find out all the anagrams of the inputted string, compare it to the dictionary ds and return a list of all the possible anagrams that are in the dictionary.
Okay, for step 1 and 2 (reading from the dictionary), when I'm reading everything in I stored it in a Map, where the keys are the letter of the alphabet and the values are ArrayLists storing all the words beginning with that letter.
I'm stuck at finding all the anagrams, I figured how to calculate the number of possible permutations recursively (proudly) and I'm not sure how to go about actually doing the rearranging.
Is it better to break it up into char and play with it that way, or split it up and keep it as string elements? I've seen sample code online in different sites but I don't want to see code, I would to know the kind of approach/ideas behind developing the solution for this as I'm kinda stuck how to even begin :(
I mean, I think I know how I'm going to go about the comparison to the dictionary ds once I've generated all permutations.
Any advice would be helpful, but not code if that'd be alright, just ideas.
P.S. If you're wanting to see my code so far (for whatever reason), I'll post what I've got.

public String str = "overflow";
public ArrayList<String> possibilities = new ArrayList<String>();
public void main(String[] args)
{
permu(new boolean[str.length()],"");
}
public void permu(boolean[] used, String cur)
{
if (cur.length()==str.length())
{
possibilities.add(cur);
return;
}
for (int a = 0; a < str.length(); a++)
{
if (!used[a])
{
used[a]=true;
cur+=str.charAt(a);
permu(used,cur);
used[a] = false;
cur = cur.substring(0,cur.length()-1);
}
}
}
Simple with a really horrible run-time but it will get the job done.
EDIT : The more advanced version of this is something called a Dictionary Trie. Basically it's a Tree in which each node has 26 nodes one for each letter in the alphabet. And each node also has a boolean telling whether or not it is the end of a word. With this you can easily insert words into the dictionary and easily check if you are even on a correct path for creating a word.
I will paste the code if you would like

Computing the permutations really seem like a bad idea in this case. The word "overflow" for instance has 40320 permutations.
A better way to find out if one word is a permutation of another is to count how many times each letter occur (it will be a 26-tuple) and compare these tuples against each other.

It might be helpful if you gave an example to clarify the problem. As I understand it, you are saying that if the user typed in, say, "islent", the program would reply with "listen", "silent", and "enlist".
I think the easiest solution would be to take each word in your dictionary and store it with both the word as entered, and with the word with the letters re-arranged into alphabetical order. Let's call this the "canonical value". Index on the canonical value. Then convert the input into the canonical value, and do a straight search for matches.
To pursue the above example, when we build the dictinoary and saw the word "listen", we would translate this to "eilnst" and store "eilnst -> listen". We'd also store "eilnst -> silent" and "eilnst -> enlist". Then we get the input string, convert this to "eilnst", do a search and immediately find the three hits.

Related

Sentence Trie/Tree/dictionary/corpus

I am hoping to build a tree in which a node is an English word and a branch of leaves form a sentence. Namely,
a sentence tree (plz ignore the numbers):
I was thinking to use a Trie but I am having trouble with inserting the nodes. I am not sure how to determine the level of the nodes. In a Trie, all the nodes are characters so it's possible to use . But having words is different.
Does it make sense? I am open to other data structures as well. The goal is to create a dictionary/corpus which stores a bunch of English sentences. Users can use the first a couple of words to look up the whole sentence. I am most proficient in Java but I also know python and R so if they are easier to use for my purposes.
Thank you!
void insert(String key) {
int level;
int length = key.length();
int index;
TrieNode pCrawl = root;
for (level = 0; level < length; level++)
{
index = key.charAt(level) - 'a';
if (pCrawl.children[index] == null)
pCrawl.children[index] = new TrieNode();
pCrawl = pCrawl.children[index];
}
// mark last node as leaf
pCrawl.isEndOfWord = true;
}
A little late, but maybe I can help a bit even now.
A trie sorts each level by unique key. Traditionally this is a character from a string, and the value stored at the final location is the string itself.
Tries can be much more than this. If I understand you correctly then you wish to sort sentences by their constituent words.
At each level of your trie you look at the next word and seek its position in the list of children, rather than looking at the next character. Unfortunately all the traditional implementations show sorting by character.
I have a solution for you, or rather two. The first is to use my java source code trie. This sorts any object (in your case the string containing your sentence) by an Enumeration of integers. You would need to map your words to integers (store words in trie give each a unique number), and then write an enumerator that returned the wordIntegers for a sentence. That would work. (Do not use hash for the word -> integer conversion as two words can give the same hash).
The second solution is to take my code and instead of comparing integers compare the words as strings. This would take more work, but looks entirely feasible. In fact, I have had a suspicion that my solution can be made more generic by replacing Enumeration of Integer with an Enumeration of Comparable. If you wish to do this, or collaborate in doing this I would be interested. Heck, I may even do it myself for the fun of it.
The resultant trie would have generic type
Trie<K extends Comparable, T>
and would store instances of T against a sequence of K. The coder would need to define a method
Iterator<K extends Comparable> getIterator(T t)
============================
EDIT: ========================
It was actually remarkably easy to generalise my code to use Comparable instead of Integer. Although there are plenty of warnings that I am using raw type of Comparable rather than Comparable. Maybe I will sort those out another day.
SentenceSorter sorter = new SentenceSorter();
sorter.add("This is a sentence.");
sorter.add("This is another sentence.");
sorter.add("A sentence that should come first.");
sorter.add("Ze last sentence");
sorter.add("This is a sentence that comes somewhere in the middle.");
sorter.add("This is another sentence entirely.");
Then listing sentences by:
Iterator<String> it = sorter.iterator();
while (it.hasNext()) {
System.out.println(it.next());
}
gives
A sentence that should come first.
This is a sentence that comes somewhere in the middle.
This is a sentence.
This is another sentence entirely.
This is another sentence.
Note that the sentence split is including the full stop with the ord and that is affecting the sort. You could improve upon this.
We can show that we are sorting by words rather than characters:
it = sorter.sentencesWithPrefix("This is a").iterator();
while (it.hasNext()) {
System.out.println(it.next());
}
gives
This is a sentence that comes somewhere in the middle.
This is a sentence.
whereas
it = sorter.sentencesWithPrefix("This is another").iterator();
while (it.hasNext()) {
System.out.println(it.next());
}
gives
This is another sentence entirely.
This is another sentence.
Hope that helps - the code is all up on the above mentioned repo, and freely available under Apache2.

Accessing a full dictionary java

Im coding an AI that plays the game Word chain. if you don't know what that is here's a: wikipedia link.
So Id like to make my AI better by having the ability to access an entire dictionary and search through words based on my parameters. How would I be able to access an entire dictionary via eclipse, using java?
Given that the main game rule is that the next word should start with the last letter of the previous word, you definitely want to prepare the data structure upfront and then access it in O(1). Therefore, I would recommend to use an array of the alphabet size (e.g. 26 for English) whose elements are HashSet instances representing the bag of words starting with the corresponding letter.
HashSet<String>[] words;
In fact, given an array, you can immediately access the set of words starting with that letter (position 0 -> A, position 1 -> B...). Alternatively to the array, you can use a HashMap, whose key is a letter and whose elements are again HashSets of possible words.
HashMap<Character, HashSet<String>> words;
So, access is still granted in O(1).
Concerning the HashSets, you want to have both constant access time and constant remove time, because words cannot be repeated during the game, so, after using them, you want to drop them off the HashSet.
If your dictionary is small enough (or, from another viewpoint, you have enough resources), you can entirely prefetch the dictionary. In case you don't, the proposed structures are still adaptable: in fact, the array (or the HashMap) is not going to change, while the HashSet also offers constant add time. So you may plan on refilling the HashSet from time to time (e.g. after a given amount of remove actions).
In all cases, you can always get the first element in the HashSet or introduce some randomization; keep in mind that access to HashSet elements does not happen in any particular order, as a general rule.
The source for reading the words can easily be a file or, even better, a bag of files, one per alphabet letter, so that you always know where to read and you can open all of them and tear down the overhead of opening, closing or seeking into the file: you open and close once each file and your seeking is just linear.
Finally, in case you also want to propose only words belonging to a given category, you may want to filter out the words not belonging to that category during dictionary prefetching (assuming you have the categories each word belong to).
If your problem also includes looking for a "related" word at runtime, then you may want to use Feature vectors, so that you can still have acceptable computation time to calculate correlations during the game.
Try googling "word list". Here's a good one: http://wordlist.aspell.net/
Save one of these as a file and load it into memory with java. I would be more specific, but you will load it into memory differently depending on how you want to search the words.
What kind of AI are you trying to build? Is it a learning agent?
From what I understand 'search through words based on the parameter', I suppose you mean you want to put words into different category, so that your AI will be able to generate a list of words that can be solved.
To create a word domain, you can always store your list of words into a hashmap, and put the 'parameter' as your key. Since you are trying to store the entire dictionary, why don't you store the information into a non-relational database(if applicable), so that you don't have to prepare the AI every time when you star the game.
Non-relational database can be easily implemented in Java. The one that I know that's easy to configure is RIAK. You can see the description and tutorial here: http://basho.com/riak/. Using non-relational database if similar to searching for things with a 'key word'.
Hope that's what you are asking.
In this case I think that you need to communicate with a dict server to get access to the entire dictionary
This is my code
import java.net.*;
import java.io.*;
import java.util.*;
public class Dictionary {
public static void main(String[]args) {
String host = "dict.org";
try {
Socket soc = new Socket(host,2628);
OutputStream out = soc.getOutputStream();
String request = "DEFINE ! yourwordhere";
out.write(request.getBytes());
out.flush();
soc.shutdownOutput();
InputStream in = soc.getInputStream();
Scanner s = new Scanner(in);
while(s.hasNextLine())
System.out.println(s.nextLine());
soc.close();
} catch (UnknownHostException e) {
System.out.println("Cannot found the host at "+host);
} catch (IOException e) {
e.printStackTrace();
}
}
}
If so you don't have to do any search and it would reduce the time of executing the program

Optimising checking for Strings in a word list (Java)

I have a text file containing ~30,000 words in alphabetical order each on a separate line.
I also have a Set<String> set containing ~10 words.
I want to check if any of the words in my set are in the word list (text file).
So far my method has been to:
Open the word list text file
Read a line/word
Check if set contains that word
Repeat to the end of the word list file
This seems badly optimised. For example if I'm checking a word in my set that begins with the letter b I see no point in checking words in the text file beggining with a & c, d, .. etc.
My proposed solution would be to separate the text file into 26 files, one file for words which start with each letter of the alphabet. Is there a more efficient solution than this?
Note: I know 30,000 words isn't that large a word list but I have to do this operation many times on a mobile device so performance is key.
You can further your approach of using Hash Sets onto the entire wordlist file. String comparisons are expensive so its better to create a HashSet of Integer. You should read the wordlist (assuming words will not increase from 30,000 to something like 3 million) once in its entirety and save all the words in an Integer Hashset. When adding into the Integer Hashset use:
wordListHashSet.add(mycurrentword.hashcode());
You have mentioned that you have a string hash of 10 words that must be checked if its in the wordlist. Again instead of String Hash, create an Integer Hash Set.
Create an iterator of this Integer Hash Set.
Iterator it = myTenWordsHashSet.iterator();
Iterate over this in a loop and check for the following condition:
wordListHashSet.contains(it.next());
If this is true, then you have the word in the wordlist.
Using Integer Hash Maps is good idea when performance is what you are looking for. Internally Java processes the hash of each string and stores it in the memory such that repeated access to such strings is blazing fast, faster than binary search with search complexities of O(log n) to almost O(1) for each call to an element in the wordlist.
Hope that helps!
It's probably not worth the hassle for 30,000 words, but let's just say you have a lot more, like say 300,000,000 words, and still only 10 words to look for.
In that case, you could do a binary search in the large file for each of the search words, using Random Access Files.
Obviously, each searching step would require you to first to find the beginning of the word (or the next word, implementation dependend), which makes it a lot more difficult, and cutting out all the corner cases exceeds the limit of code one could provide here. But still it could be done and would surely be faster than reading through all of 300,000,000 words once.
You might consider iterating through your 10 word set (maybe parse it from the file into an array), and for each entry, using a binary search algorithm to see if it's contained in the larger list. Binary search should only take O(logN), so in this case log(30,000) which is significantly faster that 30,000 steps.
Since you'll repeat this step once for every word in your set, it should take 10*log(30k)
You can make some improvements depending on your needs.
If for example the file remains unchanged but your 10-words Set changes regularly, then you can load the file on another Set (HashSet). Now you just need to search for a match on this new Set. This way your search will always be O(1).

Returning a Subset of Strings from 10000 ascii strings

My college is getting over so I have started preparing for the interviews to get the JOB and I came across this interview question while I was preparing for the interview
You have a set of 10000 ascii strings (loaded from a file)
A string is input from stdin.
Write a pseudocode that returns (to stdout) a subset of strings in (1) that contain the same distinct characters (regardless of order) as
input in (2). Optimize for time.
Assume that this function will need to be invoked repeatedly. Initializing the string array once and storing in memory is okay .
Please avoid solutions that require looping through all 10000 strings.
Can anyone provide me a general pseudocode/algorithm kind of thing how to solve this problem? I am scratching my head thinking about the solution. I am mostly familiar with Java.
Here is an O(1) algorithm!
Initialization:
For each string, sort characters, removing duplicates - eg "trees" becomes "erst"
load sorted word into a trie tree using the sorted characters, adding a reference to the original word to the list of words stored at the each node traversed
Search:
sort input string same as initialization for source strings
follow source string trie using the characters, at the end node, return all words referenced there
They say optimise for time, so I guess we're safe to abuse space as much as we want.
In that case, you could do an initial pass on the 10000 strings and build a mapping from each of the unique characters present in the 10000 to their index (rather a set of their indices). That way you can ask the mapping the question, which sets contain character 'x'? Call this mapping M> ( order: O(nm) when n is the number of strings and m is their maximum length)
To optimise in time again, you could reduce the stdin input string to unique characters, and put them in a queue, Q. (order O(p), p is the length of the input string)
Start a new disjoint set, say S. Then let S = Q.extractNextItem.
Now you could loop over the rest of the unique characters and find which sets contain all of them.
While (Q is not empty) (loops O(p)) {
S = S intersect Q.extractNextItem (close to O(1) depending on your implementation of disjoint sets)
}
voila, return S.
Total time: O(mn + p + p*1) = O(mn + p)
(Still early in the morning here, I hope that time analysis was right)
As Bohemian says, a trie tree is definitely the way to go!
This sounds like the way an address book lookup would work on a phone. Start punching digits in, and then filter the address book based on the number representation as well as any of the three (or actually more if using international chars) letters that number would represent.

Building an inverted index in Java-logic

I have a collection of around 1500 documents. I parsed through each document and extract tokens. These tokens are stored in an hashmap(as key) and the total number of times they occur in the collection (i.e. frequency) is stored as the value.
I have to extend this to build an inverted index. That is, the term(key)| number of documents it occurs it-->DocNo|Frequency in that document. For exmple,
Term DocFreq DocNum TermFreq
data 3 1 12
23 31
100 17
customer 2 22 43
19 2
Currently, I have the following in Java,
hashmap<string,integer>
for(each document)
{
extract line
for(each line)
{
extract word
for(each word)
{
perform some operations
get value for word from hashmap and increment by one
}
}
}
I have to build on this code. I can't really think of a good way to implement an inverted index.
So far, I thought of making value a 2D array. So the term would be the key and the value(i.e 2D array) would store the docId and termFreq.
Please let me know if my logic is correct.
I would do it by using a Map<String, TermFrequencies>. This map would maintain a TermFrequencies object for each term found. The TermFrequencies object would have the following methods:
void addOccurrence(String documentId);
int getTotalNumberOfOccurrences();
Set<String> getDocumentIds();
int getNumberOfOccurrencesInDocument(String documentId);
It would use a Map<String, Integer> internally to associate each document the term occurs in with the number of occurrences of the term in the document.
The algorithm would be extremely simple:
for(each document) {
extract line
for(each line) {
extract word
for(each word) {
TermFrequencies termFrequencies = map.get(word);
if (termFrequencies == null) {
termFrequencies = new TermFrequencies(word);
}
termFrequencies.addOccurrence(document);
}
}
}
The addOccurrence() method would simply increment a counter for the total number of occurrences, and would insert or update the number of occurrences in the internam map.
I think it is best to have two structures: a Map<docnum, Map<term,termFreq>> and a Map<term, Set<docnum>>. Your docFreqs can be read off as set.size in the values of the second map. This solution involves no custom classes and allows a quick retrieval of everything needed.
The first map contains all the informantion and the second one is a derivative that allows quick lookup by term. As you process a document, you fill the first map. You can derive the second map afterwards, but it is also easy to do it in one pass.
I once implemented what you're asking for. The problem with your approach is that it is not abstract enough. You should model Terms, Documents and their relationships using objects. In a first run, create the term index and document objects and iterate over all terms in the documents while populating the term index. Afterwards, you have a representation in memory that you can easily transform into the desired output.
Do not start by thinking about 2d-arrays in an object oriented language. Unless you want to solve a mathematical problem or optimize something it's not the right approach most of the time.
I dont know if this is still a hot question, but I would recommend you to do it like this:
You run over all your documents and give them an id in increasing order. For each document you run over all the words.
Now you have a Hashmap that maps Strings (your words) to an array of DocTermObjects. A DocTermObject contains a docId and a TermFrequency.
Now for each word in a document, you look it up in your HashMap, if it doesn't contain an Array of DocTermObjects you create it, else you look at its very LAST element only (this is important due to runtime, think about it). If this element has the docId that you treat at the moment, you increase the TermFrequency. Else or if the Array is empty, you add a new DocTermObject with your actual docId and set the TermFrequency to 1.
Later you can use this datastructure to compute scores for example. The scores you could also save in the DoctermObjects of course.
Hope it helped :)

Categories

Resources