Accessing a full dictionary java

Accessing a full dictionary java - java

Im coding an AI that plays the game Word chain. if you don't know what that is here's a: wikipedia link.
So Id like to make my AI better by having the ability to access an entire dictionary and search through words based on my parameters. How would I be able to access an entire dictionary via eclipse, using java?

Given that the main game rule is that the next word should start with the last letter of the previous word, you definitely want to prepare the data structure upfront and then access it in O(1). Therefore, I would recommend to use an array of the alphabet size (e.g. 26 for English) whose elements are HashSet instances representing the bag of words starting with the corresponding letter.
HashSet<String>[] words;
In fact, given an array, you can immediately access the set of words starting with that letter (position 0 -> A, position 1 -> B...). Alternatively to the array, you can use a HashMap, whose key is a letter and whose elements are again HashSets of possible words.
HashMap<Character, HashSet<String>> words;
So, access is still granted in O(1).
Concerning the HashSets, you want to have both constant access time and constant remove time, because words cannot be repeated during the game, so, after using them, you want to drop them off the HashSet.
If your dictionary is small enough (or, from another viewpoint, you have enough resources), you can entirely prefetch the dictionary. In case you don't, the proposed structures are still adaptable: in fact, the array (or the HashMap) is not going to change, while the HashSet also offers constant add time. So you may plan on refilling the HashSet from time to time (e.g. after a given amount of remove actions).
In all cases, you can always get the first element in the HashSet or introduce some randomization; keep in mind that access to HashSet elements does not happen in any particular order, as a general rule.
The source for reading the words can easily be a file or, even better, a bag of files, one per alphabet letter, so that you always know where to read and you can open all of them and tear down the overhead of opening, closing or seeking into the file: you open and close once each file and your seeking is just linear.
Finally, in case you also want to propose only words belonging to a given category, you may want to filter out the words not belonging to that category during dictionary prefetching (assuming you have the categories each word belong to).
If your problem also includes looking for a "related" word at runtime, then you may want to use Feature vectors, so that you can still have acceptable computation time to calculate correlations during the game.

Try googling "word list". Here's a good one: http://wordlist.aspell.net/
Save one of these as a file and load it into memory with java. I would be more specific, but you will load it into memory differently depending on how you want to search the words.

What kind of AI are you trying to build? Is it a learning agent?
From what I understand 'search through words based on the parameter', I suppose you mean you want to put words into different category, so that your AI will be able to generate a list of words that can be solved.
To create a word domain, you can always store your list of words into a hashmap, and put the 'parameter' as your key. Since you are trying to store the entire dictionary, why don't you store the information into a non-relational database(if applicable), so that you don't have to prepare the AI every time when you star the game.
Non-relational database can be easily implemented in Java. The one that I know that's easy to configure is RIAK. You can see the description and tutorial here: http://basho.com/riak/. Using non-relational database if similar to searching for things with a 'key word'.
Hope that's what you are asking.

In this case I think that you need to communicate with a dict server to get access to the entire dictionary
This is my code
import java.net.*;
import java.io.*;
import java.util.*;
public class Dictionary {
public static void main(String[]args) {
String host = "dict.org";
try {
Socket soc = new Socket(host,2628);
OutputStream out = soc.getOutputStream();
String request = "DEFINE ! yourwordhere";
out.write(request.getBytes());
out.flush();
soc.shutdownOutput();
InputStream in = soc.getInputStream();
Scanner s = new Scanner(in);
while(s.hasNextLine())
System.out.println(s.nextLine());
soc.close();
} catch (UnknownHostException e) {
System.out.println("Cannot found the host at "+host);
} catch (IOException e) {
e.printStackTrace();
}
}
}
If so you don't have to do any search and it would reduce the time of executing the program

Related

Optimising checking for Strings in a word list (Java)

I have a text file containing ~30,000 words in alphabetical order each on a separate line.
I also have a Set<String> set containing ~10 words.
I want to check if any of the words in my set are in the word list (text file).
So far my method has been to:
Open the word list text file
Read a line/word
Check if set contains that word
Repeat to the end of the word list file
This seems badly optimised. For example if I'm checking a word in my set that begins with the letter b I see no point in checking words in the text file beggining with a & c, d, .. etc.
My proposed solution would be to separate the text file into 26 files, one file for words which start with each letter of the alphabet. Is there a more efficient solution than this?
Note: I know 30,000 words isn't that large a word list but I have to do this operation many times on a mobile device so performance is key.

You can further your approach of using Hash Sets onto the entire wordlist file. String comparisons are expensive so its better to create a HashSet of Integer. You should read the wordlist (assuming words will not increase from 30,000 to something like 3 million) once in its entirety and save all the words in an Integer Hashset. When adding into the Integer Hashset use:
wordListHashSet.add(mycurrentword.hashcode());
You have mentioned that you have a string hash of 10 words that must be checked if its in the wordlist. Again instead of String Hash, create an Integer Hash Set.
Create an iterator of this Integer Hash Set.
Iterator it = myTenWordsHashSet.iterator();
Iterate over this in a loop and check for the following condition:
wordListHashSet.contains(it.next());
If this is true, then you have the word in the wordlist.
Using Integer Hash Maps is good idea when performance is what you are looking for. Internally Java processes the hash of each string and stores it in the memory such that repeated access to such strings is blazing fast, faster than binary search with search complexities of O(log n) to almost O(1) for each call to an element in the wordlist.
Hope that helps!

It's probably not worth the hassle for 30,000 words, but let's just say you have a lot more, like say 300,000,000 words, and still only 10 words to look for.
In that case, you could do a binary search in the large file for each of the search words, using Random Access Files.
Obviously, each searching step would require you to first to find the beginning of the word (or the next word, implementation dependend), which makes it a lot more difficult, and cutting out all the corner cases exceeds the limit of code one could provide here. But still it could be done and would surely be faster than reading through all of 300,000,000 words once.

You might consider iterating through your 10 word set (maybe parse it from the file into an array), and for each entry, using a binary search algorithm to see if it's contained in the larger list. Binary search should only take O(logN), so in this case log(30,000) which is significantly faster that 30,000 steps.
Since you'll repeat this step once for every word in your set, it should take 10*log(30k)

You can make some improvements depending on your needs.
If for example the file remains unchanged but your 10-words Set changes regularly, then you can load the file on another Set (HashSet). Now you just need to search for a match on this new Set. This way your search will always be O(1).

Strings & possible anagrams of those strings

I'm doing a wee project (in Java) while uni is out just to test myself and I've hit a stumbling block.
I'm trying to write a program that will read in from a text version of dictionary, store it in a ds (data structure), then ask the user for a random string (preferably a nonsense string, but only letters and -'s, no numbers or other punctuation - I'm not interested in anything else), find out all the anagrams of the inputted string, compare it to the dictionary ds and return a list of all the possible anagrams that are in the dictionary.
Okay, for step 1 and 2 (reading from the dictionary), when I'm reading everything in I stored it in a Map, where the keys are the letter of the alphabet and the values are ArrayLists storing all the words beginning with that letter.
I'm stuck at finding all the anagrams, I figured how to calculate the number of possible permutations recursively (proudly) and I'm not sure how to go about actually doing the rearranging.
Is it better to break it up into char and play with it that way, or split it up and keep it as string elements? I've seen sample code online in different sites but I don't want to see code, I would to know the kind of approach/ideas behind developing the solution for this as I'm kinda stuck how to even begin :(
I mean, I think I know how I'm going to go about the comparison to the dictionary ds once I've generated all permutations.
Any advice would be helpful, but not code if that'd be alright, just ideas.
P.S. If you're wanting to see my code so far (for whatever reason), I'll post what I've got.

public String str = "overflow";
public ArrayList<String> possibilities = new ArrayList<String>();
public void main(String[] args)
{
permu(new boolean[str.length()],"");
}
public void permu(boolean[] used, String cur)
{
if (cur.length()==str.length())
{
possibilities.add(cur);
return;
}
for (int a = 0; a < str.length(); a++)
{
if (!used[a])
{
used[a]=true;
cur+=str.charAt(a);
permu(used,cur);
used[a] = false;
cur = cur.substring(0,cur.length()-1);
}
}
}
Simple with a really horrible run-time but it will get the job done.
EDIT : The more advanced version of this is something called a Dictionary Trie. Basically it's a Tree in which each node has 26 nodes one for each letter in the alphabet. And each node also has a boolean telling whether or not it is the end of a word. With this you can easily insert words into the dictionary and easily check if you are even on a correct path for creating a word.
I will paste the code if you would like

Computing the permutations really seem like a bad idea in this case. The word "overflow" for instance has 40320 permutations.
A better way to find out if one word is a permutation of another is to count how many times each letter occur (it will be a 26-tuple) and compare these tuples against each other.

It might be helpful if you gave an example to clarify the problem. As I understand it, you are saying that if the user typed in, say, "islent", the program would reply with "listen", "silent", and "enlist".
I think the easiest solution would be to take each word in your dictionary and store it with both the word as entered, and with the word with the letters re-arranged into alphabetical order. Let's call this the "canonical value". Index on the canonical value. Then convert the input into the canonical value, and do a straight search for matches.
To pursue the above example, when we build the dictinoary and saw the word "listen", we would translate this to "eilnst" and store "eilnst -> listen". We'd also store "eilnst -> silent" and "eilnst -> enlist". Then we get the input string, convert this to "eilnst", do a search and immediately find the three hits.

Memory conscious string filtering

Lets say I have 500 words:
Martin
Hopa
Dunam
Golap
Hugnog
Foo
... + 494 more words
I have following text that is about 85KB in total:
Marting went and got him self stuff
from Hopa store and now he is looking
to put it into storage with his best
friend Dunam. They are planing on
using Golap lock that they found in
Hugnog shop in Foo town. >... text continues into several pages
I would like to produce following text:
------- went and got him self stuff
from ---- store and now he is looking
to put it into storage with his best
friend ----. They are planing on
using ---- lock that they found in
------ shop in --- town. >... text continues into several pages
Currently I'm using commons method:
String[] 500words = //all 500 words
String[] maskFor500words = // generated mask for each word
String filteredText = StringUtils.replaceEach(textToBeFiltered, 500words , maskFor500words);
Is there a another way to do this that could be more efficient when it comes to memory and CPU usage?
What is the best storage for the 500 words? File, List, enum, array ...?
How would you get statistics, such as how many and what words were replaced; and for each word how many times it was replaced.

I wouldn't care much apout CPU and memory usage. It should be relatively small for such a problem and such a volume of text.
What I would do is
have a Map containing all the strings as keys, with the numer of times they have been found in the text (initially 0)
read the text word by word, by using a StringTokenizer, or the String.split() method
for each word, find if the map contains it (O(1) operation, very quick)
if it contains it, add "----" to a StringBuilder, and increment the value stored for the word in the map
else add the word itself (with a space before unless it's the first word of the text)
A the end of the process, the StringBuilder contains the result, and the map contains the numer of times each word has been used as a replacement.
Make sure to initialize the STringBuilder with the length of the original text, in order to avoid too many reallocations.
Should be simple and efficient.

I wouldn't care about memory much, but in case you do: trie is your friend. It's memory efficient for large sets and it allows very efficient matching. You may want to implement it in a compressed fashion.

If I understand the problem correctly, you need to read the 85KB of text and parse out every word (use split or StringTokenizer). For every word, you need to know if you have it in the set of 500words, and if so, switch it with the corresponding mask.
If you know you have about 500 words, I'd suggest store the 500 words and their masks in a HashMap with initial capacity of about 650 (JDK doc says hashing is most efficient with a load factor of 0.75). Put in the word-mask pairs in the HashMap with a for loop.
The biggest bang for the buck (HashMap) you get is that the get/put operations (searching for the key) are done in constant time, which is better than O(n) in array and even O(log(n)) if you do binary search on sorted array.
Armed with the HashMap, you can build up a SringBuffer while filtering those 85KB of text.
Return the String.toString() from your method and you are done! Regards, - M.S.
PS If you are building the map at a server and doing the filtering somewhere else (at a client) and need to transport the Dictionary, HashMap won't do - it cannot be serialized. Use a Hashtable in that case. If on the same machine, HashMap is more memory efficient. Later, - M.S.

Using Binary Trees to find Anagrams

I am currently trying to create a method that uses a binary tree that finds anagrams of a word inputted by the user.
If the tree does not contain any other anagram for the word (i.e., if the key was not in the tree or the only element in the associated linked list was the word provided by the user), the message "no anagram found " gets printed
For example, if key "opst" appears in the tree with an associated linked list containing the words "spot", "pots", and "tops", and the user gave the word "spot", the program should print "pots" and "tops" (but not spot).
public boolean find(K thisKey, T thisElement){
return find(root, thisKey, thisElement);
}
public boolean find(Node current, K thisKey, T thisElement){
if (current == null)
return false;
else{
int comp = current.key.compareTo(thisKey);
if (comp>0)
return find(current.left, thisKey, thisElement);
else if(comp<0)
return find(current.right, thisKey, thisElement);
else{
return current.item.find(thisElement);
}
}
}
While I created this method to find if the element provided is in the tree (and the associated key), I was told not to reuse this code for finding anagrams.
K is a generic type that extends Comparable and represents the Key, T is a generic type that represents an item in the list.
If extra methods I've done are required, I can edit this post, but I am absolutely lost. (Just need a pointer in the right direction)

It's a little unclear what exactly is tripping you up (beyond "I've written a nice find method but am not allowed to use it."), so I think the best thing to do is start from the top.
I think you will find that once you get your data structured in just the right way, the actual algorithms will follow relatively easily (many computer science problems share this feature.)
You have three things:
1) Many linked lists, each of which contains the set of anagrams of some set of letters. I am assuming you can generate these lists as you need to.
2) A binary tree, that maps Strings (keys) to lists of anagrams generated from those strings. Again, I'm assuming that you are able to perform basic operations on these treed--adding elements, finding elements by key, etc.
3) A user-inputted String.
Insight: The anagrams of a group of letters form an equivalence class. This means that any member of an anagram list can be used as the key associated with the list. Furthermore, it means that you don't need to store in your tree multiple keys that point to the same list (provided that we are a bit clever about structuring our data; see below).
In concrete terms, there is no need to have both "spot" and "opts" as keys in the tree pointing to the same list, because once you can find the list using any anagram of "spot", you get all the anagrams of "spot".
Structuring your data cleverly: Given our insight, assume that our tree contains exactly one key for each unique set of anagrams. So "opts" maps to {"opts", "pots", "spot", etc.}. What happens if our user gives us a String that we're not using as the key for its set of anagrams? How do we figure out that if the user types "spot", we should find the list that is keyed by "opts"?
The answer is to normalize the data stored in our data structures. This is a computer-science-y way of saying that we enforce arbitrary rules about how we store the data. (Normalizing data is a useful technique that appears repeatedly in many different computer science domains.) The first rule is that we only ever have ONE key in our tree that maps to a given linked list. Second, what if we make sure that each key we actually store is predictable--that is we know that we should search for "opts" even if the user types "spot"?
There are many ways to achieve this predictability--one simple one is to make sure that the letters of every key are in alphabetical order. Then, we know that every set of anagrams will be keyed by the (unique!) member of the set that comes first in alphabetical order. Consistently enforcing this rule makes it easy to search the tree--we know that no matter what string the user gives us, the key we want is the string formed from alphabetizing the user's input.
Putting it together: I'll provide the high-level algorithm here to make this a little more concrete.
1) Get a String from the user (hold on to this String, we'll need it later)
2) Turn this string into a search key that follows our normalization scheme
(You can do this in the constructor of your "K" class, which ensures that you will never have a non-normalized key anywhere in your program.)
3) Search the tree for that key, and get the linked list associated with it. This list contains every anagram of the user's input String.
4) Print every item in the list that isn't the user's original string (see why we kept the string handy?)
Takeaways:
Frequently, your data will have some special features that allow you to be clever. In this case it is the fact that any member of an anagram list can be the sole key we store for that list.
Normalizing your data give you predictability and allows you to reason about it effectively. How much more difficult would the "find" algorithm be if each key could be an arbitrary member of its anagram list?
Corollary: Getting your data structures exactly right (What am I storing? How are the pieces connected? How is it represented?) will make it much easier to write your algorithms.

What about sorting the characters in the words, and then compare that.

The best way to store and access 120,000 words in java

I'm programming a java application that reads strictly text files (.txt). These files can contain upwards of 120,000 words.
The application needs to store all +120,000 words. It needs to name them word_1, word_2, etc. And it also needs to access these words to perform various methods on them.
The methods all have to do with Strings. For instance, a method will be called to say how many letters are in word_80. Another method will be called to say what specific letters are in word_2200.
In addition, some methods will compare two words. For instance, a method will be called to compare word_80 with word_2200 and needs to return which has more letters. Another method will be called to compare word_80 with word_2200 and needs to return what specific letters both words share.
My question is: Since I'm working almost exclusively with Strings, is it best to store these words in one large ArrayList? Several small ArrayLists? Or should I be using one of the many other storage possibilities, like Vectors, HashSets, LinkedLists?
My two primary concerns are 1.) access speed, and 2.) having the greatest possible number of pre-built methods at my disposal.
Thank you for your help in advance!!
Wow! Thanks everybody for providing such a quick response to my question. All your suggestions have helped me immensely. I’m thinking through and considering all the options provided in your feedback.
Please forgive me for any fuzziness; and let me address your questions:
Q) English?
A) The text files are actually books written in English. The occurrence of a word in a second language would be rare – but not impossible. I’d put the percentage of non-English words in the text files at .0001%
Q) Homework?
A) I’m smilingly looking at my question’s wording now. Yes, it does resemble a school assignment. But no, it’s not homework.
Q) Duplicates?
A) Yes. And probably every five or so words, considering conjunctions, articles, etc.
Q) Access?
A) Both random and sequential. It’s certainly possible a method will locate a word at random. It’s equally possible a method will want to look for a matching word between word_1 and word_120000 sequentially. Which leads to the last question…
Q) Iterate over the whole list?
A) Yes.
Also, I plan on growing this program to perform many other methods on the words. I apologize again for my fuzziness. (Details do make a world of difference, do they not?)
Cheers!

I would store them in one large ArrayList and worry about (possibly unnecessary) optimisations later on.
Being inherently lazy, I don't think it's a good idea to optimise unless there's a demonstrated need. Otherwise, you're just wasting effort that could be better spent elsewhere.
In fact, if you can set an upper bound to your word count and you don't need any of the fancy List operations, I'd opt for a normal (native) array of string objects with an integer holding the actual number. This is likely to be faster than a class-based approach.
This gives you the greatest speed in accessing the individual elements whilst still retaining the ability to do all that wonderful string manipulation.
Note I haven't benchmarked native arrays against ArrayLists. They may be just as fast as native arrays, so you should check this yourself if you have less blind faith in my abilities than I do :-).
If they do turn out to be just as fast (or even close), the added benefits (expandability, for one) may be enough to justify their use.

Just confirming pax assumptions, with a very naive benchmark
public static void main(String[] args)
{
int size = 120000;
String[] arr = new String[size];
ArrayList al = new ArrayList(size);
for (int i = 0; i < size; i++)
{
String put = Integer.toHexString(i).toString();
// System.out.print(put + " ");
al.add(put);
arr[i] = put;
}
Random rand = new Random();
Date start = new Date();
for (int i = 0; i < 10000000; i++)
{
int get = rand.nextInt(size);
String fetch = arr[get];
}
Date end = new Date();
long diff = end.getTime() - start.getTime();
System.out.println("array access took " + diff + " ms");
start = new Date();
for (int i = 0; i < 10000000; i++)
{
int get = rand.nextInt(size);
String fetch = (String) al.get(get);
}
end = new Date();
diff = end.getTime() - start.getTime();
System.out.println("array list access took " + diff + " ms");
}
and the output:
array access took 578 ms
array list access took 907 ms
running it a few times the actual times seem to vary some, but generally array access is between 200 and 400 ms faster, over 10,000,000 iterations.

If you will access these Strings sequentially, the LinkedList would be the best choice.
For random access, ArrayLists have a nice memory usage/access speed tradeof.

My take:
For a non-threaded program, an Arraylist is always fastest and simplest.
For a threaded program, a java.util.concurrent.ConcurrentHashMap<Integer,String> or java.util.concurrent.ConcurrentSkipListMap<Integer,String> is awesome. Perhaps you would later like to allow threads so as to make multiple queries against this huge thing simultaneously.

If you're going for fast traversal as well as compact size, use a DAWG (Directed Acyclic Word Graph.) This data structure takes the idea of a trie and improves upon it by finding and factoring out common suffixes as well as common prefixes.
http://en.wikipedia.org/wiki/Directed_acyclic_word_graph

Use a Hashtable? This will give you your best lookup speed.

ArrayList/Vector if order matters (it appears to, since you are calling the words "word_xxx"), or HashTable/HashMap if it doesn't.
I'll leave the exercise of figuring out why you would want to use an ArrayList vs. a Vector or a HashTable vs. a HashMap up to you since I have a sneaking suspicion this is your homework. Check the Javadocs.
You're not going to get any methods that help you as you've asked for in the examples above from your Collections Framework class, since none of them do String comparison operations. Unless you just want to order them alphabetically or something, in which case you'd use one of the Tree implementations in the Collections framework.

How about a radix tree or Patricia trie?
http://en.wikipedia.org/wiki/Radix_tree

The only advantage of a linked list over an array or array list would be if there are insertions and deletions at arbitrary places. I don't think this is the case here: You read in the document and build the list in order.
I THINK that when the original poster talked about finding "word_2200", he meant simply the 2200th word in the document, and not that there are arbitrary labels associated with each word. If so, then all he needs is indexed access to all the words. Hence, an array or array list. If there really is something more complex, if one word might be labeled "word_2200" and the next word is labeled "foobar_42" or some such, then yes, he'd need a more complex structure.
Hey, do you want to give us a clue WHY you want to do any of this? I'm hard pressed to remember the last time I said to myself, "Hey, I wonder if the 1,237th word in this document I'm reading is longer or shorter than the 842nd word?"

Depends on what the problem is - speed or memory.
If it's memory, the minimum solution is to write a function getWord(n) which scans the whole file each time it runs, and extracts word n.
Now - that's not a very good solution. A better solution is to decide how much memory you want to use: lets say 1000 items. Scan the file for words once when the app starts, and store a series of bookmarks containing the word number and the position in the file where it is located - do this in such a way that the bookmarks are more-or-less evenly spaced through the file.
Then, open the file for random access. The function getWord(n) now looks at the bookmarks to find the biggest word # <= n (please use a binary search), does a seek to get to the indicated location, and scans the file, counting the words, to find the requested word.
An even quicker solution, using rather more memnory, is to build some sort of cache for the blocks - on the basis that getWord() requests usually come through in clusters. You can rig things up so that if someone asks for word # X, and its not in the bookmarks, then you seek for it and put it in the bookmarks, saving memory by consolidating whichever bookmark was least recently used.
And so on. It depends, really, on what the problem is - on what kind of patterns of retreival are likely.

I don't understand why so many people are suggesting Arraylist, or the like, since you don't mention ever having to iterate over the whole list. Further, it seems you want to access them as key/value pairs ("word_348"="pedantic").
For the fastest access, I would use a TreeMap, which will do binary searches to find your keys. Its only downside is that it's unsynchronized, but that's not a problem for your application.
http://java.sun.com/javase/6/docs/api/java/util/TreeMap.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.