How to remove duplicate letters while preserving the smallest lexicographical order - java

I have a task to remove duplicates from given string (classic interview question), but this one is a bit different, the end result should be in the smallest lexicographical order possible among other. For example, cbacdcbc => acdb, bcabc => abc. I saw several related problems in SO, but I could not find the answer.
Edit: Here is my code so far (not working properly):
public static String removeDuplicateCharsAlphbetically(String str) {
int len = str.length();
if (len<2) return str;
char[] letters = str.toCharArray();
int[] counts = new int[26];
for (char c : letters) {
counts[c-97]++;
}
StringBuilder sb = new StringBuilder();
for (int i=0;i<len-1;i++) {
if (letters[i]==letters[i+1]) continue;
if (counts[letters[i]-97]==1) {
sb.append(letters[i]);
} else if (counts[letters[i]-97] != 0) {
if (letters[i]<letters[i+1] && counts[letters[i]-97] == 1) {
sb.append(letters[i]);
counts[letters[i]-97]=0;
} else {
counts[letters[i]-97]--;
}
}
}
return sb.toString();
}
EDIT2: I am sorry for not putting link of the question earlier. here is the link:

This can be done in O(StringLenght) time.
String lenght = N;
Time Complexity O(N) , single scan of the string.
Space Complexity O(26)
Solution:
Create an array of Alphabet letters which will store pointer to doubly link list Node .
ListNode* array[26]; // Initialized with NULL value.
Create an empty linkedlist , which will represent the solution string at any point of time, in reverse order.
Scan the string and for each letter , check the letter ,ltr, check array[ltr-'a']
.... a.) if it is NULL , it means , it is first occurence and add it to the linked list . ... b.) If array is pointing to any node listNodeltr , it means letter is already in result
i. check value for previous listNode to listNodeltr, in linklist ,
If value of listNodeltr->prev->val < listNode->val , it means removing the current node from this position will make the result lexographically smaller .
So remove listNodeLtr from the current postion in linkedList and add it to the end.
Else, current postion of ltr is find and continue.
cbacdcbc
[a]->[b]->[c]
cbacdcbc
[c]->[a]->[b]
cbacdcbc
[d]->[c]->[a]->[b]
cbacdcbc
[b]->[d]->[c]->[a]
cbacdcbc
[b]->[d]->[c]->[a]
print linklist in reverse order : acdb.

First, let's create a set of all distinct letters of the string s. The size of this set is the length of the answer and the number of steps in our algorithm.
We will add one letter to the answer on each step with the following greedy approach:
On every step iterate through the remaining letters in alphabetical order and for every letter l:
Find the first occurrence of l in s. Let's name it lpos.
If the substring s[lpos, end] contains all remaining letters then add l to the result, replace s with s[lpos+1, end] and go to the next step with reduced remaining letters set.
Implementation with some optimizations to achieve better time complexity:
public String removeDuplicateLetters(String s) {
StringBuilder result = new StringBuilder();
int[] subsets = new int[s.length()];
int subset = 0;
for (int i = s.length() - 1; i >= 0; i--) {
char ch = s.charAt(i);
subset = addToSet(subset, ch);
subsets[i] = subset;
}
int curPos = 0;
while (subset != 0) {
for (char ch = 'a'; ch <= 'z'; ++ch) {
if (inSet(subset, ch)) {
int chPos = s.indexOf(ch, curPos);
if (includes(subsets[chPos], subset)) {
result.append(ch);
subset = removeFromSet(subset, ch);
curPos = chPos + 1;
break;
}
}
}
}
return result.toString();
}
private boolean inSet(int set, char ch) {
return (set & (1 << (ch - 'a'))) != 0;
}
private boolean includes(int set, int subset) {
return (set | subset) == set;
}
private int addToSet(int set, char ch) {
return set | (1 << (ch - 'a'));
}
private int removeFromSet(int set, char ch) {
return set & ~(1 << (ch - 'a'));
}
Runnable version: https://ideone.com/wIKi3x

Observation 1: the first letter of the output is the least letter such that all other letters appear to the right of its leftmost appearance in the string.
Observation 2: the remaining letters of the output are a subsequence of the letters to the right of the leftmost appearance of the first letter.
This suggests a recursive algorithm.
def rem_dups_lex_least(s):
if not s:
return ''
n = len(set(s)) # number of distinct letters in s
seen = set() # number of distinct letters seen while scanning right to left
for j in range(len(s) - 1, -1, -1): # len(s)-1 down to 0
seen.add(s[j])
if len(seen) == n: # all letters seen
first = min(s[:j+1])
i = s.index(first) # leftmost appearance
return first + rem_dups_lex_least(''.join(c for c in s[i+1:] if c != first))

Build the result by going backwards from end of the input to start. On each step:
If new letter is encountered, prepend it to result.
If duplicate is encountered, then compare it with the head of result. If head is greater, remove duplicate from the result and prepend it instead.
LinkedHashSet is good both for storing result set and its internal order.
public static String unduplicate(String input) {
Character head = null;
Set<Character> set = new LinkedHashSet<>();
for (int i = input.length() - 1; i >= 0; --i) {
Character c = input.charAt(i);
if (set.add(c))
head = c;
else if (c.compareTo(head) < 0) {
set.remove(c);
set.add(head = c);
}
}
StringBuilder result = new StringBuilder(set.size());
for (Character c: set)
result.append(c);
return result.reverse().toString();
}

Related

How to loop over arraylist and sorting based on starting character

This was my assignment:
Write and test a method that takes a List words, which contains Strings of alphabetic characters, throws them into 26 "buckets", according to the first letter, and returns the ArrayList of buckets. Each bucket should be represented by an ArrayList. The first bucket should contain all the Strings from words that start with an 'a', in the same order as they appear in words; the second bucket should contain all the Strings that start with a 'b'; and so on. Your method should traverse the list words only once and leave it unchanged.
So far my code looks like this:
public static ArrayList<ArrayList<String>> bucketSort(ArrayList<String> arr) {
// Generate 26 bucket list
ArrayList<ArrayList<String>> bucket = new ArrayList<ArrayList<String>>(26);
for (int i = 0; i < 26; i++) {
bucket.add(new ArrayList<String>());
}
// Sort by char
for (String v : arr) {
for (int c=97; c <=122; c++) {
if (v.startsWith(String.valueOf((char) c))) {
bucket.get(97-c).add(v.toUpperCase());
}
}
}
return bucket;
}
The only problem is that i'm getting indexoutofbounds error and I'm not sure why. The 97 and 122 refers to chars 'a' - 'z'. Think its correct because (int) 'a' gives me 97 and 122 for 'z'. I'm confused on why I'm getting index out of bounds.
The IndexOutOfBounds is because you have your subtraction backwards by using 97-c instead of c-97.
You could also iterate by chars to simplify your code:
public static ArrayList<ArrayList<String>> bucketSort(ArrayList<String> arr) {
// Generate 26 bucket list
ArrayList<ArrayList<String>> bucket = new ArrayList<ArrayList<String>>(26);
for (int i = 0; i < 26; i++) {
bucket.add(new ArrayList<String>());
}
// Sort by char
for (String v : arr) {
for (char c = 'a'; c <= 'z'; c++) {
if (v.startsWith(c) || v.startsWith(Character.toUpperCase(c)) {
bucket.get(c - 'a').add(v.toUpperCase());
}
}
}
return bucket;
}
You are getting the error because of this line bucket.get(97-c).add(v.toUpperCase()); . In this line , you are trying to access the ArrayList using 97-c . , but if the char is 'b' , the then value of c becomes 98 and your 97-c resolves to 97-98 which is throwing IndexOutOfBounds exception.
You will have to modify that line as
bucket.get(c-97).add(v.toUpperCase()); .
In this way , if the starting letter is "a", your c-97 gets resolved as 97-97 which equals 0 and hence you will get bucket.get(0) . and if your starting letter is "f" would resolve to 102-97 and you will get bucket.get(5).
Try change places c and 97 in subtraction.
If c=97 then 97-97=0,
If c=122 then 122-97=25.
And also you do not process words wich begins with upper letter. If you wan to have only 26 case insensitive buckets so convert before search to lower case:
public static ArrayList<ArrayList<String>> bucketSort(ArrayList<String> arr) {
// Generate 26 bucket list
ArrayList<ArrayList<String>> bucket = new ArrayList<ArrayList<String>>(26);
for (int i = 0; i < 26; i++) {
bucket.add(new ArrayList<String>());
}
// Sort by char
for (String v : arr) {
for (int c=97; c <=122; c++) {
if (v.toLowerCase().startsWith(String.valueOf((char) c))) {
bucket.get(c-97).add(v.toUpperCase());
}
}
}
return bucket;
}
for (String v : arr) {
for (int c=97; c <=122; c++) {
if (v.startsWith(String.valueOf((char) c))) {
bucket.get(97-c).add(v.toUpperCase());
}
}
}
There is no point in looping over all characters here: a string starts with one character, a fixed character; so just check that.
for (String v : arr) {
if (v.isEmpty()) continue;
char c = v.charAt(0);
if (c < 'a' || c > 'z') continue;
int bucketIndex = c - 'a';
bucket.get(bucketIndex).add(v.toUpperCase());
}
Note that bucketIndex is c - 'a' (or c - 97), not 97 - c, since the bucket for 'z' needs to be 25 ('z' - 'a'), not -25.

Generating all permutations of a certain length

Suppose we have an alphabet "abcdefghiklimnop". How can I recursively generate permutations with repetition of this alphabet in groups of FIVE in an efficient way?
I have been struggling with this a few days now. Any feedback would be helpful.
Essentially this is the same as: Generating all permutations of a given string
However, I just want the permutations in lengths of FIVE of the entire string. And I have not been able to figure this out.
SO for all substrings of length 5 of "abcdefghiklimnop", find the permutations of the substring. For example, if the substring was abcdef, I would want all of the permutations of that, or if the substring was defli, I would want all of the permutations of that substring. The code below gives me all permutations of a string but I would like to use to find all permutations of all substrings of size 5 of a string.
public static void permutation(String str) {
permutation("", str);
}
private static void permutation(String prefix, String str) {
int n = str.length();
if (n == 0) System.out.println(prefix);
else {
for (int i = 0; i < n; i++)
permutation(prefix + str.charAt(i), str.substring(0, i) + str.substring(i+1, n));
}
}
In order to pick five characters from a string recursively, follow a simple algorithm:
Your method should get a portion filled in so far, and the first position in the five-character permutation that needs a character
If the first position that needs a character is above five, you are done; print the combination that you have so far, and return
Otherwise, put each character into the current position in the permutation, and make a recursive call
This is a lot shorter in Java:
private static void permutation(char[] perm, int pos, String str) {
if (pos == perm.length) {
System.out.println(new String(perm));
} else {
for (int i = 0 ; i < str.length() ; i++) {
perm[pos] = str.charAt(i);
permutation(perm, pos+1, str);
}
}
}
The caller controls the desired length of permutation by changing the number of elements in perm:
char[] perm = new char[5];
permutation(perm, 0, "abcdefghiklimnop");
Demo.
All permutations of five characters will be contained in the set of the first five characters of every permutation. For example, if you want all two character permutations of a four character string 'abcd' you can obtain them from all permutations:
'abcd', 'abdc', 'acbd','acdb' ... 'dcba'
So instead of printing them in your method you can store them to a list after checking to see if that permutation is already stored. The list can either be passed in to the function or a static field, depending on your specification.
class StringPermutationOfKLength
{
// The main recursive method
// to print all possible
// strings of length k
static void printAllKLengthRec(char[] set,String prefix,
int n, int k)
{
// Base case: k is 0,
// print prefix
if (k == 0)
{
System.out.println(prefix);
return;
}
// One by one add all characters
// from set and recursively
// call for k equals to k-1
for (int i = 0; i < n; i++)
{
// Next character of input added
String newPrefix = prefix + set[i];
// k is decreased, because
// we have added a new character
printAllKLengthRec(set, newPrefix,
n, k - 1);
}
}
// Driver Code
public static void main(String[] args)
{
System.out.println("First Test");
char[] set1 = {'a', 'b','c', 'd'};
int k = 2;
printAllKLengthRec(set1, "", set1.length, k);
System.out.println("\nSecond Test");
char[] set2 = {'a', 'b', 'c', 'd'};
k = 1;
printAllKLengthRec(set2, "", set2.length, k);
}
This is can be easily done using bit manipulation.
private void getPermutation(String str, int length)
{
if(str==null)
return;
Set<String> StrList = new HashSet<String>();
StringBuilder strB= new StringBuilder();
for(int i = 0;i < (1 << str.length()); ++i)
{
strB.setLength(0); //clear the StringBuilder
if(getNumberOfOnes(i)==length){
for(int j = 0;j < str.length() ;++j){
if((i & (1 << j))>0){ // to check whether jth bit is set (is 1 or not)
strB.append(str.charAt(j));
}
}
StrList.add(strB.toString());
}
}
System.out.println(Arrays.toString(StrList.toArray()));
}
private int getNumberOfOnes (int n) // to count how many numbers of 1 in binary representation of n
{
int count=0;
while( n>0 )
{
n = n&(n-1);
count++;
}
return count;
}

How to create an array that contains 26 english letters but the order to be set by an input

I want to know how to create an array that contains 26 english letters but the order of them to be: e.g.
INPUT: problem
and the array would be:
'p','r','o','b','l','e','m','a','c','d','f','g','h','i','j','k','n','q','s','t','u','v','w','x','y','z'.
I tried to do it but i couldnt
My code is here
import javax.swing.JOptionPane;
public class TabelaEShkronjave {
public static void main(String[] args) {
char[][] square = new char[26][26];
/*char[] fjalaKyqe = {'p','r','o','b','l','e','m','a','c','d','f',
'g','h','i','j','k','n','q','s','t','u','v','w','x','y','z'};
*/
String word = JOptionPane.showInputDialog("Write a word: ");
char[] wordArray = word.toCharArray();
char[] alphabet = "abcdefghijklmnopqrstuvwxyz".toCharArray();
}
}
First off, you need to guard against bad input, such as non-letters, repeated letters, and uppercase vs. lowercase letters.
One way to build the desired result is to rely on behavior of LinkedHashSet, which will ignore duplicate inserts, so if we first add the letters of the input text, then all letters of alphabet, duplicates will be eliminated for us. The main problem is that the Set has to work with boxed Character objects, not plain char values.
private static char[] wordPrefixedAlphabet(String word) {
Set<Character> letters = new LinkedHashSet<>();
for (char c : word.toLowerCase().toCharArray())
if (c >= 'a' && c <= 'z')
letters.add(c);
for (char c = 'a'; c <= 'z'; c++)
letters.add(c);
char[] alphabet = new char[26];
int i = 0;
for (char c : letters)
alphabet[i++] = c;
return alphabet;
}
Another way is to keep track of which letters have already been added, using a boolean[26]:
private static char[] wordPrefixedAlphabet(String word) {
boolean[] used = new boolean[26];
char[] alphabet = new char[26];
int i = 0;
for (char c : word.toLowerCase().toCharArray())
if (c >= 'a' && c <= 'z' && ! used[c - 'a']) {
used[c - 'a'] = true;
alphabet[i++] = c;
}
for (char c = 'a'; c <= 'z'; c++)
if (! used[c - 'a'])
alphabet[i++] = c;
return alphabet;
}
Testing both with the input "That is NOT a problem!!" produces:
[t, h, a, i, s, n, o, p, r, b, l, e, m, c, d, f, g, j, k, q, u, v, w, x, y, z]
You could:
create an array, the size of the alphabet
copy into the array the characters of the word
selectively copy the characters of the alphabet not yet used
Something like this:
char[] wordArray = word.toCharArray();
char[] alphabet = "abcdefghijklmnopqrstuvwxyz".toCharArray();
char[] target = new char[alphabet.length];
System.arraycopy(wordArray, 0, target, 0, wordArray.length);
boolean[] used = new boolean[alphabet.length];
for (char c : wordArray.toCharArray()) {
used[c - 'a'] = true;
}
for (int k = 0, t = wordArray.length; t < target.length; ++k) {
char c = alphabet.chatAt(k);
int pos = c - 'a';
if (!used[pos]) {
target[t++] = c;
}
}
Psuedocode:
Get the length of the input string
Does the input string contain any duplicate characters?
Tip: Use stringName.charAt(i) and a for loop to test individual characters in string
If it doesn't contain any duplicate characters
For loop through string length (i)
For loop through alphabet length (j)
Find array position of stringName.charAt(i) in the alphabet array (j)
Swap this array character with current loop position (i).
(so 'a' and 'p' in problem swap)
Break
For loop through the alphabet array swapped out starting at string length (i)
if the (int) character is less than the next (int) character in loop
swap them and set (i) to string length to restart the for loop
Else
Print an error saying they don't have all unique characters in input
While this isn't the most efficient way to do it, it doesn't require any outside classes besides what you're already given, so it is useful if you are a beginner and need more practice in array manipulation and for loops.
After you read the input and got an array of chars from the input you can use the following logic to achieve desired result:
boolean contains(char[] arr, value) {
if (arr == null) {
return false;
}
for (char c : arr) {
if (c == value) {
return true;
}
}
return false;
}
...
char[] myOrderedAlphabet = new char[26];
int alphabetPosition = 0;
for (char c : wordArray) {
}
for (char c = 'a'; c <= 'z'; c++) {
if (!contains(myOrderedAlphabet, c) {
myOrderedAlphabet[alphabetPosition] = c;
alphabetPosition++;
}
}
Please note that this snippet does not check if character from wordArray is alphabetic. Hence if you will have characters which are not lowercase latin letters this code will cause an IndexOutOfRangeException. You might want to add some additional checks to the code above to prevent errors.
you could do it like this:
String word = JOptionPane.showInputDialog("Write a word: ");
char[] alphabet = "abcdefghijklmnopqrstuvwxyz".toCharArray();
char[] result = new char[alphabet.length];
int start = 0;
for (; start < word.length(); start++) {
result[start] = word.toCharArray()[start];
}
for (int i = 0; i < alphabet.length; i++) {
if(word.indexOf(alphabet[i]) == -1) {
result[start++] = alphabet[i];
}
}

Java: Testing algorithms: all possible combinations

I want to exhaustively test a String matching algorithm, named myAlgo(Char[] a, Char[] b)
The exhaustive test includes a no. of different char letters, alplhabet " l ", in an "n" long array. The test then computes all combinations, while comparing it with all combinations of another array with similar properties (Like truth tables),e.g.
I have not been able to either compute something that would generate every combination of the array of size n and alphabet l, niether have I been able to make code that is able to combine the computation into iterative testcases (test all the combinations of the two arrays compared), though with code that would be able to generate the combinations, making a nested for-loop should do the required testing.
My goal is to break my algorithm by making it compute something it should not compute.
Test(char[] l, int n)
l = [a;b] //a case could be
n = 2 //a case could be
myAlgo([a;a],[a;a]); //loops over my algorithm in the following way
myAlgo([a;b],[a;a]);
myAlgo([b;a],[a;a]);
myAlgo([b;b],[a;a]);
myAlgo([a;a],[a;b]);
myAlgo([a;b],[a;b]);
myAlgo([b;a],[a;b]);
myAlgo([b;b],[a;b]);
myAlgo([a;a],[b;a]);
myAlgo([a;b],[b;a]);
...
myAlgo([b;b],[b;b]);
My own solution (only works for a finite set of "l") and also starts printing wierd outputs on later iterations.
public class Test {
//aux function to format chars
public static String concatChar(char [] c){
String s = "";
for(char cc : c){
s += cc;
}
return s;
}
public static void main(String[] args) {
String ss1 = "AA"; //TestCases, n = 2
String ss2 = "AA";
char[] test1 = ss1.toCharArray();
char[] test2 = ss2.toCharArray();
Fordi fordi = new Fordi(); //my algorithm
TestGenerator tGen = new TestGenerator(); //my testGenerator
for(int i=0; i<Math.pow(4.0, 2.0);i++){ //to test all different cases
for(int j=0; j<Math.pow(4.0, 2.0);j++){
int k = fordi.calculate(test1, test2); //my algorithm
String mys1 = concatChar(test1); //to print result
String mys2 = concatChar(test2); //to print result
System.out.println(mys1 + " - " + mys2);
System.out.println(k);
test2 = tGen.countArray(test2); //"flip" one number
}
test2 = ss1.toCharArray();
test1 = tGen.countArray(test1); //"flip"
}
}
}
My arrayflipper code:
public char[] countArray(char[] a){
int i=0;
while(i<a.length){
switch (a[i]){
case 'A':
a[i]='B';
clearBottom(a,i);
return a;
case 'B':
a[i]='C';
clearBottom(a,i);
return a;
case 'C':
a[i]='D';
clearBottom(a,i);
return a;
case 'D':
i++;
break;
default:
System.out.println("Something went terribly wrong!");
}
}
return a;
}
public char[] clearBottom(char [] a, int i){
while(i >0){
i--;
a[i] = 'A';
}
return a;
}
As I understand it, your goal is to create all n-character long strings (stored individually as elements in an array) consisting of letters in the L letter alphabet?
One way to accomplish this is to order your letters (A=0, B=1, C=2, etc). Then you can, from a starting string of AAA...AAA (n-characters long) just keep adding 1. Essentially you implement an addition algorithm. Adding 1 would turn an A=0 into a B=1. For example, n=3 and L=3:
start: AAA (0,0,0).
Adding 1 becomes AAB (0,0,1)
Adding 1 again become AAC (0, 0, 2)
Adding 1 again (since we are out of letters, now we carry a bit over) ABA (0, 1, 0).
You can boil the process down to looking for the right-most number that is not maxed out and add 1 to it (then all digits to the right of that digit go back to zero). So in the string ABCCC, the B digit is the right-most not maxed out digit, it goes up by 1 and becomes a C, then all the maxed out digits to the right go back to 0 (A) leaving ACAAA as the next string.
Your algorithm just repeatedly adds 1 until all the elements in the string are maxed out.
Instead of using a switch statement, I recommend putting every character you want to test (A, B, C, D) into an array, and then using the XOR operation to calculate the index of each character from the iteration number in a manner similar to the following:
char[] l = new char[]{'A','B','C','D'};
int n = 2;
char[] test1 = new char[n];
char[] test2 = new char[n];
int max = (int)Math.pow(l.length, n);
for (int i = 0; i < max; i++) {
for (int k = 0; k < n; k++) {
test2[k] = l[(i % (int)Math.pow(l.length, k + 1)) / (int)Math.pow(l.length, k)];
}
for (int j = 0; j < max; j++) {
for (int k = 0; k < n; k++) {
test1[k] = l[(j % (int)Math.pow(l.length, k + 1)) / (int)Math.pow(l.length, k)];
}
int k = fordi.calculate(test1, test2);
System.out.println(new String(test1) + "-" + new String(test2));
System.out.println(k);
}
}
You can add more characters to l as well as increase n and it should still work. Of course, this can be further optimized, but you should get the idea. Hope this answer helps!

How to retrieve a random word of a given length from a Trie

I have a simple Trie that I'm using to store about 80k words of length 2 - 15. It works great for checking to see if a string is a word; However, now I need a way of getting a random word of a given length. In other words, I need "getRandomWord(5)" to return a 5 letter word, with all 5 letter words having an equal chance of being returned.
The only way I can think of is to pick a random number and traverse the tree breadth-first until I've passed that many words of the desired length. Is there a better way to do this?
Possibly unnecessary, but here's the code for my trie.
class TrieNode {
private TrieNode[] c;
private Boolean end = false;
public TrieNode() {
c = new TrieNode[26];
}
protected void insert(String word) {
int n = word.charAt(0) - 'A';
if (c[n] == null)
c[n] = new TrieNode();
if (word.length() > 1) {
c[n].insert(word.substring(1));
} else {
c[n].end = true;
}
}
public Boolean isThisAWord(String word) {
if (word.length() == 0)
return false;
int n = word.charAt(0) - 'A';
if (c[n] != null && word.length() > 1)
return c[n].isThisAWord(word.substring(1));
else if (c[n] != null && c[n].end && word.length() == 1)
return true;
else
return false;
}
}
Edit: The marked answer worked well; I'll add my implementation here for posterity, in case it helps anyone with a similar problem.
First, I made a helper class to hold metadata about the TrieNodes I'm using in the search:
class TrieBranch {
TrieNode node;
int letter;
int depth;
public TrieBranch(TrieNode n, int l, int d) {
letter = l; node = n; depth = d;
}
}
This is the class that holds the Trie and implements the search for the random word. I'm kind of a beginner so there may be better ways to do this, but I tested this a bit and it seems to work. No error handling, so caveat emptor.
class Dict {
final static int maxWordLength = 13;
final static int lettersInAlphabet = 26;
TrieNode trie;
int lengthFrequencyByLetter[][];
int totalLengthFrequency[];
public Dict() {
trie = new TrieNode();
lengthFrequencyByLetter = new int[lettersInAlphabet][maxWordLength + 1];
totalLengthFrequency = new int[maxWordLength + 1];
}
public String getRandomWord(int length) {
// Returns a random word of the specified length from the trie
// First, pick a random number from 0 to [number of words with this length]
Random r = new Random();
int wordIndex = r.nextInt(totalLengthFrequency[length]);
// figure out what the first letter of this word would be
int firstLetter = -1, totalSoFar = 0;
while (totalSoFar <= wordIndex) {
firstLetter++;
totalSoFar += lengthFrequencyByLetter[firstLetter][length];
}
wordIndex -= (totalSoFar - lengthFrequencyByLetter[firstLetter][length]);
// traverse the (firstLetter)'th node of trie depth-first to find the word (wordIndex)'th word
int[] result = new int[length + 1];
Stack<TrieBranch> stack = new Stack<TrieBranch>();
stack.push(new TrieBranch(trie.getBranch(firstLetter), firstLetter, 1));
while (!stack.isEmpty()) {
TrieBranch n = stack.pop();
result[n.depth] = n.letter;
// examine the current node
if (n.depth == length && n.node.isEnd()) {
wordIndex--;
if (wordIndex < 0) {
// search is over
String sResult = "";
for (int i = 1; i <= length; i++) {
sResult += (char)(result[i] + 'a');
}
return sResult;
}
}
// handle child nodes unless they're deeper than target length
if (n.depth < length) {
for (int i = 25; i >= 0; i--) {
if (n.node.getBranch(i) != null)
stack.push(new TrieBranch(n.node.getBranch(i), i, n.depth + 1));
}
}
}
return "failure of some sort";
}
}
Using a casual dictionary (80k words max length 12) each call to getRandomWord() takes abount .2ms, and using a more thorough dictionary (250K words, max length 24) each call takes about 1ms.
To make sure you have an even chance of getting each 5-letter word, you need to know how many 5-letter words there are in your tree. So as you construct the tree, you add the length of the word you're adding to two counters: an overall frequency counter, and a by-letter frequency counter:
int lengthFrequencyByLetter[letterIndex][maxWordLength-1]
int totalLengthFrequency[maxWordLength-1]
So if you have 4000 5-letter words, and 213 of them start with "d", then
lengthFrequencyByLetter[3][4] = 213
and
totalLengthFrequency[4] = 4000
after you're done adding everything to your tree. (The letter "a" is 0, "b" is 1, ... "z" is 25.)
From here, you can do a search for the nth word of a given length, where n is a random integer picked from a uniform random distribution, in the range (0, totalLengthFrequency[length-1]).
Let's say you have 4000 5-letter words in your structure. You pick random number 1234. Now you can check
lengthFrequencyByLetter[0][4]
lengthFrequencyByLetter[1][4]
lengthFrequencyByLetter[2][4]
lengthFrequencyByLetter[3][4]
in turn, until you exceed a total of 1234. Then you know quickly what the start letter of the 1234th 5-letter word is, and then search there. You don't have to search every word in the tree from the beginning each time.

Categories

Resources