I have a scrambled String as follows: "artearardreardac".
I have a text file which contains English dictionary words close to 300,000 of them. I need to find the English words and be able to form a word as follows:
C A R D
A R E A
R E A R
D A R T
My intention was to initially loop through the scrambled String and make query to that text file each time n try to match 4 characters each time to see if its a valid word.
Problem with this is checking it against 300,000 words per loop.. Going to take ages. I looped through only the first letter 16 times and that itself take a significant time. The amount of possibilities coming from this method seems endless. Even if I dismiss the efficiency for now, I could end up finding English words which may not form a Word.
My guess is I have to resolve and find words while maintaining the letter formation correctly from the start somehow? At it for hours and gone from fun to frustration. Can I just get some guidance please. Looking for similar questions but found none.
Note: This is an example and I am trying to keep it open for a longer string or a square of different size. (The example is 4x4. The user can decide to go with a 5x5 square with a string of length 25).
My Code
public static void main(String[] args){
String result = wordSquareCreator(4, "artearardreardac");
System.out.println(result);
}
static String wordSquareCreator(int dimension, String letter){
String sortedWord = "";
String temp;
int front = 0;
int firstLetterFront = 0;
int back = dimension;
//Looping through first 4 letters and only changing the first letter 16 times to try a match.
for (int j = 0; j < letter.length(); j++) {
String a = letter.substring(firstLetterFront, j+1) + letter.substring(front+1, back);
temp = readFile(dimension, a);
if(temp != null){
sortedWord+= temp;
}
firstLetterFront++;
}
return sortedWord;
}
static String readFile(int dimension, String word){
//dict text file contains 300,00 English words
File file = new File("dict.txt");
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
String text;
while ((text = reader.readLine()) != null) {
if(text.length() == dimension) {
if(text.equals(word)){
//found a valid English word
return text;
}
}
}
}catch (Exception e){
e.printStackTrace();
}
finally {
try {
if(reader != null)
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return null;
}
You can greatly cut down your search space if you organize your dictionary properly. (Which can be done as you read it in, you don't need to modify the file on disk.)
Break it up into one list per word length, then sort each list.
Now, to reduce your search space--note that singletons can only occur on the diagonal from the top left to the bottom right. You have an odd number of C, T, R and A--those 4 letters make up this diagonal. (Note that you will not always be able to do this as they aren't guaranteed unique.) Your search space is now one set of 4 with 4 options (24 options) and one set of 6 (720 options except there are duplicates that cut this down.) 17k possible boards and under 1k words (edit: I originally said 5k but you can restrict the space to words starting with the correct letter and since it's a sorted list you don't need to consider the others at all) to try and you're already under 20 million possibilities to examine. You can cut this considerably by first filtering your word list to those that contain only the letters that are used.
At this point an exhaustive search isn't going to be prohibitive.
Since it seems that you want to create a word square out of those letters that you take in as a parameter to your function, you know that the absolute word length in your square is sqrt(amountOfLetters). In your examplecode that would be sqrt(16) = 4. You can also disqualify a lot of words directly from your dictionary:
discard a word if it does not start with a letter in your "alphabet" (i.e. "A", "C", "D", "E", "R", "T")
discard a word if it is not equal to your wordlength (i.e. 4)
discard a word if it has a letter not in your alphabet
The amount of words that you want to "write" in your square is wordlength * 2 (since the words can only start from the upper-row or from the left-column)
You could actually first start by going through your dictionary and copying only valid words into new file. Then compare your square into this new shorter dictionary.
With building up the square, I think there are 2 possibilities to choose between.
The first one is to randomly organize the square from the letters and make checks if the letters form up correct words
The second one is to randomly choose "correct" words from the dictionary, and write them into your square. After that you check if the words use a correct amount and setting of letters
Related
I am working on a Comp Sci assignment. In the end, the program will determine whether a file is written in English or French. Right now, I'm struggling with the method that counts the frequency of words that appears in a .txt file.
I have a set of text files in both English and French in their respective folders labeled 1-20. The method asks for a directory (which in this case is "docs/train/eng/" or "docs/train/fre/") and for how many files that the program should go through (there are 20 files in each folder). Then it reads that file, splits all the words apart (I don't need to worry about capitalization or punctuation), and puts every word in a HashMap along with how many times they were in the file. (Key = word, Value = frequency).
This is the code I came up with for the method:
public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();
// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
// Puts together the string that the FileReader will refer to.
String learn = directory + k + ".txt";
try {
FileReader reader = new FileReader(learn);
BufferedReader br = new BufferedReader(reader);
// The BufferedReader reads the lines
String line = br.readLine();
// Split the line into a String array to loop through
String[] words = line.split(" ");
int freq = 0;
// for loop goes through every word
for (int i = 0; i < words.length; i++) {
// Case if the HashMap already contains the key.
// If so, just increments the value
if (wordCount.containsKey(words[i])) {
wordCount.put(words[i], freq++);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], freq++);
}
}
// Catching the file not found error
// and any other errors
}
catch (FileNotFoundException fnfe) {
System.err.println("File not found.");
}
catch (Exception e) {
System.err.print(e);
}
}
return wordCount;
}
The code compiles. Unfortunately, when I asked it to print the results of all the word counts for the 20 files, it printed this. It's complete gibberish (though the words are definitely there) and is not at all what I need the method to do.
If anyone could help me debug my code, I would greatly appreciate it. I've been at it for ages, conducting test after test and I'm ready to give up.
Let me combine all the good answers here.
1) Split up your methods to handle one thing each. One to read the files into strings[], one to process the strings[], and one to call the first two.
2) When you split think deeply about how you want to split. As #m0skit0 suggest you should likely split with \b for this problem.
3) As #jas suggested you should first check if your map already has the word. If it does increment the count, if not add the word to the map and set it's count to 1.
4) To print out the map in the way you likely expect, take a look at the below:
Map test = new HashMap();
for (Map.Entry entry : test.entrySet()){
System.out.println(entry.getKey() + " " + entry.getValue());
}
I would have expected something more like this. Does it make sense?
if (wordCount.containsKey(words[i])) {
int n = wordCount.get(words[i]);
wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], 1);
}
If the word is already in the hashmap, we want to get the current count, add 1 to that and replace the word with the new count in the hashmap.
If the word is not yet in the hashmap, we simply put it in the map with a count of 1 to start with. The next time we see the same word we'll up the count to 2, etc.
If you split by space only, then other signs (parenthesis, punctuation marks, etc...) will be included in the words. For example: "This phrase, contains... funny stuff", if you split it by space you get: "This" "phrase," "contains..." "funny" and "stuff".
You can avoid this by splitting by word boundary (\b) instead.
line.split("\\b");
Btw your if and else parts are identical. You're always incrementing freq by one, which doesn't make much sense. If the word is already in the map, you want to get the current frequency, add 1 to it, and update the frequency in the map. If not, you put it in the map with a value of 1.
And pro tip: always print/log the full stacktrace for the exceptions.
Let's say I have this String
String myText="I think that stackoverflow is a very great website";
If i want to divide it in 2 lines i would have something like
I think that stackoverflow
is a very great website.
So the String will be now ("I think that stackoverflow\nis a very great website"
If I want it to divide in 3 lines it will be like
I think that
stackoverflow is a
very great website
What I've tried was just dividing the text, every line would have total number of words / n (n is the number of lines that i want to divide my text).
But this is a bad thing, i would have a result like
String myText="I me him is veryverylong wordvery longestwordever thisisevenlonger"
And the result would be (if i want to divide it in 2 lines) something like
"i you me is\nveryverylong wordvery longestwordever thisisevenlonger"
What do you guys suggest for me to try?
I've tried the common apache algorithm
http://pastebin.com/68zycavf
But my output text will be every word separated by \n ..if i use wrap(text,2)..
As Eran noted in his answer, you want to split at approximately the line length divided by the desired number of lines, but have to adjust for that being in the middle of a word.
I think his solution won't quite always give the best solution though, as it might sometimes be best to split before the word instead of after as he's doing.
A divide-and-conquer approach would be a recursive algorithm roughly as follows:
Let N be the desired number of lines and LENGTH be the number of characters in the input string (normalizing to single-spaces first).
If the character at LENGTH/N is a space, make the first cut there, and recursively call to split the remainder into N-1 lines, otherwise find the spaces at each end of the word containing this character and make trial cuts at both points with recursive calls again tom complete both cuts. Score the results somehow and choose the better.
I have implemented this as follows. For the scoring function, I chose to minimize the maximum length of lines in the split. A more complex scoring function might possibly improve the results, but this seems to work for all your cases.
public class WordWrapper {
public String wrapWords(String input, int lines) {
return splitWords(input.replaceAll("\\s+", " "), lines);
}
private String splitWords(String input, int lines) {
if (lines <= 1) {
return input;
}
int splitPointHigh = findSplit(input, lines, 1);
String splitHigh = input.substring(0, splitPointHigh).trim() + "\n" + splitWords(input.substring(splitPointHigh).trim(), lines - 1);
int splitPointLow = findSplit(input, lines, -1);
String splitLow = input.substring(0, splitPointLow).trim() + "\n" + splitWords(input.substring(splitPointLow).trim(), lines - 1);
if (maxLineLength(splitLow) < maxLineLength(splitHigh))
return splitLow;
else return splitHigh;
}
private int maxLineLength(String split) {
return maxLength(split.split("\n"));
}
private int maxLength(String[] lines) {
int maxLength = 0;
for (String line: lines) {
if (line.length() > maxLength)
maxLength = line.length();
}
return maxLength;
}
private int findSplit(String input, int lines, int dir) {
int result = input.length() / lines;
while (input.charAt(result) != ' ')
result+= dir;
return result;
}
}
I didn't actually bother with the special case of the lucky situation of the simple split landing on a space, and adding special handling for that might make it a little quicker. This code will in that case generate two identical "trial splits" and "choose one".
You might want to make all these methods static of course, and the recursion might give you a stack overflow for large inputs and large line counts.
I make no claim that this is the best algorithm, but it seems to work.
You can split based on the number of characters divided by n.
Then, for each line, you should add the end of the last word (which is the beginning of the next line, if the current line doesn't end with a space and the next line doesn't begin with a space), so that no words are split in the middle.
So if you have :
I me him is veryverylong wordvery longestwordever thisisevenlonger
And you wish to split it to two lines, you get :
I me him is veryverylong wordvery
longestwordever thisisevenlonger
In this case the second line already starts with a space, so we know that no word was split in the middle, and we are done.
If you split it to three lines, you first get :
I me him is veryverylo
ng wordvery longestwor
dever thisisevenlonger
Here some words were split, so you move "ng" to the first line, and then move "dever" to the second line.
I me him is veryverylong
wordvery longestwordever
thisisevenlonger
This is my solution using the split() function.
public class Textcut {
public static void main(String arg[]) {
String myText="I think that stackoverflow is a very great website";
int n = 2;
String[] textSplit = myText.split(" ");
int wordNumber = textSplit.length;
int cutIndex = wordNumber/n;
int i = cutIndex;
int j = 0;
while(i <= wordNumber) {
for(; j < i; j++) {
System.out.print(textSplit[j] + " ");
}
System.out.println("\n");
i = i+cutIndex;
}
}
}
Aside: I am using the penn.txt file for the problem. The link here is to my Dropbox but it is also available in other places such as here. However, I've not checked whether they are exactly the same.
Problem statement: I would like to do some word processing on each line of the penn.txt file which contains some words and syntactic categories. The details are not relevant.
Actual "problem" faced: I suspect that the file has some consecutive blank lines (which should ideally not be present), which I think the code verifies but I have not verified it by eye, because the number of lines is somewhat large (~1,300,000). So I would like my Java code and conclusions checked for correctness.
I've used slightly modified version of the code for converting file to String and counting number of lines in a string. I'm not sure about efficiency of splitting but it works well enough for this case.
File file = new File("final_project/penn.txt"); //location
System.out.println(file.exists());
//converting file to String
byte[] encoded = null;
try {
encoded = Files.readAllBytes(Paths.get("final_project/penn.txt"));
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
String mystr = new String(encoded, StandardCharsets.UTF_8);
//splitting and checking "consecutiveness" of \n
for(int j=1; ; j++){
String split = new String();
for(int i=0; i<j; i++){
split = split + "\n";
}
if(mystr.split(split).length==1) break;
System.out.print("("+mystr.split(split).length + "," + j + ") ");
}
//counting using Scanner
int count=0;
try {
Scanner reader = new Scanner(new FileInputStream(file));
while(reader.hasNext()){
count++;
String entry = reader.next();
//some word processing here
}
reader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
System.out.println(count);
The number of lines in Gedit--if I understand correctly--matched the number of \n characters found at 1,283,169. I have verified (separately) that the number of \r and \r\n (combined) characters is 0 using the same splitting idea. The total splitting output is shown below:
(1283169,1) (176,2) (18,3) (13,4) (11,5) (9,6) (8,7) (7,8) (6,9) (6,10) (5,11) (5,12) (4,13) (4,14) (4,15) (4,16) (3,17) (3,18) (3,19) (3,20) (3,21) (3,22) (3,23) (3,24) (3,25) (2,26) (2,27) (2,28) (2,29) (2,30) (2,31) (2,32) (2,33) (2,34) (2,35) (2,36) (2,37) (2,38) (2,39) (2,40) (2,41) (2,42) (2,43) (2,44) (2,45) (2,46) (2,47) (2,48) (2,49) (2,50)
Please answer whether the following statements are correct or not:
From this, what I understand is that there is one instance of 50 consecutive \n characters and because of that there are exactly two instances of 25 consecutive \n characters and so on.
The last count (using Scanner) reading gives 1,282,969 which is an exact difference of 200. In my opinion, what this means is that there are exactly 200 (or 199?) empty lines floating about somewhere in the file.
Is there any way to separately verify this "discrepancy" of 200? (something like a set-theoretic counting of intersections maybe)
A partial answer to question (the last part) is as follows:
(Assuming the two statements in the question are true)
If instead of printing number of split parts, if you print no. of occurrences of \n j times, you'll get (simply doing a -1):
(1283168,1) (175,2) (17,3) (12,4) (10,5) (8,6) (7,7) (6,8) (5,9) (5,10) (4,11) (4,12) (3,13) (3,14) (3,15) (3,16) (2,17) (2,18) (2,19) (2,20) (2,21) (2,22) (2,23) (2,24) (2,25) (1,26) (1,27) (1,28) (1,29) (1,30) (1,31) (1,32) (1,33) (1,34) (1,35) (1,36) (1,37) (1,38) (1,39) (1,40) (1,41) (1,42) (1,43) (1,44) (1,45) (1,46) (1,47) (1,48) (1,49) (1,50)
Note that for j>3, product of both numbers is <=50, which is your maximum. What this means is that there is a place with 50 consecutive \n characters and all the hits you are getting from 4 to 49 are actually part of the same.
However for 3, the maximum multiple of 3 less than 50 is 48 which gives 16 while you have 17 occurrences here. So there is an extra \n\n\n somewhere with non-\n character on both its 'sides'.
Now for 2 (\n\n), we can subtract 25 (coming from the 50 \ns) and 1 (coming from the separate \n\n\n) to obtain 175-26 = 149.
Counting for the discrepancy, we should sum (2-1)*149 + (3-1)*1 + (50-1)*1, the -1 coming because first \n in each of these is accounted for in the Scanner counting. This sum is 200.
Basically I want to create a program which simulates the 'Countdown' game on Channel 4. In effect a user must input 9 letters and the program will search for the largest word in the dictionary that can be made from these letters.I think a tree structure would be better to go with rather than hash tables. I already have a file which contains the words in the dictionary and will be using file io.
This is my file io class:
public static void main(String[] args){
FileIO reader = new FileIO();
String[] contents = reader.load("dictionary.txt");
}
This is what I have so far in my Countdown class
public static void main(String[] args) throws IOException{
Scanner scan = new Scanner(System.in);
letters = scan.NextLine();
}
I get totally lost from here. I know this is only the start but I'm not looking for answers. I just want a small bit of help and maybe a pointer in the right direction. I'm only new to java and found this question in an interview book and thought I should give it a .
Thanks in advance
welcome to the world of Java :)
The first thing I see there that you have two main methods, you don't actually need that. Your program will have a single entry point in most cases then it does all its logic and handles user input and everything.
You're thinking of a tree structure which is good, though there might be a better idea to store this. Try this: http://en.wikipedia.org/wiki/Trie
What your program has to do is read all the words from the file line by line, and in this process build your data structure, the tree. When that's done you can ask the user for input and after the input is entered you can search the tree.
Since you asked specifically not to provide answers I won't put code here, but feel free to ask if you're unclear about something
There are only about 800,000 words in the English language, so an efficient solution would be to store those 800,000 words as 800,000 arrays of 26 1-byte integers that count how many times each letter is used in the word, and then for an input 9 characters you convert to similar 26 integer count format for the query, and then a word can be formed from the query letters if the query vector is greater than or equal to the word-vector component-wise. You could easily process on the order of 100 queries per second this way.
I would write a program that starts with all the two-letter words, then does the three-letter words, the four-letter words and so on.
When you do the two-letter words, you'll want some way of picking the first letter, then picking the second letter from what remains. You'll probably want to use recursion for this part. Lastly, you'll check it against the dictionary. Try to write it in a way that means you can re-use the same code for the three-letter words.
I believe, the power of Regular Expressions would come in handy in your case:
1) Create a regular expression string with a symbol class like: /^[abcdefghi]*$/ with your letters inside instead of "abcdefghi".
2) Use that regular expression as a filter to get a strings array from your text file.
3) Sort it by length. The longest word is what you need!
Check the Regular Expressions Reference for more information.
UPD: Here is a good Java Regex Tutorial.
A first approach could be using a tree with all the letters present in the wordlist.
If one node is the end of a word, then is marked as an end-of-word node.
In the picture above, the longest word is banana. But there are other words, like ball, ban, or banal.
So, a node must have:
A character
If it is the end of a word
A list of children. (max 26)
The insertion algorithm is very simple: In each step we "cut" the first character of the word until the word has no more characters.
public class TreeNode {
public char c;
private boolean isEndOfWord = false;
private TreeNode[] children = new TreeNode[26];
public TreeNode(char c) {
this.c = c;
}
public void put(String s) {
if (s.isEmpty())
{
this.isEndOfWord = true;
return;
}
char first = s.charAt(0);
int pos = position(first);
if (this.children[pos] == null)
this.children[pos] = new TreeNode(first);
this.children[pos].put(s.substring(1));
}
public String search(char[] letters) {
String word = "";
String w = "";
for (int i = 0; i < letters.length; i++)
{
TreeNode child = children[position(letters[i])];
if (child != null)
w = child.search(letters);
//this is not efficient. It should be optimized.
if (w.contains("%")
&& w.substring(0, w.lastIndexOf("%")).length() > word
.length())
word = w;
}
// if a node its end-of-word we add the special char '%'
return c + (this.isEndOfWord ? "%" : "") + word;
}
//if 'a' returns 0, if 'b' returns 1...etc
public static int position(char c) {
return ((byte) c) - 97;
}
}
Example:
public static void main(String[] args) {
//root
TreeNode t = new TreeNode('R');
//for skipping words with "'" in the wordlist
Pattern p = Pattern.compile(".*\\W+.*");
int nw = 0;
try (BufferedReader br = new BufferedReader(new FileReader(
"files/wordsEn.txt")))
{
for (String line; (line = br.readLine()) != null;)
{
if (p.matcher(line).find())
continue;
t.put(line);
nw++;
}
// line is not visible here.
br.close();
System.out.println("number of words : " + nw);
String res = null;
// substring (1) because of the root
res = t.search("vuetsrcanoli".toCharArray()).substring(1);
System.out.println(res.replace("%", ""));
}
catch (Exception e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Output:
number of words : 109563
counterrevolutionaries
Notes:
The wordlist is taken from here
the reading part is based on another SO question : How to read a large text file line by line using Java?
I need to read in a file that contains 2 sentences to compare and return a number between 0 and 1. If the sentences are exactly the same it should return a 1 for true and if they are totally opposite it should return a 0 for false. If the sentences are similar but words are changed to synonyms or something close it should return a .25 .5 or .75. The text file is formatted like this:
______________________________________
Text: Sample
Text 1: It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.
Text 20: It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines
// Should score high point but not 1
Text 21: It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines
// Should score lower than text20
Text 22: I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.
// Should score lower than text21 but NOT 0
Text 24: It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats.
// Should score a 0!
________________________________________________
I have a file reader, but I am not sure the best way to store each line so I can compare them. For now I have the file being read and then being printed out on the screen. What is the best way to store these and then compare them to get my desired number?
import java.io.*;
public class implement
{
public static void main(String[] args)
{
try
{
FileInputStream fstream = new FileInputStream("textfile.txt");
DataInputStream in = new DataInputStream (fstream);
BufferedReader br = new BufferedReader (new InputStreamReader(in));
String strLine;
while ((strLine = br.readLine()) != null)
{
System.out.println (strLine);
}
in.close();
}
catch (Exception e)
{
System.err.println("Error: " + e.getMessage());
}
}
}
Save them in an array list.
ArrayList list = new ArrayList();
//Read File
//While loop
list.add(strLine)
To check each variable in a sentence simply remove punctuation then delimit by spaces and search for each word in the sentence you are comparing. I would suggest ignoring words or 2 or 3 characters. it is up to your digression
then save the lines to the array and compare them however you wanted to.
To compare similar words you will need a database to efficiently check words. Aka a hash table. Once you have this you can search words in a database semiquickly. Next this hash table of works will need a thesaurus linked to each word for similar words. Then take the similar words for the key words in each sentence and run a search for these words on the sentence you are comparing. Obviously before you search for the similar words you would want to compare the two actually sentences. In the end you will need an advanced datastucture you will have to build yourself to do more than direct comparisons.