Scanning a large number of documents for tens of words

Scanning a large number of documents for tens of words - java

I have a large number of documents (over a million) which I need to regularly scan and match to about 100 "multi-word keyword" (i.e not just keywords like "movies" but also "north american"). I have the following code which works fine with single words keywords (i.e "book"):
/**
* Scan a text for certain keywords
* #param keywords the list of keywords we are searching for
* #param text the text we will be scanning
* #return a list of any keywords from the list which we could find in the text
*/
public static List<String> scanWords(List<String> keywords, String text) {
// prepare the BreakIterator
BreakIterator wb = BreakIterator.getWordInstance();
wb.setText(text);
List<String> results = new ArrayList<String>();
// iterate word by word
int start = wb.first();
for (int end = wb.next(); end != BreakIterator.DONE; start = end, end = wb.next()) {
String word = text.substring(start, end);
if (!StringUtils.isEmpty(word) && keywords.contains(word)){
// we have this word in our keywords so return it
results.add(word);
}
}
return results;
}
Note: I need this code to be as efficient as possible as the number of documents is very large.
My current code fails to find any of the 2 key-word keywords. Any idea on how to fix? I am also fine with a completely different approach.

Scanning every dokument does not scale at all. Better index your document in inverted index
Or as in comment use Lucene.

I believe creating an instance of Scanner would work for this. The Scanner class has a method that allows you to search text for a pattern which would be the words in your case.
Scanner scanner=new Scanner(text);
while(scanner.hasNext()){
scanner.findInLine(String pattern);
scanner.next();
}
Scanner class is good for doing stuff like this, and I believe it would work fine for what you need it for.

Related

Reading from two textfiles: one array to keep, one array to omit from textfile

This is the problem:
I'm given a text with certain words and characters that I am to omit. I am to create a method that will return an array of words with two file arguments like this:
public Word[] keyWordsList(File inputFile1, File inputFile2)
inputFile2 contains a bunch of words and punctuations that I am to ignore from what is contained in inputFile1.
inputFile2 contains the following:
of the a in on , . ?
inputFile1 contains a whole paragraph of text, and I am to push each of those words into the Word[].
I just need help in understanding how I can accurately place all the words and omit the ones from inputFile2 because inputFile2 contains both string and character primitive types.
This is my code to solve this issue (which has been fairly successful) but I just don't know how to handle the exception where the punctuation is right after the word.
Scanner ignore = new Scanner(inputFile2);
Scanner important = new Scanner(inputFile1);
ArrayList<String> ignoreArray = new ArrayList<>();
while (ignore.hasNext()) {
String ignoreWord = ignore.next();
ignoreArray.add(ignoreWord);
}
ArrayList<String> importantWords = new ArrayList<>();
while (important.hasNext()) {
String word = important.next();
if (ignoreArray.contains(word))
continue;
else
importantWords.add(word);
}
I'm getting results like this:
[This, is, input, file, to, create, key, words, list., If, we, ever, want, to, know, how, background, job, works,, fastest, way, to, find, k, smallest, elements, an, array,,] etc.etc.
From this:
This is the input file to create key words list.
If we ever want to know how background job works, fastest way to find k smallest elements in an array,
I would appreciate any help. Thank you!

Java: jumbled word program

My program loads two text files: one with a list of English words, another with jumbled words (more just random strings but most make words), then determines what words can be made of the jumbled ones and prints (at least supposed to) the word with its jumbled version next to it. My program effectively finds what words the jumbled words can make up. The problem is not all the words with jumbled equivalents print jumbled word next to them. Also I need the jumbled words on the right unscrambled on the right. Example here's some output (commas separate lines ie if two words in comma they were printed next to each other ;
addej,
ahicryrhe hierarchy,
alvan naval,
annaab banana,
baltoc,
braney nearby,
public class Lab4{
public static void main(String [] args) throws Exception{
if(args.length<2) {
Error(); }
BufferedReader jumbledW = new BufferedReader(new FileReader(args[1]));
BufferedReader words = new BufferedReader(new FileReader(args[0]));
List<String> jumbledWList = new ArrayList<>();
List<String> wList = new ArrayList<>();
long initialTime = System.currentTimeMillis();
while(jumbledW.ready()){
String jumble = jumbledW.readLine();
jumbledWList.add(jumble);
}
Collections.sort(jumbledWList);
while (words.ready()){
String word = words.readLine();
wList.add(word);
}
Collections.sort(wList);
for (String jumble : jumbledWList ) {
System.out.print(jumble + " ");
for (String word : wList) {
if(toConnical(jumble).equals(toConnical(word)))
System.out.print(word);
}
System.out.println();
}
long finalTime = System.currentTimeMillis();
long time = finalTime - initialTime;
System.out.println("The time taken for this program to run is " + time/1000.0 + " seconds" );
}
private static void Error(){
System.out.println("\nError:You have to pass the name of the input files on the command line" );
System.exit(0);
}
private static String toConnical(String word){
char [] arr = word.toCharArray();
Arrays.sort(arr);
String connical = new String(arr);
return connical;
}
}

While skipping through old unanswered posts I had come across this particular question which quite frankly is somewhat unclear as to what the actual problem may be. The way I read the post is this:
A file name is passed to this application through the command line which consists of several jumbled character words. Not certain however whether each line within the file consists of single jumbled words OR each line within the file consists of multiple jumbled words separated with one or more white-spaces (or perhaps even tabs). With this in mind we need to cover either scenario.
Yet another file name is also passed to this application through the command line which consists of several valid language type words. This file is to be considered the Word List to whereas any single jumbled word could pertain to one (or more) of those words within the list if it were unscrambled or not (the word within the Jumbled String list might not be jumbled).
As the Jumbled String list is processed each Word List word that is found to be capable of matching via sorted character to character comparison to the sorted Jumbled word characters is printed to console window.
Console output required is to be a comma delimited string consisting of each jumbled word followed by space delimited matching words from the Word List:
addej jaded, ahicryrhe hierarchy, alvan alvan naval, annaab banana,...etc
This is however appears to now be contradictory to your comment:
"the program is given a txt file of a great deal of english words youd
find in a dictionary and another txt file with jumbled words such as
cra which could make car or rat. The output i desire is the in
reference to the example of cra would be "car cra" (on one line)."
Whereas space delimited Word List words are to come first then the jumbled word processed each consuming a single console window line. Which format is desired? By the way rat can not be achieved from cra.
In reality your code works as expected however since you are using a BufferedReader object and a FileReader object your code will need to be in a try/catch block to handle any exceptions such as the FileNotFoundException and the IOException. This is a requiremnt and can not be excluded.
Below is your code slightly modified to accommodate your first desired output format:
try {
BufferedReader jumbledW = new BufferedReader(new FileReader(args[1]));
BufferedReader words = new BufferedReader(new FileReader(args[0]));
List<String> jumbledWList = new ArrayList<>();
List<String> wList = new ArrayList<>();
long initialTime = System.currentTimeMillis();
while (jumbledW.ready()) {
String jumble = jumbledW.readLine();
jumbledWList.add(jumble);
}
Collections.sort(jumbledWList);
while (words.ready()) {
String word = words.readLine();
wList.add(word);
}
Collections.sort(wList);
String resultString = "";
for (int i = 0; i < jumbledWList.size(); i++){
String jumble = jumbledWList.get(i);
resultString+= jumble + ":";
for (String word : wList) {
if (toConnical(jumble).equals(toConnical(word))) {
resultString+= " " + word;
}
}
if (i != (jumbledWList.size()-1)) { resultString+= ", "; }
}
System.out.println(resultString);
long finalTime = System.currentTimeMillis();
long time = finalTime - initialTime;
System.out.println("The time taken for this program to run is " + time / 1000.0 + " seconds");
}
catch (FileNotFoundException ex) { ex.printStackTrace(); }
catch (IOException ex) { ex.printStackTrace(); }
The jumbled word list file contained the following jumbled strings:
addej
ahicryrhe
alvan
annaab
baltoc
braney
cra
htis
the console output looked like this after running the above list of jumbled words across my 370,101 word Word List file. It took 0.740 seconds to process on my system:
addej: jaded, ahicryrhe: hierarchy, alvan: alvan naval, annaab: banana, baltoc: cobalt, braney: barney nearby, cra: arc car, htis: hist hits isth shit sith this tshi
All words shown above were in my Word List file.

Find Word From a jumbled String

I have a scrambled String as follows: "artearardreardac".
I have a text file which contains English dictionary words close to 300,000 of them. I need to find the English words and be able to form a word as follows:
C A R D
A R E A
R E A R
D A R T
My intention was to initially loop through the scrambled String and make query to that text file each time n try to match 4 characters each time to see if its a valid word.
Problem with this is checking it against 300,000 words per loop.. Going to take ages. I looped through only the first letter 16 times and that itself take a significant time. The amount of possibilities coming from this method seems endless. Even if I dismiss the efficiency for now, I could end up finding English words which may not form a Word.
My guess is I have to resolve and find words while maintaining the letter formation correctly from the start somehow? At it for hours and gone from fun to frustration. Can I just get some guidance please. Looking for similar questions but found none.
Note: This is an example and I am trying to keep it open for a longer string or a square of different size. (The example is 4x4. The user can decide to go with a 5x5 square with a string of length 25).
My Code
public static void main(String[] args){
String result = wordSquareCreator(4, "artearardreardac");
System.out.println(result);
}
static String wordSquareCreator(int dimension, String letter){
String sortedWord = "";
String temp;
int front = 0;
int firstLetterFront = 0;
int back = dimension;
//Looping through first 4 letters and only changing the first letter 16 times to try a match.
for (int j = 0; j < letter.length(); j++) {
String a = letter.substring(firstLetterFront, j+1) + letter.substring(front+1, back);
temp = readFile(dimension, a);
if(temp != null){
sortedWord+= temp;
}
firstLetterFront++;
}
return sortedWord;
}
static String readFile(int dimension, String word){
//dict text file contains 300,00 English words
File file = new File("dict.txt");
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(file));
String text;
while ((text = reader.readLine()) != null) {
if(text.length() == dimension) {
if(text.equals(word)){
//found a valid English word
return text;
}
}
}
}catch (Exception e){
e.printStackTrace();
}
finally {
try {
if(reader != null)
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return null;
}

You can greatly cut down your search space if you organize your dictionary properly. (Which can be done as you read it in, you don't need to modify the file on disk.)
Break it up into one list per word length, then sort each list.
Now, to reduce your search space--note that singletons can only occur on the diagonal from the top left to the bottom right. You have an odd number of C, T, R and A--those 4 letters make up this diagonal. (Note that you will not always be able to do this as they aren't guaranteed unique.) Your search space is now one set of 4 with 4 options (24 options) and one set of 6 (720 options except there are duplicates that cut this down.) 17k possible boards and under 1k words (edit: I originally said 5k but you can restrict the space to words starting with the correct letter and since it's a sorted list you don't need to consider the others at all) to try and you're already under 20 million possibilities to examine. You can cut this considerably by first filtering your word list to those that contain only the letters that are used.
At this point an exhaustive search isn't going to be prohibitive.

Since it seems that you want to create a word square out of those letters that you take in as a parameter to your function, you know that the absolute word length in your square is sqrt(amountOfLetters). In your examplecode that would be sqrt(16) = 4. You can also disqualify a lot of words directly from your dictionary:
discard a word if it does not start with a letter in your "alphabet" (i.e. "A", "C", "D", "E", "R", "T")
discard a word if it is not equal to your wordlength (i.e. 4)
discard a word if it has a letter not in your alphabet
The amount of words that you want to "write" in your square is wordlength * 2 (since the words can only start from the upper-row or from the left-column)
You could actually first start by going through your dictionary and copying only valid words into new file. Then compare your square into this new shorter dictionary.
With building up the square, I think there are 2 possibilities to choose between.
The first one is to randomly organize the square from the letters and make checks if the letters form up correct words
The second one is to randomly choose "correct" words from the dictionary, and write them into your square. After that you check if the words use a correct amount and setting of letters

Using a user inputted string of characters find the longest word that can be made

Basically I want to create a program which simulates the 'Countdown' game on Channel 4. In effect a user must input 9 letters and the program will search for the largest word in the dictionary that can be made from these letters.I think a tree structure would be better to go with rather than hash tables. I already have a file which contains the words in the dictionary and will be using file io.
This is my file io class:
public static void main(String[] args){
FileIO reader = new FileIO();
String[] contents = reader.load("dictionary.txt");
}
This is what I have so far in my Countdown class
public static void main(String[] args) throws IOException{
Scanner scan = new Scanner(System.in);
letters = scan.NextLine();
}
I get totally lost from here. I know this is only the start but I'm not looking for answers. I just want a small bit of help and maybe a pointer in the right direction. I'm only new to java and found this question in an interview book and thought I should give it a .
Thanks in advance

welcome to the world of Java :)
The first thing I see there that you have two main methods, you don't actually need that. Your program will have a single entry point in most cases then it does all its logic and handles user input and everything.
You're thinking of a tree structure which is good, though there might be a better idea to store this. Try this: http://en.wikipedia.org/wiki/Trie
What your program has to do is read all the words from the file line by line, and in this process build your data structure, the tree. When that's done you can ask the user for input and after the input is entered you can search the tree.
Since you asked specifically not to provide answers I won't put code here, but feel free to ask if you're unclear about something

There are only about 800,000 words in the English language, so an efficient solution would be to store those 800,000 words as 800,000 arrays of 26 1-byte integers that count how many times each letter is used in the word, and then for an input 9 characters you convert to similar 26 integer count format for the query, and then a word can be formed from the query letters if the query vector is greater than or equal to the word-vector component-wise. You could easily process on the order of 100 queries per second this way.

I would write a program that starts with all the two-letter words, then does the three-letter words, the four-letter words and so on.
When you do the two-letter words, you'll want some way of picking the first letter, then picking the second letter from what remains. You'll probably want to use recursion for this part. Lastly, you'll check it against the dictionary. Try to write it in a way that means you can re-use the same code for the three-letter words.

I believe, the power of Regular Expressions would come in handy in your case:
1) Create a regular expression string with a symbol class like: /^[abcdefghi]*$/ with your letters inside instead of "abcdefghi".
2) Use that regular expression as a filter to get a strings array from your text file.
3) Sort it by length. The longest word is what you need!
Check the Regular Expressions Reference for more information.
UPD: Here is a good Java Regex Tutorial.

A first approach could be using a tree with all the letters present in the wordlist.
If one node is the end of a word, then is marked as an end-of-word node.
In the picture above, the longest word is banana. But there are other words, like ball, ban, or banal.
So, a node must have:
A character
If it is the end of a word
A list of children. (max 26)
The insertion algorithm is very simple: In each step we "cut" the first character of the word until the word has no more characters.
public class TreeNode {
public char c;
private boolean isEndOfWord = false;
private TreeNode[] children = new TreeNode[26];
public TreeNode(char c) {
this.c = c;
}
public void put(String s) {
if (s.isEmpty())
{
this.isEndOfWord = true;
return;
}
char first = s.charAt(0);
int pos = position(first);
if (this.children[pos] == null)
this.children[pos] = new TreeNode(first);
this.children[pos].put(s.substring(1));
}
public String search(char[] letters) {
String word = "";
String w = "";
for (int i = 0; i < letters.length; i++)
{
TreeNode child = children[position(letters[i])];
if (child != null)
w = child.search(letters);
//this is not efficient. It should be optimized.
if (w.contains("%")
&& w.substring(0, w.lastIndexOf("%")).length() > word
.length())
word = w;
}
// if a node its end-of-word we add the special char '%'
return c + (this.isEndOfWord ? "%" : "") + word;
}
//if 'a' returns 0, if 'b' returns 1...etc
public static int position(char c) {
return ((byte) c) - 97;
}
}
Example:
public static void main(String[] args) {
//root
TreeNode t = new TreeNode('R');
//for skipping words with "'" in the wordlist
Pattern p = Pattern.compile(".*\\W+.*");
int nw = 0;
try (BufferedReader br = new BufferedReader(new FileReader(
"files/wordsEn.txt")))
{
for (String line; (line = br.readLine()) != null;)
{
if (p.matcher(line).find())
continue;
t.put(line);
nw++;
}
// line is not visible here.
br.close();
System.out.println("number of words : " + nw);
String res = null;
// substring (1) because of the root
res = t.search("vuetsrcanoli".toCharArray()).substring(1);
System.out.println(res.replace("%", ""));
}
catch (Exception e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Output:
number of words : 109563
counterrevolutionaries
Notes:
The wordlist is taken from here
the reading part is based on another SO question : How to read a large text file line by line using Java?

Java Lucene Stop Words Filter

I have about 500 sentences in which I would like to compile a set of ngrams. I am having trouble removing the stop words. I tried adding the lucene StandardFilter and StopFilter but I still have the same problem. Here is my code:
for(String curS: Sentences)
{
reader = new StringReader(curS);
tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopWords);
tokenizer = new ShingleFilter(tokenizer, 2, 3);
charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
while(tokenizer.incrementToken())
{
curNGram = charTermAttribute.toString().toString();
nGrams.add(curNGram); //store each token into an ArrayList
}
}
For example, the first phrase I am testing is: "For every person that listens to". In this example curNGram is set to "For" which is a stop word in my list stopWords. Also, in this example "every" is a stop word and so "person" should be the first ngram.
Why are stop words being added to my list when I am using the StopFiler?
All help is appreciated!

What you've posted looks okay to me, so I suspect that stopWords isn't providing the information you want to the filter.
Try something like:
//Let's say we read the stop words into an array list (A simple array, or any list implementation should be fine)
List<String> words = new ArrayList();
//Read the file into words.
Set stopWords = StopFilter.makeStopSet(Version.LUCENE_36, words, true);
Assuming the list you of stopwords you generated (the one I've named 'words') looks like you think it does, this should put them into a format usable to the StopFilter.
Were you already generating stopWords like that?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Scanning a large number of documents for tens of words - java

Scanning every dokument does not scale at all. Better index your document in inverted index Or as in comment use Lucene.

Related

Reading from two textfiles: one array to keep, one array to omit from textfile

Java: jumbled word program

Find Word From a jumbled String

Using a user inputted string of characters find the longest word that can be made

Java Lucene Stop Words Filter

Categories

Resources