Java Lucene Stop Words Filter

Java Lucene Stop Words Filter - java

I have about 500 sentences in which I would like to compile a set of ngrams. I am having trouble removing the stop words. I tried adding the lucene StandardFilter and StopFilter but I still have the same problem. Here is my code:
for(String curS: Sentences)
{
reader = new StringReader(curS);
tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
tokenizer = new StandardFilter(Version.LUCENE_36, tokenizer);
tokenizer = new StopFilter(Version.LUCENE_36, tokenizer, stopWords);
tokenizer = new ShingleFilter(tokenizer, 2, 3);
charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);
while(tokenizer.incrementToken())
{
curNGram = charTermAttribute.toString().toString();
nGrams.add(curNGram); //store each token into an ArrayList
}
}
For example, the first phrase I am testing is: "For every person that listens to". In this example curNGram is set to "For" which is a stop word in my list stopWords. Also, in this example "every" is a stop word and so "person" should be the first ngram.
Why are stop words being added to my list when I am using the StopFiler?
All help is appreciated!

What you've posted looks okay to me, so I suspect that stopWords isn't providing the information you want to the filter.
Try something like:
//Let's say we read the stop words into an array list (A simple array, or any list implementation should be fine)
List<String> words = new ArrayList();
//Read the file into words.
Set stopWords = StopFilter.makeStopSet(Version.LUCENE_36, words, true);
Assuming the list you of stopwords you generated (the one I've named 'words') looks like you think it does, this should put them into a format usable to the StopFilter.
Were you already generating stopWords like that?

Related

Reading from two textfiles: one array to keep, one array to omit from textfile

This is the problem:
I'm given a text with certain words and characters that I am to omit. I am to create a method that will return an array of words with two file arguments like this:
public Word[] keyWordsList(File inputFile1, File inputFile2)
inputFile2 contains a bunch of words and punctuations that I am to ignore from what is contained in inputFile1.
inputFile2 contains the following:
of the a in on , . ?
inputFile1 contains a whole paragraph of text, and I am to push each of those words into the Word[].
I just need help in understanding how I can accurately place all the words and omit the ones from inputFile2 because inputFile2 contains both string and character primitive types.
This is my code to solve this issue (which has been fairly successful) but I just don't know how to handle the exception where the punctuation is right after the word.
Scanner ignore = new Scanner(inputFile2);
Scanner important = new Scanner(inputFile1);
ArrayList<String> ignoreArray = new ArrayList<>();
while (ignore.hasNext()) {
String ignoreWord = ignore.next();
ignoreArray.add(ignoreWord);
}
ArrayList<String> importantWords = new ArrayList<>();
while (important.hasNext()) {
String word = important.next();
if (ignoreArray.contains(word))
continue;
else
importantWords.add(word);
}
I'm getting results like this:
[This, is, input, file, to, create, key, words, list., If, we, ever, want, to, know, how, background, job, works,, fastest, way, to, find, k, smallest, elements, an, array,,] etc.etc.
From this:
This is the input file to create key words list.
If we ever want to know how background job works, fastest way to find k smallest elements in an array,
I would appreciate any help. Thank you!

Scanning a large number of documents for tens of words

I have a large number of documents (over a million) which I need to regularly scan and match to about 100 "multi-word keyword" (i.e not just keywords like "movies" but also "north american"). I have the following code which works fine with single words keywords (i.e "book"):
/**
* Scan a text for certain keywords
* #param keywords the list of keywords we are searching for
* #param text the text we will be scanning
* #return a list of any keywords from the list which we could find in the text
*/
public static List<String> scanWords(List<String> keywords, String text) {
// prepare the BreakIterator
BreakIterator wb = BreakIterator.getWordInstance();
wb.setText(text);
List<String> results = new ArrayList<String>();
// iterate word by word
int start = wb.first();
for (int end = wb.next(); end != BreakIterator.DONE; start = end, end = wb.next()) {
String word = text.substring(start, end);
if (!StringUtils.isEmpty(word) && keywords.contains(word)){
// we have this word in our keywords so return it
results.add(word);
}
}
return results;
}
Note: I need this code to be as efficient as possible as the number of documents is very large.
My current code fails to find any of the 2 key-word keywords. Any idea on how to fix? I am also fine with a completely different approach.

Scanning every dokument does not scale at all. Better index your document in inverted index
Or as in comment use Lucene.

I believe creating an instance of Scanner would work for this. The Scanner class has a method that allows you to search text for a pattern which would be the words in your case.
Scanner scanner=new Scanner(text);
while(scanner.hasNext()){
scanner.findInLine(String pattern);
scanner.next();
}
Scanner class is good for doing stuff like this, and I believe it would work fine for what you need it for.

Splitting a line from a textfile into two separate Arrays

I must apologize in advance if this question has been answered, I have not been able to find an answer to this problem on the forum thus far.
I am working on a Project Euler Problem, Problem 54 to be exact, and I am required to read in 1000 lines of data, each line contains two randomly generated Poker hands, the first half of the line being Player 1's hand, the second being Player 2's.
Below is an example of the layout of the document:
8CTSKC9H4S 7D2S5D3SAC
5CAD5DAC9C 7C5H8DTDKS
3H7H6SKCJS QHTDJC2D8S
TH8H5CQSTC 9H4DJCKSJS
What I have done thus far is manually make two documents copying the second "Hand" from each line and adding that to the new document. However I did that twice and came up with two different results, so I think I am making an error somewhere with all the copying.
Below is an example of the the above function I used:
ArrayList<String> player1 = new ArrayList<String>();
ArrayList<String> player2 = new ArrayList<String>();
String file1 = "FilePath1";
String file2 = "FilePath2";
Scanner input1 = new Scanner(new File(file1));
while(input1.hasNext()) {
player1.add(input1.nextLine());
}
Scanner input2 = new Scanner(new File(file2));
while(input2.hasNext()) {
player2.add(input2.nextLine());
}
input1.close();
input2.close();
What I would like to know is how could I read only the first hand into an ArrayList, and then only the second hand into another one. Without having to create two separate documents and risk compromising the data. I am not sure how I would be able to use the Split function here if that is the way to go. I am fairly new to programming, I am trying to teach myself, so I apologize if this is a overly simple problem.
Thank you very much in advance

You can split on space (all whitespace)
For example:
String currLine = input1.nextLine();
//This only needed if you are not sure if the input will have leading/trailing space
currLine = currLine.trim();
String[] split = currLine.split("\\s+");
//Ensuring the line read was properly split
if(split.length == 2) {
//split[0] will have the first hand
player1.add(split[0]);
//split[1] will have the second hand
player2.add(split[1]);
}

How to implement Lists of Hashmap/ArrayList

Hi everyone I am having a problem trying to get this to work. Basically what I wanted to do is to read a text file containing this kind of data not exactly but just similar and count the frequency of each letter appearing on each line. Also the real data contains any random ASCII from 0-255.
An examples is:
Hi this is john.
We are going .4%2) &,.! m#ll
What I wanted to have is something like this implemented in Lists of Maps
{H=3, i=3, ' '=3, t=1, h=2, s=2,... until the end of the line },
{W=1, e=2, ' '=4, a=1, r=1, g=2, o=1, i=1, n=1, .=2, 4=1, %=1.... until the end of line},
so its a Lists of Map
I have tried to research on similar questions but the closest I can do in coding it is this.
List <Map<String, Integer>> storeListsofMaps = new ArrayList<Map<String, Integer>>();
ArrayList <String> storePerLine = new ArrayList<String>();
String getBuf;
try {
FileReader rf = new FileReader("simpleTextCharDist.txt");
BufferedReader encapRF = new BufferedReader(rf);
getBuf = encapRF.readLine();
while (getBuf!=null){
storePerLine.add(getBuf);
getBuf = encapRF.readLine();
}
for (String index: storePerLine){
Map<String, Integer> storeCharAndCount = new HashMap<String, Integer>();
Integer count = storeCharAndCount.get(index);
storeCharAndCount.put(index, (count==null)?count = 1:count+1);
storeListsofMaps.add(storeCharAndCount);
}
System.out.println("StoreListsofMaps: "+ storeListsofMaps);
encapRF.close();
}
I know this code would not execute the one I described but am stuck up until this part. The code I have shown will only count the word itself not each letter in the string. I tried counting iterating over each element in the string by converting the string into char [] and converting it back to string again, but it is very inefficient and produces alot of errors. Hope anyone would be kind enough to help.

Here is the pseudo algo to achieve this -
Using file I/O create a list containing 1 line as 1 element in the list
Write a small helper function which will:
take String (representing an element from list created in step 1)
iterate through the line
create a map of char & count. This map should be the return type.
Create a Map<String,Map<String,Int>> where 1st String is the "Line1", "Line2", etc. 2nd String is the char value. The Map written inline is the map returned from Step 2.
This should work.

Think about what you are trying to do. Write down your algorithm in text form. Think about when you have to create your variables, aqnd which types your varibales need to have. Compare your written algorithm with your actual code.
Example algorithm:
Open file
Create a list of maps of charcaters to integers (ArrayList>)
Read all lines; for each line:
Create a map char -> int for that line (HashMap)
For each character c in the line:
update the count in the map
Store the map for one line in the list of maps

How to calculate total hits

I am parsing several log files and searching for a particular String in them.
I look through each line, once I find the string I create a Map with the String and a text as key.
Like Map result = new HashMap(); result.put("Report Page", line.substring(60));
I then add these Maps to a list and I interate through the list and display my table.
What I want is, to give out the number of times the string occured in the files.
Desired output :
Name Value Occurences.
...
...
...
Could someone please help?
(Note :This is not a homework project.)
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
String line;
while ((line = reader.readLine()) != null) {
Map result = new HashMap();
if(line.contains("Parm Name/Value:REPORT_PAGE")){
result.put("Report Page", line.substring(60));
}
rows.add(result);

The question is a bit unclear, I hope I got you right.
You're currently hashing some string (whose meaning I don't understand) to the substring itself.
It also seems, for some reason, that you create a map for each line.
Are you sure that's what you want to do?
Anyway, what I think you want to do is to create a hash map which maps strings to integers.
Please paste a more complete code...

Check http://guava-libraries.googlecode.com/svn/tags/release09/javadoc/index.html. This the right choice for your use case.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Lucene Stop Words Filter - java

Related

Reading from two textfiles: one array to keep, one array to omit from textfile

Scanning a large number of documents for tens of words

Splitting a line from a textfile into two separate Arrays

How to implement Lists of Hashmap/ArrayList

How to calculate total hits

Categories

Resources