How to remove stop words in java? - java

I want to remove stop words in java.
So, I read stop words from text file.
and store Set
Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader br = new BufferedReader(new FileReader("stopwords.txt"));
String words = null;
while( (words = br.readLine()) != null) {
stopWords.add(words.trim());
}
br.close();
And, I read another text file.
So, I wanna remove to duplicate string in text file.
How can I?

using set for stopword :
Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader SW= new BufferedReader(new FileReader("StopWord.txt"));
for(String line;(line = SW.readLine()) != null;)
stopWords.add(line.trim());
SW.close();
and ArrayList for input txt_file
BufferedReader br = new BufferedReader(new FileReader(txt_file.txt));
//make your arraylist here
// function deletStopWord() for remove all stopword in your "stopword.txt"
public ArrayList<String> deletStopWord(Set stopWords,ArrayList arraylist){
System.out.println(stopWords.contains("?"));
ArrayList<String> NewList = new ArrayList<String>();
int i=3;
while(i < arraylist.size() ){
if(!stopWords.contains(arraylist.get(i))){
NewList.add((String) arraylist.get(i));
}
i++;
}
System.out.println(NewList);
return NewList;
}
arraylist=deletStopWord(stopWords,arraylist);

You want to remove duplicate words from file, below is the high level logic for same.
Read File
Loop through file content(i.e one line at a time)
Have string tokenizer for that line based on space
Add each each token to your set. This will make sure that you have only one entry per word.
Close file
Now you have set that contains all the unique word of file.

Using the ArrayList may be more easier.
public ArrayList removeDuplicates(ArrayList source){
ArrayList<String> newList = new ArrayList<String>();
for (int i=0; i<source.size(); i++){
String s = source.get(i);
if (!newList.contains(s)){
newList.add(s);
}
}
return newList;
}
Hope this helps.

If you simply want to remove a certain set of words from the words in a file, you can do it however you want. But if you are dealing with a problem involving natural language processing, you should use a library.
For example, using Lucene for tokenizing will seem more complicated at first, but it will deal with myriad complications that you will overlook, and allow for great flexibility should you change your mind on the specific stopwords, on how you are tokenizing, whether you care about case, etc.

You should try using StringTokenizer.

Related

complete indexing of text file java

im trying to read a text file, sort the words within alphabetically and display what line numbers those words appear on.
Im new to java so not sure what the most efficient way to approach the system is.
My plan so far is to:
-use a scanner to parse file into one string
-string.split
-lineCount++
-(somehow sort those split strings alphabetically)
-print sorted words with line number next to them
Is that the best way of going about this? im not sure if java has some sort of ordered dictionary maybe i could use?
A Scanner is fine, as you could scan per word, not even needing a split.
A BufferedReader would be for line-wise reading, and there exists a LineNumberReader for your goal: counting lines.
I head indicate the encoding of the file.
SortedMap<String, SortedSet<Integer>> linenosPerWord = new TreeMap<>();
// A BufferedReader with a linenumber counter:
try (LineNumberReader in = new LineNumberReader(new InputStreamReader(
new FileInputSTream(file, StandardCharsets.UTF_8))) {
for (;;) {
String line = in.readLine();
if (line == null) {
break;
}
int lineno = in.getLineNumber();
String[] words = line.split("\\P{LM}"); // Split on non-letters and non-accents
for (String word : words) {
word = word.toLowerCase(); // Possible with Locale
SortedSet<Integer> linenos = linenosPerWord.get(word);
if (linenos == null) {
linenos = new TreeSet<>();
linenosPerWord.put(word, lineno);
}
linenos.add(lineno);
}
}
}
linenosPerWord.remove(""); // Remove a possibly found empty word, like in "-Hello"

sending all read lines to string array

I have a concept but I'm not sure how to go at it. I would like to parse a website and use regex to find certain parts. Then store these parts into a string. After I would like to do the same, but find differences between before and after.
The plan:
parse/regex add lines found to the array before.
refresh the website/parse/regex add lines found to the array after.
compare all strings before with all of string after. println any new ones.
send all after strings to before strings.
Then repeat from 2. forever.
Basically its just checking a website for updated code and telling me what's updated.
Firstly, is this doable?
Here's my code for part 1.
String before[] = {};
int i = 0;
while ((line = br.readLine()) != null) {
Matcher m = p.matcher(line);
if (m.find()) {
before[i]=line;
System.out.println(before[i]);
i++;
}
}
It doesn't work and I am not sure why.
You could do something like this, assuming you're reading from a file:
Scanner s = new Scanner(new File("oldLinesFilePath"));
List<String> oldLines = new ArrayList<String>();
List<String> newLines = new ArrayList<String>();
while (s.hasNext()){
oldLines.add(s.nextLine());
}
s = new Scanner(new File("newLinesFilePath"));
while (s.hasNext()){
newLines.add(s.nextLine());
}
s.close();
for(int i = 0; i < newLines.size(); i++) {
if(!oldLines.contains(newLines.get(i)) {
System.out.println(newLines.get(i));
}
}

Searching a text file in java and Listing the results

I've really searched around for ideas on how to go about this, and so far nothing's turned up.
I need to search a text file via keywords entered in a JTextField and present the search results to a user in an array of columns, like how google does it. The text file has a lot of content, about 22,000 lines of text. I want to be able to sift through lines not containing the words specified in the JTextField and only present lines containing at least one of the words in the JTextField in rows of search results, each row being a line from the text file.
Anyone has any ideas on how to go about this? Would really appreciate any kind of help. Thank you in advance
You can read the file line by line and search in every line for your keywords. If you find one, store the line in an array.
But first split you text box String by whitespaces and create the array:
String[] keyWords = yourTextBoxString.split(" ");
ArrayList<String> results = new ArrayList<String>();
Reading the file line by line:
void readFileLineByLine(File file) {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
processOneLine(line);
}
br.close();
}
Processing the line:
void processOneLine(String line) {
for (String currentKey : keyWords) {
if (line.contains(currentKey) {
results.add(line);
break;
}
}
}
I have not testst this, but you should get a overview on how you can do this.
If you need more speed, you can also use a RegularExpression to search for the keywords so you don't need this for loop.
Read in file, as per the Oracle tutorial, http://docs.oracle.com/javase/tutorial/essential/io/file.html#textfiles Iterate through each line and search for your keyword(s) using String's contain method. If it contains the search phrase, place the line and line number in a results List. When you've finished you can display the results list to the user.
You need a method as follows:
List<String> searchFile(String path, String match){
List<String> linesToPresent = new ArrayList<String>();
File f = new File(path);
FileReader fr;
try {
fr = new FileReader(f);
BufferedReader br = new BufferedReader(fr);
String line;
do{
line = br.readLine();
Pattern p = Pattern.compile(match);
Matcher m = p.matcher(line);
if(m.find())
linesToPresent.add(line);
} while(line != null);
br.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return linesToPresent;
}
It searches a file line by line and checks with regex if a line contains a "match" String. If you have many Strings to check you can change the second parameter to String[] match and with a foreach loop check for each String match.
You can use :
FileUtils
This will read each line and return you a List<String>.
You can iterate over this List<String> and check whether the String contains the word entered by the user, if it contains, add it to another List<String>. then at the end you will be having another List<String> which contains all the lines which contains the word entered by the user. You can iterate this List<String> and display the result to the user.

Java Scanner multiple delimiter

I know this is a very asked question but I can't find and apropiate answer for my problem. Thing is I have to program and aplication that reads from a .TXT file like this
Real:Atelti
Alcorcon:getafe
Barcelona:Sporting
My question is how what can I do to tell Java that I want String before : in one ArrayList and Strings after : in another ArrayList?? I guess It's using delimeter method but I don't know how use it in this case.
Sorry for my poor english, I've to improve It i guess. Thanks
use split function of java.
steps:
Declare two arrayList. l1 and l2;
read each line.
split each line by ":", this will return a array of length 2, array. (as per your input)
l1.add(array[0]) , l2.add(array1)
try yourself, post code if you need help :)
check here for use of split function, though through google you can find many different example
Split the string using ":" as delimiter. Add the odd entries from the result to one list and even to another list.
If your text is like this:
Real:Atelti
Alcorcon:getafe
Barcelona:Sporting
You can achieve what you want by using:
StringBuilder text = new StringBuilder();
Scanner scanner = new Scanner(new FileInputStream(fFileName), encoding); //try utf8 or utf-8 for 'encoding'
try {
while (scanner.hasNextLine()){
String line = scanner.nextLine();
String before = line.split(":")[0];
String after = line.split(":")[1];
//dsw 'before' and 'after' - add them to lists.
}
}
finally{
scanner.close();
}
Scanner scanner = new Scanner(new FileInputStream("YOUR_FILE_PATH"));
List<String> firstList = new ArrayList<String>();
List<String> secondList = new ArrayList<String>();
while(scanner.hasNextLine()) {
String currentLine = scanner.nextLine();
String[] tokenizedString = currentLine.split(":");
firstList.add(tokenizedString[0]);
secondList.add(tokenizedString[1]);
}
scanner.close();
Enumerating firstList and secondList will get you the desired result.
1. Use ":" as delimiter.
2. Then Store them in the String[] using split() function.
3. Try using BufferedReader instead of Scanner.
Eg:
File f = new File("d:\\Mytext.txt");
FileReader fr = new FileReader(f);
BufferedReader br = new BufferedReader(fr);
ArrayList<String> s1 = new ArrayList<String>();
ArrayList<String> s2 = new ArrayList<String>();
while ((br.readLine())!=null){
String line = br.readLine();
String bf = line.split(":")[0];
String af = line.split(":")[1];
s1.add(bf);
s2.add(af);
}

How to read a file from end to the beginning?

How to read file from end to the beginning my code,
try
{
String strpath="/var/date.log";
FileReader fr = new FileReader(strpath);
BufferedReader br = new BufferedReader(fr);
String ch;
String[] Arr;
do
{
ch = br.readLine();
if (ch != null)
out.print(ch+"<br/>");
}
while (ch != null);
fr.close();
}
catch(IOException e){
out.print(e.getMessage());
}
You can use RandomAccessFile class. Either just read in loop from file length to 0, or use some convenience 3rd party wraps like http://mattfleming.com/node/11
If you need print lines in reverse order:
Read all lines to list
Reverse them
Write them back
Code:
List<String> lines = new ArrayList<String>();
String curLine;
while ( (curLine= br.readLine()) != null) {
lines.add(curLine);
}
Collections.reverse(lines);
for (String line : lines) {
System.out.println(line);
}
If you don't want to use temporary data(for reversing the file) you should use RandomAccessFile class.
In other case you can read and store the whole file in memory, then reversing it contents.
List<String> data = new LinkedList<String>();
If you need lines in reverse order, insted of:
out.print(ch+"<br/>");
do
data.add(ch);
And after reading the whole file you can use
Collections.reverse(data);
If you need every symbol to be in reverse order, you can use type Character instead of String and read not the whole line but only one symbol.
After that simply reverse your data.
P.S. To print (to system output stream for example) you should iterate over each item in collection.
for (String line : data) {
out.println(line);
}
If you use just out.print(data) this will call data.toString() method and print out its result. Standart implementation of toString() will not work as you expected. It will return something like object type and number.

Categories

Resources