I am trying to go over a bunch of files, read each of them, and remove all stopwords from a specified list with such words. The result is a disaster - the content of the whole file copied over and over again.
What I tried:
- Saving the file as String and trying to look with regex
- Saving the file as String and going over line by line and comparing tokens to the stopwords that are stored in a LinkedHashSet, I can also store them in a file
- tried to twist the logic below in multiple ways, getting more and more ridiculous output.
- tried looking into text / line with the .contains() method, but no luck
My general logic is as follows:
for every word in the stopwords set:
while(file has more lines):
save current line into String
while (current line has more tokens):
assign current token into String
compare token with current stopword:
if(token equals stopword):
write in the output file "" + " "
else: write in the output file the token as is
Tried what's in this question and many other SO questions, but just can't achieve what I need.
Real code below:
private static void removeStopWords(File fileIn) throws IOException {
File stopWordsTXT = new File("stopwords.txt");
System.out.println("[Removing StopWords...] FILE: " + fileIn.getName() + "\n");
// create file reader and go over it to save the stopwords into the Set data structure
BufferedReader readerSW = new BufferedReader(new FileReader(stopWordsTXT));
Set<String> stopWords = new LinkedHashSet<String>();
for (String line; (line = readerSW.readLine()) != null; readerSW.readLine()) {
// trim() eliminates leading and trailing spaces
stopWords.add(line.trim());
}
File outp = new File(fileIn.getPath().substring(0, fileIn.getPath().lastIndexOf('.')) + "_NoStopWords.txt");
FileWriter fOut = new FileWriter(outp);
Scanner readerTxt = new Scanner(new FileInputStream(fileIn), "UTF-8");
while(readerTxt.hasNextLine()) {
String line = readerTxt.nextLine();
System.out.println(line);
Scanner lineReader = new Scanner(line);
for (String curSW : stopWords) {
while(lineReader.hasNext()) {
String token = lineReader.next();
if(token.equals(curSW)) {
System.out.println("---> Removing SW: " + curSW);
fOut.write("" + " ");
} else {
fOut.write(token + " ");
}
}
}
fOut.write("\n");
}
fOut.close();
}
What happens most often is that it looks for the first word from the stopWords set and that's it. The output contains all the other words even if I manage to remove the first one. And the first will be there in the next appended output in the end.
Part of my stopword list
about
above
after
again
against
all
am
and
any
are
as
at
With tokens I mean words, i.e. getting every word from the line and comparing it to the current stopword
After awhile of debugging I believe I have found the solution. This problem is very tricky as you have to use several different scanners and file readers etc. Here is what I did:
I changed how you added to your StopWords set, as it wasn't adding them correctly. I used a buffered reader to read each line, then a scanner to read each word, then added it to the set.
Then when you compared them I got rid of one of your loops as you can easily use the .contains() method to check if the word was a stopWord.
I left you to do the part of writing to the file to take out the stop words, as I'm sure you can figure that out now that everything else is working.
-My sample stop words txt file:
Stop words
Words
-My samples input file was the exact same, so it should catch all three words.
The code:
// create file reader and go over it to save the stopwords into the Set data structure
BufferedReader readerSW = new BufferedReader(new FileReader("stopWords.txt"));
Set<String> stopWords = new LinkedHashSet<String>();
String stopWordsLine = readerSW.readLine();
while (stopWordsLine != null) {
// trim() eliminates leading and trailing spaces
Scanner words = new Scanner(stopWordsLine);
String word = words.next();
while(word != null) {
stopWords.add(word.trim()); //Add the stop words to the set
if(words.hasNext()) {
word = words.next(); //If theres another line, read it
}
else {
break; //else break the inner while loop
}
}
stopWordsLine = readerSW.readLine();
}
BufferedReader outp = new BufferedReader(new FileReader("Words.txt"));
String line = outp.readLine();
while(line != null) {
Scanner lineReader = new Scanner(line);
String line2 = lineReader.next();
while(line2 != null) {
if(stopWords.contains(line2)) {
System.out.println("removing " + line2);
}
if(lineReader.hasNext()) { //If theres another line, read it
line2 = lineReader.next();
}
else {
break; //else break the first while loop
}
}
lineReader.close();
line = outp.readLine();
}
OutPut:
removing Stop
removing words
removing Words
Let me know if I can elaborate any more on my code or why I did something!
i have a question. I have a text file with some names and numbers arranged like this :
Cheese;10;12
Borat;99;55
I want to read the chars and integers from the file until the ";" symbol, println them, then continue, read the next one, println etc. Like this :
Cheese -> println , 10-> println, 99 -> println , and on to the next line and continue.
I tried using :
BufferedReader flux_in = new BufferedReader (
new InputStreamReader (
new FileInputStream ("D:\\test.txt")));
while ((line = flux_in.readLine())!=null &&
line.contains(terminator)==true)
{
text = line;
System.out.println(String.valueOf(text));
}
But it reads the entire line, doesn`t stop at the ";" symbol. Setting the 'contains' condition to false does not read the line at all.
EDIT : Partially solved, i managed to write this code :
StringBuilder sb = new StringBuilder();
// while ((line = flux_in.readLine())!=null)
int c;
String terminator_char = ";";
while((c = flux_in.read()) != -1) {
{
char character = (char) c;
if (String.valueOf(character).contains(terminator_char)==false)
{
// System.out.println(String.valueOf(character) + " : Char");
sb.append(character);
}
else
{
continue;
}
}
}
System.out.println(String.valueOf(sb) );
Which returns a new string formed out of the characters from the read one, but without the ";". Still need a way to make it stop on the first ";", println the string and continue.
This simple code does the trick, thanks to Stefan Vasilica for the ideea :
Scanner scan = new Scanner(new File("D:\\testfile.txt"));
// Printing the delimiter used
scan.useDelimiter(";");
System.out.println("Delimiter:" + scan.delimiter());
// Printing the tokenized Strings
while (scan.hasNext()) {
System.out.println(scan.next());
}
// closing the scanner stream
scan.close();
Read the characters from file 1 by 1
Delete the 'contains' condition
Use a stringBuilder() to build yourself the strings 1 by 1
Each stringBuilder stops when facing a ';' (say you use an if clause)
I didn't test it because I'm on my phone. Hope this helps
I have to write a program that will parse baseball player info and hits,out,walk,ect from a txt file. For example the txt file may look something like this:
Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s
Jill Jenks,o,o,s,h,h,o,o
Will Jones,o,o,w,h,o,o,o,o,w,o,o
I know how to parse the file and can get that code running perfect. The only problem I am having is that we should only be printing the name for each player and 3 or their plays. For example:
Sam Slugger hit,hit,out
Jill Jenks out, out, sacrifice fly
Will Jones out, out, walk
I am not sure how to limit this and every time I try to cut it off at 3 I always get the first person working fine but it breaks the loop and doesn't do anything for all the other players.
This is what I have so far:
import java.util.Scanner;
import java.io.*;
public class ReadBaseBall{
public static void main(String args[]) throws IOException{
int count=0;
String playerData;
Scanner fileScan, urlScan;
String fileName = "C:\\Users\\Crust\\Documents\\java\\TeamStats.txt";
fileScan = new Scanner(new File(fileName));
while(fileScan.hasNext()){
playerData = fileScan.nextLine();
fileScan.useDelimiter(",");
//System.out.println("Name: " + playerData);
urlScan = new Scanner(playerData);
urlScan.useDelimiter(",");
for(urlScan.hasNext(); count<4; count++)
System.out.print(" " + urlScan.next() + ",");
System.out.println();
}
}
}
This prints out:
Sam Slugger, h, h, o,
but then the other players are voided out. I need help to get the other ones printing as well.
Here, try this one using FileReader
Assuming your file content format is like this
Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s
Jill Johns,h,h,o,s,w,w,h,w,o,o,o,h,s
with each player in the his/her own line then this can work for you
BufferedReader reader;
try {
reader = new BufferedReader(new FileReader(new File("file.txt")));
String line = "";
while ((line = reader.readLine()) != null) {
String[] values_per_line = line.split(",");
System.out.println("Name:" + values_per_line[0] + " "
+ values_per_line[1] + " " + values_per_line[2] + " "
+ values_per_line[3]);
line = reader.readLine();
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
otherwise if they are lined all in like one line which would not make sense then modify this sample.
Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s| John Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s
BufferedReader reader;
try {
reader = new BufferedReader(new FileReader(new File("file.txt")));
String line = "";
while ((line = reader.readLine()) != null) {
// token identifier is a space
String[] data = line.trim().split("|");
for (int i = 0; i < data.length; i++)
System.out.println("Name:" + data[0].split(",")[0] + " "
+ data[1].split(",")[1] + " "
+ data[2].split(",")[2] + " "
+ data[3].split(",")[3]);
line = reader.readLine();
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
You need to reset your count car in the while loop:
while(fileScan.hasNext()){
count = 0;
...
}
First Problem
Change while(fileScan.hasNext())) to while(fileScan.hasNextLine()). Not a breaking problem but when using scanner you usually put sc.* right after a sc.has*.
Second Problem
Remove the line fileScan.useDelimiter(","). This line doesn't do anything in this case but replaces the default delimiter so the scanner no longer splits on whitespace. Which doesn't matter when using Scanner.nextLine, but can have some nasty side effects later on.
Third Problem
Change this line for(urlScan.hasNext(); count<4; count++) to while(urlScan.hasNext()). Honestly I'm surprised that line even compiled and if it did it only read the first 4 from the scanner.
If you want to limit the amount processed for each line you can replace it with
for( int count = 0; count < limit && urlScan.hasNext( ); count++ )
This will limit the amount read to limit while still handling lines that have less data than the limit.
Make sure that each of your data sets is separated by a line otherwise the output might not make much sense.
You shouldn't have multiple scanners on this - assuming the format you posted in your question you can use regular expressions to do this.
This demonstrates a regular expression to match a player and to use as a delimiter for the scanner. I fed the scanner in my example a string, but the technique is the same regardless of source.
int count = 0;
Pattern playerPattern = Pattern.compile("\\w+\\s\\w+(?:,\\w){1,3}");
Scanner fileScan = new Scanner("Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s Jill Jenks,o,o,s,h,h,o,o Will Jones,o,o,w,h,o,o,o,o,w,o,o");
fileScan.useDelimiter("(?<=,\\w)\\s");
while (fileScan.hasNext()){
String player = fileScan.next();
Matcher m = playerPattern.matcher(player);
if (m.find()) {
player = m.group(0);
} else {
throw new InputMismatchException("Players data not in expected format on string: " + player);
}
System.out.println(player);
count++;
}
System.out.printf("%d players found.", count);
Output:
Sam Slugger,h,h,o
Jill Jenks,o,o,s
Will Jones,o,o,w
The call to Scanner.delimiter() sets the delimiter to use for retrieving tokens. The regex (?<=,\\w)\\s:
(?< // positive lookbehind
,\w // literal comma, word character
)
\s // whitespace character
Which delimits the players by the space between their entries without matching anything but that space, and fails to match the space between the names.
The regular expression used to extract up to 3 plays per player is \\w+\\s\\w+(?:,\\w){1,3}:
\w+ // matches one to unlimited word characters
(?: // begin non-capturing group
,\w // literal comma, word character
){1,3} // match non-capturing group 1 - 3 times
Yo, I got this code that prints all the string from a text file to another text file, line by line. It works perfectly but I'm afraid we're restricted on using .isEmpty(). Is there any other condition for the 2nd while statement other than .isEmpty()? like counting the line's size and decrementing it by the size of the word every loop? I tried line.length > 0, declared int size = line.length() - 1 and decrementing by size -= word.length(); but I still got errors having infinite loop.
Here's my code,
import java.io.*;
class fileStr{
public static void main(String args[]){
try{
BufferedReader rw = new BufferedReader(new FileReader("inStr.txt"));
PrintWriter sw = new PrintWriter(new FileOutputStream("outStr.txt"));
String line = rw.readLine();
while(line!=null){
String word = line.substring(0,line.indexOf(" "));
while(!word.isEmpty()){
sw.println(word);
line = line.substring(line.indexOf(" ") + 1) + " ";
word = line.substring(0,line.indexOf(" "));
}
line = rw.readLine();
}
rw.close();
sw.close();
}catch(Exception e){
System.out.println("\n\tFILE NOT FOUND!");
}
}
}
Please help, thanks.
According to the docs isEmpty() returns true if and only if the String's length is zero. You can perform this exact same check pretty easily without using isEmpty():
while(word.length() > 0){
....
As for the failed tries to use length() - in order to help you there you should post some missing details, such as, the content of the file you're reading and on which line it fails.
I've a txt file having over thousand line of text that has some integers at the starting.
Like:
22Ahmedabad, AES Institute of Computer Studies
526Ahmedabad, Indian Institute of Managment
561Ahmedabad, Indus Institute of Technology & Engineering
745Ahmedabad, Lalbhai Dalpatbhai College of Engineering
I want to store all the lines in another file without the integers.
The code I've written is:
while (s.hasNextLine()){
String sentence=s.nextLine();
int l=sentence.length();
c++;
try{//printing P
FileOutputStream ffs = new FileOutputStream ("ps.txt",true);
PrintStream p = new PrintStream ( ffs );
for (int i=0;i<l;i++){
if ((int)sentence.charAt(i)<=48 && (int)sentence.charAt(i)>=57){
p.print(sentence.charAt(i));
}
}
p.close();
}
catch(Exception e){}
}
But it outputs a blank file.
There are a couple of things in your code that should be improved:
Don't re-open the output file with every line. Just keep it open the whole time.
You are removing all numbers, not just numbers at the beginning - is that your intention?
Do you know any number that is both <= 48 and >= 57 at the same time?
Scanner.nextLine() does not include line returns, so you'll need a call to p.println() after every line.
Try this:
// open the file once
FileOutputStream ffs = new FileOutputStream ("ps.txt");
PrintStream p = new PrintStream ( ffs );
while (s.hasNextLine()){
String sentence=s.nextLine();
int l=sentence.length();
c++;
try{//printing P
for (int i=0;i<l;i++){
// check "< 48 || > 57", which is non-numeric range
if ((int)sentence.charAt(i)<48 || (int)sentence.charAt(i)>57){
p.print(sentence.charAt(i));
}
}
// move to next line in output file
p.println();
}
catch(Exception e){}
}
p.close();
You can apply this regular expression to each line that you read from the file:
String str = ... // read the next line from the file
str = str.replaceAll("^[0-9]+", "");
The regular expression ^[0-9]+ matches any number of digits at the beginning of the line. replaceAll method replaces the match with an empty string.
On top of mellamokb comments, you should avoid "magic numbers". There's no guarantee that that the digits will fall within the expected range of ASCII codes.
You can simply detect if a character is a digit using Character.isDigit
String value = "22Ahmedabad, AES Institute of Computer Studies";
int index = 0;
while (Character.isDigit(value.charAt(index))) {
index++;
}
if (index < value.length()) {
System.out.println(value.substring(index));
} else {
System.out.println("Nothing but numbers here");
}
(Nb dasblinkenlight has posted some excellent regular expression, which would probably easier to use, but if you're like, regexp turns my brain inside out :P)