complete indexing of text file java

complete indexing of text file java - java

im trying to read a text file, sort the words within alphabetically and display what line numbers those words appear on.
Im new to java so not sure what the most efficient way to approach the system is.
My plan so far is to:
-use a scanner to parse file into one string
-string.split
-lineCount++
-(somehow sort those split strings alphabetically)
-print sorted words with line number next to them
Is that the best way of going about this? im not sure if java has some sort of ordered dictionary maybe i could use?

A Scanner is fine, as you could scan per word, not even needing a split.
A BufferedReader would be for line-wise reading, and there exists a LineNumberReader for your goal: counting lines.
I head indicate the encoding of the file.
SortedMap<String, SortedSet<Integer>> linenosPerWord = new TreeMap<>();
// A BufferedReader with a linenumber counter:
try (LineNumberReader in = new LineNumberReader(new InputStreamReader(
new FileInputSTream(file, StandardCharsets.UTF_8))) {
for (;;) {
String line = in.readLine();
if (line == null) {
break;
}
int lineno = in.getLineNumber();
String[] words = line.split("\\P{LM}"); // Split on non-letters and non-accents
for (String word : words) {
word = word.toLowerCase(); // Possible with Locale
SortedSet<Integer> linenos = linenosPerWord.get(word);
if (linenos == null) {
linenos = new TreeSet<>();
linenosPerWord.put(word, lineno);
}
linenos.add(lineno);
}
}
}
linenosPerWord.remove(""); // Remove a possibly found empty word, like in "-Hello"

Related

How can I remove specific elements from a linkedlist in java based on user input?

I'm very new (6 weeks into java) trying to remove elements from a csv file that lists a set of students as such (id, name, grades) each on a new line.
Each student id is numbered in ascending value. I want to try and remove a student by entering the id number and I'm not sure how I can do this.
So far I've just tried to reduce the value that user inputs to match the index as students are listed by number and I did this in a while loop. However, each iteration doesn't recognize the reduction from the previous user Input, and I think I need a way that can just search the value of the id, and remove the entire line from the csv file.
Have only tried to include the pertinent code. Reading previous stack questions has shown me a bunch of answers related to nodes, which make no sense to me since I don't have whatever prerequisite knowledge is required to understand it, and I'm not sure the rest of my code is valid for those methods.
Any ideas that are relatively simple?
Student.txt (each on a new line)
1,Frank,West,98,95,87,78,77,80
2,Dianne,Greene,78,94,88,87,95,92
3,Doug,Lei,78,94,88,87,95,92
etc....
Code:
public static boolean readFile(String filename) {
File file = new File("C:\\Users\\me\\eclipse-workspace\\studentdata.txt");
try {
Scanner scanner = new Scanner(file);
while(scanner.hasNextLine()) {
String[] words=scanner.nextLine().split(",");
int id = Integer.parseInt(words[0]);
String firstName = words[1];
String lastName = words[2];
int mathMark1 = Integer.parseInt(words[3]);
int mathMark2 = Integer.parseInt(words[4]);
int mathMark3 = Integer.parseInt(words[5]);
int englishMark1 = Integer.parseInt(words[6]);
int englishMark2 = Integer.parseInt(words[7]);
int englishMark3 = Integer.parseInt(words[8]);
addStudent(id,firstName,lastName,mathMark1,mathMark2,mathMark3,englishMark1,englishMark2,englishMark3);
}scanner.close();
}catch (FileNotFoundException e) {
System.out.println("Failed to readfile.");
private static void removeStudent() {
String answer = "Yes";
while(answer.equals("Yes") || answer.equals("yes")) {
System.out.println("Do you wish to delete a student?");
answer = scanner.next();
if (answer.equals("Yes") || answer.equals("yes")) {
System.out.println("Please enter the ID of the student to be removed.");
//tried various things here: taking userInput and passing through linkedlist.remove() but has never worked.

This solution may not be optimal or pretty, but it works. It reads in an input file line by line, writing each line out to a temporary output file. Whenever it encounters a line that matches what you are looking for, it skips writing that one out. It then renames the output file. I have omitted error handling, closing of readers/writers, etc. from the example. I also assume there is no leading or trailing whitespace in the line you are looking for. Change the code around trim() as needed so you can find a match.
File inputFile = new File("myFile.txt");
File tempFile = new File("myTempFile.txt");
BufferedReader reader = new BufferedReader(new FileReader(inputFile));
BufferedWriter writer = new BufferedWriter(new FileWriter(tempFile));
String lineToRemove = "bbb";
String currentLine;
while((currentLine = reader.readLine()) != null) {
// trim newline when comparing with lineToRemove
String trimmedLine = currentLine.trim();
if(trimmedLine.equals(lineToRemove)) continue;
writer.write(currentLine + System.getProperty("line.separator"));
}
writer.close();
reader.close();
boolean successful = tempFile.renameTo(inputFile);

Remove stop words from file - going over it multiple times causes content duplication and does not remove the words

I am trying to go over a bunch of files, read each of them, and remove all stopwords from a specified list with such words. The result is a disaster - the content of the whole file copied over and over again.
What I tried:
- Saving the file as String and trying to look with regex
- Saving the file as String and going over line by line and comparing tokens to the stopwords that are stored in a LinkedHashSet, I can also store them in a file
- tried to twist the logic below in multiple ways, getting more and more ridiculous output.
- tried looking into text / line with the .contains() method, but no luck
My general logic is as follows:
for every word in the stopwords set:
while(file has more lines):
save current line into String
while (current line has more tokens):
assign current token into String
compare token with current stopword:
if(token equals stopword):
write in the output file "" + " "
else: write in the output file the token as is
Tried what's in this question and many other SO questions, but just can't achieve what I need.
Real code below:
private static void removeStopWords(File fileIn) throws IOException {
File stopWordsTXT = new File("stopwords.txt");
System.out.println("[Removing StopWords...] FILE: " + fileIn.getName() + "\n");
// create file reader and go over it to save the stopwords into the Set data structure
BufferedReader readerSW = new BufferedReader(new FileReader(stopWordsTXT));
Set<String> stopWords = new LinkedHashSet<String>();
for (String line; (line = readerSW.readLine()) != null; readerSW.readLine()) {
// trim() eliminates leading and trailing spaces
stopWords.add(line.trim());
}
File outp = new File(fileIn.getPath().substring(0, fileIn.getPath().lastIndexOf('.')) + "_NoStopWords.txt");
FileWriter fOut = new FileWriter(outp);
Scanner readerTxt = new Scanner(new FileInputStream(fileIn), "UTF-8");
while(readerTxt.hasNextLine()) {
String line = readerTxt.nextLine();
System.out.println(line);
Scanner lineReader = new Scanner(line);
for (String curSW : stopWords) {
while(lineReader.hasNext()) {
String token = lineReader.next();
if(token.equals(curSW)) {
System.out.println("---> Removing SW: " + curSW);
fOut.write("" + " ");
} else {
fOut.write(token + " ");
}
}
}
fOut.write("\n");
}
fOut.close();
}
What happens most often is that it looks for the first word from the stopWords set and that's it. The output contains all the other words even if I manage to remove the first one. And the first will be there in the next appended output in the end.
Part of my stopword list
about
above
after
again
against
all
am
and
any
are
as
at
With tokens I mean words, i.e. getting every word from the line and comparing it to the current stopword

After awhile of debugging I believe I have found the solution. This problem is very tricky as you have to use several different scanners and file readers etc. Here is what I did:
I changed how you added to your StopWords set, as it wasn't adding them correctly. I used a buffered reader to read each line, then a scanner to read each word, then added it to the set.
Then when you compared them I got rid of one of your loops as you can easily use the .contains() method to check if the word was a stopWord.
I left you to do the part of writing to the file to take out the stop words, as I'm sure you can figure that out now that everything else is working.
-My sample stop words txt file:
Stop words
Words
-My samples input file was the exact same, so it should catch all three words.
The code:
// create file reader and go over it to save the stopwords into the Set data structure
BufferedReader readerSW = new BufferedReader(new FileReader("stopWords.txt"));
Set<String> stopWords = new LinkedHashSet<String>();
String stopWordsLine = readerSW.readLine();
while (stopWordsLine != null) {
// trim() eliminates leading and trailing spaces
Scanner words = new Scanner(stopWordsLine);
String word = words.next();
while(word != null) {
stopWords.add(word.trim()); //Add the stop words to the set
if(words.hasNext()) {
word = words.next(); //If theres another line, read it
}
else {
break; //else break the inner while loop
}
}
stopWordsLine = readerSW.readLine();
}
BufferedReader outp = new BufferedReader(new FileReader("Words.txt"));
String line = outp.readLine();
while(line != null) {
Scanner lineReader = new Scanner(line);
String line2 = lineReader.next();
while(line2 != null) {
if(stopWords.contains(line2)) {
System.out.println("removing " + line2);
}
if(lineReader.hasNext()) { //If theres another line, read it
line2 = lineReader.next();
}
else {
break; //else break the first while loop
}
}
lineReader.close();
line = outp.readLine();
}
OutPut:
removing Stop
removing words
removing Words
Let me know if I can elaborate any more on my code or why I did something!

Java compare strings from two places and exclude any matches

I'm trying to end up with a results.txt minus any matching items, having successfully compared some string inputs against another .txt file. Been staring at this code for way too long and I can't figure out why it isn't working. New to coding so would appreciate it if I could be steered in the right direction! Maybe I need a different approach? Apologies in advance for any loud tutting noises you may make. Using Java8.
//Sending a String[] into 'searchFile', contains around 8 small strings.
//Example of input: String[]{"name1","name2","name 3", "name 4.zip"}
^ This is my exclusions list.
public static void searchFile(String[] arr, String separator)
{
StringBuilder b = new StringBuilder();
for(int i = 0; i < arr.length; i++)
{
if(i != 0) b.append(separator);
b.append(arr[i]);
String findME = arr[i];
searchInfo(MyApp.getOptionsDir()+File.separator+"file-to-search.txt",findME);
}
}
^This works fine. I'm then sending the results to 'searchInfo' and trying to match and remove any duplicate (complete, not part) strings. This is where I am currently failing. Code runs but doesn't produce my desired output. It often finds part strings rather than complete ones. I think the 'results.txt' file is being overwritten each time...but I'm not sure tbh!
file-to-search.txt contains: "name2","name.zip","name 3.zip","name 4.zip" (text file is just a single line)
public static String searchInfo(String fileName, String findME)
{
StringBuffer sb = new StringBuffer();
try {
BufferedReader br = new BufferedReader(new FileReader(fileName));
String line = null;
while((line = br.readLine()) != null)
{
if(line.startsWith("\""+findME+"\""))
{
sb.append(line);
//tried various replace options with no joy
line = line.replaceFirst(findME+"?,", "");
//then goes off with results to create a txt file
FileHandling.createFile("results.txt",line);
}
}
} catch (Exception e) {
e.printStackTrace();
}
return sb.toString();
}
What i'm trying to end up with is a result file MINUS any matching complete strings (not part strings):
e.g. results.txt to end up with: "name.zip","name 3.zip"

ok with the information I have. What you can do is this
List<String> result = new ArrayList<>();
String content = FileUtils.readFileToString(file, "UTF-8");
for (String s : content.split(", ")) {
if (!s.equals(findME)) { // assuming both have string quotes added already
result.add(s);
}
}
FileUtils.write(newFile, String.join(", ", result), "UTF-8");
using apache commons file utils for ease. You may add or remove spaces after comma as per your need.

Searching a text file in java and Listing the results

I've really searched around for ideas on how to go about this, and so far nothing's turned up.
I need to search a text file via keywords entered in a JTextField and present the search results to a user in an array of columns, like how google does it. The text file has a lot of content, about 22,000 lines of text. I want to be able to sift through lines not containing the words specified in the JTextField and only present lines containing at least one of the words in the JTextField in rows of search results, each row being a line from the text file.
Anyone has any ideas on how to go about this? Would really appreciate any kind of help. Thank you in advance

You can read the file line by line and search in every line for your keywords. If you find one, store the line in an array.
But first split you text box String by whitespaces and create the array:
String[] keyWords = yourTextBoxString.split(" ");
ArrayList<String> results = new ArrayList<String>();
Reading the file line by line:
void readFileLineByLine(File file) {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
processOneLine(line);
}
br.close();
}
Processing the line:
void processOneLine(String line) {
for (String currentKey : keyWords) {
if (line.contains(currentKey) {
results.add(line);
break;
}
}
}
I have not testst this, but you should get a overview on how you can do this.
If you need more speed, you can also use a RegularExpression to search for the keywords so you don't need this for loop.

Read in file, as per the Oracle tutorial, http://docs.oracle.com/javase/tutorial/essential/io/file.html#textfiles Iterate through each line and search for your keyword(s) using String's contain method. If it contains the search phrase, place the line and line number in a results List. When you've finished you can display the results list to the user.

You need a method as follows:
List<String> searchFile(String path, String match){
List<String> linesToPresent = new ArrayList<String>();
File f = new File(path);
FileReader fr;
try {
fr = new FileReader(f);
BufferedReader br = new BufferedReader(fr);
String line;
do{
line = br.readLine();
Pattern p = Pattern.compile(match);
Matcher m = p.matcher(line);
if(m.find())
linesToPresent.add(line);
} while(line != null);
br.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return linesToPresent;
}
It searches a file line by line and checks with regex if a line contains a "match" String. If you have many Strings to check you can change the second parameter to String[] match and with a foreach loop check for each String match.

You can use :
FileUtils
This will read each line and return you a List<String>.
You can iterate over this List<String> and check whether the String contains the word entered by the user, if it contains, add it to another List<String>. then at the end you will be having another List<String> which contains all the lines which contains the word entered by the user. You can iterate this List<String> and display the result to the user.

How to remove stop words in java?

I want to remove stop words in java.
So, I read stop words from text file.
and store Set
Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader br = new BufferedReader(new FileReader("stopwords.txt"));
String words = null;
while( (words = br.readLine()) != null) {
stopWords.add(words.trim());
}
br.close();
And, I read another text file.
So, I wanna remove to duplicate string in text file.
How can I?

using set for stopword :
Set<String> stopWords = new LinkedHashSet<String>();
BufferedReader SW= new BufferedReader(new FileReader("StopWord.txt"));
for(String line;(line = SW.readLine()) != null;)
stopWords.add(line.trim());
SW.close();
and ArrayList for input txt_file
BufferedReader br = new BufferedReader(new FileReader(txt_file.txt));
//make your arraylist here
// function deletStopWord() for remove all stopword in your "stopword.txt"
public ArrayList<String> deletStopWord(Set stopWords,ArrayList arraylist){
System.out.println(stopWords.contains("?"));
ArrayList<String> NewList = new ArrayList<String>();
int i=3;
while(i < arraylist.size() ){
if(!stopWords.contains(arraylist.get(i))){
NewList.add((String) arraylist.get(i));
}
i++;
}
System.out.println(NewList);
return NewList;
}
arraylist=deletStopWord(stopWords,arraylist);

You want to remove duplicate words from file, below is the high level logic for same.
Read File
Loop through file content(i.e one line at a time)
Have string tokenizer for that line based on space
Add each each token to your set. This will make sure that you have only one entry per word.
Close file
Now you have set that contains all the unique word of file.

Using the ArrayList may be more easier.
public ArrayList removeDuplicates(ArrayList source){
ArrayList<String> newList = new ArrayList<String>();
for (int i=0; i<source.size(); i++){
String s = source.get(i);
if (!newList.contains(s)){
newList.add(s);
}
}
return newList;
}
Hope this helps.

If you simply want to remove a certain set of words from the words in a file, you can do it however you want. But if you are dealing with a problem involving natural language processing, you should use a library.
For example, using Lucene for tokenizing will seem more complicated at first, but it will deal with myriad complications that you will overlook, and allow for great flexibility should you change your mind on the specific stopwords, on how you are tokenizing, whether you care about case, etc.

You should try using StringTokenizer.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

complete indexing of text file java - java

Related

How can I remove specific elements from a linkedlist in java based on user input?

Remove stop words from file - going over it multiple times causes content duplication and does not remove the words

Java compare strings from two places and exclude any matches

Searching a text file in java and Listing the results

How to remove stop words in java?

Categories

Resources