Removing duplicate lines from a text file

Removing duplicate lines from a text file - java

I have a text file that is sorted alphabetically, with around 94,000 lines of names (one name per line, text only, no punctuation.
Example:
Alice
Bob
Simon
Simon
Tom
Each line takes the same form, first letter is capitalized, no accented letters.
My code:
try{
BufferedReader br = new BufferedReader(new FileReader("orderedNames.txt"));
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("sortedNoDuplicateNames.txt", true)));
ArrayList<String> textToTransfer = new ArrayList();
String previousLine = "";
String current = "";
//Load first line into previous line
previousLine = br.readLine();
//Add first line to the transfer list
textToTransfer.add(previousLine);
while((current = br.readLine()) != previousLine && current != null){
textToTransfer.add(current);
previousLine = current;
}
int index = 0;
for(int i=0; i<textToTransfer.size(); i++){
out.println(textToTransfer.get(i));
System.out.println(textToTransfer.get(i));
index ++;
}
System.out.println(index);
}catch(Exception e){
e.printStackTrace();
}
From what I understand is that, the first line of the file is being read and loaded into the previousLine variable like I intended, current is being set to the second line of the file we're reading from, current is then compared against the previous line and null, if it's not the same as the last line and it's not null, we add it to the array-list.
previousLine is then set to currents value so the next readLine for current can replace the current 'current' value to continue comparing in the while loop.
I cannot see what is wrong with this.
If a duplicate is found, surely the loop should break?
Sorry in advance when it turns out to be something stupid.

Use a TreeSet instead of an ArrayList.
Set<String> textToTransfer = new TreeSet<>();
The TreeSet is sorted and does not allow duplicates.

Don't reinvent the wheel!
If you don't want duplicates, you should consider using a Collection that doesn't allows duplicates. The easiest way to remove repeated elements is to add the contents to a Set which will not allow duplicates:
import java.util.*;
import java.util.stream.*;
public class RemoveDups {
public static void main(String[] args) {
Set<String> dist = Arrays.asList(args).stream().collect(Collectors.toSet());
}
}
Another way is to remove duplicates from text file before reading the file by the Java code, in Linux for example (far quicker than do it in Java code):
sort myFileWithDuplicates.txt | uniq -u > myFileWithoutDuplicates.txt

While, like the others, I recommend using a collection object that does not allow repeated entries into the collection, I think I can identify for you what is wrong with your function. The method in which you are trying to compare strings (which is what you are trying to do, of course) in your While loop is incorrect in Java. The == (and its counterpart) are used to determine if two objects are the same, which is not the same as determining if their values are the same. Luckily, Java's String class has a static string comparison method in equals(). You may want something like this:
while(!(current = br.readLine()).equals(previousLine) && current != null){
Keep in mind that breaking your While loop here will force your file reading to stop, which may or may not be what you intended.

Related

Regex to match a String with asterisk

I'm coding in Java and I want to split my string. I want to split it at.
/* sort */
Yes I plan to split a .java file that I have read as a string so I need it to include "/* sort */". I'm creating a code that sorts Arrays that are predefined in java class file.
Exactly that and do another split at
}
and then I wanted help how to go about splitting up the array since I'll be left with
an example would be this
final static String[] ANIMALS = new String[] /* sort */ { "eland", "antelope", "hippopotamus"};
My goal would be to sort that Array inside a .java file and replace it. This is my current code
private void editFile() throws IOException {
//Loads the whole Text or java file into a String
try (BufferedReader br = new BufferedReader(new FileReader(fileChoice()))) {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
everything = sb.toString();
}
arrayCutOff = everything.split("////* sort *////");
for(int i = 0; i < arrayCutOff.length; i++){
System.out.println(arrayCutOff[i]);
}
}
This basically reads the whole .txt or .java file completely with the exact same formatting into one string. I planned to split it at /* sort */ and sort the array inside but I realized if I did that I probably can't replace it.

Considered your're using java 8 you might go this direction:
private void editFile() throws IOException {
List<String> lines = Files.readAllLines(Paths.get(fileChoice()));
String content = lines.stream().collect(Collectors.joining(System.lineSeparator()));
Stream.of(content.split(Pattern.quote("/* sort */"))).forEach(System.out::println);
}
However, the trick you're asking for is Pattern.quote, which dates back Java 5. It'll qoute a literal so it can be used as a literal in regExs and is a bit more convenient (and reliable I think) than wrestling around with backslashes...

Why aren't my words coming out less than 8 characters?

public String compWord() throws IOException, ClassNotFoundException
{
// Local constants
final int MAX_COUNT = 8;
// Local variables
BufferedReader reader = new BufferedReader(new FileReader("dictionary.txt")); // Create a new BufferedReader, looking for dictionary.txt
List<String> lines = new ArrayList<String>(); // New ArrayList to keep track of the lines
String line; // Current line
Random rand = new Random(); // New random object
String word; // The computer's word
/********************* Start compWord *********************/
// Start reading the txt file
line = reader.readLine();
// WHILE the line isn't null
while(line != null)
{
// Add the line to lines list
lines.add(line);
// Go to the next line
line = reader.readLine();
}
// Set the computers word to a random word in the list
word = lines.get(rand.nextInt(lines.size()));
if(word.length() > MAX_COUNT)
compWord();
// Return the computer's word
return word;
}
From what I understand it should only be returning words less than 8 characters? Any idea what I am doing wrong? The if statement should recall compWord until the word is less than 8 characters. But for some reason I'm still get words from 10-15 chars.

Look at this code:
if(word.length() > MAX_COUNT)
compWord();
return word;
If the word that is picked is longer than your limit, you're calling compWord recursively - but ignoring the return value, and just returning the "too long" word anyway.
Personally I would suggest that you avoid the recursion, and instead just use a do/while loop:
String word;
do
{
word = lines.get(rand.nextInt(lines.size());
} while (word.length() > MAX_COUNT);
return word;
Alternatively, filter earlier while you read the lines:
while(line != null) {
if (line.length <= MAX_COUNT) {
lines.add(line);
}
line = reader.readLine();
}
return lines.get(rand.nextInt(lines.size()));
That way you're only picking out of the valid lines to start with.
Note that using Files.readAllLines is a rather simpler way of reading all the lines from a text file, by the way - and currently you're not closing the file afterwards...

If the word is longer than 8 characters, you simply call your method again, continue, and nothing changes.
So:
You are getting all the words from the file,
Then getting a random word from the List, and putting it in the word String,
And if the word is is longer than 8 characters, the method runs again.
But, at the end, it will always return the word it picked first. The problem is that you just call the method recursively, and you do nothing with the return value. You are calling a method, and it will do something, and the caller method will continue, and in this case return your word. It does not matter if this method is recursive or not.
Instead, I would recommend you use a non-recursive solution, as Skeet recommended, or learn a bit about recursion and how to use it.

Counting frequency of words from a .txt file in java

I am working on a Comp Sci assignment. In the end, the program will determine whether a file is written in English or French. Right now, I'm struggling with the method that counts the frequency of words that appears in a .txt file.
I have a set of text files in both English and French in their respective folders labeled 1-20. The method asks for a directory (which in this case is "docs/train/eng/" or "docs/train/fre/") and for how many files that the program should go through (there are 20 files in each folder). Then it reads that file, splits all the words apart (I don't need to worry about capitalization or punctuation), and puts every word in a HashMap along with how many times they were in the file. (Key = word, Value = frequency).
This is the code I came up with for the method:
public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();
// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
// Puts together the string that the FileReader will refer to.
String learn = directory + k + ".txt";
try {
FileReader reader = new FileReader(learn);
BufferedReader br = new BufferedReader(reader);
// The BufferedReader reads the lines
String line = br.readLine();
// Split the line into a String array to loop through
String[] words = line.split(" ");
int freq = 0;
// for loop goes through every word
for (int i = 0; i < words.length; i++) {
// Case if the HashMap already contains the key.
// If so, just increments the value
if (wordCount.containsKey(words[i])) {
wordCount.put(words[i], freq++);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], freq++);
}
}
// Catching the file not found error
// and any other errors
}
catch (FileNotFoundException fnfe) {
System.err.println("File not found.");
}
catch (Exception e) {
System.err.print(e);
}
}
return wordCount;
}
The code compiles. Unfortunately, when I asked it to print the results of all the word counts for the 20 files, it printed this. It's complete gibberish (though the words are definitely there) and is not at all what I need the method to do.
If anyone could help me debug my code, I would greatly appreciate it. I've been at it for ages, conducting test after test and I'm ready to give up.

Let me combine all the good answers here.
1) Split up your methods to handle one thing each. One to read the files into strings[], one to process the strings[], and one to call the first two.
2) When you split think deeply about how you want to split. As #m0skit0 suggest you should likely split with \b for this problem.
3) As #jas suggested you should first check if your map already has the word. If it does increment the count, if not add the word to the map and set it's count to 1.
4) To print out the map in the way you likely expect, take a look at the below:
Map test = new HashMap();
for (Map.Entry entry : test.entrySet()){
System.out.println(entry.getKey() + " " + entry.getValue());
}

I would have expected something more like this. Does it make sense?
if (wordCount.containsKey(words[i])) {
int n = wordCount.get(words[i]);
wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
wordCount.put(words[i], 1);
}
If the word is already in the hashmap, we want to get the current count, add 1 to that and replace the word with the new count in the hashmap.
If the word is not yet in the hashmap, we simply put it in the map with a count of 1 to start with. The next time we see the same word we'll up the count to 2, etc.

If you split by space only, then other signs (parenthesis, punctuation marks, etc...) will be included in the words. For example: "This phrase, contains... funny stuff", if you split it by space you get: "This" "phrase," "contains..." "funny" and "stuff".
You can avoid this by splitting by word boundary (\b) instead.
line.split("\\b");
Btw your if and else parts are identical. You're always incrementing freq by one, which doesn't make much sense. If the word is already in the map, you want to get the current frequency, add 1 to it, and update the frequency in the map. If not, you put it in the map with a value of 1.
And pro tip: always print/log the full stacktrace for the exceptions.

Parse .csv File in java returns outofbounds exception

I have the following issue: I am trying to parse a .csv file in java, and store specifically 3 columns of it in a 2 Dimensional array. The Code for the method looks like this:
public static void parseFile(String filename) throws IOException{
FileReader readFile = new FileReader(filename);
BufferedReader buffer = new BufferedReader(readFile);
String line;
String[][] result = new String[10000][3];
String[] b = new String[6];
for(int i = 0; i<10000; i++){
while((line = buffer.readLine()) != null){
b = line.split(";",6);
System.out.println("ID: "+b[0]+" Title: "+b[3]+ "Description: "+b[4]); // Here is where the outofbounds exception occurs...
result[i][0] = b[0];
result[i][1] = b[3];
result[i][2] = b[4];
}
}
buffer.close();
}
I feel like I have to specify this: the .csv file is HUGE. It has 32 columns, and (almost) 10.000 entries (!).
When Parsing, I keep getting the following:
XXXXX CHUNKS OF SUCCESFULLY EXTRACTED CODE
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException:3
at ParseCSV.parseFile(ParseCSV.java:24)
at ParseCSV.main(ParseCSV.java:41)
However, I realized that SOME of the stuff in the file has a strange format e.g. some of the texts inside it for instance have newlines in them, but there is no newline character involved in any way. However, if I delete those blank lines manually, the output generated (before the error message is prompted) adds the stuff to the array up until the next blank line ...
Does anyone have an idea how to fix this? Any help would be greately appreciated...

Your first problem is that you probably have at least one blank line in your csv file. You need to replace:
b = line.split(";", 6);
with
b = line.split(";");
if(b.length() < 5){
System.err.println("Warning, line has only " + b.length() +
"entries, so skipping it:\n" + line);
continue;
}
If your input can legitimately have new lines or embedded semi-colons within your entries, that is a more complex parsing problem, and you are probably better off using a third-party parsing library, as there are several very good ones.
If your input is not supposed to have new lines in it, the problem probably is \r. Windows uses \r\n to represent a new line, while most other systems just use \n. If multiple people/programs edited your text file, it is entirely possible to end up with stray \r by themselves, which are not easily handled by most parsers.
A way to easily check if that's your problem is before you split your line, do
line = line.replace("\r","").
If this is a process you are repeating many times, you might need to consider using a Scanner (or library) instead to get more efficient text processing. Otherwise, you can make do with this.

When you have new lines in your CSV file, after this line
while((line = buffer.readLine()) != null){
variable line will have not a CSV line but just some text without ;
For example, if you have file
column1;column2;column
3 value
after first iteration variable line will have
column1;column2;column
after second iteration it will have
3 value
when you call "3 value".split(";",6) it will return array with one element. and later when you call b[3] it will throw exception.
CSV format has many small things, to implement which you will spend a lot of time. This is a good article about all possible csv examples
http://en.wikipedia.org/wiki/Comma-separated_values#Basic_rules_and_examples
I would recommend to you some ready CSV parsers like this
https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVParser.html

String's split(pattern, limit) method returns an array sized to the number of tokens found up to the the number specified by the limit parameter. Limit is the maximum, not the minimum number of array elements returned.
"1,2,3" split with (",", 6) with return an array of 3 elements: "1", "2" and "3".
"1,2,3,4,5,6,7" will return 6 elements: "1", "2", "3", "4", "5" and ""6,7" The last element is goofy because the split method stopped splitting after 5 and returned the rest of the source string as the sixth element.
An empty line is represented as an empty string (""). Splitting "" will return an array of 1 element, the empty string.
In your case, the string array created here
String[] b = new String[6];
and assigned to b is replaced by the the array returned by
b = line.split(";",6);
and meets it's ultimate fate at the hands of the garbage collector unseen and unloved.
Worse, in the case of the empty lines, it's replaced by a one element array, so
System.out.println("ID: "+b[0]+" Title: "+b[3]+ "Description: "+b[4]);
blows up when trying to access b[3].
Suggested solution is to either
while((line = buffer.readLine()) != null){
if (line.length() != 0)
{
b = line.split(";",6);
System.out.println("ID: "+b[0]+" Title: "+b[3]+ "Description: "+b[4]); // Here is where the outofbounds exception occurs...
...
}
or (better because the previous could trip over a malformed line)
while((line = buffer.readLine()) != null){
b = line.split(";",6);
if (b.length() == 6)
{
System.out.println("ID: "+b[0]+" Title: "+b[3]+ "Description: "+b[4]); // Here is where the outofbounds exception occurs...
...
}
You might also want to think about the for loop around the while. I don't think it's doing you any good.
while((line = buffer.readLine()) != null)
is going to read every line in the file, so
for(int i = 0; i<10000; i++){
while((line = buffer.readLine()) != null){
is going to read every line in the file the first time. Then it going to have 9999 attempts to read the file, find nothing new, and exit the while loop.
You are not protected from reading more than 10000 elements because the while loop because the while loop will read a 10001th element and overrun your array if there are more than 10000 lines in the file. Look into replacing the big array with an arraylist or vector as they will size to fit your file.

Please check b.length>0 before accessing b[].

Iterating over the content of a text file line by line having limitation, working but not finishing the entire text

I'm working on a Java method that will parse in a eternal for-loop a text line by line.
As you see I'm assigning the content of a bufferReader to a list
BufferedReader br = new BufferedReader(new FileReader("C:/feed.txt"));
String strLine;
ArrayList list = new ArrayList();
while ((strLine = br.readLine()) != null) {
list.add(strLine);
This work perfectly and the feed.txt content is totally assigned to the arrayList with 18238 line.
But when I tried to use and process the content of the list with an iterator in a for-loop (the following code), there is a problem:
Iterator itr;
for (itr = list.iterator(); itr.hasNext();) {
String str = itr.next().toString();
}
The instructions (business processes) in the loop are working perfectly until the line number 5175 when the program stop his iteration. It is a problem linked to the iterator.
I think it is about the iterator because there isn't something special about this line, even by deleting it. The problem persist.
Does the iterator have a limitation? How can I rise it?
I'm trying to parse a file having this number of line, but I'm supposed to develop into my project an eternal never ending loop, receiving line all the time .
Can you help me please ?

A few things here:
First of all, use
List<String> list = new ArrayList<String>();
Then you can just iterate using
for (String str : list) {
// do something
}
Does that solve the problem?

You asked "how do I combine"? Here is a simple example. Note it does NOT use an iterator - so you will just be able to see that you are able to do something with all the lines in the input file. It's not really answering your question, but it should help you narrow down where the problem lies.
BufferedReader br = new BufferedReader(new FileReader("C:/feed.txt"));
String strLine, str;
int numLines;
ArrayList list = new ArrayList();
numLines = 0;
while ((strLine = br.readLine()) != null) {
list.add(strLine);
System.out.println(list.get(numLines));
numLines++;
// do whatever you were going to do with the iterator here
str = strLine.toString();
}
System.out.printf("Read in %d lines; the last line read is \n%s\n", numLines, list.get(numLines-1));
While this is not exactly what you were asking, when you run this code and see how it fails you will be a step closer to solving your stated problem.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Removing duplicate lines from a text file - java

Use a TreeSet instead of an ArrayList. Set<String> textToTransfer = new TreeSet<>(); The TreeSet is sorted and does not allow duplicates.

Related

Regex to match a String with asterisk

Why aren't my words coming out less than 8 characters?

Counting frequency of words from a .txt file in java

Parse .csv File in java returns outofbounds exception

Iterating over the content of a text file line by line having limitation, working but not finishing the entire text

Categories

Resources