How do I get rid of these empty strings? - java

My constructor takes a filename of a text file and converts it to an ArrayList of all the words in lowercase, without punctuation or white space. These specs, along with the constructor's argument are specified by my homework assignment, so don't suggest I change them.
private ArrayList<String> list;
public Tokenizer(String file) throws IOException {
list = new ArrayList<>();
String thisLine;
BufferedReader br = new BufferedReader(new FileReader(file));
while ((thisLine = br.readLine()) != null)
list.addAll(Arrays.asList(thisLine.replaceAll("\\p{Punct}+"," ").toLowerCase().split("\\s+")));
}
My problem is that there are many empty strings that appear. I've tried using "-1" as the second argument in "split", but it doesn't seem to do anything.
My other question is if its inefficient to do Arrays.asList, or if I should just create an iterator, plus if you think I do anything else wrong. eg, is there another way to input a filename into the BufferedReader?
Thanks
Edit 1:
Below is test I used for an online book (it is a text file and there are not problems with the text file) I found on project Gutenberg. I also get similar results when using a text file that I personally create, so don't think its a problem with the text file itself.
In fact, I'll just reproduce my entire code since its pretty simple:
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;
public class Tokenizer {
private ArrayList<String> list;
public Tokenizer(String file) throws IOException {
list = new ArrayList<>();
String thisLine;
BufferedReader br = new BufferedReader(new FileReader(file));
while ((thisLine = br.readLine()) != null)
list.addAll(Arrays.asList(thisLine.replaceAll("\\p{Punct}+"," ").toLowerCase().trim().split("\\s+")));
}
public ArrayList<String> wordList() {
return list;
}
public static void main(String[] args) throws IOException {
Tokenizer T = new Tokenizer("C:\\...\\1898amongmyb00loweuoft_djvu.txt");
ArrayList<String> array = T.wordList();
for(int i = 0; i < 20; i++) {
System.out.println(array.get(i));
}
}
}
And here is my output:
i
9
digitized
by
the
internet
archive
in
2007
with
funding
from
microsoft
corporation
No, those empty lines are not white space. They are empty strings. As in, "". I hope I am as clear as possible.
Since it will probably cause confusion, no that is not the actual argument I use for the path name of the file. The ellipsis (the "...") is just a shorthand, so I don't have to reveal my computer directories to the internet.
Also, yes there is another empty string at the end, but this website's interface will not let me put it there.
Edit 2:
I always forget something, here is the first few lines of the text file:
I 9
Digitized by the Internet Archive
in 2007 with funding from
Microsoft Corporation
http://www.archive.org/details/1898amongmyb00loweuoft
James Ettsscll Lotocll.
COMPLETE POETICAL AND PROSE WORKS. Riverside
Edition, n vols, crown 8vo, gilt top, each, $ 1.50 ; the set,
$ 1 6. 50.
1-4. Literary Essays (including My Study Windows, Among
My Books, Fireside Travels) ; 5. Political Essays ; 6. Literary
and Political Addresses ; 7. Latest Literary Essays and Ad-
dresses, The Old English Dramatists ; 8-1 1. Poems.
PROSE WORKS. Riverside Edition. With Portraits. 7 vols,
crown 8vo, gilt top, $10.50.
POEMS. Riverside Edition. With Portraits. 4 vols, crown
8vo, gilt top, $6.00.
COMPLETE POETICAL WORKS. Cambridge Edition.
Printed from clear type on opaque paper, and attractively
bound. With a Portrait and engraved Title-page, and a
Vignette of Lowell's Home, Elmwood. Large crown 8vo, $2.00.
Household Edition. With Portrait and Illustrations. Crown
8vo, $1.50.
Cabinet Edition. i8
I think I now see the problem. The empty strings correspond to the empty lines.
Edit 3:
So I ended up answering my own problem. I ended up doing this:
while ((thisLine = br.readLine()) != null) {
ArrayList<String> newList = new ArrayList(Arrays.asList(thisLine.replaceAll("\\p{Punct}+"," ").toLowerCase().split("\\s+")));
while(newList.remove(""));
list.addAll(newList);
}
I did try using an if statement, but then you are comparing the line before the split. This could be problematic because the split may produce some empty lines you would then miss. Therefore, I made the list I was going to add to my main list, but before adding it, I just went through it and deleted all of the instances of empty strings.
I don't really know if this is the most efficient way of doing things... if its not let me know!

Your problem most likely is that there is a space at the beginning or end of your thisLine read from the file. Which is very common for a text document to have lines like this. So if you call split on \s+ and the line ends with a space, the very last thing will be an empty string.
To fix this, I would suggest to add a trim on your string before you do the split.
Using your code change it to:
list.addAll(Arrays.asList(thisLine.replaceAll("\\p{Punct}+"," ").toLowerCase().trim().split("\\s+")));
Try that and see if it doesn't get rid of most if not all of your empty strings. Also, you should consider breaking this statement up into multiple operations so that it is easier to read.

How about replacing while ((thisLine = br.readLine()) != null)
list.addAll(Arrays.asList(thisLine.replaceAll("\\p{Punct}+"," ").toLowerCase().trim().split("\\s+")));
with: while ((thisLine = br.readLine()) != null )
if (thisLine.length() > 0)
list.addAll(Arrays.asList(thisLine.replaceAll("\\p{Punct}+", " ").toLowerCase().trim().split("\\s+")));

Related

Why is my collections.sort leading to different outputs in 2 arraylists with the same data [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 10 months ago.
Improve this question
I am having some trouble sorting 2 arraylists of mine which have the exact same data, one is received through my API and the other is parsed through ingame text fields, for some reason they are sorted differently even though they are the exact same.
public ArrayList<String> names = new ArrayList<>();
AbowAPI api = new AbowAPI();
#Override
public void onStart() throws InterruptedException {
try {
names = api.getTargets();
Collections.sort(names);
} catch (IOException e) {
log(e.getStackTrace());
}
}
#Override
public int onLoop() throws InterruptedException {
if(tabs.getOpen() != Tab.FRIENDS) {
tabs.open(Tab.FRIENDS);
} else {
ArrayList<String> friendList = getFriendNames();
Collections.sort(friendList);
log(friendList);
}
This here is the resulting output
[INFO][Bot #1][05/06 07:59:51 em]: [abc, abow42069, adam, bad, bl ack, blood, blue, bye, dead, dog, google, her, him, john wick, light, lol, mad, red]
[INFO][Bot #1][05/06 07:59:51 em]: [abc, abow42069, adam, bad, blood, blue, bl ack, bye, dog, google, her, him, john wick, light, lol, mad, red]
As I try comparing the 2 arraylists they are not also not equal, I need them to be sorted the same way so they match but I'm having troubles with it, any help to why they are sorting in different ways?
This is my API call to get the targets, maybe this is what is causing the weird bug?
enter code here
URL url = new URL("http://127.0.0.1:5000/snipes");
HttpURLConnection con = (HttpURLConnection) url.openConnection();
con.setRequestMethod("GET");
BufferedReader in = new BufferedReader(
new InputStreamReader(con.getInputStream()));
String inputLine;
ArrayList<String> content = new ArrayList<>();
while ((inputLine = in.readLine()) != null) {
content.add(inputLine);
}
in.close();
return content;
The third character in "bl ack" is in one case a character that sorts before the lower-case alphabetics, and in the other case a character that sorts after them.
I would go with the hunch that says the first one is the "normal" space, and the second one is some other space (of which there are a few), and therefore the second one will have a character code greater than 127 - i.e., outside the usual ASCII range - which is possible since Java does not use ASCII.
To debug, I'd insert code like this:
for (String f : friendList)
for (int k=0; k<f.length(); k++) {
char c = f.charAt(k);
if (c > 127)
System.out.printf("String '%s' char %d has code %d (%x)%n",
f, k, c, c);
}
It's not pretty but it'll get the job done. It'll work equally well if I'm wrong that the 'funny' character is a form of space, but is just something the logger is replacing by space. Armed with the character code in hex, you can look it up at unicode.org.
Replace 'printf' with anything more suitable to your development environment, if appropriate.
(For the purists, I'm guessing it's not going to be a surrogate pair, thus I'm using char rather than codepoint).
Once you know what you're dealing with, you can devise a handling strategy, which might be "replace the funny character with plain old space".
Edited since we now know the character is non-breaking space, 00a0 in hex.
You could, for example, change this:
ArrayList<String> friendList = getFriendNames();
Collections.sort(friendList);
to this:
ArrayList<String> friendList = new ArrayList<>();
ArrayList<String> temp = getFriendNames();
for (String t : temp) {
friendList.add(t.replaceAll("\\h", " "));
}
Collections.sort(friendList);
The \h represents any horizontal whitespace in a Java regular expression Pattern, which includes the non-breaking space and others. So we're going above and beyond the observed problem, normalizing all possible "spaces".
It would probably be better to make the same replacement in the getFriendNames method, but I don't think you've shown that code. Nevertheless, I hope you get the idea.
(Code typed in, not tested).

Removing duplicate lines from a text file

I have a text file that is sorted alphabetically, with around 94,000 lines of names (one name per line, text only, no punctuation.
Example:
Alice
Bob
Simon
Simon
Tom
Each line takes the same form, first letter is capitalized, no accented letters.
My code:
try{
BufferedReader br = new BufferedReader(new FileReader("orderedNames.txt"));
PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("sortedNoDuplicateNames.txt", true)));
ArrayList<String> textToTransfer = new ArrayList();
String previousLine = "";
String current = "";
//Load first line into previous line
previousLine = br.readLine();
//Add first line to the transfer list
textToTransfer.add(previousLine);
while((current = br.readLine()) != previousLine && current != null){
textToTransfer.add(current);
previousLine = current;
}
int index = 0;
for(int i=0; i<textToTransfer.size(); i++){
out.println(textToTransfer.get(i));
System.out.println(textToTransfer.get(i));
index ++;
}
System.out.println(index);
}catch(Exception e){
e.printStackTrace();
}
From what I understand is that, the first line of the file is being read and loaded into the previousLine variable like I intended, current is being set to the second line of the file we're reading from, current is then compared against the previous line and null, if it's not the same as the last line and it's not null, we add it to the array-list.
previousLine is then set to currents value so the next readLine for current can replace the current 'current' value to continue comparing in the while loop.
I cannot see what is wrong with this.
If a duplicate is found, surely the loop should break?
Sorry in advance when it turns out to be something stupid.
Use a TreeSet instead of an ArrayList.
Set<String> textToTransfer = new TreeSet<>();
The TreeSet is sorted and does not allow duplicates.
Don't reinvent the wheel!
If you don't want duplicates, you should consider using a Collection that doesn't allows duplicates. The easiest way to remove repeated elements is to add the contents to a Set which will not allow duplicates:
import java.util.*;
import java.util.stream.*;
public class RemoveDups {
public static void main(String[] args) {
Set<String> dist = Arrays.asList(args).stream().collect(Collectors.toSet());
}
}
Another way is to remove duplicates from text file before reading the file by the Java code, in Linux for example (far quicker than do it in Java code):
sort myFileWithDuplicates.txt | uniq -u > myFileWithoutDuplicates.txt
While, like the others, I recommend using a collection object that does not allow repeated entries into the collection, I think I can identify for you what is wrong with your function. The method in which you are trying to compare strings (which is what you are trying to do, of course) in your While loop is incorrect in Java. The == (and its counterpart) are used to determine if two objects are the same, which is not the same as determining if their values are the same. Luckily, Java's String class has a static string comparison method in equals(). You may want something like this:
while(!(current = br.readLine()).equals(previousLine) && current != null){
Keep in mind that breaking your While loop here will force your file reading to stop, which may or may not be what you intended.

How to search for name in file and extract value

I have a file that looks like this:
Dwarf remains:0
Toolkit:1
Cannonball:2
Nulodion's notes:3
Ammo mould:4
Instruction manual:5
Cannon base:6
Cannon base noted:7
Cannon stand:8
Cannon stand noted:9
Cannon barrels:10
...
What is the easiest way to open this file, search for name and return the value of the field? I cannot use any external libraries.
What i have tried/is this ok?
public String item(String name) throws IOException{
String line;
FileReader in = new FileReader("C:/test.txt");
BufferedReader br = new BufferedReader(in);
while ((line = br.readLine()) != null) {
if(line.contains(name)){
String[] parts = line.split(":");
return parts[1];
}
}
return null;
}
As a followup to your code - it compiles and works ok. Be aware though, that / is not the correct path separator on Windows (\ is). You could've created the correct path using, for example: Paths.get("C:", "test.txt").toString(). Correct separator is defined as well in File.separator.
The task can be easily achieved using basic Java capabilities. Firstly, you need to open the the file and read its lines. It can be easily achieved with Files.lines (Path.get ("path/to/file")). Secondly, you need to iterate through all the lines returned by those instructions. If you do not know stream API, you can change value returned from Files.lines (...) from Stream to an array using String[] lines = Files.lines(Paths.get("path/to/file")).toArray(a -> new String[a]);. Now lines variable has all the lines from the input file.
You have to then split each line into two parts (String.split) and see whether first part equals (String.equals) what you're looking for. Then simply return the second one.

IndexOf(), String index out of bounds: -1

I have no idea what is happening. I have a list of products along with a number separated with a tab. When I use indexOf() to find the tab, I get a String index out of bounds error, and it says the index is -1. Here's the code:
package taxes;
import java.util.*;
import java.io.*;
public class Taxes {
public static void main(String[] args) throws IOException {
//File aFile = new File("H:\\java\\PrimeNumbers\\build\\classes\\primenumbers\\priceList.txt");
File aFile = new File("C:\\Users\\Tim\\Documents\\NetBeansProjects\\Taxes\\src\\taxes\\priceList.txt");
priceChange(aFile);
}
static void priceChange(File inFile) throws IOException {
Scanner scan = new Scanner("priceList.txt");
char tab = '\t';
while (scan.hasNextLine()) {
String line = scan.nextLine();
int a = line.indexOf(tab);
String productName = line.substring(0,a);
String priceTag = line.substring(a);
}
}
}
And here's the input:
Plyer set 10
Jaw Locking Plyers 10
Cable Cutter 7
16 oz. Hammer 5
64 oz. Dead Blow Hammer 12
Sledge Hammer 20
Cordless Drill 22
Hex Impact Driver 50
Drill Bit Set 30
Miter Saw 200
Circular Saw 40
Scanner scan = new Scanner("priceList.txt");
This line of code is wrong. This Scanner instance will scan the String "priceList.txt". It doesn't contain a tab, therefore indexOf returns -1.
Change it to:
Scanner scan = new Scanner(inFile);
to use the method argument, that is the desired file instance of your priceList.txt.
String.indexOf(char) will return -1 if an instance isn't found.
You need to check before proceeding that a isn't negative.
You can read more about the indexOf method here and here.
Because you are checking int a = line.indexOf(tab) in every iteration of the while loop, there has to be a tab in every single line of your document in order for the error to be prevented.
When your while (scan.hasNextLine()) loop runs into a line with no tab in it, the index is going to be -1, and you get the StringIndexOutOfBoundsException when trying to get line.substring(0,a), with a being -1.
while (scan.hasNextLine()) {
String line = scan.nextLine();
int a = line.indexOf(tab);
if(a!=-1) {
String productName = line.substring(0,a);
String priceTag = line.substring(a);
}
}
If you look very carefully at the input lines you have posted, you'll see
Jaw Locking Plyers 10
...
Cordless Drill 22
Hex Impact Driver 50
Drill Bit Set 30
that the "Hex Impact Driver" line has the price two characters to the right of the one in the lines before and after. This is an indication that "50" does not start at a tab position whereas "10" is at such a position, the next after the one for "22" and "30".
The Q&A editor does preserve TABs, so your editor preserves them as well, and your program should be able to recognize a TAB in the input lines.
That said, a TAB entered by hand (!) is a very poor choice for a separator. As you have experienced, text file presentation doesn't show it. It would be much better to use a special character that does not occur in the product names. Plausible choices are '|', '#', and '\'.
Another good way would be to use pattern matching to find the numeric price at the end of a line - the product name is what remains after removing the price and calling trim() on the remaining string.
Since it has been verified that indexOf(tab) returns -1, the question is why does the line of text not contain t a tab when you seem certain that it does?
The answer is most likely the settings on your IDE. For instance, I usually configure Netbeans to convert a tab to three spaces. So if you typed this input file yourself within an IDE, the tab-to-space conversion is likely the problem.
Work around:
If we copy/paste some text into Netbeans that includes tabs, the tabs do not get converted to spaces.
The text file could be created with notepad or any other simple text editor to avoid the problem.
Change the settings on your IDE, at least for this project.

Comparing Sentences From a Read-In File - Java

I need to read in a file that contains 2 sentences to compare and return a number between 0 and 1. If the sentences are exactly the same it should return a 1 for true and if they are totally opposite it should return a 0 for false. If the sentences are similar but words are changed to synonyms or something close it should return a .25 .5 or .75. The text file is formatted like this:
______________________________________
Text: Sample
Text 1: It was a dark and stormy night. I was all alone sitting on a red chair. I was not completely alone as I had three cats.
Text 20: It was a murky and stormy night. I was all alone sitting on a crimson chair. I was not completely alone as I had three felines
// Should score high point but not 1
Text 21: It was a murky and tempestuous night. I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines
// Should score lower than text20
Text 22: I was all alone sitting on a crimson cathedra. I was not completely alone as I had three felines. It was a murky and tempestuous night.
// Should score lower than text21 but NOT 0
Text 24: It was a dark and stormy night. I was not alone. I was not sitting on a red chair. I had three cats.
// Should score a 0!
________________________________________________
I have a file reader, but I am not sure the best way to store each line so I can compare them. For now I have the file being read and then being printed out on the screen. What is the best way to store these and then compare them to get my desired number?
import java.io.*;
public class implement
{
public static void main(String[] args)
{
try
{
FileInputStream fstream = new FileInputStream("textfile.txt");
DataInputStream in = new DataInputStream (fstream);
BufferedReader br = new BufferedReader (new InputStreamReader(in));
String strLine;
while ((strLine = br.readLine()) != null)
{
System.out.println (strLine);
}
in.close();
}
catch (Exception e)
{
System.err.println("Error: " + e.getMessage());
}
}
}
Save them in an array list.
ArrayList list = new ArrayList();
//Read File
//While loop
list.add(strLine)
To check each variable in a sentence simply remove punctuation then delimit by spaces and search for each word in the sentence you are comparing. I would suggest ignoring words or 2 or 3 characters. it is up to your digression
then save the lines to the array and compare them however you wanted to.
To compare similar words you will need a database to efficiently check words. Aka a hash table. Once you have this you can search words in a database semiquickly. Next this hash table of works will need a thesaurus linked to each word for similar words. Then take the similar words for the key words in each sentence and run a search for these words on the sentence you are comparing. Obviously before you search for the similar words you would want to compare the two actually sentences. In the end you will need an advanced datastucture you will have to build yourself to do more than direct comparisons.

Categories

Resources