Issue removing punctuation and capital letters - java

I am trying to read a text file and create a hash map with unique words and their frequency. I searched for a method of removing punctuation and tried implementing it, but it doesn't seem to be working.
I tried using the following in the fourth line of code: line = line.replaceAll("\p{Punct}+", "");
Am I missing something?
try (BufferedReader br = new BufferedReader(new FileReader("Book 1 A_Tale_of_Two_Cities_T.txt"))) {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
line = line.replaceAll("\\p{Punct}+", "");
while (line != null) {
String[] words = line.split(" ");//those are your word
for (int i = 0; i < words.length; i++) {
if (m1.get(words[i]) == null) {
m1.put(words[i], 1);
} else {
int newValue = Integer.valueOf(String.valueOf(m1.get(words[i])));
newValue++;
m1.put(words[i], newValue);
}
}
sb.append(System.lineSeparator());
line = br.readLine();
}
}
Map<String, String> sorted = new TreeMap<>(m1);
for (Object key : sorted.keySet()) {
System.out.println("Word: " + key + "\tCounts: " + m1.get(key));
}
The output I am expecting looks like this:
Word: there Counts: 279
Word: thereupon Counts: 1
Word: these Counts: 156
The issue is that I am also getting this as output:
Word: these, Counts: 3
Word: these. Counts: 2
Word: these.’ Counts: 1
I would like the punctuation removed from the end (and beginning) of the words and have them added to the count of "these", etc.
Thanks for your help!

You are running your replaceAll after reading the first line:
String line = br.readLine();
line = line.replaceAll("\\p{Punct}+", "");
So the first line will not have any punctuation. But then, you go into this while loop:
while (line != null) {
...
line = br.readLine();
}
So there is no replaceAll inside the loop. At its end you read another line. And then you loop back to the while. Since there is no replacement inside the loop, the second line and those that follow it will retain the punctuation.
The replacement should be done inside the loop. Moreover, it shouldn't be done right after you read the first line, because that very first line might be null in theory (if the file is empty).
So what you should do is do it inside the loop after you verify that the line is not null:
String line = br.readLine();
while (line != null) {
line = line.replaceAll("\\p{Punct}+", "");
...
line = br.readLine();
}
Now, it tests if the line is null, and then replaces the punctuation in it. And since the replacement is done inside the while, it will also be applied to the second line and those that follow it.

As RealSkeptic pointed out, you need to put the regex replace inside the loop.
There are several other "problems" with your code, but the main problem is there's just so much of it.
Here's how you can do it in one (albeit long) line:
Files.lines(Paths.get("Book 1 A_Tale_of_Two_Cities_T.txt")
.map(s -> s.replaceAll("\\p{Punct}", "").toLowerCase()))
.flatMap(s -> Arrays.stream(s.split("\\s+")))
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting())
.entrySet().stream()
.sorted(Map.Entry.comparingByKey())
.forEach((k, v) -> System.out.println("Word: " + v + "\tCounts: " + v));
Disclaimer: Code may not compile or work as it was thumbed in on my phone (but there's a reasonable chance it will work)

Related

How to print specific numbers from txt file?

I have a text file written in the following texts:
18275440:Annette Nguyen:98
93840989:Mary Rochetta:87
23958632:Antoine Yung:79
23658231:Claire Coin:78
23967548:Emma Chung:69
23921664:Jung Kim:98
23793215:Harry Chiu:98
I want to extract last two digit numbers from each line. This is my written code:
for (int i = 3; i < 25; i++) {
line = inFile.nextLine();
String[] split = line.split(":");
System.out.println(split[2]);
}
And I am getting a runtime error.
Update the reading method, if you are using Scanner you can check if there are more lines left or not.
while(inFile.hasNextLine()) {
line = inFile.nextLine();
String[] split = line.split(":");
System.out.println(split[2]);
}
Why the complexity of the for loop specification? You don't use i, so why bother with all that. Don't you just want to read lines until there aren't any more? If you do that, assuming that inFile will let you read lines from it, your code to actually parse each line and extract the number at the end seems right. Here's a complete (minus the class definition) example that uses your parsing logic:
public static void main(String[] args) throws IOException {
// Open the input data file
BufferedReader inFile = new BufferedReader(new FileReader("/tmp/data.txt"));
while(true) {
// Read the next line
String line = inFile.readLine();
// Break out of our loop if we've run out of lines
if (line == null)
break;
// Strip off any whitespace on the beginning and end of the line
line = line.strip();
// If the line is empty, skip it
if (line.isEmpty())
continue;
// Parse the line, and print out the third component, the two digit number at the end of the line
String[] split = line.strip().split(":");
System.out.println(split[2]);
}
}
If there's a file named /tmp/data.txt with the contents you provide in your question, this is the output you get from this code:
98
87
79
78
69
98
98
Don't be so explicit with your loop criteria. Use a counter to acquire the data you want from the file, for example:
int lineCounter = 0;
String line;
while (inFile.hasNextLine()) {
line = inFile.nextLine();
lineCounter++;
if (lineCounter >=3 && lineCounter <= 24) {
String[] split = line.trim().split(":");
System.out.println(split[2]);
}
}
I don't know why your code gives error. If you had any unwanted lines (I see you have 3 such lines in your code) in the beginning just run an empty scanner over them.
Scanner scanner = new Scanner(new File("E:\\file.txt"));
String[] split;
// run an empty scanner
for (int i = 1; i <= 3; i++) scanner.nextLine();
while (scanner.hasNextLine()) {
split = scanner.nextLine().split(":");
System.out.println(split[2]);
}
In case you don't know of such lines and they would not comply to the rules of the lines, then you could use try...catch to eliminate them. I'm using a simple exception here. But you could throw an exception when your conditions doesn't meet.
Suppose your file looks like this:
1
2
3
18275440:Annette Nguyen:98
93840989:Mary Rochetta:87
23958632:Antoine Yung:79
bleh bleh bleh
23658231:Claire Coin:78
23967548:Emma Chung:69
23921664:Jung Kim:98
23793215:Harry Chiu:98
Then your code would be
Scanner scanner = new Scanner(new File("E:\\file.txt"));
String[] split;
// run an empty scanner
// for (int i = 1; i <= 3; i++) scanner.nextLine();
while (scanner.hasNextLine()) {
split = scanner.nextLine().split(":");
try {
System.out.println(split[2]);
} catch (ArrayIndexOutOfBoundsException e) {
}
}
Assuming you're using Java 8, you can take a simpler, less imperative approach by using BufferedReader's lines method, which returns a Stream:
BufferedReader reader = new BufferedReader(new FileReader("/tmp/data.txt"));
reader.lines()
.map(line -> line.split(":")[2])
.forEach(System.out::println);
But, come to think of it, you could avoid BufferedReader by using Files from Java's NIO API:
Files.lines(Paths.get("/tmp/data.txt"))
.map(line -> line.split(":")[2])
.forEach(System.out::println);
You can split on \d+:[\p{L}\s]+: and take the second element from the resulting array. The regex pattern, \d+:[\p{L}\s]+: means a string of digits (\d+) followed by a : which in turn is followed by a string of any combinations of letters and space which in turn is followed by a :
public class Main {
public static void main(String[] args) {
String line = "18275440:Annette Nguyen:98";
String[] split = line.split("\\d+:[\\p{L}\\s]+:");
String n = "";
if (split.length == 2) {
n = split[1].trim();
}
System.out.println(n);
}
}
Output:
98
Note that \p{L} specifies a letter.

how to find i of a token in an array[i]

So, I've found a word in a document and print the line in which the word is present like this:
say example file contains : "The quick brown fox jumps over the lazy dog.Jackdaws love my big sphinx of quartz."
FileInputStream fstream = new FileInputStream(file);
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
//Read File Line By Line
while((strLine = br.readLine()) != null){
//check to see whether testWord occurs at least once in the line of text
check = strLine.toLowerCase().contains(testWord.toLowerCase());
if(check){
//get the line, and parse its words into a String array
String[] lineWords = strLine.split("\\s+");
for(int i=0;i<lineWords.length;i++){
System.out.print(lineWords[i]+ ' ');
}
And if I search for 'fox' , then linewords[] will contain tokens from the first sentence. and linewords[3] = fox. To print the color of the fox, I need linewords[2].
I was wondering how can we get the 'i' of a token in that linewords[i], because I want the output to be linewords[i-1]
You could use a hashMap which stores the word and a list with the indices.
HashMap<String, List<Integer>> indices = new HashMap<>();
So in the for loop you fill the HashMap:
for(int i=0;i<lineWords.length;i++){
String word = lineWords[i];
if (!indices.contains(word)) {
indices.put(word, new ArrayList<>();
}
indices.get(word).add(i);
}
To get all the indices of a specific word call:
List<Integer> indicesForWord = indices.get("fox");
And to get the i - 1 word call:
for (int i = 0; i < indicesForWord.size(); i++) {
int index = indicesForWord[i] - 1;
if (index >= 0 || index >= lineWords.length) {
System.out.println(lineWords[index]);
}
}
If you are using Java 8, it is straightforward:
List<String> words = Files.lines(Paths.get("files/input.txt"))
.flatMap(line -> Arrays.stream(line.split("\\s+")))
.collect(Collectors.toList());
int index = words.indexOf("fox");
System.out.println(index);
if(index>0)
System.out.println(words.get(index-1));
This solution works also when the word you are searching is the first words in a line. I hope it helps!
If you need to find all occurences, you can use the indexOfAll method from this post.
That can be done by traversing the array and when you get your word , print the one before it.Here's how:-
if(lineWords[0].equals(testWord)
return;//no preceding word
for(int i=1;i<lineWords.length;i++){
if(lineWords[i].equals(testWord){
System.out.println(lineWords[i-1]);
break;
}
}

How to read a string of characters from a text file until a certain character, print them in the console, then continue?

i have a question. I have a text file with some names and numbers arranged like this :
Cheese;10;12
Borat;99;55
I want to read the chars and integers from the file until the ";" symbol, println them, then continue, read the next one, println etc. Like this :
Cheese -> println , 10-> println, 99 -> println , and on to the next line and continue.
I tried using :
BufferedReader flux_in = new BufferedReader (
new InputStreamReader (
new FileInputStream ("D:\\test.txt")));
while ((line = flux_in.readLine())!=null &&
line.contains(terminator)==true)
{
text = line;
System.out.println(String.valueOf(text));
}
But it reads the entire line, doesn`t stop at the ";" symbol. Setting the 'contains' condition to false does not read the line at all.
EDIT : Partially solved, i managed to write this code :
StringBuilder sb = new StringBuilder();
// while ((line = flux_in.readLine())!=null)
int c;
String terminator_char = ";";
while((c = flux_in.read()) != -1) {
{
char character = (char) c;
if (String.valueOf(character).contains(terminator_char)==false)
{
// System.out.println(String.valueOf(character) + " : Char");
sb.append(character);
}
else
{
continue;
}
}
}
System.out.println(String.valueOf(sb) );
Which returns a new string formed out of the characters from the read one, but without the ";". Still need a way to make it stop on the first ";", println the string and continue.
This simple code does the trick, thanks to Stefan Vasilica for the ideea :
Scanner scan = new Scanner(new File("D:\\testfile.txt"));
// Printing the delimiter used
scan.useDelimiter(";");
System.out.println("Delimiter:" + scan.delimiter());
// Printing the tokenized Strings
while (scan.hasNext()) {
System.out.println(scan.next());
}
// closing the scanner stream
scan.close();
Read the characters from file 1 by 1
Delete the 'contains' condition
Use a stringBuilder() to build yourself the strings 1 by 1
Each stringBuilder stops when facing a ';' (say you use an if clause)
I didn't test it because I'm on my phone. Hope this helps

How to take first word of new paragraph into consideration?

I'm trying to build a program that takes in files and outputs the number of words in the file. It works perfectly when everything is under one whole paragraph. However, when there are multiple paragraphs, it doesn't take into account the first word of the new paragraph. For example, if a file reads "My name is John" , the program will output "4 words". However, if a file read"My Name Is John" with each word being a new paragraph, the program will output "1 word". I know it must be something about my if statement, but I assumed that there are spaces before the new paragraph that would take the first word in a new paragraph into account.
Here is my code in general:
import java.io.*;
public class HelloWorld
{
public static void main(String[]args)
{
try{
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream("health.txt");
// Use DataInputStream to read binary NOT text.
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
int word2 =0;
int word3 =0;
//Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
;
int wordLength = strLine.length();
System.out.println(strLine);
for(int i = 0 ; i < wordLength -1 ; i++)
{
Character a = strLine.charAt(i);
Character b= strLine.charAt(i + 1);
**if(a == ' ' && b != '.' &&b != '?' && b != '!' && b != ' ' )**
{
word2++;
//doesnt take into account 1st character of new paragraph
}
}
word3 = word2 + 1;
}
System.out.println("There are " + word3 + " "
+ "words in your file.");
//Close the input stream
in.close();
}catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
}
I've tried adjusting the if statement multiple teams, but it does not seem to make a difference. Does anyone know where I'm messing up?
I'm a pretty new user and asked a similar question a couple days back with people accusing me of demanding too much of users, so hopefully this narrows my question a bit. I just am really confused on why its not taking into account the first word of a new paragraph. Please let me know if you need any more information. Thanks!!
Firstly, your counting logic is incorrect. Consider:
word3 = word2 + 1;
Think about what this does. Every time through your loop, when you read a line, you essentially count the words in that line, then reset the total count to word2 + 1. Hint: If you want to count the total number in the file, you'd want to increment word3 each time, rather than replace it with the current line's word count.
Secondly, your word parsing logic is slightly off. Consider the case of a blank line. You would see no words in it, but you treat the word count in the line as word2 + 1, which means you are incorrectly counting a blank line as 1 word. Hint: If the very first character on the line is a letter, then the line starts with a word.
Your approach is reasonable although your implementation is slightly flawed. As an alternate option, you may want to consider String.split() on each line. The number of elements in the resulting array is the number of words on the line.
By the way, you can increase readability of your code, and make debugging easier, if you use meaningful names for your variables (e.g. totalWords instead of word3).
if your paragraph is not started by whitespace, then your if condition won't count the first word.
"My name is John" , the program will output "4 words", this is not correct, because you miss the first word but add one after.
Try this:
String strLine;
strLine = strLine.trime();//remove leading and trailing whitespace
String[] words = strLine.split(" ");
int numOfWords = words.length;
I personally prefer a regular Scanner with token-based scanning for this sort of thing. How about something like this:
int words = 0;
Scanner lineScan = new Scanner(new File("fileName.txt"));
while (lineScan.hasNext()) {
Scanner tokenScan = new Scanner(lineScan.Next());
while (tokenScan.hasNext()) {
tokenScan.Next();
words++;
}
}
This iterates through every line in the file. And for every line in the file, it iterates through every token (in this case words) and increments the word count.
I am not sure what you mean by "paragraph", however I tried to use capital letters as you suggested and it worked perfectly fine. I used Appache Commons IO library
package Project1;
import java.io.*;
import org.apache.commons.io.*;
public class HelloWorld
{
private static String fileStr = "";
private static String[] tokens;
public static void main(String[]args)
{
try{
// Open the file that is the first
// command line parameter
try {
File f = new File("c:\\TestFile\\test.txt");
fileStr = FileUtils.readFileToString(f);
tokens = fileStr.split(" ");
System.out.println("Words in file : " + tokens.length);
}
catch(Exception ex){
System.out.println(ex);
}
}catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
}

Split text file into Strings on empty line

I want to read a local txt file and read the text in this file. After that i want to split this whole text into Strings like in the example below .
Example :
Lets say file contains-
abcdef
ghijkl
aededd
ededed
ededfe
efefeef
efefeff
......
......
I want to split this text in to Strings
s1 = abcdef+"\n"+ghijkl;
s2 = aededd+"\n"+ededed;
s3 = ededfe+"\n"+efefeef+"\n"+efefeff;
........................
I mean I want to split text on empty line.
I do know how to read a file. I want help in splitting the text in to strings
you can split a string to an array by
String.split();
if you want it by new lines it will be
String.split("\\n\\n");
UPDATE*
If I understand what you are saying then john.
then your code will essentially be
BufferedReader in
= new BufferedReader(new FileReader("foo.txt"));
List<String> allStrings = new ArrayList<String>();
String str ="";
while(true)
{
String tmp = in.readLine();
if(tmp.isEmpty())
{
if(!str.isEmpty())
{
allStrings.add(str);
}
str= "";
}
else if(tmp==null)
{
break;
}
else
{
if(str.isEmpty())
{
str = tmp;
}
else
{
str += "\\n" + tmp;
}
}
}
Might be what you are trying to parse.
Where allStrings is a list of all of your strings.
The below code would work even if there are more than 2 empty lines between useful data.
import java.util.regex.*;
// read your file and store it in a string named str_file_data
Pattern p = Pattern.compile("\\n[\\n]+"); /*if your text file has \r\n as the newline character then use Pattern p = Pattern.compile("\\r\\n[\\r\\n]+");*/
String[] result = p.split(str_file_data);
(I did not test the code so there could be typos.)
I would suggest more general regexp:
text.split("(?m)^\\s*$");
In this case it would work correctly on any end-of-line convention, and also would treat the same empty and blank-space-only lines.
It may depend on how the file is encoded, so I would likely do the following:
String.split("(\\n\\r|\\n|\\r){2}");
Some text files encode newlines as "\n\r" while others may be simply "\n". Two new lines in a row means you have an empty line.
Godwin was on the right track, but I think we can make this work a bit better. Using the '[ ]' in regx is an or, so in his example if you had a \r\n that would just be a new line not an empty line. The regular expression would split it on both the \r and the \n, and I believe in the example we were looking for an empty line which would require a either a \n\r\n\r, a \r\n\r\n, a \n\r\r\n, a \r\n\n\r, or a \n\n or a \r\r
So first we want to look for either \n\r or \r\n twice, with any combination of the two being possible.
String.split(((\\n\\r)|(\\r\\n)){2}));
next we need to look for \r without a \n after it
String.split(\\r{2});
lastly, lets do the same for \n
String.split(\\n{2});
And all together that should be
String.split("((\\n\\r)|(\\r\\n)){2}|(\\r){2}|(\\n){2}");
Note, this works only on the very specific example of using new lines and character returns. I in ruby you can do the following which would encompass more cases. I don't know if there is an equivalent in Java.
.match($^$)
#Kevin code works fine and as he mentioned that the code was not tested, here are the 3 changes required:
1.The if check for (tmp==null) should come first, otherwise there will be a null pointer exception.
2.This code leaves out the last set of lines being added to the ArrayList. To make sure the last one gets added, we have to include this code after the while loop: if(!str.isEmpty()) { allStrings.add(str); }
3.The line str += "\n" + tmp; should be changed to use \n instead if \\n. Please see the end of this thread, I have added the entire code so that it can help
BufferedReader in
= new BufferedReader(new FileReader("foo.txt"));
List<String> allStrings = new ArrayList<String>();
String str ="";
List<String> allStrings = new ArrayList<String>();
String str ="";
while(true)
{
String tmp = in.readLine();
if(tmp==null)
{
break;
}else if(tmp.isEmpty())
{
if(!str.isEmpty())
{
allStrings.add(str);
}
str= "";
}else
{
if(str.isEmpty())
{
str = tmp;
}
else
{
str += "\n" + tmp;
}
}
}
if(!str.isEmpty())
{
allStrings.add(str);
}

Categories

Resources