Split paragraphs into sentences - a special case - java

I am a newbie to programming in Java. I want to split the paragraphs in one file into sentences and write them in a different file. Also there should be mechanism to identify which sentence comes from which paragraph.The code I have used so far is mentioned below. But this code breaks:
Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.
into
Former Secretary of Finance Dr.
P.B.
Jayasundera is being questioned by the police Financial Crime Investigation Division.
How can I correct it? Thanks in advance.
import java.io.*;
class trial4{
public static void main(String args[]) throws IOException
{
FileReader fr = new FileReader("input.txt");
BufferedReader br = new BufferedReader(fr);
String s;
OutputStream out = new FileOutputStream("output10.txt");
String token[];
while((s = br.readLine()) != null)
{
token = s.split("(?<=[.!?])\\s* ");
for(int i=0;i<token.length;i++)
{
byte buf[]=token[i].getBytes();
for(int j=0;j<buf.length;j=j+1)
{
out.write(buf[j]);
if(j==buf.length-1)
out.write('\n');
}
}
}
fr.close();
}
}
I referenced all the similar questions posted on StackOverFlow. But those answers couldn't help me solve this.

How about using a negative-lookbehind in conjunction with a replace. Simply said: Replace all line endings that don't have "something special" before them with the line end followed by newline.
A list of "known abbreviations" will be needed. There's no guarantee as to how long those can be or how short a word at the end of a line might be. (See? 'be' if quite short already!)
class trial4{
public static void main(String args[]) throws IOException {
FileReader fr = new FileReader("input.txt");
BufferedReader br = new BufferedReader(fr);
PrintStream out = new PrintStream(new FileOutputStream("output10.txt"));
String s = br.readLine();
while(s != null) {
out.print( //Prints newline after each line in any case
s.replaceAll("(?i)" //Make the match case insensitive
+ "(?<!" //Negative lookbehind
+ "(\\W\\w)|" //Single non-word followed by word character (P.B.)
+ "(\\W\\d{1,2})|" //one or two digits (dates!)
+ "(\\W(dr|mr|mrs|ms))" //List of known abbreviations
+ ")" //End of lookbehind
+"([!?\\.])" //Match end-ofsentence
, "$5" //Replace with end-of-sentence found
+System.lineSeparator())); //Add newline if found
s = br.readLine();
}
}
}

As mentioned in the comment "it will be reasonable hard" to break text into paragraphs without formalizing the requirements. Take a look at BreakIterator - especially SentenceInstance. You might roll out your own BreakIterator since it breaks the same as you get with regexp, except that it is more abstract. Or try to find a 3rd party solution like http://deeplearning4j.org/sentenceiterator.html which can be trained to tokenize your input.
Example with BreakIterator:
String str = "Former Secretary of Finance Dr. P.B. Jayasundera is being questioned by the police Financial Crime Investigation Division.";
BreakIterator bilus = BreakIterator.getSentenceInstance(Locale.US);
bilus.setText(str);
int last = bilus.first();
int count = 0;
while (BreakIterator.DONE != last)
{
int first = last;
last = bilus.next();
if (BreakIterator.DONE != last)
{
String sentence = str.substring(first, last);
System.out.println("Sentence:" + sentence);
count++;
}
}
System.out.println("" + count + " sentences found.");

Related

Replace specific string by another - String#replaceAll()

I'm actually developping a parser and I'm stuck on a method.
I need to clean specifics words in some sentences, meaning replacing those by a whitespace or a nullcharacter.
For now, I came up with this code:
private void clean(String sentence)
{
try {
FileInputStream fis = new FileInputStream(
ConfigHandler.getDefault(DictionaryType.CLEANING).getDictionaryFile());
BufferedReader bis = new BufferedReader(new InputStreamReader(fis));
String read;
List<String> wordList = new ArrayList<String>();
while ((read = bis.readLine()) != null) {
wordList.add(read);
}
}
catch (IOException e) {
e.printStackTrace();
}
for (String s : wordList) {
if (StringUtils.containsIgnoreCase(sentence, s)) { // this comes from Apache Lang
sentence = sentence.replaceAll("(?i)" + s + "\\b", " ");
}
}
cleanedList.add(sentence);
}
But when I look at the output, I got all of the occurences of the word to be replaced in my sentence replaced by a whitespace.
Does anybody can help me out on replacing only the exact words to be replaced on my sentence?
Thanks in advance !
There are two problems in your code:
You are missing the \b before the string
You will run into issues if any of the words from the file has special characters
To fix this problem construct your regex as follows:
sentence = sentence.replaceAll("(?i)\\b\\Q" + s + "\\E\\b", " ");
or
sentence = sentence.replaceAll("(?i)\\b" + Pattern.quote(s) + "\\b", " ");

How to split a file into several tokens

I was trying to tokenize an input file from sentences into tokens(words).
For example,
"This is a test file." into five words "this" "is" "a" "test" "file", omitting the punctuations and the white spaces. And store them into an arraylist.
I tried to write some codes like this:
public static ArrayList<String> tokenizeFile(File in) throws IOException {
String strLine;
String[] tokens;
//create a new ArrayList to store tokens
ArrayList<String> tokenList = new ArrayList<String>();
if (null == in) {
return tokenList;
} else {
FileInputStream fStream = new FileInputStream(in);
DataInputStream dataIn = new DataInputStream(fStream);
BufferedReader br = new BufferedReader(new InputStreamReader(dataIn));
while (null != (strLine = br.readLine())) {
if (strLine.trim().length() != 0) {
//make sure strings are independent of capitalization and then tokenize them
strLine = strLine.toLowerCase();
//create regular expression pattern to split
//first letter to be alphabetic and the remaining characters to be alphanumeric or '
String pattern = "^[A-Za-z][A-Za-z0-9'-]*$";
tokens = strLine.split(pattern);
int tokenLen = tokens.length;
for (int i = 1; i <= tokenLen; i++) {
tokenList.add(tokens[i - 1]);
}
}
}
br.close();
dataIn.close();
}
return tokenList;
}
This code works fine except I found out that instead of make a whole file into several words(tokens), it made a whole line into a token. "area area" becomes a token, instead of "area" appeared twice. I don't see the error in my codes. I believe maybe it's something wrong with my trim().
Any valuable advices is appreciated. Thank you so much.
Maybe I should use scanner instead?? I'm confused.
I think Scanner is more approprate for this task. As to this code, you should fix regex, try "\\s+";
Try pattern as String pattern = "[^\\w]"; in the same code

Parsing txt file

I have to write a program that will parse baseball player info and hits,out,walk,ect from a txt file. For example the txt file may look something like this:
Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s
Jill Jenks,o,o,s,h,h,o,o
Will Jones,o,o,w,h,o,o,o,o,w,o,o
I know how to parse the file and can get that code running perfect. The only problem I am having is that we should only be printing the name for each player and 3 or their plays. For example:
Sam Slugger hit,hit,out
Jill Jenks out, out, sacrifice fly
Will Jones out, out, walk
I am not sure how to limit this and every time I try to cut it off at 3 I always get the first person working fine but it breaks the loop and doesn't do anything for all the other players.
This is what I have so far:
import java.util.Scanner;
import java.io.*;
public class ReadBaseBall{
public static void main(String args[]) throws IOException{
int count=0;
String playerData;
Scanner fileScan, urlScan;
String fileName = "C:\\Users\\Crust\\Documents\\java\\TeamStats.txt";
fileScan = new Scanner(new File(fileName));
while(fileScan.hasNext()){
playerData = fileScan.nextLine();
fileScan.useDelimiter(",");
//System.out.println("Name: " + playerData);
urlScan = new Scanner(playerData);
urlScan.useDelimiter(",");
for(urlScan.hasNext(); count<4; count++)
System.out.print(" " + urlScan.next() + ",");
System.out.println();
}
}
}
This prints out:
Sam Slugger, h, h, o,
but then the other players are voided out. I need help to get the other ones printing as well.
Here, try this one using FileReader
Assuming your file content format is like this
Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s
Jill Johns,h,h,o,s,w,w,h,w,o,o,o,h,s
with each player in the his/her own line then this can work for you
BufferedReader reader;
try {
reader = new BufferedReader(new FileReader(new File("file.txt")));
String line = "";
while ((line = reader.readLine()) != null) {
String[] values_per_line = line.split(",");
System.out.println("Name:" + values_per_line[0] + " "
+ values_per_line[1] + " " + values_per_line[2] + " "
+ values_per_line[3]);
line = reader.readLine();
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
otherwise if they are lined all in like one line which would not make sense then modify this sample.
Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s| John Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s
BufferedReader reader;
try {
reader = new BufferedReader(new FileReader(new File("file.txt")));
String line = "";
while ((line = reader.readLine()) != null) {
// token identifier is a space
String[] data = line.trim().split("|");
for (int i = 0; i < data.length; i++)
System.out.println("Name:" + data[0].split(",")[0] + " "
+ data[1].split(",")[1] + " "
+ data[2].split(",")[2] + " "
+ data[3].split(",")[3]);
line = reader.readLine();
}
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
You need to reset your count car in the while loop:
while(fileScan.hasNext()){
count = 0;
...
}
First Problem
Change while(fileScan.hasNext())) to while(fileScan.hasNextLine()). Not a breaking problem but when using scanner you usually put sc.* right after a sc.has*.
Second Problem
Remove the line fileScan.useDelimiter(","). This line doesn't do anything in this case but replaces the default delimiter so the scanner no longer splits on whitespace. Which doesn't matter when using Scanner.nextLine, but can have some nasty side effects later on.
Third Problem
Change this line for(urlScan.hasNext(); count<4; count++) to while(urlScan.hasNext()). Honestly I'm surprised that line even compiled and if it did it only read the first 4 from the scanner.
If you want to limit the amount processed for each line you can replace it with
for( int count = 0; count < limit && urlScan.hasNext( ); count++ )
This will limit the amount read to limit while still handling lines that have less data than the limit.
Make sure that each of your data sets is separated by a line otherwise the output might not make much sense.
You shouldn't have multiple scanners on this - assuming the format you posted in your question you can use regular expressions to do this.
This demonstrates a regular expression to match a player and to use as a delimiter for the scanner. I fed the scanner in my example a string, but the technique is the same regardless of source.
int count = 0;
Pattern playerPattern = Pattern.compile("\\w+\\s\\w+(?:,\\w){1,3}");
Scanner fileScan = new Scanner("Sam Slugger,h,h,o,s,w,w,h,w,o,o,o,h,s Jill Jenks,o,o,s,h,h,o,o Will Jones,o,o,w,h,o,o,o,o,w,o,o");
fileScan.useDelimiter("(?<=,\\w)\\s");
while (fileScan.hasNext()){
String player = fileScan.next();
Matcher m = playerPattern.matcher(player);
if (m.find()) {
player = m.group(0);
} else {
throw new InputMismatchException("Players data not in expected format on string: " + player);
}
System.out.println(player);
count++;
}
System.out.printf("%d players found.", count);
Output:
Sam Slugger,h,h,o
Jill Jenks,o,o,s
Will Jones,o,o,w
The call to Scanner.delimiter() sets the delimiter to use for retrieving tokens. The regex (?<=,\\w)\\s:
(?< // positive lookbehind
,\w // literal comma, word character
)
\s // whitespace character
Which delimits the players by the space between their entries without matching anything but that space, and fails to match the space between the names.
The regular expression used to extract up to 3 plays per player is \\w+\\s\\w+(?:,\\w){1,3}:
\w+ // matches one to unlimited word characters
(?: // begin non-capturing group
,\w // literal comma, word character
){1,3} // match non-capturing group 1 - 3 times

How to take first word of new paragraph into consideration?

I'm trying to build a program that takes in files and outputs the number of words in the file. It works perfectly when everything is under one whole paragraph. However, when there are multiple paragraphs, it doesn't take into account the first word of the new paragraph. For example, if a file reads "My name is John" , the program will output "4 words". However, if a file read"My Name Is John" with each word being a new paragraph, the program will output "1 word". I know it must be something about my if statement, but I assumed that there are spaces before the new paragraph that would take the first word in a new paragraph into account.
Here is my code in general:
import java.io.*;
public class HelloWorld
{
public static void main(String[]args)
{
try{
// Open the file that is the first
// command line parameter
FileInputStream fstream = new FileInputStream("health.txt");
// Use DataInputStream to read binary NOT text.
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String strLine;
int word2 =0;
int word3 =0;
//Read File Line By Line
while ((strLine = br.readLine()) != null) {
// Print the content on the console
;
int wordLength = strLine.length();
System.out.println(strLine);
for(int i = 0 ; i < wordLength -1 ; i++)
{
Character a = strLine.charAt(i);
Character b= strLine.charAt(i + 1);
**if(a == ' ' && b != '.' &&b != '?' && b != '!' && b != ' ' )**
{
word2++;
//doesnt take into account 1st character of new paragraph
}
}
word3 = word2 + 1;
}
System.out.println("There are " + word3 + " "
+ "words in your file.");
//Close the input stream
in.close();
}catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
}
I've tried adjusting the if statement multiple teams, but it does not seem to make a difference. Does anyone know where I'm messing up?
I'm a pretty new user and asked a similar question a couple days back with people accusing me of demanding too much of users, so hopefully this narrows my question a bit. I just am really confused on why its not taking into account the first word of a new paragraph. Please let me know if you need any more information. Thanks!!
Firstly, your counting logic is incorrect. Consider:
word3 = word2 + 1;
Think about what this does. Every time through your loop, when you read a line, you essentially count the words in that line, then reset the total count to word2 + 1. Hint: If you want to count the total number in the file, you'd want to increment word3 each time, rather than replace it with the current line's word count.
Secondly, your word parsing logic is slightly off. Consider the case of a blank line. You would see no words in it, but you treat the word count in the line as word2 + 1, which means you are incorrectly counting a blank line as 1 word. Hint: If the very first character on the line is a letter, then the line starts with a word.
Your approach is reasonable although your implementation is slightly flawed. As an alternate option, you may want to consider String.split() on each line. The number of elements in the resulting array is the number of words on the line.
By the way, you can increase readability of your code, and make debugging easier, if you use meaningful names for your variables (e.g. totalWords instead of word3).
if your paragraph is not started by whitespace, then your if condition won't count the first word.
"My name is John" , the program will output "4 words", this is not correct, because you miss the first word but add one after.
Try this:
String strLine;
strLine = strLine.trime();//remove leading and trailing whitespace
String[] words = strLine.split(" ");
int numOfWords = words.length;
I personally prefer a regular Scanner with token-based scanning for this sort of thing. How about something like this:
int words = 0;
Scanner lineScan = new Scanner(new File("fileName.txt"));
while (lineScan.hasNext()) {
Scanner tokenScan = new Scanner(lineScan.Next());
while (tokenScan.hasNext()) {
tokenScan.Next();
words++;
}
}
This iterates through every line in the file. And for every line in the file, it iterates through every token (in this case words) and increments the word count.
I am not sure what you mean by "paragraph", however I tried to use capital letters as you suggested and it worked perfectly fine. I used Appache Commons IO library
package Project1;
import java.io.*;
import org.apache.commons.io.*;
public class HelloWorld
{
private static String fileStr = "";
private static String[] tokens;
public static void main(String[]args)
{
try{
// Open the file that is the first
// command line parameter
try {
File f = new File("c:\\TestFile\\test.txt");
fileStr = FileUtils.readFileToString(f);
tokens = fileStr.split(" ");
System.out.println("Words in file : " + tokens.length);
}
catch(Exception ex){
System.out.println(ex);
}
}catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
}
}
}

Preserving line breaks and spacing in file IO

I am workig on a pretty neat problem challenge that involves reading words from a .txt file. The program must allow for ANY .txt file to be read, ergo the program cannot predict what words it will be dealing with.
Then, it takes the words and makes them their "Pig Latin" counterpart, and writes them into a new file. There are a lot more requirements to this problem but siffice to say, I have every part solved save one...when printng to the new file I am unable to perserve the line spacing. That is to say, if line 1 has 5 words and then there is a break and line 2 has 3 words and a break...the same must be true for the new file. As it stands now, it all works but all the converted words are all listed one after the other.
I am interested in learning this so I am OK if you all wish to play coy in your answers. Although I have been at this for 9 hours so "semi-coy" will be appreaciated as well :) Please pay close attention to the "while" statements in the code that is where the file IO action is happening. I am wondering if I need to utilize the nextLine() commands from the scanner and then make a string off that...then make substrings off the nextLine() string to convert the words one at a time. The substrings could be splits or tokens, or something else - I am unclear on this part and token attempts are giving me compiler arrors exceptions "java.util.NoSuchElementException" - I do not seem to understand the correct call for a split command. I tried something like String a = scan.nextLine() where "scan" is my scanner var. Then tried String b = a.split() no go. Anyway here is my code and see if you can figure out what I am missing.
Here is code and thank you very much in advance Java gods....
import java.util.*;
import javax.swing.*;
import java.io.*;
import java.text.*;
public class PigLatinTranslator
{
static final String ay = "ay"; // "ay" is added to the end of every word in pig latin
public static void main(String [] args) throws IOException
{
File nonPiggedFile = new File(...);
String nonPiggedFileName = nonPiggedFile.getName();
Scanner scan = new Scanner(nonPiggedFile);
nonPiggedFileName = ...;
File pigLatinFile = new File(nonPiggedFileName + "-pigLatin.txt"); //references a file that may or may not exist yet
pigLatinFile.createNewFile();
FileWriter newPigLatinFile = new FileWriter(nonPiggedFileName + "-pigLatin.txt", true);
PrintWriter PrintToPLF = new PrintWriter(newPigLatinFile);
while (scan.hasNext())
{
boolean next;
while (next = scan.hasNext())
{
String nonPig = scan.next();
nonPig = nonPig.toLowerCase();
StringBuilder PigLatWord = new StringBuilder(nonPig);
PigLatWord.insert(nonPig.length(), nonPig.charAt(0) );
PigLatWord.insert(nonPig.length() + 1, ay);
PigLatWord.deleteCharAt(0);
String plw = PigLatWord.toString();
if (plw.contains("!") )
{
plw = plw.replace("!", "") + "!";
}
if (plw.contains(".") )
{
plw = plw.replace(".", "") + ".";
}
if (plw.contains("?") )
{
plw = plw.replace("?", "") + "?";
}
PrintToPLF.print(plw + " ");
}
PrintToPLF.close();
}
}
}
Use BufferedReader, not Scanner. http://docs.oracle.com/javase/6/docs/api/java/io/BufferedReader.html
I leave that part of it as an exercise for the original poster, it's easy once you know the right class to use! (And hopefully you learn something instead of copy-pasting my code).
Then pass the entire line into functions like this: (note this does not correctly handle quotes as it puts all non-apostrophe punctuation at the end of the word). Also it assumes that punctuation is supposed to go at the end of the word.
private static final String vowels = "AEIOUaeiou";
private static final String punct = ".,!?";
public static String pigifyLine(String oneLine) {
StringBuilder pigified = new StringBuilder();
boolean first = true;
for (String word : oneLine.split(" ")) {
if (!first) pigified.append(" ");
pigified.append(pigify(word));
first = false;
}
return pigified.toString();
}
public static String pigify(String oneWord) {
char[] chars = oneWord.toCharArray();
StringBuilder consonants = new StringBuilder();
StringBuilder newWord = new StringBuilder();
StringBuilder punctuation = new StringBuilder();
boolean consDone = false; // set to true when the first consonant group is done
for (int i = 0; i < chars.length; i++) {
// consonant
if (vowels.indexOf(chars[i]) == -1) {
// punctuation
if (punct.indexOf(chars[i]) > -1) {
punctuation.append(chars[i]);
consDone = true;
} else {
if (!consDone) { // we haven't found the consonants
consonants.append(chars[i]);
} else {
newWord.append(chars[i]);
}
}
} else {
consDone = true;
// vowel
newWord.append(chars[i]);
}
}
if (consonants.length() == 0) {
// vowel words are "about" -> "aboutway"
consonants.append("w");
}
consonants.append("ay");
return newWord.append(consonants).append(punctuation).toString();
}
You could try to store the count of words per line in a separate data structure, and use that as a guide for when to move on to the next line when writing the file.
I purposely made this semi-vague for you, but can elaborate on request.

Categories

Resources