How can I split paragraphs into proper sentences in java using split()? - java

I want to split paragraphs into sentences. For eg: "Mary had a little lamb. Its fleece was white." i want to split it into:
"Mary had a little lamb."
"Its fleece was white."
currently i tried using text.split("[.]"); and got the result:
"Mary had a little lamb" (no full stop present. i need it)
" Its fleece was white" (space present before the sentence and still, no full stop)
what I intend to do is split this para into proper sentences and put it into array.
String text = sc.nextLine();
String[] sentence = text.split("[.]");
please help!

you can just append the full stop '.' to the string after the splitting.
Somthing like:
String[] splitString = theString.split("[.]");
for(String s : splitString){
s += ".";
}
Something in that direction.

Assuming that there is at least a chance that sentence splitting is not the last bit of natural language processing required, you should consider using a natural language processing (NLP) library like OpenNLP. You can try out OpenNLP through a web interface thanks to the Gate project, who have made an English language processing pipeline available as a web page. Make sure to use the "Customise Annotations" button to get to see the sentence structure.

Somethig like this should work:
public class Main {
public static void main(String[] args) {
String paragraph = "Mary had a little lamb. Its fleece was white.";
String sentences[] = paragraph.split("[.]");
for (String sentence:sentences){
System.out.println(sentence);
}
}
}

assume this sentence:
String sResult = "This is a test. This is a T.L.A. test.";
so you'd better to try this
String sResult = "This is a test. This is a T.L.A. test.";
String[] sSentence = sResult.split("(?<=[a-z])\\.\\s+");
Result:
This is a test
This is a T.L.A. test.
Note that there are abbrevations that do not end with capital letters, such as abbrev., Mr., etc... And there are also sentences that don't end in periods!

Related

Write a regular expression to count sentences

I have a String :
"Hello world... I am here. Please respond."
and I would like to count the number of sentences within the String. I had an idea to use a Scanner as well as the useDelimiter method to split any String into sentences.
Scanner in = new Scanner(file);
in.useDelimiter("insert here");
I'd like to create a regular expression which can go through the String I have shown above and identify it to have two sentences. I initially tried using the delimiter:
[^?.]
It gets hung up on the ellipses.
You could use a regular expression that checks for a non end of sentence, followed by an end of sentence like:
[^?!.][?!.]
Although as #Gabe Sechan points out, a regular expression may not be accurate when the sentence includes abbreviated words such as Dr., Rd., St., etc.
this could help :
public int getNumSentences()
{
List<String> tokens = getTokens( "[^!?.]+" );
return tokens.size();
}
and you can also add enter button as separator and make it independent on your OS by the following line of code
String pattern = System.getProperty("line.separator" + " ");
actually you can find more about the
Enter
here : Java regex: newline + white space
and hence finally the method becomes :
public int getNumSentences()
{
List<String> tokens = getTokens( "[^!?.]+" + pattern + "+" );
return tokens.size();
}
hope this could help :) !
A regular expression probably isn't the right tool for this. English is not a regular language, so regular expressions get hung up- a lot. For one thing you can't even be sure a period in the middle of the text is an end of sentence- abbreviations (like Mr.), acronyms with periods, and initials will screw you up as well. Its not the right tool.
For your sentence : "Hello world... I am here. Please respond."
The code will be :
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaRegex {
public static void main(String[] args) {
int count=0;
String sentence = "Hello world... I am here. Please respond.";
Pattern pattern = Pattern.compile("\\..");
Matcher matcher = pattern.matcher(sentence);
while(matcher.find()) {
count++;
}
System.out.println("No. of sentence = "+count);
}
}

Splitting a string in Java only when a delimiter is surrounded by quote marks

Let's say I have the following string:
"John Doe","IT,SI","foo, bar"
And I would like to split it into:
["John Doe", "IT,SI", "foo, bar"]
I was thinking to implement something like this:
String line = "\"John Doe\",\"IT,SI\",\"foo, bar\"";
String[] lineItems = line.split("\",\"");
for (String lineItem : lineItems) {
lineItem.removeAll("\"");
}
It does the thing, however this doesn't seem to be a state of art. Is there a better solution?
Below regex works well for this case.
String[] lineItems = line.split("(?<=\"),(?=\")");
The following should solve the issue
line.split("(?<=\"),(?=\")")
You wants the words between double quotes so instead of splitting the string you can use following regex to extract them :
/"([^"]+)"/g
See demo https://regex101.com/r/hC3cW2/1
(\b[^"]+)
Or you may name it also:
(?P<names>\b[^"]+)
OR
(?<names>\b[^"]+)
Not sure java supports P<> or <>

Remove space except tab in a text line

I am reading a tab-delimited text file line by line which is extremely messy and trying to get the unique columns names out of it.
The problem is it contains tabs as field separator but some column names have space in their names! I am using
String[] cols = line.split("\\t");
which seems that is not working properly since for some cases it gets the spaces as separators! Is using regex a good solution? If yes, could you advise what regex removes white spaces from a string but keeps the tabs?
Data is like:
Sever ID Name
12221 zxsz
Tab in a string literal is just "\t". "\\t" is a literal backslash followed by a "t". Having said that, either method works for me:
public class Scratch2 {
public static void main(String[] args) {
String welk = "anna one\tanna two\tanna three";
System.out.println("\\t");
String[] annas = welk.split("\t");
for (String anna : annas) {
System.out.println(anna);
}
System.out.println("\\\\t");
annas = welk.split("\\t");
for (String anna : annas) {
System.out.println(anna);
}
}
}
Output:
\t
anna one
anna two
anna three
\\t
anna one
anna two
anna three
The simplest explanation is that your input strings don't contain the whitespace characters you think they do.

Checking if the word is uppercase, and changing only the first letter

I've searched for a bit on this forum and all I can seem to find are questions on how to make the first letter of every word uppercase. Which is not what I'm looking for.
I'm looking for something that will check through all of the words in the String, and if they're uppercase, will change the letters to lowercase EXCEPT for the first one.
Like, let's say the string is:
"HI STACKOVERFLOW"
It would change it to:
"Hi Stackoverflow"
Or:
"I'M ASKING A QUESTION ON stackoverflow dot com"
It would change it to:
"I'm Asking a Question On stackoverflow dot com"
I would use the StringTokenizer class to break the string up into the separate words. Then you can get each token as a separate String and compare:
String line = "A BIG Thing that Something"
StringTokenizer st = new StringTokenizer(line);
while(st.hasMoreTokens)
{
String a = st.nextToken();
if(a.equals(a.toUpperCase())){
System.out.println(a.charAt(0) + a.substring(1).toLowerCase());
}else{
System.out.println(a);
}
}
Something like that... You'll need to remember to import StringTokenizer, it's part of the java.util package.
public static void main(String[] args) {
String org= "HI STACKOVERFLOW";
String [] temp=org.split(" ");
int len=temp.length;
String ne = ".";
for(int i=0;i<len;i++)
{
temp[i]=temp[i].toUpperCase();
temp[i]=(temp[i].substring(0, 1)).toUpperCase()+(temp[i].substring(1, temp[i].length())).toLowerCase();
System.out.print(temp[i]+" ");
}
}
output Hi Stackoverflow
If you're open to incorporate a library in your project, I'm quite sure that Apache Commons-lang StringUtils has the type of functionality you need.

Regular expression for finding two words in a string

Here is my basic problem: I am reading some lines in from a file. The format of each line in the file is this:
John Doe 123
There is a tab between Doe and 123.
I'm looking for a regex such that I can "pick off" the John Doe. Something like scanner.next(regular expression) that would give me the John Doe.
This is probably very simple, but I can't seem to get it to work. Also, I'm trying to figure this out without having to rely on the tab being there.
I've looked here: Regular Expression regex to validate input: Two words with a space between. But none of these answers worked. I kept getting runtime errors.
Some Code:
while(inFile.hasNextLine()){
String s = inFile.nextLine();
Scanner string = new Scanner(s);
System.out.println(s); // check to make sure I got the string
System.out.println(string.next("[A-Za-z]+ [A-Za-z]+")); //This
//doesn't work for me
System.out.println(string.next("\\b[A-Za-z ]+\\b"));//Nor does
//this
}
Are you required to use regex for this? You could simply use a split method across \t on each line and just grab the first or second element (I'm not sure which you meant by 'pick off' john doe).
It would help if you provided the code you're trying that is giving you runtime errors.
You could use regex:
[A-Za-z]+ [A-Za-z]+
if you always knew your name was going to be two words.
You could also try
\b[A-Za-z ]+\b
which matches any number of words (containing alphabets), making sure it captures whole words (that's what the '\b' is) --> to return "John Doe" instead of "John Doe " (with the trailing space too). Don't forget backslashes need to be escaped in Java.
This basically works to isolate John Doe from the rest...
public String isolateAndTrim( String candidate ) {
// This pattern isolates "John Doe" as a group...
Pattern pattern = Pattern.compile( "(\\w+\\s+\\w+)\\s+\\d*" );
Matcher matcher = pattern.matcher( candidate );
String clean = "";
if ( matcher.matches() ) {
clean = matcher.group( 1 );
// This replace all reduces away extraneous whitespace...
clean = clean.replaceAll( "\\s+", " " );
}
return clean;
}
The grouping parenthesis will allow you to "pick off" the name portion from the digit portion. "John Doe", "Jane Austin", whatever. You should learn the grouping stuff in RegEx as it works great for problems just like this one.
The trick to remove the extra whitespace comes from How to remove duplicate white spaces in string using Java?
Do you prefer simplicity and readability? If so, consider the following solution
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class MyLineScanner
{
public static void readLine(String source_file) throws FileNotFoundException
{
File source = new File(source_file);
Scanner line_scanner = new Scanner(source);
while(line_scanner.hasNextLine())
{
String line = line_scanner.nextLine();
// check to make sure line is exists;
System.out.println(line);
// this work for me
Scanner words_scanner = new Scanner(line);
words_scanner.useDelimiter("\t");
while (words_scanner.hasNext())
{
System.out.format("word : %s %n", words_scanner.next());
}
}
}
public static void main(String[] args) throws FileNotFoundException
{
readLine("source.txt");
}
}

Categories

Resources