Write a regular expression to count sentences

Write a regular expression to count sentences - java

I have a String :
"Hello world... I am here. Please respond."
and I would like to count the number of sentences within the String. I had an idea to use a Scanner as well as the useDelimiter method to split any String into sentences.
Scanner in = new Scanner(file);
in.useDelimiter("insert here");
I'd like to create a regular expression which can go through the String I have shown above and identify it to have two sentences. I initially tried using the delimiter:
[^?.]
It gets hung up on the ellipses.

You could use a regular expression that checks for a non end of sentence, followed by an end of sentence like:
[^?!.][?!.]
Although as #Gabe Sechan points out, a regular expression may not be accurate when the sentence includes abbreviated words such as Dr., Rd., St., etc.

this could help :
public int getNumSentences()
{
List<String> tokens = getTokens( "[^!?.]+" );
return tokens.size();
}
and you can also add enter button as separator and make it independent on your OS by the following line of code
String pattern = System.getProperty("line.separator" + " ");
actually you can find more about the
Enter
here : Java regex: newline + white space
and hence finally the method becomes :
public int getNumSentences()
{
List<String> tokens = getTokens( "[^!?.]+" + pattern + "+" );
return tokens.size();
}
hope this could help :) !

A regular expression probably isn't the right tool for this. English is not a regular language, so regular expressions get hung up- a lot. For one thing you can't even be sure a period in the middle of the text is an end of sentence- abbreviations (like Mr.), acronyms with periods, and initials will screw you up as well. Its not the right tool.

For your sentence : "Hello world... I am here. Please respond."
The code will be :
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaRegex {
public static void main(String[] args) {
int count=0;
String sentence = "Hello world... I am here. Please respond.";
Pattern pattern = Pattern.compile("\\..");
Matcher matcher = pattern.matcher(sentence);
while(matcher.find()) {
count++;
}
System.out.println("No. of sentence = "+count);
}
}

Related

Extracting hashtags from user input

I have been asking similar questions before so this may be taken down but I feel like the code I have now should work but it doesn't.
String post [] = new String [100];
System.out.println("\nType your post");
String userPost = input.nextLine();
post[0] = userPost;
String hashtags ="";
for (int i = 0; i<post.length && post[i]!=null;i++){
String[]words = post[i].split(" ");
for(int j=0;j<words.length;j++){
if(words[j].trim().startsWith("#")){
hashtags+=words[j].trim() + " ";
}
}
}
if(hashtags.trim().isEmpty())
System.out.println("No hashtags were typed");
else
System.out.println("Hashtags found: " + hashtags );
I feel like this should work but when running this code, it skips asking for user input and immediately prints No hashtags were typed.

What should you do is to use regular expression Java API, extracting all searched hashtags from the provided String variable with the proper regexp:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class HashtagsFinder {
public static void main() {
String input = "#HASHTAGS #HASHTAG #ANOTHER_HASHTAG BUT#THISISNOTAHASHTAG";
Matcher matcher = Pattern.compile("\\B#\\w+").matcher(input);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
A regular expression passed as the Pattern.compile() argument matches expressions in line with the following rules:
\B - ensure that there is no word boundary at the beginning.
# - ensure that there is exactly one hash char at the beginning of the word - with the previous part, it ensures that the hash char is always at the beginning of the word.
\w+ - match one or more of the possible word character (alphanumeric and undescore), so we won't be collecting expressions with just the hash char.
Output of the code pasted above:
#HASHTAGS
#HASHTAG
#ANOTHER_HASHTAG
What is also important - instead of the standard java.util.regex package, you should use https://github.com/google/re2j, since it's faster and provides safer compliation (standard Java regular expressions library uses backtracking to compile passed regex, which may cause our program to exceed available stack space).

Need help in Regex to exclude splitting string within "

I need to split a String based on comma as seperator, but if the part of string is enclosed with " the splitting has to stop for that portion from starting of " to ending of it even it contains commas in between.
Can anyone please help me to solve this using regex with look around.

Resurrecting this question because it had a simple regex solution that wasn't mentioned. This situation sounds very similar to ["regex-match a pattern unless..."][4]
\"[^\"]*\"|(,)
The left side of the alternation matches complete double-quoted strings. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
Here is working code (see online demo):
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) {
String subject = "\"Messages,Hello\",World,Hobbies,Java\",Programming\"";
Pattern regex = Pattern.compile("\"[^\"]*\"|(,)");
Matcher m = regex.matcher(subject);
StringBuffer b = new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits)
System.out.println(split);
} // end main
} // end Program
Reference
How to match pattern except in situations s1, s2, s3

Please try this:
(?<!\G\s*"[^"]*),
If you put this regex in your program, it should be:
String regex = "(?<!\\G\\s*\"[^\"]*),";
But 2 things are not clear:
Does the " only start near the ,, or it can start in the middle of content, such as AAA, BB"CC,DD" ? The regex above only deal with start neer , .
If the content has " itself, how to escape? use "" or \"? The regex above does not deal any escaped " format.

Regular expression for finding two words in a string

Here is my basic problem: I am reading some lines in from a file. The format of each line in the file is this:
John Doe 123
There is a tab between Doe and 123.
I'm looking for a regex such that I can "pick off" the John Doe. Something like scanner.next(regular expression) that would give me the John Doe.
This is probably very simple, but I can't seem to get it to work. Also, I'm trying to figure this out without having to rely on the tab being there.
I've looked here: Regular Expression regex to validate input: Two words with a space between. But none of these answers worked. I kept getting runtime errors.
Some Code:
while(inFile.hasNextLine()){
String s = inFile.nextLine();
Scanner string = new Scanner(s);
System.out.println(s); // check to make sure I got the string
System.out.println(string.next("[A-Za-z]+ [A-Za-z]+")); //This
//doesn't work for me
System.out.println(string.next("\\b[A-Za-z ]+\\b"));//Nor does
//this
}

Are you required to use regex for this? You could simply use a split method across \t on each line and just grab the first or second element (I'm not sure which you meant by 'pick off' john doe).

It would help if you provided the code you're trying that is giving you runtime errors.
You could use regex:
[A-Za-z]+ [A-Za-z]+
if you always knew your name was going to be two words.
You could also try
\b[A-Za-z ]+\b
which matches any number of words (containing alphabets), making sure it captures whole words (that's what the '\b' is) --> to return "John Doe" instead of "John Doe " (with the trailing space too). Don't forget backslashes need to be escaped in Java.

This basically works to isolate John Doe from the rest...
public String isolateAndTrim( String candidate ) {
// This pattern isolates "John Doe" as a group...
Pattern pattern = Pattern.compile( "(\\w+\\s+\\w+)\\s+\\d*" );
Matcher matcher = pattern.matcher( candidate );
String clean = "";
if ( matcher.matches() ) {
clean = matcher.group( 1 );
// This replace all reduces away extraneous whitespace...
clean = clean.replaceAll( "\\s+", " " );
}
return clean;
}
The grouping parenthesis will allow you to "pick off" the name portion from the digit portion. "John Doe", "Jane Austin", whatever. You should learn the grouping stuff in RegEx as it works great for problems just like this one.
The trick to remove the extra whitespace comes from How to remove duplicate white spaces in string using Java?

Do you prefer simplicity and readability? If so, consider the following solution
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class MyLineScanner
{
public static void readLine(String source_file) throws FileNotFoundException
{
File source = new File(source_file);
Scanner line_scanner = new Scanner(source);
while(line_scanner.hasNextLine())
{
String line = line_scanner.nextLine();
// check to make sure line is exists;
System.out.println(line);
// this work for me
Scanner words_scanner = new Scanner(line);
words_scanner.useDelimiter("\t");
while (words_scanner.hasNext())
{
System.out.format("word : %s %n", words_scanner.next());
}
}
}
public static void main(String[] args) throws FileNotFoundException
{
readLine("source.txt");
}
}

How to find a whole word in a String in Java?

I have a String that I have to parse for different keywords.
For example, I have the String:
"I will come and meet you at the 123woods"
And my keywords are
'123woods'
'woods'
I should report whenever I have a match and where. Multiple occurrences should also be accounted for.
However, for this one, I should get a match only on '123woods', not on 'woods'. This eliminates using String.contains() method. Also, I should be able to have a list/set of keywords and check at the same time for their occurrence. In this example, if I have '123woods' and 'come', I should get two occurrences. Method execution should be somewhat fast on large texts.
My idea is to use StringTokenizer but I am unsure if it will perform well. Any suggestions?

The example below is based on your comments. It uses a List of keywords, which will be searched in a given String using word boundaries. It uses StringUtils from Apache Commons Lang to build the regular expression and print the matched groups.
String text = "I will come and meet you at the woods 123woods and all the woods";
List<String> tokens = new ArrayList<String>();
tokens.add("123woods");
tokens.add("woods");
String patternString = "\\b(" + StringUtils.join(tokens, "|") + ")\\b";
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
If you are looking for more performance, you could have a look at StringSearch: high-performance pattern matching algorithms in Java.

Use regex + word boundaries as others answered.
"I will come and meet you at the 123woods".matches(".*\\b123woods\\b.*");
will be true.
"I will come and meet you at the 123woods".matches(".*\\bwoods\\b.*");
will be false.

Hope this works for you:
String string = "I will come and meet you at the 123woods";
String keyword = "123woods";
Boolean found = Arrays.asList(string.split(" ")).contains(keyword);
if(found){
System.out.println("Keyword matched the string");
}
http://codigounico.blogspot.com/

How about something like Arrays.asList(String.split(" ")).contains("xx")?
See String.split() and How can I test if an array contains a certain value.

Got a way to match Exact word from String in Android:
String full = "Hello World. How are you ?";
String one = "Hell";
String two = "Hello";
String three = "are";
String four = "ar";
boolean is1 = isContainExactWord(full, one);
boolean is2 = isContainExactWord(full, two);
boolean is3 = isContainExactWord(full, three);
boolean is4 = isContainExactWord(full, four);
Log.i("Contains Result", is1+"-"+is2+"-"+is3+"-"+is4);
Result: false-true-true-false
Function for match word:
private boolean isContainExactWord(String fullString, String partWord){
String pattern = "\\b"+partWord+"\\b";
Pattern p=Pattern.compile(pattern);
Matcher m=p.matcher(fullString);
return m.find();
}
Done

public class FindTextInLine {
String match = "123woods";
String text = "I will come and meet you at the 123woods";
public void findText () {
if (text.contains(match)) {
System.out.println("Keyword matched the string" );
}
}
}

Try to match using regular expressions. Match for "\b123wood\b", \b is a word break.

The solution seems to be long accepted, but the solution could be improved, so if someone has a similar problem:
This is a classical application for multi-pattern-search-algorithms.
Java Pattern Search (with Matcher.find) is not qualified for doing that. Searching for exactly one keyword is optimized in java, searching for an or-expression uses the regex non deterministic automaton which is backtracking on mismatches. In worse case each character of the text will be processed l times (where l is the sum of the pattern lengths).
Single pattern search is better, but not qualified, too. One will have to start the whole search for every keyword pattern. In worse case each character of the text will be processed p times where p is the number of patterns.
Multi pattern search will process each character of the text exactly once. Algorithms suitable for such a search would be Aho-Corasick, Wu-Manber, or Set Backwards Oracle Matching. These could be found in libraries like Stringsearchalgorithms or byteseek.
// example with StringSearchAlgorithms
AhoCorasick stringSearch = new AhoCorasick(asList("123woods", "woods"));
CharProvider text = new StringCharProvider("I will come and meet you at the woods 123woods and all the woods", 0);
StringFinder finder = stringSearch.createFinder(text);
List<StringMatch> all = finder.findAll();

A much simpler way to do this is to use split():
String match = "123woods";
String text = "I will come and meet you at the 123woods";
String[] sentence = text.split();
for(String word: sentence)
{
if(word.equals(match))
return true;
}
return false;
This is a simpler, less elegant way to do the same thing without using tokens, etc.

You can use regular expressions.
Use Matcher and Pattern methods to get the desired output

You can also use regex matching with the \b flag (whole word boundary).

To Match "123woods" instead of "woods" , use atomic grouping in regular expresssion.
One thing to be noted is that, in a string to match "123woods" alone , it will match the first "123woods" and exits instead of searching the same string further.
\b(?>123woods|woods)\b
it searches 123woods as primary search, once it got matched it exits the search.

Looking back at the original question, we need to find some given keywords in a given sentence, count the number of occurrences and know something about where. I don't quite understand what "where" means (is it an index in the sentence?), so I'll pass that one... I'm still learning java, one step at a time, so I'll see to that one in due time :-)
It must be noticed that common sentences (as the one in the original question) can have repeated keywords, therefore the search cannot just ask if a given keyword "exists or not" and count it as 1 if it does exist. There can be more then one of the same. For example:
// Base sentence (added punctuation, to make it more interesting):
String sentence = "Say that 123 of us will come by and meet you, "
+ "say, at the woods of 123woods.";
// Split it (punctuation taken in consideration, as well):
java.util.List<String> strings =
java.util.Arrays.asList(sentence.split(" |,|\\."));
// My keywords:
java.util.ArrayList<String> keywords = new java.util.ArrayList<>();
keywords.add("123woods");
keywords.add("come");
keywords.add("you");
keywords.add("say");
By looking at it, the expected result would be 5 for "Say" + "come" + "you" + "say" + "123woods", counting "say" twice if we go lowercase. If we don't, then the count should be 4, "Say" being excluded and "say" included. Fine. My suggestion is:
// Set... ready...?
int counter = 0;
// Go!
for(String s : strings)
{
// Asking if the sentence exists in the keywords, not the other
// around, to find repeated keywords in the sentence.
Boolean found = keywords.contains(s.toLowerCase());
if(found)
{
counter ++;
System.out.println("Found: " + s);
}
}
// Statistics:
if (counter > 0)
{
System.out.println("In sentence: " + sentence + "\n"
+ "Count: " + counter);
}
And the results are:
Found: Say
Found: come
Found: you
Found: say
Found: 123woods
In sentence: Say that 123 of us will come by and meet you, say, at the woods of 123woods.
Count: 5

If you want to identify a whole word in a string and change the content of that word you can do this way. Your final string stays equals, except the word you treated. In this case "not" stays "'not'" in final string.
StringBuilder sb = new StringBuilder();
String[] splited = value.split("\\s+");
if(ArrayUtils.isNotEmpty(splited)) {
for(String valor : splited) {
sb.append(" ");
if("not".equals(valor.toLowerCase())) {
sb.append("'").append(valor).append("'");
} else {
sb.append(valor);
}
}
}
return sb.toString();

Parsing CSV input with a RegEx in java

I know, now I have two problems. But I'm having fun!
I started with this advice not to try and split, but instead to match on what is an acceptable field, and expanded from there to this expression.
final Pattern pattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?=,|$)");
The expression looks like this without the annoying escaped quotes:
"([^"]*)"|(?<=,|^)([^,]*)(?=,|$)
This is working well for me - either it matches on "two quotes and whatever is between them", or "something between the start of the line or a comma and the end of the line or a comma". Iterating through the matches gets me all the fields, even if they are empty. For instance,
the quick, "brown, fox jumps", over, "the",,"lazy dog"
breaks down into
the quick
"brown, fox jumps"
over
"the"
"lazy dog"
Great! Now I want to drop the quotes, so I added the lookahead and lookbehind non-capturing groups like I was doing for the commas.
final Pattern pattern = Pattern.compile("(?<=\")([^\"]*)(?=\")|(?<=,|^)([^,]*)(?=,|$)");
again the expression is:
(?<=")([^"]*)(?=")|(?<=,|^)([^,]*)(?=,|$)
Instead of the desired result
the quick
brown, fox jumps
over
the
lazy dog
now I get this breakdown:
the quick
"brown
fox jumps"
,over,
"the"
,,
"lazy dog"
What am I missing?

Operator precedence. Basically there is none. It's all left to right. So the or (|) is applying to the closing quote lookahead and the comma lookahead
Try:
(?:(?<=")([^"]*)(?="))|(?<=,|^)([^,]*)(?=,|$)

(?:^|,)\s*(?:(?:(?=")"([^"].*?)")|(?:(?!")(.*?)))(?=,|$)
This should do what you want.
Explanation:
(?:^|,)\s*
The pattern should start with a , or beginning of string. Also, ignore all whitespace at the beginning.
Lookahead and see if the rest starts with a quote
(?:(?=")"([^"].*?)")
If it does, then match non-greedily till next quote.
(?:(?!")(.*?))
If it does not begin with a quote, then match non-greedily till next comma or end of string.
(?=,|$)
The pattern should end with a comma or end of string.

When I started to understand what I had done wrong, I also started to understand how convoluted the lookarounds were making this. I finally realized that I didn't want all the matched text, I wanted specific groups inside of it. I ended up using something very similar to my original RegEx except that I didn't do a lookahead on the closing comma, which I think should be a little more efficient. Here is my final code.
package regex.parser;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CSVParser {
/*
* This Pattern will match on either quoted text or text between commas, including
* whitespace, and accounting for beginning and end of line.
*/
private final Pattern csvPattern = Pattern.compile("\"([^\"]*)\"|(?<=,|^)([^,]*)(?:,|$)");
private ArrayList<String> allMatches = null;
private Matcher matcher = null;
private String match = null;
private int size;
public CSVParser() {
allMatches = new ArrayList<String>();
matcher = null;
match = null;
}
public String[] parse(String csvLine) {
matcher = csvPattern.matcher(csvLine);
allMatches.clear();
String match;
while (matcher.find()) {
match = matcher.group(1);
if (match!=null) {
allMatches.add(match);
}
else {
allMatches.add(matcher.group(2));
}
}
size = allMatches.size();
if (size > 0) {
return allMatches.toArray(new String[size]);
}
else {
return new String[0];
}
}
public static void main(String[] args) {
String lineinput = "the quick,\"brown, fox jumps\",over,\"the\",,\"lazy dog\"";
CSVParser myCSV = new CSVParser();
System.out.println("Testing CSVParser with: \n " + lineinput);
for (String s : myCSV.parse(lineinput)) {
System.out.println(s);
}
}
}

I know this isn't what the OP wants, but for other readers, one of the String.replace methods could be used to strip the quotes from each element in the result array of the OPs current regex.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Write a regular expression to count sentences - java

You could use a regular expression that checks for a non end of sentence, followed by an end of sentence like: [^?!.][?!.] Although as #Gabe Sechan points out, a regular expression may not be accurate when the sentence includes abbreviated words such as Dr., Rd., St., etc.

Related

Extracting hashtags from user input

Need help in Regex to exclude splitting string within "

Regular expression for finding two words in a string

How to find a whole word in a String in Java?

Parsing CSV input with a RegEx in java

Categories

Resources