How to scan for words in Java excluding punctuation - java

I'm trying to use the scanner class to parse all the words in a file. The file contains common text, but I only want to take the words excluding all the puntuation.
The solution I have until now is not complete but is already giving me some problem:
Scanner fileScan= new Scanner(file);
String word;
while(fileScan.hasNext("[^ ,!?.]+")){
word= fileScan.next();
this.addToIndex(word, filename);
}
Now if I use this on a sentence like "hi my name is mario!" it returns just "hi", "my", "name" and "is". It's not matching "mario!" (obviously) but it's not matching "mario", like I think it should.
Can you explain why is that and helping me find a better solution if you have one?
Thank you

This works:
import java.util.*;
class S {
public static void main(String[] args) {
Scanner fileScan= new Scanner("hi my name is mario!").useDelimiter("[ ,!?.]+");
String word;
while(fileScan.hasNext()){
word= fileScan.next();
System.out.println(word);
}
} // end of main()
}
javac -g S.java && java S
hi
my
name
is
mario

Since you want to get rid of the punctuation, you can simply replace all punctuation marks before adding to the index:
word = word.replaceAll("\\{Punct}", "");
In the case of hypens, or other isolated punctuation marks, you just check if word.isEmpty() before adding.
Of course, you'd have to get rid of your custom delimiter.

Related

Space in list despite using delimiters and how to remove it?

I am solving project Euler problem 22, wherein the program reads a text file having text format as follows and then tries to alphabetically sort it:
"MARY","PATRICIA","LINDA","BARBARA","ELIZABETH","JENNIFER",
"MARIA","SUSAN","MARGARET","DOROTHY","LISA", etc...
I use delimiter to eliminate both "" and ",", however when the ArrayList is sorted, it gives first element blank and sort result is like this:
<I get blank space here>,ANNALISA, ANNAMAE, ANNAMARIA, ANNAMARIE,
ANNE, ANNELIESE, ANNELLE, ANNEMARIE, ANNETT, ANNETTA, ANNETTE,
ANNICE, ANNIE, ANNIKA, ANNIS, ANNITA, ANNMARIE, ANTHONY,
ANTIONE, ANTIONETTE, ANTOINE, ANTOINETTE, etc...
My code is
public class Problem22 {
public static void main(String[] args) throws FileNotFoundException {
Scanner scan = new Scanner (new File("file.txt"));
scan.useDelimiter(",|\"| ");
String name = null;
ArrayList<String> names = new ArrayList<>();
while(scan.hasNext()) {
name = scan.next();
names.add(name);
}
scan.close();
Collections.sort(names);
System.out.println(names);
}
}
I need help to understand the reason for getting the blank line. Also I tried to remove it but unable to do it.
Pattern b = Pattern.compile("\\|"+"\r\n");
scan.useDelimiter(b);
I changed regex
To understand the regular expression(regex)1:https://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html
2:https://regexone.com/ - practice online
When I ran your code I actually had multiple empty strings in the result. Your mistake is your delimiter regex. ,|\"| means "split at each ,, ", or " and not "split at sequences of ,, ", ".
That means that "aaa", "bbb" will be split into ["", "aaa", "", "", "", "", "bbb", ""].
Change your regex accordingly and it'll work. I used \\W+ (meaning "sequences of non-word characters"), which also dealt with line breaks nicely. If you need more control, use something like [, \"]+.

Write a regular expression to count sentences

I have a String :
"Hello world... I am here. Please respond."
and I would like to count the number of sentences within the String. I had an idea to use a Scanner as well as the useDelimiter method to split any String into sentences.
Scanner in = new Scanner(file);
in.useDelimiter("insert here");
I'd like to create a regular expression which can go through the String I have shown above and identify it to have two sentences. I initially tried using the delimiter:
[^?.]
It gets hung up on the ellipses.
You could use a regular expression that checks for a non end of sentence, followed by an end of sentence like:
[^?!.][?!.]
Although as #Gabe Sechan points out, a regular expression may not be accurate when the sentence includes abbreviated words such as Dr., Rd., St., etc.
this could help :
public int getNumSentences()
{
List<String> tokens = getTokens( "[^!?.]+" );
return tokens.size();
}
and you can also add enter button as separator and make it independent on your OS by the following line of code
String pattern = System.getProperty("line.separator" + " ");
actually you can find more about the
Enter
here : Java regex: newline + white space
and hence finally the method becomes :
public int getNumSentences()
{
List<String> tokens = getTokens( "[^!?.]+" + pattern + "+" );
return tokens.size();
}
hope this could help :) !
A regular expression probably isn't the right tool for this. English is not a regular language, so regular expressions get hung up- a lot. For one thing you can't even be sure a period in the middle of the text is an end of sentence- abbreviations (like Mr.), acronyms with periods, and initials will screw you up as well. Its not the right tool.
For your sentence : "Hello world... I am here. Please respond."
The code will be :
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaRegex {
public static void main(String[] args) {
int count=0;
String sentence = "Hello world... I am here. Please respond.";
Pattern pattern = Pattern.compile("\\..");
Matcher matcher = pattern.matcher(sentence);
while(matcher.find()) {
count++;
}
System.out.println("No. of sentence = "+count);
}
}

Split a sentence ignoring characters in Java

I Want to write a program that reads one line of input text and breaks it up into words.
The (solution)
words should be output one per line. A word is defined to be a sequence of letters.
Any characters in the input that are not letters should be discarded.
For example, if the user inputs the line:
He said, "That’s not a good idea."
then the output of the program should be:
He
said
That
‘s
not
a
good
idea
Simply use a regex
Pattern pattern = Pattern.compile("[\\w'’]+");
Matcher matcher = pattern.matcher("He said, \"That’s not a good idea.\"");
while (matcher.find())
System.out.println(matcher.group());
Try this:
public class Main {
public static void main(String[] args) {
Scanner stdIn = new Scanner(System.in); // user input
String line = stdIn.nextLine(); // read line
String[] words = line.split("[^a-zA-Z]+"); // split by all non-alphabetic characters (a regex)
for (String word : words) { // iterate through the words
System.out.println(word); // print word with a newline
}
}
}
It won't include the apostrophe in the token 's, but I don't know why you included that. It's not a letter, after all, and I read your first bold sentence. I hope the comments help explain how it works. There will be a trailing empty line, but that should be easy for you to fix if you really need to.

Getting scanner to read text file

I am trying to use a scanner to read a text file pulled with JFileChooser. The wordCount is working correctly, so I know it is reading. However, I cannot get it to search for instances of the user inputted word.
public static void main(String[] args) throws FileNotFoundException {
String input = JOptionPane.showInputDialog("Enter a word");
JFileChooser fileChooser = new JFileChooser();
fileChooser.showOpenDialog(null);
File fileSelection = fileChooser.getSelectedFile();
int wordCount = 0;
int inputCount = 0;
Scanner s = new Scanner (fileSelection);
while (s.hasNext()) {
String word = s.next();
if (word.equals(input)) {
inputCount++;
}
wordCount++;
}
You'll have to look for
, ; . ! ? etc.
for each word. The next() method grabs an entire string until it hits an empty space.
It will consider "hi, how are you?" as the following "hi,", "how", "are", "you?".
You can use the method indexOf(String) to find these characters. You can also use replaceAll(String regex, String replacement) to replace characters. You can individuality remove each character or you can use a Regex, but those are usually more complex to understand.
//this will remove a certain character with a blank space
word = word.replaceAll(".","");
word = word.replaceAll(",","");
word = word.replaceAll("!","");
//etc.
Read more about this method:
http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replaceAll%28java.lang.String,%20java.lang.String%29
Here's a Regex example:
//NOTE: This example will not work for you. It's just a simple example for seeing a Regex.
//Removes whitespace between a word character and . or ,
String pattern = "(\\w)(\\s+)([\\.,])";
word = word.replaceAll(pattern, "$1$3");
Source:
http://www.vogella.com/articles/JavaRegularExpressions/article.html
Here is a good Regex example that may help you:
Regex for special characters in java
Parse and remove special characters in java regex
Remove all non-"word characters" from a String in Java, leaving accented characters?
if the user inputed text is different in case then you should try using equalsIgnoreCase()
in addition to blackpanthers answer you should also use trim() to account for whitespaces.as
"abc" not equal to "abc "
You should take a look at matches().
equals will not help you, since next() doesn't return the file word by word,
but rather whitespace (not comma, semicolon, etc.) separated token by token (as others mentioned).
Here the java docString#matches(java.lang.String)
...and a little example.
input = ".*" + input + ".*";
...
boolean foundWord = word.matches(input)
. is the regex wildcard and stands for any sign. .* stands for 0 or more undefined signs. So you get a match, if input is somewhere in word.

Regular expression for finding two words in a string

Here is my basic problem: I am reading some lines in from a file. The format of each line in the file is this:
John Doe 123
There is a tab between Doe and 123.
I'm looking for a regex such that I can "pick off" the John Doe. Something like scanner.next(regular expression) that would give me the John Doe.
This is probably very simple, but I can't seem to get it to work. Also, I'm trying to figure this out without having to rely on the tab being there.
I've looked here: Regular Expression regex to validate input: Two words with a space between. But none of these answers worked. I kept getting runtime errors.
Some Code:
while(inFile.hasNextLine()){
String s = inFile.nextLine();
Scanner string = new Scanner(s);
System.out.println(s); // check to make sure I got the string
System.out.println(string.next("[A-Za-z]+ [A-Za-z]+")); //This
//doesn't work for me
System.out.println(string.next("\\b[A-Za-z ]+\\b"));//Nor does
//this
}
Are you required to use regex for this? You could simply use a split method across \t on each line and just grab the first or second element (I'm not sure which you meant by 'pick off' john doe).
It would help if you provided the code you're trying that is giving you runtime errors.
You could use regex:
[A-Za-z]+ [A-Za-z]+
if you always knew your name was going to be two words.
You could also try
\b[A-Za-z ]+\b
which matches any number of words (containing alphabets), making sure it captures whole words (that's what the '\b' is) --> to return "John Doe" instead of "John Doe " (with the trailing space too). Don't forget backslashes need to be escaped in Java.
This basically works to isolate John Doe from the rest...
public String isolateAndTrim( String candidate ) {
// This pattern isolates "John Doe" as a group...
Pattern pattern = Pattern.compile( "(\\w+\\s+\\w+)\\s+\\d*" );
Matcher matcher = pattern.matcher( candidate );
String clean = "";
if ( matcher.matches() ) {
clean = matcher.group( 1 );
// This replace all reduces away extraneous whitespace...
clean = clean.replaceAll( "\\s+", " " );
}
return clean;
}
The grouping parenthesis will allow you to "pick off" the name portion from the digit portion. "John Doe", "Jane Austin", whatever. You should learn the grouping stuff in RegEx as it works great for problems just like this one.
The trick to remove the extra whitespace comes from How to remove duplicate white spaces in string using Java?
Do you prefer simplicity and readability? If so, consider the following solution
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class MyLineScanner
{
public static void readLine(String source_file) throws FileNotFoundException
{
File source = new File(source_file);
Scanner line_scanner = new Scanner(source);
while(line_scanner.hasNextLine())
{
String line = line_scanner.nextLine();
// check to make sure line is exists;
System.out.println(line);
// this work for me
Scanner words_scanner = new Scanner(line);
words_scanner.useDelimiter("\t");
while (words_scanner.hasNext())
{
System.out.format("word : %s %n", words_scanner.next());
}
}
}
public static void main(String[] args) throws FileNotFoundException
{
readLine("source.txt");
}
}

Categories

Resources