Split a sentence ignoring characters in Java - java

I Want to write a program that reads one line of input text and breaks it up into words.
The (solution)
words should be output one per line. A word is defined to be a sequence of letters.
Any characters in the input that are not letters should be discarded.
For example, if the user inputs the line:
He said, "That’s not a good idea."
then the output of the program should be:
He
said
That
‘s
not
a
good
idea

Simply use a regex
Pattern pattern = Pattern.compile("[\\w'’]+");
Matcher matcher = pattern.matcher("He said, \"That’s not a good idea.\"");
while (matcher.find())
System.out.println(matcher.group());

Try this:
public class Main {
public static void main(String[] args) {
Scanner stdIn = new Scanner(System.in); // user input
String line = stdIn.nextLine(); // read line
String[] words = line.split("[^a-zA-Z]+"); // split by all non-alphabetic characters (a regex)
for (String word : words) { // iterate through the words
System.out.println(word); // print word with a newline
}
}
}
It won't include the apostrophe in the token 's, but I don't know why you included that. It's not a letter, after all, and I read your first bold sentence. I hope the comments help explain how it works. There will be a trailing empty line, but that should be easy for you to fix if you really need to.

Related

Regex matching failing while attempting to read from file (SIC assembler in java)

I'm currently working on a SIC assembler and scanning lines from the following file:
begin START 0
main LDX zero
copy LDCH str1, x
STCH str2, x
TIX eleven
JLT copy
str1 BYTE C'TEST STRING'
str2 RESB 11
zero WORD 0
eleven WORD 11
END main
I'm using, as you might have already guessed, a regex to extract the fields from each line of code. Right now, I'm just testing if the lines match the regex (as they're supposed to). If they do, the program prints them. The problem is, it just recognizes the first line, and ignores the rest (i. e. from the second line on, they do not match the regex).
Here's the code so far:
public static void main(String args[]) throws FileNotFoundException {
Scanner scan = new Scanner(new File("/home/daniel/test.asm"));
Pattern std = Pattern.compile("(^$|[a-z0-9\\-\\_]*)(\\s+)([A-Z]+)(\\s+)([a-z0-9\\-\\_]*)");
String lineFromFile;
lineFromFile = scan.nextLine();
Matcher standard = std.matcher(lineFromFile);
while (standard.find()) {
System.out.println(lineFromFile);
lineFromFile = scan.nextLine();
}
}
It prints just the first line:
begin START 0
The weird thing comes here: if I copy the second line directly from the file, and declare a String object with it, and test it manually, it does work! And the same with the rest of the other lines. Something like:
public static void main(String args[]) throws FileNotFoundException {
Scanner scan = new Scanner(new File("/home/daniel/test.asm"));
Pattern std = Pattern.compile("(^$|[a-z0-9\\-\\_]*)(\\s+)([A-Z]+)(\\s+)([a-z0-9\\-\\_]*)");
String lineFromFile;
lineFromFile = "main LDX zero";
Matcher standard = std.matcher(lineFromFile);
if (standard.find())
System.out.println(lineFromFile);
}
And it does prints it!
main LDX zero
I don't know if it has something to do with the regex, or the file. I'd really appreciate if any of you guys help me to find the error.
Thanks for your time! :)
NOTE :- I am assuming your regex is correct
You need to update the Matcher object for every line you read from input. (For demonstration, I have just updated your code to read line by line from console and not file.)
Java Code
String pattern = "(^$|[a-z0-9\\-\\_]*)(\\s+)([A-Z]+)(\\s+)([a-z0-9\\-\\_]*)";
Pattern r = Pattern.compile(pattern);
String line = "";
Matcher m;
while((line = tmp.nextLine()) != null) {
m = r.matcher(line);
while(m.find()) {
System.out.println(m.group(1) + m.group(2)+ m.group(3)+ m.group(4)+ m.group(5));
}
}
Ideone Demo
Though, use of if will be sufficient here until there are multiple matches on single line
if(m.find()) {
System.out.println(m.group(1) + m.group(2)+ m.group(3)+ m.group(4)+ m.group(5));
}
EDIT
Assuming only three part in your input, you can use this regex instead
^((?:\w+)?\s+)(\w+\s+)(.*)$
Regex Demo
Your regex does appear to be incorrect, but that's not your immediate problem. Your while loop has to iterate through all lines, not just the ones that match. If you're using a Scanner, the test condition is the hasNextLine() method. You do the matching inside the loop. You can still create the Matcher ahead of time and apply it to each line using the reset() method:
Scanner sc = new Scanner(new File("test.asm"));
Pattern p = Pattern.compile("^([a-z0-9_-]*)\\s+([A-Z]+)\\s+(.*)");
Matcher m = p.matcher("");
while (sc.hasNextLine()) {
String lineFromFile = sc.nextLine();
if (m.reset(lineFromFile).find()) {
System.out.printf("%-8s %-6s %s%n", m.group(1), m.group(2), m.group(3));
}
}
As for your regex, the last part seemed to be too restrictive--it doesn't match your sample data, anyway. I changed it to consume everything after the second whitespace gap. I also simplified the first part and got rid of the unnecessary groups.

Write a regular expression to count sentences

I have a String :
"Hello world... I am here. Please respond."
and I would like to count the number of sentences within the String. I had an idea to use a Scanner as well as the useDelimiter method to split any String into sentences.
Scanner in = new Scanner(file);
in.useDelimiter("insert here");
I'd like to create a regular expression which can go through the String I have shown above and identify it to have two sentences. I initially tried using the delimiter:
[^?.]
It gets hung up on the ellipses.
You could use a regular expression that checks for a non end of sentence, followed by an end of sentence like:
[^?!.][?!.]
Although as #Gabe Sechan points out, a regular expression may not be accurate when the sentence includes abbreviated words such as Dr., Rd., St., etc.
this could help :
public int getNumSentences()
{
List<String> tokens = getTokens( "[^!?.]+" );
return tokens.size();
}
and you can also add enter button as separator and make it independent on your OS by the following line of code
String pattern = System.getProperty("line.separator" + " ");
actually you can find more about the
Enter
here : Java regex: newline + white space
and hence finally the method becomes :
public int getNumSentences()
{
List<String> tokens = getTokens( "[^!?.]+" + pattern + "+" );
return tokens.size();
}
hope this could help :) !
A regular expression probably isn't the right tool for this. English is not a regular language, so regular expressions get hung up- a lot. For one thing you can't even be sure a period in the middle of the text is an end of sentence- abbreviations (like Mr.), acronyms with periods, and initials will screw you up as well. Its not the right tool.
For your sentence : "Hello world... I am here. Please respond."
The code will be :
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaRegex {
public static void main(String[] args) {
int count=0;
String sentence = "Hello world... I am here. Please respond.";
Pattern pattern = Pattern.compile("\\..");
Matcher matcher = pattern.matcher(sentence);
while(matcher.find()) {
count++;
}
System.out.println("No. of sentence = "+count);
}
}

Java Regex all non word characters except whitespace

This has Probably been asked before, but i want to split a string at every non word character except the white space in java. i do not have experience with regex in general and the wiki doesn't really help.
I've tried it with this: "[\\W][^\\s]" but that did not help.
Edit: how the String is read out of the file
StringBuilder sb = new StringBuilder();
Scanner sc = new Scanner(getResources().openRawResource(R.raw.answers));
try
{
while (sc.hasNext())
{
sb.append(sc.next());
}
} finally
{
sc.close();
}
You can split using this regex:
String[] tok = input.split( "[\\W&&\\S]+" );
This will split on any non-word that is also a non-space character hence leaving aside space characters for split.
Check Character classes in Java Pattern reference.

Getting scanner to read text file

I am trying to use a scanner to read a text file pulled with JFileChooser. The wordCount is working correctly, so I know it is reading. However, I cannot get it to search for instances of the user inputted word.
public static void main(String[] args) throws FileNotFoundException {
String input = JOptionPane.showInputDialog("Enter a word");
JFileChooser fileChooser = new JFileChooser();
fileChooser.showOpenDialog(null);
File fileSelection = fileChooser.getSelectedFile();
int wordCount = 0;
int inputCount = 0;
Scanner s = new Scanner (fileSelection);
while (s.hasNext()) {
String word = s.next();
if (word.equals(input)) {
inputCount++;
}
wordCount++;
}
You'll have to look for
, ; . ! ? etc.
for each word. The next() method grabs an entire string until it hits an empty space.
It will consider "hi, how are you?" as the following "hi,", "how", "are", "you?".
You can use the method indexOf(String) to find these characters. You can also use replaceAll(String regex, String replacement) to replace characters. You can individuality remove each character or you can use a Regex, but those are usually more complex to understand.
//this will remove a certain character with a blank space
word = word.replaceAll(".","");
word = word.replaceAll(",","");
word = word.replaceAll("!","");
//etc.
Read more about this method:
http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replaceAll%28java.lang.String,%20java.lang.String%29
Here's a Regex example:
//NOTE: This example will not work for you. It's just a simple example for seeing a Regex.
//Removes whitespace between a word character and . or ,
String pattern = "(\\w)(\\s+)([\\.,])";
word = word.replaceAll(pattern, "$1$3");
Source:
http://www.vogella.com/articles/JavaRegularExpressions/article.html
Here is a good Regex example that may help you:
Regex for special characters in java
Parse and remove special characters in java regex
Remove all non-"word characters" from a String in Java, leaving accented characters?
if the user inputed text is different in case then you should try using equalsIgnoreCase()
in addition to blackpanthers answer you should also use trim() to account for whitespaces.as
"abc" not equal to "abc "
You should take a look at matches().
equals will not help you, since next() doesn't return the file word by word,
but rather whitespace (not comma, semicolon, etc.) separated token by token (as others mentioned).
Here the java docString#matches(java.lang.String)
...and a little example.
input = ".*" + input + ".*";
...
boolean foundWord = word.matches(input)
. is the regex wildcard and stands for any sign. .* stands for 0 or more undefined signs. So you get a match, if input is somewhere in word.

Recognizing empty lines in a text?

Im taking input from a separate file which currently has one paragraph. Im storing every word in the paragraph into a list and then iterating over each of them using this:
for (String word: words)
However, this iterator goes over each WORD. If I have two paragraphs in my input file which are separated by an empty line, how do I recognize that empty line under this for-loop iterator? My thinking is that iterating over words is obviously different from going over lines, so Im not sure.
An empty line follows the pattern:
\n\n
\r\n\r\n
\n -whitespace- \n
etc
A word following the pattern
-whitespace-nonwhitespace-whitespace-
Very different patterns. So looping over something using the definition of a word will never work.
You can use Java scanner to look at a file line by line.
public class LineScanner {
public List<String> eliminateEmptyLines(String input) {
scanner Scanner = new Scanner(input);
ArrayList<String> output = new ArrayList<>();
while (scanner.hasNextLine) {
String line = scanner.nextLine;
boolean isEmpty = line.matches("^\s*$");
if !(isEmpty) {
output.add(line);
}
}
return output;
}
}
Here's how the regex in String.matches works: How to check if a line is blank using regex
Here's the javadoc on Scanner: http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html

Categories

Resources