Java Regex all non word characters except whitespace

Java Regex all non word characters except whitespace - java

This has Probably been asked before, but i want to split a string at every non word character except the white space in java. i do not have experience with regex in general and the wiki doesn't really help.
I've tried it with this: "[\\W][^\\s]" but that did not help.
Edit: how the String is read out of the file
StringBuilder sb = new StringBuilder();
Scanner sc = new Scanner(getResources().openRawResource(R.raw.answers));
try
{
while (sc.hasNext())
{
sb.append(sc.next());
}
} finally
{
sc.close();
}

You can split using this regex:
String[] tok = input.split( "[\\W&&\\S]+" );
This will split on any non-word that is also a non-space character hence leaving aside space characters for split.
Check Character classes in Java Pattern reference.

Related

.split() and [\\W] creates an additional empty string?

I'm creating a small program to split a string into tokens (consecutive English alphabet characters, then outputting the number of tokens as well as the actual tokens. The problem is an extra empty string element is created wherever there is a comma followed by a space.
I've researched into regular expressions and understand that \W is anything that is not a word character.
String str = sc.nextLine();
// creating an array of tokens
String tokens[] = str.split("[\\W]");
int len = tokens.length;
System.out.println(len);
for (int i = 0; i < len; i++) {
System.out.println(tokens[i]);
}
Input:
Hello, World.
Expected output:
2
Hello
World
Actual output:
3
Hello
World
Note: this is my first stack overflow post, if I've done anything wrong please let me know, thanks

Try str.split("\\W+")
It means 1 or more non-word character
\W matches only 1 character. So it breaks at , and then breaks again at the space
That’s why it gives you back an extra empty string.
\W+ will match on ‘, ‘ as one, so it will break only once, so you will get back only the tokens. (It works on multiple tokens not just two. So ‘hello, world, again’ will give you [hello,world,again].

If you use .split("\\W") you will get empty items if:
non-word char(s) appear(s) at the start of the string
non-word chars appear in succession, one after another as \W matches 1 non-word char, breaks the string, and then the next non-word char breaks it again, producing empty strings.
There are two ways out.
Either remove all non-word chars at the start and then split with \W+:
String tokens[] = str.replaceFirst("^\\W+", "").split("\\W+");
Or, match the chunks of word chars with \w+ pattern:
Pattern p = Pattern.compile("\\w+");
Matcher m = p.matcher(" abc=-=123");
List<String> tokens = new ArrayList<>();
while(m.find()) {
tokens.add(m.group());
}
System.out.println(tokens);
See the online demo.

Try this
Scanner inputter = new Scanner(System.in);
System.out.print("Please enter your thoughts : ");
final String words = inputter.nextLine();
final String[] tokens = words.split("\\W+");
Arrays.stream(tokens).forEach(System.out::println);

Split a sentence ignoring characters in Java

I Want to write a program that reads one line of input text and breaks it up into words.
The (solution)
words should be output one per line. A word is defined to be a sequence of letters.
Any characters in the input that are not letters should be discarded.
For example, if the user inputs the line:
He said, "That’s not a good idea."
then the output of the program should be:
He
said
That
‘s
not
a
good
idea

Simply use a regex
Pattern pattern = Pattern.compile("[\\w'’]+");
Matcher matcher = pattern.matcher("He said, \"That’s not a good idea.\"");
while (matcher.find())
System.out.println(matcher.group());

Try this:
public class Main {
public static void main(String[] args) {
Scanner stdIn = new Scanner(System.in); // user input
String line = stdIn.nextLine(); // read line
String[] words = line.split("[^a-zA-Z]+"); // split by all non-alphabetic characters (a regex)
for (String word : words) { // iterate through the words
System.out.println(word); // print word with a newline
}
}
}
It won't include the apostrophe in the token 's, but I don't know why you included that. It's not a letter, after all, and I read your first bold sentence. I hope the comments help explain how it works. There will be a trailing empty line, but that should be easy for you to fix if you really need to.

Getting scanner to read text file

I am trying to use a scanner to read a text file pulled with JFileChooser. The wordCount is working correctly, so I know it is reading. However, I cannot get it to search for instances of the user inputted word.
public static void main(String[] args) throws FileNotFoundException {
String input = JOptionPane.showInputDialog("Enter a word");
JFileChooser fileChooser = new JFileChooser();
fileChooser.showOpenDialog(null);
File fileSelection = fileChooser.getSelectedFile();
int wordCount = 0;
int inputCount = 0;
Scanner s = new Scanner (fileSelection);
while (s.hasNext()) {
String word = s.next();
if (word.equals(input)) {
inputCount++;
}
wordCount++;
}

You'll have to look for
, ; . ! ? etc.
for each word. The next() method grabs an entire string until it hits an empty space.
It will consider "hi, how are you?" as the following "hi,", "how", "are", "you?".
You can use the method indexOf(String) to find these characters. You can also use replaceAll(String regex, String replacement) to replace characters. You can individuality remove each character or you can use a Regex, but those are usually more complex to understand.
//this will remove a certain character with a blank space
word = word.replaceAll(".","");
word = word.replaceAll(",","");
word = word.replaceAll("!","");
//etc.
Read more about this method:
http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replaceAll%28java.lang.String,%20java.lang.String%29
Here's a Regex example:
//NOTE: This example will not work for you. It's just a simple example for seeing a Regex.
//Removes whitespace between a word character and . or ,
String pattern = "(\\w)(\\s+)([\\.,])";
word = word.replaceAll(pattern, "$1$3");
Source:
http://www.vogella.com/articles/JavaRegularExpressions/article.html
Here is a good Regex example that may help you:
Regex for special characters in java
Parse and remove special characters in java regex
Remove all non-"word characters" from a String in Java, leaving accented characters?

if the user inputed text is different in case then you should try using equalsIgnoreCase()

in addition to blackpanthers answer you should also use trim() to account for whitespaces.as
"abc" not equal to "abc "

You should take a look at matches().
equals will not help you, since next() doesn't return the file word by word,
but rather whitespace (not comma, semicolon, etc.) separated token by token (as others mentioned).
Here the java docString#matches(java.lang.String)
...and a little example.
input = ".*" + input + ".*";
...
boolean foundWord = word.matches(input)
. is the regex wildcard and stands for any sign. .* stands for 0 or more undefined signs. So you get a match, if input is somewhere in word.

Convert a string to an array of strings

If I have:
Scanner input = new Scanner(System.in);
System.out.println("Enter an infixed expression:");
String expression = input.nextLine();
String[] tokens;
How do I scan the infix expression around spaces one token at a time, from left to right and put in into an array of strings? Here a token is defined as an operand, operator, or parentheses symbol.
Example: "3 + (9-2)" ==> tokens = [3][+][(][9][-][2][)]

String test = "13 + (9-2)";
List<String> allMatches = new ArrayList<String>();
Matcher m = Pattern.compile("\\d+|\\(|\\)|\\+|\\*|-|/")
.matcher(test);
while (m.find()) {
allMatches.add(m.group());
}
Can someone test this please?

I think it would be easiest to read the line into one string, and then split based on space. There is a handy string function split that does this for you.
String[] tokens = input.split("");

It's probably overkill for your example, but in case it gets more complex, take a look at JavaCC, the Java Compiler Compiler. JavaCC allows you to create a parser in Java based on a grammar definition.
Be aware that it is not an easy tool to get started with. However, the grammar definition will be much easier to read than the corresponding regular expressions.

if tokens[] must be String you can use this
String ex="3 + (9-2)";
String tokens[];
StringTokenizer tok=new StringTokenizer(ex);
String line="";
while(tok.hasMoreTokens())line+=tok.nextToken();
tokens=new String[line.length()];
for(int i=1;i<line.length()+1;i++)tokens[i-1]=line.substring(i-1,i);
tokens can be a charArray so:
String ex="3 + (9-2)";
char tokens[];
StringTokenizer tok=new StringTokenizer(ex);
String line="";
while(tok.hasMoreTokens())line+=tok.nextToken();
tokens=line.toCharArray();

This (IMHO elegant) single line of code works (tested):
String[] tokens = input.split("(?<=[^ ])(?<!\\B) *");
This regex also caters for input containing multiple character numbers (eg 123) which would be split into separate characters but for the negative look-behind for a non-word boundary (?<!\\B).
The first look-behind (?<=[^ ]) prevents an initial blank string split at start if input, and assures spaces are consumed.
The final part of the regex " *" assures spaces are consumed.

reading line in bufferedReader

From the javadoc
public String readLine()
throws IOException
Read a line of text. A line is considered to be terminated by any one of a line feed ('\n'), a carriage return ('\r'), or a carriage return followed immediately by a linefeed.
I have following kind of text :
Now the earth was formless and empty. Darkness was on the surface
of the deep. God's Spirit was hovering over the surface
of the waters.
I am reading lines as:
while(buffer.readline() != null){
}
But, the problem is it is considering a line for string upto before newline.But i would like to consider line when string ends with .. How would i do it?

You can use a Scanner and set your own delimiter using useDelimiter(Pattern).
Note that the input delimiter is a regex, so you will need to provide the regex \. (you need to break the special meaning of the character . in regex)

You can read a character at a time, and copy the data to a StringBuilder
Reader reader = ...;
StringBuilder sb = new StringBuilder();
int ch;
while((ch = reader.read()) >= 0) {
if(ch == '.') break;
sb.append((char) ch);
}

Use a java.util.Scanner instead of a buffered reader, and set the delimiter to "\\." with Scanner.useDelimiter().
(but be aware that the delimiter is consumed, so you'll have to add it again!)
or read the raw string and split it on each .

You could split the whole text by every .:
String text = "Your test.";
String[] lines = text.split("\\.");
After you split the text you get an array of lines. You could also use a regex if you want more control, e.g. to split the text also by : or ;. Just google it.
PS.: Perhaps you have to remove the new line characters first with something like:
text = text.replaceAll("\n", "");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java Regex all non word characters except whitespace - java

You can split using this regex: String[] tok = input.split( "[\\W&&\\S]+" ); This will split on any non-word that is also a non-space character hence leaving aside space characters for split. Check Character classes in Java Pattern reference.

Related

.split() and [\\W] creates an additional empty string?

Split a sentence ignoring characters in Java

Getting scanner to read text file

Convert a string to an array of strings

reading line in bufferedReader

Categories

Resources