Why are my character and word counts off?

Why are my character and word counts off? - java

Given the following string:
String text = "The woods are\nlovely,\t\tdark and deep.";
I want all whitespace treated as a single character. So for instance, the \n is 1 char. The \t\t should also be 1 char. With that logic, I count 36 characters and 7 words. But when I run this through the following code:
String text = "The woods are\nlovely,\t\tdark and deep.";
int numNewCharacters = 0;
for(int i=0; i < text.length(); i++)
if(!Character.isWhitespace(text.charAt(i)))
numNewCharacters++;
int numNewWords = text.split("\\s").length;
// Prints "30"
System.out.println("Chars:" + numNewCharacters);
// Prints "8"
System.out.println("Words:" + numNewWords);
It's telling me that there are 30 characters and 8 words. Any ideas as to why? Thanks in advance.

You are matching on individual whitespaces. Instead you could match on one or more:
text.split("\\s+")

You are counting only non white space characters in the first loop - so not counting space etc at all. Then 30 is the right answer. As for the second - I suspect split is treating consecutive white spaces as distinct, so there is a "null" word between the two tabs.

Reimueus has already solved your word count problem:
text.split("\\s+")
And your character count is corret. Newlines \n and tabs \t are considered whitespace. If you don't want them to be, you can implement your own isWhitespace function.

Here is the complete solution to counting words and characters:
System.out.println("Characters: " + text.replaceAll("\\s+", " ").length());
Matcher m = Pattern.compile("[^\\s]+", Pattern.MULTILINE).matcher(text);
int wordCount = 0;
while (m.find()) {
wordCount ++;
}
System.out.println("Words: "+ wordCount);
Character count is accomplished by replacing all whitespaces groups to a single space and just taking the resulting string's length;
For word count we create a pattern that will match any char group which does not contain a whitespace. You could use \\w+ pattern here, but it will match only alphanumeric characters and underscore. Note also Pattern.MULTILINE parameter.

Related

Is there a regex to the String.replaceAll method that only keeps letters and white spaces

I have made a program that counts the frequency of a word in a very long string. My problem is that the program is counting for example "*it" (consider * a quotation mark) and "it" as different words and therefore putting them in different categories.
I tried to replace all the punctuation marks I know of with the following code:
text = text.replace("\n", " ");
text = text.replaceAll("\\p{Punct}", " ");
text = text.replace("\"", "");
text = text.replace("–", "");
text = text.replace("\t", "");
Unfortunately, the code didn't work and I think it is because there is a lot of different quotation marks in Unicode that I can't see a difference between, so is there a way to remove all Unicode characters except letters and whitespaces with the String.replaceAll method or do I have to make a CharArray and continue from there?
Thanks a lot, any help would be appreciated.

I think this might do it
text = text.replaceAll("[^a-zA-Z0-9 ]", "");
which will remove all the characters which are not either alphanumeric or special characters.
EDIT :-
As suggesed by #npinti
text = text.replaceAll("[^\\p{L}0-9 ]", "");

This will remove all non-letter/digit characters and squish the spaces so you don't get multiple consecutive spaces:
text = text.replaceAll("[^\\p{L}\\d]+", " ");

This will remove all not letters and whitespaces.
text.replaceAll("[^\\sa-zA-Z]", "");
Legend:
^ - exclude given characters from being replaced
\\s - all whitespaces (\n , \t , ' ')
a-zA-Z - all letters
Example:
String in="12ASxA sdr5%";
System.out.println(in.replaceAll("[^\\sa-zA-Z]", "")); // ASxA sdr

.split() and [\\W] creates an additional empty string?

I'm creating a small program to split a string into tokens (consecutive English alphabet characters, then outputting the number of tokens as well as the actual tokens. The problem is an extra empty string element is created wherever there is a comma followed by a space.
I've researched into regular expressions and understand that \W is anything that is not a word character.
String str = sc.nextLine();
// creating an array of tokens
String tokens[] = str.split("[\\W]");
int len = tokens.length;
System.out.println(len);
for (int i = 0; i < len; i++) {
System.out.println(tokens[i]);
}
Input:
Hello, World.
Expected output:
2
Hello
World
Actual output:
3
Hello
World
Note: this is my first stack overflow post, if I've done anything wrong please let me know, thanks

Try str.split("\\W+")
It means 1 or more non-word character
\W matches only 1 character. So it breaks at , and then breaks again at the space
That’s why it gives you back an extra empty string.
\W+ will match on ‘, ‘ as one, so it will break only once, so you will get back only the tokens. (It works on multiple tokens not just two. So ‘hello, world, again’ will give you [hello,world,again].

If you use .split("\\W") you will get empty items if:
non-word char(s) appear(s) at the start of the string
non-word chars appear in succession, one after another as \W matches 1 non-word char, breaks the string, and then the next non-word char breaks it again, producing empty strings.
There are two ways out.
Either remove all non-word chars at the start and then split with \W+:
String tokens[] = str.replaceFirst("^\\W+", "").split("\\W+");
Or, match the chunks of word chars with \w+ pattern:
Pattern p = Pattern.compile("\\w+");
Matcher m = p.matcher(" abc=-=123");
List<String> tokens = new ArrayList<>();
while(m.find()) {
tokens.add(m.group());
}
System.out.println(tokens);
See the online demo.

Try this
Scanner inputter = new Scanner(System.in);
System.out.print("Please enter your thoughts : ");
final String words = inputter.nextLine();
final String[] tokens = words.split("\\W+");
Arrays.stream(tokens).forEach(System.out::println);

Java: how to represent characters not containing white spaces as words?

If I have a file containing random characters e.g:
sdo8kd oko ala la654
"sdo8kd", "oko", "ala" and "la654" would be considered words.
How can I represent a word not containing white space characters specifically using the method Character.isWhitespace(c) where c is the character being checked to see if it is white space.

You can use split(regex) from String and put into an array, after that do what you want with it.
String sentence = "sdo8kd oko ala la654";
String[] words = sentence.split("\\s+");
for(String word : words){
System.out.println("'" + word + "'"); //'sdo8kd', 'oko', 'ala', 'la654'
}
System.out.println(words.length); //4

try this regex "[a-zA-Z0-9]+"
boolean isAlphaNumeric = s.matches("[a-zA-Z0-9]+");
a-zA-Z all latin letters (lower and upper case)
0-9 all digits
[a-zA-Z0-9]+ at least 1 or more characters inside brackets.

If you are looking to split the words even any number of space is in between.
You can use
"sdo8kd oko ala la654".split(" +");
This will return String[] with values "sdo8kd", "oko", "ala" and "la654"

RegEx to match lines consisting of whitespace only

I've gone through multiple expressions for hours but can't quite get one to match what I need exactly.
If I have the following input:
Hi
This
Is
A
Test
I am trimming it to:
Hi
This
Is
A
Test
All is good when the blank lines length are 0 (no String) however some inputs contains a few spaces (" ") within those blank lines and thus would like to check whether a string has 0:infinite number of whitespaces but no characters (simply a blank line).
ArrayList<Integer> listOfBlanks = new ArrayList<>();
for(int i = 0; i < arrayList.size(); i++) {
if(arrayList.get(i).isEmpty()) {
if(arrayList.get(i+1).isEmpty())
listOfBlanks.add(i+1);
}
}
String#isEmpty is only good when there are no whitespaces

Use string.trim().isEmpty() to check for length 0 after trimming leading and trailing whitespaces.

The regular expression to do so would be s.matches("\\s*"). \\s* matches zero or more whitespace characters.

How can I find repeated characters with a regex in Java?

Can anyone give me a Java regex to identify repeated characters in a string? I am only looking for characters that are repeated immediately and they can be letters or digits.
Example:
abccde <- looking for this (immediately repeating c's)
abcdce <- not this (c's seperated by another character)

Try "(\\w)\\1+"
The \\w matches any word character (letter, digit, or underscore) and the \\1+ matches whatever was in the first set of parentheses, one or more times. So you wind up matching any occurrence of a word character, followed immediately by one or more of the same word character again.
(Note that I gave the regex as a Java string, i.e. with the backslashes already doubled for you)

String stringToMatch = "abccdef";
Pattern p = Pattern.compile("(\\w)\\1+");
Matcher m = p.matcher(stringToMatch);
if (m.find())
{
System.out.println("Duplicate character " + m.group(1));
}

Regular Expressions are expensive. You would probably be better off just storing the last character and checking to see if the next one is the same.
Something along the lines of:
String s;
char c1, c2;
c1 = s.charAt(0);
for(int i=1;i<s.length(); i++){
char c2 = s.charAt(i);
// Check if they are equal here
c1=c2;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Why are my character and word counts off? - java

You are matching on individual whitespaces. Instead you could match on one or more: text.split("\\s+")

You are counting only non white space characters in the first loop - so not counting space etc at all. Then 30 is the right answer. As for the second - I suspect split is treating consecutive white spaces as distinct, so there is a "null" word between the two tabs.

Reimueus has already solved your word count problem: text.split("\\s+") And your character count is corret. Newlines \n and tabs \t are considered whitespace. If you don't want them to be, you can implement your own isWhitespace function.

Related

Is there a regex to the String.replaceAll method that only keeps letters and white spaces

.split() and [\\W] creates an additional empty string?

Java: how to represent characters not containing white spaces as words?

RegEx to match lines consisting of whitespace only

How can I find repeated characters with a regex in Java?

Categories

Resources