Count number of words in the given string [duplicate] - java

What regex pattern would need I to pass to java.lang.String.split() to split a String into an Array of substrings using all whitespace characters (' ', '\t', '\n', etc.) as delimiters?

Something in the lines of
myString.split("\\s+");
This groups all white spaces as a delimiter.
So if I have the string:
"Hello[space character][tab character]World"
This should yield the strings "Hello" and "World" and omit the empty space between the [space] and the [tab].
As VonC pointed out, the backslash should be escaped, because Java would first try to escape the string to a special character, and send that to be parsed. What you want, is the literal "\s", which means, you need to pass "\\s". It can get a bit confusing.
The \\s is equivalent to [ \\t\\n\\x0B\\f\\r].

In most regex dialects there are a set of convenient character summaries you can use for this kind of thing - these are good ones to remember:
\w - Matches any word character.
\W - Matches any nonword character.
\s - Matches any white-space character.
\S - Matches anything but white-space characters.
\d - Matches any digit.
\D - Matches anything except digits.
A search for "Regex Cheatsheets" should reward you with a whole lot of useful summaries.

To get this working in Javascript, I had to do the following:
myString.split(/\s+/g)

"\\s+" should do the trick

Also you may have a UniCode non-breaking space xA0...
String[] elements = s.split("[\\s\\xA0]+"); //include uniCode non-breaking

String string = "Ram is going to school";
String[] arrayOfString = string.split("\\s+");

Apache Commons Lang has a method to split a string with whitespace characters as delimiters:
StringUtils.split("abc def")
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#split(java.lang.String)
This might be easier to use than a regex pattern.

All you need is to split using the one of the special character of Java Ragex Engine,
and that is- WhiteSpace Character
\d Represents a digit: [0-9]
\D Represents a non-digit: [^0-9]
\s Represents a whitespace character including [ \t\n\x0B\f\r]
\S Represents a non-whitespace character as [^\s]
\v Represents a vertical whitespace character as [\n\x0B\f\r\x85\u2028\u2029]
\V Represents a non-vertical whitespace character as [^\v]
\w Represents a word character as [a-zA-Z_0-9]
\W Represents a non-word character as [^\w]
Here, the key point to remember is that the small leter character \s represents all types of white spaces including a single space [ ] , tab characters [ ] or anything similar.
So, if you'll try will something like this-
String theString = "Java<a space><a tab>Programming"
String []allParts = theString.split("\\s+");
You will get the desired output.
Some Very Useful Links:
Split() method Best Examples
Regexr
split-Java 11
RegularExpInfo
PatternClass
Hope, this might help you the best!!!

To split a string with any Unicode whitespace, you need to use
s.split("(?U)\\s+")
^^^^
The (?U) inline embedded flag option is the equivalent of Pattern.UNICODE_CHARACTER_CLASS that enables \s shorthand character class to match any characters from the whitespace Unicode category.
If you want to split with whitespace and keep the whitespaces in the resulting array, use
s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")
See the regex demo. See Java demo:
String s = "Hello\t World\u00A0»";
System.out.println(Arrays.toString(s.split("(?U)\\s+"))); // => [Hello, World, »]
System.out.println(Arrays.toString(s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")));
// => [Hello, , World, , »]

Since it is a regular expression, and i'm assuming u would also not want non-alphanumeric chars like commas, dots, etc that could be surrounded by blanks (e.g. "one , two" should give [one][two]), it should be:
myString.split(/[\s\W]+/)

you can split a string by line break by using the following statement :
String textStr[] = yourString.split("\\r?\\n");
you can split a string by Whitespace by using the following statement :
String textStr[] = yourString.split("\\s+");

String str = "Hello World";
String res[] = str.split("\\s+");

Study this code.. good luck
import java.util.*;
class Demo{
public static void main(String args[]){
Scanner input = new Scanner(System.in);
System.out.print("Input String : ");
String s1 = input.nextLine();
String[] tokens = s1.split("[\\s\\xA0]+");
System.out.println(tokens.length);
for(String s : tokens){
System.out.println(s);
}
}
}

Related

Java Regexp to match words only (', -, space)

What is the Java Regular expression to match all words containing only :
From a to z and A to Z
The ' - Space Characters but they must not be in the beginning or the
end.
Examples
test'test match
test' doesn't match
'test doesn't match
-test doesn't match
test- doesn't match
test-test match
You can use the following pattern: ^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$
Below are the examples:
String s1 = "abc";
String s2 = " abc";
String s3 = "abc ";
System.out.println(s1.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s2.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
System.out.println(s3.matches("^(?!-|'|\\s)[a-zA-Z]*(?!-|'|\\s)$"));
When you mean the whitespace char it is: [a-zA-Z ]
So it checks if your string contains a-z(lowercase) and A-Z(uppercase) chars and the whitespace chars. If not, the test will fail
Here's my solution:
/(\w{2,}(-|'|\s)\w{2,})/g
You can take it for a spin on Regexr.
It is first checking for a word with \w, then any of the three qualifiers with "or" logic using |, and then another word. The brackets {} are making sure the words on either end are at least 2 characters long so contractions like don't aren't captured. You could set that to any value to prevent longer words from being captured or omit them entirely.
Caveat: \w also looks for _ underscores. If you don't want that you could replace it with [a-zA-Z] like so:
/([a-zA-Z]{2,}(-|'|\s)[a-zA-Z]{2,})/g

Split the string by regular expression [duplicate]

What regex pattern would need I to pass to java.lang.String.split() to split a String into an Array of substrings using all whitespace characters (' ', '\t', '\n', etc.) as delimiters?
Something in the lines of
myString.split("\\s+");
This groups all white spaces as a delimiter.
So if I have the string:
"Hello[space character][tab character]World"
This should yield the strings "Hello" and "World" and omit the empty space between the [space] and the [tab].
As VonC pointed out, the backslash should be escaped, because Java would first try to escape the string to a special character, and send that to be parsed. What you want, is the literal "\s", which means, you need to pass "\\s". It can get a bit confusing.
The \\s is equivalent to [ \\t\\n\\x0B\\f\\r].
In most regex dialects there are a set of convenient character summaries you can use for this kind of thing - these are good ones to remember:
\w - Matches any word character.
\W - Matches any nonword character.
\s - Matches any white-space character.
\S - Matches anything but white-space characters.
\d - Matches any digit.
\D - Matches anything except digits.
A search for "Regex Cheatsheets" should reward you with a whole lot of useful summaries.
To get this working in Javascript, I had to do the following:
myString.split(/\s+/g)
"\\s+" should do the trick
Also you may have a UniCode non-breaking space xA0...
String[] elements = s.split("[\\s\\xA0]+"); //include uniCode non-breaking
String string = "Ram is going to school";
String[] arrayOfString = string.split("\\s+");
Apache Commons Lang has a method to split a string with whitespace characters as delimiters:
StringUtils.split("abc def")
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#split(java.lang.String)
This might be easier to use than a regex pattern.
All you need is to split using the one of the special character of Java Ragex Engine,
and that is- WhiteSpace Character
\d Represents a digit: [0-9]
\D Represents a non-digit: [^0-9]
\s Represents a whitespace character including [ \t\n\x0B\f\r]
\S Represents a non-whitespace character as [^\s]
\v Represents a vertical whitespace character as [\n\x0B\f\r\x85\u2028\u2029]
\V Represents a non-vertical whitespace character as [^\v]
\w Represents a word character as [a-zA-Z_0-9]
\W Represents a non-word character as [^\w]
Here, the key point to remember is that the small leter character \s represents all types of white spaces including a single space [ ] , tab characters [ ] or anything similar.
So, if you'll try will something like this-
String theString = "Java<a space><a tab>Programming"
String []allParts = theString.split("\\s+");
You will get the desired output.
Some Very Useful Links:
Split() method Best Examples
Regexr
split-Java 11
RegularExpInfo
PatternClass
Hope, this might help you the best!!!
To split a string with any Unicode whitespace, you need to use
s.split("(?U)\\s+")
^^^^
The (?U) inline embedded flag option is the equivalent of Pattern.UNICODE_CHARACTER_CLASS that enables \s shorthand character class to match any characters from the whitespace Unicode category.
If you want to split with whitespace and keep the whitespaces in the resulting array, use
s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")
See the regex demo. See Java demo:
String s = "Hello\t World\u00A0»";
System.out.println(Arrays.toString(s.split("(?U)\\s+"))); // => [Hello, World, »]
System.out.println(Arrays.toString(s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")));
// => [Hello, , World, , »]
Since it is a regular expression, and i'm assuming u would also not want non-alphanumeric chars like commas, dots, etc that could be surrounded by blanks (e.g. "one , two" should give [one][two]), it should be:
myString.split(/[\s\W]+/)
you can split a string by line break by using the following statement :
String textStr[] = yourString.split("\\r?\\n");
you can split a string by Whitespace by using the following statement :
String textStr[] = yourString.split("\\s+");
String str = "Hello World";
String res[] = str.split("\\s+");
Study this code.. good luck
import java.util.*;
class Demo{
public static void main(String args[]){
Scanner input = new Scanner(System.in);
System.out.print("Input String : ");
String s1 = input.nextLine();
String[] tokens = s1.split("[\\s\\xA0]+");
System.out.println(tokens.length);
for(String s : tokens){
System.out.println(s);
}
}
}

Java String Split() Method

I was wondering what the following line would do:
String parts = inputLine.split("\\s+");
Would this simply split the string at any spaces in the line? I think this a regex, but I've never seen them before.
Yes, as documentation states split takes regex as argument.
In regex \s represents character class of containing whitespace characters like:
tab \t,
space " ",
line separators \n \r
more...
+ is quantifier which can be read as "once or more" which makes \s+ representing text build from one or more whitespaces.
We need to write this regex as "\\s+ (with two backslashes) because in String \ is considered special character which needs escaping (with another backslash) to produce \ literal.
So split("\\s+") will produce array of tokens separated by one or more whitespaces. BTW trailing empty elements are removed so "a b c ".split("\\s+") will return array ["a", "b", "c"] not ["a", "b", "c", ""].
Yes, though actually any number of space meta-characters (including tabs, newlines etc). See the Java documentation on Patterns.
It will split the string on one (or more) consecutive white space characters. The Pattern Javadoc describes the Predefined character classes (of which \s is one) as,
Predefined character classes
. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
Note that the \\ is to escape the back-slash as required to embed it in a String.
Yes, and it splits both tab and space:
String t = "test your function aaa";
for(String s : t.split("\\s+"))
System.out.println(s);
Output:
test
your
function
aaa

Weird Java replaceFirst

What does this line mean?
str = str.replaceFirst("^\\s+", "")
I believe ^ must be a symbol for the beginning of line, but what is \\s+ ? Is this just a string of characters or have some pattern meaning?
I am trying to convert this line to JavaScript.
str = str.replaceFirst("^\\s+", "")
is removing all the spaces at the begining of the string.
^ is used for beginning of the line
\\s+ is used for several white spaces
\s matches any whitespace characters (spaces, tabs, etc.)
+ means one or more of the previous character or character class.
So the pattern str.replaceFirst("^\\s+", "") matches any whitespace at the beginning of the string, and removes it.
The equivalent JavaScript would be str.replace(/^\s+/, '').
\\s+ is a regular expression.
To convert it to JavaScript is really simple.
str = str.replace(/^\s+/, '');

split a string by any symbol

What is the regex that I should pass with String.split() in order to split the string by any symbol?
Now, by any symbol I mean any of the following:
`~`, `!`, `#`, `#`, ...
Basically any non-letter and non-digit printable character.
You should use a non word i.e \W
\W is inverse of \w
\W is similar to [^a-zA-Z0-9_] and so would match any non-word character except _
OR
you can simply use [^a-zA-Z0-9]
You can try using this: -
str.split("[^a-zA-Z0-9]");
This will not include an underscore.
\W is equivalent to: - "[a-zA-Z0-9_]"
You could either be specific like Spring.split("[~!#$]") or list the values you do not want to split upon Spring.split("[^\\w]")
You may want to use \W or ^\w. You may find more details here: Regex: Character classes
String str = "a#v$d!e";
String[] splitted = str.split("\\W");
System.out.println(splitted.length); //<--print 4
or
String str = "a#v$d!e";
String[] splitted = str.split("[^\\w]");
System.out.println(splitted.length); //<--print 4

Categories

Resources