split a string by any symbol - java

What is the regex that I should pass with String.split() in order to split the string by any symbol?
Now, by any symbol I mean any of the following:
`~`, `!`, `#`, `#`, ...
Basically any non-letter and non-digit printable character.

You should use a non word i.e \W
\W is inverse of \w
\W is similar to [^a-zA-Z0-9_] and so would match any non-word character except _
OR
you can simply use [^a-zA-Z0-9]

You can try using this: -
str.split("[^a-zA-Z0-9]");
This will not include an underscore.
\W is equivalent to: - "[a-zA-Z0-9_]"

You could either be specific like Spring.split("[~!#$]") or list the values you do not want to split upon Spring.split("[^\\w]")

You may want to use \W or ^\w. You may find more details here: Regex: Character classes
String str = "a#v$d!e";
String[] splitted = str.split("\\W");
System.out.println(splitted.length); //<--print 4
or
String str = "a#v$d!e";
String[] splitted = str.split("[^\\w]");
System.out.println(splitted.length); //<--print 4

Related

Count number of words in the given string [duplicate]

What regex pattern would need I to pass to java.lang.String.split() to split a String into an Array of substrings using all whitespace characters (' ', '\t', '\n', etc.) as delimiters?
Something in the lines of
myString.split("\\s+");
This groups all white spaces as a delimiter.
So if I have the string:
"Hello[space character][tab character]World"
This should yield the strings "Hello" and "World" and omit the empty space between the [space] and the [tab].
As VonC pointed out, the backslash should be escaped, because Java would first try to escape the string to a special character, and send that to be parsed. What you want, is the literal "\s", which means, you need to pass "\\s". It can get a bit confusing.
The \\s is equivalent to [ \\t\\n\\x0B\\f\\r].
In most regex dialects there are a set of convenient character summaries you can use for this kind of thing - these are good ones to remember:
\w - Matches any word character.
\W - Matches any nonword character.
\s - Matches any white-space character.
\S - Matches anything but white-space characters.
\d - Matches any digit.
\D - Matches anything except digits.
A search for "Regex Cheatsheets" should reward you with a whole lot of useful summaries.
To get this working in Javascript, I had to do the following:
myString.split(/\s+/g)
"\\s+" should do the trick
Also you may have a UniCode non-breaking space xA0...
String[] elements = s.split("[\\s\\xA0]+"); //include uniCode non-breaking
String string = "Ram is going to school";
String[] arrayOfString = string.split("\\s+");
Apache Commons Lang has a method to split a string with whitespace characters as delimiters:
StringUtils.split("abc def")
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#split(java.lang.String)
This might be easier to use than a regex pattern.
All you need is to split using the one of the special character of Java Ragex Engine,
and that is- WhiteSpace Character
\d Represents a digit: [0-9]
\D Represents a non-digit: [^0-9]
\s Represents a whitespace character including [ \t\n\x0B\f\r]
\S Represents a non-whitespace character as [^\s]
\v Represents a vertical whitespace character as [\n\x0B\f\r\x85\u2028\u2029]
\V Represents a non-vertical whitespace character as [^\v]
\w Represents a word character as [a-zA-Z_0-9]
\W Represents a non-word character as [^\w]
Here, the key point to remember is that the small leter character \s represents all types of white spaces including a single space [ ] , tab characters [ ] or anything similar.
So, if you'll try will something like this-
String theString = "Java<a space><a tab>Programming"
String []allParts = theString.split("\\s+");
You will get the desired output.
Some Very Useful Links:
Split() method Best Examples
Regexr
split-Java 11
RegularExpInfo
PatternClass
Hope, this might help you the best!!!
To split a string with any Unicode whitespace, you need to use
s.split("(?U)\\s+")
^^^^
The (?U) inline embedded flag option is the equivalent of Pattern.UNICODE_CHARACTER_CLASS that enables \s shorthand character class to match any characters from the whitespace Unicode category.
If you want to split with whitespace and keep the whitespaces in the resulting array, use
s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")
See the regex demo. See Java demo:
String s = "Hello\t World\u00A0»";
System.out.println(Arrays.toString(s.split("(?U)\\s+"))); // => [Hello, World, »]
System.out.println(Arrays.toString(s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")));
// => [Hello, , World, , »]
Since it is a regular expression, and i'm assuming u would also not want non-alphanumeric chars like commas, dots, etc that could be surrounded by blanks (e.g. "one , two" should give [one][two]), it should be:
myString.split(/[\s\W]+/)
you can split a string by line break by using the following statement :
String textStr[] = yourString.split("\\r?\\n");
you can split a string by Whitespace by using the following statement :
String textStr[] = yourString.split("\\s+");
String str = "Hello World";
String res[] = str.split("\\s+");
Study this code.. good luck
import java.util.*;
class Demo{
public static void main(String args[]){
Scanner input = new Scanner(System.in);
System.out.print("Input String : ");
String s1 = input.nextLine();
String[] tokens = s1.split("[\\s\\xA0]+");
System.out.println(tokens.length);
for(String s : tokens){
System.out.println(s);
}
}
}

Replacing characters in String using Meta characters or character classes

I am writing to remove all non-alphanumeric characters in a String with only lowercase letters.
I am using the replaceAll function and have looked at a few regexes
My reference is from: https://www.vogella.com/tutorials/JavaRegularExpressions/article.html which shows that
\s : A whitespace character, short for [ \t\n\x0b\r\f]
\W : A non-word character [^\w]
I tried the folllowing in Java but the results didn't remove the spaces or symbols:
lowercased = lowercased.replaceAll("\\W\\s", "");
output:
amanaplanac analp anam a
May I know what is wrong?
Regex \W\s means "a non-word character followed by a whitespace character".
If you want to replace any character that is one of those, use one of these:
\W|\s where | means or
[\W\s] where [ ] is a character class that in this case merges the built-in special character classes \W and \s, because that's what those are.
Of the two, I recommend using the second.
Of course, having \s there is redundant, because \s means whitespace character, and \W means non-word character, and since whitespaces are not word characters, using \W alone is enough.
lowercased = lowercased.replaceAll("\\W+", "");
Regex \W is meant for matching character's that are not numbers(0-9), alphabets(A-Z and a-z) and underscore (_). And /s is meant for matching space.
As /W already take care for matching non alphanumeric characters (excluding underscore). No need to use \s.
So if you are using \W you are allowing underscore(_) with alphanumeric values.
use the following to exclude underscore as well.
lowercased = lowercased.replaceAll("\\W|_", "");
Use | (or operator) like \W|\s since both \W and \s are independent case for which you want to replace. And since whitespace are not word character you can use \W only.
lowercased = lowercased.replaceAll("\\W|\\s", "");

Split the string by regular expression [duplicate]

What regex pattern would need I to pass to java.lang.String.split() to split a String into an Array of substrings using all whitespace characters (' ', '\t', '\n', etc.) as delimiters?
Something in the lines of
myString.split("\\s+");
This groups all white spaces as a delimiter.
So if I have the string:
"Hello[space character][tab character]World"
This should yield the strings "Hello" and "World" and omit the empty space between the [space] and the [tab].
As VonC pointed out, the backslash should be escaped, because Java would first try to escape the string to a special character, and send that to be parsed. What you want, is the literal "\s", which means, you need to pass "\\s". It can get a bit confusing.
The \\s is equivalent to [ \\t\\n\\x0B\\f\\r].
In most regex dialects there are a set of convenient character summaries you can use for this kind of thing - these are good ones to remember:
\w - Matches any word character.
\W - Matches any nonword character.
\s - Matches any white-space character.
\S - Matches anything but white-space characters.
\d - Matches any digit.
\D - Matches anything except digits.
A search for "Regex Cheatsheets" should reward you with a whole lot of useful summaries.
To get this working in Javascript, I had to do the following:
myString.split(/\s+/g)
"\\s+" should do the trick
Also you may have a UniCode non-breaking space xA0...
String[] elements = s.split("[\\s\\xA0]+"); //include uniCode non-breaking
String string = "Ram is going to school";
String[] arrayOfString = string.split("\\s+");
Apache Commons Lang has a method to split a string with whitespace characters as delimiters:
StringUtils.split("abc def")
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#split(java.lang.String)
This might be easier to use than a regex pattern.
All you need is to split using the one of the special character of Java Ragex Engine,
and that is- WhiteSpace Character
\d Represents a digit: [0-9]
\D Represents a non-digit: [^0-9]
\s Represents a whitespace character including [ \t\n\x0B\f\r]
\S Represents a non-whitespace character as [^\s]
\v Represents a vertical whitespace character as [\n\x0B\f\r\x85\u2028\u2029]
\V Represents a non-vertical whitespace character as [^\v]
\w Represents a word character as [a-zA-Z_0-9]
\W Represents a non-word character as [^\w]
Here, the key point to remember is that the small leter character \s represents all types of white spaces including a single space [ ] , tab characters [ ] or anything similar.
So, if you'll try will something like this-
String theString = "Java<a space><a tab>Programming"
String []allParts = theString.split("\\s+");
You will get the desired output.
Some Very Useful Links:
Split() method Best Examples
Regexr
split-Java 11
RegularExpInfo
PatternClass
Hope, this might help you the best!!!
To split a string with any Unicode whitespace, you need to use
s.split("(?U)\\s+")
^^^^
The (?U) inline embedded flag option is the equivalent of Pattern.UNICODE_CHARACTER_CLASS that enables \s shorthand character class to match any characters from the whitespace Unicode category.
If you want to split with whitespace and keep the whitespaces in the resulting array, use
s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")
See the regex demo. See Java demo:
String s = "Hello\t World\u00A0»";
System.out.println(Arrays.toString(s.split("(?U)\\s+"))); // => [Hello, World, »]
System.out.println(Arrays.toString(s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")));
// => [Hello, , World, , »]
Since it is a regular expression, and i'm assuming u would also not want non-alphanumeric chars like commas, dots, etc that could be surrounded by blanks (e.g. "one , two" should give [one][two]), it should be:
myString.split(/[\s\W]+/)
you can split a string by line break by using the following statement :
String textStr[] = yourString.split("\\r?\\n");
you can split a string by Whitespace by using the following statement :
String textStr[] = yourString.split("\\s+");
String str = "Hello World";
String res[] = str.split("\\s+");
Study this code.. good luck
import java.util.*;
class Demo{
public static void main(String args[]){
Scanner input = new Scanner(System.in);
System.out.print("Input String : ");
String s1 = input.nextLine();
String[] tokens = s1.split("[\\s\\xA0]+");
System.out.println(tokens.length);
for(String s : tokens){
System.out.println(s);
}
}
}

How to use regex in matches() to look for letters, dots, and apostrophe?

In the following code, I tried many expressions to check if the string str has only letters, dots, or apostrophe by using matches() method.
However, it's not returning true for this string, for example, one o'clock. :
String str = "One o'clock.";
System.out.println(str.matches("[^a-zA-Z'. ]"));
You have two problems:
your regex is using negated character class [^...]
your regex can match strings with only one character.
So use standard character class [a-zA-Z'. ] and to let your regex match strings one or more characters use + quantifier.
System.out.println(str.matches("[a-zA-Z'. ]+"));
You can try this one [\p{Alpha}\s'.]*

Java Replacing Characters

This is pretty simple but how would I create a regex to strip anything but
letters a-Z,
numbers 0-9
and commas?
I think the regex expression for the first two is [^a-zA-Z_0-9] but how could I add commas to it.
Also, would it be the following?
"string".replaceAll("expression", null);
First of all, you can not use null for the replacement value. It will give you java.lang.NullPointerException. You must use string there. For example instead of null use empty "".
About the regex, if you need anything to add inside your character class [], just add them there. For example [^a-z,*.]
Furthermore, your a-zA-Z_0-9 can be replaced with \\w
[^\\w,]
You can simply add comma to your negated character class
[^a-zA-Z0-9,]
^ add this
Also Strings are immutable so replaceAll will not affect original string but create new one with replaced characters so you need to store it somewhere (maybe in reference to original String).
Last thing is that you need to pass empty string "" as replacement, not null.
So try with
yourString = yourString.replaceAll("[^a-zA-Z0-9,]","");
Another thing is that regex you are currently using also prevents _ from being removed. If that was intentional then instead of _ a-z A-Z 0-9 you can simply use predefined character class \w (which in Javas String needs to be written as "\\w" because \ needs to be escaped) so your code can look like
yourString = yourString.replaceAll("[^\\w,]","");
No, you should do:
value = "string".replaceAll("[\\W_,]", "");
My pattern doesn't use negation.
You should replace it with empty string and not null and you've to assign the result to your string as strings are immutable.
You can just simplify your regex to mine.
Otherwise just add , to your negated character class.
[\w,]+ is the regex which matches alphanumeric, underscore and comma.
Here \w is equivalent to [A-Za-z0-9_]
[\W,]+ is the regex which matches everything except alphanumeric, underscore and comma.
Here \W - Matches any character that is not a word character (alphanumeric & underscore) which is equivalent to [^A-Za-z0-9_]

Categories

Resources