Java regex splitting, but only removing one whitespace - java

I have this code:
String[] parts = sentence.split("\\s");
and a sentence like: "this is a whitespace and I want to split it" (note there are 3 whitespaces after "whitespace")
I want to split it in a way, where only the last whitespace will be removed, keeping the original message intact. The output should be
"[this], [is], [a], [whitespace ], [and], [I], [want], [to], [split], [it]"
(two whitespaces after the word "whitespace")
Can I do this with regex and if not, is there even a way?
I removed the + from \\s+ to only remove one whitespace

You can use
String[] parts = sentence.split("\\s(?=\\S)");
That will split with a whitespace char that is immediately followed with a non-whitespace char.
See the regex demo. Details:
\s - a whitespace char
(?=\S) - a positive lookahead that requires a non-whitespace char to appear immediately to the right of the current location.
To make it fully Unicode-aware in Java, add the (?U) (Pattern.UNICODE_CHARACTER_CLASS option equivalent) embedded flag option: .split("(?U)\\s(?=\\S)").

Related

Count number of words in the given string [duplicate]

What regex pattern would need I to pass to java.lang.String.split() to split a String into an Array of substrings using all whitespace characters (' ', '\t', '\n', etc.) as delimiters?
Something in the lines of
myString.split("\\s+");
This groups all white spaces as a delimiter.
So if I have the string:
"Hello[space character][tab character]World"
This should yield the strings "Hello" and "World" and omit the empty space between the [space] and the [tab].
As VonC pointed out, the backslash should be escaped, because Java would first try to escape the string to a special character, and send that to be parsed. What you want, is the literal "\s", which means, you need to pass "\\s". It can get a bit confusing.
The \\s is equivalent to [ \\t\\n\\x0B\\f\\r].
In most regex dialects there are a set of convenient character summaries you can use for this kind of thing - these are good ones to remember:
\w - Matches any word character.
\W - Matches any nonword character.
\s - Matches any white-space character.
\S - Matches anything but white-space characters.
\d - Matches any digit.
\D - Matches anything except digits.
A search for "Regex Cheatsheets" should reward you with a whole lot of useful summaries.
To get this working in Javascript, I had to do the following:
myString.split(/\s+/g)
"\\s+" should do the trick
Also you may have a UniCode non-breaking space xA0...
String[] elements = s.split("[\\s\\xA0]+"); //include uniCode non-breaking
String string = "Ram is going to school";
String[] arrayOfString = string.split("\\s+");
Apache Commons Lang has a method to split a string with whitespace characters as delimiters:
StringUtils.split("abc def")
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#split(java.lang.String)
This might be easier to use than a regex pattern.
All you need is to split using the one of the special character of Java Ragex Engine,
and that is- WhiteSpace Character
\d Represents a digit: [0-9]
\D Represents a non-digit: [^0-9]
\s Represents a whitespace character including [ \t\n\x0B\f\r]
\S Represents a non-whitespace character as [^\s]
\v Represents a vertical whitespace character as [\n\x0B\f\r\x85\u2028\u2029]
\V Represents a non-vertical whitespace character as [^\v]
\w Represents a word character as [a-zA-Z_0-9]
\W Represents a non-word character as [^\w]
Here, the key point to remember is that the small leter character \s represents all types of white spaces including a single space [ ] , tab characters [ ] or anything similar.
So, if you'll try will something like this-
String theString = "Java<a space><a tab>Programming"
String []allParts = theString.split("\\s+");
You will get the desired output.
Some Very Useful Links:
Split() method Best Examples
Regexr
split-Java 11
RegularExpInfo
PatternClass
Hope, this might help you the best!!!
To split a string with any Unicode whitespace, you need to use
s.split("(?U)\\s+")
^^^^
The (?U) inline embedded flag option is the equivalent of Pattern.UNICODE_CHARACTER_CLASS that enables \s shorthand character class to match any characters from the whitespace Unicode category.
If you want to split with whitespace and keep the whitespaces in the resulting array, use
s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")
See the regex demo. See Java demo:
String s = "Hello\t World\u00A0»";
System.out.println(Arrays.toString(s.split("(?U)\\s+"))); // => [Hello, World, »]
System.out.println(Arrays.toString(s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")));
// => [Hello, , World, , »]
Since it is a regular expression, and i'm assuming u would also not want non-alphanumeric chars like commas, dots, etc that could be surrounded by blanks (e.g. "one , two" should give [one][two]), it should be:
myString.split(/[\s\W]+/)
you can split a string by line break by using the following statement :
String textStr[] = yourString.split("\\r?\\n");
you can split a string by Whitespace by using the following statement :
String textStr[] = yourString.split("\\s+");
String str = "Hello World";
String res[] = str.split("\\s+");
Study this code.. good luck
import java.util.*;
class Demo{
public static void main(String args[]){
Scanner input = new Scanner(System.in);
System.out.print("Input String : ");
String s1 = input.nextLine();
String[] tokens = s1.split("[\\s\\xA0]+");
System.out.println(tokens.length);
for(String s : tokens){
System.out.println(s);
}
}
}

Matching pound (#) or empty line comments with regex

As a start, I am using Java, if this influences the regex.
I am trying to match the contents of a line that start with any number of whitespace character but no other, followed by any number of pounds (#), and followed by any characters, then ending with a new line.
Or, a fully empty line with only either whitespace or new line.
I tried finding the first part myself but it doesn't seem to match any of the comments:
^(?!.+)#+.*$
It doesn't work even if I include \r*\n* on the end
In your regexr example you have selected Javascript and enabled the s flag to have to dot match a newline.
If you want to match all lines, you can enable the multiline and global flag instead, and use
^[^\S\r\n]*(?:#.*)?\r?\n
Regex demo
In Java, you might use
^\h*(?:#.*)?\R
With the doubled escapes backslashes
String regex = "^\\h*(?:#.*)?\\R";
The pattern matches:
^ Start of string
\h* Match optional horizontal whitespace chars
(?:#.*)? Optionally match # followed by the rest of the line
\R Match any Unicode newline sequence
Regex demo
If you want to match the whole line, and instead of matching a newline you want to assert the end of the string you can use an anchor $ instead of \R
^\h*(?:#.*)?$
Regex demo

Replace white spaces only in part of the string

I have a String like
"This is apple tree"
I want to remove the white spaces available until the word apple.After the change it will be like
"Thisisapple tree"
I need to achieve this in single replace command combined with regular expressions.
For now it looks like you may be looking for
String s = "This is apple tree";
System.out.println(s.replaceAll("\\G(\\S+)(?<!(?<!\\S)apple)\\s", "$1"));
Output: Thisisapple tree.
Explanation:
\G represents either end of previous match or start of input (^) if there was no previous match yet (when we are attempting to find first match)
\S+ represents one or more non-whitespace characters (to match words, including non-alphabetic characters like ' or punctuation)
(?<!(?<!\\S)apple)\\s negative-look-behind will prevent accepting whitespace which has apple before it (I added another negative-look-behind before apple to make sure that it doesn't have any non-whitespace which ensures that this is not part of some other word)
$1 in replacement represents match from group 1 (the one from (\S+)) which represents word. So we are replacing word and spaces with only word (effectively removing spaces)
WARNING: This solution assumes that
sentence doesn't start with space,
words can be separated with only one space.
If we want to get rid of this assumptions we would need something like:
System.out.println(s.replaceAll("^\\s+|\\G(\\S+)(?<!(?<!\\S)apple)\\s+", "$1"));
^\s+ will allow us to match spaces at beginning of string (and replace them with content of group 1 (word) which in this case will be empty, so we will simply remove these whitespaces)
\s+ at the end allows us to match word and one or more spaces after it (to remove them)
A single replace() is unlikely to solve your problem. You could do something like this..
String s[] = "This is an apple tree, not an orange tree".split("apple");
System.out.println(new StringBuilder(s[0].replace(" ","")).append("apple").append(s[1]));
This is achived via lookahead assertion, like this:
String str = "This is an apple tree";
System.out.println(str.replaceAll(" (?=.*apple)", ""));
It means: replace all spaces in front of which there anywhere word apple
If you want to use a regular expression you could try:
Matcher matcher = Pattern.compile("^(.*?\\bapple\\b)(.*)$").matcher("This is an apple but this apple is an orange");
System.out.println((!matcher.matches()) ? "No match" : matcher.group(1).replaceAll(" ", "") + matcher.group(2));
This checks that "apple" is an individual word and not just part of another word such as "snapple". It also splits at the first use of "apple".

Split the string by regular expression [duplicate]

What regex pattern would need I to pass to java.lang.String.split() to split a String into an Array of substrings using all whitespace characters (' ', '\t', '\n', etc.) as delimiters?
Something in the lines of
myString.split("\\s+");
This groups all white spaces as a delimiter.
So if I have the string:
"Hello[space character][tab character]World"
This should yield the strings "Hello" and "World" and omit the empty space between the [space] and the [tab].
As VonC pointed out, the backslash should be escaped, because Java would first try to escape the string to a special character, and send that to be parsed. What you want, is the literal "\s", which means, you need to pass "\\s". It can get a bit confusing.
The \\s is equivalent to [ \\t\\n\\x0B\\f\\r].
In most regex dialects there are a set of convenient character summaries you can use for this kind of thing - these are good ones to remember:
\w - Matches any word character.
\W - Matches any nonword character.
\s - Matches any white-space character.
\S - Matches anything but white-space characters.
\d - Matches any digit.
\D - Matches anything except digits.
A search for "Regex Cheatsheets" should reward you with a whole lot of useful summaries.
To get this working in Javascript, I had to do the following:
myString.split(/\s+/g)
"\\s+" should do the trick
Also you may have a UniCode non-breaking space xA0...
String[] elements = s.split("[\\s\\xA0]+"); //include uniCode non-breaking
String string = "Ram is going to school";
String[] arrayOfString = string.split("\\s+");
Apache Commons Lang has a method to split a string with whitespace characters as delimiters:
StringUtils.split("abc def")
http://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#split(java.lang.String)
This might be easier to use than a regex pattern.
All you need is to split using the one of the special character of Java Ragex Engine,
and that is- WhiteSpace Character
\d Represents a digit: [0-9]
\D Represents a non-digit: [^0-9]
\s Represents a whitespace character including [ \t\n\x0B\f\r]
\S Represents a non-whitespace character as [^\s]
\v Represents a vertical whitespace character as [\n\x0B\f\r\x85\u2028\u2029]
\V Represents a non-vertical whitespace character as [^\v]
\w Represents a word character as [a-zA-Z_0-9]
\W Represents a non-word character as [^\w]
Here, the key point to remember is that the small leter character \s represents all types of white spaces including a single space [ ] , tab characters [ ] or anything similar.
So, if you'll try will something like this-
String theString = "Java<a space><a tab>Programming"
String []allParts = theString.split("\\s+");
You will get the desired output.
Some Very Useful Links:
Split() method Best Examples
Regexr
split-Java 11
RegularExpInfo
PatternClass
Hope, this might help you the best!!!
To split a string with any Unicode whitespace, you need to use
s.split("(?U)\\s+")
^^^^
The (?U) inline embedded flag option is the equivalent of Pattern.UNICODE_CHARACTER_CLASS that enables \s shorthand character class to match any characters from the whitespace Unicode category.
If you want to split with whitespace and keep the whitespaces in the resulting array, use
s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")
See the regex demo. See Java demo:
String s = "Hello\t World\u00A0»";
System.out.println(Arrays.toString(s.split("(?U)\\s+"))); // => [Hello, World, »]
System.out.println(Arrays.toString(s.split("(?U)(?<=\\s)(?=\\S)|(?<=\\S)(?=\\s)")));
// => [Hello, , World, , »]
Since it is a regular expression, and i'm assuming u would also not want non-alphanumeric chars like commas, dots, etc that could be surrounded by blanks (e.g. "one , two" should give [one][two]), it should be:
myString.split(/[\s\W]+/)
you can split a string by line break by using the following statement :
String textStr[] = yourString.split("\\r?\\n");
you can split a string by Whitespace by using the following statement :
String textStr[] = yourString.split("\\s+");
String str = "Hello World";
String res[] = str.split("\\s+");
Study this code.. good luck
import java.util.*;
class Demo{
public static void main(String args[]){
Scanner input = new Scanner(System.in);
System.out.print("Input String : ");
String s1 = input.nextLine();
String[] tokens = s1.split("[\\s\\xA0]+");
System.out.println(tokens.length);
for(String s : tokens){
System.out.println(s);
}
}
}

Java String Split() Method

I was wondering what the following line would do:
String parts = inputLine.split("\\s+");
Would this simply split the string at any spaces in the line? I think this a regex, but I've never seen them before.
Yes, as documentation states split takes regex as argument.
In regex \s represents character class of containing whitespace characters like:
tab \t,
space " ",
line separators \n \r
more...
+ is quantifier which can be read as "once or more" which makes \s+ representing text build from one or more whitespaces.
We need to write this regex as "\\s+ (with two backslashes) because in String \ is considered special character which needs escaping (with another backslash) to produce \ literal.
So split("\\s+") will produce array of tokens separated by one or more whitespaces. BTW trailing empty elements are removed so "a b c ".split("\\s+") will return array ["a", "b", "c"] not ["a", "b", "c", ""].
Yes, though actually any number of space meta-characters (including tabs, newlines etc). See the Java documentation on Patterns.
It will split the string on one (or more) consecutive white space characters. The Pattern Javadoc describes the Predefined character classes (of which \s is one) as,
Predefined character classes
. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]
Note that the \\ is to escape the back-slash as required to embed it in a String.
Yes, and it splits both tab and space:
String t = "test your function aaa";
for(String s : t.split("\\s+"))
System.out.println(s);
Output:
test
your
function
aaa

Categories

Resources