Confusing output from String.split - java

I do not understand the output of this code:
public class StringDemo{
public static void main(String args[]) {
String blank = "";
String comma = ",";
System.out.println("Output1: "+blank.split(",").length);
System.out.println("Output2: "+comma.split(",").length);
}
}
And got the following output:
Output1: 1
Output2: 0

Documentation:
For: System.out.println("Output1: "+blank.split(",").length);
The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.
It will simply return the entire string that's why it returns 1.
For the second case, String.split will discard the , so the result will be empty.
String.split silently discards trailing separators
see guava StringsExplained too

Everything happens according to plan, but let's do it step by step (I hope you have some time).
According to documentation (and source code) of split(String regex) method:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero.
So when you invoke
split(String regex)
you are actually getting result from the split(String regex, int limit) method which is invoked in a way:
split(regex, 0)
So here limit is set to 0.
You need to know a few things about this parameter:
If limit is positive you are limiting length of result array to a positive number you specified, so "axaxaxaxa".split("x",2) will return an array, ["a", "axaxaxa"], not ["a","a","a","a","a"].
If limit is 0 then you are not limiting the length of the result array. But it also means that any trailing empty strings will be removed. For example:
"fooXbarX".split("X")
will at start generate an array which will look like:
["foo", "bar", ""]
("barX" split on "X" generates "bar" and ""), but since split removes all trailing empty string, it will return
["foo", "bar"]
Behaviour of negative value of limit is similar to behaviour where limit is set to 0 (it will not limit length of result array). The only difference is that it will not remove empty strings from the end of the result array. In other words
"fooXbarX".split("X",-1)
will return ["foo", "bar", ""]
Lets take a look at the case,
",".split(",").length
which (as explained earlier) is same as
",".split(",", 0).length
This means that we are using a version of split which will not limit the length of the result array, but will remove all trailing empty strings, "". You need to understand that when we split one thing we are always getting two things.
In other words, if we split "abc" in place of b, we will get "a" and "c".
The tricky part is to understand that if we split "abc" in c we will get "ab" and "" (empty string).
Using this logic, if we split "," on , we will get "" and "" (two empty strings).
You can check it using split with negative limit:
for (String s: ",".split(",", -1)){
System.out.println("\""+s+"\"");
}
which will print
""
""
So as we see result array here is at first ["", ""].
But since by default we are using limit set to 0, all trailing empty strings will be removed. In this case, the result array contains only trailing empty strings, so all of them will be removed, leaving you with empty array [] which has length 0.
To answer the case with
"".split(",").length
you need to understand that removing trailing empty strings makes sense only if such trailing empty strings ware result of splitting (and most probably are not needed).
So if there were not any places on which we could split, there is no chance that empty strings ware created, so there is no point in running this "cleaning" process.
This information is mentioned in documentation of split(String regex, int limit) method where you can read:
If the expression does not match any part of the input then the resulting array has just one element, namely this string.
You can also see this behaviour in source code of this method (from Java 8):
2316 public String[] split(String regex, int limit) {2317 /* fastpath if the regex is a2318 (1)one-char String and this character is not one of the2319 RegEx's meta characters ".$|()[{^?*+\\", or2320 (2)two-char String and the first char is the backslash and2321 the second is not the ascii digit or ascii letter.2322 */2323 char ch = 0;2324 if (((regex.value.length == 1 &&2325 ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||2326 (regex.length() == 2 &&2327 regex.charAt(0) == '\\' &&2328 (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&2329 ((ch-'a')|('z'-ch)) < 0 &&2330 ((ch-'A')|('Z'-ch)) < 0)) &&2331 (ch < Character.MIN_HIGH_SURROGATE ||2332 ch > Character.MAX_LOW_SURROGATE))2333 {2334 int off = 0;2335 int next = 0;2336 boolean limited = limit > 0;2337 ArrayList<String> list = new ArrayList<>();2338 while ((next = indexOf(ch, off)) != -1) {2339 if (!limited || list.size() < limit - 1) {2340 list.add(substring(off, next));2341 off = next + 1;2342 } else { // last one2343 //assert (list.size() == limit - 1);2344 list.add(substring(off, value.length));2345 off = value.length;2346 break;2347 }2348 }2349 // If no match was found, return this2350 if (off == 0)2351 return new String[]{this};2353 // Add remaining segment2354 if (!limited || list.size() < limit)2355 list.add(substring(off, value.length));2357 // Construct result2358 int resultSize = list.size();2359 if (limit == 0) {2360 while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {2361 resultSize--;2362 }2363 }2364 String[] result = new String[resultSize];2365 return list.subList(0, resultSize).toArray(result);2366 }2367 return Pattern.compile(regex).split(this, limit);2368 }
where you can find
if (off == 0)
return new String[]{this};
fragment which means
if (off == 0) - if off (position from which method should start searching for next possible match for regex passed as split argument) is still 0 after iterating over entire string, we didn't find any match, so the string was not split
return new String[]{this}; - in that case let's just return an array with original string (represented by this).
Since "," couldn't be found in "" even once, "".split(",") must return an array with one element (empty string on which you invoked split). This means that the length of this array is 1.
BTW. Java 8 introduced another mechanism. It removes leading empty strings (if they ware created while splitting process) if we split using zero-length regex (like "" or with look-around (?<!x)). More info at: Why in Java 8 split sometimes removes empty strings at start of result array?

From the Java 1.7 Documentation
Splits the string around matches of the given regular expression.
split() method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
In the Case 1 blank.split(",") does not match any part of the input then the resulting array has just one element, namely this String.
It will return entire String. So, the length will be 1.
In the Case 2 comma.split(",") will return empty.
split() expecting a regex as argument, return result array to matching with that regex.
So, the length is 0
For Example(Documentation)
The string "boo:and:foo", yields the following results with these expressions:
Regex Result
: { "boo", "and", "foo" }
o { "b", "", ":and:f" }
Parameters:
regex - the delimiting regular expression
Returns:
the array of strings computed by splitting this string around matches of the given regular expression
Throws:
PatternSyntaxException - if the regular expression's syntax is invalid

From String class javadoc for the public String[] split(String regex) method:
Splits this string around matches of the given regular expression.
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
In the first case, the expression does not match any part of the input so we got an array with only one element - the input.
In the second case, the expression matches input and split should return two empty strings; but, according to javadoc, they are discarded (because they are trailing and empty).

We can take a look into the source code of java.util.regex.Pattern which is behind String.split. Way down the rabbit hole the method
public String[] split(CharSequence input, int limit)
is invoked.
Input ""
For input "" this method is called as
String[] parts = split("", 0);
The intersting part of this method is:
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);
while(m.find()) {
// Tichodroma: this will not happen for our input
}
// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};
And that is what happens: new String[] {input.toString()} is returned.
Input ","
For input ","the intersting part is
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
Here resultSize == 0 and limit == 0 so new String[0] is returned.

From JDK 1.7
public String[] split(String regex, int limit) {
/* fastpath if the regex is a
(1)one-char String and this character is not one of the
RegEx's meta characters ".$|()[{^?*+\\", or
(2)two-char String and the first char is the backslash and
the second is not the ascii digit or ascii letter.
*/
char ch = 0;
if (((regex.count == 1 &&
".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
(regex.length() == 2 &&
regex.charAt(0) == '\\' &&
(((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
((ch-'a')|('z'-ch)) < 0 &&
((ch-'A')|('Z'-ch)) < 0)) &&
(ch < Character.MIN_HIGH_SURROGATE ||
ch > Character.MAX_LOW_SURROGATE))
{
int off = 0;
int next = 0;
boolean limited = limit > 0;
ArrayList<String> list = new ArrayList<>();
while ((next = indexOf(ch, off)) != -1) {
if (!limited || list.size() < limit - 1) {
list.add(substring(off, next));
off = next + 1;
} else { // last one
//assert (list.size() == limit - 1);
list.add(substring(off, count));
off = count;
break;
}
}
// If no match was found, return this
if (off == 0)
return new String[] { this };
// Add remaining segment
if (!limited || list.size() < limit)
list.add(substring(off, count));
// Construct result
int resultSize = list.size();
if (limit == 0)
while (resultSize > 0 && list.get(resultSize-1).length() == 0)
resultSize--;
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
}
return Pattern.compile(regex).split(this, limit);
}
So for this case, the regex will be handled by the first if.
For the first case blank.split(",")
// If no match was found, return this
if (off == 0)
return new String[] { this };
So, this function will return an array which contains one element if there is no matched.
For the second case comma.split(",")
List<String> list = new ArrayList<>();
//...
int resultSize = list.size();
if (limit == 0)
while (resultSize > 0 && list.get(resultSize-1).length() == 0)
resultSize--;
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
As you notice, the last while loop has removed all empty element in the end of the list, so the resultSize is 0.

String blank = "";
String comma = ",";
System.out.println("Output1: "+blank.split(",").length); // case 1
System.out.println("Output2: "+comma.split(",").length); // case 2
case 1 - Here blank.split(",") will return "" since there is no , in blank you get the same, So length will be 1
case 2- Here comma.split(",") will return empty array, you have to scape , if you want to count comma with length 1 else length will be 0
Again comma.split(",") split() expecting a regex as argument it will return result array to matching with that regex.
The array returned by this method contains each substring of this
string that is terminated by another substring that matches the given
expression or is terminated by the end of the string.
Else
If the expression does not match any part of the input then the
resulting array has just one element, namely this string.

The API for the split method states that "If the expression does not match any part of the input then the resulting array has just one element, namely this string."
So, as the String blank doesn't contain a ",", a String[] with one element (i.e. blank itself) is returned.
For the String comma, "nothing" is left of the original string thus an empty array is returned.
This seems to be the best solution if you want to process the returned result, e. g.
String[] splits = aString.split(",");
for(String split: splits) {
// do something
}

Related

java programming on a Mac split [duplicate]

Before Java 8 when we split on empty string like
String[] tokens = "abc".split("");
split mechanism would split in places marked with |
|a|b|c|
because empty space "" exists before and after each character. So as result it would generate at first this array
["", "a", "b", "c", ""]
and later will remove trailing empty strings (because we didn't explicitly provide negative value to limit argument) so it will finally return
["", "a", "b", "c"]
In Java 8 split mechanism seems to have changed. Now when we use
"abc".split("")
we will get ["a", "b", "c"] array instead of ["", "a", "b", "c"].
My first guess was that maybe now leading empty strings are also removed just like trailing empty strings.
But this theory fails, since
"abc".split("a")
returns ["", "bc"], so leading empty string was not removed.
Can someone explain what is going on here? How rules of split have changed in Java 8?
The behavior of String.split (which calls Pattern.split) changes between Java 7 and Java 8.
Documentation
Comparing between the documentation of Pattern.split in Java 7 and Java 8, we observe the following clause being added:
When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
The same clause is also added to String.split in Java 8, compared to Java 7.
Reference implementation
Let us compare the code of Pattern.split of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.
Java 7
public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);
// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}
// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
Java 8
public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);
// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}
// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.
if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}
Maintaining compatibility
Following behavior in Java 8 and above
To make split behaves consistently across versions and compatible with the behavior in Java 8:
If your regex can match zero-length string, just add (?!\A) at the end of the regex and wrap the original regex in non-capturing group (?:...) (if necessary).
If your regex can't match zero-length string, you don't need to do anything.
If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.
(?!\A) checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.
Following behavior in Java 7 and prior
There is no general solution to make split backward-compatible with Java 7 and prior, short of replacing all instance of split to point to your own custom implementation.
This has been specified in the documentation of split(String regex, limit).
When there is a positive-width match at the beginning of this string
then an empty leading substring is included at the beginning of the
resulting array. A zero-width match at the beginning however never
produces such empty leading substring.
In "abc".split("") you got a zero-width match at the beginning so the leading empty substring is not included in the resulting array.
However in your second snippet when you split on "a" you got a positive width match (1 in this case), so the empty leading substring is included as expected.
(Removed irrelevant source code)
There was a slight change in the docs for split() from Java 7 to Java 8. Specifically, the following statement was added:
When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
(emphasis mine)
The empty string split generates a zero-width match at the beginning, so an empty string is not included at the start of the resulting array in accordance with what is specified above. By contrast, your second example which splits on "a" generates a positive-width match at the start of the string, so an empty string is in fact included at the start of the resulting array.

Why is the input not split at start of input when regex is blank [duplicate]

Before Java 8 when we split on empty string like
String[] tokens = "abc".split("");
split mechanism would split in places marked with |
|a|b|c|
because empty space "" exists before and after each character. So as result it would generate at first this array
["", "a", "b", "c", ""]
and later will remove trailing empty strings (because we didn't explicitly provide negative value to limit argument) so it will finally return
["", "a", "b", "c"]
In Java 8 split mechanism seems to have changed. Now when we use
"abc".split("")
we will get ["a", "b", "c"] array instead of ["", "a", "b", "c"].
My first guess was that maybe now leading empty strings are also removed just like trailing empty strings.
But this theory fails, since
"abc".split("a")
returns ["", "bc"], so leading empty string was not removed.
Can someone explain what is going on here? How rules of split have changed in Java 8?
The behavior of String.split (which calls Pattern.split) changes between Java 7 and Java 8.
Documentation
Comparing between the documentation of Pattern.split in Java 7 and Java 8, we observe the following clause being added:
When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
The same clause is also added to String.split in Java 8, compared to Java 7.
Reference implementation
Let us compare the code of Pattern.split of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.
Java 7
public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);
// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}
// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
Java 8
public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);
// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}
// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.
if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}
Maintaining compatibility
Following behavior in Java 8 and above
To make split behaves consistently across versions and compatible with the behavior in Java 8:
If your regex can match zero-length string, just add (?!\A) at the end of the regex and wrap the original regex in non-capturing group (?:...) (if necessary).
If your regex can't match zero-length string, you don't need to do anything.
If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.
(?!\A) checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.
Following behavior in Java 7 and prior
There is no general solution to make split backward-compatible with Java 7 and prior, short of replacing all instance of split to point to your own custom implementation.
This has been specified in the documentation of split(String regex, limit).
When there is a positive-width match at the beginning of this string
then an empty leading substring is included at the beginning of the
resulting array. A zero-width match at the beginning however never
produces such empty leading substring.
In "abc".split("") you got a zero-width match at the beginning so the leading empty substring is not included in the resulting array.
However in your second snippet when you split on "a" you got a positive width match (1 in this case), so the empty leading substring is included as expected.
(Removed irrelevant source code)
There was a slight change in the docs for split() from Java 7 to Java 8. Specifically, the following statement was added:
When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
(emphasis mine)
The empty string split generates a zero-width match at the beginning, so an empty string is not included at the start of the resulting array in accordance with what is specified above. By contrast, your second example which splits on "a" generates a positive-width match at the start of the string, so an empty string is in fact included at the start of the resulting array.

Counting the vowels included between two consonants

I'm trying to find, from a sentence, the words that contains two vowels between two r using java. So I read in the sentence and then I have to find the words that match the criteria described above. For instance if I have a string such as: "roar soccer roster reader" the method matches should return true for the words "roar" and "roster"
This is the method I come up with, which is doing the job
public boolean matches(String singleWord)
{
// set count to -1. it will increase to 2 if a 'r' is found, it decreases for each vowel
int count = -1;
// loop through a single word
for (int i=0; i<singleWord.length(); i++){
// if a 'r' is found set the count to two
if(singleWord.charAt(i) == 'r'){
// when count it's 0 exit loop
if (count == 0)
return true;
count = 2;}
// if I find a vowel count decreases
else if(isVowel(singleWord.charAt(i))){
count--;}
}
return false;
}
but it seems a bit clumsy... any suggestion on how to improve it or make it simpler? thanx!!!
just in case, this is the isVowel method
private boolean isVowel(char c)
{
String s = c + "";
return "aeiou".contains(s);
}
You can do this using a straightforward algorithm without loops:
Find the index of the first 'r'
Find the index of the last 'r'
Cut the substring in between the two
Return true if removing all vowels from the substring shortens it at least by two characters.
Here is how you can implement it:
boolean matches(String singleWord) {
int from = singleWord.indexOf('r');
int to = singleWord.lastIndexOf('r');
if (from < 0 || from == to) return false;
String sub = singleWord.substring(from+1, to);
return (sub.length() - sub.replaceAll("[aeiou]", "").length()) == 2;
}
Here is how it works step by step, using the word "roadster" as an example:
from = 0, to = 7
sub = "oadste"; length is 6
sub after replacement is "dst"; length is 3
The expression (6 - 3) == 2 is 3, not 2, so false is returned.
EDIT : The sequence must contain exactly two vowels, with no intervening 'r's.
This makes a problem slightly different, because the trick with the first and the last index no longer applies. However, a regex to match the desired sequence can be constructed relatively easily - here it is:
"r[^raeiou]*[aeiou][^raeiou]*[aeiou][^raeiou]*r"
In order to understand this regexp, all you need to know is that [...] matches any character inside brackets, [^...] matches any character except the ones in brackets, and * matches the preceding subexpression zero or more times.
The expression is lengthy, but it is composed of trivial pieces. It matches as follws:
An initial r
Zero or more non-vowels except r
The first vowel
Zero or more non-vowels except r
The second vowel
Zero or more non-vowels except r
The closing r
Here is a simple implementation:
boolean matches(String singleWord) {
return singleWord
.replaceAll("r[^raeiou]*[aeiou][^raeiou]*[aeiou][^raeiou]*r", "")
.length() != singleWord.length();
}
You can use a regular expression:
public static boolean matches(final String singleWord) {
return singleWord.matches(".*r([^aeiour]*[aeiou]){2}[^aeiour]*r.*");
}
Here is the test code:
for (String word: "roar soccer roster reader rarar".split(" "))
System.out.println(word+":"+matches(word));
And here is the output:
roar:true
soccer:false
roster:true
reader:false
rarar:false
You could also use a regular expression:
java.util.regex.Pattern.matches("\w*r\w*([aeiou]\w*){2}r\w*", "roar soccer roster reader");

Java split by newline when string is all newlines

When I have a string like \n\n\n, and I split by \\n, I get 0. Why is this?
public class Test {
public static void main(String []args){
String str = "\n\n\n";
String[] lines = str.split("\\n");
System.out.println(lines.length);
}
}
You can copy & paste the code into CompileOnline.
The token that you split on is not part of the result. Since there is nothing else, there is no item to put in the array.
This is different when you add another character to your base string though. When you do that, it will include the empty entries after all.
This can be explained by looking at the source code in java.lang.String:2305.
Consider the following excerpt:
// Construct result
int resultSize = list.size();
if (limit == 0)
while (resultSize > 0 && list.get(resultSize - 1).length() == 0)
resultSize--;
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
If you have 3 empty entries as in your case, resultSize will count down to 0 and essentially return an empty array.
If you have 3 empty entries and one filled one (with the random character you added to the end), resultSize will not move from 4 and thus you will get an array of 4 items where the first 3 are empty.
Basically it will remove all the trailing empty values.
String str = "\n\n\n"; // Returns length 0
String str = "\n\n\nb"; // Returns length 4
String str = "\n\n\nb\n\n"; // Returns length 4
As said in the String javadoc:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
So, when you split() a String made entirely of delimiters (whatever the delimiter is), you will get only empty Strings, the delimiter not being included in the result, and, thus, they will all be considered as trailing empty strings, and not be included in the resulting array.
If you want to get everything, including the empty strings, you have two choices:
add something that is not a delimiter at the end of the String:
String str = "\n\n\ne";
String[] lines = str.split("\\n");
System.out.println(lines.length); // prints "4"
use the two-argument split method with a negative limit:
String str = "\n\n\n";
String[] lines = str.split("\\n", -1);
System.out.println(lines.length); // prints "4"
Because your string contains just \n
str.split(""\n") get the string after \n which is equivalent to NULL before it's next split search. Therefore you obtain 0 as the lines[] is storing NULL.

How to remove leading and trailing whitespace from the string in Java?

I want to remove the leading and trailing whitespace from string:
String s = " Hello World ";
I want the result to be like:
s == "Hello world";
s.trim()
see String#trim()
Without any internal method, use regex like
s.replaceAll("^\\s+", "").replaceAll("\\s+$", "")
or
s.replaceAll("^\\s+|\\s+$", "")
or just use pattern in pure form
String s=" Hello World ";
Pattern trimmer = Pattern.compile("^\\s+|\\s+$");
Matcher m = trimmer.matcher(s);
StringBuffer out = new StringBuffer();
while(m.find())
m.appendReplacement(out, "");
m.appendTail(out);
System.out.println(out+"!");
String s="Test ";
s= s.trim();
I prefer not to use regular expressions for trivial problems. This would be a simple option:
public static String trim(final String s) {
final StringBuilder sb = new StringBuilder(s);
while (sb.length() > 0 && Character.isWhitespace(sb.charAt(0)))
sb.deleteCharAt(0); // delete from the beginning
while (sb.length() > 0 && Character.isWhitespace(sb.charAt(sb.length() - 1)))
sb.deleteCharAt(sb.length() - 1); // delete from the end
return sb.toString();
}
Use the String class trim method. It will remove all leading and trailing whitespace.
http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html
String s=" Hello World ";
s = s.trim();
For more information See This
Simply use trim(). It only eliminate the start and end excess white spaces of a string.
String fav = " I like apple ";
fav = fav.trim();
System.out.println(fav);
Output:
I like apple //no extra space at start and end of the string
String.trim() answers the question but was not an option for me.
As stated here :
it simply regards anything up to and including U+0020 (the usual space character) as whitespace, and anything above that as non-whitespace.
This results in it trimming the U+0020 space character and all “control code” characters below U+0020 (including the U+0009 tab character), but not the control codes or Unicode space characters that are above that.
I am working with Japanese where we have full-width characters Like this, the full-width space would not be trimmed by String.trim().
I therefore made a function which, like xehpuk's snippet, use Character.isWhitespace().
However, this version is not using a StringBuilder and instead of deleting characters, finds the 2 indexes it needs to take a trimmed substring out of the original String.
public static String trimWhitespace(final String stringToTrim) {
int endIndex = stringToTrim.length();
// Return the string if it's empty
if (endIndex == 0) return stringToTrim;
int firstIndex = -1;
// Find first character which is not a whitespace, if any
// (increment from beginning until either first non whitespace character or end of string)
while (++firstIndex < endIndex && Character.isWhitespace(stringToTrim.charAt(firstIndex))) { }
// If firstIndex did not reach end of string, Find last character which is not a whitespace,
// (decrement from end until last non whitespace character)
while (--endIndex > firstIndex && Character.isWhitespace(stringToTrim.charAt(endIndex))) { }
// Return substring using indexes
return stringToTrim.substring(firstIndex, endIndex + 1);
}
s = s.trim();
More info:
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#trim()
Why do you not want to use predefined methods? They are usually most efficient.
See String#trim() method
Since Java 11 String class has strip() method which is used to returns a string whose value is this string, with all leading and trailing white space removed. This is introduced to overcome the problem of trim method.
Docs: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#strip()
Example:
String str = " abc ";
// public String strip()
str = str.strip(); // Returns abc
There are two more useful methods in Java 11+ String class:
stripLeading() : Returns a string whose value is this string,
with all leading white space removed.
stripTrailing() : Returns a string whose value is this string,
with all trailing white space removed.
While #xehpuk's method is good if you want to avoid using regex, but it has O(n^2) time complexity. The following solution also avoids regex, but is O(n):
if(s.length() == 0)
return "";
char left = s.charAt(0);
char right = s.charAt(s.length() - 1);
int leftWhitespace = 0;
int rightWhitespace = 0;
boolean leftBeforeRight = leftWhitespace < s.length() - 1 - rightWhitespace;
while ((left == ' ' || right == ' ') && leftBeforeRight) {
if(left == ' ') {
leftWhitespace++;
left = s.charAt(leftWhitespace);
}
if(right == ' ') {
rightWhitespace++;
right = s.charAt(s.length() - 1 - rightWhitespace);
}
leftBeforeRight = leftWhitespace < s.length() - 1 - rightWhitespace;
}
String result = s.substring(leftWhitespace, s.length() - rightWhitespace);
return result.equals(" ") ? "" : result;
This counts the number of trailing whitespaces in the beginning and end of the string, until either the "left" and "right" indices obtained from whitespace counts meet, or both indices have reached a non-whitespace character. Afterwards, we either return the substring obtained using the whitespace counts, or the empty string if the result is a whitespace (needed to account for all-whitespace strings with odd number of characters).

Categories

Resources