Java split by newline when string is all newlines - java

When I have a string like \n\n\n, and I split by \\n, I get 0. Why is this?
public class Test {
public static void main(String []args){
String str = "\n\n\n";
String[] lines = str.split("\\n");
System.out.println(lines.length);
}
}
You can copy & paste the code into CompileOnline.

The token that you split on is not part of the result. Since there is nothing else, there is no item to put in the array.
This is different when you add another character to your base string though. When you do that, it will include the empty entries after all.
This can be explained by looking at the source code in java.lang.String:2305.
Consider the following excerpt:
// Construct result
int resultSize = list.size();
if (limit == 0)
while (resultSize > 0 && list.get(resultSize - 1).length() == 0)
resultSize--;
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
If you have 3 empty entries as in your case, resultSize will count down to 0 and essentially return an empty array.
If you have 3 empty entries and one filled one (with the random character you added to the end), resultSize will not move from 4 and thus you will get an array of 4 items where the first 3 are empty.
Basically it will remove all the trailing empty values.
String str = "\n\n\n"; // Returns length 0
String str = "\n\n\nb"; // Returns length 4
String str = "\n\n\nb\n\n"; // Returns length 4

As said in the String javadoc:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
So, when you split() a String made entirely of delimiters (whatever the delimiter is), you will get only empty Strings, the delimiter not being included in the result, and, thus, they will all be considered as trailing empty strings, and not be included in the resulting array.
If you want to get everything, including the empty strings, you have two choices:
add something that is not a delimiter at the end of the String:
String str = "\n\n\ne";
String[] lines = str.split("\\n");
System.out.println(lines.length); // prints "4"
use the two-argument split method with a negative limit:
String str = "\n\n\n";
String[] lines = str.split("\\n", -1);
System.out.println(lines.length); // prints "4"

Because your string contains just \n
str.split(""\n") get the string after \n which is equivalent to NULL before it's next split search. Therefore you obtain 0 as the lines[] is storing NULL.

Related

Java, splitting string into array

I am trying to split a string into string array. And I have stumbled to something strange to me. I don't understand why it works like this.
String one, two;
one = "";
two = ":";
String[] devided1 = one.trim().split(":");
String[] devided2 = two.trim().split(":");
System.out.println("size: "+ devided1.length);
System.out.println("size: "+ devided2.length);
I get output:
size: 1
size: 0
Why is empty string giving me size of one, but string that only has the delimiter gives my array size of 0?
I saw more confusing things like: that size of "::" is 0, but size of ": :" is 2, not 3.
Can someone please explain it to me?
See the doc comment in source code or documentation for public String[] split(String regex, int limit) method.
Case 1:
String one = "";
String[] devided1 = one.trim().split(":");
The resulting array will have 1 element = original string String[1] [""], because expresion ":" was not match any part of the input string.
According to documentation:
If the
* expression does not match any part of the input then the resulting array
* has just one element, namely this string.
Case 2:
String two = ":";
String[] devided2 = two.trim().split(":");
The split(":") has default limit = 0. It means that from the resulting array trailing empty strings will be removed. So method splits ":" string to array with two empty strings and then remove them and as result we get empty array.
According to documentation:
If limit is zero then the pattern will be applied as many times as
possible, the array can have any length, and trailing empty strings
will be discarded.
Case 3:
String two = ":";
String[] devided2 = two.trim().split(":", -1);
We will get an array with two empty strings.
According to documentation:
If limit is non-positive then the pattern will be applied as many
times as possible and the array can have any length
Case 4:
String two = "::";
String[] devided2 = two.trim().split(":");
We will get empty array. It is the same like Case 2.
Case 5:
String one = ": :";
String[] devided1 = one.trim().split(":");
The method will split string to three array elements ["", " ", ""] and then remove empty strings from the end of array, because limit = 0. We will get String[2] ["", " "].
According to documentation:
If limit is zero then the pattern will be applied as many times as
possible, the array can have any length, and trailing empty strings
will be discarded.
This link is helpful:
https://konigsberg.blogspot.com/2009/11/final-thoughts-java-puzzler-splitting.html
Basically, it is for perl compatibility.
You can use split(":", -1) here if you don't want that behavior.
Otherwise, split(":") defaults to split(":", 0), and the difference is:
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#split(java.lang.String,int)
If the limit is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
If the limit is negative then the pattern will be applied as many times as possible and the array can have any length.
In case of ":" being splitted, it would result in {"" , ""}, but empty traling spaces will be discarded, so it will return an empty array.

Java String split inconsistency

If I split "hello|" and "|hello" with "|" character, then I get one value for the first and two values for the second version.
String[] arr1 = new String("hello|").split("\\|");
String[] arr2 = new String("|hello").split("\\|");
System.out.println("arr1 length: " + arr1.length + "\narr2 length: " + arr2.length);
This prints out:
arr1 length: 1
arr2 length: 2
Why is this?
According to java docs. split creates an empty String if the first character is the separator, but doesn't create an empty String (or empty Strings) if the last character (or consecutive characters) is the separator. You will get the same behavior regardless of the separator you use.
Trailing empty String will not be included in array check the following statement.
String#split
This method works as if by invoking the two-argument split method with
the given expression and a limit argument of zero. Trailing empty
strings are therefore not included in the resulting array.
String#split always returns the array of strings computed by splitting this string around matches of the given regular expression.
Check the source code for the answer: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/regex/Pattern.java#Pattern.compile%28java.lang.String%29
The last lines contains the answer:
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
So the end will not be included if it is empty.

Confusing output from String.split

I do not understand the output of this code:
public class StringDemo{
public static void main(String args[]) {
String blank = "";
String comma = ",";
System.out.println("Output1: "+blank.split(",").length);
System.out.println("Output2: "+comma.split(",").length);
}
}
And got the following output:
Output1: 1
Output2: 0
Documentation:
For: System.out.println("Output1: "+blank.split(",").length);
The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.
It will simply return the entire string that's why it returns 1.
For the second case, String.split will discard the , so the result will be empty.
String.split silently discards trailing separators
see guava StringsExplained too
Everything happens according to plan, but let's do it step by step (I hope you have some time).
According to documentation (and source code) of split(String regex) method:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero.
So when you invoke
split(String regex)
you are actually getting result from the split(String regex, int limit) method which is invoked in a way:
split(regex, 0)
So here limit is set to 0.
You need to know a few things about this parameter:
If limit is positive you are limiting length of result array to a positive number you specified, so "axaxaxaxa".split("x",2) will return an array, ["a", "axaxaxa"], not ["a","a","a","a","a"].
If limit is 0 then you are not limiting the length of the result array. But it also means that any trailing empty strings will be removed. For example:
"fooXbarX".split("X")
will at start generate an array which will look like:
["foo", "bar", ""]
("barX" split on "X" generates "bar" and ""), but since split removes all trailing empty string, it will return
["foo", "bar"]
Behaviour of negative value of limit is similar to behaviour where limit is set to 0 (it will not limit length of result array). The only difference is that it will not remove empty strings from the end of the result array. In other words
"fooXbarX".split("X",-1)
will return ["foo", "bar", ""]
Lets take a look at the case,
",".split(",").length
which (as explained earlier) is same as
",".split(",", 0).length
This means that we are using a version of split which will not limit the length of the result array, but will remove all trailing empty strings, "". You need to understand that when we split one thing we are always getting two things.
In other words, if we split "abc" in place of b, we will get "a" and "c".
The tricky part is to understand that if we split "abc" in c we will get "ab" and "" (empty string).
Using this logic, if we split "," on , we will get "" and "" (two empty strings).
You can check it using split with negative limit:
for (String s: ",".split(",", -1)){
System.out.println("\""+s+"\"");
}
which will print
""
""
So as we see result array here is at first ["", ""].
But since by default we are using limit set to 0, all trailing empty strings will be removed. In this case, the result array contains only trailing empty strings, so all of them will be removed, leaving you with empty array [] which has length 0.
To answer the case with
"".split(",").length
you need to understand that removing trailing empty strings makes sense only if such trailing empty strings ware result of splitting (and most probably are not needed).
So if there were not any places on which we could split, there is no chance that empty strings ware created, so there is no point in running this "cleaning" process.
This information is mentioned in documentation of split(String regex, int limit) method where you can read:
If the expression does not match any part of the input then the resulting array has just one element, namely this string.
You can also see this behaviour in source code of this method (from Java 8):
2316 public String[] split(String regex, int limit) {2317 /* fastpath if the regex is a2318 (1)one-char String and this character is not one of the2319 RegEx's meta characters ".$|()[{^?*+\\", or2320 (2)two-char String and the first char is the backslash and2321 the second is not the ascii digit or ascii letter.2322 */2323 char ch = 0;2324 if (((regex.value.length == 1 &&2325 ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||2326 (regex.length() == 2 &&2327 regex.charAt(0) == '\\' &&2328 (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&2329 ((ch-'a')|('z'-ch)) < 0 &&2330 ((ch-'A')|('Z'-ch)) < 0)) &&2331 (ch < Character.MIN_HIGH_SURROGATE ||2332 ch > Character.MAX_LOW_SURROGATE))2333 {2334 int off = 0;2335 int next = 0;2336 boolean limited = limit > 0;2337 ArrayList<String> list = new ArrayList<>();2338 while ((next = indexOf(ch, off)) != -1) {2339 if (!limited || list.size() < limit - 1) {2340 list.add(substring(off, next));2341 off = next + 1;2342 } else { // last one2343 //assert (list.size() == limit - 1);2344 list.add(substring(off, value.length));2345 off = value.length;2346 break;2347 }2348 }2349 // If no match was found, return this2350 if (off == 0)2351 return new String[]{this};2353 // Add remaining segment2354 if (!limited || list.size() < limit)2355 list.add(substring(off, value.length));2357 // Construct result2358 int resultSize = list.size();2359 if (limit == 0) {2360 while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {2361 resultSize--;2362 }2363 }2364 String[] result = new String[resultSize];2365 return list.subList(0, resultSize).toArray(result);2366 }2367 return Pattern.compile(regex).split(this, limit);2368 }
where you can find
if (off == 0)
return new String[]{this};
fragment which means
if (off == 0) - if off (position from which method should start searching for next possible match for regex passed as split argument) is still 0 after iterating over entire string, we didn't find any match, so the string was not split
return new String[]{this}; - in that case let's just return an array with original string (represented by this).
Since "," couldn't be found in "" even once, "".split(",") must return an array with one element (empty string on which you invoked split). This means that the length of this array is 1.
BTW. Java 8 introduced another mechanism. It removes leading empty strings (if they ware created while splitting process) if we split using zero-length regex (like "" or with look-around (?<!x)). More info at: Why in Java 8 split sometimes removes empty strings at start of result array?
From the Java 1.7 Documentation
Splits the string around matches of the given regular expression.
split() method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
In the Case 1 blank.split(",") does not match any part of the input then the resulting array has just one element, namely this String.
It will return entire String. So, the length will be 1.
In the Case 2 comma.split(",") will return empty.
split() expecting a regex as argument, return result array to matching with that regex.
So, the length is 0
For Example(Documentation)
The string "boo:and:foo", yields the following results with these expressions:
Regex Result
: { "boo", "and", "foo" }
o { "b", "", ":and:f" }
Parameters:
regex - the delimiting regular expression
Returns:
the array of strings computed by splitting this string around matches of the given regular expression
Throws:
PatternSyntaxException - if the regular expression's syntax is invalid
From String class javadoc for the public String[] split(String regex) method:
Splits this string around matches of the given regular expression.
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
In the first case, the expression does not match any part of the input so we got an array with only one element - the input.
In the second case, the expression matches input and split should return two empty strings; but, according to javadoc, they are discarded (because they are trailing and empty).
We can take a look into the source code of java.util.regex.Pattern which is behind String.split. Way down the rabbit hole the method
public String[] split(CharSequence input, int limit)
is invoked.
Input ""
For input "" this method is called as
String[] parts = split("", 0);
The intersting part of this method is:
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);
while(m.find()) {
// Tichodroma: this will not happen for our input
}
// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};
And that is what happens: new String[] {input.toString()} is returned.
Input ","
For input ","the intersting part is
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
Here resultSize == 0 and limit == 0 so new String[0] is returned.
From JDK 1.7
public String[] split(String regex, int limit) {
/* fastpath if the regex is a
(1)one-char String and this character is not one of the
RegEx's meta characters ".$|()[{^?*+\\", or
(2)two-char String and the first char is the backslash and
the second is not the ascii digit or ascii letter.
*/
char ch = 0;
if (((regex.count == 1 &&
".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
(regex.length() == 2 &&
regex.charAt(0) == '\\' &&
(((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
((ch-'a')|('z'-ch)) < 0 &&
((ch-'A')|('Z'-ch)) < 0)) &&
(ch < Character.MIN_HIGH_SURROGATE ||
ch > Character.MAX_LOW_SURROGATE))
{
int off = 0;
int next = 0;
boolean limited = limit > 0;
ArrayList<String> list = new ArrayList<>();
while ((next = indexOf(ch, off)) != -1) {
if (!limited || list.size() < limit - 1) {
list.add(substring(off, next));
off = next + 1;
} else { // last one
//assert (list.size() == limit - 1);
list.add(substring(off, count));
off = count;
break;
}
}
// If no match was found, return this
if (off == 0)
return new String[] { this };
// Add remaining segment
if (!limited || list.size() < limit)
list.add(substring(off, count));
// Construct result
int resultSize = list.size();
if (limit == 0)
while (resultSize > 0 && list.get(resultSize-1).length() == 0)
resultSize--;
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
}
return Pattern.compile(regex).split(this, limit);
}
So for this case, the regex will be handled by the first if.
For the first case blank.split(",")
// If no match was found, return this
if (off == 0)
return new String[] { this };
So, this function will return an array which contains one element if there is no matched.
For the second case comma.split(",")
List<String> list = new ArrayList<>();
//...
int resultSize = list.size();
if (limit == 0)
while (resultSize > 0 && list.get(resultSize-1).length() == 0)
resultSize--;
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
As you notice, the last while loop has removed all empty element in the end of the list, so the resultSize is 0.
String blank = "";
String comma = ",";
System.out.println("Output1: "+blank.split(",").length); // case 1
System.out.println("Output2: "+comma.split(",").length); // case 2
case 1 - Here blank.split(",") will return "" since there is no , in blank you get the same, So length will be 1
case 2- Here comma.split(",") will return empty array, you have to scape , if you want to count comma with length 1 else length will be 0
Again comma.split(",") split() expecting a regex as argument it will return result array to matching with that regex.
The array returned by this method contains each substring of this
string that is terminated by another substring that matches the given
expression or is terminated by the end of the string.
Else
If the expression does not match any part of the input then the
resulting array has just one element, namely this string.
The API for the split method states that "If the expression does not match any part of the input then the resulting array has just one element, namely this string."
So, as the String blank doesn't contain a ",", a String[] with one element (i.e. blank itself) is returned.
For the String comma, "nothing" is left of the original string thus an empty array is returned.
This seems to be the best solution if you want to process the returned result, e. g.
String[] splits = aString.split(",");
for(String split: splits) {
// do something
}

Why String.Split(regex) in java returns array of elements of size less then what actually is present?

The number of elements returned is less than what I'd expected when I run String.split()
Example:- The actual string is "country,12345,2,1,,1,,", so 8 elements were expected in array returned, but the size of array was "6"
Code:-
String line1 = "country,12345,2,1,,1,,";
String data1[] = line1.split(",");
System.out.println("Length : "+data1.length);
Output:-
Length : 6
Why is it so?
Because the single-argument split method drops trailing empty fields. If you want to preserve them use the two-argument version, with a negative limit parameter.
String data1[] = line1.split(",", -1);
Thanks #Ian Roberts
Split() method drops tailing empty fields , so trailing empty fields won't count.
for better understanding , see code below -
Case 1:
String line1 = "country,12345,2,1,,1, ,";
String data1[] = line1.split(",");
System.out.println("Length : "+data1.length);
Output : 7
As second last character within comma is i.e. something (but not empty) , split won't count last part only.
Case 2:
String line1 = "country,12345,2,1,,1,, ";
String data1[] = line1.split(",");
System.out.println("Length : "+data1.length);
Output : 8
As last character after comma is i.e. something (but not empty) , split will count all parts.
I think you have to change the regex:
String data1[] = line1.split("\\,");

Why is String.split behaving like this?

My code is
public class Main
{
public static void main(String[] args)
{
String inputString = "#..#...##";
String[] abc = inputString.trim().split("#+");
for (int i = 0; i < abc.length; i++)
{
System.out.println(abc[i]);
}
System.out.println(abc.length);
}
}
The output abc is an array of length 3.
with abc[0] being an empty string. The other two elements in abc are .. and ...
If my inputString is "..##...". I don't get a empty string in the array returned by split function. The input String doesn't have trailing whitespace in both cases.
Can soemone explain me why do I get a extra space in the code shown above?
You don't get an extra space, you get the empty string (with length 0). It says so in the javadoc:
* <p> When there is a positive-width match at the beginning of this
* string then an empty leading substring is included at the beginning
* of the resulting array. A zero-width match at the beginning however
* never produces such empty leading substring
When you split by #+ and first character of input string is # then input is split at beginning itself and what you get is an empty string as first element of string. It is due to the fact that left hand side of first # is just anchor ^ which will give an empty string only in the resulting array.
From the Javadoc:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
And Javadoc:
If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
Whenever you say .split to a String, it splits the String n number of times that condition is met.
So when you say
String inputString = "#..#...##";
and your condition for spliting is # and since the value before the first occurrence of # is empty, abc[0] will hold empty. Therefore count of abc will return 3 because abc[0]=nothing(empty string), abc[1]=.. abc[2]=...

Categories

Resources