Java String split inconsistency - java

If I split "hello|" and "|hello" with "|" character, then I get one value for the first and two values for the second version.
String[] arr1 = new String("hello|").split("\\|");
String[] arr2 = new String("|hello").split("\\|");
System.out.println("arr1 length: " + arr1.length + "\narr2 length: " + arr2.length);
This prints out:
arr1 length: 1
arr2 length: 2
Why is this?

According to java docs. split creates an empty String if the first character is the separator, but doesn't create an empty String (or empty Strings) if the last character (or consecutive characters) is the separator. You will get the same behavior regardless of the separator you use.

Trailing empty String will not be included in array check the following statement.
String#split
This method works as if by invoking the two-argument split method with
the given expression and a limit argument of zero. Trailing empty
strings are therefore not included in the resulting array.

String#split always returns the array of strings computed by splitting this string around matches of the given regular expression.

Check the source code for the answer: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/regex/Pattern.java#Pattern.compile%28java.lang.String%29
The last lines contains the answer:
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
So the end will not be included if it is empty.

Related

Java, splitting string into array

I am trying to split a string into string array. And I have stumbled to something strange to me. I don't understand why it works like this.
String one, two;
one = "";
two = ":";
String[] devided1 = one.trim().split(":");
String[] devided2 = two.trim().split(":");
System.out.println("size: "+ devided1.length);
System.out.println("size: "+ devided2.length);
I get output:
size: 1
size: 0
Why is empty string giving me size of one, but string that only has the delimiter gives my array size of 0?
I saw more confusing things like: that size of "::" is 0, but size of ": :" is 2, not 3.
Can someone please explain it to me?
See the doc comment in source code or documentation for public String[] split(String regex, int limit) method.
Case 1:
String one = "";
String[] devided1 = one.trim().split(":");
The resulting array will have 1 element = original string String[1] [""], because expresion ":" was not match any part of the input string.
According to documentation:
If the
* expression does not match any part of the input then the resulting array
* has just one element, namely this string.
Case 2:
String two = ":";
String[] devided2 = two.trim().split(":");
The split(":") has default limit = 0. It means that from the resulting array trailing empty strings will be removed. So method splits ":" string to array with two empty strings and then remove them and as result we get empty array.
According to documentation:
If limit is zero then the pattern will be applied as many times as
possible, the array can have any length, and trailing empty strings
will be discarded.
Case 3:
String two = ":";
String[] devided2 = two.trim().split(":", -1);
We will get an array with two empty strings.
According to documentation:
If limit is non-positive then the pattern will be applied as many
times as possible and the array can have any length
Case 4:
String two = "::";
String[] devided2 = two.trim().split(":");
We will get empty array. It is the same like Case 2.
Case 5:
String one = ": :";
String[] devided1 = one.trim().split(":");
The method will split string to three array elements ["", " ", ""] and then remove empty strings from the end of array, because limit = 0. We will get String[2] ["", " "].
According to documentation:
If limit is zero then the pattern will be applied as many times as
possible, the array can have any length, and trailing empty strings
will be discarded.
This link is helpful:
https://konigsberg.blogspot.com/2009/11/final-thoughts-java-puzzler-splitting.html
Basically, it is for perl compatibility.
You can use split(":", -1) here if you don't want that behavior.
Otherwise, split(":") defaults to split(":", 0), and the difference is:
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/String.html#split(java.lang.String,int)
If the limit is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
If the limit is negative then the pattern will be applied as many times as possible and the array can have any length.
In case of ":" being splitted, it would result in {"" , ""}, but empty traling spaces will be discarded, so it will return an empty array.

Java difference between "split(regEx)" and "split(regEx, 0)"?

Is there any difference between using split(regEx) and split(regEx, 0)?
Because the output is for the cases I tested the same. Ex:
String myS = this is stack overflow;
String[] mySA = myS.split(' ');
results in mySA === {'this','is','stack,'overflow'}
And
String myS = this is stack overflow;
String[] mySA = myS.split(' ', 0);
also results in mySA === {'this','is','stack,'overflow'}
Is there something "hidden" going on here? Or something else which needs to be known about the .split(regEx, 0)?
They are essentially the same.
Quoted from String.split(String regex) documentation:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
Answering the question. Yes they're same.
Please find the split method of String class which intern calls the split(regex,0) method.
public String[] split(String regex) {
return split(regex, 0);
}
The limit parameter controls the number of times the pattern is
applied and therefore affects the length of the resulting array. If
the limit n is greater than zero then the pattern will be applied at
most n - 1 times, the array's length will be no greater than n, and
the array's last entry will contain all input beyond the last matched
delimiter. If n is non-positive then the pattern will be applied as
many times as possible and the array can have any length. If n is zero
then the pattern will be applied as many times as possible, the array
can have any length, and trailing empty strings will be discarded
For example the following code can give you some insight.
String myS = "this is stack overflow";
String[] mySA = myS.split(" ", 2);
String[] withOutLimit = myS.split(" ");
System.out.println(mySA.length); // prints 2
System.out.println(withOutLimit.length); // prints 4

Java split by newline when string is all newlines

When I have a string like \n\n\n, and I split by \\n, I get 0. Why is this?
public class Test {
public static void main(String []args){
String str = "\n\n\n";
String[] lines = str.split("\\n");
System.out.println(lines.length);
}
}
You can copy & paste the code into CompileOnline.
The token that you split on is not part of the result. Since there is nothing else, there is no item to put in the array.
This is different when you add another character to your base string though. When you do that, it will include the empty entries after all.
This can be explained by looking at the source code in java.lang.String:2305.
Consider the following excerpt:
// Construct result
int resultSize = list.size();
if (limit == 0)
while (resultSize > 0 && list.get(resultSize - 1).length() == 0)
resultSize--;
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
If you have 3 empty entries as in your case, resultSize will count down to 0 and essentially return an empty array.
If you have 3 empty entries and one filled one (with the random character you added to the end), resultSize will not move from 4 and thus you will get an array of 4 items where the first 3 are empty.
Basically it will remove all the trailing empty values.
String str = "\n\n\n"; // Returns length 0
String str = "\n\n\nb"; // Returns length 4
String str = "\n\n\nb\n\n"; // Returns length 4
As said in the String javadoc:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
So, when you split() a String made entirely of delimiters (whatever the delimiter is), you will get only empty Strings, the delimiter not being included in the result, and, thus, they will all be considered as trailing empty strings, and not be included in the resulting array.
If you want to get everything, including the empty strings, you have two choices:
add something that is not a delimiter at the end of the String:
String str = "\n\n\ne";
String[] lines = str.split("\\n");
System.out.println(lines.length); // prints "4"
use the two-argument split method with a negative limit:
String str = "\n\n\n";
String[] lines = str.split("\\n", -1);
System.out.println(lines.length); // prints "4"
Because your string contains just \n
str.split(""\n") get the string after \n which is equivalent to NULL before it's next split search. Therefore you obtain 0 as the lines[] is storing NULL.

Unexpected behavior of Java String split( )

I am trying to split a string using String split function, here's an example:
String[] list = " Hello ".split("\\s+");
System.out.println("String length: " + list.length);
for (String s : list) {
System.out.println("----");
System.out.println(s);
}
Here's the output:
String length: 2
----
----
Hello
As you can see, the leading whitespace becoming an empty element in the String array, but the trailing whitespace is not.
Does anyone know why?
You need to use the other split method which specifys the limit and specify a limit of -1
String[] list = " Hello ".split("\\s+", -1);
to preserve the trailing whitespace, - the default behavior is to omit the trailing spaces as per the javadoc
Edit (answer for comment):
To trim the leading space, you can strip off the leading space before splitting the String
String str = " Hello ".replaceAll("^\\s+", "");
String[] list = str.split("\\s+", -1);
From split documentation
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
so in reality split(regex) is the same as using
split(regex, 0);
and its documentation says
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
so if you want to include trailing empty strings will just have to use non-zero value like
split("\\s+",10);
but this will also limit result array to max 10 elements. To get rid of this problem use some negative number like
split("\\s+",-1);

String.split() returning a "" unexpectedly

I have a simple method splitting a string into an array. It splits it where there are non-letter characters. The line I am using right now is as follows:
String[] words = str.split("[^a-zA-Z]");
So this should split the string where there are only alphabetical characters. But the problem is that when it splits it works for some, but not all. For example:
String str = "!!day--yaz!!";
String[] words = str.split("[^a-zA-Z]");
String result = "";
for (int i = 0; i < words.length; i++) {
result += words[i] + "1 ";
}
return result;
I added the 1 in there to see where the split takes place, becuase i was getting errors on null values. Anyway, when I run this code I get an output of:
1 1 day1 1 yaz1
Why is it splitting between the first two !'s and after one of the -'s, but not after the last two !'s? Why is it even splitting there at all? Any help on this would be great!
It doesn't split before or after it splits ON the matches, therefore you get an empty String between the dashes and the bangs.
This doesn't apply to the trailing bangs, because trailing empty Strings are omitted as described in the javadoc
Trailing empty strings are therefore not included in the resulting
array.
This happens because it indeed uses every non-letter character as a delimiter. It means that string "!" will be splitted into array of 2 empty strings to the left and to the right of the exclamation sign.
Your problem can be solved withing 2 steps.
use "[^a-zA-Z]+" instead of "[^a-zA-Z]". The + will help you to avoid empty string between 2 dashes.
Remove starting and trailing non-letter characters before splitting. This will remove leading and trailing empty strings: str.replaceFirst("[^a-zA-Z]+").replaceFirst("[^a-zA-Z]+$")
Finally your split will look like:
String[] words = str..replaceFirst("[^a-zA-Z]+").replaceFirst("[^a-zA-Z]+$")split("[^a-zA-Z]");
If you want to get rid of some of the extra splits, use split("[^a-zA-Z]+") instead of split("[^a-zA-Z]"). This will match a continuous part of the String that matches the pattern.

Categories

Resources