Why is String.split behaving like this? - java

My code is
public class Main
{
public static void main(String[] args)
{
String inputString = "#..#...##";
String[] abc = inputString.trim().split("#+");
for (int i = 0; i < abc.length; i++)
{
System.out.println(abc[i]);
}
System.out.println(abc.length);
}
}
The output abc is an array of length 3.
with abc[0] being an empty string. The other two elements in abc are .. and ...
If my inputString is "..##...". I don't get a empty string in the array returned by split function. The input String doesn't have trailing whitespace in both cases.
Can soemone explain me why do I get a extra space in the code shown above?

You don't get an extra space, you get the empty string (with length 0). It says so in the javadoc:
* <p> When there is a positive-width match at the beginning of this
* string then an empty leading substring is included at the beginning
* of the resulting array. A zero-width match at the beginning however
* never produces such empty leading substring

When you split by #+ and first character of input string is # then input is split at beginning itself and what you get is an empty string as first element of string. It is due to the fact that left hand side of first # is just anchor ^ which will give an empty string only in the resulting array.

From the Javadoc:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
And Javadoc:
If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

Whenever you say .split to a String, it splits the String n number of times that condition is met.
So when you say
String inputString = "#..#...##";
and your condition for spliting is # and since the value before the first occurrence of # is empty, abc[0] will hold empty. Therefore count of abc will return 3 because abc[0]=nothing(empty string), abc[1]=.. abc[2]=...

Related

Strings: .replaceAll and .Substring: Practical example explaination?

What steps are actually happening within the following two methods? I have a rough understanding of what the methods are to do, but I do not know how.
Method 1:
public String processDiscardedLetters(String name) {
return name.substring(0, 1)
+ name.substring(1).replaceAll("[aeihouwy]", "");
}
Method 2:
public String processEquivalentLetters(String name) {
name = name.replaceAll("[aeiou]", "a");
name = name.replaceAll("[cgjkqsxyz]", "c");
name = name.replaceAll("[bfpvw]", "b");
name = name.replaceAll("[dt]", "d");
name = name.replaceAll("[mn]", "m");
return name;
}
Example 1:
input: here are my
output: hr r m //replaced with '' (nothing)
Example 2 (first line):
input: cake is king xoxo
output: aaka as kang xoxo // replaced with 'a'
As you can see that the characters inside [] will be replaced with the character you write as 2nd parameter.
Both of the examples use regular expressions to transform input.
The bracket syntax [] is a character class, which means that any character inside it will match with a part of the input.
In particular, doing a string replace with [aeihouwy] will replace all occurrences of any of the letters with the replacement string.
Method 1: Preserves the first letter regardless of what it is, and removes all occurrences of of the of the characters aeihouwy for the rest of the string.
To be more specific, Method 1 is doing the following steps.
Separate the original string into two parts: The first character and the rest of the String. This uses the subString method with arguments 0 and 1 to pull out the first character, and argument 1 to extract the part of the string starting at the second character and ending at end of string.
Use replaceAll to eliminate any of the characters aeihouwy from the second half of the string.
Join the strings together again.
Method 2: For your first example
name = name.replaceAll("[aeiou]", "a"); // Replace of any of the letters `aeiou` with `a` whereever they occur.
String's substring(startIndex, endIndex) methods actually takes a string from a start index (inclusive) to the end index (exclusive) of the string array (A string is a CharSequence which is a "string array"). As you are aware, array counts starts from 0 to N - 1. A substring(startIndex) takes a string starting from index to the end of the string.
The replaceAll(String regularExpression, String replacement) method will replace a string based on the matched regular expression criteria. If a portion of a string matches a regular expression criteria, it gets replaced by a replacement string.
The regular expression in [ and ] are called character classes. These tell the regular expression engine to match only 1 out of those characters. So, if you have [aeihouwy], the words like grey will match since it identified gr[ey] and gr[e]y as well as gre[y]. So, any mentioned words will be replaced with a blank string "".
I hope this helps.

Java split by newline when string is all newlines

When I have a string like \n\n\n, and I split by \\n, I get 0. Why is this?
public class Test {
public static void main(String []args){
String str = "\n\n\n";
String[] lines = str.split("\\n");
System.out.println(lines.length);
}
}
You can copy & paste the code into CompileOnline.
The token that you split on is not part of the result. Since there is nothing else, there is no item to put in the array.
This is different when you add another character to your base string though. When you do that, it will include the empty entries after all.
This can be explained by looking at the source code in java.lang.String:2305.
Consider the following excerpt:
// Construct result
int resultSize = list.size();
if (limit == 0)
while (resultSize > 0 && list.get(resultSize - 1).length() == 0)
resultSize--;
String[] result = new String[resultSize];
return list.subList(0, resultSize).toArray(result);
If you have 3 empty entries as in your case, resultSize will count down to 0 and essentially return an empty array.
If you have 3 empty entries and one filled one (with the random character you added to the end), resultSize will not move from 4 and thus you will get an array of 4 items where the first 3 are empty.
Basically it will remove all the trailing empty values.
String str = "\n\n\n"; // Returns length 0
String str = "\n\n\nb"; // Returns length 4
String str = "\n\n\nb\n\n"; // Returns length 4
As said in the String javadoc:
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
So, when you split() a String made entirely of delimiters (whatever the delimiter is), you will get only empty Strings, the delimiter not being included in the result, and, thus, they will all be considered as trailing empty strings, and not be included in the resulting array.
If you want to get everything, including the empty strings, you have two choices:
add something that is not a delimiter at the end of the String:
String str = "\n\n\ne";
String[] lines = str.split("\\n");
System.out.println(lines.length); // prints "4"
use the two-argument split method with a negative limit:
String str = "\n\n\n";
String[] lines = str.split("\\n", -1);
System.out.println(lines.length); // prints "4"
Because your string contains just \n
str.split(""\n") get the string after \n which is equivalent to NULL before it's next split search. Therefore you obtain 0 as the lines[] is storing NULL.

Unexpected behavior of Java String split( )

I am trying to split a string using String split function, here's an example:
String[] list = " Hello ".split("\\s+");
System.out.println("String length: " + list.length);
for (String s : list) {
System.out.println("----");
System.out.println(s);
}
Here's the output:
String length: 2
----
----
Hello
As you can see, the leading whitespace becoming an empty element in the String array, but the trailing whitespace is not.
Does anyone know why?
You need to use the other split method which specifys the limit and specify a limit of -1
String[] list = " Hello ".split("\\s+", -1);
to preserve the trailing whitespace, - the default behavior is to omit the trailing spaces as per the javadoc
Edit (answer for comment):
To trim the leading space, you can strip off the leading space before splitting the String
String str = " Hello ".replaceAll("^\\s+", "");
String[] list = str.split("\\s+", -1);
From split documentation
This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.
so in reality split(regex) is the same as using
split(regex, 0);
and its documentation says
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
so if you want to include trailing empty strings will just have to use non-zero value like
split("\\s+",10);
but this will also limit result array to max 10 elements. To get rid of this problem use some negative number like
split("\\s+",-1);

Spliting a String upto nth delimiter in java

String s = "10.226.18.158:10.226.17.183:ABCD :AAAA"
My requirement is to split the string at up to 3rd : or up to 2nd :. i.e.
Something like String sa[] = s.split(), but with the regex splitting only up to 3rd or 2nd.
s[0] = "10.226.18.158"
s[1] = "10.226.17.183"
s[2] = "ABCD :AAAA"
According to the String#split() javadoc you can add a number to limit the number of splits.
s.split(":", 3);
Edit: as melwil metions This will return an array of up to the number passed in long.
So in your example of splitting up to 2nd : you would need to pass in 3.
s.split(":",3) returns the output
sa[0] = "10.226.18.158"
sa[1] = "10.226.17.183"
sa[2] = "ABCD :AAAA"
Relevent section quoted from the java doc about how the second argument (limit) works.
The limit parameter controls the number of times the pattern is
applied and therefore affects the length of the resulting array. If
the limit n is greater than zero then the pattern will be applied at
most n - 1 times, the array's length will be no greater than n, and
the array's last entry will contain all input beyond the last matched
delimiter. If n is non-positive then the pattern will be applied as
many times as possible and the array can have any length. If n is zero
then the pattern will be applied as many times as possible, the array
can have any length, and trailing empty strings will be discarded.
You can split your string basing on one non-whitespece character, \S{1}, followed by a colon, ::
String sa[] = s.split("\\S{1}:");

String.split() returning a "" unexpectedly

I have a simple method splitting a string into an array. It splits it where there are non-letter characters. The line I am using right now is as follows:
String[] words = str.split("[^a-zA-Z]");
So this should split the string where there are only alphabetical characters. But the problem is that when it splits it works for some, but not all. For example:
String str = "!!day--yaz!!";
String[] words = str.split("[^a-zA-Z]");
String result = "";
for (int i = 0; i < words.length; i++) {
result += words[i] + "1 ";
}
return result;
I added the 1 in there to see where the split takes place, becuase i was getting errors on null values. Anyway, when I run this code I get an output of:
1 1 day1 1 yaz1
Why is it splitting between the first two !'s and after one of the -'s, but not after the last two !'s? Why is it even splitting there at all? Any help on this would be great!
It doesn't split before or after it splits ON the matches, therefore you get an empty String between the dashes and the bangs.
This doesn't apply to the trailing bangs, because trailing empty Strings are omitted as described in the javadoc
Trailing empty strings are therefore not included in the resulting
array.
This happens because it indeed uses every non-letter character as a delimiter. It means that string "!" will be splitted into array of 2 empty strings to the left and to the right of the exclamation sign.
Your problem can be solved withing 2 steps.
use "[^a-zA-Z]+" instead of "[^a-zA-Z]". The + will help you to avoid empty string between 2 dashes.
Remove starting and trailing non-letter characters before splitting. This will remove leading and trailing empty strings: str.replaceFirst("[^a-zA-Z]+").replaceFirst("[^a-zA-Z]+$")
Finally your split will look like:
String[] words = str..replaceFirst("[^a-zA-Z]+").replaceFirst("[^a-zA-Z]+$")split("[^a-zA-Z]");
If you want to get rid of some of the extra splits, use split("[^a-zA-Z]+") instead of split("[^a-zA-Z]"). This will match a continuous part of the String that matches the pattern.

Categories

Resources