I am trying to split a string that has two numbers and possibly a letter that will look similar to:
(2,3) (2,6) p (8,5) p (5,6)
I am trying:
String inputTokens = input.split([(),\\s]);
but that leaves me with with a bunch of empty strings in the tokens array. How do I stop them from appearing in the first place?
For clarification: By empty string I mean a string containing nothing, not even a space
Add the "one or more times" greediness quantifier to your character class:
String[] inputTokens = input.split("[(),\\s]+");
This will result in one leading empty String, which is unavoidable when using the split() method and splitting away the immediate start of the String and otherwise no empty Strings.
String inputTokens[] = input.split("[(),\\s]+");
This will read the whitespace as part of the regex so there will be no empty entries in your array.
Related
Im trying to split a sting on multiple or single occurences of "O" and all other characters will be dots. I'm wondering why this produces en empty string first.
String row = ".....O.O.O"
String[] arr = row.split("\\.+");
This produces produces:
["", "O", "O", "O"]
You just need to make sure that any trailing or leading dots are removed.
So one solution is:
row.replaceAll("^\\.+|\\.+$", "").split("\\.+");
For this pattern you can use replaceFirstMethod() and then split by dot
String[] arr = row.replaceFirst("\\.+","").split("\\.");
Output will be
["O","O","O"]
The "+" character is removing multiple instances of the seperator, so what your split is essentially doing is splitting the following string on "."
.0.0.0.
This, of course, means that your first field is empty. Hence the result you get.
To avoid this, strip all leading separators from the string before splitting it. Rather than type some examples on how to do this, here's a thread with a few suggestions.
Java - Trim leading or trailing characters from a string?
First thing first, here is my code:
String line = "Events|1005435529|7021370073||PAGELOAD|2017-06-19T12:04:40||JI||ServerHostName|ServerIPAddress|9P2_D2jB9Toct7PDTJ7zwLUmWfEYz6Y4akyOKn2g4CepveMH4wr3!46548593!1497854077121|||||||||||";
int offset = line.indexOf("Events");
String zeroIn = line.substring(offset);
String[] jsonElements = zeroIn.split("\\|");
System.out.println(Arrays.asList(jsonElements));
Output:
[Events, 1005435529, 7021370073, , PAGELOAD, 2017-06-19T12:04:40, , JI, , ServerHostName, ServerIPAddress, 9P2_D2jB9Toct7PDTJ7zwLUmWfEYz6Y4akyOKn2g4CepveMH4wr3!46548593!1497854077121]`
I also notice spaces added to each array element at the beginning.
My question is that I have almost 10 empty pipeline symbols at the end of the String line while as the first second and third occurance of empty pipeline symbols is respected, the last ones are missed and don't add up in the array. What do I miss here?
split(java.lang.String regex) calls split(java.lang.String regex ,int limit) with an argument of 0.
If n is zero then the pattern will be applied as many times as
possible, the array can have any length, and trailing empty strings
will be discarded.
You may call this method by yourself with a positive value (and large enough to be sure to include all tokens) to prevent empty tokens from being discarded :
String[] jsonElements = zeroIn.split("\\|", zeroIn.length());
Note : from the comments below, using a negative value is indeed a better way to do this :
String[] jsonElements = zeroIn.split("\\|", -1);
If n is non-positive then the pattern will be applied as many times as
possible and the array can have any length.
From String class and split method doc:
Trailing empty strings are therefore not included in the resulting array.
So, after last occurrence of not empty string, rest will be not included in array.
The accepted answer explains the limitations you oberved splitting on a single character delimeter. I thought I would offer this answer if you need the ability to retain empty tokens in your output. If you split using a lookaround, e.g. a lookbehind, then you would end up with distinct entries even when two pipes have nothing in between them:
String line = "Events|1005435529|7021370073||PAGELOAD|2017-06-19T12:04:40||JI||ServerHostName|ServerIPAddress|9P2_D2jB9Toct7PDTJ7zwLUmWfEYz6Y4akyOKn2g4CepveMH4wr3!46548593!1497854077121|||||||||||";
String[] parts = line.split("(?<=\\|)");
for (String part : parts) {
System.out.println(part);
}
Demo here:
Rextester
I'm practicing reading input and then tokenizing it.
For example, if I have [882,337] I want to just get the numbers 882 and 337. I tried using the following code:
String test = "[882,337]";
String[] tokens = test.split("\\[|\\]|,");
System.out.println(tokens[0]);
System.out.println(tokens[1]);
System.out.println(tokens[2]);
It kind of works, the output is:
(blank line)
882
337
What I don't understand is why token[0] is empty? I would expect there to only be two tokens where token[0] = 882 and token[1] = 337.
I checked out some links but didn't find the answer.
Thanks for the help!
Split splits the given String. If you split "[882,337]" on "[" or "," or "]" then you actually have:
nothing
882
337
nothing
But, as you have called String.split(delimiter), this calls String.split(delimiter, limit) with a limit of zero.
From the documentation:
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
(emphasis mine)
So in this configuration the final, empty, strings are discarded. You are therefore left with exactly what you have.
Usually, to tokenize something like this, one would go for a combination of replaceAll and split:
final String[] tokens = input.replaceAll("^\\[|\\]$").split(",");
This will first strip off the start (^[) and end (]$) brackets and then split on ,. This way you don't have to have somewhat obtuse program logic where you start looping from an arbitrary index.
As an alternative, for more complex tokenizations, one can use Pattern - might be overkill here, but worth bearing in mind before you get into writing multiple replaceAll chains.
First we need to define, in Regex, the tokens we want (rather than those we're splitting on) - in this case it's simple, it's just digits so \d.
So, in order to extract all digit only (no thousands/decimal separators) values from an arbitrary String on would do the following:
final List<Integer> tokens = new ArrayList<>(); <-- to hold the tokens
final Pattern pattern = Pattern.compile("\\d++"); <-- the compiled regex
final Matcher matcher = pattern.matcher(input); <-- the matcher on input
while(matcher.find()) { <-- for each matched token
tokens.add(Integer.parseInt(matcher.group())); <-- parse and `int` and store
}
N.B: I have used a possessive regex pattern for efficiency
So, you see, the above code is somewhat more complex than the simple replaceAll().split(), but it is much more extensible. You can use arbitrary complex regex to token almost any input.
The symbols where the string is split are here:
String test = "[882,337]";
^ ^ ^
Because The first char matches your delimiter, everything left from it will be the first result. Well, left from the first letter is nothing, so the result is the empty string.
One could expect the same behaviour for the end, since the last symbol also matches the delimiter. But:
Trailing empty strings are therefore not included in the resulting array.
See Javadoc.
Splitting creates two (or more) things from one thing. For instance if you split a,b by , you will get a and b.
But in case of ",b" you will get "" and "b". You can think of it this way:
"" exists at start, end and even in-between all characters of string:
""+","+"b" -> ",b" so if we split on this "," we are getting left and right part: "" and "b"
Similar things happens in case of "a," and at first result array is ["a",""] but here split method removes trailing empty strings and returns only ["a"] (you can turn off this clearing mechanism by using split(",", -1)).
So in case of
String test = "[882,337]";
String[] tokens = test.split("\\[|\\]|,");
you are splitting:
""+"["+"882"+","+"337"+"]"+""
here: ^ ^ ^
which at first creates array ["", "882", "337", ""] but then trailing empty string is removed and finally you are receiving:
["", "882", "337"]
Only case where empty string is removed from start of result array is when
you are using Java 8 (or newer) and splitting on regex which is zero-length like split("") or lets say before each x with split("(?=x)") (more info at: Why in Java 8 split sometimes removes empty strings at start of result array?)
and when this empty string was result of split method. For instance "".split("") will not remove "", more info here: https://stackoverflow.com/a/25058091/1393766
That's because each delimiter has a "before" and "after" result, even if it is empty. Consider
882,337
You expect that to produce two results.
Similarly, you expect
882,337,
to produce three, with the last one being empty (assuming your limit is big enough, or assuming you're using almost any other language / implementation of split()). Extending that logically,
,882,337,
must produce four, with the first and last results being empty. This is exactly the case you have, except you have multiple delimiters.
I have a string like
String myString = "hello world~~hello~~world"
I am using the split method like this
String[] temp = myString.split("~|~~|~~~");
I want the array temp to contain only the strings separated by ~, ~~ or ~~~.
However, the temp array thus created has length 5, the 2 additional 'strings' being empty strings.
I want it to ONLY contain my non-empty string. Please help. Thank you!
You should use quantifier with your character:
String[] temp = myString.split("~+");
String#split() takes a regex. ~+ will match 1 or more ~, so it will split on ~, or ~~, or ~~~, and so on.
Also, if you just want to split on ~, ~~, or ~~~, then you can limit the repetition by using {m,n} quantifier, which matches a pattern from m to n times:
String[] temp = myString.split("~{1,3}");
When you split it the way you are doing, it will split a~~b twice on ~, and thus the middle element will be an empty string.
You could also have solved the problem by reversing the order of your delimiter like this:
String[] temp = myString.split("~~~|~~|~");
That will first try to split on ~~, before splitting on ~ and will work fine. But you should use the first approach.
Just turn the pattern around:
String myString = "hello world~~hello~~world";
String[] temp = myString.split("~~~|~~|~");
Try This :
myString.split("~~~|~~|~");
It will definitely works. In your code, what actually happens that when ~ occurs for the first time,it count as a first separator and split the string from that point. So it doesn't get ~~ or ~~~ anywhere in your string though it is there. Like :
[hello world]~[]~[hello]~[]~[world]
Square brackets are split-ed in to 5 different string values.
I don't see why does the following output makes sense.
String split method on an empty String returning an array of String with length 1
String[] split = "".split(",");
System.out.println(split.length);
Returns array of String with length 1
String[] split = "Java".split(",");
System.out.println(split.length);
Returns array of String with length 1
How to differentiate??
From the documentation:
The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string.
To answer your question, it does what it is expected to do: the returned substring is terminated by the end of the input string (as there was no , to be found). The documentation also states:
If the expression does not match any part of the input then the resulting array has just one element, namely this string.
Note that this is a consequence of the first statement. It is not an additional circumstance that the Java developers added in case the search string could not be found.
I hit this, too. What it's returning is the string up to but not including the split character. If you want to get no strings, use StringTokenizer:
StringTokenizer st = new StringTokenizer(someString,',');
int numberOfSubstrings = st.countTokens();
It's returning the original string (which in this case is the empty string) since there was no , to split on.
It returns one because you are measuring the size of the split array, which contains one element: an empty string.