String.split() - matching leading empty String prior to first delimiter? - java

I need to be able to split an input String by commas, semi-colons or white-space (or a mix of the three). I would also like to treat multiple consecutive delimiters in the input as a single delimiter. Here's what I have so far:
String regex = "[,;\\s]+";
return input.split(regex);
This works, except for when the input string starts with one of the delimiter characters, in which case the first element of the result array is an empty String. I do not want my result to have empty Strings, so that something like, ",,,,ZERO; , ;;ONE ,TWO;," returns just a three element array containing the capitalized Strings.
Is there a better way to do this than stripping out any leading characters that match my reg-ex prior to invoking String.split?
Thanks in advance!

No, there isn't. You can only ignore trailing delimiters by providing 0 as a second parameter to String's split() method:
return input.split(regex, 0);
but for leading delimiters, you'll have to strip them first:
return input.replaceFirst("^"+regex, "").split(regex, 0);

If by "better" you mean higher performance then you might want to try creating a regular expression that matches what you want to match and using Matcher.find in a loop and pulling out the matches as you find them. This saves modifying the string first. But measure it for yourself to see which is faster for your data.
If by "better" you mean simpler, then no I don't think there is a simpler way than the way you suggested: removing the leading separators before applying the split.

Pretty much all splitting facilities built into the JDK are broken one way or another. You'd be better off using a third-party class such as Splitter, which is both flexible and correct in how it handles empty tokens and whitespaces:
Splitter.on(CharMatcher.anyOf(";,").or(CharMatcher.WHITESPACE))
.omitEmptyStrings()
.split(",,,ZERO;,ONE TWO");
will yield an Iterable<String> containing "ZERO", "ONE", "TWO"

You could also potentially use StringTokenizer to build the list, depending what you need to do with it:
StringTokenizer st = new StringTokenizer(",,,ZERO;,ONE TWO", ",; ", false);
while(st.hasMoreTokens()) {
String str = st.nextToken();
//add to list, process, etc...
}
As a caveat, however, you'll need to define each potential whitespace character separately in the second argument to the constructor.

Related

Confused with using split with multiple delimiters

I'm practicing reading input and then tokenizing it.
For example, if I have [882,337] I want to just get the numbers 882 and 337. I tried using the following code:
String test = "[882,337]";
String[] tokens = test.split("\\[|\\]|,");
System.out.println(tokens[0]);
System.out.println(tokens[1]);
System.out.println(tokens[2]);
It kind of works, the output is:
(blank line)
882
337
What I don't understand is why token[0] is empty? I would expect there to only be two tokens where token[0] = 882 and token[1] = 337.
I checked out some links but didn't find the answer.
Thanks for the help!
Split splits the given String. If you split "[882,337]" on "[" or "," or "]" then you actually have:
nothing
882
337
nothing
But, as you have called String.split(delimiter), this calls String.split(delimiter, limit) with a limit of zero.
From the documentation:
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.
(emphasis mine)
So in this configuration the final, empty, strings are discarded. You are therefore left with exactly what you have.
Usually, to tokenize something like this, one would go for a combination of replaceAll and split:
final String[] tokens = input.replaceAll("^\\[|\\]$").split(",");
This will first strip off the start (^[) and end (]$) brackets and then split on ,. This way you don't have to have somewhat obtuse program logic where you start looping from an arbitrary index.
As an alternative, for more complex tokenizations, one can use Pattern - might be overkill here, but worth bearing in mind before you get into writing multiple replaceAll chains.
First we need to define, in Regex, the tokens we want (rather than those we're splitting on) - in this case it's simple, it's just digits so \d.
So, in order to extract all digit only (no thousands/decimal separators) values from an arbitrary String on would do the following:
final List<Integer> tokens = new ArrayList<>(); <-- to hold the tokens
final Pattern pattern = Pattern.compile("\\d++"); <-- the compiled regex
final Matcher matcher = pattern.matcher(input); <-- the matcher on input
while(matcher.find()) { <-- for each matched token
tokens.add(Integer.parseInt(matcher.group())); <-- parse and `int` and store
}
N.B: I have used a possessive regex pattern for efficiency
So, you see, the above code is somewhat more complex than the simple replaceAll().split(), but it is much more extensible. You can use arbitrary complex regex to token almost any input.
The symbols where the string is split are here:
String test = "[882,337]";
^ ^ ^
Because The first char matches your delimiter, everything left from it will be the first result. Well, left from the first letter is nothing, so the result is the empty string.
One could expect the same behaviour for the end, since the last symbol also matches the delimiter. But:
Trailing empty strings are therefore not included in the resulting array.
See Javadoc.
Splitting creates two (or more) things from one thing. For instance if you split a,b by , you will get a and b.
But in case of ",b" you will get "" and "b". You can think of it this way:
"" exists at start, end and even in-between all characters of string:
""+","+"b" -> ",b" so if we split on this "," we are getting left and right part: "" and "b"
Similar things happens in case of "a," and at first result array is ["a",""] but here split method removes trailing empty strings and returns only ["a"] (you can turn off this clearing mechanism by using split(",", -1)).
So in case of
String test = "[882,337]";
String[] tokens = test.split("\\[|\\]|,");
you are splitting:
""+"["+"882"+","+"337"+"]"+""
here: ^ ^ ^
which at first creates array ["", "882", "337", ""] but then trailing empty string is removed and finally you are receiving:
["", "882", "337"]
Only case where empty string is removed from start of result array is when
you are using Java 8 (or newer) and splitting on regex which is zero-length like split("") or lets say before each x with split("(?=x)") (more info at: Why in Java 8 split sometimes removes empty strings at start of result array?)
and when this empty string was result of split method. For instance "".split("") will not remove "", more info here: https://stackoverflow.com/a/25058091/1393766
That's because each delimiter has a "before" and "after" result, even if it is empty. Consider
882,337
You expect that to produce two results.
Similarly, you expect
882,337,
to produce three, with the last one being empty (assuming your limit is big enough, or assuming you're using almost any other language / implementation of split()). Extending that logically,
,882,337,
must produce four, with the first and last results being empty. This is exactly the case you have, except you have multiple delimiters.

Tokenizing a string using negations

So i have the following problem:
I have to tokenize a string using String.split() and the tokens must be in the form 07dd ddd ddd, where d is a digit. I thought of using the following regex : ^(07\\d{2}\\s\\d{3}\\d{3}) and pass it as an argument to String.split(). But for some reason, although i do have substrings under that form, it outputs the whole initial string and doesn't tokenize it.
I initially thought that it was using an empty string as a splitter, as an empty string indeed matches that regex, but even after I added & (.)+ to the regex in order to assure that the splitter hasn't got length 0, it still outputs the whole initial string.
I know that i could have used Pattern's and Matchers to solve it much faster, but i have to use String.split(). Any ideas why this happens?
A Few Pointers
Your pattern ^(07\d{2}\s\d{3}\d{3}) is missing a space between the two last groups of digits
The reason you get the whole string back is that this pattern was never found in the first place: there is no split
If you split on this pattern (once fixed), the resulting array will be strings that are in-between this pattern (these tokens are actually removed)
If you want to use this pattern (once fixed), you need a Match All not a Split. This will look like arrayOfMatches = yourString.match(/pattern/g);
If you want to split, you need to use a delimiter that is present between the token (this delimiter could in fact just be a zero-width position asserted by the 07 about to follow)
Further Reading
Match All and Split are Two Sides of the Same Coin

Remove all other trailing whitespace characters except tab in Java

I know that removing whitespaces is as easy as String.trim(). But my string contains tab (\t) characters which I would like to keep.
Example:
"teststring\t\t\t ".trimSpaceNotTab() => "teststring\t\t\t"
My current implementation is to use split();
String[] arr = tabbedString.split("\t");
Then joining them somewhere as a string.
I find this implementation slow and ugly.
Is there a better way in Java where I can retain the tabs?
How about
tabbedString.replaceAll("[ \n\x0B\f\r]","")
Function used - String.replaceAll()
In case you'd like to also go for tabs and remove them, use a predefined character class \s
Pattern Summary
Go through the string and ask each Char if its whitespace using the isSpaceChar
Use a regular expression that replace all white space but not tab(\t).

Regular expression to select all whitespace that isn't in quotes?

I'm not very good at RegEx, can someone give me a regex (to use in Java) that will select all whitespace that isn't between two quotes? I am trying to remove all such whitespace from a string, so any solution to do so will work.
For example:
(this is a test "sentence for the regex")
should become
(thisisatest"sentence for the regex")
Here's a single regex-replace that works:
\s+(?=([^"]*"[^"]*")*[^"]*$)
which will replace:
(this is a test "sentence for the regex" foo bar)
with:
(thisisatest"sentence for the regex"foobar)
Note that if the quotes can be escaped, the even more verbose regex will do the trick:
\s+(?=((\\[\\"]|[^\\"])*"(\\[\\"]|[^\\"])*")*(\\[\\"]|[^\\"])*$)
which replaces the input:
(this is a test "sentence \"for the regex" foo bar)
with:
(thisisatest"sentence \"for the regex"foobar)
(note that it also works with escaped backspaces: (thisisatest"sentence \\\"for the regex"foobar))
Needless to say (?), this really shouldn't be used to perform such a task: it makes ones eyes bleed, and it performs its task in quadratic time, while a simple linear solution exists.
EDIT
A quick demo:
String text = "(this is a test \"sentence \\\"for the regex\" foo bar)";
String regex = "\\s+(?=((\\\\[\\\\\"]|[^\\\\\"])*\"(\\\\[\\\\\"]|[^\\\\\"])*\")*(\\\\[\\\\\"]|[^\\\\\"])*$)";
System.out.println(text.replaceAll(regex, ""));
// output: (thisisatest"sentence \"for the regex"foobar)
Here is the regex which works for both single & double quotes (assuming that all strings are delimited properly)
\s+(?=(?:[^\'"]*[\'"][^\'"]*[\'"])*[^\'"]*$)
It won't work with the strings which has quotes inside.
This just isn't something regexes are good at. Search-and-replace functions with regexes are always a bit limited, and any sort of nesting/containment at all becomes difficult and/or impossible.
I'd suggest an alternate approach: Split your string on quote characters. Go through the resulting array of strings, and strip the spaces from every other substring (whether you start with the first or second depends on whether you string started with a quote or not). Then join them back together, using quotes as separators. That should produce the results you're looking for.
Hope that helps!
PS: Note that this won't handle nested strings, but since you can't make nested strings with the ASCII double-qutoe character, I'm gonna assume you don't need that behaviour.
PPS: Once you're dealing with your substrings, then it's a good time to use regexes to kill those spaces - no containing quotes to worry about. Just remember to use the /.../g modifier to make sure it's a global replacement and not just the first match.
Groups of whitespace outside of quotes are separated by stuff that's a) not whitespace, or b) inside quotes.
Perhaps something like:
(\s+)([^ "]+|"[^"]*")*
The first part matches a sequence of spaces; the second part matches non-spaces (and non-quotes), or some stuff in quotes, either repeated any number of times. The second part is the separator.
This will give you two groups for each item in the result; just ignore the second element. (We need the parentheses for precidence rather than match grouping there.) Or, you could say, concatenate all the second elements -- though you need to match the first non-space word too, or in this example, make the spaces optional:
StringBuffer b = new StringBuffer();
Pattern p = Pattern.compile("(\\s+)?([^ \"]+|\"[^\"]*\")*");
Matcher m = p.matcher("this is \"a test\"");
while (m.find()) {
if (m.group(2) != null)
b.append(m.group(2));
}
System.out.println(b.toString());
(I haven't done much regex in Java so expect bugs.)
Finally This is how I'd do it if regexes were compulsory. ;-)
As well as Xavier's technique, you could simply do it the way you'd do it in C: just iterate over the input characters, and copy each to the new string if either it's non-space, or you've counted an odd number of quotes up to that point.
If there is only one set of quotes, you can do this:
String s = "(this is a test \"sentence for the regex\") a b c";
Matcher matcher = Pattern.compile("^[^\"]+|[^\"]+$").matcher(s);
while (matcher.find())
{
String group = matcher.group();
s = s.replace(group, group.replaceAll("\\s", ""));
}
System.out.println(s); // (thisisatest"sentence for the regex")abc
This isn't an exact solution, but you can accomplish your goal by doing the following:
STEP 1: Match the two segments
\\(([a-zA-Z ]\*)"([a-zA-Z ]\*)"\\)
STEP 2: remove spaces
temp = $1 replace " " with ""
STEP 3: rebuild your string
(temp"$2")

Matcher returns matches on a regex pattern, but split() fails to find a match on the same regex?

I can't see a reason why the Matcher would return a match on the pattern, but split will return a zero length array on the same regex pattern. It should return something -- in this example I'm looking for a return of 2 separate strings containing "param/value".
public class MyClass {
protected Pattern regEx = "(([a-z])+/{1}([a-z0-9])+/?)*";
public void someMethod() {
String qs = "param/value/param/value";
Matcher matcherParamsRegEx = this.regEx.matcher(qs);
if (matcherParamsRegEx.matches()) { // This finds a match.
String[] parameterValues = qs.split(this.regEx.pattern()); // No matches... zero length array.
}
}
}
The pattern can match the entire string. split() doesn't return the match, only what's in between. Since the pattern matches the whole string that only leaves an empty string to return. I think you might be under a misconception as to what split() does.
For example:
String qs = "param/value/param/value";
String pieces = qs.split("/");
will return an array of 4 elements: param, value, param, value.
Notice that what you search on ("/") isn't returned.
Your regex is somewhat over-complicated. For one thing you're using {1}, which is unnecessary. Second, when you do ([a-z])+ you will capture exactly one latter (the last one encountered. Compare that to ([a-z]+), which will capture the entire match. Also, you don't even need to capture for this. The pattern can be simplified to:
protected Pattern regEx = Pattern.compile("[a-z]+/([a-z0-9]+/?)*");
Technically this:
protected Pattern regEx = "(([a-z])+/{1}([a-z0-9])+/?)*";
is a compiler error, so what you actually ran versus what you posted could be anything.
The problem here is that split splits around matches of your regex. You have two consecutive matches with nothing else in between, so there is nothing left for split to return.
I can't see any way for you to get what you want from that string using split, but if you can use a different delimiter to separate pairs than you do to separate name and value, that will help a lot.
Otherwise, you might split on slashes and take alternating results as names and values, but this is error-prone.
The regex is matching--if it weren't, you would get a one-element array, that element being the whole original string. You just have the wrong idea about how split() works. On the first match attempt it finds "param/value/" and stores everything preceding that match as the first token: an empty string. The second attempt finds "param/value" and stores whatever lay between it and the first match as the next token: another empty string. The third match attempt fails, so whatever was between the second match and the end of the string becomes the final token: yet another empty string.
Having stored all the tokens, split() iterates through them in reverse, checking for trailing empty tokens. The third token is indeed empty, so it deletes that one. The second one is also empty, so it deletes that one. You see where this is going? You can force split() to preserve trailing empty matches by passing a negative integer as the second argument, but that obviously doesn't do you any good. You need to rethink your problem (whatever it is) in terms of how the regex package actually works.

Categories

Resources