Splitting string with character sequence as a delimiter

Splitting string with character sequence as a delimiter - java

The requirement is to split strings in Java so that the following
"this#{s}is#{s}a#{s}string"
would result in the following array
["this","is","a","string"]
As you can see here the delimiter is the character sequence "#{s}".
What is the fastest and efficient way of doing this using existing tools?
Am I right to assume that using regex (String.split()) is a bit of wasting because we are splitting using static string?
I got the assumption from here http://www.javamex.com/tutorials/regular_expressions/splitting_tokenisation_performance.shtml .
But I cannot use StringTokenizer since the delimiter is a sequence of char.
Note: currently I'm using String.split() and have no problem with that. This is pure curiosity.

Faster than using String.split is Pattern.split: i.e., precompile the pattern and store that for subsequent use. If you use the same pattern all the time, and do a lot of splitting using that pattern, it may be worth putting that pattern into a static field or something.
Also, if your pattern contains no regex metacharacters, you can pass in Pattern.LITERAL when creating the pattern. This is something you can't do with String.split. :-P

Related

Tokenizing a string using negations

So i have the following problem:
I have to tokenize a string using String.split() and the tokens must be in the form 07dd ddd ddd, where d is a digit. I thought of using the following regex : ^(07\\d{2}\\s\\d{3}\\d{3}) and pass it as an argument to String.split(). But for some reason, although i do have substrings under that form, it outputs the whole initial string and doesn't tokenize it.
I initially thought that it was using an empty string as a splitter, as an empty string indeed matches that regex, but even after I added & (.)+ to the regex in order to assure that the splitter hasn't got length 0, it still outputs the whole initial string.
I know that i could have used Pattern's and Matchers to solve it much faster, but i have to use String.split(). Any ideas why this happens?

A Few Pointers
Your pattern ^(07\d{2}\s\d{3}\d{3}) is missing a space between the two last groups of digits
The reason you get the whole string back is that this pattern was never found in the first place: there is no split
If you split on this pattern (once fixed), the resulting array will be strings that are in-between this pattern (these tokens are actually removed)
If you want to use this pattern (once fixed), you need a Match All not a Split. This will look like arrayOfMatches = yourString.match(/pattern/g);
If you want to split, you need to use a delimiter that is present between the token (this delimiter could in fact just be a zero-width position asserted by the 07 about to follow)
Further Reading
Match All and Split are Two Sides of the Same Coin

Pattern for Guava Splitter

I need to split String by comma or dot or backslach :
Pattern stringPattern = Pattern.compile("\\s+|,|\\\\|");
Splitter.on(stringPattern).omitEmptyStrings().split(description));
but this pattern don't work , what is wrong ?

Why not use a CharMatcher?
Splitter.on(CharMatcher.anyOf(",.\\")).omitEmptyStrings().split(description);
Given your simple problem, I don't think you need the regular expressions.

The correct regex for comma or dot or backslash is [.,\\], so in Java that's
Pattern.compile("[.,\\\\]")
I do like Olivier's suggestion of CharMatcher though.

I'd use string.split with the regular expressions. Following should work (I have not tried)
description.split(",.\\")
Then do null check (as such splitter has extra api for the same).
Patterns are useful for identifying "groups". Any regular expression related splitting can be equally done with strings (instead of pattern)-that is not to discourage from using Guava!

String.replaceAll is considerably slower than doing the job yourself

I have an old piece of code that performs find and replace of tokens within a string.
It receives a map of from and to pairs, iterates over them and for each of those pairs, iterates over the target string, looks for the from using indexOf(), and replaces it with the value of to. It does all the work on a StringBuffer and eventually returns a String.
I replaced that code with this line: replaceAll("[,. ]*", "");
And I ran some comparative performance tests.
When comparing for 1,000,000 iterations, I got this:
Old Code: 1287ms
New Code: 4605ms
3 times longer!
I then tried replacing it with 3 calls to replace:
replace(",", "");
replace(".", "");
replace(" ", "");
This resulted with the following results:
Old Code: 1295
New Code: 3524
2 times longer!
Any idea why replace and replaceAll are so inefficient? Can I do something to make it faster?
Edit: Thanks for all the answers - the main problem was indeed that [,. ]* did not do what I wanted it to do. Changing it to be [,. ]+ almost equaled the performance of the non-Regex based solution.
Using a pre-compiled regex helped, but was marginal. (It is a solution very applicable for my problem.
Test code:
Replace string with Regex: [,. ]*
Replace string with Regex: [,. ]+
Replace string with Regex: [,. ]+ and Pre-Compiled Pattern

While using regular expressions imparts some performance impact, it should not be as terrible.
Note that using String.replaceAll() will compile the regular expression each time you call it.
You can avoid that by explicitly using a Pattern object:
Pattern p = Pattern.compile("[,. ]+");
// repeat only the following part:
String output = p.matcher(input).replaceAll("");
Note also that using + instead of * avoids replacing empty strings and therefore might also speed up the process.

replace and replaceAll uses regex internally which in most cases gives a serious performance impact compared to e.g., StringUtils.replace(..).
String.replaceAll():
public String replaceAll(String regex, String replacement) {
return Pattern.compile(regex).matcher(this ).replaceAll(
replacement);
}
String.replace() uses Pattern.compile underneath.
public String replace(CharSequence target, CharSequence replacement) {
return Pattern.compile(target.toString(), Pattern.LITERAL)
.matcher(this ).replaceAll(
Matcher.quoteReplacement(replacement.toString()));
}
Also see Replace all occurrences of substring in a string - which is more efficient in Java?

As I have put in a comment [,. ]* matches the empty String "". So, every "space" between characters matches the pattern. It is only noted in performance because you are replacing a lot of "" by "".
Try doing this:
Pattern p = Pattern.compile("[,. ]*");
System.out.println(p.matcher("Hello World").replaceAll("$$$");
It returns:
H$$$e$$$l$$$o$$$$$$W$$$o$$$r$$$l$$$d$$$!$$$
No wonder it is slower that doing it "by hand"! You should try with [,. ]+

When it comes to replaceAll("[,. ]*", "") it's not that big of a surprise since it relies on regular expressions. The regex engine creates an automaton which it runs over the input. Some overhead is expected.
The second approach (replace(",", "")...) also uses regular expressions internally. Here the given pattern is however compiled using Pattern.LITERAL so the regular expression overhead should be negligable.) In this case it is probably due to the fact that Strings are immutable (however small change you do, you will create a new string) and thus not as efficient as StringBuffers which manipulate the string in-place.

Refactor Regex Pattern - Java

I have the following aaaa_bb_cc string to match and written a regex pattern like
\\w{4}+\\_\\w{2}\\_\\w{2} and it works. Is there any simple regex which can do this same ?

You don't need to escape the underscores:
\w{4}+_\w{2}_\w{2}
And you can collapse the last two parts, if you don't capture them anyway:
\w{4}+(?:_\w{2}){2}
Doesn't get shorter, though.
(Note: Re-add the needed backslashes for Java's strings, if you like; I prefer to omit them while talking about regular expressions :))

I sometimes do what I call "meta-regexing" as follows:
String pattern = "x{4}_x{2}_x{2}".replace("x", "[a-z]");
System.out.println(pattern); // prints "[a-z]{4}_[a-z]{2}_[a-z]{2}"
Note that this doesn't use \w, which can match an underscore. That is, your original pattern would match "__________".
If x really needs to be replaced with [a-zA-Z0-9], then just do it in the one place (instead of 3 places).
Other examples
Regex for metamap in Java
How do I convert CamelCase into human-readable names in Java?

Yes, you can use just \\w{4}_\\w{2}_\\w{2} or maybe \\w{4}(_\\w{2}){2}.

Looks like your \w does not need to match underscore, so you can use [a-zA-Z0-9] instead
[a-zA-Z0-9]{4}_[a-zA-Z0-9]{2}_[a-zA-Z0-9]{2}

Escaping a String from getting regex parsed in Java

In Java, suppose I have a String variable S, and I want to search for it inside of another String T, like so:
if (T.matches(S)) ...
(note: the above line was T.contains() until a few posts pointed out that that method does not use regexes. My bad.)
But now suppose S may have unsavory characters in it. For instance, let S = "[hi". The left square bracket is going to cause the regex to fail. Is there a function I can call to escape S so that this doesn't happen? In this particular case, I would like it to be transformed to "\[hi".

String.contains does not use regex, so there isn't a problem in this case.
Where a regex is required, rather rejecting strings with regex special characters, use java.util.regex.Pattern.quote to escape them.

As Tom Hawtin said, you need to quote the pattern. You can do this in two ways (edit: actually three ways, as pointed out by #diastrophism):
Surround the string with "\Q" and "\E", like:
if (T.matches("\\Q" + S + "\\E"))
Use Pattern instead. The code would be something like this:
Pattern sPattern = Pattern.compile(S, Pattern.LITERAL);
if (sPattern.matcher(T).matches()) { /* do something */ }
This way, you can cache the compiled Pattern and reuse it. If you are using the same regex more than once, you almost certainly want to do it this way.
Note that if you are using regular expressions to test whether a string is inside a larger string, you should put .* at the start and end of the expression. But this will not work if you are quoting the pattern, since it will then be looking for actual dots. So, are you absolutely certain you want to be using regular expressions?

Try Pattern.quote(String). It will fix up anything that has special meaning in the string.

Any particular reason not to use String.indexOf() instead? That way it will always be interpreted as a regular string rather than a regex.

Regex uses the backslash character '\' to escape a literal. Given that java also uses the backslash character you would need to use a double bashslash like:
String S = "\\[hi"
That will become the String:
\[hi
which will be passed to the regex.
Or if you only care about a literal String and don't need a regex you could do the following:
if (T.indexOf("[hi") != -1) {

T.contains() (according to javadoc : http://java.sun.com/javase/6/docs/api/java/lang/String.html) does not use regexes. contains() delegates to indexOf() only.
So, there are NO regexes used here. Were you thinking of some other String method ?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Splitting string with character sequence as a delimiter - java

Related

Tokenizing a string using negations

Pattern for Guava Splitter

String.replaceAll is considerably slower than doing the job yourself

Refactor Regex Pattern - Java

Escaping a String from getting regex parsed in Java

Categories

Resources