String.replaceAll is considerably slower than doing the job yourself - java

I have an old piece of code that performs find and replace of tokens within a string.
It receives a map of from and to pairs, iterates over them and for each of those pairs, iterates over the target string, looks for the from using indexOf(), and replaces it with the value of to. It does all the work on a StringBuffer and eventually returns a String.
I replaced that code with this line: replaceAll("[,. ]*", "");
And I ran some comparative performance tests.
When comparing for 1,000,000 iterations, I got this:
Old Code: 1287ms
New Code: 4605ms
3 times longer!
I then tried replacing it with 3 calls to replace:
replace(",", "");
replace(".", "");
replace(" ", "");
This resulted with the following results:
Old Code: 1295
New Code: 3524
2 times longer!
Any idea why replace and replaceAll are so inefficient? Can I do something to make it faster?
Edit: Thanks for all the answers - the main problem was indeed that [,. ]* did not do what I wanted it to do. Changing it to be [,. ]+ almost equaled the performance of the non-Regex based solution.
Using a pre-compiled regex helped, but was marginal. (It is a solution very applicable for my problem.
Test code:
Replace string with Regex: [,. ]*
Replace string with Regex: [,. ]+
Replace string with Regex: [,. ]+ and Pre-Compiled Pattern

While using regular expressions imparts some performance impact, it should not be as terrible.
Note that using String.replaceAll() will compile the regular expression each time you call it.
You can avoid that by explicitly using a Pattern object:
Pattern p = Pattern.compile("[,. ]+");
// repeat only the following part:
String output = p.matcher(input).replaceAll("");
Note also that using + instead of * avoids replacing empty strings and therefore might also speed up the process.

replace and replaceAll uses regex internally which in most cases gives a serious performance impact compared to e.g., StringUtils.replace(..).
String.replaceAll():
public String replaceAll(String regex, String replacement) {
return Pattern.compile(regex).matcher(this ).replaceAll(
replacement);
}
String.replace() uses Pattern.compile underneath.
public String replace(CharSequence target, CharSequence replacement) {
return Pattern.compile(target.toString(), Pattern.LITERAL)
.matcher(this ).replaceAll(
Matcher.quoteReplacement(replacement.toString()));
}
Also see Replace all occurrences of substring in a string - which is more efficient in Java?

As I have put in a comment [,. ]* matches the empty String "". So, every "space" between characters matches the pattern. It is only noted in performance because you are replacing a lot of "" by "".
Try doing this:
Pattern p = Pattern.compile("[,. ]*");
System.out.println(p.matcher("Hello World").replaceAll("$$$");
It returns:
H$$$e$$$l$$$o$$$$$$W$$$o$$$r$$$l$$$d$$$!$$$
No wonder it is slower that doing it "by hand"! You should try with [,. ]+

When it comes to replaceAll("[,. ]*", "") it's not that big of a surprise since it relies on regular expressions. The regex engine creates an automaton which it runs over the input. Some overhead is expected.
The second approach (replace(",", "")...) also uses regular expressions internally. Here the given pattern is however compiled using Pattern.LITERAL so the regular expression overhead should be negligable.) In this case it is probably due to the fact that Strings are immutable (however small change you do, you will create a new string) and thus not as efficient as StringBuffers which manipulate the string in-place.

Related

Trimming trailing whitespace from every line in a multi-line string using replaceAll

Please explain why in this expression, the replacement only occurs once, at the end of the string.
"1 \n3 \n5 ".replaceAll(" +$", "x") // => "1 \n3 \n5x"
According to the Java 7 docs for Pattern,
$ is supposed to match the end of a line, and \z match the end of input (the string).
My goal is to replace trailing whitespace at the end of every line in a string. The "x" replacement is merely a way to better visualize what is being replaced.
I would think a regular expression and String.replaceAll ought to be able to do this. If replaceAll is not able to perform this operation, please suggest a concise alternative.
To replace the trailing whitespace of each line, you need to use the (?m) (multi-line) modifier.
"1 \n3 \n5 ".replaceAll("(?m) +$", "x") //=> "1x\n3x\n5x"
Note: This modifier makes the ^ and $ match at the start and end of each line.
hwnd's answer is indeed the most concise way to do it. However, especially if the pattern is to be matched many, many times, I'd like to suggest an alternative that is faster. Yes, it's called precompiling the pattern! (In contrast, String#replaceAll recompiles the pattern on each invocation.)
private static final Pattern LINE_END_SPACES = Pattern.compile(" +$", Pattern.MULTILINE);
public static String stripLineEndSpaces(String str) {
return LINE_END_SPACES.matcher(str).replaceAll("");
}
Yes, it looks more clunky, and that's because Java doesn't have regexp literals. I wish it did, but then people will want JSON literals, XML literals (think E4X), XPath expression literals, printf format string literals, SQL statement literals, etc., and then people will be asking when the madness will stop. :-)

How can I write a regex in Java that will perform a .replaceFirst on a group that is not in a comment?

So I need to return modified String where it replaces the first instance of a token with another token while skipping comments. Here's an example of what I'm talking about:
This whole quote is one big String
-- I don't want to replace this ##
But I want to replace this ##!
Being a former .NET developer, I thought this was easy. I'd just do a negative lookbehind like this:
(?<!--.*)##
But then I learned Java can't do this. So upon learning that the curly braces are okay, I tried this:
(?<!--.{0,9001})##
That didn't throw an exception, but it did match the ## in the comment.
When I test this regex with a Java regex tester, it works as expected. About the only thing I can think of is that I'm using Java 1.5. Is it possible that Java 1.5 has a bug in its regex engine? Assuming it does, how do I get Java 1.5 to do what I want it to do without breaking up my string and reassembling it?
EDIT I changed the # to the -- operator since it looks like the regex will be more complex with two chars instead of one. I originally did not reveal that I was modifying a query in order to avoid off topic discussion on "Well you shouldn't modify queries that way!" I have a very good reason for doing this. Please don't discuss query modification good practices. Thanks
You really don't need a negative look-behind here. You can do it without that too.
It would be like this:
String str = "I don't want to replace this ##";
str = str.replaceAll("^([^#].*?)##", "$1");
So, it replaces first occurrence of ## in the string that does not start with # with the part of the string before ##. So, ## is removed. Here replaceAll works because it uses a reluctant quantifier - .*?. So, it will automatically stop at the first ##.
As correctly pointed out by #nhahtdh in the comment, that this might fail, if your comment is at the end of the line. So, you can rather use this one:
String str = "I don't want to # replace this ##";
str = str.replaceAll("^([^#]*?)##", "$1");
This one will work for any case. And in the given example case, it won't replace the ##, as it is a part of the comment.
If your comment start is denoted by two characters, then negated character class won't work. You would need to use negative look-ahead like this:
String str = "This whole quote ## is one big String -- asdf ##\n" +
"-- I don't want to replace this ##\n" +
"But I want to replace this ##!";
str = str.replaceAll("(?m)^(((?!--).)*?)##", "$1");
System.out.println(str);
Output:
This whole quote is one big String -- asdf ##
-- I don't want to replace this ##
But I want to replace this !
(?m) at the beginning of the pattern is used to enable MULTILINE mode of matching, so the ^ will match the start of each line, rather than the start of the entire expression.
You can use something like this:
String string = "This whole quote is one big String\n" +
"# I don't want to replace this ##\n" +
"And I also # don't want to replace this ##\n" +
"But I want to replace this ##!\n" +
"But not this ##!";
Matcher m =
Pattern.compile (
"^((?:[^##]|#[^#]|#[^\n]*)*)##", Pattern.MULTILINE).
matcher (string);
StringBuffer result = new StringBuffer ();
if (m.find ())
m.appendReplacement (result, "$1FOO");
m.appendTail (result);
System.out.println (result.toString ());

Regular expression to select all whitespace that isn't in quotes?

I'm not very good at RegEx, can someone give me a regex (to use in Java) that will select all whitespace that isn't between two quotes? I am trying to remove all such whitespace from a string, so any solution to do so will work.
For example:
(this is a test "sentence for the regex")
should become
(thisisatest"sentence for the regex")
Here's a single regex-replace that works:
\s+(?=([^"]*"[^"]*")*[^"]*$)
which will replace:
(this is a test "sentence for the regex" foo bar)
with:
(thisisatest"sentence for the regex"foobar)
Note that if the quotes can be escaped, the even more verbose regex will do the trick:
\s+(?=((\\[\\"]|[^\\"])*"(\\[\\"]|[^\\"])*")*(\\[\\"]|[^\\"])*$)
which replaces the input:
(this is a test "sentence \"for the regex" foo bar)
with:
(thisisatest"sentence \"for the regex"foobar)
(note that it also works with escaped backspaces: (thisisatest"sentence \\\"for the regex"foobar))
Needless to say (?), this really shouldn't be used to perform such a task: it makes ones eyes bleed, and it performs its task in quadratic time, while a simple linear solution exists.
EDIT
A quick demo:
String text = "(this is a test \"sentence \\\"for the regex\" foo bar)";
String regex = "\\s+(?=((\\\\[\\\\\"]|[^\\\\\"])*\"(\\\\[\\\\\"]|[^\\\\\"])*\")*(\\\\[\\\\\"]|[^\\\\\"])*$)";
System.out.println(text.replaceAll(regex, ""));
// output: (thisisatest"sentence \"for the regex"foobar)
Here is the regex which works for both single & double quotes (assuming that all strings are delimited properly)
\s+(?=(?:[^\'"]*[\'"][^\'"]*[\'"])*[^\'"]*$)
It won't work with the strings which has quotes inside.
This just isn't something regexes are good at. Search-and-replace functions with regexes are always a bit limited, and any sort of nesting/containment at all becomes difficult and/or impossible.
I'd suggest an alternate approach: Split your string on quote characters. Go through the resulting array of strings, and strip the spaces from every other substring (whether you start with the first or second depends on whether you string started with a quote or not). Then join them back together, using quotes as separators. That should produce the results you're looking for.
Hope that helps!
PS: Note that this won't handle nested strings, but since you can't make nested strings with the ASCII double-qutoe character, I'm gonna assume you don't need that behaviour.
PPS: Once you're dealing with your substrings, then it's a good time to use regexes to kill those spaces - no containing quotes to worry about. Just remember to use the /.../g modifier to make sure it's a global replacement and not just the first match.
Groups of whitespace outside of quotes are separated by stuff that's a) not whitespace, or b) inside quotes.
Perhaps something like:
(\s+)([^ "]+|"[^"]*")*
The first part matches a sequence of spaces; the second part matches non-spaces (and non-quotes), or some stuff in quotes, either repeated any number of times. The second part is the separator.
This will give you two groups for each item in the result; just ignore the second element. (We need the parentheses for precidence rather than match grouping there.) Or, you could say, concatenate all the second elements -- though you need to match the first non-space word too, or in this example, make the spaces optional:
StringBuffer b = new StringBuffer();
Pattern p = Pattern.compile("(\\s+)?([^ \"]+|\"[^\"]*\")*");
Matcher m = p.matcher("this is \"a test\"");
while (m.find()) {
if (m.group(2) != null)
b.append(m.group(2));
}
System.out.println(b.toString());
(I haven't done much regex in Java so expect bugs.)
Finally This is how I'd do it if regexes were compulsory. ;-)
As well as Xavier's technique, you could simply do it the way you'd do it in C: just iterate over the input characters, and copy each to the new string if either it's non-space, or you've counted an odd number of quotes up to that point.
If there is only one set of quotes, you can do this:
String s = "(this is a test \"sentence for the regex\") a b c";
Matcher matcher = Pattern.compile("^[^\"]+|[^\"]+$").matcher(s);
while (matcher.find())
{
String group = matcher.group();
s = s.replace(group, group.replaceAll("\\s", ""));
}
System.out.println(s); // (thisisatest"sentence for the regex")abc
This isn't an exact solution, but you can accomplish your goal by doing the following:
STEP 1: Match the two segments
\\(([a-zA-Z ]\*)"([a-zA-Z ]\*)"\\)
STEP 2: remove spaces
temp = $1 replace " " with ""
STEP 3: rebuild your string
(temp"$2")

Splitting string with character sequence as a delimiter

The requirement is to split strings in Java so that the following
"this#{s}is#{s}a#{s}string"
would result in the following array
["this","is","a","string"]
As you can see here the delimiter is the character sequence "#{s}".
What is the fastest and efficient way of doing this using existing tools?
Am I right to assume that using regex (String.split()) is a bit of wasting because we are splitting using static string?
I got the assumption from here http://www.javamex.com/tutorials/regular_expressions/splitting_tokenisation_performance.shtml .
But I cannot use StringTokenizer since the delimiter is a sequence of char.
Note: currently I'm using String.split() and have no problem with that. This is pure curiosity.
Faster than using String.split is Pattern.split: i.e., precompile the pattern and store that for subsequent use. If you use the same pattern all the time, and do a lot of splitting using that pattern, it may be worth putting that pattern into a static field or something.
Also, if your pattern contains no regex metacharacters, you can pass in Pattern.LITERAL when creating the pattern. This is something you can't do with String.split. :-P

Escaping a String from getting regex parsed in Java

In Java, suppose I have a String variable S, and I want to search for it inside of another String T, like so:
if (T.matches(S)) ...
(note: the above line was T.contains() until a few posts pointed out that that method does not use regexes. My bad.)
But now suppose S may have unsavory characters in it. For instance, let S = "[hi". The left square bracket is going to cause the regex to fail. Is there a function I can call to escape S so that this doesn't happen? In this particular case, I would like it to be transformed to "\[hi".
String.contains does not use regex, so there isn't a problem in this case.
Where a regex is required, rather rejecting strings with regex special characters, use java.util.regex.Pattern.quote to escape them.
As Tom Hawtin said, you need to quote the pattern. You can do this in two ways (edit: actually three ways, as pointed out by #diastrophism):
Surround the string with "\Q" and "\E", like:
if (T.matches("\\Q" + S + "\\E"))
Use Pattern instead. The code would be something like this:
Pattern sPattern = Pattern.compile(S, Pattern.LITERAL);
if (sPattern.matcher(T).matches()) { /* do something */ }
This way, you can cache the compiled Pattern and reuse it. If you are using the same regex more than once, you almost certainly want to do it this way.
Note that if you are using regular expressions to test whether a string is inside a larger string, you should put .* at the start and end of the expression. But this will not work if you are quoting the pattern, since it will then be looking for actual dots. So, are you absolutely certain you want to be using regular expressions?
Try Pattern.quote(String). It will fix up anything that has special meaning in the string.
Any particular reason not to use String.indexOf() instead? That way it will always be interpreted as a regular string rather than a regex.
Regex uses the backslash character '\' to escape a literal. Given that java also uses the backslash character you would need to use a double bashslash like:
String S = "\\[hi"
That will become the String:
\[hi
which will be passed to the regex.
Or if you only care about a literal String and don't need a regex you could do the following:
if (T.indexOf("[hi") != -1) {
T.contains() (according to javadoc : http://java.sun.com/javase/6/docs/api/java/lang/String.html) does not use regexes. contains() delegates to indexOf() only.
So, there are NO regexes used here. Were you thinking of some other String method ?

Categories

Resources