I have a class that is doing a lot of text processing. For each string, which is anywhere from 100->2000 characters long, I am performing 30 different string replacements.
Example:
string modified;
for(int i = 0; i < num_strings; i++){
modified = runReplacements(strs[i]);
//do stuff
}
public runReplacements(String str){
str = str.replace("foo","bar");
str = str.replace("baz","beef");
....
return str;
}
'foo', 'baz', and all other "targets" are only expected to appear once and are string literals (no need for an actual regex).
As you can imagine, I am concerned about performance :)
Given this,
replaceFirst() seems a bad choice because it won't use Pattern.LITERAL and will do extra processing that isn't required.
replace() seems a bad choice because it will traverse the entire string looking for multiple instances to be replaced.
Additionally, since my replacement texts are the same everytime, it seems to make sense for me to write my own code otherwise String.replaceFirst() or String.replace() will be doing a Pattern.compile every single time in the background. Thinking that I should write my own code, this is my thought:
Perform a Pattern.compile() only once for each literal replacement desired (no need to recompile every single time) (i.e. p1 - p30)
Then do the following for each pX: p1.matcher(str).replaceFirst(Matcher.quoteReplacement("desiredReplacement"));
This way I abandon ship on the first replacement (instead of traversing the entire string), and I am using literal vs. regex, and I am not doing a re-compile every single iteration.
So, which is the best for performance?
So, which is the best for performance?
Measure it! ;-)
ETA: Since a two word answer sounds irretrievably snarky, I'll elaborate slightly. "Measure it and tell us..." since there may be some general rule of thumb about the performance of the various approaches you cite (good ones, all) but I'm not aware of it. And as a couple of the comments on this answer have mentioned, even so, the different approaches have a high likelihood of being swamped by the application environment. So, measure it in vivo and focus on this if it's a real issue. (And let us know how it goes...)
First, run and profile your entire application with a simple match/replace. This may show you that:
your application already runs fast enough, or
your application is spending most of its time doing something else, so optimizing the match/replace code is not worthwhile.
Assuming that you've determined that match/replace is a bottleneck, write yourself a little benchmarking application that allows you to test the performance and correctness of your candidate algorithms on representative input data. It's also a good idea to include "edge case" input data that is likely to cause problems; e.g. for the substitutions in your example, input data containing the sequence "bazoo" could be an edge case. On the performance side, make sure that you avoid the traps of Java micro-benchmarking; e.g. JVM warmup effects.
Next implement some simple alternatives and try them out. Is one of them good enough? Done!
In addition to your ideas, you could try concatenating the search terms into a single regex (e.g. "(foo|baz)" ), use Matcher.find(int) to find each occurrence, use a HashMap to lookup the replacement strings and a StringBuilder to build the output String from input string substrings and replacements. (OK, this is not entirely trivial, and it depends on Pattern/Matcher handling alternates efficiently ... which I'm not sure is the case. But that's why you should compare the candidates carefully.)
In the (IMO unlikely) event that a simple alternative doesn't cut it, this wikipedia page has some leads which may help you to implement your own efficient match/replacer.
Isn't if frustrating when you ask a question and get a bunch of advice telling you to do a whole lot of work and figure it out for yourself?!
I say use replaceAll();
(I have no idea if it is, indeed, the most efficient, I just don't want you to feel like you wasted your money on this question and got nothing.)
[edit]
PS. After that, you might want to measure it.
[edit 2]
PPS. (and tell us what you found)
Related
I've learned that String's replaceAll() method takes a regex as an input parameter and it can cause considerable performance impact. But once time, I read this blog with a small program assert that (according my understanding):
Process the replaceAll() method slow for the first time but faster for the next time.
This is test result:
regex replace time taken: 14.09 milliseconds
manual replace time taken: 2.371 seconds
-----
regex replace time taken: 9.498 milliseconds
manual replace time taken: 2.406 seconds
-----
regex replace time taken: 2.184 milliseconds
manual replace time taken: 2.360 seconds
-----
What is the optimization mechanism behind this result?
Usually it doesn't cause a meaningful performance impact unless used in weird and unusual ways. In a normal use case (let's say web request) it will disappear under things like network latency and other things that take way more time. Only if you were to use replaceAll in a very hot loop, would it become necessary to consider using the Pattern and Matcher classes directly, which could help with performance.
The linked tutorial site seems questionable (and there are a lot of them, which is why you should be careful what you read). For one, it compares replaceAll with a manual replace method that is poorly written (that's why you get the seconds vs. milliseconds difference). Then it draws conclusions based on that.
So there's no optimization mechanism behind the result in the link. The reason behind the result is the badly written manual replacement method which concatenates a lot of Strings, making it slow compared to replaceAll.
The following is taken from the official1 OpenJDK 11 source code2.
Starting with the String.replaceAll method itself.
public String replaceAll(String regex, String replacement) {
return Pattern.compile(regex).matcher(this).replaceAll(replacement);
}
No caching here. Next the Pattern.compile
public static Pattern compile(String regex) {
return new Pattern(regex, 0);
}
No caching there either. And not in the private Pattern constructor either.
The Pattern constructor uses an internal compile() method to do the working of compiling the regex to the internal form. It takes steps to avoid a Pattern being compiled twice. But as you can see from the above, each replaceAll call is generating a single use Pattern object.
So why are you seeing a speedup in those performance figures?
They could be using an old version (before Java 6) of Pattern that might have3 cached compiled patterns.
The most likely explanation is that this just a JVM warmup effect. A well written benchmark should account for that, but the benchmark that is used in that blog is not doing proper warmup.
In short, the speedup that you think is caused by some "optimization" is apparently just the result of JVM warmup effects such as JIT compilation of the Pattern, Matcher and related classes.
1 - The OpenJDK source code for Java 6 onwards is can be downloaded from https://openjdk.java.net/
2 - The OpenJDK 6 source code is doing the same thing: no caching.
3 - I have not checked, but it is moot. Performance benchmarks based on EOL versions of Java are not instructive for current versions of Java. Nobody should still be using Java 5. If they are, performance of replaceAll is the least of their worries.
Well, if you execute replaceAll over the same expression again and again, then the matched chars are indeed less frequents, so fewer replacements.
Example : /(.)(.*?)\1/ matches strings with a repeated char, and replaceAll matches with $2 will remove these duplicate chars, but it needs to be execute multiple times to process all the duplicates (there can be some in $2 for example)
ABCDEFABCBCDEF for example gives BCDEFCCDEF
BCDEFCCDEF ==> BDEFCDEF
BDEFCDEF ==> BEFCEF
BEFCEF ==> BFCF
BFCF ==> BC
The processed string is smaller and smaller, the match count gets lower and lower, so processing time decreases as well.
I admit that this answer is really specific to a use-case, but I cannot reproduce the behavior you mentioned so it has to be somehow specific to a use-case.
Caching can also be a reason - I first read 14 seconds instead of milliseconds - but mixing units like this isn't a brilliant idea either...
Well, replaceAll uses regex internally will compile each time you call it, but possible if we execute replaceAll over the same expression again and again java use some internal mechanism so that same expression will not compile again and make the repetitive replacement faster or just a JIT effect.
is there any difference in performance between these two idioms ?
String firstStr = "Hello ";
String secStr = "world";
String third = firstStr + secStr;
and
String firstStr = "Hello ";
String secStr = "world";
String third = String.format("%s%s",firstStr , secStr);
I know that concatenation with + operator is bad for performance specially if the operation is done a lot of times, but what about String.format() ? is it the same or it can help to improve performance?
The second one will be even slower (if you look at the source code of String.format() you will see why). It is just because String.format() executes much more code than the simple concatenation. And at the end of the day, both code versions create 3 instances of String. There are other reasons, not performance related, to use String.format(), as others already pointed out.
First of all, let me just put a disclaimer against premature optimization. Unless you are reasonably sure this is going to be a hotspot in your program, just choose the construct that fits your program the best.
If you are reasonably sure, however, and want to have good control over the concatenation, just use a StringBuilder directly. That's what the built-in concatenation operation does anyway, and there's no reason to assume that it's slow. As long as you keep the same StringBuilder and keep appending to it, rather than risking creating several in a row (which would have to be "initialized" with the previously created data), you'll have proper O(n) performance. Especially so if you make sure to initialize the StringBuilder with proper capacity.
That also said, however, a StringBuilder is, as mentioned, what the built-in concatenation operation uses anyway, so if you just keep all your concatenations "inline" -- that is, use A + B + C + D, rather than something like e = A + B followed by f = C + D followed by e + f (this way, the same StringBuilder is used and appended to throughout the entire operation) -- then there's no reason to assume it would be slow.
EDIT: In reply to your comment, I'd say that String.format is always slower. Even if it appends optimally, it can't do it any faster than StringBuilder (and therefore also the concatenation operation) does anyway, but it also has to create a Formatter object, do parsing of the input string, and so on. So there's more to it, but it still can't do the basic operation any faster.
Also, if you look internally in how the Formatter works, you'll find that it also (by default) uses a StringBuilder, just as the concatenation operation does. Therefore, it does the exact same basic operation -- that is, feeding a StringBuilder with the strings you give it. It just does it in a far more roundabout way.
As described in this excellent answer , you would rather use String.format, but mainly because of localization concerns.
Suppose that you had to supply different text for different languages, in that case then using String.format - you can just plug in the new languages(using resource files). But concatenation leaves messy code.
See : Is it better practice to use String.format over string Concatenation in Java?
I have a .toUpperCase() happening in a tight loop and have profiled and shown it is impacting application performance. Annoying thing is it's being called on strings already in capital letters. I'm considering just dropping the call to .toUpperCase() but this makes my code less safe for future use.
This level of Java performance optimization is past my experience thus far. Is there any way to do a pre-compilation, set an annotation, etc. to skip the call to toUpperCase on already upper case strings?
What you need to do if you can is call .toUpperCase() on the string once, and store it so that when you go through the loop you won't have to do it each time.
I don't believe there is a pre-compilation situation - you can't know in advance what data the code will be handling. If anyone can correct me on this, it's be pretty awesome.
If you post some example code, I'd be able to help a lot more - it really depends on what kind of access you have to the data before you get to the loop. If your loop is actually doing the data access (e.g., reading from a file) and you don't have control over where those files come from, your hands are a lot more tied than if the data is hardcoded.
Any many cases there's an easy answer, but in some, there's not much you can do.
You can try equalsIgnoreCase, too. It doesn't make a new string.
No you cannot do this using an annotation or pre-compilation because your input is given during the runtime and the annotation and pre-compilation are compile time constructions.
If you would have known the input in advance then you could simply convert it to uppercase before running the application, i.e. before you compile your application.
Note that there are many ways to optimize string handling but without more information we cannot give you any tailor made solution.
You can write a simple function isUpperCase(String) and call it before calling toUpperCase():
if (!isUpperCase(s)) {
s = s.toUpperCase()
}
It might be not significantly faster but at least this way less garbage will be created. If a majority of the strings in your input are already upper case this is very valid optimization.
isUpperCase function will look roughly like this:
boolean isUpperCase(String s) {
for (int i = 0; i < s.length; i++) {
if (Character.isLowerCase(s.charAt(i)) {
return false;
}
}
return true;
}
you need to do an if statement that conditions those letters out of it. the ideas good just have a condition. Then work with ascii codes so convert it using (int) then find the ascii numbers for uppercase which i have no idea what it is, and then continue saying if ascii whatever is true then ignore this section or if its for specific letters in a line then ignore it for charAt(i)
sorry its a rough explanation
Suppose you need to perform some kind of comparison amongst 2 files. You only need to do it when it makes sense, in other words, you wouldn't want to compare JSON file with Property file or .txt file with .jar file
Additionally suppose that you have a mechanism in place to sort all of these things out and what it comes down to now is the actual file name. You would want to compare "myFile.txt" with "myFile.txt", but not with "somethingElse.txt". The goal is to be as close to "apples to apples" rules as possible.
So here we are, on one side you have "myFile.txt" and on another side you have "_myFile.txt", "_m_y_f_i_l_e.txt" and "somethingReallyClever.txt".
Task is to pick the closest name to later compare. Unfortunately, identical name is not found.
Looking at the character composition, it is not hard to figure out what the relationship is. My algo says:
_myFile.txt to _m_y_f_i_l_e.txt 0.312
_myFile.txt to somethingReallyClever.txt 0.16
So _m_y_f_i_l_e.txt is closer to_myFile.txt then somethingReallyClever.txt. Fantastic. But also says that ist is only 2 times closer, where as in reality we can look at the 2 files and would never think to compare somethingReallyClever.txt with _myFile.txt.
Why?
What logic would you suggest i apply to not only figure out likelihood by having chars on the same place, but also test whether determined weight makes sense?
In my example, somethingReallyClever.txt should have had a weight of 0.0
I hope i am being clear.
Please share your experience and thoughts on this.
(whatever approach you suggest should not depend on number of characters filename consists out of)
Possibly helpful previous question which highlights several possible algorithms:
Word comparison algorithm
These algorithms are based on how many changes would be needed to get from one string to the other - where a change is adding a character, deleting a character, or replacing a character.
Certainly any sensible metric here should have a low score as meaning close (think distance between the two strings) and larger scores as meaning not so close.
Sounds like you want the Levenshtein distance, perhaps modified by preconverting both words to the same case and normalizing spaces (e.g. replace all spaces and underscores with empty string)
I am writing code in Java where I branch off based on whether a string starts with certain characters while looping through a dataset and my dataset is expected to be large.
I was wondering whether startsWith is faster than indexOf. I did experiment with 2000 records but not found any difference.
startsWith only needs to check for the presence at the very start of the string - it's doing less work, so it should be faster.
My guess is that your 2000 records finished in a few milliseconds (if that). Whenever you want to benchmark one approach against another, try to do it for enough time that differences in timing will be significant. I find that 10-30 seconds is long enough to show significant improvements, but short enough to make it bearable to run the tests multiple times. (If this were a serious investigation I'd probably try for longer times. Most of my benchmarking is for fun.)
Also make sure you've got varied data - indexOf and startsWith should have roughly the same running time in the case where indexOf returns 0. So if all your records match the pattern, you're not really testing correctly. (I don't know whether that was the case in your tests of course - it's just something to watch out for.)
In general, the golden rule of micro-optimization applies here:
"Measure, don't guess".
As with all optimizations of this type, the difference between the two calls almost certainly won't matter unless you are checking millions of strings that are each tens of thousands of characters long.
Run a profiler over your code, and only optimize this call when you can measure that it's slowing you down. Till then, go with the more readable options (startsWith, in this case). Once you know that this block is slowing you down, try both and use whichever is faster. Rinse. Repeat ;-)
Academically, my guess is that startsWith will likely be implemented using indexOf. Check the source code and see if you're interested. (Turns out that startsWith does not call indexOf)
Even without looking into the sources, it should be clear that startsWith() is faster at least for large strings and short pattern.
The running time of a.startsWith(b) is bound be the length of b. After at most the first b characters are checked, the search finished.
The running time of a.indexOf(b) is larger (depending on the actual algorithm). Every algorithm has at least a running time depending on the length of a. Roughly, you can say, that you have to look at each character once to check if the pattern starts at that position.
However, as always, it depends on the actual use case if you really see a difference in practice. Measuring the difference in real life is never bad.
Probably, if it doesn't match it can stop looking whereas indexOf needs to look for occurrences later in the string.
startsWith is clearer than indexOf == 0.
Have you identified the test as a performance bottleneck for which you need to sacrifice readability?
public class Test
{
public static void main(String args[]) {
long value1 = System.currentTimeMillis();
for(long i=0;i<100000000;i++)
{
"abcd".indexOf("a");
}
long value2 = System.currentTimeMillis();
System.out.println(value2-value1);
value1 = System.currentTimeMillis();
for(long i=0;i<100000000;i++)
{
"abcd".startsWith("a");
}
value2 = System.currentTimeMillis();
System.out.println(value2-value1);
}
}
Tested it with this piece of code and perf for startsWith seems to be better, for obvious reason that it doesn't have to traverse through string. But in best case scenario both should perform close while in a worst case scenario startsWith will always perform better than indexOf
You mentioned the dataset is expected to be large. So i will bet a lot of performanve will go into access this dataset and handle it in memory. That means use one or the other will not change the perfomance significant. But if this is important to you you may write your own startWith method that could be significant faster than standard library methods or at least you know exactly what is done.
Unfortunate, statsWith is not working as supposed to! it uses "indexOf" behind the sene (lazy developpers :D) so indexOf is 10x faster than implemented statsWith