I am writing code in Java where I branch off based on whether a string starts with certain characters while looping through a dataset and my dataset is expected to be large.
I was wondering whether startsWith is faster than indexOf. I did experiment with 2000 records but not found any difference.
startsWith only needs to check for the presence at the very start of the string - it's doing less work, so it should be faster.
My guess is that your 2000 records finished in a few milliseconds (if that). Whenever you want to benchmark one approach against another, try to do it for enough time that differences in timing will be significant. I find that 10-30 seconds is long enough to show significant improvements, but short enough to make it bearable to run the tests multiple times. (If this were a serious investigation I'd probably try for longer times. Most of my benchmarking is for fun.)
Also make sure you've got varied data - indexOf and startsWith should have roughly the same running time in the case where indexOf returns 0. So if all your records match the pattern, you're not really testing correctly. (I don't know whether that was the case in your tests of course - it's just something to watch out for.)
In general, the golden rule of micro-optimization applies here:
"Measure, don't guess".
As with all optimizations of this type, the difference between the two calls almost certainly won't matter unless you are checking millions of strings that are each tens of thousands of characters long.
Run a profiler over your code, and only optimize this call when you can measure that it's slowing you down. Till then, go with the more readable options (startsWith, in this case). Once you know that this block is slowing you down, try both and use whichever is faster. Rinse. Repeat ;-)
Academically, my guess is that startsWith will likely be implemented using indexOf. Check the source code and see if you're interested. (Turns out that startsWith does not call indexOf)
Even without looking into the sources, it should be clear that startsWith() is faster at least for large strings and short pattern.
The running time of a.startsWith(b) is bound be the length of b. After at most the first b characters are checked, the search finished.
The running time of a.indexOf(b) is larger (depending on the actual algorithm). Every algorithm has at least a running time depending on the length of a. Roughly, you can say, that you have to look at each character once to check if the pattern starts at that position.
However, as always, it depends on the actual use case if you really see a difference in practice. Measuring the difference in real life is never bad.
Probably, if it doesn't match it can stop looking whereas indexOf needs to look for occurrences later in the string.
startsWith is clearer than indexOf == 0.
Have you identified the test as a performance bottleneck for which you need to sacrifice readability?
public class Test
{
public static void main(String args[]) {
long value1 = System.currentTimeMillis();
for(long i=0;i<100000000;i++)
{
"abcd".indexOf("a");
}
long value2 = System.currentTimeMillis();
System.out.println(value2-value1);
value1 = System.currentTimeMillis();
for(long i=0;i<100000000;i++)
{
"abcd".startsWith("a");
}
value2 = System.currentTimeMillis();
System.out.println(value2-value1);
}
}
Tested it with this piece of code and perf for startsWith seems to be better, for obvious reason that it doesn't have to traverse through string. But in best case scenario both should perform close while in a worst case scenario startsWith will always perform better than indexOf
You mentioned the dataset is expected to be large. So i will bet a lot of performanve will go into access this dataset and handle it in memory. That means use one or the other will not change the perfomance significant. But if this is important to you you may write your own startWith method that could be significant faster than standard library methods or at least you know exactly what is done.
Unfortunate, statsWith is not working as supposed to! it uses "indexOf" behind the sene (lazy developpers :D) so indexOf is 10x faster than implemented statsWith
Related
I've learned that String's replaceAll() method takes a regex as an input parameter and it can cause considerable performance impact. But once time, I read this blog with a small program assert that (according my understanding):
Process the replaceAll() method slow for the first time but faster for the next time.
This is test result:
regex replace time taken: 14.09 milliseconds
manual replace time taken: 2.371 seconds
-----
regex replace time taken: 9.498 milliseconds
manual replace time taken: 2.406 seconds
-----
regex replace time taken: 2.184 milliseconds
manual replace time taken: 2.360 seconds
-----
What is the optimization mechanism behind this result?
Usually it doesn't cause a meaningful performance impact unless used in weird and unusual ways. In a normal use case (let's say web request) it will disappear under things like network latency and other things that take way more time. Only if you were to use replaceAll in a very hot loop, would it become necessary to consider using the Pattern and Matcher classes directly, which could help with performance.
The linked tutorial site seems questionable (and there are a lot of them, which is why you should be careful what you read). For one, it compares replaceAll with a manual replace method that is poorly written (that's why you get the seconds vs. milliseconds difference). Then it draws conclusions based on that.
So there's no optimization mechanism behind the result in the link. The reason behind the result is the badly written manual replacement method which concatenates a lot of Strings, making it slow compared to replaceAll.
The following is taken from the official1 OpenJDK 11 source code2.
Starting with the String.replaceAll method itself.
public String replaceAll(String regex, String replacement) {
return Pattern.compile(regex).matcher(this).replaceAll(replacement);
}
No caching here. Next the Pattern.compile
public static Pattern compile(String regex) {
return new Pattern(regex, 0);
}
No caching there either. And not in the private Pattern constructor either.
The Pattern constructor uses an internal compile() method to do the working of compiling the regex to the internal form. It takes steps to avoid a Pattern being compiled twice. But as you can see from the above, each replaceAll call is generating a single use Pattern object.
So why are you seeing a speedup in those performance figures?
They could be using an old version (before Java 6) of Pattern that might have3 cached compiled patterns.
The most likely explanation is that this just a JVM warmup effect. A well written benchmark should account for that, but the benchmark that is used in that blog is not doing proper warmup.
In short, the speedup that you think is caused by some "optimization" is apparently just the result of JVM warmup effects such as JIT compilation of the Pattern, Matcher and related classes.
1 - The OpenJDK source code for Java 6 onwards is can be downloaded from https://openjdk.java.net/
2 - The OpenJDK 6 source code is doing the same thing: no caching.
3 - I have not checked, but it is moot. Performance benchmarks based on EOL versions of Java are not instructive for current versions of Java. Nobody should still be using Java 5. If they are, performance of replaceAll is the least of their worries.
Well, if you execute replaceAll over the same expression again and again, then the matched chars are indeed less frequents, so fewer replacements.
Example : /(.)(.*?)\1/ matches strings with a repeated char, and replaceAll matches with $2 will remove these duplicate chars, but it needs to be execute multiple times to process all the duplicates (there can be some in $2 for example)
ABCDEFABCBCDEF for example gives BCDEFCCDEF
BCDEFCCDEF ==> BDEFCDEF
BDEFCDEF ==> BEFCEF
BEFCEF ==> BFCF
BFCF ==> BC
The processed string is smaller and smaller, the match count gets lower and lower, so processing time decreases as well.
I admit that this answer is really specific to a use-case, but I cannot reproduce the behavior you mentioned so it has to be somehow specific to a use-case.
Caching can also be a reason - I first read 14 seconds instead of milliseconds - but mixing units like this isn't a brilliant idea either...
Well, replaceAll uses regex internally will compile each time you call it, but possible if we execute replaceAll over the same expression again and again java use some internal mechanism so that same expression will not compile again and make the repetitive replacement faster or just a JIT effect.
I have a .toUpperCase() happening in a tight loop and have profiled and shown it is impacting application performance. Annoying thing is it's being called on strings already in capital letters. I'm considering just dropping the call to .toUpperCase() but this makes my code less safe for future use.
This level of Java performance optimization is past my experience thus far. Is there any way to do a pre-compilation, set an annotation, etc. to skip the call to toUpperCase on already upper case strings?
What you need to do if you can is call .toUpperCase() on the string once, and store it so that when you go through the loop you won't have to do it each time.
I don't believe there is a pre-compilation situation - you can't know in advance what data the code will be handling. If anyone can correct me on this, it's be pretty awesome.
If you post some example code, I'd be able to help a lot more - it really depends on what kind of access you have to the data before you get to the loop. If your loop is actually doing the data access (e.g., reading from a file) and you don't have control over where those files come from, your hands are a lot more tied than if the data is hardcoded.
Any many cases there's an easy answer, but in some, there's not much you can do.
You can try equalsIgnoreCase, too. It doesn't make a new string.
No you cannot do this using an annotation or pre-compilation because your input is given during the runtime and the annotation and pre-compilation are compile time constructions.
If you would have known the input in advance then you could simply convert it to uppercase before running the application, i.e. before you compile your application.
Note that there are many ways to optimize string handling but without more information we cannot give you any tailor made solution.
You can write a simple function isUpperCase(String) and call it before calling toUpperCase():
if (!isUpperCase(s)) {
s = s.toUpperCase()
}
It might be not significantly faster but at least this way less garbage will be created. If a majority of the strings in your input are already upper case this is very valid optimization.
isUpperCase function will look roughly like this:
boolean isUpperCase(String s) {
for (int i = 0; i < s.length; i++) {
if (Character.isLowerCase(s.charAt(i)) {
return false;
}
}
return true;
}
you need to do an if statement that conditions those letters out of it. the ideas good just have a condition. Then work with ascii codes so convert it using (int) then find the ascii numbers for uppercase which i have no idea what it is, and then continue saying if ascii whatever is true then ignore this section or if its for specific letters in a line then ignore it for charAt(i)
sorry its a rough explanation
I have a variable that gets read and updated thousands of times a second. It needs to be reset regularly. But "half" the time, the value is already the reset value. Is it a good idea to check the value first (to see if it needs resetting) before resetting (a write operaion), or I should just reset it regardless? The main goal is to optimize the code for performance.
To illustrate:
Random r = new Random();
int val = Integer.MAX_VALUE;
for (int i=0; i<100000000; i++) {
if (i % 2 == 0)
val = Integer.MAX_VALUE;
else
val = r.nextInt();
if (val != Integer.MAX_VALUE) //skip check?
val = Integer.MAX_VALUE;
}
I tried to use the above program to test the 2 scenarios (by un/commenting the 2nd "if" line), but any difference is masked by the natural variance of the run duration time.
Thanks.
Don't check it.
It's more execution steps = more cycles = more time.
As an aside, you are breaking one of the basic software golden rules: "Don't optimise early". Unless you have hard evidence that this piece if code is a performance problem, you shouldn't be looking at it. (Note that doesn't mean you code without performance in mind, you still follow normal best practice, but you don't add any special code whose only purpose is "performance related")
The check has no actual performance impact. We'd be talking about a single clock cycle or something, which is usually not relevant in a Java program (as hard-core number crunching usually isn't done in Java).
Instead, base the decision on readability. Think of the maintainer who's going to change this piece of code five years on.
In the case of your example, using my rationale, I would skip the check.
Most likely the JIT will optimise the code away because it doesn't do anything.
Rather than worrying about performance, it is usually better to worry about what it
simpler to understand
cleaner to implement
In both cases, you might remove the code as it doesn't do anything useful and it could make the code faster as well.
Even if it did make the code a little slower it would be very small compared to the cost of calling r.nextInt() which is not cheap.
I have a Double that I want to knock the extra digits after the decimal place off of (I'm not too concerned about accuracy but feel free to mention it in your answer) prior to conversion into a String.
I was wondering whether it would be better to cast to an int or to use a DecimalFormat and call format(..) . Also, is it then more efficient to specify String.valueOf() or leave it as it is and let the compiler figure it out?
Sorry if I sound a bit ignorant, I'm genuinely curious to learn more of the technical details.
For reference, i'm drawing text to and android canvas:
c.drawText("FPS: " + String.valueOf((int)lastFps), xPos, yPos, paint);
Casting will probably be more efficient. This is implemented as native code while using a method will have to go through the java code. Also it's much more readable.
For the string.valueof, I expect the performance to be strictly the same. I find it more readable to just do "string" + intValue than "string" + String.valueof(intValue)
I made a program that used System.nanoTime() to calculate the execution time of these two methods:
public static void cast() {
for (int i=0; i<1000000; i++) {
int x= (int)Math.random();
}
}
public static void format() {
for (int i=0; i< 1000000; i++) {
DecimalFormat df = new DecimalFormat("#");
df.format(Math.random());
}
}
Here are the respective results:
80984944
6048075593
Granted my tests probably aren't perfect examples. I'm just using math.random(), which generates a number that will always cast to 0, which might affect results. However, these results do make sense - casting should be cheap, since it likely doesn't operate on the bits at all - the JVM just treats the bits differently.
Edit: If I pull out the instantiation of the formatter for the second example, the program runs in 3155165182ns. If I multiply the random numbers by Integer.MAX_VALUE in both cases (with the instantiation pulled out), the results are: 82100170 and 4174558079. Looks like casting is the way to go.
This is a job for Math.floor().
Generally speaking, function/method calls come at the cost of performance overhead. My vote is that typecasting would be faster, but as #Zefiryn suggested, the best way is to create a loop and do each action a multitude of times and measure the performance with a timer.
I'm not sure about the efficiency of either, but here's a third option that could be interesting to compare:
String.valueOf(doubleValue).substring(0, endInt)
which would give a set number of characters rather than decimals/numbers, and would skip the typecasting but make two function calls instead.
EDIT: Was too curious so I tried running each option:
integerNumberFormat.format(number)
String.valueOf(doubleValue).substring(0, endInt)
String.valueOf((int)doubleValue)
10^6 cycles with the results being ~800 ms, ~300 ms and ~40 ms, respectively. I guess my results won't be immediately translatable to your situation but they could give a hint that the last one is indeed, as the previous posters suggested, the fastest one.
I have a class that is doing a lot of text processing. For each string, which is anywhere from 100->2000 characters long, I am performing 30 different string replacements.
Example:
string modified;
for(int i = 0; i < num_strings; i++){
modified = runReplacements(strs[i]);
//do stuff
}
public runReplacements(String str){
str = str.replace("foo","bar");
str = str.replace("baz","beef");
....
return str;
}
'foo', 'baz', and all other "targets" are only expected to appear once and are string literals (no need for an actual regex).
As you can imagine, I am concerned about performance :)
Given this,
replaceFirst() seems a bad choice because it won't use Pattern.LITERAL and will do extra processing that isn't required.
replace() seems a bad choice because it will traverse the entire string looking for multiple instances to be replaced.
Additionally, since my replacement texts are the same everytime, it seems to make sense for me to write my own code otherwise String.replaceFirst() or String.replace() will be doing a Pattern.compile every single time in the background. Thinking that I should write my own code, this is my thought:
Perform a Pattern.compile() only once for each literal replacement desired (no need to recompile every single time) (i.e. p1 - p30)
Then do the following for each pX: p1.matcher(str).replaceFirst(Matcher.quoteReplacement("desiredReplacement"));
This way I abandon ship on the first replacement (instead of traversing the entire string), and I am using literal vs. regex, and I am not doing a re-compile every single iteration.
So, which is the best for performance?
So, which is the best for performance?
Measure it! ;-)
ETA: Since a two word answer sounds irretrievably snarky, I'll elaborate slightly. "Measure it and tell us..." since there may be some general rule of thumb about the performance of the various approaches you cite (good ones, all) but I'm not aware of it. And as a couple of the comments on this answer have mentioned, even so, the different approaches have a high likelihood of being swamped by the application environment. So, measure it in vivo and focus on this if it's a real issue. (And let us know how it goes...)
First, run and profile your entire application with a simple match/replace. This may show you that:
your application already runs fast enough, or
your application is spending most of its time doing something else, so optimizing the match/replace code is not worthwhile.
Assuming that you've determined that match/replace is a bottleneck, write yourself a little benchmarking application that allows you to test the performance and correctness of your candidate algorithms on representative input data. It's also a good idea to include "edge case" input data that is likely to cause problems; e.g. for the substitutions in your example, input data containing the sequence "bazoo" could be an edge case. On the performance side, make sure that you avoid the traps of Java micro-benchmarking; e.g. JVM warmup effects.
Next implement some simple alternatives and try them out. Is one of them good enough? Done!
In addition to your ideas, you could try concatenating the search terms into a single regex (e.g. "(foo|baz)" ), use Matcher.find(int) to find each occurrence, use a HashMap to lookup the replacement strings and a StringBuilder to build the output String from input string substrings and replacements. (OK, this is not entirely trivial, and it depends on Pattern/Matcher handling alternates efficiently ... which I'm not sure is the case. But that's why you should compare the candidates carefully.)
In the (IMO unlikely) event that a simple alternative doesn't cut it, this wikipedia page has some leads which may help you to implement your own efficient match/replacer.
Isn't if frustrating when you ask a question and get a bunch of advice telling you to do a whole lot of work and figure it out for yourself?!
I say use replaceAll();
(I have no idea if it is, indeed, the most efficient, I just don't want you to feel like you wasted your money on this question and got nothing.)
[edit]
PS. After that, you might want to measure it.
[edit 2]
PPS. (and tell us what you found)