Related
I have solved in various ways a simple problem on CodeEval, which specification can be found here (only a few lines long).
I have made 3 working versions (one of them in Scala) and I don't understand the difference of performances for my last Java version which I expected to be the best time and memory-wise.
I also compared this to a code found on Github. Here are the performance stats returned by CodeEval :
. Version 1 is the version found on Github
. Version 2 is my Scala solution :
object Main extends App {
val p = Pattern.compile("\\d+")
scala.io.Source.fromFile(args(0)).getLines
.filter(!_.isEmpty)
.map(line => {
val dists = new TreeSet[Int]
val m = p.matcher(line)
while (m.find) dists += m.group.toInt
val list = dists.toList
list.zip(0 +: list).map { case (x,y) => x - y }.mkString(",")
})
.foreach(println)
}
. Version 3 is my Java solution which I expected to be the best :
public class Main {
public static void main(String[] args) throws IOException {
Pattern p = Pattern.compile("\\d+");
File file = new File(args[0]);
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
Set<Integer> dists = new TreeSet<Integer>();
Matcher m = p.matcher(line);
while (m.find()) dists.add(Integer.parseInt(m.group()));
Iterator<Integer> it = dists.iterator();
int prev = 0;
StringBuilder sb = new StringBuilder();
while (it.hasNext()) {
int curr = it.next();
sb.append(curr - prev);
sb.append(it.hasNext() ? "," : "");
prev = curr;
}
System.out.println(sb);
}
br.close();
}
}
Version 4 is the same as version 3 except I don't use a StringBuilder to print the output and do like in version 1
Here is how I interpreted those results :
version 1 is too slow because of the too high number of System.out.print calls. Moreover, using split on very large lines (that's the case in the tests performed) uses a lot of memory.
version 2 seems slow too but it is mainly because of an "overhead" on running Scala code on CodeEval, even very efficient code run slowly on it
version 2 uses unnecessary memory to build a list from the set, which also takes some time but should not be too significant. Writing more efficient Scala would probably like writing it in Java so I preferred elegance to performance
version 3 should not use that much memory in my opinion. The use of a StringBuilder has the same impact on memory as calling mkString in version 2
version 4 proves the calls to System.out.println are slowering down the program
Does someone see an explanation to those results ?
I conducted some tests.
There is a baseline for every type of language. I code in java and javascript. For javascript here are my test results:
Rev 1: Default empty boilerplate for JS with a message to standard output
Rev 2: Same without file reading
Rev 3: Just a message to the standard output
You can see that no matter what, there will be at least 200 ms runtime and about 5 megs of memory usage. This baseline depends on the load of the servers as well! There was a time when codeevals was heavily overloaded, thus making impossible to run anything within the max time(10s).
Check this out, a totally different challenge than the previous:
Rev4: My solution
Rev5: The same code submitted again now. Scored 8000 more ranking point. :D
Conclusion: I would not worry too much about CPU and memory usage and rank. It is clearly not reliable.
Your scala solution is slow, not because of "overhead on CodeEval", but because you are building an immutable TreeSet, adding elements to it one by one. Replacing it with something like
val regex = """\d+""".r // in the beginning, instead of your Pattern.compile
...
.map { line =>
val dists = regex.findAllIn(line).map(_.toInt).toIndexedSeq.sorted
...
Should shave about 30-40% off your execution time.
Same approach (build a list, then sort) will, probably, help your memory utilization in "version 3" (java sets are real memory hogs). It is also a good idea to give your list an initial size while you are at it (otherwise, it'll grow by 50% every time it runs out of capacity, which is wasteful in both memory and performance). 600 sounds like a good number, since that's the upper bound for the number of cities from the problem description.
Now, since we know the upper boundary, an even faster and slimmer approach is to do away with lists and boxed Integeres, and just do int dists[] = new int[600];.
If you wanted to get really fancy, you'd also make use of the "route length" range that's mentioned in the description. For example, instead of throwing ints into an array and sorting (or keeping a treeset), make an array of 20,000 bits (or even 20K bytes for speed), and set those that you see in input as you read it ... That would be both faster and more memory efficient than any of your solutions.
I tried solving this question and figured that you don't need the names of the cities, just the distances in a sorted array.
It has much better runtime of 738ms, and memory of 4513792 with this.
Although this may not help improve your piece of code, it seems like a better way to approach the question. Any suggestions to improve the code further are welcome.
import java.io.*;
import java.util.*;
public class Main {
public static void main (String[] args) throws IOException {
File file = new File(args[0]);
BufferedReader buffer = new BufferedReader(new FileReader(file));
String line;
while ((line = buffer.readLine()) != null) {
line = line.trim();
String out = new Main().getDistances(line);
System.out.println(out);
}
}
public String getDistances(String s){
//split the string
String[] arr = s.split(";");
//create an array to hold the distances as integers
int[] distances = new int[arr.length];
for(int i=0; i<arr.length; i++){
//find the index of , - get the characters after that - convert to integer - add to distances array
distances[i] = Integer.parseInt(arr[i].substring(arr[i].lastIndexOf(",")+1));
}
//sort the array
Arrays.sort(distances);
String output = "";
output += distances[0]; //append the distance to the closest city to the string
for(int i=0; i<arr.length-1; i++){
//get distance between current element(city) and next
int distance_between = distances[i+1] - distances[i];
//append the distance to the string
output += "," + distance_between;
}
return output;
}
}
I wrote a little program that tries to find a connection between two equal length English words. Word A will transform into Word B by changing one letter at a time, each newly created word has to be an English word.
For example:
Word A = BANG
Word B = DUST
Result:
BANG -> BUNG ->BUNT -> DUNT -> DUST
My process:
Load an English wordlist(consist of 109582 words) into a Map<Integer, List<String>> _wordMap = new HashMap();, key will be the word length.
User put in 2 words.
createGraph creates a graph.
calculate the shortest path between those 2 nodes
prints out the result.
Everything works perfectly fine, but I am not satisfied with the time it took in step 3.
See:
Completely loaded 109582 words!
CreateMap took: 30 milsecs
CreateGraph took: 17417 milsecs
(HOISE : HORSE)
(HOISE : POISE)
(POISE : PRISE)
(ARISE : PRISE)
(ANISE : ARISE)
(ANILE : ANISE)
(ANILE : ANKLE)
The wholething took: 17866 milsecs
I am not satisfied with the time it takes create the graph in step 3, here's my code for it(I am using JgraphT for the graph):
private List<String> _wordList = new ArrayList(); // list of all 109582 English words
private Map<Integer, List<String>> _wordMap = new HashMap(); // Map grouping all the words by their length()
private UndirectedGraph<String, DefaultEdge> _wordGraph =
new SimpleGraph<String, DefaultEdge>(DefaultEdge.class); // Graph used to calculate the shortest path from one node to the other.
private void createGraph(int wordLength){
long before = System.currentTimeMillis();
List<String> words = _wordMap.get(wordLength);
for(String word:words){
_wordGraph.addVertex(word); // adds a node
for(String wordToTest : _wordList){
if (isSimilar(word, wordToTest)) {
_wordGraph.addVertex(wordToTest); // adds another node
_wordGraph.addEdge(word, wordToTest); // connecting 2 nodes if they are one letter off from eachother
}
}
}
System.out.println("CreateGraph took: " + (System.currentTimeMillis() - before)+ " milsecs");
}
private boolean isSimilar(String wordA, String wordB) {
if(wordA.length() != wordB.length()){
return false;
}
int matchingLetters = 0;
if (wordA.equalsIgnoreCase(wordB)) {
return false;
}
for (int i = 0; i < wordA.length(); i++) {
if (wordA.charAt(i) == wordB.charAt(i)) {
matchingLetters++;
}
}
if (matchingLetters == wordA.length() - 1) {
return true;
}
return false;
}
My question:
How can I improve my algorithm inorder to speed up the process?
For any redditors that are reading this, yes I created this after seeing the thread from /r/askreddit yesterday.
Here's a starting thought:
Create a Map<String, List<String>> (or a Multimap<String, String> if you've using Guava), and for each word, "blank out" one letter at a time, and add the original word to the list for that blanked out word. So you'd end up with:
.ORSE => NORSE, HORSE, GORSE (etc)
H.RSE => HORSE
HO.SE => HORSE, HOUSE (etc)
At that point, given a word, you can very easily find all the words it's similar to - just go through the same process again, but instead of adding to the map, just fetch all the values for each "blanked out" version.
You probably need to run it through a profiler to see where most of the time is taken, especially since you are using library classes - otherwise you might put in a lot of effort but see no significant improvement.
You could lowercase all the words before you start, to avoid the equalsIgnoreCase() on every comparison. In fact, this is an inconsistency in your code - you use equalsIgnoreCase() initially, but then compare chars in a case-sensitive way: if (wordA.charAt(i) == wordB.charAt(i)). It might be worth eliminating the equalsIgnoreCase() check entirely, since this is doing essentially the same thing as the following charAt loop.
You could change the comparison loop so it finishes early when it finds more than one different letter, rather than comparing all the letters and only then checking how many are matching or different.
(Update: this answer is about optimizing your current code. I realize, reading your question again, that you may be asking about alternative algorithms!)
You can have the list of words of same length sorted, and then have a loop nesting of the kind for (int i = 0; i < n; ++i) for (int j = i + 1; j < n; ++j) { }.
And in isSimilar count the differences and on 2 return false.
So, I've written a spellchecker in Java and things work as they should. The only problem is that if I use a word where the max allowed distance of edits is too large (like say, 9) then my code runs out of memory. I've profiled my code and dumped the heap into a file, but I don't know how to use it to optimize my code.
Can anyone offer any help? I'm more than willing to put up the file/use any other approach that people might have.
-Edit-
Many people asked for more details in the comments. I figured that other people would find them useful, and they might get buried in the comments. Here they are:
I'm using a Trie to store the words themselves.
In order to improve time efficiency, I don't compute the Levenshtein Distance upfront, but I calculate it as I go. What I mean by this is that I keep only two rows of the LD table in memory. Since a Trie is a prefix tree, it means that every time I recurse down a node, the previous letters of the word (and therefore the distance for those words) remains the same. Therefore, I only calculate the distance with that new letter included, with the previous row remaining unchanged.
The suggestions that I generate are stored in a HashMap. The rows of the LD table are stored in ArrayLists.
Here's the code of the function in the Trie that leads to the problem. Building the Trie is pretty straight forward, and I haven't included the code for the same here.
/*
* #param letter: the letter that is currently being looked at in the trie
* word: the word that we are trying to find matches for
* previousRow: the previous row of the Levenshtein Distance table
* suggestions: all the suggestions for the given word
* maxd: max distance a word can be from th query and still be returned as suggestion
* suggestion: the current suggestion being constructed
*/
public void get(char letter, ArrayList<Character> word, ArrayList<Integer> previousRow, HashSet<String> suggestions, int maxd, String suggestion){
// the new row of the trie that is to be computed.
ArrayList<Integer> currentRow = new ArrayList<Integer>(word.size()+1);
currentRow.add(previousRow.get(0)+1);
int insert = 0;
int delete = 0;
int swap = 0;
int d = 0;
for(int i=1;i<word.size()+1;i++){
delete = currentRow.get(i-1)+1;
insert = previousRow.get(i)+1;
if(word.get(i-1)==letter)
swap = previousRow.get(i-1);
else
swap = previousRow.get(i-1)+1;
d = Math.min(delete, Math.min(insert, swap));
currentRow.add(d);
}
// if this node represents a word and the distance so far is <= maxd, then add this word as a suggestion
if(isWord==true && d<=maxd){
suggestions.add(suggestion);
}
// if any of the entries in the current row are <=maxd, it means we can still find possible solutions.
// recursively search all the branches of the trie
for(int i=0;i<currentRow.size();i++){
if(currentRow.get(i)<=maxd){
for(int j=0;j<26;j++){
if(children[j]!=null){
children[j].get((char)(j+97), word, currentRow, suggestions, maxd, suggestion+String.valueOf((char)(j+97)));
}
}
break;
}
}
}
Here's some code I quickly crafted showing one way to generate the candidates and to then "rank" them.
The trick is: you never "test" a non-valid candidate.
To me your: "I run out of memory when I've got an edit distance of 9" screams "combinatorial explosion".
Of course to dodge a combinatorial explosion you don't do thing like trying to generate yourself all words that are at a distance from '9' from your misspelled work. You start from the misspelled word and generate (quite a lot) of possible candidates, but you refrain from creating too many candidates, for then you'd run into trouble.
(also note that it doesn't make much sense to compute up to a Levenhstein Edit Distance of 9, because technically any word less than 10 letters can be transformed into any other word less than 10 letters in max 9 transformations)
Here's why you simply cannot test all words up to a distance of 9 without either having an OutOfMemory error or simply a program never terminating:
generating all the LED up to 1 for the word "ptmizing", by only adding one letter (from a to z) generates already 9*26 variations (i.e. 324 variations) [there are 9 positions where you can insert one out of 26 letters)
generating all the LED up to 2, by only adding one letter to what we know have generates already 10*26*324 variations (60 840)
generating all the LED up to 3 gives: 17 400 240 variations
And that is only by considering the case where we add one, add two or add three letters (we're not counting deletion, swaps, etc.). And that is on a misspelled word that is only nine characters long. On "real" words, it explodes even faster.
Sure, you could get "smart" and generate this in a way not to have too many dupes etc. but the point stays: it's a combinatorial explosion that explodes fastly.
Anyway... Here's an example. I'm simply passing the dictionary of valid words (containing only four words in this case) to the corresponding method to keep this short.
You'll obviously want to replace the call to the LED with your own LED implementation.
The double-metaphone is just an example: in a real spellchecker words that do "sound alike"
despite further LED should be considered as "more correct" and hence often suggest first. For example "optimizing" and "aupteemising" are quite far from a LED point of view, but using the double-metaphone you should get "optimizing" as one of the first suggestion.
(disclaimer: following was cranked in a few minutes, it doesn't take into account uppercase, non-english words, etc.: it's not a real spell-checker, just an example)
#Test
public void spellCheck() {
final String src = "misspeled";
final Set<String> validWords = new HashSet<String>();
validWords.add("boing");
validWords.add("Yahoo!");
validWords.add("misspelled");
validWords.add("stackoverflow");
final List<String> candidates = findNonSortedCandidates( src, validWords );
final SortedMap<Integer,String> res = computeLevenhsteinEditDistanceForEveryCandidate(candidates, src);
for ( final Map.Entry<Integer,String> entry : res.entrySet() ) {
System.out.println( entry.getValue() + " # LED: " + entry.getKey() );
}
}
private SortedMap<Integer, String> computeLevenhsteinEditDistanceForEveryCandidate(
final List<String> candidates,
final String mispelledWord
) {
final SortedMap<Integer, String> res = new TreeMap<Integer, String>();
for ( final String candidate : candidates ) {
res.put( dynamicProgrammingLED(candidate, mispelledWord), candidate );
}
return res;
}
private int dynamicProgrammingLED( final String candidate, final String misspelledWord ) {
return Levenhstein.getLevenshteinDistance(candidate,misspelledWord);
}
Here you generate all possible candidates using several methods. I've only implemented one such method (and quickly so it may be bogus but that's not the point ; )
private List<String> findNonSortedCandidates( final String src, final Set<String> validWords ) {
final List<String> res = new ArrayList<String>();
res.addAll( allCombinationAddingOneLetter(src, validWords) );
// res.addAll( allCombinationRemovingOneLetter(src) );
// res.addAll( allCombinationInvertingLetters(src) );
return res;
}
private List<String> allCombinationAddingOneLetter( final String src, final Set<String> validWords ) {
final List<String> res = new ArrayList<String>();
for (char c = 'a'; c < 'z'; c++) {
for (int i = 0; i < src.length(); i++) {
final String candidate = src.substring(0, i) + c + src.substring(i, src.length());
if ( validWords.contains(candidate) ) {
res.add(candidate); // only adding candidates we know are valid words
}
}
if ( validWords.contains(src+c) ) {
res.add( src + c );
}
}
return res;
}
One thing you could try is, increase the Java's heap size, in order to overcome "out of memory error".
Following article will help you in order to understand how to increase heap size in Java
http://viralpatel.net/blogs/2009/01/jvm-java-increase-heap-size-setting-heap-size-jvm-heap.html
But I think the better approach to address your problem is, find out a better algorithm than the current algorithm
Well without more Information on the topic there is not much the community could do for you... You can start with the following:
Look at what your Profiler says (after it has run a little while): Does anything pile up? Are there a lot of Objects - this should normally give you a hint on what is wrong with your code.
Publish your saved dump somewhere and link it in your question, so someone else could take a look at it.
Tell us which profiler you are using, then somebody can give you hints on where to look for valuable information.
After you have narrowed down your problem to a specific part of your Code, and you cannot figure out why there are so many objects of $FOO in your memory, post a snippet of the relevant part.
Which one of the following is a better practice to check if a string is float?
try{
Double.parseDouble(strVal);
}catch(NumberFormatException e){
//My Logic
}
or
if(!strVal.matches("[-+]?\\d*\\.?\\d+")){
//My Logic
}
In terms of performace, maintainence and readability?
And yeah, I would like to know which one is good coding practice?
Personal opinion - of the code I've seen, I would expect that most developers would tend towards the try - catch blocks. The try catch is in a sense also more readable and makes the assumption that for most cases the string will contain a valid number. But there are a number of things to consider with you examples which may effect which you choose.
How often do you expect the string to not contain a valid number.
Note that for bulk processing you should create a Pattern object outside of the loop. This will stop the code from having to recompile the pattern every time.
As a general rule you should never use expectations as logic flow. Your try - catch indicates logic if it's not a string, where as your regex indicates logic if it is a number. So it wasn't obvious what the context of the code is.
If you choose the regex technique, you are still probably going to have to convert at some point, so in effect, it may be a waste of effort.
And finally, is the performance requirements of the application important enough to warrant analysis at this level. Again generally speaking I'd recommend keeping things as simple as possible, making it work, then if there are performance problems, use some code analysis tools to find the bottle necks and tune them out.
Performance: Exceptions are slow, and so is exception-based logic, so second would be faster.
Maintenance / Reliability: The first one is crystal clear and will stay updated with updates to the Java Framework.
That being said, I would personally prefer the first. Performance is something you want to consider as a whole in your architecture, your data structure design, etc. not line by line. Measure for performance and optimize what is actually slow, not what you think might be slow.
The first one is going to perform better than the regex when the string matches the double. For one it's very fast to parse it when the recognizer is hard coded as it would be with Double.parse. Also there's nothing to maintain it's whatever Java defines as the Double is as a string. Not to mention Double.parseDouble() is easier to read.
The other solution isn't going to be compiled so the first thing that the regex has to do is compile and parse the regex expression, then it has to run that expression, then you'll have to execute Double.parseDouble() to get it into a double. And that's going to be done for every number passed to it. You might be able to optimize it with Pattern.compile(), but executing the expression is going to be slower. Especially when you have to run a Double.doubleParse to get the value into a double.
Yes exceptions are not super fast, but you'll only have to pay that price when you parse an error. If you don't plan on seeing lots of errors then I don't think you'll notice the slow down from gathering the stacktrace on the throw (which is why exceptions perform poorly). If you're only going to encounter a handful of exceptions then performance isn't going be a problem. The problem is you expected a double and it wasn't so probably some configuration mistake so tell the user and quit, or pick a suitable default and continue. That's all you can do in those cases.
If you use parseDouble, you will end up with what Mark said, but in a more readable way, and might profit from performance improvements and bug fixes.
Since exceptions are only costly when they are thrown, there is only need to look for a different strategy if you
expect wrong formats to happen often
expect them to fall in a specific pattern which you can catch faster and beforehand
In the end you will call parseDouble either, and therefore it is considered alright to use it that way.
Note that your pattern rejects 7. as a Double, while Java and C/C++ don't, as well as scientific notation like 4.2e8.
May be you can also try this way.But this is generic for a string containing valid number.
public static boolean isNumeric(String str)
{
str = "2.3452342323423424E8";
// str = "21414124.12412412412412";
// str = "123123";
NumberFormat formatter = NumberFormat.getInstance();
ParsePosition pos = new ParsePosition(0);
formatter.parse(str, pos);
return str.length() == pos.getIndex();
}
And yeah, I would like to know which one is good coding practice?
Either can be good coding practice, depending on the context.
If bad numbers are unlikely (i.e. it is an "exceptional" situation), then the exception-based solution is fine. (Indeed, if the probability of bad numbers is small enough, exceptions might even be faster on average. It depends on the relative speed of Double.parseDouble() and a compiled regex for typical input strings. That would need to be measured ...)
If bad numbers are reasonably (or very) likely (i.e. it is NOT an "exceptional" situation), then the regex-based solution is probably better.
If the code path that does the test is infrequently executed, then it really makes no difference which approach you use.
Below is performance test to see the performance difference between regular expression VS try catch for validating a string is numeric.
Below table shows stats with a list(100k) with three points (90%, 70%, 50%) good data(float value) and remaining bad data(strings).
**90% - 10% 70% - 30% 50% - 50%**
**Try Catch** 87234580 122297750 143470144
**Regular Expression** 202700266 192596610 162166308
Performance of try catch is better (unless the bad data is over 50%) even though try/catch may have some impact on performance. The performance impact of try catch is because try/catch prevents JVM from doing some optimizations. Joshua Bloch, in "Effective Java," said the following:. Joshua Bloch, in "Effective Java," said the following:
• Placing code inside a try-catch block inhibits certain optimizations that modern JVM implementations might otherwise perform.
public class PerformanceStats {
static final String regularExpr = "([0-9]*[.])?[0-9]+";
public static void main(String[] args) {
PerformanceStats ps = new PerformanceStats();
ps.statsFinder();
//System.out.println("123".matches(regularExpr));
}
private void statsFinder() {
int count = 200000;
int ncount = 200000;
ArrayList<String> ar = getList(count, ncount);
System.out.println("count = " + count + " ncount = " + ncount);
long t1 = System.nanoTime();
validateWithCatch(ar);
long t2 = System.nanoTime();
validateWithRegularExpression(ar);
long t3 = System.nanoTime();
System.out.println("time taken with Exception " + (t2 - t1) );
System.out.println("time taken with Regular Expression " + (t3 - t2) );
}
private ArrayList<String> getList(int count, int noiseCount) {
Random rand = new Random();
ArrayList<String> list = new ArrayList<String>();
for (int i = 0; i < count; i++) {
list.add((String) ("" + Math.abs(rand.nextFloat())));
}
// adding noise
for (int i = 0; i < (noiseCount); i++) {
list.add((String) ("sdss" + rand.nextInt() ));
}
return list;
}
private void validateWithRegularExpression(ArrayList<String> list) {
ArrayList<Float> ar = new ArrayList<>();
for (String s : list) {
if (s.matches(regularExpr)) {
ar.add(Float.parseFloat(s));
}
}
System.out.println("the size is in regular expression " + ar.size());
}
private void validateWithCatch(ArrayList<String> list) {
ArrayList<Float> ar = new ArrayList<>();
for (String s : list) {
try {
float e = Float.parseFloat(s);
ar.add(e);
} catch (Exception e) {
}
}
System.out.println("the size is in catch block " + ar.size());
}
}
We have to build Strings all the time for log output and so on. Over the JDK versions we have learned when to use StringBuffer (many appends, thread safe) and StringBuilder (many appends, non-thread-safe).
What's the advice on using String.format()? Is it efficient, or are we forced to stick with concatenation for one-liners where performance is important?
e.g. ugly old style,
String s = "What do you get if you multiply " + varSix + " by " + varNine + "?";
vs. tidy new style (String.format, which is possibly slower),
String s = String.format("What do you get if you multiply %d by %d?", varSix, varNine);
Note: my specific use case is the hundreds of 'one-liner' log strings throughout my code. They don't involve a loop, so StringBuilder is too heavyweight. I'm interested in String.format() specifically.
I took hhafez's code and added a memory test:
private static void test() {
Runtime runtime = Runtime.getRuntime();
long memory;
...
memory = runtime.freeMemory();
// for loop code
memory = memory-runtime.freeMemory();
I run this separately for each approach, the '+' operator, String.format and StringBuilder (calling toString()), so the memory used will not be affected by other approaches.
I added more concatenations, making the string as "Blah" + i + "Blah"+ i +"Blah" + i + "Blah".
The result are as follows (average of 5 runs each):
Approach
Time(ms)
Memory allocated (long)
+ operator
747
320,504
String.format
16484
373,312
StringBuilder
769
57,344
We can see that String + and StringBuilder are practically identical time-wise, but StringBuilder is much more efficient in memory use.
This is very important when we have many log calls (or any other statements involving strings) in a time interval short enough so the Garbage Collector won't get to clean the many string instances resulting of the + operator.
And a note, BTW, don't forget to check the logging level before constructing the message.
Conclusions:
I'll keep on using StringBuilder.
I have too much time or too little life.
I wrote a small class to test which has the better performance of the two and + comes ahead of format. by a factor of 5 to 6.
Try it your self
import java.io.*;
import java.util.Date;
public class StringTest{
public static void main( String[] args ){
int i = 0;
long prev_time = System.currentTimeMillis();
long time;
for( i = 0; i< 100000; i++){
String s = "Blah" + i + "Blah";
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for( i = 0; i<100000; i++){
String s = String.format("Blah %d Blah", i);
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
}
}
Running the above for different N shows that both behave linearly, but String.format is 5-30 times slower.
The reason is that in the current implementation String.format first parses the input with regular expressions and then fills in the parameters. Concatenation with plus, on the other hand, gets optimized by javac (not by the JIT) and uses StringBuilder.append directly.
All the benchmarks presented here have some flaws, thus results are not reliable.
I was surprised that nobody used JMH for benchmarking, so I did.
Results:
Benchmark Mode Cnt Score Error Units
MyBenchmark.testOld thrpt 20 9645.834 ± 238.165 ops/s // using +
MyBenchmark.testNew thrpt 20 429.898 ± 10.551 ops/s // using String.format
Units are operations per second, the more the better. Benchmark source code. OpenJDK IcedTea 2.5.4 Java Virtual Machine was used.
So, old style (using +) is much faster.
Your old ugly style is automatically compiled by JAVAC 1.6 as :
StringBuilder sb = new StringBuilder("What do you get if you multiply ");
sb.append(varSix);
sb.append(" by ");
sb.append(varNine);
sb.append("?");
String s = sb.toString();
So there is absolutely no difference between this and using a StringBuilder.
String.format is a lot more heavyweight since it creates a new Formatter, parses your input format string, creates a StringBuilder, append everything to it and calls toString().
Java's String.format works like so:
it parses the format string, exploding into a list of format chunks
it iterates the format chunks, rendering into a StringBuilder, which is basically an array that resizes itself as necessary, by copying into a new array. this is necessary because we don't yet know how large to allocate the final String
StringBuilder.toString() copies his internal buffer into a new String
if the final destination for this data is a stream (e.g. rendering a webpage or writing to a file), you can assemble the format chunks directly into your stream:
new PrintStream(outputStream, autoFlush, encoding).format("hello {0}", "world");
I speculate that the optimizer will optimize away the format string processing. If so, you're left with equivalent amortized performance to manually unrolling your String.format into a StringBuilder.
To expand/correct on the first answer above, it's not translation that String.format would help with, actually.
What String.format will help with is when you're printing a date/time (or a numeric format, etc), where there are localization(l10n) differences (ie, some countries will print 04Feb2009 and others will print Feb042009).
With translation, you're just talking about moving any externalizable strings (like error messages and what-not) into a property bundle so that you can use the right bundle for the right language, using ResourceBundle and MessageFormat.
Looking at all the above, I'd say that performance-wise, String.format vs. plain concatenation comes down to what you prefer. If you prefer looking at calls to .format over concatenation, then by all means, go with that.
After all, code is read a lot more than it's written.
In your example, performance probalby isn't too different but there are other issues to consider: namely memory fragmentation. Even concatenate operation is creating a new string, even if its temporary (it takes time to GC it and it's more work). String.format() is just more readable and it involves less fragmentation.
Also, if you're using a particular format a lot, don't forget you can use the Formatter() class directly (all String.format() does is instantiate a one use Formatter instance).
Also, something else you should be aware of: be careful of using substring(). For example:
String getSmallString() {
String largeString = // load from file; say 2M in size
return largeString.substring(100, 300);
}
That large string is still in memory because that's just how Java substrings work. A better version is:
return new String(largeString.substring(100, 300));
or
return String.format("%s", largeString.substring(100, 300));
The second form is probably more useful if you're doing other stuff at the same time.
Generally you should use String.Format because it's relatively fast and it supports globalization (assuming you're actually trying to write something that is read by the user). It also makes it easier to globalize if you're trying to translate one string versus 3 or more per statement (especially for languages that have drastically different grammatical structures).
Now if you never plan on translating anything, then either rely on Java's built in conversion of + operators into StringBuilder. Or use Java's StringBuilder explicitly.
Another perspective from Logging point of view Only.
I see a lot of discussion related to logging on this thread so thought of adding my experience in answer. May be someone will find it useful.
I guess the motivation of logging using formatter comes from avoiding the string concatenation. Basically, you do not want to have an overhead of string concat if you are not going to log it.
You do not really need to concat/format unless you want to log. Lets say if I define a method like this
public void logDebug(String... args, Throwable t) {
if(debugOn) {
// call concat methods for all args
//log the final debug message
}
}
In this approach the cancat/formatter is not really called at all if its a debug message and debugOn = false
Though it will still be better to use StringBuilder instead of formatter here. The main motivation is to avoid any of that.
At the same time I do not like adding "if" block for each logging statement since
It affects readability
Reduces coverage on my unit tests - thats confusing when you want to make sure every line is tested.
Therefore I prefer to create a logging utility class with methods like above and use it everywhere without worrying about performance hit and any other issues related to it.
I just modified hhafez's test to include StringBuilder. StringBuilder is 33 times faster than String.format using jdk 1.6.0_10 client on XP. Using the -server switch lowers the factor to 20.
public class StringTest {
public static void main( String[] args ) {
test();
test();
}
private static void test() {
int i = 0;
long prev_time = System.currentTimeMillis();
long time;
for ( i = 0; i < 1000000; i++ ) {
String s = "Blah" + i + "Blah";
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for ( i = 0; i < 1000000; i++ ) {
String s = String.format("Blah %d Blah", i);
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for ( i = 0; i < 1000000; i++ ) {
new StringBuilder("Blah").append(i).append("Blah");
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
}
}
While this might sound drastic, I consider it to be relevant only in rare cases, because the absolute numbers are pretty low: 4 s for 1 million simple String.format calls is sort of ok - as long as I use them for logging or the like.
Update: As pointed out by sjbotha in the comments, the StringBuilder test is invalid, since it is missing a final .toString().
The correct speed-up factor from String.format(.) to StringBuilder is 23 on my machine (16 with the -server switch).
Here is modified version of hhafez entry. It includes a string builder option.
public class BLA
{
public static final String BLAH = "Blah ";
public static final String BLAH2 = " Blah";
public static final String BLAH3 = "Blah %d Blah";
public static void main(String[] args) {
int i = 0;
long prev_time = System.currentTimeMillis();
long time;
int numLoops = 1000000;
for( i = 0; i< numLoops; i++){
String s = BLAH + i + BLAH2;
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for( i = 0; i<numLoops; i++){
String s = String.format(BLAH3, i);
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for( i = 0; i<numLoops; i++){
StringBuilder sb = new StringBuilder();
sb.append(BLAH);
sb.append(i);
sb.append(BLAH2);
String s = sb.toString();
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
}
}
Time after for loop 391
Time after for loop 4163
Time after for loop 227
The answer to this depends very much on how your specific Java compiler optimizes the bytecode it generates. Strings are immutable and, theoretically, each "+" operation can create a new one. But, your compiler almost certainly optimizes away interim steps in building long strings. It's entirely possible that both lines of code above generate the exact same bytecode.
The only real way to know is to test the code iteratively in your current environment. Write a QD app that concatenates strings both ways iteratively and see how they time out against each other.
Consider using "hello".concat( "world!" ) for small number of strings in concatenation. It could be even better for performance than other approaches.
If you have more than 3 strings, than consider using StringBuilder, or just String, depending on compiler that you use.