Java : Searching Ids from hashset or String - java

I have large number of IDs which can I store in HashSet or String
i.e.
String strIds=",1,2,3,4,5,6,7,8,.,.,.,.,.,.,.,1000,";
Or
HashSet<String> setOfids = new HashSet<String>();
setOfids.put("1");
setOfids.put("2");
.
.
.
setOfids.put("1000");
Further more I want to perform search on IDs
Which Should I use for better Performance(Faster & memory efficient)
1) strIds.indexOf("someId");
or
2) setOfids.contains("someId");
Tell me any other way so, I can do the same.
Thanks for Looking here :)

A hash table lookup is "constant time", i.e., it does not grow with the number of ids.
But a compact string of all id's in a String requires the least memory.
So, make up your mind: fastest retrieval or a minimum of storage!

Set will be better choice. Reasons:
Search will be O(1) in case of Set. In case of String it will be O(N).
Performance will not degrade as data grows.
String will use more memory if you want to do any kind of data manipulation (add or remove IDs).
indexOf might give you negative result as well
Say 1000 is present but 100 is not, so indexOf will return the location of 1000 as 100 is substring of 1000.
Simple POC code for the performance:
import java.util.HashSet;
import java.util.Set;
public class TimeComputationTest {
public static void main(String[] args) {
String strIds = null;
Set<String> setOfids = new HashSet<String>();
StringBuffer sb = new StringBuffer();
for (int i = 1;i <= 1000;i++) {
setOfids.add(String.valueOf(i));
if (sb.length() != 0) {
sb.append(",");
}
sb.append(i);
}
strIds = sb.toString();
testTime(strIds, setOfids, "1");
testTime(strIds, setOfids, "100");
testTime(strIds, setOfids, "500");
testTime(strIds, setOfids, "1000");
}
private static void testTime(String strIds, Set<String> setOfids, String string) {
long startTime = System.nanoTime();
strIds.indexOf(string);
long endTime = System.nanoTime();
System.out.println("String search time for (" + string + ") is " + (endTime - startTime));
startTime = System.nanoTime();
setOfids.contains(string);
endTime = System.nanoTime();
System.out.println("HashSet search time for (" + string + ") is " + (endTime - startTime));
}
}
The output will be (approx.):
String search time for (1) is 3000
HashSet search time for (1) is 7000
String search time for (100) is 6000
HashSet search time for (100) is 2000
String search time for (500) is 33000
HashSet search time for (500) is 2000
String search time for (1000) is 71000
HashSet search time for (1000) is 1000

Besides the performances, you shouldn't use a String like that. Although it is creative, it is not made for indexing like that. What would happen if you want to change the format of the ids?
To improve the performance and save memory of hashSet you could of course use
HashSet<Integer> instead of HashSet<String>

I assume HashSet is the better option to go with.
There are two advantages:
It doesn't allow duplicates
HashSet internally assumes a HashMap, hence retrieval is faster.

It will work faster:::
String strIds=",1,2,3,4,5,6,7,8,.,.,.,.,.,.,.,1000,";
String searchStr = "9";
boolean searchFound = strIds.contains(","+searchStr +",");

Related

What is the best way to find common elements from 2 sets?

Recently I had an interview and I was asked one question.
I have 2 sets with around 1 Million records each.
I have to find the common element in 2 sets.
My response:
I will create a new empty Set. And i gave him below solution but he was not happy with it. He said there are 1 million records so the solution won't be good.
public Set<Integer> commonElements(Set<Integer> s1, Set<Integer> s2) {
Set<Integer> res = new HashSet<>();
for (Integer temp : s1) {
if(s2.contains(temp)) {
res.add(temp);
}
}
return res;
}
What is the better way to solve this problem then?
First of all: in order determine the intersection of two sets, you absolutely have to look at all entries of at least one of the two sets (to figure whether it is in the other set). There is no magic around that would tell you that in less than O(min(size(s1), size(s2)). Period.
The next thing to tell the interviewer: "1 million entries. You must be kidding. It is 2019. Any decent piece of hardware crunches two 1-million sets in less than a second". (Of course: that only applies for objects that are cheap to compare, like here for Integer instances. If oneRecord.equals(anotherRecord) is a super expensive operation, then 1 million entries could still be a problem in 2022).
Then you briefly mention that there are various built-in ways to solve this, as well as various 3rd party libraries. But you avoid the mistake that the other two answers make: pointing to a library that does compute the intersect is not at all something you sell as "solution" to this question.
You see, regarding coding: the java Set interface has an easy solution to that: s1.retainAll(s2) computes the join of the two sets, as it removes all elements from s1 that
aren't in s2.
Obviously, you have to mention within the interview that this will modify s1.
In case that the requirement is to not modify s1 or s2, your solution is a viable way to go, and there isn't anything one can do about the runtime cost. If it all, you could call size() for both sets and iterate the one that has less entries.
Alternatively, you can do
Set<String> result = new HashSet<>(s1);
return result.retain(s2);
but in the end, you have to iterate one set and for each element determine whether it is in the second set.
But of course, the real answer to such questions is always always always to show the interviewer that you are able to dissect the problem into its different aspects. You outline basic constraints, you outline different solutions and discuss their pros and cons. Me for example, I would expect you to sit down and maybe write a program like this:
public class Numbers {
private final static int numberOfEntries = 20_000_000;
private final static int maxRandom = numberOfEntries;
private Set<Integer> s1;
private Set<Integer> s2;
#Before
public void setUp() throws Exception {
Random random = new Random(42);
s1 = fillWithRandomEntries(random, numberOfEntries);
s2 = fillWithRandomEntries(random, numberOfEntries);
}
private static Set<Integer> fillWithRandomEntries(Random random, int entries) {
Set<Integer> rv = new HashSet<>();
for (int i = 0; i < entries; i++) {
rv.add(random.nextInt(maxRandom));
}
return rv;
}
#Test
public void classic() {
long start = System.currentTimeMillis();
HashSet<Integer> intersection = new HashSet<>();
s1.forEach((i) -> {
if (s2.contains(i))
intersection.add(i);
});
long end = System.currentTimeMillis();
System.out.println("foreach duration: " + (end-start) + " ms");
System.out.println("intersection.size() = " + intersection.size());
}
#Test
public void retainAll() {
long start = System.currentTimeMillis();
s1.retainAll(s2);
long end = System.currentTimeMillis();
System.out.println("Retain all duration: " + (end-start) + " ms");
System.out.println("intersection.size() = " + s1.size());
}
#Test
public void streams() {
long start = System.currentTimeMillis();
Set<Integer> intersection = s1.stream().filter(i -> s2.contains(i)).collect(Collectors.toSet());
long end = System.currentTimeMillis();
System.out.println("streaming: " + (end-start) + " ms");
System.out.println("intersection.size() = " + intersection.size());
}
#Test
public void parallelStreams() {
long start = System.currentTimeMillis();
Set<Integer> intersection = s1.parallelStream().filter(i -> s2.contains(i)).collect(Collectors.toSet());
long end = System.currentTimeMillis();
System.out.println("parallel streaming: " + (end-start) + " ms");
System.out.println("intersection.size() = " + intersection.size());
}
}
The first observation here: I decided to run with 20 million entries. I started with 2 million, but all three tests would run well below 500 ms. Here is the print out for 20 million on my Mac Book Pro:
foreach duration: 9304 ms
intersection.size() = 7990888
streaming: 9356 ms
intersection.size() = 7990888
Retain all duration: 685 ms
intersection.size() = 7990888
parallel streaming: 6998 ms
intersection.size() = 7990888
As expected: all intersects have the same size (because I seeded the random number generator to get to comparable results).
And surprise: modifying s1 in place ... is by far the cheapest option. It beats streaming by a factor of 10. Also note: the parallel streaming is quicker here. When running with 1 million entries, the sequential stream was faster.
Therefore I initially mentioned to mention "1 million entries is not a performance problem". That is a very important statement, as it tells the interviewer that you are not one of those people wasting hours to micro-optimize non-existing performance issues.
you can use
CollectionUtils
its from apache
CollectionUtils.intersection(Collection a,Collection b)
The answer is:
s1.retainAll(s2);
Ref. https://www.w3resource.com/java-exercises/collection/java-collection-hash-set-exercise-11.php

Why do we convert string to charArray when we do operations on a string?

When I do algorithm exercises, I found many people like to transfer a string to charArray before do operations?
I don't understand why do we bother do that? I mean, I can use string.charAt(), why use string.toCharArray() and then charArray[i]? It's the same and even charArray use O(n) memory.
Can anyone explain that to me?
There are several reasons why people prefer char[] over String and StringBuffer:
String is immutable. This means, if you want to manipulate a String without using any utilityclass you'll wind up copying the String pretty often, which results in extremely inefficient code.
accessing Characters in a char[] is way faster than using charAt (though it takes some time to convert a String to a char[], which should be considered aswell, when optimizing):
class Test{
public static void main(String[] args){
String s = "a";
char[] c = new char[]{'a'};
StringBuffer buffer = new StringBuffer("a");
char x;
long time = System.nanoTime();
for(int i = 0 ; i < 1000 ; i++)
x = s.charAt(0);
time = System.nanoTime() - time;
System.out.println("string: " + time);
time = System.nanoTime();
for(int i = 0 ; i < 1000 ; i++)
x = c[0];
time = System.nanoTime() - time;
System.out.println("[]: " + time);
time = System.nanoTime();
for(int i = 0 ; i < 1000 ; i++)
x = buffer.charAt(0);
time = System.nanoTime() - time;
System.out.println("buffer: " + time);
}
}
Running this simple class results in the following output (on my machine, using javaSE 1.8.0 build b132):
string: 37895
[]: 18948
buffer: 85659
So obviously access via char[] is way faster than using a String or StringBuilder.
using an Object to manipulate single characters will result in a code stuffed with charAt() and setCharAt(), which might be considered ugly code.
Security: String is rather insecure, if the code handles sensitive data, since the immutable String will be stored in memory. This means that the String containing sensitive data will be accessible until the GC removes it from the memory. char[] on the other hand can be simply overwritten at any time and thus remove the sensitive data from memory.

Parse and Extract unique value from a text file efficiently

I have two tsv files to parse and extract values from each file. Each line may have 4-5 attributes per line. The content of both the files are as below :
1 44539 C T 19.44
1 44994 A G 4.62
1 45112 TATGG 0.92
2 43635 Z Q 0.87
3 5672 AAS 0.67
There are some records in each file that have first 3 or 4 attributes same but different value. I want to retain higher value of such records and prepare new file with all unique values. For example:
1 44539 C T 19.44
1 44539 C T 25.44
I need to retain one with the higher value in above case record with value 25.44
I have drafted code for this however after few minutes the program runs slow. I am reading each record from a file forming a key value pair with the first 3 or 4 records as key and last record as value and storing it in hashmap and use it to again write to a file. Is there a better solution?
also how can I test if my code is giving me correct output in file?
One file is of size 498 MB with 23822225 records and other is of 515 MB with 24500367 records.
I get Exception in thread "main" java.lang.OutOfMemoryError: Java heap space error for the file with size 515 MB.
Is there a better way I can code to execute the program efficiently with out increasing heap size.
I might have to deal with larger files in future, what would be the trick to solve such problems?
public class UniqueExtractor {
private int counter = 0;
public static void main(String... aArgs) throws IOException {
UniqueExtractor parser = new UniqueExtractor("/Users/xxx/Documents/xyz.txt");
long startTime = System.currentTimeMillis();
parser.processLineByLine();
parser.writeToFile();
long endTime = System.currentTimeMillis();
long total_time = endTime - startTime;
System.out.println("done in " + total_time/1000 + "seconds ");
}
public void writeToFile()
{
System.out.println("writing to a file");
try {
PrintWriter writer = new PrintWriter("/Users/xxx/Documents/xyz_unique.txt", "UTF-8");
Iterator it = map.entrySet().iterator();
StringBuilder sb = new StringBuilder();
while (it.hasNext()) {
sb.setLength(0);
Map.Entry pair = (Map.Entry)it.next();
sb.append(pair.getKey());
sb.append(pair.getValue());
writer.println(sb.toString());
writer.flush();
it.remove();
}
}
catch(Exception e)
{
e.printStackTrace();
}
}
public UniqueExtractor(String fileName)
{
fFilePath = fileName;
}
private HashMap<String, BigDecimal> map = new HashMap<String, BigDecimal>();
public final void processLineByLine() throws IOException {
try (Scanner scanner = new Scanner(new File(fFilePath))) {
while (scanner.hasNextLine())
{
//System.out.println("ha");
System.out.println(++counter);
processLine(scanner.nextLine());
}
}
}
protected void processLine(String aLine)
{
StringBuilder sb = new StringBuilder();
String[] split = aLine.split(" ");
BigDecimal bd = null;
BigDecimal bd1= null;
for (int i=0; i < split.length-1; i++)
{
//System.out.println(i);
//System.out.println();
sb.append(split[i]);
sb.append(" ");
}
bd= new BigDecimal((split[split.length-1]));
//System.out.print("key is" + sb.toString());
//System.out.println("value is "+ bd);
if (map.containsKey(sb.toString()))
{
bd1 = map.get(sb.toString());
int res = bd1.compareTo(bd);
if (res == -1)
{
System.out.println("replacing ...."+ sb.toString() + bd1 + " with " + bd);
map.put(sb.toString(), bd);
}
}
else
{
map.put(sb.toString(), bd);
}
sb.setLength(0);
}
private String fFilePath;
}
There are a couple main things you may want to consider to improve the performance of this program.
Avoid BigDecimal
While BigDecimal is very useful, it has a lot of overhead, both in speed and space requirements. According to your examples, you don't have very much precision to worry about, so I would recommend switching to plain floats or doubles. These would take a mere fraction of the space (so you could process larger files) and would probably be faster to work with.
Avoid StringBuilder
This is not a general rule, but applies in this case: you appear to be parsing and then rebuilding aLine in processLine. This is very expensive, and probably unnecessary. You could, instead, use aLine.lastIndexOf('\t') and aLine.substring to cut up the String with much less overhead.
These two should significantly improve the performance of your code, but don't address the overall algorithm.
Dataset splitting
You're trying to handle enough data that you might want to consider not keeping all of it in memory at once.
For example, you could split up your data set into multiple files based on the first field, run your program on each of the files, and then rejoin the files into one. You can do this with more than one field if you need more splitting. This requires less memory usage because the splitting program does not have to keep more than a single line in memory at once, and the latter programs only need to keep a chunk of the original data in memory at once, not the entire thing.
You may want to try the specific optimizations outlined above, and then see if you need more efficiency, in which case try to do dataset splitting.

java regular expressions: performance and alternative

Recently I have been had to search a number of string values to see which one matches a certain pattern. Neither the number of string values nor the pattern itself is clear until a search term has been entered by the user. The problem is I have noticed each time my application runs the following line:
if (stringValue.matches (rexExPattern))
{
// do something so simple
}
it takes about 40 micro second. No need to say when the number of string values exceeds a few thousands, it'll be too slow.
The pattern is something like:
"A*B*C*D*E*F*"
where A~F are just examples here, but the pattern is some thing like the above. Please note* that the pattern actually changes per search. For example "A*B*C*" may change to W*D*G*A*".
I wonder if there is a better substitution for the above pattern or, more generally, an alternative for java regular expressions.
Regular expressions in Java are compiled into an internal data structure. This compilation is the time-consuming process. Each time you invoke the method String.matches(String regex), the specified regular expression is compiled again.
So you should compile your regular expression only once and reuse it:
Pattern pattern = Pattern.compile(regexPattern);
for(String value : values) {
Matcher matcher = pattern.matcher(value);
if (matcher.matches()) {
// your code here
}
}
Consider the following (quick and dirty) test:
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test3 {
// time that tick() was called
static long tickTime;
// called at start of operation, for timing
static void tick () {
tickTime = System.nanoTime();
}
// called at end of operation, prints message and time since tick().
static void tock (String action) {
long mstime = (System.nanoTime() - tickTime) / 1000000;
System.out.println(action + ": " + mstime + "ms");
}
// generate random strings of form AAAABBBCCCCC; a random
// number of characters each randomly repeated.
static List<String> generateData (int itemCount) {
Random random = new Random();
List<String> items = new ArrayList<String>();
long mean = 0;
for (int n = 0; n < itemCount; ++ n) {
StringBuilder s = new StringBuilder();
int characters = random.nextInt(7) + 1;
for (int k = 0; k < characters; ++ k) {
char c = (char)(random.nextInt('Z' - 'A') + 'A');
int rep = random.nextInt(95) + 5;
for (int j = 0; j < rep; ++ j)
s.append(c);
mean += rep;
}
items.add(s.toString());
}
mean /= itemCount;
System.out.println("generated data, average length: " + mean);
return items;
}
// match all strings in items to regexStr, do not precompile.
static void regexTestUncompiled (List<String> items, String regexStr) {
tick();
int matched = 0, unmatched = 0;
for (String item:items) {
if (item.matches(regexStr))
++ matched;
else
++ unmatched;
}
tock("uncompiled: regex=" + regexStr + " matched=" + matched +
" unmatched=" + unmatched);
}
// match all strings in items to regexStr, precompile.
static void regexTestCompiled (List<String> items, String regexStr) {
tick();
Matcher matcher = Pattern.compile(regexStr).matcher("");
int matched = 0, unmatched = 0;
for (String item:items) {
if (matcher.reset(item).matches())
++ matched;
else
++ unmatched;
}
tock("compiled: regex=" + regexStr + " matched=" + matched +
" unmatched=" + unmatched);
}
// test all strings in items against regexStr.
static void regexTest (List<String> items, String regexStr) {
regexTestUncompiled(items, regexStr);
regexTestCompiled(items, regexStr);
}
// generate data and run some basic tests
public static void main (String[] args) {
List<String> items = generateData(1000000);
regexTest(items, "A*");
regexTest(items, "A*B*C*");
regexTest(items, "E*C*W*F*");
}
}
Strings are random sequences of 1-8 characters with each character occurring 5-100 consecutive times (e.g. "AAAAAAGGGGGDDFFFFFF"). I guessed based on your expressions.
Granted this might not be representative of your data set, but the timing estimates for applying those regular expressions to 1 million randomly generates strings of average length 208 each on my modest 2.3 GHz dual-core i5 was:
Regex Uncompiled Precompiled
A* 0.564 sec 0.126 sec
A*B*C* 1.768 sec 0.238 sec
E*C*W*F* 0.795 sec 0.275 sec
Actual output:
generated data, average length: 208
uncompiled: regex=A* matched=6004 unmatched=993996: 564ms
compiled: regex=A* matched=6004 unmatched=993996: 126ms
uncompiled: regex=A*B*C* matched=18677 unmatched=981323: 1768ms
compiled: regex=A*B*C* matched=18677 unmatched=981323: 238ms
uncompiled: regex=E*C*W*F* matched=25495 unmatched=974505: 795ms
compiled: regex=E*C*W*F* matched=25495 unmatched=974505: 275ms
Even without the speedup of precompiled expressions, and even considering that the results vary wildly depending on the data set and regular expression (and even considering that I broke a basic rule of proper Java performance tests and forgot to prime HotSpot first), this is very fast, and I still wonder if the bottleneck is truly where you think it is.
After switching to precompiled expressions, if you still are not meeting your actual performance requirements, do some profiling. If you find your bottleneck is still in your search, consider implementing a more optimized search algorithm.
For example, assuming your data set is like my test set above: If your data set is known ahead of time, reduce each item in it to a smaller string key by removing repetitive characters, e.g. for "AAAAAAABBBBCCCCCCC", store it in a map of some sort keyed by "ABC". When a user searches for "ABC*" (presuming your regex's are in that particular form), look for "ABC" items. Or whatever. It highly depends on your scenario.

Should I use Java's String.format() if performance is important?

We have to build Strings all the time for log output and so on. Over the JDK versions we have learned when to use StringBuffer (many appends, thread safe) and StringBuilder (many appends, non-thread-safe).
What's the advice on using String.format()? Is it efficient, or are we forced to stick with concatenation for one-liners where performance is important?
e.g. ugly old style,
String s = "What do you get if you multiply " + varSix + " by " + varNine + "?";
vs. tidy new style (String.format, which is possibly slower),
String s = String.format("What do you get if you multiply %d by %d?", varSix, varNine);
Note: my specific use case is the hundreds of 'one-liner' log strings throughout my code. They don't involve a loop, so StringBuilder is too heavyweight. I'm interested in String.format() specifically.
I took hhafez's code and added a memory test:
private static void test() {
Runtime runtime = Runtime.getRuntime();
long memory;
...
memory = runtime.freeMemory();
// for loop code
memory = memory-runtime.freeMemory();
I run this separately for each approach, the '+' operator, String.format and StringBuilder (calling toString()), so the memory used will not be affected by other approaches.
I added more concatenations, making the string as "Blah" + i + "Blah"+ i +"Blah" + i + "Blah".
The result are as follows (average of 5 runs each):
Approach
Time(ms)
Memory allocated (long)
+ operator
747
320,504
String.format
16484
373,312
StringBuilder
769
57,344
We can see that String + and StringBuilder are practically identical time-wise, but StringBuilder is much more efficient in memory use.
This is very important when we have many log calls (or any other statements involving strings) in a time interval short enough so the Garbage Collector won't get to clean the many string instances resulting of the + operator.
And a note, BTW, don't forget to check the logging level before constructing the message.
Conclusions:
I'll keep on using StringBuilder.
I have too much time or too little life.
I wrote a small class to test which has the better performance of the two and + comes ahead of format. by a factor of 5 to 6.
Try it your self
import java.io.*;
import java.util.Date;
public class StringTest{
public static void main( String[] args ){
int i = 0;
long prev_time = System.currentTimeMillis();
long time;
for( i = 0; i< 100000; i++){
String s = "Blah" + i + "Blah";
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for( i = 0; i<100000; i++){
String s = String.format("Blah %d Blah", i);
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
}
}
Running the above for different N shows that both behave linearly, but String.format is 5-30 times slower.
The reason is that in the current implementation String.format first parses the input with regular expressions and then fills in the parameters. Concatenation with plus, on the other hand, gets optimized by javac (not by the JIT) and uses StringBuilder.append directly.
All the benchmarks presented here have some flaws, thus results are not reliable.
I was surprised that nobody used JMH for benchmarking, so I did.
Results:
Benchmark Mode Cnt Score Error Units
MyBenchmark.testOld thrpt 20 9645.834 ± 238.165 ops/s // using +
MyBenchmark.testNew thrpt 20 429.898 ± 10.551 ops/s // using String.format
Units are operations per second, the more the better. Benchmark source code. OpenJDK IcedTea 2.5.4 Java Virtual Machine was used.
So, old style (using +) is much faster.
Your old ugly style is automatically compiled by JAVAC 1.6 as :
StringBuilder sb = new StringBuilder("What do you get if you multiply ");
sb.append(varSix);
sb.append(" by ");
sb.append(varNine);
sb.append("?");
String s = sb.toString();
So there is absolutely no difference between this and using a StringBuilder.
String.format is a lot more heavyweight since it creates a new Formatter, parses your input format string, creates a StringBuilder, append everything to it and calls toString().
Java's String.format works like so:
it parses the format string, exploding into a list of format chunks
it iterates the format chunks, rendering into a StringBuilder, which is basically an array that resizes itself as necessary, by copying into a new array. this is necessary because we don't yet know how large to allocate the final String
StringBuilder.toString() copies his internal buffer into a new String
if the final destination for this data is a stream (e.g. rendering a webpage or writing to a file), you can assemble the format chunks directly into your stream:
new PrintStream(outputStream, autoFlush, encoding).format("hello {0}", "world");
I speculate that the optimizer will optimize away the format string processing. If so, you're left with equivalent amortized performance to manually unrolling your String.format into a StringBuilder.
To expand/correct on the first answer above, it's not translation that String.format would help with, actually.
What String.format will help with is when you're printing a date/time (or a numeric format, etc), where there are localization(l10n) differences (ie, some countries will print 04Feb2009 and others will print Feb042009).
With translation, you're just talking about moving any externalizable strings (like error messages and what-not) into a property bundle so that you can use the right bundle for the right language, using ResourceBundle and MessageFormat.
Looking at all the above, I'd say that performance-wise, String.format vs. plain concatenation comes down to what you prefer. If you prefer looking at calls to .format over concatenation, then by all means, go with that.
After all, code is read a lot more than it's written.
In your example, performance probalby isn't too different but there are other issues to consider: namely memory fragmentation. Even concatenate operation is creating a new string, even if its temporary (it takes time to GC it and it's more work). String.format() is just more readable and it involves less fragmentation.
Also, if you're using a particular format a lot, don't forget you can use the Formatter() class directly (all String.format() does is instantiate a one use Formatter instance).
Also, something else you should be aware of: be careful of using substring(). For example:
String getSmallString() {
String largeString = // load from file; say 2M in size
return largeString.substring(100, 300);
}
That large string is still in memory because that's just how Java substrings work. A better version is:
return new String(largeString.substring(100, 300));
or
return String.format("%s", largeString.substring(100, 300));
The second form is probably more useful if you're doing other stuff at the same time.
Generally you should use String.Format because it's relatively fast and it supports globalization (assuming you're actually trying to write something that is read by the user). It also makes it easier to globalize if you're trying to translate one string versus 3 or more per statement (especially for languages that have drastically different grammatical structures).
Now if you never plan on translating anything, then either rely on Java's built in conversion of + operators into StringBuilder. Or use Java's StringBuilder explicitly.
Another perspective from Logging point of view Only.
I see a lot of discussion related to logging on this thread so thought of adding my experience in answer. May be someone will find it useful.
I guess the motivation of logging using formatter comes from avoiding the string concatenation. Basically, you do not want to have an overhead of string concat if you are not going to log it.
You do not really need to concat/format unless you want to log. Lets say if I define a method like this
public void logDebug(String... args, Throwable t) {
if(debugOn) {
// call concat methods for all args
//log the final debug message
}
}
In this approach the cancat/formatter is not really called at all if its a debug message and debugOn = false
Though it will still be better to use StringBuilder instead of formatter here. The main motivation is to avoid any of that.
At the same time I do not like adding "if" block for each logging statement since
It affects readability
Reduces coverage on my unit tests - thats confusing when you want to make sure every line is tested.
Therefore I prefer to create a logging utility class with methods like above and use it everywhere without worrying about performance hit and any other issues related to it.
I just modified hhafez's test to include StringBuilder. StringBuilder is 33 times faster than String.format using jdk 1.6.0_10 client on XP. Using the -server switch lowers the factor to 20.
public class StringTest {
public static void main( String[] args ) {
test();
test();
}
private static void test() {
int i = 0;
long prev_time = System.currentTimeMillis();
long time;
for ( i = 0; i < 1000000; i++ ) {
String s = "Blah" + i + "Blah";
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for ( i = 0; i < 1000000; i++ ) {
String s = String.format("Blah %d Blah", i);
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for ( i = 0; i < 1000000; i++ ) {
new StringBuilder("Blah").append(i).append("Blah");
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
}
}
While this might sound drastic, I consider it to be relevant only in rare cases, because the absolute numbers are pretty low: 4 s for 1 million simple String.format calls is sort of ok - as long as I use them for logging or the like.
Update: As pointed out by sjbotha in the comments, the StringBuilder test is invalid, since it is missing a final .toString().
The correct speed-up factor from String.format(.) to StringBuilder is 23 on my machine (16 with the -server switch).
Here is modified version of hhafez entry. It includes a string builder option.
public class BLA
{
public static final String BLAH = "Blah ";
public static final String BLAH2 = " Blah";
public static final String BLAH3 = "Blah %d Blah";
public static void main(String[] args) {
int i = 0;
long prev_time = System.currentTimeMillis();
long time;
int numLoops = 1000000;
for( i = 0; i< numLoops; i++){
String s = BLAH + i + BLAH2;
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for( i = 0; i<numLoops; i++){
String s = String.format(BLAH3, i);
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
prev_time = System.currentTimeMillis();
for( i = 0; i<numLoops; i++){
StringBuilder sb = new StringBuilder();
sb.append(BLAH);
sb.append(i);
sb.append(BLAH2);
String s = sb.toString();
}
time = System.currentTimeMillis() - prev_time;
System.out.println("Time after for loop " + time);
}
}
Time after for loop 391
Time after for loop 4163
Time after for loop 227
The answer to this depends very much on how your specific Java compiler optimizes the bytecode it generates. Strings are immutable and, theoretically, each "+" operation can create a new one. But, your compiler almost certainly optimizes away interim steps in building long strings. It's entirely possible that both lines of code above generate the exact same bytecode.
The only real way to know is to test the code iteratively in your current environment. Write a QD app that concatenates strings both ways iteratively and see how they time out against each other.
Consider using "hello".concat( "world!" ) for small number of strings in concatenation. It could be even better for performance than other approaches.
If you have more than 3 strings, than consider using StringBuilder, or just String, depending on compiler that you use.

Categories

Resources