String .contains VS Set<String> .contains VS Regex String.matches()

String .contains VS Set<String> .contains VS Regex String.matches() - java

I have two sets of strings which are not very long (200~500 words) in two files which looks like this:
File1 File2
this window
that good
word work
java fine
book home
All unique words.
Now First read the strings from file (line-by-line) and store them in:
Set<String> set1 Set<String> set2: That may looks like this: [this, that, word, java, book] and [window, good, work, fine, home]
Or
String str1 String str2: That may looks like this: str1: thisthatwordjava and str2: windowgoodworkfinehome OR can be str1: this,that,word,java (separated by comma).
Now there are three ways to check the word home in which Set or String will be present:
To use set1/2.contains("home")
To use str1/2.contains("home")
To use str1/2.matches("home")
All of the above will work fine, but which one the BEST one
Note: The purpose of this question is because the frequency of checking for string is very high.

Don't Make Performance Assumptions
What makes you think that String.contains will have "better performance"?
It won't, except for very simple cases, that is if:
your list of strings is short,
the strings to compare are short,
you want to do a one-time lookup.
For all other cases, the Set approach will scale and work better. Sure you'll have a memory overhead for the Set as opposed to a single string, but the O(1) lookups will remain constant even if you want to store millions of strings and compare long strings.
The Right Data-Structure and Algorithm for the Right Job
Use the safer and more robust design, especially as here it's not a difficult solution to implement. And as you mention that you will check frequently, then a set approach is definitely better for you.
Also, String.contain will be unsafe, as if your both have matching strings and substrings your lookups will fail. As kennytm said in a comment, if we use your example, and you have the "java" string in your list, looking up "ava" will match it, which you apparently don't want.
Pick the Right Set
You may not want to use the simple HashSet or to tweak its settings though. For instance, you could consider a Guava ImmutableSet, if your set will be created only once but checked very often.
Examples
Here's what I'd do, assuming you want an immutable set (as you say you read the list of strings from a file). This is off-hand and without verification so forgive the lack of ceremonies.
Using Java 8 + Guava
import com.google.common.collect.ImmutableSet;
import com.google.common.io.Files;
import com.google.common.base.Splitter;
final Set<String> lookupTable = ImmutableSet.copyOf(
Splitter.on(',')
.trimResults()
.omitEmptyStrings()
.split(Files.asCharSource(new File("YOUR_FILE_PATH"), Charsets.UTF_8).read())
);
Season to taste with correct path, correct charset, and with or without trimming if you want to allow spaces and an empty string.
Using Only Java 8
If you don't want to use Guava and only vanilla Java, then simply do something like this in Java 8 (again, apologies, untested):
final Set<String> lookupTable =
Files.lines(Paths.get("YOUR_FILE_PATH"))
.map(line -> line.split(",+"))
.map(Arrays::stream)
.collect(toSet());
Using Java < 8
If you have Java < 8, then use the usual FileInputStream to read the file, then String.split[] or StringTokenizer to extract an array, and finally add the array entries into a Set.

I guess you read the line(s) of the file into a String anyway, so splitting it and storing the substrings in a set isn't more optimal if you plan only one query.

Set should take more memory space but less execution time if given the word without comas (which can be done with a simple split)
but what i really think is the best way is the experimental proof System.currentTimeMillis()

If you want to know something about performence differences. Simply measure it. Here is a test setting for you.
final int WORDS = 10000;
final int SEARCHES = 1000000;
Set<String> strSet = new TreeSet<String>();
String strStr = "";
int[] searches = new int[SEARCHES];
Random randomGenerator = new Random();
// filling set and string
for(int i = 0; i < WORDS; i++){
strSet.add(String.valueOf(i));
strStr += "," + String.valueOf(i);
}
// creating searches
for(int i = 0; i < SEARCHES; i++)
searches[i] = randomGenerator.nextInt(WORDS);
// measure set
long startTime = System.currentTimeMillis();
for(int i = 0; i < SEARCHES; i++)
strSet.contains(String.valueOf(searches[i]));
System.out.println("set result " + (System.currentTimeMillis() - startTime));
// measure string
startTime = System.currentTimeMillis();
for(int i = 0; i < SEARCHES; i++)
strStr.contains(String.valueOf(searches[i]));
System.out.println("string result " + (System.currentTimeMillis() - startTime));
For me the output is a meaningful proof that you should stay with a Set
set result 350
string result 14197

Related

Fixed Array?/ StringBuilder?/ String? which is best way to create a string, if 84 strings to be appended

I know in advance that, there would be 84 strings going to be appended by comma separator, to create one string then,
Which way is be better a fixed Array, Strings or String Builder?

If by "best" you mean "most memory and/or runtime efficient" then you're probably best off with a StringBuilder you pre-allocate. (Having looked at the implementation of String.join in the JDK, it uses StringJoiner, which uses a StringBuilder with the default initial capacity [16 chars] with no attempt to avoid reallocation and copying.)
You'd sum up the lengths of your 84 strings, add in the number of commas, create a StringBuilder with that length, add them all, and call toString on it. E.g.:
int length = 0;
for (int i = 0; i < strings.length; ++i) {
length += strings[i].length();
}
length += strings.length - 1; // For the commas
StringBuilder sb = new StringBuilder(length);
sb.append(strings[0]);
for (int i = 1; i < strings.length; ++i) {
sb.append(',');
sb.append(strings[i]);
}
String result = sb.toString();

There are a lot of ways of doing that.
My preferred way of doing it (which may or may not be the best) would be to convert my 84 strings into a stream (with Arrays.stream() or list.stream(), depending how the strings are actually stored) and then do Collectors.joining(",").
That is, if you already have an array, String.join(",", array) will do the trick as well, as noted in another answer.

You could also use StringJoiner to build the String. It would be like using StringBuilder, but you don't need to worry about the commas (and you can even append and prepend a value if you want).
This is mainly useful when you're building the result in parts, or when you may omit some elements. Otherwise it offers no benefits vs. Collectors.joining() or String.join() (which internally uses StringJoiner anyway).

Correct coding practices with strings, variables, loops - Java (Android)

I am Android developer and not new to Java but I have some questions about best practices for performace. Ill give some examples from my code so you can decide.
String concatenation
url = "http://www.myserver." + domain + "/rss.php?"
+ rawType + rawCathegory + rawSubCathegory + rawLocality
+ rawRadius + rawKeyword + rawPriceFrom + rawPriceto;
As far as I know, this would create 11 string objects before my url variable is created, right?
Ive been taught to use StringBuilder, but my question is, whats the minimum amount of strings to concat to make it efficient? I think it wouldnt make much sense to use it concat two strings, right?
Local variables
Sometimes I try to "chain" method calls like so
FilterData.getInstance(context).getFilter(position).setActivated(isActivated);
to naively avoid variable allocation, but is it any faster than this?
FilterData filterData = FilterData.getInstance(context);
Filter filter = filterData.getFilter(position);
filter.setActivated(isActivated);
I believe it should as I save myself a local variable, but it becomes unreadable if the method names are long, etc.
Loops
http://developer.android.com/training/articles/perf-tips.html says that enhanced for loops is 3x faster than the regular for loop, well that great and its easier to write anyways, but, what if I need the index? As far as I know, in enhaced for loop I need to keep track of it myself, like this
int index = 0;
for(Object obj : objects) {
// do stuff
index++;
}
Is this still faster than the regular loop?
for(int i = 0; i < objects.size(); i++) {
// do stuff
}
I think that enhanced for loop maybe does optimisations about that limit, so maybe if the size() got optimized to this
int size = objects.size();
for(int i = 0; i < size; i++) {
// do stuff
}
How would that stand?
Thanks, I know this might be nitpicking and not make that much of a difference, but Ill rather learn such common tasks the right way.

Strings:
Unless there's a loop involved, the compiler is clever enough to do the concatenation for you in the best way.
When you're looping, use StringBuilder or Buffer.
Local Variables:
The two examples you give are identical. The memory still needs to be allocated even if you never give it a name.
Loops:
Depending on the type of loop, using enhanced loops can give a massive or negligible improvement, it's best to read up on the one you're using.

What ways can you create a string with 2000 "spaces"

For various reasons I am trying to set a string to 2000 spaces. Currently I am using:
String s = String.format("%1$-2000s"," ");
This is great for Java 5, however, some of the developers in our department are using 1.4 and this does not work.
I was wondering, are any other ways of achieving the same result? I know I can do things like a for loop adding a space at a time, but I am looking for something simple like the format option.
For those that may be interested in why I need this, it is because we have an XML type on a dataobject that on insert into the DB is null. It then gets updated with the XML string, usually around 2000 characters in size. In Oracle pre-reserving this space can prevent row migration, therefore, increasing performance.
Thanks!

char[] spacesArray = new char[2000];
Arrays.fill(spacesArray, ' ');
String spaces = new String(spacesArray);

the simplest answer: (scroll to see all the codes)
String s = " "; // 2000 spaces

You can use lpad(' ',2000,' ') in the insert statement directly to tell Oracle to create the value you want.
In fact, you can set the field in question to have this as the default, which could prevent you from needing to change it in multiple places (if your code is explicitly sending null as the value for the field, that will override the default).

A StringBuffer and then add a space 2000 times in a loop, and toString() afterwards. I don't think there are any "simpler" ways to do it which doesn't end up doing this anyway under the covers.
If you do this a lot, it would make a good library function.

A random function I found in my personal library:
public static String whiteSpace2(int l) {
if (l==0) return "";
String half=whiteSpace2(l/2);
if ((l&1)!=0) {
return half+" "+half;
} else {
return half+half;
}
}
Not claiming it is the fastest possible way to generate whitespace, but it works :-)

StringUtils.repeat(" ", 2000) (from commons-lang)
However, I'm not sure whether such micro-optimizations should be made with the cost of code that would require a 5 line comment to explain why is this needed. If you do it - be sure to add an extensive comment, otherwise imagine the reaction of those reading your code.

If nothing else works:
StringBuilder sb = new StringBuilder();
for(int i = 0; i < 2000; ++i)
sb.append(" ");
String str = new String(sb);

See this other question.
Can I multiply strings in Java to repeat sequences?
Both Apache Commons StringUtils and Google Guava libraries have commands to multiply (repeat) strings.

The best alternative for String flyweight implementation in Java

My application is multithreaded with intensive String processing. We are experiencing excessive memory consumption and profiling has demonstrated that this is due to String data. I think that memory consumption would benefit greatly from using some kind of flyweight pattern implementation or even cache (I know for sure that Strings are often duplicated, although I don't have any hard data in that regard).
I have looked at Java Constant Pool and String.intern, but it seems that it can provoke some PermGen problems.
What would be the best alternative for implementing application-wide, multithreaded pool of Strings in java?
EDIT: Also see my previous, related question: How does java implement flyweight pattern for string under the hood?

Note: This answer uses examples that might not be relevant in modern runtime JVM libraries. In particular, the substring example is no longer an issue in OpenJDK/Oracle 7+.
I know it goes against what people often tell you, but sometimes explicitly creating new String instances can be a significant way to reduce your memory.
Because Strings are immutable, several methods leverage that fact and share the backing character array to save memory. However, occasionally this can actually increase the memory by preventing garbage collection of unused parts of those arrays.
For example, assume you were parsing the message IDs of a log file to extract warning IDs. Your code would look something like this:
//Format:
//ID: [WARNING|ERROR|DEBUG] Message...
String testLine = "5AB729: WARNING Some really really really long message";
Matcher matcher = Pattern.compile("([A-Z0-9]*): WARNING.*").matcher(testLine);
if ( matcher.matches() ) {
String id = matcher.group(1);
//...do something with id...
}
But look at the data actually being stored:
//...
String id = matcher.group(1);
Field valueField = String.class.getDeclaredField("value");
valueField.setAccessible(true);
char[] data = ((char[])valueField.get(id));
System.out.println("Actual data stored for string \"" + id + "\": " + Arrays.toString(data) );
It's the whole test line, because the matcher just wraps a new String instance around the same character data. Compare the results when you replace String id = matcher.group(1); with String id = new String(matcher.group(1));.

This is already done at the JVM level. You only need to ensure that you aren't creating new Strings everytime, either explicitly or implicitly.
I.e. don't do:
String s1 = new String("foo");
String s2 = new String("foo");
This would create two instances in the heap. Rather do so:
String s1 = "foo";
String s2 = "foo";
This will create one instance in the heap and both will refer the same (as evidence, s1 == s2 will return true here).
Also don't use += to concatenate strings (in a loop):
String s = "";
for (/* some loop condition */) {
s += "new";
}
The += implicitly creates a new String in the heap everytime. Rather do so
StringBuilder sb = new StringBuilder();
for (/* some loop condition */) {
sb.append("new");
}
String s = sb.toString();
If you can, rather use StringBuilder or its synchronized brother StringBuffer instead of String for "intensive String processing". It offers useful methods for exactly those purposes, such as append(), insert(), delete(), etc. Also see its javadoc.

Java 7/8
If you are doing what the accepted answer says and using Java 7 or newer you are not doing what it says you are.
The implementation of subString() has changed.
Never write code that relies on an implementation that can change drastically and might make things worse if you are relying on the old behavior.
1950 public String substring(int beginIndex, int endIndex) {
1951 if (beginIndex < 0) {
1952 throw new StringIndexOutOfBoundsException(beginIndex);
1953 }
1954 if (endIndex > count) {
1955 throw new StringIndexOutOfBoundsException(endIndex);
1956 }
1957 if (beginIndex > endIndex) {
1958 throw new StringIndexOutOfBoundsException(endIndex - beginIndex);
1959 }
1960 return ((beginIndex == 0) && (endIndex == count)) ? this :
1961 new String(offset + beginIndex, endIndex - beginIndex, value);
1962 }
So if you use the accepted answer with Java 7 or newer you are creating twice as much memory usage and garbage that needs to be collected.

Effeciently pack Strings in memory! I once wrote a hyper memory efficient Set class, where Strings were stored as a tree. If a leaf was reached by traversing the letters, the entry was contained in the set. Fast to work with, too, and ideal to store a large dictionary.
And don't forget that Strings are often the largest part in memory in nearly every app I profiled, so don't care for them if you need them.
Illustration:
You have 3 Strings: Beer, Beans and Blood. You can create a tree structure like this:
B
+-e
+-er
+-ans
+-lood
Very efficient for e.g. a list of street names, this is obviously most reasonable with a fixed dictionary, because insert cannot be done efficiently. In fact the structure should be created once, then serialized and afterwards just loaded.

First, decide how much your application and developers would suffer if you eliminated some of that parsing. A faster application does you no good if you double your employee turnover rate in the process! I think based on your question we can assume you passed this test already.
Second, if you can't eliminate creating an object, then your next goal should be to ensure it doesn't survive Eden collection. And parse-lookup can solve that problem. However, a cache "implemented properly" (I disagree with that basic premise, but I won't bore you with the attendant rant) usually brings thread contention. You'd be replacing one kind of memory pressure for another.
There's a variation of the parse-lookup idiom that suffers less from the sort of collateral damage you usually get from full-on caching, and that's a simple precalculated lookup table (see also "memoization"). The Pattern you usually see for this is the Type Safe Enumeration (TSE). With the TSE, you parse the String, pass it to the TSE to retrieve the associated enumerated type, and then you throw the String away.
Is the text you're processing free-form, or does the input have to follow a rigid specification? If a lot of your text renders down to a fixed set of possible values, then a TSE could help you here, and serves a greater master: Adding context/semantics to your information at the point of creation, instead of at the point of use.

Is there a "fastest way" to construct Strings in Java?

I normally create a String in Java the following way:
String foo = "123456";
However, My lecturer has insisted to me that forming a String using the format method, as so:
String foo = String.format("%s", 123456);
Is much faster.
Also, he says that using the StringBuilder class is even faster.
StringBuilder sb = new StringBuilder();
String foo = sb.append(String.format("%s", 123456)).toString();
Which is the fastest method to create a String, if there even is one?
They could not be 100% accurate as I might not remember them fully.

If there is only one string then:
String foo = "123456";
Is fastest. You'll notice that the String.format line has "%s%" declared in it, so I don't see how the lecturer could possibly think that was faster. Plus you've got a method call on top of it.
However, if you're building a string over time, such as in a for-loop, then you'll want to use a StringBuilder. If you were to just use += then you're building a brand new string every time the += line is called. StringBuilder is much faster since it holds a buffer and appends to that every time you call append.

Slightly off-topic, but I wish that the whole "must-not-use-plus-to-concatenate-strings-in-Java" myth would go away. While it might have been true in early versions of Java that StringBuffer was faster and "+ was evil", it is certainly not true in modern JVMs that are taking care of a lot of optimisations.
For example, which is faster?
String s = "abc" + "def";
or
StringBuffer buf = new StringBuffer();
buf.append("abc");
buf.append("def");
String s = buf.toString();
The answer is the former. The JVM recognises that this is a string constant and will actually put "abcdef" in the string pool, whereas the "optimised stringbuffer" version will cause an extra StringBuffer object to be built.
Another JVM optimisation is
String s = onestring + " concat " + anotherstring;
Where the JVM will work out what the best way of concatenating will be. In JDK 5, this means a StringBuilder will be internally used and it will be faster than using a string buffer.
But as other answers have said, the "123456" constant in your question is certainly the fastest way and your lecturer should go back to being a student :-)
And yes, I've been sad enough to verify this by looking at the Java bytecode...

This whole discussion is moot. Please read this article by Jeff, i.e., the guy who created Stack Overflow.
The Sad Tragedy of Micro-Optimization Theater
Please refer your instructor to this post and ask him to stop ruining his/her student's brains with useless information. Algorithmic optimizations are where your code will live or die, not with what method you use to construct strings. In any case, StringBuilder, and String formatter have to execute ACTUAL CODE with REAL MEMORY, if you just construct a string it gets set aside during compile time and is ready to be used when you need it, in essence, it has 0 run-time cost, while the other options have real cost, since code actually needs to be executed.

String foo = "some string literal";
Is certainly the fastest way to make a String. It's embedded in the .class file and is a simple memory look-up to retrieve.
Using String.format when you have nothing to really format just looks ugly and might cause junior developers to cry.
If the String is going to be modified, then StringBuilder is the best since Strings are immutable.

In your second example, using:
String foo = String.format("%s", 123456);
doesn't buy you anything; 123456 is already a constant value, so why not just assign foo = "123456"? For constant strings, there's no better way.
If you're creating a string from multiple parts being appended together at runtime, use StringBuffer or StringBuilder (the former being thread-safe).

If your string is known at compile-time, then using a literal is best: String foo = "123456";.
If your string is not known at compile-time and is composed of an aggregation of smaller strings, StringBuilder is usually the way to go (but beware thread-safety!).
Using String foo = String.format("%s", 123456); could reduce your .class' size and make class-loading it a tiny bit faster, but that would be extremely aggressive (extreme) memory tuning there ^^.

As has been pointed out, if you're just building a single string with no concatenation, just use String.
For concatenating multiple bits into one big string, StringBuffer is slower than StringBuilder, but StringBuffer is synchronized. If you don't need synchronization, StringBuilder.

Are you 100% certain that the instructor was not talking about something like:
String foo = "" + 123456;
I see my students do that type of thing "all the time" (a handful will do that each term). The reason that they do it is that some book showed them how to do it that way. Shakes head and fist at lazy book writers!

The first example you gave is the fastest and the simplest. Use that.
Each piece of code you added in those examples makes it significantly slower and more difficult to read.
I would suggest example 2 is at least 10-100x slower than example 1 and example 3 is about 2x slower than example 2.
Did your processor provide any justification for this assertion?
BTW: Your first example doesn't construct a String at all (which is why it is fastest), it just hands you a String sitting in the String constant pool.

How about measuring dynamic strings so that VM cannot optimise it:
public static void measureConcats(long lim){
double sum = 0;
long start = System.currentTimeMillis();
for(long a = 0;a<lim;++a){
sum+=Math.random();
}
long end = System.currentTimeMillis();
System.out.println("Sum:" +sum);
System.out.println("Double creations time:" + (end - start));
String res = "";
Double sad = 0.0;
start = System.currentTimeMillis();
for(long b = 0;b<lim;++b){
sad = Math.random();
String sa = sad.toString();
res+=sa;
}
end = System.currentTimeMillis();
System.out.println("Pure string concat time:" + (end - start));
System.out.println("len:"+res.length());
StringBuffer sbf = new StringBuffer();
start = System.currentTimeMillis();
for(long c = 0;c<lim;++c){
sad = Math.random();
String sa = sad.toString();
sbf.append(sa);
}
end = System.currentTimeMillis();
System.out.println("StringBuffer concat time:" + (end - start));
System.out.println("len:"+sbf.length());}
My result for 10000 concats is 364ms for String+=String and 14ms for StringBuffer append.
I was very surprised about this result.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

String .contains VS Set<String> .contains VS Regex String.matches() - java

I guess you read the line(s) of the file into a String anyway, so splitting it and storing the substrings in a set isn't more optimal if you plan only one query.

Set should take more memory space but less execution time if given the word without comas (which can be done with a simple split) but what i really think is the best way is the experimental proof System.currentTimeMillis()

Related

Fixed Array?/ StringBuilder?/ String? which is best way to create a string, if 84 strings to be appended

Correct coding practices with strings, variables, loops - Java (Android)

What ways can you create a string with 2000 "spaces"

The best alternative for String flyweight implementation in Java

Is there a "fastest way" to construct Strings in Java?

Categories

Resources