Is there an equivalent to Java's String intern function in Go?
I am parsing a lot of text input that has repeating patterns (tags). I would like to be memory efficient about it and store pointers to a single string for each tag, instead of multiple strings for each occurrence of a tag.
No such function exists that I know of. However, you can make your own very easily using maps. The string type itself is a uintptr and a length. So, a string assigned from another string takes up only two words. Therefore, all you need to do is ensure that there are no two strings with redundant content.
Here is an example of what I mean.
type Interner map[string]string
func NewInterner() Interner {
return Interner(make(map[string]string))
}
func (m Interner) Intern(s string) string {
if ret, ok := m[s]; ok {
return ret
}
m[s] = s
return s
}
This code will deduplicate redundant strings whenever you do the following:
str = interner.Intern(str)
EDIT: As jnml mentioned, my answer could pin memory depending on the string it is given. There are two ways to solve this problem. Both of these should be inserted before m[s] = s in my previous example. The first copies the string twice, the second uses unsafe. Neither are ideal.
Double copy:
b := []byte(s)
s = string(b)
Unsafe (use at your own risk. Works with current version of gc compiler):
b := []byte(s)
s = *(*string)(unsafe.Pointer(&b))
I think that for example Pool and GoPool may fulfill your needs. That code solves one thing which Stephen's solution ignores. In Go, a string value may be a slice of a bigger string. Scenarios are where it doesn't matter and scenarios are where that is a show stopper. The linked functions attempt to be on the safe side.
Related
Using the StringBuilder.insert method like:
builder.insert(0, str)
seems a reasonable way to prepend a String instance before another string, but after having a look a the source code of the insert method I am not sure if its an efficient way of doing so, because the insert method has to assert sufficient capacity and perform some shifting:
public AbstractStringBuilder insert(int offset, String str) {
...
ensureCapacityInternal(count + len);
shift(offset, len);
...
}
and both relies on using System.arraycopy.
I wrote a StringCombiner class which just adds two strings:
public StringCombiner prepend(String pre){
this.string = pre + this.string;
return this;
}
After running some performance tests the insert method still seems to be much faster as the creation of the new string seems to be a much bigger performance penalty.
So I wanted to know if there is another more performant way to prepend a string before another.
Thx.
Your prepend() method uses the + string operator, which internally creates a StringBuilder, appends both strings, and converts the result to a String. So, the + operator can't help you.
If it's just adding two strings, there's no way to improve that.
If it's inside a loop, you might get some improvement by reversing the loop, so you start from the part of the result string that gets placed at the very beginning of the result, and instead of continuously prepending (and thus repeatedly moving ever-growing accumulated character sequences), you'll end up appending strings and keep the majority of characters in place.
So, improvements are possible only considering the bigger context of this string operation.
In a code base I'm working with I'm seeing this idiom being used.Can someone explain it for me?
new String("" + number) // `i` is an instance of Integer
For some context, this is approximately what the method looks like:
public String someMethod(String numberString) {
Integer number = new Integer(numberString);
// other stuff happens...
return new String("" + number);
}
It's most likely nothing but an inexperienced Java programmers attempt at converting a number to a String.
Whether converting numbers to Strings using "" + number is good practice is debateable. I for one find String.valueOf(number) to be more clear (although it's semantically equivalent).
It's completely unnecessary to wrap the result in new String(...) unless you (for some unimaginable reason) really need a new string, i.e. one that's referentially distinct from other strings.
This ideom is the easiest way of converting any primitive or Object type to a string.
It is quite easy: Typing "" + number is less work than typing String.valueOf(number) (or Integer.toString(number). The latter is the usual way of transforming an int into a String. Which one is clearer and to be preferred is a matter of taste. You will find advocates for both versions.
But note three things:
Calling String.valueOf(number) is faster, because "" + number will become a string concatenation, it will be compiled to
new StringBuilder().append("").append(number).toString()
However, unless this statement is in a hot place in your code, the difference will not matter at all. Even in a hot place, the difference might still be negligible
Simply calling number.toString() is also an option since number is a boxed integer. However, if number is null, this will trigger a null pointer exception so be careful!
The last line of your example, i.e., new String("" + number); is simply crap! It wraps the result into a newly built string. This creates an unnecessary copy and costs precious extra keystrokes. It also generates unnecessary boilerplate noise in your code. This last line compiles to:
new String(new StringBuilder().append("").append(number).toString())
That is quite some work for transforming int to String. You first create a StringBuilder, then a built String and then a copy of this built String. This creates three objects instead of one.
This idiom is used to automatically convert the integer number to a string.
The String constructor expects an argument of type String. The expression new String(number) would not compile. Concatenting an empty string is a little trick to turn the Integer into a String.
You could of course use new String(number.toString()) or String.valueOf(number) instead.
Note: As Jon Skeet mentions in the comments there is no need to use new Sring() at all. You could just return number.toString(); or return String.valueOf(number)
In an application a String is a often used data type. What we know, is that the mutation of a String uses lots of memory. So what we can do is to use a StringBuilder/StringBuffer.
But at what point should we change to StringBuilder?
And what should we do, when we have to split it or to remplace characters in there?
eg:
//original:
String[] split = string.split("?");
//better? :
String[] split = stringBuilder.toString().split("?);
or
//original:
String replacedString = string.replace("l","st");
//better? :
String replacedString = stringBuilder.toString().replace("l","st");
//or
StringBuilder replacedStringBuilder = new StringBuilder(stringBuilder.toString().replace("l","st);
In your examples, there are no benefits in using a StringBuilder, since you use the toString method to create an immutable String out of your StringBuilder.
You should only copy the contents of a StringBuilder into a String after you are done appending it (or modifying it in some other way).
The problem with Java's StringBuilder is that it lacks some methods you get when using a plain string (check this thread, for example: How to implement StringBuilder.replace(String, String)).
What we know, is that a String uses lots of memory.
Actually, to be precise, a String uses less memory than a StringBuilder with equivalent contents. A StringBuilder class has some additional constant overhead, and usually has a preallocated buffer to store more data than needed at any given moment (to reduce allocations). The issue with Strings is that they are immutable, which means Java needs to create a new instance whenever you need to change its contents.
To conclude, StringBuilder is not designed for the operations you mentioned (split and replace), and it won't yield much better performance in any case. A split method cannot benefit from StringBuilder's mutability, since it creates an array of immutable strings as its output anyway. A replace method still needs to iterate through the entire string, and do a lot of copying if replaced string is not the same size as the searched one.
If you need to do a lot of appending, then go for a StringBuilder. Since it uses a "mutable" array of characters under the hood, adding data to the end will be especially efficient.
This article compares the performance of several StringBuilder and String methods (although I would take the Concatenation part with reserve, because it doesn't mention dynamic string appending at all and concentrates on a single Join operation only).
What we know, is that the mutation of a String uses lots of memory.
That is incorrect. Strings cannot be mutated. They are immutable.
What you are actually talking about is building a String from other strings. That can use a lot more memory than is necessary, but it depends how you build the string.
So what we can do is to use a StringBuilder/StringBuffer.
Using a StringBuilder will help in some circumstances:
String res = "";
for (String s : ...) {
res = res + s;
}
(If the loop iterates many times then optimizing the above to use a StringBuilder could be worthwhile.)
But in other circumstances it is a waste of time:
String res = s1 + s2 + s3 + s4 + s5;
(It is a waste of time to optimize the above to use a StringBuilder because the Java compiler will automatically translate the expression into code that creates and uses a StringBuilder.)
You should only ever use a StringBuffer instead of a StringBuilder when the string needs to be accessed and/or updated by more than one thread; i.e. when it needs to be thread-safe.
But at what point should we change to StringBuilder?
The simple answer is to only do it when the profiler tells you that you have a performance problem in your string handling / processing.
Generally speaking, StringBuilders are used for building strings rather as the primary representation of the strings.
And what should we do, when we have to split it or to replace characters in there?
Then you have to review your decision to use a StringBuilder / StringBuffer as your primary representation at that point. And if it is still warranted you have to figure out how to do the operation using the API you have chosen. (This may entail converting to a String, performing the operation and then creating a new StringBuilder from the result.)
If you frequently modify the string, go with StringBuilder. Otherwise, if it's immutable anyway, go with String.
To answer your question on how to replace characters, check this out: http://download.oracle.com/javase/tutorial/java/data/buffers.html. StringBuilder operations is what you want.
Here's another good write-up on StringBuilder: http://www.yoda.arachsys.com/csharp/stringbuilder.html
If you need to lot of alter operations on your String, then you can go for StringBuilder. Go for StringBuffer if you are in multithreaded application.
Both a String and a StringBuilder use about the same amount of memory. Why do you think it is “much”?
If you have measured (for example with jmap -histo:live) that the classes [C and java.lang.String take up most of the memory in the heap, only then should you think further in this direction.
Maybe there are multiple strings with the same value. Then, since Strings are immutable, you could intern the duplicate strings. Don't use String.intern for it, since it has bad performance characteristics, but Google Guava's Interner.
My application is multithreaded with intensive String processing. We are experiencing excessive memory consumption and profiling has demonstrated that this is due to String data. I think that memory consumption would benefit greatly from using some kind of flyweight pattern implementation or even cache (I know for sure that Strings are often duplicated, although I don't have any hard data in that regard).
I have looked at Java Constant Pool and String.intern, but it seems that it can provoke some PermGen problems.
What would be the best alternative for implementing application-wide, multithreaded pool of Strings in java?
EDIT: Also see my previous, related question: How does java implement flyweight pattern for string under the hood?
Note: This answer uses examples that might not be relevant in modern runtime JVM libraries. In particular, the substring example is no longer an issue in OpenJDK/Oracle 7+.
I know it goes against what people often tell you, but sometimes explicitly creating new String instances can be a significant way to reduce your memory.
Because Strings are immutable, several methods leverage that fact and share the backing character array to save memory. However, occasionally this can actually increase the memory by preventing garbage collection of unused parts of those arrays.
For example, assume you were parsing the message IDs of a log file to extract warning IDs. Your code would look something like this:
//Format:
//ID: [WARNING|ERROR|DEBUG] Message...
String testLine = "5AB729: WARNING Some really really really long message";
Matcher matcher = Pattern.compile("([A-Z0-9]*): WARNING.*").matcher(testLine);
if ( matcher.matches() ) {
String id = matcher.group(1);
//...do something with id...
}
But look at the data actually being stored:
//...
String id = matcher.group(1);
Field valueField = String.class.getDeclaredField("value");
valueField.setAccessible(true);
char[] data = ((char[])valueField.get(id));
System.out.println("Actual data stored for string \"" + id + "\": " + Arrays.toString(data) );
It's the whole test line, because the matcher just wraps a new String instance around the same character data. Compare the results when you replace String id = matcher.group(1); with String id = new String(matcher.group(1));.
This is already done at the JVM level. You only need to ensure that you aren't creating new Strings everytime, either explicitly or implicitly.
I.e. don't do:
String s1 = new String("foo");
String s2 = new String("foo");
This would create two instances in the heap. Rather do so:
String s1 = "foo";
String s2 = "foo";
This will create one instance in the heap and both will refer the same (as evidence, s1 == s2 will return true here).
Also don't use += to concatenate strings (in a loop):
String s = "";
for (/* some loop condition */) {
s += "new";
}
The += implicitly creates a new String in the heap everytime. Rather do so
StringBuilder sb = new StringBuilder();
for (/* some loop condition */) {
sb.append("new");
}
String s = sb.toString();
If you can, rather use StringBuilder or its synchronized brother StringBuffer instead of String for "intensive String processing". It offers useful methods for exactly those purposes, such as append(), insert(), delete(), etc. Also see its javadoc.
Java 7/8
If you are doing what the accepted answer says and using Java 7 or newer you are not doing what it says you are.
The implementation of subString() has changed.
Never write code that relies on an implementation that can change drastically and might make things worse if you are relying on the old behavior.
1950 public String substring(int beginIndex, int endIndex) {
1951 if (beginIndex < 0) {
1952 throw new StringIndexOutOfBoundsException(beginIndex);
1953 }
1954 if (endIndex > count) {
1955 throw new StringIndexOutOfBoundsException(endIndex);
1956 }
1957 if (beginIndex > endIndex) {
1958 throw new StringIndexOutOfBoundsException(endIndex - beginIndex);
1959 }
1960 return ((beginIndex == 0) && (endIndex == count)) ? this :
1961 new String(offset + beginIndex, endIndex - beginIndex, value);
1962 }
So if you use the accepted answer with Java 7 or newer you are creating twice as much memory usage and garbage that needs to be collected.
Effeciently pack Strings in memory! I once wrote a hyper memory efficient Set class, where Strings were stored as a tree. If a leaf was reached by traversing the letters, the entry was contained in the set. Fast to work with, too, and ideal to store a large dictionary.
And don't forget that Strings are often the largest part in memory in nearly every app I profiled, so don't care for them if you need them.
Illustration:
You have 3 Strings: Beer, Beans and Blood. You can create a tree structure like this:
B
+-e
+-er
+-ans
+-lood
Very efficient for e.g. a list of street names, this is obviously most reasonable with a fixed dictionary, because insert cannot be done efficiently. In fact the structure should be created once, then serialized and afterwards just loaded.
First, decide how much your application and developers would suffer if you eliminated some of that parsing. A faster application does you no good if you double your employee turnover rate in the process! I think based on your question we can assume you passed this test already.
Second, if you can't eliminate creating an object, then your next goal should be to ensure it doesn't survive Eden collection. And parse-lookup can solve that problem. However, a cache "implemented properly" (I disagree with that basic premise, but I won't bore you with the attendant rant) usually brings thread contention. You'd be replacing one kind of memory pressure for another.
There's a variation of the parse-lookup idiom that suffers less from the sort of collateral damage you usually get from full-on caching, and that's a simple precalculated lookup table (see also "memoization"). The Pattern you usually see for this is the Type Safe Enumeration (TSE). With the TSE, you parse the String, pass it to the TSE to retrieve the associated enumerated type, and then you throw the String away.
Is the text you're processing free-form, or does the input have to follow a rigid specification? If a lot of your text renders down to a fixed set of possible values, then a TSE could help you here, and serves a greater master: Adding context/semantics to your information at the point of creation, instead of at the point of use.
I'm reading from a binary file and want to convert the bytes to US ASCII strings. Is there any way to do this without calling new on String to avoid multiple semantically equal String objects being created in the string literal pool? I'm thinking that it is probably not possible since introducing String objects using double quotes is not possible here. Is this correct?
private String nextString(DataInputStream dis, int size)
throws IOException
{
byte[] bytesHolder = new byte[size];
dis.read(bytesHolder);
return new String(bytesHolder, Charset.forName("US-ASCII")).trim();
You'd have to have a cache mapping byte arrays to strings, then search through the cache for any equal values before creating a new string.
You can intern existing strings with intern() as Yishai posted - that won't stop you from creating more strings, but it'll make all but the first one (for any char sequence) very short lived. On the other hand, it'll make all the distinct strings live for a very long time indeed.
You can have "pseudo-interning" by using a Map<String, String>:
String tmp = new String(bytesHolder, Charset.forName("US-ASCII")).trim();
String cached = cache.get(tmp);
if (cached == null)
{
cached = tmp;
cache.put(tmp, tmp);
}
return cached;
You could even put a bit more effort in and end up with an LRU cache so that it'll keep the N most recently fetched strings, discarding others when it needs to.
None of that reduces the number of strings created in the first place, as I say - but is that likely to be a problem in your situation? GCs have been tuned to make it very cheap to create short-lived objects.
You can call the intern() method on the string to ensure one for the whole JVM.
String s = new String(bytes, "US-ASCII").intern();
You won't avoid creating the initial string again, but you will save on the storage.
That being said, interned strings have a limited storage space, so use with caution. A better option may be to implement a HashMap with the string as the key and value and check if the string already exists and get it if it does, insert it if it doesn't. That way you won't have such memory limitations.
You shouldn’t be concerned about it—unless you profiled your application and have determined the String creation to be the exact source of your problem.
If you find out that the String creation is the source of your problem I would recommend what Jon Skeet proposed, i.e. a mapping from byte[] to String. That has about the same effect as interning your Strings while not hogging up valuable memory until you restart the VM.