String.intern() vs manual string-to-identifier mapping? - java

I recall seeing a couple of string-intensive programs that do a lot of string comparison but relatively few string manipulation, and that have used a separate table to map strings to identifiers for efficient equality and lower memory footprint, e.g.:
public class Name {
public static Map<String, Name> names = new SomeMap<String, Name>();
public static Name from(String s) {
Name n = names.get(s);
if (n == null) {
n = new Name(s);
names.put(s, n);
}
return n;
}
private final String str;
private Name(String str) { this.str = str; }
#Override public String toString() { return str; }
// equals() and hashCode() are not overridden!
}
I'm pretty sure one of these programs was javac from OpenJDK, so not some toy application. Of course the actual class was more complex (and also I think it implemented CharSequence), but you get the idea - the entire program was littered with Name in any location you would expect String, and on the rare cases where string manipulation was needed, it converted to strings and then cached them again, conceptually like:
Name newName = Name.from(name.toString().substring(5));
I think I understand the point of this - especially when there are a lot of identical strings all around and a lot of comparisons - but couldn't the same be achieved by just using regular strings and interning them? The documentation for String.intern() explicitly says:
...
When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equals(Object) method, then the string from the pool is returned. Otherwise, this String object is added to the pool and a reference to this String object is returned.
It follows that for any two strings s and t, s.intern() == t.intern() is true if and only if s.equals(t) is true.
...
So, what are the advantages and disadvantages of manually managing a Name-like class vs using intern()?
What I've thought about so far was:
Manually managing the map means using regular heap, intern() uses the permgen.
When manually managing the map you enjoy type-checking that can verify something is a Name, while an interned string and a non-interned string share the same type so it's possible to forget interning in some places.
Relying on intern() means reusing an existing, optimized, tried-and-tested mechanism without coding any extra classes.
Manually managing the map results in a code more confusing to new users, and strign operations become more cumbersome.
... but I feel like I'm missing something else here.

Unfortunately, String.intern() can be slower than a simple synchronized HashMap. It doesn't need to be so slow, but as of today in Oracle's JDK, it is slow (probably due to JNI)
Another thing to consider: you are writing a parser; you collected some chars in a char[], and you need to make a String out of them. Since the string is probably common and can be shared, we'd like to use a pool.
String.intern() uses such a pool; yet to look up, you'll need a String to begin with. So we need to new String(char[],offset,length) first.
We can avoid that overhead in a custom pool, where lookup can be done directly based on a char[],offset,length. For example, the pool is a trie. The string most likely is in the pool, so we'll get the String without any memory allocation.
If we don't want to write our own pool, but use the good old HashMap, we'll still need to create a key object that wraps char[],offset,length (something like CharSequence). This is still cheaper than a new String, since we don't copy chars.

I would always go with the Map because intern() has to do a (probably linear) search inside the internal String's pool of strings. If you do that quite often it is not as efficient as Map - Map is made for fast search.

what are the advantages and disadvantages of manually managing a Name-like class vs using intern()
Type checking is a major concern, but invariant preservation is also a significant concern.
Adding a simple check to the Name constructor
Name(String s) {
if (!isValidName(s)) { throw new IllegalArgumentException(s); }
...
}
can ensure* that there exist no Name instances corresponding to invalid names like "12#blue,," which means that methods that take Names as arguments and that consume Names returned by other methods don't need to worry about where invalid Names might creep in.
To generalize this argument, imagine your code is a castle with walls designed to protect it from invalid inputs. You want some inputs to get through so you install gates with guards that check inputs as they come through. The Name constructor is an example of a guard.
The difference between String and Name is that Strings can't be guarded against. Any piece of code, malicious or naive, inside or outside the perimeter, can create any string value. Buggy String manipulation code is analogous to a zombie outbreak inside the castle. The guards can't protect the invariants because the zombies don't need to get past them. The zombies just spread and corrupt data as they go.
That a value "is a" String satisfies fewer useful invariants than that a value "is a" Name.
See stringly typed for another way to look at the same topic.
* - usual caveat re deserializing of Serializable allowing bypass of constructor.

String.intern() in Java 5.0 & 6 uses the perm gen space which usually has a low maximum size. It can mean you run out of space even though there is plenty of free heap.
Java 7 uses its the regular heap to store intern()ed Strings.
String comparison it pretty fast and I don't imagine there is much advantage in cutting comparison times when you consider the overhead.
Another reason this might be done is if there are many duplicate strings. If there is enough duplication, this can save a lot of memory.
A simpler way to cache Strings is to use a LRU cache like LinkedHashMap
private static final int MAX_SIZE = 10000;
private static final Map<String, String> STRING_CACHE = new LinkedHashMap<String, String>(MAX_SIZE*10/7, 0.70f, true) {
#Override
protected boolean removeEldestEntry(Map.Entry<String, String> eldest) {
return size() > 10000;
}
};
public static String intern(String s) {
// s2 is a String equals to s, or null if its not there.
String s2 = STRING_CACHE.get(s);
if (s2 == null) {
// put the string in the map if its not there already.
s2 = s;
STRING_CACHE.put(s2,s2);
}
return s2;
}
Here is an example of how it works.
public static void main(String... args) {
String lo = "lo";
for (int i = 0; i < 10; i++) {
String a = "hel" + lo + " " + (i & 1);
String b = intern(a);
System.out.println("String \"" + a + "\" has an id of "
+ Integer.toHexString(System.identityHashCode(a))
+ " after interning is has an id of "
+ Integer.toHexString(System.identityHashCode(b))
);
}
System.out.println("The cache contains "+STRING_CACHE);
}
prints
String "hello 0" has an id of 237360be after interning is has an id of 237360be
String "hello 1" has an id of 5736ab79 after interning is has an id of 5736ab79
String "hello 0" has an id of 38b72ce1 after interning is has an id of 237360be
String "hello 1" has an id of 64a06824 after interning is has an id of 5736ab79
String "hello 0" has an id of 115d533d after interning is has an id of 237360be
String "hello 1" has an id of 603d2b3 after interning is has an id of 5736ab79
String "hello 0" has an id of 64fde8da after interning is has an id of 237360be
String "hello 1" has an id of 59c27402 after interning is has an id of 5736ab79
String "hello 0" has an id of 6d4e5d57 after interning is has an id of 237360be
String "hello 1" has an id of 2a36bb87 after interning is has an id of 5736ab79
The cache contains {hello 0=hello 0, hello 1=hello 1}
This ensure the cache of intern()ed Strings will be limited in number.
A faster but less effective way is to use a fixed array.
private static final int MAX_SIZE = 10191;
private static final String[] STRING_CACHE = new String[MAX_SIZE];
public static String intern(String s) {
int hash = (s.hashCode() & 0x7FFFFFFF) % MAX_SIZE;
String s2 = STRING_CACHE[hash];
if (!s.equals(s2))
STRING_CACHE[hash] = s2 = s;
return s2;
}
The test above works the same except you need
System.out.println("The cache contains "+ new HashSet<String>(Arrays.asList(STRING_CACHE)));
to print out the contents which shows the following include on null for the empty entries.
The cache contains [null, hello 1, hello 0]
The advantage of this approach is speed and that it can be safely used by multiple thread without locking. i.e. it doesn't matter if different threads have different view of STRING_CACHE.

So, what are the advantages and disadvantages of manually managing a
Name-like class vs using intern()?
One advantage is:
It follows that for any two strings s and t, s.intern() == t.intern()
is true if and only if s.equals(t) is true.
In a program where many many small strings must be compared often, this may pay off.
Also, it saves space in the end. Consider a source program that uses names like AbstractSyntaxTreeNodeItemFactorySerializer quite often. With intern(), this string will be stored once and that is it. Everything else if just references to that, but the references you have anyway.

Related

String objects which are not literal not requiring new keyword?

So I know there are other similar questions to this, such as this one and this other one. But Their answer seems to be that because they are literal and part of some pool of immutable literal constants, they will remain available. This sort of makes sense to me, but then why do non literals also work fine? When do I ever have to use the "new" keyword when dealing with strings. In the example below, I use strings to do a few things, but everything works fine and I never use the "new" keyword (correction: I never use it with a String type object).
import java.util.*;
class teststrings{
public static void main(String[] args){
Scanner in = new Scanner(System.in);
String nonew;
String nonew2;
String literally= "old";
literally= "new"; //does the word "old" get garbage collected here?
nonew = in.nextLine(); //this does not use the new keyword, but it works, why?
System.out.println("nonew:"+nonew);
System.out.println("literally:"+literally);
nonew2 = in.nextLine();
System.out.println("nonew:"+nonew); //the data is still preserved here
System.out.println("nonew2:"+nonew2);
//I didn't use the new keyword at all, but everything worked
//So, when do I need to use it?
}
}
A couple of points:
"Does the word "old" get garbage collected here?"
Chances are your compiler realises it's never used and just skips it altogether.
Scanner::nextLine returns a String, and the value returned by the method is used for the assignment.
As for when to use new for Strings... Well, rarely would probably be best. The only time I've ever seen it used would be for internal constants. For example
public class MatchChecker {
private static final String ANY_VALUE = new String("*");
private final Map<Object, String> map = new HashMap<Object, String>();
public void addMatch(Object object, String string) {
map.put(object, string);
}
public void addAnyValueMatch(Object object) {
map.put(object, ANY_VALUE);
}
public boolean matches(Object object, String string) {
if (!map.contains(object)) {
return false;
}
if (map.get(object) == ANY_VALUE || map.get(object).equals(string)) {
return true;
}
return false;
}
}
Which would mean only those Objects added via addAnyValueMatch would match any value (as it's tested with ==), even if the user used "*" as the string in addMatch.
Strings are treated specially in Java. The Java JVM makes use of a cache like implementation called a String pool.
Unlike other objects, when you create a literal String like this: String mystring = "Hello"; Java will first check to see if the String "Hello" already exists in the String pool. If not, it will add it to be cached and reused if referenced again.
So, when you assign a variable to "Hello" the first time, it gets added to the pool:
String s1 = "Hello";
String s2 = "Hello";
String s3 = s1;
s1 = "SomethingElse"
In the code above, when s1 is assigned "Hello" the JVM will see it is not stored in the pool and create/add it to the pool.
For s2, you are again referencing "Hello". The JVM will see it in the pool and assign s2 to the same String stored in the pool. s3 is simply assigned to the value referenced at the memory address of s1, or the same string "Hello". Finally, s1 is then reassigned to another String, which doesn't exist yet in the pool, so is added. Also, s1 no longer points to "Hello", yet it will not be garbage collected, for two reasons. 1:t is being stored in the String pool and 2: s2 also points to the same referenced string.
With Strings, you should never use the new keyword for creating literal strings. If you do, you are not taking advantage of the String pool reuse and could cause multiple instances of the same String to exist in memory, which is a waste.

How does this Java code snippet work? (String pool and reflection) [duplicate]

This question already has answers here:
Is a Java string really immutable?
(16 answers)
Closed 7 years ago.
Java string pool coupled with reflection can produce some unimaginable result in Java:
import java.lang.reflect.Field;
class MessingWithString {
public static void main (String[] args) {
String str = "Mario";
toLuigi(str);
System.out.println(str + " " + "Mario");
}
public static void toLuigi(String original) {
try {
Field stringValue = String.class.getDeclaredField("value");
stringValue.setAccessible(true);
stringValue.set(original, "Luigi".toCharArray());
} catch (Exception ex) {
// Ignore exceptions
}
}
}
Above code will print:
"Luigi Luigi"
What happened to Mario?
What happened to Mario ??
You changed it, basically. Yes, with reflection you can violate the immutability of strings... and due to string interning, that means any use of "Mario" (other than in a larger string constant expression, which would have been resolved at compile-time) will end up as "Luigi" in the rest of the program.
This kinds of thing is why reflection requires security permissions...
Note that the expression str + " " + "Mario" does not perform any compile-time concatenation, due to the left-associativity of +. It's effectively (str + " ") + "Mario", which is why you still see Luigi Luigi. If you change the code to:
System.out.println(str + (" " + "Mario"));
... then you'll see Luigi Mario as the compiler will have interned " Mario" to a different string to "Mario".
It was set to Luigi. Strings in Java are immutable; thus, the compiler can interpret all mentions of "Mario" as references to the same String constant pool item (roughly, "memory location"). You used reflection to change that item; so all "Mario" in your code are now as if you wrote "Luigi".
To explain the existing answers a bit more, let's take a look at your generated byte code (Only the main() method here).
Now, any changes to the content's of that location will affect both the references (And any other you give too).
String literals are stored in the string pool and their canonical value is used. Both "Mario" literals aren't just strings with the same value, they are the same object. Manipulating one of them (using reflection) will modify "both" of them, as they are just two references to the same object.
You just changed the String of String constant pool Mario to Luigi which was referenced by multiple Strings, so every referencing literal Mario is now Luigi.
Field stringValue = String.class.getDeclaredField("value");
You have fetched the char[] named value field from class String
stringValue.setAccessible(true);
Make it accessible.
stringValue.set(original, "Luigi".toCharArray());
You changed original String field to Luigi. But original is Mario the String literal and literal belongs to the String pool and all are interned. Which means all the literals which has same content refers to the same memory address.
String a = "Mario";//Created in String pool
String b = "Mario";//Refers to the same Mario of String pool
a == b//TRUE
//You changed 'a' to Luigi and 'b' don't know that
//'a' has been internally changed and
//'b' still refers to the same address.
Basically you have changed the Mario of String pool which got reflected in all the referencing fields. If you create String Object (i.e. new String("Mario")) instead of literal you will not face this behavior because than you will have two different Marios .
The other answers adequately explain what's going on. I just wanted to add the point that this only works if there is no security manager installed. When running code from the command line by default there is not, and you can do things like this. However in an environment where trusted code is mixed with untrusted code, such as an application server in a production environment or an applet sandbox in a browser, there would typically be a security manager present and you would not be allowed these kinds of shenanigans, so this is less of a terrible security hole as it seems.
Another related point: you can make use of the constant pool to improve the performance of string comparisons in some circumstances, by using the String.intern() method.
That method returns the instance of String with the same contents as the String on which it is invoked from the String constants pool, adding it it if is not yet present. In other words, after using intern(), all Strings with the same contents are guaranteed to be the same String instance as each other and as any String constants with those contents, meaning you can then use the equals operator (==) on them.
This is just an example which is not very useful on its own, but it illustrates the point:
class Key {
Key(String keyComponent) {
this.keyComponent = keyComponent.intern();
}
public boolean equals(Object o) {
// String comparison using the equals operator allowed due to the
// intern() in the constructor, which guarantees that all values
// of keyComponent with the same content will refer to the same
// instance of String:
return (o instanceof Key) && (keyComponent == ((Key) o).keyComponent);
}
public int hashCode() {
return keyComponent.hashCode();
}
boolean isSpecialCase() {
// String comparison using equals operator valid due to use of
// intern() in constructor, which guarantees that any keyComponent
// with the same contents as the SPECIAL_CASE constant will
// refer to the same instance of String:
return keyComponent == SPECIAL_CASE;
}
private final String keyComponent;
private static final String SPECIAL_CASE = "SpecialCase";
}
This little trick isn't worth designing your code around, but it is worth keeping in mind for the day when you notice a little more speed could be eked out of some bit of performance sensitive code by using the == operator on a string with judicious use of intern().

strings and memory allocation in java?

one thing that i always wondered, if i have a method like this:
String replaceStuff (String plainText) {
return plainText.replaceAll("&", "&");
}
will it create new String objects all the time for the "&" and the "&" that gets destroyed by the GC and then recreated again by next call?
E.g.
would it in theory be better to do something like this
final String A ="&";
final String AMP ="&";
String replaceStuff (String plainText) {
return plainText.replaceAll(A, AMP);
}
i think this is probably a more theoretic question than a real life problem, I am just curious how the memory management is handled in this aspect.
No. String literals are interned. Even if you use an equal literal (or other constant) from elsewhere, you'll still refer to the same object:
Object x = "hello";
Object y = "he" + "llo";
System.out.println(x == y); // Guaranteed to print true.
EDIT: The JLS guarantees this in section 3.10.5
String literals-or, more generally, strings that are the values of constant expressions (§15.28)-are "interned" so as to share unique instances, using the method String.intern.
Section 15.28 shows the + operator being included as an operation which can produce a new constant from two other constants.
Nope, they're literals and therefore automatically interned to the constant pool.
The only way you'd create new strings each time would be to do:
String replaceStuff (String plainText) {
return plainText.replaceAll(new String("&"), new String("&"));
}
Strings are handled little different than the normal objects by GC.
For example if
String a = "aaa";
String a1 = "aaa";
Now both a and a1 will point to same String value in memory till any of the value changes. Hence there will be only 1 object in memory.
Also, if we change 'a' and 'a1' to point to any other string, still the value "aaa" is left in the string pool and will be used later by JVM if required. The string is not GC'd

Are Strings also static: String creation within Methods

I know that at compile time when a String is created, that String will be THE string used by any objects of that particular signature.
String s = "foo"; <--Any other identical strings will simply be references to this object.
Does this hold for strings created during methods at runtime? I have some code where an object holds a piece of string data. The original code is something like
for(datum :data){
String a = datum.getD(); //getD is not doing anything but returning a field
StringBuffer toAppend = new StringBuffer(a).append(stuff).toString();
someData = someObject.getMethod(a);
//do stuff
}
Since the String was already created in data, it seems better to just call datum.getD() instead of creating a string on every iteration of the loop.
Unless there's something I'm missing?
String instances are shared when they are the result of a compile-time constant expression. As a result, in the example below a and c will point to the same instance, but b will be a different instance, even though they all represent the same value:
String a = "hello";
String b = hell() + o();
String c = "hell" + "o";
public String hell() {
return "hell";
}
public String o() {
return "o";
}
You can explicitly intern the String however:
String b = (hell() + o()).intern();
In which case they'll all point to the same object.
The line
String a = datum.getD();
means, assign the result of evaluating datum.getD() to the reference a . It doesn't create a new String.
You are correct that strings are immutable so all references to the same string value use the same object.
As far as being static, I do not think Strings are static in the way you describe. The Class class is like that, but I think it is the only object that does that.
I think it would be better to just call the datum.getD() since there is nothing that pulling it out into its own sting object gains for you.
If you do use the datum.getD() several times in the loop, then it might make sense to pull the value into a String object, because the cost of creating a string object once might be less than the cost of calling the getD() function multiple times.

What is String pool in Java? [duplicate]

This question already has answers here:
What is the Java string pool and how is "s" different from new String("s")? [duplicate]
(5 answers)
Closed 9 years ago.
I am confused about StringPool in Java. I came across this while reading the String chapter in Java. Please help me understand, in layman terms, what StringPool actually does.
This prints true (even though we don't use equals method: correct way to compare strings)
String s = "a" + "bc";
String t = "ab" + "c";
System.out.println(s == t);
When compiler optimizes your string literals, it sees that both s and t have same value and thus you need only one string object. It's safe because String is immutable in Java.
As result, both s and t point to the same object and some little memory saved.
Name 'string pool' comes from the idea that all already defined string are stored in some 'pool' and before creating new String object compiler checks if such string is already defined.
I don't think it actually does much, it looks like it's just a cache for string literals. If you have multiple Strings who's values are the same, they'll all point to the same string literal in the string pool.
String s1 = "Arul"; //case 1
String s2 = "Arul"; //case 2
In case 1, literal s1 is created newly and kept in the pool. But in case 2, literal s2 refer the s1, it will not create new one instead.
if(s1 == s2) System.out.println("equal"); //Prints equal.
String n1 = new String("Arul");
String n2 = new String("Arul");
if(n1 == n2) System.out.println("equal"); //No output.
http://p2p.wrox.com/java-espanol/29312-string-pooling.html
Let's start with a quote from the virtual machine spec:
Loading of a class or interface that contains a String literal may create a new String object (§2.4.8) to represent that literal. This may not occur if the a String object has already been created to represent a previous occurrence of that literal, or if the String.intern method has been invoked on a String object representing the same string as the literal.
This may not occur - This is a hint, that there's something special about String objects. Usually, invoking a constructor will always create a new instance of the class. This is not the case with Strings, especially when String objects are 'created' with literals. Those Strings are stored in a global store (pool) - or at least the references are kept in a pool, and whenever a new instance of an already known Strings is needed, the vm returns a reference to the object from the pool. In pseudo code, it may go like that:
1: a := "one"
--> if(pool[hash("one")] == null) // true
pool[hash("one") --> "one"]
return pool[hash("one")]
2: b := "one"
--> if(pool[hash("one")] == null) // false, "one" already in pool
pool[hash("one") --> "one"]
return pool[hash("one")]
So in this case, variables a and b hold references to the same object. IN this case, we have (a == b) && (a.equals(b)) == true.
This is not the case if we use the constructor:
1: a := "one"
2: b := new String("one")
Again, "one" is created on the pool but then we create a new instance from the same literal, and in this case, it leads to (a == b) && (a.equals(b)) == false
So why do we have a String pool? Strings and especially String literals are widely used in typical Java code. And they are immutable. And being immutable allowed to cache String to save memory and increase performance (less effort for creation, less garbage to be collected).
As programmers we don't have to care much about the String pool, as long as we keep in mind:
(a == b) && (a.equals(b)) may be true or false (always use equals to compare Strings)
Don't use reflection to change the backing char[] of a String (as you don't know who is actualling using that String)
When the JVM loads classes, or otherwise sees a literal string, or some code interns a string, it adds the string to a mostly-hidden lookup table that has one copy of each such string. If another copy is added, the runtime arranges it so that all the literals refer to the same string object. This is called "interning". If you say something like
String s = "test";
return (s == "test");
it'll return true, because the first and second "test" are actually the same object. Comparing interned strings this way can be much, much faster than String.equals, as there's a single reference comparison rather than a bunch of char comparisons.
You can add a string to the pool by calling String.intern(), which will give you back the pooled version of the string (which could be the same string you're interning, but you'd be crazy to rely on that -- you often can't be sure exactly what code has been loaded and run up til now and interned the same string). The pooled version (the string returned from intern) will be equal to any identical literal. For example:
String s1 = "test";
String s2 = new String("test"); // "new String" guarantees a different object
System.out.println(s1 == s2); // should print "false"
s2 = s2.intern();
System.out.println(s1 == s2); // should print "true"

Categories

Resources