Does String Pool in Java behaves like LRU cache?

Does String Pool in Java behaves like LRU cache? - java

Strings are immutable and are managed in String pool. I wish to know as how this pool is managed. If there are large number of String literals being used in an application, ( I understand String builder should be used when modifications like append, replace operations are more ) then Pool enhances the performance of the application by not recreating the new String objects again and again but using the same objects present in the pool, this is possible as Strings are immutable and doing so has no ill effect.
My question is as how this String Pool is managed. If in case there is huge frequency of some 'k' Strings and there may be few other String objects which are once created and not being used again. There may be other newer String literals being used.
In cases like these does String Pool behaves like LRU cache, holding
the references to the latest used literals and removing the older not
used strings from the pool ?
Does String pool has a size or can we control it in our application ?
Edit :
Usually we give size to the custom object pools we implement. I wonder why feature like LRU is not there for Sting Pools. This could have been a feature. In case of large Strings also there would not have been problem. But I feel its the way it has been implemented but I just wanted to know as why its not there, I mean its not there for some valid reason, having this feature would have resulted in some ill effects. If some one could throw some light on those ill effects, it will be good.

String pool is not an LRU cache, since entries aren't taken out unless GC'd.
There are 2 ways to get entries in the String pool. String literals go there automatically, and new entries can be added with String.intern() unless the String already exists in the pool, in which case a reference to it is returned.
The values are garbage collected if there are no more references to them, which for String literals (e.g. String constants) can be a bit harder than ones that were intern()ed.
The implementation has changed a lot between Java 6 and Java 8 (and even between minor versions). The default size of the String pool is apparently 1009, but it can be changed with -XX:StringTableSize=N (since Java 7) parameter. This size is the table size of an internal hash table, so it can be tuned higher if you're using a lot of intern() (for String literals, it should be plenty). The size affects only the speed of intern() call, not the amount of Strings you can intern.
Basically unless you're using intern() heavily (presumably for a good reason), there's very little reason to worry about the String pool. Especially since it's no longer stored in PermGen, so it can't cause OutOfMemoryErrors very easily anymore.
Source.

Related

When will the new String() object in memory gets cleared after invoking intern() method

List<String> list = new ArrayList<>();
for (int i = 0; i < 1000; i++)
{
StringBuilder sb = new StringBuilder();
String string = sb.toString();
string = string.intern()
list.add(string);
}
In the above sample, after invoking string.intern() method, when will the 1000 objects created in heap (sb.toString) be cleared?
Edit 1:
If there is no guarantee that these objects could be cleared. Assuming that GC haven't run, is it obsolete to use string.intern() itself? (In terms of the memory usage?)
Is there any way to reduce memory usage / object creation while using intern() method?

Your example is a bit odd, as it creates 1000 empty strings. If you want to get such a list with consuming minimum memory, you should use
List<String> list = Collections.nCopies(1000, "");
instead.
If we assume that there is something more sophisticated going on, not creating the same string in every iteration, well, then there is no benefit in calling intern(). What will happen, is implementation dependent. But when calling intern() on a string that is not in the pool, it will be just added to the pool in the best case, but in the worst case, another copy will be made and added to the pool.
At this point, we have no savings yet, but potentially created additional garbage.
Interning at this point can only save you some memory, if there are duplicates somewhere. This implies that you construct duplicate strings first, to look up their canonical instance via intern() afterwards, so having the duplicate string in memory until garbage collected, is unavoidable. But that’s not the real problem with interning:
in older JVMs, there was special treatment of interned string that could result in worse garbage collection performance or even running out of resources (i.e. the fixed size “PermGen” space).
in HotSpot, the string pool holding the interned strings is a fixed size hash table, yielding hash collisions, hence, poor performance, when referencing significantly more strings than the table size.
Before Java 7, update 40, the default size was about 1,000, not even sufficient to hold all string constants for any nontrivial application without hash collisions, not to speak of manually added strings. Later versions use a default size of about 60,000, which is better, but still a fixed size that should discourage you from adding an arbitrary number of strings
the string pool has to obey inter-thread semantics mandated by the language specification (as it is used to for string literals), hence, need to perform thread safe updates that can degrade the performance
Keep in mind that you pay the price of the disadvantages named above, even in the cases that there are no duplicates, i.e. there is no space saving. Also, the acquired reference to the canonical string has to have a much longer lifetime than the temporary object used to look it up, to have any positive effect on the memory consumption.
The latter touches your literal question. The temporary instances are reclaimed when the garbage collector runs the next time, which will be when the memory is actually needed. There is no need to worry about when this will happen, but well, yes, up to that point, acquiring a canonical reference had no positive effect, not only because the memory hasn’t been reused up to that point, but also, because the memory was not actually needed until then.
This is the place to mention the new String Deduplication feature. This does not change string instances, i.e. the identity of these objects, as that would change the semantic of the program, but change identical strings to use the same char[] array. Since these character arrays are the biggest payload, this still may achieve great memory savings, without the performance disadvantages of using intern(). Since this deduplication is done by the garbage collector, it will only applied to strings that survived long enough to make a difference. Also, this implies that it will not waste CPU cycles when there still is plenty of free memory.
However, there might be cases, where manual canonicalization might be justified. Imagine, we’re parsing a source code file or XML file, or importing strings from an external source (Reader or data base) where such canonicalization will not happen by default, but duplicates may occur with a certain likelihood. If we plan to keep the data for further processing for a longer time, we might want to get rid of duplicate string instances.
In this case, one of the best approaches is to use a local map, not being subject to thread synchronization, dropping it after the process, to avoid keeping references longer than necessary, without having to use special interaction with the garbage collector. This implies that occurrences of the same strings within different data sources are not canonicalized (but still being subject to the JVM’s String Deduplication), but it’s a reasonable trade-off. By using an ordinary resizable HashMap, we also do not have the issues of the fixed intern table.
E.g.
static List<String> parse(CharSequence input) {
List<String> result = new ArrayList<>();
Matcher m = TOKEN_PATTERN.matcher(input);
CharBuffer cb = CharBuffer.wrap(input);
HashMap<CharSequence,String> cache = new HashMap<>();
while(m.find()) {
result.add(
cache.computeIfAbsent(cb.subSequence(m.start(), m.end()), Object::toString));
}
return result;
}
Note the use of the CharBuffer here: it wraps the input sequence and its subSequence method returns another wrapper with different start and end index, implementing the right equals and hashCode method for our HashMap, and computeIfAbsent will only invoke the toString method, if the key was not present in the map before. So, unlike using intern(), no String instance will be created for already encountered strings, saving the most expensive aspect of it, the copying of the character arrays.
If we have a really high likelihood of duplicates, we may even save the creation of wrapper instances:
static List<String> parse(CharSequence input) {
List<String> result = new ArrayList<>();
Matcher m = TOKEN_PATTERN.matcher(input);
CharBuffer cb = CharBuffer.wrap(input);
HashMap<CharSequence,String> cache = new HashMap<>();
while(m.find()) {
cb.limit(m.end()).position(m.start());
String s = cache.get(cb);
if(s == null) {
s = cb.toString();
cache.put(CharBuffer.wrap(s), s);
}
result.add(s);
}
return result;
}
This creates only one wrapper per unique string, but also has to perform one additional hash lookup for each unique string when putting. Since the creation of a wrapper is quiet cheap, you really need a significantly large number of duplicate strings, i.e. small number of unique strings compared to the total number, to have a benefit from this trade-off.
As said, these approaches are very efficient, because they use a purely local cache that is just dropped afterwards. With this, we don’t have to deal with thread safety nor interact with the JVM or garbage collector in a special way.

You can open JMC and check for GC under Memory tab inside MBean Server of the particular JVM when it performed and how much did it cleared. Still, there is no fixed guarantee of the time when it would be called. You can initiate GC under Diagnostic Commands on a specific JVM.
Hope it helps.

String Pool management

Strings are immutable objects and are stored in the String Pool. Suppose in an application none of the strings are created using new operator. In this case also is it necessary to use equals method instead of == for String objects equality checks ?
I feel the answer of above question is probably yes and it has something to do with String Pool size.
How is the String Pool managed ? Memory is limited so I feel String pool also has a definite size. Does it work like LRU cache, discarding the least used Strings when the pool is full?
Please provide your valuable inputs.
My question is not about size of string pool. My question is if none of the strings are creared using new operator then using == will always be safe. Is this statement correct or can it happen that in this case also two string references haing same string characters may return false. I know design wise I should always use equals method butI just want to know the language specifications.

Strings are immutable objects and are stored in the String Pool. Suppose in an application none of the strings are created using new operator. In this case also is it necessary to use equals method instead of == for String objects equality checks?
If you always use equals() you never need to worry about the answer to this question, but unless you only plan on comparing string literals the situation can never possibly arise.
I feel the answer of above question is probably yes
Correct.
and it has something to do with String Pool size.
No.
How is the String Pool managed? Memory is limited so I feel String pool also has a definite size.
No.
Does it work like LRU cache, discarding the least used Strings when the pool is full?
No, but Strings that have been intern()-ed can be garbage-collected from the pool.

Java String intern By Default

In C# I would have to explictly call String.Intern(string) in order to add a string to the intern pool.
Does Java have the same idea conceptually? is the expectation that those dealing with frequently and repeatable strings use the intern pool for accessing and resolving strings?

Java makes short lived objects pretty cheap. Java 8 can eliminate them entirely. Interning them is fairly expensive and could slow down an application if not used with care.
For longer term objects there is plans to make the char[] which the String refers to "interned" on a GC. The String object itself cannot be interned automagically as this might change behaviour.

String Constant Pool memory sector and garbage collection

I read this question on the site How is the java memory pool divided? and i was wondering to which of these sectors does the "String Constant Pool" belongs?
And also does the String literals in the pool ever get GCed?
The intern() method returns the base link of the String literal from the pool.
If the pool does gets GCed then wouldn't it be counter-productive to the idea of the string pool? New String literals would again be created nullifying the GC.
(It is assuming that only a specific set of literals exist in the pool, they never go obsolete and sooner or later they will be needed again)

As far as I know String literals end up in the "Perm Gen" part of non-Heap JVM memory. Perm Gen space is only examined during Full GC runs (not Partials).
In early JVM's (and I confess I had to look this up because I wasn't sure), String literals in the String Pool never got GC'ed. In the newer JVM's, WeakReferences are used to reference the Strings in the pool, so interned Strings can actually get GC'ed, but only during Full Garbage collections.

Reading the JavaDoc for String.intern() doesn't give hints to the implementation, but according to this page, the interned strings are held by a weak reference. This means that if the GC detects that there are no references to the interned string except for the repository that holds interned strings then it is allowed to collect them. Of course this is transparent to external code so unless you are using weak references of your own you'll never know about the garbage collection.

String pooling
String pooling (sometimes also called as string canonicalisation) is a
process of replacing several String objects with equal value but
different identity with a single shared String object. You can achieve
this goal by keeping your own Map (with possibly soft
or weak references depending on your requirements) and using map
values as canonicalised values. Or you can use String.intern() method
which is provided to you by JDK.
At times of Java 6 using String.intern() was forbidden by many
standards due to a high possibility to get an OutOfMemoryException if
pooling went out of control. Oracle Java 7 implementation of string
pooling was changed considerably. You can look for details in
http://bugs.sun.com/view_bug.do?bug_id=6962931 and
http://bugs.sun.com/view_bug.do?bug_id=6962930.
String.intern() in Java 6
In those good old days all interned strings were stored in the PermGen
– the fixed size part of heap mainly used for storing loaded classes
and string pool. Besides explicitly interned strings, PermGen string
pool also contained all literal strings earlier used in your program
(the important word here is used – if a class or method was never
loaded/called, any constants defined in it will not be loaded).
The biggest issue with such string pool in Java 6 was its location –
the PermGen. PermGen has a fixed size and can not be expanded at
runtime. You can set it using -XX:MaxPermSize=96m option. As far as I
know, the default PermGen size varies between 32M and 96M depending on
the platform. You can increase its size, but its size will still be
fixed. Such limitation required very careful usage of String.intern –
you’d better not intern any uncontrolled user input using this method.
That’s why string pooling at times of Java 6 was mostly implemented in
the manually managed maps.
String.intern() in Java 7
Oracle engineers made an extremely important change to the string
pooling logic in Java 7 – the string pool was relocated to the heap.
It means that you are no longer limited by a separate fixed size
memory area. All strings are now located in the heap, as most of other
ordinary objects, which allows you to manage only the heap size while
tuning your application. Technically, this alone could be a sufficient
reason to reconsider using String.intern() in your Java 7 programs.
But there are other reasons.
String pool values are garbage collected
Yes, all strings in the JVM string pool are eligible for garbage
collection if there are no references to them from your program roots.
It applies to all discussed versions of Java. It means that if your
interned string went out of scope and there are no other references to
it – it will be garbage collected from the JVM string pool.
Being eligible for garbage collection and residing in the heap, a JVM
string pool seems to be a right place for all your strings, isn’t it?
In theory it is true – non-used strings will be garbage collected from
the pool, used strings will allow you to save memory in case then you
get an equal string from the input. Seems to be a perfect memory
saving strategy? Nearly so. You must know how the string pool is
implemented before making any decisions.
source.

String literals don't get created into the pool at runtime. I don't know for sure if they get GC'd or not, but I suspect that they do not for two reasons:
It would be immensely complex to detect in the general case when a literal will not be used anymore
There is likely a static code segment where it is stored for performance. The rest of the data is likely built around it, where the boundaries are also static

Strings, even though they are immutable, are still objects like any other in Java. Objects are created on the heap and Strings are no exception. So, Strings that are part of the "String Literal Pool" still live on the heap, but they have references to them from the String Literal Pool.
For more please refer this link
`http://www.javaranch.com/journal/200409/ScjpTipLine-StringsLiterally.html`
Edited Newly :
public class ImmutableStrings
{
public static void main(String[] args)
{
String one = "someString";
String two = new String("someString");
one = two = null;
}
}
Just before the main method ends, how many objects are available for garbage collection? 0? 1? 2?
The answer is 1. Unlike most objects, String literals always have a reference to them from the String Literal Pool. That means that they always have a reference to them and are, therefore, not eligible for garbage collection.
neither of our local variables, one or two, refer to our String object, there is still a reference to it from the String Literal Pool. Therefore, the object is not elgible for garbage collection.The object is always reachable through use of the intern() method

Why is string.intern() so slow?

Before anyone questions the fact of using string.intern() at all, let me say that I need it in my particular application for memory and performance reasons. [1]
So, until now I used String.intern() and assumed it was the most efficient way to do it. However, I noticed since ages it is a bottleneck in the software. [2]
Then, just recently, I tried to replace the String.intern() by a huge map where I put/get the strings in order to obtain each time a unique instance. I expected this would be slower... but it was exactly the opposite! It was tremendously faster! Replacing the intern() by pushing/polling a map (which achieves exactly the same) resulted in more than one order of magnitude faster.
The question is: why is intern() so slow?!? Why isn't it then simply backed up by a map (or actually, just a customized set) and would be tremendously faster? I'm puzzled.
[1]: For the unconvinced ones: It is in natural language processing and has to process gigabytes of text, therefore needs to avoid many instances of a same string to avoid blowing up the memory and referential string comparison to be fast enough.
[2]: without it (normal strings) it is impossible, with it, this particular step remains the most computation intensive one
EDIT:
Due to the surprising interest in this post, here is some code to test it out:
http://pastebin.com/4CD8ac69
And the results of interning a bit more than 1 million strings:
HashMap: 4 seconds
String.intern(): 54 seconds
Due to avoid some warm-up / OS IO caching and stuff like this, the experiment was repeated by inverting the order of both benchmarks:
String.intern(): 69 seconds
HashMap: 3 seconds
As you see, the difference is very noticeable, more than tenfolds. (Using OpenJDK 1.6.0_22 64bits ...but using the sun one resulted in similar results I think)

This article discusses the implementation of String.intern(). In Java 6 and 7, the implementation used a fixed size (1009) hashtable so as the number entries grew, the performance became O(n). The fixed size can be changed using -XX:StringTableSize=N. Apparently, in Java8 the default size is larger but issue remains.

Most likely reason for the performance difference: String.intern() is a native method, and calling a native method incurs massive overhead.
So why is it a native method? Probably because it uses the constant pool, which is a low-level VM construct.

#Michael Borgwardt said this in a comment:
intern() is not synchronized, at least at the Java language level.
I think that you mean that the String.intern() method is not declared as synchronized in the sourcecode of the String class. And indeed, that is a true statement.
However:
Declaring intern() as synchronized would only lock the current String instance, because it is an instance method, not a static method. So they couldn't implement string pool synchronization that way.
If you step back and think about it, the string pool has to perform some kind of internal synchronization. If it didn't it would be unusable in a multi-threaded application, because there is simply no practical way for all code that uses the intern() method to do external synchronization.
So, the internal synchronization that the string pool performs could be a bottleneck in multi-threaded application that uses intern() heavily.

I can't speak from any great experience with it, but from the String docs:
"When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the {#link #equals(Object)} method, then the string from the pool returned. Otherwise, this String object is added to the pool and a reference to this String object is returned."
When dealing with large numbers of objects, any solution involving hashing will outperform one that doesn't. I think you're just seeing the result of misusing a Java language feature. Interning isn't there to act as a Map of strings for your use. You should use a Map for that (or Set, as appropriate). The String table is for optimization at the language level, not the app level.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.