When to avoid string interning

When to avoid string interning - java

I've started looking into string interning and it seems like a great feature however I haven't found a great reason for why you would want to create a string using the string constructor, after some digging I came up with this, could someone confirm (or deny) if this is a valid reason to create a string with new?
Say you have 2 strings:
String novel = "The contents of a very long novel..."
String page = new String("The contents of a single page...")
By default all string literals are stored in the string pool (such as with String novel) and by default all sub-strings of novel will be interned (assuming they are created as a string literal) to optimizing memory allocation. Creating a string using the new keyword results in the string being created on the heap rather than in string pool. A particular case when you may want to avoid interning is if you wanted to create a string that is a sub-string of a very large string literal (such as page).
For example; Say you had a very large string literal (e.g. the contents of a novel) that you wanted to process only a portion of (e.g. a single page). It may be beneficial to use the string constructor (via new keyword) when creating the string that only contains a single page of the novel. That way the very large string may be free'd from the string pool sooner and keep only the string that contains the contents of a page on the heap. In contrast, if you created a string literal that is an interned sub-string of an entire novel, a larger amount of novel may be kept alive in the string pool despite only needing a small portion of the novel string.

TL;DR: There is no good / valid reason to new a String in a modern JVM, or to call String.intern() explicitly.
Your question contains false statements of fact, and that means that the conclusions that you are drawing are incorrect.
By default all string literals are stored in the string pool (such as with String novel)
That is correct, though it is not "by default". (It is like saying "by default a square has 4 sides". Squares have 4 sides, period. There are no exceptions. And no defaults.)
and by default all sub-strings of novel will be interned (assuming they are created as a string literal) to optimizing memory allocation.
Incorrect.
A String created by the String.substring() method is NOT interned. Not in current Java releases, or (AFAIK) in any previous release. (But see below.)
Creating a string using the new keyword results in the string being created on the heap rather than in string pool.
Correct.
A particular case when you may want to avoid interning is if you wanted to create a string that is a sub-string of a very large string literal (such as page).
Incorrect.
I think you are confusing "interning" with something else.
Actually, in a modern JVM you always want to avoid interning. It is expensive, and it causes string objects to be (artificially) kept for longer than they need to me.
In fact, the only real reason that interning is still a thing is that it is necessary to guarantee certain semantic properties specified in the JLS about compile-time constant strings.
A modern JVM (Java 9 and later) performs string deduping in the garbage collector for strings that live long enough. This happens transparently ... and in cases where it is likely to be beneficial.
Historic note.
In some old JVMs, there used to be a good reason to call new String in conjunction with substring. The problem was the substring method has a "clever optimization" whereby it created the substrings to share the backing char[] with the original string1. This had the problem that references to (small) substrings could keep the (large) backing array reachable. It was a subtle kind of memory leak. You could avoid the leak by using new.
However:
The optimization was NOT interning. The substrings were created in the regular heap, and they did not have the semantics of interned strings.
The problem only affected certain String use-cases. And in practice they didn't involve large String literals.
The problem was solved long ago. The String.substring now creates a new String with its own backing array.
In summary, using new String might have been a good idea in some cases with old Java versions, but it isn't anymore. It was fixed in Java 7.
1 - Interestingly, the source code for String describes this as a speed optimization rather than a space optimization.

Related

How many Strings are formed? [duplicate]

This question already has answers here:
How many string objects will be created in memory? [duplicate]
(4 answers)
Closed 3 days ago.
String a="hello";
String b=a+"Bye";
How many Strings are formed?
From my understanding of Java.
What happens in this code is:
String a="hello"; // hello is created in string pool
String b=a+"bye"; // new StringBuilder(a).append("bye")
So totally 2 strings are to be created, right?
1.Hello
2.HelloBye (In the Heap)
Or does Java create 3?
1.Hello
2.Bye
3.HelloBye
If this is the case, does append method create the appending strings in the string pool?

String a = "hello";
JVM will create one string in the string pool. (FIRST STRING IN POOL)
Now, here comes the tricky part>
b = a + "bye";
Internally + operator uses StringBuffer for concatenating strings.
String b= new StringBuilder(a).append("bye").toString(); (The toString() method of StringBuilder is returning a new String which will be definitely in the Heap since it is created with new String(...). So "bye" will be SECOND STRING IN POOL.)
Now,
b="hellobye" ("hellobye" will be THIRD STRING IN POOL)

First string "hello" is created and added to the string pool.
Next, the String "Bye" is created and added to the string pool.
The concatenation of a and "Bye" results in a new String "helloBye",
which is also added to the string pool.
A total of 3 Strings will be created in the pool: "hello", "Bye",
and "helloBye".
When you create a new StringBuilder and append a string to it, the resulting string will not be added to the string pool. Instead, a new String object will be created in the heap memory to represent the combined string.
So, the code new StringBuilder(a).append("bye") will create one new String object in the heap memory to represent the combined string and one string in pool for "a".

The only part of your question that can be answered with complete certainty is this:
Does append method create the appending strings in the string pool?
The answer is No. The result of a string concatenation that is not a constant expression is not placed in the string pool. At least not in any implementation of mainstream Java to date. However, there is no specification that actually guarantees this.
There are a couple of reasons why we don't know for sure how many strings are "formed".
We don't know when the String objects corresponding to the literals are actually created. In some Java implementation they will be created (and interned) when the code is loaded. In others, the string creation could occur the first time this code is run.
We don't know whether one or both of those literals are used by another class ... and hence whether this code is "forming" them.
Depending on the Java implementation, interning a string (to put it in the string pool) may result in a new String object being created. So you might get a scenario where two String objects get "formed" for each literal.
In short there is enough ambiguity that we cannot be 100% sure of the precise number of strings that are created during the execution of that code.
Does it matter that we don't know for sure?
Frankly, no. It should make zero difference to the way that you write your code1. Let the Java compiler and runtime take care of it ... and use a recent version of Java to get the benefit of the work they have done on optimizing this.
1 - But it is still wise to avoid string concatenation loops. I don't know if they can be optimized.
In your commented version you wrote:
String a = "hello"; // hello is created in string pool
String b = a + "bye"; // new StringBuilder(a).append("bye")
Both of those comments are questionable:
The "hello is created in string pool" comment is questionable for reasons that I gave above.
The new StringBuilder(a).append("bye") pseudo-code is questionable because that is an implementation detail. In Java 9 and later, expressions that involve string concatenations are translated to a invokedynamic bytecode. The JIT compiler generates native instructions directly. See How much does Java optimize string concatenation with +? for more information.

Is there any scenario where character array is better than Strings in Java

I feel strings can replace character array in all the scenarios. Even considering the immutability characteristic of Strings, declaration of strings in appropriate scope and java's garbage collection feature should help us avoid any memory leaks. I want to know if there is any corner case where character array should be used instead of Strings in Java.

Character arrays have some slight advantage over plain strings when it comes to storing security sensitive data. There's a lot of resources on that, for example this question: Why is char[] preferred over String for passwords? (with an answer by Jon Skeet himself).
In general it boils down to two things:
You have very little influence on how long a String stays in memory. Because of that you might leak sensitive data through a memory dump.
Leaking sensitive data accidentally in application logs as clear text is much more likely with plain strings
More reading:
Why we read password from console in char array instead of String
https://www.codebyamir.com/blog/use-character-arrays-to-store-sensitive-data-java
https://www.geeksforgeeks.org/use-char-array-string-storing-passwords-java/amp/
https://www.baeldung.com/java-storing-passwords
https://javarevisited.blogspot.com/2012/03/why-character-array-is-better-than.html
https://javainsider.wordpress.com/2012/12/10/character-array-is-better-than-string-for-storing-password-in-java/amp/

String is a class, not a build in type. It most likely does what it does by using a char array underneath, but there is no guarantee. "We dont care how it is implemented". It has methods that make sense for strings, like comparing strings. Comparing arrays?? Hmm. Doesn't really make sense to do it. You could check if they are equal sure, but less or greater than...
Back in point. One scenario is you want to operate with chars, not a string. For example you have letters of the alphabet and want to sort them. Or grades in A-F system and you want to sort them. Generally where it makes sense having chars that are not connected to have some meaning together (like in a message string, or a text message). You would not generally need to sort the chars of a text message now, would you? So, you use an array.
To sort, you can take advantage of the Arrays.sort() method for example, while i dont think there is a method that does it for strings. Perhaps 3rd part libraries.
On another note(unrelated to question) , you can use StringBuilder to if you want to modify strings often. Its better at performace.

You don't have to look much further than at methods in the JDK core API that use char[].
Such as this one (java.io.Reader):
public int read(char[] cbuf)
throws IOException
Reads characters into an array. This method will block until some input is available, an I/O error occurs, or the end of the stream is reached.
Parameters:
cbuf - Destination buffer
Returns:
The number of characters read, or -1 if the end of the stream has been reached
Throws:
IOException - If an I/O error occurs
Instead of returning a String they ask you to pass in a char[] to use as a buffer to write the result into. The reason is efficiency.

You might be knowing String is immutable and how Substring can cause memory leak in Java.
Since Strings are immutable in Java if you store password as plain text it will be available in memory until Garbage collector clears it and since String are used in String pool for reusability there is pretty high chance that it will be remain in memory for long duration, which pose a security threat. Since any one who has access to memory dump can find the password in clear text. Since Strings are immutable there is no way contents of Strings can be changed because any change will produce new String, while if you char[] you can still set all his element as blank or zero. So Storing password in character array clearly mitigates security risk of stealing password.

Why char[] performs better than String ?- Java

In reference to the link: File IO Tuning, last section titled "Further Tuning" where the author suggests using char[] to avoid generating String objects for n lines in the file, I need to understand how does
char[] arr = new char{'a','u','t','h', 'o', 'r'}
differ with
String s = "author"
in terms of memory consumption or any other performance factor? Isn't String object internally stored as a character array? I feel silly since I never thought of this before. :-)

In Oracle's JDK a String has four instance-level fields:
A character array
An integral offset
An integral character count
An integral hash value
That means that each String introduces an extra object reference (the String itself), and three integers in addition to the character array itself. (The offset and character count are there to allow sharing of the character array among String instances produced through the String#substring() methods, a design choice that some other Java library implementers have eschewed.) Beyond the extra storage cost, there's also one more level of access indirection, not to mention the bounds checking with which the String guards its character array.
If you can get away with allocating and consuming just the basic character array, there's space to be saved there. It's certainly not idiomatic to do so in Java though; judicious comments would be warranted to justify the choice, preferably with mention of evidence from having profiled the difference.

In the example you've referred to, it's because there's only a single character array being allocated for the whole loop. It's repeatedly reading into that same array, and processing it in place.
Compare that with using readLine which needs to create a new String instance on each iteration. Each String instance will contain a few int fields and a reference to a char[] containing the actual data - so it would need two new instances per iteration.
I'd usually expect the differences to be insignificant (with a decent GC throwing away unused "young" objects very efficiently) compared with the IO involved in reading the data - assuming it's from disk - but I believe that's the point the author was trying to make.

The author didn't get the reason right. The real overhead in in.readLine() is the copying a char[] buffer when making a String out of it. The additional copying is the most damning cost when dealing with large data.
It is possible to optimize this within JDK so that the additional copying is not needed.

Here are few reasons which makes sense to believe that character array is better choice in Java than String:
Say for Storing the Password
1) Since Strings are immutable in Java, if you store password as plain text it will be available in memory until Garbage collector clears it and since String are used in String pool for reusability there is pretty high chance that it will be remain in memory for long duration, which pose a security threat.
Since any one who has access to memory dump can find the password in clear text and that's another reason you should always used an encrypted password than plain text.
Since Strings are immutable there is no way contents of Strings can be changed because any change will produce new String, while if you char[] you can still set all his element as blank or zero. So Storing password in character array clearly mitigates security risk of stealing password.
2) Java itself recommends using getPassword() method of JPasswordField which returns a char[] and deprecated getText() method which returns password in clear text stating security reason. Its good to follow advice from Java team and adhering to standard rather than going against it.
3) With String there is always a risk of printing plain text in log file or console but if use Array you won't print contents of array instead its memory location get printed. though not a real reason but still make sense.
For this simple program
String strPassword="Unknown";
char[] charPassword= new char[]{'U','n','k','n','o','w','n'};
System.out.println("String password: " + strPassword);
System.out.println("Character password: " + charPassword);
Output:
String password: Unknown
Character password: [C#110b053
That's all on why character array is better choice than String for storing passwords in Java. Though using char[] is not just enough you need to erase content to be more secure.
Hope this will help.

My answer is going to focus on other stack questions along this similar line, others have already posted more direct answers.
There have been other questions similar to this, advice seems to go along the lines of using StringBuilder.
If you're concerned with string concentenation this have a look at the performance as described here between three different implementations. With another stack post which can give you some additional pointers and examples you could try yourself to see the performance.

When should we change a String to a Stringbuilder?

In an application a String is a often used data type. What we know, is that the mutation of a String uses lots of memory. So what we can do is to use a StringBuilder/StringBuffer.
But at what point should we change to StringBuilder?
And what should we do, when we have to split it or to remplace characters in there?
eg:
//original:
String[] split = string.split("?");
//better? :
String[] split = stringBuilder.toString().split("?);
or
//original:
String replacedString = string.replace("l","st");
//better? :
String replacedString = stringBuilder.toString().replace("l","st");
//or
StringBuilder replacedStringBuilder = new StringBuilder(stringBuilder.toString().replace("l","st);

In your examples, there are no benefits in using a StringBuilder, since you use the toString method to create an immutable String out of your StringBuilder.
You should only copy the contents of a StringBuilder into a String after you are done appending it (or modifying it in some other way).
The problem with Java's StringBuilder is that it lacks some methods you get when using a plain string (check this thread, for example: How to implement StringBuilder.replace(String, String)).
What we know, is that a String uses lots of memory.
Actually, to be precise, a String uses less memory than a StringBuilder with equivalent contents. A StringBuilder class has some additional constant overhead, and usually has a preallocated buffer to store more data than needed at any given moment (to reduce allocations). The issue with Strings is that they are immutable, which means Java needs to create a new instance whenever you need to change its contents.
To conclude, StringBuilder is not designed for the operations you mentioned (split and replace), and it won't yield much better performance in any case. A split method cannot benefit from StringBuilder's mutability, since it creates an array of immutable strings as its output anyway. A replace method still needs to iterate through the entire string, and do a lot of copying if replaced string is not the same size as the searched one.
If you need to do a lot of appending, then go for a StringBuilder. Since it uses a "mutable" array of characters under the hood, adding data to the end will be especially efficient.
This article compares the performance of several StringBuilder and String methods (although I would take the Concatenation part with reserve, because it doesn't mention dynamic string appending at all and concentrates on a single Join operation only).

What we know, is that the mutation of a String uses lots of memory.
That is incorrect. Strings cannot be mutated. They are immutable.
What you are actually talking about is building a String from other strings. That can use a lot more memory than is necessary, but it depends how you build the string.
So what we can do is to use a StringBuilder/StringBuffer.
Using a StringBuilder will help in some circumstances:
String res = "";
for (String s : ...) {
res = res + s;
}
(If the loop iterates many times then optimizing the above to use a StringBuilder could be worthwhile.)
But in other circumstances it is a waste of time:
String res = s1 + s2 + s3 + s4 + s5;
(It is a waste of time to optimize the above to use a StringBuilder because the Java compiler will automatically translate the expression into code that creates and uses a StringBuilder.)
You should only ever use a StringBuffer instead of a StringBuilder when the string needs to be accessed and/or updated by more than one thread; i.e. when it needs to be thread-safe.
But at what point should we change to StringBuilder?
The simple answer is to only do it when the profiler tells you that you have a performance problem in your string handling / processing.
Generally speaking, StringBuilders are used for building strings rather as the primary representation of the strings.
And what should we do, when we have to split it or to replace characters in there?
Then you have to review your decision to use a StringBuilder / StringBuffer as your primary representation at that point. And if it is still warranted you have to figure out how to do the operation using the API you have chosen. (This may entail converting to a String, performing the operation and then creating a new StringBuilder from the result.)

If you frequently modify the string, go with StringBuilder. Otherwise, if it's immutable anyway, go with String.
To answer your question on how to replace characters, check this out: http://download.oracle.com/javase/tutorial/java/data/buffers.html. StringBuilder operations is what you want.
Here's another good write-up on StringBuilder: http://www.yoda.arachsys.com/csharp/stringbuilder.html

If you need to lot of alter operations on your String, then you can go for StringBuilder. Go for StringBuffer if you are in multithreaded application.

Both a String and a StringBuilder use about the same amount of memory. Why do you think it is “much”?
If you have measured (for example with jmap -histo:live) that the classes [C and java.lang.String take up most of the memory in the heap, only then should you think further in this direction.
Maybe there are multiple strings with the same value. Then, since Strings are immutable, you could intern the duplicate strings. Don't use String.intern for it, since it has bad performance characteristics, but Google Guava's Interner.

Avoid creating 'new' String objects when converting a byte[] to String using a specific charset

I'm reading from a binary file and want to convert the bytes to US ASCII strings. Is there any way to do this without calling new on String to avoid multiple semantically equal String objects being created in the string literal pool? I'm thinking that it is probably not possible since introducing String objects using double quotes is not possible here. Is this correct?
private String nextString(DataInputStream dis, int size)
throws IOException
{
byte[] bytesHolder = new byte[size];
dis.read(bytesHolder);
return new String(bytesHolder, Charset.forName("US-ASCII")).trim();

You'd have to have a cache mapping byte arrays to strings, then search through the cache for any equal values before creating a new string.
You can intern existing strings with intern() as Yishai posted - that won't stop you from creating more strings, but it'll make all but the first one (for any char sequence) very short lived. On the other hand, it'll make all the distinct strings live for a very long time indeed.
You can have "pseudo-interning" by using a Map<String, String>:
String tmp = new String(bytesHolder, Charset.forName("US-ASCII")).trim();
String cached = cache.get(tmp);
if (cached == null)
{
cached = tmp;
cache.put(tmp, tmp);
}
return cached;
You could even put a bit more effort in and end up with an LRU cache so that it'll keep the N most recently fetched strings, discarding others when it needs to.
None of that reduces the number of strings created in the first place, as I say - but is that likely to be a problem in your situation? GCs have been tuned to make it very cheap to create short-lived objects.

You can call the intern() method on the string to ensure one for the whole JVM.
String s = new String(bytes, "US-ASCII").intern();
You won't avoid creating the initial string again, but you will save on the storage.
That being said, interned strings have a limited storage space, so use with caution. A better option may be to implement a HashMap with the string as the key and value and check if the string already exists and get it if it does, insert it if it doesn't. That way you won't have such memory limitations.

You shouldn’t be concerned about it—unless you profiled your application and have determined the String creation to be the exact source of your problem.
If you find out that the String creation is the source of your problem I would recommend what Jon Skeet proposed, i.e. a mapping from byte[] to String. That has about the same effect as interning your Strings while not hogging up valuable memory until you restart the VM.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.