String intern in equals method - java

Is it a good practise to use String#intern() in equals method of the class. Suppose we have a class:
public class A {
private String field;
private int number;
#Override
public boolean equals(Object obj) {
if (obj == null) {
return false;
}
if (getClass() != obj.getClass()) {
return false;
}
final A other = (A) obj;
if ((this.field == null) ? (other.field != null) : !this.field.equals(other.field)) {
return false;
}
if (this.number != other.number) {
return false;
}
return true;
}
}
Will it be faster to use field.intern() != other.field.intern() instead of !this.field.equals(other.field).

No! Using String.intern() implicitly like this is not a good idea:
It will not be faster. As a matter of fact it will be slower due to the use of a hash table in the background. A get() operation in a hash table contains a final equality check, which is what you want to avoid in the first place. Used like this, intern() will be called each and every time you call equals() for your class.
String.intern() has a lot of memory/GC implications that you should not implicitly force on users of this class.
If you want to avoid full blown equality checks when possible, consider the following avenues:
If you know that the set of strings is limited and you have repeated equality checks, you can use intern() for the field at object creation, so that any subsequent equality checks will come down to an identity comparison.
Use an explicit HashMap or WeakHashMap instead of intern() to avoid storing strings in the GC permanent generation - this was an issue in older JVMs, not sure if it is still a valid concern.
Keep in mind that if the set of strings is unbounded, you will have memory issues.
That said, all this sounds like premature optimization to me. String.equals() is pretty fast in the general case, since it compares the string lengths before comparing the strings themselves. Have you profiled your code?

Good practice : Nope. You're doing something tricky, and that makes for brittle, less readable code. Unless this equals() method needs to be crazy performant (and your performance tests validate that it is in fact faster), it's not worth it.
Faster : Could be. But don't forget that you can have unintended side effects from using the intern() method: http://www.onkarjoshi.com/blog/213/6-things-to-remember-about-saving-memory-with-the-string-intern-method/

Any benefit gained by performing an identity comparison on the interned Strings is likely to be outweighed by the associated cost of interning the Strings.
In the above case you could consider interning the String when you instantiate the class, providing the field is constant (in which case you should also mark it as final). You could also check for null on instantiation to avoid having to check on each call to equals (assuming you disallow null Strings).
However, in general these types of micro-optimisation offer little gain in performance.

Let's go through this one step at a time...
The idea here is that if you use String#intern, you'll be given a canonical representation of that String. A pool of Strings is kept internally and each entry is guaranteed to be unique for that pool with regard to equals. If you call intern() on a String, then either a previously pooled identical String is going to be returned, or the String you called intern on is going to be pooled and returned.
So if we have two Strings s1 and s2 and we assume neither is null, then the following two lines of code are considered idempotent:
s1.equals(s2);
s1.intern() == s2.intern();
Let's investigate two assumptions we've made now:
s1.intern() and s2.intern() really will return the same object if s1.equals(s2) evaluates to true.
Using the == operator on two interned references to the same String will be more efficient than using the equals method.
The first assumption is probably the most dangerous of all. The JavaDoc for the intern method tells us that using this method will return a canonical representation for an internally kept pool of Strings. But it doesn't tell us anything about that pool. Once an entry has been added to the pool, can it ever be removed again? Will the pool keep growing indefinitely or will entries occassionally be culled to make it act as a limited-size cache? You'd have to check the actual specifications of the Java Language and Virtual Machine to get any certainty, if they offer it at all. Having to check specs for a limited optimization is usually a big warning sign. Checking the source code for Sun's JDK 7, I see that intern is specified as a native method. So not only is the implementation likely to be vendor-specific, it might vary across platforms as well for VMs from the same vendor. All bets are off regarding stuff that's not in the spec.
On to our second assumption. Let's consider for a moment what it would take to intern a String... First of all, we'll need to check if the String is already in the pool. We'll assume they've tried to get an O(1) complexity going there to keep this fast by using some hashing scheme. But that's assuming we've got a hash of the String. Since this is a native method, I'm not certain what would be used... Some hash of the native representation or simply what hashCode() returns. I know from the source code of Sun's JDK that a String instance caches its hash code. It'll only be calculated the first time the method is called, and after that the calculated value will be returned. So at the very least, a hash must be calculated at least once if we're to use that. Getting a reliable hash of a String will probably involve arithmetic on each and every character, which can be expensive for lenghty values. Even once we have the hash and thus a set of Strings that are candidates for being matches in the interned pool, we'd still have to verify if one of these really is an exact match which would involve... an equality check. Meaning going through each and every character of the Strings and seeing if they match if trivial cases like inequal length can't be applied first. Worse still, we might have to do this for more than one other String like we'd do with a regular equals, since multiple Strings in the pool might have the same hash or end up in the same hash bucket.
So, that stuff we need to do to find out if a String was already interned or not sounds suspiciously like what equals would need to do. Basically, we've gained nothing and might even have made our equals implementation more expensive. At least, if we're going to call intern each and every time. So maybe we should intern the String right away and simply always use that interned reference. Let's check how class A would look if that were the case. I'm assuming the String field is initialized on construction:
public class A {
private final String field;
public A(final String s) {
field = s.intern();
}
}
That's looking a little more sensible. Any Strings that are passed to the constructor and are equal will end up being the same reference. Now we can safely use == between the field field of A instances for equality checks, right?
Well, it'd be useless. Why? If you check the source for equals in class String, you'll find that any implementation made by someone with half a brain will first do a == check to catch the trivial case where the instance and the argument are the same reference first. That could save a potentially heavy char-by-char comparison. I know the JDK 7 source I'm using for reference does this. So you're still better off using equals because it does that reference check anyway.
The second reason this'd be a bad idea is that first point way up above... We simply don't know if the instances are going to be kept in the pool indefinitely. Check this scenario, which may or may not occur depending on JVM implementation:
String s1 = ... //Somehow gets passed a non-interned "test" value
A a1 = new A(s1);
//Lots of time passes... winter comes and goes and spring returns the land to a lush green...
String s2 = ... //Somehow gets passed a non-interned "test" value
A a2 = new A(s2);
a1.equals(a2); //Totally returns the wrong result
What happened? Well, if it turns out the interned String pool will sometimes be culled of certain entries, then that first construction of an A could have s1 interned, only to see it being removed from the pool, to have it later replaced by that s2 instance. Since s1 and s2 are conceivably different instances, the == check fails. Can this happen? I've got no idea. I certainly won't go check the specs and native code to find out. Will the programmer that's going through your code with a debugger to find out why the hell "test" is not considered the same as "test"?
It's no problem if we're using equals. It'll catch the same instance case early for optimal results, which will benefit us when we've interned our Strings, but we won't have to worry about cases where the instances still end up being different because then equals is gonna do the classic compare work. It just goes to show that it's best not to second-guess the actual runtime implementation or compiler, because these things were made by people who know the specs like the back of their hands and really worry about performance.
So String interning manually can be of benefit when you make sure that...
you're not interning each and every time, but just intern a String once like when intializing a field and then keep using that interned instance;
you still use equals to make sure implementation details won't ruin your day and your code doesn't actually rely on that interning, instead relying on the implementation of the method to catch the trivial cases.
After keeping this in mind, surely it's worth using intern()? Well, we still don't know how expensive intern() is. It's a native method so it might be really fast. But we're not sure unless we check the code for our target platform and JVM implementation. We've also had to make sure we understand exactly what interning does and what assumptions we've made about it. Are you sure the next person reading your code will have the same level of understanding? They might be bewildered about this weird method they've never seen before that dabbles in JVM internals and might spend an hour reading the same gibberish I'm typing right now, instead of getting work done.
That's the problem right there... Before, it was simple. You used equals and were done. Now, you've added another little thing that can nestle itself in your mind and cause you to wake up screaming one night because you've just realized that oh my God you've forgot to take out one of the == uses and that piece of code is used in a routine controlling the killer bots' apprisal of citizen disobedience and you've heard its JVM isn't too solid!
Donald Knuth was famously attributed the quote...
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil"
Knuth was clever enough to add in that 97% detail. Sometimes, thoroughly micro-optimizing a small portion of code can make a big difference. Say, if that piece of code takes up 30% of the program's runtime execution. The problem with micro-optimizations is that they tend to work on assumptions. When you start using intern() and believe that from then on it'll be safe to make reference equality checks, you've made a hell of a lot of assumptions. And even if you go down to implementation level to check if they're right, are you sure they will be in the next JRE version?
I myself have used intern() manually. Did it in some piece of code where the same handful of Strings are gonna end up in hundreds if not thousands of object instances as fields. Those fields are gonna be used as keys in HashMaps and are frequently used while doing some validation over those instances. I figured interning was worth it for two purposes: reducing memory overhead by making all those equal Strings one single instance and speeding up the map lookups, since they're using hashCode() and equals. But I've made damn sure that you can take all those intern() calls out of the code and everything will still work fine. The interning is just some icing on the cake in this case, a little extra that may or may not make a bit of difference along the road. But it's not an essential part of my code's correctness.
Long post, eh? Why'd I go through the trouble of typing all of this up? To show you that if you make micro-optimizations, you'd better know damn well what you're doing and willing to document it so thoroughly that you might as well not have bothered.

This is hard to say given that you have not specified hardware. Timing test are difficult to get right and are not universal. Have you done a timing test yourself?
My feeling is that the intern pattern would not be faster as each string would need to be matched to a possible string in a dictionary of all interned strings.

Related

Does equality test order affect performance in Java?

I commonly find myself writing code like this:
private List<Foo> fooList = new ArrayList<Foo>();
public Foo findFoo(FooAttr attr) {
for(Foo foo : fooList) {
if (foo.getAttr().equals(attr)) {
return foo;
}
}
}
However, assuming I properly guard against null input, I could also express the loop like this:
for(Foo foo : fooList) {
if (attr.equals(foo.getAttr()) {
return foo;
}
}
I'm wondering if one of the above forms has a performance advantage over the other. I'm well aware of the dangers of premature optimization, but in this case, I think the code is equally legible either way, so I'm looking for a reason to prefer one form over another, so I can build my coding habits to favor that form. I think given a large enough list, even a small performance advantage could amount to a significant amount of time.
In particular, I'm wondering if the second form might be more performant because the equals() method is called repeatedly on the same object, instead of different objects? Maybe branch prediction is a factor?
I would offer 2 pieces of advice here:
Measure it
If nothing else points you in any given direction, prefer the form which makes most sense and sounds most natural when you say it out loud (or in your head!)
I think that considering branch prediction is worrying about efficiency at too low of a level. However, I find the second example of your code more readable because you put the consistent object first. Similarly, if you were comparing this to some other object that, I would put the this first.
Of course, equals is defined by the programmer so it could be asymmetric. You should make equals an equivalence relation so this shouldn't be the case. Even if you have an equivalence relation, the order could matter. Suppose that attr is a superclass of the various foo.getAttr and the first test of your equals method checks if the other object is an instance of the same class. Then attr.equals(foo.getAttr()) will pass the first check but foo.getAttr().equals(attr) will fail the first check.
However, worrying about efficiency at this level seldom has benefits.
This depends on the implementation of the equals methods. In this situation I assume that both objects are instances of the same class. So that would mean that the methods are equal. This makes no performance difference.
If both objects are of the same type, then they should perform the same. If not, then you can't really know in advance what's going to happen, but usually it will be stopped quite quickly (with an instanceof or something else).
For myself, I usually start the method with a non-null check on the given parameter and I then use the attr.equals(foo.getAttr()) since I don't have to check for null in the loop. Just a question of preference I guess.
The only thing which does affect performance is code which does nothing.
In some cases you have code which is much the same or the difference is so small it just doesn't matter. This is the case here.
Where its is useful to swap the .equals() around is when you have a known value which cannot be null (This doesn't appear to be the cases here) of the type you are using is known.
e.g.
Object o = (Integer) 123;
String s = "Hello";
o.equals(s); // the type of equals is unknown and a virtual table look might be required
s.equals(o); // the type of equals is known and the class is final.
The difference is so small I wouldn't worry about it.
DEVENTER (n) A decision that's very hard to make because so little depends on it, such as which way to walk around a park
-- The Deeper Meaning of Liff by Douglas Adams and John Lloyd.
The performance should be the same, but in terms of safety, it's usually best to have the left operand be something that you are sure is not null, and have your equals method deal with null values.
Take for instance:
String s1 = null;
s1.equals("abc");
"abc".equals(s1);
The two calls to equals are not equivalent as one would issue a NullPointerException (the first one), and the other would return false.
The latter form is generally preferred for comparing with string constants for exactly this reason.

I don't understand this ("string" == "string") example

I found this java code on a java tutorial page:
if ("progress" == evt.getPropertyName())
http://download.oracle.com/javase/tutorial/uiswing/examples/components/index.html
How could this work? I thought we HAVE TO use the equals() method for this situation (string.equals("bla"))? Could we use equals() here too? Would it be better? Any ideas?
Edit: So IF equals() would be better, then I really don't understand why a serious oracle tutorial page didn't use it? Also, I don't understand why it's working because I thought a string is an object. If I say object == object, then that's a big problem.
Yes, equals() would definitely be better and correct. In Java, a pool of string constants is maintained and reused intelligently for performance. So this can work, but it is only guaranteed if evt.getPropertyName() is assured to return constants.
Also, the more correct version would be "progress".equals(evt.getPropertyName()), in case evt.getPropertyName() is null. Note that the implementation of String.equals starts with using == as a first test before doing char-by-char comparison, so performance will not be much affected versus the original code.
Which demo are we looking at?
This explains equals() vs ==
http://www.java-samples.com/showtutorial.php?tutorialid=221
It is important to understand that the equals( ) method and the == operator perform two different operations. As just explained, the equals( ) method compares the characters inside a String object. The == operator compares two object references to see whether they refer to the same instance. The following program shows how two different String objects can contain the same characters, but references to these objects will not compare as equal:
So in your particular example, it is comparing the reference to see if they are the same reference, not to see if the string chars match I believe.
The correct version of this code should be:
if ("progress".equals(evt.getPropertyName()))
This could work because of the way that the JVM handles string constants. Each string constant is intern()ed. So if evt.getPropertyName() is returning a reference to a string constant than using == will work. But it is bad form and in general it will not work.
This only would work if evt.getPropertyName() returns a constant string of value "progress".
With constant string, I mean evaluated at compile-time.
In most cases, when comparing Strings, using equals is best. However, if you know you'll be comparing the exact same String objects (not just two strings that have the same content), or if you're dealing entirely with constant Strings and you really care about performance, using == will be somewhat faster than using equals. You should normally use equals since you normally don't care about performance sufficiently to think about all the other prerequisites for using ==.
In this case, the author of the progress demo should probably have used equals - that code isn't especially performance-critical. However, in this particular case, the code will be dealing entirely with constant strings, so whilst it's probably not the best choice, especially for a demo, it is a valid choice.

Why doesn't Java warn about a == "something"?

This might sound stupid, but why doesn't the Java compiler warn about the expression in the following if statement:
String a = "something";
if(a == "something"){
System.out.println("a is equal to something");
}else{
System.out.println("a is not equal to something");
}
I realize why the expression is untrue, but AFAIK, a can never be equal to the String literal "something". The compiler should realize this and at least warn me that I'm an idiot who is coding way to late at night.
Clarification
This question is not about comparing two String object variables, it is about comparing a String object variable to a String literal. I realize that the following code is useful and would produce different results than .equals():
String a = iReturnAString();
String b = iReturnADifferentString();
if(a == b){
System.out.println("a is equal to b");
}else{
System.out.println("a is not equal to b");
}
In this case a and b might actually point to the same area in memory, even if it's not because of interning. In the first example though, the only reason it would be true is if Java is doing something behind the scenes which is not useful to me, and which I can't use to my advantage.
Follow up question
Even if a and the string-literal both point to the same area in memory, how is that useful for me in an expression like the one above. If that expression returns true, there isn't really anything useful I could do with that knowledge, is there? If I was comparing two variables, then yes, that info would be useful, but with a variable and a literal it's kinda pointless.
Actually they can indeed be the same reference if Java chooses to intern the string. String interning is the notion of having only one value for a distinct string at runtime.
http://en.wikipedia.org/wiki/String_intern_pool
Java notes about string interning
http://javatechniques.com/blog/string-equality-and-interning/
Compiler warnings tend to be about things that are either blatantly wrong (conditionals that can never be true or false) or unsafe (unchecked casts). The use of == is valid, and in some rare cases intentional.
I believe all of Checkstyle, FindBugs and PMD will warn about this, and optionally a lot of other bad practices we tend to have when half asleep or otherwise incapacitated ;).
Because:
you might actually want to use ==, if working with constants and interned strings
the compiler should make an exception only for String, and no other type. What I mean is - whenever the compiler encounters == it should check if the operands are Strings in order to issue a warning. What if the arguments are Strings, but are referred to as Object or CharSequence ?
The rationale given by checkstyle for issuing an error is that novice programmers often do this. And if you are novice, I'd be hard to configure checkstyle (or pmd), or even to know about them.
Another thing is the actual scenario when strings are compared and there is a literal as one of the operands. First, it would be better to use a constant (static final) instead of a literal. And where would the other operand come from? It is likely that it will come from the same constant / literal, somewhere else in the code. So == would work.
Depending on the context, both identity comparisons and value comparisons can be legitimate.
I can think of very few queries where there is a deterministic automated algorithm to figure out unambiguously that one of them is an error.
Therefore, there's no attempt to do this automatically.
If you think about things like caching, then there are situations where you would want to do this test.
Actually, it may sometimes be true, depending on if Java takes an existing String from its internal String cache, creating the first declaration and then storing it, or taking it for both string declarations.
The compiler doesn't care that you're trying to do identity comparison against a literal. It could also be argued that it's not the compiler's job to be a code nanny. Look for a lint-like tool if you want to catch situations like this.
"I realize why the expression is untrue, but AFAIK, a can never be equal to the String literal "something"."
To clarify, in the example given, the expersion is always TRUE and a can be == and equals() to the String literal and in the example given it is always == and equals().
It is ironic that you appear have given the rare counter example to your own argument.
There are cases where you actually care whether you're dealing with exactly the same object rather than whether two objects are equal. In such cases, you need == rather than equals(). The compiler has no way of knowing whether you really wanted to compare the references for equality or the objects that they point to.
Now, it's far less likely that you're going to want == for strings than it would be for a user-defined type, but that doesn't guarantee that you wouldn't want it, and even if it did, that means that the compiler would have to special case strings are specifically check to make sure that you didn't use == on them.
In addition, because strings are immutable, the JVM is free to make string which would be equal per equals() share the same instance (to save memory), in which case they would also be equal per ==. So, depending on what the JVM does, == could very well return true. And the example that you gave is actually one where there's a decent chance of it because they're both string literals, so it would be fairly easy for the JVM to make them the same string, and it probably would. And, of course, if you want to see whether the JVM is making two strings share the same instance, you would have to use == rather than equals(), so there's a legitimate reason to want to use == on strings right there.
So, the compiler has no way of knowing enough of what you're doing to know that using == instead of equals() should be an error. This can lead to bugs if you're not careful (especially if you're used to a language like C++ which overloads == instead of having a separate equals() function), but the compiler can only do so much for you. There are legitimate reasons for using == instead of equals(), so the compiler isn't going to flag it as an error.
There exist tools that will warn you about these constructs; feel free to use them. However there are valid cases when you want to use == on Strings, and it is much worse language design to warn a user about a perfectly valid construct than to fail to warn them. When you have been using Java a year or so (and I will bet good money that you haven't reached that stage yet) you will find avoiding constructs like this is second nature.

Am I correctly interning my Strings?

I want to make sure I don't pummel the permgen space, so I'm carefully interning my strings.
Are these two statements equivalent ?
String s1 = ( "hello" + "world" ).intern();
String s2 = "hello".intern() + "world".intern();
UPDATE
How I framed my question was totally different from the actual application. Here's the method where I am using intern.
public String toAddress( Transport transport )
{
Constraint.NonNullArgument.check( transport, "transport" );
switch( transport )
{
case GOOGLE:
case MSN:
return ( transport.code() + PERIOD + _domain ).intern();
case YAHOO:
default:
return _domain;
}
}
private String _domain; // is initialized during constructor
private static final String PERIOD = ".";
The best advice I can think of is: don't bother. Statically declared String's will be in the constant pool any how so unless you are dynamically creating a String that is...errr no I can't think of a reason.
I've been programming using Java since 97 and I've never actually used String.intern().
EDIT: After seeing your update I really am of the opinion that you shouldn't be using intern(). Your method looks perfectly normal and there there is little or no reason to use intern().
My reason for this is that it is infect an optimisation and potentially a premature one at that, you are second guessing the garbage collector. If the just of you method is short lived then the resulting string will die the young generation very shortly afterwards in the next minor GC and if it isn't it'll be interned (for want of a better word) in the mature generation anyhow.
I guess the only time this could be a good idea is if you spend a bit of time with a profiler and prove that it makes a large difference to the performance of your application.
As jensgram says, the two statements are not equivalent. Two important rules:
Concatenating string literals in code ends up with a string constant, so these two statements are exactly equivalent (they'll produce identical bytecode):
String x = "foo" + "bar":
String x = "foobar";
String constants are interned automatically, you don't need to do it explicitly
Now, this concentrates on literals - are you actually calling intern on literals, or is your real use case somewhat different (e.g. interning values fetched from a database which you'll see frequently)? If so, please give us more details.
EDIT: Okay, based on the question edit: this could save some memory if you end up storing the return value of toAddress() somewhere that it'll stick around for a long time and you'll end up with the same address multiple times. If those aren't the case, interning will actually probably make things worse. I don't know for sure whether interned strings stick around forever, but it's quite possible.
This looks to me like it's unlikely to be a good use of interning, and may well be making things worse instead. You mention trying to save permgen space - why do you believe interning will help there? The concatenated strings won't end up in permgen anyway, unless I'm much mistaken.
No. Adding two interned strings together does not give you an interned string.
That said, it's pretty rare that one needs to "carefully intern one's strings". Unless you're dealing with huge numbers of identical strings, it's more trouble than it's worth.
I would say no. s1 adds "helloworld" to the pool, whereas s2 is made up of the two pooled strings "hello" and "world".
More info will help us to understand your query... Anyway...
If you manually want to intern for HelloWorld then go with first statement as in second statement you interning hello and world separately. Two statements are not identical at all.
You might want to have some form of proof (via profiling perhaps) that you are "pummelling the permgen space" before you write all your code like that.
Otherwise you may just be doing "premature optimisation" which is generally frowned upon.
See http://en.wikipedia.org/wiki/Optimization_(computer_science)#When_to_optimize for more details on why this may be a bad thing.
In many cases, "carefully interning" your strings gives you nothing but some time wasted.
Consider the following case:
void foobar(int x) {
String s1 = someMethod(x).intern();
...
...
}
So s1 is interned, no (heap) space wasted? Wrong! Most likely, the intermediary result of someMethod(x) still exists somewhere on the heap and needs to be garbage collected. That's because someMethod() somehow constructed the string, and (unless it returns a literal) it did that on the heap. But then... better look up what the permgen space is used for. It's used for metadata about classes and (ooops) the String.intern table. By interning all your strings, you are doing exactly what you wanted to avoid: Pummel the permgen space.
More information here: http://www.thesorensens.org/2006/09/09/java-permgen-space-stringintern-xml-parsing/
The amount of strings you're using has no effect on the Permananent Generation of the JVM, since we're still talking about one class.
interning strings is basically a memory leak waiting to happen :(
Unless you have a very, very good reason[1] don't do it, but leave it to the JVM.
[1] As in, "Dear Boss, Please don't fire me. I have this profiling data to support my decision to use intern" :)

Why can't strings be mutable in Java and .NET?

Why is it that they decided to make String immutable in Java and .NET (and some other languages)? Why didn't they make it mutable?
According to Effective Java, chapter 4, page 73, 2nd edition:
"There are many good reasons for this: Immutable classes are easier to
design, implement, and use than mutable classes. They are less prone
to error and are more secure.
[...]
"Immutable objects are simple. An immutable object can be in
exactly one state, the state in which it was created. If you make sure
that all constructors establish class invariants, then it is
guaranteed that these invariants will remain true for all time, with
no effort on your part.
[...]
Immutable objects are inherently thread-safe; they require no synchronization. They cannot be corrupted by multiple threads
accessing them concurrently. This is far and away the easiest approach
to achieving thread safety. In fact, no thread can ever observe any
effect of another thread on an immutable object. Therefore,
immutable objects can be shared freely
[...]
Other small points from the same chapter:
Not only can you share immutable objects, but you can share their internals.
[...]
Immutable objects make great building blocks for other objects, whether mutable or immutable.
[...]
The only real disadvantage of immutable classes is that they require a separate object for each distinct value.
There are at least two reasons.
First - security http://www.javafaq.nu/java-article1060.html
The main reason why String made
immutable was security. Look at this
example: We have a file open method
with login check. We pass a String to
this method to process authentication
which is necessary before the call
will be passed to OS. If String was
mutable it was possible somehow to
modify its content after the
authentication check before OS gets
request from program then it is
possible to request any file. So if
you have a right to open text file in
user directory but then on the fly
when somehow you manage to change the
file name you can request to open
"passwd" file or any other. Then a
file can be modified and it will be
possible to login directly to OS.
Second - Memory efficiency http://hikrish.blogspot.com/2006/07/why-string-class-is-immutable.html
JVM internally maintains the "String
Pool". To achive the memory
efficiency, JVM will refer the String
object from pool. It will not create
the new String objects. So, whenever
you create a new string literal, JVM
will check in the pool whether it
already exists or not. If already
present in the pool, just give the
reference to the same object or create
the new object in the pool. There will
be many references point to the same
String objects, if someone changes the
value, it will affect all the
references. So, sun decided to make it
immutable.
Actually, the reasons string are immutable in java doesn't have much to do with security. The two main reasons are the following:
Thead Safety:
Strings are extremely widely used type of object. It is therefore more or less guaranteed to be used in a multi-threaded environment. Strings are immutable to make sure that it is safe to share strings among threads. Having an immutable strings ensures that when passing strings from thread A to another thread B, thread B cannot unexpectedly modify thread A's string.
Not only does this help simplify the already pretty complicated task of multi-threaded programming, but it also helps with performance of multi-threaded applications. Access to mutable objects must somehow be synchronized when they can be accessed from multiple threads, to make sure that one thread doesn't attempt to read the value of your object while it is being modified by another thread. Proper synchronization is both hard to do correctly for the programmer, and expensive at runtime. Immutable objects cannot be modified and therefore do not need synchronization.
Performance:
While String interning has been mentioned, it only represents a small gain in memory efficiency for Java programs. Only string literals are interned. This means that only the strings which are the same in your source code will share the same String Object. If your program dynamically creates string that are the same, they will be represented in different objects.
More importantly, immutable strings allow them to share their internal data. For many string operations, this means that the underlying array of characters does not need to be copied. For example, say you want to take the five first characters of String. In Java, you would calls myString.substring(0,5). In this case, what the substring() method does is simply to create a new String object that shares myString's underlying char[] but who knows that it starts at index 0 and ends at index 5 of that char[]. To put this in graphical form, you would end up with the following:
| myString |
v v
"The quick brown fox jumps over the lazy dog" <-- shared char[]
^ ^
| | myString.substring(0,5)
This makes this kind of operations extremely cheap, and O(1) since the operation neither depends on the length of the original string, nor on the length of the substring we need to extract. This behavior also has some memory benefits, since many strings can share their underlying char[].
Thread safety and performance. If a string cannot be modified it is safe and quick to pass a reference around among multiple threads. If strings were mutable, you would always have to copy all of the bytes of the string to a new instance, or provide synchronization. A typical application will read a string 100 times for every time that string needs to be modified. See wikipedia on immutability.
One should really ask, "why should X be mutable?" It's better to default to immutability, because of the benefits already mentioned by Princess Fluff. It should be an exception that something is mutable.
Unfortunately most of the current programming languages default to mutability, but hopefully in the future the default is more on immutablity (see A Wish List for the Next Mainstream Programming Language).
Wow! I Can't believe the misinformation here. Strings being immutable have nothing with security. If someone already has access to the objects in a running application (which would have to be assumed if you are trying to guard against someone 'hacking' a String in your app), they would certainly be a plenty of other opportunities available for hacking.
It's a quite novel idea that the immutability of String is addressing threading issues. Hmmm ... I have an object that is being changed by two different threads. How do I resolve this? synchronize access to the object? Naawww ... let's not let anyone change the object at all -- that'll fix all of our messy concurrency issues! In fact, let's make all objects immutable, and then we can removed the synchonized contruct from the Java language.
The real reason (pointed out by others above) is memory optimization. It is quite common in any application for the same string literal to be used repeatedly. It is so common, in fact, that decades ago, many compilers made the optimization of storing only a single instance of a String literal. The drawback of this optimization is that runtime code that modifies a String literal introduces a problem because it is modifying the instance for all other code that shares it. For example, it would be not good for a function somewhere in an application to change the String literal "dog" to "cat". A printf("dog") would result in "cat" being written to stdout. For that reason, there needed to be a way of guarding against code that attempts to change String literals (i. e., make them immutable). Some compilers (with support from the OS) would accomplish this by placing String literal into a special readonly memory segment that would cause a memory fault if a write attempt was made.
In Java this is known as interning. The Java compiler here is just following an standard memory optimization done by compilers for decades. And to address the same issue of these String literals being modified at runtime, Java simply makes the String class immutable (i. e, gives you no setters that would allow you to change the String content). Strings would not have to be immutable if interning of String literals did not occur.
String is not a primitive type, yet you normally want to use it with value semantics, i.e. like a value.
A value is something you can trust won't change behind your back.
If you write: String str = someExpr();
You don't want it to change unless YOU do something with str.
String as an Object has naturally pointer semantics, to get value semantics as well it needs to be immutable.
One factor is that, if Strings were mutable, objects storing Strings would have to be careful to store copies, lest their internal data change without notice. Given that Strings are a fairly primitive type like numbers, it is nice when one can treat them as if they were passed by value, even if they are passed by reference (which also helps to save on memory).
I know this is a bump, but...
Are they really immutable?
Consider the following.
public static unsafe void MutableReplaceIndex(string s, char c, int i)
{
fixed (char* ptr = s)
{
*((char*)(ptr + i)) = c;
}
}
...
string s = "abc";
MutableReplaceIndex(s, '1', 0);
MutableReplaceIndex(s, '2', 1);
MutableReplaceIndex(s, '3', 2);
Console.WriteLine(s); // Prints 1 2 3
You could even make it an extension method.
public static class Extensions
{
public static unsafe void MutableReplaceIndex(this string s, char c, int i)
{
fixed (char* ptr = s)
{
*((char*)(ptr + i)) = c;
}
}
}
Which makes the following work
s.MutableReplaceIndex('1', 0);
s.MutableReplaceIndex('2', 1);
s.MutableReplaceIndex('3', 2);
Conclusion: They're in an immutable state which is known by the compiler. Of couse the above only applies to .NET strings as Java doesn't have pointers. However a string can be entirely mutable using pointers in C#. It's not how pointers are intended to be used, has practical usage or is safely used; it's however possible, thus bending the whole "mutable" rule. You can normally not modify an index directly of a string and this is the only way. There is a way that this could be prevented by disallowing pointer instances of strings or making a copy when a string is pointed to, but neither is done, which makes strings in C# not entirely immutable.
For most purposes, a "string" is (used/treated as/thought of/assumed to be) a meaningful atomic unit, just like a number.
Asking why the individual characters of a string are not mutable is therefore like asking why the individual bits of an integer are not mutable.
You should know why. Just think about it.
I hate to say it, but unfortunately we're debating this because our language sucks, and we're trying to using a single word, string, to describe a complex, contextually situated concept or class of object.
We perform calculations and comparisons with "strings" similar to how we do with numbers. If strings (or integers) were mutable, we'd have to write special code to lock their values into immutable local forms in order to perform any kind of calculation reliably. Therefore, it is best to think of a string like a numeric identifier, but instead of being 16, 32, or 64 bits long, it could be hundreds of bits long.
When someone says "string", we all think of different things. Those who think of it simply as a set of characters, with no particular purpose in mind, will of course be appalled that someone just decided that they should not be able to manipulate those characters. But the "string" class isn't just an array of characters. It's a STRING, not a char[]. There are some basic assumptions about the concept we refer to as a "string", and it generally can be described as meaningful, atomic unit of coded data like a number. When people talk about "manipulating strings", perhaps they're really talking about manipulating characters to build strings, and a StringBuilder is great for that. Just think a bit about what the word "string" truly means.
Consider for a moment what it would be like if strings were mutable. The following API function could be tricked into returning information for a different user if the mutable username string is intentionally or unintentionally modified by another thread while this function is using it:
string GetPersonalInfo( string username, string password )
{
string stored_password = DBQuery.GetPasswordFor( username );
if (password == stored_password)
{
//another thread modifies the mutable 'username' string
return DBQuery.GetPersonalInfoFor( username );
}
}
Security isn't just about 'access control', it's also about 'safety' and 'guaranteeing correctness'. If a method can't be easily written and depended upon to perform a simple calculation or comparison reliably, then it's not safe to call it, but it would be safe to call into question the programming language itself.
Immutability is not so closely tied to security. For that, at least in .NET, you get the SecureString class.
Later edit: In Java you will find GuardedString, a similar implementation.
The decision to have string mutable in C++ causes a lot of problems, see this excellent article by Kelvin Henney about Mad COW Disease.
COW = Copy On Write.
It's a trade off. Strings go into the String pool and when you create multiple identical Strings they share the same memory. The designers figured this memory saving technique would work well for the common case, since programs tend to grind over the same strings a lot.
The downside is that concatenations make a lot of extra Strings that are only transitional and just become garbage, actually harming memory performance. You have StringBuffer and StringBuilder (in Java, StringBuilder is also in .NET) to use to preserve memory in these cases.
Strings in Java are not truly immutable, you can change their value's using reflection and or class loading. You should not be depending on that property for security.
For examples see: Magic Trick In Java
Immutability is good. See Effective Java. If you had to copy a String every time you passed it around, then that would be a lot of error-prone code. You also have confusion as to which modifications affect which references. In the same way that Integer has to be immutable to behave like int, Strings have to behave as immutable to act like primitives. In C++ passing strings by value does this without explicit mention in the source code.
There is an exception for nearly almost every rule:
using System;
using System.Runtime.InteropServices;
namespace Guess
{
class Program
{
static void Main(string[] args)
{
const string str = "ABC";
Console.WriteLine(str);
Console.WriteLine(str.GetHashCode());
var handle = GCHandle.Alloc(str, GCHandleType.Pinned);
try
{
Marshal.WriteInt16(handle.AddrOfPinnedObject(), 4, 'Z');
Console.WriteLine(str);
Console.WriteLine(str.GetHashCode());
}
finally
{
handle.Free();
}
}
}
}
It's largely for security reasons. It's much harder to secure a system if you can't trust that your Strings are tamperproof.

Categories

Resources