String secret="foo";
WhatILookFor.securelyWipe(secret);
And I need to know that it will not be removed by java optimizer.
A String cannot be "wiped". It is immutable, and short of some really dirty and dangerous tricks you cannot alter that.
So the safest solution is to not put the data into a string in the first place. Use a StringBuilder or an array of characters instead, or some other representation that is not immutable. (And then clear it when you are done.)
For the record, there are a couple of ways that you can change the contents of a String's backing array. For example, you can use reflection to fish out a reference to the String's backing array, and overwrite its contents. However, this involves doing things that the JLS states have unspecified behaviour so you cannot guarantee that the optimizer won't do something unexpected.
My personal take on this is that you are better off locking down your application platform so that unauthorized people can't gain access to the memory / memory dump in the first place. After all, if the platform is not properly secured, the "bad guys" may be able to get hold of the string contents before you erase it. Steps like this might be warranted for small amounts of security critical state, but if you've got a lot of "confidential" information to process, it is going to be a major hassle to not be able to use normal strings and string handling.
You would need direct access to the memory.
You really wouldn't be able to do this with String, since you don't have reliable access to the string, and don't know if it's been interned somewhere, or if an object was created that you don't know about.
If you really needed to this, you'd have to do something like
public class SecureString implements CharSequence {
char[] data;
public void wipe() {
for(int i = 0; i < data.length; i++) data[i] = '.'; // random char
}
}
That being said, if you're worried about data still being in memory, you have to realize that if it was ever in memory at one point, than an attacker probably already got it. The only thing you realistically protect yourself from is if a core dump is flushed to a log file.
Regarding the optimizer, I incredibly doubt it will optimize away the operation. If you really needed it to, you could do something like this:
public int wipe() {
// wipe the array to a random value
java.util.Arrays.fill(data, (char)(rand.nextInt(60000));
// compute hash to force optimizer to do the wipe
int hash = 0;
for(int i = 0; i < data.length; i++) {
hash = hash * 31 + (int)data[i];
}
return hash;
}
This will force the compiler to do the wipe. It makes it roughly twice as long to run, but it's a pretty fast operation as it is, and doesn't increase the order of complexity.
Store the data off-heap using the "Unsafe" methods. You can then zero over it when done and be certain that it won't be pushed around the heap by the JVM.
Here is a good post on Unsafe:
http://highlyscalable.wordpress.com/2012/02/02/direct-memory-access-in-java/
If you're going to use a String then I think you are worried about it appearing in a memory dump. I suggest using String.replace() on key-characters so that when the String is used at run-time it will change and then go out of scope after it is used and won't appear correctly in a memory dump. However, I strongly recommend that you not use a String for sensitive data.
So a number of variations of a question exist here on stackoverflow that ask how to measure the size of an object (for example this one). And the answers point to the fact, without much elaboration, that it is not possible. Can somebody explain in length why is it not possible or why does it not make sense to measure object sizes?
I guess from the tags that you are asking about measurements of object sizes in Java and C#. Don't know much about C# therefore the following only pertains to Java.
Also there is a difference between the shallow and detained size of a single object and I suppose you are asking about the shallow size (which would be the base to derive the detained size).
I also interpret your term managed environment that you only want to know the size of an object at runtime in a specific JVM (not for instance calculating the size looking only at source code).
My short answers first:
Does it make sense to measure object sizes? Yes it does. Any developer of an application which runs under memory constraints is happy to know the memory implications of class layouts and object allocations.
Is it impossible to measure in managed environments? No it is not. The JVM must know about the size of its objects and therefore must be able to report the size of an object. If we only had a way to ask for it.
Long answer:
There are plenty of reasons why the object size cannot be derived from the class definition alone, for example:
The Java language spec only gives lowerbound memory requirements for primitive types. A int consumes at least 4 bytes, but the real size is up to the VM.
Not sure what the language spec tells about the size of references. Is there any constraint on the number of possible objects in a JVM (which would have implications for the size of internal storage for object references)? Today's JVMs use 4 bytes for a reference pointer.
JVMs may (and do) pad the object bytes to align at some boundary which may extend the object size. Todays JVMs usually align object memory at a 8 byte boundary.
But all these reasons do not apply to a JVM runtime which uses actual memory layouts, eventually allows its generational garbage collector to push objects around, and must therefore be able to report object sizes.
So how do we know about object sizes at runtime?
In Java 1.5 we got java.lang.instrument.Instrumentation#getObjectSize(Object).
The Javadoc says:
Returns an implementation-specific approximation of the amount of
storage consumed by the specified object. The result may include some
or all of the object's overhead, and thus is useful for comparison
within an implementation but not between implementations. The estimate
may change during a single invocation of the JVM.
Reading with a grain of salt this tells me that there is a reasonable way to get the exact shallow size of an object during one point at runtime.
Getting size of object is easily possible.
Getting object size may have little overhead if the object is large and we use IO streams to get the size.
If you have to get size of larger objects very frequently, you have to be careful.
Have a look at below code.
import java.io.*;
class ObjectData implements Serializable{
private int id=1;;
private String name="sunrise76";
private String city = "Newyork";
private int dimensitons[] = {20,45,789};
}
public class ObjectSize{
public static void main(String args[]){
try{
ObjectData data = new ObjectData();
ByteArrayOutputStream b = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(b);
oos.writeObject(data);
System.out.println("Size:"+b.toByteArray().length);
}catch(Exception err){
err.printStackTrace();
}
}
}
I tested the performance of a Java ray tracer I'm writing on with VisualVM 1.3.7 on my Linux Netbook. I measured with the profiler.
For fun I tested if there's a difference between using getters and setters and accessing the fields directly. The getters and setters are standard code with no addition.
I didn't expected any differences. But the directly accessing code was slower.
Here's the sample I tested in Vector3D:
public float dot(Vector3D other) {
return x * other.x + y * other.y + z * other.z;
}
Time: 1542 ms / 1,000,000 invocations
public float dot(Vector3D other) {
return getX() * other.getX() + getY() * other.getY() + getZ() * other.getZ();
}
Time: 1453 ms / 1,000,000 invocations
I didn't test it in a micro-benchmark, but in the ray tracer. The way I tested the code:
I started the program with the first code and set it up. The ray tracer isn't running yet.
I started the profiler and waited a while after initialization was done.
I started a ray tracer.
When VisualVM showed enough invocations, I stopped the profiler and waited a bit.
I closed the ray tracer program.
I replaced the first code with the second and repeated the steps above after compiling.
I did at least run 20,000,000 invocations for both codes. I closed any program I didn't need.
I set my CPU on performance, so my CPU clock was on max. all the time.
How is it possible that the second code is 6% faster?
I did done some micro benchmarking with lots of JVM warm up and found the two approaches take the exact same amount of execution time.
This happens because the JIT compiler is in-lining the getter method with a direct access to the field thus making them identical bytecode.
Thank you all for helping me answering this question. In the end, I found the answer.
First, Bohemian is right: With PrintAssembly I checked the assumption that the generated assembly codes are identical. And yes, although the bytecodes are different, the generated codes are identical.
So masterxilo is right: The profiler have to be the culprit. But masterxilo's guess about timing fences and more instrumentation code can't be true; both codes are identical in the end.
So there's still the question: How is it possible that the second code seems to be 6% faster in the profiler?
The answer lies in the way how VisualVM measures: Before you start profiling, you need calibration data. This is used for removing the overhead time caused by the profiler.
Although the calibration data is correct, the final calculation of the measurement is not. VisualVM sees the method invocations in the bytecode. But it doesn't see that the JIT compiler removes these invocations while optimizing.
So it removes non-existing overhead time. And that's how the difference appears.
In case you have not taken a course in Statistics, there is always variance in program performance no matter how well that it is written. The reason why these two methods seem to run at approximately the same rate is because the accessor fields only do one thing: They return a particular field. Because nothing else happens in the accessor method, both tactics pretty much do the same thing; however, in case you know not about encapsulation, which is how well that a programmer hides the data (fields or attributes) from the user, a major rule of encapsulation is not to reveal internal data to the user. Modifying a field as public means that any other class can access those fields, and that can be very dangerous to the user. That is why I always recommend Java programmers to use accessor and mutator methods so that the fields will not get into the wrong hands.
In case you were curious about how to access a private field, you can use reflection, which actually accesses the data of a particular class so that you can mutate it if you really must do so. As a frivolous example, suppose that you knew that the java.lang.String class contains a private field of type char[] (that is, a char array). It is hidden from the user, so you cannot access the field directly. (By the way, the method java.lang.String.toCharArray() accesses the field for you.) If you wanted to access each character consecutively and store each character into a collection (for the sake of simplicity, why not a java.util.List?), then here is how to use reflection in this case:
/**
This method iterates through each character in a <code>String</code> and places each of them into a <code>java.util.List</code> of type <code>Character</code>.
#param str The <code>String</code> to extract from.
#param list The list to store each character into. (This is necessary because the compiler knows not which <code>List</code> to use, so it will automatically clear the list anyway.)
*/
public static void extractStringData(String str, List<Character> list) throws IllegalAccessException, NoSuchFieldException
{
java.lang.reflect.Field value = String.class.getDeclaredField("value");
value.setAccessible(true);
char[] data = (char[]) value.get(str);
for(char ch : data) list.add(ch);
}
As a sidenote, note that reflection takes a lot of performance out of your program. If there is a field, method, or inner or nested class that you must access for whatever reason (which is highly unlikely anyway), then you should use reflection. The main reason why reflection takes away precious performance is because of the relatively innumerable exceptions that it throws. I am glad to have helped!
I recently had the following thoughts: when you define your object and override toString method, that might be called multiple times during the program execution. I am not sure about how certain UI components refresh themselves (does JTable upon refresh calls each cell-member toString method) or whether debugger calls toString every time when you step on instruction that modifies the object etc. Anyway, I was thinking whether it would be beneficial to define a lazily evaluated String as our to String definition if our structure is IMMUTABLE:
private String toString;
//.. definitions of many components, sets, lists which won't change
public String toString(){
if (toString == null) // instantiate
return toString;
}
Does the above is worth doing?
If the string creation process is long and the created string is often used, you should store it. If not, it depends on how many times you will call the toString() method (always the same old war between memory and CPU time)
If your classe is immutable and you know that the toString() method will be called at least once, you should instanciate the string in the constructor (or in the factory), instead of always checking.
The standard rules of optimisation come into play here.
Don't do it.
(For experts only) Don't do it yet - at least not until you've got profiling information illustrating why it's a problem that needs solving.
You're losing clarity and maintainability for performance reasons, so quantify what those performance reasons are and whether they're worth the cost.
You should not need to worry about this due to Javas String Literal Pool. Assuming that you are not creating lots of unique strings Java will traverse your code on loading of the class file or initialization and find string, adding them to the String Literal Pool.
This is a great article and overview on this with nice diagrams - String Literal Pool Overview
Other suggestions are to use StringBuffers as the independent parts would not added to the buffer only the output string.
Making StringBuffer static would also avoid creating a new object each time, but it really depends on how often you call toString().
If this is really a bottle neck running a profiler as Tom Johnson suggested would be the next direction to take.
Why is it that they decided to make String immutable in Java and .NET (and some other languages)? Why didn't they make it mutable?
According to Effective Java, chapter 4, page 73, 2nd edition:
"There are many good reasons for this: Immutable classes are easier to
design, implement, and use than mutable classes. They are less prone
to error and are more secure.
[...]
"Immutable objects are simple. An immutable object can be in
exactly one state, the state in which it was created. If you make sure
that all constructors establish class invariants, then it is
guaranteed that these invariants will remain true for all time, with
no effort on your part.
[...]
Immutable objects are inherently thread-safe; they require no synchronization. They cannot be corrupted by multiple threads
accessing them concurrently. This is far and away the easiest approach
to achieving thread safety. In fact, no thread can ever observe any
effect of another thread on an immutable object. Therefore,
immutable objects can be shared freely
[...]
Other small points from the same chapter:
Not only can you share immutable objects, but you can share their internals.
[...]
Immutable objects make great building blocks for other objects, whether mutable or immutable.
[...]
The only real disadvantage of immutable classes is that they require a separate object for each distinct value.
There are at least two reasons.
First - security http://www.javafaq.nu/java-article1060.html
The main reason why String made
immutable was security. Look at this
example: We have a file open method
with login check. We pass a String to
this method to process authentication
which is necessary before the call
will be passed to OS. If String was
mutable it was possible somehow to
modify its content after the
authentication check before OS gets
request from program then it is
possible to request any file. So if
you have a right to open text file in
user directory but then on the fly
when somehow you manage to change the
file name you can request to open
"passwd" file or any other. Then a
file can be modified and it will be
possible to login directly to OS.
Second - Memory efficiency http://hikrish.blogspot.com/2006/07/why-string-class-is-immutable.html
JVM internally maintains the "String
Pool". To achive the memory
efficiency, JVM will refer the String
object from pool. It will not create
the new String objects. So, whenever
you create a new string literal, JVM
will check in the pool whether it
already exists or not. If already
present in the pool, just give the
reference to the same object or create
the new object in the pool. There will
be many references point to the same
String objects, if someone changes the
value, it will affect all the
references. So, sun decided to make it
immutable.
Actually, the reasons string are immutable in java doesn't have much to do with security. The two main reasons are the following:
Thead Safety:
Strings are extremely widely used type of object. It is therefore more or less guaranteed to be used in a multi-threaded environment. Strings are immutable to make sure that it is safe to share strings among threads. Having an immutable strings ensures that when passing strings from thread A to another thread B, thread B cannot unexpectedly modify thread A's string.
Not only does this help simplify the already pretty complicated task of multi-threaded programming, but it also helps with performance of multi-threaded applications. Access to mutable objects must somehow be synchronized when they can be accessed from multiple threads, to make sure that one thread doesn't attempt to read the value of your object while it is being modified by another thread. Proper synchronization is both hard to do correctly for the programmer, and expensive at runtime. Immutable objects cannot be modified and therefore do not need synchronization.
Performance:
While String interning has been mentioned, it only represents a small gain in memory efficiency for Java programs. Only string literals are interned. This means that only the strings which are the same in your source code will share the same String Object. If your program dynamically creates string that are the same, they will be represented in different objects.
More importantly, immutable strings allow them to share their internal data. For many string operations, this means that the underlying array of characters does not need to be copied. For example, say you want to take the five first characters of String. In Java, you would calls myString.substring(0,5). In this case, what the substring() method does is simply to create a new String object that shares myString's underlying char[] but who knows that it starts at index 0 and ends at index 5 of that char[]. To put this in graphical form, you would end up with the following:
| myString |
v v
"The quick brown fox jumps over the lazy dog" <-- shared char[]
^ ^
| | myString.substring(0,5)
This makes this kind of operations extremely cheap, and O(1) since the operation neither depends on the length of the original string, nor on the length of the substring we need to extract. This behavior also has some memory benefits, since many strings can share their underlying char[].
Thread safety and performance. If a string cannot be modified it is safe and quick to pass a reference around among multiple threads. If strings were mutable, you would always have to copy all of the bytes of the string to a new instance, or provide synchronization. A typical application will read a string 100 times for every time that string needs to be modified. See wikipedia on immutability.
One should really ask, "why should X be mutable?" It's better to default to immutability, because of the benefits already mentioned by Princess Fluff. It should be an exception that something is mutable.
Unfortunately most of the current programming languages default to mutability, but hopefully in the future the default is more on immutablity (see A Wish List for the Next Mainstream Programming Language).
Wow! I Can't believe the misinformation here. Strings being immutable have nothing with security. If someone already has access to the objects in a running application (which would have to be assumed if you are trying to guard against someone 'hacking' a String in your app), they would certainly be a plenty of other opportunities available for hacking.
It's a quite novel idea that the immutability of String is addressing threading issues. Hmmm ... I have an object that is being changed by two different threads. How do I resolve this? synchronize access to the object? Naawww ... let's not let anyone change the object at all -- that'll fix all of our messy concurrency issues! In fact, let's make all objects immutable, and then we can removed the synchonized contruct from the Java language.
The real reason (pointed out by others above) is memory optimization. It is quite common in any application for the same string literal to be used repeatedly. It is so common, in fact, that decades ago, many compilers made the optimization of storing only a single instance of a String literal. The drawback of this optimization is that runtime code that modifies a String literal introduces a problem because it is modifying the instance for all other code that shares it. For example, it would be not good for a function somewhere in an application to change the String literal "dog" to "cat". A printf("dog") would result in "cat" being written to stdout. For that reason, there needed to be a way of guarding against code that attempts to change String literals (i. e., make them immutable). Some compilers (with support from the OS) would accomplish this by placing String literal into a special readonly memory segment that would cause a memory fault if a write attempt was made.
In Java this is known as interning. The Java compiler here is just following an standard memory optimization done by compilers for decades. And to address the same issue of these String literals being modified at runtime, Java simply makes the String class immutable (i. e, gives you no setters that would allow you to change the String content). Strings would not have to be immutable if interning of String literals did not occur.
String is not a primitive type, yet you normally want to use it with value semantics, i.e. like a value.
A value is something you can trust won't change behind your back.
If you write: String str = someExpr();
You don't want it to change unless YOU do something with str.
String as an Object has naturally pointer semantics, to get value semantics as well it needs to be immutable.
One factor is that, if Strings were mutable, objects storing Strings would have to be careful to store copies, lest their internal data change without notice. Given that Strings are a fairly primitive type like numbers, it is nice when one can treat them as if they were passed by value, even if they are passed by reference (which also helps to save on memory).
I know this is a bump, but...
Are they really immutable?
Consider the following.
public static unsafe void MutableReplaceIndex(string s, char c, int i)
{
fixed (char* ptr = s)
{
*((char*)(ptr + i)) = c;
}
}
...
string s = "abc";
MutableReplaceIndex(s, '1', 0);
MutableReplaceIndex(s, '2', 1);
MutableReplaceIndex(s, '3', 2);
Console.WriteLine(s); // Prints 1 2 3
You could even make it an extension method.
public static class Extensions
{
public static unsafe void MutableReplaceIndex(this string s, char c, int i)
{
fixed (char* ptr = s)
{
*((char*)(ptr + i)) = c;
}
}
}
Which makes the following work
s.MutableReplaceIndex('1', 0);
s.MutableReplaceIndex('2', 1);
s.MutableReplaceIndex('3', 2);
Conclusion: They're in an immutable state which is known by the compiler. Of couse the above only applies to .NET strings as Java doesn't have pointers. However a string can be entirely mutable using pointers in C#. It's not how pointers are intended to be used, has practical usage or is safely used; it's however possible, thus bending the whole "mutable" rule. You can normally not modify an index directly of a string and this is the only way. There is a way that this could be prevented by disallowing pointer instances of strings or making a copy when a string is pointed to, but neither is done, which makes strings in C# not entirely immutable.
For most purposes, a "string" is (used/treated as/thought of/assumed to be) a meaningful atomic unit, just like a number.
Asking why the individual characters of a string are not mutable is therefore like asking why the individual bits of an integer are not mutable.
You should know why. Just think about it.
I hate to say it, but unfortunately we're debating this because our language sucks, and we're trying to using a single word, string, to describe a complex, contextually situated concept or class of object.
We perform calculations and comparisons with "strings" similar to how we do with numbers. If strings (or integers) were mutable, we'd have to write special code to lock their values into immutable local forms in order to perform any kind of calculation reliably. Therefore, it is best to think of a string like a numeric identifier, but instead of being 16, 32, or 64 bits long, it could be hundreds of bits long.
When someone says "string", we all think of different things. Those who think of it simply as a set of characters, with no particular purpose in mind, will of course be appalled that someone just decided that they should not be able to manipulate those characters. But the "string" class isn't just an array of characters. It's a STRING, not a char[]. There are some basic assumptions about the concept we refer to as a "string", and it generally can be described as meaningful, atomic unit of coded data like a number. When people talk about "manipulating strings", perhaps they're really talking about manipulating characters to build strings, and a StringBuilder is great for that. Just think a bit about what the word "string" truly means.
Consider for a moment what it would be like if strings were mutable. The following API function could be tricked into returning information for a different user if the mutable username string is intentionally or unintentionally modified by another thread while this function is using it:
string GetPersonalInfo( string username, string password )
{
string stored_password = DBQuery.GetPasswordFor( username );
if (password == stored_password)
{
//another thread modifies the mutable 'username' string
return DBQuery.GetPersonalInfoFor( username );
}
}
Security isn't just about 'access control', it's also about 'safety' and 'guaranteeing correctness'. If a method can't be easily written and depended upon to perform a simple calculation or comparison reliably, then it's not safe to call it, but it would be safe to call into question the programming language itself.
Immutability is not so closely tied to security. For that, at least in .NET, you get the SecureString class.
Later edit: In Java you will find GuardedString, a similar implementation.
The decision to have string mutable in C++ causes a lot of problems, see this excellent article by Kelvin Henney about Mad COW Disease.
COW = Copy On Write.
It's a trade off. Strings go into the String pool and when you create multiple identical Strings they share the same memory. The designers figured this memory saving technique would work well for the common case, since programs tend to grind over the same strings a lot.
The downside is that concatenations make a lot of extra Strings that are only transitional and just become garbage, actually harming memory performance. You have StringBuffer and StringBuilder (in Java, StringBuilder is also in .NET) to use to preserve memory in these cases.
Strings in Java are not truly immutable, you can change their value's using reflection and or class loading. You should not be depending on that property for security.
For examples see: Magic Trick In Java
Immutability is good. See Effective Java. If you had to copy a String every time you passed it around, then that would be a lot of error-prone code. You also have confusion as to which modifications affect which references. In the same way that Integer has to be immutable to behave like int, Strings have to behave as immutable to act like primitives. In C++ passing strings by value does this without explicit mention in the source code.
There is an exception for nearly almost every rule:
using System;
using System.Runtime.InteropServices;
namespace Guess
{
class Program
{
static void Main(string[] args)
{
const string str = "ABC";
Console.WriteLine(str);
Console.WriteLine(str.GetHashCode());
var handle = GCHandle.Alloc(str, GCHandleType.Pinned);
try
{
Marshal.WriteInt16(handle.AddrOfPinnedObject(), 4, 'Z');
Console.WriteLine(str);
Console.WriteLine(str.GetHashCode());
}
finally
{
handle.Free();
}
}
}
}
It's largely for security reasons. It's much harder to secure a system if you can't trust that your Strings are tamperproof.