java string optimizations - load-in-place algorithm - java

I need to optimize the actual loading/parsing of a csv file (strings). The best way I know is the load-in-place algorithms and I successfully used it using JNI and a C++ dll that loads the data directly from a file made out of the parsed csv data.
It would have been fine if it stopped there but using that scheme only made it 15% faster (no more parsing of the data). One of the reason it is not as fast as I first thought it would be is because the java client uses jstring so I need to convert the actual data again from char* to jstring.
The best would be to ignore that conversion step and load-in-place the data directly into the jstring objects (no more conversion). So instead of duplicating the data based on the loaded-in-place data, the jstring would be pointing directly into the chunk of memory (note that the data would be made of jchars instead of chars). The real bad thing is that we would need to make sure the garbage collector doesn't collect that data (by keeping a reference to it maybe?) but it should be feasible.. no?
I think I have two options to do that:
1- Load the data in java (no more jni) and use chars that are pointing to the loaded data to create the strings.. but I need to find a way to prevent the duplicating of the data when creating a String.
2- Continue using jni to "manually" create and set the jstring variable and make sure that the garbage collector options are set properly to prevent it from doing anything to it. For instance:
jstring str;
str.data = loadedinplacedata; // assign data pointer
return str;
Not sure if that's possible but I wouldn't mind just save the jstring directly into the file and reload it like that:
jstring * str = (jstring *)&loadedinplacedata[someoffset];
return * str;
I'm aware that this is not the usual Java thing, but I'm pretty sure Java is extensible enough to be able to do that. And it's not like I really have a choice in the matter... the project is already 3 years old and it needs to work. =S
This is the JNI code (C++):
const jchar * data = GetData(id, row, col); // get pointer of the string ends w/ \0
unsigned int len = wcslen( (wchar_t*)data );
// The best would be to prevent this function to duplicate the data.
jstring str = env->NewString( data, len );
return str;
Note: The code above made it 20% faster (instead of 15) by using unicode data instead of UTF8 (NewString instead of NewStringUTF). This shows that if I can remove that step or optimize it, I'd get quite the good performance increase.

I've never worked with JNI, but... does it make any sense to have it return a custom class implementing CharSequence, and maybe a few other interfaces like Comparable< CharSequence >, instead of a String? It seems like you'd be less likely to have data corruption problems that way.

I think first you have to understand why the C++ version runs 15% faster, and why that performance improvement is not directly translatable into Java. Why can't you write the code 15% faster in Java?
Lets look at your problem. You've eliminated the parsing by using a C++ dll. (Why could this not have been done in Java?). And then as I understand it:
You're proposing to manipulate the contents of the jstrings directly
You want to prevent the garbage collector from touching these modified jstrings (by keeping a reference to them), and therefore potentially modifying the behaviour of the JVM and screwing with the garbage collector when it does eventually garbage collect.
Will you 'fix' these references before you allow them to be garbage collected?
If you propose doing your own memory management, why are you using java at all? Why not just do it in pure C++?
Assuming that you wish to continue in Java, when you create a String, it the String itself is a new Object, but the data that it's pointing to is not necessarily. You can test this by calling String.intern(). Using the following code:
public static void main(String[] args) {
String s3 = "foofoo";
String s1 = call("foo");
String s2 = call("foo");
System.out.println("s1 == s2=" + (s1 == s2));
System.out.println("s1.intern() == s2.intern()=" + (s1.intern() == s2.intern()));
System.out.println("s1.intern() == s3.intern()=" + (s1.intern() == s3.intern()));
System.out.println("s1.substring(3) == s2.substring(3)=" + (s1.substring(3) == s2.substring(3)));
System.out.println("s1.substring(3).intern() == s2.substring(3).intern()=" + (s1.substring(3).intern() == s2.substring(3).intern()));
}
public static String call(String s) {
return s + "foo";
}
This produces:
s1 == s2=false
s1.intern() == s2.intern()=true
s1.intern() == s3.intern()=true
s1.substring(3) == s2.substring(3)=false
s1.substring(3).intern() == s2.substring(3).intern()=true
So you can see that although the String objects are different, the data, the actual bytes aren't. So your modifications may not actually be that relevant, the JVM may already be doing it for you. And it's worth saying that if you start modifying the internals of jstrings, this may well screw this up.
My suggestion would be to find out what you can do in terms of algorithms. Development with pure java is always quicker that Java & JNI combined. You've got a much better chance of finding a better solution with pure Java.

Well... seems like what I wanted to do is not "supported" by Java unless I hack it.. I believe it would be possible to do so by using GetStringCritical to get the actual char array address and then find out the number of characters and such but this is way beyond "safe" programming.
The best work around I found was to create a hash table in java and use an unique identifier processed while creating my data file (acting similar to .intern()). if the string was not in the hash table, it would query it through the dll and save it in the hash table.
data file:
numrow,numcols,
for each cell, add a integer value (in my case the offset in memory pointing to the string)
for each cell, add string ending with \0
By using the offset value, I can somewhat minimize the number of strings creation and string queries. I tried using globalref to keep the string inside the dll but that made it 4 times slower.

Related

Why would it be better to store sensitive data in a char than String? [duplicate]

In Swing, the password field has a getPassword() (returns char[]) method instead of the usual getText() (returns String) method. Similarly, I have come across a suggestion not to use String to handle passwords.
Why does String pose a threat to security when it comes to passwords?
It feels inconvenient to use char[].
Strings are immutable. That means once you've created the String, if another process can dump memory, there's no way (aside from reflection) you can get rid of the data before garbage collection kicks in.
With an array, you can explicitly wipe the data after you're done with it. You can overwrite the array with anything you like, and the password won't be present anywhere in the system, even before garbage collection.
So yes, this is a security concern - but even using char[] only reduces the window of opportunity for an attacker, and it's only for this specific type of attack.
As noted in the comments, it's possible that arrays being moved by the garbage collector will leave stray copies of the data in memory. I believe this is implementation-specific - the garbage collector may clear all memory as it goes, to avoid this sort of thing. Even if it does, there's still the time during which the char[] contains the actual characters as an attack window.
While other suggestions here seem valid, there is one other good reason. With plain String you have much higher chances of accidentally printing the password to logs, monitors or some other insecure place. char[] is less vulnerable.
Consider this:
public static void main(String[] args) {
Object pw = "Password";
System.out.println("String: " + pw);
pw = "Password".toCharArray();
System.out.println("Array: " + pw);
}
Prints:
String: Password
Array: [C#5829428e
To quote an official document, the Java Cryptography Architecture guide says this about char[] vs. String passwords (about password-based encryption, but this is more generally about passwords of course):
It would seem logical to collect and store the password in an object
of type java.lang.String. However, here's the caveat: Objects of
type String are immutable, i.e., there are no methods defined that
allow you to change (overwrite) or zero out the contents of a String
after usage. This feature makes String objects unsuitable for
storing security sensitive information such as user passwords. You
should always collect and store security sensitive information in a
char array instead.
Guideline 2-2 of the Secure Coding Guidelines for the Java Programming Language, Version 4.0 also says something similar (although it is originally in the context of logging):
Guideline 2-2: Do not log highly sensitive information
Some information, such as Social Security numbers (SSNs) and
passwords, is highly sensitive. This information should not be kept
for longer than necessary nor where it may be seen, even by
administrators. For instance, it should not be sent to log files and
its presence should not be detectable through searches. Some transient
data may be kept in mutable data structures, such as char arrays, and
cleared immediately after use. Clearing data structures has reduced
effectiveness on typical Java runtime systems as objects are moved in
memory transparently to the programmer.
This guideline also has implications for implementation and use of
lower-level libraries that do not have semantic knowledge of the data
they are dealing with. As an example, a low-level string parsing
library may log the text it works on. An application may parse an SSN
with the library. This creates a situation where the SSNs are
available to administrators with access to the log files.
Character arrays (char[]) can be cleared after use by setting each character to zero and Strings not. If someone can somehow see the memory image, they can see a password in plain text if Strings are used, but if char[] is used, after purging data with 0's, the password is secure.
Some people believe that you have to overwrite the memory used to store the password once you no longer need it. This reduces the time window an attacker has to read the password from your system and completely ignores the fact that the attacker already needs enough access to hijack the JVM memory to do this. An attacker with that much access can catch your key events making this completely useless (AFAIK, so please correct me if I am wrong).
Update
Thanks to the comments I have to update my answer. Apparently there are two cases where this can add a (very) minor security improvement as it reduces the time a password could land on the hard drive. Still I think it's overkill for most use cases.
Your target system may be badly configured or you have to assume it is and you have to be paranoid about core dumps (can be valid if the systems are not managed by an administrator).
Your software has to be overly paranoid to prevent data leaks with the attacker gaining access to the hardware - using things like TrueCrypt (discontinued), VeraCrypt, or CipherShed.
If possible, disabling core dumps and the swap file would take care of both problems. However, they would require administrator rights and may reduce functionality (less memory to use) and pulling RAM from a running system would still be a valid concern.
I don't think this is a valid suggestion, but, I can at least guess at the reason.
I think the motivation is wanting to make sure that you can erase all trace of the password in memory promptly and with certainty after it is used. With a char[] you could overwrite each element of the array with a blank or something for sure. You can't edit the internal value of a String that way.
But that alone isn't a good answer; why not just make sure a reference to the char[] or String doesn't escape? Then there's no security issue. But the thing is that String objects can be intern()ed in theory and kept alive inside the constant pool. I suppose using char[] forbids this possibility.
The answer has already been given, but I'd like to share an issue that I discovered lately with Java standard libraries. While they take great care now of replacing password strings with char[] everywhere (which of course is a good thing), other security-critical data seems to be overlooked when it comes to clearing it from memory.
I'm thinking of e.g. the PrivateKey class. Consider a scenario where you would load a private RSA key from a PKCS#12 file, using it to perform some operation. Now in this case, sniffing the password alone wouldn't help you much as long as physical access to the key file is properly restricted. As an attacker, you would be much better off if you obtained the key directly instead of the password. The desired information can be leaked manifold, core dumps, a debugger session or swap files are just some examples.
And as it turns out, there is nothing that lets you clear the private information of a PrivateKey from memory, because there's no API that lets you wipe the bytes that form the corresponding information.
This is a bad situation, as this paper describes how this circumstance could be potentially exploited.
The OpenSSL library for example overwrites critical memory sections before private keys are freed. Since Java is garbage-collected, we would need explicit methods to wipe and invalidate private information for Java keys, which are to be applied immediately after using the key.
As Jon Skeet states, there is no way except by using reflection.
However, if reflection is an option for you, you can do this.
public static void main(String[] args) {
System.out.println("please enter a password");
// don't actually do this, this is an example only.
Scanner in = new Scanner(System.in);
String password = in.nextLine();
usePassword(password);
clearString(password);
System.out.println("password: '" + password + "'");
}
private static void usePassword(String password) {
}
private static void clearString(String password) {
try {
Field value = String.class.getDeclaredField("value");
value.setAccessible(true);
char[] chars = (char[]) value.get(password);
Arrays.fill(chars, '*');
} catch (Exception e) {
throw new AssertionError(e);
}
}
when run
please enter a password
hello world
password: '***********'
Note: if the String's char[] has been copied as a part of a GC cycle, there is a chance the previous copy is somewhere in memory.
This old copy wouldn't appear in a heap dump, but if you have direct access to the raw memory of the process you could see it. In general you should avoid anyone having such access.
There is nothing that char array gives you vs String unless you clean it up manually after use, and I haven't seen anyone actually doing that. So to me the preference of char[] vs String is a bit exaggerated.
Take a look at the widely used Spring Security library here and ask yourself - are Spring Security guys incompetent or char[] passwords just don't make much sense. When some nasty hacker grabs memory dumps of your RAM be sure s/he'll get all the passwords even if you use sophisticated ways to hide them.
However, Java changes all the time, and some scary features like String Deduplication feature of Java 8 might intern String objects without your knowledge. But that's a different conversation.
Edit: Coming back to this answer after a year of security research, I realize it makes the rather unfortunate implication that you would ever actually compare plaintext passwords. Please don't. Use a secure one-way hash with a salt and a reasonable number of iterations. Consider using a library: this stuff is hard to get right!
Original answer: What about the fact that String.equals() uses short-circuit evaluation, and is therefore vulnerable to a timing attack? It may be unlikely, but you could theoretically time the password comparison in order to determine the correct sequence of characters.
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
int n = value.length;
// Quits here if Strings are different lengths.
if (n == anotherString.value.length) {
char v1[] = value;
char v2[] = anotherString.value;
int i = 0;
// Quits here at first different character.
while (n-- != 0) {
if (v1[i] != v2[i])
return false;
i++;
}
return true;
}
}
return false;
}
Some more resources on timing attacks:
A Lesson In Timing Attacks
A discussion about timing attacks over on Information Security Stack Exchange
And of course, the Timing Attack Wikipedia page
Strings are immutable and cannot be altered once they have been created. Creating a password as a string will leave stray references to the password on the heap or on the String pool. Now if someone takes a heap dump of the Java process and carefully scans through he might be able to guess the passwords. Of course these non used strings will be garbage collected but that depends on when the GC kicks in.
On the other side char[] are mutable as soon as the authentication is done you can overwrite them with any character like all M's or backslashes. Now even if someone takes a heap dump he might not be able to get the passwords which are not currently in use. This gives you more control in the sense like clearing the Object content yourself vs waiting for the GC to do it.
String is immutable and it goes to the string pool. Once written, it cannot be overwritten.
char[] is an array which you should overwrite once you used the password and this is how it should be done:
char[] passw = request.getPassword().toCharArray()
if (comparePasswords(dbPassword, passw) {
allowUser = true;
cleanPassword(passw);
cleanPassword(dbPassword);
passw = null;
}
private static void cleanPassword (char[] pass) {
Arrays.fill(pass, '0');
}
One scenario where the attacker could use it is a crashdump—when the JVM crashes and generates a memory dump—you will be able to see the password.
That is not necessarily a malicious external attacker. This could be a support user that has access to the server for monitoring purposes. He/she could peek into a crashdump and find the passwords.
The short and straightforward answer would be because char[] is mutable while String objects are not.
Strings in Java are immutable objects. That is why they can't be modified once created, and therefore the only way for their contents to be removed from memory is to have them garbage collected. It will be only then when the memory freed by the object can be overwritten, and the data will be gone.
Now garbage collection in Java doesn't happen at any guaranteed interval. The String can thus persist in memory for a long time, and if a process crashes during this time, the contents of the string may end up in a memory dump or some log.
With a character array, you can read the password, finish working with it as soon as you can, and then immediately change the contents.
Case String:
String password = "ill stay in StringPool after Death !!!";
// some long code goes
// ...Now I want to remove traces of password
password = null;
password = "";
// above attempts wil change value of password
// but the actual password can be traced from String pool through memory dump, if not garbage collected
Case CHAR ARRAY:
char[] passArray = {'p','a','s','s','w','o','r','d'};
// some long code goes
// ...Now I want to remove traces of password
for (int i=0; i<passArray.length;i++){
passArray[i] = 'x';
}
// Now you ACTUALLY DESTROYED traces of password form memory
A string in Java is immutable. So whenever a string is created, it will remain in memory until it is garbage collected. So anyone who has access to the memory can read the value of the string.
If the value of the string is modified then it will end up creating a new string. So both the original value and the modified value stay in the memory until it is garbage collected.
With the character array, the contents of the array can be modified or erased once the purpose of the password is served. The original contents of the array will not be found in memory after it is modified and even before the garbage collection kicks in.
Because of the security concern it is better to store password as a character array.
It is debatable as to whether you should use String or use Char[] for this purpose because both have their advantages and disadvantages. It depends on what the user needs.
Since Strings in Java are immutable, whenever some tries to manipulate your string it creates a new Object and the existing String remains unaffected. This could be seen as an advantage for storing a password as a String, but the object remains in memory even after use. So if anyone somehow got the memory location of the object, that person can easily trace your password stored at that location.
Char[] is mutable, but it has the advantage that after its usage the programmer can explicitly clean the array or override values. So when it's done being used it is cleaned and no one could ever know about the information you had stored.
Based on the above circumstances, one can get an idea whether to go with String or to go with Char[] for their requirements.
A lot of the previous answers are great. There is another point which I am assuming (please correct me if I am wrong).
By default Java uses UTF-16 for storing strings. Using character arrays, char[] array, facilitates use of Unicode, regional characters, etc. This technique allows all character set to be respected equally for storing the passwords and henceforth will not initiate certain crypto issues due to character set confusion. Finally, using the character array, we can convert the password array to our desired character set string.

Android - deletion of resources in Java [duplicate]

String secret="foo";
WhatILookFor.securelyWipe(secret);
And I need to know that it will not be removed by java optimizer.
A String cannot be "wiped". It is immutable, and short of some really dirty and dangerous tricks you cannot alter that.
So the safest solution is to not put the data into a string in the first place. Use a StringBuilder or an array of characters instead, or some other representation that is not immutable. (And then clear it when you are done.)
For the record, there are a couple of ways that you can change the contents of a String's backing array. For example, you can use reflection to fish out a reference to the String's backing array, and overwrite its contents. However, this involves doing things that the JLS states have unspecified behaviour so you cannot guarantee that the optimizer won't do something unexpected.
My personal take on this is that you are better off locking down your application platform so that unauthorized people can't gain access to the memory / memory dump in the first place. After all, if the platform is not properly secured, the "bad guys" may be able to get hold of the string contents before you erase it. Steps like this might be warranted for small amounts of security critical state, but if you've got a lot of "confidential" information to process, it is going to be a major hassle to not be able to use normal strings and string handling.
You would need direct access to the memory.
You really wouldn't be able to do this with String, since you don't have reliable access to the string, and don't know if it's been interned somewhere, or if an object was created that you don't know about.
If you really needed to this, you'd have to do something like
public class SecureString implements CharSequence {
char[] data;
public void wipe() {
for(int i = 0; i < data.length; i++) data[i] = '.'; // random char
}
}
That being said, if you're worried about data still being in memory, you have to realize that if it was ever in memory at one point, than an attacker probably already got it. The only thing you realistically protect yourself from is if a core dump is flushed to a log file.
Regarding the optimizer, I incredibly doubt it will optimize away the operation. If you really needed it to, you could do something like this:
public int wipe() {
// wipe the array to a random value
java.util.Arrays.fill(data, (char)(rand.nextInt(60000));
// compute hash to force optimizer to do the wipe
int hash = 0;
for(int i = 0; i < data.length; i++) {
hash = hash * 31 + (int)data[i];
}
return hash;
}
This will force the compiler to do the wipe. It makes it roughly twice as long to run, but it's a pretty fast operation as it is, and doesn't increase the order of complexity.
Store the data off-heap using the "Unsafe" methods. You can then zero over it when done and be certain that it won't be pushed around the heap by the JVM.
Here is a good post on Unsafe:
http://highlyscalable.wordpress.com/2012/02/02/direct-memory-access-in-java/
If you're going to use a String then I think you are worried about it appearing in a memory dump. I suggest using String.replace() on key-characters so that when the String is used at run-time it will change and then go out of scope after it is used and won't appear correctly in a memory dump. However, I strongly recommend that you not use a String for sensitive data.

"Immutable" strings in Java - actually, it's a lie [duplicate]

We all know that String is immutable in Java, but check the following code:
String s1 = "Hello World";
String s2 = "Hello World";
String s3 = s1.substring(6);
System.out.println(s1); // Hello World
System.out.println(s2); // Hello World
System.out.println(s3); // World
Field field = String.class.getDeclaredField("value");
field.setAccessible(true);
char[] value = (char[])field.get(s1);
value[6] = 'J';
value[7] = 'a';
value[8] = 'v';
value[9] = 'a';
value[10] = '!';
System.out.println(s1); // Hello Java!
System.out.println(s2); // Hello Java!
System.out.println(s3); // World
Why does this program operate like this? And why is the value of s1 and s2 changed, but not s3?
String is immutable* but this only means you cannot change it using its public API.
What you are doing here is circumventing the normal API, using reflection. The same way, you can change the values of enums, change the lookup table used in Integer autoboxing etc.
Now, the reason s1 and s2 change value, is that they both refer to the same interned string. The compiler does this (as mentioned by other answers).
The reason s3 does not was actually a bit surprising to me, as I thought it would share the value array (it did in earlier version of Java, before Java 7u6). However, looking at the source code of String, we can see that the value character array for a substring is actually copied (using Arrays.copyOfRange(..)). This is why it goes unchanged.
You can install a SecurityManager, to avoid malicious code to do such things. But keep in mind that some libraries depend on using these kind of reflection tricks (typically ORM tools, AOP libraries etc).
*) I initially wrote that Strings aren't really immutable, just "effective immutable". This might be misleading in the current implementation of String, where the value array is indeed marked private final. It's still worth noting, though, that there is no way to declare an array in Java as immutable, so care must be taken not to expose it outside its class, even with the proper access modifiers.
As this topic seems overwhelmingly popular, here's some suggested further reading: Heinz Kabutz's Reflection Madness talk from JavaZone 2009, which covers a lot of the issues in the OP, along with other reflection... well... madness.
It covers why this is sometimes useful. And why, most of the time, you should avoid it. :-)
In Java, if two string primitive variables are initialized to the same literal, it assigns the same reference to both variables:
String Test1="Hello World";
String Test2="Hello World";
System.out.println(test1==test2); // true
That is the reason the comparison returns true. The third string is created using substring() which makes a new string instead of pointing to the same.
When you access a string using reflection, you get the actual pointer:
Field field = String.class.getDeclaredField("value");
field.setAccessible(true);
So change to this will change the string holding a pointer to it, but as s3 is created with a new string due to substring() it would not change.
You are using reflection to circumvent the immutability of String - it's a form of "attack".
There are lots of examples you can create like this (eg you can even instantiate a Void object too), but it doesn't mean that String is not "immutable".
There are use cases where this type of code may be used to your advantage and be "good coding", such as clearing passwords from memory at the earliest possible moment (before GC).
Depending on the security manager, you may not be able to execute your code.
You are using reflection to access the "implementation details" of string object. Immutability is the feature of the public interface of an object.
Visibility modifiers and final (i.e. immutability) are not a measurement against malicious code in Java; they are merely tools to protect against mistakes and to make the code more maintainable (one of the big selling points of the system). That is why you can access internal implementation details like the backing char array for Strings via reflection.
The second effect you see is that all Strings change while it looks like you only change s1. It is a certain property of Java String literals that they are automatically interned, i.e. cached. Two String literals with the same value will actually be the same object. When you create a String with new it will not be interned automatically and you will not see this effect.
#substring until recently (Java 7u6) worked in a similar way, which would have explained the behaviour in the original version of your question. It didn't create a new backing char array but reused the one from the original String; it just created a new String object that used an offset and a length to present only a part of that array. This generally worked as Strings are immutable - unless you circumvent that. This property of #substring also meant that the whole original String couldn't be garbage collected when a shorter substring created from it still existed.
As of current Java and your current version of the question there is no strange behaviour of #substring.
String immutability is from the interface perspective. You are using reflection to bypass the interface and directly modify the internals of the String instances.
s1 and s2 are both changed because they are both assigned to the same "intern" String instance. You can find out a bit more about that part from this article about string equality and interning. You might be surprised to find out that in your sample code, s1 == s2 returns true!
Which version of Java are you using? From Java 1.7.0_06, Oracle has changed the internal representation of String, especially the substring.
Quoting from Oracle Tunes Java's Internal String Representation:
In the new paradigm, the String offset and count fields have been removed, so substrings no longer share the underlying char [] value.
With this change, it may happen without reflection (???).
There are really two questions here:
Are strings really immutable?
Why is s3 not changed?
To point 1: Except for ROM there is no immutable memory in your computer. Nowadays even ROM is sometimes writable. There is always some code somewhere (whether it's the kernel or native code sidestepping your managed environment) that can write to your memory address. So, in "reality", no they are not absolutely immutable.
To point 2: This is because substring is probably allocating a new string instance, which is likely copying the array. It is possible to implement substring in such a way that it won't do a copy, but that doesn't mean it does. There are tradeoffs involved.
For example, should holding a reference to reallyLargeString.substring(reallyLargeString.length - 2) cause a large amount of memory to be held alive, or only a few bytes?
That depends on how substring is implemented. A deep copy will keep less memory alive, but it will run slightly slower. A shallow copy will keep more memory alive, but it will be faster. Using a deep copy can also reduce heap fragmentation, as the string object and its buffer can be allocated in one block, as opposed to 2 separate heap allocations.
In any case, it looks like your JVM chose to use deep copies for substring calls.
To add to the #haraldK's answer - this is a security hack which could lead to a serious impact in the app.
First thing is a modification to a constant string stored in a String Pool. When string is declared as a String s = "Hello World";, it's being places into a special object pool for further potential reusing. The issue is that compiler will place a reference to the modified version at compile time and once the user modifies the string stored in this pool at runtime, all references in code will point to the modified version. This would result into a following bug:
System.out.println("Hello World");
Will print:
Hello Java!
There was another issue I experienced when I was implementing a heavy computation over such risky strings. There was a bug which happened in like 1 out of 1000000 times during the computation which made the result undeterministic. I was able to find the problem by switching off the JIT - I was always getting the same result with JIT turned off. My guess is that the reason was this String security hack which broke some of the JIT optimization contracts.
According to the concept of pooling, all the String variables containing the same value will point to the same memory address. Therefore s1 and s2, both containing the same value of “Hello World”, will point towards the same memory location (say M1).
On the other hand, s3 contains “World”, hence it will point to a different memory allocation (say M2).
So now what's happening is that the value of S1 is being changed (by using the char [ ] value). So the value at the memory location M1 pointed both by s1 and s2 has been changed.
Hence as a result, memory location M1 has been modified which causes change in the value of s1 and s2.
But the value of location M2 remains unaltered, hence s3 contains the same original value.
The reason s3 does not actually change is because in Java when you do a substring the value character array for a substring is internally copied (using Arrays.copyOfRange()).
s1 and s2 are the same because in Java they both refer to the same interned string. It's by design in Java.
String is immutable, but through reflection you're allowed to change the String class. You've just redefined the String class as mutable in real-time. You could redefine methods to be public or private or static if you wanted.
Strings are created in permanent area of the JVM heap memory. So yes, it's really immutable and cannot be changed after being created.
Because in the JVM, there are three types of heap memory:
1. Young generation
2. Old generation
3. Permanent generation.
When any object are created, it goes into the young generation heap area and PermGen area reserved for String pooling.
Here is more detail you can go and grab more information from:
How Garbage Collection works in Java .
[Disclaimer this is a deliberately opinionated style of answer as I feel a more "don't do this at home kids" answer is warranted]
The sin is the line field.setAccessible(true); which says to violate the public api by allowing access to a private field. Thats a giant security hole which can be locked down by configuring a security manager.
The phenomenon in the question are implementation details which you would never see when not using that dangerous line of code to violate the access modifiers via reflection. Clearly two (normally) immutable strings can share the same char array. Whether a substring shares the same array depends on whether it can and whether the developer thought to share it. Normally these are invisible implementation details which you should not have to know unless you shoot the access modifier through the head with that line of code.
It is simply not a good idea to rely upon such details which cannot be experienced without violating the access modifiers using reflection. The owner of that class only supports the normal public API and is free to make implementation changes in the future.
Having said all that the line of code is really very useful when you have a gun held you your head forcing you to do such dangerous things. Using that back door is usually a code smell that you need to upgrade to better library code where you don't have to sin. Another common use of that dangerous line of code is to write a "voodoo framework" (orm, injection container, ...). Many folks get religious about such frameworks (both for and against them) so I will avoid inviting a flame war by saying nothing other than the vast majority of programmers don't have to go there.
String is immutable in nature Because there is no method to modify String object.
That is the reason They introduced StringBuilder and StringBuffer classes
This is a quick guide to everything
// Character array
char[] chr = {'O', 'K', '!'};
// this is String class
String str1 = new String(chr);
// this is concat
str1 = str1.concat("another string's ");
// this is format
System.out.println(String.format(str1 + " %s ", "string"));
// this is equals
System.out.println(str1.equals("another string"));
//this is split
for(String s: str1.split(" ")){
System.out.println(s);
}
// this is length
System.out.println(str1.length());
//gives an score of the total change in the length
System.out.println(str1.compareTo("OK!another string string's"));
// trim
System.out.println(str1.trim());
// intern
System.out.println(str1.intern());
// character at
System.out.println(str1.charAt(5));
// substring
System.out.println(str1.substring(5, 12));
// to uppercase
System.out.println(str1.toUpperCase());
// to lowerCase
System.out.println(str1.toLowerCase());
// replace
System.out.println(str1.replace("another", "hello"));
// output
// OK!another string's string
// false
// OK!another
// string's
// 20
// 7
// OK!another string's
// OK!another string's
// o
// other s
// OK!ANOTHER STRING'S
// ok!another string's
// OK!hello string's

How to detect whether String.substring copies the character data

I know that for Oracle Java 1.7 update 6 and newer, when using String.substring,
the internal character array of the String is copied, and for older versions, it is shared.
But I found no offical API that would tell me the current behavior.
Use Case
My use case is:
In a parser, I like to detect whether String.substring copies or shares the underlying character array.
The problem is, if the character array is shared, then my parser needs to explicitly "un-share" using new String(s) to avoid
memory problems. However, if String.substring anyway copies the data, then this is not necessary, and explicitly copying the data in the parser could be avoided. Use case:
// possibly the query is very very large
String query = "select * from test ...";
// the identifier is used outside of the parser
String identifier = query.substring(14, 18);
// avoid if possible for speed,
// but needed if identifier internally
// references the large query char array
identifier = new String(identifier);
What I Need
Basically, I would like to have a static method boolean isSubstringCopyingForSure() that would detect if new String(..) is not needed. I'm OK if detection doesn't work if there is a SecurityManager. Basically, the detection should be conservative (to avoid memory problems, I'd rather use new String(..) even if not necessary).
Options
I have a few options, but I'm not sure if they are reliable, specially for non-Oracle JVMs:
Checking for the String.offset field
/**
* #return true if substring is copying, false if not or if it is not clear
*/
static boolean isSubstringCopyingForSure() {
if (System.getSecurityManager() != null) {
// we can not reliably check it
return false;
}
try {
for (Field f : String.class.getDeclaredFields()) {
if ("offset".equals(f.getName())) {
return false;
}
}
return true;
} catch (Exception e) {
// weird, we do have a security manager?
}
return false;
}
Checking the JVM version
static boolean isSubstringCopyingForSure() {
// but what about non-Oracle JREs?
return System.getProperty("java.vendor").startsWith("Oracle") &&
System.getProperty("java.version").compareTo("1.7.0_45") >= 0;
}
Checking the behavior
There are two options, both are rather complicated. One is create a string using custom charset, then create a new string b using substring, then modify the original string and check whether b is also changed. The second options is create huge string, then a few substrings, and check the memory usage.
Right, indeed this change was made in 7u6. There is no API change for this, as this change is strictly an implementation change, not an API change, nor is there an API to detect which behavior the running JDK has. However, it is certainly possible for applications to notice a difference in performance or memory utilization because of the change. In fact, it's not difficult to write a program that works in 7u4 but fails in 7u6 and vice-versa. We expect that the tradeoff is favorable for the majority of applications, but undoubtedly there are applications that will suffer from this change.
It's interesting that you're concerned about the case where string values are shared (prior to 7u6). Most people I've heard from have the opposite concern, where they like the sharing and the 7u6 change to unshared values is causing them problems (or, they're afraid it will cause problems).
In any case the thing to do is measure, not guess!
First, compare the performance of your application between similar JDKs with and without the change, e.g. 7u4 and 7u6. Probably you should be looking at GC logs or other memory monitoring tools. If the difference is acceptable, you're done!
Assuming that the shared string values prior to 7u6 cause a problem, the next step is to try the simple workaround of new String(s.substring(...)) to force the string value to be unshared. Then measure that. Again, if the performance is acceptable on both JDKs, you're done!
If it turns out that in the unshared case, the extra call to new String() is unacceptable, then probably the best way to detect this case and make the "unsharing" call conditional is to reflect on a String's value field, which is a char[], and get its length:
int getValueLength(String s) throws Exception {
Field field = String.class.getDeclaredField("value");
field.setAccessible(true);
return ((char[])field.get(s)).length;
}
Consider a string resulting from a call to substring() that returns a string shorter than the original. In the shared case, the substring's length() will differ from the length of the value array retrieved as shown above. In the unshared case, they'll be the same. For example:
String s = "abcdefghij".substring(2, 5);
int logicalLength = s.length();
int valueLength = getValueLength(s);
System.out.printf("%d %d ", logicalLength, valueLength);
if (logicalLength != valueLength) {
System.out.println("shared");
else
System.out.println("unshared");
On JDKs older than 7u6, the value's length will be 10, whereas on 7u6 or later, the value's length will be 3. In both cases, of course, the logical length will be 3.
This is not a detail you need to care about. No really! Just call identifier = new String(identifier) in both cases (JDK6 and JDK7). Under JDK6 it will create a copy (as desired). Under JDK7, because the substring is already a unique string the constructor is essentially a no-op (no copy is performed -- read the code). Sure there is a slight overhead of object creation, but because of object reuse in the Younger generation, I challenge you to qualify a performance difference.
In older Java versions, String.substring(..) will use the same char array as the original, with a different offset and count.
In the latest Java versions (according to the comment by Thomas Mueller: since 1.7 Update 6), this has changed, and substrings are now be created with a new char array.
If you parse lots of sources, the best way to deal with it is to avoid checking the Strings' internals, but anticipate this effect and always create new Strings where you need them (as in the first code block in your question).
String identifier = query.substring(14, 18);
// older Java versions: backed by same char array, different offset and count
// newer Java versions: copy of the desired run of the original char array
identifier = new String(identifier);
// older Java versions: when the backed char array is larger than count, a copy of the desired run will be made
// newer Java versions: trivial operation, create a new String instance which is backed by the same char array, no copy needed.
That way, you end up with the same result with both variants, without having to distinguish them and without unnecessary array copy overhead.
Are you sure, that making string copy is really expensive? I belive that JVM optimizer has intrinsics about strings and avoids unnecessary copies. Also large texts are parsed with one-pass algorithms such as LALR automata, generated by compiler compilers. So, the parser input usually be an java.io.Reader or another streaming interface, not a solid String. Parsing is usially expensive itself, still not as expensive as type checking. I don't think that copying strings is a real bottleneck. You better experience with profiler and with microbenchmarks before your assumptions.

How can I securely wipe a confidential data in memory in java with guarantee it will not be 'optimized'?

String secret="foo";
WhatILookFor.securelyWipe(secret);
And I need to know that it will not be removed by java optimizer.
A String cannot be "wiped". It is immutable, and short of some really dirty and dangerous tricks you cannot alter that.
So the safest solution is to not put the data into a string in the first place. Use a StringBuilder or an array of characters instead, or some other representation that is not immutable. (And then clear it when you are done.)
For the record, there are a couple of ways that you can change the contents of a String's backing array. For example, you can use reflection to fish out a reference to the String's backing array, and overwrite its contents. However, this involves doing things that the JLS states have unspecified behaviour so you cannot guarantee that the optimizer won't do something unexpected.
My personal take on this is that you are better off locking down your application platform so that unauthorized people can't gain access to the memory / memory dump in the first place. After all, if the platform is not properly secured, the "bad guys" may be able to get hold of the string contents before you erase it. Steps like this might be warranted for small amounts of security critical state, but if you've got a lot of "confidential" information to process, it is going to be a major hassle to not be able to use normal strings and string handling.
You would need direct access to the memory.
You really wouldn't be able to do this with String, since you don't have reliable access to the string, and don't know if it's been interned somewhere, or if an object was created that you don't know about.
If you really needed to this, you'd have to do something like
public class SecureString implements CharSequence {
char[] data;
public void wipe() {
for(int i = 0; i < data.length; i++) data[i] = '.'; // random char
}
}
That being said, if you're worried about data still being in memory, you have to realize that if it was ever in memory at one point, than an attacker probably already got it. The only thing you realistically protect yourself from is if a core dump is flushed to a log file.
Regarding the optimizer, I incredibly doubt it will optimize away the operation. If you really needed it to, you could do something like this:
public int wipe() {
// wipe the array to a random value
java.util.Arrays.fill(data, (char)(rand.nextInt(60000));
// compute hash to force optimizer to do the wipe
int hash = 0;
for(int i = 0; i < data.length; i++) {
hash = hash * 31 + (int)data[i];
}
return hash;
}
This will force the compiler to do the wipe. It makes it roughly twice as long to run, but it's a pretty fast operation as it is, and doesn't increase the order of complexity.
Store the data off-heap using the "Unsafe" methods. You can then zero over it when done and be certain that it won't be pushed around the heap by the JVM.
Here is a good post on Unsafe:
http://highlyscalable.wordpress.com/2012/02/02/direct-memory-access-in-java/
If you're going to use a String then I think you are worried about it appearing in a memory dump. I suggest using String.replace() on key-characters so that when the String is used at run-time it will change and then go out of scope after it is used and won't appear correctly in a memory dump. However, I strongly recommend that you not use a String for sensitive data.

Categories

Resources