In reference to the link: File IO Tuning, last section titled "Further Tuning" where the author suggests using char[] to avoid generating String objects for n lines in the file, I need to understand how does
char[] arr = new char{'a','u','t','h', 'o', 'r'}
differ with
String s = "author"
in terms of memory consumption or any other performance factor? Isn't String object internally stored as a character array? I feel silly since I never thought of this before. :-)
In Oracle's JDK a String has four instance-level fields:
A character array
An integral offset
An integral character count
An integral hash value
That means that each String introduces an extra object reference (the String itself), and three integers in addition to the character array itself. (The offset and character count are there to allow sharing of the character array among String instances produced through the String#substring() methods, a design choice that some other Java library implementers have eschewed.) Beyond the extra storage cost, there's also one more level of access indirection, not to mention the bounds checking with which the String guards its character array.
If you can get away with allocating and consuming just the basic character array, there's space to be saved there. It's certainly not idiomatic to do so in Java though; judicious comments would be warranted to justify the choice, preferably with mention of evidence from having profiled the difference.
In the example you've referred to, it's because there's only a single character array being allocated for the whole loop. It's repeatedly reading into that same array, and processing it in place.
Compare that with using readLine which needs to create a new String instance on each iteration. Each String instance will contain a few int fields and a reference to a char[] containing the actual data - so it would need two new instances per iteration.
I'd usually expect the differences to be insignificant (with a decent GC throwing away unused "young" objects very efficiently) compared with the IO involved in reading the data - assuming it's from disk - but I believe that's the point the author was trying to make.
The author didn't get the reason right. The real overhead in in.readLine() is the copying a char[] buffer when making a String out of it. The additional copying is the most damning cost when dealing with large data.
It is possible to optimize this within JDK so that the additional copying is not needed.
Here are few reasons which makes sense to believe that character array is better choice in Java than String:
Say for Storing the Password
1) Since Strings are immutable in Java, if you store password as plain text it will be available in memory until Garbage collector clears it and since String are used in String pool for reusability there is pretty high chance that it will be remain in memory for long duration, which pose a security threat.
Since any one who has access to memory dump can find the password in clear text and that's another reason you should always used an encrypted password than plain text.
Since Strings are immutable there is no way contents of Strings can be changed because any change will produce new String, while if you char[] you can still set all his element as blank or zero. So Storing password in character array clearly mitigates security risk of stealing password.
2) Java itself recommends using getPassword() method of JPasswordField which returns a char[] and deprecated getText() method which returns password in clear text stating security reason. Its good to follow advice from Java team and adhering to standard rather than going against it.
3) With String there is always a risk of printing plain text in log file or console but if use Array you won't print contents of array instead its memory location get printed. though not a real reason but still make sense.
For this simple program
String strPassword="Unknown";
char[] charPassword= new char[]{'U','n','k','n','o','w','n'};
System.out.println("String password: " + strPassword);
System.out.println("Character password: " + charPassword);
Output:
String password: Unknown
Character password: [C#110b053
That's all on why character array is better choice than String for storing passwords in Java. Though using char[] is not just enough you need to erase content to be more secure.
Hope this will help.
My answer is going to focus on other stack questions along this similar line, others have already posted more direct answers.
There have been other questions similar to this, advice seems to go along the lines of using StringBuilder.
If you're concerned with string concentenation this have a look at the performance as described here between three different implementations. With another stack post which can give you some additional pointers and examples you could try yourself to see the performance.
Related
I've started looking into string interning and it seems like a great feature however I haven't found a great reason for why you would want to create a string using the string constructor, after some digging I came up with this, could someone confirm (or deny) if this is a valid reason to create a string with new?
Say you have 2 strings:
String novel = "The contents of a very long novel..."
String page = new String("The contents of a single page...")
By default all string literals are stored in the string pool (such as with String novel) and by default all sub-strings of novel will be interned (assuming they are created as a string literal) to optimizing memory allocation. Creating a string using the new keyword results in the string being created on the heap rather than in string pool. A particular case when you may want to avoid interning is if you wanted to create a string that is a sub-string of a very large string literal (such as page).
For example; Say you had a very large string literal (e.g. the contents of a novel) that you wanted to process only a portion of (e.g. a single page). It may be beneficial to use the string constructor (via new keyword) when creating the string that only contains a single page of the novel. That way the very large string may be free'd from the string pool sooner and keep only the string that contains the contents of a page on the heap. In contrast, if you created a string literal that is an interned sub-string of an entire novel, a larger amount of novel may be kept alive in the string pool despite only needing a small portion of the novel string.
TL;DR: There is no good / valid reason to new a String in a modern JVM, or to call String.intern() explicitly.
Your question contains false statements of fact, and that means that the conclusions that you are drawing are incorrect.
By default all string literals are stored in the string pool (such as with String novel)
That is correct, though it is not "by default". (It is like saying "by default a square has 4 sides". Squares have 4 sides, period. There are no exceptions. And no defaults.)
and by default all sub-strings of novel will be interned (assuming they are created as a string literal) to optimizing memory allocation.
Incorrect.
A String created by the String.substring() method is NOT interned. Not in current Java releases, or (AFAIK) in any previous release. (But see below.)
Creating a string using the new keyword results in the string being created on the heap rather than in string pool.
Correct.
A particular case when you may want to avoid interning is if you wanted to create a string that is a sub-string of a very large string literal (such as page).
Incorrect.
I think you are confusing "interning" with something else.
Actually, in a modern JVM you always want to avoid interning. It is expensive, and it causes string objects to be (artificially) kept for longer than they need to me.
In fact, the only real reason that interning is still a thing is that it is necessary to guarantee certain semantic properties specified in the JLS about compile-time constant strings.
A modern JVM (Java 9 and later) performs string deduping in the garbage collector for strings that live long enough. This happens transparently ... and in cases where it is likely to be beneficial.
Historic note.
In some old JVMs, there used to be a good reason to call new String in conjunction with substring. The problem was the substring method has a "clever optimization" whereby it created the substrings to share the backing char[] with the original string1. This had the problem that references to (small) substrings could keep the (large) backing array reachable. It was a subtle kind of memory leak. You could avoid the leak by using new.
However:
The optimization was NOT interning. The substrings were created in the regular heap, and they did not have the semantics of interned strings.
The problem only affected certain String use-cases. And in practice they didn't involve large String literals.
The problem was solved long ago. The String.substring now creates a new String with its own backing array.
In summary, using new String might have been a good idea in some cases with old Java versions, but it isn't anymore. It was fixed in Java 7.
1 - Interestingly, the source code for String describes this as a speed optimization rather than a space optimization.
I feel strings can replace character array in all the scenarios. Even considering the immutability characteristic of Strings, declaration of strings in appropriate scope and java's garbage collection feature should help us avoid any memory leaks. I want to know if there is any corner case where character array should be used instead of Strings in Java.
Character arrays have some slight advantage over plain strings when it comes to storing security sensitive data. There's a lot of resources on that, for example this question: Why is char[] preferred over String for passwords? (with an answer by Jon Skeet himself).
In general it boils down to two things:
You have very little influence on how long a String stays in memory. Because of that you might leak sensitive data through a memory dump.
Leaking sensitive data accidentally in application logs as clear text is much more likely with plain strings
More reading:
Why we read password from console in char array instead of String
https://www.codebyamir.com/blog/use-character-arrays-to-store-sensitive-data-java
https://www.geeksforgeeks.org/use-char-array-string-storing-passwords-java/amp/
https://www.baeldung.com/java-storing-passwords
https://javarevisited.blogspot.com/2012/03/why-character-array-is-better-than.html
https://javainsider.wordpress.com/2012/12/10/character-array-is-better-than-string-for-storing-password-in-java/amp/
String is a class, not a build in type. It most likely does what it does by using a char array underneath, but there is no guarantee. "We dont care how it is implemented". It has methods that make sense for strings, like comparing strings. Comparing arrays?? Hmm. Doesn't really make sense to do it. You could check if they are equal sure, but less or greater than...
Back in point. One scenario is you want to operate with chars, not a string. For example you have letters of the alphabet and want to sort them. Or grades in A-F system and you want to sort them. Generally where it makes sense having chars that are not connected to have some meaning together (like in a message string, or a text message). You would not generally need to sort the chars of a text message now, would you? So, you use an array.
To sort, you can take advantage of the Arrays.sort() method for example, while i dont think there is a method that does it for strings. Perhaps 3rd part libraries.
On another note(unrelated to question) , you can use StringBuilder to if you want to modify strings often. Its better at performace.
You don't have to look much further than at methods in the JDK core API that use char[].
Such as this one (java.io.Reader):
public int read(char[] cbuf)
throws IOException
Reads characters into an array. This method will block until some input is available, an I/O error occurs, or the end of the stream is reached.
Parameters:
cbuf - Destination buffer
Returns:
The number of characters read, or -1 if the end of the stream has been reached
Throws:
IOException - If an I/O error occurs
Instead of returning a String they ask you to pass in a char[] to use as a buffer to write the result into. The reason is efficiency.
You might be knowing String is immutable and how Substring can cause memory leak in Java.
Since Strings are immutable in Java if you store password as plain text it will be available in memory until Garbage collector clears it and since String are used in String pool for reusability there is pretty high chance that it will be remain in memory for long duration, which pose a security threat. Since any one who has access to memory dump can find the password in clear text. Since Strings are immutable there is no way contents of Strings can be changed because any change will produce new String, while if you char[] you can still set all his element as blank or zero. So Storing password in character array clearly mitigates security risk of stealing password.
How to create our own O(1) substring function in java as it was in jdk 6. If there is any method to use substring() of jdk 6 on advanced versions of jdk ?
The O(1) substring was because the underlying character array of the string could be shared between objects. Hence substring simply required creating an object with a pointer to the original string along with an offset and length. There was no copying of the actual data itself, which had the annoying effect that taking a small substring of a huge string, then deleting the huge one, didn't actually free up memory. This lead to code such as:
String newstr = new String(oldStr.substring(5,9));
rather than the more sensible-looking:
String newstr = oldStr.substring(5,9);
Since strings no longer share data (Update 6 of Java 7 is where I think this happened), that's not possible so, if you want to get back that O(1) performance, you'll basically have to construct your own string class to do it.
Just be aware that you may be worrying about something that's not so important. Unless your strings are very large, the extra cost (in space and time) of copying the data for them may be inconsequential.
And the extra effort in converting your O1String into String for every function that needs the latter, as well as the less than perfect integration with literal strings, may well make it even worse.
Here you can view how it was implimented in Java 6
Open JDK
After reading this beautiful question: Why is char[] preferred over String for passwords?, I'm curious as to how this applies to servlet based web applications. Say your UI has some input field for the password, the password will be retrievable with request.getParameter("passwordFieldName") which returns a String. Even if you then convert it to a char[], you have to wait for GC to clear the String object.
Also, many of the Encryption/Hashing libraries I'm looking into using for password hashing have a method such as checkPassword(plaintext, hashed) that takes two Strings and returns true if the entered plain text string gives a hash equal to hashed. With this, even if you had a char[], you would still need to convert the array to a String with the new String(char[]) constructor.
It seems to me like you can't really avoid having your password as a String for comparing it to its stored representation. If you are worried about attacks during that small window, how do you protect yourself?
This is an overreaction and really just "security theater". There is really no scenario in which having a long String as a password in a Java application would be at all desirable to an attacker. If a memory exhaustion attack is a concern, then don't use Strings anywhere.
That being said CWE-521 states that passwords must have a max size. Strings don't really have a max size, and using a char[x] is a good way of enforcing a max size.
In an application a String is a often used data type. What we know, is that the mutation of a String uses lots of memory. So what we can do is to use a StringBuilder/StringBuffer.
But at what point should we change to StringBuilder?
And what should we do, when we have to split it or to remplace characters in there?
eg:
//original:
String[] split = string.split("?");
//better? :
String[] split = stringBuilder.toString().split("?);
or
//original:
String replacedString = string.replace("l","st");
//better? :
String replacedString = stringBuilder.toString().replace("l","st");
//or
StringBuilder replacedStringBuilder = new StringBuilder(stringBuilder.toString().replace("l","st);
In your examples, there are no benefits in using a StringBuilder, since you use the toString method to create an immutable String out of your StringBuilder.
You should only copy the contents of a StringBuilder into a String after you are done appending it (or modifying it in some other way).
The problem with Java's StringBuilder is that it lacks some methods you get when using a plain string (check this thread, for example: How to implement StringBuilder.replace(String, String)).
What we know, is that a String uses lots of memory.
Actually, to be precise, a String uses less memory than a StringBuilder with equivalent contents. A StringBuilder class has some additional constant overhead, and usually has a preallocated buffer to store more data than needed at any given moment (to reduce allocations). The issue with Strings is that they are immutable, which means Java needs to create a new instance whenever you need to change its contents.
To conclude, StringBuilder is not designed for the operations you mentioned (split and replace), and it won't yield much better performance in any case. A split method cannot benefit from StringBuilder's mutability, since it creates an array of immutable strings as its output anyway. A replace method still needs to iterate through the entire string, and do a lot of copying if replaced string is not the same size as the searched one.
If you need to do a lot of appending, then go for a StringBuilder. Since it uses a "mutable" array of characters under the hood, adding data to the end will be especially efficient.
This article compares the performance of several StringBuilder and String methods (although I would take the Concatenation part with reserve, because it doesn't mention dynamic string appending at all and concentrates on a single Join operation only).
What we know, is that the mutation of a String uses lots of memory.
That is incorrect. Strings cannot be mutated. They are immutable.
What you are actually talking about is building a String from other strings. That can use a lot more memory than is necessary, but it depends how you build the string.
So what we can do is to use a StringBuilder/StringBuffer.
Using a StringBuilder will help in some circumstances:
String res = "";
for (String s : ...) {
res = res + s;
}
(If the loop iterates many times then optimizing the above to use a StringBuilder could be worthwhile.)
But in other circumstances it is a waste of time:
String res = s1 + s2 + s3 + s4 + s5;
(It is a waste of time to optimize the above to use a StringBuilder because the Java compiler will automatically translate the expression into code that creates and uses a StringBuilder.)
You should only ever use a StringBuffer instead of a StringBuilder when the string needs to be accessed and/or updated by more than one thread; i.e. when it needs to be thread-safe.
But at what point should we change to StringBuilder?
The simple answer is to only do it when the profiler tells you that you have a performance problem in your string handling / processing.
Generally speaking, StringBuilders are used for building strings rather as the primary representation of the strings.
And what should we do, when we have to split it or to replace characters in there?
Then you have to review your decision to use a StringBuilder / StringBuffer as your primary representation at that point. And if it is still warranted you have to figure out how to do the operation using the API you have chosen. (This may entail converting to a String, performing the operation and then creating a new StringBuilder from the result.)
If you frequently modify the string, go with StringBuilder. Otherwise, if it's immutable anyway, go with String.
To answer your question on how to replace characters, check this out: http://download.oracle.com/javase/tutorial/java/data/buffers.html. StringBuilder operations is what you want.
Here's another good write-up on StringBuilder: http://www.yoda.arachsys.com/csharp/stringbuilder.html
If you need to lot of alter operations on your String, then you can go for StringBuilder. Go for StringBuffer if you are in multithreaded application.
Both a String and a StringBuilder use about the same amount of memory. Why do you think it is “much”?
If you have measured (for example with jmap -histo:live) that the classes [C and java.lang.String take up most of the memory in the heap, only then should you think further in this direction.
Maybe there are multiple strings with the same value. Then, since Strings are immutable, you could intern the duplicate strings. Don't use String.intern for it, since it has bad performance characteristics, but Google Guava's Interner.