I've seen many primitive examples describing how String intern()'ing works, but I have yet to see a real-life use-case that would benefit from it.
The only situation that I can dream up is having a web service that receives a considerable amount of requests, each being very similar in nature due to a rigid schema. By intern()'ing the request field names in this case, memory consumption can be significantly reduced.
Can anyone provide an example of using intern() in a production environment with great success? Maybe an example of it in a popular open source offering?
Edit: I am referring to manual interning, not the guaranteed interning of String literals, etc.
Interning can be very beneficial if you have N strings that can take only K different values, where N far exceeds K. Now, instead of storing N strings in memory, you will only be storing up to K.
For example, you may have an ID type which consists of 5 digits. Thus, there can only be 10^5 different values. Suppose you're now parsing a large document that has many references/cross references to ID values. Let's say this document have 10^9 references total (obviously some references are repeated in other parts of the documents).
So N = 10^9 and K = 10^5 in this case. If you are not interning the strings, you will be storing 10^9 strings in memory, where lots of those strings are equals (by Pigeonhole Principle). If you intern() the ID string you get when you're parsing the document, and you don't keep any reference to the uninterned strings you read from the document (so they can be garbage collected), then you will never need to store more than 10^5 strings in memory.
We had a production system that processes literally millions of pieces of data at a time, many of which have string fields. We should have been interning strings, but there was a bug which meant we were not. By fixing the bug we avoided having to do a very costly (at least 6 figures, possibly 7) server upgrade.
Examples where interning will be beneficial involve a large numbers strings where:
the strings are likely to survive multiple GC cycles, and
there are likely to be multiple copies of a large percentage of the Strings.
Typical examples involve splitting / parsing a text into symbols (words, identifiers, URIs) and then attaching those symbols to long-lived data structures. XML processing, programming language compilation and RDF / OWL triple stores spring to mind as applications where interning is likely to be beneficial.
But interning is not without its problems, especially if it turns out that the assumptions above are not correct:
the pool data structure used to hold the interned strings takes extra space,
interning takes time, and
interning doesn't prevent the creation of the duplicate string in the first place.
Finally, interning potentially increases GC overheads by increasing the number of objects that need to be traced and copied, and by increasing the number of weak references that need to be dealt with. This increase in overheads has to be balanced against the decrease in GC overheads that results from effective interning.
Not a complete answer but additional food for thought (found here):
Therefore, the primary benefit in this case is that using the == operator for internalized strings is a lot faster than use the equals() method [for not internalized Strings]. So, use the intern() method if you're going to be comparing strings more than a time or three.
Never, ever, use intern on user-supplied data, as that can cause denial of service attacks (as intern()ed strings are never freed). You can do validation on the user-supplied strings, but then again you've done most of the work needed for intern().
Related
I have a ResultSet with list of Stock exchanges and Countries, in which they reside. Nonetheless, in my database, not every Cxchange has an country_id, therefore when creating Exchange objects, bunch of them has country_id and country_title null values. Due to the memory optimization, I planned to intern all duplicate Strings (countries, currencies, etc.), but noticed, that I get a NullPointerException, which is loggical. Is there some workaround, how to avoid duplicate strings with intern and also don't get a NPE? Thank you.
Some options are:
Given there are less than 200 countries, and less than that many exchanges (there are only 60 major exchanges globally), it would be trivial to provide the missing data to your exchanges.
Provide a default value programatically, either in java or via your query, eg assign 0 to the country_id and "" to country_title when they are null in the database.
Don't bother interning - with so few Strings, such a micro optimisation would have no measurable effect.
Thank you guys, there are much more strings used in the app, countries and exchanges were just an example. Totally there are around 500k Strings, out of which 50k are unique, i.e. around 30mb wasted. Not a big deal indeed.
After some research, I will not intern strings, given that the app should run on a well equiped PCs :)
Given the code :
long i=0;
while(i++<MILLIONS){
String justHex = UUID.randomUUID().toString().replaceAll("-","");
System.out.println(justHex);
}
This will produce lots of unique strings, which gc will have to clean ultimately. And, enter code heredoing replaceAll on each strings will create even more unique strings (twice?).
Whether this(replaceAll) is a significant overhead for GC for a small application?
Should a programmer worry about such things?
The strings are temporary strings, and will not be referenced anymore in the next iteration, so I expect them to be quickly garbage collected. Unless benchmarks indicate that the loop is a performance bottleneck, don't worry too much about it and focus on functional correctness.
A bigger impact on both memory usage and performance will be the fact that you use replaceAll, which expects a regular expression as first argument. If you don't need a regular expression, it's better to use replace, which also replaces all occurrences, but does not have the regular expression overhead.
I was under the impression that StringBuffer is the fastest way to concatenate strings, but I saw this Stack Overflow post saying that concat() is the fastest method. I tried the 2 given examples in Java 1.5, 1.6 and 1.7 but I never got the results they did. My results are almost identical to this
Can somebody explain what I don't understand here? What is truly the fastest way to concatenate strings in Java?
Is there a different answer when one seeks the fastest way to concatenate two strings and when concatenating multiple strings?
String.concat is faster than the + operator if you are concatenating two strings... Although this can be fixed at any time and may even have been fixed in java 8 as far as I know.
The thing you missed in the first post you referenced is that the author is concatenating exactly two strings, and the fast methods are the ones where the size of the new character array is calculated in advance as str1.length() + str2.length(), so the underlying character array only needs to be allocated once.
Using StringBuilder() without specifying the final size, which is also how + works internally, will often need to do more allocations and copying of the underlying array.
If you need to concatenate a bunch of strings together, then you should use a StringBuilder. If it's practical, then precompute the final size so that the underlying array only needs to be allocated once.
What I understood from others answer is following:
If you need thread safety, use StringBuffer
If you do not need thread safety:
If strings are known before hand and for some reasons multiple time same code needs to be run, use '+' as compiler will optimize and handle it during compile time itself.
if only two strings need to be concatenated, use concat() as it will not require StringBuilder/StringBuffer objects to be created. Credits to #nickb
If multiple strings need to be concatenated, use StringBuilder.
Joining very long lists os strings by naively addinging them from start to end is very slow: the padded buffer grows incrementally, and is reallocated again and again, making additional copies (and sollicitating a lot the garbage collector).
The most efficient way to join long lists is to always start by joining pairs of adjascent strings whose total length is the smallest from ALL other candidate pairs; however this would require a complex lookup to find the optimal pair (similar to the wellknown problem of Hanoi towers), and finding it only to reduce the numebr of copies to the strict minimum would slow down things.
What you need a smart algorithm using a "divide and conquer" recursive algorithm with a good heuristic which is very near from this optimum:
If you have no string to join, return the empty string.
If you have only 1 string to join, just return it.
Otherwise if you have only 2 strings to join, join them and return the result.
Compute the total length of the final result.
Then determine the number of strings to join from the left until it reaches half of this total to determine the "divide" point splitting the set of strings in two non-empty parts (each part must contain at least 1 string, the division point cannot be the 1st or last string from te set to join).
Join the smallest part if it has at least 2 strings to join, otherwise join the other part (using this algorithm recursively).
Loop back to the beginning (1.) to complete the other joins.
Note that empty strings in the collection have to be ignored as if they were not part of the set.
Many default implementations of String.join(table of string, optional separator) found in various libraries are slow as they are using naive incremental joinining from left to right; the divide-and-conquer algorithm above will outperform it, when you need to join MANY small string to generate a very large string.
Such situation is not exceptional, it occurs in text preprocessors and generators, or in HTML processing (e.g. in "Element.getInnerText()" when the element is a large document containing many text elements separated or contained by many named elements).
The strategy above works when the source strings are all (or almost all to be garbage collected to keep only the final result. If th result is kept together as long as the list of source strings, the best alternative is to allocate the final large buffer of the result only once for its total length, then copy source strings from left to right.
In both cases, this requires a first pass on all strings to compute their total length.
If you usse a reallocatable "string buffer", this does not work well if the "string buffer" reallocates constantly. However, the string buffer may be useful when performing the first pass, to prejoin some short strings that can fit in it, with a reasonnable (medium) size (e.g. 4KB for one page of memory): once it is full, replace the subset of strings by the content of the string buffer, and allocate a new one.
This can considerably reduce the number of small strings in the source set, and after the first pass, you have the total length for the final buffer to allocate for the result, where you'll copy incrementally all the remaining medium-size strings collected in the first pass This works very well when the list of source strings come from a parser function or generator, where the total length is not fully known before the end of parsing/generation: you'll use only intermediate stringbuffers with medium size, and finally you'll generate the final buffer without reparsing again the input (to get many incremental fragments) or without calling the generator repeatedly (this would be slow or would not work for some generators or if the input of the parser is consumed and not recoverable from the start).
Note that this remarks also applies not just to joinind strings, but also to file I/O: writing the file incrementally also suffers from reallocation or fragmentation: you should be able to precompute the total final length of the generated file. Otherwise you need a classig buffer (implemented in most file I/O libraries, and usually sized in memory at about one memory page of 4KB, but you should allocate more because file I/O is considerably slower, and fragmentation becomes a performance problem for later file accesses when file fragments are allocated incrementalyy by too small units of just one "cluster"; using a buffer at about 1MB will avoid most pperformance problems caused by fragmented allocation on the file system as fragments will be considerably larger; a filesystem like NTFS is optimized to support fragments up to 64MB, above which fragmentation is no longer a noticeable problem; the same is true for Unix/Linux filesystems, which rend to defragment only up to a maximum fragment size, and can efficiently handle allocation of small fragments using "pools" of free clusters organized by mimum size of 1 cluster, 2 cluster, 4 clusters, 8 clusters... in powers of two, so that defragmenting these pools is straightforward and not very costly, and can be done asynchornously in the background when there's lo level of I/O activity).
An in all modern OSes, memory management is correlated with disk storage management, using memory mapped files for handling caches: the memory is backed by a storage, managed by the virtual memory manager (which means that you can allocate more dynamic memory than you have physical RAM, the rest will be paged out to the disk when needed): the straegy you use for managing RAM for very large buffers tends to be correlated to the performance of I/O for paging: using a memory mapped file is a good solution, and everything that worked with file I/O can be done now in a very large (virtual) memory.
Let's say I have a list of very long strings (40-1000 characters). A user needs to be able to enter a term into the list and the list will report whether the term exists.
Barring storage, is it more efficient to store a hash alongside the long strings, and then when a user attempts a lookup it hashes the input and compares it to a list of hashes?
There are similar answers here, but they aren't quite generalized enough.
Assuming that the data fits in the heap (i.e., in memory), your best bet is to use a Set (or Map if there is data associated with each string). Either change your storage from a List to a Set (using HashSet) or maintain a separate Set if you also really need a List.
The time to compute the hashcode() of a string is proportional to the length of the string. The time to look for the string is constant with respect to the number of strings in the collection (once the hashcode has been computed), assuming a properly-implemented hashcode() and properly-sized Set.
If instead you use equals() on an unsorted list, your lookup time will probably be proportional to the number of items in the list. If you keep the list sorted, you could do binary search with the number of comparisons to lookup one string proportional to the log of the number of items in the list (and each comparison will have to compare characters until a difference is found).
In essence, the Set is sort of like keeping the hashcode of the strings handy, but it goes one step further and stores the data in such a way that it is very quick to jump straight to the elements of the collection that have that hashcode value.
Note that an equals comparison of two strings can bail out as soon as a difference is found, but might have to compare every character in the two strings (when they are equal). If your strings have similar, long prefixes it can hurt performance. Sometimes, you can benefit (performance-wise) from knowledge of the content of your data types. For example, if all your strings begin with the same 1K prefix and only differ in the end, you could benefit from overriding the equals() implementation to compare from the end to the start, so you find differences earlier.
Your question is not specific enough.
First, I assume you mean "I have a set of very long strings", because list is very inefficient structure for presence lookups
Some ideas:
Depends on the properties of your strings' set (i. e. the domain), prefix tree could appear to be dramatically more efficient by memory and speed, than any sort of hash table. Prefix tree means comparisons, not hash computation.
Otherwise, you should end up using some sort of hash table, which means you should compute hash code anyway, at least once for each string. In this case, it seems reasonable to store hash codes along with the strings. But for strict correctness, in the end you should probably compare strings by contents anyway, because hash collisions are possible.
Theoretically, max speed of well-distributed hash functions is 3-4 bytes / clock cycle (i. e. hash function consumes 3-4 bytes per CPU cycle).
Speed of stream comparison - depends on some conditions and how your code is compiled, there are instuctions on modern CPUs that allow to compare up to 16 bytes per cycle. Interesting, that Arrays.equals methods are intrinsified, but there is no "raw" memory comparison method in sun.misc.Unsafe class.
I am running a NetBeans profile of a recursive operation that includes creating a class with a java.lang.String field. In the classes list, in the profile heap dump, the number of String fields corresponds to the number of classes created as expected, however there are also a similar number of char[] instances. The char arrays account for nearly 70% of the memory usage(!) whilst the String field accounts for about 7%.
What is going on here? And how can I reduce the number of char[] instances?
Thanks
Take a look at the String source code. The String object itself contains a cached hash code, a count of the number of characters (again, for optimisation purposes), an offset (since a String.substr() points to the original string data) and the character array, which contains the actual string data. Hence your metrics showing that String consumes relatively little, but the majority of memory is taken by the underlying character arrays.
The char arrays account for nearly 70%
of the memory usage(!) whilst the
String field accounts for about 7%
This is subtlety of memory profiling known as "retained size" and "shallow size":
Shallow size refers to how much memory is taken up by an object, not including any child objects it contains. Basically, this means primitive fields.
Retained size is the shallow size plus the size of the other objects referred to by the object, but only those other objects which are referred to only by this object (tricky to explain, simple concept).
String is the perfect example. It contains a handful of primitive fields, plus the char[]. The char[] accounts for the vast majority of the memory usage. The shallow size of String is very small, but it's retained size is much larger, since that includes the char[].
The NetBeans profiler is probably giving you the shallow size, which isn't a very useful figure, but is easy to calculate. The retained size would incorporate the char[] memory usage into the String memory usage, but calculating the retained size is computationally expensive, and so profilers won't work that out until explicitly asked to.
The String class in the Sun's Java implementation uses a char[] to store the character data.
I believe this can be verified without looking at the source code by using a debugger to look at the contents of a String, or by using reflection to look at the internals of the String object.
Therefore, it would be difficult to reduce the number of char[] which are being created, unless the number of String instances which are being created were reduced.
Strings are backed by char arrays, so I don't think you can reduce the number of char[] instances without reducing your Strings.
Have you tried removing some Strings to see if the char[]s go down as well?