Java : Is String.replace GC overhead too much?

Java : Is String.replace GC overhead too much? - java

Given the code :
long i=0;
while(i++<MILLIONS){
String justHex = UUID.randomUUID().toString().replaceAll("-","");
System.out.println(justHex);
}
This will produce lots of unique strings, which gc will have to clean ultimately. And, enter code heredoing replaceAll on each strings will create even more unique strings (twice?).
Whether this(replaceAll) is a significant overhead for GC for a small application?
Should a programmer worry about such things?

The strings are temporary strings, and will not be referenced anymore in the next iteration, so I expect them to be quickly garbage collected. Unless benchmarks indicate that the loop is a performance bottleneck, don't worry too much about it and focus on functional correctness.
A bigger impact on both memory usage and performance will be the fact that you use replaceAll, which expects a regular expression as first argument. If you don't need a regular expression, it's better to use replace, which also replaces all occurrences, but does not have the regular expression overhead.

Related

How to calculate complexity of internal iterations

This is regarding identifying time complexity of a java program. If i've iterations like for or while etc, we can identify the complexity. But if i use java API to do some task, if it is internally iterating, i think we should include that as well. If so, how to do that.
Example :
String someString = null;
for(int i=0;i<someLength;i++){
someString.contains("something");// Here i think internal iteration will happen, likewise how to identify time complexity
}
Thanks,
Aditya

Internal operations in the Java APIs have their own time complexity based on their implementation. For example the contains method of the String variable runs with linear complexity, where the dependency is based on the length of your someString variable.
In short - you should check how inner operations work and take them into consideration when calculating complexity.
Particularly for your code the time complexity is something like O(N*K), where N is the number of iterations of your loop (someLength) and K is the length of your someString variable.

You are correct in that the internal iterations will add to your complexity. However, except in a fairly small number of cases, the complexity of API methods is not well documented. Many collection operations come with an upper bound requirement for all implementations, but even in such cases there is no guarantee that the actual code doesn't have lower complexity than required. For cases like String.contains() an educated guess is almost certain to be correct, but again there is no guarantee.
Your best bet for a consistent metric is to look at the source code for the particular API implementation you are using and attempt to figure out the complexity from that. Another good approach would be to run benchmarks on the methods you care about with a wide range of input sizes and types and simply estimate the complexity from the shape of the resulting graph. The latter approach will probably yield better results for cases where the code is too complex to analyze directly.

Most efficient way to concatenate Strings

I was under the impression that StringBuffer is the fastest way to concatenate strings, but I saw this Stack Overflow post saying that concat() is the fastest method. I tried the 2 given examples in Java 1.5, 1.6 and 1.7 but I never got the results they did. My results are almost identical to this
Can somebody explain what I don't understand here? What is truly the fastest way to concatenate strings in Java?
Is there a different answer when one seeks the fastest way to concatenate two strings and when concatenating multiple strings?

String.concat is faster than the + operator if you are concatenating two strings... Although this can be fixed at any time and may even have been fixed in java 8 as far as I know.
The thing you missed in the first post you referenced is that the author is concatenating exactly two strings, and the fast methods are the ones where the size of the new character array is calculated in advance as str1.length() + str2.length(), so the underlying character array only needs to be allocated once.
Using StringBuilder() without specifying the final size, which is also how + works internally, will often need to do more allocations and copying of the underlying array.
If you need to concatenate a bunch of strings together, then you should use a StringBuilder. If it's practical, then precompute the final size so that the underlying array only needs to be allocated once.

What I understood from others answer is following:
If you need thread safety, use StringBuffer
If you do not need thread safety:
If strings are known before hand and for some reasons multiple time same code needs to be run, use '+' as compiler will optimize and handle it during compile time itself.
if only two strings need to be concatenated, use concat() as it will not require StringBuilder/StringBuffer objects to be created. Credits to #nickb
If multiple strings need to be concatenated, use StringBuilder.

Joining very long lists os strings by naively addinging them from start to end is very slow: the padded buffer grows incrementally, and is reallocated again and again, making additional copies (and sollicitating a lot the garbage collector).
The most efficient way to join long lists is to always start by joining pairs of adjascent strings whose total length is the smallest from ALL other candidate pairs; however this would require a complex lookup to find the optimal pair (similar to the wellknown problem of Hanoi towers), and finding it only to reduce the numebr of copies to the strict minimum would slow down things.
What you need a smart algorithm using a "divide and conquer" recursive algorithm with a good heuristic which is very near from this optimum:
If you have no string to join, return the empty string.
If you have only 1 string to join, just return it.
Otherwise if you have only 2 strings to join, join them and return the result.
Compute the total length of the final result.
Then determine the number of strings to join from the left until it reaches half of this total to determine the "divide" point splitting the set of strings in two non-empty parts (each part must contain at least 1 string, the division point cannot be the 1st or last string from te set to join).
Join the smallest part if it has at least 2 strings to join, otherwise join the other part (using this algorithm recursively).
Loop back to the beginning (1.) to complete the other joins.
Note that empty strings in the collection have to be ignored as if they were not part of the set.
Many default implementations of String.join(table of string, optional separator) found in various libraries are slow as they are using naive incremental joinining from left to right; the divide-and-conquer algorithm above will outperform it, when you need to join MANY small string to generate a very large string.
Such situation is not exceptional, it occurs in text preprocessors and generators, or in HTML processing (e.g. in "Element.getInnerText()" when the element is a large document containing many text elements separated or contained by many named elements).
The strategy above works when the source strings are all (or almost all to be garbage collected to keep only the final result. If th result is kept together as long as the list of source strings, the best alternative is to allocate the final large buffer of the result only once for its total length, then copy source strings from left to right.
In both cases, this requires a first pass on all strings to compute their total length.
If you usse a reallocatable "string buffer", this does not work well if the "string buffer" reallocates constantly. However, the string buffer may be useful when performing the first pass, to prejoin some short strings that can fit in it, with a reasonnable (medium) size (e.g. 4KB for one page of memory): once it is full, replace the subset of strings by the content of the string buffer, and allocate a new one.
This can considerably reduce the number of small strings in the source set, and after the first pass, you have the total length for the final buffer to allocate for the result, where you'll copy incrementally all the remaining medium-size strings collected in the first pass This works very well when the list of source strings come from a parser function or generator, where the total length is not fully known before the end of parsing/generation: you'll use only intermediate stringbuffers with medium size, and finally you'll generate the final buffer without reparsing again the input (to get many incremental fragments) or without calling the generator repeatedly (this would be slow or would not work for some generators or if the input of the parser is consumed and not recoverable from the start).
Note that this remarks also applies not just to joinind strings, but also to file I/O: writing the file incrementally also suffers from reallocation or fragmentation: you should be able to precompute the total final length of the generated file. Otherwise you need a classig buffer (implemented in most file I/O libraries, and usually sized in memory at about one memory page of 4KB, but you should allocate more because file I/O is considerably slower, and fragmentation becomes a performance problem for later file accesses when file fragments are allocated incrementalyy by too small units of just one "cluster"; using a buffer at about 1MB will avoid most pperformance problems caused by fragmented allocation on the file system as fragments will be considerably larger; a filesystem like NTFS is optimized to support fragments up to 64MB, above which fragmentation is no longer a noticeable problem; the same is true for Unix/Linux filesystems, which rend to defragment only up to a maximum fragment size, and can efficiently handle allocation of small fragments using "pools" of free clusters organized by mimum size of 1 cluster, 2 cluster, 4 clusters, 8 clusters... in powers of two, so that defragmenting these pools is straightforward and not very costly, and can be done asynchornously in the background when there's lo level of I/O activity).
An in all modern OSes, memory management is correlated with disk storage management, using memory mapped files for handling caches: the memory is backed by a storage, managed by the virtual memory manager (which means that you can allocate more dynamic memory than you have physical RAM, the rest will be paged out to the disk when needed): the straegy you use for managing RAM for very large buffers tends to be correlated to the performance of I/O for paging: using a memory mapped file is a good solution, and everything that worked with file I/O can be done now in a very large (virtual) memory.

Big O notation of just a return statement?

Am I right in saying that the time complexity in big O notation would just be O(1)?
public boolean size() {
return (size == 0);
}

Am I right in saying that the time complexity in big O notation would just be O(1)?
No.
This is so common a misconception among students/pupils that I can only constantly repeat this:
Big-O notation is meant to give the complexity of something, with respect to a certain measure, over another number:
For example, saying:
"The algorithm for in-place FFT has a space requirement of O(n), with n being the number of FFT bins"
says something about how much the FFT will need in memory, observed for different lengths of the FFT.
So, you don't specify
What is the thing you're actually observing? Is it the time between calling and returning from your method? Is it the comparison alone? Is "time" measured in Java bytecode instructions, or real machine cycles?
What do you vary? The number of calls to your method? The variable size?
What is it that you actually want to know?
I'd like to stress 3.: Computer science students often think that they know how something will behave if they just know the theoretical time complexity of an algorithm. In reality, these numbers tend to mean nothing. And I mean that. A single fetching of a variable that is not in the CPU cache can take the time of 100-10000 additions in the CPU. Calling a method just to see whether something is 0 will take a few dozen instructions if directly compiled, and might take a lot more if you're using something that is (semi-)interpreted like Java; however, in Java, the next time you call that same method, it might already be there as precompiled machine code...
Then, if your compiler is very smart, it might not only inline the function, eliminating the stack save/restore and call/return instructions, but possibly even merging the result into whatever instructions you were conditioning on that return value, which in essence means that this function, in an extreme case, might not take a single cycle to execute.
So, no matter how you put this, you can not say "time complexity in big O of something that is a language specific feature" without saying what you vary, and exactly what your platform is.

Appending Strings vs appending chars in Java

I am trying to solve an algorithmic task where speed is of primary importance. In the algorithm, I am using a DFS search in a graph and in every step, I add a char and a String. I am not sure whether this is the bottleneck of my algorithm (probably not) but I am curious what is the fastest and most efficient way to do this.
At the moment, I use this:
transPred.symbol + word
I think that there is might be a better alternative than the "+" operator but most String methods only work with other Strings (would converting my char into String and using one of them make a difference?).
Thanks for answers.
EDIT:
for (Transition transPred : state.transtitionsPred) {
walk(someParameters, transPred.symbol + word);
}
transPred.symbol is a char and word is a string

A very common problem / concern.
Bear in mind that each String in java is immutable. Thus, if you modify the string it actually creates a new object. This results in one new object for each concatenation you're doing above. This isn't great, as it's simply creating garbage that will have to be collected at some point.
If your graph is overly large, this might be during your traversal logic - and it may slow down your algorithm.
To avoid creating a new String for each concatenation, use the StringBuilder. You can declare one outside your loop and then append each character with StringBuilder.append(char). This does not incur a new object creation for each append() operation.
After your loop you can use StringBuilder.toString(), this will create a new object (the String) but it will only be one for your entire loop.

Since you replace one char in the string at each iteration I don't think that there is anything faster than a simple + append operation. As mentioned, Strings are immutable, so when you append a char to it, you will get a new String object, but this seems to be unavoidable in your case since you need a new string at each iteration.
If you really want to optimize this part, consider using something mutable like an array of chars. This would allow you to replace the first character without any excessive object creation.
Also, I think you're right when you say that this probably isn't your bottleneck. And remember that premature optimization is the root of all evil etc. (Don't mind the irony that the most popular example of good optimization is avoiding excessive string concatenation).

Real Life, Practical Example of Using String.intern() in Java?

I've seen many primitive examples describing how String intern()'ing works, but I have yet to see a real-life use-case that would benefit from it.
The only situation that I can dream up is having a web service that receives a considerable amount of requests, each being very similar in nature due to a rigid schema. By intern()'ing the request field names in this case, memory consumption can be significantly reduced.
Can anyone provide an example of using intern() in a production environment with great success? Maybe an example of it in a popular open source offering?
Edit: I am referring to manual interning, not the guaranteed interning of String literals, etc.

Interning can be very beneficial if you have N strings that can take only K different values, where N far exceeds K. Now, instead of storing N strings in memory, you will only be storing up to K.
For example, you may have an ID type which consists of 5 digits. Thus, there can only be 10^5 different values. Suppose you're now parsing a large document that has many references/cross references to ID values. Let's say this document have 10^9 references total (obviously some references are repeated in other parts of the documents).
So N = 10^9 and K = 10^5 in this case. If you are not interning the strings, you will be storing 10^9 strings in memory, where lots of those strings are equals (by Pigeonhole Principle). If you intern() the ID string you get when you're parsing the document, and you don't keep any reference to the uninterned strings you read from the document (so they can be garbage collected), then you will never need to store more than 10^5 strings in memory.

We had a production system that processes literally millions of pieces of data at a time, many of which have string fields. We should have been interning strings, but there was a bug which meant we were not. By fixing the bug we avoided having to do a very costly (at least 6 figures, possibly 7) server upgrade.

Examples where interning will be beneficial involve a large numbers strings where:
the strings are likely to survive multiple GC cycles, and
there are likely to be multiple copies of a large percentage of the Strings.
Typical examples involve splitting / parsing a text into symbols (words, identifiers, URIs) and then attaching those symbols to long-lived data structures. XML processing, programming language compilation and RDF / OWL triple stores spring to mind as applications where interning is likely to be beneficial.
But interning is not without its problems, especially if it turns out that the assumptions above are not correct:
the pool data structure used to hold the interned strings takes extra space,
interning takes time, and
interning doesn't prevent the creation of the duplicate string in the first place.
Finally, interning potentially increases GC overheads by increasing the number of objects that need to be traced and copied, and by increasing the number of weak references that need to be dealt with. This increase in overheads has to be balanced against the decrease in GC overheads that results from effective interning.

Not a complete answer but additional food for thought (found here):
Therefore, the primary benefit in this case is that using the == operator for internalized strings is a lot faster than use the equals() method [for not internalized Strings]. So, use the intern() method if you're going to be comparing strings more than a time or three.

Never, ever, use intern on user-supplied data, as that can cause denial of service attacks (as intern()ed strings are never freed). You can do validation on the user-supplied strings, but then again you've done most of the work needed for intern().

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.