Java 8 String deduplication vs. String.intern()

Java 8 String deduplication vs. String.intern() - java

I am reading about the feature in Java 8 update 20 for String deduplication (more info) but I am not sure if this basically makes String.intern() obsolete.
I know that this JVM feature needs the G1 garbage collector, which might not be an option for many, but assuming one is using G1GC, is there any difference/advantage/disadvantage of the automatic deduplication done by the JVM vs manually having to intern your strings (one obvious one is the advantage of not having to pollute your code with calls to intern())?
This is especially interesting considering that Oracle might make G1GC the default GC in java 9

With this feature, if you have 1000 distinct String objects, all with the same content "abc", JVM could make them share the same char[] internally. However, you still have 1000 distinct String objects.
With intern(), you will have just one String object. So if memory saving is your concern, intern() would be better. It'll save space, as well as GC time.
However, the performance of intern() isn't that great, last time I heard. You might be better off by having your own string cache, even using a ConcurrentHashMap ... but you need to benchmark it to make sure.

As a comment references, do see: http://java-performance.info/string-intern-in-java-6-7-8/. It is very insightful reference and I learned a lot, however I'm not sure its conclusions are necessarily "one size fits all". Each aspect depends on the needs of your own application - taking measurements of realistic input data is highly recommended!
The main factor probably depends on what you are in control over:
Do you have full control over the choice of GC? In a GUI application for example, there is still a strong case to be made for using Serial GC. (far lower total memory footprint for the process - think 400 MB vs ~1 GB for a moderately complex app, and being much more willing release memory, e.g. after a transient spike in usage). So you might pick that or give your users the option. (If the heap remains small the pauses should not be a big deal).
Do you have full control over the code? The G1GC option is great for 3rd party libraries (and applications!) which you can't edit.
The second consideration (as per #ZhongYu's answer) is that String.intern can de-duplication the String objects themselves, whereas G1GC necessarily can only de-duplicate their private char[] field.
A third consideration may be CPU usage, say if impact on laptop battery life might be of concern to your users. G1GC will run an extra thread dedicated to de-duplicating the heap. For example, I played with this to run Eclipse and found it caused an initial period of increased CPU activity after starting up (think 1 - 2 minutes) but it settled on a smaller heap "in-use" and no obvious (just eye-balling the task manager) CPU overhead or slow-down thereafter. So I imagine a certain % of a CPU core will be taken up on de-duplication (during? after?) periods of high memory-churn. (Of course there may be a comparable overhead if you call String.intern everywhere, which would also runs in serial, but then...)
You probably don't need string de-duplication everywhere. There are probably only certain areas of code that:
really impact long-term heap usage, and
create a high proportion of duplicate strings
By using String.intern selectively, other parts of the code (which may create temporary or semi-temporary strings) don't pay the price.
And finally, a quick plug for the Guava utility: Interner, which:
Provides equivalent behavior to String.intern() for other immutable types
You can also use that for Strings. Memory probably is (and should be) your top performance concern, so this probably doesn't apply often: however when you need to squeeze every drop of speed out of some hot-spot area, my experience is that Java-based weak-reference HashMap solutions do run slightly but consistently faster than the JVM's C++ implementation of String.intern(), even after tuning the jvm options. (And bonus: you don't need to tune the JVM options to scale to different input.)

I want to introduce another decision factor regarding the targeted audience:
For a system integrator having a system composed by many different libraries/frameworks, with low capacity to influence those libraries internal development, StringDeDuplication could be a quick winner if memory is a problem. It will affect all the Strings in the JVM, but G1 will use only spare time to do it. You may even tweak when DeDuplication is calculated by using another parameter(StringDeduplicationAgeThreshold)
For a developer profiling his own code, String.intern could be more interesting. Thoughful review of the domain model is necessary to decide whether to call intern, and when. As rule of thumb you may use intern when you know the String will contain a limited set of values, like a kind of enumerated set (i.e. Country name, month, day of week...).

Related

Does it still make sense to avoid creating objects for the sake of garbage collection?

For example in a service adapter you might:
a. have an input data model and an output data model, maybe even immutable, with different classes and use Object Mappers to transform between classes and create some short-lived objects along the way
b. have a single data model, some of the classes might be mutable, but the same object that was created for the input is also sent as output
There are other use-cases when you'd have to choose between clear code with many objects and less clear code with less objects and I would like to know if Garbage Collection still has a weight in this decision.

I should make this a comment as IMO it does not qualify as an answer, but it will not fit.
Even if the answer(s) are going to most probably be - do whatever makes your code more readable (and to be honest I still follow that all the time); we have faced this issue of GC in our code base.
Suppose that you want to create a graph of users (we had to - around 1/2 million) and load all their properties in memory and do some aggregations on them and filtering, etc. (it was not my decision), because these graph objects where pretty heavy - once loaded even with 16GB of heap the JVM would fail with OOM or GC would take huge pauses. And it's understandable - lots of data requires lots of memory, you can't run away from it. The solution proposed and that actually worked was to model that with simple BitSets - where each bit would be a property and a potential linkage to some other data; this is by far not readable and extremely complicated to maintain to this day. Lots of shifts, lots of intrinsics of the data - you have to know at all time what the 3-bit means for example, there's no getter for usernameIncome let's say - you have to do quite a lot shifts and map that to a search table, etc. But it would keep the GC pretty low, at least in the ranges where we were OK with that.
So unless you can prove that GC is taken your app time so much - you probably are even safer simply adding more RAM and increasing it(unless you have a leak). I would still go for clear code like 99.(99) % of the time.

Newer versions of Java have quite sophisticated mechanisms to handle very short-living objects so it's not as bad as it was in the past. With a modern JVM I'd say that you don't need to worry about garbage collection times if you create many objects, which is a good thing since there are now many more of them being created on the fly that this was the case with older versions of Java.
What's still valid is to keep the number of created objects low if the creation is coming with high costs, e.g. accessing a database to retrieve data from, network operations, etc.

As other people have said I think it's better to write your code to solve the problem in an optimum way for that problem rather than thinking about what the garbage collector (GC) will do.
The key to working with the GC is to look at the lifespan of your objects. The heap is (typically) divided into two main regions called generations to signify how long objects have been alive (thus young and old generations). To minimise the impact of GC you want your objects to become eligible for collection while they are still in the young generation (either in the Eden space or a survivor space, but preferably Eden space). Collection of objects in the Eden space is effectively free, as the GC does nothing with them, it just ignores them and resets the allocation pointer(s) when a minor GC is finished.
Rather than explicitly calling the GC via System.gc() it's much better to tune your heap. For example, you can set the size of the young generation using command line options like -XX:NewRatio=n, where n signifies the ratio of new to old (e.g. setting it to 3 will make the ratio of new:old 1:3 so the young generation will be 1 quarter of the heap). Alternatively, you can set the size explicitly using -XX:NewSize=n and -XX:MaxNewSize=m. The GC may resize the heap during collections so setting these values to be the same will keep it at a fixed size.
You can profile your code to establish the rate of object creation and how long your objects typically live for. This will give you the information to (ideally) configure your heap to minimise the number of objects being promoted into the old generation. What you really don't want is objects being promoted and then becoming garbage shortly thereafter.
Alternatively, you may want to look at the Zing JVM from Azul (full disclosure, I work for them). This uses a different GC algorithm, called C4, which enables compaction of the heap concurrently with application threads and so eliminates most of the impact of the GC on application latency.

"Give a rough estimate of the overhead incurred by each system call." - what? [duplicate]

I am a student in Computer Science and I am hearing the word "overhead" a lot when it comes to programs and sorts. What does this mean exactly?

It's the resources required to set up an operation. It might seem unrelated, but necessary.
It's like when you need to go somewhere, you might need a car. But, it would be a lot of overhead to get a car to drive down the street, so you might want to walk. However, the overhead would be worth it if you were going across the country.
In computer science, sometimes we use cars to go down the street because we don't have a better way, or it's not worth our time to "learn how to walk".

The meaning of the word can differ a lot with context. In general, it's resources (most often memory and CPU time) that are used, which do not contribute directly to the intended result, but are required by the technology or method that is being used. Examples:
Protocol overhead: Ethernet frames, IP packets and TCP segments all have headers, TCP connections require handshake packets. Thus, you cannot use the entire bandwidth the hardware is capable of for your actual data. You can reduce the overhead by using larger packet sizes and UDP has a smaller header and no handshake.
Data structure memory overhead: A linked list requires at least one pointer for each element it contains. If the elements are the same size as a pointer, this means a 50% memory overhead, whereas an array can potentially have 0% overhead.
Method call overhead: A well-designed program is broken down into lots of short methods. But each method call requires setting up a stack frame, copying parameters and a return address. This represents CPU overhead compared to a program that does everything in a single monolithic function. Of course, the added maintainability makes it very much worth it, but in some cases, excessive method calls can have a significant performance impact.

You're tired and cant do any more work. You eat food. The energy spent looking for food, getting it and actually eating it consumes energy and is overhead!
Overhead is something wasted in order to accomplish a task. The goal is to make overhead very very small.
In computer science lets say you want to print a number, thats your task. But storing the number, the setting up the display to print it and calling routines to print it, then accessing the number from variable are all overhead.

Wikipedia has us covered:
In computer science, overhead is
generally considered any combination
of excess or indirect computation
time, memory, bandwidth, or other
resources that are required to attain
a particular goal. It is a special
case of engineering overhead.

Overhead typically reffers to the amount of extra resources (memory, processor, time, etc.) that different programming algorithms take.
For example, the overhead of inserting into a balanced Binary Tree could be much larger than the same insert into a simple Linked List (the insert will take longer, use more processing power to balance the Tree, which results in a longer percieved operation time by the user).

For a programmer overhead refers to those system resources which are consumed by your code when it's running on a giving platform on a given set of input data. Usually the term is used in the context of comparing different implementations or possible implementations.
For example we might say that a particular approach might incur considerable CPU overhead while another might incur more memory overhead and yet another might weighted to network overhead (and entail an external dependency, for example).
Let's give a specific example: Compute the average (arithmetic mean) of a set of numbers.
The obvious approach is to loop over the inputs, keeping a running total and a count. When the last number is encountered (signaled by "end of file" EOF, or some sentinel value, or some GUI buttom, whatever) then we simply divide the total by the number of inputs and we're done.
This approach incurs almost no overhead in terms of CPU, memory or other resources. (It's a trivial task).
Another possible approach is to "slurp" the input into a list. iterate over the list to calculate the sum, then divide that by the number of valid items from the list.
By comparison this approach might incur arbitrary amounts of memory overhead.
In a particular bad implementation we might perform the sum operation using recursion but without tail-elimination. Now, in addition to the memory overhead for our list we're also introducing stack overhead (which is a different sort of memory and is often a more limited resource than other forms of memory).
Yet another (arguably more absurd) approach would be to post all of the inputs to some SQL table in an RDBMS. Then simply calling the SQL SUM function on that column of that table. This shifts our local memory overhead to some other server, and incurs network overhead and external dependencies on our execution. (Note that the remote server may or may not have any particular memory overhead associated with this task --- it might shove all the values immediately out to storage, for example).
Hypothetically we might consider an implementation over some sort of cluster (possibly to make the averaging of trillions of values feasible). In this case any necessary encoding and distribution of the values (mapping them out to the nodes) and the collection/collation of the results (reduction) would count as overhead.
We can also talk about the overhead incurred by factors beyond the programmer's own code. For example compilation of some code for 32 or 64 bit processors might entail greater overhead than one would see for an old 8-bit or 16-bit architecture. This might involve larger memory overhead (alignment issues) or CPU overhead (where the CPU is forced to adjust bit ordering or used non-aligned instructions, etc) or both.
Note that the disk space taken up by your code and it's libraries, etc. is not usually referred to as "overhead" but rather is called "footprint." Also the base memory your program consumes (without regard to any data set that it's processing) is called its "footprint" as well.

Overhead is simply the more time consumption in program execution. Example ; when we call a function and its control is passed where it is defined and then its body is executed, this means that we make our CPU to run through a long process( first passing the control to other place in memory and then executing there and then passing the control back to the former position) , consequently it takes alot performance time, hence Overhead. Our goals are to reduce this overhead by using the inline during function definition and calling time, which copies the content of the function at the function call hence we dont pass the control to some other location, but continue our program in a line, hence inline.

You could use a dictionary. The definition is the same. But to save you time, Overhead is work required to do the productive work. For instance, an algorithm runs and does useful work, but requires memory to do its work. This memory allocation takes time, and is not directly related to the work being done, therefore is overhead.

You can check Wikipedia. But mainly when more actions or resources are used. Like if you are familiar with .NET there you can have value types and reference types. Reference types have memory overhead as they require more memory than value types.

A concrete example of overhead is the difference between a "local" procedure call and a "remote" procedure call.
For example, with classic RPC (and many other remote frameworks, like EJB), a function or method call looks the same to a coder whether its a local, in memory call, or a distributed, network call.
For example:
service.function(param1, param2);
Is that a normal method, or a remote method? From what you see here you can't tell.
But you can imagine that the difference in execution times between the two calls are dramatic.
So, while the core implementation will "cost the same", the "overhead" involved is quite different.

Think about the overhead as the time required to manage the threads and coordinate among them. It is a burden if the thread does not have enough task to do. In such a case the overhead cost over come the saved time through using threading and the code takes more time than the sequential one.

To answer you, I would give you an analogy of cooking Rice, for example.
Ideally when we want to cook, we want everything to be available, we want pots to be already clean, rice available in enough quantities. If this is true, then we take less time to cook our rice( less overheads).
On the other hand, let's say you don't have clean water available immediately, you don't have rice, therefore you need to go buy it from the shops first and you need to also get clean water from the tap outside your house. These extra tasks are not standard or let me say to cook rice you don't necessarily have to spend so much time gathering your ingredients. Ideally, your ingredients must be present at the time of wanting to cook your rice.
So the cost of time spent in going to buy your rice from the shops and water from the tap are overheads to cooking rice. They are costs that we can avoid or minimize, as compared to the standard way of cooking rice( everything is around you, you don't have to waste time gathering your ingredients).
The time wasted in collecting ingredients is what we call the Overheads.
In Computer Science, for example in multithreading, communication overheads amongst threads happens when threads have to take turns giving each other access to a certain resource or they are passing information or data to each other. Overheads happen due to context switching.Even though this is crucial to them but it's the wastage of time (CPU cycles) as compared to the traditional way of single threaded programming where there is never a time wastage in communication. A single threaded program does the work straight away.

its anything other than the data itself, ie tcp flags, headers, crc, fcs etc..

List of performance improvement features that we can implement in java

May be this is a well known question, But i didn't find the best reference for this ques...
what is the formula to calculate and assign the default u-limit, verbose (for gc) and max heap memory value?
If there is no specific formula, what is the criteria to specify this for a particular machine.
If possible could anyone please explain these concepts also.
Is there any other concepts we need to consider for performance improvement?
How to tune the JVM for better performance,

Stop what you're doing right now.
Tuning the JVM is probably the last thing you should worry about. Until you've gone through every other performance trick in the book, the default settings should be just fine.
Firstly you need to profile your application and find out where the bottlenecks are. Specifically, you will want to know:
What functions /methods are consuming the majority of CPU time?
Where are all the memory allocations happening?
What kind of objects are taking up most space on the heap?
Then you should apply targeted optimisations to the areas that are causing problems. There are thousands of valid techniques, but here are the ones that I find are most useful:
Improve algorithms - anything that is taking up a decent chunk of CPU time and has complexity of O(n^2) or worse is probably a good candidate for improvement. Try to get it to O(n log n) or better.
Share immutable data - if you have a lot of copies of the same data then it makes sense to turn these into immutable objects and share a single instance. This can save a lot of memory (and has the nice effect of improving thread safety / concurrency)
Use primitive types - replace Integer with int etc. This saves memory and makes numerical operations faster.
Be lazy - don't compute things until they are definitely needed.
Cache things - if something is expensive to compute but frequently requested, store it in a cache after the first request. Use a cache backed by a SoftHashMap so that the memory can still be released if needed.
Offload work - Can you make use of multiple cores? Can the client application do some of the work for you?
After making any changes you then need to profile again. At the very least, you will want to confirm that your optimisations actually helped. Additionally, fixing one bottleneck will usually move the bottleneck to another part of the application. So you will need to identify the new place to focus next.
Repeat until your application is fast enough (as defined by your own or your customers' requirements).

Are there performance issues from using large numbers of objects in Java

I am currently working on a system where performance is an important consideration. It is going to be used for processing large quantities of data (some of the object types are in millions) with non-trivial algorithms (think about Integer Programming problems etc.). At the moment I have a working solution which creates all these data points as Objects.
Is there any performance increase to be gained, by treating them as arrays for example? Are there any best practices for working with large numbers of objects in Java (should it be avoided?).

I suggest you start by using a commercial CPU and memory profiler. This will give you a good idea of what are your bottleneck.
Reducing garbage and making your memory more compact helps more when your have optimised the code to the point that your profilers cannot suggest anything.
You might like to consider what structures which fit in your CPU caches better as this can improve performance by up to 2-5x. e.g. Your L3 cache might be 8 MB, and more than 5x faster than main memory. The more you can condense your working set to fit into it the better.
BTW Your L1 cache is 32 KB and ~10x faster again.
This all assumes that the time to perform a GC doesn't bother you. If you create enough objects you can see multi-second, even multi-minute GC stop-the-world pauses.

Arrays or ArrayLists have similar performance although arrays are faster (up to 25% depending on what you do with them). Where you can find a significant performance gain is by avoiding boxed primitives for calculations, in which case the only solution is to use an array.
Apart from that, creating many short lived objects incurs little performance cost, apart from the fact that GC will run more often (but the cost of running minor GC depends on the number of reachable objects, not on unreachable ones).

Premature optimization is evil. As Richard says in comments, write your code, see if its slow, then improve it. If you have suspicions write an example to simulate high load. The time spent up front to determine this is worth it.
But as for your question...
Yes, creating objects is more expensive compared to creating primitives. It also occupies more heap space (memory.) Also if you are using objects for only a short time the garbage collector will have to run more often which will eat some CPU.
Again, only worry about this if you really need speed improvement.

Prototype key parts of your algorithms, test them in separation, find the slowest, improve, repeat. Stay single threads for as long as possible, but always make a note of what can be done in parallel.
At the end your bottleneck may be either of below:
CPU because if algorithm computational complexity => try finding better algorithm (or run on multiple CPUs in parallel if you are just slightly below the target, if you are far below then parallel processing won't help)
CPU because of excessive GC => profile memory, use low/zero-GC collections (trove4j etc.) or even arrays of primitive types, or even direct memory buffers from NIO, experiment
Memory - optimize data proximity (use chunked arrays matching cache sizes, etc).
Contentions on concurrent objects => revert to single threaded design, try lock-free synchronization primitives, etc.

Why Java and Python garbage collection methods are different?

Python uses the reference count method to handle object life time. So an object that has no more use will be immediately destroyed.
But, in Java, the GC(garbage collector) destroys objects which are no longer used at a specific time.
Why does Java choose this strategy and what is the benefit from this?
Is this better than the Python approach?

There are drawbacks of using reference counting. One of the most mentioned is circular references: Suppose A references B, B references C and C references B. If A were to drop its reference to B, both B and C will still have a reference count of 1 and won't be deleted with traditional reference counting. CPython (reference counting is not part of python itself, but part of the C implementation thereof) catches circular references with a separate garbage collection routine that it runs periodically...
Another drawback: Reference counting can make execution slower. Each time an object is referenced and dereferenced, the interpreter/VM must check to see if the count has gone down to 0 (and then deallocate if it did). Garbage Collection does not need to do this.
Also, Garbage Collection can be done in a separate thread (though it can be a bit tricky). On machines with lots of RAM and for processes that use memory only slowly, you might not want to be doing GC at all! Reference counting would be a bit of a drawback there in terms of performance...

Actually reference counting and the strategies used by the Sun JVM are all different types of garbage collection algorithms.
There are two broad approaches for tracking down dead objects: tracing and reference counting. In tracing the GC starts from the "roots" - things like stack references, and traces all reachable (live) objects. Anything that can't be reached is considered dead. In reference counting each time a reference is modified the object's involved have their count updated. Any object whose reference count gets set to zero is considered dead.
With basically all GC implementations there are trade offs but tracing is usually good for high through put (i.e. fast) operation but has longer pause times (larger gaps where the UI or program may freeze up). Reference counting can operate in smaller chunks but will be slower overall. It may mean less freezes but poorer performance overall.
Additionally a reference counting GC requires a cycle detector to clean up any objects in a cycle that won't be caught by their reference count alone. Perl 5 didn't have a cycle detector in its GC implementation and could leak memory that was cyclic.
Research has also been done to get the best of both worlds (low pause times, high throughput):
http://cs.anu.edu.au/~Steve.Blackburn/pubs/papers/urc-oopsla-2003.pdf

Darren Thomas gives a good answer. However, one big difference between the Java and Python approaches is that with reference counting in the common case (no circular references) objects are cleaned up immediately rather than at some indeterminate later date.
For example, I can write sloppy, non-portable code in CPython such as
def parse_some_attrs(fname):
return open(fname).read().split("~~~")[2:4]
and the file descriptor for that file I opened will be cleaned up immediately because as soon as the reference to the open file goes away, the file is garbage collected and the file descriptor is freed. Of course, if I run Jython or IronPython or possibly PyPy, then the garbage collector won't necessarily run until much later; possibly I'll run out of file descriptors first and my program will crash.
So you SHOULD be writing code that looks like
def parse_some_attrs(fname):
with open(fname) as f:
return f.read().split("~~~")[2:4]
but sometimes people like to rely on reference counting to always free up their resources because it can sometimes make your code a little shorter.
I'd say that the best garbage collector is the one with the best performance, which currently seems to be the Java-style generational garbage collectors that can run in a separate thread and has all these crazy optimizations, etc. The differences to how you write your code should be negligible and ideally non-existent.

I think the article "Java theory and practice: A brief history of garbage collection" from IBM should help explain some of the questions you have.

One big disadvantage of Java's tracing GC is that from time to time it will "stop the world" and freeze the application for a relatively long time to do a full GC. If the heap is big and the the object tree complex, it will freeze for a few seconds. Also each full GC visits the whole object tree over and over again, something that is probably quite inefficient. Another drawback of the way Java does GC is that you have to tell the jvm what heap size you want (if the default is not good enough); the JVM derives from that value several thresholds that will trigger the GC process when there is too much garbage stacking up in the heap.
I presume that this is actually the main cause of the jerky feeling of Android (based on Java), even on the most expensive cellphones, in comparison with the smoothness of iOS (based on ObjectiveC, and using RC).
I'd love to see a jvm option to enable RC memory management, and maybe keeping GC only to run as a last resort when there is no more memory left.

Garbage collection is faster (more time efficient) than reference counting, if you have enough memory. For example, a copying gc traverses the "live" objects and copies them to a new space, and can reclaim all the "dead" objects in one step by marking a whole memory region. This is very efficient, if you have enough memory. Generational collections use the knowledge that "most objects die young"; often only a few percent of objects have to be copied.
[This is also the reason why gc can be faster than malloc/free]
Reference counting is much more space efficient than garbage collection, since it reclaims memory the very moment it gets unreachable. This is nice when you want to attach finalizers to objects (e.g. to close a file once the File object gets unreachable). A reference counting system can work even when only a few percent of the memory is free. But the management cost of having to increment and decrement counters upon each pointer assignment cost a lot of time, and some kind of garbage collection is still needed to reclaim cycles.
So the trade-off is clear: if you have to work in a memory-constrained environment, or if you need precise finalizers, use reference counting. If you have enough memory and need the speed, use garbage collection.

Reference counting is particularly difficult to do efficiently in a multi-threaded environment. I don't know how you'd even start to do it without getting into hardware assisted transactions or similar (currently) unusual atomic instructions.
Reference counting is easy to implement. JVMs have had a lot of money sunk into competing implementations, so it shouldn't be surprising that they implement very good solutions to very difficult problems. However, it's becoming increasingly easy to target your favourite language at the JVM.

The latest Sun Java VM actually have multiple GC algorithms which you can tweak. The Java VM specifications intentionally omitted specifying actual GC behaviour to allow different (and multiple) GC algorithms for different VMs.
For example, for all the people who dislike the "stop-the-world" approach of the default Sun Java VM GC behaviour, there are VM such as IBM's WebSphere Real Time which allows real-time application to run on Java.
Since the Java VM spec is publicly available, there is (theoretically) nothing stopping anyone from implementing a Java VM that uses CPython's GC algorithm.

Late in the game, but I think one significant rationale for RC in python is its simplicity. See this email by Alex Martelli, for example.
(I could not find a link outside google cache, the email date from 13th october 2005 on python list).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.