Can you determine the memory footprint of a collection? - java

I have an in-memory collection that I want to flush to disk once it has reached either a certain size (count wise) or memory footprint.
How can I determine how much memory a collection is using?
It is going to be some sort of Dictionary/Map.

You can't, easily. For example, consider an ArrayList<String> with a backing array of size 256 and an "in use" size of 200, where each string is 20 characters long, backed by a 30 character backing array.
It sounds like you could easily work out how much memory that's taking - but if every element in the array is actually a reference to the same string, then obviously it takes a lot less memory. That's just for String, which is a class which is relatively straightforward to analyze. For classes with various mixtures of definitely-distinct and possibly-shared references, it becomes even more complicated.
You could serialize it - but that only shows you how much space it takes up when serialized, not in memory.
I suggest you experiment and find some appropriate "average" size, derive a maximum count that makes sense, and just go on that basis.

Related

Most efficient way to concatenate Strings

I was under the impression that StringBuffer is the fastest way to concatenate strings, but I saw this Stack Overflow post saying that concat() is the fastest method. I tried the 2 given examples in Java 1.5, 1.6 and 1.7 but I never got the results they did. My results are almost identical to this
Can somebody explain what I don't understand here? What is truly the fastest way to concatenate strings in Java?
Is there a different answer when one seeks the fastest way to concatenate two strings and when concatenating multiple strings?
String.concat is faster than the + operator if you are concatenating two strings... Although this can be fixed at any time and may even have been fixed in java 8 as far as I know.
The thing you missed in the first post you referenced is that the author is concatenating exactly two strings, and the fast methods are the ones where the size of the new character array is calculated in advance as str1.length() + str2.length(), so the underlying character array only needs to be allocated once.
Using StringBuilder() without specifying the final size, which is also how + works internally, will often need to do more allocations and copying of the underlying array.
If you need to concatenate a bunch of strings together, then you should use a StringBuilder. If it's practical, then precompute the final size so that the underlying array only needs to be allocated once.
What I understood from others answer is following:
If you need thread safety, use StringBuffer
If you do not need thread safety:
If strings are known before hand and for some reasons multiple time same code needs to be run, use '+' as compiler will optimize and handle it during compile time itself.
if only two strings need to be concatenated, use concat() as it will not require StringBuilder/StringBuffer objects to be created. Credits to #nickb
If multiple strings need to be concatenated, use StringBuilder.
Joining very long lists os strings by naively addinging them from start to end is very slow: the padded buffer grows incrementally, and is reallocated again and again, making additional copies (and sollicitating a lot the garbage collector).
The most efficient way to join long lists is to always start by joining pairs of adjascent strings whose total length is the smallest from ALL other candidate pairs; however this would require a complex lookup to find the optimal pair (similar to the wellknown problem of Hanoi towers), and finding it only to reduce the numebr of copies to the strict minimum would slow down things.
What you need a smart algorithm using a "divide and conquer" recursive algorithm with a good heuristic which is very near from this optimum:
If you have no string to join, return the empty string.
If you have only 1 string to join, just return it.
Otherwise if you have only 2 strings to join, join them and return the result.
Compute the total length of the final result.
Then determine the number of strings to join from the left until it reaches half of this total to determine the "divide" point splitting the set of strings in two non-empty parts (each part must contain at least 1 string, the division point cannot be the 1st or last string from te set to join).
Join the smallest part if it has at least 2 strings to join, otherwise join the other part (using this algorithm recursively).
Loop back to the beginning (1.) to complete the other joins.
Note that empty strings in the collection have to be ignored as if they were not part of the set.
Many default implementations of String.join(table of string, optional separator) found in various libraries are slow as they are using naive incremental joinining from left to right; the divide-and-conquer algorithm above will outperform it, when you need to join MANY small string to generate a very large string.
Such situation is not exceptional, it occurs in text preprocessors and generators, or in HTML processing (e.g. in "Element.getInnerText()" when the element is a large document containing many text elements separated or contained by many named elements).
The strategy above works when the source strings are all (or almost all to be garbage collected to keep only the final result. If th result is kept together as long as the list of source strings, the best alternative is to allocate the final large buffer of the result only once for its total length, then copy source strings from left to right.
In both cases, this requires a first pass on all strings to compute their total length.
If you usse a reallocatable "string buffer", this does not work well if the "string buffer" reallocates constantly. However, the string buffer may be useful when performing the first pass, to prejoin some short strings that can fit in it, with a reasonnable (medium) size (e.g. 4KB for one page of memory): once it is full, replace the subset of strings by the content of the string buffer, and allocate a new one.
This can considerably reduce the number of small strings in the source set, and after the first pass, you have the total length for the final buffer to allocate for the result, where you'll copy incrementally all the remaining medium-size strings collected in the first pass This works very well when the list of source strings come from a parser function or generator, where the total length is not fully known before the end of parsing/generation: you'll use only intermediate stringbuffers with medium size, and finally you'll generate the final buffer without reparsing again the input (to get many incremental fragments) or without calling the generator repeatedly (this would be slow or would not work for some generators or if the input of the parser is consumed and not recoverable from the start).
Note that this remarks also applies not just to joinind strings, but also to file I/O: writing the file incrementally also suffers from reallocation or fragmentation: you should be able to precompute the total final length of the generated file. Otherwise you need a classig buffer (implemented in most file I/O libraries, and usually sized in memory at about one memory page of 4KB, but you should allocate more because file I/O is considerably slower, and fragmentation becomes a performance problem for later file accesses when file fragments are allocated incrementalyy by too small units of just one "cluster"; using a buffer at about 1MB will avoid most pperformance problems caused by fragmented allocation on the file system as fragments will be considerably larger; a filesystem like NTFS is optimized to support fragments up to 64MB, above which fragmentation is no longer a noticeable problem; the same is true for Unix/Linux filesystems, which rend to defragment only up to a maximum fragment size, and can efficiently handle allocation of small fragments using "pools" of free clusters organized by mimum size of 1 cluster, 2 cluster, 4 clusters, 8 clusters... in powers of two, so that defragmenting these pools is straightforward and not very costly, and can be done asynchornously in the background when there's lo level of I/O activity).
An in all modern OSes, memory management is correlated with disk storage management, using memory mapped files for handling caches: the memory is backed by a storage, managed by the virtual memory manager (which means that you can allocate more dynamic memory than you have physical RAM, the rest will be paged out to the disk when needed): the straegy you use for managing RAM for very large buffers tends to be correlated to the performance of I/O for paging: using a memory mapped file is a good solution, and everything that worked with file I/O can be done now in a very large (virtual) memory.

Memory saving in multi-dimentional array

Hi i need a multidimentional array to store big number but i am getting heap space error. I have 4gb ram.
double array[][] = new double[100000][100000]
I know it would need a lot of memory, can any one help me tackling this issue? Thanks for helping
If the array is sparse (more empty array cells than filled ones), you could look at using a hash map instead.
For the hash map, use the index of the array as the key.
example:
{ 23: 'foo', 23945: 'bar' }
this will be much more memory efficient!
If you absolutely need that much memory, you will need at minimum 100000*100000*8 B, or 80 GB of RAM (plus whatever overhead there is for the array organization itself). I will go as far as to say that you probably cannot work with that much data in RAM.
The only real way you'll be able to have an array that big is to write the parts of the array that aren't in use at a given moment in time out to disk. This will take some work to do though because you have to track where you are in the array to push and pull the array's data on the disk.

Java Memory Saving Techniques?

I have this 4 Dimensional array to store String values which are used to create a map which is then displayed on screen with the paintComponent. I have read many articles saying that using huge arrays is very inefficient. (Especially since the array dimensions are 16x16x3x3) I was wondering if there was any way to store the string values (I use them as ID values) differently to save memory or reduce fetching time. If you have any ideas or methods I would appreciate it. Thanks!
Well, if your matrix is full, that is every element contains data, then I believe an array is optimally efficient. But if your matrix is sparse, you could look into using more Linking-based data types.
the first thing I would do is to not use strings as IDs, use ints. It'll reduce the size of your structure a lot.
Also, that array really isn't that big, I wouldn't worry about efficiency if that's the only data structure you have. It's only 2304 elements large.
First off, 16*16*3*3 = 2304 - quite modest really. At this size I'd be more worried about the confusion likely to be caused by a 4D array than the size it is taking!
As others have said, if it fully populated, arrays are ok. If it has gaps, an ArrayList or similar would be better.
If the Strings are just IDs, why not store an enum (or even Integers) instead of a string?
Keep in mind that the String values are separate from the array. The array itself takes the same memory space regardless of what string values it links to. Accessing a specific address in your array will take the same amount of time regardless of what type of object you have saved there, or what the value of that object is.
However, if you find that many of your string values represent exactly the same string, you can avoid having multiple copies of the same string by leveraging String.intern(). If you store the interned string, and you don't have any other references to the non-interned string, that frees the non-interned string up to be garbage-collected. Your array will then have multiple entries that point to the same memory space, rather than different memory addresses with equivalent string objects.
See also:
Is it good practice to use java.lang.String.intern()?
http://www.codeinstructions.com/2009/01/busting-javalangstringintern-myths.html
Depending on the requirements of your IDs, you may also want to look into using a different data structure than strings. For example, while the array itself would be the same size, storing int values would avoid the need to allocate extra space for each individual entry.
Also, a 4-dimensional array may not be the best data structure for your needs in the first place. Can you describe why you've chosen this data structure for what you're trying to represent?
The Strings only take up the space of a reference in each array element. There could be a savings if the strings come from a very small set of values. A more important question is are your 4-dimensional arrays sparse or mostly filled? If you have very few values actually specified then you might have a big savings replacing the 4-d array with a Map from the indicies to the String. Let me know if you want a code sample.
Do you actually have a 4D array of 16x16x3x3 (i.e. 2k) string objects? That doesn't sound that big to me. An array is the most efficient way to store a collection of objects, in terms of memory. An ArrayList can be slightly less efficient (up to 50% wasted space).
The only other way I can think of is to store the Strings end-to-end in one giant String and then use substring() to get the bit you need from that, but you would still need to store the indexes somewhere.
Are you running out of memory? If so check that the Strings in your array are the size you think they are - the backing array of a String instance in Java can be much larger than the string itself. If you do subString() on a 1 GB string, the returned string instance shares the 1 GB array of the first string so will keep it from being GC'd longer than you might expect.

How to define the concept of capacity in ArrayLists?

I understand that capacity is the number of elements or available spaces in an ArrayList that may or may not hold a value referencing an object. I am trying to understand more about the concept of capacity.
So I have three questions:
1) What are some good ways to define what capacity represents from a memory standpoint?
...the (contiguous?) memory allocated to the ArrayList?
...the ArrayLists’s memory footprint on the (heap?)?
2) Then if the above is true, changing capacity requires some manner of memory management overhead?
3) Anyone have an example where #2 was or could be a performance concern? Aside from maybe a large number of large ArrayLists having their capacities continually adjusted?
The class is called ArrayList because it's based on an array. The capacity is the size of the array, which requires a block of contiguous heap memory. However, note that the array itself contains only references to the elements, which are separate objects on the heap.
Increasing the capacity requires allocating a new, larger array and copying all the references from the old array to the new one, after which the old one becomes eligible for garbage collection.
You've cited the main case where performance could be a concern. In practice, I've never seen it actually become a problem, since the element objects usually take up much more memory (and possibly CPU time) than the list.
ArrayList is implemented like this:
class ArrayList {
private Object[] elements;
}
the capacity is the size of that array.
Now, if your capacity is 10, and you're adding 11-th element, ArrayList will do this:
Object[] newElements = new Object[capacity * 1.5];
System.arraycopy(this.elements, newElements);
this.elements = newElements;
So if you start off with a small capacity, ArrayList will end up creating a bunch of arrays and copying stuff around for you as you keep adding elements, which isn't good.
On the other hand, if you specify a capacity of 1,000,000 and add only 3 elements to ArrayList, that also is kinda bad.
Rule of thumb: if you know the capacity, specify it. If you aren't sure but know the upper bound, specify that. If you just aren't sure, use the defaults.
Capacity is as you described it -- the contiguous memory allocated to an ArrayList for storage of values. ArrayList stores all values in an array, and automatically resizes the array for you. This incurs memory management overhead when resizing.
If I remember correctly, Java increases the size of an ArrayList's backing array from size N to size 2N + 2 when you try to add one more element than the capacity can take. I do not know what size it increases to when you use the insert method (or similar) to insert at a specific position beyond the end of the capacity, or even whether it allows this.
Here is an example to help you think about how it works. Picture each space between the |s as a cell in the backing array:
| | |
size = 0 (contains no elements), capacity = 2 (can contain 2 elements).
|1| |
size = 1 (contains 1 element), capacity = 2 (can contain 2 elements).
|1|2|
size = 2, capacity = 2. Adding another element:
|1|2|3| | | |
size increased by 1, capacity increased to 6 (2 * 2 + 2). This can be expensive with large arrays, as allocating a large contiguous memory region can require a bit of work (as opposed to a LinkedList, which allocates many small pieces of memory) because the JVM needs to search for an appropriate location, and may need to ask the OS for more memory. It is also expensive to copy a large number of values from one place to another, which would be done once such a region was found.
My rule of thumb is this: If you know the capacity you will require, use an ArrayList because there will only be one allocation and access is very fast. If you do not know your required capacity, use a LinkedList because adding a new value always takes the same amount of work, and there is no copying involved.
1) What are some good ways to define what capacity represents from a memory standpoint?
...the (contiguous?) memory allocated to the ArrayList?
Yes, an ArrayList is backed up by an array, to that represents the internal array size.
...the ArrayLists’s memory footprint on the (heap?)?
Yes, the larget the array capacity, the more footprint used by the arraylist.
2) Then if the above is true, changing capacity requires some manner of memory management overhead?
It is. When the list grows large enough, a larger array is allocated and the contents copied. The previous array maybe discarded and marked for garbage collection.
3) Anyone have an example where #2 was or could be a performance concern? Aside from maybe a large number of large ArrayLists having their capacities continually adjusted?
Yes, if you create the ArrayList with initial capacity of 1 ( for instance ) and your list grows way beyond that. If you know upfront the number of elements to store, you better request an initial capacity of that size.
However I think this should be low in your list of priorities, while array copy may happen very often, it is optimized since the early stages of Java, and should not be a concern. Better would be to choose a right algorithm, I think. Remember: Premature optimization is the root of all evil
See also: When to use LinkedList over ArrayList

Why does the profiler show large numbers of char[] instances when I am not creating any?

I am running a NetBeans profile of a recursive operation that includes creating a class with a java.lang.String field. In the classes list, in the profile heap dump, the number of String fields corresponds to the number of classes created as expected, however there are also a similar number of char[] instances. The char arrays account for nearly 70% of the memory usage(!) whilst the String field accounts for about 7%.
What is going on here? And how can I reduce the number of char[] instances?
Thanks
Take a look at the String source code. The String object itself contains a cached hash code, a count of the number of characters (again, for optimisation purposes), an offset (since a String.substr() points to the original string data) and the character array, which contains the actual string data. Hence your metrics showing that String consumes relatively little, but the majority of memory is taken by the underlying character arrays.
The char arrays account for nearly 70%
of the memory usage(!) whilst the
String field accounts for about 7%
This is subtlety of memory profiling known as "retained size" and "shallow size":
Shallow size refers to how much memory is taken up by an object, not including any child objects it contains. Basically, this means primitive fields.
Retained size is the shallow size plus the size of the other objects referred to by the object, but only those other objects which are referred to only by this object (tricky to explain, simple concept).
String is the perfect example. It contains a handful of primitive fields, plus the char[]. The char[] accounts for the vast majority of the memory usage. The shallow size of String is very small, but it's retained size is much larger, since that includes the char[].
The NetBeans profiler is probably giving you the shallow size, which isn't a very useful figure, but is easy to calculate. The retained size would incorporate the char[] memory usage into the String memory usage, but calculating the retained size is computationally expensive, and so profilers won't work that out until explicitly asked to.
The String class in the Sun's Java implementation uses a char[] to store the character data.
I believe this can be verified without looking at the source code by using a debugger to look at the contents of a String, or by using reflection to look at the internals of the String object.
Therefore, it would be difficult to reduce the number of char[] which are being created, unless the number of String instances which are being created were reduced.
Strings are backed by char arrays, so I don't think you can reduce the number of char[] instances without reducing your Strings.
Have you tried removing some Strings to see if the char[]s go down as well?

Categories

Resources