I need to store a very large amount of instances of my class, and since I have a pretty terrible computer with only 2gb of RAM I need it to run with as little memory usage as possible. So can anyone tell me is it more efficient to have a ton of fields or an array. I don't care about the "best way" to do it, I need the way that uses the least RAM. So yeah, an array or many fields?
Your question is a little unclear, but basically the class
public class SomeClass {
int var1;
int var2;
...
int var100;
}
Will take as much space as an int[100] array. There might be a slight difference, depending on the platform, but no more than 16 bytes total, and it could go either way. (And you can substitute any other data type in place of int and the same thing will be true.)
But, just to be clear, either of the above takes up much less space than 100 objects, each containing one int.
An array doesn't condense the objects in any way, it just orders them. So fields or an array would have the same memory overhead.
That said, having an array of objects (or a List) would be better to keep your objects collected and together.
Related
The first version is:
int[] a = new int[1000];
int[] b = new int[1000];
The second version is:
class Helper{
int a;
int b;
}
Helper[] c = new Helper[1000];
My intuition tells me that the second one is better, but I could persuade myself in reason....
Can anyone compare the time complexity and space complexity of these two structure for me. For example, are these two version cost the same space? Or the second one cost less?
Thank you!
The real question you should be asking is what is the relation between aa[i] and bb[i]. If aa[i] and bb[i] are properties of the same object (that has a more meaningful description than "Helper"), you should definitely put them in some class instead of using multiple primitive arrays. After all, Java is an object oriented language.
You shouldn't care about performance differences. Those will be insignificant. The important thing is that you code makes sense to whoever reads it.
While Eran is right, I'll add a few more points.
Sure, chances are you're very likely not going to bother about the performance - the difference is very insignificant for a general purpose application. Readability is what matters.
Still, in terms of technical details:
Space complexity is the same (it's always as many elements you have), but the absolute value in bytes is different.
Array of pairs will cost you more - each Java object has an overhead of several bytes. In case of two arrays, it's only a few bytes overhead per each array.
Also, in case of two arrays, the values of each array will reside next to each other in memory - reading all values of one of those array will be more effective in terms of CPU cache and memory layout. In case of array of objects, you now have an array of links to those objects and need first read the address of the object, before you can access the actual value of the field.
These are just general points so you can get a feeling of what's going on. In practice it all depends on how you want to work with those structures.
Out of interest: Recently, I encountered a situation in one of my Java projects where I could store some data either in a two-dimensional array or make a dedicated class for it whose instances I would put into a one-dimensional array. So I wonder whether there exist some canonical design advice on this topic in terms of performance (runtime, memory consumption)?
Without regard of design patterns (extremely simplified situation), let's say I could store data like
class MyContainer {
public double a;
public double b;
...
}
and then
MyContainer[] myArray = new MyContainer[10000];
for(int i = myArray.length; (--i) >= 0;) {
myArray[i] = new MyContainer();
}
...
versus
double[][] myData = new double[10000][2];
...
I somehow think that the array-based approach should be more compact (memory) and faster (access). Then again, maybe it is not, arrays are objects too and array access needs to check indexes while object member access does not.(?) The allocation of the object array would probably(?) take longer, as I need to iteratively create the instances and my code would be bigger due to the additional class.
Thus, I wonder whether the designs of the common JVMs provide advantages for one approach over the other, in terms of access speed and memory consumption?
Many thanks.
Then again, maybe it is not, arrays are objects too
That's right. So I think this approach will not buy you anything.
If you want to go down that route, you could flatten this out into a one-dimensional array (each of your "objects" then takes two slots). That would give you immediate access to all fields in all objects, without having to follow pointers, and the whole thing is just one big memory allocation: since your component type is primitive, there is just one object as far as memory allocation is concerned (the container array itself).
This is one of the motivations for people wanting to have structs and value types in Java, and similar considerations drive the development of specialized high-performance data structure libraries (that get rid of unneccessary object wrappers).
I would not worry about it, until you really have a huge datastructure, though. Only then will the overhead of the object-oriented way matter.
I somehow think that the array-based approach should be more compact (memory) and faster (access)
It won't. You can easily confirm this by using Java Management interfaces:
com.sun.management.ThreadMXBean b = (com.sun.management.ThreadMXBean) ManagementFactory.getThreadMXBean();
long selfId = Thread.currentThread().getId();
long memoryBefore = b.getThreadAllocatedBytes(selfId);
// <-- Put measured code here
long memoryAfter = b.getThreadAllocatedBytes(selfId);
System.out.println(memoryAfter - memoryBefore);
Under measured code put new double[0] and new Object() and you will see that those allocations will require exactly the same amount of memory.
It might be that the JVM/JIT treats arrays in a special way which could make them faster to access in one way or another.
JIT do some vectorization of an array operations if for-loops. But it's more about speed of arithmetic operations rather than speed of access. Beside that, can't think about any.
The canonical advice that I've seen in this situation is that premature optimisation is the root of all evil. Following that means that you should stick with the code that is easiest to write / maintain / get past your code quality regime, and then look at optimisation if you have a measurable performance issue.
In your examples the memory consumption is similar because in the object case you have 10,000 references plus two doubles per reference, and in the 2D array case you have 10,000 references (the first dimension) to little arrays containing two doubles each. So both are one base reference plus 10,000 references plus 20,000 doubles.
A more efficient representation would be two arrays, where you'd have two base references plus 20,000 doubles.
double[] a = new double[10000];
double[] b = new double[10000];
When designing java classes, what are the recommendations for achieving CPU cache friendliness?
What I have learned so far is that one should use POD as much as possible (i.e. int instead of integer). Thus, the data will be allocated consecutively when allocating the containing object. E.g.
class Local
{
private int data0;
private int data1;
// ...
};
is more cache friendly than
class NoSoLocal
{
private Integer data0;
private Integer data1;
//...
};
The latter will require two separate allocations for the Integer objects that can be at arbitrary locations in memory, esp. after a GC run. OTOH the first approach might lead to duplication of data in cases where the data can be reused.
Is there a way to have them located close to each other in memory so that the parent object and its' containing elements will be in the CPU cache at once and not distributed arbitrarily over the whole memory plus the GC will keep them together?
You cannot force JVM to place related objects close to each other (though JVM tries to do it automatically). But there are certain tricks to make Java programs more cache-friendly.
Let me show you some examples from the real-life projects.
BEWARE! This is not a recommended way to code in Java!
Do not adopt the following techniques unless you are absolutely sure why you are doing it.
Inheritance over composition. You've definitely heard the contrary principle "Favor composition over inheritance". But with composition you have an extra reference to follow. This is not good for cache locality and also requires more memory. The classic example of inheritance over composition is JDK 8 Adder and Accumulator classes which extend utility Striped64 class.
Transform arrays of structures into a structure of arrays. This again helps to save memory and to speed-up bulk operations on a single field, e.g. key lookups:
class Entry {
long key;
Object value;
}
Entry[] entries;
will be replaced with
long[] keys;
Object[] values;
Flatten data structures by inlining. My favorite example is inlining 160-bit SHA1 hash represented by byte[]. The code before:
class Blob {
long offset;
int length;
byte[] sha1_hash;
}
The code after:
class Blob {
long offset;
int length;
int hash0, hash1, hash2, hash3, hash4;
}
Replace String with char[]. You know, String in Java contains char[] object under the hood. Why pay performance penalty for an extra reference?
Avoid linked lists. Linked lists are very cache-unfriendly. Hardware works best with linear structures. LinkedList can be often replaced with ArrayList. A classic HashMap may be replaced with an open address hash table.
Use primitive collections. Trove is a high-performance library containg specialized lists, maps, sets etc. for primitive types.
Build your own data layouts on top of arrays or ByteBuffers. A byte array is a perfect linear structure. To achieve the best cache locality you can pack an object data manually into a single byte array.
the first approach might lead to duplication of data in cases where the data can be reused.
But not in the case you mention. An int is 4 bytes, a reference is typically 4-bytes so you don't gain anything by using Integer. For a more complex type, it can make a big difference however.
Is there a way to have them located close to each other in memory so that the parent object and its' containing elements will be in the CPU cache at once and not distributed arbitrarily over the whole memory plus the GC will keep them together?
The GC will do this anyway, provided the objects are only used in one place. If the objects are used in multiple places, they will be close to one reference.
Note: this is not guaranteed to be the case, however when allocating objects they will typically be continuous in memory as this is the simplest allocation strategy. When copying retained objects, the HotSpot GC will copy them in reverse order of discovery. i.e. they are still together but in reverse order.
Note 2: Using 4 bytes for an int is still going to be more efficient than using 28 bytes for an Integer (4 bytes for reference, 16 bytes for object header, 4 bytes for value and 4 bytes for padding)
Note 3: Above all, you should favour clarity over performance, unless you have measured you need and have a more performant solution. In this case, an int cannot be a null but an integer can be null. If you want a value which should not be null use int, not for performance but for clarity.
I have an ArrayList that is not vary memory intensive, it stores only two fields,
public class ExampleObject{
private String string;
private Integer integer;
public ExampleObject(String stringInbound, Integer integerInbound){
string = stringInbound;
integer = integerInbound;
}
I will fill an ArrayList with this objects
ArrayList<ExampleObject> = new ArrayList<ExampleObject>();
for raw, hard core performance is it much better to use a hashset for this? If my ArrayList grows to a vary large number of items with an index in the the hundreds, will I notice a huge deference between the ArrayList of Objects and the hashset?
Although they are both Collection, I suggest that you read the differences between a Set and a List.
They are not used for the same purpose. So choose the one that meet your implementation requirements before thinking about performance.
It all depends on what you're doing. How is data added? How is it accessed? How often is it removed?
For example, in some situations it might even be better to have parallel arrays of String[] and int[] -- you avoid collection class overhead and the boxing of int to Integer. Depending on exactly what you're doing, that might be really great or incredibly dumb.
Memory consumption can have a strong effect on performance as data sets get larger. A couple of IBM researchers did a neat presentation on Building Memory-efficient Java Applications a few years back that everyone who is concerned about performance should read.
I have to store millions of X/Y double pairs for reference in my Java program. I'd like to keep memory consumption as low as possible as well as the number of object references. So after some thinking I decided holding the two points in a tiny double array might be a good idea, it's setup looks like so:
double[] node = new double[2];
node[0] = x;
node[1] = y;
I figured using the array would prevent the link between the class and my X and Y variables used in a class, as follows:
class Node {
public double x, y;
}
However after reading into the way public fields in classes are stored, it dawned on me that fields may not actually be structured as pointer like structures, perhaps the JVM is simply storing these values in contiguous memory and knows how to find them without an address thus making the class representation of my point smaller than the array.
So the question is, which has a smaller memory footprint? And why?
I'm particularly interested in whether or not class fields use a pointer, and thus have a 32-bit overhead, or not.
The latter has the smaller footprint.
Primitive types are stored inline in the containing class. So your Node requires one object header and two 64-bit slots. The array you specify uses one array header (>= an object header) plust two 64-bit slots.
If you're going to allocate 100 variables this way, then it doesn't matter so much, as it is just the header sizes which are different.
Caveat: all of this is somewhat speculative as you did not specify the JVM - some of these details may vary by JVM.
I don't think your biggest problem is going to be storing the data, I think it's going to be retrieving, indexing, and manipulating it.
However, an array, fundamentally, is the way to go. If you want to save on pointers, use a one dimensional array. (Someone has already said that).
First, it must be stated that the actual space usage depends on the JVM you are using. It is strictly implementation specific. The following is for a typical mainstream JVM.
So the question is, which has a smaller memory footprint? And why?
The 2nd version is smaller. An array has the overhead of the 32 bit field in the object header that holds the array's length. In the case of a non-array object, the size is implicit in the class and does not need to be represented separately.
But note that this is a fixed over head per array object. The larger the array is, the less important the overhead is in practical terms. And the flipside of using a class rather than array is that indexing won't work and your code may be more complicated (and slower) as a result.
A Java 2D array is actually and array of 1D arrays (etcetera), so you can apply the same analysis to arrays with higher dimensionality. The larger the size an array has in any dimension, the less impact the overhead has. The overhead in a 2x10 array will be less than in a 10x2 array. (Think it through ... 1 array of length 2 + 2 of length 10 versus 1 array of length 10 + 10 of length 2. The overhead is proportional to the number of arrays.)
I'm particularly interested in whether or not class fields use a pointer, and thus have a 32-bit overhead, or not.
(You are actually talking about instance fields, not class fields. These fields are not static ...)
Fields whose type is a primitive type are stored directly in the heap node of the object without any references. There is no pointer overhead in this case.
However, if the field types were wrapper types (e.g. Double rather than double) then there could be the overhead of a reference AND the overheads of the object header for the Double object.