I'm new to java world from C++ background. I'd like to port some C++ code to Java.
The code uses Sparse vectors:
struct Feature{
int index;
double value;
};
typedef std::vector<Feature> featvec_t;
As I understood, if one makes an object, there will be some overhead on memory usage.
So naive implementation of Feature will overhead signifiantly when there will be 10-100 millions of Features in a set of featvec_t.
How to represent this structure memory efficiently in Java?
If memory is really your bottleneck, try storing your data in two separate arrays:
int[] index and double[] value.
But in most cases with such big structures performance (time) will be the main issue. Depending on operations mostly performed on your data (insert, delete, get, etc.) you need to choose appropriate data structure to store objects of class Feature.
Start your explorations with java.util.Collection interface, its subinterfaces (List, Set, etc) and their implementations provided in java.util package.
To avoid memory overhead for each entry, you could write a java.util.List<Feature> implementation that wraps arrays of int and double, and builds Feature objects on demand.
To have it resize automatically, you could use TIntArrayList and TDoubleArrayList from GNU trove.
Is the question about space for the struct itself or the sparse vector? Since others have answered the former, I'll shoot for the latter...
There aren't any sparse lists/matrices in the standard Java collections to my knowledge.
You could build an equivalent using a TreeMap keyed on the index.
An object in Java (I guess) has:
sizeof(index)
sizeof(value)
sizeof(Class*) <-- pointer to concrete class
So the diference is four bytes from the pointer. If your struct is 4+8=12 bytes it's a 33% overhead... but I can't think other (better) way to do it.
Related
I need to re-write some C++ code in Java that I wrote years ago.
In C++, I used a std::vector<unsigned char> buffer to implement a buffer which I would retrieve from a socket connection.
With this buffer I do all kinds of operations:
Have an iterator to buffer element and advance it as required.
Search it for certain tokens.
Get a range of byte elements (sub-buffer) from buffer using indexes / iterators.
Based on these requirements, I could use ArrayList<Byte> and have a custom iterator-like class to keep track of current element and work using indexes.
In what ways will be in-efficient to use List<Byte>?
Somewhere else I have seen ByteBuffer being recommended but my requirements don't call for it.
Perhaps because I need indexes, searches to perform, etc. and my buffer won't change or be modified once created.
All I need is to read the buffer and perform above operations.
Should I better just have a wrapper around byte[] with my own iterator/pointer class?
This really depends on your requirements. The huge problem with List<Byte> is the fact that you are using Byte objects rather than primitive byte values!
Not only does that affect memory consumption, it also means that there might be a lot of boxing/unboxing going on.
That may cause a lot of objects to be generated† - leading to constant churn for the garbage collector.
So if you intend to do intensive computations, or you are working massive amounts of data, then the ByteBuffer class has various performance advantages here.
† from Java Language Specification §5.1.1
The rule above is a pragmatic compromise, requiring that certain common values always be boxed into indistinguishable objects. The implementation may cache these, lazily or eagerly.
List is not a class, it's an interface, so you need to supply an actual class behind it, to provide an implementation. You can use a List<Byte> variable on your code, to hide the implementation details of the actual class you are using, but if you need it behave as an array sometimes, and as a List others (whit iterators and such), then you need not to abandon the ArrayList<Byte> class and continue using it, to be able to use the index and list behaviours together. If you switch to a List<Byte>, finally you will end doing unnecessary casts to access the array features of ArrayList, and these will consume cpu, checking that the proper class is being cast. And you'll blur the code. Also, you have alternatives in ByteBuffer or CharBuffer, as suggested by other responses. But beware that these class are abstract, so you'll have to implement part of them, before being able to use.
We use a HashMap<Integer, SomeType>() with more than a million entries. I consider that large.
But integers are their own hash code. Couldn't we save memory with a, say, IntegerHashMap<Integer, SomeType>() that uses a special Map.Entry, using int directly instead of a pointer to an Integer object? In our case, that would save 1000000x the memory required for an Integer object.
Any faults in my line of thought? Too special to be of general interest? (at least, there is an EnumHashMap)
add1. The first generic parameter of IntegerHashMap is used to make it closely similar to the other Map implementations. It could be dropped, of course.
add2. The same should be possible with other maps and collections. For example ToIntegerHashMap<KeyType, Integer>, IntegerHashSet<Integer>, etc.
What you're looking for is a "Primitive collections" library. They are usually much better with memory usage and performance. One of the oldest/popular libraries was called "Trove". However, it is a bit outdated now. The main active libraries in use now are:
Goldman Sach Collections
Fast Util
Koloboke
See Benchmarks Here
Some words of caution:
"integers are their own hash code" I'd be very careful with this statement. Depending on the integers you have, the distribution of keys may be anything from optimal to terrible. Ideally, I'd design the map so that you can pass in a custom IntFunction as hashing strategy. You can still default this to (i) -> i if you want, but you probably want to introduce a modulo factor, or your internal array will be enormous. You may even want to use an IntBinaryOperator, where one param is the int and the other is the number of buckets.
I would drop the first generic param. You probably don't want to implement Map<Integer, SomeType>, because then you will have to box / unbox in all your methods, and you will lose all your optimizations (except space). Trying to make a primitive collection compatible with an object collection will make the whole exercise pointless.
Out of interest: Recently, I encountered a situation in one of my Java projects where I could store some data either in a two-dimensional array or make a dedicated class for it whose instances I would put into a one-dimensional array. So I wonder whether there exist some canonical design advice on this topic in terms of performance (runtime, memory consumption)?
Without regard of design patterns (extremely simplified situation), let's say I could store data like
class MyContainer {
public double a;
public double b;
...
}
and then
MyContainer[] myArray = new MyContainer[10000];
for(int i = myArray.length; (--i) >= 0;) {
myArray[i] = new MyContainer();
}
...
versus
double[][] myData = new double[10000][2];
...
I somehow think that the array-based approach should be more compact (memory) and faster (access). Then again, maybe it is not, arrays are objects too and array access needs to check indexes while object member access does not.(?) The allocation of the object array would probably(?) take longer, as I need to iteratively create the instances and my code would be bigger due to the additional class.
Thus, I wonder whether the designs of the common JVMs provide advantages for one approach over the other, in terms of access speed and memory consumption?
Many thanks.
Then again, maybe it is not, arrays are objects too
That's right. So I think this approach will not buy you anything.
If you want to go down that route, you could flatten this out into a one-dimensional array (each of your "objects" then takes two slots). That would give you immediate access to all fields in all objects, without having to follow pointers, and the whole thing is just one big memory allocation: since your component type is primitive, there is just one object as far as memory allocation is concerned (the container array itself).
This is one of the motivations for people wanting to have structs and value types in Java, and similar considerations drive the development of specialized high-performance data structure libraries (that get rid of unneccessary object wrappers).
I would not worry about it, until you really have a huge datastructure, though. Only then will the overhead of the object-oriented way matter.
I somehow think that the array-based approach should be more compact (memory) and faster (access)
It won't. You can easily confirm this by using Java Management interfaces:
com.sun.management.ThreadMXBean b = (com.sun.management.ThreadMXBean) ManagementFactory.getThreadMXBean();
long selfId = Thread.currentThread().getId();
long memoryBefore = b.getThreadAllocatedBytes(selfId);
// <-- Put measured code here
long memoryAfter = b.getThreadAllocatedBytes(selfId);
System.out.println(memoryAfter - memoryBefore);
Under measured code put new double[0] and new Object() and you will see that those allocations will require exactly the same amount of memory.
It might be that the JVM/JIT treats arrays in a special way which could make them faster to access in one way or another.
JIT do some vectorization of an array operations if for-loops. But it's more about speed of arithmetic operations rather than speed of access. Beside that, can't think about any.
The canonical advice that I've seen in this situation is that premature optimisation is the root of all evil. Following that means that you should stick with the code that is easiest to write / maintain / get past your code quality regime, and then look at optimisation if you have a measurable performance issue.
In your examples the memory consumption is similar because in the object case you have 10,000 references plus two doubles per reference, and in the 2D array case you have 10,000 references (the first dimension) to little arrays containing two doubles each. So both are one base reference plus 10,000 references plus 20,000 doubles.
A more efficient representation would be two arrays, where you'd have two base references plus 20,000 doubles.
double[] a = new double[10000];
double[] b = new double[10000];
When designing java classes, what are the recommendations for achieving CPU cache friendliness?
What I have learned so far is that one should use POD as much as possible (i.e. int instead of integer). Thus, the data will be allocated consecutively when allocating the containing object. E.g.
class Local
{
private int data0;
private int data1;
// ...
};
is more cache friendly than
class NoSoLocal
{
private Integer data0;
private Integer data1;
//...
};
The latter will require two separate allocations for the Integer objects that can be at arbitrary locations in memory, esp. after a GC run. OTOH the first approach might lead to duplication of data in cases where the data can be reused.
Is there a way to have them located close to each other in memory so that the parent object and its' containing elements will be in the CPU cache at once and not distributed arbitrarily over the whole memory plus the GC will keep them together?
You cannot force JVM to place related objects close to each other (though JVM tries to do it automatically). But there are certain tricks to make Java programs more cache-friendly.
Let me show you some examples from the real-life projects.
BEWARE! This is not a recommended way to code in Java!
Do not adopt the following techniques unless you are absolutely sure why you are doing it.
Inheritance over composition. You've definitely heard the contrary principle "Favor composition over inheritance". But with composition you have an extra reference to follow. This is not good for cache locality and also requires more memory. The classic example of inheritance over composition is JDK 8 Adder and Accumulator classes which extend utility Striped64 class.
Transform arrays of structures into a structure of arrays. This again helps to save memory and to speed-up bulk operations on a single field, e.g. key lookups:
class Entry {
long key;
Object value;
}
Entry[] entries;
will be replaced with
long[] keys;
Object[] values;
Flatten data structures by inlining. My favorite example is inlining 160-bit SHA1 hash represented by byte[]. The code before:
class Blob {
long offset;
int length;
byte[] sha1_hash;
}
The code after:
class Blob {
long offset;
int length;
int hash0, hash1, hash2, hash3, hash4;
}
Replace String with char[]. You know, String in Java contains char[] object under the hood. Why pay performance penalty for an extra reference?
Avoid linked lists. Linked lists are very cache-unfriendly. Hardware works best with linear structures. LinkedList can be often replaced with ArrayList. A classic HashMap may be replaced with an open address hash table.
Use primitive collections. Trove is a high-performance library containg specialized lists, maps, sets etc. for primitive types.
Build your own data layouts on top of arrays or ByteBuffers. A byte array is a perfect linear structure. To achieve the best cache locality you can pack an object data manually into a single byte array.
the first approach might lead to duplication of data in cases where the data can be reused.
But not in the case you mention. An int is 4 bytes, a reference is typically 4-bytes so you don't gain anything by using Integer. For a more complex type, it can make a big difference however.
Is there a way to have them located close to each other in memory so that the parent object and its' containing elements will be in the CPU cache at once and not distributed arbitrarily over the whole memory plus the GC will keep them together?
The GC will do this anyway, provided the objects are only used in one place. If the objects are used in multiple places, they will be close to one reference.
Note: this is not guaranteed to be the case, however when allocating objects they will typically be continuous in memory as this is the simplest allocation strategy. When copying retained objects, the HotSpot GC will copy them in reverse order of discovery. i.e. they are still together but in reverse order.
Note 2: Using 4 bytes for an int is still going to be more efficient than using 28 bytes for an Integer (4 bytes for reference, 16 bytes for object header, 4 bytes for value and 4 bytes for padding)
Note 3: Above all, you should favour clarity over performance, unless you have measured you need and have a more performant solution. In this case, an int cannot be a null but an integer can be null. If you want a value which should not be null use int, not for performance but for clarity.
I am using java.util.List in java for storing a results of my calculation
How can I store values for indexes bigger than maximum integer? (eg large)
The short answer is, you will not be using the java.util.List interface. You will have to implement something else.
If it was my program, and if it was not tailored for some specific supercomputing environment, then I would seriously consider using a database instead of trying to store more than two billion objects in RAM.
The FastUtil library, which specializes in huge data structures, has a BigList class whose implementation actually uses arrays of arrays. Getters and setters take long parameters for indices.
With fastutil 6, a new set of classes makes it possible to handle very
large collections: in particular, collections whose size exceeds 2^31.
Big arrays are arrays-of-arrays handled by a wealth of static methods
that act on them as if they were monodimensional arrays with 64-bit
indices, and big lists provide 64-bit list access.