I am trying to implement a cache-like collection of objects. The purpose is to have fast access to these objects through locality in memory since I'll likely be reading multiple objects at a time. I currently just store objects in a java collections object like vector or deque. But I do not believe this makes use of contiguous memory.
I know this can be done in C, but can it be done in Java? These objects may be of varying lengths (since they may contain strings). Is there a way to allocate contiguous memory through Java? Is there a Java collections object that would?
Please let me know.
Thanks,
jbu
You can't force it. If you allocate all the objects in quick succession they're likely to be contiguous - but if you're storing them in a collection, there's no guarantee that the collection will be local to the actual values. (The collection will have references to the objects, rather than containing the objects themselves.)
In addition, GC compaction will move values around in memory.
Have you actually profiled your app and found this is a bottleneck? In most cases I'd expect other optimisations could help you in a more reliable way.
No, you can't guarantee this locality of reference.
By allocating a byte array, or using a mapped byte buffer from the nio packages, you could get a chunk of contiguous memory, from which you can decode the data you want (effectively deserializing the objects of interest from this chunk of memory). However, if you repeatedly access the same objects, the deserialization overhead would likely defeat the purpose.
Have you written this code yet in Java? And if so, have you profiled it? I would argue that you probably don't need to worry about the objects being in contiguous memory - the JVM is better at memory management than you are in a garbage collected environment.
If you're really concerned about performance, maybe Java isn't the right tool for the job, but my gut instinct is to tell you that you're worrying about optimization too early, and that a Java version of your code, working with non-contiguous memory, will probably suit your needs.
I suggest using a HashMap (no threaded) or Hashtable (threaded) for your cache. Both store an object into an array in the sun jvm. Since all objects in java are passed by reference, this should be represented as an array of pointers in c. My bet is that you are performing premature optimization.
If you absolutely must have this, you have two options:
1) Use JNI and write it in c.
2) Get a BIG byte buffer and use ObjectOutputStream to write objects to it. This will probable be VERY SLOW compared to using a hash table.
Related
I need to implement a large octree in Java. The tree will be very large so I will need to use a paging mechanism and split the tree into many small files for long-term storage.
My concern is that Java objects have a very high space overhead. If I were using C, I would be able to store the 8 reference pointers and any other data with only a byte or so of overhead to store the node type.
Is there any way that I can approach this level of overhead in Java?
I would be tempted to just use a single byte array per file. I could then use offsets in place of pointers (this is how I plan on storing the files). However, even when limiting the max file size, that would easily leave me with arrays too large to fit in contiguous memory, particularly if that memory becomes heavily fragmented. This would also lead to large time overheads for adding new nodes as the entire space would need to be reallocated. A ByteBuffer might resolve the first problem (I am not entirely sure of this), however it would not solve the second, as the size of a ByteBuffer is static.
For the moment I will just stick to using node objects. If anyone knows a more space efficient solution with a low time cost, please let me know.
More specifically, how does the Array object in Java allow users to access each bucket in constant time? I understand that Java allocates memory equivalent to the specified size at initialization, but what about the structure of it allows such rapid lookup? The reason I am asking is because it's something not available in any other data structure save those that use Array representations of data.
Another (possibly silly) question is why can't other data structures provide such quick lookup of the stored data? I guess this question can/will be answered by the answer to the first question (how Arrays are implemented).
An array is just stored as a big contiguous block of memory.
Accessing an element is a matter of:
Finding where the data starts
Validating the index
Multipying the index by the element size, and adding the result to the start location
Accessing that memory location
Note that this is all done in the JVM, not in byte code. If arrays didn't exist at a JVM level, you couldn't fake them up within pure Java. Aside from anything else, arrays are the only objects in Java which logically vary in size between instances within the same JVM. (It's possible that some JVMs have clever systems to use compressed references in some cases and not in others, but that's an implementation detail rather than a logical difference.)
In C/C++ we have realloc which will efficiently allocate additional space for an existing collection. I guess it is sub linear (or even constant) in complexity.
Is there a way to achieve the same in Java? Here are the items that I looked at,
Array resize is not possible,
Copying an array to another of bigger size is linear in complexity. Looked at both System.arrayCopy as well as Arrays.copyOf
ArrayList must be same as point 2 above.
Note : My requirement is to expand possibly an extremely large array to even further.
realloc is likely to be O(n) in practice, since it sometimes/often involves memory copying. In that sense, it's equivalent in theoretical complexity to allocating a new array in Java.
Now Java always zeros the newly allocated memory which may take it a bit longer, but OTOH the GC has insanely fast memory allocations so it may even be faster than realloc in some cases. I'd expect a strategy that involves allocating new arrays in Java to be overall roughly comparable in speed to realloc. Possibly Java is better for smaller arrays, C/C++ would have the edge for big arrays, but YMMV. You'd have to benchmark in your particular implementation and workload to be sure.
So overall:
Don't worry about it, just reallocate new arrays in Java
If you do this a lot, be sure to recreate arrays with more space than you need so that you don't need to reallocate with each single element added (this is what the Java ArrayList does internally.
Final but important point: unless you are writing very low level code, you probably shouldn't be worrying about this anyway. Just use one of the fine collection classes that already exist (Java Collections, Google Collections, Trove etc.) and let them handle all of this stuff for you.
right now, i need to load huge data from database into a vector, but when i loaded 38000 rows of data, the program throw out OutOfMemoryError exception.
What can i do to handle this ?
I think there may be some memory leak in my program, good methods to detect it ?thanks
Provide more memory to your JVM (usually using -Xmx/-Xms) or don't load all the data into memory.
For many operations on huge amounts of data there are algorithms which don't need access to all of it at once. One class of such algorithms are divide and conquer algorithms.
If you must have all the data in memory, try caching commonly appearing objects. For example, if you are looking at employee records and they all have a job title, use a HashMap when loading the data and reuse the job titles already found. This can dramatically lower the amount of memory you're using.
Also, before you do anything, use a profiler to see where memory is being wasted, and to check if things that can be garbage collected have no references floating around. Again, String is a common example, since if for example you're using the first 10 chars of a 2000 char string, and you have used substring instead of allocating a new String, what you actually have is a reference to a char[2000] array, with two indices pointing at 0 and 10. Again, a huge memory waster.
You can try increasing the heap size:
java -Xms<initial heap size> -Xmx<maximum heap size>
Default is
java -Xms32m -Xmx128m
Do you really need to have such a large object stored in memory?
Depending of what you have to do with that data you might want to split it in lesser chunks.
Load the data section by section. This will not let you work on all data at the same time, but you won't have to change the memory provided to the JVM.
You could run your code using a profiler to understand how and why the memory is being eaten up. Debug your way through the loop and watch what is being instantiated. There are any number of them; JProfiler, Java Memory Profiler, see the list of profilers here, and so forth.
Maybe optimize your data classes? I've seen a case someone has been using Strings in place of native datatypes such as int or double for every class member that gave an OutOfMemoryError when storing a relatively small amount of data objects in memory. Take a look that you aren't duplicating your objects. And, of course, increase the heap size:
java -Xmx512M (or whatever you deem necessary)
Let your program use more memory or much better rethink the strategy. Do you really need so much data in the memory?
I know you are trying to read the data into vector - otherwise, if you where trying to display them, I would have suggested you use NatTable. It is designed for reading huge amount of data into a table.
I believe it might come in handy for another reader here.
Use a memory mapped file. Memory mapped files can basically grow as big as you want, without hitting the heap. It does require that you encode your data in a decoding-friendly way. (Like, it would make sense to reserve a fixed size for every row in your data, in order to quickly skip a number of rows.)
Preon allows you deal with that easily. It's a framework that aims to do to binary encoded data what Hibernate has done for relational databases, and JAXB/XStream/XmlBeans to XML.
How do you optimize the heap size usage of an application that has a lot (millions) of long-lived objects? (big cache, loading lots of records from a db)
Use the right data type
Avoid java.lang.String to represent other data types
Avoid duplicated objects
Use enums if the values are known in advance
Use object pools
String.intern() (good idea?)
Load/keep only the objects you need
I am looking for general programming or Java specific answers. No funky compiler switch.
Edit:
Optimize the memory representation of a POJO that can appear millions of times in the heap.
Use cases
Load a huge csv file in memory (converted into POJOs)
Use hibernate to retrieve million of records from a database
Resume of answers:
Use flyweight pattern
Copy on write
Instead of loading 10M objects with 3 properties, is it more efficient to have 3 arrays (or other data structure) of size 10M? (Could be a pain to manipulate data but if you are really short on memory...)
I suggest you use a memory profiler, see where the memory is being consumed and optimise that. Without quantitative information you could end up changing thing which either have no effect or actually make things worse.
You could look at changing the representation of your data, esp if your objects are small.
For example, you could represent a table of data as a series of columns with object arrays for each column, rather than one object per row. This can save a significant amount of overhead for each object if you don't need to represent an individual row. e.g. a table with 12 columns and 10,000,000 rows could use 12 objects (one per column) rather than 10 million (one per row)
You don't say what sort of objects you're looking to store, so it's a little difficult to offer detailed advice. However some (not exclusive) approaches, in no particular order, are:
Use a flyweight pattern wherever
possible.
Caching to disc. There are
numerous cache solutions for
Java.
There is some debate as to whether
String.intern is a good idea. See
here for a question re.
String.intern(), and the amount of
debate around its suitability.
Make use of soft or weak
references to store data that you can
recreate/reload on demand. See
here for how to use soft
references with caching techniques.
Knowing more about the internals and lifetime of the objects you're storing would result in a more detailed answer.
Ensure good normalization of your object model, don't duplicate values.
Ahem, and, if it's only millions of objects I think I'd just go for a decent 64 bit VM and lots of ram ;)
Normal "profilers" won't help you much, because you need an overview of all your "live" objects. You need heap dump analyzer. I recommend the Eclipse Memory analyzer.
Check for duplicated objects, starting with Strings.
Check whether you can apply patterns like flightweight, copyonwrite, lazy initialization (google will be your friend).
Take a look at this presentation linked from here. It lays out the memory use of common java object and primitives and helps you understand where all the extra memory goes.
Building Memory-efficient Java Applications: Practices and Challenges
You could just store fewer objects in memory. :) Use a cache that spills to disk or use Terracotta to cluster your heap (which is virtual) allowing unused parts to be flushed out of memory and transparently faulted back in.
I want to add something to the point Peter alredy made(can't comment on his answer :() it's always better to use a memory profiler(check java memory profiler) than to go by intution.80% of time it's routine that we ignore has some problem in it.also collection classes are more prone to memory leaks.
If you have millions of Integers and Floats etc. then see if your algorithms allow for representing the data in arrays of primitives. That means fewer references and lower CPU cost of each garbage collection.
A fancy one: keep most data compressed in ram. Only expand the current working set. If your data has good locality that can work nicely.
Use better data structures. The standard collections in java are rather memory intensive.
[what is a better data structure]
If you take a look at the source for the collections, you'll see that if you restrict yourself in how you access the collection, you can save space per element.
The way the collection handle growing is no good for large collections. Too much copying. For large collections, you need some block-based algorithm, like btree.
Spend some time getting acquainted with and tuning the VM command line options, especially those concerning garbage collection. While this won't change the memory used by your objects, it can have a big impact on performance with memory-intensive apps on machines with a lot of RAM.
Assign null value to all the variables which are no longer used. Thus make it available for Garbage collection.
De-reference the collections once usage is over, otherwise GC won't sweep those.
1) Use right dataTypes wherever possible
Class Person {
int age;
int status;
}
Here we can use below variables to save memory while sending Person object
class Person{
short age;
byte status;
}
2) Instead of returning new ArrayList<>(); from method , you can use Collection.emptyList() which will only contain only one element instead of default 10;
For e.g
public ArrayList getResults(){
.....
if(failedOperation)
return new ArrayList<>();
}
//Use this
public ArrayList getResults(){
if(failedOperation)
return Collections.emptyList();
}
3 ) Move creation of objects in methods instead of static declaration wherever possible as fields of objects will be stored on stack instead of heap
4) Using binary formats like protobuf,thrift,avro,messagepack for reducing intercommunication instead of json or XML