I’m trying to find a may to minimize memory allocation and garbage collection in a Java OpenGL (JOGL) application. I am porting some C/C++/C# OpenGL project to Java as a learning exercise. One thing I am running into is the lack of pointers in Java and the allocation of objects and GC of them as the application runs. In C/C++/C# I can get the applications to startup and simply run without allocating any additional memory or objects by passing references around but in Java it seems my designs are incompatible.
As these designs have evolved they are using higher level objects. In C they were structs for Vectors and Matrices and in C++/C# classes. These all essentially boil down to arrays of bytes in memory. Which are then cast in one way or another to float[] for OpenGL calls or object arrays internally within the application so I can us object based operations like operator overloading, add and multiply, or property accessing for example. Anyone working with OpenGL probably sees what I am doing. This way I allocate everything on load and simply pass the data around.
Java has thrown me for some loops. It appears I cannot cast data back and forth thus I keep creating lots of data and the GC comes by and does it work. This is noticeable by resources being consumed and cleaned up and a noticeable stutter during the application run. I have alleviated some of this by creating FloatBuffers in addition to VectorXf arrays for my geometry data and pass the FloatBuffer down to OpenGL. But when I need to update Vector data I have to recopy the data back to the float buffer. This also means I am storing double the data and incurring the overhead of the floatbuffer fills.
I’d like to hear how others are dealing with these issues. I’d like to keep the higher order objects for the functionality built in but be able to pass the data to OpenGL. Are my designs simply incompatible with Java? Do I need to move to FloatBuffers exclusively? How does one pass component data into a higher order object without the penalty of object creation. So many OpenGL applications exist that I suspect there is some ‘magic’ to utilize either the same buffer for float[] and Object[] or allocate contiguous block for object data and pass a reference to OpenGL.
The driving force in managing your OpenGL data is that you don't want to be responsible for the memory containing the geometry or textures. The use of float[] or even FloatBuffers should only be for the purpose of transferring geometry data into OpenGL buffer objects. Once you've created an OpenGL buffer and copied the data to it, you no longer need to keep a copy in your JVM. On virtually all modern hardware this will cause the data to be retained on the video card itself, completely outside of the JVM.
Ideally, if most of your geometry is static, you can copy it to OpenGL buffers at startup time and never have to manage it directly again. If you're dealing with a lot of dynamic geometry, then you're still going to be having to transfer data back and forth to the OpenGL driver. In this case, you probably want to maintain a pool of FloatBuffers that can act as ferries for moving the data between your code generating or discovering the changing geometry, and the driver. FloatBuffers are unavoidable because OpenGL expects data in a given format, which is going to be different than the internal representation of the data in the JVM, but at the very least you don't need to be keeping a separate FloatBuffer around for every set of data you have.
My experience:
I was using FloatBuffers for only transferring data, but I found that this was really performance killing for dynamic meshes because I had to transform my Vec arrays to FloatBuffers everytime I change my meshes. Now I got rid of my vec arrays and only use FloatBuffers persistently through my mesh classes, less elegant to handle them but much faster. So I would advise you to keep & update all your geometry data with FloatBuffers
Related
I'm developing a software package which makes heavy use of arrays (ArrayLists). Instructions to be process are put into an array queue to be processed, then when used, deleted from the array. Same with drawing on a plot, data is placed into an array queue, which is read to plot data, and the oldest data is eventually deleted as new data comes in. We are talking about thousands of instructions over an hour and at any time maybe 200,000 points plotted, continually growing/shrinking the array.
After sometime, the software beings to slow where the instructions are processed slower. Nothing really changes as to what is going on for processing, that is, the system is stable as to what how much data is plotted and what instructions are being process, just working off similar incoming data time after time.
Is there some memory issue going on with the "abuse" of the variable-sized (not a defined size, add/delete as needed) arrays/queues that could be causing eventual slowing?
Is there a better way than the String ArrayList to act as a queue?
Thanks!
Yes, you are most likely using the wrong data structure for the job. An ArrayList is a list with a backing array so get() is fast.
The Java runtime library has a very rich set of data structures so you can get a well-written and debugged with the characteristics you need out of the box. You most likely should be using one or more Queues instead.
My guess is that you forget to null out values in your arraylist so the JVM has to keep all of them around. This is a memory leak.
To confirm, use a profiler to see where your memory and cpu go. Visualvm is a nice standalone. Netbeans include one.
The use of VisualVM helped. It showed a heavy use of a "message" form that I was dumping incoming data to and forgot existed, so it was dealing with a million characters when the sluggishness became apparent, because I never limited its size.
I'm working on a project where i need to plot some data. At the moment i keep all the data in an object and then give the pointer to this object to the graphs. But it is possible to dynamically change the data, whereas i need to change the data the graphs gets. So here is my question:
Should i create a new array every time i edit the data or and then change the pointers in the graphs or should i just change the data within the original array and the just repaint the graphs?
Using immutable data results in cleaner, more predictable API. If you mutate the array which is currently used by the graph API, nasty interactions are lurking just around the corner. This may lead to the graph API defensively copying the array internally; at that point you lose: you get more copying than you'd needed had you started with an immutable approach up front.
Keeping one single model is the preferred approach especially from the memory performance point of view. However, it may depend. If you use the same model somewhere else then you must ponder a little bit more.
Not so recently I've published a game that is written entirely in Java on Android platform. Currently I'm trying to get as much of the performance as possible. It seems that the problem in my game's case is that I'm using more too often ArrayList container in places where Map could be better suited. To explain myself I did it because I was afraid of dynamic memory allocations that would be triggered behind the scene (Map/Tree structures on Android). Maybe there is some sort of structure on Android/Java platform I don't know about, which will provide me with fast searching results and additionally will not allocate dynamically extra memory when adding new elements?
UPDATE:
For example I'm using an ArrayList structure for holding most of my game's Particles. Of course removing them independently (not sequentially) is a pain in the b**t as the system needs to iterate through the whole container just to remove one entity object (of course in the worst case scenario).
I wouldn't worry about slowdown because of memory allocation unless you specifically find it to be an issue. Memory allocation isn't really the cause of slowdowns in Android games, it's when the GC runs that's usually the problem. Unless you are inserting and deleting from the Map very often, you might not have to worry about the allocations.
Update:
Instead of using a Map, you might want to consider just marking particles as "dead" when you no longer need them and using that flag to skip over them in your update iteration. Store the references to the dead particles in a new deadParticles ArrayList, and just take one out from that list when you need a new one. That way the you have instant access to particles when you need them.
Are you preallocating your element objects, and reusing the empties rather than allocating new ones?
For ultra-fast code it essential that we keep locality of reference- keep as much of the data which is closely used together, in CPU cache:
http://en.wikipedia.org/wiki/Locality_of_reference
What techniques are to achieve this? Could people give examples?
I interested in Java and C/C++ examples. Interesting to know of ways people use to stop lots of cache swapping.
Greetings
This is probably too generic to have clear answer. The approaches in C or C++ compared to Java will differ quite a bit (the way the language lays out objects differ).
The basic would be, keep data that will be access in close loops together. If your loop operates on type T, and it has members m1...mN, but only m1...m4 are used in the critical path, consider breaking T into T1 that contains m1...m4 and T2 that contains m4...mN. You might want to add to T1 a pointer that refers to T2. Try to avoid objects that are unaligned with respect to cache boundaries (very platform dependent).
Use contiguous containers (plain old array in C, vector in C++) and try to manage the iterations to go up or down, but not randomly jumping all over the container. Linked Lists are killers for locality, two consecutive nodes in a list might be at completely different random locations.
Object containers (and generics) in Java are also a killer, while in a Vector the references are contiguous, the actual objects are not (there is an extra level of indirection). In Java there are a lot of extra variables (if you new two objects one right after the other, the objects will probably end up being in almost contiguous memory locations, even though there will be some extra information (usually two or three pointers) of Object management data in between. GC will move objects around, but hopefully won't make things much worse than it was before it run.
If you are focusing in Java, create compact data structures, if you have an object that has a position, and that is to be accessed in a tight loop, consider holding an x and y primitive types inside your object rather than creating a Point and holding a reference to it. Reference types need to be newed, and that means a different allocation, an extra indirection and less locality.
Two common techniques include:
Minimalism (of data size and/or code size/paths)
Use cache oblivious techniques
Example for minimalism: In ray tracing (a 3d graphics rendering paradigm), it is a common approach to use 8 byte Kd-trees to store static scene data. The traversal algorithm fits in just a few lines of code. Then, the Kd-tree is often compiled in a manner that minimalizes the number of traversal steps by having large, empty nodes at the top of tree ("Surface Area Heuristics" by Havran).
Mispredictions typically have a probability of 50%, but are of minor costs, because really many nodes fit in a cache-line (consider that you get 128 nodes per KiB!), and one of the two child nodes is always a direct neighbour in memory.
Example for cache oblivious techniques: Morton array indexing, also known as Z-order-curve-indexing. This kind of indexing might be preferred if you usually access nearby array elements in unpredictable direction. This might be valuable for large image or voxel data where you might have 32 or even 64 bytes big pixels, and then millions of them (typical compact camera measure is Megapixels, right?) or even thousands of billions for scientific simulations.
However, both techniques have one thing in common: Keep most frequently accessed stuff nearby, the less frequently things can be further away, spanning the whole range of L1 cache over main memory to harddisk, then other computers in the same room, next room, same country, worldwide, other planets.
Some random tricks that come to my mind, and which some of them I used recently:
Rethink your algorithm. For example, you have an image with a shape and the processing algorithm that looks for corners of the shape. Instead of operating on the image data directly, you can preprocess it, save all the shape's pixel coordinates in a list and then operate on the list. You avoid random the jumping around the image
Shrink data types. Regular int will take 4 bytes, and if you manage to use e.g. uint16_t you will cache 2x more stuff
Sometimes you can use bitmaps, I used it for processing a binary image. I stored pixel per bit, so I could fit 8*32 pixels in a single cache line. It really boosted the performance
Form Java, you can use JNI (it's not difficult) and implement your critical code in C to control the memory
In the Java world the JIT is going to be working hard to achieve this, and trying to second guess this is likely to be counterproductive. This SO question addresses Java-specific issues more fully.
I am developing a tile-based physics game like Falling Sand Game. I am currently using a Static VBO for the vertices and a Dynamic VBO for the colors associated with each block type. With this type of game the data in the color VBO changes very frequently. (ever block change) Currently I am calling glBufferSubDataARB for each block change. I have found this to work, yet it doesn't scale well with resolution. (Much slower with each increase in resolution) I was hoping that that I could get double my current playable resolution. (256x256)
Should I call BufferSubData very frequently or BufferData once a frame? Should I drop the VBO and go with vertex array?
What can be done about video cards that do not support VBOs?
(Note: Each block is larger than one pixel)
First of all, you should stop using both functions. Buffer objects have been core OpenGL functionality since around 2002; there is no reason to use the extension form of them. You should be using glBufferData and glBufferSubData, not the ARB versions.
Second, if you want high-performance buffer object streaming, tips can be found on the OpenGL wiki. But in general, calling glBufferSubData many times per frame on the same memory isn't helpful. It would likely be better to map the buffer and modify it directly.
To your last question, I would say this: why should you care? As previously stated, buffer objects are old. It's like asking what you should do for hardware that only support D3D 5.0.
Ignore it; nobody will care.
You should preferrably have the frequently changing color information updated in your own copy in RAM and hand the data to the GL in one operation, once per frame, preferrably at the end of the frame, just before swapping buffers (this means you need to do it once out of line for the very first frame).
glBufferSubData can be faster than glBufferData since it does not reallocate the memory on the server, and since it possibly transfer less data. In your case, however, it is likely slower, because it needs to be synced with the data that is still drawn. Also, since data could possibly change in any random location, the gains from only uploading a subrange won't be great, and uploading the whole buffer once per frame should be no trouble bandwidth-wise.
The best strategy would be to call glDraw(Elements|Arrays|Whatever) followed by glBufferData(...NULL). This tells OpenGL that you don't care about the buffer any more, and it can throw the contents away as soon as it's done drawing (when you map this buffer or copy into it now, OpenGL will secretly use a new buffer without telling you. That way, you can work on the new buffer while the old one has not finished drawing, this avoids a stall).
Now you run your physics simulation, and modify your color data any way you want. Once you are done, either glMapBuffer, memcpy, and glUnmapBuffer, or simply use glBufferData (mapping is sometimes better, but in this case it should make little or no difference). This is the data you will draw the next frame. Finally, swap buffers.
That way, the driver has time to do the transfer while the card is still processing the last draw call. Also, if vsync is enabled and your application blocks waiting for vsync, this time is available to the driver for data transfers. You are thus practically guaranteed that whenever a draw call is made (the next frame), the data is ready.
About cards that do not support VBOs, this does not really exist (well, it does, but not really). VBO is more a programming model, rather than a hardware feature. If you use plain normal vertex arrays, the driver still has to somehow transfer a block of data to the card, eventually. The only difference is that you own a vertex array, but the driver owns a VBO.
Which means in the case of VBO, the driver needs not ask you when to do what. In the case of vertex arrays, it can only rely that the data be valid at the exact time you call glDrawElements. In the case of a VBO, it always knows the data is valid, because you can only modify it via an interface controlled by the driver. This means it can much more optimally manage memory and transfers, and can better pipeline drawing.
There do of course exist implementations that don't support VBOs, but those would need to be truly old (like 10+ years old) drivers. It's not something to worry about, realistically.