For ultra-fast code it essential that we keep locality of reference- keep as much of the data which is closely used together, in CPU cache:
http://en.wikipedia.org/wiki/Locality_of_reference
What techniques are to achieve this? Could people give examples?
I interested in Java and C/C++ examples. Interesting to know of ways people use to stop lots of cache swapping.
Greetings
This is probably too generic to have clear answer. The approaches in C or C++ compared to Java will differ quite a bit (the way the language lays out objects differ).
The basic would be, keep data that will be access in close loops together. If your loop operates on type T, and it has members m1...mN, but only m1...m4 are used in the critical path, consider breaking T into T1 that contains m1...m4 and T2 that contains m4...mN. You might want to add to T1 a pointer that refers to T2. Try to avoid objects that are unaligned with respect to cache boundaries (very platform dependent).
Use contiguous containers (plain old array in C, vector in C++) and try to manage the iterations to go up or down, but not randomly jumping all over the container. Linked Lists are killers for locality, two consecutive nodes in a list might be at completely different random locations.
Object containers (and generics) in Java are also a killer, while in a Vector the references are contiguous, the actual objects are not (there is an extra level of indirection). In Java there are a lot of extra variables (if you new two objects one right after the other, the objects will probably end up being in almost contiguous memory locations, even though there will be some extra information (usually two or three pointers) of Object management data in between. GC will move objects around, but hopefully won't make things much worse than it was before it run.
If you are focusing in Java, create compact data structures, if you have an object that has a position, and that is to be accessed in a tight loop, consider holding an x and y primitive types inside your object rather than creating a Point and holding a reference to it. Reference types need to be newed, and that means a different allocation, an extra indirection and less locality.
Two common techniques include:
Minimalism (of data size and/or code size/paths)
Use cache oblivious techniques
Example for minimalism: In ray tracing (a 3d graphics rendering paradigm), it is a common approach to use 8 byte Kd-trees to store static scene data. The traversal algorithm fits in just a few lines of code. Then, the Kd-tree is often compiled in a manner that minimalizes the number of traversal steps by having large, empty nodes at the top of tree ("Surface Area Heuristics" by Havran).
Mispredictions typically have a probability of 50%, but are of minor costs, because really many nodes fit in a cache-line (consider that you get 128 nodes per KiB!), and one of the two child nodes is always a direct neighbour in memory.
Example for cache oblivious techniques: Morton array indexing, also known as Z-order-curve-indexing. This kind of indexing might be preferred if you usually access nearby array elements in unpredictable direction. This might be valuable for large image or voxel data where you might have 32 or even 64 bytes big pixels, and then millions of them (typical compact camera measure is Megapixels, right?) or even thousands of billions for scientific simulations.
However, both techniques have one thing in common: Keep most frequently accessed stuff nearby, the less frequently things can be further away, spanning the whole range of L1 cache over main memory to harddisk, then other computers in the same room, next room, same country, worldwide, other planets.
Some random tricks that come to my mind, and which some of them I used recently:
Rethink your algorithm. For example, you have an image with a shape and the processing algorithm that looks for corners of the shape. Instead of operating on the image data directly, you can preprocess it, save all the shape's pixel coordinates in a list and then operate on the list. You avoid random the jumping around the image
Shrink data types. Regular int will take 4 bytes, and if you manage to use e.g. uint16_t you will cache 2x more stuff
Sometimes you can use bitmaps, I used it for processing a binary image. I stored pixel per bit, so I could fit 8*32 pixels in a single cache line. It really boosted the performance
Form Java, you can use JNI (it's not difficult) and implement your critical code in C to control the memory
In the Java world the JIT is going to be working hard to achieve this, and trying to second guess this is likely to be counterproductive. This SO question addresses Java-specific issues more fully.
Related
I'm creating a graph in java using jGraphT and adding vertexes and edges from a list using a stream.
My question is:
Can I use stream().parallel() to add them faster?
No, at least not as far as I'm aware. Essentially, adding a vertex or edge boils down to 2 steps: (a) check whether the edge/vertex already exists and if not (b) add the edge/vertex. Depending on the type of graph, step (b) involves adding the object to the appropriate container that stores the edges/vertices. I'm not an expert on concurrent programming, but I don't see how a parallel stream can do the above faster.
I don't know exactly what your usecase is, or what you try to accomplish. There are however some optimized, special graph types in the jgrapht-opt package that might benefit you. The graph functionality doesn't change (i.e. you can run the same algorithms on them); only the way the graph is stored changes. Some storage mechanisms are more memory efficient, allowing you to store massive graphs using little memory. Other graphs, such as the sparse graphs, can be created quicker and access operations are also quicker, but these graphs are typically immutable, i.e. once created they cannot be changed. What you need really depends on your usecase.
I need to implement a large octree in Java. The tree will be very large so I will need to use a paging mechanism and split the tree into many small files for long-term storage.
My concern is that Java objects have a very high space overhead. If I were using C, I would be able to store the 8 reference pointers and any other data with only a byte or so of overhead to store the node type.
Is there any way that I can approach this level of overhead in Java?
I would be tempted to just use a single byte array per file. I could then use offsets in place of pointers (this is how I plan on storing the files). However, even when limiting the max file size, that would easily leave me with arrays too large to fit in contiguous memory, particularly if that memory becomes heavily fragmented. This would also lead to large time overheads for adding new nodes as the entire space would need to be reallocated. A ByteBuffer might resolve the first problem (I am not entirely sure of this), however it would not solve the second, as the size of a ByteBuffer is static.
For the moment I will just stick to using node objects. If anyone knows a more space efficient solution with a low time cost, please let me know.
I'm programming something in Java, for context see this question: Markov Model descision process in Java
I have two options:
byte[MAX][4] mypatterns;
or
ArrayList mypatterns
I can use a Java ArrayList and append a new arrays whenever I create them, or use a static array by calculating all possible data combinations, then looping through to see which indexes are 'on or off'.
Essentially, I'm wondering if I should allocate a large block that may contain uninitialized values, or use the dynamic array.
I'm running in fps, so looping through 200 elements every frame could be very slow, especially because I will have multiple instances of this loop.
Based on theory and what I have heard, dynamic arrays are very inefficient
My question is: Would looping through an array of say, 200 elements be faster than appending an object to a dynamic array?
Edit>>>
More information:
I will know the maxlength of the array, if it is static.
The items in the array will frequently change, but their sizes are constant, therefore I can easily change them.
Allocating it statically will be the likeness of a memory pool
Other instances may have more or less of the data initialized than others
You right really, I should use a profiler first, but I'm also just curious about the question 'in theory'.
The "theory" is too complicated. There are too many alternatives (different ways to implement this) to analyse. On top of that, the actual performance for each alternative will depend on the the hardware, JIT compiler, the dimensions of the data structure, and the access and update patterns in your (real) application on (real) inputs.
And the chances are that it really doesn't matter.
In short, nobody can give you an answer that is well founded in theory. The best we can give is recommendations that are based on intuition about performance, and / or based on software engineering common sense:
simpler code is easier to write and to maintain,
a compiler is a more consistent1 optimizer than a human being,
time spent on optimizing code that doesn't need to be optimized is wasted time.
1 - Certainly over a large code-base. Given enough time and patience, human can do a better job for some problems, but that is not sustainable over a large code-base and it doesn't take account of the facts that 1) compilers are always being improved, 2) optimal code can depend on things that a human cannot take into account, and 3) a compiler doesn't get tired and make mistakes.
The fastest way to iterate over bytes is as a single arrays. A faster way to process these are as int or long types as process 4-8 bytes at a time is faster than process one byte at a time, however it rather depends on what you are doing. Note: a byte[4] is actually 24 bytes on a 64-bit JVM which means you are not making efficient use of your CPU cache. If you don't know the exact size you need you might be better off creating a buffer larger than you need even if you are not using all the buffer. i.e. in the case of the byte[][] you are using 6x time the memory you really need already.
Any performance difference will not be visible, when you set initialCapacity on ArrayList. You say that your collection's size can never change, but what if this logic changes?
Using ArrayList you get access to a lot of methods such as contains.
As other people have said already, use ArrayList unless performance benchmarks say it is a bottle neck.
What do you use when you need a immutable list with the fastest access/update? LinkedList can be slow if you have to access an element from the middle, and it's prohibitive to create and repopulate it. Binary trees? quadtrees?
If updating is very rare (or the collection is small), an array which you don't write to after intialization is worthwhile. The much lower constant factors (both in time and space) outweigh the linear time update in these cases.
Apart from that, there are a number of purely functional data structures which provide better bounds for these cases. 2-3 Finger Trees (the data structure behind Haskell's Data.Sequence) are one example. Another option are Clojure's vectors and related data structures (e.g. Relaxed Radix-Balanced Trees), which use trees with high fan-out (32 or more) to keep reads cheap and structural sharing to avoid too many copies.
All of these are moderately tricky to implement manually though, especially if performance is important, and I'm not aware of existing implementations (I don't think Clojure's vectors are easy or convenient to use from Java).
I'm not sure I understand what you're looking for but I'll try to give a couple of pointers based on some things I've seen in the standard classes:
CopyOnWriteArrayList is a mutable yet threadsafe list because it duplicates the internal array on updates. Perhaps you could adapt some ideas from that, although it's obviously not efficient for large lists.
ConcurrentHashMap implements similar ideas on a much more complicated structure. It divides the internal hash table into separate partitions, so that changes only need to lock access to the relevant partition.
For an immutable list you could do something similar: divide the list's internal array into several partitions and treat them all as immutable. When you need to change the list, you only need to clone one partition and the index of the partitions, which will be cheaper than duplicating the whole list.
AWTEventMulticaster achieves similar goals, but duplicates the absolute minimum. It's a clever binary tree. See the source.
With a smaller size of internal partition or block, you can get faster updates, but slower access in general. With a larger block (e.g., the entire array) you get slower updates but faster access.
If you really need fastest access and update, you have to use a mutable array.
If one took say 1000 lines of computer code and instead of the variables being declared independently, they were grouped together (obviously this does depend on the variable sizes being used) into classes and structs, would this directly increase the cache spacial locality (and therefore reduce the cache miss rate)?
I was under the impression by associating the variables within a class/struct they would be assigned continuous memory addresses?
If you are talking about method-local variables, they are already contiguous on the stack, or strictly speaking in activation records which are all but invariably on the stack. If you are talking about references to Java objects, or pointers to dynamically allocated C++ objects, putting them into containing classes won't make any difference for another reason: the objects concerned will still be in arbitrary positions in the heap.
Answering this question is not possible without making some quite unreasonable assumptions. Spatial locality is as much about algorithms as it is about data structures, so grouping logically related data elements together may be of no consequence or even worse based on an algorithm that you use.
For example, consider a representation of 100 points in 3D space. You could put them in three separate arrays, or create a 3-tuple struct/class, and make an array of these.
If your algorithm must get all three coordinates of each point at once on each step, the tuple representation wins. However, think what would happen if you wanted to build an algorithm that operates on each dimension independently, and paralelize it three-way among three independent threads. In this case three separate arrays would win hands down, because that layout would avoid false sharing, and improve spatial locality as far as the one-dimension-at-a-time algorithm is concerned.
This example shows that there is no "one size fits all" solution. Spatial locality should always be considered in the context of a specific algorithm; a good solution in one case could turn bad in other seemingly similar cases.
If you are asking whether to group local variables into explicitly defined structures, there is not going to be an advantage. Local variables are implemented in terms of activation records, which are usually closely related to the implementation of class structures, for any language that has both.
So, local variables should already have good spatial locality, unless the language implementation is doing something weird to screw it up.
You might improve locality by isolating large chunks of local state which isn't used during recursion into separate non-recursing functions. This would be a micro-optimization, so you need to inspect the machine code first to be sure it's not a waste of time. Anyway, it's unrelated to moving locals into a class.