If one took say 1000 lines of computer code and instead of the variables being declared independently, they were grouped together (obviously this does depend on the variable sizes being used) into classes and structs, would this directly increase the cache spacial locality (and therefore reduce the cache miss rate)?
I was under the impression by associating the variables within a class/struct they would be assigned continuous memory addresses?
If you are talking about method-local variables, they are already contiguous on the stack, or strictly speaking in activation records which are all but invariably on the stack. If you are talking about references to Java objects, or pointers to dynamically allocated C++ objects, putting them into containing classes won't make any difference for another reason: the objects concerned will still be in arbitrary positions in the heap.
Answering this question is not possible without making some quite unreasonable assumptions. Spatial locality is as much about algorithms as it is about data structures, so grouping logically related data elements together may be of no consequence or even worse based on an algorithm that you use.
For example, consider a representation of 100 points in 3D space. You could put them in three separate arrays, or create a 3-tuple struct/class, and make an array of these.
If your algorithm must get all three coordinates of each point at once on each step, the tuple representation wins. However, think what would happen if you wanted to build an algorithm that operates on each dimension independently, and paralelize it three-way among three independent threads. In this case three separate arrays would win hands down, because that layout would avoid false sharing, and improve spatial locality as far as the one-dimension-at-a-time algorithm is concerned.
This example shows that there is no "one size fits all" solution. Spatial locality should always be considered in the context of a specific algorithm; a good solution in one case could turn bad in other seemingly similar cases.
If you are asking whether to group local variables into explicitly defined structures, there is not going to be an advantage. Local variables are implemented in terms of activation records, which are usually closely related to the implementation of class structures, for any language that has both.
So, local variables should already have good spatial locality, unless the language implementation is doing something weird to screw it up.
You might improve locality by isolating large chunks of local state which isn't used during recursion into separate non-recursing functions. This would be a micro-optimization, so you need to inspect the machine code first to be sure it's not a waste of time. Anyway, it's unrelated to moving locals into a class.
Related
I'm programming something in Java, for context see this question: Markov Model descision process in Java
I have two options:
byte[MAX][4] mypatterns;
or
ArrayList mypatterns
I can use a Java ArrayList and append a new arrays whenever I create them, or use a static array by calculating all possible data combinations, then looping through to see which indexes are 'on or off'.
Essentially, I'm wondering if I should allocate a large block that may contain uninitialized values, or use the dynamic array.
I'm running in fps, so looping through 200 elements every frame could be very slow, especially because I will have multiple instances of this loop.
Based on theory and what I have heard, dynamic arrays are very inefficient
My question is: Would looping through an array of say, 200 elements be faster than appending an object to a dynamic array?
Edit>>>
More information:
I will know the maxlength of the array, if it is static.
The items in the array will frequently change, but their sizes are constant, therefore I can easily change them.
Allocating it statically will be the likeness of a memory pool
Other instances may have more or less of the data initialized than others
You right really, I should use a profiler first, but I'm also just curious about the question 'in theory'.
The "theory" is too complicated. There are too many alternatives (different ways to implement this) to analyse. On top of that, the actual performance for each alternative will depend on the the hardware, JIT compiler, the dimensions of the data structure, and the access and update patterns in your (real) application on (real) inputs.
And the chances are that it really doesn't matter.
In short, nobody can give you an answer that is well founded in theory. The best we can give is recommendations that are based on intuition about performance, and / or based on software engineering common sense:
simpler code is easier to write and to maintain,
a compiler is a more consistent1 optimizer than a human being,
time spent on optimizing code that doesn't need to be optimized is wasted time.
1 - Certainly over a large code-base. Given enough time and patience, human can do a better job for some problems, but that is not sustainable over a large code-base and it doesn't take account of the facts that 1) compilers are always being improved, 2) optimal code can depend on things that a human cannot take into account, and 3) a compiler doesn't get tired and make mistakes.
The fastest way to iterate over bytes is as a single arrays. A faster way to process these are as int or long types as process 4-8 bytes at a time is faster than process one byte at a time, however it rather depends on what you are doing. Note: a byte[4] is actually 24 bytes on a 64-bit JVM which means you are not making efficient use of your CPU cache. If you don't know the exact size you need you might be better off creating a buffer larger than you need even if you are not using all the buffer. i.e. in the case of the byte[][] you are using 6x time the memory you really need already.
Any performance difference will not be visible, when you set initialCapacity on ArrayList. You say that your collection's size can never change, but what if this logic changes?
Using ArrayList you get access to a lot of methods such as contains.
As other people have said already, use ArrayList unless performance benchmarks say it is a bottle neck.
If I have to store 3 integer values and would like to just retrieve the same , no calculation is required.Which one of the following would be a better option?
int i,j,k;
or
int [] arr = new int[3];
Array would be allocating 3 continuous blocks of memory (after allocation of space by JVM) or randomly assigning variables to some memory location (which I guess would consume lesser time for JVM as compared to array).
Apologies if the question is too trivial.
The answer is: It depends.
You shouldn't think too much about the performance implications for this case. the performance difference between the two is not big enough to notice.
What you really need to be on the look out for is readability and maintainability.
if i, j, and k, all essentially mean the same thing, and you're going to be using them the same way, and you feel like you might want to iterate over them, then it might make sense to use an array, so that you can iterate over them more easily.
if they're different values, with different meanings, and you're going to be using them differently, than it does not makes sense to include them in an array. They should each have their own identity, and their own descriptive variable name.
Choose whichever makes most sense semantically:
If these variables are three for a fundamental reason (maybe they are coordinates in the 3D space of a 3D game engine), then use three separate variables (because making, say, a 4D game engine is not a trivial change).
If these variables are three now but they could be trivially changed to be four tomorrow, it's reasonable to consider an array (or, better yet, a new type that contains them).
In terms of performance, traditionally local variables are faster than arrays. Under specific circumstances, the array may be allocated on the stack. Under specific circumstances, bound checks can be removed.
But don't make decisions based on performance, unless you have first done everything else correctly first and you have thorough tests and this particular piece of code is a performance-critical hot-spot and you're sure that it is the bottleneck of your application at the moment.
It depends on how would you access them. Array is of course an overhead, because you will first calculate a reference to a value and then get it. So if these values are totally unrelated, array is bad, and it may even count as code obfuscation. But naming variables like i, j, k is sort of obfuscation, too. Obfuscation is better to do automatically at build stage, there are tools like Proguard™ which can do it.
The two are not the same at all and are for different purpose.
in the first example you gave int i,j,k; you are pushing the values on to the stack,
The stack is for short term use and small data sizes i.e. function call arguments and iterator states.
The second example you gave int [] arr = new int[3]; the new keyword is allocating actual memory for the heap hat was giving to the process by the operating system.
The stack is optimized for short term use and all (most) all CPUs have a registers that are dedicated to point at the stack location and base making the stack a grate place for small dirty variables. The stack is also limited in size (by theory), its only a few KB in size (average case).
The heap on he other hand is proper memory allocation for large data types and proper memory management.
So, the two may be used for the same thing but it dose not mean it's right.
Arrays/Objects/Dicts go in allocated memory from he heap, function arguments (and iterator indexes usually) go on the stack.
It depends, but most probably, using distinct variables is the way to go.
In general, don't do micro-optimizations. Nobody will ever notice any difference in performance. Readable and maintainable code is what really matters in high-level languages.
See this article on micro-optimizations.
What do you use when you need a immutable list with the fastest access/update? LinkedList can be slow if you have to access an element from the middle, and it's prohibitive to create and repopulate it. Binary trees? quadtrees?
If updating is very rare (or the collection is small), an array which you don't write to after intialization is worthwhile. The much lower constant factors (both in time and space) outweigh the linear time update in these cases.
Apart from that, there are a number of purely functional data structures which provide better bounds for these cases. 2-3 Finger Trees (the data structure behind Haskell's Data.Sequence) are one example. Another option are Clojure's vectors and related data structures (e.g. Relaxed Radix-Balanced Trees), which use trees with high fan-out (32 or more) to keep reads cheap and structural sharing to avoid too many copies.
All of these are moderately tricky to implement manually though, especially if performance is important, and I'm not aware of existing implementations (I don't think Clojure's vectors are easy or convenient to use from Java).
I'm not sure I understand what you're looking for but I'll try to give a couple of pointers based on some things I've seen in the standard classes:
CopyOnWriteArrayList is a mutable yet threadsafe list because it duplicates the internal array on updates. Perhaps you could adapt some ideas from that, although it's obviously not efficient for large lists.
ConcurrentHashMap implements similar ideas on a much more complicated structure. It divides the internal hash table into separate partitions, so that changes only need to lock access to the relevant partition.
For an immutable list you could do something similar: divide the list's internal array into several partitions and treat them all as immutable. When you need to change the list, you only need to clone one partition and the index of the partitions, which will be cheaper than duplicating the whole list.
AWTEventMulticaster achieves similar goals, but duplicates the absolute minimum. It's a clever binary tree. See the source.
With a smaller size of internal partition or block, you can get faster updates, but slower access in general. With a larger block (e.g., the entire array) you get slower updates but faster access.
If you really need fastest access and update, you have to use a mutable array.
I am working with some relatively large arrays of instances of a single data structure. Each instance consists of about a half a dozen fields. The arrays take up a lot of space and I'm finding that my development environment dies even when running with a vm using 7 gigabytes of heap space. Although I can move to a larger machine, I am also exploring ways I could economize on space without taking an enormous hit in performance. On inspection of the data I've noticed a great deal of redundancy in the data. For about 80 percent of the data, four of the six fields have identical values.
This gave me the idea, that I can segregate these instances that have redundant information and put them in a specialized form of the data structure (an extension of the original data structure) with static fields for the four fields that contain the identical information. My assumption is that the static fields will only be instantiated in memory once and so even though this information is shared by say 100K objects, these fields take up the same memory as a they would if only one data structure was instantiated. I therefore should be able to realize a significant memory savings.
Is this a correct assumption?
Thank you,
Elliott
I don't know your specific datastructure and a possible algorithm to buid a flyweight, but I would suggest one:
http://en.wikipedia.org/wiki/Flyweight_pattern
The pattern is quite near to the solution you are thinking about, and gives you a good seperation of "how to get the data."
How about maintaining the fields that are redundant in a map and just have references to those values in arrays. Could save space by reducing size of individual data structure.
Try to use HashMap for your storage. That's the way to fast find equal object.
You need to think how define hashCode function of your objects.
For ultra-fast code it essential that we keep locality of reference- keep as much of the data which is closely used together, in CPU cache:
http://en.wikipedia.org/wiki/Locality_of_reference
What techniques are to achieve this? Could people give examples?
I interested in Java and C/C++ examples. Interesting to know of ways people use to stop lots of cache swapping.
Greetings
This is probably too generic to have clear answer. The approaches in C or C++ compared to Java will differ quite a bit (the way the language lays out objects differ).
The basic would be, keep data that will be access in close loops together. If your loop operates on type T, and it has members m1...mN, but only m1...m4 are used in the critical path, consider breaking T into T1 that contains m1...m4 and T2 that contains m4...mN. You might want to add to T1 a pointer that refers to T2. Try to avoid objects that are unaligned with respect to cache boundaries (very platform dependent).
Use contiguous containers (plain old array in C, vector in C++) and try to manage the iterations to go up or down, but not randomly jumping all over the container. Linked Lists are killers for locality, two consecutive nodes in a list might be at completely different random locations.
Object containers (and generics) in Java are also a killer, while in a Vector the references are contiguous, the actual objects are not (there is an extra level of indirection). In Java there are a lot of extra variables (if you new two objects one right after the other, the objects will probably end up being in almost contiguous memory locations, even though there will be some extra information (usually two or three pointers) of Object management data in between. GC will move objects around, but hopefully won't make things much worse than it was before it run.
If you are focusing in Java, create compact data structures, if you have an object that has a position, and that is to be accessed in a tight loop, consider holding an x and y primitive types inside your object rather than creating a Point and holding a reference to it. Reference types need to be newed, and that means a different allocation, an extra indirection and less locality.
Two common techniques include:
Minimalism (of data size and/or code size/paths)
Use cache oblivious techniques
Example for minimalism: In ray tracing (a 3d graphics rendering paradigm), it is a common approach to use 8 byte Kd-trees to store static scene data. The traversal algorithm fits in just a few lines of code. Then, the Kd-tree is often compiled in a manner that minimalizes the number of traversal steps by having large, empty nodes at the top of tree ("Surface Area Heuristics" by Havran).
Mispredictions typically have a probability of 50%, but are of minor costs, because really many nodes fit in a cache-line (consider that you get 128 nodes per KiB!), and one of the two child nodes is always a direct neighbour in memory.
Example for cache oblivious techniques: Morton array indexing, also known as Z-order-curve-indexing. This kind of indexing might be preferred if you usually access nearby array elements in unpredictable direction. This might be valuable for large image or voxel data where you might have 32 or even 64 bytes big pixels, and then millions of them (typical compact camera measure is Megapixels, right?) or even thousands of billions for scientific simulations.
However, both techniques have one thing in common: Keep most frequently accessed stuff nearby, the less frequently things can be further away, spanning the whole range of L1 cache over main memory to harddisk, then other computers in the same room, next room, same country, worldwide, other planets.
Some random tricks that come to my mind, and which some of them I used recently:
Rethink your algorithm. For example, you have an image with a shape and the processing algorithm that looks for corners of the shape. Instead of operating on the image data directly, you can preprocess it, save all the shape's pixel coordinates in a list and then operate on the list. You avoid random the jumping around the image
Shrink data types. Regular int will take 4 bytes, and if you manage to use e.g. uint16_t you will cache 2x more stuff
Sometimes you can use bitmaps, I used it for processing a binary image. I stored pixel per bit, so I could fit 8*32 pixels in a single cache line. It really boosted the performance
Form Java, you can use JNI (it's not difficult) and implement your critical code in C to control the memory
In the Java world the JIT is going to be working hard to achieve this, and trying to second guess this is likely to be counterproductive. This SO question addresses Java-specific issues more fully.