I am working on a fractal rendering software.
The basic setup is that I have a big 2-dimensional array (picture), where values are incremented.
The simple rendering process is
while( iteration < maxIteration ) {
update some pixel in array;
}
This is stupidly simple to parallellize; just have several threads to do this simultaneously,
since each thread will (very likely) work with different pixels at the same time,
and even if there is an update collision in the array, this is fine.
The array is shared among the threads!
However, to keep track of the total number of iteratins done, I need iteration
to be volatile, which I suspect slows down the code a little.
What baffels me is that I get virtually the same speed for 4 threads and 16 threads,
and I run this on a 64-core machine, which is verified by Runtime.getRuntime().availableProcessors().
One issue is that I have no control over where in the array the threads work, hence, the issue might be a big case cache misses? The array is of the size of a fullhd-image: 1920x1080x4 longs.
Thus, I seek possible issues, and solutions to them, since I think this might be a common type of problem.
Edit: The code I am trying to optimize is available here (sourceforge).
The class ThreadComputator represents one thread, and all these do iterations.
The number of iterations done is stored in the shared variable currentIteration,
which (in the current code) is incremented in a synchronized block.
All threads write to the Histogram object, which essentially is a big array of doubles.
Writing to this does not need to be atomic, as overwrites will be rare, and the error is tolerated.
I think you've answered your own question.
Because I implement the chaos game algorithm. This means that the next pixel
I need to work on depends non-deterministically on current pixel.
And you have a memory system on your computer that is functionally random access; but, the fastest performance is only possible if you have localized (within the cache pages) reads and writes.
I'd re-implement your algorithm like so:
Get all of your desired writes for a "time instant", wrap them in a class / data structure such that they can be ordered and grouped by memory page / cache line.
Generate the list of memory pages requiring access.
Randomly assign a to-be-accessed memory page to a thread.
Run all the updates for that page before that thread works on another memory page.
Yes, it won't be 100% random anymore; however you can mitigate that by counting the "write time" and assuming that all writes in the same write time occurred simultaneously. It will still thrash your memory pretty badly, but at least it will thrash is somewhat less.
Related
In numerous articles, YouTube videos, etc., I have seen Java's volatile keyword explained as a problem of cache memory, where declaring a variable volatile ensures that reads/writes are forced to main memory, and not cache memory.
It has always been my understanding that in modern CPUs, cache memory implements coherency protocols that guarantee that reads/writes are seen equally by all processors, cores, hardware threads, etc, at all levels of the cache architecture. Am I wrong?
jls-8.3.1.4 simply states
A field may be declared volatile, in which case the Java Memory Model ensures that all threads see a consistent value for the variable (ยง17.4).
It says nothing about caching. As long as cache coherency is in effect, volatile variables simply need to be written to a memory address, as opposed to being stored locally in a CPU register, for example. There may also be other optimizations that need to be avoided to guarantee the contract of volatile variables, and thread visibility.
I am simply amazed at the number of people who imply that CPUs do not implement cache coherency, such that I have been forced to go to StackOverflow because I am doubting my sanity. These people go to a great deal of effort, with diagrams, animated diagrams, etc. to imply that cache memory is not coherent.
jls-8.3.1.4 really is all that needs to be said, but if people are going to explain things in more depth, wouldn't it make more sense to talk about CPU Registers (and other optimizations) than blame CPU Cache Memory?
CPUs are very, very fast. That memory is physically a few centimeters away. Let's say 15 centimeters.
The speed of light is 300,000 kilometers per second, give or take. That's 30,000,000,000,000 centimeters every second. The speed of light in medium is not as fast as in vacuum, but it's close so lets ignore that part. That means sending a single signal from the CPU to the memory, even if the CPU and memory both can instantly process it all, is already limiting you to 1,000,000,000 or 1Ghz (You need to cover 30 centimeters to get form the core to the memory and back, so you can do that 1,000,000,000 every second. If you can do it any faster, you're travelling backwards in time. Or some such. You get a nobel prize if you figure out how to manage that one).
Processors are about that fast! We measure core speeds in Ghz these days, as in, in the time it takes the signal to travel, the CPU's clock has already ticked. In practice of course that memory controller is not instantaneous either, nor is the CPU pipelining system.
Thus:
It has always been my understanding that in modern CPUs, cache memory implements coherency protocols that guarantee that reads/writes are seen equally by all processors, cores, hardware threads, etc, at all levels of the cache architecture. Am I wrong?
Yes, you are wrong. QED.
I don't know why you think that or where you read that. You misremember, or you misunderstood what was written, or whatever was written was very very wrong.
In actual fact, an actual update to 'main memory' takes on the order of a thousand cycles! A CPU is just sitting there, twiddling its thumbs, doing nothing, in a time window where it could roll through a thousand, on some cores, multiple thousands of instructions, memory is that slow. Molasses level slow.
The fix is not registers, you are missing about 20 years of CPU improvement. There aren't 2 layers (registers, then main memory), no. There are more like 5: Registers, on-die cache in multiple hierarchical levels, and then, eventually, main memory. To make it all very very fast these things are very very close to the core. So close, in fact, that each core has their own, and, drumroll here - modern CPUs cannot read main memory. At all. They are entirely incapable of it.
Instead what happens is that the CPU sees you write or read to main memory and translates that, as it can't actually do any of that, by figuring out which 'page' of memory that is trying to read/write to (each chunk of e.g. 64k worth of memory is a page; actual page size depends on hardware). The CPU then checks if any of the pages loaded in its on-die cache is that page. If yes, great, and it's all mapped to that. Which does mean that, if 2 cores both have that page loaded, they both have their own copy, and obviously anything that one core does to its copy is entirely invisible to the other core.
If the CPU does -not- find this page in its own on-die cache you get what's called a cache miss, and the CPU will then check which of its loaded pages is least used, and will purge this page. Purging is 'free' if the CPU hasn't modified it, but if that page is 'dirty', it will first send a ping to the memory controller followed by blasting the entire sequence of 64k bytes into it (because sending a burst is way, way faster than waiting for the signal to bounce back and forth or to try to figure out which part of that 64k block is dirty), and the memory controller will take care of it. Then, that same CPU pings the controller to blast the correct page to it and overwrites the space that was just purged out. Now the CPU 'retries' the instruction, and this time it does work, as that page IS now in 'memory', in the sense that the part of the CPU that translates the memory location to cachepage+offset now no longer throws a CacheMiss.
And whilst all of that is going on, THOUSANDS of cycles can pass, because it's all very very slow. Cache misses suck.
This explains many things:
It explains why volatile is slow and synchronized is slower. Dog slow. In general if you want big speed, you want processes that run [A] independent (do not need to share memory between cores, except at the very start and very end perhaps to load in the data needed to operate on, and to send out the result of the complex operation), and [B] fit all memory needs to perform the calculation in 64k or so, depending on CPU cache sizes and how many pages of L1 cache it has.
It explains why one thread can observe a field having value A and another thread observes the same field having a different value for DAYS on end if you're unlucky. If the cores aren't doing all that much and the threads checking the values of those fields does it often enough, that page is never purged, and the 2 cores go on their merry way with their local core value for days. A CPU doesn't sync pages for funsies. It only does this if that page is the 'loser' and gets purged.
It explains why Spectre happened.
It explains why LinkedList is slower than ArrayList even in cases where basic fundamental informatics says it should be faster (big-O notation, analysing computational complexity). Because as long as the arraylist's stuff is limited to a single page you can more or less consider it all virtually instant - it takes about the same order of magnitude to fly through an entire page of on-die cache as it takes for that same CPU to wait around for a single cache miss. And LinkedList is horrible on this front: Every .add on it creates a tracker object (the linkedlist has to store the 'next' and 'prev' pointers somewhere!) so for every item in the linked list you have to read 2 objects (the tracker and the actual object), instead of just the one (as the arraylist's array is in contiguous memory, that page is worst-case scenario read into on-die once and remains active for your entire loop), and it's very easy to end up with the tracker object and the actual object being on different pages.
It explains the Java Memory Model rules: Any line of code may or may not observe the effect of any other line of code on the value of any field. Unless you have established a happens-before/happens-after relationship using any of the many rules set out in the JMM to establish these. That's to give the JVM the freedom to, you know, not run literally 1000x slower than neccessary, because guaranteeing consistent reads/writes can only be done by flushing memory on every read, and that is 1000x slower than not doing that.
NB: I have massively oversimplified things. I do not have the skill to fully explain ~20 years of CPU improvements in a mere SO answer. However, it should explain a few things, and it is a marvellous thing to keep in mind as you try to analyse what happens when multiple java threads try to write/read to the same field and you haven't gone out of your way to make very very sure you have an HB/HA relationship between the relevant lines. If you're scared now, good. You shouldn't be attempting to communicate between 2 threads often, or even via fields, unless you really, really know what you are doing. Toss it through a message bus, use designs where the data flow is bounded to the start and end of the entire thread's process (make a job, initialize the job with the right data, toss it in an ExecutorPool queue, set up that you get notified when its done, read out the result, don't ever share anything whatsoever with the actual thread that runs it), or talk to each other via the database.
I have a problem where I need to run a complex function on a large 3d array. For each array row, I will execute anywhere from 100 to 1000 instructions, and depending on the data on that row some instructions will or not be executed.
This array is large but would still fit inside a GPU shared memory (around 2GB in size). I could execute these instructions on separate parts of the array given that they don't need to be processed in order, so I'm thinking executing on the GPU could be a good option. I'm not entirely sure because the instructions executed will change depending on the data itself (lots of if/then/else in there) and I've read branching could an issue.
These instructions are an abstract syntax tree representing a short program that operates over the array row and returns a value.
Does this look like an appropriate problem to be tackled by the GPU?
What other info would be needed to determine that?
I'm thinking to write this in Java and use JCuda.
Thanks!
Eduardo
It Depends. How big is your array, i.e. how many parallel tasks does your array provide (in your case it sounds like the number of rows is the number of parallel tasks you're going to execute)? If you have few rows (ASTs) but many columns (commands), then maybe it's not worth it. The other way round would work better, because more work can be parallelized.
Branching can indeed be an issue if you're unaware. You can do some optimizations though to mitigate that cost - after you got your initial prototype running and can do some comparision measurements.
The issue with Branching is, that all streaming multiprocessors in one "Block" need to execute the same instruction. If one core does not need that instruction, it sleeps. So if you have two ASTs, each with 100 distinct commands, the multiprocessors will take 200 commands to complete the calculation, some of the SMs will be sleeping while the other do their commands.
If you have 1000 commands max and some only use a subset, the processor will take as many commands as the AST with the most commands has - in the optimal case. E.g. a set of (100, 240, 320, 1, 990) will run for at least 990 commands, even though one of the ASTs only uses one command. And if that command isn't in the set of 990 commands from the last AST, it even runs for 991 commands.
You can mitigate this (after you have the prototype working and can do actual measurements) by optimizing the array you send to the GPU, so that one set of Streaming Multiprocessors (Block) has a similar set of instructions to do. As different SMs don't interfere with each other on the execution level, they don't need to wait on each other. The size of the blocks is also configurable when you execute the code, so you can adjust it somewhat here.
For even more optimization - only 32 (NVidia "Warp")/64 (AMD "Wavefront") of the threads in a block are executed at the same time, so if you organize your array to exploit this, you can even gain a bit more.
How much of a difference those optimizations make is dependant on how sparse / dense / mixed your command array will be. Also not all optimizations actually optimize your execution time. Testing and comparing is key here. Another source of optimization is your memory layout, but with your described use case it shouldn't be a problem. You can look up Memory Coalescing for more info on that.
I am developing an application that allows users to set the maximum data set size they want me to run their algorithm against
It has become apparent that array sizes around 20,000,000 in size causes an 'out of memory' error. Because I am invoking this via reflection, there is not really a great deal I can do about this.
I was just wondering, is there any way I can check / calculate what the maximum array size could be based on the users heap space settings and therefore validate user entry before running the application?
If not, are there any better solutions?
Use Case:
The user provides a data size they want to run their algorithm against, we generate a scale of numbers to test it against up to the limit they provided.
We record the time it takes to run and measure the values (in order to work out the o-notation).
We need to somehow limit the users input so as to not exceed or get this error. Ideally we want to measure n^2 algorithms on as bigger array sizes as we can (which could last in terms of runtime for days) therefore we really don't want it running for 2 days and then failing as it would have been a waste of time.
You can use the result of Runtime.freeMemory() to estimate the amount of available memory. However, it might be that actually a lot of memory is occupied by unreachable objects, which will be reclaimed by GC soon. So you might actually be able to use more memory than this. You can try invoking the GC before, but this is not guaranteed to do anything.
The second difficulty is to estimate the amount of memory needed for a number given by the user. While it is easy to calculate the size of an ArrayList with so many entries, this might not be all. For example, which objects are stored in this list? I would expect that there is at least one object per entry, so you need to add this memory too. Calculating the size of an arbitrary Java object is much more difficult (and in practice only possible if you know the data structures and algorithms behind the objects). And then there might be a lot of temporary objects creating during the run of the algorithm (for example boxed primitives, iterators, StringBuilders etc.).
Third, even if the available memory is theoretically sufficient for running a given task, it might be practically insufficient. Java programs can get very slow if the heap is repeatedly filled with objects, then some are freed, some new ones are created and so on, due to a large amount of Garbage Collection.
So in practice, what you want to achieve is very difficult and probably next to impossible. I suggest just try running the algorithm and catch the OutOfMemoryError.
Usually, catching errors is something you should not do, but this seems like an occasion where its ok (I do this in some similar cases). You should make sure that as soon as the OutOfMemoryError is thrown, some memory becomes reclaimable for GC. This is usually not a problem, as the algorithm aborts, the call stack is unwound and some (hopefully a lot of) objects are not reachable anymore. In your case, you should probably ensure that the large list is part of these objects which immediately become unreachable in the case of an OOM. Then you have a good chance of being able to continue your application after the error.
However, note that this is not a guarantee. For example, if you have multiple threads working and consuming memory in parallel, the other threads might as well receive an OutOfMemoryError and not be able to cope with this. Also the algorithm needs to support the fact that it might get interrupted at any arbitrary point. So it should make sure that the necessary cleanup actions are executed nevertheless (and of course you are in trouble if those need a lot of memory!).
I am running a complicated multithread java game, which is working perfectly with the exception of this function. This function, when called, approximately 0.01% of the time, will cause a thread hitch of a quarter of second. Through a series of debug lines and time measurements, it's absolutely down to this function (and three others almost exactly like it).
The usage of this function is to provide the light level of a nearby block within a voxel engine game. It is only run when a section of the world is updated, which can happen alongside rendering.
Please note:
This function works accurately 100% of the time for its intended function.
This function causes a thread hitch of approximately a quarter second 0.01% of the time.
The variables are not synchronized.
The function is never called more than once at a time in the program.
All variables are valid fields of a larger, non-synchronized class.
All variables are integers.
The array light[][][] is a byte[][][].
This method is never called more than once at a time, as it is synchronized by a larger method on a wide interval.
I'm pretty sure external synchronization is not the issue.
What part(s) of this function may be causing an issue with thread synchronization, CPU overuse, or stack filling, and how can I go about improving performance to get rid of these render hitches?
public byte nsleftlighting(int[] coords){
if(coords[0]<0)return 16;
difx=chunkedx-chunks[coords[0]].X;
difz=chunkedz-chunks[coords[0]].Z;
if(coords[1]==0){
if(-difx<=-(chunklimit)){return 16;}
else if (-difx==0) {
if(-difz>=0){
proz=0;
specialz=-difz;
}else{
specialz=difz-1;
proz=1;
}
if(chunks[chunkxyid[1][proz][0][specialz]].loaded){
return chunks[chunkxyid[1][proz][0][specialz]].light[15][coords[2]][coords[3]];
}
else{return 16;}
} else {
if(-difz>=0){
proz=0;
specialz=-difz;
}else{
specialz=difz-1;
proz=1;
}
if(-difx>0){
prox=0;
specialx=-difx-1;
}else{
specialx=difx;
prox=1;
}
if(chunks[chunkxyid[prox][proz][specialx][specialz]].loaded){
return chunks[chunkxyid[prox][proz][specialx][specialz]].light[15][coords[2]][coords[3]];
} else {return 16;}
}
}
if(coords[1]>0){
return chunks[coords[0]].light[coords[1]-1][coords[2]][coords[3]];
}
return 16;
}
Multidimensional arrays in Java are not guaranteed to be laid out in the memory contiguously (I'm not sure if 1-dimensional arrays are specifically guaranteed to be contiguous either, but in practice they are). Therefore, depending on how you access the elements, the CPU cache might have to be updated quite often (as opposed to accessing successive or nearby elements in an 1-dimensional array, which is quite fast, as the whole array, or at least a contiguous block of it, can be loaded to cache at once; additionally, newer JVM implementations can optimize index bound checks away in some simple - but not complex - cases (loops) which makes array access almost as fast as it can be in any language (C)). What exactly happens depends on the JVM implementation and the memory manager. See this for a reference.
So, using multidimensional arrays as opposed to manually mapped 1-dimensional arrays is generally a performance penalty, but would hardly account for quarter-second delays in this case. If the arrays are really big, could it be swapping to disk cache?
I don't see anything in here that would cause a performance problem -- at least not with that high a variance. Array accesses should be extremely fast -- even if they are 4 dimensional arrays. [[Nice effort on that.]]
A quarter second is not a huge amount of time which makes me wonder if the profiler is lying to you about the source of the problem. It may be responding poorly to the multi-dimensional arrays or some other attribute of this method that is not immediately apparent -- to me at least.
One possibility, however remote, is that your program is swapped and these arrays are pretty big. If they aren't accessed very often is there any chance you are seeing some IO as some memory pages are swapped in?
You commented that you are using wall-clock timers to determine that the routine takes 250ms. Are you sure that the CPU was actually executing that method for that time period? Could this be a thread contention issue that is taking over CPU in some other part of the program? Can you see if you see CPU spikes every so often when this method takes a long time?
Any chance you are seeing a GC heap lock and it's affecting the array accesses more than other routines? Can you watch the memory graphs to see if you see a correlation? Does giving the program more heap affect the timing or the frequency of the problem? This is going to be more an issue if you are running Java <= 1.5.
I wrote two matrix classes in Java just to compare the performance of their matrix multiplications. One class (Mat1) stores a double[][] A member where row i of the matrix is A[i]. The other class (Mat2) stores A and T where T is the transpose of A.
Let's say we have a square matrix M and we want the product of M.mult(M). Call the product P.
When M is a Mat1 instance the algorithm used was the straightforward one:
P[i][j] += M.A[i][k] * M.A[k][j]
for k in range(0, M.A.length)
In the case where M is a Mat2 I used:
P[i][j] += M.A[i][k] * M.T[j][k]
which is the same algorithm because T[j][k]==A[k][j]. On 1000x1000 matrices the second algorithm takes about 1.2 seconds on my machine, while the first one takes at least 25 seconds. I was expecting the second one to be faster, but not by this much. The question is, why is it this much faster?
My only guess is that the second one makes better use of the CPU caches, since data is pulled into the caches in chunks larger than 1 word, and the second algorithm benefits from this by traversing only rows, while the first ignores the data pulled into the caches by going immediately to the row below (which is ~1000 words in memory, because arrays are stored in row major order), none of the data for which is cached.
I asked someone and he thought it was because of friendlier memory access patterns (i.e. that the second version would result in fewer TLB soft faults). I didn't think of this at all but I can sort of see how it results in fewer TLB faults.
So, which is it? Or is there some other reason for the performance difference?
This because of locality of your data.
In RAM a matrix, although bidimensional from your point of view, it's of course stored as a contiguous array of bytes. The only difference from a 1D array is that the offset is calculated by interpolating both indices that you use.
This means that if you access element at position x,y it will calculate x*row_length + y and this will be the offset used to reference to the element at position specified.
What happens is that a big matrix isn't stored in just a page of memory (this is how you OS manages the RAM, by splitting it into chunks) so it has to load inside CPU cache the correct page if you try to access an element that is not already present.
As long as you go contiguously doing your multiplication you don't create any problems, since you mainly use all coefficients of a page and then switch to the next one but if you invert indices what happens is that every single element may be contained in a different memory page so everytime it needs to ask to RAM a different page, this almost for every single multiplication you do, this is why the difference is so neat.
(I rather simplified the whole explaination, it's just to give you the basic idea around this problem)
In any case I don't think this is caused by JVM by itself. It maybe related in how your OS manages the memory of the Java process..
The cache and TLB hypotheses are both reasonable, but I'd like to see the complete code of your benchmark ... not just pseudo-code snippets.
Another possibility is that performance difference is a result of your application using 50% more memory for the data arrays in the version with the transpose. If your JVM's heap size is small, it is possible that this is causing the GC to run too often. This could well be a result of using the default heap size. (Three lots of 1000 x 1000 x 8 bytes is ~24Mb)
Try setting the initial and max heap sizes to (say) double the current max size. If that makes no difference, then this is not a simple heap size issue.
It's easy to guess that the problem might be locality, and maybe it is, but that's still a guess.
It's not necessary to guess. Two techniques might give you the answer - single stepping and random pausing.
If you single-step the slow code you might find out that it's doing a lot of stuff you never dreamed of. Such as, you ask? Try it and find out. What you should see it doing, at the machine-language level, is efficiently stepping through the inner loop with no waste motion.
If it actually is stepping through the inner loop with no waste motion, then random pausing will give you information. Since the slow one is taking 20 times longer than the fast one, that implies 95% of the time it is doing something it doesn't have to. So see what it is. Each time you pause it, the chance is 95% that you will see what that is, and why.
If in the slow case, the instructions it is executing appear just as efficient as the fast case, then cache locality is a reasonable guess of why it is slow. I'm sure, once you've eliminated any other silliness that may be going on, that cache locality will dominate.
You might try comparing performance between JDK6 and OpenJDK7, given this set of results...