HashSet performance Issues in fractal renderer - java

Use Case:
I'm trying to improve my application to render the Mandelbrot Set. I'm using HashSets to detect periodicity in the orbit of a point in the set.
For instance, the orbit of -1 is 0, -1, 0, -1... If I put each number I reach into a HashSet, I can detect infinite loops by just comparing the size of the HashSet to the iteration count. Doing this makes the render orders of magnitude faster.
Current Implementation:
As my program stands, the method that performs iterations receives a HashSet (of type Integer) constructed with the default constructor. This is what Java Mission Control shows me (typical output, regardless of the complexity or depth of the render):
The runtime of the iteration method is small, almost always less than .1 ms. (At high zooms, sometimes 10ms). As a result, a LOT of these hashsets are created, filled with ~10-100k entries, and then immediately dumped. This creates a lot of overhead, since the HashSet has to be resized quite frequently.
Things I've tried that don't work:
Making one HashSet and clearing it: The O(n) iteration through the backing map absolutely kills performance.
Making the HashSet large enough to contain the iterations, using the initalCapacity argument: I tried every power of 2 from 1024 to 524288, all make the program slower. My conjecture as to why is the following: Since we have so many HashSets, java more quickly runs out of large blocks for the new sets, so we trigger very frequent GC, or some similar issue.
Ideally, I would want the best of both worlds: Make one object that's large enough, and then clear it. However, I can't seem to locate such a data structure. What's the best approach to storing this data?

Im my experience, periodic sequences are rather short, so you should rather use an array and do a sequential search backwards. You can do some experiments, for instance if you only do a comparison with the element that you had 60 iterations ago you already capture periodic sequences of length 2, 3, 4, 5, 6, 12, 15 and 30. Calculating one element of the mandelbrot orbit is usually much faster than finding (and also inserting) an element in a hashset.
In this case it is also easier to tune this method further, eg ignore the first n elements or use some epsilon to avoid rounding errors. Using your method you would check double values for strict equality in my understanding.
Good luck!

Related

Putting a number to the efficiency of an algorithm

I have been developing with Java for some time now, and always strife to do something in the most efficient way. By now i have mostly been trying to condense the number of lines of code I have. But when starting to work with 2d rendering it is more about how long it takes to compute a certain piece of code as it is called many times a second.
My question:
Is there some way to measure how long it takes to compute a certain piece of code in Eclipse, Java, ... ?
First, some nitpicking. You title this question ...
Putting a number to the efficiency of an algorithm
There is no practical quantifiable measure of "efficiency" for an algorithm. Efficiency (as normally conceived) is a measure of "something" relative to an ideal / perfect; e.g. a hypothetical 100% efficient steam engine would convert all of the energy in the coal being burned into useful "work". But for software, there is no ideal to measure against. (Or if there is, we can't be sure that it is the ideal.) Hence "efficiency" is the wrong term.
What you actually mean is a measure of the "performance" of ...
Algorithms are an abstract concept, and their performance cannot be measured.
What you actually want is a measure of the performance of a specific implementation of an algorithm; i.e. some actual code.
So how do you quantify performance?
Well, ultimately there is only one sound way to quantify performance. You measure it, empirically. (How you do that ... and the limitations ... are a matter I will come to.)
But what about theoretical approaches?
A common theoretical approach is to analyse the algorithm to give you a measure of computational complexity. The classic measure is Big-O Complexity. This is a very useful measure, but unfortunately Big-O Complexity does not actually measure performance at all. Rather, it is a way of characterizing the behaviour of an algorithm as the problem size scales up.
To illustrate, consider these algorithms for adding B numbers together:
int sum(int[] input) {
int sum = 0;
for (int i = 0; i < input.size(); i++) {
sum += input[i];
}
return i;
}
int sum(int[] input) {
int tmp = p(1000); // calculates the 1000th prime number
int sum = 0;
for (int i = 0; i < input.size(); i++) {
sum += input[i];
}
return i;
}
We can prove that both versions of sum have a complexity of O(N), according to the accepted mathematical definitions. Yet it obvious that the first one will be faster than the second one ... because the second one does a large (and pointless) calculation as well.
In short: Big-O Complexity is NOT a measure of Performance.
What about theoretical measures of Performance?
Well, as far as I'm aware, there are none that really work. The problem is that real performance (as in time taken to complete) depends on various complicated things in the compilation of code to executables AND the way that real execution platforms (hardware) behaves. It is too complicated to do a theoretical analysis that will reliably predict actual performance.
So how do you measure performance?
The naive answer is to benchmark like this:
Take a clock measurement
Run the code
Take a second clock measurement
Subtract the first measurement from the second ... and that is your answer.
But it doesn't work. Or more precisely, the answer you get may be wildly different from the performance that the code exhibits when you use it in a real world context.
Why?
There may be other things happening on the machine that are happening ... or have happened ... that influence the code's execution time. Another program might be running. You may have files pre-loaded into the file system cache. You may get hit by CPU clock scaling ... or a burst of network traffic.
Compilers and compiler flags can often make a lot of difference to how fast a piece of code runs.
The choice of inputs can often make a big difference.
If the compiler is smart, it might deduce that some or all of your benchmarked code does nothing "useful" (in the context) ... and optimize it away entirely.
And for languages like Java and C#, there are other important issues:
Implementations of these languages typically do a lot of work during startup to load and link the code.
Implementations of these languages are typically JIT compiled. This means that the language runtime system does the final translation of the code (e.g. bytecodes) to native code at runtime. The performance characteristics of your code after JIT compilation change drastically, and the time taken to do the compilation may be significant ... and may distort your time measurements.
Implementations of these languages typically rely on a garbage collected heap for memory management. The performance of a heap is often uneven, especially at startup.
These things (and possibly others) contribute to something that we call (in Java) JVM warmup overheads; particularly JIT compilation. If you don't take account of these overheads in your methodology, then your results are liable to be distorted.
So what is the RIGHT way to measure performance of Java code?
It is complicated, but the general principle is to run the benchmark code lots of times in the same JVM instance, measuring each iteration. The first few measurements (during JVM warmup) should be discarded, and the remaining measurements should be averaged.
These days, the recommended way to do Java benchmarking is to use a reputable benchmarking framework. The two prime candidates are Caliper and Oracle's jmh tool.
And what are the limitations of performance measurements that you mentioned?
Well I have alluded to them above.
Performance measurements can be distorted to various environmental factors on the execution platform.
Performance can be dependent on inputs ... an this may not be revealed by simple measurement.
Performance (e.g. of C / C++ code) can be dependent on the compiler and compiler switches.
Performance can be dependent on hardware; e.g. processors speed, number of cores, memory architecture, and so on.
These factors can make it difficult to make general statements about the performance of a specific piece of code, and to make general comparisons between alternative versions of the same code. As a rule, we can only make limited statements like "on system X, with compiler Y and input set Z the performance measures are P, Q, R".
The amount of lines has very little correlation to the execution speed of a program.
Your program will look completely different after it's processed by the compiler. In general, large compilers perform many optimizations, such as loop unrolling, getting rid of variables that are not used, getting rid of dead code, and hundreds more.
So instead of trying to "squeeze" the last bit of performance/memory out of your program by using short instead of int, char[] instead of String or whichever method you think will "optimize" (premature optimization) your program, just do it using objects, or types such that make sense to you, so it will be easier to maintain. Your compiler, interpreter, VM should take care of the rest. If it doesn't, only then do you start looking for bottlenecks, and start playing with hacks.
So what makes programs fast then? Algorithmic efficiency (at least it tends to make the biggest difference if the algorithm/data structure was not designed right). This is what computer scientists study.
Let's say you're given 2 data structures. An array, and a singly linked list.
An array stores things in a block, one after the other.
+-+-+-+-+-+-+-+
|1|3|2|7|4|6|1|
+-+-+-+-+-+-+-+
To retrieve the element at index 3, you simply just go to the 4th square and retrieve it. You know where it is because you know it's 3 after the first square.
A singly linked list will store things in a node, which may not be stored contiguously in memory, but each node will have a tag (pointer, reference) on it telling you where the next item in the list is.
+-+ +-+ +-+ +-+ +-+ +-+ +-+
|1| -> |3| -> |2| -> |7| -> |4| -> |6| -> |1|
+-+ +-+ +-+ +-+ +-+ +-+ +-+
To retrieve the element at index of 3, you will have to start with the first node, then go to the connected node, which is 1, and then go to 2, and finally after, you arrive at 3. All because you don't know where they are, so you follow a path to them.
Now say you have an Array and an SLL, both containing the same data, with the length n, which one would be faster? Depends on how you use it.
Let's say you do a lot of insertions at the end of the list. The algorithms (pseudocode) would be:
Array:
array[list.length] = element to add
increment length field
SLL:
currentNode = first element of SLL
while currentNode has next node:
currentNode = currentNode's next element
currentNode's next element = new Node(element to add)
increment length field
As you can see, in the array algorithm, it doesn't matter what the size of the array is. It always takes a constant amount of operations. Let's say a[list.length] takes 1 operation. Assigning it is another operation, incrementing the field, and writing it to memory is 2 operations. It would take 4 operations every time. But if you look at the SLL algorithm, it would take at least list.length number of operations just to find the last element in the list. In other words, the time it takes to add an element to the end of an SLL increases linearly as the size of the SLL increases t(n) = n, whereas for the array, it's more like t(n) = 4.
I suggest reading the free book written by my data structures professor. Even has working code in C++ and Java
Generally speaking, the speed vs. lines of code is not the most effective measure of performance since it depends heavily depends on your hardware and your compiler. There is something called Big Oh notation, which gives one a picture of how fast an algorithm will run as the number of inputs increase.
For example, if your algorithm speed is O(n), then the time it will take for code to run scales linear with time. If your algorithm speed is O(1), then the time it will take your code to run will be constant.
I found this particular way of measuring performance useful because you learn that it's not really lines of code that will effect speed it's your codes design that will effect speed. A code with a more efficient way of handling the problem can be faster than code with a less efficient method with 1/10 lines of code.

Java multithreaded rendering, how to optimize

I am working on a fractal rendering software.
The basic setup is that I have a big 2-dimensional array (picture), where values are incremented.
The simple rendering process is
while( iteration < maxIteration ) {
update some pixel in array;
}
This is stupidly simple to parallellize; just have several threads to do this simultaneously,
since each thread will (very likely) work with different pixels at the same time,
and even if there is an update collision in the array, this is fine.
The array is shared among the threads!
However, to keep track of the total number of iteratins done, I need iteration
to be volatile, which I suspect slows down the code a little.
What baffels me is that I get virtually the same speed for 4 threads and 16 threads,
and I run this on a 64-core machine, which is verified by Runtime.getRuntime().availableProcessors().
One issue is that I have no control over where in the array the threads work, hence, the issue might be a big case cache misses? The array is of the size of a fullhd-image: 1920x1080x4 longs.
Thus, I seek possible issues, and solutions to them, since I think this might be a common type of problem.
Edit: The code I am trying to optimize is available here (sourceforge).
The class ThreadComputator represents one thread, and all these do iterations.
The number of iterations done is stored in the shared variable currentIteration,
which (in the current code) is incremented in a synchronized block.
All threads write to the Histogram object, which essentially is a big array of doubles.
Writing to this does not need to be atomic, as overwrites will be rare, and the error is tolerated.
I think you've answered your own question.
Because I implement the chaos game algorithm. This means that the next pixel
I need to work on depends non-deterministically on current pixel.
And you have a memory system on your computer that is functionally random access; but, the fastest performance is only possible if you have localized (within the cache pages) reads and writes.
I'd re-implement your algorithm like so:
Get all of your desired writes for a "time instant", wrap them in a class / data structure such that they can be ordered and grouped by memory page / cache line.
Generate the list of memory pages requiring access.
Randomly assign a to-be-accessed memory page to a thread.
Run all the updates for that page before that thread works on another memory page.
Yes, it won't be 100% random anymore; however you can mitigate that by counting the "write time" and assuming that all writes in the same write time occurred simultaneously. It will still thrash your memory pretty badly, but at least it will thrash is somewhat less.

test the speed of quicksort

I read the book of Algorithm 4th edition princeton and watched the online course video. I have found two interesting things.
It was said in the video, if we use a cutoff like this in quicksort, we will speed up the program by 10~20%:
if(hi - lo < CUTOFF) Insertion.sort(a);
It suggested that when we use recursive formula to divide the array a into subarray, and sort subarray recursively, we can use insertion sorting algorithm when the size of subarray is smaller than CUTOFF instead.However, when I test it with CUTOFF size 3, 7 and 10. It was not the case. It's about 10 times slower in my test data set. The data set is array of 5000 random numbers. So I guess we'd better not use insertion sorting for small size array.
When I trying to measure the running time of my code and compare it to the standard code from this course, i.e. the algs4.jar library. I found my time is longer, even I change my code as the standard code. Finally, I realized that even if we quicksort the same array twice ( copied the array as a1, a2), the running time of the second sorting will always be around half of the running time of the second sorting. i.e. (pseudo code):
stopWatch sw1 = new stopWatch();
quicksort(a1);
print sw1.elaspedTime();
stopWatch sw2 = new stopWatch();
quicksort(a2);
print sw2.elaspedTime();
Then the second one cost about half time, even they are the same algorithm and sorting the same array. I don't know why this happened. It's a very interesting phenomenon.
Now. By theory it could be faster, but depending on what language, compiler, system, CPU you are using, it might be different. I can just gonna use your 2nd point as an example. CPU has something called cache which would hold the frequent used data to increase speed. It is very small but it is super fast, way faster than RAM. So basically the first time you ran the program, the array was initially in memory and it got into cache when there is a cache miss. When you run the same code the second time, everything is in cache already, there is no need to look it up in RAM and no cache misses, so its way faster than first run. If you would like accurate result then you might have to clear RAM clear Cache, shut down any program you are running and ect

Java: ConcurrencyLevel value for ConcurrentHashMap

Is there some optimal value for ConcurrencyLevel beyond which ConcurrentHashMap's performance starts degrading?
If yes, what's that value, and what's the reason for performance degradation? (this question orginates from trying to find out any practical limitations that a ConcurrentHashMap may have).
The Javadoc offers pretty detailed guidance:
The allowed concurrency among update operations is guided by the optional concurrencyLevel constructor argument (default 16), which is used as a hint for internal sizing.
The table is internally partitioned to try to permit the indicated number of concurrent updates without contention. Because placement in hash tables is essentially random, the actual concurrency will vary. Ideally, you should choose a value to accommodate as many threads as will ever concurrently modify the table. Using a significantly higher value than you need can waste space and time, and a significantly lower value can lead to thread contention. But overestimates and underestimates within an order of magnitude do not usually have much noticeable impact. A value of one is appropriate when it is known that only one thread will modify and all others will only read.
To summarize: the optimal value depends on the number of expected concurrent updates. A value within an order of magnitude of that should work well. Values outside that range can be expected to lead to performance degradation.
You have to ask yourself two questions
how many cpus do I have?
what percentage of the time will a useful program be accessing the same map?
The first question tells you the maximum number of threads which can access the map at once. You can have 10000 threads, but if you have only 4 cpus, at most 4 will be running at once.
The second question tells you the most any of those threads will be accessing the map AND doing something useful. You can optimise the map to do something useless (e.g. a micro-benchmark) but there is no point tuning for this IMHO. Say you have a useful program which uses the map a lot. It might be spending 90% of the time doing something else e.g. IO, accessing other maps, building keys or values, doing something with the values it gets from the map.
Say you spend 10% of the time accessing a map on a machine with 4 CPUs. This means on average you will be accessing the map in 0.4 threads on average. (Or one thread about 40% of the time) In this case a concurrency level of 1-4 is fine.
In any case, making the concurrency level higher than the number of cpus you have is likely to be unnecessary, even for a micro-benchmark.
As of Java 8, ConcurrentHashMap's constructor parameter for concurrencyLevel is effectively unused, and remains primarily for backwards-compatibility. The implementation was re-written to use the first node within each hash bin as the lock for that bin, rather than a fixed number of segments/stripes as was the case in earlier versions.
In short, starting in Java 8, don't worry about setting the concurrencyLevel parameter, as long as you set a positive (non-zero, non-negative) value, per the API contract.

Why is the performance of these matrix multiplications so different?

I wrote two matrix classes in Java just to compare the performance of their matrix multiplications. One class (Mat1) stores a double[][] A member where row i of the matrix is A[i]. The other class (Mat2) stores A and T where T is the transpose of A.
Let's say we have a square matrix M and we want the product of M.mult(M). Call the product P.
When M is a Mat1 instance the algorithm used was the straightforward one:
P[i][j] += M.A[i][k] * M.A[k][j]
for k in range(0, M.A.length)
In the case where M is a Mat2 I used:
P[i][j] += M.A[i][k] * M.T[j][k]
which is the same algorithm because T[j][k]==A[k][j]. On 1000x1000 matrices the second algorithm takes about 1.2 seconds on my machine, while the first one takes at least 25 seconds. I was expecting the second one to be faster, but not by this much. The question is, why is it this much faster?
My only guess is that the second one makes better use of the CPU caches, since data is pulled into the caches in chunks larger than 1 word, and the second algorithm benefits from this by traversing only rows, while the first ignores the data pulled into the caches by going immediately to the row below (which is ~1000 words in memory, because arrays are stored in row major order), none of the data for which is cached.
I asked someone and he thought it was because of friendlier memory access patterns (i.e. that the second version would result in fewer TLB soft faults). I didn't think of this at all but I can sort of see how it results in fewer TLB faults.
So, which is it? Or is there some other reason for the performance difference?
This because of locality of your data.
In RAM a matrix, although bidimensional from your point of view, it's of course stored as a contiguous array of bytes. The only difference from a 1D array is that the offset is calculated by interpolating both indices that you use.
This means that if you access element at position x,y it will calculate x*row_length + y and this will be the offset used to reference to the element at position specified.
What happens is that a big matrix isn't stored in just a page of memory (this is how you OS manages the RAM, by splitting it into chunks) so it has to load inside CPU cache the correct page if you try to access an element that is not already present.
As long as you go contiguously doing your multiplication you don't create any problems, since you mainly use all coefficients of a page and then switch to the next one but if you invert indices what happens is that every single element may be contained in a different memory page so everytime it needs to ask to RAM a different page, this almost for every single multiplication you do, this is why the difference is so neat.
(I rather simplified the whole explaination, it's just to give you the basic idea around this problem)
In any case I don't think this is caused by JVM by itself. It maybe related in how your OS manages the memory of the Java process..
The cache and TLB hypotheses are both reasonable, but I'd like to see the complete code of your benchmark ... not just pseudo-code snippets.
Another possibility is that performance difference is a result of your application using 50% more memory for the data arrays in the version with the transpose. If your JVM's heap size is small, it is possible that this is causing the GC to run too often. This could well be a result of using the default heap size. (Three lots of 1000 x 1000 x 8 bytes is ~24Mb)
Try setting the initial and max heap sizes to (say) double the current max size. If that makes no difference, then this is not a simple heap size issue.
It's easy to guess that the problem might be locality, and maybe it is, but that's still a guess.
It's not necessary to guess. Two techniques might give you the answer - single stepping and random pausing.
If you single-step the slow code you might find out that it's doing a lot of stuff you never dreamed of. Such as, you ask? Try it and find out. What you should see it doing, at the machine-language level, is efficiently stepping through the inner loop with no waste motion.
If it actually is stepping through the inner loop with no waste motion, then random pausing will give you information. Since the slow one is taking 20 times longer than the fast one, that implies 95% of the time it is doing something it doesn't have to. So see what it is. Each time you pause it, the chance is 95% that you will see what that is, and why.
If in the slow case, the instructions it is executing appear just as efficient as the fast case, then cache locality is a reasonable guess of why it is slow. I'm sure, once you've eliminated any other silliness that may be going on, that cache locality will dominate.
You might try comparing performance between JDK6 and OpenJDK7, given this set of results...

Categories

Resources