Runtime of Sorting-Algorithms gets faster (in Java)

Runtime of Sorting-Algorithms gets faster (in Java) - java

Sorting-Algorithms get faster (in Java)?!
I have implemented some Sortingalgorithms and a method getNanoTime which gives the NanoTime of this Sorting-Algorithm.
I wanted to calculate an average-value. I recognized that the average-times are different to the time when testing the algorithm one time.
I thought I did something wrong.
But then I found it.
When doing:
int length = 5000;
int bereich = 1000;
long time;
time = Bubblesort.getNanoTime(length, bereich);
System.out.println("BUBBLESORT: " + (1.0 * time / 1_000_000) + " ms");
time = Insertionsort.getNanoTime(length, bereich);
System.out.println("INSERTIONSORT: " + (1.0 * time / 1_000_000) + " ms");
time = Mergesort.getNanoTime(length, bereich);
System.out.println("MERGESORT: " + (1.0 * time / 1_000_000) + " ms");
time = Quicksort.getNanoTime(length, bereich);
System.out.println("QUICKSORT: " + (1.0 * time / 1_000_000) + " ms");
time = Selectionsort.getNanoTime(length, bereich);
System.out.println("SELECTIONSORT: " + (1.0 * time / 1_000_000) + " ms");
I got:
BUBBLESORT: 75.7835 ms
INSERTIONSORT: 27.250875 ms
MERGESORT: 17.450083 ms
QUICKSORT: 7.092709 ms
SELECTIONSORT: 967.638792 ms
But when doing for example:
for (int i = 0; i < 20; i++) {
System.out.println(1.0 * Bubblesort.getNanoTime(5000, 1000) / 1_000_000);
}
I got:
85.473625 ms
62.681959 ms
68.866542 ms
48.737333 ms
47.402708 ms
47.368708 ms
47.567792 ms
47.018042 ms
45.1795 ms
47.871416 ms
49.570208 ms
50.285875 ms
56.37975 ms
50.342917 ms
50.262833 ms
50.036959 ms
50.286542 ms
51.752708 ms
50.342458 ms
51.511541 ms
The first time is always high (here is the first time 85 ms) and the times after the first time are lower.
So, I think, the machine learns, and it becomes faster
Could that be?
Do you know more?

I think, the machine learns, and it becomes faster
Yup.
Look up Just-in-time compilation and while you're at it, spend a few weeks becoming a rocket scientist so that you can fully understand how CPU caches work.
Or, if you don't want to spend the next 10 weeks studying but you do want a much better idea of how any of this works, read the rest of this answer, then go look at this talk by Douglas Hawkins about JVM performance puzzlers. I bet after watching those 40 minutes you will feel fully enlightened about this conundrum.
Two things are going on here (JIT warmup effect, and cachepage effect), possibly more:
The JIT is 'warming up': The way java works, is that it runs your class file code in the dumbest, slowest, stupidest way possible, and is in addition wasting even more time maintaining a boatload of bookkeeping on the code, such as 'how often does this if block enter vs how often it is skipped?' for no good reason. Everything runs slow as molasses. Intentionally so, really.
But then... because of all that bookkeeping the JVM at some point goes: Huh. Literally (and I'm not overstating the case here, this is very common!) 99% of the CPU's time is spent on this 0.1% of the entire code base.
It then takes some time to analyse the daylights out of that 0.1%, creating a very finely tuned machine code version that is exactly perfect for the actual CPU you're running on. This takes a lot of time, will use all that bookkeeping (which is, after all, not so pointless!) to do things such as re-order code so that the most-often taken 'branch' in an if/else block is the one that can run without code jumps (which are slow due to pipeline resets), and will even turn currently-observed-truths-which-may-not-hold-later into assumptions. As in, the code is 'compiled' into machine code that will straight up not work if these assumptions (that so far have been, due to all that bookkeeping, observed to always be true) end up being false, and will then add hooks throughout the entire VM that if any code happens to break the assumption, the finely crafted machine code is flagged as now invalid and will no longer be used. For example, if you have a class that is not final, it can be extended. And java is always dynamic dispatch: If you invoke foo.hello(), you get the hello() implementation of the actual type of the object the foo variable is pointing at, not the type of the foo expression itself. Classloading in java is inherently dynamic (classes can be loaded at any time, a JVM never knows that it is 'done loading classes'. This means a lookup table must be involved. A pricey annoyance! But, hotspot optimizer gets around it and eliminates the table: If the optimizer figures out that a non-final class is nevertheless not currently extended, or all extensions do not overwrite the implementation in question, then it can omit the lookup table and just link a method call straight to the implementation. It also adds hooks to the classloader that if ever any class is loaded that extends the targeted class (and changes the impl of the relevant method), the machine code that directly jumps straight to the impl is invalidated. That method's actual performance takes a nosedive again, as the JVM goes back to the slow-as-molasses way. If it's still being run a lot, no worries. hotspot will do another pass, this time taking into account there are multiple implementations.
Once this machine code is available, all calls to this method are redirected to run using this finely tuned machine code. Which is incredibly fast; usually faster than -O3 compiled C code, in fact, because the JVM gets the benefit of taking runtime behaviour into account which a C compiler can never do.
Nevertheless usually only about 1% of all code in a VM is actually running in this mode. The simple truth is that almost all code of any app is irrelevant performance wise. It doesn't do anything complicated, isn't running during 'hot' moments, and it just. doesn't. matter. The smartypants analysis is left solely for the 1% or so that is actually going to run lots.
And that's what likely explains a large chunk of that difference: A bunch of your sort algorithm's loops are running in dogslow (not-hotspotted) mode, whereas once the hotspotting is done during your first sort run, the next sort runs get the benefit of the hotspotted code from the start.
Secondarily, data needs to be in a cache page for the CPU to really do it quickly. Often repeating a calculation means the first run through gets the penalty of the CPU having to swap in a bunch of cache pages, whereas all future runs don't need to pay that price as the relevant parts of memory are already in the caches.
The conclusion is simple: Micro-benchmarking like this is incredibly complicated, you can't just time it with System.nanoTime, JVMs are incredibly complicated, CPUs are incredibly complicated, even the wintered engineers that spend their days writing the JVM itself are on record as stating they are far too stupid to guess performance like this. So you stand absolutely no chance whatsoever.
The solution is, fortunately, also very very simple. Those same JVM engineers wanna know how fast stuff runs so they wrote an entire framework that lets you micro-benchmark, actively checking for hotspot warmups, doing a bunch of dry runs, ensuring that the optimizer doesn't optimize away your entire algorithm (which can happen, if you sort a list and then toss the list in the garbage, the optimizer might just figure out that the entire sort op is best optimized by skipping it entirely, because, hey, if nobody actually cares about the results of the sort, why sort, right? You need a 'sink' to ensure that the optimizer doesn't conclude it can just junk the whole thing due to the data being discarded!) - and it is called JMH. Rewrite your benchmark in JMH and let it rip. You'll find that it times consistently, and those times are generally meaningful (vs what you wrote, which means mostly nothing).

Related

Execution time of junit test case varies every time. Why?

I have a set of 196 test methods. The Execution time of these testcases vary every time I run it. It has been run in a controlled environment,Say,For garbage collection, I included null in teardown().
Every time before executing the tests, I also make sure CPU usage, Memory usage, Disk space, System load are same for every start.
Also,The time variation is not in any particular order. I need to know why don't we get stable execution time while executing the same test cases again?
I made 93 cases stable by including a warm up period in the class. Other cases are related to database connections (reading a data or updating a data in database). Is it possible to have same execution time every time i run these testcases. (Execution time refers to junit testcase execution time)

Two primary things come to mind with Java performance:
You need to warmup the JVM, your tests are being interpreted as bytecode and are at the mercy of the JVM. That means executing the same test upwards of thousands of times during the same run.
JUnit tests are not measured with much accuracy. In fact, it's pretty much impossible to get an exact performance reading, even with libraries build specifically for this. This is why taking an average of multiple samples is generally suggested.
Yet, this and those others suggested by Reto are just what could be causing variance by Java. Where a variance of milliseconds is more than expected. For an example of this, create a unit test that takes a thread and puts it to sleep for 10 ms. Watch as you're given results anywhere from 7 ms to 13 ms to 17 ms or more. It just isn't a reliable way to measure things.
If you're connecting to a network, uploading data to a database, etc. I can't speak on behalf of that, but you need to take into account the variance of those systems as well.
I would suggest breaking your three tests with the greatest variance into smaller blocks. Try and isolate where your biggest bottleneck is, then concentrate on optimizing that operation or set of operations. I would think that connecting to the database takes the greatest amount of time, next to that would likely be executing the query. But you should isolate the measurement of these operations to make sure of that.

Why are floating point operations much faster with a warmup phase?

I initially wanted to test something different with floating-point performance optimisation in Java, namely the performance difference between the division by 5.0f and multiplication with 0.2f (multiplication seems to be slower without warm-up but faster with by a factor of about 1.5 respectively).
After studying the results I noticed that I had forgotten to add a warm-up phase, as suggested so often when doing performance optimisations, so I added it. And, to my utter surprise, it turned out to be about 25 times faster in average over multiple test runs.
I tested it with the following code:
public static void main(String args[])
{
float[] test = new float[10000];
float[] test_copy;
//warmup
for (int i = 0; i < 1000; i++)
{
fillRandom(test);
test_copy = test.clone();
divideByTwo(test);
multiplyWithOneHalf(test_copy);
}
long divisionTime = 0L;
long multiplicationTime = 0L;
for (int i = 0; i < 1000; i++)
{
fillRandom(test);
test_copy = test.clone();
divisionTime += divideByTwo(test);
multiplicationTime += multiplyWithOneHalf(test_copy);
}
System.out.println("Divide by 5.0f: " + divisionTime);
System.out.println("Multiply with 0.2f: " + multiplicationTime);
}
public static long divideByTwo(float[] data)
{
long before = System.nanoTime();
for (float f : data)
{
f /= 5.0f;
}
return System.nanoTime() - before;
}
public static long multiplyWithOneHalf(float[] data)
{
long before = System.nanoTime();
for (float f : data)
{
f *= 0.2f;
}
return System.nanoTime() - before;
}
public static void fillRandom(float[] data)
{
Random random = new Random();
for (float f : data)
{
f = random.nextInt() * random.nextFloat();
}
}
Results without warm-up phase:
Divide by 5.0f: 382224
Multiply with 0.2f: 490765
Results with warm-up phase:
Divide by 5.0f: 22081
Multiply with 0.2f: 10885
Another interesting change that I cannot explain is the turn in what operation is faster (division vs. multiplication). As earlier mentioned, without the warm-up the division seems to be a tad faster, while with the warm-up it seems to be twice as slow.
I tried adding an initialization block setting the values to something random, but it didn't not effect the results and neither did adding multiple warm-up phases. The numbers on which the methods operate are the same, so that cannot be the reason.
What is the reason for this behaviour? What is this warm-up phase and how does it influence the performance, why are the operations so much faster with a warm-up phase and why is there a turn in which operation is faster?

Before the warm up Java will be running the byte codes via an interpreter, think how you would write a program that could execute java byte codes in java. After warm up, hotspot will have generated native assembler for the cpu that you are running on; making use of that cpus feature set. There is a significant performance difference between the two, the interpreter will run many many cpu instructions for a single byte code where as hotspot generates native assembler code just as gcc does when compiling C code. That is the difference between the time to divide and to multiply will ultimately be down to the CPU that one is running on, and it will be just a single cpu instruction.
The second part to the puzzle is hotspot also records statistics that measure the runtime behaviour of your code, when it decides to optimise the code then it will use those statistics to perform optimisations that are not necessarily possible at compilation time. For example it can reduce the cost of null checks, branch mispredictions and polymorphic method invocation.
In short, one must discard the results pre-warmup.
Brian Goetz wrote a very good article here on this subject.
========
APPENDED: overview of what 'JVM Warm-up' means
JVM 'warm up' is a loose phrase, and is no longer strictly speaking a single phase or stage of the JVM. People tend to use it to refer to the idea of where JVM performance stabilizes after compilation of the JVM byte codes to native byte codes. In truth, when one starts to scratch under the surface and delves deeper into the JVM internals it is difficult not to be impressed by how much Hotspot is doing for us. My goal here is just to give you a better feel for what Hotspot can do in the name of performance, for more details I recommend reading articles by Brian Goetz, Doug Lea, John Rose, Cliff Click and Gil Tene (amongst many others).
As already mentioned, the JVM starts by running Java through its interpreter. While strictly speaking not 100% correct, one can think of an interpreter as a large switch statement and a loop that iterates over every JVM byte code (command). Each case within the switch statement is a JVM byte code such as add two values together, invoke a method, invoke a constructor and so forth. The overhead of the iteration, and jumping around the commands is very large. Thus execution of a single command will typically use over 10x more assembly commands, which means > 10x slower as the hardware has to execute so many more commands and caches will get polluted by this interpreter code which ideally we would rather focused on our actual program. Think back to the early days of Java when Java earned its reputation of being very slow; this is because it was originally a fully interpreted language only.
Later on JIT compilers were added to Java, these compilers would compile Java methods to native CPU instructions just before the methods were invoked. This removed all of the overhead of the interpreter and allowed the execution of code to be performed in hardware. While execution within hardware is much faster, this extra compilation created a stall on startup for Java. And this was partly where the terminology of 'warm up phase' took hold.
The introduction of Hotspot to the JVM was a game changer. Now the JVM would start up faster because it would start life running the Java programs with its interpreter and individual Java methods would be compiled in a background thread and swapped out on the fly during execution. The generation of native code could also be done to differing levels of optimisation, sometimes using very aggressive optimisations that are strictly speaking incorrect and then de-optimising and re-optimising on the fly when necessary to ensure correct behaviour. For example, class hierarchies imply a large cost to figuring out which method will be called as Hotspot has to search the hierarchy and locate the target method. Hotspot can become very clever here, and if it notices that only one class has been loaded then it can assume that will always be the case and optimise and inline methods as such. Should another class get loaded that now tells Hotspot that there is actually a decision between two methods to be made, then it will remove its previous assumptions and recompile on the fly. The full list of optimisations that can be made under different circumstances is very impressive, and is constantly changing. Hotspot's ability to record information and statistics about the environment that it is running in, and the work load that it is currently experiencing makes the optimisations that are performed very flexible and dynamic. In fact it is very possible that over the life time of a single Java process, that the code for that program will be regenerated many times over as the nature of its work load changes. Arguably giving Hotspot a large advantage over more traditional static compilation, and is largely why a lot of Java code can be considered to be just as fast as writing C code. It also makes understanding microbenchmarks a lot harder; in fact it makes the JVM code itself much more difficult for the maintainers at Oracle to understand, work with and diagnose problems. Take a minute to raise a pint to those guys, Hotspot and the JVM as a whole is a fantastic engineering triumph that rose to the fore at a time when people were saying that it could not be done. It is worth remembering that, because after a decade or so it is quite a complex beast ;)
So given that context, in summary we refer to warming up a JVM in microbenchmarks as running the target code over 10k times and throwing the results away so as to give the JVM a chance to collect statistics and to optimise the 'hot regions' of the code. 10k is a magic number because the Server Hotspot implementation waits for that many method invocations or loop iterations before it starts to consider optimisations. I would also advice on having method calls between the core test runs, as while hotspot can do 'on stack replacement' (OSR), it is not common in real applications and it does not behave exactly the same as swapping out whole implementations of methods.

You aren't measuring anything useful "without a warmup phase"; you're measuring the speed of interpreted code times how long it takes for the on-stack replacement to be generated. Maybe divisions cause compilation to kick in earlier.
There are sets of guidelines and various packages for building microbenchmarks that don't suffer from these sorts of issues. I would suggest that you read the guidelines and use the ready-made packages if you intend to continue doing this sort of thing.

Why is performance of executing Mockito mocks so erratic?

Would anyone have an explanation, or even better a suggested fix, for why the time taken to execute Mockito mocks is so erratic? The simplest SSCCE I could come up with for this is below:
import static org.mockito.Mockito.mock;
public class TestSimpleMockTiming
{
public static final void main (final String args [])
{
final Runnable theMock = mock (Runnable.class);
int tookShort = 0;
int tookMedium = 0;
int tookLong = 0;
int tookRidiculouslyLong = 0;
long longest = 0;
for (int n = 0; n < 2000000; n++)
{
final long startTime = System.nanoTime ();
theMock.run ();
final long duration = System.nanoTime () - startTime;
if (duration < 1000000) // 0.001 seconds
tookShort++;
else if (duration < 100000000) // 0.1 seconds
tookMedium++;
else if (duration < 1000000000) // 1 second !!!
tookLong++;
else
tookRidiculouslyLong++;
longest = Math.max (longest, duration);
}
System.out.println (tookShort + ", " + tookMedium + ", " + tookLong + ", " + tookRidiculouslyLong);
System.out.println ("Longest duration was " + longest + " ns");
}
}
If I run this (from within Eclipse, using JDK 1.7.45 on Win 7 x64) typical output looks like:
1999983, 4, 9, 4
Longest duration was 5227445252 ns
So, while in the majority of situations the mock executes very fast, there's several executions that take even longer than 1 second. That's an eternity for a method that does nothing. From my experimenting with this, I don't believe the issue is the accuracy of System.nanoTime (), I think the mock really does take that long to execute. Is there anything I can do to improve on this and make the timing behave more consistently?
(FYI, why this is an issue is that I have a Swing app which contains various frames, and I try to write JUnit tests for the frames so that I can test that the layoutManagers behave correctly without having to fire up the whole app and navigate to the correct screen. In one such test, the screen uses a javax.swing.Timer to implement scrolling, so the display will pan around an area when the mouse is held near the end of the frame. I noticed the behaviour of this was very erratic, and the scrolling while usually fine would periodically freeze for up to a second and it looked dreadful. I wrote an SSCCE around this, thinking the problem was that Swing Timers can't be depended on to fire at a consistent rate, and in the SSCCE it worked perfectly.
After hours of tearing my hair out then trying to spot differences between my real code and the scrolling demo SSCCE, I started putting nano timers around blocks of code that ran repeatedly, noticed the time taken by my paintComponent method to be very erratic and eventually narrowed it down to a mock call. Testing the screen from running the real app, the scrolling behaves smoothly, its only a problem from the JUnit test because of the mock call, which led to me testing a simple mock in isolation with the SSCCE posted above.)
Many thanks!

This test is flawed in multiple ways. If you want to benchmark properly I strongly suggest that using JMH, it is done by someone Alexey Shipilev that is much smarter than us and definitely more knowledgeable on the JVM than most people doing Java on our beloved planet.
Here's the most notable way the test is flawed.
The test ignores what the JVM is doing, like the warmup phase, compilation C1 and C2 thread, GC, threading issues (even though this code is not multi-threaded, the JVM/OS may have to do something else) etc...
The test do seem to ignore if the actual OS/JVM/CPU combination offer a proper resolution up to the nanosecond.
Even though there's a System.nanoTime() are you sure the JVM and the OS have the proper resolution. On windows for example, there's the JVM don't have access to the the real nanosecond, but instead to some counter, not a wall-clock time. The javadoc states this, here's snippet :
This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time. The value returned represents nanoseconds since some fixed but arbitrary origin time (perhaps in the future, so values may be negative). The same origin is used by all invocations of this method in an instance of a Java virtual machine; other virtual machine instances are likely to use a different origin.
This method provides nanosecond precision, but not necessarily nanosecond resolution (that is, how frequently the value changes) - no guarantees are made except that the resolution is at least as good as that of currentTimeMillis().
The test also ignores how Mockito works.
Mockito stores every invocation in its own model in order to be able to verify these calls after executing the scenario. So on every iteration of the loop Mockito stores another invocation up to 2M invocations, which will impact the JVM (maybe the mock instance will hold several generations and promoted to the tenured which is definitely more costly for the GC). That means that the more the iterations the more this code stresses the JVM and not Mockito.
I believe it's not released (there's dev binaries on jcentral however), but Mockito will offer a setting to allow mockito to stub only hence it will not store invocations, which may allow Mockito to fit well in a scenario like this one.
The test lacks proper statistical analysis.
Interestingly enough the code of the test have a pseudo percentile approach. Which is good! Although it doesn't work like that and in this case it cannot work to catch the big issue. Instead it should record every measure in order to extract the tendencies of the evolution of the time mockito spent as the iteration count advances.
And if you want, it's good idea to store every recorded measure, so it would be possible to feed them to a proper statistical analysis tool like R in order extract a graph, percentile data, etc.
On that statistical matter it would certainly be interesting to use the HDRHistogram. Outside a microbenchmark of course as it will impact the memory and alter the result of the microbenchmark. Let's keep that for JMH.
Both point 1 and 2 can be addressed if you change the code to use JMH.
Hope that helps.

A JVM is a very complex thing that does a lot of optimization at runtime (including caching and byte code optimization). Thus, measuring execution time of Java programs, first of all you should do a warmup phase before doing your actual benchmark.
I expect that your first four runs took your longest profilling time and afterwards, the execution time became better and better.
Execute your benchmark a few hundreds or thousands times before you actually start profiling. Afterwards, I expect your measurement results should become more stable.

Java - Measuring Method Execution Time

I am trying to measure the complexity of an algorithm using a timer to measure the execution time, whilst changing the size of the input array.
The code I have at the moment is rather simple:
public void start() {
start = System.nanoTime();
}
public long stop() {
long time = System.nanoTime() - start;
start = 0;
return time;
}
It appears to work fine, up until the size of the array becomes very large, and what I expect to be an O(n) complexity algorithm turns out appearing to be O(n^2). I believe that this is due to the threading on the CPU, with other processes cutting in for more time during the runs with larger values for n.
Basically, I want to measure how much time my process has been running for, rather than how long it has been since I invoked the algorithm. Is there an easy way to do this in Java?

Measuring execution time is a really interesting, but also complicated topic. To do it right in Java, you have to know a little bit about how the JVM works. Here is a good article from developerWorks about benchmarking and measuring. Read it, it will help you a lot.
The author also provides a small framework for doing benchmarks. You can use this framework. It will give you exaclty what you needs - the CPU consuming time, instead of just two time stamps from before and after. The framework will also handle the JVM warm-up and will keep track of just-in-time-compilings.
You can also use a performance monitor like this one for Eclipse. The problem by such a performance monitor is, that it doesn't perform a benchmark. It just tracks the time, memory and such things, that your application currently uses. But that's not a real measurement - it's just a snapshot at a specific time.

Benchmarking in Java is a hard problem, not least because the JIT can have weird effects as your method gets more and more heavily optimized. Consider using a purpose-built tool like Caliper. Examples of how to use it and to measure performance on different input sizes are here.

If you want the actual CPU time of the current thread (or indeed, any arbitrary thread) rather than the wall clock time then you can get this via ThreadMXBean. Basically, do this at the start:
ThreadMXBean thx = ManagementFactory.getThreadMXBean();
thx.setThreadCpuTimeEnabled(true);
Then, whenever you want to get the elapsed CPU time for the current thread:
long cpuTime = thx.getCurrentThreadCpuTime();
You'll see that ThreadMXBean has calls to get CPU time and other info for arbitrary threads too.
Other comments about the complexities of timing also apply. The timing of the individual invocation of a piece of code can depend among other things on the state of the CPU and on what the JIT compiler decides to do at that particular moment. The overall scalability behaviour of an algorithm is generally a trend that emerges across a number of invocations and you will always need to be prepared for some "outliers" in your timings.
Also, remember that just because a particular timing is expressed in nanoseconds (or indeed milliseconds) does not mean that the timing actually has that granularity.

Create quick/reliable benchmark with java?

I'm trying to create a benchmark test with java. Currently I have the following simple method:
public static long runTest(int times){
long start = System.nanoTime();
String str = "str";
for(int i=0; i<times; i++){
str = "str"+i;
}
return System.nanoTime()-start;
}
I'm currently having this loop multiple times within another loop that is happening multiple times and getting the min/max/avg time it takes to run this method through. Then I am starting some activity on another thread and testing again. Basically I am just wanting to get consistent results... It seems pretty consistent if I have the runTest loop 10 million times:
Number of times ran: 5
The max time was: 1231419504 (102.85% of the average)
The min time was: 1177508466 (98.35% of the average)
The average time was: 1197291937
The difference between the max and min is: 4.58%
Activated thread activity.
Number of times ran: 5
The max time was: 3872724739 (100.82% of the average)
The min time was: 3804827995 (99.05% of the average)
The average time was: 3841216849
The difference between the max and min is: 1.78%
Running with thread activity took 320.83% as much time as running without.
But this seems a bit much, and takes some time... if I try a lower number (100000) in the runTest loop... it starts to become very inconsistent:
Number of times ran: 5
The max time was: 34726168 (143.01% of the average)
The min time was: 20889055 (86.02% of the average)
The average time was: 24283026
The difference between the max and min is: 66.24%
Activated thread activity.
Number of times ran: 5
The max time was: 143950627 (148.83% of the average)
The min time was: 64780554 (66.98% of the average)
The average time was: 96719589
The difference between the max and min is: 122.21%
Running with thread activity took 398.3% as much time as running without.
Is there a way that I can do a benchmark like this that is both consistent and efficient/fast?
I'm not testing the code that is between the start and end times by the way. I'm testing the CPU load in a way (see how I'm starting some thread activity and retesting). So I think that what I'm looking for it something to substitute for the code I have in "runTest" that will yield quicker and more consistent results.
Thanks

In short:
(Micro-)benchmarking is very complex, so use a tool like the Benchmarking framework http://www.ellipticgroup.com/misc/projectLibrary.zip - and still be skeptical about the results ("Put micro-trust in a micro-benchmark", Dr. Cliff Click).
In detail:
There are a lot of factors that can strongly influence the results:
The accuracy and precision of System.nanoTime: it is in the worst case as bad as of System.currentTimeMillis.
code warmup and class loading
mixed mode: JVMs JIT compile (see Edwin Buck's answer) only after a code block is called sufficiently often (1500 or 1000 times)
dynamic optimizations: deoptimization, on-stack replacement, dead code elimination (you should use the result you computed in your loop, e.g. print it)
resource reclamation: garbace collection (see Michael Borgwardt's answer) and object finalization
caching: I/O and CPU
your operating system on the whole: screen saver, power management, other processes (indexer, virus scan, ...)
Brent Boyer's article "Robust Java benchmarking, Part 1: Issues" ( http://www.ibm.com/developerworks/java/library/j-benchmark1/index.html) is a good description of all those issues and whether/what you can do against them (e.g. use JVM options or call ProcessIdleTask beforehand).
You won't be able to eliminate all these factors, so doing statistics is a good idea. But:
instead of computing the difference between the max and min, you should put in the effort to compute the standard deviation (the results {1, 1000 times 2, 3} is different from {501 times 1, 501 times 3}).
The reliability is taken into account by producing confidence intervals (e.g. via bootstrapping).
The above mentioned Benchmark framework ( http://www.ellipticgroup.com/misc/projectLibrary.zip) uses these techniques. You can read about them in Brent Boyer's article "Robust Java benchmarking, Part 2: Statistics and solutions" ( https://www.ibm.com/developerworks/java/library/j-benchmark2/).

Your code ends up testing mainly garbage collection performance because appending to a String in a loop ends up creating and immediately discarding a large number of increasingly large String objects.
This is something that inherently leads to wildly varying measurements and is influenced strongy by multi-thread activity.
I suggest you do something else in your loop that has more predictable performance, like mathematical calculations.

In the 10 million times run, odds are good the HotSpot compiler detected a "heavily used" piece of code and compiled it into machine native code.
JVM bytecode is interpreted, which leads it susceptible to more interrupts from other background processes occurring in the JVM (like garbage collection).
Generally speaking, these kinds of benchmarks are rife with assumptions that don't hold. You cannot believe that a micro benchmark really proves what it set out to prove without a lot of evidence proving that the initial measurement (time) isn't actually measuring your task and possibly some other background tasks. If you don't attempt to control for background tasks, then the measurement is much less useful.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.