I'm attempting to test some benchmarking tools by running them against a simple program which increments a variable as many times as possible for 1000 milliseconds.
How many incrementations of a single 64 bit number should I expect to be able to perform on an intel i7 chip on the JDK for Mac OS X ?
My current methodology is :
start thread (t2) that continually increments "i" in an infinite loop (for(;;;)).
let the main thread (call it t1) sleep for 1000 milliseconds.
have t1 interrupt (or stop, since this deprecated method works on Apple's JDK 6) t2.
Currently, I am reproducibly getting about 2E8 incrementations (this is tabulated below: the value shown is the value that is printed when the incrementing thread is interrupted after a 1000 millisecond sleep() in the calling thread).
217057470
223302277
212337757
215177075
214785738
213849329
215645992
215651712
215363726
216135710
How can I know wether this benchmark is reasonable or not, i.e., what is the theoretical fastest speed at which an i7 chip should be able to increment a single 64-bit digit? This code is running in the JVM and is below:
package net.rudolfcode.jvm;
/**
* How many instructions can the JVM exeucte in a second?
* #author jayunit100
*/
public class Example3B {
public static void main(String[] args){
for(int i =0 ; i < 10 ; i++){
Thread addThread = createThread();
runForASecond(addThread,1000);
}
}
private static Thread createThread() {
Thread addThread = new Thread(){
Long i =0L;
public void run() {
boolean t=true;
for (;;) {
try {
i++;
}
catch (Exception e) {
e.printStackTrace();
}
}
}
#Override
public void interrupt() {
System.out.println(i);
super.interrupt();
}
};
return addThread;
}
private static void runForASecond(Thread addThread, int milli) {
addThread.start();
try{
Thread.sleep(milli);
}
catch(Exception e){
}
addThread.interrupt();
//stop() works on some JVMs...
addThread.stop();
}
}
Theoretically, making some assumptions which are probably not valid:
Assume that a number can be incremented in 1 instruction (probably not, because you're running in a JVM and not natively)
Assume that a 2.5 GHz processor can execute 2,500,000,000 instructions per second (but in reality, it's more complicated than that)
Then you could say that 2,500,000,000 increments in 1 second is a "reasonable" upper bound based on the simplest possible back-of-the-envelope estimation.
How far off is that from your measurement?
2,500,000,000 is O(1,000,000,000)
2E8 is O(100,000,000)
So we're only off by 1 order of magnitude. Given the wildly unfounded assumptions – sounds reasonable to me.
First of all beware of JVM optimisations! You must be sure you measure exactly what you think you do. Since Long i =0L; is not volatile and it's effectively useless (nothing is done to intermediate values) JIT can do pretty nasty stuff.
As for the estimation you can think of not more then X*10^9 operations per second on X GHz machine. You can safely divide this value for 10 for probably because instructions aren't mapped 1:1.
So you're pretty close :)
Related
Currently working on a university assessment, so I won't share specifics and I'm not asking for any explanation that will help me solve the main problem. I've already solved the problem, but my solution might be considered a little messy.
Basically, we're working with concurrency and semaphores. There is some shared resource that up to X (where X > 1) number of threads can access at a time and an algorithm which makes it a little more complicated than just acquiring and releasing access. Threads come at a certain time, use the resource for a certain time and then leave. We are to assume that no time is wasted when arriving, accessing, releasing and leaving the resource. The goal is to demonstrate that the algorithm we have written works by outputting the times a thread arrives, accesses the resource and leaves for each thread.
I'm using a semaphore with X number of permits to govern access. And it's all working fine, but I think the way I arrive at the expected output might be a bit janky. Here's something like what I have currently:
#Override
public void run() {
long alive = System.currentTimeMillis();
try { Thread.sleep(arrivalTime * 1000); }
catch (InterruptedException e) {} // no interrupts implemented
long actualArriveTime = System.currentTimeMillis() - alive;
boolean accessed = false;
while (!accessed) accessed = tryAcquire();
long actualAccessTime = System.currentTimeMillis() - alive;
try { Thread.sleep(useTime * 1000); }
catch (InterruptedException e) {} // no interrupts implemented
release();
long actualDepartTime = System.currentTimeMillis() - alive;
System.out.println(actualArriveTime);
System.out.println(actualAccessTime);
System.out.println(actualDepartTime);
}
I do it this way because where the expected output might be:
Thread Arrival Access Departure
A 0 0 3
B 0 0 5
C 2 2 6
... ... ... ...
My output looks something like:
Thread Arrival Access Departure
A 0 0 3006
B 0 0 5008
C 2 2 6012
... ... ... ...
I'm essentially making the time period much larger so that if the computer takes a fews milliseconds to acquire(), for example, it doesn't affect the number much. Then I can round to the nearest second to get the expected output. My algorithm works, but there are issues with this. A: It's slow; B: With enough threads, the milliseconds of delay may build so that I round to the wrong number.
I need something more like this:
public static void main(String[] args) {
int clock = 0;
while (threadsWaiting) {
clock++;
}
}
#Override
public void run() {
Thread.waitUntil(clock == arrivalTime);
boolean accessed = false;
while (!accessed) accessed = tryAcquire();
int accessTime = clock;
int depatureTime = accessTime + useTime;
Thread.waitUntil(clock == departureTime);
release();
System.out.println(arrivalTime);
System.out.println(accessTime);
System.out.println(departureTime);
}
Hopefully that's clear. Any help is appreciated.
Thanks!
I tried to compile the example from Thinking in Java by Bruce Eckel:
import java.util.concurrent.*;
public class SimplePriorities implements Runnable {
private int countDown = 5;
private volatile double d; // No optimization
private int priority;
public SimplePriorities(int priority) {
this.priority = priority;
}
public String toString() {
return Thread.currentThread() + ": " + countDown;
}
public void run() {
Thread.currentThread().setPriority(priority);
while(true) {
// An expensive, interruptable operation:
for(int i = 1; i < 100000; i++) {
d += (Math.PI + Math.E) / (double)i;
if(i % 1000 == 0)
Thread.yield();
}
System.out.println(this);
if(--countDown == 0) return;
}
}
public static void main(String[] args) {
ExecutorService exec = Executors.newCachedThreadPool();
for(int i = 0; i < 5; i++)
exec.execute(
new SimplePriorities(Thread.MIN_PRIORITY));
exec.execute(
new SimplePriorities(Thread.MAX_PRIORITY));
exec.shutdown();
}
}
According to the book, the output has to look like:
Thread[pool-1-thread-6,10,main]: 5
Thread[pool-1-thread-6,10,main]: 4
Thread[pool-1-thread-6,10,main]: 3
Thread[pool-1-thread-6,10,main]: 2
Thread[pool-1-thread-6,10,main]: 1
Thread[pool-1-thread-3,1,main]: 5
Thread[pool-1-thread-2,1,main]: 5
Thread[pool-1-thread-1,1,main]: 5
...
But in my case 6th thread doesn't execute its task at first and threads are disordered. Could you please explain me what's wrong? I just copied the source and didn't add any strings of code.
The code is working fine and with the output from the book.
Your IDE probably has console window with the scroll bar - just scroll it up and see the 6th thread first doing its job.
However, the results may differ depending on OS / JVM version. This code runs as expected for me on Windows 10 / JVM 8
There are two issues here:
If two threads with the same priority want to write output, which one goes first?
The order of threads (with the same priority) is undefined, therefore the order of output is undefined. It is likely that a single thread is allowed to write several outputs in a row (because that's how most thread schedulers work), but it could also be completely random, or anything in between.
How many threads will a cached thread pool create?
That depends on your system. If you run on a dual-core system, creating more than 4 threads is pointless, because there hardly won't be any CPU available to execute those threads. In this scenario further tasks will be queued and executed only after earlier tasks are completed.
Hint: there is also a fixed-size thread pool, experimenting with that should change the output.
In summary there is nothing wrong with your code, it is just wrong to assume that threads are executed in any order. It is even technically possible (although very unlikely), that the first task is already completed before the last task is even started. If your book says that the above order is "correct" then the book is simply wrong. On an average system that might be the most likely output, but - as above - with threads there is never any order, unless you enforce it.
One way to enforce it are thread priorities - higher priorities will get their work done first - you can find other concepts in the concurrent package.
I am trying out the executor service in Java, and wrote the following code to run Fibonacci (yes, the massively recursive version, just to stress out the executor service).
Surprisingly, it will run faster if I set the nThreads to 1. It might be related to the fact that the size of each "task" submitted to the executor service is really small. But still it must be the same number also if I set nThreads to 1.
To see if the access to the shared Atomic variables can cause this issue, I commented out the three lines with the comment "see text", and looked at the system monitor to see how long the execution takes. But the results are the same.
Any idea why this is happening?
BTW, I wanted to compare it with the similar implementation with Fork/Join. It turns out to be way slower than the F/J implementation.
public class MainSimpler {
static int N=35;
static AtomicInteger result = new AtomicInteger(0), pendingTasks = new AtomicInteger(1);
static ExecutorService executor;
public static void main(String[] args) {
int nThreads=2;
System.out.println("Number of threads = "+nThreads);
executor = Executors.newFixedThreadPool(nThreads);
Executable.inQueue = new AtomicInteger(nThreads);
long before = System.currentTimeMillis();
System.out.println("Fibonacci "+N+" is ... ");
executor.submit(new FibSimpler(N));
waitToFinish();
System.out.println(result.get());
long after = System.currentTimeMillis();
System.out.println("Duration: " + (after - before) + " milliseconds\n");
}
private static void waitToFinish() {
while (0 < pendingTasks.get()){
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
executor.shutdown();
}
}
class FibSimpler implements Runnable {
int N;
FibSimpler (int n) { N=n; }
#Override
public void run() {
compute();
MainSimpler.pendingTasks.decrementAndGet(); // see text
}
void compute() {
int n = N;
if (n <= 1) {
MainSimpler.result.addAndGet(n); // see text
return;
}
MainSimpler.executor.submit(new FibSimpler(n-1));
MainSimpler.pendingTasks.incrementAndGet(); // see text
N = n-2;
compute(); // similar to the F/J counterpart
}
}
Runtime (approximately):
1 thread : 11 seconds
2 threads: 19 seconds
4 threads: 19 seconds
Update:
I notice that even if I use one thread inside the executor service, the whole program will use all four cores of my machine (each core around 80% usage on average). This could explain why using more threads inside the executor service slows down the whole process, but now, why does this program use 4 cores if only one thread is active inside the executor service??
It might be related to the fact that the size of each "task" submitted
to the executor service is really small.
This is certainly the case and as a result you are mainly measuring the overhead of context switching. When n == 1, there is no context switching and thus the performance is better.
But still it must be the same number also if I set nThreads to 1.
I'm guessing you meant 'to higher than 1' here.
You are running into the problem of heavy lock contention. When you have multiple threads, the lock on the result is contended all the time. Threads have to wait for each other before they can update the result and that slows them down. When there is only a single thread, the JVM probably detects that and performs lock elision, meaning it doesn't actually perform any locking at all.
You may get better performance if you don't divide the problem into N tasks, but rather divide it into N/nThreads tasks, which can be handled simultaneously by the threads (assuming you choose nThreads to be at most the number of physical cores/threads available). Each thread then does its own work, calculating its own total and only adding that to a grand total when the thread is done. Even then, for fib(35) I expect the costs of thread management to outweigh the benefits. Perhaps try fib(1000).
I'm trying to understand synchronization of multiple threads in Java more fully. I understand the high level idea behind the use of the synchronized keyword, and how it provides mutual exclusion among threads.
The only thing is that most of the examples I read online and in my textbook still work correctly even if you remove the synchronized keyword which is making this topic more confusing than I think it needs to be.
Can anyone provide me with a concrete example of when not including the synchronized keyword will produce erroneous results? Any information would be much appreciated.
You can usually trigger a race condition by increasing the number of iterations. Here's a simple example that works with 100 and 1,000 iterations but fails (at least on my quad-core box) at 10,000 iterations (sometimes).
public class Race
{
static final int ITERATIONS = 10000;
static int counter;
public static void main(String[] args) throws InterruptedException {
System.out.println("start");
Thread first = new Thread(new Runnable() {
#Override
public void run() {
for (int i = 0; i < ITERATIONS; i++) {
counter++;
}
}
});
Thread second = new Thread(new Runnable() {
#Override
public void run() {
for (int i = 0; i < ITERATIONS; i++) {
counter++;
}
}
});
first.start();
second.start();
first.join();
second.join();
System.out.println("Counter " + counter + " should be " + (2 * ITERATIONS));
}
}
>>> Counter 12325 should be 20000
This example fails because access to counter is not properly synchronized. It can fail in two ways, possibly both in the same run:
One thread fails to see that the other has incremented the counter because it doesn't see the new value.
One thread increments the counter between the other thread reading the current value and writing the new value. This is because the increment and decrement operators are not atomic.
The fix for this simple program would be to use an AtomicInteger. Using volatile isn't enough due to the problem with increment, but AtomicInteger provides atomic operations for increment, get-and-set, etc.
The thing about race conditions is that they don't necessarily happen if you don't do proper synchronization -- indeed, quite frequently it'll work just fine -- but then one year later, in the middle of the night, your code will crash with a completely unpredictable bug that you can't reproduce, because the bug only appears randomly.
Race conditions are so insidious precisely because they don't always make your program crash, and they trigger more or less randomly.
I have a favorite C# program similar to the one below that shows that if two threads share the same memory address for counting (one thread incrementing n times, one thread decrementing n times) you can get a final result other than zero. As long as n is reasonably large, it's pretty easy to get C# to display some non-zero value between [-n, n]. However, I can't get Java to produce a non-zero result even when increasing the number of threads to 1000 (500 up, 500 down). Is there some memory model or specification difference wrt C# I'm not aware of that guarantees this program will always yield 0 despite the scheduling or number of cores that I am not aware of? Would we agree that this program could produce a non-zero value even if we can not prove that experimentally?
(Not:, I found this exact question over here, but when I run that topic's code I also get zero.)
public class Counter
{
private int _counter = 0;
Counter() throws Exception
{
final int limit = Integer.MAX_VALUE;
Thread add = new Thread()
{
public void run()
{
for(int i = 0; i<limit; i++)
{
_counter++;
}
}
};
Thread sub = new Thread()
{
public void run()
{
for(int i = 0; i<limit; i++)
{
_counter--;
}
}
};
add.run();
sub.run();
add.join();
sub.join();
System.out.println(_counter);
}
public static void main(String[] args) throws Exception
{
new Counter();
}
}
The code you've given only runs on a single thread, so will always give a result of 0. If you actually start two threads, you can indeed get a non-zero result:
// Don't call run(), which is a synchronous call, which doesn't start any threads
// Call start(), which starts a new thread and calls run() *in that thread*.
add.start();
sub.start();
On my box in a test run that gave -2146200243.
Assuming you really meant start, not run.
On most common platforms it will very likely produce non zero, because ++/-- are not atomic operations in case of multiple cores. On single core/single CPU you will most likely get 0 because ++/-- are atomic if compiled to one instruction (add/inc) but that part depends on JVM.
Check result here: http://ideone.com/IzTT2
The problem with your program is that you are not creating an OS thread, so your program is essentially single threaded. In Java you must call Thread.start() to create a new OS thread, not Thread.run(). This has to do with a regrettable mistake made in the initial Java API. That mistake is that the designer made Thread implement Runnable.
add.start();
sub.start();
add.join();
sub.join();