Multiple cores of my Quad-Core Xeon are the violinists that play music to my ears. With the 4 cores one should expect a 4-time speedup at best, right? And that, only for the embarrassingly problems at that. Wrong!!!
Lets start by including a few headers in our guinea pig.
#import <Cocoa/Cocoa.h>
#import <iostream>
#import <libkern/OSAtomic.h>
Ok..ok.. Objective-C++… Guilty. Let us set up a function that does some dummy work but toils to the dawn with it.
using namespace std;
void myLongComputation(double& d)
{
for (int j=0;j<100000;j++) { d++; }
}
Now we set up a parallel dispatch queue. Notice the __block qualifier in front of our accumulator variable. This enables our forthcoming blocks to accumulate into the stack variable rather than into a local copy. It will all become clear in a moment.
dispatch_queue_t q = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
size_t iterations = 1000*100;
__block int32_t total=0;
Now we enqueue a bunch of blocks (in parallel) that invoke our long computation each incrementing the value of our shared total. The following pass does it in a non-atomic way. We also time our operations in a naive way, however it should be sufficient for our purposes.
CFAbsoluteTime startTime = CFAbsoluteTimeGetCurrent();
dispatch_apply(iterations, q, ^(size_t sz) { total++; double d; myLongComputation(d); } );
CFAbsoluteTime finishTime = CFAbsoluteTimeGetCurrent();
cerr << "total (parallel+naive) = " << total << endl;
cerr << "program took " << finishTime-startTime << " seconds" << endl;
The second pass is a copy of the first but utilizes atomic increments.
total = 0;
startTime = CFAbsoluteTimeGetCurrent();
dispatch_apply(iterations, q, ^(size_t sz) { OSAtomicAdd32(1, &total); double d; myLongComputation(d); } );
finishTime = CFAbsoluteTimeGetCurrent();
cerr << "total (parallel+atomics) = " << total << endl;
cerr << "program took " << finishTime-startTime << " seconds" << endl;
Finally, we simply loop over the computations in the main thread as follows.
total = 0;
startTime = CFAbsoluteTimeGetCurrent();
void (^myBlock)(size_t) = ^(size_t sz) { total++; double d; myLongComputation(d); };
for (int j=0;j<iterations;j++) { myBlock(j); }
finishTime = CFAbsoluteTimeGetCurrent();
cerr << "total (serial) = " << total << endl;
cerr << "program took " << finishTime-startTime << " seconds" << endl;
Is seems the mercurial hyper threading actually works
since the tin-headed (aluminum-headed?… no.) under my desk spits out:
total (parallel+naive) = 99991
program took 5.27613 seconds
total (parallel+atomics) = 100000
program took 5.26177 seconds
total (serial) = 100000
program took 32.9869 seconds
with a 6 time speedup… hmm…