Coding on Intel MIC co-processors (XEON Phi)
Now that Stampede is in production mode, we have access to the new
(old?) fanciness of the Intel co-processors - MIC (pronounced Mike). This is
a card that contains roughly 60 cores (Stampede has a special 61 core version,
retail cards will have 60 cores). These cores run at 1.1 GHz but are capable
of processing multiple operations every clock-cycle, thanks to the 512 bit SIMD
registers. Here are the things to keep in mind for coding on the MICs:
- Compiler has to be icc. MIC binaries can be produced by using -mmic flag on the
compile command-line. Note that native CPU binaries can not be run on the MICs (and vice-versa).
- Memory stride is important, declare arrays as x[N], y[N]. (Normally, you might use structures
to encapsulate the data but these structures will reduce performance.)
- For hybrid programming, data needs to be copied over to the MIC. However, only bit-wise copy-able
data is allowed to be copied over; i.e., in certain cases structures may have to be unpacked into
individual fields and then copied over (and re-assembled manually, if desired).
- Openmp seems to be the easiest way to harness the MIC. Check out this
presentation. In
particular, look at page 26 for the optimal way to offload task on to the MIC. According to the workshop
I attended at TACC, pthreads will not work for offloading to the MIC since the MIC libraries are different;
openmp is aware of the different libraries and picks the correct one depending on the execution core.
emcee on TACC Stampede
Here are a few things I found out about using emcee (
emcee link)
Loadbalancing emcee
- By default, the MPIPool class in emcee does not have any load balancing. This is not much
of a problem if the variance between the run-time of the walkers is small; however, when
the runtimes vary wildly, then load-balancing can become an issue. I re-factored some
of the code from here
and added a loadbalancing option (set to False) by default. Download this utils.py
and replace the emcee utils.py, build emcee again and you will be all set.
-
While on the topic of load-balancing, if you have a sense of run-time based on one (or some) of your parameters,
then you can improve load-balancing further by sorting the tasks in descending order of expected runtime. Note
that you should sort in descending order of runtime, i.e., most expensive evaluations should come first. You can
pass this runtime sorting routine as an argument when you instantiate the EnsembleSampler class. The code
you will need to get this functionality is ensemble.py. An example to benchmark
these loadbalancing routines is loadbalance.py (for benchmarking over MPI).
Using these loadbalancing options improved runtimes by 20-30% typically. YMMV.
Note: The runtime sorter will also improve the run-time scaling even without loadbalancing.