Coding on Intel MIC co-processors (XEON Phi)

Now that Stampede is in production mode, we have access to the new (old?) fanciness of the Intel co-processors - MIC (pronounced Mike). This is a card that contains roughly 60 cores (Stampede has a special 61 core version, retail cards will have 60 cores). These cores run at 1.1 GHz but are capable of processing multiple operations every clock-cycle, thanks to the 512 bit SIMD registers. Here are the things to keep in mind for coding on the MICs:
  • Compiler has to be icc. MIC binaries can be produced by using -mmic flag on the compile command-line. Note that native CPU binaries can not be run on the MICs (and vice-versa).
  • Memory stride is important, declare arrays as x[N], y[N]. (Normally, you might use structures to encapsulate the data but these structures will reduce performance.)
  • For hybrid programming, data needs to be copied over to the MIC. However, only bit-wise copy-able data is allowed to be copied over; i.e., in certain cases structures may have to be unpacked into individual fields and then copied over (and re-assembled manually, if desired).
  • Openmp seems to be the easiest way to harness the MIC. Check out this presentation. In particular, look at page 26 for the optimal way to offload task on to the MIC. According to the workshop I attended at TACC, pthreads will not work for offloading to the MIC since the MIC libraries are different; openmp is aware of the different libraries and picks the correct one depending on the execution core.
 

emcee on TACC Stampede

Here are a few things I found out about using emcee (emcee link)
  • emcee needs mpi4py to run on multiple nodes. mpi4py is compiled against the MVAPICH2 modules -- and crashes on runs greater than 64 cores. To get around this, you need to use the python-mpi executable located under the mpi4py directory. Mine is located here:

    /opt/apps/intel13/mvapich2_1_9/mpi4py/1.3/lib/python/mpi4py/bin/python-mpi

  • If you need an unique identifier per process, use the comm.Get_rank() method to get the rank of the process. Since emcee actually spawns new processes over mpi, process-id will (likely) give you different pid for every chi-square evaluation.
  • Right now, there is no easy way to use the MIC co-processors. You will have to compile python and mpi4py for both cpu and MIC and then use the symmetric mode with the TACC script ibrun to use both the cpu and the MIC.
  • Be aware that there is roughly a 50% overhead for python, i.e., the actual time spent in the chi-square routines will be roughly half the total queue time.
 

Loadbalancing emcee

  • By default, the MPIPool class in emcee does not have any load balancing. This is not much of a problem if the variance between the run-time of the walkers is small; however, when the runtimes vary wildly, then load-balancing can become an issue. I re-factored some of the code from here and added a loadbalancing option (set to False) by default. Download this utils.py and replace the emcee utils.py, build emcee again and you will be all set.
  • While on the topic of load-balancing, if you have a sense of run-time based on one (or some) of your parameters, then you can improve load-balancing further by sorting the tasks in descending order of expected runtime. Note that you should sort in descending order of runtime, i.e., most expensive evaluations should come first. You can pass this runtime sorting routine as an argument when you instantiate the EnsembleSampler class. The code you will need to get this functionality is ensemble.py. An example to benchmark these loadbalancing routines is loadbalance.py (for benchmarking over MPI).
Using these loadbalancing options improved runtimes by 20-30% typically. YMMV. Note: The runtime sorter will also improve the run-time scaling even without loadbalancing.