Recent Changes - Search:

HomePage

PmWiki

pmwiki.org

MKLAndOpen-MPIOptimizations

In this page we present a way to (heavily) optimize your configuration for a fast compilation and run of the models in the toolkit. The toolkit is already optimized for any DMRG and matrix product calculation, but we want to try and optimize it more if possible, for some specific-purpose applications. So there will be no guarantee that this optimization works in your case. To be more specific, we have already tested successfully these upcoming methods including Intel® Math Kernel Libraries (MKL) for the Toric-Code, J1-J2 Heisenberg model on the square, triangular, and kagome lattices, plus some similar calculations on frustrated magnets; of course they are could work with up to 4 times faster speed than default configuration.

Why we need these optimizations at all?

If you are already did some experiments with the toolkit, you may noticed that the wall-clock time for compiling and running new models and their dependents is sometimes considerably long, specially for bigger codes and larger size lattices. One occasionally needs to put hundreds of these jobs within a script to run. So if there will be any way to improve the speed of processing and its efficiency, it is definitely worth to include.

Note 1: This method, if becomes possible to imply, may only improve the speed of compilation and making of the new models and NOT essentially the whole CPU time of a DMRG job; but MKL linking and multi-threading usually lead to significant speedup of the latter too. Plus it is quite possible that MKL optimizations, actually speed up all mp-tool's dependent calculations.

If you need to improve the speed of calculations even more, probably the best way is to run it as grid parallel with a grid calculation library like MPI. But as the toolkit source-codes still under developments for MPI agents, you need to include MPI code blocks to all the basic programs yourself, which could be tedious and there is no guarantee that DMRG calculations getting faster if active cores cannot speak to each other at certain parts of the code.

An introduction to MKL, OpenMP, and some other basic optimizations for the toolkit

Here we are going to suggest some implications of MKL, OpenMP, and some other basic optimizations that may optimize the running time of calculations of the toolkit.

For getting a bit familiar with MKL, we like to iterate what is stated in Intel® MKL webpage:

"Intel® Math Kernel Libraries (MKL) accelerates math processing routines that increase application performance and reduce development time. [...] MKL includes highly vectorized and threaded Linear Algebra functions. [...] The easiest way to take advantage of all of that processing power is to use a carefully optimized computing math library. Even the best compiler can’t compete with the level of performance possible from a hand-optimized library."

As a result the general idea behind this optimization is mainly linking some Intel® Math Kernel Libraries (MKL) to current BLAS and LAPACK routines which are suitable for multi-threaded computations. Additional optimizations are including setting the number of threads and setting the compiler flags. So you need to make sure that the cluster or personal machine you are using, already has all default libraries of the toolkit as they have explained in Install page. Of course MKL packages are NOT all free; thus make sure if you have access to them through some accounts.

Step 1: Compiler Flags

In our experience if one intended to do compilation of a model code in the toolkit while debugging is NOT a consideration, the best customized set of compiler flags are as below:

CXX = g++
CXXFLAGS = -DNDEBUG -O3 -pthread -march=native -flto
F77 = gfortran
FFLAGS = -g -O2

Of course employing a paid compiler instead of the free and open-source gcc, will make things even run faster. One suggestion is to use Intel C/C++ Compilers, i.e. the icc.

Step 2: linking a hand-picked set of MKL packages

The current list and guides for the available 64-bit MKL of version 10.x could be found in this Intel® webpage. Obviously one doesn't need all these libraries for a specific-purpose calculation. Linking some of these libraries could even speed down your calculations. In our experience for doing DMRG with the toolkit, to speed up calculations the best set of MKL's to link (to the already existing BLAS and LAPACK libraries) is as below:

BLAS_LIBS = -L/opt/intel/composerxe/mkl/lib/intel64 -lmkl_lapack95_lp64 -lmkl_blas95_lp64 -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core
LIBS = -lpthread -lgomp

Step 3: Multi-threaded calculations with OpenMP

By completing step 2, you can now efficiently run a multi-threaded calculation using your MKL and OpenMP (OMP) desired settings. This exploits a multiple number of CPU slots to run a job faster in the cluster. To do so, firstly, you need to specify your physical environment in the queuing system to be usable with OMP. It should be something appropriate for multi-threaded calculations. A typical example is to add this line of the header to your script,

#$ -pe threaded 8

Then you need to set the number of threads for your MKL and OMP settings:

export MKL_NUM_THREADS="8"
export OMP_NUM_THREADS="8"

Note 2: Usually when the number of jobs you want to run is larger than than the number of total cores you have access to, it is best to run all calculations single-core instead.

How to do all these at once?!

Considering that you have made all default settings and performed the toolkit's setup steps correctly, you can do all mentioned optimizations in a single set of commands. In a typical example it looks like,

make clean
export CXXFLAGS="-DNDEBUG -O3 -pthread -march=native -flto"
export LIBS="-lpthread -lgomp"
./trunk/configure --bindir=/data/<user-specified path>/bin --with-blas="-L/opt/intel/composerxe/mkl/lib/intel64 -lmkl_lapack95_lp64 -lmkl_blas95_lp64 -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core"
make
make install
export MKL_NUM_THREADS="4"
export OMP_NUM_THREADS="4"
export MKL_DISABLE_FAST_MM="1"

Doing the last line of commands, i.e. "export MKL_DISABLE_FAST_MM=1", may be crucial for stabilizing your calculations. This is to disable the memory management of MKL, as there is known issues for leaking memory while doing calculations with MKL (cf. https://software.intel.com/en-us/node/528564).

Feel free to send us your feedback, questions, and comments.

Copyrights of this article: Seyed S. Saadatmand and Ian P. McCulloch @ 2016

Edit - History - Print - Recent Changes - Search
Page last modified on January 25, 2016, at 10:01 AM