README.md

# README - QCD Accelerator Benchmarksuite Part 2
##   2017 -  Jacob Finkenrath - CaSToRC - The Cyprus Institute  (j.finkenrath@cyi.ac.cy)

The QCD Accelerator Benchmark suite Part 2 consists of two kernels,
the QUDA [1] and the QPhix library [2]. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/).
The benchmark code is used the provided Conjugated Gradient benchmark functions of the libraries.


[1] R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
in Lattice Quantum Chromodynamics” SC 10 (Supercomputing 2010)

[2] B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
W. Watson III, “Lattice QCD on Intel Xeon Phi”, International Supercomputing Conference (ISC’13), 2013

##  Table of Contents

GPU - BENCHMARK SUITE (QUDA)
```
1. Compile and Run the GPU-Benchmark Suite
1.1. Compile
1.2. Run
1.2.1. Main-script: "run_ana.sh"
1.2.2. Main-script: "prepare_submit_job.sh"
1.2.3. Main-script: "submit_job.sh.template"
1.3. Example Benchmark results
```

XEONPHI - BENCHMARK SUITE (QPHIX)
```
2. Compile and Run the XeonPhi-Benchmark Suite
2.1. Compile
2.1.1. Example compilation on PRACE machines
2.1.1.1. BSC - Marenostrum III Hybrid partitions
2.1.1.2. CINES - Frioul
2.2. Run
2.2.1. Main-script: "run_ana.sh"
2.2.2. Main-script: "prepare_submit_job.sh"
2.2.3. Main-script: "submit_job.sh.template"
2.3. Example Benchmark Results
```


##   GPU - BENCHMARK SUITE
### 1. Compile and Run the GPU-Benchmark Suite

#### 1.1 Compile

Download Cmake and Quda

General information how to build QUDA with cmake can be found under:
"https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake"
Here we just give a short overview:

Build Cmake: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/cmake-3.7.0.tar.gz)

Cmake can be downloaded from the source with the URL: https://cmake.org/download/
In this guide the version cmake-3.7.0 is used. The build instruction can be found
in the main directory under README.rst . Use the configure file `./configure` .
Then run `gmake`.

Build Quda: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/quda.tar.gz)

Download quda for example by using `git clone https://github.com/lattice/quda.git`.
Create a build-folder. Execute the executable `cmake` in the build-folder which
is located in the cmake/bin.
Execute:

``` shell
./$PATH2CMAKE/cmake $PATH2QUDA -DQUDA_GPU_ARCH=sm_XX -DQUDA_DIRAC_WILSON=ON -DQUDA_DIRAC_TWISTED_MASS=OFF
-DQUDA_DIRACR_DOMAIN_WALL=OFF -DQUDA_HISQ_LINK=OFF -DQUDA_GAUGE_FORCE=OFF -DQUDA_HISQ_FORCE=OFF -DQUDA_MPI=ON
```

with

``` shell
  PATH2CMAKE= path to the cmake-executable
  PAT2QUDA= path to the home dir of QUDA
```

Set `-DQUDA_GPU_ARCH=sm_XX` to the GPU Architecture (`sm_60` for Pascals, `sm_35` for Keplers)

If Cmake or the compilation fails library paths and options can be set by the cmake provided function "ccmake".
Use `./PATH2CMAKE/ccmake PATH2BUILD_DIR` to edit and to see the availble options.
Cmake generates the Makefiles. Run them by use `make`.
Now in the folder /test one can find the needed Quda executable "invert_".

####  1.2 Run


The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts located in the folder ./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the benchmark runs
on the target machines. This bash-scripts are:

 - `run_ana.sh`              :   Main-script, set up the benchmark mode and submit the jobs (analyse the results)
 - `prepare_submit_job.sh`   :   Generate the job-scripts
 - `submit_job.sh.template`  :   Template for submit script

##### 1.2.1 Main-script: "run_ana.sh"

The path to the executable has to be set by $PATH2EXE .
QUDA automaticaly tune the GPU-kernels. The optimal setup will be saved in
the folder which one declares by the variable `QUDA_RESOURCE_PATH`. Set it to
folder where the tuning data should be saved.
Different scaling modes can be choose from Strong-scaling to Weak scaling
by using the variables sca_mode (="Strong" or ="Weak").
The lattice sizes can be set by "gx" and "gt".
Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS.
Note that the submition is done here by "sbatch", match this to the queing system on
your target machine.

##### 1.2.2 Main-script: "prepare_submit_job.sh"

Add additional option if necessary.

##### 1.2.3 Main-script: "submit_job.sh.template"

The submit-template will be edit by `prepare_submit_job.sh` to generate
the final submit-script. The header should be matched to the queing system
of the target machine.

The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs
on the target machines. This bash-scripts are:

#### 1.3 Example Benchmark results

Here are shown the benchmark results on PizDaint located in Switzerland at CSCS
and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the
provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown,
the "Strong-Scaling mode with a random lattice configuration of size 32^3x96 and
a "Weak-Scaling" mode with a configuration of local lattice size 48^3x24.
The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case
that one card per node and two cards per node are used.
The benchmark are done by using the Conjugated Gradient solver which
solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator
"D" and a known right hand side "b".

```
---------------------
  PizDaint - Pascal
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single

GPUs     GFLOPS      sec
1    786.520000 4.569600
2   1522.410000 3.086040
4   2476.900000 2.447180
8   3426.020000 2.117580
16  5091.330000 1.895790
32  8234.310000 1.860760
64  8276.480000 1.869230

sloppy-precision: double
       precision: double

GPUs     GFLOPS      sec
1    385.965000 6.126730
2    751.227000 3.846940
4   1431.570000 2.774470
8   1368.000000 2.367040
16  2304.900000 2.071160
32  4965.480000 2.095180
64  2308.850000 2.005110


Weak - Scaling:
local lattice size (48x48x48x24)

sloppy-precision: single
       precision: single

GPUs     GFLOPS      sec
1     765.967000 3.940280
2    1472.980000 4.004630
4    2865.600000 4.044360
8    5421.270000 4.056410
16   9373.760000 7.396590
32  17995.100000 4.243390
64  27219.800000 4.535410

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
 1   376.611000 5.108900
 2   728.973000 5.190880
 4  1453.500000 5.144160
 8  2884.390000 5.207090
16  5004.520000 5.362020
32  8744.090000 5.623290
64  14053.00000 5.910520
```

```
---------------------
  SurfSara - Kepler
---------------------
##
## 1 GPU per Node
##

Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single
GPUs    GFLOPS      sec
1      243.084000 4.030000
2      478.179000 2.630000
4      939.953000 2.250000
8     1798.240000 1.570000
16    3072.440000 1.730000
32    4365.320000 1.310000

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
1      119.786000 6.060000
2      234.179000 3.290000
4      463.594000 2.250000
8      898.090000 1.960000
16    1604.210000 1.480000
32    2420.130000 1.630000

##
## 2 GPU per Node
##

Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single

GPUs    GFLOPS      sec
2      463.041000 2.720000
4      896.707000 1.940000
8     1672.080000 1.680000
16    2518.240000 1.420000
32    3800.970000 1.460000
64    4505.440000 1.430000

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
2     229.579000 3.380000
4     450.425000 2.280000
8     863.117000 1.830000
16   1348.760000 1.510000
32   1842.560000 1.550000
64   2645.590000 1.480000
```

##   XEONPHI - BENCHMARK SUITE
### 2. Compile and Run the XeonPhi-Benchmark Suite

Unpack the provided source tar-file located in `./QCD_Accelerator_Benchmarksuite_Part2/XeonPhi/src` or
clone the actual git-hub branches of the code
packages QMP:

``` shell
git clone https://github.com/usqcd-software/qmp
```

and for QPhix

``` shell
git clone https://github.com/JeffersonLab/qphix
```

Note that for running on Skylake chips it is recommended to utilize
the branch develop of QPhix which needs additional packages
like qdp++ (Status 04/2019).

#### 2.1 Compile

The QPhix library is based on QMP communication functions.
For that QMP has to be setup first.

``` shell
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS=" -mmic/-xAVX512 -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
```

Create the Install folder and link with `$QMP_INSTALL_DIR` to it.
Use the compilerflag  `-mmic` for the compilation for KNC's
while use `-xAVX512` for the compilation for KNL's.
Then use
``` shell
make
make install
```

to compile and setup the necessary source files in `$QMP_INSTALL_DIR`.

The QPhix executable can be compiled by using:
for KNC's

``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g  -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
```

or for KNL's

``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
```

by using the previous variable `QMP_INSTALL_DIR` which links to the install-folder
of QMP. The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.


Note for the develop branch the package qdp++ has to be compiled.
QDP++ can be configure using (here for skylake chip)

``` shell
./configure --with-qmp=$QMP_INSTALL_DIR --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xCORE-AVX512 -mtune=skylake-avx512 -std=c99" CXX=mpiicpc CXXFLAGS="-axCORE-AVX512 -mtune=skylake-avx512 -std=c++14 -qopenmp" --enable-openmp --host=x86_64-linux-gnu --build=none-none-none --prefix=$QDPXX_INSTALL_DIR
```

Now QPhix executable can be compiled by using:


``` shell
cmake -DQDPXX_DIR=$QDP_INSTALL_DIR -DQMP_DIR=$QMP_INSTALL_DIR -Disa=avx512 -Dparallel_arch=parscalar -Dhost_cxx=mpiicpc -Dhost_cxxflags="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -Dtm_clover=ON -Dtwisted_mass=ON -Dtesting=ON -DCMAKE_CXX_COMPILER=mpiicpc -DCMAKE_CXX_FLAGS="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -DCMAKE_C_COMPILER=mpiicc -DCMAKE_C_FLAGS="-std=c99 -O3 -axCORE-AVX512 -mtune=skylake-avx512" ..
```

The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.

##### 2.1.1 Example compilation on PRACE machines

In the subsection we provide some example compilation on PRACE machines
which where used to develop the QCD Benchmarksuite 2.

###### 2.1.1.1 BSC - Marenostrum III Hybrid partitions

The Hybrid partition on Marenostrum are equiped with KNC's.
First following modules were loaded

``` shell
module unload openmpi
module load impi
```

and the necessary links are set with

``` shell
source /opt/intel/impi/4.1.1.036/bin64/mpivars.sh
source /opt/intel/2013.5.192/composer_xe_2013.5.192/bin/compilervars.sh intel64
export I_MPI_MIC=enable
export I_MPI_HYDRA_BOOTSTRAP=ssh
```

The QMP-library was configured and compiled with

``` shell
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-mmic -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
make
make install
```

Now the package QPhix is compilled with

``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g  -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
make
```

###### 2.1.1.2 CINES - Frioul

On a test cluster at the CINES-side the Benchmarksuite was tested on KNL's.
The steps are similar to BSC. First the libraries paths are set with

``` shell
source /opt/software/intel/composer_xe_2015/bin/compilervars.sh intel64
source /opt/software/intel/impi_5.0.3/bin64/mpivars.sh
```

The QMP was compiled by using:

``` shell
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-xMIC-AVX512 -mGLOB_default_function_attrs="use_gather_scatter_hint=off" -openmp -g  -O2 -fno-alias -std=c99"  --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
make
make install
```

The QPhix was configured and compiled by using

``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=/home/finkenrath/benchmark/qmp/install
make
```

####  2.2 Run

The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs
on the target machines. This bash-scripts are:

 - `run_ana.sh`              :   Main-script, set up the bechmark mode and submit the jobs (analyse the results)
 - `prepare_submit_job.sh`   :   Generate the job-scripts
 - `submit_job.sh.template`  :   Template for submit script

##### 2.2.1 Main-script: "run_ana.sh"

The path to the executable has to be set by $PATH2EXE .
Different scaling modes can be choose from Strong-scaling to Weak scaling
by using the variables sca_mode (="Strong" or ="Weak").
The lattice sizes can be set by "gx" and "gt".
Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS.
Note that the submition is done by "sbatch" match this to the queing system on
your target machine.

##### 2.2.2 Main-script: "prepare_submit_job.sh"

Add additional option if necessary.

##### 2.2.3 Main-script: "submit_job.sh.template"

The submit-template will be edit by `prepare_submit_job.sh` to generate
the final submit-script. The header should be matched to the quening system
of the target machine.


#### 2.3 Example Benchmark Results

The benchmark results for the XeonPhi benchmark suite are performed on
Frioul, a test cluster at CINES, and the hybrid partion on MareNostrum III at BSC.
Frioul has one KNL-card per node while the hybrid partion of MareNostrum III is
equiped with two KNCs per node. The data on Frioul are generated by using
the bash-scripts provided by the QCD-Accelerator Benchmarksute Part 2
and are done for the two test cases "Strong-Scaling" with a lattice size
of 32^3x96 and "Weak-scaling" with a local lattice size of 48^3x24 per
card. In case of the data generated at MareNostrum, data for the "Strong-Scaling"
mode on a 32^3x96 lattice are shown. The Benchmark is using a random gauge configuration and uses the
Conjugated Gradient solver to solve a linear equation involving the clover Wilson Dirac operator.

```
---------------------
  Frioul - KNLs
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)

precision: single

KNLs     GFLOPS
1       340.75
2       627.612
4      1111.13
8      1779.34
16     2410.8

precision: double

KNLs     GFLOPS
1      328.149
2      616.467
4      1047.79
8      1616.37

Weak - Scaling:
local lattice size (48x48x48x24)

precision: single

KNLs   GFLOPS
1       348.304
2       616.697
4      1214.82
8      2425.45
16     4404.63

precision: double

KNLs   GFLOPS
 1      172.303
 2      320.761
 4      629.79
 8     1228.77
16     2310.63
```

```
---------------------
  MareNostrum III - KNC's
---------------------

Strong - Scaling:
global lattice size (32x32x32x96)

precision: single - 1 Cards per Node

KNCs  GFLOPS
2    103.561
4    200.159
8    338.276
16   534.369
32   815.896

precision: single - 2 Cards per Node

KNCs  GFLOPS
4    118.995
8    212.558
16   368.196
32   605.882
64   847.566
```