# README - QCD UEABS Part 2
**2021 -  Jacob Finkenrath - CaSToRC - The Cyprus Institute  (j.finkenrath@cyi.ac.cy)**

Part 2 of the QCD kernels of the Unified European Applications Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/ was developed under the accelerator benchmark suite task within the 4th implementation phase of PRACE and is part of the UEABS kernel since PRACE 5IP under UEABS QCD part 2. Part 2 consists of two kernels, based on QUDA 

[^]: R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations

and on the QPhiX library

[^]: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,

. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). Currently a HIP and a generic version is under development, which can be used for AMD GPUs and if ready for CPU architectures, such as ARM. The generic QUDA kernel might replace the computational kernel of QPhiX in the future. The QPhiX library consists of routines developed for Intel Xeon Phi architecture and can perform on x86 architecture. QPhiX are optimize to use Intel intrinsic functions of multiple vector length including AVX512 (http://jeffersonlab.github.io/qphix/). In general the benchmark kernels are applying the Conjugated Gradient solver to the Wilson Dirac operator, a 4-dimension stencil.

##  Table of Contents

[TOC]


##   GPU - Kernel
### 1. Compile and Run the GPU-Benchmark Suite

#### 1.1 Compile

Clone `quda` via

```shell
git clone https://github.com/lattice/quda.git
```

and build `quda` via

```
cd quda
mkdir build
cd build

```

Set `-DQUDA_GPU_ARCH=sm_XX` to the GPU Architecture (`sm_60` for Pascals, `sm_35` for Keplers)

If CMake or the compilation fails, library paths and options can be set by the cmake provided function "ccmake".
Use `./PATH2CMAKE/ccmake PATH2BUILD_DIR` to edit and to see the available options.
Cmake generates the Makefiles. Run them by use `make`.
Now in the folder /test one can find the needed ` executable "invert_".

####  1.2 Run


The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts located in the folder ./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the benchmark runs on the target machines. This bash-scripts are:

 - `run_ana.sh`              :   Main-script, set up the benchmark mode and submit the jobs (analyse the results)
 - `prepare_submit_job.sh`   :   Generate the job-scripts
 - `submit_job.sh.template`  :   Template for submit script

##### 1.2.1 Main-script: "run_ana.sh"

The path to the executable has to be set by $PATH2EXE . QUDA automaticaly tune the GPU-kernels. The optimal setup will be saved in 
the folder which one declares by the variable `QUDA_RESOURCE_PATH`. Set it to folder where the tuning data should be saved. Different scaling modes can be choose from Strong-scaling to Weak scaling by using the variables sca_mode (="Strong" or ="Weak"). The lattice sizes can be set by "gx" and "gt". Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS. Note that the submition is done here by "sbatch", match this to the queing system on your target machine.

##### 1.2.2 Main-script: "prepare_submit_job.sh"

Add additional option if necessary.

##### 1.2.3 Main-script: "submit_job.sh.template"

The submit-template will be edit by `prepare_submit_job.sh` to generate the final submit-script. The header should be matched to the queing system
of the target machine.

The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs on the target machines. This bash-scripts are:

#### 1.3 Example Benchmark results

Here are shown the benchmark results on PizDaint located in Switzerland at CSCS and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown,
the "Strong-Scaling mode with a random lattice configuration of size 32x32x32x96 and a "Weak-Scaling" mode with a configuration of local lattice size 48x48x48x24. The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case that one card per node and two cards per node are used. The benchmark are done by using the Conjugated Gradient solver which solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b".

```
---------------------
  PizDaint - Pascal
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single

GPUs     GFLOPS      sec
1    786.520000 4.569600
2   1522.410000 3.086040
4   2476.900000 2.447180
8   3426.020000 2.117580
16  5091.330000 1.895790
32  8234.310000 1.860760
64  8276.480000 1.869230

sloppy-precision: double
       precision: double

GPUs     GFLOPS      sec
1    385.965000 6.126730
2    751.227000 3.846940
4   1431.570000 2.774470
8   1368.000000 2.367040
16  2304.900000 2.071160
32  4965.480000 2.095180
64  2308.850000 2.005110


Weak - Scaling:
local lattice size (48x48x48x24)

sloppy-precision: single
       precision: single

GPUs     GFLOPS      sec
1     765.967000 3.940280
2    1472.980000 4.004630
4    2865.600000 4.044360
8    5421.270000 4.056410
16   9373.760000 7.396590
32  17995.100000 4.243390
64  27219.800000 4.535410

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
 1   376.611000 5.108900
 2   728.973000 5.190880
 4  1453.500000 5.144160
 8  2884.390000 5.207090
16  5004.520000 5.362020
32  8744.090000 5.623290
64  14053.00000 5.910520
```

```
---------------------
  SurfSara - Kepler
---------------------
##
## 1 GPU per Node
##

Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single
GPUs    GFLOPS      sec
1      243.084000 4.030000
2      478.179000 2.630000
4      939.953000 2.250000
8     1798.240000 1.570000
16    3072.440000 1.730000
32    4365.320000 1.310000

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
1      119.786000 6.060000
2      234.179000 3.290000
4      463.594000 2.250000
8      898.090000 1.960000
16    1604.210000 1.480000
32    2420.130000 1.630000

##
## 2 GPU per Node
##

Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single

GPUs    GFLOPS      sec
2      463.041000 2.720000
4      896.707000 1.940000
8     1672.080000 1.680000
16    2518.240000 1.420000
32    3800.970000 1.460000
64    4505.440000 1.430000

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
2     229.579000 3.380000
4     450.425000 2.280000
8     863.117000 1.830000
16   1348.760000 1.510000
32   1842.560000 1.550000
64   2645.590000 1.480000
```

##   Xeon(Phi) Kernel
### 2. Compile and Run the Xeon(Phi)-Part

QPhiX currently requires additional third party libraries, like the USQCD-libraries `qmp`,`qdpxx`,`qio`,`xpath_reader`,`c-lime` and the xml library `libxml2`. The USQCD-libs can be found under

```shell
https://github.com/usqcd-software
```

and for `libxml2`

```shell
http://xmlsoft.org/
```

while the repository  of QPhiX is hosted under Jefferson Lab.

``` shell
git clone https://github.com/JeffersonLab/qphix
```

#### 2.1 Compile

The QPhiX library is based on QMP communication functions.
For that QMP has to be setup first. Note that you might have to reconfigure the configure-file using `autoreconf` from `Autotool`. QMP can be configured using:

``` shell
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS=" -xHOST -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
```

Create the install folder and link with `$QMP_INSTALL_DIR` to it.
Then use

``` shell
make
make install
```

to compile and setup the necessary source files in `$QMP_INSTALL_DIR`.

For the current master branch of `QPhiX` it is required to provide the package `qdp++`, which has sub-dependencies given by ``qio`,`xpath_reader`,`c-lime` and `libxml2`.  QDP++ can be configure using (here for Skylake chip)

``` shell
./configure --with-qmp=$QMP_INSTALL_DIR --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xCORE-AVX512 -mtune=skylake-avx512 -std=c99" CXX=mpiicpc CXXFLAGS="-axCORE-AVX512 -mtune=skylake-avx512 -std=c++14 -qopenmp" --enable-openmp --host=x86_64-linux-gnu --build=none-none-none --prefix=$QDPXX_INSTALL_DIR l --disable-filedb
```

The configure is searching for additional USQCD libraries  in the subfolder `./other_libs`. Clone the required library files, like 

```shell
git clone https://github.com/usqcd-software/qio.git
```

and reconfigure. Note that on system like JSC's JUWELS `libxml2` has to be additional compiled and the path can be added to the configuration of `qdpxx` via

```shell
--with-libxml2=$LIBXML2_INSTALL_DIR
```

Now QPhiX benchmark kernels can be compiled via `cmake`. Create a build folder and


``` shell
mkdir build
cd build
cmake -DQDPXX_DIR=$QDP_INSTALL_DIR -DQMP_DIR=$QMP_INSTALL_DIR -Disa=avx512 -Dparallel_arch=parscalar -Dhost_cxx=mpiicpc -Dhost_cxxflags="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -Dtm_clover=ON -Dtwisted_mass=ON -Dtesting=ON -DCMAKE_CXX_COMPILER=mpiicpc -DCMAKE_CXX_FLAGS="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -DCMAKE_C_COMPILER=mpiicc -DCMAKE_C_FLAGS="-std=c99 -O3 -axCORE-AVX512 -mtune=skylake-avx512" ..
```

The executable `time_clov_noqdp` can be found now in the sub-folder  `tests`. Note that in the current version compilation of other test kernels can fail. If that is the case directly compile the needed executable, via

```
cd tests
make time_clov_noqdp
```

The QPhiX is developed to utilize the computational potential of Intels Xeon Phi architecture, which is discontinued. Earlier versions of QPhiX for the Intel Xeon Phi architecture can be compiled without `qdpxx` using configure build. Namely for KNC's

``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g  -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
```

or for KNL's

``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
```

by using the previous variable `QMP_INSTALL_DIR` which links to the install-folder 
of QMP. The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.

##### 2.1.1 Example compilation on PRACE machines

In the subsection we provide some example compilation on PRACE machines
which where used to develop the QCD Benchmarksuite 2.

###### 2.1.1.1 JSC - JUWELS

JUWELS (Cluster Module) at Juelich Supercomputing Center is equipped with Intel Skylake chips, namely 2× Intel Xeon Platinum 8168 CPU, 2× 24 cores, 2.7 GHz per compute node. The compilation was done using the following software setup (status 07/21)

```shell
ml Intel/2020.2.254-GCC-9.3.0  IntelMPI/2019.8.254 imkl/2020.4.304 Autotools CMake
```

(`ml` is a short cut for `module load`).

`qmp` was build via

```
git clone https://github.com/usqcd-software/qmp.git
cd qmp
autoreconf
./configure --prefix=${PWD}/install CC=mpiicc CFLAGS="-xHOST -std=c99 -O3" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu
make 
make install
```

`qdpxx` requires `libxml2` which was build via

```shell
git clone https://gitlab.gnome.org/GNOME/libxml2.git
cd libxml2
./autogen.sh
./configure --prefix=${PWD}/install
make
make install
```

For `qdpxx` the additional required libs were added to the sub-folder `other-libs`, by

```shell
git clone https://github.com/usqcd-software/qdpxx.git
cd qdpxx/other_libs
git clone https://github.com/usqcd-software/xpath_reader.git
cd xpath_reader/
autoreconf
cd ..
git clone https://github.com/usqcd-software/qio.git
cd qio
autoreconf
cd /other_libs/
git clone https://github.com/usqcd-software/c-lime.git
# note that the path of c-lime is ./qdpxx/other_libs/qio/other_libs/c-lime
cd c-lime/
autoreconf
```

Now, `qpdxx` was compiled via

```shell
./configure  --with-libxml2=${PWD}/../libxml2/install  --with-qmp=/p/project/cecy00/ecy00a/src_bench/qdpxx/../qmp/install --enable-openmp --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xHOST -std=c99 -qopenmp" CXX=mpiicpc CXXFLAGS="-xHOST -std=c++11 -qopenmp" --prefix=/p/project/cecy00/ecy00a/src_bench/qdpxx/install --with-libxml2=${PWD}/../libxml2/install --disable-filedb
make
make install
```

Note `configure` might require `xml2-config` for which the path can be set via `export PATH+=:$PATH_2_LIBXML2`

Finally the computational kernels of `QPhiX` can be build via

```shell
git clone https://github.com/JeffersonLab/qphix.git
cd qphix/
mkdir build
cd build
cmake -DQDPXX_DIR=${PWD}/../../qdpxx/install  -DQMP_DIR=${PWD}/../../qmp/install -Disa=avx512 -Dparallel_arch=parscalar -Dhost_cxx=icpc -Dhost_cxxflags="-std=c++17 -O3 -xHOST" -Dtm_clover=ON -Dtwisted_mass=ON -Dtesting=ON -DCMAKE_CXX_COMPILER=mpiicpc -DCMAKE_CXX_FLAGS="-std=c++17 -O3 -xHOST" -DCMAKE_C_COMPILER=mpiicc -DCMAKE_C_FLAGS="-std=c99 -O3 -xHOST" -DCMAKE_INSTALL_PREFIX=${PWD}/../install ..
cd tests
make time_clov_noqdp
```

###### 2.1.1.2 BSC - Marenostrum III Hybrid partitions

The Hybrid partition on Marenostrum are equiped with KNC's (status 2016).
First following modules were loaded

``` shell
module unload openmpi
module load impi
```

and the necessary links are set with

``` shell
source /opt/intel/impi/4.1.1.036/bin64/mpivars.sh
source /opt/intel/2013.5.192/composer_xe_2013.5.192/bin/compilervars.sh intel64
export I_MPI_MIC=enable
export I_MPI_HYDRA_BOOTSTRAP=ssh
```

The QMP-library was configured and compiled with

``` shell
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-mmic -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
make
make install
```

Now the package QPhix is compilled with

``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g  -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
make
```

###### 2.1.1.3 CINES - Frioul

On a test cluster at the CINES-side the Benchmarksuite was tested on KNL's (status 2018).
The steps are similar to BSC. First the libraries paths are set with

``` shell
source /opt/software/intel/composer_xe_2015/bin/compilervars.sh intel64
source /opt/software/intel/impi_5.0.3/bin64/mpivars.sh
```

The QMP was compiled by using:

``` shell
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-xMIC-AVX512 -mGLOB_default_function_attrs="use_gather_scatter_hint=off" -openmp -g  -O2 -fno-alias -std=c99"  --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
make
make install
```

The QPhix was configured and compiled by using

``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=/home/finkenrath/benchmark/qmp/install
make
```

####  2.2 Run

The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs
on the target machines. This bash-scripts are:

 - `run_ana.sh`              :   Main-script, set up the bechmark mode and submit the jobs (analyse the results)
 - `prepare_submit_job.sh`   :   Generate the job-scripts
 - `submit_job.sh.template`  :   Template for submit script

##### 2.2.1 Main-script: "run_ana.sh"

The path to the executable has to be set by $PATH2EXE .
Different scaling modes can be choose from Strong-scaling to Weak scaling
by using the variables sca_mode (="Strong" or ="Weak").
The lattice sizes can be set by "gx" and "gt".
Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS.
Note that the submition is done by "sbatch" match this to the queing system on
your target machine.

##### 2.2.2 Main-script: "prepare_submit_job.sh"

Add additional option if necessary.

##### 2.2.3 Main-script: "submit_job.sh.template"

The submit-template will be edit by `prepare_submit_job.sh` to generate
the final submit-script. The header should be matched to the quening system
of the target machine.


#### 2.3 Example Benchmark Results

The benchmark results for the XeonPhi benchmark suite are performed on
Frioul, a test cluster at CINES, and the hybrid partion on MareNostrum III at BSC.
Frioul has one KNL-card per node while the hybrid partion of MareNostrum III is
equiped with two KNCs per node. The data on Frioul are generated by using
the bash-scripts provided by the QCD-Accelerator Benchmarksute Part 2
and are done for the two test cases "Strong-Scaling" with a lattice size
of 32^3x96 and "Weak-scaling" with a local lattice size of 48^3x24 per
card. In case of the data generated at MareNostrum, data for the "Strong-Scaling"
mode on a 32^3x96 lattice are shown. The Benchmark is using a random gauge configuration and uses the
Conjugated Gradient solver to solve a linear equation involving the clover Wilson Dirac operator.

```
---------------------
  Frioul - KNLs
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)

precision: single

KNLs     GFLOPS
1       340.75
2       627.612
4      1111.13
8      1779.34
16     2410.8

precision: double

KNLs     GFLOPS
1      328.149
2      616.467
4      1047.79
8      1616.37

Weak - Scaling:
local lattice size (48x48x48x24)

precision: single

KNLs   GFLOPS
1       348.304
2       616.697
4      1214.82
8      2425.45
16     4404.63

precision: double

KNLs   GFLOPS
 1      172.303
 2      320.761
 4      629.79
 8     1228.77
16     2310.63
```

```
---------------------
  MareNostrum III - KNC's
---------------------

Strong - Scaling:
global lattice size (32x32x32x96)

precision: single - 1 Cards per Node

KNCs  GFLOPS
2    103.561
4    200.159
8    338.276
16   534.369
32   815.896

precision: single - 2 Cards per Node

KNCs  GFLOPS
4    118.995
8    212.558
16   368.196
32   605.882
64   847.566
```