Skip to content
README.md 16.5 KiB
Newer Older
Victor's avatar
Victor committed
# README - QCD Accelerator Benchmarksuite Part 2
Victor's avatar
Victor committed
##   2017 -  Jacob Finkenrath - CaSToRC - The Cyprus Institute  (j.finkenrath@cyi.ac.cy)
Victor's avatar
Victor committed
The QCD Accelerator Benchmark suite Part 2 consists of two kernels,
Victor's avatar
Victor committed
the QUDA [1] and the QPhix library [2]. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/).
The benchmark code is used the provided Conjugated Gradient benchmark functions of the libraries.


[1] R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
in Lattice Quantum Chromodynamics” SC 10 (Supercomputing 2010)

Victor's avatar
Victor committed
[2] B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
W. Watson III, “Lattice QCD on Intel Xeon Phi”, International Supercomputing Conference (ISC’13), 2013

Victor's avatar
Victor committed
##  Table of Contents

GPU - BENCHMARK SUITE (QUDA)
Victor's avatar
Victor committed
```
1. Compile and Run the GPU-Benchmark Suite
Victor's avatar
Victor committed
1.1. Compile
1.2. Run
1.2.1. Main-script: "run_ana.sh"
1.2.2. Main-script: "prepare_submit_job.sh"
1.2.3. Main-script: "submit_job.sh.template"
1.3. Example Benchmark results
```

XEONPHI - BENCHMARK SUITE (QPHIX)
Victor's avatar
Victor committed
```
2. Compile and Run the XeonPhi-Benchmark Suite
Victor's avatar
Victor committed
2.1. Compile
2.1.1. Example compilation on PRACE machines
2.1.1.1. BSC - Marenostrum III Hybrid partitions
2.1.1.2. CINES - Frioul
2.2. Run
2.2.1. Main-script: "run_ana.sh"
2.2.2. Main-script: "prepare_submit_job.sh"
2.2.3. Main-script: "submit_job.sh.template"
2.3. Example Benchmark Results
```
Victor's avatar
Victor committed
##   GPU - BENCHMARK SUITE
### 1. Compile and Run the GPU-Benchmark Suite
Victor's avatar
Victor committed
#### 1.1 Compile

Download Cmake and Quda

General information how to build QUDA with cmake can be found under:
"https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake"
Here we just give a short overview:

Build Cmake: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/cmake-3.7.0.tar.gz)

Cmake can be downloaded from the source with the URL: https://cmake.org/download/
In this guide the version cmake-3.7.0 is used. The build instruction can be found
Victor's avatar
Victor committed
in the main directory under README.rst . Use the configure file `./configure` .
Then run `gmake`.

Build Quda: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/quda.tar.gz)

Victor's avatar
Victor committed
Download quda for example by using `git clone https://github.com/lattice/quda.git`.
Create a build-folder. Execute the executable `cmake` in the build-folder which
is located in the cmake/bin.
Execute:

Victor's avatar
Victor committed
``` shell
./$PATH2CMAKE/cmake $PATH2QUDA -DQUDA_GPU_ARCH=sm_XX -DQUDA_DIRAC_WILSON=ON -DQUDA_DIRAC_TWISTED_MASS=OFF
-DQUDA_DIRACR_DOMAIN_WALL=OFF -DQUDA_HISQ_LINK=OFF -DQUDA_GAUGE_FORCE=OFF -DQUDA_HISQ_FORCE=OFF -DQUDA_MPI=ON
Victor's avatar
Victor committed
```
Victor's avatar
Victor committed
``` shell
Victor's avatar
Victor committed
  PATH2CMAKE= path to the cmake-executable
  PAT2QUDA= path to the home dir of QUDA
Victor's avatar
Victor committed
```
Victor's avatar
Victor committed
Set `-DQUDA_GPU_ARCH=sm_XX` to the GPU Architecture (`sm_60` for Pascals, `sm_35` for Keplers)

If Cmake or the compilation fails library paths and options can be set by the cmake provided function "ccmake".
Victor's avatar
Victor committed
Use `./PATH2CMAKE/ccmake PATH2BUILD_DIR` to edit and to see the availble options.
Victor's avatar
Victor committed
Cmake generates the Makefiles. Run them by use `make`.
Now in the folder /test one can find the needed Quda executable "invert_".

Victor's avatar
Victor committed
####  1.2 Run


The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts located in the folder ./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the benchmark runs
on the target machines. This bash-scripts are:

Victor's avatar
Victor committed
 - `run_ana.sh`              :   Main-script, set up the benchmark mode and submit the jobs (analyse the results)
 - `prepare_submit_job.sh`   :   Generate the job-scripts
 - `submit_job.sh.template`  :   Template for submit script
Victor's avatar
Victor committed
##### 1.2.1 Main-script: "run_ana.sh"

The path to the executable has to be set by $PATH2EXE .
QUDA automaticaly tune the GPU-kernels. The optimal setup will be saved in
Victor's avatar
Victor committed
the folder which one declares by the variable `QUDA_RESOURCE_PATH`. Set it to
folder where the tuning data should be saved.
Victor's avatar
Victor committed
Different scaling modes can be choose from Strong-scaling to Weak scaling
by using the variables sca_mode (="Strong" or ="Weak").
The lattice sizes can be set by "gx" and "gt".
Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS.
Victor's avatar
Victor committed
Note that the submition is done here by "sbatch", match this to the queing system on
Victor's avatar
Victor committed
##### 1.2.2 Main-script: "prepare_submit_job.sh"

Add additional option if necessary.

Victor's avatar
Victor committed
##### 1.2.3 Main-script: "submit_job.sh.template"
Victor's avatar
Victor committed
The submit-template will be edit by `prepare_submit_job.sh` to generate
the final submit-script. The header should be matched to the queing system
of the target machine.

The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs
on the target machines. This bash-scripts are:

Victor's avatar
Victor committed
#### 1.3 Example Benchmark results

Here are shown the benchmark results on PizDaint located in Switzerland at CSCS
Victor's avatar
Victor committed
and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the
provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown,
the "Strong-Scaling mode with a random lattice configuration of size 32^3x96 and
a "Weak-Scaling" mode with a configuration of local lattice size 48^3x24.
The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case
Victor's avatar
Victor committed
that one card per node and two cards per node are used.
The benchmark are done by using the Conjugated Gradient solver which
solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator
"D" and a known right hand side "b".

Victor's avatar
Victor committed
```
---------------------
  PizDaint - Pascal
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single

GPUs     GFLOPS      sec
1    786.520000 4.569600
2   1522.410000 3.086040
4   2476.900000 2.447180
8   3426.020000 2.117580
16  5091.330000 1.895790
32  8234.310000 1.860760
64  8276.480000 1.869230

sloppy-precision: double
       precision: double

GPUs     GFLOPS      sec
1    385.965000 6.126730
2    751.227000 3.846940
4   1431.570000 2.774470
8   1368.000000 2.367040
16  2304.900000 2.071160
32  4965.480000 2.095180
64  2308.850000 2.005110


Weak - Scaling:
local lattice size (48x48x48x24)

sloppy-precision: single
       precision: single

GPUs     GFLOPS      sec
1     765.967000 3.940280
2    1472.980000 4.004630
4    2865.600000 4.044360
8    5421.270000 4.056410
16   9373.760000 7.396590
32  17995.100000 4.243390
64  27219.800000 4.535410

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
 1   376.611000 5.108900
 2   728.973000 5.190880
 4  1453.500000 5.144160
 8  2884.390000 5.207090
16  5004.520000 5.362020
32  8744.090000 5.623290
Victor's avatar
Victor committed
64  14053.00000 5.910520
```
Victor's avatar
Victor committed
```
---------------------
  SurfSara - Kepler
---------------------
##
## 1 GPU per Node
##

Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single
GPUs    GFLOPS      sec
Victor's avatar
Victor committed
1      243.084000 4.030000
2      478.179000 2.630000
4      939.953000 2.250000
8     1798.240000 1.570000
16    3072.440000 1.730000
32    4365.320000 1.310000

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
Victor's avatar
Victor committed
1      119.786000 6.060000
2      234.179000 3.290000
4      463.594000 2.250000
8      898.090000 1.960000
16    1604.210000 1.480000
32    2420.130000 1.630000

##
## 2 GPU per Node
##

Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single

GPUs    GFLOPS      sec
Victor's avatar
Victor committed
2      463.041000 2.720000
4      896.707000 1.940000
8     1672.080000 1.680000
16    2518.240000 1.420000
32    3800.970000 1.460000
64    4505.440000 1.430000

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
Victor's avatar
Victor committed
2     229.579000 3.380000
4     450.425000 2.280000
8     863.117000 1.830000
16   1348.760000 1.510000
32   1842.560000 1.550000
64   2645.590000 1.480000
```

##   XEONPHI - BENCHMARK SUITE
### 2. Compile and Run the XeonPhi-Benchmark Suite

Unpack the provided source tar-file located in `./QCD_Accelerator_Benchmarksuite_Part2/XeonPhi/src` or
clone the actual git-hub branches of the code
packages QMP:

Victor's avatar
Victor committed
``` shell
git clone https://github.com/usqcd-software/qmp
```
Victor's avatar
Victor committed
``` shell
git clone https://github.com/JeffersonLab/qphix
```

Note that for running on Skylake chips it is recommended to utilize
the branch develop of QPhix which needs additional packages
like qdp++ (Status 04/2019).
Victor's avatar
Victor committed
#### 2.1 Compile
Victor's avatar
Victor committed
The QPhix library is based on QMP communication functions.
For that QMP has to be setup first.
Victor's avatar
Victor committed
``` shell
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS=" -mmic/-xAVX512 -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
Victor's avatar
Victor committed
```
Victor's avatar
Victor committed
Create the Install folder and link with `$QMP_INSTALL_DIR` to it.
Use the compilerflag  `-mmic` for the compilation for KNC's
while use `-xAVX512` for the compilation for KNL's.
Victor's avatar
Victor committed
``` shell
make
make install
```
Victor's avatar
Victor committed
to compile and setup the necessary source files in `$QMP_INSTALL_DIR`.

The QPhix executable can be compiled by using:
for KNC's

Victor's avatar
Victor committed
``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g  -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
Victor's avatar
Victor committed
```
Victor's avatar
Victor committed
``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
Victor's avatar
Victor committed
```
Victor's avatar
Victor committed
by using the previous variable `QMP_INSTALL_DIR` which links to the install-folder
of QMP. The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.


Note for the develop branch the package qdp++ has to be compiled.
QDP++ can be configure using (here for skylake chip)

``` shell
./configure --with-qmp=$QMP_INSTALL_DIR --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xCORE-AVX512 -mtune=skylake-avx512 -std=c99" CXX=mpiicpc CXXFLAGS="-axCORE-AVX512 -mtune=skylake-avx512 -std=c++14 -qopenmp" --enable-openmp --host=x86_64-linux-gnu --build=none-none-none --prefix=$QDPXX_INSTALL_DIR
```

Now QPhix executable can be compiled by using:


``` shell
cmake -DQDPXX_DIR=$QDP_INSTALL_DIR -DQMP_DIR=$QMP_INSTALL_DIR -Disa=avx512 -Dparallel_arch=parscalar -Dhost_cxx=mpiicpc -Dhost_cxxflags="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -Dtm_clover=ON -Dtwisted_mass=ON -Dtesting=ON -DCMAKE_CXX_COMPILER=mpiicpc -DCMAKE_CXX_FLAGS="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -DCMAKE_C_COMPILER=mpiicc -DCMAKE_C_FLAGS="-std=c99 -O3 -axCORE-AVX512 -mtune=skylake-avx512" ..
```

The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.
Victor's avatar
Victor committed
##### 2.1.1 Example compilation on PRACE machines

In the subsection we provide some example compilation on PRACE machines
which where used to develop the QCD Benchmarksuite 2.

Victor's avatar
Victor committed
###### 2.1.1.1 BSC - Marenostrum III Hybrid partitions

The Hybrid partition on Marenostrum are equiped with KNC's.
Victor's avatar
Victor committed
First following modules were loaded
Victor's avatar
Victor committed
``` shell
module unload openmpi
module load impi
Victor's avatar
Victor committed
```

and the necessary links are set with

Victor's avatar
Victor committed
``` shell
source /opt/intel/impi/4.1.1.036/bin64/mpivars.sh
source /opt/intel/2013.5.192/composer_xe_2013.5.192/bin/compilervars.sh intel64
export I_MPI_MIC=enable
export I_MPI_HYDRA_BOOTSTRAP=ssh
Victor's avatar
Victor committed
```

The QMP-library was configured and compiled with

Victor's avatar
Victor committed
``` shell
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-mmic -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
make
make install
Victor's avatar
Victor committed
```

Now the package QPhix is compilled with

Victor's avatar
Victor committed
``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g  -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
make
Victor's avatar
Victor committed
```
Victor's avatar
Victor committed
###### 2.1.1.2 CINES - Frioul

On a test cluster at the CINES-side the Benchmarksuite was tested on KNL's.
The steps are similar to BSC. First the libraries paths are set with

Victor's avatar
Victor committed
``` shell
source /opt/software/intel/composer_xe_2015/bin/compilervars.sh intel64
source /opt/software/intel/impi_5.0.3/bin64/mpivars.sh
Victor's avatar
Victor committed
```

The QMP was compiled by using:
Victor's avatar
Victor committed

``` shell
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-xMIC-AVX512 -mGLOB_default_function_attrs="use_gather_scatter_hint=off" -openmp -g  -O2 -fno-alias -std=c99"  --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
make
make install
Victor's avatar
Victor committed
```
Victor's avatar
Victor committed
The QPhix was configured and compiled by using
Victor's avatar
Victor committed
``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=/home/finkenrath/benchmark/qmp/install
Victor's avatar
Victor committed
```
Victor's avatar
Victor committed
####  2.2 Run

The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs
on the target machines. This bash-scripts are:

Victor's avatar
Victor committed
 - `run_ana.sh`              :   Main-script, set up the bechmark mode and submit the jobs (analyse the results)
 - `prepare_submit_job.sh`   :   Generate the job-scripts
 - `submit_job.sh.template`  :   Template for submit script
Victor's avatar
Victor committed
##### 2.2.1 Main-script: "run_ana.sh"

The path to the executable has to be set by $PATH2EXE .
Victor's avatar
Victor committed
Different scaling modes can be choose from Strong-scaling to Weak scaling
by using the variables sca_mode (="Strong" or ="Weak").
The lattice sizes can be set by "gx" and "gt".
Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS.
Victor's avatar
Victor committed
Note that the submition is done by "sbatch" match this to the queing system on
Victor's avatar
Victor committed
##### 2.2.2 Main-script: "prepare_submit_job.sh"

Add additional option if necessary.

Victor's avatar
Victor committed
##### 2.2.3 Main-script: "submit_job.sh.template"
Victor's avatar
Victor committed
The submit-template will be edit by `prepare_submit_job.sh` to generate
the final submit-script. The header should be matched to the quening system
of the target machine.

Victor's avatar
Victor committed

#### 2.3 Example Benchmark Results

The benchmark results for the XeonPhi benchmark suite are performed on
Frioul, a test cluster at CINES, and the hybrid partion on MareNostrum III at BSC.
Frioul has one KNL-card per node while the hybrid partion of MareNostrum III is
equiped with two KNCs per node. The data on Frioul are generated by using
the bash-scripts provided by the QCD-Accelerator Benchmarksute Part 2
and are done for the two test cases "Strong-Scaling" with a lattice size
of 32^3x96 and "Weak-scaling" with a local lattice size of 48^3x24 per
card. In case of the data generated at MareNostrum, data for the "Strong-Scaling"
mode on a 32^3x96 lattice are shown. The Benchmark is using a random gauge configuration and uses the
Conjugated Gradient solver to solve a linear equation involving the clover Wilson Dirac operator.

Victor's avatar
Victor committed
```
---------------------
  Frioul - KNLs
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)

precision: single

Victor's avatar
Victor committed
KNLs     GFLOPS
1       340.75
2       627.612
4      1111.13
8      1779.34
16     2410.8

precision: double

Victor's avatar
Victor committed
KNLs     GFLOPS
1      328.149
2      616.467
4      1047.79
8      1616.37

Weak - Scaling:
local lattice size (48x48x48x24)

precision: single

Victor's avatar
Victor committed
KNLs   GFLOPS
1       348.304
2       616.697
4      1214.82
8      2425.45
16     4404.63
Victor's avatar
Victor committed

Victor's avatar
Victor committed
KNLs   GFLOPS
 1      172.303
 2      320.761
 4      629.79
 8     1228.77
16     2310.63
Victor's avatar
Victor committed
```
Victor's avatar
Victor committed
```
Victor's avatar
Victor committed
  MareNostrum III - KNC's
---------------------

Strong - Scaling:
global lattice size (32x32x32x96)

precision: single - 1 Cards per Node

KNCs  GFLOPS
2    103.561
4    200.159
8    338.276
16   534.369
32   815.896

precision: single - 2 Cards per Node

KNCs  GFLOPS
4    118.995
8    212.558
16   368.196
32   605.882
64   847.566
Victor's avatar
Victor committed
```