Newer
Older
**2021 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy)**
Part 2 of the QCD kernels of the Unified European Applications Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/ was developed under the accelerator benchmark suite task within the 4th implementation phase of PRACE and is part of the UEABS kernel since PRACE 5IP under UEABS QCD part 2. Part 2 consists of two kernels, based on QUDA
[^]: R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
[^]: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). Currently a HIP and a generic version is under development, which can be used for AMD GPUs and if ready for CPU architectures, such as ARM. The generic QUDA kernel might replace the computational kernel of QPhiX in the future. The QPhiX library consists of routines developed for Intel Xeon Phi architecture and can perform on x86 architecture. QPhiX are optimize to use Intel intrinsic functions of multiple vector length including AVX512 (http://jeffersonlab.github.io/qphix/). In general the benchmark kernels are applying the Conjugated Gradient solver to the Wilson Dirac operator, a 4-dimension stencil.
```shell
git clone https://github.com/lattice/quda.git
```
Set `-DQUDA_GPU_ARCH=sm_XX` to the GPU Architecture (`sm_60` for Pascals, `sm_35` for Keplers)
If CMake or the compilation fails, library paths and options can be set by the cmake provided function "ccmake".
Use `./PATH2CMAKE/ccmake PATH2BUILD_DIR` to edit and to see the available options.
Now in the folder /test one can find the needed ` executable "invert_".
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts located in the folder ./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the benchmark runs on the target machines. This bash-scripts are:
- `run_ana.sh` : Main-script, set up the benchmark mode and submit the jobs (analyse the results)
- `prepare_submit_job.sh` : Generate the job-scripts
- `submit_job.sh.template` : Template for submit script
The path to the executable has to be set by $PATH2EXE . QUDA automaticaly tune the GPU-kernels. The optimal setup will be saved in
the folder which one declares by the variable `QUDA_RESOURCE_PATH`. Set it to folder where the tuning data should be saved. Different scaling modes can be choose from Strong-scaling to Weak scaling by using the variables sca_mode (="Strong" or ="Weak"). The lattice sizes can be set by "gx" and "gt". Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS. Note that the submition is done here by "sbatch", match this to the queing system on your target machine.
Add additional option if necessary.
The submit-template will be edit by `prepare_submit_job.sh` to generate the final submit-script. The header should be matched to the queing system
of the target machine.
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs on the target machines. This bash-scripts are:
Here are shown the benchmark results on PizDaint located in Switzerland at CSCS and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown,
the "Strong-Scaling mode with a random lattice configuration of size 32x32x32x96 and a "Weak-Scaling" mode with a configuration of local lattice size 48x48x48x24. The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case that one card per node and two cards per node are used. The benchmark are done by using the Conjugated Gradient solver which solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b".
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---------------------
PizDaint - Pascal
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
1 786.520000 4.569600
2 1522.410000 3.086040
4 2476.900000 2.447180
8 3426.020000 2.117580
16 5091.330000 1.895790
32 8234.310000 1.860760
64 8276.480000 1.869230
sloppy-precision: double
precision: double
GPUs GFLOPS sec
1 385.965000 6.126730
2 751.227000 3.846940
4 1431.570000 2.774470
8 1368.000000 2.367040
16 2304.900000 2.071160
32 4965.480000 2.095180
64 2308.850000 2.005110
Weak - Scaling:
local lattice size (48x48x48x24)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
1 765.967000 3.940280
2 1472.980000 4.004630
4 2865.600000 4.044360
8 5421.270000 4.056410
16 9373.760000 7.396590
32 17995.100000 4.243390
64 27219.800000 4.535410
sloppy-precision: double
precision: double
GPUs GFLOPS sec
1 376.611000 5.108900
2 728.973000 5.190880
4 1453.500000 5.144160
8 2884.390000 5.207090
16 5004.520000 5.362020
32 8744.090000 5.623290
---------------------
SurfSara - Kepler
---------------------
##
## 1 GPU per Node
##
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
1 243.084000 4.030000
2 478.179000 2.630000
4 939.953000 2.250000
8 1798.240000 1.570000
16 3072.440000 1.730000
32 4365.320000 1.310000
sloppy-precision: double
precision: double
GPUs GFLOPS sec
1 119.786000 6.060000
2 234.179000 3.290000
4 463.594000 2.250000
8 898.090000 1.960000
16 1604.210000 1.480000
32 2420.130000 1.630000
##
## 2 GPU per Node
##
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
2 463.041000 2.720000
4 896.707000 1.940000
8 1672.080000 1.680000
16 2518.240000 1.420000
32 3800.970000 1.460000
64 4505.440000 1.430000
sloppy-precision: double
precision: double
GPUs GFLOPS sec
2 229.579000 3.380000
4 450.425000 2.280000
8 863.117000 1.830000
16 1348.760000 1.510000
32 1842.560000 1.550000
64 2645.590000 1.480000
```
## Xeon(Phi) Kernel
### 2. Compile and Run the Xeon(Phi)-Part
QPhiX currently requires additional third party libraries, like the USQCD-libraries `qmp`,`qdpxx`,`qio`,`xpath_reader`,`c-lime` and the xml library `libxml2`. The USQCD-libs can be found under
```shell
https://github.com/usqcd-software
```
and for `libxml2`
```shell
http://xmlsoft.org/
while the repository of QPhiX is hosted under Jefferson Lab.
``` shell
git clone https://github.com/JeffersonLab/qphix
```
#### 2.1 Compile
The QPhiX library is based on QMP communication functions.
For that QMP has to be setup first. Note that you might have to reconfigure the configure-file using `autoreconf` from `Autotool`. QMP can be configured using:
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS=" -xHOST -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
Create the install folder and link with `$QMP_INSTALL_DIR` to it.
to compile and setup the necessary source files in `$QMP_INSTALL_DIR`.
For the current master branch of `QPhiX` it is required to provide the package `qdp++`, which has sub-dependencies given by ``qio`,`xpath_reader`,`c-lime` and `libxml2`. QDP++ can be configure using (here for Skylake chip)
./configure --with-qmp=$QMP_INSTALL_DIR --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xCORE-AVX512 -mtune=skylake-avx512 -std=c99" CXX=mpiicpc CXXFLAGS="-axCORE-AVX512 -mtune=skylake-avx512 -std=c++14 -qopenmp" --enable-openmp --host=x86_64-linux-gnu --build=none-none-none --prefix=$QDPXX_INSTALL_DIR l --disable-filedb
The configure is searching for additional USQCD libraries in the subfolder `./other_libs`. Clone the required library files, like
```shell
git clone https://github.com/usqcd-software/qio.git
and reconfigure. Note that on system like JSC's JUWELS `libxml2` has to be additional compiled and the path can be added to the configuration of `qdpxx` via
```shell
--with-libxml2=$LIBXML2_INSTALL_DIR
```
Now QPhiX benchmark kernels can be compiled via `cmake`. Create a build folder and
mkdir build
cd build
cmake -DQDPXX_DIR=$QDP_INSTALL_DIR -DQMP_DIR=$QMP_INSTALL_DIR -Disa=avx512 -Dparallel_arch=parscalar -Dhost_cxx=mpiicpc -Dhost_cxxflags="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -Dtm_clover=ON -Dtwisted_mass=ON -Dtesting=ON -DCMAKE_CXX_COMPILER=mpiicpc -DCMAKE_CXX_FLAGS="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -DCMAKE_C_COMPILER=mpiicc -DCMAKE_C_FLAGS="-std=c99 -O3 -axCORE-AVX512 -mtune=skylake-avx512" ..
The executable `time_clov_noqdp` can be found now in the sub-folder `tests`. Note that in the current version compilation of other test kernels can fail. If that is the case directly compile the needed executable, via
```
cd tests
make time_clov_noqdp
```
The QPhiX is developed to utilize the computational potential of Intels Xeon Phi architecture, which is discontinued. Earlier versions of QPhiX for the Intel Xeon Phi architecture can be compiled without `qdpxx` using configure build. Namely for KNC's
./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
or for KNL's
``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
```
by using the previous variable `QMP_INSTALL_DIR` which links to the install-folder
of QMP. The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.
In the subsection we provide some example compilation on PRACE machines
which where used to develop the QCD Benchmarksuite 2.
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
###### 2.1.1.1 JSC - JUWELS
JUWELS (Cluster Module) at Juelich Supercomputing Center is equipped with Intel Skylake chips, namely 2× Intel Xeon Platinum 8168 CPU, 2× 24 cores, 2.7 GHz per compute node. The compilation was done using the following software setup (status 07/21)
```shell
ml Intel/2020.2.254-GCC-9.3.0 IntelMPI/2019.8.254 imkl/2020.4.304 Autotools CMake
```
(`ml` is a short cut for `module load`).
`qmp` was build via
```
git clone https://github.com/usqcd-software/qmp.git
cd qmp
autoreconf
./configure --prefix=${PWD}/install CC=mpiicc CFLAGS="-xHOST -std=c99 -O3" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu
make
make install
```
`qdpxx` requires `libxml2` which was build via
```shell
git clone https://gitlab.gnome.org/GNOME/libxml2.git
cd libxml2
./autogen.sh
./configure --prefix=${PWD}/install
make
make install
```
For `qdpxx` the additional required libs were added to the sub-folder `other-libs`, by
```shell
git clone https://github.com/usqcd-software/qdpxx.git
cd qdpxx/other_libs
git clone https://github.com/usqcd-software/xpath_reader.git
cd xpath_reader/
autoreconf
cd ..
git clone https://github.com/usqcd-software/qio.git
cd qio
autoreconf
cd /other_libs/
git clone https://github.com/usqcd-software/c-lime.git
# note that the path of c-lime is ./qdpxx/other_libs/qio/other_libs/c-lime
cd c-lime/
autoreconf
```
Now, `qpdxx` was compiled via
```shell
./configure --with-libxml2=${PWD}/../libxml2/install --with-qmp=/p/project/cecy00/ecy00a/src_bench/qdpxx/../qmp/install --enable-openmp --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xHOST -std=c99 -qopenmp" CXX=mpiicpc CXXFLAGS="-xHOST -std=c++11 -qopenmp" --prefix=/p/project/cecy00/ecy00a/src_bench/qdpxx/install --with-libxml2=${PWD}/../libxml2/install --disable-filedb
make
make install
```
Note `configure` might require `xml2-config` for which the path can be set via `export PATH+=:$PATH_2_LIBXML2`
Finally the computational kernels of `QPhiX` can be build via
```shell
git clone https://github.com/JeffersonLab/qphix.git
cd qphix/
mkdir build
cd build
cmake -DQDPXX_DIR=${PWD}/../../qdpxx/install -DQMP_DIR=${PWD}/../../qmp/install -Disa=avx512 -Dparallel_arch=parscalar -Dhost_cxx=icpc -Dhost_cxxflags="-std=c++17 -O3 -xHOST" -Dtm_clover=ON -Dtwisted_mass=ON -Dtesting=ON -DCMAKE_CXX_COMPILER=mpiicpc -DCMAKE_CXX_FLAGS="-std=c++17 -O3 -xHOST" -DCMAKE_C_COMPILER=mpiicc -DCMAKE_C_FLAGS="-std=c99 -O3 -xHOST" -DCMAKE_INSTALL_PREFIX=${PWD}/../install ..
cd tests
make time_clov_noqdp
```
###### 2.1.1.2 BSC - Marenostrum III Hybrid partitions
The Hybrid partition on Marenostrum are equiped with KNC's (status 2016).
module unload openmpi
module load impi
and the necessary links are set with
source /opt/intel/impi/4.1.1.036/bin64/mpivars.sh
source /opt/intel/2013.5.192/composer_xe_2013.5.192/bin/compilervars.sh intel64
export I_MPI_MIC=enable
export I_MPI_HYDRA_BOOTSTRAP=ssh
The QMP-library was configured and compiled with
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-mmic -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
make
make install
Now the package QPhix is compilled with
./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
make
On a test cluster at the CINES-side the Benchmarksuite was tested on KNL's (status 2018).
The steps are similar to BSC. First the libraries paths are set with
source /opt/software/intel/composer_xe_2015/bin/compilervars.sh intel64
source /opt/software/intel/impi_5.0.3/bin64/mpivars.sh
The QMP was compiled by using:
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-xMIC-AVX512 -mGLOB_default_function_attrs="use_gather_scatter_hint=off" -openmp -g -O2 -fno-alias -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
make
make install
``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=/home/finkenrath/benchmark/qmp/install
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs
on the target machines. This bash-scripts are:
- `run_ana.sh` : Main-script, set up the bechmark mode and submit the jobs (analyse the results)
- `prepare_submit_job.sh` : Generate the job-scripts
- `submit_job.sh.template` : Template for submit script
The path to the executable has to be set by $PATH2EXE .
Different scaling modes can be choose from Strong-scaling to Weak scaling
by using the variables sca_mode (="Strong" or ="Weak").
The lattice sizes can be set by "gx" and "gt".
Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS.
Note that the submition is done by "sbatch" match this to the queing system on
your target machine.
Add additional option if necessary.
The submit-template will be edit by `prepare_submit_job.sh` to generate
the final submit-script. The header should be matched to the quening system
of the target machine.
The benchmark results for the XeonPhi benchmark suite are performed on
Frioul, a test cluster at CINES, and the hybrid partion on MareNostrum III at BSC.
Frioul has one KNL-card per node while the hybrid partion of MareNostrum III is
equiped with two KNCs per node. The data on Frioul are generated by using
the bash-scripts provided by the QCD-Accelerator Benchmarksute Part 2
and are done for the two test cases "Strong-Scaling" with a lattice size
of 32^3x96 and "Weak-scaling" with a local lattice size of 48^3x24 per
card. In case of the data generated at MareNostrum, data for the "Strong-Scaling"
mode on a 32^3x96 lattice are shown. The Benchmark is using a random gauge configuration and uses the
Conjugated Gradient solver to solve a linear equation involving the clover Wilson Dirac operator.
---------------------
Frioul - KNLs
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)
precision: single
1 340.75
2 627.612
4 1111.13
8 1779.34
16 2410.8
precision: double
1 328.149
2 616.467
4 1047.79
8 1616.37
Weak - Scaling:
local lattice size (48x48x48x24)
precision: single
1 348.304
2 616.697
4 1214.82
8 2425.45
16 4404.63
precision: double
1 172.303
2 320.761
4 629.79
8 1228.77
16 2310.63
---------------------
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)
precision: single - 1 Cards per Node
KNCs GFLOPS
2 103.561
4 200.159
8 338.276
16 534.369
32 815.896
precision: single - 2 Cards per Node
KNCs GFLOPS
4 118.995
8 212.558
16 368.196
32 605.882
64 847.566