Newer
Older
# README - QCD UEABS Part 2
**2017 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy)**
The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA
[^]: R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
[^]: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhiX library consists of routines which are optimize to use Intel intrinsic functions of multiple vector length including AVX512, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/). The benchmark kernel consists of the provided Conjugated Gradient benchmark functions of the libraries.
Download Cmake and Quda
General information how to build QUDA with cmake can be found under:
"https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake"
Here we just give a short overview:
Build Cmake: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/cmake-3.7.0.tar.gz)
Cmake can be downloaded from the source with the URL: https://cmake.org/download/
In this guide the version cmake-3.7.0 is used. The build instruction can be found in the main directory under README.rst . Use the configure file `./configure` .
Then run
gmake`.
Build Quda: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/quda.tar.gz)
Download quda for example by using `git clone https://github.com/lattice/quda.git`.
Create a build-folder. Execute the executable `cmake` in the build-folder which
is located in the cmake/bin.
Execute:
``` shell
./$PATH2CMAKE/cmake $PATH2QUDA -DQUDA_GPU_ARCH=sm_XX -DQUDA_DIRAC_WILSON=ON -DQUDA_DIRAC_TWISTED_MASS=OFF
-DQUDA_DIRACR_DOMAIN_WALL=OFF -DQUDA_HISQ_LINK=OFF -DQUDA_GAUGE_FORCE=OFF -DQUDA_HISQ_FORCE=OFF -DQUDA_MPI=ON
PATH2CMAKE= path to the cmake-executable
PAT2QUDA= path to the home dir of QUDA
Set `-DQUDA_GPU_ARCH=sm_XX` to the GPU Architecture (`sm_60` for Pascals, `sm_35` for Keplers)
If Cmake or the compilation fails library paths and options can be set by the cmake provided function "ccmake".
Use `./PATH2CMAKE/ccmake PATH2BUILD_DIR` to edit and to see the availble options.
Now in the folder /test one can find the needed Quda executable "invert_".
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts located in the folder ./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the benchmark runs on the target machines. This bash-scripts are:
- `run_ana.sh` : Main-script, set up the benchmark mode and submit the jobs (analyse the results)
- `prepare_submit_job.sh` : Generate the job-scripts
- `submit_job.sh.template` : Template for submit script
The path to the executable has to be set by $PATH2EXE . QUDA automaticaly tune the GPU-kernels. The optimal setup will be saved in
the folder which one declares by the variable `QUDA_RESOURCE_PATH`. Set it to folder where the tuning data should be saved. Different scaling modes can be choose from Strong-scaling to Weak scaling by using the variables sca_mode (="Strong" or ="Weak"). The lattice sizes can be set by "gx" and "gt". Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS. Note that the submition is done here by "sbatch", match this to the queing system on your target machine.
Add additional option if necessary.
The submit-template will be edit by `prepare_submit_job.sh` to generate the final submit-script. The header should be matched to the queing system
of the target machine.
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs on the target machines. This bash-scripts are:
Here are shown the benchmark results on PizDaint located in Switzerland at CSCS and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown,
the "Strong-Scaling mode with a random lattice configuration of size 32x32x32x96 and a "Weak-Scaling" mode with a configuration of local lattice size 48x48x48x24. The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case that one card per node and two cards per node are used. The benchmark are done by using the Conjugated Gradient solver which solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b".
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---------------------
PizDaint - Pascal
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
1 786.520000 4.569600
2 1522.410000 3.086040
4 2476.900000 2.447180
8 3426.020000 2.117580
16 5091.330000 1.895790
32 8234.310000 1.860760
64 8276.480000 1.869230
sloppy-precision: double
precision: double
GPUs GFLOPS sec
1 385.965000 6.126730
2 751.227000 3.846940
4 1431.570000 2.774470
8 1368.000000 2.367040
16 2304.900000 2.071160
32 4965.480000 2.095180
64 2308.850000 2.005110
Weak - Scaling:
local lattice size (48x48x48x24)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
1 765.967000 3.940280
2 1472.980000 4.004630
4 2865.600000 4.044360
8 5421.270000 4.056410
16 9373.760000 7.396590
32 17995.100000 4.243390
64 27219.800000 4.535410
sloppy-precision: double
precision: double
GPUs GFLOPS sec
1 376.611000 5.108900
2 728.973000 5.190880
4 1453.500000 5.144160
8 2884.390000 5.207090
16 5004.520000 5.362020
32 8744.090000 5.623290
---------------------
SurfSara - Kepler
---------------------
##
## 1 GPU per Node
##
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
1 243.084000 4.030000
2 478.179000 2.630000
4 939.953000 2.250000
8 1798.240000 1.570000
16 3072.440000 1.730000
32 4365.320000 1.310000
sloppy-precision: double
precision: double
GPUs GFLOPS sec
1 119.786000 6.060000
2 234.179000 3.290000
4 463.594000 2.250000
8 898.090000 1.960000
16 1604.210000 1.480000
32 2420.130000 1.630000
##
## 2 GPU per Node
##
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
2 463.041000 2.720000
4 896.707000 1.940000
8 1672.080000 1.680000
16 2518.240000 1.420000
32 3800.970000 1.460000
64 4505.440000 1.430000
sloppy-precision: double
precision: double
GPUs GFLOPS sec
2 229.579000 3.380000
4 450.425000 2.280000
8 863.117000 1.830000
16 1348.760000 1.510000
32 1842.560000 1.550000
64 2645.590000 1.480000
```
## x86 Kernel
### 2. Compile and Run the x86-Part
Unpack the provided source tar-file located in `./QCD_Accelerator_Benchmarksuite_Part2/XeonPhi/src` or
clone the actual git-hub branches of the code
packages QMP:
``` shell
git clone https://github.com/usqcd-software/qmp
```
and for QPhix
``` shell
git clone https://github.com/JeffersonLab/qphix
```
Note that for running on Skylake chips it is recommended to utilize
the branch develop of QPhix which needs additional packages
like qdp++ (Status 04/2019).
The QPhix library is based on QMP communication functions.
For that QMP has to be setup first.
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS=" -mmic/-xAVX512 -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
Create the Install folder and link with `$QMP_INSTALL_DIR` to it.
Use the compilerflag `-mmic` for the compilation for KNC's
while use `-xAVX512` for the compilation for KNL's.
to compile and setup the necessary source files in `$QMP_INSTALL_DIR`.
The QPhix executable can be compiled by using:
for KNC's
./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
or for KNL's
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
by using the previous variable `QMP_INSTALL_DIR` which links to the install-folder
of QMP. The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.
Note for the develop branch the package qdp++ has to be compiled.
QDP++ can be configure using (here for skylake chip)
``` shell
./configure --with-qmp=$QMP_INSTALL_DIR --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xCORE-AVX512 -mtune=skylake-avx512 -std=c99" CXX=mpiicpc CXXFLAGS="-axCORE-AVX512 -mtune=skylake-avx512 -std=c++14 -qopenmp" --enable-openmp --host=x86_64-linux-gnu --build=none-none-none --prefix=$QDPXX_INSTALL_DIR
```
Now QPhix executable can be compiled by using:
``` shell
cmake -DQDPXX_DIR=$QDP_INSTALL_DIR -DQMP_DIR=$QMP_INSTALL_DIR -Disa=avx512 -Dparallel_arch=parscalar -Dhost_cxx=mpiicpc -Dhost_cxxflags="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -Dtm_clover=ON -Dtwisted_mass=ON -Dtesting=ON -DCMAKE_CXX_COMPILER=mpiicpc -DCMAKE_CXX_FLAGS="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -DCMAKE_C_COMPILER=mpiicc -DCMAKE_C_FLAGS="-std=c99 -O3 -axCORE-AVX512 -mtune=skylake-avx512" ..
```
The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.
In the subsection we provide some example compilation on PRACE machines
which where used to develop the QCD Benchmarksuite 2.
The Hybrid partition on Marenostrum are equiped with KNC's.
module unload openmpi
module load impi
and the necessary links are set with
source /opt/intel/impi/4.1.1.036/bin64/mpivars.sh
source /opt/intel/2013.5.192/composer_xe_2013.5.192/bin/compilervars.sh intel64
export I_MPI_MIC=enable
export I_MPI_HYDRA_BOOTSTRAP=ssh
The QMP-library was configured and compiled with
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-mmic -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
make
make install
Now the package QPhix is compilled with
./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
make
On a test cluster at the CINES-side the Benchmarksuite was tested on KNL's.
The steps are similar to BSC. First the libraries paths are set with
source /opt/software/intel/composer_xe_2015/bin/compilervars.sh intel64
source /opt/software/intel/impi_5.0.3/bin64/mpivars.sh
The QMP was compiled by using:
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-xMIC-AVX512 -mGLOB_default_function_attrs="use_gather_scatter_hint=off" -openmp -g -O2 -fno-alias -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
make
make install
``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=/home/finkenrath/benchmark/qmp/install
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs
on the target machines. This bash-scripts are:
- `run_ana.sh` : Main-script, set up the bechmark mode and submit the jobs (analyse the results)
- `prepare_submit_job.sh` : Generate the job-scripts
- `submit_job.sh.template` : Template for submit script
The path to the executable has to be set by $PATH2EXE .
Different scaling modes can be choose from Strong-scaling to Weak scaling
by using the variables sca_mode (="Strong" or ="Weak").
The lattice sizes can be set by "gx" and "gt".
Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS.
Note that the submition is done by "sbatch" match this to the queing system on
your target machine.
Add additional option if necessary.
The submit-template will be edit by `prepare_submit_job.sh` to generate
the final submit-script. The header should be matched to the quening system
of the target machine.
The benchmark results for the XeonPhi benchmark suite are performed on
Frioul, a test cluster at CINES, and the hybrid partion on MareNostrum III at BSC.
Frioul has one KNL-card per node while the hybrid partion of MareNostrum III is
equiped with two KNCs per node. The data on Frioul are generated by using
the bash-scripts provided by the QCD-Accelerator Benchmarksute Part 2
and are done for the two test cases "Strong-Scaling" with a lattice size
of 32^3x96 and "Weak-scaling" with a local lattice size of 48^3x24 per
card. In case of the data generated at MareNostrum, data for the "Strong-Scaling"
mode on a 32^3x96 lattice are shown. The Benchmark is using a random gauge configuration and uses the
Conjugated Gradient solver to solve a linear equation involving the clover Wilson Dirac operator.
---------------------
Frioul - KNLs
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)
precision: single
1 340.75
2 627.612
4 1111.13
8 1779.34
16 2410.8
precision: double
1 328.149
2 616.467
4 1047.79
8 1616.37
Weak - Scaling:
local lattice size (48x48x48x24)
precision: single
1 348.304
2 616.697
4 1214.82
8 2425.45
16 4404.63
precision: double
1 172.303
2 320.761
4 629.79
8 1228.77
16 2310.63
---------------------
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)
precision: single - 1 Cards per Node
KNCs GFLOPS
2 103.561
4 200.159
8 338.276
16 534.369
32 815.896
precision: single - 2 Cards per Node
KNCs GFLOPS
4 118.995
8 212.558
16 368.196
32 605.882
64 847.566