Set -DQUDA_GPU_ARCH=sm_XX to the GPU Architecture (sm_60 for Pascal, sm_35 for Kepler)
If cmake or the compilation fails, library paths and options can be
set via the text user interface of cmake by using "ccmake". Use
"./PATH2CMAKE/ccmake PATH2BUILD_DIR" to see and edit the available
options. After successfully configuring the buil, run "make". Now in
the folder test/ one can find the needed Quda executables which begin
with "invert_".
##
## 1.2 Run
##
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts
located in the folder
./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the
benchmark runs on the target machines. This bash-scripts are:
run_ana.sh : Main-script, sets up the benchmark mode and submits the jobs (analyse the results)
prepare_submit_job.sh : Generates the job-scripts
submit_job.sh.template : Template for submit script
##
## 1.2.1 Main-script: "run_ana.sh"
##
The path to the executable has to be set by $PATH2EXE . Upon first
run, QUDA automatically tunes the GPU-kernels by sweeping the number
of threads per block. The optimal setup will be saved in the folder
which pointed to in the environment variable "QUDA_RESOURCE_PATH". You
must set this variable, otherwise the tune data will be lost and
performance will be sub-optimal. Set it to the folder where the tuning
data should be saved. Strong scaling or Weak scaling can be chosen by
using the variable sca_mode (="Strong" or ="Weak"). The lattice sizes
can be set by "gx" and "gt". Choose mode="Run" for run mode or
mode="Analysis" for extracting the GFLOPS. Note that the script
assumes Slurm is used as the job scheduler. If not, change the line
which includes the "sbatch" command accordingly.
##
## 1.2.2 Main-script: "prepare_submit_job.sh"
##
Add additional options if necessary.
##
## 1.2.3 Main-script: "submit_job.sh.template"
##
The submit-template will be edited by "prepare_submit_job.sh" to
generate the final submit-script. The first lines (beginning with
"#SBATCH") depend on the queuing system of the target machine, which in
this case is assumed to be Slurm. These should be changed in case of a
different queuing system.
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to
setup the benchmark runs on the target machines. These bash-scripts
are:
##
## 1.3 Example Benchmark results
##
Here are shown the benchmark results on PizDaint located in Switzerland at CSCS
and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown,
the "Strong-Scaling mode with a random lattice configuration of size 32^3x96 and
a "Weak-Scaling" mode with a configuration of local lattice size 48^3x24.
The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case
that one card per node and two cards per node are used.
The benchmark are done by using the Conjugated Gradient solver which
solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator
"D" and a known right hand side "b".
---------------------
PizDaint - Pascal P100
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
1 786.520000 4.569600
2 1522.410000 3.086040
4 2476.900000 2.447180
8 3426.020000 2.117580
16 5091.330000 1.895790
32 8234.310000 1.860760
64 8276.480000 1.869230
sloppy-precision: double
precision: double
GPUs GFLOPS sec
1 385.965000 6.126730
2 751.227000 3.846940
4 1431.570000 2.774470
8 1368.000000 2.367040
16 2304.900000 2.071160
32 4965.480000 2.095180
64 2308.850000 2.005110
Weak - Scaling:
local lattice size (48x48x48x24)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
1 765.967000 3.940280
2 1472.980000 4.004630
4 2865.600000 4.044360
8 5421.270000 4.056410
16 9373.760000 7.396590
32 17995.100000 4.243390
64 27219.800000 4.535410
sloppy-precision: double
precision: double
GPUs GFLOPS sec
1 376.611000 5.108900
2 728.973000 5.190880
4 1453.500000 5.144160
8 2884.390000 5.207090
16 5004.520000 5.362020
32 8744.090000 5.623290
64 14053.00000 5.910520
---------------------
SurfSara - Kepler K20m
---------------------
##
## 1 GPU per Node
##
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
1 243.084000 4.030000
2 478.179000 2.630000
4 939.953000 2.250000
8 1798.240000 1.570000
16 3072.440000 1.730000
32 4365.320000 1.310000
sloppy-precision: double
precision: double
GPUs GFLOPS sec
1 119.786000 6.060000
2 234.179000 3.290000
4 463.594000 2.250000
8 898.090000 1.960000
16 1604.210000 1.480000
32 2420.130000 1.630000
##
## 2 GPU per Node
##
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
2 463.041000 2.720000
4 896.707000 1.940000
8 1672.080000 1.680000
16 2518.240000 1.420000
32 3800.970000 1.460000
64 4505.440000 1.430000
sloppy-precision: double
precision: double
GPUs GFLOPS sec
2 229.579000 3.380000
4 450.425000 2.280000
8 863.117000 1.830000
16 1348.760000 1.510000
32 1842.560000 1.550000
64 2645.590000 1.480000
###
###
### XEONPHI - BENCHMARK SUITE
###
###
##
## 2. Compile and Run the XeonPhi-Benchmark Suite
##
Unpack the provided source tar-file located in
"./QCD_Accelerator_Benchmarksuite_Part2/XeonPhi/src" or clone the
actual git-hub branches of the code packages QMP:
"git clone https://github.com/usqcd-software/qmp"
and for QPhix
"git clone https://github.com/JeffersonLab/qphix"
Note that the AVX512 instructions, which are needed for an optimal run
on KNLs, are not yet part of the main branch. The AVX512 instructions
are available in the avx512-branch ("git checkout avx512). The
provided source file is using the avx512-branch (Status as of 01/2017).
##
## 2.1 Compile
##
The QPhix library must be built upon QMP, a thin communication layer