README.md 15.7 KB
Newer Older
Jacob Finkenrath's avatar
Jacob Finkenrath committed
1
2
# README - QCD UEABS Part 2
**2017 -  Jacob Finkenrath - CaSToRC - The Cyprus Institute  (j.finkenrath@cyi.ac.cy)**
3

Jacob Finkenrath's avatar
Jacob Finkenrath committed
4
The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA 
5

Jacob Finkenrath's avatar
Jacob Finkenrath committed
6
[^]: R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
7

Jacob Finkenrath's avatar
Jacob Finkenrath committed
8
 and the QPhiX library
9

Jacob Finkenrath's avatar
Jacob Finkenrath committed
10
11
12
[^]: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,

. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhiX library consists of routines which are optimize to use Intel intrinsic functions of multiple vector length including AVX512, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/). The benchmark kernel consists of the provided Conjugated Gradient benchmark functions of the libraries.
13

Victor's avatar
Victor committed
14
##  Table of Contents
15

Jacob Finkenrath's avatar
Jacob Finkenrath committed
16
[TOC]
17
18


Jacob Finkenrath's avatar
Jacob Finkenrath committed
19
##   GPU - Kernel
Victor's avatar
Victor committed
20
### 1. Compile and Run the GPU-Benchmark Suite
21

Victor's avatar
Victor committed
22
#### 1.1 Compile
23
24
25
26
27
28
29
30
31
32

Download Cmake and Quda

General information how to build QUDA with cmake can be found under:
"https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake"
Here we just give a short overview:

Build Cmake: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/cmake-3.7.0.tar.gz)

Cmake can be downloaded from the source with the URL: https://cmake.org/download/
Jacob Finkenrath's avatar
Jacob Finkenrath committed
33
34
35
36
In this guide the version cmake-3.7.0 is used. The build instruction can be found in the main directory under README.rst . Use the configure file `./configure` .
Then run 

gmake`.
37
38
39

Build Quda: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/quda.tar.gz)

Victor's avatar
Victor committed
40
41
Download quda for example by using `git clone https://github.com/lattice/quda.git`.
Create a build-folder. Execute the executable `cmake` in the build-folder which
42
43
44
is located in the cmake/bin.
Execute:

Victor's avatar
Victor committed
45
46
``` shell
./$PATH2CMAKE/cmake $PATH2QUDA -DQUDA_GPU_ARCH=sm_XX -DQUDA_DIRAC_WILSON=ON -DQUDA_DIRAC_TWISTED_MASS=OFF
47
-DQUDA_DIRACR_DOMAIN_WALL=OFF -DQUDA_HISQ_LINK=OFF -DQUDA_GAUGE_FORCE=OFF -DQUDA_HISQ_FORCE=OFF -DQUDA_MPI=ON
Victor's avatar
Victor committed
48
```
49
50
51

with

Victor's avatar
Victor committed
52
``` shell
Victor's avatar
Victor committed
53
54
  PATH2CMAKE= path to the cmake-executable
  PAT2QUDA= path to the home dir of QUDA
Victor's avatar
Victor committed
55
```
56

Victor's avatar
Victor committed
57
Set `-DQUDA_GPU_ARCH=sm_XX` to the GPU Architecture (`sm_60` for Pascals, `sm_35` for Keplers)
58
59

If Cmake or the compilation fails library paths and options can be set by the cmake provided function "ccmake".
Victor's avatar
Victor committed
60
Use `./PATH2CMAKE/ccmake PATH2BUILD_DIR` to edit and to see the availble options.
Victor's avatar
Victor committed
61
Cmake generates the Makefiles. Run them by use `make`.
62
63
Now in the folder /test one can find the needed Quda executable "invert_".

Victor's avatar
Victor committed
64
####  1.2 Run
65
66


Jacob Finkenrath's avatar
Jacob Finkenrath committed
67
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts located in the folder ./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the benchmark runs on the target machines. This bash-scripts are:
68

Victor's avatar
Victor committed
69
70
71
 - `run_ana.sh`              :   Main-script, set up the benchmark mode and submit the jobs (analyse the results)
 - `prepare_submit_job.sh`   :   Generate the job-scripts
 - `submit_job.sh.template`  :   Template for submit script
72

Victor's avatar
Victor committed
73
##### 1.2.1 Main-script: "run_ana.sh"
74

Jacob Finkenrath's avatar
Jacob Finkenrath committed
75
76
The path to the executable has to be set by $PATH2EXE . QUDA automaticaly tune the GPU-kernels. The optimal setup will be saved in 
the folder which one declares by the variable `QUDA_RESOURCE_PATH`. Set it to folder where the tuning data should be saved. Different scaling modes can be choose from Strong-scaling to Weak scaling by using the variables sca_mode (="Strong" or ="Weak"). The lattice sizes can be set by "gx" and "gt". Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS. Note that the submition is done here by "sbatch", match this to the queing system on your target machine.
77

Victor's avatar
Victor committed
78
##### 1.2.2 Main-script: "prepare_submit_job.sh"
79
80
81

Add additional option if necessary.

Victor's avatar
Victor committed
82
##### 1.2.3 Main-script: "submit_job.sh.template"
83

Jacob Finkenrath's avatar
Jacob Finkenrath committed
84
The submit-template will be edit by `prepare_submit_job.sh` to generate the final submit-script. The header should be matched to the queing system
85
86
of the target machine.

Jacob Finkenrath's avatar
Jacob Finkenrath committed
87
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs on the target machines. This bash-scripts are:
88

Victor's avatar
Victor committed
89
#### 1.3 Example Benchmark results
90

Jacob Finkenrath's avatar
Jacob Finkenrath committed
91
92
Here are shown the benchmark results on PizDaint located in Switzerland at CSCS and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown,
the "Strong-Scaling mode with a random lattice configuration of size 32x32x32x96 and a "Weak-Scaling" mode with a configuration of local lattice size 48x48x48x24. The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case that one card per node and two cards per node are used. The benchmark are done by using the Conjugated Gradient solver which solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b".
93

Victor's avatar
Victor committed
94
```
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---------------------
  PizDaint - Pascal
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single

GPUs     GFLOPS      sec
1    786.520000 4.569600
2   1522.410000 3.086040
4   2476.900000 2.447180
8   3426.020000 2.117580
16  5091.330000 1.895790
32  8234.310000 1.860760
64  8276.480000 1.869230

sloppy-precision: double
       precision: double

GPUs     GFLOPS      sec
1    385.965000 6.126730
2    751.227000 3.846940
4   1431.570000 2.774470
8   1368.000000 2.367040
16  2304.900000 2.071160
32  4965.480000 2.095180
64  2308.850000 2.005110


Weak - Scaling:
local lattice size (48x48x48x24)

sloppy-precision: single
       precision: single

GPUs     GFLOPS      sec
1     765.967000 3.940280
2    1472.980000 4.004630
4    2865.600000 4.044360
8    5421.270000 4.056410
16   9373.760000 7.396590
32  17995.100000 4.243390
64  27219.800000 4.535410

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
 1   376.611000 5.108900
 2   728.973000 5.190880
 4  1453.500000 5.144160
 8  2884.390000 5.207090
16  5004.520000 5.362020
32  8744.090000 5.623290
Victor's avatar
Victor committed
151
152
64  14053.00000 5.910520
```
153

Victor's avatar
Victor committed
154
```
155
156
157
158
159
160
161
162
163
164
165
166
167
---------------------
  SurfSara - Kepler
---------------------
##
## 1 GPU per Node
##

Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single
GPUs    GFLOPS      sec
Victor's avatar
Victor committed
168
169
170
171
172
1      243.084000 4.030000
2      478.179000 2.630000
4      939.953000 2.250000
8     1798.240000 1.570000
16    3072.440000 1.730000
173
174
175
176
177
178
32    4365.320000 1.310000

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
Victor's avatar
Victor committed
179
180
181
182
183
1      119.786000 6.060000
2      234.179000 3.290000
4      463.594000 2.250000
8      898.090000 1.960000
16    1604.210000 1.480000
184
185
186
187
188
189
190
191
192
193
194
195
196
32    2420.130000 1.630000

##
## 2 GPU per Node
##

Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single

GPUs    GFLOPS      sec
Victor's avatar
Victor committed
197
198
199
200
201
2      463.041000 2.720000
4      896.707000 1.940000
8     1672.080000 1.680000
16    2518.240000 1.420000
32    3800.970000 1.460000
202
203
204
205
206
207
64    4505.440000 1.430000

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
Victor's avatar
Victor committed
208
209
210
211
212
213
214
215
2     229.579000 3.380000
4     450.425000 2.280000
8     863.117000 1.830000
16   1348.760000 1.510000
32   1842.560000 1.550000
64   2645.590000 1.480000
```

Jacob Finkenrath's avatar
Jacob Finkenrath committed
216
217
##   x86 Kernel
### 2. Compile and Run the x86-Part
Victor's avatar
Victor committed
218
219

Unpack the provided source tar-file located in `./QCD_Accelerator_Benchmarksuite_Part2/XeonPhi/src` or
220
221
222
clone the actual git-hub branches of the code
packages QMP:

Victor's avatar
Victor committed
223
224
225
``` shell
git clone https://github.com/usqcd-software/qmp
```
226
227
228

and for QPhix

Victor's avatar
Victor committed
229
230
231
232
``` shell
git clone https://github.com/JeffersonLab/qphix
```

233
234
235
Note that for running on Skylake chips it is recommended to utilize
the branch develop of QPhix which needs additional packages
like qdp++ (Status 04/2019).
236

Victor's avatar
Victor committed
237
#### 2.1 Compile
238

Victor's avatar
Victor committed
239
240
The QPhix library is based on QMP communication functions.
For that QMP has to be setup first.
241

Victor's avatar
Victor committed
242
``` shell
243
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS=" -mmic/-xAVX512 -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
Victor's avatar
Victor committed
244
```
245

Victor's avatar
Victor committed
246
247
248
Create the Install folder and link with `$QMP_INSTALL_DIR` to it.
Use the compilerflag  `-mmic` for the compilation for KNC's
while use `-xAVX512` for the compilation for KNL's.
249
Then use
Victor's avatar
Victor committed
250
251
252
253
``` shell
make
make install
```
254

Victor's avatar
Victor committed
255
to compile and setup the necessary source files in `$QMP_INSTALL_DIR`.
256
257
258
259

The QPhix executable can be compiled by using:
for KNC's

Victor's avatar
Victor committed
260
``` shell
261
./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g  -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
Victor's avatar
Victor committed
262
```
263
264
265

or for KNL's

Victor's avatar
Victor committed
266
``` shell
267
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
Victor's avatar
Victor committed
268
```
269

Victor's avatar
Victor committed
270
271
by using the previous variable `QMP_INSTALL_DIR` which links to the install-folder
of QMP. The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288


Note for the develop branch the package qdp++ has to be compiled.
QDP++ can be configure using (here for skylake chip)

``` shell
./configure --with-qmp=$QMP_INSTALL_DIR --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xCORE-AVX512 -mtune=skylake-avx512 -std=c99" CXX=mpiicpc CXXFLAGS="-axCORE-AVX512 -mtune=skylake-avx512 -std=c++14 -qopenmp" --enable-openmp --host=x86_64-linux-gnu --build=none-none-none --prefix=$QDPXX_INSTALL_DIR
```

Now QPhix executable can be compiled by using:


``` shell
cmake -DQDPXX_DIR=$QDP_INSTALL_DIR -DQMP_DIR=$QMP_INSTALL_DIR -Disa=avx512 -Dparallel_arch=parscalar -Dhost_cxx=mpiicpc -Dhost_cxxflags="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -Dtm_clover=ON -Dtwisted_mass=ON -Dtesting=ON -DCMAKE_CXX_COMPILER=mpiicpc -DCMAKE_CXX_FLAGS="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -DCMAKE_C_COMPILER=mpiicc -DCMAKE_C_FLAGS="-std=c99 -O3 -axCORE-AVX512 -mtune=skylake-avx512" ..
```

The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`.
289

Victor's avatar
Victor committed
290
##### 2.1.1 Example compilation on PRACE machines
291
292
293
294

In the subsection we provide some example compilation on PRACE machines
which where used to develop the QCD Benchmarksuite 2.

Victor's avatar
Victor committed
295
###### 2.1.1.1 BSC - Marenostrum III Hybrid partitions
296
297

The Hybrid partition on Marenostrum are equiped with KNC's.
Victor's avatar
Victor committed
298
First following modules were loaded
299

Victor's avatar
Victor committed
300
``` shell
301
302
module unload openmpi
module load impi
Victor's avatar
Victor committed
303
```
304
305
306

and the necessary links are set with

Victor's avatar
Victor committed
307
``` shell
308
309
310
311
source /opt/intel/impi/4.1.1.036/bin64/mpivars.sh
source /opt/intel/2013.5.192/composer_xe_2013.5.192/bin/compilervars.sh intel64
export I_MPI_MIC=enable
export I_MPI_HYDRA_BOOTSTRAP=ssh
Victor's avatar
Victor committed
312
```
313
314
315

The QMP-library was configured and compiled with

Victor's avatar
Victor committed
316
``` shell
317
318
319
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-mmic -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
make
make install
Victor's avatar
Victor committed
320
```
321
322
323

Now the package QPhix is compilled with

Victor's avatar
Victor committed
324
``` shell
325
326
./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g  -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
make
Victor's avatar
Victor committed
327
```
328

Victor's avatar
Victor committed
329
###### 2.1.1.2 CINES - Frioul
330
331
332
333

On a test cluster at the CINES-side the Benchmarksuite was tested on KNL's.
The steps are similar to BSC. First the libraries paths are set with

Victor's avatar
Victor committed
334
``` shell
335
336
source /opt/software/intel/composer_xe_2015/bin/compilervars.sh intel64
source /opt/software/intel/impi_5.0.3/bin64/mpivars.sh
Victor's avatar
Victor committed
337
```
338
339

The QMP was compiled by using:
Victor's avatar
Victor committed
340
341

``` shell
342
343
344
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-xMIC-AVX512 -mGLOB_default_function_attrs="use_gather_scatter_hint=off" -openmp -g  -O2 -fno-alias -std=c99"  --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
make
make install
Victor's avatar
Victor committed
345
```
346

Victor's avatar
Victor committed
347
The QPhix was configured and compiled by using
348

Victor's avatar
Victor committed
349
350
``` shell
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=/home/finkenrath/benchmark/qmp/install
351
make
Victor's avatar
Victor committed
352
```
353

Victor's avatar
Victor committed
354
####  2.2 Run
355
356
357
358

The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs
on the target machines. This bash-scripts are:

Victor's avatar
Victor committed
359
360
361
 - `run_ana.sh`              :   Main-script, set up the bechmark mode and submit the jobs (analyse the results)
 - `prepare_submit_job.sh`   :   Generate the job-scripts
 - `submit_job.sh.template`  :   Template for submit script
362

Victor's avatar
Victor committed
363
##### 2.2.1 Main-script: "run_ana.sh"
364
365

The path to the executable has to be set by $PATH2EXE .
Victor's avatar
Victor committed
366
Different scaling modes can be choose from Strong-scaling to Weak scaling
367
368
369
by using the variables sca_mode (="Strong" or ="Weak").
The lattice sizes can be set by "gx" and "gt".
Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS.
Victor's avatar
Victor committed
370
Note that the submition is done by "sbatch" match this to the queing system on
371
372
your target machine.

Victor's avatar
Victor committed
373
##### 2.2.2 Main-script: "prepare_submit_job.sh"
374
375
376

Add additional option if necessary.

Victor's avatar
Victor committed
377
##### 2.2.3 Main-script: "submit_job.sh.template"
378

Victor's avatar
Victor committed
379
The submit-template will be edit by `prepare_submit_job.sh` to generate
380
381
382
the final submit-script. The header should be matched to the quening system
of the target machine.

Victor's avatar
Victor committed
383
384

#### 2.3 Example Benchmark Results
385
386
387
388
389
390
391
392
393
394
395
396

The benchmark results for the XeonPhi benchmark suite are performed on
Frioul, a test cluster at CINES, and the hybrid partion on MareNostrum III at BSC.
Frioul has one KNL-card per node while the hybrid partion of MareNostrum III is
equiped with two KNCs per node. The data on Frioul are generated by using
the bash-scripts provided by the QCD-Accelerator Benchmarksute Part 2
and are done for the two test cases "Strong-Scaling" with a lattice size
of 32^3x96 and "Weak-scaling" with a local lattice size of 48^3x24 per
card. In case of the data generated at MareNostrum, data for the "Strong-Scaling"
mode on a 32^3x96 lattice are shown. The Benchmark is using a random gauge configuration and uses the
Conjugated Gradient solver to solve a linear equation involving the clover Wilson Dirac operator.

Victor's avatar
Victor committed
397
```
398
399
400
401
402
403
404
405
---------------------
  Frioul - KNLs
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)

precision: single

Victor's avatar
Victor committed
406
KNLs     GFLOPS
407
408
409
410
411
412
413
414
1       340.75
2       627.612
4      1111.13
8      1779.34
16     2410.8

precision: double

Victor's avatar
Victor committed
415
KNLs     GFLOPS
416
417
418
419
420
421
422
423
424
425
1      328.149
2      616.467
4      1047.79
8      1616.37

Weak - Scaling:
local lattice size (48x48x48x24)

precision: single

Victor's avatar
Victor committed
426
KNLs   GFLOPS
427
428
429
430
431
1       348.304
2       616.697
4      1214.82
8      2425.45
16     4404.63
Victor's avatar
Victor committed
432

433
434
precision: double

Victor's avatar
Victor committed
435
KNLs   GFLOPS
436
437
438
439
440
 1      172.303
 2      320.761
 4      629.79
 8     1228.77
16     2310.63
Victor's avatar
Victor committed
441
```
442

Victor's avatar
Victor committed
443
```
444
---------------------
Victor's avatar
Victor committed
445
  MareNostrum III - KNC's
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
---------------------

Strong - Scaling:
global lattice size (32x32x32x96)

precision: single - 1 Cards per Node

KNCs  GFLOPS
2    103.561
4    200.159
8    338.276
16   534.369
32   815.896

precision: single - 2 Cards per Node

KNCs  GFLOPS
4    118.995
8    212.558
16   368.196
32   605.882
64   847.566
Victor's avatar
Victor committed
468
```