Newer
Older
### README - QCD Accelerator Benchmarksuite Part 2
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
###
### 2017 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy)
###
The QCD Accelerator Benchmark suite Part 2 consists of two kernels,
the QUDA [1] and the QPhix library [2]. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/).The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/).
The benchmark code is used the provided Conjugated Gradient benchmark functions of the libraries.
[1] R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
in Lattice Quantum Chromodynamics” SC 10 (Supercomputing 2010)
[2] B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
W. Watson III, “Lattice QCD on Intel Xeon Phi”, International Supercomputing Conference (ISC’13), 2013
###
### GPU - BENCHMARK SUITE - QUDA
###
The GPU benchmark results of the second implementation are done on PizDaint located in Switzerland at CSCS and the GPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint is equipped with one P100 Pascal-GPU per node. Two different test-cases are depicted, the "strong-scaling" mode with a random lattice configuration of size 32x32x32x96 and 64x64x64x128. The GPU nodes of Cartesius have two Kepler-GPU K40m per node and the "strong-scaling" test is shown for one card per node and for two cards per node. The benchmark kernel is using the conjugated gradient solver which solve a linear equation system given by D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b".
Figures:
surfsara_K20m.png:
The figure shows strong scaling of the conjugate gradient solver on K40m GPUs on Cartesius. The lattice size is given by 32x32x32x96, which corresponds to a moderate lattice size nowadays. The test is perform with a mixed precision CG in double-double mode (red) and half-double mode (blue). The run is done on one GPU per node (filled) and two GPU nodes per node (non-filled).
pizdaitn_P100.png
The figure shows strong scaling of the conjugate gradient solver on P100 GPUs on PizDaint. The lattice size is given by 32x32x32x96 similar to the strong scaling run on the K40m on Cartesius. The test is performed with mixed precision CG in double-double mode (red) and half-double mode (blue).
pizdaint_P100_lV128x64c.png
The figure shows strong scaling of the conjugate gradient solver on P100 GPU on PizDaint. The lattice size is increase to 64x64x64x128, which is a large lattice nowadays. By increasing the lattice the scaling test shows that the conjugate gradient solver has a very good strong scaling up to 64 GPU.
---------------------
PizDaint - Pascal P100
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
1 786.520000 4.569600
2 1522.410000 3.086040
4 2476.900000 2.447180
8 3426.020000 2.117580
16 5091.330000 1.895790
32 8234.310000 1.860760
64 8276.480000 1.869230
sloppy-precision: double
precision: double
GPUs GFLOPS sec
1 385.965000 6.126730
2 751.227000 3.846940
4 1431.570000 2.774470
8 1368.000000 2.367040
16 2304.900000 2.071160
32 4965.480000 2.095180
64 2308.850000 2.005110
Weak - Scaling:
local lattice size (48x48x48x24)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
1 765.967000 3.940280
2 1472.980000 4.004630
4 2865.600000 4.044360
8 5421.270000 4.056410
16 9373.760000 7.396590
32 17995.100000 4.243390
64 27219.800000 4.535410
sloppy-precision: double
precision: double
GPUs GFLOPS sec
1 376.611000 5.108900
2 728.973000 5.190880
4 1453.500000 5.144160
8 2884.390000 5.207090
16 5004.520000 5.362020
32 8744.090000 5.623290
64 14053.00000 5.910520
---------------------
SurfSara - Kepler K20m
---------------------
##
## 1 GPU per Node
##
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
1 243.084000 4.030000
2 478.179000 2.630000
4 939.953000 2.250000
8 1798.240000 1.570000
16 3072.440000 1.730000
32 4365.320000 1.310000
sloppy-precision: double
precision: double
GPUs GFLOPS sec
1 119.786000 6.060000
2 234.179000 3.290000
4 463.594000 2.250000
8 898.090000 1.960000
16 1604.210000 1.480000
32 2420.130000 1.630000
##
## 2 GPU per Node
##
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
GPUs GFLOPS sec
2 463.041000 2.720000
4 896.707000 1.940000
8 1672.080000 1.680000
16 2518.240000 1.420000
32 3800.970000 1.460000
64 4505.440000 1.430000
sloppy-precision: double
precision: double
GPUs GFLOPS sec
2 229.579000 3.380000
4 450.425000 2.280000
8 863.117000 1.830000
16 1348.760000 1.510000
32 1842.560000 1.550000
64 2645.590000 1.480000
###
### XEONPHI - BENCHMARK SUITE
###
The benchmark results for the XeonPhi benchmark suite are performed on Frioul at CINES, and the hybrid partition on MareNostrum III at BSC. Frioul has one KNL-card per node while the hybrid partition of MareNostrum III is equipped with two KNCs per node. The data on Frioul are generated by using the bash-scripts provided by the second implementation of QCD and are done for the two test cases "strong-scaling" with a lattice size of 32x32x32x96 and 64x64x64x128. In case of the data generated at MareNostrum, data for the "strong-scaling" mode on a 32x32x32x96 lattice are shown. The benchmark kernel uses a random gauge configuration and the conjugated gradient solver to solve a linear equation involving the clover Wilson Dirac operator.
MareNostrum_KNC.png
The figure shows strong scaling of the conjugate gradient solver on KNC's on the hybrid partition on MareNostrum III. The lattice size is given by 32x32x32x96, which corresponds to a moderate lattice size nowadays. The test is performed with a conjugate gradient solver in single precision by using the native mode and 60 openMP tasks per MPI process. The run is done on one KNC per node (filled) and two KNCs node per node (non-filled).
Frioul_KNL.png
The figure shows strong scaling results of the conjugate gradient solver on KNL's on Frioul. The lattice size is given by 32x32x32x96 which is similar to the strong scaling run on the KNCs on MareNostrum III. The run is performed in quadrantic cache mode with 68 openMP processes per KNLs. The test is performed with a conjugate gradient solver in single precision.
Frioul_KNL_lV128x64c.png
The figure shows strong scaling of the conjugate gradient solver on KNL's GPU on PizDaint. The lattice size is increases to 64x64x64x128, which is a commonly used large lattice nowadays. By increasing the lattice the scaling tests shows that the conjugate gradient solver has a very good strong scaling up to 16 KNL's
---------------------
Frioul - KNLs
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)
precision: single
KNLs GFLOPS
1 340.75
2 627.612
4 1111.13
8 1779.34
16 2410.8
precision: double
KNLs GFLOPS
1 328.149
2 616.467
4 1047.79
8 1616.37
Weak - Scaling:
local lattice size (48x48x48x24)
precision: single
KNLs GFLOPS
1 348.304
2 616.697
4 1214.82
8 2425.45
16 4404.63
precision: double
KNLs GFLOPS
1 172.303
2 320.761
4 629.79
8 1228.77
16 2310.63
---------------------
MareNostrum III - KNC's
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)
precision: single - 1 Cards per Node
KNCs GFLOPS
2 103.561
4 200.159
8 338.276
16 534.369
32 815.896
precision: single - 2 Cards per Node
KNCs GFLOPS
4 118.995
8 212.558
16 368.196
32 605.882
64 847.566
#########################################################
Results from PRACE 5IP (see White paper for more details)
Nodes Irene SKL Juwels Marconi-KNL MareNostrum PizDaint Davide Frioul Deep Mont-Blanc 3
1 134,382 132,26 101,815 142,336 387,659 392,763 184,729 41,7832 99,6378
2 240,853 245,599 145,608 263,355 755,308 773,901 269,705 40,7721 214,549
4 460,044 456,228 202,135 480,516 1400,06 1509,46 441,534 59,6317 410,902
8 754,657 864,959 223,082 895,277 1654,21 2902,83 614,466 67,3355 715,699
16 1366,21 1700,95 214,705 1632,87 2145,69 5394,16 644,303 91,5139 1,17E+03
32 2603,9 3199,98 183,327 2923,7 2923,98 9650,91 937,755
64 4122,76 5167,48 232,788 4118,7 2332,71 800,514
128 4703,46 7973,9 37,8003 4050,41
256 -- 3130,42
512 -- 3421,25
Qphix Qphix Qphix Qphix QUDA QUDA Qphix Qphix Grid
Skylake Skylake KNL Skylake P100 P100 KNL Xeons ARM
Results in GFLOP/s