Results_QCD_BenchmarkSuite_Part2 9.7 KB
Newer Older
Jacob Finkenrath's avatar
Jacob Finkenrath committed
1
2
#        Results - QCD UEABS Part 2  
**2017 -  Jacob Finkenrath - CaSToRC - The Cyprus Institute  (j.finkenrath@cyi.ac.cy)**
3
4
5



Jacob Finkenrath's avatar
Jacob Finkenrath committed
6
The QCD UEABS Part 2 consists of two kernels, the QUDA 
7

Jacob Finkenrath's avatar
Jacob Finkenrath committed
8
[^]: R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
9

Jacob Finkenrath's avatar
Jacob Finkenrath committed
10
and the QPhix library 
11

Jacob Finkenrath's avatar
Jacob Finkenrath committed
12
13
14
15
[^]: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,

. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/).The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/).
The benchmark code is used the provided Conjugated Gradient benchmark functions of the libraries.
16

Jacob Finkenrath's avatar
Jacob Finkenrath committed
17
###   GPU - BENCHMARK SUITE - QUDA
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
The  GPU benchmark results of the second implementation are done on PizDaint located in Switzerland at CSCS and the GPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint is equipped with one P100 Pascal-GPU per node.  Two different test-cases are depicted, the "strong-scaling" mode with a random lattice configuration of size 32x32x32x96 and 64x64x64x128.  The GPU nodes of Cartesius have two Kepler-GPU K40m per node and the "strong-scaling" test is shown for one card per node and for two cards per node. The benchmark kernel is using the conjugated gradient solver which solve a linear equation system given by D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b".


Figures:
surfsara_K20m.png:
The figure shows strong scaling of the conjugate gradient solver on K40m GPUs on Cartesius. The lattice size is given by 32x32x32x96, which corresponds to a moderate lattice size nowadays.  The test is perform with a mixed precision CG in double-double mode (red) and half-double mode (blue). The run is done on one GPU per node (filled) and two GPU nodes per node (non-filled).


pizdaitn_P100.png
The figure shows strong scaling of the conjugate gradient solver on P100 GPUs on PizDaint. The lattice size is given by 32x32x32x96 similar to the strong scaling run on the K40m on Cartesius. The test is performed with mixed precision CG in double-double mode (red) and half-double mode (blue).

pizdaint_P100_lV128x64c.png
The figure shows strong scaling of the conjugate gradient solver on P100 GPU on PizDaint. The lattice size is increase to 64x64x64x128, which is a large lattice nowadays. By increasing the lattice the scaling test shows that the conjugate gradient solver has a very good strong scaling up to 64 GPU.


---------------------
Jacob Finkenrath's avatar
Jacob Finkenrath committed
34
35
36
#### PizDaint - Pascal  P100
###### Strong - Scaling:

37
38
39
global lattice size (32x32x32x96)

sloppy-precision: single
Jacob Finkenrath's avatar
Jacob Finkenrath committed
40
precision: single
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

GPUs     GFLOPS      sec
1    786.520000 4.569600
2   1522.410000 3.086040
4   2476.900000 2.447180
8   3426.020000 2.117580
16  5091.330000 1.895790
32  8234.310000 1.860760
64  8276.480000 1.869230

sloppy-precision: double
       precision: double

GPUs     GFLOPS      sec
1    385.965000 6.126730
2    751.227000 3.846940
4   1431.570000 2.774470
8   1368.000000 2.367040
16  2304.900000 2.071160
32  4965.480000 2.095180
64  2308.850000 2.005110

Jacob Finkenrath's avatar
Jacob Finkenrath committed
63
###### Weak - Scaling:
64
65
66
67

local lattice size (48x48x48x24)

sloppy-precision: single
Jacob Finkenrath's avatar
Jacob Finkenrath committed
68
precision: single
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92

GPUs     GFLOPS      sec
1     765.967000 3.940280
2    1472.980000 4.004630
4    2865.600000 4.044360
8    5421.270000 4.056410
16   9373.760000 7.396590
32  17995.100000 4.243390
64  27219.800000 4.535410

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
 1   376.611000 5.108900
 2   728.973000 5.190880
 4  1453.500000 5.144160
 8  2884.390000 5.207090
16  5004.520000 5.362020
32  8744.090000 5.623290
64  14053.00000 5.910520 


---------------------
Jacob Finkenrath's avatar
Jacob Finkenrath committed
93
94
95
#### SurfSara - Kepler  K20m
##### 1 GPU per Node
###### Strong - Scaling:
96
97
98
99

global lattice size (32x32x32x96)

sloppy-precision: single
Jacob Finkenrath's avatar
Jacob Finkenrath committed
100
precision: single
101
102
103
104
105
106
107
108
109
GPUs    GFLOPS      sec
1      243.084000 4.030000 
2      478.179000 2.630000 
4      939.953000 2.250000 
8     1798.240000 1.570000 
16    3072.440000 1.730000 
32    4365.320000 1.310000

sloppy-precision: double
Jacob Finkenrath's avatar
Jacob Finkenrath committed
110
precision: double
111
112
113
114
115
116
117
118
119

GPUs    GFLOPS      sec
1      119.786000 6.060000 
2      234.179000 3.290000 
4      463.594000 2.250000 
8      898.090000 1.960000 
16    1604.210000 1.480000 
32    2420.130000 1.630000

Jacob Finkenrath's avatar
Jacob Finkenrath committed
120
121
##### 2 GPU per Node
###### Strong - Scaling:
122
123
124
125

global lattice size (32x32x32x96)

sloppy-precision: single
Jacob Finkenrath's avatar
Jacob Finkenrath committed
126
precision: single
127
128
129
130
131
132
133
134
135
136

GPUs    GFLOPS      sec
2      463.041000 2.720000 
4      896.707000 1.940000 
8     1672.080000 1.680000 
16    2518.240000 1.420000 
32    3800.970000 1.460000 
64    4505.440000 1.430000

sloppy-precision: double
Jacob Finkenrath's avatar
Jacob Finkenrath committed
137
precision: double
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156

GPUs    GFLOPS      sec
2     229.579000 3.380000 
4     450.425000 2.280000 
8     863.117000 1.830000 
16   1348.760000 1.510000 
32   1842.560000 1.550000 
64   2645.590000 1.480000 

###   XEONPHI - BENCHMARK SUITE
The benchmark results for the XeonPhi benchmark suite are performed on Frioul at CINES, and the hybrid partition on MareNostrum III at BSC. Frioul has one KNL-card per node while the hybrid partition of MareNostrum III is equipped with two KNCs per node. The data on Frioul are generated by using the bash-scripts provided by the second implementation of QCD and are done for the two test cases "strong-scaling" with a lattice size of 32x32x32x96 and 64x64x64x128. In case of the data generated at MareNostrum, data for the "strong-scaling" mode on a 32x32x32x96 lattice are shown. The benchmark kernel uses a random gauge configuration and the conjugated gradient solver to solve a linear equation involving the clover Wilson Dirac operator.

MareNostrum_KNC.png
The figure shows strong scaling of the conjugate gradient solver on KNC's on the hybrid partition on MareNostrum III. The lattice size is given by 32x32x32x96, which corresponds to a moderate lattice size nowadays. The test is performed with a conjugate gradient solver in single precision by using the native mode and 60 openMP tasks per MPI process. The run is done on one KNC per node (filled) and two KNCs node per node (non-filled).

Frioul_KNL.png
The figure shows strong scaling results of the conjugate gradient solver on KNL's on Frioul. The lattice size is given by 32x32x32x96  which is similar to the strong scaling run on the KNCs on MareNostrum III. The run is performed in quadrantic cache mode with 68 openMP processes per KNLs. The test is performed with a conjugate gradient solver in single precision.

Frioul_KNL_lV128x64c.png
Jacob Finkenrath's avatar
Jacob Finkenrath committed
157
158
159
160
161
162

The figure shows strong scaling of the conjugate gradient solver on KNL's GPU on PizDaint. The lattice size is increases to 64x64x64x128, which is a commonly used large lattice nowadays. By increasing the lattice the scaling tests shows that the conjugate gradient solver has a very good strong scaling up to 16 KNL's.

#### Frioul - KNLs
###### Strong - Scaling:

163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
global lattice size (32x32x32x96)

precision: single

KNLs     GFLOPS  
1       340.75
2       627.612
4      1111.13
8      1779.34
16     2410.8

precision: double

KNLs     GFLOPS    
1      328.149
2      616.467
4      1047.79
8      1616.37

Weak - Scaling:
local lattice size (48x48x48x24)

precision: single

KNLs   GFLOPS  
1       348.304
2       616.697
4      1214.82
8      2425.45
16     4404.63
Jacob Finkenrath's avatar
Jacob Finkenrath committed
193

194
195
196
197
198
199
200
201
202
203
precision: double

KNLs   GFLOPS    
 1      172.303
 2      320.761
 4      629.79
 8     1228.77
16     2310.63


Jacob Finkenrath's avatar
Jacob Finkenrath committed
204
205
206
207
208

#### MareNostrum III - KNC's 

###### Strong - Scaling:

209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
global lattice size (32x32x32x96)

precision: single - 1 Cards per Node

KNCs  GFLOPS
2    103.561
4    200.159
8    338.276
16   534.369
32   815.896

precision: single - 2 Cards per Node

KNCs  GFLOPS
4    118.995
8    212.558
16   368.196
32   605.882
64   847.566
228

Jacob Finkenrath's avatar
Jacob Finkenrath committed
229
#### Results from PRACE 5IP
230

Jacob Finkenrath's avatar
Jacob Finkenrath committed
231
 (see White paper for more details)
232

233
Results in GFLOP/s for V=96x32x32x32
Jacob Finkenrath's avatar
Jacob Finkenrath committed
234
235
Nodes   Irene SKL    Juwels     Marconi-KNL  MareNostrum     PizDaint        Davide     Frioul      Deep     Mont-Blanc 3
1        134,382     132,26      101,815      142,336        387,659     392,763      184,729     41,7832     99,6378
236
237
238
2        240,853     245,599     145,608      263,355        755,308     773,901     269,705     40,7721     214,549
4        460,044     456,228     202,135      480,516       1400,06      1509,46     441,534     59,6317     410,902
8        754,657     864,959     223,082      895,277       1654,21      2902,83     614,466     67,3355     715,699
Jacob Finkenrath's avatar
Jacob Finkenrath committed
239
240
241
16      1366,21     1700,95      214,705      1632,87       2145,69      5394,16    644,303     91,5139     1,17E+03
32      2603,9      3199,98      183,327      2923,7        2923,98      9650,91      937,755   
64      4122,76     5167,48      232,788      4118,7        2332,71                         800,514   
242
243
244
245
128     4703,46     7973,9        37,8003     4050,41     
256     --          3130,42       
512     --          3421,25 
        Qphix        Qphix        Qphix        Qphix         QUDA         QUDA         Qphix      Qphix       Grid     
246
	Skylake     Skylake       KNL          Skylake       P100         P100         KNL        Xeons       ARM    
247
        
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265

Results in GFLOP/s for V=128x64x64x64
Node	Irene SKL	Juwels		Marconi-KNL	MareNostrum	PizDaint
1	141,306		134,972		64,2657		144,32	
2	267,278		263,636		153,008		280,68	
4	503,041		496,465		420,936		514,956	
8	922,187		954,659		783,39		930,95		2694
16	1607,92		1787,43		1109,95		1778,23		5731,56
32	3088,02		3289,02		1486,79		2635,74		7779,29
64	4787,89		5952,8		1087,01		5264,16		10607,2
128	5750,35		10315,3		601,615		7998,56		13560,5
256	15370,9		18177,2			
512			26972,6			
        Qphix           Qphix           Qphix           QPhix           QUDA
        Skylake         Skylake         KNL             Skylake         P100