README 6.22 KB
Newer Older
1
2
3
PRACE QCD Accelerator Benchmark 1
=================================

Jacob Finkenrath's avatar
Jacob Finkenrath committed
4
This benchmark is part of the QCD section of the Accelerator Benchmarks Suite developed as part of a PRACE EU funded project 
5
6
(http://www.prace-ri.eu).

Jacob Finkenrath's avatar
Jacob Finkenrath committed
7
The suite is derived from the Unified European Applications Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/
8

Jacob Finkenrath's avatar
Jacob Finkenrath committed
9
This specific component is a direct port of "QCD kernel E" from the UEABS, which is based on the MILC code suite (http://www.physics.utah.edu/~detar/milc/). The performance-portable targetDP model has been used to allow the benchmark to utilise NVIDIA GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core CPUs. The use of MPI (in conjunction with targetDP) allows multiple nodes to be used in parallel.
10

Jacob Finkenrath's avatar
Jacob Finkenrath committed
11
For full details of this benchmark, and for results on NVIDIA GPU and Intel Knights Corner Xeon Phi architectures (in addition to regular CPUs), please see:
12
13
14
15
16
17
18
19
20
21
22

**********************************************************************
Gray, Alan, and Kevin Stratford. "A lightweight approach to
performance portability with targetDP." The International Journal of
High Performance Computing Applications (2016): 1094342016682071, Also
available at https://arxiv.org/abs/1609.01479 
**********************************************************************

To Build
--------

Jacob Finkenrath's avatar
Jacob Finkenrath committed
23
Choose a configuration file from the "config" directory that best matches your platform, and copy to "config.mk" in this (the top-level) directory. Then edit this file, if necessary, to properly set the compilers and paths on your system.
24

Jacob Finkenrath's avatar
Jacob Finkenrath committed
25
Note that if you are building for a GPU system, and the TARGETCC variable in the configuration file is set to the NVIDIA compiler nvcc, then the build process will automatically build the GPU version. Otherwise, the threaded CPU version will be built which can run on Xeon Phi manycore CPUs or regular multi-core CPUs.
26
27

Then, build the targetDP performance-portable library:
Jacob Finkenrath's avatar
Jacob Finkenrath committed
28
29

```
30
31
32
33
 cd targetDP
 make clean
 make
 cd ..
Jacob Finkenrath's avatar
Jacob Finkenrath committed
34
```
35
36

And finally build the benchmark code
Jacob Finkenrath's avatar
Jacob Finkenrath committed
37
38

```
39
40
41
42
 cd src
 make clean
 make
 cd ..
Jacob Finkenrath's avatar
Jacob Finkenrath committed
43
44
45
```


46
47
48
49
50


To Validate
-----------

Jacob Finkenrath's avatar
Jacob Finkenrath committed
51
After building, an executable "bench" will exist in the src directory. To run the default validation (64x64x64x8, 1 iteration) case:
52

Jacob Finkenrath's avatar
Jacob Finkenrath committed
53
```
54
55
cd src
./bench
Jacob Finkenrath's avatar
Jacob Finkenrath committed
56
```
57

Jacob Finkenrath's avatar
Jacob Finkenrath committed
58
The code will automatically self-validate by comparing with the appropriate output reference file for this case which exists in output_ref, and will print to stdout, e.g.
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

Validating against output_ref/kernel_E.output.nx64ny64nz64nt8.i1.t1:
VALIDATION PASSED

The benchmark time is also printed to stdout, e.g.

******BENCHMARK TIME 1.6767786769196391e-01 seconds****** 

(Where this time is as reported on an NVIDIA K40 GPU).



To Run Different Cases
---------------------

You can edit the input file 

Jacob Finkenrath's avatar
Jacob Finkenrath committed
76
```
77
 src/kernel_E.input
Jacob Finkenrath's avatar
Jacob Finkenrath committed
78
```
79

Jacob Finkenrath's avatar
Jacob Finkenrath committed
80
if you want to deviate from the default system size, number of iterations and/or run using more than 1 MPI task. E.g. replacing 
81

Jacob Finkenrath's avatar
Jacob Finkenrath committed
82
```
83
 totnodes 1 1 1 1
Jacob Finkenrath's avatar
Jacob Finkenrath committed
84
```
85
86
87

with

Jacob Finkenrath's avatar
Jacob Finkenrath committed
88
```
89
 totnodes 2 1 1 1
Jacob Finkenrath's avatar
Jacob Finkenrath committed
90
```
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

will run with 2 MPI tasks rather than 1, where the domain is decomposed in 
the "X" direction.


To Run using a Script
---------------------

The "run" directory contains an example script which
 - sets up a temporary scratch directory
 - copies in the input file, plus also some reference output files
 - sets the number of OpenMP threads (for a multi/many core CPU run)
 - runs the code (which will automatically validate if an 
   appropriate output reference file exists)

Jacob Finkenrath's avatar
Jacob Finkenrath committed
106
So, in the run directory, you should copy "run_example.sh" to run.sh, which you can customise for your system.
107
108
109
110
111


Known Issues
------------

Jacob Finkenrath's avatar
Jacob Finkenrath committed
112
The quantity used for validation (see congrad.C) becomes very small after a few iterations. Therefore, only a small number of iterations should be used for validation. This is not an issue specific to this port of the benchmark, but is also true of the original version (see above), with which this version is designed to be consistent.
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180


Performance Results for Reference
--------------------------------

Here are some performance timings obtained using this benchmark.


From the paper cited above:

64x64x64x32x8, 1000 iterations, single chip

Chip	       	    		Time (s)

Intel Ivy-Bridge 12-core CPU	361.55	
Intel Haswell 8-core CPU 	376.08
AMD Opteron 16-core CPU	 	618.19
Intel KNC Xeon Phi  		139.94
NVIDIA K20X GPU 		96.84
NVIDIA K40 GPU			90.90


Multi-node scaling:		

Titan GPU (one K20X per node)	
Titan CPU (one 16-core Interlagos per node)
ARCHER CPU (two 12-core Ivy-bridge per node)

All times in seconds.

Small Case: 64x64x32x8, 1000 iterations

Nodes   Titan GPU	Titan CPU       ARCHER CPU

1	9.64E+01	6.01E+02	1.86E+02
2	5.53E+01	3.14E+02	9.57E+01
4	3.30E+01	1.65E+02	5.22E+01
8	2.18E+01	8.33E+01	2.60E+01
16	1.35E+01	4.02E+01	1.27E+01
32	8.80E+00	2.06E+01	6.49E+00
64	6.54E+00	9.90E+00	2.36E+00
128	5.13E+00	4.31E+00	1.86E+00
256	4.25E+00	2.95E+00	1.96E+00
			

Large Case: 64x64x64x192, 1000 iterations

Nodes   Titan GPU	Titan CPU       ARCHER CPU

64	1.36E+02	5.19E+02	1.61E+02
128	8.23E+01	2.75E+02	8.51E+01
256	6.70E+01	1.61E+02	4.38E+01
512	3.79E+01	8.80E+01	2.18E+01
1024	2.41E+01	5.72E+01	1.46E+01
2048	1.81E+01	3.88E+01	7.35E+00
4096	1.56E+01	2.28E+01	6.53E+00


Preliminary results on new Pascal GPU and Intel KNL architectures:

Single chip, 64x64x64x8, 1000 iterations

Chip				Time (s)

12-core Intel Ivy-Bridge	7.24E+02
Intel KNL Xeon Phi 	   	9.72E+01	
NVIDIA P100 GPU			5.60E+01

181
182
183
184
**********************************************************************

Prace 5IP - Results (see White Paper for more):

185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
	Irene KNL  Irene SKL    Juwels	Marconi-KNL	MareNostrum	PizDaint	Davide	 Frioul	  Deep	    Mont-Blanc 3
1	148,68	  219,6        182,49	 133,38 	186,40		53,73		53.4	 151 	  656,41    206,17
2	 79,35	  114,22        91,83	 186,14 	 94,63 		32,38		113 	  86.9 	  432,93     93,48
4	 48,07	   58,11        46,58	 287,17 	 47,22 		19,13		21.4	  52.7 	  277,67     49,95
8	 28,42	   32,09        25,37	 533,49          25,86 		12,78		14.8	  36.5 	  189,83     25,19
16	 17,08	   14,35        11,77	 1365,72         11,64 		 9,20		10.1	  17.8 	  119,14     12,55
32	 10,56	    7,28         5,43	 2441,29          5,59 	    	 6,35		 6.94 	  15.6 		
64	 9,01	    4,18         2,65			  2,65 		 6,41	                  11.7 	
128	 5,08		         1,39			  2,48 		 5,95
256			         1,38					 5,84	
512				 0,89

Results in [sec]
for V=8x64x64x64

200
201