PRACE QCD Accelerator Benchmark 1 ================================= This benchmark is part of the QCD section of the Accelerator Benchmarks Suite developed as part of a PRACE EU funded project (http://www.prace-ri.eu). The suite is derived from the Unified European Applications Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/ This specific component is a direct port of "QCD kernel E" from the UEABS, which is based on the MILC code suite (http://www.physics.utah.edu/~detar/milc/). The performance-portable targetDP model has been used to allow the benchmark to utilise NVIDIA GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core CPUs. The use of MPI (in conjunction with targetDP) allows multiple nodes to be used in parallel. For full details of this benchmark, and for results on NVIDIA GPU and Intel Knights Corner Xeon Phi architectures (in addition to regular CPUs), please see: ********************************************************************** Gray, Alan, and Kevin Stratford. "A lightweight approach to performance portability with targetDP." The International Journal of High Performance Computing Applications (2016): 1094342016682071, Also available at https://arxiv.org/abs/1609.01479 ********************************************************************** To Build -------- Choose a configuration file from the "config" directory that best matches your platform, and copy to "config.mk" in this (the top-level) directory. Then edit this file, if necessary, to properly set the compilers and paths on your system. Note that if you are building for a GPU system, and the TARGETCC variable in the configuration file is set to the NVIDIA compiler nvcc, then the build process will automatically build the GPU version. Otherwise, the threaded CPU version will be built which can run on Xeon Phi manycore CPUs or regular multi-core CPUs. Then, build the targetDP performance-portable library: cd targetDP make clean make cd .. And finally build the benchmark code cd src make clean make cd .. To Validate ----------- After building, an executable "bench" will exist in the src directory. To run the default validation (64x64x64x8, 1 iteration) case: cd src ./bench The code will automatically self-validate by comparing with the appropriate output reference file for this case which exists in output_ref, and will print to stdout, e.g. Validating against output_ref/kernel_E.output.nx64ny64nz64nt8.i1.t1: VALIDATION PASSED The benchmark time is also printed to stdout, e.g. ******BENCHMARK TIME 1.6767786769196391e-01 seconds****** (Where this time is as reported on an NVIDIA K40 GPU). To Run Different Cases --------------------- You can edit the input file src/kernel_E.input if you want to deviate from the default system size, number of iterations and/or run using more than 1 MPI task. E.g. replacing totnodes 1 1 1 1 with totnodes 2 1 1 1 will run with 2 MPI tasks rather than 1, where the domain is decomposed in the "X" direction. To Run using a Script --------------------- The "run" directory contains an example script which - sets up a temporary scratch directory - copies in the input file, plus also some reference output files - sets the number of OpenMP threads (for a multi/many core CPU run) - runs the code (which will automatically validate if an appropriate output reference file exists) So, in the run directory, you should copy "run_example.sh" to run.sh, which you can customise for your system. Known Issues ------------ The quantity used for validation (see congrad.C) becomes very small after a few iterations. Therefore, only a small number of iterations should be used for validation. This is not an issue specific to this port of the benchmark, but is also true of the original version (see above), with which this version is designed to be consistent. Performance Results for Reference -------------------------------- Here are some performance timings obtained using this benchmark. From the paper cited above: 64x64x64x32x8, 1000 iterations, single chip Chip Time (s) Intel Ivy-Bridge 12-core CPU 361.55 Intel Haswell 8-core CPU 376.08 AMD Opteron 16-core CPU 618.19 Intel KNC Xeon Phi 139.94 NVIDIA K20X GPU 96.84 NVIDIA K40 GPU 90.90 Multi-node scaling: Titan GPU (one K20X per node) Titan CPU (one 16-core Interlagos per node) ARCHER CPU (two 12-core Ivy-bridge per node) All times in seconds. Small Case: 64x64x32x8, 1000 iterations Nodes Titan GPU Titan CPU ARCHER CPU 1 9.64E+01 6.01E+02 1.86E+02 2 5.53E+01 3.14E+02 9.57E+01 4 3.30E+01 1.65E+02 5.22E+01 8 2.18E+01 8.33E+01 2.60E+01 16 1.35E+01 4.02E+01 1.27E+01 32 8.80E+00 2.06E+01 6.49E+00 64 6.54E+00 9.90E+00 2.36E+00 128 5.13E+00 4.31E+00 1.86E+00 256 4.25E+00 2.95E+00 1.96E+00 Large Case: 64x64x64x192, 1000 iterations Nodes Titan GPU Titan CPU ARCHER CPU 64 1.36E+02 5.19E+02 1.61E+02 128 8.23E+01 2.75E+02 8.51E+01 256 6.70E+01 1.61E+02 4.38E+01 512 3.79E+01 8.80E+01 2.18E+01 1024 2.41E+01 5.72E+01 1.46E+01 2048 1.81E+01 3.88E+01 7.35E+00 4096 1.56E+01 2.28E+01 6.53E+00 Preliminary results on new Pascal GPU and Intel KNL architectures: Single chip, 64x64x64x8, 1000 iterations Chip Time (s) 12-core Intel Ivy-Bridge 7.24E+02 Intel KNL Xeon Phi 9.72E+01 NVIDIA P100 GPU 5.60E+01 ********************************************************************** Prace 5IP - Results (see White Paper for more): Irene KNL Irene SKL Juwels Marconi-KNL MareNostrum PizDaint Davide Frioul Deep Mont-Blanc 3 1 148,68 219,68 182,49 133,38 186,40 53,73 53.4c 151 656,41 206,17 2 79,35 114,22 91,83 186,14 94,63 32,38 113 86.9 432,93 93,48 4 48,07 58,11 46,58 287,17 47,22 19,13 21.4 52.7 277,67 49,95 8 28,42 32,09 25,37 533,49 25,86 12,78 14.8 36.5 189,83 25,19 16 17,08 14,35 11,77 1365,72 11,64 9,20 10.1 27.8 119,14 12,55 32 10,56 7,28 5,43 2441,29 5,59 6,35 6.94 15.6 64 9,01 4,18 2,65 -- 2,65 6,41 -- 11.7 128 5,08 -- 1,39 -- 2,48 5,95 256 -- -- 1,38 -- -- 5,84 512 -- -- 0,89 --