| <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_1)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_1/README) | lattice QuantumChromodynamics - Part 1 | C | yes | yes | yes (CUDA) | -- | Accelerator enabled kernel E of UEABS QCD CPU part using targetDP model. Test case A - 8x64x64x64. Conjugate Gradient solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. |
| <br>[- Source](https://lattice.github.io/quda/)<br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README) | lattice QuantumChromodynamics - Part 2 - QUDA | C++ | yes | yes | yes (CUDA) | -- | Part 2: GPU is using a QUDA kernel for running on NVIDIA GPUs. [Test case A - 96x32x32x32] Small problem size. CG solver. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. CG solver on Wilson Dirac stencil. Bandwidth bounded |
| <br>[- Source](http://jeffersonlab.github.io/qphix/)<br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README) | lattice QuantumChromodynamics - Part 2 - QPHIX | C++ | yes | yes | no | -- | Part 2: Xeon is using a QPhiX kernel which is optimize to run on x86, in particular Intel Xeon (Phi). [Test case A - 96x32x32x32] Small problem size. CG solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. CG solver on Wilson Dirac stencil. Bandwidth bounded |
| <br>[- Source](https://repository.prace-ri.eu/ueabs/QCD/1.3/QCD_Source_TestCaseA.tar.gz)<br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_cpu)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_cpu/README) | lattice QuantumChromodynamics - CPU Part - legacy UEABS | C/Fortran | yes | yes/no | No | -- | CPU part based on UEABS QCD CPU part (legacy) benchmark kernels (last update 2017). Based on 5 different Benchmark applications representative for the European Lattice QCD community (see doc for more details). |
| <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_1)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_1/README) | lattice QuantumChromodynamics Part 1 | C | yes | yes | yes (CUDA) | -- | Accelerator enabled kernel E of UEABS QCD CPU part using targetDP model. Test case A - 8x64x64x64. Conjugate Gradient solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. |
| <br>[- Source](https://lattice.github.io/quda/)<br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README) | lattice QuantumChromodynamics Part 2 - QUDA | C++ | yes | yes | yes (CUDA) | -- | Part 2: GPU is using a QUDA kernel for running on NVIDIA GPUs. [Test case A - 96x32x32x32] Small problem size. CG solver. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. CG solver on Wilson Dirac stencil. Bandwidth bounded |
| <br>[- Source](http://jeffersonlab.github.io/qphix/)<br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README) | lattice QuantumChromodynamics Part 2 - QPHIX | C++ | yes | yes | no | -- | Part 2: Xeon(Phi) is using a QPhiX kernel which is optimize to run on x86, in particular Intel Xeon (Phi). [Test case A - 96x32x32x32] Small problem size. CG solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. CG solver on Wilson Dirac stencil. Bandwidth bounded |
| <br>[- Source](https://repository.prace-ri.eu/ueabs/QCD/1.3/QCD_Source_TestCaseA.tar.gz)<br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_cpu)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_cpu/README) | lattice QuantumChromodynamics - CPU Part - legacy UEABS | C/Fortran | yes | yes/no | No | -- | CPU part based on UEABS QCD CPU part (legacy) benchmark kernels (last update 2017). Based on 5 different Benchmark applications representative for the European Lattice QCD community (see doc for more details). |
@@ -8,7 +8,7 @@ The QCD Accelerator Benchmark suite Part 1 is a direct port of "QCD kernel E" fr
## Part 2:
The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA and the QPhiX library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhiX library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/). The benchmark kernels are using the provided Conjugated Gradient benchmark functions of the libraries.
The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA and the QPhiX library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhiX library consists of routines which are optimize to use Intel intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/). The benchmark kernels are using the provided Conjugated Gradient benchmark functions of the libraries.
This benchmark is part of the QCD section of the Accelerator Benchmarks Suite developed as part of a PRACE EU funded project
(http://www.prace-ri.eu).
The suite is derived from the Unified European Applications Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/
(http://www.prace-ri.eu) and since PRACE 5IP integrated in the Unified European Applications Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/ and named UEABS QCD part 1.
This specific component is a direct port of "QCD kernel E" from the UEABS, which is based on the MILC code suite (http://www.physics.utah.edu/~detar/milc/). The performance-portable targetDP model has been used to allow the benchmark to utilise NVIDIA GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core CPUs. The use of MPI (in conjunction with targetDP) allows multiple nodes to be used in parallel.
**2017 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy)**
The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA
Part 2 of the QCD kernels of the Unified European Applications Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/ is developed in PRACE 4 IP under the task for developing an accelerator benchmark suite and is part of the UEABS kernel since PRACE 5IP under UEABS QCD part 2. Part 2 consists of two kernels, based on QUDA
[^]:R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
and the QPhiX library
and on the QPhiX library
[^]:B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
...
...
@@ -213,8 +213,8 @@ GPUs GFLOPS sec
64 2645.590000 1.480000
```
## x86 Kernel
### 2. Compile and Run the x86-Part
## Xeon(Phi) Kernel
### 2. Compile and Run the Xeon(Phi)-Part
Unpack the provided source tar-file located in `./QCD_Accelerator_Benchmarksuite_Part2/XeonPhi/src` or
The benchmark results for the XeonPhi benchmark suite are performed on Frioul at CINES, and the hybrid partition on MareNostrum III at BSC. Frioul has one KNL-card per node while the hybrid partition of MareNostrum III is equipped with two KNCs per node. The data on Frioul are generated by using the bash-scripts provided by the second implementation of QCD and are done for the two test cases "strong-scaling" with a lattice size of 32x32x32x96 and 64x64x64x128. In case of the data generated at MareNostrum, data for the "strong-scaling" mode on a 32x32x32x96 lattice are shown. The benchmark kernel uses a random gauge configuration and the conjugated gradient solver to solve a linear equation involving the clover Wilson Dirac operator.