@@ -227,16 +227,18 @@ The application codes that constitute the UEABS are:
<td>PFARM uses an R-matrix ab-initio approach to calculate electron-atom and electron-molecule collisions data for a wide range of applications including atrophysics and nuclear fusion. It is written in modern Fortran/MPI/OpenMP and exploits highly-optimised dense linear algebra numerical library routines.</td>
</tr>
<tr>
<td>QCD</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>QCD
<li><ahref='qcd/README.md'>see for more details</a></li>
</td>
<td>100,000+</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>X</td>
<td>X</td>
<td>The QCD benchmark is, unlike the other benchmarks in the PRACE application benchmark suite, not a full application but a set of 3 parts which are representative of some of the most compute-intensive parts of QCD calculations. The major application of the different parts consists of a Conjugate Gradient solver involving Wilson Dirac stencil in 4 dimension. Keywords of the QCD bencharmks kernels are: Domain Decomposition, Memory bandwidth, strong scaling, MPI latency.</td>
</tr>
<tr>
<td>Quantum ESPRESSO
...
...
@@ -304,6 +306,7 @@ The application codes that constitute the UEABS are:
</tbody>
</table>
_**TODO for all BCOs: move all information below this line either in the Table above (using a short version in "Code Description/Notes") or to your top-level README / description. (And when done, remove the relevant text below.)**_
| <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_1)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_1/README.md) | lattice Quantum Chromodynamics Part 1 | C | yes | yes | yes (CUDA) | -- | Accelerator enabled kernel E of UEABS QCD CPU part using targetDP model. Test case A - 8x64x64x64. Conjugate Gradient solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. |
| <br>[- Source](https://lattice.github.io/quda/)<br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README.md) | lattice Quantum Chromodynamics Part 2 - QUDA | C++ | yes | yes | yes (CUDA) | -- | Part 2: GPU is using a QUDA kernel for running on NVIDIA GPUs. [Test case A - 96x32x32x32] Small problem size. CG solver. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. CG solver on Wilson Dirac stencil. Bandwidth bounded |
| <br>[- Source](http://jeffersonlab.github.io/qphix/)<br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README.md) | lattice Quantum Chromodynamics Part 2 - QPHIX | C++ | yes | yes | no | -- | Part 2: Xeon(Phi) is using a QPhiX kernel which is optimize to run on x86, in particular Intel Xeon (Phi). [Test case A - 96x32x32x32] Small problem size. CG solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. CG solver on Wilson Dirac stencil. Bandwidth bounded |
| <br>[- Source](https://repository.prace-ri.eu/ueabs/QCD/1.3/QCD_Source_TestCaseA.tar.gz)<br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_cpu)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_cpu/README.md) | lattice Quantum Chromodynamics - CPU Part - legacy UEABS | C/Fortran | yes | yes/no | No | -- | CPU part based on UEABS QCD CPU part (legacy) benchmark kernels (last update 2017). Based on 5 different Benchmark applications representative for the European Lattice QCD community (see doc for more details). |
| <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_1)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_1/README.md) | lattice Quantum Chromodynamics Part 1 | C | yes | yes | yes (CUDA) | -- | Accelerator enabled kernel E of UEABS QCD CPU part using targetDP model. Test case A - 8x64x64x64. Conjugate Gradient solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. |
| <br>[- Source](https://lattice.github.io/quda/)<br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README.md) | lattice Quantum Chromodynamics Part 2 - QUDA | C++ | yes | yes | yes (CUDA) | -- | Part 2: GPU is using a QUDA kernel for running on NVIDIA GPUs. [Test case A - 96x32x32x32] Small problem size. CG solver. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. CG solver on Wilson Dirac stencil. Bandwidth bounded |
| <br>[- Source](http://jeffersonlab.github.io/qphix/)<br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README.md) | lattice Quantum Chromodynamics Part 2 - QPHIX | C++ | yes | yes | no | -- | Part 2: Xeon(Phi) is using a QPhiX kernel which is optimize to run on x86, in particular Intel Xeon (Phi). [Test case A - 96x32x32x32] Small problem size. CG solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. CG solver on Wilson Dirac stencil. Bandwidth bounded |
| <br>[- Source](https://repository.prace-ri.eu/ueabs/QCD/1.3/QCD_Source_TestCaseA.tar.gz)<br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_cpu)<br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_cpu/README.md) | lattice Quantum Chromodynamics - CPU Part - legacy UEABS | C/Fortran | yes | yes/no | No | -- | CPU part based on UEABS QCD CPU part (legacy) benchmark kernels (last update 2017). Based on 5 different Benchmark applications representative for the European Lattice QCD community (see doc for more details). |
# QCD - Overview
The QCD benchmark is, unlike the other benchmarks in the PRACE application benchmark suite, not a full application but a set of 3 parts which are representative of some of the most compute-intensive parts of QCD calculations.
## Part 1:
The QCD Accelerator Benchmark suite Part 1 is a direct port of "QCD kernel E" from the CPU part, which is based on the MILC code suite (http://www.physics.utah.edu/~detar/milc/). The performance-portable targetDP model has been used to allow the benchmark to utilise NVIDIA GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core CPUs. The use of MPI (in conjunction with targetDP) allows multiple nodes to be used in parallel.
## Part 2:
The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA and the QPhiX library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhiX library consists of routines which are optimize to use Intel intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/). The benchmark kernels are using the provided Conjugated Gradient benchmark functions of the libraries.
## Part CPU:
The CPU part of QCD benchmark is not a full application but a set of 5 kernels which are
representative of some of the most compute-intensive parts of QCD calculations.
Each of the 5 kernels has one test case:
Kernel A is derived from BQCD (Berlin Quantum ChromoDynamics program), a hybrid Monte-Carlo code which simulates Quantum Chromodynamics with dynamical standard Wilson fermions. The computations take place on a four-dimensional regular grid with periodic boundary conditions. The kernel is a standard conjugate gradient solver with even/odd pre-conditioning. The default lattice size is 16x16x16x16 for the small test case and 32x32x64x64 for the medium test case.
Kernel C is derived from SU3_AHiggs, a lattice quantum chromodynamics (QCD) code intended for computing the conditions of the Early Universe. Instead of “full QCD”, the code applies an effective field theory,which is valid at high temperatures. In the effective theory, the lattice is 3D. The default lattice size is 64x64x64 for the small test case and 256x256x256 for the medium test case. Lattice size is 8x8x8x8. Note that Kernel C can only be run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time.
Kernel C is based on the software package openQCD. Kernel C is build for run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time. The local lattice size is 8x8x8x8.
Kernel D consists of the core matrix-vector multiplication routine for standard Wilson fermions based on the software package tmLQCD. The default lattice size is 16x16x16x16 for the small test case and 64x64x64x64 for the medium test case.
Kernel E consists of a full conjugate gradient solution using Wilson fermions. The default lattice size is 16x16x16x16 for the small test case and 64x64x64x32 for the medium test case.