Skip to content
Commits on Source (5)
......@@ -65,7 +65,7 @@ The equations are solved iteratively using time-marching algorithms, and most of
- Build and Run instructions: [code_saturne/Code_Saturne_Build_Run_5.3_UEABS.pdf](code_saturne/Code_Saturne_Build_Run_5.3_UEABS.pdf)
- Test Case A: https://repository.prace-ri.eu/ueabs/Code_Saturne/2.1/CS_5.3_PRACE_UEABS_CAVITY_13M.tar.gz
- Test Case B: https://repository.prace-ri.eu/ueabs/Code_Saturne/2.1/CS_5.3_PRACE_UEABS_CAVITY_111M.tar.gz
# CP2K <a name="cp2k"></a>
CP2K is a freely available quantum chemistry and solid-state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. CP2K provides a general framework for different modelling methods such as DFT using the mixed Gaussian and plane waves approaches GPW and GAPW. Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO, ...), and classical force fields (AMBER, CHARMM, ...). CP2K can do simulations of molecular dynamics, metadynamics, Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimisation, and transition state optimisation using NEB or dimer method.
......@@ -236,52 +236,12 @@ Accelerator-based implementations have been implemented for EXDIG, using off-loa
# QCD <a name="qcd"></a>
The QCD benchmark is, unlike the other benchmarks in the PRACE application benchmark suite,
not a full application but a set of 3 parts which are representative of some of the most compute-intensive parts of QCD calculations.
Part 1:
The QCD Accelerator Benchmark suite Part 1 is a direct port of "QCD kernel E" from the
CPU part, which is based on the MILC code suite
(http://www.physics.utah.edu/~detar/milc/). The performance-portable
targetDP model has been used to allow the benchmark to utilise NVIDIA
GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core
CPUs. The use of MPI (in conjunction with targetDP) allows multiple
nodes to be used in parallel.
Part 2:
The QCD Accelerator Benchmark suite Part 2 consists of two kernels,
the QUDA and the QPhix library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/).
The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines
for KNC and KNL (http://jeffersonlab.github.io/qphix/).
The benchmark kernels are using the provided Conjugated Gradient benchmark functions of the libraries.
Part CPU:
The CPU part of QCD benchmark is not a full application but a set of 5 kernels which are
representative of some of the most compute-intensive parts of QCD calculations.
Each of the 5 kernels has one test case:
Kernel A is derived from BQCD (Berlin Quantum ChromoDynamics program), a hybrid Monte-Carlo code which simulates Quantum Chromodynamics with dynamical standard Wilson fermions. The computations take place on a four-dimensional regular grid with periodic boundary conditions. The kernel is a standard conjugate gradient solver with even/odd pre-conditioning. The default lattice size is 16x16x16x16 for the small test case and 32x32x64x64 for the medium test case.
Kernel C is derived from SU3_AHiggs, a lattice quantum chromodynamics (QCD) code intended for computing the conditions of the Early Universe. Instead of “full QCD”,
the code applies an effective field theory,which is valid at high temperatures. In the effective theory, the lattice is 3D. The default lattice size is 64x64x64 for the small test case
and 256x256x256 for the medium test case. Lattice size is 8x8x8x8. Note that Kernel C can only be run in a weak scaling mode, where each CPU stores the same local lattice size,
regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time.
Kernel C is based on the software package openQCD. Kernel C is build for run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time. The local lattice size is 8x8x8x8.
Kernel D consists of the core matrix-vector multiplication routine for standard Wilson fermions based on the software package tmLQCD.
The default lattice size is 16x16x16x16 for the small test case and 64x64x64x64 for the medium test case.
Kernel E consists of a full conjugate gradient solution using
Wilson fermions. The default lattice size is 16x16x16x16 for the
small test case and 64x64x64x32 for the medium test case.
- Code download: https://repository.prace-ri.eu/ueabs/QCD/1.3/QCD_Source_TestCaseA.tar.gz
- Build instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/qcd/QCD_Build_README.txt
- Test Case A: included with source download
- Run instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/qcd/QCD_Run_README.txt
| **General information** | **Scientific field** | **Language** | **MPI** | **OpenMP** | **GPU** | **LoC** | **Code description** |
|------------------|----------------------|--------------|---------|------------|---------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_1) <br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_1/README) | lattice Quantum Chromodynamics Part 1 | C | yes | yes | yes (CUDA) | -- | Accelerator enabled kernel E of UEABS QCD CPU part using targetDP model. Test case A - 8x64x64x64. Conjugate Gradient solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. |
| <br>[- Source](https://lattice.github.io/quda/) <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2) <br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README) | lattice Quantum Chromodynamics Part 2 - QUDA | C++ | yes | yes | yes (CUDA) | -- | Part 2: GPU is using a QUDA kernel for running on NVIDIA GPUs. [Test case A - 96x32x32x32] Small problem size. CG solver. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. CG solver on Wilson Dirac stencil. Bandwidth bounded |
| <br>[- Source](http://jeffersonlab.github.io/qphix/) <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2) <br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README) | lattice Quantum Chromodynamics Part 2 - QPHIX | C++ | yes | yes | no | -- | Part 2: Xeon(Phi) is using a QPhiX kernel which is optimize to run on x86, in particular Intel Xeon (Phi). [Test case A - 96x32x32x32] Small problem size. CG solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. CG solver on Wilson Dirac stencil. Bandwidth bounded |
| <br>[- Source](https://repository.prace-ri.eu/ueabs/QCD/1.3/QCD_Source_TestCaseA.tar.gz) <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_cpu) <br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_cpu/README) | lattice Quantum Chromodynamics - CPU Part - legacy UEABS | C/Fortran | yes | yes/no | No | -- | CPU part based on UEABS QCD CPU part (legacy) benchmark kernels (last update 2017). Based on 5 different Benchmark applications representative for the European Lattice QCD community (see doc for more details). |
# Quantum Espresso <a name="espresso"></a>
......
# QCD - Overview
The QCD benchmark is, unlike the other benchmarks in the PRACE application benchmark suite, not a full application but a set of 3 parts which are representative of some of the most compute-intensive parts of QCD calculations.
## Part 1:
The QCD Accelerator Benchmark suite Part 1 is a direct port of "QCD kernel E" from the CPU part, which is based on the MILC code suite (http://www.physics.utah.edu/~detar/milc/). The performance-portable targetDP model has been used to allow the benchmark to utilise NVIDIA GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core CPUs. The use of MPI (in conjunction with targetDP) allows multiple nodes to be used in parallel.
## Part 2:
The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA and the QPhiX library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhiX library consists of routines which are optimize to use Intel intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/). The benchmark kernels are using the provided Conjugated Gradient benchmark functions of the libraries.
## Part CPU:
The CPU part of QCD benchmark is not a full application but a set of 5 kernels which are
representative of some of the most compute-intensive parts of QCD calculations.
Each of the 5 kernels has one test case:
Kernel A is derived from BQCD (Berlin Quantum ChromoDynamics program), a hybrid Monte-Carlo code which simulates Quantum Chromodynamics with dynamical standard Wilson fermions. The computations take place on a four-dimensional regular grid with periodic boundary conditions. The kernel is a standard conjugate gradient solver with even/odd pre-conditioning. The default lattice size is 16x16x16x16 for the small test case and 32x32x64x64 for the medium test case.
Kernel C is derived from SU3_AHiggs, a lattice quantum chromodynamics (QCD) code intended for computing the conditions of the Early Universe. Instead of “full QCD”, the code applies an effective field theory,which is valid at high temperatures. In the effective theory, the lattice is 3D. The default lattice size is 64x64x64 for the small test case and 256x256x256 for the medium test case. Lattice size is 8x8x8x8. Note that Kernel C can only be run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time.
Kernel C is based on the software package openQCD. Kernel C is build for run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time. The local lattice size is 8x8x8x8.
Kernel D consists of the core matrix-vector multiplication routine for standard Wilson fermions based on the software package tmLQCD. The default lattice size is 16x16x16x16 for the small test case and 64x64x64x64 for the medium test case.
Kernel E consists of a full conjugate gradient solution using Wilson fermions. The default lattice size is 16x16x16x16 for the small test case and 64x64x64x32 for the medium test case.
- Code download: https://repository.prace-ri.eu/ueabs/QCD/1.3/QCD_Source_TestCaseA.tar.gz
- Build instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/qcd/QCD_Build_README.txt
- Test Case A: included with source download
- Run instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/qcd/QCD_Run_README.txt
Description and Building of the QCD Benchmark
=============================================
Description
===========
The QCD benchmark is, unlike the other benchmarks in the PRACE application benchmark suite, not a full application but a set of 7 kernels, which are representative of some of the most compute-intensive parts of lattice QCD calculations for different architectures.
The QCD benchmark suite consists of three different main application, part_cpu is based on 5 kernels of major QCD application and different major codes developed during DEISA and used since PRACE 2IP. They remain as legacy kernels while the performance tracking is shifted more towards newer computatonal kernels available in the **part-2**. The applications contains in **part-1** and **part-2** are suitable to run benchmarks on HPC machines equipped with accelerators like Nvidia GPUs or Intel Xeon Phi processors.
In the following building instruction of the CPU part are described, see for descriptions of the other parts in the different subdirectories.
## Part CPU: Test cases
Each of the 5 kernels has one test case to be used for Tier-0 and Tier-1:
**Kernel A** is derived from BQCD (Berlin Quantum ChromoDynamics program), a hybrid Monte-Carlo code that simulates Quantum Chromodynamics with dynamical standard Wilson fermions. The computations take place on a four-dimensional regular grid with periodic boundary conditions. The kernel is a standard conjugate gradient solver with even/odd pre-conditioning. Lattice size is 32x32x64x64.
**Kernel B** is derived from SU3_AHiggs, a lattice quantum chromodynamics (QCD) code intended for computing the conditions of the Early Universe. instead of "full QCD", the code applies an effective field theory, which is valid at high temperatures. In the effective theory,the lattice is 3D. Lattice size is 256x256x256.
**Kernel C** is based on benchmark kernels of the Community package openQCD, used mainly by the CLS consortium. Lattice size is 8x8x8x8. Note that Kernel C can only be run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance per peak TFlop/s is simply the reciprocal of the execution time.
**Kernel D** is based on the benchmark kernel application of tmLQCD, the community package of the Extended Twisted Mass collaboration. It consists of the core matrix-vector multiplication routine for standard Wilson fermions. The lattice size is 64x64x64x64 .
**Kernel E** consists of a full conjugate gradient solution using Wilson fermions, based on a MILC routine. The source code is also used in Part-1. The standart test cass has a lattice size of 64x64x64x64.
### Building the QCD Benchmark CPU PART in the JuBE Framework
The QCD benchmark: Part CPU is integrated in the JuBE Benchmarking Environment (www.fz-juelich.de/jsc/jube). JuBE also includes all steps to build the application.
Unpack the QCD_Source_TestCaseA.tar.gz into a directory of your choice.
After unpacking the Benchmark the following directory structure is available:
PABS/
applications/
bench/
doc/
platform/
skel/
LICENCE
The applications/ subdirectory contains the QCD benchmark applications.
The bench/ subdirectory contains the benchmark environment scripts.
The doc/ subdirectory contains the overall documentation of the framework and a tutorial.
The platform/ subdirectory holds the platform definitions as well as job submission script templates for each defined platform.
The skel/ subdirectory contains templates for analysis patterns for text output of different measurement tools.
##### Configuration
Definition files are already prepared for many platforms. If you are running on a defined platform just skip this part and go forward to QCD_Run_README.txt ("Execution").
##### The platform
A platform is defined through a set of variables in the platform.xml file, which can be found in the platform/ directory. To create a new platform entry, copy an existing platform description and modify it to fit your local setup. The variables defined here will be used by the individual applications in the later process. Best practice for the platform nomenclature would be: <vendor>-<system type>-<system name|site>. Additionally, you have to create a template batch submission script, which should be placed in a subdirectory of the platform/ directory of the same name as the platform itself. Although this nomenclature is not required by the benchmarking environment, it helps keeping track of you templates, and minimises the amount of adaptation necessary for the individual application configurations.
##### The applications
Once a platform is defined, each individual application that should be used in the benchmark (in this case the QCD application) needs to be configured for this platform. In order to configure an individual application, copy an existing top-level configuration file (e.g. prace-scaling-juqueen.xml) to the file prace-<yourplatform>.xml. Then open an editor of your choice, to adapt the file to your needs. Change the settings of the platform parameter to the name of your defined platform. The platform name can then be referenced throughout the benchmarking environment by the $platform variable.
Do the same for compile.xml, execute.xml, analyse.xml. You can find a step by step tutorial also in doc/JuBETutorial.pdf.
The compilation is part of the run of the application. Please continue with the QCD_Run_README.txt to finalize the build and to run the benchmark.
\ No newline at end of file
Description and Building of the QCD Benchmark
=============================================
Description
===========
The QCD benchmark is, unlike the other benchmarks in the PRACE
application benchmark suite, not a full application but a set of 7
kernels which are representative of some of the most compute-intensive
parts of QCD calculations for different architectures.
The QCD benchmark suite consists of three different main application,
part_cpu is based on 5 kernels of major QCD application and different
major codes. The applications contains in part_1 and part_2 are
suitable to run benchmarks on HPC machines equiped with accelerators
like Nvidia GPUs or Intel Xeon Phi processors.
In the following building instruction of the part_cpu part
are descriped, see for describtion of the other parts in the different
subdirectories.
============================================
PART_CPU:
Test Cases
----------
Each of the 5 kernels has one test case to be used for Tier-0 and
Tier-1:
Kernel A is derived from BQCD (Berlin Quantum ChromoDynamics program),
a hybrid Monte-Carlo code that simulates Quantum Chromodynamics with
dynamical standard Wilson fermions. The computations take place on a
four-dimensional regular grid with periodic boundary conditions. The
kernel is a standard conjugate gradient solver with even/odd
pre-conditioning. Lattice size is 32^2 x 64^2.
Kernel B is derived from SU3_AHiggs, a lattice quantum chromodynamics
(QCD) code intended for computing the conditions of the Early
Universe. Instead of "full QCD", the code applies an effective field
theory, which is valid at high temperatures. In the effective theory,
the lattice is 3D. Lattice size is 256^3.
Kernel C Lattice size is 84. Note that Kernel C can only be run in a
weak scaling mode, where each CPU stores the same local lattice size,
regardless of the number of CPUs. Ideal scaling for this kernel
therefore corresponds to constant execution time, and performance per
peak TFlop/s is simply the reciprocal of the execution time.
Kernel D consists of the core matrix-vector multiplication routine for
standard Wilson fermions. The lattice size is 64^4 .
Kernel E consists of a full conjugate gradient solution using Wilson
fermions. Lattice size is 64^4.
Building the QCD Benchmark in the JuBE Framework
================================================
The QCD benchmark is integrated in the JuBE Benchmarking Environment
(www.fz-juelich.de/jsc/jube).
JuBE also includes all steps to build the application.
Unpack the QCD_Source_TestCaseA.tar.gz into a directory of your
choice.
After unpacking the Benchmark the following directory structure is available:
PABS/
applications/
bench/
doc/
platform/
skel/
LICENCE
The applications/ subdirectory contains the QCD benchmark
applications.
The bench/ subdirectory contains the benchmark environment scripts.
The doc/ subdirectory contains the overall documentation of the
framework and a tutorial.
The platform/ subdirectory holds the platform definitions as well as
job submission script templates for each defined platform.
The skel/ subdirectory contains templates for analysis patterns for
text output of different measurement tools.
Configuration
-------------
Definition files are already prepared for many platforms. If you are
running on a defined platform just skip this part and go forward to
QCD_Run_README.txt ("Execution").
The platform
------------
A platform is defined through a set of variables in the platform.xml
file, which can be found in the platform/ directory. To create a new
platform entry, copy an existing platform description and modify it to
fit your local setup. The variables defined here will be used by the
individual applications in the later process. Best practice for the
platform nomenclature would be: <vendor>-<system type>-<system
name|site>. Additionally, you have to create a template batch
submission script, which should be placed in a subdirectory of the
platform/ directory of the same name as the platform itself. Although
this nomenclature is not required by the benchmarking environment, it
helps keeping track of you templates, and minimises the amount of
adaptation necessary for the individual application configurations.
The applications
----------------
Once a platform is defined, each individual application that should be
used in the benchmark (in this case the QCD application) needs to be
configured for this platform. In order to configure an individual
application, copy an existing top-level configuration file
(e.g. prace-scaling-juqueen.xml) to the file prace-<yourplatform>.xml.
Then open an editor of your choice, to adapt the file to your
needs. Change the settings of the platform parameter to the name of
your defined platform. The platform name can then be referenced
throughout the benchmarking environment by the $platform variable.
Do the same for compile.xml, execute.xml, analyse.xml.
You can find a step by step tutorial also in doc/JuBETutorial.pdf.
The compilation is part of the run of the application. Please continue
with the QCD_Run_README.txt to finalize the build and to run the
benchmark.
......@@ -11,62 +11,44 @@ The folder ./part_cpu contains following subfolders
skel/
LICENCE
The applications/ subdirectory contains the QCD benchmark
applications.
The applications/ subdirectory contains the QCD benchmark applications.
The bench/ subdirectory contains the benchmark environment scripts.
The doc/ subdirectory contains the overall documentation of the
framework and a tutorial.
The platform/ subdirectory holds the platform definitions as well as
job submission script templates for each defined platform.
The skel/ subdirectory contains templates for analysis patterns for
text output of different measurement tools.
The doc/ subdirectory contains the overall documentation of the framework and a tutorial.
The platform/ subdirectory holds the platform definitions as well as job submission script templates for each defined platform.
The skel/ subdirectory contains templates for analysis patterns for text output of different measurement tools.
Configuration
=============
Definition files are already prepared for many platforms. If you are
running on a defined platform just go forward, otherwise please have a
look at QCD_Build_README.txt.
Definition files are already prepared for many platforms. If you are running on a defined platform just go forward, otherwise please have a look at QCD_Build_README.txt.
Execution
=========
Assuming the Benchmark Suite is installed in a directory that can be
used during execution, a typical run of a benchmark application will
Assuming the Benchmark Suite is installed in a directory that can be used during execution, a typical run of a benchmark application will
contain two steps.
1. Compiling and submitting the benchmark to the system scheduler.
2. Verifying, analysing and reporting the performance data.
Compiling and submitting
------------------------
If configured correctly, the application benchmark can be compiled and
submitted on the system (e.g. the IBM BlueGene/Q at Jülich) with
the commands:
If configured correctly, the application benchmark can be compiled and submitted on the system (e.g. the IBM BlueGene/Q at Jülich) with the commands:
>> cd PABS/applications/QCD
>> perl ../../bench/jube prace-scaling-juqueen.xml
The benchmarking environment will then compile the binary for all
node/task/thread combinations defined, if those parameters need to be
compiled into the binary. It creates a so-called sandbox subdirectory
for each job, ensuring conflict free operation of the individual
applications at runtime. If any input files are needed, those are
prepared automatically as defined.
The benchmarking environment will then compile the binary for all node/task/thread combinations defined, if those parameters need to be compiled into the binary. It creates a so-called sandbox subdirectory for each job, ensuring conflict free operation of the individual applications at runtime. If any input files are needed, those are prepared automatically as defined.
Each active benchmark in the application’s top-level configuration
file will receive an ID, which is used as a reference by JUBE later
on.
Each active benchmark in the application’s top-level configuration file will receive an ID, which is used as a reference by JUBE later on.
Verifying, analysing and reporting
----------------------------------
After the benchmark jobs have run, an additional call to jube will
gather the performance data. For this, the options -update and -result
are used.
After the benchmark jobs have run, an additional call to jube will gather the performance data. For this, the options -update and -result are used.
>> cd DEISA_BENCH/application/QCD
>> perl ../../bench/jube -update -result <ID>
The ID is the reference number the benchmarking environment has
assigned to this run. The performance data will then be output to
stdout, and can be post-processed from there.
The ID is the reference number the benchmarking environment has assigned to this run. The performance data will then be output to stdout, and can be post-processed from there.
PRACE QCD Accelerator Benchmark 1
Part 1: UEABS QCD
=================================
This benchmark is part of the QCD section of the Accelerator
Benchmarks Suite developed as part of a PRACE EU funded project
(http://www.prace-ri.eu).
This benchmark is part of the QCD section of the Accelerator Benchmarks Suite developed as part of a PRACE EU funded project
(http://www.prace-ri.eu) and since PRACE 5IP integrated in the Unified European Applications Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/ and named UEABS QCD part 1.
The suite is derived from the Unified European Applications
Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/
This specific component is a direct port of "QCD kernel E" from the UEABS, which is based on the MILC code suite (http://www.physics.utah.edu/~detar/milc/). The performance-portable targetDP model has been used to allow the benchmark to utilise NVIDIA GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core CPUs. The use of MPI (in conjunction with targetDP) allows multiple nodes to be used in parallel.
This specific component is a direct port of "QCD kernel E" from the
UEABS, which is based on the MILC code suite
(http://www.physics.utah.edu/~detar/milc/). The performance-portable
targetDP model has been used to allow the benchmark to utilise NVIDIA
GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core
CPUs. The use of MPI (in conjunction with targetDP) allows multiple
nodes to be used in parallel.
For full details of this benchmark, and for results on NVIDIA GPU and
Intel Knights Corner Xeon Phi architectures (in addition to regular
CPUs), please see:
For full details of this benchmark, and for results on NVIDIA GPU and Intel Knights Corner Xeon Phi architectures (in addition to regular CPUs), please see:
**********************************************************************
Gray, Alan, and Kevin Stratford. "A lightweight approach to
......@@ -30,42 +18,42 @@ available at https://arxiv.org/abs/1609.01479
To Build
--------
Choose a configuration file from the "config" directory that best
matches your platform, and copy to "config.mk" in this (the
top-level) directory. Then edit this file, if necessary, to properly
set the compilers and paths on your system.
Choose a configuration file from the "config" directory that best matches your platform, and copy to "config.mk" in this (the top-level) directory. Then edit this file, if necessary, to properly set the compilers and paths on your system.
Note that if you are building for a GPU system, and the TARGETCC
variable in the configuration file is set to the NVIDIA compiler nvcc,
then the build process will automatically build the GPU
version. Otherwise, the threaded CPU version will be built which can
run on Xeon Phi manycore CPUs or regular multi-core CPUs.
Note that if you are building for a GPU system, and the TARGETCC variable in the configuration file is set to the NVIDIA compiler nvcc, then the build process will automatically build the GPU version. Otherwise, the threaded CPU version will be built which can run on Xeon Phi manycore CPUs or regular multi-core CPUs.
Then, build the targetDP performance-portable library:
```
cd targetDP
make clean
make
cd ..
```
And finally build the benchmark code
```
cd src
make clean
make
cd ..
```
To Validate
-----------
After building, an executable "bench" will exist in the src directory.
To run the default validation (64x64x64x8, 1 iteration) case:
After building, an executable "bench" will exist in the src directory. To run the default validation (64x64x64x8, 1 iteration) case:
```
cd src
./bench
```
The code will automatically self-validate by comparing with the
appropriate output reference file for this case which exists in
output_ref, and will print to stdout, e.g.
The code will automatically self-validate by comparing with the appropriate output reference file for this case which exists in output_ref, and will print to stdout, e.g.
Validating against output_ref/kernel_E.output.nx64ny64nz64nt8.i1.t1:
VALIDATION PASSED
......@@ -83,16 +71,21 @@ To Run Different Cases
You can edit the input file
```
src/kernel_E.input
```
if you want to deviate from the default system size, number of
iterations and/or run using more than 1 MPI task. E.g. replacing
if you want to deviate from the default system size, number of iterations and/or run using more than 1 MPI task. E.g. replacing
```
totnodes 1 1 1 1
```
with
```
totnodes 2 1 1 1
```
will run with 2 MPI tasks rather than 1, where the domain is decomposed in
the "X" direction.
......@@ -108,18 +101,13 @@ The "run" directory contains an example script which
- runs the code (which will automatically validate if an
appropriate output reference file exists)
So, in the run directory, you should copy "run_example.sh" to
run.sh, which you can customise for your system.
So, in the run directory, you should copy "run_example.sh" to run.sh, which you can customise for your system.
Known Issues
------------
The quantity used for validation (see congrad.C) becomes very small
after a few iterations. Therefore, only a small number of iterations
should be used for validation. This is not an issue specific to this
port of the benchmark, but is also true of the original version (see
above), with which this version is designed to be consistent.
The quantity used for validation (see congrad.C) becomes very small after a few iterations. Therefore, only a small number of iterations should be used for validation. This is not an issue specific to this port of the benchmark, but is also true of the original version (see above), with which this version is designed to be consistent.
Performance Results for Reference
......
......@@ -3,25 +3,16 @@ targetDP
Copyright 2015 The University of Edinburgh
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT ARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
About
----------
targetDP (target Data Parallel) is a lightweight programming
abstraction layer, designed to allow the same application source code
to be able to target multiple architectures, e.g. NVIDIA GPUs and
multicore/manycore CPUs, in a performance portable manner.
targetDP (target Data Parallel) is a lightweight programming abstraction layer, designed to allow the same application source code to be able to target multiple architectures, e.g. NVIDIA GPUs and multicore/manycore CPUs, in a performance portable manner.
See:
......@@ -46,10 +37,11 @@ Compiling the targetDP libraries
to create the library libtarget.a:
Edit the Makefile to set CC to the desired compiler (and CFLAGS to the
desired compiler flags). Then type
Edit the Makefile to set CC to the desired compiler (and CFLAGS to the desired compiler flags). Then type
```
make
```
If CC is a standard C/C++ compiler, then the CPU version of targetDP.a
will be created. If CC=nvcc, then the GPU version will be created.
......@@ -64,8 +56,6 @@ See targetDPspec.pdf in the doc directory, and the example (below).
Example
----------
See simpleExample.c (which has compilation instructions for CPU and
GPU near the top of the file).
See simpleExample.c (which has compilation instructions for CPU and GPU near the top of the file).
\ No newline at end of file
# README - QCD Accelerator Benchmarksuite Part 2
## 2017 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy)
# README - QCD UEABS Part 2
**2017 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy)**
The QCD Accelerator Benchmark suite Part 2 consists of two kernels,
the QUDA [1] and the QPhix library [2]. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/).
The benchmark code is used the provided Conjugated Gradient benchmark functions of the libraries.
Part 2 of the QCD kernels of the Unified European Applications Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/ is developed in PRACE 4 IP under the task for developing an accelerator benchmark suite and is part of the UEABS kernel since PRACE 5IP under UEABS QCD part 2. Part 2 consists of two kernels, based on QUDA
[^]: R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
[1] R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
in Lattice Quantum Chromodynamics” SC 10 (Supercomputing 2010)
and on the QPhiX library
[2] B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
W. Watson III, “Lattice QCD on Intel Xeon Phi”, International Supercomputing Conference (ISC’13), 2013
[^]: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhiX library consists of routines which are optimize to use Intel intrinsic functions of multiple vector length including AVX512, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/). The benchmark kernel consists of the provided Conjugated Gradient benchmark functions of the libraries.
## Table of Contents
GPU - BENCHMARK SUITE (QUDA)
```
1. Compile and Run the GPU-Benchmark Suite
1.1. Compile
1.2. Run
1.2.1. Main-script: "run_ana.sh"
1.2.2. Main-script: "prepare_submit_job.sh"
1.2.3. Main-script: "submit_job.sh.template"
1.3. Example Benchmark results
```
[TOC]
XEONPHI - BENCHMARK SUITE (QPHIX)
```
2. Compile and Run the XeonPhi-Benchmark Suite
2.1. Compile
2.1.1. Example compilation on PRACE machines
2.1.1.1. BSC - Marenostrum III Hybrid partitions
2.1.1.2. CINES - Frioul
2.2. Run
2.2.1. Main-script: "run_ana.sh"
2.2.2. Main-script: "prepare_submit_job.sh"
2.2.3. Main-script: "submit_job.sh.template"
2.3. Example Benchmark Results
```
## GPU - BENCHMARK SUITE
## GPU - Kernel
### 1. Compile and Run the GPU-Benchmark Suite
#### 1.1 Compile
......@@ -54,9 +30,10 @@ Here we just give a short overview:
Build Cmake: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/cmake-3.7.0.tar.gz)
Cmake can be downloaded from the source with the URL: https://cmake.org/download/
In this guide the version cmake-3.7.0 is used. The build instruction can be found
in the main directory under README.rst . Use the configure file `./configure` .
Then run `gmake`.
In this guide the version cmake-3.7.0 is used. The build instruction can be found in the main directory under README.rst . Use the configure file `./configure` .
Then run
gmake`.
Build Quda: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/quda.tar.gz)
......@@ -87,8 +64,7 @@ Now in the folder /test one can find the needed Quda executable "invert_".
#### 1.2 Run
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts located in the folder ./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the benchmark runs
on the target machines. This bash-scripts are:
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts located in the folder ./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the benchmark runs on the target machines. This bash-scripts are:
- `run_ana.sh` : Main-script, set up the benchmark mode and submit the jobs (analyse the results)
- `prepare_submit_job.sh` : Generate the job-scripts
......@@ -96,16 +72,8 @@ on the target machines. This bash-scripts are:
##### 1.2.1 Main-script: "run_ana.sh"
The path to the executable has to be set by $PATH2EXE .
QUDA automaticaly tune the GPU-kernels. The optimal setup will be saved in
the folder which one declares by the variable `QUDA_RESOURCE_PATH`. Set it to
folder where the tuning data should be saved.
Different scaling modes can be choose from Strong-scaling to Weak scaling
by using the variables sca_mode (="Strong" or ="Weak").
The lattice sizes can be set by "gx" and "gt".
Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS.
Note that the submition is done here by "sbatch", match this to the queing system on
your target machine.
The path to the executable has to be set by $PATH2EXE . QUDA automaticaly tune the GPU-kernels. The optimal setup will be saved in
the folder which one declares by the variable `QUDA_RESOURCE_PATH`. Set it to folder where the tuning data should be saved. Different scaling modes can be choose from Strong-scaling to Weak scaling by using the variables sca_mode (="Strong" or ="Weak"). The lattice sizes can be set by "gx" and "gt". Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS. Note that the submition is done here by "sbatch", match this to the queing system on your target machine.
##### 1.2.2 Main-script: "prepare_submit_job.sh"
......@@ -113,25 +81,15 @@ Add additional option if necessary.
##### 1.2.3 Main-script: "submit_job.sh.template"
The submit-template will be edit by `prepare_submit_job.sh` to generate
the final submit-script. The header should be matched to the queing system
The submit-template will be edit by `prepare_submit_job.sh` to generate the final submit-script. The header should be matched to the queing system
of the target machine.
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs
on the target machines. This bash-scripts are:
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs on the target machines. This bash-scripts are:
#### 1.3 Example Benchmark results
Here are shown the benchmark results on PizDaint located in Switzerland at CSCS
and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the
provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown,
the "Strong-Scaling mode with a random lattice configuration of size 32^3x96 and
a "Weak-Scaling" mode with a configuration of local lattice size 48^3x24.
The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case
that one card per node and two cards per node are used.
The benchmark are done by using the Conjugated Gradient solver which
solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator
"D" and a known right hand side "b".
Here are shown the benchmark results on PizDaint located in Switzerland at CSCS and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown,
the "Strong-Scaling mode with a random lattice configuration of size 32x32x32x96 and a "Weak-Scaling" mode with a configuration of local lattice size 48x48x48x24. The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case that one card per node and two cards per node are used. The benchmark are done by using the Conjugated Gradient solver which solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b".
```
---------------------
......@@ -255,8 +213,8 @@ GPUs GFLOPS sec
64 2645.590000 1.480000
```
## XEONPHI - BENCHMARK SUITE
### 2. Compile and Run the XeonPhi-Benchmark Suite
## Xeon(Phi) Kernel
### 2. Compile and Run the Xeon(Phi)-Part
Unpack the provided source tar-file located in `./QCD_Accelerator_Benchmarksuite_Part2/XeonPhi/src` or
clone the actual git-hub branches of the code
......
# Results - QCD UEABS Part 2
**2017 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy)**
###
### README - QCD Accelerator Benchmarksuite Part 2
###
### 2017 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy)
###
The QCD Accelerator Benchmark suite Part 2 consists of two kernels,
the QUDA [1] and the QPhix library [2]. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/).The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/).
The benchmark code is used the provided Conjugated Gradient benchmark functions of the libraries.
The QCD UEABS Part 2 consists of two kernels, the QUDA
[1] R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
in Lattice Quantum Chromodynamics” SC 10 (Supercomputing 2010)
[^]: R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
[2] B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
W. Watson III, “Lattice QCD on Intel Xeon Phi”, International Supercomputing Conference (ISC’13), 2013
and the QPhix library
###
### GPU - BENCHMARK SUITE - QUDA
###
[^]: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/).The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/).
The benchmark code is used the provided Conjugated Gradient benchmark functions of the libraries.
### GPU - BENCHMARK SUITE - QUDA
The GPU benchmark results of the second implementation are done on PizDaint located in Switzerland at CSCS and the GPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint is equipped with one P100 Pascal-GPU per node. Two different test-cases are depicted, the "strong-scaling" mode with a random lattice configuration of size 32x32x32x96 and 64x64x64x128. The GPU nodes of Cartesius have two Kepler-GPU K40m per node and the "strong-scaling" test is shown for one card per node and for two cards per node. The benchmark kernel is using the conjugated gradient solver which solve a linear equation system given by D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b".
......@@ -36,13 +31,13 @@ The figure shows strong scaling of the conjugate gradient solver on P100 GPU on
---------------------
PizDaint - Pascal P100
---------------------
Strong - Scaling:
#### PizDaint - Pascal P100
###### Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
precision: single
GPUs GFLOPS sec
1 786.520000 4.569600
......@@ -65,12 +60,12 @@ GPUs GFLOPS sec
32 4965.480000 2.095180
64 2308.850000 2.005110
###### Weak - Scaling:
Weak - Scaling:
local lattice size (48x48x48x24)
sloppy-precision: single
precision: single
precision: single
GPUs GFLOPS sec
1 765.967000 3.940280
......@@ -95,17 +90,14 @@ GPUs GFLOPS sec
---------------------
SurfSara - Kepler K20m
---------------------
##
## 1 GPU per Node
##
#### SurfSara - Kepler K20m
##### 1 GPU per Node
###### Strong - Scaling:
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
precision: single
GPUs GFLOPS sec
1 243.084000 4.030000
2 478.179000 2.630000
......@@ -115,7 +107,7 @@ GPUs GFLOPS sec
32 4365.320000 1.310000
sloppy-precision: double
precision: double
precision: double
GPUs GFLOPS sec
1 119.786000 6.060000
......@@ -125,15 +117,13 @@ GPUs GFLOPS sec
16 1604.210000 1.480000
32 2420.130000 1.630000
##
## 2 GPU per Node
##
##### 2 GPU per Node
###### Strong - Scaling:
Strong - Scaling:
global lattice size (32x32x32x96)
sloppy-precision: single
precision: single
precision: single
GPUs GFLOPS sec
2 463.041000 2.720000
......@@ -144,7 +134,7 @@ GPUs GFLOPS sec
64 4505.440000 1.430000
sloppy-precision: double
precision: double
precision: double
GPUs GFLOPS sec
2 229.579000 3.380000
......@@ -154,11 +144,7 @@ GPUs GFLOPS sec
32 1842.560000 1.550000
64 2645.590000 1.480000
###
### XEONPHI - BENCHMARK SUITE
###
### Xeon(Phi) - BENCHMARK SUITE
The benchmark results for the XeonPhi benchmark suite are performed on Frioul at CINES, and the hybrid partition on MareNostrum III at BSC. Frioul has one KNL-card per node while the hybrid partition of MareNostrum III is equipped with two KNCs per node. The data on Frioul are generated by using the bash-scripts provided by the second implementation of QCD and are done for the two test cases "strong-scaling" with a lattice size of 32x32x32x96 and 64x64x64x128. In case of the data generated at MareNostrum, data for the "strong-scaling" mode on a 32x32x32x96 lattice are shown. The benchmark kernel uses a random gauge configuration and the conjugated gradient solver to solve a linear equation involving the clover Wilson Dirac operator.
MareNostrum_KNC.png
......@@ -168,11 +154,12 @@ Frioul_KNL.png
The figure shows strong scaling results of the conjugate gradient solver on KNL's on Frioul. The lattice size is given by 32x32x32x96 which is similar to the strong scaling run on the KNCs on MareNostrum III. The run is performed in quadrantic cache mode with 68 openMP processes per KNLs. The test is performed with a conjugate gradient solver in single precision.
Frioul_KNL_lV128x64c.png
The figure shows strong scaling of the conjugate gradient solver on KNL's GPU on PizDaint. The lattice size is increases to 64x64x64x128, which is a commonly used large lattice nowadays. By increasing the lattice the scaling tests shows that the conjugate gradient solver has a very good strong scaling up to 16 KNL's
---------------------
Frioul - KNLs
---------------------
Strong - Scaling:
The figure shows strong scaling of the conjugate gradient solver on KNL's GPU on PizDaint. The lattice size is increases to 64x64x64x128, which is a commonly used large lattice nowadays. By increasing the lattice the scaling tests shows that the conjugate gradient solver has a very good strong scaling up to 16 KNL's.
#### Frioul - KNLs
###### Strong - Scaling:
global lattice size (32x32x32x96)
precision: single
......@@ -203,7 +190,7 @@ KNLs GFLOPS
4 1214.82
8 2425.45
16 4404.63
precision: double
KNLs GFLOPS
......@@ -213,11 +200,12 @@ KNLs GFLOPS
8 1228.77
16 2310.63
---------------------
MareNostrum III - KNC's
---------------------
Strong - Scaling:
#### MareNostrum III - KNC's
###### Strong - Scaling:
global lattice size (32x32x32x96)
precision: single - 1 Cards per Node
......@@ -238,20 +226,19 @@ KNCs GFLOPS
32 605.882
64 847.566
#### Results from PRACE 5IP
#########################################################
Results from PRACE 5IP (see White paper for more details)
(see White paper for more details)
Results in GFLOP/s for V=96x32x32x32
Nodes Irene SKL Juwels Marconi-KNL MareNostrum PizDaint Davide Frioul Deep Mont-Blanc 3
1 134,382 132,26 101,815 142,336 387,659 392,763 184,729 41,7832 99,6378
Nodes Irene SKL Juwels Marconi-KNL MareNostrum PizDaint Davide Frioul Deep Mont-Blanc 3
1 134,382 132,26 101,815 142,336 387,659 392,763 184,729 41,7832 99,6378
2 240,853 245,599 145,608 263,355 755,308 773,901 269,705 40,7721 214,549
4 460,044 456,228 202,135 480,516 1400,06 1509,46 441,534 59,6317 410,902
8 754,657 864,959 223,082 895,277 1654,21 2902,83 614,466 67,3355 715,699
16 1366,21 1700,95 214,705 1632,87 2145,69 5394,16 644,303 91,5139 1,17E+03
32 2603,9 3199,98 183,327 2923,7 2923,98 9650,91 937,755
64 4122,76 5167,48 232,788 4118,7 2332,71 800,514
16 1366,21 1700,95 214,705 1632,87 2145,69 5394,16 644,303 91,5139 1,17E+03
32 2603,9 3199,98 183,327 2923,7 2923,98 9650,91 937,755
64 4122,76 5167,48 232,788 4118,7 2332,71 800,514
128 4703,46 7973,9 37,8003 4050,41
256 -- 3130,42
512 -- 3421,25
......