Commit 20dda510 authored by Jacob Finkenrath's avatar Jacob Finkenrath
Browse files

Impr readme - first Version

parent 04a109f2
...@@ -236,52 +236,12 @@ Accelerator-based implementations have been implemented for EXDIG, using off-loa ...@@ -236,52 +236,12 @@ Accelerator-based implementations have been implemented for EXDIG, using off-loa
# QCD <a name="qcd"></a> # QCD <a name="qcd"></a>
The QCD benchmark is, unlike the other benchmarks in the PRACE application benchmark suite, | **General information** | **Scientific field** | **Language** | **MPI** | **OpenMP** | **GPU** | **LoC** | **Code description** |
not a full application but a set of 3 parts which are representative of some of the most compute-intensive parts of QCD calculations. |------------------|----------------------|--------------|---------|------------|---------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_1) <br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_1/README) | lattice QuantumChromodynamics - Kernel 1 | C | yes | yes | yes (CUDA) | -- | Based on UEABS kernel E, CG solver using the Wilson Dirac operator. targetDP model has been used to allow the benchmark to utilise NVIDIA GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core CPU. [Test case A - 8x64x64x64] Small problem size suitable to run on one KNC. Conjugate Gradient solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. |
Part 1: | <br>[- Source](https://lattice.github.io/quda/) <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2) <br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README) | lattice QuantumChromodynamics - Kernel 2 -QUDA | C++ | yes | yes | yes (CUDA) | -- | The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs. [Test case A - 96x32x32x32] Small problem size. Conjugate Gradient solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. Conjugate Gradient solver involving Wilson Dirac stencil. Bandwidth bounded |
The QCD Accelerator Benchmark suite Part 1 is a direct port of "QCD kernel E" from the | <br>[- Source](http://jeffersonlab.github.io/qphix/) <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2) <br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README) | lattice QuantumChromodynamics - Kernel 2 - QPHIX | C++ | yes | yes | no | -- | The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL [Test case A - 96x32x32x32] Small problem size. Conjugate Gradient solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. Conjugate Gradient solver involving Wilson Dirac stencil. Bandwidth bounded |
CPU part, which is based on the MILC code suite | <br>[- Source](https://repository.prace-ri.eu/ueabs/QCD/1.3/QCD_Source_TestCaseA.tar.gz) <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_cpu) <br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_cpu/README) | lattice QuantumChromodynamics - legacy UEABS Kernels | C/Fortran | yes | yes/no | No | -- | Legacy UEABS QCD benchmark kernels are based on 5 different Benchmark applications which are used by the European Lattice QCD community (see doc for more details). |
(http://www.physics.utah.edu/~detar/milc/). The performance-portable
targetDP model has been used to allow the benchmark to utilise NVIDIA
GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core
CPUs. The use of MPI (in conjunction with targetDP) allows multiple
nodes to be used in parallel.
Part 2:
The QCD Accelerator Benchmark suite Part 2 consists of two kernels,
the QUDA and the QPhix library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/).
The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines
for KNC and KNL (http://jeffersonlab.github.io/qphix/).
The benchmark kernels are using the provided Conjugated Gradient benchmark functions of the libraries.
Part CPU:
The CPU part of QCD benchmark is not a full application but a set of 5 kernels which are
representative of some of the most compute-intensive parts of QCD calculations.
Each of the 5 kernels has one test case:
Kernel A is derived from BQCD (Berlin Quantum ChromoDynamics program), a hybrid Monte-Carlo code which simulates Quantum Chromodynamics with dynamical standard Wilson fermions. The computations take place on a four-dimensional regular grid with periodic boundary conditions. The kernel is a standard conjugate gradient solver with even/odd pre-conditioning. The default lattice size is 16x16x16x16 for the small test case and 32x32x64x64 for the medium test case.
Kernel C is derived from SU3_AHiggs, a lattice quantum chromodynamics (QCD) code intended for computing the conditions of the Early Universe. Instead of “full QCD”,
the code applies an effective field theory,which is valid at high temperatures. In the effective theory, the lattice is 3D. The default lattice size is 64x64x64 for the small test case
and 256x256x256 for the medium test case. Lattice size is 8x8x8x8. Note that Kernel C can only be run in a weak scaling mode, where each CPU stores the same local lattice size,
regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time.
Kernel C is based on the software package openQCD. Kernel C is build for run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time. The local lattice size is 8x8x8x8.
Kernel D consists of the core matrix-vector multiplication routine for standard Wilson fermions based on the software package tmLQCD.
The default lattice size is 16x16x16x16 for the small test case and 64x64x64x64 for the medium test case.
Kernel E consists of a full conjugate gradient solution using
Wilson fermions. The default lattice size is 16x16x16x16 for the
small test case and 64x64x64x32 for the medium test case.
- Code download: https://repository.prace-ri.eu/ueabs/QCD/1.3/QCD_Source_TestCaseA.tar.gz
- Build instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/qcd/QCD_Build_README.txt
- Test Case A: included with source download
- Run instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/qcd/QCD_Run_README.txt
# Quantum Espresso <a name="espresso"></a> # Quantum Espresso <a name="espresso"></a>
......
# QCD - Overview
The QCD benchmark is, unlike the other benchmarks in the PRACE application benchmark suite, not a full application but a set of 3 parts which are representative of some of the most compute-intensive parts of QCD calculations.
## Part 1:
The QCD Accelerator Benchmark suite Part 1 is a direct port of "QCD kernel E" from the CPU part, which is based on the MILC code suite (http://www.physics.utah.edu/~detar/milc/). The performance-portable targetDP model has been used to allow the benchmark to utilise NVIDIA GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core CPUs. The use of MPI (in conjunction with targetDP) allows multiple nodes to be used in parallel.
## Part 2:
The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA and the QPhiX library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhiX library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/). The benchmark kernels are using the provided Conjugated Gradient benchmark functions of the libraries.
## Part CPU:
The CPU part of QCD benchmark is not a full application but a set of 5 kernels which are
representative of some of the most compute-intensive parts of QCD calculations.
Each of the 5 kernels has one test case:
Kernel A is derived from BQCD (Berlin Quantum ChromoDynamics program), a hybrid Monte-Carlo code which simulates Quantum Chromodynamics with dynamical standard Wilson fermions. The computations take place on a four-dimensional regular grid with periodic boundary conditions. The kernel is a standard conjugate gradient solver with even/odd pre-conditioning. The default lattice size is 16x16x16x16 for the small test case and 32x32x64x64 for the medium test case.
Kernel C is derived from SU3_AHiggs, a lattice quantum chromodynamics (QCD) code intended for computing the conditions of the Early Universe. Instead of “full QCD”, the code applies an effective field theory,which is valid at high temperatures. In the effective theory, the lattice is 3D. The default lattice size is 64x64x64 for the small test case and 256x256x256 for the medium test case. Lattice size is 8x8x8x8. Note that Kernel C can only be run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time.
Kernel C is based on the software package openQCD. Kernel C is build for run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time. The local lattice size is 8x8x8x8.
Kernel D consists of the core matrix-vector multiplication routine for standard Wilson fermions based on the software package tmLQCD. The default lattice size is 16x16x16x16 for the small test case and 64x64x64x64 for the medium test case.
Kernel E consists of a full conjugate gradient solution using Wilson fermions. The default lattice size is 16x16x16x16 for the small test case and 64x64x64x32 for the medium test case.
- Code download: https://repository.prace-ri.eu/ueabs/QCD/1.3/QCD_Source_TestCaseA.tar.gz
- Build instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/qcd/QCD_Build_README.txt
- Test Case A: included with source download
- Run instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/qcd/QCD_Run_README.txt
...@@ -3,64 +3,33 @@ Description and Building of the QCD Benchmark ...@@ -3,64 +3,33 @@ Description and Building of the QCD Benchmark
Description Description
=========== ===========
The QCD benchmark is, unlike the other benchmarks in the PRACE The QCD benchmark is, unlike the other benchmarks in the PRACE application benchmark suite, not a full application but a set of 7 kernels, which are representative of some of the most compute-intensive parts of lattice QCD calculations for different architectures.
application benchmark suite, not a full application but a set of 7
kernels which are representative of some of the most compute-intensive
parts of QCD calculations for different architectures.
The QCD benchmark suite consists of three different main application, The QCD benchmark suite consists of three different main application, part_cpu is based on 5 kernels of major QCD application and different major codes developed during DEISA and used since PRACE 2IP. They remain as legacy kernels while the performance tracking is shifted more towards newer computatonal kernels available in the **part-2**. The applications contains in **part-1** and **part-2** are suitable to run benchmarks on HPC machines equipped with accelerators like Nvidia GPUs or Intel Xeon Phi processors.
part_cpu is based on 5 kernels of major QCD application and different
major codes. The applications contains in part_1 and part_2 are
suitable to run benchmarks on HPC machines equiped with accelerators
like Nvidia GPUs or Intel Xeon Phi processors.
In the following building instruction of the part_cpu part In the following building instruction of the CPU part are described, see for descriptions of the other parts in the different subdirectories.
are descriped, see for describtion of the other parts in the different
subdirectories.
============================================
PART_CPU:
Test Cases ## Part CPU: Test cases
----------
Each of the 5 kernels has one test case to be used for Tier-0 and
Tier-1:
Kernel A is derived from BQCD (Berlin Quantum ChromoDynamics program), Each of the 5 kernels has one test case to be used for Tier-0 and Tier-1:
a hybrid Monte-Carlo code that simulates Quantum Chromodynamics with
dynamical standard Wilson fermions. The computations take place on a
four-dimensional regular grid with periodic boundary conditions. The
kernel is a standard conjugate gradient solver with even/odd
pre-conditioning. Lattice size is 32^2 x 64^2.
Kernel B is derived from SU3_AHiggs, a lattice quantum chromodynamics **Kernel A** is derived from BQCD (Berlin Quantum ChromoDynamics program), a hybrid Monte-Carlo code that simulates Quantum Chromodynamics with dynamical standard Wilson fermions. The computations take place on a four-dimensional regular grid with periodic boundary conditions. The kernel is a standard conjugate gradient solver with even/odd pre-conditioning. Lattice size is 32x32x64x64.
(QCD) code intended for computing the conditions of the Early
Universe. Instead of "full QCD", the code applies an effective field
theory, which is valid at high temperatures. In the effective theory,
the lattice is 3D. Lattice size is 256^3.
Kernel C Lattice size is 84. Note that Kernel C can only be run in a **Kernel B** is derived from SU3_AHiggs, a lattice quantum chromodynamics (QCD) code intended for computing the conditions of the Early Universe. instead of "full QCD", the code applies an effective field theory, which is valid at high temperatures. In the effective theory,the lattice is 3D. Lattice size is 256x256x256.
weak scaling mode, where each CPU stores the same local lattice size,
regardless of the number of CPUs. Ideal scaling for this kernel
therefore corresponds to constant execution time, and performance per
peak TFlop/s is simply the reciprocal of the execution time.
Kernel D consists of the core matrix-vector multiplication routine for **Kernel C** is based on benchmark kernels of the Community package openQCD, used mainly by the CLS consortium. Lattice size is 8x8x8x8. Note that Kernel C can only be run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance per peak TFlop/s is simply the reciprocal of the execution time.
standard Wilson fermions. The lattice size is 64^4 .
Kernel E consists of a full conjugate gradient solution using Wilson **Kernel D** is based on the benchmark kernel application of tmLQCD, the community package of the Extended Twisted Mass collaboration. It consists of the core matrix-vector multiplication routine for standard Wilson fermions. The lattice size is 64x64x64x64 .
fermions. Lattice size is 64^4.
Building the QCD Benchmark in the JuBE Framework **Kernel E** consists of a full conjugate gradient solution using Wilson fermions, based on a MILC routine. The source code is also used in Part-1. The standart test cass has a lattice size of 64x64x64x64.
================================================
The QCD benchmark is integrated in the JuBE Benchmarking Environment ### Building the QCD Benchmark CPU PART in the JuBE Framework
(www.fz-juelich.de/jsc/jube).
JuBE also includes all steps to build the application.
Unpack the QCD_Source_TestCaseA.tar.gz into a directory of your The QCD benchmark: Part CPU is integrated in the JuBE Benchmarking Environment (www.fz-juelich.de/jsc/jube). JuBE also includes all steps to build the application.
choice.
Unpack the QCD_Source_TestCaseA.tar.gz into a directory of your choice.
After unpacking the Benchmark the following directory structure is available: After unpacking the Benchmark the following directory structure is available:
PABS/ PABS/
...@@ -71,54 +40,23 @@ After unpacking the Benchmark the following directory structure is available: ...@@ -71,54 +40,23 @@ After unpacking the Benchmark the following directory structure is available:
skel/ skel/
LICENCE LICENCE
The applications/ subdirectory contains the QCD benchmark The applications/ subdirectory contains the QCD benchmark applications.
applications.
The bench/ subdirectory contains the benchmark environment scripts. The bench/ subdirectory contains the benchmark environment scripts.
The doc/ subdirectory contains the overall documentation of the The doc/ subdirectory contains the overall documentation of the framework and a tutorial.
framework and a tutorial. The platform/ subdirectory holds the platform definitions as well as job submission script templates for each defined platform.
The platform/ subdirectory holds the platform definitions as well as The skel/ subdirectory contains templates for analysis patterns for text output of different measurement tools.
job submission script templates for each defined platform.
The skel/ subdirectory contains templates for analysis patterns for ##### Configuration
text output of different measurement tools.
Definition files are already prepared for many platforms. If you are running on a defined platform just skip this part and go forward to QCD_Run_README.txt ("Execution").
Configuration
------------- ##### The platform
Definition files are already prepared for many platforms. If you are A platform is defined through a set of variables in the platform.xml file, which can be found in the platform/ directory. To create a new platform entry, copy an existing platform description and modify it to fit your local setup. The variables defined here will be used by the individual applications in the later process. Best practice for the platform nomenclature would be: <vendor>-<system type>-<system name|site>. Additionally, you have to create a template batch submission script, which should be placed in a subdirectory of the platform/ directory of the same name as the platform itself. Although this nomenclature is not required by the benchmarking environment, it helps keeping track of you templates, and minimises the amount of adaptation necessary for the individual application configurations.
running on a defined platform just skip this part and go forward to
QCD_Run_README.txt ("Execution"). ##### The applications
The platform Once a platform is defined, each individual application that should be used in the benchmark (in this case the QCD application) needs to be configured for this platform. In order to configure an individual application, copy an existing top-level configuration file (e.g. prace-scaling-juqueen.xml) to the file prace-<yourplatform>.xml. Then open an editor of your choice, to adapt the file to your needs. Change the settings of the platform parameter to the name of your defined platform. The platform name can then be referenced throughout the benchmarking environment by the $platform variable.
------------ Do the same for compile.xml, execute.xml, analyse.xml. You can find a step by step tutorial also in doc/JuBETutorial.pdf.
A platform is defined through a set of variables in the platform.xml The compilation is part of the run of the application. Please continue with the QCD_Run_README.txt to finalize the build and to run the benchmark.
file, which can be found in the platform/ directory. To create a new \ No newline at end of file
platform entry, copy an existing platform description and modify it to
fit your local setup. The variables defined here will be used by the
individual applications in the later process. Best practice for the
platform nomenclature would be: <vendor>-<system type>-<system
name|site>. Additionally, you have to create a template batch
submission script, which should be placed in a subdirectory of the
platform/ directory of the same name as the platform itself. Although
this nomenclature is not required by the benchmarking environment, it
helps keeping track of you templates, and minimises the amount of
adaptation necessary for the individual application configurations.
The applications
----------------
Once a platform is defined, each individual application that should be
used in the benchmark (in this case the QCD application) needs to be
configured for this platform. In order to configure an individual
application, copy an existing top-level configuration file
(e.g. prace-scaling-juqueen.xml) to the file prace-<yourplatform>.xml.
Then open an editor of your choice, to adapt the file to your
needs. Change the settings of the platform parameter to the name of
your defined platform. The platform name can then be referenced
throughout the benchmarking environment by the $platform variable.
Do the same for compile.xml, execute.xml, analyse.xml.
You can find a step by step tutorial also in doc/JuBETutorial.pdf.
The compilation is part of the run of the application. Please continue
with the QCD_Run_README.txt to finalize the build and to run the
benchmark.
...@@ -11,62 +11,44 @@ The folder ./part_cpu contains following subfolders ...@@ -11,62 +11,44 @@ The folder ./part_cpu contains following subfolders
skel/ skel/
LICENCE LICENCE
The applications/ subdirectory contains the QCD benchmark The applications/ subdirectory contains the QCD benchmark applications.
applications.
The bench/ subdirectory contains the benchmark environment scripts. The bench/ subdirectory contains the benchmark environment scripts.
The doc/ subdirectory contains the overall documentation of the The doc/ subdirectory contains the overall documentation of the framework and a tutorial.
framework and a tutorial. The platform/ subdirectory holds the platform definitions as well as job submission script templates for each defined platform.
The platform/ subdirectory holds the platform definitions as well as The skel/ subdirectory contains templates for analysis patterns for text output of different measurement tools.
job submission script templates for each defined platform.
The skel/ subdirectory contains templates for analysis patterns for
text output of different measurement tools.
Configuration Configuration
============= =============
Definition files are already prepared for many platforms. If you are Definition files are already prepared for many platforms. If you are running on a defined platform just go forward, otherwise please have a look at QCD_Build_README.txt.
running on a defined platform just go forward, otherwise please have a
look at QCD_Build_README.txt.
Execution Execution
========= =========
Assuming the Benchmark Suite is installed in a directory that can be Assuming the Benchmark Suite is installed in a directory that can be used during execution, a typical run of a benchmark application will
used during execution, a typical run of a benchmark application will
contain two steps. contain two steps.
1. Compiling and submitting the benchmark to the system scheduler. 1. Compiling and submitting the benchmark to the system scheduler.
2. Verifying, analysing and reporting the performance data. 2. Verifying, analysing and reporting the performance data.
Compiling and submitting Compiling and submitting
------------------------ ------------------------
If configured correctly, the application benchmark can be compiled and If configured correctly, the application benchmark can be compiled and submitted on the system (e.g. the IBM BlueGene/Q at Jülich) with the commands:
submitted on the system (e.g. the IBM BlueGene/Q at Jülich) with
the commands:
>> cd PABS/applications/QCD >> cd PABS/applications/QCD
>> perl ../../bench/jube prace-scaling-juqueen.xml >> perl ../../bench/jube prace-scaling-juqueen.xml
The benchmarking environment will then compile the binary for all The benchmarking environment will then compile the binary for all node/task/thread combinations defined, if those parameters need to be compiled into the binary. It creates a so-called sandbox subdirectory for each job, ensuring conflict free operation of the individual applications at runtime. If any input files are needed, those are prepared automatically as defined.
node/task/thread combinations defined, if those parameters need to be
compiled into the binary. It creates a so-called sandbox subdirectory
for each job, ensuring conflict free operation of the individual
applications at runtime. If any input files are needed, those are
prepared automatically as defined.
Each active benchmark in the application’s top-level configuration Each active benchmark in the application’s top-level configuration file will receive an ID, which is used as a reference by JUBE later on.
file will receive an ID, which is used as a reference by JUBE later
on.
Verifying, analysing and reporting Verifying, analysing and reporting
---------------------------------- ----------------------------------
After the benchmark jobs have run, an additional call to jube will After the benchmark jobs have run, an additional call to jube will gather the performance data. For this, the options -update and -result are used.
gather the performance data. For this, the options -update and -result
are used.
>> cd DEISA_BENCH/application/QCD >> cd DEISA_BENCH/application/QCD
>> perl ../../bench/jube -update -result <ID> >> perl ../../bench/jube -update -result <ID>
The ID is the reference number the benchmarking environment has The ID is the reference number the benchmarking environment has assigned to this run. The performance data will then be output to stdout, and can be post-processed from there.
assigned to this run. The performance data will then be output to
stdout, and can be post-processed from there.
PRACE QCD Accelerator Benchmark 1 PRACE QCD Accelerator Benchmark 1
================================= =================================
This benchmark is part of the QCD section of the Accelerator This benchmark is part of the QCD section of the Accelerator Benchmarks Suite developed as part of a PRACE EU funded project
Benchmarks Suite developed as part of a PRACE EU funded project
(http://www.prace-ri.eu). (http://www.prace-ri.eu).
The suite is derived from the Unified European Applications The suite is derived from the Unified European Applications Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/
Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/
This specific component is a direct port of "QCD kernel E" from the This specific component is a direct port of "QCD kernel E" from the UEABS, which is based on the MILC code suite (http://www.physics.utah.edu/~detar/milc/). The performance-portable targetDP model has been used to allow the benchmark to utilise NVIDIA GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core CPUs. The use of MPI (in conjunction with targetDP) allows multiple nodes to be used in parallel.
UEABS, which is based on the MILC code suite
(http://www.physics.utah.edu/~detar/milc/). The performance-portable
targetDP model has been used to allow the benchmark to utilise NVIDIA
GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core
CPUs. The use of MPI (in conjunction with targetDP) allows multiple
nodes to be used in parallel.
For full details of this benchmark, and for results on NVIDIA GPU and For full details of this benchmark, and for results on NVIDIA GPU and Intel Knights Corner Xeon Phi architectures (in addition to regular CPUs), please see:
Intel Knights Corner Xeon Phi architectures (in addition to regular
CPUs), please see:
********************************************************************** **********************************************************************
Gray, Alan, and Kevin Stratford. "A lightweight approach to Gray, Alan, and Kevin Stratford. "A lightweight approach to
...@@ -30,42 +20,42 @@ available at https://arxiv.org/abs/1609.01479 ...@@ -30,42 +20,42 @@ available at https://arxiv.org/abs/1609.01479
To Build To Build
-------- --------
Choose a configuration file from the "config" directory that best Choose a configuration file from the "config" directory that best matches your platform, and copy to "config.mk" in this (the top-level) directory. Then edit this file, if necessary, to properly set the compilers and paths on your system.
matches your platform, and copy to "config.mk" in this (the
top-level) directory. Then edit this file, if necessary, to properly
set the compilers and paths on your system.
Note that if you are building for a GPU system, and the TARGETCC Note that if you are building for a GPU system, and the TARGETCC variable in the configuration file is set to the NVIDIA compiler nvcc, then the build process will automatically build the GPU version. Otherwise, the threaded CPU version will be built which can run on Xeon Phi manycore CPUs or regular multi-core CPUs.
variable in the configuration file is set to the NVIDIA compiler nvcc,
then the build process will automatically build the GPU
version. Otherwise, the threaded CPU version will be built which can
run on Xeon Phi manycore CPUs or regular multi-core CPUs.
Then, build the targetDP performance-portable library: Then, build the targetDP performance-portable library:
```
cd targetDP cd targetDP
make clean make clean
make make
cd .. cd ..
```
And finally build the benchmark code And finally build the benchmark code
```
cd src cd src
make clean make clean
make make
cd .. cd ..
```
To Validate To Validate
----------- -----------
After building, an executable "bench" will exist in the src directory. After building, an executable "bench" will exist in the src directory. To run the default validation (64x64x64x8, 1 iteration) case:
To run the default validation (64x64x64x8, 1 iteration) case:
```
cd src cd src
./bench ./bench
```
The code will automatically self-validate by comparing with the The code will automatically self-validate by comparing with the appropriate output reference file for this case which exists in output_ref, and will print to stdout, e.g.
appropriate output reference file for this case which exists in
output_ref, and will print to stdout, e.g.
Validating against output_ref/kernel_E.output.nx64ny64nz64nt8.i1.t1: Validating against output_ref/kernel_E.output.nx64ny64nz64nt8.i1.t1:
VALIDATION PASSED VALIDATION PASSED
...@@ -83,16 +73,21 @@ To Run Different Cases ...@@ -83,16 +73,21 @@ To Run Different Cases
You can edit the input file You can edit the input file
```
src/kernel_E.input src/kernel_E.input
```
if you want to deviate from the default system size, number of if you want to deviate from the default system size, number of iterations and/or run using more than 1 MPI task. E.g. replacing
iterations and/or run using more than 1 MPI task. E.g. replacing
```
totnodes 1 1 1 1 totnodes 1 1 1 1
```
with with
```
totnodes 2 1 1 1 totnodes 2 1 1 1
```
will run with 2 MPI tasks rather than 1, where the domain is decomposed in will run with 2 MPI tasks rather than 1, where the domain is decomposed in
the "X" direction. the "X" direction.
...@@ -108,18 +103,13 @@ The "run" directory contains an example script which ...@@ -108,18 +103,13 @@ The "run" directory contains an example script which
- runs the code (which will automatically validate if an - runs the code (which will automatically validate if an
appropriate output reference file exists) appropriate output reference file exists)
So, in the run directory, you should copy "run_example.sh" to So, in the run directory, you should copy "run_example.sh" to run.sh, which you can customise for your system.
run.sh, which you can customise for your system.
Known Issues Known Issues
------------ ------------
The quantity used for validation (see congrad.C) becomes very small The quantity used for validation (see congrad.C) becomes very small after a few iterations. Therefore, only a small number of iterations should be used for validation. This is not an issue specific to this port of the benchmark, but is also true of the original version (see above), with which this version is designed to be consistent.
after a few iterations. Therefore, only a small number of iterations
should be used for validation. This is not an issue specific to this
port of the benchmark, but is also true of the original version (see
above), with which this version is designed to be consistent.
Performance Results for Reference Performance Results for Reference
......
...@@ -3,25 +3,16 @@ targetDP ...@@ -3,25 +3,16 @@ targetDP
Copyright 2015 The University of Edinburgh Copyright 2015 The University of Edinburgh
Licensed under the Apache License, Version 2.0 (the "License"); Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT ARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
About About
---------- ----------
targetDP (target Data Parallel) is a lightweight programming targetDP (target Data Parallel) is a lightweight programming abstraction layer, designed to allow the same application source code to be able to target multiple architectures, e.g. NVIDIA GPUs and multicore/manycore CPUs, in a performance portable manner.
abstraction layer, designed to allow the same application source code
to be able to target multiple architectures, e.g. NVIDIA GPUs and
multicore/manycore CPUs, in a performance portable manner.
See: See:
...@@ -46,10 +37,11 @@ Compiling the targetDP libraries ...@@ -46,10 +37,11 @@ Compiling the targetDP libraries
to create the library libtarget.a: to create the library libtarget.a:
Edit the Makefile to set CC to the desired compiler (and CFLAGS to the Edit the Makefile to set CC to the desired compiler (and CFLAGS to the desired compiler flags). Then type
desired compiler flags). Then type
```
make make
```
If CC is a standard C/C++ compiler, then the CPU version of targetDP.a If CC is a standard C/C++ compiler, then the CPU version of targetDP.a
will be created. If CC=nvcc, then the GPU version will be created. will be created. If CC=nvcc, then the GPU version will be created.
...@@ -64,8 +56,6 @@ See targetDPspec.pdf in the doc directory, and the example (below). ...@@ -64,8 +56,6 @@ See targetDPspec.pdf in the doc directory, and the example (below).
Example Example
---------- ----------
See simpleExample.c (which has compilation instructions for CPU and See simpleExample.c (which has compilation instructions for CPU and GPU near the top of the file).
GPU near the top of the file).
\ No newline at end of file
# README - QCD Accelerator Benchmarksuite Part 2 # README - QCD UEABS Part 2
## 2017 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy) **2017 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy)**
The QCD Accelerator Benchmark suite Part 2 consists of two kernels, The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA
the QUDA [1] and the QPhix library [2]. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/).
The benchmark code is used the provided Conjugated Gradient benchmark functions of the libraries.
[^]: R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
[1] R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations and the QPhiX library
in Lattice Quantum Chromodynamics” SC 10 (Supercomputing 2010)
[2] B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey, [^]: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
W. Watson III, “Lattice QCD on Intel Xeon Phi”, International Supercomputing Conference (ISC’13), 2013
. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhiX library consists of routines which are optimize to use Intel intrinsic functions of multiple vector length including AVX512, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/). The benchmark kernel consists of the provided Conjugated Gradient benchmark functions of the libraries.
## Table of Contents ## Table of Contents
GPU - BENCHMARK SUITE (QUDA) [TOC]
```
1. Compile and Run the GPU-Benchmark Suite
1.1. Compile
1.2. Run
1.2.1. Main-script: "run_ana.sh"
1.2.2. Main-script: "prepare_submit_job.sh"
1.2.3. Main-script: "submit_job.sh.template"
1.3. Example Benchmark results