Commit 20dda510 authored by Jacob Finkenrath's avatar Jacob Finkenrath
Browse files

Impr readme - first Version

parent 04a109f2
......@@ -236,52 +236,12 @@ Accelerator-based implementations have been implemented for EXDIG, using off-loa
# QCD <a name="qcd"></a>
The QCD benchmark is, unlike the other benchmarks in the PRACE application benchmark suite,
not a full application but a set of 3 parts which are representative of some of the most compute-intensive parts of QCD calculations.
Part 1:
The QCD Accelerator Benchmark suite Part 1 is a direct port of "QCD kernel E" from the
CPU part, which is based on the MILC code suite
(http://www.physics.utah.edu/~detar/milc/). The performance-portable
targetDP model has been used to allow the benchmark to utilise NVIDIA
GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core
CPUs. The use of MPI (in conjunction with targetDP) allows multiple
nodes to be used in parallel.
Part 2:
The QCD Accelerator Benchmark suite Part 2 consists of two kernels,
the QUDA and the QPhix library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/).
The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines
for KNC and KNL (http://jeffersonlab.github.io/qphix/).
The benchmark kernels are using the provided Conjugated Gradient benchmark functions of the libraries.
Part CPU:
The CPU part of QCD benchmark is not a full application but a set of 5 kernels which are
representative of some of the most compute-intensive parts of QCD calculations.
Each of the 5 kernels has one test case:
Kernel A is derived from BQCD (Berlin Quantum ChromoDynamics program), a hybrid Monte-Carlo code which simulates Quantum Chromodynamics with dynamical standard Wilson fermions. The computations take place on a four-dimensional regular grid with periodic boundary conditions. The kernel is a standard conjugate gradient solver with even/odd pre-conditioning. The default lattice size is 16x16x16x16 for the small test case and 32x32x64x64 for the medium test case.
Kernel C is derived from SU3_AHiggs, a lattice quantum chromodynamics (QCD) code intended for computing the conditions of the Early Universe. Instead of “full QCD”,
the code applies an effective field theory,which is valid at high temperatures. In the effective theory, the lattice is 3D. The default lattice size is 64x64x64 for the small test case
and 256x256x256 for the medium test case. Lattice size is 8x8x8x8. Note that Kernel C can only be run in a weak scaling mode, where each CPU stores the same local lattice size,
regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time.
Kernel C is based on the software package openQCD. Kernel C is build for run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time. The local lattice size is 8x8x8x8.
Kernel D consists of the core matrix-vector multiplication routine for standard Wilson fermions based on the software package tmLQCD.
The default lattice size is 16x16x16x16 for the small test case and 64x64x64x64 for the medium test case.
Kernel E consists of a full conjugate gradient solution using
Wilson fermions. The default lattice size is 16x16x16x16 for the
small test case and 64x64x64x32 for the medium test case.
- Code download: https://repository.prace-ri.eu/ueabs/QCD/1.3/QCD_Source_TestCaseA.tar.gz
- Build instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/qcd/QCD_Build_README.txt
- Test Case A: included with source download
- Run instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/qcd/QCD_Run_README.txt
| **General information** | **Scientific field** | **Language** | **MPI** | **OpenMP** | **GPU** | **LoC** | **Code description** |
|------------------|----------------------|--------------|---------|------------|---------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------|
| <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_1) <br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_1/README) | lattice QuantumChromodynamics - Kernel 1 | C | yes | yes | yes (CUDA) | -- | Based on UEABS kernel E, CG solver using the Wilson Dirac operator. targetDP model has been used to allow the benchmark to utilise NVIDIA GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core CPU. [Test case A - 8x64x64x64] Small problem size suitable to run on one KNC. Conjugate Gradient solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. |
| <br>[- Source](https://lattice.github.io/quda/) <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2) <br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README) | lattice QuantumChromodynamics - Kernel 2 -QUDA | C++ | yes | yes | yes (CUDA) | -- | The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs. [Test case A - 96x32x32x32] Small problem size. Conjugate Gradient solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. Conjugate Gradient solver involving Wilson Dirac stencil. Bandwidth bounded |
| <br>[- Source](http://jeffersonlab.github.io/qphix/) <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_2) <br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_2/README) | lattice QuantumChromodynamics - Kernel 2 - QPHIX | C++ | yes | yes | no | -- | The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL [Test case A - 96x32x32x32] Small problem size. Conjugate Gradient solver involving Wilson Dirac stencil. Domain Decomposition, Memory bandwidth, strong scaling, MPI latency. [Test case B - 126x64x64x64] Moderate problem size. Conjugate Gradient solver involving Wilson Dirac stencil. Bandwidth bounded |
| <br>[- Source](https://repository.prace-ri.eu/ueabs/QCD/1.3/QCD_Source_TestCaseA.tar.gz) <br>[- Bench](https://repository.prace-ri.eu/git/UEABS/ueabs/-/tree/r2.2-dev/qcd/part_cpu) <br>[- Summary](https://repository.prace-ri.eu/git/UEABS/ueabs/-/blob/r2.2-dev/qcd/part_cpu/README) | lattice QuantumChromodynamics - legacy UEABS Kernels | C/Fortran | yes | yes/no | No | -- | Legacy UEABS QCD benchmark kernels are based on 5 different Benchmark applications which are used by the European Lattice QCD community (see doc for more details). |
# Quantum Espresso <a name="espresso"></a>
......
# QCD - Overview
The QCD benchmark is, unlike the other benchmarks in the PRACE application benchmark suite, not a full application but a set of 3 parts which are representative of some of the most compute-intensive parts of QCD calculations.
## Part 1:
The QCD Accelerator Benchmark suite Part 1 is a direct port of "QCD kernel E" from the CPU part, which is based on the MILC code suite (http://www.physics.utah.edu/~detar/milc/). The performance-portable targetDP model has been used to allow the benchmark to utilise NVIDIA GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core CPUs. The use of MPI (in conjunction with targetDP) allows multiple nodes to be used in parallel.
## Part 2:
The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA and the QPhiX library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhiX library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/). The benchmark kernels are using the provided Conjugated Gradient benchmark functions of the libraries.
## Part CPU:
The CPU part of QCD benchmark is not a full application but a set of 5 kernels which are
representative of some of the most compute-intensive parts of QCD calculations.
Each of the 5 kernels has one test case:
Kernel A is derived from BQCD (Berlin Quantum ChromoDynamics program), a hybrid Monte-Carlo code which simulates Quantum Chromodynamics with dynamical standard Wilson fermions. The computations take place on a four-dimensional regular grid with periodic boundary conditions. The kernel is a standard conjugate gradient solver with even/odd pre-conditioning. The default lattice size is 16x16x16x16 for the small test case and 32x32x64x64 for the medium test case.
Kernel C is derived from SU3_AHiggs, a lattice quantum chromodynamics (QCD) code intended for computing the conditions of the Early Universe. Instead of “full QCD”, the code applies an effective field theory,which is valid at high temperatures. In the effective theory, the lattice is 3D. The default lattice size is 64x64x64 for the small test case and 256x256x256 for the medium test case. Lattice size is 8x8x8x8. Note that Kernel C can only be run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time.
Kernel C is based on the software package openQCD. Kernel C is build for run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance is simply the reciprocal of the execution time. The local lattice size is 8x8x8x8.
Kernel D consists of the core matrix-vector multiplication routine for standard Wilson fermions based on the software package tmLQCD. The default lattice size is 16x16x16x16 for the small test case and 64x64x64x64 for the medium test case.
Kernel E consists of a full conjugate gradient solution using Wilson fermions. The default lattice size is 16x16x16x16 for the small test case and 64x64x64x32 for the medium test case.
- Code download: https://repository.prace-ri.eu/ueabs/QCD/1.3/QCD_Source_TestCaseA.tar.gz
- Build instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/qcd/QCD_Build_README.txt
- Test Case A: included with source download
- Run instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r1.3/qcd/QCD_Run_README.txt
......@@ -3,64 +3,33 @@ Description and Building of the QCD Benchmark
Description
===========
The QCD benchmark is, unlike the other benchmarks in the PRACE
application benchmark suite, not a full application but a set of 7
kernels which are representative of some of the most compute-intensive
parts of QCD calculations for different architectures.
The QCD benchmark is, unlike the other benchmarks in the PRACE application benchmark suite, not a full application but a set of 7 kernels, which are representative of some of the most compute-intensive parts of lattice QCD calculations for different architectures.
The QCD benchmark suite consists of three different main application,
part_cpu is based on 5 kernels of major QCD application and different
major codes. The applications contains in part_1 and part_2 are
suitable to run benchmarks on HPC machines equiped with accelerators
like Nvidia GPUs or Intel Xeon Phi processors.
The QCD benchmark suite consists of three different main application, part_cpu is based on 5 kernels of major QCD application and different major codes developed during DEISA and used since PRACE 2IP. They remain as legacy kernels while the performance tracking is shifted more towards newer computatonal kernels available in the **part-2**. The applications contains in **part-1** and **part-2** are suitable to run benchmarks on HPC machines equipped with accelerators like Nvidia GPUs or Intel Xeon Phi processors.
In the following building instruction of the part_cpu part
are descriped, see for describtion of the other parts in the different
subdirectories.
In the following building instruction of the CPU part are described, see for descriptions of the other parts in the different subdirectories.
============================================
PART_CPU:
Test Cases
----------
Each of the 5 kernels has one test case to be used for Tier-0 and
Tier-1:
## Part CPU: Test cases
Kernel A is derived from BQCD (Berlin Quantum ChromoDynamics program),
a hybrid Monte-Carlo code that simulates Quantum Chromodynamics with
dynamical standard Wilson fermions. The computations take place on a
four-dimensional regular grid with periodic boundary conditions. The
kernel is a standard conjugate gradient solver with even/odd
pre-conditioning. Lattice size is 32^2 x 64^2.
Each of the 5 kernels has one test case to be used for Tier-0 and Tier-1:
Kernel B is derived from SU3_AHiggs, a lattice quantum chromodynamics
(QCD) code intended for computing the conditions of the Early
Universe. Instead of "full QCD", the code applies an effective field
theory, which is valid at high temperatures. In the effective theory,
the lattice is 3D. Lattice size is 256^3.
**Kernel A** is derived from BQCD (Berlin Quantum ChromoDynamics program), a hybrid Monte-Carlo code that simulates Quantum Chromodynamics with dynamical standard Wilson fermions. The computations take place on a four-dimensional regular grid with periodic boundary conditions. The kernel is a standard conjugate gradient solver with even/odd pre-conditioning. Lattice size is 32x32x64x64.
Kernel C Lattice size is 84. Note that Kernel C can only be run in a
weak scaling mode, where each CPU stores the same local lattice size,
regardless of the number of CPUs. Ideal scaling for this kernel
therefore corresponds to constant execution time, and performance per
peak TFlop/s is simply the reciprocal of the execution time.
**Kernel B** is derived from SU3_AHiggs, a lattice quantum chromodynamics (QCD) code intended for computing the conditions of the Early Universe. instead of "full QCD", the code applies an effective field theory, which is valid at high temperatures. In the effective theory,the lattice is 3D. Lattice size is 256x256x256.
Kernel D consists of the core matrix-vector multiplication routine for
standard Wilson fermions. The lattice size is 64^4 .
**Kernel C** is based on benchmark kernels of the Community package openQCD, used mainly by the CLS consortium. Lattice size is 8x8x8x8. Note that Kernel C can only be run in a weak scaling mode, where each CPU stores the same local lattice size, regardless of the number of CPUs. Ideal scaling for this kernel therefore corresponds to constant execution time, and performance per peak TFlop/s is simply the reciprocal of the execution time.
Kernel E consists of a full conjugate gradient solution using Wilson
fermions. Lattice size is 64^4.
**Kernel D** is based on the benchmark kernel application of tmLQCD, the community package of the Extended Twisted Mass collaboration. It consists of the core matrix-vector multiplication routine for standard Wilson fermions. The lattice size is 64x64x64x64 .
Building the QCD Benchmark in the JuBE Framework
================================================
**Kernel E** consists of a full conjugate gradient solution using Wilson fermions, based on a MILC routine. The source code is also used in Part-1. The standart test cass has a lattice size of 64x64x64x64.
The QCD benchmark is integrated in the JuBE Benchmarking Environment
(www.fz-juelich.de/jsc/jube).
JuBE also includes all steps to build the application.
### Building the QCD Benchmark CPU PART in the JuBE Framework
Unpack the QCD_Source_TestCaseA.tar.gz into a directory of your
choice.
The QCD benchmark: Part CPU is integrated in the JuBE Benchmarking Environment (www.fz-juelich.de/jsc/jube). JuBE also includes all steps to build the application.
Unpack the QCD_Source_TestCaseA.tar.gz into a directory of your choice.
After unpacking the Benchmark the following directory structure is available:
PABS/
......@@ -71,54 +40,23 @@ After unpacking the Benchmark the following directory structure is available:
skel/
LICENCE
The applications/ subdirectory contains the QCD benchmark
applications.
The applications/ subdirectory contains the QCD benchmark applications.
The bench/ subdirectory contains the benchmark environment scripts.
The doc/ subdirectory contains the overall documentation of the
framework and a tutorial.
The platform/ subdirectory holds the platform definitions as well as
job submission script templates for each defined platform.
The skel/ subdirectory contains templates for analysis patterns for
text output of different measurement tools.
Configuration
-------------
Definition files are already prepared for many platforms. If you are
running on a defined platform just skip this part and go forward to
QCD_Run_README.txt ("Execution").
The platform
------------
A platform is defined through a set of variables in the platform.xml
file, which can be found in the platform/ directory. To create a new
platform entry, copy an existing platform description and modify it to
fit your local setup. The variables defined here will be used by the
individual applications in the later process. Best practice for the
platform nomenclature would be: <vendor>-<system type>-<system
name|site>. Additionally, you have to create a template batch
submission script, which should be placed in a subdirectory of the
platform/ directory of the same name as the platform itself. Although
this nomenclature is not required by the benchmarking environment, it
helps keeping track of you templates, and minimises the amount of
adaptation necessary for the individual application configurations.
The applications
----------------
Once a platform is defined, each individual application that should be
used in the benchmark (in this case the QCD application) needs to be
configured for this platform. In order to configure an individual
application, copy an existing top-level configuration file
(e.g. prace-scaling-juqueen.xml) to the file prace-<yourplatform>.xml.
Then open an editor of your choice, to adapt the file to your
needs. Change the settings of the platform parameter to the name of
your defined platform. The platform name can then be referenced
throughout the benchmarking environment by the $platform variable.
Do the same for compile.xml, execute.xml, analyse.xml.
You can find a step by step tutorial also in doc/JuBETutorial.pdf.
The compilation is part of the run of the application. Please continue
with the QCD_Run_README.txt to finalize the build and to run the
benchmark.
The doc/ subdirectory contains the overall documentation of the framework and a tutorial.
The platform/ subdirectory holds the platform definitions as well as job submission script templates for each defined platform.
The skel/ subdirectory contains templates for analysis patterns for text output of different measurement tools.
##### Configuration
Definition files are already prepared for many platforms. If you are running on a defined platform just skip this part and go forward to QCD_Run_README.txt ("Execution").
##### The platform
A platform is defined through a set of variables in the platform.xml file, which can be found in the platform/ directory. To create a new platform entry, copy an existing platform description and modify it to fit your local setup. The variables defined here will be used by the individual applications in the later process. Best practice for the platform nomenclature would be: <vendor>-<system type>-<system name|site>. Additionally, you have to create a template batch submission script, which should be placed in a subdirectory of the platform/ directory of the same name as the platform itself. Although this nomenclature is not required by the benchmarking environment, it helps keeping track of you templates, and minimises the amount of adaptation necessary for the individual application configurations.
##### The applications
Once a platform is defined, each individual application that should be used in the benchmark (in this case the QCD application) needs to be configured for this platform. In order to configure an individual application, copy an existing top-level configuration file (e.g. prace-scaling-juqueen.xml) to the file prace-<yourplatform>.xml. Then open an editor of your choice, to adapt the file to your needs. Change the settings of the platform parameter to the name of your defined platform. The platform name can then be referenced throughout the benchmarking environment by the $platform variable.
Do the same for compile.xml, execute.xml, analyse.xml. You can find a step by step tutorial also in doc/JuBETutorial.pdf.
The compilation is part of the run of the application. Please continue with the QCD_Run_README.txt to finalize the build and to run the benchmark.
\ No newline at end of file
......@@ -11,62 +11,44 @@ The folder ./part_cpu contains following subfolders
skel/
LICENCE
The applications/ subdirectory contains the QCD benchmark
applications.
The applications/ subdirectory contains the QCD benchmark applications.
The bench/ subdirectory contains the benchmark environment scripts.
The doc/ subdirectory contains the overall documentation of the
framework and a tutorial.
The platform/ subdirectory holds the platform definitions as well as
job submission script templates for each defined platform.
The skel/ subdirectory contains templates for analysis patterns for
text output of different measurement tools.
The doc/ subdirectory contains the overall documentation of the framework and a tutorial.
The platform/ subdirectory holds the platform definitions as well as job submission script templates for each defined platform.
The skel/ subdirectory contains templates for analysis patterns for text output of different measurement tools.
Configuration
=============
Definition files are already prepared for many platforms. If you are
running on a defined platform just go forward, otherwise please have a
look at QCD_Build_README.txt.
Definition files are already prepared for many platforms. If you are running on a defined platform just go forward, otherwise please have a look at QCD_Build_README.txt.
Execution
=========
Assuming the Benchmark Suite is installed in a directory that can be
used during execution, a typical run of a benchmark application will
Assuming the Benchmark Suite is installed in a directory that can be used during execution, a typical run of a benchmark application will
contain two steps.
1. Compiling and submitting the benchmark to the system scheduler.
2. Verifying, analysing and reporting the performance data.
Compiling and submitting
------------------------
If configured correctly, the application benchmark can be compiled and
submitted on the system (e.g. the IBM BlueGene/Q at Jülich) with
the commands:
If configured correctly, the application benchmark can be compiled and submitted on the system (e.g. the IBM BlueGene/Q at Jülich) with the commands:
>> cd PABS/applications/QCD
>> perl ../../bench/jube prace-scaling-juqueen.xml
The benchmarking environment will then compile the binary for all
node/task/thread combinations defined, if those parameters need to be
compiled into the binary. It creates a so-called sandbox subdirectory
for each job, ensuring conflict free operation of the individual
applications at runtime. If any input files are needed, those are
prepared automatically as defined.
The benchmarking environment will then compile the binary for all node/task/thread combinations defined, if those parameters need to be compiled into the binary. It creates a so-called sandbox subdirectory for each job, ensuring conflict free operation of the individual applications at runtime. If any input files are needed, those are prepared automatically as defined.
Each active benchmark in the application’s top-level configuration
file will receive an ID, which is used as a reference by JUBE later
on.
Each active benchmark in the application’s top-level configuration file will receive an ID, which is used as a reference by JUBE later on.
Verifying, analysing and reporting
----------------------------------
After the benchmark jobs have run, an additional call to jube will
gather the performance data. For this, the options -update and -result
are used.
After the benchmark jobs have run, an additional call to jube will gather the performance data. For this, the options -update and -result are used.
>> cd DEISA_BENCH/application/QCD
>> perl ../../bench/jube -update -result <ID>
The ID is the reference number the benchmarking environment has
assigned to this run. The performance data will then be output to
stdout, and can be post-processed from there.
The ID is the reference number the benchmarking environment has assigned to this run. The performance data will then be output to stdout, and can be post-processed from there.
PRACE QCD Accelerator Benchmark 1
=================================
This benchmark is part of the QCD section of the Accelerator
Benchmarks Suite developed as part of a PRACE EU funded project
This benchmark is part of the QCD section of the Accelerator Benchmarks Suite developed as part of a PRACE EU funded project
(http://www.prace-ri.eu).
The suite is derived from the Unified European Applications
Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/
The suite is derived from the Unified European Applications Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/
This specific component is a direct port of "QCD kernel E" from the
UEABS, which is based on the MILC code suite
(http://www.physics.utah.edu/~detar/milc/). The performance-portable
targetDP model has been used to allow the benchmark to utilise NVIDIA
GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core
CPUs. The use of MPI (in conjunction with targetDP) allows multiple
nodes to be used in parallel.
This specific component is a direct port of "QCD kernel E" from the UEABS, which is based on the MILC code suite (http://www.physics.utah.edu/~detar/milc/). The performance-portable targetDP model has been used to allow the benchmark to utilise NVIDIA GPUs, Intel Xeon Phi manycore CPUs and traditional multi-core CPUs. The use of MPI (in conjunction with targetDP) allows multiple nodes to be used in parallel.
For full details of this benchmark, and for results on NVIDIA GPU and
Intel Knights Corner Xeon Phi architectures (in addition to regular
CPUs), please see:
For full details of this benchmark, and for results on NVIDIA GPU and Intel Knights Corner Xeon Phi architectures (in addition to regular CPUs), please see:
**********************************************************************
Gray, Alan, and Kevin Stratford. "A lightweight approach to
......@@ -30,42 +20,42 @@ available at https://arxiv.org/abs/1609.01479
To Build
--------
Choose a configuration file from the "config" directory that best
matches your platform, and copy to "config.mk" in this (the
top-level) directory. Then edit this file, if necessary, to properly
set the compilers and paths on your system.
Choose a configuration file from the "config" directory that best matches your platform, and copy to "config.mk" in this (the top-level) directory. Then edit this file, if necessary, to properly set the compilers and paths on your system.
Note that if you are building for a GPU system, and the TARGETCC
variable in the configuration file is set to the NVIDIA compiler nvcc,
then the build process will automatically build the GPU
version. Otherwise, the threaded CPU version will be built which can
run on Xeon Phi manycore CPUs or regular multi-core CPUs.
Note that if you are building for a GPU system, and the TARGETCC variable in the configuration file is set to the NVIDIA compiler nvcc, then the build process will automatically build the GPU version. Otherwise, the threaded CPU version will be built which can run on Xeon Phi manycore CPUs or regular multi-core CPUs.
Then, build the targetDP performance-portable library:
```
cd targetDP
make clean
make
cd ..
```
And finally build the benchmark code
```
cd src
make clean
make
cd ..
```
To Validate
-----------
After building, an executable "bench" will exist in the src directory.
To run the default validation (64x64x64x8, 1 iteration) case:
After building, an executable "bench" will exist in the src directory. To run the default validation (64x64x64x8, 1 iteration) case:
```
cd src
./bench
```
The code will automatically self-validate by comparing with the
appropriate output reference file for this case which exists in
output_ref, and will print to stdout, e.g.
The code will automatically self-validate by comparing with the appropriate output reference file for this case which exists in output_ref, and will print to stdout, e.g.
Validating against output_ref/kernel_E.output.nx64ny64nz64nt8.i1.t1:
VALIDATION PASSED
......@@ -83,16 +73,21 @@ To Run Different Cases
You can edit the input file
```
src/kernel_E.input
```
if you want to deviate from the default system size, number of
iterations and/or run using more than 1 MPI task. E.g. replacing
if you want to deviate from the default system size, number of iterations and/or run using more than 1 MPI task. E.g. replacing
```
totnodes 1 1 1 1
```
with
```
totnodes 2 1 1 1
```
will run with 2 MPI tasks rather than 1, where the domain is decomposed in
the "X" direction.
......@@ -108,18 +103,13 @@ The "run" directory contains an example script which
- runs the code (which will automatically validate if an
appropriate output reference file exists)
So, in the run directory, you should copy "run_example.sh" to
run.sh, which you can customise for your system.
So, in the run directory, you should copy "run_example.sh" to run.sh, which you can customise for your system.
Known Issues
------------
The quantity used for validation (see congrad.C) becomes very small
after a few iterations. Therefore, only a small number of iterations
should be used for validation. This is not an issue specific to this
port of the benchmark, but is also true of the original version (see
above), with which this version is designed to be consistent.
The quantity used for validation (see congrad.C) becomes very small after a few iterations. Therefore, only a small number of iterations should be used for validation. This is not an issue specific to this port of the benchmark, but is also true of the original version (see above), with which this version is designed to be consistent.
Performance Results for Reference
......
......@@ -3,25 +3,16 @@ targetDP
Copyright 2015 The University of Edinburgh
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT ARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
About
----------
targetDP (target Data Parallel) is a lightweight programming
abstraction layer, designed to allow the same application source code
to be able to target multiple architectures, e.g. NVIDIA GPUs and
multicore/manycore CPUs, in a performance portable manner.
targetDP (target Data Parallel) is a lightweight programming abstraction layer, designed to allow the same application source code to be able to target multiple architectures, e.g. NVIDIA GPUs and multicore/manycore CPUs, in a performance portable manner.
See:
......@@ -46,10 +37,11 @@ Compiling the targetDP libraries
to create the library libtarget.a:
Edit the Makefile to set CC to the desired compiler (and CFLAGS to the
desired compiler flags). Then type
Edit the Makefile to set CC to the desired compiler (and CFLAGS to the desired compiler flags). Then type
```
make
```
If CC is a standard C/C++ compiler, then the CPU version of targetDP.a
will be created. If CC=nvcc, then the GPU version will be created.
......@@ -64,8 +56,6 @@ See targetDPspec.pdf in the doc directory, and the example (below).
Example
----------
See simpleExample.c (which has compilation instructions for CPU and
GPU near the top of the file).
See simpleExample.c (which has compilation instructions for CPU and GPU near the top of the file).
\ No newline at end of file
# README - QCD Accelerator Benchmarksuite Part 2
## 2017 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy)
# README - QCD UEABS Part 2
**2017 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy)**
The QCD Accelerator Benchmark suite Part 2 consists of two kernels,
the QUDA [1] and the QPhix library [2]. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/).
The benchmark code is used the provided Conjugated Gradient benchmark functions of the libraries.
The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA
[^]: R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
[1] R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
in Lattice Quantum Chromodynamics” SC 10 (Supercomputing 2010)
and the QPhiX library
[2] B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
W. Watson III, “Lattice QCD on Intel Xeon Phi”, International Supercomputing Conference (ISC’13), 2013
[^]: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,
. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhiX library consists of routines which are optimize to use Intel intrinsic functions of multiple vector length including AVX512, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/). The benchmark kernel consists of the provided Conjugated Gradient benchmark functions of the libraries.
## Table of Contents
GPU - BENCHMARK SUITE (QUDA)
```
1. Compile and Run the GPU-Benchmark Suite
1.1. Compile
1.2. Run
1.2.1. Main-script: "run_ana.sh"
1.2.2. Main-script: "prepare_submit_job.sh"
1.2.3. Main-script: "submit_job.sh.template"
1.3. Example Benchmark results
```
[TOC]
XEONPHI - BENCHMARK SUITE (QPHIX)
```
2. Compile and Run the XeonPhi-Benchmark Suite
2.1. Compile
2.1.1. Example compilation on PRACE machines
2.1.1.1. BSC - Marenostrum III Hybrid partitions
2.1.1.2. CINES - Frioul
2.2. Run
2.2.1. Main-script: "run_ana.sh"
2.2.2. Main-script: "prepare_submit_job.sh"
2.2.3. Main-script: "submit_job.sh.template"
2.3. Example Benchmark Results
```
## GPU - BENCHMARK SUITE
## GPU - Kernel
### 1. Compile and Run the GPU-Benchmark Suite
#### 1.1 Compile
......@@ -54,9 +30,10 @@ Here we just give a short overview:
Build Cmake: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/cmake-3.7.0.tar.gz)
Cmake can be downloaded from the source with the URL: https://cmake.org/download/
In this guide the version cmake-3.7.0 is used. The build instruction can be found
in the main directory under README.rst . Use the configure file `./configure` .
Then run `gmake`.
In this guide the version cmake-3.7.0 is used. The build instruction can be found in the main directory under README.rst . Use the configure file `./configure` .
Then run
gmake`.
Build Quda: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/quda.tar.gz)
......@@ -87,8 +64,7 @@ Now in the folder /test one can find the needed Quda executable "invert_".
#### 1.2 Run
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts located in the folder ./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the benchmark runs
on the target machines. This bash-scripts are:
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts located in the folder ./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the benchmark runs on the target machines. This bash-scripts are:
- `run_ana.sh` : Main-script, set up the benchmark mode and submit the jobs (analyse the results)
- `prepare_submit_job.sh` : Generate the job-scripts
......@@ -96,16 +72,8 @@ on the target machines. This bash-scripts are:
##### 1.2.1 Main-script: "run_ana.sh"
The path to the executable has to be set by $PATH2EXE .
QUDA automaticaly tune the GPU-kernels. The optimal setup will be saved in
the folder which one declares by the variable `QUDA_RESOURCE_PATH`. Set it to
folder where the tuning data should be saved.
Different scaling modes can be choose from Strong-scaling to Weak scaling
by using the variables sca_mode (="Strong" or ="Weak").
The lattice sizes can be set by "gx" and "gt".
Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS.
Note that the submition is done here by "sbatch", match this to the queing system on
your target machine.
The path to the executable has to be set by $PATH2EXE . QUDA automaticaly tune the GPU-kernels. The optimal setup will be saved in
the folder which one declares by the variable `QUDA_RESOURCE_PATH`. Set it to folder where the tuning data should be saved. Different scaling modes can be choose from Strong-scaling to Weak scaling by using the variables sca_mode (="Strong" or ="Weak"). The lattice sizes can be set by "gx" and "gt". Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS. Note that the submition is done here by "sbatch", match this to the queing system on your target machine.
##### 1.2.2 Main-script: "prepare_submit_job.sh"
......@@ -113,25 +81,15 @@ Add additional option if necessary.
##### 1.2.3 Main-script: "submit_job.sh.template"
The submit-template will be edit by `prepare_submit_job.sh` to generate
the final submit-script. The header should be matched to the queing system
The submit-template will be edit by `prepare_submit_job.sh` to generate the final submit-script. The header should be matched to the queing system
of the target machine.
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs
on the target machines. This bash-scripts are:
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs on the target machines. This bash-scripts are:
#### 1.3 Example Benchmark results
Here are shown the benchmark results on PizDaint located in Switzerland at CSCS
and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the
provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown,
the "Strong-Scaling mode with a random lattice configuration of size 32^3x96 and
a "Weak-Scaling" mode with a configuration of local lattice size 48^3x24.
The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case
that one card per node and two cards per node are used.
The benchmark are done by using the Conjugated Gradient solver which
solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator
"D" and a known right hand side "b".
Here are shown the benchmark results on PizDaint located in Switzerland at CSCS and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown,
the "Strong-Scaling mode with a random lattice configuration of size 32x32x32x96 and a "Weak-Scaling" mode with a configuration of local lattice size 48x48x48x24. The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case that one card per node and two cards per node are used. The benchmark are done by using the Conjugated Gradient solver which solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b".
```
---------------------
......@@ -255,8 +213,8 @@ GPUs GFLOPS sec
64 2645.590000 1.480000
```
## XEONPHI - BENCHMARK SUITE
### 2. Compile and Run the XeonPhi-Benchmark Suite
## x86 Kernel
### 2. Compile and Run the x86-Part
Unpack the provided source tar-file located in `./QCD_Accelerator_Benchmarksuite_Part2/XeonPhi/src` or
clone the actual git-hub branches of the code
......