Commit 0c62bb79 authored by Arno Proeme's avatar Arno Proeme
Browse files

Merge branch 'r2.1-dev' of https://repository.prace-ri.eu/git/UEABS/ueabs into r2.1-dev

parents 47f7228a fe0282c8
......@@ -35,7 +35,7 @@ The application codes that constitute the UEABS are:
- [GROMACS](#gromacs)
- [NAMD](#namd)
- [NEMO](#nemo)
- PFARM
- [PFARM](#pfarm)
- [QCD](#qcd)
- [Quantum Espresso](#espresso)
- [SHOC](#shoc)
......@@ -178,52 +178,76 @@ NAMD is written in C++ and parallelised using Charm++ parallel objects, which ar
# NEMO <a name="nemo"></a>
NEMO (Nucleus for European Modelling of the Ocean) [22] is mathematical modelling framework for research activities and prediction services in ocean and climate sciences developed by European consortium. It is intended to be tool for studying the ocean and its interaction with the other components of the earth climate system over a large number of space and time scales. It comprises of the core engines namely OPA (ocean dynamics and thermodynamics), SI3 (sea ice dynamics and thermodynamics), TOP (oceanic tracers) and PISCES (biogeochemical process).
NEMO (Nucleus for European Modelling of the Ocean) [22] is mathematical modelling framework for research activities and prediction services in ocean and climate sciences developed by European consortium. It is intended to be tool for studying the ocean and its interaction with the other components of the earth climate system over a large number of space and time scales. It comprises of the core engines namely OPA (ocean dynamics and thermodynamics), SI3 (sea ice dynamics and thermodynamics), TOP (oceanic tracers) and PISCES (biogeochemical process).
Prognostic variables in NEMO are the three-dimensional velocity field, a linear or non-linear sea surface height, the temperature and the salinity.
In the horizontal direction, the model uses a curvilinear orthogonal grid and in the vertical direction, a full or partial step z-coordinate, or s-coordinate, or a mixture of the two. The distribution of variables is a three-dimensional Arakawa C-type grid for most of the cases.
In the horizontal direction, the model uses a curvilinear orthogonal grid and in the vertical direction, a full or partial step z-coordinate, or s-coordinate, or a mixture of the two. The distribution of variables is a three-dimensional Arakawa C-type grid for most of the cases.
The model is implemented in Fortran 90, with preprocessing (C-pre-processor). It is optimized for vector computers and parallelized by domain decomposition with MPI. It supports modern C/C++ and Fortran compilers. All input and output is done with third party software called XIOS with dependency on NetCDF (Network Common Data Format) and HDF5. It is highly scalable and perfect application for measuring supercomputing performances in terms of compute capacity, memory subsystem, I/O and interconnect performance.
The model is implemented in Fortran 90, with preprocessing (C-pre-processor). It is optimized for vector computers and parallelized by domain decomposition with MPI. It supports modern C/C++ and Fortran compilers. All input and output is done with third party software called XIOS with dependency on NetCDF (Network Common Data Format) and HDF5. It is highly scalable and perfect application for measuring supercomputing performances in terms of compute capacity, memory subsystem, I/O and interconnect performance.
### Test Case Description
The GYRE configuration has been built to model seasonal cycle of double gyre box model. It consists of idealized domain over which seasonal forcing is applied. This allows for studying large number of interactions and their combined contribution to large scale circulation.
The GYRE configuration has been built to model seasonal cycle of double gyre box model. It consists of idealized domain over which seasonal forcing is applied. This allows for studying large number of interactions and their combined contribution to large scale circulation.
The domain geometry is rectangular bounded by vertical walls and flat bottom. The configuration is meant to represent idealized north Atlantic or north pacific basin. The circulation is forced by analytical profiles of wind and buoyancy fluxes.
The wind stress is zonal and its curl changes sign at 22 and 36. It forces a subpolar gyre in the north, a subtropical gyre in the wider part of the domain and a small recirculation gyre in the southern corner. The net heat flux takes the form of a restoring toward a zonal apparent air temperature profile.
The wind stress is zonal and its curl changes sign at 22 and 36. It forces a subpolar gyre in the north, a subtropical gyre in the wider part of the domain and a small recirculation gyre in the southern corner. The net heat flux takes the form of a restoring toward a zonal apparent air temperature profile.
A portion of the net heat flux which comes from the solar radiation is allowed to penetrate within the water column. The fresh water flux is also prescribed and varies zonally. It is determined such as, at each time step, the basin-integrated flux is zero.
A portion of the net heat flux which comes from the solar radiation is allowed to penetrate within the water column. The fresh water flux is also prescribed and varies zonally. It is determined such as, at each time step, the basin-integrated flux is zero.
The basin is initialized at rest with vertical profiles of temperature and salinity uniformity applied to the whole domain. The GYRE configuration is set through the namelist_cfg file.
The horizontal resolution is determined by setting jp_cfg as follows:
The horizontal resolution is determined by setting jp_cfg as follows:
`Jpiglo = 30 x jp_cfg + 2`
`Jpjglo = 20 x jp_cfg + 2`
In this configuration, we use default value of 30 ocean levels depicted by jpk=31. The GYRE configuration is an ideal case for benchmark test as it is very simple to increase the resolution and perform both weak and strong scalability experiment using the same input files. We use two configurations as follows:
`Jpjglo = 20 x jp_cfg + 2`
In this configuration, we use default value of 30 ocean levels depicted by jpk=31. The GYRE configuration is an ideal case for benchmark test as it is very simple to increase the resolution and perform both weak and strong scalability experiment using the same input files. We use two configurations as follows:
**Test Case A**:
* jp_cfg = 128 suitable up to 1000 cores
* Number of Days: 20
* Number of Time steps: 1440
* Time step size: 20 mins
* jp_cfg = 128 suitable up to 1000 cores
* Number of Days: 20
* Number of Time steps: 1440
* Time step size: 20 mins
* Number of seconds per time step: 1200
**Test Case B**
* jp_cfg = 256 suitable up to 20,000 cores.
* Number of Days (real): 80
* Number of time step: 4320
* Time step size(real): 20 mins
* Number of seconds per time step: 1200
* jp_cfg = 256 suitable up to 20,000 cores.
* Number of Days (real): 80
* Number of time step: 4320
* Time step size(real): 20 mins
* Number of seconds per time step: 1200
* Web site: <http://www.nemo-ocean.eu/>
* Download, Build and Run Instructions : <https://repository.prace-ri.eu/git/UEABS/ueabs/tree/master/nemo>
# PFARM <a name="pfarm"></a>
PFARM is part of a suite of programs based on the ‘R-matrix’ ab-initio approach to the variational solution of the many-electron Schrödinger
equation for electron-atom and electron-ion scattering. The package has been used to calculate electron collision data for astrophysical
applications (such as: the interstellar medium, planetary atmospheres) with, for example, various ions of Fe and Ni and neutral O, plus
other applications such as data for plasma modelling and fusion reactor impurities. The code has recently been adapted to form a compatible
interface with the UKRmol suite of codes for electron (positron) molecule collisions thus enabling large-scale parallel ‘outer-region’
calculations for molecular systems as well as atomic systems.
The PFARM outer-region application code EXDIG is domi-nated by the assembly of sector Hamiltonian matrices and their subsequent eigensolutions.
The code is written in Fortran 2003 (or Fortran 2003-compliant Fortran 95), is parallelised using MPI and OpenMP and is designed to take
advantage of highly optimised, numerical library routines. Hybrid MPI / OpenMP parallelisation has also been introduced into the code via
shared memory enabled numerical library kernels.
Accelerator-based implementations have been implemented for EXDIG, using off-loading (MKL or CuBLAS/CuSolver) for the standard (dense) eigensolver calculations that dominate overall run-time.
- Code download: https://repository.prace-ri.eu/UEABS/ueabs/pfarm/pfarm.tar.gz
- Build & Run instructions: https://repository.prace-ri.eu/git/UEABS/ueabs/blob/r2.1-dev/pfarm/PFARM_Build_Run_README.txt
- Test Case A: https://repository.prace-ri.eu/UEABS/ueabs/pfarm/PFARM_TestCaseA.tar.bz2
- Test Case B: https://repository.prace-ri.eu/UEABS/ueabs/pfarm/PFARM_TestCaseB.tar.bz2
# QCD <a name="qcd"></a>
The QCD benchmark is, unlike the other benchmarks in the PRACE application benchmark suite, not a full application but a set of 5 kernels which are representative of some of the most compute-intensive parts of QCD calculations.
......
========================================================================
README file for PRACE Accelerator Benchmark Code PFARM (stage EXDIG, program RMX95)
========================================================================
Author: Andrew Sunderland (andrew.sunderland@stfc.ac.uk).
The code download should contain the following directories:
benchmark/RMX_MPI_OMP: RMX source files for running on Host or KNL (using serial or threaded LAPACK or MKL)
benchmark/RMX_MAGMA_GPU: RMX source for running on CPU/GPU nodes using MAGMA
benchmark/run: Run directory with input files
benchmark/xdr: XDR library src files and static XDR library file
benchmark/data: Data files for the benchmark test cases
The code uses the eXternal Data Representation library (XDR) for cross-platform
compatibility of unformatted data files. The XDR source files are provided with this code bundle.
and can be obtained from various sources, including
http://meteora.ucsd.edu/~pierce/fxdr_home_page.html
http://people.redhat.com/rjones/portablexdr/
----------------------------------------------------------------------------
* Installing (MAGMA GPU Only)
Download MAGMA (current version magma-2.2.0) from http://icl.utk.edu/magma/
Install MAGMA : Modify the make.inc file to indicate your C/C++
compiler, Fortran compiler, and determine where CUDA, CPU BLAS, and
LAPACK are installed on your system. Refer to MAGMA documentation for further details
----------------------------------------------------------------------------
* Install XDR
Build XDR library:
update DEFS file for your compiler and environment
$> cd xdr
$> make
(ignore warnings related to float/double type mismatches in xdr_rmat64.c - this is not relevant for this benchmark)
The validity of the XDR library can be tested by running test_xdr
$> ./test_xdr
——————————————————————————————————————————————————————————————————————————
* Install RMX_MPI_OMP
$> cd RMX_MPI_OMP
Update DEFS file for your setup, ensuring you are linking to a LAPACK or MKL library (or equivalent).
This is usually facilitated by e.g. compiling with -mkl=parallel (Intel compiler) or loading the appropriate library modules.
$> make
* Install RMX_MAGMA_GPU
Set MAGMADIR, CUDADIR environment variables to point to MAGMA and CUDA installations
$> cd RMX_MAGMA_GPU
Update DEFS file for your setup
$> make
----------------------------------------------------------------------------
* Run RMX
==========
The RMX application can be run by running the executable "rmx95"
For the FEIII dataset, the program requires the following input files from the data directory linked to the run directory :
phzin.ctl
XJTARMOM
HXJ030
These files are located in benchmark/run
A guide to each of the variables in the namelist in phzin.ctl can be found at:
https://hpcforge.org/plugins/mediawiki/wiki/pfarm/images/9/99/Phz_rep.pdf
However, it is recommended that these inputs are not changed for the benchmark runs and
problem size, runtime etc, are better controlled via the environment variables listed below.
* Run on CPUs (or KNLs)
=======================
A typical PBS script to run the RMX_MPI_OMP benchmark on 4 KNL nodes (4 MPI tasks with 64 threads per MPI task) is listed below:
Settings will vary according to your local environment.
#PBS -N rmx95_4x64
#PBS -l select=4
#PBS -l walltime=01:00:00
#PBS -A my_account_id
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=64
# Set some code-specific environment variables (for details see below) e.g.:
export RMX_NSECT_FINE=4
export RMX_NSECT_COARSE=4
export RMX_NL_FINE=12
export RMX_NL_COARSE=6
# Run on 4 nodes with 1 MPI task per node and 24 OpenMP tasks per node
aprun -N 1 -n 4 -d $OMP_NUM_THREADS ./rmx95
* Run on CPU/GPU nodes
======================
A typical PBS script to run the RMX_MPI_GPU benchmark on 4 CPU nodes (with 2 GPUs per CPU) is listed below:
Settings will vary according to your local environment.
#PBS -N rmx95_4MPIx2GPU
#PBS -l select=4
#PBS -l walltime=01:00:00
#PBS -A my_account_id
cd $PBS_O_WORKDIR
# Set number of GPUs per node to use (MAGMA auto-parallelises over these)
export RMX_NGPU=2
# Set some code-specific environment variables (for details see below) e.g.:
export RMX_NSECT_FINE=4
export RMX_NSECT_COARSE=4
export RMX_NL_FINE=12
export RMX_NL_COARSE=6
# Run on 4 nodes with 1 MPI task per node and 2 GPUs per node
aprun -N 1 -n 4 ./rmx95
----------------------------------------------------------------------------
* Run-time environment variable settings
The following environmental variables that e.g. can be set inside the script allow the H sector matrix
to easily change dimensions and also allows the number of sectors to change when undertaking benchmarks.
These can be adapted by the user to suit benchmark load requirements e.g. short vs long runs.
Each MPI Task will pickup a sector calculation which will then be distributed amongst available threads per node (for CPU and KNL) or offloaded (for GPU).
The distribution among MPI tasks is simple round-robin.
RMX_NGPU : refers to the number of shared GPUs per node (only for RMX_MAGMA_GPU)
RMX_NSECT_FINE : sets the number of sectors for the Fine region (it is recommended to set this to a low number if the sector Hamiltonian matrix dimension is large).
RMX_NSECT_COARSE : sets the number of sectors for the Coarse region (it is recommended to set this to a low number if the sector Hamiltonian matrix dimension is large).
RMX_NL_FINE : sets the number of basis functions for the Fine region sector calculations (this will determine the size of the sector Hamiltonian matrix).
RMX_NL_COARSE : sets the number of basis functions for the Coarse region sector calculations (this will determine the size of the sector Hamiltonian matrix).
Hint: To aid scaling across nodes, the number of MPI tasks in the job script should ideally be a factor of RMX_NSECT_FINE.
For representative test cases:
RMX_NL_FINE should take values in the range 6:25
RMX_NL_COARSE should take values in the range 5:10
For accuracy reasons, RMX_NL_FINE should always be great than RMX_NL_COARSE.
The following value pairs for RMX_NL_FINE and RMX_NL_COARSE provide representative calculations:
12,6
14,8
16,10
18,10
20,10
25,10
If RMX_NSECT and RMX_NL variables are not set, the benchmark code defaults to calculating NL and NSECT, giving:
RMX_NSECT_FINE=5
RMX_NSECT_COARSE=20
RMX_NL_FINE=12
RMX_NL_COARSE=6
* Results
1 AMPF file will be created for each fine-region sector
1 AMPC file will be created for each coarse-region sector
All output AMPF files will be the same size and all output AMPC files will be the same size (bytes).
The Hamiltonian matrix dimension will be output along
with the Wallclock time it takes to do each individual DSYEVD call.
Performance is measured in Wallclock time and is displayed
on the screen or output log at the end of the run.
----------------------------------------------------------------------------
\ No newline at end of file
# PFARM in the United European Applications Benchmark Suite (UEABS)
## Document Author: Andrew Sunderland (andrew.sunderland@stfc.ac.uk) , STFC, UK.
## Introduction
PFARM is part of a suite of programs based on the ‘R-matrix’ ab-initio approach to the vari-tional solution of the many-electron Schrödinger equation for electron-atom and electron-ion scattering. The package has been used to calculate electron collision data for astrophysical applications (such as: the interstellar medium, planetary atmospheres) with, for example, var-ious ions of Fe and Ni and neutral O, plus other applications such as data for plasma model-ling and fusion reactor impurities. The code has recently been adapted to form a compatible interface with the UKRmol suite of codes for electron (positron) molecule collisions thus ena-bling large-scale parallel ‘outer-region’ calculations for molecular systems as well as atomic systems.
In this README we give information relevant for its use in the UEABS.
### Standard CPU version
The PFARM outer-region application code EXDIG is domi-nated by the assembly of sector Hamiltonian matrices and their subsequent eigensolutions. The code is written in Fortran 2003 (or Fortran 2003-compliant Fortran 95), is parallelised using MPI and OpenMP and is designed to take advantage of highly optimised, numerical library routines. Hybrid MPI / OpenMP parallelisation has also been introduced into the code via shared memory enabled numerical library kernels.
### GPU version
Accelerator-based implementations have been implemented for EXDIG, using off-loading (MKL or CuBLAS/CuSolver) for the standard (dense) eigensolver calculations that dominate overall run-time.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment