Commit 698769ce authored by Valeriu Codreanu's avatar Valeriu Codreanu
Browse files

Merge branch 'v1.3' into 'master'

Merging the UEABS into the Accelerated Benchmark Suite

See merge request UEABS/ueabs!2
parents 5ed6121e 08522bc9
This diff is collapsed.
In order to build ALYA (Alya.x), please follow these steps:
- Go to: Thirdparties/metis-4.0 and build the Metis library (libmetis.a) using 'make'
- Go to the directory: Executables/unix
- Adapt the file: configure-marenostrum-mpi.txt to your own MPI wrappers and paths
- Execute:
./configure -x -f=configure-marenostrum-mpi.txt nastin parall
Data sets
The parameters used in the datasets try to represent at best typical industrial runs in order to obtain representative speedups. For example, the iterative solvers
are never converged to machine accuracy, as the system solution is inside a non-linear loop.
The datasets represent the solution of the cavity flow at Re=100. A small mesh of 10M elements should be used for Tier-1 supercomputers while a 30M element mesh
is specifically designed to run on Tier-0 supercomputers.
However, the number of elements can be multiplied by using the mesh multiplication option in the file *.ker.dat (DIVISION=0,2,3...). The mesh multiplication is
carried out in parallel and the numebr of elements is multiplied by 8 at each of these levels. "0" means no mesh multiplication.
The different datasets are:
cavity10_tetra ... 10M tetrahedra mesh
cavity30_tetra ... 30M tetrahedra mesh
How to execute Alya with a given dataset
In order to run ALYA, you need at least the following input files per execution:
In our case, there are 2 different inputs, so X={cavity10_tetra,cavity30_tetra}
To execute a simulation, you must be inside the input directory and you should submit a job like:
mpirun Alya.x cavity10_tetra
mpirun Alya.x cavity30_tetra
How to measure the speedup
1. Edit the fensap.nsi.cvg file
2. You will see ten rows, each one corresponds to one simulation timestep
3. Go to the second row, it starts with a number 2
4. Get the last number of this row, that corresponds to the elapsed CPU time of this timestep
5. Use this value in order to measure the speedup
If you have any question regarding the runs, please feel free to contact Guillaume Houzeaux:
Build instructions for CP2K.
===== 1. Get the code =====
Download a CP2K release from or follow instructions at to check out the relevant branch of the CP2K SVN repository. These
build instructions and the accompanying benchmark run instructions have been tested with release
===== 2. Prerequisites & Libraries =====
GNU make and Python 2.x are required for the build process, as are a Fortran 2003 compiler and
matching C compiler, e.g. gcc/gfortran (gcc >=4.6 works, later version is recommended).
CP2K can benefit from a number of external libraries for improved performance. It is advised to use
vendor-optimized versions of these libraries. If these are not available on your machine, there
exist freely available implementations of these libraries including but not limited to those listed
The minimum set of libraries required to build a CP2K executable that will run the UEABS benchmarks
1. LAPACK & BLAS, as provided by, for example:
netlib : &
MKL : part of your Intel MKL installation, if available
LibSci : installed on Cray platforms
OpenBLAS :
clBLAS :
2. SCALAPACK & BLACS, as provided by, for example:
netlib :
MKL : part of your Intel MKL installation, if available
LibSci : installed on Cray platforms
3. LIBINT, available from
(see build instructions in section 2.1 below)
The following libraries are optional but give a significant performance benefit:
4. FFTW3, available from or provided as an interface by MKL
5. ELPA, available from
6. libgrid, available from inside the distribution at cp2k/tools/autotune_grid
7. libxsmm, available from
More information can be found in the INSTALL file in the CP2K distribution.
2.1 Building LIBINT
The following commands will uncompress and install the LIBINT library required for the UEABS
tar xzf libint-1.1.4
cd libint-1.1.4
./configure CC=cc CXX=CC --prefix=some_path_other_than_this_directory
make install
The environment variables CC and CXX are optional and can be used to specify the C and C++ compilers
to use for the build (the example above is configured to use the compiler wrappers cc and CC used on
Cray systems). By default the build process only creates static libraries (ending in .a). If you
want to be able to link dynamically to LIBINT when building CP2K you can pass the flag
--enable-shared to ./configure in order to produce shared libraries (ending in .so). In that case
you will need to ensure that the library is located in place that is accessible at runtime and that
the LD_LIBRARY_PATH environment variable includes the LIBINT installation directory.
For more build options see ./configure --help.
===== 3. Building CP2K =====
If you have downloaded a tarball of the release, uncompress the file by running
tar xf cp2k-4.1.tar.bz2.
If necessary you can find additional information about building CP2K in the INSTALL file located in
the root directory of the CP2K distribution.
==== 3.1 Create or choose an arch file ====
Before compiling the choice of compilers, the library locations and compilation and linker flags
need to be specified. This is done in an arch (architecture) file. Example arch files for a number
of common architecture examples can be found inside cp2k/arch. The names of these files match the
pattern architecture.version (e.g., Linux-x86-64-gfortran.sopt). The case "version=psmp" corresponds
to the hybrid MPI + OpenMP version that you should build to run the UEABS benchmarks.
In most cases you need to create a custom arch file, either from scratch or by modifying an existing
one that roughly fits the cpu type, compiler, and installation paths of libraries on your
system. You can also consult, which provides sample arch files as part of
the testing reports for some platforms (click on the status field for a platform, and search for
'ARCH-file' in the resulting output).
As a guided example, the following should be included in your arch file if you are compiling with
GNU compilers:
(a) Specification of which compiler and linker commands to use:
CC = gcc
FC = mpif90
LD = mpif90
CP2K is primarily a Fortran code, so only the Fortran compiler needs to be MPI-enabled.
(b) Specification of the DFLAGS variable, which should include:
-D__parallel (to build parallel CP2K executable)
-D__SCALAPACK (to link to ScaLAPACK)
-D__LIBINT (to link to LIBINT)
-D__MKL (if relying on MKL to provide ScaLAPACK and/or an FFTW interface)
-D__HAS_NO_SHARED_GLIBC (for convenience on HPC systems, see INSTALL file)
Additional DFLAGS needed to link to performance libraries, such as -D__FFTW3 to link to FFTW3,
are listed in the INSTALL file.
(c) Specification of compiler flags:
Required (for gfortran):
FCFLAGS = $(DFLAGS) -ffree-form -fopenmp
Recommended additional flags (for gfortran):
FCFLAGS += -O3 -ffast-math -funroll-loops
If you want to link any libraries containing header files you should pass the path to the
directory containing these to FCFLAGS in the format -I/path_to_include_dir.
(d) Specification of linker flags:
(e) Specification of libraries to link to:
Required (LIBINT):
-L/home/z01/z01/UEABS/CP2K/libint/1.1.4/lib -lderiv -lint
If you use MKL to provide ScaLAPACK and/or an FFTW interface the LIBS variable should be used
to pass the relevant flags provided by the MKL Link Line Advisor (,
which you should use carefully in order to generate the right options for your system.
(f) AR = ar -r
As an example, a simple arch file is shown below for ARCHER (, a Cray system
that uses compiler wrappers cc and ftn to compile C and Fortran code respectively, and which has
LIBINT installed in /home/z01/z01/user/cp2k/libs/libint/1.1.4. On Cray systems the compiler wrappers
automatically link in Cray's LibSci library which provides ScaLAPACK, hence there is no need for
explicit specification of the library location and library names in LIBS or relevant include paths
in FCFLAGS. This would not be the case if MKL was used instead.
# Ensure the following environment modules are loaded before starting the build:
# PrgEnv-gnu
# cray-libsci
CC = cc
FC = ftn
LD = ftn
AR = ar -r
DFLAGS = -D__parallel \
FCFLAGS = $(DFLAGS) -ffree-form -fopenmp
FCFLAGS += -O3 -ffast-math -funroll-loops
LIBS = -L/home/z01/z01/user/cp2k/libint/1.1.4/lib -lderiv -lint
==== 3.2 Compile ===
Change directory to cp2k-4.1/makefiles
There is no configure stage. If the arch file for your machine is called
SomeArchitecture_SomeCompiler.psmp, then issue the following command to compile:
make ARCH=SomeArchitecture_SomeCompiler VERSION=psmp
or, if you are able to run in parallel with N threads:
make -j N ARCH=SomeArchitecture_SomeCompiler VERSION=psmp
There is also no "make install" stage. If everything goes fine, you'll find the executable cp2k.psmp
in the directory cp2k-4.1/exe/SomeArchitecture_SomeCompiler
CP2K can be downloaded from :
It is free for all users under GPL license,
see Obtaining CP2K section in the download page.
In UEABS(2IP) the 2.3 branch was used that can be downloaded from :
Data files are compatible with at least 2.4 branch.
Tier-0 data set requires the libint-1.1.4 library. If libint version 1
is not available on your machine, libint can be downloaded from :
Run instructions for CP2K.
After building the hybrid MPI+OpenMP version of CP2K you have an executable
called cp2k.psmp. The general way to run the benchmarks is:
parallel_launcher launcher_options path_to_cp2k.psmp -i inputfile -o logfile
o The parallel_launcher is mpirun, mpiexec, or some variant such as aprun on
Cray systems or srun when using Slurm.
o The launcher_options include the parallel placement in terms of total numbers
of nodes, MPI ranks/tasks, tasks per node, and OpenMP threads per task (which
should be equal to the value given to OMP_NUM_THREADS)
You can try any combination of tasks per node and OpenMP threads per task to
investigate absolute performance and scaling on the machine of interest.
For tier-1 systems the best performance is usually obtained with pure MPI,while
for tier-0 systems the best performance is typically obtained using 1 MPI task
per node with the number of threads being equal to the number of cores per node.
More information in the form of a README and an example job script is included
in each benchmark tar file.
The run walltime is reported near the end of logfile:
grep "CP2K " logfile | awk -F ' ' '{print $7}'
1. Install FFTW-2, available at
2. Install GSL, availavle at
3. Install HDF5, availavle at
4. Go to Gadget3/
5. Edit Makefile, set:
6. make
\ No newline at end of file
1. Creation of input
mpirun -np 128 ./N-GenIC ics_medium.param
ics_medium.param is in N-GenIC directory
2. Run calculation
mpirun -np 128 ./Gadget3 param-medium.txt
param-medium.txt is in Gadget3 directory
This is the README file for the GENE application benchmark,
distributed with the Unified European Application Benchmark Suite.
GENE readme
1. General description
2. Code structure
3. Parallelization
4. Building
5. Execution
6. Data
1. General description
The gyrokinetic plasma turbulence code GENE (this acronym stands for
Gyrokinetic Electromagnetic Numerical Experiment) is a software package
dedicated to solving the nonlinear gyrokinetic Integro-Differential system
of equations in either flux-tube domain or in a radially nonlocal domain.
GENE has been developed by a team of people (the Gene Development Team,
led by F. Jenko, Max-Planck-Institut for Plasma Physics) over the last
several years.
For further documentation of the code see:
2. Code structure
Each particle species is described by a time-dependent distribution function
in a five-dimensional phase space.
This results in 6 dimensional arrays, which have the following coordinates:
x y z three space coordinates
v parallel velocity
w perpendicular velocity
spec species of particles
GENE is written completely in FORTRAN90, with some language structures
from Fortran 2003 standard. It also contains preprocessing directives.
3. Parallelization
Parallelization is done by domain decomposition of all 6 coordinates using MPI.
x, y, z 3 space coordinates
v parallel velocity
w perpendicular velocity
spec species of particles
4. Building
The source code (fortran-90) resides in directory src.
The compilation of GENE will be done by JuBE.
Compilation will be done automatically if a new executable for the
benchmark runs is needed.
5. Running the code
A very brief description of the datasets:
A small data set for test purposes. Needs only 8 cores to run.
Global simulation of ion-scale turbulence in Asdex-Upgrade,
needs 200-500GB total memory, runs from 256 to 4096 cores
Global simulation of ion-scale turbulence in JET,
needs 3.5-7TB total memory, runs from 4096 to 16384 cores
For running the benchmark for GENE, please follow the instructions for
using JuBE.
JuBE generates for each benchmark run a run directory and generates from
a template input file the input file 'parameters' and stores it in the
run directory.
A job submit script is created as well and is submitted.
6. Data
The only input file is 'parameters'. It has the format of a f90 namelist.
The following output files are stored in the run directory.
nrg.dat The content of this file is used to verify the correctness
of the benchmark run.
stdout is redirected by JuBE.
It contains logging information,
especially the result of the time measurement.
Instructions for obtaining GPAW and its test set for PRACE benchmarking
GPAW is licensed under GPL, so there are no license issues
NOTE: This benchmark uses 0.11 version of GPAW. For instructions for installing the
latest version, please visit:
Software requirements
* Python
* version 2.6-3.5 required
* this benchmark uses version 2.7.9
* NumPy
* this benchmark uses version 1.11.0
* ASE (Atomic Simulation Environment)
* this benchmark uses 3.9.0
* LibXC
* this benchmark uses version 2.0.1
* BLAS and LAPACK libraries
* this benchmark uses Intel MKL from Intel Composer Studio 2015
* MPI library (optional, for increased performance using parallel processes)
* this benchmark uses Intel MPI from Intel Composer Studio 2015
* FFTW (optional, for increased performance)
* this benchmark uses Intel MKL from Intel Composer Studio 2015
* BLACS and ScaLAPACK (optional, for increased performance)
* this benchmark uses Intel MKL from Intel Composer Studio 2015
* HDF5 (optional, library for parallel I/O and for saving files in HDF5 format)
* this benchmark uses 1.8.14
Obtaining the source code
* The specific version of GPAW used in this benchmark can be obtained from:
* Installation instructions can be found at:
* For platform specific instructions, please refer to:
* Help regarding the benchmark can be requested from
This benchmark set contains scaling tests for electronic structure simulation software GPAW.
More information on GPAW can be found at
Small Scaling Test:
A ground state calculation for (6-6-10) carbon nanotube, requiring 30 SCF iterations.
The calculations under ScaLAPACK are parallelized under 4/4/64 partitioning scheme.
This systems scales reasonably up to 512 cores, running to completion under two minutes on a 2015 era x86 architecture cluster.
For scalability testing, the relevant timer in the text output 'out_nanotube_hXXX_kYYY_pZZZ' (where XXX denotes grid spacing, YYY denotes Brillouin-zone sampling and ZZZ denotes number of cores utilized) is 'Total Time'.
Medium Scaling Test: and C60_Pb100_POSCAR
A ground state calculation for Fullerene on Pb 100 Surface, requiring ~100 SCF iterations.
In this example, the parameters of the parallelization scheme for ScaLAPACK calculations are chosen automatically (using the keyword 'sl_auto: True').
This systems scales reasonably up to 1024 cores, running to completion under thirteen minutes on a 2015 era x86 architecture cluster.
For scalability testing, the relevant timer in the text output 'out_C60_Pb100_hXXX_kYYY_pZZZ' (where XXX denotes grid spacing, YYY denotes Brillouin-zone sampling and ZZZ denotes number of cores utilized) is 'Total Time'.
How to run
* Download and build the source code along the instructions in GPAW_Build_README.txt
* Benchmarks do not need any special command line options and can be run
just as e.g. :
mpirun -np 256 gpaw-python
mpirun -np 512 gpaw-python
Complete Build instructions:
A typical build procedure look like :
tar -zxf gromacs-2016.tar.gz
cd gromacs-2016
mkdir build
cd build
cmake \
-DCMAKE_INSTALL_PREFIX=$HOME/Packages/gromacs/2016 \
-DCMAKE_C_COMPILER=`which mpicc` \
-DCMAKE_CXX_COMPILER=`which mpicxx` \
-DGMX_GPU=off \
-DGMX_MPI=on \
-DGMX_X11=off \
make (or make -j ##)
make install
You probably need to adjust
1. The CMAKE_INSTALL_PREFIX to point to a different path
2. GMX_SIMD : You may completely ommit this if your compile and compute nodes are of the same architecture (for example Haswell).
If they are different you should specify what fits to your compute nodes.
For a complete and up to date list of possible choices refer to the gromacs official build instructions.
Gromacs can be downloaded from :
The UEABS benchmark cases require the use of 4.6 or newer branch,
the latest 4.6.x version is suggested.
There are two data sets in UEABS for Gromacs.
1. ion_channel that use PME for electrostatics, for Tier-1 systems
2. lignocellulose-rf that use Reaction field for electrostatics, for Tier-0 systems. Reference :
The input data file for each benchmark is the corresponding .tpr file produced using
tools from a complete gromacs installation and a series of ascii data files
(atom coords/velocities, forcefield, run control).
If it happens to run the tier-0 case on BG/Q use lignucellulose-rf.BGQ.tpr
instead lignocellulose-rf.tpr. It is the same as lignocellulose-rf.tpr
created on a BG/Q system.
The general way to run gromacs benchmarks is :
WRAPPER WRAPPER_OPTIONS PATH_TO_GMX mdrun -s CASENAME.tpr -maxh 0.50 -resethway -noconfout -nsteps 10000 -g logile
CASENAME is one of ion_channel or lignocellulose-rf
maxh : Terminate after 0.99 times this time (hours) i.e. gracefully terminate after ~30 min
resethwat : Reset Timer counters at half steps. This means that the reported
walltime and performance referes to the last
half steps of sumulation.
noconfout : Do not save output coordinates/velocities at the end.
nsteps : Run this number of steps, no matter what is requested in the input file
logfile : The output filename. If extension .log is ommited
it is automatically appended. Obviously, it should be different
for different runs.
WRAPPER and WRAPPER_OPTIONS depend on system, batch system etc.
Few common pairs are :
Curie : ccc_mrun with no options - obtained from batch system
Juqueen : runjob --np TASKS --ranks-per-node TASKSPERNOD --exp-env OMP_NUM_THREADS
Slurm : srun with no options, obtained from slurm if the variables below are set.
#SBATCH --ntasks-per-node=TASKSPERNODE
The best performance is usually obtained using pure MPI i.e. THREADSPERTASK=1.
You can check other hybrid MPI/OMP combinations.
The execution time is reported at the end of logfile : grep Time: logfile | awk -F ' ' '{print $3}'
NOTE : This is the wall time for the last half number of steps.
For sufficiently large nsteps, this is half of the total wall time.
Build instructions for namd.
In order to run benchmarks the memopt build with SMP support is mandatory.
NAMD may be compiled in an experimental memory-optimized mode that utilizes a compressed version of the molecular structure and also supports parallel I/O.
In addition to reducing per-node memory requirements, the compressed structure greatly reduces startup times compared to reading a psf file.
In order to build this version, your MPI need to have level of thread support: MPI_THREAD_FUNNELED
You need a NAMD 2.11 version or newer.
1. Uncompress/tar the source.
2. cd NAMD_Source_BASE (the directory name depends on how the source obtained,
typically : namd2 or NAMD_2.11_Source )
3. untar the charm-VERSION.tar that exists. If you obtained the namd source via
cvs, you need to download separately charm.
4. cd to charm-VERSION directory
5. configure and compile charm :