app.tex

\section{Benchmark Suite Description\label{sec:codes}}

This part will cover each code, presenting the interest for the scientific community as well as the test cases defined for the benchmarks.

As an extension to the EUABS, most codes presented in this suite are included in the latter. Exceptions are PFARM which comes from PRACE-2IP \cite{ref-0023} and SHOC \cite{ref-0026} a synthetic benchmark suite.

\begin{table}[]
\centering
\caption{Codes and corresponding APIs available (in green)}
\label{table:avail-api}
\begin{tabular}{l|l|l|l|}
\cline{2-4}
                                       & OpenMP                   & OpenCL                   & CUDA                     \\ \hline
\multicolumn{1}{|l|}{Alya}             & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline
\multicolumn{1}{|l|}{Code\_Saturne}    & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline
\multicolumn{1}{|l|}{CP2K}             & \cellcolor[HTML]{009901} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{009901} \\ \hline
\multicolumn{1}{|l|}{GPAW}             & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline
\multicolumn{1}{|l|}{GROMACS}          & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline
\multicolumn{1}{|l|}{NAMD}             & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline
\multicolumn{1}{|l|}{PFARM}            & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline
\multicolumn{1}{|l|}{QCD}              & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline
\multicolumn{1}{|l|}{QUANTUM ESPRESSO} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} \\ \hline
\multicolumn{1}{|l|}{SHOC}             & \cellcolor[HTML]{CB0000} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{009901} \\ \hline
\multicolumn{1}{|l|}{SPECFEM3D}        & \cellcolor[HTML]{009901} & \cellcolor[HTML]{009901} & \cellcolor[HTML]{009901} \\ \hline
\end{tabular}
\end{table}

Table \ref{table:avail-api} lists the codes that will be presented in the next sections as well as their implementations available. It should be noted that OpenMP can be used with the Intel Xeon Phi architecture while CUDA is used for NVidia GPU cards. OpenCL has been considered as a third alternative that can be used on both architectures. It has been available on the first generation of Xeon Phi (KNC) but has not been ported to the second one (KNL). SHOC is the only code that is impacted, this problem is addressed in Sect. \ref{sec:results:shoc}.

\subsection{Alya}

Alya is a high performance computational mechanics code that can solve different coupled mechanics problems: incompressible/compressible flows, solid mechanics, chemistry, excitable media, heat transfer and Lagrangian particle transport. It is one single code. There are no particular parallel or individual platform versions. Modules, services and kernels can be compiled individually and used a la carte. The main discretisation technique employed in Alya is based on the variational multiscale finite element method to assemble the governing equations into Algebraic systems. These systems can be solved using solvers like GMRES, Deflated Conjugate Gradient, pipelined CG together with preconditioners like SSOR, Restricted Additive Schwarz, etc. The coupling between physics solved in different computational domains (like fluid-structure interactions) is carried out in a multi-code way, using different instances of the same executable. Asynchronous coupling can be achieved in the same way in order to transport Lagrangian particles.

\subsubsection{Code Description.}

The code is parallelised with MPI and OpenMP. Two OpenMP strategies are available, without and with a colouring strategy to avoid ATOMICs during the assembly step. A CUDA version is also available for the different solvers. Alya has been also compiled for MIC (Intel Xeon Phi).

Alya is written in Fortran 1995 and the incompressible fluid module, present in the benchmark suite, is freely available. This module solves the Navier-Stokes equations using an Orthomin(1) \cite{ref-0029} method for the pressure Schur complement. This method is an algebraic split strategy which converges to the monolithic solution. At each linearisation step, the momentum is solved twice and the continuity equation is solved once or twice depending whether the momentum preserving or the continuity preserving algorithm is selected.

\subsubsection{Test Cases Description.}

\paragraph{Cavity-hexaedra elements (10M Elements).}

This test is the classical lid-driven cavity. The problem geometry is a cube of dimensions $1\times1\times1$. The fluid properties are $\mbox{density}=1.0$ and $\mbox{viscosity}=0.01$. Dirichlet boundary conditions are applied on all sides, with three no-slip walls and one moving wall with velocity equal to $1.0$, which corresponds to a Reynolds number of $100$. The Reynolds number is low so the regime is laminar and turbulence modelling is not necessary. The domain is discretised into $9800344$ hexaedra elements. The solvers are the GMRES method for the momentum equations and the Deflated Conjugate Gradient to solve the continuity equation. This test case can be run using pure MPI parallelisation or the hybrid MPI/OpenMP strategy.

\paragraph{Cavity-hexaedra Elements (30M Elements).}

This is the same cavity test as before but with 30M of elements. Note that a mesh multiplication strategy enables one to multiply the number of elements by powers of 8, by simply activating the corresponding option in the ker.dat file.

\paragraph{Cavity-hexaedra Elements-GPU Version (10M Elements).}

This is the same test as Test case 1, but using the pure MPI parallelisation strategy with acceleration of the algebraic solvers using GPU.

\subsection{Code\_Saturne\label{ref-0058}}

Code\_Saturne is a CFD software package developed by EDF R\&D since 1997 and open-source since 2007. The Navier-Stokes equations are discretised following a finite volume method approach. The code can handle any type of mesh built with any type of cell/grid structure. Incompressible and compressible flows can be simulated, with or without heat transfer, and a range of turbulence models is available. The code can also be coupled with itself or other software to model some multi-physics problems (fluid-structure, fluid-conjugate heat transfer, for instance).

\subsubsection{Code Description.\label{ref-0059}}

Parallelism is handled by distributing the domain over the processors (several partitioning tools are available, either internally, i.e. SFC Hilbert and Morton, or through external libraries, i.e. METIS Serial, ParMETIS, Scotch Serial, PT-SCOTCH. Communications between subdomains are handled by MPI. Hybrid parallelism using MPI/OpenMP has recently been optimised for improved multicore performance.

For incompressible simulations, most of the time is spent during the computation of the pressure through Poisson equations. The matrices are very sparse. PETSc has recently been linked to the code to offer alternatives to the internal solvers to compute the pressure. The developer's version of PETSc supports CUDA and is used in this benchmark suite.

Code\_Saturne is written in C, F95 and Python. It is freely available under the GPL license.

\subsubsection{Test Cases Description.\label{ref-0060}}

Two test cases are dealt with, the former with a mesh made of hexahedral cells and the latter with a mesh made of tetrahedral cells. Both configurations are meant for incompressible laminar flows. The first test case is run on KNL in order to test the performance of the code always completely filling up a node using 64 MPI tasks and then either 1, 2, 4 OpenMP threads, or 1, 2, 4 extra MPI tasks to investigate the effect of hyper-threading. In this case, the pressure is computed using the code's native Algebraic Multigrid (AMG) algorithm as a solver. The second test case is run on KNL and GPU. In this configuration, the pressure equation is solved using the conjugate gradient (CG) algorithm from the PETSc library (the version of PETSc is the developer's version which supports GPU) and tests are run on KNL as well as on CPU+GPU. PETSc is built with the CUSP library and the CUSP format is used.

Note that computing the pressure using a CG algorithm has always been slower than using the native AMG algorithm, when using Code\_Saturne. The second test is then meant to compare the current results obtained on KNL and GPU using CG only, and not to compare CG and AMG time to solution.

\paragraph{Flow in a 3-D Lid-driven Cavity (Tetrahedral Cells).}

The geometry is very simple, i.e. a cube, but the mesh is built using tetrahedral cells only. The Reynolds number is set to 100, and symmetry boundary conditions are applied in the spanwise direction. The case is modular and the mesh size can easily been varied. The largest mesh has about 13 million cells and is used to get some first comparisons using Code\_Saturne linked to the developer's PETSc library, in order to get use of the GPU.

\paragraph{3-D Taylor-Green Vortex Flow (Hexahedral Cells).}

The Taylor-Green vortex flow is traditionally used to assess the accuracy of CFD code numerical schemes. Periodicity is used in the 3 directions. The total kinetic energy (integral of the velocity) and enstrophy (integral of the vorticity) evolutions as a function of the time are looked at. Code\_Saturne is set for 2nd order time and spatial schemes. The mesh size is 2563 cells.

\subsection{CP2K\label{ref-0061}}

CP2K is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. It can perform molecular dynamics, metadynamics, Quantum Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimisation, and transition state optimisation using NEB or dimer method.

CP2K provides a general framework for different modelling methods such as density functional theory (DFT) using the mixed Gaussian and plane waves approaches (GPW) and Gaussian and Augmented Plane (GAPW). Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO, {\ldots}), and classical force fields (AMBER, CHARMM, {\ldots}).

\subsubsection{Code Description.\label{ref-0062}}

Parallelisation is achieved using a combination of OpenMP-based multi-threading and MPI.

Offloading for accelerators is implemented through CUDA and OpenCL for GPU and through OpenMP for MIC (Intel Xeon Phi).

CP2K is written in Fortran 2003 and freely available under the GPL license.

\subsubsection{Test Cases Description.\label{ref-0063}}

\paragraph{LiH-HFX.}

This is a single-point energy calculation for a particular configuration of a 216 atom Lithium Hydride crystal with 432 electrons in a $12.3 \mbox{\AA{}}^3$ (Angstroms cubed) cell. The calculation is performed using a DFT algorithm with GAPW under the hybrid Hartree-Fock exchange (HFX) approximation. These types of calculations are generally around one hundred times the computational cost of a standard local DFT calculation, although the cost of the latter can be reduced by using the Auxiliary Density Matrix Method (ADMM). Using OpenMP is of particular benefit here as the HFX implementation requires a large amount of memory to store partial integrals. By using several threads, fewer MPI processes share the available memory on the node and thus enough memory is available to avoid recomputing any integrals on-the-fly, improving performance

This test case is expected to scale efficiently to 1000+ nodes.

\paragraph{H2O-DFT-LS.}

This is a single-point energy calculation for 2048 water molecules in a 39 \AA{}$^{\mathrm{3}}$ box using linear-scaling DFT. A local-density approximation (LDA) functional is used to compute the Exchange-Correlation energy in combination with a DZVP MOLOPT basis set and a 300 Ry cutoff. For large systems, the linear-scaling approach for solving Self-Consistent-Field equations should be much cheaper computationally than using standard DFT, and allow scaling up to 1 million atoms for simple systems. The linear scaling cost results from the fact that the algorithm is based on an iteration on the density matrix. The cubically-scaling orthogonalisation step of standard DFT is avoided and key operations are sparse matrix-matrix multiplications, which have a number of non-zero entries that scale linearly with system size. These are implemented efficiently in CP2K's DBCSR library.

This test case is expected to scale efficiently to 4000+ nodes.

\subsection{GPAW\label{ref-0064}}

GPAW is a DFT program for ab-initio electronic structure calculations using the projector augmented wave method. It uses a uniform real-space grid representation of the electronic wavefunctions, that allows for excellent computational scalability and systematic converge properties.

\subsubsection{Code Description.\label{ref-0065}}

GPAW is written mostly in Python, but includes also computational kernels written in C as well as leveraging external libraries such as NumPy, BLAS and ScaLAPACK. Parallelisation is based on message-passing using MPI with no threading. Development branches for GPU and MICs include support for offloading to accelerators using either CUDA or pyMIC, respectively. GPAW is freely available under the GPL license.

\subsubsection{Test Cases Description.\label{ref-0066}}

\paragraph{Carbon Nanotube.}

This test case is a ground state calculation for a carbon nanotube in vacuum. By default, it uses a $6-6-10$ nanotube with 240 atoms (freely adjustable) and serial LAPACK with an option to use ScaLAPACK.

This benchmark is aimed at smaller systems, with an intended scaling range of up to 10 nodes.

\paragraph{Copper Filament.}

This test case is a ground state calculation for a copper filament in vacuum. By default, it uses a $2\times2\times3$ FCC lattice with 71 atoms (freely adjustable) and ScaLAPACK for parallelisation.

This benchmark is aimed at larger systems, with an intended scaling range of up to 100 nodes. A lower limit on the number of nodes may be imposed by the amount of memory required, which can be adjusted to some extent with the run parameters (e.g. lattice size or grid spacing).

\subsection{GROMACS\label{ref-0067}}

GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles.

It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.

GROMACS supports all the usual algorithms you expect from a modern molecular dynamics implementation, and some additional features:

GROMACS provides extremely high performance compared to all other programs. A lot of algorithmic optimisations have been introduced in the code; for instance, the calculation of the virial has been extracted from the innermost loops over pairwise interactions, and we use our own software routines to calculate the inverse square root. In GROMACS 4.6 and up, on almost all common computing platforms, the innermost loops are written in C using intrinsic functions that the compiler transforms to SIMD machine instructions, to utilise the available instruction-level parallelism. These kernels are available in both single and double precision, and support all different kinds of SIMD support found in x86-family (and other) processors.

\subsubsection{Code Description.\label{ref-0068}}

Parallelisation is achieved using combined OpenMP and MPI.

Offloading for accelerators is implemented through CUDA for GPU and through OpenMP for MIC (Intel Xeon Phi).

GROMACS is written in C/C++ and freely available under the GPL license.

\subsubsection{Test Cases Description.\label{ref-0069}}

\paragraph{GluCL Ion Channel.}

The ion channel system is the membrane protein GluCl, which is a pentameric chloride channel embedded in a lipid bilayer. The GluCl ion channel was embedded in a DOPC membrane and solvated in TIP3P water. This system contains $142\times10^3$ atoms, and is a quite challenging parallelisation case due to the small size. However, it is likely one of the most wanted target sizes for biomolecular simulations due to the importance of these proteins for pharmaceutical applications. It is particularly challenging due to a highly inhomogeneous and anisotropic environment in the membrane, which poses hard challenges for load balancing with domain decomposition.

This test case was used as the ``Small'' test case in previous 2IP and 3IP PRACE phases. It is included in the package's version 5.0 benchmark cases. It is reported to scale efficiently up to 1000+ cores on x86 based systems.

\paragraph{Lignocellulose.}

A model of cellulose and lignocellulosic biomass in an aqueous solution {\cite{ref-0024}. This system of 3.3 million atoms is inhomogeneous. This system uses reaction-field electrostatics instead of PME and therefore scales well on x86. This test case was used as the ``Large'' test case in previous PRACE 2IP and 3IP projects. It is reported in previous PRACE projects to scale efficiently up to 10000+ x86 cores.

\subsection{NAMD\label{ref-0070}}

NAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of compute platforms. NAMD is developed by the ``Theoretical and Computational Biophysics Group'' at the University of Illinois at Urbana Champaign. In the design of NAMD particular emphasis has been placed on scalability when utilising a large number of processors. The application can read a wide variety of different file formats, for example force fields, protein structures, which are commonly used in bio-molecular science. A NAMD license can be applied for on the developer's website free of charge. Once the license has been obtained, binaries for a number of platforms and the source can be downloaded from the website. Deployment areas of NAMD include pharmaceutical research by academic and industrial users. NAMD is particularly suitable when the interaction between a number of proteins or between proteins and other chemical substances is of interest. Typical examples are vaccine research and transport processes through cell membrane proteins.

\subsubsection{Code Description.\label{ref-0071}}

NAMD is written in C++ and parallelised using Charm++ parallel objects, which are implemented on top of MPI, supporting both pure MPI and hybrid parallelisation \cite{ref-0025}.

Offloading for accelerators is implemented for both GPU and MIC (Intel Xeon Phi).

\subsubsection{Test Cases Description.\label{ref-0072}}

The datasets are based on the original "Satellite Tobacco Mosaic Virus (STMV)" dataset from the official NAMD site. The memory optimised build of the package and data sets are used in benchmarking. Data are converted to the appropriate binary format used by the memory optimised build.

\paragraph{STMV.1M.}

This is the original STMV dataset from the official NAMD site. The system contains roughly 1 million atoms. This data set scales efficiently up to 1000+ x86 Ivy Bridge cores.

\paragraph{STMV.8M.}

This is a $2\times2\times2$ replication of the original STMV dataset from the official NAMD site. The system contains roughly 8 million atoms. This data set scales efficiently up to 6000 x86 Ivy Bridge cores.

\paragraph{STMV.28M.}

This is a $3\times3\times3$ replication of the original STMV dataset from the official NAMD site. The system contains roughly 28 million atoms. This data set also scales efficiently up to 6000 x86 Ivy Bridge cores.

\subsection{PFARM\label{ref-0073}}

PFARM is part of a suite of programs based on the ``R-matrix'' ab-initio approach to the varitional solution of the many-electron Schr\"{o}dinger equation for electron-atom and electron-ion scattering. The package has been used to calculate electron collision data for astrophysical applications (such as: the interstellar medium, planetary atmospheres) with, for example, various ions of Fe and Ni and neutral O, plus other applications such as data for plasma modelling and fusion reactor impurities. The code has recently been adapted to form a compatible interface with the UKRmol suite of codes for electron (positron) molecule collisions thus enabling large-scale parallel `outer-region' calculations for molecular systems as well as atomic systems.

\subsubsection{Code Description.\label{ref-0074}}

In order to enable efficient computation, the external region calculation takes place in two distinct stages, named EXDIG and EXAS, with intermediate files linking the two. EXDIG is dominated by the assembly of sector Hamiltonian matrices and their subsequent eigensolutions. EXAS uses a combined functional/domain decomposition approach where good load-balancing is essential to maintain efficient parallel performance. Each of the main stages in the calculation is written in Fortran 2003 (or Fortran 2003-compliant Fortran 95), is parallelised using MPI and is designed to take advantage of highly optimised, numerical library routines. Hybrid MPI / OpenMP parallelisation has also been introduced into the code via shared memory enabled numerical library kernels.

Accelerator-based implementations have been implemented for both EXDIG and EXAS. EXAS uses offloading via MAGMA (or MKL) for sector Hamiltonian diagonalisations on Intel Xeon Phi and GPU accelerators. EXDIG uses combined MPI and OpenMP to distribute the scattering energy calculations on CPU efficiently both across and within Intel Xeon Phi co-processors.

\subsubsection{Test Cases Description.\label{ref-0075}}

External region R-matrix propagations take place over the outer partition of configuration space, including the region where long-range potentials remain important. The radius of this region is determined from the user input and the program decides upon the best strategy for dividing this space into multiple sub-regions (or sectors). Generally, a choice of larger sector lengths requires the application of larger numbers of basis functions (and therefore larger Hamiltonian matrices) in order to maintain accuracy across the sector and vice-versa. Memory limits on the target hardware may determine the final preferred configuration for each test case.

\paragraph{Iron, $\mathrm{FeIII}$.}

This is an electron-ion scattering case with 1181 channels. Hamiltonian assembly in the coarse region applies 10 Legendre functions leading to Hamiltonian matrix diagonalisations of order 11810. In the `fine energy region' up to 30 Legendre functions may be applied leading to Hamiltonian matrices of up to order 35430. The number of sector calculations is likely to range from about 15 to over 30 depending on the user specifications. Several thousand scattering energies are used in the calculation.

\paragraph{Methane, $\mathrm{CH}_4$.}

The dataset is an electron-molecule calculation with 1361 channels. Hamiltonian dimensions are therefore estimated between 13610 and $\sim40000$. A process in the code which splits the constituent channels according to spin can be used to approximately halve the Hamiltonian size (whilst doubling the overall number of Hamiltonian matrices). As eigensolvers generally require $O(N^3)$ operations, spin splitting leads to a saving in both memory requirements and operation count. The final radius of the external region required is relatively long, leading to more numerous sectors calculations (estimated to between 20 and 30). The calculation will require many thousands of scattering energies.

In the current model, parallelism in EXDIG is limited to the number of sector calculations, i.e a maximum of around 30 accelerator nodes.

Methane is a relatively new dataset which has not been calculated on novel technology platforms at the very large-scale to date, so this is somewhat a step into the unknown. We are also somewhat reliant on collaborative partners that are not associated with PRACE for continuing to develop and fine tune the accelerator-based EXAS program for this proposed work. Access to suitable hardware with throughput suited to development cycles is also a necessity if suitable progress is to be ensured.

\subsection{QCD\label{sec:codes-qcd}}

Matter consists of atoms, which in turn consist of nuclei and electrons. The nuclei consist of neutrons and protons, which comprise quarks bound together by gluons.

The theory of how quarks and gluons interact to form nucleons and other elementary particles is called Quantum Chromo Dynamics (QCD). For most problems of interest, it is not possible to solve QCD analytically, and instead numerical simulations must be performed. Such ``Lattice QCD'' calculations are very computationally intensive, and occupy a significant percentage of all HPC resources worldwide.

\subsubsection{Code Description.\label{ref-0077}}

The QCD benchmark benefits of two different implementations described below.

\paragraph{First Implementation.}

The MILC code is a freely-available suite for performing Lattice QCD simulations, developed over many years by a collaboration of researchers \cite{ref-0030}.

The benchmark used here is derived from the MILC code (v6), and consists of a full conjugate gradient solution using Wilson fermions. The benchmark is consistent with ``QCD kernel E'' in the full UAEBS, and has been adapted so that it can efficiently use accelerators as well as traditional CPU.

The implementation for accelerators has been achieved using the ``targetDP'' programming model \cite{ref-0031}, a lightweight abstraction layer designed to allow the same application source code to be able to target multiple architectures, e.g. NVIDIA GPU and multicore/manycore CPU, in a performance portable manner. The targetDP syntax maps, at compile time, to either NVIDIA CUDA (for execution on GPU) or OpenMP+vectorisation (for implementation on multi/manycore CPU including Intel Xeon Phi). The base language of the benchmark is C and MPI is used for node-level parallelism.

\paragraph{Second Implementation.}

The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA \cite{ref-0027} and the QPhix \cite{ref-0028} library. The library QUDA is based on CUDA and optimize for running on NVIDIA GPU \cite{ref-0032}. The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL's \cite{ref-0033}. In both QUDA and QPhix, the benchmark kernel uses the conjugate gradient solvers implemented within the libraries.

\subsubsection{Test Cases Description.\label{ref-0078}}

Lattice QCD involves discretisation of space-time into a lattice of points, where the extent of the lattice in each of the 3 spatial and 1 temporal dimensions can be chosen. This means that the benchmark is very flexible, where the size of the lattice can be varied with the size of the computing system in use (weak scaling) or can be fixed (strong scaling). For testing on a single node, then $64\times64\times32\times8$ is a reasonable size, since this fits on a single Intel Xeon Phi or a single GPU. For larger numbers of nodes, the lattice extents can be increased accordingly, keeping the geometric shape roughly similar. Test cases for the second implementation are given by a strong-scaling mode with a lattice size of $32\times32\times32\times96$ and $64\times64\times64\times128$ and a weak scaling mode with a local lattice size of $48\times48\times48\times24$.

\subsection{QUANTUM ESPRESSO\label{ref-0079}}

QUANTUM ESPRESSO is an integrated suite of computer codes for electronic-structure calculations and materials modelling, based on density-functional theory, plane waves, and pseudopotentials (norm-conserving, ultrasoft, and projector-augmented wave). QUANTUM ESPRESSO stands for \textit{opEn Source Package for Research in Electronic Structure, Simulation, and Optimisation}. It is freely available to researchers around the world under the terms of the GNU General Public License. QUANTUM ESPRESSO builds upon newly restructured electronic-structure codes that have been developed and tested by some of the original authors of novel electronic-structure algorithms and applied in the last twenty years by some of the leading materials modelling groups worldwide. Innovation and efficiency are still its main focus, with special attention paid to massively parallel architectures, and a great effort being devoted to user friendliness. QUANTUM ESPRESSO is evolving towards a distribution of independent and inter-operable codes in the spirit of an open-source project, where researchers active in the field of electronic-structure calculations are encouraged to participate in the project by contributing their own codes or by implementing their own ideas into existing codes.

QUANTUM ESPRESSO is written mostly in Fortran90, and parallelised using MPI and OpenMP and is released under a GPL license.

\subsubsection{Code Description.\label{ref-0080}}

During 2011 a GPU-enabled version of Quantum ESPRESSO was publicly released. The code is currently developed and maintained by Filippo Spiga at the High Performance Computing Service - University of Cambridge (United Kingdom) and Ivan Girotto at the International Centre for Theoretical Physics (Italy). The initial work has been supported by the EC-funded PRACE and a SFI (Science Foundation Ireland, grant 08/HEC/I1450). At the time of writing, the project is self-sustained thanks to the dedication of the people involved and thanks to NVIDIA support in providing hardware and expertise in GPU programming.

The current public version of QE-GPU is 14.10.0 as it is the last version maintained as plug-in working on all QE 5.x versions. QE-GPU utilised phiGEMM (external) for CPU+GPU GEMM computation, MAGMA (external) to accelerate eigen-solvers and explicit CUDA kernel to accelerate compute-intensive routines. FFT capabilities on GPU are available only for serial computation due to the hard challenges posed in managing accelerators in the parallel distributed 3D-FFT portion of the code where communication is the dominant element that limits excellent scalability beyond hundreds of MPI ranks.

A version for Intel Xeon Phi (MIC) accelerators is not currently available. Standart x86 version have been used on KNL.

\subsubsection{Test Cases Description.\label{ref-0081}}

\paragraph{PW-IRMOF\_M11.}

Full SCF calculation of a Zn-based isoreticular metal--organic framework (total 130 atoms) over 1 K point. Benchmarks run in 2012 demonstrated speedups due to GPU (NVIDIA K20s, with respect to non-accelerated nodes) in the range 1.37 -- 1.87, according to node count (maximum number of accelerators=8). Runs with current hardware technology and an updated version of the code are expected to exhibit higher speedups (probably 2-3x) and scale up to a couple hundred nodes.

\paragraph{PW-SiGe432.}

This is a SCF calculation of a Silicon-Germanium crystal with 430 atoms. Being a fairly large system, parallel scalability up to several hundred, perhaps a 1000 nodes is expected, with accelerated speed-ups likely to be of 2-3x.

\subsection{Synthetic benchmarks -- SHOC\label{ref-0082}}

The Accelerator Benchmark Suite will also include a series of synthetic benchmarks. For this purpose, we choose the Scalable HeterOgeneous Computing (SHOC) benchmark suite, augmented with a series of benchmark examples developed internally. SHOC is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing. Its initial focus is on systems containing GPU and multi-core processors, and on the OpenCL programming standard, but CUDA and OpenACC versions were added. Moreover, a subset of the benchmarks is optimised for the Intel Xeon Phi coprocessor. SHOC can be used on clusters as well as individual hosts.

The SHOC benchmark suite currently contains benchmark programs categorised by complexity. Some measure low-level 'feeds and speeds' behaviour (Level 0), some measure the performance of a higher-level operation such as a Fast Fourier Transform (FFT) (Level 1), and the others measure real application kernels (Level 2).

The SHOC benchmark suite has been selected to evaluate the performance of accelerators on synthetic benchmarks, mostly because SHOC provides CUDA / OpenCL / Offload / OpenACC variants of the benchmarks. This allowed us to evaluate NVIDIA GPU (with CUDA / OpenCL / OpenACC), Intel Xeon Phi KNC (with both Offload and OpenCL), but also Intel host CPU (with OpenCL/OpenACC). However, on the latest Xeon Phi processor (codenamed KNL) none of these 4 models is supported. Thus, benchmarks on the KNL architecture can not be run at this point, and there aren't any news of Intel supporting OpenCL on the KNL. However, there is work in progress on the PGI compiler to support the KNL as a target. This support will be added during 2017. This will allow us to compile and run the OpenACC benchmarks for the KNL. Alternatively, the OpenACC benchmarks will be ported to OpenMP and executed on the KNL.

\subsubsection{Code Description.\label{ref-0083}}

All benchmarks are MPI-enabled. Some will report aggregate metrics over all MPI ranks, others will only perform work for specific ranks.

Offloading for accelerators is implemented through CUDA and OpenCL for GPU and through OpenMP for MIC (Intel Xeon Phi). For selected benchmarks OpenACC implementations are provided for GPU. Multi-node parallelisation is achieved using MPI.

SHOC is written in C++ and is open-source and freely available.

\subsubsection{Test Cases Description.\label{ref-0084}}

The benchmarks contained in SHOC currently feature 4 different sizes for increasingly large systems. The size convention is as follows:

\begin{itemize}
\item CPU / debugging
\item Mobile/integrated GPU
\item Discrete GPU (e.g. GeForce or Radeon series)
\item HPC-focused or large memory GPU (e.g. Tesla or Firestream Series)
\end{itemize}

In order to go even larger scale, we plan to add a 5th level for massive supercomputers.

\subsection{SPECFEM3D\_GLOBE\label{ref-0085}}

The software package SPECFEM3D\_GLOBE simulates three-dimensional global and regional seismic wave propagation based upon the spectral-element method (SEM). All SPECFEM3D\_GLOBE software is written in Fortran90 with full portability in mind, and conforms strictly to the Fortran95 standard. It uses no obsolete or obsolescent features of Fortran77. The package uses parallel programming based upon the Message Passing Interface (MPI).

The SEM was originally developed in computational fluid dynamics and has been successfully adapted to address problems in seismic wave propagation. It is a continuous Galerkin technique, which can easily be made discontinuous; it is then close to a particular case of the discontinuous Galerkin technique, with optimised efficiency because of its tensorised basis functions. In particular, it can accurately handle very distorted mesh elements. It has very good accuracy and convergence properties. The spectral element approach admits spectral rates of convergence and allows exploiting hp-convergence schemes. It is also very well suited to parallel implementation on very large supercomputers as well as on clusters of GPU accelerating graphics cards. Tensor products inside each element can be optimised to reach very high efficiency, and mesh point and element numbering can be optimised to reduce processor cache misses and improve cache reuse. The SEM can also handle triangular (in 2D) or tetrahedral (3D) elements as well as mixed meshes, although with increased cost and reduced accuracy in these elements, as in the discontinuous Galerkin method.

In many geological models in the context of seismic wave propagation studies (except for instance for fault dynamic rupture studies, in which very high frequencies of supershear rupture need to be modelled near the fault) a continuous formulation is sufficient because material property contrasts are not drastic and thus conforming mesh doubling bricks can efficiently handle mesh size variations. This is particularly true at the scale of the full earth. Effects due to lateral variations in compressional-wave speed, shear-wave speed, density, a 3D crustal model, ellipticity, topography and bathyletry, the oceans, rotation, and self-gravitation are included. The package can accommodate full 21-parameter anisotropy as well as lateral variations in attenuation. Adjoint capabilities and finite-frequency kernel simulations are also included.

\subsubsection{Test cases definition.\label{ref-0086}}

Both test cases will use the same input data. A 3D shear-wave speed model (S362ANI) will be used to benchmark the code.

Here is an explanation of the simulation parameters that will be used to size the test case:

\begin{itemize}
\item \verb}NCHUNKS,} number of face of the cubed sphere included in the simulation (will be always 6)
\item \verb}NPROC_XI}, number of slice along one chunk of the cubed sphere (will represents also the number of processors used for 1 chunk
\item \verb}NEX_XI}, number of spectral elements along one side of a chunk.
\item \verb}RECORD_LENGHT_IN_MINUTES}, length of the simulated seismograms. The time of the simulation should vary linearly with this parameter.
\end{itemize}

\paragraph{Small test case.}

It runs with 24 MPI tasks and has the following mesh characteristics:

\begin{itemize}
\item \verb+NCHUCKS = 6+
\item \verb+NPROC_XI = 2+
\item \verb+NEX_XI = 80+
\item \verb+RECORD_LENGHT_IN_MINUTES = 2.0+
\end{itemize}

\paragraph{Bigger test case.}

It runs with 150 MPI tasks and has the following mesh characteristics:

\begin{itemize}
\item \verb+NCHUCKS = 6+
\item \verb+NPROC_XI = 5+
\item \verb+NEX_XI = 80+
\item \verb+RECORD_LENGHT_IN_MINUTES = 2.0+
\end{itemize}