d7.5_4IP_1.0.tex

% docx2tex --- ``Garbage In, Garbage Out'' 
% 
% docx2tex is Open Source and 
% you can download it on GitHub: 
% https://github.com/transpect/docx2tex 
% 
\documentclass{scrbook} 
\usepackage{graphicx} 
\usepackage{hyperref} 
\usepackage{multirow} 
\usepackage{tabularx} 
\usepackage{color} 
\usepackage{amsmath} 
\usepackage{amssymb} 
\usepackage{amsfonts} 
\usepackage{amsxtra} 
\usepackage{wasysym} 
\usepackage{isomath} 
\usepackage{mathtools} 
\usepackage{txfonts} 
\usepackage{upgreek} 
\usepackage{enumerate} 
\usepackage{tensor} 
\usepackage{pifont} 
 
 
 
 
 
\usepackage[ngerman]{babel}
\definecolor{color-1}{rgb}{0.91,0.9,0.9}
\definecolor{color-2}{rgb}{1,1,1}
\definecolor{color-3}{rgb}{0.85,0.85,0.85}
\definecolor{color-4}{rgb}{1,0,0}
\begin{document}
\includegraphics[width=1\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image1.jpeg}

\textbf{E-Infrastructures}

\textbf{H2020-EINFRA-2014-2015}

\textbf{EINFRA-4-2014: Pan-European High Performance Computing}

\textbf{Infrastructure and Services}

\textbf{PRACE-4IP}

\textbf{PRACE Fourth Implementation Phase Project}\label{ref-0001}

\textbf{Grant Agreement Number: \label{ref-0002}EINFRA-653838}

\textbf{D7.5}\label{ref-0003}

\textbf{Application performance on accelerators}\label{ref-0004}

\textbf{\textit{Final }} \label{ref-0005}

Version: \label{ref-0006}1.0

Author(s): \label{ref-0007}Victor Cameo Ponz, CINES

Date: 24.03.2016

\textbf{Project and Deliverable Information Sheet\label{ref-0008}}

\begin{table}
\begin{tabularx}{\textwidth}{
p{\dimexpr 0.23\linewidth-2\tabcolsep}
p{\dimexpr 0.29\linewidth-2\tabcolsep}
p{\dimexpr 0.48\linewidth-2\tabcolsep}}
\multirow{7}{*}{\textbf{PRACE Project}}& \multicolumn{2}{l}{\textbf{Project Ref. №:} \textbf{{\hyperref[ref-0002]{EINFRA-653838}}}} \\
\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{Project Title:} \textbf{{\hyperref[ref-0001]{PRACE Fourth Implementation Phase Project}}}} \\
\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{Project Web Site:} \href{http://www.prace-project.eu}{http://www.prace-project.eu}} \\
\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{Deliverable ID:} {\textless} \textbf{{\hyperref[ref-0003]{D7.5}}}{\textgreater}} \\
\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{Deliverable Nature:} {\textless}DOC\_TYPE: Report / Other{\textgreater}} \\
\cline{2-2}\cline{3-3} & \multirow{1}{*}{\textbf{Dissemination Level:}\par PU}& \textbf{Contractual Date of Delivery:}\par $31 / 03 / 2017$ \\
\cline{3-3} & & \textbf{Actual Date of Delivery:}\par DD / Month / YYYY \\
\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{EC Project Officer:} \textbf{Leonardo Flores A\~{n}over}} \\

\end{tabularx}

\end{table}

* - The dissemination level are indicated as follows: \textbf{PU} -- Public, \textbf{CO} -- Confidential, only for members of the consortium (including the Commission Services) \textbf{CL} -- Classified, as referred to in Commission Decision 2991/844/EC.

\textbf{Document Control Sheet\label{ref-0009}}

\begin{table}
\begin{tabularx}{\textwidth}{
p{\dimexpr 0.23\linewidth-2\tabcolsep}
p{\dimexpr 0.29\linewidth-2\tabcolsep}
p{\dimexpr 0.48\linewidth-2\tabcolsep}}
\multirow{5}{*}{\textbf{Document}}& \multicolumn{2}{l}{\textbf{Title:} \textbf{{\hyperref[ref-0004]{Application performance on accelerators}}}} \\
\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{ID:} \textbf{{\hyperref[ref-0003]{D7.5}}} } \\
\cline{2-2}\cline{3-3} & \textbf{Version:} {\textless}{\hyperref[ref-0006]{1.0}}{\textgreater} & \textbf{Status:} \textbf{\textit{{\hyperref[ref-0005]{Final}}}} \\
\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{Available at:} \href{http://www.prace-project.eu}{http://www.prace-project.eu}} \\
\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{Software Tool:} Microsoft Word 2010} \\
\cline{2-2}\cline{3-3} & \multicolumn{2}{l}{\textbf{File(s):} d7.5\_4IP\_1.0.docx} \\
\multirow{3}{*}{\textbf{Authorship}}& \textbf{Written by:} & {\hyperref[ref-0007]{Victor Cameo Ponz,}}CINES \\
\cline{2-2}\cline{3-3} & \textbf{Contributors:} & Adem Tekin, ITU\par Alan Grey, EPCC\par Andrew Emerson, CINECA\par Andrew Sunderland, STFC\par Arno Proeme, EPCC\par Charles Moulinec, STFC\par Dimitris Dellis, GRNET\par Fiona Reid, EPCC\par Gabriel Hautreux, INRIA\par Jacob Finkenrath, CyI\par James Clark, STFC\par Janko Strassburg, BSC\par Jorge Rodriguez, BSC\par Martti Louhivuori, CSC\par Philippe Segers, GENCI\par Valeriu Codreanu, SURFSARA \\
\cline{2-2}\cline{3-3} & \textbf{Reviewed by:} & Filip Stanek, IT4I\par Thomas Eickermann, FZJ \\
\cline{2-2}\cline{3-3} & \textbf{Approved by:} & MB/TB \\

\end{tabularx}

\end{table}

\textbf{Document Status Sheet\label{ref-0010}}

\begin{table}
\begin{tabularx}{\textwidth}{
p{\dimexpr 0.24\linewidth-2\tabcolsep}
p{\dimexpr 0.24\linewidth-2\tabcolsep}
p{\dimexpr 0.24\linewidth-2\tabcolsep}
p{\dimexpr 0.29\linewidth-2\tabcolsep}}
\textbf{Version} & \textbf{Date} & \textbf{Status} & \textbf{Comments} \\
0.1 & $13/03/2017$ & Draft & First revision \\
0.2 & $15/03/2017$ & Draft & Include remark of the first review + new figures \\
1.0 & $24/03/2017$ & Final version & Improved the application performance section \\

\end{tabularx}

\end{table}

\textbf{Document Keywords \label{ref-0011}}

\begin{table}
\begin{tabularx}{\textwidth}{
p{\dimexpr 0.24\linewidth-2\tabcolsep}
p{\dimexpr 0.76\linewidth-2\tabcolsep}}
\textbf{Keywords:} & PRACE, HPC, Research Infrastructure, Accelerators, GPU, Xeon Phi, Benchmark suite \\

\end{tabularx}

\end{table}

\textbf{Disclaimer}

This deliverable has been prepared by the responsible Work Package of the Project in accordance with the Consortium Agreement and the Grant Agreement n$^{\circ}$ {\hyperref[ref-0002]{EINFRA-653838}}. It solely reflects the opinion of the parties to such agreements on a collective basis in the context of the Project and to the extent foreseen in such agreements. Please note that even though all participants to the Project are members of PRACE AISBL, this deliverable has not been approved by the Council of PRACE AISBL and therefore does not emanate from it nor should it be considered to reflect PRACE AISBL's individual opinion.

\begin{table}
\begin{tabularx}{\textwidth}{
p{\dimexpr 1\linewidth-2\tabcolsep}}
\textbf{Copyright notices}\par {\textcopyright} 2016 PRACE Consortium Partners. All rights reserved. This document is a project document of the PRACE project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the PRACE partners, except as mandated by the European Commission contract {\hyperref[ref-0002]{EINFRA-653838}} for reviewing and dissemination purposes. \par All trademarks and other rights on third party products mentioned in this document are acknowledged as own by the respective holders. \\

\end{tabularx}

\end{table}

\textbf{Table of Contents\label{ref-0012}}

\textbf{Project and Deliverable Information Sheet \pageref{ref-0008}}

\textbf{Document Control Sheet \pageref{ref-0009}}

\textbf{Document Status Sheet \pageref{ref-0010}}

\textbf{Document Keywords \pageref{ref-0011}}

\textbf{Table of Contents \pageref{ref-0012}}

\textbf{List of Figures \pageref{ref-0013}}

\textbf{List of Tables \pageref{ref-0014}}

\textbf{References and Applicable Documents \pageref{ref-0015}}

\textbf{List of Acronyms and Abbreviations \pageref{ref-0036}}

\textbf{List of Project Partner Acronyms \pageref{ref-0037}}

\textbf{Executive Summary \pageref{ref-0038}}

\textbf{1 Introduction \pageref{ref-0039}}

\textbf{2 Targeted architectures \pageref{ref-0041}}

\textbf{2.1 Co-processor description \pageref{ref-0042}}

\textbf{2.2 Systems description \pageref{ref-0045}}

\textit{2.2.1 Cartesius K40 \pageref{ref-0047}}

\textit{2.2.2 MareNostrum KNC \pageref{ref-0048}}

\textit{2.2.3 Ouessant P100 \pageref{ref-0049}}

\textit{2.2.4 Frioul KNL \pageref{ref-0050}}

\textbf{3 Benchmark suite description \pageref{ref-0052}}

\textbf{3.1 Alya \pageref{ref-0055}}

\textit{3.1.1 Code description \pageref{ref-0056}}

\textit{3.1.2 Test cases description \pageref{ref-0057}}

\textbf{3.2 Code\_Saturne \pageref{ref-0058}}

\textit{3.2.1 Code description \pageref{ref-0059}}

\textit{3.2.2 Test cases description \pageref{ref-0060}}

\textbf{3.3 CP2K \pageref{ref-0061}}

\textit{3.3.1 Code description \pageref{ref-0062}}

\textit{3.3.2 Test cases description \pageref{ref-0063}}

\textbf{3.4 GPAW \pageref{ref-0064}}

\textit{3.4.1 Code description \pageref{ref-0065}}

\textit{3.4.2 Test cases description \pageref{ref-0066}}

\textbf{3.5 GROMACS \pageref{ref-0067}}

\textit{3.5.1 Code description \pageref{ref-0068}}

\textit{3.5.2 Test cases description \pageref{ref-0069}}

\textbf{3.6 NAMD \pageref{ref-0070}}

\textit{3.6.1 Code description \pageref{ref-0071}}

\textit{3.6.2 Test cases description \pageref{ref-0072}}

\textbf{3.7 PFARM \pageref{ref-0073}}

\textit{3.7.1 Code description \pageref{ref-0074}}

\textit{3.7.2 Test cases description \pageref{ref-0075}}

\textbf{3.8 QCD \pageref{ref-0076}}

\textit{3.8.1 Code description \pageref{ref-0077}}

\textit{3.8.2 Test cases description \pageref{ref-0078}}

\textbf{3.9 Quantum Espresso \pageref{ref-0079}}

\textit{3.9.1 Code description \pageref{ref-0080}}

\textit{3.9.2 Test cases description \pageref{ref-0081}}

\textbf{3.10 Synthetic benchmarks -- SHOC \pageref{ref-0082}}

\textit{3.10.1 Code description \pageref{ref-0083}}

\textit{3.10.2 Test cases description \pageref{ref-0084}}

\textbf{3.11 SPECFEM3D \pageref{ref-0085}}

\textit{3.11.1 Test cases definition \pageref{ref-0086}}

\textbf{4 Applications performances \pageref{ref-0088}}

\textbf{4.1 Alya \pageref{ref-0089}}

\textbf{4.2 Code\_Saturne \pageref{ref-0094}}

\textbf{4.3 CP2K \pageref{ref-0101}}

\textbf{4.4 GPAW \pageref{ref-0104}}

\textbf{4.5 GROMACS \pageref{ref-0110}}

\textbf{4.6 NAMD \pageref{ref-0113}}

\textbf{4.7 PFARM \pageref{ref-0116}}

\textbf{4.8 QCD \pageref{ref-0123}}

\textit{4.8.1 First implementation \pageref{ref-0124}}

\textit{4.8.2 Second implementation \pageref{ref-0131}}

\textbf{4.9 Quantum Espresso \pageref{ref-0142}}

\textbf{4.10 Synthetic benchmarks (SHOC) \pageref{ref-0152}}

\textbf{4.11 SPECFEM3D \pageref{ref-0155}}

\textbf{5 Conclusion and future work \pageref{ref-0158}}

\textbf{List of Figures\label{ref-0013}}

{\hyperref[ref-0090]{Figure 1 Shows the matrix construction part of Alya that is parallelised with OpenMP and benefits significantly from the many cores available on KNL.}} {\hyperref[ref-0090]{ }}

{\hyperref[ref-0091]{Figure 2 Demonstrates the scalability of the code. As expected Haswell cores with K80 GPU are high-performing while the KNL port is currently being optimized further.}} {\hyperref[ref-0091]{ }}

{\hyperref[ref-0093]{Figure 3 Best performance is achieved with GPU in combination with powerful CPU cores. Single thread performance has a big impact on the speedup, both threading and vectorization are employed for additional performance.}} {\hyperref[ref-0093]{ }}

{\hyperref[ref-0096]{Figure 4 Code_Saturne's performance on KNL. AMG is used as a solver in V4.2.2.}} {\hyperref[ref-0096]{ }}

{\hyperref[ref-0103]{Figure 5 Test case 1 of CP2K on the ARCHER cluster}} {\hyperref[ref-0103]{ }}

{\hyperref[ref-0109]{Figure 6 Relative performance (to / t) of GPAW is shown for parallel jobs using an increasing number of CPU (blue) or Xeon Phi KNC (red). Single CPU SCF-cycle runtime (to) was used as the baseline for the normalisation. Ideal scaling is shown as a linear dashed line for comparison. Case 1 (Carbon Nanotube) is shown with square markers and Case 2 (Copper Filament) is shown with round markers.}} {\hyperref[ref-0109]{ }}

{\hyperref[ref-0111]{Figure 7 Scalability for GROMACS test case GluCL Ion Channel}} {\hyperref[ref-0111]{ }}

{\hyperref[ref-0112]{Figure 8 Scalability for GROMACS test case Lignocellulose}} {\hyperref[ref-0112]{ }}

{\hyperref[ref-0114]{Figure 9 Scalability for NAMD test case STMV.8M}} {\hyperref[ref-0114]{ }}

{\hyperref[ref-0115]{Figure 10 Scalability for NAMD test case STMV.28M}} {\hyperref[ref-0115]{ }}

{\hyperref[ref-0118]{Figure 11 Eigensolver performance on KNL and GPU}} {\hyperref[ref-0118]{ }}

{\hyperref[ref-0126]{Figure 12 Small test case results for QCD, first implementation}} {\hyperref[ref-0126]{ }}

{\hyperref[ref-0128]{Figure 13 Large test case results for QCD, first implementation}} {\hyperref[ref-0128]{ }}

{\hyperref[ref-0130]{Figure 14 shows the time taken by the full MILC 64x64x64x8 test cases on traditional CPU, Intel Knights Landing Xeon Phi and NVIDIA P100 (Pascal) GPU architectures.}} {\hyperref[ref-0130]{ }}

{\hyperref[ref-0133]{Figure 15 Result of second implementation of QCD on K40m GPU}} {\hyperref[ref-0133]{ }}

{\hyperref[ref-0135]{Figure 16 Result of second implementation of QCD on P100 GPU}} {\hyperref[ref-0135]{ }}

{\hyperref[ref-0137]{Figure 17 Result of second implementation of QCD on P100 GPU on larger test case}} {\hyperref[ref-0137]{ }}

{\hyperref[ref-0139]{Figure 18 Result of second implementation of QCD on KNC}} {\hyperref[ref-0139]{ }}

{\hyperref[ref-0141]{Figure 19 Result of second implementation of QCD on KNL}} {\hyperref[ref-0141]{ }}

{\hyperref[ref-0144]{Figure 20 Scalability of Quantum Espresso on GPU for test case 1}} {\hyperref[ref-0144]{ }}

{\hyperref[ref-0146]{Figure 21 Scalability of Quantum Espresso on GPU for test case 2}} {\hyperref[ref-0146]{ }}

{\hyperref[ref-0148]{Figure 22 Scalability of Quantum Espresso on KNL for test case 1}} {\hyperref[ref-0148]{ }}

{\hyperref[ref-0150]{Figure 23 Quantum Espresso - KNL vs BDW vs BGQ (at scale)}} {\hyperref[ref-0150]{ }}

\textbf{List of Tables\label{ref-0014}}

Table 1 Main co-processors specifications \pageref{ref-0044}

Table 2 Codes and corresponding APIs available (in green) \pageref{ref-0054}

Table 3 Performance of Code\_Saturne + PETSc on 1 node of the POWER8 clusters. Comparison between 2 different nodes, using different types of CPU and GPU. PETSc is built on LAPACK. The speedup is computed at the ratio between the time to solution on the CPU for a given number of MPI tasks and the time to solution on the CPU/GPU for the same number of MPI tasks. \pageref{ref-0098}

Table 4 Performance of Code\_Saturne and PETSc on 1 node of KNL. PETSc is built on the MKL library \pageref{ref-0100}

Table 5 GPAW runtimes (in seconds) for the smaller benchmark (Carbon Nanotube) measured on several architectures when using n sockets (i.e. processors or accelerators). \pageref{ref-0106}

Table 6 GPAW runtimes (in seconds) for the larger benchmark (Copper Filament) measured on several architectures when using n sockets (i.e. processors or accelerators). *Due to memory limitations on the GPU the grid spacing was increased from 0.22 to 0.28 to have a sparser grid. To account for this in the comparison, the K40 and K80 runtimes have been scaled up using a corresponding CPU runtime as a yardstick (scaling factor q=2.1132). \pageref{ref-0108}

Table 7 Overall EXDIG runtime performance on various accelerators (runtime, secs) \pageref{ref-0120}

Table 8 Overall EXDIG runtime parallel performance using MPI-GPU version \pageref{ref-0122}

Table 9 Synthetic benchmarks results on GPU and Xeon Phi \pageref{ref-0154}

Table 10 SPECFEM 3D GLOBE results (run time in second) \pageref{ref-0156}

\textbf{References and Applicable Documents\label{ref-0015}}

\begin{enumerate}[1]

\item \href{http://www.prace-ri.eu}{\label{ref-0016}http://www.prace-ri.eu} 

\item The Unified European Application Benchmark Suite -- \href{http://www.prace-ri.eu/ueabs/}{http://www.prace-ri.eu/ueabs/}\label{ref-0017}

\item D7.4 Unified European Applications Benchmark Suite -- Mark Bull et al. -- 2013\label{ref-0018}

\item \href{http://www.nvidia.com/object/quadro-design-and-manufacturing.html}{\label{ref-0019}http://www.nvidia.com/object/quadro-design-and-manufacturing.html}

\item \href{https://userinfo.surfsara.nl/systems/cartesius/description}{https://userinfo.surfsara.nl/systems/cartesius/description}\label{ref-0020}

\item MareNostrum III User's Guide Barcelona Supercomputing Center -- \href{https://www.bsc.es/support/MareNostrum3-ug.pdf}{https://www.bsc.es/support/MareNostrum3-ug.pdf}\label{ref-0021}

\item \href{http://www.idris.fr/eng/ouessant/}{http://www.idris.fr/eng/ouessant/}\label{ref-0022}

\item PFARM reference -- \href{https://hpcforge.org/plugins/mediawiki/wiki/pracewp8/images/3/34/Pfarm_long_lug.pdf}{https://hpcforge.org/plugins/mediawiki/wiki/pracewp8/images/3/34/Pfarm\_long\_lug.pdf}\label{ref-0023}

\item Solvent-Driven Preferential Association of Lignin with Regions of Crystalline Cellulose in Molecular Dynamics Simulation -- Benjamin Lindner et al. -- Biomacromolecules, 2013\label{ref-0024}

\item NAMD website -- \href{http://www.ks.uiuc.edu/Research/namd/}{http://www.ks.uiuc.edu/Research/namd/}\label{ref-0025}

\item SHOC source repository -- \href{https://github.com/vetter/shoc}{https://github.com/vetter/shoc}\label{ref-0026}

\item Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics -- R. Babbich, M. Clark and B. Joo -- SC 10 (Supercomputing 2010)\label{ref-0027}

\item Lattice QCD on Intel Xeon Phi -- B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey, W. Watson III -- International Supercomputing Conference (ISC'13), 2013\label{ref-0028}

\item Extension of fractional step techniques for incompressible flows: The preconditioned Orthomin(1) for the pressure Schur complement -- G. Houzeaux, R. Aubry, and M. V\'{a}zquez -- Computers \& Fluids, 44:297-313, 2011\label{ref-0029}

\item MIMD Lattice Computation (MILC) Collaboration -- \href{http://physics.indiana.edu/~sg/milc.html}{http://physics.indiana.edu/\textasciitilde{}sg/milc.html}\label{ref-0030}

\item targetDP -- \href{https://ccpforge.cse.rl.ac.uk/svn/ludwig/trunk/targetDP/README}{https://ccpforge.cse.rl.ac.uk/svn/ludwig/trunk/targetDP/README}\label{ref-0031}

\item QUDA: A library for QCD on GPU -- \href{https://lattice.github.io/quda/}{https://lattice.github.io/quda/}\label{ref-0032}

\item QPhiX, QCD for Intel Xeon Phi and Xeon processors -- \href{http://jeffersonlab.github.io/qphix/}{http://jeffersonlab.github.io/qphix/}\label{ref-0033}

\item KNC MaxFlops issue (both SP and DP) -- \href{https://github.com/vetter/shoc/issues/37}{https://github.com/vetter/shoc/issues/37}\label{ref-0034}

\item \label{ref-0035}KNC SpMV issue -- https://github.com/vetter/shoc/issues/24, https://github.com/vetter/shoc/issues/23.

\end{enumerate}

\textbf{List of Acronyms and Abbreviations\label{ref-0036}}

\begin{description}
\item[aisbl]Association International Sans But Lucratif \newline
 (legal form of the PRACE-RI)

\item[BCO]Benchmark Code Owner 

\end{description}

CoE Center of Excellence 

\begin{description}
\item[CPU]Central Processing Unit

\item[CUDA]Compute Unified Device Architecture (NVIDIA)

\item[DARPA]Defense Advanced Research Projects Agency

\item[DEISA]Distributed European Infrastructure for Supercomputing Applications EU project by leading national HPC centres

\end{description}

DoA Description of Action (formerly known as DoW)

\begin{description}
\item[EC]European Commission

\item[EESI]European Exascale Software Initiative

\end{description}

EoI Expression of Interest

\begin{description}
\item[	ESFRI]European Strategy Forum on Research Infrastructures 

\item[GB]Giga (= 2$^{\mathrm{30}}$ \textasciitilde{} 10$^{\mathrm{9}}$) Bytes (= 8 bits), also GByte

\end{description}

Gb/s Giga (= 10$^{\mathrm{9}}$) bits per second, also Gbit/s

GB/s Giga (= 10$^{\mathrm{9}}$) Bytes (= 8 bits) per second, also GByte/s

\begin{description}
\item[	G\'{E}ANT]Collaboration between National Research and Education Networks to build a multi-gigabit pan-European network. The current EC-funded project as of 2015 is GN4.

\item[GFlop/s]Giga (= 10$^{\mathrm{9}}$) Floating point operations (usually in 64-bit, i.e. DP) per second, also GF/s

\end{description}

GHz Giga (= 10$^{\mathrm{9}}$) Hertz, frequency =10$^{\mathrm{9}}$ periods or clock cycles per second

\begin{description}
\item[GPU]Graphic Processing Unit

\item[	HET]High Performance Computing in Europe Taskforce. Taskforce by representatives from European HPC community to shape the European HPC Research Infrastructure. Produced the scientific case and valuable groundwork for the PRACE project.

\item[HMM]Hidden Markov Model

\item[HPC]High Performance Computing; Computing at a high performance level at any given time; often used synonym with Supercomputing

\item[HPL]High Performance LINPACK 

\item[	ISC]International Supercomputing Conference; European equivalent to the US based SCxx conference. Held annually in Germany.

\item[KB]Kilo (= 2$^{\mathrm{10}}$ \textasciitilde{}10$^{\mathrm{3}}$) Bytes (= 8 bits), also KByte

\item[LINPACK]Software library for Linear Algebra

\item[MB]Management Board (highest decision making body of the project)

\item[MB]Mega (= 2$^{\mathrm{20}}$ \textasciitilde{} 10$^{\mathrm{6}}$) Bytes (= 8 bits), also MByte

\end{description}

MB/s Mega (= 10$^{\mathrm{6}}$) Bytes (= 8 bits) per second, also MByte/s

MFlop/sMega (= 10$^{\mathrm{6}}$) Floating point operations (usually in 64-bit, i.e. DP) per second, also MF/s

MooC Massively open online Course

MoU Memorandum of Understanding

\begin{description}
\item[MPI]Message Passing Interface

\item[	NDA]Non-Disclosure Agreement. Typically signed between vendors and customers working together on products prior to their general availability or announcement.

\item[	PA]Preparatory Access (to PRACE resources)

\item[	PATC]PRACE Advanced Training Centres

\item[PRACE]Partnership for Advanced Computing in Europe; Project Acronym

\item[PRACE 2]The upcoming next phase of the PRACE Research Infrastructure following the initial five year period.

\item[PRIDE]Project Information and Dissemination Event

\item[RI]Research Infrastructure

\item[TB]Technical Board (group of Work Package leaders)

\item[TB]Tera (= 240 \textasciitilde{} 1012) Bytes (= 8 bits), also TByte

\item[TCO]Total Cost of Ownership. Includes recurring costs (e.g. personnel, power, cooling, maintenance) in addition to the purchase cost.

\item[TDP]Thermal Design Power

\item[TFlop/s]Tera (= 1012) Floating-point operations (usually in 64-bit, i.e. DP) per second, also TF/s

\item[Tier-0]Denotes the apex of a conceptual pyramid of HPC systems. In this context the Supercomputing Research Infrastructure would host the Tier-0 systems; national or topical HPC centres would constitute Tier-1

\item[UNICORE]Uniform Interface to Computing Resources. Grid software for seamless access to distributed resources.

\end{description}

\textbf{List of Project Partner Acronyms\label{ref-0037}}

\begin{description}
\item[	BADW-LRZ]Leibniz-Rechenzentrum der Bayerischen Akademie der Wissenschaften, Germany (3$^{\mathrm{rd}}$ Party to GCS)

\item[	BILKENT]Bilkent University, Turkey (3$^{\mathrm{rd}}$ Party to UYBHM)

\item[	BSC]Barcelona Supercomputing Center - Centro Nacional de Supercomputacion, Spain 

\item[	CaSToRC]Computation-based Science and Technology Research Center, Cyprus

\item[	CCSAS]Computing Centre of the Slovak Academy of Sciences, Slovakia

\item[	CEA]Commissariat \`{a} l'Energie Atomique et aux Energies Alternatives, France (3$^{\mathrm{ rd}}$ Party to GENCI)

\item[	CESGA]Fundacion Publica Gallega Centro Tecnol\'{o}gico de Supercomputaci\'{o}n de Galicia, Spain, (3$^{\mathrm{rd}}$ Party to BSC)

\item[	CINECA]CINECA Consorzio Interuniversitario, Italy

\item[	CINES]Centre Informatique National de l'Enseignement Sup\'{e}rieur, France (3$^{\mathrm{ rd}}$ Party to GENCI)

\item[	CNRS]Centre National de la Recherche Scientifique, France (3$^{\mathrm{ rd}}$ Party to GENCI)

\item[	CSC]CSC Scientific Computing Ltd., Finland

\item[	CSIC]Spanish Council for Scientific Research (3$^{\mathrm{rd}}$ Party to BSC)

\item[	CYFRONET]Academic Computing Centre CYFRONET AGH, Poland (3rd party to PNSC)

\item[	EPCC]EPCC at The University of Edinburgh, UK 

\item[	ETHZurich (CSCS)]Eidgen\"{o}ssische Technische Hochschule Z\"{u}rich -- CSCS, Switzerland

\item[	FIS]FACULTY OF INFORMATION STUDIES, Slovenia (3$^{\mathrm{rd}}$ Party to ULFME)

\item[	GCS]Gauss Centre for Supercomputing e.V.

\item[	GENCI]Grand Equipement National de Calcul Intensiv, France

\item[	GRNET]Greek Research and Technology Network, Greece

\item[	INRIA]Institut National de Recherche en Informatique et Automatique, France (3$^{\mathrm{ rd}}$ Party to GENCI)

\item[	IST]Instituto Superior T\'{e}cnico, Portugal (3rd Party to UC-LCA)

\item[	IUCC]INTER UNIVERSITY COMPUTATION CENTRE, Israel

\item[	JKU]Institut fuer Graphische und Parallele Datenverarbeitung der Johannes Kepler Universitaet Linz, Austria

\item[	JUELICH]Forschungszentrum Juelich GmbH, Germany

\item[	KTH]Royal Institute of Technology, Sweden (3$^{\mathrm{ rd}}$ Party to SNIC)

\item[	LiU]Linkoping University, Sweden (3$^{\mathrm{ rd}}$ Party to SNIC)

\item[	NCSA]NATIONAL CENTRE FOR SUPERCOMPUTING APPLICATIONS, Bulgaria

\item[	NIIF]National Information Infrastructure Development Institute, Hungary

\item[	NTNU]The Norwegian University of Science and Technology, Norway (3$^{\mathrm{rd}}$ Party to SIGMA)

\item[	NUI-Galway]National University of Ireland Galway, Ireland

\item[	PRACE]Partnership for Advanced Computing in Europe aisbl, Belgium

\item[	PSNC]Poznan Supercomputing and Networking Center, Poland

\item[	RISCSW]RISC Software GmbH

\item[	RZG]Max Planck Gesellschaft zur F\"{o}rderung der Wissenschaften e.V., Germany (3$^{\mathrm{ rd}}$ Party to GCS)

\item[	SIGMA2]UNINETT Sigma2 AS, Norway

\item[	SNIC]Swedish National Infrastructure for Computing (within the Swedish Science Council), Sweden

\item[	STFC]Science and Technology Facilities Council, UK (3$^{\mathrm{rd}}$ Party to EPSRC)

\item[	SURFsara]Dutch national high-performance computing and e-Science support center, part of the SURF cooperative, Netherlands

\item[	UC-LCA]Universidade de Coimbra, Labotat\'{o}rio de Computa\c{c}\~{a}o Avan\c{c}ada, Portugal

\item[	UCPH]K\o{}benhavns Universitet, Denmark

\item[	UHEM]Istanbul Technical University, Ayazaga Campus, Turkey

\item[	UiO]University of Oslo, Norway (3$^{\mathrm{rd}}$ Party to SIGMA)

\item[	ULFME]UNIVERZA V LJUBLJANI, Slovenia

\item[	UmU]Umea University, Sweden (3$^{\mathrm{ rd}}$ Party to SNIC)

\item[	UnivEvora]Universidade de \'{E}vora, Portugal (3rd Party to UC-LCA)

\item[	UPC]Universitat Polit\`{e}cnica de Catalunya, Spain (3rd Party to BSC)

\item[	UPM/CeSViMa]Madrid Supercomputing and Visualization Center, Spain (3$^{\mathrm{rd}}$ Party to BSC)

\item[	USTUTT-HLRS]Universitaet Stuttgart -- HLRS, Germany (3rd Party to GCS)

\item[	VSB-TUO]VYSOKA SKOLA BANSKA - TECHNICKA UNIVERZITA OSTRAVA, Czech Republic

\item[	WCNS]Politechnika Wroclawska, Poland (3rd party to PNSC)

\end{description}

\textbf{Executive Summary\label{ref-0038}}

This document describes an accelerator benchmark suite, a set of 11 codes that includes 1 synthetic benchmark and 10 commonly used applications. The key focus of this task has been exploiting accelerators or co-processors to improve the performance of real applications. It aims at providing a set of scalable, currently relevant and publically available codes and datasets.

This work has been undertaken by Task7.2B "Accelerator Benchmarks" in the PRACE Fourth Implementation Phase (PRACE-4IP) project.

Most of the selected application are a subset of the Unified European Applications Benchmark Suite (UEABS) {\hyperref[ref-0017]{[2]}}{\hyperref[ref-0018]{[3]}}. One application and a synthetic benchmark have been added.

As a result, selected codes are: Alya, Code\_Saturne, CP2K, GROMACS, GPAW, NAMD, PFARM, QCD, Quantum Espresso, SHOC and SPECFEM3D.

For each code either two or more test case datasets have been selected. These are described in this document, along with a brief introduction to the application codes themselves. For each code, some sample results are presented, from first run on leading edge systems and prototypes.

\section{1 Introduction\label{ref-0039}}

The work produced within this task is an extension of the UEABS for accelerators. This document will cover each code, presenting the code as well as the test cases defined for the benchmarks and the first results that have been recorded on various accelerator systems.

As the UEABS, this suite aims to present results for many scientific fields that can use HPC accelerated resources. Hence, it will help the European scientific communities to decide in terms of infrastructures they could buy in a near future. We focus on Intel Xeon Phi coprocessors and NVIDIA GPU cards for benchmarking as they are the two most wide-spread accelerated resources available now.

Section {\hyperref[ref-0040]{2}} will present both types of accelerator systems, Xeon Phi and GPU card along with architecture examples. Section {\hyperref[ref-0051]{3}} gives a description of each of the selected applications, together with the test case datasets while section {\hyperref[ref-0087]{4}} presents some sample results. Section {\hyperref[ref-0157]{5}} outlines further work on, and using, the suite.

\section{2 Targeted architectures\label{ref-0040}\label{ref-0041}}

This suite is targeting accelerator cards, more specifically the Intel Xeon Phi and NVIDIA GPU architecture. This section will quickly describe them and will present the 4 machines, the benchmarks ran on.

\subsection{2.1 Co-processor description\label{ref-0042}}

Scientific computing using co-processors has gained popularity in recent years. First the utility of GPU has been demonstrated and evaluated in several application domains {\hyperref[ref-0019]{[4]}}. As a response to NVIDIA's supremacy in this field, Intel designed Xeon Phi cards.

Architectures and programming models of co-processors may differ from CPU and vary among different co-processor types. The main challenges are the high-level parallelism ability required from software and the fact that code may have to be offloaded to the accelerator card.

The {\hyperref[ref-0043]{Table 1}} enlightens this fact:

\begin{table}
\begin{tabularx}{\textwidth}{
p{\dimexpr 0.24\linewidth-2\tabcolsep}
p{\dimexpr 0.18\linewidth-2\tabcolsep}
p{\dimexpr 0.2\linewidth-2\tabcolsep}
p{\dimexpr 0.2\linewidth-2\tabcolsep}
p{\dimexpr 0.19\linewidth-2\tabcolsep}}
 & \multicolumn{2}{l}{Intel Xeon Phi} & \multicolumn{2}{l}{NVIDIA GPU} \\
 & 5110P (KNC) & 7250 (KNL) & K40m & P100 \\
public availability date & Nov-12 & Jun-16 & Jun-13 & May-16 \\
theoretical peak perf & 1,011 GF/s & 3,046 GF/s & 1,430 GF/s & 5,300 GF/s \\
offload required & possible & not possible & required & required \\
max number of thread/cuda cores & 240 & 272 & 2880 & 3584 \\

\end{tabularx}

\end{table}

\textbf{Table 1 Main co-processors specifications\label{ref-0043}\label{ref-0044}}

\subsection{2.2 Systems description\label{ref-0045}}

The benchmark suite has been officially granted access to 4 different machines hosted by PRACE partners. Most results presented in this paper were obtained on these machines but some of the simulation has run on similar ones. This section will cover specifications of the sub mentioned 4 official systems while the few other ones will be presented along with concerned results.

As it can be noticed on the previous section, leading edge architectures have been available quite recently and some code couldn't run on it yet. Results will be completed in a near future and will be delivered with an update of the benchmark suite. Still, presented performances are a good indicator about potential efficiency of codes on both Xeon Phi and NVIDIA GPU platforms.

As for the future, the PRACE-3IP PCP is in its third and last phase and will be a good candidate to provide access to bigger machines. The following suppliers had been awarded with a contract: ATOS/Bull SAS (France), E4 Computer Engineering (Italy) and Maxeler Technologies (UK), providing pilots using Xeon Phi, OPENPower and FPGA technologies. During this final phase, which started in October 2016, the contractors will have to deploy pilot system with a compute capability of around 1 PFlop/s, to demonstrate technology readiness of the proposed solution and the progress in terms of energy efficiency, using high frequency monitoring designed for this purpose. These results will be evaluated on a subset of applications from UEABS (NEMO, SPECFEM3D, QuantumEspresso, BQCD). The access to these systems is foreseen to be open to PRACE partners, with a special interest for the 4IP-WP7 task on accelerated Benchmarks.

\subsubsection{\raisebox{-0pt}{2.2.1} Cartesius K40\label{ref-0046}\label{ref-0047}}

The SURFsara institute in The Netherlands granted access to Cartesius which has a GPU island (installed May 2014) with following specifications {\hyperref[ref-0020]{[5]}}:

\begin{itemize}
\item 66 Bullx B515 GPU accelerated nodes

\begin{enumerate}[a]
\setcounter{enumi}{14}

\item 2x 8-core 2.5 GHz Intel Xeon E5-2450 v2 (Ivy Bridge) CPU/node

\item 2x NVIDIA Tesla K40m GPU/node

\item 96 GB/node, DDR3-1600 RAM

\end{enumerate}

\item Total theoretical peak performance (Ivy Bridge + K40m) 1,056 cores + 132 GPU: 210 TF/s

\end{itemize}
The interconnect has a fully non-blocking fat-tree topology. Every node has two ConnectX-3 InfiniBand FDR adapters: one per GPU.

\subsubsection{\raisebox{-0pt}{2.2.2} MareNostrum KNC\label{ref-0048}}

The Barcelona Supercomputing Center (BSC) in Spain granted access to MareNostrum III which features KNC nodes (upgrade June 2013). Here's the description of this partition {\hyperref[ref-0021]{[6]}}:

\begin{itemize}
\item 42 hybrid nodes containing:

\begin{enumerate}[a]
\setcounter{enumi}{14}

\item 1x Sandy-Bridge-EP (2 x 8 cores) host processors E5-2670 

\item 8x 8G DDR3--1600 DIMMs (4GB/core), total: 64GB/node

\item 2x Xeon Phi 5110P accelerators

\end{enumerate}

\item Interconnection networks:

\begin{enumerate}[a]
\setcounter{enumi}{14}

\item Infiniband Mellanox FDR10: High bandwidth network used by parallel applications communications (MPI)

\item Gigabit Ethernet: 10GbitEthernet network used by the GPFS Filesystem.

\end{enumerate}

\subsubsection{\raisebox{-0pt}{2.2.3} Ouessant P100\label{ref-0049}}

\end{itemize}
GENCI granted access to the Ouessant prototype at IDRIS in France (installed September 2016). It is composed of 12 IBM Minsky compute nodes with each containing {\hyperref[ref-0022]{[7]}}:

\begin{itemize}
\item Compute nodes

\begin{enumerate}[a]
\setcounter{enumi}{14}

\item POWER8+ sockets, 10 cores, 8 threads per core (or 160 threads per node)

\item 128 GB of DDR4 memory (bandwidth {\textgreater} 9 GB/s per core)

\item 4 NVIDIA's new generation Pascal P100 GPU, 16 GB of HBM2 memory

\end{enumerate}

\item Interconnect

\begin{enumerate}[a]
\setcounter{enumi}{14}

\item 4 NVLink interconnects (40GB/s of bi-directional bandwidth per interconnect); each GPU card is connected to a CPU with 2 NVLink interconnects and another GPU with 2 interconnects remaining

\item A Mellanox EDR InfiniBand CAPI interconnect network (1 interconnect per node)

\end{enumerate}

\subsubsection{\raisebox{-0pt}{2.2.4} Frioul KNL\label{ref-0050}}

\end{itemize}
GENCI also granted access to the Frioul prototype at CINES in France (installed December 2016). It is composed of 48 Intel KNL compute nodes each containing:

\begin{itemize}
\item Compute nodes

\begin{enumerate}[a]
\setcounter{enumi}{14}

\item 7250 KNL, 68 cores, 4 threads per cores

\item 192GB of DDR4 memory

\item 16GB of MCDRAM

\end{enumerate}

\item Interconnect:

\begin{enumerate}[a]
\setcounter{enumi}{14}

\item A Mellanox EDR 4x InfiniBand

\end{enumerate}

\end{itemize}
\section{3 Benchmark suite description\label{ref-0051}\label{ref-0052}}

This part will cover each code, presenting the interest for the scientific community as well as the test cases defined for the benchmarks.

As an extension to the EUABS, most codes presented in this suite are included in the latter. Exceptions are PFARM which comes from PRACE-2IP {\hyperref[ref-0023]{[8]}} and SHOC {\hyperref[ref-0026]{[11]}} a synthetic benchmark suite.

\includegraphics[width=0.5\textwidth]{d7.5_4IP_1.0.docx.tmp/word/media/image2.emf}\includegraphics[width=0.5\textwidth]{embeddings/Microsoft_Excel_Worksheet1.xlsx}

\textbf{Table 2 Codes and corresponding APIs available (in green)\label{ref-0053}\label{ref-0054}}

{\hyperref[ref-0053]{Table 2}} lists the codes that will be presented in the next sections as well as their implementations available. It should be noted that OpenMP can be used with the Intel Xeon Phi architecture while CUDA is used for NVidia GPU cards. OpenCL has been considered as a third alternative that can be used on both architectures. It has been available on the first generation of Xeon Phi (KNC) but has not been ported to the second one (KNL). SHOC is the only code that is impacted, this problem is addressed in section {\hyperref[ref-0151]{4.10}}.

\subsection{3.1 Alya\label{ref-0055}}

Alya is a high performance computational mechanics code that can solve different coupled mechanics problems: incompressible/compressible flows, solid mechanics, chemistry, excitable media, heat transfer and Lagrangian particle transport. It is one single code. There are no particular parallel or individual platform versions. Modules, services and kernels can be compiled individually and used a la carte. The main discretisation technique employed in Alya is based on the variational multiscale finite element method to assemble the governing equations into Algebraic systems. These systems can be solved using solvers like GMRES, Deflated Conjugate Gradient, pipelined CG together with preconditioners like SSOR, Restricted Additive Schwarz, etc. The coupling between physics solved in different computational domains (like fluid-structure interactions) is carried out in a multi-code way, using different instances of the same executable. Asynchronous coupling can be achieved in the same way in order to transport Lagrangian particles.

\subsubsection{\raisebox{-0pt}{3.1.1} Code description\label{ref-0056}}

The code is parallelised with MPI and OpenMP. Two OpenMP strategies are available, without and with a colouring strategy to avoid ATOMICs during the assembly step. A CUDA version is also available for the different solvers. Alya has been also compiled for MIC (Intel Xeon Phi).

Alya is written in Fortran 1995 and the incompressible fluid module, present in the benchmark suite, is freely available. This module solves the Navier-Stokes equations using an Orthomin(1) {\hyperref[ref-0029]{[14]}} method for the pressure Schur complement. This method is an algebraic split strategy which converges to the monolithic solution. At each linearisation step, the momentum is solved twice and the continuity equation is solved once or twice depending whether the momentum preserving or the continuity preserving algorithm is selected.

\subsubsection{\raisebox{-0pt}{3.1.2} Test cases description\label{ref-0057}}

\textit{Cavity-hexaedra elements (10M elements)}

This test is the classical lid-driven cavity. The problem geometry is a cube of dimensions 1x1x1. The fluid properties are density=1.0 and viscosity=0.01. Dirichlet boundary conditions are applied on all sides, with three no-slip walls and one moving wall with velocity equal to 1.0, which corresponds to a Reynolds number of 100. The Reynolds number is low so the regime is laminar and turbulence modelling is not necessary. The domain is discretised into 9800344 hexaedra elements. The solvers are the GMRES method for the momentum equations and the Deflated Conjugate Gradient to solve the continuity equation. This test case can be run using pure MPI parallelisation or the hybrid MPI/OpenMP strategy.

\textit{Cavity-hexaedra elements (30M elements)}

This is the same cavity test as before but with 30M of elements. Note that a mesh multiplication strategy enables one to multiply the number of elements by powers of 8, by simply activating the corresponding option in the ker.dat file.

\textit{Cavity-hexaedra elements-GPU version (10M elements)}

This is the same test as Test case 1, but using the pure MPI parallelisation strategy with acceleration of the algebraic solvers using GPU.

\subsection{3.2 Code\_Saturne\label{ref-0058}}

Code\_Saturne is a CFD software package developed by EDF R\&D since 1997 and open-source since 2007. The Navier-Stokes equations are discretised following a finite volume method approach. The code can handle any type of mesh built with any type of cell/grid structure. Incompressible and compressible flows can be simulated, with or without heat transfer, and a range of turbulence models is available. The code can also be coupled with itself or other software to model some multi-physics problems (fluid-structure, fluid-conjugate heat transfer, for instance).

\subsubsection{\raisebox{-0pt}{3.2.1} Code description\label{ref-0059}}

Parallelism is handled by distributing the domain over the processors (several partitioning tools are available, either internally, i.e. SFC Hilbert and Morton, or through external libraries, i.e. METIS Serial, ParMETIS, Scotch Serial, PT-SCOTCH. Communications between subdomains are handled by MPI. Hybrid parallelism using MPI/OpenMP has recently been optimised for improved multicore performance.

For incompressible simulations, most of the time is spent during the computation of the pressure through Poisson equations. The matrices are very sparse. PETSc has recently been linked to the code to offer alternatives to the internal solvers to compute the pressure. The developer's version of PETSc supports CUDA and is used in this benchmark suite.

Code\_Saturne is written in C, F95 and Python. It is freely available under the GPL license.

\subsubsection{\raisebox{-0pt}{3.2.2} Test cases description\label{ref-0060}}

Two test cases are dealt with, the former with a mesh made of hexahedral cells and the latter with a mesh made of tetrahedral cells. Both configurations are meant for incompressible laminar flows. The first test case is run on KNL in order to test the performance of the code always completely filling up a node using 64 MPI tasks and then either 1, 2, 4 OpenMP threads, or 1, 2, 4 extra MPI tasks to investigate the effect of hyper-threading. In this case, the pressure is computed using the code's native Algebraic Multigrid (AMG) algorithm as a solver. The second test case is run on KNL and GPU. In this configuration, the pressure equation is solved using the conjugate gradient (CG) algorithm from the PETSc library (the version of PETSc is the developer's version which supports GPU) and tests are run on KNL as well as on CPU+GPU. PETSc is built with the CUSP library and the CUSP format is used.

Note that computing the pressure using a CG algorithm has always been slower than using the native AMG algorithm, when using Code\_Saturne. The second test is then meant to compare the current results obtained on KNL and GPU using CG only, and not to compare CG and AMG time to solution.

\textit{Flow in a 3-D lid-driven cavity (tetrahedral cells)}

The geometry is very simple, i.e. a cube, but the mesh is built using tetrahedral cells only. The Reynolds number is set to 100, and symmetry boundary conditions are applied in the spanwise direction. The case is modular and the mesh size can easily been varied. The largest mesh has about 13 million cells and is used to get some first comparisons using Code\_Saturne linked to the developer's PETSc library, in order to get use of the GPU.

\textit{3-D Taylor-Green vortex flow (hexahedral cells)}

The Taylor-Green vortex flow is traditionally used to assess the accuracy of CFD code numerical schemes. Periodicity is used in the 3 directions. The total kinetic energy (integral of the velocity) and enstrophy (integral of the vorticity) evolutions as a function of the time are looked at. Code\_Saturne is set for 2nd order time and spatial schemes. The mesh size is 2563 cells.

\subsection{3.3 CP2K\label{ref-0061}}

CP2K is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. It can perform molecular dynamics, metadynamics, Quantum Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimisation, and transition state optimisation using NEB or dimer method.

CP2K provides a general framework for different modelling methods such as density functional theory (DFT) using the mixed Gaussian and plane waves approaches (GPW) and Gaussian and Augmented Plane (GAPW). Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO, {\ldots}), and classical force fields (AMBER, CHARMM, {\ldots}).

\subsubsection{\raisebox{-0pt}{3.3.1} Code description\label{ref-0062}}

Parallelisation is achieved using a combination of OpenMP-based multi-threading and MPI.

Offloading for accelerators is implemented through CUDA and OpenCL for GPU and through OpenMP for MIC (Intel Xeon Phi).

CP2K is written in Fortran 2003 and freely available under the GPL license.

\subsubsection{\raisebox{-0pt}{3.3.2} Test cases description\label{ref-0063}}

\textit{LiH-HFX}

This is a single-point energy calculation for a particular configuration of a 216 atom Lithium Hydride crystal with 432 electrons in a 12.3 \AA{}$^{\mathrm{3}}$ (Angstroms cubed) cell. The calculation is performed using a DFT algorithm with GAPW under the hybrid Hartree-Fock exchange (HFX) approximation. These types of calculations are generally around one hundred times the computational cost of a standard local DFT calculation, although the cost of the latter can be reduced by using the Auxiliary Density Matrix Method (ADMM). Using OpenMP is of particular benefit here as the HFX implementation requires a large amount of memory to store partial integrals. By using several threads, fewer MPI processes share the available memory on the node and thus enough memory is available to avoid recomputing any integrals on-the-fly, improving performance

This test case is expected to scale efficiently to 1000+ nodes.

\textit{H2O-DFT-LS}

This is a single-point energy calculation for 2048 water molecules in a 39 \AA{}$^{\mathrm{3}}$ box using linear-scaling DFT. A local-density approximation (LDA) functional is used to compute the Exchange-Correlation energy in combination with a DZVP MOLOPT basis set and a 300 Ry cutoff. For large systems, the linear-scaling approach for solving Self-Consistent-Field equations should be much cheaper computationally than using standard DFT, and allow scaling up to 1 million atoms for simple systems. The linear scaling cost results from the fact that the algorithm is based on an iteration on the density matrix. The cubically-scaling orthogonalisation step of standard DFT is avoided and key operations are sparse matrix-matrix multiplications, which have a number of non-zero entries that scale linearly with system size. These are implemented efficiently in CP2K's DBCSR library.

This test case is expected to scale efficiently to 4000+ nodes.

\subsection{3.4 GPAW\label{ref-0064}}

GPAW is a DFT program for ab-initio electronic structure calculations using the projector augmented wave method. It uses a uniform real-space grid representation of the electronic wavefunctions, that allows for excellent computational scalability and systematic converge properties.

\subsubsection{\raisebox{-0pt}{3.4.1} Code description\label{ref-0065}}

GPAW is written mostly in Python, but includes also computational kernels written in C as well as leveraging external libraries such as NumPy, BLAS and ScaLAPACK. Parallelisation is based on message-passing using MPI with no threading. Development branches for GPU and MICs include support for offloading to accelerators using either CUDA or pyMIC, respectively. GPAW is freely available under the GPL license.

\subsubsection{\raisebox{-0pt}{3.4.2} Test cases description\label{ref-0066}}

\textit{Carbon Nanotube}

This test case is a ground state calculation for a carbon nanotube in vacuum. By default, it uses a $6-6-10$ nanotube with 240 atoms (freely adjustable) and serial LAPACK with an option to use ScaLAPACK.

This benchmark is aimed at smaller systems, with an intended scaling range of up to 10 nodes.

\textit{Copper Filament}

This test case is a ground state calculation for a copper filament in vacuum. By default, it uses a 2x2x3 FCC lattice with 71 atoms (freely adjustable) and ScaLAPACK for parallelisation.

This benchmark is aimed at larger systems, with an intended scaling range of up to 100 nodes. A lower limit on the number of nodes may be imposed by the amount of memory required, which can be adjusted to some extent with the run parameters (e.g. lattice size or grid spacing).

\subsection{3.5 GROMACS\label{ref-0067}}

GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles.

It is primarily designed for biochemical molecules like proteins, lipids and nucleic acids that have a lot of complicated bonded interactions, but since GROMACS is extremely fast at calculating the nonbonded interactions (that usually dominate simulations) many groups are also using it for research on non-biological systems, e.g. polymers.

GROMACS supports all the usual algorithms you expect from a modern molecular dynamics implementation, and some additional features:

GROMACS provides extremely high performance compared to all other programs. A lot of algorithmic optimisations have been introduced in the code; for instance, the calculation of the virial has been extracted from the innermost loops over pairwise interactions, and we use our own software routines to calculate the inverse square root. In GROMACS 4.6 and up, on almost all common computing platforms, the innermost loops are written in C using intrinsic functions that the compiler transforms to SIMD machine instructions, to utilise the available instruction-level parallelism. These kernels are available in both single and double precision, and support all different kinds of SIMD support found in x86-family (and other) processors.

\subsubsection{\raisebox{-0pt}{3.5.1} Code description\label{ref-0068}}

Parallelisation is achieved using combined OpenMP and MPI.

Offloading for accelerators is implemented through CUDA for GPU and through OpenMP for MIC (Intel Xeon Phi).

GROMACS is written in C/C++ and freely available under the GPL license.

\subsubsection{\raisebox{-0pt}{3.5.2} Test cases description\label{ref-0069}}

\textit{GluCL Ion Channel}

The ion channel system is the membrane protein GluCl, which is a pentameric chloride channel embedded in a lipid bilayer. The GluCl ion channel was embedded in a DOPC membrane and solvated in TIP3P water. This system contains 142k atoms, and is a quite challenging parallelisation case due to the small size. However, it is likely one of the most wanted target sizes for biomolecular simulations due to the importance of these proteins for pharmaceutical applications. It is particularly challenging due to a highly inhomogeneous and anisotropic environment in the membrane, which poses hard challenges for load balancing with domain decomposition.

This test case was used as the ``Small'' test case in previous 2IP and 3IP PRACE phases. It is included in the package's version 5.0 benchmark cases. It is reported to scale efficiently up to 1000+ cores on x86 based systems.

\textit{Lignocellulose}

A model of cellulose and lignocellulosic biomass in an aqueous solution {\hyperref[ref-0024]{[9]}}. This system of 3.3 million atoms is inhomogeneous. This system uses reaction-field electrostatics instead of PME and therefore scales well on x86. This test case was used as the ``Large'' test case in previous PRACE 2IP and 3IP projects. It is reported in previous PRACE projects to scale efficiently up to 10000+ x86 cores.

\subsection{3.6 NAMD\label{ref-0070}}

NAMD is a widely used molecular dynamics application designed to simulate bio-molecular systems on a wide variety of compute platforms. NAMD is developed by the ``Theoretical and Computational Biophysics Group'' at the University of Illinois at Urbana Champaign. In the design of NAMD particular emphasis has been placed on scalability when utilising a large number of processors. The application can read a wide variety of different file formats, for example force fields, protein structures, which are commonly used in bio-molecular science. A NAMD license can be applied for on the developer's website free of charge. Once the license has been obtained, binaries for a number of platforms and the source can be downloaded from the website. Deployment areas of NAMD include pharmaceutical research by academic and industrial users. NAMD is particularly suitable when the interaction between a number of proteins or between proteins and other chemical substances is of interest. Typical examples are vaccine research and transport processes through cell membrane proteins.

\subsubsection{\raisebox{-0pt}{3.6.1} Code description\label{ref-0071}}

NAMD is written in C++ and parallelised using Charm++ parallel objects, which are implemented on top of MPI, supporting both pure MPI and hybrid parallelisation {\hyperref[ref-0025]{[10]}}.

Offloading for accelerators is implemented for both GPU and MIC (Intel Xeon Phi).

\subsubsection{\raisebox{-0pt}{3.6.2} Test cases description\label{ref-0072}}

The datasets are based on the original "Satellite Tobacco Mosaic Virus (STMV)" dataset from the official NAMD site. The memory optimised build of the package and data sets are used in benchmarking. Data are converted to the appropriate binary format used by the memory optimised build.

\textit{STMV.1M}

This is the original STMV dataset from the official NAMD site. The system contains roughly 1 million atoms. This data set scales efficiently up to 1000+ x86 Ivy Bridge cores.

\textit{STMV.8M}

This is a 2x2x2 replication of the original STMV dataset from the official NAMD site. The system contains roughly 8 million atoms. This data set scales efficiently up to 6000 x86 Ivy Bridge cores.

STMV.28M

This is a 3x3x3 replication of the original STMV dataset from the official NAMD site. The system contains roughly 28 million atoms. This data set also scales efficiently up to 6000 x86 Ivy Bridge cores.

\subsection{3.7 PFARM\label{ref-0073}}

PFARM is part of a suite of programs based on the `R-matrix' ab-initio approach to the varitional solution of the many-electron Schr\"{o}dinger equation for electron-atom and electron-ion scattering. The package has been used to calculate electron collision data for astrophysical applications (such as: the interstellar medium, planetary atmospheres) with, for example, various ions of Fe and Ni and neutral O, plus other applications such as data for plasma modelling and fusion reactor impurities. The code has recently been adapted to form a compatible interface with the UKRmol suite of codes for electron (positron) molecule collisions thus enabling large-scale parallel `outer-region' calculations for molecular systems as well as atomic systems.

\subsubsection{\raisebox{-0pt}{3.7.1} Code description\label{ref-0074}}

In order to enable efficient computation, the external region calculation takes place in two distinct stages, named EXDIG and EXAS, with intermediate files linking the two. EXDIG is dominated by the assembly of sector Hamiltonian matrices and their subsequent eigensolutions. EXAS uses a combined functional/domain decomposition approach where good load-balancing is essential to maintain efficient parallel performance. Each of the main stages in the calculation is written in Fortran 2003 (or Fortran 2003-compliant Fortran 95), is parallelised using MPI and is designed to take advantage of highly optimised, numerical library routines. Hybrid MPI / OpenMP parallelisation has also been introduced into the code via shared memory enabled numerical library kernels.

Accelerator-based implementations have been implemented for both EXDIG and EXAS. EXAS uses offloading via MAGMA (or MKL) for sector Hamiltonian diagonalisations on Intel Xeon Phi and GPU accelerators. EXDIG uses combined MPI and OpenMP to distribute the scattering energy calculations on CPU efficiently both across and within Intel Xeon Phi co-processors.

\subsubsection{\raisebox{-0pt}{3.7.2} Test cases description\label{ref-0075}}

External region R-matrix propagations take place over the outer partition of configuration space, including the region where long-range potentials remain important. The radius of this region is determined from the user input and the program decides upon the best strategy for dividing this space into multiple sub-regions (or sectors). Generally, a choice of larger sector lengths requires the application of larger numbers of basis functions (and therefore larger Hamiltonian matrices) in order to maintain accuracy across the sector and vice-versa. Memory limits on the target hardware may determine the final preferred configuration for each test case.

\textit{Iron, FeIII}

This is an electron-ion scattering case with 1181 channels. Hamiltonian assembly in the coarse region applies 10 Legendre functions leading to Hamiltonian matrix diagonalisations of order 11810. In the `fine energy region' up to 30 Legendre functions may be applied leading to Hamiltonian matrices of up to order 35430. The number of sector calculations is likely to range from about 15 to over 30 depending on the user specifications. Several thousand scattering energies are used in the calculation. 

\textit{Methane, CH4}

The dataset is an electron-molecule calculation with 1361 channels. Hamiltonian dimensions are therefore estimated between 13610 and \textasciitilde{}40000. A process in the code which splits the constituent channels according to spin can be used to approximately halve the Hamiltonian size (whilst doubling the overall number of Hamiltonian matrices). As eigensolvers generally require O(N3) operations, spin splitting leads to a saving in both memory requirements and operation count. The final radius of the external region required is relatively long, leading to more numerous sectors calculations (estimated to between 20 and 30). The calculation will require many thousands of scattering energies.

In the current model, parallelism in EXDIG is limited to the number of sector calculations, i.e a maximum of around 30 accelerator nodes. 

Methane is a relatively new dataset which has not been calculated on novel technology platforms at the very large-scale to date, so this is somewhat a step into the unknown. We are also somewhat reliant on collaborative partners that are not associated with PRACE for continuing to develop and fine tune the accelerator-based EXAS program for this proposed work. Access to suitable hardware with throughput suited to development cycles is also a necessity if suitable progress is to be ensured.

\subsection{3.8 QCD\label{ref-0076}}

Matter consists of atoms, which in turn consist of nuclei and electrons. The nuclei consist of neutrons and protons, which comprise quarks bound together by gluons.

The theory of how quarks and gluons interact to form nucleons and other elementary particles is called Quantum Chromo Dynamics (QCD). For most problems of interest, it is not possible to solve QCD analytically, and instead numerical simulations must be performed. Such ``Lattice QCD'' calculations are very computationally intensive, and occupy a significant percentage of all HPC resources worldwide.

\subsubsection{\raisebox{-0pt}{3.8.1} Code description\label{ref-0077}}

The QCD benchmark benefits of two different implementations described below.

\textit{First implementation}