\documentclass{llncs}
\usepackage[table,xcdraw]{xcolor}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{tabularx}
\definecolor{color-4}{rgb}{1,0,0}

\begin{document}
\title{PRACE European Benchmark Suite: Application Performances on Accelerators}
\author{Victor Cameo Ponz\inst{1}
\and Adem Tekin\inst{2}
\and Alan Grey\inst{3}
\and Andrew Emerson\inst{4}
\and Andrew Sunderland\inst{5}
\and Arno Proeme\inst{3}
\and Charles Moulinec\inst{5}
\and Dimitris Dellis\inst{6}
% \and Fiona Reid\inst{3}
\and Jacob Finkenrath\inst{7}
% \and James Clark\inst{5}
\and Janko Strassburg\inst{8}
\and Jorge Rodriguez\inst{8}
\and Martti Louhivuori\inst{9}
\and Valeriu Codreanu\inst{10}}
\institute{Centre Informatique National de l'Enseignement Suppérieur (CINES), Montpellier, France% 1
\and Istanbul Technical University (ITU), Istanbul, Turkey% 2
\and Edinburgh Parallel Computing Centre (EPCC), Edinburgh, United Kingdom% 3
\and Cineca, Bologna, Italy% 4
\and Science and Technology Facilities Council (STFC) Daresbury Laboratory, Daresbury United, Kingdom% 5
\and Greek Research and Technology Network (GRNET), Athens, Greece% 6
\and Cyprus Institute (CyI), Nicosia, Chypre% 7
\and Barcelona Supercomputing Center (BSC), Barcelona, Spain% 8
\and CSC – IT Center for Science Ltd, Helsinki, Finland% 9
\and SurfSARA, Amsterdam, Netherlands} % 10
\maketitle

\begin{abstract}
Increasing interest is being shown in the use of graphic and many-core processors in order to reach HPC exascale machines.
Radicaly different hardwares imply that codes --sometimes antique-- will have to evolve to remain efficient.
This portability effort can be huge and obviously varies from one software to the other.
People that uses HPC sytems will have to choose among stofwares adapted to hardware of their new cluster and may have to change customs completely.
Also people buying machines needs to have a good insight of effeciency of each platform regarding targetted comunity.

This leads to the need of a performance overview of various softwares on different hardware stack.
It is common need in the HPC field and benchmark suite has existed since a long time. The objective of PRACE with this document is to target leading edge technologies such as OpenPOWER coupled with GPU.
It describes an accelerator benchmark suite, a set of 11 codes that includes 1 synthetic benchmark and 10 commonly used applications. It aims at providing a set of scalable, currently relevant and publically available codes and datasets.
For each code either two or more test case datasets have been selected. These are described in this document, along with a brief introduction to the application codes themselves. For each code, sample results on OpenPOWER+GPU are presented.
\end{abstract}

\section{Introduction\label{sec:intro}}

The work presented here have been carried out as an extension of the Unified European Application Benchmark Suite (UEABS) \cite{ref-0017,ref-0018} for accelerators. This document will cover each code, presenting the code as well as the test cases defined for the benchmarks and the results that have been recorded on targeted systems.

As the UEABS, this suite aims to present results for many scientific fields that can use HPC accelerated resources. Hence, it will help the European scientific communities to decide in terms of infrastructures they could buy in a near future. We focus on Intel Xeon Phi coprocessors and NVIDIA GPU cards for benchmarking as they are the two most wide-spread accelerated resources available now.

Section \ref{sec:hardware} will present architecture examples on which code ran. Section \ref{sec:codes} gives a description of each of the selected applications, together with the test case datasets while section \ref{sec:results} presents the results. Section \ref{sec:conclusion} outlines further work on, and using the suite.

\section{Hardware Platform Available\label{sec:hardware}}

This suite is targeting accelerator cards, more specifically the Intel Xeon Phi and NVIDIA GPU architecture. This section will quickly describe OpenPOWER system benchmarks ran on.

Genci Grand Equipement National de Calcul Intensif (GENCI) granted access to the \emph{Ouessant} prototype at IDRIS in France (installed September 2016). It is composed of 12 IBM OpenPOWER Minsky compute nodes with each containing \cite{ref-0022}:

\begin{itemize}
\item Compute nodes
\begin{itemize}
\item POWER8+ sockets, 10 cores, 8 threads per core (or 160 threads per node)
\item 128 GB of DDR4 memory (bandwidth 9 GB/s per core)
\item 4 NVIDIA's new generation Pascal P100 GPU, 16 GB of HBM2 memory
\end{itemize}
\item Interconnect
\begin{itemize}
\item 4 NVLink interconnects (40GB/s of bi-directional bandwidth per interconnect); each GPU card is connected to a CPU with 2 NVLink interconnects and another GPU with 2 interconnects remaining
\item A Mellanox EDR InfiniBand CAPI interconnect network (1 interconnect per node)
\end{itemize}
\end{itemize}

\input{app.tex}
\input{results.tex}

\section{Conclusion and future work\label{sec:conclusion}}

The work presented here stand as a first sight for application benchmarking on accelerators. Most codes have been selected among the main Unified European Application Benchmark Suite. This paper describes each of them as well as implementation, relevance to European science community and test cases. We have presented results available on OpenPOWER systems.

The suite will be publicly available on the PRACE CodeVault git repository \cite{ref-0036} where links to download sources and test cases will be published along with compilation and run instructions.

This task in PRACE 4IP started to design a benchmark suite for accelerators. This work has been done aiming at integrating it to the main UEABS one so that both can be maintained and evolve together. As PCP (PRACE-3IP) machines will soon be available, it will be very interesting to run the benchmark suite on them. First because these machines will be larger, but also because they will feature energy consumption probes.

\section{Acknowledgements}
This work was financially supported by the PRACE project funded in part by the EU's Horizon 2020 research
and innovation programme (2014-2020) under grant agreement 653838.

\begin{thebibliography}{} % (do not forget {})

% \bibitem{ref-0016} PRACE web site -- \url{http://www.prace-ri.eu}

\bibitem{ref-0017} The Unified European Application Benchmark Suite -- \url{http://www.prace-ri.eu/ueabs/}

\bibitem{ref-0018} D7.4 Unified European Applications Benchmark Suite -- Mark Bull et al. (2013)

% \bibitem{ref-0019} Accelerate productivity with the power of NVIDIA -- \url{http://www.nvidia.com/object/quadro-design-and-manufacturing.html}

% \bibitem{ref-0020} Description of the Cartesius system -- \url{https://userinfo.surfsara.nl/systems/cartesius/description}

% \bibitem{ref-0021} MareNostrum III User's Guide Barcelona Supercomputing Center -- \url{https://www.bsc.es/support/MareNostrum3-ug.pdf}

\bibitem{ref-0022} Installation of an OpenPOWER prototype at IDRIS -- \url{http://www.idris.fr/eng/ouessant/}

\bibitem{ref-0023} PFARM reference -- \url{https://hpcforge.org/plugins/mediawiki/wiki/pracewp8/images/3/34/Pfarm_long_lug.pdf}

\bibitem{ref-0024} Solvent-Driven Preferential Association of Lignin with Regions of Crystalline Cellulose in Molecular Dynamics Simulation -- Benjamin Lindner et al. -- Biomacromolecules, 2013

\bibitem{ref-0025} NAMD website -- \url{http://www.ks.uiuc.edu/Research/namd/}

\bibitem{ref-0026} SHOC source repository -- \url{https://github.com/vetter/shoc}

\bibitem{ref-0027} Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics -- R. Babbich, M. Clark and B. Joo -- Supercomputing 2010 (SC 10)

\bibitem{ref-0028} Lattice QCD on Intel Xeon Phi -- B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey, W. Watson III -- International Supercomputing Conference (ISC'13), 2013

\bibitem{ref-0029} Extension of fractional step techniques for incompressible flows: The preconditioned Orthomin(1) for the pressure Schur complement -- G. Houzeaux, R. Aubry, and M. V\'{a}zquez -- Computers \& Fluids, 44:297-313, 2011

\bibitem{ref-0030} MIMD Lattice Computation (MILC) Collaboration -- \url{http://physics.indiana.edu/~sg/milc.html}

\bibitem{ref-0031} targetDP -- \url{https://ccpforge.cse.rl.ac.uk/svn/ludwig/trunk/targetDP/README}

\bibitem{ref-0032} QUDA: A library for QCD on GPU -- \url{https://lattice.github.io/quda/}

\bibitem{ref-0033} QPhiX, QCD for Intel Xeon Phi and Xeon processors -- \url{http://jeffersonlab.github.io/qphix/}

\bibitem{ref-0034} KNC MaxFlops issue (both SP and DP) -- \url{https://github.com/vetter/shoc/issues/37}

\bibitem{ref-0035} KNC SpMV issue -- \url{https://github.com/vetter/shoc/issues/24}, \url{https://github.com/vetter/shoc/issues/23}.

\bibitem{ref-0036} PRACE Code Vault repository -- \url{https://gitlab.com/PRACE-4IP/CodeVault}.


\end{thebibliography}
\end{document}