results.tex

\section{Applications Performances\label{sec:results}}

This section presents sample results on OpenPOWER + GPU systems. It should be noted that some of the codes presented in Sect. \ref{sec:codes} -- namely Alya, CP2K, GPAW, PFARM, QUANTUM ESPRESSO -- has been run on x86 + GPU platforms so results are not shown here. Meanwhile instruction to compile and run those codes are available and can be used to target OpenPOWER based systems.

\subsection{Code\_Saturne\label{ref-0094}}

\subsubsection{Description Runtime Architecture.}

Code\_Saturne ran on2 POWER8 nodes, i.e. S822LC (2x P8 10-cores + 2x K80 (2 G210 per K80)) and S824L (2x P8 12-cores + 2x K40 (1 G180 per K40)). The compiler is at/8.0, the MPI distribution openmpi/1.8.8 and the CUDA compiler's version is 7.5.

\subsubsection{Flow in a 3-D Lid-driven Cavity (Tetrahedral Cells)}

The following options are used for PETSc: \verb+-ksp_type = cg+, \verb+-vec_type = cusp+, \verb+-mat_type = aijcusp+ and \verb+-pc_type = jacobi+

\begin{table}
\caption{Performance of Code\_Saturne + PETSc on 1 node of the POWER8 clusters. Comparison between 2 different nodes, using different types of CPU and GPU. PETSc is built on LAPACK. The speedup is computed at the ratio between the time to solution on the CPU for a given number of MPI tasks and the time to solution on the CPU/GPU for the same number of MPI tasks.\label{table:cs-results}}
\includegraphics[width=1\textwidth]{media/image7.pdf}
\end{table}

Talbe \ref{table:cs-results} shows the results obtained using POWER8 CPU and CPU/GPU. Focusing on the results on the POWER8 nodes first, a speedup is observed on each node of the POWER8, when using the same number of MPI tasks and of GPU. However, when the nodes are fully populated (20 and 24 MPI tasks, respectively), it is cheaper to run on the CPU only than using CPU/GPU. This could be explained by the fact that the same overall amount of data is transferred but the system administration costs, latency costs, asynchronicity of transfer in 20 (S822LC) or 24 (S824L) slices might be prohibitive.

\subsection{GROMACS\label{ref-0110}}

Gromacs was successfully compiled and ran on IDRIS Ouessant: IBM Power 8 + Dual P100 presented in Sect. \ref{sec:hardware}.

In all accelerated runs a speed up of 2-2.6x with respect CPU only was achieved with GPU.
\begin{figure}
\caption{Scalability for GROMACS test case GluCL Ion Channel\label{ref-0111}}
\includegraphics[width=1\textwidth]{media/image13.png}
\label{fig:7}
\end{figure}
\begin{figure}
\caption{Scalability for GROMACS test case Lignocellulose\label{ref-0112}}
\includegraphics[width=1\textwidth]{media/image14.png}
\label{fig:8}
\end{figure}

\subsection{NAMD\label{ref-0113}}

NAMD was successfully compiled and ran on IDRIS Ouessant: IBM Power 8 + Dual P100 presented in Sect. \ref{sec:hardware}.

In all accelerated runs a speed up of 5-6x with respect CPU only runs was achieved with GPU.

\begin{figure}
\caption{Scalability for NAMD test case STMV.8M\label{ref-0114}}
\includegraphics[width=1\textwidth]{media/image15.png}
\label{fig:9}
\end{figure}

\begin{figure}
\caption{ Scalability for NAMD test case STMV.28M\label{ref-0115}}
\includegraphics[width=1\textwidth]{media/image16.png}
\label{fig:10}
\end{figure}


\subsection{QCD\label{ref-0123}}

As stated Sect. \ref{sec:codes-qcd}, QCD benchmark has two implementations.

\subsubsection{First Implementation.\label{ref-0124}}

\begin{figure}
\caption{shows the time taken by the full MILC $64\times64\times64\times8$ test cases on traditional CPU, Intel Knights Landing Xeon Phi and NVIDIA P100 (Pascal) GPU architectures.\label{ref-0129}\label{ref-0130}}
\includegraphics[width=1\textwidth]{media/image21.pdf}
\label{fig:14}
\end{figure}
In Fig. \ref{fig:14} we present preliminary results for on the latest generation Intel Knights Landing (KNL) and NVIDIA Pascal architectures, which offer very high bandwidth stacked memory, together with the same traditional Intel-Ivy-bridge CPU used in previous sections. Note that these results are not directly comparable with those presented earlier, since they are for a different test case size (larger since we are no longer limited by the small memory size of the Knights Corner), and they are for a slightly updated verion of the benchmark. The KNL is the 64-core 7210 model, available from within a test and development platform provided as part of the ARCHER service. The Pascal is a NVIDIA P100 GPU provided as part of the ``Ouessant'' IBM service at IDRIS, where the host CPU is an IBM Power8+.

It can be seen that the KNL is 7.5X faster than the Ivy-bridge; the Pascal is 13X faster than the Ivy-bridge; and the OpenPOWER + Pascal is 1.7X faster than the KNL.

\subsection{Synthetic Benchmarks (SHOC)\label{sec:results:shoc}}

The SHOC benchmark has been run on Cartesius, Ouessant and MareNostrum. Table \ref{table:result:shoc} presents the results. Results on Power 8 are the one shown in the P100 CUDA column.

\begin{table}
\caption{Synthetic benchmarks results on GPU and Xeon Phi\label{table:result:shoc}}
\begin{tabularx}{\textwidth}{
p{\dimexpr 0.25\linewidth-2\tabcolsep}
p{\dimexpr 0.13\linewidth-2\tabcolsep}
p{\dimexpr 0.13\linewidth-2\tabcolsep}
p{\dimexpr 0.13\linewidth-2\tabcolsep}
p{\dimexpr 0.1\linewidth-2\tabcolsep}
p{\dimexpr 0.12\linewidth-2\tabcolsep}
p{\dimexpr 0.13\linewidth-2\tabcolsep}}
 & \multicolumn{3}{l}{NVIDIA GPU} & \multicolumn{3}{l}{Intel Xeon Phi} \\
 & K40 CUDA & K40 OpenCL & P100 CUDA & KNC Offload & KNC OpenCL & Haswell OpenCL \\
\hline
BusSpeedDownload & 10.5 GB/s & 10.56 GB/s & 32.23 GB/s & 6.6 GB/s & 6.8 GB/s & 12.4 GB/s \\
BusSpeedReadback & 10.5 GB/s & 10.56 GB/s & 34.00 GB/s & 6.7 GB/s & 6.8 GB/s & 12.5 GB/s \\
maxspflops & 3716 GFLOPS & 3658 GFLOPS & 10424 GFLOPS & \textcolor{color-4}{21581} & \textcolor{color-4}{2314 GFLOPS} & 1647 GFLOPS \\
maxdpflops & 1412 GFLOPS & 1411 GFLOPS & 5315 GFLOPS & \textcolor{color-4}{16017} & \textcolor{color-4}{2318 GFLOPS} & 884 GFLOPS \\
gmem\_readbw & 177 GB/s & 179 GB/s & 575.16 GB/s & 170 GB/s & 49.7 GB/s & 20.2 GB/s \\
gmem\_readbw\_strided & 18 GB/s & 20 GB/s & 99.15 GB/s & N/A & 35 GB/s & \textcolor{color-4}{156 GB/s} \\
gmem\_writebw & 175 GB/s & 188 GB/s & 436 GB/s & 72 GB/s & 41 GB/s & 13.6 GB/s \\
gmem\_writebw\_strided & 7 GB/s & 7 GB/s & 26.3 GB/s & N/A & 25 GB/s & \textcolor{color-4}{163 GB/s} \\
lmem\_readbw & 1168 GB/s & 1156 GB/s & 4239 GB/s & N/A & 442 GB/s & 238 GB/s \\
lmem\_writebw & 1194 GB/s & 1162 GB/s & 5488 GB/s & N/A & 477 GB/s & 295 GB/s \\
BFS & 49,236,500 Edges/s & 42,088,000 Edges/s & 91,935,100 Edges/s & N/A & 1,635,330 Edges/s & 14,225,600 Edges/s \\
FFT\_sp & 523 GFLOPS & 377 GFLOPS & 1472 GFLOPS & 135 GFLOPS & 71 GFLOPS & 80 GFLOPS \\
FFT\_dp & 262 GFLOPS & 61 GFLOPS & 733 GFLOPS & 69.5 GFLOPS & 31 GFLOPS & 55 GFLOPS \\
SGEMM & 2900-2990 GFLOPS & 694/761 GFLOPS & 8604-8720 GFLOPS & 640/645 GFLOPS & 179/217 GFLOPS & 419-554 GFLOPS \\
DGEMM & 1025-1083 GFLOPS & 411/433 GFLOPS & 3635-3785 GFLOPS & 179/190 GFLOPS & 76/100 GFLOPS & 189-196 GFLOPS \\
MD (SP) & 185 GFLOPS & 91 GFLOPS & 483 GFLOPS & 28 GFLOPS & 33 GFLOPS & 114 GFLOPS \\
MD5Hash & 3.38 GH/s & 3.36 GH/s & 15.77 GH/s & N/A & 1.7 GH/s & 1.29 GH/s \\
Reduction & 137 GB/s & 150 GB/s & 271 GB/s & 99 GB/s & 10 GB/s & 91 GB/s \\
Scan & 47 GB/s & 39 GB/s & 99.2 GB/s & 11 GB/s & 4.5 GB/s & 15 GB/s \\
Sort & 3.08 GB/s & 0.54 GB/s & 12.54 GB/s & N/A & 0.11 GB/s & 0.35 GB/s \\
Spmv & 4-23 GFLOPS & 3-17 GFLOPS & 23-65 GFLOPS & \textcolor{color-4}{1-17944 GFLOPS} & N/A & 1-10 GFLOPS \\
Stencil2D & 123 GFLOPS & 135 GFLOPS & 465 GFLOPS & 89 GFLOPS & 8.95 GFLOPS & 34 GFLOPS \\
Stencil2D\_dp & 57 GFLOPS & 67 GFLOPS & 258 GFLOPS & 16 GFLOPS & 7.92 GFLOPS & 30 GFLOPS \\
Triad & 13.5 GB/s & 9.9 GB/s & 43 GB/s & 5.76 GB/s & 5.57 GB/s & 8 GB/s \\
S3D (level2) & 94 GFLOPS & 91 GFLOPS & 294 GFLOPS & 109 GFLOPS & 18 GFLOPS & 27 GFLOPS \\
\end{tabularx}
\end{table}

In table \ref{table:result:shoc}, measures marked red are not relevant and should not be considered:

\begin{itemize}
\item KNC MaxFlops (both SP and DP): In this case the compiler optimizes away some of the computation (although it shouldn't) \cite{ref-0034}.
\item KNC SpMV: For these benchmarks it is a known bug currently being addressed \cite{ref-0035}.
\item Haswell \verb+gmem_readbw_strided+ and \verb+gmem_writebw_strided+: strided read/write benchmarks doesn't make too much sense in the case of the CPU, as the data will be cache in the large L3 caches. It is reason why we see high number only in the Haswell case.
\end{itemize}

\subsection{SPECFEM3D\_GLOBE\label{ref-0155}}

Tests have been carried out on Ouessant (see Sect. \ref{sec:hardware}).

So far it has only been possible to run on one fixed core count for each test case, so scaling curves are not available. Test case A ran on 4 KNL and 4 P100. Test case B ran on 10 KNL and 4 P100. Results are shown on table \ref{table:results:specfem}.

\begin{table}
\centering
\caption{SPECFEM 3D GLOBE results (run time in second)}
\begin{tabular}{ccc}
 & KNL & POWER + P100 \\
\hline\noalign{\smallskip}
Test case A & 66 & 105 \\
Test case B & 21.4 & 68 \\
\end{tabular}
\label{table:results:specfem}
\end{table}