QCD_accelerator_benchmark_2_all_systems.html

<!DOCTYPE html>
<html>
<head><meta charset="utf-8">
	<title></title>
</head>
<body>
<p>###<br />
###&nbsp;&nbsp; &nbsp;README - QCD Accelerator Benchmarksuite Part 2 &nbsp;<br />
###<br />
###&nbsp;&nbsp; 2017 -&nbsp; Jacob Finkenrath - CaSToRC - The Cyprus Institute&nbsp; (j.finkenrath@cyi.ac.cy)<br />
###</p>

<p>The QCD Accelerator Benchmark suite Part 2 consists of two kernels,<br />
the QUDA and the QPhix library. The QUDA library is based on CUDA and<br />
optimized for running on NVIDIA GPUs. The QPhix library consists of<br />
routines which are optimize to use INTEL intrinsic functions of<br />
multiple vector length, including optimized routines for KNC and<br />
KNL. In both QUDA and QPhix, this benchmark uses the Conjugate<br />
Gradient solvers implemented within the libraries.</p>

<p>[1] R. Babbich, M. Clark and B. Joo, &ldquo;Parallelizing the QUDA Library for Multi-GPU Calculations<br />
in Lattice Quantum Chromodynamics&rdquo; SC 10 (Supercomputing 2010)</p>

<p>[2] B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey,<br />
W. Watson III, &ldquo;Lattice QCD on Intel Xeon Phi&rdquo;, International Supercomputing Conference (ISC&rsquo;13), 2013</p>

<p>###<br />
###&nbsp; Table of Contents<br />
###</p>

<p><br />
GPU - BENCHMARK SUITE (QUDA)<br />
1. Compile and Run the GPU-Benchmark Suite<br />
1.1 Compile<br />
1.2 Run<br />
1.2.1 Main-script: &quot;run_ana.sh&quot;<br />
1.2.2 Main-script: &quot;prepare_submit_job.sh&quot;<br />
1.2.3 Main-script: &quot;submit_job.sh.template&quot;<br />
1.3 Example Benchmark results</p>

<p>XEONPHI - BENCHMARK SUITE (QPHIX)<br />
2. Compile and Run the XeonPhi-Benchmark Suite<br />
2.1 Compile<br />
2.1.1 Example compilation on PRACE machines<br />
2.1.1.1 BSC - Marenostrum III Hybrid partitions<br />
2.1.1.2 CINES - Frioul<br />
2.2 Run<br />
2.2.1 Main-script: &quot;run_ana.sh&quot;<br />
2.2.2 Main-script: &quot;prepare_submit_job.sh&quot;<br />
2.2.3 Main-script: &quot;submit_job.sh.template&quot;<br />
2.3 Example Benchmark Results</p>

<p><br />
###<br />
###<br />
###&nbsp;&nbsp; GPU - BENCHMARK SUITE<br />
###<br />
###<br />
##<br />
## 1. Compile and Run the GPU-Benchmark Suite<br />
##</p>

<p>##<br />
## 1.1 Compile<br />
##</p>

<p>Download Cmake and Quda</p>

<p>General information how to build QUDA with cmake can be found under:<br />
&quot;https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake&quot;. Here<br />
we just give a short overview:</p>

<p>Build Cmake: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/cmake-3.7.0.tar.gz)</p>

<p>Cmake can be downloaded from source at URL:<br />
https://cmake.org/download/. This guide uses version 3.7.0. The build<br />
instruction can be found in the main directory under &quot;README.rst&quot;. Use<br />
the configure file &quot;./configure&quot; .&nbsp; Then run &quot;gmake&quot; to compile.</p>

<p>Build Quda: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/quda.tar.gz)</p>

<p>Download quda for example by using &quot;git clone<br />
https://github.com/lattice/quda.git&quot;.&nbsp; Create a build-folder. Use<br />
&quot;cmake&quot; in the build-folder, which should be under cmake/bin if you<br />
compiled cmake from source. Execute:</p>

<p>./$PATH2CMAKE/cmake $PATH2QUDA -DQUDA_GPU_ARCH=sm_XX -DQUDA_DIRAC_WILSON=ON -DQUDA_DIRAC_TWISTED_MASS=OFF<br />
-DQUDA_DIRACR_DOMAIN_WALL=OFF -DQUDA_HISQ_LINK=OFF -DQUDA_GAUGE_FORCE=OFF -DQUDA_HISQ_FORCE=OFF -DQUDA_MPI=ON</p>

<p>with</p>

<p>&nbsp;&nbsp; &nbsp;PATH2CMAKE=&lt;path to the cmake-executable&gt;<br />
&nbsp;&nbsp; &nbsp;PAT2QUDA=&lt;path to the home dir of QUDA&gt;</p>

<p>Set -DQUDA_GPU_ARCH=sm_XX to the GPU Architecture (sm_60 for Pascal, sm_35 for Kepler)</p>

<p>If cmake or the compilation fails, library paths and options can be<br />
set via the text user interface of cmake by using &quot;ccmake&quot;.&nbsp; Use<br />
&quot;./PATH2CMAKE/ccmake PATH2BUILD_DIR&quot; to see and edit the available<br />
options. After successfully configuring the buil, run &quot;make&quot;.&nbsp; Now in<br />
the folder test/ one can find the needed Quda executables which begin<br />
with &quot;invert_&quot;.</p>

<p>##<br />
##&nbsp;&nbsp; &nbsp;1.2 Run<br />
##</p>

<p><br />
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts<br />
located in the folder<br />
./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts&quot; to setup the<br />
benchmark runs on the target machines. This bash-scripts are:</p>

<p>run_ana.sh&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; :&nbsp;&nbsp; Main-script, sets up the benchmark mode and submits the jobs (analyse the results)<br />
prepare_submit_job.sh&nbsp;&nbsp; :&nbsp;&nbsp; Generates the job-scripts<br />
submit_job.sh.template&nbsp; :&nbsp;&nbsp; Template for submit script</p>

<p>##<br />
## 1.2.1 Main-script: &quot;run_ana.sh&quot;<br />
##</p>

<p>The path to the executable has to be set by $PATH2EXE .&nbsp; Upon first<br />
run, QUDA automatically tunes the GPU-kernels by sweeping the number<br />
of threads per block. The optimal setup will be saved in the folder<br />
which pointed to in the environment variable &quot;QUDA_RESOURCE_PATH&quot;. You<br />
must set this variable, otherwise the tune data will be lost and<br />
performance will be sub-optimal. Set it to the folder where the tuning<br />
data should be saved. Strong scaling or Weak scaling can be chosen by<br />
using the variable sca_mode (=&quot;Strong&quot; or =&quot;Weak&quot;).&nbsp; The lattice sizes<br />
can be set by &quot;gx&quot; and &quot;gt&quot;.&nbsp; Choose mode=&quot;Run&quot; for run mode or<br />
mode=&quot;Analysis&quot; for extracting the GFLOPS. Note that the script<br />
assumes Slurm is used as the job scheduler. If not, change the line<br />
which includes the &quot;sbatch&quot; command accordingly.</p>

<p>##<br />
## 1.2.2 Main-script: &quot;prepare_submit_job.sh&quot;<br />
##</p>

<p>Add additional options if necessary.</p>

<p>##<br />
## 1.2.3 Main-script: &quot;submit_job.sh.template&quot;<br />
##</p>

<p>The submit-template will be edited by &quot;prepare_submit_job.sh&quot; to<br />
generate the final submit-script. The first lines (beginning with<br />
&quot;#SBATCH&quot;) depend on the queuing system of the target machine, which in<br />
this case is assumed to be Slurm. These should be changed in case of a<br />
different queuing system.</p>

<p>The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to<br />
setup the benchmark runs on the target machines. These bash-scripts<br />
are:</p>

<p>##<br />
## 1.3 Example Benchmark results<br />
##</p>

<p>Here are shown the benchmark results on PizDaint located in Switzerland at CSCS<br />
and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown,<br />
the &quot;Strong-Scaling mode with a random lattice configuration of size 32^3x96 and<br />
a &quot;Weak-Scaling&quot; mode with a configuration of local lattice size 48^3x24.<br />
The GPGPU nodes of Cartesius has two Kepler-GPU per node and the &quot;Strong-Scaling&quot; test is shown for the case<br />
that one card per node and two cards per node are used.<br />
The benchmark are done by using the Conjugated Gradient solver which<br />
solve a linear equation, D * x = b, for the unknown solution &quot;x&quot; based on the clover improved Wilson Dirac operator<br />
&quot;D&quot; and a known right hand side &quot;b&quot;.</p>

<p>---------------------<br />
&nbsp; PizDaint - Pascal&nbsp; P100<br />
---------------------<br />
Strong - Scaling:<br />
global lattice size (32x32x32x96)</p>

<p>sloppy-precision: single<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; precision: single</p>

<p>GPUs&nbsp;&nbsp;&nbsp;&nbsp; GFLOPS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sec<br />
1&nbsp;&nbsp;&nbsp; 786.520000 4.569600<br />
2&nbsp;&nbsp; 1522.410000 3.086040<br />
4&nbsp;&nbsp; 2476.900000 2.447180<br />
8&nbsp;&nbsp; 3426.020000 2.117580<br />
16&nbsp; 5091.330000 1.895790<br />
32&nbsp; 8234.310000 1.860760<br />
64&nbsp; 8276.480000 1.869230</p>

<p>sloppy-precision: double<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; precision: double</p>

<p>GPUs&nbsp;&nbsp;&nbsp;&nbsp; GFLOPS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sec<br />
1&nbsp;&nbsp;&nbsp; 385.965000 6.126730<br />
2&nbsp;&nbsp;&nbsp; 751.227000 3.846940<br />
4&nbsp;&nbsp; 1431.570000 2.774470<br />
8&nbsp;&nbsp; 1368.000000 2.367040<br />
16&nbsp; 2304.900000 2.071160<br />
32&nbsp; 4965.480000 2.095180<br />
64&nbsp; 2308.850000 2.005110</p>

<p><br />
Weak - Scaling:<br />
local lattice size (48x48x48x24)</p>

<p>sloppy-precision: single<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; precision: single</p>

<p>GPUs&nbsp;&nbsp;&nbsp;&nbsp; GFLOPS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sec<br />
1&nbsp;&nbsp;&nbsp;&nbsp; 765.967000 3.940280<br />
2&nbsp;&nbsp;&nbsp; 1472.980000 4.004630<br />
4&nbsp;&nbsp;&nbsp; 2865.600000 4.044360<br />
8&nbsp;&nbsp;&nbsp; 5421.270000 4.056410<br />
16&nbsp;&nbsp; 9373.760000 7.396590<br />
32&nbsp; 17995.100000 4.243390<br />
64&nbsp; 27219.800000 4.535410</p>

<p>sloppy-precision: double<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; precision: double</p>

<p>GPUs&nbsp;&nbsp;&nbsp; GFLOPS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sec<br />
&nbsp;1&nbsp;&nbsp; 376.611000 5.108900<br />
&nbsp;2&nbsp;&nbsp; 728.973000 5.190880<br />
&nbsp;4&nbsp; 1453.500000 5.144160<br />
&nbsp;8&nbsp; 2884.390000 5.207090<br />
16&nbsp; 5004.520000 5.362020<br />
32&nbsp; 8744.090000 5.623290<br />
64&nbsp; 14053.00000 5.910520</p>

<p><br />
---------------------<br />
&nbsp; SurfSara - Kepler&nbsp;&nbsp; K20m<br />
---------------------<br />
##<br />
## 1 GPU per Node<br />
##</p>

<p>Strong - Scaling:<br />
global lattice size (32x32x32x96)</p>

<p>sloppy-precision: single<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; precision: single<br />
GPUs&nbsp;&nbsp;&nbsp; GFLOPS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sec<br />
1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 243.084000 4.030000<br />
2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 478.179000 2.630000<br />
4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 939.953000 2.250000<br />
8&nbsp;&nbsp;&nbsp;&nbsp; 1798.240000 1.570000<br />
16&nbsp;&nbsp;&nbsp; 3072.440000 1.730000<br />
32&nbsp;&nbsp;&nbsp; 4365.320000 1.310000</p>

<p>sloppy-precision: double<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; precision: double</p>

<p>GPUs&nbsp;&nbsp;&nbsp; GFLOPS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sec<br />
1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 119.786000 6.060000<br />
2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 234.179000 3.290000<br />
4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 463.594000 2.250000<br />
8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 898.090000 1.960000<br />
16&nbsp;&nbsp;&nbsp; 1604.210000 1.480000<br />
32&nbsp;&nbsp;&nbsp; 2420.130000 1.630000</p>

<p>##<br />
## 2 GPU per Node<br />
##</p>

<p>Strong - Scaling:<br />
global lattice size (32x32x32x96)</p>

<p>sloppy-precision: single<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; precision: single</p>

<p>GPUs&nbsp;&nbsp;&nbsp; GFLOPS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sec<br />
2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 463.041000 2.720000<br />
4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 896.707000 1.940000<br />
8&nbsp;&nbsp;&nbsp;&nbsp; 1672.080000 1.680000<br />
16&nbsp;&nbsp;&nbsp; 2518.240000 1.420000<br />
32&nbsp;&nbsp;&nbsp; 3800.970000 1.460000<br />
64&nbsp;&nbsp;&nbsp; 4505.440000 1.430000</p>

<p>sloppy-precision: double<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; precision: double</p>

<p>GPUs&nbsp;&nbsp;&nbsp; GFLOPS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; sec<br />
2&nbsp;&nbsp;&nbsp;&nbsp; 229.579000 3.380000<br />
4&nbsp;&nbsp;&nbsp;&nbsp; 450.425000 2.280000<br />
8&nbsp;&nbsp;&nbsp;&nbsp; 863.117000 1.830000<br />
16&nbsp;&nbsp; 1348.760000 1.510000<br />
32&nbsp;&nbsp; 1842.560000 1.550000<br />
64&nbsp;&nbsp; 2645.590000 1.480000</p>

<p>###<br />
###<br />
###&nbsp;&nbsp; XEONPHI - BENCHMARK SUITE<br />
###<br />
###</p>

<p>##<br />
## 2. Compile and Run the XeonPhi-Benchmark Suite<br />
##</p>

<p>Unpack the provided source tar-file located in<br />
&quot;./QCD_Accelerator_Benchmarksuite_Part2/XeonPhi/src&quot; or clone the<br />
actual git-hub branches of the code packages QMP:</p>

<p>&quot;git clone https://github.com/usqcd-software/qmp&quot;</p>

<p>and for QPhix</p>

<p>&quot;git clone https://github.com/JeffersonLab/qphix&quot;</p>

<p>Note that the AVX512 instructions, which are needed for an optimal run<br />
on KNLs, are not yet part of the main branch. The AVX512 instructions<br />
are available in the avx512-branch (&quot;git checkout avx512). The<br />
provided source file is using the avx512-branch (Status as of 01/2017).</p>

<p>##<br />
## 2.1 Compile<br />
##</p>

<p>The QPhix library must be built upon QMP, a thin communication layer<br />
on top of MPI. Compile QMP first:</p>

<p>./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS=&quot; -mmic/-xAVX512 -std=c99&quot; --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none</p>

<p>Create the install folder and link with $QMP_INSTALL_DIR to it.&nbsp; Use<br />
the compiler flag &quot;-mmic&quot; for the compilation for KNC while use<br />
&quot;-xAVX512&quot; for the compilation for KNL.&nbsp; Then use &quot;make&quot; to compile<br />
and &quot;make install&quot; to copy the necessary source files in<br />
$QMP_INSTALL_DIR.</p>

<p>The QPhix executable can be compiled by using, for KNC:</p>

<p>./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS=&quot;-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\&quot;use_gather_scatter_hint=off\&quot; -g -O2 -finline-functions -fno-alias -std=c++0x&quot; CFLAGS=&quot;-mmic -vec-report -restrict -mGLOB_default_function_attrs=\&quot;use_gather_scatter_hint=off\&quot; -openmp -g&nbsp; -O2 -fno-alias -std=c9l9&quot; CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR</p>

<p>or for KNL:</p>

<p>./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS=&quot;-qopenmp -xMIC-AVX512 -g -O3 -std=c++14&quot; CFLAGS=&quot;-xMIC-AVX512 -qopenmp -O3 -std=c99&quot; CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR</p>

<p>by using the previously set variable QMP_INSTALL_DIR which links to<br />
the folder in which the QMP library was copied. The executable<br />
&quot;time_clov_noqdp&quot; should appear in the &quot;./qphix/test&quot; folder. Note<br />
that the avx512-branch will compile an additional executable which has<br />
dependencies on the package QDP (which will generate an error at the<br />
end of the compilation process).</p>

<p>##<br />
## 2.1.1 Example compilation on PRACE machines<br />
##</p>

<p>In the subsection we provide some example compilation on PRACE machines<br />
which where used to develop the QCD Benchmarksuite 2.</p>

<p>##<br />
## 2.1.1.1 BSC - Marenostrum III Hybrid partitions<br />
##</p>

<p>The nodes of the hybrid partition of Marenostrum are equipped with KNC<br />
cards. First load the following modules:</p>

<p>module unload openmpi<br />
module load impi</p>

<p>and then setup the appropriate environment with:</p>

<p>source /opt/intel/impi/4.1.1.036/bin64/mpivars.sh<br />
source /opt/intel/2013.5.192/composer_xe_2013.5.192/bin/compilervars.sh intel64<br />
export I_MPI_MIC=enable<br />
export I_MPI_HYDRA_BOOTSTRAP=ssh</p>

<p>Configure and compile the QMP-library with:</p>

<p>./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS=&quot;-mmic -std=c99&quot; --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none</p>

<p>make<br />
make install</p>

<p>Configure and compile QPhix with:</p>

<p>./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS=&quot;-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\&quot;use_gather_scatter_hint=off\&quot; -g -O2 -finline-functions -fno-alias -std=c++0x&quot; CFLAGS=&quot;-mmic -vec-report -restrict -mGLOB_default_function_attrs=\&quot;use_gather_scatter_hint=off\&quot; -openmp -g&nbsp; -O2 -fno-alias -std=c9l9&quot; CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR<br />
make</p>

<p>##<br />
## 2.1.1.2 CINES - Frioul<br />
##</p>

<p>On a test cluster at CINES the Benchmarksuite was tested on KNL cards.<br />
The steps are similar to Marenostrum above. First setup the appropriate environment with:</p>

<p>source /opt/software/intel/composer_xe_2015/bin/compilervars.sh intel64<br />
source /opt/software/intel/impi_5.0.3/bin64/mpivars.sh</p>

<p>Configure and compile QMP with:<br />
&nbsp;<br />
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS=&quot;-xMIC-AVX512 -mGLOB_default_function_attrs=&quot;use_gather_scatter_hint=off&quot; -openmp -g&nbsp; -O2 -fno-alias -std=c99&quot;&nbsp; --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none<br />
make<br />
make install<br />
&nbsp;<br />
Configure and compile QPhix with:<br />
&nbsp;<br />
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS=&quot;-qopenmp -xMIC-AVX512 -g -O3 -std=c++14&quot; CFLAGS=&quot;-xMIC-AVX512 -qopenmp -O3 -std=c99&quot; CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=/home/finkenrath/benchmark/qmp/install</p>

<p>and</p>

<p>make</p>

<p>##<br />
##&nbsp;&nbsp; &nbsp;2.2 Run<br />
##</p>

<p><br />
The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to<br />
setup the benchmark runs on the target machines. These are:</p>

<p>run_ana.sh&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; :&nbsp;&nbsp; Main-script, set up the bechmark mode and submit the jobs (analyse the results)<br />
prepare_submit_job.sh&nbsp;&nbsp; :&nbsp;&nbsp; Generate the job-scripts<br />
submit_job.sh.template&nbsp; :&nbsp;&nbsp; Template for submit script</p>

<p>##<br />
## 2.2.1 Main-script: &quot;run_ana.sh&quot;<br />
##</p>

<p>The path to the executable has to be set by $PATH2EXE .&nbsp; Choose a<br />
scaling mode between Strong scaling or Weak scaling by setting the<br />
variable sca_mode (=&quot;Strong&quot; or =&quot;Weak&quot;). The lattice sizes can be set<br />
by &quot;gx&quot; and &quot;gt&quot;.&nbsp; Choose between mode=&quot;Run&quot; for run mode or<br />
mode=&quot;Analysis&quot; for extracting the GFLOPS. Note that the script<br />
assumes Slurm is used as the job scheduler. If not, change the line<br />
which includes the &quot;sbatch&quot; command accordingly.</p>

<p>##<br />
## 2.2.2 Main-script: &quot;prepare_submit_job.sh&quot;<br />
##</p>

<p>Add additional options if necessary.</p>

<p>##<br />
## 2.2.3 Main-script: &quot;submit_job.sh.template&quot;<br />
##</p>

<p>The submit-template will be edited by &quot;prepare_submit_job.sh&quot; to<br />
generate the final submit-script. The first lines (beginning with<br />
&quot;#SBATCH&quot;) depend on the queuing system of the target machine, which<br />
in this case is assumed to be Slurm. These should be changed in case<br />
of a different queuing system.</p>

<p>##<br />
## 2.3 Example Benchmark Results<br />
##</p>

<p><br />
The benchmark results for the XeonPhi benchmark suite are performed on<br />
Frioul, a test cluster at CINES, and the hybrid partion on MareNostrum III at BSC.<br />
Frioul has one KNL-card per node while the hybrid partion of MareNostrum III is<br />
equiped with two KNCs per node. The data on Frioul are generated by using<br />
the bash-scripts provided by the QCD-Accelerator Benchmarksute Part 2<br />
and are done for the two test cases &quot;Strong-Scaling&quot; with a lattice size<br />
of 32^3x96 and &quot;Weak-scaling&quot; with a local lattice size of 48^3x24 per<br />
card. In case of the data generated at MareNostrum, data for the &quot;Strong-Scaling&quot;<br />
mode on a 32^3x96 lattice are shown. The Benchmark is using a random gauge configuration and uses the<br />
Conjugated Gradient solver to solve a linear equation involving the clover Wilson Dirac operator.</p>

<p>---------------------<br />
&nbsp; Frioul - KNLs<br />
---------------------<br />
Strong - Scaling:<br />
global lattice size (32x32x32x96)</p>

<p>precision: single</p>

<p>KNLs&nbsp;&nbsp;&nbsp;&nbsp; GFLOPS &nbsp;<br />
1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 340.75<br />
2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 627.612<br />
4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1111.13<br />
8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1779.34<br />
16&nbsp;&nbsp;&nbsp;&nbsp; 2410.8</p>

<p>precision: double</p>

<p>KNLs&nbsp;&nbsp;&nbsp;&nbsp; GFLOPS&nbsp;&nbsp; &nbsp;<br />
1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 328.149<br />
2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 616.467<br />
4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1047.79<br />
8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1616.37</p>

<p><br />
Weak - Scaling:<br />
local lattice size (48x48x48x24)</p>

<p>precision: single</p>

<p>KNLs&nbsp;&nbsp; GFLOPS &nbsp;<br />
1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 348.304<br />
2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 616.697<br />
4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1214.82<br />
8&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2425.45<br />
16&nbsp;&nbsp;&nbsp;&nbsp; 4404.63<br />
&nbsp;<br />
precision: double</p>

<p>KNLs&nbsp;&nbsp; GFLOPS&nbsp;&nbsp; &nbsp;<br />
&nbsp;1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 172.303<br />
&nbsp;2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 320.761<br />
&nbsp;4&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 629.79<br />
&nbsp;8&nbsp;&nbsp;&nbsp;&nbsp; 1228.77<br />
16&nbsp;&nbsp;&nbsp;&nbsp; 2310.63</p>

<p>---------------------<br />
&nbsp; MareNostrum III - KNC&#39;s<br />
---------------------</p>

<p>Strong - Scaling:<br />
global lattice size (32x32x32x96)</p>

<p>precision: single - 1 Cards per Node</p>

<p>KNCs&nbsp; GFLOPS<br />
2&nbsp;&nbsp;&nbsp; 103.561<br />
4&nbsp;&nbsp;&nbsp; 200.159<br />
8&nbsp;&nbsp;&nbsp; 338.276<br />
16&nbsp;&nbsp; 534.369<br />
32&nbsp;&nbsp; 815.896</p>

<p>precision: single - 2 Cards per Node</p>

<p>KNCs&nbsp; GFLOPS<br />
4&nbsp;&nbsp;&nbsp; 118.995<br />
8&nbsp;&nbsp;&nbsp; 212.558<br />
16&nbsp;&nbsp; 368.196<br />
32&nbsp;&nbsp; 605.882<br />
64&nbsp;&nbsp; 847.566</p>

<p>&nbsp;</p>
</body>
</html>