README

###
###	README - QCD Accelerator Benchmarksuite Part 2  
###
###   2017 -  Jacob Finkenrath - CaSToRC - The Cyprus Institute  (j.finkenrath@cyi.ac.cy)
###

The QCD Accelerator Benchmark suite Part 2 consists of two kernels, 
the QUDA [1] and the QPhix library [2]. The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/).The QPhix library consists of routines which are optimize to use INTEL intrinsic functions of multiple vector length, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/).
The benchmark code is used the provided Conjugated Gradient benchmark functions of the libraries.


[1] R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations
in Lattice Quantum Chromodynamics” SC 10 (Supercomputing 2010)

[2] B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey, 
W. Watson III, “Lattice QCD on Intel Xeon Phi”, International Supercomputing Conference (ISC’13), 2013

###
###  Table of Contents
###


GPU - BENCHMARK SUITE (QUDA)
1. Compile and Run the GPU-Benchmark Suite
1.1 Compile 
1.2 Run
1.2.1 Main-script: "run_ana.sh"
1.2.2 Main-script: "prepare_submit_job.sh"
1.2.3 Main-script: "submit_job.sh.template"
1.3 Example Benchmark results

XEONPHI - BENCHMARK SUITE (QPHIX)
2. Compile and Run the XeonPhi-Benchmark Suite
2.1 Compile
2.1.1 Example compilation on PRACE machines
2.1.1.1 BSC - Marenostrum III Hybrid partitions 
2.1.1.2 CINES - Frioul
2.2 Run
2.2.1 Main-script: "run_ana.sh"
2.2.2 Main-script: "prepare_submit_job.sh"
2.2.3 Main-script: "submit_job.sh.template"
2.3 Example Benchmark Results


###
###
###   GPU - BENCHMARK SUITE
###
###
##
## 1. Compile and Run the GPU-Benchmark Suite
##


##
## 1.1 Compile 
##

Download Cmake and Quda

General information how to build QUDA with cmake can be found under:
"https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake"
Here we just give a short overview:

Build Cmake: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/cmake-3.7.0.tar.gz)

Cmake can be downloaded from the source with the URL: https://cmake.org/download/
In this guide the version cmake-3.7.0 is used. The build instruction can be found
in the main directory under README.rst . Use the configure file "./configure" . 
Then run "gmake".

Build Quda: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/quda.tar.gz)

Download quda for example by using "git clone https://github.com/lattice/quda.git".
Create a build-folder. Execute the executable "cmake" in the build-folder which 
is located in the cmake/bin.
Execute:

./$PATH2CMAKE/cmake $PATH2QUDA -DQUDA_GPU_ARCH=sm_XX -DQUDA_DIRAC_WILSON=ON -DQUDA_DIRAC_TWISTED_MASS=OFF 
-DQUDA_DIRACR_DOMAIN_WALL=OFF -DQUDA_HISQ_LINK=OFF -DQUDA_GAUGE_FORCE=OFF -DQUDA_HISQ_FORCE=OFF -DQUDA_MPI=ON

with

	PATH2CMAKE= path to the cmake-executable
	PAT2QUDA= path to the home dir of QUDA

Set -DQUDA_GPU_ARCH=sm_XX to the GPU Architecture (sm_60 for Pascals, sm_35 for Keplers)

If Cmake or the compilation fails library paths and options can be set by the cmake provided function "ccmake".
Use "./PATH2CMAKE/ccmake PATH2BUILD_DIR" to edit and to see the availble options.
Cmake generates the Makefiles. Run them by use "make".
Now in the folder /test one can find the needed Quda executable "invert_".

##
##	1.2 Run
##


The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts located in the folder ./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the benchmark runs
on the target machines. This bash-scripts are:

run_ana.sh              :   Main-script, set up the benchmark mode and submit the jobs (analyse the results)
prepare_submit_job.sh   :   Generate the job-scripts
submit_job.sh.template  :   Template for submit script

##
## 1.2.1 Main-script: "run_ana.sh"
##

The path to the executable has to be set by $PATH2EXE .
QUDA automaticaly tune the GPU-kernels. The optimal setup will be saved in
the folder which one declares by the variable "QUDA_RESOURCE_PATH". Set it to
folder where the tuning data should be saved.
Different scaling modes can be choose from Strong-scaling to Weak scaling 
by using the variables sca_mode (="Strong" or ="Weak").
The lattice sizes can be set by "gx" and "gt".
Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS.
Note that the submition is done here by "sbatch", match this to the queing system on 
your target machine.

##
## 1.2.2 Main-script: "prepare_submit_job.sh"
##

Add additional option if necessary.

##
## 1.2.3 Main-script: "submit_job.sh.template"
##

The submit-template will be edit by "prepare_submit_job.sh" to generate
the final submit-script. The header should be matched to the queing system
of the target machine.

The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs
on the target machines. This bash-scripts are:

## 
## 1.3 Example Benchmark results
##

Here are shown the benchmark results on PizDaint located in Switzerland at CSCS
and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the 
provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown,
the "Strong-Scaling mode with a random lattice configuration of size 32^3x96 and
a "Weak-Scaling" mode with a configuration of local lattice size 48^3x24.
The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case
that one card per node and two cards per node are used. 
The benchmark are done by using the Conjugated Gradient solver which
solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator
"D" and a known right hand side "b".

---------------------
  PizDaint - Pascal
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single

GPUs     GFLOPS      sec
1    786.520000 4.569600
2   1522.410000 3.086040
4   2476.900000 2.447180
8   3426.020000 2.117580
16  5091.330000 1.895790
32  8234.310000 1.860760
64  8276.480000 1.869230

sloppy-precision: double
       precision: double

GPUs     GFLOPS      sec
1    385.965000 6.126730
2    751.227000 3.846940
4   1431.570000 2.774470
8   1368.000000 2.367040
16  2304.900000 2.071160
32  4965.480000 2.095180
64  2308.850000 2.005110


Weak - Scaling:
local lattice size (48x48x48x24)

sloppy-precision: single
       precision: single

GPUs     GFLOPS      sec
1     765.967000 3.940280
2    1472.980000 4.004630
4    2865.600000 4.044360
8    5421.270000 4.056410
16   9373.760000 7.396590
32  17995.100000 4.243390
64  27219.800000 4.535410

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
 1   376.611000 5.108900
 2   728.973000 5.190880
 4  1453.500000 5.144160
 8  2884.390000 5.207090
16  5004.520000 5.362020
32  8744.090000 5.623290
64  14053.00000 5.910520 


---------------------
  SurfSara - Kepler
---------------------
##
## 1 GPU per Node
##

Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single
GPUs    GFLOPS      sec
1      243.084000 4.030000 
2      478.179000 2.630000 
4      939.953000 2.250000 
8     1798.240000 1.570000 
16    3072.440000 1.730000 
32    4365.320000 1.310000

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
1      119.786000 6.060000 
2      234.179000 3.290000 
4      463.594000 2.250000 
8      898.090000 1.960000 
16    1604.210000 1.480000 
32    2420.130000 1.630000

##
## 2 GPU per Node
##

Strong - Scaling:
global lattice size (32x32x32x96)

sloppy-precision: single
       precision: single

GPUs    GFLOPS      sec
2      463.041000 2.720000 
4      896.707000 1.940000 
8     1672.080000 1.680000 
16    2518.240000 1.420000 
32    3800.970000 1.460000 
64    4505.440000 1.430000

sloppy-precision: double
       precision: double

GPUs    GFLOPS      sec
2     229.579000 3.380000 
4     450.425000 2.280000 
8     863.117000 1.830000 
16   1348.760000 1.510000 
32   1842.560000 1.550000 
64   2645.590000 1.480000 

###
###
###   XEONPHI - BENCHMARK SUITE
###
###

##
## 2. Compile and Run the XeonPhi-Benchmark Suite
##

Unpack the provided source tar-file located in "./QCD_Accelerator_Benchmarksuite_Part2/XeonPhi/src" or 
clone the actual git-hub branches of the code
packages QMP:

"git clone https://github.com/usqcd-software/qmp"

and for QPhix

"git clone https://github.com/JeffersonLab/qphix"

Note that the AVX512 instructions, which are needed for an optimal run on
KNLs, are not yet part of the main branch. The AVX512 instructions are available
in the avx512-branch ("git checkout avx512). The provided
source file is using the avx512-branch (Status 01/2017).

##
## 2.1 Compile
##

The QPhix library is based on QMP communication functions. 
For that QMP has to be setup first. 

./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS=" -mmic/-xAVX512 -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none

Create the Install folder and link with $QMP_INSTALL_DIR to it.
Use the compilerflag  "-mmic" for the compilation for KNC's
while use "-xAVX512" for the compilation for KNL's.
Then use
"make"

and
"make install"

to compile and setup the necessary source files in $QMP_INSTALL_DIR.

The QPhix executable can be compiled by using:
for KNC's

./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g  -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR

or for KNL's

./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR

by using the previous variable QMP_INSTALL_DIR which links to the install-folder
of QMP. The executable "time_clov_noqdp" can be found now in the subfolder "./qphix/test".
Note that the avx512-branch will compile additional executable which has dependencies
on the package QDP (which will generate an error at the end of the compilation process).

##
## 2.1.1 Example compilation on PRACE machines
##

In the subsection we provide some example compilation on PRACE machines
which where used to develop the QCD Benchmarksuite 2.

##
## 2.1.1.1 BSC - Marenostrum III Hybrid partitions 
##

The Hybrid partition on Marenostrum are equiped with KNC's.
First following modules were loaded 

module unload openmpi
module load impi

and the necessary links are set with

source /opt/intel/impi/4.1.1.036/bin64/mpivars.sh
source /opt/intel/2013.5.192/composer_xe_2013.5.192/bin/compilervars.sh intel64
export I_MPI_MIC=enable
export I_MPI_HYDRA_BOOTSTRAP=ssh

The QMP-library was configured and compiled with

./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-mmic -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none

make
make install

Now the package QPhix is compilled with

./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g  -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR
make

##
## 2.1.1.2 CINES - Frioul
##

On a test cluster at the CINES-side the Benchmarksuite was tested on KNL's.
The steps are similar to BSC. First the libraries paths are set with

source /opt/software/intel/composer_xe_2015/bin/compilervars.sh intel64
source /opt/software/intel/impi_5.0.3/bin64/mpivars.sh

The QMP was compiled by using:
  
./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-xMIC-AVX512 -mGLOB_default_function_attrs="use_gather_scatter_hint=off" -openmp -g  -O2 -fno-alias -std=c99"  --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none
make
make install
  
The QPhix was configured and compiled by using
  
./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=/home/finkenrath/benchmark/qmp/install

and

make

##
##	2.2 Run
##


The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs
on the target machines. This bash-scripts are:

run_ana.sh              :   Main-script, set up the bechmark mode and submit the jobs (analyse the results)
prepare_submit_job.sh   :   Generate the job-scripts
submit_job.sh.template  :   Template for submit script

##
## 2.2.1 Main-script: "run_ana.sh"
##

The path to the executable has to be set by $PATH2EXE .
Different scaling modes can be choose from Strong-scaling to Weak scaling 
by using the variables sca_mode (="Strong" or ="Weak").
The lattice sizes can be set by "gx" and "gt".
Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS.
Note that the submition is done by "sbatch" match this to the queing system on 
your target machine.

##
## 2.2.2 Main-script: "prepare_submit_job.sh"
##

Add additional option if necessary.

##
## 2.2.3 Main-script: "submit_job.sh.template"
##

The submit-template will be edit by "prepare_submit_job.sh" to generate
the final submit-script. The header should be matched to the quening system
of the target machine.

##
## 2.3 Example Benchmark Results
##

The benchmark results for the XeonPhi benchmark suite are performed on
Frioul, a test cluster at CINES, and the hybrid partion on MareNostrum III at BSC.
Frioul has one KNL-card per node while the hybrid partion of MareNostrum III is
equiped with two KNCs per node. The data on Frioul are generated by using
the bash-scripts provided by the QCD-Accelerator Benchmarksute Part 2
and are done for the two test cases "Strong-Scaling" with a lattice size
of 32^3x96 and "Weak-scaling" with a local lattice size of 48^3x24 per
card. In case of the data generated at MareNostrum, data for the "Strong-Scaling"
mode on a 32^3x96 lattice are shown. The Benchmark is using a random gauge configuration and uses the
Conjugated Gradient solver to solve a linear equation involving the clover Wilson Dirac operator.

---------------------
  Frioul - KNLs
---------------------
Strong - Scaling:
global lattice size (32x32x32x96)

precision: single

KNLs     GFLOPS  
1       340.75
2       627.612
4      1111.13
8      1779.34
16     2410.8

precision: double

KNLs     GFLOPS    
1      328.149
2      616.467
4      1047.79
8      1616.37

Weak - Scaling:
local lattice size (48x48x48x24)

precision: single

KNLs   GFLOPS  
1       348.304
2       616.697
4      1214.82
8      2425.45
16     4404.63
 
precision: double

KNLs   GFLOPS    
 1      172.303
 2      320.761
 4      629.79
 8     1228.77
16     2310.63

---------------------
  MareNostrum III - KNC's 
---------------------

Strong - Scaling:
global lattice size (32x32x32x96)

precision: single - 1 Cards per Node

KNCs  GFLOPS
2    103.561
4    200.159
8    338.276
16   534.369
32   815.896

precision: single - 2 Cards per Node

KNCs  GFLOPS
4    118.995
8    212.558
16   368.196
32   605.882
64   847.566