# README - QCD UEABS Part 2 **2017 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy)** The QCD Accelerator Benchmark suite Part 2 consists of two kernels, the QUDA [^]: R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations and the QPhiX library [^]: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey, . The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). The QPhiX library consists of routines which are optimize to use Intel intrinsic functions of multiple vector length including AVX512, including optimized routines for KNC and KNL (http://jeffersonlab.github.io/qphix/). The benchmark kernel consists of the provided Conjugated Gradient benchmark functions of the libraries. ## Table of Contents [TOC] ## GPU - Kernel ### 1. Compile and Run the GPU-Benchmark Suite #### 1.1 Compile Download Cmake and Quda General information how to build QUDA with cmake can be found under: "https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake" Here we just give a short overview: Build Cmake: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/cmake-3.7.0.tar.gz) Cmake can be downloaded from the source with the URL: https://cmake.org/download/ In this guide the version cmake-3.7.0 is used. The build instruction can be found in the main directory under README.rst . Use the configure file `./configure` . Then run gmake`. Build Quda: (./QCD_Accelerator_Benchmarksuite_Part2/GPUs/src/quda.tar.gz) Download quda for example by using `git clone https://github.com/lattice/quda.git`. Create a build-folder. Execute the executable `cmake` in the build-folder which is located in the cmake/bin. Execute: ``` shell ./$PATH2CMAKE/cmake $PATH2QUDA -DQUDA_GPU_ARCH=sm_XX -DQUDA_DIRAC_WILSON=ON -DQUDA_DIRAC_TWISTED_MASS=OFF -DQUDA_DIRACR_DOMAIN_WALL=OFF -DQUDA_HISQ_LINK=OFF -DQUDA_GAUGE_FORCE=OFF -DQUDA_HISQ_FORCE=OFF -DQUDA_MPI=ON ``` with ``` shell PATH2CMAKE= path to the cmake-executable PAT2QUDA= path to the home dir of QUDA ``` Set `-DQUDA_GPU_ARCH=sm_XX` to the GPU Architecture (`sm_60` for Pascals, `sm_35` for Keplers) If Cmake or the compilation fails library paths and options can be set by the cmake provided function "ccmake". Use `./PATH2CMAKE/ccmake PATH2BUILD_DIR` to edit and to see the availble options. Cmake generates the Makefiles. Run them by use `make`. Now in the folder /test one can find the needed Quda executable "invert_". #### 1.2 Run The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts located in the folder ./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the benchmark runs on the target machines. This bash-scripts are: - `run_ana.sh` : Main-script, set up the benchmark mode and submit the jobs (analyse the results) - `prepare_submit_job.sh` : Generate the job-scripts - `submit_job.sh.template` : Template for submit script ##### 1.2.1 Main-script: "run_ana.sh" The path to the executable has to be set by $PATH2EXE . QUDA automaticaly tune the GPU-kernels. The optimal setup will be saved in the folder which one declares by the variable `QUDA_RESOURCE_PATH`. Set it to folder where the tuning data should be saved. Different scaling modes can be choose from Strong-scaling to Weak scaling by using the variables sca_mode (="Strong" or ="Weak"). The lattice sizes can be set by "gx" and "gt". Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS. Note that the submition is done here by "sbatch", match this to the queing system on your target machine. ##### 1.2.2 Main-script: "prepare_submit_job.sh" Add additional option if necessary. ##### 1.2.3 Main-script: "submit_job.sh.template" The submit-template will be edit by `prepare_submit_job.sh` to generate the final submit-script. The header should be matched to the queing system of the target machine. The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs on the target machines. This bash-scripts are: #### 1.3 Example Benchmark results Here are shown the benchmark results on PizDaint located in Switzerland at CSCS and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown, the "Strong-Scaling mode with a random lattice configuration of size 32x32x32x96 and a "Weak-Scaling" mode with a configuration of local lattice size 48x48x48x24. The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case that one card per node and two cards per node are used. The benchmark are done by using the Conjugated Gradient solver which solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b". ``` --------------------- PizDaint - Pascal --------------------- Strong - Scaling: global lattice size (32x32x32x96) sloppy-precision: single precision: single GPUs GFLOPS sec 1 786.520000 4.569600 2 1522.410000 3.086040 4 2476.900000 2.447180 8 3426.020000 2.117580 16 5091.330000 1.895790 32 8234.310000 1.860760 64 8276.480000 1.869230 sloppy-precision: double precision: double GPUs GFLOPS sec 1 385.965000 6.126730 2 751.227000 3.846940 4 1431.570000 2.774470 8 1368.000000 2.367040 16 2304.900000 2.071160 32 4965.480000 2.095180 64 2308.850000 2.005110 Weak - Scaling: local lattice size (48x48x48x24) sloppy-precision: single precision: single GPUs GFLOPS sec 1 765.967000 3.940280 2 1472.980000 4.004630 4 2865.600000 4.044360 8 5421.270000 4.056410 16 9373.760000 7.396590 32 17995.100000 4.243390 64 27219.800000 4.535410 sloppy-precision: double precision: double GPUs GFLOPS sec 1 376.611000 5.108900 2 728.973000 5.190880 4 1453.500000 5.144160 8 2884.390000 5.207090 16 5004.520000 5.362020 32 8744.090000 5.623290 64 14053.00000 5.910520 ``` ``` --------------------- SurfSara - Kepler --------------------- ## ## 1 GPU per Node ## Strong - Scaling: global lattice size (32x32x32x96) sloppy-precision: single precision: single GPUs GFLOPS sec 1 243.084000 4.030000 2 478.179000 2.630000 4 939.953000 2.250000 8 1798.240000 1.570000 16 3072.440000 1.730000 32 4365.320000 1.310000 sloppy-precision: double precision: double GPUs GFLOPS sec 1 119.786000 6.060000 2 234.179000 3.290000 4 463.594000 2.250000 8 898.090000 1.960000 16 1604.210000 1.480000 32 2420.130000 1.630000 ## ## 2 GPU per Node ## Strong - Scaling: global lattice size (32x32x32x96) sloppy-precision: single precision: single GPUs GFLOPS sec 2 463.041000 2.720000 4 896.707000 1.940000 8 1672.080000 1.680000 16 2518.240000 1.420000 32 3800.970000 1.460000 64 4505.440000 1.430000 sloppy-precision: double precision: double GPUs GFLOPS sec 2 229.579000 3.380000 4 450.425000 2.280000 8 863.117000 1.830000 16 1348.760000 1.510000 32 1842.560000 1.550000 64 2645.590000 1.480000 ``` ## x86 Kernel ### 2. Compile and Run the x86-Part Unpack the provided source tar-file located in `./QCD_Accelerator_Benchmarksuite_Part2/XeonPhi/src` or clone the actual git-hub branches of the code packages QMP: ``` shell git clone https://github.com/usqcd-software/qmp ``` and for QPhix ``` shell git clone https://github.com/JeffersonLab/qphix ``` Note that for running on Skylake chips it is recommended to utilize the branch develop of QPhix which needs additional packages like qdp++ (Status 04/2019). #### 2.1 Compile The QPhix library is based on QMP communication functions. For that QMP has to be setup first. ``` shell ./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS=" -mmic/-xAVX512 -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none ``` Create the Install folder and link with `$QMP_INSTALL_DIR` to it. Use the compilerflag `-mmic` for the compilation for KNC's while use `-xAVX512` for the compilation for KNL's. Then use ``` shell make make install ``` to compile and setup the necessary source files in `$QMP_INSTALL_DIR`. The QPhix executable can be compiled by using: for KNC's ``` shell ./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR ``` or for KNL's ``` shell ./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR ``` by using the previous variable `QMP_INSTALL_DIR` which links to the install-folder of QMP. The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`. Note for the develop branch the package qdp++ has to be compiled. QDP++ can be configure using (here for skylake chip) ``` shell ./configure --with-qmp=$QMP_INSTALL_DIR --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xCORE-AVX512 -mtune=skylake-avx512 -std=c99" CXX=mpiicpc CXXFLAGS="-axCORE-AVX512 -mtune=skylake-avx512 -std=c++14 -qopenmp" --enable-openmp --host=x86_64-linux-gnu --build=none-none-none --prefix=$QDPXX_INSTALL_DIR ``` Now QPhix executable can be compiled by using: ``` shell cmake -DQDPXX_DIR=$QDP_INSTALL_DIR -DQMP_DIR=$QMP_INSTALL_DIR -Disa=avx512 -Dparallel_arch=parscalar -Dhost_cxx=mpiicpc -Dhost_cxxflags="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -Dtm_clover=ON -Dtwisted_mass=ON -Dtesting=ON -DCMAKE_CXX_COMPILER=mpiicpc -DCMAKE_CXX_FLAGS="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -DCMAKE_C_COMPILER=mpiicc -DCMAKE_C_FLAGS="-std=c99 -O3 -axCORE-AVX512 -mtune=skylake-avx512" .. ``` The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`. ##### 2.1.1 Example compilation on PRACE machines In the subsection we provide some example compilation on PRACE machines which where used to develop the QCD Benchmarksuite 2. ###### 2.1.1.1 BSC - Marenostrum III Hybrid partitions The Hybrid partition on Marenostrum are equiped with KNC's. First following modules were loaded ``` shell module unload openmpi module load impi ``` and the necessary links are set with ``` shell source /opt/intel/impi/4.1.1.036/bin64/mpivars.sh source /opt/intel/2013.5.192/composer_xe_2013.5.192/bin/compilervars.sh intel64 export I_MPI_MIC=enable export I_MPI_HYDRA_BOOTSTRAP=ssh ``` The QMP-library was configured and compiled with ``` shell ./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-mmic -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none make make install ``` Now the package QPhix is compilled with ``` shell ./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR make ``` ###### 2.1.1.2 CINES - Frioul On a test cluster at the CINES-side the Benchmarksuite was tested on KNL's. The steps are similar to BSC. First the libraries paths are set with ``` shell source /opt/software/intel/composer_xe_2015/bin/compilervars.sh intel64 source /opt/software/intel/impi_5.0.3/bin64/mpivars.sh ``` The QMP was compiled by using: ``` shell ./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-xMIC-AVX512 -mGLOB_default_function_attrs="use_gather_scatter_hint=off" -openmp -g -O2 -fno-alias -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none make make install ``` The QPhix was configured and compiled by using ``` shell ./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=/home/finkenrath/benchmark/qmp/install make ``` #### 2.2 Run The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs on the target machines. This bash-scripts are: - `run_ana.sh` : Main-script, set up the bechmark mode and submit the jobs (analyse the results) - `prepare_submit_job.sh` : Generate the job-scripts - `submit_job.sh.template` : Template for submit script ##### 2.2.1 Main-script: "run_ana.sh" The path to the executable has to be set by $PATH2EXE . Different scaling modes can be choose from Strong-scaling to Weak scaling by using the variables sca_mode (="Strong" or ="Weak"). The lattice sizes can be set by "gx" and "gt". Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS. Note that the submition is done by "sbatch" match this to the queing system on your target machine. ##### 2.2.2 Main-script: "prepare_submit_job.sh" Add additional option if necessary. ##### 2.2.3 Main-script: "submit_job.sh.template" The submit-template will be edit by `prepare_submit_job.sh` to generate the final submit-script. The header should be matched to the quening system of the target machine. #### 2.3 Example Benchmark Results The benchmark results for the XeonPhi benchmark suite are performed on Frioul, a test cluster at CINES, and the hybrid partion on MareNostrum III at BSC. Frioul has one KNL-card per node while the hybrid partion of MareNostrum III is equiped with two KNCs per node. The data on Frioul are generated by using the bash-scripts provided by the QCD-Accelerator Benchmarksute Part 2 and are done for the two test cases "Strong-Scaling" with a lattice size of 32^3x96 and "Weak-scaling" with a local lattice size of 48^3x24 per card. In case of the data generated at MareNostrum, data for the "Strong-Scaling" mode on a 32^3x96 lattice are shown. The Benchmark is using a random gauge configuration and uses the Conjugated Gradient solver to solve a linear equation involving the clover Wilson Dirac operator. ``` --------------------- Frioul - KNLs --------------------- Strong - Scaling: global lattice size (32x32x32x96) precision: single KNLs GFLOPS 1 340.75 2 627.612 4 1111.13 8 1779.34 16 2410.8 precision: double KNLs GFLOPS 1 328.149 2 616.467 4 1047.79 8 1616.37 Weak - Scaling: local lattice size (48x48x48x24) precision: single KNLs GFLOPS 1 348.304 2 616.697 4 1214.82 8 2425.45 16 4404.63 precision: double KNLs GFLOPS 1 172.303 2 320.761 4 629.79 8 1228.77 16 2310.63 ``` ``` --------------------- MareNostrum III - KNC's --------------------- Strong - Scaling: global lattice size (32x32x32x96) precision: single - 1 Cards per Node KNCs GFLOPS 2 103.561 4 200.159 8 338.276 16 534.369 32 815.896 precision: single - 2 Cards per Node KNCs GFLOPS 4 118.995 8 212.558 16 368.196 32 605.882 64 847.566 ```