# README - QCD UEABS Part 2 **2021 - Jacob Finkenrath - CaSToRC - The Cyprus Institute (j.finkenrath@cyi.ac.cy)** Part 2 of the QCD kernels of the Unified European Applications Benchmark Suite (UEABS) http://www.prace-ri.eu/ueabs/ was developed under the accelerator benchmark suite task within the 4th implementation phase of PRACE and is part of the UEABS kernel since PRACE 5IP under UEABS QCD part 2. Part 2 consists of two kernels, based on QUDA [^]: R. Babbich, M. Clark and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations and on the QPhiX library [^]: B. Joo, D. D. Kalamkar, K. Vaidyanathan, M. Smelyanskiy, K. Pamnany, V. W. Lee, P. Dubey, . The library QUDA is based on CUDA and optimize for running on NVIDIA GPUs (https://lattice.github.io/quda/). Currently a HIP and a generic version is under development, which can be used for AMD GPUs and if ready for CPU architectures, such as ARM. The generic QUDA kernel might replace the computational kernel of QPhiX in the future. The QPhiX library consists of routines developed for Intel Xeon Phi architecture and can perform on x86 architecture. QPhiX are optimize to use Intel intrinsic functions of multiple vector length including AVX512 (http://jeffersonlab.github.io/qphix/). In general the benchmark kernels are applying the Conjugated Gradient solver to the Wilson Dirac operator, a 4-dimension stencil. ## Table of Contents [TOC] ## GPU - Kernel ### 1. Compile and Run the GPU-Benchmark Suite #### 1.1 Compile Clone `quda` via ```shell git clone https://github.com/lattice/quda.git ``` and build `quda` via ``` cd quda mkdir build cd build ``` Set `-DQUDA_GPU_ARCH=sm_XX` to the GPU Architecture (`sm_60` for Pascals, `sm_35` for Keplers) If CMake or the compilation fails, library paths and options can be set by the cmake provided function "ccmake". Use `./PATH2CMAKE/ccmake PATH2BUILD_DIR` to edit and to see the available options. Cmake generates the Makefiles. Run them by use `make`. Now in the folder /test one can find the needed ` executable "invert_". #### 1.2 Run The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts located in the folder ./QCD_Accelerator_Benchmarksuite_Part2/GPUs/scripts" to setup the benchmark runs on the target machines. This bash-scripts are: - `run_ana.sh` : Main-script, set up the benchmark mode and submit the jobs (analyse the results) - `prepare_submit_job.sh` : Generate the job-scripts - `submit_job.sh.template` : Template for submit script ##### 1.2.1 Main-script: "run_ana.sh" The path to the executable has to be set by $PATH2EXE . QUDA automaticaly tune the GPU-kernels. The optimal setup will be saved in the folder which one declares by the variable `QUDA_RESOURCE_PATH`. Set it to folder where the tuning data should be saved. Different scaling modes can be choose from Strong-scaling to Weak scaling by using the variables sca_mode (="Strong" or ="Weak"). The lattice sizes can be set by "gx" and "gt". Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS. Note that the submition is done here by "sbatch", match this to the queing system on your target machine. ##### 1.2.2 Main-script: "prepare_submit_job.sh" Add additional option if necessary. ##### 1.2.3 Main-script: "submit_job.sh.template" The submit-template will be edit by `prepare_submit_job.sh` to generate the final submit-script. The header should be matched to the queing system of the target machine. The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs on the target machines. This bash-scripts are: #### 1.3 Example Benchmark results Here are shown the benchmark results on PizDaint located in Switzerland at CSCS and the GPGPU-partition of Cartesius at Surfsara based in Netherland, Amsterdam. The runs are performed by using the provided bash-scripts. PizDaint has one Pascal-GPU per node and two different testcases are shown, the "Strong-Scaling mode with a random lattice configuration of size 32x32x32x96 and a "Weak-Scaling" mode with a configuration of local lattice size 48x48x48x24. The GPGPU nodes of Cartesius has two Kepler-GPU per node and the "Strong-Scaling" test is shown for the case that one card per node and two cards per node are used. The benchmark are done by using the Conjugated Gradient solver which solve a linear equation, D * x = b, for the unknown solution "x" based on the clover improved Wilson Dirac operator "D" and a known right hand side "b". ``` --------------------- PizDaint - Pascal --------------------- Strong - Scaling: global lattice size (32x32x32x96) sloppy-precision: single precision: single GPUs GFLOPS sec 1 786.520000 4.569600 2 1522.410000 3.086040 4 2476.900000 2.447180 8 3426.020000 2.117580 16 5091.330000 1.895790 32 8234.310000 1.860760 64 8276.480000 1.869230 sloppy-precision: double precision: double GPUs GFLOPS sec 1 385.965000 6.126730 2 751.227000 3.846940 4 1431.570000 2.774470 8 1368.000000 2.367040 16 2304.900000 2.071160 32 4965.480000 2.095180 64 2308.850000 2.005110 Weak - Scaling: local lattice size (48x48x48x24) sloppy-precision: single precision: single GPUs GFLOPS sec 1 765.967000 3.940280 2 1472.980000 4.004630 4 2865.600000 4.044360 8 5421.270000 4.056410 16 9373.760000 7.396590 32 17995.100000 4.243390 64 27219.800000 4.535410 sloppy-precision: double precision: double GPUs GFLOPS sec 1 376.611000 5.108900 2 728.973000 5.190880 4 1453.500000 5.144160 8 2884.390000 5.207090 16 5004.520000 5.362020 32 8744.090000 5.623290 64 14053.00000 5.910520 ``` ``` --------------------- SurfSara - Kepler --------------------- ## ## 1 GPU per Node ## Strong - Scaling: global lattice size (32x32x32x96) sloppy-precision: single precision: single GPUs GFLOPS sec 1 243.084000 4.030000 2 478.179000 2.630000 4 939.953000 2.250000 8 1798.240000 1.570000 16 3072.440000 1.730000 32 4365.320000 1.310000 sloppy-precision: double precision: double GPUs GFLOPS sec 1 119.786000 6.060000 2 234.179000 3.290000 4 463.594000 2.250000 8 898.090000 1.960000 16 1604.210000 1.480000 32 2420.130000 1.630000 ## ## 2 GPU per Node ## Strong - Scaling: global lattice size (32x32x32x96) sloppy-precision: single precision: single GPUs GFLOPS sec 2 463.041000 2.720000 4 896.707000 1.940000 8 1672.080000 1.680000 16 2518.240000 1.420000 32 3800.970000 1.460000 64 4505.440000 1.430000 sloppy-precision: double precision: double GPUs GFLOPS sec 2 229.579000 3.380000 4 450.425000 2.280000 8 863.117000 1.830000 16 1348.760000 1.510000 32 1842.560000 1.550000 64 2645.590000 1.480000 ``` ## Xeon(Phi) Kernel ### 2. Compile and Run the Xeon(Phi)-Part QPhiX currently requires additional third party libraries, like the USQCD-libraries `qmp`,`qdpxx`,`qio`,`xpath_reader`,`c-lime` and the xml library `libxml2`. The USQCD-libs can be found under ```shell https://github.com/usqcd-software ``` and for `libxml2` ```shell http://xmlsoft.org/ ``` while the repository of QPhiX is hosted under Jefferson Lab. ``` shell git clone https://github.com/JeffersonLab/qphix ``` #### 2.1 Compile The QPhiX library is based on QMP communication functions. For that QMP has to be setup first. Note that you might have to reconfigure the configure-file using `autoreconf` from `Autotool`. QMP can be configured using: ``` shell ./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS=" -xHOST -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none ``` Create the install folder and link with `$QMP_INSTALL_DIR` to it. Then use ``` shell make make install ``` to compile and setup the necessary source files in `$QMP_INSTALL_DIR`. For the current master branch of `QPhiX` it is required to provide the package `qdp++`, which has sub-dependencies given by ``qio`,`xpath_reader`,`c-lime` and `libxml2`. QDP++ can be configure using (here for Skylake chip) ``` shell ./configure --with-qmp=$QMP_INSTALL_DIR --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xCORE-AVX512 -mtune=skylake-avx512 -std=c99" CXX=mpiicpc CXXFLAGS="-axCORE-AVX512 -mtune=skylake-avx512 -std=c++14 -qopenmp" --enable-openmp --host=x86_64-linux-gnu --build=none-none-none --prefix=$QDPXX_INSTALL_DIR l --disable-filedb ``` The configure is searching for additional USQCD libraries in the subfolder `./other_libs`. Clone the required library files, like ```shell git clone https://github.com/usqcd-software/qio.git ``` and reconfigure. Note that on system like JSC's JUWELS `libxml2` has to be additional compiled and the path can be added to the configuration of `qdpxx` via ```shell --with-libxml2=$LIBXML2_INSTALL_DIR ``` Now QPhiX benchmark kernels can be compiled via `cmake`. Create a build folder and ``` shell mkdir build cd build cmake -DQDPXX_DIR=$QDP_INSTALL_DIR -DQMP_DIR=$QMP_INSTALL_DIR -Disa=avx512 -Dparallel_arch=parscalar -Dhost_cxx=mpiicpc -Dhost_cxxflags="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -Dtm_clover=ON -Dtwisted_mass=ON -Dtesting=ON -DCMAKE_CXX_COMPILER=mpiicpc -DCMAKE_CXX_FLAGS="-std=c++17 -O3 -axCORE-AVX512 -mtune=skylake-avx512" -DCMAKE_C_COMPILER=mpiicc -DCMAKE_C_FLAGS="-std=c99 -O3 -axCORE-AVX512 -mtune=skylake-avx512" .. ``` The executable `time_clov_noqdp` can be found now in the sub-folder `tests`. Note that in the current version compilation of other test kernels can fail. If that is the case directly compile the needed executable, via ``` cd tests make time_clov_noqdp ``` The QPhiX is developed to utilize the computational potential of Intels Xeon Phi architecture, which is discontinued. Earlier versions of QPhiX for the Intel Xeon Phi architecture can be compiled without `qdpxx` using configure build. Namely for KNC's ``` shell ./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR ``` or for KNL's ``` shell ./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR ``` by using the previous variable `QMP_INSTALL_DIR` which links to the install-folder of QMP. The executable `time_clov_noqdp` can be found now in the subfolder `./qphix/test`. ##### 2.1.1 Example compilation on PRACE machines In the subsection we provide some example compilation on PRACE machines which where used to develop the QCD Benchmarksuite 2. ###### 2.1.1.1 JSC - JUWELS JUWELS (Cluster Module) at Juelich Supercomputing Center is equipped with Intel Skylake chips, namely 2× Intel Xeon Platinum 8168 CPU, 2× 24 cores, 2.7 GHz per compute node. The compilation was done using the following software setup (status 07/21) ```shell ml Intel/2020.2.254-GCC-9.3.0 IntelMPI/2019.8.254 imkl/2020.4.304 Autotools CMake ``` (`ml` is a short cut for `module load`). `qmp` was build via ``` git clone https://github.com/usqcd-software/qmp.git cd qmp autoreconf ./configure --prefix=${PWD}/install CC=mpiicc CFLAGS="-xHOST -std=c99 -O3" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu make make install ``` `qdpxx` requires `libxml2` which was build via ```shell git clone https://gitlab.gnome.org/GNOME/libxml2.git cd libxml2 ./autogen.sh ./configure --prefix=${PWD}/install make make install ``` For `qdpxx` the additional required libs were added to the sub-folder `other-libs`, by ```shell git clone https://github.com/usqcd-software/qdpxx.git cd qdpxx/other_libs git clone https://github.com/usqcd-software/xpath_reader.git cd xpath_reader/ autoreconf cd .. git clone https://github.com/usqcd-software/qio.git cd qio autoreconf cd /other_libs/ git clone https://github.com/usqcd-software/c-lime.git # note that the path of c-lime is ./qdpxx/other_libs/qio/other_libs/c-lime cd c-lime/ autoreconf ``` Now, `qpdxx` was compiled via ```shell ./configure --with-libxml2=${PWD}/../libxml2/install --with-qmp=/p/project/cecy00/ecy00a/src_bench/qdpxx/../qmp/install --enable-openmp --enable-parallel-arch=parscalar CC=mpiicc CFLAGS="-xHOST -std=c99 -qopenmp" CXX=mpiicpc CXXFLAGS="-xHOST -std=c++11 -qopenmp" --prefix=/p/project/cecy00/ecy00a/src_bench/qdpxx/install --with-libxml2=${PWD}/../libxml2/install --disable-filedb make make install ``` Note `configure` might require `xml2-config` for which the path can be set via `export PATH+=:$PATH_2_LIBXML2` Finally the computational kernels of `QPhiX` can be build via ```shell git clone https://github.com/JeffersonLab/qphix.git cd qphix/ mkdir build cd build cmake -DQDPXX_DIR=${PWD}/../../qdpxx/install -DQMP_DIR=${PWD}/../../qmp/install -Disa=avx512 -Dparallel_arch=parscalar -Dhost_cxx=icpc -Dhost_cxxflags="-std=c++17 -O3 -xHOST" -Dtm_clover=ON -Dtwisted_mass=ON -Dtesting=ON -DCMAKE_CXX_COMPILER=mpiicpc -DCMAKE_CXX_FLAGS="-std=c++17 -O3 -xHOST" -DCMAKE_C_COMPILER=mpiicc -DCMAKE_C_FLAGS="-std=c99 -O3 -xHOST" -DCMAKE_INSTALL_PREFIX=${PWD}/../install .. cd tests make time_clov_noqdp ``` ###### 2.1.1.2 BSC - Marenostrum III Hybrid partitions The Hybrid partition on Marenostrum are equiped with KNC's (status 2016). First following modules were loaded ``` shell module unload openmpi module load impi ``` and the necessary links are set with ``` shell source /opt/intel/impi/4.1.1.036/bin64/mpivars.sh source /opt/intel/2013.5.192/composer_xe_2013.5.192/bin/compilervars.sh intel64 export I_MPI_MIC=enable export I_MPI_HYDRA_BOOTSTRAP=ssh ``` The QMP-library was configured and compiled with ``` shell ./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-mmic -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none make make install ``` Now the package QPhix is compilled with ``` shell ./configure --enable-parallel-arch=parscalar --enable-proc=MIC --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-openmp -mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -g -O2 -finline-functions -fno-alias -std=c++0x" CFLAGS="-mmic -vec-report -restrict -mGLOB_default_function_attrs=\"use_gather_scatter_hint=off\" -openmp -g -O2 -fno-alias -std=c9l9" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=$QMP_INSTALL_DIR make ``` ###### 2.1.1.3 CINES - Frioul On a test cluster at the CINES-side the Benchmarksuite was tested on KNL's (status 2018). The steps are similar to BSC. First the libraries paths are set with ``` shell source /opt/software/intel/composer_xe_2015/bin/compilervars.sh intel64 source /opt/software/intel/impi_5.0.3/bin64/mpivars.sh ``` The QMP was compiled by using: ``` shell ./configure --prefix=$QMP_INSTALL_DIR CC=mpiicc CFLAGS="-xMIC-AVX512 -mGLOB_default_function_attrs="use_gather_scatter_hint=off" -openmp -g -O2 -fno-alias -std=c99" --with-qmp-comms-type=MPI --host=x86_64-linux-gnu --build=none-none-none make make install ``` The QPhix was configured and compiled by using ``` shell ./configure --enable-parallel-arch=parscalar --enable-proc=AVX512 --enable-soalen=8 --enable-clover --enable-openmp --enable-cean --enable-mm-malloc CXXFLAGS="-qopenmp -xMIC-AVX512 -g -O3 -std=c++14" CFLAGS="-xMIC-AVX512 -qopenmp -O3 -std=c99" CXX=mpiicpc CC=mpiicc --host=x86_64-linux-gnu --build=none-none-none --with-qmp=/home/finkenrath/benchmark/qmp/install make ``` #### 2.2 Run The Accelerator QCD-Benchmarksuite Part 2 provides bash-scripts to setup the benchmark runs on the target machines. This bash-scripts are: - `run_ana.sh` : Main-script, set up the bechmark mode and submit the jobs (analyse the results) - `prepare_submit_job.sh` : Generate the job-scripts - `submit_job.sh.template` : Template for submit script ##### 2.2.1 Main-script: "run_ana.sh" The path to the executable has to be set by $PATH2EXE . Different scaling modes can be choose from Strong-scaling to Weak scaling by using the variables sca_mode (="Strong" or ="Weak"). The lattice sizes can be set by "gx" and "gt". Choose mode="Run" for run mode while mode="Analysis" for extracting the GFLOPS. Note that the submition is done by "sbatch" match this to the queing system on your target machine. ##### 2.2.2 Main-script: "prepare_submit_job.sh" Add additional option if necessary. ##### 2.2.3 Main-script: "submit_job.sh.template" The submit-template will be edit by `prepare_submit_job.sh` to generate the final submit-script. The header should be matched to the quening system of the target machine. #### 2.3 Example Benchmark Results The benchmark results for the XeonPhi benchmark suite are performed on Frioul, a test cluster at CINES, and the hybrid partion on MareNostrum III at BSC. Frioul has one KNL-card per node while the hybrid partion of MareNostrum III is equiped with two KNCs per node. The data on Frioul are generated by using the bash-scripts provided by the QCD-Accelerator Benchmarksute Part 2 and are done for the two test cases "Strong-Scaling" with a lattice size of 32^3x96 and "Weak-scaling" with a local lattice size of 48^3x24 per card. In case of the data generated at MareNostrum, data for the "Strong-Scaling" mode on a 32^3x96 lattice are shown. The Benchmark is using a random gauge configuration and uses the Conjugated Gradient solver to solve a linear equation involving the clover Wilson Dirac operator. ``` --------------------- Frioul - KNLs --------------------- Strong - Scaling: global lattice size (32x32x32x96) precision: single KNLs GFLOPS 1 340.75 2 627.612 4 1111.13 8 1779.34 16 2410.8 precision: double KNLs GFLOPS 1 328.149 2 616.467 4 1047.79 8 1616.37 Weak - Scaling: local lattice size (48x48x48x24) precision: single KNLs GFLOPS 1 348.304 2 616.697 4 1214.82 8 2425.45 16 4404.63 precision: double KNLs GFLOPS 1 172.303 2 320.761 4 629.79 8 1228.77 16 2310.63 ``` ``` --------------------- MareNostrum III - KNC's --------------------- Strong - Scaling: global lattice size (32x32x32x96) precision: single - 1 Cards per Node KNCs GFLOPS 2 103.561 4 200.159 8 338.276 16 534.369 32 815.896 precision: single - 2 Cards per Node KNCs GFLOPS 4 118.995 8 212.558 16 368.196 32 605.882 64 847.566 ```